Practical ways to index the fediverse

berkes · June 22, 2022, 9:22am

I’m writing a bot that scouts the fediverse for “job postings” and then allows to search for jobs on a webpage.

I’m looking for ways to “scout and index the public fediverse” more reliable and more in line with activity-pub.

First, and foremost: yes, there are social issues with indexing the fediverse. People don’t want to be indexed, some instances ban bots, others enable tech to discourage spiders and so on. This thread is not so much about discussing whether or not indexing is evil in itself. So, for the sake of this thread, let’s presume crawling and indexing can have valid use-cases, such as providing a job-board of jobs advertised on the fediverse. There are many other use-cases possible (index book-reviews, find toots that mention your brand to engage, find for users with similar interests, find people with certain interests who are looking for a job, etc). And lets presume that the indexing strategy follows the privacy settings provided by users and instances and ignores any content and profiles that don’t want to be indexed.

Given that there are proper use-cases, how would we best attack this problem?

What my bot does now, is both naïve and flakey: it just uses the mastodon REST api of the instance (botsin.space) that it’s registered at to search for hashtags and walk over the toots in those hashtags. It does this several times a day. Additionally it follows the public websocket stream to index toots as they appear.

It is naïve, because it -obviously- misses a lot of content: only the content that is federated to the server the bot uses will be indexed. It is mastodon-api only, and following the websocket breaks easily, esp when we want to do this over months, not hours.

Some things I considered:

I can write (or employ) a crawler. Just and old-fashioned spider that crawls the web, and uses known fediverse servers as boundaries and as input. This seems clumsy, and requires a spider that is rather heavy, as the public side of the fediverse employs a lot of javascript to render content.

I can improve stability and frequency of the current “read from the REST api” method and just accept that we’ll miss out on content.

I can build a bot that follows everyone and then starts indexing their posts as they now get federated to me. This makes it hard to distinguish “public” content from more private. I now get pushed content that mightn’t be meant to go into a (public) search index. This is frowned upon by the community, for this reason, AFAICS.

And probably some more strategies.

How would you attack this? What has been tried? Are there existing bots, crawlers or other software that tackle some of the “indexing of the fediverse” already?

aschrijver · June 22, 2022, 11:23am

FYI @realaravinth of https://forgeflux.org forge federation is writing a crawler, also in Rust. Maybe there’s overlap in ideas.

updated: corrected link

realaravinth · June 22, 2022, 7:16pm

@aschrijver: Thanks for the ping! The website is at https://forgeflux.org

Hello!

The spider I’m working on is called Starchart and it is limited to crawling software forges(Gitea, GitLab, sourcehut, etc.). Since the popular forges don’t implement forgefed yet, the work-in-progress AP-based federation standard for software forges, I am also relying on REST APIs to source.

We have some overlapping concerns and here’s our approach to them:

1. Privacy

Some people would prefer not to be spidered, so they should have an option to decline spidering.

In Starchart, a forge admin must sign up for spidering. We use a DNS-based challenge to verify ownership of a forge and admins who want their forge to be spidered should complete the challenge.

I don’t know how to implement user-level consent within the constraints of a software forge.

2. Spiders can be abusive

We plan on implementing rate-limiting and other configuration options using TXT records that forge admins could use to configure how Starchart interacts with their forge:

TXT starchart-starchart.forgeflux.org.forge.example.org spidering=false,rate=500

This is in the pipeline and is not yet integrated.

3. Crawled data should be freely shared

To me, federation is about enabling the common person run their own infrastructure. So IMO, the spidered data should be made publicly available for download. That way, anyone can bootstrap a new spider with ease.

This feature, is again, very much work-in-progress. We are using publiccodeyml to publish the data we’ve crawled. Please see here for more information on how it works and here to browse crawled data.

These are some of the issues that I could think of, when I started working on the project. I’m sure plenty more will pop up as the project continues to be developed .

mayel · June 23, 2022, 2:51am

What I’d recommend would be implementing a bot Actor, which people can tag in posts (eg. instead of including #vacancy in a post they’d mention @vacancy@flockingbird.social), and then you index all posts that arrive in the bot’s inbox.

macgirvin · June 24, 2022, 9:22am

Maybe create a job postings group?

berkes · June 24, 2022, 11:03am

Thanks for the suggestion.
We do this for candidates already. You’ll have to send a toot to @hunter2@botsin.space including the words “index me” after which we crawl the users’ profile.

Job postings, IMO are different in that. Most of them explicitly want and need exposure as big as possible. Many actively ask for boosts or retweets, have syncing to distribute to as many platforms and so on. Some might not want indexing, but this is rare, for this use-case. And if encountered, can in will be dealt with on a per-case basis, rather than building it in from the start.

Also, and this is personal: toots that are available on the public web, should be considered public (on the web). I’ve always considered bots to be “just another actor” or “just another client consuming the content”. i understand this isn’t how anyone looks at the web, but my point is mostly: if you don’t want it crawled or indexed, there are numerous technical things you can easily do to indicitate or even prevent this. Mastodon offers those behind a simple switch. In other words: if you don’t want to be indexed, just indicate you don’t want to be indexed. Everything else, posted public, should be considered public. But that’s just one opinion.

berkes · June 24, 2022, 12:11pm

I don’t know how to implement user-level consent within the constraints of a software forge.

I guess that depends on how much the consent matters. The most simple solution would be to say "if it is public, and no guidelines (robots.txt, rel=nofollow, meta noindex etc) are in place as guidelines, one could interpret that as “consent given”. After all: the data is published on a public web. But that’s more a philosophical question.
The hard technical issue is to consider timeliness in this: when first consent is given, data is indexed and then consent is revoked, or data taken offline: what should we do with the data? It’s an unsolved discussion in the entire federation actually, WRT deleting, tombstoning etc.

Spiders can be abusive

Certainly. My initial take on this was to err on the safe side and tune a spider so it browses like a “normal human”. A human (using a javascript client in a web-browser) may scroll, cause some paginated requests to fire at once, but in general, won’t browse the entire website in breakneck space.

With that in mind, I thought about setting up the crawler in such a way that it can still browse the entire fediverse at breakneck pace (thousands of pages per minute), but will spread load to once server/domain evenly. In simpler terms: don’t crawl a.com, then b.com, then c.com etc but instead mix them up. The crawler can keep speed, but the load on all separate domains is evenly distributed and spread out.

I like the TXT record idea. But it seems to solve a problem that may not be there in the first place.

Crawled data should be freely shared

I like this idea, but it does make the first part more difficult. E.g. if a dataset is shared and mirrorred, or version controlled or such, there may be no way ever to get your personal data removed from it; there might always be a torrent tracker in some basement seeding that one copy that still holds the data a person wants removed. But then again, like above: if the index contains only public data that was shared publicly, was available on the web for everyone and everything to see, it should be less of a suprise to see it turn up in publically shared databases.

Licencing of that is another issue, I think, though. Obviously no-one handed over the copyright. So if your database contains posts, comments, or other content, you are violating copyright. Moreover: you cannot just re-distribute that dataset under ODbL or some such. Quite probably, you cannot re-distribute it at all under any licence. Maybe only if the author (or instance) provided a licence under which all their content is published, but I haven’t seen this used in practice on the fediverse (yet).

Overall, thanks for your thoughts. I’ll have a look at the crawler and will certainly check if this is a project that I sould use and contribute to, instead of my dedicated, niche “crawler” that basically just eats REST/JSON results instead.