I’m writing a bot that scouts the fediverse for “job postings” and then allows to search for jobs on a webpage.
I’m looking for ways to “scout and index the public fediverse” more reliable and more in line with activity-pub.
First, and foremost: yes, there are social issues with indexing the fediverse. People don’t want to be indexed, some instances ban bots, others enable tech to discourage spiders and so on. This thread is not so much about discussing whether or not indexing is evil in itself. So, for the sake of this thread, let’s presume crawling and indexing can have valid use-cases, such as providing a job-board of jobs advertised on the fediverse. There are many other use-cases possible (index book-reviews, find toots that mention your brand to engage, find for users with similar interests, find people with certain interests who are looking for a job, etc). And lets presume that the indexing strategy follows the privacy settings provided by users and instances and ignores any content and profiles that don’t want to be indexed.
Given that there are proper use-cases, how would we best attack this problem?
What my bot does now, is both naïve and flakey: it just uses the mastodon REST api of the instance (botsin.space) that it’s registered at to search for hashtags and walk over the toots in those hashtags. It does this several times a day. Additionally it follows the public websocket stream to index toots as they appear.
It is naïve, because it -obviously- misses a lot of content: only the content that is federated to the server the bot uses will be indexed. It is mastodon-api only, and following the websocket breaks easily, esp when we want to do this over months, not hours.
Some things I considered:
I can write (or employ) a crawler. Just and old-fashioned spider that crawls the web, and uses known fediverse servers as boundaries and as input. This seems clumsy, and requires a spider that is rather heavy, as the public side of the fediverse employs a lot of javascript to render content.
I can improve stability and frequency of the current “read from the REST api” method and just accept that we’ll miss out on content.
I can build a bot that follows everyone and then starts indexing their posts as they now get federated to me. This makes it hard to distinguish “public” content from more private. I now get pushed content that mightn’t be meant to go into a (public) search index. This is frowned upon by the community, for this reason, AFAICS.
And probably some more strategies.
How would you attack this? What has been tried? Are there existing bots, crawlers or other software that tackle some of the “indexing of the fediverse” already?