Defenses against scraping, from Meta and others?

jdp23 · August 11, 2025, 10:51pm

Dropsitenews has published a list (from a Meta whistleblower) of “the roughly 100,000 top websites and content delivery network addresses scraped to train Meta’s proprietary AI models” – including quite a few fedi sites. Meta denies everything of course, but they routinely lie through their teeth so who knows. In any case, whether or not the specific details in the report are accurate, it’s certainly a threat worth thinking about.

Here’s a Mastodon thread asking what (if anything) instance admins are doing to defend against this kind of scraping. And it’s also potentially a question for fedi software level: what kinds of defenses can be built in – sample robots.txt, an installation choice to enable firewall-level blocking of ip ranges, integration (or at least instructions to integrate) Anubis, etc etc etc?

Thoughts on either of those, or related subjects, welcome!

aschrijver · August 12, 2025, 9:54am

My response might be along a somewhat different line of thought than was the intent of your thread. Taking quotes from the toot ..

Are there other approaches that you think might be promising that you just haven’t had the time or resources to try?

The fediverse represents a veritable goldmine for AI scrapers with all the natural human conversation taking place on huge public social squares. Public squares where people stand on their soapbox and address following crowds and random passers-by are nice. There’s certainly a need, and I enjoy my mastodontime a lot. Yet there is very little equivalence to how we engage socially on the fediverse, and how do social networking in real life.

A future social web should imho explore Personal social networking to become more tuned to the needs of people in their everyday life. In real life we know how to choose friends and avoid our enemies. If someone is a thief we have locks on our door to keep them out. Personal social networking moves towards a fedi that has an intricate network of small communities and social groups in all kinds of relationships and overlap with each other.

It means not everything is public by default. In real life we find that normal, but online we show weird resistance to this idea. Where things are not public are where we protect ourselves from scraping AI.

Do you have any language in your terms of servive that attempts to prohibit training for AI?

There are a couple of initiatives that investigate new license types, specifically designed to prohibit AI. The ones I know of are:

(As part of Social coding commons I’m maintaining a list thread with various innovative licensing schemes more suitable for sustainable FOSS projects, or ‘Sustainable open social systems’ as I call them, i.e. SOSS initiatives.)

jdp23 · August 12, 2025, 6:41pm

I didn’t realize that @next@fedi.copyleft.org had relaunched, that’s certainly worth following – thanks! Cara is another example of a license where they’ve put a lot of thought into the AI issues - https://cara.app/terms The Cyber License project however is discontinued: “While we thought we were onto something innovative, it’s become clear that Cyber License might have been more of a miss than a hit.”

Agreed that there’s a lot of value to social networking than “the public square”, and that non-public data is much more resistant to AI scraping both technically and legally. In general I think this is the sweet spot for fedi, and projects like Bonfire, GoToSocial, and Frequency really get that. Alas most people in today’s fedi are on Mastodon which doesn’t even have local-only posting.

aschrijver · August 12, 2025, 6:57pm

Nice, I will add Clara to my list.

Damn, that is a bummer. I only learned about it on 8 August when tech lawyer Neil Brown discussed it on fedi: