ActivityPub data dumps?

Hi!
I’m researching the performance of streaming RDF systems, and I’m very interested in gathering as many datasets for the benchmarks as possible. The benchmark suite I’m working on is called RiverBench: RiverBench

As far as I understand ActivityPub, the core of the protocol is a mechanism for inbox and outbox streams, which consist of messages in JSON-LD. JSON-LD of course is RDF, so this counts as an RDF stream :slight_smile: and it’s a very interesting use case.

My question is: are there any public dumps of ActivityPub data? Basically, I’m looking for an archive of a bunch of these JSON-LD messages. If not, could you point me to some APIs that I can scrape to get this data?

There is unfortunately the question of licensing… to make this benchmark data useful, it needs to be under some kind of an open license (e.g., CC BY). I have zero idea if there are any projects in the Fediverse that specify open licenses for user-created content… If it’s an optional thing (i.e., some users choose to use an open license), then I can filter the data to only include those.

I’m not entirely sure this is directly the case. You can have a look at an example outbox here:

https://mastodon.social/users/trwnh/outbox

ActivityPub doesn’t directly deal with streams of data; rather, it has the concept of a Collection (which is loosely related to an LDP Container but isn’t the same thing), and the Collection can be paged using certain properties (as:first, as:last, as:next, as:prev, and so on). I think it would take some pre-processing to treat “paging a Collection” as “streaming RDF”.

I grant you permission to page through my outbox (linked above), under CC BY-NC-SA. Hopefully this is open enough?

1 Like

There are many ways in which you can view a piece of data :wink: For me, a sequence of RDF documents is a stream. It will require some light preprocessing, but that’s fine.

Thanks! But no, sorry, it’s not. I would need this to be at least as open as CC BY-SA, the non-commercial clause is pretty restrictive when it comes to benchmarks that may be actually used in the industry. I absolutely understand why you would not want to publish your social media posts under such a permissive license (to be honest, I would even consider adding an ND clause…).

Maybe I should rephrase the question: is there any use case in the Fediverse where the stuff put into ActivityPub is freely licensed by default? The only similar thing that comes to my mind are discussion pages on Wikimedia projects (e.g., Wikipedia), but this is of course not an ActivityPub thing.