Querying ActivityPub collections

grishka · June 19, 2021, 2:18am

It’s me again with my efforts to extend collections. This is somewhat of a continuation of the discussion started in the #fep-400e topic, but I feel like it warrants having a topic (and at some point a FEP) of its own.

Sometimes, it’s useful to have the ability to issue a single request to ask a remote instance “does this collection contain this object?” for an authoritative answer. This can be useful both for collection synchronization and to enforce bilateral relationships.

Example #1. There’s a wall post made by a user from instance A on the wall of a group on instance B (in accordance to my #fep-400e). The user from instance A mentions a user from instance C in a reply to that post. The instance C fetches both the post and the reply to display them to its local user in a notification. How does instance C verify that instance B has approved that parent post?

Example #2. A user from instance A and a user from instance B add each other as friends. They each have a “friends” collection that contains all such connections. Each sends an Add activity to their followers to keep friend lists in sync across instances. How do recipients verify that this relationship was actually established on the other end if they only follow one of the users?

Currently, there’s no non-ugly way to do this — the only way you can get something out of a collection is by going through its pages. I’d like to discuss an easy way to query a collection, for example, using a well-defined query parameter or an HTTP header. You’d send a request to the collection ID URL, specifying the ID of the object you’re checking for, and the response would be one of three discernible states:

The collection contains the object
The collection does not contain the object
Inconclusive — the server doesn’t support responding to these requests

naturzukunft · June 19, 2021, 5:28am

“using a well-defined query parameter or an HTTP header.” This sounds to me like the beginning of a new query language. To query triples, there is already the standard SPQRQL.

I have a similar problem. An actor sets a task as an object of an activity and the actor that processes this task needs to be able to query for open tasks, for example.

I had thought about limiting this to collections, but decided to query the whole storage. (This was also easier for me at first.) However, I could still filter this by collection. Certainly relevant for some UseCases. However, there are probably use cases where a query across collections is desired.

My ActivityPub C2S interface has an endpoint that accepts SPQRQL queries and returns the result as SPQRQL bindings.

naturzukunft · June 19, 2021, 5:37am

This became exciting for me because I assigned the task of an actor A, who created the task, to another actor B. So as:to and wf:assigned was ‘Actor B’ and no Actor B is my processing service.

But if I only store references in the inbox of Actor B instead of duplicating all activities and objects in my database, then I can’t query the activities and objects at all. I would have to send a query to distributed creator servers ;-(

Also a cache will be weird, as my cache will be organised by AP server rather than by inbox of a specific actor.

The question of whether an inbox contains objects or references is left to each implementation. Both have advantages and disadvantages.
Copying all references into my database somehow doesn’t feel like liked data.

@cjs has noted in another task that an inbox contains only references, not objects. (if I remember correctly)

grishka · June 19, 2021, 1:11pm

No. Absolutely, categorically not. I mean for this to do one thing and one thing only. And it should be trivial to implement. It definitely shouldn’t require a library, or a lengthy spec with hundreds of unit tests. I’ve already had enough “fun” implementing JSON-LD processing algorithms. Except making a bug in a query language implementation also has a great potential of having security consequences.

So, no, I don’t want a query language. I don’t want this to be much extensible, if at all.

And now you’ve met the exact reason why I think c2s can’t be made practical. You have to make way too many requests to way too many servers to render something like a news feed. You’d need to fetch your inbox, filter the activities you want, issue a request for potentially each actor you encounter… Now imagine doing that on a crappy EDGE connection. Good luck.

If you used a domain-specific client API instead, like Mastodon does and like I’m planning to do, that would’ve been a single request, with response containing everything you need, and the server itself handling the rest behind the scenes.

aschrijver · June 21, 2021, 5:59am

I don’t know the applicability of this, but just wanted to mention the Meld protocol here (project will be presenting at @j8ter NGI Linked Data webinar at 9.30 CEST).

At the heart of m-ld is a decentralised protocol for distributing live state among clones . Using m-ld , every app instance has read-write access to the shared information via its local clone, with zero network latency. Changes to the information are propagated to all other app instances, so they are all eventually consistent .

The specification includes a query language json-rql that is is a superset of JSON-LD, designed for query expressions.

It’s JSON: straightforward to construct in code, manipulate and serialize, and also to constrain . Use standard JSON tooling such as JSON schema to limit your API to the queries that your back-end has been designed and tested for.

It’s SPARQL: in context , all queries can be translated to the W3C standard language for directed, labeled graph data. This means that your API can be extended to cover future query requirements, without breaking changes.

With Meld, apps use one of available Message-layer adapters to synchronise clones of ‘domain’ data. It seems like this layer may as well be an ActivityPub adapter. A Java platform implementation is apparently in the works.

gsvarovsky · June 21, 2021, 11:30am

Thanks for the shout Arnold, and for posting the links. I don’t know enough about ActivityPub to really get the details of the use-case here but I’ll make some comments about json-rql and m-ld in case they’re of interest.

Judging by this:

No. Absolutely, categorically not [a new query language]

I think, a little contrarily, that json-rql might be something interesting to you. Its principle is very simple: just encode SPARQL in JSON. But the key contribution is that this not only makes it easy to embed, serialise and parse, but also you can easily document which parts of it your service supports – even if it’s very little. There’s some narrative about this on the project wiki.

Anyway happy to discuss further if you like the idea. There’s a Java library available, which has not had much love for a while but would be easy to re-invigorate.

In principle json-rql is agnostic about whether you’re on the client or the server (it’s just a serialisation of a language), so it should lend itself fine to having the application and the deployment making decisions about where to run the query. (Not really different to SPARQL in this regard – just maybe a bit easier to work with, especially if the only query supported is really simple.)

m-ld is a bit more of a machine with moving parts of its own. In effect it’s like a local database, with a json-rql API, which incidentally is very inclined to clone itself all over the place.

The Java engine for m-ld does exist, but needs quite a bit of work. I was maybe too hopeful last year for how much I would be able to do . Use-cases would help me prioritise, for sure!

~~With regard to this, yes, an ActivityPub adapter might be a great thing for me to have.~~

Edit: Having read a little more, I think this sentence does not really make sense – I was assuming some peer-to-peer messaging going on, but I think ActivityPub is strongly client-server and a standard for activity publication (duh), agnostic to low-level message distribution. Let me know if I’m mistaken.

I’ve just added a socket .io adapter (still incubating – I’ll update when it’s baked), but I still don’t have a great story for a fully decentralised messaging approach (even though m-ld itself is uninterested in servers or central coordination).

naturzukunft · June 21, 2021, 1:21pm

My ActivityPub (AP) implementation uses RDF4J as RDF store. AP has as UseCase in the ClientToServer part (C2S) A Client App sends an Activity to the outbox of an Actor.
E.g. a webApp is calling the C2S API of an AP implementation and sends an activity (including object(s)) to an actors outbox.
An activity can have recipients like a mail. to,cc, bto, bcc,…
And the Server2SServer part defines the delivery to the recipient. Which means the recipient gets the reference to the sent activity in his inbox. See this picture: https://activitypub.rocks/

Now I have created a possibility to query the outbox of an actor with sparql. this works well. But if there are queries on the inbox, it becomes difficult, because there are only references. Hence the idea of managing a copy of all activities and objects. and aschrivjer mentioned meld as a possible solution. but i see a lot of problems or challanges
Because it’s not that one copy. it’s a copy of some activities of different outboxes on different servers.

grishka · June 21, 2021, 1:40pm

If it needs a library, it’s not simple enough for my needs.

grishka · June 21, 2021, 1:42pm

@marius showed me a draft of a draft of his FEP that might be relevant to this discussion, and it is simple enough and doesn’t try its hardest to be SQL:

gsvarovsky · June 21, 2021, 2:04pm

Ah! Then in retrospect I should not have mentioned it. The library is only for convenience if you want to translate from json-rql to SPARQL or otherwise handle arbitrarily complex json-rql. Which I do – but you probably don’t. The language is just simple way of expressing SPARQL in JSON. That’s it – you can interpret the JSON yourself of course, especially if it’s only a small query.

grishka · June 21, 2021, 2:22pm

That’s still way too complex for what I’m trying to achieve. Marius’s solution is also kinda complex in that it allows querying on arbitrary fields (on which I might not have indexes), but at least doesn’t support joining, sorting, or aggregation.

Sebastian · June 22, 2021, 2:52pm

Well, I came to the conclusion that we can use a subset of JSON-Schema or better JSON-LD Framing for filtering and querying.

I agree with the conclusion in this article:

Especially because “Framing” is well specified in the underlying specs.

See
https://www.w3.org/TR/json-ld11-framing/#matching-on-properties

Opinions?

naturzukunft · June 22, 2021, 6:13pm

I think it’s a shame when applications that use a triple store have to struggle to provide the activityPub API. I don’t like this json-ld thing and for me it makes more trouble then fun.
To me, activityPub sounds more and more like an RDF - json bridge.

marius · June 23, 2021, 8:57am

@grishka I didn’t reply on Mastodon, but there is a way to automatically populate columns in postgresql/sqlite based on extracting data from Json, so maybe that can allow you to modify your model to include more columns/indexes.

For my own application, in the case of relational databases storage, I’m thinking of using dual filtering:

Filtering based on existing columns in the db, which can take advantage of indexes and such (ie, generating an SQL query)
Filtering the remaining criteria by iterating over the returned set/collection and further removing what doesn’t match.

Also I have an sqlite example for the basic tables I’m using in fedbox using this approach. This is work in progress, so some things (like indexes) are missing.

Sebastian · June 23, 2021, 10:41am

Hey @marius –

thanks for posting and also thank you for the FEP.
Would like to get your opinion on the above posted article.

ActivityPub is JSON LD and

JSON-LD Framing allows developers to query by example and force a specific tree layout to a JSON-LD document.
JSON-LD 1.1 Framing

I mean, it is maybe not as easy-to-use as the FEP but we can basically use it to query anything, also nested and it’s already a W3C recommendation.

marius · June 23, 2021, 3:44pm

Hi @Sebastian, it’s the first time I’ve seen this document, I wasn’t aware that there’s an attempt at producing a querying specification for JSON-LD.

On a (very) quick glance, seems a step into the direction of GraphQL, which I personally don’t like. I consider it to be overkill for this specific use case, querying an ActivityPub server’s collection. It might be useful if there are libraries that seamlessly integrate the framing querying mechanism on top of some specific RDF storage, but I wouldn’t try to implement something like this from scratch on my own.

I think most developers in the AP space are pretending pretty hard that the -LD part is absent from the documents they process and treat everything like plain JSON. I know I do.

PS. The FEP draft-draft is pretty far from being ready. I haven’t finished putting on paper all the details. I offered it to Grishka as he wanted ideas for something he can use immediately and it’s probably sufficient for that purpose.

grishka · June 26, 2021, 2:43am

Since I do actually run JSON-LD processing algorithms on everything I receive to convert it to my “local context”, it’s been at least 2 times when someone forgot to add "https://w3id.org/security/v1" to their @context causing my code to not see the public key of an actor.

aschrijver · June 26, 2021, 3:28am

I didn’t verify if/how handled, but this is where @dansup’s fedidb.org and @cjs testsuite can be very useful in conformance reporting. We should encourage app devs to run these.

aschrijver · June 27, 2021, 12:47pm

5 posts were merged into an existing topic: Introducing FediDB - DevTools for ActivityPub

grishka · May 6, 2022, 6:20pm

Anyway. It seems easier to gather feedback after you’ve done something rather than when you’re only planning to do it.

Incomplete and unreleased yet, but you get the idea.