Implementing Paginated Inbox Collections

mag · July 7, 2021, 5:08pm

Hi All!

I’ve been enjoying reading through the AS/AP specs, and have started writing my own little project - I might be confident enough to publicise it here once it’s further developed.

Sorry to drop a question as soon as I join the community; I do plan on interacting with you folks in more than just this topic

When reading the specs, specifically the portions on inboxes and collections, I’ve been having trouble figuring out how to make paginated collections when the underlying collection is prone to change frequently.

From the ActivityPub spec:

The inbox is discovered through the inbox property of an actor’s profile. The inbox MUST be an OrderedCollection.

I think I understand this: when an authorised client GETs someone’s inbox URL, they’re sent back an OrderedCollection describing that someone’s inbox. If the content of their inbox is small enough, this response could include references to all of the items in someone’s inbox (or maybe it could even include all of the items), but I’m interested in the case where a person’s inbox is too large to include al of the references in a single response: the case where pagination is required.

Collections with pages can include the properties first, current, and last; the distinction between the first two confuses me.

From the ActivityStreams vocabulary:

current: Notes: In a paged Collection , indicates the page that contains the most recently updated member items.

first: Notes: In a paged Collection , indicates the furthest preceeding page of items in the collection.

In an OrderedCollection in reverse chronological order, are these properties different from one another? In my head, “furthest-preceeding” and “contains the most recently updated member items” can both mean two different things in this situation, and so both properties could mean either of:

The provided reference will always point to the same page of items, as it was the “newest”/“youngest”/“most-recently-updated” page at the time of your GET request to the inbox;
The provided reference is a reference to content that may change over time, because it is a reference to the “newest”/etc. content at the time of your GET request to the page.

I can think of two possibilities that seem sensible to me:

first and current both refer to #2: you can include both with the same value, or omit one;
first refers to #1; current refers to #2.

There are solutions for #1 and #2, so really my issue is just down to what these properties are meant to represent. Are there solid examples of these properties in use that I haven’t come across?

Hopefully there’s enough possible discussion here to warrant this being a “topic” - I’m brand new to Discourse, so I’m not really sure what I’m doing

weex · July 8, 2021, 6:10pm

Hey @mag, Welcome and thank you for raising this question about the protocols! I’m trying to get acquainted with them as well, though I’m not developing anything against the spec currently.

As I look at OrderedCollection, I don’t see that the ordering property to be strictly specified so I think this may turn out to be implementation dependent. The most actively developed projects that support AP include Mastodon, Pleroma, PeerTube, and Pixelfed so if you’re looking for examples those would be good to look at.

mag · July 10, 2021, 12:48am

Thanks for the insight @weex!

With regards to the specific ordering of reverse-chronological OrderedCollections, the AP spec has a note that seems to align with what you’re saying:

What property is used to determine the reverse chronological order is intentionally left as an implementation detail. For example, many SQL-style databases use an incrementing integer as an identifier, which can be reasonably used for handling insertion order in most cases. In other databases, an insertion time timestamp may be preferred. What is used isn’t important, but the ordering of elements must remain intact, with newer items first. A property which changes regularly, such a “last updated” timestamp, should not be used.

Having re-read this note, I think I’ll be using row ID to sort my results - I was previously planning on using insertion time and hoping that not too many items were inserted in the same second

Thanks also for pointing to some project names; I’ll have a look at them in the next couple days and report back here if I find anything useful.

mag · July 19, 2021, 10:34pm

Well this took me longer than I expected to get back to I’ve been thinking more about this, and while this all could be considered quite a niche portion of the spec to agonise over, I do feel like having a consistent way of handling small issues in common situations would be good for implementers of clients. Below I detail a possible issue that can arise from the vague restrictions of the the specification w.r.t. handling user’s inboxes.

A Possible Issue

Say we’re trying to fetch a user’s inbox: this user is very active and so they have thousands of events in their inbox, meaning their server should paginate the results of their inbox (to spread out to load given to the client, and to save data in the case that the client only wants recent entries). The server will serve the client pages at a time, and the client will request the next page when it’s able to receive it. Since the inbox is sorted in reverse-chronological order, new items are added to the top of the list.

In a naïve implementation, where pages contain a fixed number of entries, the client may receive duplicate entries when fetching multiple pages. Consider the following order of events:

Client A requests a user’s inbox from server S. S returns an OrderedCollection, with property "first": ".../inbox?page=1".
A fetches the first page from the given URL; S replies with the 10 most recent items in an OrderedCollectionPage, with property "next": ".../inbox?page=2".
Another server, T, POSTs a new item into the user’s inbox. This new item is now the most recent item in the collection, and so becomes the first item in page 1. Since pages are a fixed size, this means the item that was previously at the bottom of page 1 is pushed onto page 2.
A fetches the second page from the inbox; this second page has 9 new items, but the first item in the page has already been seen by this client, when it fetched page 1.

As I read the AP spec, it is servers that should handle this issue:

The server MUST perform de-duplication of activities returned by the inbox. Duplication can occur if an activity is addressed both to an actor’s followers, and a specific actor who also follows the recipient actor, and the server has failed to de-duplicate the recipients list. Such deduplication MUST be performed by comparing the id of the activities and dropping any activities already seen.

This scenario would also happen in reverse: if a client were to attempt to start at the last page of the Collection and iterate backwards, some items might be skipped as they shift pages.

Example Inbox Implementations

I had a look at the existing projects mentioned by @weex: unless I’m missing something, it seems that Pleroma and PeerTube would both suffer from the issue described.

Mastodon

Mastodon’s documentation on AP support is quite short. The only mention of collections is right at the very bottom of the page:

Mastodon has a concept of “followers-only” posts, but expanding the followers collection is currently handled at the destination rather than at the origin (i.e., not with explicit addressing). Therefore, a mechanism to detect synchronization issues and correct them is needed. This mechanism must work on partial followers collections, rather than the full collection (which may not be public information).

When delivering a message to a remote user, an optional Collection-Synchronization HTTP header is attached, following the same syntax as the Signature header, with the following attributes:

collectionId = MUST be the sender’s followers collection

url = a URL to a partial collection containing the identifiers of the sender’s followers residing on the receiver’s instance. MUST reside on the same domain as the actor itself, and SHOULD be only accessible with a signed query from the receiver’s instance

digest = hexadecimal representation of the XORed SHA256 digests of each of the identifiers in the partial collection

Example:
POST https://mastodon.social/users/foo/inbox
Collection-Synchronization:
  collectionId="https://social.sitedethib.com/users/Thib/followers",
  url="https://social.sitedethib.com/users/Thib/followers_synchronization",
  digest="b08ab6951c7d6cc2b91e17ebd9557da7fae02489728e9332fcb3a97748244d50"
When a remote user attempts to GET the partial collection url , this request must be signed with HTTP signatures. Example:
GET https://social.sitedethib.com/users/Thib/followers_synchronization
Signature: ... # a signature from an account on mastodon.social
{
  "@context": "https://www.w3.org/ns/activitystreams",
  "id": "https://social.sitedethib.com/users/Thib/followers?domain=mastodon.social",
  "type": "OrderedCollection",
  "orderedItems": [
    "https://mastodon.social/users/Gargron"
  ]
}

I couldn’t find the place where Collections and their pages were constructed in the Mastodon codebase (Ruby is not my forte), though the ActivityPub-related code seems to be located in app/serializers/activitypub/ and app/services/activitypub, and tested in spec/services/activitypub. I couldn’t see a mention of activitypub in their Gemfile, either.

Pleroma

I couldn’t find any docs detailing Pleroma’s implementation of ActivityPub, though by looking at their code it looks as though they define the next page of a paginated collection by incrementing the index of the current page:

Map.put(map, "next", "#{iri}?page=#{page + 1}")

PeerTube

PeerTube is much the same as Pleroma, in that their docs don’t seem to include specifics about how they handle CollectionPages, and their code simply increments a page index:

next = baseUrl + '?page=' + (page + 1)

The rest of that function fetches the page based purely on the index, meaning that it likely falls prey to the duplicate-item issue laid out above.

Possible Solutions to the Possible Issue

I’ve also had a little think about ways to avoid this problem on the server side of things:

Have page indices count in the opposite collection, where page 0 contains the oldest items in the Collection (and is therefore the Collection’s last page). The first page’s index can be calculated based on the total number of items in the Collection. This has the issue of making database queries trickier in some setups.
Base the index of the next page on the timestamp of the oldest item in the current page. This way, the contents of the page can be calculated as the X most recent items older than the page’s index.
Have pages’ contents be based on a fixed time span. This way, items are never “pushed” out of a page, but there may be many empty pages, and in the opposite case could be an opening for a DoS attack (by piling a bunch of items into a short time span).

Again, I’d love to hear what others have to say on the issue. For now, I’ll likely make an implementation similar to those I looked at for this post, and make a more watertight solution once I have things operational on a basic level.

grishka · July 19, 2021, 11:41pm

I’d make the inbox page endpoint take the maximum row ID as a query parameter. So for the inbox collection itself you’d do something like SELECT MAX(id) FROM inbox WHERE ... and put that ID into the first page URL. This way you’ll have a stable starting point to paginate from regardless of whether, and how many, new rows are inserted. You’d then advance that to the row that’s immediately next to the last row in the page you’re serving and use that for next.

I have a similar arrangement for outboxes, except it supports paginating in both directions:

github.com

grishka/Smithereen/blob/d88f8455bc37c342ebb65495e54a29e365da9719/src/main/java/smithereen/routes/ActivityPubRoutes.java#L273-L320

    
      
          	public static Object userOutbox(Request req, Response resp) throws SQLException{
          		int id=Utils.parseIntOrDefault(req.params(":id"), 0);
          		User user=UserStorage.getById(id);
          		if(user==null || user instanceof ForeignUser){
          			throw new ObjectNotFoundException();
          		}
          		int _minID=Utils.parseIntOrDefault(req.queryParams("min_id"), -1);
          		int _maxID=Utils.parseIntOrDefault(req.queryParams("max_id"), -1);
          		int minID=Math.max(0, _minID);
          		int maxID=Math.max(0, _maxID);
          		int[] _total={0};
          		List<Post> posts=PostStorage.getWallPosts(user.id, false, minID, maxID, 0, _total, true);
          		int total=_total[0];
          		CollectionPage page=new CollectionPage(true);
          		page.totalItems=total;
          		page.items=new ArrayList<>();
          		try{
          			for(Post post:posts){
          				Create activity=new Create();
          				activity.object=new LinkOrObject(post);

This file has been truncated. show original

mag · July 20, 2021, 11:20am

Thanks @grishka for the suggestion! I hadn’t heard of Smithereen, but having a look now it seems very cool

Looking at the implementation of getWallPosts in the Smithereen codebase, it looks like all user posts are stored in one big table (I’m also doing this for my toy implementation). This would mean that the IDs for a single user’s posts would not be sequential. Are there any downsides to doing things this way? I guess it means clients can’t guess a user’s post IDs (which might be a good thing, really).

Also, I’d just like to clarify: if a request to a Smithereen instance includes a "min_id" query parameter, does the instance return the page of posts whose IDs are as low as possible while being higher than the parameter?

grishka · July 20, 2021, 12:14pm

Yes, I’ll tell you more — the wall_posts table actually contains all posts AND all comments. It’s actually more convenient that way.

As far as clients go, I don’t plan on supporting C2S, but I’ll make a domain-specific client API at some point. The current web UI is rendered on the server side.

No, they’re always sorted by newest first. Having min_id allows starting from the oldest posts (last page) and working your way up to newest (first page).

Actually, it’s strange that I’m not returning the last field in the collection
Also the try-catch, it’s ugly, should use URI.create instead because these URIs are known to be valid (they were validated before they’ve been put into the database). So this is in need of a bit of a rewrite.

weex · July 21, 2021, 6:01am

Very interesting. I’ve created an issue in the Pleroma git and pinged the PeerTube irc channel since they have some issue templates that a potential issue doesn’t quite fit into.