Implementing Paginated Inbox Collections

Well this took me longer than I expected to get back to :stuck_out_tongue: I’ve been thinking more about this, and while this all could be considered quite a niche portion of the spec to agonise over, I do feel like having a consistent way of handling small issues in common situations would be good for implementers of clients. Below I detail a possible issue that can arise from the vague restrictions of the the specification w.r.t. handling user’s inboxes.

A Possible Issue

Say we’re trying to fetch a user’s inbox: this user is very active and so they have thousands of events in their inbox, meaning their server should paginate the results of their inbox (to spread out to load given to the client, and to save data in the case that the client only wants recent entries). The server will serve the client pages at a time, and the client will request the next page when it’s able to receive it. Since the inbox is sorted in reverse-chronological order, new items are added to the top of the list.

In a naïve implementation, where pages contain a fixed number of entries, the client may receive duplicate entries when fetching multiple pages. Consider the following order of events:

  1. Client A requests a user’s inbox from server S. S returns an OrderedCollection, with property "first": ".../inbox?page=1".
  2. A fetches the first page from the given URL; S replies with the 10 most recent items in an OrderedCollectionPage, with property "next": ".../inbox?page=2".
  3. Another server, T, POSTs a new item into the user’s inbox. This new item is now the most recent item in the collection, and so becomes the first item in page 1. Since pages are a fixed size, this means the item that was previously at the bottom of page 1 is pushed onto page 2.
  4. A fetches the second page from the inbox; this second page has 9 new items, but the first item in the page has already been seen by this client, when it fetched page 1.

As I read the AP spec, it is servers that should handle this issue:

The server MUST perform de-duplication of activities returned by the inbox. Duplication can occur if an activity is addressed both to an actor’s followers, and a specific actor who also follows the recipient actor, and the server has failed to de-duplicate the recipients list. Such deduplication MUST be performed by comparing the id of the activities and dropping any activities already seen.

This scenario would also happen in reverse: if a client were to attempt to start at the last page of the Collection and iterate backwards, some items might be skipped as they shift pages.

Example Inbox Implementations

I had a look at the existing projects mentioned by @weex: unless I’m missing something, it seems that Pleroma and PeerTube would both suffer from the issue described.

Mastodon

Mastodon’s documentation on AP support is quite short. The only mention of collections is right at the very bottom of the page:

Mastodon has a concept of “followers-only” posts, but expanding the followers collection is currently handled at the destination rather than at the origin (i.e., not with explicit addressing). Therefore, a mechanism to detect synchronization issues and correct them is needed. This mechanism must work on partial followers collections, rather than the full collection (which may not be public information).

When delivering a message to a remote user, an optional Collection-Synchronization HTTP header is attached, following the same syntax as the Signature header, with the following attributes:

  • collectionId = MUST be the sender’s followers collection
  • url = a URL to a partial collection containing the identifiers of the sender’s followers residing on the receiver’s instance. MUST reside on the same domain as the actor itself, and SHOULD be only accessible with a signed query from the receiver’s instance
  • digest = hexadecimal representation of the XORed SHA256 digests of each of the identifiers in the partial collection

Example:

POST https://mastodon.social/users/foo/inbox
Collection-Synchronization:
  collectionId="https://social.sitedethib.com/users/Thib/followers",
  url="https://social.sitedethib.com/users/Thib/followers_synchronization",
  digest="b08ab6951c7d6cc2b91e17ebd9557da7fae02489728e9332fcb3a97748244d50"

When a remote user attempts to GET the partial collection url , this request must be signed with HTTP signatures. Example:

GET https://social.sitedethib.com/users/Thib/followers_synchronization
Signature: ... # a signature from an account on mastodon.social
{
  "@context": "https://www.w3.org/ns/activitystreams",
  "id": "https://social.sitedethib.com/users/Thib/followers?domain=mastodon.social",
  "type": "OrderedCollection",
  "orderedItems": [
    "https://mastodon.social/users/Gargron"
  ]
}

I couldn’t find the place where Collections and their pages were constructed in the Mastodon codebase (Ruby is not my forte), though the ActivityPub-related code seems to be located in app/serializers/activitypub/ and app/services/activitypub, and tested in spec/services/activitypub. I couldn’t see a mention of activitypub in their Gemfile, either.

Pleroma

I couldn’t find any docs detailing Pleroma’s implementation of ActivityPub, though by looking at their code it looks as though they define the next page of a paginated collection by incrementing the index of the current page:

Map.put(map, "next", "#{iri}?page=#{page + 1}")

PeerTube

PeerTube is much the same as Pleroma, in that their docs don’t seem to include specifics about how they handle CollectionPages, and their code simply increments a page index:

next = baseUrl + '?page=' + (page + 1)

The rest of that function fetches the page based purely on the index, meaning that it likely falls prey to the duplicate-item issue laid out above.

Possible Solutions to the Possible Issue

I’ve also had a little think about ways to avoid this problem on the server side of things:

  • Have page indices count in the opposite collection, where page 0 contains the oldest items in the Collection (and is therefore the Collection's last page). The first page’s index can be calculated based on the total number of items in the Collection. This has the issue of making database queries trickier in some setups.
  • Base the index of the next page on the timestamp of the oldest item in the current page. This way, the contents of the page can be calculated as the X most recent items older than the page’s index.
  • Have pages’ contents be based on a fixed time span. This way, items are never “pushed” out of a page, but there may be many empty pages, and in the opposite case could be an opening for a DoS attack (by piling a bunch of items into a short time span).

Again, I’d love to hear what others have to say on the issue. For now, I’ll likely make an implementation similar to those I looked at for this post, and make a more watertight solution once I have things operational on a basic level.