Stricter specifications for pagination of Collections and OrderedCollections

trwnh · November 5, 2022, 7:20am

Overview

ActivityPub primarily depends on direct delivery of activities, but many implementations expose old post history via the outbox. In theory, you could fetch the outbox to discover old posts that weren’t delivered to you but should still be visible to you (e.g. Public posts that persist in the outbox). However, there is one big problem: pagination.

Specifically, pagination is an issue because you will have to fetch multiple pages, and you don’t know exactly when to stop, or how to discover gaps. You may be able to fetch up until you see a post that you already have, but there may be other unseen posts beyond that one. The only way to be sure is to fetch every single page of the outbox, which can be a costly operation.

Recommendations

Arguably, this situation can be improved by making some specific recommendations:

Construct chronologically, but present in reverse order

Because an OrderedCollection is mandated to be “reverse chronological” specifically, extra care needs to be taken to allow stable page references. Perhaps pages should be built chronologically and simply presented in reverse order, crucially with the first page containing less than than the max page size.

Example: A page size of 10 and a collection of 23 items should be split into pages of 3, 10, 10. These pages would be presented as such:

[
[23 22 21]
[20 19 18 17 16 15 14 13 12 11]
[10 9 8 7 6 5 4 3 2 1]
]

Stable page references should also be reversed

Furthermore, in order to maintain stable page references, such that if you’ve fetched a page before you don’t have to fetch it again, page counters should be assigned in reverse order as well.

Taking the example from above, the pages would be identified as 3, 2, 1:

[
[23 22 21] // page 3
[20 19 18 17 16 15 14 13 12 11] // page 2
[10 9 8 7 6 5 4 3 2 1] // page 1
]

Deleted items should either be Tombstoned or change the page size

All this work is useless if the pages have to be recalculated or the items get shifted to a different page. To prevent this, either serve a Tombstone in place of deleted items, or otherwise freeze the upper and lower bound of a page while allowing variable page sizes.

For example, let’s say we delete post 17. The result might look like this:

[
[23 22 21] // page 3 (3 items)
[20 19 18 T 16 15 14 13 12 11] // page 2 (10 items, 1 Tombstone)
[10 9 8 7 6 5 4 3 2 1] // page 1 (10 items)
]

Or, it might look like this:

[
[23 22 21] // page 3 (3 items)
[20 19 18 16 15 14 12 11] // page 2 (9 items)
[10 9 8 6 5 4 3 2 1] // page 1 (10 items)
]

Notice that page 3 remains unchanged, rather than item 21 becoming part of the 2nd page.

Accessing pages should be done in a consistent way

The final piece of the puzzle is a way to consistently load specific pages. For example, consider a collection at /collection/id. You might be able to attach a query parameter ?page=N to access the Nth page via /collection/id?page=N. Or you might have some route such as /collection/id/page/N. Whatever the case, there should be a way of getting pages that can be expected to work across all implementations. Or, at the very least, a way that may be inferred easily, but a standard pagination technique would be better.

My thinking is that /page/N would be better, because it would allow for static pages as an option more easily,

Also for consistency: Tombstone is preferable to exclusion, because it allows dynamic page sizing on-the-fly in dynamic servers that use query parameters.

silverpill · November 6, 2022, 8:33pm

Perhaps a timestamp filter could solve this problem?
For example, /collection/id?after=1667766000 can return a paginated subset of the collection, and if the client knows the time of the last sync, it can retrieve missing objects with fewer requests.
This way implementations can continue to use their preferred pagination mechanism.

trwnh · November 7, 2022, 3:28am

Having a timestamp filter might work for dynamic server implementations but not for static server implementations. I took care in making sure, while writing the above post, that the recommendations would be applicable to both static and dynamic servers.

mro · November 8, 2022, 8:53am

i am thinking about pagination for seppo.app a lot. My current take on it is

page content SHOULD be immutable (except for deletions),
there are no edits of post, but modified re-posts plus a delete in case,
due to deletions page size is mutable(!), there may be degenerated empty ones,
numbering is reverse for stable, static page contents (as long as there are no deletes),
the current (most recent) page has no number
the current (most recent) page is always full, the 2nd (thus the most recent bearing a number) has volatile size, all others are stable.

Most I do it at mro.name/microblog for years now.

P.S.: works for dynamic pages, too.

marius · November 10, 2022, 5:29am

ActivityPub collections seem to be specifically designed for keyset pagination in my opinion.

If you use that you don’t even need stable references (because the top of the collection is always changing) you just use the collection’s IRI (eg, user.example.com/outbox) on which you append the after/before query parameter with the id of the last/first element in the current collection page. Eg:

* ID: user.example.com/outbox (full collection)
* Type: OrderedCollection
* TotalItems: 13
* First: user.example.com/outbox?count=3
* OrderedItems:
  ├─ /1  Article
  ├─ /2  Page
  ├─ /3  Video
  ├─ /4  Video
  ├─ /5  Image
  ├─ /6  Note
  ├─ /7  Article
  ├─ /8  Article
  ├─ /9  Page
  ├─ /10 Video
  ├─ /11 Video
  ├─ /12 Image
  └─ /13 Note

* ID: user.example.com/outbox?count=3 (first page)
* Type: OrderedCollectionPage
* TotalItems: 3
* Next: user.example.com/outbox?count=3&after=3
* OrderedItems:
  ├─ /1  Article
  ├─ /2  Page
  └─ /3  Video

* ID: user.example.com/outbox?count=3&after=3 (second page)
* Type: OrderedCollectionPage
* TotalItems: 3
* Previous: user.example.com/outbox?count=3&before=4
* Next: user.example.com/outbox?count=3&after=6
* OrderedItems:
  ├─ /4 Video
  ├─ /5 Image
  └─ /6 Note

* ID: user.example.com/outbox?count=3&after=6 (third page, samd)
* Type: OrderedCollectionPage
* TotalItems: 3
* Previous: user.example.com/outbox?count=3&before=7
* Next: user.example.com/outbox?count=3&after=9
* OrderedItems:
  ├─ /7  Article
  ├─ /8  Article
  └─ /9  Page

This gives you the proper mechanism of going through the collection in sequential order and you can delegate this to an async mechanism and show the user only what gets loaded, etc. This is all already working in the go-activitypub Go libraries that I have on Github. I also have a basic example that should work with any service that uses proper Next/Previous attributes for their collections: https://git.sr.ht/~mariusor/fedbox-client-example/tree/master/item/main.go

trwnh · January 5, 2023, 2:10pm

in writing this up i seem to have overlooked the startIndex property of OrderedCollectionPage, which is basically just a positive offset for how far into the OrderedCollection you are with the first item. unfortunately this is less useful due to reverse chronology being mandated by ActivityPub; it would be far more useful in a forward-chronological collection. it also doesn’t apply to regular CollectionPage sadly.

there is still the option of having a regular Collection contain OrderedCollectionPage, though, or otherwise simply disregarding the “MUST be reverse chronological” bit and committing a spec violation.

yvolk · July 16, 2023, 3:19pm

Hello @trwnh
Regarding “deleted or changed items”
If we talk about Activities, my understanding is that an Activity is immutable. Just like you cannot change the past.
In order to change or delete a “Note”, you just create new activity for that and post it. The new activity has current time and it is located at a new “page”…

stevebate · July 16, 2023, 4:58pm

Activities are immutable but their object referents will generally not be. However, the AP specification allows activities to be removed from the outbox at any time.

…, there is no guarantee that time the Activity may appear in the outbox. The Activity might appear after a delay or disappear at any period.

So that’s still a potential issue for static outbox pages. Personally, I think static pages are not a great idea for the reasons others discussed earlier. I believe collection paging should be considered a data communication optimization rather than a storage and representation technique.

Some of my recent reading has given me more insight into the shortcomings of AP collection paging.

trwnh · July 17, 2023, 3:16am

Activities are not immutable, but they aren’t generally expected to change. In fact, they do sometimes change, due to implementation decisions and quirks. For example, Mastodon doesn’t store any activities, and it serializes ActivityPub entities upon request. This means that if you are looking in the outbox, editing a status will rewrite the original Create rather than pushing an Update to the start of the outbox.

As for static paging being a good or bad idea, I would want to leave the door open for alternative implementations. It should be at least possible to pre-generate static pages, which could then be served by any generic HTTP server that supports authorization and content negotiation headers, and they could be stored in a variety of data stores other than a database.

stevebate · July 17, 2023, 10:08am

I actually wrote about that topic in my previous message and then deleted it because I thought it might be tangential. I think this is the standard AP behavior and not Mastodon-specific. An activity references an object. This is true even if the activity is communicated with the object embedded for various reasons. If that object is updated later, it doesn’t change the activity itself (the activity still references the same URI). However, it means that the accuracy of activities cannot be trusted and that object history is lost.

The outbox Create is a somewhat special case. The Create activity object posted to the outbox is effectively a blank/anonymous node in the graph. The outbox processing creates an object with an assigned URI, but the Create activity doesn’t (or shouldn’t) reference that object. The object originally embedded in the Create activity can’t be dereferenced without querying the activity itself and it can’t be updated since there’s no public URI. This is where there are probably implementation quirks in various servers. I’d guess that most servers effectively replace the Create activity object reference with a reference to the created object.

There’s still the same issue with other outbox activities like Update and Delete. After a series of outbox Updates (or a Delete) for an object, all non-Create activities for that object will be referencing the most recently updated version of the object, which is generally not the object version they were originally updating.

(As a tangent to the tangent, I’ve been wondering about Undo/Delete. It appears to be spec-compliant, but I don’t know of any servers that support it.)

As for static paging being a good or bad idea, I would want to leave the door open for alternative implementations. It should be at least possible to pre-generate static pages…

I’m not opposed to static pages in principle. I think it’s an interesting thought experiment and I’m keeping an open mind about it. However, using AP pages as the internal representation of collections feels like a conflation of architectural layers to me.

yvolk · July 17, 2023, 3:51pm

As I read your responses, “mutations” of Activities are caused by implementation shortcuts, simplifications. These are deviations from ActivityPub spec. Ideally each Activity changing a Note should contain (or reference somehow, if this is possible…) its own version of the Note. So reading and applying these Activities chronologically any Client app instance could get the same resulting Note. Even a deleted Note.

stevebate · July 17, 2023, 5:05pm

Speaking for myself, I wasn’t saying that (except maybe for the Create special case). I was saying this is the behavior implied by the AP specification. When several Update activities refer to the same object URI, then they all will refer to the latest updated object as it changes. To fix this issue, the Update could refer to an anonymous object (a blank node in RDF terms) that contains the data for the update. It could also use a different property (“target”?) to identify which object will be updated instead of the “id” property of the “object” (Section 7.3). However, this would not be compliant with the spec. I agree that, in the Create case, there’s no reason not use a blank node already for the activity’s object but some implementations don’t do that for various reasons.

yvolk · July 17, 2023, 5:55pm

Reading the spec: " The Update activity is used when updating an already existing object. The side effect of this is that the object MUST be modified to reflect the new structure as defined in the update activity…"
For me this means that each “Update activity” MUST contain information enough to modify (update) current version of the target object. Hence it cannot simply refer to that object.

That “current version” may also be cached by a Client app, for example, so a Client app could apply the same update AND retain the change history, if it’s needed.

stevebate · July 17, 2023, 10:27pm

An example of what I’m suggesting (outbox, partial update):

{
  "id": "https://example.test/update-1",
  "type": "Update",
  "object": {
    "name": "My added name",
    "content": "My updated content",
    "summary": null
  },
  "target": "https://example.test/object-1"
}

This activity instructs the server to apply the update (the direct “object”, an anonymous resource) to the indirect “target” object. It’s consistent with the quotes above and it has an advantage over:

{
  "id": "https://example.test/update-1",
  "type": "Update",
  "object": {
    "id": "https://example.test/object-1",
    "name": "My added name",
    "content": "My updated content",
    "summary": null
  }
}

because the linked data integrity is not broken with the overloading of the semantics of an “id” property to be a “target” reference. In the second example, the IRI “https://example.test/object-1” is associated with two different objects: the one in the Update activity and the existing object. If you consider the activity as part of a linked data (RDF) graph, my point may be more clear.

Yes, but those are client-specific implementation details outside the scope of the AP specification.

nightpool · July 17, 2023, 11:38pm

They’re referring to the activity itself, which may have many versions over time. There’s no concept of a “latest” version just as there’s no concept of a “first” version. For example, if you have 15 likes of e.g. https://nightpool.club/post/10230, then there’s no way to tell which Like activities are referring to which versions unless they contain some other information that would convey that. Same with Update activities, which is why it’s important to set the updated property on the object itself, so that you can make sure you have the most recent version of the status and not an old, cached copy.

This quote is from the C2S spec, so it’s about the Client sending data to the authoritative Server. Obviously in that case there’s no possibility of “dereferencing” the object, since the server being dereferenced is the one the Client is trying to send the updated data to! The relevant section from the S2S spec is as follows:

the receiving server SHOULD update its copy of the object of the same id to the copy supplied in the Update activity. Unlike the client to server handling of the Update activity, this is not a partial update but a complete replacement of the object

That section seems to me to be somewhat less prescriptive about whether the object link should be dereferenced or not. In practical terms, the choice of whether or not to embed the entire object is going to be up to the individual server and it’s own performance / optimization tradeoffs.

stevebate · July 18, 2023, 12:50am

What do you mean by that?

When I said “latest version” of an activity object, I meant the currently stored activity object state rather than that the object was being versioned in some way. I don’t see any mention of versioning any kind of data in the spec.

I’m thinking of dereferencing a resource/object given an IRI rather than dereferencing a server. The definition I’m using is similar to this SO answer: “all URIs you can map to a resource can be considered dereferenceable”.

A C2S Update is instructing the server to partially update an existing target object using specific instructions about which properties to replace, add or remove. Whatever is processing the update only has the target object IRI and the update instructions. In that sense, it must “dereference” the IRI to update the existing object. Is there a different terminology you prefer for that?

nightpool · July 18, 2023, 2:30pm

But both objects have the same set of properties, which is why it’s not two “different” objects, it’s just that one document represents only a subset of the properties that one object has, which is completely valid.

e.g. the algorithm for processing an Update activity from a linked data perspective could be something like “iterate through the triples defined against the object in the document you’ve been sent, and then update those triples to point to their new value, deleting them if their value is null”. This doesn’t break linked data integrity in any way.

trwnh · July 18, 2023, 4:16pm

I think we’ve gone off-topic for this thread, which was about paginating Collections and being able to reliably fetch “page 14” or “items 396-400” as an ahead-of-time optimized action set by the producer (and possibly by the consumer, if standardized parameters or subpaths are developed and supported). an example use-case of which is to avoid “infinite scroll” dark patterns by always knowing how much you’ve read and how far you’ve read.

nonetheless, my take on the Update thing is that it should follow S2S Update semantics. Update should have its object inlined or embedded in its entirety, so that the “complete replacement” logic can be followed as described in the S2S Update section.

example flow:

Client sends C2S Update with partial replacement
Server makes the requested changes if possible
Server publishes an S2S Update in the outbox with the fully embedded or inlined object, as resulting from the C2S Update

# Before

id: <doc>
type: Document
name: "foo"
summary: "A document of some kind"

# POST /outbox HTTP/1.1

type: Update
name: "Some Update"
summary: "Rename foo to bar"
object:
  id: <doc>
  name: "bar"

# After

id: <doc>
type: Document
name: "bar"
summary: "A document of some kind"

# GET /outbox/some-update HTTP/1.1

id: </outbox/some-update>
type: Update
name: "Some Update"
summary: "Rename foo to bar"
object:
  id: <doc>
  name: "bar"
  summary: "A document of some kind"

there is maybe some use of result that can be proposed for linking to specific revisions, assuming that specific revisions are supported and tracked. but that’s yet another topic. i assume people would want to be able to Like or otherwise refer to or act upon certain revisions, so there’s certainly a lot of things that need to be figured out if one wishes to pursue that.

stevebate · July 18, 2023, 10:37pm

I agree it’s a tangent and I apologize for that. I’ve created another topic continue the discussion.

C2S/S2S Update Issues