Delivery - reliability and scaling

mnot · January 12, 2024, 11:18pm

I’m reading ActivityPub with an eye to how it uses HTTP, and am fairly new to the protocol, so apologies in advance if I’ve missed something obvious or already discussed.

Section 7.1 defines delivery as the heart of server-to-server interaction in ActivityPub.

An HTTP POST request (with authorization of the submitting user) is then made to the inbox, with the Activity as the body of the request.

As far as I can tell, pretty much all server-to-server interactions defined in ActivityPub use this mechanism because they use the ‘deliver / delivery’ terminology.

Section 1, however, implies that there’s an alternative mechanism for server-to-server updates:

You can GET from someone’s outbox to see what messages they’ve posted (or at least the ones you’re authorized to see). (client-to-server and/or server-to-server)

Of course, if that last one (GET’ing from someone’s outbox) was the only way to see what people have sent, this wouldn’t be a very efficient federation protocol! Indeed, federation happens usually by servers posting messages sent by actors to actors on other servers’ inboxes.

This last statement is very interesting to me. GET is cacheable, whereas POST is not (at least in this particular use case). GET can be scaled out by a proxy cache, which can serve hundreds of thousands of requests per second on modern hardware, and be geographically distributed very easily because it’s a generic function of HTTP; POST handling in most implementations requires application-specific code that often struggles to achieve single-digit thousands of requests a second (or even hundreds).

GET is also resilient, because it’s idempotent; if a client doesn’t have a complete view of the state of the server, it can make requests to complete its view. Section 7.1 addresses this with:

For federated servers performing delivery to a third party server, delivery SHOULD be performed asynchronously, and SHOULD additionally retry delivery to recipients if it fails due to network error.

This isn’t specific enough to ensure that messages will be delivered reliably and interoperably – implementations will make different decisions about when and how they ‘give up’ on failure.

Has there been much discussion of these aspects of the delivery protocol? My initial sense is that it would be helpful to have a negotiation mechanism that allows two servers to agree on how updates will flow between them, so that (for example) one that desires timely updates can use POST, whereas one that wants to take advantage of caching mechanisms for scale can ask its peers to use GET polling. That work might also include extensions to communicate expectations about retry behaviour on POSTs.

nightpool · January 13, 2024, 1:00am

If someone wanted to use GETs, they can simply use RSS. ActivityPub was designed as an improvement on systems that used GET only, so that’s why it uses POSTs for delivery. GETs have mostly fallen by the wayside because polling did not prove responsive enough to create a good user experience.

Furthermore, POST requests can be distributed geographically using a message passing system, especially a PubSub based one. Most Mastodon instances run with several background workers. It’s more complicated then putting a proprietary CDN in front of your server, sure, but when using your own hardware the two aren’t substantially different

nightpool · January 13, 2024, 1:20am

While this is true in theory, in practice when users could easily have thousands of posts in their outbox, it’s very unlikely that any server would spend the time crawling and syncing through all of them. Especially since as they try to catch up, new activities are being made, and it’s hard to tell how fast it might be being made, so they have to restart crawling from the very beginning (reverse-chronologically) every time.

That’s true, but it’s acceptable because servers are in charge of the delivery of their own messages, so they’re the ones who are most impacted by their failure to deliver. This allows servers to make the right tradeoffs in terms of how much processing power to spend on notifying servers of historical failures vs giving up and moving on with more relevant future content.