FEP-8fcf: Followers collection synchronization across servers

system · December 1, 2020, 5:25pm

Source: https://codeberg.org/fediverse/fep/src/branch/main/feps/fep-8fcf.md (Moved from old location at git.activitypub.dev)

authors: @Claire
status: FINAL
dateReceived: 2020-10-24
dateFinalized: 2022-02-07

FEP-8fcf: Followers collection synchronization across servers

Summary

In ActivityPub, follow relationships are established, updated and removed by
sending activities such as Follow, Accept or Reject, which are assumed to
be correctly and promptly processed upon receipt.

However, due to incompatible protocol extensions, software bugs, server crashes
or database rollbacks, the two ends of a Follow relationship may end up out of
sync.

This can be especially damaging when a remote instance has outdated information
about follow relationships that should have been revoked, as some
implementations may deliver activities addressed to the sender’s followers
collection by using the sharedInbox mechanism and letting the recipient use
the sender’s followers collection for local delivery and access control.

This proposal describes an optional mechanism for detecting discrepancies in
following relationships across instances, with minimal overhead and without loss
of privacy.

Requirements

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”,
“SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this
specification are to be interpreted as described in [RFC-2119].

The proposed protocol for followers collection synchronization makes a number of
assumptions that may not be suitable to every implementation or deployment.

Implementations and deployments MUST NOT implement the mechanisms described in
this proposal unless they match the following requirements:

actors managed by an instance are required to all share the same exact URI
scheme and authority for their id, inbox and sharedInbox URIs
such instances are required to manage all actors using the same URI scheme and
authority for either their id, inbox or sharedInbox URIs (that is, for
instance, two fediverse implementations cannot implement this proposal if they
are set up on the same exact same domain name, unless implementing an
additional mechanism to share follower information between them, which is out
of scope for this proposal).

The reason for those requirements is to prevent the partial followers collection
described below from missing legitimate followers, which could result in
followers being removed for no reason.

Failing to implement this proposed synchronization mechanism should not impact
compatibility with other implementations, as it is completely optional.

Partial follower collection

For efficiency and privacy purposes, we consider a subset of an actor’s
followers collection. This subset is the set of an actor’s followers whose id
shares an instance’s specific URI scheme and authority.

For instance, if https://example.org/users/1 has the following followers:

https://example.org/users/2
https://testing.example.org/users/1
https://next.example.org/users/foo
https://testing.example.org/users/2

The partial follower collection of https://example.org/users/1 for the
instance serving https://testing.example.org/users/1 is:

https://testing.example.org/users/1
https://testing.example.org/users/2

Partial follower collection digest

To enable quick checking of partial followers consistency across instances, a
partial follower collection digest is computed.

This digest is created by XORing together the individual SHA256 digests of each
follower’s id.

partialCollectionDigest = SHA256(follower1) XOR SHA256(follower2) XOR ... XOR SHA256(followerN)

For instance, the partial follower collection digest of
https://example.org/users/1 for the instance serving
https://testing.example.org/users/1 is:
3a06e99569547f444c352ab7f52e4bab207abec5ca6f07b0045cfdc9723f8fa9 XOR f939a1585d4a8f02ee339210dbe7315d7003476663d6095f7d996fc4bc7a49b6 = c33f48cd341ef046a206b8a72ec97af65079f9a3a9b90eef79c5920dce45c61f

The `Collection-Synchronization` HTTP Header

The Collection-Synchronization HTTP header provides a mechanism for quickly
checking whether the sender’s followers collection part that is relevant to the
recipient is consistent with the recipient’s knowledge.

The header field name is Collection-Synchronization and its value is a list of
parameters and values, formatted according to the signature syntax defined in
[HTTP-Signatures], Section 4.1.

Example:

Collection-Synchronization: collectionId="https://example.org/users/1/followers", url="https://example.org/users/1/followers_synchronization", digest="c33f48cd341ef046a206b8a72ec97af65079f9a3a9b90eef79c5920dce45c61f"

Collection Synchronization Header Parameters

The Collection-Synchronization header’s parameters are defined as follows:

collectionId: this is URI of the collection that supports synchronization.
It must be the sender’s followers collection.
url: this is the URL of the partial followers collection intended for the
receiving instance.
Accessing it should require authentication from the receiving instance.
digest: the partial follower collection digest intended for the receiving
instance.

Synchronization procedure

On the sender end

When delivering an Activity to an inbox (or sharedInbox), an instance MAY
set a Collection-Synchronization header intended for the corresponding
instance (determined by the inbox URI scheme and authority).

When exactly to set this header is up to the sender, but it is recommended to
at least send it for any Create activity addressed specifically to the
sender’s followers collection.

On the receiving end

On the receiving end, upon receiving an Activity delivery with a
signed Collection-Synchronization header, the receiver MUST check that:

the collectionId attribute matches the sender’s followers collection id
the url attribute also matches the same authority (so that the instance
cannot get tricked into requesting the followers list of a third-party
individual)

If any of those checks fails, the receiver MUST ignore the
Collection-Synchronization header.

The receiver SHOULD then compute the partial collection digest for the sender’s
followers based on its own knowledge. If the digest does not match the digest
attribute of the header, it SHOULD then query the url, authenticating itself
to the remote server using [HTTP-Signatures] or another method.

Having fetched the up-to-date partial followers collection from the autoritative
server, the receiving end:

SHOULD remove from its local copy of the followers collection any local actor
not listed in the partial followers collection.
MAY consider any pending outgoing follow listed in the partial followers
collection as accepted.
SHOULD send an Undo Follow for any other local follower listed in the
partial followers collection but not known locally.

Implementations

This proposal is implemented by Mastodon since the following Pull Request: Add follower synchronization mechanism by ClearlyClaire · Pull Request #14510 · mastodon/mastodon · GitHub

References

[RFC-2119] S. Bradner, [Key words for use in RFCs to Indicate Requirement Levels](RFC 2119 - Key words for use in RFCs to Indicate Requirement Levels
[HTTP-Signatures] A. Backman, J. Richer, M. Sporny, Signing HTTP Messages

Copyright

CC0 1.0 Universal (CC0 1.0) Public Domain Dedication

To the extent possible under law, the authors of this Fediverse Enhancement Proposal have waived all copyright and related or neighboring rights to this work.

Claire · August 4, 2020, 1:29pm

Context

Due to software bugs (e.g., Pleroma forgetting follow relationships a while ago, Mastodon mishandling Reject on Accepted Follows for the time being, etc.), crashes, backup rollbacks, etc., state may get out of sync between instances, in particular regarding followers information. With Mastodon’s follower-only privacy scope, this can get particularly damaging, as receiving remote instances are asked to deliver the message to who they know to follow the poster: if that gets out of sync, then the messages may reach people that the poster expects to not have access to.

Therefore, a mechanism to detect synchronization issues and correct them is needed. Such a mechanism has to take into account that the full list of an account’s followers may not be public information, and should thus only consider followers residing on a specific domain.

Proposal

Protocol extensions and an implementation are proposed in the following Mastodon Pull Request: https://github.com/tootsuite/mastodon/pull/14510

The basic idea being the addition of an optional collectionSynchronization attribute to activities such as Create and Announce activities. This attribute, typically set on activities addressed to the author’s followers specifically, would contain one or more SynchronizationItem objects with the following attributes:

object: the identifier of the collection this is describing. This collection MUST be owned by the activity’s author. For our current purposes, it must be the activity author’s followers collection.
domain: the domain name of the instance the message is addressed to. It can be used to catch implementation bugs or to allow implementations to inline multiple SynchronizationItem corresponding to different servers in a same payload (e.g. when using relays)
partialCollection: an URI to a collection that is the subset of the object collection made of the URIs of objects owned by domain. Accessing this collection SHOULD require the request to be signed (HTTP Sigs) with an account owned by the domain. For security reasons, it must be on the same domain as the collection it is describing.
digest: A Digest of the partialCollection's items, in alphabetical order, and joined by a newline. In the proposed implementation, SHA256 (http://www.w3.org/2001/04/xmlenc#sha256) is the only acceptable algorithm. Choosing a common algorithm enables caching that information on each instance as necessary.

A server receiving an activity with such a SynchronizationItem would check that it is the intended receiver (domain attribute), that the object is indeed the author’s follower collection, check that the partialCollection resides on the same domain name as the author, then check the digest, and in case that digest doesn’t match, it would fetch the partialCollection, remove any local follower not listed there, and send an Undo Follow for any follower listed there that it doesn’t know to be followers.

Example

Create activity:

{
  "@context": [
    "https://www.w3.org/ns/activitystreams",
    "https://w3id.org/security/v1",
    {
      "toot":"http://joinmastodon.org/ns#",
      "collectionSynchronization":"toot:collectionSynchronization",
      "SynchronizationItem":"toot:SynchronizationItem",
      "partialCollection": "toot:partialCollection"
  }],
  "id": "https://social.sitedethib.com/users/Thib/statuses/1234",
  "type": "Note",
  "attributedTo": "https://social.sitedethib.com/users/Thib",
  "to": ["https://social.sitedethib.com/users/Thib/followers"],
  "content": "This is just a private toot",
  "collectionSynchronization": [
    {
      "type": "SynchronizationItem",
      "domain": "mastodon.social",
      "object": "https://social.sitedethib.com/users/Thib/followers",
      "partialCollection": "https://social.sitedethib.com/users/Thib/followers_synchronization",
      "digest": {
        "type": "Digest",
        "digestAlgorithm": "http://www.w3.org/2001/04/xmlenc#sha256",
        "digestValue": "b08ab6951c7d6cc2b91e17ebd9557da7fae02489728e9332fcb3a97748244d50"
      }
    }
  ]
}

Result of querying https://social.sitedethib.com/users/Thib/followers_synchronization (when signing the request on behalf of a mastodon.social account):

{
  "@context": "https://www.w3.org/ns/activitystreams",
  "id": "https://social.sitedethib.com/users/Thib/followers?domain=mastodon.social",
  "type": "OrderedCollection",
  "orderedItems": [
    "https://mastodon.social/users/Gargron"
  ]
}

cjs · August 4, 2020, 2:52pm

I see the PR acknowledges it is not an AP-friendly solution. What were the reasons for discarding an ActivityPub-esque solution, like the following?

Have a single Service actor representing the Mastodon server, with an HTTP Signature public key.
Have Mastodon respond to requests for /user/{user}/followers with a query parameter like ?domain=mastodon.social that returns a properly-filtered OrderedCollection, but only in response to requests from a Service actor that is properly authenticated using HTTP Signatures
Have the receiving server call the authoritative server at authoritative.server.example.com/user/{user}/followers?domain=receiving.server.example.com using the Service when it receives a "followers-only scoped Note" and it gets the authoritative domain-filtered list.

Edit: In fact, with HTTP Signatures, you could even eliminate the query parameter entirely, as by authenticating the Service you’d already have the hostname and therefore could respond with a filtered list anyway. So, uh, optionally eliminate half of step 2 above. /edit

This would:

Not leak the followers collection outside of its domain, work through relays, and maintain the same synchronous-mechanism-outcome (a partial list of followers in an OrderedCollection)
Represent a Mastodon server as an ActivityPub Service actor, which is also an investment for future innovations
I’d hope it would touch less Mastodon internals than the proposed solution, for easier maintainability
Is easier to understand by other developers or maintainers than inventing whole new unofficial vocabularies
Less bytes being shuffled around on the wire
Just as many additional HTTP requests as in the original proposal

Claire · August 4, 2020, 2:59pm

Mastodon already has a Service actor representing the Mastodon server, and it is already used for signing HTTP requests (and would sign this one). However, there is no way to know which actor is the actor representing an instance, in the general case.
You would still need to know how to construct that query parameter, that is not defined in AP. The PR uses a different endpoint for that, but it could reuse the same. It’s just a matter of the code being simpler. There is also no query parameter, because the HTTP signature is required and is sufficient to identify the domain name.
How would it decide to make that query? Would it make the query every single time it receives a “followers-only scoped Note”? This would be pretty expensive. This is why I decided to add a hash value there.

cjs · August 4, 2020, 3:29pm

RE; Number 1

Sure but that’s a really weak rationale. Mastodon isn’t a general case, it is a specific software that is creating a convention here. The least that it could do is pick a sensible one.

The same argument could be turned around and say that there’s no way to know that the actor being signed in the original proposal isn’t an admin, who would then be able to acquire the same exact “leak” of information as the original proposal, since they can access all the keys.

Furthermore, this kind of answer raises worrying questions: What’s the purpose of the existing Mastodon Service actor signing requests with HTTP Signatures already, but if Mastodon doesn’t know/care that it is the actor representing the server as surely it wouldn’t need to exist in the first place?

RE: Number 2

As you mention, there’s no need for a query parameter. One less thing to worry about! Woot!

RE 3: Number 3

“The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times”. Computers are cheap, human beings are expensive. I would strongly encourage you to favor having the computer do harder work so the rest of the ecosystem can think easier, as a matter of principle.

Then, there’s the notion that there’s other ways to tackle this problem: rather than:

{
  [ ... ]
  "type": "Create",
  "object": "https://social.sitedethib.com/users/Thib/statuses/1234",
  "to": ["https://social.sitedethib.com/users/Thib/followers"]
}

One could do the perfectly-acceptable:

{
  [ ... ]
  "to": [
    {
      "@context": [...]
      "id": "https://social.sitedethib.com/users/Thib/followers",
      "type": "OrderedCollection",
      "first": [...],
      "digest": ABC123
    }
  ]
}

And there’s the digest, already in the proper location.

Plus, I’ve noticed Mastodon likes to try to tackle the “optimization” problem ahead of time (ex: sharedInbox). Before proposing adding extra crud in the name of optimization/performance it would help to actually measure that performance and have numbers to back that up. Measure it, otherwise there will be endless debates over how often followers-only is actually used, how much such a problem would actually load onto peer servers. Only then can a reasonable discussion over performance and optimization occur. And the answer may be to just use standard HTTP tools to do so: batching requests (for a small servers receiving a lot from one monolithic instance) or requiring the maintainer of monolithic instances to rightfully maintain a gateway for managing the massive traffic for incoming requests, because you can’t rely on performance shortcuts from federated peers anyway.

Claire · August 4, 2020, 3:44pm

RE: Number 1

This account is used to do all the signed fetches, forward reports, etc.
Previously, we’d just pick an existing user (often, the instance admin, but that wasn’t a guarantee), but that could be confusing for remote users. It’s just a dedicated user.
Anyway, I’m not sure how this consideration fits in the current proposal.

RE: Number 2

Yup. However, if you do re-use the followers collection endpoint, then you have the issue that you may want a few differents behavior from it:

list all followers, regardless of the domain (default in Mastodon, can be opted out by the user, not suitable for the synchronization mechanism as there are potentially hundreds or thousands of unrelated accounts)
list none of the followers (when the user has decided to hide their followers, obviously not suitable for the synchronization mechanism)
list all followers from the instance of the requesting account

Having something do the second behavior when no signature is provided but the third one when one is, would make sense, but then the case of the first behavior being “downgraded” by the presence of a signature does not (furthermore, you could want to show your full followers list, but only to your friends).

RE: Number 3

This makes sense, I considered it, and I do like this proposal to a large extent, however:

I fear it might break backward-compatibility with previous Mastodon versions (haven’t fully checked it, but a quick glance didn’t look good) and possibly other software
same considerations as Number 2
pretty minor, but does not allow you to inline hashes for multiple domains

Performances concerns: right, I haven’t actually measured the cost of performing a query every single time, but having all your follower instances request a non-cacheable (not on edge servers anyway) endpoint every time you send them a followers-only toot does not seem good (especially with Mastodon, which uses technology that doesn’t provide asynchronous handling of incoming requests).

EDIT: I checked, and yeah, Mastodon isn’t going to like non-uris in the to attribute (we could improve that, but it’s not going to fix current versions)

cjs · August 4, 2020, 3:59pm

Ah gotcha. Thanks for considering my feedback. I don’t think I have any further comments since your concerns get pretty Mastodon-specific, so while your software engineering skills could address them, mine can’t.

To summarize my understanding of the discussion:

The problem can be broken down into detecting the out-of-sync-ness and correcting it, no debate about digest for the former nor a HTTP request for the latter.

Detecting: Current proposal is to enshrine the digest into a whole new vocabulary, I think I landed at embedding the digest into the very content-less followers OrderedCollection as a literal in the to property. Mastodon’s concerns are: backwards compatibility, no domain-specific hashes, harder to reason about Mastodon’s behavior.

Correcting: The current proposal involves an HTTP Sigs fetch to followers by the user’s actor, I think I landed at an HTTP Sigs fetch to followers by the server’s Service actor. But in hindsight I agree w/ you, I think now this kind of looks silly now. A vestigial remain of my thought process for how to reason about this in an ActivityPub way. Mastodon’s concerns are: harder to reason about Mastodon’s behavior.

Final thing I have to say is that, on a personal note, I’m not very thrilled at the prospect of translating yet another whole new Mastodon-specific vocabulary for go-fed as it usually has very little and/or insufficient corresponding documentation, and I can’t read Ruby to save my life.

Claire · August 4, 2020, 4:09pm

On the Correcting part: the signed fetch is performed by the Service actor, but that doesn’t change much of anything overall.

I understand your concern about adding custom vocabulary, and I tried to keep that down (in the end, the new stuff is “only” synchronizationCollection, SynchronizationItem and partialCollection. We could have gotten rid of the first two by using your “embedding the OrderedCollection” proposal, but unfortunately that would break compatibility with at least past and current versions of Mastodon, and likely other software.

I also can hear you about the lack of documentation. We have some documentation on https://docs.joinmastodon.org/spec/activitypub/#extensions but we could definitely improve it.

darius · August 4, 2020, 5:01pm

I don’t have technical commentary but I do think this makes sense as something that most AP servers should have the ability to do from time to time.

cwebber · September 4, 2020, 4:46pm

There’s a lot here… but here are my thoughts.

You’ve already heard from me that I don’t love the way AP deployments have become domain-oriented, and I know @Claire has already acknowledged it’s an issue so I guess we won’t rehash that here. I do fear this further entrenches that. However I also want to acknowledge the amount of thought and care that has gone into this proposal by @claire (and also the follow-up thinking by @cjs).

I actually think that the pairwise Service endpoint is the right solution though. I also think there’s a question of “is this something that should be checked every request, or time to time for error correction?”

I’m going to suggest a modified workflow based on the latter, maybe as an alternate starting point:

The check is not done on every receipt, but is done every now and then for error correction.
Attach the Service actor to every actor, including collections (such as the Follow collection). This now becomes the equivalent “domain” for every actor. (It would be better if the Service actor also signs off on this being true, though it’s probably fine to skip that and just say “well we’re not going to bother yet because we’ll just require it’s on the same domain”… an okay intermediate step. That’s actually a security concern for other issues but not for this one.)
So now we can traverse from an Actor’s Followers collection -> the appropriate Service endpoint. Ok.
Now send the Service endpoint, “Hey, could you give me a hash of what the contents are for the following collections for the followers at my Service endpoint: [, , , …]” (man, we really need to build promise-resolution into AP systems huh… for now I guess you could use http request-response to do it) So this is really a Service asking a Service for sanity checks on collections. The request is signed via an http signature from the requesting endpoint.
The Service says “Sure, here you go:”

{"type": "CollectionCheck",
 "checkResult": [
   {"uri": "https://social.example/u/alice/c/followers",
    "checkHash": <hash-goes-here>}]}

If things are off, then the service which is mirroring the followers can request to the canonical service of those URIs to give them a corrected list.

(Mastodon does nonces on its http signatures right? That’ll be extra important here to prevent replay attacks in both requests to the Service).

This also has the advantage of not increasing the message size of each sent message by a large amount, which I think the original proposal appeared to do. Instead it changes into an “occasional repair” approach.

What do you think @Claire (and @cjs)?

cjs · September 4, 2020, 7:33pm

Thanks for the suggestion, Chris! I still strongly prefer an AP-friendly solution, and I would categorize your proposal as such.

Claire · September 5, 2020, 10:24am

I am mostly ok with this proposal, but there are a few things I’m not sure about.

Which objects should have an attached service actor

I’m not sure having a different attached actor for every collection would make a lot of sense. It could be useful for flexibility and for ease of use, however, what would be the point of an actor and its followers collection having a different attached service account? What would you do when that happens?

Upgrade path

This actor/collection → service actor thing would be a new thing, and the upgrade path really isn’t obvious.

For the followers synchronization mechanism to work, you need to know the attached service of all the followers/followed users, but realistically, you’d only know about them for individuals users when getting an Update or re-fetching one actor’s definition once both servers are upgraded.

So, unless specific action is taken, you’ll only know the attached service of a subset of the known accounts managed by that service. And running the followers synchronization mechanism when you know only about the attached service of a subset of those followers means the synchronization mechanism will effectively break the synchronization, removing the unaccounted followers from the following end, but not from the followed end.

This might be salvageable, as it seems reasonable to assume that if one actor on a specific domain advertises its attached service, all of them do (or at least all of them managed by the same service, which is what we’re interested in), so upon discovering an attached service on an account, we can schedule a re-fetch of all the actors on the same domain for which we do not know the attached service, but:

it might involve a lot of requests (e.g., my small, single-user instance, knows about 12K mastodon.social users)
what do we do if one of those requests fails?
the synchronization mechanism remains unusable until this process has finished, so one needs to keep track of that
there is also the fact that actor json can get cached on an edge server, so the initial assumption might not always be true

cwebber · September 5, 2020, 12:37pm

I’m not sure having a different attached actor for every collection would make a lot of sense. It could be useful for flexibility and for ease of use, however, what would be the point of an actor and its followers collection having a different attached service account? What would you do when that happens?

Think of it this way: we’re saying “Hm, we need to check that our understanding of the state of this collection is correct or not…” Now consider that Followers is only one kind of collection, in theory people can have photo galleries, etc… and one might want to use the same technique. From such an analysis, it is clear that the correct generalization is that it is the object being synchronized, namely the collection, that we look to for its service endpoint.

For the followers synchronization mechanism to work, you need to know the attached service of all the followers/followed users, but realistically, you’d only know about them for individuals users when getting an Update or re-fetching one actor’s definition once both servers are upgraded.

This might not be a problem if this is done as a periodic repair step:

Occasionally see if repair checks can be done… including against those which you know the service endpoint (you can start the repair! only one side needs to know in this case), and those which don’t. If you’re testing if a server has gotten new service endpoint support, you could use a heuristic where you query one or two users’ collections as a short probe, and if you see service endpoints listed, begin fetching; if not, abort. That allows for upgrading within the current world of domains stuff or not.
Or push out updates to start that process.
As evidence for “only one side needs to know in this case”, consider that when requesting from the requestor of the check, it already knows what the response-end’s endpoint is. So the response-end can therefore say “oh, well this claims it’s checking against all these endpoints which are theoretically managed by this service… if that’s true, then I guess these objects must have services listed now”. Start checking, and if you see some, then great, start noting them down and get the rest. If not abort and return an error because their signature won’t be acceptable for this entire set anyway.
This might seem like a large amount of overhead, and it might be for a short period. But in the long run, I’m going to argue that it’s less: consider how much is not being added to every message going forward in history

Claire · September 5, 2020, 1:34pm

Occasionally see if repair checks can be done… including against those which you know the service endpoint (you can start the repair! only one side needs to know in this case), and those which don’t. If you’re testing if a server has gotten new service endpoint support, you could use a heuristic where you query one or two users’ collections as a short probe, and if you see service endpoints listed, begin fetching; if not, abort. That allows for upgrading within the current world of domains stuff or not.

Or push out updates to start that process.

This will only work when both ends have support for the “attached service actor” thing.

As evidence for “only one side needs to know in this case”, consider that when requesting from the requestor of the check, it already knows what the response-end’s endpoint is. So the response-end can therefore say “oh, well this claims it’s checking against all these endpoints which are theoretically managed by this service… if that’s true, then I guess these objects must have services listed now”. Start checking, and if you see some, then great, start noting them down and get the rest. If not abort and return an error because their signature won’t be acceptable for this entire set anyway.

Sorry, I do not understand this part at all.

Anyway, there are two things here:

discovering that a followed account’s “followers” collection supports “attached service actor”/“follower synchronization mechanism”:
- on the negative side, it’s difficult to come up with a heuristic to check that (currently, Mastodon doesn’t refetch profiles of followed accounts unless some explicit action is taken or receiving an Update)
- on the plus side, only having partial info is fine for this, you don’t need to know about other accounts to initiate a check on a remote followed account
discovering if your followers support “attached service actor”/“follower synchronization mechanism”:
- when receiving a check request, to provide a sound result, you MUST know the “attached service actor” of all the requested account’s followers who are managed by the requesting service actor
- since by definition you may not have this info yet for every follower, you don’t know the exact set of actors to re-fetch
- since the whole goal of this thing is to not rely on domain-based reasoning, you can’t rely on “re-fetch only the accounts for which the acct: domain part is the same as the requesting account”
- it might be reasonable, however, to only re-fetch actors whose id is hosted on the same domain name as the service’s id, as that would be consistent with the current security assumptions
- one cannot, however, assume that one domain name is served by the same AP software, and in particular, one cannot assume that all the AP software serving one domain support the “attached service actor” thing; therefore, knowing accounts that do not have an attached service actor does not mean that you have only a partial view. This means there isn’t any easy criterion to know when you need to re-fetch actors… the only case in which you are sure you don’t have to re-fetch is the case where you know every followed account has an attached service actor

This might seem like a large amount of overhead, and it might be for a short period. But in the long run, I’m going to argue that it’s less: consider how much is not being added to every message going forward in history

I’d argue it’s an immense amount of overhead, and also pretty error-prone. And in the (admittedly very rare, but that’s the kind of the things your proposal addresses to begin with, otherwise my proposal would be far enough) case where a domain is served by multiple AP software, not all of which handle this mechanism, this may have to be done an unbounded number of times.

Thankfully, in most cases, the heaviest part (discovering support for all remote followers) would only have to be done once per instance pairs.

There’s also the question on when to check for corrections. Request too often, and you have significant overhead, request not often enough, and you risk leaking private messages to desynchronized followers. My proposal addressed that by inlining the required info on each followers-only message (at the cost of some overhead for each of those messages, yes), which are the messages that would be affected by a synchronization error.

cwebber · September 5, 2020, 2:46pm

Arguably this problem is another thing that stems from the mistake of rushing in a domain-oriented sharedInbox distribution in the first place (which also breaks ocap security, and is my biggest regret in ActivityPub…). The chat system I’m working on is a proof of concept of how to get around these problems in an ocap manner that doesn’t require so much state synchronization and instead relies on message flow while still reducing the number of messages (the core idea is the “ampitheater” that is only vaguely demonstrated on this page from Electric Communities’ documentation), but I realize that I am getting it out a bit too late and now the network as a whole has adopted to the broken domain-oriented model. I do fear that by the time I have my solutions out there that the ActivityPub network as a whole won’t be able to adapt to them even if it’s done in a spec-compatible way. :\

I realize that the route I’m suggesting as an intermediate fix also seems complicated (though I’d argue that the original is too, in different ways; I think the solution I’ve suggested is a simpler pattern, but with a more complicated upgrade path in the meanwhile); I’m attempting to encourage this because I think in the long run it’ll have less structural issues and still preserves a path forward to a better design. I really do fear that if we keep trying to patch up the “origin” style system we’re going to end up digging our heels in enough that it won’t be possible to get out of it again. But maybe that’s already the case.

nightpool · September 5, 2020, 2:50pm

I haven’t dug into the technical details of your proposal Chris, but I think it’s really very important that, when processing addressing that’s given by a collection, two servers can agree about the state of the world that’s implied by that collection. So in general I would prefer to keep thinking about this as something that needs to be checked for every request, not a long term best-effort consistency sort of situation.

cjs · September 5, 2020, 3:09pm

What I interpret is that you’re asking for is distributed atomicity and versioning, which is way bigger a problem than I think you’re intending, especially in an intermittently-connected, possibly delay-delivered, federated network context.

For example: If my AP software receives a message from a peer who composed it a year ago (but never delivered it until now, a year later) with followers at v1, and it’s now far obsolete because my instance’s users have all unfollowed, is the onus on my server to treat the peer server as authoritative and go back and treat the exchange as atomic and distribute it to the once-were followers on my AP server?

I think it is an unnecessary constraint to attempt to rigidly adhere to an every-request-design, especially when it is for solving a problem due to the sharedInbox hack (and thus the solutions explored tend to solve only that particular problem, and not examine the general solution space). For example, a path forward could be to abandon sharedInbox entirely and maintain preauthenticated channels like kaniini’s suggestion, to solve the sharedInbox performance problem, and then use that single multiplexed channel to better manage atomicity/distributed versioning (ex: as part of startup, or as far as in between every bundled message), which would also be a general solution as well.

cwebber · September 5, 2020, 3:13pm

I haven’t dug into the technical details of your proposal Chris, but I think it’s really very important that, when processing addressing that’s given by a collection, two servers can agree about the state of the world that’s implied by that collection. So in general I would prefer to keep thinking about this as something that needs to be checked for every request, not a long term best-effort consistency sort of situation.

I understand that and it makes sense within the context of the way ActivityPub has been rolled out (which I still think is unfortunate but I am partly to blame by not realizing the implications of a sharedInbox “mirror the followers collection” approach or knowing of an alternative at the time). I do think the ampitheater pattern would fix this but it requires more ocap stuff than I think implementors are yet ready for. I am trying to figure out how to make it more understandable, but I’m trying to do that through prototyping.

In the meanwhile, we should acknowledge something important: whenever messages are “reposted”, the collectionSynchronization will need to be stripped out, because otherwise else you’re leaking some dangerous information: who, on this server, has subscribed to who. That might be private, and if someone has a guess they can uncover it (hmmmmm… let me just see if these three users are in there… aha, they are!) and they can also check for when it has changed to observe when the collections’s contents have changed. I’m not sure that message signatures are still being really used, but if they are you’d have to strip them out for signing/verifying too.

But wait! I’ve just thought of an alternative that I think is a hack, but actually is no more of a hack than anything presented, and manages to not clutter up the activitystreams messages AND incorporates the kind of workflow that @claire wanted. What if we just took the same information that was in @claire’s collectionSynchronization message and put it in an HTTP header? Like X-Collection-Synchronization: <base64-encoded synchronization data>

That is fully backwards compatible, provides the information desired, and also is forwards compatible towards other designs that are more actor/ocap’y. Maybe everyone can be happy with that. (Plus since it’s not put in the json-ld, that can be one less thing to define as any kind of special term… it could be as simple as: [[<collection-id>, <domain>, <hash>], ...]

@nightpool, @claire, @cjs … what do you think? Maybe everyone can be happy with this!

Claire · September 5, 2020, 3:17pm

I think this last solution has exactly the same shortcomings as my proposal? The only thing that would be different is that you’d consider this as “not quite an ActivityPub thing, more of an interim solution” but I don’t really know if that’s really a good thing.

cwebber · September 5, 2020, 3:21pm

Well it is intentionally using the same mechanism as yours, just moving it up to the http headers. And yes, the advantage there is two things:

yes, one is exactly that it’s “not quite an activitypub thing, more of an interim solution”, because I think there’s better options out there
It doesn’t have the complexities of accidentally leaking information with sharing posts that I warned about, which will be a real problem with nodes that aren’t aware of this addition… any reposts from there won’t know to strip that off and thus will perpetuate as security issue

I’m trying to accomodate your “our current design needs this for every request” bit while trying to not introducing the security vulnerability for now and allow a path towards a different solution as well. I think those are both important.

FEP-8fcf: Followers collection synchronization across servers

FEP-8fcf: Followers collection synchronization across servers

Summary

Requirements

Partial follower collection

Partial follower collection digest

The Collection-Synchronization HTTP Header

Collection Synchronization Header Parameters

Synchronization procedure

On the sender end

On the receiving end

Implementations

References

Copyright

Context

Proposal

Example

RE: Number 1

RE: Number 2

RE: Number 3

Which objects should have an attached service actor

Upgrade path

The `Collection-Synchronization` HTTP Header