FEP-8fcf: Followers collection synchronization across servers

Claire · September 5, 2020, 3:27pm

I don’t think there’s any security issue in my proposal? The only thing you’d potentially leak is support for the synchronization mechanism. Maybe domains where someone has followers (if using the “inline all domains in a single payload”), but not he individual followers.

Furthermore, in the proposed implementation, those properties would only be set on the Create and Announce of followers-only posts, so not things that are usually shared, and in case of Mastodon, not things that are LDSigned.

I don’t have a strong opinion wrt. AS attribute VS HTTP header, although I think it’s slightly dubious to push extension stuff outside of AP, which is made to be extensible in the first place.

nightpool · September 5, 2020, 3:38pm

I also don’t understand what information leakage may exist. are you saying the hash of the followers list may expose some information? (if you have a guess at the full complete followers list for a domain and wish to confirm it?)

cjs · September 5, 2020, 3:43pm

I think I see what Chris is saying:

If big-instance leaks a digest of a collectionSynchronization field of some small instance X, and the attacker enumerates all the N users of small instance X, the attacker can simply create N! permutations of users in their alphabetical order and find the colliding digest and know who is following the person on big-instance.

Illustration: SHA-256 hashing performance is pretty good in the age of Bitcoin. I’ll assume a very rough 1 million h/s for illustrative purposes. Note: speed does depend on input size, but we’re not talking about MB of data here for a small instance. A single hash for a 14-user instance or smaller is crackable in 1 day or less compute time worst case (not average case). So small instances suffer the most, in this case. The smaller the instance, the more likely someone could reconstruct the specific folks one follows.

Chris, is this what you had in mind?

EDIT: Updated to hopefully be more clear.

Claire · September 5, 2020, 3:48pm

Indeed, if the hash is leaked, assuming an attacker has a way to enumerate users of the target domain, they can indeed recover the list. This is a good case against the “inline multiple SynchronizationItem point I made in my proposal.

However, the proposed implementation sets only one SynchronizationItem and only sets it to Create and Announce activities which should be considered private.

cwebber · September 5, 2020, 3:48pm

Yes, that’s it.

You also don’t need to fully enumerate everyone if you have a reasonable guess; many cases can be done in shorter attempts.

cjs · September 5, 2020, 3:57pm

This is possible on mastodon, it’ll respond with a HTTP payload if a user exists with that username.

Plus, as Chris said, if it is just one individual being targeted, or just one specific case to be checked, potentially any-size instance is vulnerable (I don’t know how many usernames fit into a 1-second SHA-256 hash of ~200MB).

I wasn’t worried about this before… but now…

Claire · September 5, 2020, 4:17pm

Again, I really don’t think that’s an issue at all in my proposed implementation (the info is only inlined to Create and Announce activities, not the objects themselves, and only to that associated with private objects, with each domain receiving its SynchronizationItem and no other), but it’s worth making that issue clear in the protocol proposal itself (though I can’t edit my first post in this thread anymore, apparently).

cjs · September 5, 2020, 4:30pm

I understand you don’t believe so, but I want to really impress what this means. “Inlining” properties into the delivered objects means peer federated servers are liable to just cache the entire thing without sanitizing it, and treating it as their entry in an inbox. Once something leaves your server, you have no control what the other software does with it, and now your software (or a peer’s software) is one Announce-using-the-inlined-object away from sharing this with the world (no maliciousness required by anyone), precisely because your proposal is the first which makes a JSON-LD field security-sensitive if leaked. So not only does Mastodon need to scrub this JSON-LD property from incoming requests before re-boosting, but so do peer server software as well.

Claire · September 5, 2020, 4:39pm

Yes, but those properties are only inlined into activities that are private. If those activities get leaked by other servers, you have a much bigger problem: private conversations being leaked.

cjs · September 5, 2020, 4:49pm

You’re basically saying: “If non-Mastodon servers don’t understand the non-AP Mastodon-specific private-and-followers-only concept, they’re the ones broken”.

I am saying as a writer of non-Mastodon software that doesn’t have the concept of “private-and-followers-only” that it’s not a reasonable assumption to make. Just because I included followers and omitted as:public does not mean it is private (privacy is not a binary concept), and I don’t want to be accidentally leaking security fields because of a Mastodon-specific interpretation, and then be told my software is the problematic one.

Claire · September 5, 2020, 6:09pm

It is true that ActivityPub doesn’t explicitly define anything about who is allowed to see an object, beyond its initial addressing. The only thing it says is that objects addressed to as:Public should be accessible to everyone without authentication. But I’m not sure what you’re getting at. If you consider that software would be right to treat anything it sees as public because AS doesn’t explicitly specify that something should not be displayed to someone it isn’t addressed to… then it means you shouldn’t use ActivityPub for anything private at all ever, regardless of the addressing.

nightpool · September 5, 2020, 6:30pm

If a post doesn’t have as:public, it shouldn’t be visible to anyone outside of it’s authorized recipients. That seems like it’s pretty clear from the spec? It talks about this restriction in a couple of different locations. (but never in the context of the receiving server in a s2s system because AP doesn’t specify that kind of “lookup” interface for the receiving server)

there aren’t any rules preventing you from Announcing that post to a different audience or whatever but it seems like a mistake to inline the full object without being aware of the individual properties, I would expect nearly all system to just reference the post by id or inline a couple of properties that they know are relevant

cjs · September 5, 2020, 6:35pm

Perhaps inverting my argument will make it clearer: having this synchronization option only be available for a private Activity is problematic; a synchronization solution should work independently of visibility/privacy-model/addressing of the Activity it is attached-to/associated-with (including publicly-available ones).

The fact that the security flaw’s best defense is by appealing to the specificities of this particular solution’s coupling of these concerns, which should be separated concerns, will not convince me of it as “solving” or “not being” a security problem as a matter of software engineering principles, especially as a matter of principle.

What is “private”? An account that automatically approves its follower requests (I know this too is optional in Mastodon but it is not encoded into the Activities you send out and therefore is not documented as part of the delivery scope) and sends a non-as:public message to potentially thousands of followers does not mean it is “private”, to me. You may think differently. That’s how Mastodon operates today. I believe it is deceiving to tell users it is a “private” message. You’re right, perhaps I’m more paranoid than the average user, but that’s survivorship bias (I treat all my Mastodon content – including DMs and follower-only – as fully public; people who can’t accept that don’t use Mastodon, which skews your average to those that accept it).

Correct, which is not the same as the message being “private”.

Do we really want to go down this privacy discussion road? I’d prefer to just agree to disagree and save everyone’s time, rather than have y’all try to convince me I am wrong. I’m not trying to convince either of you that you’re wrong. It’s just a different interpretation.

I’m just trying to say “here is a security problem Chris identified” and our viewpoint is rejected. I can see your viewpoint, I would just prefer a no-bullshit answer of “we’re not going to address that” than y’all trying to force y’alls particular privacy model on everyone else (“We introduce this generalized AP solution to synchronosity but it requires Mastodon’s interpretation of privacy to address security concerns”). My interpretation of the spec may be stricter, but I’m not going to force y’all into my world view either. Woo, peering & federation! Unfortunately, this non-reciprocal listening of viewpoints is where I am growing a current source of frustration. I am sorry if I come across as upset, it is just that the optional AP leeway of having different privacy guarantees in AP is something I would like to effectively preserve in the ecosystem. Mastodon should want that too, or else a different AP-based Fediverse could be made that is fundamentally un-interoperable.

Claire · October 19, 2020, 12:48pm

So, I’m sad we still haven’t managed to reach an agreement on such a protocol, especially since, due to issues on our (Mastodon’s) end, it is sorely needed.

To sum up the disagreements with my proposal, I think there are basically two of them:

it entrenches the domain-based thing further. I understand this (this goes slightly beyond the existing security considerations, and assumes that software managing one account on a domain manages all accounts on that same domain), and I regret it, but I’m afraid the alternative is too difficult and error-prone to handle. In the extremely unlikely event one handle dispatches accounts to several different software implementations, we can allow disabling that feature, or enabling the implementations to collaborate.
privacy concerns regarding implementations that would store/relay “private” activities to platforms/actors that don’t need to know that information. The “fix” for that being to use an HTTP header instead of additional AS attributes. I think that’s also an unlikely edge case, and I’m not happy with pushing extensibility outside of ActivityStreams and using an HTTP header instead, but I have no fundamental opposition to that.

I can work on changing the proposal/PR to use an HTTP header instead of custom AS attributes. Would that work for you?

Claire · October 19, 2020, 4:43pm

I have now added a commit to the PR to replace the collectionSynchronization attribute in activities with the Collection-Synchronization HTTP header, with the same grammar as the Signature header and the following fields:

collectionId: the collection this header refers to (the only one supported at that point is the sender’s collection)
digest: hexadecimal representation of XORed digests of the actor ids on the receiver’s instance, joined by newlines
url: the URL of the endpoint returning a Collection

In the current implementation of the PR, the list of followers is the list of followers whose actor id shares the same URI scheme and netloc as the inbox being delivered to. I’m afraid this introduces, again, a new constraint on the inbox and actor id needing to be on the same domain, and I’m not too sure what to do about that.

cjs · October 20, 2020, 3:46pm

I am sorry. I do want to thank you for enabling this discussion in the first place. I have learned a lot even when the discussions get tough. I know you put a lot of effort into the PR, and didn’t have to come here in the first place, and didn’t have to come back to revisit it. I am also sorry to @nightpool – I know we often see things differently, and I genuinely appreciate your persistent willingness to push back against me.

I’m not intending to block y’all doing what you need to do to fix the problems you’re seeing day-to-day. I just wanted to be sure objections were heard, they were recognized, even if not addressed. I am sorry that I got a bit upset around this, I do think y’all recognized them but in the moment it didn’t feel that way.

In that spirit, I don’t think I have any new objections, and don’t feel the need to revisit old ones. Feel empowered to carry on as you see fit.

Whichever course you choose (submitting the PR as-is with the HTTP header, reverting back to the original proposal, or something else), I hope I haven’t soured the thread so badly that y’all aren’t considering trying out a FEP. Since you’re breaking new ground it would be nice if it existed as a FEP so others could use your solution in an interoperable way.

Claire · October 21, 2020, 3:22pm

It’s fine. I’m in the process of writing a FEP regarding this extension. Hopefully I’ll have that ready soon.

Claire · October 25, 2020, 10:25am

Just an update that it has been implemented in the development version of Mastodon and has been formalized as FEP-8fc1f0: Followers collection synchronization across servers

RaphJ · November 25, 2020, 8:57am

Thinking about performances, I wonder whether we should introduce a mechanism of hash-tree syncing for large collections of followers ?

I mean, if the sending account has 20k followers, is the receiving instance supposed to fetch it all again (or even a subset of 10k for a large instance), while only a single follower mismatches ?

I guess this could be a separate generic proposal in order to optimize synchronizing of any collection, as an alternative to collection paging

BTW: Do you have any statistics about the topology of the Fediverse and the performance of ActivityPub ? Like :

The distribution of number of followers (average, max, percentiles, …)
How much followers tend to be spread among several instances ?
Number of AP requests for each instance, type of requests that consume most resources (CPU, bandwidth, etc)

Claire · December 1, 2020, 9:08am

I’m not worried about performances wrt. detecting issues (which should remain a rare occurrence), but indeed the current proposal is not very efficient for fixing them when large sets of followers are involved, as the receiving instance will fetch the whole list (filtered so it only contains followers on the receiving instance).

Unfortunately, I do not have any statistics about the topology of the Fediverse.