Data storage questions

@nutomic and I are working on federating lemmy, and I have some questions about data storage, and replication.

Questions on activitypub storage

  • User tables (at least in pleroma) seem to hold both local, and federated users. Does this mean when two activitypub instances connect, it has to pull all the other instance’s users, including their entire outboxes, into its own data storage model?
    • What about a users toots? Do all user toots also get pulled?
    • Is there some standard for this? Pulling all data from an instance?
  • What if one of those users changes their profile? How to deal with mutable data? How often should this data be “cached”?
  • Are all objects and activities posted to all instances, and replicated everywhere?
  • Many users of instance 1 are following user B on instance 2. User B toots, and that create note gets posted to the inboxes of all the users instance 1. How is this de-duped in data storage? Just by an id on the create activity? Is there maybe just a single post endpoint for all data?
  • When should OrderedCollection be used, and when UnorderedCollection?
1 Like

Why would this be required? What purpose does this serve? I can say that Pleroma does not do this, and i’m not sure it’s even possible without spidering or crawling the other instance in some way. (and even then you’d never be able to get everything)

I can’t speak for Pleroma, but when users change their profile on mastodon, we send out an Update activity to the user’s followers. I believe we also refetch the profile if we’re processing an activity from the user and we haven’t refetched the profile recently, but i’m not sure what the actual cadence is. Either way, refetching profile metadata is definitely “best effort” in the absence of an explicit Update activity

No, this is a very basic part of how ActivityPub works. Activities are posted to the inbox only of the actors you wish to deliver them to (In a c2s model, the actors are specified using the to and cc properties. When not implementing c2s, you can choose the to and cc properties freely based on what set of users you wish to deliver the activity to). For example, Mastodon delivers its activities to the inboxes of a user’s followers, as you can see in the targeting information below:


{
  "@context": "https://www.w3.org/ns/activitystreams",
  "id": "https://cybre.space/users/nightpool/statuses/102148702213596342/activity",
  "type": "Create",
  "actor": "https://cybre.space/users/nightpool",
  "published": "2019-05-24T02:38:20Z",
  "to": ["https://www.w3.org/ns/activitystreams#Public"],
  "cc": ["https://cybre.space/users/nightpool/followers"],
  "object": {
    ...
  }
}

OrderedCollection and Collection are both part of the ActivityStreams spec, and they’re defined there. I’ll quote the spec:

The items within a Collection can be ordered or unordered. The OrderedCollection type MAY be used to identify a Collection whose items are always ordered. In the JSON serialization, the unordered items of a Collection are represented using the items property while ordered items are represented using the orderedItems property.

If your Collection is ordered in any way, you should use OrderedCollection. The difference is that the items property on Collection gets parsed as an unordered set, while the orderedItems property gets parsed as an array, so unless you use orderedItems, other implementations might not ever retain any order information (for example, in a relational database i would represent a Collection using a simple join table, and for OrderedCollection I would add an order property to the table rows)

ActivityPub defines several properties on actors that should be OrderedCollections, and several properties where it allows implementations to decide whether the items should be ordered or not. For example, an actor’s inbox collection should always be an OrderedCollection, presented in reverse chronological order: https://www.w3.org/TR/activitypub/#collections, while the likes collection may be ordered or unordered, as the implementation prefers: https://www.w3.org/TR/activitypub/#liked.

1 Like

Thanks! This is really helpful.

Does masto have a users table that stores federated users (from other instances). If so, how does it populate these? When it receives, lets say a Create Note from an actor, does it then check its local DB to see if that actor exists in its own user table, and then populate it?

I’m having trouble with this part. Don’t I need to bring this data into my local data model, so that it can be queried and presented to my users?

Otherwise, I’d have to fetch the data on demand. This could mean my server having to do dozens of on-demand http fetches to get content.

And lets say its not even content that’s been posted to my users inboxes, its just their public feed. Does this mean it has to be fetched on demand?

I wish this documented a little better. Seems like every single follow activity would have to check for other users on that same inbox… and I have no idea where the sharedInbox fits in with followers collection.

Outside of the sharedInbox though, is the flow:

  1. A user on instance 1 creates a note, it gets posted to the inbox of all their followers.
  2. A following user on instance 2 receives an inbox post, and brings that data into their local data model for querying.

You only pull the users as needed. For example, when you receive an activity made by an unknown user, you fetch that user and store it in your table.

Side note about my implementation: I decided to separate users from accounts. Every local user has an account, so there are rows in both tables for those. Every remote user only has a row in the users table because they aren’t supposed to log in to this instance.

I re-fetch the remote users whenever I receive an activity made by a user that was last fetched more than 24 hours ago.

No, they’re posted according to addressing (to/cc/bto/bcc fields). If there’s a followers collection, you would get the list of inboxes of the followers from your database (sharedInbox if available, otherwise simply inbox) and deduplicate it. The way I do it that seems to be working so far. For individual actors, you’d simply post the activity to that actor’s inbox or to their instance’s sharedInbox.

Yes. That @id is supposed to be unique internet-wide. Also, for the very reason to avoid posting several copies of the same activity to different inboxes on the same instance, there exists sharedInbox. If you use that, you’ll only have to post one copy of your activity per instance.

Depends on the semantics of your collection. Where the order is important (for example, outboxes where activities are sorted reverse chronologically) you’d use OrderedCollection. Otherwise, I think it’s fine to use any.

1 Like

Thanks!

Things like this are extremely helpful.

One question, what about the data that is only stored remotely, such as a remote user’s public feed. Do you fetch that data on demand, and, since its not in your local data model, do different transformations to show that data on your front end? Or do you simply not show it, and force your users to view that instance directly?

In my app currently, I’m only showing data on the front end that comes from my local data model, but this seems to imply that I need to be able to have a completely different path, that fetches activitypub user feeds on demand, and transform them into objects my front end can read, completely bypassing my local data model / stored data.

Also for both @nightpool and @grishka, is there any more documentation for how sharedInboxes work? This doc isn’t really showing how it works in practice.

Thanks!

This is still an open question for me. Jumping back and forth between instances isn’t a great UX — imagine opening someone’s profile on their instance and wanting to like their post that your instance doesn’t have. Fetching all of the user’s posts in advance isn’t good because, well, you’ll potentially download and store a lot of data.

Someone somewhere suggested fetching remote collections “as needed”, almost proxying. That was for followers/following which won’t work for me, but it has a good chance of working for posts. Like when someone requests the next page of someone’s post feed and you don’t have it, you’d fetch it from the remote server, store it locally and then render the page.

Mastodon doesn’t seem to be fetching any collections for remote users at all. Not even followers/following.


This says “40 following”, but only shows 4 users that are from mastodon.social, the instance I took this screenshot from.

I don’t think there’s any. On my server I don’t even make a distinction between the two, they’re both handled by the same code even though they’re different URLs. That said, I think this post series might answer some of the questions you currently have and might have in the future.

1 Like

About sharedInbox, maybe this thread has some clarification.
If not please add it to the guide :
Introduction to ActivityPub

About

I’m having trouble with this part. Don’t I need to bring this data into my local data model, so that it can be queried and presented to my users?

Yes but this is only specific to your app. We talked about it in the journalism session at 36c3 and everybody handles this different apart from the “id”. For example some only store hashtags, others store a special search index and a link index plus from/to relations.

1 Like

is there any more documentation for how sharedInboxes work?

See the section “What’s special about the sharedInbox?” in Draft: Guide for new ActivityPub implementers. Does that help?

1 Like