Pruning of remote content

This week I sorted out a feature that I've been putting off for far too long... content pruning!

ActivityPub had been enabled since March of this year, and in that time, the database had accumulated tens of thousands¹ of pieces of content from the fediverse. This was causing our database size to grow, but it was nothing entirely insurmountable, since it's mostly text.

However, the principle of the matter was that we were being pushed a mountain of content, most of which wasn't even consumed. For reference, when I ran the script against the data on this site:

2024-06-10T18:04:52.445Z [4567,4568/3631171] - info: [notes/prune] Found 32531 topics older than 30 days (since last activity).

We then take those topics and determine whether it received "engagement". Essentially whether a user reply to or liked a post within a topic.

Filtering for those, we get:

2024-06-10T18:07:31.860Z [4567,4568/3631171] - info: [notes/prune] 32252 topics eligible for pruning

... which essentially means ~99.14% of incoming content received zero engagement. That's not entirely surprising given that content streams in at all hours of the day, and I only reply or like a fraction of them.

In the backend, I'm allowing this value to be configurable, with the default being 30 days.

Next up, remote user pruning!


¹ Actually, the number ended up being closer to 116k pieces of content. In contrast, this forum has been running for about a decade and only recorded just under 100k posts!