Image fetching - is DOFV (Download on First View) AP-compliant?

strypey · September 28, 2024, 7:18am

I’m told that it’s inherent to ActivityPub that if an account someone using my server follows posts an image, my server goes and actively downloads it and stores it in its local storage. Is this correct? Because except in use cases where every person following an account wants to see every images posted on it, DOFV (Download on First View) seems like a wiser default. But would this be AP-compliant?

Edit: I changed “Use” to “View” in the ancroymn before I initially posted, but forgot to change “DOFU” TO “DOFV”. Fixed.

kopper · September 28, 2024, 10:41am

AFAICT, ActivityPub or it’s adjacent specs don’t specify how to handle media. Some software (like Mastodon) just download everything immediately, some proxy on request (with some caching in front), and some just pass the sent media URL as is directly for clients to show.

nightpool · September 28, 2024, 3:39pm

I’m told that it’s inherent to ActivityPub that if an account someone using my server follows posts an image, my server goes and actively downloads it and stores it in its local storage

It’s not part of the spec in any way shape or form. But many implementors have looked at the problem and concluded it was the most performant and privacy-preserving solution

Because except in use cases where every person following an account wants to see every images posted on it, DOFU (Download on First View) seems like a wiser default

All you’re doing is adding more latency to the first user to see the image, and you’re getting very little benefit by doing so. You’re basically optimizing for posts that nobody will ever look at—why? Do you not have active users on your social network? When I was running a Mastodon server, well over 75% of posts made got viewed by somebody within a few seconds of them being created. Our communities were very dense—everybody followed everybody else. Maybe if you’re running a very small or single-user instance you could get away with this (stop downloading images for the 8 hours you’re asleep?). But now you’re also leaking more privacy information—it’s easier for people to track when you’re at your computer and when you’re away. So I don’t think this tradeoff would make sense for most implementors.

Link previews on Mastodon used to be DOFU and in fact specifically moved from being DOFU to “download 30-90 seconds at random after receiving them” due to the latter strategy DECREASING the amount of bandwidth experienced by the remote server. So I think you can use that to argue that DOFU hasn’t really worked well to reduce costs in practice.

stevebate · September 29, 2024, 6:13am

I run a single user Mastodon instance with several relays and I’d be surprised if my view rates are anywhere near that high. How do I measure it? I don’t see a flag in the database Statuses table to indicate a status has been viewed.

I understand that Mastodon is optimized for larger instances, but it would be nice to have DOFV as an opt-in feature for smaller instances. Mastodon consumes a crazy amount of disk space (even with aggressive periodic media cache pruning) to store media that is never viewed (in my case).

strypey · October 2, 2024, 5:25am

Ok, another wall of text from me. Sorry! Strap in…

Mastodon’s defaults on fetching and caching seem whacky to me. The admin of the server I’m on says the default is to download all text/image posts, and then delete it all after a week.

Including posts that @mention me. Even ones that are part of the context for ongoing conversations. Even ones I haven’t even seen. Even the Direct posts that @mention me.

Why?

My dream defaults are quite different.

Yes, there will be a bit more latency for them. One obvious benefit is never downloading images for posts nobody ever sees. Reducing storage costs for the receiving server, and bandwidth use at both ends.

A less obvious benefit is that a server never ends up storing dodgy media (potentially illegal stuff like CSAM), without someone using their server seeing it. If it is viewed, and therefore downloaded, either the person who viewed it will report it for purging (and further action as appropriate), or mods can hold them responsible for not reporting it.

That’s exactly what Mastodon’s doing by downloading images before anyone wants to look at them. Potentially wasting piles of storage. Which is then addressed by aggressively pruning text data people on the server will want to see. Text that uses a fraction of the per-post data of space used by image data. To me, this seems whacky, ‘cheery Londoners dancing on roofs with brooms’ level whacky.

What you’re optimising for with DOFV is precisely the images people do want to look at. As evidenced by the fact that somone has. With a whole lot less media flying around the network unnecessarily, the overall latency of everything would likely improve.

This may surprise you, but not everyone looks at every post made by everyone their account follows. I follow anyone who posts stuff I like. I look at very little of it. But if I do a search in my app, I’m more likely to find good stuff (or for specific searches, any stuff at all). As is anyone else using the same server.

At least we were, before Mastodon started stuffing it all down the memory hole after a week. Our admin pushed out auto-pruning to a month after I queried what’s going on. But we’re still losing our conversation history on a daily basis.

In which case wouldn’t there be latency for the first person viewing them anyway?

How many were never viewed at all? Wasting significant resources in a totally avoidable way (see above). Now multiply that by every server in the network. Now imagine the network being 10 times bigger in 5 years, and 100 times bigger in 10 (just as a hypothetical).

The waste of storage and bandwidth scales up with growth of the network. Convince me this is good design.

Ok firstly, this seems like a purely theoretical risk. Most people who use social media walk around with a computer in our pockets. How sensitive is data about the times it’s in our hands instead? In XMPP networks, telegraphing presence is a feature, not a bug

Also, as you say, this is really only an issue for single-account servers. If this is part of their threat model, they can just change the default to suit their needs.

But in general, people using single-account servers are even more likely to follow more stuff than they’ll ever look at. To increase the scope of discovery (see the linked Fediverse Ideas issue about my dream defaults). So unless they have a generous budget for storage, or they don’t pay for it (eg server in the closet), a DOFV default for anything beyond post metadata and text still makes more sense to me.

Seoondly, I think we need to watch out for prioritising a theoretical risk over solving a clear and present danger; making running small-to-medium fediverse server so expensive, it becomes unaffordable for most communities and organisations to even consider hosting their own. That’s a big part of how the DataFarmers ate the web in the first place.

This is a totally different situation, involving interaction between the fediverse and the document web. Also AFAIK the Mastodon policy on link previews wasn’t DOFV, it was DOEV (Download On Every Viewing). Thus slashdotting any non-fediverse site linked in a post that went viral across the verse.

The solution they went with was, as you say, to change the style of DOEV…

So Mastodon traffic is still caning the webservers hosting people’s blogs etc, avoidably consuming their resources. Just slowly enough not to DDoS them off the web. Moving to DOFV would actually solve the probiem.

I’m not convinved that Mastodon is optimised at all. See;

bumblefudge · October 2, 2024, 8:42am

re: DOEV versus caching, deduplication seems like the elephant in the room, and there’s even an argument to be made for shared caching for at least moderation purposes (i.e. by IFTAS) if not also for economy-of-scale and discovery purposes (i.e. by Relays), particularly if the average server size stays below 100 users…

In any case, I wrote up some notes on this when I was researching how IPFS might help in AP and/or Mastodon-API implementations:

erincandescent · October 2, 2024, 10:59am

FWIW even in Mastodon you can set different pruning times for images & for posts themselves (it’ll try and refetch expired images on first view). In the 5 years I was operating a Mastodon server, we never pruned old posts; in fact it wasn’t a feature before 4.0.

erincandescent · October 2, 2024, 11:08am

I don’t think many implementors have looked at it and decided it was the most performant option. In fact I’m not sure anyone besides Mastodon does this.

Both Akkoma/Pleroma and Misskey offer the option of either hotlinking (which might be ideal for single user instances, where no real anonymisation is possible anyway) or thier media proxy feature. In the latter case, there may be caching (Akkoma instructions have information on how to configure most webservers to cache said proxied images), and there can also be cache pre-warming (download-on-post-reception).

But even if you’re using a caching media proxy with fetch-on-reception, it’s a much smarter setup than is used by by Mastodon - images are not kept for a fixed N days and generally there us a much smarter (e.g. LRU) eviction algorithm.

(And, an old pet peeve of mine: When media drops out of the cache, Mastodon is soo slow at refetching it - it seems like it fetches it in its entirety before forwarding any of it on to the viewer, rather than streaming it from origin?)

tesaguri · October 2, 2024, 12:44pm

Well, if tracking were a purely theoretical threat, you wouldn’t need to be afraid of big techs collecting your private data and browsers wouldn’t need to implement sophisticated tracking protection features. And browsers’ tracking protection aren’t sufficient by default either, because they still expose plenty of fingerprinting information by default, such as User-Agent and Accept-Language, since they are too useful for ensuring the minimum level of experience for average users. Those data, combined with your IP address, are usually sufficient for identifying you among hundreds of others in the access log.

Also, the fingerprinting information isn’t needed at all if a tracker DMs you an image URL with a tracking parameter, a common technique in the e-mail world. When your client downloads the image, the tracker not only tells that you viewed the post at that time, but they also gets your fingerprint to link with your past and future activities elsewhere.

“Presence status” is a feature only because you opt into making them public, and I don’t think that should be unknowingly exposed. I believe Mastodon’s behavior is a sensible default. if there is a problem, that is probably that they don’t offer a configuration to opt out of that behavior, not that it’s a bad default.

devnull · October 2, 2024, 2:50pm

I’ll add my two cents.

In my ten years developing NodeBB from the ground up, I have been approached exactly two times about images being hotlinked, and that causing an IP address leakage.

Both of those times were from a self-proclaimed “security researcher” spraying vulnerability reports to everyone who would listen, and praying that someone would pay them for it.

Nevertheless, NodeBB supports the use of a camo proxy through a plugin, and that’s that. Otherwise we hotlink directly. It’s how the internet was built, and it works absolutely fine in the vast majority of cases.

As for the “75% viewership” metric, I’ll counter with my own. My instance when I first added scheduled pruning was seeing view rates of around 0.01%. As in 99.9% of the content we received wasn’t interacted with in any way. Prematurely downloading and processing media for that would be an absolute waste of resources.

strypey · October 2, 2024, 10:02pm

Intriguing. Did you have a look at the dream defaults linked in my post? Is there a way to keep only the text posts that @mention accounts on the server, or have hashtag, while aggressively pruning everything else?

In it’s current form, it’s not a feature it’s a bug.

Well that strawman won’t be getting up again But since you seem to have missed all the points I actually made, as well as those made in the responses by @bumblefudge and @erincandescent, all I can suggest is have another read? The response following yours by @devnull is worth a read too.

BTW welcome to the party @devnull! (this may be belated, I’ve been away for a while).

nightpool · October 5, 2024, 6:09pm

I mean, the answer explains itself, right? Relays are antithetical to the design of AP and Mastodon. That were a bad idea that has refused to die. Obviously, if you use a relay you’re going to get thousands of useless posts that nobody cares about. That’s the whole point of how relays work.

Sure, that makes sense. NodeBB is a completely different UX than a microblogging platform—it makes sense that the view rates would be completely different. As you might be able to see from my views about relays, I personally think that a UX design that leads to downloading 100x more content then you need is a pretty inefficient one—have you considered moving to a client-based model where posts are fetched from the remote server instead of a push-based model? This is one of the reasons I don’t think AP is a good fit for the “threadiverse”—the interaction modalities are just too different

stevebate · October 5, 2024, 6:18pm

We’re talking about images, not posts. The images are not pushed to my server.

trwnh · October 5, 2024, 6:39pm

you mean “how relays currently work”? because sure, a service that exists only as a firehose is not very useful unless you intend to process that firehose. but there are myriad other ways that relays could be structured. it could be a small members-only group for example. it could share reports instead of posts. it could function like irc where the relay represents a room. there are so many uses for the general concept that are all far more interesting and useful than the way they ended up implemented 5 or 6 years ago (and then never iterated on, except now they’re sort of being reworked into Fediverse Auxiliary Service Providers which as far as i can tell don’t work over activitypub, they use their own bespoke apis).

not speaking for nodebb here but i see “interact over activitypub” in the same way that you can comment on github issues via email.

strypey · October 5, 2024, 7:10pm

Again, I ask you to look at this from the POV of a person trying to use the fediverse. Particularly a newcomer.

If Alice sets up a new server - especially a single-account server - it is a tabula rasa. Like a new email address. AFAIK there is no in-built discovery mechanism that will allow people using Alice’s server to find people to interact with. Because all the existing discovery tools rely on searching posts the server has already downloaded from followed accounts. Without anyone to interact with, there’s no easy way for anyone outside Alice’s server to discover and interact with people using it.

Connecting to a relay gives Alice and other people using her server access to thousands of posts that may or may not be useful in themselves. But it does allow them to find some people to follow, or reply to, or favourite/ boost. Which starts to populate Alice’s server with a database of posts that can be searched (either by hashtag or keyword), and helps people on other servers find accounts on her server, and follow or interact with them.

You seem enthusiastic about sneering at other people’s ideas and innovations. But I’m still waiting for a reply to the arguments I made for DOFV in response to your earlier comments, and my critique of Mastodon’s whacky data retention defaults.

devnull · October 7, 2024, 1:58pm

Wanted to quickly address this because there might be a misunderstanding. Technically you may be right, that a forum would not need the entire firehose of fedi data. When a forum only interacts with the fediverse by federating out and only accepting replies to topics it already knows about, then yes, the view rates would be wildly different as to be incomparable.

Discourse (right now, can’t speak for later) is one of those platforms. Only the federating categories send content outwards, and only replies to topics already known about are accepted — everything else is dropped.

NodeBB is not like that, there is a discovery mode called “world” which contains that firehose, scoped to users you follow (akin to the Mastodon home timeline), and selecting the view to show recent posts scoped to the “Uncategorized” category is akin to the Mastodon “Explore” tab. You can follow others from NodeBB and use it as a full ActivityPub client if you wish.

So while the quoted text above is not incorrect — view rates are different — it’s not completely different

devnull · October 7, 2024, 2:03pm

This is the main hurdle that relays attempt to overcome, and I’d say it does a decent job of it considering there is no centralized algorithm to artificially boost content.

Part of my integrating AP in NodeBB is because starting a forum sucks. It sucked before social media, but is worse now because the alternatives are orders of magnitude easier to start with, even though the UX and data privacy issues are a nightmare.

AP promised to solve this, but does not deliver on that promise until you put in a fair amount of manual effort. A great deal less effort than normal (i.e. normal being @aschrijver going to great lengths on fedi to get people to post here), but effort nonetheless. His workload is lessened somewhat now that some categories are federating, but would be lessened even more if relays were used and content posted here were automatically shared with a number of servers.

strypey · October 8, 2024, 10:43am

Thanks for this insight @devnull . You summed up in a couple of phrases what my paragraph-length use story was trying to explain;

For “forum”, substitute anything subject to network effects, and the point remains.

Bringing this back to the starting topic, if you did this, you could ingest and store a lot of text without any major hassles. But you wouldn’t want all of those servers downloading and storing all Hamish’s meme graphics. Unless and until someone using the server was actually reading the post their in. Ergo; DOFV.