FEP-5feb: Search indexing consent for actors

Hello!

This is a discussion thread for the proposed FEP-5feb: Search indexing consent for actors.
Please use this thread to discuss the proposed FEP and any potential problems
or improvements that can be addressed.

Summary

This FEP introduces an actor-level attribute that can be used to explicitly express an actor’s consent (or lack thereof) to their public objects being indexed for search purposes.

Akin to robots.txt and noindex meta tags, this attribute is advisory and relies on the indexers respecting the directive, as public objects can not technically be prevented from being indexed.

2 Likes

What is the scope of the directive? Does it include the actor’s local instance in addition to remote instances?

The FEP seems to conflate “indexing” (an object retrieval optimization technique) with “availability for search” (search authorization). The text implies that objects may be available for search by the author but “MUST NOT be made available for search to other users”. To be available for search, even only for the author, the objects would typically have been indexed. I’m wondering if indexing and search authorization should be considered separately. Of course, the name of the property implies the object will not be indexed, which would usually mean it’s not available for search by anyone, including the author.

Does preventing indexing also mean the actor’s profile cannot be searched? If so, does that mean the Mastodon account search will not resolve that actor’s account identifier and that any potential followers will need to know the address already (and that similar behavior is expected from other servers)?

There are references to robots.txt and noindex, but this is a simpler mechanism. Was there any discussion about a rule-based approach (like search opt-in/opt-out for specific instances, for example)?

2 Likes

A missing indexable attribute SHOULD be handled as indexable: false.

Whats the reason for this? In case of Lemmy the assumption is definitely that all content is public and searchable. In fact there is already a search engine for Lemmy, and people are working on better SEO so Lemmy shows up on Google etc.

2 Likes

I would say yes, this should include the local instance as well, although to be honest this FEP is written with Mastodon in mind, where only S2S is provided, so the actor is only represented for other servers.

I agree the name of the attribute is not ideal. The intent is really to provide consent for the purpose of generalized search.

There was not. There was a discussion of a per-object flag, but that was ruled out as unpractical if a user were to change their mind.

The reason, is, again, rooted in Mastodon’s history: from the moment search features were introduced back in 2017, they were purposefully limited to not have generalized search. This means that (many) Mastodon users have that expectation, and treating it otherwise would suddenly make their existing posts searchable by default without their explicit consent. We understand that different projects have different expectations, and that is why this is a SHOULD and not a MUST.

2 Likes

There are numerous social norms, social contracts, legislation and regulation that reflect a growing desire for social media users to be able to expect a certain level of privacy, even for content posted in “public”. In ActivityPub settings, there are various schools of policy and thought that show end users wish to describe the intent for their content, with a reasonable expectation that they are sharing content with defined groups or servers or followers etc., without additionally allowing third parties or unintended databases or footprints of their content to be created and used for purposes outside their stated intent.

Service platforms are adopting strategies to allow individual users to signal their intent, and good faith services can choose to respect these signals. Just because something /can/ be scraped and searched, doesn’t mean it /should/ be scraped and searched. Some problems are technical, some problems are social. This one is a heady mix of both.

Some reading:

4 Likes

Handling updates to the indexable attribute
Whenever an actor is updated and its attribute is set to indexable: true, its objects SHOULD be made available for search as described in the previous section.
Whenever an actor is updated and its attribute is set to indexable: false, its objects MUST be removed from search as described in the previous section.

This section should contain some disclaimers that indexing is expensive, so one needs to take a lot of care when implementing to not offer a DOS vector.

Personally, I also feel that “indexing is expensive” will lead to people use this property as “searchable”. Indexing everything, and then not showing results, is probably cheaper as far as computational costs go.

2 Likes

IMO there needs to be a distinction between “user and their content can be found using search on other servers, as long as relevant objects are already cached locally” vs “profile and content can be indexed by search engines”. “Their public objects being indexed for search purposes” is too vague and can mean both.

2 Likes

I really don’t like that this is a property on one object that’s supposed to influence the handling of other objects. It would make sense on the content objects themselves, or even on activities that convey the object. But having to look up a property on the actor that an object is attributed to in order to know how to process the object adds a lot of complexity and extra round trips to either the origin server, the local cache, or both.

3 Likes

:100:

I would also include that I consider “searchable” the natural step above Public in Mastodon’s visibility settings, i.e.

Searchable > Public > Unlisted > Followers only > Mentioned people only

We discussed doing it per-object, but this makes changing your mind impractical: you’d need to update every single object to make them searchable or non-searchable, while with the current proposal you need to only transmit one message.

Additionally, it is likely that you already have actor information at hand when processing a new object from that actor.

Changing your mind is impractical no matter what. You can’t put the genie back in the bottle.

We think it’s important for someone to be able to change their mind on this, for the same reasons it’s important to provide the feature in the first place, even though the posts are public and can technically be indexed without your consent.

And this is not just about changing your mind! But also about making your choice explicit for your past posts when there was previously no provision to do so.

I’m not sure that’s viable in practice. It’s hard to imagine re/de-indexing historical posts from actors on other servers every time they change their mind. That seems like it’s just inventing new ways to be vulnerable to DOS attacks.

And then, here’s another point to consider: the AP spec allows for activities and objects to be attributed to multiple actors. Which could represent co-authorship, as one example. How should services that receive an activity attributed to multiple actors with different indexing consent flags proceed? And what if only one of those multiple actors changes their mind later?

2 Likes

I think what is important for FEP-5feb is that it is viable for Mastodon. I doubt that many other implementations will be willing to implement the reindexing requirement on Actor updates.

Fortunately, a virtue of FEP is that a FEP is a proposal. So if one thinks one can do better, one can submit a competing proposal.

It has been viable for good-faith uses on mastodon.social so far, but as @helge pointed out earlier, it makes sense to outline the possible costs, and I will amend it.

Maybe something like:

Whenever an actor is updated and its attribute is set to indexable: true, its objects SHOULD be made available for search as described in the previous section. This MAY be delayed, rate-limited, batched, or throttled to limit costly indexing operations.

Whenever an actor is updated and its attribute is set to indexable: false, its objects MUST be removed from search as described in the previous section. This SHOULD happen in a timely manner, but it MAY be delayed or batched to limit costly indexing operations.

1 Like

@Claire it is a minor change to the FEP, but I feel it’d be bette if the example JSON-LD shown in the text shows indexable to be FALSE, as best-practice should see it as an opt-in. And maybe even explicitly mentioning this ‘humane technology’ best-practice of “implement as opt-in” is a further improvement.

I think indexable should be nullable and should be treated as third state, not indexable: false. Like the way most browsers implement the DNT header.

I think the interpretation of default indexable value should be considered as service user’s intention to be indexed or not, not the Mastodon’s philosophy.

3 Likes

There is a prior art of an extension for search consents. Fedibird, a fork of Mastodon, has searchableBy property (JSON-LD term definition: {"@id": "http://fedibird.com/ns#searchableBy", "@type": "@id"}) for both per-actor and per-object search consent. The property follows the model of Activity Streams’ audience targeting and takes a set of @ids to specify who should be able to search for the actor’s post. The implementation currently supports three types of search consent:

  • “Public”: "https://www.w3.org/ns/activitystreams#Public"
  • “Followers-only”: (the followers collection of the author)
  • “Reacted-users-only” (searchable by mentioned users only): []

The granularity allows users to make “unlisted” posts discoverable by their followers (whom the user might trust to some extent at least if they manually approves follow requests) while being undiscoverable by random strangers. In addition, bot actors may want to make their posts searchable-by-followers-only so that users can opt into being able to search their posts by following them while ensuring that the automated posts won’t overwhelm search timeline of other users.

(The ideal approach regarding the matter of bot actors in search timelines might be something like a client configuration to exclude bots in search timelines, but such a solution would require explicit support by consuming implementations, and making automated posts “unlisted” is a common practice today. I think the searchable-by-followers-only discoverability would be a good-enough compromise between compatibility with existing implementations and utility for new implementations that supports the feature.)

Also, having both the per-actor (default) consent and the per-object consents may serve as a mitigation of the problem of users changing their minds: users would usually use the default consent level (objects with no search consent property) and use the per-object consent only if necessary. That way, users would be able to change the search consent level of normal posts at once. Users might still want to change per-object consent levels too, but with this approach, the per-object search consent is only set by users’ explicit will and thus expected to be less likely to need to change. And even if users really want to change per-object consent levels, the number of affected objects would be far fewer and likely to be manageable by manual Update activities.

See also the comments by the fork’s author on Mastodon’s GitHub repository for details: https://github.com/mastodon/mastodon/pull/23808#issuecomment-1543273137.

1 Like

There were some recent discussions in the Fediverse Developers Network / General room about this topic. Mastodon has a “noindex” setting distinct from the account/actor’s indexable property and they aren’t necessary consistent with each other. My actor shows "indexable": false, but the Mastodon 4.2.9 profile and post pages still allow search engine indexing (and Google does actually index them). I can control that with the UI-based preferences setting, but it seems like indexable should be consistent with that setting (and vice versa).

Also, it’s not clear what indexing is included in the scope of indexable. Some devs assume it’s only full-text search of post content, but the FEP doesn’t state that explicitly. It seems strange that a non-indexable account could be found using Mastodon search, for example, even in remote instances where the account information is cached. That’s just another form of indexing content for search purposes.

1 Like

To enable this, couldn’t we extend the indexable property to accept an audience-targeting-like notation like as:Public and https://example.com/users/1/followers (i.e. extend its range to xsd:boolean | as:Actor | as:Collection)?

Yes, Mastodon might only want to support the “public” and “private” consent levels, but I think not every implementation wants to stick to that coarse granularity, so I believe the protocol should be flexible enough to support both the needs. If you don’t want to support the intermediate consent levels, you can just fall back to “private” consent, like so:

diff --git a/app/services/activitypub/process_account_service.rb b/app/services/activitypub/process_account_service.rb
index 1e2d614d7..6c5e89067 100644
--- a/app/services/activitypub/process_account_service.rb
+++ b/app/services/activitypub/process_account_service.rb
@@ -115,7 +115,7 @@ class ActivityPub::ProcessAccountService < BaseService
     @account.fields                  = property_values || {}
     @account.also_known_as           = as_array(@json['alsoKnownAs'] || []).map { |item| value_or_id(item) }
     @account.discoverable            = @json['discoverable'] || false
-    @account.indexable               = @json['indexable'] || false
+    @account.indexable               = @json['indexable'] == true || as_array(@json['indexable'] || []).include?(ActivityPub::TagManager::COLLECTIONS[:public])
     @account.memorial                = @json['memorial'] || false
     @account.attribution_domains     = as_array(@json['attributionDomains'] || []).map { |item| value_or_id(item) }
   end
2 Likes