ID normalization

ActivityStreams 2.0 spec says IDs are IRIs (RFC 3987): Activity Streams 2.0
ActivityPub spec is more strict and says that IDs are “Publicly dereferencable URIs” (RFC 3986): ActivityPub

But what does it mean in practice?

Example 1: https://嘟文.com/users/OldBig/statuses/111971511872396431

Should this ID be treated as equivalent to https://xn--j5r817a.com/users/OldBig/statuses/111971511872396431 ? If so, I guess implementations are expected to normalize IDs and always work with ASCII form (e.g. when searching through local cache), to avoid duplicates.

Example 2: https://chrichri.ween.de

Should this ID be treated as equivalent to https://chrichri.ween.de/? At least from the WHATWG URL spec perspective, these two forms seem to be equivalent.

1 Like

See my non-captured reply here.

2 Likes

Per example one, the URL should not be converted to punycode. It has no utility from a technical perspective and only serves to cause issues when compared against the unicode form.

The reason for this is spec compliance: ActivityPub requires IDs to be URIs, and URIs can only contain ASCII characters.

Non-ASCII characters need to be either percent-encoded (when occur in path), or punycoded (when occur in domain name).

1 Like

The good news is almost everyone already uses URIs. I found only one actor which has Unicode in its ID, it is a Wordpress actor (reported).

I also came to conclusion that consumers shouldn’t normalize IDs. There are many ways to do it, and nobody knows how to do it properly, so I’d store ID as is (as long as it is a valid URI).