IDN/punycode/non-Latin domain names

What is the proper way to handle these? For example, I own гришка.рф, for which the canonical representation is xn--80afpi2a3c.xn--p1ai. How do I match actors with such domains (i.e. I have the punycode version in my database, but I receive the Cyrillic one from somewhere)? How do I represent these internally? How do I represent these in ActivityPub objects? How do I represent these in my web interface and client API? The spec doesn’t say anything on this subject.

1 Like

I would normalize all DB records and lookups into ascii, but leave JSON payloads and display alone. In Tavern, I’ll probably use golang.org/x/net/idna on hostnames when storing actors for the first time. When doing an ActorID to ActorRowID lookup, the domain part of the URL should also be normalized.

Actually, as far as I’m concerned about Smithereen, TIL that Java has had a built-in punycode converter since about forever.

Looks like PHP does too: https://www.php.net/manual/en/ref.intl.idn.php

Is there a consensus about normalizing IDN to ASCII for storage, and using an UTF-8 representation for exchange? Would this be useful to have such consensus? How do various implementations treat this case so far?

During resolution, either ascii or UTF8 should result in the same endpoint being reached. I think it really only matters for internal dereferencing so you don’t end up with duplicate actor records.

1 Like

[2020-04-13 23:13:11+0000] Nick Gerakines via SocialHub:

During resolution, either ascii or UTF8 should result in the same endpoint being reached. I think it really only matters for internal dereferencing so you don’t end up with duplicate actor records.

Should, but for better compatibility (for example with systems not supporting IDN) the ASCII representation should be used during exchanges.

And for the internal representation I think it also makes sense to keep it in the canonical / ASCII representation, less shenanigans with database encoding and I think the UTF-8 representation should only be used when displaying it as you might want to be able to filter out homopgraphic glyphs for example. This is what WebKitGTK and probably what other browsers are doing btw.

Also if you want an example of an actor using an IDN domain name: https://pénibles.transposées.eu/users/estradiol

Actually, Mastodon already does this: all URLs in ActivityPub objects are ASCII, but the web interface shows the domain decoded, with accents.


(The REST client I’m using handles punycode itself but I assume it’s supposed to be in the encoded form in the Host header anyway)

1 Like