Skip to content

Improve pure ruby IDNA implementation to match browsers behavior (IDNA2008 and UTS#46) #491

@jarthod

Description

@jarthod

Forked from #408, this is the separate ticket to deal with the improvement of the default (pure ruby) IDNA implementation.

  • IDNA2003 is the older standard (which was design for unicode 3.2), it's pretty permissive.
  • IDNA2008 is the newer one, which supports all unicode versions up to now (more characters supported, but also makes a lot of string invalid while IDNA2003 was pretty permissive, to reduce confusion/phishing risk and stuff like that).
  • UTS#46 is some kind of standard character mapping used in conjuction with IDNA2008, in order to avoid breaking compatibility with older domains to make the transision smoother.
  • There are unfortunately 4 "Deviations" chars between IDNA2003 and IDNA2008 which can't be smoothly transitionned, they are simply converted differently in both version. So clients using IDNA2008+UTS#46 have to choose between the old (called "Transitional") and new (called "Non-Transitional") behavior.

AFAICS the expectation is that while all registrar are upgrading to IDNA2008 and only allow valid hostnames (=approximately forever), browsers and web clients in general are encouraged to use IDNA2008+UTS#46 to widen support. So basically unless you're running a registrar, IDNA2008+UTS#46 is the target.

That's why libidn2 is implementing IDNA2008 + UTS#46 and default to the Non-Transitional (=new) mode. Which is also used by curl for example and probably many other web clients. Firefox and Safari also seems to do IDNA2008+UTS#46 Non-Transitional. Chrome was lagging a bit as it was still using Transitional mode up until very recently, apparently they juuust changed this to Non-Transitional in Chome 110. I can't verify this yet as I only have Chome 109 on Linux ^^

Edit (February 13th 2023): I just received Chrome 110 and confirmed the new behavior, http://faß.de now resolves to http://xn--fa-hia.de (and stays displayed as http://faß.de). Whereas in Chrome 109 it was transformed into http://fass.de (IDNA2003).

libidn (the current "native" option) implements IDNA2003 standard (the "older" one). IMO we should upgrade to libidn2, this will be discussed in #247.

The "pure" implementation is IDNA2008iiiisssshhhhh, but not compliant. As we can see in this example with an emoji modifier:

irb(main):004:0> s1 = "https://l♥️h.ws"
=> "https://l♥️h.ws"
irb(main):006:0> Addressable::URI.parse(s1).normalize
=> #<Addressable::URI:0x243d8 URI:https://xn--lh-t0xz926h.ws/>
irb(main):008:0> s1.codepoints
=> [104, 116, 116, 112, 115, 58, 47, 47, 108, 9829, 65039, 104, 46, 119, 115]

If we compare that to the official Unicode test website):
image
https://xn--lh-t0xz926h.ws (returned by current "pure" implementation) is not even an option, no matter what standard we use, it's either xn--lh-t0x.ws or invalid (IDNA2008)

In order to make the pure implementation up to the state of art, we'll have to rewrite some of it (or bring in a dependency).
As I was looking at options for dependencies, I found:

Good news: the Unicode team provide some awesome comformance testing file with thousands of input string and the desired output for IDNA2008+UTS#46, for every version of Unicode, example: https://www.unicode.org/Public/idna/15.0.0/IdnaTestV2.txt

My suggestion here would be to go with an incremental rewrite in order to:

  • Remove all the custom unicode normalization functions by relying on ruby's instead.
  • Simplify and improve performance by rubyfiyng the punnycode function which is still in C (similar to simpleidn implementation)
  • Slightly update the code to the stricter IDNA2008 rule (rejecting invalid chars, etc..)
  • Use the official UTS#46 mapping tables to implement UTS#46 compatibility layer.
  • Use the extensive comformance testing file provided by the unicode team to robustly test this implementation

@sporkmonger @dentarg what do you think?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions