Improve pure ruby IDNA implementation to match browsers behavior (IDNA2008 and UTS#46)

Forked from #408, this is the separate ticket to deal with the improvement of the default (pure ruby) IDNA implementation.

- **IDNA2003** is the older standard (which was design for unicode 3.2), it's pretty permissive.
- **IDNA2008** is the _newer_ one, which supports all unicode versions up to now (more characters supported, but also makes a lot of string invalid while IDNA2003 was pretty permissive, to reduce confusion/phishing risk and stuff like that).
- **UTS#46** is some kind of standard character mapping used in conjuction with IDNA2008, in order to avoid breaking compatibility with older domains to make the transision smoother.
- There are unfortunately [4 "Deviations" chars](https://unicode.org/reports/tr46/#Table_Deviation_Characters) between IDNA2003 and IDNA2008 which can't be smoothly transitionned, they are simply converted differently in both version. So clients using IDNA2008+UTS#46 have to choose between the old (called "Transitional") and new (called "Non-Transitional") behavior.

AFAICS the expectation is that while all registrar are upgrading to IDNA2008 and only allow valid hostnames (=approximately forever), browsers and web clients in general are encouraged to use IDNA2008+UTS#46 to widen support. So basically unless you're running a registrar, IDNA2008+UTS#46 is the target.

That's why `libidn2` is implementing [IDNA2008 + UTS#46](https://gitlab.com/libidn/libidn2) and default to the **Non-Transitional** (=new) mode. Which is also used by curl for example and probably many other web clients. Firefox and Safari also seems to do IDNA2008+UTS#46 Non-Transitional. Chrome was lagging a bit as it was still using Transitional mode up until very recently, apparently [they juuust changed this to Non-Transitional in Chome 110](https://chromestatus.com/feature/5105856067141632). I can't verify this yet as I only have Chome 109 on Linux ^^

Edit (February 13th 2023): I just received Chrome 110 and confirmed the new behavior, http://faß.de now resolves to `http://xn--fa-hia.de` (and stays displayed as http://faß.de). Whereas in Chrome 109 it was transformed into http://fass.de (IDNA2003).

`libidn` (the current "native" option) implements **IDNA2003** standard (the "older" one). IMO we should upgrade to `libidn2`, this will be discussed in  #247.

The "pure" implementation is **IDNA2008iiiisssshhhhh**, but not compliant. As we can see in this example with an emoji modifier:
```ruby
irb(main):004:0> s1 = "https://l♥️h.ws"
=> "https://l♥️h.ws"
irb(main):006:0> Addressable::URI.parse(s1).normalize
=> #<Addressable::URI:0x243d8 URI:https://xn--lh-t0xz926h.ws/>
irb(main):008:0> s1.codepoints
=> [104, 116, 116, 112, 115, 58, 47, 47, 108, 9829, 65039, 104, 46, 119, 115]
```
If we compare that to the official [Unicode test website](https://util.unicode.org/UnicodeJsps/idna.jsp?a=l%E2%99%A5%EF%B8%8Fh.ws%0D%0A)):
![image](https://user-images.githubusercontent.com/201687/217281398-aea01ef7-1991-485f-b0c3-2e023a358841.png)
`https://xn--lh-t0xz926h.ws` (returned by current "pure" implementation) is not even an option, no matter what standard we use, it's either `xn--lh-t0x.ws` or invalid (IDNA2008)

In order to make the pure implementation up to the state of art, we'll have to rewrite some of it (or bring in a dependency).
As I was looking at options for dependencies, I found:
- https://github.com/mmriis/simpleidn which has a much simpler and much more readable Ruby implementation, it also happen to use the [official UTS#46 mapping table](https://www.unicode.org/Public/idna/15.0.0/IdnaMappingTable.txt), It's using a compiled depdency on `unf` for unicode normalization though :neutral_face: which is not great, especially as ruby does this. It could be a good help for a rewrite though.
- https://github.com/HoneyryderChuck/idnx which also has a pure ruby implementation but very like the current one, officially supporting IDNA2003 only, so not helping much.

Good news: the Unicode team provide some awesome comformance testing file with thousands of input string and the desired output for IDNA2008+UTS#46, for every version of Unicode, example: https://www.unicode.org/Public/idna/15.0.0/IdnaTestV2.txt

My suggestion here would be to go with an incremental rewrite in order to:
- Remove all the custom unicode normalization functions [by relying on ruby's instead](https://idiosyncratic-ruby.com/73-unicode-version-mapping.html).
- Simplify and improve performance by rubyfiyng the punnycode function which is still in C (similar to `simpleidn` implementation)
- Slightly update the code to the stricter IDNA2008 rule (rejecting invalid chars, etc..)
- Use the [official UTS#46 mapping tables](https://www.unicode.org/Public/idna/15.0.0/IdnaMappingTable.txt) to implement UTS#46 compatibility layer.
- Use the extensive comformance testing file provided by the unicode team to robustly test this implementation

@sporkmonger @dentarg what do you think?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve pure ruby IDNA implementation to match browsers behavior (IDNA2008 and UTS#46) #491

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Improve pure ruby IDNA implementation to match browsers behavior (IDNA2008 and UTS#46) #491

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions