Skip to content

TextDecoder is wrong and very slow #61041

@ChALkeR

Description

@ChALkeR

Correctness

Encodings that return invalid results:

  • Single-byte:
    • ibm866 (fails at even ascii input)
    • koi8-u
    • windows-874
    • windows-1252
    • windows-1253
    • windows-1255
  • Multi-byte (all except gb18030):
    • gbk (should be identical to gb18030 but it is instead broken)
    • big5
    • euc-jp
    • iso-2022-jp
    • shift_jis (fails at even ascii input)
    • euc-kr

Unimplemented encodings that throw:

  • iso-8859-16
  • x-user-defined

If built without icu, utf-16le encoding also returns invalid results:

> new TextDecoder('utf-16le').decode(Uint16Array.of(0xd800))
'�' // correct
'\ud800' // no ICU

Performance

  • utf-8 (aka default) TextDecoder is much slower on ascii input than it can and should be
    1.3x on 4096 bytes, ~3x on 1 MiB input
  • The above applies to buffer.toString() too
    It's much slower on ASCII input than a checked js impl (same 1.3x-3x)
  • windows-1252 aka new TextDecoder('ascii') aka new TextDecoder('latin1')
    is ~2x-4x slower than an optimized impl on ascii input
  • windows-1252 aka new TextDecoder('latin1')
    is ~6x-12x slower than an optimized impl on latin1 input
  • windows-1252 is ~7x-12x slower than an optimized js impl
  • Other single-byte encodings that are significantly slower than js impl even on non-ascii input:
    iso-8859-3, iso-8859-6, iso-8859-7, iso-8859-8, iso-8859-8-i, windows-1253, windows-1255, windows-1257
  • None of the single-byte encodings are faster than the js impl even on non-ascii input
  • All of the single-byte encodings except windows-1252 are >=10x slower than the js impl on ascii input
    (windows-1252 is only ~2-4x slower)

References

Nothing of the above requires any changes on the native side, I compared to a somewhat optimized JS implementation

See https://docs.google.com/spreadsheets/d/1pdEefRG6r9fZy61WHGz0TKSt8cO4ISWqlpBN5KntIvQ/edit

See tests in https://github.com/ExodusOSS/bytes/blob/master/tests/encoding/mistakes.test.js (comment out the import and it can be run on Node.js without deps with only that file)

Suggestions

  1. Add a proper ASCII fast path to buffer.toString()
    src: improve StringBytes::Encode perf on ASCII #61119
  2. Add a proper ASCII fast path to new TextDecoder().decode(arg)
    src: improve StringBytes::Encode perf on ASCII #61119
  3. Perhaps replace single-byte decoders with a js impl, remove native paths and lib usage. They all are just mappers, the implementation for all of them is identical
    Or at least replace the slow, unsupported, or invalid ones.
    lib: implement all 1-byte encodings in js #61093
  4. Remove gbk decoder path and make it do the same as gb18030 as the spec says
    lib: gbk decoder is gb18030 decoder per spec #61099
  5. For utf16 decode optimistically using existing fast apis, then check the string for validity
    lib: add utf16 fast path for TextDecoder #61559
  6. Fix bugs in the non-ICU codepath
    lib: unify ICU and no-ICU TextDecoder #61409
    lib: use utf8 fast path for streaming TextDecoder #61549
    lib: add utf16 fast path for TextDecoder #61559
  7. Fix or replace implementations for big5, euc-jp, iso-2022-jp, shift_jis, euc-kr
    To fix legacy multi-byte decoders, attempt to re-use what Chromium has or import js code from @exodus/bytes

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceIssues and PRs related to the performance of Node.js.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions