Type of Change
Summary
Add \uXXXX (4 hex digit unicode escape) as a 6th escape form in quoted strings and keys, enabling TOON to represent control characters that are currently unrepresentable.
Motivation
Problem
The TOON spec defines 5 escape sequences: \\, \", \n, \r, \t. Control characters in the range U+0000–U+001F (excluding U+000A, U+000D, U+0009) cannot appear literally in strings and have no escape syntax — they are unrepresentable in TOON.
This means any Go/TypeScript/Python value containing these characters (e.g. U+0004 EOT, U+0000 NUL, U+001B ESC) cannot be serialized to TOON at all. TOON encoders must reject the input entirely.
Use Case
Real-world strings from databases, search indices, APIs, and binary-adjacent text processing commonly contain control characters as sentinel values or delimiters. A format that claims lossless JSON data model serialization should be able to represent any valid Unicode string that JSON can.
Benefits
- Closes the representability gap between TOON and JSON
- Enables lossless round-tripping of arbitrary string data
- Follows an escape syntax already familiar to developers from JSON, JavaScript, Java, C#, Python, etc.
Detailed Design
Proposed Syntax
In quoted strings and quoted keys, add support for \uXXXX where XXXX is exactly 4 hexadecimal digits (0–9, a–f, A–F), representing a Unicode code point.
escaped-char = "\" ( "\" / DQUOTE / "n" / "r" / "t" / unicode-escape )
unicode-escape = "u" 4HEXDIG
Encoding Rules
- Encoders MUST use
\uXXXX for control characters < U+0020 that are not \n, \r, or \t (since these have no other representation in TOON).
- Encoders SHOULD prefer the shorthand escapes
\n, \r, \t for U+000A, U+000D, U+0009 respectively.
- Encoders MAY use
\uXXXX for any Unicode code point, but SHOULD prefer literal UTF-8 for printable characters to preserve human readability (a core TOON design goal).
Decoding Rules
- Decoders MUST parse
\uXXXX in quoted strings and quoted keys, converting the 4 hex digits to the corresponding Unicode code point.
- Decoders MUST reject
\u followed by fewer than 4 hex digits as an invalid escape sequence.
Grammar Changes (ABNF)
Current:
escaped-char = "\" ( "\" / DQUOTE / "n" / "r" / "t" )
Proposed:
escaped-char = "\" ( "\" / DQUOTE / "n" / "r" / "t" / unicode-escape )
unicode-escape = "u" 4HEXDIG
Examples
Encoding a string with U+0004 (EOT)
Input (JSON):
{"marker": "hello\u0004world"}
Output (TOON):
marker: "hello\u0004world"
Round-trip of various control characters
Input (JSON):
{"controls": "\u0000\u0001\u001f"}
Output (TOON):
controls: "\u0000\u0001\u001F"
Mixed with existing escapes
Input (JSON):
{"mixed": "line1\nline2\u0004end"}
Output (TOON):
mixed: "line1\nline2\u0004end"
Drawbacks
- Slightly increased parser complexity: Decoders must handle a new escape form, including reading exactly 4 hex digits and converting to a rune/code point. This is straightforward but non-trivial.
- Token cost:
\uXXXX is 6 characters to represent 1 character. However, this only applies to control characters that are already rare in typical data, so the impact on TOON's token efficiency goals is negligible.
- Not needed for most documents: The vast majority of TOON documents contain no control characters beyond
\n, \r, \t. This feature primarily unblocks edge cases rather than improving the common path.
Alternatives Considered
Alternative 1: Do nothing — leave control characters unrepresentable
Callers must sanitize strings before TOON encoding (strip or replace control chars). Rejected because this makes TOON lossy — decoded output won't match the original input, breaking the "lossless serialization of the JSON data model" guarantee.
Alternative 2: Allow raw control characters in quoted strings
Let encoders emit literal bytes for control characters inside quotes. Rejected because raw control characters cause problems with text editors, terminals, copy-paste, and other tooling. JSON explicitly forbids this for good reason.
Alternative 3: Use a different escape syntax (e.g. \xHH)
\xHH is 2 hex digits (byte-level). Rejected because it's ambiguous with multi-byte UTF-8 sequences and isn't widely standardized across languages. \uXXXX is the most universally recognized unicode escape syntax.
Impact on Implementations
- Reference implementation (TypeScript): Requires adding
\uXXXX emit in the encoder for control chars < 0x20, and \uXXXX parsing in the decoder's escape handling.
- Community implementations (Go, Python, Java, Swift, Julia, Ruby): Same scope — encoder and decoder string handling. Typically 10-30 lines of code per implementation.
- Backward compatibility: Existing valid TOON documents remain valid. Documents using
\uXXXX will only be parseable by updated decoders. Old decoders will correctly reject \u as an unknown escape (per current spec: "decoders MUST reject any other escape sequence").
- Versioning: This is a backward-compatible addition (new documents may use it; old documents are unaffected). Appropriate for a MINOR version bump per VERSIONING.md.
Migration Strategy
For Implementers
- Update decoder to handle
case 'u': in escape sequence parsing — read 4 hex digits, convert to code point
- Update encoder to emit
\uXXXX for runes < U+0020 that aren't \n/\r/\t, instead of rejecting them
- Add round-trip tests for control characters
For Users
No migration needed. Existing TOON documents are unaffected. New documents containing control characters will simply work instead of failing.
Test Cases
[
{
"name": "encode string with U+0004 EOT",
"input": {"val": "a\u0004b"},
"expected": "val: \"a\\u0004b\"",
"specSection": "7.1"
},
{
"name": "encode string with U+0000 NUL",
"input": {"val": "a\u0000b"},
"expected": "val: \"a\\u0000b\"",
"specSection": "7.1"
},
{
"name": "encode string with U+001F (last control char)",
"input": {"val": "a\u001fb"},
"expected": "val: \"a\\u001Fb\"",
"specSection": "7.1"
},
{
"name": "prefer \\n over \\u000A",
"input": {"val": "a\nb"},
"expected": "val: \"a\\nb\"",
"specSection": "7.1",
"note": "Encoders SHOULD prefer shorthand escapes"
},
{
"name": "decode \\u0004 in quoted string",
"category": "decode",
"input": "val: \"a\\u0004b\"",
"expected": {"val": "a\u0004b"},
"specSection": "7.1"
},
{
"name": "reject truncated unicode escape",
"category": "decode",
"input": "val: \"a\\u00b\"",
"shouldError": true,
"specSection": "7.1",
"note": "\\u must be followed by exactly 4 hex digits"
}
]
Affected Specification Sections
- Section 7.1 (Escape Sequences): Add
\uXXXX to the list of valid escapes, update ABNF grammar
- Section 1 (Conventions): No change needed, but the new escape uses the same RFC 2119 keywords
- Appendix / ABNF Grammar: Update
escaped-char production
Unresolved Questions
- Surrogate pairs: Should TOON support
\uD800–\uDFFF surrogate pair encoding for characters above U+FFFF (as JSON does via two consecutive \uXXXX escapes)? Or should TOON keep it simple and only support BMP code points, requiring literal UTF-8 for supplementary characters? The simpler option (no surrogate pairs) seems preferable since TOON is UTF-8 native and supplementary characters can appear literally.
- Case sensitivity: Should
\u001f and \u001F both be accepted? Recommendation: yes, decoders should accept both (case-insensitive hex digits), while encoders SHOULD emit uppercase for consistency.
Additional Context
- JSON (RFC 8259) defines
\uXXXX with the same semantics proposed here. This is the most widely understood unicode escape syntax across programming languages.
- The TOON CONTRIBUTING.md already uses "Adding
\u0000 escape sequences" as a canonical example of an RFC-worthy change, suggesting this gap is already on the maintainers' radar.
- This proposal intentionally keeps scope narrow: only
\uXXXX (4 hex digits, BMP). Extended forms like \U00XXXXXX (8 hex digits) or \u{XXXXX} (variable-length) are out of scope and can be considered separately if needed.
Checklist
Type of Change
Summary
Add
\uXXXX(4 hex digit unicode escape) as a 6th escape form in quoted strings and keys, enabling TOON to represent control characters that are currently unrepresentable.Motivation
Problem
The TOON spec defines 5 escape sequences:
\\,\",\n,\r,\t. Control characters in the range U+0000–U+001F (excluding U+000A, U+000D, U+0009) cannot appear literally in strings and have no escape syntax — they are unrepresentable in TOON.This means any Go/TypeScript/Python value containing these characters (e.g. U+0004 EOT, U+0000 NUL, U+001B ESC) cannot be serialized to TOON at all. TOON encoders must reject the input entirely.
Use Case
Real-world strings from databases, search indices, APIs, and binary-adjacent text processing commonly contain control characters as sentinel values or delimiters. A format that claims lossless JSON data model serialization should be able to represent any valid Unicode string that JSON can.
Benefits
Detailed Design
Proposed Syntax
In quoted strings and quoted keys, add support for
\uXXXXwhereXXXXis exactly 4 hexadecimal digits (0–9, a–f, A–F), representing a Unicode code point.Encoding Rules
\uXXXXfor control characters < U+0020 that are not\n,\r, or\t(since these have no other representation in TOON).\n,\r,\tfor U+000A, U+000D, U+0009 respectively.\uXXXXfor any Unicode code point, but SHOULD prefer literal UTF-8 for printable characters to preserve human readability (a core TOON design goal).Decoding Rules
\uXXXXin quoted strings and quoted keys, converting the 4 hex digits to the corresponding Unicode code point.\ufollowed by fewer than 4 hex digits as an invalid escape sequence.Grammar Changes (ABNF)
Current:
Proposed:
Examples
Encoding a string with U+0004 (EOT)
Input (JSON):
{"marker": "hello\u0004world"}Output (TOON):
Round-trip of various control characters
Input (JSON):
{"controls": "\u0000\u0001\u001f"}Output (TOON):
Mixed with existing escapes
Input (JSON):
{"mixed": "line1\nline2\u0004end"}Output (TOON):
Drawbacks
\uXXXXis 6 characters to represent 1 character. However, this only applies to control characters that are already rare in typical data, so the impact on TOON's token efficiency goals is negligible.\n,\r,\t. This feature primarily unblocks edge cases rather than improving the common path.Alternatives Considered
Alternative 1: Do nothing — leave control characters unrepresentable
Callers must sanitize strings before TOON encoding (strip or replace control chars). Rejected because this makes TOON lossy — decoded output won't match the original input, breaking the "lossless serialization of the JSON data model" guarantee.
Alternative 2: Allow raw control characters in quoted strings
Let encoders emit literal bytes for control characters inside quotes. Rejected because raw control characters cause problems with text editors, terminals, copy-paste, and other tooling. JSON explicitly forbids this for good reason.
Alternative 3: Use a different escape syntax (e.g.
\xHH)\xHHis 2 hex digits (byte-level). Rejected because it's ambiguous with multi-byte UTF-8 sequences and isn't widely standardized across languages.\uXXXXis the most universally recognized unicode escape syntax.Impact on Implementations
\uXXXXemit in the encoder for control chars < 0x20, and\uXXXXparsing in the decoder's escape handling.\uXXXXwill only be parseable by updated decoders. Old decoders will correctly reject\uas an unknown escape (per current spec: "decoders MUST reject any other escape sequence").Migration Strategy
For Implementers
case 'u':in escape sequence parsing — read 4 hex digits, convert to code point\uXXXXfor runes < U+0020 that aren't\n/\r/\t, instead of rejecting themFor Users
No migration needed. Existing TOON documents are unaffected. New documents containing control characters will simply work instead of failing.
Test Cases
[ { "name": "encode string with U+0004 EOT", "input": {"val": "a\u0004b"}, "expected": "val: \"a\\u0004b\"", "specSection": "7.1" }, { "name": "encode string with U+0000 NUL", "input": {"val": "a\u0000b"}, "expected": "val: \"a\\u0000b\"", "specSection": "7.1" }, { "name": "encode string with U+001F (last control char)", "input": {"val": "a\u001fb"}, "expected": "val: \"a\\u001Fb\"", "specSection": "7.1" }, { "name": "prefer \\n over \\u000A", "input": {"val": "a\nb"}, "expected": "val: \"a\\nb\"", "specSection": "7.1", "note": "Encoders SHOULD prefer shorthand escapes" }, { "name": "decode \\u0004 in quoted string", "category": "decode", "input": "val: \"a\\u0004b\"", "expected": {"val": "a\u0004b"}, "specSection": "7.1" }, { "name": "reject truncated unicode escape", "category": "decode", "input": "val: \"a\\u00b\"", "shouldError": true, "specSection": "7.1", "note": "\\u must be followed by exactly 4 hex digits" } ]Affected Specification Sections
\uXXXXto the list of valid escapes, update ABNF grammarescaped-charproductionUnresolved Questions
\uD800–\uDFFFsurrogate pair encoding for characters above U+FFFF (as JSON does via two consecutive\uXXXXescapes)? Or should TOON keep it simple and only support BMP code points, requiring literal UTF-8 for supplementary characters? The simpler option (no surrogate pairs) seems preferable since TOON is UTF-8 native and supplementary characters can appear literally.\u001fand\u001Fboth be accepted? Recommendation: yes, decoders should accept both (case-insensitive hex digits), while encoders SHOULD emit uppercase for consistency.Additional Context
\uXXXXwith the same semantics proposed here. This is the most widely understood unicode escape syntax across programming languages.\u0000escape sequences" as a canonical example of an RFC-worthy change, suggesting this gap is already on the maintainers' radar.\uXXXX(4 hex digits, BMP). Extended forms like\U00XXXXXX(8 hex digits) or\u{XXXXX}(variable-length) are out of scope and can be considered separately if needed.Checklist