Skip to content

Optimize BinaryWriter with a growable buffer#1108

Open
jlucaso1 wants to merge 2 commits intobufbuild:mainfrom
jlucaso1:refactor/binary-writer-buffer
Open

Optimize BinaryWriter with a growable buffer#1108
jlucaso1 wants to merge 2 commits intobufbuild:mainfrom
jlucaso1:refactor/binary-writer-buffer

Conversation

@jlucaso1
Copy link
Copy Markdown

@jlucaso1 jlucaso1 commented Apr 17, 2025

This refactor moves away from “chunks + push‐to‐array + concat at the end” toward a single, growable Uint8Array buffer with explicit capacity management and in‐place writes. The main benefits are:

• Amortized O(1) writes → by doubling the buffer when it’s full (ensureCapacity), you avoid
frequent small allocations or large concat operations.
• Lower GC pressure → you no longer build many tiny Uint8Array slices or intermediate JS arrays.
• Faster varint encoding → the hot path for single‐byte values now early‐returns, and the multi‑byte loop
writes directly into the buffer instead of an intermediate array.
• Simpler fork/join → length‑delimited framing is done by shifting bytes in place rather than flushing/collecting chunks.
• More predictable memory layout → everything lives contiguously in one buffer, so slice/subarray calls are just views.

Together these yield better throughput, reduced pauses for garbage collection, and (often) smaller peak working sets at runtime.

its like #964 but less api changes

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 17, 2025

CLA assistant check
All committers have signed the CLA.

@timostamm
Copy link
Copy Markdown
Member

Thanks for the PR! We'll allocate time to give this a closer look.

Copy link
Copy Markdown
Member

@timostamm timostamm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments below.

I like this change - it's a bit cleaner, and should also make it easier to move to resizable array buffers in the future.

Looking at perf:

# before
toBinary   perf-payload.bin x 5,680 ops/sec ±0.33% (96 runs sampled)
toBinary   tiny example.User x 1,176,788 ops/sec ±0.19% (100 runs sampled)
toBinary   normal example.User x 203,325 ops/sec ±0.54% (94 runs sampled)
toBinary   scalar values x 292,358 ops/sec ±0.65% (98 runs sampled)
toBinary   repeated scalar values x 101,041 ops/sec ±0.57% (96 runs sampled)
toBinary   map with scalar keys and values x 69,991 ops/sec ±1.12% (99 runs sampled)
toBinary   repeated field with 1000 messages x 3,812 ops/sec ±2.65% (96 runs sampled)
toBinary   map field with 1000 messages x 771 ops/sec ±2.20% (94 runs sampled)

# after
toBinary   perf-payload.bin x 5,162 ops/sec ±0.33% (99 runs sampled)
toBinary   tiny example.User x 1,252,113 ops/sec ±0.50% (94 runs sampled)
toBinary   normal example.User x 244,426 ops/sec ±1.18% (92 runs sampled)
toBinary   scalar values x 353,611 ops/sec ±0.45% (99 runs sampled)
toBinary   repeated scalar values x 129,307 ops/sec ±0.43% (99 runs sampled)
toBinary   map with scalar keys and values x 89,141 ops/sec ±0.46% (96 runs sampled)
toBinary   repeated field with 1000 messages x 7,059 ops/sec ±0.29% (100 runs sampled)
toBinary   map field with 1000 messages x 1,126 ops/sec ±0.22% (98 runs sampled)

# ran with
cd packages/protobuf-test
npx turbo run build
npx tsx src/perf.ts benchmark 'toBinary'

Nice improvement overall, with a 10% regression on perf-payload.bin. We've used this case for performance optimization in the past (for example #836), so it's unfortunate that they are getting slower with this change.

I think the payload fields repeated_long_string_field and repeated_long_bytes_field (see perf-payload.txt) are responsible. Would be great to understand why, and whether it can be improved.

Comment on lines -184 to -190
/**
* Writes a tag (field number and wire type).
*
* Equivalent to `uint32( (fieldNo << 3 | type) >>> 0 )`.
*
* Generated code should compute the tag ahead of time and call `uint32()`.
*/
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please restore the doc comment.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines -223 to -225
/**
* Write a `int32` value, a signed 32 bit varint.
*/
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please restore the doc comment.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines -257 to -259
/**
* Write a `float` value, 32-bit floating point number.
*/
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please restore the doc comment.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +287 to +289
const tmp: number[] = [];
varint32write(value, tmp);
this.raw(Uint8Array.from(tmp));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not now - we have enough moving parts - but this is worth a closer look later:

Instead of creating an Array and a Uint8Array, we can allocate the max varint size (5 bytes for uint, 10 bytes for int), and encode directly into the buffer. varint32write is not exported from the package and we are free to change the signature.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took the suggestion. int32, sint32, int64, sint64, uint64, and join() now encode varints straight into the buffer (no number[] + Uint8Array.from). Added a small varint32Size helper for join() so the length prefix can be spliced via copyWithin.

Comment on lines +125 to +127
const out = this.buffer.subarray(0, this.pos);
// Return a copy to avoid mutation if writer is reused
const result = new Uint8Array(out);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const out = this.buffer.subarray(0, this.pos);
// Return a copy to avoid mutation if writer is reused
const result = new Uint8Array(out);
const result = this.buffer.slice(0, this.pos);

See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/TypedArray/slice

Comment on lines +271 to +277
this.ensureCapacity(4);
new DataView(
this.buffer.buffer,
this.buffer.byteOffset,
this.buffer.byteLength,
).setInt32(this.pos, value, true);
this.pos += 4;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Can you apply the same to sfixed64 and fixed64?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. sfixed64 and fixed64 now write directly through DataView on this.buffer, no intermediate 8-byte Uint8Array.

@jlucaso1 jlucaso1 force-pushed the refactor/binary-writer-buffer branch from 0211c7d to 530ff1f Compare April 23, 2026 16:40
@jlucaso1
Copy link
Copy Markdown
Author

Hi @timostamm, thanks for the review. Rebased on main and addressed all comments. Details inline on each thread.

Rebenched on main (3 runs, best):

benchmark main this PR Δ
perf-payload.bin 3,200 5,000 +56%
tiny example.User 930,000 530,000 −43%
normal example.User 50,000 49,000 ≈0%
scalar values 75,000 181,000 +141%
repeated scalar values 35,000 90,000 +157%
map scalar 30,000 34,000 +13%
repeated 1000 messages 2,400 6,000 +150%
map 1000 messages 465 570 +22%

perf-payload.bin flipped from −10% to +56% (doubling + lazy initial buffer).

tiny User is the one regression. I tried a hybrid (scratch number[] + flush to Uint8Array) to recover it. Passes all 2843 tests, but fork() has to flush for correctness, and that kills the copyWithin-based join() fast path: repeated 1000 messages drops 45% vs this PR. V8 specializes SMI array pushes much better than typed-array stores for 6-byte payloads, so the two optimization axes (SMI array vs contiguous buffer) can't be had together without a two-pass size-then-write refactor.

Happy to go that route in a follow-up if the tiny regression is a blocker. Otherwise this PR should be ready.

@intech
Copy link
Copy Markdown

intech commented Apr 23, 2026

@jlucaso1 @timostamm — following up on the earlier #333 thread, I tried three small writer-only tweaks on top of this PR's current head (530ff1f7) and the numbers look quite a bit better. Opened them as jlucaso1#1 against refactor/binary-writer-buffer so the commits can be reviewed individually.

Update: P0-b has been revised to construct the DataView lazily (on the first fixed/float/double write) rather than eagerly. This recovers the earlier regression on bool/varint/string-only fixtures while keeping the gain on scalar-heavy ones. Tables below reflect the updated version.

What each change does and why it's better

P0-a — finish() returns a subarray view instead of a slice() copy

ea0c3604

Current finish() does this.buffer.slice(0, this.pos) — allocates a fresh ArrayBuffer and memcpys the written bytes into it, every encode. Switching to this.buffer.subarray(0, this.pos) makes it O(1), just a view over the existing buffer. A dirtyAfterFinish flag swaps in a fresh backing buffer on the next write so a caller that reuses the writer after finish() doesn't see the returned view clobbered.

Why it's better: the slice cost scales with the current capacity, not with pos. A 19-byte message on the default 128-byte initial buffer pays a 128-byte copy (over-copy factor 6.7×). A 21 KB message that grew the buffer to 32 KB pays a 32 KB copy. Both cases disappear.

P0-b — lazily construct DataView instead of allocating per-call

cb03a152

float / double / fixed32 / sfixed32 / fixed64 / sfixed64 currently do new DataView(this.buffer.buffer, this.buffer.byteOffset, this.buffer.byteLength) before calling setFloat32 / setInt32 / etc. The patch caches one DataView, constructed lazily on first use and invalidated in ensureCapacity when the backing buffer is swapped.

Why it's better: the per-call DataView construction is the hot path on scalar-heavy payloads (fixed64 and double fields); allocating once per grown buffer instead of once per scalar write removes that cost. Lazy construction also means writers that never touch a DataView-backed field (bool/varint/string-only shapes like tiny example.User) pay no DataView allocation at all — an earlier eager-init version produced a V8 inline-cache regression on those shapes that the lazy variant avoids.

P0-c — ASCII fast-path in string()

a9fa1672

Current string() always calls this.encodeUtf8(value) (TextEncoder). The patch does a single-pass probe of the code units: if every charCodeAt(i) <= 0x7F, write the bytes inline without touching TextEncoder; otherwise fall back to the injected encoder. Non-ASCII behaviour is identical to before.

Why it's better: TextEncoder.encode() has a meaningful per-call cost (function-pointer indirection into a C++ binding, result allocation, byte copy) that dominates short-string encoding. An inline ASCII loop skips all of it for the common case. This is the patch that directly addresses the "tiny User −43%" you flagged — a 5-character firstName string goes through 5 inline writes instead of a full TextEncoder round-trip.

Results

Cumulative toBinary deltas vs this PR's current head, measured on our fork's bench-matrix (median of 5 runs, taskset -c 0, tinybench; fixture mix skewed toward realistic OTLP / K8s / GraphQL / RPC payloads):

fixture PR baseline +P0-a +P0-a+b(lazy)+c
SimpleMessage (19 B) 795k +17.9% +79.0%
OTLP ExportTrace 100 spans (32 KB) 506 +40.4% +156.4%
ExportMetrics 50 series (17 KB) 967 +27.2% +153.1%
ExportLogs 100 records (21 KB) 978 +24.2% +150.2%
K8sPodList 20 pods (29 KB) 840 +27.4% +217.6%
GraphQLRequest (624 B) 133k +36.7% +50.8%
GraphQLResponse (1.4 KB) 149k +43.6% +120.0%
RpcRequest (501 B) 99k +41.2% +172.3%
RpcResponse (602 B) 178k +46.0% +175.8%
StressMessage (depth=8, width=200, 13 KB) 2.6k +35.0% +258.0%

toBinary on packages/protobuf-test/src/perf.ts (best-of-3, taskset -c 0, benchmark harness — same methodology you used in the PR description):

fixture upstream +#1108 (Δ) +#1108+P0 (Δ vs #1108 / Δ vs upstream)
perf-payload.bin 3.9k 6.2k (+59.7%) 8.5k (+37.2% / +119.1%)
tiny example.User 993.4k 887.7k (-10.6%) 890.8k (+0.4% / -10.3%)
normal example.User 102.8k 129k (+25.5%) 384.8k (+198.3% / +274.3%)
scalar values 147.4k 286.3k (+94.2%) 494.3k (+72.7% / +235.3%)
repeated scalar values 54.5k 98.5k (+80.8%) 141.5k (+43.7% / +159.9%)
map with scalar keys and values 39.4k 52.8k (+34.1%) 106.7k (+101.9% / +170.8%)
repeated field with 1000 messages 2.8k 5.6k (+104.3%) 5.7k (+0.9% / +106.2%)
map field with 1000 messages 554 888 (+60.3%) 1.6k (+80.6% / +189.5%)

@bufbuild/protobuf-test passes on every state.

Copy link
Copy Markdown
Contributor

@emcfarlane emcfarlane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlucaso1 thanks for these changes! Excited to see them land. Just a small comment to help reduce the diff and keep formatting consistent. Otherwise looks great!

@intech thanks for the detailed investigation. Would be great to get follow up PRs for the three performance points after this work has landed.

Comment thread packages/protobuf/src/wire/binary-encoding.ts
@jlucaso1 jlucaso1 force-pushed the refactor/binary-writer-buffer branch from 530ff1f to 6a1f209 Compare April 24, 2026 18:32
@intech
Copy link
Copy Markdown

intech commented Apr 24, 2026

@emcfarlane Should I create a separate pull request after this merge? It would be cleaner to combine them into jlucaso1#1 and merge them in this PR. What do you think?

@emcfarlane
Copy link
Copy Markdown
Contributor

@intech your changes look good and make sense. I still think breaking them into separate PRs would be the cleanest approach though. This will help us as reviewers and give us a nice commit by commit breakdown of these performance changes when merged to main.

@emcfarlane emcfarlane changed the title refactor(protobuf): replace chunk-based BinaryWriter with growable Uint8Array buffer and in-place varint writes Optimize BinaryWriter with a growable buffer Apr 24, 2026
@emcfarlane emcfarlane force-pushed the refactor/binary-writer-buffer branch from cbe965c to afdb827 Compare April 24, 2026 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants