Optimize BinaryWriter with a growable buffer by jlucaso1 · Pull Request #1108 · bufbuild/protobuf-es

jlucaso1 · 2025-04-17T23:36:12Z

This refactor moves away from “chunks + push‐to‐array + concat at the end” toward a single, growable Uint8Array buffer with explicit capacity management and in‐place writes. The main benefits are:

• Amortized O(1) writes → by doubling the buffer when it’s full (ensureCapacity), you avoid
frequent small allocations or large concat operations.
• Lower GC pressure → you no longer build many tiny Uint8Array slices or intermediate JS arrays.
• Faster varint encoding → the hot path for single‐byte values now early‐returns, and the multi‑byte loop
writes directly into the buffer instead of an intermediate array.
• Simpler fork/join → length‑delimited framing is done by shifting bytes in place rather than flushing/collecting chunks.
• More predictable memory layout → everything lives contiguously in one buffer, so slice/subarray calls are just views.

Together these yield better throughput, reduced pauses for garbage collection, and (often) smaller peak working sets at runtime.

its like #964 but less api changes

CLAassistant · 2025-04-17T23:36:18Z

All committers have signed the CLA.

timostamm · 2025-04-18T08:43:07Z

Thanks for the PR! We'll allocate time to give this a closer look.

timostamm

Left a couple of comments below.

I like this change - it's a bit cleaner, and should also make it easier to move to resizable array buffers in the future.

Looking at perf:

# before
toBinary   perf-payload.bin x 5,680 ops/sec ±0.33% (96 runs sampled)
toBinary   tiny example.User x 1,176,788 ops/sec ±0.19% (100 runs sampled)
toBinary   normal example.User x 203,325 ops/sec ±0.54% (94 runs sampled)
toBinary   scalar values x 292,358 ops/sec ±0.65% (98 runs sampled)
toBinary   repeated scalar values x 101,041 ops/sec ±0.57% (96 runs sampled)
toBinary   map with scalar keys and values x 69,991 ops/sec ±1.12% (99 runs sampled)
toBinary   repeated field with 1000 messages x 3,812 ops/sec ±2.65% (96 runs sampled)
toBinary   map field with 1000 messages x 771 ops/sec ±2.20% (94 runs sampled)

# after
toBinary   perf-payload.bin x 5,162 ops/sec ±0.33% (99 runs sampled)
toBinary   tiny example.User x 1,252,113 ops/sec ±0.50% (94 runs sampled)
toBinary   normal example.User x 244,426 ops/sec ±1.18% (92 runs sampled)
toBinary   scalar values x 353,611 ops/sec ±0.45% (99 runs sampled)
toBinary   repeated scalar values x 129,307 ops/sec ±0.43% (99 runs sampled)
toBinary   map with scalar keys and values x 89,141 ops/sec ±0.46% (96 runs sampled)
toBinary   repeated field with 1000 messages x 7,059 ops/sec ±0.29% (100 runs sampled)
toBinary   map field with 1000 messages x 1,126 ops/sec ±0.22% (98 runs sampled)

# ran with
cd packages/protobuf-test
npx turbo run build
npx tsx src/perf.ts benchmark 'toBinary'

Nice improvement overall, with a 10% regression on perf-payload.bin. We've used this case for performance optimization in the past (for example #836), so it's unfortunate that they are getting slower with this change.

I think the payload fields repeated_long_string_field and repeated_long_bytes_field (see perf-payload.txt) are responsible. Would be great to understand why, and whether it can be improved.

timostamm · 2025-04-25T13:22:09Z

-  /**
-   * Writes a tag (field number and wire type).
-   *
-   * Equivalent to `uint32( (fieldNo << 3 | type) >>> 0 )`.
-   *
-   * Generated code should compute the tag ahead of time and call `uint32()`.
-   */


Please restore the doc comment.

timostamm · 2025-04-25T13:22:39Z

-  /**
-   * Write a `int32` value, a signed 32 bit varint.
-   */


Please restore the doc comment.

timostamm · 2025-04-25T13:22:51Z

-  /**
-   * Write a `float` value, 32-bit floating point number.
-   */


Please restore the doc comment.

timostamm · 2025-04-25T13:43:28Z

+    const tmp: number[] = [];
+    varint32write(value, tmp);
+    this.raw(Uint8Array.from(tmp));


Not now - we have enough moving parts - but this is worth a closer look later:

Instead of creating an Array and a Uint8Array, we can allocate the max varint size (5 bytes for uint, 10 bytes for int), and encode directly into the buffer. varint32write is not exported from the package and we are free to change the signature.

Took the suggestion. int32, sint32, int64, sint64, uint64, and join() now encode varints straight into the buffer (no number[] + Uint8Array.from). Added a small varint32Size helper for join() so the length prefix can be spliced via copyWithin.

timostamm · 2025-04-25T14:03:16Z

+    const out = this.buffer.subarray(0, this.pos);
+    // Return a copy to avoid mutation if writer is reused
+    const result = new Uint8Array(out);


Suggested change

const out = this.buffer.subarray(0, this.pos);

// Return a copy to avoid mutation if writer is reused

const result = new Uint8Array(out);

const result = this.buffer.slice(0, this.pos);

See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/TypedArray/slice

timostamm · 2025-04-25T14:19:10Z

+    this.ensureCapacity(4);
+    new DataView(
+      this.buffer.buffer,
+      this.buffer.byteOffset,
+      this.buffer.byteLength,
+    ).setInt32(this.pos, value, true);
+    this.pos += 4;


Nice. Can you apply the same to sfixed64 and fixed64?

Done. sfixed64 and fixed64 now write directly through DataView on this.buffer, no intermediate 8-byte Uint8Array.

jlucaso1 · 2026-04-23T16:45:39Z

Hi @timostamm, thanks for the review. Rebased on main and addressed all comments. Details inline on each thread.

Rebenched on main (3 runs, best):

benchmark	`main`	this PR	Δ
`perf-payload.bin`	3,200	5,000	+56%
tiny `example.User`	930,000	530,000	−43%
normal `example.User`	50,000	49,000	≈0%
scalar values	75,000	181,000	+141%
repeated scalar values	35,000	90,000	+157%
map scalar	30,000	34,000	+13%
repeated 1000 messages	2,400	6,000	+150%
map 1000 messages	465	570	+22%

perf-payload.bin flipped from −10% to +56% (doubling + lazy initial buffer).

tiny User is the one regression. I tried a hybrid (scratch number[] + flush to Uint8Array) to recover it. Passes all 2843 tests, but fork() has to flush for correctness, and that kills the copyWithin-based join() fast path: repeated 1000 messages drops 45% vs this PR. V8 specializes SMI array pushes much better than typed-array stores for 6-byte payloads, so the two optimization axes (SMI array vs contiguous buffer) can't be had together without a two-pass size-then-write refactor.

Happy to go that route in a follow-up if the tiny regression is a blocker. Otherwise this PR should be ready.

intech · 2026-04-23T21:16:59Z

@jlucaso1 @timostamm — following up on the earlier #333 thread, I tried three small writer-only tweaks on top of this PR's current head (530ff1f7) and the numbers look quite a bit better. Opened them as jlucaso1#1 against refactor/binary-writer-buffer so the commits can be reviewed individually.

Overlay branch (3 commits, 1 file): https://github.com/Connectum-Framework/protobuf-es/tree/overlay/pr-1108-p0-only
Compare view: jlucaso1/protobuf-es@refactor/binary-writer-buffer...Connectum-Framework:protobuf-es:overlay/pr-1108-p0-only

Update: P0-b has been revised to construct the DataView lazily (on the first fixed/float/double write) rather than eagerly. This recovers the earlier regression on bool/varint/string-only fixtures while keeping the gain on scalar-heavy ones. Tables below reflect the updated version.

What each change does and why it's better

P0-a — `finish()` returns a subarray view instead of a `slice()` copy

ea0c3604

Current finish() does this.buffer.slice(0, this.pos) — allocates a fresh ArrayBuffer and memcpys the written bytes into it, every encode. Switching to this.buffer.subarray(0, this.pos) makes it O(1), just a view over the existing buffer. A dirtyAfterFinish flag swaps in a fresh backing buffer on the next write so a caller that reuses the writer after finish() doesn't see the returned view clobbered.

Why it's better: the slice cost scales with the current capacity, not with pos. A 19-byte message on the default 128-byte initial buffer pays a 128-byte copy (over-copy factor 6.7×). A 21 KB message that grew the buffer to 32 KB pays a 32 KB copy. Both cases disappear.

P0-b — lazily construct `DataView` instead of allocating per-call

cb03a152

float / double / fixed32 / sfixed32 / fixed64 / sfixed64 currently do new DataView(this.buffer.buffer, this.buffer.byteOffset, this.buffer.byteLength) before calling setFloat32 / setInt32 / etc. The patch caches one DataView, constructed lazily on first use and invalidated in ensureCapacity when the backing buffer is swapped.

Why it's better: the per-call DataView construction is the hot path on scalar-heavy payloads (fixed64 and double fields); allocating once per grown buffer instead of once per scalar write removes that cost. Lazy construction also means writers that never touch a DataView-backed field (bool/varint/string-only shapes like tiny example.User) pay no DataView allocation at all — an earlier eager-init version produced a V8 inline-cache regression on those shapes that the lazy variant avoids.

P0-c — ASCII fast-path in `string()`

a9fa1672

Current string() always calls this.encodeUtf8(value) (TextEncoder). The patch does a single-pass probe of the code units: if every charCodeAt(i) <= 0x7F, write the bytes inline without touching TextEncoder; otherwise fall back to the injected encoder. Non-ASCII behaviour is identical to before.

Why it's better: TextEncoder.encode() has a meaningful per-call cost (function-pointer indirection into a C++ binding, result allocation, byte copy) that dominates short-string encoding. An inline ASCII loop skips all of it for the common case. This is the patch that directly addresses the "tiny User −43%" you flagged — a 5-character firstName string goes through 5 inline writes instead of a full TextEncoder round-trip.

Results

Cumulative toBinary deltas vs this PR's current head, measured on our fork's bench-matrix (median of 5 runs, taskset -c 0, tinybench; fixture mix skewed toward realistic OTLP / K8s / GraphQL / RPC payloads):

fixture	PR baseline	+P0-a	+P0-a+b(lazy)+c
SimpleMessage (19 B)	795k	+17.9%	+79.0%
OTLP ExportTrace 100 spans (32 KB)	506	+40.4%	+156.4%
ExportMetrics 50 series (17 KB)	967	+27.2%	+153.1%
ExportLogs 100 records (21 KB)	978	+24.2%	+150.2%
K8sPodList 20 pods (29 KB)	840	+27.4%	+217.6%
GraphQLRequest (624 B)	133k	+36.7%	+50.8%
GraphQLResponse (1.4 KB)	149k	+43.6%	+120.0%
RpcRequest (501 B)	99k	+41.2%	+172.3%
RpcResponse (602 B)	178k	+46.0%	+175.8%
StressMessage (depth=8, width=200, 13 KB)	2.6k	+35.0%	+258.0%

toBinary on packages/protobuf-test/src/perf.ts (best-of-3, taskset -c 0, benchmark harness — same methodology you used in the PR description):

fixture	upstream	+#1108 (Δ)	+#1108+P0 (Δ vs #1108 / Δ vs upstream)
`perf-payload.bin`	3.9k	6.2k (+59.7%)	8.5k (+37.2% / +119.1%)
`tiny example.User`	993.4k	887.7k (-10.6%)	890.8k (+0.4% / -10.3%)
`normal example.User`	102.8k	129k (+25.5%)	384.8k (+198.3% / +274.3%)
`scalar values`	147.4k	286.3k (+94.2%)	494.3k (+72.7% / +235.3%)
`repeated scalar values`	54.5k	98.5k (+80.8%)	141.5k (+43.7% / +159.9%)
`map with scalar keys and values`	39.4k	52.8k (+34.1%)	106.7k (+101.9% / +170.8%)
`repeated field with 1000 messages`	2.8k	5.6k (+104.3%)	5.7k (+0.9% / +106.2%)
`map field with 1000 messages`	554	888 (+60.3%)	1.6k (+80.6% / +189.5%)

@bufbuild/protobuf-test passes on every state.

emcfarlane

@jlucaso1 thanks for these changes! Excited to see them land. Just a small comment to help reduce the diff and keep formatting consistent. Otherwise looks great!

@intech thanks for the detailed investigation. Would be great to get follow up PRs for the three performance points after this work has landed.

intech · 2026-04-24T19:09:15Z

@emcfarlane Should I create a separate pull request after this merge? It would be cleaner to combine them into jlucaso1#1 and merge them in this PR. What do you think?

emcfarlane · 2026-04-24T19:23:50Z

@intech your changes look good and make sense. I still think breaking them into separate PRs would be the cleanest approach though. This will help us as reviewers and give us a nice commit by commit breakdown of these performance changes when merged to main.

timostamm reviewed Apr 25, 2025

View reviewed changes

intech mentioned this pull request Apr 21, 2026

Performance is not ideal #333

Open

jlucaso1 force-pushed the refactor/binary-writer-buffer branch from 0211c7d to 530ff1f Compare April 23, 2026 16:40

jlucaso1 requested a review from timostamm April 23, 2026 16:45

intech mentioned this pull request Apr 23, 2026

Three writer-only optimisations on top of the contiguous buffer refactor jlucaso1/protobuf-es#1

Closed

emcfarlane reviewed Apr 24, 2026

View reviewed changes

Comment thread packages/protobuf/src/wire/binary-encoding.ts

Refactor BinaryWriter to use a growable Uint8Array buffer

6a1f209

jlucaso1 force-pushed the refactor/binary-writer-buffer branch from 530ff1f to 6a1f209 Compare April 24, 2026 18:32

emcfarlane approved these changes Apr 24, 2026

View reviewed changes

emcfarlane changed the title ~~refactor(protobuf): replace chunk-based BinaryWriter with growable Uint8Array buffer and in-place varint writes~~ Optimize BinaryWriter with a growable buffer Apr 24, 2026

Update bundle-size

afdb827

emcfarlane force-pushed the refactor/binary-writer-buffer branch from cbe965c to afdb827 Compare April 24, 2026 19:44

Conversation

jlucaso1 commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timostamm commented Apr 18, 2025

Uh oh!

timostamm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlucaso1 commented Apr 23, 2026

Uh oh!

intech commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What each change does and why it's better

P0-a — finish() returns a subarray view instead of a slice() copy

P0-b — lazily construct DataView instead of allocating per-call

P0-c — ASCII fast-path in string()

Results

Uh oh!

emcfarlane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

intech commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emcfarlane commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jlucaso1 commented Apr 17, 2025 •

edited

Loading

CLAassistant commented Apr 17, 2025 •

edited

Loading

intech commented Apr 23, 2026 •

edited

Loading

P0-a — `finish()` returns a subarray view instead of a `slice()` copy

P0-b — lazily construct `DataView` instead of allocating per-call

P0-c — ASCII fast-path in `string()`

intech commented Apr 24, 2026 •

edited

Loading