Optimize BinaryWriter with a growable buffer#1108
Optimize BinaryWriter with a growable buffer#1108jlucaso1 wants to merge 2 commits intobufbuild:mainfrom
Conversation
|
Thanks for the PR! We'll allocate time to give this a closer look. |
timostamm
left a comment
There was a problem hiding this comment.
Left a couple of comments below.
I like this change - it's a bit cleaner, and should also make it easier to move to resizable array buffers in the future.
Looking at perf:
# before
toBinary perf-payload.bin x 5,680 ops/sec ±0.33% (96 runs sampled)
toBinary tiny example.User x 1,176,788 ops/sec ±0.19% (100 runs sampled)
toBinary normal example.User x 203,325 ops/sec ±0.54% (94 runs sampled)
toBinary scalar values x 292,358 ops/sec ±0.65% (98 runs sampled)
toBinary repeated scalar values x 101,041 ops/sec ±0.57% (96 runs sampled)
toBinary map with scalar keys and values x 69,991 ops/sec ±1.12% (99 runs sampled)
toBinary repeated field with 1000 messages x 3,812 ops/sec ±2.65% (96 runs sampled)
toBinary map field with 1000 messages x 771 ops/sec ±2.20% (94 runs sampled)
# after
toBinary perf-payload.bin x 5,162 ops/sec ±0.33% (99 runs sampled)
toBinary tiny example.User x 1,252,113 ops/sec ±0.50% (94 runs sampled)
toBinary normal example.User x 244,426 ops/sec ±1.18% (92 runs sampled)
toBinary scalar values x 353,611 ops/sec ±0.45% (99 runs sampled)
toBinary repeated scalar values x 129,307 ops/sec ±0.43% (99 runs sampled)
toBinary map with scalar keys and values x 89,141 ops/sec ±0.46% (96 runs sampled)
toBinary repeated field with 1000 messages x 7,059 ops/sec ±0.29% (100 runs sampled)
toBinary map field with 1000 messages x 1,126 ops/sec ±0.22% (98 runs sampled)
# ran with
cd packages/protobuf-test
npx turbo run build
npx tsx src/perf.ts benchmark 'toBinary'
Nice improvement overall, with a 10% regression on perf-payload.bin. We've used this case for performance optimization in the past (for example #836), so it's unfortunate that they are getting slower with this change.
I think the payload fields repeated_long_string_field and repeated_long_bytes_field (see perf-payload.txt) are responsible. Would be great to understand why, and whether it can be improved.
| /** | ||
| * Writes a tag (field number and wire type). | ||
| * | ||
| * Equivalent to `uint32( (fieldNo << 3 | type) >>> 0 )`. | ||
| * | ||
| * Generated code should compute the tag ahead of time and call `uint32()`. | ||
| */ |
There was a problem hiding this comment.
Please restore the doc comment.
| /** | ||
| * Write a `int32` value, a signed 32 bit varint. | ||
| */ |
There was a problem hiding this comment.
Please restore the doc comment.
| /** | ||
| * Write a `float` value, 32-bit floating point number. | ||
| */ |
There was a problem hiding this comment.
Please restore the doc comment.
| const tmp: number[] = []; | ||
| varint32write(value, tmp); | ||
| this.raw(Uint8Array.from(tmp)); |
There was a problem hiding this comment.
Not now - we have enough moving parts - but this is worth a closer look later:
Instead of creating an Array and a Uint8Array, we can allocate the max varint size (5 bytes for uint, 10 bytes for int), and encode directly into the buffer. varint32write is not exported from the package and we are free to change the signature.
There was a problem hiding this comment.
Took the suggestion. int32, sint32, int64, sint64, uint64, and join() now encode varints straight into the buffer (no number[] + Uint8Array.from). Added a small varint32Size helper for join() so the length prefix can be spliced via copyWithin.
| const out = this.buffer.subarray(0, this.pos); | ||
| // Return a copy to avoid mutation if writer is reused | ||
| const result = new Uint8Array(out); |
There was a problem hiding this comment.
| const out = this.buffer.subarray(0, this.pos); | |
| // Return a copy to avoid mutation if writer is reused | |
| const result = new Uint8Array(out); | |
| const result = this.buffer.slice(0, this.pos); |
See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/TypedArray/slice
| this.ensureCapacity(4); | ||
| new DataView( | ||
| this.buffer.buffer, | ||
| this.buffer.byteOffset, | ||
| this.buffer.byteLength, | ||
| ).setInt32(this.pos, value, true); | ||
| this.pos += 4; |
There was a problem hiding this comment.
Nice. Can you apply the same to sfixed64 and fixed64?
There was a problem hiding this comment.
Done. sfixed64 and fixed64 now write directly through DataView on this.buffer, no intermediate 8-byte Uint8Array.
0211c7d to
530ff1f
Compare
|
Hi @timostamm, thanks for the review. Rebased on Rebenched on
Happy to go that route in a follow-up if the |
|
@jlucaso1 @timostamm — following up on the earlier #333 thread, I tried three small writer-only tweaks on top of this PR's current head (
Update: P0-b has been revised to construct the What each change does and why it's betterP0-a —
|
| fixture | PR baseline | +P0-a | +P0-a+b(lazy)+c |
|---|---|---|---|
| SimpleMessage (19 B) | 795k | +17.9% | +79.0% |
| OTLP ExportTrace 100 spans (32 KB) | 506 | +40.4% | +156.4% |
| ExportMetrics 50 series (17 KB) | 967 | +27.2% | +153.1% |
| ExportLogs 100 records (21 KB) | 978 | +24.2% | +150.2% |
| K8sPodList 20 pods (29 KB) | 840 | +27.4% | +217.6% |
| GraphQLRequest (624 B) | 133k | +36.7% | +50.8% |
| GraphQLResponse (1.4 KB) | 149k | +43.6% | +120.0% |
| RpcRequest (501 B) | 99k | +41.2% | +172.3% |
| RpcResponse (602 B) | 178k | +46.0% | +175.8% |
| StressMessage (depth=8, width=200, 13 KB) | 2.6k | +35.0% | +258.0% |
toBinary on packages/protobuf-test/src/perf.ts (best-of-3, taskset -c 0, benchmark harness — same methodology you used in the PR description):
| fixture | upstream | +#1108 (Δ) | +#1108+P0 (Δ vs #1108 / Δ vs upstream) |
|---|---|---|---|
perf-payload.bin |
3.9k | 6.2k (+59.7%) | 8.5k (+37.2% / +119.1%) |
tiny example.User |
993.4k | 887.7k (-10.6%) | 890.8k (+0.4% / -10.3%) |
normal example.User |
102.8k | 129k (+25.5%) | 384.8k (+198.3% / +274.3%) |
scalar values |
147.4k | 286.3k (+94.2%) | 494.3k (+72.7% / +235.3%) |
repeated scalar values |
54.5k | 98.5k (+80.8%) | 141.5k (+43.7% / +159.9%) |
map with scalar keys and values |
39.4k | 52.8k (+34.1%) | 106.7k (+101.9% / +170.8%) |
repeated field with 1000 messages |
2.8k | 5.6k (+104.3%) | 5.7k (+0.9% / +106.2%) |
map field with 1000 messages |
554 | 888 (+60.3%) | 1.6k (+80.6% / +189.5%) |
@bufbuild/protobuf-test passes on every state.
emcfarlane
left a comment
There was a problem hiding this comment.
@jlucaso1 thanks for these changes! Excited to see them land. Just a small comment to help reduce the diff and keep formatting consistent. Otherwise looks great!
@intech thanks for the detailed investigation. Would be great to get follow up PRs for the three performance points after this work has landed.
530ff1f to
6a1f209
Compare
|
@emcfarlane Should I create a separate pull request after this merge? It would be cleaner to combine them into jlucaso1#1 and merge them in this PR. What do you think? |
|
@intech your changes look good and make sense. I still think breaking them into separate PRs would be the cleanest approach though. This will help us as reviewers and give us a nice commit by commit breakdown of these performance changes when merged to main. |
cbe965c to
afdb827
Compare
This refactor moves away from “chunks + push‐to‐array + concat at the end” toward a single, growable
Uint8Arraybuffer with explicit capacity management and in‐place writes. The main benefits are:• Amortized O(1) writes → by doubling the buffer when it’s full (
ensureCapacity), you avoidfrequent small allocations or large concat operations.
• Lower GC pressure → you no longer build many tiny
Uint8Arrayslices or intermediate JS arrays.• Faster varint encoding → the hot path for single‐byte values now early‐returns, and the multi‑byte loop
writes directly into the buffer instead of an intermediate array.
• Simpler fork/join → length‑delimited framing is done by shifting bytes in place rather than flushing/collecting chunks.
• More predictable memory layout → everything lives contiguously in one buffer, so slice/subarray calls are just views.
Together these yield better throughput, reduced pauses for garbage collection, and (often) smaller peak working sets at runtime.
its like #964 but less api changes