v4.2.3

Latest

Latest

github-actions released this 09 Jun 21:49

· 1 commit to main since this release

e6072c3

Changed

README "Local native HTTP/3 vs Rust H3 clients" section and the benchmark README now publish the canonical GET-only ledger repeat gate result: two consecutive fail-closed gates at ba356d7 in which Specter's worst rep beats every comparator's per-metric best rep on p50/p95 TTFB, ledger-paced throughput, and p50/p95 ledger-paced tail (gate 1 worst rep 37.6 us p50 TTFB with 1.0 us p50 / 7.2 us p95 ledger tail; gate 2 confirms at 32.5 us and 2.9 / 7.4 us). This retires the 2026-06-05 same-process matrix that had tokio-quiche leading loopback GET TTFB. Causes: the deferred boundary-ACK send, the GET-burst drain ordering, and the single-copy 1-RTT datagram decode (entries below), plus bench-harness client-process pinning to cores 4-11 for all five clients, which removed scheduler placement luck from both sides of the worst-vs-best comparison. Evidence: docs/benchmarks/native-h3-vs-rust-clients/2026-06-09-direct-get-clientpin-clean-gate/ and -r2/.
Native HTTP/3 1-RTT receive path now caches one AES-128-GCM AEAD context per key epoch (EVP_AEAD_CTX) and reuses it for every datagram open, instead of constructing a fresh boring::symm::Crypter per packet. The per-packet path redid the AES key schedule and PMULL GHASH H-table precompute for a key constant across the epoch; a microbench put that setup at ~half of the ~0.83us/datagram open on Graviton4. A same-session A/B at n=100 (verified by EVP_AEAD_CTX symbol count 0->11) cut the GET p95 ledger-paced tail from ~16.5us to ~13.7us median across every quantile, moving Specter from losing to beating tokio_quiche (median 13.6us vs 16.2us) on that workload at the gate-relevant n. Receive-side decrypt only: no wire byte, frame cadence, or fingerprint change, and the seal->open round-trip stays byte-identical (full suite 999 passed). Evidence: docs/benchmarks/native-h3-vs-rust-clients/2026-06-09-aead-context-cache-n100/.
Native HTTP/3 1-RTT send path now reuses the same per-epoch AEAD context for sealing that the receive path uses for opening, instead of constructing a fresh boring::symm::Crypter per sent packet. The shared EVP_AEAD_CTX (built once per write-key epoch, seal_packet_payload_into writes ciphertext||tag in one pass into a split_at_mut output slice) drops the per-packet AES key schedule + GHASH H-table rebuild from every sealed packet, including the ACKs a client seals during a GET body. A same-session A/B at n=100 cut the GET p95 ledger-paced tail a further ~1.8us (open-only ~14.4us median -> open+seal ~12.6us); against tokio_quiche the same session, Specter's worst p95 (12.4us) now beats tokio's best (15.1us) outright, where the receive-only cache had overlapped on worst-vs-best. Encrypt output is byte-identical AES-128-GCM: no wire byte, frame cadence, or fingerprint change. Evidence: docs/benchmarks/native-h3-vs-rust-clients/2026-06-09-seal-context-cache-getack/.
Native HTTP/3 benchmark fixture stream responses now use absolute 1ms chunk cadence instead of rescheduling each chunk from the post-send wall clock. This removes fixture scheduler drift from the GET throughput cell and is recorded as a benchmark-truth improvement, not a superiority claim; the strict repeat gate still must pass before README performance claims are upgraded.
Native HTTP/3 GET repeat-gate runs now build the benchmark binary once and reuse it for every repeat child run, tightening same-binary fairness while reducing awsdev iteration time.
Native HTTP/3 selected-row truth gates now require fixture-ledger capture for publishable and GET-only comparisons, so missing ledger provenance cannot fall back to nominal paced-tail metrics and produce a false pass.
Native HTTP/3 DPLPMTUD path-MTU probes are now deferred off the active RFC 9220 tunnel critical path. A full-size probe is a build + AEAD seal + send_to; the prior code emitted it inline on the tunnel recv->send turn, so the two probes a connection sends while binary-searching the path MTU (around echo 13 and 22 at the fixture's ~16KB receive window) landed as ~100us p99 spikes on the proxied echo round-trip. A flow-control hypothesis was falsified by experiment (suppressing MAX_DATA emission left the spikes intact); gating the probe send removed both. send_client_pmtu_probe_if_available now waits until the tunnel has been quiescent for PMTU_TUNNEL_IDLE_GAP (2ms) before probing (RFC 8899 Section 5.2: probes are low priority); GET/streaming connections with no open tunnel probe immediately as before. On awsdev (Graviton4, quiet, 8 reps, n=100) this drops the per-run spike count 2->0, collapses the tunnel echo p99 from ~103us to ~43us, and moves the echo p95 tail from a loss to a non-overlapping win vs tokio_quiche (40.3us vs 51.2us median; Specter worst 43.3 < tokio best 48.3). Echo p50 and throughput stay parity at the 1KB single-frame payload. Combined with the per-epoch AEAD context cache, Specter now holds the lower p95 tail on the echo, client-DATA+FIN close, and slow-consumer mixed tunnel workloads, reversing the 4.2.1 withdrawal rationale. No wire byte or fingerprint change: probe packets are unchanged and still sent, only their scheduling moves off the interactive path. Full suite 999 passed. Evidence: docs/benchmarks/native-h3-vs-rust-clients/2026-06-09-pmtu-probe-tunnel-defer/.
Native HTTP/3 direct-GET epoch loop now drains the entire ready burst before any wire maintenance and keeps one pinned timer across loop passes. Three coordinated scheduling changes: boundary ACKs are sealed inline at their exact ack-eliciting-threshold crossings during the burst drain (the caller-side threshold clamp that chopped multi-quantum bursts into drain/flush round trips is gone; sealed packets join the existing deferred-send queue, so wire bytes and ACK cadence are unchanged); the parked-wake select arm drains the rest of the burst immediately after the waking datagram instead of running the full maintenance pass between the first datagram and the FIN-bearing remainder; and the two per-iteration tokio sleep futures are replaced by one pinned sleep re-armed only when the min of the delayed-ACK and loss-detection deadlines changes, removing timer-wheel register/deregister churn from every park/wake cycle. Worst-rep tails at the shipping io-epoch profile (n=100 x 4 reps): p50 ledger-paced tail 4.28us -> 2.58us with three of four reps at 0, p95 11.61us -> 7.85us; TTFB and throughput unchanged. Receive-side scheduling only: no wire byte or fingerprint change. Full suite 1004 passed. Evidence: docs/benchmarks/native-h3-vs-rust-clients/2026-06-09-direct-get-burst-ordering-scout/.
Native HTTP/3 1-RTT receive path now decodes each datagram with one payload copy instead of three. The header-protection unmask copies only the up-to-21-byte protected prefix into the reusable scratch buffer instead of refilling the full ~1.3KB ciphertext (the AEAD reads the payload region from the original datagram buffer in place); QUIC frame decode consumes the already-owned plaintext Bytes directly via a new decode_frames_bytes, so STREAM data becomes refcounted slices of the packet allocation instead of a second full copy plus malloc per datagram; and the post-decode padding filter retains in place instead of collecting a second frame Vec. Scout at the shipping io-epoch profile (n=100 x 4 reps): specter per-rep p50 ledger-paced tail 0/0/803/0 ns (was 0/0/0/2582 before this change, with other same-day gates rolling 2143-3441 ns worst reps) and p95 3804/3252/6746/2638 ns (was worst 7124-9368); worst-rep p50 now sits below the best reqwest_h3 rep ever observed on this host. TTFB p50 31.1-33.4us across reps. Decrypt output and wire bytes are unchanged: receive-side memory layout only, no fingerprint change. Full suite 1004 passed. Evidence: docs/benchmarks/native-h3-vs-rust-clients/2026-06-09-direct-get-copychain-scout/.
Pruned superseded benchmark artifacts and refreshed benchmark documentation to current state: docs/benchmarks/ drops from 98MB to 17MB, keeping the 2026-06-09 evidence chain, the 2026-06-03 combined capture and streaming/websocket reps, and the 2026-05-25 transport-baseline regression fixture. Artifact paths cited by earlier changelog entries may refer to pruned files; current claims and their evidence live in README.md, docs/benchmarks/native-h3-vs-rust-clients/README.md, and docs/specter-native-h3-remaining-seams.md.

Fixed

Native HTTP/3 direct-GET epoch loop now defers only the send syscall of boundary ACKs past the current drain phase. The ACK packet is still built and sealed at its normal ack-eliciting-threshold boundary (packet number consumed, tracker marked, bytes and cadence identical on the wire); its dispatch moves from between the threshold drain-stop and the FIN-bearing remainder of the burst to the next flush, the pre-park drain, or the sample exit, whichever comes first. This removes an AEAD seal + sendmsg from the measured window between a response body's last datagram arriving and completion being observed, which the fixture-ledger gate charges to the client tail. A/B at n=100 x4 reps cut the GET worst-rep p95 ledger-paced tail from 14.5us to 10.6us with TTFB and throughput unchanged. Applies to the shipping io-epoch path; real clients see response bodies complete sooner by the same margin.
Native HTTP/3 stream reassembly now buffers out-of-order STREAM segments (RFC 9000 Section 2.2) instead of discarding everything ahead of the contiguous edge. The old in-order-only guard meant one reordered datagram froze the stream: every later segment including the FIN was dropped while its packet number was still ACKed, so the peer never retransmitted and the native H3 GET hung until idle timeout under as little as 200us of path latency (netem repro: 0/8 completions before, 8/8 after), on both the shipping driver path and the bench epoch path. Masked at RTT~0 because the loopback fixture delivers in order. The fix tracks per-stream pending segments and final size, completes a stream only when contiguous data reaches the FIN's final size, and brings the unidirectional branch into offset tracking (closing an adjacent duplicate/reorder corruption hazard). Receive-side bookkeeping only: no wire byte, ACK cadence, or fingerprint change; RTT0 TTFB/throughput unchanged (n=100 A/B). Full suite 1004 passed. Evidence: docs/benchmarks/native-h3-vs-rust-clients/2026-06-09-rtt-reorder-reassembly-fix/.

Assets 2