Performance and Benchmarks

Zinc’s fundamental advantage is that there is no data transfer. Two processes reading and writing the same physical memory incur no serialization cost, no copy cost, and no kernel round-trip for the data itself. The only kernel interactions are the initial mmap setup and the optional futex call in notify/wait.

Notify/wait latency

The notify/wait roundtrip measures the time from a writer calling notify() to a reader returning from wait(). This is the synchronization overhead, not the data access cost.

Platform	P50	P99	Measurement method
Linux (futex)	< 1 microsecond	< 3 microseconds	`criterion` on AMD EPYC
macOS (spin loop)	~2 microseconds	~10 microseconds	`criterion` on Apple M2 Pro

The Linux futex path is near the hardware minimum: a single syscall in the wake path, and another syscall in the waiter path when the kernel unparks it. The macOS spin loop is slower because the waiting thread must poll. Target. Under 5 microseconds for notify/wait roundtrip on Linux. Current measurements confirm it.

Data access throughput

Once the region is mapped, data access is memory-bound, not Zinc-bound. Reading or writing the shared region is equivalent to reading or writing any other memory-mapped region in the process.

Operation	100MB region	1GB region
Sequential read	~200 microseconds	~2 milliseconds
Sequential write	~200 microseconds	~2 milliseconds
Random read (4KB stride)	~5 milliseconds	~50 milliseconds

These numbers depend on RAM bandwidth and CPU cache behavior. A system with 50GB/s memory bandwidth will read a 100MB region in approximately 2 milliseconds (100 MB / 50 GB/s = 2 ms). The throughput is limited by the memory subsystem, not by Zinc.

Zero-copy verification

To verify that no copying occurs, compare pointer addresses across adapters:

// Rust: get the mmap base address
let region = SharedRegion::create("test", 4096)?;
let rust_ptr = region.as_ptr() as usize;
println!("Rust data ptr: 0x{rust_ptr:x}");

# Python: get the buffer address
region = SharedRegion.open("test")
buf = region.as_buffer()
buf_ptr = id(buf)  # Python object address
# Use ctypes to get the actual pointer
import ctypes
addr = ctypes.addressof(ctypes.c_char.from_buffer(buf))
print(f"Python buf addr: 0x{addr:x}")

// Go: get the slice data pointer
region, _ := zinc.Open("test")
data := region.Bytes()
go_ptr := unsafe.Pointer(&data[0])
fmt.Printf("Go data ptr: %v\n", go_ptr)

The addresses should match (or differ by a constant offset if the header is included in one calculation and excluded in another).

Comparison to alternatives

Unix sockets. For a 100MB transfer over a Unix domain socket, the data is copied twice (from sender to kernel, from kernel to receiver). Throughput on a typical system is roughly 1-2 GB/s. Zinc eliminates both copies. For small messages under 1KB, sockets can be faster because there is no setup overhead. gRPC (localhost). gRPC serializes data to protobuf, copies through HTTP/2 framing, sends over a TCP socket, and deserializes. For a 100MB tensor, serialization alone takes 10-30 milliseconds. Zinc’s zero-copy path takes 0 milliseconds because no bytes are moved. Redis. Redis is a network service. Data goes through the client library, through the kernel’s network stack, into Redis’s memory, and back. Useful for data that needs to be shared between machines. For same-machine sharing, the overhead is unnecessary. mmap files. Raw shm_open + mmap + ftruncate uses the exact same kernel primitives as Zinc. Throughput is identical — both are memory-bandwidth-bound. Any measurement variance between them is OS scheduling noise (CPU frequency scaling, thread placement), not a code-path difference. The difference is lifecycle: Zinc handles creation, naming, reference counting, and cleanup automatically. With raw mmap, you must manage file creation, sizing, and cleanup yourself. Zinc’s regions are anonymous (backed by shm_open, not a visible file) and self-cleaning. Plasma (Ray). Deprecated. Formerly a shared memory object store with a client-server architecture. Zinc has no server — memory is shared directly between processes with no intermediary. Run the throughput examples below to measure each alternative against Zinc on your hardware.

Benchmark methodology

Each example uses the same statistically rigorous approach to isolate true throughput from system noise.

Min-time (max GB/s) across multiple samples

Noise sources — CPU frequency scaling, thermal throttling, OS scheduler interrupts, TLB pressure — only ever slow things down, never speed them up. The fastest observed run is therefore the run with the least noise, closest to the true hardware limit.

T_{\text{best}} = \min(T_1, T_2, \ldots, T_n)

\text{Throughput} = \frac{\text{payload bytes} \times \text{iterations}}{T_{\text{best}}}

This is the industry standard, used by Criterion.rs and Google Benchmark. Taking the average would encode noise — a single context-switch spike inflates the mean.

Alternating order

At each payload size, 5 independent samples are collected. Odd-numbered samples run Zinc first, then the alternative. Even-numbered samples run the alternative first, then Zinc. This cancels first-run bias (cold pages, cold kernel caches).

Warmup phase

Before each timed measurement, 10% of the full iteration count is run as unmeasured warmup. This faults in physical pages, primes the TLB, and stabilizes CPU frequency before timing begins.

Page-aligned data

Zinc reserves the full first page of the shared memory region for the header. The data region (as_ptr()) starts at a page boundary:

\text{data\_offset} = \text{page\_size} \quad \text{(not sizeof(RegionHeader))}

Page-aligned data lets CPU write-combining buffers and L1 streaming prefetch operate at full bandwidth. Without this, memset crossing a page boundary on every write cycle incurs a measurable throughput penalty at small payloads.

Throughput examples

Standalone comparison benchmarks live in core/examples/. Each measures Zinc against an alternative for the same workload and prints a throughput table.

Zinc vs Unix sockets

cargo run --release --example unix_sockets --manifest-path core/Cargo.toml

Measures notify/wait latency and sequential transfer throughput for Zinc shared memory vs Unix domain socket pairs. Zinc is memory-bandwidth-bound (~60 GB/s); sockets are kernel-copy-bound (~1.4 GB/s).

Zinc vs raw mmap

cargo run --release --example mmap --manifest-path core/Cargo.toml

Compares Zinc against raw shm_open + mmap + ftruncate with manual futex synchronization. Throughput is identical — both use the same kernel primitives. Any deviation from ratio 1.00 is OS scheduling noise, not a Zinc overhead. The raw path requires explicit lifecycle management: shm_open, ftruncate, mmap, shm_unlink, and manual notify/wait via futex. Zinc adds automatic creation, reference counting, and cleanup on drop.

Zinc vs gRPC

cargo run --release --example grpc --manifest-path core/Cargo.toml

Compares Zinc against protobuf-serialized messages over TCP (length-prefixed framing, same wire format as gRPC). Demonstrates serialization cost: Zinc’s zero-copy path takes 0 ms to “transfer” bytes; the gRPC path pays protobuf encode/decode and kernel TCP stack overhead.

Zinc vs Redis

cargo run --release --example redis --manifest-path core/Cargo.toml

Compares Zinc against Redis SET/GET on localhost. Requires Redis running on 127.0.0.1:6379. Shows network-stack overhead even on loopback — data goes through client library → kernel TCP → Redis process → and back.

Results shown above were measured on a MacBook Pro with Apple M2 Pro chip, 16 GB RAM, and macOS Tahoe 26.5. Performance will vary across platforms, see the benchmark methodology section for details on measurement approach.

Benchmarks

Benchmarks are not yet available. They will live in core/benches/ and use criterion for statistical measurement. Once added, they’ll be run with:

cargo bench --manifest-path core/Cargo.toml

Allocation-free hot path

The zinc_ptr, zinc_capacity, zinc_notify, and zinc_wait functions perform no heap allocation. The hot path is lock-free and allocation-free. We verified this with dhat (a heap profiling tool). Zero allocations in these functions across 1 million iterations.

​Notify/wait latency

​Data access throughput

​Zero-copy verification

​Comparison to alternatives

​Benchmark methodology

​Min-time (max GB/s) across multiple samples

​Alternating order

​Warmup phase

​Page-aligned data

​Throughput examples

​Zinc vs Unix sockets

​Zinc vs raw mmap

​Zinc vs gRPC

​Zinc vs Redis

​Benchmarks

​Allocation-free hot path

Notify/wait latency

Data access throughput

Zero-copy verification

Comparison to alternatives

Benchmark methodology

Min-time (max GB/s) across multiple samples

Alternating order

Warmup phase

Page-aligned data

Throughput examples

Zinc vs Unix sockets

Zinc vs raw mmap

Zinc vs gRPC

Zinc vs Redis

Benchmarks

Allocation-free hot path