API reference¶
Core codec¶
- z4ai.compress(data, *, dtype=None, level=None, threads=0, auto=True, effort=None)[source]¶
Compress a buffer (or NumPy array) into a self-describing z4ai frame.
- Parameters:
data (bytes | bytearray | memoryview | object) – raw
bytes/bytearray/memoryview, or anumpy.ndarray(dtype and shape are then inferred and round-tripped).dtype (str | None) – element type code when
datais a raw buffer, e.g."bf16","fp16","fp32","fp64".Nonetreats the buffer as opaque bytes (still compressed, no float-plane split). Ignored whendatais a NumPy array.level (int | None) – Zstd level (1..22).
Noneuses the codec default.threads (int) –
0= auto (all cores),1= single-threaded,n= n workers.auto (bool) – when
True(default), also try plain whole-buffer Zstd, the byte-plane transpose, and the sparse zero-aware path, keeping the smallest frame. SetFalseto force the dtype’s primary transform only (fastest compress; skips the best-of passes).effort (str | None) – ratio/speed tier.
None(default) uses the fast streaming codec (the balanced default — multi-GB/s, ratio ties-or-beats ZipNN)."max"selects the high-ratio AUTO codec, which adds a context-modeling backend (brotli, chunk-parallel) on top of the field split for the smallest possible file — the right choice for model distribution (compress once, download many): measured +3% to +12% smaller than ZipNN on real transformers, at a ~3-12 MB/s offline compress and ~1 GB/s decompress."balanced"is an intermediate tier. SeeREADME.mdfor the comparison table.
- Returns:
b"Z4AI"for the default codec,b"ZARA"foreffort="max"/"balanced".decompress()accepts either; decompression needs no side information.- Return type:
A
bytesframe
- z4ai.decompress(blob, *, out=None)[source]¶
Decompress a z4ai frame back to the original buffer (byte-identical).
- Parameters:
blob (bytes) – a frame produced by
compress().out (bytearray | memoryview | None) – optional writable buffer to decode into; when given it is filled and returned.
- Returns:
The original
bytes(oroutif provided).
- z4ai.compress_array(arr, **kwargs)¶
Compress a NumPy array, recording dtype and shape for exact restore.
- z4ai.decompress_array(blob)¶
Decompress a frame created by
compress_ndarray()into an array.
Checkpoint deltas¶
- z4ai.compress_delta(current, reference, *, dtype=None, level=None, threads=0)[source]¶
Losslessly compress
currentas a delta against areferencecheckpoint.Consecutive training/fine-tuning checkpoints differ only slightly; storing the bit-exact XOR delta (
current ^ reference) makes most bytes zero and compresses far below the from-scratch float-entropy ceiling. Decompress withdecompress_delta()and the samereference. Seez4ai.delta.
- z4ai.decompress_delta(blob, reference, *, out=None)[source]¶
Reconstruct a checkpoint from a delta frame and its
reference(seecompress_delta()).
Model-aware deltas (.safetensors)¶
Whole-checkpoint delta compression, aligned tensor-by-tensor by name.
z4ai.delta deltas two byte buffers that are already positionally
aligned. Real .safetensors checkpoints are not: a fine-tune may add, drop,
rename or re-order tensors relative to its base, which shifts every following
byte and makes a raw whole-buffer XOR align nothing (high entropy, no win). This
module deltas at the tensor granularity instead – it matches each tensor in the
target to the base tensor of the same name and shape, so unchanged and
slightly-changed tensors delta cleanly regardless of how the surrounding layout
moved.
This is the lossless lever that matters most in production. The bulk of stored checkpoints are derived: a fine-tune of a base model, the next training step, an RLHF / DPO iteration, an EMA copy, or a plain re-upload. Their weights are mostly bit-identical to a reference, so the redundancy lives between files and is invisible to any single-file codec (z4ai included) or to ZipNN. Measured on synthetic fine-tunes: a checkpoint with ~2 % of weights changed stores ~25x smaller as a name-aligned delta than compressed from scratch; ~10 % changed, ~6x; a re-upload, unboundedly (every tensor is a zero-cost COPY). See the low-precision lossless literature (arXiv:2508.19263) for the same effect on real BF16 training-checkpoint sequences.
Each target tensor is stored as one of:
COPY– byte-identical to the base tensor of the same name; nothing stored.DELTA– az4ai.deltaframe oftarget_tensor XOR base_tensor(kept only when it beats a from-scratch frame;delta.compressitself picks the smaller of XOR-delta vs full, so this never loses to plain compression).FULL– a standalonez4ai.compress()frame, for tensors with no shape/name match in the base (new or reshaped tensors).
Reconstruction is bit-exact: the target’s original safetensors JSON header is stored verbatim, every tensor is restored to its exact byte span, and a digest of the base data buffer is recorded so decoding against the wrong base fails loudly instead of silently corrupting weights. The base checkpoint is supplied by the caller at both ends (you already have it on disk when you write the derived one).
Example:
>>> from z4ai import model_delta
>>> blob = model_delta.compress(finetuned_safetensors, base_safetensors)
>>> restored = model_delta.decompress(blob, base_safetensors)
>>> assert restored == finetuned_safetensors # byte-identical
- z4ai.model_delta.peek_is_model_delta(blob)[source]¶
True if
blobis a z4ai whole-checkpoint delta frame.- Return type:
- z4ai.model_delta.compress(target, base, *, level=None, threads=0)[source]¶
Compress safetensors
targetas a name-aligned delta againstbase.Both arguments are raw
.safetensorsbuffers (bytes-like). Returns a self-describingZ4DMframe thatdecompress()turns back into the byte-identicaltargetgiven the samebase. The result is never larger than compressingtargetstandalone: every tensor independently keeps the smaller of its delta and from-scratch encoding.
- z4ai.model_delta.decompress(blob, base)[source]¶
Reconstruct the byte-identical safetensors
targetfrom aZ4DMframe and the samebasebuffer used to build it (seecompress()).- Return type:
- z4ai.model_delta.compress_file(target_path, base_path, out_path, *, level=None, threads=0)[source]¶
Delta-compress the safetensors at
target_pathagainstbase_path.Writes a
Z4DMframe toout_pathand returns a small stats dict (original,stored,ratio). Reconstruct withdecompress_file()and the same base file.
safetensors container (random access)¶
Apply the z4ai codec to .safetensors checkpoints, tensor by tensor.
This is the production use case ZipNN is built around (zipnn_safetensors /
zipnn_hf): take a Hugging Face / vLLM .safetensors checkpoint and shrink
it losslessly. The z4ai codec is applied to each tensor independently with
its per-dtype byte-plane + long-distance-matching pipeline, and the results are
packed into a self-describing ZSTN container - the codec’s on-disk format
for a whole checkpoint. It uses the tensor’s declared dtype so
BF16/FP16/FP32 tensors take the model-weight fast path while everything else is
compressed opaquely. The round-trip is byte-identical to the original file.
Why per-tensor (not whole-file Zstd)? Because z4ai’s exponent-plane separation only helps when applied to homogeneous float data; mixing tensors of different dtypes into one stream defeats it. Per-tensor also means z4ai’s long-distance matching can dedup within a large tensor (tied embeddings, repeated blocks) - exactly where z4ai beats ZipNN’s 256 KiB chunking.
Random access - the container property that matters for distribution¶
A ZSTN v3 container carries a tensor index and a fixed-size footer,
so a reader can locate and decompress one tensor without touching the rest
of the file (ZstnReader). This preserves the single most valuable
property of safetensors - per-tensor lazy loading / partial reads - which a
plain “compress the whole file” approach destroys. Tied weights are stored once
and referenced, and the reader resolves the reference transparently.
No third-party safetensors package is required: the format is a tiny,
well-specified container that we parse directly.
safetensors v1 source format:
8 bytes little-endian u64 = header length N
N bytes UTF-8 JSON header: { name: {dtype, shape, data_offsets:[b,e]},
optional "__metadata__": {str: str} }
rest the data buffer; tensor `name` occupies buffer[b:e]
ZSTN v3 container layout:
"ZSTN" 4-byte magic
u8 version (= 3)
u8 contiguous flag (1 = per-tensor index present, 0 = single opaque frame)
u64 hlen ; <hlen> bytes original safetensors JSON header, verbatim
u64 data_len length of the original data buffer
-- if not contiguous --
u64 frame_len ; <frame_len> bytes one opaque z4ai frame of the buffer
-- if contiguous --
<frames section> unique tensor frames, back to back
<index section> starts at index_offset (see footer):
u32 n_tensors
per tensor (emission order):
u16 name_len ; <name>
u64 dest_begin ; u64 dest_end destination in the data buffer
u8 kind (0 = FRAME, 1 = REF)
if FRAME: u64 frame_offset ; u64 frame_len (absolute in file)
if REF: u32 ref_index (earlier FRAME entry)
<footer> (fixed 20 bytes, always last):
u64 index_offset ; u64 index_len ; "ZSTN"
Public API¶
compress_bytes()/decompress_bytes()- in-memory, whole file.compress_file()/decompress_file()- on-disk, returns stats.ZstnReader- lazy, random-access single-tensor reads from a file or buffer (read_raw/read_numpy/names/header).
- class z4ai.safetensors.ZstnReader(*, fh=None, buf=None, owns_file=False)[source]¶
Random-access reader for a
ZSTNv3 container.Reads only the header and tensor index up front; each
read_raw()/read_numpy()seeks to a single tensor’s frame and decompresses just that tensor - the rest of the file is never touched. This is the lazy / partial-load path that makesZSTNa drop-in forsafetensors’ per-tensor access rather than a whole-file blob.Open from a path (recommended for large checkpoints - uses
seekso only the needed frames are read) or from an in-memory buffer:with ZstnReader.open("model.zstn") as r: w = r.read_numpy("model.embed_tokens.weight") r = ZstnReader.from_bytes(frame)
- classmethod open(path)[source]¶
Open a
.zstnfile for lazy, seek-based random access.- Parameters:
path (str)
- Return type:
- classmethod from_bytes(frame)[source]¶
Open an in-memory
ZSTNbuffer.- Parameters:
frame (bytes | bytearray | memoryview)
- Return type:
- header()[source]¶
The original safetensors JSON header (dtype/shape/offsets, metadata).
- Return type:
- read_raw(name)[source]¶
Decompress and return the raw bytes of a single tensor.
Only that tensor’s frame is read from disk; tied-weight references are resolved to the original frame transparently.
- read_numpy(name)[source]¶
Return a tensor as a NumPy array with its declared dtype and shape.
Raises
TypeErrorfor dtypes NumPy cannot represent natively (bf16, fp8) - useread_raw()for those.- Parameters:
name (str)
- z4ai.safetensors.compress_bytes(blob, *, level=None, threads=0, workers=None)[source]¶
Compress a safetensors buffer; returns a self-describing
ZSTNframe.The original JSON header is preserved verbatim so reconstruction is exactly byte-identical (including key order and whitespace). Each tensor’s bytes are compressed with
z4ai.compress()using the tensor’s dtype, and a tensor index + footer are appended soZstnReadercan read one tensor without decompressing the others.If the tensors do not contiguously cover the data buffer (unexpected for valid safetensors), the whole data section is compressed opaquely as a single stream - still byte-exact, just without per-tensor random access.
- z4ai.safetensors.decompress_bytes(frame, *, workers=None)[source]¶
Invert
compress_bytes(); returns the byte-identical safetensors buffer.
High-throughput chunked codec¶
Chunked, multi-threaded codec - the high-throughput path.
ZipNN reaches several GB/s by fusing a native byte-split with a multi-threaded
Zstd pass over chunks. Profiling (FINDINGS-speed.md) showed our throughput
ceiling is single-threaded Zstd, not the NumPy split - and crucially that
both NumPy and Zstandard release the GIL. So we can parallelize a pure
Python pipeline: cut the buffer into element-aligned chunks and run, on each
core, an independent split -> Zstd pass.
This module keeps a deliberately lean per-chunk path (direct byte-plane transpose + a single Zstd level-1 pass per plane, no per-stream probing) so the fixed overhead per chunk stays tiny and the thread pool stays saturated. It is fully self-contained and lossless by construction.
Warning
Chunking forfeits cross-chunk redundancy. Splitting the buffer caps
Zstd’s match horizon at one chunk, so on structured / duplicated weights
(tied embeddings, repeated layers, EMA/optimizer copies) this path gives up
the order-of-magnitude long-distance-matching win that the default
whole-plane z4ai.compress() keeps - the very thing that beats ZipNN by
tens of x on real checkpoints. Use this chunked path for throughput on
i.i.d.-like weights where the mantissa is the bulk and there is little
long-range structure; use the default codec when ratio is what matters.
Frame layout (all integers little-endian):
magic 8 bytes b"Z4AIMF01"
flags u8 reserved (0)
width u8 element width used for the transpose
n_chunks u32
per chunk: u32 elems, u8 tail_len, tail bytes, then
width * (u8 method, u32 comp_len, comp_bytes)
method is 0 = stored raw, 1 = Zstd. Plane order is byte-plane order, i.e.
plane j holds byte j of every element.
- z4ai.chunked.compress(data, *, dtype=None, level=None, threads=0, chunk_size=2097152)[source]¶
Compress
datawith the chunked, multi-threaded pipeline.threadscontrols concurrent chunk workers (0 = all cores); other args mirrorz4ai.compress(). Returns oneZ4AIMF01frame.