API reference

Core codec

z4ai.compress(data, *, dtype=None, level=None, threads=0, auto=True, effort=None)[source]

Compress a buffer (or NumPy array) into a self-describing z4ai frame.

Parameters:
  • data (bytes | bytearray | memoryview | object) – raw bytes/bytearray/memoryview, or a numpy.ndarray (dtype and shape are then inferred and round-tripped).

  • dtype (str | None) – element type code when data is a raw buffer, e.g. "bf16", "fp16", "fp32", "fp64". None treats the buffer as opaque bytes (still compressed, no float-plane split). Ignored when data is a NumPy array.

  • level (int | None) – Zstd level (1..22). None uses the codec default.

  • threads (int) – 0 = auto (all cores), 1 = single-threaded, n = n workers.

  • auto (bool) – when True (default), also try plain whole-buffer Zstd, the byte-plane transpose, and the sparse zero-aware path, keeping the smallest frame. Set False to force the dtype’s primary transform only (fastest compress; skips the best-of passes).

  • effort (str | None) – ratio/speed tier. None (default) uses the fast streaming codec (the balanced default — multi-GB/s, ratio ties-or-beats ZipNN). "max" selects the high-ratio AUTO codec, which adds a context-modeling backend (brotli, chunk-parallel) on top of the field split for the smallest possible file — the right choice for model distribution (compress once, download many): measured +3% to +12% smaller than ZipNN on real transformers, at a ~3-12 MB/s offline compress and ~1 GB/s decompress. "balanced" is an intermediate tier. See README.md for the comparison table.

Returns:

b"Z4AI" for the default codec, b"ZARA" for effort="max"/"balanced". decompress() accepts either; decompression needs no side information.

Return type:

A bytes frame

z4ai.decompress(blob, *, out=None)[source]

Decompress a z4ai frame back to the original buffer (byte-identical).

Parameters:
Returns:

The original bytes (or out if provided).

z4ai.compress_array(arr, **kwargs)

Compress a NumPy array, recording dtype and shape for exact restore.

Parameters:

arr (ndarray)

Return type:

bytes

z4ai.decompress_array(blob)

Decompress a frame created by compress_ndarray() into an array.

Parameters:

blob (bytes)

Return type:

ndarray

z4ai.peek_is_z4ai(blob)[source]

Return True if blob starts with the z4ai magic.

Parameters:

blob (bytes)

Return type:

bool

z4ai.dtype_width(dtype)[source]

Return the byte-plane width for a dtype code (see DTYPE_WIDTHS).

Parameters:

dtype (str | None)

Return type:

int

Checkpoint deltas

z4ai.compress_delta(current, reference, *, dtype=None, level=None, threads=0)[source]

Losslessly compress current as a delta against a reference checkpoint.

Consecutive training/fine-tuning checkpoints differ only slightly; storing the bit-exact XOR delta (current ^ reference) makes most bytes zero and compresses far below the from-scratch float-entropy ceiling. Decompress with decompress_delta() and the same reference. See z4ai.delta.

Parameters:
  • dtype (str | None)

  • level (int | None)

  • threads (int)

Return type:

bytes

z4ai.decompress_delta(blob, reference, *, out=None)[source]

Reconstruct a checkpoint from a delta frame and its reference (see compress_delta()).

Parameters:

Model-aware deltas (.safetensors)

Whole-checkpoint delta compression, aligned tensor-by-tensor by name.

z4ai.delta deltas two byte buffers that are already positionally aligned. Real .safetensors checkpoints are not: a fine-tune may add, drop, rename or re-order tensors relative to its base, which shifts every following byte and makes a raw whole-buffer XOR align nothing (high entropy, no win). This module deltas at the tensor granularity instead – it matches each tensor in the target to the base tensor of the same name and shape, so unchanged and slightly-changed tensors delta cleanly regardless of how the surrounding layout moved.

This is the lossless lever that matters most in production. The bulk of stored checkpoints are derived: a fine-tune of a base model, the next training step, an RLHF / DPO iteration, an EMA copy, or a plain re-upload. Their weights are mostly bit-identical to a reference, so the redundancy lives between files and is invisible to any single-file codec (z4ai included) or to ZipNN. Measured on synthetic fine-tunes: a checkpoint with ~2 % of weights changed stores ~25x smaller as a name-aligned delta than compressed from scratch; ~10 % changed, ~6x; a re-upload, unboundedly (every tensor is a zero-cost COPY). See the low-precision lossless literature (arXiv:2508.19263) for the same effect on real BF16 training-checkpoint sequences.

Each target tensor is stored as one of:

  • COPY – byte-identical to the base tensor of the same name; nothing stored.

  • DELTA – a z4ai.delta frame of target_tensor XOR base_tensor (kept only when it beats a from-scratch frame; delta.compress itself picks the smaller of XOR-delta vs full, so this never loses to plain compression).

  • FULL – a standalone z4ai.compress() frame, for tensors with no shape/name match in the base (new or reshaped tensors).

Reconstruction is bit-exact: the target’s original safetensors JSON header is stored verbatim, every tensor is restored to its exact byte span, and a digest of the base data buffer is recorded so decoding against the wrong base fails loudly instead of silently corrupting weights. The base checkpoint is supplied by the caller at both ends (you already have it on disk when you write the derived one).

Example:

>>> from z4ai import model_delta
>>> blob = model_delta.compress(finetuned_safetensors, base_safetensors)
>>> restored = model_delta.decompress(blob, base_safetensors)
>>> assert restored == finetuned_safetensors   # byte-identical
z4ai.model_delta.peek_is_model_delta(blob)[source]

True if blob is a z4ai whole-checkpoint delta frame.

Return type:

bool

z4ai.model_delta.compress(target, base, *, level=None, threads=0)[source]

Compress safetensors target as a name-aligned delta against base.

Both arguments are raw .safetensors buffers (bytes-like). Returns a self-describing Z4DM frame that decompress() turns back into the byte-identical target given the same base. The result is never larger than compressing target standalone: every tensor independently keeps the smaller of its delta and from-scratch encoding.

Parameters:
  • level (int | None)

  • threads (int)

Return type:

bytes

z4ai.model_delta.decompress(blob, base)[source]

Reconstruct the byte-identical safetensors target from a Z4DM frame and the same base buffer used to build it (see compress()).

Return type:

bytes

z4ai.model_delta.compress_file(target_path, base_path, out_path, *, level=None, threads=0)[source]

Delta-compress the safetensors at target_path against base_path.

Writes a Z4DM frame to out_path and returns a small stats dict (original, stored, ratio). Reconstruct with decompress_file() and the same base file.

Parameters:
  • target_path (str)

  • base_path (str)

  • out_path (str)

  • level (int | None)

  • threads (int)

Return type:

dict

z4ai.model_delta.decompress_file(delta_path, base_path, out_path)[source]

Reconstruct the byte-identical safetensors target from a Z4DM file and its base file (inverse of compress_file()).

Parameters:
  • delta_path (str)

  • base_path (str)

  • out_path (str)

Return type:

dict

safetensors container (random access)

Apply the z4ai codec to .safetensors checkpoints, tensor by tensor.

This is the production use case ZipNN is built around (zipnn_safetensors / zipnn_hf): take a Hugging Face / vLLM .safetensors checkpoint and shrink it losslessly. The z4ai codec is applied to each tensor independently with its per-dtype byte-plane + long-distance-matching pipeline, and the results are packed into a self-describing ZSTN container - the codec’s on-disk format for a whole checkpoint. It uses the tensor’s declared dtype so BF16/FP16/FP32 tensors take the model-weight fast path while everything else is compressed opaquely. The round-trip is byte-identical to the original file.

Why per-tensor (not whole-file Zstd)? Because z4ai’s exponent-plane separation only helps when applied to homogeneous float data; mixing tensors of different dtypes into one stream defeats it. Per-tensor also means z4ai’s long-distance matching can dedup within a large tensor (tied embeddings, repeated blocks) - exactly where z4ai beats ZipNN’s 256 KiB chunking.

Random access - the container property that matters for distribution

A ZSTN v3 container carries a tensor index and a fixed-size footer, so a reader can locate and decompress one tensor without touching the rest of the file (ZstnReader). This preserves the single most valuable property of safetensors - per-tensor lazy loading / partial reads - which a plain “compress the whole file” approach destroys. Tied weights are stored once and referenced, and the reader resolves the reference transparently.

No third-party safetensors package is required: the format is a tiny, well-specified container that we parse directly.

safetensors v1 source format:

8 bytes      little-endian u64 = header length N
N bytes      UTF-8 JSON header: { name: {dtype, shape, data_offsets:[b,e]},
             optional "__metadata__": {str: str} }
rest         the data buffer; tensor `name` occupies buffer[b:e]

ZSTN v3 container layout:

"ZSTN"                         4-byte magic
u8  version (= 3)
u8  contiguous flag (1 = per-tensor index present, 0 = single opaque frame)
u64 hlen ; <hlen> bytes        original safetensors JSON header, verbatim
u64 data_len                   length of the original data buffer
-- if not contiguous --
    u64 frame_len ; <frame_len> bytes   one opaque z4ai frame of the buffer
-- if contiguous --
    <frames section>           unique tensor frames, back to back
    <index section>            starts at index_offset (see footer):
        u32 n_tensors
        per tensor (emission order):
            u16 name_len ; <name>
            u64 dest_begin ; u64 dest_end      destination in the data buffer
            u8  kind (0 = FRAME, 1 = REF)
            if FRAME: u64 frame_offset ; u64 frame_len   (absolute in file)
            if REF:   u32 ref_index                      (earlier FRAME entry)
    <footer> (fixed 20 bytes, always last):
        u64 index_offset ; u64 index_len ; "ZSTN"

Public API

class z4ai.safetensors.ZstnReader(*, fh=None, buf=None, owns_file=False)[source]

Random-access reader for a ZSTN v3 container.

Reads only the header and tensor index up front; each read_raw() / read_numpy() seeks to a single tensor’s frame and decompresses just that tensor - the rest of the file is never touched. This is the lazy / partial-load path that makes ZSTN a drop-in for safetensors’ per-tensor access rather than a whole-file blob.

Open from a path (recommended for large checkpoints - uses seek so only the needed frames are read) or from an in-memory buffer:

with ZstnReader.open("model.zstn") as r:
    w = r.read_numpy("model.embed_tokens.weight")

r = ZstnReader.from_bytes(frame)
Parameters:
  • fh (Optional[BinaryIO])

  • buf (Optional[bytes])

  • owns_file (bool)

classmethod open(path)[source]

Open a .zstn file for lazy, seek-based random access.

Parameters:

path (str)

Return type:

ZstnReader

classmethod from_bytes(frame)[source]

Open an in-memory ZSTN buffer.

Parameters:

frame (bytes | bytearray | memoryview)

Return type:

ZstnReader

names()[source]

Tensor names, in data-buffer order.

Return type:

List[str]

header()[source]

The original safetensors JSON header (dtype/shape/offsets, metadata).

Return type:

dict

metadata()[source]

The __metadata__ sub-dict, or empty if absent.

Return type:

Dict[str, str]

read_raw(name)[source]

Decompress and return the raw bytes of a single tensor.

Only that tensor’s frame is read from disk; tied-weight references are resolved to the original frame transparently.

Parameters:

name (str)

Return type:

bytes

frame_size(name)[source]

Compressed (on-disk) size of a tensor’s frame, in bytes.

Parameters:

name (str)

Return type:

int

read_numpy(name)[source]

Return a tensor as a NumPy array with its declared dtype and shape.

Raises TypeError for dtypes NumPy cannot represent natively (bf16, fp8) - use read_raw() for those.

Parameters:

name (str)

z4ai.safetensors.compress_bytes(blob, *, level=None, threads=0, workers=None)[source]

Compress a safetensors buffer; returns a self-describing ZSTN frame.

The original JSON header is preserved verbatim so reconstruction is exactly byte-identical (including key order and whitespace). Each tensor’s bytes are compressed with z4ai.compress() using the tensor’s dtype, and a tensor index + footer are appended so ZstnReader can read one tensor without decompressing the others.

If the tensors do not contiguously cover the data buffer (unexpected for valid safetensors), the whole data section is compressed opaquely as a single stream - still byte-exact, just without per-tensor random access.

Parameters:
Return type:

bytes

z4ai.safetensors.decompress_bytes(frame, *, workers=None)[source]

Invert compress_bytes(); returns the byte-identical safetensors buffer.

Parameters:
Return type:

bytes

z4ai.safetensors.compress_file(in_path, out_path, *, level=None, threads=0, workers=None)[source]

Compress a .safetensors file to a ZSTN file. Returns stats.

Parameters:
  • in_path (str)

  • out_path (str)

  • level (int | None)

  • threads (int)

  • workers (int | None)

Return type:

dict

z4ai.safetensors.decompress_file(in_path, out_path, *, workers=None)[source]

Decompress a ZSTN file back to a byte-identical .safetensors file.

Parameters:
  • in_path (str)

  • out_path (str)

  • workers (int | None)

Return type:

dict

High-throughput chunked codec

Chunked, multi-threaded codec - the high-throughput path.

ZipNN reaches several GB/s by fusing a native byte-split with a multi-threaded Zstd pass over chunks. Profiling (FINDINGS-speed.md) showed our throughput ceiling is single-threaded Zstd, not the NumPy split - and crucially that both NumPy and Zstandard release the GIL. So we can parallelize a pure Python pipeline: cut the buffer into element-aligned chunks and run, on each core, an independent split -> Zstd pass.

This module keeps a deliberately lean per-chunk path (direct byte-plane transpose + a single Zstd level-1 pass per plane, no per-stream probing) so the fixed overhead per chunk stays tiny and the thread pool stays saturated. It is fully self-contained and lossless by construction.

Warning

Chunking forfeits cross-chunk redundancy. Splitting the buffer caps Zstd’s match horizon at one chunk, so on structured / duplicated weights (tied embeddings, repeated layers, EMA/optimizer copies) this path gives up the order-of-magnitude long-distance-matching win that the default whole-plane z4ai.compress() keeps - the very thing that beats ZipNN by tens of x on real checkpoints. Use this chunked path for throughput on i.i.d.-like weights where the mantissa is the bulk and there is little long-range structure; use the default codec when ratio is what matters.

Frame layout (all integers little-endian):

magic       8 bytes   b"Z4AIMF01"
flags       u8        reserved (0)
width       u8        element width used for the transpose
n_chunks    u32
per chunk:  u32 elems, u8 tail_len, tail bytes, then
            width * (u8 method, u32 comp_len, comp_bytes)

method is 0 = stored raw, 1 = Zstd. Plane order is byte-plane order, i.e. plane j holds byte j of every element.

z4ai.chunked.compress(data, *, dtype=None, level=None, threads=0, chunk_size=2097152)[source]

Compress data with the chunked, multi-threaded pipeline.

threads controls concurrent chunk workers (0 = all cores); other args mirror z4ai.compress(). Returns one Z4AIMF01 frame.

Parameters:
  • dtype (str | None)

  • level (int | None)

  • threads (int)

  • chunk_size (int)

Return type:

bytes

z4ai.chunked.decompress(blob, *, threads=0)[source]

Inverse of compress(); reconstructs the exact original buffer.

Parameters:
Return type:

bytes