Usage¶
All examples are lossless and round-trip byte-for-byte.
Buffers and dtypes¶
dtype selects the element width so z4ai can split the float planes. Supported
codes include bf16, fp16, fp32, fp64, and integer types (int8/uint8
… int64/uint64). None treats the input as opaque bytes (still compressed,
no float-plane split).
import z4ai
blob = z4ai.compress(buffer, dtype="fp32")
data = z4ai.decompress(blob) # byte-identical to `buffer`
# Decode into a pre-allocated buffer (no extra allocation):
out = bytearray(len(data))
z4ai.decompress(blob, out=out)
NumPy arrays (dtype + shape preserved)¶
compress_array records the array’s dtype and shape, so the decode restores an
array identical in value, dtype, and shape — no need to pass dtype yourself.
import numpy as np, z4ai
w = np.random.randn(1024, 1024).astype(np.float16)
blob = z4ai.compress_array(w)
back = z4ai.decompress_array(blob)
assert np.array_equal(w, back) and back.dtype == w.dtype and back.shape == w.shape
Effort tiers: speed vs size¶
effort is a single dial trading decode speed against file size. All three tiers
are lossless and decode through the same z4ai.decompress().
# fastest to decode (serve / load-often paths)
blob = z4ai.compress(buffer, dtype="bf16", effort="fast")
# balanced default
blob = z4ai.compress(buffer, dtype="bf16")
# smallest file (offline encode; for the artifact you upload)
blob = z4ai.compress(buffer, dtype="bf16", effort="max")
effort="max" adds a chunk-parallel context-modeling backend (brotli) and keeps
the smaller of it and the fast default - so it is never worse than the default.
Install it with pip install "z4ai[maxratio]"; without brotli, max falls back
to the default.
Checkpoint deltas (aligned buffers)¶
When two checkpoints are positionally aligned (same shapes, same order),
compress_delta stores only the bytes that changed.
import z4ai
blob = z4ai.compress_delta(current_ckpt_bytes, reference=prev_ckpt_bytes, dtype="bf16")
restored = z4ai.decompress_delta(blob, reference=prev_ckpt_bytes)
assert bytes(restored) == current_ckpt_bytes # lossless
The reference’s length and a digest are recorded in the frame, so passing the wrong reference is rejected loudly rather than producing a corrupt result.
Model deltas (.safetensors, tensor-aware)¶
For real .safetensors checkpoints whose layout may have shifted (a fine-tune
that added, dropped, renamed, or re-ordered tensors), z4ai.model_delta aligns
each tensor to the base tensor of the same name and shape before delta-coding
it - so unchanged and slightly-changed tensors still delta cleanly regardless of
how the surrounding bytes moved.
from z4ai import model_delta
# in memory (.safetensors bytes)
blob = model_delta.compress(finetuned_bytes, base_bytes)
restored = model_delta.decompress(blob, base_bytes)
assert restored == finetuned_bytes # lossless
# or straight from / to files
stats = model_delta.compress_file("finetuned.safetensors", "base.safetensors",
"finetuned.z4dm")
print(f"{stats['ratio']:.1f}x smaller")
Each tensor is stored as the smallest of a zero-cost COPY (bit-identical to the
base), a per-tensor XOR-delta, or a from-scratch frame - so a delta archive is
never larger than compressing the target alone.
safetensors with random access¶
Compress a .safetensors checkpoint to a smaller .zstn file, and still load a
single tensor lazily without decompressing the rest.
from z4ai.safetensors import compress_file, decompress_file, ZstnReader
stats = compress_file("model.safetensors", "model.zstn")
print(f"{stats['saved_pct']:.0f}% smaller ({stats['ratio']:.2f}x)")
# Restore a byte-identical .safetensors file:
decompress_file("model.zstn", "model.safetensors")
# Or read just the tensors you need, lazily:
with ZstnReader.open("model.zstn") as r:
print(r.names()) # list tensors, no decode
w = r.read_numpy("model.embed_tokens.weight") # decode only this tensor
The ZSTN container carries a per-tensor index + footer for the random-access
reads and deduplicates tied weights so they are stored once.
HuggingFace integration¶
The optional z4ai.integrations module loads z4ai-compressed weights into
torch / transformers. It needs the hf extra (pip install "z4ai[hf]",
which pulls in torch and safetensors); the core codec needs neither.
Load / save a torch state dict — drop-in for safetensors.torch.load_file /
save_file. A ZSTN file is decompressed in memory; a plain .safetensors file
passes through unchanged.
import z4ai
stats = z4ai.save_file(model.state_dict(), "model.z4ai.safetensors")
print(f"{stats['saved_pct']:.0f}% smaller")
state = z4ai.load_file("model.z4ai.safetensors") # -> {name: torch.Tensor}
model.load_state_dict(state)
Transparent from_pretrained — enable_hf() patches safetensors so
transformers (and vllm) load compressed weights with no other code change.
Compress your weights and name them with a variant suffix
(model.z4ai.safetensors), then load with variant="z4ai". Call enable_hf()
before importing transformers/vllm — they bind safe_open at import time.
from z4ai import enable_hf
enable_hf() # before importing transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("your-org/model", variant="z4ai")
tok = AutoTokenizer.from_pretrained("your-org/model")
This mirrors ZipNN’s zipnn_safetensors(), but z4ai’s strongest cases are the
self-hosted ones — your own object storage or model registry — and the
cross-checkpoint compress_delta mode, which has no from_pretrained equivalent
on either side.
High-throughput native path¶
For the load-often path, z4ai.chunked is backed by a fused, multithreaded C
codec that runs the whole split -> zstd -> interleave loop in C with the GIL
released. The frame is byte-interoperable with the default path.
import z4ai.chunked as ck
blob = ck.compress(weights, dtype="bf16")
weights_back = ck.decompress(blob) # byte-exact, multithreaded