Block Floating Point Formats¶
Block Floating Point (BFP) is a quantization format where a group of numbers shares a common exponent (scale factor), but each number has its own mantissa. This provides a good balance between compression efficiency and hardware simplicity.
Overview¶
What is Block Floating Point?¶
Block Floating Point (BFP) divides data into blocks and applies a shared exponent to all elements within each block. This is simpler than full floating-point but provides better dynamic range than fixed-point quantization.
Key Characteristics:
Shared Exponent: One exponent per block (typically 8 bits)
Individual Mantissas: Each element has its own mantissa (4-16 bits)
Hardware-Efficient: Simpler than full floating-point arithmetic
Good Dynamic Range: Adapts to local data statistics
BFP vs Other Formats:
Format |
Memory |
Dynamic Range |
Hardware Cost |
Best Use Case |
|---|---|---|---|---|
BFP |
Low |
Good |
Low |
Edge devices, Inference |
FP32 |
High |
Excellent |
High |
Research, Training |
FP16 |
Medium |
Good |
Medium |
Training, Inference |
INT8 |
Low |
Poor |
Low |
Inference only |
MX Formats |
Low |
Excellent |
Medium |
Advanced training |
Architecture¶
BFP Structure¶
A BFP block consists of:
┌─────────────────────────────────────────────────┐
│ Block Floating Point Structure │
├─────────────────────────────────────────────────┤
│ Shared Exponent (8 bits) │
├─────────────────────────────────────────────────┤
│ Element 1: Sign (1) + Mantissa (n bits) │
│ Element 2: Sign (1) + Mantissa (n bits) │
│ ... │
│ Element N: Sign (1) + Mantissa (n bits) │
└─────────────────────────────────────────────────┘
Example: BFP8 with block_size=32
1 shared exponent (8 bits)
32 elements × 8 bits each = 256 bits
Total: 264 bits for 32 elements
Compression vs FP16: 512/264 = 1.94x
Predefined Formats¶
Pychop provides several predefined BFP formats optimized for different use cases:
Standard Formats¶
Format Name |
Mantissa Bits |
Block Size |
Exponent Bits |
Compression vs FP16 |
Use Case |
|---|---|---|---|---|---|
|
16 |
16 |
8 |
1.07x |
High precision |
|
12 |
16 |
8 |
1.39x |
Balanced |
|
8 |
32 |
8 |
1.94x |
Recommended default |
|
6 |
32 |
8 |
2.56x |
Aggressive compression |
|
4 |
32 |
8 |
3.76x |
Ultra-low precision |
Ultra-Low Precision Formats¶
Format Name |
Mantissa Bits |
Block Size |
Exponent Bits |
Compression vs FP16 |
Use Case |
|---|---|---|---|---|---|
|
3 |
64 |
8 |
5.82x |
Extreme compression |
|
2 |
128 |
8 |
10.67x |
Research only |
Intel Flexpoint Compatible¶
Format Name |
Mantissa Bits |
Block Size |
Exponent Bits |
Compression vs FP16 |
Notes |
|---|---|---|---|---|---|
|
16 |
16 |
5 |
1.10x |
Intel compatible |
|
8 |
32 |
5 |
1.97x |
Intel compatible |
Quick Start¶
Basic Usage¶
import pychop
import numpy as np
# Set backend (auto-detect by default)
pychop.backend('auto')
# Create test data
X = np.random.randn(1024, 768).astype(np.float32)
# Quantize with BFP8
from pychop import bfp_quantize
X_quantized = bfp_quantize(X, format='bfp8')
# Check compression
print(f"Original: {X.nbytes / 1024:.2f} KB")
print(f"Quantized maintains same shape: {X_quantized.shape}")
Using BFPTensor¶
from pychop import BFPTensor
# Create BFP tensor
bfp = BFPTensor(X, format='bfp8')
# Dequantize
X_reconstructed = bfp.dequantize()
# Get statistics
stats = bfp.statistics()
print(f"Compression: {stats['compression_ratio_fp16']:.2f}x vs FP16")
print(f"Memory saved: {stats['memory_saved_vs_fp16']:.1f}%")
# Compute error
mse = np.mean((X - X_reconstructed) ** 2)
print(f"MSE: {mse:.2e}")
Custom Formats¶
from pychop import create_bfp_spec, bfp_quantize
# Create custom 5-bit BFP format
custom_spec = create_bfp_spec(
mantissa_bits=5,
block_size=64,
exponent_bits=8,
name="my_bfp5"
)
# Use custom format
X_q = bfp_quantize(X, format=custom_spec)
# Or use tuple shorthand
X_q = bfp_quantize(X, format=(5, 64)) # (mantissa_bits, block_size)
Backend-Specific Usage¶
NumPy Backend¶
Pure NumPy implementation for inference and analysis:
import numpy as np
import pychop
pychop.backend('numpy')
X = np.random.randn(512, 512).astype(np.float32)
X_q = pychop.bfp_quantize(X, format='bfp8')
# Compute reconstruction error
error = np.mean((X - X_q) ** 2)
print(f"MSE: {error:.2e}")
PyTorch Backend (with STE)¶
PyTorch backend with Straight-Through Estimator for Quantization-Aware Training:
import torch
import pychop
pychop.backend('torch')
# Enable gradient tracking
X = torch.randn(128, 768, requires_grad=True)
# Quantize (automatic STE!)
X_q = pychop.bfp_quantize(X, format='bfp8')
# Backward pass - gradients flow through!
loss = X_q.sum()
loss.backward()
print(f"Gradient shape: {X.grad.shape}")
print(f"Gradient norm: {X.grad.norm():.2e}")
Using BFP Quantizers in Models:
from pychop.tch.bfp_formats import BFPQuantizerSTE
class QuantizedModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.quantizer = BFPQuantizerSTE(format='bfp8')
self.linear = torch.nn.Linear(768, 3072)
def forward(self, x):
x = self.quantizer(x) # Quantize activations
return self.linear(x)
model = QuantizedModel()
optimizer = torch.optim.Adam(model.parameters())
# Training loop
for batch in dataloader:
output = model(batch)
loss = loss_fn(output, target)
loss.backward() # STE handles gradients automatically!
optimizer.step()
Quantized Layers:
from pychop.tch.bfp_formats import BFPLinear
# Replace standard Linear with BFP quantized version
layer = BFPLinear(
in_features=768,
out_features=3072,
weight_format='bfp8', # Quantize weights
quantize_input=True, # Quantize input activations
quantize_output=False # Keep output in FP32
)
x = torch.randn(32, 768)
y = layer(x) # Automatic quantization with STE
Model Conversion:
from pychop.tch.bfp_formats import convert_linear_to_bfp
# Load pretrained model
model = YourModel()
# Convert all Linear layers to BFP
model = convert_linear_to_bfp(
model,
format='bfp8',
quantize_input=True,
quantize_output=False,
inplace=True
)
# Fine-tune with quantization
for epoch in range(num_epochs):
train(model) # Gradients flow through STE automatically
JAX Backend (with Custom VJP)¶
JAX backend with custom Vector-Jacobian Product for differentiation:
import jax
import jax.numpy as jnp
import pychop
pychop.backend('jax')
# Create data
key = jax.random.PRNGKey(0)
X = jax.random.normal(key, (256, 512))
# Quantize
X_q = pychop.bfp_quantize(X, format='bfp8')
# Test gradient flow
from pychop.jx.bfp_formats import BFPQuantizerSTE
quantizer = BFPQuantizerSTE(format='bfp8')
def loss_fn(x):
x_q = quantizer(x)
return jnp.sum(x_q ** 2)
# Compute gradients (custom VJP handles this)
grad_fn = jax.grad(loss_fn)
grads = grad_fn(X)
print(f"Gradient shape: {grads.shape}")
print(f"Gradient norm: {jnp.linalg.norm(grads):.2e}")
Flax Integration:
from flax import linen as nn
from pychop.jx.bfp_formats import BFPDense
class QuantizedMLP(nn.Module):
features: list
@nn.compact
def __call__(self, x):
for feat in self.features[:-1]:
x = BFPDense(
features=feat,
weight_format='bfp8',
quantize_input=True
)(x)
x = nn.relu(x)
x = BFPDense(features=self.features[-1])(x)
return x
model = QuantizedMLP(features=[512, 256, 128, 10])
# Initialize
key = jax.random.PRNGKey(0)
x = jax.random.normal(key, (32, 784))
variables = model.init(key, x)
# Forward pass with quantization
output = model.apply(variables, x)
API Reference¶
Core Functions¶
bfp_quantize¶
- bfp_quantize(data, format='bfp8', backend=None)¶
Quantize array to BFP format with automatic backend detection.
- Parameters:
data (array-like) – Input data (numpy.ndarray, torch.Tensor, or jax.Array)
format (str, BFPSpec, or tuple(int, int)) – BFP format specification
backend (str, optional) – Force specific backend (‘numpy’, ‘jax’, or ‘torch’)
- Returns:
Quantized data (same type as input)
- Return type:
array-like
Format Options:
String:
'bfp8','bfp6', etc. (predefined formats)Tuple:
(mantissa_bits, block_size)for custom formatBFPSpec: Full specification object
Example:
import numpy as np from pychop import bfp_quantize X = np.random.randn(1024, 768) # Predefined format X_q = bfp_quantize(X, format='bfp8') # Custom format X_q = bfp_quantize(X, format=(6, 32)) # 6-bit mantissa, 32 elem/block # Force backend X_q = bfp_quantize(X, format='bfp8', backend='numpy')
Classes¶
BFPTensor¶
- class BFPTensor(data, format='bfp8', backend=None)¶
Backend-agnostic BFP tensor wrapper.
- Parameters:
data (array-like) – Input tensor
format (str, BFPSpec, or tuple) – BFP format specification
backend (str, optional) – Force specific backend
Methods:
- dequantize()¶
Dequantize to original data type.
- Returns:
Reconstructed tensor
- Return type:
array-like
- statistics()¶
Get quantization statistics.
- Returns:
Dictionary with statistics
- Return type:
dict
Statistics Keys:
format: Format namemantissa_bits: Mantissa bits per elementblock_size: Elements per blocknum_blocks: Total number of blockscompression_ratio_fp32: Compression vs FP32compression_ratio_fp16: Compression vs FP16bfp_memory_mb: BFP memory usage (MB)memory_saved_vs_fp16: Memory saved vs FP16 (%)bits_per_element: Average bits per element
Example:
from pychop import BFPTensor bfp = BFPTensor(X, format='bfp8') # Reconstruct X_reconstructed = bfp.dequantize() # Get statistics stats = bfp.statistics() print(f"Compression: {stats['compression_ratio_fp16']:.2f}x") print(f"Memory saved: {stats['memory_saved_vs_fp16']:.1f}%") print(f"Blocks: {stats['num_blocks']}")
BFPSpec¶
- class BFPSpec(name, mantissa_bits, block_size, exponent_bits=8, has_sign=True, use_subnormals=False)¶
BFP format specification.
- Parameters:
name (str) – Format name
mantissa_bits (int) – Mantissa bits per element
block_size (int) – Elements per block
exponent_bits (int) – Shared exponent bits
has_sign (bool) – Whether elements have sign bits
use_subnormals (bool) – Whether to support subnormal numbers
Properties:
total_bits_per_block: Total bits for entire blockcompression_vs_fp32: Compression ratio vs FP32compression_vs_fp16: Compression ratio vs FP16
create_bfp_spec¶
- create_bfp_spec(mantissa_bits, block_size, exponent_bits=8, name=None)¶
Create custom BFP format specification.
- Parameters:
mantissa_bits (int) – Number of mantissa bits (1-32)
block_size (int) – Elements per block
exponent_bits (int) – Bits for shared exponent
name (str, optional) – Custom format name
- Returns:
BFP format specification
- Return type:
Example:
from pychop import create_bfp_spec, bfp_quantize # Create 5-bit BFP format spec = create_bfp_spec( mantissa_bits=5, block_size=64, exponent_bits=8, name="my_bfp5" ) # Use custom format X_q = bfp_quantize(X, format=spec)
Utility Functions¶
print_bfp_format_table¶
- print_bfp_format_table()¶
Print table of all predefined BFP formats.
Example:
from pychop import print_bfp_format_table print_bfp_format_table()
Output:
========================================================================================== Predefined BFP Formats ========================================================================================== Name Mantissa Block Size Exponent Compress FP16 Total Bits ------------------------------------------------------------------------------------------ bfp16 16 16 8 1.07x 264 bfp12 12 16 8 1.39x 200 bfp8 8 32 8 1.94x 264 bfp6 6 32 8 2.56x 200 bfp4 4 32 8 3.76x 136 bfp3 3 64 8 5.82x 200 bfp2 2 128 8 10.67x 264 flexpoint16 16 16 5 1.10x 261 flexpoint8 8 32 5 1.97x 261 ==========================================================================================
PyTorch-Specific API¶
BFPQuantizerSTE¶
- class pychop.tch.bfp_formats.BFPQuantizerSTE(format='bfp8')¶
BFP quantizer with Straight-Through Estimator for QAT.
Automatically uses STE during training (
requires_grad=True).- Parameters:
format (str, BFPSpec, or tuple) – BFP format specification
Example:
import torch from pychop.tch.bfp_formats import BFPQuantizerSTE quantizer = BFPQuantizerSTE(format='bfp8') x = torch.randn(32, 768, requires_grad=True) x_q = quantizer(x) loss = x_q.sum() loss.backward() # Gradients flow through STE
BFPLinear¶
- class pychop.tch.bfp_formats.BFPLinear(in_features, out_features, bias=True, weight_format='bfp8', act_format=None, quantize_input=True, quantize_output=False)¶
Linear layer with BFP quantization.
- Parameters:
in_features (int) – Input dimension
out_features (int) – Output dimension
bias (bool) – Whether to use bias
weight_format (str, BFPSpec, or tuple) – BFP format for weights
act_format (str, BFPSpec, or tuple, optional) – BFP format for activations (if None, uses weight_format)
quantize_input (bool) – Whether to quantize input
quantize_output (bool) – Whether to quantize output
Example:
from pychop.tch.bfp_formats import BFPLinear layer = BFPLinear( in_features=768, out_features=3072, weight_format='bfp8', quantize_input=True, quantize_output=False ) x = torch.randn(32, 768) y = layer(x) # Automatic quantization with STE
BFPConv2d¶
- class pychop.tch.bfp_formats.BFPConv2d(in_channels, out_channels, kernel_size, weight_format='bfp8', act_format=None, quantize_input=True, quantize_output=False, **kwargs)¶
2D Convolution with BFP quantization.
- Parameters:
in_channels (int) – Input channels
out_channels (int) – Output channels
kernel_size (int or tuple) – Convolution kernel size
weight_format (str, BFPSpec, or tuple) – BFP format for weights
act_format (str, BFPSpec, or tuple, optional) – BFP format for activations
quantize_input (bool) – Whether to quantize input
quantize_output (bool) – Whether to quantize output
kwargs (dict) – Other Conv2d parameters
Example:
from pychop.tch.bfp_formats import BFPConv2d conv = BFPConv2d( in_channels=3, out_channels=64, kernel_size=3, weight_format='bfp8', quantize_input=True, padding=1 ) x = torch.randn(16, 3, 224, 224) y = conv(x)
convert_linear_to_bfp¶
- pychop.tch.bfp_formats.convert_linear_to_bfp(module, format='bfp8', quantize_input=True, quantize_output=False, inplace=True)¶
Convert all Linear layers in a model to BFP quantized versions.
- Parameters:
module (torch.nn.Module) – Model to convert
format (str, BFPSpec, or tuple) – BFP format
quantize_input (bool) – Whether to quantize inputs
quantize_output (bool) – Whether to quantize outputs
inplace (bool) – Whether to modify in place
- Returns:
Converted model
- Return type:
torch.nn.Module
Example:
from pychop.tch.bfp_formats import convert_linear_to_bfp import transformers # Load pretrained model model = transformers.AutoModelForCausalLM.from_pretrained("gpt2") # Convert to BFP8 model = convert_linear_to_bfp( model, format='bfp8', quantize_input=True, quantize_output=False, inplace=True ) # Fine-tune with BFP quantization for epoch in range(num_epochs): train(model)
JAX-Specific API¶
BFPQuantizerSTE (JAX)¶
- class pychop.jx.bfp_formats.BFPQuantizerSTE(format='bfp8')¶
BFP quantizer with custom VJP for JAX.
- Parameters:
format (str, BFPSpec, or tuple) – BFP format specification
Example:
import jax.numpy as jnp from pychop.jx.bfp_formats import BFPQuantizerSTE quantizer = BFPQuantizerSTE(format='bfp8') x = jnp.array(np.random.randn(256, 512)) x_q = quantizer(x)
BFPDense¶
- class pychop.jx.bfp_formats.BFPDense(features, use_bias=True, weight_format='bfp8', quantize_input=True)¶
Dense layer with BFP quantization for Flax.
- Parameters:
features (int) – Number of output features
use_bias (bool) – Whether to use bias
weight_format (str, BFPSpec, or tuple) – BFP format for weights
quantize_input (bool) – Whether to quantize input
Example:
from flax import linen as nn from pychop.jx.bfp_formats import BFPDense class MyModel(nn.Module): @nn.compact def __call__(self, x): x = BFPDense(features=512, weight_format='bfp8')(x) x = nn.relu(x) x = BFPDense(features=10)(x) return x
Advanced Usage¶
Format Comparison¶
Compare different BFP formats on the same data:
import numpy as np
from pychop import BFPTensor
X = np.random.randn(1024, 768).astype(np.float32)
formats = ['bfp16', 'bfp8', 'bfp6', 'bfp4']
print("Format Comparison")
print("="*80)
print(f"{'Format':<10} {'Compression':<15} {'MSE':<12} {'MAE':<12}")
print("-"*80)
for fmt in formats:
bfp = BFPTensor(X, format=fmt)
X_reconstructed = bfp.dequantize()
stats = bfp.statistics()
mse = np.mean((X - X_reconstructed) ** 2)
mae = np.mean(np.abs(X - X_reconstructed))
print(f"{fmt:<10} {stats['compression_ratio_fp16']:.2f}x{'':>11} "
f"{mse:.2e}{'':>6} {mae:.2e}")
Memory Analysis¶
Analyze memory usage for different formats:
from pychop import BFPTensor
X = np.random.randn(4096, 4096).astype(np.float32)
print("\nMemory Analysis")
print("="*80)
print(f"Original FP32: {X.nbytes / 1024**2:.2f} MB")
print(f"FP16 equivalent: {X.nbytes / 2 / 1024**2:.2f} MB")
print("-"*80)
for fmt in ['bfp8', 'bfp6', 'bfp4']:
bfp = BFPTensor(X, format=fmt)
stats = bfp.statistics()
print(f"\n{fmt.upper()}:")
print(f" Memory: {stats['bfp_memory_mb']:.2f} MB")
print(f" Saved vs FP32: {stats['memory_saved_vs_fp32']:.1f}%")
print(f" Saved vs FP16: {stats['memory_saved_vs_fp16']:.1f}%")
print(f" Compression: {stats['compression_ratio_fp16']:.2f}x vs FP16")
LLM Fine-Tuning Example¶
Complete example for fine-tuning LLMs with BFP quantization:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from pychop.tch.bfp_formats import convert_linear_to_bfp
# Load model
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Convert to BFP8
model = convert_linear_to_bfp(
model,
format='bfp8',
quantize_input=True,
quantize_output=False,
inplace=True
)
# Setup training
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device)
# Training loop
model.train()
for epoch in range(num_epochs):
for batch in dataloader:
input_ids = batch['input_ids'].to(device)
labels = input_ids.clone()
# Forward pass (automatic BFP quantization with STE)
outputs = model(input_ids=input_ids, labels=labels)
loss = outputs.loss
# Backward pass (gradients flow through STE)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Loss: {loss.item():.4f}")
# Save quantized model
torch.save(model.state_dict(), 'model_bfp8.pt')
Performance Tips¶
Choosing Block Size¶
Block size affects compression and accuracy:
Small blocks (8-16): Better accuracy, less compression
Medium blocks (32): Recommended default, good balance
Large blocks (64-128): Higher compression, lower accuracy
# Test different block sizes
for block_size in [8, 16, 32, 64, 128]:
X_q = bfp_quantize(X, format=(8, block_size))
mse = np.mean((X - X_q) ** 2)
print(f"Block size {block_size}: MSE = {mse:.2e}")
Choosing Mantissa Bits¶
Mantissa bits control precision:
16 bits: Near-lossless, minimal compression
8 bits: Recommended for most tasks
6 bits: Aggressive compression, acceptable for inference
4 bits or less: Research/experimental
Backend Selection¶
Choose backend based on your needs:
# For inference (fastest)
pychop.backend('numpy')
# For training (STE support)
pychop.backend('torch')
# For JAX/Flax (custom VJP)
pychop.backend('jax')
# Auto-detect (recommended)
pychop.backend('auto')
Troubleshooting¶
Common Issues¶
Import Error:
# Error: cannot import name 'bfp_quantize'
# Solution: Update pychop
pip install --upgrade pychop
Memory Issues:
# For very large tensors, use smaller block sizes
X_q = bfp_quantize(X, format=(8, 16)) # Smaller blocks
Gradient Issues:
# Ensure requires_grad=True for training
X = torch.randn(128, 768, requires_grad=True)
X_q = bfp_quantize(X, format='bfp8')
# Check gradient flow
loss = X_q.sum()
loss.backward()
assert X.grad is not None, "Gradients not flowing!"
Backend Issues:
# Check current backend
import pychop
print(pychop.get_backend())
# Reset backend
pychop.backend('auto')
FAQ¶
Q: What’s the difference between BFP and MX formats?
A: BFP uses one shared exponent per block, while MX formats use both a shared scale and individual exponents per element. BFP is simpler and more hardware-efficient, while MX provides better dynamic range.
Q: Can I use BFP for training?
A: Yes! The PyTorch backend includes Straight-Through Estimator (STE) support, enabling full quantization-aware training. JAX backend uses custom VJP.
Q: Which format should I use?
A: For most cases, BFP8 (8-bit mantissa, 32 elements/block) is recommended. It provides ~2x compression vs FP16 with minimal accuracy loss.
Q: How does BFP compare to INT8?
A: BFP provides better dynamic range than INT8 while maintaining similar compression. BFP adapts to local data statistics (per-block), while INT8 uses global scaling.
Q: Can I mix different formats in the same model?
A: Yes! You can use different formats for different layers:
from pychop.tch.bfp_formats import BFPLinear
class MixedPrecisionModel(nn.Module):
def __init__(self):
super().__init__()
# Higher precision for first layer
self.fc1 = BFPLinear(768, 3072, weight_format='bfp12')
# Lower precision for middle layers
self.fc2 = BFPLinear(3072, 3072, weight_format='bfp6')
# Full precision for output
self.fc3 = nn.Linear(3072, 768)
Q: Does BFP work with quantized models from PyTorch/TensorFlow?
A: BFP is independent of PyTorch/TensorFlow quantization. You can apply BFP quantization to any model, including already-quantized models.
References¶
Papers:
Intel Flexpoint: “Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks” (2017) https://arxiv.org/abs/1711.02213
Microsoft BFloat16: “BFloat16: The Secret to High Performance on Cloud TPUs” (2019) https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus
Block Floating Point for Neural Networks: “Training Deep Neural Networks with 8-bit Floating Point Numbers” (2018) https://arxiv.org/abs/1812.08011
Related Formats:
Microscaling formats - OCP Microscaling formats with better dynamic range
fixed_point - Fixed-point quantization (Chopf)
integer - Integer quantization (Chopi)
External Links:
Note
For the latest updates and examples, see the Pychop GitHub repository.