Block Floating Point Formats

Block Floating Point (BFP) is a quantization format where a group of numbers shares a common exponent (scale factor), but each number has its own mantissa. This provides a good balance between compression efficiency and hardware simplicity.

Overview

What is Block Floating Point?

Block Floating Point (BFP) divides data into blocks and applies a shared exponent to all elements within each block. This is simpler than full floating-point but provides better dynamic range than fixed-point quantization.

Key Characteristics:

  • Shared Exponent: One exponent per block (typically 8 bits)

  • Individual Mantissas: Each element has its own mantissa (4-16 bits)

  • Hardware-Efficient: Simpler than full floating-point arithmetic

  • Good Dynamic Range: Adapts to local data statistics

BFP vs Other Formats:

Format

Memory

Dynamic Range

Hardware Cost

Best Use Case

BFP

Low

Good

Low

Edge devices, Inference

FP32

High

Excellent

High

Research, Training

FP16

Medium

Good

Medium

Training, Inference

INT8

Low

Poor

Low

Inference only

MX Formats

Low

Excellent

Medium

Advanced training

Architecture

BFP Structure

A BFP block consists of:

┌─────────────────────────────────────────────────┐
│         Block Floating Point Structure          │
├─────────────────────────────────────────────────┤
│  Shared Exponent (8 bits)                       │
├─────────────────────────────────────────────────┤
│  Element 1: Sign (1) + Mantissa (n bits)       │
│  Element 2: Sign (1) + Mantissa (n bits)       │
│  ...                                            │
│  Element N: Sign (1) + Mantissa (n bits)       │
└─────────────────────────────────────────────────┘

Example: BFP8 with block_size=32

  • 1 shared exponent (8 bits)

  • 32 elements × 8 bits each = 256 bits

  • Total: 264 bits for 32 elements

  • Compression vs FP16: 512/264 = 1.94x

Predefined Formats

Pychop provides several predefined BFP formats optimized for different use cases:

Standard Formats

Format Name

Mantissa Bits

Block Size

Exponent Bits

Compression vs FP16

Use Case

bfp16

16

16

8

1.07x

High precision

bfp12

12

16

8

1.39x

Balanced

bfp8

8

32

8

1.94x

Recommended default

bfp6

6

32

8

2.56x

Aggressive compression

bfp4

4

32

8

3.76x

Ultra-low precision

Ultra-Low Precision Formats

Format Name

Mantissa Bits

Block Size

Exponent Bits

Compression vs FP16

Use Case

bfp3

3

64

8

5.82x

Extreme compression

bfp2

2

128

8

10.67x

Research only

Intel Flexpoint Compatible

Format Name

Mantissa Bits

Block Size

Exponent Bits

Compression vs FP16

Notes

flexpoint16

16

16

5

1.10x

Intel compatible

flexpoint8

8

32

5

1.97x

Intel compatible

Quick Start

Basic Usage

import pychop
import numpy as np

# Set backend (auto-detect by default)
pychop.backend('auto')

# Create test data
X = np.random.randn(1024, 768).astype(np.float32)

# Quantize with BFP8
from pychop import bfp_quantize
X_quantized = bfp_quantize(X, format='bfp8')

# Check compression
print(f"Original: {X.nbytes / 1024:.2f} KB")
print(f"Quantized maintains same shape: {X_quantized.shape}")

Using BFPTensor

from pychop import BFPTensor

# Create BFP tensor
bfp = BFPTensor(X, format='bfp8')

# Dequantize
X_reconstructed = bfp.dequantize()

# Get statistics
stats = bfp.statistics()
print(f"Compression: {stats['compression_ratio_fp16']:.2f}x vs FP16")
print(f"Memory saved: {stats['memory_saved_vs_fp16']:.1f}%")

# Compute error
mse = np.mean((X - X_reconstructed) ** 2)
print(f"MSE: {mse:.2e}")

Custom Formats

from pychop import create_bfp_spec, bfp_quantize

# Create custom 5-bit BFP format
custom_spec = create_bfp_spec(
    mantissa_bits=5,
    block_size=64,
    exponent_bits=8,
    name="my_bfp5"
)

# Use custom format
X_q = bfp_quantize(X, format=custom_spec)

# Or use tuple shorthand
X_q = bfp_quantize(X, format=(5, 64))  # (mantissa_bits, block_size)

Backend-Specific Usage

NumPy Backend

Pure NumPy implementation for inference and analysis:

import numpy as np
import pychop

pychop.backend('numpy')

X = np.random.randn(512, 512).astype(np.float32)
X_q = pychop.bfp_quantize(X, format='bfp8')

# Compute reconstruction error
error = np.mean((X - X_q) ** 2)
print(f"MSE: {error:.2e}")

PyTorch Backend (with STE)

PyTorch backend with Straight-Through Estimator for Quantization-Aware Training:

import torch
import pychop

pychop.backend('torch')

# Enable gradient tracking
X = torch.randn(128, 768, requires_grad=True)

# Quantize (automatic STE!)
X_q = pychop.bfp_quantize(X, format='bfp8')

# Backward pass - gradients flow through!
loss = X_q.sum()
loss.backward()

print(f"Gradient shape: {X.grad.shape}")
print(f"Gradient norm: {X.grad.norm():.2e}")

Using BFP Quantizers in Models:

from pychop.tch.bfp_formats import BFPQuantizerSTE

class QuantizedModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.quantizer = BFPQuantizerSTE(format='bfp8')
        self.linear = torch.nn.Linear(768, 3072)

    def forward(self, x):
        x = self.quantizer(x)  # Quantize activations
        return self.linear(x)

model = QuantizedModel()
optimizer = torch.optim.Adam(model.parameters())

# Training loop
for batch in dataloader:
    output = model(batch)
    loss = loss_fn(output, target)
    loss.backward()  # STE handles gradients automatically!
    optimizer.step()

Quantized Layers:

from pychop.tch.bfp_formats import BFPLinear

# Replace standard Linear with BFP quantized version
layer = BFPLinear(
    in_features=768,
    out_features=3072,
    weight_format='bfp8',      # Quantize weights
    quantize_input=True,        # Quantize input activations
    quantize_output=False       # Keep output in FP32
)

x = torch.randn(32, 768)
y = layer(x)  # Automatic quantization with STE

Model Conversion:

from pychop.tch.bfp_formats import convert_linear_to_bfp

# Load pretrained model
model = YourModel()

# Convert all Linear layers to BFP
model = convert_linear_to_bfp(
    model,
    format='bfp8',
    quantize_input=True,
    quantize_output=False,
    inplace=True
)

# Fine-tune with quantization
for epoch in range(num_epochs):
    train(model)  # Gradients flow through STE automatically

JAX Backend (with Custom VJP)

JAX backend with custom Vector-Jacobian Product for differentiation:

import jax
import jax.numpy as jnp
import pychop

pychop.backend('jax')

# Create data
key = jax.random.PRNGKey(0)
X = jax.random.normal(key, (256, 512))

# Quantize
X_q = pychop.bfp_quantize(X, format='bfp8')

# Test gradient flow
from pychop.jx.bfp_formats import BFPQuantizerSTE

quantizer = BFPQuantizerSTE(format='bfp8')

def loss_fn(x):
    x_q = quantizer(x)
    return jnp.sum(x_q ** 2)

# Compute gradients (custom VJP handles this)
grad_fn = jax.grad(loss_fn)
grads = grad_fn(X)

print(f"Gradient shape: {grads.shape}")
print(f"Gradient norm: {jnp.linalg.norm(grads):.2e}")

Flax Integration:

from flax import linen as nn
from pychop.jx.bfp_formats import BFPDense

class QuantizedMLP(nn.Module):
    features: list

    @nn.compact
    def __call__(self, x):
        for feat in self.features[:-1]:
            x = BFPDense(
                features=feat,
                weight_format='bfp8',
                quantize_input=True
            )(x)
            x = nn.relu(x)

        x = BFPDense(features=self.features[-1])(x)
        return x

model = QuantizedMLP(features=[512, 256, 128, 10])

# Initialize
key = jax.random.PRNGKey(0)
x = jax.random.normal(key, (32, 784))
variables = model.init(key, x)

# Forward pass with quantization
output = model.apply(variables, x)

API Reference

Core Functions

bfp_quantize

bfp_quantize(data, format='bfp8', backend=None)

Quantize array to BFP format with automatic backend detection.

Parameters:
  • data (array-like) – Input data (numpy.ndarray, torch.Tensor, or jax.Array)

  • format (str, BFPSpec, or tuple(int, int)) – BFP format specification

  • backend (str, optional) – Force specific backend (‘numpy’, ‘jax’, or ‘torch’)

Returns:

Quantized data (same type as input)

Return type:

array-like

Format Options:

  • String: 'bfp8', 'bfp6', etc. (predefined formats)

  • Tuple: (mantissa_bits, block_size) for custom format

  • BFPSpec: Full specification object

Example:

import numpy as np
from pychop import bfp_quantize

X = np.random.randn(1024, 768)

# Predefined format
X_q = bfp_quantize(X, format='bfp8')

# Custom format
X_q = bfp_quantize(X, format=(6, 32))  # 6-bit mantissa, 32 elem/block

# Force backend
X_q = bfp_quantize(X, format='bfp8', backend='numpy')

Classes

BFPTensor

class BFPTensor(data, format='bfp8', backend=None)

Backend-agnostic BFP tensor wrapper.

Parameters:
  • data (array-like) – Input tensor

  • format (str, BFPSpec, or tuple) – BFP format specification

  • backend (str, optional) – Force specific backend

Methods:

dequantize()

Dequantize to original data type.

Returns:

Reconstructed tensor

Return type:

array-like

statistics()

Get quantization statistics.

Returns:

Dictionary with statistics

Return type:

dict

Statistics Keys:

  • format: Format name

  • mantissa_bits: Mantissa bits per element

  • block_size: Elements per block

  • num_blocks: Total number of blocks

  • compression_ratio_fp32: Compression vs FP32

  • compression_ratio_fp16: Compression vs FP16

  • bfp_memory_mb: BFP memory usage (MB)

  • memory_saved_vs_fp16: Memory saved vs FP16 (%)

  • bits_per_element: Average bits per element

Example:

from pychop import BFPTensor

bfp = BFPTensor(X, format='bfp8')

# Reconstruct
X_reconstructed = bfp.dequantize()

# Get statistics
stats = bfp.statistics()
print(f"Compression: {stats['compression_ratio_fp16']:.2f}x")
print(f"Memory saved: {stats['memory_saved_vs_fp16']:.1f}%")
print(f"Blocks: {stats['num_blocks']}")

BFPSpec

class BFPSpec(name, mantissa_bits, block_size, exponent_bits=8, has_sign=True, use_subnormals=False)

BFP format specification.

Parameters:
  • name (str) – Format name

  • mantissa_bits (int) – Mantissa bits per element

  • block_size (int) – Elements per block

  • exponent_bits (int) – Shared exponent bits

  • has_sign (bool) – Whether elements have sign bits

  • use_subnormals (bool) – Whether to support subnormal numbers

Properties:

  • total_bits_per_block: Total bits for entire block

  • compression_vs_fp32: Compression ratio vs FP32

  • compression_vs_fp16: Compression ratio vs FP16

create_bfp_spec

create_bfp_spec(mantissa_bits, block_size, exponent_bits=8, name=None)

Create custom BFP format specification.

Parameters:
  • mantissa_bits (int) – Number of mantissa bits (1-32)

  • block_size (int) – Elements per block

  • exponent_bits (int) – Bits for shared exponent

  • name (str, optional) – Custom format name

Returns:

BFP format specification

Return type:

BFPSpec

Example:

from pychop import create_bfp_spec, bfp_quantize

# Create 5-bit BFP format
spec = create_bfp_spec(
    mantissa_bits=5,
    block_size=64,
    exponent_bits=8,
    name="my_bfp5"
)

# Use custom format
X_q = bfp_quantize(X, format=spec)

Utility Functions

PyTorch-Specific API

BFPQuantizerSTE

class pychop.tch.bfp_formats.BFPQuantizerSTE(format='bfp8')

BFP quantizer with Straight-Through Estimator for QAT.

Automatically uses STE during training (requires_grad=True).

Parameters:

format (str, BFPSpec, or tuple) – BFP format specification

Example:

import torch
from pychop.tch.bfp_formats import BFPQuantizerSTE

quantizer = BFPQuantizerSTE(format='bfp8')

x = torch.randn(32, 768, requires_grad=True)
x_q = quantizer(x)

loss = x_q.sum()
loss.backward()  # Gradients flow through STE

BFPLinear

class pychop.tch.bfp_formats.BFPLinear(in_features, out_features, bias=True, weight_format='bfp8', act_format=None, quantize_input=True, quantize_output=False)

Linear layer with BFP quantization.

Parameters:
  • in_features (int) – Input dimension

  • out_features (int) – Output dimension

  • bias (bool) – Whether to use bias

  • weight_format (str, BFPSpec, or tuple) – BFP format for weights

  • act_format (str, BFPSpec, or tuple, optional) – BFP format for activations (if None, uses weight_format)

  • quantize_input (bool) – Whether to quantize input

  • quantize_output (bool) – Whether to quantize output

Example:

from pychop.tch.bfp_formats import BFPLinear

layer = BFPLinear(
    in_features=768,
    out_features=3072,
    weight_format='bfp8',
    quantize_input=True,
    quantize_output=False
)

x = torch.randn(32, 768)
y = layer(x)  # Automatic quantization with STE

BFPConv2d

class pychop.tch.bfp_formats.BFPConv2d(in_channels, out_channels, kernel_size, weight_format='bfp8', act_format=None, quantize_input=True, quantize_output=False, **kwargs)

2D Convolution with BFP quantization.

Parameters:
  • in_channels (int) – Input channels

  • out_channels (int) – Output channels

  • kernel_size (int or tuple) – Convolution kernel size

  • weight_format (str, BFPSpec, or tuple) – BFP format for weights

  • act_format (str, BFPSpec, or tuple, optional) – BFP format for activations

  • quantize_input (bool) – Whether to quantize input

  • quantize_output (bool) – Whether to quantize output

  • kwargs (dict) – Other Conv2d parameters

Example:

from pychop.tch.bfp_formats import BFPConv2d

conv = BFPConv2d(
    in_channels=3,
    out_channels=64,
    kernel_size=3,
    weight_format='bfp8',
    quantize_input=True,
    padding=1
)

x = torch.randn(16, 3, 224, 224)
y = conv(x)

convert_linear_to_bfp

pychop.tch.bfp_formats.convert_linear_to_bfp(module, format='bfp8', quantize_input=True, quantize_output=False, inplace=True)

Convert all Linear layers in a model to BFP quantized versions.

Parameters:
  • module (torch.nn.Module) – Model to convert

  • format (str, BFPSpec, or tuple) – BFP format

  • quantize_input (bool) – Whether to quantize inputs

  • quantize_output (bool) – Whether to quantize outputs

  • inplace (bool) – Whether to modify in place

Returns:

Converted model

Return type:

torch.nn.Module

Example:

from pychop.tch.bfp_formats import convert_linear_to_bfp
import transformers

# Load pretrained model
model = transformers.AutoModelForCausalLM.from_pretrained("gpt2")

# Convert to BFP8
model = convert_linear_to_bfp(
    model,
    format='bfp8',
    quantize_input=True,
    quantize_output=False,
    inplace=True
)

# Fine-tune with BFP quantization
for epoch in range(num_epochs):
    train(model)

JAX-Specific API

BFPQuantizerSTE (JAX)

class pychop.jx.bfp_formats.BFPQuantizerSTE(format='bfp8')

BFP quantizer with custom VJP for JAX.

Parameters:

format (str, BFPSpec, or tuple) – BFP format specification

Example:

import jax.numpy as jnp
from pychop.jx.bfp_formats import BFPQuantizerSTE

quantizer = BFPQuantizerSTE(format='bfp8')

x = jnp.array(np.random.randn(256, 512))
x_q = quantizer(x)

BFPDense

class pychop.jx.bfp_formats.BFPDense(features, use_bias=True, weight_format='bfp8', quantize_input=True)

Dense layer with BFP quantization for Flax.

Parameters:
  • features (int) – Number of output features

  • use_bias (bool) – Whether to use bias

  • weight_format (str, BFPSpec, or tuple) – BFP format for weights

  • quantize_input (bool) – Whether to quantize input

Example:

from flax import linen as nn
from pychop.jx.bfp_formats import BFPDense

class MyModel(nn.Module):
    @nn.compact
    def __call__(self, x):
        x = BFPDense(features=512, weight_format='bfp8')(x)
        x = nn.relu(x)
        x = BFPDense(features=10)(x)
        return x

Advanced Usage

Format Comparison

Compare different BFP formats on the same data:

import numpy as np
from pychop import BFPTensor

X = np.random.randn(1024, 768).astype(np.float32)

formats = ['bfp16', 'bfp8', 'bfp6', 'bfp4']

print("Format Comparison")
print("="*80)
print(f"{'Format':<10} {'Compression':<15} {'MSE':<12} {'MAE':<12}")
print("-"*80)

for fmt in formats:
    bfp = BFPTensor(X, format=fmt)
    X_reconstructed = bfp.dequantize()
    stats = bfp.statistics()

    mse = np.mean((X - X_reconstructed) ** 2)
    mae = np.mean(np.abs(X - X_reconstructed))

    print(f"{fmt:<10} {stats['compression_ratio_fp16']:.2f}x{'':>11} "
          f"{mse:.2e}{'':>6} {mae:.2e}")

Memory Analysis

Analyze memory usage for different formats:

from pychop import BFPTensor

X = np.random.randn(4096, 4096).astype(np.float32)

print("\nMemory Analysis")
print("="*80)
print(f"Original FP32: {X.nbytes / 1024**2:.2f} MB")
print(f"FP16 equivalent: {X.nbytes / 2 / 1024**2:.2f} MB")
print("-"*80)

for fmt in ['bfp8', 'bfp6', 'bfp4']:
    bfp = BFPTensor(X, format=fmt)
    stats = bfp.statistics()

    print(f"\n{fmt.upper()}:")
    print(f"  Memory: {stats['bfp_memory_mb']:.2f} MB")
    print(f"  Saved vs FP32: {stats['memory_saved_vs_fp32']:.1f}%")
    print(f"  Saved vs FP16: {stats['memory_saved_vs_fp16']:.1f}%")
    print(f"  Compression: {stats['compression_ratio_fp16']:.2f}x vs FP16")

LLM Fine-Tuning Example

Complete example for fine-tuning LLMs with BFP quantization:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from pychop.tch.bfp_formats import convert_linear_to_bfp

# Load model
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Convert to BFP8
model = convert_linear_to_bfp(
    model,
    format='bfp8',
    quantize_input=True,
    quantize_output=False,
    inplace=True
)

# Setup training
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device)

# Training loop
model.train()
for epoch in range(num_epochs):
    for batch in dataloader:
        input_ids = batch['input_ids'].to(device)
        labels = input_ids.clone()

        # Forward pass (automatic BFP quantization with STE)
        outputs = model(input_ids=input_ids, labels=labels)
        loss = outputs.loss

        # Backward pass (gradients flow through STE)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        print(f"Loss: {loss.item():.4f}")

# Save quantized model
torch.save(model.state_dict(), 'model_bfp8.pt')

Performance Tips

Choosing Block Size

Block size affects compression and accuracy:

  • Small blocks (8-16): Better accuracy, less compression

  • Medium blocks (32): Recommended default, good balance

  • Large blocks (64-128): Higher compression, lower accuracy

# Test different block sizes
for block_size in [8, 16, 32, 64, 128]:
    X_q = bfp_quantize(X, format=(8, block_size))
    mse = np.mean((X - X_q) ** 2)
    print(f"Block size {block_size}: MSE = {mse:.2e}")

Choosing Mantissa Bits

Mantissa bits control precision:

  • 16 bits: Near-lossless, minimal compression

  • 8 bits: Recommended for most tasks

  • 6 bits: Aggressive compression, acceptable for inference

  • 4 bits or less: Research/experimental

Backend Selection

Choose backend based on your needs:

# For inference (fastest)
pychop.backend('numpy')

# For training (STE support)
pychop.backend('torch')

# For JAX/Flax (custom VJP)
pychop.backend('jax')

# Auto-detect (recommended)
pychop.backend('auto')

Troubleshooting

Common Issues

Import Error:

# Error: cannot import name 'bfp_quantize'
# Solution: Update pychop
pip install --upgrade pychop

Memory Issues:

# For very large tensors, use smaller block sizes
X_q = bfp_quantize(X, format=(8, 16))  # Smaller blocks

Gradient Issues:

# Ensure requires_grad=True for training
X = torch.randn(128, 768, requires_grad=True)
X_q = bfp_quantize(X, format='bfp8')

# Check gradient flow
loss = X_q.sum()
loss.backward()
assert X.grad is not None, "Gradients not flowing!"

Backend Issues:

# Check current backend
import pychop
print(pychop.get_backend())

# Reset backend
pychop.backend('auto')

FAQ

Q: What’s the difference between BFP and MX formats?

A: BFP uses one shared exponent per block, while MX formats use both a shared scale and individual exponents per element. BFP is simpler and more hardware-efficient, while MX provides better dynamic range.

Q: Can I use BFP for training?

A: Yes! The PyTorch backend includes Straight-Through Estimator (STE) support, enabling full quantization-aware training. JAX backend uses custom VJP.

Q: Which format should I use?

A: For most cases, BFP8 (8-bit mantissa, 32 elements/block) is recommended. It provides ~2x compression vs FP16 with minimal accuracy loss.

Q: How does BFP compare to INT8?

A: BFP provides better dynamic range than INT8 while maintaining similar compression. BFP adapts to local data statistics (per-block), while INT8 uses global scaling.

Q: Can I mix different formats in the same model?

A: Yes! You can use different formats for different layers:

from pychop.tch.bfp_formats import BFPLinear

class MixedPrecisionModel(nn.Module):
    def __init__(self):
        super().__init__()
        # Higher precision for first layer
        self.fc1 = BFPLinear(768, 3072, weight_format='bfp12')
        # Lower precision for middle layers
        self.fc2 = BFPLinear(3072, 3072, weight_format='bfp6')
        # Full precision for output
        self.fc3 = nn.Linear(3072, 768)

Q: Does BFP work with quantized models from PyTorch/TensorFlow?

A: BFP is independent of PyTorch/TensorFlow quantization. You can apply BFP quantization to any model, including already-quantized models.

References

Papers:

  1. Intel Flexpoint: “Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks” (2017) https://arxiv.org/abs/1711.02213

  2. Microsoft BFloat16: “BFloat16: The Secret to High Performance on Cloud TPUs” (2019) https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus

  3. Block Floating Point for Neural Networks: “Training Deep Neural Networks with 8-bit Floating Point Numbers” (2018) https://arxiv.org/abs/1812.08011

Related Formats:

  • Microscaling formats - OCP Microscaling formats with better dynamic range

  • fixed_point - Fixed-point quantization (Chopf)

  • integer - Integer quantization (Chopi)

External Links:

Note

For the latest updates and examples, see the Pychop GitHub repository.