8-Bit Weight Quantizer

Input Features

1 512

Output Features

1 512

Include Bias

Data Type

Original Weights

Quantized Weights (INT8)

Dequantized Weights

Original Weight Distribution

Quantized Weight Distribution

Quantization Error

Input Features

1 512

Output Features

1 512

Include Bias

Data Type

Weight Pattern

Original Weights

Quantized Weights (INT8)

Dequantized Weights

Original Weight Distribution

Quantized Weight Distribution

Quantization Error

Quantization Scales Distribution

8-bit Quantizer Implementation

This implementation includes:

W8A16LinearLayer - A PyTorch module that uses INT8 weights and FP16/BF16/FP32 activations
Quantization - Converts FP32/FP16/BF16 weights to INT8 using per-output-channel scaling
Visualization - Shows the impact of quantization on weight distributions and errors

For each output channel, find the maximum absolute weight value
Scale all weights in that channel so the maximum fits in INT8 range (-128 to 127)
Round scaled weights to integers and store as INT8
During inference, multiply INT8 weights by scaling factors to recover approximate FP values

The quantization process reduces memory usage by up to 75% compared to FP32 weights.

This implementation is based on modern techniques used in LLM quantization
Similar methods are used in libraries like bitsandbytes, AutoGPTQ, and GPTQ-for-LLaMa