This tool demonstrates quantization of neural network weights to INT8 precision.
It implements a custom W8A16LinearLayer that uses 8-bit weights with 16-bit activations.
1512
1512
Data Type
1512
1512
Data Type
Weight Pattern
8-bit Quantizer Implementation
This implementation includes:
W8A16LinearLayer - A PyTorch module that uses INT8 weights and FP16/BF16/FP32 activations
Quantization - Converts FP32/FP16/BF16 weights to INT8 using per-output-channel scaling
Visualization - Shows the impact of quantization on weight distributions and errors
How It Works:
For each output channel, find the maximum absolute weight value
Scale all weights in that channel so the maximum fits in INT8 range (-128 to 127)
Round scaled weights to integers and store as INT8
During inference, multiply INT8 weights by scaling factors to recover approximate FP values
The quantization process reduces memory usage by up to 75% compared to FP32 weights.
References:
This implementation is based on modern techniques used in LLM quantization
Similar methods are used in libraries like bitsandbytes, AutoGPTQ, and GPTQ-for-LLaMa