PyTorch 8-Bit Weight Quantizer
This tool demonstrates quantization of neural network weights to INT8 precision.
It implements a custom W8A16LinearLayer that uses 8-bit weights with 16-bit activations.
1 512
1 512
Data Type
1 512
1 512
Data Type
Weight Pattern
8-bit Quantizer Implementation
This implementation includes:
- W8A16LinearLayer - A PyTorch module that uses INT8 weights and FP16/BF16/FP32 activations
- Quantization - Converts FP32/FP16/BF16 weights to INT8 using per-output-channel scaling
- Visualization - Shows the impact of quantization on weight distributions and errors
How It Works:
- For each output channel, find the maximum absolute weight value
- Scale all weights in that channel so the maximum fits in INT8 range (-128 to 127)
- Round scaled weights to integers and store as INT8
- During inference, multiply INT8 weights by scaling factors to recover approximate FP values
The quantization process reduces memory usage by up to 75% compared to FP32 weights.
References:
- This implementation is based on modern techniques used in LLM quantization
- Similar methods are used in libraries like bitsandbytes, AutoGPTQ, and GPTQ-for-LLaMa