vllm.model_executor.layers.quantization.compressed_tensors.schemes ¶
Modules:
| Name | Description |
|---|---|
compressed_tensors_scheme | |
compressed_tensors_w4a16_mxfp4 | |
CompressedTensorsScheme ¶
Bases: ABC
Abstract class used to describe the weight creation and forward pass of different quantization schemes supported by CompressedTensors.
Source code in vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_scheme.py
apply_weights abstractmethod ¶
Run the forward pass for the particular scheme. This is where scheme-specific dequant/quant steps/kernels should be applied.
:param layer: torch.nn.Module with the registered weights and other parameters relevant to the particular scheme. :param x: input to the layer :param bias: bias parameter
Source code in vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_scheme.py
create_weights abstractmethod ¶
Weight creation for the particular scheme. Inputs to this function
CompressedTensorsW4A16Mxfp4 ¶
Bases: CompressedTensorsScheme
Compressed tensors scheme for MXFP4 weight-only quantization.
Supports models quantized with the compressed-tensors mxfp4-pack-quantized format.
MXFP4 format: - 4-bit float weights (E2M1) packed into uint8 - Per-group E8M0 scales with group_size=32 - No global scale (unlike NVFP4)