Quantisation in speech and language models

by Rowley Adams | Apr 22, 2025 | AI Development

Quantisation underpins digital signal processing and now elements of contemporary machine learning. Digital audio and images are ubiquitous, and quantisation represents one of the core transformations from analogue to digital format – it converts continuous signals or values into discrete approximations of the real, like converting sound waves into digital audio or continuous light intensities into pixelated images. Quantisation is now also leveraged in efficient deployment of large language models (LLMs), and also as a representation learning tool in contemporary speech models. Quantisation is not only a technical necessity in digital formats but ties together concepts across information theory, cognitive science, artificial intelligence and linguistics.

Quantisation in audio

To illustrate quantisation, it can first be understood as it applies to audio. Quantisation is integral to all forms of digital audio. Any music, podcasts or videos encountered in digital format are quantised. As we will see quantisation also underpins landlines in a different form.

Sound in its physical form consists of continuous pressure waves, and the frequency of these waves determines what pitch we perceive. Sound waves move through the air and hit our auditory systems with a certain speed (frequency – perceived as pitch) and amplitude (perceived as loudness), like waves crashing into the shore, creating smooth, uninterrupted signals like the movement of water.

Digital formats are inherently discrete, not continuous. Computers can’t directly process continuous signals. They operate on discrete, finite data – specific, countable numbers that can be stored and manipulated. Transforming real audio into digital format therefore requires discretisation through two core processes – sampling and quantisation.

Audio in digital form is stored as a sequence of individual numbers, unlike in analogue formats, e.g. where the needle of a vinyl player follows the smooth, continuous grooves etched onto the vinyl. Sampling determines how many measurements are taken per second – typically 44.1kHz for CDs, meaning 44,100 measurements are taken every second. You can imagine that the shape of the smooth etching of a vinyl groove traversed over one second could be well approximated with 44,100 ‘steps’. Quantisation then determines the precision of these steps – quantisation sets how many distinct values each sample (measurement) can take, controlled by the bit depth.

During quantisation, some information is inevitably lost, as each amplitude measurement is rounded to the nearest available level – this introduces a degree of error known as quantisation noise. The precision of the approximation of the signal is determined by the bit depth – 16-bit digital audio (standard for CDs) allows for 2^16 (65,536) possible values for each sample. So the shape of the vinyl groove over one second is approximated by 44,100 steps, where each step can take one of 65,536 values – hence why digital and analogue audio sound perceptually similar. You could use a larger bit depth, and therefore lose less information, but this would conversely require more storage.

Figures: Show quantised approximation of a real signal. Each sample of the actual signal (e.g. orange) is rounded to the nearest available level (green/red).

The bit depth directly impacts storage requirements and fidelity – 8-bit audio requires half the storage of 16-bit audio, as literally half the number of binary numbers are required to store the information – each sample is represented with 8 bits rather than 16. However, using 8-bit audio will sacrifice considerable dynamic range and fidelity, resulting in a coarse, arcade-like audio quality – in 8-bit audio there are only 2^8 = 256 levels, whereas in 16-bit there are 2^16 = 65,536 levels, so each sample has to be coarsely rounded to the nearest level, adding a lot of noise.. This tradeoff between compression efficiency and signal fidelity is also ubiquitous in machine learning.

Quantisation and large language models (LLMs)

The same quantisation principles find application in modern artificial intelligence, including in large language models. Once trained, an LLM is a vast array of numbers (weights) within a transformer architecture, including feedforward blocks and attention matrices (key, value and query matrices).

These weights are typically stored as high-precision floating-point numbers (FP32), which can represent decimal values to a very high precision, requiring significant memory and compute. An FP32 float is split into three parts – the sign, represented with a single bit, the exponent, represented with 8 bits with a bias of 127, and the mantissa, represented with 23 bits, which stores the precision (the actual digits of the number). While all digital representations have of course been quantised (as they are stored in binary), further quantisation – reducing the number of bits used to store each value – can be applied to improve efficiency. This includes using 16-bit floating-point (FP16 – 1 bit for the exponent, 5 exponent bits and 10 precision bits), or even more aggressive techniques like 8-bit (1 sign bit, 7 value bits) or 4-bit (1 sign bit, 3 value bits) integer quantisation, which trade off precision for significant gains in speed and storage. For example, on HuggingFace, the model Mistral-7B-Instruct-v0.3 had 742,441 downloads in full precision, and one of the several available 4-bit quantised models had around 211,000 downloads. Summed together the downloads of quantised models exceeds the downloads of the full precision model weights.

(When you half the precision, half the number of bits are required to represent the data!)

The benefits for LLMs however extend beyond mere storage efficiency. Quantisation can significantly accelerate inference – the process of generating predictions or text completions from the model. By quantising weights to integers, performing matrix multiplication with these integers, and then dequantising the output, models can run faster while preserving most of their capabilities.

Matrix multiplication

The advantages of quantisation apply to the fundamental operation behind deep learning – matrix multiplication. When quantising, high-precision floating-point numbers are replaced with lower-precision values, such as 8-bit integers, enabling much faster computation on modern hardware.

Consider the multiplication of a weight matrix (W) by a hidden state matrix (X) – a core operation in both feedforward layers and attention mechanisms. If the hidden state matrix X has dimensions R^{R x h} (where R is the batch size or sequence length, and h is the hidden dimension) and the weight matrix has dimensions R ^{h x o} (where o is the output dimension), their multiplication produces a matrix of dimension R ^{R x o}.

To enable quantisation, floating-point values (typically in the range of -1 to 1) are mapped to integers (such as -128 to 127 for INT8). This conversion allows for INT8xINT8 matrix multiplication, which is significantly faster on most hardware, and accelerated on GPUs. Integer mathematics requires less memory and benefits from more efficient hardware instructions.

Quantisation methods and challenges

In practice, implementing quantisation is non-trivial. Two common methods are:

Absmax quantisation: This method scales inputs to the INT8 range (-127 to 127) by multiplying by 127 divided by the absolute maximum value in the tensor. This approach works well for symmetric, zero-centred distribution (common in weight matrices) and enables efficient INT8 x INT8 hardware. However, a common activation function is ReLU, which enforces all activations to be non-negative, meaning that no hidden state values in the hidden state matrix will be negative – this means half of the representation range (-127 to 0) will be wasted.
Zero-point quantisation: This technique applies an affine transformation that shifts the input distribution to utilise the full range of the data type – it’s therefore better for asymmetric distributions (like model activations), but typically combines INT8 and INT16/32 operations, making hardware acceleration more challenging.

Row-wise quantisation is typically employed, where each row of the hidden state and weight matrices is quantised independently with its own scaling factor. The outcome of both absmax and zero-point quantisation are determined by the characteristics of the weight and activation matrices – absmax takes the maximum value to scale the weights to integer range, and zeropoint uses a normalised dynamic range and then shifts by the zeropoint – both are therefore vulnerable to distortions by outliers in the distribution of the weights and activations.

A significant challenge in LLM quantisation involves handling outlier activations. As models improve in performance, typically measured as perplexity, they develop specific activation patterns with extreme values that seem to prove crucial for generating good predictions. This seems to apply across LLMs. While most values may lie in the range of -1 to 1, certain activations in a handful of dimensions in the vast array of the LLM are consistently much higher, reaching values up to 20x larger than in other dimensions. (LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers et al., 2022)). At around 6.7B parameters, they find that all transformer layers and 75% of all sequence dimensions are affected by these extreme features.

As described above, standard quantisation methods like absmax and zero-point quantisation will both be skewed by these outliers, therefore degrading the fidelity of the quantised output, thus affecting the quality of the model. Dettmers et al. (2022) show that setting these outlier features to zero can decrease top-1 attention softmax probability mass by over 20%, and degrade validation perplexity by 600-1000%, despite comprising only about 0.1% of all input features. In contrast, removing the same number of randomly selected features has minimal effect, decreasing the probability by a maximum of 0.3% and degrading perplexity by only around 0.1%. This highlights the importance of preserving high-fidelity representations of these outliers.

To address this, mixed-precision approaches have emerged, representing outliers with high-precision while using lower precision for the majority of activations, effectively mixing the resolution to preserve fidelity of the input characteristics. This balanced approach preserves the critical information contained in outliers in high fidelity, while maintaining efficiency gains through representing the vast majority of values in lower precision.

Quantisation for representation learning

Beyond compression, quantisation serves another important function in machine learning: representation learning. Models like wav2vec 2.0 introduce quantised layers that discretise continuous latent speech representations using codebooks, mapping high-dimensional continuous vectors into discrete units.

This approach has roots in traditional speech coding and acoustic modeling, particularly in technologies used for telecommunications in the 20th century. Landline telephone networks faced strict bandwidth constraints, as analog voice signals needed to be transmitted efficiently through limited copper infrastructure. Early systems like Pulse Code Modulation (PCM) performed simple uniform quantisation, but more sophisticated codebook-based approaches soon followed – Vector Quantization (VQ) and Code-Excited Linear Prediction (CELP) codecs used codebooks of speech patterns to reconstruct voice signals with excellent fidelity despite heavy compression, allowing landline networks to serve millions of concurrent calls with an acceptable voice quality. Rather than focusing solely on compression (as was common for landline and early mobile phone networks), quantisation now functions as a representational learning technique.

In wav2vec 2.0, an encoder produces continuous feature vectors from raw audio, which are then passed through a quantisation module for discretisation using learned vector codebooks. The model trains with contrastive loss to distinguish correct codebook entries from distractors. This creates an effective bottleneck that structures the learning task, particularly for speech, a highly variable and time-dependent signal. This makes the contrastive objective more stable and improves downstream performance.

Another model which has a conceptually similar layer is HuBERT, which uses k-means clustering to generate discrete pseudo-labels from MFCCs (Mel-Frequency Cepstral Coefficients) or self-supervised features. These labels serve as training targets in a predictive task, where the model learns to predict future cluster assignments from masked portions of the input. By aligning with these unsupervised labels, HuBERT benefits from a similar bottleneck effect, guiding the model to learn meaningful intermediate representations of speech. Over training iterations, better features lead to better clustering, improving representation learning without requiring manual annotation.

Linguistics and machine learning

This use of quantisation creates an intellectually satisfying convergence between machine learning practice and linguistic theory. Both symbolic generative theories and usage-based theories of phonology recognise that speech involves discrete categories while also being inherently continuous.

Wav2vec 2.0’s explicit use of discrete intermediate representations suggests that enforcing symbolic structure leads to better performance – models work better when they discretise speech in specific ways. This is a nice example of a rediscovery of established ideas from linguistics and cognitive science, reinterpreted through machine learning.

Early neural models sought to bypass symbolic structure and learn directly from raw waveforms, but current research may point to a return to discrete representations – incorporating intermediate structure – neither fully symbolic nor raw audio – may be cognitively and computationally beneficial. Language inherently contains both continuous and discrete elements, and effective models seem to rediscover this duality.

To conclude; quantisation demonstrates how fundamental concepts can bridge seemingly disparate fields. As both a compression technique and representation learning strategy, discretisation proves essential for capturing useful structure. In LLMs, quantisation can enable faster, more efficient computation while (if done carefully) preserving essential patterns in the activations.

Quantisation also exemplifies the continuous recycling of ideas in machine learning – concepts from information theory, signal processing, and linguistics find new applications in today’s shiny new AI systems, a reminder that progress in machine learning is not linear, but rather dialectical, spiraling as it develops by revisiting older ideas in novel contexts.

← What is Responsible AI in 2025? Tracking AI incidents: OECD AIM and AIAAIC Repository →