Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
One-bit large language models (LLMs) have emerged as a promising approach to making generative AI more accessible and affordable. By representing model weights with a very limited number of bits, 1-bit LLMs dramatically reduce the memory and computational resources required to run them.
Microsoft Research has been pushing the boundaries of 1-bit LLMs with its BitNet architecture. In a new paper, the researchers introduce BitNet a4.8, a new technique that further improves the efficiency of 1-bit LLMs without sacrificing their performance.
The rise of 1-bit LLMs
Traditional LLMs use 16-bit floating-point numbers (FP16) to represent their parameters. This requires a lot of memory and compute resources, which limits the accessibility and deployment options for LLMs. One-bit LLMs address this challenge by drastically reducing the precision of model weights while matching the performance of full-precision models.
Previous BitNet models used 1.58-bit values (-1, 0, 1) to represent model weights and 8-bit values for activations. This approach significantly reduced memory and I/O costs, but the computational cost of matrix multiplications remained a bottleneck, and optimizing neural networks with extremely low-bit parameters is challenging.
Two techniques help to address this problem. Sparsification reduces the number of computations by pruning activations with smaller magnitudes. This is particularly useful in LLMs because activation values tend to have a long-tailed distribution, with a few very large values and many small ones.
Quantization, on the other hand, uses a smaller number of bits to represent activations, reducing the computational and memory cost of processing them. However, simply lowering the precision of activations can lead to significant quantization errors and performance degradation.
Furthermore, combining sparsification and quantization is challenging, and presents special problems when training 1-bit LLMs.
“Both quantization and sparsification introduce non-differentiable operations, making gradient computation during training particularly challenging,” Furu Wei, Partner Research Manager at Microsoft Research, told VentureBeat.
Gradient computation is essential for calculating errors and updating parameters when training neural networks. The researchers also had to ensure that their techniques could be implemented efficiently on existing hardware while maintaining the benefits of both sparsification and quantization.
BitNet a4.8
BitNet a4.8 addresses the challenges of optimizing 1-bit LLMs through what the researchers describe as “hybrid quantization and sparsification.” They achieved this by designing an architecture that selectively applies quantization or sparsification to different components of the model based on the specific distribution pattern of activations. The architecture uses 4-bit activations for inputs to attention and feed-forward network (FFN) layers. It uses sparsification with 8 bits for intermediate states, keeping only the top 55% of the parameters. The architecture is also optimized to take advantage of existing hardware.
“With BitNet b1.58, the inference bottleneck of 1-bit LLMs switches from memory/IO to computation, which is constrained by the activation bits (i.e., 8-bit in BitNet b1.58),” Wei said. “In BitNet a4.8, we push the activation bits to 4-bit so that we can leverage 4-bit kernels (e.g., INT4/FP4) to bring 2x speed up for LLM inference on the GPU devices. The combination of 1-bit model weights from BitNet b1.58 and 4-bit activations from BitNet a4.8 effectively addresses both memory/IO and computational constraints in LLM inference.”
BitNet a4.8 also uses 3-bit values to represent the key (K) and value (V) states in the attention mechanism. The KV cache is a crucial component of transformer models. It stores the representations of previous tokens in the sequence. By lowering the precision of KV cache values, BitNet a4.8 further reduces memory requirements, especially when dealing with long sequences.
The promise of BitNet a4.8
Experimental results show that BitNet a4.8 delivers performance comparable to its predecessor BitNet b1.58 while using less compute and memory.
Compared to full-precision Llama models, BitNet a4.8 reduces memory usage by a factor of 10 and achieves 4x speedup. Compared to BitNet b1.58, it achieves a 2x speedup through 4-bit activation kernels. But the design can deliver much more.
“The estimated computation improvement is based on the existing hardware (GPU),” Wei said. “With hardware specifically optimized for 1-bit LLMs, the computation improvements can be significantly enhanced. BitNet introduces a new computation paradigm that minimizes the need for matrix multiplication, a primary focus in current hardware design optimization.”
The efficiency of BitNet a4.8 makes it particularly suited for deploying LLMs at the edge and on resource-constrained devices. This can have important implications for privacy and security. By enabling on-device LLMs, users can benefit from the power of these models without needing to send their data to the cloud.
Wei and his team are continuing their work on 1-bit LLMs.
“We continue to advance our research and vision for the era of 1-bit LLMs,” Wei said. “While our current focus is on model architecture and software support (i.e., bitnet.cpp), we aim to explore the co-design and co-evolution of model architecture and hardware to fully unlock the potential of 1-bit LLMs.”