Introduction
With the commencement of Large Language Models (LLMs), the quest for optimizing performance with minimized computational resources has led to the development of various quantization approaches. Mistral 7B, a powerful language model, serves as the canvas for this comparative exploration. Three prominent quantization methods—GPTQ, AWQ, and GGUF—stand out as contenders in the pursuit of achieving efficient and streamlined inference on Mistral 7B.
GPTQ, a one-shot weight quantization method, harnesses approximate second-order information to achieve highly accurate and efficient quantization. AWQ takes an activation-aware approach, by protecting salient weights by observing activations, and has showcased excellent quantization performance, particularly for instruction-tuned LMs. GGUF, on the other hand, represents a new format designed for flexibility, tailored to run on CPU and Apple M series devices while allowing the offloading of certain layers to the GPU.
In this comparative analysis, we delve into the nuances of these quantization approaches, by evaluating their impact on Mistral 7B. As we explore their respective strengths, trade-offs, and compatibility with Mistral 7B's architecture, the goal is to provide insights that aid in choosing the most fitting quantization strategy. Each approach brings its advantages and considerations, by shaping the landscape of possibilities for enhanced performance and resource efficiency in Mistral 7B's deployment
Let’s get started!
GGUF: GPT-Generated Unified Format
GGUF, the successor of GGML, was introduced by the llama.cpp team. It is a method of quantization designed for Large Language Models. It allows users to run LLMs on a CPU while offloading some layers to the GPU, by offering speed improvements. GGUF is particularly useful for those running models on CPUs or Apple devices. Quantization, in the GGUF context, involves scaling down model weights (typically stored as 16-bit floating-point numbers) to save computational resources. GGUF was introduced as a more efficient and flexible way of storing and using LLMs for inference. It was tailored to rapidly load and save models, with a user-friendly approach for handling model files.
Comparison with GPTQ and AWQ
- GGUF is focused on CPU and Apple M series devices and offers flexibility with offloading layers to the GPU for speed enhancements.
- It serves as an evolution from GGML, with improvements in efficiency and user-friendliness.
- GGUF has its unique file format and support in llama.cpp, which distinguishes it from GPTQ and AWQ.
GPTQ: Generalized Post-Training Quantization
GPTQ is a one-shot weight quantization method based on approximate second-order information. Developed by Frantar et al. in 2023, it is designed for compressing Large Language Models and accelerating their performance. GPTQ allows for highly accurate and efficient quantization, even for models with a large number of parameters (e.g., 175 billion parameters in GPT models). It is primarily focused on GPU inference and performance gains. GPTQ supports quantization to 8, 4, 3, or even 2 bits without a significant drop in performance and with faster inference speed. It has been integrated into various platforms, including NVIDIA TensorRT-LLM, FastChat, vLLM, HuggingFace TGI, and LMDeploy.
Comparison with GGUF and AWQ
- GPTQ focuses on GPU inference and flexibility in quantization levels.
- It supports a wide range of quantization bit levels and is compatible with most GPU hardware.
- GPTQ aims to provide a balance between compression gains and inference speed.
AWQ: Activation-Aware Weight Quantization
AWQ is an activation-aware weight quantization approach developed by the MIT-HAN lab. It protects salient weights by observing activations rather than the weights themselves. AWQ achieves excellent quantization performance, especially for instruction-tuned LMs and multi-modal LMs. It provides accurate quantization and offers reasoning outputs. AWQ can easily reduce GPU memory for model serving and speed up token generation. It has been integrated into various platforms, including NVIDIA TensorRT-LLM, FastChat, vLLM, HuggingFace TGI, and LMDeploy.
Comparison with GGUF and GPTQ
- AWQ takes an activation-aware approach, by observing activations for weight quantization.
- It excels in quantization performance for instruction-tuned LMs and multi-modal LMs.
- AWQ provides a turn-key solution for efficient deployment on resource-constrained edge platforms.
Leveraging E2E Cloud GPU for Quantization Approaches
E2E Cloud stands as a robust platform, providing an optimal environment for the implementation and optimization of quantization approaches such as GPTQ, AWQ, and GGUF. With its NVIDIA L4 Cloud GPU, E2E Cloud delivers exceptional performance, tailored to meet the high computational demands of data scientists and tech professionals. This cutting-edge GPU technology ensures a seamless environment for running quantized Large Language Models. This flexibility facilitates the efficient implementation and optimization of quantization approaches by offering tailored solutions for diverse applications.
This user-friendly approach streamlines the implementation and optimization of quantization approaches, by ensuring a smoother journey in deploying and fine-tuning quantized models. The platform's adaptability is further highlighted by its support for various quantization methods. Whether implementing GPTQ, AWQ, or GGUF, users have the flexibility to choose the quantization approach that aligns with their specific requirements.
I used E2E Cloud with A100 80 GB with CUDA 11 for efficient performance. To learn more about E2E Cloud GPUs, visit the website.
To get started, add your SSH keys by going into Settings.
After creating SSH keys, create a node by going into ‘Compute’.
Now, open your Visual Studio Code and download the extension ‘Remote Explorer’ as well as ‘Remote SSH’. Open a new terminal and login into your local system.
You’ll be logged in remotely with SSH on your local system.
Implementing Quantization Approaches
We’ll start with hugging face transformers without using any quantization approaches.
Let’s install the dependencies.
We’ll create a pipeline simply with the base model from Hugging Face.
Using this approach for loading an LLM typically does not employ compression techniques to conserve VRAM or enhance efficiency.
To generate our prompt, we initially need to construct the required template. Fortunately, this can be automatically accomplished if the chat template is stored in the underlying tokenizer.
We’ll pass the tokenizer to the pipeline, and then, we can start passing the prompt to the model to generate an answer.
The output should be like this:
GGUF
GGUF allows users to use the CPU to run an LLM but also offload some of its layers to the GPU to speed up the process. Although using the CPU is generally slower than using a GPU for inference, it is an incredible format for those running models on CPU or Apple devices.
Now, we’ll empty the VRAM cache by using the following code:
After emptying the cache, we’ll install dependencies for GGUF quantization.
We’ll use the GGUF model by ‘The Bloke’ on Hugging Face.
We’ll pass the messages and use the tokenizer’s chat template as we did before.
Then, we’ll pass the prompt to the pipeline.
The output will look like this:
If we see the time and VRAM consumed by this quantized model, it will be calculated by using the following code, where we will load the model, pass the prompt and run the pipeline as we did before:
The result will be like this:
GPTQ
GPTQ is a post-training quantization (PTQ) technique designed for 4-bit quantization, with a primary emphasis on GPU inference and performance.
The underlying concept of this method involves compressing all weights to a 4-bit quantization by minimizing the mean squared error associated with each weight. During inference, the model dynamically de-quantizes its weights to float16, aiming to enhance performance while maintaining low memory usage.
Let’s start by installing dependencies for GPTQ quantization after emptying the cache.
To maintain the balance between compression and accuracy, we will use the main model but the GPTQ version by ‘The Bloke’ from Hugging Face.
Then, we’ll use the tokenizer in the chat template to format the message. After that, we’ll pass the tokenizer into the pipeline for prompt generation by applying the chat template as we did before.
We’ll use the same prompt as before for the output.
The output should look like this.
Similarly, if we see the time and VRAM consumed by this quantized model, it will be:
AWQ
AWQ is a novel quantization method akin to GPTQ. While there are multiple distinctions between AWQ and GPTQ, a crucial divergence lies in AWQ's assumption that not all weights contribute equally to an LLM's performance.
In essence, AWQ selectively skips a small fraction of weights during quantization by mitigating quantization loss. Consequently, their research indicates a noteworthy acceleration in speed compared to GPTQ, all by maintaining comparable and superior performance.
Let’s install dependencies for AWQ after emptying the cache.
We’ll use the AWQ mistral model from Hugging Face. We’ll define the sampling parameter and use the GPU memory utilization. If you’re running out of memory, you can consider decreasing the GPU memory utilization.
As we did before, we’ll pass the message and use the chat template of tokenizer for each message.
Then we’ll use the LLM and generate the output using the prompt and the sampling parameters that we set before.
The output will look like this:
Similarly, if we see the time and VRAM consumed by this quantized model, it will be:
See the difference between the outputs generated, time taken, and VRAM usage by these quantized models!
Conclusion
With the quantization approaches, we trained different quantized models of Mistral 7B. We observed the difference between the outputs that were generated by them. As the quantized models were heavy, emptying the VRAM cache was important.
Also, you’ll see the fluctuation in Disk, GPU RAM, and System RAM while implementing the quantized models on E2E Cloud GPU. However, we saw that AWQ performed well, and you can work with AWQ without emptying the cache. The output generated by this quantized model was nicer than other models. There was no VRAM usage while working with AWQ, and the time taken was less as compared to the quantized models.