Large Language Models (LLMs) have emerged as pivotal tools in natural language processing, revolutionizing our approach to solving complex problems at scale. However, harnessing their potential while ensuring optimal performance has remained a persistent challenge due to their computationally demanding nature. In this space enters Optimum-NVIDIA: a groundbreaking inference library available on Hugging Face, specially designed to catapult LLM inference speeds on the NVIDIA platform with minimal coding intervention.
Revolutionizing Inference Speeds
Optimum-NVIDIA represents a game-changing advancement in the realm of LLMs. Through a remarkably simple API tweak, modifying just one line of code, users can unlock up to 28 times faster inference speeds and achieve an impressive throughput of 1,200 tokens per second on the NVIDIA platform. This monumental leap in performance is made possible by leveraging the innovative float8 format supported on NVIDIA's Ada Lovelace and Hopper architectures, coupled with the formidable compilation capabilities of NVIDIA TensorRT-LLM software.
Seamless Integration and Enhanced Performance
Optimum-NVIDIA is a cutting-edge inference library designed specifically to accelerate LLM inference on NVIDIA platforms. Integrating Optimum-NVIDIA into your workflow is effortless. By employing a pipeline from Optimum-NVIDIA, users can kickstart Llama with blazingly fast inference speeds with just three lines of code.
In the following tutorial, you’ll go through the step-by-step process to harness the power of Optimum-NVIDIA with Llama-2, showcasing its seamless integration and performance benefits.
Tutorial - Using Optimum-NVIDIA with Llama-2 on E2E Cloud
If you require extra GPU resources for the tutorials ahead, you can explore the offerings on E2E CLOUD. E2E provides a diverse selection of GPUs, making them a suitable choice for advanced LLM-based applications.
To get one, head over to MyAccount, and sign up. Then launch a GPU node as is shown in the screenshot below:
Make sure you add your ssh keys during launch, or through the security tab after launching.
Once you have launched a node, you can use VSCode Remote Explorer to ssh into the node and use it as a local development environment.
Installation
There is currently no `pip` support for `optimum-nvidia` as transitive dependencies are missing on PyPI.
To get started with Optimum-NVIDIA, you can:
- Pull the pre-built docker container `huggingface/optimum-nvidia`.
- Build the docker container locally.
Docker Installation
To get started quickly, use the pre-built Docker container available on the Hugging Face Docker Hub:
docker pull huggingface/optimum-nvidia
Note: An Optimum-NVIDIA package installable via pip is yet to be released.
Building Docker Container Locally
If you want to build your own image and/or customize it, you can do so by using the three-step process described below:
1. Clone the `optimum-nvidia` repository:
- Build the tensorrt_llm:latest image from the NVIDIA TensorRT-LLM repository. If you cloned the optimum-nvidia from the step above, you can use the following command (assuming you're at the root of optimum-nvidia repository):
Where CUDA_ARCHS is a comma-separated list of CUDA architectures you'd like to support.
For instance, here are a few examples of TARGET_SM values:
- 90-real : H100/H200
- 89-real : L4/L40/L40s/RTX Ada/RTX 4090
- 86-real : A10/A40/RTX Ax000
- 80-real : A100/A30
- 75-real : T4/RTX Quadro
- 70-real : V100
- Finally, let's build the huggingface/optimum-nvidia docker image on top of the tensorrt_llm layer
Building from Source
Alternatively, you can build Optimum-NVIDIA from source:
Note: You may try any one of the installation set-up for getting started with Optimum-NVIDIA.
Quickstart Guide
Pipelines
Hugging Face pipelines offer a straightforward way to set up inference. Transitioning from existing Transformers code to leverage Optimum-NVIDIA's performance boost is remarkably simple:
Generate API
For more control over advanced features such as quantization and token selection strategies, you can use the generate() API:
Enabling FP8 quantization with a single flag allows the execution of larger models on a single GPU at accelerated speeds without compromising accuracy. This quantization feature is complemented by a predefined calibration strategy, although users retain the flexibility to customize tokenization and calibration datasets to tailor the quantization process to their specific use cases.
For power users seeking granular control over sampling parameters, the Model API provides an avenue to fine-tune the settings.
Boosting LLM Inference with Llama-2
Llama-2, an LLM known for its intricate architecture and state-of-the-art language understanding, serves as an excellent example to demonstrate the efficacy of the Optimum-NVIDIA library in enhancing inference speed.
- Parallel Computation: The Optimum-NVIDIA library harnesses the parallel processing capabilities of NVIDIA GPUs, enabling simultaneous computation of multiple operations within the Llama-2 model. This parallelism drastically reduces inference time by executing tasks concurrently.
- Precision Optimization: Through optimized numerical precision handling, the library strikes a balance between computational accuracy and speed. By employing mixed-precision arithmetic, it performs computations using reduced precision where possible, accelerating the overall inference process.
- Memory Optimization: Efficient memory utilization is critical for speeding up inference. The library employs techniques like memory reuse and allocation strategies tailored for NVIDIA GPUs, effectively minimizing memory overhead and maximizing throughput.
- Kernel Fusion and Optimization: It merges multiple neural network operations into a single optimized kernel, eliminating redundant computations and reducing the number of memory accesses. This fusion enhances the efficiency of GPU utilization, further accelerating inference.
Performance Validation
Optimum-NVIDIA achieves top-notch inference performance on the NVIDIA platform via Hugging Face. By modifying only one line in your current Transformers code, you can operate Llama-2 at a rate of 1,200 tokens per second, showcasing up to 28 times faster performance compared to the framework.
Evaluating LLM performance revolves around two critical metrics: First Token Latency and Throughput. Optimum-NVIDIA sets new benchmarks by delivering up to 3.3 times faster First Token Latency compared to standard Transformer models, ensuring a more responsive user experience.
Moreover, when considering throughput, a crucial metric for batched generations, Optimum-NVIDIA shines with up to 28 times better performance than traditional Transformer models. These results underscore the library's capability to significantly enhance processing speeds and efficiency.
Embracing Future Advancements
The impact of Optimum-NVIDIA is poised to elevate further with the advent of NVIDIA's H200 Tensor Core GPU, promising an additional 2x boost in throughput for Llama models. As these GPUs become more accessible, we anticipate sharing performance data showcasing Optimum-NVIDIA's prowess on these cutting-edge platforms.
The Journey Ahead
Currently tailored for the LLaMAForCausalLM architecture and task, Optimum-NVIDIA continues to evolve, extending support to encompass various text generation model architectures and tasks available within Hugging Face. Future iterations are set to introduce groundbreaking optimization techniques like In-Flight Batching to streamline throughput for streaming prompts and INT4 quantization, enabling the execution of even larger models on a single GPU.
Conclusion
Optimum-NVIDIA represents a significant leap in accelerating LLM inference, and integrating it into your workflow is straightforward. By following this tutorial, you've learned how to seamlessly transition from existing Transformers code to harness the performance benefits of Optimum-NVIDIA with Llama-2, whether through pipelines or the generate() API.
Experiment with various models and tasks, explore the additional capabilities provided by Optimum-NVIDIA, and optimize your LLM workflows for speed and efficiency. Remember to share your feedback and experiences with the Optimum-NVIDIA repository – it's a collaborative effort to unlock the full potential of LLMs!
Reference
- Hugging Face Blog: https://huggingface.co/blog/optimum-nvidia
- Github Repo: https://github.com/huggingface/optimum-nvidia