Optimum-NVIDIA Library for Speeding up LLM Inference: Using Optimum-NVIDIA with Llama-2 on E2E Cloud

Large Language Models (LLMs) have emerged as pivotal tools in natural language processing, revolutionizing our approach to solving complex problems at scale. However, harnessing their potential while ensuring optimal performance has remained a persistent challenge due to their computationally demanding nature. In this space enters Optimum-NVIDIA: a groundbreaking inference library available on Hugging Face, specially designed to catapult LLM inference speeds on the NVIDIA platform with minimal coding intervention.

Revolutionizing Inference Speeds

Optimum-NVIDIA represents a game-changing advancement in the realm of LLMs. Through a remarkably simple API tweak, modifying just one line of code, users can unlock up to 28 times faster inference speeds and achieve an impressive throughput of 1,200 tokens per second on the NVIDIA platform. This monumental leap in performance is made possible by leveraging the innovative float8 format supported on NVIDIA's Ada Lovelace and Hopper architectures, coupled with the formidable compilation capabilities of NVIDIA TensorRT-LLM software.

Seamless Integration and Enhanced Performance

Optimum-NVIDIA is a cutting-edge inference library designed specifically to accelerate LLM inference on NVIDIA platforms. Integrating Optimum-NVIDIA into your workflow is effortless. By employing a pipeline from Optimum-NVIDIA, users can kickstart Llama with blazingly fast inference speeds with just three lines of code.

In the following tutorial, you’ll go through the step-by-step process to harness the power of Optimum-NVIDIA with Llama-2, showcasing its seamless integration and performance benefits.

Tutorial - Using Optimum-NVIDIA with Llama-2 on E2E Cloud

If you require extra GPU resources for the tutorials ahead, you can explore the offerings on E2E CLOUD. E2E provides a diverse selection of GPUs, making them a suitable choice for advanced LLM-based applications.

‍

Make sure you add your ssh keys during launch, or through the security tab after launching.

Once you have launched a node, you can use VSCode Remote Explorer to ssh into the node and use it as a local development environment.

Installation

There is currently no pip support for optimum-nvidia as transitive dependencies are missing on PyPI.

To get started with Optimum-NVIDIA, you can:

Pull the pre-built docker container huggingface/optimum-nvidia.
Build the docker container locally.

Docker Installation

To get started quickly, use the pre-built Docker container available on the Hugging Face Docker Hub:

docker pull huggingface/optimum-nvidia

Note: An Optimum-NVIDIA package installable via pip is yet to be released.

Building Docker Container Locally

If you want to build your own image and/or customize it, you can do so by using the three-step process described below:

Clone the optimum-nvidia repository:

git clone --recursive --depth=1 https://github.com/huggingface/optimum-nvidia && cd optimum-nvidia

Build the tensorrt_llm:latest image from the NVIDIA TensorRT-LLM repository. If you cloned the optimum-nvidia from the step above, you can use the following command (assuming you're at the root of optimum-nvidia repository):

cd third-party/tensorrt-llm && make -C docker release_build CUDA_ARCHS=""

Where CUDA_ARCHS is a comma-separated list of CUDA architectures you'd like to support.

For instance, here are a few examples of TARGET_SM values:

90-real : H100/H200
89-real : L4/L40/L40s/RTX Ada/RTX 4090
86-real : A10/A40/RTX Ax000
80-real : A100/A30
75-real : T4/RTX Quadro
70-real : V100

Finally, let's build the huggingface/optimum-nvidia docker image on top of the tensorrt_llm layer

cd ../.. && docker build -t huggingface/optimum-nvidia -f docker/Dockerfile .

Building from Source

Alternatively, you can build Optimum-NVIDIA from source:

TARGET_SM = "90-real,89-real"
git clone --recursive --depth=1 git@github.com:huggingface/optimum-nvidia.git
cd optimum-nvidia/third-party/tensorrt-llm
make -C docker release_build CUDA_ARCHS=$TARGET_SM
cd ../.. && docker build -t : -f docker/Dockerfile .

Note: You may try any one of the installation set-up for getting started with Optimum-NVIDIA.

Quickstart Guide

Pipelines

Hugging Face pipelines offer a straightforward way to set up inference. Transitioning from existing Transformers code to leverage Optimum-NVIDIA's performance boost is remarkably simple:

from optimum.nvidia.pipelines import pipeline

pipe = pipeline('text-generation', 'meta-llama/Llama-2-7b-chat-hf', use_fp8=True)
pipe("Describe a real-world application of AI in sustainable energy.")

Generate API

For more control over advanced features such as quantization and token selection strategies, you can use the generate() API:

from optimum.nvidia import LlamaForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", padding_side="left")

model = LlamaForCausalLM.from_pretrained(
  "meta-llama/Llama-2-7b-chat-hf",
  use_fp8=True,
)

model_inputs = tokenizer(
    ["How is autonomous vehicle technology transforming the future of transportation and urban planning?"],
    return_tensors="pt"
).to("cuda")

generated_ids = model.generate(
    **model_inputs,
    top_k=40,
    top_p=0.7,
    repetition_penalty=10,
)

tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Enabling FP8 quantization with a single flag allows the execution of larger models on a single GPU at accelerated speeds without compromising accuracy. This quantization feature is complemented by a predefined calibration strategy, although users retain the flexibility to customize tokenization and calibration datasets to tailor the quantization process to their specific use cases.

For power users seeking granular control over sampling parameters, the Model API provides an avenue to fine-tune the settings.

Boosting LLM Inference with Llama-2

Llama-2, an LLM known for its intricate architecture and state-of-the-art language understanding, serves as an excellent example to demonstrate the efficacy of the Optimum-NVIDIA library in enhancing inference speed.

Parallel Computation: The Optimum-NVIDIA library harnesses the parallel processing capabilities of NVIDIA GPUs, enabling simultaneous computation of multiple operations within the Llama-2 model. This parallelism drastically reduces inference time by executing tasks concurrently.
Precision Optimization: Through optimized numerical precision handling, the library strikes a balance between computational accuracy and speed. By employing mixed-precision arithmetic, it performs computations using reduced precision where possible, accelerating the overall inference process.
Memory Optimization: Efficient memory utilization is critical for speeding up inference. The library employs techniques like memory reuse and allocation strategies tailored for NVIDIA GPUs, effectively minimizing memory overhead and maximizing throughput.
Kernel Fusion and Optimization: It merges multiple neural network operations into a single optimized kernel, eliminating redundant computations and reducing the number of memory accesses. This fusion enhances the efficiency of GPU utilization, further accelerating inference.

Performance Validation

Optimum-NVIDIA achieves top-notch inference performance on the NVIDIA platform via Hugging Face. By modifying only one line in your current Transformers code, you can operate Llama-2 at a rate of 1,200 tokens per second, showcasing up to 28 times faster performance compared to the framework.

Evaluating LLM performance revolves around two critical metrics: First Token Latency and Throughput. Optimum-NVIDIA sets new benchmarks by delivering up to 3.3 times faster First Token Latency compared to standard Transformer models, ensuring a more responsive user experience.

Moreover, when considering throughput, a crucial metric for batched generations, Optimum-NVIDIA shines with up to 28 times better performance than traditional Transformer models. These results underscore the library's capability to significantly enhance processing speeds and efficiency.

Embracing Future Advancements

The impact of Optimum-NVIDIA is poised to elevate further with the advent of NVIDIA's H200 Tensor Core GPU, promising an additional 2x boost in throughput for Llama models. As these GPUs become more accessible, we anticipate sharing performance data showcasing Optimum-NVIDIA's prowess on these cutting-edge platforms.

The Journey Ahead

Currently tailored for the LLaMAForCausalLM architecture and task, Optimum-NVIDIA continues to evolve, extending support to encompass various text generation model architectures and tasks available within Hugging Face. Future iterations are set to introduce groundbreaking optimization techniques like In-Flight Batching to streamline throughput for streaming prompts and INT4 quantization, enabling the execution of even larger models on a single GPU.

Conclusion

Optimum-NVIDIA represents a significant leap in accelerating LLM inference, and integrating it into your workflow is straightforward. By following this tutorial, you've learned how to seamlessly transition from existing Transformers code to harness the performance benefits of Optimum-NVIDIA with Llama-2, whether through pipelines or the generate() API.

Experiment with various models and tasks, explore the additional capabilities provided by Optimum-NVIDIA, and optimize your LLM workflows for speed and efficiency. Remember to share your feedback and experiences with the Optimum-NVIDIA repository – it's a collaborative effort to unlock the full potential of LLMs!

Reference

Hugging Face Blog: https://huggingface.co/blog/optimum-nvidia
Github Repo: https://github.com/huggingface/optimum-nvidia

Optimum-NVIDIA Library for Speeding up LLM Inference: Using Optimum-NVIDIA with Llama-2 on E2E Cloud

Revolutionizing Inference Speeds

Seamless Integration and Enhanced Performance

Tutorial - Using Optimum-NVIDIA with Llama-2 on E2E Cloud

Installation

Docker Installation

Building Docker Container Locally

Building from Source

Quickstart Guide

Pipelines

Generate API

Boosting LLM Inference with Llama-2

Performance Validation

Embracing Future Advancements

The Journey Ahead

Conclusion

Reference

Related Articles

Making AI Deployment Affordable and Scalable: Cost Efficiency of Quantization

Interpretable vs. Black-Box Models: A Comprehensive Exploration on Early Prediction under Uncertainty

Generative AI in Healthcare: Applications, Benefits, and Its Future

GPU Cloud

Company

Legal & Policies

Investor Relations

Resources