Large Language Models have become the talk of the town. Every day brings new progress and innovation. New versions of GPT are being released. Meta is in the limelight with Llama-2. Parallelly we have a large number of counterparts and variations getting launched, including in multiple languages. Foundational models can now handle text and images pretty easily.
Open source LLMs are democratizing AI and come with a lot of promise. We can count on them for their data privacy, transparency, customization capabilities, and their low cost. But the fundamental issue we are facing when it comes to LLMs is their huge size and thirst for computational power. Firms and individuals other than large enterprises may not be able to afford such large infrastructure for their AI models – but we may have some workarounds. With this constraint in mind, let’s explore some frameworks and methods for LLM serving and inference which can help us use them seamlessly.
Prerequisites
To conduct the following experiments, first sign up on E2E Cloud. Once registered, you can go to ‘Compute’ tab on the left, and spin up a GPU compute resource.
Once you have launched the GPU node, you would be able to add your SSH keys, and get going. Follow the steps outlined below to test out the various approaches outlined.
Hugging Face Endpoints
This may be the simplest way to make use of LLMs. It becomes particularly useful when we are building simple systems or testing models. Hugging Face has made it straight-forward by providing necessary documentation for each model in its page. Short code snippets are enough to get the model running. Here is an example inference code snippet for Llama-2 chat model.
Usage:
Install transformers and login to Hugging Face:
$ pip install transformers
$ huggingface-cli login
Import libraries, load and prompt the model.
from transformers import AutoTokenizer
import transformers
import torch
model = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)
sequences = pipeline(
'Tell me which is the best large language model.\n',
do_sample=True,
top_k=5,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=350,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
As simple as that!!
Pros:
- Best option for beginners and research purposes.
- Detailed and good documentation
- With its robust community support and widespread popularity, Hugging Face has become the go-to platform for machine learning developers and researchers.
- Big companies support and use this platform to provide their models as open source.
- It can be easily used with almost all other frameworks.
Cons:
- It does have paid services for using API endpoints by cloud. But usually, it involves downloading the model to the machine and using it. So, efficiency is compromised and it requires pretty decent machines to use.
vLLM
vLLM is a fast and simple framework for LLM inference and serving. It provides high throughput serving and support for distributed inference. Offering seamless integration with Hugging Face models and OpenAI compatible API server. Outstanding features include Continuous Batching and Paged Attention. As their tagline says, it indeed is easy, fast and cheap LLM serving for everyone.
Usage:
It can be deployed as an API service. Here is the FastAPI server code and client code from the official documentation.
Install CUDA on the system if not installed already.
$ sudo apt install nvidia-cuda-toolkit
To start the server:
$ pip install vllm
$ python -m vllm.entrypoints.api_server
Query the hosted model using:
$ curl http://localhost:8000/generate \
-d '{
"prompt": "Joe Biden is the",
"use_beam_search": true,
"n": 4,
"temperature": 0.2
}'
We can use it for offline batched inference also. Here is an example for text completion:
from vllm import LLM, SamplingParams
prompts = [
"Joe Biden is the",
"The best large language model is",
]
Define parameters, load model and prompt.
sampling_params = SamplingParams(temperature=0.7, top_p=0.95)
llm = LLM(model="meta-llama/Llama-2-13b-hf")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
For utilizing multiple GPUs we can deploy models to any cloud using SkyPilot.
$ pip install skypilot
$ sky check
A serving.yaml file is to be created as shown. Here we have requested A100 GPU for the Llama 13B model we want to deploy.
resources:
accelerators: A100
envs:
MODEL_NAME: decapoda-research/llama-13b-hf
TOKENIZER: hf-internal-testing/llama-tokenizer
setup: |
conda create -n vllm python=3.9 -y
conda activate vllm
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install .
pip install gradio
run: |
conda activate vllm
echo 'Starting vllm api server...'
python -u -m vllm.entrypoints.api_server \
--model $MODEL_NAME \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--tokenizer $TOKENIZER 2>&1 | tee api_server.log &
echo 'Waiting for vllm api server to start...'
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
echo 'Starting gradio server...'
python vllm/examples/gradio_webserver.py
Pros:
- If inference speed is your priority vLLM is the best option. Algorithms like paged attention accelerate inference like no other frameworks.
- High Throughput serving
- It has integrations with Hugging Face Transformers and OpenAI API
- For better deployment and scaling using any cloud platform, it has native integration with SkyPilot framework.
Cons:
- Currently, it does not support quantization. So, it is not efficient in terms of memory usage.
- Currently, it does not support LoRA, QLoRA or other adapters for LLMs.
OpenLLM
OpenLLM is a platform for packaging LLMs into production. It allows seamless integration with leading services like BentoML, Hugging Face, LangChain. It allows deployment to cloud or on machines and supports docker containerization when used with platforms like BentoML. It works with state-of-the-art models like Falcon, Flan-T5, StarCoder and much more.
Usage:
For quick inference through server install openllm library,
$ pip install openllm
To start the server:
$ openllm start opt
Alternatively, to specify the model type, enter model-id and parameters:
$ openllm start opt --model-id facebook/opt-2.7b
--max-new-tokens 200 \
--temperature 0.2 \
--api-workers 1 \
--workers-per-resource 2
Query the hosted model using curl or inbuilt python client.
import openllm
client = openllm.client.HTTPClient('http://localhost:3000')
client.query('Which is the best large language model?')
To get more control over the server, we can code one using BentoML service.
from __future__ import annotations
import bentoml
import openllm
# specify model name here
model = ""
llm_runner = openllm.Runner(model)
# specify service name here
svc = bentoml.Service(name="", runners=[llm_runner])
@svc.on_startup
def download(_: bentoml.Context):
llm_runner.download_model()
@svc.api(input=bentoml.io.Text(), output=bentoml.io.Text())
async def prompt(input_text: str) -> str:
answer = await llm_runner.generate.async_run(input_text)
return answer[0]["generated_text"]
To start the service run:
$ bentoml serve service:svc
If required we can containerize and deploy the LLM application also using BentoML.
Pros:
- It offers quantization techniques for LLMs with bitsandbytes and GPTQ.
- There are options to modify the models. Experimental fine-tuning functionalities are available. It also supports plugins like adapters for LLMs.
- Integration with BentoML and BentoCloud offers wonderful options for deployment and scaling. BentoML enables us to dockerize the application.
- It has integrations with Hugging Face Agents framework and LangChain.
Cons:
- It does not support built-in distributed inference.
Ray Serve
Ray is a complete toolkit that includes libraries to make end-to-end ML pipelines. One of the libraries Ray Serve is a good scalable model serving tool. It can be used to build inference APIs. It supports complex deep learning models with TensorFlow or PyTorch and has special optimizations like response streaming, dynamic request batching and batched inference for LLMs.
Usage:
Here is a sample code snippet for serving a Llama 13B model.
import requests
from starlette.requests import Request
from typing import Dict
from ray import serve
from transformers import AutoModelForCausalLM
# Define a Ray Serve deployment for your Hugging Face Llama model.
@serve.deployment(route_prefix="/llamas")
class LlamaModelDeployment:
def __init__(self, llama_model):
self.llama_model = llama_model
def __call__(self, request: Request) -> Dict:
# Make a prediction with the Hugging Face Llama model.
prediction = self.llama_model.generate(request.query_params["prompt"])
return {"prediction": prediction}
# Load your Hugging Face Llama model.
llama_model = AutoModelForCausalLM.from_pretrained("decapoda-research/llama-13b-hf")
# Bind the deployment to the model.
app = LlamaModelDeployment.bind(llama_model=llama_model)
# Deploy the application.
serve.run(app)
For inference, query the /llamas endpoint using HTTP requests:
response = requests.get("http://localhost:8000/llamas?prompt=Which is the best large language model?")
# The response will contain a JSON object with the prediction.
prediction = response.json()["prediction"]
Pros:
- Dynamically scalable to many machines for adjusting required resources for the model.
- It is optimized for LLMs with features like response streaming, dynamic request batching and multi-GPU serving for computation intensive LLMs.
- It has got native integration for the LangChain.
Cons:
- It does not seem to be the best option for beginners as it comprises of all kinds of complicated functions for all types of ML models.
More About E2E Cloud Platform
E2E Networks is a leading provider of advanced cloud GPUs in India, offering a high-value infrastructure cloud for building and launching machine learning applications. Our cloud platform supports a variety of workloads across various industry verticals, providing high-performance computing in the cloud. We offer a flagship machine learning platform called Tir, which is built on top of Jupyter Notebook. E2E Networks also provides reliable and cost-effective storage solutions, allowing businesses to store and access their data from anywhere. Our cloud platform is trusted by over 10,000 clients and is designed for numerous real-world use cases in domains ranging from Data Science, NLP, Computer Vision/Image Processing, HealthTech, ConsumerTech, and more.
Tir: E2E Cloud’s Flagship Machine Learning Platform Tir is built on top of Jupyter Notebook, an advanced web-based interactive development environment offered by E2E Cloud. It provides users with the latest features and functionalities, including JupyterLab, a cutting-edge interface for working with notebooks, code, and data. Tir allows easy integration with other E2E services to build and deploy a complete pipeline. There are interfaces to test and decide which model to use for your task.
E2E Cloud Models: E2E Networks offers a wide range of options for machine learning applications. There are deployment-ready models for tasks ranging from Data Science, NLP, Computer Vision/Image Processing, HealthTech, ConsumerTech, and more. Other models include embedding models that convert text to embeddings which are widely used for search engines, personalization and recommendation tasks.
Pros:
- Highly scalable infrastructure
- Reliable for production-level purposes
- We are now growing to an end-to-end platform for machine learning models.
Conclusion
The selection of the framework is indeed contingent on several factors, including the specific task you’re tackling, the Large Language Model you’re utilizing, and the expenses you’re prepared to bear. This is not an exhaustive list but these are some of the currently available frameworks you are expected to try first for LLM inference and serving. To sum up the whole thing:
References
https://vllm.readthedocs.io/en/latest/serving/run_on_sky.html
https://github.com/bentoml/OpenLLM
https://betterprogramming.pub/frameworks-for-serving-llms-60b7f7b23407
https://docs.ray.io/en/latest/serve/index.html