A Comprehensive Guide to LLMs’ Inference and Serving

September 29, 2023

Prerequisites

To conduct the following experiments, first sign up on E2E Cloud. Once registered, you can go to ‘Compute’ tab on the left, and spin up a GPU compute resource.

‍

Once you have launched the GPU node, you would be able to add your SSH keys, and get going. Follow the steps outlined below to test out the various approaches outlined.

Hugging Face Endpoints

‍

This may be the simplest way to make use of LLMs. It becomes particularly useful when we are building simple systems or testing models. Hugging Face has made it straight-forward by providing necessary documentation for each model in its page. Short code snippets are enough to get the model running. Here is an example inference code snippet for Llama-2 chat model.

Usage:

Install transformers and login to Hugging Face:


$ pip install transformers
$ huggingface-cli login

Import libraries, load and prompt the model.


from transformers import AutoTokenizer
import transformers
import torch


model = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)
sequences = pipeline(
    'Tell me which is the best large language model.\n',
    do_sample=True,
    top_k=5,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=350,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

As simple as that!!

Pros:

Best option for beginners and research purposes.
Detailed and good documentation
With its robust community support and widespread popularity, Hugging Face has become the go-to platform for machine learning developers and researchers.
Big companies support and use this platform to provide their models as open source.
It can be easily used with almost all other frameworks.

Cons:

It does have paid services for using API endpoints by cloud. But usually, it involves downloading the model to the machine and using it. So, efficiency is compromised and it requires pretty decent machines to use.

‍

vLLM

‍

vLLM is a fast and simple framework for LLM inference and serving. It provides high throughput serving and support for distributed inference. Offering seamless integration with Hugging Face models and OpenAI compatible API server. Outstanding features include Continuous Batching and Paged Attention. As their tagline says, it indeed is easy, fast and cheap LLM serving for everyone.

Usage:

It can be deployed as an API service. Here is the FastAPI server code and client code from the official documentation.

Install CUDA on the system if not installed already.


$ sudo apt install nvidia-cuda-toolkit

To start the server:


$ pip install vllm


$ python -m vllm.entrypoints.api_server

‍

Query the hosted model using:


$ curl http://localhost:8000/generate 
    -d '{
        "prompt": "Joe Biden is the",
        "use_beam_search": true,
        "n": 4,
        "temperature": 0.2
    }'

We can use it for offline batched inference also. Here is an example for text completion:


from vllm import LLM, SamplingParams
prompts = [
    "Joe Biden is the",
    "The best large language model is",
]

‍

Define parameters, load model and prompt.


sampling_params = SamplingParams(temperature=0.7, top_p=0.95)
llm = LLM(model="meta-llama/Llama-2-13b-hf")
outputs = llm.generate(prompts, sampling_params)
Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

For utilizing multiple GPUs we can deploy models to any cloud using SkyPilot.


$ pip install skypilot
$ sky check

A serving.yaml file is to be created as shown. Here we have requested A100 GPU for the Llama 13B model we want to deploy.

resources: accelerators: A100 envs: MODEL_NAME: decapoda-research/llama-13b-hf TOKENIZER: hf-internal-testing/llama-tokenizer setup: | conda create -n vllm python=3.9 -y conda activate vllm git clone https://github.com/vllm-project/vllm.git cd vllm pip install . pip install gradio

run: | conda activate vllm echo 'Starting vllm api server...' python -u -m vllm.entrypoints.api_server --model $MODEL_NAME --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE --tokenizer $TOKENIZER 2>&1 | tee api_server.log & echo 'Waiting for vllm api server to start...' while ! cat api_server.log | grep -q 'Uvicorn running on'; do sleep 1; done echo 'Starting gradio server...' python vllm/examples/gradio_webserver.py

‍Pros:

If inference speed is your priority vLLM is the best option. Algorithms like paged attention accelerate inference like no other frameworks.
High Throughput serving
It has integrations with Hugging Face Transformers and OpenAI API
For better deployment and scaling using any cloud platform, it has native integration with SkyPilot framework.

Cons:

Currently, it does not support quantization. So, it is not efficient in terms of memory usage.
Currently, it does not support LoRA, QLoRA or other adapters for LLMs.

OpenLLM

OpenLLM is a platform for packaging LLMs into production. It allows seamless integration with leading services like BentoML, Hugging Face, LangChain. It allows deployment to cloud or on machines and supports docker containerization when used with platforms like BentoML. It works with state-of-the-art models like Falcon, Flan-T5, StarCoder and much more.

Usage:

For quick inference through server install openllm library,


$ pip install openllm

To start the server:


$ openllm start opt

Alternatively, to specify the model type, enter model-id and parameters:


$ openllm start opt --model-id facebook/opt-2.7b
  --max-new-tokens 200 
  --temperature 0.2 
  --api-workers 1 
  --workers-per-resource 2

Query the hosted model using curl or inbuilt python client.


import openllm
client = openllm.client.HTTPClient('http://localhost:3000')
client.query('Which is the best large language model?')

To get more control over the server, we can code one using BentoML service.


from future import annotations
import bentoml
import openllm
specify model name here
model = ""  
llm_runner = openllm.Runner(model)
specify service name here
svc = bentoml.Service(name="", runners=[llm_runner])      
@svc.on_startup
def download(_: bentoml.Context):
    llm_runner.download_model()
@svc.api(input=bentoml.io.Text(), output=bentoml.io.Text())
async def prompt(input_text: str) -> str:
    answer = await llm_runner.generate.async_run(input_text)
    return answer[0]["generated_text"]

To start the service run:


$ bentoml serve service:svc

If required we can containerize and deploy the LLM application also using BentoML.

Pros:

It offers quantization techniques for LLMs with bitsandbytes and GPTQ.
There are options to modify the models. Experimental fine-tuning functionalities are available. It also supports plugins like adapters for LLMs.
Integration with BentoML and BentoCloud offers wonderful options for deployment and scaling. BentoML enables us to dockerize the application.
It has integrations with Hugging Face Agents framework and LangChain.

Cons:

It does not support built-in distributed inference.

Ray Serve

Ray is a complete toolkit that includes libraries to make end-to-end ML pipelines. One of the libraries Ray Serve is a good scalable model serving tool. It can be used to build inference APIs. It supports complex deep learning models with TensorFlow or PyTorch and has special optimizations like response streaming, dynamic request batching and batched inference for LLMs.

Usage:

Here is a sample code snippet for serving a Llama 13B model.


import requests
from starlette.requests import Request
from typing import Dict
from ray import serve
from transformers import AutoModelForCausalLM
Define a Ray Serve deployment for your Hugging Face Llama model.
@serve.deployment(route_prefix="/llamas")
class LlamaModelDeployment:
    def init(self, llama_model):
        self.llama_model = llama_model
    def call(self, request: Request) -> Dict:
        # Make a prediction with the Hugging Face Llama model.
        prediction = self.llama_model.generate(request.query_params["prompt"])
        return {"prediction": prediction}
Load your Hugging Face Llama model.
llama_model = AutoModelForCausalLM.from_pretrained("decapoda-research/llama-13b-hf")
Bind the deployment to the model.
app = LlamaModelDeployment.bind(llama_model=llama_model)
Deploy the application.
serve.run(app)

For inference, query the /llamas endpoint using HTTP requests:


response = requests.get("http://localhost:8000/llamas?prompt=Which is the best large language model?")
The response will contain a JSON object with the prediction.
prediction = response.json()["prediction"]

Pros:

Dynamically scalable to many machines for adjusting required resources for the model.
It is optimized for LLMs with features like response streaming, dynamic request batching and multi-GPU serving for computation intensive LLMs.
It has got native integration for the LangChain.

Cons:

It does not seem to be the best option for beginners as it comprises of all kinds of complicated functions for all types of ML models.

More About E2E Cloud Platform

E2E Networks is a leading provider of advanced cloud GPUs in India, offering a high-value infrastructure cloud for building and launching machine learning applications. Our cloud platform supports a variety of workloads across various industry verticals, providing high-performance computing in the cloud. We offer a flagship machine learning platform called Tir, which is built on top of Jupyter Notebook. E2E Networks also provides reliable and cost-effective storage solutions, allowing businesses to store and access their data from anywhere. Our cloud platform is trusted by over 10,000 clients and is designed for numerous real-world use cases in domains ranging from Data Science, NLP, Computer Vision/Image Processing, HealthTech, ConsumerTech, and more.

Tir: E2E Cloud’s Flagship Machine Learning Platform Tir is built on top of Jupyter Notebook, an advanced web-based interactive development environment offered by E2E Cloud. It provides users with the latest features and functionalities, including JupyterLab, a cutting-edge interface for working with notebooks, code, and data. Tir allows easy integration with other E2E services to build and deploy a complete pipeline. There are interfaces to test and decide which model to use for your task.

E2E Cloud Models: E2E Networks offers a wide range of options for machine learning applications. There are deployment-ready models for tasks ranging from Data Science, NLP, Computer Vision/Image Processing, HealthTech, ConsumerTech, and more. Other models include embedding models that convert text to embeddings which are widely used for search engines, personalization and recommendation tasks.

Pros:

Highly scalable infrastructure
Reliable for production-level purposes
We are now growing to an end-to-end platform for machine learning models.

Conclusion

The selection of the framework is indeed contingent on several factors, including the specific task you’re tackling, the Large Language Model you’re utilizing, and the expenses you’re prepared to bear. This is not an exhaustive list but these are some of the currently available frameworks you are expected to try first for LLM inference and serving. To sum up the whole thing:

‍

References

https://vllm.readthedocs.io/en/latest/serving/run_on_sky.html

https://github.com/bentoml/OpenLLM

https://betterprogramming.pub/frameworks-for-serving-llms-60b7f7b23407

https://docs.ray.io/en/latest/serve/index.html

‍

Sign up for Free Trial

Latest Blogs

A vector illustration of a tech city using latest cloud technologies & infrastructure

A Comprehensive Guide to LLMs’ Inference and Serving

September 29, 2023

Ashish Abraham

Prerequisites

To conduct the following experiments, first sign up on E2E Cloud. Once registered, you can go to ‘Compute’ tab on the left, and spin up a GPU compute resource.

‍

Once you have launched the GPU node, you would be able to add your SSH keys, and get going. Follow the steps outlined below to test out the various approaches outlined.

Hugging Face Endpoints

‍

Usage:

Install transformers and login to Hugging Face:


$ pip install transformers
$ huggingface-cli login

Import libraries, load and prompt the model.


from transformers import AutoTokenizer
import transformers
import torch


model = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)
sequences = pipeline(
    'Tell me which is the best large language model.\n',
    do_sample=True,
    top_k=5,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=350,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

As simple as that!!

Pros:

Best option for beginners and research purposes.
Detailed and good documentation
With its robust community support and widespread popularity, Hugging Face has become the go-to platform for machine learning developers and researchers.
Big companies support and use this platform to provide their models as open source.
It can be easily used with almost all other frameworks.

Cons:

It does have paid services for using API endpoints by cloud. But usually, it involves downloading the model to the machine and using it. So, efficiency is compromised and it requires pretty decent machines to use.

‍

vLLM

‍

Usage:

It can be deployed as an API service. Here is the FastAPI server code and client code from the official documentation.

Install CUDA on the system if not installed already.


$ sudo apt install nvidia-cuda-toolkit

To start the server:


$ pip install vllm


$ python -m vllm.entrypoints.api_server

‍

Query the hosted model using:


$ curl http://localhost:8000/generate 
    -d '{
        "prompt": "Joe Biden is the",
        "use_beam_search": true,
        "n": 4,
        "temperature": 0.2
    }'

We can use it for offline batched inference also. Here is an example for text completion:


from vllm import LLM, SamplingParams
prompts = [
    "Joe Biden is the",
    "The best large language model is",
]

‍

Define parameters, load model and prompt.


sampling_params = SamplingParams(temperature=0.7, top_p=0.95)
llm = LLM(model="meta-llama/Llama-2-13b-hf")
outputs = llm.generate(prompts, sampling_params)
Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

For utilizing multiple GPUs we can deploy models to any cloud using SkyPilot.


$ pip install skypilot
$ sky check

A serving.yaml file is to be created as shown. Here we have requested A100 GPU for the Llama 13B model we want to deploy.

‍Pros:

If inference speed is your priority vLLM is the best option. Algorithms like paged attention accelerate inference like no other frameworks.
High Throughput serving
It has integrations with Hugging Face Transformers and OpenAI API
For better deployment and scaling using any cloud platform, it has native integration with SkyPilot framework.

Cons:

Currently, it does not support quantization. So, it is not efficient in terms of memory usage.
Currently, it does not support LoRA, QLoRA or other adapters for LLMs.

OpenLLM

Usage:

For quick inference through server install openllm library,


$ pip install openllm

To start the server:


$ openllm start opt

Alternatively, to specify the model type, enter model-id and parameters:


$ openllm start opt --model-id facebook/opt-2.7b
  --max-new-tokens 200 
  --temperature 0.2 
  --api-workers 1 
  --workers-per-resource 2

Query the hosted model using curl or inbuilt python client.


import openllm
client = openllm.client.HTTPClient('http://localhost:3000')
client.query('Which is the best large language model?')

To get more control over the server, we can code one using BentoML service.


from future import annotations
import bentoml
import openllm
specify model name here
model = ""  
llm_runner = openllm.Runner(model)
specify service name here
svc = bentoml.Service(name="", runners=[llm_runner])      
@svc.on_startup
def download(_: bentoml.Context):
    llm_runner.download_model()
@svc.api(input=bentoml.io.Text(), output=bentoml.io.Text())
async def prompt(input_text: str) -> str:
    answer = await llm_runner.generate.async_run(input_text)
    return answer[0]["generated_text"]

To start the service run:


$ bentoml serve service:svc

If required we can containerize and deploy the LLM application also using BentoML.

Pros:

It offers quantization techniques for LLMs with bitsandbytes and GPTQ.
There are options to modify the models. Experimental fine-tuning functionalities are available. It also supports plugins like adapters for LLMs.
Integration with BentoML and BentoCloud offers wonderful options for deployment and scaling. BentoML enables us to dockerize the application.
It has integrations with Hugging Face Agents framework and LangChain.

Cons:

It does not support built-in distributed inference.

Ray Serve

Usage:

Here is a sample code snippet for serving a Llama 13B model.


import requests
from starlette.requests import Request
from typing import Dict
from ray import serve
from transformers import AutoModelForCausalLM
Define a Ray Serve deployment for your Hugging Face Llama model.
@serve.deployment(route_prefix="/llamas")
class LlamaModelDeployment:
    def init(self, llama_model):
        self.llama_model = llama_model
    def call(self, request: Request) -> Dict:
        # Make a prediction with the Hugging Face Llama model.
        prediction = self.llama_model.generate(request.query_params["prompt"])
        return {"prediction": prediction}
Load your Hugging Face Llama model.
llama_model = AutoModelForCausalLM.from_pretrained("decapoda-research/llama-13b-hf")
Bind the deployment to the model.
app = LlamaModelDeployment.bind(llama_model=llama_model)
Deploy the application.
serve.run(app)

For inference, query the /llamas endpoint using HTTP requests:


response = requests.get("http://localhost:8000/llamas?prompt=Which is the best large language model?")
The response will contain a JSON object with the prediction.
prediction = response.json()["prediction"]

Pros:

Dynamically scalable to many machines for adjusting required resources for the model.
It is optimized for LLMs with features like response streaming, dynamic request batching and multi-GPU serving for computation intensive LLMs.
It has got native integration for the LangChain.

Cons:

It does not seem to be the best option for beginners as it comprises of all kinds of complicated functions for all types of ML models.

More About E2E Cloud Platform

Pros:

Highly scalable infrastructure
Reliable for production-level purposes
We are now growing to an end-to-end platform for machine learning models.

Conclusion

‍

References

https://vllm.readthedocs.io/en/latest/serving/run_on_sky.html

https://github.com/bentoml/OpenLLM

https://betterprogramming.pub/frameworks-for-serving-llms-60b7f7b23407

https://docs.ray.io/en/latest/serve/index.html

‍

Sign up for Free Trial

Latest Blogs

A Comprehensive Guide to LLMs’ Inference and Serving

Table of Contents

Prerequisites

Hugging Face Endpoints

vLLM

Print the outputs.

OpenLLM

specify model name here

specify service name here

Ray Serve

Define a Ray Serve deployment for your Hugging Face Llama model.

Load your Hugging Face Llama model.

Bind the deployment to the model.

Deploy the application.

The response will contain a JSON object with the prediction.

More About E2E Cloud Platform

Conclusion

References

A Comprehensive Guide to LLMs’ Inference and Serving

Table of Contents

Prerequisites

Hugging Face Endpoints

vLLM

Print the outputs.

OpenLLM

specify model name here

specify service name here

Ray Serve

Define a Ray Serve deployment for your Hugging Face Llama model.

Load your Hugging Face Llama model.

Bind the deployment to the model.

Deploy the application.

The response will contain a JSON object with the prediction.

More About E2E Cloud Platform

Conclusion

References

How Does RAG Improve the Accuracy of LLM Responses?

Top 10 Cloud GPU Providers in 2025

What is Retrieval-Augmented Generation (RAG)?

AI Inference vs Training: Understanding Key Differences

Sovereign Cloud: India's Key to Digital Independence in the AI Age

E2E Sovereign Cloud Platform: Revolutionizing Cloud Sovereignty

Top 8 Generative AI Applications in 2025

A Comparison between TIR Containerized VMs vs Traditional VMs

Accelerate Your AI Application Development Using TIR Containerized VMs

The AI Revolution in the Automotive Industry: Steering Toward a Smarter, Safer, and Sustainable Future