Introduction
This new era of AI has swiftly transformed the global landscape in a matter of months. One of the areas which has undergone a dramatic shift is the software engineering workspace. Programming has evolved into a less strenuous and demanding endeavour for software engineers. The first innovation, in my personal experience, was with the GitHub Copilot. I was very excited about it and I applied to the first queue of waiting lists. As expected, it was too good and changed the way I code. It really made development faster and helped in writing clean code. Unfortunately, as many people did, I lost access as they moved to subscription plans.
With the advent of LLMs which are built exclusively for coding, there is a lot of excitement. Models like CodeLlama and StarCoder can produce exceptionally good code in seconds. In this tutorial, I will show you how to harness StarCoder, fine-tune the model, and deploy it as your coding assistant on Visual Studio Code.
StarCoder
StarCoder is an open-source code large language model specially intended for code completion and related tasks. It was first released by the BigCode open-source community in 2023. Boasting a 1 trillion token training on the dataset The Stack, StarCoder surpasses all open code LLMs that accommodate various programming languages and equals or exceeds the performance of the OpenAI code-cushman-001 model. It is one of the closest competitors to CodeLlama.
Prerequisites
This tutorial will use the model and functions from the Hugging Face ecosystem. The model will be fetched from Hugging Face models and will be fine-tuned using the huggingface trainer. Make sure git and docker are installed in your system.
I am using the fine-tuning scripts from the github repo by Hugging Face. Clone the repo.
Install the required libraries.
Login using your Hugging Face credentials.
Fine-Tuning StarCoder
Let us fine-tune the StarCoder 1B parameter version. For fine-tuning the model on a code corpus, we will use the hf-stack-peft dataset from Hugging Face datasets. This particular dataset is composed of 158 rows of code content, encompassing a variety of programming languages.
I am using the A100 115 GB instance from E2E TIR AI Platform. Run the fine-tuning script by specifying the model and dataset as follows. Feel free to tweak the parameters as necessary.
A model repository with the fine-tuned model will be created in your Hugging Face profile. You will be using the inference endpoints from this repository.
Serving the Fine-Tuned LLM
To use the fine-tuned model in real time, we need to deploy it to a server. In this section, you’ll find a detailed guide on how to accomplish it using E2E Cloud. Numerous methods, including the Triton server, PyTorch server, and others, can be utilized for deployment and inference on E2E. We will do it by building a custom docker container.
Build a Custom Docker Container
A Docker Container is a software unit that encapsulates code and its dependencies, ensuring the application operates seamlessly and dependably across different computing environments. To create the inference API, we must create an API handler. API handler will vary depending upon the LLM and configurations you choose to use.
First, create a new directory for your model with the following files.
Model
├── model_server.py
├── requirements.txt
└── Dockerfile
In this API handler, the model inference is wrapped using the Kserve model server. The contents of the files are as shown.
The tokenizer should be the base model and the model would be the fine-tuned version. Load the models and define the pipeline. Define the inference function predict, which takes input and generates code sequences.
Define the dockerfile.
Dockerfile
Define the requirements file.
requirements.txt
Run this command to build the docker image.
After building the docker image, make sure it is working using the run command.
After testing, push it to the docker hub. starcoder1b will be the repo name in the docker hub and the tagname will be the name of the version you are pushing. It should be different if you are pushing new content into this repo later.
Creating Inference Endpoint
Before moving to endpoints, create an authorization token in the API Tokens section. To create a model endpoint for our object detection model, go to the Model Endpoints and click on Create Endpoint button. Select the GPU configuration required for your model and create an inference endpoint.
From the frameworks, choose custom container.
In Container Details, enter the image as <your-docker-handle-here>/starcoder1b and select other parameters as necessary. In Environment details, enter these key-values: HUGGING_FACE_HUB_TOKEN: get the token from HuggingFace website. TRANSFORMERS_CACHE: /mnt/models.
Proceed without selecting any model in the models section as the model has to be fetched from the Hugging Face repo and not any EOS bucket. Click Finish and create the inference endpoint.
If everything goes fine, you will see the logs somewhat similar to this.
Connecting to VSCode
You are almost there! Now connect your StarCoder to VSCode. We will be using the llm-vscode extension by Hugging Face. Download the extension in your VSCode text editor. By default, the extension will be using another LLM from a Hugging Face endpoint.
Login to Hugging Face
You can supply your HF token using these steps:
- Cmd/Ctrl+Shift+P to open the VSCode command palette
- Type: Llm: Login
- Enter the token
Configure the Endpoints
Go to the Settings page by pressing cmd+ on Mac or Ctrl+ on Windows. In the settings, you should find an option to set your own HTTP endpoint for code generation requests. Change the following configurations in settings.json.
Save your settings and then open a file and start coding. The extension will fetch results from StarCoder in the inference endpoint and auto-complete your code like the GitHub CoPilot.
Wrapping Up
You have now learnt how to create and use your own AI code copilot. We encourage you to try larger models like CodeLlama. E2E inference endpoints have their own pre-built containers for CodeLlama and other frameworks with ready API handlers. You may have to scale your infrastructure based on the models you use.