Build Your Own VS Code-Connected AI Coding Companion: A Step-by-Step Guide

April 2, 2025

Table of Contents

Introduction

This new era of AI has swiftly transformed the global landscape in a matter of months. One of the areas which has undergone a dramatic shift is the software engineering workspace. Programming has evolved into a less strenuous and demanding endeavour for software engineers. The first innovation, in my personal experience, was with the GitHub Copilot. I was very excited about it and I applied to the first queue of waiting lists. As expected, it was too good and changed the way I code. It really made development faster and helped in writing clean code. Unfortunately, as many people did, I lost access as they moved to subscription plans. 

With the advent of LLMs which are built exclusively for coding, there is a lot of excitement. Models like CodeLlama and StarCoder can produce exceptionally good code in seconds. In this tutorial, I will show you how to harness StarCoder, fine-tune the model, and deploy it as your coding assistant on Visual Studio Code.

StarCoder

StarCoder is an open-source code large language model specially intended for code completion and related tasks. It was first released by the BigCode open-source community in 2023. Boasting a 1 trillion token training on the dataset The Stack, StarCoder surpasses all open code LLMs that accommodate various programming languages and equals or exceeds the performance of the OpenAI code-cushman-001 model. It is one of the closest competitors to CodeLlama.

Prerequisites

This tutorial will use the model and functions from the Hugging Face ecosystem. The model will be fetched from Hugging Face models and will be fine-tuned using the huggingface trainer. Make sure git and docker are installed in your system.

I am using the fine-tuning scripts from the github repo by Hugging Face. Clone the repo.


!git clone https://github.com/pacman100/DHS-LLM-Workshop.git

Install the required libraries.


!pip install packaging
!pip uninstall -y ninja && pip install ninja
!ninja --version
!echo $?

%cd DHS-LLM-Workshop
!git pull
!pip install -r requirements.txt

Login using your Hugging Face credentials.


from huggingface_hub import notebook_login


notebook_login()

Fine-Tuning StarCoder

Let us fine-tune the StarCoder 1B parameter version.  For fine-tuning the model on a code corpus, we will use the hf-stack-peft dataset from Hugging Face datasets. This particular dataset is composed of 158 rows of code content, encompassing a variety of programming languages.

I am using the A100 115 GB instance from E2E TIR AI Platform. Run the fine-tuning script by specifying the model and dataset as follows. Feel free to tweak the parameters as necessary.


%cd personal_copilot/training/
!python train.py \
    --model_path "bigcode/starcoderbase-1b" \
    --dataset_name "smangrul/hf-stack-peft" \
    --subset "data" \
    --data_column "content" \
    --split "train" \
    --seq_length 2048 \
    --max_steps 2000 \
    --batch_size 4 \
    --gradient_accumulation_steps 4 \
    --learning_rate 5e-4 \
    --lr_scheduler_type "cosine" \
    --weight_decay 0.01 \
    --num_warmup_steps 30 \
    --eval_freq 100 \
    --save_freq 100 \
    --log_freq 25 \
    --num_workers 4 \
    --bf16 \
    --no_fp16 \
    --output_dir "starcoder1B-personal-copilot" \
    --fim_rate 0.5 \
    --fim_spm_rate 0.5 \
    --use_peft_lora \
    --lora_r 32 \
    --lora_alpha 64 \
    --lora_dropout 0.0 \
    --lora_target_modules "c_proj,c_attn,q_attn,c_fc,c_proj" \
    --use_4bit_qunatization \
    --use_nested_quant \
    --bnb_4bit_compute_dtype "bfloat16" \
    --push_to_hub

A model repository with the fine-tuned model will be created in your Hugging Face profile. You will be using the inference endpoints from this repository.   

Serving the Fine-Tuned LLM

To use the fine-tuned model in real time, we need to deploy it to a server. In this section, you’ll find a detailed guide on how to accomplish it using E2E Cloud. Numerous methods, including the Triton server, PyTorch server, and others, can be utilized for deployment and inference on E2E. We will do it by building a custom docker container.

Build a Custom Docker Container

A Docker Container is a software unit that encapsulates code and its dependencies, ensuring the application operates seamlessly and dependably across different computing environments. To create the inference API, we must create an API handler. API handler will vary depending upon the LLM and configurations you choose to use. 

First, create a new directory for your model with the following files.

Model       

├── model_server.py

├── requirements.txt

└── Dockerfile

In this API handler, the model inference is wrapped using the Kserve model server. The contents of the files are as shown.


from kserve import Model, ModelServer
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
from typing import List, Dict
from huggingface_hub._login import _login


_login(token='YOUR_TOKEN', add_to_git_credential=False)


class StarCoder(Model):
    def __init__(self, name: str):
       super().__init__(name)
       self.name = name
       self.ready = False
       self.tokenizer = None
       #
       self.model_id = 'username/starcoder1B-personal-copilot'
       self.load()


    def load(self):
        # this step fetches the model from huggingface directly. the downloads may take longer and be slow depending on upstream link. We recommend using TIR Models
        # instead
        self.model = AutoModelForCausalLM.from_pretrained(self.model_id,
                                                          trust_remote_code=True,
                                                          device_map='auto')


        self.tokenizer = AutoTokenizer.from_pretrained('bigcode/starcoderbase-1b')
        self.pipeline  = transformers.pipeline(
            "text-generation",
            model=self.model,
            torch_dtype=torch.float16,
            tokenizer=self.tokenizer,
            device_map="auto",
        )
        self.ready = True


    def predict(self, payload: Dict, headers: Dict[str, str] = None) -> Dict:
        inputs = payload["instances"]
        source_text = inputs[0]["text"]
        # Encode the source text
        inputs = self.tokenizer.encode(source_text, return_tensors="pt").to(self.device)
        # Generate sequences
        sequences = self.model.generate(inputs,
                                        do_sample=True,
                                        top_k=10,
                                        num_return_sequences=1,
                                        eos_token_id=self.tokenizer.eos_token_id,
                                        max_length=200)
        results = []
        for seq in sequences:
            results.append(self.tokenizer.decode(seq))
        return {"predictions": results}


if __name__ == "__main__":
    model = StarCoder("starcoderbase-1b")
    ModelServer().start([model])

The tokenizer should be the base model and the model would be the fine-tuned version. Load the models and define the pipeline. Define the inference function predict, which takes input and generates code sequences. 

Define the dockerfile.

Dockerfile


# Use an official Python runtime as a parent image
FROM python:3.10-slim-buster


# Set the working directory in the container to /app
WORKDIR /app


# Add the current directory contents into the container at /app
ADD . /app


# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir -r requirements.txt


# Make port 80 available to the world outside this container
EXPOSE 80


# Run model_server.py when the container launches
CMD ["python", "model_server.py"]


Define the requirements file.

requirements.txt


kserve
ray[serve]
transformers
torch
huggingface_hub
accelerate
peft
bitsandbytes

Run this command to build the docker image.


docker build -t Model .‍

After building the docker image, make sure it is working using the run command.


docker run -p 4000:80 Model

After testing, push it to the docker hub. starcoder1b will be the repo name in the docker hub and the tagname will be the name of the version you are pushing. It should be different if you are pushing new content into this repo later. 


docker login
docker tag starcoder1b:tagname username/starcoder1b:tagname 
docker push username/starcoder1b:tagname

Creating Inference Endpoint

Before moving to endpoints, create an authorization token in the API Tokens section. To create a model endpoint for our object detection model, go to the Model Endpoints and click on Create Endpoint button. Select the GPU configuration required for your model and create an inference endpoint. 

From the frameworks, choose custom container. 

In Container Details, enter the image as <your-docker-handle-here>/starcoder1b and select other parameters as necessary. In Environment details, enter these key-values: HUGGING_FACE_HUB_TOKEN: get the token from HuggingFace website.               TRANSFORMERS_CACHE: /mnt/models. 

Proceed without selecting any model in the models section as the model has to be fetched from the Hugging Face repo and not any EOS bucket. Click Finish and create the inference endpoint.

If everything goes fine, you will see the logs somewhat similar to this.

Connecting to VSCode

You are almost there! Now connect your StarCoder to VSCode. We will be using the llm-vscode extension by Hugging Face. Download the extension in your VSCode text editor. By default, the extension will be using another LLM from a Hugging Face endpoint. 

Login to Hugging Face

You can supply your HF token using these steps:

  1. Cmd/Ctrl+Shift+P to open the VSCode command palette
  2. Type: Llm: Login
  3. Enter the token 

Configure the Endpoints

Go to the Settings page by pressing cmd+ on Mac or Ctrl+ on Windows. In the settings, you should find an option to set your own HTTP endpoint for code generation requests. Change the following configurations in settings.json.


"llm.modelIdOrEndpoint": "YOUR_E2E_INFERENCE_ENDPOINT"
"llm.tokenizer": {
    "repository": "bigcode/starcoderbase-1b"
  }
"llm.attributionEndpoint": "YOUR_E2E_INFERENCE_ENDPOINT"

Save your settings and then open a file and start coding. The extension will fetch results from StarCoder in the inference endpoint and auto-complete your code like the GitHub CoPilot.

Wrapping Up

You have now learnt how to create and use your own AI code copilot. We encourage you to try larger models like CodeLlama. E2E inference endpoints have their own pre-built containers for CodeLlama and other frameworks with ready API handlers. You may have to scale your infrastructure based on the models you use. 

Latest Blogs

A vector illustration of a tech city using latest cloud technologies & infrastructure