Understanding Starcoder2 and Using It to Build AI Coding Assistant for Enterprises

April 3, 2025

Introduction

StarCoder2 is a family of advanced code generation models (of sizes 3B, 7B, and 15B) that has been trained on 600+ programming languages from the Stack v2, a vast training dataset that has been created by the BigCode project.

This training dataset (Stack v2) has been created from various sources, including a large archive of code in 619 programming languages, GitHub, Kaggle, and other documentation, and is four times bigger than the previous dataset.

Interestingly, the testing shows that even the smallest StarCoder2 model (3B) performs better than other models of its size and even beats the previous larger StarCoder model (15B).

The largest StarCoder2 model (15B) outperforms other similar-sized models and is competitive with or better than much larger models, such as CodeLlama-34B. And even though DeepSeekCoder-33B leads in code completion for widely used languages, StarCoder2 (15B) excels in math, code reasoning, and languages with fewer resources.

Due to its stellar performance, StarCoder2 has numerous applications for businesses. Here are some:

Helping scale developer productivity through AI coding assistant, which simplifies code refactor, debugging, testing, documentation and more
Scaling R&D through the ability to generate PoCs faster
Helping with code migration and code porting
Explaining and summarizing code to non-programmers

However, to effectively use StarCoder2 as a code generation tool or a coding assistant, the key is to deploy it and integrate it with Visual Studio Code, the IDE (interactive development environment) of choice for most developers. In this blog, we’ll show exactly how to do the same.

Let’s get started.

Deploying StarCoder2 Using Ollama

Head over to https://myaccount.e2enetworks.com/ to sign up to E2E Networks. For this blog, we shall be launching a V100 GPU node.

However, when you want to offer this code generation AI to a larger number of developers, you should ideally choose H100, or better still, HGX H100, which would enable you to deploy it in a way that it can handle a large number of concurrent requests.

Deploying StarCoder2

To start the process, install Ollama by running the following on the bash terminal:


curl -fsSL https://ollama.com/install.sh | sh

Then launch the Ollama server.


OLLAMA_HOST=0.0.0.0 ollama serve

This will create a server on the default port 11434. We are binding the server to 0.0.0.0 so that it can be accessible from external IPs. This will be useful later on, when we integrate the Starcoder2 into VSCode for coding assistance.

Make sure you keep this port open to be accessed externally.


sudo iptables -A INPUT -p tcp --dport 11434 -j ACCEPT

Now pull the Starcoder2:15b model from the Ollama repositories.


curl http://localhost:11434/api/pull -d '{  "name": "starcoder2:15b"}'

Once pulled, we can check its availability.


curl http://localhost:11434/api/tags


{"models":[{"name":"starcoder2:15b","model":"starcoder2:15b","modified_at":"2024-03-07T20:09:38.748987811+05:30","size":9065413130,"digest":"20cdb0f709c2c7a8d4e23bb06ea246a7cf42e62f37805fba797b14f94decce8a","details":{"parent_model":"","format":"gguf","family":"starcoder2","families":["starcoder2"],"parameter_size":"16B","quantization_level":"Q4_0"}}]}

Now to chat with the model, we can either send API requests to the endpoint using cURL commands, or we can write Python code to do so. Let’s spin a Jupyter notebook, and try the latter approach.


from ollama import Clientclient = Client(host='http://localhost:11434')prompt = 'Write a function in Python to calcuate the fibonacci series'for token in client.chat(model='starcoder2:15b', messages=[{'role': 'user','content': f'{prompt}',},], stream=True):    print(token['message']['content'],end='')

When the `stream=True` parameter is set, the response from the endpoint is delivered in a streaming fashion, which means that each token is transmitted and made available for processing as soon as it is generated. This streaming mechanism allows each piece of the response to be printed immediately, providing real-time feedback. Consequently, we don't have to wait for the entire response to be fully generated before any output is presented. This can enhance the interactivity of the application, especially when dealing with large responses or when the immediacy of the output is critical.

Integration with VSCode

To integrate the Starcoder2 endpoint into our VSCode environment, we need to first download the extension Continue.

Continue provides a framework for enabling coding co-pilots in the VSCode IDE.

Once installed, you can shift to the right side of the screen for better experience.

Now click on the plus (+) sign and select the provider as Ollama.

After that open the config.json file and add the following to the model's list.


  {      "title": "Ollama",      "model": "starcoder2:15b",      "completionOptions": {},      "apiBase": "http://:11434/",      "provider": "ollama"    }

That’s it. Now you’re all ready to work with your coding assistant.

Below is a screenshot of our integration:

‍

Press ctrl+I and write Continue for auto-completing the code.

Alternatively, you can also chat with the Continue assistant for coding suggestions.

Conclusion

Using the solution given in this blog, implementing a centralized coding assistant for all employees within a company can be a cost-effective solution. By hosting the assistant on a single server instance, every team member with access can utilize the tool without the need for individual subscriptions.

To scale this, you can use a HGX H100 cluster, which can help you serve thousands of developers and give them the benefits of code generation AI. Talk to sales@e2enetworks.com to learn more.

Sign up for Free Trial

Latest Blogs