The emergence of Large Language Models (LLMs) has sparked a new era of AI-assisted programming, helping developers streamline their coding processes and tackling complex problems more efficiently. Among the various LLMs available, open-source coding LLMs have gained significant attention due to their accessibility, transparency, and community-driven nature.
Open-source coding LLMs are powerful AI models that have been trained on vast amounts of programming-related data, including source code, documentation, and developer discussions. These models can understand and generate code in multiple programming languages, provide intelligent code suggestions, and even assist in debugging and optimization tasks. By leveraging the collective knowledge and expertise of the open-source community, these LLMs offer developers a valuable tool to enhance their productivity and overcome programming challenges.
Moreover, LLMs for coding provide significant benefits to software organizations. One of the key advantages is cost reduction compared to proprietary coding assistant subscriptions. By hosting open-source LLMs locally, organizations can avoid the recurring expenses associated with subscription-based services.
In addition to cost savings, these LLMs for coding offer organizations greater control, customization, and privacy. By hosting these models within their own infrastructure, companies can ensure data security and compliance with privacy requirements. The open-source nature of these LLMs also allows organizations to customize and fine-tune the models to align with their specific coding practices, coding standards, and domain-specific requirements.
In this article, we will explore the top open-source coding LLMs that are making waves in the developer community.
1. Mistral 7B & Mixtral 8X7B
Mistral 7B and Mixtral 8x7B are two open-source language models developed by Mistral AI, both released under the Apache 2.0 license.
Mistral 7B is a 7.3B parameter model that outperforms Llama 2 13B on all benchmarks and even surpasses Llama 1 34B on many tasks. It approaches the performance of CodeLlama 7B on coding tasks while maintaining strong performance in English-language tasks. Mistral 7B uses techniques like Grouped Query Attention (GQA) for faster inference and Sliding Window Attention (SWA) to efficiently handle longer sequences.
Mixtral 8x7B is a larger, 46.7B parameter Sparse Mixture-of-Experts (SMoE) model. Despite its high parameter count, it only uses 12.9B parameters per token, allowing it to process input and generate output at the same speed and cost as much as a 12.9B model. Mixtral 8x7B matches or outperforms Llama 2 70B on most benchmarks.
Both models demonstrate strong performance on coding-related tasks:
1. Mistral 7B approaches the performance of CodeLlama 7B on code generation tasks while maintaining its proficiency in English-language tasks.
2. Mixtral 8x7B shows strong performance in code generation.
The models can be easily fine-tuned for various tasks. For example, Mistral 7B was fine-tuned on publicly available instruction datasets to create Mistral 7B Instruct, which outperforms all 7B models on the MT-Bench benchmark.
- Mistralai/Mistral-7B-Instruct-v0.2
- Mistralai/Mixtral-8x7B-Instruct-v0.1
- Mistralai/Mistral-7B-Instruct-v0.1
- Mistralai/Mixtral-8x7B-v0.1
- Mistralai/Mistral-7B-v0.1
2. CodeLlama
CodeLlama by Meta is a state-of-the-art large language model (LLM) designed for code generation and natural language tasks related to code. It is built on top of Llama 2 and is available in three versions:
1. CodeLlama: The foundational code model.
2. CodeLlama - Python: Specialized for Python programming.
3. CodeLlama - Instruct: Fine-tuned for understanding natural language instructions.
Four sizes of CodeLlama have been released: 7B, 13B, 34B, and 70B parameters. The models are trained on a massive dataset of code and code-related data:
- 7B, 13B, and 34B models are trained on 500B tokens of code and code-related data.
- 70B model is trained on 1T tokens.
The 7B and 13B base and instruct models have also been trained with fill-in-the-middle (FIM) capability, allowing them to insert code into existing code for tasks like code completion.
CodeLlama - Python is further fine-tuned on 100B tokens of Python code, while CodeLlama - Instruct is instruction fine-tuned and aligned to better understand human prompts.
In benchmark tests using HumanEval and Mostly Basic Python Programming (MBPP), CodeLlama outperformed state-of-the-art publicly available LLMs on code tasks. CodeLlama 34B scored 53.7% on HumanEval and 56.2% on MBPP, the highest among open-source solutions.
The models are released under the same community license as Llama 2, and the training recipes and model weights are available on GitHub.
- CodeLlama-34b-Instruct-hf
- CodeLlama-13b-Instruct-hf
- CodeLlama-7b-Instruct-hf
- CodeLlama-70b-Instruct-hf
- CodeLlama-70b-Python-hf
- CodeLlama-70b-hf
- CodeLlama-7b-hf
- CodeLlama-13b-hf
- CodeLlama-34b-hf
- CodeLlama-7b-Python-hf
- CodeLlama-13b-Python-hf
- CodeLlama-34b-Python-hf
3. Phind-CodeLlama
Phind, an AI company, has fine-tuned two models, CodeLlama-34B and CodeLlama-34B-Python, using their internal dataset. The resulting models, named Phind-CodeLlama-34B-v1 and Phind-CodeLlama-34B-Python-v1, have achieved impressive results on the HumanEval benchmark, scoring 67.6% and 69.5% pass@1, respectively.
Phind's dataset consists of approximately 80,000 high-quality programming problems and solutions, structured as instruction-answer pairs rather than code completion examples. The models were trained over two epochs, totaling around 160,000 examples, using native fine-tuning without LoRA. The training process was optimized using DeepSpeed ZeRO 3 and Flash Attention 2, allowing the models to be trained in just three hours using 32 A100-80GB GPUs with a sequence length of 4096 tokens.
To ensure the validity of their results, Phind applied the decontamination methodology to their dataset, which involves sampling substrings from each evaluation example and checking for matches in the processed training examples. No contaminated examples were found in Phind's dataset.
Phind-CodeLlama-34B-v2 is a newer version, which was initialized from Phind-CodeLlama-34B-v1 and trained on an additional 1.5 billion tokens. This new model achieved an even higher score of 73.8% pass@1 on the HumanEval benchmark, further demonstrating the effectiveness of Phind's fine-tuning approach.
- Phind-CodeLlama-34B-v2
- Phind-CodeLlama-34B-v1
- Phind-CodeLlama-34B-Python-v1
4. StarCoder & StarCoder2
StarCoder and StarCoder2 are two large language models developed by the BigCode project, an open scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs).
StarCoder:
- StarCoder is a 15.5B parameter model with an 8K context length, infilling capabilities, and fast large-batch inference enabled by multi-query attention.
- It is built upon StarCoderBase, which was trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process.
- StarCoder is a fine-tuned version of StarCoderBase, trained on an additional 35B Python tokens.
StarCoder2:
- StarCoder2 is built upon The Stack v2, which is 4× larger than the first StarCoder dataset, in partnership with Software Heritage (SWH).
- The Stack v2 contains over 3B files in 600+ programming and markup languages, derived from the Software Heritage archive.
- StarCoder2 models come in three sizes: 3B, 7B, and 15B parameters, trained on 3.3 to 4.3 trillion tokens.
- StarCoder2-3B outperforms other Code LLMs of similar size on most benchmarks and also outperforms StarCoderBase-15B.
- StarCoder2-15b
- StarCoder2-7b
- StarCoder2-3b
- StarCoder
- StarCoderBase
5. WizardCoder
WizardCoder is a code large language model (LLM) that enhances the open-source StarCoder model through complex instruction fine-tuning using the Evol-Instruct method adapted for code.
The Evol-Instruct method, introduced by WizardLM, is a technique for generating more complex and diverse instruction data to improve the fine-tuning of language models. The key idea is to "evolve" an existing dataset of instructions by iteratively applying various transformations to make the instructions more challenging and varied.
- WizardCoder-Python-34B-V1.0
- WizardCoder-15B-V1.0
- WizardCoder-Python-13B-V1.0
- WizardCoder-Python-7B-V1.0
- WizardCoder-3B-V1.0
- WizardCoder-1B-V1.0
- WizardCoder-33B-V1.1
6. Solar-10.7B
SOLAR 10.7B is a large language model with 10.7 billion parameters that demonstrates strong performance in various natural language processing tasks. The model was initialized from the pretrained weights of Mistral 7B.
For fine-tuning, SOLAR 10.7B underwent a two-stage process: instruction tuning and alignment tuning. The instruction tuning stage utilized mostly open-source datasets such as Alpaca-GPT4, OpenOrca, and a synthetically generated math question-answering dataset called “Synth. Math-Instruct”. In the alignment tuning stage, the model was further fine-tuned using human preference data from datasets like Orca DPO Pairs, Ultrafeedback Cleaned, and a synthesized math alignment dataset called “Synth. Math-Alignment”.
The resulting instruction-tuned and alignment-tuned model, SOLAR 10.7B-Instruct, outperforms larger models like Mixtral 8x7B-Instruct on benchmark tasks, demonstrating the effectiveness of the training approach.
The Economics of Hosting an Open-Source Coding LLM on E2E’s Cloud Server
E2E Networks provides a wide range of cloud computing GPUs to host and inference these high-memory coding LLMs.
To calculate the GPU memory requirements, let’s spin a GPU node on E2E, and then load these models.
We’ll be using a V100 32 GB GPU node for loading the models.
You can install Ollama to run the models. Ollama is a great service to serve and inference AI models locally. It provides super fast speeds.
Now let’s run WizardCoder:33b by using the following command.
To check the GPU usage, open another terminal and run the following.
This is the output we received:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:01:01.0 Off | Off |
| N/A 27C P0 36W / 250W | 19082MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2643 C /usr/local/bin/ollama 19078MiB |
+-----------------------------------------------------------------------------+
This shows that WizardCoder:33b takes about 20 GB of GPU memory space to be deployed.
Using the above approach we calculated the GPU requirements of various models:
- Mixtral 8X7B: 25 GB
- CodeLlama-70b-Instruct-hf: 30.8 GB
- Phind-CodeLlama-34B-v2: 20 GB
- StarCoder2-15b: 9.51 GB
Now let’s assume that an organization has 1000 developers and the concurrency of requests sent to the LLM is 1%. This would mean that we need at least 10 instances of our deployed LLM so that there is lower latency and queuing of requests. For a team size of 2000 developers we would need 20 instances, and so on.
Based on the GPU requirements as calculated above, we can decide to select the median value, which is roughly 20 GB.
Every instance consumes around 20GB, and for our team of 1000 developers we need 10 instances. So the total memory requirement is about 200GB.
So we would need 8 times V100 - 32GB GPUs, which would give us a total GPU memory of 256 GB. This way we’ll also have extra memory for resource overheads.
E2E Networks offers a 4xV100 GPU node for 1,80,000 INR per month. Since we would be needing 2 of those, the cost would be roughly about 3,60,000 INR per month (if you were using V100).
However, we recommend using H100 for this instead, due to its low latency and top-notch GPU capabilities. The HGX powered 8XH100 Cloud GPU has a total GPU memory of 640GB. So about 30 instances of our model could be launched on this GPU, which can cater to about 3000 developers!
The cost for this series of Cloud GPUs is 20,00,000 INR per month. It comes with 200 CPU cores, a RAM of 1,800 GB, and an SSD storage of 21,000 GB. The system supports a combined memory bandwidth of 24 TB/s. With a remarkable 32 PetaFLOPS of computational power, it represents the most potent accelerated scale-up server platform for artificial intelligence and high-performance computing applications. This cutting-edge hardware enables the efficient processing of complex and demanding workloads, pushing the boundaries of what is possible in these domains.
On the other hand, if you want to reduce cost (and can handle higher latency and delays in response times), you could consider hosting a model with lower GPU requirements like StarCoder2-15B, on a cloud GPU like 4XL4 GPU on E2E networks, which costs about 1,27,000 INR per month. This has a memory capacity of 96 GB, and can easily host 10 instances of StarCoder2-15 B.
References
Refer to this table for a comprehensive comparison of all the available open-source coding LLMs.