Now and then, the open-source community releases a new incredible model, a revolutionary dataset, or an improved training method. All of it started with Dolly 2.0 by Databricks. The incredible journey saw the rise of exceptional rivals to GPT like Llama2, eventually reaching multimodal LLMs like LLaVA. The open-source community has never failed to amaze us when we needed them the most, especially when the monopolies decided to adopt a paid model for our favorite AI applications. The journey of open-source LLMs is a testament to the power of collective intelligence and the spirit of sharing knowledge. As we move forward, we can only expect this trend to continue, bringing more advanced and accessible AI tools to the world.
Code LLMs
Besides all kinds of language models that could generally perform any task, there came LLMs which specialized in coding. They were specially trained in large datasets containing code snippets and instructions. These models, with their unique training methods, have opened up new possibilities in the field of automated code generation and have become invaluable tools for developers around the world. Each model adopts its own method of training which makes it unique when compared to counterparts. StarCoder, an open-source language model was trained on the dataset called ‘The Stack’ which contains code and content from GitHub. CodeLlama, another open-source language model, was trained in a ‘fill-in-the-blank’ fashion which included the completion of missing code snippets based on the surrounding code. Despite their different training methods, both models have proven to be highly effective in their respective domains, demonstrating the versatility and potential of LLMs in coding. Both CodeLlama and StarCoder have the ability to generate high-quality code in a matter of seconds and vary on the benchmarks in various tasks they are used for.
Instruction Tuning
LLM capabilities can be further enhanced by training methods like instruction tuning. An LLM is trained on instruction-response pairs in a supervised fashion. This enhances their user interactivity by following human instructions, as opposed to the typical goal of transformers, which is to predict the next word in a text sequence. There are a handful of instruction tuning methods.
Self-Instruct Tuning
This method was largely aimed to reduce the dependence on human annotators. Huge datasets with instructions and responses can be created effortlessly using this method. The output generated by the LLM is used to create instructional data. The LLM is then fine-tuned on this data. This method has been a valuable tool for developers in automated code generation.
Evol-Instruct Tuning
The evol-instruct was another method used for enhancing LLM capability. It involves gradually developing complex instructions. Starting with an initial instruction set, the data is regenerated in each step to create complex instructions. These are called evolved instructions. This dataset is then used to fine-tune the LLM.
Magicoder
Magicoder is the latest development in the code LLM space, contributed this time by researchers from the University of Illinois and Tsinghua University. Released in December 2023, it laid a new benchmark for open-source LLMs for code. Despite its relatively small parameter size of just 7B, compared to other larger LLMs, Magicoder has outperformed leading code-based large language models in generating text-to-code, particularly for data science programs. Researchers achieved this milestone using yet another instruction tuning method called OSS-Instruct. This method involves open-source code snippets which ensure high-quality low-bias instruction data to fine-tune the LLM.
OSS-Instruct ensures more diverse and realistic data to finetune the LLM. Unlike other code LLMs, Magicoder is able to produce high standard coding problems and solutions. Currently, the configurations available are Magicoder-DS, Magicoder-S-DS, Magicoder-CL, and Magicoder-S-CL, all with 6.7 B parameters.
OSS-Instruct
OSS-Instruct functions by guiding an LLM to generate a coding problem along with its solution, using a seed code snippet obtained from freely available sources such as GitHub. This seed snippet offers a degree of influence over the generation, encouraging the LLM to create diverse coding problems that are reflective of real-world programming scenarios. This method ensures that the coding problems generated are not only diverse but also practical and applicable, enhancing the learning experience for those using the LLM. Furthermore, the use of real-world code snippets as seeds contributes to the authenticity of the problems, making them more relevant to actual programming situations.
Here is a detailed prompt design specified in the paper.
Prompt Design in OSS Instruct (Image from paper)
A diverse dataset with 75K instruction data samples is used to finetune Magicoder. Magicoder-OSS-Instruct-75K is generated through OSS-Instruct using gpt-3.5-turbo-1106 and used to train both Magicoder and Magicoder-S variations. Magicoder-Evol-Instruct-110K is another dataset used to perform evol-instruct on the S variations.
Comparison & Performance
Magicoder outperforms variously sized state-of-the-art LLMs across a broad spectrum of coding benchmarks. It shines in tackling Python text-to-code problems, handling coding tasks in various languages, and solving data science-related challenges. Specifically, Magicoder-S-DS-6.7B surpasses GPT-3.5-Turbo and Gemini Ultra on HumanEval, showcasing superior performance. Here are some test results from the paper. Further details are available in the leaderboard.
Overview of OSS-INSTRUCT and the pass@1 results of different LLMs on HumanEval (+) (Image from paper)
With that in mind, let’s delve into the practical aspects of this technology. In this tutorial, let’s explore how we can bring Magicoder to our table.
Prerequisites
The tutorial will be covered using the model and functions from the Hugging Face ecosystem. The model will be fetched from Hugging Face models. Launch a Jupyter notebook on the E2E TIR AI Platform and login using your Hugging Face credentials.
Inference
We will perform the inference using the Magicoder-CL-7B model. Import the required libraries.
Define the prompt structure for Magicoder as shown.
@@ Instruction and @@ Response are special tokens used to structure the prompt for the model. They are not actual programming syntax but serve as markers within the text prompt to delineate the input instruction and the expected response.
Now create the prompt.
Define the pipeline with the parameters as shown.
task is set to ‘text-generation’ as we need to generate code based on the input prompt. Other configurations include ‘summarization’ and ‘question-answering’ which are not required here.
Now generate and fetch the response. The maximum token length is set to 1024 and only a single sequence is returned. The temperature parameter adjusts the randomness of the generated text. Feel free to try other values.
That’s great. As Magicoder is known for advanced coding capabilities, let's try problems with gradual increase in difficulty levels in other programming languages.
Going to a higher level, let us test the model with a tree-based hard LeetCode problem:
Lastly, let’s evaluate the performance of this LLM in terms of deploying and serving other Large Language Models.
The responses show that Magicoder is pretty savvy with programming languages and is even updated with the latest developments in the LLM space.
Wrapping Up
Congratulations! You have learned about Magicoder and how to perform inference on it using Hugging Face endpoints. Language models dedicated to generating code still have substantial room for improvement regarding both the efficiency and quality of the code they produce. Researchers are actively working on addressing these shortcomings by focusing on enhancing open-source datasets' quality and employing novel training techniques. Magicoder was unique enough to achieve remarkable results with just 7B parameters, signaling a promising trajectory for the development of efficient, low-compute LLMs.
The E2E cloud platform serves as an ideal tool for deploying state-of-the-art models like Magicoder into production environments. It offers user-friendly functionalities for training, refining, and deploying code-based Language Models (LLMs). You can create tailored inference endpoints with custom API handlers or utilize pre-existing containers available in the Inference Endpoints section, equipped with ready-to-use API handlers. Depending on the complexity and demands of the models used, adjusting the infrastructure scale might be necessary. I hope you enjoyed this tutorial and found it useful.