Fine-Tuning Your Own Personal Copilot

April 2, 2025

In the ever-evolving realm of programming and software development, the pursuit of efficiency and productivity has given rise to remarkable innovations. Among these innovations are code generation models like Codex, StarCoder, and Code Llama. These models have showcased their remarkable ability to generate code snippets that resemble human-written code, thereby highlighting their immense potential as valuable coding assistants.

Fine-Tuning Your Own Personal Copilot

The ability to fine-tune and adapt large-scale language models to specific tasks has become a game-changer. It allows us to harness the power of pre-trained models and customize them to our own needs. In this blog post, we'll explore the concept of fine-tuning and how it can be used to create your own personal copilot for various tasks, using examples and references from the field.

The Challenge of Full Fine-Tuning

Full fine-tuning of LLMs is a resource-intensive endeavor. The hardware requirements for full fine-tuning can be daunting, as exemplified by the numbers provided:

Weight: 2 bytes (Mixed-precision training)
Weight gradient: 2 bytes
Optimizer state when using Adam: 4 bytes for original FP32 weight + 8 bytes for first and second moment estimates
Cost per parameter adding all of the above: 16 bytes per parameter
For a massive model like bigcode/starcoder with 15.5 billion parameters, this adds up to 248GB of GPU memory.

Considering the huge memory requirements for storing intermediate activations, full fine-tuning would require a minimum of 4 A100 80GB GPUs. Such a set-up is not only costly but also hard to come by. So, the question arises: how can we fine-tune these models efficiently without such astronomical hardware requirements?

Parameter-Efficient Fine-Tuning (PEFT) with QLoRA

To tackle the resource constraints, a more efficient approach is required. This is where Parameter-Efficient Fine-Tuning (PEFT) comes into play, particularly when coupled with techniques like QLoRA. Here's how it works:

Trainable parameters are significantly reduced, with a much smaller proportion of the original model being updated.
In the case of bigcode/starcoder, the trainable parameters are only about 0.7% of the total.

This reduction in trainable parameters results in significantly lower memory requirements:

Base model weight: 7.755 GB
Adapter weight: 0.22 GB
Weight gradient: 0.12 GB
Optimizer state when using Adam: 1.32 GB
In total, only about 10GB of GPU memory is needed for fine-tuning.

This is a game-changer, as it means you can carry out fine-tuning on a single A100 40GB GPU, which is much more accessible. The reduced memory requirements are made possible by leveraging techniques like Flash Attention V2 and Gradient Checkpointing.

Flash Attention V2 and Gradient Checkpointing

To make the process even more memory-efficient, Flash Attention V2 and Gradient Checkpointing are employed. These techniques significantly reduce the memory required for intermediate activation checkpointing.

For QLoRA with Flash Attention V2 and Gradient Checkpointing, the memory occupied by the model on a single A100 40GB GPU is just 26GB, even with a batch size of 4.

For full fine-tuning using FSDP (Fully Sharded Data Parallel) along with Flash Attention V2 and Gradient Checkpointing, the memory occupied per GPU ranges between 70GB to 77.6GB, with a per_gpu_batch_size of 1.

This combination of techniques makes it possible to fine-tune large language models without the need for an array of high-end GPUs.

Efficient Training of Language Models to Fill-in-the-Middle

Introduction

Language models have evolved significantly in recent years, and they are now more capable than ever before. But what if you could customize your own personal copilot, a model that can assist you in a highly specialized manner?

Fine-Tuning for Specialized Skills

Fine-tuning language models is the process of training pre-existing models to perform specific tasks or possess new skills. Recent research has revealed that this approach can be highly effective, allowing models to acquire skills beyond their original capabilities with minimal additional training. One notable skill that has been explored is Fill-in-the-Middle (FIM) capabilities, where models can generate content in the middle of a document.

Understanding the FIM-for-Free Property

One significant finding is the concept of the ‘FIM-for-Free’ property. It suggests that with the same amount of computational resources, FIM models can achieve similar performance as traditional left-to-right models in terms of standard language modeling tasks while also excelling in FIM tasks. In essence, fine-tuning models with FIM does not compromise their original abilities, making it a cost-effective approach to impart new skills.

Choosing the Right FIM Hyperparameters

When fine-tuning models for FIM, several hyperparameters play a crucial role in achieving optimal performance:

Character-Level Spans: Applying FIM transformation at the character level is recommended. This level of granularity allows the model to generate sensible completions, even when the prefix and suffix end within the middle of a token.
Context-Level vs. Document-Level FIM: Context-level FIM consistently outperforms document-level FIM in a range of scenarios. However, document-level FIM may be preferred for its simplicity in some implementations.
FIM Rate: The research indicates that FIM rates between 50% and 90% are reasonable choices. Higher FIM rates can significantly enhance model capabilities without undermining left-to-right skills.

Enhancing Infilling Capabilities

Infilling is the task of generating content to complete gaps within a document, which can be particularly useful for coding and text generation tasks. While the research focused on single-slot in filling, there are various directions for improving infilling capabilities:

Smarter Span Selection: Selecting spans in a more meaningful way, such as based on syntax or semantics, can enhance infilling performance.
Steerable Generation: Providing more control over the style and content of in-filled text can improve the usability of fine-tuned models.
Multi-Slot Infilling: While the research primarily examined single-slot infilling, extending this to multi-slot infilling presents new challenges and opportunities for customization.

The Future of Fine-Tuning Language Models

The ability to fine-tune language models for specific skills or tasks has the potential to revolutionize how we interact with AI systems. By optimizing hyperparameters and developing smarter selection methods, we can create highly specialized personal copilots that seamlessly assist us in a variety of domains.

In conclusion, fine-tuning language models for specific capabilities, such as FIM, opens up exciting possibilities for customized AI assistance. As the research in this field continues to evolve, we can expect even more powerful and versatile language models that cater to our individual needs.

Tutorial

If you require extra GPU resources for the tutorials ahead, you can explore the offerings on E2E CLOUD. We provide a diverse selection of GPUs. To get one, head over to MyAccount, and sign up. Then launch a GPU node as is shown in the screenshot below:

‍

Make sure you add your ssh keys during launch, or through the security tab after launching.

Once you have launched a node, you can use VSCode Remote Explorer to ssh into the node and use it as a local development environment.

Personal Copilot - Fine-Tuned

In this tutorial, we will guide you through the process of setting up a Personal Copilot project on a GPU using Jupyter Notebook. We'll cover the essential steps, from checking GPU availability to running a deep learning training script. The tutorial assumes that you have a Jupyter Notebook environment with access to a GPU.

Prerequisites

Access to a Jupyter Notebook environment with GPU support.
Basic knowledge of Python and deep learning concepts.
Familiarity with Jupyter Notebook operations.

Step 1: Checking GPU Availability

To check if your Jupyter Notebook environment is connected to a GPU, you can use the following code snippet:


gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

This code runs the nvidia-smi command to fetch GPU information. If it contains the word ‘failed’, it means you're not connected to a GPU.

Step 2: Cloning a GitHub Repository

To get the code and data for your deep learning project, you can clone a GitHub repository. Use the following command to clone a repository from GitHub:


!git clone https://github.com/pacman100/DHS-LLM-Workshop.git

Replace the GitHub repository URL with the one you want to clone.

Step 3: Preparing the Environment

Next, you need to set up the Python environment for your project. Execute the following commands:


!pip install packaging
!pip uninstall -y ninja && pip install ninja
!ninja --version
!echo $?

These commands install necessary Python packages and check the version of Ninja, a build system used for some deep learning libraries.

Step 4: Changing Directory

Navigate to the cloned GitHub repository using the %cd command:


%cd /content/DHS-LLM-Workshop

This command changes the current working directory to your project's directory.

Step 5: Updating the Repository

To ensure you have the latest code and data, run the following commands to pull the latest updates from the GitHub repository:


!git pull
!pip install -r requirements.txt

These commands fetch the latest code changes and install the required Python packages defined in the requirements.txt file.

Step 6: Installing Additional Libraries

Install any additional libraries your project might need. For example, you can install the flash-attn library using the following command:


!pip install flash-attn --no-build-isolation

This command installs the ‘flash-attn’ library without building isolations.

Step 7: Login to External Services

In some deep learning projects, you might need to log in to external services, such as Weights & Biases (WandB) and the Hugging Face Hub. Use the following commands to log in:


!wandb login --relogin


from huggingface_hub import notebook_login


notebook_login()

These commands log you into Weights & Biases and the Hugging Face Hub. Here you have to input your key which can be accessed from here: Weights & Biases (WandB) and Huggingface Hub.

Step 8: Training Your Model

Now, you're ready to train your deep learning model. Execute the following commands to start training.

Before you start training the model, please get access to bigcode/starcoderbase-1b. This is required in order to access the model. Once you get the access, you can resume again.


%cd personal_copilot/training/


!git lfs install
!git pull
!python train.py \
    --model_path "bigcode/starcoderbase-1b" \
    --dataset_name "smangrul/hf-stack-v1" \
    --subset "data" \
    --data_column "content" \
    --split "train" \
    --seq_length 2048 \
    --max_steps 2000 \
    --batch_size 8 \
    --gradient_accumulation_steps 2 \
    --learning_rate 5e-5 \
    --lr_scheduler_type "cosine" \
    --weight_decay 0.01 \
    --num_warmup_steps 30 \
    --eval_freq 100 \
    --save_freq 500 \
    --log_freq 25 \
    --num_workers 4 \
    --bf16 \
    --no_fp16 \
    --output_dir "starcoderbase1b-personal-copilot" \
    --fim_rate 0.5 \
    --fim_spm_rate 0.5 \
    --use_flash_attn

These commands run your training script with various configuration options. Modify the options to suit your project.

Congratulations! You've successfully set up and trained a Personal Copilot project on a GPU in a Jupyter Notebook environment. You can use these steps as a template for your deep learning projects.

Conclusion

Fine-tuning large language models for specific tasks doesn't have to be an overwhelming endeavor. With techniques like Parameter-Efficient Fine-Tuning (PEFT), QLoRA, Flash Attention V2 and Gradient Checkpointing, it's possible to achieve remarkable results with more accessible hardware. These advancements are making the power of large language models available to a broader range of users, opening up exciting possibilities for developing your own personal copilot in various domains.

References

Research Paper: Efficient Training of Language Models to Fill in the Middle

Hugging Face Reference: https://huggingface.co/bigcode/starcoderbase-1b

Hugging Face Blog: https://huggingface.co/blog/personal-copilot

Sign up for Free Trial

Latest Blogs

A vector illustration of a tech city using latest cloud technologies & infrastructure

Fine-Tuning Your Own Personal Copilot

April 2, 2025

Hady Khamis Khan

Sign up for Free Trial

Example H2