Mistral 7B: Pushing the Boundaries of LLM with Custom Datasets on E2E Networks’ TIR-AI Platform

April 15, 2025
By
Na

Table of Contents

Five months ago, Mistral.AI published their strategic memo (see here) where they mentioned that they will become a key player in the generative AI industry, emphasizing on innovation, ethical deployment, and the ambition to rival large existing companies like OpenAI. Today, it’s LLM, the Mistral 7B stands amongst the topmost ranked 7 Billion parameter model there is. It even outperforms Meta’s LLaMa2 13B language model. So what’s all the hype about? Why is it so powerful? We will dive deeper in this interactive post where I have also illustrated a step-by-step guide on how we can implement it on E2E Networks’ TIR/AI platform. 

E2E Networks is India’s largest hyperscaler equipped with state-of-the-art NVIDIA graphics card. They offer virtual machines for heavy AI/ML workloads. 

What sets Mistral 7B Apart ?

Mistral 7-B is a lightweight large language model that is surprisingly robust and outperforms most of the best performing Language Models. It surpasses Meta’s LLaMa2 13B in text generation and LLaMa 34B in mathematical and code generation. It even approaches coding performance of Code-LLaMa 7B. 

A high level overview of its architecture

It leverages Grouped Query Attention and Sliding Window Attention. Grouped Query Attention increases inference speed and reduces memory requirement during decoding allowing for higher batch sizes hence higher throughput. This helps in real time applications like building a conversational agent. In layman terms, it reduces both computational and time complexity.

Getting Started on E2E Networks’ TIR-AI

The TIR-AI platform on E2E Networks provides an interactive jupyter labs interface. TIR-AI is the built-in jupyter lab interface on E2E. What is unique about it is the fact that it comes with various python frameworks embedded in the notebooks as per your use case. Let’s get started:

Login to My Account and you will be directed to the dashboard. 

On clicking the TIR-AI Platform button we will be directed to the TIR-AI Platform

Hit on CREATE NOTEBOOK, and we will get the option to create a new jupyter notebook or import a  notebook from an existing github repository.

We can name the notebook anything of our choice. In this particular case the default name is ‘tir-notebook-8177’.

Then, most important, we can select the frameworks depending on our use case. For running Mistral-7B, we will be choosing PyTorch 2.

 

These are the various images available on TIR-AI.

Next, we choose the GPU/CPU Plan we wish to go with.

The above are the CPU Plans. The GPU Plans are listed below:

 

I went with the paid version of GDC.A100-16.115GB.

We also have the option to add our local system’s SSH keys.

And then, we can create the notebook once all our options are selected. Select on the three dots when the notebook is running and click on Launch Notebook.

Now, the notebook is in the running stage. 

Then we will be directed to the jupyter lab interface:

Click on Python 3 under Notebook. Now, we have finished setting up the jupyter lab. Now let us get into coding. 

Problem Statement

We have already discussed the potentials of Mistral-7B. Now let us see what problem it can solve. I took a custom dataset of Sales-KRA from here. The problem most sales executives face when they are new to this field is, how to approach customers, how to generate leads, and many more. Here in this step-by-step guide, we will see a detailed walkthrough of how we can solve it by training the custom data with the Mistral 7B LLM.

Code Snippets

Step 1 : Install all the dependencies 


!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U datasets scipy ipywidgets matplotlib

Press Shift + Enter

To get the sales KRA data run this command:


!git clone https://huggingface.co/datasets/AdiOO7/SalesKRA

Press Shift + Enter

Step 2 : Data Preparation


import json
import os
from sklearn.model_selection import train_test_split

# Path to your .jsonl file
file_path = 'SalesKRA/Dataset.jsonl'

# Directory to save the individual JSON files
train_dir = 'train_data'
val_dir = 'val_data'

# Make sure the directories exist
os.makedirs(train_dir, exist_ok=True)
os.makedirs(val_dir, exist_ok=True)

# Read the .jsonl file and collect the entries
entries = []
with open(file_path, 'r') as file:
    for line in file:
        # Parse the JSON object from each line
        data = json.loads(line.strip())
        entries.append(data)

# Split the entries into train and validation sets
train_entries, val_entries = train_test_split(entries, test_size=0.2, random_state=42)

# Function to save a list of entries to a specified directory
def save_entries(entries, directory):
    for i, entry in enumerate(entries):
        # Construct a file name
        file_name = f"entry_{i+1}.json"
        # Construct the full path
        full_path = os.path.join(directory, file_name)
        # Write the entry to the file in JSON format
        with open(full_path, 'w') as f:
            json.dump(entry, f, indent=4)

# Save the train and validation entries to their respective directories
save_entries(train_entries, train_dir)
save_entries(val_entries, val_dir)

print(f"Saved {len(train_entries)} entries to {train_dir}")
print(f"Saved {len(val_entries)} entries to {val_dir}")

Press Shift + Enter

The above code snippet will prepare the data in the format of a dictionary saved in each file that can be accessed for training.


def formatting_func(example):
    text = f"### Question: {example['input']}\n ### Answer: {example['output']}"
    return text

Press Shift + Enter

Step 3:Load Model and Dataset


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

base_model_id = "mistralai/Mistral-7B-v0.1"
bnb_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_use_double_quant=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=bnb_config)

Press Shift + Enter


Loading checkpoint shards: 100%
2/2 [00:09<00:00, 4.39s="" it]="" <="" pre="">

tokenizer = AutoTokenizer.from_pretrained(
   base_model_id,
   padding_side="left",
   add_eos_token=True,
   add_bos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token

Press Shift + Enter


def generate_and_tokenize_prompt(prompt):
   return tokenizer(formatting_func(prompt))

from datasets import load_dataset
​
# Define the function that will tokenize the data
def generate_and_tokenize_prompt(example):
   # Assuming you have a tokenizer loaded, e.g., tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
   # Replace this with your actual tokenizer and tokenization logic
   example['input_ids'] = tokenizer.encode(example['text'], truncation=True, padding='max_length')
   return example
​
# Load the datasets from the JSON files in the directories
train_dataset = load_dataset('json', data_files={'train': f'{train_dir}/*.json'})
val_dataset = load_dataset('json', data_files={'validation': f'{val_dir}/*.json'})
​
# Tokenize the datasets
tokenized_train_dataset = train_dataset['train'].map(generate_and_tokenize_prompt)
tokenized_val_dataset = val_dataset['validation'].map(generate_and_tokenize_prompt)

​Press Shift + Enter

Out:


Resolving data files: 100%
88/88 [00:00<00:00, 0="" 1="" 22="" 88="" 10323.86it="" s]="" downloading="" data="" files:="" 100%="" [00:00<00:00,="" 121.02it="" extracting="" 33.04it="" generating="" train="" split:="" 1546.61="" examples="" resolving="" 2862.38it="" 138.16it="" 67.14it="" validation="" 987.09="" map:="" 2257.64="" asking="" to="" pad="" max_length="" but="" no="" maximum="" length="" is="" provided="" and="" the="" model="" has="" predefined="" length.="" default padding.="" truncate="" truncation.="" 1434.22="" <="" code="">

import matplotlib.pyplot as plt
​
def plot_data_lengths(tokenize_train_dataset, tokenized_val_dataset):
   lengths = [len(x['input_ids']) for x in tokenized_train_dataset]
   lengths += [len(x['input_ids']) for x in tokenized_val_dataset]
   print(len(lengths))
​
   # Plotting the histogram
   plt.figure(figsize=(10, 6))
   plt.hist(lengths, bins=20, alpha=0.7, color='blue')
   plt.xlabel('Length of input_ids')
   plt.ylabel('Frequency')
   plt.title('Distribution of Lengths of input_ids')
   plt.show()

Press Shift + Enter​

Out:

plot_data_lengths(tokenized_train_dataset, tokenized_val_dataset)

110


max_length = 512 # This was an appropriate max length for my dataset
​
def generate_and_tokenize_prompt2(prompt):
   result = tokenizer(
       formatting_func(prompt),
       truncation=True,
       max_length=max_length,
       padding="max_length",
   )
   result["labels"] = result["input_ids"].copy()
   return result

def formatting_func(example):
   # Correct the keys according to your dataset format
   text = f"### Question: {example['text']}\n### Answer: {example['response']}"
   return text
​
def generate_and_tokenize_prompt2(example):
   # Tokenize the formatted text
   result = tokenizer(
       formatting_func(example),
       truncation=True,
       max_length=max_length,
       padding="max_length",
   )
   # Copy the input_ids to create labels for a language modeling task, if necessary
   result["labels"] = result["input_ids"].copy()
   return result
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt2)
tokenized_val_dataset = val_dataset.map(generate_and_tokenize_prompt2)

​Press Shift + Enter

Out:


Map: 100%
88/88 [00:00<00:00, 22="" 1620.17="" examples="" s]="" map:="" 100%="" [00:00<00:00,="" 967.05="" <="" code="">

print(tokenized_train_dataset['train'][1]['input_ids'])

Press Shift + Enter​

Out:


[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 774, 22478, 28747, 315, 11371, 395, 430, 3810, 529, 2235, 304, 506, 264, 524, 5244, 298, 4916, 586, 727, 5411, 6266, 28723, 1824, 1023, 315, 511, 28804, 13, 27332, 26307, 28747, 15102, 1060, 2032, 9796, 778, 7000, 5944, 28725, 2231, 6790, 28748, 14049, 346, 6400, 28725, 808, 8305, 404, 354, 3694, 8063, 28725, 26518, 1255, 520, 1308, 28725, 1149, 395, 2936, 9796, 298, 1813, 14019, 28725, 1388, 15667, 298, 312, 14978, 28723, 2]

eval_prompt = " The following is KRA of Sales Executive: # "

tokenizer = AutoTokenizer.from_pretrained(
   base_model_id,
   add_bos_token=True,
)
​
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")
​
model.eval()
with torch.no_grad():
   print(tokenizer.decode(model.generate(**model_input, max_new_tokens=256, repetition_penalty=1.15)[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.

The following is KRA of Sales Executive: # 1023

- To identify and develop new business opportunities in the marketplace.
- To manage existing accounts to ensure that they are satisfied with our services, and to maximize their potential for growth.
- To work closely with other departments within the company to ensure that we provide a high level of service to all customers.
- To maintain accurate records of all sales activity, including customer contact details, orders placed, and payments received.
- To attend trade shows and exhibitions as required, in order to promote our products and services.
- To keep up to date with industry trends and developments, so that we can offer our customers the best possible advice and support.
- To be responsible for your own personal development, by attending training courses and keeping abreast of changes in legislation or technology which may affect your role.

from peft import prepare_model_for_kbit_training
​
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

def print_trainable_parameters(model):
   """
   Prints the number of trainable parameters in the model.
   """
   trainable_params = 0
   all_param = 0
   for _, param in model.named_parameters():
       all_param += param.numel()
       if param.requires_grad:
           trainable_params += param.numel()
   print(
       f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
   )

print(model)

Press Shift + Enter

Out:


MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )
    (norm): MistralRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)

from peft import LoraConfig, get_peft_model
​
config = LoraConfig(
   r=32,
   lora_alpha=64,
   target_modules=[
       "q_proj",
       "k_proj",
       "v_proj",
       "o_proj",
       "gate_proj",
       "up_proj",
       "down_proj",
       "lm_head",
   ],
   bias="none",
   lora_dropout=0.05,  # Conventional
   task_type="CAUSAL_LM",
)
​
model = get_peft_model(model, config)
print_trainable_parameters(model)

​Press Shift + Enter

out:


trainable params: 85041152 || all params: 3837112320 || trainable%: 2.2162799758751914

print(model)

Press Shift + Enter

Out:


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (q_proj): Linear4bit(
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
              )
              (k_proj): Linear4bit(
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=1024, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
              )
              (v_proj): Linear4bit(
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=1024, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
              )
              (o_proj): Linear4bit(
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
              )
              (rotary_emb): MistralRotaryEmbedding()
            )
            (mlp): MistralMLP(
              (gate_proj): Linear4bit(
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=14336, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (base_layer): Linear4bit(in_features=4096, out_features=14336, bias=False)
              )
              (up_proj): Linear4bit(
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=14336, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (base_layer): Linear4bit(in_features=4096, out_features=14336, bias=False)
              )
              (down_proj): Linear4bit(
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=14336, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (base_layer): Linear4bit(in_features=14336, out_features=4096, bias=False)
              )
              (act_fn): SiLU()
            )
            (input_layernorm): MistralRMSNorm()
            (post_attention_layernorm): MistralRMSNorm()
          )
        )
        (norm): MistralRMSNorm()
      )
      (lm_head): Linear(
        in_features=4096, out_features=32000, bias=False
        (lora_dropout): ModuleDict(
          (default): Dropout(p=0.05, inplace=False)
        )
        (lora_A): ModuleDict(
          (default): Linear(in_features=4096, out_features=32, bias=False)
        )
        (lora_B): ModuleDict(
          (default): Linear(in_features=32, out_features=32000, bias=False)
        )
        (lora_embedding_A): ParameterDict()
        (lora_embedding_B): ParameterDict()
      )
    )
  )
)

from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
​
fsdp_plugin = FullyShardedDataParallelPlugin(
   state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
   optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)
​
accelerator = Accelerator(fsdp_plugin=fsdp_plugin)

Press Shift + Enter

Out:


Detected kernel version 5.4.242, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

model = accelerator.prepare_model(model)

Press Shift + Enter


!pip install -q wandb -U
​
import wandb, os
wandb.login()
​
wandb_project = "journal-finetune"
if len(wandb_project) > 0:
   os.environ["WANDB_PROJECT"] = wandb_project

Press Shift + Enter

Out:


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Out:


WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 21.2.4; however, version 23.3.1 is available.
You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.

wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:
 ········

wandb: Appending key for api.wandb.ai to your netrc file: /home/jovyan/.netrc

if torch.cuda.device_count() > 1: # If more than 1 GPU
   model.is_parallelizable = True
   model.model_parallel = True
   

Press Shift+Enter

Step 4: Training the model


import transformers
from datetime import datetime
​
project = "kra-finetune"
base_model_name = "mistral"
run_name = base_model_name + "-" + project
output_dir = "./" + run_name
​
if tokenizer.pad_token is None:
   tokenizer.pad_token = tokenizer.eos_token
​
# You might need to resize the token embeddings in your model if you added a new token
model.resize_token_embeddings(len(tokenizer))
​
​
trainer = transformers.Trainer(
   model=model
Latest Blogs

A vector illustration of a tech city using latest cloud technologies & infrastructure