An Introduction to TinyLlama: A 1.1B Model Trained on 3 Trillion Tokens

October 25, 2023

What Is the Motivation Behind TinyLlama?

Llama2, which is the precursor for TinyLlama, is a family of open-source LLM models for text generation. They also boast small models that can be run locally on Windows or Mac without the need for expensive GPUs. This was a major step forward in making LLMs generally available, including on devices with low computing power, such as mobile phones. The smallest version of Llama 2 has 7B parameters and was trained on 2T tokens, following the Chinchilla Scaling Law. The Chinchilla Scaling Law states that the number of tokens required to train a model should be roughly 20 times the number of parameters in the model. For example, a Llama 2 model with 7B parameters needs about 1.4T tokens for effective pre-training. It is claimed that the addition of tokens beyond the Chinchilla optimal data size will not yield significant improvements in model performance.

TinyLlama is an experiment designed to challenge exactly this claim. The 3T tokens to be used for pre-training vastly exceed the optimal size recommended by the Chinchilla scaling law. The idea is to find out if a model with fewer parameters can be trained to be more capable in language tasks by training it long enough on large amounts of data. This hypothesis is supported by the fact that the training metrics, such as accuracy and perplexity, did not reach saturation for Llama2 and previous LLM models when their pre-training was completed. As shown in the comparison plots taken from the Llama 2 paper, the training loss curves do not plateau even as they have consumed all the tokens. There is still room for improvement in these models by pre-training on additional data. However, the limiting factor is the compute budget, given their huge parameter sizes. In contrast, TinyLlama is a scaled-down model with relatively few parameters. This model is being run on 16 A100-40G GPUs, with an approximate budget of 40K USD. Thus, the model can afford to be trained on data of an enormous size, leaving a much smaller carbon footprint.

Model Architecture

The model has the same transformer architecture as Llama 2. Specifically, it uses Root Means Squared Norm (RMSNorm) ^[1] layer pre-normalization to normalize the intermediate layers in a transformer rather than just the output layer. Pre-normalization is done to improve model stability. The mode further uses the SwiGLU ^[2]activation function, Rotary Positional Embedding (RoPE)^[3], and Grouped Query Attention (GQA) ^[4] to improve model training performance. TinyLlama is trained on the GitHub-excluded subset of Slimpajama and Starcoderdata. To scale down the Llama2 architecture, the model uses 22 layers, 6 heads, an embedding dimension of 2048, and an intermediate dimension of 5632, leading to ~1.1B training parameters. The model implementation uses a bunch of optimization techniques such as fused layer norm, fused SwiGLU, and FlashAttention for speed-up, which achieved a throughput of 24K tokens per second on an A100-40G GPU.

Current Progress

The training progress of the model can be tracked live here. At the time of writing this article, the model has been trained on over 1T tokens and is running its second epoch. The model hit the IT token milestone on 3 October 2023. The cross-entropy training loss as well as the validation loss continue to decrease, with no visible signs of saturation. This shows that the model is training well and is not overfitting. Note that the model has already crossed the Chinchilla-optimal data size, which is 22B tokens (20 * 1.1B = 22B). The latest checkpoint for the base model and chat model after pre-training on 1T tokens has been released, as can be viewed on the model’s GitHub page. The authors also provide a Hugging Face library template script to install the model and interact with the chat interface. The model can be run for free on a Google Colab with a T4 GPU. I interacted with the chat interface on a Google Colab and found the results to be promising, as shown below:


!pip install transformers
!pip install accelerate


from transformers import AutoTokenizer
import transformers
import torch
model = "PY007/TinyLlama-1.1B-Chat-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)
prompt = "Give me an easy and detailed recipe for making pancakes."
formatted_prompt = (
    f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
)
sequences = pipeline(
    formatted_prompt,
    do_sample=True,
    top_k=1,
    top_p = 0.9,
    num_return_sequences=1,
    repetition_penalty=1.1,
    max_new_tokens=1024,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

‍

‍

As can be seen from the above script, an advantage of open-source LLMs is that the user has full control over the chat interface and can tune the prompts as desired.

Potential Use Cases

TinyLlama, being a small model, can be easily deployed on devices with limited memory and computing power. This could lead to novel end-user applications for mobile devices, such as chat dialogue generation in video games. An interesting use case of TinyLlama, which was verified by the authors, is the speculative decoding of large LLMs at inference, whereby the weights of a predicted sequence of tokens are looked up in advance to reduce the inference time latency.

Parting Thoughts

The outlook for the TinyLlama project is promising. If the model can successfully consume all 3T tokens without reaching saturation, it would provide the AI community with a deeper insight into LLM training. Even if the model saturates early on, we can probably gain insights into the phenomenon of saturation. However, what remains to be seen is how well the TinyLlama model compares in performance with Llama2 and other LLMs. If the gain in model performance is modest, then we need to consider if it is worth investing the time and resources in feeding additional data to a tiny model. To use an LLM effectively for a production-based chat application, the responses should be concise and accurate, and the inference latency should be low. The current pre-training report does not provide any pointers on how the model compares with previous LLMs on downstream tasks. This would be an essential next step to pursue to fully comprehend the potential of this model. Nevertheless, the model training in itself provides useful insights into the capabilities of tiny models and the potential of improving the learned patterns using more training data than what is suggested by the Chinchilla Scaling Law.

References

Here are some references that might be of interest to you as you delve deeper into this model:

[1] Root Mean Square Layer Normalization

[2] GLU Variants Improve Transformer

[3] RoFormer: Enhanced Transformer with Rotary Position Embedding

[4] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

[5] Llama 2: Open Foundation and Fine-Tuned Chat Models

Sign up for Free Trial

Latest Blogs

A vector illustration of a tech city using latest cloud technologies & infrastructure