Scaling Laws for Large Language Models

March 22, 2023

Tags

Scaling Laws for Large Language Models

In this article, we will discuss the scaling laws and various scaling techniques for large language models. Scaling laws allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly better sample efficient, such that optimally compute-efficient training involves training larger models on a relatively modest amount of data and stopping significantly before convergence.

Two distinct eras of Compute Usages in AI:

‍

In the above figure, we can observe two distinct zones in compute usages in AI. Before 2012, the increase in compute usage in AI models was as per Moore’s law. Moore’s law states that the number of transistors in a dense integrated circuit (IC) doubles about every two years. But since 2012, the amount of compute is 3.4 months which is 10x per year. We can see that it goes 300,000x between 2012 and 2018 for AI models.

Let's take an example of Natural Language Processing, which is basically the task of predicting the next word in a sentence. There is a loss function associated with the task and that's how we get to know how well we are doing.

It is observed that if we increase the amount of compute provided and keep all other hyperparameters within a reasonable range then the larger models perform better and they follow the power law.

‍

What is Power Law?

A Power Law is an equation of form F(x)=Cx^k

When plotted on a log-log plot, it shows up as a straight line.
C controls the slope while k controls the intercept.
Innovations that change k have better scaling performance than innovations that affect C.

Why should you know scaling laws for Large Language Models?

If you don't want to put a lot of compute resources into training a huge model because you are either experimenting or don't have required resources. What can you do in this case? You can train many small models with different architectures and training methods and predict how well they will perform after scaling to larger-size models. By experimenting with what we try to achieve is getting a better slope rather than just a good constant offset. There are scaling laws for compute, dataset size, and a number of parameters.

If you are using compute optimally model size increases quickly, batch size increases slowly and the number of parameters you train in sequence increases slower. This trend is clearly understood through the picture below:

The language models have several challenges to use such as cost, iteration time, and engineering challenges to setting infrastructure. To solve these challenges we need parallelization and there are various techniques to do it. One such technique is Data Parallelism-splitting the batch among replicas, which is discussed below:

Data Parallelism

Each replica has its own copy of the parameters, its own minibatch of data, and computes its own gradient.
After gradients are computed, they are summed across all the replicas.
All the replicas then apply identical gradients in tandem.

But data parallelism has a few limitations which are:

Doesn’t split up the parameters so we will run out of memory for larger models.
The gradient all reduces at the end and takes a fixed length of time for a given model per step. Therefore, as we split across more replicas and reduce compute time, the gradients all reduce become a larger fraction of time.

Universal Relationship Between Batch Size and Training Time

Training time can be measured by the number of training steps in an idealized data parallel system that spends little time synchronizing between processors. The relationship between batch size and training time exhibits three distinct scaling regimes under this assumption: a ‘perfect scaling’ regime in which doubling the batch size reduces the number of training steps required to reach a target out-of-sample error, followed by a regime of diminishing returns, and finally a ‘maximal data parallelism’ regime, where further increasing the batch size does not reduce training time, assuming idealized hardware.

Cloud GPUs for Large Language Models:

Cost optimization is a key factor in seeing the compute requirement as well as a flexible environment for Large Language Models. E2E Cloud provides the best market GPUs and accelerated computing platform suitable for experimenting with and building Large Language Model applications. So, why wait any more? Reach us for your free trial at sales@e2enetworks.com

Sign up for Free Trial

Latest Blogs

A vector illustration of a tech city using latest cloud technologies & infrastructure

Scaling Laws for Large Language Models

Example H2

Scaling Laws for Large Language Models

Two distinct eras of Compute Usages in AI:

‍

It is observed that if we increase the amount of compute provided and keep all other hyperparameters within a reasonable range then the larger models perform better and they follow the power law.

‍

What is Power Law?

A Power Law is an equation of form F(x)=Cx^k

When plotted on a log-log plot, it shows up as a straight line.
C controls the slope while k controls the intercept.
Innovations that change k have better scaling performance than innovations that affect C.

Why should you know scaling laws for Large Language Models?

Data Parallelism

Each replica has its own copy of the parameters, its own minibatch of data, and computes its own gradient.
After gradients are computed, they are summed across all the replicas.
All the replicas then apply identical gradients in tandem.

But data parallelism has a few limitations which are:

Doesn’t split up the parameters so we will run out of memory for larger models.
The gradient all reduces at the end and takes a fixed length of time for a given model per step. Therefore, as we split across more replicas and reduce compute time, the gradients all reduce become a larger fraction of time.

Universal Relationship Between Batch Size and Training Time

Cloud GPUs for Large Language Models:

Latest Blogs

Scaling Laws for Large Language Models

Table of Contents

Scaling Laws for Large Language Models

Table of Contents

What is Retrieval-Augmented Generation (RAG)?

AI Inference vs Training: Understanding Key Differences

Sovereign Cloud: India's Key to Digital Independence in the AI Age

E2E Sovereign Cloud Platform: Revolutionizing Cloud Sovereignty

Top 8 Generative AI Applications in 2025

A Comparison between TIR Containerized VMs vs Traditional VMs

Accelerate Your AI Application Development Using TIR Containerized VMs

The AI Revolution in the Automotive Industry: Steering Toward a Smarter, Safer, and Sustainable Future

How to Build an AI Agent for Personalized Customer Experiences with LangGraph, LangChain and Gradio

Unleash Your AI Creativity at DeepSeek HackAIthon