The Competitive Advantage of 100K Context Window in LLMs

April 2, 2025

What Are Context Windows?

Context windows in Large Language Models (LLMs) typically refer to how many tokens a model can accept as a prompt in one go. This number is an indication of how much information can be fed into the model at a time so that it can generate responses as per that information. This is an important metric as it can greatly influence the quality of the results a model can give or, in other words, how much information can be extracted from the model in a single go. Models like GPT-3 will accept 2000 tokens while GPT-4 can go up to 32,000.

Advantages of Large Context Windows

The size of the context window describes the amount of information the model can keep in mind to generate a response. As such, the size getting larger is what is generally preferred, i.e., the larger the context window, the better it gets.

Consider the case of a book. Suppose you wish to get answers to a particular question based on the contents of a book. With a lower context window you have three straightforward ways to do this (more like a brute force solution). Option 1 is to summarize the details of the book in a way that captures all the required contents along with the query within the context window size. Option 2 is to include the book itself in the original training dataset of the book, which is not exactly feasible. The third option is to frame a bunch of successive queries using various constraints and structures such that all the book details required for the task at hand can be fed in for reference. As evident, all these methods are pretty complicated and not exactly ideal. But if the context window was large enough, one could just feed in the details of the entire book into the model along with a query in a single go and not worry about anything else. This is exactly why we prefer a larger content window.

Why Don't All Models Have Large Context Windows?

Larger context windows are always more favorable. However, it's not exactly feasible per say. A large context window requires much higher computation during training and inference. In other words, training a model with a large context window is very expensive. And, without training a base model explicitly for a large context window, one can’t expect to use large context windows – i.e., fine-tuning a model trained on a small context window for larger ones won’t work. Hence the only way to do it is to train the base model itself for the larger context window. However, the relationship between context window size and computational expense in the conventional transformer architecture works out to have a quadratic relationship. This makes training larger and larger context window models in the conventional sense very expensive.

Open Source LLMs with Large Context Windows

MPT-7B (Mosaic Pretrained Transformer - 7 Billion)

MPT-7B models are a series of open source LLMs with a notable speciality: its 65K context length! The ability comes from the ALiBi paper which removes Position Sinusoidal Encoding at the bottom of the conventional transformer architecture with Attention with Linear Biases (ALiBi) at the attention head. This change accelerates training speeds and allows larger context windows to be trained. This modification allows MPT-7B to achieve a high context window length of 65k. This capability can then be used to fine-tune MPT-7B Base to create MPT-7B-StoryWriter-65k+. This model can fit in entire books like ‘The Great Gatsby’ into a single prompt and generate an epilogue of it. Also ALiBi allows StoryWriter to work with even longer content length than what it was trained on (65K), like up to 84K in some test cases.

Claude’s 100K Context Window

Claude by Anthropic has moved from a context window length of just 9K to 100k tokens. This corresponds to around 75,000 words in a single go. Claude can accept hundreds of pages of information from businesses and other applications to analyze and respond with the right result in mere seconds. For context, an average person might take 5+ hours to go through the same amount of tokens! ‘The Great Gatsby’ book was fed into Claude with just a single line changed and the model was able to figure it out in just 22 seconds. Claude’s larger context window makes it much more suited to answer complex queries than vector search-based approaches.

How Can a Large Context Window Help

Let's try out MosaicML’s mosaicml/mpt-7b-storywriter model in a Jupyter Notebook.

Set up your environment and install the libraries.


!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U einops
!pip install xformers
!pip install scipy

Import the libraries.


from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
import transformers

Import the model after setting the context window size to 83968.


model_name = "mosaicml/mpt-7b-storywriter"


config = transformers.AutoConfig.from_pretrained(model_name, trust_remote_code=True)
config.max_seq_len = 83968 # (input + output) tokens can now be up to 83968
model = AutoModelForCausalLM.from_pretrained(model_name, config=config, load_in_4bit=True, device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")

Set up the pipeline for text generation.


pipe = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

Set up the pipeline for text generation.

great_gatsby = """ Chapter 1 In my younger and more vulnerable years my father gave …… …… …… So we beat on, boats against the current, borne back ceaselessly into the past.

Epilogue: """

(Note: We’ve added an extra ‘Epilogue’ at the end to facilitate epilogue generation as we push this prompt into the text generation model.)

‍

Generate the text.


sequences = pipe(
    great_gatsby,
    max_length=83968,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Run the code – and we can observe text generation of the epilogue of the book in action.

Conclusion

Large context windows are crucial when working with LLMs. It allows more and more information to be kept in the memory when generating responses. This, ultimately, eliminates the need to use vector databases and other techniques to achieve the same results. An infinite context window could possibly be the holy grail for any LLM. Such an architecture would be able to encompass all the information available to generate any response and will lead to more accurate and well-rounded results, thereby eliminating any human-made biases.

Sign up for Free Trial

Latest Blogs