Large language Models have revolutionized natural language processing tasks, by showcasing impressive capabilities in understanding and generating human-like text. However, the dynamic landscape of information poses challenges in adapting these models to new knowledge effectively. Two prominent approaches, Retrieval Augmented Generation (RAG) and Fine-Tuning, have emerged as contenders in addressing this challenge.
Let’s delve into the variances between these strategies, exploring their strengths, weaknesses, and implications for enhancing the knowledge base of LLMs.
Challenges Faced by LLMs
Before delving into the solutions, it's crucial to understand the challenges LLMs face. Factual errors often arise due to various factors such as domain knowledge deficits, outdated information, disremembering, forgetting, and reasoning failures. The pre-training phase, where models learn from vast datasets, plays a pivotal role in shaping their knowledge foundation. Despite the extensive training, these models may falter when faced with new or specific information that is not present in their training data.
Factual Errors: The Pitfalls of Imperfect Information
Factual errors in language models are inaccuracies or discrepancies between the information generated by the model and the facts. These errors can arise from various sources and contribute to the model's inability to provide reliable and accurate information. Understanding the nature of these errors is essential for mitigating their impact on model performance.
- Domain Knowledge Deficits: Language models are trained on vast datasets that cover a broad range of topics. However, they may lack in-depth knowledge in specific domains, which leads to inaccuracies when generating content in those areas. For instance, a model trained on general knowledge may struggle with detailed medical or scientific information.
- Outdated Information: Models may not be aware of recent developments or changes, especially if their training data has a cutoff date. This can result in the dissemination of outdated information, which affects the model's reliability in dynamic fields.
- Disremembering: Despite extensive training, models may fail to memorize specific facts or details. This disremembering can manifest as the model could generate incorrect information or fail to recall essential details.
- Forgetting: Over time, language models may forget certain information from their training data. This memory decay can lead to factual errors, especially when dealing with infrequently encountered facts.
- Reasoning Failures: Language models may struggle with complex reasoning tasks that require a deep understanding of context and relationships between different pieces of information. This can result in errors when answering questions or providing explanations.
Knowledge Deficits: Addressing the Gaps in Information
Knowledge deficits go hand-in-hand with factual errors and refer to the areas where language models lack the necessary information. While models may exhibit general knowledge prowess, addressing these deficits is crucial for ensuring their adaptability to new and specialized knowledge domains.
- Limited Training Data Coverage: Language models may not have encountered sufficient examples or instances related to specific topics during their training. This limited coverage can lead to knowledge deficits, particularly in niche or emerging fields.
- Inadequate Pre-training: The quality of pre-training data significantly influences the knowledge base of language models. Inadequate pre-training on diverse and representative datasets may result in models with inherent knowledge gaps.
- Lack of Exposure to Varied Contexts: Models trained on monolithic datasets may lack exposure to diverse contexts and perspectives. This narrow exposure can contribute to knowledge deficits, especially when dealing with multifaceted or culturally nuanced information.
- Implicit Bias: Language models may inadvertently carry biases present in their training data. This bias can lead to skewed or incomplete representations of certain topics, which contributes to knowledge deficits and inaccuracies.
Knowledge Injection: Fine-Tuning vs. Retrieval Augmented Generation
Fine-Tuning
Fine-tuning is a process that involves adjusting a pre-trained language model on a more specific dataset or task to improve its performance within that particular domain. This approach is particularly useful when models need to be adapted to new or specialized knowledge areas. Fine-tuning can be classified into several types, each with its unique characteristics:
- Supervised Fine-Tuning: In supervised fine-tuning, the model is trained on labeled input-output pairs. This method often involves presenting the model with task descriptions in natural language and corresponding examples of the desired behavior. While effective for improving overall model quality, supervised fine-tuning may not necessarily impart new knowledge to the model.
- Reinforcement Learning Fine-Tuning: Another form of fine-tuning leverages reinforcement learning or RL-inspired optimization strategies. Techniques such as reinforcement learning from human feedback, direct preference optimization, and proximal policy optimization are examples of RL-based fine-tuning. These methods focus on improving the overall quality and expected behavior of the model but may not specifically address knowledge breadth.
- Unsupervised Fine-Tuning: Unsupervised fine-tuning, also known as continual pre-training or unstructured fine-tuning, involves continuing the pre-training phase of the language model in a causal auto-regressive manner. This method capitalizes on the vast knowledge stored during the initial pre-training. Unsupervised fine-tuning aims to inject new knowledge into the model and is often preferred for its efficacy in learning new information.
Fine-tuning, while a valuable tool, comes with challenges such as instability and potential impacts on the model's broader capabilities. Careful consideration of the fine-tuning strategy and its alignment with specific goals is crucial to achieving optimal results.
Retrieval Augmented Generation (RAG)
Retrieval Augmented Generation (RAG) represents a paradigm shift in enhancing language models by incorporating external knowledge sources. This technique is particularly beneficial for knowledge-intensive tasks and involves the following key steps:
- Knowledge Base Creation: RAG begins with the creation of an auxiliary knowledge base. Relevant information is gathered from the data provided to create a knowledge base. This knowledge base serves as a repository of information that can augment the model's understanding.
- Embedding and Retrieval: Given an input query, RAG employs an embedding model to represent both the query and the documents in the knowledge base. Retrieval involves finding documents within the knowledge base that resemble the input query. This process is typically facilitated by vector representations of documents.
- Context Enrichment: Retrieved documents are added to the input query, which enriches the model's context about the subject. This augmented context enhances the model's ability to generate more informed and contextually relevant responses.
RAG has proven effective in addressing both factual errors and knowledge deficits. It allows language models to go beyond their pre-training knowledge and leverage external information for improved performance in various tasks.
Choosing between Fine-Tuning and RAG: Considerations and Trade-Offs
The decision to employ fine-tuning or RAG depends on the specific goals of a task and the nature of the knowledge required. Here are some considerations and trade-offs:
- Fine-tuning Considerations: Fine-tuning is suitable for tasks where specific, task-oriented improvements are needed. It is effective for refining a model's performance in a particular domain. However, fine-tuning may exhibit instability and might not be the optimal choice for addressing broad knowledge deficits.
- RAG Considerations: RAG excels in knowledge-intensive tasks where external information is valuable which is provided by feeding data to the knowledge base. It can address both knowledge deficits and factual errors by incorporating diverse knowledge from external sources. RAG's effectiveness relies on the quality and coverage of the knowledge base.
- Trade-offs: Fine-tuning may provide more control over specific task-related improvements, but it might struggle with broader knowledge adaptation. RAG, while powerful in leveraging external knowledge, depends on the availability and reliability of the knowledge base.
Utilizing E2E Cloud GPU for Implementing Knowledge Injection
Knowledge Injection is a process that enhances machine learning models by integrating external knowledge which is advantageous while dealing with limited or unrepresentative training data. There are several benefits when we use Cloud GPUs. Firstly, the scalability of cloud GPUs allows for a seamless transition from a single GPU for development to multiple GPUs for training larger models or handling extensive datasets. Additionally, the cost-effectiveness of cloud GPUs offers a more economical alternative to maintaining dedicated hardware, especially for irregular usage patterns where charges are incurred only for the resources utilized.
Let’s walk through an example of fine-tuning an LLM vs. the RAG approach and compare their performance with evaluation metrics. To perform this, I used E2E Cloud GPU A100 80 GB with CUDA 11. To learn more about E2E Cloud GPUs, visit the website.
To get started, add your SSH keys by going into Settings.
After creating SSH keys, create a node by going into ‘Compute’.
Now, open your Visual Studio Code and download the extension ‘Remote Explorer’ as well as ‘Remote SSH’. Open a new terminal and login into your local system.
You’ll be logged in remotely with SSH on your local system.
Fine-Tuning LLM Example: Code Implementation
Let’s understand the scenario by code implementation with the same dataset for both approaches.
First, install the dependencies that we are going to use in this implementation.
Import the classes and functions that we are going to need in this implementation.
In this implementation, we will use Flan T5 XXL, which you can find here. We’ll tokenize the model and make it our foundation model.
Load the dataset, which can be found here. We’ll map the dataset to the tokenizer for one of the columns named system_prompt. We’ll select a range of datasets to work with.
Let’s set the LoRA configuration for fine-tuning the foundation model.
Using LoRA configuration, we will set the final PEFT model from the foundation model, and see the trainable parameters, all parameters, and the trainable percentage.
The following will be the result:
Now, we will make a directory to store the PEFT model outputs.
We’ll set the training arguments for our new fine-tuned model.
Then, we’ll train the model using the sample data and the Data Collator.
We’ll save the fine-tuned model using the output directory and the timestamp.
Now, we'll load the trained PEFT model and use the foundation model for prompt tuning.
We’ll take a test set, and set a metric to evaluate the fine-tuned model.
From the evaluation we get the following results:
The evaluation F1 score is 67.7%, which is not much good as we can see.
RAG Example: Code Implementation
For implementing the evaluation of the RAG approach, let’s start with the setting of the Cohere API key, which we’ll be using for the embeddings.
Let’s load the dataset again.
We’ll take a sample of the data.
We’ll split the data into chunks using the text splitter.
Then, we’ll create the embeddings using the model “embed-english-light-v3.0”. Cohere has many models for different purposes, but here we are using this one.
Now, we’ll set the Pinecone API key and the environment.
After that, we’ll initialize the Pinecone set up.
We’ll create an index to store the embeddings.
Then, we will initiate the Pinecone with docs, embeddings, and index.
We’ll load the model to pass the query.
Then, we’ll create a retriever object, and pass the model in the RetrievalQA chain.
Now, we’ll evaluate our RAG approach using RAGAS.
The following is the result:
The following will be the RAGAS score for each question:
You can see from the results that the ragas_score for each question is performing better than the fine-tuned model.
Comparative Knowledge Injection Observations
As we compared the evaluation metrics of the fine-tuned LLM model and the RAG model on the text classification use case, we saw that the ragas_score is performing better than the score of the fine-tuned model. Let’s see how the fine-tuned model and RAG perform with other use cases.
However, RAG use cases can be more powerful if we incorporate the fine-tuned model with it.
Conclusion
In the quest to enhance LLMs' knowledge bases, understanding the variances between retrieval augmented generation and fine-tuning is pivotal. While fine-tuning shows promise, especially when combined with RAG, the latter emerges as a more reliable choice for knowledge injection, as we saw in the examples.
Fine-tuning and Retrieval Augmented Generation represent powerful tools for refining language models, each with its unique strengths and considerations. The choice between these approaches often involves a balance between task specificity, model stability, and the need for external knowledge integration.
In many cases, a synergistic approach that combines aspects of both fine-tuning and RAG may offer a comprehensive solution as we saw in the comparative observations. Leveraging the strengths of fine-tuned models along with the context enrichment provided by external knowledge through RAG can contribute to the development of more robust and adaptable language models.