An enterprise is largely composed of various linguistic elements, such as legal contracts, code, invoices, payments, email, and sales follow-ups etc. One of the most significant shifts in the future will be the ability of machines to interpret and act on information contained in documents robustly.
The use of Large Language Models in Natural Language Processing has recently shown impressive results in a variety of tasks such as chatbots, copywriting, code generation & autocomplete, translation, etc.
Recently many famous models achieved these tasks by generating diverse and compelling text from human input prompts. However, the evaluation of output for a prompt is not trivial as it is subjective and context-dependent. A loss function that captures these attributes seems intractable, and most language models still use a simple next-token prediction loss.RL techniques are generally better than supervised methods at aligning LMs to human preferences.
Reinforcement Learning from Human Feedback
This approach involves learning a reward function from human feedback and then optimizing that reward function.
Reinforcement Learning for Large Language Models
There are three main steps involved in Reinforcement Learning for Large Language Models:
- Collect demonstration data and train supervised policy-
- A prompt is sampled from our prompt dataset.
- A labeler demonstrates the desired output behavior.
- This data is used to fine-tune LLM with supervised learning.
- Collect comparison data and train a reward model-
- Prompt and several model outputs are sampled.
- Labelers rank the output from best to worst.
- This data is used to train our reward model.
- Optimize a policy against the reward model using the PPO reinforcement learning algorithm-
- A new prompt is sampled from the dataset.
- The PPO model is initialized from the supervised policy.
- The policy generates an output.
- The reward model calculates a reward for the output model.
- The reward is used to update the policy using PPO.
Algorithms
There are three main algorithms used in RLHF:
- PPO (Proximal Policy Optimization )
- NLPO (Natural Language Policy Optimization)
- ILQL (Implicit Language Q-Learning)
Open Source Tools for Building RLHF Applications:
- TRL - Transformer Reinforcement Learning
- Transformer Reinforcement Learning X
- RL4LMs: RL library to fine-tune language models to human preferences
TRL - Transformer Reinforcement Learning
TRL allows you to train transformer language models using Proximal Policy Optimization (PPO). A transformers library is used as a foundation for the library. As a result, pre-trained language models can be loaded directly via transformers. Most decoder architectures and encoder-decoder architectures are supported at this point. The mechanism for working TRL is demonstrated below:
Installation
Transformer Reinforcement Learning X
trlX is a distributed training framework that fine-tunes large language models with reinforcement learning using either a provided reward function or a reward-labeled dataset.
Training support for Hugging Face models is provided by Accelerate-backed trainers, allowing users to fine-tune causal and T5-based language models of up to 20B parameters. For models beyond 20B parameters, trlX provides NVIDIA NeMo-backed trainers that leverage efficient parallelism techniques to scale effectively.
Installation
RL4LMs: RL library to fine-tune language models to human preferences
The library is built on HuggingFace and stable-baselines-3, combining important components from their interfaces. RL4LMs can be used to train any decoder-only or encoder-decoder transformer models from HuggingFace with any on-policy RL algorithm from stable-baselines-3. Furthermore, it provides reliable implementations of popular on-policy RL algorithms that are tailored for LM fine-tuning such as PPO, TRPO, A2C, and NLPO. The library is modular, which enables users to plug in customized environments, reward functions, metrics, and algorithms.
It supports the following 7 different NLP tasks tested over GRUE benchmark:
- Summarization
- Generative Commonsense Reasoning
- IMDB Sentiment-based Text Continuation
- Table-to-text generation
- Abstractive Question Answering
- Machine Translation
- Dialogue Generation
Cloud GPUs: Cloud GPUs as accelerators play an important role in any deep learning project. Cost-effectiveness along with the flexibility to set your own pipeline are key parameters for selecting a cloud platform for research and application development. We encourage the reader to try E2E Cloud GPUs with a free trial for your research or application development purposes. For getting your free credits please contact: sales@e2enetworks.com