Introduction
Expertise is essential in the field of medicine, which is reliant on data, facts, and information. Experts employ 'Evidence-Based Medicine (EBM)' to make judgments about patients' health. Acquiring such skills globally is challenging – therefore the importance of ensuring appropriate access to standardized medical information. Recent advances in large language models (LLMs) have the potential to revolutionize access to medical evidence. These models are meticulously trained to comprehend and generate text resembling human language patterns. During training, LLMs use massive amounts of medical data, learning complex language signs and patterns embedded in the diverse information. Despite efforts to enhance accuracy in LLMs' medical knowledge and reasoning capacities, limits persist. The few models, such as GPT-4 and PaLM, are not open access and are limited in scale (≤ 13B parameters), restricting their capabilities.
In this blog, we will explore an enhanced approach to large-scale medical LLMs called Meditron, a suite of open-source LLMs with 7B and 70B parameters adapted for the domain of medicine. Meditron is built upon Llama-2 and extends pretraining using a comprehensively curated medical corpus, incorporating selected PubMed articles, abstracts, and internationally recognized medical guidelines. NVIDIA’s Megatron-LM distributed training library supports the Llama-2 architecture for this training purpose. Meditron-70B, after fine-tuning on relevant data, surpasses Llama-2-70B, GPT-3.5, and Flan-PaLM in multiple medical reasoning tasks.
Medical Training Data
GAP-REPLAY, a domain-adaptive pre-training corpus from Meditron, integrates 48.1 billion tokens from four datasets.
- Clinical Guidelines: It is a recently created dataset including 46469 clinical practice recommendations from several healthcare-related databases.
- Paper Abstracts: It has freely accessible summaries of 16.1 million PubMed and PubMed Central closed-access articles.
- Medical Papers: These are the full-text articles taken from 5 million publicly accessible PubMed and PubMed Central publications.
- Replay Dataset: It comprises a dataset of 400 million tokens sourced from general domain pre-training data, specifically sampled from RedPajama-v1.
Training Process
Training LLMs on a large scale has a distinct challenge. To enable the functionality of LLMs, a framework capable of handling massive, distributed training is required, accommodating parameter size and pretraining token counts. This framework must harness the combined power of numerous GPUs distributed across different computers. Designed with a language modeling technique like GPT, Megatron-LM utilizes the Megatron-LLM distributed training library to efficiently distribute the training process across a cluster. An extension of NVIDIA's Megatron-LM, this library has been updated to facilitate the training of three well-known open-source LLMs: Llama, Falcon, and Llama-2. All Meditron models undergo pretraining and fine-tuning using the Megatron-LLM.
The hardware configuration comprises 16 nodes, each featuring 8 NVIDIA A100 (80GB) SXM GPUs connected through NVLink and NVSwitch. Additionally, a single NVIDIA ConnectX-6 DX network card is incorporated into each node. The hardware is further equipped with 2 AMD EPYC 7543 32-Core Processors and 512 GB of RAM. Inter-node connectivity is facilitated by RDMA over Converged Ethernet.
The following is the entire process for ongoing pretraining, supervised fine-tuning, and evaluation of MEDITRON-7B and MEDITRON-70B [1].
The three-fold parallelism scheme involves the following components:
- Tensor Parallelism (TP): It is recommended to employ tensor parallelism equal to the number of GPUs per node, denoted as TP=8 in the cluster.
- Pipeline Parallelism (PP): For the largest training run using a 70 billion parameter model, a pipeline parallelism factor of PP=8 is utilized.
- Data Parallelism (DP): Considering a total of 128 GPUs in the cluster (128128), data parallelism is calculated as DP=128/(TP×PP); thus, DP= 128/(8x8) = 2.
The utilization of tensor parallelism (TP), pipeline parallelism (PP), and data parallelism (DP) greater than one is necessary for the efficient training of models at this scale.
Training Hyperparameters
The table summarizes key parameters for Meditron-7B and Meditron-70B models, highlighting similarities in learning rates, epochs, betas, gradient clipping, weight decay, and optimizer settings. Notable differences include variations in DP Size, TP Size, PP Size, context length, and batch sizes.
Interference Approaches
To retrieve responses from the model that result from further pre-training or instruction tuning, a number of inference approaches are used.
- Top-Token: For tasks with a single-label answer, such as multiple-choice or Boolean QA, the Top Token Selection method is utilized. In order to determine accuracy, a text generation engine picks the token with the highest log probability as the model's generated response. This response is then compared to the predicted response.
Example
Prompt: ‘What are the common symptoms of COVID-19?’
Inference: Meditron accurately identifies common symptoms such as fever, cough, and shortness of breath.
- Chain-of-Thought (CoT): This approach improves the model's accuracy in addressing complex challenges, such as mathematical word problems, by conditioning the model's generation on intermediate reasoning stages, particularly in the context of multi-step problems. To create a zero-shot CoT prompting for models optimized using medical data, one can append the words ‘Let's think step-by-step’ to the conclusion of the query.
Example
Prompt: ‘Explain the process of insulin production in the human body. Let's think step-by-step.’
Inference: Meditron delivers a detailed, sequential explanation of insulin production, showcasing its advanced reasoning capabilities.
- Self-Consistency CoT (SC-CoT): This method is used to increase performance on multiple-choice question-answering benchmarks. This approach involves sampling various reasoning tracks, extracting answer choices, and applying majority voting to make final predictions.
Example
Prompt: ‘What are the potential treatments for Type 2 diabetes? A) medication; B) lifestyle changes; C) Surgery. Please provide multiple responses.’
Inference: Meditron generates diverse answers, including ‘Medication,’ ‘Lifestyle changes,’ and ‘Surgery,’ showcasing its ability to provide varied perspectives on potential treatments for Type 2 diabetes.
Applications
Meditron-70B is now available for extensive testing and assessment as an AI assistant, with the goal of enhancing clinical decision-making and democratizing access to a Large Language Model (LLM) for healthcare applications. Potential use cases contain medical exam question-answering, support for differential diagnosis, exploring disease information, with details on symptoms, causes, and treatments, and seeking general health information.
The Meditron model can be downloaded from the Hugging Face model hub as follows [2], [3]:
Running the Code
If you are looking for reliable cloud GPU solutions to run your AI/ML code on, head over to my account at e2enetworks.com and register to avail their suite of NVIDIA GPUs.
Here, the code imports necessary classes and modules from the Hugging Face Transformers library.
The parameter ‘model_name’ signifies the name of the pretrained model to be employed. In our case, it is set to 'TheBloke/meditron-70B-AWQ.' The variable ‘tokenizer’ loads the tokenizer associated with the model. Then, ‘model' loads the pretrained language model for text generation. The option low_cpu_mem_usage=True is employed to optimize memory usage on the CPU, while device_map="cuda:0" specifies that the model should be loaded onto the GPU with device ID 0.
This code snippet defines three variables for different components of the conversation. The first variable, ‘system_message’, holds a system message, which is set to ‘This is a system message.’ The second variable, ‘prompt’, contains a user prompt inquiring about the role of artificial intelligence in managing cardiovascular diseases. The third variable, ‘prompt_template’, combines these elements into a structured template. This template is organized with sections for the system message, user prompt, and an ‘assistant’ placeholder, allowing for the generation of conversations or interactions where a user poses a question and the assistant provides a response.
The first part of this code converts the prompt template into tokens using the tokenizer. The ‘return_tensors='pt'’ option returns PyTorch tensors, and ‘.input_ids.cuda()’ moves the resulting tensor to the GPU.In the second part, generation parameters are defined for controlling the text generation process. For example, ‘do_sample’ indicates whether to sample from the distribution, ‘temperature’ controls the randomness of the sampling, and top_p, top_k, max_new_tokens and repetition_penalty set constraints on the generated text.
This sets up a variable named- ‘TextStreamer’. It is a utility class provided by Hugging Face Transformers that allows streaming the output of generated text one token at a time. The parameters ‘skip_prompt=True’ and ‘skip_special_tokens=True’ indicate that the prompt and special tokens should be excluded from the streamed output.‘TextStreamer’ initializes a TextStreamer object, which is used to control how the generated text is streamed and presented.
The code then generates text using the pre-trained model ‘model.generate’. It takes the previously obtained tokens as input, utilizes the streamer for streaming the output, and applies generation parameters specified by **generation_params.
Here, ‘generation_output_streamed’ holds the result of the text generation process using a streamer. It is assumed to be a batch of generated sequences, and [0] is used to access the first sequence. The generated tokens, represented by ‘token_output_streamed’, are decoded back into human-readable text using the Hugging Face tokenizer's ‘decode’ method.
In this section, a text generation pipeline from the Hugging Face Transformers library is configured for the task of ‘text-generation’. The pipeline encapsulates the model, tokenizer, and generation parameters, simplifying the process. The pipeline is then employed to generate text based on the provided ‘prompt_template’, and the resulting generated text is printed, as below.
Conclusion
To conclude this blog, Meditron emerges as a revolutionary thing in large-scale medical language models. It is specially designed on the foundation of Llama-2 for the complex world of health care, using a mix of smart training techniques and a vast medical knowledge base. Its 70B variant, refined through precise fine-tuning, beats the conventional standards like Llama-2-70B, GPT-3.5, and Flan-PaLM in a number of complicated medical reasoning tasks. So, in a nutshell, Meditron is leading the way in making AI super helpful for understanding and dealing with questions in the field of medicine.
References
The information in this blog is derived from various resources, and some of them are listed below:
1. Research article: https://arxiv.org/pdf/2311.16079.pdf
2. GitHub link: https://github.com/epfLLM/meditron
3. HuggingFaceModel link: https://huggingface.co/datasets/epfl-llm/guidelines ;
https://huggingface.co/epfl-llm/ ;
https://huggingface.co/TheBloke/meditron-70B-AWQ