Transformer models are currently employed in a wide range of machine learning challenges. All popular large language models are based on transformers. Interestingly, these models are now also being utilized for vision-related tasks. First appearing in the paper “Attention Is All You Need” by Vaswani et al., 2017, transformers have been the basis of many state-of-the-art models in natural language processing tasks – such as translation, summarization, and sentiment analysis. Some popular transformer models are GPT, BERT, and XLS-R.
T5 (text-to-text transfer transformer) was presented by Google in “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. It is an encoder-decoder transformer model that handles all NLP problems as text-to-text tasks. In this tutorial, we will delve into the process of fine-tuning this model for the task of converting Natural Language to SQL.
Prerequisites
The tutorial will be covered using the t5-small model and functions from the Hugging Face ecosystem. The model will be fetched from Hugging Face models and will be fine-tuned using the HuggingFace trainer. Make sure Git is installed in your system. Install the required libraries.
Generate a token from the HuggingFace profile and authorize the notebook you are using.
Load & Prepare Data
For fine-tuning T5 for converting natural language to SQL, we will use the WikiSQL dataset. The WikiSQL dataset is a large crowd-sourced dataset for developing natural language interfaces for relational databases. The dataset consists of 87,726 hand-annotated SQL queries and natural language question pairs. These SQL queries are further split into training (61,297 examples), development (9,145 examples), and test sets (17,284 examples). It can be used for natural language inference tasks related to relational databases.
Format the dataset to fit the query and response format we want. Here it should be something like “Translate to SQL: {Natural Language Query}”.
Define the tokenizer for our T5 model, using the AutoTokenizer class from HuggingFace. Now calculate the input and target lengths from the formatted dataset.
Model Fine-Tuning
Tokenize Dataset
Upon examining the input and output, it becomes apparent that the average token count is around 20. This observation implies that in order to preserve all the information, it’s crucial to configure both the encoder and decoder to use at least 20 tokens. However, to provide a buffer and ensure comprehensive coverage of the information, we will set the token count to 64 in this case. This allows the model to handle inputs and outputs that may be larger than the observed average.
Define functions to tokenize the dataset and map it to train and test the data.
Training Arguments
Set the arguments for the Hugging Face trainer. As we are doing text to text tasks, we have to choose a sequence to sequence trainer for handling the job.
Evaluation Metric
We will be using rouge score (Recall-Oriented Understudy for Gisting Evaluation) as the metric. It is a set of metrics used for evaluating automatic summarization of texts, as well as machine translation. It works by comparing an automatically produced summary or translation against a set of reference summaries (typically human-produced).
Training
Define the model using the T5ForConditionalGeneration class from HuggingFace. Instantiate the sequence to sequence trainer object and run a test.
Train the model using the Hugging Face trainer. After completing training, save the tokenizer and model.
Kudos! The model has been successfully fine-tuned for our task.
Inference
Load the model and tokenizer from the location we saved it after completing training.
Load the test dataset and define the inference function.
Run the loop to see results from each example of the test dataset.
Here are some sample results from the fine-tuned model.
Wrapping Up
We have now learnt how to fine-tune the T5 language model from Google to convert human language to SQL queries. One can try experimenting with other open-source models like Llama and Falcon from the hub, which can give more promising results. We hope you enjoyed this tutorial and found it useful for your projects.
References
https://blog.research.google/2020/02/exploring-transfer-learning-with-t5.html
https://github.com/salesforce/WikiSQL
https://huggingface.co/google/flan-t5-base