How to Use Google T5 to Transform Natural Language Queries into SQL

April 2, 2025

‍

Transformer models are currently employed in a wide range of machine learning challenges. All popular large language models are based on transformers. Interestingly, these models are now also being utilized for vision-related tasks. First appearing in the paper “Attention Is All You Need” by Vaswani et al., 2017, transformers have been the basis of many state-of-the-art models in natural language processing tasks – such as translation, summarization, and sentiment analysis. Some popular transformer models are GPT, BERT, and XLS-R.

T5 (text-to-text transfer transformer) was presented by Google in “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. It is an encoder-decoder transformer model that handles all NLP problems as text-to-text tasks. In this tutorial, we will delve into the process of fine-tuning this model for the task of converting Natural Language to SQL.

Prerequisites

The tutorial will be covered using the t5-small model and functions from the Hugging Face ecosystem. The model will be fetched from Hugging Face models and will be fine-tuned using the HuggingFace trainer. Make sure Git is installed in your system. Install the required libraries.


! sudo apt-get install git-lfs
! pip install -q transformers datasets

‍

Generate a token from the HuggingFace profile and authorize the notebook you are using.


! huggingface-cli login

Load & Prepare Data

For fine-tuning T5 for converting natural language to SQL, we will use the WikiSQL dataset. The WikiSQL dataset is a large crowd-sourced dataset for developing natural language interfaces for relational databases. The dataset consists of 87,726 hand-annotated SQL queries and natural language question pairs. These SQL queries are further split into training (61,297 examples), development (9,145 examples), and test sets (17,284 examples). It can be used for natural language inference tasks related to relational databases.


from datasets import load_dataset


train_data = load_dataset('wikisql', split='train+validation')
test_data = load_dataset('wikisql', split='test')

Format the dataset to fit the query and response format we want. Here it should be something like “Translate to SQL: {Natural Language Query}”.


def format_dataset(example):
    return {'input': 'translate to SQL: ' + example['question'], 'target': example['sql']['human_readable']}


train_data = train_data.map(format_dataset,remove_columns=train_data.column_names)
test_data = test_data.map(format_dataset,remove_columns=test_data.column_names)

Define the tokenizer for our T5 model, using the AutoTokenizer class from HuggingFace. Now calculate the input and target lengths from the formatted dataset.


from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('t5-small')




# map article and summary len to dict as well as if sample is longer than 512 tokens
def map_to_length(x):
  x["input_len"] = len(tokenizer(x["input"]).input_ids)
  x["input_longer_256"] = int(x["input_len"] > 256)
  x["input_longer_128"] = int(x["input_len"] > 128)
  x["input_longer_64"] = int(x["input_len"] > 64)
  x["out_len"] = len(tokenizer(x["target"]).input_ids)
  x["out_longer_256"] = int(x["out_len"] > 256)
  x["out_longer_128"] = int(x["out_len"] > 128)
  x["out_longer_64"] = int(x["out_len"] > 64)
  return x


sample_size = 10000
data_stats = train_data.select(range(sample_size)).map(map_to_length, num_proc=4)


def compute_and_print_stats(x):
  if len(x["input_len"]) == sample_size:
    print(
        "Input Mean: {}, %-Input > 256:{},  %-Input > 128:{}, %-Input > 64:{} Output Mean:{}, %-Output > 256:{}, %-Output > 128:{}, %-Output > 64:{}".format(
            sum(x["input_len"]) / sample_size,
            sum(x["input_longer_256"]) / sample_size,
            sum(x["input_longer_128"]) / sample_size,
            sum(x["input_longer_64"]) / sample_size,   
            sum(x["out_len"]) / sample_size,
            sum(x["out_longer_256"]) / sample_size,
            sum(x["out_longer_128"]) / sample_size,
            sum(x["out_longer_64"]) / sample_size,
        )
    )


output = data_stats.map(
  compute_and_print_stats, 
  batched=True,
  batch_size=-1,
)

Model Fine-Tuning

Tokenize Dataset

Upon examining the input and output, it becomes apparent that the average token count is around 20. This observation implies that in order to preserve all the information, it’s crucial to configure both the encoder and decoder to use at least 20 tokens. However, to provide a buffer and ensure comprehensive coverage of the information, we will set the token count to 64 in this case. This allows the model to handle inputs and outputs that may be larger than the observed average.

Define functions to tokenize the dataset and map it to train and test the data.


def convert_to_features(example_batch):
    input_encodings = tokenizer.batch_encode_plus(example_batch['input'], pad_to_max_length=True, max_length=64)
    target_encodings = tokenizer.batch_encode_plus(example_batch['target'], pad_to_max_length=True, max_length=64)


    encodings = {
        'input_ids': input_encodings['input_ids'], 
        'attention_mask': input_encodings['attention_mask'],
        'labels': target_encodings['input_ids'],
        'decoder_attention_mask': target_encodings['attention_mask']
    }


    return encodings


train_data = train_data.map(convert_to_features, batched=True, remove_columns=train_data.column_names)
test_data = test_data.map(convert_to_features, batched=True, emove_columns=test_data.column_names)


columns = ['input_ids', 'attention_mask', 'labels', 'decoder_attention_mask']


train_data.set_format(type='torch', columns=columns)
test_data.set_format(type='torch', columns=columns)

‍

Training Arguments

Set the arguments for the Hugging Face trainer. As we are doing text to text tasks, we have to choose a sequence to sequence trainer for handling the job.

‍

Evaluation Metric

We will be using rouge score (Recall-Oriented Understudy for Gisting Evaluation) as the metric. It is a set of metrics used for evaluating automatic summarization of texts, as well as machine translation. It works by comparing an automatically produced summary or translation against a set of reference summaries (typically human-produced).


! pip install -q rouge_score


from datasets import load_metric
rouge = load_metric("rouge")


def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions


    # all unnecessary tokens are removed
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)


    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid


    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),

Training

Define the model using the T5ForConditionalGeneration class from HuggingFace. Instantiate the sequence to sequence trainer object and run a test.


from transformers import T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained('t5-small')


trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_data,
    eval_dataset=test_data,
)
trainer.evaluate()

Train the model using the Hugging Face trainer. After completing training, save the tokenizer and model.


trainer.train()
trainer.save_model()
tokenizer.save_pretrained('/content/t5-small-finetuned-wikisql')   # replace the location here

Kudos! The model has been successfully fine-tuned for our task.

Inference

Load the model and tokenizer from the location we saved it after completing training.


t5_path = '/content/t5-small-finetuned-wikisql'  # replace the location here
from transformers import AutoTokenizer, T5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained(t5_path)
model = T5ForConditionalGeneration.from_pretrained(t5_path)

Load the test dataset and define the inference function.


test_data = load_dataset('wikisql', split='test')  


def translate_to_sql(text):
    inputs = tokenizer(text, padding='longest', max_length=64, return_tensors='pt')
    input_ids = inputs.input_ids
    attention_mask = inputs.attention_mask
    output = model.generate(input_ids, attention_mask=attention_mask, max_length=64)


    return tokenizer.decode(output[0], skip_special_tokens=True)

Run the loop to see results from each example of the test dataset.


for i in range(0,100,10):
  print('translate to SQL: ' + test_data[i]['question'])
  print('Predict. :' + translate_to_sql('translate to SQL: ' + test_data[i]['question']))
  print('Expected: ' + test_data[i]['sql']['human_readable'])
  print('=================================\n')

Here are some sample results from the fine-tuned model.

Wrapping Up

We have now learnt how to fine-tune the T5 language model from Google to convert human language to SQL queries. One can try experimenting with other open-source models like Llama and Falcon from the hub, which can give more promising results. We hope you enjoyed this tutorial and found it useful for your projects.