Feature Extraction with Large Language Models, Hugging Face, and MinIO

April 2, 2025

Table of Contents

Introduction

Sentiment analysis is a popular application of natural language processing (NLP) that has many practical uses in business and marketing. It can be used to monitor brand reputation, track customer feedback, and identify emerging trends in customer sentiment.

In this post, we will use feature extraction for a pre-trained LLM from Hugging Face to perform sentiment analysis on a dataset of movie reviews. We will then use MinIO to store the data and the model.

Getting Started

To get started, head over to MyAccount and get yourself a GPU node after signup. Additionally, we recommend that you install the Remote Explorer extension with VS Code to be able to use the remote node as if it's your local development environment. 

Next, we need to install the following libraries:

  • Hugging Face Transformers
  • Hugging Face Datasets
  • MinIO Python SDK

Load Datasets

Once we have these libraries installed, we can download the movie review dataset from Hugging Face's datasets library using the following code:


from datasets import load_dataset
dataset = load_dataset('imdb')

The Hugging Face Dataset library provides a convenient way to download and work with datasets. However, when dealing with enterprise data, it is not always feasible to upload and download data from the Hugging Face Hub. A better solution is to store the data in MinIO buckets and objects and then load it into the Dataset library's internal structures.

The dataset contains 50,000 movie reviews that are labelled as either positive or negative. We will use this dataset to train our sentiment analysis model.

In this article, we will explore how to use Hugging Face Datasets and MinIO to perform feature extraction and transfer learning. We will use a pre-trained model from Hugging Face to perform sentiment analysis on a dataset of movie reviews. Here are the steps we will follow:

Hugging Face Datasets and MinIO

First, let’s create some helper functions for getting data into and out of MinIO. These functions are below. The get_object() function will retrieve an object from MinIO and save it as a file. The put_file() function will upload a file to a specified bucket within MinIO. If the bucket does not exist, it will be created.


def get_object(bucket_name: str, object_name: str, file_path: str):
  '''
  This function will download an object from MinIO to the specified file_path
  and return the object_info.
  '''

  # Load the credentials and connection information.
  with open('credentials.json') as f:
    credentials = json.load(f)

  # Create client with access and secret key
  client = Minio(credentials['url'], # host.docker.internal
  credentials['accessKey'],
  credentials['secretKey'],
  secure=False)
  # Get data of an object.
  object_info = client.fget_object(bucket_name, object_name, file_path)

  return object_info


def put_file(bucket_name: str, object_name: str, file_path: str):
  '''
  This function will upload a file to MinIO and return the object_info.
  '''

  # Load the credentials and connection information.
  with open('credentials.json') as f:
    credentials = json.load(f)

  # Create client with access and secret key
  client = Minio(credentials['url'], # host.docker.internal
                 credentials['accessKey'],
                credentials['secretKey'],
                secure=False)
  # Make sure the bucket exists.
  found = client.bucket_exists(bucket_name)
  if not found:
    client.make_bucket(bucket_name)

  # Upload the file.
  object_write_result = client.fput_object(bucket_name, object_name, file_path)

  return object_write_result

To create the files and upload them to MinIO, run the snippet below. You will get one file for each set. Other file types that are supported are CSV, Arrow and Parquet.


bucket_name = 'my_dataset'
for split, dataset in emotions.items():
  dataset.to_json(f'reviews-{split}.jsonl')
  put_file(bucket_name, f'reviews-{split}.jsonl', f'reviews-{split}.jsonl')

Finally, we can reload our data from MinIO using the code below.


data_files = {}
for split in ['train', 'validation', 'test']:
  data_files[split] = f'reviews-{split}.jsonl'
  get_object(bucket_name, f'reviews-{split}.jsonl', f'reviews-{split}.jsonl')

emotions = load_dataset('json', data_files=data_files)

We now have reviewed the DatasetDict object loaded with a training set, validation set, and test set. We can look at the columns using the column_names property.


emotions['train'].column_names

Load the Model and Tokenizer

To load a pre-trained model from Hugging Face, we can use the from_pretrained method of the appropriate model class. For example, to load a DistilBERT model for sequence classification, we can use the following code:


from transformers import DistilBertForSequenceClassification

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

We can also load a tokenizer for the model using the from_pretrained method of the appropriate tokenizer class:


from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Tokenize the Data

Before we can train our model, we need to preprocess the data. This involves tokenizing the text and converting it into a format that can be used by our model.

To tokenize our data using the tokenizer we loaded earlier, we can use the following code:


dataset = load_dataset('imdb')
def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=True)
dataset = dataset.map(tokenize, batched=True)

This code tokenizes each review in the dataset using our tokenizer. It also pads each review to a fixed length and truncates any reviews that are too long.

Feature Extraction

Feature extraction is a technique that involves using a pre-trained model to extract features from data. In our case, we will use our pre-trained DistilBERT model to extract features from our tokenized movie review dataset.

To perform feature extraction, we can use the following code:


import torch

features = []
for i in range(len(dataset['train'])):
    input_ids = torch.tensor(dataset['train'][i]['input_ids']).unsqueeze(0)
    with torch.no_grad():
        last_hidden_states = model(input_ids)[0]
    features.append(last_hidden_states.numpy())

len(features), len(dataset['train']['label']), len(dataset)

This code iterates over each review in our dataset and uses our pre-trained model to extract features from it. The features are then stored in a list.

Transfer Learning

Transfer learning is a technique that involves using knowledge learned from one task to improve performance on another related task. In our case, we will use transfer learning to train a new model for sentiment analysis using the features extracted from our pre-trained model.

Install SKlearn Module using:


pip install scikit-learn

To train our new model using transfer learning, we can use the following code:


from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
    
X_train, X_test, y_train, y_test = train_test_split(features, dataset['train']['label'], test_size=0.2)

clf = LogisticRegression(random_state=0).fit(X_train, y_train)

score = clf.score(X_test, y_test)
print(f'Score: {score:.2f}')

This code splits our extracted features into training and test sets and trains a logistic regression model on them. We then evaluate the performance of our new model on the test set.

Analysing the Results

0.87

Our logistic regression model has achieved an accuracy of 0.87 on the test set. This demonstrates that transfer learning can be an effective technique for training models on limited data.

Conclusion

In this article, we have explored how to use Hugging Face Datasets and MinIO to perform feature extraction and transfer learning. We have also demonstrated how transfer learning can be used to train models on limited data. We used a pre-trained LLM from Hugging Face to perform sentiment analysis on a dataset of movie reviews. We also used MinIO to store the data and the model.

Latest Blogs

A vector illustration of a tech city using latest cloud technologies & infrastructure