Introduction
Natural language processing is an ocean of different work areas, but it constitutes a fundamental task: text classification. Basically, it involves assigning a text into various predefined categories, for example, the text “basketball” can be assigned under the “sports” category. By doing so, a document becomes easy to manage and sort.
Text classification basically involves filtering the unstructured data. It can be used to structure useful information in various forms: emails, documents, web pages, chat conversations, social media messages, reviews on various issues or trends. The data can be classified by either humans or using NLP. Manual intervention is both time-consuming and laborious. Using machine learning, this task can be performed efficiently.
Components:
The text classifier is a combination of three components:
Datasets:
The larger the dataset the more well-trained model will be. For example, if you have 800 categories you must provide at least 100 datasets for each category.
Datasets are freely available via open-source distribution. For example, IMDB provides movie reviews which are both positive and negative.
Preprocessing:
A dataset consists of important text as well as stopwords, mis-spellings, slangs. We have to filter this noise from the data. We treat all the words in dataset equal, but in preprocessing we have to assign weights to each word; weightage could be based on how important the text is or the number of times it is occurring in the document. There are various techniques to preprocess the data such as BOW, TF-IDF, N-gram models, K-nearest, and Random forests, among others.
Classification algorithm and strategy:
There are various algorithms available. You can choose any of them according to your model requirements. Naive Bayes, Support vector machines(SVM), Deep learning, Decision trees are some examples.
Examples:
Sentiment Analysis-
Its main motive is to identify the polarity of the content and what impact it imposes. Sentiment analysis determines whether a text is positive or negative. It can also determine in the binary form of likes or dislike ratings for various movies, brands, or reviews on current affairs.
Topic labeling-
It helps one to know details about the content: what it relates from, where it is derived, and what it means. This analysis can be used for taking customer feedback on a particular topic or organizing new articles according to their subjects.
Language detection-
It is used to classify a text compatible with the incoming language it is used for routing purposes.
Why is it important?
According to research, 80% of the data is unstructured. Even though it may contain important information, due to noise, slang, or other things which make the data massive. Text classification uses text classifiers that remove unwanted items from the text, categorizes, and makes the data useful.
- A text classifier improves the scalability.
- There are many situations where we need the results instantly; text classifiers provide real-time analysis which helps to perform difficult operations or extract important information immediately.
- By manual classification of data, many mistakes can arise. The reason could be anything: low-level understanding, limited knowledge, distraction, boredom or anything else.
Use Cases:
Text classification can be used for various purposes, some of them are:
- Social media monitoring
- Brand monitoring
- Customer service
- Voice of service
Workflow:
There are six steps involved in the workflow process:
- Gather data:
As mentioned earlier, you can get data via open source distribution channels. We will explore the resources from where you can get the datasets. Here, We will use Reuters datasets.
Topic classification:
- Reuters news datasets
It is composed of 11,228 newswires from Reuters which is classified into 46 different categories such as politics, sports, economics, etc. We have to import these datasets from Keras. After importing, its feature dataset and label dataset are individually stored in two tuples. Each tuple contains both training and testing portions.
You can import Reuters dataset from Keras-
- 20 NewsGroups-
It is another dataset source, consisting of approx. 20,000 documents across 20 different topics.
Sentiment Analysis:
For sentiment analysis, there are various sources of datasets such as-
- Amazon Product Reviews
- IMDB Reviews
- Twitter Airline Sentiment
- Explore Dataset
You need to load your dataset from its source location. We will load the dataset from Reuters.
Next, you need to load training and testing data. In this example, we have used length function to get the length of the words and number of classes present in the dataset.
Choose a model:
Now, we have to choose a model. Let's move further to know how to choose a model.
Algorithm for data preparation and data modeling
1. Calculate the number of samples/number of words per sample ratio.
2. If this ratio is less than 1500, tokenize the text as n-grams and use a
simple multi-layer perceptron (MLP) model to classify them (left branch in the
flowchart below):
a. Split the samples into word n-grams; convert the n-grams into vectors.
b. Score the importance of the vectors and then select the top 20K using the scores.
c. Build an MLP model.
3. If the ratio is greater than 1500, tokenize the text as sequences and use a
sepCNN model to classify them (right branch in the flowchart below):
a. Split the samples into words; select the top 20K words based on their frequency.
b. Convert the samples into word sequence vectors.
c. If the original number of samples/number of words per sample ratio is less than 15K, using a fine-tuned pre-trained embedding with the sepCNN.
The model will likely provide the best results.
4. Measure the model performance with different hyperparameter values to find the best model configuration for the dataset.
Let's explain this with a flow chart:
The above flow chart depicts various choices which can be made various options available:
Yellow box- depicts data and model preparation processes.
Grey box - indicates choices that can be considered for each process.
Green box- indicates recommended choices for each process.
Grey boxes and green boxes indicate choices we considered for each process. Green boxes indicate our recommended choice for each process.
Prepare Dataset-
In this, you need to remove unwanted elements: capitalization, spaces, slang, and eliminating redundancy.
Tokenization- this means to break down the sentences into words. You can do so by importing tokenizer and with the help of the following code you should use tokenization on both test and train data.
Build, Train and Evaluate your model
Building machine learning models with Keras is all about assembling together layers, data-processing building blocks, much like we would assemble Lego bricks. These layers allow us to specify the sequence of transformations we want to perform on our input. As our learning algorithm takes in a single text input and outputs a single classification, we can create a linear stack of layers using the Sequential model API.
You can build a model by the following:
Train your model and hyper tune
Now we have to train our model and hyper tune parameters by specifying the number of epochs, limiting the data by using validation sets, specifying dropout rate, and learning rate.
You can plot the accuracy and loss graph using matplotlib.
In this way, we can train the model.
Click here to deploy your AI workloads on E2E GPU Cloud.