Introduction
Imbalanced data refers to those types of datasets where the target class has an uneven distribution of observations, i.e one class label has a very high number of observations and the other has a very low number of observations. Let’s say we have a dataset of cancer patients and we are going to use this dataset to build a predictive model which takes an input and says whether a base patient is diagnosed. It’s a cancer record not cancer.
The problem of imbalanced datasets is very common. This problem arises when one set of classes dominates over another set of classes. It causes the machine learning model to be more biased towards the majority class. It causes poor classification of minority classes. Hence, this problem throws the question of “accuracy” out of the question. This is a very common problem in machine learning where we have datasets with a disproportionate ratio of observations in each class.
Imbalance classification is also called rare event modeling. When the target label for a classification modeling dataset is highly imbalanced, we call the minority event to be a rare event. In this case, the models tend to get learnings from the majority class, and predicting the minority class can be challenging. For example, if only 0.01% of the dataset is the minority event, the model tends not to do a good job identifying the pattern of a minority event.
So, let’s say you have a thousand records out of which 900 are cancer and 100 are non-cancer. This is an example of an imbalance dataset because your majority class is about 9 times bigger than the majority class. Data Balance can occur in different ways in your datasets, it could for instance be you have a lot of positive examples and a few negative examples, lots of positive and may be only a few negative points thrown in.
Right now if you would normally train a model on the following theory points , it will have a relatively small effect on the loss function and your model may tend to just ignore these points, if these points are important, you want that so you want to correct this imbalance somehow.
Sample Theory: Handling Imbalanced Dataset-
Problem:
- You have a majority class with many times the number of examples as the minority class
- Or: classes are balanced but associated costs are not (FN are worse than FP)
How to Resolve?
- Add class weights to the loss function: give the minority class more weight
- In practice: set class_weight=’balanced’
- Change the prediction threshold to minimize false negatives or false positives.
- There are also things we can do by preprocessing the data.
- Resample the data to correct the imbalance.
- Random or model based
- Generate synthetic samples for the minority class.
- Build ensembles over different resampled datasets.
- Combinations of these.
An effective way to handle imbalanced data is to downsample and upweight the majority class. Let's start by defining those two new terms:
- Undersampling: It meansmeans training on a disproportionately low subset of the majority class examples.
- Oversampling: It means adding an example weight to the downsampled class equal to the factor by which you downsampled. Oversampling is also termed as up sampling.
Oversampling & under-sampling are the techniques to change the ratio of the classes in an imbalanced modeling dataset.
Oversampling: Imbalanced learning is a basic problem in machine learning. When the number of samples from different categories in a classification task dataset differs significantly, the dataset is called imbalanced.
Minority categories in these fields have smaller sample sizes and poorer sample data quality; however, they typically carry more important information. We focus on a model’s ability to correctly classify the minority classes of samples, such as a complex network system, where it is more important to accurately diagnose the network fault types and maintain the normal operation of the system than to diagnose the network as normal.
By analyzing the variability of the algorithm cost in different misclassification cases, the classification algorithm is optimized to improve the performance of the learning algorithm. Data-based approaches are more popular in existing literature than approaches that improve a specific classification algorithm for a specific imbalanced dataset.
Undersampling: Undersampling is a technique to balance uneven datasets by keeping all of the data in the minority class and decreasing the size of the majority class. It is one of several techniques data scientists can use to extract more accurate information from originally imbalanced datasets. It simply means that there are not enough samples for you to accurately reconstruct the continuous-time signal that you started with.
Advantages:
- Reduce Dataset Size.
- Low Storage Requirement.
- Saves Computation Cost.
- Requires less run time.
- Random oversampling duplicates examples from the minority class in the training dataset and can result in overfitting for some models.
Disadvantages:
- Will remove useful information while undersampling.
- Random sampling may be a biased sample.
Techniques to handle Imbalanced dataset:-
- Undersampling Majority Class: The first technique to handle the imbalance in your data set is under sampling majority class. Let’s say you have 99 000 samples belonging to one class. Ex: green class and 1000 samples belonging to red class. Ex: Fraud Detection scenario where thousand transactions are fraud 99 000 are not for transaction to tackle this imbalance what you can do is take randomly picked 1000 samples from your 99 000 samples and discard remaining samples and then combine that with thousand red samples and then train your machine learning model but obviously this is not the best approach because you are throwing away so much data.
- CNN (Condensed Nearest Neighbors) Undersampling: This technique uses a subset of events. The scientist takes the events of a dataset and adds them to the “store” if the events cannot be classified correctly based on the current contents of the store. The store includes all of the events from the minority class and only events from the majority class that cannot be classified correctly.
- Random Oversampling: Imagine a scenario where you have a minority class with 100 samples and a majority class with a thousand samples. We will pick a row randomly from this minority class and put it into the dataset. So, let’s say we pick row number 99 and put it up there again which will continue iteratively to a point where we will match the total number of points which are there in the majority class. This is how random oversampling works.
- Oversampling Minority Class by Duplication: Here we generate new samples from current samples by simply duplicating them.
- Oversampling Minority Class using SMOTE: Synthetic Minority Oversampling Technique, for example; you have two features, X1 & X2 and there are two samples black and blue. Now, suppose we have more number of black samples as compared to blue samples so there is a class imbalance that is there now rather than decreasing the black samples. Let's increase the blue samples which are minority class samples by using SMOTE.
So, there are four points in minority class, p1, p2, p3 & p4.
Now, If a SMOTE analysis is done on this which is our minority class, If K nearest neighbors are specified as we want to create artificial samples as say three. SMOTE will basically find the nearest neighbors of every point, suppose for P1, we have P2 as the nearest neighbor; we have P3 as the nearest neighbor. If we keep a nearest neighbor count of 3 then even P4 is the nearest neighbor. Similarly for P3, P1 is the nearest neighbor, P2 is the nearest neighbor and P4 is also its nearest neighbor. So, for P4 we have P2, P1 & P3 based on the number of samples we want SMOTE to create. SMOTE would first find these lines, which is the line joining your minority class samples based on how many nearest neighbors you have considered and it will plot these instances somewhere on these lines. We can have multiple sample points on all the lines.
Conclusion: These approaches can be effective, although they are hit-or-miss and time-consuming respectively. Instead, the shortest path to a good result on a new classification task is to systematically evaluate a suite of machine learning algorithms in order to discover what works well, then double down. This approach can also be used for imbalanced classification problems, tailored for the range of data sampling, cost-sensitive, and one-class classification algorithms that one may choose from.
You can get more such insightful updates here: https://www.e2enetworks.com/blog
Are you looking for GPUs for your Machine Learning tasks? Check us out: https://www.e2enetworks.com/products
Feel free to connect with us for solving any query you may have: https://www.e2enetworks.com/contact-us