A while again, I discovered myself considering of various information augmentation methods for unbalanced datasets, i.e. datasets through which a number of courses are over-represented in comparison with the others, and questioning how these methods stack as much as each other. So I made a decision to arrange a easy experiment to match them. This publish describes the experiment and its outcomes.
The dataset I selected for this experiment was the SMS Spam Assortment Dataset from Kaggle, a group of virtually 5600 textual content messages, consisting of 4825 (87%) ham and 747 (13%) spam messages. The community is an easy 3 layer totally related community (FCN), whose enter is a 512 aspect vector generated utilizing the Google Common Sentence Encoder (GUSE) in opposition to the textual content message, and outputs the argmax of a 2 aspect vector (representing “ham” or “spam”). The textual content augmentation methods I thought-about in my experiment are as follows:
- Baseline — it is a baseline for outcome comparability. Because the process is binary classification, the metric we selected is Accuracy. We practice the community for 10 epochs utilizing Cross Entropy and the AdamW Optimizer with a studying fee of 1e-3.
- Class Weights — Class Weights try to deal with information imbalance by giving extra weight to the minority class. Right here we assign class weights to our optimizer proportional to the inverse of their counts within the coaching information.
- Undersampling Majority Class — on this situation, we pattern from the bulk class the variety of information within the minority class, and solely use the sampled subset of the bulk class plus the minority class for our coaching.
- Oversampling Minority Class — that is the alternative situation, the place we pattern (with substitute) from the minority class numerous information which might be equal to the quantity within the majority class. The sampled set will comprise repetitions. We then use the sampled set plus the bulk class for coaching.
- SMOTE — it is a variant on the earlier technique of oversampling the minority class. SMOTE (Artificial Minority Oversampling TEchnique) ensures extra heterogeneity within the oversampled minority class by creating artificial information by interpolating between actual information. SMOTE wants the enter information to be vectorized.
- Textual content Augmentation — like the 2 earlier approaches, that is one other oversampling technique. Heuristics and ontologies are used to make modifications to the enter textual content preserving its which means so far as doable. I used the TextAttack, a Python library for textual content augmentation (and producing examples for adversarial assaults).
A couple of factors to notice right here.
First, all of the sampling strategies, i.e., all of the methods listed above apart from the Baseline and Class Weights, requires you to separate your coaching information into coaching, validation, and take a look at splits, earlier than they’re utilized. Additionally, the sampling needs to be achieved solely on the coaching cut up. In any other case, you threat information leakage, the place the augmented information leaks into the validation and take a look at splits, providing you with very optimistic outcomes throughout mannequin growth which is able to invariably not maintain as you progress your mannequin into manufacturing.
Second, augmenting your information utilizing SMOTE can solely be achieved on vectorized information, because the thought is to search out and use factors in characteristic hyperspace which might be “in-between” your present information. Due to this, I made a decision to pre-vectorize my textual content inputs utilizing GUSE. Different augmentation approaches thought-about right here do not want the enter to be pre-vectorized.
The code for this experiment is split into two notebooks.
- blog_text_augment_01.ipynb — On this pocket book, I cut up the dataset right into a practice/validation/take a look at cut up of 70/10/20, and generate vector representations for every textual content message utilizing GUSE. I additionally oversample the minority class (spam) by producing roughly 5 augmentations for every report, and generate their vector representations as properly.
- blog_text_augment_02.ipynb — I outline a typical community, which I retrain utilizing Pytorch for every of the 6 augmentation situations listed above, and evaluate their accuracies.
Outcomes are proven under, and appear to point that oversampling methods are inclined to work the very best, each the naive one and the one primarily based on SMOTE. The following most suitable option appears to be class weights. This appears comprehensible as a result of oversampling provides the community probably the most information to coach with. That’s most likely additionally why undersampling does not work properly. I used to be a bit shocked additionally that textual content augmentation methods didn’t carry out in addition to the opposite oversampling methods.
Nonetheless, the variations listed here are fairly small and presumably probably not important (notice the y-axis within the bar chart is exagerrated (0.95 to 1.0) to spotlight this distinction). I additionally discovered that the outcomes various throughout a number of runs, most likely ensuing from totally different initialization situations. However general the sample proven above was the commonest.
Edit 2021-02-13: @Yorko advised utilizing confidence intervals to be able to deal with my above concern (see feedback under), so I collected the outcomes from 10 runs and computed the imply and customary deviation for every method throughout all of the runs. The up to date bar chart above exhibits the imply worth and has error bars of +/- 2 customary deviations off the imply outcome. Due to the error bars, we will now draw just a few further conclusions. First, we observe that SMOTE oversampling can certainly give higher outcomes than naive oversampling. It additionally exhibits that undersampling outcomes might be very extremely variable.