Imbalanced Data receives the proper treatment in a new, outstanding Machine Learning book

Paulo Cysne
4 min readDec 12, 2023

"Machine Learning for Imbalanced Data" is a truly unique book by Kumar Abhishek and Dr. Mounir Abdelaziz covering an important subject comprehensively and authoritatively.

Imbalanced data, where certain classes may have considerably fewer instances than others, is a true issue that must be addressed. The authors give a warning in bold in the preface, “While imbalanced data can present challenges, it’s important to understand that the techniques to address this imbalance are not universally applicable. Their relevance and necessity depend on various factors such as the domain, the data distribution, the performance metrics you’re optimizing, and the business objectives.” But they also point out that “Familiarizing yourself with these techniques will provide you with a comprehensive toolkit, preparing you for scenarios that you may not yet know you’ll encounter.”

The book starts with a very clear introduction to imbalanced datasets giving a brief but easy-to-understand descriptions of the different learning modes (supervised, unsupervised, reinforcement, supervised learning types (regression, classification) and different classical supervised machine learning modes (logistics regression, SVM, KNN, trees, ensemble models, and neural networks). It then covers the different model evaluation metrics for classification (confusion matrix, AUC-ROC, ROC, and Precision-Recall curves), explaining them very well and showing how they are affected by imbalanced data.

It then addresses two important questions: “When can we have an imbalance in datasets?” and “Why can imbalanced data be a challenge?”. And it moves on to handle an important case: when to not worry about data imbalance.

The library imbalanced-learn (imported as imblearn) is a Python package that, offers several techniques to deal with data imbalance, and is used heavily in the first half of this book. Clear and illustrating examples are shown with this library to start with.

The next chapter is on oversampling methods, with examples from the imblearn library. It covers random oversampling, SMOTE, SMOTE variants, and ADASYN. It shows when and how to use these techniques and compare them. Oversampling in multi-class classification is also treated. It explains the problems with SMOTE and when to avoid oversampling. Deep learning is treated later in the book.

Undersampling methods are covered next. It covers strategies for removing noisy observations, strategies for removing easy observations, and strategies that remove the samples uniformly. It includes random undersampling, cluster-centroids, ENN, RENN, AIIKNN, Tomek links, and others.

Afterwards, ensemble methods are covered. It includes boosting and bagging techniques for imbalanced data as well ensemble of ensembles. A model comparison is given at the end.

As an alternative to oversampling (which can lead to overfitting of the model) and undersampling (which can lead to the loss of useful information), an entire chapter is then dedicated to cost-sensitive learning.

Class imbalance is a common issue for deep learning models. Four entire chapters are dedicated to this subject. The first covers a general understanding, the second covers data-level deep learning methods, the third covers algorithm-level deep learning techniques, and the fourth covers hybrid deep learning methods.

The last chapter covers model calibration and the needed post-processing of the prediction scores that we get from the trained models. It shows how to measure how calibrated the model is and how imbalanced datasets make the model calibration inevitable. A very useful appendix on the machine learning pipeline in production completes the book.

The book offers end-of-chapter questions with thorough answers at the end of the book, which greatly helps with the learning and practice of the book material. Each chapter of the book also has at least one concrete use case from a leading company.

As the authors explain, “Establishing a sound baseline solution is crucial. Implementing various methods, such as those in cost-sensitive learning and algorithm-level deep learning techniques, can offer insights into handling imbalanced datasets effectively. Each method has its pros and cons.”

Data imbalance may not be a problem at all sometimes. Tabular data under tree-based models such as XGBoost can be robust to certain kinds of data imbalance. With this lucid and comprehensive book, you are ready to handle all cases with confidence and robust techniques. I highly recommend this book.

You can buy this excellent book from Amazon at the link below.

--

--