DEVELOPMENT OF ADVANCED DATA SAMPLING SCHEMES TO ALLEVIATE CLASS IMBALANCE PROBLEM IN DATA MINING CLASSIFICATION ALGORITHMS

ABSTRACT

Classification is the process of finding a set of models that distinguish data classes to predict unknown class label in data mining. The class imbalance problem occurs when standard classifiers are majority-biased while the minority class is ignored. Existing classifiers tend to maximise overall prediction accuracy and minimise error at the expense of the minority class. However, research had shown that misclassification cost of the minority class is higher and should not be ignored since it is the class of interest. This work was therefore designed to develop advanced data sampling schemes that improve the classification performance of imbalance datasets with the view of increasing the recall of the minority class. Synthetic Minority Oversampling Technique (SMOTE) was extended to SMOTE+300% and combined with existing under-sampling schemes: Random Under-Sampling (RUS), Neighbourhood Cleaning Rule (NCL), Wilson’s Edited Nearest Neighbour (ENN) and Condense Nearest Neighbour (CNN). Five advanced data sampling scheme algorithms: SMOTE300ENN, SMOTE300RUS, SMOTE300NCL, SMOTENCL and SMOTERUS were coded using JAVA and implemented in WEKA, a data mining tool as an Application Programming Interface. The existing and developed schemes were applied to 886 Diabetes Mellitus (DM), 1,163 Senior Secondary School Certificate Result (SSSCR) and 786 Contraceptive Methods (CM) datasets. The datasets were collected in Ilesha and Ibadan, Nigeria. Their performances were determined with different classification algorithms using Receiver Operating Characteristics (ROC), recall of the minority class and performance gain metrics. Friedman’s Test at p = 0.05 was used to analyse these schemes against the classification algorithms.