Please use this identifier to cite or link to this item:
Keywords: Imbalanced dataset
Receiver operating characteristics
Data reduction techniques
Data reduction techniques
Issue Date: Sep-2015
Abstract: Classification is the process of finding a set of models that distinguish data classes to predict unknown class label in data mining. The class imbalance problem occurs when standard classifiers are majority-biased while the minority class is ignored. Existing classifiers tend to maximise overall prediction accuracy and minimise error at the expense of the minority class. However, research had shown that misclassification cost of the minority class is higher and should not be ignored since it is the class of interest. This work was therefore designed to develop advanced data sampling schemes that improve the classification performance of imbalance datasets with the view of increasing the recall of the minority class. Synthetic Minority Oversampling Technique (SMOTE) was extended to SMOTE+300% and combined with existing under-sampling schemes: Random Under-Sampling (RUS), Neighbourhood Cleaning Rule (NCL), Wilson’s Edited Nearest Neighbour (ENN) and Condense Nearest Neighbour (CNN). Five advanced data sampling scheme algorithms: SMOTE300ENN, SMOTE300RUS, SMOTE300NCL, SMOTENCL and SMOTERUS were coded using JAVA and implemented in WEKA, a data mining tool as an Application Programming Interface. The existing and developed schemes were applied to 886 Diabetes Mellitus (DM), 1,163 Senior Secondary School Certificate Result (SSSCR) and 786 Contraceptive Methods (CM) datasets. The datasets were collected in Ilesha and Ibadan, Nigeria. Their performances were determined with different classification algorithms using Receiver Operating Characteristics (ROC), recall of the minority class and performance gain metrics. Friedman’s Test at p = 0.05 was used to analyse these schemes against the classification algorithms. The ROC metric revealed that the mean rank values for DM, SSSCR and CM datasets treated with the advanced schemes ranged from 6.9-13.8, 3.8-12.8 and 6.6-13.5, respectively when compared with the existing schemes which ranged from 3.4-7.8, 2.6-12.6 and 2.8-7.9, respectively. These results signifies improved classification performance. The Recall metric analysis for the DM, SSSCR and CM datasets in the advanced schemes ranged from 9.4-13.0, 6.3-14.0 and 7.3-13.6, respectively when compared with the existing schemes 2.0-7.5, 2.5-8.9 and 2.1-7.4, respectively. These results show increased detection of the minority class. Performance gains by the advanced UNIVERSITY OF IBADAN LIBRARY vii schemes over the original dataset (DM, SSCE and CM) were: SMOTE300ENN (27.1%), SMOTE300RUS (11.6%), SMOTE300NCL (15.5%), SMOTENCL (8.3%) and SMOTERUS (7.3%). Significant difference was observed amongst all the schemes. The higher the mean rank value and performance gain, the better the scheme. The SMOTE300ENN scheme gave the highest ROC and recall values in the three datasets which were 13.8, 12.8, 12.3 and 13.0, 14.0, 13.6, respectively. The developed Synthetic Minority Oversampling Technique 300 Wilson’s Edited Nearest Neighbour scheme significantly improved classification performance and increased the recall of the minority class over the existing schemes using the same dataset. It is therefore recommended for classification of imbalanced datasets. Keywords: Imbalanced dataset, Receiver operating characteristics, Data reduction techniques, Data reduction techniques Word count: 445
Description: A Thesis in the Department of Computer Science, Submitted to the Faculty of Science, In partial fulfilment of the requirements for the degree of DOCTOR OF PHILOSOPHY of the UNIVERSITY OF IBADAN
Appears in Collections:Scholarly works

Items in UISpace are protected by copyright, with all rights reserved, unless otherwise indicated.