Doctor of Philosophy (PhD)
Current classification approaches usually do not try to achieve a balance between fitting and generalization when they infer models from training data. Such approaches ignore the possibility of different penalty costs for the false-positive, false-negative, and unclassifiable types. Thus, their performances may not be optimal or may even be coincidental. This dissertation analyzes the above issues in depth. It also proposes two new approaches called the Homogeneity-Based Algorithm (HBA) and the Convexity-Based Algorithm (CBA) to address these issues. These new approaches aim at optimally balancing the data fitting and generalization behaviors of models when some traditional classification approaches are used. The approaches first define the total misclassification cost (TC) as a weighted function of the three penalty costs and their corresponding error rates. The approaches then partition the training data into regions. In the HBA, the partitioning is done according to some homogeneous properties derivable from the training data. Meanwhile, the CBA employs some convex properties to derive regions. A traditional classification method is then used in conjunction with the HBA and CBA. Finally, the approaches apply a genetic approach to determine the optimal levels of fitting and generalization. The TC serves as the fitness function in this genetic approach. Real-life datasets from a wide spectrum of domains were used to better understand the effectiveness of the HBA and CBA. The computational results have indicated that both the HBA and CBA might potentially fill a critical gap in the implementation of current or future classification approaches. Furthermore, the results have also shown that when the penalty cost of an error type was changed, the corresponding error rate followed stepwise patterns. The finding of stepwise patterns of classification errors can assist researchers in determining applicable penalties for classification errors. Thus, the dissertation also proposes a binary search approach (BSA) to produce those patterns. Real-life datasets were utilized to demonstrate for the BSA.
Document Availability at the Time of Submission
Secure the entire work for patent and/or proprietary purposes for a period of one year. Student has submitted appropriate documentation which states: During this period the copyright owner also agrees not to exercise her/his ownership rights, including public use in works, without prior authorization from LSU. At the end of the one year period, either we or LSU may request an automatic extension for one additional year. At the end of the one year secure period (or its extension, if such is requested), the work will be released for access worldwide.
Pham, Huy Nguyen Anh, "The Impact of Overfitting and Overgeneralization on the Classification Accuracy in Data Mining" (2010). LSU Doctoral Dissertations. 3335.