Doctor of Philosophy (PhD)
When doing classification, it has often been observed that datasets may exhibit different levels of difficulty with respect to how accurately they can be classified. That is, there are some datasets which can be classified very accurately by many classification algorithms, and there also exist some other datasets that no classifier can classify them with high accuracy. Based on this observation, we try to address the following problems: a)what are the factors that make a dataset easy or difficult to be accurately classified? b) how to use such factors to predict the difficulties of unclassified datasets? and c) how to use such factors to improve classification. It turns out that the monotonic features of the datasets, along with some other closely related structural properties, play an important role in determining how difficult datasets can be accurately classified. More importantly, datasets which are comprised of highly monotonic data, can usually be classified more accurately than datasets with low monotonically distributed data. By further exploring these monotonicity based properties, we observed that datasets can always be decomposed into a family of subsets while each of them is highly monotonic locally. Moreover, it is proposed in this dissertation a methodology to use the classification models inferred from the smaller but highly monotonic subsets to construct a highly accurate classification model for the original dataset. Two groups of experiments were implemented in this dissertation. The first group of experiments were performed to discover the relationships between the data difficulty and data monotonic properties, and represent such relationships in regression models. Such models were later used to predict the classification difficulty of unclassified datasets. It seems that in more than 95% of the predictions, the deviations between the predicted value and the real difficulty are smaller than 2.4%. The second group of experiments focused on the performance of the proposed meta-learning approach. According to the experimental results, the proposed approach can consistently achieve significant improvements.
Document Availability at the Time of Submission
Secure the entire work for patent and/or proprietary purposes for a period of one year. Student has submitted appropriate documentation which states: During this period the copyright owner also agrees not to exercise her/his ownership rights, including public use in works, without prior authorization from LSU. At the end of the one year period, either we or LSU may request an automatic extension for one additional year. At the end of the one year secure period (or its extension, if such is requested), the work will be released for access worldwide.
Lin, Di, "Exploring the Learnability of Numeric Datasets" (2013). LSU Doctoral Dissertations. 1836.