Date of Award


Document Type


Degree Name

Doctor of Philosophy (PhD)


Computer Science

First Advisor

Doris L. Carver


Software maintenance is both a technical and an economic concern for organizations. Large software systems are difficult to maintain due to their intrinsic complexity, and their maintenance consumes between 50% and 90% of the cost of their complete life-cycle. An essential step in maintenance is reverse engineering, which focuses on understanding the system. This system understanding is critical to avoid the generation of undesired side effects during maintenance. The objective of this research is to investigate the potential of applying data mining to reverse engineering. This research was motivated by the following: (1) data mining can process large volumes of information, (2) data mining can elicit meaningful information without previous knowledge of the domain, (3) data mining can extract novel non-trivial relationships from a data set, and (4) data mining is automatable. These data mining features are used to help address the problem of understanding large legacy systems. This research produced a general method to apply data mining to reverse engineering, and a methodology for design recovery, called Identification of Subsystems based on Associations (ISA). ISA uses mined association rules from a database view of the subject system to guide a clustering process that produces a data-cohesive hierarchical subsystem decomposition of the system. ISA promotes object-oriented principles because each identified subsystem consists of a set of data repositories and the code (i.e., programs) that manipulates them. ISA is an automatic multi-step process, which uses the source code of the subject system and multiple parameters as its input. ISA includes two representation models (i.e., text-based and graphic-based representation models) to present the resulting subsystem decomposition. The automated environment RE-ISA implements the ISA methodology. RE-ISA was used to produce the subsystem decomposition of real-word software systems. Results show that ISA can automatically produce data-cohesive subsystem decompositions without previous knowledge of the subject system, and that ISA always generates the same results if the same parameters are utilized. This research provides evidence that data mining is a beneficial tool for reverse engineering and provides the foundation for defining methodologies that combine data mining and software maintenance.