Doctor of Philosophy (PhD)


Education Leadership & Research

Document Type



The United States depends on scientific and technological progress to build an economy and improve the standard of living. However, persistence to degree completion of students who pursue a major in science, technology, engineering, and mathematics (STEM) continues to be an issue for postsecondary institutions. New techniques, machine learning specifically, in recent years have allowed institutions to analyze student data to improve admission and enrollment projections, identify academic and social patterns among variables and college student populations and as a result more accurately predict retention and completion rates.

Utilizing Astin’s (1965) Input-Environment-Outcome conceptual model and Wang’s (2015) “STEM momentum” framework, the purpose of this study was to investigate the factors that determines degree attainment of students who majored in STEM during the application process at a public university. In addition, a comparison of the use of six supervised machine learning techniques in accurately classifying student completion and their performance in three other evaluation metrics. The factors investigated included high school academics, family financial background, demographics, and college academic, financial, and social factors. A logistic regression analysis determined that high school GPA, ACT Match score, ACT English score, gender, first-year retention in a STEM major, completion of Calculus or higher Math course, second academic year GPA, and second year retention in a STEM major were statistically significant in predicting completion of a STEM degree. The three most important factors that determined degree completion were total credit hours earned after the second academic year, ACT Math score, and total credit hours earned during the first academic year.

The application of six supervised machine learning models (Logistic Regression, Random Forest, XGBoost, AdaBoost, Support Vector Machine, and Decision Tree) were created to measure and compare their performance in evaluating four different metrics in effectively classifying completion of a STEM degree. The dataset was split with a 70% train set/30% test set and a 10-fold cross validation was performed on the training set. The six models hyperparameter was optimized with GridSearchCV utilizing 200 iterations on the partitioned test data set and the average mean square error of the iterations was calculated. The results from the test set revealed that the AdaBoost model had the highest accuracy percentage of 91.37%, the Support Vector Machine model had the highest F1-score percentage of 100%, the Logistic regression model had the highest precision percentage of 88.72%, and the Support Vector Machine model had the highest recall percentage of 100%. The study concludes with implications for practice and recommendations for future research in STEM majors’ retention and persistence and the application of artificial intelligence to the higher education field.



Committee Chair

Kennedy, Eugene

Available for download on Wednesday, July 10, 2030