Semester of Graduation
Spring 2025
Degree
Master of Science (MS)
Department
Division of Computer Science & Engineering
Document Type
Thesis
Abstract
Reverse engineering is a cybersecurity process that focuses on understanding the underlying functionality of software or malware. This is an arduous process that demands large amounts of time and effort from cybersecurity practitioners. Large Language Models (LLMs) offer a potential solution to this problem. LLMs have worked their way into various fields of cybersecurity in recent years, including incident response and malware classification. However, LLMs have historically struggled with low-level code comprehension: a necessary part of reverse engineering. While LLMs can generate code and explain its function on the surface level, they struggle to grasp the wider context. In this paper, we utilize parameter-efficient fine-tuning to train LLMs to generate contextual comments for x86 assembly code in an effort to expedite reverse engineering. We choose LLMs from several parameter classes in each of the Qwen2.5-Coder, CodeLlama, and CodeGemma families to fine-tune on a dataset of x86 assembly code. We evaluate each model's performance on cross entropy loss and cosine similarity before and after fine-tuning. We observe promising results, with a significant boost in similarity score for five out of the seven LLMs selected. This is particularly evident in the 0.11 increase in simialrity score for Qwne2.5-Coder-7B and the 0.18 increase in similarity score for CodeLlama-7B after fine-tuning.
Date
4-3-2025
Recommended Citation
Lea, Darrin Michael, "Optimizing LLM x86 Assembly Code Comprehension through Fine-Tuning" (2025). LSU Master's Theses. 6140.
https://repository.lsu.edu/gradschool_theses/6140
Committee Chair
Dr. James M Ghawaly