Towards better semi-supervised classification of malicious software

Document Type

Conference Proceeding

Publication Date

3-4-2015

Abstract

Due to the large number of malicious software (malware) and the large variety among them, automated detection and analysis using machine learning techniques have become more and more important for network and computer security. An often encountered scenario in these security applications is that training examples are scarce but unlabeled data are abundant. Semi-supervised learning where both labeled and unlabeled data are used to learn a good model quickly is a natural choice under such condition. We investigate semi-supervised classification for malware categorization. We observed that malware data have specific characteristics and that they are noisy. Off-the-shelf semi-supervised learning may not work well in this case. We proposed a semi-supervised approach that addresses the problems with malware data and can provide better classification. We conducted a set of experiments to test and compare our method to others. The experimental results show that semi-supervised classification is a promising direction for malware classification. Our method achieved more than 90% accuracy when there were only a few number of training examples. The results also indicates that modifications are needed to make semi-supervised learning work with malware data. Otherwise, semi-supervised classification may perform worse than classifiers trained on only the labeled data.

Publication Source (Journal or Book title)

IWSPA 2015 - Proceedings of the 2015 ACM International Workshop on Security and Privacy Analytics, Co-located with CODASPY 2015

First Page

27

Last Page

33

This document is currently not available here.

Share

COinS