EfficientAudioNet: Enhancing Environmental Sound Classification through Data Fusion of Multiple Audio Representations

Document Type

Conference Proceeding

Publication Date

1-1-2025

Abstract

Environmental Sound Classification (ESC) is becoming an ever increasingly important application in different scenarios, such as smart cities, autonomous systems, safety, and industrial monitoring. Traditional methods for ESC mainly rely on features extracted from a single-representation, usually spectrograms or MFCCs. However, while deep learning-based CNN models have demonstrated excellent performance, they still suffer from certain limitations due to the reliance on a single feature representation. In this regard, this work exploits a multi-representation strategy by fusing five kinds of audio features, namely: spectrograms, phasograms, scalograms, wavelet phasograms, and MFCC-grams. Each representation captures different properties of the audio. These representations are combined in a structured manner by investigating three fusion strategies: early, intermediate, and late fusion using a novel model based on the EfficientNet, named EfficientAudioNet. The proposed strategies are evaluated on four benchmark datasets: a Construction Site machinery sounds dataset, the ESC-10 and ESC-50 environmental sound datasets, and the UrbanSound8K dataset. Experimental results demonstrate that the multi-representation fusion, specially the early fusion, significantly enhances the classification performance. Overall, the proposed approach overcomes state-of-the-art accuracy on all the tested datasets.

Publication Source (Journal or Book title)

Proceedings of the International Joint Conference on Neural Networks

This document is currently not available here.

Share

COinS