Multimodal action recognition for manufacturing assembly task through spatio-temporal knowledge fusion
Document Type
Article
Publication Date
8-1-2026
Abstract
This paper introduces a novel multimodal action recognition framework designed to address the complexities of human activity recognition in manufacturing assembly tasks through spatio-temporal knowledge fusion. Traditional unimodal and naive multimodal fusion methods often fail to capture the intricate dependencies across space, time, and modality, especially in real-world industrial environments where actions are subtle, repetitive, and context-dependent. To overcome these challenges, we propose a unified architecture that incorporates: (i) a Multi-Stage Hierarchical Reconnection module for robust spatial and temporal feature disentanglement and reintegration; (ii) a spatio-temporal regularization technique, Optimized-MixUp (OMU), that jointly augments data along spatial and temporal axes to improve generalization; and (iii) a Cross-Modal Auxiliary Feature Learning component to enhance late fusion by exploiting modality-specific complementary information. Extensive experiments conducted on four benchmark datasets, NTU RGB+D, NTU RGB+D120, HA4M, and Northwestern-UCLA, demonstrate that our method outperforms recent state-of-the-art approaches, achieving top-1 accuracy of 98.4%, 93.5%, 92.0%, and 97.3%, respectively. These results confirm the framework’s robustness, scalability, and suitability for high-precision human activity understanding in manufacturing environments. The proposed method advances the field of information fusion by offering a principled approach to integrating heterogeneous spatio-temporal data for real-world action recognition tasks.
Publication Source (Journal or Book title)
Information Fusion
Recommended Citation
Bonyani, M., Soleymani, M., & Wang, C. (2026). Multimodal action recognition for manufacturing assembly task through spatio-temporal knowledge fusion. Information Fusion, 132 https://doi.org/10.1016/j.inffus.2026.104225