Multimodal action recognition for manufacturing assembly task through spatio-temporal knowledge fusion

Document Type

Article

Publication Date

8-1-2026

Abstract

This paper introduces a novel multimodal action recognition framework designed to address the complexities of human activity recognition in manufacturing assembly tasks through spatio-temporal knowledge fusion. Traditional unimodal and naive multimodal fusion methods often fail to capture the intricate dependencies across space, time, and modality, especially in real-world industrial environments where actions are subtle, repetitive, and context-dependent. To overcome these challenges, we propose a unified architecture that incorporates: (i) a Multi-Stage Hierarchical Reconnection module for robust spatial and temporal feature disentanglement and reintegration; (ii) a spatio-temporal regularization technique, Optimized-MixUp (OMU), that jointly augments data along spatial and temporal axes to improve generalization; and (iii) a Cross-Modal Auxiliary Feature Learning component to enhance late fusion by exploiting modality-specific complementary information. Extensive experiments conducted on four benchmark datasets, NTU RGB+D, NTU RGB+D120, HA4M, and Northwestern-UCLA, demonstrate that our method outperforms recent state-of-the-art approaches, achieving top-1 accuracy of 98.4%, 93.5%, 92.0%, and 97.3%, respectively. These results confirm the framework’s robustness, scalability, and suitability for high-precision human activity understanding in manufacturing environments. The proposed method advances the field of information fusion by offering a principled approach to integrating heterogeneous spatio-temporal data for real-world action recognition tasks.

Publication Source (Journal or Book title)

Information Fusion

This document is currently not available here.

Share

COinS