Semester of Graduation
Fall 2024
Degree
Master of Science (MS)
Department
Division of Computer Science & Engineering
Document Type
Thesis
Abstract
Wearable exoskeletons offer significant potential in enhancing human mobility in industrial environments. However, their adaptability to dynamic, task-intensive settings presents challenges, especially in accurately predicting locomotion modes such as ladder climbing, stair navigation, low-space movement, and obstacle navigation. This research proposes a multimodal framework that integrates visual data and speech commands to improve locomotion mode prediction in unpredictable environments. Multimodal data was collected using smart glasses, capturing both the user’s perspective (field-of-view, FOV) and voice during locomotion tasks. State-of-the-art models—CLIP, ImageBind, and GPT-4o—process these visual and linguistic inputs to predict locomotion activities. The models were evaluated in zero-shot and fine-tuned conditions, with preprocessing steps aligning voice commands to FOV frames. Class imbalances were addressed through data generation and augmentation techniques. Results show that fine-tuned models significantly improve prediction accuracy, especially when integrating visual and textual modalities. The CLIP model achieved an F1-score of 90.05% when fine-tuned on image-text data, while GPT-4o reached 87.87% in zero-shot reasoning tasks using chain-of-thought prompting. ImageBind performed well with image-text fusion, though audio integration produced mixed outcomes. This research demonstrates that multimodal approaches—especially the combination of vision and language—can substantially enhance locomotion prediction in complex industrial environments.
Date
11-21-2024
Recommended Citation
Ahmadi, Ehsan, "Vision-Language Integration for Enhanced Locomotion Mode Prediction" (2024). LSU Master's Theses. 6065.
https://repository.lsu.edu/gradschool_theses/6065
Committee Chair
Jasim, Mahmood