Semester of Graduation

Fall 2024

Degree

Master of Science (MS)

Department

Division of Computer Science & Engineering

Document Type

Thesis

Abstract

Wearable exoskeletons offer significant potential in enhancing human mobility in industrial environments. However, their adaptability to dynamic, task-intensive settings presents challenges, especially in accurately predicting locomotion modes such as ladder climbing, stair navigation, low-space movement, and obstacle navigation. This research proposes a multimodal framework that integrates visual data and speech commands to improve locomotion mode prediction in unpredictable environments. Multimodal data was collected using smart glasses, capturing both the user’s perspective (field-of-view, FOV) and voice during locomotion tasks. State-of-the-art models—CLIP, ImageBind, and GPT-4o—process these visual and linguistic inputs to predict locomotion activities. The models were evaluated in zero-shot and fine-tuned conditions, with preprocessing steps aligning voice commands to FOV frames. Class imbalances were addressed through data generation and augmentation techniques. Results show that fine-tuned models significantly improve prediction accuracy, especially when integrating visual and textual modalities. The CLIP model achieved an F1-score of 90.05% when fine-tuned on image-text data, while GPT-4o reached 87.87% in zero-shot reasoning tasks using chain-of-thought prompting. ImageBind performed well with image-text fusion, though audio integration produced mixed outcomes. This research demonstrates that multimodal approaches—especially the combination of vision and language—can substantially enhance locomotion prediction in complex industrial environments.

Date

11-21-2024

Committee Chair

Jasim, Mahmood

Share

COinS