Degree

Doctor of Philosophy (PhD)

Department

Computer Science and Engineering

Document Type

Dissertation

Abstract

Social media platforms like X (formerly Twitter) serve as rich sources of textual and visual information, making multimodal sentiment analysis essential for understanding complex human emotions. This dissertation aims to advance multimodal sentiment analysis by improving the semantic alignment and fusion of textual and visual features, thereby enabling more accurate and context-aware sentiment interpretation of social media content.

To address challenges in multimodal integration, this work proposes two complementary MSA approaches. The first approach introduces a similarity-based multi-layer attention neural network (SiMANN) that enhances modality integration through cosine-based similarity fusion and modality-specific attention to emphasize salient features in text and images. Although effective, this approach revealed limitations related to weak fine-grained alignment, heterogeneous feature embeddings, and limited cross-modal interaction modeling. To overcome these constraints, the dissertation presents SentiGAT, a graph attention network-based framework that employs a unified CLIP-based feature extractor to encode modalities into a shared semantic space. It further incorporates a graph attention network (GAT)-based word–object alignment to strengthen fine-grained semantic correspondence, and a GAT-based fusion module that learns content-aware inter-modal dependencies for sentiment prediction.

The proposed methods were evaluated on the MVSA-Single and MVSA-Multiple benchmark datasets. SentiGAT demonstrated superior performance compared to state-of-the-art models and large language model-based approaches, yielding notable improvements in both accuracy and F1 score. Consistent gains across 10-fold cross-validation confirmed the robustness and generalizability of the approach. Further analysis showed that SentiGAT effectively captures nuanced cross-modal relationships, enhances semantic grounding through word–object alignment, and improves interpretability by highlighting modality-level contributions via attention mechanisms.

This dissertation presents a progressive advancement in multimodal sentiment analysis, starting from real-world observations on social media and leading to a graph-based framework for improved alignment and fusion. By addressing key limitations in how textual and visual features interact, the proposed methods enable more accurate and context-aware sentiment interpretation. The findings provide a strong basis for future research in multimodal learning, affective computing, and social media analytics.

Date

12-2-2025

Committee Chair

Lee, Kisung

Available for download on Friday, May 01, 2026

Share

COinS