Semester of Graduation

Fall 2025

Degree

Master of Science in Computer Science (MSCS)

Department

Computer Science

Document Type

Thesis

Abstract

Autonomous robots are increasingly deployed on construction sites for tasks such as progress monitoring, inspection, and safety assessment. For these robots to operate effectively, they must perceive and interpret complex, dynamic environments populated by workers, machinery, and unstructured terrain. Achieving reliable perception depends on high performing semantic segmentation models trained on large volumes of annotated data—an expensive and logistically challenging requirement in construction due to privacy restrictions, variable site access, and slow digitalization. This research addresses the challenge of limited labeled data by investigating transfer learning as a label-efficient approach for construction-site segmentation. Specifically, it explores whether road construction imagery—abundant and publicly available—can serve as a domain-adjacent pretraining source for building-site vision models. Two architectures representing distinct design paradigms were evaluated: the convolutional network DeepLabv3+ (ResNet-50 backbone with atrous spatial pyramid pooling) and the transformer-based SegFormer (MiT-B0 backbone). Each model was pretrained on two road-scene datasets—ROADWork (construction-specific) and Cityscapes (urban driving)—and subsequently fine-tuned on a 5,550-image building-site dataset collected with a Boston Dynamics Spot robot equipped with RGB and LiDAR sensors. Experiments were conducted under varying annotation budgets ranging from 20 to 1,000 labeled images to simulate real-world data scarcity. Results show that performance improves steeply up to approximately 420 images, then gradually saturates near 600. Domain-aligned pretraining from ROADWork consistently outperformed Cityscapes across all budgets, with the advantage most pronounced for safety-critical classes such as ix Workers and Equipment. At full data scale, SegFormer–ROADWork achieved 0.65 mIoU, slightly surpassing DeepLab–ROADWork (0.64 mIoU) and significantly outperforming all Cityscapes-initialized counterparts. The study demonstrates that domain-adjacent pretraining substantially enhances segmentation accuracy under limited supervision, especially for transformer based models that rely less on inductive biases. These findings provide practical guidance for robotic perception in construction: select pretrained sources that closely resemble the target environment, allocate labeling resources up to roughly 600 images for maximal efficiency, and prefer lightweight transformer architectures when domain alignment is feasible.

Date

11-2-2025

Committee Chair

Fronchetti Felipe

Share

COinS