Available at: https://digitalcommons.calpoly.edu/theses/3113
Date of Award
7-2025
Degree Name
MS in Electrical Engineering
Department/Program
Electrical Engineering
College
College of Engineering
Advisor
Jane Zhang
Advisor Department
Electrical Engineering
Advisor College
College of Engineering
Abstract
Vehicular collisions represent a significant public health concern, necessitating re search into advanced emergency notification systems. While deep learning has shown promise in accident detection, a research gap persists in applying state-of-the-art transformer architectures to the task of anticipatory, real-time crash prediction from video. This thesis addresses this gap by developing and evaluating a Video Vision Transformer (ViViT) for the binary classification of imminent vehicular collisions. Utilizing a curated dataset of 1,493 unique collision sequences, this study systemati cally investigates the impact of temporal context by comparing the ViViT against a single-frame Vision Transformer (ViT) baseline and conducting comprehensive exper iments on temporal hyperparameters like tubelet depth and frame stride. The results compellingly demonstrate that leveraging temporal context yields substantial per formance improvements, with the optimal ViViT model achieving 98.72% accuracy and, most critically, a recall of 98.75%—a 7.63 percentage point improvement over the baseline. The findings validate the efficacy of pure attention-based models for this safety-critical application and establish a strong methodological foundation for developing intelligent transportation systems capable of reducing emergency response times and saving lives.