Available at: https://digitalcommons.calpoly.edu/theses/3303
Date of Award
6-2026
Degree Name
MS in Computer Science
Department/Program
Computer Science
College
College of Engineering
Advisor
Paul Anderson
Advisor Department
Computer Science
Advisor College
College of Engineering
Abstract
Vision Transformers (ViTs) have demonstrated great performance on image classifica tion benchmarks, however, the quadratic complexity of the self-attention mechanism with respect to sequence length limits their scalability to higher resolution inputs. The attention score matrix grows as O(N2) in both compute and memory, where N is the number of patch tokens, making ViTs computationally expensive and memory intensive for applications that require real-time inference or operate under resource constraints.
This thesis investigates whether the key and value sequences of the self-attention mechanism can be compressed using the local spatial structure of the image — while keeping queries at full sequence length — in order to reduce this quadratic cost without significant degradation in classification accuracy. Two attention variants are designed and evaluated against a standard Multi-Head Attention (MHA) base line: L-SRA, which groups consecutive patch tokens into non-overlapping blocks and compresses each block into a single token via content-weighted softmax pooling, and C-SRA, which applies a depthwise 3×3 strided convolution to a low-rank latent rep resentation of the 2D patch grid to produce a spatially subsampled K/V sequence. All three models share the same depth, embedding dimension, patch size, and training recipe, isolating the attention mechanism as the sole experimental variable.
Experiments are conducted on CIFAR-100 at R = 2 and on ImageNet-1K at com pression rates of R = 2 and R = 4. On ImageNet-1K, C-SRA at R = 2 achieves 76.25% Top-1 accuracy, within 0.17 percentage points of MHA’s 76.42%, while using 3.3 million fewer parameters and improving throughput by 3.3% and peak memory by 13.8% at 384 × 384 resolution. At R = 4, C-SRA achieves a 22.9% throughput improvement and a 27.2% reduction in peak memory relative to MHA at a cost of 3.42 percentage points in accuracy. L-SRA provides parameter savings at both compression rates but incurs routing overhead that eliminates wall-clock efficiency gains at R =2, and is dominated by C-SRA on efficiency metrics at both rates. The results demonstrate that convolutional K/V compression is an effective strategy for improving the efficiency of ViT attention, with C-SRA at R = 2 achieving a near-lossless accuracy-efficiency trade-off on ImageNet-1K