Master's Theses

High-Resolution Queries, Low-Resolution Context: Scaling Vision Transformers with Asymmetric Spatial Reduction

James M. Dwyer, California Polytechnic State University, San Luis ObispoFollow

Available at: https://digitalcommons.calpoly.edu/theses/3303

Date of Award

6-2026

Degree Name

MS in Computer Science

Department/Program

Computer Science

College

College of Engineering

Advisor

Paul Anderson

Advisor Department

Computer Science

Advisor College

College of Engineering

Abstract

Vision Transformers (ViTs) have demonstrated great performance on image classifica tion benchmarks, however, the quadratic complexity of the self-attention mechanism with respect to sequence length limits their scalability to higher resolution inputs. The attention score matrix grows as O(N2) in both compute and memory, where N is the number of patch tokens, making ViTs computationally expensive and memory intensive for applications that require real-time inference or operate under resource constraints.

This thesis investigates whether the key and value sequences of the self-attention mechanism can be compressed using the local spatial structure of the image — while keeping queries at full sequence length — in order to reduce this quadratic cost without significant degradation in classification accuracy. Two attention variants are designed and evaluated against a standard Multi-Head Attention (MHA) base line: L-SRA, which groups consecutive patch tokens into non-overlapping blocks and compresses each block into a single token via content-weighted softmax pooling, and C-SRA, which applies a depthwise 3×3 strided convolution to a low-rank latent rep resentation of the 2D patch grid to produce a spatially subsampled K/V sequence. All three models share the same depth, embedding dimension, patch size, and training recipe, isolating the attention mechanism as the sole experimental variable.

Experiments are conducted on CIFAR-100 at R = 2 and on ImageNet-1K at com pression rates of R = 2 and R = 4. On ImageNet-1K, C-SRA at R = 2 achieves 76.25% Top-1 accuracy, within 0.17 percentage points of MHA’s 76.42%, while using 3.3 million fewer parameters and improving throughput by 3.3% and peak memory by 13.8% at 384 × 384 resolution. At R = 4, C-SRA achieves a 22.9% throughput improvement and a 27.2% reduction in peak memory relative to MHA at a cost of 3.42 percentage points in accuracy. L-SRA provides parameter savings at both compression rates but incurs routing overhead that eliminates wall-clock efficiency gains at R =2, and is dominated by C-SRA on efficiency metrics at both rates. The results demonstrate that convolutional K/V compression is an effective strategy for improving the efficiency of ViT attention, with C-SRA at R = 2 achieving a near-lossless accuracy-efficiency trade-off on ImageNet-1K

Download

Included in

Artificial Intelligence and Robotics Commons

COinS

Master's Theses

High-Resolution Queries, Low-Resolution Context: Scaling Vision Transformers with Asymmetric Spatial Reduction

Date of Award

Degree Name

Department/Program

College

Advisor

Advisor Department

Advisor College

Abstract

Included in

Search

Browse

Author Corner

LINKS

Master's Theses

High-Resolution Queries, Low-Resolution Context: Scaling Vision Transformers with Asymmetric Spatial Reduction

Author

Date of Award

Degree Name

Department/Program

College

Advisor

Advisor Department

Advisor College

Abstract

Included in

Share

Search

Browse

Author Corner

LINKS