Date of Award


Degree Name

MS in Computer Science


Computer Science


College of Engineering


Franz Kurfess

Advisor Department

Computer Science

Advisor College

College of Engineering


Convolutional neural networks (CNNs) have dominated the computer vision field since the early 2010s, when deep learning largely replaced previous approaches like hand-crafted feature engineering and hierarchical image parsing. Meanwhile transformer architectures have attained preeminence in natural language processing, and have even begun to supplant CNNs as the state of the art for some computer vision tasks.

This study proposes a novel transformer-based architecture, the attentional parsing network, that reconciles the deep learning and hierarchical image parsing approaches to computer vision. We recast unsupervised image representation as a sequence-to-sequence translation problem where image patches are mapped to successive layers of latent variables; and we enforce symmetry and sparsity constraints to encourage these mappings take the form of a parse tree.

We measure the quality of learned representations by passing them to a classifier and find high accuracy (> 90%) for even small models. We also demonstrate controllable image generation: first by “back translating” from latent variables to pixels, and then by selecting subsets of those variables with attention masks. Finally we discuss our design choices and compare them with alternatives, suggesting best practices and possible areas of improvement.