DOI: https://doi.org/10.15368/theses.2020.155
Available at: https://digitalcommons.calpoly.edu/theses/2583
Date of Award
12-2020
Degree Name
MS in Computer Science
Department/Program
Computer Science
College
College of Engineering
Advisor
Franz Kurfess
Advisor Department
Computer Science
Advisor College
College of Engineering
Abstract
Convolutional neural networks (CNNs) have dominated the computer vision field since the early 2010s, when deep learning largely replaced previous approaches like hand-crafted feature engineering and hierarchical image parsing. Meanwhile transformer architectures have attained preeminence in natural language processing, and have even begun to supplant CNNs as the state of the art for some computer vision tasks.
This study proposes a novel transformer-based architecture, the attentional parsing network, that reconciles the deep learning and hierarchical image parsing approaches to computer vision. We recast unsupervised image representation as a sequence-to-sequence translation problem where image patches are mapped to successive layers of latent variables; and we enforce symmetry and sparsity constraints to encourage these mappings take the form of a parse tree.
We measure the quality of learned representations by passing them to a classifier and find high accuracy (> 90%) for even small models. We also demonstrate controllable image generation: first by “back translating” from latent variables to pixels, and then by selecting subsets of those variables with attention masks. Finally we discuss our design choices and compare them with alternatives, suggesting best practices and possible areas of improvement.