Available at: https://digitalcommons.calpoly.edu/theses/3356
Date of Award
6-2026
Degree Name
MS in Computer Science
Department/Program
Computer Science
College
College of Engineering
Advisor
Stephen Beard
Advisor Department
Computer Science
Advisor College
College of Engineering
Abstract
Machine learning training and inference workloads run at massive scale, where the efficiency of generated machine code directly affects throughput and energy consumption. The compilers that lower model specifications to hardware instructions rely on optimization passes that can only exploit information visible in the representations they operate on. MLIR, the intermediate representation framework underlying production machine learning compilers such as IREE and Triton, is designed so that each abstraction level can encode semantics that lower levels cannot represent. One example is memref.subview, which partitions a buffer into typed regions with explicit offsets and sizes, making structural non-overlap provable from the IR. Standard MLIR-to-LLVM lowering discards this structure, reducing subviews to pointer arithmetic that LLVM alias analysis cannot distinguish from potentially overlapping accesses. This loss blocks optimizations such as loop-invariant code motion, redundancy elimination, and vectorization, even when accesses are provably disjoint at the MLIR level.
This thesis presents an out-of-tree MLIR pass pipeline that identifies provably disjoint buffer regions before lowering and preserves the structural proof as LLVM !alias.scope and !noalias metadata. Rather than introducing a new alias analysis, the approach communicates existing information in a form LLVM optimization passes already understand. Across six case-study kernels, the pipeline eliminates all 32 alias-related optimization misses. It achieves a 3.76x speedup on Apple M4 under -force-vector-width=4 by enabling NEON SIMD vectorization. On benchmark-derived kernels from PolyBench/C, IMEX, and IREE, speedups reach 1.51x across platforms and exceed 2x on out-of-order ARM cores, while neutral and regressing cases establish the practical limits of the approach.