Available at: https://digitalcommons.calpoly.edu/theses/3291

Date of Award

6-2026

Degree Name

MS in Computer Science

Department/Program

Computer Science

College

College of Engineering

Advisor

Maria Pantoja

Advisor Department

Computer Science

Advisor College

College of Engineering

Abstract

Since the seminal assertion by Vaswani et al. in 2017 that “Attention Is All You Need,” transformer models have risen to ubiquity due to their ability to learn extremely complex patterns from sequence data, culminating in the unprecedented generative capabilities of large language models. These models’ strength lies in their scale: hundreds of millions (e.g., BERT-LARGE) to billions or trillions of learned parameters. Running inference with these models, let alone training them, would be intractable without significant innovations in the hardware and software that support them. This need has driven an enormous demand for GPU compute and associated software ecosystems, like that of NVIDIA. With many different accelerator architectures, attention variants, and application constraints, obtaining a holistic view of what goes on behind the scenes of state-of-the-art AI systems can be elusive, even to the well-initiated.

In this work, we seek to provide exactly that: a unified perspective on what makes modern attention implementations fast, efficient, and flexible enough to handle the tremendous computational demands of transformer applications. To this end, we discuss the machine learning operations that motivate this need, together with the algorithmic and systems-level techniques researchers have developed to satisfy it, in terms understandable to readers with a general computer science background. We present five key attributes of modern attention implementations, with detailed explanations of popular frameworks, such as the FlashAttention series, that put them into practice. We demonstrate the tangible performance improvements these yield by evaluating them against attention workloads of various sizes across four generations of NVIDIA GPUs. Our results consolidate the diverse literature surrounding this topic into an accessible, structured framework, and provide researchers a practical reference for navigating and assessing the rapidly evolving field of attention acceleration.

Download

Included in

Artificial Intelligence and Robotics Commons

COinS

Master's Theses

What Makes a Modern Attention Implementation?

Date of Award

Degree Name

Department/Program

College

Advisor

Advisor Department

Advisor College

Abstract

Included in

Search

Browse

Author Corner

LINKS

Master's Theses

What Makes a Modern Attention Implementation?

Author

Date of Award

Degree Name

Department/Program

College

Advisor

Advisor Department

Advisor College

Abstract

Included in

Share

Search

Browse

Author Corner

LINKS