Date of Award

6-2026

Degree Name

MS in Electrical Engineering

Department/Program

Electrical Engineering

College

College of Engineering

Advisor

Xiao-Hua Yu

Advisor Department

Electrical Engineering

Advisor College

College of Engineering

Abstract

Large Language Models (LLMs) have grown increasingly prevalent in both research and application. While LLMs are often thought of as “black boxes,” there has been a major push towards the interpretation and control of LLMs at a mechanistic level. One tool emerging from this research is the Sparse Autoencoder (SAE). An SAE is trained on internal transformer representations to extract human-interpretable features from an LLM. These features can be tuned within the SAE to manually steer an LLM towards a certain behavior. Some attempts have been made to apply this method of steering to functional control in niche applications. However, the current research towards designing and applying these systems is limited due to the recency of the technology. Additionally, from a control perspective, the tools to date are open-loop applications with no feedback mechanism over multiple tokens. The purpose of this thesis is to explore how such a feedback mechanism would be implemented, demonstrated through an example application. In the application investigated in this thesis, the system is targeted at increasing the refusal rate of the Gemma 2 2B LLM against multi-turn jailbreak attacks while simultaneously maintaining a high compliance rate against multi-turn benign prompts. Gemma 2 2B is chosen for its size and pretrained SAEs. The feedback control mechanism is shown to improve resistance to Generative Offensive Agent Tester (GOAT) jailbreak attacks from 29.2% to 48% relative to the base model, while the noncompliance rate for benign prompts only changes from 2.1% to 3.3% relative to the baseline. This system is implemented in real time.

Available for download on Thursday, May 31, 2029

Share

COinS