Available at: https://digitalcommons.calpoly.edu/theses/3270
Date of Award
6-2026
Degree Name
MS in Electrical Engineering
Department/Program
Electrical Engineering
College
College of Engineering
Advisor
Xiao-Hua Yu
Advisor Department
Electrical Engineering
Advisor College
College of Engineering
Abstract
Large Language Models (LLMs) have grown increasingly prevalent in both research and application. While LLMs are often thought of as “black boxes,” there has been a major push towards the interpretation and control of LLMs at a mechanistic level. One tool emerging from this research is the Sparse Autoencoder (SAE). An SAE is trained on internal transformer representations to extract human-interpretable features from an LLM. These features can be tuned within the SAE to manually steer an LLM towards a certain behavior. Some attempts have been made to apply this method of steering to functional control in niche applications. However, the current research towards designing and applying these systems is limited due to the recency of the technology. Additionally, from a control perspective, the tools to date are open-loop applications with no feedback mechanism over multiple tokens. The purpose of this thesis is to explore how such a feedback mechanism would be implemented, demonstrated through an example application. In the application investigated in this thesis, the system is targeted at increasing the refusal rate of the Gemma 2 2B LLM against multi-turn jailbreak attacks while simultaneously maintaining a high compliance rate against multi-turn benign prompts. Gemma 2 2B is chosen for its size and pretrained SAEs. The feedback control mechanism is shown to improve resistance to Generative Offensive Agent Tester (GOAT) jailbreak attacks from 29.2% to 48% relative to the base model, while the noncompliance rate for benign prompts only changes from 2.1% to 3.3% relative to the baseline. This system is implemented in real time.