Date of Award

9-2025

Degree Name

MS in Statistics

Department/Program

Statistics

College

College of Science and Mathematics

Advisor

Hunter Glanz

Advisor Department

Statistics

Advisor College

College of Science and Mathematics

Abstract

As video games increasingly emphasize narrative depth and player immersion, the quality of Non-Player Character (NPC) dialogue has become crucial for creating engaging gaming experiences. This thesis investigates the potential of Large Language Models (LLMs) to generate high-quality NPC dialogue by comprehensively evaluating four state-of-the-art models: Gemma 3 27B, Mistral 7B, QWEN 2.5, and LLAMA 3.1. The study employs a mixed-methods approach, combining human evaluation (N=50 participants) with AI-based assessment across five key benchmarks: coherence, personality expression, engagement, style/tone appropriateness, and overall quality. Participants evaluated 32 dialogue samples (8 per model) generated for a fantasy game context featuring two distinct characters. Statistical analysis using ordinal mixed-effects models revealed significant differences in performance across models and benchmarks. Mistral 7B consistently outperformed other models, particularly in personality expression and engagement metrics. The study also examined the correlation between human and AI evaluation methods, finding weak positive correlations across benchmarks, suggesting AI evaluators cannot reliably substitute for human judgment in dialogue quality assessment. These findings have important implications for game developers seeking to leverage LLMs for scalable, dynamic dialogue generation while maintaining narrative quality.

Share

COinS