Date of Award


Degree Name

MS in Computer Science


Computer Science


College of Agriculture, Food, and Environmental Sciences


Foaad Khosmood

Advisor Department

<--Please Select Department-->

Advisor College

College of Agriculture, Food, and Environmental Sciences


Facial recognition is a powerful tool for identifying people visually. Yet, when the end goal is more specific than merely identifying the person in a picture problems can arise. Speaker identification is one such task which expects more predictive power out of a facial recognition system than can be provided on its own. Speaker identification is the task of identifying who is speaking in video not simply who is present in the video. This extra requirement introduces numerous false positives into the facial recognition system largely due to one main scenario. The person speaking is not on camera. This paper investigates a solution to this problem by incorporating information from a new system which indicates whether or not the person on camera is speaking. This information can then be combined with an existing facial recognition to boost its predictive capabilities in this instance.

We propose a speaker detection system to visually detect when someone in a given video is speaking. The system relies strictly on visual information and is not reliant on audio information. By relying strictly on visual information to detect when someone is speaker the system can be synced with an existing facial recognition system and extend its predictive power. We use a two-stream convolutional neural network to accomplish the speaker detection. The neural network is trained and tested using data extracted from Digital Democracy’s large database of transcribed political hearings [4]. We show that the system is capable of accurately detecting when someone on camera is speaking with an accuracy of 87% on a dataset of legislators. Furthermore we demonstrate how this information can benefit a facial recognition system with the end goal of identifying the speaker. The system increased the precision of a existing facial recognition system by up to 5% at the cost of a large drop in recall.