DOI: https://doi.org/10.15368/theses.2013.107
Available at: https://digitalcommons.calpoly.edu/theses/936
Date of Award
6-2013
Degree Name
MS in Computer Science
Department/Program
Computer Science
Advisor
Zoë J Wood
Abstract
In the field of speech recognition, an algorithm must learn to tell the difference between "a nice rock" and "a gneiss rock". These identical-sounding phrases are called oronyms. Word frequency dictionaries are often used by speech recognition systems to help resolve phonetic sequences with more than one possible orthographic phrase interpretation, by looking up which oronym of the root phonetic sequence contains the most-common words.
Our paper demonstrates a technique used to validate word frequency dictionary values. We chose to use frequency values from the UNISYN dictionary, which tallies each word on a per-occurance basis, using a proprietary text corpus, to calculate word frequency.
In the first phase of our user study, we generated oronym strings for the phrase "a nice cold hour", and had over a dozen people make 62 of the most-common oronyms for that phrase. In the second phase, we selected 15 of the phase one recordings, and had 74 different people transcribe each one, for a total of 953 transcriptions overall.
If the frequency dictionary values for our test phrases accurately reflected the real-world expectations of actual listeners, we would expect that the most-commonly transcribed phrases in our user study would roughly correspond with our metric for the most likely oronym interpretation of the root phrase.
During the course of our study, we found that using per-occurance frequency values, like those found in the UNISYN dictionary, when computing our overall-phrase-frequency metric caused the end result to be thrown off by excessively common words, such as "the", "is", and "a" These super-common words had such high per-occurance tallies that they overpowered any effect that any regular word had on a frequency metric. When we used frequency values from the COCA dictionary, which has word frequency values tallied on a document-count basis instead of a UNISYN-like per-occurance basis, we found that this effect was mitigated. As a result, we do not recommend using the UNISYN dictionary for word frequency purposes.
Included in
Artificial Intelligence and Robotics Commons, Composition Commons, Computational Linguistics Commons, Graphics and Human Computer Interfaces Commons, Other Computer Sciences Commons, Other Linguistics Commons, Phonetics and Phonology Commons, Software Engineering Commons