Available at: http://digitalcommons.calpoly.edu/theses/1268
Date of Award
MS in Computer Science
Communication Accommodation Theory (CAT) states that individuals adapt to each other’s communicative behaviors. This adaptation is called “convergence.” In this work we explore the convergence of writing styles of users of the online music distribution plat- form SoundCloud.com. In order to evaluate our system we created a corpus of over 38,000 comments retrieved from SoundCloud in April 2014. The corpus represents comments from 8 distinct musical genres: Classical, Electronic, Hip Hop, Jazz, Country, Metal, Folk, and World. Our corpus contains: short comments, frequent misspellings, little sentence struc- ture, hashtags, emoticons, and URLs. We adapt techniques used by researchers analyzing other short web-text corpora in order to deal with these problems. We use a supervised machine learning approach to classify the genre of comments in our corpus. We examine the effects of different feature sets and supervised machine learning algorithms on classification accuracy. In total we ran 180 experiments in which we varied: number of genres, feature set composition, and machine learning algorithm. In experiments with all 8 genres we achieve up to 40% accuracy using either a Naive Bayes classifier or C4.5 based classifier with a feature set consisting of 1262 token unigrams and bigrams. This represents a 3 time improvement over chance levels.