BS in Statistics
Twitter has rapidly become one of the most popular sites of the Internet. It functions not just as a microblogging service, but as a crowdsourcing tool for listening, promotion, insight and much more. From the perspective of TV networks, tweets capture the real time reactions of viewers, making them an ideal indicator of a show’s ratings. This paper predicts Internet Movie Database (IMDB) television ratings by text mining Twitter data.
Tweets for five television shows were downloaded over a period of several months utilizing a SAS macro. Television show data, such as rating, show title, episode title, and more were retrieved through the Python package IMDBpy. Overall, there were four to seven episodes for each show, with approximately 1,000 to 100,000 tweets per episode.
Tweets were cleaned through a series of Perl-derivative regular expressions in SAS and Python. Once the data were cleaned as much as possible, both SAS and Python were used to score each tweet for sentiment analysis based on the AFINN dictionary. PROC SQL was used to join the datasets as the data were transferred from each program.
Sentiment analysis was used to determine the attitude or emotion of each tweet in order to properly capture the audiences’ natural reactions. Reviews are written by a select minority of reviewers, while tweets can be written by anyone. The tweets might be more honest than an actual review because users are not writing tweets in the same setting that they would write a review.