Focusing on Maximum Entropy Classification of Lyrics by Tom Waits Christos Iraklis Tsatsoulis Department of Informatics Institute of Technology Blanchardstown Dublin, Ireland Email: ctsats@hotmail.com Markus Hofmann Department of Informatics Institute of Technology Blanchardstown Dublin, Ireland Email: markus.hofmann@itb.ie Abstract—This paper reports on the text analysis (visually and computationally) carried out on the lyrics of the rock musician Tom Waits. The analysis mainly focuses on the generally agreed transition period of the musician with his album Swordfishtrom- bones which started a new phase in Waits’ 40 year career. A total of ten supervised learners are tested with the aim to separate the high dimensional space of the word vector (based on his lyrics)into two phases. After initial tests particular focus is given to the Maximum Entropy classifier by further working with some additional pre-processing techniques. The classifier is able to shed some further light into the two classes by being able to separate the two classes with an accuracy of 95%. I. I NTRODUCTION We choose as our dataset the lyrics of rock musician Tom Waits, aggregated at the album level. Tom Waits’ musical career spans almost 40 years, and it is commonly agreed that it went under some kind of “phase transition” with his 1983 album Swordfishtrombones. While this is rather unambiguous regarding his musical style, here we will try to see if this change is similarly reflected in his lyrics. We obtained Waits’ lyrics from Lyrics Mania (http://www. lyricsmania.com/). To do so, we built a custom crawler in R and employed the XML package. Exploiting the XML package and some regularities in the corresponding web pages, we were able to receive at the crawler’s output a pure and clean .txt file, stripped from all HTML format and extra info (adver- tisements, hyperlinks etc.). Once this was obtained, we then manually checked each file for typos and words that were not part of the lyrics (e.g. “Chorus”, “Instrumental”, composers’ names etc.). The titles of the instrumental (i.e. without lyrics) songs were retained in the corresponding documents. The titles of Tom Waits’ 20 studio albums that form our doc- ument collection (compilations and live albums are obviously irrelevant for our analysis) are shown in alphabetical order in Table 1. The last column (Period) indicates if the particular album belongs to the pre-1983 period (A) or otherwise (B). In total, we have 8 albums from period A and 12 albums from period B. Our objective will be to use classification algorithms trained on a subset of the albums, trying to correctly predict the period for the rest of the albums (test set). The binary classification problem has a small and slightly imbalanced dataset (20 examples, 60/40 balance). Mining music and lyrics is not uncommon and some approaches were publihsed in [1], [2], and [3]. The paper first explores the corpus visually in Section 2. Section 3 focused on a set of initial experiments using a total of ten classification model. Given the results obtained in Section 3 justify a more focused analysis into Maximum Entropy classification written up in Section 4. The final Section 5 provides an extensive discussion and conclusion II. VISUALIZATION AND EXPLORATION First, we would like to have a visual impression of the word usage separately for each of our classes, in order to spot possible differences. For this reason, we construct two wordcloud visualizations, one for each class, after tokeniz- ing, converting to lowercase, and removing punctuation and stopwords. We use tm, the de facto R baseline package for text mining, and the wordcloud package. The results, with simple Term Frequency (TF) weighting, are shown in Figs. 1 and 2 respectively. The first thing that catches our attention in Figs. 1 and 2 is the strong presence of the term cause in class A only. The term actually comes from ’cause, the abbreviated spoken form of because, and we can immediately see from the visualizations that it constitutes a difference between our two classes. We consider this as a useful practical example for the fact that stopwords should always be context-specific: having identified the presence of ’cause in the text, one could certainly be tempted to remove it as a stopword, just like because; however, and since this is a case of artistic expression text where style matters, we can see that removing it would undermine the discriminative power of our attempted classification. Similar arguments hold also for the term gonna, which is much more pronounced in class B. Other things worth noticing in Figs. 1 and 2 are the strong presence of the terms world and train in class B only. Some rather expected terms for rock song lyrics are present in both classes, in sim- ilar (baby, love, ain) or different (night, heart, home) emphases. At first glance, it seems that there is indeed room for classification based on the different terms involved in our two classes. 664 978-1-4799-2572-8/14/$31.00 c 2014 IEEE