Focusing on Maximum Entropy Classiﬁcation of Lyrics by Tom Waits Christos Iraklis Tsatsoulis Department of Informatics Institute of Technology Blanchardstown Dublin, Ireland Email: ctsats@hotmail.com Markus Hofmann Department of Informatics Institute of Technology Blanchardstown Dublin, Ireland Email: markus.hofmann@itb.ie Abstract—This paper reports on the text analysis (visually and computationally) carried out on the lyrics of the rock musician Tom Waits. The analysis mainly focuses on the generally agreed transition period of the musician with his album Swordﬁshtrom- bones which started a new phase in Waits’ 40 year career. A total of ten supervised learners are tested with the aim to separate the high dimensional space of the word vector (based on his lyrics)into two phases. After initial tests particular focus is given to the Maximum Entropy classiﬁer by further working with some additional pre-processing techniques. The classiﬁer is able to shed some further light into the two classes by being able to separate the two classes with an accuracy of 95%. I. I NTRODUCTION We choose as our dataset the lyrics of rock musician Tom Waits, aggregated at the album level. Tom Waits’ musical career spans almost 40 years, and it is commonly agreed that it went under some kind of “phase transition” with his 1983 album Swordﬁshtrombones. While this is rather unambiguous regarding his musical style, here we will try to see if this change is similarly reﬂected in his lyrics. We obtained Waits’ lyrics from Lyrics Mania (http://www. lyricsmania.com/). To do so, we built a custom crawler in R and employed the XML package. Exploiting the XML package and some regularities in the corresponding web pages, we were able to receive at the crawler’s output a pure and clean .txt ﬁle, stripped from all HTML format and extra info (adver- tisements, hyperlinks etc.). Once this was obtained, we then manually checked each ﬁle for typos and words that were not part of the lyrics (e.g. “Chorus”, “Instrumental”, composers’ names etc.). The titles of the instrumental (i.e. without lyrics) songs were retained in the corresponding documents. The titles of Tom Waits’ 20 studio albums that form our doc- ument collection (compilations and live albums are obviously irrelevant for our analysis) are shown in alphabetical order in Table 1. The last column (Period) indicates if the particular album belongs to the pre-1983 period (A) or otherwise (B). In total, we have 8 albums from period A and 12 albums from period B. Our objective will be to use classiﬁcation algorithms trained on a subset of the albums, trying to correctly predict the period for the rest of the albums (test set). The binary classiﬁcation problem has a small and slightly imbalanced dataset (20 examples, 60/40 balance). Mining music and lyrics is not uncommon and some approaches were publihsed in [1], [2], and [3]. The paper ﬁrst explores the corpus visually in Section 2. Section 3 focused on a set of initial experiments using a total of ten classiﬁcation model. Given the results obtained in Section 3 justify a more focused analysis into Maximum Entropy classiﬁcation written up in Section 4. The ﬁnal Section 5 provides an extensive discussion and conclusion II. VISUALIZATION AND EXPLORATION First, we would like to have a visual impression of the word usage separately for each of our classes, in order to spot possible differences. For this reason, we construct two wordcloud visualizations, one for each class, after tokeniz- ing, converting to lowercase, and removing punctuation and stopwords. We use tm, the de facto R baseline package for text mining, and the wordcloud package. The results, with simple Term Frequency (TF) weighting, are shown in Figs. 1 and 2 respectively. The ﬁrst thing that catches our attention in Figs. 1 and 2 is the strong presence of the term cause in class A only. The term actually comes from ’cause, the abbreviated spoken form of because, and we can immediately see from the visualizations that it constitutes a difference between our two classes. We consider this as a useful practical example for the fact that stopwords should always be context-speciﬁc: having identiﬁed the presence of ’cause in the text, one could certainly be tempted to remove it as a stopword, just like because; however, and since this is a case of artistic expression text where style matters, we can see that removing it would undermine the discriminative power of our attempted classiﬁcation. Similar arguments hold also for the term gonna, which is much more pronounced in class B. Other things worth noticing in Figs. 1 and 2 are the strong presence of the terms world and train in class B only. Some rather expected terms for rock song lyrics are present in both classes, in sim- ilar (baby, love, ain) or different (night, heart, home) emphases. At ﬁrst glance, it seems that there is indeed room for classiﬁcation based on the different terms involved in our two classes. 664 978-1-4799-2572-8/14/$31.00 c 2014 IEEE