Table of Contents

    In part one of this two-part series, we explored basic models and data enrichments for our hit song classifier. In this article, we will try to push our model a little more by attempting to improve its performance through better data exploration, enrichment and feature engineering

    Before we get started, let’s recall the context.

    The context

    Like I explained previously, there’s no shortage of articles and papers trying to explain why a song became a hit, and the features hit songs share. Here, we will try to go a bit further and try to enrich the hit song classifier we built in part one. 

    In part one we used data from the Billboard Year-End Hot100 Singles Chart between 2010 and 2018. We then enriched the data using Spotify’s API. Our model achieved an accuracy of 93% on the test set.

    In part two we will use data from to enrich our model and do some sentiment analysis. As a reminder, we will consider a song a hit only if it reached the top 10 of the most popular songs of the year. Otherwise, it does not count as a hit.

    Feature Generation: The Next Frontier of Data Science

    Data enrichment through is a great resource if you are looking for song lyrics. It offers a great API, all of which is packaged in a great library called lyricsgenius. Start by installing the package (instructions can be found on GitHub).

    You will have to get a token from developer’s website.

    Start by importing the package:

    import lyricsgenius as genius
    api = genius.Genius('YOUR_TOKEN_GOES_HERE')

    As before, the API has a powerful search functionality:

    def lookup_lyrics(song):
        try :
            return api.search_song(song).lyrics
        except :
            return None

    You’ll need to create a column “lyrics” that contains the lyrics of each song. This one might take some time.

    df['lyrics'] = df['lookup'].apply(lambda x: lookup_lyrics(x))

    Notice how some of the text is not clean and contains \n to denote a new line or has text between brackets to split sections:

    def clean_txt(song):
        song = ' '.join(song.split("\n"))
        song = re.sub("[\[].*?[\]]", "", song)
        return song
    df['lyrics'] = df['lyrics'].apply(lambda x: clean_txt(x))
    df = df.dropna() #Drop song if we don't have lyrics

    Some features we could add are:

    • The length of the lyrics
    • The number of unique words used
    • The length of the lyrics without stopwords
    • The number of unique words used without stopwords

    We will use NLTK stop words list in English. However, we should also consider that some of the songs of the Billboard Year-End Hot100 Singles Chart are not English songs.

    from nltk.corpus import stopwords 
    from nltk.tokenize import word_tokenize 
    stop_words = set(stopwords.words('english'))
    def len_lyrics(song):
        return len(song.split())
    def len_unique_lyrics(song):
        return len(list(set(song.split())))
    def rmv_stop_words(song):
        song = [w for w in song.split() if not w in stop_words] 
        return len(song)
    def rmv_set_stop_words(song):
        song = [w for w in song.split() if not w in stop_words] 
        return len(list(set(song)))

    Next, apply this to the dataset:

    df['len_lyrics'] = df['lyrics'].apply(lambda x: len_lyrics(x))
    df['len_unique_lyrics'] = df['lyrics'].apply(lambda x: len_unique_lyrics(x))
    df['without_stop_words'] = df['lyrics'].apply(lambda x: rmv_stop_words(x))
    df['unique_without_stop_words'] = df['lyrics'].apply(lambda x: rmv_set_stop_words(x))

    Data exploration

    Just like in the first article, some data exploration might bring us additional insights.

    How many words are used in the lyrics?

    plt.hist(df[df['len_lyrics']<2000]['len_lyrics'], bins=70) #Not plot outliers
    plt.title("Number of words")

    data exploration

    The histogram above does not represent outliers, but a few songs count over 2000 words. On average, there are 467 words in a song and 166 unique words. This can be verified by:


    The ratio of unique words over total words is 35%. We can also plot the distribution of this ratio:

    plt.hist(df['len_unique_lyrics']/df['len_lyrics'], bins=50)
    plt.title("Ratio Unique Words over total words")

    data exploration

    The vast majority of the songs do not exceed 40% of unique words, which reflects the balance that hit songs have between repetitive lyrics and a diversified vocabulary. To illustrate the diversity of the vocabulary used in the songs, we can compute the ratio of words that are not stop words over all other words:

    data exploration

    When we remove the stop words, the average ratio is now much higher. A large part of the vocabulary used in those songs seems to be made of stop words.  What are the most common words that singers use in their songs?

    from wordcloud import WordCloud, STOPWORDS
    word_cloud = df['lyrics'].values
    str1 = ' '.join(word_cloud)
    stopwords = set(STOPWORDS)
    wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(str(str1))
    plt.imshow(wordcloud, interpolation='bilinear')

    data exploration

    We won’t spend too much time commenting this, but “yeah,” “oh,” and “baby” should definitely be on your hit-song to-do list following your data exploration. 

    Lyrics sentiment

    Should a song be positive? Negative? Neutral? To assess the positiveness of a song and its intensity, we will use Valence Aware Dictionary and sEntiment Reasoner (VADER), a lexicon and rule-based sentiment analysis tool, available on Github. This method relies on lexicons and has over 7500 words annotated by linguists. This kind of algorithm was used before the rise of Natural Language Processing, but can still be useful in cases like this one where we do not have labeled data or trained models for song sentiment classification.

    from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
    analyzer = SentimentIntensityAnalyzer()
    df['sentimentVaderPos'] = df['lyrics'].apply(lambda x: analyzer.polarity_scores(x)['pos'])
    df['sentimentVaderNeg'] = df['lyrics'].apply(lambda x: analyzer.polarity_scores(x)['neg'])
    df['sentimentVaderComp'] = df['lyrics'].apply(lambda x: analyzer.polarity_scores(x)['compound'])
    df['sentimentVaderNeu'] = df['lyrics'].apply(lambda x: analyzer.polarity_scores(x)['neu'])

    We can also create a feature that is the difference between the positive and the negative score:

    df['Vader'] = df['sentimentVaderPos'] - df['sentimentVaderNeg']

    What are the sentiments expressed in the songs?

    plt.hist(df['Vader'], bins=50)
    plt.axvline(0, c='r')
    plt.title("Average sentiment")

    data exploration

    On average, the sentiment is slightly positive. Some songs have strong sentiments attached to them (i.e more than 0.5 in absolute value), but most songs have sentiments that are more controlled. This approach is however limited since it derives the average sentiment of a song by averaging the word sentiments but does not understand the content and the context.

    New model

    Now, let’s train a new model and see whether the performance was improved. First, we create the train and test sets and apply oversampling:

    X = df.drop(["Artist_Feat", "Artist", "Artist_Feat_Num", "Title", "Hit", "lookup", "release_date", "genres", "lyrics"], axis=1)
    y = df["Hit"]
    sm = SMOTE(random_state=42)
    X_res, y_res = sm.fit_resample(X, y)
    X_train, X_test, y_train, y_test = train_test_split(X_res,y_res, test_size=0.2, random_state=42) 

    Then, we define the random forest classifier and train the model:

    rf=RandomForestClassifier(n_estimators=100), y_train)
    y_pred = rf.predict(X_test)
    accuracy_score(y_pred, y_test)

    The accuracy score improved by close to 5% and reaches 98.3%.

    What are the most important features of this new model?

    importances = rf.feature_importances_
    indices = np.argsort(importances)
    plt.title('Feature Importances')
    plt.barh(range(len(indices)), importances[indices], align='center')
    plt.yticks(range(len(indices)), [X.columns[i] for i in indices])
    plt.xlabel('Relative Importance')

    Hit song classifier

    The order of the important features remains the same, but the compound sentiment feature is now one of the most important features.

    Making predictions

    Prediction function

    We can build a predictor that takes the name of the song and the singer as an input, creates the features, and outputs the probability of a song being a hit. Since the algorithm has never been trained on songs from 2019, we can feed it with recent songs and observe the outcome.

    Let’s recall the whole pipeline first:

    Hit Song Classifier

    Let’s build this pipeline and try it with “Lover” by Taylor Swift, a song that was recently released when we wrote this article:

    def model_prediction(artist, title):
        df_pred = pd.DataFrame.from_dict({
        df_pred["Featuring"] = df_pred.apply(lambda row: featuring(row['Artist']), axis=1)
        df_pred["Artist_Feat"] = df_pred.apply(lambda row: featuring_substring(row['Artist']), axis=1)
        df_pred['Title_Length'] = df_pred['Title'].apply(lambda x: num_words(x))
        df_pred['lookup'] = df_pred['Title'] + " " + df_pred["Artist_Feat"]
        df_pred['available_markets'], df_pred['release_date'], df_pred['total_followers'],
        df_pred['genres'], df_pred['popularity'], df_pred['acousticness'], df_pred['danceability'],
        df_pred['duration_ms'], df_pred['energy'], df_pred['instrumentalness'], df_pred['key'],
        df_pred['liveness'], df_pred['loudness'], df_pred['speechiness'], df_pred['tempo'],
        df_pred['time_signature'], df_pred['valence'] = zip(*df_pred['lookup'].map(artist_info))
        df_pred['release_date'] = pd.to_datetime(df_pred['release_date'])
        df_pred['month_release'] = df_pred['release_date'].apply(lambda x: x.month)
        df_pred['day_release'] = df_pred['release_date'].apply(lambda x:
        df_pred['weekday_release'] = df_pred['release_date'].apply(lambda x: x.weekday())
        df_pred['lookup'] = df_pred['Title'] + " " + df_pred["Artist"]
        df_pred['lyrics'] = df_pred['lookup'].apply(lambda x: lookup_lyrics(x))
        df_pred['lyrics'] = df_pred['lyrics'].apply(lambda x: clean_txt(x))
        df_pred['len_lyrics'] = df_pred['lyrics'].apply(lambda x: len_lyrics(x))
        df_pred['len_unique_lyrics'] = df_pred['lyrics'].apply(lambda x: len_unique_lyrics(x))
        df_pred['without_stop_words'] = df_pred['lyrics'].apply(lambda x: rmv_stop_words(x))
        df_pred['unique_without_stop_words'] = df_pred['lyrics'].apply(lambda x: rmv_set_stop_words(x))
        df_pred['sentimentVaderPos'] = df_pred['lyrics'].apply(lambda x: analyzer.polarity_scores(x)['pos'])
        df_pred['sentimentVaderNeg'] = df_pred['lyrics'].apply(lambda x: analyzer.polarity_scores(x)['neg'])
        df_pred['sentimentVaderComp'] = df_pred['lyrics'].apply(lambda x: analyzer.polarity_scores(x)['compound'])
        df_pred['sentimentVaderNeu'] = df_pred['lyrics'].apply(lambda x: analyzer.polarity_scores(x)['neu'])
        df_pred['Vader'] = df_pred['sentimentVaderPos'] - df_pred['sentimentVaderNeg']
        X = df_pred.drop(["Artist_Feat", "Artist", "Title", "lookup", "release_date", "genres", "lyrics"], axis=1).astype(float)
        y_pred = rf.predict_proba(X)
        print("It's a NOT hit with probability : " + str(y_pred[0][0]))
        print("It's a hit with probability : " + str(y_pred[0][1]))
        return y_pred

    We can create an interactive from the Notebook to ask the user for the name of the artist, the title of the song, and output the prediction.

    from ipywidgets import widgets, interact
    artist = widgets.Text()
    title = widgets.Text()
    ui = widgets.HBox([artist, title])
    def f(artist, title):
        return model_prediction(artist, title)

    And in the next cell, type:

    interact(f, artist='Taylor Swift', title='Lover')

    Hit Song Classifier

    According to our algorithm, there is only a 22% chance that the song “Lover” by Taylor Swift will make it to the top 10 of the most popular songs of 2019. And this is probably the case since the song “only” peaked at number 10 on Billboard Hot 100 for a few days.


    Through this article, we illustrated the importance of external data sources for most data science problems. A good enrichment dataset (even better if it’s done through automated data enrichment) can boost the performance of your model.  Relevant feature engineering can help gain additional performance. 

    Here is a performance summary of the different steps of our model:

    Description Model Performance
    Data from billboard Decision Tree F1-Score: 6.6%
    Enrich with Spotify and oversample Random Forest Accuracy: 93%
    Enrich with Genius Random Forest Accuracy: 98%

    Feature Generation: The Next Frontier of Data Science