Project 3: Why are Songs Popular?

Will Chiappetta
Oct 21, 2022
9 min read

Updated: Oct 27, 2022

Introduction to the Problem

Everyone around the world loves to listen and enjoy music but what makes a song popular? Wouldn't it be nice to be able to know what features of a song have a big effect on making a song popular or not? I want to find out how different features of songs are affecting their popularity. I am going to do this by making different linear regression models that have different features to compare to song popularity.

Introduce the Data

The data that I chose to use came from the website Kaggle. This is a website that is made for people to use public data sets. There were multiple studies done to understand a song's popularity based on certain factors. In addition to the factors they used, we are also going to look at song name, song popularity, song duration in milliseconds, acoustics, danceability, energy, instrumentals, key, liveness, loudness, audio mode, speechiness, tempo, time signature, and audio valence.

What is regression?

Linear regression was created to understand the relationship between input and output numerical variables. Linear regression is a linear model that assumes a linear relationship between the input variable which is x and the single output variable which is y. The point of linear regression is to predict an output from the set of input values. For a simple regression model the formula would be y = B0 + B1*x. For a higher dimensionally model, we would have to have more than one x value.

Pre-processing/Data understanding

Now I need to find out if my data needs to be cleaned. The first thing I want to check is if there are any nulls in my data and if all of the data types are useable in the analysis. It is important to have data that is numerical to make a linear regression model.

What you can see from this is that there are no nulls in the data because all of the Non-Null Counts are the same number as the number of columns. You can also see that we are not going to be able to use the song_name column because it is an object Dtype. This means that we will need to drop song_name when doing our analysis.

Now I want to use SelectKBest to find the best features to use for analysis. For SelectKBest and StandardScaler we need to separate our features from the target variable. We will make the features x and drop the song_popularity and song_name because song_popularity is our target variable and we cannot use song_name.

Next we need to standardize the data. We need to convert all of the values in our features into a z-score. This will tell us how far away the critical value is from the mean. This method allows us to better compare the values because it standardizes all of the data. This will help when we are making our linear regression model.

Now that we have standardized all of the data we can locate our best features.

The best features with the highest correlation to our target variable are danceability, instrumentalness , and loudness. Next I want to see some visualizations between the data. I am going to see if each variable has a linear relationship with the others.

Let's start out with danceability and instrumentalness.

There does not seem to be a correlation between these variables.

Next let's look at loudness with danceability.

There might be a slight negative correlation with these variables.

This next one is between loudness and danceability.

There also might be a slight negative correlation between these variables.

This one is also between loudness and danceability but with danceability as the x this time.

There does not seem to be a correlation between these variables.

This graph is between instrumentalness and danceability but with instrumentalness as the x this time.

There might be a slight negative correlation between these two variables.

The final graph is between instrumentalness and loudness with instrumentalness as the x this time.

There also might be a slight negative correlation between these two.

You can see from all of these graphs that there is not going to be a large correlation between these variables. This is most likely because there is not only one thing that makes a song popular.

After seeing these visualizations I have decided that I will do three linear regression models. Each model will be one of the best features to predict with our target variable: song_popularity. Even though there is not going to be a larger correlation between the variables I still think it will be interesting to see.

Time to start modeling!

Experiment 1:

Modeling

First we need to create the X variable for our first experiment which is danceability. We need to decide how we are going to split our test and training data. I chose to train 70% and test 30%. Now we can start to train our data and actually make our linear regression model.

Our model has now been trained and we can move on to evaluating our model.

Evaluation

Let's take a look at a OLS model to get some statistical insight on our model.

There is a lot to look at on this table, but the main things I want to highlight are the P>|t| or the p-value. The p-value is very low which shows that the features can be considered significant when predicting the song popularity. Also, we can look at the R-squared value which is 1.2%. This is extremely low because you want to try to get at least a 50%. Basically, this means that this data is not good enough to predict why a song is popular based on danceability. I believe this is because there are so many different factors that make a song popular that you cannot really predict why it songs become popular.

We will continue to look at this data because I would like to see if something will change with different variables or if it will be more of the same.

Now we will look at some stats on the test set.

From this we can tell again that since our coefficient of Determination is low that our data is not good for predicting. We also got our slope and y intercept which we can use to make visualizations.

Now we can see what our regression model actually looks like.

You can see that there is a slight positive correlation in the data between a song being more danceable and it being popular.

I also wanted to look at a lowess smoother model because I wanted to see how the line of best fit might increase or decrease as it goes through the data because it is difficult to tell where there is more data.

You can see that there is a slight increase in the slope at around .7 danceability. This shows that there is more data on that side of the graph that increases the slope.

Finally, for this experiment, let's look at the some final statistics on the model.

What we are going to look at here is the root mean square error or RMSE. The RMSE is the amount of error you may be leave up to interpretation within reason. We got around a 22 which means if the model predicts 50 for popularity it could be between 28 to 72. This is obviously a bad result but it is better than I thought it would be after seeing my r-squared score.

Time to move to the next experiment to see if we can do any better.

Experiment 2:

Modeling

We are going to start off the same way as the last model and create our X variable which is instrumentalness. I have hopes for this model to be better because I think that the instrumentals in a song are very important for it to be popular. We also need to create our training data and test data. I have decided to do a 30 / 70 split again to the data. Finally, we will create our regression model.

Next is the evaluation of the model.

Evaluation

We will again be looking at a OLS model because it shows many useful statistical values very well.

First, we will look at the p-value again which is very low making it significant for predicting song popularity. Looking at the r-squared again we can tell that this model did slightly better than the last one because it got a 1.9% and the last one got a 1.2%. This is still a terrible score because it is so low. Moreover, this means that the data is not good enough to accurately predict if a song will be popular based on instrumentalness.

Let's look at some statistics on the test statistic once again.

We can see from our coefficient of determination that it is still very low for instrumentalness. It is still a little higher than danceability. This means that it is a little better at predicting song popularity, but it is still terrible.

Now let's look at what some models look like.

From this visualization you can see that there is a slight decrease in correlation with more instrumentals to song popularity.

We will again look at a lowess smoother graph to see where there is a high concentration of change in the line of best fit.

From this visualization you can better see that there is a negative correlation because there is a big drop in the beginning, but it is pretty even across the rest of the way.

Now for we will look at the RMSE again.

The RMSE is almost exactly the same as the last experiment. Once again the amount of error that is left up to interpretation is 22. This is still a very bad score and shows that the model does not predict very well.

Let's now move on to our final experiment to see if this one can do better than the last two.

Experiment 3:

Modeling

For this final experiment we are going to start off the same way again by creating our X variable. This time we are going to be looking at how loudness affects song popularity. We need to split our data into a test and training data. I also decided to do a 70 to 30 train test model. Then we will need to make our linear regression model for our final experiment.

Now that this model is trained we can start to evaluate the model.

Evaluation

We will be looking at the OLS model again to make our observations.

In this model we can see again that the p-value is still low which shows that loudness can be considered significant. When we look at the r-squared again this is the lowest r-squared value we have seen at a 1%. This is not good and we cannot predict song popularity based off this model.

Once again we will look at some statistics for the test set.

When looking at this models coefficient of determination we can see that it is once again very low and will not be good for predicting song popularity. This one is better than the first model, but worse than the second one.

Now let's look at some visualizations of the regression model.

This model also has a positive correlation just like the first model. You can see that as the loudness of the song increases, the popularity might increase.

Again we will look at a lowess smoother visualization to see what parts of the graph have a higher impact on the model.

From this model you can see that one part of the graph has a huge impact on the line of best fit. This is because if it is from -15 to -5 then the popularity increases tremendously, but anywhere else the popularity decreases. This region was also enough to make the line of best fit have a positive correlation.

Finally, we will end this model with looking at some more statistical values.

The RMSE for this model is also pretty much the same as the previous two models. It is still at 22 and will still leave a lot error as well as a lot up to interpretation in the model. This is will still mean that this model is not very useful and not a strong model.

Experiment 4:(just for fun)

Just for fun I wanted to do one final experiment to see if using all of the data in the data set would help the model be more accurate.

I trained all of the data in the data set and created my linear regression model. To see how this model did I once again created my OLS model.

You can see from this again that the data is not good at predicting song popularity. The r-squared is a 4.8% which barely increased from before. This shows you cannot predict the song popularity accurately with this data.

Even though you can see that it still does not have a high prediction accuracy. I still want to see the RMSE score.

Once again the RMSE score is basically a 22 which is the same as all of the models from before. Using all of the data in the data set did not help the model perform any better.

Impact

There could be a few impacts that could come from my analysis. One impact that could come from my analysis is artists could use this data to model their new songs after to try to make them more popular. This could be a negative and a positive because there might be more popular songs which is a positive, but the negative is that all songs might start to sound the same. Another impact from my research could be that someone takes my results on what the features that affect song poulatity the most too literally and just focus on them when making a song. While these features are important there is more than just danceability, instrumentalness, and loudness that make a song.

Conclusion

My results from my three models have showed me that there is not a strong correlation that makes a song popular. All three of my models that should have had the highest correlation did not have high model performance which means they all do not predict well and are not very useful. I think that there was not a strong correlation between any of the variables because I do not think that there is a way to predict if a song will be popular or not. This is because so many different types of songs have been popular and they all have different styles.

This project has taught me how to better use Python and Jupyter Notebook. I have learned how to make a linear regression model and that not all models will have a high accuracy. I also learned more about statistical data and how to better analyze data. I am excited to get better at coding and learn more about Data Mining as this class continues.

References/Code

Regression information: https://machinelearningmastery.com/linear-regression-for-machine-learning/

Code help: https://seaborn.pydata.org/tutorial/regression.html

My Code

Project 3: Why are Songs Popular?

Recent Posts

Comentários