Hello! By you seeing this: it must mean you are interested in my work. I am very thankful for your attention, so let me take you through this analysis in my portfolio.
A short’n sweet (Sabrina Carpenter reference?) explanation is I love music, so when I wanted to create a portfolio to show off my Data Engineering and Data Science skills I accumulated over the past few years, analyzing Billboard data was my first thought.
Now, why BillBoard 200, instead of Billboard 100? I am glad you asked! My favorite artist is, Ethel Cain and at the time I was doing this project her album Preacher’s Daughter had just breached the top 10 on the Billboard 200 charts with the debut of her first vinyl record release.
Besides that, I didn’t come into this analysis with any particular question in mind, but over the course of Data Wrangling a few questions came to me that I will try to answer: “What trends are associated with genres of the album?”, “Is there a correlation between album length and success?”, “Is there a correlation between the number of tracks on a given album, to the albums success?”
Now you know why I embarked on this journey, let me tell you a little bit of what I found. Fortunately the University of Texas - Austin has a github repository of the Billboard 100 & 200 charts, so the data I have prior to outlier removal or data transformations is from April 15, 1967 to April 19, 2025, 58 years worth of data nearly to the day.
One of the first things I thought about was: “It sure would be useful to have a unique identifier for each album.” that would make joining data so much easier. In this Data Wrangling stage I learned a lot about how albums categorize. One of the most important things I learned was the use of a MusicBrainz ID, a unique identifier for albums created by MusicBrainz Community. Fortunately MusicBrainz has an API that allows you to query for albums with just album name and artist, but there are a lot of unique albums in 58 years of data, so I restricted the data to just the Top 10 of each Chart week, that alone brought the number of data rows to ~30,000, but not every album easily had a MusicBrainz ID that worked with other APIs with different data, so in total I am analyzing ~23,000 rows of data, prior to outlier removal or data transformations, the data is still within the 58 year timespan, which is about 75% of top 10 album data available.
Now with a unique identifier for each album, I knew I wanted to pair this data with more data, I needed to know the predominant genre of the album (although we know albums can be differential in their genre), I needed how many tracks were on the album, the run time of the album, and maybe any listener data I could find. After a quick Google search I decided on using the API associated with LastFM, so this analysis is:
After than I has a DataFrame with the columns: chart week, current_week, MusicBrainz ID, album title, performer, peak position, weeks on chart, genre, listeners, play count, total tracks, duration
Let’s take a look at the data
We can see with no transformations most data is skewed.
With log + 1 transformations the data is better, but still not great. Let’s see what happens if we remove outliers using a Tukey Fence.
Let’s see what happens if we remove outliers using a Tukey Fence. Duration is looking very good now, but let’s log transform the others and see where they land
Total Tracks is looking good, the other two are certainly better, maybe we can try another transformation
Let’s try to customize transformations by using Yeo-Johsnon Transformations, we know that all values in the columns we want to transform are greater than 0, and out prior log transformation was the same as if \(\lambda=0\), so we will use the formula of \(\lambda\) being any real number not equal to 0.
Now visually most of our data appears to be normally distributed we can move on to look at correlations between the data. To recap:
Removing Outliers relatively normalized the Durations variable alone
Removing Outliers and Natural Log Transformation relatively normalized Total Tracks variable
Removing Outliers and Yeo-Johnson Transformation with \(\lambda=1/5\) relatively normalized Listeners variable
Removing Outliers and Yeo-Johnson Transformation with \(\lambda=1/8\) Transformation relatively normalized Playcount variable
We usually equate success with a higher position on the charts, e.g. it is better to be #1 than #10
We can see that when not broken out by genre, there isn’t much of a difference from Top 1 to Top 10 in regards to duration, total number of track, number of listeners, and number of plays. we can see that due to the nearly 180 degree trend line.
When breaking out by genre we do see that there is a little correlation to longer album duration to chart success for Hip-Hop and R&B, so if you wanted a hit Hip-Hop album might be worth it to push the song time to being a little longer, but that is just a slight correlation.
Interestingly enough, some Country albums seem like with more listeners and plays they still remain at the lower end of the Top 10, as opposed to other genres that seem to have a correlation between higher chart positions and higher listener and play counts, but while there is a slight incline to suggest such a pattern the incline is not so significant that it warrants a conclusion being drawn without further investigating. Intuitively I would imagine this trend could be from albums returning to the Top 10 with lower chart positions such as Shania Twain’s Come On Over that initially reached the charts in 1997 and appeared at least more than once until 2000, the album peaked at #2 but also dipped to #10 for a time. Another example is Taylor Swift’s Fearless that appeared on the charts 2008-2010 peaking at #1 and dipping to #10 also. In cases like this, these albums have time to accumulate listens and play counts, but may not always reach the higher end of the charts.
But I like this part of the analysis because it shows there isn’t much difference between albums at #1 versus #10. Which is what you would expect, being in the top 10 of the BillBoard 200 is a remarkable feat, no matter what, and this analysis shows that my favorite Artist Ethel Cain is collecting about as many listeners and as many plays as an album at a higher rank, with her #10 peak performance.
Next I really want to look at popularity of genre over time, to see if the preference of genres has changed over time. This is actually my favorite part of the analysis because I find it so cool and interesting. I really like how you can see the peak of rock in the 80’s which is when you had all the rock, and specifically glam rock, bands and performers like David Bowie, Def Leppard, Queen, Poison, etc. which are remarkable bands and performers. I grew up listening to many of them.
But I really like the graph with the trend line because you can see how rock later became less popular as Hip-Hop, R&B, and Pop gathered more traction. Which is what you would expect, Spice Girls, Brittney Spears dominated the charts in 2000’s, and performers like Tupac, and Notorious B.I.G became very influential in the 90’s and projected success for like Jay-Z, Snoop Dogg, J. Cole, and Kendrick Lamar that can be felt today.
Closing Paragraph
At my place of work, we usually follow up an analysis with a little section describing what else we could do, future projects and so on. The obvious idea for what to do next would be to look at Billboard 100 songs. A thought I had recently is I wonder if there is any patterns that can be deduced by “One Hit Wonders.”
But, perhaps I am getting a little bit ahead of myself. There is still a lot I could do with analyzing the Billboard 200. All these albums, were in the Top 10, they are really all phenomenal in their own rite, I think it would be fun to analyze patterns we see in albums at rank 190-200. In an ideal world with bountiful data, it would be fun to look at albums that didn’t place in the top 200 at all.
In a world with better computing resources, and more time. I would’ve liked to set up my own MusicBrainz database, it would’ve been easier to get MBID’s and thus easier for me to get supplemental data, and I probably could’ve looked at the top 200 as a whole.
But my time in world of analytics has taught me, that very rarely will you get every answer you want from a project, and sometimes it is better to strive for progress not perfection. I verifiable cannot say that this project was a wash. I have:
familiarized myself with 2 APIs
Learned a lot about the music industry and how musical data is processed
Refreshed my knowledge of dplyr, ggplot, and httr2
Learned a lot about Shiny dashboard, flexdashbords and R Markdown
And maybe most importantly… I had fun! For some reason the most tedious part, which was trying to compress different genre labels into less specific was just very fun to me. Not to mention, I got to yap about some trends I have noticed in music over the years, and talk about my favorite artist and other artists I enjoy very much.
Yes there was times when I was aggravated, times I felt hopeless and wanted to give up, and no these deliverables are not to the same quality I had in my head. But, I do feel a sense of accomplishment, I feel more prepared, and I believe I can use the skills I learned by doing this project in my professional work. So you take the good with bad, and that’s the facts of life.
THE END