EDA & DATA CLEANING

We introduce the sources of our data and predictors of interest. We also show how we processed the data into a dataframe to be used for modeling.

EDA: Welcome

SPOTIFY MILLION PLAYLISTS

The Million Playlist Dataset is a large dataset containing songs as observations and labelled by a “pid” (Playlist ID - all of the songs that appear in a playlist will share the same ID number). The file structure consists of 10,000 .json subfiles, with each subfile containing 1,000 playlists. Each playlist object contains the following attributes.

'collaborative': boolean (describes whether or not it is a collaborative playlist)
'duration_ms': int (the duration of the entire playlist in milliseconds)
'modified_at': int (the Unix Epoch Time value of when the playlist was last modified)
'name': str (name of the playlist)
'num_albums': int (number of unique albums in the playlist)
'num_artists': int (number of unique artists in the playlist)
'num_edits': int (number of times the playlist has been edited)
'num_followers': int (number of users that follow the playlist)
'num_tracks': int (number of tracks on the playlist)
'pid': int (the playlist ID number, ranging from 0 - 999,999,999)
'tracks': list of track objects (contains a list of tracks, where each track is an object containing the following attributes:
- 'album_name': str (the name of the track’s album)
- 'album_uri': str (the unique album ID -- uniform resource identifier)
- 'artist_name': str (the name of the artist)
- 'artist_uri': str (the unique artist ID -- uniform resource identifier)
- 'duration_ms': int (the duration of the track in milliseconds)
- 'pos': int (the track’s position in the playlist)
- 'track_name' : str (the name of the track))

The column ‘tracks’ contains lists of the songs included in each playlist. So, our cleanup included creating a second dataframe, this time with each row as a unique track and a column indicating the name of the playlist it came from. These two dataframes will help us do some exploratory data analysis to see what are some common trends for each playlist in terms of characteristics and what are some common trends for song distributions grouped by playlists.

EDA: Text

EDA VISUALIZATIONS

From the Million Playlists data alone, we were able to find the distribution of song duration and some of the most featured artists across our sample of 1000 playlists.

EDA: Gallery

EDA: Image

SPOTIFY API

Audio Features Predictors

Spotify API provides Audio Features Objects for each song track, which is what we would like to use as predictors because we believe that they may reveal implicit patterns or determinants of music similarity and could affect whether certain songs are included within a playlist. According to Spotify’s documentation, the Audio Features objects include the following variables.

ACOUSTICNESS

float

A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

DANCEABILITY

float

Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

ENERGY

float

Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

INSTRUMENTALNESS

float

Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

LIVENESS

float

Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

LOUDNESS

float

The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

SPEECHINESS

float

Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

VALENCE

float

A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

TEMPO

float

The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

KEY

int

The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.

MODE

int

Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

EDA: List

EDA VISUALIZATIONS

EDA: Image

EDA: Image

DATA CLEANING

Here we work with our two data sources, the Spotify 1 Million Playlist Dataset and Spotify API:

Download and save the 1,000,000 playlists as 1,000 files in the folder 'Songs'.
Randomly select 10 files from the 1,000 files to work with; these 10 files contain 10,000 playlists. We combine these 10,000 playlists into a single dataframe and save this as a csv file.
Retrieve song data from Spotify API. We find that Spotify prevents us from requesting too much data, so we end up randomly selecting an even smaller subset of playlists (100 playlists). About 12,100 songs appear on these 100 playlists and we retrieve “audio features” from Spotify API about these songs.
We merge the two datasets into a single dataframe. We prepared the data as a csv file for easy downloading.
Before using the data for analysis, we inspect the data to confirm that a diverse set of songs have been randomly selected among these 100 playlists. See the following graphs.
Before using any audio features in our models, we used MinMaxScaler to normalize the data.

To summarize, the final dataframe we work with contains playlist data taken from the Million Playlist Dataset as well as audio features for each song in those playlists from the Spotify API.

Randomly Sampled Data

EDA: Intro