CONCLUSIONS
The best of the models which were created for the original 2018 Playlist Challenge achieved an R-precision of about 22% and NDCG of about 0.39. We would have liked to achieve similar results, but the problem of automatic playlist generation is complex and in many ways, a subjective art rather than a science. The table below shows the scores of our models; our highest scoring model was the content filtering model.
Through this project, we developed and evaluated six different models, including collaborative filtering, content filtering, k-means clustering, hierarchical clustering, and ALS, to generate songs. We explored these particular models because of prior work that stood out to us in our research. Though the implementations of these models were not concepts directly taught in CS109a, we were able to follow the entire data science process, performing EDA and validation where possible. We used R-precision and NDCG to evaluate the songs (see Scoring below for more details), but the performance of each model varied greatly depending on the train and test sets that we used. Unsurprisingly, using a longer playlist to predict upon yielded better results, which is why we randomly sampled from only longer playlists (based on a threshold) to test our models. Even the magnitudes of R-precision and NDCG were not perfectly correlated among the models, reflecting how challenging it is to quantify the performance of our recommendation systems.
Due to the difficulty of directly judging various models, we can’t say definitively which of our models performs the best, even given the R precision and NDCG scores. However, we find that content filtering, collaborative filtering, and hierarchical clustering yielded relatively strong results. Qualitatively, we feel that looking at the output of these models shows that they are the closest to achieving the goal of automatic playlist generation. In our opinion, content filtering worked well since it was able to most efficiently utilize all of the song audio features from Spotify API to discover similar songs. Furthermore, since it used cosine distances and gave equal importance to standardized features, it didn't overfit the data and was able to predict well on test data. In contrast, our baseline model, collaborative filtering, and ALS model weren't able to utilize the audio features effectively in order to find similar songs. The clustering models may have been susceptible to overfitting by clustering using features that performed well on training data but were not as relevant on test data.
We believe that our work with the Spotify playlist generation problem will only become more and more significant in the world of technology. The approaches and models which we explored can be incorporated and further built upon in any user recommendation system, not only for music, but also for movies, books, products, and more. As consumer technology moves towards an even more data-driven, personalized consumption model, the development of work like ours can help us build better products and businesses across industries.
TABLE OF RESULTS

SCORING
To score the performance of each model, we first obscured the test playlist by removing half of the songs from it. We then compared the songs that our models recommended to the list of songs we had obscured from that playlist and calculated two metrics. First was R-precision, which was the number of songs that our models correctly predicted in the obscured set divided by the number of songs we had predicted. The second metric we used was the NDCG (Normalized discounted cumulative gain), which measures the ranking quality of recommended songs; this judges how well we ordered the songs inside our recommendations.

FUTURE WORK
We would be interested in applying new methods to our data cleaning and modeling. Firstly, we suspect that using more training data would improve our models significantly--as mentioned in our EDA, we did not have the technological capability of processing all or even most of the Million Playlists Data and accessing many more songs through the Spotify API. Secondly, there may be other approaches we can take to solve the problems of automatic playlist generation and cold start, perhaps by restructuring the data so that we take a playlist’s average features and converting the problem into a classification problem. This would open us up to using traditional data science approaches taught in CS109a, such as logistic regression, tree-based models, and neural networks, rather than using strictly recommendation system-based algorithms as we did in this project. We also found many sources regarding singular value decomposition (SVD) among recommender systems, and would be interested in pursuing this approach.
One approach we began, but did not pursue further due to our time and experience limitations, was to scrape song lyrics and to utilize the information in song lyrics as predictors in our modeling. We would perhaps utilize Word2Vec and other text analysis tools to train models that “learn” implicit similarities within lyrics, as we imagine that similar songs would share similar lyrics. This idea has been explored by others, and we would be interested in applying such techniques to our playlist generation models.
