The code in "msdHDF5toCSV.py" is designed to convert the HDF5 files of the Million Song Dataset (We advise you to use a mouse to enjoy the complete diving experience).Million Song Dataset HDF5 to CSV Converter, You can now explore it and verify by yourself the accuracy of our results ! This is a nice way to compare metadata classification and content similarities while discovering artists that are of a unknown/other style which matches your personal tastes ! Therefore, it is often the case that they belong to the same super genre group, but it can happen that two different styles may be “spatially” close nonetheless. This allows the algorithm to build a 2 or 3 dimensional plot with pairs of points close to each other following this distribution.Ĭlose points represent songs that are similar by their content, more specifically by their timbre feature. The probabilities representing the similarities, come from a conversion of the high-dimensional Euclidean distances between data. While PCA is a linear algorithm, that will not be able to interpret complex polynomial relationship between features, t-SNE is based on probability distributions with random walk on neighborhood graphs to find the structure within the data. It allows us to embed high-dimensional data into a two or three dimensional space, which can then be visualized in a scatter plot. The T-SNE approach, is is a machine learning algorithm for dimensionality reduction. Based on this prediction we plot the songs in a 3 dimensional space using t-SNE, and for a better visualization purpose we colored each song by they super genre belonging. The training and prediction were performed using a simple neural network (you can have a look at our github notebook for detailed explanations). Once trained using metadatas, we use our model to predict the genre distribution of each song based on their timbre informations. However, we can see that the points overlap. Indeed, we can see well defined separation between the two music genre. We can see below, that the representation in space of two sets of songs, (coming from two different genre with respect to their two principal components values give some relevant results. PCA, is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. In order to see if we could get information from it, we started by applying on the vectors of a given genre, a principal component analysis (PCA). In order to facilitate our calculations we decided to take the mean of all segments to represent each song with one single vector of size 12. The segment_timbre is a 12-dimensional vector which captures the “tone colour" for each segment of a song.Segments are characterized by their perceptual onsets and duration in seconds, loudness (dB), pitch and timbral content. Segments: are a set of sound entities (typically under a second) each relatively uniform in timbre and harmony.
We thus decided to have a closer look at the “segment_timbre” feature. The correlations between them do not give better results neither:
However, we can see below, that the box plots for the features : “loudness” and "tempo" are not really relevant with respect to the defined genres. To do that, we focused on these given features : "tempo","loudness" with respect to the our genre labeling. More precisely, we want to confront the given labeling (tags) of the songs with some of their content informations. Now, lets try to put our genre classification into another perspective. The second section consists in the building of an interactive 3D plot, where the user can walk through a data cloud and explore the different genre of music and listen to short previews for a better immersive experience. The first section is about genre classification as well as chronological analysis and geographic representation of our data set. If you are interested in more details, you can have a look at the github page of this project. This web page is separated into two parts and contains a majority of the relevant results we obtained through the analysis of this data set. With the current computing capabilities, the field of big data analysis has emerged and allows us to explore and try to discover the secrets hidden behind what makes music such a magical and essential part throughout our lives. Data could be stored efficiently and aggregated to this day. Since the emergence of compact disc, the music industry capabilities has expanded exponentially. The core of this data set, is the feature analysis and metadata for one million songs, provided by The Echo Nest. Robert West’s Applied Data Analysis class of Autumn 2017, we decided to focus on one of the freely-available largest collection of music data sets online: the Million Song Dataset.