Search⌘ K

Loading the Data

Explore the process of loading the Million Song Dataset into PostgreSQL. Understand how to import data from SQLite and JSON formats using pgloader and custom scripts. Learn querying techniques and the use of PostgreSQL's COPY protocol for efficient data ingestion. This lesson enables you to manage and analyze large music datasets within PostgreSQL.

We’re going to look at the LastFm dataset, the official song tag, and the song similarity dataset of the Million Song Dataset (MSD). Quoting from the link above:

“The MSD team is proud to partner with Last.fm in order to bring you the largest research collection of song-level tags and precomputed song-level similarity. All the data is associated with MSD tracks, which makes it easy to link it to other MSD resources: audio features, artist data, lyrics, etc.”

Importing the dataset

First, we need to import this dataset into a PostgreSQL database. The dataset is offered both as a SQLite database and a JSON file. Loading the SQLite database is easy thanks to pgloader:

$ curl -L -o /tmp/lastfm_tags.db
       http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_tags.db

$ pgloader /tmp/lastfm_tags.db pgsql://appdev@localhost/appdev

We get the following output, meaning the data is now available in our PostgreSQL database for further indexing:

             table name     errors       read   imported      bytes      total time
-----------------------  ---------  ---------  ---------  ---------  --------------
                  fetch          0          0          0                     0.000s
        fetch meta data          0          8          8                     0.028s
         Create Schemas          0          0          0                     0.000s
       Create SQL Types          0          0          0                     0.006s
          Create tables          0          6          6                     0.031s
         Set Table OIDs          0          3          3                    
...