Loading the Data
Get an introduction to another dataset and load the related data.
We'll cover the following
We’re going to look at the LastFm dataset, the official song tag, and the song similarity dataset of the Million Song Dataset (MSD). Quoting from the link above:
“The MSD team is proud to partner with Last.fm in order to bring you the largest research collection of song-level tags and precomputed song-level similarity. All the data is associated with MSD tracks, which makes it easy to link it to other MSD resources: audio features, artist data, lyrics, etc.”
Importing the dataset
First, we need to import this dataset into a PostgreSQL database. The dataset is offered both as a SQLite database and a JSON file. Loading the SQLite database is easy thanks to pgloader:
$ curl -L -o /tmp/lastfm_tags.db
http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_tags.db
$ pgloader /tmp/lastfm_tags.db pgsql://appdev@localhost/appdev
We get the following output, meaning the data is now available in our PostgreSQL database for further indexing:
table name errors read imported bytes total time
----------------------- --------- --------- --------- --------- --------------
fetch 0 0 0 0.000s
fetch meta data 0 8 8 0.028s
Create Schemas 0 0 0 0.000s
Create SQL Types 0 0 0 0.006s
Create tables 0 6 6 0.031s
Set Table OIDs 0 3 3 0.009s
----------------------- --------- --------- --------- --------- --------------
tids 0 505216 505216 9.2 MB 1.893s
tags 0 522366 522366 8.6 MB 1.781s
tid_tag 0 8598630 8598630 135.7 MB 32.614s
----------------------- --------- --------- --------- --------- --------------
COPY Threads Completion 0 4 4 34.366s
Create Indexes 0 5 5 2m14.346s
Index Build Completion 0 5 5 36.976s
Reset Sequences 0 0 0 0.054s
Primary Keys 0 0 0 0.000s
Create Foreign Keys 0 0 0 0.000s
Create Triggers 0 0 0 0.001s
Install Comments 0 0 0 0.000s
----------------------- --------- --------- --------- --------- --------------
Total import time ✓ 9626212 9626212 153.4 MB 3m25.743s
Here, pgloader extracted the table and index definitions from the SQLite database using the sqlite_master
catalog and the PRAGMA table_info()
commands, and it migrated the data in a streaming fashion to PostgreSQL using the COPY protocol.
Querying the data
Having a look at the demo_tags.py
script from the Last.fm project, we can see how to use relations here, and we realize they are using the 64-bit signed integer ROWID system column. We need something comparable to be able to make sense of the data:
Get hands-on with 1400+ tech skills courses.