Introduction to Datasets
An introduction to datasets and different ways to load them.
We'll cover the following
Distributed data sources
To build scalable data pipelines, we’ll need to switch from using local files, such as CSVs, to distributed data sources, such as Parquet files on S3. While the tools used across cloud platforms to load data vary significantly, the end result is usually the same, which is a dataframe. In a single machine environment, we can use Pandas to load the dataframe, while distributed environments use different implementations such as Spark dataframes in PySpark.
In this lesson, we will introduce the datasets that we’ll explore throughout the course. In this chapter, we’ll focus on loading the data using a single machine, while later chapters will present distributed approaches. Although most of the datasets presented here can be downloaded as CSV files and can be read into Pandas using read_csv, it’s good practice to develop automated workflows to connect to diverse data sources.
Common datasets
We’ll explore the following datasets throughout this course:
- Boston Housing: records of sale prices of homes in the Boston housing market back in 1980
- Game Purchases: a synthetic dataset representing games purchased by different users on XBox One
- Natality: one of BigQuery’s open datasets on birth statistics in the US over multiple decades
- Kaggle NHL: play-by-play events from professional hockey games and game statistics over the past decade
The first two datasets only need a single command to load them, as long as you have the required libraries installed.
The Natality and Kaggle NHL datasets require setting up authentication files before you programmatically pull the data sources into Pandas.
Load data from a library
The first approach we’ll use to load a dataset is retrieving it directly from a library. Multiple libraries include the Boston housing dataset because it is a small dataset that is useful for testing out regression models. We’ll load it from scikit-learn by first running pip from the command line:
In our pre-configured execution environment below, these libraries are already installed.
pip3 install pandas==1.3.5pip3 install sklearn>=1.0.2
Once scikit-learn is installed, we can switch back to the Jupyter notebook to explore the dataset. The code snippet below shows how to load the scikit-learn and Pandas libraries, load the Boston dataset as a Pandas dataframe, and display the first 5 records:
from sklearn.datasets import load_bostonimport pandas as pdimport numpy as npdata_url = "http://lib.stat.cmu.edu/datasets/boston"raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])target = raw_df.values[1::2, 2]bostonDF = pd.DataFrame(data,columns=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT'])bostonDF['label'] = targetbostonDF.head()
The result of running these commands is shown in the figure below:
Load data from web
The second approach we’ll use to load a dataset is fetching it from the web. The CSV for the Games dataset is available as a single file on GitHub. We can fetch it into a Pandas dataframe by using the read_csv
function and passing the URL of the file as a parameter.
import pandas as pdgamesDF = pd.read_csv("https://github.com/bgweber/Twitch/raw/master/Recommendations/games-expand.csv")gamesDF.head()
Both of these approaches are similar to downloading CSV files and reading them from a local directory, but by using these methods, we can avoid the manual step of downloading files.
This behavior is useful to avoid in order to build automated workflows in Python.
The result of reading the dataset and printing out the first few records is shown in the figure below:
Try it out! #
Let’s try this out in the Jupyter notebook given below: