...

/

Joining Real-World Datasets

Joining Real-World Datasets

See some joining data examples.

Let’s look at a concrete example of joining data.

Most of the data we’ve looked at in the course has been delivered in a single CSV file. Sometimes we have data from multiple sources and we need to combine them. This section will explore joining a real-world dataset.

Dirty Devil flow and weather data

Let’s revisit the Dirty Devil data. Let’s load the flow and gauge height data. In this case, we will leave the datetime column as a column and not use it for the index:

Press + to interact
import pandas as pd
# set the url of the file we want to read
url = 'https://github.com/mattharrison/datasets/raw/master/data/'\
'dirtydevil.txt'
# read the tab-separated file using pandas, skip the first 33 rows and the 35th row
# and store the result as a DataFrame
df = pd.read_csv(url, skiprows=lambda num: num <34 or num == 35,
sep='\t')
# define a function to convert time to US/Mountain time zone
# it takes a DataFrame, a time column name, and a timezone column name as inputs
def to_us_mountain_time(df_, time_col, tz_col):
# replace 'MDT' with 'MST7MDT' in the timezone column
return (df_
.assign(**{tz_col: df_[tz_col].replace('MDT',
'MST7MDT')})
# group the DataFrame by the timezone column
.groupby(tz_col)
# for each group, convert the time column to datetime, localize to the timezone name, and convert to US/Mountain timezone
[time_col]
.transform(lambda s: pd.to_datetime(s)
.dt.tz_localize(s.name, ambiguous=True)
.dt.tz_convert('US/Mountain'))
)
# define a function to rename columns in the DataFrame
def tweak_river(df_):
# convert the datetime column to US/Mountain timezone
return (df_
.assign(datetime=to_us_mountain_time(df_, 'datetime', 'tz_cd'))
# rename the columns '144166_00060' and '144167_00065' to 'cfs' and 'gage_height', respectively
.rename(columns={'144166_00060': 'cfs',
'144167_00065': 'gage_height'})
)
# apply the above functions to the DataFrame and store the result as 'dd'
dd = tweak_river(df)
# print the resulting DataFrame
print(dd)

We’ll also load some meteorological data from Hanksville, Utah, a city near the river. We’ll then join both datasets ...