Joining Real-World Datasets
See some joining data examples.
We'll cover the following...
Let’s look at a concrete example of joining data.
Most of the data we’ve looked at in the course has been delivered in a single CSV file. Sometimes we have data from multiple sources and we need to combine them. This section will explore joining a real-world dataset.
Dirty Devil flow and weather data
Let’s revisit the Dirty Devil data. Let’s load the flow and gauge height data. In this case, we will leave the datetime
column as a column and not use it for the index:
Press + to interact
import pandas as pd# set the url of the file we want to readurl = 'https://github.com/mattharrison/datasets/raw/master/data/'\'dirtydevil.txt'# read the tab-separated file using pandas, skip the first 33 rows and the 35th row# and store the result as a DataFramedf = pd.read_csv(url, skiprows=lambda num: num <34 or num == 35,sep='\t')# define a function to convert time to US/Mountain time zone# it takes a DataFrame, a time column name, and a timezone column name as inputsdef to_us_mountain_time(df_, time_col, tz_col):# replace 'MDT' with 'MST7MDT' in the timezone columnreturn (df_.assign(**{tz_col: df_[tz_col].replace('MDT','MST7MDT')})# group the DataFrame by the timezone column.groupby(tz_col)# for each group, convert the time column to datetime, localize to the timezone name, and convert to US/Mountain timezone[time_col].transform(lambda s: pd.to_datetime(s).dt.tz_localize(s.name, ambiguous=True).dt.tz_convert('US/Mountain')))# define a function to rename columns in the DataFramedef tweak_river(df_):# convert the datetime column to US/Mountain timezonereturn (df_.assign(datetime=to_us_mountain_time(df_, 'datetime', 'tz_cd'))# rename the columns '144166_00060' and '144167_00065' to 'cfs' and 'gage_height', respectively.rename(columns={'144166_00060': 'cfs','144167_00065': 'gage_height'}))# apply the above functions to the DataFrame and store the result as 'dd'dd = tweak_river(df)# print the resulting DataFrameprint(dd)
We’ll also load some meteorological data from Hanksville, Utah, a city near the river. We’ll then join both datasets ...