Creating and Updating Columns
Explore the ”one true way” to create and update columns in pandas.
We'll cover the following...
Loading the data
We’ll be looking at a dataset of Python users from
import pandas as pdurl = 'https://github.com/mattharrison/datasets/raw/master/data/'\'2020-jetbrains-python-survey.csv'jb = pd.read_csv(url)print(jb)
This is a pretty good dataset. It has over 50,000 rows and 264 columns. However, we’ll need to clean it up to perform exploratory analysis.
Some of the columns have a dummy-like encoding, like the columns starting with “database.” end with a database name. In the values for these columns, the database name is included. Since a user might use multiple databases, this is a mechanism to encode this. However, it also creates many columns, one per database. To keep the data manageable, we’re going to filter out columns like the database columns.
Below is code that determines whether a feature can have multiple values (like database) and removes those:
import collections # import the collections modulecounter = collections.defaultdict(list) # create an empty defaultdict with default value as list to store column countfor col in sorted(jb.columns): # loop over sorted list of column names in jb objectperiod_count = col.count('.') # count the number of periods in current column nameif period_count >= 2: # if there are at least 2 periods in column namepart_end = 2 # set the part_end variable to 2else: # otherwise, if there is only 1 period in column namepart_end = 1 # set the part_end variable to 1parts = col.split('.')[:part_end] # split column name by period and keep only the first 1 or 2 partscounter['.'.join(parts)].append(col) # add the current column name to the defaultdict using the first 1 or 2 parts as keyuniq_cols = [] # create an empty list to store unique column namesfor cols in counter.values(): # loop over the values in the defaultdictif len(cols) == 1: # if there is only 1 column with that keyuniq_cols.extend(cols) # add the column name to the unique columns listprint(uniq_cols) # print the list of unique columns
Note that these column names have a period in them. We’re going to replace those with an underscore to allow us to access the names of the columns via attributes (with a period).
Let’s look at the “age” column:
print(jb[uniq_cols].rename(columns=lambda c: c.replace('.', '_')).age.value_counts(dropna=False))
We’re going to pull out the first two characters from the “age” column and convert it to numbers. We’ll have to convert it to float
because there are missing values in the column:
print(jb[uniq_cols].rename(columns=lambda c: c.replace('.', '_')).age.str.slice(0,2).astype(float))
We can also write .str.slice(0,2)
as .str[0:2]
.
Note: Currently, pandas can’t convert strings ...