...

/

Creating and Updating Columns

Creating and Updating Columns

Explore the ”one true way” to create and update columns in pandas.

We'll cover the following...

Loading the data

We’ll be looking at a dataset of Python users from JetBrainsJetBrains is a software development company that creates tools for developers to write high-quality code more efficiently. They offer a wide range of integrated development environments (IDEs) and other productivity tools, such as code editors, code collaboration tools, and issue trackers. JetBrains’ products are used by millions of developers worldwide in various industries, including web development, mobile development, data science, and more.. Let’s load the data:

Press + to interact
import pandas as pd
url = 'https://github.com/mattharrison/datasets/raw/master/data/'\
'2020-jetbrains-python-survey.csv'
jb = pd.read_csv(url)
print(jb)

This is a pretty good dataset. It has over 50,000 rows and 264 columns. However, we’ll need to clean it up to perform exploratory analysis.

Some of the columns have a dummy-like encoding, like the columns starting with “database.” end with a database name. In the values for these columns, the database name is included. Since a user might use multiple databases, this is a mechanism to encode this. However, it also creates many columns, one per database. To keep the data manageable, we’re going to filter out columns like the database columns.

Below is code that determines whether a feature can have multiple values (like database) and removes those:

Press + to interact
import collections # import the collections module
counter = collections.defaultdict(list) # create an empty defaultdict with default value as list to store column count
for col in sorted(jb.columns): # loop over sorted list of column names in jb object
period_count = col.count('.') # count the number of periods in current column name
if period_count >= 2: # if there are at least 2 periods in column name
part_end = 2 # set the part_end variable to 2
else: # otherwise, if there is only 1 period in column name
part_end = 1 # set the part_end variable to 1
parts = col.split('.')[:part_end] # split column name by period and keep only the first 1 or 2 parts
counter['.'.join(parts)].append(col) # add the current column name to the defaultdict using the first 1 or 2 parts as key
uniq_cols = [] # create an empty list to store unique column names
for cols in counter.values(): # loop over the values in the defaultdict
if len(cols) == 1: # if there is only 1 column with that key
uniq_cols.extend(cols) # add the column name to the unique columns list
print(uniq_cols) # print the list of unique columns

Note that these column names have a period in them. We’re going to replace those with an underscore to allow us to access the names of the columns via attributes (with a period).

Let’s look at the “age” column:

Press + to interact
print(jb
[uniq_cols]
.rename(columns=lambda c: c.replace('.', '_'))
.age
.value_counts(dropna=False)
)

We’re going to pull out the first two characters from the “age” column and convert it to numbers. We’ll have to convert it to float because there are missing values in the column:

Press + to interact
print(jb
[uniq_cols]
.rename(columns=lambda c: c.replace('.', '_'))
.age
.str.slice(0,2)
.astype(float)
)

We can also write .str.slice(0,2) as .str[0:2].

Note: Currently, pandas can’t convert strings ...