Loading the data

We’ll be looking at a dataset of Python users from JetBrainsJetBrains is a software development company that creates tools for developers to write high-quality code more efficiently. They offer a wide range of integrated development environments (IDEs) and other productivity tools, such as code editors, code collaboration tools, and issue trackers. JetBrains’ products are used by millions of developers worldwide in various industries, including web development, mobile development, data science, and more.. Let’s load the data:

Press + to interact

This is a pretty good dataset. It has over 50,000 rows and 264 columns. However, we’ll need to clean it up to perform exploratory analysis.

Some of the columns have a dummy-like encoding, like the columns starting with “database.” end with a database name. In the values for these columns, the database name is included. Since a user might use multiple databases, this is a mechanism to encode this. However, it also creates many columns, one per database. To keep the data manageable, we’re going to filter out columns like the database columns.

Below is code that determines whether a feature can have multiple values (like database) and removes those:

Press + to interact

Python 3.8

import collections  # import the collections module
counter = collections.defaultdict(list)  # create an empty defaultdict with default value as list to store column count
for col in sorted(jb.columns):  # loop over sorted list of column names in jb object
    period_count = col.count('.')  # count the number of periods in current column name
    if period_count >= 2:  # if there are at least 2 periods in column name
        part_end = 2  # set the part_end variable to 2
    else:  # otherwise, if there is only 1 period in column name
        part_end = 1  # set the part_end variable to 1
    parts = col.split('.')[:part_end]  # split column name by period and keep only the first 1 or 2 parts
    counter['.'.join(parts)].append(col)  # add the current column name to the defaultdict using the first 1 or 2 parts as key
uniq_cols = []  # create an empty list to store unique column names
for cols in counter.values():  # loop over the values in the defaultdict
    if len(cols) == 1:  # if there is only 1 column with that key
        uniq_cols.extend(cols)  # add the column name to the unique columns list
print(uniq_cols)  # print the list of unique columns

Press + to interact

Introduction

Series Deep Dive

DataFrames

Manipulating Data

Wrapping Up

Appendix

Creating and Updating Columns

Loading the data