pandas is a powerful Python library for data processing, manipulation, and analysis. Its comprehensive functionality makes it a go-to tool for data scientists, analysts, and researchers to handle and process data efficiently. pandas has an impressive tool for data manipulation known as DataFrames.
DataFrames are two-dimensional structures similar to spreadsheets that simplify data presentation, storage, handling, and transformation.
DataFrames provide quick methods for statistical descriptions of data properties and data analysis. Moreover, DataFrames offer unparalleled versatility, enabling users to easily perform a wide range of operations, from basic data manipulation to advanced analytics.
pandas seamlessly integrates with other Python libraries, such as NumPy and Matplotlib, further enhancing its capabilities and making it an essential component of the data science toolkit. In this blog post, we’ll explore how to create a DataFrame and what are some basic operations we can perform on the DataFrame for data manipulation and analysis.
The pandas DataFrame is a two-dimensional labeled data structure, resembling a spreadsheet or a table. It provides a flexible and efficient way to manipulate and analyze data in Python.
DataFrames offer a huge list of advantages for data manipulation tasks. Their inherent structure allows for the seamless handling of structured data, including tabular data, time series, and more complex datasets. The simplicity of DataFrame operations facilitates basic tasks like indexing and filtering and more advanced operations such as grouping, aggregation, and reshaping. Furthermore, DataFrames provide an intuitive interface for data exploration, enabling users to perform tasks like data cleaning, preprocessing, exploratory data analysis, and statistical analysis effortlessly.
DataFrames are instrumental in statistical analysis tasks such as calculating summary statistics, visualizing data distributions, and conducting hypothesis testing. Let’s see some real-world examples where DataFrames help shape data:
Financial analysis: DataFrames are invaluable for organizing and analyzing stock market data, performing portfolio management, and calculating financial metrics like returns and volatility.
Marketing analytics: DataFrames are used to analyze customer behavior, segment customers based on demographics or purchasing patterns, and track the performance of marketing campaigns.
Healthcare: DataFrames facilitate the analysis of patient data, clinical trials, and medical research, aiding in disease diagnosis, treatment planning, and epidemiological studies.
E-commerce: DataFrames are extensively utilized in e-commerce, manufacturing, and telecommunications industries for tasks like supply chain management, quality control, and customer relationship management.
DataFrames provide a versatile and intuitive framework for exploring, analyzing, and deriving value from data in diverse domains.
A DataFrame consists of rows and columns, where rows are accessed via index—a zero-based integer index by default. However, it can also be string-based or customized.
Each column in a DataFrame is similar to a pandas Series, which is a one-dimensional labeled array. A Series is used to model one-dimensional data and can hold data of any data type. A column in a DataFrame can be extracted as a pandas Series. The type of column is a pandas Series instance. Any operation that can be performed on a Series can be applied to a DataFrame column.
In pandas, the notion of an axis is crucial. An axis can be thought of as a direction along which operations are performed. For a DataFrame:
axis=0
: This refers to operations along the rows (vertically). When applying a function or operation row-wise, axis=0
is specified. For example, df.sum(axis=0)
applies the row-wise summation function.
axis=1
: This refers to operations along the columns (horizontally). When applying a function or operation column-wise, axis=1
is used. For example, df.sum(axis=1)
applies the column-wise summation function.
It’s important to understand the axis
parameter for performing operations in the desired direction within a pandas DataFrame. It helps to avoid unintended errors and ensures accurate data manipulation.
Let’s understand the basic structure and axis of the DataFrame from the following illustrations:
DataFrames play a pivotal role in Python data analysis. Their importance lies in their ability to handle and manipulate structured and heterogeneous data efficiently, making it a cornerstone for tasks ranging from data cleaning to complex analytics.
We’ve understood the structure of a DataFrame, and now the question arises: how to create it? To work with DataFrames, you need to install pandas in your virtual environment, as follows:
pip install pandas
DataFrames can be created from many types of input:
Dictionaries of lists: We can create a DataFrame by providing a dictionary where each key
represents a column name, and the corresponding value
is a list of data for that column.
import pandas as pddata = {'Column1': [1, 2, 3],'Column2': ['A', 'B', 'C']}df = pd.DataFrame(data)
List of dictionaries: We can create a DataFrame from a list of dictionaries, where each dictionary represents a row, and keys
within dictionaries represent column names.
import pandas as pddata = [{'Column1': 1, 'Column2': 'A'},{'Column1': 2, 'Column2': 'B'},{'Column1': 3, 'Column2': 'C'}]df = pd.DataFrame(data)
NumPy ndarrays: We can create a DataFrame from a NumPy ndarray, where each ndarray can represent a column or row.
import pandas as pddata = np.array([[1, 'A'], [2, 'B'], [3, 'C']])df = pd.DataFrame(data, columns=['Column1', 'Column2'])
From files: pandas provides a convenient method to create a DataFrame from a CSV
, JSON
, HDF5
, and many other file types. It also provides functions for reading an SQL
database to a DataFrame.
import pandas as pd#for csv filesdf = pd.read_csv('file.csv')#for json files:df = pd.read_json('example.json')#for hdf5 filesdf = pd.read_hdf('example.h5', key='data')#for sql databaseimport sqlite3con = sqlite3.connect('example.db')query = 'SELECT * FROM table_name'df = pd.read_sql(query, con)
We’ll start with a simple example of populating an empty DataFrame, and later, we’ll discuss some of the functions that can be applied to the DataFrame.
Let’s check the code snippet below for creating an empty DataFrame:
import pandas as pddf = pd.DataFrame()print(df)
Let’s review the code:
Line 1: We import the pandas
library as pd
.
Line 2: We call the DataFrame()
constructor from pd
to create an instance of DataFrame named df
here.
Line 3: We print our DataFrame, which is df
here. Since it’s an empty DataFrame, so in the output, we get an empty list of columns and indexes.
Let’s start populating our DataFrame with some cool book collection information. We will add the book title, author’s name, publication year, and review rating.
In the code snippet below, we’ll add values as lists
in different columns of df
:
# Add seriesdf['Title'] = ['The Great Gatsby', 'To Kill a Mockingbird', '1984']df['Author'] = ['F. Scott Fitzgerald', 'Harper Lee', 'George Orwell']df['Publication Year'] = [1925, 1960, 1949]df['Review Rating'] = [4.5, 4.8, 4.3]# Display the DataFrameprint("Initial DataFrame:")print(df)
Let’s review the code snippet above:
Lines 2–5: We add four columns—Title
, Author
, Publication Year
, and Review Rating
to the DataFrame df
, each containing a list
of values.
Lines 8–9: We add printing statements.
In the code snippet below, we’ll add a new book to df
:
# Add a new booknew_book = {'Title': 'Brave New World', 'Author': 'Aldous Huxley', 'Publication Year': 1932, 'Review Rating': 4.0}df = df.append(new_book, ignore_index=True)# Display the updated DataFrameprint("\nDataFrame with Added Book:s")print(df)
Let’s review the code snippet above:
Line 2: We create a dictionary, new_book
, with key-value pairs representing the details of the new book—Title
, Author
, Publication Year
, and Review Rating
.
Line 3: We add the new book (new_book
) by using append
method. The ignore_index
parameter is set to True
to reindex the DataFrame after appending the new book, ensuring a continuous and unique index.
Lines 6–7: We add the printing statements.
In the code snippet below, we’ll add some new books to df
:
# Add new books with review ratingsnew_books = [{'Title': 'The Silent Patient', 'Author': 'Alex Michaelides', 'Publication Year': 2019, 'Review Rating': 4.7},{'Title': 'Where the Crawdads Sing', 'Author': 'Delia Owens', 'Publication Year': 2018, 'Review Rating': 4.9},{'Title': 'Educated', 'Author': 'Tara Westover', 'Publication Year': 2018, 'Review Rating': 4.6},{'Title': 'Atomic Habits', 'Author': 'James Clear', 'Publication Year': 2018, 'Review Rating': 4.8},{'Title': 'The Four Agreements', 'Author': 'Don Miguel Ruiz', 'Publication Year': 1997, 'Review Rating': 4.7}]# Append new books to the DataFramedf = df.append(new_books, ignore_index=True)# Add more recent self-help booksrecent_self_help_books = [{'Title': 'The Power of Habit', 'Author': 'Charles Duhigg', 'Publication Year': 2012, 'Review Rating': 4.6},{'Title': 'Think Like a Monk', 'Author': 'Jay Shetty', 'Publication Year': 2020, 'Review Rating': 4.5},{'Title': 'Mindset: The New Psychology of Success', 'Author': 'Carol S. Dweck', 'Publication Year': 2006, 'Review Rating': 4.7},{'Title': 'The Subtle Art of Not Giving a F*ck', 'Author': 'Mark Manson', 'Publication Year': 2016, 'Review Rating': 4.2},{'Title': 'The 5 Second Rule', 'Author': 'Mel Robbins', 'Publication Year': 2017, 'Review Rating': 4.0}]# Append recent self-help books to the DataFramedf = df.append(recent_self_help_books, ignore_index=True)# Display the updated DataFrameprint("\nDataFrame with Added Books:")print(df)
Let’s review the code snippet above:
Lines 2–8: We create a list new_books
, containing dictionaries for each new book, with details such as title, author, publication year, and review rating.
Line 10: We use the append
method to add the new books to the DataFrame (df
). The ignore_index=True
parameter ensures a continuous and unique index after appending.
Line 12–18: We create another list, recent_self_help_books
, containing dictionaries for more recent self-help books.
Line 20: Similar to before, we use the append
method for adding these recent self-help books to the DataFrame, again with ignore_index=True
.
Lines 22–23: We add the printing statements for displaying the updated DataFrame, showcasing the new additions.
Let’s quickly look at different ways to explore a DataFrame.
We print every element of each row in our DataFrame with the iterrows()
function, as follows:
# Loop through each row and print informationprint("\nPrinting Information for Each Book")print("************************************")for index, row in df.iterrows():print("\tindex: {}".format(index))print("\tBook Name: {}".format(row['Title']))print("\tAuthor's Name: {}".format(row['Author']))print("\tYear and Rating: {}-{}".format(row['Publication Year'], row['Review Rating']))print("************************************")
Let’s review the code snippet above:
Line 2: We print a header indicating that each book’s information will be displayed.
Line 3: We print a separator line for better readability.
Line 4–9: We use a for
loop for iterating through each row in the DataFrame (df
) using iterrows()
.
Line 4: For each iteration, index
represents the DataFrame index, and row
is a pandas Series containing the data for that specific row.
Lines 5–8: We print information about each book, including:
The index of the book in the DataFrame
The book’s title ('Title'
column value)
The author’s name ('Author'
column value)
The publication year and review rating ('Publication Year'
and 'Review Rating'
column values)
Line 9: We print another separator line to separate the information for each book.
We can exploit different methods of a DataFrame to get different information, as shown below:
print("************************************")print("Print df head: {}".format(df.head(3)))print("************************************")print("Print df tail: {}".format(df.tail(3)))print("************************************")print("Column Names: {}".format(df.columns))print("************************************")print("DataFrame length: {}".format(len(df)))print("************************************")print("DataFrame shape: {}".format(df.shape))print("************************************")print("Basic Statistics: {}".format(df.describe()))print("************************************")print("Single column\n: {}".format(df['Author']))print("************************************")print("Single row\n : {}".format(df.loc[2]))
Let’s review the code snippet above:
Line 2: We print the first three rows of the DataFrame (df
) with df.head(3)
. The default value of df.head()
is 5.
Line 4: We print the last three rows of the DataFrame (df
) with df.tail(3)
. The default value of df.tail()
is 5.
Line 6: The df.columns
attribute returns the column names of the DataFrame as a string.
Line 8: The len(df)
function returns the number of rows in the DataFrame.
Line 10: The df.shape
attribute returns a tuple representing the dimensions of the DataFrame (number of rows, number of columns).
Line 12: The df.describe()
function generates basic statistics of the DataFrame, and they’re printed. This includes count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum for numerical columns (for this particular example, these stats might not be meaningful).
Line 14: The df['Author']
column indexing extracts the Author
column.
Line 16: The df.loc[2]
row indexing extracts the row with index 2.
Other than that, various data manipulation functions exist for working with DataFrames. Here are some common examples:
Drop rows: The df.drop(index)
method allows for removing specified rows from the DataFrame based on the provided index or a list of indexes. This is useful when certain rows need to be excluded from the analysis.
Drop columns: By using df.drop(columns)
, specific columns can be removed from the DataFrame. This operation is handy when certain columns are unnecessary or redundant for the analysis.
Filter rows based on condition: Applying df[df['column'] > value]
filters rows based on a specified condition, retaining only those rows that meet the given criterion. This is effective for extracting subsets of data.
Drop missing values (NaN): The df.dropna()
method eliminates rows containing any missing values (NaN) in the DataFrame. This is valuable when missing data might interfere with analysis.
Fill missing values: The df.fillna(value)
method is employed to fill missing values (NaN) in the DataFrame with a specified value. This is useful for handling missing data while maintaining the structure of the DataFrame.
We covered the basics of a powerful pandas tool—DataFrame—and learned how to create and manipulate it, featuring rows accessed via an index, which can be integer- or string-based, and columns behaving like pandas Series. DataFrames serve as a cornerstone in data analysis. DataFrames can be created from diverse sources, including dictionaries, lists, NumPy ndarrays, and various file types. This blog provides practical examples, demonstrating operations such as appending new rows, iterating through rows, and using built-in functions for insightful data exploration. We also looked at common data manipulation tasks, such as dropping rows/columns, filtering, and handling missing values, demonstrating pandas’s versatility in handling structured and heterogeneous data for comprehensive data analysis.
Interested to learn more about the pandas DataFrame? Check out the following links:
Pandas: Python for Data Analysis
Pandas is a very popular Python library that provides powerful, flexible, and high-performance tools to analyze and process data. Moreover, data science is in high demand and is one of the most highly paid professions today. If you’re looking to get into data science, machine learning, or if you simply want to brush up on your analytical skills, then this is the Path for you. It covers topics from basic representation of data to advanced data analysis techniques. You’ll also learn about feature engineering using pandas. By the end of this Path, you’ll be able to perform data analysis on different data sets.
Mastering Data Analysis with Python Pandas
There are several exercises that focus on how to use a particular function and method. The functions are covered in detail by explaining the important parameters and how to use them. By completing this course, you will be able to do data analysis and manipulation with Pandas easily and efficiently.
Effective Data Manipulation with pandas
The course is a comprehensive guide to using pandas for data analysis. It covers a wide range of topics related to data manipulation, including filtering, merging, grouping, pivoting, and reshaping data. In this course, you’ll learn about the best practices and efficient techniques for working with data. You’ll be provided with numerous examples of real-world data analysis problems and gain hands-on experience with the pandas library to solve them effectively. This course will also cover performance considerations and provide insights on common pitfalls to avoid when using pandas. The course is suitable for both beginners and experienced users of pandas. It includes detailed explanations of pandas concepts and functions, as well as tips and tricks for optimizing your code. By the time you’re done with this course, you’ll be able to effectively handle data using pandas.
Free Resources