Introduction to DataFrames
Explore the foundational concepts of pandas DataFrames, starting with their column-oriented structure and relationship to Series. Understand various methods to construct DataFrames, including from dictionaries, CSV files, and NumPy arrays. Learn how to access rows and columns, and use the two axes for data operations. This lesson prepares you to handle tabular data efficiently in pandas.
We'll cover the following...
In pandas, the two-dimensional counterpart to the one-dimensional Series is the DataFrame. If we want to understand this data structure, we should start by looking at how it’s constructed.
Database and spreadsheet analogues
If we think of a DataFrame as row-oriented, the interface will feel wrong. Many tabular data structures are row-oriented. Perhaps this is due to how spreadsheets and CSV files are dealt with on a row by row basis. Perhaps it’s due to the many OLTP databases that are row-oriented out of the box. A DataFrame is often used for analytical purposes and is better understood when thought of as column-oriented, where each column is a Series.
Note: In practice, many highly optimized analytical databases (those used for OLAP cubes) are also column-oriented. Laying out the data in a columnar manner can improve performance and requires fewer resources. Columns of a single type can be compressed easily. Performing analysis on a column requires loading only that column, whereas a row-oriented database would require reading the complete database to access an entire column.
What is the difference between OLTP and OLAP?
A simple Python version
Below is a simple attempt to create a tabular Python data structure that is column-oriented. It has a zero-based integer index, but that’s not required; the index could also be string based. Each column is similar to the Series-like structure developed previously: ...