New features in pandas 2.0

The pandas library version 2.0 was launched in April 2023 amidst plenty of fanfare and excitement after three years of development. Given the popularity of the library, the upgrade from pandas 1.0 to 2.0 comprises numerous key changes that greatly impact many users. Let’s take a look at some of the key new features introduced in pandas 2.0, which is the version we use in this course.

Improved performance and memory efficiency

The pandas 2.0 update introduced PyArrow (a Python library built on top of Arrow) as the backing memory format for DataFrames, which used to be based on inefficient NumPy data structures. With these new Arrow extension arrays and memory structures as the backend, there is a vast improvement in speed and memory utilization because we can leverage the C++ implementation of Arrow.

Previously, inefficient memory usage caused by the original NumPy backend was a common problem that caused many users to explore alternative tools, such as Spark, Ray, etc. With the use of PyArrow as the backend, users can now work with pandas more efficiently and enjoy faster operations from the columnar in-memory data representation.

Support for non-nanosecond resolution in timestamps

A persistent problem within pandas was the exclusive usage of nanosecond resolution for timestamps. This led to the inability to represent dates prior to September 21st, 1677, or beyond April 11th, 2264, which created difficulties for researchers examining time series data across multiple millennia.

Incorporated within the version 2.0 update is enhanced support for additional resolutions, including second, millisecond, and microsecond precision.

Enhanced support for nullable dtypes

Previously, handling null values was challenging due to pandas' reliance on NumPy, which didn’t support null values for certain data types like integer dtypes. This issue led to the automatic conversion of integer columns to float dtype when a null value was introduced, potentially leading to a loss of precision.

The pandas 2.0 update has significantly improved the handling of nullable data types, allowing a unique null value to be assigned variables instead of typical values for specific data types.

This enhancement is facilitated by the inclusion of a new parameter, dtype_backend, which returns a DataFrame with nullable data types when set to numpy_nullable for most I/O functions, as shown in the example for CSV files below:

Get hands-on with 1400+ tech skills courses.