Accessors and Operations
Learn the accessors and operations for handling sparse arrays.
Introduction
Having learned about how sparse data can be represented as SparseArray
objects in pandas
, let’s now look at the accessors and operations we can apply to these sparse arrays. We’ll look at the sparse dataset of movie ratings scored between 1 and 5 by different viewers, where NaN
means that the movie isn’t rated yet:
Movie Ratings By Viewers
Movie 1 | Movie 2 | Movie 3 | Movie 4 | Movie 5 | Movie 6 | |
Viewer 1 | NaN | 3.0 | NaN | 5.0 | 3.0 | NaN |
Viewer 2 | NaN | NaN | 3.0 | NaN | NaN | 3.0 |
Viewer 3 | 2.0 | 1.0 | 1.0 | NaN | NaN | 1.0 |
Viewer 4 | 5.0 | NaN | NaN | NaN | NaN | 5.0 |
Viewer 5 | NaN | NaN | NaN | 2.0 | NaN | NaN |
Viewer 6 | 2.0 | NaN | NaN | NaN | NaN | NaN |
Accessors
The SparseArray
object supports the .sparse
accessor for sparse-specific methods and attributes. It’s similar to the other accessors we have seen before, such as .str
for string data and .dt
for datetime data. Firstly, let’s convert the original DataFrame into a fully sparse representation:
# Convert df to sparse representation for all columnsdf_sparse = df.copy()for col in df_sparse.columns:df_sparse[col] = pd.arrays.SparseArray(df_sparse[col])# View dtypesprint(df_sparse.dtypes)
We can then use the .sparse
accessor to find attributes, such as fill and non-fill values of a SparseArray
and the density of a DataFrame (i.e., the proportion of non-fill values).
# Get fill value of a DataFrame columnprint('Fill value of Movie 1 col:', df_sparse['Movie 1'].sparse.fill_value)# Get non-fill values of a DataFrame columnprint('Non-fill values of Movie 1 col:', df_sparse['Movie 1'].sparse.sp_values)# Get density of Sparse DataFrameprint('Density:', df_sparse.sparse.density)
In the example above, the fill_value
and sp_values
attributes are for the SparseArray
at the column level (i.e., an array with SparseDtype
). On the other hand, the density
attribute is generated from the DataFrame.sparse
accessor because it applies to the entire sparse DataFrame. This is because pandas
has included the .sparse
accessor for DataFrames as well.
The DataFrame.sparse
accessor also lets us perform conversions to other formats. For instance, the following code shows how to convert a sparse DataFrame into a sparse SciPy
COO (Coordinate Format) matrix:
# Convert df to sparse representation for all columnsdf_sparse = df.copy()# Ensure every sparse array has fill value of 0 in order to convert to COOfor col in df_sparse.columns:df_sparse[col] = pd.arrays.SparseArray(df_sparse[col], fill_value=0)# Convert to SciPy COO matrixcoo_matrix = df_sparse.sparse.to_coo()print(f'SciPy COO matrix:\n{coo_matrix}\n')
The COO representation is a sparse matrix format for efficiently storing ...