Home/Blog/Data Science/NumPy vs. pandas: What’s the difference?
Home/Blog/Data Science/NumPy vs. pandas: What’s the difference?

NumPy vs. pandas: What’s the difference?

Saif Ali
Aug 23, 2024
24 min read

In the realm of data science and scientific computing, Python stands out as a powerful and versatile programming language. Python seems to have an expanse of libraries available for these use case, but two of the most widely used are NumPy and pandas.

If you’re stuck choosing between Numpy and pandas, it’s very understandable. Both libraries have become indispensable tools for data scientists, analysts, and engineers, providing robust functionality for numerical computations and data manipulation. However, that choice will be easier once you learn where each tool excels, and therefore: which is the best for your data.

Let’s dive in!

We’ll cover the following

  • What is NumPy?

    • Strengths of NumPy

  • What is pandas?

    • Strengths of pandas

  • NumPy vs. pandas: The core differences

    • 1. Data structures

      • NumPy arrays

      • pandas Series and DataFrames

    • 2. Indexing and selection

      • NumPy indexing

      • pandas indexing

  • NumPy and pandas functionality

    • 1. Mathematical operations

      • NumPy: Mathematical operations

      • pandas Mathematical operations

    • 2. Loading data from file/dataset

      • NumPy: Loading data

      • pandas: Loading data

    • 3. Data manipulation

      • NumPy: Data manipulation

      • pandas: Data manipulation

  • Integration and ecosystem

    • Interoperability

      • NumPy: Interoperability

      • pandas: Interoperability

  • Use cases of NumPy and pandas

  • Pros and cons of NumPy and pandas

  • Comparison between NumPy and pandas

  • Conclusion

  • Next steps

What is NumPy?#

NumPy, short for Numerical Python, is a library that provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on the arrays. It was created in 2005 by Travis Oliphant, building on the earlier Numeric and Numarray libraries to create a more complete and efficient package for array computing.

NumPy’s core functionality revolves around the ndarray object, a powerful nn-dimensional array that allows for efficient storage and manipulation of large datasets. These arrays provide a high-performance alternative to Python’s built-in lists, especially for large-scale numerical data.

Strengths of NumPy#

NumPy is renowned for its efficiency in handling numerical computations and its ability to process large datasets swiftly. It's implemented in C, which gives NumPy a significant speed advantage over pure Python code.

  • Numerical computations: NumPy offers a comprehensive suite of mathematical functions for operations such as linear algebra, random number generation, Fourier transforms, and statistical computations. Its functions are implemented in C, providing a significant speed advantage over pure Python code.

  • Handling of n-dimensional arrays:  The ndarray object is designed to handle a variety of data shapes and sizes, from simple 1-dimensional arrays to complex nn-dimensional datasets. This flexibility makes NumPy an essential tool for scientific computing, where data often comes in multi-dimensional forms.

  • Broadcasting: NumPy’s broadcasting feature allows arithmetic operations to be performed on arrays of different shapes and sizes without requiring explicit replication of data, making code more efficient and easier to write.

What is pandas?#

Pandas is a powerful data manipulation and analysis library for Python created by Wes McKinney in 2008. It was developed to address the need for a flexible, high-performance tool for working with structured data, which was lacking in the existing scientific Python ecosystem at the time.

The pandas library introduces two primary data structures:

  • Series

  • DataFrame.

A Series is a one-dimensional labeled array capable of holding any data type, while a DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).

Strengths of pandas#

pandas is highly regarded for its versatility in data manipulation and ability to easily handle complex transformations, thanks to its intuitive syntax and robust set of functions.

  • Data manipulation: It provides a wide range of functions for data manipulation, including filtering, merging, reshaping, and aggregation. Its intuitive syntax makes it easy to perform complex data transformations and cleaning tasks.

  • Handling tabular data: The DataFrame structure is particularly well-suited for working with tabular data, similar to the structure of a database table or an Excel spreadsheet. This makes pandas an ideal tool for data analysis tasks in domains such as finance, economics, statistics, and many others.

  • Data alignment: It excels in handling missing data and aligning data from different sources based on their indexes. This capability is crucial for real-world data analysis, where data often comes with gaps or needs to be integrated from multiple sources.

  • Time series analysis: It offers powerful tools for time series analysis, including date range generation, frequency conversion, moving window statistics, and more, making it an excellent choice for analyzing time-based data.

NumPy vs. pandas: The core differences#

Understanding the core differences between NumPy and pandas is crucial for determining which library to use for specific tasks. Here, we will dive into two key aspects:

  1. Data structures

  2. Indexing mechanisms

Data structures#

With NumPy, we get arrays, and pandas gives us Series and DataFrames. Depending on the data you're working with, data structures of each library may be your deciding factor.

Let's explore which use cases each data structure excels in.

NumPy arrays#

NumPy’s primary data structure is the array. This array object is homogeneous, meaning all elements are of the same type, and provides a range of functionalities for numerical computations.

NumPy Data Structure

Properties

Use Cases

ndarray

  • Homogeneity: All elements in an ndarray are of the same type
  • Scientific computing and simulations
  • n-dimensional: Can handle multidimensional data (e.g., 1D, 2D, 3D arrays)
  • Handling large numerical datasets
  • Efficient: Optimized for performance, making it ideal for numerical computations
  • Machine learning algorithms

The following code creates a 2D NumPy array and performs element-wise squaring, demonstrating how an array can be used for efficient numerical operations:

import numpy as np
# Creating a 2D array (matrix)
array_2d = np.array([[1, 2, 3], [4, 5, 6]])
print("2D array (matrix):\n", array_2d)
# Performing element-wise operations
array_squared = array_2d ** 2
print("\n2D array (matrix) after performing element-wise operation:\n", array_squared)

Code explanation:

  • Line 1: We import the NumPy library and assign it the alias np.

  • Line 4: We create a 2D NumPy array (which can be thought of as a matrix) using np.array() function. The array is initialized with the values [[1, 2, 3], [4, 5, 6]]. This means it has 2 rows and 3 columns.

  • Line 5: We use print() to display a message "2D array (matrix):\n" followed by the contents of array_2d. The \n in the string is a newline character.

  • Line 8: We perform an element-wise operation on array_2d. In NumPy, operations like ** 2 on an array mean each element of the array is squared individually. So array_2d ** 2 squares each element of array_2d and stores the result in array_squared.

  • Line 9: We use print() to display a message "2D array (matrix) after performing element-wise operation:\n" followed by array_squared.

pandas Series and DataFrames#

The pandas library introduces two core data structures: Series and DataFrame. These structures are designed to handle labeled data intuitively and efficiently.

Series and DataFrames

pandas Data Structure

Properties

Use Cases


Series

  • One-dimensional: Similar to a column in a table
  • Labeled index: Each element is associated with an index
  • Flexible: Can hold different data types
  • Data analysis and manipulation



  • Handling and cleaning tabular data



  • Time series analysis



DataFrame

  • Two-dimensional: Similar to a table with rows and columns
  • Labeled axes: Both rows and columns are indexed
  • Heterogeneous: Can hold different data types in different columns

The following code demonstrates creating a pandas Series with a custom index and a DataFrame from a dictionary, showcasing the flexibility and intuitive handling of labeled data in pandas:

import pandas as pd
# Creating a Series
series = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print("Series:\n", series)
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
print("\nDataFrame:\n", df)

Code explanation:

  • Line 1: We import the pandas library and assign it the alias pd.

  • Line 4: We create a pandas Series using the pd.Series() function. Here, we pass a list [10, 20, 30] as data and specify index=['a', 'b', 'c'] to label each element in the Series.

  • Line 5: We use print() to display a message "Series:\n" followed by the contents of series. The \n in the string is a newline character.

  • Lines 8–9: We create a pandas DataFrame using the pd.DataFrame() function. Here, data is a dictionary where keys are column names ('Name' and 'Age') and values are lists representing the data in each column (['Alice', 'Bob', 'Charlie'] and [25, 30, 35], respectively).

  • Line 10: We use print() to display a message "DataFrame:\n" followed by the contents of df.

2. Indexing and selection#

IndexingIndexing in NumPy is the method used to access individual elements, slices, or a subset of elements within a NumPy array. and selectionIn NumPy, a selection refers to the process of retrieving a subset of elements from an array. are fundamental operations for both NumPy and pandas. However, they offer different methods and flexibilities for accessing and modifying data.

NumPy indexing#

NumPy arrays allow for both basic and advanced indexing techniques. Basic indexing involves using integers, slices, or boolean arrays to access elements.

The following code shows how to access and modify elements in a NumPy array using basic indexing techniques:

import numpy as np
# Creating a 1D array
array_1d = np.array([10, 20, 30, 40, 50])
# Display 1D array
print("1D array: ", array_1d)
# Basic indexing
print("Accessing the third element: ", array_1d[2])
# Slicing
print("Accessing elements from index 1 to 3: ", array_1d[1:4])
# Boolean indexing
print("Accessing elements greater than 25: ", array_1d[array_1d > 25])

Code explanation:

  • Line 10: We print the message "Accessing the third element: " followed by array_1d[2]. This accesses and displays the third element (index 2) of array_1d.

  • Line 13: We print the message "Accessing elements from index 1 to 3: " followed by array_1d[1:4]. This performs slicing on array_1d, accessing elements from index 1 (inclusive) to index 4 (exclusive) and displaying them.

  • Line 16: We print the message "Accessing elements greater than 25: " followed by array_1d[array_1d > 25]. This uses boolean indexing to filter elements in array_1d that are greater than 25 and display them.

pandas indexing#

The pandas library provides more flexible and powerful indexing options. It supports both label-based and location-based indexing through .loc and .iloc.

  • Label-based indexing (.loc): Access elements by labels.

  • Location-based indexing (.iloc): Access elements by integer location.

The following code demonstrates accessing elements in a pandas DataFrame using both label-based and location-based indexing:

import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data, index=['a', 'b', 'c'])
# Display DataFrame
print("DataFrame:\n", df)
# Label-based indexing
print("\nAccessing row with label 'b': ", df.loc['b'])
# Location-based indexing
print("\nAccessing the second row (index 1): ", df.iloc[1])

Code explanation:

  • Line 11: We print the message "Accessing row with label 'b': " followed by df.loc['b']. This uses label-based indexing (loc) to access and display the row labeled 'b' in the DataFrame df.

  • Line 14: Prints the message "Accessing the second row (index 1): " followed by df.iloc[1]. This uses location-based indexing (iloc) to access and display the second row (index 1) in the DataFrame df.

Tip: You can get hands-on with NumPy and pandas in the course below.

Python Data Analysis and Visualization

Cover
Python Data Analysis and Visualization

With over 400 billion gigabytes of data out there and more every day, companies are paying top dollar to those who can leverage it. Strong data skills are becoming increasingly valuable - even if you choose not to become a professional data scientist. This path will help you master the skills to extract insights from data using a powerful (and easy to use) assortment of popular Python libraries.

16hrs
12 Challenges
24 Quizzes

NumPy vs. pandas functionality#

NumPy and pandas each provide a rich set of functionalities that cater to different needs in data science and analysis. As such, your specific needs will influence your choice of library.

We’ll dive into specific capabilities of each library, focusing on :

  • Mathematical operations

  • Loading data from file/dataset

  • Data manipulation

Mathematical operations#

Mathematical operations are fundamental in data analysis and scientific computing, enabling tasks like statistical calculations and modeling.

NumPy: Mathematical operations#

NumPy excels in numerical computations, providing a wide array of mathematical functions that are optimized for performance. These functions make it a powerful tool for tasks involving linear algebra, random sampling, and Fourier transforms.

Linear algebra#

NumPy offers comprehensive support for linear algebra operations, including:

  • Matrix multiplication

  • Decomposition

  • Inversion

  • Eigenvalue calculations

These functionalities are essential for solving systems of linear equations and performing various mathematical transformations.

import numpy as np
# Define two matrices
matrix_a = np.array([[1, 2],
[3, 4]])
matrix_b = np.array([[5, 6],
[7, 8]])
# Print the original matrices
print("\nMatrix A:")
print(matrix_a)
print("\nMatrix B:")
print(matrix_b)
# Perform matrix operations
print("\nMatrix operations:")
print("Transpose of Matrix A:\n", np.transpose(matrix_a))
print("Determinant Matrix A:", np.linalg.det(matrix_a))
print("Inverse Matrix A:\n", np.linalg.inv(matrix_a))
print("Trace Matrix A:", np.trace(matrix_a))
# Perform matrix multiplication
print("\nMatrix Multiplication:")
result_mult = np.dot(matrix_a, matrix_b)
print(result_mult)
# Perform QR decomposition (alternative to LU decomposition)
print("\nQR Decomposition of Matrix A:")
q, r = np.linalg.qr(matrix_a)
print("Q Matrix:")
print(q)
print("R Matrix:")
print(r)
# Perform matrix inversion
print("\nInverse of Matrix A:")
result_inv = np.linalg.inv(matrix_a)
print(result_inv)
# Perform eigenvalue and eigenvector calculation
print("\nEigenvalues and Eigenvectors of Matrix A:")
eigenvalues, eigenvectors = np.linalg.eig(matrix_a)
print("Eigenvalues:")
print(eigenvalues)
print("Eigenvectors:")
print(eigenvectors)

Code explanation:

  • Lines 18–22: Perform various matrix operations on matrix_a:

    • Transpose: np.transpose(matrix_a) calculates and prints the transpose of matrix_a.

    • Determinant: np.linalg.det(matrix_a) computes and prints the determinant of matrix_a.

    • Inverse: np.linalg.inv(matrix_a) computes and prints the inverse of matrix_a.

    • Trace: np.trace(matrix_a) computes and prints the trace (sum of diagonal elements) of matrix_a.

  • Lines 25–27: We perform matrix multiplication using np.dot(matrix_a, matrix_b). Store the result in result_mult and print it.

  • Lines 30–35: We perform QR decomposition of matrix_a using np.linalg.qr(matrix_a). We store the matrices q (orthogonal/unitary matrix) and r (upper triangular matrix) and print them.

  • Lines 38–40: We compute the inverse of matrix_a using np.linalg.inv(matrix_a). We store the result in result_inv and print it.

  • Lines 43–48: We compute the eigenvalues and eigenvectors of matrix_a using np.linalg.eig(matrix_a). We store the eigenvalues in eigenvalues and eigenvectors in eigenvectors, and print them. This computes and prints the eigenvalues and corresponding eigenvectors of matrix_a.

Random sampling #

NumPy’s random module allows for generating random numbers, creating random samples, and performing random sampling from different distributions.

import numpy as np
# Generating random numbers from a normal distribution
random_numbers = np.random.normal(loc=0, scale=1, size=5)
print("Random numbers from a normal distribution:\n", random_numbers)

Code explanation:

  • Line 4: We use the np.random.normal() function to generate an array random_numbers of 55 random numbers drawn from a normal distribution:

    • loc=0: Mean of the distribution (centered at 0)

    • scale=1: Standard deviation of the distribution

    • size=5: Number of random numbers to generate

Fourier transforms #

NumPy provides functions to compute the discrete Fourier transform, which is useful in signal processing.

import numpy as np
# Creating a sample signal
signal = np.array([1, 2, 1, 0, 1, 2, 1, 0])
print("Signal: ", signal)
# Computing the Fourier transform
fourier_transform = np.fft.fft(signal)
print("Fourier transform of the signal:\n", fourier_transform)

Code explanation:

  • Line 8: We compute the Fourier transform of signal using np.fft.fft(signal). The result is stored in fourier_transform.

pandas: Mathematical operations#

Unlike NumPy, pandas is not designed for advanced mathematical computations. Instead, it offers powerful tools for data aggregation, merging, reshaping, and handling missing data, which are essential for data analysis.

Data aggregation #

pandas provide functions for summarizing data, such as groupby, sum, mean, and count.

import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Alice', 'Bob'], 'Score': [85, 90, 78, 88]}
df = pd.DataFrame(data)
# Display DataFrame
print("DataFrame:\n", df)
# Aggregating data by 'Name'
grouped = df.groupby('Name').mean()
print("\nAggregated data:\n", grouped)

Code explanation:

  • Line 11: We use the groupby() method on df to group data by the 'Name' column, and then calculate the mean using the mean() method. The result is stored in grouped.

Merging#

pandas allows for merging and joining DataFrames using various methods like merge, join, and concat.

import pandas as pd
# Creating two DataFrames
data1 = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
data2 = {'Name': ['Alice', 'Bob'], 'Score': [85, 90]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# Display DataFrames
print("DataFrame1:\n", df1)
print("\nDataFrame2:\n", df2)
# Merging the DataFrames on 'Name'
merged_df = pd.merge(df1, df2, on='Name')
print("\nMerged DataFrame:\n", merged_df)

Code explanation:

  • Line 12: We use the pd.merge() function to merge df1 and df2 based on the 'Name' column. The result is stored in merged_df.

Reshaping#

pandas offers functions like pivot, melt, and stack for reshaping DataFrames.

import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob'], 'Math': [85, 90], 'Science': [88, 92]}
df = pd.DataFrame(data)
# Display DataFrame
print("Original DataFrame:\n", df)
# Melting the DataFrame to unpivot subjects into rows
melted = pd.melt(df, id_vars=['Name'], value_vars=['Math', 'Science'], var_name='Subject', value_name='Score')
print("\nMelted DataFrame (Subjects as rows):\n", melted)
# Using stack to pivot the DataFrame
stacked = df.set_index('Name').stack().reset_index(name='Score').rename(columns={'level_1': 'Subject'})
print("\nStacked DataFrame (Subjects as rows):\n", stacked)
# Pivoting the melted DataFrame back to original form using pivot
unmelted = melted.pivot(index='Name', columns='Subject', values='Score').reset_index()
print("\nPivoted DataFrame (Original form):\n", unmelted)

Code explanation:

  • Line 11: We use the pd.melt() function to melt (unpivot) the DataFrame df:

    • id_vars=['Name']: Specifies the 'Name' column as the identifier variable (unchanged).

    • value_vars=['Math', 'Science']: Specifies the 'Math' and 'Science' columns to melt.

    • var_name='Subject': Renames the variable column to 'Subject'.

    • value_name='Score': Renames the value column to 'Score'.

    • The result is stored in melted.

  • Line 15: We use the stack() method to pivot the DataFrame df by stacking columns into rows:

    • set_index('Name'): Sets the 'Name' column as the index.

    • stack(): Pivots all remaining columns into rows.

    • reset_index(name='Score'): Resets the index and renames the resulting stacked column to 'Score'.

    • rename(columns={'level_1': 'Subject'}): Renames the column previously holding column names to 'Subject'.

    • The result is stored in stacked.

  • Line 19: We use the pivot() method on the melted DataFrame to pivot it back to the original form:

    • index='Name': Sets the 'Name' column as the index.

    • columns='Subject': Specifies the 'Subject' column values to pivot.

    • values='Score': Specifies the 'Score' column values to populate the pivoted DataFrame.

    • reset_index(): Resets the index to convert 'Name' from the index back to a regular column.

    • The result is stored in unmelted.

Handling missing data#

pandas provides functions to detect, remove, or fill missing data in DataFrames.

import pandas as pd
# Creating a DataFrame with missing values
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Score': [85, None, 78]}
df = pd.DataFrame(data)
# Display DataFrame
print("Original DataFrame:\n", df)
# Filling missing values with the mean
df['Score'].fillna(df['Score'].mean(), inplace=True)
print("\nDataFrame after filling missing values:\n", df)

Code explanation:

  • Line 11: We use the fillna() method on the 'Score' column of df to fill missing values (None) with the mean of existing values in the column:

    • df['Score'].mean(): Computes the mean of non-missing values in 'Score'.

    • inplace=True: Modifies df in place rather than returning a new DataFrame.

    • The filled DataFrame is stored back into df['Score'].

Loading data from file/dataset#

Loading data from external files or datasets is a fundamental operation in data analysis and scientific computing. Both NumPy and pandas provide capabilities to read data from various file formats, each tailored to different use cases.

NumPy: Loading data#

NumPy primarily deals with numerical data in the form of arrays. It provides basic functionalities to load data from text files, such as CSV files, but it stores the data in its own ndarray format, which is homogeneous and optimized for numerical computations.

The following is an example of loading data with NumPy:

import numpy as np
# Load data from a CSV file into a NumPy ndarray
data_np = np.loadtxt('data.csv', delimiter=',')
print("NumPy Array:\n", data_np)

Code explanation:

  • Line 4: We use the np.loadtxt() function to load data from a CSV file 'data.csv' into a NumPy ndarray data_np:

    • 'data.csv': Specifies the path to the CSV file to be loaded.

    • delimiter=',': Specifies that the data in the CSV file is separated by commas.

pandas: Loading data#

pandas excels in handling structured data, including loading data from various file formats such as CSV, Excel, SQL databases, and more. It stores the data in DataFrame objects, which are flexible and capable of handling heterogeneous data types.

The following is an example of loading data with pandas:

import pandas as pd
# Load data from a CSV file into a pandas DataFrame
df = pd.read_csv('data.csv')
print("Pandas DataFrame:\n", df)

Code explanation:

  • Line 4: We use the pd.read_csv() function to load data from a CSV file 'data.csv' into a pandas DataFrame df:

    • 'data.csv': Specifies the path to the CSV file to be loaded.

Data manipulation#

Effective data manipulation is crucial in preparing data for analysis and ensuring it meets the requirements of various computational tasks.

NumPy: Data manipulation#

NumPy offers a range of functionalities for basic data manipulation, including slicing, reshaping, and broadcasting.

Slicing #

Slicing in NumPy allows you to extract parts of an array.

import numpy as np
# Creating a 1D array
array = np.array([10, 20, 30, 40, 50])
# Display 1D array
print("1D array: ", array)
# Slicing the array
sliced_array = array[1:4]
print("Sliced array: ", sliced_array)

Code explanation:

  • Line 10: We use the slicing to create a new array sliced_array from array:

    • array[1:4]: Retrieves elements starting from index 1 (inclusive) to index 4 (exclusive) from array.

    • The sliced elements [20, 30, 40] are assigned to sliced_array.

Reshaping#

NumPy allows you to change the shape of an array without changing its data.

import numpy as np
# Creating a 1D array
array = np.array([1, 2, 3, 4, 5, 6])
# Display 1D array
print("Originsl array: ", array)
# Reshaping to a 2x3 array
reshaped_array = array.reshape(2, 3)
print("Reshaped array:\n", reshaped_array)

Code explanation:

  • Line 10: We use the reshape() method to reshape array into a 2x3 NumPy array reshaped_array:

    • reshape(2, 3): Reshapes array into a 2 rows by 3 columns array.

    • The reshaped array reshaped_array will have a shape of (2, 3).

Broadcasting #

Broadcasting allows NumPy to perform arithmetic operations on arrays of different shapes.

Let’s explore how broadcasting works with NumPy arrays in various scenarios:

import numpy as np
# When one operand is N*N and other is 1*1
print("Case 1:")
Z1 = np.arange(9).reshape(3,3)
print("Z1:")
print(Z1)
Z2 = 1
print("Z2:")
print(Z2)
print("Z1 + Z2:")
print(Z1 + Z2)
# When one operand is N*N and other is N*1
print("\nCase 2:")
Z1 = np.arange(9).reshape(3,3)
print("Z1:")
print(Z1)
Z2 = np.arange(3)[::-1].reshape(3,1)
print("Z2:")
print(Z2)
print("Z1 + Z2:")
print(Z1 + Z2)
# When one operand is N*N and other is 1*N
print("\nCase 3:")
Z1 = np.arange(9).reshape(3,3)
print("Z1:")
print(Z1)
Z2 = np.arange(3)[::-1]
print("Z2:")
print(Z2)
print("Z1 + Z2:")
print(Z1 + Z2)

Code explanation:

  • The code demonstrates different scenarios of addition between NumPy arrays (Z1) and other operands (Z2) of different shapes:

    • Lines 4–14 (case 1): Addition of a 3x3 array (Z1) and a scalar (Z2 = 1).

    • Lines 17–27 (case 2): Addition of a 3x3 array (Z1) and a 3x1 array (Z2).

    • Lines 30–40 (case 3): Addition of a 3x3 array (Z1) and a 1D array (Z2).

pandas: Data manipulation#

pandas offers advanced tools for data manipulation, including data cleaning, merging, grouping, and time series manipulation.

Data cleaning#

pandas provides functions to clean and preprocess data, such as dropna and fillna.

import pandas as pd
# Creating a DataFrame with the missing values
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Score': [85, None, 78]}
df = pd.DataFrame(data)
# Display DataFrame
print("Original DataFrame:\n", df)
# Dropping rows with the missing values
cleaned_df = df.dropna()
print("\nDataFrame after dropping missing values:\n", cleaned_df)
# Filling the missing values with a specific value (e.g., 0)
filled_df = df.fillna(0)
print("\nDataFrame after filling missing values with 0:\n", filled_df)

Code explanation:

  • Line 11: We use the dropna() method to create a new DataFrame cleaned_df by dropping rows from df that contain the missing values.

  • Line 15: We use the fillna(0) method to create a new DataFrame filled_df by filling the missing values in df with the value 0.

Merging#

pandas allow for complex data merging operations.

import pandas as pd
# Creating two DataFrames
data1 = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
data2 = {'Name': ['Alice', 'Bob'], 'Score': [85, 90]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# Display DataFrames
print("DataFrame1:\n", df1)
print("\nDataFrame2:\n", df2)
# Merging the DataFrames on 'Name'
merged_df = pd.merge(df1, df2, on='Name')
print("\nMerged DataFrame:\n", merged_df)

Code explanation:

  • Line 14: We use the pd.merge(df1, df2, on='Name') to merge df1 and df2 on the column 'Name', resulting in a new DataFrame merged_df containing all columns from both DataFrames where 'Name' matches.

Grouping #

pandas’ groupby function enables the grouping of data for aggregation.

import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Alice', 'Bob'], 'Score': [85, 90, 78, 88]}
df = pd.DataFrame(data)
# Display DataFrame
print("Original DataFrame:\n", df)
# Grouping by 'Name' and calculating mean score
grouped = df.groupby('Name').mean()
print("\nGrouped DataFrame:\n", grouped)

Code explanation:

  • Line 11: We group the DataFrame df by the 'Name' column and calculate the mean of the 'Score' for each group, resulting in a new DataFrame grouped.

Time series manipulation#

pandas excels in handling time series data, providing functions for resampling, shifting, and rolling window operations.

import pandas as pd
# Creating a time series
dates = pd.date_range('20230101', periods=6)
data = {'Sales': [100, 150, 200, 250, 300, 350]}
df = pd.DataFrame(data, index=dates)
# Display DataFrame
print("DataFrame:\n", df)
# Resampling the time series data to monthly frequency
monthly_sales = df.resample('M').sum()
print("\nMonthly sales data:\n", monthly_sales)

Code explanation:

  • Line 12: We use df.resample('M').sum() to resample the DataFrame df to a monthly frequency and calculate the sum of sales for each month, resulting in a new DataFrame monthly_sales.

Integration and ecosystem#

Both NumPy and pandas are integral parts of the Python data science ecosystem. They are designed to seamlessly integrate with other libraries, enhancing their capabilities and providing a comprehensive toolkit for data analysis and scientific computing.

Interoperability#

Effective interoperability ensures that NumPy and pandas can collaborate seamlessly with other libraries, enhancing their utility in diverse analytical and scientific applications.

NumPy: Interoperability#

NumPy is designed to work well with other scientific libraries in Python. Its interoperability allows it to serve as the foundation for a wide range of scientific and analytical tools.

SciPy #

SciPy builds on NumPy to provide additional functionality for scientific and technical computing, including modules for optimization, integration, interpolation, eigenvalue problems, and other advanced mathematical tasks.

import numpy as np
from scipy import optimize
# Define a quadratic function
def f(x):
return x**2 + 4*x + 4
# Find the minimum of the function using SciPy
result = optimize.minimize(f, x0=0)
print("Optimization result:\n", result)

Code explanation:

  • Lines 5–6: We define a quadratic function f(x) that takes an input x and returns the value of the quadratic expression x**2 + 4*x + 4.

  • Line 9: We use optimize.minimize(f, x0=0) to find the minimum of the function f(x), starting from the initial guess x0=0, and stores the result in the variable result.

Matplotlib#

Matplotlib is a plotting library that works closely with NumPy arrays to produce a variety of static, animated, and interactive visualizations.

import numpy as np
import matplotlib.pyplot as plt
# Create a range of values
x = np.linspace(0, 2 * np.pi, 100)
y = np.sin(x)
# Plot the sine wave
plt.plot(x, y)
plt.title('Sine Wave')
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.show()

Code explanation:

  • Line 5: We use np.linspace(0, 2 * np.pi, 100) to create an array x of 100 evenly spaced values ranging from 0 to 2 * np.pi.

  • Line 6: We use np.sin(x) to compute the sine of each value in the array x, resulting in an array y.

  • Line 9: We use plt.plot(x, y) to create a plot of y vs. x.

  • Line 10: We use plt.title('Sine Wave') to set the title of the plot to 'Sine Wave'.

  • Line 11: We use plt.xlabel('x') to label the x-axis as 'x'.

  • Line 12: We use plt.ylabel('sin(x)') to label the y-axis as 'sin(x)'.

  • Line 13: We use plt.show() to display the plot.

pandas: Interoperability#

pandas is also highly interoperable with a variety of other data tools and libraries, making it a versatile choice for data manipulation and analysis.

SQL databases#

pandas can read from and write to SQL databases, allowing for efficient data retrieval and storage. The read_sql and to_sql functions facilitate this integration.

import pandas as pd
import sqlite3
# Create an in-memory SQLite database and connect to it
conn = sqlite3.connect(':memory:')
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
# Write the DataFrame to a SQL table
df.to_sql('people', conn, index=False)
# Read the data back from the SQL table
df_from_sql = pd.read_sql('SELECT * FROM people', conn)
print("DataFrame read from SQL:\n", df_from_sql)

Code explanation:

  • Line 5: We use sqlite3.connect(':memory:') to create an in-memory SQLite database and establishes a connection to it, assigned to conn.

  • Line 8: We create a dictionary data with sample data.

  • Line 9: We use pd.DataFrame(data) to create a DataFrame df from the dictionary data.

  • Line 12: We use df.to_sql('people', conn, index=False) to write the DataFrame df to a SQL table named 'people' in the SQLite database connected to by conn.

  • Line 15: We use pd.read_sql('SELECT * FROM people', conn) to read the data back from the SQL table 'people' into a new DataFrame df_from_sql.

Matplotlib #

pandas integrates smoothly with Matplotlib, making it easy to generate plots directly from DataFrames.

import pandas as pd
import matplotlib.pyplot as plt
# Create a sample DataFrame
data = {'Month': ['January', 'February', 'March'], 'Sales': [200, 250, 300]}
df = pd.DataFrame(data)
# Plot the data
df.plot(x='Month', y='Sales', kind='bar')
plt.title('Monthly Sales')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.show()

Code explanation:

  • Line 5: We create a dictionary data with sample data.

  • Line 6: We use pd.DataFrame(data) to create a DataFrame df from the dictionary data.

  • Line 9: We use df.plot(x='Month', y='Sales', kind='bar') to create a bar plot with 'Month' on the x-axis and 'Sales' on the y-axis.

  • Line 10: We use plt.title('Monthly Sales') to set the title of the plot to 'Monthly Sales'.

  • Line 11: We use plt.xlabel('Month') to label the x-axis as 'Month'.

  • Line 12: We use plt.ylabel('Sales') to label the y-axis as 'Sales'.

  • Line 13: We use plt.show() to display the plot.

Seaborn#

Seaborn is a statistical data visualization library built on top of Matplotlib that works well with pandas DataFrames. It provides high-level interfaces for drawing attractive and informative statistical graphics.

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85, 90, 78, 88]}
df = pd.DataFrame(data)
# Set up the plotting context and style
sns.set_context("talk")
sns.set_style("whitegrid")
# Create a bar plot using Seaborn
ax = sns.barplot(x='Name', y='Score', data=df)
# Access the axis object to set labels and title using matplotlib
ax.set_xlabel('Name', fontsize=14)
ax.set_ylabel('Score', fontsize=14)
ax.set_title('Student Scores', fontsize=16)
# Show the plot
sns.despine()
plt.show()

Using the above code, we will get the following output:

Code explanation:

  • Line 6: We create a dictionary data with sample data.

  • Line 7: We use pd.DataFrame(data) to create a DataFrame df from the dictionary data.

  • Line 10: We use sns.set_context("talk") to set the plotting context to "talk", which adjusts the size of the plot elements.

  • Line 11: We use sns.set_style("whitegrid") to set the plot style to "whitegrid", which adds grid lines on a white background for aesthetics.

  • Line 14: We use sns.barplot(x='Name', y='Score', data=df) to create a bar plot (barplot) with 'Name' on the x-axis and 'Score' on the y-axis, using data from the DataFrame df. The resulting plot axis object is stored in ax.

  • Line 17: We use ax.set_xlabel('Name', fontsize=14) to set the x-axis label to 'Name' with a font size of 14.

  • Line 18: We use ax.set_ylabel('Score', fontsize=14) to set the y-axis label to 'Score' with a font size of 14.

  • Line 19: We use ax.set_title('Student Scores', fontsize=16) to set the plot title to 'Student Scores' with a font size of 16.

  • Line 22: We use sns.despine() to remove the top and right spines from the plot for better aesthetics.

  • Line 23: We use plt.show() to display the plot using Matplotlib’s show function.

Use cases of NumPy and pandas#

Understanding the specific use cases for NumPy and pandas helps in selecting the right tool for your data processing tasks. Here, we’ll outline the primary use cases for each library, providing a clear comparison of their strengths and applications.

Library

Use Cases

Description

NumPy

Scientific computing

NumPy is the preferred library for performing scientific calculations that require high precision and performance.

Machine learning

It provides the foundational data structures and mathematical operations essential for machine learning algorithms.

Numerical simulations

NumPy is used for creating simulations that require handling large amounts of numerical data efficiently.

pandas

Data analysis

pandas is particularly effective in handling and analyzing structured data, making it perfect for tasks like exploring data and creating reports.

Data preprocessing for machine learning

It provides tools for cleaning and preparing data, including handling missing values and transforming data formats.

Financial modeling

pandas’ robust data manipulation capabilities are perfect for building and analyzing financial models.

Pros and cons: NumPy vs. pandas#

When choosing between NumPy and pandas, it’s essential to understand their strengths and limitations. Here, we’ll outline the pros and cons of each library, providing a clear comparison to help you make an informed decision.

Library

Pros

Cons

NumPy

  • Speed: Highly optimized for numerical computations, offering superior performance compared to native Python.
  • Less intuitive for tabular data: Handling tabular data can be challenging and less straightforward compared to using pandas.
  • Numerical operations: Extensive support for a wide range of mathematical functions and operations.
  • Limited data types: Primarily designed for numerical data, with less flexibility for heterogeneous data types.
  • Memory efficiency: Efficiently handles large arrays and matrices, minimizing memory overhead.
  • Steep learning curve: Might have a steeper learning curve for users unfamiliar with array-based programming.

pandas

  • Data manipulation: Excellent tools for data manipulation, cleaning, and transformation, making it easy to handle complex datasets.
  • Performance: Can be slower and less efficient with very large datasets compared to NumPy.
  • Intuitive syntax: User-friendly and intuitive syntax, especially for operations on tabular data.
  • Memory consumption: Higher memory usage when dealing with large DataFrames due to its rich feature set.
  • Versatility: Supports various data formats and integrates well with other data analysis libraries.
  • Complexity for simple tasks: May introduce unnecessary complexity for tasks that are simple in NumPy.

Comparison between NumPy and pandas#

The table below presents a comparison between NumPy and pandas:

NumPy vs. pandas

Feature

NumPy

pandas

Data structures

Homogeneous arrays (single data type)

Heterogeneous DataFrames (mixed data types)

Performance (Numerical)

Generally faster

Slower for raw calculations, but convenient functions

Memory usage

Memory efficient

Potentially higher memory usage

Strengths

Efficient numerical computations, vectorized operations

Data cleaning, manipulation, analysis, time series

Common use cases

Scientific computing, machine learning (numerical data), image processing

Data loading, cleaning, EDA, feature engineering, time series analysis

Indexing

Basic (integer-based, slices, and boolean indexing)

Advanced indexing (label-based, location-based)

Missing value handling

Limited (manual replacement)

Flexible (fillna, interpolation)

Data types

Supports various numerical data types (integer, float, complex) and boolean

Supports various numerical data types, strings, categorical data, and custom data types

Math functions

Rich collection of element-wise mathematical functions (arithmetic, trigonometric, linear algebra)

Offers functions for common data analysis tasks (e.g., mean, standard deviation, correlation)

Time series functionality

Limited

Specialized functionalities (date/time objects, resampling)

Multidimensional data

Efficient handling of n-dimensional arrays

Less efficient for high-dimensional data

Learning curve

Easier to learn due to simpler data structures

Steeper learning curve due to richer features and functionalities

Interoperability

Integrates seamlessly with other scientific Python libraries (SciPy, Matplotlib)

Integrates well with NumPy and other data science libraries (Matplotlib, scikit-learn, and Seaborn)

Conclusion#

Now that you know about both of Python data manipulation tools, we hope you feel ready to make a choice about which one to pick.

NumPy shines in numerical computations and high-performance scientific computing, making it the preferred choice for tasks involving large-scale numerical data and complex mathematical operations.

pandas, on the other hand, is particularly effective in data manipulation and analysis, providing intuitive tools for handling and transforming structured data, which is invaluable for data cleaning, exploration, and preprocessing in machine learning.

Whether you choose to work with one tool, or have decided to learn both, you can get hands-on with NumPy and pandas in our comprehensive Skill Path:

Python Data Analysis and Visualization

Cover
Python Data Analysis and Visualization

With over 400 billion gigabytes of data out there and more every day, companies are paying top dollar to those who can leverage it. Strong data skills are becoming increasingly valuable - even if you choose not to become a professional data scientist. This path will help you master the skills to extract insights from data using a powerful (and easy to use) assortment of popular Python libraries.

16hrs
12 Challenges
24 Quizzes

You can keep building your data science skills with our Data Science resources. Check it out and consider exploring advanced tools like SciPy for scientific computing, Matplotlib and Seaborn for visualization, and scikit-learn for machine learning. Diving into databases such as SQL or NoSQL can also broaden your ability to manage diverse datasets effectively.


  

Free Resources