In the realm of data science and scientific computing, Python stands out as a powerful and versatile programming language. Python seems to have an expanse of libraries available for these use case, but two of the most widely used are NumPy and pandas.
If you’re stuck choosing between Numpy and pandas, it’s very understandable. Both libraries have become indispensable tools for data scientists, analysts, and engineers, providing robust functionality for numerical computations and data manipulation. However, that choice will be easier once you learn where each tool excels, and therefore: which is the best for your data.
Let’s dive in!
We’ll cover the following
|
NumPy, short for Numerical Python, is a library that provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on the arrays. It was created in 2005 by Travis Oliphant, building on the earlier Numeric and Numarray libraries to create a more complete and efficient package for array computing.
NumPy’s core functionality revolves around the ndarray
object, a powerful
NumPy is renowned for its efficiency in handling numerical computations and its ability to process large datasets swiftly. It's implemented in C, which gives NumPy a significant speed advantage over pure Python code.
Numerical computations: NumPy offers a comprehensive suite of mathematical functions for operations such as linear algebra, random number generation, Fourier transforms, and statistical computations. Its functions are implemented in C, providing a significant speed advantage over pure Python code.
Handling of n-dimensional arrays: The ndarray
object is designed to handle a variety of data shapes and sizes, from simple 1-dimensional arrays to complex
Broadcasting: NumPy’s broadcasting feature allows arithmetic operations to be performed on arrays of different shapes and sizes without requiring explicit replication of data, making code more efficient and easier to write.
Pandas is a powerful data manipulation and analysis library for Python created by Wes McKinney in 2008. It was developed to address the need for a flexible, high-performance tool for working with structured data, which was lacking in the existing scientific Python ecosystem at the time.
The pandas library introduces two primary data structures:
Series
DataFrame.
A Series is a one-dimensional labeled array capable of holding any data type, while a DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).
pandas is highly regarded for its versatility in data manipulation and ability to easily handle complex transformations, thanks to its intuitive syntax and robust set of functions.
Data manipulation: It provides a wide range of functions for data manipulation, including filtering, merging, reshaping, and aggregation. Its intuitive syntax makes it easy to perform complex data transformations and cleaning tasks.
Handling tabular data: The DataFrame structure is particularly well-suited for working with tabular data, similar to the structure of a database table or an Excel spreadsheet. This makes pandas an ideal tool for data analysis tasks in domains such as finance, economics, statistics, and many others.
Data alignment: It excels in handling missing data and aligning data from different sources based on their indexes. This capability is crucial for real-world data analysis, where data often comes with gaps or needs to be integrated from multiple sources.
Time series analysis: It offers powerful tools for time series analysis, including date range generation, frequency conversion, moving window statistics, and more, making it an excellent choice for analyzing time-based data.
Understanding the core differences between NumPy and pandas is crucial for determining which library to use for specific tasks. Here, we will dive into two key aspects:
Data structures
Indexing mechanisms
With NumPy, we get arrays, and pandas gives us Series and DataFrames. Depending on the data you're working with, data structures of each library may be your deciding factor.
Let's explore which use cases each data structure excels in.
NumPy’s primary data structure is the array. This array object is homogeneous, meaning all elements are of the same type, and provides a range of functionalities for numerical computations.
NumPy Data Structure | Properties | Use Cases |
|
|
|
|
| |
|
|
The following code creates a 2D NumPy array and performs element-wise squaring, demonstrating how an array can be used for efficient numerical operations:
import numpy as np# Creating a 2D array (matrix)array_2d = np.array([[1, 2, 3], [4, 5, 6]])print("2D array (matrix):\n", array_2d)# Performing element-wise operationsarray_squared = array_2d ** 2print("\n2D array (matrix) after performing element-wise operation:\n", array_squared)
Code explanation:
Line 1: We import the NumPy library and assign it the alias np
.
Line 4: We create a 2D NumPy array (which can be thought of as a matrix) using np.array()
function. The array is initialized with the values [[1, 2, 3], [4, 5, 6]]
. This means it has 2 rows and 3 columns.
Line 5: We use print()
to display a message "2D array (matrix):\n"
followed by the contents of array_2d
. The \n
in the string is a newline character.
Line 8: We perform an element-wise operation on array_2d
. In NumPy, operations like ** 2
on an array mean each element of the array is squared individually. So array_2d ** 2
squares each element of array_2d
and stores the result in array_squared
.
Line 9: We use print()
to display a message "2D array (matrix) after performing element-wise operation:\n"
followed by array_squared
.
The pandas library introduces two core data structures: Series and DataFrame. These structures are designed to handle labeled data intuitively and efficiently.
pandas Data Structure | Properties | Use Cases |
Series |
|
|
DataFrame |
|
The following code demonstrates creating a pandas Series with a custom index and a DataFrame from a dictionary, showcasing the flexibility and intuitive handling of labeled data in pandas:
import pandas as pd# Creating a Seriesseries = pd.Series([10, 20, 30], index=['a', 'b', 'c'])print("Series:\n", series)# Creating a DataFramedata = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}df = pd.DataFrame(data)print("\nDataFrame:\n", df)
Code explanation:
Line 1: We import the pandas
library and assign it the alias pd
.
Line 4: We create a pandas Series using the pd.Series()
function. Here, we pass a list [10, 20, 30]
as data and specify index=['a', 'b', 'c']
to label each element in the Series.
Line 5: We use print()
to display a message "Series:\n"
followed by the contents of series
. The \n
in the string is a newline character.
Lines 8–9: We create a pandas DataFrame using the pd.DataFrame()
function. Here, data is a dictionary where keys are column names ('Name'
and 'Age'
) and values are lists representing the data in each column (['Alice', 'Bob', 'Charlie']
and [25, 30, 35]
, respectively).
Line 10: We use print()
to display a message "DataFrame:\n"
followed by the contents of df
.
NumPy arrays allow for both basic and advanced indexing techniques. Basic indexing involves using integers, slices, or boolean arrays to access elements.
The following code shows how to access and modify elements in a NumPy array using basic indexing techniques:
import numpy as np# Creating a 1D arrayarray_1d = np.array([10, 20, 30, 40, 50])# Display 1D arrayprint("1D array: ", array_1d)# Basic indexingprint("Accessing the third element: ", array_1d[2])# Slicingprint("Accessing elements from index 1 to 3: ", array_1d[1:4])# Boolean indexingprint("Accessing elements greater than 25: ", array_1d[array_1d > 25])
Code explanation:
Line 10: We print the message "Accessing the third element: "
followed by array_1d[2]
. This accesses and displays the third element (index 2) of array_1d
.
Line 13: We print the message "Accessing elements from index 1 to 3: "
followed by array_1d[1:4]
. This performs slicing on array_1d
, accessing elements from index 1 (inclusive) to index 4 (exclusive) and displaying them.
Line 16: We print the message "Accessing elements greater than 25: "
followed by array_1d[array_1d > 25]
. This uses boolean indexing to filter elements in array_1d
that are greater than 25 and display them.
The pandas library provides more flexible and powerful indexing options. It supports both label-based and location-based indexing through .loc
and .iloc
.
Label-based indexing (.loc
): Access elements by labels.
Location-based indexing (.iloc
): Access elements by integer location.
The following code demonstrates accessing elements in a pandas DataFrame using both label-based and location-based indexing:
import pandas as pd# Creating a DataFramedata = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}df = pd.DataFrame(data, index=['a', 'b', 'c'])# Display DataFrameprint("DataFrame:\n", df)# Label-based indexingprint("\nAccessing row with label 'b': ", df.loc['b'])# Location-based indexingprint("\nAccessing the second row (index 1): ", df.iloc[1])
Code explanation:
Line 11: We print the message "Accessing row with label 'b': "
followed by df.loc['b']
. This uses label-based indexing (loc
) to access and display the row labeled 'b'
in the DataFrame df
.
Line 14: Prints the message "Accessing the second row (index 1): "
followed by df.iloc[1]
. This uses location-based indexing (iloc
) to access and display the second row (index 1) in the DataFrame df
.
Tip: You can get hands-on with NumPy and pandas in the course below.
Python Data Analysis and Visualization
With over 400 billion gigabytes of data out there and more every day, companies are paying top dollar to those who can leverage it. Strong data skills are becoming increasingly valuable - even if you choose not to become a professional data scientist. This path will help you master the skills to extract insights from data using a powerful (and easy to use) assortment of popular Python libraries.
NumPy and pandas each provide a rich set of functionalities that cater to different needs in data science and analysis. As such, your specific needs will influence your choice of library.
We’ll dive into specific capabilities of each library, focusing on :
Mathematical operations
Loading data from file/dataset
Data manipulation
Mathematical operations are fundamental in data analysis and scientific computing, enabling tasks like statistical calculations and modeling.
NumPy excels in numerical computations, providing a wide array of mathematical functions that are optimized for performance. These functions make it a powerful tool for tasks involving linear algebra, random sampling, and Fourier transforms.
NumPy offers comprehensive support for linear algebra operations, including:
Matrix multiplication
Decomposition
Inversion
Eigenvalue calculations
These functionalities are essential for solving systems of linear equations and performing various mathematical transformations.
import numpy as np# Define two matricesmatrix_a = np.array([[1, 2],[3, 4]])matrix_b = np.array([[5, 6],[7, 8]])# Print the original matricesprint("\nMatrix A:")print(matrix_a)print("\nMatrix B:")print(matrix_b)# Perform matrix operationsprint("\nMatrix operations:")print("Transpose of Matrix A:\n", np.transpose(matrix_a))print("Determinant Matrix A:", np.linalg.det(matrix_a))print("Inverse Matrix A:\n", np.linalg.inv(matrix_a))print("Trace Matrix A:", np.trace(matrix_a))# Perform matrix multiplicationprint("\nMatrix Multiplication:")result_mult = np.dot(matrix_a, matrix_b)print(result_mult)# Perform QR decomposition (alternative to LU decomposition)print("\nQR Decomposition of Matrix A:")q, r = np.linalg.qr(matrix_a)print("Q Matrix:")print(q)print("R Matrix:")print(r)# Perform matrix inversionprint("\nInverse of Matrix A:")result_inv = np.linalg.inv(matrix_a)print(result_inv)# Perform eigenvalue and eigenvector calculationprint("\nEigenvalues and Eigenvectors of Matrix A:")eigenvalues, eigenvectors = np.linalg.eig(matrix_a)print("Eigenvalues:")print(eigenvalues)print("Eigenvectors:")print(eigenvectors)
Code explanation:
Lines 18–22: Perform various matrix operations on matrix_a
:
Transpose: np.transpose(matrix_a)
calculates and prints the transpose of matrix_a
.
Determinant: np.linalg.det(matrix_a)
computes and prints the determinant of matrix_a
.
Inverse: np.linalg.inv(matrix_a)
computes and prints the inverse of matrix_a
.
Trace: np.trace(matrix_a)
computes and prints the trace (sum of diagonal elements) of matrix_a
.
Lines 25–27: We perform matrix multiplication using np.dot
(matrix_a, matrix_b)
. Store the result in result_mult
and print it.
Lines 30–35: We perform QR decomposition of matrix_a
using np.linalg.qr(matrix_a)
. We store the matrices q
(orthogonal/unitary matrix) and r
(upper triangular matrix) and print them.
Lines 38–40: We compute the inverse of matrix_a
using np.linalg.inv(matrix_a)
. We store the result in result_inv
and print it.
Lines 43–48: We compute the eigenvalues and eigenvectors of matrix_a
using np.linalg.eig(matrix_a)
. We store the eigenvalues in eigenvalues
and eigenvectors in eigenvectors
, and print them. This computes and prints the eigenvalues and corresponding eigenvectors of matrix_a
.
NumPy’s random module allows for generating random numbers, creating random samples, and performing random sampling from different distributions.
import numpy as np# Generating random numbers from a normal distributionrandom_numbers = np.random.normal(loc=0, scale=1, size=5)print("Random numbers from a normal distribution:\n", random_numbers)
Code explanation:
Line 4: We use the np.random.normal()
function to generate an array random_numbers
of
loc=0
: Mean of the distribution (centered at 0)
scale=1
: Standard deviation of the distribution
size=5
: Number of random numbers to generate
NumPy provides functions to compute the discrete Fourier transform, which is useful in signal processing.
import numpy as np# Creating a sample signalsignal = np.array([1, 2, 1, 0, 1, 2, 1, 0])print("Signal: ", signal)# Computing the Fourier transformfourier_transform = np.fft.fft(signal)print("Fourier transform of the signal:\n", fourier_transform)
Code explanation:
Line 8: We compute the Fourier transform of signal
using np.fft.fft(signal)
. The result is stored in fourier_transform
.
Unlike NumPy, pandas is not designed for advanced mathematical computations. Instead, it offers powerful tools for data aggregation, merging, reshaping, and handling missing data, which are essential for data analysis.
pandas provide functions for summarizing data, such as groupby
, sum
, mean
, and count
.
import pandas as pd# Creating a DataFramedata = {'Name': ['Alice', 'Bob', 'Alice', 'Bob'], 'Score': [85, 90, 78, 88]}df = pd.DataFrame(data)# Display DataFrameprint("DataFrame:\n", df)# Aggregating data by 'Name'grouped = df.groupby('Name').mean()print("\nAggregated data:\n", grouped)
Code explanation:
Line 11: We use the groupby()
method on df
to group data by the 'Name'
column, and then calculate the mean using the mean()
method. The result is stored in grouped
.
pandas allows for merging and joining DataFrames using various methods like merge
, join
, and concat
.
import pandas as pd# Creating two DataFramesdata1 = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}data2 = {'Name': ['Alice', 'Bob'], 'Score': [85, 90]}df1 = pd.DataFrame(data1)df2 = pd.DataFrame(data2)# Display DataFramesprint("DataFrame1:\n", df1)print("\nDataFrame2:\n", df2)# Merging the DataFrames on 'Name'merged_df = pd.merge(df1, df2, on='Name')print("\nMerged DataFrame:\n", merged_df)
Code explanation:
Line 12: We use the pd.merge()
function to merge df1
and df2
based on the 'Name'
column. The result is stored in merged_df
.
pandas offers functions like pivot
, melt
, and stack
for reshaping DataFrames.
import pandas as pd# Creating a DataFramedata = {'Name': ['Alice', 'Bob'], 'Math': [85, 90], 'Science': [88, 92]}df = pd.DataFrame(data)# Display DataFrameprint("Original DataFrame:\n", df)# Melting the DataFrame to unpivot subjects into rowsmelted = pd.melt(df, id_vars=['Name'], value_vars=['Math', 'Science'], var_name='Subject', value_name='Score')print("\nMelted DataFrame (Subjects as rows):\n", melted)# Using stack to pivot the DataFramestacked = df.set_index('Name').stack().reset_index(name='Score').rename(columns={'level_1': 'Subject'})print("\nStacked DataFrame (Subjects as rows):\n", stacked)# Pivoting the melted DataFrame back to original form using pivotunmelted = melted.pivot(index='Name', columns='Subject', values='Score').reset_index()print("\nPivoted DataFrame (Original form):\n", unmelted)
Code explanation:
Line 11: We use the pd.melt()
function to melt (unpivot) the DataFrame df
:
id_vars=['Name']
: Specifies the 'Name'
column as the identifier variable (unchanged).
value_vars=['Math', 'Science']
: Specifies the 'Math'
and 'Science'
columns to melt.
var_name='Subject'
: Renames the variable column to 'Subject'
.
value_name='Score'
: Renames the value column to 'Score'
.
The result is stored in melted
.
Line 15: We use the stack()
method to pivot the DataFrame df
by stacking columns into rows:
set_index('Name')
: Sets the 'Name'
column as the index.
stack()
: Pivots all remaining columns into rows.
reset_index(name='Score')
: Resets the index and renames the resulting stacked column to 'Score'
.
rename(columns={'level_1': 'Subject'})
: Renames the column previously holding column names to 'Subject'
.
The result is stored in stacked
.
Line 19: We use the pivot()
method on the melted
DataFrame to pivot it back to the original form:
index='Name'
: Sets the 'Name'
column as the index.
columns='Subject'
: Specifies the 'Subject'
column values to pivot.
values='Score'
: Specifies the 'Score'
column values to populate the pivoted DataFrame.
reset_index()
: Resets the index to convert 'Name'
from the index back to a regular column.
The result is stored in unmelted
.
pandas provides functions to detect, remove, or fill missing data in DataFrames.
import pandas as pd# Creating a DataFrame with missing valuesdata = {'Name': ['Alice', 'Bob', 'Charlie'], 'Score': [85, None, 78]}df = pd.DataFrame(data)# Display DataFrameprint("Original DataFrame:\n", df)# Filling missing values with the meandf['Score'].fillna(df['Score'].mean(), inplace=True)print("\nDataFrame after filling missing values:\n", df)
Code explanation:
Line 11: We use the fillna()
method on the 'Score'
column of df
to fill missing values (None
) with the mean of existing values in the column:
df['Score'].mean()
: Computes the mean of non-missing values in 'Score'
.
inplace=True
: Modifies df
in place rather than returning a new DataFrame.
The filled DataFrame is stored back into df['Score']
.
Loading data from external files or datasets is a fundamental operation in data analysis and scientific computing. Both NumPy and pandas provide capabilities to read data from various file formats, each tailored to different use cases.
NumPy primarily deals with numerical data in the form of arrays. It provides basic functionalities to load data from text files, such as CSV files, but it stores the data in its own ndarray
format, which is homogeneous and optimized for numerical computations.
The following is an example of loading data with NumPy:
import numpy as np# Load data from a CSV file into a NumPy ndarraydata_np = np.loadtxt('data.csv', delimiter=',')print("NumPy Array:\n", data_np)
Code explanation:
Line 4: We use the np.loadtxt()
function to load data from a CSV file 'data.csv'
into a NumPy ndarray
data_np
:
'data.csv'
: Specifies the path to the CSV file to be loaded.
delimiter=','
: Specifies that the data in the CSV file is separated by commas.
pandas excels in handling structured data, including loading data from various file formats such as CSV, Excel, SQL databases, and more. It stores the data in DataFrame objects, which are flexible and capable of handling heterogeneous data types.
The following is an example of loading data with pandas:
import pandas as pd# Load data from a CSV file into a pandas DataFramedf = pd.read_csv('data.csv')print("Pandas DataFrame:\n", df)
Code explanation:
Line 4: We use the pd.read_csv()
function to load data from a CSV file 'data.csv'
into a pandas DataFrame df
:
'data.csv'
: Specifies the path to the CSV file to be loaded.
Effective data manipulation is crucial in preparing data for analysis and ensuring it meets the requirements of various computational tasks.
NumPy offers a range of functionalities for basic data manipulation, including slicing, reshaping, and broadcasting.
Slicing in NumPy allows you to extract parts of an array.
import numpy as np# Creating a 1D arrayarray = np.array([10, 20, 30, 40, 50])# Display 1D arrayprint("1D array: ", array)# Slicing the arraysliced_array = array[1:4]print("Sliced array: ", sliced_array)
Code explanation:
Line 10: We use the slicing to create a new array sliced_array
from array
:
array[1:4]
: Retrieves elements starting from index 1
(inclusive) to index 4
(exclusive) from array
.
The sliced elements [20, 30, 40]
are assigned to sliced_array
.
NumPy allows you to change the shape of an array without changing its data.
import numpy as np# Creating a 1D arrayarray = np.array([1, 2, 3, 4, 5, 6])# Display 1D arrayprint("Originsl array: ", array)# Reshaping to a 2x3 arrayreshaped_array = array.reshape(2, 3)print("Reshaped array:\n", reshaped_array)
Code explanation:
Line 10: We use the reshape()
method to reshape array
into a 2x3 NumPy array reshaped_array
:
reshape(2, 3)
: Reshapes array
into a 2 rows by 3 columns array.
The reshaped array reshaped_array
will have a shape of (2, 3)
.
Broadcasting allows NumPy to perform arithmetic operations on arrays of different shapes.
Let’s explore how broadcasting works with NumPy arrays in various scenarios:
import numpy as np# When one operand is N*N and other is 1*1print("Case 1:")Z1 = np.arange(9).reshape(3,3)print("Z1:")print(Z1)Z2 = 1print("Z2:")print(Z2)print("Z1 + Z2:")print(Z1 + Z2)# When one operand is N*N and other is N*1print("\nCase 2:")Z1 = np.arange(9).reshape(3,3)print("Z1:")print(Z1)Z2 = np.arange(3)[::-1].reshape(3,1)print("Z2:")print(Z2)print("Z1 + Z2:")print(Z1 + Z2)# When one operand is N*N and other is 1*Nprint("\nCase 3:")Z1 = np.arange(9).reshape(3,3)print("Z1:")print(Z1)Z2 = np.arange(3)[::-1]print("Z2:")print(Z2)print("Z1 + Z2:")print(Z1 + Z2)
Code explanation:
The code demonstrates different scenarios of addition between NumPy arrays (Z1
) and other operands (Z2
) of different shapes:
Lines 4–14 (case 1): Addition of a 3x3 array (Z1
) and a scalar (Z2
= 1).
Lines 17–27 (case 2): Addition of a 3x3 array (Z1
) and a 3x1 array (Z2
).
Lines 30–40 (case 3): Addition of a 3x3 array (Z1
) and a 1D array (Z2
).
pandas offers advanced tools for data manipulation, including data cleaning, merging, grouping, and time series manipulation.
pandas provides functions to clean and preprocess data, such as dropna
and fillna
.
import pandas as pd# Creating a DataFrame with the missing valuesdata = {'Name': ['Alice', 'Bob', 'Charlie'], 'Score': [85, None, 78]}df = pd.DataFrame(data)# Display DataFrameprint("Original DataFrame:\n", df)# Dropping rows with the missing valuescleaned_df = df.dropna()print("\nDataFrame after dropping missing values:\n", cleaned_df)# Filling the missing values with a specific value (e.g., 0)filled_df = df.fillna(0)print("\nDataFrame after filling missing values with 0:\n", filled_df)
Code explanation:
Line 11: We use the dropna()
method to create a new DataFrame cleaned_df
by dropping rows from df
that contain the missing values.
Line 15: We use the fillna(0)
method to create a new DataFrame filled_df
by filling the missing values in df
with the value 0
.
pandas allow for complex data merging operations.
import pandas as pd# Creating two DataFramesdata1 = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}data2 = {'Name': ['Alice', 'Bob'], 'Score': [85, 90]}df1 = pd.DataFrame(data1)df2 = pd.DataFrame(data2)# Display DataFramesprint("DataFrame1:\n", df1)print("\nDataFrame2:\n", df2)# Merging the DataFrames on 'Name'merged_df = pd.merge(df1, df2, on='Name')print("\nMerged DataFrame:\n", merged_df)
Code explanation:
Line 14: We use the pd.merge(df1, df2, on='Name')
to merge df1
and df2
on the column 'Name'
, resulting in a new DataFrame merged_df
containing all columns from both DataFrames where 'Name'
matches.
pandas’ groupby
function enables the grouping of data for aggregation.
import pandas as pd# Creating a DataFramedata = {'Name': ['Alice', 'Bob', 'Alice', 'Bob'], 'Score': [85, 90, 78, 88]}df = pd.DataFrame(data)# Display DataFrameprint("Original DataFrame:\n", df)# Grouping by 'Name' and calculating mean scoregrouped = df.groupby('Name').mean()print("\nGrouped DataFrame:\n", grouped)
Code explanation:
Line 11: We group the DataFrame df
by the 'Name'
column and calculate the mean of the 'Score'
for each group, resulting in a new DataFrame grouped
.
pandas excels in handling time series data, providing functions for resampling, shifting, and rolling window operations.
import pandas as pd# Creating a time seriesdates = pd.date_range('20230101', periods=6)data = {'Sales': [100, 150, 200, 250, 300, 350]}df = pd.DataFrame(data, index=dates)# Display DataFrameprint("DataFrame:\n", df)# Resampling the time series data to monthly frequencymonthly_sales = df.resample('M').sum()print("\nMonthly sales data:\n", monthly_sales)
Code explanation:
Line 12: We use df.resample('M').sum()
to resample the DataFrame df
to a monthly frequency and calculate the sum of sales for each month, resulting in a new DataFrame monthly_sales
.
Both NumPy and pandas are integral parts of the Python data science ecosystem. They are designed to seamlessly integrate with other libraries, enhancing their capabilities and providing a comprehensive toolkit for data analysis and scientific computing.
Effective interoperability ensures that NumPy and pandas can collaborate seamlessly with other libraries, enhancing their utility in diverse analytical and scientific applications.
NumPy is designed to work well with other scientific libraries in Python. Its interoperability allows it to serve as the foundation for a wide range of scientific and analytical tools.
SciPy builds on NumPy to provide additional functionality for scientific and technical computing, including modules for optimization, integration, interpolation, eigenvalue problems, and other advanced mathematical tasks.
import numpy as npfrom scipy import optimize# Define a quadratic functiondef f(x):return x**2 + 4*x + 4# Find the minimum of the function using SciPyresult = optimize.minimize(f, x0=0)print("Optimization result:\n", result)
Code explanation:
Lines 5–6: We define a quadratic function f(x)
that takes an input x
and returns the value of the quadratic expression x**2 + 4*x + 4
.
Line 9: We use optimize.minimize(f, x0=0)
to find the minimum of the function f(x)
, starting from the initial guess x0=0
, and stores the result in the variable result
.
Matplotlib is a plotting library that works closely with NumPy arrays to produce a variety of static, animated, and interactive visualizations.
import numpy as npimport matplotlib.pyplot as plt# Create a range of valuesx = np.linspace(0, 2 * np.pi, 100)y = np.sin(x)# Plot the sine waveplt.plot(x, y)plt.title('Sine Wave')plt.xlabel('x')plt.ylabel('sin(x)')plt.show()
Code explanation:
Line 5: We use np.linspace(0, 2 * np.pi, 100)
to create an array x
of 100 evenly spaced values ranging from 0
to 2 * np.pi
.
Line 6: We use np.sin(x)
to compute the sine of each value in the array x
, resulting in an array y
.
Line 9: We use plt.plot(x, y)
to create a plot of y
vs. x
.
Line 10: We use plt.title('Sine Wave')
to set the title of the plot to 'Sine Wave'
.
Line 11: We use plt.xlabel('x')
to label the x-axis as 'x'
.
Line 12: We use plt.ylabel('sin(x)')
to label the y-axis as 'sin(x)'
.
Line 13: We use plt.show
()
to display the plot.
pandas is also highly interoperable with a variety of other data tools and libraries, making it a versatile choice for data manipulation and analysis.
pandas can read from and write to SQL databases, allowing for efficient data retrieval and storage. The read_sql
and to_sql
functions facilitate this integration.
import pandas as pdimport sqlite3# Create an in-memory SQLite database and connect to itconn = sqlite3.connect(':memory:')# Create a sample DataFramedata = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}df = pd.DataFrame(data)# Write the DataFrame to a SQL tabledf.to_sql('people', conn, index=False)# Read the data back from the SQL tabledf_from_sql = pd.read_sql('SELECT * FROM people', conn)print("DataFrame read from SQL:\n", df_from_sql)
Code explanation:
Line 5: We use sqlite3.connect(':memory:')
to create an in-memory SQLite database and establishes a connection to it, assigned to conn
.
Line 8: We create a dictionary data
with sample data.
Line 9: We use pd.DataFrame(data)
to create a DataFrame df
from the dictionary data
.
Line 12: We use df.to_sql('people', conn, index=False)
to write the DataFrame df
to a SQL table named 'people'
in the SQLite database connected to by conn
.
Line 15: We use pd.read_sql('SELECT * FROM people', conn)
to read the data back from the SQL table 'people'
into a new DataFrame df_from_sql
.
pandas integrates smoothly with Matplotlib, making it easy to generate plots directly from DataFrames.
import pandas as pdimport matplotlib.pyplot as plt# Create a sample DataFramedata = {'Month': ['January', 'February', 'March'], 'Sales': [200, 250, 300]}df = pd.DataFrame(data)# Plot the datadf.plot(x='Month', y='Sales', kind='bar')plt.title('Monthly Sales')plt.xlabel('Month')plt.ylabel('Sales')plt.show()
Code explanation:
Line 5: We create a dictionary data
with sample data.
Line 6: We use pd.DataFrame(data)
to create a DataFrame df
from the dictionary data
.
Line 9: We use df.plot(x='Month', y='Sales', kind='bar')
to create a bar plot with 'Month'
on the x-axis and 'Sales'
on the y-axis.
Line 10: We use plt.title('Monthly Sales')
to set the title of the plot to 'Monthly Sales'
.
Line 11: We use plt.xlabel('Month')
to label the x-axis as 'Month'
.
Line 12: We use plt.ylabel('Sales')
to label the y-axis as 'Sales'
.
Line 13: We use plt.show
()
to display the plot.
Seaborn is a statistical data visualization library built on top of Matplotlib that works well with pandas DataFrames. It provides high-level interfaces for drawing attractive and informative statistical graphics.
import seaborn as snsimport pandas as pdimport matplotlib.pyplot as plt# Create a sample DataFramedata = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85, 90, 78, 88]}df = pd.DataFrame(data)# Set up the plotting context and stylesns.set_context("talk")sns.set_style("whitegrid")# Create a bar plot using Seabornax = sns.barplot(x='Name', y='Score', data=df)# Access the axis object to set labels and title using matplotlibax.set_xlabel('Name', fontsize=14)ax.set_ylabel('Score', fontsize=14)ax.set_title('Student Scores', fontsize=16)# Show the plotsns.despine()plt.show()
Using the above code, we will get the following output:
Code explanation:
Line 6: We create a dictionary data
with sample data.
Line 7: We use pd.DataFrame(data)
to create a DataFrame df
from the dictionary data
.
Line 10: We use sns.set_context("talk")
to set the plotting context to "talk"
, which adjusts the size of the plot elements.
Line 11: We use sns.set_style("whitegrid")
to set the plot style to "whitegrid"
, which adds grid lines on a white background for aesthetics.
Line 14: We use sns.barplot(x='Name', y='Score', data=df)
to create a bar plot (barplot
) with 'Name'
on the x-axis and 'Score'
on the y-axis, using data from the DataFrame df
. The resulting plot axis object is stored in ax
.
Line 17: We use ax.set_xlabel('Name', fontsize=14)
to set the x-axis label to 'Name'
with a font size of 14.
Line 18: We use ax.set_ylabel('Score', fontsize=14)
to set the y-axis label to 'Score'
with a font size of 14.
Line 19: We use ax.set_title('Student Scores', fontsize=16)
to set the plot title to 'Student Scores'
with a font size of 16.
Line 22: We use sns.despine()
to remove the top and right spines from the plot for better aesthetics.
Line 23: We use plt.show
()
to display the plot using Matplotlib’s show function.
Understanding the specific use cases for NumPy and pandas helps in selecting the right tool for your data processing tasks. Here, we’ll outline the primary use cases for each library, providing a clear comparison of their strengths and applications.
Library | Use Cases | Description |
NumPy | Scientific computing | NumPy is the preferred library for performing scientific calculations that require high precision and performance. |
Machine learning | It provides the foundational data structures and mathematical operations essential for machine learning algorithms. | |
Numerical simulations | NumPy is used for creating simulations that require handling large amounts of numerical data efficiently. | |
pandas | Data analysis | pandas is particularly effective in handling and analyzing structured data, making it perfect for tasks like exploring data and creating reports. |
Data preprocessing for machine learning | It provides tools for cleaning and preparing data, including handling missing values and transforming data formats. | |
Financial modeling | pandas’ robust data manipulation capabilities are perfect for building and analyzing financial models. |
When choosing between NumPy and pandas, it’s essential to understand their strengths and limitations. Here, we’ll outline the pros and cons of each library, providing a clear comparison to help you make an informed decision.
Library | Pros | Cons |
NumPy |
|
|
|
| |
|
| |
pandas |
|
|
|
| |
|
|
The table below presents a comparison between NumPy and pandas:
Feature | NumPy | pandas |
Data structures | Homogeneous arrays (single data type) | Heterogeneous DataFrames (mixed data types) |
Performance (Numerical) | Generally faster | Slower for raw calculations, but convenient functions |
Memory usage | Memory efficient | Potentially higher memory usage |
Strengths | Efficient numerical computations, vectorized operations | Data cleaning, manipulation, analysis, time series |
Common use cases | Scientific computing, machine learning (numerical data), image processing | Data loading, cleaning, EDA, feature engineering, time series analysis |
Indexing | Basic (integer-based, slices, and boolean indexing) | Advanced indexing (label-based, location-based) |
Missing value handling | Limited (manual replacement) | Flexible ( |
Data types | Supports various numerical data types (integer, float, complex) and boolean | Supports various numerical data types, strings, categorical data, and custom data types |
Math functions | Rich collection of element-wise mathematical functions (arithmetic, trigonometric, linear algebra) | Offers functions for common data analysis tasks (e.g., mean, standard deviation, correlation) |
Time series functionality | Limited | Specialized functionalities (date/time objects, resampling) |
Multidimensional data | Efficient handling of n-dimensional arrays | Less efficient for high-dimensional data |
Learning curve | Easier to learn due to simpler data structures | Steeper learning curve due to richer features and functionalities |
Interoperability | Integrates seamlessly with other scientific Python libraries (SciPy, Matplotlib) | Integrates well with NumPy and other data science libraries (Matplotlib, scikit-learn, and Seaborn) |
Now that you know about both of Python data manipulation tools, we hope you feel ready to make a choice about which one to pick.
NumPy shines in numerical computations and high-performance scientific computing, making it the preferred choice for tasks involving large-scale numerical data and complex mathematical operations.
pandas, on the other hand, is particularly effective in data manipulation and analysis, providing intuitive tools for handling and transforming structured data, which is invaluable for data cleaning, exploration, and preprocessing in machine learning.
Whether you choose to work with one tool, or have decided to learn both, you can get hands-on with NumPy and pandas in our comprehensive Skill Path:
Python Data Analysis and Visualization
With over 400 billion gigabytes of data out there and more every day, companies are paying top dollar to those who can leverage it. Strong data skills are becoming increasingly valuable - even if you choose not to become a professional data scientist. This path will help you master the skills to extract insights from data using a powerful (and easy to use) assortment of popular Python libraries.
You can keep building your data science skills with our Data Science resources. Check it out and consider exploring advanced tools like SciPy for scientific computing, Matplotlib and Seaborn for visualization, and scikit-learn for machine learning. Diving into databases such as SQL or NoSQL can also broaden your ability to manage diverse datasets effectively.
Free Resources