What is Pandas library in Python?

Pandas is widely used for data manipulation and analysis in Python. It is built on top of the Matplotlib and NumPy. Thus, it offers a variety of functions for both handling data and visualizing them.

Structure of Pandas

Pandas stores data as series and dataframes.

  • A series is a single column in Pandas. It has a 1-dimensional structure.
  • A dataframe is a series collection (multiple columns) and thus has a 2-dimensional structure.

Both series and dataframe have indices.

  • Indices are used to identify individual records (rows) in Pandas.

The illustration below shows a dataframe, series, and indices:

Series and Dataframe

Reading files

Pandas can be used to read a variety of file formats. Each file is converted to a dataframe once it is read.

Some widely used file formats are listed below:

  • .csv
  • .xlsx
  • .json
  • .xml
  • .html
  • .SQL

Data manipulation

Pandas can be used to perform functions on individual series and entire dataframes. This includes finding descriptive statistics (mean, median, and mode), grouping data based on specific conditions, filtering out rows and columns, merging data, and dealing with missing values.

Data visualization

Pandas is built upon Matplotlib, which offers extensive support for visualizations. We can draw a variety of plots which include:

  • Histograms
  • Bar plots
  • Pie charts
  • Box plots
  • Line plots
  • Scatter plots
  • Rug plots
  • Mosaic plots
  • Area plots
  • Lag plots

The illustration below shows some of the plots in Pandas:

Plots in Pandas (image from Python Awesome)
Plots in Pandas (image from Python Awesome)

Data science in Pandas

Pandas is widely used to perform the entire process of data science. This includes reading vast amounts of data from different formats, cleaning the data, performing exploratory data analysis (EDA), plotting visualizations, conducting statistical learning, and machine learning.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved