Python provides a library called pandas that is popular with data scientists and analysts. Pandas enable users to manipulate and analyze data using sophisticated data analysis tools.
Pandas provide two data structures that shape data into a readable form:
Series
DataFrame
A pandas series is a one-dimensional data structure that comprises of key-value pair, where keys/labels are the indices and values are the values stored on that index. It is similar to a python dictionary, except it provides more freedom to manipulate and edit the data.
We use pandas.Series()
to initialize a series object using Pandas.
The syntax to initialize different series objects is shown below:
import pandas##### INTIALIZATION ######STRING SERIESfruits = pandas.Series(["apples", "oranges", "bananas"])print("Fruit series:")print(fruits)#FLOAT SERIEStemperature = pandas.Series([32.6, 34.1, 28.0, 35.9])print("\nTemperature series:")print(temperature)#INTEGER SERIESfactors_of_12 = pandas.Series([1,2,4,6,12])print("\nFactors of 12 series:")print(factors_of_12)print("Type of this data structure is:", type(factors_of_12))
In the code example above, there are three different series initialized by providing a list to the pandas.Series()
method. Every element in the series has a label/index. By default, the indices are similar to an array index e.g., start with
However, we can provide our indices by using the index
parameter of the pandas.Series()
method.
import pandas# Integer indicesfruits = pandas.Series(["apples", "oranges", "bananas"], index=[4, 3, 2])print("Fruit series:")print(fruits)# String indicestemperature = pandas.Series([32.6, 34.1, 28.0, 35.9], index=["one", "two", "three", "four"])print("\nTemperature series:")print(temperature)# Non-unique index valuesfactors_of_12 = pandas.Series([1,2,4,6,12], index=[1, 1, 2, 2, 3])print("\nFactors of 12 series:")print(factors_of_12)print("Type of this data structure is:", type(factors_of_12))
We can have indices with hashable data types e.g., integers and strings. Index values don't have to be unique (shown in the above code example).
Moreover, you can name your series by passing a string to the name
argument in the pandas.Series()
method:
import pandasfruit = pandas.Series(["apples", "oranges", "bananas"], name = "fruit_series")print("Fruit series:")print(fruit)
We can also initialize our series with a python dictionary using the following syntax:
import pandasdata = {'a': 25, 'bb': 30, 'c': 50, 'za': 21, 2: 200}fruit = pandas.Series(data)print("Series:")print(fruit)
series
objectTo query a series using the default/built-in labels, we use .iloc[]
method or the bracket operator []
. To query using the user-defined labels/indices we use .loc[]
method.
import pandasfruits = pandas.Series(["apples", "oranges", "bananas"], index=['a', 'b', 'c'])print("Fruit series:")print(fruits)##### ACCESSING DATA ######Using .ilocprint ("\n2nd fruit using .iloc[]: ", fruits.iloc[1])#Using indexprint ("\n2nd fruit using default/built-in index: ", fruits[1])#Using locprint ("\nFruit at key \"b\" using .loc[]: ", fruits.loc['b'])
Note: Pandas series provides a vast range of functionality. To dig deeper into the different series methods, visit the official [documentation].
A pandas DataFrame is a two-dimensional data structure that can be thought of as a spreadsheet. It can also be thought of as a collection of two or more series with common indices.
To initialize a DataFrame, use pandas.DataFrame()
:
import pandas as pd##### INITIALIZATION #####fruits_jack = ["apples", "oranges", "bananas"]fruits_john = ["guavas", "kiwis", "strawberries"]index = ["a", "b", "c"]all_fruits = {"Jack's": fruits_jack, "John's": fruits_john}fruits_default_index = pd.DataFrame(all_fruits)print("Dataframe with default indices:\n", fruits_default_index, "\n")new_fruits = pd.DataFrame(all_fruits, index = index)print("Dataframe with given indices:\n", new_fruits, "\n")
In the code example above, a DataFrame is initialized using a dictionary with two key-value pairs. Every key in this dictionary represents a column in the resulting DataFrame and the value represents all the elements in this column.
Both of the lists comprising of fruits as values are used to make a Python dictionary which is then passed to the pandas.DataFrame()
method to make a DataFrame.
For the second DataFrame, we passed a list of indexes using the index
argument in the pandas.DataFrame()
method to use our custom indices.
The DataFrame can be queried in multiple ways.
.loc[]
can be used to query the DataFrame using the user-defined indexes.
.iloc[]
can be used to query using the default/built-in indexes.
Bracket operator []
can be used to query specific indices or columns.
We can also use chained queries to query a specific cell in the DataFrame.
These queries return a series or a single object depending on the type of query. Querying a row or a column returns series while querying a cell returns an object.
import pandas as pd##### INITIALIZATION #####fruits_jack = ["apples", "oranges", "bananas"]fruits_john = ["guavas", "kiwis", "strawberries"]index = ["a", "b", "c"]all_fruits = {"Jack's": fruits_jack, "John's": fruits_john}fruits = pd.DataFrame(all_fruits, index = index)print(fruits, "\n")new_fruits = pd.DataFrame(all_fruits)print(new_fruits, "\n")##### QUERY ######USING INDEXprint("1st fruit:")print(fruits.iloc[0], "\n")#USING KEYprint("Fruits at key \"c\":")print(fruits.loc["c"], "\n")#USING COLUMN NAMEprint("Jack's fruits: ")print(fruits["Jack's"], "\n")#CHAINED QUERY, querying a cellprint("Johns third fruit: ")print(fruits["John's"][2], "\n")
Note: The pandas DataFrame equips you with numerous tools to manipulate and analyze large amounts of data. To dig deeper into the different DataFrame methods, visit the official documentation.