Series Introduction
Learn the basics of pandas' Series data structure in this lesson.
We'll cover the following
A Series is used to model one-dimensional data. The Series object also has a few more bits of data, including an index and a name. A common idea in pandas is the notion of an axis. Because a series is one-dimensional, it has a single axis—the index.
Below is a table of counts of songs several artists composed. We’ll use this to explore the series:
Counts of songs artists composed
Artist | Data |
0 | 145 |
1 | 142 |
2 | 38 |
3 | 13 |
Data representation in Python
If we wanted to represent this data in pure Python, we could use the Dictionary data structure. The series
dictionary has a list of
the data points stored under the data
key. In addition to an entry in the dictionary for the actual data, there is an explicit entry for the corresponding index values for the data (in the index
key) as well as an entry for the name of the data (in the name
key):
series = {'index':[0 , 1, 2, 3], 'data':[145 , 142, 38, 13], 'name':'songs'}print(series)
The get
function defined below can pull items out of this data structure
based on the index
:
series = {'index':[0 , 1, 2, 3], 'data':[145 , 142, 38, 13], 'name':'songs'}def get(series , idx ):value_idx = series['index'].index(idx)return series['data'][value_idx]print(get(series , 1))
The index abstraction
This double abstraction of the index seems unnecessary at first glance—a list already has integer indexes. But there is a trick up pandas’ sleeves. By allowing non-integer values, the data structure supports other index types such as strings and dates as well as arbitrarily ordered indices or even duplicate index values.
Below is an example that has string values for the index
:
songs = {'index':['Paul', 'John', 'George', 'Ringo'],'data':[145, 142, 38, 13],'name':'counts'}print(get(songs, 'John'))
The index is a core feature of pandas’ data structures given the library’s past in the analysis of financial data or time-series data. Many of the operations performed on a Series operate directly on the index or by index lookup.
The pandas Series
With that background in mind, let’s look at how to create a Series in pandas. It’s easy to create a Series object from a list:
import pandas as pdsongs2 = pd.Series([145, 142, 38, 13], name='counts')print(songs2)
When the interpreter prints our Series, pandas makes the best effort to format it for the current terminal size. The series is one-dimensional. However, it looks like it’s two-dimensional. The leftmost column is the index. The index is not part of the values. The generic name for an index is an axis, and the values of the index—0, 1, 2, 3—are called axis labels. The data—145, 142, 38, and 13—are also called the values of the series. The two-dimensional structure in pandas—DataFrame—has two axes, one for the rows and another for the columns.
The rightmost column in the output contains the values of the series—145, 142, 38, and 13. In this case, they’re integers (the console representation says dtype: int64
, in which dtype
means data type and int64
means 64-bit integer), but in general, the values of a Series can hold strings
, floats
, booleans
, or arbitrary Python objects.
To get the best speed (and to leverage vectorized operations), the values should be of the same type, though this is not required. It’s easy to inspect the index of a Series (or DataFrame), since it’s an attribute of the object:
x = songs2.indexprint(x)
The default values for an index are monotonically increasing integers. songs2
has an integer-based index.
The index can be string-based as well, in which case pandas indicates that the data type for the index is the object (not string):
songs3 = pd.Series([145, 142, 38, 13],name='counts',index=['Paul', 'John', 'George', 'Ringo'])print(songs3)
Note: The
dtype
that we see when we print a Series is the type of the values, not the index. Even though this looks two-dimensional, remember that the index is not part of the values.
When we inspect the index attribute, we see that the dtype
is an object:
x = songs3.indexprint(x)
The actual data (or values) for a series does not have to be numeric or homogeneous. We can insert Python objects into a series:
class Foo:passringo = pd.Series(['Richard', 'Starkey', 13, Foo()], name='ringo')print(ringo)
In the above case, the dtype
—data type—of the Series is the object
(meaning a Python object). This can both be good or bad.
The object
data type is also used for a series with string values. In addition, it’s also used for values that have heterogeneous or mixed types. If we have only numeric data in a Series, we wouldn’t want it stored as a Python object but rather as an int64
or float64
, which allows us to do vectorized numeric operations.
If we have time data and it says that it has the object
type, we probably have strings for the dates. Using strings instead of date types is bad because we don’t get the date operations that we would get if the type were datetime64[ns]
. A series with string data, on the other hand, has the object
type. Don’t worry; we’ll see how to convert types later in the course.