Concatenation
pandas provides some useful utilities to combine data from multiple files. Let's see how to use them.
append()
with same column
In real life, data comes from different sources, combining them together into a single DataFrame is very useful.
Let’s begin with a very simple scenario.
There are two DataFrame structures with the same column names. Your job is to combine these two DataFrames into one. Before trying it in pandas, do you remember how to do it using the native Python list
? We would use append()
to add items to a Python list
. In pandas, the function is also append()
. It concatenates the two DataFrame objects along axis=0
, namely the index or column.
Notice:
- In the example, the function only accepts one DataFrame. However, multiple DataFrames can be passed as a list.
- The index can be duplicated.
- ignore_index=True could be passed if the index of this DataFrame is meaningless.
import pandas as pdd1 = {"a": [1, 2], "b": [2, 4]}df1 = pd.DataFrame(d1)print("The first DataFrame")print(df1)print("------------------------")d2 = {"a": [3, 4], "b": [6, 8]}df2 = pd.DataFrame(d2)print("The second DataFrame")print(df2)df2 = pd.DataFrame(d2)print("------------------------")df1 = df1.append(df2)print("Append the second DataFrame to the first one")print(df1)print("Append the second DataFrame to the first one and set ignore_index=True")df1 = df1.append(df2, ignore_index=True)print(df1)
Two DataFrame objects are created from line 3
to line 12
.
Line 15
shows the basic usage of append
. From the output of line 16
, you may notice that both DataFrame objects keep the original indexes. You can see the duplicated index.
ignore_index=True
is passed to append
on Line 20
, which would remove all indexex of original DataFrame objects.
append()
with different column
Not all DataFrames have the same column names. If two DataFrames have different column names, what would happen if append()
is called? These DataFrames would be joined where the column names are the same and the remaining positions would be filled by NaN
.
import pandas as pdd1 = {"b": [1, 2], "c": [2, 4]}df1 = pd.DataFrame(d1)print("The first DataFrame")print(df1)print("------------------------")d2 = {"c": [3, 4], "d": [6, 8]}df2 = pd.DataFrame(d2)print("The second DataFrame")print(df2)df2 = pd.DataFrame(d2)print("------------------------")df1 = df1.append(df2)print("Append the second DataFrame to the first one")print(df1)
Two DataFrame objects are created from line 3
to line 12
. Then df2
is appended to df1
on line 15
.
From the output from line 17
, you would find that the columns of these two objects are different.
concat
with outer join
concat()
is a much more powerful function to join DataFrames, which can combine DataFrame objects along both rows and columns, unlike append which does it only along the rows.
Its default behavior is to join DataFrames and columns, just like append()
. axis=1
can be passed to join DataFrames along with rows.
join
is a parameter that specifies how to handle indexes on another axis. "outer"
for union and "inner"
for the intersection.
Below is an example of an outer join. The first DataFrame has an index [1, 2], the second DataFrame has an index [2, 3], they have one overlapping index, 2. The rows with index=2 in both DataFrames would join. Meanwhile, other rows would be kept separate. The empty location would be filled by NaN. After concat, the total row number is 3.
import pandas as pdd1 = {"a": [1, 2], "b": [2, 4]}df1 = pd.DataFrame(d1, index=[1, 2])print("The first DataFrame")print(df1)print("------------------------")d2 = {"c": [3, 4], "d": [6, 8]}df2 = pd.DataFrame(d2, index=[2, 3])print("The second DataFrame")print(df2)print("------------------------")print("The outer join DataFrame")df3 = pd.concat([df1, df2], axis=1)print(df3)
Two DataFrame objects are created from line 3
to line 12
. On line 14
, we see how to concat these two objects using outer
join.
concat
with inner join
inner
join handles the index for intersection. In the example below, these two DataFrames share only one common index, 2. So, the final DataFrame has only one row.
import pandas as pdd1 = {"a": [1, 2], "b": [2, 4]}df1 = pd.DataFrame(d1, index=[1, 2])print("The first DataFrame")print(df1)print("------------------------")d2 = {"c": [3, 4], "d": [6, 8]}df2 = pd.DataFrame(d2, index=[2, 3])print("The second DataFrame")print(df2)print("------------------------")print("The inner join DataFrame")df3 = pd.concat([df1, df2], axis=1, join="inner")print(df3)
Two DataFrame objects are created from line 3
to line 12
. Then line 14
shows how to concat these two objects with inner
join.
Comparison with SQL join
If you are familiar with SQL, you can check the comparison table below.
concat join type | SQL |
---|---|
inner | INNER JOIN |
outer | FULL OUTER JOIN |
right | RIGHT OUTER JOIN |
left | LEFT OUTER JOIN |