Speed up File Loading
In this lesson, we show how to speed up file loading.
We'll cover the following
You may notice one thing, if you load a large CSV file into the DataFrame, the object may be very slow. It’s a time-consuming operation. If your file is a static file, it won’t change frequency. If loading this file frequently and doing data analysis is part of your job, then reducing file load time would be a very useful operation.
Export DataFrame object to hdf
file format
There is a method to export your static
file to a binary format, such as the hdf
format. Hierarchical Data Format (HDF
) is a set of file formats (HDF4, HDF5) designed to store and organize large amounts of data. By using this format, it can effectively reduce file load time.
Let’s see an example. Because of the limitation of this site, a file more than 2MB is not allowed. So, the code here is not executable.
import pandas as pd
import numpy as np
import timeit
# Let's create a matrix, which is 200000 * 20, and create a DataFrame object from it.
d = np.random.randint(-10, 10, size=(200000, 20))
df = pd.DataFrame(d)
# Export the data to two files, one is CSV format, another one is HDF format.
df.to_csv("output/data.csv")
df.to_hdf("output/data.hdf", key="df")
# We use timeit to record the running time between the start and stop.
# In this section, we read the file from CSV file, and print the running time.
start = timeit.default_timer()
df1 = pd.read_csv("output/data.csv")
stop = timeit.default_timer()
print('Loading data.csv file time: {}'.format(stop - start))
# In this section, we read the file from HDF file, and print the running time.
start = timeit.default_timer()
df2 = pd.read_hdf("output/data.hdf")
stop = timeit.default_timer()
print('Loading data.hdf file time: {}'.format(stop - start))
Notice: File read performance depends on the environment, and the following data is the result of running it on my own PC.
Loading data.csv file time: 0.3129s Loading data.hdf file time: 0.0652s
As you can see, the loading time for an
HDF
format is one-fifth that of a CSV.
In addition to the HDF
format, there are other formats to choose from, such as pickle
and gbq
.
Get hands-on with 1200+ tech skills courses.