Importing Geospatial Data
Learn how to import geospatial data into a GeoDataFrame.
Overview
The GeoDataFrame is the data structure provided by GeoPandas to store tabular and geographic data in a unified schema. It consists of a tabular data structure (a pandas DataFrame) with one or more columns of geometry data types, typically stored as a GeoSeries. The geometry column is what distinguishes a GeoDataFrame from a standard pandas DataFrame and enables the geospatial capabilities.
In this lesson, we'll learn how to load some data into a GeoDataFrame and see its internal structure.
The datasets
For this first example, we are going to use one of the datasets (Natural Earth) that comes with the geodatasets
library. The geodatasets
library provides several toy datasets for GIS processing:
Natural Earth lowres: A low-resolution representation of the world's countries (polygons).
Natural Earth cities: A sample with the center points of 243 major cities in the world (points).
New York Borough Boundaries: A high-resolution representation of the 5 boroughs of New York City (Bronx, Manhattan, Queens, Brooklyn, Staten Island).
The full list of datasets and their descriptions can be seen in the output table provided from the following code:
import geodatasetsimport pandas as pddatasets = pd.DataFrame(geodatasets.data.flatten()).Tprint(datasets[['name', 'geometry_type', 'description']].to_html())
To retrieve one dataset and save it locally, we can use the get_path
function. Since the location will vary depending on the installation, environment, etc., the get_path
function returns the full path where the dataset is locally stored. In the following code snippet, we retrieve two sample datasets (naturalearth.land
and New York Borough Boundaries location
) and store them locally:
import geodatasetsnatural_earth = geodatasets.get_path('naturalearth.land')print(f'Natural Earth location: {natural_earth}\n')nybb = geodatasets.get_path('ny.bb')print(f'New York Borough Boundaries location: {nybb}')
As we can see, the datasets can refer to the main file (e.g., .shp)
or be given in compressed (ZIP) format. In this case, GeoPandas will uncompress it and load the dataset automatically. If there are multiple datasets within the .zip
file or multiple folders, we can specify the folder and the filename by appending !folder/filename
to the path, like so:
zipfile = zip:///local_path/zippedfile.zip!folder/filename
The .read_file()
method
In GeoPandas, we primarily use .read_file()
to read geospatial data. This method can read various geospatial data file formats, including shapefiles (.shp
), GeoJSON files (.geojson
), and many others.
The .read_file()
method reads the geospatial data file into a GeoDataFrame object, which is a specialized pandas DataFrame object that can store and manipulate geospatial data. The method automatically detects the file format and reads it accordingly.
For example, to read a shapefile
into a GeoDataFrame
using .read_file()
, we can use the following code:
import geopandas as gpd# read the shapefilegdf = gpd.read_file('path/to/shapefile.shp')
The GeoDataFrame structure
Let's open the Natural Earth dataset and analyze its structure:
import geopandas as gpdimport geodatasets# open the Natural Earth datasetn_earth = geodatasets.get_path('naturalearth.land')gdf = gpd.read_file(n_earth)# apply the function to trim the geometry (for display purpose)gdf['geometry_str'] = gdf.geometry.map(lambda x: str(x)[:50])# preview the dataframeprint(gdf.head().to_html())
Lines 1–2: We import the
geopandas
andgeodatasets
libraries.Line 5: We retrieve the file path for the Natural Earth dataset using the
get_path()
function.Line 6: We read the contents of the file at the specified path using the
read_file()
.Line 9: We add a new column called
geometry_str
, which contains a truncated string representation of thegeometry
column for visualization purposes.Line 12: We preview the dataset as HTML using the
.to_html()
function.
Here, we can observe that each row represents a feature (islands) with its corresponding attributes. The geometry of each record (rows) is stored in a special column called geometry
, which stores Shapely geometries and makes it possible for GeoPandas to render and perform spatial operations on them.
Note: The geometry column could have any arbitrary name. In fact, the GeoDataFrame can have multiple columns with geographic information, but only one can be active at a time. To set a column as the geographic data for the GeoDataFrame we can use the command below.
gdf.set_geometry('column name')
Other data formats (FIONA)
Besides traditional shapefiles, GeoPandas is one of the most used data formats for vectorial geometries and is able to open other types of geographic data. For that, it uses FIONA underneath, which is built on the top of GDAL. The good news is that we don't need to learn GDAL's cumbersome API bindings. FIONA provides an elegant interface for reading and writing vectorial data in standard Python IO style.
Therefore, besides shapefiles, it can read several vector-based data formats without additional configuration. For a full list of FIONA pre-installed drivers, we can check the supported_drivers
dictionary, like so:
import fiona# Get the supported driversprint(fiona.supported_drivers)
The most important file formats are supported by default, such as:
GeoJSON
GeoPackage
ESRI file geodatabase
MapInfo TAB
DXF
All these file types will be treated automatically by the .read_file()
function, as we will see in the following example. Additional formats can be supported, depending on the GDAL/OGR installation.
Reading from HTTP
One great feature of the .read_file()
function, besides the ability to read distinct data formats, is its capacity to load data directly from the internet, through the HTTP
protocol. As we don't have anything downloaded besides the internal datasets (that comes in shapefiles), let's grab something directly from the internet.
It's also possible to download remote assets with wget
, for example, but we'll pass the URL directly to GeoPandas. Let's try opening the a dataset with US states boundaries. In this example the geometries are provided as .geojson
:
import geopandas as gpdgdf = gpd.read_file('https://d2ad6b4ur7yvpq.cloudfront.net/naturalearth-3.3.0/ne_110m_admin_1_states_provinces_shp.geojson')ax = gdf.plot(column='iso_3166_2', figsize=(7, 5))ax.set_ylabel('Latitude (degrees)')ax.set_xlabel('Longitude (degrees)')ax.figure.savefig('output/states.png')
Line 3: We read the .geojson
geometries from the US States.
Line 5: We plot the GeoDataFrame
, specifying the column iso_3116_2
to automatically create a choropleth map with distinct colors by state.
Line 9: We save the figure to the output folder for visualization.