Location Resolution: Geocoding

Use geocoding services to preprocess location attributes in many entity resolution tasks.

OpenStreetMap (OSM)It is a free, open geographic database that is updated and maintained by an open community of volunteers. converts our free-text search to a point on a geographic map. It matches the input with the most similar record from its validated database and responds with geocoordinates, a structured address, and more helpful information, and that’s location resolution, better known as geocoding.

Addresses are everywhere: home, work, bill-to, and ship-to addresses as part of people, companies, factories, distribution hubs, orders, invoices, and more. Therefore, matching addresses is part of many entity resolution tasks. Consider geocoding as a preprocessing step or partial resolver of entities with location attributes.

We have geocoded the given four search strings with GeoapifyIt is a commercial API using the OSM ecosystem as its backend.’s batch endpointIt is a particularly cheap API endpoint for processing large numbers of records.. The first three are location names with street addresses. We added a fourth consisting of just an address part. We read the response stored as plain text in the GeoJSONIt is a subclass of the JSON specification used for geographic shapes and metadata. format into a Python dictionary, which can be comfortably parsed with geopandasIt is an open-source package extending pandas to work with geographic shapes., as follows:

Press + to interact
import json
import geopandas as gpd
with open('geocode-results.json', 'r') as file:
response_data = json.load(file)
df_geo = gpd.GeoDataFrame.from_features(response_data, crs='EPSG:4326')
show_columns = ['geometry', 'formatted']
print(df_geo[show_columns])

Only the second line does not look good. The quality of the input is not the best, after all. This is also indicated by the df_geo['rank'] column consisting of dictionaries, which we can expand with a pandas function below:

Press + to interact
import pandas as pd
show_columns = ['confidence', 'match_type']
print(pd.json_normalize(df_geo['rank'])[show_columns])

We drop the second record for the rest of this lesson. The other three have the same street address and yet the last one has slightly different geocoordinates:

Press + to interact
# Drop the false line:
df_geo = df_geo.loc[df_geo.index != 1]
# Count unique coordinates:
print(df_geo.geometry.value_counts())

It’s unfair to declare one right and the other wrong. Geocodes are point coordinates, whereas locations are areas—in other words, an infinite set of points. So, we should be cautious with exact matching on geocoordinates to resolve location records.

We can address this problem in many ways. First, we can use areas instead of points to represent locations. Plus Codes,It is an open-source encoding of longitude-latitude pairs developed by Google. developed by Google, are rectangular and H3 indexes,It is another open-source encoding of longitude-latitude pairs developed by Uber. developed by Uber, are hexagons. Both are open-source and deterministic functions of geocoordinates.

Press + to interact
from openlocationcode import openlocationcode
import h3
df_geo['plus_code'] = df_geo.apply(lambda row: openlocationcode.encode(longitude=row['lon'], latitude=row['lat']), axis=1)
df_geo['h3_index_resolution_8'] = df_geo.apply(lambda row: h3.geo_to_h3(lng=row['lon'], lat=row['lat'], resolution=8), axis=1)
df_geo['h3_index_resolution_9'] = df_geo.apply(lambda row: h3.geo_to_h3(lng=row['lon'], lat=row['lat'], resolution=9), axis=1)
show_columns = ['plus_code', 'h3_index_resolution_8', 'h3_index_resolution_9']
print(df_geo[show_columns])

The address of a large office building is “Kruppstr. 4” so we still end up with different Plus Codes and H3 indexes on resolution level 8 for this same location. We can change the resolution of each by dropping some of the last characters of Plus Codes or by explicitly setting the level in H3. This can quickly result in an area covering more than we want.

There is a more elegant solution to our problem. We don’t persist on the exact matching of names and other strings in other resolution tasks. So, why should we make this mistake for locations?

Press + to interact
# EPSG:3035 is a metric coordinate system for European locations
first_location = df_geo.geometry.to_crs('EPSG:3035')[0]
other_locations = df_geo.geometry.to_crs('EPSG:3035').iloc[1:]
# Computes distance between the first and the other two locations in meters:
print(other_locations.distance(first_location))

The output of the last cell tells us that the two distinct geocodes are “27.84” meters apart. Deciding if this is close enough to claim a match is up to you, task by task.

Does a geocoding service resolve all our location records?

Geocoding services work well on street addresses. No service covers all addresses, and not every location is a street address—for example, some services might also work for PO boxes but many more exotic examples, like “reactor 2 of power plant XYZ,” exist. There is no chance that one service will rule them all.

Having a second similarity function for locations is good if the geocoding service responds with low confidence. We can compare the strings that make up a location description with edit distances, just like we are used to doing for names. We don’t need to decide on one approach against the other. We can combine evidence from both to make a final match or no-match conclusion.

Key takeaway

We can use geocoding services to resolve location records. Others spent decades refining their services, building massive address databases, and more. It is not just about the quality of results and costs. Check out one of the many providers using the OSM ecosystem under the hood. Their license allows us to store and distribute results—something we want to do as a preprocessing step in our entity resolution pipeline.