Location Resolution: Geocoding
Use geocoding services to preprocess location attributes in many entity resolution tasks.
We'll cover the following
Addresses are everywhere: home, work, bill-to, and ship-to addresses as part of people, companies, factories, distribution hubs, orders, invoices, and more. Therefore, matching addresses is part of many entity resolution tasks. Consider geocoding as a preprocessing step or partial resolver of entities with location attributes.
We have geocoded the given four search strings with geopandas
import jsonimport geopandas as gpdwith open('geocode-results.json', 'r') as file:response_data = json.load(file)df_geo = gpd.GeoDataFrame.from_features(response_data, crs='EPSG:4326')show_columns = ['geometry', 'formatted']print(df_geo[show_columns])
Only the second line does not look good. The quality of the input is not the best, after all. This is also indicated by the df_geo['rank']
column consisting of dictionaries, which we can expand with a pandas
function below:
import pandas as pdshow_columns = ['confidence', 'match_type']print(pd.json_normalize(df_geo['rank'])[show_columns])
We drop the second record for the rest of this lesson. The other three have the same street address and yet the last one has slightly different geocoordinates:
# Drop the false line:df_geo = df_geo.loc[df_geo.index != 1]# Count unique coordinates:print(df_geo.geometry.value_counts())
It’s unfair to declare one right and the other wrong. Geocodes are point coordinates, whereas locations are areas—in other words, an infinite set of points. So, we should be cautious with exact matching on geocoordinates to resolve location records.
We can address this problem in many ways. First, we can use areas instead of points to represent locations.
from openlocationcode import openlocationcodeimport h3df_geo['plus_code'] = df_geo.apply(lambda row: openlocationcode.encode(longitude=row['lon'], latitude=row['lat']), axis=1)df_geo['h3_index_resolution_8'] = df_geo.apply(lambda row: h3.geo_to_h3(lng=row['lon'], lat=row['lat'], resolution=8), axis=1)df_geo['h3_index_resolution_9'] = df_geo.apply(lambda row: h3.geo_to_h3(lng=row['lon'], lat=row['lat'], resolution=9), axis=1)show_columns = ['plus_code', 'h3_index_resolution_8', 'h3_index_resolution_9']print(df_geo[show_columns])
The address of a large office building is “Kruppstr. 4” so we still end up with different Plus Codes and H3 indexes on resolution level 8 for this same location. We can change the resolution of each by dropping some of the last characters of Plus Codes or by explicitly setting the level in H3. This can quickly result in an area covering more than we want.
There is a more elegant solution to our problem. We don’t persist on the exact matching of names and other strings in other resolution tasks. So, why should we make this mistake for locations?
# EPSG:3035 is a metric coordinate system for European locationsfirst_location = df_geo.geometry.to_crs('EPSG:3035')[0]other_locations = df_geo.geometry.to_crs('EPSG:3035').iloc[1:]# Computes distance between the first and the other two locations in meters:print(other_locations.distance(first_location))
The output of the last cell tells us that the two distinct geocodes are “27.84” meters apart. Deciding if this is close enough to claim a match is up to you, task by task.
Does a geocoding service resolve all our location records?
Geocoding services work well on street addresses. No service covers all addresses, and not every location is a street address—for example, some services might also work for PO boxes but many more exotic examples, like “reactor 2 of power plant XYZ,” exist. There is no chance that one service will rule them all.
Having a second similarity function for locations is good if the geocoding service responds with low confidence. We can compare the strings that make up a location description with edit distances, just like we are used to doing for names. We don’t need to decide on one approach against the other. We can combine evidence from both to make a final match or no-match conclusion.
Key takeaway
We can use geocoding services to resolve location records. Others spent decades refining their services, building massive address databases, and more. It is not just about the quality of results and costs. Check out one of the many providers using the OSM ecosystem under the hood. Their license allows us to store and distribute results—something we want to do as a preprocessing step in our entity resolution pipeline.