Nearest (spatial) join as a new feature to geopandas?
See original GitHub issue@jorisvandenbossche Sorry took awhile to come back to this. BUT. Would following nearest neighbor kind of spatial join be something that would be useful to be integrated into geopandas?
I include a working piece of code and a test case to demonstrate the idea.
- So here is the function:
import geopandas as gpd
import osmnx as ox
from shapely.geometry import LineString
def sjoin_nearest(left_df, right_df, op='intersects', search_dist=0.03, report_dist=False,
lsuffix='left', rsuffix='right'):
"""
Perform a spatial join between two input layers.
If a geometry in left_df falls outside (all) geometries in right_df, the data from nearest Polygon will be used as a result.
To make queries faster, "search_dist" -parameter (specified in map units) can be used to limit the search area for geometries around source points.
If report_dist == True, the distance for closest geometry will be reported in a column called `dist`. If geometries intersect, the distance will be 0.
"""
# Explode possible MultiGeometries
right_df = right_df.explode()
right_df = right_df.reset_index(drop=True)
if 'index_left' in left_df.columns:
left_df = left_df.drop('index_left', axis=1)
if 'index_right' in left_df.columns:
left_df = left_df.drop('index_right', axis=1)
if report_dist:
if 'dist' in left_df.columns:
raise ValueError("'dist' column exists in the left DataFrame. Remove it, or set 'report_dist' to False.")
# Get geometries that intersect or do not intersect polygons
mask = left_df.intersects(right_df.unary_union)
geoms_intersecting_polygons = left_df.loc[mask]
geoms_outside_polygons = left_df.loc[~mask]
# Make spatial join between points that fall inside the Polygons
if geoms_intersecting_polygons.shape[0] > 0:
pip_join = gpd.sjoin(left_df=geoms_intersecting_polygons, right_df=right_df, op=op)
if report_dist:
pip_join['dist'] = 0
else:
pip_join = gpd.GeoDataFrame()
# Get nearest geometries
closest_geometries = gpd.GeoDataFrame()
# A tiny snap distance buffer is needed in some cases
snap_dist = 0.00000005
# Closest points from source-points to polygons
for idx, geom in geoms_outside_polygons.iterrows():
# Get geometries within search distance
candidates = right_df.loc[right_df.intersects(geom[left_df.geometry.name].buffer(search_dist))]
if len(candidates) == 0:
continue
unary = candidates.unary_union
if unary.geom_type == 'Polygon':
# Get exterior of the Polygon
exterior = unary.exterior
# Find a point from Polygons that is closest to the source point
closest_geom = exterior.interpolate(exterior.project(geom[left_df.geometry.name]))
if report_dist:
distance = closest_geom.distance(geom[left_df.geometry.name])
# Select the Polygon
closest_poly = right_df.loc[right_df.intersects(closest_geom.buffer(snap_dist))]
elif unary.geom_type == 'MultiPolygon':
# Keep track of distance for closest polygon
distance = 9999999999
closest_geom = None
for idx, poly in enumerate(unary):
# Get exterior of the Polygon
exterior = poly.exterior
# Find a point from Polygons that is closest to the source point
closest_candidate = exterior.interpolate(exterior.project(geom[left_df.geometry.name]))
# Calculate distance between origin point and the closest point in Polygon
dist = geom[left_df.geometry.name].distance(closest_candidate)
# If the point is closer to given polygon update the info
if dist < distance:
distance = dist
closest_geom = closest_candidate
# Select the Polygon that was closest
closest_poly = right_df.loc[right_df.intersects(closest_geom.buffer(snap_dist))]
else:
print("Incorrect input geometry type. Skipping ..")
# Reset index
geom = geom.to_frame().T.reset_index(drop=True)
# Drop geometry from closest polygon
closest_poly = closest_poly.drop(right_df.geometry.name, axis=1)
closest_poly = closest_poly.reset_index(drop=True)
# Join values
join = geom.join(closest_poly, lsuffix='_%s' % lsuffix, rsuffix='_%s' % rsuffix)
# Add information about distance to closest geometry if requested
if report_dist:
if 'dist' in join.columns:
raise ValueError("'dist' column exists in the DataFrame. Remove it, or set 'report_dist' to False.")
join['dist'] = distance
closest_geometries = closest_geometries.append(join, ignore_index=True, sort=False)
# Merge everything together
result = pip_join.append(closest_geometries, ignore_index=True, sort=False)
return result
- And this is how it works:
# Get some data and prepare the data for demonstration
# -------------------------------------------------------------
place = "Kamppi, Helsinki"
polys = ox.footprints_from_place(place, footprint_type='building')
pois = ox.pois_from_place(place, amenities=['school'])
# Take a good sample for demonstration purposes (use mixture of points and polygons)
polys = polys.head(10).copy()
geom_mix = pois.head().copy()
geom_mix = geom_mix.append(pois.tail().copy())
# Add ids for the data for easier visual examination
polys['polyid'] = polys.reset_index().index
geom_mix['pointid'] = [a for a in 'abcdefghij']
# Plot the data and annotate ids
ax = polys.plot()
ax = geom_mix.plot(ax=ax, color='red', alpha=0.5)
polys.apply(lambda x: ax.annotate(s=x.polyid, xy=x.geometry.centroid.coords[0], ha='center'),axis=1);
geom_mix.apply(lambda x: ax.annotate(s=x.pointid, xy=x.geometry.centroid.coords[0], ha='center'),axis=1);
So in this example above, the point 'e'
should get information from closest Polygon number 2,
and point 'd'
should get information from Polygon 1, whereas red polygon 'j'
should get information from blue polygon 3 (most likely) and so on.
- We can conduct the spatial join in a similar manner as the
sjoin
but in this case the left_df will get information from the closest geometry in right_df in case it does not intersect with any geometries. By using parameterreport_dist
it is also possible to get information about the distance to the closest geometry (in map units). If the geometries intersect directly, the distance is 0.
# Let's test the nearest join and confirm if it works, let's also report the distance
nearest_join = sjoin_nearest(left_df=geom_mix, right_df=polys, report_dist=True)
# Let's confirm the join visually
lines = gpd.GeoDataFrame()
for idx, row in nearest_join.iterrows():
a = polys.loc[polys['polyid']==row['polyid'], 'geometry'].centroid.values[0]
b = geom_mix.loc[geom_mix['pointid']==row['pointid'], 'geometry'].centroid.values[0]
line = LineString([a, b])
lines = lines.append({'geometry': line}, ignore_index=True)
# Plot the data and annotate ids
ax = polys.plot()
ax = geom_mix.plot(ax=ax, color='red', alpha=0.5)
ax = lines.plot(ax=ax, color='orange', alpha=0.5)
polys.apply(lambda x: ax.annotate(s=x.polyid, xy=x.geometry.centroid.coords[0], ha='center'),axis=1);
geom_mix.apply(lambda x: ax.annotate(s=x.pointid, xy=x.geometry.centroid.coords[0], ha='center'),axis=1);
# Check values - Everything from point df should be present (with distance as well in this case)
[0] nearest_join[['polyid', 'pointid', 'dist']].head(10)
polyid pointid dist
0 0 a 0.008897
1 0 b 0.009375
2 0 c 0.002689
3 1 d 0.000610
4 2 e 0.000117
5 0 f 0.003870
6 0 g 0.010740
7 0 h 0.010204
8 0 i 0.009624
9 3 j 0.003662
So as we can see, the join works correctly at least when joining information between Points and Polygons (haven’t tested yet extensively so there might be issues though).
Any thoughts about this? The code is now a first iteration so all comments, concerns, ideas would be highly useful. 🙂 If you think this could be useful, I would be happy to work on it.
Cheers, Henrikki
Issue Analytics
- State:
- Created 4 years ago
- Reactions:26
- Comments:17 (12 by maintainers)
Top GitHub Comments
@shandou we are waiting for the support of
nearest
inpygeos.STRtree
which is on the way - https://github.com/pygeos/pygeos/pull/272 . My assumption is that GeoPandas 0.10 will have this feature, next 0.9 will not.This is rad! If this is something we want to support, we have a performant example with points to points that can be used if the inputs are of the right type?