question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Nearest (spatial) join as a new feature to geopandas?

See original GitHub issue

@jorisvandenbossche Sorry took awhile to come back to this. BUT. Would following nearest neighbor kind of spatial join be something that would be useful to be integrated into geopandas?

I include a working piece of code and a test case to demonstrate the idea.

  • So here is the function:
import geopandas as gpd
import osmnx as ox
from shapely.geometry import LineString

def sjoin_nearest(left_df, right_df, op='intersects', search_dist=0.03, report_dist=False,
                  lsuffix='left', rsuffix='right'):
    """
    Perform a spatial join between two input layers.
    If a geometry in left_df falls outside (all) geometries in right_df, the data from nearest Polygon will be used as a result.
    To make queries faster, "search_dist" -parameter (specified in map units) can be used to limit the search area for geometries around source points.
    If report_dist == True, the distance for closest geometry will be reported in a column called `dist`. If geometries intersect, the distance will be 0.

    """

    # Explode possible MultiGeometries
    right_df = right_df.explode()
    right_df = right_df.reset_index(drop=True)

    if 'index_left' in left_df.columns:
        left_df = left_df.drop('index_left', axis=1)

    if 'index_right' in left_df.columns:
        left_df = left_df.drop('index_right', axis=1)

    if report_dist:
        if 'dist' in left_df.columns:
            raise ValueError("'dist' column exists in the left DataFrame. Remove it, or set 'report_dist' to False.")

    # Get geometries that intersect or do not intersect polygons
    mask = left_df.intersects(right_df.unary_union)
    geoms_intersecting_polygons = left_df.loc[mask]
    geoms_outside_polygons = left_df.loc[~mask]

    # Make spatial join between points that fall inside the Polygons
    if geoms_intersecting_polygons.shape[0] > 0:
        pip_join = gpd.sjoin(left_df=geoms_intersecting_polygons, right_df=right_df, op=op)

        if report_dist:
            pip_join['dist'] = 0

    else:
        pip_join = gpd.GeoDataFrame()

    # Get nearest geometries
    closest_geometries = gpd.GeoDataFrame()

    # A tiny snap distance buffer is needed in some cases
    snap_dist = 0.00000005

    # Closest points from source-points to polygons
    for idx, geom in geoms_outside_polygons.iterrows():
        # Get geometries within search distance
        candidates = right_df.loc[right_df.intersects(geom[left_df.geometry.name].buffer(search_dist))]

        if len(candidates) == 0:
            continue
        unary = candidates.unary_union

        if unary.geom_type == 'Polygon':

            # Get exterior of the Polygon
            exterior = unary.exterior

            # Find a point from Polygons that is closest to the source point
            closest_geom = exterior.interpolate(exterior.project(geom[left_df.geometry.name]))

            if report_dist:
                distance = closest_geom.distance(geom[left_df.geometry.name])

            # Select the Polygon
            closest_poly = right_df.loc[right_df.intersects(closest_geom.buffer(snap_dist))]

        elif unary.geom_type == 'MultiPolygon':
            # Keep track of distance for closest polygon
            distance = 9999999999
            closest_geom = None

            for idx, poly in enumerate(unary):
                # Get exterior of the Polygon
                exterior = poly.exterior

                # Find a point from Polygons that is closest to the source point
                closest_candidate = exterior.interpolate(exterior.project(geom[left_df.geometry.name]))

                # Calculate distance between origin point and the closest point in Polygon
                dist = geom[left_df.geometry.name].distance(closest_candidate)

                # If the point is closer to given polygon update the info
                if dist < distance:
                    distance = dist
                    closest_geom = closest_candidate

            # Select the Polygon that was closest
            closest_poly = right_df.loc[right_df.intersects(closest_geom.buffer(snap_dist))]
        else:
            print("Incorrect input geometry type. Skipping ..")

        # Reset index
        geom = geom.to_frame().T.reset_index(drop=True)

        # Drop geometry from closest polygon
        closest_poly = closest_poly.drop(right_df.geometry.name, axis=1)
        closest_poly = closest_poly.reset_index(drop=True)

        # Join values
        join = geom.join(closest_poly, lsuffix='_%s' % lsuffix, rsuffix='_%s' % rsuffix)

        # Add information about distance to closest geometry if requested
        if report_dist:
            if 'dist' in join.columns:
                raise ValueError("'dist' column exists in the DataFrame. Remove it, or set 'report_dist' to False.")
            join['dist'] = distance

        closest_geometries = closest_geometries.append(join, ignore_index=True, sort=False)

    # Merge everything together
    result = pip_join.append(closest_geometries, ignore_index=True, sort=False)
    return result
  • And this is how it works:
# Get some data and prepare the data for demonstration
# -------------------------------------------------------------
place = "Kamppi, Helsinki"
polys = ox.footprints_from_place(place, footprint_type='building')
pois = ox.pois_from_place(place, amenities=['school'])

# Take a good sample for demonstration purposes (use mixture of points and polygons)
polys = polys.head(10).copy()
geom_mix = pois.head().copy()
geom_mix = geom_mix.append(pois.tail().copy())

# Add ids for the data for easier visual examination
polys['polyid'] = polys.reset_index().index
geom_mix['pointid'] = [a for a in 'abcdefghij']

# Plot the data and annotate ids
ax = polys.plot()
ax = geom_mix.plot(ax=ax, color='red', alpha=0.5)
polys.apply(lambda x: ax.annotate(s=x.polyid, xy=x.geometry.centroid.coords[0], ha='center'),axis=1);
geom_mix.apply(lambda x: ax.annotate(s=x.pointid, xy=x.geometry.centroid.coords[0], ha='center'),axis=1);

before_join

So in this example above, the point 'e' should get information from closest Polygon number 2, and point 'd' should get information from Polygon 1, whereas red polygon 'j' should get information from blue polygon 3 (most likely) and so on.

  • We can conduct the spatial join in a similar manner as the sjoin but in this case the left_df will get information from the closest geometry in right_df in case it does not intersect with any geometries. By using parameter report_dist it is also possible to get information about the distance to the closest geometry (in map units). If the geometries intersect directly, the distance is 0.
# Let's test the nearest join and confirm if it works, let's also report the distance
nearest_join = sjoin_nearest(left_df=geom_mix, right_df=polys, report_dist=True)

# Let's confirm the join visually
lines = gpd.GeoDataFrame()
for idx, row in nearest_join.iterrows():
    a = polys.loc[polys['polyid']==row['polyid'], 'geometry'].centroid.values[0]
    b = geom_mix.loc[geom_mix['pointid']==row['pointid'], 'geometry'].centroid.values[0]
    line = LineString([a, b])
    lines = lines.append({'geometry': line}, ignore_index=True)
    
# Plot the data and annotate ids
ax = polys.plot()
ax = geom_mix.plot(ax=ax, color='red', alpha=0.5)
ax = lines.plot(ax=ax, color='orange', alpha=0.5)
polys.apply(lambda x: ax.annotate(s=x.polyid, xy=x.geometry.centroid.coords[0], ha='center'),axis=1);
geom_mix.apply(lambda x: ax.annotate(s=x.pointid, xy=x.geometry.centroid.coords[0], ha='center'),axis=1);

correct_join

# Check values - Everything from point df should be present (with distance as well in this case)
[0] nearest_join[['polyid', 'pointid', 'dist']].head(10)
   polyid pointid      dist
0       0       a  0.008897
1       0       b  0.009375
2       0       c  0.002689
3       1       d  0.000610
4       2       e  0.000117
5       0       f  0.003870
6       0       g  0.010740
7       0       h  0.010204
8       0       i  0.009624
9       3       j  0.003662

So as we can see, the join works correctly at least when joining information between Points and Polygons (haven’t tested yet extensively so there might be issues though).

Any thoughts about this? The code is now a first iteration so all comments, concerns, ideas would be highly useful. 🙂 If you think this could be useful, I would be happy to work on it.

Cheers, Henrikki

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:26
  • Comments:17 (12 by maintainers)

github_iconTop GitHub Comments

6reactions
martinfleiscommented, Jan 5, 2021

@shandou we are waiting for the support of nearest in pygeos.STRtree which is on the way - https://github.com/pygeos/pygeos/pull/272 . My assumption is that GeoPandas 0.10 will have this feature, next 0.9 will not.

2reactions
ljwolfcommented, Aug 12, 2019

This is rad! If this is something we want to support, we have a performant example with points to points that can be used if the inputs are of the right type?

Read more comments on GitHub >

github_iconTop Results From Across the Web

geopandas.sjoin_nearest - GeoDataFrame
Spatial join of two GeoDataFrames based on the distance between their geometries. Results will include multiple output records for a single input record ......
Read more >
Question on Geopandas Spatial Join nearest - Stack Overflow
I'm juste a beginner in this field, but I think that using kd-trees (sklearn), you can specify the number of 'nearest neighbours' you...
Read more >
Python Spatial Join with GeoPandas (and GEOS)
A Spatial join is a GIS operation that affixes data from one feature layer's attribute table to another from a spatial perspective.
Read more >
Spatial join — Intro to Python GIS documentation
Luckily, spatial join ( gpd.sjoin() -function) is already implemented in Geopandas, thus we do not need to create it ourselves. There are three...
Read more >
Finding nearest point in other GeoDataFrame using GeoPandas
If you have large dataframes, I've found that scipy 's cKDTree spatial index .query method returns very fast results for nearest neighbor ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found