question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Vectorized distance produce skewed result

See original GitHub issue

GeoPandas got completely confused when calculating distance between two objects. geo_points and gdf_harbours are GeoDataFrames with few thousand rows

geo_points.distance(gdf_harbours)
 0        138.419503
 1        138.464243
 2        138.425727
 ...
 65496           NaN
 65497           NaN
 65498           NaN
 Length: 65499, dtype: float64

while

gdf_harbours.distance(geo_points.loc[0]).min()
Out[47]: 7.344255335164139

and

gdf_harbours.distance(geo_points.loc[65498]).min()
Out[48]: 0.00654932231511796

I was unable to reconstruct this result using binary_vector_float as

gpd.vectorized.binary_vector_float('distance', geo_points.geometry._geometry_array.data, gdf_harbours.geometry._geometry_array.data)

kills notebook’s kernel immediately. My versions are

geopandas                 1.0.0.dev                py36_1    conda-forge/label/dev
geos                      3.6.2                h5470d99_2

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jhconningcommented, Oct 30, 2018

Scipy’s cKDTree can be used to get a nearest neighbor(s) solution that operates on geopandas dataframes and is effectively vectorized. It is orders of magnitude faster than the brute force method @jorisvandenbossche suggested (find all pairwise distances and then find a minimum) and even the RTree spatial index nearest method that @avnovikov 's suggested (which is fast for single point lookup but requires looping over rows if we want nearest neighbors to all points in a geodataframe).

The code below illustrates how to use cKDTree query method to write a function that operates on two dataframes, finding for each point in dataframe gdfA(e.g. ships) the distance to its nearest neighbor in target gdfB (e.g. harbors) as well as requested column information about that neighbor (e.g. ‘harbor_id’). The function returns a two column dataframe.

I’m not a developer but given tremendous speed up and usefulness from using cKDTree methods could functionality like this be built into a future geopandas release?

import numpy as np
import geopandas as gpd
import pandas as pd
from scipy.spatial import cKDTree
from shapely.geometry import Point
import random

# Build  example `ships` and a `harbors` geodaframes (small but just to illustrate)
N = 1000
nharbors = np.random.uniform(0., 5000., (N, 2))  
nships = np.random.uniform(0., 5000., (N, 2))  

df = pd.DataFrame(nharbors,columns=['lat', 'lon'])
df['geometry'] = list(zip(df.lat, df.lon))
df['geometry'] = df['geometry'].apply(Point)
df['harbor_id'] = random.sample(range(N), N)
harbors = gpd.GeoDataFrame(df, geometry='geometry')

df = pd.DataFrame(nships,columns=['lat', 'lon'])
df['geometry'] = list(zip(df.lat, df.lon))
df['geometry'] = df['geometry'].apply(Point)
df['ship_id'] = random.sample(range(N), N)
ships = gpd.GeoDataFrame(df, geometry='geometry')

Now the helper function

def ckdnearest(gdA, gdB, bcol):
    nA = np.array(list(zip(gdA.geometry.x, gdA.geometry.y)) )
    nB = np.array(list(zip(gdB.geometry.x, gdB.geometry.y)) )
    btree = cKDTree(nB)
    dist, idx = btree.query(nA,k=1)
    df = pd.DataFrame.from_dict({'distance': dist,
                'bcol' : gdB.loc[idx, bcol].values })
    return df

Let’s test it: searching for nearest harbor (of N=1000 harbors) to each of N=1000 ships. The function returns a two column dataframe with distance and the value of the ‘harbor_id’ column

In  []:  ckdnearest(ships, harbors, 'harbor_id').head(2)
Out: []:	
	distance	bcol
0	130.539793	491
1	102.736394	932

And time it:

In  []: %%timeit
ckdnearest(ships, harbors, 'harbor_id')
43.6 ms ± 714 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Compare this to use the brute force method (slightly adapting @jorisvandenbossche 's code to also return a 2 column dataframe):

def dist_to_nearest(point, gdf):
    howfar = gdf.geometry.distance(point)
    return  howfar.min(), howfar.idxmin()

In  []: %%timeit
mdistances = ships.geometry.apply(lambda p: dist_to_nearest(p, harbors))
mdistances.apply(pd.Series)

5.3 s ± 91.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The cKDTree method is efficient (I think I read it’s O(log N) somewhere). When I increase the dataframe sizes to N=100,000 rows (so the potential pairwise distance comparisons rise to 10 billion) the cKDTree method can still find 100,000 nearest neighbor points about 4.6s.

0reactions
martinfleiscommented, Jun 4, 2019

Hi @caiodu, What exactly are you trying to do? For clarity I would recommend either opening new issue (if it is an issue or might be) or post your question on https://stackoverflow.com/questions/tagged/geopandas (if you need help with workflow). It is better for future reference.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Distance Between Skew Lines: Vector, Cartesian Form ... - Toppr
This lesson lets you understand the meaning of skew lines and how the shortest distance between them can be calculated. We will look...
Read more >
Vectors - Shortest distance between skew lines (example)
Go to http://www.examsolutions.net/ for the index, playlists and more maths videos on vectors, vector product and other maths topics.
Read more >
Shortest distance between two skew lines in 3D space.
The resulting vector from one point to the other then leads to the parameters of the points required and the distance between them....
Read more >
9 Distance Measures in Data Science
Although it is a common distance measure, Euclidean distance is not scale in-variant which means that distances computed might be skewed ...
Read more >
Different Types of Distances Used in Machine Learning
Minkowski distance is defined as the similarity metric between two points in the normed vector space (N-dimensional real space).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found