Vectorized distance produce skewed result
See original GitHub issueGeoPandas got completely confused when calculating distance between two objects. geo_points and gdf_harbours are GeoDataFrames with few thousand rows
geo_points.distance(gdf_harbours)
0 138.419503
1 138.464243
2 138.425727
...
65496 NaN
65497 NaN
65498 NaN
Length: 65499, dtype: float64
while
gdf_harbours.distance(geo_points.loc[0]).min()
Out[47]: 7.344255335164139
and
gdf_harbours.distance(geo_points.loc[65498]).min()
Out[48]: 0.00654932231511796
I was unable to reconstruct this result using binary_vector_float as
gpd.vectorized.binary_vector_float('distance', geo_points.geometry._geometry_array.data, gdf_harbours.geometry._geometry_array.data)
kills notebook’s kernel immediately. My versions are
geopandas 1.0.0.dev py36_1 conda-forge/label/dev
geos 3.6.2 h5470d99_2
Issue Analytics
- State:
- Created 6 years ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
Distance Between Skew Lines: Vector, Cartesian Form ... - Toppr
This lesson lets you understand the meaning of skew lines and how the shortest distance between them can be calculated. We will look...
Read more >Vectors - Shortest distance between skew lines (example)
Go to http://www.examsolutions.net/ for the index, playlists and more maths videos on vectors, vector product and other maths topics.
Read more >Shortest distance between two skew lines in 3D space.
The resulting vector from one point to the other then leads to the parameters of the points required and the distance between them....
Read more >9 Distance Measures in Data Science
Although it is a common distance measure, Euclidean distance is not scale in-variant which means that distances computed might be skewed ...
Read more >Different Types of Distances Used in Machine Learning
Minkowski distance is defined as the similarity metric between two points in the normed vector space (N-dimensional real space).
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Scipy’s cKDTree can be used to get a nearest neighbor(s) solution that operates on geopandas dataframes and is effectively vectorized. It is orders of magnitude faster than the brute force method @jorisvandenbossche suggested (find all pairwise distances and then find a minimum) and even the RTree spatial index nearest method that @avnovikov 's suggested (which is fast for single point lookup but requires looping over rows if we want nearest neighbors to all points in a geodataframe).
The code below illustrates how to use cKDTree query method to write a function that operates on two dataframes, finding for each point in dataframe
gdfA(e.g. ships) the distance to its nearest neighbor in targetgdfB(e.g. harbors) as well as requested column information about that neighbor (e.g. ‘harbor_id’). The function returns a two column dataframe.I’m not a developer but given tremendous speed up and usefulness from using cKDTree methods could functionality like this be built into a future geopandas release?
Now the helper function
Let’s test it: searching for nearest harbor (of N=1000 harbors) to each of N=1000 ships. The function returns a two column dataframe with distance and the value of the ‘harbor_id’ column
And time it:
Compare this to use the brute force method (slightly adapting @jorisvandenbossche 's code to also return a 2 column dataframe):
The cKDTree method is efficient (I think I read it’s O(log N) somewhere). When I increase the dataframe sizes to N=100,000 rows (so the potential pairwise distance comparisons rise to 10 billion) the cKDTree method can still find 100,000 nearest neighbor points about 4.6s.
Hi @caiodu, What exactly are you trying to do? For clarity I would recommend either opening new issue (if it is an issue or might be) or post your question on https://stackoverflow.com/questions/tagged/geopandas (if you need help with workflow). It is better for future reference.