question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NearestNeighbors radius_neighbors memory leaking

See original GitHub issue

Description

NearestNeighbors uses a large chunck of memory in run time without releasing it even after calling del or assigning a empty array to the object variable. The memory will how ever be released after the python process is terminated.

I noticed the bug when I was trying to fit a DBSCAN model and by looking deeper into the issue I was able to reproduce the same memory leak with a random data by using NearestNeighbors.radius_neighbors() method.

The memory leaked in a 12GB RAM machine was: 10%.

Steps/Code to Reproduce

from sklearn.neighbors import NearestNeighbors
import numpy as np

X = np.random.rand(20000, 2) # Random data

neighbors_model = NearestNeighbors()
neighbors_model.fit(X)

neighborhoods = neighbors_model.radius_neighbors(X, 0.5, return_distance=False)

del neighborhoods
del neighbors_model
del X

while True:
	pass

The algorithm modes that were producing the leak were:

  • auto
  • ball_tree
  • kd_tree

With the algorithm brute the memory leak didn’t happen but when increasing the data size by factor of 10 the following error happened

Expected Results

MemoryError                               Traceback (most recent call last)
<ipython-input-14-df89794bb3b3> in <module>()
----> 1 neighborhoods = neighbors_model.radius_neighbors(X, 0.5, return_distance=False)

/usr/local/lib/python3.5/dist-packages/sklearn/neighbors/base.py in radius_neighbors(self, X, radius, return_distance)
    588             if self.effective_metric_ == 'euclidean':
    589                 dist = pairwise_distances(X, self._fit_X, 'euclidean',
--> 590                                           n_jobs=self.n_jobs, squared=True)
    591                 radius *= radius
    592             else:

/usr/local/lib/python3.5/dist-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
   1245         func = partial(distance.cdist, metric=metric, **kwds)
   1246 
-> 1247     return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1248 
   1249 

/usr/local/lib/python3.5/dist-packages/sklearn/metrics/pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1088     if n_jobs == 1:
   1089         # Special case to avoid picklability checks in delayed
-> 1090         return func(X, Y, **kwds)
   1091 
   1092     # TODO: in some cases, backend='threading' may be appropriate

/usr/local/lib/python3.5/dist-packages/sklearn/metrics/pairwise.py in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared)
    244         YY = row_norms(Y, squared=True)[np.newaxis, :]
    245 
--> 246     distances = safe_sparse_dot(X, Y.T, dense_output=True)
    247     distances *= -2
    248     distances += XX

/usr/local/lib/python3.5/dist-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output)
    138         return ret
    139     else:
--> 140         return np.dot(a, b)
    141 
    142 

MemoryError: 

Versions

  • Linux-4.13.0-39-generic-x86_64-with-Ubuntu-16.04-xenial
  • Python 3.5.2 (default, Nov 23 2017, 16:37:01)
  • [GCC 5.4.0 20160609]
  • NumPy 1.14.3
  • SciPy 1.0.1
  • Scikit-Learn 0.19.1

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:21 (19 by maintainers)

github_iconTop GitHub Comments

1reaction
jnothmancommented, May 3, 2018

Sorry, I shouldn’t have let #11056 close this. I think that solved a separate issue.

0reactions
rthcommented, May 8, 2018

Thanks for investigating @recamshak !

After digging into glibc I found that the memory manager behavior can be changed via environment variables. See the end of man mallopt for a (exhaustive?) list.

Interesting, in particular,

Note: Nowadays, glibc uses a dynamic mmap threshold by default. The initial value of the threshold is 128*1024, but when blocks larger than the current threshold and less than or equal to DEFAULT_MMAP_THRESHOLD_MAX are freed, the threshold is adjusted upward to the size of the freed block. When dynamic mmap thresholding is in effect, the threshold for trimming the heap is also dynamically adjusted to be twice the dynamic mmap threshold. Dynamic adjustment of the mmap threshold is disabled if any of the M_TRIM_THRESHOLD, M_TOP_PAD, M_MMAP_THRESHOLD, or M_MMAP_MAX parameters is set.

sounds also relevant, but I can’t wrap my head around it (and I’m not sure I want to)…

For the record, I also tried running a more in depth memory benchmark with all fieds from psutil.Process().memory_info(), using the following function,

def get_memory_usage():
    import psutil
    res = psutil.Process().memory_full_info()
    return ' '.join('%s: %.3f MB' % (key, val/2**20)
                    for key, val in res._asdict().items() if val)

(and adapting the print statements in the benchmark script) which produces,

Initial memory usage: rss: 71.637 MB vms: 547.602 MB shared: 31.797 MB text: 2.746 MB data: 41.824 MB uss: 67.371 MB pss: 67.648 MB
Allocated X: rss: 72.410 MB vms: 548.320 MB shared: 31.809 MB text: 2.746 MB data: 42.543 MB uss: 68.195 MB pss: 68.473 MB
X.nbytes == 0.32 MB
Fitted NearestNeighbors: rss: 72.984 MB vms: 548.691 MB shared: 31.871 MB text: 2.746 MB data: 42.914 MB uss: 68.625 MB pss: 68.902 MB
Computed radius_neighbors: rss: 1545.211 MB vms: 2020.824 MB shared: 31.996 MB text: 2.746 MB data: 1515.047 MB uss: 1540.930 MB pss: 1541.207 MB
Computed radius_neighbors: rss: 1545.867 MB vms: 2021.324 MB shared: 31.996 MB text: 2.746 MB data: 1515.547 MB uss: 1541.441 MB pss: 1541.719 MB
Computed radius_neighbors: rss: 1546.266 MB vms: 2021.973 MB shared: 31.996 MB text: 2.746 MB data: 1516.195 MB uss: 1541.965 MB pss: 1542.242 MB
Computed radius_neighbors: rss: 1546.785 MB vms: 2022.359 MB shared: 31.996 MB text: 2.746 MB data: 1516.582 MB uss: 1542.352 MB pss: 1542.629 MB
Computed radius_neighbors: rss: 1546.785 MB vms: 2022.359 MB shared: 31.996 MB text: 2.746 MB data: 1516.582 MB uss: 1542.352 MB pss: 1542.629 MB
Unallocated everything  : rss: 1545.758 MB vms: 2021.109 MB shared: 31.996 MB text: 2.746 MB data: 1515.332 MB uss: 1541.250 MB pss: 1541.527 MB

The original benchmark script only shows the RSS memory, but in the end I don’t think this provides any additional insight.

I agree with @jnothman about closing this because,

  • we were not able to reproduce the original problem on 0.19.0
  • on master this seems to be another issue. The fact that repeated calls don’t increase memory suggect there is not leak and the rest seems to be the pecularities of handling of memory allocations by the Linux OS. So I’m not sure there is anything that we can/should be do about it…
Read more comments on GitHub >

github_iconTop Results From Across the Web

Nearest neighbor analysis with large datasets - Read the Docs
This can be a really memory hungry and slow operation, that can cause problems with large point datasets. Luckily, there is a much...
Read more >
Implementation of Radius Neighbors from Scratch in Python
So in this, we will create a K-Nearest Neighbors Classifier model to predict the presence of diabetes or not for patients with such...
Read more >
sklearn.neighbors.NearestNeighbors
Regression based on neighbors within a fixed radius. BallTree. Space partitioning data structure for organizing points in a multi-dimensional space, used for ...
Read more >
1.6. Nearest Neighbors — scikit-learn 0.17 文档
The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor...
Read more >
sklearn kneighbours memory error python - Stack Overflow
I want to calculate 5 nearest neighbours with this dataset for an 18MB test set. nbrs = NearestNeighbors(n_neighbors=5).fit(vec.transform(data[' ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found