LinAlgError: the leading minor of order 11 of 'b' is not positive definite. The factorization of 'b' could not be completed and no eigenvalues or eigenvectors were computed.
See original GitHub issueI’m looking at trying to use umap on whole-genome data from the 1000 genomes project. I’m not doing lots of preprocessing, just filtering out variants with minor allele frequency less than 10% and looking at a single chromosome (chr1); this gives me 509925 variants across 2504 individuals (a 509925x2504 matrix).
Here’s what I’m doing:
import resource, sys
sys.setrecursionlimit(10**6)
resource.setrlimit(resource.RLIMIT_STACK, (2**29,-1))
import umap
import allel
callset = allel.read_vcf('ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz')
gt = allel.GenotypeArray(callset['calldata/GT'])
allel_freqs = gt.count_alleles()
maf_1 = gt[((allel_freqs/float(gt.shape[1]*2))[:,1] > .1)]
reducer = umap.UMAP()
embedding = reducer.fit(maf_1[:,:,0])
(If any biologists are reading, I’m probably doing this filtering wrong but I think UMAP should still work in this case?)
Initially I was running into RecursionLimit issues, but those went away after increasing the recursion limit (first 3 lines above). However, after a couple hours I get the following traceback:
In [42]: embedding = reducer.fit(maf_1[:,:,0])
/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/umap_learn-0.3.6-py2.7.egg/umap/spectral.py:229: UserWarning: Embedding a total of 2215 separate connected components using meta-embedding (experimental)
---------------------------------------------------------------------------
LinAlgError Traceback (most recent call last)
<ipython-input-42-784bcd135a58> in <module>()
----> 1 embedding = reducer.fit(maf_1[:,:,0])
/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/umap_learn-0.3.6-py2.7.egg/umap/umap_.pyc in fit(self, X, y)
1534 self.metric,
1535 self._metric_kwds,
-> 1536 self.verbose,
1537 )
1538
/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/umap_learn-0.3.6-py2.7.egg/umap/umap_.pyc in simplicial_set_embedding(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, verbose)
939 random_state,
940 metric=metric,
--> 941 metric_kwds=metric_kwds,
942 )
943 expansion = 10.0 / initialisation.max()
/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/umap_learn-0.3.6-py2.7.egg/umap/spectral.pyc in spectral_layout(data, graph, dim, random_state, metric, metric_kwds)
238 random_state,
239 metric=metric,
--> 240 metric_kwds=metric_kwds,
241 )
242
/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/umap_learn-0.3.6-py2.7.egg/umap/spectral.pyc in multi_component_layout(data, graph, n_components, component_labels, dim, random_state, metric, metric_kwds)
120 dim,
121 metric=metric,
--> 122 metric_kwds=metric_kwds,
123 )
124 else:
/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/umap_learn-0.3.6-py2.7.egg/umap/spectral.pyc in component_layout(data, n_components, component_labels, dim, metric, metric_kwds)
57 component_embedding = SpectralEmbedding(
58 n_components=dim, affinity="precomputed"
---> 59 ).fit_transform(affinity_matrix)
60 component_embedding /= component_embedding.max()
61
/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/sklearn/manifold/spectral_embedding_.pyc in fit_transform(self, X, y)
546 X_new : array-like, shape (n_samples, n_components)
547 """
--> 548 self.fit(X)
549 return self.embedding_
/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/sklearn/manifold/spectral_embedding_.pyc in fit(self, X, y)
525 n_components=self.n_components,
526 eigen_solver=self.eigen_solver,
--> 527 random_state=random_state)
528 return self
529
/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/sklearn/manifold/spectral_embedding_.pyc in spectral_embedding(adjacency, n_components, eigen_solver, random_state, eigen_tol, norm_laplacian, drop_first)
324 X[:, 0] = dd.ravel()
325 lambdas, diffusion_map = lobpcg(laplacian, X, tol=1e-15,
--> 326 largest=False, maxiter=2000)
327 embedding = diffusion_map.T[:n_components]
328 if norm_laplacian:
/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/scipy/sparse/linalg/eigen/lobpcg/lobpcg.pyc in lobpcg(A, X, B, M, Y, tol, maxiter, largest, verbosityLevel, retLambdaHistory, retResidualNormsHistory)
488
489 # Solve the generalized eigenvalue problem.
--> 490 _lambda, eigBlockVector = eigh(gramA, gramB, check_finite=False)
491 ii = np.argsort(_lambda)[:sizeX]
492 if largest:
/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/scipy/linalg/decomp.pyc in eigh(a, b, lower, eigvals_only, overwrite_a, overwrite_b, turbo, eigvals, type, check_finite)
491 " factorization of 'b' could not be completed"
492 " and no eigenvalues or eigenvectors were"
--> 493 " computed." % (info-b1.shape[0]))
494
495
LinAlgError: the leading minor of order 11 of 'b' is not positive definite. The factorization of 'b' could not be completed and no eigenvalues or eigenvectors were computed.
Any idea what’s going wrong here? If you want to reproduce, the VCF can be found here:
The only other dependency aside from UMAP is scikit-allel
.
Versions of various things:
In [46]: print umap.__version__
0.3.6
In [47]: print allel.__version__
1.1.10
In [48]: import sklearn
In [49]: print sklearn.__version__
0.20.0
Issue Analytics
- State:
- Created 5 years ago
- Comments:7 (2 by maintainers)
Top GitHub Comments
The optimal tolerance is difficult to predict theoretically, since it is problem dependent. Practically speaking, just try setting tol=1e-15 larger and/or maxiter=2000 smaller and see what happens… You need to get some practical experience yourself for your data finding the best balance keeping accuracy good enough while cutting the compute time, if your code runs for too long with the default values.
I’ve seen examples where even maxiter=2 would be enough for the overall goal. Have fun!
Thanks for the suggestion @lobpcg; this is not my area of expertise, so do you have some suggested default values that should be good?