ENH: stats: implementation of `multivariate_normal.logcdf` suitable for small probability mass
See original GitHub issueHello there!
I am trying to use the Multivariate Normal Integration algorithm that is embedded in scipy, in the mvn package. Specifically, I am trying to use mvn.mvnun function, for very large number of variables (n>1000).
The function seems to return 0.0 with INFORM value 0, for this very large covariance matrices. I suspect that there might be an issue of numerical accuracy in the simulations, that renders the result 0 although it is probably a very small number.
IF THIS IS THE CASE (I am not an expert in this field), is there a way to perform the simulations for the logarithm of the integral, so the result doesn’t degenerate to 0?
Thanks for your help!
Here is a code of the MVN integral I am trying to compute, that returns a 0 value. It represents a typical points-in-a-grid in a spatial analysis where the covariance matrix is obtained by use of an isotropic radial kernel:
import numpy as np
from scipy.stats import mvn
import time
def correl( dx, s, w ):
return s * np.exp( -w*dx**2 )
def buildCov( nrows, ncols, s, w ):
N = nrows * ncols
S = np.zeros( [N, N] )
for i in range( N ):
for j in range( i,N ):
rowi = np.int( i/ncols )
coli = i - ncols * rowi + 1
rowj = np.int( j/ncols )
colj = j - ncols * rowj + 1
ijdist = np.sqrt( (rowi - rowj)**2 + (coli - colj)**2 )
S[i,j] = correl( ijdist, s, w )
S[j,i] = S[i,j]
return S
# Matrix of covariates
nrows = 25
ncols = 25
N = nrows*ncols
s = 1
w = ( 1/(min(nrows,ncols)/3) )**2
S = buildCov( nrows, ncols, s, w )
M = np.zeros( N )
inf = -np.inf * np.ones( N )
sup = 0 * np.ones( N )
t0 = time.time()
p,i = mvn.mvnun( inf, sup, M, S )
t2 = time.time() - t0
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (6 by maintainers)
And I am not aware of any implementation of the log-CDF that would help here. Might try to search through the papers citing the original, or if you find an implementation in some other system (R, Julia, MATLAB, whatever). I think it would be hard, though.
It looks like there are approaches that exploit the banded structure of your covariance matrix, but they won’t be general for use in
multivariate_normal
.The underlying algorithm only supports dimensionality up to 500. I think the implementation is supposed to explicitly return early with
INFORM=2
to indicate that, but nonetheless, the algorithm won’t scale up that high. It does look like there have been some bugs fixed in the upstream package (MVNDST
here) since we integrated it that might be relevant to yourn=25
example.In such high dimensions, I expect that for most practical purposes, the answer is ~0 if the mean coordinate is within the bounds and ~1 outside of it without a whole lot that’s practically reachable in between. High dimensions are weird.