Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dealing with nans

See original GitHub issue

For the PPCA demo, I recommend generating two datasets:

1.) First generate a well-structured covariance matrix:

from scipy.linalg import toeplitz import numpy as np K = 10 - toeplitz(np.arange(10))

2.) Now generate a first dataset (a random walk with the given covariance matrix)

data1 = np.cumsum(np.random.multivariate_normal(np.zeros(10), K, 250), axis=0)

3.) Now copy the first dataset

from copy import copy data2 = copy(data1)

4.) Set random entries of data2 to nan (choose some level of sparsity for this, e.g. 10% of the entries)

5.) Now plot data1 (solid line) and data2 (dashed line) and make sure they line up with each other

Issue Analytics

State:
Created 7 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

jeremymanningcommented, Dec 19, 2016

It looks like there’s still some interpolation going in in reduce.py…

desired behavior:

1.) if no nans, use PCA to reduce to the specified number of dimensions 2.) if nans, use PPCA (instead of PCA) to reduce to the specified number of dimensions. some observations may *still * be nans after using PPCA. those should show up as breaks in the line (i.e. don’t explicitly remove them from the plot, but they just won’t be visible). not removing nans is important because the user may want the rows to match up across matrices, and we don’t want to mess with that.

in the matlab version the nans are removed before doing PCA, and then they are added back in prior to plotting. what i’m proposing for the python version is to be a little fancier by using PPCA when possible to reconstruct missing data. since we’re already making an assumption that the data covariance matters in applying PCA to the data, we can leverage the same assumption to fill in parts of missing observations. but for skipped observations (i.e. where no feature is observed for that row of the data matrix) we shouldn’t add in any additional assumptions about the timecourse (we can’t even assume that the user is giving us a timecourse).

in other words, we want the reduced data to have the same number of rows as the original data.

1reaction

jeremymanningcommented, Dec 18, 2016

(This will help us determine if PPCA is correctly interpolating)