dealing with nans
See original GitHub issueFor the PPCA demo, I recommend generating two datasets:
1.) First generate a well-structured covariance matrix:
from scipy.linalg import toeplitz
import numpy as np
K = 10 - toeplitz(np.arange(10))
2.) Now generate a first dataset (a random walk with the given covariance matrix)
data1 = np.cumsum(np.random.multivariate_normal(np.zeros(10), K, 250), axis=0)
3.) Now copy the first dataset
from copy import copy
data2 = copy(data1)
4.) Set random entries of data2 to nan
(choose some level of sparsity for this, e.g. 10% of the entries)
5.) Now plot data1 (solid line) and data2 (dashed line) and make sure they line up with each other
Issue Analytics
- State:
- Created 7 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
What's the best way to handle NaN values?
Inpute them with specific values. · Impute with special metrics, for example, mean or median. · Impute using a method: MICE or KNN....
Read more >What would the best way to handle NaN values for both ...
What would the best way to handle NaN values for both numerical and categorical data [closed] · 1- Replace it with 0: df.fillna(0,...
Read more >29. Dealing with NaN | Numerical Programming
We will create a temperature DataFrame, in which some data is not defined, i.e. NaN. We will randomly assign some NaN values into...
Read more >Dealing With Missing Values in Python – A Complete Guide
Missing Value Treatment in Python – Missing values are usually represented in the form of Nan or null or None in the dataset....
Read more >Dealing with NaNs and infs - Stable Baselines - Read the Docs
Dealing with NaNs and infs¶. During the training of a model on a given environment, it is possible that the RL model becomes...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It looks like there’s still some interpolation going in in
reduce.py
…desired behavior:
1.) if no nans, use PCA to reduce to the specified number of dimensions 2.) if nans, use PPCA (instead of PCA) to reduce to the specified number of dimensions. some observations may *still * be nans after using PPCA. those should show up as breaks in the line (i.e. don’t explicitly remove them from the plot, but they just won’t be visible). not removing nans is important because the user may want the rows to match up across matrices, and we don’t want to mess with that.
in the matlab version the nans are removed before doing PCA, and then they are added back in prior to plotting. what i’m proposing for the python version is to be a little fancier by using PPCA when possible to reconstruct missing data. since we’re already making an assumption that the data covariance matters in applying PCA to the data, we can leverage the same assumption to fill in parts of missing observations. but for skipped observations (i.e. where no feature is observed for that row of the data matrix) we shouldn’t add in any additional assumptions about the timecourse (we can’t even assume that the user is giving us a timecourse).
in other words, we want the reduced data to have the same number of rows as the original data.
(This will help us determine if PPCA is correctly interpolating)