Using mean_match_candidates different from zero with categorical variables generates an error
See original GitHub issueHi,
So other issue that I found using categorical variables imputation (category dtype
) is that defining mean_match_candidates != 0
in Kernel definition generate an issue during .mice()
.
Error message:
/opt/conda/lib/python3.7/site-packages/miceforest/utils.py:377: RuntimeWarning: divide by zero encountered in true_divide
odds_ratio = probability / (1 - probability)
/opt/conda/lib/python3.7/site-packages/miceforest/utils.py:378: RuntimeWarning: divide by zero encountered in log
log_odds = np.log(odds_ratio)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-106-6f8a47b02d4b> in <module>
9 mean_match_candidates=2)
10 # Run the MICE algorithm for X iterations - 12:09 - 13:22
---> 11 a.mice(iterations=2, verbose=True, n_jobs=-1)
/opt/conda/lib/python3.7/site-packages/miceforest/ImputationKernel.py in mice(self, iterations, verbose, variable_parameters, compile_candidates, **kwlgb)
1193 random_state=self._random_state,
1194 hashed_seeds=None,
-> 1195 candidate_preds=candidate_preds,
1196 )
1197 )
/opt/conda/lib/python3.7/site-packages/miceforest/mean_match_schemes.py in mean_match_function_kdtree_cat(mmc, model, bachelor_features, candidate_values, random_state, hashed_seeds, candidate_preds)
361 candidate_values,
362 random_state,
--> 363 hashed_seeds,
364 )
365
/opt/conda/lib/python3.7/site-packages/miceforest/mean_match_schemes.py in _mean_match_multiclass_accurate(mmc, bachelor_preds, candidate_preds, candidate_values, random_state, hashed_seeds)
119
120 index_choice = knn_indices[np.arange(knn_indices.shape[0]), ind]
--> 121 imp_values = candidate_values[index_choice]
122
123 return imp_values
/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in __getitem__(self, key)
964 return self._get_values(key)
965
--> 966 return self._get_with(key)
967
968 def _get_with(self, key):
/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in _get_with(self, key)
999 # (i.e. self.iloc) or label-based (i.e. self.loc)
1000 if not self.index._should_fallback_to_positional():
-> 1001 return self.loc[key]
1002 else:
1003 return self.iloc[key]
/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
929
930 maybe_callable = com.apply_if_callable(key, self.obj)
--> 931 return self._getitem_axis(maybe_callable, axis=axis)
932
933 def _is_scalar_access(self, key: tuple):
/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
1151 raise ValueError("Cannot index with multidimensional key")
1152
-> 1153 return self._getitem_iterable(key, axis=axis)
1154
1155 # nested tuple slicing
/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
1091
1092 # A collection of keys
-> 1093 keyarr, indexer = self._get_listlike_indexer(key, axis)
1094 return self.obj._reindex_with_indexers(
1095 {axis: [keyarr, indexer]}, copy=True, allow_dups=True
/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis)
1312 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
1313
-> 1314 self._validate_read_indexer(keyarr, indexer, axis)
1315
1316 if needs_i8_conversion(ax.dtype) or isinstance(
/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)
1375
1376 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 1377 raise KeyError(f"{not_found} not in index")
1378
1379
KeyError: '[129846] not in index'
What I found is that using the default of mean_matching_scheme
parameter was causing the issue (because will evaluate for categorical features the Mean Matching) and that using miceforest.mean_match_schemes.mean_match_scheme_fast_cat
was turning around for that.
Issue Analytics
- State:
- Created a year ago
- Comments:26 (14 by maintainers)
Top Results From Across the Web
Lecture 10 - Categorical variables and interaction terms in ...
This model produces 3 lines because the coefficients of the race variable result in different intercepts. # Calculate race-specific intercepts intercepts <- c( ......
Read more >Coding Systems for Categorical Variables in Regression ...
It is a way to make the categorical variable into a series of dichotomous variables (variables that can have a value of zero...
Read more >Chapter 12 Regression with Categorical Variables
Generally, a slope confidence interval which contains zero means that if we repeated the experiment we might find the reverse trend as presented...
Read more >Categorical X variables and Interaction terms - YouTube
All my stats videos are found here: http://www.zstatistics.com/videos/See the whole regression series here: ...
Read more >Simple Linear Regression - One Binary Categorical ...
We can avoid this error in analysis by creating dummy variables. ... or not b (the coefficient for females) is different from zero...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@KaikeWesleyReis If you wouldn’t mind, can you pull the MeanMatchScheme branch of this repo and build the package locally, and test for your use case?
To do this, the following commands should work in the terminal (Windows):
If you have linux, the commands should be similar.
EDIT - I should also note that this version changes how mean matching is controlled. There is now a
MeanMatchScheme
class, which you probably won’t need to mess with. The README for this branch has updated examples of how work with the new structure. Otherwise, things are pretty much the same as they were.It should be possible to use the
scale_pos_weight
parameter. You would need to pass it tovariable_parameters
specifically for the problem column.And now that I think about it, I’m not convinced that this would cause much of a problem. If your predictions have been altered for the bachelors and the candidates, then the distribution of imputations might still be similar to the original distribution. I’ll have to experiment with this next week, it could be a good solution to a problem like yours.