Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using mean_match_candidates different from zero with categorical variables generates an error

See original GitHub issue

Hi,

So other issue that I found using categorical variables imputation (category dtype) is that defining mean_match_candidates != 0 in Kernel definition generate an issue during .mice().

Error message:

/opt/conda/lib/python3.7/site-packages/miceforest/utils.py:377: RuntimeWarning: divide by zero encountered in true_divide
  odds_ratio = probability / (1 - probability)
/opt/conda/lib/python3.7/site-packages/miceforest/utils.py:378: RuntimeWarning: divide by zero encountered in log
  log_odds = np.log(odds_ratio)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-106-6f8a47b02d4b> in <module>
      9                                    mean_match_candidates=2)
     10 # Run the MICE algorithm for X iterations - 12:09 - 13:22
---> 11 a.mice(iterations=2, verbose=True, n_jobs=-1)

/opt/conda/lib/python3.7/site-packages/miceforest/ImputationKernel.py in mice(self, iterations, verbose, variable_parameters, compile_candidates, **kwlgb)
   1193                                 random_state=self._random_state,
   1194                                 hashed_seeds=None,
-> 1195                                 candidate_preds=candidate_preds,
   1196                             )
   1197                         )

/opt/conda/lib/python3.7/site-packages/miceforest/mean_match_schemes.py in mean_match_function_kdtree_cat(mmc, model, bachelor_features, candidate_values, random_state, hashed_seeds, candidate_preds)
    361                 candidate_values,
    362                 random_state,
--> 363                 hashed_seeds,
    364             )
    365 

/opt/conda/lib/python3.7/site-packages/miceforest/mean_match_schemes.py in _mean_match_multiclass_accurate(mmc, bachelor_preds, candidate_preds, candidate_values, random_state, hashed_seeds)
    119 
    120     index_choice = knn_indices[np.arange(knn_indices.shape[0]), ind]
--> 121     imp_values = candidate_values[index_choice]
    122 
    123     return imp_values

/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in __getitem__(self, key)
    964             return self._get_values(key)
    965 
--> 966         return self._get_with(key)
    967 
    968     def _get_with(self, key):

/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in _get_with(self, key)
    999             #  (i.e. self.iloc) or label-based (i.e. self.loc)
   1000             if not self.index._should_fallback_to_positional():
-> 1001                 return self.loc[key]
   1002             else:
   1003                 return self.iloc[key]

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    929 
    930             maybe_callable = com.apply_if_callable(key, self.obj)
--> 931             return self._getitem_axis(maybe_callable, axis=axis)
    932 
    933     def _is_scalar_access(self, key: tuple):

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1151                     raise ValueError("Cannot index with multidimensional key")
   1152 
-> 1153                 return self._getitem_iterable(key, axis=axis)
   1154 
   1155             # nested tuple slicing

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
   1091 
   1092         # A collection of keys
-> 1093         keyarr, indexer = self._get_listlike_indexer(key, axis)
   1094         return self.obj._reindex_with_indexers(
   1095             {axis: [keyarr, indexer]}, copy=True, allow_dups=True

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis)
   1312             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1313 
-> 1314         self._validate_read_indexer(keyarr, indexer, axis)
   1315 
   1316         if needs_i8_conversion(ax.dtype) or isinstance(

/opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)
   1375 
   1376             not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 1377             raise KeyError(f"{not_found} not in index")
   1378 
   1379 

KeyError: '[129846] not in index'

What I found is that using the default of mean_matching_scheme parameter was causing the issue (because will evaluate for categorical features the Mean Matching) and that using miceforest.mean_match_schemes.mean_match_scheme_fast_cat was turning around for that.

Issue Analytics

State:
Created a year ago
Comments:26 (14 by maintainers)

Top GitHub Comments

1reaction

AnotherSamWilsoncommented, Jul 28, 2022

@KaikeWesleyReis If you wouldn’t mind, can you pull the MeanMatchScheme branch of this repo and build the package locally, and test for your use case?

To do this, the following commands should work in the terminal (Windows):

git clone https://github.com/AnotherSamWilson/miceforest.git
cd miceforest
git checkout MeanMatchScheme

python -m setup.py sdist
pip install dist/miceforest-5.6.0.tar.gz

If you have linux, the commands should be similar.

EDIT - I should also note that this version changes how mean matching is controlled. There is now a MeanMatchScheme class, which you probably won’t need to mess with. The README for this branch has updated examples of how work with the new structure. Otherwise, things are pretty much the same as they were.

0reactions

AnotherSamWilsoncommented, Jul 29, 2022

It should be possible to use the scale_pos_weight parameter. You would need to pass it to variable_parameters specifically for the problem column.

And now that I think about it, I’m not convinced that this would cause much of a problem. If your predictions have been altered for the bachelors and the candidates, then the distribution of imputations might still be similar to the original distribution. I’ll have to experiment with this next week, it could be a good solution to a problem like yours.

Top Results From Across the Web

Lecture 10 - Categorical variables and interaction terms in ...

This model produces 3 lines because the coefficients of the race variable result in different intercepts. # Calculate race-specific intercepts intercepts <- c( ......

Coding Systems for Categorical Variables in Regression ...

It is a way to make the categorical variable into a series of dichotomous variables (variables that can have a value of zero...

Chapter 12 Regression with Categorical Variables

Generally, a slope confidence interval which contains zero means that if we repeated the experiment we might find the reverse trend as presented...

Categorical X variables and Interaction terms - YouTube

All my stats videos are found here: http://www.zstatistics.com/videos/See the whole regression series here: ...

Simple Linear Regression - One Binary Categorical ...

We can avoid this error in analysis by creating dummy variables. ... or not b (the coefficient for females) is different from zero...