Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

IndexError using impute_new_data

See original GitHub issue

I am trying to impute new data using the kernel.

from datetime import datetime

start_t = datetime.now()
new_data_imputed = kernel.impute_new_data(new_data=new_sub)
print(f"New Data imputed in {(datetime.now() - start_t).total_seconds()} seconds")

But, I keep getting an IndexError:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/var/folders/36/j_203fcj42q9bvnlt1sl3j640000gp/T/ipykernel_54504/3730750411.py in <module>
      2 
      3 start_t = datetime.now()
----> 4 new_data_imputed = kernel.impute_new_data(new_data=new_sub)
      5 print(f"New Data imputed in {(datetime.now() - start_t).total_seconds()} seconds")

~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/miceforest/ImputationKernel.py in impute_new_data(self, new_data, datasets, iterations, save_all_iterations, copy_data, random_state, verbose)
   1233                         )
   1234                     )
-> 1235                     imputed_data._insert_new_data(
   1236                         dataset=ds, variable_index=var, new_data=imp_values
   1237                     )

~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/miceforest/ImputedData.py in _insert_new_data(self, dataset, variable_index, new_data)
    387         view = _slice(self.working_data, col_slice=variable_index)
    388         if view.dtype.name == "category":
--> 389             new_data = np.array(view.cat.categories)[new_data]
    390 
    391         _assign_col_values_without_copy(

IndexError: index 1 is out of bounds for axis 0 with size 1

Shape of original data: (33008, 71) Shape of new_sub: (15, 71) Both datasets have columns that are all of the same data type. What could be causing this issue?

Issue Analytics

State:
Created 2 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

2reactions

AnotherSamWilsoncommented, Dec 20, 2021

You can avoid this by setting the datatypes in the new category columns equal to the category types in the original data. Running this should solve it:

for col in data.columns:
  new_sub[col] = new_sub[col].astype(data[col].dtype)

This will ensure the categories are recognized, even if they do not exist in the new data.

0reactions

KaikeWesleyReiscommented, Jul 26, 2022

@AnotherSamWilson Just to mention a concern here: if we use mean_matching_candidates != 0 at Kernel definition, the imputation will fail category dtype columns. If this is expected, should be clear at Kernel description.