Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

what to do if ref and test data have different categories in chisquare?

See original GitHub issue

I have a question that I think should have been taken into account in this library but I can’t find the solution.

Currently if the reference data has a category feature that is different from that of the test data, we will get an error when we call the predict method in TabularDrift or ChiSquareDrift. I created categories_per_feature on the whole data but the way I split the data, one of the features of my reference data has categories from 0 to 11, and 0 to 12 for test data. The error I get is operands could not be broadcast together with shapes (13,) (12,) This error comes from chisquare function under the hood.

I think this is not a rare incident and it is probable that the reference data does not have all the categories of the test data for one or more features.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:14 (7 by maintainers)

Top GitHub Comments

1reaction

arnaudvlcommented, Apr 16, 2021

@tjhallum and @AsiehH : addressed by #222

0reactions

tjhallumcommented, Apr 7, 2021

We obviously agree with that but saw it more in the context of drift detection on inputs for machine learning models. A lot of models will simply break if they are presented with categories that were not seen during training. But we also want to facilitate your use case…

I forgot to speak specifically to this in my previous reply. In line with your vision, I am in fact using TabularDrift on inputs for machine learning models. It’s just that in my case I’ve specifically setup my models so that they do not break when encountering new categories that weren’t previously seen during training.