DataQualityTestPreset error with Categorical features
See original GitHub issueHi team, first of all - thank you for creating this amazing package for DS/MLEs all over the world! 😃
Describe the bug I am using Evidently to assess the Data Quality of my current and reference pandas dataframe datasets.
But, I see the following message on the AWS SageMaker notebook: ValueError: could not convert string to float: ‘Canteen’ (one of my categories in the features)
I was able to successfully run the Education Dataset mentioned here. Error Message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/anaconda3/envs/python3/lib/python3.8/site-packages/IPython/core/formatters.py in __call__(self, obj)
343 method = get_real_method(obj, self.print_method)
344 if method is not None:
--> 345 return method()
346 return None
347 else:
~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/suite/base_suite.py in _repr_html_(self)
105
106 def _repr_html_(self):
--> 107 dashboard_id, dashboard_info, graphs = self._build_dashboard_info()
108 template_params = TemplateParams(
109 dashboard_id=dashboard_id, dashboard_info=dashboard_info, additional_graphs=graphs
~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/test_suite/test_suite.py in _build_dashboard_info(self)
126 renderer.color_options = color_options
127 by_status[test_result.status] = by_status.get(test_result.status, 0) + 1
--> 128 test_results.append(renderer.render_html(test))
129
130 summary_widget = BaseWidgetInfo(
~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/tests/data_quality_tests.py in render_html(self, obj)
822 raise ValueError("column_name should be present")
823
--> 824 counts_data = obj.metric.get_result().plot_data.counts_of_values
825 if counts_data is not None:
826 curr_df = counts_data["current"]
~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/metrics/base_metric.py in get_result(self)
51 result = self.context.metric_results.get(self, None)
52 if isinstance(result, ErrorResult):
---> 53 raise result.exception
54 if result is None:
55 raise ValueError(f"No result found for metric {self} of type {type(self).__name__}")
~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/suite/base_suite.py in run_checks(self)
260 try:
261 logging.debug(f"Executing {type(test)}...")
--> 262 test_results[test] = test.check()
263 except BaseException as ex:
264 test_results[test] = TestResult(
~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/tests/data_quality_tests.py in check(self)
443 # return result
444
--> 445 result = super().check()
446
447 if self.value is None:
~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/tests/base_test.py in check(self)
302 status=TestResult.SKIPPED,
303 )
--> 304 value = self.calculate_value_for_test()
305 self.value = value
306 result.description = self.get_description(value)
~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/tests/data_quality_tests.py in calculate_value_for_test(self)
786
787 def calculate_value_for_test(self) -> Optional[Numeric]:
--> 788 features_stats = self.metric.get_result().current_characteristics
789 most_common_percentage = features_stats.most_common_percentage
790
~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/metrics/base_metric.py in get_result(self)
51 result = self.context.metric_results.get(self, None)
52 if isinstance(result, ErrorResult):
---> 53 raise result.exception
54 if result is None:
55 raise ValueError(f"No result found for metric {self} of type {type(self).__name__}")
~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/suite/base_suite.py in run_calculate(self, data)
242 logging.debug(f"Executing {type(calculation)}...")
243 try:
--> 244 calculations[calculation] = calculation.calculate(data)
245 except BaseException as ex:
246 calculations[calculation] = ErrorResult(ex)
~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/metrics/data_integrity/column_summary_metric.py in calculate(self, data)
192 if data.reference_data is not None:
193 reference_data = data.reference_data
--> 194 ref_characteristics = self.map_data(get_features_stats(data.reference_data[self.column_name], column_type))
195 curr_characteristics = self.map_data(get_features_stats(data.current_data[self.column_name], column_type))
196
~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/calculations/data_quality.py in get_features_stats(feature, feature_type)
186 # round most common feature value for numeric features to 1e-5
187 if not np.issubdtype(feature, np.number):
--> 188 feature = feature.astype(float)
189 result.most_common_value = np.round(result.most_common_value, 5)
190 result.infinite_count = int(np.sum(np.isinf(feature)))
~/anaconda3/envs/python3/lib/python3.8/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
5813 else:
5814 # else, only a single dtype is given
-> 5815 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
5816 return self._constructor(new_data).__finalize__(self, method="astype")
5817
~/anaconda3/envs/python3/lib/python3.8/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
416
417 def astype(self: T, dtype, copy: bool = False, errors: str = "raise") -> T:
--> 418 return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
419
420 def convert(
~/anaconda3/envs/python3/lib/python3.8/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
325 applied = b.apply(f, **kwargs)
326 else:
--> 327 applied = getattr(b, f)(**kwargs)
328 except (TypeError, NotImplementedError):
329 if not ignore_failures:
~/anaconda3/envs/python3/lib/python3.8/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
589 values = self.values
590
--> 591 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
592
593 new_values = maybe_coerce_values(new_values)
~/anaconda3/envs/python3/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in astype_array_safe(values, dtype, copy, errors)
1307
1308 try:
-> 1309 new_values = astype_array(values, dtype, copy=copy)
1310 except (ValueError, TypeError):
1311 # e.g. astype_nansafe can fail on object-dtype of strings
~/anaconda3/envs/python3/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in astype_array(values, dtype, copy)
1255
1256 else:
-> 1257 values = astype_nansafe(values, dtype, copy=copy)
1258
1259 # in pandas we don't store numpy str dtypes, so convert to object
~/anaconda3/envs/python3/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
1199 if copy or is_object_dtype(arr.dtype) or is_object_dtype(dtype):
1200 # Explicit copy, or required since NumPy can't view from / to object.
-> 1201 return arr.astype(dtype, copy=True)
1202
1203 return arr.astype(dtype, copy=copy)
ValueError: could not convert string to float: 'Canteen'
Expected behavior: To see the Data Quality report
To Reproduce: After Evidently imports,
data_quality = TestSuite(tests=[
DataQualityTestPreset(),
])
data_quality.run(reference_data=reference_df, current_data=current_df)
data_quality
System: AWS SageMaker with Python 3.8.12 and Evidently 0.2.0
Issue Analytics
- State:
- Created 10 months ago
- Comments:7 (2 by maintainers)
Top Results From Across the Web
Dealing with Categorical Features. · Issue #118 - GitHub
Dear All, Question1: I noticed that you are apply get_dummies to the the categorical features, then splitting them into x_train and x_test.
Read more >measurement error in categorical dependent variable
I know measurement error on the DV does not bias estimates for the IVs in interval data, but this doesn't really make sense...
Read more >Handling Categorical Data, The Right Way
Categorical data is simply information aggregated into groups rather than being in numeric formats, such as Gender, Sex or Education Level.
Read more >How to Perform Feature Selection with Categorical Data
In this tutorial, you will discover how to perform feature selection with categorical input data. After completing this tutorial, you will know:.
Read more >Categorical Data — xgboost 1.7.2 documentation
Using native interface. For numerical data, the feature type can be "q" or "float" , while for categorical feature it's specified as "c"...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@AbhiPawar5 Cool, good to know! We continue to improve automatic input data parsing process. I hope we will do better soon 😃
Hi @emeli-dral and @elenasamuylova, using ColumnMapping solved this issue 😃 Please feel free to close this issue.
Thanks for your help.