question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DataQualityTestPreset error with Categorical features

See original GitHub issue

Hi team, first of all - thank you for creating this amazing package for DS/MLEs all over the world! 😃

Describe the bug I am using Evidently to assess the Data Quality of my current and reference pandas dataframe datasets.

But, I see the following message on the AWS SageMaker notebook: ValueError: could not convert string to float: ‘Canteen’ (one of my categories in the features)

I was able to successfully run the Education Dataset mentioned here. Error Message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/anaconda3/envs/python3/lib/python3.8/site-packages/IPython/core/formatters.py in __call__(self, obj)
    343             method = get_real_method(obj, self.print_method)
    344             if method is not None:
--> 345                 return method()
    346             return None
    347         else:

~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/suite/base_suite.py in _repr_html_(self)
    105 
    106     def _repr_html_(self):
--> 107         dashboard_id, dashboard_info, graphs = self._build_dashboard_info()
    108         template_params = TemplateParams(
    109             dashboard_id=dashboard_id, dashboard_info=dashboard_info, additional_graphs=graphs

~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/test_suite/test_suite.py in _build_dashboard_info(self)
    126             renderer.color_options = color_options
    127             by_status[test_result.status] = by_status.get(test_result.status, 0) + 1

--> 128             test_results.append(renderer.render_html(test))
    129 
    130         summary_widget = BaseWidgetInfo(

~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/tests/data_quality_tests.py in render_html(self, obj)
    822             raise ValueError("column_name should be present")
    823 
--> 824         counts_data = obj.metric.get_result().plot_data.counts_of_values
    825         if counts_data is not None:
    826             curr_df = counts_data["current"]

~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/metrics/base_metric.py in get_result(self)
     51         result = self.context.metric_results.get(self, None)
     52         if isinstance(result, ErrorResult):
---> 53             raise result.exception
     54         if result is None:
     55             raise ValueError(f"No result found for metric {self} of type {type(self).__name__}")

~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/suite/base_suite.py in run_checks(self)
    260             try:
    261                 logging.debug(f"Executing {type(test)}...")
--> 262                 test_results[test] = test.check()
    263             except BaseException as ex:
    264                 test_results[test] = TestResult(

~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/tests/data_quality_tests.py in check(self)
    443         #     return result
    444 
--> 445         result = super().check()
    446 
    447         if self.value is None:

~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/tests/base_test.py in check(self)
    302             status=TestResult.SKIPPED,
    303         )
--> 304         value = self.calculate_value_for_test()
    305         self.value = value
    306         result.description = self.get_description(value)

~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/tests/data_quality_tests.py in calculate_value_for_test(self)
    786 
    787     def calculate_value_for_test(self) -> Optional[Numeric]:
--> 788         features_stats = self.metric.get_result().current_characteristics
    789         most_common_percentage = features_stats.most_common_percentage
    790 

~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/metrics/base_metric.py in get_result(self)
     51         result = self.context.metric_results.get(self, None)
     52         if isinstance(result, ErrorResult):
---> 53             raise result.exception
     54         if result is None:
     55             raise ValueError(f"No result found for metric {self} of type {type(self).__name__}")

~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/suite/base_suite.py in run_calculate(self, data)
    242                     logging.debug(f"Executing {type(calculation)}...")
    243                     try:
--> 244                         calculations[calculation] = calculation.calculate(data)
    245                     except BaseException as ex:
    246                         calculations[calculation] = ErrorResult(ex)

~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/metrics/data_integrity/column_summary_metric.py in calculate(self, data)
    192         if data.reference_data is not None:
    193             reference_data = data.reference_data
--> 194             ref_characteristics = self.map_data(get_features_stats(data.reference_data[self.column_name], column_type))
    195         curr_characteristics = self.map_data(get_features_stats(data.current_data[self.column_name], column_type))
    196 

~/anaconda3/envs/python3/lib/python3.8/site-packages/evidently/calculations/data_quality.py in get_features_stats(feature, feature_type)
    186         # round most common feature value for numeric features to 1e-5
    187         if not np.issubdtype(feature, np.number):
--> 188             feature = feature.astype(float)
    189         result.most_common_value = np.round(result.most_common_value, 5)
    190         result.infinite_count = int(np.sum(np.isinf(feature)))

~/anaconda3/envs/python3/lib/python3.8/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5813         else:
   5814             # else, only a single dtype is given
-> 5815             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   5816             return self._constructor(new_data).__finalize__(self, method="astype")
   5817 

~/anaconda3/envs/python3/lib/python3.8/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    416 
    417     def astype(self: T, dtype, copy: bool = False, errors: str = "raise") -> T:
--> 418         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    419 
    420     def convert(

~/anaconda3/envs/python3/lib/python3.8/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
    325                     applied = b.apply(f, **kwargs)
    326                 else:
--> 327                     applied = getattr(b, f)(**kwargs)
    328             except (TypeError, NotImplementedError):
    329                 if not ignore_failures:

~/anaconda3/envs/python3/lib/python3.8/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    589         values = self.values
    590 
--> 591         new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    592 
    593         new_values = maybe_coerce_values(new_values)

~/anaconda3/envs/python3/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in astype_array_safe(values, dtype, copy, errors)
   1307 
   1308     try:
-> 1309         new_values = astype_array(values, dtype, copy=copy)
   1310     except (ValueError, TypeError):
   1311         # e.g. astype_nansafe can fail on object-dtype of strings

~/anaconda3/envs/python3/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in astype_array(values, dtype, copy)
   1255 
   1256     else:
-> 1257         values = astype_nansafe(values, dtype, copy=copy)
   1258 
   1259     # in pandas we don't store numpy str dtypes, so convert to object

~/anaconda3/envs/python3/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
   1199     if copy or is_object_dtype(arr.dtype) or is_object_dtype(dtype):
   1200         # Explicit copy, or required since NumPy can't view from / to object.
-> 1201         return arr.astype(dtype, copy=True)
   1202 
   1203     return arr.astype(dtype, copy=copy)

ValueError: could not convert string to float: 'Canteen'

Expected behavior: To see the Data Quality report

To Reproduce: After Evidently imports,

data_quality = TestSuite(tests=[
    DataQualityTestPreset(),
])

data_quality.run(reference_data=reference_df, current_data=current_df)
data_quality

System: AWS SageMaker with Python 3.8.12 and Evidently 0.2.0

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
emeli-dralcommented, Dec 6, 2022

@AbhiPawar5 Cool, good to know! We continue to improve automatic input data parsing process. I hope we will do better soon 😃

0reactions
AbhiPawar5commented, Dec 6, 2022

Hi @emeli-dral and @elenasamuylova, using ColumnMapping solved this issue 😃 Please feel free to close this issue.

Thanks for your help.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dealing with Categorical Features. · Issue #118 - GitHub
Dear All, Question1: I noticed that you are apply get_dummies to the the categorical features, then splitting them into x_train and x_test.
Read more >
measurement error in categorical dependent variable
I know measurement error on the DV does not bias estimates for the IVs in interval data, but this doesn't really make sense...
Read more >
Handling Categorical Data, The Right Way
Categorical data is simply information aggregated into groups rather than being in numeric formats, such as Gender, Sex or Education Level.
Read more >
How to Perform Feature Selection with Categorical Data
In this tutorial, you will discover how to perform feature selection with categorical input data. After completing this tutorial, you will know:.
Read more >
Categorical Data — xgboost 1.7.2 documentation
Using native interface. For numerical data, the feature type can be "q" or "float" , while for categorical feature it's specified as "c"...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found