question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] package_general_metadata() can't find the categorical columns

See original GitHub issue

Describe the bug I am running into an issue where fitting on the training dataset and then transforming it and saving it is fine, but when I then try to transform + save the validation and test dataset, it fails. Weirdly, it works when ran step by step in a Jupyter notebook environment, but it errors as a Job when it hits the to_parquet call with the following error.

Traceback (most recent call last):
 File "/usr/local/path/to/my/preprocess.py", line 77, in <module>
  main(sys.argv)
 File "/usr/local/path/to/my/preprocess.py", line 73, in main
  cars_preprocessor.preprocess()
 File "/usr/localpath/to/my/preprocessor.py", line 101, in preprocess
  self.fit_transform_features()
 File "/usr/local/path/to/my/preprocessor.py", line 132, in fit_transform_features
  self._transform_and_save(
 File "/usr/local/path/to/my/preprocessor.py", line 249, in _transform_and_save
  self.nvt_workflow.proc.transform(
 File "/nvtabular/nvtabular/io/dataset.py", line 897, in to_parquet
  _ddf_to_dataset(
 File "/nvtabular/nvtabular/io/dask.py", line 369, in _ddf_to_dataset
  out = client.compute(out).result()
 File "/root/.local/lib/python3.8/site-packages/distributed-2021.7.1-py3.8.egg/distributed/client.py", line 228, in result
  raise exc.with_traceback(tb)
 File "/usr/lib/python3.8/contextlib.py", line 75, in inner
  return func(*args, **kwds)
 File "/nvtabular/nvtabular/io/dask.py", line 210, in _write_subgraph
  return writer.close()
 File "/nvtabular/nvtabular/io/writer.py", line 313, in close
  _general_meta = self.package_general_metadata()
 File "/nvtabular/nvtabular/io/writer.py", line 252, in package_general_metadata
  data["cats"].append({"col_name": c, "index": self.col_idx[c]})
KeyError: 'my_categorical_column1'

The line 249 mentioned above in _transform_and_save looks like

self.nvt_workflow.proc.transform(dataset).to_parquet(
         output_path=output_path,
         out_files_per_proc=out_files_per_proc,
         shuffle=nvt.io.Shuffle.PER_PARTITION,
         dtypes=self.nvt_workflow.dict_dtypes,
         cats=self.nvt_workflow.final_cat_columns,
         conts=self.nvt_workflow.final_cont_columns,
         labels=self.nvt_workflow.label_cols,
)

I have checked that the transformed output looks correct. I have also tried setting a out_path in Categorify() but that did not have any effect. And it is especially weird because the training set is able to be transformed and saved correctly. In fact, the data itself is written out for the validation set (which I do a transform and to_parquet right after the training set), but the error comes out right after the data is written out and NVT tries to generate the metadata file for the validation set.

I have also checked that the variables being passed into dtypes, cats, conts, labels are correct. The my_categorical_column1 is the first item in the cats.

Could you provide guidance on how to debug this issue? I appreciate it a lot - thank you so much!

Environment details (please complete the following information): Docker image built from the latest merlin-training:21.11 image released

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
EvenOldridgecommented, Nov 22, 2021

@rnyak This sounds very similar to a problem you were running into. Any guidance for Shoya?

0reactions
shoyasaxacommented, Feb 2, 2022

My apologies I missed this - and yes please feel free to close. Thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Check if dataframe column is Categorical - Stack Overflow
Just putting this here because pandas.DataFrame.select_dtypes() is what I was actually looking for:
Read more >
Categorical data — pandas 1.5.2 documentation
Categorical Series or columns in a DataFrame can be created in several ways: ... Be aware that Categorical.set_categories() cannot know whether some ...
Read more >
[BUG] make_feature_column_workflow doesn't work with ...
Since the API overhaul in 0.4 - the make_feature_column_workflow function doesn't work with categorical data. The problem is that its trying to directly...
Read more >
Using pandas categories properly is tricky... here's why
unstack() (which for the uninitiated, flips the index into the columns much like a pivot) moves the categorical index into the column index....
Read more >
Getting continuous or categorical columns with Pandas
Are you wondering how to get all of the continuous columns in your Pandas DataFrame? Or maybe you are interested in selecting all...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found