[BUG] package_general_metadata() can't find the categorical columns
See original GitHub issueDescribe the bug
I am running into an issue where fitting on the training dataset and then transforming it and saving it is fine, but when I then try to transform + save the validation and test dataset, it fails. Weirdly, it works when ran step by step in a Jupyter notebook environment, but it errors as a Job when it hits the to_parquet
call with the following error.
Traceback (most recent call last):
File "/usr/local/path/to/my/preprocess.py", line 77, in <module>
main(sys.argv)
File "/usr/local/path/to/my/preprocess.py", line 73, in main
cars_preprocessor.preprocess()
File "/usr/localpath/to/my/preprocessor.py", line 101, in preprocess
self.fit_transform_features()
File "/usr/local/path/to/my/preprocessor.py", line 132, in fit_transform_features
self._transform_and_save(
File "/usr/local/path/to/my/preprocessor.py", line 249, in _transform_and_save
self.nvt_workflow.proc.transform(
File "/nvtabular/nvtabular/io/dataset.py", line 897, in to_parquet
_ddf_to_dataset(
File "/nvtabular/nvtabular/io/dask.py", line 369, in _ddf_to_dataset
out = client.compute(out).result()
File "/root/.local/lib/python3.8/site-packages/distributed-2021.7.1-py3.8.egg/distributed/client.py", line 228, in result
raise exc.with_traceback(tb)
File "/usr/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/nvtabular/nvtabular/io/dask.py", line 210, in _write_subgraph
return writer.close()
File "/nvtabular/nvtabular/io/writer.py", line 313, in close
_general_meta = self.package_general_metadata()
File "/nvtabular/nvtabular/io/writer.py", line 252, in package_general_metadata
data["cats"].append({"col_name": c, "index": self.col_idx[c]})
KeyError: 'my_categorical_column1'
The line 249 mentioned above in _transform_and_save
looks like
self.nvt_workflow.proc.transform(dataset).to_parquet(
output_path=output_path,
out_files_per_proc=out_files_per_proc,
shuffle=nvt.io.Shuffle.PER_PARTITION,
dtypes=self.nvt_workflow.dict_dtypes,
cats=self.nvt_workflow.final_cat_columns,
conts=self.nvt_workflow.final_cont_columns,
labels=self.nvt_workflow.label_cols,
)
I have checked that the transformed output looks correct. I have also tried setting a out_path
in Categorify()
but that did not have any effect. And it is especially weird because the training set is able to be transformed and saved correctly. In fact, the data itself is written out for the validation set (which I do a transform
and to_parquet
right after the training set), but the error comes out right after the data is written out and NVT tries to generate the metadata file for the validation set.
I have also checked that the variables being passed into dtypes
, cats
, conts
, labels
are correct. The my_categorical_column1
is the first item in the cats
.
Could you provide guidance on how to debug this issue? I appreciate it a lot - thank you so much!
Environment details (please complete the following information):
Docker image built from the latest merlin-training:21.11
image released
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (5 by maintainers)
Top GitHub Comments
@rnyak This sounds very similar to a problem you were running into. Any guidance for Shoya?
My apologies I missed this - and yes please feel free to close. Thank you!