Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[QST] What data types should we use to save multi-hot columns?

See original GitHub issue

I see that HugeCTR recently added multi-hot categorical variable column compatibility from parquet files. In order to use that, I was wondering what data type we should use to save a multi-hot column when calling to_parquet from the NVT side?

More specifically, I see in the examples code:

dict_dtypes = {}
for col in cat_feats.columns:
    dict_dtypes[col] = np.int64
for col in cont_feats.columns:
    dict_dtypes[col] = np.float32
dict_dtypes['target'] = np.float32

If one of the categorical columns was multi-hot, what data type should we pass in for it? I’ve tried np.int64, and List[np.int64], but they did not work. I could also just omit specifying a data type for the multi hot columns, but I am afraid HugeCTR will complain if the ints within the list is not np.int64.

Could you please advise? Thank you so much for your help!

Update: I wrote my transformed dataset with multi-hot columns to S3, but the multi-hot columns don’t seem to exist in the written dataset. But when I saved it to the local PVC, it all saved fine. (we are also using 0.6.0).

Issue Analytics

State:
Created 2 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

shoyasaxacommented, Nov 17, 2021

I also confirmed that when I call to_parquet to an EBS volume (and not to S3), the metadata are generated correctly.

1reaction

shoyasaxacommented, Nov 17, 2021

Hi @rnyak, I just checked and now, there is actually no metadata (_file_list.txt, _metadata, _metadata.json) being generated at all when I save to S3 with the newest merlin-training:21.11