[QST] What data types should we use to save multi-hot columns?
See original GitHub issueI see that HugeCTR recently added multi-hot categorical variable column compatibility from parquet files.
In order to use that, I was wondering what data type we should use to save a multi-hot column when calling to_parquet
from the NVT side?
More specifically, I see in the examples code:
dict_dtypes = {}
for col in cat_feats.columns:
dict_dtypes[col] = np.int64
for col in cont_feats.columns:
dict_dtypes[col] = np.float32
dict_dtypes['target'] = np.float32
If one of the categorical columns was multi-hot, what data type should we pass in for it? I’ve tried np.int64
, and List[np.int64]
, but they did not work. I could also just omit specifying a data type for the multi hot columns, but I am afraid HugeCTR will complain if the ints within the list is not np.int64
.
Could you please advise? Thank you so much for your help!
Update: I wrote my transformed dataset with multi-hot columns to S3, but the multi-hot columns don’t seem to exist in the written dataset. But when I saved it to the local PVC, it all saved fine. (we are also using 0.6.0).
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (3 by maintainers)
Top GitHub Comments
I also confirmed that when I call to_parquet to an EBS volume (and not to S3), the metadata are generated correctly.
Hi @rnyak, I just checked and now, there is actually no metadata (_file_list.txt, _metadata, _metadata.json) being generated at all when I save to S3 with the newest merlin-training:21.11