question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[QST] What data types should we use to save multi-hot columns?

See original GitHub issue

I see that HugeCTR recently added multi-hot categorical variable column compatibility from parquet files. In order to use that, I was wondering what data type we should use to save a multi-hot column when calling to_parquet from the NVT side?

More specifically, I see in the examples code:

dict_dtypes = {}
for col in cat_feats.columns:
    dict_dtypes[col] = np.int64
for col in cont_feats.columns:
    dict_dtypes[col] = np.float32
dict_dtypes['target'] = np.float32

If one of the categorical columns was multi-hot, what data type should we pass in for it? I’ve tried np.int64, and List[np.int64], but they did not work. I could also just omit specifying a data type for the multi hot columns, but I am afraid HugeCTR will complain if the ints within the list is not np.int64.

Could you please advise? Thank you so much for your help!

Update: I wrote my transformed dataset with multi-hot columns to S3, but the multi-hot columns don’t seem to exist in the written dataset. But when I saved it to the local PVC, it all saved fine. (we are also using 0.6.0).

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
shoyasaxacommented, Nov 17, 2021

I also confirmed that when I call to_parquet to an EBS volume (and not to S3), the metadata are generated correctly.

1reaction
shoyasaxacommented, Nov 17, 2021

Hi @rnyak, I just checked and now, there is actually no metadata (_file_list.txt, _metadata, _metadata.json) being generated at all when I save to S3 with the newest merlin-training:21.11

Read more comments on GitHub >

github_iconTop Results From Across the Web

Data types in Data Models - Microsoft Support
In a Data Model, each column has an associated data type that specifies the type of data the column can hold: whole numbers,...
Read more >
Data Types and Formats – Data Analysis and Visualization in ...
Objectives. Describe how information is stored in a Python DataFrame. Define the two main types of data in Python: text and numerics.
Read more >
Setting the Import Format for Multi-Column Data Types
For the multi-column type, you can use a header, multi-row header, or no header specified in the import format. These are the different...
Read more >
WS-QST-001 Pending Requests - FAMIT
WS-QST-001 Pending Requests. This is a workspace-style report to allow a user to report on Pending Requests from a variety of perspectives.
Read more >
Db2 12 - Introduction - Data types of columns - IBM
You specify the data type of each column at the time that you create the table. You can also change the data type...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found