[BUG] categorical columns automatically sorted.
See original GitHub issueDescribe the bug I put the
cats={'a_is_verified': (2, 16),
'b_follows_a': (2, 16),
'tw_first_word': (1411026, 16),
'tw_second_word': (1111029, 16),
'tw_last_word': (140149, 16),
'tw_llast_word': (232494, 16),
'language': (65, 16),
'tweet_type': (3, 16),
'media': (16, 16),
'a_user_id_index': (41693, 16),
'b_user_id_index': (10495, 16),
'mfi_hashtags': (63165, 16),
'mfg_hashtags': (77144, 16),
'mfi_links': (29328, 16),
'mfg_links': (29467, 16),
'mfi_domains': (18090, 16),
'mfg_domains': (18329, 16)}
TorcyAsyncItr Dataset, but,
batch = next(iter(train_loader))
cat_feats, cont_feats, labels = batch
features are returned ordered alphabetically.
cat_feats.max(axis=0).values
this command yields a somewhat weird order.
tensor([ 1, 19490, 1, 3080, 63, 8, 3886, 13118, 3410,
3839, 11250, 3406, 123392, 24677, 38871, 112914, 2],
device='cuda:0')
I think this is not intended behavior.
Steps/Code to reproduce bug
# External dependencies
import os
import cudf # cuDF is an implementation of Pandas-like Dataframe on GPU
import shutil
import numpy as np
import nvtabular as nvt
from os import path
import pandas as pd
from nvtabular.loader.torch import TorchAsyncItr, DLDataLoader
x = pd.concat([pd.Series([0, 0, 0, 0, 0]), pd.Series([0,1,2,3,4]), pd.Series([1,1,1,1,1])], axis=1)
x.columns = ['z', 'a', 'c']
cat_cols = ['a', 'z']
x.to_csv("./ret.csv")
train_dataset = TorchAsyncItr(
nvt.Dataset("./ret.csv"),
batch_size=2,
cats=cat_cols,
conts=[],
labels=["c"])
train_loader = DLDataLoader(
train_dataset, batch_size=None, collate_fn=lambda x: x, pin_memory=False, num_workers=0
)
batch = next(iter(train_loader)); a,b,c = batch
print(a)
tensor([[0, 0],
[1, 0]], device='cuda:0')
Expected behavior the first column is “a” and its maximum value will be 0.
**Environment details (please complete I’m using docker. I installed NVTabuler via conda commands in README.md (using CUDA 10.1)
conda env create -f=conda/environments/nvtabular_dev_cuda10.1.yml
conda activate nvtabular_dev_10.1
python -m ipykernel install --user --name=nvt
pip install -e .
jupyter notebook
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
[BUG] categorical columns automatically sorted. #814 - GitHub
Describe the bug I put the cats={'a_is_verified': (2, 16), 'b_follows_a': (2, 16), ... [BUG] categorical columns automatically sorted. #814.
Read more >Stop auto-ordering of categorical columns
I'm aware that we can use the "value" ordering option to sort a categorical variable into a preferred order for data visualizations.
Read more >Weird behaviour with groupby on ordered categorical columns
So you're saying the orderer Categorical variable gets lost and is treated as a string when the Multiindex is created? Sounds like a...
Read more >Categorical data — pandas 1.5.2 documentation
Categorical Series or columns in a DataFrame can be created in several ways: ... New categorical data are not automatically ordered.
Read more >Beware the Dummy Variable Trap in Pandas | Built In
Similarly, the column EducationField also gets separated into three different columns based on the field of education. Things are pretty self- ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Sorry, I thought your question was as to why the max is 1 instead of zero. If you are asking whether we are sorting columns alphabetically, you are correct we are sorting. That code can be found here (https://github.com/NVIDIA/NVTabular/blob/739c6ef92915ef2493ec33ae5020e0978ab2d170/nvtabular/loader/backend.py#L455-L462). This is done to guarantee ordering in both tensor representation and tensor embeddings size table. We use the same ordering function to order the tensor embedding sizes (https://github.com/NVIDIA/NVTabular/blob/e02549a785dc3297b2065b44266226e0fe7af330/nvtabular/ops/categorify.py#L451). Let me know if this helps.
We are modifying this for our upcoming release, currently in a PR (https://github.com/NVIDIA/NVTabular/pull/793). We are changing to mimic the tensorflow dataloader output which provides a dictionary based structure with the following possible key/value signatures “column name”: tensor “column name”: (tensor, offsets) “column_name”: sparse-tensor This will allow the user to dictate any ordering they may want to specify.