question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] categorical columns automatically sorted.

See original GitHub issue

Describe the bug I put the

cats={'a_is_verified': (2, 16),
 'b_follows_a': (2, 16),
 'tw_first_word': (1411026, 16),
 'tw_second_word': (1111029, 16),
 'tw_last_word': (140149, 16),
 'tw_llast_word': (232494, 16),
 'language': (65, 16),
 'tweet_type': (3, 16),
 'media': (16, 16),
 'a_user_id_index': (41693, 16),
 'b_user_id_index': (10495, 16),
 'mfi_hashtags': (63165, 16),
 'mfg_hashtags': (77144, 16),
 'mfi_links': (29328, 16),
 'mfg_links': (29467, 16),
 'mfi_domains': (18090, 16),
 'mfg_domains': (18329, 16)}

TorcyAsyncItr Dataset, but,

batch = next(iter(train_loader))
cat_feats, cont_feats, labels = batch

features are returned ordered alphabetically.

    cat_feats.max(axis=0).values

this command yields a somewhat weird order.

tensor([     1,  19490,      1,   3080,     63,      8,   3886,  13118,   3410,
          3839,  11250,   3406, 123392,  24677,  38871, 112914,      2],
       device='cuda:0')

I think this is not intended behavior.

Steps/Code to reproduce bug

# External dependencies
import os
import cudf  # cuDF is an implementation of Pandas-like Dataframe on GPU
import shutil
import numpy as np

import nvtabular as nvt

from os import path
import pandas as pd
from nvtabular.loader.torch import TorchAsyncItr, DLDataLoader


x = pd.concat([pd.Series([0, 0, 0, 0, 0]), pd.Series([0,1,2,3,4]), pd.Series([1,1,1,1,1])], axis=1)
x.columns = ['z', 'a', 'c']
cat_cols = ['a', 'z']
x.to_csv("./ret.csv")


train_dataset = TorchAsyncItr(
    nvt.Dataset("./ret.csv"),
    batch_size=2,
    cats=cat_cols,
    conts=[],
    labels=["c"])
train_loader = DLDataLoader(
    train_dataset, batch_size=None, collate_fn=lambda x: x, pin_memory=False, num_workers=0
)

batch = next(iter(train_loader)); a,b,c = batch
print(a)


tensor([[0, 0],
        [1, 0]], device='cuda:0')

Expected behavior the first column is “a” and its maximum value will be 0.

**Environment details (please complete I’m using docker. I installed NVTabuler via conda commands in README.md (using CUDA 10.1)

conda env create -f=conda/environments/nvtabular_dev_cuda10.1.yml
conda activate nvtabular_dev_10.1
python -m ipykernel install --user --name=nvt
pip install -e .
jupyter notebook

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
jperez999commented, May 18, 2021

Sorry, I thought your question was as to why the max is 1 instead of zero. If you are asking whether we are sorting columns alphabetically, you are correct we are sorting. That code can be found here (https://github.com/NVIDIA/NVTabular/blob/739c6ef92915ef2493ec33ae5020e0978ab2d170/nvtabular/loader/backend.py#L455-L462). This is done to guarantee ordering in both tensor representation and tensor embeddings size table. We use the same ordering function to order the tensor embedding sizes (https://github.com/NVIDIA/NVTabular/blob/e02549a785dc3297b2065b44266226e0fe7af330/nvtabular/ops/categorify.py#L451). Let me know if this helps.

0reactions
jperez999commented, May 18, 2021

We are modifying this for our upcoming release, currently in a PR (https://github.com/NVIDIA/NVTabular/pull/793). We are changing to mimic the tensorflow dataloader output which provides a dictionary based structure with the following possible key/value signatures “column name”: tensor “column name”: (tensor, offsets) “column_name”: sparse-tensor This will allow the user to dictate any ordering they may want to specify.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[BUG] categorical columns automatically sorted. #814 - GitHub
Describe the bug I put the cats={'a_is_verified': (2, 16), 'b_follows_a': (2, 16), ... [BUG] categorical columns automatically sorted. #814.
Read more >
Stop auto-ordering of categorical columns
I'm aware that we can use the "value" ordering option to sort a categorical variable into a preferred order for data visualizations.
Read more >
Weird behaviour with groupby on ordered categorical columns
So you're saying the orderer Categorical variable gets lost and is treated as a string when the Multiindex is created? Sounds like a...
Read more >
Categorical data — pandas 1.5.2 documentation
Categorical Series or columns in a DataFrame can be created in several ways: ... New categorical data are not automatically ordered.
Read more >
Beware the Dummy Variable Trap in Pandas | Built In
Similarly, the column EducationField also gets separated into three different columns based on the field of education. Things are pretty self- ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found