Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] categorical columns automatically sorted.

See original GitHub issue

Describe the bug I put the

cats={'a_is_verified': (2, 16),
 'b_follows_a': (2, 16),
 'tw_first_word': (1411026, 16),
 'tw_second_word': (1111029, 16),
 'tw_last_word': (140149, 16),
 'tw_llast_word': (232494, 16),
 'language': (65, 16),
 'tweet_type': (3, 16),
 'media': (16, 16),
 'a_user_id_index': (41693, 16),
 'b_user_id_index': (10495, 16),
 'mfi_hashtags': (63165, 16),
 'mfg_hashtags': (77144, 16),
 'mfi_links': (29328, 16),
 'mfg_links': (29467, 16),
 'mfi_domains': (18090, 16),
 'mfg_domains': (18329, 16)}

TorcyAsyncItr Dataset, but,

batch = next(iter(train_loader))
cat_feats, cont_feats, labels = batch

features are returned ordered alphabetically.

    cat_feats.max(axis=0).values

this command yields a somewhat weird order.

tensor([     1,  19490,      1,   3080,     63,      8,   3886,  13118,   3410,
          3839,  11250,   3406, 123392,  24677,  38871, 112914,      2],
       device='cuda:0')

I think this is not intended behavior.

Steps/Code to reproduce bug

# External dependencies
import os
import cudf  # cuDF is an implementation of Pandas-like Dataframe on GPU
import shutil
import numpy as np

import nvtabular as nvt

from os import path
import pandas as pd
from nvtabular.loader.torch import TorchAsyncItr, DLDataLoader


x = pd.concat([pd.Series([0, 0, 0, 0, 0]), pd.Series([0,1,2,3,4]), pd.Series([1,1,1,1,1])], axis=1)
x.columns = ['z', 'a', 'c']
cat_cols = ['a', 'z']
x.to_csv("./ret.csv")


train_dataset = TorchAsyncItr(
    nvt.Dataset("./ret.csv"),
    batch_size=2,
    cats=cat_cols,
    conts=[],
    labels=["c"])
train_loader = DLDataLoader(
    train_dataset, batch_size=None, collate_fn=lambda x: x, pin_memory=False, num_workers=0
)

batch = next(iter(train_loader)); a,b,c = batch
print(a)


tensor([[0, 0],
        [1, 0]], device='cuda:0')

Expected behavior the first column is “a” and its maximum value will be 0.

**Environment details (please complete I’m using docker. I installed NVTabuler via conda commands in README.md (using CUDA 10.1)

conda env create -f=conda/environments/nvtabular_dev_cuda10.1.yml
conda activate nvtabular_dev_10.1
python -m ipykernel install --user --name=nvt
pip install -e .
jupyter notebook

Issue Analytics

State:
Created 2 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

jperez999commented, May 18, 2021

Sorry, I thought your question was as to why the max is 1 instead of zero. If you are asking whether we are sorting columns alphabetically, you are correct we are sorting. That code can be found here (https://github.com/NVIDIA/NVTabular/blob/739c6ef92915ef2493ec33ae5020e0978ab2d170/nvtabular/loader/backend.py#L455-L462). This is done to guarantee ordering in both tensor representation and tensor embeddings size table. We use the same ordering function to order the tensor embedding sizes (https://github.com/NVIDIA/NVTabular/blob/e02549a785dc3297b2065b44266226e0fe7af330/nvtabular/ops/categorify.py#L451). Let me know if this helps.

0reactions

jperez999commented, May 18, 2021

We are modifying this for our upcoming release, currently in a PR (https://github.com/NVIDIA/NVTabular/pull/793). We are changing to mimic the tensorflow dataloader output which provides a dictionary based structure with the following possible key/value signatures “column name”: tensor “column name”: (tensor, offsets) “column_name”: sparse-tensor This will allow the user to dictate any ordering they may want to specify.