Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] No schema.pbtxt File is being generated from NVTabular workflow

See original GitHub issue

Describe the bug I am trying to generate schema.pbtxt file from NVTabular workflow using the same exact script posted here: https://github.com/NVIDIA-Merlin/NVTabular/issues/1156

NUM_ROWS = 1000
long_tailed_item_distribution = np.clip(np.random.lognormal(3., 1., NUM_ROWS).astype(np.int32), 1, 50000)

# generate random item interaction features 
df = pd.DataFrame(np.random.randint(70000, 80000, NUM_ROWS), columns=['session_id'])
df['item_id'] = long_tailed_item_distribution

# generate category mapping for each item-id
df['category'] = pd.cut(df['item_id'], bins=334, labels=np.arange(1, 335)).astype(np.int32)
df['timestamp/age_days'] = np.random.uniform(0, 1, NUM_ROWS)
df['timestamp/weekday/sin']= np.random.uniform(0, 1, NUM_ROWS)

# generate day mapping for each session 
map_day = dict(zip(df.session_id.unique(), np.random.randint(1, 10, size=(df.session_id.nunique()))))
df['day'] =  df.session_id.map(map_day)

# Categorify categorical features
categ_feats = ['session_id', 'item_id', 'category'] >> nvt.ops.Categorify(start_index=1)

# Define Groupby Workflow
groupby_feats = categ_feats + ['day', 'timestamp/age_days', 'timestamp/weekday/sin']

# Groups interaction features by session and sorted by timestamp
groupby_features = groupby_feats >> nvt.ops.Groupby(
    groupby_cols=["session_id"], 
    aggs={
        "item_id": ["list", "count"],
        "category": ["list"],     
        "day": ["first"],
        "timestamp/age_days": ["list"],
        'timestamp/weekday/sin': ["list"],
        },
    name_sep="-")

# Select and truncate the sequential features
sequence_features_truncated = (groupby_features['category-list', 'item_id-list', 
                                          'timestamp/age_days-list', 'timestamp/weekday/sin-list']) >> \
                            nvt.ops.ListSlice(0,20) >> nvt.ops.Rename(postfix = '_trim')

# Filter out sessions with length 1 (not valid for next-item prediction training and evaluation)
MINIMUM_SESSION_LENGTH = 2
selected_features = groupby_features['item_id-count', 'day-first', 'session_id'] + sequence_features_truncated
filtered_sessions = selected_features >> nvt.ops.Filter(f=lambda df: df["item_id-count"] >= MINIMUM_SESSION_LENGTH)


workflow = nvt.Workflow(filtered_sessions)
dataset = nvt.Dataset(df, cpu=False)
# Generating statistics for the features
workflow.fit(dataset)
workflow.transform(dataset).to_parquet(
    './schema',
    out_files_per_proc=1,
)

schema_path = Path('./schema')
proto_schema = Schema.read_protobuf(schema_path / "schema.pbtxt")

Expected behavior When I check the contents of the ./schema folder, the only files placed in there are: _file_list.txt _metadata _metadata.json part_0.parquet

Issue Analytics

State:
Created 2 years ago
Comments:19 (10 by maintainers)

Top GitHub Comments

1reaction

alexanderDoriacommented, Mar 15, 2022

Hey @karlhigley, sure, I got no module named nvtabular.io.dataset when running the training cells in the 2nd getting-started notebook until I downgraded to nvtabular==0.10.0

1reaction

alexanderDoriacommented, Mar 15, 2022

hey @jperez999 and @rnyak, sorry for the late response. I’m happy to inform that I got it working. The most recent issue was running out of memory (which was solved by increasing the size of the cluster). We also had an issue with NVTabular/T4R which was solved by downgrading NVTabular. I’m able to run the tutorial cells now.

Let me know if you’d like me to give more info on anything.

Top Results From Across the Web

Troubleshooting — NVTabular 2021 documentation

NVTabular expects that all input parquet files have the same schema, which includes column types and the nullable (not null) option.

https://raw.githubusercontent.com/NVIDIA/NVTabular...

Release notes are now being hosted in GitHub Releases ... max list lengths [#1171](https://github.com/NVIDIA-Merlin/NVTabular/pull/1171) ## Bug Fixes * Fix ...

NVIDIA Merlin NVTabular

Merlin NVTabular is a feature engineering and preprocessing library designed to effectively manipulate terabytes of recommender system datasets and ...

NVTabular

[BUG] error from `export_pytorch_ensemble()` function · [BUG] No schema.pbtxt File is being generated from NVTabular workflow.

nvtabular dataset. 3TB Criteo dataset shared by CriteoLabs for ...

"providing the Dataset class, which breaks a set of parquet or csv files into ... from (Workflow) `Operator` * Move `ColumnSelector` and `Schema`...