question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] No schema.pbtxt File is being generated from NVTabular workflow

See original GitHub issue

Describe the bug I am trying to generate schema.pbtxt file from NVTabular workflow using the same exact script posted here: https://github.com/NVIDIA-Merlin/NVTabular/issues/1156

NUM_ROWS = 1000
long_tailed_item_distribution = np.clip(np.random.lognormal(3., 1., NUM_ROWS).astype(np.int32), 1, 50000)

# generate random item interaction features 
df = pd.DataFrame(np.random.randint(70000, 80000, NUM_ROWS), columns=['session_id'])
df['item_id'] = long_tailed_item_distribution

# generate category mapping for each item-id
df['category'] = pd.cut(df['item_id'], bins=334, labels=np.arange(1, 335)).astype(np.int32)
df['timestamp/age_days'] = np.random.uniform(0, 1, NUM_ROWS)
df['timestamp/weekday/sin']= np.random.uniform(0, 1, NUM_ROWS)

# generate day mapping for each session 
map_day = dict(zip(df.session_id.unique(), np.random.randint(1, 10, size=(df.session_id.nunique()))))
df['day'] =  df.session_id.map(map_day)

# Categorify categorical features
categ_feats = ['session_id', 'item_id', 'category'] >> nvt.ops.Categorify(start_index=1)

# Define Groupby Workflow
groupby_feats = categ_feats + ['day', 'timestamp/age_days', 'timestamp/weekday/sin']

# Groups interaction features by session and sorted by timestamp
groupby_features = groupby_feats >> nvt.ops.Groupby(
    groupby_cols=["session_id"], 
    aggs={
        "item_id": ["list", "count"],
        "category": ["list"],     
        "day": ["first"],
        "timestamp/age_days": ["list"],
        'timestamp/weekday/sin': ["list"],
        },
    name_sep="-")

# Select and truncate the sequential features
sequence_features_truncated = (groupby_features['category-list', 'item_id-list', 
                                          'timestamp/age_days-list', 'timestamp/weekday/sin-list']) >> \
                            nvt.ops.ListSlice(0,20) >> nvt.ops.Rename(postfix = '_trim')

# Filter out sessions with length 1 (not valid for next-item prediction training and evaluation)
MINIMUM_SESSION_LENGTH = 2
selected_features = groupby_features['item_id-count', 'day-first', 'session_id'] + sequence_features_truncated
filtered_sessions = selected_features >> nvt.ops.Filter(f=lambda df: df["item_id-count"] >= MINIMUM_SESSION_LENGTH)


workflow = nvt.Workflow(filtered_sessions)
dataset = nvt.Dataset(df, cpu=False)
# Generating statistics for the features
workflow.fit(dataset)
workflow.transform(dataset).to_parquet(
    './schema',
    out_files_per_proc=1,
)

schema_path = Path('./schema')
proto_schema = Schema.read_protobuf(schema_path / "schema.pbtxt")

Expected behavior When I check the contents of the ./schema folder, the only files placed in there are: _file_list.txt _metadata _metadata.json part_0.parquet

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:19 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
alexanderDoriacommented, Mar 15, 2022

Hey @karlhigley, sure, I got no module named nvtabular.io.dataset when running the training cells in the 2nd getting-started notebook until I downgraded to nvtabular==0.10.0

1reaction
alexanderDoriacommented, Mar 15, 2022

hey @jperez999 and @rnyak, sorry for the late response. I’m happy to inform that I got it working. The most recent issue was running out of memory (which was solved by increasing the size of the cluster). We also had an issue with NVTabular/T4R which was solved by downgrading NVTabular. I’m able to run the tutorial cells now.

Let me know if you’d like me to give more info on anything.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting — NVTabular 2021 documentation
NVTabular expects that all input parquet files have the same schema, which includes column types and the nullable (not null) option.
Read more >
https://raw.githubusercontent.com/NVIDIA/NVTabular...
Release notes are now being hosted in GitHub Releases ... max list lengths [#1171](https://github.com/NVIDIA-Merlin/NVTabular/pull/1171) ## Bug Fixes * Fix ...
Read more >
NVIDIA Merlin NVTabular
Merlin NVTabular is a feature engineering and preprocessing library designed to effectively manipulate terabytes of recommender system datasets and ...
Read more >
NVTabular
[BUG] error from `export_pytorch_ensemble()` function · [BUG] No schema.pbtxt File is being generated from NVTabular workflow.
Read more >
nvtabular dataset. 3TB Criteo dataset shared by CriteoLabs for ...
"providing the Dataset class, which breaks a set of parquet or csv files into ... from (Workflow) `Operator` * Move `ColumnSelector` and `Schema`...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found