[BUG] No schema.pbtxt File is being generated from NVTabular workflow
See original GitHub issueDescribe the bug I am trying to generate schema.pbtxt file from NVTabular workflow using the same exact script posted here: https://github.com/NVIDIA-Merlin/NVTabular/issues/1156
NUM_ROWS = 1000
long_tailed_item_distribution = np.clip(np.random.lognormal(3., 1., NUM_ROWS).astype(np.int32), 1, 50000)
# generate random item interaction features
df = pd.DataFrame(np.random.randint(70000, 80000, NUM_ROWS), columns=['session_id'])
df['item_id'] = long_tailed_item_distribution
# generate category mapping for each item-id
df['category'] = pd.cut(df['item_id'], bins=334, labels=np.arange(1, 335)).astype(np.int32)
df['timestamp/age_days'] = np.random.uniform(0, 1, NUM_ROWS)
df['timestamp/weekday/sin']= np.random.uniform(0, 1, NUM_ROWS)
# generate day mapping for each session
map_day = dict(zip(df.session_id.unique(), np.random.randint(1, 10, size=(df.session_id.nunique()))))
df['day'] = df.session_id.map(map_day)
# Categorify categorical features
categ_feats = ['session_id', 'item_id', 'category'] >> nvt.ops.Categorify(start_index=1)
# Define Groupby Workflow
groupby_feats = categ_feats + ['day', 'timestamp/age_days', 'timestamp/weekday/sin']
# Groups interaction features by session and sorted by timestamp
groupby_features = groupby_feats >> nvt.ops.Groupby(
groupby_cols=["session_id"],
aggs={
"item_id": ["list", "count"],
"category": ["list"],
"day": ["first"],
"timestamp/age_days": ["list"],
'timestamp/weekday/sin': ["list"],
},
name_sep="-")
# Select and truncate the sequential features
sequence_features_truncated = (groupby_features['category-list', 'item_id-list',
'timestamp/age_days-list', 'timestamp/weekday/sin-list']) >> \
nvt.ops.ListSlice(0,20) >> nvt.ops.Rename(postfix = '_trim')
# Filter out sessions with length 1 (not valid for next-item prediction training and evaluation)
MINIMUM_SESSION_LENGTH = 2
selected_features = groupby_features['item_id-count', 'day-first', 'session_id'] + sequence_features_truncated
filtered_sessions = selected_features >> nvt.ops.Filter(f=lambda df: df["item_id-count"] >= MINIMUM_SESSION_LENGTH)
workflow = nvt.Workflow(filtered_sessions)
dataset = nvt.Dataset(df, cpu=False)
# Generating statistics for the features
workflow.fit(dataset)
workflow.transform(dataset).to_parquet(
'./schema',
out_files_per_proc=1,
)
schema_path = Path('./schema')
proto_schema = Schema.read_protobuf(schema_path / "schema.pbtxt")
Expected behavior
When I check the contents of the ./schema folder, the only files placed in there are:
_file_list.txt _metadata _metadata.json part_0.parquet
Issue Analytics
- State:
- Created 2 years ago
- Comments:19 (10 by maintainers)
Top Results From Across the Web
Troubleshooting — NVTabular 2021 documentation
NVTabular expects that all input parquet files have the same schema, which includes column types and the nullable (not null) option.
Read more >https://raw.githubusercontent.com/NVIDIA/NVTabular...
Release notes are now being hosted in GitHub Releases ... max list lengths [#1171](https://github.com/NVIDIA-Merlin/NVTabular/pull/1171) ## Bug Fixes * Fix ...
Read more >NVIDIA Merlin NVTabular
Merlin NVTabular is a feature engineering and preprocessing library designed to effectively manipulate terabytes of recommender system datasets and ...
Read more >NVTabular
[BUG] error from `export_pytorch_ensemble()` function · [BUG] No schema.pbtxt File is being generated from NVTabular workflow.
Read more >nvtabular dataset. 3TB Criteo dataset shared by CriteoLabs for ...
"providing the Dataset class, which breaks a set of parquet or csv files into ... from (Workflow) `Operator` * Move `ColumnSelector` and `Schema`...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hey @karlhigley, sure, I got
no module named nvtabular.io.dataset
when running the training cells in the 2nd getting-started notebook until I downgraded tonvtabular==0.10.0
hey @jperez999 and @rnyak, sorry for the late response. I’m happy to inform that I got it working. The most recent issue was running out of memory (which was solved by increasing the size of the cluster). We also had an issue with NVTabular/T4R which was solved by downgrading NVTabular. I’m able to run the tutorial cells now.
Let me know if you’d like me to give more info on anything.