[BUG] Cannot execute Workflow.fit on full Criteo Dataset located on a remote storage (Google Cloud Storage)
See original GitHub issueDescribe the bug I cannot run Workflow.fit on a full Criteo dataset located on a remote storage (Google Cloud Storage). Here is a gist with the error message: https://gist.github.com/leiterenato/678636807fe168f5709a35b1771fe5d7
Dataset location: ‘gs://workshop-datasets/criteo-parque/’ (24 files ~12GB each).
Steps/Code to reproduce bug
- Create a cluster:
device_size = device_mem_size()
device_limit = int(device_limit_frac * device_size)
device_pool_size = int(device_pool_frac * device_size)
rmm_pool_size = (device_pool_size // 256) * 256
cluster = LocalCUDACluster(
n_workers=n_workers,
device_memory_limit=device_limit,
rmm_pool_size=rmm_pool_size
)
I am using the n_workers equal to the number of GPUs, in this case 4.
- Create the dataset definition:
fs = fsspec.filesystem('gs')
data_path = 'gs://workshop-datasets/criteo-parque/'
file_list = fs.glob(
os.path.join(data_path, '*.parquet')
)
file_list = [os.path.join('gs://', i) for i in file_list]
dataset = nvt.Dataset(
file_list,
engine="parquet",
part_size=int(part_mem_frac * device_mem_size()),
client=client
)
- Create the workflow:
cont_names = ["I" + str(x) for x in range(1, 14)]
cat_names = ["C" + str(x) for x in range(1, 27)]
num_buckets = 10000000
categorify_op = Categorify(max_size=num_buckets)
cat_features = cat_names >> categorify_op
cont_features = cont_names >> FillMissing() >> Clip(min_value=0) >> Normalize()
features = cat_features + cont_features + ['label']
# Create and save workflow
workflow = nvt.Workflow(features, client)
- Execute FIT:
workflow.fit(dataset)
Expected behavior I expect to fit the dataset.
Environment details:
- Google Cloud
- Using the NGC merlin-training image 21.11
- 1 x N1 instance with 96 vCPUs and 360GB RAM
- 4 x T4s
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (1 by maintainers)
Top Results From Across the Web
[BUG] MemoryError: std::bad_alloc: - with workflow.fit on 1 ...
Describe the bug I am trying to run Workflow.fit on 1 parquet file from Criteo Dataset (day_1.parquet). Here is the transformation: ...
Read more >Pipeline troubleshooting and debugging - Dataflow
Go to the Google Cloud console. Select your Google Cloud project from the project list.
Read more >Using Cloud Workflows to load Cloud Storage files into ...
We will create a Cloud Workflow to load data from Google Storage into BigQuery. This is a complete guide on how to work...
Read more >TensorFlow Datasets
TFDS provides a collection of ready-to-use datasets for use with TensorFlow, Jax, and other Machine Learning frameworks.
Read more >Simplifying AI Development with NVIDIA Base Command ...
The dataset and workspace concepts are how Base Command Platform solves this problem. A dataset is a read-only storage construct after creation, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Update: I believe I now have a pretty good understanding of the issue and will write up more information tonight or tomorrow (either or in a separate PR). It seems that a 16GB GPU is indeed “tight” for high-cardinality columns like “C20”. In the current
main
branch, we need almost 6GB of memory to store the categories and counts for this column. This can easily blow past a ~14GB limit when cudf adds on temporary memory copies for computation. After a bit of experimenting, it seems that we can reduce this memory footprint by >50%. These optimizations may do the trick for us.I was able to reproduce this same error with the nightly cudf-21.12 and latest NVTabular
main
branch. I checked if the failure was related to device-memory constraints by running the same script on a dgx-1 (8xV100s) but with the host and device memory limits set to 30GB and 15GB, respectively. There was no error when reading from nvme, so we should be able to do criteo processing on 16GB T4s.I still haven’t figured out what is causing the error on GCP, but I intend to keep digging.