question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Cannot execute Workflow.fit on full Criteo Dataset located on a remote storage (Google Cloud Storage)

See original GitHub issue

Describe the bug I cannot run Workflow.fit on a full Criteo dataset located on a remote storage (Google Cloud Storage). Here is a gist with the error message: https://gist.github.com/leiterenato/678636807fe168f5709a35b1771fe5d7

Dataset location: ‘gs://workshop-datasets/criteo-parque/’ (24 files ~12GB each).

Steps/Code to reproduce bug

  1. Create a cluster:
device_size = device_mem_size()
device_limit = int(device_limit_frac * device_size)
device_pool_size = int(device_pool_frac * device_size)
rmm_pool_size = (device_pool_size // 256) * 256

cluster = LocalCUDACluster(
    n_workers=n_workers,
    device_memory_limit=device_limit,
    rmm_pool_size=rmm_pool_size
)

I am using the n_workers equal to the number of GPUs, in this case 4.

  1. Create the dataset definition:
fs = fsspec.filesystem('gs')
data_path = 'gs://workshop-datasets/criteo-parque/'
file_list = fs.glob(
    os.path.join(data_path, '*.parquet')
)
file_list = [os.path.join('gs://', i) for i in file_list]

dataset = nvt.Dataset(
    file_list,
    engine="parquet", 
    part_size=int(part_mem_frac * device_mem_size()),
    client=client
)
  1. Create the workflow:
cont_names = ["I" + str(x) for x in range(1, 14)]
cat_names = ["C" + str(x) for x in range(1, 27)]

num_buckets = 10000000
categorify_op = Categorify(max_size=num_buckets)
cat_features = cat_names >> categorify_op
cont_features = cont_names >> FillMissing() >> Clip(min_value=0) >> Normalize()
features = cat_features + cont_features + ['label']

# Create and save workflow
workflow = nvt.Workflow(features, client)
  1. Execute FIT:
workflow.fit(dataset)

Expected behavior I expect to fit the dataset.

Environment details:

  • Google Cloud
  • Using the NGC merlin-training image 21.11
  • 1 x N1 instance with 96 vCPUs and 360GB RAM
  • 4 x T4s

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:13 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
rjzamoracommented, Nov 19, 2021

Update: I believe I now have a pretty good understanding of the issue and will write up more information tonight or tomorrow (either or in a separate PR). It seems that a 16GB GPU is indeed “tight” for high-cardinality columns like “C20”. In the current main branch, we need almost 6GB of memory to store the categories and counts for this column. This can easily blow past a ~14GB limit when cudf adds on temporary memory copies for computation. After a bit of experimenting, it seems that we can reduce this memory footprint by >50%. These optimizations may do the trick for us.

2reactions
rjzamoracommented, Nov 16, 2021

I was able to reproduce this same error with the nightly cudf-21.12 and latest NVTabular main branch. I checked if the failure was related to device-memory constraints by running the same script on a dgx-1 (8xV100s) but with the host and device memory limits set to 30GB and 15GB, respectively. There was no error when reading from nvme, so we should be able to do criteo processing on 16GB T4s.

I still haven’t figured out what is causing the error on GCP, but I intend to keep digging.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[BUG] MemoryError: std::bad_alloc: - with workflow.fit on 1 ...
Describe the bug I am trying to run Workflow.fit on 1 parquet file from Criteo Dataset (day_1.parquet). Here is the transformation: ...
Read more >
Pipeline troubleshooting and debugging - Dataflow
Go to the Google Cloud console. Select your Google Cloud project from the project list.
Read more >
Using Cloud Workflows to load Cloud Storage files into ...
We will create a Cloud Workflow to load data from Google Storage into BigQuery. This is a complete guide on how to work...
Read more >
TensorFlow Datasets
TFDS provides a collection of ready-to-use datasets for use with TensorFlow, Jax, and other Machine Learning frameworks.
Read more >
Simplifying AI Development with NVIDIA Base Command ...
The dataset and workspace concepts are how Base Command Platform solves this problem. A dataset is a read-only storage construct after creation, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found