Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Cannot execute Workflow.fit on full Criteo Dataset located on a remote storage (Google Cloud Storage)

See original GitHub issue

Describe the bug I cannot run Workflow.fit on a full Criteo dataset located on a remote storage (Google Cloud Storage). Here is a gist with the error message: https://gist.github.com/leiterenato/678636807fe168f5709a35b1771fe5d7

Dataset location: ‘gs://workshop-datasets/criteo-parque/’ (24 files ~12GB each).

Steps/Code to reproduce bug

Create a cluster:

device_size = device_mem_size()
device_limit = int(device_limit_frac * device_size)
device_pool_size = int(device_pool_frac * device_size)
rmm_pool_size = (device_pool_size // 256) * 256

cluster = LocalCUDACluster(
    n_workers=n_workers,
    device_memory_limit=device_limit,
    rmm_pool_size=rmm_pool_size
)

I am using the n_workers equal to the number of GPUs, in this case 4.

Create the dataset definition:

fs = fsspec.filesystem('gs')
data_path = 'gs://workshop-datasets/criteo-parque/'
file_list = fs.glob(
    os.path.join(data_path, '*.parquet')
)
file_list = [os.path.join('gs://', i) for i in file_list]

dataset = nvt.Dataset(
    file_list,
    engine="parquet", 
    part_size=int(part_mem_frac * device_mem_size()),
    client=client
)

Create the workflow:

cont_names = ["I" + str(x) for x in range(1, 14)]
cat_names = ["C" + str(x) for x in range(1, 27)]

num_buckets = 10000000
categorify_op = Categorify(max_size=num_buckets)
cat_features = cat_names >> categorify_op
cont_features = cont_names >> FillMissing() >> Clip(min_value=0) >> Normalize()
features = cat_features + cont_features + ['label']

# Create and save workflow
workflow = nvt.Workflow(features, client)

Execute FIT:

workflow.fit(dataset)

Expected behavior I expect to fit the dataset.

Environment details:

Google Cloud
Using the NGC merlin-training image 21.11
1 x N1 instance with 96 vCPUs and 360GB RAM
4 x T4s

Issue Analytics

State:
Created 2 years ago
Comments:13 (1 by maintainers)

Top GitHub Comments

2reactions

rjzamoracommented, Nov 19, 2021

Update: I believe I now have a pretty good understanding of the issue and will write up more information tonight or tomorrow (either or in a separate PR). It seems that a 16GB GPU is indeed “tight” for high-cardinality columns like “C20”. In the current main branch, we need almost 6GB of memory to store the categories and counts for this column. This can easily blow past a ~14GB limit when cudf adds on temporary memory copies for computation. After a bit of experimenting, it seems that we can reduce this memory footprint by >50%. These optimizations may do the trick for us.

2reactions

rjzamoracommented, Nov 16, 2021

I was able to reproduce this same error with the nightly cudf-21.12 and latest NVTabular main branch. I checked if the failure was related to device-memory constraints by running the same script on a dgx-1 (8xV100s) but with the host and device memory limits set to 30GB and 15GB, respectively. There was no error when reading from nvme, so we should be able to do criteo processing on 16GB T4s.

I still haven’t figured out what is causing the error on GCP, but I intend to keep digging.

Top Results From Across the Web

[BUG] MemoryError: std::bad_alloc: - with workflow.fit on 1 ...

Describe the bug I am trying to run Workflow.fit on 1 parquet file from Criteo Dataset (day_1.parquet). Here is the transformation: ...

Pipeline troubleshooting and debugging - Dataflow

Go to the Google Cloud console. Select your Google Cloud project from the project list.

Using Cloud Workflows to load Cloud Storage files into ...

We will create a Cloud Workflow to load data from Google Storage into BigQuery. This is a complete guide on how to work...

TensorFlow Datasets

TFDS provides a collection of ready-to-use datasets for use with TensorFlow, Jax, and other Machine Learning frameworks.

Simplifying AI Development with NVIDIA Base Command ...

The dataset and workspace concepts are how Base Command Platform solves this problem. A dataset is a read-only storage construct after creation, ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[BUG] Cannot execute Workflow.fit on full Criteo Dataset located on a remote storage (Google Cloud Storage)

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[BUG] package_general_metadata() can't find the categorical columns

[QST] What data types should we use to save multi-hot columns?