Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OOMs on seemingly simple shuffle job: mem usage greatly exceeds --memory-limit

See original GitHub issue

Summary

I’m struggling to figure out how to avoid OOMs in a seemingly simple shuffle on a ~6gb parquet.snappy dataset using 16 workers, each with 8gb mem, ~4gb memory limit, 1 proc, and 1 thread. I’m not persisting anything, and I’m ok with shuffle tasks spilling to disk as necessary.
The OOMs cause the job to either fail after a while or complete after a really long while, nondeterministically.
I decreased task size by increasing task count (128 -> 512), but I still observed OOMs with similar frequency.
Plotting mem usage over time shows a tight distribution around --memory-limit for the first ~1/2 of the job and then large variance for the second ~1/2 of the job, during which time OOMs start happening (plots below).
I created more headroom for this large variance by decreasing --memory-limit (4gb/8gb -> 2gb/8gb) and I did observe many fewer OOMs, but still 1 OOM, and moreover 2gb/8gb impedes our ability to persist data later in this pipeline for an iterative ML algo so this isn’t a feasible solution.
Maybe there’s something fishy on the dask side happening here, in particular in the high variance of mem usage above --memory-limit? Or maybe I’m just making a dumb user error somewhere that’s easy to fix?
Lmk if I can clarify or distill anything better!

Setup

16 workers (on k8s on ec2), each running in its own docker container with 8gb mem and 1 cpu
Workers running with ~4gb mem limit, 1 proc, and 1 thread:
- DASK_COMPRESSION=zlib dask-worker --nprocs 1 --nthreads 1 --memory-limit=4e9 --no-nanny <scheduler-url>
Code looks like:

# Read from parquet (s3)
#   - 238 parts in
#   - ~6.5gb total
#   - Part file sizes vary 10-50mb (see plot below)
ddf_no_index = dd.read_parquet(in_path)

# Pick task/part count for output
num_parts_out = ... # 128 or 512

# Reindex to a column of uniformly distributed uuid5 values with fixed, uniform divisions
#   - npartitions=num_parts_out, via divisions=uniform_divisions[num_parts_out]
ddf_indexed = ddf_no_index.set_index(
    uniformly_distributed_uuid5_column,
    drop=False,
    divisions=uniform_divisions[num_parts_out],
)

# Write to parquet (s3)
#   - 128 or 512 parts out
#   - ~6.6gb total (based on a successful 128-part output)
#   - When 128 parts, output part files vary 54-58mb (see plot below)
#   - When 512 parts, output part files should vary ~10-15mb, but I didn't let the job finish
(ddf_indexed
    .astype(...)
    .drop(ddf_indexed.index.name, axis=1)
    .to_parquet(
        out_path,
        compression='snappy',
        object_encoding=...,
        write_index=True,
    )
)

Data skew looks like:

input parquet.snappy part file sizes 238 parts	output parquet.snappy part file sizes 128 parts

Trials

Rows 1–2: my starting point was num_parts_out=128 with --memory-limit=4e9, which fails a lot of the time but actually succeeded twice with many OOMs and long runtimes
Row 3: I increased task count to num_parts_out=512, but saw a similar frequency of OOMs and killed the job
Row 4: I decreased mem limit to --memory-limit=2e9 but still saw 1 OOM (and thus some amount of repeated work)
Col “sys metrics”: check out the change in variance in mem usage partway through the job, after which OOMs start happening
Col “task aftermath”: you can see the lost workers, all due to OOMs
Col “task counts”: shows the number of shuffle tasks, for reference (~6–8k)

params	outcome	task counts	task aftermath	sys metrics
238 parts in 128 parts out 4g mem limit	27 OOMs 111m success
238 parts in 128 parts out 4g mem limit	10 OOMs 47m success
238 parts in 512 parts out 4g mem limit	>4 OOMs gave up early
238 parts in 128 parts out 2g mem limit	1 OOM 56m success

Versions

$ python --version
Python 3.6.0

$ cat requirements.txt | egrep 'dask|distributed|fastparquet'
git+https://github.com/dask/dask.git@a883f44
git+https://github.com/dask/fastparquet.git@d07d662
distributed==1.16.2

Issue Analytics

State:
Created 6 years ago
Reactions:3
Comments:34 (20 by maintainers)

Top GitHub Comments

2reactions

jdanbrowncommented, Jun 26, 2017

Another unsuccessful approach at solving the “4g mem limit, 8g mem, 16 workers” case:

Context: all trials above are running in docker containers (on k8s on ec2), and psutil reports host metrics, not container metrics
- https://fabiokung.com/2014/03/13/memory-inside-linux-containers/
- https://github.com/giampaolo/psutil/issues/1011
- Indeed, the scheduler /workers page shows “memory: 64 GiB”, which is our k8s host ram, instead of the pod’s container mem (one of 4/8/16 GiB in the various trials above)
Hypothesis: even though we’re setting --memory-limit, other parts of dask are seeing psutil.virtual_memory().total = 64GiB and allocating more than they would have otherwise, which is causing OOMs, e.g. max_buffer_size in distributed.comm.tcp
So I made a hacked version of psutil where virtual_memory() returns metrics from /sys/fs/cgroup/memory instead of /proc, and verified that I saw “memory: 8 GiB” on the /workers page
But I ran the “4g mem limit, 8g mem, 16 workers” trial like above (https://github.com/dask/dask/issues/2456#issuecomment-309657133) and I still see multiple OOMs once the data starts shuffling (~15m into the job), similar to how it behaved without the psutil hack

1reaction

bryanwwebercommented, Mar 28, 2022

@odovad Can you please provide a minimal reproducer for your example? Thank you!

Top Results From Across the Web

dask-worker --memory-limit not working? · Issue #1015 - GitHub

Limiting the memory used by a dask-worker using the --memory-limit ... OOMs on seemingly simple shuffle job: mem usage greatly exceeds ...

Why LSF does not kill the job when its memory usage exceeds ...

The memory usage has already exceeded the MEMLIMIT value shown in "bjobs -l", but the job is not killed by LSF.Why doesn't LSF...

Leading items - LWN.net

Containers and license compliance: Dirk Hohndel calls attention to the compliance hazards associated with container images. Willy's memory-management to-do list ...

Efficient Workload Colocation in Modern Data Centers

job completion times and increased memory utilization. We conclude that these techniques make it possible to colocate even highly-bursty ...

PHP 5 ChangeLog

Uploads equal or greater than 2GB in size are now accepted. Reduced POST data memory usage by 200-300%. Changed INI setting always_populate_raw_post_data to ......