question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OOMs on seemingly simple shuffle job: mem usage greatly exceeds --memory-limit

See original GitHub issue

Summary

  • I’m struggling to figure out how to avoid OOMs in a seemingly simple shuffle on a ~6gb parquet.snappy dataset using 16 workers, each with 8gb mem, ~4gb memory limit, 1 proc, and 1 thread. I’m not persisting anything, and I’m ok with shuffle tasks spilling to disk as necessary.
  • The OOMs cause the job to either fail after a while or complete after a really long while, nondeterministically.
  • I decreased task size by increasing task count (128 -> 512), but I still observed OOMs with similar frequency.
  • Plotting mem usage over time shows a tight distribution around --memory-limit for the first ~1/2 of the job and then large variance for the second ~1/2 of the job, during which time OOMs start happening (plots below).
  • I created more headroom for this large variance by decreasing --memory-limit (4gb/8gb -> 2gb/8gb) and I did observe many fewer OOMs, but still 1 OOM, and moreover 2gb/8gb impedes our ability to persist data later in this pipeline for an iterative ML algo so this isn’t a feasible solution.
  • Maybe there’s something fishy on the dask side happening here, in particular in the high variance of mem usage above --memory-limit? Or maybe I’m just making a dumb user error somewhere that’s easy to fix?
  • Lmk if I can clarify or distill anything better!

Setup

  • 16 workers (on k8s on ec2), each running in its own docker container with 8gb mem and 1 cpu
  • Workers running with ~4gb mem limit, 1 proc, and 1 thread:
    • DASK_COMPRESSION=zlib dask-worker --nprocs 1 --nthreads 1 --memory-limit=4e9 --no-nanny <scheduler-url>
  • Code looks like:
# Read from parquet (s3)
#   - 238 parts in
#   - ~6.5gb total
#   - Part file sizes vary 10-50mb (see plot below)
ddf_no_index = dd.read_parquet(in_path)

# Pick task/part count for output
num_parts_out = ... # 128 or 512

# Reindex to a column of uniformly distributed uuid5 values with fixed, uniform divisions
#   - npartitions=num_parts_out, via divisions=uniform_divisions[num_parts_out]
ddf_indexed = ddf_no_index.set_index(
    uniformly_distributed_uuid5_column,
    drop=False,
    divisions=uniform_divisions[num_parts_out],
)

# Write to parquet (s3)
#   - 128 or 512 parts out
#   - ~6.6gb total (based on a successful 128-part output)
#   - When 128 parts, output part files vary 54-58mb (see plot below)
#   - When 512 parts, output part files should vary ~10-15mb, but I didn't let the job finish
(ddf_indexed
    .astype(...)
    .drop(ddf_indexed.index.name, axis=1)
    .to_parquet(
        out_path,
        compression='snappy',
        object_encoding=...,
        write_index=True,
    )
)
  • Data skew looks like:
input parquet.snappy part file sizes
238 parts
output parquet.snappy part file sizes
128 parts
fig-20170615t072553509023 fig-20170615t073618347342

Trials

  • Rows 1–2: my starting point was num_parts_out=128 with --memory-limit=4e9, which fails a lot of the time but actually succeeded twice with many OOMs and long runtimes
  • Row 3: I increased task count to num_parts_out=512, but saw a similar frequency of OOMs and killed the job
  • Row 4: I decreased mem limit to --memory-limit=2e9 but still saw 1 OOM (and thus some amount of repeated work)
  • Col “sys metrics”: check out the change in variance in mem usage partway through the job, after which OOMs start happening
  • Col “task aftermath”: you can see the lost workers, all due to OOMs
  • Col “task counts”: shows the number of shuffle tasks, for reference (~6–8k)
params outcome task counts task aftermath sys metrics
238 parts in
128 parts out
4g mem limit
27 OOMs
111m
success
datadog 4g 128 2
238 parts in
128 parts out
4g mem limit
10 OOMs
47m
success
dask 4g 128 tasks 4g 128 datadog 4g 128
238 parts in
512 parts out
4g mem limit
>4 OOMs
gave up early
dask 4g 512 blank datadog 4g 512
238 parts in
128 parts out
2g mem limit
1 OOM
56m
success
dask 2g 128 tasks 2g 128 datadog 2g 128

Versions

$ python --version
Python 3.6.0

$ cat requirements.txt | egrep 'dask|distributed|fastparquet'
git+https://github.com/dask/dask.git@a883f44
git+https://github.com/dask/fastparquet.git@d07d662
distributed==1.16.2

Issue Analytics

  • State:open
  • Created 6 years ago
  • Reactions:3
  • Comments:34 (20 by maintainers)

github_iconTop GitHub Comments

2reactions
jdanbrowncommented, Jun 26, 2017

Another unsuccessful approach at solving the “4g mem limit, 8g mem, 16 workers” case:

1reaction
bryanwwebercommented, Mar 28, 2022

@odovad Can you please provide a minimal reproducer for your example? Thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

dask-worker --memory-limit not working? · Issue #1015 - GitHub
Limiting the memory used by a dask-worker using the --memory-limit ... OOMs on seemingly simple shuffle job: mem usage greatly exceeds ...
Read more >
Why LSF does not kill the job when its memory usage exceeds ...
The memory usage has already exceeded the MEMLIMIT value shown in "bjobs -l", but the job is not killed by LSF.Why doesn't LSF...
Read more >
Leading items - LWN.net
Containers and license compliance: Dirk Hohndel calls attention to the compliance hazards associated with container images. Willy's memory-management to-do list ...
Read more >
Efficient Workload Colocation in Modern Data Centers
job completion times and increased memory utilization. We conclude that these techniques make it possible to colocate even highly-bursty ...
Read more >
PHP 5 ChangeLog
Uploads equal or greater than 2GB in size are now accepted. Reduced POST data memory usage by 200-300%. Changed INI setting always_populate_raw_post_data to ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found