question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Out of memory error on workers while running Beam+Dataflow

See original GitHub issue

Describe the bug

While running the preprocessing of the natural_question dataset (see PR #4368), there is an issue for the “default” config (train+dev files).

Previously we ran the preprocessing for the “dev” config (only dev files) with success.

Train data files are larger than dev ones and apparently workers run out of memory while processing them.

Any help/hint is welcome!

Error message:

Data channel closed, unable to receive additional data from SDK sdk-0-0

Info from the Diagnostics tab:

Out of memory: Killed process 1882 (python) total-vm:6041764kB, anon-rss:3290928kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:9520kB oom_score_adj:900
The worker VM had to shut down one or more processes due to lack of memory.

Additional information

Stack trace

Traceback (most recent call last):
  File "/home/albert_huggingface_co/natural_questions/venv/bin/datasets-cli", line 8, in <module>
    sys.exit(main())
  File "/home/albert_huggingface_co/natural_questions/venv/lib/python3.9/site-packages/datasets/commands/datasets_cli.py", line 39, in main
    service.run()
  File "/home/albert_huggingface_co/natural_questions/venv/lib/python3.9/site-packages/datasets/commands/run_beam.py", line 127, in run
    builder.download_and_prepare(
  File "/home/albert_huggingface_co/natural_questions/venv/lib/python3.9/site-packages/datasets/builder.py", line 704, in download_and_prepare
    self._download_and_prepare(
  File "/home/albert_huggingface_co/natural_questions/venv/lib/python3.9/site-packages/datasets/builder.py", line 1389, in _download_and_prepare
    pipeline_results.wait_until_finish()
  File "/home/albert_huggingface_co/natural_questions/venv/lib/python3.9/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1667, in wait_until_finish
    raise DataflowRuntimeException(
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Data channel closed, unable to receive additional data from SDK sdk-0-0

Logs

Error message from worker: Data channel closed, unable to receive additional data from SDK sdk-0-0

Workflow failed. Causes: S30:train/ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/GroupByKey/Read+train/ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/GroupByKey/GroupByWindow+train/ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/FlatMap(restore_timestamps)+train/ReadAllFromText/ReadAllFiles/Reshard/RemoveRandomKeys+train/ReadAllFromText/ReadAllFiles/ReadRange+train/Map(_parse_example)+train/Encode+train/Count N. Examples+train/Get values/Values+train/Save to parquet/Write/WriteImpl/WindowInto(WindowIntoFn)+train/Save to parquet/Write/WriteImpl/WriteBundles+train/Save to parquet/Write/WriteImpl/Pair+train/Save to parquet/Write/WriteImpl/GroupByKey/Write failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers: beamapp-alberthuggingface-06170554-5p23-harness-t4v9 Root cause: Data channel closed, unable to receive additional data from SDK sdk-0-0, beamapp-alberthuggingface-06170554-5p23-harness-t4v9 Root cause: The worker lost contact with the service., beamapp-alberthuggingface-06170554-5p23-harness-bwsj Root cause: The worker lost contact with the service., beamapp-alberthuggingface-06170554-5p23-harness-5052 Root cause: The worker lost contact with the service.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
seirastocommented, Jun 30, 2022

I asked my colleague who ran the code and he said apache beam.

He looked into it further and he just used DirectRunner. @albertvillanova

1reaction
seirastocommented, Jun 28, 2022

I asked my colleague who ran the code and he said apache beam.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot Dataflow out of memory errors - Google Cloud
Several pipeline operations can cause out of memory errors. This section provides options for reducing your pipeline's memory usage. To identify the pipeline ......
Read more >
Out of memory exception in dataflow with windowing on ...
The windowed output is written to a partition in BigQuery table using Apache Beam's BigQueryIO. The job fails with OOM error. Dataflow Job ......
Read more >
Troubleshoot Slow or Stuck Jobs in Google Cloud Dataflow
Are you experiencing slowness with your jobs or your jobs getting stuck in Cloud Dataflow ?Slow/Stuck Dataflow jobs can be caused by a ......
Read more >
Dataflow Observability, Monitoring, and Troubleshooting
3. OOMs (out of memory) · Enable Vertical Autoscaling, a feature in Dataflow Prime that dynamically scales the memory available to workers ...
Read more >
FAQ · Scio - Spotify Open Source Projects
x is built on top of Beam 2.x. Many users run Scio on the Dataflow runner today. How does Scio compare to Scalding...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found