Out of memory error on workers while running Beam+Dataflow
See original GitHub issueDescribe the bug
While running the preprocessing of the natural_question dataset (see PR #4368), there is an issue for the “default” config (train+dev files).
Previously we ran the preprocessing for the “dev” config (only dev files) with success.
Train data files are larger than dev ones and apparently workers run out of memory while processing them.
Any help/hint is welcome!
Error message:
Data channel closed, unable to receive additional data from SDK sdk-0-0
Info from the Diagnostics tab:
Out of memory: Killed process 1882 (python) total-vm:6041764kB, anon-rss:3290928kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:9520kB oom_score_adj:900
The worker VM had to shut down one or more processes due to lack of memory.
Additional information
Stack trace
Traceback (most recent call last):
File "/home/albert_huggingface_co/natural_questions/venv/bin/datasets-cli", line 8, in <module>
sys.exit(main())
File "/home/albert_huggingface_co/natural_questions/venv/lib/python3.9/site-packages/datasets/commands/datasets_cli.py", line 39, in main
service.run()
File "/home/albert_huggingface_co/natural_questions/venv/lib/python3.9/site-packages/datasets/commands/run_beam.py", line 127, in run
builder.download_and_prepare(
File "/home/albert_huggingface_co/natural_questions/venv/lib/python3.9/site-packages/datasets/builder.py", line 704, in download_and_prepare
self._download_and_prepare(
File "/home/albert_huggingface_co/natural_questions/venv/lib/python3.9/site-packages/datasets/builder.py", line 1389, in _download_and_prepare
pipeline_results.wait_until_finish()
File "/home/albert_huggingface_co/natural_questions/venv/lib/python3.9/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1667, in wait_until_finish
raise DataflowRuntimeException(
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Data channel closed, unable to receive additional data from SDK sdk-0-0
Logs
Error message from worker: Data channel closed, unable to receive additional data from SDK sdk-0-0
Workflow failed. Causes: S30:train/ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/GroupByKey/Read+train/ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/GroupByKey/GroupByWindow+train/ReadAllFromText/ReadAllFiles/Reshard/ReshufflePerKey/FlatMap(restore_timestamps)+train/ReadAllFromText/ReadAllFiles/Reshard/RemoveRandomKeys+train/ReadAllFromText/ReadAllFiles/ReadRange+train/Map(_parse_example)+train/Encode+train/Count N. Examples+train/Get values/Values+train/Save to parquet/Write/WriteImpl/WindowInto(WindowIntoFn)+train/Save to parquet/Write/WriteImpl/WriteBundles+train/Save to parquet/Write/WriteImpl/Pair+train/Save to parquet/Write/WriteImpl/GroupByKey/Write failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers: beamapp-alberthuggingface-06170554-5p23-harness-t4v9 Root cause: Data channel closed, unable to receive additional data from SDK sdk-0-0, beamapp-alberthuggingface-06170554-5p23-harness-t4v9 Root cause: The worker lost contact with the service., beamapp-alberthuggingface-06170554-5p23-harness-bwsj Root cause: The worker lost contact with the service., beamapp-alberthuggingface-06170554-5p23-harness-5052 Root cause: The worker lost contact with the service.
Issue Analytics
- State:
- Created a year ago
- Comments:8 (8 by maintainers)
Top Results From Across the Web
Troubleshoot Dataflow out of memory errors - Google Cloud
Several pipeline operations can cause out of memory errors. This section provides options for reducing your pipeline's memory usage. To identify the pipeline ......
Read more >Out of memory exception in dataflow with windowing on ...
The windowed output is written to a partition in BigQuery table using Apache Beam's BigQueryIO. The job fails with OOM error. Dataflow Job ......
Read more >Troubleshoot Slow or Stuck Jobs in Google Cloud Dataflow
Are you experiencing slowness with your jobs or your jobs getting stuck in Cloud Dataflow ?Slow/Stuck Dataflow jobs can be caused by a ......
Read more >Dataflow Observability, Monitoring, and Troubleshooting
3. OOMs (out of memory) · Enable Vertical Autoscaling, a feature in Dataflow Prime that dynamically scales the memory available to workers ...
Read more >FAQ · Scio - Spotify Open Source Projects
x is built on top of Beam 2.x. Many users run Scio on the Dataflow runner today. How does Scio compare to Scalding...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
He looked into it further and he just used DirectRunner. @albertvillanova
I asked my colleague who ran the code and he said apache beam.