question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Dask on Ray 1TB sort failed by S3 read failure

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

Failed within 12 seconds with this error;

(dask:generate_s3_file-b0ed8484-e106-492b-95ff-7889df04cc94 pid=1208) S3 partition 6 exists
(dask:generate_s3_file-d0d38f68-747e-4536-b6c9-28fac980a525 pid=1209) S3 partition 7 exists
(dask:generate_s3_file-ccc17522-e547-4a1a-b5a2-f15a20a691f1 pid=189, ip=10.0.3.102) S3 partition 9 exists
(dask:generate_s3_file-dd77b428-0729-499c-9f57-4016e25a3480 pid=194, ip=10.0.3.93) S3 partition 8 exists
(dask:generate_s3_file-3d9fe851-2279-4d04-ba28-7a1ab4b7a390 pid=1210) S3 partition 4 exists
(dask:generate_s3_file-69546713-ce11-4a3f-8a17-4f4e8ee38ad6 pid=192, ip=10.0.3.102) S3 partition 2 exists
(dask:generate_s3_file-75019ef3-37c7-4675-b5d4-32a03e1eea40 pid=379, ip=10.0.3.102) S3 partition 3 exists
(dask:generate_s3_file-0f4164ae-65cb-46a8-a100-d901e069efd9 pid=371, ip=10.0.3.93) S3 partition 0 exists
(dask:generate_s3_file-4a30bc6b-677c-4df6-b86c-df85ec55858b pid=376, ip=10.0.3.102) S3 partition 1 exists
Traceback (most recent call last):
  File "dask_on_ray/dask_on_ray_sort.py", line 200, in <module>
    file_path=args.file_path,
  File "dask_on_ray/dask_on_ray_sort.py", line 112, in trial
    df = load_dataset(client, data_dir, s3_bucket, nbytes, n_partitions)
  File "dask_on_ray/dask_on_ray_sort.py", line 56, in load_dataset
    df = dd.read_parquet(filenames)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py", line 342, in read_parquet
    **kwargs,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py", line 383, in read_metadata
    kwargs,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py", line 917, in _collect_dataset_info
    **_dataset_kwargs,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pyarrow/dataset.py", line 670, in dataset
    return _filesystem_dataset(source, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pyarrow/dataset.py", line 422, in _filesystem_dataset
    return factory.finish(schema)
  File "pyarrow/_dataset.pyx", line 1680, in pyarrow._dataset.DatasetFactory.finish
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/_fs.pyx", line 1179, in pyarrow._fs._cb_open_input_file
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pyarrow/fs.py", line 394, in open_input_file
    raise FileNotFoundError(path)
FileNotFoundError: core-nightly-test/df-100-0.parquet.gzip

https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_9vwUB3udZpenx7fNeNL2GmmQ?command-history-section=command_history&user=usr_76g56tcf24ftP4qglftTqO

Versions / Dependencies

master

Reproduction script

N/A

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
rkooo567commented, Feb 9, 2022

Looks like it fails 5 times in a row, so there must be something here.

cc @mwtian would you have time to take a look at this issue?

0reactions
rkooo567commented, Feb 9, 2022

If you are busy, I can just take it over. I will take a look at it today!

Read more comments on GitHub >

github_iconTop Results From Across the Web

[Bug] Dask on ray 1tb sort failing due to input file not found error
The files still seem to be in the bucket. At a high level, it seems that somehow Dataset is now interpreting the URI...
Read more >
Troubleshoot Amazon S3 Batch Operations issues
Here are some common reasons that Amazon S3 Batch Operations fails or returns an error: Manifest file format (CSV or JSON); Manifest file ......
Read more >
dask distributed memory error - Stack Overflow
The most common cause of this error is trying to collect too much data, such as occurs in the following example using dask.dataframe:...
Read more >
Analyzing memory management and performance in Dask-on ...
The goal of this blog is to compare the memory management and performance of "Dask-on-Ray'' versus Dask with its built-in scheduler.
Read more >
Connect to remote data - Dask documentation
Connect to remote data¶. Dask can read data from a variety of data stores including local file systems, network file systems, cloud object...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found