[Bug] Ray Datasets schema() function should not trigger pipeline
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Core
What happened + What you expected to happen
Calling Ray DatasetPipeline schema() function will trigger Pipeline to be read.
- I don’t understand why schema() needs the whole pipeline to be triggered.
- This will make the below use case of Ray Dataset fail:
Create DatasetPipeline
Call DatasetPipeline.schema() to get useful schema information
Call DatasetPipeline.iter_datasets() to start iterate data # Fail here
Error message:
File "/Users/chongxiaoc/git/ml-code/env/py369/lib/python3.6/site-packages/ray/data/dataset_pipeline.py", line 543, in iter_datasets
raise RuntimeError("Pipeline cannot be read multiple times.")
RuntimeError: Pipeline cannot be read multiple times.
Versions / Dependencies
Ray version: ray-2.0.0.dev0-cp36-cp36m-manylinux2014_x86_64.whl
Reproduction script
Code is heavily integrated in our infrastructure platform, I provide the pseudo code as below to reproduce, it is a common case I think.
Create DatasetPipeline
Call DatasetPipeline.schema() to get useful schema information
Call DatasetPipeline.iter_datasets() to start iterate data # Fail here
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:7 (6 by maintainers)
Top Results From Across the Web
ray.data.datasource.parquet_datasource — Ray 2.2.0
To estimate real-time in-memory data size, Datasets will try to estimate the ... if _block_udf is not None: # Try to infer dataset...
Read more >dependant pipeline triggering is not working
I cannot get the Azure Pipeline triggers, as defined at Azure Devops Pipeline Triggers, to trigger a dependent build. I have reduced this...
Read more >Pipeline Steps Reference - Jenkins
The following plugins offer Pipeline-compatible steps. ... _AcpContextInit : Internal utility function for Devops DSL ... App-Ray Security check plugin.
Read more >Bug listing with status UNCONFIRMED as at 2022/12/20 15 ...
Bug :128538 - "sys-apps/coreutils: /bin/hostname should be installed from ... could not be found by inherit()" status:UNCONFIRMED resolution: severity:normal ...
Read more >Best Practices for Implementing Azure Data Factory
Finally, if you would like a better way to access the activity error details within your handler pipeline I suggest using an Azure...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It should be cached until read yep. I think this is generally fine since the pipeline is going to get ready anyways, or deleted.
@chongxiaoc This would be the non-pipelined
Dataset
object. E.g. if you’re pipelining on an existingDataset
object, you’d get the schema from that existingDataset
object: