question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Ray Datasets schema() function should not trigger pipeline

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

Calling Ray DatasetPipeline schema() function will trigger Pipeline to be read.

  • I don’t understand why schema() needs the whole pipeline to be triggered.
  • This will make the below use case of Ray Dataset fail:
Create DatasetPipeline
Call DatasetPipeline.schema() to get useful schema information
Call DatasetPipeline.iter_datasets() to start iterate data        # Fail here

Error message:

  File "/Users/chongxiaoc/git/ml-code/env/py369/lib/python3.6/site-packages/ray/data/dataset_pipeline.py", line 543, in iter_datasets
    raise RuntimeError("Pipeline cannot be read multiple times.")
RuntimeError: Pipeline cannot be read multiple times.

Versions / Dependencies

Ray version: ray-2.0.0.dev0-cp36-cp36m-manylinux2014_x86_64.whl

Reproduction script

Code is heavily integrated in our infrastructure platform, I provide the pseudo code as below to reproduce, it is a common case I think.

Create DatasetPipeline
Call DatasetPipeline.schema() to get useful schema information
Call DatasetPipeline.iter_datasets() to start iterate data        # Fail here

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
ericlcommented, Feb 12, 2022

It should be cached until read yep. I think this is generally fine since the pipeline is going to get ready anyways, or deleted.

1reaction
clarkzinzowcommented, Feb 12, 2022

@chongxiaoc This would be the non-pipelined Dataset object. E.g. if you’re pipelining on an existing Dataset object, you’d get the schema from that existing Dataset object:

ds: Dataset = ray.data.read_parquet(...)
pipe: DatasetPipeline = ds.repeat(N)
schema = ds.schema()
Read more comments on GitHub >

github_iconTop Results From Across the Web

ray.data.datasource.parquet_datasource — Ray 2.2.0
To estimate real-time in-memory data size, Datasets will try to estimate the ... if _block_udf is not None: # Try to infer dataset...
Read more >
dependant pipeline triggering is not working
I cannot get the Azure Pipeline triggers, as defined at Azure Devops Pipeline Triggers, to trigger a dependent build. I have reduced this...
Read more >
Pipeline Steps Reference - Jenkins
The following plugins offer Pipeline-compatible steps. ... _AcpContextInit : Internal utility function for Devops DSL ... App-Ray Security check plugin.
Read more >
Bug listing with status UNCONFIRMED as at 2022/12/20 15 ...
Bug :128538 - "sys-apps/coreutils: /bin/hostname should be installed from ... could not be found by inherit()" status:UNCONFIRMED resolution: severity:normal ...
Read more >
Best Practices for Implementing Azure Data Factory
Finally, if you would like a better way to access the activity error details within your handler pipeline I suggest using an Azure...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found