Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Ray Datasets schema() function should not trigger pipeline

See original GitHub issue

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

Calling Ray DatasetPipeline schema() function will trigger Pipeline to be read.

I don’t understand why schema() needs the whole pipeline to be triggered.
This will make the below use case of Ray Dataset fail:

Create DatasetPipeline
Call DatasetPipeline.schema() to get useful schema information
Call DatasetPipeline.iter_datasets() to start iterate data        # Fail here

Error message:

  File "/Users/chongxiaoc/git/ml-code/env/py369/lib/python3.6/site-packages/ray/data/dataset_pipeline.py", line 543, in iter_datasets
    raise RuntimeError("Pipeline cannot be read multiple times.")
RuntimeError: Pipeline cannot be read multiple times.

Versions / Dependencies

Ray version: ray-2.0.0.dev0-cp36-cp36m-manylinux2014_x86_64.whl

Reproduction script

Code is heavily integrated in our infrastructure platform, I provide the pseudo code as below to reproduce, it is a common case I think.

Create DatasetPipeline
Call DatasetPipeline.schema() to get useful schema information
Call DatasetPipeline.iter_datasets() to start iterate data        # Fail here

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:7 (6 by maintainers)

Top GitHub Comments

1reaction

ericlcommented, Feb 12, 2022

It should be cached until read yep. I think this is generally fine since the pipeline is going to get ready anyways, or deleted.

1reaction

clarkzinzowcommented, Feb 12, 2022

@chongxiaoc This would be the non-pipelined Dataset object. E.g. if you’re pipelining on an existing Dataset object, you’d get the schema from that existing Dataset object:

ds: Dataset = ray.data.read_parquet(...)
pipe: DatasetPipeline = ds.repeat(N)
schema = ds.schema()

Top Results From Across the Web

ray.data.datasource.parquet_datasource — Ray 2.2.0

To estimate real-time in-memory data size, Datasets will try to estimate the ... if _block_udf is not None: # Try to infer dataset...

dependant pipeline triggering is not working

I cannot get the Azure Pipeline triggers, as defined at Azure Devops Pipeline Triggers, to trigger a dependent build. I have reduced this...

Pipeline Steps Reference - Jenkins

The following plugins offer Pipeline-compatible steps. ... _AcpContextInit : Internal utility function for Devops DSL ... App-Ray Security check plugin.

Bug listing with status UNCONFIRMED as at 2022/12/20 15 ...

Bug :128538 - "sys-apps/coreutils: /bin/hostname should be installed from ... could not be found by inherit()" status:UNCONFIRMED resolution: severity:normal ...

Best Practices for Implementing Azure Data Factory

Finally, if you would like a better way to access the activity error details within your handler pipeline I suggest using an Azure...