[BUG] [flytekit] StructuredDataset handling fails when using Azure Blob Storage
See original GitHub issueDescribe the bug
The StructuredDataset
implementation enabled by default in https://github.com/flyteorg/flytekit/pull/885 lacks support for Azure Blob Storage, resulting in an error when trying to handle e.g. pd.DataFrame
while using Azure backed storage.
[3/3] currentAttempt done. Last Error: SYSTEM::Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/flytekit/exceptions/scopes.py", line 165, in system_entry_point
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/flytekit/core/base_task.py", line 474, in dispatch_execute
native_inputs = TypeEngine.literal_map_to_kwargs(exec_ctx, input_literal_map, self.python_interface.inputs)
File "/usr/local/lib/python3.8/dist-packages/flytekit/core/type_engine.py", line 756, in literal_map_to_kwargs
return {k: TypeEngine.to_python_value(ctx, lm.literals[k], v) for k, v in python_types.items()}
File "/usr/local/lib/python3.8/dist-packages/flytekit/core/type_engine.py", line 756, in <dictcomp>
return {k: TypeEngine.to_python_value(ctx, lm.literals[k], v) for k, v in python_types.items()}
File "/usr/local/lib/python3.8/dist-packages/flytekit/core/type_engine.py", line 720, in to_python_value
return transformer.to_python_value(ctx, lv, expected_pythile "/usr/local/lib/python3.8/dist-packages/flytekit/types/structured/structured_dataset.py", line 589, in to_python_value
return self.open_as(ctx, sd_literal, expected_python_type, metad)
File "/usr/local/lib/python3.8/dist-packages/flytekit/types/structured/structured_dataset.py", line 668, in open_as
decoder = self.get_decoder(df_type, protocol, sd.metadata.structured_dataset_type.format)
File "/usr/local/lib/python3.8/dist-packages/flytekit/types/structured/structured_dataset.py", line 363, in get_decoder
return cls._finder(StructuredDatasetTransformerEngine.DECODERS, df_type, protocol, format)
File "/usr/local/lib/python3.8/dist-packages/flytekit/types/structured/structured_dataset.py", line 355, in _finder
raise ValueError(f"Failed to find a handler for {df_type}, protocol {protocol}, fmt {format}")
Message:
Failed to find a handler for <class 'pandas.core.frame.DataFrame'>, protocol abfs, fmt parquet
The workflows in question use flytekit v1.0.5
.
Expected behavior
StructuredDatasets/pd.DataFrames
are handled correctly on Azure/using the abfs
protocol/adlfs
(via stow
).
Looking at #2684, users should also not have to supply the protocol themselves using ABS, however I assume that’s covered by the suggested changes of the other issue as well.
Additional context to reproduce
- Set up Flyte instance using Azure Blob Storage as its storage backend
- Register workflow handling
pd.DataFrames
usingflytekit v1.0.5
- Try to execute workflow
Example regression test we’ve added to our Flyte test suite, covering our common use cases:
import pandas as pd
from flytekit import task, workflow
_START_RANGE_VALUE = 0
_END_RANGE_VALUE = 100
_DUMMY_DF = pd.DataFrame(
{
"quadkeys": list(range(_START_RANGE_VALUE, _END_RANGE_VALUE)),
}
)
@task()
def get_pandas_dataframe() -> pd.DataFrame:
return _DUMMY_DF
@task()
def consume_pandas_dataframe(input_df: pd.DataFrame) -> None:
for quadkey in input_df["quadkeys"]:
assert len(quadkey) > -1
@task()
def consume_pandas_dataframe_return_str(input_df: pd.DataFrame) -> str:
return_val = ""
for quadkey in input_df["quadkeys"]:
return_val += str(quadkey)
return return_val
@task()
def consume_pandas_dataframe_return_pandas_dataframe(input_df: pd.DataFrame) -> pd.DataFrame:
values = []
for quadkey in list(reversed(input_df["quadkeys"])):
values.append(str(quadkey))
return pd.DataFrame({"quadkeys": values})
@workflow
def test_pandas_dataframe_consumption_and_returning() -> None:
df1 = get_pandas_dataframe()
consume_pandas_dataframe(input_df=df1)
consume_pandas_dataframe_return_str(input_df=df1)
consume_pandas_dataframe_return_pandas_dataframe(input_df=df1)
All three tasks above will fail with the mentioned error message.
Screenshots
No response
Are you sure this issue hasn’t been raised already?
- Yes
Have you read the Code of Conduct?
- Yes
Issue Analytics
- State:
- Created a year ago
- Comments:8 (8 by maintainers)
Top Results From Across the Web
Troubleshoot the Azure Blob Storage connector
This article provides suggestions to troubleshoot common problems with the Azure Blob Storage connector in Azure Data Factory and Azure Synapse.
Read more >[BUG] Write blob data to blob storage fails randomly #8079
Describe the bug While writing data to blob storage the library fails randomly with a nullpointerexception or indexarrayoutofbounds ...
Read more >flytekit Changelog - pyup.io
flyte/local-cache/` directory and persists across multiple workflow and task executions. A similar semantics as the remote Flyte case applies, i.e. task outputs ...
Read more >Iterate and Re-deploy - Flyte
Iterate and Re-deploy#. In this guide, you'll learn how to iterate on and re-deploy your tasks and workflows. Modify Code and Test Locally#....
Read more >Error when listing or loading files in Microsoft Azure storage
When attempting to list or load data files in Microsoft Azure Blob storage, you could encounter an error similar to the following:
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@MorpheusXAUT I’m trying to fix it ASAP.
merging the other pr soon.