Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] [flytekit] StructuredDataset handling fails when using Azure Blob Storage

See original GitHub issue

Describe the bug

The StructuredDataset implementation enabled by default in https://github.com/flyteorg/flytekit/pull/885 lacks support for Azure Blob Storage, resulting in an error when trying to handle e.g. pd.DataFrame while using Azure backed storage.

[3/3] currentAttempt done. Last Error: SYSTEM::Traceback (most recent call last):

      File "/usr/local/lib/python3.8/dist-packages/flytekit/exceptions/scopes.py", line 165, in system_entry_point
        return wrapped(*args, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/flytekit/core/base_task.py", line 474, in dispatch_execute
        native_inputs = TypeEngine.literal_map_to_kwargs(exec_ctx, input_literal_map, self.python_interface.inputs)
      File "/usr/local/lib/python3.8/dist-packages/flytekit/core/type_engine.py", line 756, in literal_map_to_kwargs
        return {k: TypeEngine.to_python_value(ctx, lm.literals[k], v) for k, v in python_types.items()}
      File "/usr/local/lib/python3.8/dist-packages/flytekit/core/type_engine.py", line 756, in <dictcomp>
        return {k: TypeEngine.to_python_value(ctx, lm.literals[k], v) for k, v in python_types.items()}
      File "/usr/local/lib/python3.8/dist-packages/flytekit/core/type_engine.py", line 720, in to_python_value
        return transformer.to_python_value(ctx, lv, expected_pythile "/usr/local/lib/python3.8/dist-packages/flytekit/types/structured/structured_dataset.py", line 589, in to_python_value
        return self.open_as(ctx, sd_literal, expected_python_type, metad)
      File "/usr/local/lib/python3.8/dist-packages/flytekit/types/structured/structured_dataset.py", line 668, in open_as
        decoder = self.get_decoder(df_type, protocol, sd.metadata.structured_dataset_type.format)
      File "/usr/local/lib/python3.8/dist-packages/flytekit/types/structured/structured_dataset.py", line 363, in get_decoder
        return cls._finder(StructuredDatasetTransformerEngine.DECODERS, df_type, protocol, format)
      File "/usr/local/lib/python3.8/dist-packages/flytekit/types/structured/structured_dataset.py", line 355, in _finder
        raise ValueError(f"Failed to find a handler for {df_type}, protocol {protocol}, fmt {format}")

Message:

    Failed to find a handler for <class 'pandas.core.frame.DataFrame'>, protocol abfs, fmt parquet

The workflows in question use flytekit v1.0.5.

Expected behavior

StructuredDatasets/pd.DataFrames are handled correctly on Azure/using the abfs protocol/adlfs (via stow). Looking at #2684, users should also not have to supply the protocol themselves using ABS, however I assume that’s covered by the suggested changes of the other issue as well.

Additional context to reproduce

Set up Flyte instance using Azure Blob Storage as its storage backend
Register workflow handling pd.DataFrames using flytekit v1.0.5
Try to execute workflow

Example regression test we’ve added to our Flyte test suite, covering our common use cases:

import pandas as pd
from flytekit import task, workflow


_START_RANGE_VALUE = 0
_END_RANGE_VALUE = 100
_DUMMY_DF = pd.DataFrame(
    {
        "quadkeys": list(range(_START_RANGE_VALUE, _END_RANGE_VALUE)),
    }
)


@task() 
def get_pandas_dataframe() -> pd.DataFrame:
    return _DUMMY_DF


@task()
def consume_pandas_dataframe(input_df: pd.DataFrame) -> None:
    for quadkey in input_df["quadkeys"]:
        assert len(quadkey) > -1


@task()
def consume_pandas_dataframe_return_str(input_df: pd.DataFrame) -> str:
    return_val = ""
    for quadkey in input_df["quadkeys"]:
        return_val += str(quadkey)
    return return_val


@task()
def consume_pandas_dataframe_return_pandas_dataframe(input_df: pd.DataFrame) -> pd.DataFrame:
    values = []
    for quadkey in list(reversed(input_df["quadkeys"])):
        values.append(str(quadkey))
    return pd.DataFrame({"quadkeys": values})


@workflow
def test_pandas_dataframe_consumption_and_returning() -> None:
    df1 = get_pandas_dataframe()
    consume_pandas_dataframe(input_df=df1)
    consume_pandas_dataframe_return_str(input_df=df1)
    consume_pandas_dataframe_return_pandas_dataframe(input_df=df1)

All three tasks above will fail with the mentioned error message.

Screenshots

No response

Are you sure this issue hasn’t been raised already?

Have you read the Code of Conduct?

Issue Analytics

State:
Created a year ago
Comments:8 (8 by maintainers)

Top GitHub Comments

2reactions

pingsutwcommented, Jul 21, 2022

@MorpheusXAUT I’m trying to fix it ASAP.

1reaction

wild-endeavorcommented, Jul 21, 2022

merging the other pr soon.

Top Results From Across the Web

Troubleshoot the Azure Blob Storage connector

This article provides suggestions to troubleshoot common problems with the Azure Blob Storage connector in Azure Data Factory and Azure Synapse.

[BUG] Write blob data to blob storage fails randomly #8079

Describe the bug While writing data to blob storage the library fails randomly with a nullpointerexception or indexarrayoutofbounds ...

flytekit Changelog - pyup.io

flyte/local-cache/` directory and persists across multiple workflow and task executions. A similar semantics as the remote Flyte case applies, i.e. task outputs ...

Iterate and Re-deploy - Flyte

Iterate and Re-deploy#. In this guide, you'll learn how to iterate on and re-deploy your tasks and workflows. Modify Code and Test Locally#....

Error when listing or loading files in Microsoft Azure storage

When attempting to list or load data files in Microsoft Azure Blob storage, you could encounter an error similar to the following: