question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] [flytekit] StructuredDataset handling fails when using Azure Blob Storage

See original GitHub issue

Describe the bug

The StructuredDataset implementation enabled by default in https://github.com/flyteorg/flytekit/pull/885 lacks support for Azure Blob Storage, resulting in an error when trying to handle e.g. pd.DataFrame while using Azure backed storage.

[3/3] currentAttempt done. Last Error: SYSTEM::Traceback (most recent call last):

      File "/usr/local/lib/python3.8/dist-packages/flytekit/exceptions/scopes.py", line 165, in system_entry_point
        return wrapped(*args, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/flytekit/core/base_task.py", line 474, in dispatch_execute
        native_inputs = TypeEngine.literal_map_to_kwargs(exec_ctx, input_literal_map, self.python_interface.inputs)
      File "/usr/local/lib/python3.8/dist-packages/flytekit/core/type_engine.py", line 756, in literal_map_to_kwargs
        return {k: TypeEngine.to_python_value(ctx, lm.literals[k], v) for k, v in python_types.items()}
      File "/usr/local/lib/python3.8/dist-packages/flytekit/core/type_engine.py", line 756, in <dictcomp>
        return {k: TypeEngine.to_python_value(ctx, lm.literals[k], v) for k, v in python_types.items()}
      File "/usr/local/lib/python3.8/dist-packages/flytekit/core/type_engine.py", line 720, in to_python_value
        return transformer.to_python_value(ctx, lv, expected_pythile "/usr/local/lib/python3.8/dist-packages/flytekit/types/structured/structured_dataset.py", line 589, in to_python_value
        return self.open_as(ctx, sd_literal, expected_python_type, metad)
      File "/usr/local/lib/python3.8/dist-packages/flytekit/types/structured/structured_dataset.py", line 668, in open_as
        decoder = self.get_decoder(df_type, protocol, sd.metadata.structured_dataset_type.format)
      File "/usr/local/lib/python3.8/dist-packages/flytekit/types/structured/structured_dataset.py", line 363, in get_decoder
        return cls._finder(StructuredDatasetTransformerEngine.DECODERS, df_type, protocol, format)
      File "/usr/local/lib/python3.8/dist-packages/flytekit/types/structured/structured_dataset.py", line 355, in _finder
        raise ValueError(f"Failed to find a handler for {df_type}, protocol {protocol}, fmt {format}")

Message:

    Failed to find a handler for <class 'pandas.core.frame.DataFrame'>, protocol abfs, fmt parquet

The workflows in question use flytekit v1.0.5.

Expected behavior

StructuredDatasets/pd.DataFrames are handled correctly on Azure/using the abfs protocol/adlfs (via stow). Looking at #2684, users should also not have to supply the protocol themselves using ABS, however I assume that’s covered by the suggested changes of the other issue as well.

Additional context to reproduce

  1. Set up Flyte instance using Azure Blob Storage as its storage backend
  2. Register workflow handling pd.DataFrames using flytekit v1.0.5
  3. Try to execute workflow

Example regression test we’ve added to our Flyte test suite, covering our common use cases:

import pandas as pd
from flytekit import task, workflow


_START_RANGE_VALUE = 0
_END_RANGE_VALUE = 100
_DUMMY_DF = pd.DataFrame(
    {
        "quadkeys": list(range(_START_RANGE_VALUE, _END_RANGE_VALUE)),
    }
)


@task() 
def get_pandas_dataframe() -> pd.DataFrame:
    return _DUMMY_DF


@task()
def consume_pandas_dataframe(input_df: pd.DataFrame) -> None:
    for quadkey in input_df["quadkeys"]:
        assert len(quadkey) > -1


@task()
def consume_pandas_dataframe_return_str(input_df: pd.DataFrame) -> str:
    return_val = ""
    for quadkey in input_df["quadkeys"]:
        return_val += str(quadkey)
    return return_val


@task()
def consume_pandas_dataframe_return_pandas_dataframe(input_df: pd.DataFrame) -> pd.DataFrame:
    values = []
    for quadkey in list(reversed(input_df["quadkeys"])):
        values.append(str(quadkey))
    return pd.DataFrame({"quadkeys": values})


@workflow
def test_pandas_dataframe_consumption_and_returning() -> None:
    df1 = get_pandas_dataframe()
    consume_pandas_dataframe(input_df=df1)
    consume_pandas_dataframe_return_str(input_df=df1)
    consume_pandas_dataframe_return_pandas_dataframe(input_df=df1)

All three tasks above will fail with the mentioned error message.

Screenshots

No response

Are you sure this issue hasn’t been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
pingsutwcommented, Jul 21, 2022

@MorpheusXAUT I’m trying to fix it ASAP.

1reaction
wild-endeavorcommented, Jul 21, 2022

merging the other pr soon.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot the Azure Blob Storage connector
This article provides suggestions to troubleshoot common problems with the Azure Blob Storage connector in Azure Data Factory and Azure Synapse.
Read more >
[BUG] Write blob data to blob storage fails randomly #8079
Describe the bug While writing data to blob storage the library fails randomly with a nullpointerexception or indexarrayoutofbounds ...
Read more >
flytekit Changelog - pyup.io
flyte/local-cache/` directory and persists across multiple workflow and task executions. A similar semantics as the remote Flyte case applies, i.e. task outputs ...
Read more >
Iterate and Re-deploy - Flyte
Iterate and Re-deploy#. In this guide, you'll learn how to iterate on and re-deploy your tasks and workflows. Modify Code and Test Locally#....
Read more >
Error when listing or loading files in Microsoft Azure storage
When attempting to list or load data files in Microsoft Azure Blob storage, you could encounter an error similar to the following:
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found