question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

errors working with parquet files

See original GitHub issue

fyi… @hayesgb

Hi, i’m working to add support for adlfs in our open source framework mlrun and run into few issues:

  1. for some Parquet files i get the error below when trying to read, I do manage to read the exact file using local pd.read_parquet(), and i managed to do the same from a different environment with the same ver of pandas, pyarrow, fsspec and adlfs.

i do:

fs = fsspec.filesystem('az', **storage_options)
with fs.open('az://<blob-container>/data/labels.parquet', "rb") as f:
    df = pd.read_parquet(f)

and get the error:

OSError                                   Traceback (most recent call last)
<ipython-input-160-7a509f1c55af> in <module>
      2 fs = fsspec.filesystem('az', **storage_options)
      3 with fs.open('az://XXX/data/labels.parquet', "rb") as f:
----> 4     df = pd.read_parquet(f)
      5 df

/conda/lib/python3.7/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)
    315     """
    316     impl = get_engine(engine)
--> 317     return impl.read(path, columns=columns, **kwargs)

/conda/lib/python3.7/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)
    140         kwargs["use_pandas_metadata"] = True
    141         result = self.api.parquet.read_table(
--> 142             path, columns=columns, filesystem=fs, **kwargs
    143         ).to_pandas()
    144         if should_close:

/conda/lib/python3.7/site-packages/pyarrow/parquet.py in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes)
   1571                 buffer_size=buffer_size,
   1572                 filters=filters,
-> 1573                 ignore_prefixes=ignore_prefixes,
   1574             )
   1575         except ImportError:

/conda/lib/python3.7/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, **kwargs)
   1409                 fragment = parquet_format.make_fragment(path_or_paths)
   1410                 self._dataset = ds.FileSystemDataset(
-> 1411                     [fragment], schema=fragment.physical_schema,
   1412                     format=parquet_format
   1413                 )

/conda/lib/python3.7/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Fragment.physical_schema.__get__()

/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

OSError: Could not open parquet input source '<Buffer>': Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
  1. i try to do:
df.to_parquet(blob_path,  storage_options={'account_name': "XXXX", 'account_key': "XXX"})

and this fails with:

/conda/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    197                 else:
    198                     kwargs[new_arg_name] = new_arg_value
--> 199             return func(*args, **kwargs)
    200 
    201         return cast(F, wrapper)

/conda/lib/python3.7/site-packages/pandas/core/frame.py in to_parquet(self, path, engine, compression, index, partition_cols, **kwargs)
   2370             index=index,
   2371             partition_cols=partition_cols,
-> 2372             **kwargs,
   2373         )
   2374 

/conda/lib/python3.7/site-packages/pandas/io/parquet.py in to_parquet(df, path, engine, compression, index, partition_cols, **kwargs)
    274         index=index,
    275         partition_cols=partition_cols,
--> 276         **kwargs,
    277     )
    278 

/conda/lib/python3.7/site-packages/pandas/io/parquet.py in write(self, df, path, compression, index, partition_cols, **kwargs)
    106             import fsspec.core
    107 
--> 108             fs, path = fsspec.core.url_to_fs(path)
    109             kwargs["filesystem"] = fs
    110         else:

/conda/lib/python3.7/site-packages/fsspec/core.py in url_to_fs(url, **kwargs)
    374     else:
    375         protocol, urlpath = split_protocol(url)
--> 376         fs = filesystem(protocol, **kwargs)
    377         urlpath = fs._strip_protocol(url)
    378     return fs, urlpath

/conda/lib/python3.7/site-packages/fsspec/registry.py in filesystem(protocol, **storage_options)
    225     """
    226     cls = get_filesystem_class(protocol)
--> 227     return cls(**storage_options)

/conda/lib/python3.7/site-packages/fsspec/spec.py in __call__(cls, *args, **kwargs)
     56             return cls._cache[token]
     57         else:
---> 58             obj = super().__call__(*args, **kwargs)
     59             # Setting _fs_token here causes some static linters to complain.
     60             obj._fs_token_ = token

TypeError: __init__() missing 1 required positional argument: 'account_name'

Note that i did specify the account_name, and it failed, and in the same environment using the fs.open() approach above did work for many files (other than the one in the 1st problem)

As a side note it be very useful if adlfs can optionally pick up those or the AZURE_STORAGE_CONNECTION_STRING from the environment vars like s3fs/boto do

My env include:

  • fsspec 0.8.3
  • pandas 1.1.3
  • adlfs 0.6.0
  • pyarrow 1.0.1
  • python 3.7

all installed using pip

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
yaronhacommented, Feb 14, 2021

@hayesgb added fsspec to mlrun datastores/dataitems, will make your life easier w dask and other apps: https://github.com/mlrun/mlrun/pull/724

0reactions
hayesgbcommented, Feb 15, 2021

@yaronha — Very nice! This should enable direct use of Azure Blob for mlrun artifacts, right? Did updating Pandas fix the parquet issue?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting Reads from ORC and Parquet Files - Vertica
This behavior is specific to Parquet files; with an ORC file the type is correctly reported as STRING. The problem occurs because Parquet...
Read more >
Error reading data from parquet file ( I am pretty new to Pig)
Error reading data from parquet file ( I am pretty new to Pig). Labels: ... If it works, add value next and if...
Read more >
Error writing parquet files - Databricks Community
Hi, we are having this chain of errors every day in different files and processes: An error occurred while calling o11255.parquet. : org.apache.spark....
Read more >
Troubleshoot the Parquet format connector - Azure Data ...
This article provides suggestions to troubleshoot common problems with the Parquet format connector in Azure Data Factory and Azure Synapse.
Read more >
Errors reading parquet files - Dremio Community
We are seeing errors (see below) when queries/reflections try to read particular partitions on S3. If I filter out those partitions from the ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found