errors working with parquet files
See original GitHub issuefyi… @hayesgb
Hi, i’m working to add support for adlfs in our open source framework mlrun and run into few issues:
- for some Parquet files i get the error below when trying to read, I do manage to read the exact file using local
pd.read_parquet()
, and i managed to do the same from a different environment with the same ver of pandas, pyarrow, fsspec and adlfs.
i do:
fs = fsspec.filesystem('az', **storage_options)
with fs.open('az://<blob-container>/data/labels.parquet', "rb") as f:
df = pd.read_parquet(f)
and get the error:
OSError Traceback (most recent call last)
<ipython-input-160-7a509f1c55af> in <module>
2 fs = fsspec.filesystem('az', **storage_options)
3 with fs.open('az://XXX/data/labels.parquet', "rb") as f:
----> 4 df = pd.read_parquet(f)
5 df
/conda/lib/python3.7/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)
315 """
316 impl = get_engine(engine)
--> 317 return impl.read(path, columns=columns, **kwargs)
/conda/lib/python3.7/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)
140 kwargs["use_pandas_metadata"] = True
141 result = self.api.parquet.read_table(
--> 142 path, columns=columns, filesystem=fs, **kwargs
143 ).to_pandas()
144 if should_close:
/conda/lib/python3.7/site-packages/pyarrow/parquet.py in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes)
1571 buffer_size=buffer_size,
1572 filters=filters,
-> 1573 ignore_prefixes=ignore_prefixes,
1574 )
1575 except ImportError:
/conda/lib/python3.7/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, **kwargs)
1409 fragment = parquet_format.make_fragment(path_or_paths)
1410 self._dataset = ds.FileSystemDataset(
-> 1411 [fragment], schema=fragment.physical_schema,
1412 format=parquet_format
1413 )
/conda/lib/python3.7/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Fragment.physical_schema.__get__()
/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
OSError: Could not open parquet input source '<Buffer>': Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
- i try to do:
df.to_parquet(blob_path, storage_options={'account_name': "XXXX", 'account_key': "XXX"})
and this fails with:
/conda/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
197 else:
198 kwargs[new_arg_name] = new_arg_value
--> 199 return func(*args, **kwargs)
200
201 return cast(F, wrapper)
/conda/lib/python3.7/site-packages/pandas/core/frame.py in to_parquet(self, path, engine, compression, index, partition_cols, **kwargs)
2370 index=index,
2371 partition_cols=partition_cols,
-> 2372 **kwargs,
2373 )
2374
/conda/lib/python3.7/site-packages/pandas/io/parquet.py in to_parquet(df, path, engine, compression, index, partition_cols, **kwargs)
274 index=index,
275 partition_cols=partition_cols,
--> 276 **kwargs,
277 )
278
/conda/lib/python3.7/site-packages/pandas/io/parquet.py in write(self, df, path, compression, index, partition_cols, **kwargs)
106 import fsspec.core
107
--> 108 fs, path = fsspec.core.url_to_fs(path)
109 kwargs["filesystem"] = fs
110 else:
/conda/lib/python3.7/site-packages/fsspec/core.py in url_to_fs(url, **kwargs)
374 else:
375 protocol, urlpath = split_protocol(url)
--> 376 fs = filesystem(protocol, **kwargs)
377 urlpath = fs._strip_protocol(url)
378 return fs, urlpath
/conda/lib/python3.7/site-packages/fsspec/registry.py in filesystem(protocol, **storage_options)
225 """
226 cls = get_filesystem_class(protocol)
--> 227 return cls(**storage_options)
/conda/lib/python3.7/site-packages/fsspec/spec.py in __call__(cls, *args, **kwargs)
56 return cls._cache[token]
57 else:
---> 58 obj = super().__call__(*args, **kwargs)
59 # Setting _fs_token here causes some static linters to complain.
60 obj._fs_token_ = token
TypeError: __init__() missing 1 required positional argument: 'account_name'
Note that i did specify the account_name, and it failed, and in the same environment using the fs.open() approach above did work for many files (other than the one in the 1st problem)
As a side note it be very useful if adlfs can optionally pick up those or the AZURE_STORAGE_CONNECTION_STRING
from the environment vars like s3fs/boto do
My env include:
- fsspec 0.8.3
- pandas 1.1.3
- adlfs 0.6.0
- pyarrow 1.0.1
- python 3.7
all installed using pip
Issue Analytics
- State:
- Created 3 years ago
- Comments:5
Top Results From Across the Web
Troubleshooting Reads from ORC and Parquet Files - Vertica
This behavior is specific to Parquet files; with an ORC file the type is correctly reported as STRING. The problem occurs because Parquet...
Read more >Error reading data from parquet file ( I am pretty new to Pig)
Error reading data from parquet file ( I am pretty new to Pig). Labels: ... If it works, add value next and if...
Read more >Error writing parquet files - Databricks Community
Hi, we are having this chain of errors every day in different files and processes: An error occurred while calling o11255.parquet. : org.apache.spark....
Read more >Troubleshoot the Parquet format connector - Azure Data ...
This article provides suggestions to troubleshoot common problems with the Parquet format connector in Azure Data Factory and Azure Synapse.
Read more >Errors reading parquet files - Dremio Community
We are seeing errors (see below) when queries/reflections try to read particular partitions on S3. If I filter out those partitions from the ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@hayesgb added fsspec to mlrun datastores/dataitems, will make your life easier w dask and other apps: https://github.com/mlrun/mlrun/pull/724
@yaronha — Very nice! This should enable direct use of Azure Blob for mlrun artifacts, right? Did updating Pandas fix the parquet issue?