Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Globs don't work when reading multiple hive-format parquet files

See original GitHub issue

Hi,

We have a directory full of hive-format parquet files (generated by fastparquet, but I don’t think it’s relevant here), one for each month, which in totality form a year.

I’d like to load these files all into one dataframe, with the glob functionality:

df  = ddf.read_parquet("../data/processed/2016*.parq")

While this works for CSVs and such, it doesn’t seem to work for parquet files - I think the machinery that does the glob matching doesn’t really know how to deal with the fact that hive-format parquet files are already directories.

Loading the files individually works OK:


In [12]: df  = ddf.read_parquet("../data/processed/201601.parq")

The error we get:


In [13]: df  = ddf.read_parquet("../data/processed/2016*.parq")
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/fastparquet/api.py in __init__(self, fn, verify, open_with, sep)
     46             self.fn = fn2
---> 47             with open_with(fn2, 'rb') as f:
     48                 self._parse_header(f, verify)

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/dask/bytes/core.py in __enter__(self)
    313         mode = self.mode.replace('t', '').replace('b', '') + 'b'
--> 314         f = f2 = self.myopen(self.path, mode=mode)
    315         CompressFile = merge(seekable_files, compress_files)[self.compression]

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/dask/bytes/local.py in open(self, path, mode, **kwargs)
     62             path = path[len('file://'):]
---> 63         return open(path, mode=mode)
     64 

FileNotFoundError: [Errno 2] No such file or directory: '../data/processed/2016*.parq/_metadata/_metadata'

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/dask/dataframe/io/parquet.py in read_parquet(path, columns, filters, categories, index, storage_options)
     74                                      open_with=myopen,
---> 75                                      sep=myopen.fs.sep)
     76     except:

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/fastparquet/api.py in __init__(self, fn, verify, open_with, sep)
     51             self.fn = fn
---> 52             with open_with(fn, 'rb') as f:
     53                 self._parse_header(f, verify)

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/dask/bytes/core.py in __enter__(self)
    313         mode = self.mode.replace('t', '').replace('b', '') + 'b'
--> 314         f = f2 = self.myopen(self.path, mode=mode)
    315         CompressFile = merge(seekable_files, compress_files)[self.compression]

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/dask/bytes/local.py in open(self, path, mode, **kwargs)
     62             path = path[len('file://'):]
---> 63         return open(path, mode=mode)
     64 

FileNotFoundError: [Errno 2] No such file or directory: '../data/processed/2016*.parq/_metadata'

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/fastparquet/api.py in __init__(self, fn, verify, open_with, sep)
     46             self.fn = fn2
---> 47             with open_with(fn2, 'rb') as f:
     48                 self._parse_header(f, verify)

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/dask/bytes/core.py in __enter__(self)
    313         mode = self.mode.replace('t', '').replace('b', '') + 'b'
--> 314         f = f2 = self.myopen(self.path, mode=mode)
    315         CompressFile = merge(seekable_files, compress_files)[self.compression]

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/dask/bytes/local.py in open(self, path, mode, **kwargs)
     62             path = path[len('file://'):]
---> 63         return open(path, mode=mode)
     64 

FileNotFoundError: [Errno 2] No such file or directory: '../data/processed/2016*.parq/_metadata'

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-13-2d2b23a42644> in <module>()
----> 1 df  = ddf.read_parquet("../data/processed/2016*.parq")

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/dask/dataframe/io/parquet.py in read_parquet(path, columns, filters, categories, index, storage_options)
     75                                      sep=myopen.fs.sep)
     76     except:
---> 77         pf = fastparquet.ParquetFile(path, open_with=myopen, sep=myopen.fs.sep)
     78 
     79     check_column_names(pf.columns, categories)

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/fastparquet/api.py in __init__(self, fn, verify, open_with, sep)
     50         except (IOError, OSError):
     51             self.fn = fn
---> 52             with open_with(fn, 'rb') as f:
     53                 self._parse_header(f, verify)
     54         if all(rg.columns[0].file_path is None for rg in self.row_groups):

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/dask/bytes/core.py in __enter__(self)
    312     def __enter__(self):
    313         mode = self.mode.replace('t', '').replace('b', '') + 'b'
--> 314         f = f2 = self.myopen(self.path, mode=mode)
    315         CompressFile = merge(seekable_files, compress_files)[self.compression]
    316         if PY2:

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/dask/bytes/local.py in open(self, path, mode, **kwargs)
     61         if path.startswith('file://'):
     62             path = path[len('file://'):]
---> 63         return open(path, mode=mode)
     64 
     65     def ukey(self, path):

FileNotFoundError: [Errno 2] No such file or directory: '../data/processed/2016*.parq'

The fact that it’s looking for ../data/processed/2016*.parq/_metadata/_metadata suggests that there is something kinda wonky going on.

Specifying the file scheme as hive to the storage engine doesn’t seem to resolve the issue, and neither does trying different combos of wildcards and slashes. As a last ditch effort, I tried to pass a list of parquet files into read_parquet and that doesnt’ seem to be supported (even just this would be super handy).

My workaround is probably going to be to do a delayed for-loop read and concat after that, but I’m not sure how that’d perform in comparison to the other way. Regardless, it would be super nice to have this. Thank you for reading!

Issue Analytics

State:
Created 6 years ago
Comments:15 (8 by maintainers)

Top GitHub Comments

2reactions

martindurantcommented, May 22, 2018

Also, we now support globs in dataframe.read_parquet, so going to close this issue.

0reactions

makmanalpcommented, May 23, 2018

Thanks @martindurant !

Top Results From Across the Web

Solved: Unable to read multiple Parquet files in hive

Now on querying the table we are getting incorrect results and it appers to be that the first parquet file loaded contents are...

Multiple Parquet files while writing to Hive Table(Incremental)

INSERT OVERWRITE TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID) SELECT NAME, AGE, YEAR, CUSTOMER_ID FROM CUSTOMER;. Which works fine and creates ...

Cannot read parquet files with hive when created with a ...

Hello! I have created a mapping that write some data into HDFS in parquet format. But when I use that files to create...

Best Practices for Amazon Redshift Spectrum | AWS Big Data ...

Amazon Redshift Spectrum supports many common data formats: text, Parquet, ORC, JSON, Avro, and more. You can query data in its original ...

Migration Guide: Hadoop to Databricks

Data sources. 29. Data migration. 32. Hive metastore. 34. HiveQL vs. Spark SQL. 37. Delta Lake to optimize data pipelines. 40. User-defined functions....