question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Globs don't work when reading multiple hive-format parquet files

See original GitHub issue

Hi,

We have a directory full of hive-format parquet files (generated by fastparquet, but I don’t think it’s relevant here), one for each month, which in totality form a year.

I’d like to load these files all into one dataframe, with the glob functionality:

df  = ddf.read_parquet("../data/processed/2016*.parq")

While this works for CSVs and such, it doesn’t seem to work for parquet files - I think the machinery that does the glob matching doesn’t really know how to deal with the fact that hive-format parquet files are already directories.

Loading the files individually works OK:


In [12]: df  = ddf.read_parquet("../data/processed/201601.parq")

The error we get:


In [13]: df  = ddf.read_parquet("../data/processed/2016*.parq")
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/fastparquet/api.py in __init__(self, fn, verify, open_with, sep)
     46             self.fn = fn2
---> 47             with open_with(fn2, 'rb') as f:
     48                 self._parse_header(f, verify)

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/dask/bytes/core.py in __enter__(self)
    313         mode = self.mode.replace('t', '').replace('b', '') + 'b'
--> 314         f = f2 = self.myopen(self.path, mode=mode)
    315         CompressFile = merge(seekable_files, compress_files)[self.compression]

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/dask/bytes/local.py in open(self, path, mode, **kwargs)
     62             path = path[len('file://'):]
---> 63         return open(path, mode=mode)
     64 

FileNotFoundError: [Errno 2] No such file or directory: '../data/processed/2016*.parq/_metadata/_metadata'

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/dask/dataframe/io/parquet.py in read_parquet(path, columns, filters, categories, index, storage_options)
     74                                      open_with=myopen,
---> 75                                      sep=myopen.fs.sep)
     76     except:

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/fastparquet/api.py in __init__(self, fn, verify, open_with, sep)
     51             self.fn = fn
---> 52             with open_with(fn, 'rb') as f:
     53                 self._parse_header(f, verify)

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/dask/bytes/core.py in __enter__(self)
    313         mode = self.mode.replace('t', '').replace('b', '') + 'b'
--> 314         f = f2 = self.myopen(self.path, mode=mode)
    315         CompressFile = merge(seekable_files, compress_files)[self.compression]

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/dask/bytes/local.py in open(self, path, mode, **kwargs)
     62             path = path[len('file://'):]
---> 63         return open(path, mode=mode)
     64 

FileNotFoundError: [Errno 2] No such file or directory: '../data/processed/2016*.parq/_metadata'

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/fastparquet/api.py in __init__(self, fn, verify, open_with, sep)
     46             self.fn = fn2
---> 47             with open_with(fn2, 'rb') as f:
     48                 self._parse_header(f, verify)

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/dask/bytes/core.py in __enter__(self)
    313         mode = self.mode.replace('t', '').replace('b', '') + 'b'
--> 314         f = f2 = self.myopen(self.path, mode=mode)
    315         CompressFile = merge(seekable_files, compress_files)[self.compression]

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/dask/bytes/local.py in open(self, path, mode, **kwargs)
     62             path = path[len('file://'):]
---> 63         return open(path, mode=mode)
     64 

FileNotFoundError: [Errno 2] No such file or directory: '../data/processed/2016*.parq/_metadata'

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-13-2d2b23a42644> in <module>()
----> 1 df  = ddf.read_parquet("../data/processed/2016*.parq")

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/dask/dataframe/io/parquet.py in read_parquet(path, columns, filters, categories, index, storage_options)
     75                                      sep=myopen.fs.sep)
     76     except:
---> 77         pf = fastparquet.ParquetFile(path, open_with=myopen, sep=myopen.fs.sep)
     78 
     79     check_column_names(pf.columns, categories)

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/fastparquet/api.py in __init__(self, fn, verify, open_with, sep)
     50         except (IOError, OSError):
     51             self.fn = fn
---> 52             with open_with(fn, 'rb') as f:
     53                 self._parse_header(f, verify)
     54         if all(rg.columns[0].file_path is None for rg in self.row_groups):

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/dask/bytes/core.py in __enter__(self)
    312     def __enter__(self):
    313         mode = self.mode.replace('t', '').replace('b', '') + 'b'
--> 314         f = f2 = self.myopen(self.path, mode=mode)
    315         CompressFile = merge(seekable_files, compress_files)[self.compression]
    316         if PY2:

/nfs/projects_nobackup/c/cidgrowlab/Mali/MBTA/mbtaenv/lib/python3.6/site-packages/dask/bytes/local.py in open(self, path, mode, **kwargs)
     61         if path.startswith('file://'):
     62             path = path[len('file://'):]
---> 63         return open(path, mode=mode)
     64 
     65     def ukey(self, path):

FileNotFoundError: [Errno 2] No such file or directory: '../data/processed/2016*.parq'

The fact that it’s looking for ../data/processed/2016*.parq/_metadata/_metadata suggests that there is something kinda wonky going on.

Specifying the file scheme as hive to the storage engine doesn’t seem to resolve the issue, and neither does trying different combos of wildcards and slashes. As a last ditch effort, I tried to pass a list of parquet files into read_parquet and that doesnt’ seem to be supported (even just this would be super handy).

My workaround is probably going to be to do a delayed for-loop read and concat after that, but I’m not sure how that’d perform in comparison to the other way. Regardless, it would be super nice to have this. Thank you for reading!

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:15 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
martindurantcommented, May 22, 2018

Also, we now support globs in dataframe.read_parquet, so going to close this issue.

0reactions
makmanalpcommented, May 23, 2018

Thanks @martindurant !

Read more comments on GitHub >

github_iconTop Results From Across the Web

Solved: Unable to read multiple Parquet files in hive
Now on querying the table we are getting incorrect results and it appers to be that the first parquet file loaded contents are...
Read more >
Multiple Parquet files while writing to Hive Table(Incremental)
INSERT OVERWRITE TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID) SELECT NAME, AGE, YEAR, CUSTOMER_ID FROM CUSTOMER;. Which works fine and creates ...
Read more >
Cannot read parquet files with hive when created with a ...
Hello! I have created a mapping that write some data into HDFS in parquet format. But when I use that files to create...
Read more >
Best Practices for Amazon Redshift Spectrum | AWS Big Data ...
Amazon Redshift Spectrum supports many common data formats: text, Parquet, ORC, JSON, Avro, and more. You can query data in its original ...
Read more >
Migration Guide: Hadoop to Databricks
Data sources. 29. Data migration. 32. Hive metastore. 34. HiveQL vs. Spark SQL. 37. Delta Lake to optimize data pipelines. 40. User-defined functions....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found