question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SNAPPY compression option in ParquetFile??

See original GitHub issue

Tried reading in folder of parquet files but SNAPPY not allowed and tells me to choose another compression option. Where do I pass in the compression option for the read step? I see it for the write step, but not ParquetFile

from fastparquet import ParquetFile, writer

filelist = glob(data_path+"/*parquet")
filelist
['.../data/deleted_data/part-00000-ad202203-7b33-4afd-9702-21ec5edc91ea.snappy.parquet',
 '.../data/deleted_data/part-00001-ad202203-7b33-4afd-9702-21ec5edc91ea.snappy.parquet',
 '.../data/deleted_data/part-00002-ad202203-7b33-4afd-9702-21ec5edc91ea.snappy.parquet',
....]

Had to write a meta file

writer.merge(filelist[1:])

So again, I tried to read it in then but SNAPPY not allowed and tells me to choose another compression option. Where do I pass in the compression option for the read step? I see it for the write step, but not ParquetFile

df = ParquetFile(data_path+"/").to_pandas()


RuntimeErrorTraceback (most recent call last)
<ipython-input-71-0cf81b88c4e1> in <module>()
----> 1 df = ParquetFile(data_path+"/").to_pandas()

/Users/steve/anaconda/lib/python2.7/site-packages/fastparquet/api.pyc in to_pandas(self, columns, categories, filters, index, timestamp96)
    308                          for (name, v) in views.items()}
    309                 self.read_row_group_file(rg, columns, categories, index,
--> 310                                          assign=parts, timestamp96=timestamp96)
    311                 start += rg.num_rows
    312         return df

/Users/steve/anaconda/lib/python2.7/site-packages/fastparquet/api.pyc in read_row_group_file(self, rg, columns, categories, index, assign, timestamp96)
    136                 fn, rg, columns, categories, self.helper, self.cats,
    137                 open=self.open, selfmade=self.selfmade, index=index,
--> 138                 assign=assign, timestamp96=timestamp96)
    139         if ret:
    140             return df

/Users/steve/anaconda/lib/python2.7/site-packages/fastparquet/core.pyc in read_row_group_file(fn, rg, columns, categories, schema_helper, cats, open, selfmade, index, assign, timestamp96)
    266         return read_row_group(f, rg, columns, categories, schema_helper, cats,
    267                               selfmade=selfmade, index=index, assign=assign,
--> 268                               timestamp96=timestamp96)
    269 
    270 

/Users/steve/anaconda/lib/python2.7/site-packages/fastparquet/core.pyc in read_row_group(file, rg, columns, categories, schema_helper, cats, selfmade, index, assign, timestamp96)
    301         raise RuntimeError('Going with pre-allocation!')
    302     read_row_group_arrays(file, rg, columns, categories, schema_helper,
--> 303                           cats, selfmade, assign=assign, timestamp96=timestamp96)
    304 
    305     for cat in cats:

/Users/steve/anaconda/lib/python2.7/site-packages/fastparquet/core.pyc in read_row_group_arrays(file, rg, columns, categories, schema_helper, cats, selfmade, assign, timestamp96)
    290                  selfmade=selfmade, assign=out[name],
    291                  catdef=out[name+'-catdef'] if use else None,
--> 292                  timestamp96=mr)
    293 
    294 

/Users/steve/anaconda/lib/python2.7/site-packages/fastparquet/core.pyc in read_col(column, schema_helper, infile, use_cat, grab_dict, selfmade, assign, catdef, timestamp96)
    229             skip_nulls = False
    230         defi, rep, val = read_data_page(infile, schema_helper, ph, cmd,
--> 231                                         skip_nulls, selfmade=selfmade)
    232         d = ph.data_page_header.encoding == parquet_thrift.Encoding.PLAIN_DICTIONARY
    233         if use_cat and not d:

/Users/steve/anaconda/lib/python2.7/site-packages/fastparquet/core.pyc in read_data_page(f, helper, header, metadata, skip_nulls, selfmade)
     97     """
     98     daph = header.data_page_header
---> 99     raw_bytes = _read_page(f, header, metadata)
    100     io_obj = encoding.Numpy8(np.frombuffer(byte_buffer(raw_bytes),
    101                                            dtype=np.uint8))

/Users/steve/anaconda/lib/python2.7/site-packages/fastparquet/core.pyc in _read_page(file_obj, page_header, column_metadata)
     29     """Read the data page from the given file-object and convert it to raw, uncompressed bytes (if necessary)."""
     30     raw_bytes = file_obj.read(page_header.compressed_page_size)
---> 31     raw_bytes = decompress_data(raw_bytes, column_metadata.codec)
     32 
     33     assert len(raw_bytes) == page_header.uncompressed_page_size, \

/Users/steve/anaconda/lib/python2.7/site-packages/fastparquet/compression.pyc in decompress_data(data, algorithm)
     80     if algorithm.upper() not in decompressions:
     81         raise RuntimeError("Decompression '%s' not available.  Options: %s" %
---> 82                 (algorithm.upper(), sorted(decompressions)))
     83     return decompressions[algorithm.upper()](data)

RuntimeError: Decompression 'SNAPPY' not available.  Options: ['GZIP', 'UNCOMPRESSED']

Conda environment

Current conda install:

               platform : osx-64
          conda version : 4.2.13
       conda is private : False
      conda-env version : 4.2.13
    conda-build version : not installed
         python version : 2.7.13.final.0
       requests version : 2.12.4
       root environment : /Users/steve/anaconda  (writable)
    default environment : /Users/steve/anaconda
       envs directories : /Users/steve/anaconda/envs
          package cache : /Users/steve/anaconda/pkgs
           channel URLs : https://repo.continuum.io/pkgs/free/osx-64
                          https://repo.continuum.io/pkgs/free/noarch
                          https://repo.continuum.io/pkgs/pro/osx-64
                          https://repo.continuum.io/pkgs/pro/noarch
            config file : None
           offline mode : False

            conda list | grep fastparquet
fastparquet               0.0.5                    py27_1    conda-forge

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:14 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
michcio1234commented, Apr 4, 2019

You just re-start python session or click a little image icon in your notebook. But you probably figured it out by now. ^^

0reactions
BullzeyeBGcommented, Mar 13, 2019

whether I do from dask.dataframe import read_parquet or even from dask.dataframe import read_csv I still get the same cannot import name demo import error

Yes, when I got this error snappy 1.1.4 1 conda-forge I installed it because I saw it in your travis.yml. It doesn’t install as a dependency with fastparquet from condo-forge. However, now that you mention it, I didn’t restart my python interpreter. And now that I do, it works I suck ~ Steve Sent via telepathy On May 3, 2017, at 6:59 PM, Martin Durant @.***> wrote: conda install python-snappy

Hey, lamers question - how do you restart the interpreter? I’m running into the same question as you do…

Read more comments on GitHub >

github_iconTop Results From Across the Web

Parquet table snappy compressed by default - 51914
Hi,. 1) If we create a table (both hive and impala)and just specify stored as parquet . Will that be snappy compressed by...
Read more >
Hive parquet snappy compression not working - Stack Overflow
I have recently created some tables stored as Parquet file with Snappy compression and have used the following commands:
Read more >
Parquet Files - Spark 2.4.4 Documentation
Sets the compression codec used when writing Parquet files. If either `compression` or `parquet.compression` is specified in the table-specific options/ ...
Read more >
Parquet file with Snappy compression on ADSL Gen 2
Solved: We have files in our Azure Data Lake Storage Gen 2 storage account that are parquet files with Snappy compression (very common...
Read more >
What is Google Snappy? High-speed data compression and ...
Data compression is not a sexy topic for most people. ... Also, it is common to find Snappy compression used as a default...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found