SNAPPY compression option in ParquetFile??
See original GitHub issueTried reading in folder of parquet files but SNAPPY not allowed and tells me to choose another compression option. Where do I pass in the compression option for the read step? I see it for the write step, but not ParquetFile
from fastparquet import ParquetFile, writer
filelist = glob(data_path+"/*parquet")
filelist
['.../data/deleted_data/part-00000-ad202203-7b33-4afd-9702-21ec5edc91ea.snappy.parquet',
'.../data/deleted_data/part-00001-ad202203-7b33-4afd-9702-21ec5edc91ea.snappy.parquet',
'.../data/deleted_data/part-00002-ad202203-7b33-4afd-9702-21ec5edc91ea.snappy.parquet',
....]
Had to write a meta file
writer.merge(filelist[1:])
So again, I tried to read it in then but SNAPPY not allowed and tells me to choose another compression option. Where do I pass in the compression option for the read step? I see it for the write step, but not ParquetFile
df = ParquetFile(data_path+"/").to_pandas()
RuntimeErrorTraceback (most recent call last)
<ipython-input-71-0cf81b88c4e1> in <module>()
----> 1 df = ParquetFile(data_path+"/").to_pandas()
/Users/steve/anaconda/lib/python2.7/site-packages/fastparquet/api.pyc in to_pandas(self, columns, categories, filters, index, timestamp96)
308 for (name, v) in views.items()}
309 self.read_row_group_file(rg, columns, categories, index,
--> 310 assign=parts, timestamp96=timestamp96)
311 start += rg.num_rows
312 return df
/Users/steve/anaconda/lib/python2.7/site-packages/fastparquet/api.pyc in read_row_group_file(self, rg, columns, categories, index, assign, timestamp96)
136 fn, rg, columns, categories, self.helper, self.cats,
137 open=self.open, selfmade=self.selfmade, index=index,
--> 138 assign=assign, timestamp96=timestamp96)
139 if ret:
140 return df
/Users/steve/anaconda/lib/python2.7/site-packages/fastparquet/core.pyc in read_row_group_file(fn, rg, columns, categories, schema_helper, cats, open, selfmade, index, assign, timestamp96)
266 return read_row_group(f, rg, columns, categories, schema_helper, cats,
267 selfmade=selfmade, index=index, assign=assign,
--> 268 timestamp96=timestamp96)
269
270
/Users/steve/anaconda/lib/python2.7/site-packages/fastparquet/core.pyc in read_row_group(file, rg, columns, categories, schema_helper, cats, selfmade, index, assign, timestamp96)
301 raise RuntimeError('Going with pre-allocation!')
302 read_row_group_arrays(file, rg, columns, categories, schema_helper,
--> 303 cats, selfmade, assign=assign, timestamp96=timestamp96)
304
305 for cat in cats:
/Users/steve/anaconda/lib/python2.7/site-packages/fastparquet/core.pyc in read_row_group_arrays(file, rg, columns, categories, schema_helper, cats, selfmade, assign, timestamp96)
290 selfmade=selfmade, assign=out[name],
291 catdef=out[name+'-catdef'] if use else None,
--> 292 timestamp96=mr)
293
294
/Users/steve/anaconda/lib/python2.7/site-packages/fastparquet/core.pyc in read_col(column, schema_helper, infile, use_cat, grab_dict, selfmade, assign, catdef, timestamp96)
229 skip_nulls = False
230 defi, rep, val = read_data_page(infile, schema_helper, ph, cmd,
--> 231 skip_nulls, selfmade=selfmade)
232 d = ph.data_page_header.encoding == parquet_thrift.Encoding.PLAIN_DICTIONARY
233 if use_cat and not d:
/Users/steve/anaconda/lib/python2.7/site-packages/fastparquet/core.pyc in read_data_page(f, helper, header, metadata, skip_nulls, selfmade)
97 """
98 daph = header.data_page_header
---> 99 raw_bytes = _read_page(f, header, metadata)
100 io_obj = encoding.Numpy8(np.frombuffer(byte_buffer(raw_bytes),
101 dtype=np.uint8))
/Users/steve/anaconda/lib/python2.7/site-packages/fastparquet/core.pyc in _read_page(file_obj, page_header, column_metadata)
29 """Read the data page from the given file-object and convert it to raw, uncompressed bytes (if necessary)."""
30 raw_bytes = file_obj.read(page_header.compressed_page_size)
---> 31 raw_bytes = decompress_data(raw_bytes, column_metadata.codec)
32
33 assert len(raw_bytes) == page_header.uncompressed_page_size, \
/Users/steve/anaconda/lib/python2.7/site-packages/fastparquet/compression.pyc in decompress_data(data, algorithm)
80 if algorithm.upper() not in decompressions:
81 raise RuntimeError("Decompression '%s' not available. Options: %s" %
---> 82 (algorithm.upper(), sorted(decompressions)))
83 return decompressions[algorithm.upper()](data)
RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP', 'UNCOMPRESSED']
Conda environment
Current conda install:
platform : osx-64
conda version : 4.2.13
conda is private : False
conda-env version : 4.2.13
conda-build version : not installed
python version : 2.7.13.final.0
requests version : 2.12.4
root environment : /Users/steve/anaconda (writable)
default environment : /Users/steve/anaconda
envs directories : /Users/steve/anaconda/envs
package cache : /Users/steve/anaconda/pkgs
channel URLs : https://repo.continuum.io/pkgs/free/osx-64
https://repo.continuum.io/pkgs/free/noarch
https://repo.continuum.io/pkgs/pro/osx-64
https://repo.continuum.io/pkgs/pro/noarch
config file : None
offline mode : False
conda list | grep fastparquet
fastparquet 0.0.5 py27_1 conda-forge
Issue Analytics
- State:
- Created 6 years ago
- Comments:14 (4 by maintainers)
Top Results From Across the Web
Parquet table snappy compressed by default - 51914
Hi,. 1) If we create a table (both hive and impala)and just specify stored as parquet . Will that be snappy compressed by...
Read more >Hive parquet snappy compression not working - Stack Overflow
I have recently created some tables stored as Parquet file with Snappy compression and have used the following commands:
Read more >Parquet Files - Spark 2.4.4 Documentation
Sets the compression codec used when writing Parquet files. If either `compression` or `parquet.compression` is specified in the table-specific options/ ...
Read more >Parquet file with Snappy compression on ADSL Gen 2
Solved: We have files in our Azure Data Lake Storage Gen 2 storage account that are parquet files with Snappy compression (very common...
Read more >What is Google Snappy? High-speed data compression and ...
Data compression is not a sexy topic for most people. ... Also, it is common to find Snappy compression used as a default...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
You just re-start python session or click a little icon in your notebook. But you probably figured it out by now. ^^
Hey, lamers question - how do you restart the interpreter? I’m running into the same question as you do…