To Pandas doesn't work with parquet file - Type Error
See original GitHub issueHi all, I’m loading some parquet files generated by a Spark ETL job.
I get this error when calling parquet_file.to_pandas()
.
AttributeError Traceback (most recent call last)
<ipython-input-9-7098f6946da6> in <module>()
----> 1 profiles.to_pandas()
/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/api.py in to_pandas(self, columns, categories, filters, index, timestamp96)
332 self.read_row_group(rg, columns, categories, infile=f,
333 index=index, assign=parts,
--> 334 timestamp96=timestamp96)
335 start += rg.num_rows
336 else:
/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/api.py in read_row_group(self, rg, columns, categories, infile, index, assign, timestamp96)
184 infile, rg, columns, categories, self.schema, self.cats,
185 self.selfmade, index=index, assign=assign,
--> 186 timestamp96=timestamp96, sep=self.sep)
187 if ret:
188 return df
/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/core.py in read_row_group(file, rg, columns, categories, schema_helper, cats, selfmade, index, assign, timestamp96, sep)
336 raise RuntimeError('Going with pre-allocation!')
337 read_row_group_arrays(file, rg, columns, categories, schema_helper,
--> 338 cats, selfmade, assign=assign, timestamp96=timestamp96)
339
340 for cat in cats:
/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/core.py in read_row_group_arrays(file, rg, columns, categories, schema_helper, cats, selfmade, assign, timestamp96)
313 selfmade=selfmade, assign=out[name],
314 catdef=out[name+'-catdef'] if use else None,
--> 315 timestamp96=mr)
316
317 if _is_map_like(schema_helper, column):
/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/core.py in read_col(column, schema_helper, infile, use_cat, grab_dict, selfmade, assign, catdef, timestamp96)
237 skip_nulls = False
238 defi, rep, val = read_data_page(infile, schema_helper, ph, cmd,
--> 239 skip_nulls, selfmade=selfmade)
240 if rep is not None and assign.dtype.kind != 'O': # pragma: no cover
241 # this should never get called
/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/core.py in read_data_page(f, helper, header, metadata, skip_nulls, selfmade)
103 dtype=np.uint8))
104
--> 105 repetition_levels = read_rep(io_obj, daph, helper, metadata)
106
107 if skip_nulls and not helper.is_required(metadata.path_in_schema):
/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/core.py in read_rep(io_obj, daph, helper, metadata)
83 metadata.path_in_schema)
84 bit_width = encoding.width_from_max_int(max_repetition_level)
---> 85 repetition_levels = read_data(io_obj, daph.repetition_level_encoding,
86 daph.num_values,
87 bit_width)[:daph.num_values]
AttributeError: 'NoneType' object has no attribute 'repetition_level_encoding'```
Has anyone seen anything like this before?
Issue Analytics
- State:
- Created 6 years ago
- Reactions:5
- Comments:41 (18 by maintainers)
Top Results From Across the Web
converting parquet file to pandas and then querying gives error
I am trying to query a dataframe for an average of a column, and I converted a parquet file to pandas to do...
Read more >Overcoming Parquet Schema Issues - Medium
I read the data in a Pandas dataframe, display the records and schema, and write it out to a parquet file. As we...
Read more >pandas.DataFrame.to_parquet
Write a DataFrame to the binary parquet format. This function writes the dataframe as a parquet file. ... Must be None if path...
Read more >Troubleshoot the Parquet format connector - Azure Data ...
If the problem persists, contact support. Error code: ParquetInvalidFile. Message: File is not a valid Parquet file. Cause: This ...
Read more >Solved: Columns not displayed correctly from parquet file
you need to change the format to Parquet in the Format / Preview tab of your screenshot. DSS wasn't able to correctly detect...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I have the same problem here.
OK, so: there appear to be multiple dictionary pages, which is not supposed to happen, but I can deal with. Also, the encoding is “bit-packed (deprecated)”, which, as the name suggests, is not supposed to be around. I can maybe code it up, since the spec is well-stated, and I can compare the result against ground-truth as given by spark. I’ll get back to you.