question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

To Pandas doesn't work with parquet file - Type Error

See original GitHub issue

Hi all, I’m loading some parquet files generated by a Spark ETL job.

I get this error when calling parquet_file.to_pandas().

AttributeError                            Traceback (most recent call last)
<ipython-input-9-7098f6946da6> in <module>()
----> 1 profiles.to_pandas()

/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/api.py in to_pandas(self, columns, categories, filters, index, timestamp96)
    332                     self.read_row_group(rg, columns, categories, infile=f,
    333                                         index=index, assign=parts,
--> 334                                         timestamp96=timestamp96)
    335                     start += rg.num_rows
    336         else:

/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/api.py in read_row_group(self, rg, columns, categories, infile, index, assign, timestamp96)
    184                 infile, rg, columns, categories, self.schema, self.cats,
    185                 self.selfmade, index=index, assign=assign,
--> 186                 timestamp96=timestamp96, sep=self.sep)
    187         if ret:
    188             return df

/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/core.py in read_row_group(file, rg, columns, categories, schema_helper, cats, selfmade, index, assign, timestamp96, sep)
    336         raise RuntimeError('Going with pre-allocation!')
    337     read_row_group_arrays(file, rg, columns, categories, schema_helper,
--> 338                           cats, selfmade, assign=assign, timestamp96=timestamp96)
    339 
    340     for cat in cats:

/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/core.py in read_row_group_arrays(file, rg, columns, categories, schema_helper, cats, selfmade, assign, timestamp96)
    313                  selfmade=selfmade, assign=out[name],
    314                  catdef=out[name+'-catdef'] if use else None,
--> 315                  timestamp96=mr)
    316 
    317         if _is_map_like(schema_helper, column):

/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/core.py in read_col(column, schema_helper, infile, use_cat, grab_dict, selfmade, assign, catdef, timestamp96)
    237             skip_nulls = False
    238         defi, rep, val = read_data_page(infile, schema_helper, ph, cmd,
--> 239                                         skip_nulls, selfmade=selfmade)
    240         if rep is not None and assign.dtype.kind != 'O':  # pragma: no cover
    241             # this should never get called

/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/core.py in read_data_page(f, helper, header, metadata, skip_nulls, selfmade)
    103                                            dtype=np.uint8))
    104 
--> 105     repetition_levels = read_rep(io_obj, daph, helper, metadata)
    106 
    107     if skip_nulls and not helper.is_required(metadata.path_in_schema):

/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/core.py in read_rep(io_obj, daph, helper, metadata)
     83             metadata.path_in_schema)
     84         bit_width = encoding.width_from_max_int(max_repetition_level)
---> 85         repetition_levels = read_data(io_obj, daph.repetition_level_encoding,
     86                                       daph.num_values,
     87                                       bit_width)[:daph.num_values]

AttributeError: 'NoneType' object has no attribute 'repetition_level_encoding'```


Has anyone seen anything like this before?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:5
  • Comments:41 (18 by maintainers)

github_iconTop GitHub Comments

2reactions
anderl80commented, Apr 9, 2018

I have the same problem here.

1reaction
martindurantcommented, Oct 17, 2017

OK, so: there appear to be multiple dictionary pages, which is not supposed to happen, but I can deal with. Also, the encoding is “bit-packed (deprecated)”, which, as the name suggests, is not supposed to be around. I can maybe code it up, since the spec is well-stated, and I can compare the result against ground-truth as given by spark. I’ll get back to you.

Read more comments on GitHub >

github_iconTop Results From Across the Web

converting parquet file to pandas and then querying gives error
I am trying to query a dataframe for an average of a column, and I converted a parquet file to pandas to do...
Read more >
Overcoming Parquet Schema Issues - Medium
I read the data in a Pandas dataframe, display the records and schema, and write it out to a parquet file. As we...
Read more >
pandas.DataFrame.to_parquet
Write a DataFrame to the binary parquet format. This function writes the dataframe as a parquet file. ... Must be None if path...
Read more >
Troubleshoot the Parquet format connector - Azure Data ...
If the problem persists, contact support. Error code: ParquetInvalidFile. Message: File is not a valid Parquet file. Cause: This ...
Read more >
Solved: Columns not displayed correctly from parquet file
you need to change the format to Parquet in the Format / Preview tab of your screenshot. DSS wasn't able to correctly detect...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found