Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

To Pandas doesn't work with parquet file - Type Error

See original GitHub issue

Hi all, I’m loading some parquet files generated by a Spark ETL job.

I get this error when calling parquet_file.to_pandas().

AttributeError                            Traceback (most recent call last)
<ipython-input-9-7098f6946da6> in <module>()
----> 1 profiles.to_pandas()

/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/api.py in to_pandas(self, columns, categories, filters, index, timestamp96)
    332                     self.read_row_group(rg, columns, categories, infile=f,
    333                                         index=index, assign=parts,
--> 334                                         timestamp96=timestamp96)
    335                     start += rg.num_rows
    336         else:

/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/api.py in read_row_group(self, rg, columns, categories, infile, index, assign, timestamp96)
    184                 infile, rg, columns, categories, self.schema, self.cats,
    185                 self.selfmade, index=index, assign=assign,
--> 186                 timestamp96=timestamp96, sep=self.sep)
    187         if ret:
    188             return df

/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/core.py in read_row_group(file, rg, columns, categories, schema_helper, cats, selfmade, index, assign, timestamp96, sep)
    336         raise RuntimeError('Going with pre-allocation!')
    337     read_row_group_arrays(file, rg, columns, categories, schema_helper,
--> 338                           cats, selfmade, assign=assign, timestamp96=timestamp96)
    339 
    340     for cat in cats:

/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/core.py in read_row_group_arrays(file, rg, columns, categories, schema_helper, cats, selfmade, assign, timestamp96)
    313                  selfmade=selfmade, assign=out[name],
    314                  catdef=out[name+'-catdef'] if use else None,
--> 315                  timestamp96=mr)
    316 
    317         if _is_map_like(schema_helper, column):

/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/core.py in read_col(column, schema_helper, infile, use_cat, grab_dict, selfmade, assign, catdef, timestamp96)
    237             skip_nulls = False
    238         defi, rep, val = read_data_page(infile, schema_helper, ph, cmd,
--> 239                                         skip_nulls, selfmade=selfmade)
    240         if rep is not None and assign.dtype.kind != 'O':  # pragma: no cover
    241             # this should never get called

/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/core.py in read_data_page(f, helper, header, metadata, skip_nulls, selfmade)
    103                                            dtype=np.uint8))
    104 
--> 105     repetition_levels = read_rep(io_obj, daph, helper, metadata)
    106 
    107     if skip_nulls and not helper.is_required(metadata.path_in_schema):

/home/springcoil/miniconda3/envs/py35/lib/python3.5/site-packages/fastparquet/core.py in read_rep(io_obj, daph, helper, metadata)
     83             metadata.path_in_schema)
     84         bit_width = encoding.width_from_max_int(max_repetition_level)
---> 85         repetition_levels = read_data(io_obj, daph.repetition_level_encoding,
     86                                       daph.num_values,
     87                                       bit_width)[:daph.num_values]

AttributeError: 'NoneType' object has no attribute 'repetition_level_encoding'```


Has anyone seen anything like this before?

Issue Analytics

State:
Created 6 years ago
Reactions:5
Comments:41 (18 by maintainers)

Top GitHub Comments

2reactions

anderl80commented, Apr 9, 2018

I have the same problem here.

1reaction

martindurantcommented, Oct 17, 2017

OK, so: there appear to be multiple dictionary pages, which is not supposed to happen, but I can deal with. Also, the encoding is “bit-packed (deprecated)”, which, as the name suggests, is not supposed to be around. I can maybe code it up, since the spec is well-stated, and I can compare the result against ground-truth as given by spark. I’ll get back to you.