question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Boolean columns promoted to object

See original GitHub issue

Hi,

First and foremost, thanks for great work with fastparquet.

I’ve struggled a bit to get pandas’ column type bool into parquet. The reason is the automatic promotion to ‘object’ whenever we have 'NaN’s in the df. Tried to do a pull request about it, but I think the requests are just for allowed members of the repo? Could you please take a look at my approach for handling this?

Both in writer.py:

201     if all(isinstance(i, bool) for i in head):
202         return 'bool'

and:

107             type, converted_type, width = typemap[object_encoding]
108
109         else:
110             raise ValueError('Object encoding (%s) not one of '
111                              'infer|utf8|bytes|json|bson' % object_encoding)

Also, a bit more of verbose whenever we’re writing a column and snaps:

576             try:
577                 chunk = write_column(f, data[column.name], column,
578                                      compression=comp)
579             except TypeError as type_exception:
580                 # in append mode, if the chunk of data that is already in parquet
581                 # has different type from the one we're writing, this will break
582                 # Giving extra info to user, which column failed.
583                 msg = str(type_exception)
584                 msg += "\n Failed column details: '%s'" % str(column)
585                 raise TypeError(msg)

This one has saved me some debugging time, maybe will also save others? 😉

Plus, something that would be good add to documentation is the case when you’re creating the dataframes from other data, to better use from numpy import nan in order to flag the missing cells. I’m transforming MongoDB collections into parquet and this was biting me big time (I was naively using “NaN” strings…)

Mentioning also that infer_object_encoding function, even though is great, it has hiccups if the data batches for append are small ~10 rows, since the likelihood of 10 rows inferred in one type and the next 10 to be inferred of another, is big, specially with NaN’s or None’s, etc. Bit of documentation there I think would save some time to future users.

It would be great to have your opinion and comments.

Thanks!

Luis

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:1
  • Comments:19 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Jan 18, 2019

No you shouldn’t have categorical, it only adds work. I suggest not putting any effort into tracing down the error message and whether it can be clearer for this case.

1reaction
Kielethcommented, Mar 16, 2017

Let’s take a look.

For bools, no problem for first write, but then in your next write for append, the new df has a None, so it promotes to ‘object’ and breaks:

In [434]: bools_df = df([True, False, True], columns=['bool_col'])

In [435]: bools_df.dtypes
Out[435]:
bool_col    bool
dtype: object

In [436]: fastparquet.write('test_append_1', bools_df, compression='SNAPPY', append=False, has_nulls=True)

In [437]: bools_df = df([True, False, None], columns=['bool_col'])

In [438]: bools_df.dtypes
Out[438]:
bool_col    object
dtype: object

In [439]: fastparquet.write('test_append_1', bools_df, compression='SNAPPY', append=True, has_nulls=True)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-439-c72a81cf2126> in <module>()
----> 1 fastparquet.write('test_append_1', bools_df, compression='SNAPPY', append=True, has_nulls=True)

/home/lguzman/miniconda2/envs/datawarehouse/lib/python3.6/site-packages/fastparquet/writer.py in write(filename, data, row_group_offsets, compression, file_scheme, open_with, mkdirs, has_nul
ls, write_index, partition_on, fixed_text, append, object_encoding, times)
    754     fmd = make_metadata(data, has_nulls=has_nulls, ignore_columns=ignore,
    755                         fixed_text=fixed_text, object_encoding=object_encoding,
--> 756                         times=times)
    757
    758     if file_scheme == 'simple':

/home/lguzman/miniconda2/envs/datawarehouse/lib/python3.6/site-packages/fastparquet/writer.py in make_metadata(data, has_nulls, ignore_columns, fixed_text, object_encoding, times)
    621         else:
    622             se, type = find_type(data[column], fixed_text=fixed,
--> 623                                  object_encoding=oencoding, times=times)
    624         col_has_nulls = has_nulls
    625         if has_nulls is None:

/home/lguzman/miniconda2/envs/datawarehouse/lib/python3.6/site-packages/fastparquet/writer.py in find_type(data, fixed_text, object_encoding, times)
     90     elif dtype == "O":
     91         if object_encoding == 'infer':
---> 92             object_encoding = infer_object_encoding(data)
     93
     94         if object_encoding == 'utf8':

/home/lguzman/miniconda2/envs/datawarehouse/lib/python3.6/site-packages/fastparquet/writer.py in infer_object_encoding(data)
    193         return 'json'
    194     else:
--> 195         raise ValueError("Can't infer object conversion type: %s" % head)
    196
    197

ValueError: Can't infer object conversion type: 0     True
1    False
Name: bool_col, dtype: object

Right, that was kindda expected.

Lets say, ok, lets make the column with a None since the beginning:

In [440]: bools_df = df([True, False, None], columns=['bool_col'])

In [441]: bools_df.dtypes
Out[441]:
bool_col    object
dtype: object

In [442]: fastparquet.write('test_append_1', bools_df, compression='SNAPPY', append=False, has_nulls=True)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-442-7980fc74c830> in <module>()
----> 1 fastparquet.write('test_append_1', bools_df, compression='SNAPPY', append=False, has_nulls=True)

/home/lguzman/miniconda2/envs/datawarehouse/lib/python3.6/site-packages/fastparquet/writer.py in write(filename, data, row_group_offsets, compression, file_scheme, open_with, mkdirs, has_nul
ls, write_index, partition_on, fixed_text, append, object_encoding, times)
    754     fmd = make_metadata(data, has_nulls=has_nulls, ignore_columns=ignore,
    755                         fixed_text=fixed_text, object_encoding=object_encoding,
--> 756                         times=times)
    757
    758     if file_scheme == 'simple':

/home/lguzman/miniconda2/envs/datawarehouse/lib/python3.6/site-packages/fastparquet/writer.py in make_metadata(data, has_nulls, ignore_columns, fixed_text, object_encoding, times)
    621         else:
    622             se, type = find_type(data[column], fixed_text=fixed,
--> 623                                  object_encoding=oencoding, times=times)
    624         col_has_nulls = has_nulls
    625         if has_nulls is None:

/home/lguzman/miniconda2/envs/datawarehouse/lib/python3.6/site-packages/fastparquet/writer.py in find_type(data, fixed_text, object_encoding, times)
     90     elif dtype == "O":
     91         if object_encoding == 'infer':
---> 92             object_encoding = infer_object_encoding(data)
     93
     94         if object_encoding == 'utf8':

/home/lguzman/miniconda2/envs/datawarehouse/lib/python3.6/site-packages/fastparquet/writer.py in infer_object_encoding(data)
    193         return 'json'
    194     else:
--> 195         raise ValueError("Can't infer object conversion type: %s" % head)
    196
    197

ValueError: Can't infer object conversion type: 0     True
1    False
Name: bool_col, dtype: object

Now, trying your suggestion on object_enconding=‘infer’

In [443]: bools_df = df([True, False, None], columns=['bool_col'])

In [444]: fastparquet.write('test_append_1', bools_df, compression='SNAPPY', append=False, has_nulls=True, object_encoding={'bool_col':'infer'})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-444-85c573304843> in <module>()
----> 1 fastparquet.write('test_append_1', bools_df, compression='SNAPPY', append=False, has_nulls=True, object_encoding={'bool_col':'infer'})

/home/lguzman/miniconda2/envs/datawarehouse/lib/python3.6/site-packages/fastparquet/writer.py in write(filename, data, row_group_offsets, compression, file_scheme, open_with, mkdirs, has_nul
ls, write_index, partition_on, fixed_text, append, object_encoding, times)
    754     fmd = make_metadata(data, has_nulls=has_nulls, ignore_columns=ignore,
    755                         fixed_text=fixed_text, object_encoding=object_encoding,
--> 756                         times=times)
    757
    758     if file_scheme == 'simple':

/home/lguzman/miniconda2/envs/datawarehouse/lib/python3.6/site-packages/fastparquet/writer.py in make_metadata(data, has_nulls, ignore_columns, fixed_text, object_encoding, times)
    621         else:
    622             se, type = find_type(data[column], fixed_text=fixed,
--> 623                                  object_encoding=oencoding, times=times)
    624         col_has_nulls = has_nulls
    625         if has_nulls is None:

/home/lguzman/miniconda2/envs/datawarehouse/lib/python3.6/site-packages/fastparquet/writer.py in find_type(data, fixed_text, object_encoding, times)
     90     elif dtype == "O":
     91         if object_encoding == 'infer':
---> 92             object_encoding = infer_object_encoding(data)
     93
     94         if object_encoding == 'utf8':

/home/lguzman/miniconda2/envs/datawarehouse/lib/python3.6/site-packages/fastparquet/writer.py in infer_object_encoding(data)
    193         return 'json'
    194     else:
--> 195         raise ValueError("Can't infer object conversion type: %s" % head)
    196
    197

ValueError: Can't infer object conversion type: 0     True
1    False
Name: bool_col, dtype: object

(This also fails with has_nulls=False too)

Let’s try with diferent object_encodings:

fastparquet.write('test_append_1', bools_df, compression='SNAPPY', append=False, has_nulls=False, object_encoding={'bool_col':'bytes'})
...
TypeError: expected list of bytes

Now trying with utf8

fastparquet.write('test_append_1', bools_df, compression='SNAPPY', append=False, has_nulls=False, object_encoding={'bool_col':'utf8'})
...
TypeError: bad argument type for built-in operation

Ok, if we force conversion to ‘bool’ whenever we have None, then we lose the None-information:

In [451]: bools_df = df([None, None, None], columns=['bool_col'])

In [452]: bools_df.dtypes
Out[452]:
bool_col    object
dtype: object

In [453]: bools_df['bool_col'] = bools_df['bool_col'].astype('bool')

In [454]: bools_df.dtypes
Out[454]:
bool_col    bool
dtype: object

In [455]: bools_df
Out[455]:
  bool_col
0    False
1    False
2    False

And since parquet allows for Nulls, I think fastparquet should honor that capability, and not follow pandas’ principle here.

On the append issue, let’s imagine we have a bunch of Nones in our bool_col in the first batch, and then we start to have bools for next baches:

In [487]: bools_df = df([None, None, None], columns=['bool_col'])

In [488]: fastparquet.write('test_append_1', bools_df, compression='SNAPPY', append=False, has_nulls=True, object_encoding={'bool_col':'infer'})

In [489]: bools_df = df([None, None, True], columns=['bool_col'])

In [490]: bools_df.dtypes
Out[490]:
bool_col    object
dtype: object

In [491]: fastparquet.write('test_append_1', bools_df, compression='SNAPPY', append=True, has_nulls=True, object_encoding={'bool_col':'infer'})
...
ValueError: Can't infer object conversion type: 2    True
Name: bool_col, dtype: object

So, whenever we have some Nones in an object, and we try to do ‘infer’ in the object_encoding, then we might run into problems, specially for bools.

Now, while writing this, I think I’ve bumped into a workaround: json!!!.

In [495]: bools_df = df([None, None, None], columns=['bool_col'])

In [496]: fastparquet.write('test_append_1', bools_df, compression='SNAPPY', append=False, has_nulls=True, object_encoding={'bool_col':'json'})

In [497]: bools_df = df([None, None, False], columns=['bool_col'])

In [498]: fastparquet.write('test_append_1', bools_df, compression='SNAPPY', append=True, has_nulls=True, object_encoding={'bool_col':'json'})

In [499]: pf = fastparquet.ParquetFile('test_append_1')

In [500]: out_df = pf.to_pandas()

In [501]: out_df
Out[501]:
  bool_col
0     None
1     None
2     None
3     None
4     None
5    False

In [502]: out_df.dtypes
Out[502]:
bool_col    object
dtype: object

So I think I’m going to do this in my actual code.

But, I still think fastparquet should support to write native pandas’ ‘object’ types with combination of booleans and None. Or just document this stuff?

Hope it makes sense, and let me know if I’m making some horrible basic assumption or something.

Read more comments on GitHub >

github_iconTop Results From Across the Web

DBIx::Class::InflateColumn::Boolean - MetaCPAN
This module maps such "database booleans" into "Perl booleans" and back by inflating designated columns into objects that store the original value, ...
Read more >
Boolean field type | Elasticsearch Guide [8.5] | Elastic
Boolean field typeedit. Boolean fields accept JSON true and false values, but can also accept strings which are interpreted as either true or ......
Read more >
Frequently Asked Questions (FAQ) - Pandas
Bitwise boolean​​ return a boolean Series which performs an element-wise comparison when compared to a scalar. See boolean comparisons for more examples.
Read more >
SSIS VS2012 convert flat file Text to Boolean - Stack Overflow
Instead of the Data Conversion Transformation, use a Derived Column Transformation. Create a new Boolean column with an Inline IF, ...
Read more >
Boolean data type preserved on columns in Framework ... - IBM
When importing metadata from a data server, the Framework Manager model now shows the data type of Boolean columns as Boolean instead of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found