Boolean columns promoted to object
See original GitHub issueHi,
First and foremost, thanks for great work with fastparquet.
I’ve struggled a bit to get pandas’ column type bool into parquet. The reason is the automatic promotion to ‘object’ whenever we have 'NaN’s in the df. Tried to do a pull request about it, but I think the requests are just for allowed members of the repo? Could you please take a look at my approach for handling this?
Both in writer.py:
201 if all(isinstance(i, bool) for i in head):
202 return 'bool'
and:
107 type, converted_type, width = typemap[object_encoding]
108
109 else:
110 raise ValueError('Object encoding (%s) not one of '
111 'infer|utf8|bytes|json|bson' % object_encoding)
Also, a bit more of verbose whenever we’re writing a column and snaps:
576 try:
577 chunk = write_column(f, data[column.name], column,
578 compression=comp)
579 except TypeError as type_exception:
580 # in append mode, if the chunk of data that is already in parquet
581 # has different type from the one we're writing, this will break
582 # Giving extra info to user, which column failed.
583 msg = str(type_exception)
584 msg += "\n Failed column details: '%s'" % str(column)
585 raise TypeError(msg)
This one has saved me some debugging time, maybe will also save others? 😉
Plus, something that would be good add to documentation is the case when you’re creating the dataframes from other data, to better use from numpy import nan
in order to flag the missing cells. I’m transforming MongoDB collections into parquet and this was biting me big time (I was naively using “NaN” strings…)
Mentioning also that infer_object_encoding
function, even though is great, it has hiccups if the data batches for append are small ~10 rows, since the likelihood of 10 rows inferred in one type and the next 10 to be inferred of another, is big, specially with NaN’s or None’s, etc. Bit of documentation there I think would save some time to future users.
It would be great to have your opinion and comments.
Thanks!
Luis
Issue Analytics
- State:
- Created 7 years ago
- Reactions:1
- Comments:19 (13 by maintainers)
No you shouldn’t have categorical, it only adds work. I suggest not putting any effort into tracing down the error message and whether it can be clearer for this case.
Let’s take a look.
For bools, no problem for first write, but then in your next write for append, the new df has a None, so it promotes to ‘object’ and breaks:
Right, that was kindda expected.
Lets say, ok, lets make the column with a None since the beginning:
Now, trying your suggestion on object_enconding=‘infer’
(This also fails with has_nulls=False too)
Let’s try with diferent object_encodings:
Now trying with utf8
Ok, if we force conversion to ‘bool’ whenever we have None, then we lose the None-information:
And since parquet allows for Nulls, I think fastparquet should honor that capability, and not follow pandas’ principle here.
On the append issue, let’s imagine we have a bunch of Nones in our bool_col in the first batch, and then we start to have bools for next baches:
So, whenever we have some Nones in an object, and we try to do ‘infer’ in the object_encoding, then we might run into problems, specially for bools.
Now, while writing this, I think I’ve bumped into a workaround: json!!!.
So I think I’m going to do this in my actual code.
But, I still think fastparquet should support to write native pandas’ ‘object’ types with combination of booleans and None. Or just document this stuff?
Hope it makes sense, and let me know if I’m making some horrible basic assumption or something.