fastparquet -> Hive interop trouble
See original GitHub issueScenario:
I used dask.dataframe.to_parquet
to write a parquet file. I then expose this parquet file to Hive as an external table, but Hive is not able read int / float columns.
Example queries from Hive:
select mystringfield from myparquettable limit 10;
--> successfully retrieve records
select myIntfield from myparquettable limit 10;
--> claims myIntfield does not exist
select myFloatfield from myparquettable limit 10;
--> claims myFloatfield does not exist
So basically all object fields work just fine for queries, but hive is unable to locate the numeric fields.
My thought is that it is a mismatch type issue. float64
fields are type DOUBLE
in hive, int64
fields are type BIGINT
.
When I look at _common_metadata
, I see the metadata values are null
while the numpy/pandas types look normal.
Have folks used fastparquet with hive? Any tips on how to troubleshoot?
Thanks
Issue Analytics
- State:
- Created 6 years ago
- Comments:11 (4 by maintainers)
@brendancol Could you please explain how you created parquet file with pyarrow? I tried the following way, but hive cannot read it as well
Sure, the string could in theory reflect the library version. However, it’s not sued anywhere, so only for the eyes of users.