Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

fastparquet -> Hive interop trouble

See original GitHub issue

Scenario:

I used dask.dataframe.to_parquet to write a parquet file. I then expose this parquet file to Hive as an external table, but Hive is not able read int / float columns.

Example queries from Hive: select mystringfield from myparquettable limit 10; --> successfully retrieve records select myIntfield from myparquettable limit 10; --> claims myIntfield does not exist select myFloatfield from myparquettable limit 10; --> claims myFloatfield does not exist

So basically all object fields work just fine for queries, but hive is unable to locate the numeric fields.
My thought is that it is a mismatch type issue. float64 fields are type DOUBLE in hive, int64 fields are type BIGINT.

When I look at _common_metadata, I see the metadata values are null while the numpy/pandas types look normal.

Have folks used fastparquet with hive? Any tips on how to troubleshoot?

Thanks

Issue Analytics

State:
Created 6 years ago
Comments:11 (4 by maintainers)

Top GitHub Comments

1reaction

antonellybcommented, Aug 31, 2021

@martindurant I haven’t closed this yet because I’m creating a boiled down example to reproduce the issue.

In the short-term, I ended up using pyarrow to write the parq file and that fixed the issue. I’m going to be digging in a bit more though and will add the reproduction case to this issue. Thx a million for your help, turned out pyspark was able to read in the parquet file and the re-write an version which was readable by hive.

@brendancol Could you please explain how you created parquet file with pyarrow? I tried the following way, but hive cannot read it as well

import pyarrow.parquet as pq
import pyarrow as pa
table = pa.Table.from_pandas(df)
pq.write_table(table, `'example.parquet')

0reactions

martindurantcommented, Jan 24, 2022

Sure, the string could in theory reflect the library version. However, it’s not sued anywhere, so only for the eyes of users.

Top Results From Across the Web

fastparquet Documentation - Read the Docs

Fastparquet cannot read a hive/drill parquet file with partition names which coerce to the same value, such as “0.7” and “.7”. Parameters fn: ......

API — fastparquet 0.7.1 documentation

Fastparquet cannot read a hive/drill parquet file with partition names which coerce to the same value, such as “0.7” and “.7”. Parameters. fn:...

Parquet Java Example - Monia B

Hive gives an SQL-like interface to query data stored in various databases and file ... Resolution: Try to generate smaller. fastparquet — fastparquet...

Parquet Java Example - Boutique 17

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and ... NET/C# platform in...

Parquet Java Example

Connect to Hive or Impala using JDBC and insert the data using SQL. how to read a ... of illustrative examples, how to...