question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

fastparquet -> Hive interop trouble

See original GitHub issue

Scenario:

I used dask.dataframe.to_parquet to write a parquet file. I then expose this parquet file to Hive as an external table, but Hive is not able read int / float columns.

Example queries from Hive: select mystringfield from myparquettable limit 10; --> successfully retrieve records select myIntfield from myparquettable limit 10; --> claims myIntfield does not exist select myFloatfield from myparquettable limit 10; --> claims myFloatfield does not exist

So basically all object fields work just fine for queries, but hive is unable to locate the numeric fields.
My thought is that it is a mismatch type issue. float64 fields are type DOUBLE in hive, int64 fields are type BIGINT.

When I look at _common_metadata, I see the metadata values are null while the numpy/pandas types look normal.

Have folks used fastparquet with hive? Any tips on how to troubleshoot?

Thanks

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:11 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
antonellybcommented, Aug 31, 2021

@martindurant I haven’t closed this yet because I’m creating a boiled down example to reproduce the issue.

In the short-term, I ended up using pyarrow to write the parq file and that fixed the issue. I’m going to be digging in a bit more though and will add the reproduction case to this issue. Thx a million for your help, turned out pyspark was able to read in the parquet file and the re-write an version which was readable by hive.

@martindurant I haven’t closed this yet because I’m creating a boiled down example to reproduce the issue.

In the short-term, I ended up using pyarrow to write the parq file and that fixed the issue. I’m going to be digging in a bit more though and will add the reproduction case to this issue. Thx a million for your help, turned out pyspark was able to read in the parquet file and the re-write an version which was readable by hive.

@brendancol Could you please explain how you created parquet file with pyarrow? I tried the following way, but hive cannot read it as well

import pyarrow.parquet as pq
import pyarrow as pa
table = pa.Table.from_pandas(df)
pq.write_table(table, `'example.parquet')
0reactions
martindurantcommented, Jan 24, 2022

Sure, the string could in theory reflect the library version. However, it’s not sued anywhere, so only for the eyes of users.

Read more comments on GitHub >

github_iconTop Results From Across the Web

fastparquet Documentation - Read the Docs
Fastparquet cannot read a hive/drill parquet file with partition names which coerce to the same value, such as “0.7” and “.7”. Parameters fn: ......
Read more >
API — fastparquet 0.7.1 documentation
Fastparquet cannot read a hive/drill parquet file with partition names which coerce to the same value, such as “0.7” and “.7”. Parameters. fn:...
Read more >
Parquet Java Example - Monia B
Hive gives an SQL-like interface to query data stored in various databases and file ... Resolution: Try to generate smaller. fastparquet — fastparquet...
Read more >
Parquet Java Example - Boutique 17
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and ... NET/C# platform in...
Read more >
Parquet Java Example
Connect to Hive or Impala using JDBC and insert the data using SQL. how to read a ... of illustrative examples, how to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found