question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hive Connector: Ignore empty files when scanning files

See original GitHub issue

Currently if the table(external table) are created as format e.g. Parquet, and there is an empty file(zero length) in the directory, an error will occur:

oss://xxxxxx/xxxx/empty_file is not a valid Parquet File

For the ease of use, we can safely skip all the empty files to avoid this error, how do you guys think? (The empty file use case will not occur if the files are managed by Hive, but could occur if the files are uploaded by user or any other program, when using Object Store Services).

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:9 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
zhenxiaocommented, Jul 2, 2020

hi @xumingming @mbasmanova I feel if a Hive connector table with metadata specified as Parquet format, all files should be Parquet format. empty files seems invalid cases. Am not convinced to have ParquetReader skip non-parquet-empty files. What do you think?

0reactions
xumingmingcommented, Jul 2, 2020

OK, sounds good to me 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Hive connector — Trino 403 Documentation
Ignore partitions when the file system location does not exist rather than failing the query. This skips data that may be expected to...
Read more >
Hive Connector — Presto 0.278 Documentation
The Hive connector allows querying data stored in a Hive data warehouse. Hive is a combination of three components: Data files in varying...
Read more >
How to ignore empty parquet files when reading using Hive
Try to use the property $file_size. If it is more than 0 then process the data load. It would be better if you...
Read more >
E-MapReduce:Hive connector - Alibaba Cloud
Specifies whether to ignore a partition rather than report a query failure if the system file path specified for the partition does not...
Read more >
Reading and writing Hive tables in R | CDP Public Cloud
The Hive Warehouse Connector (HWC) supports reads and writes to Apache Hive managed ACID tables in R. Cloudera provides an R package SparklyrHWC...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found