question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Trouble getting yyyy/mm partitioning to work with Hive sync

See original GitHub issue

Describe the problem you faced

Hi, everyone! We ingest data with options:

hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING
hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM
hoodie.deltastreamer.keygen.timebased.input.dateformat=yyyy-MM-dd'T'HH:mm:ssZ,yyyy-MM-dd'T'HH:mm:ss.SSSZ
hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex=
hoodie.deltastreamer.keygen.timebased.input.timezone='
hoodie.datasource.write.partitionpath.field=time:TIMESTAMP

Field time is in format 2021-05-16T21:36:39Z. We want for some table to have partitions by yyyy/MM, because they are small and there is no need in deep partitioning. But we have a problem with run_sync_tool.sh. What did we try:

  1. –partitioned-by time obviously didn’t help
  2. –partition-value-extractor org.apache.hudi.hive.MultiPartKeysValueExtractor –partitioned-by _hoodie_partition_path Didn’t help much as well, we are getting an error in screenshoot (in parquet file _hoodie_partition_path=2021/05 ) Any ideas how to fix it?

image

https://apache-hudi.slack.com/archives/C4D716NPQ/p1625675498061500

To Reproduce

Steps to reproduce the behavior:

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version :

  • Spark version :

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS…) :

  • Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
affeicommented, Jul 29, 2021

Solve by using --partitioned-by 'year,month'. Thanks everybody!

0reactions
nsivabalancommented, Jul 29, 2021

cool, thanks.

Read more comments on GitHub >

github_iconTop Results From Across the Web

subject:"\[GitHub\] \[hudi\] affei edited a comment ... - The Mail Archive
[GitHub] [hudi] affei edited a comment on issue #3337: [SUPPORT] Trouble getting yyyy/mm partitioning to work with Hive sync · 2021-07-29 Thread GitBox....
Read more >
Synchronizing to hive partition is incorrect #828 - apache/hudi
The job success however I found some problems with the hive partition in new table. 1. The partition path is incorrect. If the...
Read more >
Synchronizing Hudi Table Data to Hive
Command Description Mandatory or Not (Yes or... ‑‑database Specifies the Hive database name. No ‑‑table Specifies the Hive table name. Yes ‑‑base‑file‑format Specifies the file format...
Read more >
Hive recipe to parition in Hive parquet - Dataiku Community
I want to use a Hive recipe to change some format or column name and partitionned the table with a created column (YYYY-MM-DD)......
Read more >
All Configurations | Apache Hudi
This page covers the different ways of configuring your job to write/read Hudi tables. At a high level, you can control behaviour at...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found