question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT]0.10 cow table insert mode cannot merge small files

See original GitHub issue

Describe the problem you faced

In hudi 0.9, using Flink 1.12.2 SQL client sink logs to hudi cow table in insert mode. The small files would be merged into a few parquet files.

But hudi 0.10, same code would produce lots of small parquet files. And turn on write.insert.cluster option metioned in doc had no effect.

To Reproduce

Steps to reproduce the behavior:

  1. hudi table option:
...
WITH (
  'connector' = 'hudi',
  'path' = 'hdfs://xxx:8020/hudi/xxx',
  'write.precombine.field' = 'time',
  'write.operation' = 'insert',
  'write.insert.cluster' = 'true',
  'write.insert.deduplicate' = 'true',
  'hoodie.datasource.write.recordkey.field' = 'distinct_id,time,event,_track_id',
  'read.streaming.enabled' = 'true',  -- this option enable the streaming read
  'read.streaming.start-commit' = '20211001000000', -- specifies the start commit instant time
  'read.streaming.check-interval' = '60', -- specifies the check interval for finding new source commits, default 60s.
  'table.type' = 'COPY_ON_WRITE', -- this creates a MERGE_ON_READ table, by default is COPY_ON_WRITE
  'hive_sync.enable' = 'true',     -- Required. To enable hive synchronization
  'hive_sync.mode' = 'hms',         -- Required. Setting hive sync mode to hms, default jdbc
  'hive_sync.table'='xxx',                          -- required, hive table name
  'hive_sync.db'='xxx',
  'hive_sync.metastore.uris' = 'thrift://xxx9083' -- Required. The port need set on hive-site.xml
);

Expected behavior

Small files get merged.

Environment Description

  • Hudi version : 0.10-snapshot/0.11-snapshot

  • Spark version :NA

  • Hive version :2.1.1

  • Hadoop version :3.0

  • Storage (HDFS/S3/GCS…) : HDFS

  • Running on Docker? (yes/no) : no

  • Flink: 1.13.2

Additional context

NA

Stacktrace

NA

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

0reactions
danny0405commented, Dec 2, 2021

Okey, let me have a test on local cluster.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Controlling Number of small files while inserting into Hive table
I have already set : hive.merge.mapfiles and hive.merge.mapredfiles to true in custom/advanced hive-site.xml. But still the load job loads data with 1200 ...
Read more >
All Configurations | Apache Hudi
Insert mode when insert data to pk-table. The optional modes are: upsert, strict and non-strict.For upsert mode, insert statement do the upsert operation ......
Read more >
issue insert data in hive create small part files - Stack Overflow
Actual data push logic is this only. i am parsing json and get required fields and create json objects as string and try...
Read more >
Manage manifest files - Android Developers
Merge multiple manifest files. Your APK or Android App Bundle file can contain just one AndroidManifest.xml file, but your Android Studio ...
Read more >
IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
In [92]: data = "index,a,b,c\n4,apple,bat,5.7\n8,orange,cow,10" In [93]: ... mode="wb") allows writing a CSV to a file object opened binary mode.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found