Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT]0.10 cow table insert mode cannot merge small files

See original GitHub issue

Describe the problem you faced

In hudi 0.9, using Flink 1.12.2 SQL client sink logs to hudi cow table in insert mode. The small files would be merged into a few parquet files.

But hudi 0.10, same code would produce lots of small parquet files. And turn on write.insert.cluster option metioned in doc had no effect.

To Reproduce

Steps to reproduce the behavior:

hudi table option:

...
WITH (
  'connector' = 'hudi',
  'path' = 'hdfs://xxx:8020/hudi/xxx',
  'write.precombine.field' = 'time',
  'write.operation' = 'insert',
  'write.insert.cluster' = 'true',
  'write.insert.deduplicate' = 'true',
  'hoodie.datasource.write.recordkey.field' = 'distinct_id,time,event,_track_id',
  'read.streaming.enabled' = 'true',  -- this option enable the streaming read
  'read.streaming.start-commit' = '20211001000000', -- specifies the start commit instant time
  'read.streaming.check-interval' = '60', -- specifies the check interval for finding new source commits, default 60s.
  'table.type' = 'COPY_ON_WRITE', -- this creates a MERGE_ON_READ table, by default is COPY_ON_WRITE
  'hive_sync.enable' = 'true',     -- Required. To enable hive synchronization
  'hive_sync.mode' = 'hms',         -- Required. Setting hive sync mode to hms, default jdbc
  'hive_sync.table'='xxx',                          -- required, hive table name
  'hive_sync.db'='xxx',
  'hive_sync.metastore.uris' = 'thrift://xxx9083' -- Required. The port need set on hive-site.xml
);

Expected behavior

Small files get merged.

Environment Description

Hudi version : 0.10-snapshot/0.11-snapshot
Spark version :NA
Hive version :2.1.1
Hadoop version :3.0
Storage (HDFS/S3/GCS…) : HDFS
Running on Docker? (yes/no) : no
Flink: 1.13.2

Additional context

Stacktrace

Issue Analytics

State:
Created 2 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

2reactions

danny0405commented, Dec 3, 2021

Fixed in https://github.com/apache/hudi/commit/934fe54cc57b508875383cd807735b1323fef754

0reactions

danny0405commented, Dec 2, 2021

Okey, let me have a test on local cluster.

Top Results From Across the Web

Controlling Number of small files while inserting into Hive table

I have already set : hive.merge.mapfiles and hive.merge.mapredfiles to true in custom/advanced hive-site.xml. But still the load job loads data with 1200 ...

All Configurations | Apache Hudi

Insert mode when insert data to pk-table. The optional modes are: upsert, strict and non-strict.For upsert mode, insert statement do the upsert operation ......

issue insert data in hive create small part files - Stack Overflow

Actual data push logic is this only. i am parsing json and get required fields and create json objects as string and try...

Manage manifest files - Android Developers

Merge multiple manifest files. Your APK or Android App Bundle file can contain just one AndroidManifest.xml file, but your Android Studio ...

IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation

In [92]: data = "index,a,b,c\n4,apple,bat,5.7\n8,orange,cow,10" In [93]: ... mode="wb") allows writing a CSV to a file object opened binary mode.