[SUPPORT]0.10 cow table insert mode cannot merge small files
See original GitHub issueDescribe the problem you faced
In hudi 0.9, using Flink 1.12.2 SQL client sink logs to hudi cow table in insert mode. The small files would be merged into a few parquet files.
But hudi 0.10, same code would produce lots of small parquet files. And turn on write.insert.cluster option metioned in doc had no effect.
To Reproduce
Steps to reproduce the behavior:
- hudi table option:
...
WITH (
'connector' = 'hudi',
'path' = 'hdfs://xxx:8020/hudi/xxx',
'write.precombine.field' = 'time',
'write.operation' = 'insert',
'write.insert.cluster' = 'true',
'write.insert.deduplicate' = 'true',
'hoodie.datasource.write.recordkey.field' = 'distinct_id,time,event,_track_id',
'read.streaming.enabled' = 'true', -- this option enable the streaming read
'read.streaming.start-commit' = '20211001000000', -- specifies the start commit instant time
'read.streaming.check-interval' = '60', -- specifies the check interval for finding new source commits, default 60s.
'table.type' = 'COPY_ON_WRITE', -- this creates a MERGE_ON_READ table, by default is COPY_ON_WRITE
'hive_sync.enable' = 'true', -- Required. To enable hive synchronization
'hive_sync.mode' = 'hms', -- Required. Setting hive sync mode to hms, default jdbc
'hive_sync.table'='xxx', -- required, hive table name
'hive_sync.db'='xxx',
'hive_sync.metastore.uris' = 'thrift://xxx9083' -- Required. The port need set on hive-site.xml
);
Expected behavior
Small files get merged.
Environment Description
-
Hudi version : 0.10-snapshot/0.11-snapshot
-
Spark version :NA
-
Hive version :2.1.1
-
Hadoop version :3.0
-
Storage (HDFS/S3/GCS…) : HDFS
-
Running on Docker? (yes/no) : no
-
Flink: 1.13.2
Additional context
NA
Stacktrace
NA
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
Controlling Number of small files while inserting into Hive table
I have already set : hive.merge.mapfiles and hive.merge.mapredfiles to true in custom/advanced hive-site.xml. But still the load job loads data with 1200 ...
Read more >All Configurations | Apache Hudi
Insert mode when insert data to pk-table. The optional modes are: upsert, strict and non-strict.For upsert mode, insert statement do the upsert operation ......
Read more >issue insert data in hive create small part files - Stack Overflow
Actual data push logic is this only. i am parsing json and get required fields and create json objects as string and try...
Read more >Manage manifest files - Android Developers
Merge multiple manifest files. Your APK or Android App Bundle file can contain just one AndroidManifest.xml file, but your Android Studio ...
Read more >IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
In [92]: data = "index,a,b,c\n4,apple,bat,5.7\n8,orange,cow,10" In [93]: ... mode="wb") allows writing a CSV to a file object opened binary mode.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Fixed in https://github.com/apache/hudi/commit/934fe54cc57b508875383cd807735b1323fef754
Okey, let me have a test on local cluster.