question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Upgrade from 0.8.0 to 0.9.0 removes functionality and decreases performance

See original GitHub issue

Describe the problem you faced

One of the applications where I am using Hudi is SparkStructuredStreaming from Kafka to Hudi, which is scheduled to run once every hour, with about 3600 JSON messages per run, which are then exploded to about 7 million rows and upserted into a year=/month=/day= partitioned COW table.

After upgrading to 0.9.0 I am facing two problems:

First:

With no other changes than the upgrade to 0.9.0, my spark job has a decreased performance compared to 0.8.0 and has even run OOM (seen it 3-4 times in the last 24 runs, 2. or 3. retry of the task worked), even though it has run with the same configurations for multiple months without any issues and there have been no changes to the data or amount of data. Previously, the job took about 2-4 minutes to ingest the data, now it takes 4-8 minutes.

The decreased performance and OutOfMemory Error can be seen in the stages UpsertPartitioner Getting small files from partitions, even with 5GB extra memory compared to what it ran with 0.8.0 ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.0 GB of 15 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

with the log for the task that ran OOM at the end of this ticket.

Second:

Due to other limitations, all my tables are stored with DataSourceWriteOptions.PARTITIONPATH_FIELD -> "_partition", and the _partition column contains Strings like "year=2021/month=11/day=12". I’ve used this for multiple Hudi versions, however 0.9.0 seems to cause issues. While the table is still partitioned correctly, using spark.read.format("hudi").load(basePath) does not return the columns year, month or day anymore, so the only way to do partition pruning is by using a filter on _partition.

To Reproduce

Steps to reproduce the behavior:

  1. Run step Setup from https://hudi.apache.org/docs/quick-start-guide
  2. Create dataframe and replace partitionpath with continent=…/country=…/city=…, e.g. continent=asia/country=india/city=chennai
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
val newDf = df.withColumn("partitionpath", regexp_replace($"partitionpath", "(.*)(\\/){1}(.*)(\\/){1}", "continent=$1$2country=$3$4city="))
  1. Write the table
newDf.write.format("hudi").
  options(getQuickstartWriteConfigs).
  option(PRECOMBINE_FIELD.key(), "ts").
  option(RECORDKEY_FIELD.key(), "uuid").
  option(PARTITIONPATH_FIELD.key(), "partitionpath").
  option(TBL_NAME.key(), tableName).
  mode(Overwrite).
  save(basePath)
  1. Query the table
val tripsSnapshotDF = spark.
  read.
  format("hudi").
  load(basePath)
  1. Columns continent,country and city are missing from tripsSnapshotDF
scala> tripsSnapshotDF.printSchema
root
 |-- _hoodie_commit_time: string (nullable = true)
 |-- _hoodie_commit_seqno: string (nullable = true)
 |-- _hoodie_record_key: string (nullable = true)
 |-- _hoodie_partition_path: string (nullable = true)
 |-- _hoodie_file_name: string (nullable = true)
 |-- begin_lat: double (nullable = true)
 |-- begin_lon: double (nullable = true)
 |-- driver: string (nullable = true)
 |-- end_lat: double (nullable = true)
 |-- end_lon: double (nullable = true)
 |-- fare: double (nullable = true)
 |-- rider: string (nullable = true)
 |-- ts: long (nullable = true)
 |-- uuid: string (nullable = true)
 |-- partitionpath: string (nullable = true)


scala> spark.read.parquet(basePath)
res6: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 16 more fields]

scala> res6.printSchema
root
 |-- _hoodie_commit_time: string (nullable = true)
 |-- _hoodie_commit_seqno: string (nullable = true)
 |-- _hoodie_record_key: string (nullable = true)
 |-- _hoodie_partition_path: string (nullable = true)
 |-- _hoodie_file_name: string (nullable = true)
 |-- begin_lat: double (nullable = true)
 |-- begin_lon: double (nullable = true)
 |-- driver: string (nullable = true)
 |-- end_lat: double (nullable = true)
 |-- end_lon: double (nullable = true)
 |-- fare: double (nullable = true)
 |-- partitionpath: string (nullable = true)
 |-- rider: string (nullable = true)
 |-- ts: long (nullable = true)
 |-- uuid: string (nullable = true)
 |-- continent: string (nullable = true)
 |-- country: string (nullable = true)
 |-- city: string (nullable = true)

Expected behavior As documented in https://hudi.apache.org/docs/quick-start-guide#query-data,

//load(basePath) use “/partitionKey=partitionValue” folder structure for Spark auto partition discovery

Reading a table with partitionKey=partitionValue should result in the partitionKey being available as a column for partition pruning, same as spark.read.parquet(basePath)

In a shell with Hudi 0.8.0, the columns are available

scala> val basePath = "file:///tmp/hudi_trips_cow"
basePath: String = file:///tmp/hudi_trips_cow

scala> val tripsSnapshotDF = spark.
     |   read.
     |   format("hudi").
     |   load(basePath)
21/11/12 18:23:34 WARN DefaultSource: Loading Base File Only View.
21/11/12 18:23:34 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
tripsSnapshotDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 16 more fields]

scala> tripsSnapshotDF.printSchema
root
 |-- _hoodie_commit_time: string (nullable = true)
 |-- _hoodie_commit_seqno: string (nullable = true)
 |-- _hoodie_record_key: string (nullable = true)
 |-- _hoodie_partition_path: string (nullable = true)
 |-- _hoodie_file_name: string (nullable = true)
 |-- begin_lat: double (nullable = true)
 |-- begin_lon: double (nullable = true)
 |-- driver: string (nullable = true)
 |-- end_lat: double (nullable = true)
 |-- end_lon: double (nullable = true)
 |-- fare: double (nullable = true)
 |-- partitionpath: string (nullable = true)
 |-- rider: string (nullable = true)
 |-- ts: long (nullable = true)
 |-- uuid: string (nullable = true)
 |-- continent: string (nullable = true)
 |-- country: string (nullable = true)
 |-- city: string (nullable = true)

Environment Description

  • Hudi version : 0.9.0

  • Spark version : 2.4.0

  • Hive version :

  • Hadoop version : 3.0.0

  • Storage (HDFS/S3/GCS…) : HDFS

  • Running on Docker? (yes/no) : no

Additional context

Add any other context about the problem here.

Stacktrace

21/11/12 06:05:50 INFO view.AbstractTableFileSystemView: Building file system view for partition (year=2021/month=11/day=12)
21/11/12 06:05:50 INFO view.AbstractTableFileSystemView: addFilesToView: NumFiles=15, NumFileGroups=9, FileGroupsCreationTime=12, StoreTimeTaken=1
21/11/12 06:05:50 INFO io.HoodieMergeHandle: MaxMemoryPerPartitionMerge => 1073741824
21/11/12 06:05:50 INFO collection.ExternalSpillableMap: Estimated Payload size => 768
21/11/12 06:05:50 INFO collection.ExternalSpillableMap: New Estimated Payload size => 550
21/11/12 06:05:57 INFO collection.BitCaskDiskMap: Spilling to file location /opt/cloudera/var/tmp/hudi-BITCASK-7980c51b-f1bb-46b2-806d-58bd72d7ec8f/09b0c846-6255-4407-acae-857c28ef1f82 in host (...) with hostname (...)
21/11/12 06:07:27 INFO io.HoodieMergeHandle: Number of entries in MemoryBasedMap => 1561807Total size in bytes of MemoryBasedMap => 858993874Number of entries in BitCaskDiskMap => 2114256Size of file spilled to disk => 747196908
21/11/12 06:07:27 INFO io.HoodieMergeHandle: partitionPath:year=2021/month=11/day=12, fileId to be merged:0f1e66e7-3bb6-4f33-84d6-ca221f4d3f7b-0
21/11/12 06:07:27 INFO io.HoodieMergeHandle: Merging new data into oldPath /.../year=2021/month=11/day=12/0f1e66e7-3bb6-4f33-84d6-ca221f4d3f7b-0_0-20-52_20211112050651.parquet, as newPath /.../year=2021/month=11/day=12/0f1e66e7-3bb6-4f33-84d6-ca221f4d3f7b-0_0-20-89_20211112060041.parquet
21/11/12 06:07:27 INFO marker.DirectWriteMarkers: Creating Marker Path=/.../.hoodie/.temp/20211112060041/year=2021/month=11/day=12/0f1e66e7-3bb6-4f33-84d6-ca221f4d3f7b-0_0-20-89_20211112060041.parquet.marker.MERGE
21/11/12 06:07:27 INFO marker.DirectWriteMarkers: [direct] Created marker file /.../.hoodie/.temp/20211112060041/year=2021/month=11/day=12/0f1e66e7-3bb6-4f33-84d6-ca221f4d3f7b-0_0-20-89_20211112060041.parquet.marker.MERGE in 51 ms
21/11/12 06:07:27 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
21/11/12 06:07:27 INFO compress.CodecPool: Got brand-new compressor [.gz]
21/11/12 06:07:28 INFO queue.IteratorBasedQueueProducer: starting to buffer records
21/11/12 06:07:28 INFO queue.BoundedInMemoryExecutor: starting consumer thread
21/11/12 06:07:28 INFO compress.CodecPool: Got brand-new decompressor [.gz]
21/11/12 06:07:32 INFO queue.IteratorBasedQueueProducer: finished buffering records
21/11/12 06:07:33 INFO queue.BoundedInMemoryExecutor: Queue Consumption is done; notifying producer threads
21/11/12 06:09:01 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
21/11/12 06:09:01 ERROR commit.BaseSparkCommitActionExecutor: Error upserting bucketType UPDATE for partition :0
java.lang.NullPointerException
	at org.apache.hudi.common.util.SpillableMapUtils.readInternal(SpillableMapUtils.java:58)
	at org.apache.hudi.common.util.SpillableMapUtils.readBytesFromDisk(SpillableMapUtils.java:49)
	at org.apache.hudi.common.util.collection.BitCaskDiskMap.get(BitCaskDiskMap.java:202)
	at org.apache.hudi.common.util.collection.LazyFileIterable$LazyFileIterator.next(LazyFileIterable.java:101)
	at org.apache.hudi.common.util.collection.ExternalSpillableMap$IteratorWrapper.next(ExternalSpillableMap.java:331)
	at org.apache.hudi.io.HoodieMergeHandle.close(HoodieMergeHandle.java:350)
	at org.apache.hudi.table.action.commit.SparkMergeHelper.runMerge(SparkMergeHelper.java:107)
	at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:334)
	at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:325)
	at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:298)
	at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$execute$ecf5068c$1(BaseSparkCommitActionExecutor.java:156)

Hudi config for the Kafka->Hudi job

hoodie.datasource.write.insert.drop.duplicates -> false
hoodie.datasource.write.row.writer.enable -> false
hoodie.memory.spillable.map.path -> /opt/cloudera/var/tmp/
hoodie.embed.timeline.server -> false
hoodie.parquet.small.file.limit -> 104857600
hoodie.copyonwrite.insert.split.size -> 4000000
hoodie.insert.shuffle.parallelism -> 5
hoodie.datasource.write.precombine.field -> ts
hoodie.cleaner.fileversions.retained -> 1
hoodie.parquet.max.file.size -> 134217728
hoodie.parquet.block.size -> 134217728
hoodie.clean.automatic -> false
hoodie.datasource.write.streaming.ignore.failed.batch -> false
hoodie.datasource.write.operation -> upsert
hoodie.datasource.write.recordkey.field -> _uuid
hoodie.table.name -> table_name
hoodie.bulkinsert.sort.mode -> GLOBAL_SORT
hoodie.cleaner.policy -> KEEP_LATEST_COMMITS
hoodie.datasource.write.streaming.retry.count -> 0
hoodie.fail.on.timeline.archiving -> false
hoodie.keep.max.commits -> 60
hoodie.upsert.shuffle.parallelism -> 5
hoodie.cleaner.commits.retained -> 48
hoodie.keep.min.commits -> 50
hoodie.datasource.write.partitionpath.field -> _partition
hoodie.clean.async -> false
hoodie.bloom.index.filter.type -> SIMPLE
hoodie.bulkinsert.shuffle.parallelism -> 5
hoodie.index.bloom.num_entries -> 350000

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:11 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
nsivabalancommented, Dec 18, 2021

spark auto partition discovery seems a real blocker bug. I have filed a tracking jira.

1reaction
cb149commented, Nov 12, 2021

@xushiyan I don’t think this needs severity critical, my pipelines are still running, just worse in performance.

Read more comments on GitHub >

github_iconTop Results From Across the Web

New features from Apache Hudi 0.9.0 on Amazon EMR
Performance Improvements: Explore the performance improvements and new performance related features introduced on the writer and query side.
Read more >
OpenKruise 0.9.0, Supports Pod Restart and Deletion Protection
When CloneSet is used for the in-place update of an application, only the container image is updated, while the Pod is not rebuilt....
Read more >
Upgrading Specific Versions | Consul - HashiCorp Developer
Before upgrading existing Consul deployments using service mesh to Consul 1.13.x, ... Due to this removal the following endpoints no longer function:.
Read more >
Release 0.10.0 | Apache Hudi
0 release around metadata table and would recommend users to try out metadata for better performance from optimized file listings. As part of...
Read more >
Lingo4G release notes - Carrot Search downloads
This document contains Lingo4G release notes including information on new features, changes, bug fixes and upgrade considerations.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found