Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] OCC locks with data on S3 and DynamoDB fails to acquire

See original GitHub issue

Describe the problem you faced

I’m running several workloads in production and one is a parallel add oof partitions to a COW hudi table. I’m managing OCC with DynamoDB and partition in dynamoo is the table name. I’m finding that each paraallel instance waits for a lock and is blocked even though partitions being updated are different. Now this compounds as the number of parallel writes/jobs increase and you see things like the screenshot where each subsequent job takes 1 minute more as it is blocked on a lock. (spark job takes about 1-2min to run and then waits on lock until previous job completes, so majority of the 1hr duration is just waiting for lock.

First question: is this designed/intended behaviour? Second question: should I be using table partition key as lock partition key? currently, as per docs we use table name only, not table partition for lock.

env: hudi v0.11 EMR 6.6.0 Spark 3.2.0

To Reproduce

Steps to reproduce the behavior:

write a spark job with OCC enabled to write data to table on S3
run multiple instances of job with different data being ingested in different partitions, could be append only to new partitions
the more concurrent jobs and more data you have the longer the locks are held by each job and newer jobs are waiting for locks

Expected behavior

lock should be held for a short time if ingestion affects unrelated partitions.

Environment Description

Hudi version :0.11.0
Spark version : 3.2.0
Hive version : 3.1.2
Hadoop version :3.2.1
Storage (HDFS/S3/GCS…) : S3
Running on Docker? (yes/no) : no (AWS EMR 6.6.0)

Additional context

Why would lock object be null? The default timeouts is 60s, but this seems to happen after 20min sometimes. Or sometimes after 1hr

Stacktrace

ERROR Client: Application diagnostics message: User class threw exception: org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock object null
	at org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:82)
	at org.apache.hudi.client.transaction.TransactionManager.beginTransaction(TransactionManager.java:53)
	at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:230)
	at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:122)
	at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:650)
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:313)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:163)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:115)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)

Issue Analytics

State:
Created a year ago
Comments:11 (9 by maintainers)

Top GitHub Comments

1reaction

zhedoubushishicommented, Aug 4, 2022

Got it. Can you provide more information on how to reproduce this issue? Like what is the size of Hudi table? Are you seeing this slow down happens even with a small number of concurrent jobs? Also would be good if you can open a ticket to AWS EMR team if you have concerns to share logs publicly.

0reactions

nsivabalancommented, Nov 7, 2022

@atharvai : do you have any updates for us. or if you got the issue resolved, let us know how did you go about resolving it. so that it could help others in the community