[SUPPORT] OCC locks with data on S3 and DynamoDB fails to acquire
See original GitHub issueDescribe the problem you faced
I’m running several workloads in production and one is a parallel add oof partitions to a COW hudi table. I’m managing OCC with DynamoDB and partition in dynamoo is the table name. I’m finding that each paraallel instance waits for a lock and is blocked even though partitions being updated are different. Now this compounds as the number of parallel writes/jobs increase and you see things like the screenshot where each subsequent job takes 1 minute more as it is blocked on a lock. (spark job takes about 1-2min to run and then waits on lock until previous job completes, so majority of the 1hr duration is just waiting for lock.
First question: is this designed/intended behaviour? Second question: should I be using table partition key as lock partition key? currently, as per docs we use table name only, not table partition for lock.
env: hudi v0.11 EMR 6.6.0 Spark 3.2.0
To Reproduce
Steps to reproduce the behavior:
- write a spark job with OCC enabled to write data to table on S3
- run multiple instances of job with different data being ingested in different partitions, could be append only to new partitions
- the more concurrent jobs and more data you have the longer the locks are held by each job and newer jobs are waiting for locks
Expected behavior
lock should be held for a short time if ingestion affects unrelated partitions.
Environment Description
-
Hudi version :0.11.0
-
Spark version : 3.2.0
-
Hive version : 3.1.2
-
Hadoop version :3.2.1
-
Storage (HDFS/S3/GCS…) : S3
-
Running on Docker? (yes/no) : no (AWS EMR 6.6.0)
Additional context
Why would lock object be null? The default timeouts is 60s, but this seems to happen after 20min sometimes. Or sometimes after 1hr
Stacktrace
ERROR Client: Application diagnostics message: User class threw exception: org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock object null
at org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:82)
at org.apache.hudi.client.transaction.TransactionManager.beginTransaction(TransactionManager.java:53)
at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:230)
at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:122)
at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:650)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:313)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:163)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:115)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
Issue Analytics
- State:
- Created a year ago
- Comments:11 (9 by maintainers)
Top GitHub Comments
Got it. Can you provide more information on how to reproduce this issue? Like what is the size of Hudi table? Are you seeing this slow down happens even with a small number of concurrent jobs? Also would be good if you can open a ticket to AWS EMR team if you have concerns to share logs publicly.
@atharvai : do you have any updates for us. or if you got the issue resolved, let us know how did you go about resolving it. so that it could help others in the community