question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] OCC locks with data on S3 and DynamoDB fails to acquire

See original GitHub issue

Describe the problem you faced

I’m running several workloads in production and one is a parallel add oof partitions to a COW hudi table. I’m managing OCC with DynamoDB and partition in dynamoo is the table name. I’m finding that each paraallel instance waits for a lock and is blocked even though partitions being updated are different. Now this compounds as the number of parallel writes/jobs increase and you see things like the screenshot where each subsequent job takes 1 minute more as it is blocked on a lock. (spark job takes about 1-2min to run and then waits on lock until previous job completes, so majority of the 1hr duration is just waiting for lock.

First question: is this designed/intended behaviour? Second question: should I be using table partition key as lock partition key? currently, as per docs we use table name only, not table partition for lock.

env: hudi v0.11 EMR 6.6.0 Spark 3.2.0

image

To Reproduce

Steps to reproduce the behavior:

  1. write a spark job with OCC enabled to write data to table on S3
  2. run multiple instances of job with different data being ingested in different partitions, could be append only to new partitions
  3. the more concurrent jobs and more data you have the longer the locks are held by each job and newer jobs are waiting for locks

Expected behavior

lock should be held for a short time if ingestion affects unrelated partitions.

Environment Description

  • Hudi version :0.11.0

  • Spark version : 3.2.0

  • Hive version : 3.1.2

  • Hadoop version :3.2.1

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) : no (AWS EMR 6.6.0)

Additional context

Why would lock object be null? The default timeouts is 60s, but this seems to happen after 20min sometimes. Or sometimes after 1hr

Stacktrace

ERROR Client: Application diagnostics message: User class threw exception: org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock object null
	at org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:82)
	at org.apache.hudi.client.transaction.TransactionManager.beginTransaction(TransactionManager.java:53)
	at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:230)
	at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:122)
	at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:650)
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:313)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:163)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:115)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:11 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
zhedoubushishicommented, Aug 4, 2022

Got it. Can you provide more information on how to reproduce this issue? Like what is the size of Hudi table? Are you seeing this slow down happens even with a small number of concurrent jobs? Also would be good if you can open a ticket to AWS EMR team if you have concerns to share logs publicly.

0reactions
nsivabalancommented, Nov 7, 2022

@atharvai : do you have any updates for us. or if you got the issue resolved, let us know how did you go about resolving it. so that it could help others in the community

Read more comments on GitHub >

github_iconTop Results From Across the Web

Optimistic locking with version number - Amazon DynamoDB
Optimistic locking is a strategy to ensure that the client-side item that you are updating (or deleting) is the same as the item...
Read more >
Concurrency Control - Apache Hudi
In this section, we will cover Hudi's concurrency model and describe ways to ingest data into a Hudi Table from multiple writers; using...
Read more >
Error acquiring the state lock: 2 errors occurred" - Stack Overflow
Create DynamoDB Table and S3 Bucket. resource "aws_s3_bucket" "terraform_state" { bucket = "terraform-up-and-running-statezpl" ...
Read more >
The right way to implement a mutex with DynamoDB
Optimistic locks are useful when you would like to update a data, ... If value is 0, set to 1, else, fail (lock...
Read more >
DynamoDB
DynamoDB does not natively support compression. However, users can compress large attributes into binary data using compression algorithms ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found