question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Vacuum fails with Delta version 1.0.0 on AWS Glue 3.0 (Spark 3.1.1)

See original GitHub issue

Steps to reproduce

Setup AWS Glue 3.0 job (which is Spark 3.1.1) and configure following jars / python libs: - delta-core_2.12-1.0.0.jar (should be compatible with Spark 3.1.x) - delta-spark = “==1.0.0” - pyspark = “==3.1.1”

The following lines lead to an error (Stream is corrupted) when executed with AWS Glue 3.0 environment

delta_table = DeltaTable.forPath(spark_session, data_path)
delta_table.vacuum(0)

We also tried different retention periods with same results.

Everything works fine with Glue 2.0 (Spark 2.4.3) and delta-core_2.11-0.6.1.jar, so our setup should be good. Is Glue 3.0 compatible with Delta 1.0.0?

Stack Trace:

2021-11-22 15:42:53,873 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(73)): Error from Python:Traceback (most recent call last): File “/tmp/vacuum.py”, line 23, in <module> main() File “/tmp/vacuum.py”, line 19, in main delta_table.vacuum(0) File “/tmp/delta_spark-1.0.0-py3-none-any.whl/delta/tables.py”, line 211, in vacuum return DataFrame(jdt.vacuum(float(retentionHours)), self._spark._wrapped) File “/tmp/py4j-0.10.9-py2.py3-none-any.whl/py4j/java_gateway.py”, line 1305, in call answer, self.gateway_client, self.target_id, self.name) File “/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py”, line 111, in deco return f(*a, **kw) File “/tmp/py4j-0.10.9-py2.py3-none-any.whl/py4j/protocol.py”, line 328, in get_return_value format(target_id, “.”, name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o403.vacuum. : org.apache.spark.SparkException: Job aborted due to stage failure: ResultStage 17 ($anonfun$withThreadLocalCaptured$1 at FutureTask.java:266) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:772) at org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:845) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.readSize(UnsafeRowSerializer.scala:113) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:129) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110) at scala.collection.Iterator$$anon$11.next(Iterator.scala:494) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:351) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Stream is corrupted at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:200) at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:226) at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:841) … 25 more at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2465) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2414) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2413) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2413) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1871) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2676) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2621) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2610) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:914) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2238) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2259) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2278) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) at org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:402) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.org$apache$spark$sql$execution$exchange$BroadcastExchangeExec$$doComputeRelation(BroadcastExchangeExec.scala:178) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anon$1.doCompute(BroadcastExchangeExec.scala:171) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anon$1.doCompute(BroadcastExchangeExec.scala:167) at org.apache.spark.sql.execution.AsyncDriverOperation.$anonfun$compute$1(AsyncDriverOperation.scala:73) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withExecutionId$1(SQLExecution.scala:207) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:204) at org.apache.spark.sql.execution.AsyncDriverOperation.compute(AsyncDriverOperation.scala:67) at org.apache.spark.sql.execution.AsyncDriverOperation.$anonfun$computeFuture$1(AsyncDriverOperation.scala:53) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:275) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.checkNoFailures(AdaptiveExecutor.scala:147) at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.doRun(AdaptiveExecutor.scala:88) at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.tryRunningAndGetFuture(AdaptiveExecutor.scala:66) at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.execute(AdaptiveExecutor.scala:57) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:184) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:183) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeToIterator(AdaptiveSparkPlanExec.scala:416) at org.apache.spark.sql.Dataset.$anonfun$toLocalIterator$1(Dataset.scala:3036) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3724) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3722) at org.apache.spark.sql.Dataset.toLocalIterator(Dataset.scala:3034) at org.apache.spark.sql.delta.commands.VacuumCommandImpl.delete(VacuumCommand.scala:325) at org.apache.spark.sql.delta.commands.VacuumCommandImpl.delete$(VacuumCommand.scala:309) at org.apache.spark.sql.delta.commands.VacuumCommand$.delete(VacuumCommand.scala:49) at org.apache.spark.sql.delta.commands.VacuumCommand$.$anonfun$gc$1(VacuumCommand.scala:239) at com.databricks.spark.util.DatabricksLogging.recordOperation(DatabricksLogging.scala:77) at com.databricks.spark.util.DatabricksLogging.recordOperation$(DatabricksLogging.scala:67) at org.apache.spark.sql.delta.commands.VacuumCommand$.recordOperation(VacuumCommand.scala:49) at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation(DeltaLogging.scala:106) at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation$(DeltaLogging.scala:91) at org.apache.spark.sql.delta.commands.VacuumCommand$.recordDeltaOperation(VacuumCommand.scala:49) at org.apache.spark.sql.delta.commands.VacuumCommand$.gc(VacuumCommand.scala:101) at io.delta.tables.execution.DeltaTableOperations.executeVacuum(DeltaTableOperations.scala:74) at io.delta.tables.execution.DeltaTableOperations.executeVacuum$(DeltaTableOperations.scala:70) at io.delta.tables.DeltaTable.executeVacuum(DeltaTable.scala:42) at io.delta.tables.DeltaTable.vacuum(DeltaTable.scala:99) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
fibona44icommented, Feb 9, 2022

I think this is related to the following issue: https://issues.apache.org/jira/browse/SPARK-34790 (Fail in fetch shuffle blocks in batch when i/o encryption is enabled.) And glue is using spark 3.1.1 😦 As a workaround we disabled fetching shuffle blocks in batch (https://github.com/hezuojiao/spark/blob/d824a9b36d41154d15c54925be440ba92759f599/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L495) : sparkConf.set(“spark.sql.adaptive.fetchShuffleBlocksInBatch”, “false”)

0reactions
fibona44icommented, Mar 31, 2022

I already explained solution in following comment https://github.com/delta-io/delta/issues/841#issuecomment-1034159558

Read more comments on GitHub >

github_iconTop Results From Across the Web

Migrating AWS Glue jobs to AWS Glue version 4.0
This topic describes the changes between AWS Glue versions 0.9, 1.0, 2.0, and 3.0 to allow you to migrate your Spark applications and...
Read more >
Facing issue in AWS glue, While trying Delta operations
I am attempting to use the update/Delete/Upsert operation in Pyspark with AWS Glue. I have instantiated spark with below configs:
Read more >
Databricks Runtime 8.2 (Unsupported)
Group ID Artifact ID Version antlr antlr 2.7.7 com.amazonaws amazon‑kinesis‑client 1.12.0 com.amazonaws aws‑java‑sdk‑autoscaling 1.11.655
Read more >
Migrating Amazon Glue jobs to Amazon Glue version 3.0
This topic describes the changes between Amazon Glue versions 0.9, 1.0, 2.0 and 3.0 to allow you to migrate your Spark applications and...
Read more >
Databricks Runtime 9.1 LTS - Azure - Microsoft Learn
Release notes about Databricks Runtime 9.1 LTS, powered by Apache Spark. ... evolution is enabled by setting spark.databricks.delta.schema.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found