question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NullPointerException in spark rewrite_manifests procedure

See original GitHub issue

Using snapshot version 0.13.0-20220307.001124-2. Spark version 3.2

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 308) (10.103.53.2 executor 1): java.lang.NullPointerException
	at java.base/java.util.Objects.requireNonNull(Objects.java:221)
	at org.apache.iceberg.hadoop.HadoopMetricsContext.counter(HadoopMetricsContext.java:80)
	at org.apache.iceberg.aws.s3.S3OutputStream.<init>(S3OutputStream.java:135)
	at org.apache.iceberg.aws.s3.S3OutputFile.createOrOverwrite(S3OutputFile.java:60)
	at org.apache.iceberg.avro.AvroFileAppender.<init>(AvroFileAppender.java:51)
	at org.apache.iceberg.avro.Avro$WriteBuilder.build(Avro.java:191)
	at org.apache.iceberg.ManifestWriter$V1Writer.newAppender(ManifestWriter.java:281)
	at org.apache.iceberg.ManifestWriter.<init>(ManifestWriter.java:58)
	at org.apache.iceberg.ManifestWriter.<init>(ManifestWriter.java:34)
	at org.apache.iceberg.ManifestWriter$V1Writer.<init>(ManifestWriter.java:260)
	at org.apache.iceberg.ManifestFiles.write(ManifestFiles.java:117)
	at org.apache.iceberg.spark.actions.BaseRewriteManifestsSparkAction.writeManifest(BaseRewriteManifestsSparkAction.java:324)
	at org.apache.iceberg.spark.actions.BaseRewriteManifestsSparkAction.lambda$toManifests$afb7bc39$1(BaseRewriteManifestsSparkAction.java:354)
	at org.apache.spark.sql.Dataset.$anonfun$mapPartitions$1(Dataset.scala:2826)
	at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:201)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

I think this may have been introduced in https://github.com/apache/iceberg/pull/4050 From a quick glance at the code it seems that the root cause is that HadoopMetricsContext#statistics is a transient variable. If I understood thingas correctly, HadoopMetricsContext is initialized in S3FileIo:

  @Override
  public void initialize(Map<String, String> properties) {
    this.awsProperties = new AwsProperties(properties);

    // Do not override s3 client if it was provided
    if (s3 == null) {
      this.s3 = AwsClientFactories.from(properties)::s3;
    }

    // Report Hadoop metrics if Hadoop is available
    try {
      DynConstructors.Ctor<MetricsContext> ctor =
          DynConstructors.builder(MetricsContext.class).hiddenImpl(DEFAULT_METRICS_IMPL, String.class).buildChecked();
      this.metrics = ctor.newInstance("s3");

      metrics.initialize(properties);
    } catch (NoSuchMethodException | ClassCastException e) {
      LOG.warn("Unable to load metrics class: '{}', falling back to null metrics", DEFAULT_METRICS_IMPL, e);
    }
  }

Since FileIO is a broadcast variable in org.apache.iceberg.spark.actions.BaseRewriteManifestsSparkAction

  private static ManifestFile writeManifest(
      List<Row> rows, int startIndex, int endIndex, Broadcast<FileIO> io,
      String location, int format, PartitionSpec spec, StructType sparkType) throws IOException {
...

It seems that HadoopMetricsContext is serialized along with it and when its deserialized at the executor, statistcs variable will be null because it was marked as transient. I’m not really sure if that’s exactly whats happening but the only way I could reproduce this outside of spark was calling HadoopMetricsContext#counter with statistics variable set to null and the most obvious way that this could happen is after a deserialization

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jackye1995commented, Mar 10, 2022

at first glance my solution would be similar to the solution we use to handle S3 client serialization in S3FileIO, which is to store a function like

SerializableSupplier<FileSystem.Statistics> statsticsSupplier = () -> FileSystem.getStatistics(scheme, null);

and initialize the transient variable at the first invocation

@singhpk234 let me know if you would like to take this issue

1reaction
igorcalabriacommented, Mar 9, 2022

I don’t think readObject is called when using KyroSerializer.

You can test this using a local standalone cluster: ./bin/spark-shell --master spark://localhost:7077 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --repositories https://repository.apache.org/content/repositories/snapshots --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.0-SNAPSHOT

import org.apache.iceberg.hadoop.HadoopMetricsContext
import org.apache.iceberg.io.FileIOMetricsContext
import org.apache.iceberg.metrics.MetricsContext

val metricsContext = new HadoopMetricsContext("s3")

metricsContext.initialize(new java.util.HashMap[String, String]())
metricsContext.counter(FileIOMetricsContext.WRITE_BYTES, classOf[java.lang.Long], MetricsContext.Unit.BYTES) // WORKS
val broadcastContext =  spark.sparkContext.broadcast(metricsContext)
val data = Seq(("Thing"))
val rdd = spark.sparkContext.parallelize(data)
rdd.map { row => 
    broadcastContext.value.counter(FileIOMetricsContext.WRITE_BYTES, classOf[java.lang.Long], MetricsContext.Unit.BYTES).increment() // NULLPOINTER
}.collect()

Without kyro, previous snippet works fine.

Read more comments on GitHub >

github_iconTop Results From Across the Web

NullPointerException in Spark RDD map when submitted as a ...
I think that you get a NullPointerException thrown by the worker when it tries to access a SparkContext object that's only present on...
Read more >
Spark Null Pointer Exception | Edureka Community
I used Spark 1.5.2 with Hadoop 2.6 and had similar problems. Solved by doing the following steps: Download winutils.exe from the repository to ......
Read more >
Intermittent NullPointerException when AQE is enabled
When adaptive query execution (AQE) is enabled, and cluster scales down and loses shuffle data, you can get a `NullPointerException` error.
Read more >
Resolving Spark 1.6.0 “java.lang.NullPointerException, not ...
This issue is often caused by a missing winutils.exe file that Spark needs in order to initialize the Hive context, which in turn...
Read more >
[#SPARK-38528] NullPointerException when selecting a ...
NullPointerException when selecting a generator in a Stream of aggregate expressions ... flatMap(List.scala:366) at org.apache.spark.sql.catalyst.analysis.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found