question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Spark Job marked as success when data is still being written to GCS

See original GitHub issue

When using Spark on Kubernetes and the latest jar

https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar (dont’ know what version this corresponds to)

I have a spark job that writes about 10GB of data to GCS using DataFrame write

 df .write.json(path_to_gcs_bucket)

This job and stage completes

image

but I can still see part files being written in the background:

gs://mybucket/output/ZGM0YTg3Nzk2NDEwY2ViY2FhNTYwZTZi/part-00124-e86f3a48-72f7-4bf7-bdc4-328e97cdc7b1-c000.json

The job is marked as success but there are still gcs writes going on in the background. This should update/report to the the job stage correctly and not be marked as success.

Once the writes have completed the spark context stop() is encountered and the job terminated.

using spark kubernetes 2.4.0 on gke 1.11.5-gke.5

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
medbcommented, Jan 24, 2019

Interesting, may you share the job itself or simplified version of it that reproduces the issue? I will try to reproduce and debug what’s going on.

0reactions
gridcellcodercommented, Jan 24, 2019

The final write job is shown below:

image

the job was submitted 11:23:08 and finished 3.6 mins later at 11:26:44 - shown under the completed section of the UI.

The bucket shows files being generated much later: image

up to 11:35:38

Read more comments on GitHub >

github_iconTop Results From Across the Web

Spark Job marked as success when data is still being written ...
The job is marked as success but there are still gcs writes going on in the background. This should update/report to the the...
Read more >
Spark: long delay between jobs - scala - Stack Overflow
Everything is running fine but I'm getting random expansive delays between resource intensive job finish and next job start.
Read more >
Use the Cloud Storage connector with Apache Spark
Write a simple wordcount Spark job in Java, Scala, or Python, ... If the build is successful, a target/spark-with-gcs-1.0-SNAPSHOT.jar is created.
Read more >
Tuning Java Garbage Collection for Apache Spark Applications
When GC is observed as too frequent or long lasting, it may indicate that memory space is not used efficiently by Spark process...
Read more >
Long-Running Spark Jobs on GCP using Dataproc with ...
If it is running and one of the preemptible instance worker nodes is reclaimed by Google during the final line of the example...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found