Spark Job marked as success when data is still being written to GCS
See original GitHub issueWhen using Spark on Kubernetes and the latest jar
https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
(dont’ know what version this corresponds to)
I have a spark job that writes about 10GB of data to GCS using DataFrame write
df .write.json(path_to_gcs_bucket)
This job and stage completes
but I can still see part files being written in the background:
gs://mybucket/output/ZGM0YTg3Nzk2NDEwY2ViY2FhNTYwZTZi/part-00124-e86f3a48-72f7-4bf7-bdc4-328e97cdc7b1-c000.json
The job is marked as success but there are still gcs writes going on in the background. This should update/report to the the job stage correctly and not be marked as success
.
Once the writes have completed the spark context stop()
is encountered and the job terminated.
using spark kubernetes 2.4.0
on gke 1.11.5-gke.5
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (3 by maintainers)
Interesting, may you share the job itself or simplified version of it that reproduces the issue? I will try to reproduce and debug what’s going on.
The final write job is shown below:
the job was submitted
11:23:08
and finished 3.6 mins later at11:26:44
- shown under the completed section of the UI.The bucket shows files being generated much later:
up to
11:35:38