Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

add option to do a spark-submit with a SparkListener to gather events from Spark

See original GitHub issue

I was at Emily Curtin’s Spark Summit Europe presentation today (which was very interesting). An attendee asked if Spark Bench gathered Spark executor metrics. A SparkListener can be used to get benchmark data about how long was spent running tasks and how much data was shuffled (basically any data that can be seen in the Spark UI could be picked up and summarised). https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/scheduler/SparkListener.html spark-submit --conf spark.extraListeners=com.mycompany.MetricsListener https://github.com/LucaCanali/sparkMeasure has a spark listener that gathers metrics. https://github.com/groupon/sparklint also has one.

One possible design would be to

run spark-submit with a SparkListener that outputs the event data (eg as CSV)
run another spark job to summarise the event data and include the summary metrics with the other benchmark data

Another approach would be to run spark with spark.eventLog.enabled=true (and spark.eventLog.dir set) and parsing the json-lines output. https://github.com/groupon/sparklint also has code to summarise event logs to create metrics.

Issue Analytics

State:
Created 6 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

pjfanningcommented, Nov 3, 2017

I have a very early prototype at https://github.com/pjfanning/spark-bench/pull/2/files

Running bin/spark-bench.sh examples/minimal-example.conf on a distro with my change outputs

+-------+-------------+-------------+------------------+-----+------+------+---+-----------------+-----------------+--------------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+------------------+-------------------------+-------------------+--------------------+
|   name|    timestamp|total_runtime|    pi_approximate|input|output|slices|run|spark.driver.host|spark.driver.port|spark.extraListeners|hive.metastore.warehouse.dir|          spark.jars|      spark.app.name|spark.executor.id|spark.submit.deployMode|spark.master|spark.authenticate|spark.authenticate.secret|       spark.app.id|         description|
+-------+-------------+-------------+------------------+-----+------+------+---+-----------------+-----------------+--------------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+------------------+-------------------------+-------------------+--------------------+
|sparkpi|1509741483169|   1468030834|3.1425311425311424|     |      |    10|  0|    192.168.1.100|            64309|com.ibm.sparktc.s...|        file:/Users/pj.fa...|file:/Users/pj.fa...|com.ibm.sparktc.s...|           driver|                 client|    local[*]|              true|            not.so.secret|local-1509741482934|One run of SparkP...|
+-------+-------------+-------------+------------------+-----+------+------+---+-----------------+-----------------+--------------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+------------------+-------------------------+-------------------+--------------------+

**** MetricsSparkListener ****
stageCount=2
taskCount=11
jobCount=2
executorAddCount=1
executorRemoveCount=0

The aim is to gather more metrics with the listener and to include them with the other benchmarks. This would involve writing the metric data to a file and having spark-bench read that data and extending the benchmark data with these additional metrics.

0reactions

xiandong79commented, Nov 28, 2017

a CSV file recording the task-durations of all tasks would be better.

Top Results From Across the Web

SparkListener — Intercepting Events from Spark Scheduler

A SparkListener intercepts events from the Spark scheduler that it emits over the course of execution of a Spark application. A Spark listener...

How to increase Spark listener bus event queue capacity

I am running a scheduler in Spark. I am facing a problem with the listener bus. The problem is that ... to hold...

Apache Spark Monitoring using Listener APIs and Data ...

See how to use Listener APIs and data quality libraries to get different levels of data observability for Apache Spark.

Configuration - Spark 3.3.1 Documentation

bin/spark-submit will also read configuration options from ... list of classes that implement SparkListener ; when initializing SparkContext, instances of ...

Get all read/write queries made in spark job using ...

After lot of trial and error, I finally found a way to do the above. In the listener that implements QueryExecutionListener, I added...