question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

add option to do a spark-submit with a SparkListener to gather events from Spark

See original GitHub issue

I was at Emily Curtin’s Spark Summit Europe presentation today (which was very interesting). An attendee asked if Spark Bench gathered Spark executor metrics. A SparkListener can be used to get benchmark data about how long was spent running tasks and how much data was shuffled (basically any data that can be seen in the Spark UI could be picked up and summarised). https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/scheduler/SparkListener.html spark-submit --conf spark.extraListeners=com.mycompany.MetricsListener https://github.com/LucaCanali/sparkMeasure has a spark listener that gathers metrics. https://github.com/groupon/sparklint also has one.

One possible design would be to

  • run spark-submit with a SparkListener that outputs the event data (eg as CSV)
  • run another spark job to summarise the event data and include the summary metrics with the other benchmark data

Another approach would be to run spark with spark.eventLog.enabled=true (and spark.eventLog.dir set) and parsing the json-lines output. https://github.com/groupon/sparklint also has code to summarise event logs to create metrics.

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
pjfanningcommented, Nov 3, 2017

I have a very early prototype at https://github.com/pjfanning/spark-bench/pull/2/files

Running bin/spark-bench.sh examples/minimal-example.conf on a distro with my change outputs

+-------+-------------+-------------+------------------+-----+------+------+---+-----------------+-----------------+--------------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+------------------+-------------------------+-------------------+--------------------+
|   name|    timestamp|total_runtime|    pi_approximate|input|output|slices|run|spark.driver.host|spark.driver.port|spark.extraListeners|hive.metastore.warehouse.dir|          spark.jars|      spark.app.name|spark.executor.id|spark.submit.deployMode|spark.master|spark.authenticate|spark.authenticate.secret|       spark.app.id|         description|
+-------+-------------+-------------+------------------+-----+------+------+---+-----------------+-----------------+--------------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+------------------+-------------------------+-------------------+--------------------+
|sparkpi|1509741483169|   1468030834|3.1425311425311424|     |      |    10|  0|    192.168.1.100|            64309|com.ibm.sparktc.s...|        file:/Users/pj.fa...|file:/Users/pj.fa...|com.ibm.sparktc.s...|           driver|                 client|    local[*]|              true|            not.so.secret|local-1509741482934|One run of SparkP...|
+-------+-------------+-------------+------------------+-----+------+------+---+-----------------+-----------------+--------------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+------------------+-------------------------+-------------------+--------------------+

**** MetricsSparkListener ****
stageCount=2
taskCount=11
jobCount=2
executorAddCount=1
executorRemoveCount=0

The aim is to gather more metrics with the listener and to include them with the other benchmarks. This would involve writing the metric data to a file and having spark-bench read that data and extending the benchmark data with these additional metrics.

0reactions
xiandong79commented, Nov 28, 2017

a CSV file recording the task-durations of all tasks would be better.

Read more comments on GitHub >

github_iconTop Results From Across the Web

SparkListener — Intercepting Events from Spark Scheduler
A SparkListener intercepts events from the Spark scheduler that it emits over the course of execution of a Spark application. A Spark listener...
Read more >
How to increase Spark listener bus event queue capacity
I am running a scheduler in Spark. I am facing a problem with the listener bus. The problem is that ... to hold...
Read more >
Apache Spark Monitoring using Listener APIs and Data ...
See how to use Listener APIs and data quality libraries to get different levels of data observability for Apache Spark.
Read more >
Configuration - Spark 3.3.1 Documentation
bin/spark-submit will also read configuration options from ... list of classes that implement SparkListener ; when initializing SparkContext, instances of ...
Read more >
Get all read/write queries made in spark job using ...
After lot of trial and error, I finally found a way to do the above. In the listener that implements QueryExecutionListener, I added...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found