Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improving Jupyter for Spark

See original GitHub issue

Exploring where Jupyter protocols, formats, and UI need to adapt to make it so that “integration with Spark” is seamless. The word “Spark” could likely be replaced with “TensorFlow” or “Dask” throughout all of this issue.

This is a culmination of feedback, in progress initiatives, and further R&D that needs to be done to improve how Jupyter provides a wonderful ecosystem for Spark, Hadoop, and other cluster tooling.

Asynchronous/Background Outputs

For environments like Spark and Dask, when a job or multiple jobs are submitted, they provide a progress bar for the jobs with a way to cancel the overall job. They also provide links to external UI (Spark UI or Dask UI). When finished, you’re able to see the resulting table view.

There’s a similar use case for Hive Jobs, Presto Jobs, etc.

As far as Jupyter is concerned, there are things we can do to make updating a display like this easier without having to use widgets:

Routing background outputs: https://github.com/ipython/ipython/issues/9969
Ability to update prior displays: https://github.com/jupyter/jupyter_client/issues/209

Progress Bar Primitives

As discussed above in the Async Outputs section, we should seek to provide simple clean-cut library tooling for creating progress bars in the frontend. This is not to seek a replacement of e.g. tqdm, but to provide the necessary additions to the messaging spec (whether an actual message or a specific versioned mimetype).

The nested structure for Spark Jobs is something we can likely represent with a clean format for nested content with progress bars and external links.

Approaches:

https://github.com/mozilla/jupyter-spark

Cluster/computational Context

Users need a way to get context about their running spark setup, because you’re dealing with an extra remote resource beyond the kernels. They tend to want to know:

What cluster am I attached to?
What resources do I have (memory, CPU, Spark version, Hadoop version)?
How do I access the UI/logs for that cluster?

Many users want to think in terms of “clusters” and a notebook that is attached to that cluster. The real distinction is

notebook ← → kernel ← → cluster (or other background resources)

This can be a cognitive burden for the user. One way of solving this is to have “kernels” that are assumed to be attached to clusters.

Kernel Startup and Background Resources

Stages a spark user cares about when opening a notebook and running first cell:

Runtime started (kernel)
Preparing driver/Attaching to cluster $NAME/preparing Spark Context
Libraries uploading
Attached to cluster $NAME/Spark ready

This seems to me like additional startup messages from the kernel (yes, new message spec) about driver context.

We do have a banner for kernel_info, perhaps we can do banner updates with links to:

Cluster/driver logs
Spark UI/Dask UI

Uploading libraries

For a user of Spark, they need a way to upload JARs to the current spark cluster (and, many times, they need to restart the spark cluster).

This is out of scope for the Jupyter protocols, would need to be implemented on the outside of the notebook. We need to be able to provide more context, likely through the kernel, about the operating environment and how the cluster is initialized for each kernel.

Unsure of what/how we can handle this, needs exploration.

Timing Data

The transient information for how long a command took (as well as the user) are available in the message spec, but not persisted to the notebook document. We can certainly show this in the UI, but there is another reason for keeping this output consistent in serialized state:

Lifecycle management of notebooks – need to be able to know if execution results are stale relative to data, APIs, etc.
In multi-user collaboration, this must be shared amongst all clients

This is super useful, and part of why it’s in the message spec. An earlier implementation of IPython notebooks (2006) included and displayed all of this information. We found the information useful enough to keep it in the current protocol implementation, but not enough to put it in the UI or persist it to disk.

Downsides of storing it in the document:

100% chance of every re-run causing git changes, merge conflicts

Options: opt-in ‘provenance mode’, where we persist a whole bunch of extra metadata. Super unpleasant for version control, but useful in notebooks that run as jobs, especially for analysts.

Tables and Simple Charts

For many people, being able to plot simple results before getting to deeper visualization frameworks is pretty important: e.g. in pandas we can use the df.plot(), .hist(), .bar(), etc. methods for quick and easy vis.

There are a couple approaches to this right now, including mime types for plotly and vega being available and easy to use in both nteract and jupyterlab. For tables, there is an open discussion on Pandas: https://github.com/pandas-dev/pandas/issues/14386

There’s also the spark magics incubation project.

Scala Kernel

Massive effort needs to be put into making the Scala kernel(s) for Jupyter first class citizens. Competitive angle: It should be so good that Zeppelin or Beaker would adopt it (or create it) and contribute to it. Current contenders:

Style of error reporting

If you’ve ever worked with PySpark, you know the joys of having python tracebacks embedded in JVM tracebacks. In order to make development easier, we need to bring attention to the right lines to go straight to in a traceback. Spark centric notebooks do this, showing exactly the error text the user needs and a collapsible area for tracebacks. Jupyter has some open issues and discussion about improving error semantics - it’s really important for kernels and the libraries building on them - to be able to expose more useful context to users.

At a minimum, we should expose links and/or popups to the spark driver’s stdout/stderr, Spark UI, and YARN application master where relevant. Without these essentials, a notebook user can’t debug issues that aren’t returned in the notebook output cell.

POC ➡️ Production

This part is more JVM centric and not language agnostic - people need a way to turn code segments into a Scala class, compile it, and generate a JAR. My knowledge of how this can be done is fairly old/outdated.

Issue Analytics

State:
Created 7 years ago
Reactions:26
Comments:33 (22 by maintainers)

Top GitHub Comments

6reactions

napjoncommented, Oct 25, 2016

Just want to give personal experience on this,

I tried all of the Spark kernels for Jupyter out there, only sparkmagics works out of the box. The only downside is missing tab completion.

I spend days just to make other project works, but it still nowhere enough.

4reactions

parentecommented, Oct 24, 2016

Do we have data and / or studies about notebook UX experiences? If we had some, would the community have any interest in using it to drive decisions?

Here’s the raw data and analysis notebooks from the late-2015 UX survey about Jupyter Notebook: https://github.com/jupyter/design/tree/master/surveys/2015-notebook-ux

A live, dashboard version is available here: http://jupyter.cloudet.xyz/files/2015-notebook-ux-survey/analysis/deployed_dashboard/index.html

Top Results From Across the Web

How To Use Jupyter Notebooks with Apache Spark

In this post, we will see how to incorporate Jupyter Notebooks with an Apache Spark installation to carry out data analytics through your ......

Taking Jupyter Notebooks and Apache Spark to the Next ...

Learn how PixieDust is an open source library that helps data scientists and developers working in Jupyter Notebooks and Apache Spark be more...

Configuring Spark Settings for Jupyter Notebooks

You can configure Spark settings only for Jupyter notebooks with Spark kernels. You should specify the required configuration at the beginning of the...

Setting up a scalable data exploration environment with ...

This note walks thru the configuration steps required to create a data exploration environment with open source versions of HDFS, Spark and Jupyter...

Modify the Spark configuration in an Amazon EMR notebook

1. In a Jupyter notebook cell, run the %%configure command to modify the job configuration. In the following example, the command changes the ......