Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Adding new characteristics to the HLG visualizations

See original GitHub issue

Task 1 - #7869 Status: Merged
Task 2A - #7886 Status: Review

~~- [ ] Task 2B - (SVG implementation in Graphviz) #XXXX~~ Status: Graphviz acting weird and not allowing us to downscale the images

(I will keep updating this comment)

Hi 👋

I am posting this Feature Request issue here to start a discussion regarding enhancing the Graphviz output of the dask.visualize() method.

While I was adding color to show whether a HLG layeri s materialized or not using a light gray fill color, @mrocklin pointed out that it would be better to make this discussion public to gather opinions from the Dask community as a whole.

In the future I think that users would be very interested in attributes like layer size and type, as well as collection attributes like chunking structure in the case of dask arrays. Personally I would encourage you to focus your efforts there. You might also want to raise issues with proposed changes in the future. That will help you to get feedback on idea from a broad set of people in the community, rather than just the one or two that are engaging in the gsoc slack channel.

Source: PR #7843

@GenevieveBuckley worked on #7309 and added a new dictionary collection_annotations which has crucial information about the High Level Graphs which I believe can be shown in some way on the Graphviz output.

@martindurant mentioned this over at https://github.com/dask/dask/issues/7301#issuecomment-860686336

… based on the task naming conventions and e.g., for high-level graphs the number of sub-tasks and for arrays, the size of the operands. Such information might be added into the nodes as text, colour or edge-style (all probably optional).

If anyone else has any ideas, please leave them in the comments section. How should I proceed?

Let’s make the output of dask.visualize() more interesting and appealing to the eye! 🙌

Issue Analytics

State:
Created 2 years ago
Comments:24 (24 by maintainers)

Top GitHub Comments

2reactions

tomwhitecommented, Aug 10, 2021

Now, you will be able to visualize the intermediate state of the Dask arrays within the HTML Representation and it looks amazing 😄

That’s great - thanks for working on this @freyam!

2reactions

GenevieveBuckleycommented, Jul 6, 2021

I have a question: What is the general range of an HLG layer’s task count?

This could vary a lot. The dataframe shuffle example is probably the smallest size reasonable example we have (instead of the tiny toy examples we’ve also looked at). But depending on what users are doing, it could be very, very large indeed. I don’t think we can choose a fixed upper value for this.

I would like to know the following:

a reasonable minimum value for n_tasks

I’d say 1 task is the minimum value possible.

a reasonable maximum value for n_tasks

I don’t think there is a single maximum value we can pick. It will vary wildly depending on what kind of computation the user is doing. (Your suggestion to normalize to the biggest layer in each HLG structure might be a good way to handle this)

a reasonable math function to account for all kinds of values for n_tasks (I am inclining towards numpy.log and numpy.clip)

A log scale is a good option, yes.

Alternate Idea: I can calculate the minimum and the maximum n_tasks for each HLG and use the local minima and local maxima instead. Pro: This would normalize the overall graph structure. The graphs wouldn’t look very disproportionate. Con: We can no longer compare two different task graphs on the basis of the size as the minimums and the maximums can differ. (CounterPoint to the Con: We are actually also mentioning the actual number of n_tasks on every node itself. So, maybe this wouldn’t cause much trouble)

I think this is a good idea.

@GenevieveBuckley This seems like a great idea. It also seems fine to implement. All that’s left is how do we want to represent it on the screen? I believe the implementation of the “adding to the graphviz” is quite trivial. The really tricky part is how we want to show it. Which traits of the node/edge would you wanna tweak to show the new information. I do have some amazing sample diagrams ready which I can draft up by tomorrow. But, if anyone here has any plans or suggestions, I would love to hear that.

I had assumed Martin’s suggestion was to show a different color for each of these categories. Scrolling back up, it looks like he didn’t actually say that & I just imagined it. Nevertheless, perhaps color is good place to start. (Also, as I said early on in this project, you will probably have ideas about the best way to represent certain characteristics visually. Definitely add your own suggestions or ideas for discussion here too, if you have them)

I also agree with Martin’s comment “I think that getting the attributes into the plotting code is the primary thing, and deciding the how to represent them secondary”