Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

High Level Expressions

See original GitHub issue

Summary

We should make space for high level query optimization. There are a couple of ways to do this. This issue includes motivation, a description of two approaches, and some thoughts on trade-offs.

Motivation

There are a variety of situations where we might want to rewrite a user’s code.

Dataframes

Column projection, as in dd.read_parquet(...)["x"] -> dd.read_parquet(..., columns=["x"])
Predicate pushdown (same as above)
High level expression fusion (what we do today with blockwise)
Pushing length calls down through elementwise calls
Pushing filters earlier in a computation
…

Arrays

Automatic rechunking at the beginning of a computation based on the end of the computation
Slicing

History

Today there is no real place where we capture a user’s intent or their lines of code. We immediately create a task graph for the requested operation, dump it into a mapping, and create a new dask.dataframe.DataFrame or dask.array.Array instance. That instance has no knowledge of what created it, or what created the other input dataframes on which it depends.

This was a good choice early on. It made it easy for us to quickly implement lots of complex operations without thinking about a class hierarchy for them. This choice followed on from the choices of Blaze, where we started with high level expressions, but got a bit stuck because they constrained our thinking (and no one really cares about high level query optimization for a system that they don’t use.

However today we have maybe reached a point where our “keep everything low-level and simple” strategy has hit a limit, and now we’re curious about how best to bolt on a high level expression system. Doing this smoothly given the current system is hard. I see two ways out.

High Level Graph layers

We do have a record of what operations came before us in the high level graph layers. Currently the API of layers is very generic. They must be a mapping that adheres to the Dask graph spec, and they must be serializable in a certain way. There are some consistent subclasses, like blockwise, which enable high level optimizations which have proven useful.

There isn’t really much structure here though, and as a result it’s hard to do interesting optimizations. For example it would be nice if we could change a layer at the very bottom of the graph, and then replay all of the operations on that input over again to see how they would change. High level layers today don’t have enough shared structure that we know how to do this.

I like High Level Graph Layers because it gives us a space to hijack and add in all sorts of complex machinery, but without affecting the user-facing Dataframe class. We would have to add a lot more structure here though, and we’ll always be working around the collection class, which is a drawback.

Collection subclasses

I’m going to focus on an alternative that is a bit more radical. We could also have every user call generate a DataFrame subclass. There would still be a DataFrame instance that took in a generic graph/divisions/meta, but that would be mostly for backwards compatibility. Instead most dataframe operations would produce subclasses that had a well-defined common structure, as well as more custom attributes for their specific operation. Let’s look at a couple of examples.

# API calls just create instances.  All logic happens there.
def read_parquet(file, columns, filters):
    return ReadParquet(file, columns, filters)

class ReadParquet(DataFrame):
    args = ["file", "column", "filters"]  # List of arguments to use when reconstructing
    inputs = []  # List of arguments that are DataFrame objects

    def __init__(self, file, columns, filters):
        self.file  = file
        self.columns = columns
        self.filters = filters

        self.divisions, self.meta = # do a bit of work on metadata
        
    def _generate_dask_layer(self) -> dict:
        ...

class ColumnProjection(DataFrame):
    args = ["dataframe", "columns"]
    inputs = ["dataframe"]

    def __init__(self, dataframe, columns):
        self.dataframe = dataframe
        self.columns = columns
        self._meta = self.dataframe._meta[columns]

    def _generate_dask_layer(self) -> dict:
        ...

class Add(DataFrame):
    args = ["left", "right"]

    def __init__(self, left, right):
        self.left = left
        self.right = right
        self.inputs = []
        if is_dask_collection(left):
            self.inputs.append("left")
        if is_dask_collection(right):
            self.inputs.append("right")

        self._meta = ...
        self._divisions = ...

    def _generate_dask_layer(self) -> dict:
        ...

As folks familiar with SymPy will recall, having attributes like args/inputs around makes it possible to re-generate a DataFrame automatically. So if we do something like the following:

df = dd.read_parquet(...)
z = df.x + df.y

Then this turns into an expression tree like the following:

Add(
    ColumnProjection(
        ReadParquet(..., columns=None),
        "x",
    ),
    ColumnProjection(
        ReadParquet(..., columns=None),
        "y",
    ),
)

We can then traverse this tree (which is easy because we have a list of all attributes that are dask collections in the inputs attribute) and apply optimizaitons (which is easy because we can easily reconstruct layers because we have the args attribute).

For example the ColumnProjection class may have an optimization method like the following:

class ColumnProjection(DataFrame):
    ... # continued from above

    def _optimize(self) -> DataFrame:
        if isinstance(self.dataframe, ReadParquet):
            args = {arg: getattr(self.dataframe, arg) for arg in self.dataframe.args}
            args["columns"] = self.columns
            return ReadParquet(**args)._optimize()

        # here is another optimization, just to show variety
        if isinstance(self.datafarme, ColumnProjection):  # like df[["x", "y"]["x"]
            return ColumnProjection(self.dataframe.dataframe, self.columns)._optimize()

        # no known optimizations, optimize all inputs and then reconstruct (this would live in a superclass)
        args = []
        for arg in self.args:
            if arg in self.inputs:
                arg = getattr(self, arg)._optimize()
            else:
                arg = getattr(self, arg)
            args.append(arg)
        
        return type(self)(*args)

This is just one way of doing a traversal, using a method on a class. We can do fancier things. Mostly what I wanted to show here was that because we have args/inputs and class types it’s fairly easy to encode optimizations and rewrite things.

What’s notable here is that we aren’t generating the graph, or even any semblance of the graph ahead of time. At any point where we run code that requires something like _meta or _divisions from an input we stop all opportunities to change the graph under that stage. This is OK for our internal code. We can defer graph generation I think.

I think that the main advantage to this approach is that we can easily reconstruct expressions given newly modified inputs. I think that this is fundamentally what is lacking with our current HLG layer approach.

However, this would also be a significant deviation from how Dask works today. It’s likely that this would affect all downstream projects that subclass Dask arrays/dataframes today (RAPIDS, Pint, yt, …). I think that that’s probably ok. Those groups will, I think, understand.

Personal Thoughts

I’ve historically been against doing collection subclasses (the second approach), but after thinking about this for a while I think I’m now more in favor. It seems like maybe we’ve arrived at the time when the benefits to doing this outweigh the complexity costs. I think that this is motivated also by the amount of complexity that we’re encountering walking down the HLG path.

However, I do think that this will be hard for us to implement in an incremental way. If we want to go down this path we probably need to think about an approach that lets the current DataFrame API persist for backwards compatibility with downstream projects (they wouldn’t get any benefits of this system though) and a way where we can move over collection subclasses incrementally.

cc @rjzamora @jcrist

Issue Analytics

State:
Created 2 years ago
Reactions:9
Comments:65 (45 by maintainers)

Top GitHub Comments

8reactions

wesmcommented, Aug 20, 2021

Hopefully this isn’t too OT, but I wanted to make Dask folks aware that we have an adjacent effort/discussion under way in Apache Arrow to develop a language-independent serialized expression format connecting user SDKs (e.g. things like Ibis or even dask.dataframe potentially) with Arrow-based/Arrow-compatible computing engines:

https://lists.apache.org/thread.html/r90c1e6ba1c7b27a960df3f27dc2c1aeb542f07bb7af72c17138667f0%40<dev.arrow.apache.org>

The idea is to provide a common ground consisting a standard lower-level operators rather than each engine having their own slightly idiosyncratic front end (e.g. using a certain SQL dialect, or something else) — there would be the presumption of using Arrow’s schemas / type metadata to describe input/output types of operations.

In the same way that Dask uses pandas as a dataframe computing engine, it would be interesting to see if this High Level Expression effort could lead to a world of pluggable single-node engines for Dask to orchestrate (e.g. using one of the various Arrow-based engines — Rust, C++ — in development or even an small-footprint embeddable analytic SQL engine like DuckDB — which can now input/output Arrow easily). One might imagine that dask.dataframe in the future could not use pandas at all and permit use of alternative (or even multiple) backends to execute various tasks (and if useful, Arrow could be used as a serialization middleware). EDIT: I would also add that, given the existence of polars exists now, what would it take for Dask to be able to target it with dask.dataframe instead of only pandas?

4reactions

eriknwcommented, Mar 21, 2022

I’ve been looking at HLEs again recently, so let me share at a very high level how I’d like to tackle it in a way that helps us experiment and learn. I’m going to be somewhat vague and hand-wavy, because there is so much good information and details above that don’t need repeating, and because I don’t have a detailed, low level plan that I want to advocate for as the way forward (I actually have several in mind, including adapting prior efforts).

In principle, I believe some form (or perhaps many forms) of HLEs is achievable. But, I expect any approach to be experimental for a while, and we’re likely to want to iterate. As others have pointed out, some of the difficulties in implementing HLEs in Dask are that the problem, design, solution spaces are high-dimensional, and that Dask has a large, complicated API surface area to cover, which may require some invasive changes. I looked into HLEs briefly a few years ago, and I think it is daunting to try to experiment with it directly in the Dask codebase (but this may just be me).

My first goal is to create a sort of playground that makes it easier to experiment. I envision this to be new Dask collections that mimic Array, DataFrame, and Bag collections (Delayed and user-defined collections can follow, but I’m not thinking about them yet). By adding a shim layer, we can easily capture the function arguments, and call the methods on the original Dask collections when necessary. Hopefully this will give us much of what we need, but I’m certain there will be shortcomings. So, I’m also making it easy to modify the code of the original function. There may be a little magic *cough cough*, but the goal is to be convenient, flexible, and maintainable-ish (i.e., it should complain loudly when code it changes is changed). I plan to package this effort into a, uh, Python package. Maybe multiple packages for different experiments. This should make it easier for other people to try it out, and to experiment themselves. If invasive changes become too extensive or if maintenance becomes too much a burden, I expect it would be straightforward to convert such a package to a Dask PR. Adding a layer around Dask isn’t a new idea–mrocklin above said “The efforts I’ve played with / proposed in the past have been for layers surrounding Dask. That adds more complexity though”–but I think the complexity is worth it short term, and such a layer may be worth it long term. It may also make it easier to adapt to use for other packages, not just Dask.

Using MatchPy seems like a great idea. I think we should. There are many things we could match against and behaviors we could expose to the user.

My near term goal is to get a playground package that performs HLE pattern matching and rewriting for some DaskSQL patterns, including more robust predicate pushdown. This should be like bread and butter for MatchPy. From there, I think there will be many directions we could go or new experiments to try.

As a former boss of mine used to say, “for developers, doing is learning.” There is a lot for me to learn (which is to say I have whole bunch of questions and concerns), and I’ll get there by doing (hopefully–or I may crash and burn!). I’ll provide an update once I have a working playground. I would also like to share other experiments I’d like to try. Again, my main goal here is to learn, and to make it easier for other people to experiment and learn as well. There’s a lot of interest–and potential–in HLEs, so I say let’s push forward however we can.

If anybody has a violent reaction to the way I’m thinking here, please reply (non-violently 😉 ).