Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Discussion] Proposed layer reorganization

See original GitHub issue

High level graph Layer implementations have a number of constraints on them which makes their design and location within the project somewhat subtle. For security reasons they cannot deserialize or otherwise run arbitrary python code, and in general they cannot depend on numpy, pandas, or other such complex dependencies. As a result of the latter requirement, many Layer implementations have been placed in dask.layers and, where they can be safely referenced without triggering unwanted imports (cf #7381).

During some discussions, @gjoseph92, @GenevieveBuckley and I have become concerned about the size and maintainability of layers.py. As we look towards implementing more layers, this file will get more unwieldy: each layer will typically have an implementation and possibly a set of utility functions, and the module will eventually become a significant fraction of dask.

Since the natural location for these layers (near userland code that instantiates them) isn’t workable due to the unwanted imports, we have sketched out a proposed structure for making this easier to organize and maintain going forward. The basic idea is to keep shadow submodules of dask.array and dask.dataframe (and possibly dask.bag) as siblings of those submodules, each containing layer implementations for their counterpart. So dask.array_layers would have Layer classes and all the supporting utilities for dask.array. This could look something like (conceptually, not literally):

├── core.py
├── base.py
├── array
      ├── overlap.py (user code for overlap)
      └──  slicing.py (user code for slicing)
├── dataframe
      ├── shuffle.py (user code for shuffles)
      └── io.py (user code for read_csv, read_parquet, etc)
├── array_layers
      ├── overlap.py (layers for overlap)
      └──  slicing.py (layers for slicing)
└── dataframe_layers
      ├── shuffle.py (layers for shuffles)
      └── io.py (layers for read_csv, read_parquet, etc)

Code from dask.array could freely import code from dask.array_layers, but not the reverse. I don’t think the layers submodules would exactly mirror their counterparts, but a rough correspondence for collection-producing operations would make it much easier to keep things organized and avoid having layers.py be an enormous dumping ground.

Thoughts? especially cc @rjzamora and @jrbourbeau

Issue Analytics

State:
Created 2 years ago
Reactions:2
Comments:19 (12 by maintainers)

Top GitHub Comments

1reaction

gjoseph92commented, Jun 2, 2021

@jsignell the only issue with that, versus having array, dataframe, etc. subdirectories like:

.
└── layers
    ├── array
    │   ├── core.py
    │   ├── map_overlap.py
    │   ├── optimization.py
    │   ├── rechunk.py
    │   └── slicing.py
    ├── dataframe
    │   ├── core.py
    │   ├── io.py
    │   ├── optimization.py
    │   └── shuffle.py
    ├── blockwise.py
    ├── core.py
    └── optimization.py

is that we’d have to collapse a lot of code into one file. For example, array/overlap.py and array/slicing.py are currently two files (1k and 2k lines respectively); bringing all of that (and more) into one array.py file would get really long.

1reaction

gjoseph92commented, May 28, 2021

a dedicated layers/ directory with array and dataframe submodules?

Based on my experience building sheds for bicycles, I’d also prefer this structure. I guess it feels a little more equivalent to the array/dataframe submodules for them to have exactly the same names. But mostly, I think it will be nice to have a dask/layers base module for exactly the sort of

significant layer logic that can/should be shared

that we’ll inevitably end up with. I’d probably vote for moving blockwise.py, and maybe even highlevel.py (though maybe there are issues with that one I haven’t thought of), into dask/layers as well.