[Discussion] Proposed layer reorganization
See original GitHub issueHigh level graph Layer implementations have a number of constraints on them which makes their design and location within the project somewhat subtle. For security reasons they cannot deserialize or otherwise run arbitrary python code, and in general they cannot depend on numpy, pandas, or other such complex dependencies. As a result of the latter requirement, many Layer implementations have been placed in dask.layers and, where they can be safely referenced without triggering unwanted imports (cf #7381).
During some discussions, @gjoseph92, @GenevieveBuckley and I have become concerned about the size and maintainability of layers.py. As we look towards implementing more layers, this file will get more unwieldy: each layer will typically have an implementation and possibly a set of utility functions, and the module will eventually become a significant fraction of dask.
Since the natural location for these layers (near userland code that instantiates them) isnβt workable due to the unwanted imports, we have sketched out a proposed structure for making this easier to organize and maintain going forward. The basic idea is to keep shadow submodules of dask.array and dask.dataframe (and possibly dask.bag) as siblings of those submodules, each containing layer implementations for their counterpart. So dask.array_layers would have Layer classes and all the supporting utilities for dask.array. This could look something like (conceptually, not literally):
βββ core.py
βββ base.py
βββ array
βββ overlap.py (user code for overlap)
βββ slicing.py (user code for slicing)
βββ dataframe
βββ shuffle.py (user code for shuffles)
βββ io.py (user code for read_csv, read_parquet, etc)
βββ array_layers
βββ overlap.py (layers for overlap)
βββ slicing.py (layers for slicing)
βββ dataframe_layers
βββ shuffle.py (layers for shuffles)
βββ io.py (layers for read_csv, read_parquet, etc)
Code from dask.array could freely import code from dask.array_layers, but not the reverse. I donβt think the layers submodules would exactly mirror their counterparts, but a rough correspondence for collection-producing operations would make it much easier to keep things organized and avoid having layers.py be an enormous dumping ground.
Thoughts? especially cc @rjzamora and @jrbourbeau
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:19 (12 by maintainers)

Top Related StackOverflow Question
@jsignell the only issue with that, versus having
array,dataframe, etc. subdirectories like:is that weβd have to collapse a lot of code into one file. For example,
array/overlap.pyandarray/slicing.pyare currently two files (1k and 2k lines respectively); bringing all of that (and more) into onearray.pyfile would get really long.Based on my experience building sheds for bicycles, Iβd also prefer this structure. I guess it feels a little more equivalent to the
array/dataframesubmodules for them to have exactly the same names. But mostly, I think it will be nice to have adask/layersbase module for exactly the sort ofthat weβll inevitably end up with. Iβd probably vote for moving
blockwise.py, and maybe evenhighlevel.py(though maybe there are issues with that one I havenβt thought of), intodask/layersas well.