question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Discussion] Proposed layer reorganization

See original GitHub issue

High level graph Layer implementations have a number of constraints on them which makes their design and location within the project somewhat subtle. For security reasons they cannot deserialize or otherwise run arbitrary python code, and in general they cannot depend on numpy, pandas, or other such complex dependencies. As a result of the latter requirement, many Layer implementations have been placed in dask.layers and, where they can be safely referenced without triggering unwanted imports (cf #7381).

During some discussions, @gjoseph92, @GenevieveBuckley and I have become concerned about the size and maintainability of layers.py. As we look towards implementing more layers, this file will get more unwieldy: each layer will typically have an implementation and possibly a set of utility functions, and the module will eventually become a significant fraction of dask.

Since the natural location for these layers (near userland code that instantiates them) isn’t workable due to the unwanted imports, we have sketched out a proposed structure for making this easier to organize and maintain going forward. The basic idea is to keep shadow submodules of dask.array and dask.dataframe (and possibly dask.bag) as siblings of those submodules, each containing layer implementations for their counterpart. So dask.array_layers would have Layer classes and all the supporting utilities for dask.array. This could look something like (conceptually, not literally):

β”œβ”€β”€ core.py
β”œβ”€β”€ base.py
β”œβ”€β”€ array
      β”œβ”€β”€ overlap.py (user code for overlap)
      └──  slicing.py (user code for slicing)
β”œβ”€β”€ dataframe
      β”œβ”€β”€ shuffle.py (user code for shuffles)
      └── io.py (user code for read_csv, read_parquet, etc)
β”œβ”€β”€ array_layers
      β”œβ”€β”€ overlap.py (layers for overlap)
      └──  slicing.py (layers for slicing)
└── dataframe_layers
      β”œβ”€β”€ shuffle.py (layers for shuffles)
      └── io.py (layers for read_csv, read_parquet, etc)

Code from dask.array could freely import code from dask.array_layers, but not the reverse. I don’t think the layers submodules would exactly mirror their counterparts, but a rough correspondence for collection-producing operations would make it much easier to keep things organized and avoid having layers.py be an enormous dumping ground.

Thoughts? especially cc @rjzamora and @jrbourbeau

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:2
  • Comments:19 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
gjoseph92commented, Jun 2, 2021

@jsignell the only issue with that, versus having array, dataframe, etc. subdirectories like:

.
└── layers
    β”œβ”€β”€ array
    β”‚Β Β  β”œβ”€β”€ core.py
    β”‚Β Β  β”œβ”€β”€ map_overlap.py
    β”‚Β Β  β”œβ”€β”€ optimization.py
    β”‚Β Β  β”œβ”€β”€ rechunk.py
    β”‚Β Β  └── slicing.py
    β”œβ”€β”€ dataframe
    β”‚Β Β  β”œβ”€β”€ core.py
    β”‚Β Β  β”œβ”€β”€ io.py
    β”‚Β Β  β”œβ”€β”€ optimization.py
    β”‚Β Β  └── shuffle.py
    β”œβ”€β”€ blockwise.py
    β”œβ”€β”€ core.py
    └── optimization.py

is that we’d have to collapse a lot of code into one file. For example, array/overlap.py and array/slicing.py are currently two files (1k and 2k lines respectively); bringing all of that (and more) into one array.py file would get really long.

1reaction
gjoseph92commented, May 28, 2021

a dedicated layers/ directory with array and dataframe submodules?

Based on my experience building sheds for bicycles, I’d also prefer this structure. I guess it feels a little more equivalent to the array/dataframe submodules for them to have exactly the same names. But mostly, I think it will be nice to have a dask/layers base module for exactly the sort of

significant layer logic that can/should be shared

that we’ll inevitably end up with. I’d probably vote for moving blockwise.py, and maybe even highlevel.py (though maybe there are issues with that one I haven’t thought of), into dask/layers as well.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Reorganization without tears - McKinsey
A corporate reorganization doesn't have to create chaos. But many do when there is no clear plan for communicating with employees and otherΒ ......
Read more >
Self-Reorganizing and Rejuvenating CNNs for Increasing ...
The proposed method utilizes the channel activations of a convolution layer in order to reorganize that layers parameters.
Read more >
Getting Reorgs Right - Harvard Business Review
Chances are you've experienced at least one company reorganization. Reorgs can be a great way to unlock value: Two-thirds of them deliver at...
Read more >
Study on optimizing perforation by layer reorganization test
Study on optimizing perforation by layer reorganization test ... of oil layers and improving the development effect of water flooding.
Read more >
Morphological reorganization and mechanical enhancement ...
Morphological reorganization and mechanical enhancement in multilayered polyethylene/polypropylene films by layer multiplication or mildΒ ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found