question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Create MutableMapping for automatic compression

See original GitHub issue

In some workloads with highly compressible data we would like to trade off some computation time for more in-memory storage automatically. Dask workers store data in a MutableMapping (the superclass of dict). So in principle all we would need to do is make a MutableMapping subclass that overrides the getitem and setitem methods to compress and decompress data on demand.

This would be an interesting task for someone who wants to help Dask, wants to learn some internals, but doesn’t know a lot just yet. I’m marking this as a good first issue. This is an interesting and useful task that doesn’t require deep incidental Dask knowledge.

Here is a conceptual prototype of such a MutableMapping. This is completely untested, but maybe gives a sense of how I think about this problem. It’s probably not ideal though so I would encourage others to come up with their own design.

import collections
from typing import Dict, Tuple, Callable

class TypeCompression(collections.MutableMapping):
    def __init__(
        self,
        types: Dict[type, Tuple[Callable, Callable]],
        storage=dict
    ):
        self.types = type
        self.storage = collections.defaultdict(storage)

    def __setitem__(self, key, value):
        typ = type(key)
        if typ in self.types:
            compress, decompress = self.types[typ]
            value = compress(value)
        self.storage[typ] = value

    def __getitem__(self, key):
        for typ, d in self.storage.items():
            if key in d:
                value = d[key]
                break
        else:
            raise KeyError(key)

        if typ in self.types:
            compress, decompress = self.types[typ]
            value = decompress(value)

        return value

This came up in https://github.com/dask/distributed/pull/3624 . cc @madsbk and @jakirkham from that PR. cc also @eric-czech who was maybe curious about automatic compression/decompression.

People looking at compression might want to look at and use Dask’s serializations and comrpession machinery in distributed.protocol (maybe start by looking at the dumps, serialize and maybe_compress functions).

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

3reactions
mrocklincommented, Apr 9, 2020

Does types in TypeCompress refer to int, double, etc. or to snappy, blosc, lz4 etc. ?

You don’t have to use the structure I started with. I encourage you to think about this on your own and how you would design it. If you blindly follow my design you probably won’t develop a high level understanding of the problem. What I put up there was just an idea, but not a fully formed one, whoever solves this task will need to think a lot more about the problem than what I did.

0reactions
jakirkhamcommented, Jun 24, 2020

PR ( https://github.com/dask/distributed/pull/3702 ) seems to be going in the right direction. Probably the best place to move this forward atm.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to register implementation of abc.MutableMapping as a ...
So, first things first, the "obvious way to do it", is to have a Json Encoder with a default method that would create...
Read more >
Storage (zarr.storage) — zarr 2.13.3 documentation
This module contains storage classes for use with Zarr arrays and groups. Note that any object implementing the MutableMapping interface from the collections ......
Read more >
Create an Annotated Data Matrix - anndata
Pairwise annotation of observations, a mutable mapping with array-like values. ... AnnDataR6$write_h5ad( filename, compression = NULL, compression_opts ...
Read more >
python3_lib collections.abc — Abstract Base Classes for Containers ...
MutableMapping. Mapping. __getitem__ , __setitem__ , __delitem__ , __iter__ , __len__. Inherited Mapping methods and pop , popitem , clear , update ,...
Read more >
Glossary — Python 3.11.1 documentation
An object created by a asynchronous generator function. ... these include compression, saving to a binary file, and sending over a socket.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found