question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pre-binned distribution/histogram

See original GitHub issue

🚀 Feature

A user should be able to create an aim.Distribution using histogram data that the user computed. Perhaps by specifying some flag that disables the automatic internal numpy.histogram.

Motivation

Sometimes the exact histogram is known, rather than needing to a sample it from some random source.

Pitch

counts = [4, 2, 1]
bin_edges = [0, 9, 18, 99]  # or bin_midpoints?
aim.Distribution(counts=counts, bin_edges=bin_edges)

Alternatives

Plotting a typical plot.ly figure instead.

Additional context

N/A

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
YodaEmbeddingcommented, Sep 22, 2022

Attempt 2 (much cleaner):

class Distribution(CustomObject):
    """Distribution object used to store distribution objects in Aim repository."""

    def __init__(self, hist, bin_edges):
        super().__init__()
        hist = np.asanyarray(hist)
        bin_edges = np.asanyarray(bin_edges)
        self._from_np_histogram(hist, bin_edges)

    @classmethod
    def from_histogram(cls, hist, bin_edges):
        """Create Distribution object from histogram.

        Args:
            hist (:obj:): Array-like object representing bin frequency counts.
                Must be specified alongside `bin_edges`. `data` must not be specified.
            bin_edges (:obj:): Array-like object representing bin edges.
                Must be specified alongside `hist`. `data` must not be specified.
                Max 512 bins allowed.
        """
        return cls(hist, bin_edges)

    @classmethod
    def from_samples(cls, samples, bin_count=64):
        """Create Distribution object from data samples.

        Args:
            samples (:obj:): Array-like object of data sampled from a distribution.
            bin_count (:obj:`int`, optional): Optional distribution bin count for
                binning `samples`. 64 by default, max 512.
        """

        # These checks can perhaps be handled by np.histogram.
        # if not isinstance(bin_count, int):
        #     raise TypeError("`bin_count` must be an integer.")
        # try:
        #     hist, bin_edges = np.histogram(samples, bins=bin_count)
        # except TypeError:
        #     raise TypeError(f"Cannot create histogram from type {type(samples)}.")

        hist, bin_edges = np.histogram(samples, bins=bin_count)
        return cls(hist, bin_edges)

    def _from_np_histogram(self, hist, bin_edges):
        bin_count = len(bin_edges) - 1
        if 1 > bin_count > 512:
            raise ValueError("Supported range for `bin_count` is [1, 512].")

        # Checks unnecessary due to asanyarray.
        # assert isinstance(hist, np.ndarray)
        # assert isinstance(bin_edges, np.ndarray)

        self.storage["data"] = BLOB(data=hist.tobytes())
        self.storage["dtype"] = str(hist.dtype)
        self.storage["bin_count"] = bin_count
        self.storage["range"] = [bin_edges[0].item(), bin_edges[-1].item()]
1reaction
YodaEmbeddingcommented, Sep 20, 2022

In terms of the API, does this new __init__ interface seem reasonable?

class Distribution(CustomObject):
    """Distribution object used to store distribution objects in Aim repository.

    Args:
        data (:obj:): Optional array-like object of data sampled from a distribution.
        hist (:obj:): Optional array-like object representing bin frequency counts.
            Must be specified alongside `bin_edges`. `data` must not be specified.
        bin_edges (:obj:): Optional array-like object representing bin edges.
            Must be specified alongside `hist`. `data` must not be specified.
        bin_count (:obj:`int`, optional): Optional distribution bin count for
            binning `data`. 64 by default, max 512.
    """

    def __init__(self, data=None, *, hist=None, bin_edges=None, bin_count=64):
        super().__init__()

        if not isinstance(bin_count, int):
            raise TypeError('`bin_count` must be an integer.')
        if 1 > bin_count > 512:
            raise ValueError('Supported range for `bin_count` is [1, 512].')
        self.storage['bin_count'] = bin_count

        np_histogram = self._to_np_histogram(data, hist, bin_edges, bin_count)
        self._from_np_histogram(np_histogram)

    def _to_np_histogram(self, data, hist, bin_edges, bin_count):
        if data is None:
            if hist is None or bin_edges is None:
                raise ValueError('Both `hist` and `bin_edges` must be specified.')
            return np.asanyarray(hist), np.asanyarray(bin_edges)
        if hist is not None or bin_edges is not None:
            raise ValueError(
                '`hist` and `bin_edges` may not be specified if `data` is.'
            )
        # convert to np.histogram
        try:
            return np.histogram(data, bins=bin_count)
        except TypeError:
            raise TypeError(
                f'Cannot convert to aim.Distribution. Unsupported type {type(data)}.'
            )

Usage:

# Compatible with old interface:
aim.Distribution(sampled_data)

# Supports new usage:
hist, bin_edges = np.histogram(sampled_data)
aim.Distribution(hist=hist, bin_edges=bin_edges)

Supporting both data and (hist, bin_edges) makes it look a bit more complicated than, e.g. deprecating data and forcing the user to do np.histogram themselves whenever they need it.

Also, is there a reason behind setting bin_count=512 as the max?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Histogram based on pre-binned data (center of a bin given)
A possible workaround is to recalculate the beginnings of the bins separately and then feed them into the ListPlot, but I am wondering...
Read more >
How to Plot a Pre-Binned Histogram In R - Stack Overflow
I'd like R to plot a histogram of this data by doing further binning and summing the existing counts. For example, if in...
Read more >
Histograms review (article) | Khan Academy
A histogram displays numerical data by grouping data into "bins" of equal width. Each bin is plotted as a bar whose height corresponds...
Read more >
Histogram Bin Size
Determining how many histogram bins should be used for estimating distributions is a problem in non-parametric statistics, although histogram-based methods ...
Read more >
How to choose the bins of a histogram? - Your Data Teacher
A histogram is a representation of the probability distribution of a ... to perform pre-processing tasks in machine learning projects.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found