Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

by() reduction doesn't combine with mean(), std() etc.

See original GitHub issue

ALL software version info

DS master in a fresh virtualenv.

Description of expected behavior and the observed behavior

by() doesn’t seem to be able to combine itself with the more complex reductions like mean(). I think it ignores _build_bases when constructing its append method, but I’m not adept enough with the code yet to fix it myself.

Complete, minimal, self-contained example code that reproduces the issue

import numpy as np
import dask.dataframe
import datashader
import pandas as pd

if __name__ == '__main__':
    pf = pd.DataFrame(dict(a=np.arange(10), b=np.arange(10), c=np.arange(-5,5), cat=[0,0,0,1,1,1,2,2,2,3]))
    ddf = dask.dataframe.from_pandas(pf, npartitions=1)
    ddf = ddf.categorize('cat')
    print(ddf)

    canvas = datashader.Canvas(10, 10)

    raster = canvas.points(ddf, 'a', 'b', datashader.count_cat('cat'))
    print("count_cat ok")

    raster = canvas.points(ddf, 'a', 'b', datashader.mean('c'))
    print("mean ok")

    raster = canvas.points(ddf, 'a', 'b', datashader.by('cat', datashader.mean('c')))
    print("by(cat, mean(c)) ok")

Stack traceback and/or browser JavaScript console output

(sms) oms@tshikovski:~/projects/shadeMS$ python ./test-ds.py 
Dask DataFrame Structure:
                   a      b      c              cat
npartitions=1                                      
0              int64  int64  int64  category[known]
9                ...    ...    ...              ...
Dask Name: categorize_block, 2 tasks
count_cat ok
mean ok
Traceback (most recent call last):
  File "/home/oms/.venv/sms/lib/python3.6/site-packages/toolz/functoolz.py", line 456, in memof
    return cache[k]
KeyError: ((<datashader.reductions.by object at 0x7fcc800a9978>, dshape("""{
  a: int64,
  b: int64,
  c: int64,
  cat: categorical[[0, 1, 2, 3], type=int64, ordered=False]
  }"""), <datashader.glyphs.points.Point object at 0x7fcc800a9dd8>), frozenset({('cuda', False)}))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./test-ds.py", line 27, in <module>
    raster = canvas.points(ddf, 'a', 'b', datashader.by('cat', datashader.mean('c')))
  File "/scratch/oms/projects/datashader/datashader/core.py", line 224, in points
    return bypixel(source, self, glyph, agg)
  File "/scratch/oms/projects/datashader/datashader/core.py", line 1192, in bypixel
    return bypixel.pipeline(source, schema, canvas, glyph, agg)
  File "/scratch/oms/projects/datashader/datashader/utils.py", line 94, in __call__
    return lk[typ](head, *rest, **kwargs)
  File "/scratch/oms/projects/datashader/datashader/data_libraries/dask.py", line 19, in dask_pipeline
    dsk, name = glyph_dispatch(glyph, df, schema, canvas, summary, cuda=cuda)
  File "/scratch/oms/projects/datashader/datashader/utils.py", line 97, in __call__
    return lk[cls](head, *rest, **kwargs)
  File "/scratch/oms/projects/datashader/datashader/data_libraries/dask.py", line 68, in default
    compile_components(summary, schema, glyph, cuda=cuda)
  File "/home/oms/.venv/sms/lib/python3.6/site-packages/toolz/functoolz.py", line 460, in memof
    cache[k] = result = func(*args, **kwargs)
  File "/scratch/oms/projects/datashader/datashader/compiler.py", line 57, in compile_components
    calls = [_get_call_tuples(b, d, schema, cuda) for (b, d) in zip(bases, dshapes)]
  File "/scratch/oms/projects/datashader/datashader/compiler.py", line 57, in <listcomp>
    calls = [_get_call_tuples(b, d, schema, cuda) for (b, d) in zip(bases, dshapes)]
  File "/scratch/oms/projects/datashader/datashader/compiler.py", line 83, in _get_call_tuples
    return (base._build_append(dshape, schema, cuda),
  File "/scratch/oms/projects/datashader/datashader/reductions.py", line 200, in _build_append
    f = self.reduction._build_append(dshape, schema, cuda)
  File "/scratch/oms/projects/datashader/datashader/reductions.py", line 115, in _build_append
    return self._append
AttributeError: 'mean' object has no attribute '_append'
(sms) oms@tshikovski:~/projects/shadeMS$

Screenshots or screencasts of the bug in action

Issue Analytics

State:
Created 3 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

philippjfrcommented, Apr 25, 2020

Nevermind, I can reproduce.

0reactions

o-smirnovcommented, Apr 26, 2020

I’m afraid by(..., std()) is still subtly broken. It’s producing incorrect numbers. See this example:

import numpy as np
import dask.dataframe
import pandas as pd
import datashader

if __name__ == '__main__':

    pf = pd.DataFrame(dict(a=np.random.randint(0, 10, 1000),
                           b=np.random.randint(0, 10, 1000),
                           c=np.random.randint(0, 10, 1000),
                           cat=np.random.randint(0, 3, 1000)))

    print(np.std(pf['a']))

    ddf = dask.dataframe.from_pandas(pf, npartitions=1)
    ddf = ddf.categorize('cat')
    print(ddf)

    canvas = datashader.Canvas(10, 10)

    raster = canvas.points(ddf, 'a', 'b', datashader.count_cat('cat'))
    print("count_cat ok")

    raster = canvas.points(ddf, 'a', 'b', datashader.mean('c'))
    print("mean ok")

    raster = canvas.points(ddf, 'a', 'b', datashader.by('cat', datashader.mean('c')))
    print("by(cat, mean(c)) ok")
    print(raster)

    raster = canvas.points(ddf, 'a', 'b', datashader.by('cat', datashader.std('c')))
    print("by(cat, std(c)) ok")
    print(raster)
    print(np.nanmin(raster), np.nanmax(raster))

    raster = canvas.points(ddf, 'a', 'b', datashader.std('c'))
    print("by(cat, std(c)) ok")
    print(raster)
    print(np.nanmin(raster), np.nanmax(raster))

The second-last raster above, the one made via a by('cat', std('c')) aggregation, consistently reports a max value of >10. The “c” column contains only random numbers from 0 to 9, so that value for std is impossible.

The last raster, obtained by a plain std('c') reduction, looks to have sensible values.