question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

by() reduction doesn't combine with mean(), std() etc.

See original GitHub issue

ALL software version info

DS master in a fresh virtualenv.

Description of expected behavior and the observed behavior

by() doesn’t seem to be able to combine itself with the more complex reductions like mean(). I think it ignores _build_bases when constructing its append method, but I’m not adept enough with the code yet to fix it myself.

Complete, minimal, self-contained example code that reproduces the issue

import numpy as np
import dask.dataframe
import datashader
import pandas as pd

if __name__ == '__main__':
    pf = pd.DataFrame(dict(a=np.arange(10), b=np.arange(10), c=np.arange(-5,5), cat=[0,0,0,1,1,1,2,2,2,3]))
    ddf = dask.dataframe.from_pandas(pf, npartitions=1)
    ddf = ddf.categorize('cat')
    print(ddf)

    canvas = datashader.Canvas(10, 10)

    raster = canvas.points(ddf, 'a', 'b', datashader.count_cat('cat'))
    print("count_cat ok")

    raster = canvas.points(ddf, 'a', 'b', datashader.mean('c'))
    print("mean ok")

    raster = canvas.points(ddf, 'a', 'b', datashader.by('cat', datashader.mean('c')))
    print("by(cat, mean(c)) ok")

Stack traceback and/or browser JavaScript console output

(sms) oms@tshikovski:~/projects/shadeMS$ python ./test-ds.py 
Dask DataFrame Structure:
                   a      b      c              cat
npartitions=1                                      
0              int64  int64  int64  category[known]
9                ...    ...    ...              ...
Dask Name: categorize_block, 2 tasks
count_cat ok
mean ok
Traceback (most recent call last):
  File "/home/oms/.venv/sms/lib/python3.6/site-packages/toolz/functoolz.py", line 456, in memof
    return cache[k]
KeyError: ((<datashader.reductions.by object at 0x7fcc800a9978>, dshape("""{
  a: int64,
  b: int64,
  c: int64,
  cat: categorical[[0, 1, 2, 3], type=int64, ordered=False]
  }"""), <datashader.glyphs.points.Point object at 0x7fcc800a9dd8>), frozenset({('cuda', False)}))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./test-ds.py", line 27, in <module>
    raster = canvas.points(ddf, 'a', 'b', datashader.by('cat', datashader.mean('c')))
  File "/scratch/oms/projects/datashader/datashader/core.py", line 224, in points
    return bypixel(source, self, glyph, agg)
  File "/scratch/oms/projects/datashader/datashader/core.py", line 1192, in bypixel
    return bypixel.pipeline(source, schema, canvas, glyph, agg)
  File "/scratch/oms/projects/datashader/datashader/utils.py", line 94, in __call__
    return lk[typ](head, *rest, **kwargs)
  File "/scratch/oms/projects/datashader/datashader/data_libraries/dask.py", line 19, in dask_pipeline
    dsk, name = glyph_dispatch(glyph, df, schema, canvas, summary, cuda=cuda)
  File "/scratch/oms/projects/datashader/datashader/utils.py", line 97, in __call__
    return lk[cls](head, *rest, **kwargs)
  File "/scratch/oms/projects/datashader/datashader/data_libraries/dask.py", line 68, in default
    compile_components(summary, schema, glyph, cuda=cuda)
  File "/home/oms/.venv/sms/lib/python3.6/site-packages/toolz/functoolz.py", line 460, in memof
    cache[k] = result = func(*args, **kwargs)
  File "/scratch/oms/projects/datashader/datashader/compiler.py", line 57, in compile_components
    calls = [_get_call_tuples(b, d, schema, cuda) for (b, d) in zip(bases, dshapes)]
  File "/scratch/oms/projects/datashader/datashader/compiler.py", line 57, in <listcomp>
    calls = [_get_call_tuples(b, d, schema, cuda) for (b, d) in zip(bases, dshapes)]
  File "/scratch/oms/projects/datashader/datashader/compiler.py", line 83, in _get_call_tuples
    return (base._build_append(dshape, schema, cuda),
  File "/scratch/oms/projects/datashader/datashader/reductions.py", line 200, in _build_append
    f = self.reduction._build_append(dshape, schema, cuda)
  File "/scratch/oms/projects/datashader/datashader/reductions.py", line 115, in _build_append
    return self._append
AttributeError: 'mean' object has no attribute '_append'
(sms) oms@tshikovski:~/projects/shadeMS$ 

Screenshots or screencasts of the bug in action

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
philippjfrcommented, Apr 25, 2020

Nevermind, I can reproduce.

0reactions
o-smirnovcommented, Apr 26, 2020

I’m afraid by(..., std()) is still subtly broken. It’s producing incorrect numbers. See this example:

import numpy as np
import dask.dataframe
import pandas as pd
import datashader

if __name__ == '__main__':

    pf = pd.DataFrame(dict(a=np.random.randint(0, 10, 1000),
                           b=np.random.randint(0, 10, 1000),
                           c=np.random.randint(0, 10, 1000),
                           cat=np.random.randint(0, 3, 1000)))

    print(np.std(pf['a']))

    ddf = dask.dataframe.from_pandas(pf, npartitions=1)
    ddf = ddf.categorize('cat')
    print(ddf)

    canvas = datashader.Canvas(10, 10)

    raster = canvas.points(ddf, 'a', 'b', datashader.count_cat('cat'))
    print("count_cat ok")

    raster = canvas.points(ddf, 'a', 'b', datashader.mean('c'))
    print("mean ok")

    raster = canvas.points(ddf, 'a', 'b', datashader.by('cat', datashader.mean('c')))
    print("by(cat, mean(c)) ok")
    print(raster)

    raster = canvas.points(ddf, 'a', 'b', datashader.by('cat', datashader.std('c')))
    print("by(cat, std(c)) ok")
    print(raster)
    print(np.nanmin(raster), np.nanmax(raster))

    raster = canvas.points(ddf, 'a', 'b', datashader.std('c'))
    print("by(cat, std(c)) ok")
    print(raster)
    print(np.nanmin(raster), np.nanmax(raster))

The second-last raster above, the one made via a by('cat', std('c')) aggregation, consistently reports a max value of >10. The “c” column contains only random numbers from 0 to 9, so that value for std is impossible.

The last raster, obtained by a plain std('c') reduction, looks to have sensible values.

Read more comments on GitHub >

github_iconTop Results From Across the Web

algorithm to combine std::unique with a reduce step?
Just like your original answer, this will not merge nonadjacent elements, but to do that you either have to sort them by index...
Read more >
21 Iteration | R for Data Science - Hadley Wickham
Another tool for reducing duplication is iteration, which helps you when you ... A general way of creating an empty vector of given...
Read more >
Standard deviation in Excel: functions and formula examples
The tutorial explains how to calculate standard deviation and standard error of the mean in Excel with formula examples.
Read more >
Electrochemical Cell Conventions - Chemistry LibreTexts
Two electrodes: the anode where oxidation occurs and the cathode where reduction occurs (note that Cathode does not mean +, and Anode does...
Read more >
Group by: split-apply-combine — pandas 1.5.2 documentation
On a DataFrame, we obtain a GroupBy object by calling groupby() . We could naturally group by either the A or B columns,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found