question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Erratic histogram behavior

See original GitHub issue

I’ve been struggling with cupy.histogram() for more than 8 hours at this point. My data resides in a dask_cudf dataframe. When I use the following code, the results are as expected.

h, bins = cupy.histogram(ddf.t_int.values.compute(),
                         bins=nbins,
                         range=[ddf.t_int.min().compute(), ddf.t_int.max().compute()])
counts = cupy.asnumpy(h)

However, what I really want to do is prescribe the bin edges myself, and that’s where the erratic behavior is seen. When I modify my code as follows, it will sometimes return expected results and other times return counts as an array full of zeros. These different outcomes depend on which pre-processing pipeline my dask_cudf dataframe goes through. Regardless of the pipeline steps, a cupy ndarray that is full of integers is always given to the histogram function. I cannot figure out why the histogram output varies when I am passing in the same kind of data in all cases. And to reiterate, using the code above with same data always returns good results.

h, bins = cupy.histogram(ddf.t_int.values.compute(),
                         bins=cupy.asarray(x_divs_ls))
counts = cupy.asnumpy(h)

Here’s my environment:

$ python -c 'import cupy; cupy.show_config()'
CuPy Version          : 8.0.0
CUDA Build Version    : 11000
CUDA Driver Version   : 11020
CUDA Runtime Version  : 11000
cuBLAS Version        : 11200
cuFFT Version         : 10201
cuRAND Version        : 10201
cuSOLVER Version      : (10, 6, 0)
cuSPARSE Version      : 11101
NVRTC Version         : (11, 0)
Thrust Version        : 100909
CUB Build Version     : 100909
cuDNN Build Version   : 8000
cuDNN Version         : 8000
NCCL Build Version    : 2708
NCCL Runtime Version  : 2708
cuTENSOR Version      : None

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
buckeye17commented, Mar 26, 2021

@emcastillo @leofang Thanks for your consideration. I’ve discovered the erratic behavior is actually caused by my various preprocessing pipelines. When I can convert the datetime column to integers (required for the histogram()), sometimes the integers are 1x10^15 and sometimes the integers are 1x10^18. My predefined bin edges were only valid for one of these cases. So the erratic behavior had nothing to do with the cupy.histogram() function.

1reaction
jakirkhamcommented, Mar 29, 2021

Instead of computing the results from Dask-cuDF, it may be worth using to_dask_array to convert to a Dask Array (backed by CuPy) and then using da.histogram to perform the computation with Dask

Read more comments on GitHub >

github_iconTop Results From Across the Web

Histogram: Study the shape | Data collection tools - PQ Systems
Skewed left : Some histograms will show a skewed distribution to the left, as shown below. A distribution skewed to the left is...
Read more >
Data Representation with Various Types of Histograms
The histogram can be both left skewed or right skewed showing the place where majority of the data points are present.
Read more >
How Histograms Can Misrepresent Statistical Data - dummies
The y-axis of a histogram shows how many observations are in each group, using counts or percentages. A histogram can be misleading if...
Read more >
Histogram quantile impl has erratic behavior when distribution ...
Run histogram quantile on any distribution where the the total count of observations across all buckets is 0. This came up for me...
Read more >
Conceptual difficulties when interpreting histograms: A review
Histograms appear to be easy, but turn out to be difficult to interpret. •. Misinterpretations are widespread in education and research.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found