Erratic histogram behavior
See original GitHub issueI’ve been struggling with cupy.histogram() for more than 8 hours at this point. My data resides in a dask_cudf
dataframe. When I use the following code, the results are as expected.
h, bins = cupy.histogram(ddf.t_int.values.compute(),
bins=nbins,
range=[ddf.t_int.min().compute(), ddf.t_int.max().compute()])
counts = cupy.asnumpy(h)
However, what I really want to do is prescribe the bin edges myself, and that’s where the erratic behavior is seen. When I modify my code as follows, it will sometimes return expected results and other times return counts
as an array full of zeros. These different outcomes depend on which pre-processing pipeline my dask_cudf dataframe goes through. Regardless of the pipeline steps, a cupy ndarray
that is full of integers is always given to the histogram function. I cannot figure out why the histogram output varies when I am passing in the same kind of data in all cases. And to reiterate, using the code above with same data always returns good results.
h, bins = cupy.histogram(ddf.t_int.values.compute(),
bins=cupy.asarray(x_divs_ls))
counts = cupy.asnumpy(h)
Here’s my environment:
$ python -c 'import cupy; cupy.show_config()'
CuPy Version : 8.0.0
CUDA Build Version : 11000
CUDA Driver Version : 11020
CUDA Runtime Version : 11000
cuBLAS Version : 11200
cuFFT Version : 10201
cuRAND Version : 10201
cuSOLVER Version : (10, 6, 0)
cuSPARSE Version : 11101
NVRTC Version : (11, 0)
Thrust Version : 100909
CUB Build Version : 100909
cuDNN Build Version : 8000
cuDNN Version : 8000
NCCL Build Version : 2708
NCCL Runtime Version : 2708
cuTENSOR Version : None
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
@emcastillo @leofang Thanks for your consideration. I’ve discovered the erratic behavior is actually caused by my various preprocessing pipelines. When I can convert the datetime column to integers (required for the
histogram()
), sometimes the integers are 1x10^15 and sometimes the integers are 1x10^18. My predefined bin edges were only valid for one of these cases. So the erratic behavior had nothing to do with the cupy.histogram() function.Instead of computing the results from Dask-cuDF, it may be worth using
to_dask_array
to convert to a Dask Array (backed by CuPy) and then usingda.histogram
to perform the computation with Dask