question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unreasonable default fill_values

See original GitHub issue

Firstly thanks for the impressive package. We’re considering using it in https://github.com/pydata/xarray to provide faster groupby operations.

It looks like some forms of stack / unstacking are supported too, if I’m looking at “Form 4” in the readme. Is it currently possible to supply a subset of the indices as part of that?


In [7]: import numpy_groupies as npg
In [1]: import numpy as np
In [30]: from numpy_groupies.aggregate_numpy import aggregate


In [26]: flat = np.arange(12).astype(float)
    ...: data = values = flat.reshape(3, -1)

In [4]: import itertools

In [5]: group_idx = np.array(list(itertools.product(*[range(x) for x in values.shape]))).T
   ...: group_idx
Out[5]:
array([[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
       [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]])

This works well:


In [32]: aggregate(group_idx, flat, "array", size=(3, 4))
Out[32]:
array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

But this doesn’t:


In [33]: aggregate(group_idx[:, :-1], flat[:-1].astype(float), "array", size=(3, 4))
/usr/local/lib/python3.8/site-packages/numpy/core/_asarray.py:136: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  return array(a, dtype, copy=False, order=order, subok=True)
Out[33]:
array([[array([0.]), array([1.]), array([2.]), array([3.])],
       [array([4.]), array([5.]), array([6.]), array([7.])],
       [array([8.]), array([9.]), array([10.]), 0]], dtype=object)

Notably, supplying sum does compute, though the result has 0 rather than nan:


In [35]: aggregate(group_idx[:, :-1], flat[:-1].astype(float), "sum", size=(3, 4))
Out[35]:
array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10.,  0.]])

Ideally the “array” case above would return:

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10.,  np.nan.]])

Of course, if this library — as per the name — is more focused on groupby than stacking, totally reasonable to close this as wontfix.

Thanks!

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
ml31415commented, Jul 21, 2022

We might have different default fill_values for sum and nansum as well.

1reaction
ml31415commented, Feb 1, 2021

Of course PRs are welcome any time. I’d also be happy to add you to the committers, if you want to take over some long term responsibility.

About the robustness of the current features, please have a look, if the unit tests are sufficient for your use case. The major part of the test suit compares the results of the optimized implementations against a generic implementation using only numpy functions. I use the library in production for years, and errors there would have be quite costly to me. But as we just saw with the examples you came up with, that only holds true for the more frequently used features of this library, and the tests might need some extensions for your use cases.

About stacking, not sure if I got your question right. Internally npg loops naively over two 1D arrays. All else is handled with plain numpy functions to prepare everything in 1D shape before the action starts, and restore the original shape afterwards. So there are probably no speed gains hiding, see https://github.com/ml31415/numpy-groupies/blob/786a78b2da9d8df94ac373cdf11eef93ac723a8c/numpy_groupies/utils_numpy.py#L192 .

The code itself is a bit “grown”, and some parts might need some refactoring. The main reason why the weave part is still there is to have a speed reference for the numba implementation, so it didn’t see new group functions for a while.

Read more comments on GitHub >

github_iconTop Results From Across the Web

default_fillvalue
The default_fillvalue function returns the default missing value associated with the given variable type. The type of the return value will be the...
Read more >
CFMaskCoder creates unnecessary copy for `uint16` variables
Xarray does not follow the default "fill values" because it can be confusing; it is valid to store 65535 as u2 for example....
Read more >
ncpdq sets _FillValue to 0 (zero) - SourceForge
IDL default values (which cannot be changed) are: Data Type : Fill Value BYTE : 0. CHAR : 0. SHORT : -32767. LONG...
Read more >
xarray automatically applying _FillValue to coordinates on ...
When I convert to netcdf via ds.to_netcdf(), all coordinate variables have fill values applied automatically because they are floats.
Read more >
Reading and writing files - Xarray
By default, the file is saved as netCDF4 (assuming netCDF4-Python is ... Although xarray provides reasonable support for incremental reads ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found