Stateful user-defined accessors
See original GitHub issueIf anybody decorates a stateful class with @register_dataarray_accessor
or @register_dataset_accessor
, the instance will lose its state on any method that invokes _to_temp_dataset
, as well as on a shallow copy.
In [1]: @xarray.register_dataarray_accessor('foo')
...: class Foo:
...: def __init__(self, obj):
...: self.obj = obj
...: self.x = 1
...:
...:
In [2]: a = xarray.DataArray()
In [3]: a.foo.x
Out[3]: 1
In [4]: a.foo.x = 2
In [5]: a.foo.x
Out[5]: 2
In [6]: a.roll().foo.x
Out[6]: 1
In [7]: a.copy(deep=False).foo.x
Out[7]: 1
While in the case of _to_temp_dataset
it could be possible to spend (substantial) effort to retain the state, on the case of copy() it’s impossible without modifying the accessor duck API, as one would need to tamper with the accessor instance in place and modify the pointer back to the DataArray/Dataset.
This issue is so glaring that it makes me strongly suspect that nobody saves any state in accessor classes. This kind of use would also be problematic in practical terms, as the accessor object would have a hard time realising when its own state is no longer coherent with the referenced DataArray/Dataset.
This design also carries the problem that it introduces a circular reference in the DataArray/Dataset. This means that, after someone invokes an accessor method on his DataArray/Dataset, then the whole object - including the numpy buffers! - won’t be instantly collected when it’s dereferenced by the user, and it will have to instead wait for the next gc
pass. This could cause huge increases in RAM usage overnight in a user application, which would be very hard to logically link to a change that just added a custom method.
Finally, with https://github.com/pydata/xarray/pull/3250/, this statefulness forces us to increase the RAM usage of all datasets and dataarrays by an extra slot, for all users, even if this feature is quite niche.
Proposed solution
Get rid of accessor caching altogether, and just recreate the accessor object from scratch every time it is invoked. In the documentation, clarify that the __init__
method should not perform anything computationally intensive.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:4
- Comments:15 (15 by maintainers)
Top GitHub Comments
Interesting, thanks!
As an accessor maintainer, I can ensure that at least one accessor implementation is storing state 😉. But this state is based on the xarray object itself: for example, we derive georeferencing information and store the names of the coordinate variabless we know are going to be useful to us later. That is, every new call to
__init__
based on a modified object will trigger a new parsing, and we don’t come into the situation you describe above.Getting rid of the caching logic would mean some performance loss to us, yes, but I don’t know if it’s “worse” than the circular reference issue you describe or not.
Alright, I think I get it, thanks for the clarification @crusaderky