Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Time dtype encoding defaulting to `int64` when writing netcdf or zarr

See original GitHub issue

Time dtype encoding defaults to "int64" for datasets with only zero-hour times when writing to netcdf or zarr.

This results in these datasets having a precision constrained by how the time units are defined (in the example below daily precision, given units are defined as 'days since ...'). If we for instance create a zarr dataset using this default encoding option with such datasets, and subsequently append some non-zero times onto it, we loose the hour/minute/sec information from the appended bits.

MCVE Code Sample

In [1]: ds = xr.DataArray( 
    ...: data=[0.5], 
    ...: coords={"time": [datetime.datetime(2012,1,1)]}, 
    ...: dims=("time",), 
    ...: name="x", 
    ...: ).to_dataset()

In [2]: ds                                                                                                                                                            
Out[2]: 
<xarray.Dataset>
Dimensions:  (time: 1)
Coordinates:
  * time     (time) datetime64[ns] 2012-01-01
Data variables:
    x        (time) float64 0.5

In [3]: ds.to_zarr("/tmp/x.zarr")

In [4]: ds1 = xr.open_zarr("/tmp/x.zarr")

In [5]: ds1.time.encoding                                                                                                                                             
Out[5]: 
{'chunks': (1,),
 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0),
 'filters': None,
 'units': 'days since 2012-01-01 00:00:00',
 'calendar': 'proleptic_gregorian',
 'dtype': dtype('int64')}

In [6]: dsnew = xr.DataArray( 
    ...: data=[1.5], 
    ...: coords={"time": [datetime.datetime(2012,1,1,3,0,0)]}, 
    ...: dims=("time",), 
    ...: name="x", 
    ...: ).to_dataset()

In [7]: dsnew.to_zarr("/tmp/x.zarr", append_dim="time")                                                                                                               

In [8]: ds1 = xr.open_zarr("/tmp/x.zarr")                                                                                                                             

In [9]: ds1.time.values                                                                                                                                               
Out[9]: 
array(['2012-01-01T00:00:00.000000000', '2012-01-01T00:00:00.000000000'],
      dtype='datetime64[ns]')

Expected Output

In [9]: ds1.time.values                                                                                                                                               
Out[9]: 
array(['2012-01-01T00:00:00.000000000', '2012-01-01T03:00:00.000000000'],
      dtype='datetime64[ns]')

Problem Description

Perhaps it would be useful defaulting time dtype to "float64". Another option could be using a finer time resolution by default than that automatically defined from xarray based on the dataset times (for instance, if the units are automatically defined as “days since …”, use “seconds since…”.


#### Versions

<details><summary>Output of `xr.show_versions()`</summary>

In [10]: xr.show_versions()                                                                                                                                            

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.5 (default, Nov 20 2019, 09:21:52) 
[GCC 9.2.1 20191008]
python-bits: 64
OS: Linux
OS-release: 5.3.0-45-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_NZ.UTF-8
LOCALE: en_NZ.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.3

xarray: 0.15.0
pandas: 1.0.1
numpy: 1.18.1
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: 0.8.0
h5py: 2.10.0
Nio: None
zarr: 2.4.0
cftime: 1.1.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.1.3
cfgrib: None
iris: None
bottleneck: None
dask: 2.14.0
distributed: 2.12.0
matplotlib: 3.2.0
cartopy: 0.17.0
seaborn: None
numbagg: None
setuptools: 45.3.0
pip: 20.0.2
conda: None
pytest: 5.3.5
IPython: 7.13.0
sphinx: None

</details>

Issue Analytics

State:
Created 3 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

2reactions

spencerkclarkcommented, Nov 11, 2021

This logic has been around in xarray for a long time (I think it dates back to https://github.com/pydata/xarray/pull/12!), so it predates me. If I had to guess though, it would have to do with the fact that back then, a form of cftime.date2num was used to encode all times, even those that started as np.datetime64 values. I think that’s significant for two reasons:

In the old days, date2num would only return floating point values, even if the times could in principle be encoded with integers. For that reason, for accuracy reasons, it was best to keep the encoded values as small as possible to avoid roundoff error.
Even if (1) was not the case back then, date2num did not – and still does not – support nanosecond units, because it relies on microsecond-precision datetimes.

This of course is not true anymore. We no longer use date2num to encode np.datetime64 values, and we no longer encode dates with floating point values by default (#4045); we use integers for optimal round-tripping accuracy, and are capable of encoding dates with nanosecond units.

To be honest, currently it seems the only remaining advantage to choosing a larger time encoding unit and proximate reference date is that it makes the raw encoded values a little more human-readable. However, encoding dates with units of "nanoseconds since 1970-01-01" is objectively optimal for np.datetime64[ns] values, as it guarantees the maximum range of possible encoded times, and maximum round-trip accuracy, so it could be worth revisiting our approach in light of the fact that it makes appending somewhat dangerous.

2reactions

dcheriancommented, Nov 10, 2021