Time dtype encoding defaulting to `int64` when writing netcdf or zarr
See original GitHub issueTime dtype
encoding defaults to "int64"
for datasets with only zero-hour times when writing to netcdf or zarr.
This results in these datasets having a precision constrained by how the time units are defined (in the example below daily
precision, given units are defined as 'days since ...'
). If we for instance create a zarr dataset using this default encoding option with such datasets, and subsequently append some non-zero times onto it, we loose the hour/minute/sec information from the appended bits.
MCVE Code Sample
In [1]: ds = xr.DataArray(
...: data=[0.5],
...: coords={"time": [datetime.datetime(2012,1,1)]},
...: dims=("time",),
...: name="x",
...: ).to_dataset()
In [2]: ds
Out[2]:
<xarray.Dataset>
Dimensions: (time: 1)
Coordinates:
* time (time) datetime64[ns] 2012-01-01
Data variables:
x (time) float64 0.5
In [3]: ds.to_zarr("/tmp/x.zarr")
In [4]: ds1 = xr.open_zarr("/tmp/x.zarr")
In [5]: ds1.time.encoding
Out[5]:
{'chunks': (1,),
'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0),
'filters': None,
'units': 'days since 2012-01-01 00:00:00',
'calendar': 'proleptic_gregorian',
'dtype': dtype('int64')}
In [6]: dsnew = xr.DataArray(
...: data=[1.5],
...: coords={"time": [datetime.datetime(2012,1,1,3,0,0)]},
...: dims=("time",),
...: name="x",
...: ).to_dataset()
In [7]: dsnew.to_zarr("/tmp/x.zarr", append_dim="time")
In [8]: ds1 = xr.open_zarr("/tmp/x.zarr")
In [9]: ds1.time.values
Out[9]:
array(['2012-01-01T00:00:00.000000000', '2012-01-01T00:00:00.000000000'],
dtype='datetime64[ns]')
Expected Output
In [9]: ds1.time.values
Out[9]:
array(['2012-01-01T00:00:00.000000000', '2012-01-01T03:00:00.000000000'],
dtype='datetime64[ns]')
Problem Description
Perhaps it would be useful defaulting time dtype
to "float64"
. Another option could be using a finer time resolution by default than that automatically defined from xarray based on the dataset times (for instance, if the units are automatically defined as “days since …”, use “seconds since…”.
#### Versions
<details><summary>Output of `xr.show_versions()`</summary>
In [10]: xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.7.5 (default, Nov 20 2019, 09:21:52)
[GCC 9.2.1 20191008]
python-bits: 64
OS: Linux
OS-release: 5.3.0-45-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_NZ.UTF-8
LOCALE: en_NZ.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.3
xarray: 0.15.0
pandas: 1.0.1
numpy: 1.18.1
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: 0.8.0
h5py: 2.10.0
Nio: None
zarr: 2.4.0
cftime: 1.1.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.1.3
cfgrib: None
iris: None
bottleneck: None
dask: 2.14.0
distributed: 2.12.0
matplotlib: 3.2.0
cartopy: 0.17.0
seaborn: None
numbagg: None
setuptools: 45.3.0
pip: 20.0.2
conda: None
pytest: 5.3.5
IPython: 7.13.0
sphinx: None
</details>
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (6 by maintainers)
This logic has been around in xarray for a long time (I think it dates back to https://github.com/pydata/xarray/pull/12!), so it predates me. If I had to guess though, it would have to do with the fact that back then, a form of
cftime.date2num
was used to encode all times, even those that started asnp.datetime64
values. I think that’s significant for two reasons:date2num
would only return floating point values, even if the times could in principle be encoded with integers. For that reason, for accuracy reasons, it was best to keep the encoded values as small as possible to avoid roundoff error.date2num
did not – and still does not – support nanosecond units, because it relies on microsecond-precision datetimes.This of course is not true anymore. We no longer use
date2num
to encodenp.datetime64
values, and we no longer encode dates with floating point values by default (#4045); we use integers for optimal round-tripping accuracy, and are capable of encoding dates with nanosecond units.To be honest, currently it seems the only remaining advantage to choosing a larger time encoding unit and proximate reference date is that it makes the raw encoded values a little more human-readable. However, encoding dates with units of
"nanoseconds since 1970-01-01"
is objectively optimal fornp.datetime64[ns]
values, as it guarantees the maximum range of possible encoded times, and maximum round-trip accuracy, so it could be worth revisiting our approach in light of the fact that it makes appending somewhat dangerous.It’s choosing the highest resolution that matches the data, which has the benefit of allowing the maximum possible time range given the data’s frequency: https://github.com/pydata/xarray/blob/5871637873cd83c3a656ee6f4df86ea6628cf68a/xarray/coding/times.py#L317-L319
I’m not sure if this is why it was originally chosen; but that is one advantage. Perhaps @spencerkclark has some insight here.