question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Time dtype encoding defaulting to `int64` when writing netcdf or zarr

See original GitHub issue

Time dtype encoding defaults to "int64" for datasets with only zero-hour times when writing to netcdf or zarr.

This results in these datasets having a precision constrained by how the time units are defined (in the example below daily precision, given units are defined as 'days since ...'). If we for instance create a zarr dataset using this default encoding option with such datasets, and subsequently append some non-zero times onto it, we loose the hour/minute/sec information from the appended bits.

MCVE Code Sample

In [1]: ds = xr.DataArray( 
    ...: data=[0.5], 
    ...: coords={"time": [datetime.datetime(2012,1,1)]}, 
    ...: dims=("time",), 
    ...: name="x", 
    ...: ).to_dataset()

In [2]: ds                                                                                                                                                            
Out[2]: 
<xarray.Dataset>
Dimensions:  (time: 1)
Coordinates:
  * time     (time) datetime64[ns] 2012-01-01
Data variables:
    x        (time) float64 0.5

In [3]: ds.to_zarr("/tmp/x.zarr")

In [4]: ds1 = xr.open_zarr("/tmp/x.zarr")

In [5]: ds1.time.encoding                                                                                                                                             
Out[5]: 
{'chunks': (1,),
 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0),
 'filters': None,
 'units': 'days since 2012-01-01 00:00:00',
 'calendar': 'proleptic_gregorian',
 'dtype': dtype('int64')}

In [6]: dsnew = xr.DataArray( 
    ...: data=[1.5], 
    ...: coords={"time": [datetime.datetime(2012,1,1,3,0,0)]}, 
    ...: dims=("time",), 
    ...: name="x", 
    ...: ).to_dataset()

In [7]: dsnew.to_zarr("/tmp/x.zarr", append_dim="time")                                                                                                               

In [8]: ds1 = xr.open_zarr("/tmp/x.zarr")                                                                                                                             

In [9]: ds1.time.values                                                                                                                                               
Out[9]: 
array(['2012-01-01T00:00:00.000000000', '2012-01-01T00:00:00.000000000'],
      dtype='datetime64[ns]')

Expected Output

In [9]: ds1.time.values                                                                                                                                               
Out[9]: 
array(['2012-01-01T00:00:00.000000000', '2012-01-01T03:00:00.000000000'],
      dtype='datetime64[ns]')

Problem Description

Perhaps it would be useful defaulting time dtype to "float64". Another option could be using a finer time resolution by default than that automatically defined from xarray based on the dataset times (for instance, if the units are automatically defined as “days since …”, use “seconds since…”.


#### Versions

<details><summary>Output of `xr.show_versions()`</summary>

In [10]: xr.show_versions()                                                                                                                                            

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.5 (default, Nov 20 2019, 09:21:52) 
[GCC 9.2.1 20191008]
python-bits: 64
OS: Linux
OS-release: 5.3.0-45-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_NZ.UTF-8
LOCALE: en_NZ.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.3

xarray: 0.15.0
pandas: 1.0.1
numpy: 1.18.1
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: 0.8.0
h5py: 2.10.0
Nio: None
zarr: 2.4.0
cftime: 1.1.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.1.3
cfgrib: None
iris: None
bottleneck: None
dask: 2.14.0
distributed: 2.12.0
matplotlib: 3.2.0
cartopy: 0.17.0
seaborn: None
numbagg: None
setuptools: 45.3.0
pip: 20.0.2
conda: None
pytest: 5.3.5
IPython: 7.13.0
sphinx: None

</details>

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
spencerkclarkcommented, Nov 11, 2021

This logic has been around in xarray for a long time (I think it dates back to https://github.com/pydata/xarray/pull/12!), so it predates me. If I had to guess though, it would have to do with the fact that back then, a form of cftime.date2num was used to encode all times, even those that started as np.datetime64 values. I think that’s significant for two reasons:

  1. In the old days, date2num would only return floating point values, even if the times could in principle be encoded with integers. For that reason, for accuracy reasons, it was best to keep the encoded values as small as possible to avoid roundoff error.
  2. Even if (1) was not the case back then, date2num did not – and still does not – support nanosecond units, because it relies on microsecond-precision datetimes.

This of course is not true anymore. We no longer use date2num to encode np.datetime64 values, and we no longer encode dates with floating point values by default (#4045); we use integers for optimal round-tripping accuracy, and are capable of encoding dates with nanosecond units.

To be honest, currently it seems the only remaining advantage to choosing a larger time encoding unit and proximate reference date is that it makes the raw encoded values a little more human-readable. However, encoding dates with units of "nanoseconds since 1970-01-01" is objectively optimal for np.datetime64[ns] values, as it guarantees the maximum range of possible encoded times, and maximum round-trip accuracy, so it could be worth revisiting our approach in light of the fact that it makes appending somewhat dangerous.

2reactions
dcheriancommented, Nov 10, 2021

Please may I ask: Why not default to xarray encoding time as ‘units’: ‘nanoseconds since 1970-01-01’ to be consistent with np.datetime64[ns]?

It’s choosing the highest resolution that matches the data, which has the benefit of allowing the maximum possible time range given the data’s frequency: https://github.com/pydata/xarray/blob/5871637873cd83c3a656ee6f4df86ea6628cf68a/xarray/coding/times.py#L317-L319

I’m not sure if this is why it was originally chosen; but that is one advantage. Perhaps @spencerkclark has some insight here.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Reading and writing files - Xarray
These encoding options work on any version of the netCDF file format: dtype : Any valid NumPy dtype or string convertible to a...
Read more >
whats-new.rst.txt - Xarray - PyData |
Determination of zarr chunks handles empty lists for encoding chunks or ... metadata by default when writing and reading Zarr stores (:issue:`5251`).
Read more >
Serialization and IO - xarray - Read the Docs
This is only supported on netCDF4 (HDF5) files. By encoding strings into bytes, and writing encoded bytes as a character array. The default...
Read more >
Source code for parcels.particlefile.baseparticlefile
"""Module controlling the writing of ParticleSets to Zarr file""" from abc ... self.lasttime_written = None # variable to check if time has been...
Read more >
Examples: Loading data from python into IDV
lon (lon) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found