Problem with encoding dates in serialization.py
See original GitHub issueI think the accepted standard of OrientDB is to write and read dates in POSIX/Unix time which means they should be encoded and decoded as if they occurred at UTC midnight. If you use the OrientDB console:
CREATE CLASS Event EXTENDS V
CREATE VERTEX Event SET name='Example', date=Date('1970-01-01 00:00:00')
SELECT @RID, name, date.asLong() FROM Event
You get as expected 0L for the long value stored.
In serializations.py dates are encoded and decoded using local time. This means if the client using pyorient reading a date field is in a more western timezone than who wrote the value using pyorient, they get a different date. (i.e. If you encode 1970-01-01 with current serialization.py in New York it actually writes, 18000000 to the DB. If you then decode that with pyorient in California as a date, you get 1969-12-31). Note in the process pyorient discards the time portion of the timestamp, meaning the dates are truly different.
I can see why this may evade tests, as if the test process that is writing data and the test process reading the data are in the same timezone they will return the same value. (even if still the value stored in the DB is technically wrong). Any tests on dates should be updated to check what the long value stored in the DB is, i.e. Writing 1970-01-01 should store 0L.
Both the encoding and the decoding need to be fixed:
Decoding (lines 322:324 of serializations.py) should change from:
if c == 'a':
collected = date.fromtimestamp(float(collected) / 1000)
content = content[1:]
to
if c == 'a':
collected = datetime.utcfromtimestamp(float(collected) / 1000).date()
content = content[1:]
and encoding (lines 131:132 of serializations.py) should change from:
elif isinstance(value, date):
ret = str(int(time.mktime(value.timetuple())) * 1000) + 'a'
to
elif isinstance(value, date):
ret = str(int(calendar.timegm(value.timetuple())) * 1000) + 'a'
For the later change, you’ll also need to import calendar. calendar.timegm() assumes the date is UTC, time.mktime() assumes its local time. They return 0 (correct) and 18000000 (incorrect for Orient) respectively.
Issue Analytics
- State:
- Created 8 years ago
- Comments:13 (4 by maintainers)
Top GitHub Comments
“i think that the responsibility of writing and decoding the dates in the right Timezone and format, must be delegated to the applications .” I 100% agree, but pyorient doesn’t let you do that. It forces both the read and write client to write and decode using the local timezone.(by using time.mktime and date.fromtimestamp ) And currently that means it is impossible to use pyorient to store dates if you might have clients reading that are west of you. All dates will be wrong. Or do you have a suggestion on how to use pyorient to do this? The only way I can think of is very hacky, give every date a separate timezone property for where it was written that also has to be read and if you are reading it further west you increment the date returned by pyorient by one.
Just went through some code on the OrientDB side to see how they handle dates and realized that the serialization/deserialization behavior for Binary and CSV is different on the server side. The server behavior for date objects when using binary serialization essentially resolves this issue (lines 426, 717 and 998-1012 from https://github.com/orientechnologies/orientdb/blob/2.2.x/core/src/main/java/com/orientechnologies/orient/core/serialization/serializer/record/binary/ORecordSerializerBinaryV0.java#L426)
For CSV serialization/deserialization, the OrientDB server does not perform the conversion from/to
<date> 00:00:00 local time
to/from<date> 00:00:00 UTC
. (lines 638-645 and 474 from https://github.com/orientechnologies/orientdb/blob/7850712aafb3cb7c61a5c2865710019df0a7e8c9/core/src/main/java/com/orientechnologies/orient/core/serialization/serializer/record/string/ORecordSerializerStringAbstract.java#L638).The latter is what’s causing the issue reported by OP and reflected in the results I posted. Moreover, the server sets the time to
00:00:00
during deserialization after converting the UNIX time sent over to the database time zone. This behavior is honestly baffling to me and very problematic. This essentially can result in +/- 1 or 2 days difference in encoded/decoded values when the server is not in the same time zone as the client. If clients (applications) do know the database timezone, the best way to avoid problems is to send over the date as UNIX time corresponding to<date> 00:00:00 <database time zone>
and decode it using the same scheme when using CSV serialization.Given the above, to mitigate this issue, I agree with the proposal to have the option of specifying a non-default encoding and decoding time zone for Dates when using CSV serialization. For DateTime, I cannot foresee a situation where it would be useful, but having the option can’t hurt!
Finally, on a related but unrelated note, it seems that OrientDB stores dates as UNIX time corresponding to
<date> 00:00:00 <DB timezone>
. I am not sure if they process all dates in the database when the database timezone is changed withALTER database <timezone>
, but I would venture that they do not. This means that the date values stored in the DB (or at least their interpretation) will change if the database timezone is changed after a record was created. As such, I would suggest caution with usingALTER database <timezone>
command if you have records with date fields already in OrientDB.