tricky timestamp conversion
See original GitHub issueHello there, its me the bug hunter again 😃
I have this massive 200 million rows dataset, and I encountered some very annoying behavior. I wonder if this is a bug.
I load my csv using
mylog = pd.read_csv('/mydata.csv',
names = ['mydatetime', 'var2', 'var3', 'var4'],
dtype = {'mydatetime' : str},
skiprows = 1)
and the datetime
column really look like regular timestamps (tz aware)
mylog.mydatetime.head()
Out[22]:
0 2019-03-03T20:58:38.000-0500
1 2019-03-03T20:58:38.000-0500
2 2019-03-03T20:58:38.000-0500
3 2019-03-03T20:58:38.000-0500
4 2019-03-03T20:58:38.000-0500
Name: mydatetime, dtype: object
Now, I take extra care in converting these string into proper timestamps:
mylog['mydatetime'] = pd.to_datetime(mylog['mydatetime'] ,errors = 'coerce', format = '%Y-%m-%dT%H:%M:%S.%f%z', infer_datetime_format = True, cache = True)
That takes a looong time to process, but seems OK. The output is
mylog.mydatetime.head()
Out[23]:
0 2019-03-03 20:58:38-05:00
1 2019-03-03 20:58:38-05:00
2 2019-03-03 20:58:38-05:00
3 2019-03-03 20:58:38-05:00
4 2019-03-03 20:58:38-05:00
Name: mydatetime, dtype: object
What is puzzling is that so far I thought I had full control of my dtypes
. However, running the simple
mylog['myday'] = pd.to_datetime(mylog['mydatetime'].dt.date, errors = 'coerce')
File "pandas/_libs/tslib.pyx", line 537, in pandas._libs.tslib.array_to_datetime
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True
The only way I was able to go past this error was by running
mylog['myday'] = pd.to_datetime(mylog['mydatetime'].apply(lambda x: x.date()))
Is this a bug? Before upgrading to 24.1
I was not getting the tz
error above. What do you think? I cant share the data but I am happy to try some things to help you out!
Thanks!
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:20 (10 by maintainers)
When I addressed this timezone parsing my rational was if
%z
or%Z
were passed, the user would want to preserve these timezones, so this error was intentional.For your use case, if you leave out the format argument and keep
utc=True
you should get you’re dates in UTC with datetime64[ns, UTC] dtypeThis means that the object dtype is expected. Since your string data contained more than one timezone offset, it’s not possible to cast this data to one
datetime64[ns, tz]
dtype since there are multipletz
s in your data.See https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.0.html#parsing-datetime-strings-with-timezone-offsets