BUG or DOC: pd.read_csv with parse_dates does not recognize timezone
See original GitHub issueWhen parsing a timezone-aware datetime in a csv file with pd.read_csv
+ parse_dates
, it returns naive timestampes converted to UTC, and it was a surprise for me.
Example
Consider we are reading the following data. Let’s say its name is pandas_read_csv_bug.csv
.
It is a simple timeseries data with timezone (UTC+09:00) specified.
dt,val
2018-01-04 09:01:00+09:00,23350
2018-01-04 09:02:00+09:00,23400
2018-01-04 09:03:00+09:00,23400
2018-01-04 09:04:00+09:00,23400
2018-01-04 09:05:00+09:00,23400
I want to read it with pd.read_csv
using parse_dates
keyword argument activated.
If working properly, this seems to be the most elegant solution.
import pandas as pd
df = pd.read_csv('pandas_read_csv_bug.csv', parse_dates=['dt'])
However, the result is a data frame df
with strange timestamps.
| dt | val – | – | – 0| 2018-01-04 00:01:00 | 23350 1| 2018-01-04 00:02:00 | 23400 2| 2018-01-04 00:03:00 | 23400 3| 2018-01-04 00:04:00 | 23400 4| 2018-01-04 00:05:00 | 23400
Problem description
My surprise was,
- The parsed datetimes are timezone-naive.
df['dt'].iloc[0].tz is None == True
- The timestampe is automatically converted to UTC.
My first impression was that it shouldn’t be the best possible behavior. However, as an UTC offset does not uniquely corresponds to a single timezone, this could be the safest/most reasonable behavior. In that case, the documentation should mention this behavior.
Output of pd.show_versions()
pandas: 0.23.4 pytest: 3.3.1 pip: 9.0.3 setuptools: 38.5.1 Cython: None numpy: 1.15.0 scipy: 1.0.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.1.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: 2.7.4 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:7 (7 by maintainers)
Top GitHub Comments
@mroeschke Thanks for pointing
box
out. Well, this breaks quite a lot of unit tests. My first attempt was to keepbox=False
, and updatepandas/core/tools/datetime.py:_convert_listlike_datetimes
. However, I realized that we can’t fix this issue withbox=False
, because what is returned is a Numpy array of datetime64, and it cannot contain the timezone information. https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.datetime.html#changes-with-numpy-1-11 So, I will try to fix the errors caused by settingbox=True
. @gfyoung There is a Pandas sprint at PYCON KR in Seoul, the Republic of Korea on Aug 15th, and I am participating. (It is organized by @scari ) I will continue to work on this issue at the sprint.@swyoon I look forward to seeing you! 😉