Issue parsing '%z' in timestamps via pd.to_datetime
See original GitHub issueCode Sample, a copy-pastable example if possible
First, thanks a lot for this great library. It helps a lot in our day-to-day activities I came across odd behaviour in processing timestamps from set of my CSV files.
s="2018-09-10 09:30:00.000894-04:00"
t1=datetime.datetime.strptime(s, "%Y-%m-%d %H:%M:%S.%f%z")
print ("T1", t1)
t2=pd.to_datetime(s, format="%Y-%m-%d %H:%M:%S.%f%z", errors='ignore')
print ("T2", t2)
t3=pd.to_datetime(s, format="%Y-%m-%d %H:%M:%S.%f")
print ("T3", t3)
Output:
T1 2018-09-10 09:30:00.000894-04:00
T2 2018-09-10 09:30:00.000894-04:00
T3 2018-09-10 13:30:00.000894
Problem description
As we can see pd.to_datetime CAN properly parse timezone directive ‘%z’. If I omit parameter “errors” then I get exception: ValueError: ‘z’ is a bad directive in format ‘%Y-%m-%d %H:%M:%S.%f%z’
I have 6.6M rows CSV file with column that have information in above format (upto microsecond precision with time zone information like -04:00). This column is my index for the dataframe. First, I load it via from_csv(filename) call [no extra parameters except filename]. When I use pd.to_datetime I get following performance results:
- df2[‘timestamp’] = pd.to_datetime(df2[‘datetime’], format=“%Y-%m-%d %H:%M:%S.%f”) takes 5 min to process
- df2[‘datetime’] = df2[‘datetime’].str.replace(‘-04:00’, ‘’) df2[‘timestamp’] = pd.to_datetime(df2[‘datetime’], format=“%Y-%m-%d %H:%M:%S.%f”) takes 19 sec to complete
- df2[‘timestamp’] = pd.to_datetime(df2[‘datetime’], format=“%Y-%m-%d %H:%M:%S.%f%z”, errors=‘ignore’) takes 3 sec to complete
All of the above calls produce expected result in the dataframe
I see that recently there was related issue of processing ‘%z’: https://github.com/pandas-dev/pandas/issues/13486
Expected Output
No exceptions and no need to ignore errors
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit: None python: 3.7.0.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_CA.UTF-8 LOCALE: en_CA.UTF-8
pandas: 0.23.4 pytest: None pip: 18.1 setuptools: 40.5.0 Cython: 0.29 numpy: 1.15.4 scipy: 1.1.0 pyarrow: None xarray: None IPython: 7.1.1 sphinx: None patsy: None dateutil: 2.7.5 pytz: 2018.7 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 3.0.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 1.0.1 sqlalchemy: 1.2.13 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Thanks again and best regards, Boris
Issue Analytics
- State:
- Created 5 years ago
- Comments:11 (4 by maintainers)
Top GitHub Comments
%z
was introduced recently and will be apart of the v0.24.0 release.The performance hit you’re experiencing is probably because we fall back on using dateutil if initial parsing attempts failed which tends to be slow.
If you’re still experiencing a performance hit when parsing with
%z
on master, feel free to reopen.@mroeschke Sorry, just noticed your reply about setting up virtual environment. Is there any chance you could run above tests under your setup and spare me from going through installation of virtual environment?
Thanks and best regards, Boris