Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issue parsing '%z' in timestamps via pd.to_datetime

See original GitHub issue

Code Sample, a copy-pastable example if possible

First, thanks a lot for this great library. It helps a lot in our day-to-day activities I came across odd behaviour in processing timestamps from set of my CSV files.

s="2018-09-10 09:30:00.000894-04:00"
t1=datetime.datetime.strptime(s, "%Y-%m-%d %H:%M:%S.%f%z")
print ("T1", t1)
t2=pd.to_datetime(s, format="%Y-%m-%d %H:%M:%S.%f%z", errors='ignore')
print ("T2", t2)
t3=pd.to_datetime(s, format="%Y-%m-%d %H:%M:%S.%f")
print ("T3", t3)

Output:

T1 2018-09-10 09:30:00.000894-04:00
T2 2018-09-10 09:30:00.000894-04:00
T3 2018-09-10 13:30:00.000894

Problem description

As we can see pd.to_datetime CAN properly parse timezone directive ‘%z’. If I omit parameter “errors” then I get exception: ValueError: ‘z’ is a bad directive in format ‘%Y-%m-%d %H:%M:%S.%f%z’

I have 6.6M rows CSV file with column that have information in above format (upto microsecond precision with time zone information like -04:00). This column is my index for the dataframe. First, I load it via from_csv(filename) call [no extra parameters except filename]. When I use pd.to_datetime I get following performance results:

df2[‘timestamp’] = pd.to_datetime(df2[‘datetime’], format=“%Y-%m-%d %H:%M:%S.%f”) takes 5 min to process
df2[‘datetime’] = df2[‘datetime’].str.replace(‘-04:00’, ‘’) df2[‘timestamp’] = pd.to_datetime(df2[‘datetime’], format=“%Y-%m-%d %H:%M:%S.%f”) takes 19 sec to complete
df2[‘timestamp’] = pd.to_datetime(df2[‘datetime’], format=“%Y-%m-%d %H:%M:%S.%f%z”, errors=‘ignore’) takes 3 sec to complete

All of the above calls produce expected result in the dataframe

I see that recently there was related issue of processing ‘%z’: https://github.com/pandas-dev/pandas/issues/13486

Expected Output

No exceptions and no need to ignore errors

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None python: 3.7.0.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_CA.UTF-8 LOCALE: en_CA.UTF-8

pandas: 0.23.4 pytest: None pip: 18.1 setuptools: 40.5.0 Cython: 0.29 numpy: 1.15.4 scipy: 1.1.0 pyarrow: None xarray: None IPython: 7.1.1 sphinx: None patsy: None dateutil: 2.7.5 pytz: 2018.7 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 3.0.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 1.0.1 sqlalchemy: 1.2.13 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Thanks again and best regards, Boris

Issue Analytics

State:
Created 5 years ago
Comments:11 (4 by maintainers)

Top GitHub Comments

1reaction

mroeschkecommented, Nov 9, 2018

%z was introduced recently and will be apart of the v0.24.0 release.

In [6]: pd.__version__
Out[6]: '0.24.0.dev0+961.gefd1844da'

In [7]: t2=pd.to_datetime(s, format="%Y-%m-%d %H:%M:%S.%f%z", errors='ignore')
   ...: print ("T2", t2)
   ...: t3=pd.to_datetime(s, format="%Y-%m-%d %H:%M:%S.%f%z")
   ...: print ("T3", t3)
T2 2018-09-10 09:30:00.000894-04:00
T3 2018-09-10 09:30:00.000894-04:00

The performance hit you’re experiencing is probably because we fall back on using dateutil if initial parsing attempts failed which tends to be slow.

If you’re still experiencing a performance hit when parsing with %z on master, feel free to reopen.

0reactions

bmironovcommented, Nov 9, 2018

@mroeschke Sorry, just noticed your reply about setting up virtual environment. Is there any chance you could run above tests under your setup and spare me from going through installation of virtual environment?

Thanks and best regards, Boris

Top Results From Across the Web

How can I parse this date format into datetime? Python/Pandas

The format I thought would work was format='%Y-%m-%d %H:%M %Z%z' , but when I run it I get the error message ValueError: Cannot...

Pandas To Datetime – String to Date – pd.to_datetime()

This will help pandas parse your dates if your year is first. Try the format code options first. utc (Default=None): If you want...

Search - pandas 1.5.2 documentation

Parsing timezone-aware format with different timezones in to_datetime to_datetime() now supports parsing formats containing timezone names (%Z) and UTC ...

Datetime Patterns for Formatting and Parsing - Apache Spark

For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to 2099 inclusive. If...

Parse "Z" timezone suffix in datetime - Ideas

Problem Statement The function datetime.fromisoformat() parses a ... is it in practice that passing a string using the Z format through ...