question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issue parsing '%z' in timestamps via pd.to_datetime

See original GitHub issue

Code Sample, a copy-pastable example if possible

First, thanks a lot for this great library. It helps a lot in our day-to-day activities I came across odd behaviour in processing timestamps from set of my CSV files.

s="2018-09-10 09:30:00.000894-04:00"
t1=datetime.datetime.strptime(s, "%Y-%m-%d %H:%M:%S.%f%z")
print ("T1", t1)
t2=pd.to_datetime(s, format="%Y-%m-%d %H:%M:%S.%f%z", errors='ignore')
print ("T2", t2)
t3=pd.to_datetime(s, format="%Y-%m-%d %H:%M:%S.%f")
print ("T3", t3)

Output:

T1 2018-09-10 09:30:00.000894-04:00
T2 2018-09-10 09:30:00.000894-04:00
T3 2018-09-10 13:30:00.000894

Problem description

As we can see pd.to_datetime CAN properly parse timezone directive ‘%z’. If I omit parameter “errors” then I get exception: ValueError: ‘z’ is a bad directive in format ‘%Y-%m-%d %H:%M:%S.%f%z’

I have 6.6M rows CSV file with column that have information in above format (upto microsecond precision with time zone information like -04:00). This column is my index for the dataframe. First, I load it via from_csv(filename) call [no extra parameters except filename]. When I use pd.to_datetime I get following performance results:

  1. df2[‘timestamp’] = pd.to_datetime(df2[‘datetime’], format=“%Y-%m-%d %H:%M:%S.%f”) takes 5 min to process
  2. df2[‘datetime’] = df2[‘datetime’].str.replace(‘-04:00’, ‘’) df2[‘timestamp’] = pd.to_datetime(df2[‘datetime’], format=“%Y-%m-%d %H:%M:%S.%f”) takes 19 sec to complete
  3. df2[‘timestamp’] = pd.to_datetime(df2[‘datetime’], format=“%Y-%m-%d %H:%M:%S.%f%z”, errors=‘ignore’) takes 3 sec to complete

All of the above calls produce expected result in the dataframe

I see that recently there was related issue of processing ‘%z’: https://github.com/pandas-dev/pandas/issues/13486

Expected Output

No exceptions and no need to ignore errors

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None python: 3.7.0.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_CA.UTF-8 LOCALE: en_CA.UTF-8

pandas: 0.23.4 pytest: None pip: 18.1 setuptools: 40.5.0 Cython: 0.29 numpy: 1.15.4 scipy: 1.1.0 pyarrow: None xarray: None IPython: 7.1.1 sphinx: None patsy: None dateutil: 2.7.5 pytz: 2018.7 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 3.0.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 1.0.1 sqlalchemy: 1.2.13 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Thanks again and best regards, Boris

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:11 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
mroeschkecommented, Nov 9, 2018

%z was introduced recently and will be apart of the v0.24.0 release.

In [6]: pd.__version__
Out[6]: '0.24.0.dev0+961.gefd1844da'

In [7]: t2=pd.to_datetime(s, format="%Y-%m-%d %H:%M:%S.%f%z", errors='ignore')
   ...: print ("T2", t2)
   ...: t3=pd.to_datetime(s, format="%Y-%m-%d %H:%M:%S.%f%z")
   ...: print ("T3", t3)
T2 2018-09-10 09:30:00.000894-04:00
T3 2018-09-10 09:30:00.000894-04:00

The performance hit you’re experiencing is probably because we fall back on using dateutil if initial parsing attempts failed which tends to be slow.

If you’re still experiencing a performance hit when parsing with %z on master, feel free to reopen.

0reactions
bmironovcommented, Nov 9, 2018

@mroeschke Sorry, just noticed your reply about setting up virtual environment. Is there any chance you could run above tests under your setup and spare me from going through installation of virtual environment?

Thanks and best regards, Boris

Read more comments on GitHub >

github_iconTop Results From Across the Web

How can I parse this date format into datetime? Python/Pandas
The format I thought would work was format='%Y-%m-%d %H:%M %Z%z' , but when I run it I get the error message ValueError: Cannot...
Read more >
Pandas To Datetime – String to Date – pd.to_datetime()
This will help pandas parse your dates if your year is first. Try the format code options first. utc (Default=None): If you want...
Read more >
Search - pandas 1.5.2 documentation
Parsing timezone-aware format with different timezones in to_datetime to_datetime() now supports parsing formats containing timezone names (%Z) and UTC ...
Read more >
Datetime Patterns for Formatting and Parsing - Apache Spark
For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to 2099 inclusive. If...
Read more >
Parse "Z" timezone suffix in datetime - Ideas
Problem Statement The function datetime.fromisoformat() parses a ... is it in practice that passing a string using the Z format through ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found