question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PERF: add datetime caching kw in to_datetime

See original GitHub issue

I’ll propose a

cache_datetime=False keyword as an addition to read_csv and pd.to_datetime

this would use a lookup cache (a dict will probably work), to map datetime strings to Timestamp objects. For repeated dates this will lead to some dramatic speedups.

Care must be taken if a format kw is provided (in to_datetime as the cache will have to be exposed). This would be an optional (and default False) as I think if you have unique dates this could modestly slow down things (but can be revisted if needed).

This might need also want to accept a list of column names (like parse_dates) to enable per-column caching (e.g. you might want to apply to a column, but not the index of example).

possibly we could overload parse_dates='cache' to mean this as well

trivial example

In [1]: pd.DataFrame({'A' : ['20130101 00:00:00']*10000}).to_csv('test.csv',index=True)

In [14]: def parser(x):
   ....:         uniques = pd.Series(pd.unique(x))
   ....:         d = pd.to_datetime(uniques)
   ....:         d.index = uniques
   ....:         return Series(x).map(d).values
   ....: 
In [3]: df1 = pd.read_csv('test.csv',index_col=0, parse_dates=['A'])

In [4]: df2 = pd.read_csv('test.csv',index_col=0, parse_dates=['A'], date_parser=parser)

In [17]: %timeit pd.read_csv('test.csv',index_col=0, parse_dates=['A'])
1 loops, best of 3: 969 ms per loop

In [18]: %timeit pd.read_csv('test.csv',index_col=0, parse_dates=['A'], date_parser=parser)
100 loops, best of 3: 5.31 ms per loop

In [7]: df1.equals(df2)
Out[7]: True

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:18 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
DGradycommented, May 23, 2017

Okay; will dive in to it tomorrow.

0reactions
jrebackcommented, May 22, 2017

oh i think we should always do this we could add an option to turn it off i suppose

basically if you have say at least 1000 elements or less you can not do it otherwise it’s always worth it (yes there is a degenerate case with a really long u issue series but detecting that is often not worth it)

u can impelement then we an test a couple or of cases and see

fyi we do something similar in pandas.core.util.hashing with the categorize kw (it’s simpler there to just hash the uniques there) so a similar approach would be good

Read more comments on GitHub >

github_iconTop Results From Across the Web

A simple PHP API extension for DateTime. - Carbon
Carbon - A simple PHP API extension for DateTime. ... on the performance and on the result, so you can use addFilter() to...
Read more >
python - Does datetime.datetime.now return cached values ...
1 Answer 1 · time.time_ns() has a resolution of ~300us on Windows · perf_counter() has a resolution of 100ns however it does not...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found