Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PERF: add datetime caching kw in to_datetime

See original GitHub issue

I’ll propose a

cache_datetime=False keyword as an addition to read_csv and pd.to_datetime

this would use a lookup cache (a dict will probably work), to map datetime strings to Timestamp objects. For repeated dates this will lead to some dramatic speedups.

Care must be taken if a format kw is provided (in to_datetime as the cache will have to be exposed). This would be an optional (and default False) as I think if you have unique dates this could modestly slow down things (but can be revisted if needed).

This might need also want to accept a list of column names (like parse_dates) to enable per-column caching (e.g. you might want to apply to a column, but not the index of example).

possibly we could overload parse_dates='cache' to mean this as well

trivial example

In [1]: pd.DataFrame({'A' : ['20130101 00:00:00']*10000}).to_csv('test.csv',index=True)

In [14]: def parser(x):
   ....:         uniques = pd.Series(pd.unique(x))
   ....:         d = pd.to_datetime(uniques)
   ....:         d.index = uniques
   ....:         return Series(x).map(d).values
   ....: 
In [3]: df1 = pd.read_csv('test.csv',index_col=0, parse_dates=['A'])

In [4]: df2 = pd.read_csv('test.csv',index_col=0, parse_dates=['A'], date_parser=parser)

In [17]: %timeit pd.read_csv('test.csv',index_col=0, parse_dates=['A'])
1 loops, best of 3: 969 ms per loop

In [18]: %timeit pd.read_csv('test.csv',index_col=0, parse_dates=['A'], date_parser=parser)
100 loops, best of 3: 5.31 ms per loop

In [7]: df1.equals(df2)
Out[7]: True

Issue Analytics

State:
Created 8 years ago
Comments:18 (11 by maintainers)

Top GitHub Comments

1reaction

DGradycommented, May 23, 2017

Okay; will dive in to it tomorrow.

0reactions

jrebackcommented, May 22, 2017

oh i think we should always do this we could add an option to turn it off i suppose

basically if you have say at least 1000 elements or less you can not do it otherwise it’s always worth it (yes there is a degenerate case with a really long u issue series but detecting that is often not worth it)

u can impelement then we an test a couple or of cases and see

fyi we do something similar in pandas.core.util.hashing with the categorize kw (it’s simpler there to just hash the uniques there) so a similar approach would be good