Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pandas .astype('category') dramatically decreases performance

See original GitHub issue

I have a DataFrame with 2 000 000 rows. It has columns client, driver, pickup_latitude, pickup_longitude. I do

table['client'] = table.client.astype('category')
table['driver'] = table.driver.astype('category')

to save some memory space. Aggregation

aggregate = canvas.points(table, 'pickup_longitude', 'pickup_latitude', ds.count())

runs for 4 seconds.

If I do not cast client and driver to category, aggregation runs in 20ms.

It is not obvious for me why .astype(‘category’) decreases performance so much. It was hard to debug. Maybe add some notice or warning?

Issue Analytics

State:
Created 6 years ago
Comments:15 (12 by maintainers)

Top GitHub Comments

3reactions

jcristcommented, Aug 1, 2017

I think we should just remove odo entirely. We initially used it to provide potential support for alternate data sources. Since we seem to have settled on pandas/dask.dataframe only, it doesn’t make sense to have a generic dtype system as a dependency. I’d remove odo support and just use pandas dtypes directly.

1reaction

kukcommented, Aug 10, 2017

Thank you! Before posting this issue I just removed .astype(‘category’) from my code, so this is not a problem for me any more. Datashader was very useful in my work. By the way you can check out our project that uses Datashader http://lab.alexkuk.ru/taxi/