Pandas .astype('category') dramatically decreases performance
See original GitHub issueI have a DataFrame with 2 000 000 rows. It has columns client, driver, pickup_latitude, pickup_longitude. I do
table['client'] = table.client.astype('category')
table['driver'] = table.driver.astype('category')
to save some memory space. Aggregation
aggregate = canvas.points(table, 'pickup_longitude', 'pickup_latitude', ds.count())
runs for 4 seconds.
If I do not cast client and driver to category, aggregation runs in 20ms.
It is not obvious for me why .astype(‘category’) decreases performance so much. It was hard to debug. Maybe add some notice or warning?
Issue Analytics
- State:
- Created 6 years ago
- Comments:15 (12 by maintainers)
Top Results From Across the Web
Using pandas categories properly is tricky... here's why
Memory usage — for string columns where there are many repeated values, categories can drastically reduce the amount of memory required to ...
Read more >Performance of groupby when dealing with `category` type in ...
I would like to understand the subtlety behind the usage of category in pandas. I created a random three columns DataFrame through import...
Read more >Tips and Tricks to Process Large Data in Pandas - Medium
Once the data size increases we experience memory and performance issues. By understanding how pandas interprets data and by using some ...
Read more >4 Pandas Anti-Patterns to Avoid and How to Fix Them
Anti-Pattern #4: Using incorrect data types. Optimising the data types for each column in a pandas DataFrame will improve performance and memory ...
Read more >Categorical data — pandas 1.5.2 documentation
All values of categorical data are either in categories or np.nan . ... columns in an existing DataFrame can be batch converted using...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think we should just remove
odo
entirely. We initially used it to provide potential support for alternate data sources. Since we seem to have settled onpandas
/dask.dataframe
only, it doesn’t make sense to have a generic dtype system as a dependency. I’d remove odo support and just use pandas dtypes directly.Thank you! Before posting this issue I just removed .astype(‘category’) from my code, so this is not a problem for me any more. Datashader was very useful in my work. By the way you can check out our project that uses Datashader http://lab.alexkuk.ru/taxi/