question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pandas .astype('category') dramatically decreases performance

See original GitHub issue

I have a DataFrame with 2 000 000 rows. It has columns client, driver, pickup_latitude, pickup_longitude. I do

table['client'] = table.client.astype('category')
table['driver'] = table.driver.astype('category')

to save some memory space. Aggregation

aggregate = canvas.points(table, 'pickup_longitude', 'pickup_latitude', ds.count())

runs for 4 seconds.

If I do not cast client and driver to category, aggregation runs in 20ms.

It is not obvious for me why .astype(‘category’) decreases performance so much. It was hard to debug. Maybe add some notice or warning?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:15 (12 by maintainers)

github_iconTop GitHub Comments

3reactions
jcristcommented, Aug 1, 2017

I think we should just remove odo entirely. We initially used it to provide potential support for alternate data sources. Since we seem to have settled on pandas/dask.dataframe only, it doesn’t make sense to have a generic dtype system as a dependency. I’d remove odo support and just use pandas dtypes directly.

1reaction
kukcommented, Aug 10, 2017

Thank you! Before posting this issue I just removed .astype(‘category’) from my code, so this is not a problem for me any more. Datashader was very useful in my work. By the way you can check out our project that uses Datashader http://lab.alexkuk.ru/taxi/

Read more comments on GitHub >

github_iconTop Results From Across the Web

Using pandas categories properly is tricky... here's why
Memory usage — for string columns where there are many repeated values, categories can drastically reduce the amount of memory required to ...
Read more >
Performance of groupby when dealing with `category` type in ...
I would like to understand the subtlety behind the usage of category in pandas. I created a random three columns DataFrame through import...
Read more >
Tips and Tricks to Process Large Data in Pandas - Medium
Once the data size increases we experience memory and performance issues. By understanding how pandas interprets data and by using some ...
Read more >
4 Pandas Anti-Patterns to Avoid and How to Fix Them
Anti-Pattern #4: Using incorrect data types. Optimising the data types for each column in a pandas DataFrame will improve performance and memory ...
Read more >
Categorical data — pandas 1.5.2 documentation
All values of categorical data are either in categories or np.nan . ... columns in an existing DataFrame can be batch converted using...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found