question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Plotnine performance issue

See original GitHub issue

Firstly, thanks for creating this project - I’ve been looking for a python grammar of graphics implementation for a while.

I’m having a performance issue plotting a large Pandas data frame.

The data is found in prob_w_d_df, which is a (600000, 3) data frame; prob_w_d_df.describe() gives:

	doc_id		prob
	------		----
count	600000		6.000000e+05
mean	29		1.000000e-04
std	17.318117	4.862052e-04
min	0.000000	2.285988e-07
25%	14.750000	1.205222e-06
50%	29.500000	5.755733e-06
75%	44.250000	3.625872e-05
max	59.000000	9.012307e-02

The plot code is:

ggplot(prob_w_d_df, aes(x='feature', y='prob', group=1)) + 
 geom_line() + 
 xlab("Word") + 
 ylab("P(word)") +
 facet_wrap('~doc_id', scales="free") +
 theme_minimal(base_family="Arial") +
 theme(figure_size=(15,15), axis_text_x=element_blank())
) 

This takes 1,543 seconds (~27 minutes) to generate the plot.

Now I understand it’s a large data frame so will be slow; however, when I replicate the exact above code with ggplot2 in R it takes 65 seconds. I understand R may be faster but the difference seems to great.

Is there anything I’m doing wrong or should do you improve performance?

(I’ve shared a CSV dump of prob_w_d_df if it helps: prob_w_d_df.csv.)

Thanks.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
bevankoopmancommented, Apr 19, 2018

If I remove the facet_wrap(...) (i.e., plot a single panel) it’s still takes 20 minutes so it seems multiple axes are not the real issue.

0reactions
grgmillercommented, Apr 15, 2020

I was having a similar issue with performance when plotting data from a Pandas dataframe. What I found was that plotnine was having trouble if one of my variables was an object datatype. My x-axis was a timeseries of minutes of the day, which I had converted to strings, like “12:05”, “12:10”, etc.

I found that when I converted these timestamps to float datatypes, like 12.0833, 12.1667, etc, it improved the performance considerably.

Read more comments on GitHub >

github_iconTop Results From Across the Web

plotnine geom_density memory and performance issues
I am running into memory and performance issues when trying to implement a density plot using python's plotnine.
Read more >
plotnine: Make great-looking correlation plots in Python
Correlation visualizations are very powerful for business reporting as they can highlight key relationships for management. The problem is that ...
Read more >
Plotnine plot deconstruction: regularised logistic regression ...
A major issue in statistics is underdetermination: a lack of data prevents ... models as they share the same performance based on that...
Read more >
Who's Behind the Numbers? A Conversation with Hassan ...
We are excited to highlight Hassan Kibirige, creator of Plotnine, to learn ... trial solve a small problem they had run into with...
Read more >
A Grammar of Graphics for Python — plotnine 0.10.1 ...
plotnine is an implementation of a grammar of graphics in Python, it is based on ggplot2. The grammar allows users to compose plots...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found