Plotnine performance issue
See original GitHub issueFirstly, thanks for creating this project - I’ve been looking for a python grammar of graphics implementation for a while.
I’m having a performance issue plotting a large Pandas data frame.
The data is found in prob_w_d_df
, which is a (600000, 3)
data frame; prob_w_d_df.describe()
gives:
doc_id prob
------ ----
count 600000 6.000000e+05
mean 29 1.000000e-04
std 17.318117 4.862052e-04
min 0.000000 2.285988e-07
25% 14.750000 1.205222e-06
50% 29.500000 5.755733e-06
75% 44.250000 3.625872e-05
max 59.000000 9.012307e-02
The plot code is:
ggplot(prob_w_d_df, aes(x='feature', y='prob', group=1)) +
geom_line() +
xlab("Word") +
ylab("P(word)") +
facet_wrap('~doc_id', scales="free") +
theme_minimal(base_family="Arial") +
theme(figure_size=(15,15), axis_text_x=element_blank())
)
This takes 1,543 seconds (~27 minutes) to generate the plot.
Now I understand it’s a large data frame so will be slow; however, when I replicate the exact above code with ggplot2
in R it takes 65 seconds. I understand R may be faster but the difference seems to great.
Is there anything I’m doing wrong or should do you improve performance?
(I’ve shared a CSV dump of prob_w_d_df
if it helps: prob_w_d_df.csv
.)
Thanks.
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (5 by maintainers)
Top Results From Across the Web
plotnine geom_density memory and performance issues
I am running into memory and performance issues when trying to implement a density plot using python's plotnine.
Read more >plotnine: Make great-looking correlation plots in Python
Correlation visualizations are very powerful for business reporting as they can highlight key relationships for management. The problem is that ...
Read more >Plotnine plot deconstruction: regularised logistic regression ...
A major issue in statistics is underdetermination: a lack of data prevents ... models as they share the same performance based on that...
Read more >Who's Behind the Numbers? A Conversation with Hassan ...
We are excited to highlight Hassan Kibirige, creator of Plotnine, to learn ... trial solve a small problem they had run into with...
Read more >A Grammar of Graphics for Python — plotnine 0.10.1 ...
plotnine is an implementation of a grammar of graphics in Python, it is based on ggplot2. The grammar allows users to compose plots...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
If I remove the
facet_wrap(...)
(i.e., plot a single panel) it’s still takes 20 minutes so it seems multiple axes are not the real issue.I was having a similar issue with performance when plotting data from a Pandas dataframe. What I found was that plotnine was having trouble if one of my variables was an object datatype. My x-axis was a timeseries of minutes of the day, which I had converted to strings, like “12:05”, “12:10”, etc.
I found that when I converted these timestamps to float datatypes, like 12.0833, 12.1667, etc, it improved the performance considerably.