Sending Large DataFrames is Slow
See original GitHub issueProblem
This code is slow to re-execute:
import streamlit as st
import pandas as pd
DATA_BUCKET = "http://s3-us-west-2.amazonaws.com/streamlit-demo-data/"
DATA_URL = DATA_BUCKET + "uber-raw-data-sep14.csv.gz"
read_and_cache_csv = st.cache(pd.read_csv)
data = read_and_cache_csv(DATA_URL, nrows=100000)
st.write('Data', data)
Try it:
streamlit run https://gist.githubusercontent.com/treuille/fdc5ff1e68788086a568479c6ad3b954/raw/3840b108e5ccafcad8f670d3f98ccf0dd6573b27/answer_100000_rows.py
Solution
One possible solution I discussed with @tconkling would be to send only a portion of the DataFrame initially, then schedule a series of add_rows
to send the rest.
Additional context
We should also perform tests to determine whether @tconkling’s deduplication code may be slowing Streamlit down becuase of the cost of hashing the entire DataFrame. Could we use a probabilistic hash like @domoritz does in his code?
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (2 by maintainers)
Top Results From Across the Web
python code with large pandas DataFrame is to slow
my df is a large pd.DataFrame . len(df) returns 1342058 and has 25 columns. df contains timestamps with certain events and locations.
Read more >Why are pandas slow for large datasets? - Quora
There are many reasons why pandas is slow on large datasets. Be it inherent issues with large datasets, or some that are more...
Read more >How to Speed up Pandas by 4x with one line of code
But there is one drawback: Pandas is slow for larger datasets. ... your DataFrame into different parts such that each part can be...
Read more >How to Speed Up Your Pandas Code by 10x | Built In
Converting a DataFrame from Pandas to NumPy is relatively straightforward. You can use the dataframes . to_numpy() function to automatically ...
Read more >How to handle large datasets in Python with Pandas and Dask
The issue often originates in an unforeseen expansion of a dataframe during an overly-complex transformation or a blind import of a table from...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This is not an issue anymore. Since 0.85.0 (Arrow serialization), the performance boost is eye-catching!
I have similar problem with caching results of Torch model (the model is function’s argument). Hashing takes 5 seconds.
Is there any other solution to not recompute some object? A simple key-value registry would be helpful, if the key is in registry then return the value, otherwise compute and store.
There are other hashing functions, like MurmurHash (example list: https://cyan4973.github.io/xxHash/), which need to be tested in update use case.