question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Sending Large DataFrames is Slow

See original GitHub issue

Problem

This code is slow to re-execute:

import streamlit as st
import pandas as pd

DATA_BUCKET = "http://s3-us-west-2.amazonaws.com/streamlit-demo-data/"
DATA_URL = DATA_BUCKET + "uber-raw-data-sep14.csv.gz"
read_and_cache_csv = st.cache(pd.read_csv)
data = read_and_cache_csv(DATA_URL, nrows=100000)
st.write('Data', data)

Try it:

streamlit run https://gist.githubusercontent.com/treuille/fdc5ff1e68788086a568479c6ad3b954/raw/3840b108e5ccafcad8f670d3f98ccf0dd6573b27/answer_100000_rows.py

Solution

One possible solution I discussed with @tconkling would be to send only a portion of the DataFrame initially, then schedule a series of add_rows to send the rest.

Additional context

We should also perform tests to determine whether @tconkling’s deduplication code may be slowing Streamlit down becuase of the cost of hashing the entire DataFrame. Could we use a probabilistic hash like @domoritz does in his code?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
kantunicommented, Oct 6, 2021

This is not an issue anymore. Since 0.85.0 (Arrow serialization), the performance boost is eye-catching!

1reaction
djstrongcommented, Oct 18, 2019

I have similar problem with caching results of Torch model (the model is function’s argument). Hashing takes 5 seconds.

Is there any other solution to not recompute some object? A simple key-value registry would be helpful, if the key is in registry then return the value, otherwise compute and store.

There are other hashing functions, like MurmurHash (example list: https://cyan4973.github.io/xxHash/), which need to be tested in update use case.

Read more comments on GitHub >

github_iconTop Results From Across the Web

python code with large pandas DataFrame is to slow
my df is a large pd.DataFrame . len(df) returns 1342058 and has 25 columns. df contains timestamps with certain events and locations.
Read more >
Why are pandas slow for large datasets? - Quora
There are many reasons why pandas is slow on large datasets. Be it inherent issues with large datasets, or some that are more...
Read more >
How to Speed up Pandas by 4x with one line of code
But there is one drawback: Pandas is slow for larger datasets. ... your DataFrame into different parts such that each part can be...
Read more >
How to Speed Up Your Pandas Code by 10x | Built In
Converting a DataFrame from Pandas to NumPy is relatively straightforward. You can use the dataframes . to_numpy() function to automatically ...
Read more >
How to handle large datasets in Python with Pandas and Dask
The issue often originates in an unforeseen expansion of a dataframe during an overly-complex transformation or a blind import of a table from...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found