Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Sending Large DataFrames is Slow

See original GitHub issue

Problem

This code is slow to re-execute:

import streamlit as st
import pandas as pd

DATA_BUCKET = "http://s3-us-west-2.amazonaws.com/streamlit-demo-data/"
DATA_URL = DATA_BUCKET + "uber-raw-data-sep14.csv.gz"
read_and_cache_csv = st.cache(pd.read_csv)
data = read_and_cache_csv(DATA_URL, nrows=100000)
st.write('Data', data)

Try it:

streamlit run https://gist.githubusercontent.com/treuille/fdc5ff1e68788086a568479c6ad3b954/raw/3840b108e5ccafcad8f670d3f98ccf0dd6573b27/answer_100000_rows.py

Solution

One possible solution I discussed with @tconkling would be to send only a portion of the DataFrame initially, then schedule a series of add_rows to send the rest.

Additional context

We should also perform tests to determine whether @tconkling’s deduplication code may be slowing Streamlit down becuase of the cost of hashing the entire DataFrame. Could we use a probabilistic hash like @domoritz does in his code?

Issue Analytics

State:
Created 4 years ago
Comments:7 (2 by maintainers)

Top GitHub Comments

1reaction

kantunicommented, Oct 6, 2021

This is not an issue anymore. Since 0.85.0 (Arrow serialization), the performance boost is eye-catching!

1reaction

djstrongcommented, Oct 18, 2019

I have similar problem with caching results of Torch model (the model is function’s argument). Hashing takes 5 seconds.

Is there any other solution to not recompute some object? A simple key-value registry would be helpful, if the key is in registry then return the value, otherwise compute and store.

There are other hashing functions, like MurmurHash (example list: https://cyan4973.github.io/xxHash/), which need to be tested in update use case.