st.file_uploader produces undesired results for some pdfs.
See original GitHub issueHi everyone!
Something (undesired) is happening to some pdfs when they are stored in memory by st.file_uploader.
I have a streamlit app that allows user to upload documents (docx, pdf, txt) and automatically processes and cleans them.
The upload and processing works just fine, however I noticed that when I am running the EXACT SAME functions line by line but with the pdf simply inputted from my local storage/file path, it produces very different outcomes (usually for the streamlit upload some words are not correctly processed).
Again, the only difference is that, in scenario A the pdf is uploaded via st.file_uploader
and in scenario B the pdf is given by a local path. I therefore assume that the pdf is somehow differently stored by st.file_uploader
and I am not sure how to fix this.
Please note that the correct output is coming from defining local file paths. The streamlit output ist faulty.
Expected behavior:
Pdf uploaded by defining local file path: 2nd paragraph (correctly processed words):
Actual behavior:
Pdf uploaded over st.file_uploader
:
2nd paragraph (words not correctly processed):
Additional information:
This behaviour has only been showing up for pdfs, not for docx, txt files so far.
Issue Analytics
- State:
- Created a year ago
- Comments:11
Top GitHub Comments
@kajarenc , would you happen to know what is wrong with this off the top of your head? I looked at some of the code but couldn’t find anything that would help point me in the right direction.
Dear @kajarenc and @willhuang1997
After some further digging, I found the problem to be already mentioned quite a lot by the community. Please see here: https://github.com/streamlit/streamlit/issues/904
I am now closing this now. Thank you very much for your engagement and please excuse the mistake on my side. I hope for the issue above we will be able to find a solution soon.
A temporary fix is to store the file in a temporary folder as outlined here https://github.com/deepset-ai/haystack/issues/2824