Random seeking from S3 is very slow
See original GitHub issueProblem description
- What are you trying to achieve?
- I am trying to merge PDFs hosted on s3 using pikepdf
- What is the expected result?
- s3 files are read and result is saved as combined PDF
- What are you seeing instead?
- The code is stuck, I have to kill the process.
Steps/code to reproduce the problem
Add two small PDFs (200-300 KB) to s3 bucket. Reference them in the below script and run it. See that the script gets stuck. Now reference the same files from the local file system. See that it works fine.
from contextlib import ExitStack
from pikepdf import Pdf
from smart_open import open
input_pdfs = [
"s3://somebucket/somefile1.pdf",
"s3://somebucket/somefile2.pdf",
]
with open("merged.pdf", "wb") as fout:
out_pdf = Pdf.new()
version = out_pdf.pdf_version
with ExitStack() as stack:
pdf_fps = [stack.enter_context(open(path, "rb")) for path in input_pdfs]
for fp in pdf_fps:
src = Pdf.open(fp)
version = max(version, src.pdf_version)
out_pdf.pages.extend(src.pages)
out_pdf.remove_unreferenced_resources()
out_pdf.save(fout, min_version=version)
Versions
>>> import platform, sys, smart_open
>>> print(platform.platform())
Linux-5.3.18-lp152.75-default-x86_64-with-glibc2.3.4
>>> print("Python", sys.version)
Python 3.6.12 (default, Dec 02 2020, 09:44:23) [GCC]
>>> print("smart_open", smart_open.__version__)
smart_open 5.0.0
Anyhow, platform details are irrelevant, I can reproduce this on my box as well as on AWS lambda Python 3.8.
Checklist
Before you create the issue, please make sure you have:
- Described the problem clearly
- Provided a minimal reproducible example, including any required data
- Provided the version numbers of the relevant software
Issue Analytics
- State:
- Created 2 years ago
- Comments:10
Top Results From Across the Web
Troubleshoot slow or inconsistent speeds when downloading ...
To decrease the distance between the client and the S3 bucket, consider moving your data into a bucket in another Region that's closer...
Read more >S3 Bucket Access VERY SLOW : r/FoundryVTT - Reddit
If you know the file path of what your looking for or a folder that is above it you can type that in...
Read more >AWS CLI S3 CP performance is painfully slow - Stack Overflow
Same result with files regardless of whether they have a random collection of alpha numeric characters in the object name; The issue persists ......
Read more >8 Top Amazon S3 Performance Tips | NETdepot.com
Ideally, Amazon claims quite a bit about S3 performance benchmarks. 55,000 read requests per second, 100–200 milliseconds small object latencies ...
Read more >Accelerating S3 Read Performance - Hortonworks Data Platform
By default, as soon as an application makes a backwards seek() in a file, the S3A connector switches into “random” IO mode, where...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
smart_open dumps the read buffer after every seek: https://github.com/RaRe-Technologies/smart_open/blob/f8e60da5a53e1e7ead9d8ca4d3f09cbea04fc337/smart_open/s3.py#L668
pikepdf (actually, its C++ library QPDF) seeks very often. When combined with smart_open, the entire file will be downloaded hundreds of times. I got a very simple test to pass with
smart_open
- I expect the above test would take several minutes and ring up a decent sized bill from AWS.IMHO this would need to be resolved in
smart_open
. Many programs assume seeking is usually fast, especially small seeks that are already in the active read buffer. That is why performance was fine for fuse-mounted remotes. If I did a workaround in pikepdf, there are likely many other applications that would still be affected. I’m sure you’d have similar problems if someone tried to use smart_open with another library that relies heavily on seek, like sqlite.The easiest win would be to retain the read buffer if a seek lands within it. Ideally, you’d maintain a read cache.
I think that my PR #748 should fix some of this if the
seek()
calls are to the current position in the file.There is still outstanding work to do if the
seek()
is to a position contained in the read buffer.