Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Random seeking from S3 is very slow

See original GitHub issue

Problem description

What are you trying to achieve?
- I am trying to merge PDFs hosted on s3 using pikepdf
What is the expected result?
- s3 files are read and result is saved as combined PDF
What are you seeing instead?
- The code is stuck, I have to kill the process.

Steps/code to reproduce the problem

Add two small PDFs (200-300 KB) to s3 bucket. Reference them in the below script and run it. See that the script gets stuck. Now reference the same files from the local file system. See that it works fine.

from contextlib import ExitStack

from pikepdf import Pdf
from smart_open import open

input_pdfs = [
    "s3://somebucket/somefile1.pdf",
    "s3://somebucket/somefile2.pdf",
]

with open("merged.pdf", "wb") as fout:
    out_pdf = Pdf.new()
    version = out_pdf.pdf_version
    with ExitStack() as stack:
        pdf_fps = [stack.enter_context(open(path, "rb")) for path in input_pdfs]
        for fp in pdf_fps:
             src = Pdf.open(fp)
             version = max(version, src.pdf_version)
             out_pdf.pages.extend(src.pages)

    out_pdf.remove_unreferenced_resources()
    out_pdf.save(fout, min_version=version)

Versions

>>> import platform, sys, smart_open
>>> print(platform.platform())
Linux-5.3.18-lp152.75-default-x86_64-with-glibc2.3.4
>>> print("Python", sys.version)
Python 3.6.12 (default, Dec 02 2020, 09:44:23) [GCC]
>>> print("smart_open", smart_open.__version__)
smart_open 5.0.0

Anyhow, platform details are irrelevant, I can reproduce this on my box as well as on AWS lambda Python 3.8.

Checklist

Before you create the issue, please make sure you have:

Described the problem clearly
Provided a minimal reproducible example, including any required data
Provided the version numbers of the relevant software

Issue Analytics

State:
Created 2 years ago
Comments:10

Top GitHub Comments

1reaction

jbarlow83commented, May 25, 2021

smart_open dumps the read buffer after every seek: https://github.com/RaRe-Technologies/smart_open/blob/f8e60da5a53e1e7ead9d8ca4d3f09cbea04fc337/smart_open/s3.py#L668

pikepdf (actually, its C++ library QPDF) seeks very often. When combined with smart_open, the entire file will be downloaded hundreds of times. I got a very simple test to pass with smart_open - I expect the above test would take several minutes and ring up a decent sized bill from AWS.

IMHO this would need to be resolved in smart_open. Many programs assume seeking is usually fast, especially small seeks that are already in the active read buffer. That is why performance was fine for fuse-mounted remotes. If I did a workaround in pikepdf, there are likely many other applications that would still be affected. I’m sure you’d have similar problems if someone tried to use smart_open with another library that relies heavily on seek, like sqlite.

The easiest win would be to retain the read buffer if a seek lands within it. Ideally, you’d maintain a read cache.

0reactions

rustyconovercommented, Dec 18, 2022

I think that my PR #748 should fix some of this if the seek() calls are to the current position in the file.

There is still outstanding work to do if the seek() is to a position contained in the read buffer.

Top Results From Across the Web

Troubleshoot slow or inconsistent speeds when downloading ...

To decrease the distance between the client and the S3 bucket, consider moving your data into a bucket in another Region that's closer...

S3 Bucket Access VERY SLOW : r/FoundryVTT - Reddit

If you know the file path of what your looking for or a folder that is above it you can type that in...

AWS CLI S3 CP performance is painfully slow - Stack Overflow

Same result with files regardless of whether they have a random collection of alpha numeric characters in the object name; The issue persists ......

8 Top Amazon S3 Performance Tips | NETdepot.com

Ideally, Amazon claims quite a bit about S3 performance benchmarks. 55,000 read requests per second, 100–200 milliseconds small object latencies ...

Accelerating S3 Read Performance - Hortonworks Data Platform

By default, as soon as an application makes a backwards seek() in a file, the S3A connector switches into “random” IO mode, where...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Random seeking from S3 is very slow

Problem description

Steps/code to reproduce the problem

Versions

Checklist

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Compressed file ended before the end-of-stream marker was reached

Add a feature to override file extension defining compression