question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Random seeking from S3 is very slow

See original GitHub issue

Problem description

  • What are you trying to achieve?
    • I am trying to merge PDFs hosted on s3 using pikepdf
  • What is the expected result?
    • s3 files are read and result is saved as combined PDF
  • What are you seeing instead?
    • The code is stuck, I have to kill the process.

Steps/code to reproduce the problem

Add two small PDFs (200-300 KB) to s3 bucket. Reference them in the below script and run it. See that the script gets stuck. Now reference the same files from the local file system. See that it works fine.

from contextlib import ExitStack

from pikepdf import Pdf
from smart_open import open

input_pdfs = [
    "s3://somebucket/somefile1.pdf",
    "s3://somebucket/somefile2.pdf",
]

with open("merged.pdf", "wb") as fout:
    out_pdf = Pdf.new()
    version = out_pdf.pdf_version
    with ExitStack() as stack:
        pdf_fps = [stack.enter_context(open(path, "rb")) for path in input_pdfs]
        for fp in pdf_fps:
             src = Pdf.open(fp)
             version = max(version, src.pdf_version)
             out_pdf.pages.extend(src.pages)

    out_pdf.remove_unreferenced_resources()
    out_pdf.save(fout, min_version=version)

Versions

>>> import platform, sys, smart_open
>>> print(platform.platform())
Linux-5.3.18-lp152.75-default-x86_64-with-glibc2.3.4
>>> print("Python", sys.version)
Python 3.6.12 (default, Dec 02 2020, 09:44:23) [GCC]
>>> print("smart_open", smart_open.__version__)
smart_open 5.0.0

Anyhow, platform details are irrelevant, I can reproduce this on my box as well as on AWS lambda Python 3.8.

Checklist

Before you create the issue, please make sure you have:

  • Described the problem clearly
  • Provided a minimal reproducible example, including any required data
  • Provided the version numbers of the relevant software

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:10

github_iconTop GitHub Comments

1reaction
jbarlow83commented, May 25, 2021

smart_open dumps the read buffer after every seek: https://github.com/RaRe-Technologies/smart_open/blob/f8e60da5a53e1e7ead9d8ca4d3f09cbea04fc337/smart_open/s3.py#L668

pikepdf (actually, its C++ library QPDF) seeks very often. When combined with smart_open, the entire file will be downloaded hundreds of times. I got a very simple test to pass with smart_open - I expect the above test would take several minutes and ring up a decent sized bill from AWS.

IMHO this would need to be resolved in smart_open. Many programs assume seeking is usually fast, especially small seeks that are already in the active read buffer. That is why performance was fine for fuse-mounted remotes. If I did a workaround in pikepdf, there are likely many other applications that would still be affected. I’m sure you’d have similar problems if someone tried to use smart_open with another library that relies heavily on seek, like sqlite.

The easiest win would be to retain the read buffer if a seek lands within it. Ideally, you’d maintain a read cache.

0reactions
rustyconovercommented, Dec 18, 2022

I think that my PR #748 should fix some of this if the seek() calls are to the current position in the file.

There is still outstanding work to do if the seek() is to a position contained in the read buffer.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot slow or inconsistent speeds when downloading ...
To decrease the distance between the client and the S3 bucket, consider moving your data into a bucket in another Region that's closer...
Read more >
S3 Bucket Access VERY SLOW : r/FoundryVTT - Reddit
If you know the file path of what your looking for or a folder that is above it you can type that in...
Read more >
AWS CLI S3 CP performance is painfully slow - Stack Overflow
Same result with files regardless of whether they have a random collection of alpha numeric characters in the object name; The issue persists ......
Read more >
8 Top Amazon S3 Performance Tips | NETdepot.com
Ideally, Amazon claims quite a bit about S3 performance benchmarks. 55,000 read requests per second, 100–200 milliseconds small object latencies ...
Read more >
Accelerating S3 Read Performance - Hortonworks Data Platform
By default, as soon as an application makes a backwards seek() in a file, the S3A connector switches into “random” IO mode, where...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found