Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

s3 seek

See original GitHub issue

If I try to use smart open to seek/read parts of an s3 file, I get NotImplementedError: seek other than offset=0 not implemented yet.

Arbitrary seeking, especially when the seek was specified relative to the beginning of the file (seek(..., whence=0), should be possible through the Range HTTP header

>>> import boto
>>> s3 = boto.connect_s3()
>>> bucket = s3.lookup('bucket')
>>> key = bucket.lookup('key')
>>> parts = key.get_contents_as_string(headers={'Range' : 'bytes=12-24'})

seek could establish a pointer to the starting byte and subsequent reads would define the end.

Are there any technical limitation or design restrictions that would prevent this?

Issue Analytics

State:
Created 8 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

2reactions

mpenkovcommented, Dec 10, 2017

@menshikh-iv I think this is done. We can seek S3 files now.

2reactions

perrygeocommented, Nov 27, 2015

get_contents_as_string should respect the HTTP Range header but it doesn’t always behave that way. In particular, I found that the first call read the entire contents while subsequent calls (with the exact same args and kwargs) pulled in only the requested bytes. I believe this to be a bug in boto. Unfortunately I couldn’t find another way to implement this in boto2.

However, switching to boto3 I was able to put together a working s3 reader using the object abstractions. I wrapped it in a file handle interface that does arbitrary seeks and reads: https://gist.github.com/perrygeo/9239b9ab64731cacbb35#file-s3reader-py . It’s very effective and allowed me to read TIF tags off 2000 x 1.1 GB files stored on S3 in just a few minutes.

I haven’t yet considered how something like this could integrate with smart_open but I figured it might be useful.