s3 seek
See original GitHub issueIf I try to use smart open to seek/read parts of an s3 file, I get NotImplementedError: seek other than offset=0 not implemented yet
.
Arbitrary seeking, especially when the seek was specified relative to the beginning of the file (seek(..., whence=0)
, should be possible through the Range
HTTP header
>>> import boto
>>> s3 = boto.connect_s3()
>>> bucket = s3.lookup('bucket')
>>> key = bucket.lookup('key')
>>> parts = key.get_contents_as_string(headers={'Range' : 'bytes=12-24'})
seek
could establish a pointer to the starting byte and subsequent read
s would define the end.
Are there any technical limitation or design restrictions that would prevent this?
Issue Analytics
- State:
- Created 8 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
S3: How to do a partial read / seek without downloading the ...
On a Unix system I can use head to preview the first few lines of a file, no matter how large it is,...
Read more >s3 open seek operation try read rest of file into buffer ... - GitHub
Here makes API call to fetch rest of file into buffer when calling seek, which makes seek very slow. The API call may...
Read more >Random-Access (Seekable) Streams for Amazon S3 in C#
Lucky for us, S3 is one of those HTTP services that does support HTTP's method for “seeking” by using Range headers (which I've...
Read more >Working with really large objects in S3 - alexwlchan
Implementing the seek() method. When we tried to load a ZIP file the first time, we discovered that somewhere the zipfile module is...
Read more >Performance Guidelines for Amazon S3
When building applications that upload and retrieve objects from Amazon S3, follow our best practices guidelines to optimize performance.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@menshikh-iv I think this is done. We can seek S3 files now.
get_contents_as_string
should respect the HTTP Range header but it doesn’t always behave that way. In particular, I found that the first call read the entire contents while subsequent calls (with the exact same args and kwargs) pulled in only the requested bytes. I believe this to be a bug in boto. Unfortunately I couldn’t find another way to implement this in boto2.However, switching to boto3 I was able to put together a working s3 reader using the object abstractions. I wrapped it in a file handle interface that does arbitrary seeks and reads: https://gist.github.com/perrygeo/9239b9ab64731cacbb35#file-s3reader-py . It’s very effective and allowed me to read TIF tags off 2000 x 1.1 GB files stored on S3 in just a few minutes.
I haven’t yet considered how something like this could integrate with smart_open but I figured it might be useful.