TarArchiveReader is not functioning with HTTPReader or GDriveReader
See original GitHub issueThis issue was discovered as part of #40. The TarArchiveReader
implementation is likely wrong:
- An error is raised when we attempt to use
TarArchiveReader
immediately afterHTTPReader
because the HTTP stream does not support the operationseek
:
file_url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
http_reader_dp = HttpReader(IterableWrapper([file_url]))
tar_dp = http_reader_dp.read_from_tar()
for fname, stream in tar_dp:
print(f"{fname}: {stream.read()}")
It returns an error that looks something like this:
Traceback (most recent call last):
File "/Users/ktse/data/test/test_stream.py", line 66, in <module>
for fname, stream in tar_dp:
File "/Users/.../data/torchdata/datapipes/iter/util/tararchivereader.py", line 62, in __iter__
raise e
File "/Users/.../data/torchdata/datapipes/iter/util/tararchivereader.py", line 48, in __iter__
tar = tarfile.open(fileobj=cast(Optional[IO[bytes]], data_stream), mode=self.mode)
File "/Users/.../miniconda3/envs/pytorch/lib/python3.9/tarfile.py", line 1609, in open
saved_pos = fileobj.tell()
io.UnsupportedOperation: seek
Currently, you can work around by downloading the file in advance (or caching it with OnDiskCacheHolderIterDataPipe
). In those cases, TarArchiveReader
works as intended.
TarArchiveReader
also doesn’t work withGDriveReader
because of the return type
amazon_review_url = "https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbaW12WVVZS2drcnM"
gdrive_reader_dp = OnlineReader(IterableWrapper([amazon_review_url]))
tar_dp = gdrive_reader_dp.read_from_tar()
This is because validate_pathname_binary_tuple
requires BufferedIOBase
. Perhaps it should accept HTTP response as well?
test/test_stream.py:None (test/test_stream.py)
test_stream.py:79: in <module>
for fname, stream in tar_dp:
../torchdata/datapipes/iter/util/tararchivereader.py:43: in __iter__
validate_pathname_binary_tuple(data)
../torchdata/datapipes/utils/common.py:74: in validate_pathname_binary_tuple
raise TypeError(
E TypeError: pathname binary tuple should have BufferedIOBase based binary type, but got <class 'urllib3.response.HTTPResponse'>
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
tararchiveReader无法使用httpreader或gdrivereader运行
TarArchiveReader is not functioning with HTTPReader or GDriveReader. This issue was discovered as part of #40.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
So, if it’s working for
GDriveReader
, then the problem probably come fromdef _get_response_from_http
.Based on the Python doc, do you want to try
requests
rather thanurllib
to get respond?I have concluded the issue is that the
stream
parameter withinrequests.get
must be set toTrue
in order forTarArchiveReader
to be able toseek
. See #51 for the changes.session
allows information such as cookies to persist between requests. I don’t think it is relevant to our error. Nonetheless, it may be good to use it anyway.