Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TarArchiveReader is not functioning with HTTPReader or GDriveReader

See original GitHub issue

This issue was discovered as part of #40. The TarArchiveReader implementation is likely wrong:

An error is raised when we attempt to use TarArchiveReader immediately after HTTPReader because the HTTP stream does not support the operation seek:

file_url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
http_reader_dp = HttpReader(IterableWrapper([file_url]))
tar_dp = http_reader_dp.read_from_tar()
for fname, stream in tar_dp:
    print(f"{fname}: {stream.read()}")

It returns an error that looks something like this:

Traceback (most recent call last):
  File "/Users/ktse/data/test/test_stream.py", line 66, in <module>
    for fname, stream in tar_dp:
  File "/Users/.../data/torchdata/datapipes/iter/util/tararchivereader.py", line 62, in __iter__
    raise e
  File "/Users/.../data/torchdata/datapipes/iter/util/tararchivereader.py", line 48, in __iter__
    tar = tarfile.open(fileobj=cast(Optional[IO[bytes]], data_stream), mode=self.mode)
  File "/Users/.../miniconda3/envs/pytorch/lib/python3.9/tarfile.py", line 1609, in open
    saved_pos = fileobj.tell()
io.UnsupportedOperation: seek

Currently, you can work around by downloading the file in advance (or caching it with OnDiskCacheHolderIterDataPipe). In those cases, TarArchiveReader works as intended.

TarArchiveReader also doesn’t work with GDriveReader because of the return type

amazon_review_url = "https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbaW12WVVZS2drcnM"
gdrive_reader_dp = OnlineReader(IterableWrapper([amazon_review_url]))
tar_dp = gdrive_reader_dp.read_from_tar()

This is because validate_pathname_binary_tuple requires BufferedIOBase. Perhaps it should accept HTTP response as well?

https://github.com/pytorch/data/blob/85d8bbe235cd58f270c17367a5577de107b0095f/torchdata/datapipes/utils/common.py#L66-L76

test/test_stream.py:None (test/test_stream.py)
test_stream.py:79: in <module>
    for fname, stream in tar_dp:
../torchdata/datapipes/iter/util/tararchivereader.py:43: in __iter__
    validate_pathname_binary_tuple(data)
../torchdata/datapipes/utils/common.py:74: in validate_pathname_binary_tuple
    raise TypeError(
E   TypeError: pathname binary tuple should have BufferedIOBase based binary type, but got <class 'urllib3.response.HTTPResponse'>

cc @VitalyFedyunin @ejguan

Issue Analytics

State:
Created 2 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

ejguancommented, Oct 5, 2021

So, if it’s working for GDriveReader, then the problem probably come from def _get_response_from_http.

See also The Requests package is recommended for a higher-level HTTP client interface.

Based on the Python doc, do you want to try requests rather than urllib to get respond?

0reactions

NivekTcommented, Oct 8, 2021

I have concluded the issue is that the stream parameter within requests.get must be set to True in order for TarArchiveReader to be able to seek. See #51 for the changes.

session allows information such as cookies to persist between requests. I don’t think it is relevant to our error. Nonetheless, it may be good to use it anyway.