question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TarArchiveReader is not functioning with HTTPReader or GDriveReader

See original GitHub issue

This issue was discovered as part of #40. The TarArchiveReader implementation is likely wrong:

  1. An error is raised when we attempt to use TarArchiveReader immediately after HTTPReader because the HTTP stream does not support the operation seek:
file_url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
http_reader_dp = HttpReader(IterableWrapper([file_url]))
tar_dp = http_reader_dp.read_from_tar()
for fname, stream in tar_dp:
    print(f"{fname}: {stream.read()}")

It returns an error that looks something like this:

Traceback (most recent call last):
  File "/Users/ktse/data/test/test_stream.py", line 66, in <module>
    for fname, stream in tar_dp:
  File "/Users/.../data/torchdata/datapipes/iter/util/tararchivereader.py", line 62, in __iter__
    raise e
  File "/Users/.../data/torchdata/datapipes/iter/util/tararchivereader.py", line 48, in __iter__
    tar = tarfile.open(fileobj=cast(Optional[IO[bytes]], data_stream), mode=self.mode)
  File "/Users/.../miniconda3/envs/pytorch/lib/python3.9/tarfile.py", line 1609, in open
    saved_pos = fileobj.tell()
io.UnsupportedOperation: seek

Currently, you can work around by downloading the file in advance (or caching it with OnDiskCacheHolderIterDataPipe). In those cases, TarArchiveReader works as intended.

  1. TarArchiveReader also doesn’t work with GDriveReader because of the return type
amazon_review_url = "https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbaW12WVVZS2drcnM"
gdrive_reader_dp = OnlineReader(IterableWrapper([amazon_review_url]))
tar_dp = gdrive_reader_dp.read_from_tar()

This is because validate_pathname_binary_tuple requires BufferedIOBase. Perhaps it should accept HTTP response as well?

https://github.com/pytorch/data/blob/85d8bbe235cd58f270c17367a5577de107b0095f/torchdata/datapipes/utils/common.py#L66-L76

test/test_stream.py:None (test/test_stream.py)
test_stream.py:79: in <module>
    for fname, stream in tar_dp:
../torchdata/datapipes/iter/util/tararchivereader.py:43: in __iter__
    validate_pathname_binary_tuple(data)
../torchdata/datapipes/utils/common.py:74: in validate_pathname_binary_tuple
    raise TypeError(
E   TypeError: pathname binary tuple should have BufferedIOBase based binary type, but got <class 'urllib3.response.HTTPResponse'>

cc @VitalyFedyunin @ejguan

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
ejguancommented, Oct 5, 2021

So, if it’s working for GDriveReader, then the problem probably come from def _get_response_from_http.

See also The Requests package is recommended for a higher-level HTTP client interface.

Based on the Python doc, do you want to try requests rather than urllib to get respond?

0reactions
NivekTcommented, Oct 8, 2021

I have concluded the issue is that the stream parameter within requests.get must be set to True in order for TarArchiveReader to be able to seek. See #51 for the changes.

session allows information such as cookies to persist between requests. I don’t think it is relevant to our error. Nonetheless, it may be good to use it anyway.

Read more comments on GitHub >

github_iconTop Results From Across the Web

tararchiveReader无法使用httpreader或gdrivereader运行
TarArchiveReader is not functioning with HTTPReader or GDriveReader. This issue was discovered as part of #40.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found