HTTPResponse.read() closes the response causing problem with io.BufferedReader
See original GitHub issueIn the following text I mostly refer to python2 for clarity, but the problem also happens on py3.
From some library I get an active streaming urllib3.HTTPResponse
with preload_content=False
. That response refers to a very large CSV file, so I want to read it line-by-line passing it through csv.reader
.
csv.reader
requires unicode
objects. HTTPResponse
returns bytes
objects, so I use io.TextIOWrapper
. (Actually I use backports.csv
to achieve compatibility with io
).
Now, here is the problem:
test.csv
- notice the last line is missing a trailing line separator:
abc<LF>
def
It is served with python -m SimpleHTTPServer
.
And the test code, even without CSV:
import io
import urllib3
http = urllib3.PoolManager()
resp = http.request('GET', 'http://localhost:8000/test.csv', preload_content=False)
for line in iter(
io.TextIOWrapper(
# py2's implementation of TextIOWrapper requires `read1` method which is provided by `BufferedReader` wrapper
io.BufferedReader(
resp
)
)
):
print(repr(line))
Here is the error:
u'abc\n'
Traceback (most recent call last):
File "test.py", line 11, in <module>
resp
ValueError: I/O operation on closed file.
If I remove TextIOWrapper
then the error is slightly different but with the same meaning:
ValueError: readline of closed file
From what I could find out, C
implementation of readline
checks if the file is closed. And if (and only if) the last line is not terminated then it tries to read “the rest” which results in this error, because HTTPResponse.read()
closes the “file” after it is finished.
This behaviour is contrary to ordinary file
behaviour (and io.open()
or py3
file
behaviour) which will not close the file on its will but instead will keep yielding empty string when trying to read past the end of file.
Which would be the best way to make streaming HTTPResponse
work properly with io.BufferedReader
?
Issue Analytics
- State:
- Created 6 years ago
- Comments:15 (12 by maintainers)
Okay, so what we have here is an API mismatch. An HTTP response body can generally be treated as a file which…
io
module via the.closed
property)Most of these attributes are perfectly normal and expected for a file-like object. However, self-closing is not.
Generally, a file is open when it can be interacted with. When it’s closed, a file pointer raises exceptions when it’s interacted with. Thus, we have a bit of a consistency problem. We can (and should) be able to interact with an HTTPResponse after its body has been completely read, but we’re currently acting as though the file is closed immediately upon consumption.
Essentially, we’re using
.closed
to telegraph that a file has been consumed, but the standard behavior for a (read-only, non-seekable) file which has been completely consumed (but which remains open) is to return an empty set of bytes. A file being closed has a semantically different meaning - it means that the file can no longer be interacted with in any way.I propose that we change the behavior of
.close()
and.closed
onHTTPResponse
to more closely match the semantic meaning intended byIOBase
. Calling.close()
will result in a call to the underlyinghttplib
response object’s.close()
method, followed by a call tosuper().close()
, while.closed
will simply inherit fromIOBase
; if the response has not been specifically closed, it will appear open - which is the more conservative move, as it will never appear closed in a case where the httplib fp remains open (and therefore needs to be explicitly closed).This table summarizes behavior before and after this change:
close()
closed
@nateprewitt, you’ve investigated this in the past in #977. What are your thoughts? Also CC @sigmavirus24 and @Lukasa.
Lots of good information and investigation here, nice work @haikuginger. You know when tables are used in a GitHub issue comment that shit’s getting serious. 😃
I’m in favor of the change as long as it goes in as a breaking change.