Unable to seek HTTPS from some servers
See original GitHub issueProblem description
We are working with airbyte, which uses this great lib.
We are trying to download an Excel file shared in a “ownCloud” or vía “transfer.sh” (HTTPS). I think it does not work when response header is “Content-disposition: attachment”. This header is very common and having this feature would be very nice and give access to public shared links from different platforms like Google Drive, etc.
Steps/code to reproduce the problem
When we tried to use the connector File with HTTPS it works well for the same file if we upload it to a web server serving DIRECTLY the file, but, the same file, uploaded to an OwnCloud environment or to “transfer.sh” and trying to download it, airbyte says it is not a valid Excel file.
Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/source_file/source.py", line 122, in discover streams = list(client.streams) File "/usr/local/lib/python3.7/site-packages/source_file/client.py", line 384, in streams "properties": self._stream_properties(), File "/usr/local/lib/python3.7/site-packages/source_file/client.py", line 372, in _stream_properties for df in df_list: File "/usr/local/lib/python3.7/site-packages/source_file/client.py", line 327, in load_dataframes yield reader(fp, **reader_options) File "/usr/local/lib/python3.7/site-packages/pandas/util/_decorators.py", line 299, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 336, in read_excel io = ExcelFile(io, storage_options=storage_options, engine=engine) File "/usr/local/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 1056, in __init__ content=path_or_buffer, storage_options=storage_options File "/usr/local/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 942, in inspect_excel_format stream.seek(0) File "/usr/local/lib/python3.7/site-packages/smart_open/http.py", line 263, in seek raise OSError OSError
URL is like this (none of them working but it is easy to upload something to transfer.sh):
https://transfer.sh/get/1HpZ3cN/test.xlsx (/get/ is the key to direct download)
https://www.xxxxxxcloud.com/drive/index.php/s/xxxxxxx/download?path=%2F&files=test.xlsx
The only difference we notice from using a direct http server and this other options is that in the response header it comes “Content-Disposition: attachment …”
Versions
Python 3.7.10 “smart-open[all]==4.1.2”,
Issue Analytics
- State:
- Created 2 years ago
- Comments:8
I think we can solve this problem easily on the smart_open side. That range header is optional, so we should be less strict about it. I’ll push a fix and make a new release in the next couple of days.
If the files are the same, then there’s no problem with the actual content.
If your application seeks around the stream, then it looks like there’s a problem with transfer.sh. The OSError suggests the stream is not seekable for some reason. Looking further, here are the headers returned by transfer.sh:
smart_open expects the Accept-Ranges header to be there, but it’s not. So, it judges the stream to be non-seekable. I’m not sure if this will solve your problem, but try commenting out these two lines:
https://github.com/RaRe-Technologies/smart_open/blob/5321ef2439eb8b32b19c310cd4039f377748b9e8/smart_open/http.py#L249-L250
Let me know the result.