question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to seek HTTPS from some servers

See original GitHub issue

Problem description

We are working with airbyte, which uses this great lib.

We are trying to download an Excel file shared in a “ownCloud” or vía “transfer.sh” (HTTPS). I think it does not work when response header is “Content-disposition: attachment”. This header is very common and having this feature would be very nice and give access to public shared links from different platforms like Google Drive, etc.

Steps/code to reproduce the problem

When we tried to use the connector File with HTTPS it works well for the same file if we upload it to a web server serving DIRECTLY the file, but, the same file, uploaded to an OwnCloud environment or to “transfer.sh” and trying to download it, airbyte says it is not a valid Excel file.

Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/source_file/source.py", line 122, in discover streams = list(client.streams) File "/usr/local/lib/python3.7/site-packages/source_file/client.py", line 384, in streams "properties": self._stream_properties(), File "/usr/local/lib/python3.7/site-packages/source_file/client.py", line 372, in _stream_properties for df in df_list: File "/usr/local/lib/python3.7/site-packages/source_file/client.py", line 327, in load_dataframes yield reader(fp, **reader_options) File "/usr/local/lib/python3.7/site-packages/pandas/util/_decorators.py", line 299, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 336, in read_excel io = ExcelFile(io, storage_options=storage_options, engine=engine) File "/usr/local/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 1056, in __init__ content=path_or_buffer, storage_options=storage_options File "/usr/local/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 942, in inspect_excel_format stream.seek(0) File "/usr/local/lib/python3.7/site-packages/smart_open/http.py", line 263, in seek raise OSError OSError URL is like this (none of them working but it is easy to upload something to transfer.sh): https://transfer.sh/get/1HpZ3cN/test.xlsx (/get/ is the key to direct download) https://www.xxxxxxcloud.com/drive/index.php/s/xxxxxxx/download?path=%2F&files=test.xlsx

The only difference we notice from using a direct http server and this other options is that in the response header it comes “Content-Disposition: attachment …”

Versions

Python 3.7.10 “smart-open[all]==4.1.2”,

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8

github_iconTop GitHub Comments

1reaction
mpenkovcommented, Aug 27, 2021

I think we can solve this problem easily on the smart_open side. That range header is optional, so we should be less strict about it. I’ll push a fix and make a new release in the next couple of days.

1reaction
mpenkovcommented, Aug 26, 2021

If the files are the same, then there’s no problem with the actual content.

If your application seeks around the stream, then it looks like there’s a problem with transfer.sh. The OSError suggests the stream is not seekable for some reason. Looking further, here are the headers returned by transfer.sh:

{'Connection': 'keep-alive', 'Content-Disposition': 'attachment; filename="Libro2prueba.xlsx"', 'Content-Length': '11165', 'Content-Type': 'application/vnd.openxmlformat
s-officedocument.spreadsheetml.sheet', 'Retry-After': 'Thu, 26 Aug 2021 23:02:27 GMT', 'Server': 'Transfer.sh HTTP Server 1.0', 'X-Made-With': '<3 by DutchCoders', 'X-Ra
telimit-Key': '222.6.124.90', 'X-Ratelimit-Limit': '10', 'X-Ratelimit-Rate': '600', 'X-Ratelimit-Remaining': '9', 'X-Ratelimit-Reset': '1630011747', 'X-Remaining-Days':
'n/a', 'X-Remaining-Downloads': 'n/a', 'X-Served-By': 'Proudly served by DutchCoders', 'Date': 'Thu, 26 Aug 2021 21:02:22 GMT'}

smart_open expects the Accept-Ranges header to be there, but it’s not. So, it judges the stream to be non-seekable. I’m not sure if this will solve your problem, but try commenting out these two lines:

https://github.com/RaRe-Technologies/smart_open/blob/5321ef2439eb8b32b19c310cd4039f377748b9e8/smart_open/http.py#L249-L250

Let me know the result.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Enabling HTTPS on Your Servers - web.dev
Enabling HTTPS on your servers is critical to securing your webpages.
Read more >
How to Fix the "Safari Can't Establish a Secure Connection to ...
To do this, you can go to the website showing the error message, then click on the lock icon to the left of...
Read more >
Troubleshooting SSL related issues (Server Certificate)
The first 2 steps check the integrity of the certificate. Once we have confirmed that there are no issues with the certificate, a...
Read more >
HTTP 500 Internal Server Error: What It Means & How to Fix It
HTTP 429. This error is a server response to stop sending requests because of overloaded resources. This code might show up if your...
Read more >
FME Server Troubleshooting: Configuring for HTTPS/SSL
Review the log files located in <FMEServerFileShare>/Resources/Logs/tomcat. In particular look in the catalina.log for SEVERE messages, and within that message ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found