readlines on files from hdfs using webhdfs hangs
See original GitHub issueProblem description
readlines on files from hdfs using webhdfs hangs
Steps/code to reproduce the problem
with open("webhdfs://namenode/myfile.txt", "r") as f:
lines = f.readlines()
The issue seems to be in the following line https://github.com/RaRe-Technologies/smart_open/blob/develop/smart_open/webhdfs.py#L147
Sometimes self._response.raw.read(io.DEFAULT_BUFFER_SIZE)
returns 0 bytes and StopIteration is never thrown resulting in the call hanging for ever. I fixed it locally with the following change which seems to work.
while len(self._buf) < size:
raw_data = self._response.raw.read(io.DEFAULT_BUFFER_SIZE)
if len(raw_data) == 0:
break
self._buf += raw_data
Versions
Please provide the output of:
import platform, sys, smart_open
print(platform.platform())
print("Python", sys.version)
print("smart_open", smart_open.__version__)
Linux-5.8.0-44-generic-x86_64-with-debian-bullseye-sid Python 3.7.3 (default, Sep 3 2020, 16:41:28) [GCC 9.3.0] smart_open 4.2.0
Checklist
Before you create the issue, please make sure you have:
- Described the problem clearly
- Provided a minimal reproducible example, including any required data
- Provided the version numbers of the relevant software
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
Reading data from HDFS - my program can't find the path
try FileSystem fileSystem = FileSystem.get(new Configuration()); Path path = new Path(fileSystem.getName() + "/" + args[0]); BufferedReader ...
Read more >smart_open - Bountysource
Utils for streaming large files (S3, HDFS, gzip, bz2...) ... readlines on files from hdfs using webhdfs hangs ... Could smart_open read kerberos...
Read more >Manage files in HDFS using WEBHDFS REST APIs - Big data
I share my understanding and experience in using WEBHDFS REST API. What is WebHDFS? Hadoop provides a Java native API to support file...
Read more >Unable to upload file into HDFS - Cloudera Community - 300049
Hello,. I'm trying to upload a 73 GB tsv file into HDFS through Ambari's File View option. The progress bar seems to get...
Read more >API Reference — fsspec 2022.11.0+13.g0974514.dirty ...
If given, open file using compression codec. Can either be a compression ... Interface to HDFS over HTTP using the WebHDFS API. fsspec.implementations.zip....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
OK @mpenkov @piskvorky makes sense. Will send a pr.
Yes, you’re right, string concatenation is awful.
The approach you suggest is definitely better. Another common way to to it is to write into an io.BytesIO buffer.
@traboukos Please go ahead with the PR, taking the above into account.