ChunkedEncodingError & ConnectionResetError
See original GitHub issueHere’s the log with command nohup python -m cc_net mine --dump 2019-13 > 2019-13.log 2>2019-13.err &
:
2019-11-12 00:26 INFO 22835:HashesCollector - Processed 519_187 documents in 1e+01h ( 14.4 doc/s).
2019-11-12 00:26 INFO 22835:HashesCollector - Found 25_229k unique hashes over 90_967 lines. Using 3.6GB of RAM.
2019-11-12 00:27 INFO 22835:cc_net.process_wet_file - Kept 43_340 documents over 45_437 (95.4%).
2019-11-12 00:27 INFO 22835:cc_net.process_wet_file - Parsed 13 / 35 files. Estimated remaining time: 9.2h
2019-11-12 00:27 INFO 22835:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153624-00039.warc.wet.gz
/data/myusername/projects/cc_net/cc_net/jsonql.py:1138: UserWarning: Swallowed error ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')) while downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153624-00039.warc.wet.gz (1 out of 3)
f"Swallowed error {e} while downloading {url} ({i} out of {n_retry})"
2019-11-12 01:16 INFO 22835:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153624-00039.warc.wet.gz [200]
2019-11-12 01:16 INFO 22835:HashesCollector - Processed 562_527 documents in 1.1e+01h ( 14.4 doc/s).
2019-11-12 01:16 INFO 22835:HashesCollector - Found 26_687k unique hashes over 98_562 lines. Using 3.7GB of RAM.
2019-11-12 01:16 INFO 22835:HashesCollector - Found 26_687k unique hashes over 98_562 lines. Using 3.7GB of RAM.
2019-11-12 01:17 INFO 22835:cc_net.process_wet_file - Kept 43_268 documents over 45_427 (95.2%).
2019-11-12 01:17 INFO 22835:cc_net.process_wet_file - Parsed 14 / 35 files. Estimated remaining time: 17.7h
2019-11-12 01:17 INFO 22835:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153625-00008.warc.wet.gz
/data/myusername/projects/cc_net/cc_net/jsonql.py:1138: UserWarning: Swallowed error ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')) while downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153625-00008.warc.wet.gz (1 out of 3)
f"Swallowed error {e} while downloading {url} ({i} out of {n_retry})"
/data/myusername/projects/cc_net/cc_net/jsonql.py:1138: UserWarning: Swallowed error ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')) while downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153625-00008.warc.wet.gz (2 out of 3)
f"Swallowed error {e} while downloading {url} ({i} out of {n_retry})"
2019-11-12 02:11 INFO 22835:HashesCollector - Processed 605_794 documents in 1.2e+01h ( 14.3 doc/s).
2019-11-12 02:11 INFO 22835:HashesCollector - Found 0k unique hashes over 106_217 lines. Using 3.7GB of RAM.
Traceback (most recent call last):
File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
yield
File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 507, in read
data = self._fp.read(amt) if not fp_closed else b""
File "/usr/lib/python3.7/http/client.py", line 457, in read
n = self.readinto(b)
File "/usr/lib/python3.7/http/client.py", line 501, in readinto
n = self.fp.readinto(b)
File "/usr/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.7/ssl.py", line 1071, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.7/ssl.py", line 929, in read
return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/models.py", line 750, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 564, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 529, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/usr/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 443, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/data/myusername/projects/cc_net/cc_net/__main__.py", line 31, in <module>
main()
File "/data/myusername/projects/cc_net/cc_net/__main__.py", line 27, in main
command(**parsed_args)
File "/data/myusername/projects/cc_net/cc_net/mine.py", line 512, in main
regroup(conf)
File "/data/myusername/projects/cc_net/cc_net/mine.py", line 364, in regroup
mine(conf)
File "/data/myusername/projects/cc_net/cc_net/mine.py", line 257, in mine
hashes_groups = list(jsonql.grouper(hashes(conf), conf.hash_in_mem))
File "/data/myusername/projects/cc_net/cc_net/mine.py", line 206, in hashes
ex(_hashes_shard, repeat(conf), *_transpose(missing_outputs))
File "/data/myusername/projects/cc_net/cc_net/execution.py", line 128, in debug_executor
message = function(*x)
File "/data/myusername/projects/cc_net/cc_net/mine.py", line 218, in _hashes_shard
file=conf.get_cc_shard(shard),
File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 448, in run_pipes
for res in results:
File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 295, in map
for x in source:
File "/data/myusername/projects/cc_net/cc_net/process_wet_file.py", line 198, in __iter__
with jsonql.open_remote_file(self.segment_url(segment)) as f:
File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 1151, in open_remote_file
content = io.BytesIO(request_get_content(url))
File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 1136, in request_get_content
raise e
File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 1129, in request_get_content
r = requests.get(url)
File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/sessions.py", line 686, in send
r.content
File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/models.py", line 828, in content
self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/models.py", line 753, in generate
raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
Is this just due to poor network connection between me and Amazon server (I’m in China)? If so, is it recommended to run the code from an AWS server located in US? If I don’t have a C++17 compiler, how much memory do I need? Thanks a lot.
Issue Analytics
- State:
- Created 4 years ago
- Comments:13 (4 by maintainers)
Top Results From Across the Web
requests.exceptions.ChunkedEncodingError connection broken
I don't understand what ChunkedEncodingError has to do with Connection broken. · according to urls requests documentation, it means The server ...
Read more >Filtered stream request breaks in 5 min intervals - Twitter API v2
ChunkedEncodingError : ('Connection broken: IncompleteRead(0 bytes read)', ... ConnectionResetError(104, 'Connection reset by peer')) TwitterAPI.
Read more >ConnectionResetError 104 connection reset by peer
Inside infinite for loop, I am continuously checking for order completion, if an order is found completed I am stopping the loop.
Read more >Help Needed Generating Database with RESCRIPt
I am trying to generate a database for complete bacteria present in NCBI using this command to annotate the ASV table. qiime rescript ......
Read more >Python requests.exceptions.ChunkedEncodingError() Examples
ChunkedEncodingError () Examples. The following are 23 code examples of requests.exceptions.ChunkedEncodingError(). You can vote up the ones you like or vote ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@theblackcat102 Hi, I didn’t encounter any memory issue. This might be due to the short continuous running time (I have to restart my process every 4~12 hours because of network connection/file opening/etc errors).
I agree with you that the best practice is to run the program (both computation & storage) within US.
Reproducing this work is just a side project for me (I’m helping a data analyst colleague with this) and I have something more important to do now, so I’m not going to spend more time on this work within a month. Probably I’ll come back a month later.
@gwenzek I written a post with tips how to recreate this in GCP. Basically use S3 or Google cloud bucket and mount them as disk will save you a lot of storage fees