question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ChunkedEncodingError & ConnectionResetError

See original GitHub issue

Here’s the log with command nohup python -m cc_net mine --dump 2019-13 > 2019-13.log 2>2019-13.err &:

2019-11-12 00:26 INFO 22835:HashesCollector - Processed 519_187 documents in 1e+01h ( 14.4 doc/s).
2019-11-12 00:26 INFO 22835:HashesCollector - Found 25_229k unique hashes over 90_967 lines. Using 3.6GB of RAM.
2019-11-12 00:27 INFO 22835:cc_net.process_wet_file - Kept 43_340 documents over 45_437 (95.4%).
2019-11-12 00:27 INFO 22835:cc_net.process_wet_file - Parsed 13 / 35 files. Estimated remaining time: 9.2h
2019-11-12 00:27 INFO 22835:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153624-00039.warc.wet.gz
/data/myusername/projects/cc_net/cc_net/jsonql.py:1138: UserWarning: Swallowed error ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')) while downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153624-00039.warc.wet.gz (1 out of 3)
  f"Swallowed error {e} while downloading {url} ({i} out of {n_retry})"
2019-11-12 01:16 INFO 22835:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153624-00039.warc.wet.gz [200]
2019-11-12 01:16 INFO 22835:HashesCollector - Processed 562_527 documents in 1.1e+01h ( 14.4 doc/s).
2019-11-12 01:16 INFO 22835:HashesCollector - Found 26_687k unique hashes over 98_562 lines. Using 3.7GB of RAM.
2019-11-12 01:16 INFO 22835:HashesCollector - Found 26_687k unique hashes over 98_562 lines. Using 3.7GB of RAM.
2019-11-12 01:17 INFO 22835:cc_net.process_wet_file - Kept 43_268 documents over 45_427 (95.2%).
2019-11-12 01:17 INFO 22835:cc_net.process_wet_file - Parsed 14 / 35 files. Estimated remaining time: 17.7h
2019-11-12 01:17 INFO 22835:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153625-00008.warc.wet.gz
/data/myusername/projects/cc_net/cc_net/jsonql.py:1138: UserWarning: Swallowed error ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')) while downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153625-00008.warc.wet.gz (1 out of 3)
  f"Swallowed error {e} while downloading {url} ({i} out of {n_retry})"
/data/myusername/projects/cc_net/cc_net/jsonql.py:1138: UserWarning: Swallowed error ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')) while downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153625-00008.warc.wet.gz (2 out of 3)
  f"Swallowed error {e} while downloading {url} ({i} out of {n_retry})"
2019-11-12 02:11 INFO 22835:HashesCollector - Processed 605_794 documents in 1.2e+01h ( 14.3 doc/s).
2019-11-12 02:11 INFO 22835:HashesCollector - Found 0k unique hashes over 106_217 lines. Using 3.7GB of RAM.
Traceback (most recent call last):
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
    yield
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 507, in read
    data = self._fp.read(amt) if not fp_closed else b""
  File "/usr/lib/python3.7/http/client.py", line 457, in read
    n = self.readinto(b)
  File "/usr/lib/python3.7/http/client.py", line 501, in readinto
    n = self.fp.readinto(b)
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/models.py", line 750, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 564, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 529, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 443, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/data/myusername/projects/cc_net/cc_net/__main__.py", line 31, in <module>
    main()
  File "/data/myusername/projects/cc_net/cc_net/__main__.py", line 27, in main
    command(**parsed_args)
  File "/data/myusername/projects/cc_net/cc_net/mine.py", line 512, in main
    regroup(conf)
  File "/data/myusername/projects/cc_net/cc_net/mine.py", line 364, in regroup
    mine(conf)
  File "/data/myusername/projects/cc_net/cc_net/mine.py", line 257, in mine
    hashes_groups = list(jsonql.grouper(hashes(conf), conf.hash_in_mem))
  File "/data/myusername/projects/cc_net/cc_net/mine.py", line 206, in hashes
    ex(_hashes_shard, repeat(conf), *_transpose(missing_outputs))
  File "/data/myusername/projects/cc_net/cc_net/execution.py", line 128, in debug_executor
    message = function(*x)
  File "/data/myusername/projects/cc_net/cc_net/mine.py", line 218, in _hashes_shard
    file=conf.get_cc_shard(shard),
  File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 448, in run_pipes
    for res in results:
  File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 295, in map
    for x in source:
  File "/data/myusername/projects/cc_net/cc_net/process_wet_file.py", line 198, in __iter__
    with jsonql.open_remote_file(self.segment_url(segment)) as f:
  File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 1151, in open_remote_file
    content = io.BytesIO(request_get_content(url))
  File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 1136, in request_get_content
    raise e
  File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 1129, in request_get_content
    r = requests.get(url)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/sessions.py", line 686, in send
    r.content
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/models.py", line 828, in content
    self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/models.py", line 753, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

Is this just due to poor network connection between me and Amazon server (I’m in China)? If so, is it recommended to run the code from an AWS server located in US? If I don’t have a C++17 compiler, how much memory do I need? Thanks a lot.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:13 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
soloicecommented, Nov 19, 2019

@theblackcat102 Hi, I didn’t encounter any memory issue. This might be due to the short continuous running time (I have to restart my process every 4~12 hours because of network connection/file opening/etc errors).

I agree with you that the best practice is to run the program (both computation & storage) within US.

Reproducing this work is just a side project for me (I’m helping a data analyst colleague with this) and I have something more important to do now, so I’m not going to spend more time on this work within a month. Probably I’ll come back a month later.

0reactions
theblackcat102commented, Mar 15, 2020

@gwenzek I written a post with tips how to recreate this in GCP. Basically use S3 or Google cloud bucket and mount them as disk will save you a lot of storage fees

Read more comments on GitHub >

github_iconTop Results From Across the Web

requests.exceptions.ChunkedEncodingError connection broken
I don't understand what ChunkedEncodingError has to do with Connection broken. · according to urls requests documentation, it means The server ...
Read more >
Filtered stream request breaks in 5 min intervals - Twitter API v2
ChunkedEncodingError : ('Connection broken: IncompleteRead(0 bytes read)', ... ConnectionResetError(104, 'Connection reset by peer')) TwitterAPI.
Read more >
ConnectionResetError 104 connection reset by peer
Inside infinite for loop, I am continuously checking for order completion, if an order is found completed I am stopping the loop.
Read more >
Help Needed Generating Database with RESCRIPt
I am trying to generate a database for complete bacteria present in NCBI using this command to annotate the ASV table. qiime rescript ......
Read more >
Python requests.exceptions.ChunkedEncodingError() Examples
ChunkedEncodingError () Examples. The following are 23 code examples of requests.exceptions.ChunkedEncodingError(). You can vote up the ones you like or vote ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found