Error: Job not requeued because: timed-out and not checkpointable.
See original GitHub issueWhen I execute:
python -m cc_net -l fa
It throws the following exception:
File "/usr/local/Cellar/python@3.8/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 502, in readinto
n = self.fp.readinto(b)
File "/usr/local/Cellar/python@3.8/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/socket.py", line 669, in readinto
return self._sock.recv_into(b)
File "/usr/local/Cellar/python@3.8/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1241, in recv_into
return self.read(nbytes, buffer)
File "/usr/local/Cellar/python@3.8/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1099, in read
return self._sslobj.read(len, buffer)
File "/usr/local/lib/python3.8/site-packages/submitit/core/job_environment.py", line 185, in checkpoint_and_try_requeue
raise utils.UncompletedJobError(message)
submitit.core.utils.UncompletedJobError: Job not requeued because: timed-out and not checkpointable.
Here is the full log.err:
2021-01-20 22:08 INFO 20945:cc_net.process_wet_file - Parsed 1 / 16000 files. Estimated remaining time: 177.9h
2021-01-20 22:08 INFO 20945:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/155024747>
2021-01-20 22:08 INFO 20945:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/we>
2021-01-20 22:08 INFO 20945:cc_net.process_wet_file - Kept 41_939 documents over 44_039 (95.2%).
2021-01-20 22:08 INFO 20945:cc_net.process_wet_file - Parsed 2 / 16000 files. Estimated remaining time: 147.7h
2021-01-20 22:08 INFO 20945:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/155024747>
submitit WARNING (2021-01-20 22:08:54,313) - Caught signal 10 on 8095a4502934: this job is timed-out.
2021-01-20 22:08 WARNING 20945:submitit - Caught signal 10 on 8095a4502934: this job is timed-out.
2021-01-20 22:08 INFO 20945:submitit - Job not requeued because: timed-out and not checkpointable.
2021-01-20 22:08 INFO 20945:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/we>
submitit WARNING (2021-01-20 22:08:54,522) - Bypassing signal 15
submitit WARNING (2021-01-20 22:08:54,522) - Bypassing signal 15
2021-01-20 22:08 WARNING 20956:submitit - Bypassing signal 15
2021-01-20 22:08 WARNING 20957:submitit - Bypassing signal 15
2021-01-20 22:08 INFO 20945:Classifier - Processed 0 documents in 0.025h ( 0.0 doc/s).
2021-01-20 22:08 INFO 20945:Classifier - Kept 0 docs over 0 (0.0%)
2021-01-20 22:08 INFO 20945:Classifier - Found 0 language labels: {}
2021-01-20 22:08 INFO 20945:where - Selected 0 documents out of 0 ( 0.0%)
submitit ERROR (2021-01-20 22:08:54,541) - Submitted job triggered an exception
2021-01-20 22:08 ERROR 20945:submitit - Submitted job triggered an exception
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 851, in next
item = self._items.popleft()
IndexError: pop from an empty deque
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
submitit_main()
File "/opt/conda/lib/python3.8/site-packages/submitit/core/submission.py", line 65, in submitit_main
process_job(args.folder)
File "/opt/conda/lib/python3.8/site-packages/submitit/core/submission.py", line 58, in process_job
raise error
File "/opt/conda/lib/python3.8/site-packages/submitit/core/submission.py", line 47, in process_job
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:11 (1 by maintainers)
Top Results From Across the Web
G.5 Prologue error processing - Adaptive Computing
-4, The script timed out, Job will be requeued. -3, The wait(2) call returned an error, Job will be requeued. -2, Input file...
Read more >lsb.queues reference page - IBM
No backfilling. CHKPNT. Enables automatic checkpointing for the queue. All jobs that are submitted to the queue are checkpointable.
Read more >lsb.queues
This file is optional; if no queues are configured, LSF creates a queue named ... To make a MultiCluster job checkpointable, both submission...
Read more >IBM Spectrum LSF Configuration Reference - SAS Support
error. The syntax of the LSF hosts file supports host name ranges as aliases for ... LSF does not migrate checkpointable or rerunnable...
Read more >Altair PBS Professional 2022.1 - Installation & Upgrade Guide
Use of Altair's trademarks, including but not limited to "PBS™", "PBS Professional®", ... All job scripts, as well as input, output, error, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@sidsvash26 set
num_segments_pre_shard
to some value in above config (e.g., 10 instead of -1) .this works for me as well. thanks! Is it possible to download and run on only a small sample of the full data? Say (1 million documents) in one language? If yes, what would be the config for that look like?