question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error: Job not requeued because: timed-out and not checkpointable.

See original GitHub issue

When I execute:

python -m cc_net -l fa

It throws the following exception:

  File "/usr/local/Cellar/python@3.8/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 502, in readinto
    n = self.fp.readinto(b)
  File "/usr/local/Cellar/python@3.8/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/Cellar/python@3.8/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/Cellar/python@3.8/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
  File "/usr/local/lib/python3.8/site-packages/submitit/core/job_environment.py", line 185, in checkpoint_and_try_requeue
    raise utils.UncompletedJobError(message)
submitit.core.utils.UncompletedJobError: Job not requeued because: timed-out and not checkpointable.

Here is the full log.err:

2021-01-20 22:08 INFO 20945:cc_net.process_wet_file - Parsed 1 / 16000 files. Estimated remaining time: 177.9h
2021-01-20 22:08 INFO 20945:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/155024747>
2021-01-20 22:08 INFO 20945:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/we>
2021-01-20 22:08 INFO 20945:cc_net.process_wet_file - Kept 41_939 documents over 44_039 (95.2%).
2021-01-20 22:08 INFO 20945:cc_net.process_wet_file - Parsed 2 / 16000 files. Estimated remaining time: 147.7h
2021-01-20 22:08 INFO 20945:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/155024747>
submitit WARNING (2021-01-20 22:08:54,313) - Caught signal 10 on 8095a4502934: this job is timed-out.
2021-01-20 22:08 WARNING 20945:submitit - Caught signal 10 on 8095a4502934: this job is timed-out.
2021-01-20 22:08 INFO 20945:submitit - Job not requeued because: timed-out and not checkpointable.
2021-01-20 22:08 INFO 20945:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/we>
submitit WARNING (2021-01-20 22:08:54,522) - Bypassing signal 15
submitit WARNING (2021-01-20 22:08:54,522) - Bypassing signal 15
2021-01-20 22:08 WARNING 20956:submitit - Bypassing signal 15
2021-01-20 22:08 WARNING 20957:submitit - Bypassing signal 15
2021-01-20 22:08 INFO 20945:Classifier - Processed 0 documents in 0.025h (  0.0 doc/s).
2021-01-20 22:08 INFO 20945:Classifier - Kept 0 docs over 0 (0.0%)
2021-01-20 22:08 INFO 20945:Classifier - Found 0 language labels: {}
2021-01-20 22:08 INFO 20945:where - Selected 0 documents out of 0 ( 0.0%)
submitit ERROR (2021-01-20 22:08:54,541) - Submitted job triggered an exception
2021-01-20 22:08 ERROR 20945:submitit - Submitted job triggered an exception
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 851, in next
    item = self._items.popleft()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/opt/conda/lib/python3.8/site-packages/submitit/core/submission.py", line 65, in submitit_main
    process_job(args.folder)
  File "/opt/conda/lib/python3.8/site-packages/submitit/core/submission.py", line 58, in process_job
    raise error
  File "/opt/conda/lib/python3.8/site-packages/submitit/core/submission.py", line 47, in process_job

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:11 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
hadifarcommented, Feb 4, 2021

@sidsvash26 set num_segments_pre_shard to some value in above config (e.g., 10 instead of -1) .

2reactions
sidsvash26commented, Feb 4, 2021

I’m not sure what was the problem but with the following config file it will work:

python -m cc_net --config config/myconfig.json

Here is myconfig.json:

{
    "hash_in_mem": 2,
    "dump": "2019-09",
    "num_shards": 8,
    "lang_whitelist": ["fa"],
    "num_segments_per_shard": -1,
    "mine_num_processes": 1,
    "pipeline": [
        "dedup",
        "lid",
        "keep_lang",
        "split_by_segment"
    ],
    "execution": "debug",
    "target_size": "1GB",
    "output_dir": "fa_data2",
    "mined_dir": "fa_mined_by_segment2",
    "cache_dir": "fa_data2/wet_cache"
}

this works for me as well. thanks! Is it possible to download and run on only a small sample of the full data? Say (1 million documents) in one language? If yes, what would be the config for that look like?

Read more comments on GitHub >

github_iconTop Results From Across the Web

G.5 Prologue error processing - Adaptive Computing
-4, The script timed out, Job will be requeued. -3, The wait(2) call returned an error, Job will be requeued. -2, Input file...
Read more >
lsb.queues reference page - IBM
No backfilling. CHKPNT. Enables automatic checkpointing for the queue. All jobs that are submitted to the queue are checkpointable.
Read more >
lsb.queues
This file is optional; if no queues are configured, LSF creates a queue named ... To make a MultiCluster job checkpointable, both submission...
Read more >
IBM Spectrum LSF Configuration Reference - SAS Support
error. The syntax of the LSF hosts file supports host name ranges as aliases for ... LSF does not migrate checkpointable or rerunnable...
Read more >
Altair PBS Professional 2022.1 - Installation & Upgrade Guide
Use of Altair's trademarks, including but not limited to "PBS™", "PBS Professional®", ... All job scripts, as well as input, output, error, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found