question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NanoSim - KeyError: sequence_id not found in mapping

See original GitHub issue

Hi,

Thanks for developing CAMISIM! I am currently trying to simulate data with Illumina and Nanopore reads using the de novo community design. I am using the CAMISIM master branch. With the provided test data (CAMISIM/defaults/genomes/) and the provided mapping files I got it running using art and nanosim(from the https://github.com/abremgesfork).

Then I tried to use the 2nd CAMI Toy Mouse Gut Dataset genomes/, metadata.tsv and genome_to_id.tsv data as a basis to generate new data. For Illumina data this worked smoothly. However, for Nanopore data I get the following errors after simulating the reads and in the final anonymization step:

...
2021-07-09 16:17:40 DEBUG: [GenomePreparation 89018136530] 270448.0     22
2021-07-09 16:17:40 DEBUG: [GenomePreparation 89018136530] SysCmd: '/home-link/qeakr01/development/NanoSim/src/simulator.py linear -n 22 -r /sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test2/source_genomes/GCF_000403395.2_Anae_bact_G3_V1.fa -o /sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/2021.07.09_15.59.47_sample_0/reads/270448.0 -c tools/nanosim_profile/ecoli --seed 2998104995'
2021-07-09 16:17:40 INFO: [GenomePreparation 89018136530] Simulating reads from 270448.0: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test2/source_genomes/GCF_000403395.2_Anae_bact_G3_V1.fa'
2021-07-09 16:31:15 INFO: [GenomePreparation 89018136530] Simulating reads finished
[W::sam_parse1] urecognized reference name; treated as unmapped
[W::sam_parse1] urecognized reference name; treated as unmapped
[W::sam_parse1] urecognized reference name; treated as unmapped
...

and

...
2021-07-09 16:44:30 INFO: [MetagenomeSimulationPipeline] Anonymize Data
2021-07-09 16:44:30 DEBUG: [MetagenomeSimulationPipeline] /sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/tmpgY5Fcq
2021-07-09 16:44:30 INFO: [FastaAnonymizer] Shuffle and anonymize '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/2021.07.09_15.59.47_sample_0/reads'
2021-07-09 16:44:30 DEBUG: [FastaAnonymizer] get_seeded_random() { seed="$1"; openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt < /dev/zero 2>/dev/null; }; python '/nfsmounts/home/qeakr01/development/CAMISIM/fastastreamer.py' -input '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/2021.07.09_15.59.47_sample_0/reads' -format 'fastq' -ext 'fq' -s | shuf -z --random-source=<(get_seeded_random 2944938622045856594) | tr -d '\000' | python '/nfsmounts/home/qeakr01/development/CAMISIM/anonymizer.py' -prefix 'S0R' -format 'fastq' -map '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/tmpoArF7B' -out '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/tmpLJDFBS' -s
2021-07-09 16:48:06 INFO: [MetadataReader 1434768039] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test2/internal/genome_locations.tsv'
2021-07-09 16:48:08 INFO: [MetadataReader 31538633047] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test2/internal/meta_data.tsv'
2021-07-09 16:48:08 INFO: [MetadataReader 14979527976] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/tmpoArF7B'
2021-07-09 16:48:08 ERROR: [Validator 31115876351] sequence_id 'NZ-JH590862.1' not found in mapping

2021-07-09 16:48:08 DEBUG: [MetagenomeSimulationPipeline]
Traceback (most recent call last):
  File "/home-link/qeakr01/development/CAMISIM/metagenomesimulation.py", line 117, in run_pipeline
    self._anonymize_data(list_of_output_gsa, file_path_output_gsa_pooled)
  File "/home-link/qeakr01/development/CAMISIM/metagenomesimulation.py", line 639, in _anonymize_data
    file_path_genome_locations, file_path_metadata, file_path_anonymous_mapping_tmp, stream_output
  File "/nfsmounts/home/qeakr01/development/CAMISIM/scripts/GoldStandardFileFormat/goldstandardfileformat.py", line 370, in gs_read_mapping
    stream_output, dict_anonymous_to_read_id, dict_sequence_to_genome_id, dict_genome_id_to_tax_id)
  File "/nfsmounts/home/qeakr01/development/CAMISIM/scripts/GoldStandardFileFormat/goldstandardfileformat.py", line 244, in write_gs_read_mapping
    raise KeyError(msg)
KeyError: "sequence_id 'NZ-JH590862.1' not found in mapping\n"


2021-07-09 16:48:08 ERROR: [MetagenomeSimulationPipeline] "sequence_id 'NZ-JH590862.1' not found in mapping\n" in line 117
2021-07-09 16:48:08 INFO: [MetagenomeSimulationPipeline] Metagenome simulation aborted
2021-07-09 16:48:08 INFO: [MetagenomeSimulationPipeline] Temporary data stored at:
/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM

Do you have any idea what could cause this issue or how I could proceed to fix this?

sim_nanosim.test2.log sim_config.nanosim.test2.ini.txt

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:16

github_iconTop GitHub Comments

1reaction
AlphaSquadcommented, Jul 27, 2021

Ah, yeah the size is a problem. Since NanoSim requires the number of reads as input and CAMISIM the dataset size, there has to be a conversion from size -> number of reads. But the number of reads needed for a certain size depends on the average read length - which is specific to the trained models. I updated the used model but did not update the average read size. The fact that this happens points towards the fact that the calculation should be automatic depending on the chosen model.

Also thank you for the log (and information about the non-anonymous gold standards). I hope to find the problems soon - but will be on vacation until 16th of August starting this Friday

1reaction
AlphaSquadcommented, Jul 16, 2021

Even though I think that if 2.5.0 finished without errors your results probably are usable, I would use the latest NanoSim 3.0 if it works. The model used in 1.2.0 is very old so it probably does not reflect recent chemistry well.

Read more comments on GitHub >

github_iconTop Results From Across the Web

layout_unitigs.py Key Error · Issue #45 · marbl/SALSA - GitHub
Now I found the same error as you did. This happened when I tried running SALSA against a version of reference to which...
Read more >
python - KeyError: 'mtD' when 'mtD' is nowhere to be found in ...
I'm using a simple function to convert a DNA sequence into an amino acid sequence. At a high level, the code seems pretty...
Read more >
What is the Python keyerror? - Educative.io
The keyerror occurs when the key used for a lookup does not exist in the mapping. Consider a phone book that is represented...
Read more >
Untitled
Dead baby found in purse, Loco enamorado zion y lennox, Thanksgiving to god for ... Black magic pocket camera unboxing, Nuketown map not...
Read more >
KeyError: "'json_schema'.'properties' are not defined ... - Airbyte
If you will not be using pagination, no action is required - just return None. This method should return a Mapping (e.g: dict)...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found