NanoSim - KeyError: sequence_id not found in mapping
See original GitHub issueHi,
Thanks for developing CAMISIM! I am currently trying to simulate data with Illumina and Nanopore reads using the de novo community design. I am using the CAMISIM master branch. With the provided test data (CAMISIM/defaults/genomes/) and the provided mapping files I got it running using art
and nanosim
(from the https://github.com/abremgesfork).
Then I tried to use the 2nd CAMI Toy Mouse Gut Dataset
genomes/
, metadata.tsv
and genome_to_id.tsv
data as a basis to generate new data. For Illumina data this worked smoothly. However, for Nanopore data I get the following errors after simulating the reads and in the final anonymization step:
...
2021-07-09 16:17:40 DEBUG: [GenomePreparation 89018136530] 270448.0 22
2021-07-09 16:17:40 DEBUG: [GenomePreparation 89018136530] SysCmd: '/home-link/qeakr01/development/NanoSim/src/simulator.py linear -n 22 -r /sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test2/source_genomes/GCF_000403395.2_Anae_bact_G3_V1.fa -o /sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/2021.07.09_15.59.47_sample_0/reads/270448.0 -c tools/nanosim_profile/ecoli --seed 2998104995'
2021-07-09 16:17:40 INFO: [GenomePreparation 89018136530] Simulating reads from 270448.0: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test2/source_genomes/GCF_000403395.2_Anae_bact_G3_V1.fa'
2021-07-09 16:31:15 INFO: [GenomePreparation 89018136530] Simulating reads finished
[W::sam_parse1] urecognized reference name; treated as unmapped
[W::sam_parse1] urecognized reference name; treated as unmapped
[W::sam_parse1] urecognized reference name; treated as unmapped
...
and
...
2021-07-09 16:44:30 INFO: [MetagenomeSimulationPipeline] Anonymize Data
2021-07-09 16:44:30 DEBUG: [MetagenomeSimulationPipeline] /sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/tmpgY5Fcq
2021-07-09 16:44:30 INFO: [FastaAnonymizer] Shuffle and anonymize '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/2021.07.09_15.59.47_sample_0/reads'
2021-07-09 16:44:30 DEBUG: [FastaAnonymizer] get_seeded_random() { seed="$1"; openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt < /dev/zero 2>/dev/null; }; python '/nfsmounts/home/qeakr01/development/CAMISIM/fastastreamer.py' -input '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/2021.07.09_15.59.47_sample_0/reads' -format 'fastq' -ext 'fq' -s | shuf -z --random-source=<(get_seeded_random 2944938622045856594) | tr -d '\000' | python '/nfsmounts/home/qeakr01/development/CAMISIM/anonymizer.py' -prefix 'S0R' -format 'fastq' -map '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/tmpoArF7B' -out '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/tmpLJDFBS' -s
2021-07-09 16:48:06 INFO: [MetadataReader 1434768039] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test2/internal/genome_locations.tsv'
2021-07-09 16:48:08 INFO: [MetadataReader 31538633047] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/output_nanopore_test2/internal/meta_data.tsv'
2021-07-09 16:48:08 INFO: [MetadataReader 14979527976] Reading file: '/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM/tmpoArF7B'
2021-07-09 16:48:08 ERROR: [Validator 31115876351] sequence_id 'NZ-JH590862.1' not found in mapping
2021-07-09 16:48:08 DEBUG: [MetagenomeSimulationPipeline]
Traceback (most recent call last):
File "/home-link/qeakr01/development/CAMISIM/metagenomesimulation.py", line 117, in run_pipeline
self._anonymize_data(list_of_output_gsa, file_path_output_gsa_pooled)
File "/home-link/qeakr01/development/CAMISIM/metagenomesimulation.py", line 639, in _anonymize_data
file_path_genome_locations, file_path_metadata, file_path_anonymous_mapping_tmp, stream_output
File "/nfsmounts/home/qeakr01/development/CAMISIM/scripts/GoldStandardFileFormat/goldstandardfileformat.py", line 370, in gs_read_mapping
stream_output, dict_anonymous_to_read_id, dict_sequence_to_genome_id, dict_genome_id_to_tax_id)
File "/nfsmounts/home/qeakr01/development/CAMISIM/scripts/GoldStandardFileFormat/goldstandardfileformat.py", line 244, in write_gs_read_mapping
raise KeyError(msg)
KeyError: "sequence_id 'NZ-JH590862.1' not found in mapping\n"
2021-07-09 16:48:08 ERROR: [MetagenomeSimulationPipeline] "sequence_id 'NZ-JH590862.1' not found in mapping\n" in line 117
2021-07-09 16:48:08 INFO: [MetagenomeSimulationPipeline] Metagenome simulation aborted
2021-07-09 16:48:08 INFO: [MetagenomeSimulationPipeline] Temporary data stored at:
/sfs/7/workspace/ws/qeakr01-camisim-0/toy_mouse_gut_time_series/tmp/tmp4bpxOM
Do you have any idea what could cause this issue or how I could proceed to fix this?
Issue Analytics
- State:
- Created 2 years ago
- Comments:16
Top Results From Across the Web
layout_unitigs.py Key Error · Issue #45 · marbl/SALSA - GitHub
Now I found the same error as you did. This happened when I tried running SALSA against a version of reference to which...
Read more >python - KeyError: 'mtD' when 'mtD' is nowhere to be found in ...
I'm using a simple function to convert a DNA sequence into an amino acid sequence. At a high level, the code seems pretty...
Read more >What is the Python keyerror? - Educative.io
The keyerror occurs when the key used for a lookup does not exist in the mapping. Consider a phone book that is represented...
Read more >Untitled
Dead baby found in purse, Loco enamorado zion y lennox, Thanksgiving to god for ... Black magic pocket camera unboxing, Nuketown map not...
Read more >KeyError: "'json_schema'.'properties' are not defined ... - Airbyte
If you will not be using pagination, no action is required - just return None. This method should return a Mapping (e.g: dict)...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Ah, yeah the size is a problem. Since NanoSim requires the number of reads as input and CAMISIM the dataset size, there has to be a conversion from size -> number of reads. But the number of reads needed for a certain size depends on the average read length - which is specific to the trained models. I updated the used model but did not update the average read size. The fact that this happens points towards the fact that the calculation should be automatic depending on the chosen model.
Also thank you for the log (and information about the non-anonymous gold standards). I hope to find the problems soon - but will be on vacation until 16th of August starting this Friday
Even though I think that if
2.5.0
finished without errors your results probably are usable, I would use the latest NanoSim3.0
if it works. The model used in1.2.0
is very old so it probably does not reflect recent chemistry well.