Script to analyze BLAST output from host genome
See original GitHub issueThis is marked as a good first bug because, although it sounds complicated, it’s computationally not as bad as it sounds - I think. There might be some tricky sections, but if a newcomer is attempting this, I’m available for questions anytime! Just comment in this issue.
Motivation
Research has suggested that there may be cases when CRISPR systems are used for something besides immunity to foreign DNA - perhaps they could be regulating the host genome, or they might simply be inactive. A clue that one of these things might be happening is if there are spacers that come from their own host genome. To this end, we need functions to (A) BLAST spacers against the host genome and (B) analyze the BLAST output. The first function is described in issue #60 and the second function is described here.
The function
Input:
- XML file produced by the BLAST function in issue #60 (example here).
- File containing the start and end positions of CRISPR loci in the sequence, which can be found here. The format of each row in this file is
Accession, CRISPR_ID, start, end
. - Optional expect value cutoff, default should be 1.
The function should do the following:
- Use BioPython’s built-in BLAST parser to parse the XML file produced by BLAST (example in Input section). This is also done in filterByExpect_all_v2.py, although I’ve noticed a few mistakes just now so be careful!
- This is the tricky part: differentiate between a match with itself (i.e. the source spacer array) and a match somewhere else in the genome. I think the easiest way might be to run extract_CRISPRdb.py once at the beginning of the analysis, and then use the start and stop locations of the CRISPR loci as boundaries to exclude from. If there is more than one CRISPR locus in a genome, matches to any of the loci should be considered “CRISPR” matches. This will have to be slightly modified if people are contributing genomes not from CRISPRdb - but this could be solved by first running CRISPRfinder on submitted genomes. If a match is suspected to be from the original CRISPR array, the output should mark it as being “CRISPR”. If it is suspected to be from a different region of the genome, the output should mark it as being “non-CRISPR”.
Output:
- A csv file called
accession_self-spacers.csv
(where “accession” is the NCBI accession number, i.e. NC_000853) with the following column headings:- Query
- Score
- Expect
- QueryStart
- QueryEnd
- SubjectStart
- SubjectEnd
- Source
The “Source” column contains the “CRISPR” or “non-CRISPR” flag. All other column headings are fields in the object that results from the BLAST-parser module.
Issue Analytics
- State:
- Created 8 years ago
- Comments:21 (16 by maintainers)
Hello! Just wondering what the status for this issue is. I would love to assist!
I’m going to close this and redirect to #223 which is more current.