Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Script to analyze BLAST output from host genome

See original GitHub issue

This is marked as a good first bug because, although it sounds complicated, it’s computationally not as bad as it sounds - I think. There might be some tricky sections, but if a newcomer is attempting this, I’m available for questions anytime! Just comment in this issue.

Motivation

Research has suggested that there may be cases when CRISPR systems are used for something besides immunity to foreign DNA - perhaps they could be regulating the host genome, or they might simply be inactive. A clue that one of these things might be happening is if there are spacers that come from their own host genome. To this end, we need functions to (A) BLAST spacers against the host genome and (B) analyze the BLAST output. The first function is described in issue #60 and the second function is described here.

The function

Input:

XML file produced by the BLAST function in issue #60 (example here).
File containing the start and end positions of CRISPR loci in the sequence, which can be found here. The format of each row in this file is Accession, CRISPR_ID, start, end.
Optional expect value cutoff, default should be 1.

The function should do the following:

Use BioPython’s built-in BLAST parser to parse the XML file produced by BLAST (example in Input section). This is also done in filterByExpect_all_v2.py, although I’ve noticed a few mistakes just now so be careful!
This is the tricky part: differentiate between a match with itself (i.e. the source spacer array) and a match somewhere else in the genome. I think the easiest way might be to run extract_CRISPRdb.py once at the beginning of the analysis, and then use the start and stop locations of the CRISPR loci as boundaries to exclude from. If there is more than one CRISPR locus in a genome, matches to any of the loci should be considered “CRISPR” matches. This will have to be slightly modified if people are contributing genomes not from CRISPRdb - but this could be solved by first running CRISPRfinder on submitted genomes. If a match is suspected to be from the original CRISPR array, the output should mark it as being “CRISPR”. If it is suspected to be from a different region of the genome, the output should mark it as being “non-CRISPR”.

Output:

A csv file called accession_self-spacers.csv (where “accession” is the NCBI accession number, i.e. NC_000853) with the following column headings:
- Query
- Score
- Expect
- QueryStart
- QueryEnd
- SubjectStart
- SubjectEnd
- Source

The “Source” column contains the “CRISPR” or “non-CRISPR” flag. All other column headings are fields in the object that results from the BLAST-parser module.