question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Script to analyze BLAST output from host genome

See original GitHub issue

This is marked as a good first bug because, although it sounds complicated, it’s computationally not as bad as it sounds - I think. There might be some tricky sections, but if a newcomer is attempting this, I’m available for questions anytime! Just comment in this issue.

Motivation

Research has suggested that there may be cases when CRISPR systems are used for something besides immunity to foreign DNA - perhaps they could be regulating the host genome, or they might simply be inactive. A clue that one of these things might be happening is if there are spacers that come from their own host genome. To this end, we need functions to (A) BLAST spacers against the host genome and (B) analyze the BLAST output. The first function is described in issue #60 and the second function is described here.


The function

Input:

  • XML file produced by the BLAST function in issue #60 (example here).
  • File containing the start and end positions of CRISPR loci in the sequence, which can be found here. The format of each row in this file is Accession, CRISPR_ID, start, end.
  • Optional expect value cutoff, default should be 1.

The function should do the following:

  • Use BioPython’s built-in BLAST parser to parse the XML file produced by BLAST (example in Input section). This is also done in filterByExpect_all_v2.py, although I’ve noticed a few mistakes just now so be careful!
  • This is the tricky part: differentiate between a match with itself (i.e. the source spacer array) and a match somewhere else in the genome. I think the easiest way might be to run extract_CRISPRdb.py once at the beginning of the analysis, and then use the start and stop locations of the CRISPR loci as boundaries to exclude from. If there is more than one CRISPR locus in a genome, matches to any of the loci should be considered “CRISPR” matches. This will have to be slightly modified if people are contributing genomes not from CRISPRdb - but this could be solved by first running CRISPRfinder on submitted genomes. If a match is suspected to be from the original CRISPR array, the output should mark it as being “CRISPR”. If it is suspected to be from a different region of the genome, the output should mark it as being “non-CRISPR”.

Output:

  • A csv file called accession_self-spacers.csv (where “accession” is the NCBI accession number, i.e. NC_000853) with the following column headings:
    • Query
    • Score
    • Expect
    • QueryStart
    • QueryEnd
    • SubjectStart
    • SubjectEnd
    • Source

The “Source” column contains the “CRISPR” or “non-CRISPR” flag. All other column headings are fields in the object that results from the BLAST-parser module.

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:21 (16 by maintainers)

github_iconTop GitHub Comments

2reactions
ayangromanocommented, Jun 2, 2017

Hello! Just wondering what the status for this issue is. I would love to assist!

0reactions
mbonsmacommented, Jun 12, 2017

I’m going to close this and redirect to #223 which is more current.

Read more comments on GitHub >

github_iconTop Results From Across the Web

parse_blast : A script for parsing tabular BLAST output. - GitHub
This script takes an index file of form 'IDfilename' where filename is a BLAST tabular output WITH headers (as parse_blast.R produces). The files...
Read more >
BLAST — test test documentation - Biopython - Read the Docs
Firstly, running BLAST for your query sequence(s), and getting some output. Secondly, parsing the BLAST output in Python for further analysis.
Read more >
BLASTGrabber: a bioinformatic tool for visualization, analysis ...
The BLASTGrabber application introduces new ways of visualizing and analysing massive BLAST output data by integrating taxonomy identification, ...
Read more >
Running Local BLAST and Parsing Output
Summarize BLAST results by parsing output file with a BioPerl script; 4. ... Identifying and aligning similar DNA and protein sequences is one...
Read more >
BLASTGrabber: a bioinformatic tool for visualization, analysis ...
The BLASTGrabber application introduces new ways of visualizing and analysing massive BLAST output data by integrating taxonomy identification, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found