Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bioinformatics use case (RNA-Seq analysis)

See original GitHub issue

Hi @jbenet and @maxogden! Thank you so much for the time you took to meet with @mlovci and me this weekend. Here’s an overview of our current data management situation and what our ideal case would be.

What we have now

Currently, we host a datapackages.json file which contains resources with the names "experiment_design" (metadata on the samples, e.g. celltype and colors to plot them with), "expression" (gene expression data), "splicing" (scores of which version of a gene was used). Then, at the end of the file, we have an attribute called "species" (e.g. "hg19" for the human genome build 19) that only works with hg19 and mm10 (Mus musculus aka house mouse genome build 10) because it points to the URL “http://sauron.ucsd.edu/flotilla_projects/<SPECIES>/datapackage.json”, which we hand-curated. So if the data we use is from one of these two species, we can grab the data.

Try this:

On a command line:

git clone git@github.com:YeoLab/flotilla
cd flotilla
pip install -e .

In Python:

import flotilla
study = flotilla.embark("http://sauron.ucsd.edu/flotilla_projects/neural_diff_chr22/datapackage.json")

This will load the data from our server from sauron.ucsd.edu, and since you haven’t downloaded anything with that filename yet, it will download it. Additionally, this is a test dataset with only information from human chromosome 22, so it is loadable on a regular laptop. Feel free to look through the code and JSON files. flotilla.data_model.Study.from_data_package does most of the heavy lifting in loading the data. Keep in mind that the project is also in a pre-alpha stage, and has a long way to go 😃

What we would like

Two major issues are:

Get the data in the neural_diff_chr22 datapackage into a pandas.DataFrame object which can then be imported into flotilla.
- Currently this is managed by the URL in the datapackage.json file for that file, but it should first check locally for the data and be able to be loaded offline, if you already have the data downloaded.
Grab related data, e.g. descriptions of genes and their functions given an ID like ENSG00000100320 and get the “gene symbol” (i.e. the familiar name that we know it by) of RBFOX2 and that this gene is an RNA-binding protein involved in alternative splicing and neural development.
- Currently this is is managed by the "species" attribute, but ideally it would be something like ENSEMBL_v75_homo_sapiens which would link to the human data here: http://uswest.ensembl.org/info/data/ftp/index.html and then grab gene annotation (gtf files)/sequence information (fasta files) as necessary by the analysis.
- Relatedly, there is apparently an “eHive” system on ENSEMBL for data processing. I haven’t explored it yet, but it may be good to be aware of.
- Another major issue is how to merge analyses of different species’ data. For example, the ENSEMBL website has mappings of human and mouse versions of genes that we could use to compare gene expression. Plus there’s the HAVANA project which categorizes orthologous (evolutionarily related) genes between different vertebrates. But what if I want to compare across non-traditional species? And many of them, not just between two? I would like to be able to easily grab these data, submit a job (either to our local supercomputer or to Amazon AWS) which runs a script that outputs a mapping with some unique keys that you could merge all your different data on.

Ideally, we could do something like this:

study = flotilla.embark('neurons')

Which would fetch our mouse and human neuron data, which has some kind of link to ENSEMBL and attach all the relevant metadata about mouse and human genes, and give common keys where possible.

@mlovci - please add on with anything I missed.

Issue Analytics

State:
Created 9 years ago
Comments:25 (4 by maintainers)

Top GitHub Comments

1reaction

webmavencommented, Jul 3, 2016

@joehand, it looks like you missed copying over the last comment by olgabot.

1reaction

joehandcommented, Jun 17, 2016

Moved https://github.com/datproject/discussions/issues/46

Top Results From Across the Web

Bioinformatics for RNA‐Seq Data Analysis - IntechOpen

RNA‐seq can be a powerful tool to measure gene expression, detect novel transcripts, characterize transcript isoforms, and identify sequence ...

A survey of best practices for RNA-seq data analysis

RNA-sequencing (RNA-seq) has a wide variety of applications, but no single analysis pipeline can be used in all cases.

RNA Sequence Analysis - Bioinformatics Workbook

RNA-Seq data Analysis. RNA-seq experiments are performed with an aim to comprehend transcriptomic changes in organisms in response to a certain treatment.

Bioinformatics for RNAseq - YouTube

A recording of a live Zoom training for Bioinformatics for RNA Sequencing Analysis from the Tufts Data Lab, with Wenwen Hou, ...

A Beginner's Guide to Analysis of RNA Sequencing Data

A major goal of RNA-seq analysis is to identify differentially expressed and coregulated genes and to infer biological meaning for further ...