Bioinformatics use case (RNA-Seq analysis)
See original GitHub issueHi @jbenet and @maxogden! Thank you so much for the time you took to meet with @mlovci and me this weekend. Here’s an overview of our current data management situation and what our ideal case would be.
What we have now
Currently, we host a datapackages.json
file which contains resources
with the names "experiment_design"
(metadata on the samples, e.g. celltype and colors to plot them with), "expression"
(gene expression data), "splicing"
(scores of which version of a gene was used). Then, at the end of the file, we have an attribute called "species"
(e.g. "hg19"
for the human genome build 19) that only works with hg19
and mm10
(Mus musculus aka house mouse genome build 10) because it points to the URL “http://sauron.ucsd.edu/flotilla_projects/<SPECIES>/datapackage.json”, which we hand-curated. So if the data we use is from one of these two species, we can grab the data.
Try this:
On a command line:
git clone git@github.com:YeoLab/flotilla
cd flotilla
pip install -e .
In Python:
import flotilla
study = flotilla.embark("http://sauron.ucsd.edu/flotilla_projects/neural_diff_chr22/datapackage.json")
This will load the data from our server from sauron.ucsd.edu, and since you haven’t downloaded anything with that filename yet, it will download it. Additionally, this is a test dataset with only information from human chromosome 22, so it is loadable on a regular laptop. Feel free to look through the code and JSON files. flotilla.data_model.Study.from_data_package
does most of the heavy lifting in loading the data. Keep in mind that the project is also in a pre-alpha stage, and has a long way to go 😃
What we would like
Two major issues are:
- Get the data in the
neural_diff_chr22
datapackage into apandas.DataFrame
object which can then be imported intoflotilla
.- Currently this is managed by the URL in the
datapackage.json
file for that file, but it should first check locally for the data and be able to be loaded offline, if you already have the data downloaded.
- Currently this is managed by the URL in the
- Grab related data, e.g. descriptions of genes and their functions given an ID like
ENSG00000100320
and get the “gene symbol” (i.e. the familiar name that we know it by) of RBFOX2 and that this gene is an RNA-binding protein involved in alternative splicing and neural development.- Currently this is is managed by the
"species"
attribute, but ideally it would be something likeENSEMBL_v75_homo_sapiens
which would link to the human data here: http://uswest.ensembl.org/info/data/ftp/index.html and then grab gene annotation (gtf
files)/sequence information (fasta
files) as necessary by the analysis. - Relatedly, there is apparently an “eHive” system on ENSEMBL for data processing. I haven’t explored it yet, but it may be good to be aware of.
- Another major issue is how to merge analyses of different species’ data. For example, the ENSEMBL website has mappings of human and mouse versions of genes that we could use to compare gene expression. Plus there’s the HAVANA project which categorizes orthologous (evolutionarily related) genes between different vertebrates. But what if I want to compare across non-traditional species? And many of them, not just between two? I would like to be able to easily grab these data, submit a job (either to our local supercomputer or to Amazon AWS) which runs a script that outputs a mapping with some unique keys that you could merge all your different data on.
- Currently this is is managed by the
Ideally, we could do something like this:
study = flotilla.embark('neurons')
Which would fetch our mouse and human neuron data, which has some kind of link to ENSEMBL and attach all the relevant metadata about mouse and human genes, and give common keys where possible.
@mlovci - please add on with anything I missed.
Issue Analytics
- State:
- Created 9 years ago
- Comments:25 (4 by maintainers)
Top GitHub Comments
@joehand, it looks like you missed copying over the last comment by olgabot.
Moved https://github.com/datproject/discussions/issues/46