Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve data formats for network download

See original GitHub issue

I wasn’t able to find much documentation of the contents of HQ_DIR.zip (or the other three benchmark datasets) in the readme or in the zip archive.

The head of graph_files/nodes.csv is:

DRUG_53326634	DRUG
DRUG_11538251	DRUG
DRUG_174597	DRUG

The head of graph_files/edges.csv is:

DRUG_774        DRUG_REACTION_GENE      GENE_150        899
DRUG_5815       DRUG_REACTION_GENE      GENE_5739       899
DRUG_6022       DRUG_REACTION_GENE      GENE_101929876  989

Given that you want this network to become a benchmark, it’s important to document every file that users may need to interact with. There are some ways to improve the situation regarding the TSVs:

add column headers for self-documentation
add readme that include the head of files and explain what the rows correspond to

Furthermore, I think the TSV format could be improved:

an extra column could indicate the source database, important to comply with upstream licenses
additional formats could be added that are more expressive. For Hetionet, we release the data in four formats: JSON, Neo4j, TSV (similar to what OpenBioLink does now), and matrix. The JSON was created using the hetnetpy package and the matrices using hetmatpy. I think at least the JSON format would be a valuable addition for OpenBioLinks as its more self-documenting, allows everything to be stored in a single file, and allows for storing node/edge properties like source and confidence scores.

Issue Analytics

State:
Created 4 years ago
Comments:7 (5 by maintainers)

Top GitHub Comments

1reaction

matthias-samwaldcommented, Jan 28, 2020

Comment: The CSV/TSV file format choice is oriented on the file formats that link prediction packages usually utilize as input (JSON seems very uncommon here). I’m a bit reluctant to add more large files to the distribution. But if you think this adds a lot of value, we can add another, more explicitly structured data format. We’ll also discuss internally.

0reactions

dhimmelcommented, Mar 6, 2020

We have updated the data documentation

Confirming that I see improved documentation in the README under TSV Writer.

Switch from internal prefixes to CURIEs for main TSV file

Great to see this migration to CURIEs! I’m excited to start using them for my work. Finally a system for interoperability without too much overhead.

Top Results From Across the Web

Chapter 3: Data Formats - An Introduction to APIs - Zapier

Read or Download Chapter 3: Data Formats from our An Introduction to APIs e-book for FREE and start learning today!

Chapter 5, Data Formats and Data Models - O'Reilly

In this chapter, we'll discuss some of the formats most commonly used with network APIs and automation tools, and how you as a...

Data formats - MoodleDocs

Download data dropdown menu. Default data formats enabled are: Comma separated values (.csv); Microsoft Excel (.xlsx); HTML table ...

Choosing the right format for open data - Data Europa

Different file formats may have to be considered. There are three key structures to be aware of: tabular, hierarchical and network.

Best Open Source Data Formats Software 2022 - SourceForge

Compare the best free open source Data Formats Software at SourceForge. Free, secure and fast Data Formats Software downloads from the largest Open...