Improve data formats for network download
See original GitHub issueI wasn’t able to find much documentation of the contents of HQ_DIR.zip (or the other three benchmark datasets) in the readme or in the zip archive.
The head of graph_files/nodes.csv is:
DRUG_53326634 DRUG
DRUG_11538251 DRUG
DRUG_174597 DRUG
The head of graph_files/edges.csv is:
DRUG_774 DRUG_REACTION_GENE GENE_150 899
DRUG_5815 DRUG_REACTION_GENE GENE_5739 899
DRUG_6022 DRUG_REACTION_GENE GENE_101929876 989
Given that you want this network to become a benchmark, it’s important to document every file that users may need to interact with. There are some ways to improve the situation regarding the TSVs:
- add column headers for self-documentation
- add readme that include the head of files and explain what the rows correspond to
Furthermore, I think the TSV format could be improved:
- an extra column could indicate the source database, important to comply with upstream licenses
- additional formats could be added that are more expressive. For Hetionet, we release the data in four formats: JSON, Neo4j, TSV (similar to what OpenBioLink does now), and matrix. The JSON was created using the hetnetpy package and the matrices using hetmatpy. I think at least the JSON format would be a valuable addition for OpenBioLinks as its more self-documenting, allows everything to be stored in a single file, and allows for storing node/edge properties like source and confidence scores.
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (5 by maintainers)
Top Results From Across the Web
Chapter 3: Data Formats - An Introduction to APIs - Zapier
Read or Download Chapter 3: Data Formats from our An Introduction to APIs e-book for FREE and start learning today!
Read more >Chapter 5, Data Formats and Data Models - O'Reilly
In this chapter, we'll discuss some of the formats most commonly used with network APIs and automation tools, and how you as a...
Read more >Data formats - MoodleDocs
Download data dropdown menu. Default data formats enabled are: Comma separated values (.csv); Microsoft Excel (.xlsx); HTML table ...
Read more >Choosing the right format for open data - Data Europa
Different file formats may have to be considered. There are three key structures to be aware of: tabular, hierarchical and network.
Read more >Best Open Source Data Formats Software 2022 - SourceForge
Compare the best free open source Data Formats Software at SourceForge. Free, secure and fast Data Formats Software downloads from the largest Open...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Comment: The CSV/TSV file format choice is oriented on the file formats that link prediction packages usually utilize as input (JSON seems very uncommon here). I’m a bit reluctant to add more large files to the distribution. But if you think this adds a lot of value, we can add another, more explicitly structured data format. We’ll also discuss internally.
Confirming that I see improved documentation in the README under TSV Writer.
Great to see this migration to CURIEs! I’m excited to start using them for my work. Finally a system for interoperability without too much overhead.