question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve data formats for network download

See original GitHub issue

I wasn’t able to find much documentation of the contents of HQ_DIR.zip (or the other three benchmark datasets) in the readme or in the zip archive.

The head of graph_files/nodes.csv is:

DRUG_53326634	DRUG
DRUG_11538251	DRUG
DRUG_174597	DRUG

The head of graph_files/edges.csv is:

DRUG_774        DRUG_REACTION_GENE      GENE_150        899
DRUG_5815       DRUG_REACTION_GENE      GENE_5739       899
DRUG_6022       DRUG_REACTION_GENE      GENE_101929876  989

Given that you want this network to become a benchmark, it’s important to document every file that users may need to interact with. There are some ways to improve the situation regarding the TSVs:

  • add column headers for self-documentation
  • add readme that include the head of files and explain what the rows correspond to

Furthermore, I think the TSV format could be improved:

  • an extra column could indicate the source database, important to comply with upstream licenses
  • additional formats could be added that are more expressive. For Hetionet, we release the data in four formats: JSON, Neo4j, TSV (similar to what OpenBioLink does now), and matrix. The JSON was created using the hetnetpy package and the matrices using hetmatpy. I think at least the JSON format would be a valuable addition for OpenBioLinks as its more self-documenting, allows everything to be stored in a single file, and allows for storing node/edge properties like source and confidence scores.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
matthias-samwaldcommented, Jan 28, 2020

Comment: The CSV/TSV file format choice is oriented on the file formats that link prediction packages usually utilize as input (JSON seems very uncommon here). I’m a bit reluctant to add more large files to the distribution. But if you think this adds a lot of value, we can add another, more explicitly structured data format. We’ll also discuss internally.

0reactions
dhimmelcommented, Mar 6, 2020

We have updated the data documentation

Confirming that I see improved documentation in the README under TSV Writer.

Switch from internal prefixes to CURIEs for main TSV file

Great to see this migration to CURIEs! I’m excited to start using them for my work. Finally a system for interoperability without too much overhead.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Chapter 3: Data Formats - An Introduction to APIs - Zapier
Read or Download Chapter 3: Data Formats from our An Introduction to APIs e-book for FREE and start learning today!
Read more >
Chapter 5, Data Formats and Data Models - O'Reilly
In this chapter, we'll discuss some of the formats most commonly used with network APIs and automation tools, and how you as a...
Read more >
Data formats - MoodleDocs
Download data dropdown menu. Default data formats enabled are: Comma separated values (.csv); Microsoft Excel (.xlsx); HTML table ...
Read more >
Choosing the right format for open data - Data Europa
Different file formats may have to be considered. There are three key structures to be aware of: tabular, hierarchical and network.
Read more >
Best Open Source Data Formats Software 2022 - SourceForge
Compare the best free open source Data Formats Software at SourceForge. Free, secure and fast Data Formats Software downloads from the largest Open...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found