question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature Request: Support TSV and JSON file formats as input data

See original GitHub issue

CSV is good for numerical data, but when you have text data that may contain , and ", escaping the values in the columns can be tricky and identification of delimiter comma is harder for CSV parsers.

Can you add support for data file formats, TSV and JSON which do not have the problems above as much?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:6
  • Comments:10 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
w4nderlustcommented, Feb 13, 2019

Thank you for your suggestion. TSV is a no brainer and will come really soon! For JSON, the structure may be arguable (columnwise? rowwise?). Will have to put some thought into the best solution, but definitely considering it.

1reaction
w4nderlustcommented, Apr 24, 2020

Sure definitely. What I suggest you to do is to take a dataset and train a model with ludwig first. Take one of the examples in on the website, maybe text classification. I suggest you to put beakpoints everywhere in the preprocessing.py script to see what actually happens during the preprocessing, for instance how metadata is obtained and how preprocessing parameters are used and how the final data transformation id performed. You’ll notice that each feature type has its own features that implement those things. I would suggest to begin with to llok just at a couple of them, for instance sequence (which is medium complex) and category (which is medium easy). Numerical and binary are the easiest, while images and audio are the most complex at the moment. Another thing you will notice is the use of caching with HDF5 files for processed data and JSON files for metadata. After you have an understanding of how the whole process work, you’ll realize what makes it kinda tricky to extend the current design with additional data formats. Finally, if you look a that branch that I pointed you to, you’ll see a sketch of the design that I would like to follow to make preprocessing more flexibly with pluggable data formats a preprocessing strategies. After you get to that point for sure you’ll have a lot of questions. Feel free to reach out to me privately and I can answer all of them (that’s just for sparing the github issue with posts). After you have a clear picture we can define together specific tasks to perform.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Feature Request: Support TSV and JSON file formats as input ...
CSV is good for numerical data, but when you have text data that may contain , and ", escaping the values in the...
Read more >
Reading and writing CSV/TSV files with Python
CSV and TSV formats are essentially text files formatted in a specific way: the former one separates data using a comma and the...
Read more >
Data format options for inputs and outputs in AWS Glue
Data format options for inputs and outputs in AWS Glue. PDFRSS. These pages offer information about feature support and configuration parameters for data...
Read more >
Variant Effect Predictor Data formats - Ensembl
VEP can use different input formats: Default VEP input; VCF; VCF - Structural variants; HGVS identifiers; Variant identifiers; Genomic SPDI notation; REST-style ...
Read more >
Formats for Input and Output Data | ClickHouse Docs
A format supported for input can be used to parse the data provided to ... request can be used for inserting data from...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found