Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature Request: Support TSV and JSON file formats as input data

See original GitHub issue

CSV is good for numerical data, but when you have text data that may contain , and ", escaping the values in the columns can be tricky and identification of delimiter comma is harder for CSV parsers.

Can you add support for data file formats, TSV and JSON which do not have the problems above as much?

Issue Analytics

State:
Created 5 years ago
Reactions:6
Comments:10 (1 by maintainers)

Top GitHub Comments

2reactions

w4nderlustcommented, Feb 13, 2019

Thank you for your suggestion. TSV is a no brainer and will come really soon! For JSON, the structure may be arguable (columnwise? rowwise?). Will have to put some thought into the best solution, but definitely considering it.

1reaction

w4nderlustcommented, Apr 24, 2020

Sure definitely. What I suggest you to do is to take a dataset and train a model with ludwig first. Take one of the examples in on the website, maybe text classification. I suggest you to put beakpoints everywhere in the preprocessing.py script to see what actually happens during the preprocessing, for instance how metadata is obtained and how preprocessing parameters are used and how the final data transformation id performed. You’ll notice that each feature type has its own features that implement those things. I would suggest to begin with to llok just at a couple of them, for instance sequence (which is medium complex) and category (which is medium easy). Numerical and binary are the easiest, while images and audio are the most complex at the moment. Another thing you will notice is the use of caching with HDF5 files for processed data and JSON files for metadata. After you have an understanding of how the whole process work, you’ll realize what makes it kinda tricky to extend the current design with additional data formats. Finally, if you look a that branch that I pointed you to, you’ll see a sketch of the design that I would like to follow to make preprocessing more flexibly with pluggable data formats a preprocessing strategies. After you get to that point for sure you’ll have a lot of questions. Feel free to reach out to me privately and I can answer all of them (that’s just for sparing the github issue with posts). After you have a clear picture we can define together specific tasks to perform.

Top Results From Across the Web

Feature Request: Support TSV and JSON file formats as input ...

CSV is good for numerical data, but when you have text data that may contain , and ", escaping the values in the...

Reading and writing CSV/TSV files with Python

CSV and TSV formats are essentially text files formatted in a specific way: the former one separates data using a comma and the...

Data format options for inputs and outputs in AWS Glue

Data format options for inputs and outputs in AWS Glue. PDFRSS. These pages offer information about feature support and configuration parameters for data...

Variant Effect Predictor Data formats - Ensembl

VEP can use different input formats: Default VEP input; VCF; VCF - Structural variants; HGVS identifiers; Variant identifiers; Genomic SPDI notation; REST-style ...

Formats for Input and Output Data | ClickHouse Docs

A format supported for input can be used to parse the data provided to ... request can be used for inserting data from...