Feature Request: Support TSV and JSON file formats as input data
See original GitHub issueCSV is good for numerical data, but when you have text data that may contain ,
and "
, escaping the values in the columns can be tricky and identification of delimiter comma is harder for CSV parsers.
Can you add support for data file formats, TSV and JSON which do not have the problems above as much?
Issue Analytics
- State:
- Created 5 years ago
- Reactions:6
- Comments:10 (1 by maintainers)
Top Results From Across the Web
Feature Request: Support TSV and JSON file formats as input ...
CSV is good for numerical data, but when you have text data that may contain , and ", escaping the values in the...
Read more >Reading and writing CSV/TSV files with Python
CSV and TSV formats are essentially text files formatted in a specific way: the former one separates data using a comma and the...
Read more >Data format options for inputs and outputs in AWS Glue
Data format options for inputs and outputs in AWS Glue. PDFRSS. These pages offer information about feature support and configuration parameters for data...
Read more >Variant Effect Predictor Data formats - Ensembl
VEP can use different input formats: Default VEP input; VCF; VCF - Structural variants; HGVS identifiers; Variant identifiers; Genomic SPDI notation; REST-style ...
Read more >Formats for Input and Output Data | ClickHouse Docs
A format supported for input can be used to parse the data provided to ... request can be used for inserting data from...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thank you for your suggestion. TSV is a no brainer and will come really soon! For JSON, the structure may be arguable (columnwise? rowwise?). Will have to put some thought into the best solution, but definitely considering it.
Sure definitely. What I suggest you to do is to take a dataset and train a model with ludwig first. Take one of the examples in on the website, maybe text classification. I suggest you to put beakpoints everywhere in the preprocessing.py script to see what actually happens during the preprocessing, for instance how metadata is obtained and how preprocessing parameters are used and how the final data transformation id performed. You’ll notice that each feature type has its own features that implement those things. I would suggest to begin with to llok just at a couple of them, for instance sequence (which is medium complex) and category (which is medium easy). Numerical and binary are the easiest, while images and audio are the most complex at the moment. Another thing you will notice is the use of caching with HDF5 files for processed data and JSON files for metadata. After you have an understanding of how the whole process work, you’ll realize what makes it kinda tricky to extend the current design with additional data formats. Finally, if you look a that branch that I pointed you to, you’ll see a sketch of the design that I would like to follow to make preprocessing more flexibly with pluggable data formats a preprocessing strategies. After you get to that point for sure you’ll have a lot of questions. Feel free to reach out to me privately and I can answer all of them (that’s just for sparing the github issue with posts). After you have a clear picture we can define together specific tasks to perform.