question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Extend the CSVLoader class to read from different datasources/targets and different kinds of formats

See original GitHub issue

Is your feature request related to a problem? Please describe. At the moment it appears the CSVLoader can only load .csv files from the disk or a file system. Which could be a limitation both from the functionality point of view and also provenance (metadata) recording point of view.

In the Provenance data, we see the path of the file given during the training process, this path could be invalid if the process was run in docker containers or another ephemeral means.

Other libraries (non-Java based) allow loading .tgz, .zip, etc formats and although this may just be a single step when trying to manage multiple datasets this can be a boon.

Describe the solution you’d like CSVLoader through sub-class implementations allow loading:

  • files downloaded from not just local file system but also via the web (Secure and public sources i.e. S3 bucket or github)
  • files stored in different formats i.e. .tgz, .zip (compressed formats mainly)
  • data stored in datastores / databases (via login/password or other connection strings)
  • additional metadata information about the dataset itself, i.e field definition and background of the dataset or links or resources to them

Additional context Maybe show these functionalities or other functionalities or features of CSVLoader via notebook tutorials.

This request is actually two folds:

  • file format
  • data source (or target) location

Once any or all of these are established, the provenance information can now have a bit more independent set of information on how to replicate the data loading process.

For e.g.


TrainTestSplitter(
	class-name = org.tribuo.evaluation.TrainTestSplitter
	source = CSVLoader(
			class-name = org.tribuo.data.csv.CSVLoader
			outputFactory = LabelFactory(
					class-name = org.tribuo.classification.LabelFactory
				)
			response-name = species
			separator = ,
			quote = "
			path = file:/Users/apocock/Development/Tribuo/tutorials/bezdekIris.data
			file-modified-time = 2020-07-06T10:52:01.938-04:00
			resource-hash = 36F668D1CBC29A8C2C1128C5D2F0D400FA04ED4DC62D12246F44CE9360360CC0
		)
	train-proportion = 0.7
	seed = 1
	size = 150
	is-train = true
)

From the above I could not recreate the model building process or just the data loading process easily because path = file:/Users/apocock/Development/Tribuo/tutorials/bezdekIris.data is local an individual computer system. While we could have paths like path = https://path/to/bezdekIris.data which would make the whole process a lot more independent. And also add value to the provenance metadata, as we would know the original source of the data.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
Craigacpcommented, Oct 20, 2020

I’m finishing off a tutorial on RowProcessor which uses CSVDataSource and JsonDataSource to load more complex columnar data from csv and json files respectively.

0reactions
Craigacpcommented, Oct 20, 2020

So concretely there would be:

  • optional loading of gzip or zip compressed files through the data sources
  • loading files over the web (most libraries that do this provide a caching mechanism which would require designing, especially as the provenance hash currently reads the file a second time, it would be bad to download it twice)
  • mechanism for adding additional metadata to a datasource (e.g. additional provenance information on construction? or something else)
  • support for other data formats

For the last point I’m not clear what’s required. Tribuo can already connect to things via JDBC, and read delimited and json format inputs. Are there other major formats we should support?

We use ColumnarDataSource as the base class for CSV, Json and SQL format data, so there could be other subclasses of that for other columnar inputs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

CSVLoader
Reads a source that is in comma separated format (the default). One can also change the column separator from comma to tab or...
Read more >
CSVLoader (Tribuo 4.0.2 API)
The delimiter and quote characters are user controlled, so this class can parse TSVs, CSVs, semi-colon separated data and other types of single...
Read more >
How to Read Common File Formats in Python – CSV, Excel ...
Learn how to read common file formats in Python, including CSV, Excel, JSON, text, image, and other file formats.
Read more >
Storage Formats - Mavo Documentation
Often, you need to specify a different format for the mv-storage ... To create your own formats, all you need is to create...
Read more >
3.2. External Formats and Options - Tableau Help
Depending on the format, various format-specific options are available, ... If multiple files are read, all have to possess the same extension for...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found