Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Extend the CSVLoader class to read from different datasources/targets and different kinds of formats

See original GitHub issue

Is your feature request related to a problem? Please describe. At the moment it appears the CSVLoader can only load .csv files from the disk or a file system. Which could be a limitation both from the functionality point of view and also provenance (metadata) recording point of view.

In the Provenance data, we see the path of the file given during the training process, this path could be invalid if the process was run in docker containers or another ephemeral means.

Other libraries (non-Java based) allow loading .tgz, .zip, etc formats and although this may just be a single step when trying to manage multiple datasets this can be a boon.

Describe the solution you’d like CSVLoader through sub-class implementations allow loading:

files downloaded from not just local file system but also via the web (Secure and public sources i.e. S3 bucket or github)
files stored in different formats i.e. .tgz, .zip (compressed formats mainly)
data stored in datastores / databases (via login/password or other connection strings)
additional metadata information about the dataset itself, i.e field definition and background of the dataset or links or resources to them

Additional context Maybe show these functionalities or other functionalities or features of CSVLoader via notebook tutorials.

This request is actually two folds:

file format
data source (or target) location

Once any or all of these are established, the provenance information can now have a bit more independent set of information on how to replicate the data loading process.

For e.g.


TrainTestSplitter(
	class-name = org.tribuo.evaluation.TrainTestSplitter
	source = CSVLoader(
			class-name = org.tribuo.data.csv.CSVLoader
			outputFactory = LabelFactory(
					class-name = org.tribuo.classification.LabelFactory
				)
			response-name = species
			separator = ,
			quote = "
			path = file:/Users/apocock/Development/Tribuo/tutorials/bezdekIris.data
			file-modified-time = 2020-07-06T10:52:01.938-04:00
			resource-hash = 36F668D1CBC29A8C2C1128C5D2F0D400FA04ED4DC62D12246F44CE9360360CC0
		)
	train-proportion = 0.7
	seed = 1
	size = 150
	is-train = true
)

From the above I could not recreate the model building process or just the data loading process easily because path = file:/Users/apocock/Development/Tribuo/tutorials/bezdekIris.data is local an individual computer system. While we could have paths like path = https://path/to/bezdekIris.data which would make the whole process a lot more independent. And also add value to the provenance metadata, as we would know the original source of the data.

Issue Analytics

State:
Created 3 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

Craigacpcommented, Oct 20, 2020

I’m finishing off a tutorial on RowProcessor which uses CSVDataSource and JsonDataSource to load more complex columnar data from csv and json files respectively.

0reactions

Craigacpcommented, Oct 20, 2020

So concretely there would be:

optional loading of gzip or zip compressed files through the data sources
loading files over the web (most libraries that do this provide a caching mechanism which would require designing, especially as the provenance hash currently reads the file a second time, it would be bad to download it twice)
mechanism for adding additional metadata to a datasource (e.g. additional provenance information on construction? or something else)
support for other data formats

For the last point I’m not clear what’s required. Tribuo can already connect to things via JDBC, and read delimited and json format inputs. Are there other major formats we should support?

We use ColumnarDataSource as the base class for CSV, Json and SQL format data, so there could be other subclasses of that for other columnar inputs.