Extend the CSVLoader class to read from different datasources/targets and different kinds of formats
See original GitHub issueIs your feature request related to a problem? Please describe.
At the moment it appears the CSVLoader
can only load .csv
files from the disk or a file system. Which could be a limitation both from the functionality point of view and also provenance (metadata) recording point of view.
In the Provenance data, we see the path of the file given during the training process, this path could be invalid if the process was run in docker containers or another ephemeral means.
Other libraries (non-Java based) allow loading .tgz
, .zip
, etc formats and although this may just be a single step when trying to manage multiple datasets this can be a boon.
Describe the solution you’d like
CSVLoader
through sub-class implementations allow loading:
- files downloaded from not just local file system but also via the web (Secure and public sources i.e. S3 bucket or github)
- files stored in different formats i.e.
.tgz
,.zip
(compressed formats mainly) - data stored in datastores / databases (via login/password or other connection strings)
- additional metadata information about the dataset itself, i.e field definition and background of the dataset or links or resources to them
Additional context
Maybe show these functionalities or other functionalities or features of CSVLoader
via notebook tutorials.
This request is actually two folds:
- file format
- data source (or target) location
Once any or all of these are established, the provenance information can now have a bit more independent set of information on how to replicate the data loading process.
For e.g.
TrainTestSplitter(
class-name = org.tribuo.evaluation.TrainTestSplitter
source = CSVLoader(
class-name = org.tribuo.data.csv.CSVLoader
outputFactory = LabelFactory(
class-name = org.tribuo.classification.LabelFactory
)
response-name = species
separator = ,
quote = "
path = file:/Users/apocock/Development/Tribuo/tutorials/bezdekIris.data
file-modified-time = 2020-07-06T10:52:01.938-04:00
resource-hash = 36F668D1CBC29A8C2C1128C5D2F0D400FA04ED4DC62D12246F44CE9360360CC0
)
train-proportion = 0.7
seed = 1
size = 150
is-train = true
)
From the above I could not recreate the model building process or just the data loading process easily because path = file:/Users/apocock/Development/Tribuo/tutorials/bezdekIris.data
is local an individual computer system. While we could have paths like path = https://path/to/bezdekIris.data
which would make the whole process a lot more independent. And also add value to the provenance metadata, as we would know the original source of the data.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (6 by maintainers)
I’m finishing off a tutorial on
RowProcessor
which usesCSVDataSource
andJsonDataSource
to load more complex columnar data from csv and json files respectively.So concretely there would be:
For the last point I’m not clear what’s required. Tribuo can already connect to things via JDBC, and read delimited and json format inputs. Are there other major formats we should support?
We use
ColumnarDataSource
as the base class for CSV, Json and SQL format data, so there could be other subclasses of that for other columnar inputs.