CSVDataSource ignoring first column
See original GitHub issueThe first column in my CSV file is always ignored i.e. doesn’t get loaded into the feature map.
Example: if Role is the first column in the file, it doesn’t appear when the feature map is logged. If I insert a “dummy” column, then role appears.
Map<String, FieldProcessor> fieldProcessorMap = new HashMap<>();
fieldProcessorMap.put("YearsExperience", new DoubleFieldProcessor("YearsExperience") );
fieldProcessorMap.put("Location", new IdentityProcessor("Location") );
fieldProcessorMap.put("Role", new IdentityProcessor("Role"));
var responseProcessor = new FieldResponseProcessor<>("CurrentSalary", "0", new LabelFactory());
var id = new IntExtractor("ID");
var rowProcessor = new RowProcessor<>(Collections.singletonList(id), null, responseProcessor, fieldProcessorMap, Collections.emptySet());
var csvDataSource = new CSVDataSource<>(Paths.get("data/dummy/sample1.csv"), rowProcessor, true);
var dataSplitter = new TrainTestSplitter<>(csvDataSource, 0.7, 1L);
var trainingDataset = new MutableDataset<>(dataSplitter .getTrain());
var testingDataset = new MutableDataset<>(dataSplitter .getTest());
for (var i : trainingDataset.getFeatureMap())
log.info(i);
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
ignoring the first row (column headers) in a CSV data source
Hi, When i input data from files & directories in splunk, is there a way to ignore the first row (column headers) in...
Read more >python pandas not reading first column from csv file
Judging by your data it looks like the delimiter you're using is a . Try the following: a = pandas.DataFrame.from_csv('st1.csv', sep=' ').
Read more >4.9. Data Sources — Citrine Python 1.41.1 documentation
The CSVDataSource draws data from a CSV file stored on the data platform and ... Columns in the CSV that are not mapped...
Read more >CSV Bad Record Handling and it's Complications— Pyspark
If I select “name” column alone first 4 rows will be null and ... column values requested from CSV datasource, other values can...
Read more >Spark data frames from CSV files: handling headers & column ...
Spark data frames from CSV files: handling headers & column types ... Here is the final one-liner code (ignore the odd code coloring ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks Craig for a very quick diagnosis! Excel 365 on Windows also has this behaviour it seems. I had opened the file in Notepad++ to check formatting and it looked fine but hadn’t checked the encoding.
I’ve experimented and unhelpfully the “default” (i.e. top of the file type list) CSV UTF-8 option in Excel adds a BOM (even though in theory UTF-8 shouldn’t have one…thanks Microsoft?). 10 items lower down the list is CSV (Comma delimited) which doesn’t add the BOM and so works fine with Tribuo, so I’ll use that going forward.
I’m surprised I’ve never encountered this issue before as I’ve been using CSV files with Java for a while. Interestingly, looking at Jackson, they gave in and implemented auto-detection of a BOM:
https://github.com/FasterXML/jackson-dataformat-csv/blob/4e2b89f8c904b1dcb2df3e33bc5b3990314ab095/src/main/java/com/fasterxml/jackson/dataformat/csv/impl/CsvParserBootstrapper.java#L294
Apache Commons CSV doesn’t seem to auto-detect it for you, but they do offer BOMInputStream that you can use to transparently wrap an InputStream and discard the BOM if found.
It seems OpenCSV are slightly less generous in solutions to the problem.
This will come in the 4.0.2 bug fix release which will happen sometime in the next week or so. I’d hoped for it to happen this week, but I’d like to get the external models tutorial into this release too, and we’re also waiting for the OLCUT 5.1.5 dependency to arrive in Maven Central as that has a necessary bug fix for the provenance system.
In general the frequency of bug fix releases will be controlled by the severity of the bugs. In 4.0.1 there are some serious bugs in JsonDataSource and the RegressionInfo object and we’d like to get those fixed very soon. We’ll pull in the docs updates branch that we’re working on and the csv BOM fix among other smaller things. Feature releases are likely to happen on a longer cadence than monthly as we’re aiming to have feature based releases rather than time based ones.