question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CSVDataSource ignoring first column

See original GitHub issue

The first column in my CSV file is always ignored i.e. doesn’t get loaded into the feature map.

Example: if Role is the first column in the file, it doesn’t appear when the feature map is logged. If I insert a “dummy” column, then role appears.

        Map<String, FieldProcessor> fieldProcessorMap = new HashMap<>();

        fieldProcessorMap.put("YearsExperience", new DoubleFieldProcessor("YearsExperience") );
        fieldProcessorMap.put("Location", new IdentityProcessor("Location") );
        fieldProcessorMap.put("Role", new IdentityProcessor("Role"));

        var responseProcessor = new FieldResponseProcessor<>("CurrentSalary", "0", new LabelFactory());

        var id = new IntExtractor("ID");

        var rowProcessor = new RowProcessor<>(Collections.singletonList(id), null, responseProcessor, fieldProcessorMap, Collections.emptySet());

        var csvDataSource = new CSVDataSource<>(Paths.get("data/dummy/sample1.csv"), rowProcessor, true);

        var dataSplitter = new TrainTestSplitter<>(csvDataSource, 0.7, 1L);

        var trainingDataset = new MutableDataset<>(dataSplitter .getTrain());
        var testingDataset = new MutableDataset<>(dataSplitter .getTest());

        for (var i : trainingDataset.getFeatureMap())
            log.info(i);

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
lecceleccecommented, Oct 16, 2020

Thanks Craig for a very quick diagnosis! Excel 365 on Windows also has this behaviour it seems. I had opened the file in Notepad++ to check formatting and it looked fine but hadn’t checked the encoding.

I’ve experimented and unhelpfully the “default” (i.e. top of the file type list) CSV UTF-8 option in Excel adds a BOM (even though in theory UTF-8 shouldn’t have one…thanks Microsoft?). 10 items lower down the list is CSV (Comma delimited) which doesn’t add the BOM and so works fine with Tribuo, so I’ll use that going forward.

image

I’m surprised I’ve never encountered this issue before as I’ve been using CSV files with Java for a while. Interestingly, looking at Jackson, they gave in and implemented auto-detection of a BOM:

https://github.com/FasterXML/jackson-dataformat-csv/blob/4e2b89f8c904b1dcb2df3e33bc5b3990314ab095/src/main/java/com/fasterxml/jackson/dataformat/csv/impl/CsvParserBootstrapper.java#L294

Apache Commons CSV doesn’t seem to auto-detect it for you, but they do offer BOMInputStream that you can use to transparently wrap an InputStream and discard the BOM if found.

It seems OpenCSV are slightly less generous in solutions to the problem.

0reactions
Craigacpcommented, Oct 22, 2020

This will come in the 4.0.2 bug fix release which will happen sometime in the next week or so. I’d hoped for it to happen this week, but I’d like to get the external models tutorial into this release too, and we’re also waiting for the OLCUT 5.1.5 dependency to arrive in Maven Central as that has a necessary bug fix for the provenance system.

In general the frequency of bug fix releases will be controlled by the severity of the bugs. In 4.0.1 there are some serious bugs in JsonDataSource and the RegressionInfo object and we’d like to get those fixed very soon. We’ll pull in the docs updates branch that we’re working on and the csv BOM fix among other smaller things. Feature releases are likely to happen on a longer cadence than monthly as we’re aiming to have feature based releases rather than time based ones.

Read more comments on GitHub >

github_iconTop Results From Across the Web

ignoring the first row (column headers) in a CSV data source
Hi, When i input data from files & directories in splunk, is there a way to ignore the first row (column headers) in...
Read more >
python pandas not reading first column from csv file
Judging by your data it looks like the delimiter you're using is a . Try the following: a = pandas.DataFrame.from_csv('st1.csv', sep=' ').
Read more >
4.9. Data Sources — Citrine Python 1.41.1 documentation
The CSVDataSource draws data from a CSV file stored on the data platform and ... Columns in the CSV that are not mapped...
Read more >
CSV Bad Record Handling and it's Complications— Pyspark
If I select “name” column alone first 4 rows will be null and ... column values requested from CSV datasource, other values can...
Read more >
Spark data frames from CSV files: handling headers & column ...
Spark data frames from CSV files: handling headers & column types ... Here is the final one-liner code (ignore the odd code coloring ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found