Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CSVDataSource ignoring first column

See original GitHub issue

The first column in my CSV file is always ignored i.e. doesn’t get loaded into the feature map.

Example: if Role is the first column in the file, it doesn’t appear when the feature map is logged. If I insert a “dummy” column, then role appears.

        Map<String, FieldProcessor> fieldProcessorMap = new HashMap<>();

        fieldProcessorMap.put("YearsExperience", new DoubleFieldProcessor("YearsExperience") );
        fieldProcessorMap.put("Location", new IdentityProcessor("Location") );
        fieldProcessorMap.put("Role", new IdentityProcessor("Role"));

        var responseProcessor = new FieldResponseProcessor<>("CurrentSalary", "0", new LabelFactory());

        var id = new IntExtractor("ID");

        var rowProcessor = new RowProcessor<>(Collections.singletonList(id), null, responseProcessor, fieldProcessorMap, Collections.emptySet());

        var csvDataSource = new CSVDataSource<>(Paths.get("data/dummy/sample1.csv"), rowProcessor, true);

        var dataSplitter = new TrainTestSplitter<>(csvDataSource, 0.7, 1L);

        var trainingDataset = new MutableDataset<>(dataSplitter .getTrain());
        var testingDataset = new MutableDataset<>(dataSplitter .getTest());

        for (var i : trainingDataset.getFeatureMap())
            log.info(i);

Issue Analytics

State:
Created 3 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

lecceleccecommented, Oct 16, 2020

Thanks Craig for a very quick diagnosis! Excel 365 on Windows also has this behaviour it seems. I had opened the file in Notepad++ to check formatting and it looked fine but hadn’t checked the encoding.

I’ve experimented and unhelpfully the “default” (i.e. top of the file type list) CSV UTF-8 option in Excel adds a BOM (even though in theory UTF-8 shouldn’t have one…thanks Microsoft?). 10 items lower down the list is CSV (Comma delimited) which doesn’t add the BOM and so works fine with Tribuo, so I’ll use that going forward.

I’m surprised I’ve never encountered this issue before as I’ve been using CSV files with Java for a while. Interestingly, looking at Jackson, they gave in and implemented auto-detection of a BOM:

https://github.com/FasterXML/jackson-dataformat-csv/blob/4e2b89f8c904b1dcb2df3e33bc5b3990314ab095/src/main/java/com/fasterxml/jackson/dataformat/csv/impl/CsvParserBootstrapper.java#L294

Apache Commons CSV doesn’t seem to auto-detect it for you, but they do offer BOMInputStream that you can use to transparently wrap an InputStream and discard the BOM if found.

It seems OpenCSV are slightly less generous in solutions to the problem.

0reactions

Craigacpcommented, Oct 22, 2020

This will come in the 4.0.2 bug fix release which will happen sometime in the next week or so. I’d hoped for it to happen this week, but I’d like to get the external models tutorial into this release too, and we’re also waiting for the OLCUT 5.1.5 dependency to arrive in Maven Central as that has a necessary bug fix for the provenance system.

In general the frequency of bug fix releases will be controlled by the severity of the bugs. In 4.0.1 there are some serious bugs in JsonDataSource and the RegressionInfo object and we’d like to get those fixed very soon. We’ll pull in the docs updates branch that we’re working on and the csv BOM fix among other smaller things. Feature releases are likely to happen on a longer cadence than monthly as we’re aiming to have feature based releases rather than time based ones.

Top Results From Across the Web

ignoring the first row (column headers) in a CSV data source

Hi, When i input data from files & directories in splunk, is there a way to ignore the first row (column headers) in...

python pandas not reading first column from csv file

Judging by your data it looks like the delimiter you're using is a . Try the following: a = pandas.DataFrame.from_csv('st1.csv', sep=' ').

4.9. Data Sources — Citrine Python 1.41.1 documentation

The CSVDataSource draws data from a CSV file stored on the data platform and ... Columns in the CSV that are not mapped...

CSV Bad Record Handling and it's Complications— Pyspark

If I select “name” column alone first 4 rows will be null and ... column values requested from CSV datasource, other values can...

Spark data frames from CSV files: handling headers & column ...

Spark data frames from CSV files: handling headers & column types ... Here is the final one-liner code (ignore the odd code coloring ......

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

CSVDataSource ignoring first column

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Extend the CSVLoader class to read from different datasources/targets and different kinds of formats

[Question] Why is true negative represented by 'n' in the classification matrix?