Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Possible regression in loading CSVs when in Node.js environment

See original GitHub issue

Please:

Check for duplicate issues. Please file separate requests as separate issues on GitHub.
Describe how to reproduce the bug.
Use the latest Vega version, if possible. Always indicate which version you are using.
~~If you are using Kibana, first read the instructions for posting Kibana Vega issues.~~
Include error messages or gifs/screenshots of problematic behavior, if applicable.
Provide an example specification and accessible data. Try to provide the simplest specification that reproduces the problem. To share a specification, use the Vega-Editor and click “Share” to embed a working link, or provide an example spec in JSON, wrapped by triple backticks like this:

Hi folks (cc @jwoLondon) 👋

You probably know that Vega and Vega Lite are used in litvis; I’m opening this issue as a result of investigating https://github.com/gicentre/litvis/issues/27.

It seems like loading CSV files that are present in Vega Specs does not work in the Node.js environment. This was first observed after upgrading to Vega 5.0 and is possibly related to a new data loading approach. Here is an MWE (not litvis-specific):

mkdir /tmp/vega-csv-fetching-mwe
cd /tmp/vega-csv-fetching-mwe

yarn add vega vega-lite

cat <<"EOF" > mwe.js
const { parse, Spec, View } = require("vega");
const { compile } = require("vega-lite");

const generateSpec = (dataFormat) =>
  compile({
    $schema: "https://vega.github.io/schema/vega-lite/v3.json",
    data: {
      url: `https://gicentre.github.io/data/bicycleHiresLondon.${dataFormat}`,
    },
    encoding: {
      x: { field: "Month", type: "temporal" },
      y: { field: "NumberOfHires", type: "quantitative" },
    },
    mark: "circle",
  }).spec;

(async () => {
  for (const dataFormat of ["csv", "json"]) {
    const spec = generateSpec(dataFormat);
    const view = new View(parse(spec), {
      renderer: "none",
    }).initialize();
    console.log(`\nVega spec for ${dataFormat}\n=====`);
    console.log(JSON.stringify(spec));
    console.log(`\nResult`);
    console.log(await view.toSVG());
  }
})();
EOF

node mwe.js

Output:

Vega spec for csv
=====
{...}

Result
<Rather small SVG with no data shown>

Vega spec for json
=====
{...}

Result
<An SVG with data points, much longer than the previous one>

In the expected output, both SVGs are of the same length.

Versions used in the MWE:

vega@5.3.5
vega-lite@3.2.0

Copying vega specs from the standard output into the Vega Editor produces the same correct chart for both CSV and the JSON. This suggests that the problem may be to do with CSV fetching or parsing when outside the browser environment. Setting logLevel: vega.Info did not help – no issues were revealed.

What are your thoughts?

Issue Analytics

State:
Created 4 years ago
Comments:7 (5 by maintainers)

Top GitHub Comments

2reactions

jheercommented, Apr 19, 2019

@domoritz @kanitw @arvind: You might find this one particularly amusing and/or infuriating! 😅

2reactions

jheercommented, Apr 19, 2019

This is possibly the strangest bug I’ve seen in a while. I first tried to replicate everything on my own using the vega-loader. No problems. However, I only tried csv loading with either no parsing or with complete auto parsing. Both worked. Or, I should say, they appeared to work…

I then tried the specification above (as compiled from Vega-Lite), and indeed it failed. In particular, the data parsing specified by Vega-Lite {"Month": "date"} was not being properly applied. The dates were not being parsed, and so the subsequent null/NaN filter inserted by Vega-Lite suppresses all the values and we end up with an empty data set.

So now things start to get weird. Why is the parsing step failing? The input appears correct (a loaded, but not yet parsed, CSV string). However, upon closer inspection, the output has not one but two keys for the ‘Month’ field. If you take the first object in the parsed data, here’s what you get:

> keys = Object.keys(data[0])
[ 'Month', 'NumberOfHires', 'AvHireTime', 'Month' ]

> keys[0] == keys[3]
false

> keys[0].length
6

> keys[3].length
5

> keys[0][0]
''

Wat?! It’s as if the first key value has a hidden empty string inside of it… what could this be?

> keys[0][0] === ''
false

> keys[0].charCodeAt(0)
65279

What is unicode character 65279? ZERO WIDTH NO-BREAK SPACE

What in the world is that doing here? Well I did notice that the input CSV file has not just line breaks but carriage returns. That by itself should not be an issue (one would hope), but it did get me thinking that maybe something fishy is going on within the CSV file itself…

Test 1: If I download the file through my browser and take a look and save it through my text editor (VS Code) and then try to load it locally it works fine. OK.

Test 2: If I instead download the file directly via curl https://gicentre.github.io/data/bicycleHiresLondon.csv > bicycleHiresLondon.csv and try to load it locally it breaks as before. Not OK! But, we learn something important: this shows that the problem is not in our node-based fetch polyfill, as now we see that node’s fs module has the same behavior when loading directly from the local file system.

My conclusion? The file format is probably not acceptable to node. So let’s get a hexdump. Here’s what we see:

$ hexdump -C -n128 bicycleHiresLondon.csv
00000000  ef bb bf 4d 6f 6e 74 68  2c 4e 75 6d 62 65 72 4f  |...Month,NumberO|
00000010  66 48 69 72 65 73 2c 41  76 48 69 72 65 54 69 6d  |fHires,AvHireTim|
00000020  65 0d 0a 32 30 31 30 2d  30 37 2c 31 32 34 36 31  |e..2010-07,12461|
00000030  2c 31 37 0d 0a 32 30 31  30 2d 30 38 2c 33 34 31  |,17..2010-08,341|
00000040  32 30 33 2c 31 37 0d 0a  32 30 31 30 2d 30 39 2c  |203,17..2010-09,|
00000050  35 34 30 38 35 39 2c 31  35 0d 0a 32 30 31 30 2d  |540859,15..2010-|
00000060  31 30 2c 35 34 34 34 31  32 2c 31 35 0d 0a 32 30  |10,544412,15..20|
00000070  31 30 2d 31 31 2c 34 35  36 33 30 34 2c 31 34 0d  |10-11,456304,14.|

Hmm, what is that ef bb bf at the beginning of the byte stream? A little Googling and Wikipedia comes to the rescue: https://en.wikipedia.org/wiki/Byte_order_mark

Here is a particularly telling passage from the Wikipedia article:

BOM use is optional. Its presence interferes with the use of UTF-8 by software that does not expect non-ASCII bytes at the start of a file but that could otherwise handle the text stream.

So it appears that the software being used to generate this CSV is producing output that breaks other tools. The solution is simple: generate CSV files through some other means.

But, why does this work online? Perhaps the browser’s loading mechanism handles this (or strips it) for us, whereas node.js does not.