Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Make graphy.js parsers the default in RDF.JS / rdf-ext?

See original GitHub issue

Now that graphy.js is fully spec-compliant (and has been for some time), I think we should strongly consider making it the default parser of RDF/JS. There was definitely a point to an arms race many years ago, when I created N3.js, and it was able to pull off a performance difference of 2 magnitudes compared to the state of the art. Not only has that gap disappeared nowadays (thanks to a much faster V8, it seems), but also, graphy.js is simply much faster. In fact, I think the best I could achieve with N3.js is to be as fast as graphy.js; I see few options to get much better. And frankly, I wouldn’t have the time anymore either.

That said, a couple of questions perhaps to understand the implications of switching to graphy.js.

I noticed that spec-compatible parsing is not the default, in the sense that one would have to pass the options { validate: true, maxStringLength: Infinity }. Are there severe performance consequences of doing so? (I haven’t seen any.) Is maxStringLength nothing more than a safety guard?
- Could it be made the default?
- The reason I insist on spec-compatibility, is that we want to avoid downstream code having to deal with invalid RDF; e.g., writers should be able to trust that a NamedNode has a valid value rather than having to check it.
One of the reasons for the performance difference seems to be the use of sticky regular expressions. Back in the days (🦕), I remember having to make the decision with N3.js as well, but there as insufficient support for it (only Firefox if I remember correctly). Now we have it, and it seems to be faster than always chopping off the beginning of the string, like N3.js does. However, what has also stopped me in the past, is the fact that sticky regexes cannot be set to start a a certain point, whereas with chopped-off strings, you can force to start at the beginning and fail fast.
- Are there any bad cases for graphy.js? Like for instance, if I had a Turtle file like "a", "here come a million characters"@en, would it search very far to find the language tag after the first string? I understand such examples are probably artificial, I just want to understand if there are any downsides to the sticky bit.
- Could you imagine any other downsides to sticky?
Are there in general any disadvantages of the graphy.js parsers that you know about?
- Any worst cases? Any undesirable properties?
Has graphy.js been well-tested for arbitrary stream buffer boundaries? Basically all tests in https://github.com/rdfjs/N3.js/blob/b2ff96d35ed586fce1a02c567fb3ba9c10272598/test/N3Lexer-test.js#L155 that match /streamOf/. I would adapt them to graphy.js, but it doesn’t have a separate lexer (likely another source of performance gains). Some of those tests are only the result of crazy usage like with LOD Laundromat files, i.e., very hard to imagine all special cases and possible splits. This one stands out in particular.
Graphy is not an ES6 module, and thus not supporting tree shaking, so code would still be unnecessarily large in several cases, which would especially hurt browsers. Would you consider implementing that? (Or otherwise partitioning the code, for example through specific include paths?)
- Related to this question: how is code size compared to the state of the art?
- I understand there is some replication across different parser implementations. That should be gzipped away, but might not be if the minifier assigns different names (very likely). However, even if gzipped away, it is actual code size that determines performance these days.
- Is it meaningful to reuse more (I can’t imagine it hurting performance badly in most cases, and let’s not reuse where it does)?

I think that’s all for now, some more questions might pop up. Thanks in advance for your insights.

Issue Analytics

State:
Created 4 years ago
Reactions:3
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

blake-regaliacommented, Nov 25, 2019

do you guard against "bla \n\r

Exactly. Yes, this is currently handled in master branch but 3.2.2 actually did not check for this. Corresponds to this step in the example from above:

check unparsed text for invalid characters (e.g., newlines not allowed in STRING_LITERAL_QUOTE – throw parsing error if invalid start of string token)

0reactions

blake-regaliacommented, Nov 25, 2019

I should probably retire N3.js

Maybe someday in the distant future but for now I hope it remains maintained. Graphy has not seen a whole lot of mainstream usage yet.

Notation3 has not been planned yet. As for Store, DatasetTree is the graphy alternative, although I’d be curious what the differences in capabilities and performance are.