Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Regression in parsing big DOM structures

See original GitHub issue

On revisiting the old issue, I’ve found the last jsdom (9.6.0) does not handle big DOM structures as well as previuos versions. Please, try this modified code from that issue (I’ve tested in Node 7 beta):

/******************************************************************************/
'use strict';
/******************************************************************************/
const fs = require('fs');
const jsdom = require('jsdom');
/******************************************************************************/
const html =  fs.openSync('test.html', 'w');

fs.writeSync(html,
  '\uFEFF<!doctype html><html><head><meta charset="UTF-8"><title></title></head><body>\n\n',
null, 'utf8');

const elementsNumber = 1000;
let   counter = elementsNumber;

while (counter--) {
  fs.writeSync(html, `<a href='${counter}.html'>${counter}</a>\n`, null, 'utf8');
}

fs.writeSync(html, '\n</body></html>\n', null, 'utf8');
/******************************************************************************/
const hrStart = process.hrtime();

jsdom.env({
  file: 'test.html',
  done: () => {
    const hrEnd = process.hrtime(hrStart);
    console.log(`Parsed ${elementsNumber}: ${(hrEnd[0] * 1e9 + hrEnd[1]) / 1e9} s\n`);
  },
});
/******************************************************************************/

Some results with different elementsNumber:

Parsed   1000:  0.7295047  s
Parsed  10000: 88.95351464 s
Parsed 100000: well, I had been waiting more than 30 minutes and then I have aborted it.

Issue Analytics

State:
Created 7 years ago
Reactions:2
Comments:9 (9 by maintainers)

Top GitHub Comments

2reactions

domeniccommented, Oct 13, 2016

Fascinating! I guess it’s probably related to the HTMLCollection changes, but it’s hard to understand why, as HTMLCollection should be irrelevant during parsing, and in general created lazily. Anyway, this is very helpful, and should be enough to let me debug next time I get a jsdom hack day. Further investigation is of course appreciated, but this will get me started.

1reaction

domeniccommented, Oct 15, 2016

Further info:

Prior to https://github.com/tmpvar/jsdom/commit/4be8634f4121286454f6f559551ca06b0b967b68, pretty much every HTMLCollection was rooted at the document. Thus, when things changed, it sufficed to increment the document version; that would correctly tell the HTMLCollection that it needs an update.

But in https://github.com/tmpvar/jsdom/commit/4be8634f4121286454f6f559551ca06b0b967b68, I said “that’s silly; why would a HTMLCollection for select.options be rooted at the document? It should clearly be rooted at the select.” So I switched that. That broke some stuff, since we weren’t properly versioning non-document nodes. So in https://github.com/tmpvar/jsdom/commit/a7cd5a160fd6c295fa80e953fdf051a1c5dc5cf5 I fixed that by doing the ancestor/descendant iteration and incrementing those versions. But, as described above, that makes things slow.

I’m still not sure what the best fix is, but now I see at least a few options:

Introduce a “batch mode” for changes, used mostly for parsing (maybe also innerHTML/outerHTML/textContent?). It delays all version-incrementing to one single pass at the end.
Go back to the previous setup and root all HTMLCollections at the document. Add a giant warning to _version explaining that it isn’t as useful as you think it is. Maybe even remove _version from non-document nodes entirely. This means HTMLCollections will be invalidated unnecessarily quite often, but that didn’t bother anyone prior to 9.5.0, so maybe it’s fine.
Try just not incrementing the descendant versions. When I comment out the descendant-version-incrementing code, all tests still pass! Not so for the ancestor-version-incrementing code. So maybe only incrementing the ancestor version is important!? I need to think harder to see if that could possibly be true… but it seems plausible.

Top Results From Across the Web

Dependency Parsing with your Eyes - ACL Anthology

We hypothesize that regressions play a role in syntactic parsing that may go beyond the reanalysis of ambiguous material.

Parsing the DOM to Extract Pricing Data - Python Machine Learning ...

DOM is the structure of elements that form the web page. We need to get some details of the structure by parsing it....

Tool for Parsing Important Data from Web Pages - MDPI

Abstract: This paper discusses the tool for the main text and image extraction (extracting and parsing the important data) from a web ...

R Regression Models | Data Science Workshops

R Regression Models. Topics. Formula interface for model specification; Function methods for extracting quantities of interest from models; Contrasts to ...

Extracting the author of news stories with DOM-based ...

Obtaining the author of large volumes of news stories is daunting. ... A heuristics-based parser, i.e., using regular expressions and bash ...