question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Regression in parsing big DOM structures

See original GitHub issue

On revisiting the old issue, I’ve found the last jsdom (9.6.0) does not handle big DOM structures as well as previuos versions. Please, try this modified code from that issue (I’ve tested in Node 7 beta):

/******************************************************************************/
'use strict';
/******************************************************************************/
const fs = require('fs');
const jsdom = require('jsdom');
/******************************************************************************/
const html =  fs.openSync('test.html', 'w');

fs.writeSync(html,
  '\uFEFF<!doctype html><html><head><meta charset="UTF-8"><title></title></head><body>\n\n',
null, 'utf8');

const elementsNumber = 1000;
let   counter = elementsNumber;

while (counter--) {
  fs.writeSync(html, `<a href='${counter}.html'>${counter}</a>\n`, null, 'utf8');
}

fs.writeSync(html, '\n</body></html>\n', null, 'utf8');
/******************************************************************************/
const hrStart = process.hrtime();

jsdom.env({
  file: 'test.html',
  done: () => {
    const hrEnd = process.hrtime(hrStart);
    console.log(`Parsed ${elementsNumber}: ${(hrEnd[0] * 1e9 + hrEnd[1]) / 1e9} s\n`);
  },
});
/******************************************************************************/

Some results with different elementsNumber:

Parsed   1000:  0.7295047  s
Parsed  10000: 88.95351464 s
Parsed 100000: well, I had been waiting more than 30 minutes and then I have aborted it.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:2
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
domeniccommented, Oct 13, 2016

Fascinating! I guess it’s probably related to the HTMLCollection changes, but it’s hard to understand why, as HTMLCollection should be irrelevant during parsing, and in general created lazily. Anyway, this is very helpful, and should be enough to let me debug next time I get a jsdom hack day. Further investigation is of course appreciated, but this will get me started.

1reaction
domeniccommented, Oct 15, 2016

Further info:

Prior to https://github.com/tmpvar/jsdom/commit/4be8634f4121286454f6f559551ca06b0b967b68, pretty much every HTMLCollection was rooted at the document. Thus, when things changed, it sufficed to increment the document version; that would correctly tell the HTMLCollection that it needs an update.

But in https://github.com/tmpvar/jsdom/commit/4be8634f4121286454f6f559551ca06b0b967b68, I said “that’s silly; why would a HTMLCollection for select.options be rooted at the document? It should clearly be rooted at the select.” So I switched that. That broke some stuff, since we weren’t properly versioning non-document nodes. So in https://github.com/tmpvar/jsdom/commit/a7cd5a160fd6c295fa80e953fdf051a1c5dc5cf5 I fixed that by doing the ancestor/descendant iteration and incrementing those versions. But, as described above, that makes things slow.

I’m still not sure what the best fix is, but now I see at least a few options:

  • Introduce a “batch mode” for changes, used mostly for parsing (maybe also innerHTML/outerHTML/textContent?). It delays all version-incrementing to one single pass at the end.
  • Go back to the previous setup and root all HTMLCollections at the document. Add a giant warning to _version explaining that it isn’t as useful as you think it is. Maybe even remove _version from non-document nodes entirely. This means HTMLCollections will be invalidated unnecessarily quite often, but that didn’t bother anyone prior to 9.5.0, so maybe it’s fine.
  • Try just not incrementing the descendant versions. When I comment out the descendant-version-incrementing code, all tests still pass! Not so for the ancestor-version-incrementing code. So maybe only incrementing the ancestor version is important!? I need to think harder to see if that could possibly be true… but it seems plausible.
Read more comments on GitHub >

github_iconTop Results From Across the Web

Dependency Parsing with your Eyes - ACL Anthology
We hypothesize that regressions play a role in syntactic parsing that may go beyond the reanalysis of ambiguous material.
Read more >
Parsing the DOM to Extract Pricing Data - Python Machine Learning ...
DOM is the structure of elements that form the web page. We need to get some details of the structure by parsing it....
Read more >
Tool for Parsing Important Data from Web Pages - MDPI
Abstract: This paper discusses the tool for the main text and image extraction (extracting and parsing the important data) from a web ...
Read more >
R Regression Models | Data Science Workshops
R Regression Models. Topics. Formula interface for model specification; Function methods for extracting quantities of interest from models; Contrasts to ...
Read more >
Extracting the author of news stories with DOM-based ...
Obtaining the author of large volumes of news stories is daunting. ... A heuristics-based parser, i.e., using regular expressions and bash ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found