Regression in parsing big DOM structures
See original GitHub issueOn revisiting the old issue, I’ve found the last jsdom (9.6.0) does not handle big DOM structures as well as previuos versions. Please, try this modified code from that issue (I’ve tested in Node 7 beta):
/******************************************************************************/
'use strict';
/******************************************************************************/
const fs = require('fs');
const jsdom = require('jsdom');
/******************************************************************************/
const html = fs.openSync('test.html', 'w');
fs.writeSync(html,
'\uFEFF<!doctype html><html><head><meta charset="UTF-8"><title></title></head><body>\n\n',
null, 'utf8');
const elementsNumber = 1000;
let counter = elementsNumber;
while (counter--) {
fs.writeSync(html, `<a href='${counter}.html'>${counter}</a>\n`, null, 'utf8');
}
fs.writeSync(html, '\n</body></html>\n', null, 'utf8');
/******************************************************************************/
const hrStart = process.hrtime();
jsdom.env({
file: 'test.html',
done: () => {
const hrEnd = process.hrtime(hrStart);
console.log(`Parsed ${elementsNumber}: ${(hrEnd[0] * 1e9 + hrEnd[1]) / 1e9} s\n`);
},
});
/******************************************************************************/
Some results with different elementsNumber
:
Parsed 1000: 0.7295047 s
Parsed 10000: 88.95351464 s
Parsed 100000: well, I had been waiting more than 30 minutes and then I have aborted it.
Issue Analytics
- State:
- Created 7 years ago
- Reactions:2
- Comments:9 (9 by maintainers)
Top Results From Across the Web
Dependency Parsing with your Eyes - ACL Anthology
We hypothesize that regressions play a role in syntactic parsing that may go beyond the reanalysis of ambiguous material.
Read more >Parsing the DOM to Extract Pricing Data - Python Machine Learning ...
DOM is the structure of elements that form the web page. We need to get some details of the structure by parsing it....
Read more >Tool for Parsing Important Data from Web Pages - MDPI
Abstract: This paper discusses the tool for the main text and image extraction (extracting and parsing the important data) from a web ...
Read more >R Regression Models | Data Science Workshops
R Regression Models. Topics. Formula interface for model specification; Function methods for extracting quantities of interest from models; Contrasts to ...
Read more >Extracting the author of news stories with DOM-based ...
Obtaining the author of large volumes of news stories is daunting. ... A heuristics-based parser, i.e., using regular expressions and bash ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Fascinating! I guess it’s probably related to the HTMLCollection changes, but it’s hard to understand why, as HTMLCollection should be irrelevant during parsing, and in general created lazily. Anyway, this is very helpful, and should be enough to let me debug next time I get a jsdom hack day. Further investigation is of course appreciated, but this will get me started.
Further info:
Prior to https://github.com/tmpvar/jsdom/commit/4be8634f4121286454f6f559551ca06b0b967b68, pretty much every HTMLCollection was rooted at the document. Thus, when things changed, it sufficed to increment the document version; that would correctly tell the HTMLCollection that it needs an update.
But in https://github.com/tmpvar/jsdom/commit/4be8634f4121286454f6f559551ca06b0b967b68, I said “that’s silly; why would a HTMLCollection for
select.options
be rooted at the document? It should clearly be rooted at theselect
.” So I switched that. That broke some stuff, since we weren’t properly versioning non-document nodes. So in https://github.com/tmpvar/jsdom/commit/a7cd5a160fd6c295fa80e953fdf051a1c5dc5cf5 I fixed that by doing the ancestor/descendant iteration and incrementing those versions. But, as described above, that makes things slow.I’m still not sure what the best fix is, but now I see at least a few options: