WordExtractor get_body() doesn't appear to retrieve all text content from .doc file
See original GitHub issueLove this package; it’s a super neat and clean interface for extracting text.
However, in testing I found some cases where not all text was extracted. You can find the examples here.
The sample 100kB DOC file, for instance, has lots of text in it. Here’s the TypeScript code I executed:
import request from 'request';
import WordExtractor from 'word-extractor';
const fileUrl = 'https://file-examples-com.github.io/uploads/2017/02/file-sample_100kB.doc';
request.get({ url: fileUrl, encoding: null }, (err, res, body) => {
const extractor = new WordExtractor();
const extracted = extractor.extract(body);
extracted.then(function (doc) {
console.log(doc.getBody());
});
});
This seems to only extract the following, even though there’s much, much more content in that DOC file.
Lorem ipsum Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio.
Cras fringilla ipsum magna, in fringilla dui commodo a. Lorem ipsum Lorem ipsum
1 Lorem
2 Ipsum
3 Lorem
4 Lorem
5 Ipsum
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Maecenas mauris lectus, lobortis et purus mattis, blandit dictum tellus.
In eleifend velit vitae libero sollicitudin euismod.
Similarly, I’ve tried getFootnotes
, getEndnotes
, getHeaders
, getFooters
, and getAnnotations
. All of those return empty content for the document I linked above.
getTextboxes
returns the same content as getBody
.
Is this useful information/is there something I might be doing wrong with configuration options? It seems like this is a bug but I don’t have much experience with extracting text from .doc files.
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (5 by maintainers)
Nice! @morungos I’m testing this now and I’ll post back here with results. Thanks again for your quick response and work here.
💥 much better.
Results after the update on the same doc: