question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

WordExtractor get_body() doesn't appear to retrieve all text content from .doc file

See original GitHub issue

Love this package; it’s a super neat and clean interface for extracting text.

However, in testing I found some cases where not all text was extracted. You can find the examples here.

The sample 100kB DOC file, for instance, has lots of text in it. Here’s the TypeScript code I executed:

import request from 'request';
import WordExtractor from 'word-extractor';

const fileUrl = 'https://file-examples-com.github.io/uploads/2017/02/file-sample_100kB.doc';

request.get({ url: fileUrl, encoding: null }, (err, res, body) => {
  const extractor = new WordExtractor();
  const extracted = extractor.extract(body);
  extracted.then(function (doc) {
    console.log(doc.getBody());
  });
});

This seems to only extract the following, even though there’s much, much more content in that DOC file.

Lorem ipsum Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio. 

Cras fringilla ipsum magna, in fringilla dui commodo a. Lorem ipsum     Lorem ipsum
1       Lorem
2       Ipsum
3       Lorem
4       Lorem
5       Ipsum
Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Maecenas mauris lectus, lobortis et purus mattis, blandit dictum tellus. 
In eleifend velit vitae libero sollicitudin euismod. 

Similarly, I’ve tried getFootnotes, getEndnotes, getHeaders, getFooters, and getAnnotations. All of those return empty content for the document I linked above.

getTextboxes returns the same content as getBody.

Is this useful information/is there something I might be doing wrong with configuration options? It seems like this is a bug but I don’t have much experience with extracting text from .doc files.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
cakemountaincommented, Jul 30, 2021

Nice! @morungos I’m testing this now and I’ll post back here with results. Thanks again for your quick response and work here.

0reactions
cakemountaincommented, Jul 30, 2021

💥 much better.

Results after the update on the same doc:

Lorem ipsum 

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio. 

Vestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut varius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum condimentum. Vivamus dapibus sodales ex, vitae malesuada ipsum cursus convallis. Maecenas sed egestas nulla, ac condimentum orci. Mauris diam felis, vulputate ac suscipit et, iaculis non est. Curabitur semper arcu ac ligula semper, nec luctus nisl blandit. Integer lacinia ante ac libero lobortis imperdiet. Nullam mollis convallis ipsum, ac accumsan nunc vehicula vitae. Nulla eget justo in felis tristique fringilla. Morbi sit amet tortor quis risus auctor condimentum. Morbi in ullamcorper elit. Nulla iaculis tellus sit amet mauris tempus fringilla.
Maecenas mauris lectus, lobortis et purus mattis, blandit dictum tellus.
Maecenas non lorem quis tellus placerat varius. 
Nulla facilisi. 
Aenean congue fringilla justo ut aliquam. 
Mauris id ex erat. Nunc vulputate neque vitae justo facilisis, non condimentum ante sagittis. 
Morbi viverra semper lorem nec molestie. 
Maecenas tincidunt est efficitur ligula euismod, sit amet ornare est vulputate.









In non mauris justo. Duis vehicula mi vel mi pretium, a viverra erat efficitur. Cras aliquam est ac eros varius, id iaculis dui auctor. Duis pretium neque ligula, et pulvinar mi placerat et. Nulla nec nunc sit amet nunc posuere vestibulum. Ut id neque eget tortor mattis tristique. Donec ante est, blandit sit amet tristique vel, lacinia pulvinar arcu. Pellentesque scelerisque fermentum erat, id posuere justo pulvinar ut. Cras id eros sed enim aliquam lobortis. Sed lobortis nisl ut eros efficitur tincidunt. Cras justo mi, porttitor quis mattis vel, ultricies ut purus. Ut facilisis et lacus eu cursus.
In eleifend velit vitae libero sollicitudin euismod. Fusce vitae vestibulum velit. Pellentesque vulputate lectus quis pellentesque commodo. Aliquam erat volutpat. Vestibulum in egestas velit. Pellentesque fermentum nisl vitae fringilla venenatis. Etiam id mauris vitae orci maximus ultricies. 

Cras fringilla ipsum magna, in fringilla dui commodo a.

        Lorem ipsum     Lorem ipsum     Lorem ipsum
1       In eleifend velit vitae libero sollicitudin euismod.    Lorem
2       Cras fringilla ipsum magna, in fringilla dui commodo a. Ipsum
3       Aliquam erat volutpat.  Lorem
4       Fusce vitae vestibulum velit.   Lorem
5       Etiam vehicula luctus fermentum.        Ipsum

Etiam vehicula luctus fermentum. In vel metus congue, pulvinar lectus vel, fermentum dui. Maecenas ante orci, egestas ut aliquet sit amet, sagittis a magna. Aliquam ante quam, pellentesque ut dignissim quis, laoreet eget est. Aliquam erat volutpat. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Ut ullamcorper justo sapien, in cursus libero viverra eget. Vivamus auctor imperdiet urna, at pulvinar leo posuere laoreet. Suspendisse neque nisl, fringilla at iaculis scelerisque, ornare vel dolor. Ut et pulvinar nunc. Pellentesque fringilla mollis efficitur. Nullam venenatis commodo imperdiet. Morbi velit neque, semper quis lorem quis, efficitur dignissim ipsum. Ut ac lorem sed turpis imperdiet eleifend sit amet id sapien.


Lorem ipsum dolor sit amet, consectetur adipiscing elit. 

Nunc ac faucibus odio. Vestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut varius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum condimentum. Vivamus dapibus sodales ex, vitae malesuada ipsum cursus convallis. Maecenas sed egestas nulla, ac condimentum orci. Mauris diam felis, vulputate ac suscipit et, iaculis non est. Curabitur semper arcu ac ligula semper, nec luctus nisl blandit. Integer lacinia ante ac libero lobortis imperdiet. Nullam mollis convallis ipsum, ac accumsan nunc vehicula vitae. Nulla eget justo in felis tristique fringilla. Morbi sit amet tortor quis risus auctor condimentum. Morbi in ullamcorper elit. Nulla iaculis tellus sit amet mauris tempus fringilla.
Maecenas mauris lectus, lobortis et purus mattis, blandit dictum tellus. 
Maecenas non lorem quis tellus placerat varius. Nulla facilisi. Aenean congue fringilla justo ut aliquam. Mauris id ex erat. Nunc vulputate neque vitae justo facilisis, non condimentum ante sagittis. Morbi viverra semper lorem nec molestie. Maecenas tincidunt est efficitur ligula euismod, sit amet ornare est vulputate.
In non mauris justo. Duis vehicula mi vel mi pretium, a viverra erat efficitur. Cras aliquam est ac eros varius, id iaculis dui auctor. Duis pretium neque ligula, et pulvinar mi placerat et. Nulla nec nunc sit amet nunc posuere vestibulum. Ut id neque eget tortor mattis tristique. Donec ante est, blandit sit amet tristique vel, lacinia pulvinar arcu. Pellentesque scelerisque fermentum erat, id posuere justo pulvinar ut. Cras id eros sed enim aliquam lobortis. Sed lobortis nisl ut eros efficitur tincidunt. Cras justo mi, porttitor quis mattis vel, ultricies ut purus. Ut facilisis et lacus eu cursus.
In eleifend velit vitae libero sollicitudin euismod. 
Fusce vitae vestibulum velit. Pellentesque vulputate lectus quis pellentesque commodo. Aliquam erat volutpat. Vestibulum in egestas velit. Pellentesque fermentum nisl vitae fringilla venenatis. Etiam id mauris vitae orci maximus ultricies. Cras fringilla ipsum magna, in fringilla dui commodo a.
Etiam vehicula luctus fermentum. In vel metus congue, pulvinar lectus vel, fermentum dui. Maecenas ante orci, egestas ut aliquet sit amet, sagittis a magna. Aliquam ante quam, pellentesque ut dignissim quis, laoreet eget est. Aliquam erat volutpat. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Ut ullamcorper justo sapien, in cursus libero viverra eget. Vivamus auctor imperdiet urna, at pulvinar leo posuere laoreet. Suspendisse neque nisl, fringilla at iaculis scelerisque, ornare vel dolor. Ut et pulvinar nunc. Pellentesque fringilla mollis efficitur. Nullam venenatis commodo imperdiet. Morbi velit neque, semper quis lorem quis, efficitur dignissim ipsum. Ut ac lorem sed turpis imperdiet eleifend sit amet id sapien.
Read more comments on GitHub >

github_iconTop Results From Across the Web

XWPFWordExtractor.getText() throws NullPointerException
However, that itself throws a NullPointerException because it tries to access the "SectPr" of the document via doc.getDocument().getBody().
Read more >
How to troubleshoot damaged documents in Word - Office
The information from the damaged document will appear if there was any recoverable data or text. Right-click the linked text, point to Linked ......
Read more >
Get text from the word document - Blue Prism - RPA Forum
Hi PakamSuman, There doesn't seem to be a sample doc file attached to your post.
Read more >
WordExtractor (POI API Documentation) - Apache POI
Class to extract the text from a Word Document. You should use either getParagraphText() or getText() unless you have a strong reason otherwise....
Read more >
Aspose word extract content for each page header,body,footer
Free online document parser. Extract text and images from Word document. I am getting a header at the top and a footer at...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found