Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Extract from a buffer

See original GitHub issue

I am using Node.js and downloading .doc files using superagent. This gives me a buffer object that I would like to parse and extract text from. However, word-extractor only seems to support files.

How do I extract the text from a .doc in memory, not in a file?

Issue Analytics

State:
Created 6 years ago
Reactions:1
Comments:10 (5 by maintainers)

Top GitHub Comments

3reactions

morungoscommented, Jan 15, 2018

That’s mainly an issue of the underlying OLE implementation, which is very much wired to use files. All the logic that depends on fs is local to OleCompoundDoc, so one solution would be to build an alternative implementation of that classes that is backed by a buffer rather than a file. Or, perhaps better, to refactor the file system access to a separate set of methods that could be overridden more easily.

It’s a nice and important addition. If I can get the time for this, I will.

2reactions

olsonpmcommented, Oct 31, 2018

I implemented buffer support at gmr-fms/node-word-extractor if you guys are willing to switch to the npm package @gmr-fms/word-extractor. I didn’t want to work with coffeescript hence the js source and slightly modified api.

const fs = require('fs')
const extract = require('@gmr-fms/word-extractor')

const buf = fs.readFileSync('path/to/file.doc')

extract.fromBuffer(buf).then(doc => {
  // do stuff with doc here
})

I really appreciate this library though, the code was very clean and easy to follow.