Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FSCrawler can't index .doc or .docx elements

See original GitHub issue

Describe the bug

Whenever I try to index .doc or .docx files I get a warning and the files don’t get indexed.

07:57:41,337 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [C:\ELK\temp\es\LBR\2016\test.docx] -> org/apache/poi/hemf/extractor/HemfExtractor

It all works fine with .pdf documents and so I expected with word documents.

Versions:

OS: Windows 10
Elasticsearch Version 7.5.2
FSCrawler Version 2.7

EDIT:

So I recreated a .docx file with a few sentence and it worked. So what does the above error means?

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:27 (11 by maintainers)

Top GitHub Comments

1reaction

dadoonetcommented, Feb 19, 2020

I guess that you would need to change some libs in FSCrawler lib dir. Or revert this https://github.com/dadoonet/fscrawler/pull/855 and compile the project again.

1reaction

dadoonetcommented, Feb 12, 2020

Thank you for the file. That’s definitely a bug in FSCrawler which has been introduced by #855

To fix it, I “just” need to pull in this PR: #865 but there’s still “a blocker” in that one as I have seen a regression. I need to revisit it at some point.