question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FSCrawler can't index .doc or .docx elements

See original GitHub issue

Describe the bug

Whenever I try to index .doc or .docx files I get a warning and the files don’t get indexed.

07:57:41,337 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [C:\ELK\temp\es\LBR\2016\test.docx] -> org/apache/poi/hemf/extractor/HemfExtractor

It all works fine with .pdf documents and so I expected with word documents.

Versions:

  • OS: Windows 10
  • Elasticsearch Version 7.5.2
  • FSCrawler Version 2.7

EDIT:

So I recreated a .docx file with a few sentence and it worked. So what does the above error means?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:27 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
dadoonetcommented, Feb 19, 2020

I guess that you would need to change some libs in FSCrawler lib dir. Or revert this https://github.com/dadoonet/fscrawler/pull/855 and compile the project again.

1reaction
dadoonetcommented, Feb 12, 2020

Thank you for the file. That’s definitely a bug in FSCrawler which has been introduced by #855

To fix it, I “just” need to pull in this PR: #865 but there’s still “a blocker” in that one as I have seen a regression. I need to revisit it at some point.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Fscrawler doesn't seem to index against Includes - Elasticsearch
Basically I'm trying to create an index that includes only files of the types on the includes line. After it finishes, if I...
Read more >
FSCrawler - Read the Docs
This crawler helps to index binary documents such as PDF, Open Office, MS Office. Main features: Local file system (or a mounted drive)...
Read more >
Local FS settings — FSCrawler 2.10-SNAPSHOT documentation
Let's say you want to index only docs like *.doc and *.pdf but resume* . ... and xml documents directly onto the _source...
Read more >
FSCrawler 2.8 documentation - Read the Docs
This crawler helps to index binary documents such as PDF, Open Office, MS Office. Main features: Local file system (or a mounted drive)...
Read more >
Local FS settings — FSCrawler 2.7 documentation
Let's say you want to index only docs like *.doc and *.pdf but resume* . ... and xml documents directly onto the _source...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found