FSCrawler can't index .doc or .docx elements
See original GitHub issueDescribe the bug
Whenever I try to index .doc or .docx files I get a warning and the files don’t get indexed.
07:57:41,337 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [C:\ELK\temp\es\LBR\2016\test.docx] -> org/apache/poi/hemf/extractor/HemfExtractor
It all works fine with .pdf documents and so I expected with word documents.
Versions:
- OS: Windows 10
- Elasticsearch Version 7.5.2
- FSCrawler Version 2.7
EDIT:
So I recreated a .docx file with a few sentence and it worked. So what does the above error means?
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:27 (11 by maintainers)
Top Results From Across the Web
Fscrawler doesn't seem to index against Includes - Elasticsearch
Basically I'm trying to create an index that includes only files of the types on the includes line. After it finishes, if I...
Read more >FSCrawler - Read the Docs
This crawler helps to index binary documents such as PDF, Open Office, MS Office. Main features: Local file system (or a mounted drive)...
Read more >Local FS settings — FSCrawler 2.10-SNAPSHOT documentation
Let's say you want to index only docs like *.doc and *.pdf but resume* . ... and xml documents directly onto the _source...
Read more >FSCrawler 2.8 documentation - Read the Docs
This crawler helps to index binary documents such as PDF, Open Office, MS Office. Main features: Local file system (or a mounted drive)...
Read more >Local FS settings — FSCrawler 2.7 documentation
Let's say you want to index only docs like *.doc and *.pdf but resume* . ... and xml documents directly onto the _source...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I guess that you would need to change some libs in FSCrawler lib dir. Or revert this https://github.com/dadoonet/fscrawler/pull/855 and compile the project again.
Thank you for the file. That’s definitely a bug in FSCrawler which has been introduced by #855
To fix it, I “just” need to pull in this PR: #865 but there’s still “a blocker” in that one as I have seen a regression. I need to revisit it at some point.