content is null when creating the most simple job
See original GitHub issueHello. Thank you for this package. I’m trying it, but I keep getting null
in the content
, even for a plain text file containing plain text shadi
. Could you give me some pointers on how I can get the content to show up? Other than the plain text file, I’d like to index .xlsx
, .xls
, and .pdf
formats.
Here is my job settings file:
{
"name" : "sic_list",
"fs" : {
"url" : "/data/fscrawler/files",
"update_rate": "1m",
"indexed_chars": "100%"
},
"elasticsearch" : {
"index" : "sic_list",
"type": "doc",
"nodes" : [
{ "host" : "myhost.com", "port" : 9200 }
]
}
}
and here is an excerpt from my --trace
output
10:18:31,678 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [test.txt], includes = [null], excludes = [null]
10:18:31,679 TRACE [f.p.e.c.f.u.FsCrawlerUtil] no rules
10:18:31,679 DEBUG [f.p.e.c.f.FsCrawlerImpl] [test.txt] can be indexed: [true]
10:18:31,679 DEBUG [f.p.e.c.f.FsCrawlerImpl] - file: test.txt
10:18:31,680 DEBUG [f.p.e.c.f.FsCrawlerImpl] fetching content from [/data/fscrawler/files],[test.txt]
10:18:31,680 DEBUG [f.p.e.c.f.FsCrawlerImpl] Indexing in ES sic_list, doc, 57e81419ed4fa6aa86d668bb9e28674
10:18:31,681 TRACE [f.p.e.c.f.FsCrawlerImpl] JSon indexed : {
"content" : null,
"attachment" : null,
"meta" : {
"author" : null,
"title" : null,
"date" : null,
"keywords" : null,
"raw" : null
},
"file" : {
"content_type" : null,
"last_modified" : "2017-01-21T10:10:03Z",
"indexing_date" : "2017-01-21T10:18:31.680Z",
"filesize" : null,
"filename" : "test.txt",
"url" : "file:///data/fscrawler/files/test.txt",
"indexed_chars" : null,
"checksum" : null
},
"path" : {
"encoded" : "6113a2c108ffc50c1fd761817d96ca7",
"root" : "6113a2c108ffc50c1fd761817d96ca7",
"virtual" : "",
"real" : "/data/fscrawler/files/test.txt"
},
"attributes" : null
}
I’m running fscrawler
from a dockerfile
FROM openjdk:alpine
RUN apk add --update openssl
RUN wget https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler/2.1/fscrawler-2.1.zip \
&& unzip fscrawler-2.1.zip
RUN mkdir ~/.fscrawler
WORKDIR ./fscrawler-2.1
ENTRYPOINT cp /data/fscrawler/home/* ~/.fscrawler -r \
&& bin/fscrawler --trace sic_list
with the following docker command
docker build -t fscrawler build/fscrawler/
docker run -it --rm --name fscrawler-siclist \
-v /home/shadi/sic_lists/:/data/fscrawler/files/:ro \
-v "${PWD}"/home/:/data/fscrawler/home/:ro \
fscrawler
and all the files are readable by the same user launching fscrawler
Issue Analytics
- State:
- Created 7 years ago
- Comments:15 (15 by maintainers)
Top Results From Across the Web
A quick and thorough guide to 'null' - freeCodeCamp
by Christian Neumanns A quick and thorough guide to 'null': what it is, and how you should use it What is the meaning...
Read more >Post parameter is always null - Stack Overflow
If you post a model your model needs to have an empty/default constructor, otherwise the model can't be created, obviously. Be careful while...
Read more >null - JavaScript - MDN Web Docs - Mozilla
The null value represents the intentional absence of any object value. It is one of JavaScript's primitive values and is treated as falsy...
Read more >and ??= operators - null-coalescing operators - Microsoft Learn
The `??` and `??=` operators are the C# null-coalescing operators. They return the value of the left-hand operand if it isn't null.
Read more >How to SELECT Records With No NULL Values in MySQL
By far the simplest and most straightforward method for ensuring a particular column's result set doesn't contain NULL values is to use the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Ha! Indeed it’s actually true by default but only if you generate the job with FS Crawler. If you do it manually, it’s actually false.
Thanks a lot for finding this nasty bug.
I know there are some others which I’m going to fix now.
Btw, I took the freedom to open a couple of issues, which I saw while testing fscrawler, separately. I hope you don’t mind 😃