[SSH] Error while indexing content from /home/administrateur : Auth fail
See original GitHub issue(Je me permet de faire ce poste en français, il me sera plus simple de m’expliquer) Bonjour, Tout d’abord, merci pour le travail fourni sur cet outil M. Pilato.
Afin de remplacer la solution mise en place aujourd’hui dans mon entreprise (SolR + Manifold), nous souhaiterions utiliser elasticstack + fscrawler car nous travaillons sur un volume d’indexation de documents de + de 10M et plus du double de mails.
J’effectue donc en ce moment même des tests sur une Debian. Le problème rencontré, est le suivant :
[administrateur@srv-elastic-pack-test:~/.fscrawler/test$ fscrawler test --trace
11:17:13,162 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [1/doc.json] already exists
11:17:13,164 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [1/folder.json] already exists
11:17:13,164 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [1/_settings.json] already exists
11:17:13,164 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [2/doc.json] already exists
11:17:13,165 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [2/folder.json] already exists
11:17:13,165 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [2/_settings.json] already exists
11:17:13,165 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [5/doc.json] already exists
11:17:13,165 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [5/folder.json] already exists
11:17:13,165 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] Mapping [5/_settings.json] already exists
11:17:13,171 DEBUG [f.p.e.c.f.FsCrawler] Starting job [test]...
11:17:13,296 TRACE [f.p.e.c.f.FsCrawler] settings used for this crawler: [{
"name" : "test",
"fs" : {
"url" : "/home/administrateur",
"update_rate" : "1m",
"includes" : [ "*.doc" ],
"json_support" : false,
"filename_as_id" : false,
"add_filesize" : true,
"remove_deleted" : true,
"add_as_inner_object" : false,
"store_source" : false,
"index_content" : true,
"attributes_support" : false,
"raw_metadata" : true,
"xml_support" : false,
"index_folders" : true,
"lang_detect" : false
},
"server" : {
"hostname" : "192.168.37.41",
"port" : 22,
"username" : "administrateur",
"protocol" : "ssh"
},
"elasticsearch" : {
"nodes" : [ {
"host" : "127.0.0.1",
"port" : 9200,
"scheme" : "HTTP"
} ],
"type" : "doc",
"bulk_size" : 100,
"flush_interval" : "5s"
},
"rest" : {
"scheme" : "HTTP",
"host" : "127.0.0.1",
"port" : 8080,
"endpoint" : "fscrawler"
}
}]
11:17:13,300 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
11:17:13,300 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
11:17:13,699 DEBUG [f.p.e.c.f.c.ElasticsearchClient] findVersion()
11:17:13,774 TRACE [f.p.e.c.f.c.ElasticsearchClient] get server response: {name=node-0, cluster_name=es-test, cluster_uuid=E_iblWbUTU6xjk3eqgC1hA, version={number=5.1.2, build_hash=c8c4c16, build_date=2017-01-11T20:18:39.146Z, build_snapshot=false, lucene_version=6.3.0}, tagline=You Know, for Search}
11:17:13,775 DEBUG [f.p.e.c.f.c.ElasticsearchClient] findVersion() -> [5.1.2]
11:17:13,775 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Using elasticsearch >= 5, so we use [stored_fields] as fields option
11:17:13,775 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Using elasticsearch >= 5, so we can use ingest node feature
11:17:13,776 DEBUG [f.p.e.c.f.c.BulkProcessor] Creating a bulk processor with size [100], flush [5s], pipeline [null]
11:17:13,779 DEBUG [f.p.e.c.f.c.ElasticsearchClient] findVersion()
11:17:13,781 TRACE [f.p.e.c.f.c.ElasticsearchClient] get server response: {name=node-0, cluster_name=es-test, cluster_uuid=E_iblWbUTU6xjk3eqgC1hA, version={number=5.1.2, build_hash=c8c4c16, build_date=2017-01-11T20:18:39.146Z, build_snapshot=false, lucene_version=6.3.0}, tagline=You Know, for Search}
11:17:13,781 DEBUG [f.p.e.c.f.c.ElasticsearchClient] findVersion() -> [5.1.2]
11:17:13,782 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] FS crawler connected to an elasticsearch [5.1.2] node.
11:17:13,782 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [test]
11:17:13,782 TRACE [f.p.e.c.f.c.ElasticsearchClient] index settings: [{
"settings": {
"analysis": {
"analyzer": {
"fscrawler_path": {
"tokenizer": "fscrawler_path"
}
},
"tokenizer": {
"fscrawler_path": {
"type": "path_hierarchy"
}
}
}
}
}
]
11:17:13,859 TRACE [f.p.e.c.f.c.ElasticsearchClient] index already exists. Ignoring error...
11:17:13,860 DEBUG [f.p.e.c.f.c.ElasticsearchClient] is existing type [test]/[doc]
11:17:13,864 TRACE [f.p.e.c.f.c.ElasticsearchClient] get index metadata response: {test={aliases={}, mappings={folder={properties={encoded={type=keyword, store=true}, name={type=keyword, store=true}, real={type=keyword, store=true}, root={type=keyword, store=true}, virtual={type=keyword, store=true}}}, doc={properties={attachment={type=binary}, attributes={properties={group={type=keyword}, owner={type=keyword}}}, content={type=text}, file={properties={checksum={type=keyword}, content_type={type=keyword}, extension={type=keyword}, filename={type=keyword}, filesize={type=long}, indexed_chars={type=long}, indexing_date={type=date, format=dateOptionalTime}, last_modified={type=date, format=dateOptionalTime}, url={type=keyword, index=false}}}, meta={properties={author={type=text}, date={type=date, format=dateOptionalTime}, keywords={type=text}, language={type=keyword}, title={type=text}}}, object={type=object}, path={properties={encoded={type=keyword}, real={type=keyword, fields={tree={type=text, analyzer=fscrawler_path, fielddata=true}}}, root={type=keyword}, virtual={type=keyword, fields={tree={type=text, analyzer=fscrawler_path, fielddata=true}}}}}}}}, settings={index={number_of_shards=5, provided_name=test, creation_date=1486462496211, analysis={analyzer={fscrawler_path={tokenizer=fscrawler_path}}, tokenizer={fscrawler_path={type=path_hierarchy}}}, number_of_replicas=1, uuid=9uoBSiB0TbGTjREyIWTFdw, version={created=5010299}}}}}
11:17:13,865 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Mapping [test]/[doc] already exists.
11:17:13,865 DEBUG [f.p.e.c.f.c.ElasticsearchClient] is existing type [test]/[folder]
11:17:13,869 TRACE [f.p.e.c.f.c.ElasticsearchClient] get index metadata response: {test={aliases={}, mappings={folder={properties={encoded={type=keyword, store=true}, name={type=keyword, store=true}, real={type=keyword, store=true}, root={type=keyword, store=true}, virtual={type=keyword, store=true}}}, doc={properties={attachment={type=binary}, attributes={properties={group={type=keyword}, owner={type=keyword}}}, content={type=text}, file={properties={checksum={type=keyword}, content_type={type=keyword}, extension={type=keyword}, filename={type=keyword}, filesize={type=long}, indexed_chars={type=long}, indexing_date={type=date, format=dateOptionalTime}, last_modified={type=date, format=dateOptionalTime}, url={type=keyword, index=false}}}, meta={properties={author={type=text}, date={type=date, format=dateOptionalTime}, keywords={type=text}, language={type=keyword}, title={type=text}}}, object={type=object}, path={properties={encoded={type=keyword}, real={type=keyword, fields={tree={type=text, analyzer=fscrawler_path, fielddata=true}}}, root={type=keyword}, virtual={type=keyword, fields={tree={type=text, analyzer=fscrawler_path, fielddata=true}}}}}}}}, settings={index={number_of_shards=5, provided_name=test, creation_date=1486462496211, analysis={analyzer={fscrawler_path={tokenizer=fscrawler_path}}, tokenizer={fscrawler_path={type=path_hierarchy}}}, number_of_replicas=1, uuid=9uoBSiB0TbGTjREyIWTFdw, version={created=5010299}}}}}
11:17:13,869 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Mapping [test]/[folder] already exists.
11:17:13,871 DEBUG [f.p.e.c.f.FsCrawlerImpl] creating fs crawler thread [test] for [/home/administrateur] every [1m]
11:17:13,879 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started for [test] for [/home/administrateur] every [1m]
11:17:13,880 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler thread [test] is now running. Run #1...
11:17:13,881 DEBUG [f.p.e.c.f.f.FileAbstractor] Opening SSH connection to administrateur@192.168.37.41
11:17:19,162 WARN [f.p.e.c.f.FsCrawlerImpl] Error while indexing content from /home/administrateur: Auth fail
11:17:19,162 WARN [f.p.e.c.f.FsCrawlerImpl] Error while closing the connection: java.lang.NullPointerException
11:17:19,163 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler is going to sleep for 1m
^C11:18:09,811 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [test]
11:18:09,813 DEBUG [f.p.e.c.f.FsCrawlerImpl] Fs crawler is now waking up again...
11:18:09,813 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread [test] is now marked as closed...
11:18:09,813 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
11:18:09,814 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler Rest service stopped
11:18:09,814 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager
11:18:09,814 DEBUG [f.p.e.c.f.c.BulkProcessor] Closing BulkProcessor
J’ai effectué plusieurs tentatives différentes :
- Avec le nom à la place de l’ip
- Test sur les serveurs de prod et de test
- Test vers une machine en local sans firewall (test plus haut)
- Test avec différents chemins
- Test avec la version 2.3 SNAPSHOT de fscrawler et la 2.2
Mes versions sont les suivantes :
- Java SDK 1.8.0_121
- ElastickPack 5.1.2
- fscrawler 2.2 (echec) puis test avec 2.3 SNAPSHOT (echec)
Questions : 1 : Y a-t-il la possibilité de spécifier pour un même job plusieurs urls ? 2 : Le crawler parcours-t-il l’arborescence complètement du chemin spécifié ou juste les fichiers dans le dossier spécifié ?
/**principal_folder**
/folder
files
files
ou
/**principal_folder**
files
Merci pour l’aide que vous allez m’apporter ! Cordialement, MS
Issue Analytics
- State:
- Created 7 years ago
- Comments:7 (4 by maintainers)
Top GitHub Comments
Je confirme que cela fonctionne. Merci !
Normalement c’est fixé avec #329.
Il faudrait essayer la dernière version SNAPSHOT pour confirmer. Merci !