Is it possible to write to Amazon Elasticsearch Service using elasticsearch-hadoop?
See original GitHub issueI have been trying to write to AWS’s new Amazon Elasticsearch Service from a Scalding job using elasticsearch-hadoop (via scalding-taps).
This job has previously worked using Elasticsearch manually installed on an EC2 instance, implying that the problem is specific to Amazon Elasticsearch Service.
The first time I tried I got this error:
Caused by: org.elasticsearch.hadoop.EsHadoopIllegalStateException: Cluster state volatile; cannot find node backing shards - please check whether your cluster is stable
at org.elasticsearch.hadoop.rest.RestRepository.getWriteTargetPrimaryShards(RestRepository.java:370)
at org.elasticsearch.hadoop.rest.RestService.initSingleIndex(RestService.java:425)
at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:393)
at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:173)
at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.write(EsOutputFormat.java:149)
at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:844)
at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:596)
at cascading.tap.hadoop.util.MeasuredOutputCollector.collect(MeasuredOutputCollector.java:69)
at org.elasticsearch.hadoop.cascading.EsHadoopScheme.sink(EsHadoopScheme.java:212)
at cascading.tuple.TupleEntrySchemeCollector.collect(TupleEntrySchemeCollector.java:153)
... 21 more
I then tried setting es.nodes.client.only
to true
and instead got this error:
Caused by: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Client-only routing specified but no client nodes with HTTP-enabled were found in the cluster...
at org.elasticsearch.hadoop.rest.InitializationUtils.filterNonClientNodesIfNeeded(InitializationUtils.java:82)
at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:373)
at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:173)
at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.write(EsOutputFormat.java:149)
at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:844)
at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:596)
at cascading.tap.hadoop.util.MeasuredOutputCollector.collect(MeasuredOutputCollector.java:69)
at org.elasticsearch.hadoop.cascading.EsHadoopScheme.sink(EsHadoopScheme.java:212)
at cascading.tuple.TupleEntrySchemeCollector.collect(TupleEntrySchemeCollector.java:153)
... 21 more
I haven’t made any changes to Amazon’s default Elasticsearch configuration. Does anybody have any idea how to make this work?
Issue Analytics
- State:
- Created 8 years ago
- Comments:18 (5 by maintainers)
Top Results From Across the Web
Keeping clients of OpenSearch and Elasticsearch compatible ...
Our experience at AWS is that developers find it painful to update their already-deployed applications to use new version of server software, so ......
Read more >Unable to connect to ES Cluster on AWS Elasticsearch ...
I'm using version 2.2.0 of the elasticsearch-hadoop library. My ES cluster is hosted on Amazon Elasticsearch Service ...
Read more >Spark data frame write to Elastic Search using es-hadoop ...
I've already used "org.elasticsearch" % "elasticsearch-hadoop" % "6.3.0". This sends requests to aws elasticsearch without signing and they ...
Read more >ElasticSearch | Databricks on AWS
Learn how to read and write data to Elasticsearch using Databricks. ... If running into errors like org.elasticsearch.hadoop.
Read more >Elasticsearch Hadoop Tutorial with Hands-on Examples
For our exercise, we'll use a simple Apache access log to represent our “big data”. We'll learn how to write a MapReduce job...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@fblundun @malpani I’ve added a new feature in master (and published the new builds - namely
2.2.0.BUILD-SNAPSHOT
) which allows the connector to work with restricted cloud-like ES environments. By settinges.nodes.wan.only
totrue
, the connector will disable discovery and its typical peer-to-peer connections and use only the nodes indicates ines.nodes
. For Amazon or Found for example this would be the publicly accessible gateway. All connections to and from ES will be made through this node - clearly it won’t be as efficient as connecting to each shard/node directly but it’s also the only possible way.Can you please try it out and report back how it works for you?
Cheers,
Another option that may come about (hopefully soon) for AWS Elasticsearch deployments is to use the OpenDistro SQL JDBC drivers.
From my early investigation there are some blockers on this, particularly around the SQL dialect that is supported, but at the very least today you can use the OpenDistro SQL JDBC driver with request signing and print schema of an index using Spark (this example is using AWS Glue, but Spark under the hood):
At the moment you would need to create a custom dialect for anything useful to work.