Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Is it possible to write to Amazon Elasticsearch Service using elasticsearch-hadoop?

See original GitHub issue

I have been trying to write to AWS’s new Amazon Elasticsearch Service from a Scalding job using elasticsearch-hadoop (via scalding-taps).

This job has previously worked using Elasticsearch manually installed on an EC2 instance, implying that the problem is specific to Amazon Elasticsearch Service.

The first time I tried I got this error:

Caused by: org.elasticsearch.hadoop.EsHadoopIllegalStateException: Cluster state volatile; cannot find node backing shards - please check whether your cluster is stable
        at org.elasticsearch.hadoop.rest.RestRepository.getWriteTargetPrimaryShards(RestRepository.java:370)
        at org.elasticsearch.hadoop.rest.RestService.initSingleIndex(RestService.java:425)
        at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:393)
        at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:173)
        at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.write(EsOutputFormat.java:149)
        at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:844)
        at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:596)
        at cascading.tap.hadoop.util.MeasuredOutputCollector.collect(MeasuredOutputCollector.java:69)
        at org.elasticsearch.hadoop.cascading.EsHadoopScheme.sink(EsHadoopScheme.java:212)
        at cascading.tuple.TupleEntrySchemeCollector.collect(TupleEntrySchemeCollector.java:153)
        ... 21 more

I then tried setting es.nodes.client.only to true and instead got this error:

Caused by: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Client-only routing specified but no client nodes with HTTP-enabled were found in the cluster...
        at org.elasticsearch.hadoop.rest.InitializationUtils.filterNonClientNodesIfNeeded(InitializationUtils.java:82)
        at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:373)
        at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:173)
        at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.write(EsOutputFormat.java:149)
        at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:844)
        at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:596)
        at cascading.tap.hadoop.util.MeasuredOutputCollector.collect(MeasuredOutputCollector.java:69)
        at org.elasticsearch.hadoop.cascading.EsHadoopScheme.sink(EsHadoopScheme.java:212)
        at cascading.tuple.TupleEntrySchemeCollector.collect(TupleEntrySchemeCollector.java:153)
        ... 21 more

I haven’t made any changes to Amazon’s default Elasticsearch configuration. Does anybody have any idea how to make this work?

Issue Analytics

State:
Created 8 years ago
Comments:18 (5 by maintainers)

Top GitHub Comments

1reaction

costincommented, Oct 21, 2015

@fblundun @malpani I’ve added a new feature in master (and published the new builds - namely 2.2.0.BUILD-SNAPSHOT) which allows the connector to work with restricted cloud-like ES environments. By setting es.nodes.wan.only to true, the connector will disable discovery and its typical peer-to-peer connections and use only the nodes indicates in es.nodes. For Amazon or Found for example this would be the publicly accessible gateway. All connections to and from ES will be made through this node - clearly it won’t be as efficient as connecting to each shard/node directly but it’s also the only possible way.

Can you please try it out and report back how it works for you?

Cheers,

0reactions

mixjacommented, Mar 22, 2020

Another option that may come about (hopefully soon) for AWS Elasticsearch deployments is to use the OpenDistro SQL JDBC drivers.

From my early investigation there are some blockers on this, particularly around the SQL dialect that is supported, but at the very least today you can use the OpenDistro SQL JDBC driver with request signing and print schema of an index using Spark (this example is using AWS Glue, but Spark under the hood):

>>> import sys
>>> from awsglue.transforms import *
>>> from awsglue.utils import getResolvedOptions
>>> from pyspark.context import SparkContext
>>> from awsglue.context import GlueContext
>>> from awsglue.dynamicframe import DynamicFrame
>>> 
>>> glueContext = GlueContext(SparkContext.getOrCreate())
>>> jdbc_driver_name = "com.amazon.opendistroforelasticsearch.jdbc.Driver"
>>> db_url = "jdbc:elasticsearch://https://xxxxxx.ap-southeast-2.es.amazonaws.com?auth=aws_sigv4"
>>> table_name = "location"
>>> df = glueContext.read.format("jdbc").option("driver", jdbc_driver_name).option("url", db_url).option("dbtable", table_name).load()
>>> df.printSchema()
root
 |-- parent: integer (nullable = true)
 |-- aliases: string (nullable = true)
 |-- suggest_text: string (nullable = true)
 |-- tie_breaker: string (nullable = true)
 |-- type: integer (nullable = true)
 |-- modified: timestamp (nullable = true)
 |-- id: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- ontology_autocomplete: string (nullable = true)
 |-- lat: double (nullable = true)
 |-- slug: string (nullable = true)
 |-- lng: double (nullable = true)
 |-- created: timestamp (nullable = true)
 |-- neighboring_locations: integer (nullable = true)
 |-- location_type: integer (nullable = true)
 |-- location_autocomplete: string (nullable = true)
 |-- name: string (nullable = true)
 |-- suggestions: string (nullable = true)

At the moment you would need to create a custom dialect for anything useful to work.

Top Results From Across the Web

Keeping clients of OpenSearch and Elasticsearch compatible ...

Our experience at AWS is that developers find it painful to update their already-deployed applications to use new version of server software, so ......

Unable to connect to ES Cluster on AWS Elasticsearch ...

I'm using version 2.2.0 of the elasticsearch-hadoop library. My ES cluster is hosted on Amazon Elasticsearch Service ...

Spark data frame write to Elastic Search using es-hadoop ...

I've already used "org.elasticsearch" % "elasticsearch-hadoop" % "6.3.0". This sends requests to aws elasticsearch without signing and they ...

ElasticSearch | Databricks on AWS

Learn how to read and write data to Elasticsearch using Databricks. ... If running into errors like org.elasticsearch.hadoop.

Elasticsearch Hadoop Tutorial with Hands-on Examples

For our exercise, we'll use a simple Apache access log to represent our “big data”. We'll learn how to write a MapReduce job...