question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Is it possible to write to Amazon Elasticsearch Service using elasticsearch-hadoop?

See original GitHub issue

I have been trying to write to AWS’s new Amazon Elasticsearch Service from a Scalding job using elasticsearch-hadoop (via scalding-taps).

This job has previously worked using Elasticsearch manually installed on an EC2 instance, implying that the problem is specific to Amazon Elasticsearch Service.

The first time I tried I got this error:

Caused by: org.elasticsearch.hadoop.EsHadoopIllegalStateException: Cluster state volatile; cannot find node backing shards - please check whether your cluster is stable
        at org.elasticsearch.hadoop.rest.RestRepository.getWriteTargetPrimaryShards(RestRepository.java:370)
        at org.elasticsearch.hadoop.rest.RestService.initSingleIndex(RestService.java:425)
        at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:393)
        at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:173)
        at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.write(EsOutputFormat.java:149)
        at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:844)
        at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:596)
        at cascading.tap.hadoop.util.MeasuredOutputCollector.collect(MeasuredOutputCollector.java:69)
        at org.elasticsearch.hadoop.cascading.EsHadoopScheme.sink(EsHadoopScheme.java:212)
        at cascading.tuple.TupleEntrySchemeCollector.collect(TupleEntrySchemeCollector.java:153)
        ... 21 more

I then tried setting es.nodes.client.only to true and instead got this error:

Caused by: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Client-only routing specified but no client nodes with HTTP-enabled were found in the cluster...
        at org.elasticsearch.hadoop.rest.InitializationUtils.filterNonClientNodesIfNeeded(InitializationUtils.java:82)
        at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:373)
        at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:173)
        at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.write(EsOutputFormat.java:149)
        at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:844)
        at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:596)
        at cascading.tap.hadoop.util.MeasuredOutputCollector.collect(MeasuredOutputCollector.java:69)
        at org.elasticsearch.hadoop.cascading.EsHadoopScheme.sink(EsHadoopScheme.java:212)
        at cascading.tuple.TupleEntrySchemeCollector.collect(TupleEntrySchemeCollector.java:153)
        ... 21 more

I haven’t made any changes to Amazon’s default Elasticsearch configuration. Does anybody have any idea how to make this work?

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:18 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
costincommented, Oct 21, 2015

@fblundun @malpani I’ve added a new feature in master (and published the new builds - namely 2.2.0.BUILD-SNAPSHOT) which allows the connector to work with restricted cloud-like ES environments. By setting es.nodes.wan.only to true, the connector will disable discovery and its typical peer-to-peer connections and use only the nodes indicates in es.nodes. For Amazon or Found for example this would be the publicly accessible gateway. All connections to and from ES will be made through this node - clearly it won’t be as efficient as connecting to each shard/node directly but it’s also the only possible way.

Can you please try it out and report back how it works for you?

Cheers,

0reactions
mixjacommented, Mar 22, 2020

Another option that may come about (hopefully soon) for AWS Elasticsearch deployments is to use the OpenDistro SQL JDBC drivers.

From my early investigation there are some blockers on this, particularly around the SQL dialect that is supported, but at the very least today you can use the OpenDistro SQL JDBC driver with request signing and print schema of an index using Spark (this example is using AWS Glue, but Spark under the hood):

>>> import sys
>>> from awsglue.transforms import *
>>> from awsglue.utils import getResolvedOptions
>>> from pyspark.context import SparkContext
>>> from awsglue.context import GlueContext
>>> from awsglue.dynamicframe import DynamicFrame
>>> 
>>> glueContext = GlueContext(SparkContext.getOrCreate())
>>> jdbc_driver_name = "com.amazon.opendistroforelasticsearch.jdbc.Driver"
>>> db_url = "jdbc:elasticsearch://https://xxxxxx.ap-southeast-2.es.amazonaws.com?auth=aws_sigv4"
>>> table_name = "location"
>>> df = glueContext.read.format("jdbc").option("driver", jdbc_driver_name).option("url", db_url).option("dbtable", table_name).load()
>>> df.printSchema()
root
 |-- parent: integer (nullable = true)
 |-- aliases: string (nullable = true)
 |-- suggest_text: string (nullable = true)
 |-- tie_breaker: string (nullable = true)
 |-- type: integer (nullable = true)
 |-- modified: timestamp (nullable = true)
 |-- id: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- ontology_autocomplete: string (nullable = true)
 |-- lat: double (nullable = true)
 |-- slug: string (nullable = true)
 |-- lng: double (nullable = true)
 |-- created: timestamp (nullable = true)
 |-- neighboring_locations: integer (nullable = true)
 |-- location_type: integer (nullable = true)
 |-- location_autocomplete: string (nullable = true)
 |-- name: string (nullable = true)
 |-- suggestions: string (nullable = true)

At the moment you would need to create a custom dialect for anything useful to work.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Keeping clients of OpenSearch and Elasticsearch compatible ...
Our experience at AWS is that developers find it painful to update their already-deployed applications to use new version of server software, so ......
Read more >
Unable to connect to ES Cluster on AWS Elasticsearch ...
I'm using version 2.2.0 of the elasticsearch-hadoop library. My ES cluster is hosted on Amazon Elasticsearch Service ...
Read more >
Spark data frame write to Elastic Search using es-hadoop ...
I've already used "org.elasticsearch" % "elasticsearch-hadoop" % "6.3.0". This sends requests to aws elasticsearch without signing and they ...
Read more >
ElasticSearch | Databricks on AWS
Learn how to read and write data to Elasticsearch using Databricks. ... If running into errors like org.elasticsearch.hadoop.
Read more >
Elasticsearch Hadoop Tutorial with Hands-on Examples
For our exercise, we'll use a simple Apache access log to represent our “big data”. We'll learn how to write a MapReduce job...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found