Unable to index JSON from HDFS using SchemaRDD.saveToEs()
See original GitHub issueElasticsearch-hadoop: elasticsearch-hadoop-2.1.0.BUILD-20150224.023403-309.zip Spark: 1.2.0
I’m attempting to take a very basic JSON document on HDFS and index it in ES using SchemaRDD.saveToEs(). According to the documentation under “writing existing JSON to elasticsearch” it should be as easy as creating a SchemaRDD via SQLContext.jsonFile() and then index using .saveToEs(), but I’m getting an error.
Replicating the problem:
- Create JSON file on hdfs with the content:
{"key":"value"}
- Execute code in spark-shell
import org.apache.spark.SparkContext._
import org.elasticsearch.spark._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
val input = sqlContext.jsonFile("hdfs://nameservice1/user/mshirley/test.json")
input.saveToEs("mshirley_spark_test/test")
Error:
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [Bad Request(400) - Invalid JSON fragment received[["value"]][MapperParsingException[failed to parse]; nested: ElasticsearchParseException[Failed to derive xcontent from (offset=13, length=9): [123, 34, 105, 110, 100, 101, 120, 34, 58, 123, 125, 125, 10, 91, 34, 118, 97, 108, 117, 101, 34, 93, 10]]; ]];
input object:
res1: org.apache.spark.sql.SchemaRDD =
SchemaRDD[6] at RDD at SchemaRDD.scala:108
== Query Plan ==
== Physical Plan ==
PhysicalRDD [key#0], MappedRDD[5] at map at JsonRDD.scala:47
input.printSchema():
root
|-- key: string (nullable = true)
Expected result: I expect to be able to read a file from HDFS that contains a JSON document per line and submit that data to ES for indexing.
Issue Analytics
- State:
- Created 9 years ago
- Comments:26 (12 by maintainers)
Top Results From Across the Web
scala - Unable to index JSON from HDFS using SchemaRDD ...
I'm able to read the file via SQLContext.jsonFile() but when I try to use SchemaRDD.saveToEs() I get an invalid JSON fragment received error....
Read more >[Spark] Unable to index JSON from HDFS using SchemaRDD ...
I have a file called test.json on HDFS that I'm trying to read and index using Spark. I'm able to read the file...
Read more >[Solved]-Create index from raw JSON using elastic4s-scala
Coding example for the question Create index from raw JSON using ... Considering that you have the JSON mapping in a variable rawMapping...
Read more >Unable to Load Json File via Ambari - Cloudera Community
Hi , I am unable to upload json file in HDFS using Ambari ( File View ). Getting error as Attached Sample file...
Read more >subject:"elasticsearch and spark" - The Mail Archive
Spark while it is significantly faster than Hadoop, cannot compete with CULR. ... retries the data in JSON format, converts it into Scala/Java...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@costin @yanakad @mshirley
In my case the test was even simpler. To summarize: Dependencies: elasticsearch-spark_2.10 Version: 2.1.0.Beta3 spark-streaming_2.10 Version: 1.2.1
Running on: Yarn HDP cluster
Test #1: Save RDD[Map[String,Int]]. Result: Pass. This works Test #2: Save RDD[String] where String is a JSON string. This fails. Test #3: Save RDD[Case Class] According to documentation this should work. It fails as well. Here’s the documentation I’m referring to http://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html
Using the exact same example does not work. (The documentation also has a typo naming the variable +RDD+). Here’s the example. Again, this did not work and threw an exception.
And the exception was:
The example couldn’t really get any simpler than that. If I manually create a Map out of the values in the case class, it works. This example even removes the SQL context schema out of the equation.
Thoughts?
@costin
Thanks for all your help!