question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot Save Data Frame to Elasticsearch during spark streaming

See original GitHub issue

What kind an issue is this?

  • Bug report. If you’ve found a bug, please provide a code snippet or test to reproduce it below.
    The easier it is to track down the bug, the faster it is solved.
  • Feature Request. Start by telling us what problem you’re trying to solve.
    Often a solution already exists! Don’t send pull requests to implement new features without first getting our support. Sometimes we leave features out on purpose to keep the project small.
Test/code snippet

`package org.elasticsearch.spark

import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.{Seconds, StreamingContext}

import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions.udf
import org.elasticsearch.spark.sql._
import org.elasticsearch.spark._


object ElasticSpark {
  case class users(uId: String, birth_dt: String, gender_cd: String)

  case class products(url: String, category: String)

  def main(args: Array[String]) {

    // Logging
//    import org.apache.log4j.{Level, Logger}
//    Logger.getLogger("org").setLevel(Level.OFF)
//    Logger.getLogger("akka").setLevel(Level.OFF)

    // initialise spark context
    val conf = new SparkConf()
      .setAppName("ElasticSearch")
      .setMaster("local")
      .set("es.index.auto.create", "true")


    val spark = SparkSession.builder().config(conf).getOrCreate()
    import spark.sqlContext.implicits._

    // Elastic connection parameters
    val elasticConf: Map[String, String] = Map("es.nodes" -> "localhost",
      "es.clustername" -> "elasticsearch","es.mapping.id" -> "id")


    //    val indexName = "custAct_index"
    //    val mappingName = "custAct_index_type"

    // DataFrame

    val usersFile = spark.sparkContext.textFile("E:\\POC\\Datasets\\regusers.tsv")
    val userHeader = usersFile.first()
    val userRecords = usersFile.filter(x => x != userHeader)
    val usersDF = userRecords.map(x => x.split("\t", -1)).map(u => users(u(0), u(1), u(2))).toDF("uId", "birth_dt", "gender_cd")

    val usersDF1 = usersDF.filter("uId != 'null'").filter("uId != 'NULL'")
    usersDF1.createOrReplaceTempView("userData")

    val usersDF2 = spark.sql("SELECT uId,birth_dt,gender_cd,CAST(datediff(from_unixtime( unix_timestamp() )," +
      "from_unixtime( unix_timestamp(birth_dt, 'dd-MMM-yy'))) / 365  AS INT) age from userData")

    usersDF2.createOrReplaceTempView("usersData")
    // Write elasticsearch


    val productFile = spark.sparkContext.textFile("E:\\POC\\Datasets\\urlmap.tsv")
    val productHeader = productFile.first()
    val productRecords = productFile.filter(x => x != productHeader)
    val productDF = productRecords.map(x => x.split("\t")).map(p => products(p(0), p(1))).toDF("url", "category")

    productDF.createOrReplaceTempView("productCategory")

    //omniture dataframe
    val omnitureFile = spark.sparkContext.textFile("E:\\POC\\Datasets\\Clicklogs\\1.tsv")
    val omnitureBatchDF1 = omnitureFile.map(x => x.split("\t", -1)).map(o => {
      org.dfz.elasticsearch.spark.OmnitureSchema(o(0), o(1), o(2), o(3), o(4), o(5), o(6), o(7), o(8), o(9), o(10), o(11), o(12),
        o(13), o(14), o(15), o(16), o(17), o(18), o(19), o(20), o(21), o(22), o(23), o(24), o(25), o(26), o(27), o(28), o(29),
        o(30), o(31), o(32), o(33), o(34), o(35), o(36), o(37), o(38), o(39), o(40), o(41), o(42), o(43), o(44), o(45), o(46),
        o(47), o(48), o(49), o(50), o(51), o(52), o(53), o(54), o(55), o(56), o(57), o(58), o(59), o(60), o(61), o(62), o(63),
        o(64), o(65), o(66), o(67), o(68), o(69), o(70), o(71), o(72), o(73), o(74), o(75), o(76), o(77), o(78), o(79), o(80),
        o(81), o(82), o(83), o(84), o(85), o(86), o(87), o(88), o(89), o(90), o(91), o(92), o(93), o(94), o(95), o(96), o(97),
        o(98), o(99), o(100), o(101), o(102), o(103), o(104), o(105), o(106), o(107), o(108), o(109), o(110), o(111), o(112),
        o(113), o(114), o(115), o(116), o(117), o(118), o(119), o(120), o(121), o(122), o(123), o(124), o(125), o(126), o(127),
        o(128), o(129), o(130), o(131), o(132), o(133), o(134), o(135), o(136), o(137), o(138), o(139), o(140), o(141), o(142),
        o(143), o(144), o(145), o(146), o(147), o(148), o(149), o(150), o(151), o(152), o(153), o(154), o(155), o(156), o(157),
        o(158), o(159), o(160), o(161), o(162), o(163), o(164), o(165), o(166), o(167), o(168), o(169), o(170), o(171), o(172),
        o(173), o(174), o(175), o(176), o(177))
    })
      .toDF("sessionId", "clickTime", "col_3", "col_4", "col_5", "col_6", "col_7", "ipAddress", "col_9", "col_10",
        "col_11", "col_12", "productUrl", "swId", "col_15", "col_16", "col_17", "col_18", "col_19", "col_20",
        "col_21", "col_22", "col_23", "col_24", "col_25", "col_26", "col_27", "language", "col_29", "col_30",
        "col_31", "col_32", "col_33", "col_34", "col_35", "col_36", "col_37", "col_38", "domain", "regTime",
        "col_41", "col_42", "col_43", "sysSpec", "col_45", "col_46", "col_47", "col_48", "col_49", "city",
        "country", "areaCode", "state", "col_54", "col_55", "col_56", "col_57", "col_58", "col_59", "col_60",
        "col_61", "col_62", "col_63", "col_64", "col_65", "col_66", "col_67", "col_68", "col_69", "col_70",
        "col_71", "col_72", "col_73", "col_74", "col_75", "col_76", "col_77", "col_78", "col_79", "col_80",
        "col_81", "col_82", "col_83", "col_84", "col_85", "col_86", "col_87", "col_88", "col_89", "col_90",
        "col_91", "col_92", "col_93", "col_94", "col_95", "col_96", "col_97", "col_98", "col_99", "col_100",
        "col_101", "col_102", "col_103", "col_104", "col_105", "col_106", "col_107", "col_108", "col_109",
        "col_110", "col_111", "col_112", "col_113", "col_114", "col_115", "col_116", "col_117", "col_118",
        "col_119", "col_120", "col_121", "col_122", "col_123", "col_124", "col_125", "col_126", "col_127",
        "col_128", "col_129", "col_130", "col_131", "col_132", "col_133", "col_134", "col_135", "col_136",
        "col_137", "col_138", "col_139", "col_140", "col_141", "col_142", "col_143", "col_144", "col_145",
        "col_146", "col_147", "col_148", "col_149", "col_150", "col_151", "col_152", "col_153", "col_154",
        "col_155", "col_156", "col_157", "col_158", "col_159", "col_160", "col_161", "col_162", "col_163",
        "col_164", "col_165", "col_166", "col_167", "col_168", "col_169", "col_170", "col_171", "col_172",
        "col_173", "col_174", "col_175", "col_176", "col_177", "col_178").na.fill("e", Seq("blank"))

    //Replacing the special character
    def remove_string: String => String = _.replaceAll("[{}]", "")

    def remove_string_udf = udf(remove_string)

    val omnitureBatchDF = omnitureBatchDF1.withColumn("swId", remove_string_udf($"swId"))
    omnitureBatchDF.createOrReplaceTempView("omnitureBatchLog")

    //Streaming
    val ssc = new StreamingContext(spark.sparkContext, Seconds(5))

    val kafkat = Set("cs_poc")
    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "localhost:6667",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "test-consumer-group",
      "auto.offset.reset" -> "latest",
      "enable.auto.commit" -> (false: java.lang.Boolean)
    )
    //Kafka Spark Streaming Consumer
    val kafkaStream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,
      Subscribe[String, String](kafkat, kafkaParams)
    )
    kafkaStream.foreachRDD {
      msg =>
      //  if (!msg.isEmpty) {
          val omnitureStreamDF = msg.map { v => v.value().split("\t", -1) }
            .map(o => {
              org.elasticsearch.spark.OmnitureSchema(o(0).trim, o(1).trim, o(2).trim, o(3).trim, o(4).trim, o(5).trim, o(6).trim, o(7).trim, o(8).trim, o(9).trim, o(10).trim, o(11).trim, o(12).trim, o(13).trim,
                o(14).trim, o(15).trim, o(16).trim, o(17).trim, o(18).trim, o(19).trim, o(20).trim, o(21).trim, o(22).trim, o(23).trim, o(24).trim, o(25).trim, o(26).trim, o(27).trim, o(28).trim, o(29).trim, o(30).trim, o(31).trim, o(32).trim,
                o(33).trim, o(34).trim, o(35).trim, o(36).trim, o(37).trim, o(38).trim, o(39).trim, o(40).trim, o(41).trim, o(42).trim, o(43).trim, o(44).trim, o(45).trim, o(46).trim, o(47).trim, o(48).trim, o(49).trim, o(50).trim, o(51).trim,
                o(52).trim, o(53).trim, o(54).trim, o(55).trim, o(56).trim, o(57).trim, o(58).trim, o(59).trim, o(60).trim, o(61).trim, o(62).trim, o(63).trim, o(64).trim, o(65).trim, o(66).trim, o(67).trim, o(68).trim,
                o(69).trim, o(70).trim, o(71).trim, o(72).trim, o(73).trim, o(74).trim, o(75).trim, o(76).trim, o(77).trim, o(78).trim, o(79).trim, o(80).trim, o(81).trim, o(82).trim, o(83).trim, o(84).trim, o(85).trim,
                o(86).trim, o(87).trim, o(88).trim, o(89).trim, o(90).trim, o(91).trim, o(92).trim, o(93).trim, o(94).trim, o(95).trim, o(96).trim, o(97).trim, o(98).trim, o(99).trim, o(100).trim, o(101).trim, o(102).trim,
                o(103).trim, o(104).trim, o(105).trim, o(106).trim, o(107).trim, o(108).trim, o(109).trim, o(110).trim, o(111).trim, o(112).trim, o(113).trim, o(114).trim, o(115).trim, o(116).trim, o(117).trim,
                o(118).trim, o(119).trim, o(120).trim, o(121).trim, o(122).trim, o(123).trim, o(124).trim, o(125).trim, o(126).trim, o(127).trim, o(128).trim, o(129).trim, o(130).trim, o(131).trim, o(132).trim,
                o(133).trim, o(134).trim, o(135).trim, o(136).trim, o(137).trim, o(138).trim, o(139).trim, o(140).trim, o(141).trim, o(142).trim, o(143).trim, o(144).trim, o(145).trim, o(146).trim, o(147).trim,
                o(148).trim, o(149).trim, o(150).trim, o(151).trim, o(152).trim, o(153).trim, o(154).trim, o(155).trim, o(156).trim, o(157).trim, o(158).trim, o(159).trim, o(160).trim, o(161).trim, o(162).trim,
                o(163).trim, o(164).trim, o(165).trim, o(166).trim, o(167).trim, o(168).trim, o(169).trim, o(170).trim, o(171).trim, o(172).trim, o(173).trim, o(174).trim, o(175).trim, o(176).trim, o(177))
            })
            .toDF("sessionId", "clickTime", "col_3", "col_4", "col_5", "col_6", "col_7", "ipAddress", "col_9", "col_10",
              "col_11", "col_12", "productUrl", "swId", "col_15", "col_16", "col_17", "col_18", "col_19", "col_20",
              "col_21", "col_22", "col_23", "col_24", "col_25", "col_26", "col_27", "language", "col_29", "col_30",
              "col_31", "col_32", "col_33", "col_34", "col_35", "col_36", "col_37", "col_38", "domain", "regTime",
              "col_41", "col_42", "col_43", "sysSpec", "col_45", "col_46", "col_47", "col_48", "col_49", "city",
              "country", "areaCode", "state", "col_54", "col_55", "col_56", "col_57", "col_58", "col_59", "col_60",
              "col_61", "col_62", "col_63", "col_64", "col_65", "col_66", "col_67", "col_68", "col_69", "col_70",
              "col_71", "col_72", "col_73", "col_74", "col_75", "col_76", "col_77", "col_78", "col_79", "col_80",
              "col_81", "col_82", "col_83", "col_84", "col_85", "col_86", "col_87", "col_88", "col_89", "col_90",
              "col_91", "col_92", "col_93", "col_94", "col_95", "col_96", "col_97", "col_98", "col_99", "col_100",
              "col_101", "col_102", "col_103", "col_104", "col_105", "col_106", "col_107", "col_108", "col_109",
              "col_110", "col_111", "col_112", "col_113", "col_114", "col_115", "col_116", "col_117", "col_118",
              "col_119", "col_120", "col_121", "col_122", "col_123", "col_124", "col_125", "col_126", "col_127",
              "col_128", "col_129", "col_130", "col_131", "col_132", "col_133", "col_134", "col_135", "col_136",
              "col_137", "col_138", "col_139", "col_140", "col_141", "col_142", "col_143", "col_144", "col_145",
              "col_146", "col_147", "col_148", "col_149", "col_150", "col_151", "col_152", "col_153", "col_154",
              "col_155", "col_156", "col_157", "col_158", "col_159", "col_160", "col_161", "col_162", "col_163",
              "col_164", "col_165", "col_166", "col_167", "col_168", "col_169", "col_170", "col_171", "col_172",
              "col_173", "col_174", "col_175", "col_176", "col_177", "col_178").na.fill("e", Seq("blank"))


          omnitureStreamDF.createOrReplaceTempView("omnitureStreamLog")

          val omnitureDF = spark.sql("select * from omnitureBatchLog union select * from omnitureStreamLog").toDF()

          omnitureDF.createOrReplaceTempView("omnitureLog")

          val omniDF = spark.sql("SELECT  * FROM omnitureLog o join productCategory p on o.productUrl = p.url " +
            "  join usersData u WHERE o.swId=u.uId")

          val indexName = "custAct_index"
          val mappingName = "custAct_index_type"

          omniDF.saveToEs(s"${indexName}/${mappingName}", elasticConf)

        }

    }
    ssc.start()
    ssc.awaitTermination()

  }
}`
Strack trace:

Stack trace goes here


### Version Info

OS:         :  Winodows
JVM         :  1.8
Hadoop/Spark:  spark 2.0
ES-Hadoop   :  
ES          :  5.5.2

<!--
If you are filing a feature request, please remove the above bug
report block and provide responses for all of the below items.
-->

### Feature description

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
jbaieracommented, Oct 10, 2017

Can you please follow the template provided for reporting issues? Please provide the full stack trace, the versions of your OS, JVM, Hadoop/Spark, ES-Hadoop, and ES deployments, and your settings and code that led to this issue.

0reactions
vimalathicommented, Sep 17, 2018

org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot determine write shards for [custAct_index/custAct_index_type]; likely its format is incorrect (maybe it contains illegal characters?)

Due to the naming convention, it is producing this error. change the capitalized char to lower case will solve the issue.

custAct_index/custAct_index_type —> cust_act_index/cust_act_index_type

this fix worked for me…

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cannot Save Data Frame to Elasticsearch during spark ...
During Spark Stream processing and Saving Data Frame to Elastic Search facing below error: org.elasticsearch.hadoop.
Read more >
Cannot Save Data Frame to Elasticsearch during spark ...
Hi All, please help me , i am getting below error while saving dataframe to Elastic search while spark streaming: org.elasticsearch.hadoop.
Read more >
Cannot Save Data Frame to Elasticsearch during spark ...
During Spark Stream processing and Saving Data Frame to Elastic Search facing below error: org.elasticsearch.hadoop.
Read more >
Unable to read data from Elasticsearch with spark in Databricks.
When I am trying to read data from elasticsearch by spark sql, it throw an error like. RuntimeException: Error while encoding: java.lang.
Read more >
Writing a Spark Dataframe to an Elasticsearch Index - Medium
In order for Spark to communicate with the Elasticsearch, we'll need to know where the ES node(s) are located as well as the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found