Cannot Save Data Frame to Elasticsearch during spark streaming
See original GitHub issueWhat kind an issue is this?
- Bug report. If you’ve found a bug, please provide a code snippet or test to reproduce it below.
The easier it is to track down the bug, the faster it is solved. - Feature Request. Start by telling us what problem you’re trying to solve.
Often a solution already exists! Don’t send pull requests to implement new features without first getting our support. Sometimes we leave features out on purpose to keep the project small.
Test/code snippet
`package org.elasticsearch.spark
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions.udf
import org.elasticsearch.spark.sql._
import org.elasticsearch.spark._
object ElasticSpark {
case class users(uId: String, birth_dt: String, gender_cd: String)
case class products(url: String, category: String)
def main(args: Array[String]) {
// Logging
// import org.apache.log4j.{Level, Logger}
// Logger.getLogger("org").setLevel(Level.OFF)
// Logger.getLogger("akka").setLevel(Level.OFF)
// initialise spark context
val conf = new SparkConf()
.setAppName("ElasticSearch")
.setMaster("local")
.set("es.index.auto.create", "true")
val spark = SparkSession.builder().config(conf).getOrCreate()
import spark.sqlContext.implicits._
// Elastic connection parameters
val elasticConf: Map[String, String] = Map("es.nodes" -> "localhost",
"es.clustername" -> "elasticsearch","es.mapping.id" -> "id")
// val indexName = "custAct_index"
// val mappingName = "custAct_index_type"
// DataFrame
val usersFile = spark.sparkContext.textFile("E:\\POC\\Datasets\\regusers.tsv")
val userHeader = usersFile.first()
val userRecords = usersFile.filter(x => x != userHeader)
val usersDF = userRecords.map(x => x.split("\t", -1)).map(u => users(u(0), u(1), u(2))).toDF("uId", "birth_dt", "gender_cd")
val usersDF1 = usersDF.filter("uId != 'null'").filter("uId != 'NULL'")
usersDF1.createOrReplaceTempView("userData")
val usersDF2 = spark.sql("SELECT uId,birth_dt,gender_cd,CAST(datediff(from_unixtime( unix_timestamp() )," +
"from_unixtime( unix_timestamp(birth_dt, 'dd-MMM-yy'))) / 365 AS INT) age from userData")
usersDF2.createOrReplaceTempView("usersData")
// Write elasticsearch
val productFile = spark.sparkContext.textFile("E:\\POC\\Datasets\\urlmap.tsv")
val productHeader = productFile.first()
val productRecords = productFile.filter(x => x != productHeader)
val productDF = productRecords.map(x => x.split("\t")).map(p => products(p(0), p(1))).toDF("url", "category")
productDF.createOrReplaceTempView("productCategory")
//omniture dataframe
val omnitureFile = spark.sparkContext.textFile("E:\\POC\\Datasets\\Clicklogs\\1.tsv")
val omnitureBatchDF1 = omnitureFile.map(x => x.split("\t", -1)).map(o => {
org.dfz.elasticsearch.spark.OmnitureSchema(o(0), o(1), o(2), o(3), o(4), o(5), o(6), o(7), o(8), o(9), o(10), o(11), o(12),
o(13), o(14), o(15), o(16), o(17), o(18), o(19), o(20), o(21), o(22), o(23), o(24), o(25), o(26), o(27), o(28), o(29),
o(30), o(31), o(32), o(33), o(34), o(35), o(36), o(37), o(38), o(39), o(40), o(41), o(42), o(43), o(44), o(45), o(46),
o(47), o(48), o(49), o(50), o(51), o(52), o(53), o(54), o(55), o(56), o(57), o(58), o(59), o(60), o(61), o(62), o(63),
o(64), o(65), o(66), o(67), o(68), o(69), o(70), o(71), o(72), o(73), o(74), o(75), o(76), o(77), o(78), o(79), o(80),
o(81), o(82), o(83), o(84), o(85), o(86), o(87), o(88), o(89), o(90), o(91), o(92), o(93), o(94), o(95), o(96), o(97),
o(98), o(99), o(100), o(101), o(102), o(103), o(104), o(105), o(106), o(107), o(108), o(109), o(110), o(111), o(112),
o(113), o(114), o(115), o(116), o(117), o(118), o(119), o(120), o(121), o(122), o(123), o(124), o(125), o(126), o(127),
o(128), o(129), o(130), o(131), o(132), o(133), o(134), o(135), o(136), o(137), o(138), o(139), o(140), o(141), o(142),
o(143), o(144), o(145), o(146), o(147), o(148), o(149), o(150), o(151), o(152), o(153), o(154), o(155), o(156), o(157),
o(158), o(159), o(160), o(161), o(162), o(163), o(164), o(165), o(166), o(167), o(168), o(169), o(170), o(171), o(172),
o(173), o(174), o(175), o(176), o(177))
})
.toDF("sessionId", "clickTime", "col_3", "col_4", "col_5", "col_6", "col_7", "ipAddress", "col_9", "col_10",
"col_11", "col_12", "productUrl", "swId", "col_15", "col_16", "col_17", "col_18", "col_19", "col_20",
"col_21", "col_22", "col_23", "col_24", "col_25", "col_26", "col_27", "language", "col_29", "col_30",
"col_31", "col_32", "col_33", "col_34", "col_35", "col_36", "col_37", "col_38", "domain", "regTime",
"col_41", "col_42", "col_43", "sysSpec", "col_45", "col_46", "col_47", "col_48", "col_49", "city",
"country", "areaCode", "state", "col_54", "col_55", "col_56", "col_57", "col_58", "col_59", "col_60",
"col_61", "col_62", "col_63", "col_64", "col_65", "col_66", "col_67", "col_68", "col_69", "col_70",
"col_71", "col_72", "col_73", "col_74", "col_75", "col_76", "col_77", "col_78", "col_79", "col_80",
"col_81", "col_82", "col_83", "col_84", "col_85", "col_86", "col_87", "col_88", "col_89", "col_90",
"col_91", "col_92", "col_93", "col_94", "col_95", "col_96", "col_97", "col_98", "col_99", "col_100",
"col_101", "col_102", "col_103", "col_104", "col_105", "col_106", "col_107", "col_108", "col_109",
"col_110", "col_111", "col_112", "col_113", "col_114", "col_115", "col_116", "col_117", "col_118",
"col_119", "col_120", "col_121", "col_122", "col_123", "col_124", "col_125", "col_126", "col_127",
"col_128", "col_129", "col_130", "col_131", "col_132", "col_133", "col_134", "col_135", "col_136",
"col_137", "col_138", "col_139", "col_140", "col_141", "col_142", "col_143", "col_144", "col_145",
"col_146", "col_147", "col_148", "col_149", "col_150", "col_151", "col_152", "col_153", "col_154",
"col_155", "col_156", "col_157", "col_158", "col_159", "col_160", "col_161", "col_162", "col_163",
"col_164", "col_165", "col_166", "col_167", "col_168", "col_169", "col_170", "col_171", "col_172",
"col_173", "col_174", "col_175", "col_176", "col_177", "col_178").na.fill("e", Seq("blank"))
//Replacing the special character
def remove_string: String => String = _.replaceAll("[{}]", "")
def remove_string_udf = udf(remove_string)
val omnitureBatchDF = omnitureBatchDF1.withColumn("swId", remove_string_udf($"swId"))
omnitureBatchDF.createOrReplaceTempView("omnitureBatchLog")
//Streaming
val ssc = new StreamingContext(spark.sparkContext, Seconds(5))
val kafkat = Set("cs_poc")
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:6667",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "test-consumer-group",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
//Kafka Spark Streaming Consumer
val kafkaStream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](kafkat, kafkaParams)
)
kafkaStream.foreachRDD {
msg =>
// if (!msg.isEmpty) {
val omnitureStreamDF = msg.map { v => v.value().split("\t", -1) }
.map(o => {
org.elasticsearch.spark.OmnitureSchema(o(0).trim, o(1).trim, o(2).trim, o(3).trim, o(4).trim, o(5).trim, o(6).trim, o(7).trim, o(8).trim, o(9).trim, o(10).trim, o(11).trim, o(12).trim, o(13).trim,
o(14).trim, o(15).trim, o(16).trim, o(17).trim, o(18).trim, o(19).trim, o(20).trim, o(21).trim, o(22).trim, o(23).trim, o(24).trim, o(25).trim, o(26).trim, o(27).trim, o(28).trim, o(29).trim, o(30).trim, o(31).trim, o(32).trim,
o(33).trim, o(34).trim, o(35).trim, o(36).trim, o(37).trim, o(38).trim, o(39).trim, o(40).trim, o(41).trim, o(42).trim, o(43).trim, o(44).trim, o(45).trim, o(46).trim, o(47).trim, o(48).trim, o(49).trim, o(50).trim, o(51).trim,
o(52).trim, o(53).trim, o(54).trim, o(55).trim, o(56).trim, o(57).trim, o(58).trim, o(59).trim, o(60).trim, o(61).trim, o(62).trim, o(63).trim, o(64).trim, o(65).trim, o(66).trim, o(67).trim, o(68).trim,
o(69).trim, o(70).trim, o(71).trim, o(72).trim, o(73).trim, o(74).trim, o(75).trim, o(76).trim, o(77).trim, o(78).trim, o(79).trim, o(80).trim, o(81).trim, o(82).trim, o(83).trim, o(84).trim, o(85).trim,
o(86).trim, o(87).trim, o(88).trim, o(89).trim, o(90).trim, o(91).trim, o(92).trim, o(93).trim, o(94).trim, o(95).trim, o(96).trim, o(97).trim, o(98).trim, o(99).trim, o(100).trim, o(101).trim, o(102).trim,
o(103).trim, o(104).trim, o(105).trim, o(106).trim, o(107).trim, o(108).trim, o(109).trim, o(110).trim, o(111).trim, o(112).trim, o(113).trim, o(114).trim, o(115).trim, o(116).trim, o(117).trim,
o(118).trim, o(119).trim, o(120).trim, o(121).trim, o(122).trim, o(123).trim, o(124).trim, o(125).trim, o(126).trim, o(127).trim, o(128).trim, o(129).trim, o(130).trim, o(131).trim, o(132).trim,
o(133).trim, o(134).trim, o(135).trim, o(136).trim, o(137).trim, o(138).trim, o(139).trim, o(140).trim, o(141).trim, o(142).trim, o(143).trim, o(144).trim, o(145).trim, o(146).trim, o(147).trim,
o(148).trim, o(149).trim, o(150).trim, o(151).trim, o(152).trim, o(153).trim, o(154).trim, o(155).trim, o(156).trim, o(157).trim, o(158).trim, o(159).trim, o(160).trim, o(161).trim, o(162).trim,
o(163).trim, o(164).trim, o(165).trim, o(166).trim, o(167).trim, o(168).trim, o(169).trim, o(170).trim, o(171).trim, o(172).trim, o(173).trim, o(174).trim, o(175).trim, o(176).trim, o(177))
})
.toDF("sessionId", "clickTime", "col_3", "col_4", "col_5", "col_6", "col_7", "ipAddress", "col_9", "col_10",
"col_11", "col_12", "productUrl", "swId", "col_15", "col_16", "col_17", "col_18", "col_19", "col_20",
"col_21", "col_22", "col_23", "col_24", "col_25", "col_26", "col_27", "language", "col_29", "col_30",
"col_31", "col_32", "col_33", "col_34", "col_35", "col_36", "col_37", "col_38", "domain", "regTime",
"col_41", "col_42", "col_43", "sysSpec", "col_45", "col_46", "col_47", "col_48", "col_49", "city",
"country", "areaCode", "state", "col_54", "col_55", "col_56", "col_57", "col_58", "col_59", "col_60",
"col_61", "col_62", "col_63", "col_64", "col_65", "col_66", "col_67", "col_68", "col_69", "col_70",
"col_71", "col_72", "col_73", "col_74", "col_75", "col_76", "col_77", "col_78", "col_79", "col_80",
"col_81", "col_82", "col_83", "col_84", "col_85", "col_86", "col_87", "col_88", "col_89", "col_90",
"col_91", "col_92", "col_93", "col_94", "col_95", "col_96", "col_97", "col_98", "col_99", "col_100",
"col_101", "col_102", "col_103", "col_104", "col_105", "col_106", "col_107", "col_108", "col_109",
"col_110", "col_111", "col_112", "col_113", "col_114", "col_115", "col_116", "col_117", "col_118",
"col_119", "col_120", "col_121", "col_122", "col_123", "col_124", "col_125", "col_126", "col_127",
"col_128", "col_129", "col_130", "col_131", "col_132", "col_133", "col_134", "col_135", "col_136",
"col_137", "col_138", "col_139", "col_140", "col_141", "col_142", "col_143", "col_144", "col_145",
"col_146", "col_147", "col_148", "col_149", "col_150", "col_151", "col_152", "col_153", "col_154",
"col_155", "col_156", "col_157", "col_158", "col_159", "col_160", "col_161", "col_162", "col_163",
"col_164", "col_165", "col_166", "col_167", "col_168", "col_169", "col_170", "col_171", "col_172",
"col_173", "col_174", "col_175", "col_176", "col_177", "col_178").na.fill("e", Seq("blank"))
omnitureStreamDF.createOrReplaceTempView("omnitureStreamLog")
val omnitureDF = spark.sql("select * from omnitureBatchLog union select * from omnitureStreamLog").toDF()
omnitureDF.createOrReplaceTempView("omnitureLog")
val omniDF = spark.sql("SELECT * FROM omnitureLog o join productCategory p on o.productUrl = p.url " +
" join usersData u WHERE o.swId=u.uId")
val indexName = "custAct_index"
val mappingName = "custAct_index_type"
omniDF.saveToEs(s"${indexName}/${mappingName}", elasticConf)
}
}
ssc.start()
ssc.awaitTermination()
}
}`
Strack trace:
Stack trace goes here
### Version Info
OS: : Winodows
JVM : 1.8
Hadoop/Spark: spark 2.0
ES-Hadoop :
ES : 5.5.2
<!--
If you are filing a feature request, please remove the above bug
report block and provide responses for all of the below items.
-->
### Feature description
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Cannot Save Data Frame to Elasticsearch during spark ...
During Spark Stream processing and Saving Data Frame to Elastic Search facing below error: org.elasticsearch.hadoop.
Read more >Cannot Save Data Frame to Elasticsearch during spark ...
Hi All, please help me , i am getting below error while saving dataframe to Elastic search while spark streaming: org.elasticsearch.hadoop.
Read more >Cannot Save Data Frame to Elasticsearch during spark ...
During Spark Stream processing and Saving Data Frame to Elastic Search facing below error: org.elasticsearch.hadoop.
Read more >Unable to read data from Elasticsearch with spark in Databricks.
When I am trying to read data from elasticsearch by spark sql, it throw an error like. RuntimeException: Error while encoding: java.lang.
Read more >Writing a Spark Dataframe to an Elasticsearch Index - Medium
In order for Spark to communicate with the Elasticsearch, we'll need to know where the ES node(s) are located as well as the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Can you please follow the template provided for reporting issues? Please provide the full stack trace, the versions of your OS, JVM, Hadoop/Spark, ES-Hadoop, and ES deployments, and your settings and code that led to this issue.
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot determine write shards for [custAct_index/custAct_index_type]; likely its format is incorrect (maybe it contains illegal characters?)
Due to the naming convention, it is producing this error. change the capitalized char to lower case will solve the issue.
custAct_index/custAct_index_type —> cust_act_index/cust_act_index_type
this fix worked for me…