JSON deserialization fails "sometimes" without notice
See original GitHub issueBug Report
When querying our ES indices from es-hadoop we in some cases get empty result documents.
Issue description
We tracked the problem down to the deserialization of the documents. In some cases it fails (for the same document) leaving us with an empty document.
Steps to reproduce
A simple test program queries the same document a hundred times with and without deserialization and checks for a specific field we know it’s contained within the document.
Code:
import org.elasticsearch.spark._
import org.apache.log4j.Level
import org.apache.log4j.LogManager
val options = com.freiheit.ev.ElasticSearchOptions.getOptions() // nodes, credentials
val optionsWithJson = scala.collection.mutable.Map[String, String](options.toSeq: _*)
LogManager.getRootLogger.setLevel(Level.ERROR)
var broken = 0
optionsWithJson("es.output.json") = "false"
for (i <- 1 to 100) {
val result = sc.esRDD("euv_landingpages*/acquisitionLead", "?q=_id:6947a956c1f2f675a9e3acd0e714dc97c81beb5c8bc2ed6ce151c8598b2fea30",optionsWithJson).collect()
if ( !result(0)._2.contains("sessionid") )
broken+=1
}
LogManager.getRootLogger().error("broken with parsing: " + broken)
optionsWithJson("es.output.json") = "true"
broken = 0
for (i <- 1 to 100) {
val result = sc.esRDD("euv_landingpages*/acquisitionLead", "?q=_id:6947a956c1f2f675a9e3acd0e714dc97c81beb5c8bc2ed6ce151c8598b2fea30",optionsWithJson).collect()
if ( !result(0)._2.asInstanceOf[String].contains("sessionid") )
broken+=1
}
LogManager.getRootLogger().error("broken without parsing: " + broken)
Output:
16/09/22 12:29:20 ERROR root: broken with parsing: 44
16/09/22 12:29:41 ERROR root: broken without parsing: 0
Example document:
curl -k "https://xyz:9200/euv_landingpages*/acquisitionLead/_search?q=_id:6947a956c1f2f675a9e3acd0e714dc97c81beb5c8bc2ed6ce151c8598b2fea30&pretty"
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 60,
"successful" : 60,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "euv_landingpages-2016-03",
"_type" : "acquisitionLead",
"_id" : "6947a956c1f2f675a9e3acd0e714dc97c81beb5c8bc2ed6ce151c8598b2fea30",
"_score" : 1.0,
"_source" : {
"logFilename" : "tracking.2016-03-18-16.log",
"useragent" : "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36",
"rawTimestamp" : 1458316388123,
"sessionid" : "15b423df-49be-4e51-b4a6-d5886e40969d",
"timestamp" : "2016-03-18T16:53:08.123+0100",
"index" : "2016-03",
"payload" : {
"platform" : "webcms",
"submitStatus" : "SUCCESS",
"pagePath" : "bogota/offer-a-property/",
"country" : "CO",
"landingpagePath" : "bogota/",
"hash" : "s6HWSMhrVSyNMuiw3O91/g==",
"pageType" : "aquisitionpage",
"contactFieldsSet" : "FN|LN|ST|SN|CI|PH|EM|ME|",
"language" : "EN"
},
"eventType" : "acquisitionLead",
"id" : "6947a956c1f2f675a9e3acd0e714dc97c81beb5c8bc2ed6ce151c8598b2fea30"
}
} ]
}
}
Corresponding mapping:
curl -k "https://xyz:9200/euv_landingpages*/acquisitionLead/_mapping?pretty"
{
"euv_landingpages-2016-06" : {
"mappings" : {
"acquisitionLead" : {
"properties" : {
"eventType" : {
"type" : "string"
},
"id" : {
"type" : "string"
},
"index" : {
"type" : "string"
},
"logFilename" : {
"type" : "string"
},
"payload" : {
"properties" : {
"condition" : {
"type" : "string"
},
"constructionYear" : {
"type" : "string"
},
"contactFieldsSet" : {
"type" : "string"
},
"country" : {
"type" : "string"
},
"hash" : {
"type" : "string"
},
"imagesUploaded" : {
"type" : "boolean"
},
"landingpagePath" : {
"type" : "string"
},
"language" : {
"type" : "string"
},
"livingArea" : {
"type" : "string"
},
"offertype" : {
"type" : "string"
},
"pagePath" : {
"type" : "string"
},
"pageType" : {
"type" : "string"
},
"platform" : {
"type" : "string"
},
"plotSize" : {
"type" : "string"
},
"referer" : {
"type" : "string"
},
"selectedPropertyType" : {
"type" : "string"
},
"submitStatus" : {
"type" : "string"
}
}
},
"rawTimestamp" : {
"type" : "long"
},
"sessionid" : {
"type" : "string"
},
"timestamp" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"useragent" : {
"type" : "string"
}
}
}
}
}, ...other indices...
Version Info
OS : Debian GNU/Linux 8.4 (jessie) JVM : Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.8.0_91) Spark : 1.6.2 ES-Hadoop : 2.3.4 ES : 2.3.4
Issue Analytics
- State:
- Created 7 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
c# - Randoms failures on json.net deserialization through ...
I'm using Json.NET for deserializing json files sent by the API server on an iOS client (with Monotouch). I have a really weird...
Read more >JSON Deserialization behaviour regarding missing properties
Hi,. I'm using System.Text.Json.JsonSerializer.Serialize in a .NET 5 C# application to save the application state to a json file.
Read more >Failed to deserialize JSON - OutSystems
JSON. Hello,. I'm trying to open a GeoJSON to test in Outsystems from the following link: ... i set up as a REST...
Read more >Protect yourself when deserializing - System.Text.Json
When dealing with deserialization of JSON, it's always a good idea to validate that it infact deserialized correctly.
Read more >How to Deserialize JSON Into Dynamic Object in C# - Code ...
For example, cherry-picking a small portion of JSON data, dealing with external JSON data whose structure is largely unknown or changes very ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@sebastianmueller and I further analyzed this issue an came to the conclusion that #648 is responsible for this behavior, i.e. fields that are not mapped are ignored by default. Interestingly, we were pretty confident, that our mappings were (at least partly) correct.
TL;DR: Settings .set(“es.read.unmapped.fields.ignore”, “false”) on the SparkConf imports all mappings, even if not defined in your indexes’ mappings.
I went into the rabbit hole and checked to see were exactly the mapping was read from. This seems to happen here: https://github.com/elastic/elasticsearch-hadoop/blob/dd1d6036fa9ee21d9a3082226a152d40cce9ede7/mr/src/main/java/org/elasticsearch/hadoop/rest/RestService.java#L266
Which in turn calls https://github.com/elastic/elasticsearch-hadoop/blob/dd1d6036fa9ee21d9a3082226a152d40cce9ede7/mr/src/main/java/org/elasticsearch/hadoop/rest/RestRepository.java#L436
IMHO the bug lies within the function
Field.parseField()
. The return value ofclient.getMapping(resourceR.mapping())
seems fine. We get a map from index-name to the respective mappings. (We access the resource over an alias: euv_landingpages fans out to multiple indexes)Furthermore
Field.parseField()
returns ONLY the first type (!!!) of the returned mappings, as seen in here https://github.com/elastic/elasticsearch-hadoop/blob/dd1d6036fa9ee21d9a3082226a152d40cce9ede7/mr/src/main/java/org/elasticsearch/hadoop/serialization/dto/mapping/Field.java#L77The extraction of the first type is undeterministic since elasticsearch returns the mappings without a specific order, resulting in the observed behavior (sometimes a valid map is produced and sometimes not).
I hope that my analysis can further help you with fixing this bug.
Thanks to everyone who has posted super helpful analyses on this to help narrow the issue down. I’m going to close this ticket in favor of #938. If there are any more developments or feedback for this, please post it there to keep it consolidated. Cheers!