Connector uncorrectly determines the number of partitions
See original GitHub issueWhat kind an issue is this?
- Bug report. If you’ve found a bug, please provide a code snippet or test to reproduce it below.
The easier it is to track down the bug, the faster it is solved.
Issue description
It seems that there is an issue in how Elasticsearch Hadoop’s RestService is calculating the slice paritions for a specified, index/shard. In my opinion there are two issues:
- Number of produced tasks is not correct: There seems to be an issue with the division being executed here We can see the code calculating the number of parititions is using an casting the result of the division to integer. This will cause one partition lower than the expected partitions. E.g:
If we have 3 shards with 184 items each, and we want 30 documents per partition(setting
es.input.max.docs.per.partition
equal to 30). Then for each shard the following number of partitions will be calculated:For shard 0:
(int)Math.max(1, 184 / 30) => (int)Max.math(1, 6.1) => (int)(6.1)
so 6. We should expect 7 because 6* 30 = 180 and there are 184 documents in that shard. The same should stand for shard 1 and 2. As a result, we should be expecting 21 partitions, but ES-Hadoop will give us 18.
- Slices create empty tasks: From the same file we can see here that for each shard, it creates one sliced partition, always starting from zero to the total number of partitions per shard. It also uses the preference query parameter to target each query to the specified shard of the Partition. In the above example this would lead to the following paritions(slices)
For shard0: (0, 6), (1, 6), (2, 6) (3, 6) (4, 6), (5, 6)
for shard 1: (0, 6), (1, 6), (2, 6) (3, 6) (4, 6), (5, 6)
and for shard 2: (0, 6), (1, 6), (2, 6) (3, 6) (4, 6), (5, 6)
According to the scroll documentation here sliced scroll is executed on the shards first, meaning that for the first shard, shard 0, queries only (0, 6) and (3, 6) are going to fetch data. All the other slices will just create empty tasks. The same will go for all the other shards, thus creating 6 tasks with data, and 12 empty tasks. Furthermore instead of 30 documents per task we are now fetching actually 90 docs per partition.
Version Info
Elasticsearch cluster OS: : Ubuntu JVM : 1.8.0 Hadoop/Spark: 2.2.1 ES-Hadoop : elasticsearch-spark-20_2.11-6.3.2 ES : 6.3.2
Issue Analytics
- State:
- Created 5 years ago
- Comments:9 (4 by maintainers)
Top GitHub Comments
After posting this I realized that the Spark hard limit is at 2GB of data, not based on the limitations of an integer, but rather based on the maximum size of byte buffers. In this case, I think it makes more sense to simply have the user set the max docs per partition in the case that they run into issues with Spark.
Thanks @jbaiera !