Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Connector uncorrectly determines the number of partitions

See original GitHub issue

What kind an issue is this?

Bug report. If you’ve found a bug, please provide a code snippet or test to reproduce it below.
The easier it is to track down the bug, the faster it is solved.

Issue description

It seems that there is an issue in how Elasticsearch Hadoop’s RestService is calculating the slice paritions for a specified, index/shard. In my opinion there are two issues:

Number of produced tasks is not correct: There seems to be an issue with the division being executed here We can see the code calculating the number of parititions is using an casting the result of the division to integer. This will cause one partition lower than the expected partitions. E.g: If we have 3 shards with 184 items each, and we want 30 documents per partition(setting es.input.max.docs.per.partition equal to 30). Then for each shard the following number of partitions will be calculated:For shard 0:

(int)Math.max(1, 184 / 30) => (int)Max.math(1, 6.1) => (int)(6.1)

so 6. We should expect 7 because 6* 30 = 180 and there are 184 documents in that shard. The same should stand for shard 1 and 2. As a result, we should be expecting 21 partitions, but ES-Hadoop will give us 18.

Slices create empty tasks: From the same file we can see here that for each shard, it creates one sliced partition, always starting from zero to the total number of partitions per shard. It also uses the preference query parameter to target each query to the specified shard of the Partition. In the above example this would lead to the following paritions(slices)

For shard0: (0, 6), (1, 6), (2, 6) (3, 6) (4, 6), (5, 6)

for shard 1: (0, 6), (1, 6), (2, 6) (3, 6) (4, 6), (5, 6)

and for shard 2: (0, 6), (1, 6), (2, 6) (3, 6) (4, 6), (5, 6)

According to the scroll documentation here sliced scroll is executed on the shards first, meaning that for the first shard, shard 0, queries only (0, 6) and (3, 6) are going to fetch data. All the other slices will just create empty tasks. The same will go for all the other shards, thus creating 6 tasks with data, and 12 empty tasks. Furthermore instead of 30 documents per task we are now fetching actually 90 docs per partition.

Version Info

Elasticsearch cluster OS: : Ubuntu JVM : 1.8.0 Hadoop/Spark: 2.2.1 ES-Hadoop : elasticsearch-spark-20_2.11-6.3.2 ES : 6.3.2

Issue Analytics

State:
Created 5 years ago
Comments:9 (4 by maintainers)

Top GitHub Comments

1reaction

jbaieracommented, Oct 2, 2018

but in the case of using Spark it might make sense to set the default number to the hard limit of records per Spark partition.

After posting this I realized that the Spark hard limit is at 2GB of data, not based on the limitations of an integer, but rather based on the maximum size of byte buffers. In this case, I think it makes more sense to simply have the user set the max docs per partition in the case that they run into issues with Spark.

0reactions

jimczicommented, Oct 4, 2018

Thanks @jbaiera !

Top Results From Across the Web

Using Flink Connectors Correctly - Alibaba Cloud Community

This partitioner manages partitions by determining the remainder of the total number of parallel task IDs divided by the total partition length: ...

Oracle connector partition type - IBM

The connector determines the number of partitions that are on the table and associates one node with each partition.

Guide for Kafka Connector Developers

Each partition is an ordered sequence of key-value records. Both the keys and values can have complex structures, represented by the data structures...

What happens with Spark partitions when using Spark ...

Otherwise, the connector will use input.split.size_in_mb to determine the number of Spark partitions based on the estimated table size.

Choosing the number of partitions for a topic | CDP Public Cloud

Learn how to determine the number of partitions each of your Kafka topics requires. Choosing the proper number of partitions for a topic...