question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Connector uncorrectly determines the number of partitions

See original GitHub issue

What kind an issue is this?

  • Bug report. If you’ve found a bug, please provide a code snippet or test to reproduce it below.
    The easier it is to track down the bug, the faster it is solved.

Issue description

It seems that there is an issue in how Elasticsearch Hadoop’s RestService is calculating the slice paritions for a specified, index/shard. In my opinion there are two issues:

  • Number of produced tasks is not correct: There seems to be an issue with the division being executed here We can see the code calculating the number of parititions is using an casting the result of the division to integer. This will cause one partition lower than the expected partitions. E.g: If we have 3 shards with 184 items each, and we want 30 documents per partition(setting es.input.max.docs.per.partition equal to 30). Then for each shard the following number of partitions will be calculated:For shard 0:
(int)Math.max(1, 184 / 30) => (int)Max.math(1, 6.1) => (int)(6.1)

so 6. We should expect 7 because 6* 30 = 180 and there are 184 documents in that shard. The same should stand for shard 1 and 2. As a result, we should be expecting 21 partitions, but ES-Hadoop will give us 18.

  • Slices create empty tasks: From the same file we can see here that for each shard, it creates one sliced partition, always starting from zero to the total number of partitions per shard. It also uses the preference query parameter to target each query to the specified shard of the Partition. In the above example this would lead to the following paritions(slices)

For shard0: (0, 6), (1, 6), (2, 6) (3, 6) (4, 6), (5, 6)

for shard 1: (0, 6), (1, 6), (2, 6) (3, 6) (4, 6), (5, 6)

and for shard 2: (0, 6), (1, 6), (2, 6) (3, 6) (4, 6), (5, 6)

According to the scroll documentation here sliced scroll is executed on the shards first, meaning that for the first shard, shard 0, queries only (0, 6) and (3, 6) are going to fetch data. All the other slices will just create empty tasks. The same will go for all the other shards, thus creating 6 tasks with data, and 12 empty tasks. Furthermore instead of 30 documents per task we are now fetching actually 90 docs per partition.

Version Info

Elasticsearch cluster OS: : Ubuntu JVM : 1.8.0 Hadoop/Spark: 2.2.1 ES-Hadoop : elasticsearch-spark-20_2.11-6.3.2 ES : 6.3.2

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jbaieracommented, Oct 2, 2018

but in the case of using Spark it might make sense to set the default number to the hard limit of records per Spark partition.

After posting this I realized that the Spark hard limit is at 2GB of data, not based on the limitations of an integer, but rather based on the maximum size of byte buffers. In this case, I think it makes more sense to simply have the user set the max docs per partition in the case that they run into issues with Spark.

0reactions
jimczicommented, Oct 4, 2018

Thanks @jbaiera !

Read more comments on GitHub >

github_iconTop Results From Across the Web

Using Flink Connectors Correctly - Alibaba Cloud Community
This partitioner manages partitions by determining the remainder of the total number of parallel task IDs divided by the total partition length: ...
Read more >
Oracle connector partition type - IBM
The connector determines the number of partitions that are on the table and associates one node with each partition.
Read more >
Guide for Kafka Connector Developers
Each partition is an ordered sequence of key-value records. Both the keys and values can have complex structures, represented by the data structures...
Read more >
What happens with Spark partitions when using Spark ...
Otherwise, the connector will use input.split.size_in_mb to determine the number of Spark partitions based on the estimated table size.
Read more >
Choosing the number of partitions for a topic | CDP Public Cloud
Learn how to determine the number of partitions each of your Kafka topics requires. Choosing the proper number of partitions for a topic...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found