Elasticsearch sink UPSERT performance
See original GitHub issueHi guys,
When benchmarking the Elasticsearch sink we’ve seen huge difference in performance between UPSERT vs INERT, a 10x difference. I understand that inherently UPSERT is much slower than regular INSERT but I was surprised to find that it’s a 10x difference. So I’m kind curious if this bottleneck in the connector or in Elasticsearch.
I have 5mil JSON messages in kafka that when UPSERTed should total to 1mil documents (5 messages form one complete elastic document). Doing regular INSERT I was able to average 5k document per second but doing UPSERT I could only get 500 document a second.
Versions: Conflunet Kafka 2.11.0-0.11.0.1 Stream Reactor 0.30 Elasticsearch 5.6.2
Using regular connect-distributed
with the schema turned off. My connector configurations:
{
"name": "elastic-sink-ztest",
"config": {
"connector.class": "com.datamountaineer.streamreactor.connect.elastic5.ElasticSinkConnector",
"tasks.max": "1",
"topics": "ztest",
"connect.elastic.kcql": "UPSERT INTO ztest SELECT * from ztest PK id WITHDOCTYPE=event",
"connect.elastic.cluster.name": "elastic",
"connect.elastic.url": "10.10.10.1:9300",
"connect.progress.enabled": true
}
}
Issue Analytics
- State:
- Created 6 years ago
- Comments:5
Top GitHub Comments
Hi @Antwnis sadly I couldn’t get esrally to do
doc_as_upsert
, and I don’t have enough time to figure it out or write my own script.It’s definitely a counting error, 500 documents/second means indexing 2500 messages per second.
Right now my priorities have changed but I’ll be revisiting this in the near future and if I have more interesting finding that might be connector related I’ll reopen this issue or make a new one.
Happy hacking!
Hi @a3ammar did you manage to get to the bottom of this to have a better understanding whether this is a bottleneck or it is due to the way the count is done?