question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Elasticsearch sink UPSERT performance

See original GitHub issue

Hi guys,

When benchmarking the Elasticsearch sink we’ve seen huge difference in performance between UPSERT vs INERT, a 10x difference. I understand that inherently UPSERT is much slower than regular INSERT but I was surprised to find that it’s a 10x difference. So I’m kind curious if this bottleneck in the connector or in Elasticsearch.

I have 5mil JSON messages in kafka that when UPSERTed should total to 1mil documents (5 messages form one complete elastic document). Doing regular INSERT I was able to average 5k document per second but doing UPSERT I could only get 500 document a second.

Versions: Conflunet Kafka 2.11.0-0.11.0.1 Stream Reactor 0.30 Elasticsearch 5.6.2

Using regular connect-distributed with the schema turned off. My connector configurations:

{
  "name": "elastic-sink-ztest",
  "config": {
    "connector.class": "com.datamountaineer.streamreactor.connect.elastic5.ElasticSinkConnector",
    "tasks.max": "1",
    "topics": "ztest",
    "connect.elastic.kcql": "UPSERT INTO ztest SELECT * from ztest PK id WITHDOCTYPE=event",
    "connect.elastic.cluster.name": "elastic",
    "connect.elastic.url": "10.10.10.1:9300",
    "connect.progress.enabled": true
  }
}

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
a3ammarcommented, Dec 20, 2017

Hi @Antwnis sadly I couldn’t get esrally to do doc_as_upsert, and I don’t have enough time to figure it out or write my own script.

It’s definitely a counting error, 500 documents/second means indexing 2500 messages per second.

Right now my priorities have changed but I’ll be revisiting this in the near future and if I have more interesting finding that might be connector related I’ll reopen this issue or make a new one.

Happy hacking!

0reactions
Antwniscommented, Dec 19, 2017

Hi @a3ammar did you manage to get to the bottom of this to have a better understanding whether this is a bottleneck or it is due to the way the count is done?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Elasticsearch sink UPSERT performance · Issue #342 - GitHub
Hi guys, When benchmarking the Elasticsearch sink we've seen huge difference in performance between UPSERT vs INERT, a 10x difference.
Read more >
Update/Upsert Performance Improvements - Elasticsearch
I'm having data that is very frequently updated, so I use bulk updates (50k documents, ~25MB) to update the data in elasticsearch.
Read more >
Does Elasticsearch Sink Connector support upsert mode on ...
I'm moving data from Mongodb -> Elasticsearch using kafka connect. At the moment the updated records are inserted ...
Read more >
Updates, Inserts, Deletes: Comparing Elasticsearch ... - Rockset
We compare and contrast how Elasticsearch and Rockset handle data ingestion, including updates and deletes, as well as provide practical ...
Read more >
Elasticsearch Service Sink Connector for Confluent Cloud
The Kafka Connect Elasticsearch Service Sink connector for Confluent Cloud moves data from Apache Kafka® to Elasticsearch. The connector supports Avro, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found