question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ingestion performance suddenly declines after a few hours

See original GitHub issue

CrateDB version

4.7.1

CrateDB setup information

Number of nodes: 4 CRATE_HEAP_SIZE: 10g CRATE_JAVA_OPTS: -javaagent:/var/lib/prometheus/crate-jmx-exporter-1.0.0.jar=8080 Memory of the nodes: 25g Disk setup: AWS EBS (20g per node)

Observed behavior

We are developing a fast ingestion pipeline with on AWS EKS. CrateDB acts as an intermediate database, optimized for insert performance, with a single table partitioned by 15 minutes, 12 shards, no replicas.

Documents arrive in batches of 500, at a rate of ~120/s at peak times.

Every 15 minutes, a job selects all documents in the last complete partition, sends them to a long-time storage database, then deletes the oldest partition. That way, we never keep more than 2 partitions at any time. Our goal with this setup is to minimize table size in CrateDB in order to maintain good ingestion performance.

During several test runs, we observed a sudden sharp decline in performance after about 3 hours in the morning, and ~1.5 hours during peak times in the early afternoon. crate_query_sum_of_durations_millis for INSERT hovers at about 10 ms during normal operation, peaking at ~20 ms every 15 minutes (due to the aforementioned deletion job). This rate suddenly jumps up to 100 ms, then 200, peaking at about 400 ms and staying there.

CrateDB doesn’t show an unusual log entries when this happens. Neither are the pods under CPU or memory pressure. Everything looks fine, except it isn’t. The dramatic drop in ingestion performance seems to persist indefinitely (until our message broker has to block producers because queues have run full). We only get good performance again if we stop consumers and producers, then drop the entire table.

Steps to Reproduce

This issue is admittedly difficult to reproduce, because you will have to match our setup, have something constantly delivering a large amount of documents (we are talking ~50 million for each 15 minute partition), plus all the extra stuff mentioned above.

If you need more information, I’ll gladly provide it. I realize this sounds complex, but being new to working with CrateDB might mean we are simply missing something crucial in our setup.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
Tardogcommented, Mar 21, 2022

Yes, your assumption seems to be correct. CloudWatch metrics show the volumes hitting 0% remaining burst bucket credits at the time where we experienced sudden drops in write performance. So we’re likely hitting a bottleneck here.

We will look into it and run more tests with volume types that support higher throughput. Since the problem is likely not caused by CrateDB after all, this issue can be closed.

0reactions
proddatacommented, Mar 21, 2022

If possible you might want to switch to GP3 volumes which can have iops/thougput provisioned.

I will close this for now 🙂

Read more comments on GitHub >

github_iconTop Results From Across the Web

Logstash ingestion slows rapidly after 1 hour - Elastic Discuss
I have a logstash 6.7 on kubernetes which ingests .gz log files from s3 input at a rate of about 20k/minute for roughly...
Read more >
How to solve 5 Elasticsearch performance and scaling ...
This article will walk through five common Elasticsearch performance issues, and how to deal with them.
Read more >
Why do I feel tired after eating? Causes and prevention
A decrease in energy levels after eating is called postprandial somnolence. Researchers have different theories about the cause of tiredness after eating, but ......
Read more >
What Happened? Alcohol, Memory Blackouts, and the Brain
For all but one subject in the blackout group, memory impairments began during the first few hours of drinking, when BAC levels were...
Read more >
Inhalant Abuse: Signs, Symptoms & Treatment
Sudden change in friends and hobbies. Rapid decline in school performance. Poor hygiene and grooming habits. Slurred speech. Runny nose or ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found