Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ingestion performance suddenly declines after a few hours

See original GitHub issue

CrateDB version

4.7.1

CrateDB setup information

Number of nodes: 4 CRATE_HEAP_SIZE: 10g CRATE_JAVA_OPTS: -javaagent:/var/lib/prometheus/crate-jmx-exporter-1.0.0.jar=8080 Memory of the nodes: 25g Disk setup: AWS EBS (20g per node)

Observed behavior

We are developing a fast ingestion pipeline with on AWS EKS. CrateDB acts as an intermediate database, optimized for insert performance, with a single table partitioned by 15 minutes, 12 shards, no replicas.

Documents arrive in batches of 500, at a rate of ~120/s at peak times.

Every 15 minutes, a job selects all documents in the last complete partition, sends them to a long-time storage database, then deletes the oldest partition. That way, we never keep more than 2 partitions at any time. Our goal with this setup is to minimize table size in CrateDB in order to maintain good ingestion performance.

During several test runs, we observed a sudden sharp decline in performance after about 3 hours in the morning, and ~1.5 hours during peak times in the early afternoon. crate_query_sum_of_durations_millis for INSERT hovers at about 10 ms during normal operation, peaking at ~20 ms every 15 minutes (due to the aforementioned deletion job). This rate suddenly jumps up to 100 ms, then 200, peaking at about 400 ms and staying there.

CrateDB doesn’t show an unusual log entries when this happens. Neither are the pods under CPU or memory pressure. Everything looks fine, except it isn’t. The dramatic drop in ingestion performance seems to persist indefinitely (until our message broker has to block producers because queues have run full). We only get good performance again if we stop consumers and producers, then drop the entire table.

Steps to Reproduce

This issue is admittedly difficult to reproduce, because you will have to match our setup, have something constantly delivering a large amount of documents (we are talking ~50 million for each 15 minute partition), plus all the extra stuff mentioned above.

If you need more information, I’ll gladly provide it. I realize this sounds complex, but being new to working with CrateDB might mean we are simply missing something crucial in our setup.

Issue Analytics

State:
Created 2 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

Tardogcommented, Mar 21, 2022

Yes, your assumption seems to be correct. CloudWatch metrics show the volumes hitting 0% remaining burst bucket credits at the time where we experienced sudden drops in write performance. So we’re likely hitting a bottleneck here.

We will look into it and run more tests with volume types that support higher throughput. Since the problem is likely not caused by CrateDB after all, this issue can be closed.

0reactions

proddatacommented, Mar 21, 2022

If possible you might want to switch to GP3 volumes which can have iops/thougput provisioned.

I will close this for now 🙂

Top Results From Across the Web

Logstash ingestion slows rapidly after 1 hour - Elastic Discuss

I have a logstash 6.7 on kubernetes which ingests .gz log files from s3 input at a rate of about 20k/minute for roughly...

How to solve 5 Elasticsearch performance and scaling ...

This article will walk through five common Elasticsearch performance issues, and how to deal with them.

Why do I feel tired after eating? Causes and prevention

A decrease in energy levels after eating is called postprandial somnolence. Researchers have different theories about the cause of tiredness after eating, but ......

What Happened? Alcohol, Memory Blackouts, and the Brain

For all but one subject in the blackout group, memory impairments began during the first few hours of drinking, when BAC levels were...

Inhalant Abuse: Signs, Symptoms & Treatment

Sudden change in friends and hobbies. Rapid decline in school performance. Poor hygiene and grooming habits. Slurred speech. Runny nose or ...