Ingestion performance suddenly declines after a few hours
See original GitHub issueCrateDB version
4.7.1
CrateDB setup information
Number of nodes: 4
CRATE_HEAP_SIZE: 10g
CRATE_JAVA_OPTS: -javaagent:/var/lib/prometheus/crate-jmx-exporter-1.0.0.jar=8080
Memory of the nodes: 25g
Disk setup: AWS EBS (20g per node)
Observed behavior
We are developing a fast ingestion pipeline with on AWS EKS. CrateDB acts as an intermediate database, optimized for insert performance, with a single table partitioned by 15 minutes, 12 shards, no replicas.
Documents arrive in batches of 500, at a rate of ~120/s at peak times.
Every 15 minutes, a job selects all documents in the last complete partition, sends them to a long-time storage database, then deletes the oldest partition. That way, we never keep more than 2 partitions at any time. Our goal with this setup is to minimize table size in CrateDB in order to maintain good ingestion performance.
During several test runs, we observed a sudden sharp decline in performance after about 3 hours in the morning, and ~1.5 hours during peak times in the early afternoon. crate_query_sum_of_durations_millis
for INSERT hovers at about 10 ms during normal operation, peaking at ~20 ms every 15 minutes (due to the aforementioned deletion job). This rate suddenly jumps up to 100 ms, then 200, peaking at about 400 ms and staying there.
CrateDB doesn’t show an unusual log entries when this happens. Neither are the pods under CPU or memory pressure. Everything looks fine, except it isn’t. The dramatic drop in ingestion performance seems to persist indefinitely (until our message broker has to block producers because queues have run full). We only get good performance again if we stop consumers and producers, then drop the entire table.
Steps to Reproduce
This issue is admittedly difficult to reproduce, because you will have to match our setup, have something constantly delivering a large amount of documents (we are talking ~50 million for each 15 minute partition), plus all the extra stuff mentioned above.
If you need more information, I’ll gladly provide it. I realize this sounds complex, but being new to working with CrateDB might mean we are simply missing something crucial in our setup.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Yes, your assumption seems to be correct. CloudWatch metrics show the volumes hitting 0% remaining burst bucket credits at the time where we experienced sudden drops in write performance. So we’re likely hitting a bottleneck here.
We will look into it and run more tests with volume types that support higher throughput. Since the problem is likely not caused by CrateDB after all, this issue can be closed.
If possible you might want to switch to GP3 volumes which can have iops/thougput provisioned.
I will close this for now 🙂