Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scaling issues in karton - thread

See original GitHub issue

Hey guys, we are using karton system for a while now, and we have reached a certain point where scaling the analysis throughput became a real issue. Each time the amount of total tasks in redis goes up at about >500k (the specific number isn’t important, it depends on the environment), the entire system will eventually collapse where it starts with karton-system constantly crashing, karton-dashboard not responding, and from there the road to chaos is pretty short.

Just to clarify our situation - the entire infrastructure is hosted as containers on AWS ECS (all kartons, including karton-system) and we use a pretty strong managed redis server (I don’t think its computation power/storage size/connection capacity would be the bottleneck).

I investigated the causes for these crashes and issues we had, and learned a few things I would like to share and have a discussion over it:

The karton-system component is a crucial part of karton framework, and currently is also a single-point-of-failure for the entire system. This is an extremely important point which I will back to it later.
Currently, the karton-system functionality can be scaled-out partially - several instances of it would consume the karton.tasks and karton.operations properly, but the garbage collection wouldn’t be scaled efficiently. This is because each garbage collection process will run tasks = self.backend.get_all_tasks() and perform collection over this data, which is shared between other instances of the same karton-system. I think this component must be scaled-out properly.
The root cause for the system to crash on high workload is tasks = self.backend.get_all_tasks() when trying to do garbage collection, or more precisely for task_data in self.redis.mget(tasks) in get_all_tasks(). When the queue of tasks increases, each redis.mget(<all_tasks>) becomes more computation-intensive, and slows down the karton operations handling. This causes the queues to increase even more, and eventually, karton-system will be going from gc call to another gc call. In the end, in our case, the karton-system is being killed by the OS due to out-of-memory, but even if it lives, it isn’t operational.
In the short term, I would reimplement get_all_tasks() to return a generator instead of a list, and each time query a chunk of tasks (10k tasks for example). This would solve some of the stability issues (including out-of-memory issues) - It is quite trivial, but I can explain if needed.
In the long term, we can still meet situations where we starve karton operations because of constant garbage collections. I think we need to find ways to prioritize it somehow or make the gc a more “lightweight” operation (maybe a redis list of gc actions ? or part of karton.operations list ?)
karton-dashboard - we find this tool extremely valuable to monitor the workloads. Unfortunately, he becomes non-responsive with high queues due to the same get_all_tasks() call on each GET request. the general usage of karton-dashboard for us is understanding the size of the queues and canceling/reproducing crushed tasks, so we rarely look at tasks content. I would be glad to propose improvements to its responsiveness through PR if you don’t have any plans to do so.

Thanks again for all the efforts you make into this system, we managed to achieve great things using it so far. @psrok1 @nazywam

Issue Analytics

State:
Created 2 years ago
Reactions:6
Comments:15 (10 by maintainers)

Top GitHub Comments

3reactions

psrok1commented, Dec 30, 2021

Unfortunately it’s not possible yet, but @psrok1 is thinking about removing karton.operations queue to solve this

After an analysis I think that karton.operations is not needed. Task ownership is already guaranteed and that deferred task status change only complicates things. But we can still handle it in karton-system to keep things backwards compatible.

I have tried to make routing as lightweight as possible and introduce pipelined operations in routing in https://github.com/CERT-Polska/karton/pull/146. It’s not merged yet with @msm-code optimizations tho.

3reactions

alex-ilgayevcommented, Dec 28, 2021

Hey guys

I made a stress test on our staging environment in the last few hours and had a few insights regarding the scaling issue

The setup:

Three kartons - karton-classifier, karton-config-extractor, and karton-mwdb-reporter.
Small managed redis server on AWS (t2.small)
S3 as minio
Produced the same file which is forced to go through all three kartons. Each produced task is created as a separated resource on S3
Production rate of ~40-70 tasks per second with few stops (to simulate a real-world scenario)
karton-system on a strong server (4vcpu/16Gb ram)

My criteria for not handling the load would be the sizes of two queues: karton.tasks and karton.operations, and I explain - even if the individual kartons contain enormous queues, I do expect karton.tasks and karton.operations to be small or empty and remain stable. Of course, these enormous queues will eventually backfire through long GC times or OOM issues for redis.

Observations:

Most of the time until we reached a really high amount of tasks, the karton.tasks queue was quite small (maybe <1% of all tasks in redis)
karton.operations however, started to pile up very fast, even in the early stages of the stress test. We are receiving many tasks, and only when karton.tasks is empty he is starting to clean the karton.operations.
On the one hand, I confirmed (through the numbers below) that on a high amount of tasks, GC processing becomes problematic, on the other hand, I discovered that scaling the processing of the tasks (by consuming karton.operations) is more crucial than I thought, and even a pretty strong server running only karton-system can’t withstand decent load which we aim to achieve on production environment.
The time it takes to do GC for tasks and resources is almost equal.

Numbers:

No. Tasks	GC operation length	Size of `karton.operations`	Redis size (after GC)
100k	63s	124Mb	270Mb
140k	68s	141 Mb	327Mb
196k	85s	160Mb	432Mb
236k	103s	223Mb	542Mb
290k	111s	276Mb	630Mb

Thoughts:

I had several thoughts from this experiment:

I guess that you had reasons for prioritizing karton.task over karton.operations. In my reasoning, karton.tasks is our new unrouted tasks, and I would prefer the current operations to be over before introducing any new tasks into the system. Preferring the karton.tasks is blocking future processing, reducing potential queues, and potential GC.
If we keep the GC and routing as the piece of code, we definitely need to control the interval to be bigger than 3 minutes.
Optional - I wondered if it is possible to create a more minimalistic karton.operations list without the whole task. In the end, we are updating only the status and timestamp, but saving the complete task for it. Could save some potential redis memory.
This experiment has shown that we must have some method for scalable routing. The separation of karton-system into GC and routing is just a method to both be able to scale it, and stop the starving because of GC, but I’m sure there are more options to solve it.
In the end, there is no escape from redesigning it as we mentioned previously.