Scaling issues in karton - thread
See original GitHub issueHey guys,
we are using karton system for a while now, and we have reached a certain point where scaling the analysis throughput became a real issue.
Each time the amount of total tasks in redis goes up at about >500k (the specific number isn’t important, it depends on the environment), the entire system will eventually collapse where it starts with karton-system
constantly crashing, karton-dashboard
not responding, and from there the road to chaos is pretty short.
Just to clarify our situation - the entire infrastructure is hosted as containers on AWS ECS (all kartons, including karton-system
) and we use a pretty strong managed redis server (I don’t think its computation power/storage size/connection capacity would be the bottleneck).
I investigated the causes for these crashes and issues we had, and learned a few things I would like to share and have a discussion over it:
- The
karton-system
component is a crucial part of karton framework, and currently is also a single-point-of-failure for the entire system. This is an extremely important point which I will back to it later. - Currently, the
karton-system
functionality can be scaled-out partially - several instances of it would consume thekarton.tasks
andkarton.operations
properly, but the garbage collection wouldn’t be scaled efficiently. This is because each garbage collection process will runtasks = self.backend.get_all_tasks()
and perform collection over this data, which is shared between other instances of the samekarton-system
. I think this component must be scaled-out properly. - The root cause for the system to crash on high workload is
tasks = self.backend.get_all_tasks()
when trying to do garbage collection, or more preciselyfor task_data in self.redis.mget(tasks)
inget_all_tasks()
. When the queue of tasks increases, eachredis.mget(<all_tasks>)
becomes more computation-intensive, and slows down the karton operations handling. This causes the queues to increase even more, and eventually,karton-system
will be going from gc call to another gc call. In the end, in our case, thekarton-system
is being killed by the OS due to out-of-memory, but even if it lives, it isn’t operational. - In the short term, I would reimplement
get_all_tasks()
to return a generator instead of a list, and each time query a chunk of tasks (10k tasks for example). This would solve some of the stability issues (including out-of-memory issues) - It is quite trivial, but I can explain if needed. - In the long term, we can still meet situations where we starve karton operations because of constant garbage collections. I think we need to find ways to prioritize it somehow or make the gc a more “lightweight” operation (maybe a redis list of gc actions ? or part of karton.operations list ?)
- karton-dashboard - we find this tool extremely valuable to monitor the workloads. Unfortunately, he becomes non-responsive with high queues due to the same
get_all_tasks()
call on each GET request. the general usage ofkarton-dashboard
for us is understanding the size of the queues and canceling/reproducing crushed tasks, so we rarely look at tasks content. I would be glad to propose improvements to its responsiveness through PR if you don’t have any plans to do so.
Thanks again for all the efforts you make into this system, we managed to achieve great things using it so far. @psrok1 @nazywam
Issue Analytics
- State:
- Created 2 years ago
- Reactions:6
- Comments:15 (10 by maintainers)
Top GitHub Comments
After an analysis I think that
karton.operations
is not needed. Task ownership is already guaranteed and that deferred task status change only complicates things. But we can still handle it inkarton-system
to keep things backwards compatible.I have tried to make routing as lightweight as possible and introduce pipelined operations in routing in https://github.com/CERT-Polska/karton/pull/146. It’s not merged yet with @msm-code optimizations tho.
Hey guys
I made a stress test on our staging environment in the last few hours and had a few insights regarding the scaling issue
The setup:
karton-system
on a strong server (4vcpu/16Gb ram)My criteria for not handling the load would be the sizes of two queues:
karton.tasks
andkarton.operations
, and I explain - even if the individual kartons contain enormous queues, I do expectkarton.tasks
andkarton.operations
to be small or empty and remain stable. Of course, these enormous queues will eventually backfire through long GC times or OOM issues for redis.Observations:
karton.tasks
queue was quite small (maybe <1% of all tasks in redis)karton.operations
however, started to pile up very fast, even in the early stages of the stress test. We are receiving many tasks, and only whenkarton.tasks
is empty he is starting to clean thekarton.operations
.karton.operations
) is more crucial than I thought, and even a pretty strong server running onlykarton-system
can’t withstand decent load which we aim to achieve on production environment.Numbers:
karton.operations
Thoughts:
I had several thoughts from this experiment:
karton.task
overkarton.operations
. In my reasoning,karton.tasks
is our new unrouted tasks, and I would prefer the current operations to be over before introducing any new tasks into the system. Preferring thekarton.tasks
is blocking future processing, reducing potential queues, and potential GC.karton.operations
list without the whole task. In the end, we are updating only the status and timestamp, but saving the complete task for it. Could save some potential redis memory.karton-system
into GC and routing is just a method to both be able to scale it, and stop the starving because of GC, but I’m sure there are more options to solve it.Hope it was clear, and would love comment from you guys @msm-code @psrok1 @nazywam