question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Simplify scale API // Cluster + Adaptive class hierarchy change

See original GitHub issue

We currently have a heterogeneous set of implementations involving clusters and their scaling capabilities. Taking the LocalCluster as an example, with adaptive scaling, this involves many different and subtly similar classes

In [1]: from distributed import LocalCluster

In [2]: LocalCluster.__mro__
Out[2]:
(distributed.deploy.local.LocalCluster,
 distributed.deploy.spec.SpecCluster,
 distributed.deploy.cluster.Cluster,
 object)

and on top there is the adaptive implementation with distributed.deploy.adaptive.Adaptive and distributed.deploy.adaptive_core.AdaptiveCore

There are the following methods capable of scaling

  • SpecCluster.scale
  • Cluster.scale (NotImplemented)
  • async SpecCluster.scale_down
  • async Adaptive.scale_up
  • async Adaptive.scale_down
  • async AdaptiveCore.scale_up (NotImplementedError)
  • async AdaptiveCore.scale_down (NotImplementedError)
  • async AdaptiveCore.adapt

The only obvious and full implementation of scale is in SpecCluster but this implementation has the shortcoming that it is not smart, i.e. upon downscaling it will remove random workers. Clusters which do not inherit from SpecCluster cannot benefit from manual scaling by default but they can use adaptive scaling to a certain extend since scale_down is implemented in Adaptive. By implementing Cluster.scale_up a cluster would be enabled for fully adaptive, smart scaling but no manual Cluster.scale

There is an unobvious but smart implementation available as part of AdaptiveCore.adapt, see https://github.com/dask/distributed/blob/bd5367b925d25dc9c6ef0700294036b447a1839d/distributed/deploy/adaptive_core.py#L208-L220

which uses the method AdaptiveCore.recommendations to determine a smart scaling decision taking into account scheduler information and planned but not yet observed workers. This allows for an elegant and smart scaling implementation with the only requirement that the attributes plan, requested, observed and the method workers_to_close (default available in scheduler) are implemented. The attributes are already implemented for Cluster.

    async def scale(self, target):
        recommendations = await self.recommendations(target)

        if recommendations["status"] != "same":
            self.log.append((time(), dict(recommendations)))

        status = recommendations.pop("status")
        if status == "same":
            return
        if status == "up":
            await self.scale_up(**recommendations)
        if status == "down":
            await self.scale_down(**recommendations)

Moving the recommendations logic up the class hierarchies into Cluster would allow us to implement one Cluster.scale which should serve all/most use cases. This would either require us to redefine the Adaptive interface or copy code. Since I’m a bit confused about the class hierarchies, the above mentioned scale* methods and about various minor implementations I would like to propose a slightly new interface.

  • I would like to merge the AdaptiveCore and Adaptive classes since the core class is not functional and I imagine subclassing is typically done using Adaptive
  • All scale* methods will be implemented as part of Cluster. These methods are not supposed to implement smart logic about picking the “right” workers but are merely there to communicate with scheduler and resource manager to do the actual scaling. How the scaling is performed should use the same logic whether adaptivity is enabled or not.
  • Cluster.scale will be implemented using the current Adaptive.recommendations + scale_{up/down}. Therefore, the recommendations method will also be part of the Cluster together with workers_to_close which will simply query the scheduler.
  • adaptive = True / Cluster.adapt will simply start a PC which calls Cluster.scale(await Adaptive.safe_target)
  • The only user facing API for manual scaling will be the generic Cluster.scale(target: int). Subclasses are required to at least implement scale_up(workers: List[str])
  • That all reduces the entire Adaptive class to a single function defining the adaptive target to be put into a PC. This is user customisable but will default to scheduler.adaptive_target

The only thing I’m not entirely sure about is the workers_to_close method since I don’t have a good feeling about whether this is actually something people want to override.

To summarise, all components to provide a generic and smart Cluster.scale implementation are there but the pieces are scattered all over the place. I believe by implementing the above suggestions, the class hierarchy would become simpler and scaling and adaptivity would become more accessible.

Thoughts?

cc @jacobtomlinson @marcosmoyano @mrocklin

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:15 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
jacobtomlinsoncommented, Aug 9, 2021

I think I am agreed that customising the scaling logic could require subclassing the cluster. That makes everything simpler. I am not aware of use cases where folks want to switch adaptive schemes on an existing cluster object.

1reaction
fjettercommented, Aug 6, 2021

@abergou I think this all depends on whether or not you need to change the way you adapt at runtime. Apart from testing and such things, the most significant feature of having the architecture the way it is right now is that you can spin up any cluster and call Cluster.adapt(MyVeryOwnAdaptiveInstance) at runtime. This supports the sentiment of having dask “hackable” but I’m not entirely sure if this is a feature we actually require to be hot-swappable. After all, if you call adapt twice with different implementations, the behaviour is very likely very unstable (unless you properly stop the previous instance, ensure all futures are cancelled, etc.)

My goal is to have one scale implementation which does not depend on whether I am scaling via adapt or via manual scale and I believe there are two ways to move forward

A) Do not allow modification of adaptive at runtime. Any modification to the scaling logic should then be implemented by subclassing Cluster instead of Adaptive. The current logic of recommendation + scale would stay the same just the place where we define it is different.

B) We allow hot-swapping of adaptive classes the way it is done right now with one exception. The default call of Cluster.scale will merely be a proxy to an underlying Adaptive instance, a default will be provided. If a user registers a a NewAdaptive instance, the old one will properly stopped and replaced. All subsequent calls of Cluster.scale will then forward to NewAdaptive

I’m favouring A) since scaling is a functionality I closely associate with what the Cluster abstraction is supposed to be doing. The Cluster objects are “Dask Cluster manager classes” and maybe the most important management functionality is to start and stop servers. For instance, I would expect a Cluster to to be the source of truth for the knowledge of how to scale_up a server node

Anyhow, nothing is set in stone and I would appreciate some feedback since I am actually not the user of these things 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

10 Clustering Algorithms With Python
As such, cluster analysis is an iterative process where subjective evaluation of the identified clusters is fed back into changes to algorithm ...
Read more >
scipy.integrate.solve_ivp — SciPy v1.9.3 Manual
The solver looks for a sign change over each step, so if multiple zero crossings occur within one step, events may be missed....
Read more >
Kafka 3.3 Documentation
The Kafka Connect API to build and run reusable data import/export connectors that consume (read) or produce (write) streams of events from and...
Read more >
Solving the Large-Scale TSP Problem in 1 h: Santa Claus ...
We attempted to solve the same clusters using all three methods shown above over the course of 40 min to allow for 20...
Read more >
An Adaptive Elastic Multi-model Big Data Analysis and ...
By simplifying multi-model query and improving computing power, ... face the changes of the load of the whole cluster, and elastic scaling ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found