Generalizing Dask-XGBoost
See original GitHub issueIn many ML workloads we want to do pre-processing with Dask, load all of the data into memory, and then hand off to some other system:
- XGBoost
- LightGBM
- Horovod/Tensorflow
- Various cuML projects
- Dask itself in some future Actor-filled world
The Dask-XGBoost relationship does this in a few ways today:
- https://github.com/dask/dask-xgboost/ : we wait until all data is ready, then we query the scheduler for the location of each partition, and submit a function on each worker that grabs the local data and sends it to XGBoost
- https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.dask : we wait until all the data is ready, then we run a function on all the workers where they just grab partitions from the data without thinking about locality. Training tends to be much slower than a full data transfer, so this is less error prone without being much slower
- https://github.com/dmlc/xgboost/pull/4819 : a proposed rewrite in XGBoost that is similar to option 1 above
The processes above work today, but there are some problems:
- The code within dask-xgboost to figure out where data is and then run a function on every worker uses internal APIs that ML researchers don’t understand. There is a lot of blind copy-pasting going on. If we go this route then we should maybe give them something higher level.
- This approach is error prone.
- If data moves in between querying its location and running the function, things break.
- If a worker dies anywhere in this process, things break.
So, here are some things that we could do:
- We could encode either of the approaches above into some higher level API that others could use in the future. This might make things easier to use, and also allow us to improve the internals behind the scenes in the future. It would be good to figure out what this contract would look like regardless.
- We could implement a few coordination primitives that would make writing code like this at a lower level more approachable. This would probably help enable more creative solutions. For example operations like
barrier
, orcollect_local_data_that_looks_like_X
, might be useful.
I was doodling some pseudocode on a plane about what a solution for XGBoost might look like with some higher level primitives and came up with the following (although I don’t think that people should read too much into it).
Disorganized XGBoost ravings
def train_xgboost(df):
tasks = [
dask.delayed(train_task)(
part,
n=df.npartitions,
name="xgboost-" + tokenize(df)
)
for part in df.to_delayed()
]
@dask.delayed
def first_nonempty(L):
return toolz.first(filter(None, L))
return first_nonempty(tasks)
def train_xgboost_task(partition, n=n, name=name):
group_data = group_action(data=partition, n=n, name=name)
if not group_data: # Someone else collected all of the data
return None
partitions = group.data
Do XGBoost training
return result
def group_action(data=None, n=None, name=None):
worker = get_worker()
# Send message to scheduler that a new task has checked in
# This increments some counter that we'll check later
# This is kind of like a Semaphore, but in reverse?
some_semaphore.release()
# This will have to happen in a threadsafe way, maybe on the event loop
if name in worker.groups: # someone beat us here
worker.group_action_data[name].append(data)
return [] # we're not the first one here, return empty list
else:
group_data = [data]
worker.groups_action_data[name] = group_data
secede() # leave the thread pool so that we don't block progress
# Block until N tasks have checked in
some_semaphore.acquire()
rejoin() # rejoin thread pool
return group_data # this has now collected lots of partitions
I think that focusing on what a good contract would look like for XGBoost, and then copying over one of the solutions for that, might be a helpful start.
cc @TomAugspurger @trivialfis @ogrisel @@RAMitchell
Issue Analytics
- State:
- Created 4 years ago
- Reactions:4
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Scale XGBoost — Dask Examples documentation
Dask and XGBoost can work together to train gradient boosted trees in parallel. This notebook shows how to use Dask and XGBoost together....
Read more >Distributed XGBoost with Dask — xgboost 1.7.2 documentation
Dask allows easy management of distributed workers and excels at handling large distributed data science workflows. The implementation in XGBoost originates ...
Read more >Supercharge ML models with Distributed Xgboost on CML
DASK is an open-source parallel computing framework – written natively in Python – that integrates well with popular Python packages such as ...
Read more >Dask Parallel Comparison | Download Scientific Diagram
... Dask Parallel Comparison from publication: Combining K-Means and XGBoost ... These rules represent the rationale for generalizing its application over a ...
Read more >05-dask-ecosystem
This provides a solid foundation, but dask and the dask ecosystem are much ... and distributed Generalized Linear Models; dask-xgboost and dask-tensorflow ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
That
client.who_has
output looks a bit strange for your input, but I think things are balanced acros the multiple workers. Notice that the output of eachwho_has
is the same, it’s giving the mapping for all the futures backingds
, not just that one partition.If you want the
who_has
for a single partition, you may have to do a bit moreSee https://github.com/dask/distributed/pull/3236