question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Generalizing Dask-XGBoost

See original GitHub issue

In many ML workloads we want to do pre-processing with Dask, load all of the data into memory, and then hand off to some other system:

  1. XGBoost
  2. LightGBM
  3. Horovod/Tensorflow
  4. Various cuML projects
  5. Dask itself in some future Actor-filled world

The Dask-XGBoost relationship does this in a few ways today:

  1. https://github.com/dask/dask-xgboost/ : we wait until all data is ready, then we query the scheduler for the location of each partition, and submit a function on each worker that grabs the local data and sends it to XGBoost
  2. https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.dask : we wait until all the data is ready, then we run a function on all the workers where they just grab partitions from the data without thinking about locality. Training tends to be much slower than a full data transfer, so this is less error prone without being much slower
  3. https://github.com/dmlc/xgboost/pull/4819 : a proposed rewrite in XGBoost that is similar to option 1 above

The processes above work today, but there are some problems:

  1. The code within dask-xgboost to figure out where data is and then run a function on every worker uses internal APIs that ML researchers don’t understand. There is a lot of blind copy-pasting going on. If we go this route then we should maybe give them something higher level.
  2. This approach is error prone.
    1. If data moves in between querying its location and running the function, things break.
    2. If a worker dies anywhere in this process, things break.

So, here are some things that we could do:

  1. We could encode either of the approaches above into some higher level API that others could use in the future. This might make things easier to use, and also allow us to improve the internals behind the scenes in the future. It would be good to figure out what this contract would look like regardless.
  2. We could implement a few coordination primitives that would make writing code like this at a lower level more approachable. This would probably help enable more creative solutions. For example operations like barrier, or collect_local_data_that_looks_like_X, might be useful.

I was doodling some pseudocode on a plane about what a solution for XGBoost might look like with some higher level primitives and came up with the following (although I don’t think that people should read too much into it).

Disorganized XGBoost ravings
def train_xgboost(df):
    tasks = [
        dask.delayed(train_task)(
             part,
            n=df.npartitions,
            name="xgboost-" + tokenize(df)
        )
        for part in df.to_delayed()
    ]

    @dask.delayed
    def first_nonempty(L):
        return toolz.first(filter(None, L))

    return first_nonempty(tasks)


def train_xgboost_task(partition, n=n, name=name):
    group_data = group_action(data=partition, n=n, name=name)

    if not group_data:  # Someone else collected all of the data
        return None

    partitions = group.data

    Do XGBoost training

    return result


def group_action(data=None, n=None, name=None):
    worker = get_worker()

    # Send message to scheduler that a new task has checked in
    # This increments some counter that we'll check later
    # This is kind of like a Semaphore, but in reverse?
    some_semaphore.release()

    # This will have to happen in a threadsafe way, maybe on the event loop
    if name in worker.groups:  # someone beat us here
        worker.group_action_data[name].append(data)
        return []  # we're not the first one here, return empty list
    else:
        group_data = [data]
        worker.groups_action_data[name] = group_data

    secede()  # leave the thread pool so that we don't block progress
    # Block until N tasks have checked in
    some_semaphore.acquire()
    rejoin()  # rejoin thread pool

    return group_data  # this has now collected lots of partitions

I think that focusing on what a good contract would look like for XGBoost, and then copying over one of the solutions for that, might be a helpful start.

cc @TomAugspurger @trivialfis @ogrisel @@RAMitchell

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:4
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
TomAugspurgercommented, Nov 14, 2019

That client.who_has output looks a bit strange for your input, but I think things are balanced acros the multiple workers. Notice that the output of each who_has is the same, it’s giving the mapping for all the futures backing ds, not just that one partition.

If you want the who_has for a single partition, you may have to do a bit more

>>> client.who_has(client.futures_of(ds)[0])
{"('from_pandas-02ae7c2929339175d14a7c1c3e7c60b2', 0)": ('tcp://127.0.0.1:59229',)}
0reactions
mrocklincommented, Nov 14, 2019
Read more comments on GitHub >

github_iconTop Results From Across the Web

Scale XGBoost — Dask Examples documentation
Dask and XGBoost can work together to train gradient boosted trees in parallel. This notebook shows how to use Dask and XGBoost together....
Read more >
Distributed XGBoost with Dask — xgboost 1.7.2 documentation
Dask allows easy management of distributed workers and excels at handling large distributed data science workflows. The implementation in XGBoost originates ...
Read more >
Supercharge ML models with Distributed Xgboost on CML
DASK is an open-source parallel computing framework – written natively in Python – that integrates well with popular Python packages such as ...
Read more >
Dask Parallel Comparison | Download Scientific Diagram
... Dask Parallel Comparison from publication: Combining K-Means and XGBoost ... These rules represent the rationale for generalizing its application over a ...
Read more >
05-dask-ecosystem
This provides a solid foundation, but dask and the dask ecosystem are much ... and distributed Generalized Linear Models; dask-xgboost and dask-tensorflow ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found