question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Example using dask.array.learn on non-trivial data

See original GitHub issue

We built dask.array.learn a long while ago https://github.com/blaze/dask/pull/138 .

It’s a fairly trivial pairing of sklearn and dask.array (and honestly something that sklearn does just fine with a for loop). Still, it’s something that more established libraries show off, so we might as well have a clear user example on how to combine a large HDF5 file with an sklearn classifier using dask.array, probably something like the following

from sklearn import MyFavoriteEstimator
e = MyFavoriteEstimator()

import h5py
X = h5py.File(...)['/X']
y = h5py.File(...)['/X']

import dask.array as da
X = da.from_array(X, chunks=(...))
y = da.from_array(y, chunks=(...))

da.learn.fit(e, X, y)
e.predict(new_X)

The hard thing here is to find a large and meaningful dataset and the right estimator within sklearn that supports partial fitting. Then this work should be publicized in some way, perhaps both in documentation and in some sort of blogpost / notebook.

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
shoyercommented, Oct 30, 2015

Dictionary learning with a large stack of images would be an interesting application.

0reactions
mrocklincommented, Apr 26, 2016
Read more comments on GitHub >

github_iconTop Results From Across the Web

Array - Dask documentation
Dask Array implements a subset of the NumPy ndarray interface using blocked algorithms, cutting up the large array into many small arrays.
Read more >
Experiment with Dask and TensorFlow - Matthew Rocklin
Prepare Data with Dask.array. For this toy example we're just going to use the mnist data that comes with TensorFlow. However, we'll ...
Read more >
Pluralsight Tech Blog | Data Processing with Dask
In this post, we'll build a simple data pipeline for analytics and machine learning, working with text data in Dask.
Read more >
Dask: Parallelize Everything - Medium
Dask provides the most widely-used data structures inherited from Pandas ... A quick example of visualizing this is to create a 2D array...
Read more >
Machine Learning in Dask - Heartbeat
Processing a couple of gigabytes of data on one's laptop is usually an uphill task, unless the laptop has high RAM and a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found