question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Avoid compute in datasets

See original GitHub issue

https://github.com/dask/dask-ml/blob/d5801584d092d8f13f1b38aaf4da5dc3caa6a213/dask_ml/datasets.py#L332 isn’t great, especially in settings like Hyperband #221, that are using the distributed scheduler.

We could probably replace

    rng = dask_ml.utils.check_random_state(random_state)

with

    rng = sklearn.utils.check_random_state(random_state)

and draw

  1. informative_idx
  2. random data to seed the dask.array.RandomState that is eventually used to generate the large random data.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
TomAugspurgercommented, Oct 14, 2019

I meant the entire test suite, since other tests use it.

0reactions
dma092commented, Oct 14, 2019

If all the tests pass then that should be fine.

Are you talking about only the tests in test_datasets.py?

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to avoid re computation in spark Dataset? - Stack Overflow
I want your suggestions to avoid re-computation. Sample Input: Json Array ... Loop the dataset to publish each key record to kafka.
Read more >
What to Do When Your Data Is Too Big for Your Memory?
Another way to handle large datasets is by chunking them. That is cutting a large dataset into smaller chunks and then processing those...
Read more >
Using a Metric — datasets 1.11.0 documentation - Hugging Face
Metric.compute() then gathers all the cached predictions and references to ... Metric.add_batch() require the use of named arguments to avoid the silent ...
Read more >
How to Avoid Data Leakage When Performing Data Preparation
This avoids data leakage as the calculation of the minimum and maximum value for each input variable is calculated using only the training ......
Read more >
Optimize query computation | BigQuery - Google Cloud
Avoid repeatedly transforming data through SQL queries · Optimize your join patterns · Use INT64 data types in joins to reduce cost and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found