question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dask Client Computing Prior to Calling .compute()

See original GitHub issue

Hi,

I’m having a problem that is almost surely a mistake on my part, but I haven’t been able to sort it out so I’m posting it as a bug here. I’m running the following code:

from dask import dataframe as dd
from dask.distributed import Client
client = Client()
import webbrowser
webbrowser.open("http://localhost:8787/status")

Next, I run data = dd.read_csv('data.csv') # 12GB file. When doing so, the code executes immediately, but a computation on the client that takes a few minutes begins. Then, I run data = data[data['X'] <= 180]. When I run this second command, two computations occur on the client and the first one looks identical to the computation that occurred when I ran the read_csv line of code. So that computation appears to be happening twice. Am I doing something obviously wrong here – I have many more commands that will follow, but I don’t want them to be executed until the very end at which I use .compute on a command that returns a Pandas DF that will fit in my RAM.

Thanks and sorry if this is a dumb question!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
timhdesilvacommented, May 25, 2021

Will do, thanks @jrbourbeau!

0reactions
jrbourbeaucommented, May 25, 2021

Thanks for following up @timhdesilva. Could you open up a GitHub issue over with the Spyder folks since this appears to be Spyder-related and not an issue with Dask itself?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Managing Computation - Dask.distributed
You can turn any dask collection into a concrete value by calling the .compute() method or dask.compute(...) function. This function will block until...
Read more >
Distributed Computing with dask - Practical Data Science
Dask is a library designed to help facilitate (a) the manipulation of very large datasets, and (b) the distribution of computation across lots...
Read more >
Common Mistakes to Avoid when Using Dask - Coiled
Yes…until it doesn't. Just like .compute(), calling .persist() tells Dask to start computing the result (cooking the recipe). This ...
Read more >
Understanding Dask Architecture: Client, Scheduler, Workers
If you call a compute function and Dask seems to hang, or you can't see anything happening on the cluster, it's probably due...
Read more >
Parallel Computing with Dask: A Step-by-Step Tutorial
Dask is a parallel computing library built in Python. ... However, before calling the compute() method, check what the parallel execution ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found