Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Erroring on workers applying method over large Dask df

See original GitHub issue

Sumary of issue

As a Dask dataframe scales in size, the apply appears to encounter errors that cause the compute() process to fail.

Caveats

I have yet to devise a deterministic method of reliably producing these errors. That said, I have attempted to document, in detail, the conditions under which these errors occur and have identified aspects in the ordering of operations that enable to the error/s to occur more frequently.

I still suspect that this may be user error (my fault), but having exhausted all options I can readily think of, I wanted to share on GH Issues over SO on the off chance that there are issues outside of my/the user’s control at play.

The documentation exists here, in this GDoc.

The example method exists here. The example method can be run with the following command: time python test_dask.py {integer_desired_row_count_for_dfs}

What we are doing

If you look at the reference script, this is what is happening: Given 2 dataframes holding within them a geometry on each row, we want to find the distance from each geometry on the left df to each geometry on the right. This is an n*n problem, in which the resulting possible pairs scales exponentially with the number of rows in a given dataframe.

In our example, each dataframe is of the same height. For the purposes of this example, we convert a WKT geometry into WKB format and store that on each row of each column, one for the left and one for the right. They are to represent a unique geometry for each row in each dataframe.

Next, we convert the left dataframe to a Dask df, and the right regular Pandas df we merge onto the left (this line), as per item 1 in the Joins Performance Tips (here). We perform this merge through the introduce of a throwaway column on both dataframes set to the same value (this creates all possible relations between each row on each side, which creates a new table that is as tall as the produce of the height of the right and left table).

Now that we have the tall table of unique IDs for each geometry and their WKB representations, we can apply a method over each row. We do so by first grouping by the left df’s id. This is so that the result will include distances to all other geometries from the right df, grouped by each row from the left.

Resources being used

20 workers, spread across for machines, each m4.xlarge EC2 instances. The scheduler is provisioned with the same resource type as well.

The errors we are seeing

If you look at the reference Google spreadsheet, I’ve essentially started with smaller initial dataframes (e.g. 2000 rows) and worked my way up in size. Under methodology changes, I basically switched between passing the Shapely objects (interpreted geometry strings) into the Pandas dataframe prior to the apply() operation or, later, within the applied method (so it takes the WKB string and converts that into a Shapely geometry object for each geometry in each row of each possible combination from the two dataframes).

Some interesting things we have noted about the errors: The are always 12 rows long in the worker logs.

Example 1 features a column (geometry_from) that should be a WKT or WKB (depending on the version of the script) instead being an integer that matches the id column). It would appear that maybe, somehow, the column was swapped or misapplied (?).

Example 2 is very curious because the 11th row is only partly printed (and this happens consistently) before being randomly truncated and yet the logs continue normally after that.

Example 3 is the one that I’ve anecdotally noted happens most often (though not today with this example script). This one is particularly confused because the key error is thrown and passed through to the user in the traceback, but nothing is logged by the scheduler or any of the workers.

Example 5 occured later in the day and only started after we switched to WKB format from WKT. I found this one extra interesting because there is no way that the geometry_from column could possibly be a WKT format in this operation because we load it in and convert it to WKB format prior to populating the Pandas dataframe with it. So this makes me think there is some confusion or somehow the old jobs are being confused or preserved with current jobs? client.restart() was used to purge old processes from the cluster so I’d be interested to learn what was happening here.

Final notes

Dask: 0.15.0 Distributed: 1.17.1 OS: Ubuntu 16.04.2 LTS Versioning across workers/scheduler/etc should not be an issue - deployed Docker containers, so a standardized environment.

Issue Analytics

State:
Created 6 years ago
Comments:19 (13 by maintainers)

Top GitHub Comments

1reaction

mrocklincommented, Jul 2, 2017

Also, to be clear, this is a guess and not definitive. Long-GIL-holding functions have been known to cause behavior like this. That does not mean that something else is not also going on.

1reaction

mrocklincommented, Jul 2, 2017

I gave this a quick run on my local machine. Here are some side-observations and thoughts:

Dask is wildly over-estimating the nbytes size of geometries-in-pandas-dataframes. The dashboard shows 10’s of GBs of expected memory use while the OS-level monitors are reporting a much smaller value. Dask’s heuristics to estimate memory use definitely aren’t well tuned for object dtype columns that aren’t strings.
It seems unfortunate to use this much compute power on what appears to be a smallish dataset. The calc computation seems to take a very long time. In your shoes I might look to see if there are ways to speed it up before resorting to distributed computing.

Most relevant to your problem at hand though, the geos operations are taking a very long time and are holding onto the GIL. This combination is hard to deal with from a concurrency perspective. This means that even though Dask runs them in separate threads they still stop the rest of Dask’s communication machinery from listening in, handling requests, etc on other threads. While running your tasks its as though the Dask worker completely disappears from the network, only to come back after the computation finishes.

You can observe this if you install the crick library (on conda-forge or PyPI), navigate to the diagnostic server for one of the Dask workers, and look at the Counters page. Here is a screenshot from one of my local workers

If you look towards the lower left you’ll see the “tick duration” plot. This should be a histogram tightly centered around 20ms, which is an internal heartbeat that dask server keeps to test for exactly this sort of situation. You’ll see that the x-range goes all the way up to 6s, meaning that there were some periods of up to six seconds when Dask wasn’t able to check in with itself. This has been known to cause issues in the past because various timeouts from other peers will fail out. See the following issues for possible solutions in the general case (although your case is harder):

For GIL-holding functions though just about the only thing we can really do on the Dask side is to massively increase timeouts. The real solution here is to get the underlying library (it looks like shapely/geos in this case) to release the GIL while calling into C.

cc’ing @pitrou who has looked at similar issues. I don’t think there is anything for him to do here explicitly, but I thought he might find seeing this problem in the wild of interest.

Top Results From Across the Web

Dask Dataframe nunique operation: Worker running out of ...

I am trying to use dask's LocalCluster to process the data, but my workers quickly exceed their memory budget and get killed even...

Embarrassingly parallel Workloads - Dask Examples

This example focuses on using Dask for building large embarrassingly parallel computation as often seen in scientific communities and on High Performance ...

Distributed Computing with dask - Practical Data Science

Dask is a library designed to help facilitate (a) the manipulation of very large datasets, and (b) the distribution of computation across lots...

How we learned to love Dask and achieved a 40x speedup

Dask plays a crucial role in scaling out Python workflows, ... general lesson: a large number of groups in groupby().apply() might bring an ......

dask/dask - Gitter

def process(df): for col1 in columns_1: for col2 in columns_2: df.loc[df['any_column_in_df'] ... Unmanaged memory: 5.34 GiB -- Worker memory limit: 7.50 GiB ...