Erroring on workers applying method over large Dask df
See original GitHub issueSumary of issue
As a Dask dataframe scales in size, the apply
appears to encounter errors that cause the compute()
process to fail.
Caveats
I have yet to devise a deterministic method of reliably producing these errors. That said, I have attempted to document, in detail, the conditions under which these errors occur and have identified aspects in the ordering of operations that enable to the error/s to occur more frequently.
I still suspect that this may be user error (my fault), but having exhausted all options I can readily think of, I wanted to share on GH Issues over SO on the off chance that there are issues outside of my/the user’s control at play.
The documentation exists here, in this GDoc.
The example method exists here.
The example method can be run with the following command: time python test_dask.py {integer_desired_row_count_for_dfs}
What we are doing
If you look at the reference script, this is what is happening: Given 2 dataframes holding within them a geometry on each row, we want to find the distance from each geometry on the left df to each geometry on the right. This is an n*n
problem, in which the resulting possible pairs scales exponentially with the number of rows in a given dataframe.
In our example, each dataframe is of the same height. For the purposes of this example, we convert a WKT geometry into WKB format and store that on each row of each column, one for the left and one for the right. They are to represent a unique geometry for each row in each dataframe.
Next, we convert the left dataframe to a Dask df, and the right regular Pandas df we merge onto the left (this line), as per item 1 in the Joins Performance Tips (here). We perform this merge through the introduce of a throwaway column on both dataframes set to the same value (this creates all possible relations between each row on each side, which creates a new table that is as tall as the produce of the height of the right and left table).
Now that we have the tall table of unique IDs for each geometry and their WKB representations, we can apply a method over each row. We do so by first grouping by the left df’s id. This is so that the result will include distances to all other geometries from the right df, grouped by each row from the left.
Resources being used
20 workers, spread across for machines, each m4.xlarge EC2 instances. The scheduler is provisioned with the same resource type as well.
The errors we are seeing
If you look at the reference Google spreadsheet, I’ve essentially started with smaller initial dataframes (e.g. 2000 rows) and worked my way up in size. Under methodology changes, I basically switched between passing the Shapely objects (interpreted geometry strings) into the Pandas dataframe prior to the apply()
operation or, later, within the applied method (so it takes the WKB string and converts that into a Shapely geometry object for each geometry in each row of each possible combination from the two dataframes).
Some interesting things we have noted about the errors: The are always 12 rows long in the worker logs.
Example 1 features a column (geometry_from
) that should be a WKT or WKB (depending on the version of the script) instead being an integer that matches the id column). It would appear that maybe, somehow, the column was swapped or misapplied (?).
Example 2 is very curious because the 11th row is only partly printed (and this happens consistently) before being randomly truncated and yet the logs continue normally after that.
Example 3 is the one that I’ve anecdotally noted happens most often (though not today with this example script). This one is particularly confused because the key error is thrown and passed through to the user in the traceback, but nothing is logged by the scheduler or any of the workers.
Example 5 occured later in the day and only started after we switched to WKB format from WKT. I found this one extra interesting because there is no way that the geometry_from
column could possibly be a WKT format in this operation because we load it in and convert it to WKB format prior to populating the Pandas dataframe with it. So this makes me think there is some confusion or somehow the old jobs are being confused or preserved with current jobs? client.restart()
was used to purge old processes from the cluster so I’d be interested to learn what was happening here.
Final notes
Dask: 0.15.0 Distributed: 1.17.1 OS: Ubuntu 16.04.2 LTS Versioning across workers/scheduler/etc should not be an issue - deployed Docker containers, so a standardized environment.
Issue Analytics
- State:
- Created 6 years ago
- Comments:19 (13 by maintainers)
Top GitHub Comments
Also, to be clear, this is a guess and not definitive. Long-GIL-holding functions have been known to cause behavior like this. That does not mean that something else is not also going on.
I gave this a quick run on my local machine. Here are some side-observations and thoughts:
calc
computation seems to take a very long time. In your shoes I might look to see if there are ways to speed it up before resorting to distributed computing.Most relevant to your problem at hand though, the
geos
operations are taking a very long time and are holding onto the GIL. This combination is hard to deal with from a concurrency perspective. This means that even though Dask runs them in separate threads they still stop the rest of Dask’s communication machinery from listening in, handling requests, etc on other threads. While running your tasks its as though the Dask worker completely disappears from the network, only to come back after the computation finishes.You can observe this if you install the
crick
library (on conda-forge or PyPI), navigate to the diagnostic server for one of the Dask workers, and look at theCounters
page. Here is a screenshot from one of my local workersIf you look towards the lower left you’ll see the “tick duration” plot. This should be a histogram tightly centered around 20ms, which is an internal heartbeat that dask server keeps to test for exactly this sort of situation. You’ll see that the x-range goes all the way up to 6s, meaning that there were some periods of up to six seconds when Dask wasn’t able to check in with itself. This has been known to cause issues in the past because various timeouts from other peers will fail out. See the following issues for possible solutions in the general case (although your case is harder):
For GIL-holding functions though just about the only thing we can really do on the Dask side is to massively increase timeouts. The real solution here is to get the underlying library (it looks like
shapely
/geos
in this case) to release the GIL while calling into C.cc’ing @pitrou who has looked at similar issues. I don’t think there is anything for him to do here explicitly, but I thought he might find seeing this problem in the wild of interest.