Adding asynchronous execution to TaskSchedule
See original GitHub issueCurrently TaskSchedule API contains only blocking versions of execute
methods:
void execute() ;
void execute(GridTask gridTask);
void executeWithProfiler(Policy policy);
void executeWithProfilerSequential(Policy policy);
void executeWithProfilerSequentialGlobal(Policy policy);
All these methods block currently executing Java thread till all computations are done. However, computations are typically off-loaded to GPU and CPU at this moment just keeps waiting.
I think it should be both beneficial and possible to add asynchronous versions of the same methods with the following signatures:
CompletableFuture executeAsync() ;
CompletableFuture executeAsync(GridTask gridTask);
CompletableFuture executeWithProfilerAsync(Policy policy);
CompletableFuture executeWithProfilerSequentialAsyn(Policy policy); // Not sure about "sequential async"
CompletableFuture executeWithProfilerSequentialGlobalAsync(Policy policy); // Not sure about "sequential async"
Thoughts about implementation
Per my understanding, TaskSchedule
delegates back to TornadoTaskSchedule
And this objects waits on Event event
object (driver-specific). There are specific classes in each driver- CLEvent
and PTXEvent
OpenCL provides clSetEventCallback, CUDA has cudaLaunchHostFunc – so it’s possible to get async notifications from both OpenCL and PTX drivers.
So it should be possible to extend CLEvent
and PTXEvent
+ PTXStream
to add some form of listeners, where concrete listener inside TornadoTaskSchedule
can settle CompletableFuture
returned from the proposed TaskSchedule.executeAsync()
.
Thought?
Issue Analytics
- State:
- Created 3 years ago
- Comments:24 (24 by maintainers)
Top GitHub Comments
Yep, found it. There are were many other issues that I had to fix when trying to add true out-of-order execution (even inside JNI code - blocking wait for read/write). Attached is a patch with fixed runtime + OpenCL, and an API for
executeAsync(...)
Currently with OpenCL all tests run ok with both
-Dtornado.ooo-execution.enable=true
(ooo) and-Dtornado.ooo-execution.enable=false
(partly blocking) on NVIDIA OpenCL.On Intel OpenCL everything is ok for
-Dtornado.ooo-execution.enable=false
but with-Dtornado.ooo-execution.enable=true
I getSegmentation Fault
for just everything. Need someone who can debug code and find out the reason.It would be great if anyone will apply this patch to current
develop
branch and run tests (for both settings oftornado.ooo-execution.enable
) on AMD or other device.0001-Enable-full-Out-of-order-execution-Adding-executeAsy.zip
Status update.
OCLCommandQueue
to useDirectByteBuffer
for non-blocking calls and left existing “array-copying” code only for blocked read/write.TornadoVM
andOCLTornadoDevice
OCLEvent
with static buffer usageOCLCommandQueue
andOCLDeviceContext
).CompletableFuture TaskSchedule.executeAsync(...)
Tested on both NVIDIA and AMD. Almost all of the tests – ~325 out of 331 – run ok in blocking, default (mixed) and non-blocking modes. The only exceptions are:
Sporadically happens on both AMD and NVIDIA in any mode (default, blocking, non-blocking).
I guess this is smth. related to handling 2 bytes types. From tests I saw that you implemented special handling for single-byte type. Probably, two-bytes should be addressed as well. Because it’s always some crap in one byte and correctly set second byte.
Always happens on both AMD and NVIDIA in non-blocking mode. Blocking and default modes are ok.
Always happens on AMD (any mode). Need to check on Tornado 0.8 – probably this is not my regression at all.
And the asynchronous invocation itself (i.e.
Tornado.executeAsync(...)
works as expected.Waiting for your approval of my previous PR, so I’ll share these results.