Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MonteCarloDynamic kernel failing on the Xilinx FPGA

See original GitHub issue

Describe the bug I noticed a problem with the Montecarlo kernel in the dynamic package for all the sizes, when executing on the Xilinx KCU1500 FPGA. There is no error in the compilation, but the kernel does not finish and it causes failures at the driver level regarding the dma. The problem seems like this:

[  815.440478] xocl:engine_status_dump: SG engine 0-H2C1-MM status: 0x00000000:
[  815.440480] xocl:engine_status_dump: SG engine 0-H2C0-MM status: 0x00000001: BUSY
[  815.440483] xocl:transfer_abort: abort transfer 0x000000009584ae00, desc 11, engine desc queued 0.
[  815.440487] xocl:transfer_abort: abort transfer 0x00000000d2360335, desc 1, engine desc queued 0.
[  815.440505] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: DMA failed, Dumping SG Page Table
[  815.440508] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: DMA failed, Dumping SG Page Table
[  815.440516] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 0, 0xf3ce7c000
[  815.440521] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 1, 0xf3d800000
[  815.440526] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 2, 0xf3d400000
[  815.440531] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 3, 0xf3f000000
[  815.440536] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 4, 0xf7d000000
[  815.440540] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 5, 0xf4f800000
[  815.440545] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 6, 0xf54800000
[  815.440550] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 7, 0xf60400000
[  815.440554] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 8, 0xf61c00000
[  815.440559] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 9, 0xf3b800000
[  815.440568] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 0, 0xf3821f000

This problem occurs only on the Xilinx KCU1500 FPGA. The Intel Nallatech Arria 10 FPGA is working both in emulation mode and the other two modes (Full Jit and AoT).

So, I did some work around and compared the previous kernel that was working (about 2 months old) and the current one. I took the body of the old kernel and applied two changes that we introduced in the latest version: a) altered the number regarding the frame number from 6 to 0. b) removed the private region parameter.

The modified kernel seems to be working. So, the main difference between the two kernels is shown in the figure (Left kernel is the old one that is working, Right kernel is the new one that causes the problem): montecarlo_kernels diff

How To Reproduce tornado -Ds0.t0.device=0:1 -Xmx20g -Xms20g --printKernel --debug uk.ac.manchester.tornado.examples.dynamic.MontecarloDynamic 65536 default 1

Note that device 0:1 is the xilinx_kcu1500_dynamic_5_0 CL_DEVICE_TYPE_ACCELERATOR

Computing system setup (please complete the following information):

OS: Ubuntu 18.04.02 LTS
OpenCL Version: 1.0
TornadoVM commit id: ed243aa

Any ideas? I am not familiar with this change about the fma.

Issue Analytics

State:
Created 3 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

jjfumerocommented, Jul 1, 2020

Thank Thanos. You can report this issue to the Xilinx OpenCL runtime.

1reaction

jjfumerocommented, Jul 1, 2020

Thank @stratika. Do you think the issue is the FMA instruction? This is supported from OpenCL 1.0

https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/fma.html

Can you substitute the fma to use separate instructions instead? Just to double-check that is the problem.

Apart from that, the changes: a) OpenCL frame: should not affect b) Private memory allocation for arrays: this might cause a problem is we get out of resources. But IMO, we should get an error after the kernel launch.