Long-running shaders result in webgl context lost
See original GitHub issueTensorFlow.js version
0.10.0
Browser version
Chrome 64.0.3282.167 (Linux)
Describe the problem or feature request
Consider the following toy example of matrix exponentiation:
const size = 10000;
const iters = 10;
let tensor, tmp;
tensor = tf.zeros([size, size]);
for (let i=0; i<iters; i++) {
tmp = tensor.matMul(tensor);
tensor.dispose();
tensor = tmp;
}
tensor.max().print();
Expected:
The program should always print 0, and neither the behavior nor memory usage should vary as iters
increases, since it results in multiplying the exact same matrix, and the tensor is disposed after being used.
Actual:
With iters = 5
, the program behaves as expected. With iters = 10
I have seen the following three behaviors:
- Sometimes it prints NaN, and Chrome reports that “WebGL hit a snag”
- Sometimes it errors with “Couldn’t compile vertex shader”
- Sometimes it errors with the long failed script compilation found here
Reproduction
Navigate to https://js.tensorflow.org/, open the console, and post the example script there. If it works as expected, try increasing the number of iterations and see if it breaks.
Issue Analytics
- State:
- Created 5 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
HandlingContextLost - WebGL Public Wiki
Handling Shaders and Programs. When checking for shader compilation and program linking success check that the context is not lost. var shader ...
Read more >Poor performance shader with pixi.js, and webgl context lost
I am new to webgl and pixi.js . I've done this webgl shader with the help of pixi.js : fiddle. It works but:...
Read more >WebGL Specification
Each WebGLRenderingContext has a webgl context lost flag, which is initially unset. When the getContext() method of a canvas element is to ...
Read more >WebGPU - W3C
WebGPU shaders are executed by the compute units inside GPU hardware. In native APIs, some of the shader instructions may result in undefined...
Read more >Coroutines - Unity - Manual
If a coroutine runs every frame and doesn't yield on long-running operations, it's more performant to replace it with an Update or LateUpdate...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hey Dandelion, thanks for the detailed report! And hi again!
This is consistent with our observation of executing long-running shaders. A matmul of matrices of size 10Kx10K is ~2TFLOPS. The hypothesis is that Chrome is trying to protect itself (other tabs) by prematurely killing the WebGL context, if a single program takes more than x seconds, especially after several consecutive programs. This results in either a
NaN
orWebGL hit a Snag
error.The solution is non-trivial, but the idea is to use some divide and conquer where we internally divide our matmul job into several smaller matmul shader calls, each on the order GLOPS, instead of TFLOPS, which will keep Chrome and other browsers happy.
We won’t be prioritizing this yet, since it hasn’t come up in real use-cases (existing ML models), but happy to revisit this later, or take contributions!
Thanks! We won’t be able to fix this for several reasons:
tf.matmul().dataSync()
will break).