WebGPU Performance Issues
See original GitHub issuei just tried new tfjs-backend-webgpu
0.0.1-alpha.8 on tfjs
3.9.0
environment: chrome 96 canary on windows 11
first, great job on adding tons of new ops - from perspective of supported kernel ops, webgpu
is becoming usable!
however, switch to WGSL is anything but useful so far - it comes as a major performance degredation
overall, webgpu
has gotten slower than webgl
(and webgl
itself has become significantly slower since tfjs
3.4.0 - this is discussed separately in several open issues)
not to mention that new work that has gone into webgl
to make it manageable (enable uniforms) has no effect on webgpu
comparing warmup times
(fyi, my app by default uses 8 simple models running in parallel - total models size is actually tiny, below 30mb):
webgl
(default settings)14 sec (double the value with uniforms enabled)
webgl
withWEBGL_PACK_DEPTHWISECONV=false
andWEBGL_USE_SHAPES_UNIFORMS=true
7 sec (pretty good)
webgpu
(default settings)25 sec (this is incredibily slow)
webgpu
withWEBGPU_USE_GLSL=true
15 sec (already slower than webgl)
wasm
(no real warmup, included for refrerence only)2 sec
imo, when developing new backend, goal should be that its better than the previous one - not just that it passes unit tests
if webgpu
is not significantly improved, it will be a d.o.a. once released
cc @qjia7 and @xhcao due to work on webgpu
cc @pyu10055 as assignee on webgl performance degradation issue
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (7 by maintainers)
Top GitHub Comments
Thank you for the notes, here are full details
I’ve created an automated test so its easy to check all scenarios…
Performance Testing
Environment:
tfsj
3.9.0 andtfjs-backend-webgpu
0.0.1-alpha.8 Hardware: Notebook with Intel Coffee Lake i7-8750 and nVidia GTX 1050TiNotes
WebGPU
GSLS
code has been recently removed and cannot be compared with newWGSL
WebGL
warmup has massive benefit of ~80% of browser shader cachingWebGPU
warmup has little benefit of ~12% of browser shader cachingWebGPU
is much faster on inference compared toWebGL
WebGPU
is faster to warmup thanWebGL
in most casesExcept when
WebGL
shaders are cached in browser cross-session and uniforms are enabledWebGL
is 2x faster thanWebGPU
in that scenario showing necessity of caching supportWebGL
performance benefits of uniforms is massive at 2x and I dont see any side-effectsWill this be enabled by default in the future?
WebGL
packing caused massive performance regression in TFJS in 3.4.0 (3.3.0 is last unaffected version)There are several open issues, but no progress?
tf.zeros
as input is convinient, but does not produce realistic resultsTest using real input image to excercise real-world model execution path
Test Results
Issues
Using
WebGPU
backend is causing a lot of warnings although execution seems to work:Reproduction
Fully automated test in
NodeJS
usingpuppeteer
and reproducible anytimeCode available at https://gist.github.com/vladmandic/fbdcaf7fe2e2add5c33b98936d4d5740
@vladmandic Thanks for the good comments and data, as always! Chrome 94 was released on Sep 21, with WebGPU Origin Trial support. This means in addition to Chrome Canary, we may use Chrome Stable (still need option --enable-unsafe-webgpu) for WebGPU experiment now. Unfortunately, Chrome decided not to support GLSL anymore for WebGPU (changes happened in master so all the release channels would be impacted, including Canary and Stable), so WGSL is the only one that can be consumed now. We always align well with WebGPU development (My team also heavily contributes to WebGPU spec, CTS and Chrome impl) and started the TFJS GLSL to WGSL transition in June. After fixing many critical perf issues in Chrome (e.g., workgroup memory init perf regression) together with Google and working around perf issues in TFJS (e.g., hardware limits), we finished the transition after 3+ months of work. Internally we have daily track of performance against almost all the workloads defined in TFJS e2e benchmarks. Before switching to WGSL, we double-checked there was no performance regression regarding to warmup time and run time. For sure, due to resources, we could only cover very limited platforms (Actually only Intel Coffee Lake and Tiger Lake are under daily test), and very limited workloads. We’d like to hear more details from your side (e.g., hardware configuration) to understand the regression. We’ll investigate right after our holidays (We are off from Oct 1 to 7 for National Day Holidays). BTW,
Thanks again for your valuable feedback, hopes to hear more details from your side about warmup regression (e.g., hardware configuration), and looks forward to more collaborations in the future!