some questions about the source code
See original GitHub issueI’m pretty interested in the implementation of the ZeRO optimizer because the work is really beyond words. Yet I find some points I cannot understand when reading the source code, so I’m here to ask for some help to understand the DeepSpeed better.
-
All through the project, I can find the implementation of Gradient Partitioning and Parameter Partitioning here because partitioning
param_group['params']
would involve partitioning both the weights(p.data) and gradients(p.grads), but I cannot find where Optimizer State Partitioning is. -
shard_list seems to be a temporary variable in the for-loop, which means that it will be discarded for every loop, so what’s the meaning of
dist.all_gather
here? It seems to have something to do with pipeline. Would you please elaborate it? -
What’s the meaning of parameter
allgather_size
for theFP16_DeepSpeedZeroOptimizer
Would you please also share the presentation slides on the design and implementation of ZeRO ? Thanks a million !
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
@samyam I come to understand the reason of using all_gather in CSRTensor. In the data parallelism, different process will have totally different part of the weight that should be update. So we just all_gather them in the list of indices_device_list and in the list of values_device_list and then make average of these gradients from different devices.
csr.to_dense()
function will be responsible for calculating the sum of different values if their indices in the indices_device_list happen to be the same.Seems like all the questions have been answered here. Closing the issue. @sjtusmartboy , please feel free to reopen if you have any additional questions, comments, or feedback.