AssertionError: num_elems 513024> buffer 513000.
See original GitHub issueMy System: Ubuntu groovy 20.10 NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 torch 1.8.1 deepspeed 0.3.16
For some combination it works, while for others it doesn’t, it is very random.
code: `import deepspeed, torch
model = torch.nn.Sequential( torch.nn.Linear(512,1000) )
deepspeed_args = { “train_batch_size”: 1, “gradient_accumulation_steps”: 1, “fp16”: { “enabled”: True, “loss_scale”: 0, “initial_scale_power”: 3, “loss_scale_window”: 1000, “hysteresis”: 1, “min_loss_scale”: 1 }, “gradient_clipping”:1.0, “zero_optimization”: { “stage”: 3, “offload_param”:{ “device”: “nvme”, “nvme_path”:“/mnt/nvme0n1p3/” }, “offload_optimizer”: { “device”: “nvme”, “nvme_path”: “/mnt/nvme0n1p3/” }, }
}
optimizer = torch.optim.Adam(model.parameters(), lr= 0.001,betas=[0.8,0.99],eps=1e-8,weight_decay=3e-7) scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)
model, optimizer,_,scheduler = deepspeed.initialize(model=model,optimizer=optimizer,lr_scheduler=scheduler,config_params=deepspeed_args)`
error:
[2021-05-03 12:50:20,837] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.16, git-hash=unknown, git-branch=unknown [2021-05-03 12:50:20,838] [INFO] [distributed.py:36:init_distributed] Not using the DeepSpeed or torch.distributed launchers, attempting to detect MPI environment... [2021-05-03 12:50:21,196] [INFO] [distributed.py:83:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=192.168.29.88, master_port=29500 [2021-05-03 12:50:21,197] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-05-03 12:50:23,337] [INFO] [utils.py:11:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1 [2021-05-03 12:50:23,369] [INFO] [engine.py:601:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer [2021-05-03 12:50:23,369] [INFO] [engine.py:606:_configure_optimizer] Using client Optimizer as basic optimizer [2021-05-03 12:50:23,369] [INFO] [engine.py:615:_configure_optimizer] DeepSpeed Basic Optimizer = Adam Checking ZeRO support for optimizer=Adam type=<class 'torch.optim.adam.Adam'> [2021-05-03 12:50:23,369] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer Initializing ZeRO Stage 3 [2021-05-03 12:50:23,391] [INFO] [utils.py:583:see_memory_usage] Stage 3 initialize beginning /home/vbansal21/.local/lib/python3.8/site-packages/torch/cuda/memory.py:373: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved warnings.warn( /home/vbansal21/.local/lib/python3.8/site-packages/torch/cuda/memory.py:381: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved warnings.warn( [2021-05-03 12:50:23,392] [INFO] [utils.py:584:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2021-05-03 12:50:23,392] [INFO] [utils.py:592:see_memory_usage] CPU Virtual Memory: used = 6.57 GB, percent = 28.2% [2021-05-03 12:50:23,392] [INFO] [stage3.py:624:__init__] Reduce bucket size 500000000 [2021-05-03 12:50:23,392] [INFO] [stage3.py:625:__init__] Allgather bucket size 50000000 Using /home/vbansal21/.cache/torch_extensions as PyTorch extensions root... Emitting ninja build file /home/vbansal21/.cache/torch_extensions/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.19048452377319336 seconds [2021-05-03 12:50:23,816] [INFO] [stage3.py:39:print_rank_0] FP16 params swapping is True, Max params in CPU is 1000000000.0 [2021-05-03 12:50:23,840] [INFO] [utils.py:583:see_memory_usage] Before creating fp16 partitions /home/vbansal21/.local/lib/python3.8/site-packages/torch/cuda/memory.py:373: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved warnings.warn( /home/vbansal21/.local/lib/python3.8/site-packages/torch/cuda/memory.py:381: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved warnings.warn( [2021-05-03 12:50:23,841] [INFO] [utils.py:584:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2021-05-03 12:50:23,841] [INFO] [utils.py:592:see_memory_usage] CPU Virtual Memory: used = 6.59 GB, percent = 28.2% [2021-05-03 12:50:23,861] [INFO] [stage3.py:39:print_rank_0] fp16 group 0 has 1 subgroups [2021-05-03 12:50:23,862] [INFO] [stage3.py:924:_configure_tensor_swapping] Tensor Swapping: Adding optimizer tensors [2021-05-03 12:50:23,866] [INFO] [utils.py:30:print_object] SwapBufferManager: [2021-05-03 12:50:23,866] [INFO] [utils.py:34:print_object] count ........................ 4 [2021-05-03 12:50:23,866] [INFO] [utils.py:34:print_object] dtype ........................ torch.float32 [2021-05-03 12:50:23,866] [INFO] [utils.py:34:print_object] free_buffer_index ............ [0, 1, 2, 3] [2021-05-03 12:50:23,866] [INFO] [utils.py:34:print_object] gigabytes .................... 0.007644295692443848 [2021-05-03 12:50:23,866] [INFO] [utils.py:34:print_object] num_elems .................... 513000 [2021-05-03 12:50:23,866] [INFO] [utils.py:34:print_object] used_buffer_index ............ {} Using /home/vbansal21/.cache/torch_extensions as PyTorch extensions root... Emitting ninja build file /home/vbansal21/.cache/torch_extensions/async_io/build.ninja... Building extension module async_io... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module async_io... Time to load async_io op: 0.28775739669799805 seconds [2021-05-03 12:50:24,215] [INFO] [utils.py:30:print_object] PartitionedOptimizerSwapper: [2021-05-03 12:50:24,215] [INFO] [utils.py:34:print_object] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2021-05-03 12:50:24,215] [INFO] [utils.py:34:print_object] aligned_bytes ................ 1024 [2021-05-03 12:50:24,215] [INFO] [utils.py:34:print_object] dtype ........................ torch.float32 [2021-05-03 12:50:24,215] [INFO] [utils.py:34:print_object] largest_numel ................ 513000 [2021-05-03 12:50:24,215] [INFO] [utils.py:34:print_object] min_aio_bytes ................ 1048576 [2021-05-03 12:50:24,215] [INFO] [utils.py:34:print_object] numel_alignment .............. 256 [2021-05-03 12:50:24,215] [INFO] [utils.py:34:print_object] swap_config .................. {'device': 'nvme', 'nvme_path': '/mnt/nvme0n1p3/', 'buffer_count': 4, 'pin_memory': False, 'pipeline_read': False, 'pipeline_write': False, 'fast_init': False, 'pipeline': False} [2021-05-03 12:50:24,215] [INFO] [utils.py:34:print_object] swap_element_size ............ 4 [2021-05-03 12:50:24,215] [INFO] [utils.py:34:print_object] swap_folder .................. /mnt/nvme0n1p3/zero_stage_3/optimizer/rank0 Traceback (most recent call last): File "/home/vbansal21/Documents/test_run.py", line 43, in <module> model, optimizer,_,scheduler = deepspeed.initialize(model=model,optimizer=optimizer,lr_scheduler=scheduler,config_params=deepspeed_args) File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/__init__.py", line 120, in initialize engine = DeepSpeedEngine(args=args, File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 172, in __init__ self._configure_optimizer(optimizer, model_parameters) File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 628, in _configure_optimizer self.optimizer = self._configure_zero_optimizer(basic_optimizer) File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 793, in _configure_zero_optimizer optimizer = FP16_DeepSpeedZeroOptimizer_Stage3( File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 806, in __init__ self._create_fp32_partitions() File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1298, in _create_fp32_partitions self.optimizer_swapper.initialize_parameters( File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/swap_tensor/partitioned_optimizer_swapper.py", line 71, in initialize_parameters self._initialize_parameters(parameters=parameters, File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/swap_tensor/optimizer_utils.py", line 374, in _initialize_parameters self._swap_out_unpinned_tensors(aio_handle=aio_handle, File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/swap_tensor/optimizer_utils.py", line 421, in _swap_out_unpinned_tensors swap_buffers = get_sized_buffers(pinned_buffers, swap_lengths) File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/swap_tensor/utils.py", line 237, in get_sized_buffers swap_buffers = [ File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/swap_tensor/utils.py", line 238, in <listcomp> get_sized_buffer(buffer, num_elems) \ File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/swap_tensor/utils.py", line 231, in get_sized_buffer assert num_elems <= buffer.numel(), \ AssertionError: num_elems 513024> buffer 513000
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (1 by maintainers)
Top GitHub Comments
@Vbansal21, @stas00, @thies1006, I reopened this because I have found the original bug and will link a PR shortly. Thanks for all your contributions to locating this bug.
@tjruwase, great that you found it! I tried it out with the Huggingface integration and it looks good. I’m getting now another error, but it appears later (
AttributeError: 'NoneType' object has no attribute 'available_swap_in_buffers'
, I filed another issue # 1035).