Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AssertionError: num_elems 513024> buffer 513000.

See original GitHub issue

My System: Ubuntu groovy 20.10 NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 torch 1.8.1 deepspeed 0.3.16

For some combination it works, while for others it doesn’t, it is very random.

code: `import deepspeed, torch

model = torch.nn.Sequential( torch.nn.Linear(512,1000) )

deepspeed_args = { “train_batch_size”: 1, “gradient_accumulation_steps”: 1, “fp16”: { “enabled”: True, “loss_scale”: 0, “initial_scale_power”: 3, “loss_scale_window”: 1000, “hysteresis”: 1, “min_loss_scale”: 1 }, “gradient_clipping”:1.0, “zero_optimization”: { “stage”: 3, “offload_param”:{ “device”: “nvme”, “nvme_path”:“/mnt/nvme0n1p3/” }, “offload_optimizer”: { “device”: “nvme”, “nvme_path”: “/mnt/nvme0n1p3/” }, }

}

optimizer = torch.optim.Adam(model.parameters(), lr= 0.001,betas=[0.8,0.99],eps=1e-8,weight_decay=3e-7) scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)

model, optimizer,_,scheduler = deepspeed.initialize(model=model,optimizer=optimizer,lr_scheduler=scheduler,config_params=deepspeed_args)`

error: [2021-05-03 12:50:20,837] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.16, git-hash=unknown, git-branch=unknown [2021-05-03 12:50:20,838] [INFO] [distributed.py:36:init_distributed] Not using the DeepSpeed or torch.distributed launchers, attempting to detect MPI environment... [2021-05-03 12:50:21,196] [INFO] [distributed.py:83:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=192.168.29.88, master_port=29500 [2021-05-03 12:50:21,197] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-05-03 12:50:23,337] [INFO] [utils.py:11:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1 [2021-05-03 12:50:23,369] [INFO] [engine.py:601:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer [2021-05-03 12:50:23,369] [INFO] [engine.py:606:_configure_optimizer] Using client Optimizer as basic optimizer [2021-05-03 12:50:23,369] [INFO] [engine.py:615:_configure_optimizer] DeepSpeed Basic Optimizer = Adam Checking ZeRO support for optimizer=Adam type=<class 'torch.optim.adam.Adam'> [2021-05-03 12:50:23,369] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer Initializing ZeRO Stage 3 [2021-05-03 12:50:23,391] [INFO] [utils.py:583:see_memory_usage] Stage 3 initialize beginning /home/vbansal21/.local/lib/python3.8/site-packages/torch/cuda/memory.py:373: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved warnings.warn( /home/vbansal21/.local/lib/python3.8/site-packages/torch/cuda/memory.py:381: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved warnings.warn( [2021-05-03 12:50:23,392] [INFO] [utils.py:584:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2021-05-03 12:50:23,392] [INFO] [utils.py:592:see_memory_usage] CPU Virtual Memory: used = 6.57 GB, percent = 28.2% [2021-05-03 12:50:23,392] [INFO] [stage3.py:624:__init__] Reduce bucket size 500000000 [2021-05-03 12:50:23,392] [INFO] [stage3.py:625:__init__] Allgather bucket size 50000000 Using /home/vbansal21/.cache/torch_extensions as PyTorch extensions root... Emitting ninja build file /home/vbansal21/.cache/torch_extensions/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.19048452377319336 seconds [2021-05-03 12:50:23,816] [INFO] [stage3.py:39:print_rank_0] FP16 params swapping is True, Max params in CPU is 1000000000.0 [2021-05-03 12:50:23,840] [INFO] [utils.py:583:see_memory_usage] Before creating fp16 partitions /home/vbansal21/.local/lib/python3.8/site-packages/torch/cuda/memory.py:373: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved warnings.warn( /home/vbansal21/.local/lib/python3.8/site-packages/torch/cuda/memory.py:381: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved warnings.warn( [2021-05-03 12:50:23,841] [INFO] [utils.py:584:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2021-05-03 12:50:23,841] [INFO] [utils.py:592:see_memory_usage] CPU Virtual Memory: used = 6.59 GB, percent = 28.2% [2021-05-03 12:50:23,861] [INFO] [stage3.py:39:print_rank_0] fp16 group 0 has 1 subgroups [2021-05-03 12:50:23,862] [INFO] [stage3.py:924:_configure_tensor_swapping] Tensor Swapping: Adding optimizer tensors [2021-05-03 12:50:23,866] [INFO] [utils.py:30:print_object] SwapBufferManager: [2021-05-03 12:50:23,866] [INFO] [utils.py:34:print_object] count ........................ 4 [2021-05-03 12:50:23,866] [INFO] [utils.py:34:print_object] dtype ........................ torch.float32 [2021-05-03 12:50:23,866] [INFO] [utils.py:34:print_object] free_buffer_index ............ [0, 1, 2, 3] [2021-05-03 12:50:23,866] [INFO] [utils.py:34:print_object] gigabytes .................... 0.007644295692443848 [2021-05-03 12:50:23,866] [INFO] [utils.py:34:print_object] num_elems .................... 513000 [2021-05-03 12:50:23,866] [INFO] [utils.py:34:print_object] used_buffer_index ............ {} Using /home/vbansal21/.cache/torch_extensions as PyTorch extensions root... Emitting ninja build file /home/vbansal21/.cache/torch_extensions/async_io/build.ninja... Building extension module async_io... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module async_io... Time to load async_io op: 0.28775739669799805 seconds [2021-05-03 12:50:24,215] [INFO] [utils.py:30:print_object] PartitionedOptimizerSwapper: [2021-05-03 12:50:24,215] [INFO] [utils.py:34:print_object] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2021-05-03 12:50:24,215] [INFO] [utils.py:34:print_object] aligned_bytes ................ 1024 [2021-05-03 12:50:24,215] [INFO] [utils.py:34:print_object] dtype ........................ torch.float32 [2021-05-03 12:50:24,215] [INFO] [utils.py:34:print_object] largest_numel ................ 513000 [2021-05-03 12:50:24,215] [INFO] [utils.py:34:print_object] min_aio_bytes ................ 1048576 [2021-05-03 12:50:24,215] [INFO] [utils.py:34:print_object] numel_alignment .............. 256 [2021-05-03 12:50:24,215] [INFO] [utils.py:34:print_object] swap_config .................. {'device': 'nvme', 'nvme_path': '/mnt/nvme0n1p3/', 'buffer_count': 4, 'pin_memory': False, 'pipeline_read': False, 'pipeline_write': False, 'fast_init': False, 'pipeline': False} [2021-05-03 12:50:24,215] [INFO] [utils.py:34:print_object] swap_element_size ............ 4 [2021-05-03 12:50:24,215] [INFO] [utils.py:34:print_object] swap_folder .................. /mnt/nvme0n1p3/zero_stage_3/optimizer/rank0 Traceback (most recent call last): File "/home/vbansal21/Documents/test_run.py", line 43, in <module> model, optimizer,_,scheduler = deepspeed.initialize(model=model,optimizer=optimizer,lr_scheduler=scheduler,config_params=deepspeed_args) File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/__init__.py", line 120, in initialize engine = DeepSpeedEngine(args=args, File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 172, in __init__ self._configure_optimizer(optimizer, model_parameters) File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 628, in _configure_optimizer self.optimizer = self._configure_zero_optimizer(basic_optimizer) File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 793, in _configure_zero_optimizer optimizer = FP16_DeepSpeedZeroOptimizer_Stage3( File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 806, in __init__ self._create_fp32_partitions() File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1298, in _create_fp32_partitions self.optimizer_swapper.initialize_parameters( File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/swap_tensor/partitioned_optimizer_swapper.py", line 71, in initialize_parameters self._initialize_parameters(parameters=parameters, File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/swap_tensor/optimizer_utils.py", line 374, in _initialize_parameters self._swap_out_unpinned_tensors(aio_handle=aio_handle, File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/swap_tensor/optimizer_utils.py", line 421, in _swap_out_unpinned_tensors swap_buffers = get_sized_buffers(pinned_buffers, swap_lengths) File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/swap_tensor/utils.py", line 237, in get_sized_buffers swap_buffers = [ File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/swap_tensor/utils.py", line 238, in <listcomp> get_sized_buffer(buffer, num_elems) \ File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/swap_tensor/utils.py", line 231, in get_sized_buffer assert num_elems <= buffer.numel(), \ AssertionError: num_elems 513024> buffer 513000

Issue Analytics

State:
Created 2 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

3reactions

tjruwasecommented, May 3, 2021

@Vbansal21, @stas00, @thies1006, I reopened this because I have found the original bug and will link a PR shortly. Thanks for all your contributions to locating this bug.

2reactions

thies1006commented, May 3, 2021

@tjruwase, great that you found it! I tried it out with the Huggingface integration and it looks good. I’m getting now another error, but it appears later (AttributeError: 'NoneType' object has no attribute 'available_swap_in_buffers', I filed another issue # 1035).