Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] 'NoneType' object has no attribute 'reserve_partitioned_swap_space' with params offloading to nvme enabled

See original GitHub issue

Describe the bug A clear and concise description of what the bug is. Exception ‘NoneType’ object has no attribute ‘reserve_partitioned_swap_space’ with params offloading to nvme enabled

To Reproduce Steps to reproduce the behavior: With this configuration:

{
  "train_batch_size": 15,
  "fp16": {
    "enabled": true,
    "min_loss_scale": 1,
    "opt_level": "O3"
  },
  "zero_optimization": {
    "stage": 3,
    "offload_param": {
      "device": "nvme",
      "nvme_path": "/home/deepschneider/deepspeed",
      "buffer_count": 5,
      "buffer_size": 1e8,
      "max_in_cpu": 1e9
    },
    "offload_optimizer": {
      "device": "nvme",
      "nvme_path": "/home/deepschneider/deepspeed",
      "buffer_count": 4,
      "pipeline_read": false,
      "pipeline_write": false,
      "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "contiguous_gradients": true,
    "overlap_comm": true,
    "aio": {
      "block_size": 1048576,
      "queue_depth": 8,
      "thread_count": 1,
      "single_submit": false,
      "overlap_events": true
    }
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 5e-05,
      "betas": [
        0.9,
        0.999
      ],
      "eps": 1e-08
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 5e-05,
      "warmup_num_steps": 100
    }
  }
}

I’m getting the following exception Screenshot from 2021-11-09 23-14-03

This configuration works fine:

{
  "train_batch_size": 15,
  "fp16": {
    "enabled": true,
    "min_loss_scale": 1,
    "opt_level": "O3"
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "nvme",
      "nvme_path": "/home/deepschneider/deepspeed",
      "buffer_count": 4,
      "pipeline_read": false,
      "pipeline_write": false,
      "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "contiguous_gradients": true,
    "overlap_comm": true,
    "aio": {
      "block_size": 1048576,
      "queue_depth": 8,
      "thread_count": 1,
      "single_submit": false,
      "overlap_events": true
    }
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 5e-05,
      "betas": [
        0.9,
        0.999
      ],
      "eps": 1e-08
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 5e-05,
      "warmup_num_steps": 100
    }
  }
}

Expected behavior Both configurations should work fine.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
async_io ............... [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/deepschneider/PycharmProjects/gpt-neo-fine-tuning-example/venv/lib/python3.8/site-packages/torch']
torch version .................... 1.10.0+cu113
torch cuda version ............... 11.3
nvcc version ..................... 11.3
deepspeed install path ........... ['/home/deepschneider/PycharmProjects/gpt-neo-fine-tuning-example/venv/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.5.6+43e6432, 43e6432, cpu-adam/fix-scalar-compile
deepspeed wheel compiled w. ...... torch 1.10, cuda 11.3

System info (please complete the following information):

OS: Ubuntu 20.04
GPU: 1xA6000
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]: no
Python version: 3.8.10

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else? Huggingface Trainer

Issue Analytics

State:
Created 2 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

2reactions

tjruwasecommented, Nov 12, 2021

@dredwardhyde, thanks for confirming your experience. Actually, this 6B model should easily run using Huggingface Trainer with the parameter and optimizer offload configurations. I think a few things could help:

The model is constructed as fp32. Is this intentional?
zero.Init context manager should be used during model construction to ensure that the parameters are offloaded to cpu or nvme.

We can help with getting this finetuning working, so can you please open a new issue for that purpose. Thanks.

1reaction

tjruwasecommented, Nov 12, 2021

@dredwardhyde, can you please test the PR?

Top Results From Across the Web

AttributeError: 'NoneType' object has no attribute 'push'

Argument : size -- an integer or None If size is an integer an empty undobuffer of ... I only adding answer for...

How To Fix Attribute Error: 'NoneType' Object Has ... - YouTube

Article Link: https://blog.finxter.com/how-to-fix- error - nonetype - object - has - no - attribute -group/ Email Academy: ...

MultiParm with groups Problem | Forums - SideFX

... AttributeError: 'NoneType' object has no attribute 'eval'. Error. Although I connected the enable and group type parameters to the ...

Python-SDK doesn't seem to handle quota information on disks

Cause: A quota does not have to be defined for every storage domain. ... is None: 710 raise Error( AttributeError: 'NoneType' object has...

'NoneType' object has no attribute 'subset_range' in Ab-initio ...

Hello, I am trying to rerun an ab initio job with less particles (last time it worked a few months ago), but now...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[BUG] 'NoneType' object has no attribute 'reserve_partitioned_swap_space' with params offloading to nvme enabled

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Error while saving T5-11B checkpoint

[BUG] memory overhead issue with optimizer leading to OOM