question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RFC: split checkpoint load/save for huge models

See original GitHub issue

🚀 Feature request

While discussing with pytorch devs adding the ability to load/save state_dict on the finer granularity level and not needing to manifest the whole state_dict in memory, we have an additional issue of the model file just being too large. I’d like to propose for transformers to support multi-part checkpoints.

Reasons for the need:

  • the hub limitation: Cloudfront does not support >20GB files so downloads via s3 can’t be fast with those large files
  • the current pytorch issue loading the whole state_dict into memory and requiring 2x model size in memory - checkpoint conversion is quite demanding on memory as well for the same reason.
  • in general it’s a potential issue for users with imperfect up/down internet connection. uploading/downloading 25GB files is still not easy for all.

Possible solutions:

  1. as mentioned here, SplitCheckpoint already implements a possible solution which saves each state_dict’s key separately
  2. as solution 1 but we may save groups of these - e.g. save each layer’s keys together in one pickled state_dict per layer. I looked at some large models and they will have a huge amount of keys, e.g. even t5-small is ~150 keys. But this approach would be more complicated since we now need to define the container block and it’ll be different from model to model. May be by sub-module? So perhaps the first solution is much more simple.

The only addition I’d propose to actually name the files with the full key name rather than obscure files like m18.pt as implemented by SplitCheckpoint , and which require an extra file to do look ups.

So my proposal is:

config.json
merges.txt
README.md
tokenizer.json
vocab.json
pytorch_model/map.pt
pytorch_model/shared.weight.pt
pytorch_model/encoder.embed_tokens.weight.pt
[...]
pytorch_model/encoder.block.3.layer.0.SelfAttention.v.weight.pt
[...]
pytorch_model/decoder.block.5.layer.1.EncDecAttention.q.weight.pt
[...]
pytorch_model/lm_head.weight

and these are all raw files not belonging to any archive. and map just has the list of keys in their order for when OrderedDict is important.

the cost of the 1st solution is somewhat slower save/load. I haven’t benchmarked, but the IO will be the bottleneck here, and the ZIP structure currently gets unravelled one tensor at a time anyway, so the difference is likely to be negligible.

other solutions are welcome.

Other examples of split checkpoints:

  • Deepspeed’s pipeline (PP) saves each layer as a separate checkpoint, which allows to quickly change the PP degree at run time.

Threshold:

  • need to define the threshold at which we automatically switch to this multi-part format unless the user overrides the default. Probably can use the size of the model as the measurement. I think it should be 3B or even less. if model size == 3B the resulting file size are:
  1. 6GB in fp16/bf16
  2. 12GB in fp32.

@patrickvonplaten, @patil-suraj, @LysandreJik, @sgugger, @julien-c

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:5
  • Comments:32 (32 by maintainers)

github_iconTop GitHub Comments

2reactions
sguggercommented, Mar 21, 2022

The index mapping file pytorch_model.bin.index could also be a human readable file (JSON for instance) by the way. Not sure we gain much by having it stored in binary.

Plus, we’re trying to stay away from pickle those days 😃

1reaction
younesbelkadacommented, Aug 11, 2022

Just posting a small script I used to shard any model: https://gist.github.com/younesbelkada/382016361580b939a87edcddc94c6593 people may want to use it in the future to push sharded models !

Read more comments on GitHub >

github_iconTop Results From Across the Web

Handling big models - Hugging Face
In this case, it's better if your checkpoint is split in several smaller files that we call checkpoint shards. Accelerate will handle sharded...
Read more >
Splitting large files for transfer - Check Point Support Center
Splitting files on SecurePlatform / Gaia OS: The Linux command 'split' can be used to split big files into smaller parts of equal...
Read more >
[RFC PATCH 0/5] Add new VFIO PCI driver for NVMe devices - kernel
[RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin ... "fundamentally broken" doesn't help It is a major point, because if we...
Read more >
Change History | AllegroGraph 6.4.4 - Franz Inc.
When a text index was deleted or its creation rolled back, a relatively large (~5MB) file used to be left behind and only...
Read more >
Untitled
Go on, try and break append only files + securelevel now ;-) - remove ... Add patch to fix temp directory used for...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found