Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RFC: split checkpoint load/save for huge models

See original GitHub issue

🚀 Feature request

While discussing with pytorch devs adding the ability to load/save state_dict on the finer granularity level and not needing to manifest the whole state_dict in memory, we have an additional issue of the model file just being too large. I’d like to propose for transformers to support multi-part checkpoints.

Reasons for the need:

the hub limitation: Cloudfront does not support >20GB files so downloads via s3 can’t be fast with those large files
the current pytorch issue loading the whole state_dict into memory and requiring 2x model size in memory - checkpoint conversion is quite demanding on memory as well for the same reason.
in general it’s a potential issue for users with imperfect up/down internet connection. uploading/downloading 25GB files is still not easy for all.

Possible solutions:

as mentioned here, SplitCheckpoint already implements a possible solution which saves each state_dict’s key separately
as solution 1 but we may save groups of these - e.g. save each layer’s keys together in one pickled state_dict per layer. I looked at some large models and they will have a huge amount of keys, e.g. even t5-small is ~150 keys. But this approach would be more complicated since we now need to define the container block and it’ll be different from model to model. May be by sub-module? So perhaps the first solution is much more simple.

The only addition I’d propose to actually name the files with the full key name rather than obscure files like m18.pt as implemented by SplitCheckpoint , and which require an extra file to do look ups.

So my proposal is:

config.json
merges.txt
README.md
tokenizer.json
vocab.json
pytorch_model/map.pt
pytorch_model/shared.weight.pt
pytorch_model/encoder.embed_tokens.weight.pt
[...]
pytorch_model/encoder.block.3.layer.0.SelfAttention.v.weight.pt
[...]
pytorch_model/decoder.block.5.layer.1.EncDecAttention.q.weight.pt
[...]
pytorch_model/lm_head.weight

and these are all raw files not belonging to any archive. and map just has the list of keys in their order for when OrderedDict is important.

the cost of the 1st solution is somewhat slower save/load. I haven’t benchmarked, but the IO will be the bottleneck here, and the ZIP structure currently gets unravelled one tensor at a time anyway, so the difference is likely to be negligible.

Issue Analytics

State:
Created 2 years ago
Reactions:5
Comments:32 (32 by maintainers)

Top GitHub Comments

2reactions

sguggercommented, Mar 21, 2022

The index mapping file pytorch_model.bin.index could also be a human readable file (JSON for instance) by the way. Not sure we gain much by having it stored in binary.

Plus, we’re trying to stay away from pickle those days 😃

1reaction

younesbelkadacommented, Aug 11, 2022

Just posting a small script I used to shard any model: https://gist.github.com/younesbelkada/382016361580b939a87edcddc94c6593 people may want to use it in the future to push sharded models !

Top Results From Across the Web

Handling big models - Hugging Face

In this case, it's better if your checkpoint is split in several smaller files that we call checkpoint shards. Accelerate will handle sharded...

Splitting large files for transfer - Check Point Support Center

Splitting files on SecurePlatform / Gaia OS: The Linux command 'split' can be used to split big files into smaller parts of equal...

[RFC PATCH 0/5] Add new VFIO PCI driver for NVMe devices - kernel

[RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin ... "fundamentally broken" doesn't help It is a major point, because if we...

Change History | AllegroGraph 6.4.4 - Franz Inc.

When a text index was deleted or its creation rolled back, a relatively large (~5MB) file used to be left behind and only...

Untitled

Go on, try and break append only files + securelevel now ;-) - remove ... Add patch to fix temp directory used for...