Harmonize warnings/errors/documentation related to file size limit
See original GitHub issueIn general, file size limit is not made very clear for the user especially when uploading a file to the hub via HTTP endpoint. Discussion started as part of https://github.com/huggingface/huggingface_hub/pull/847 on whether we should throw an explicit error and provide guidance when a file is too big to be uploaded (see comments https://github.com/huggingface/huggingface_hub/pull/847#discussion_r861660844, https://github.com/huggingface/huggingface_hub/pull/847#discussion_r861999232, https://github.com/huggingface/huggingface_hub/pull/847#discussion_r861999702, https://github.com/huggingface/huggingface_hub/pull/847#discussion_r862002526 and https://github.com/huggingface/huggingface_hub/pull/847#discussion_r947626430).
In the documentation, we also mention a limit of 50GB in upload_file
and a limit of 5GB before using LFS in general.
To be discussed:
- What is the actual limit for a single-file to be uploaded via HTTP ? Is there even a limit ?
- What is the limit we want to set for a single-file to be uploaded ? In particular, @LysandreJik mentioned that big files are not served through CDN.
a. A possibility I see here is to raise a
ValueError
if file size is above 30GB (hard-limit) and raise a warning if file size is above 10GB (soft-limit). - How to document that consistently ?
a. I would propose to have a dedicated page/section in the documentation and each method
create_commit
,upload_file
,upload_folder
,push_to_hub
,… could refer to it. - (extra) Do we want to propose a utility helper to cut a big file into shards ? Either before uploading or on the fly (note: this is not exactly the same as uploading a big LFS file into chunks).
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:5 (5 by maintainers)
Top GitHub Comments
To clarify a bit all the different numbers:
/preupload
endpoint, and also the deprecated/upload
one). Encouraging smaller files is a good idea though, so why not a warning for when > 10GB.I would say yes for 3., for 4. I would say why not if it’s gonna be useful to downstream libs, but I will let others chime in 😃
Also cc @Kakulukian @allendorf for information
Also yes from me for 3. (inside
hub-docs
probably?)For 4. i think it’s on the downstream libraries to do it, because they have more context to do it in a better way. For instance
transformers
has utilities to split super large checkpoint files into multiple files but each file is a valid weight file (containing certain layers for instance)cc @LysandreJik @sgugger too