question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Harmonize warnings/errors/documentation related to file size limit

See original GitHub issue

In general, file size limit is not made very clear for the user especially when uploading a file to the hub via HTTP endpoint. Discussion started as part of https://github.com/huggingface/huggingface_hub/pull/847 on whether we should throw an explicit error and provide guidance when a file is too big to be uploaded (see comments https://github.com/huggingface/huggingface_hub/pull/847#discussion_r861660844, https://github.com/huggingface/huggingface_hub/pull/847#discussion_r861999232, https://github.com/huggingface/huggingface_hub/pull/847#discussion_r861999702, https://github.com/huggingface/huggingface_hub/pull/847#discussion_r862002526 and https://github.com/huggingface/huggingface_hub/pull/847#discussion_r947626430).

In the documentation, we also mention a limit of 50GB in upload_file and a limit of 5GB before using LFS in general.

To be discussed:

  1. What is the actual limit for a single-file to be uploaded via HTTP ? Is there even a limit ?
  2. What is the limit we want to set for a single-file to be uploaded ? In particular, @LysandreJik mentioned that big files are not served through CDN. a. A possibility I see here is to raise a ValueError if file size is above 30GB (hard-limit) and raise a warning if file size is above 10GB (soft-limit).
  3. How to document that consistently ? a. I would propose to have a dedicated page/section in the documentation and each method create_commit, upload_file, upload_folder, push_to_hub,… could refer to it.
  4. (extra) Do we want to propose a utility helper to cut a big file into shards ? Either before uploading or on the fly (note: this is not exactly the same as uploading a big LFS file into chunks).

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:2
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

8reactions
Pierrcicommented, Aug 17, 2022

To clarify a bit all the different numbers:

  • A file must be uploaded through LFS if:
    • It’s a binary file and its size is > 1MB
    • It’s not a binary file and its size is > 10MB
  • 5GB is the threshold after which an LFS file needs to be uploaded through our multipart custom transfer agent
  • 30GB is the size limit for cached files w/ CloudFront
  • 50GB is the size limit for served files w/ CloudFront <- this means files b/t 30 and 50GB can be served through CloudFront, though they’re not cached by it (there is no doc about this upper limit, this is from our own observations)
  1. As a result of those limits, the maximum file size that we allow to be uploaded is also 50GB.
  2. An error is already returned by the server if > 50GB (by the /preupload endpoint, and also the deprecated /upload one). Encouraging smaller files is a good idea though, so why not a warning for when > 10GB.

I would say yes for 3., for 4. I would say why not if it’s gonna be useful to downstream libs, but I will let others chime in 😃

Also cc @Kakulukian @allendorf for information

3reactions
julien-ccommented, Aug 23, 2022

Also yes from me for 3. (inside hub-docs probably?)

For 4. i think it’s on the downstream libraries to do it, because they have more context to do it in a better way. For instance transformers has utilities to split super large checkpoint files into multiple files but each file is a valid weight file (containing certain layers for instance)

cc @LysandreJik @sgugger too

Read more comments on GitHub >

github_iconTop Results From Across the Web

Hazardous Materials: Harmonization With International ...
PHMSA is amending the Hazardous Materials Regulations (HMR) to maintain alignment with international regulations and standards by adopting ...
Read more >
HarmonizeProject/harmonize.py at master - GitHub
Harmonize Project lets you sync HDMI video with Philips Hue lights using a Raspberry Pi! ... executable file 476 lines (407 sloc) 21.3...
Read more >
Maximum file size of error and warning messages - IBM
The log@error.parameters filesize parameter specifies the maximum size, in kilobytes, of the error and warning log file. Set the value in the range...
Read more >
Standardization or Harmonization? You need Both - BPTrends
Standardization. Standardization means creating uniform business processes across various divisions or locations. The expected results are processes that ...
Read more >
"File size limit reached" warning. - Code Composer Studio forum
This message can appear if you have the "Save Output to File" enabled in the CCS output window. In CCS it is possible...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found