question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Enable upload_folder to upload content in chunks

See original GitHub issue

Describe the bug

When trying to convert and upload this dataset using dataset converter tool I get following error in upload_folder (see logs).

Most datasets on kaggle are quite large and weirdly structured so if we want more datasets uploaded with tool, the library should handle it (maybe by uploading in chunks)

Reproduction

See this notebook and try to convert above dataset if you decide to run it again (it already has logs as of now)

Logs

/usr/local/lib/python3.7/dist-packages/requests/models.py in raise_for_status(self)
    940         if http_error_msg:
--> 941             raise HTTPError(http_error_msg, response=self)
    942 

HTTPError: 413 Client Error: Payload Too Large for url: https://huggingface.co/api/datasets/merve/bird-species/preupload/main

The above exception was the direct cause of the following exception:

HfHubHTTPError                            Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/ipywidgets/widgets/widget_output.py in inner(*args, **kwargs)
    101                     self.clear_output(*clear_args, **clear_kwargs)
    102                 with self:
--> 103                     return func(*args, **kwargs)
    104             return inner
    105         return capture_decorator

/content/huggingface-datasets-converter/huggingface_datasets_converter/convert.py in login_token_event(t)
    279         print(f"\t- Kaggle ID: {kaggle_id}")
    280         print(f"\t- Repo ID: {repo_id}")
--> 281         url = kaggle_to_hf(kaggle_id, repo_id)
    282         output.clear_output()
    283         print(f"You can view your dataset here: {url}")

/content/huggingface-datasets-converter/huggingface_datasets_converter/convert.py in kaggle_to_hf(kaggle_id, repo_id, token, unzip, path_in_repo)
    215         upload_file(path_or_fileobj=gitattributes_file.as_posix(), path_in_repo=".gitattributes", repo_id=repo_id, token=token, repo_type='dataset')
    216 
--> 217         upload_folder(folder_path=temp_dir, path_in_repo="", repo_id=repo_id, token=None, repo_type='dataset')
    218     # Try to make dataset card as well!
    219     card = DatasetCard.from_template(

/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_validators.py in _inner_fn(*args, **kwargs)
     92                 validate_repo_id(arg_value)
     93 
---> 94         return fn(*args, **kwargs)
     95 
     96     return _inner_fn

/usr/local/lib/python3.7/dist-packages/huggingface_hub/hf_api.py in upload_folder(self, repo_id, folder_path, path_in_repo, commit_message, commit_description, token, repo_type, revision, create_pr, parent_commit, allow_patterns, ignore_patterns)
   2391             revision=revision,
   2392             create_pr=create_pr,
-> 2393             parent_commit=parent_commit,
   2394         )
   2395 

/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_validators.py in _inner_fn(*args, **kwargs)
     92                 validate_repo_id(arg_value)
     93 
---> 94         return fn(*args, **kwargs)
     95 
     96     return _inner_fn

/usr/local/lib/python3.7/dist-packages/huggingface_hub/hf_api.py in create_commit(self, repo_id, operations, commit_message, commit_description, token, repo_type, revision, create_pr, num_threads, parent_commit)
   2035                 revision=revision,
   2036                 endpoint=self.endpoint,
-> 2037                 create_pr=create_pr,
   2038             )
   2039         except RepositoryNotFoundError as e:

/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_validators.py in _inner_fn(*args, **kwargs)
     92                 validate_repo_id(arg_value)
     93 
---> 94         return fn(*args, **kwargs)
     95 
     96     return _inner_fn

/usr/local/lib/python3.7/dist-packages/huggingface_hub/_commit_api.py in fetch_upload_modes(additions, repo_type, repo_id, token, revision, endpoint, create_pr)
    375         params={"create_pr": "1"} if create_pr else None,
    376     )
--> 377     hf_raise_for_status(resp, endpoint_name="preupload")
    378 
    379     preupload_info = validate_preupload_info(resp.json())

/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_errors.py in hf_raise_for_status(response, endpoint_name)
    252         # Convert `HTTPError` into a `HfHubHTTPError` to display request information
    253         # as well (request id and/or server error message)
--> 254         raise HfHubHTTPError(str(HTTPError), response=response) from e
    255 
    256 

HfHubHTTPError: <class 'requests.exceptions.HTTPError'> (Request ID: IdC2Rq6MbaM7tuOR-Q0Kr)

request entity too large

System Info

Using this branch of converter tool: https://github.com/merveenoyan/huggingface-datasets-converter, the only change is Hub version https://github.com/huggingface/huggingface_hub.git@fix-auth-in-lfs-upload

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

3reactions
coyotte508commented, Oct 3, 2022

https://github.com/huggingface/hub-docs/pull/348 - the docs for the commit endpoints, with the “application/x-ndjson” content type.

Basically, the content type is as such:

{key: "header", value: {"summary": string, "description"?: string, parentCommit?: string}}
{key: "file", value: { content: string; path: string; encoding?: "utf-8" | "base64"; }}
{key: "deletedFile", value: { path: string }}
{key: "lfsFile", value: { path: string; algo: "sha256"; oid: string; size?: number; }}

There can be multiple files, lfs files, deleted files, one line for each. Each line is a JSON. If we add other features to the commit API (eg to rename file, delete folders), it will follow the same pattern: a plural words for the application/json content-type with an array of objects, and a singular word for the application/x-ndjson content-type in the key field with an object in the value field.

There’s a maximum of 25k LFS files, and 1GB payload.

3reactions
julien-ccommented, Sep 27, 2022

pinging @coyotte508 on this (@SBrandeis currently has low bandwidth as he’s working on Spaces and billing)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Upload a File in Chunks - Documentation – Commvault
You can upload a large file in multiple data chunks to a folder in Edge Drive. Each chunk of data is sent as...
Read more >
Uploading Large Files as Chunks Using ReactJS & .Net Core
UploadChunks ; Which gathering all pieces of the file and save these each chunk as a file into /Temp ` folder. UploadComplete; It...
Read more >
Laravel Chunked Upload - uploading HUGE files - Webdock.io
In this article we show how you can upload very large files safely and effectively in Laravel using chunked uploads.
Read more >
Chunk File Upload with JavaScript using PHP - CodexWorld
This script handles the file uploaded process with chunk functionality. ... In the $targetDir variable, specify the folder name where the uploaded ......
Read more >
Why isn't chunked video files uploading to target directory in ...
My frontend shows the file or chunk of files as uploading till it's complete. But the upload doesn't show up in the target...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found