Enable upload_folder to upload content in chunks
See original GitHub issueDescribe the bug
When trying to convert and upload this dataset using dataset converter tool I get following error in upload_folder
(see logs).
Most datasets on kaggle are quite large and weirdly structured so if we want more datasets uploaded with tool, the library should handle it (maybe by uploading in chunks)
Reproduction
See this notebook and try to convert above dataset if you decide to run it again (it already has logs as of now)
Logs
/usr/local/lib/python3.7/dist-packages/requests/models.py in raise_for_status(self)
940 if http_error_msg:
--> 941 raise HTTPError(http_error_msg, response=self)
942
HTTPError: 413 Client Error: Payload Too Large for url: https://huggingface.co/api/datasets/merve/bird-species/preupload/main
The above exception was the direct cause of the following exception:
HfHubHTTPError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/ipywidgets/widgets/widget_output.py in inner(*args, **kwargs)
101 self.clear_output(*clear_args, **clear_kwargs)
102 with self:
--> 103 return func(*args, **kwargs)
104 return inner
105 return capture_decorator
/content/huggingface-datasets-converter/huggingface_datasets_converter/convert.py in login_token_event(t)
279 print(f"\t- Kaggle ID: {kaggle_id}")
280 print(f"\t- Repo ID: {repo_id}")
--> 281 url = kaggle_to_hf(kaggle_id, repo_id)
282 output.clear_output()
283 print(f"You can view your dataset here: {url}")
/content/huggingface-datasets-converter/huggingface_datasets_converter/convert.py in kaggle_to_hf(kaggle_id, repo_id, token, unzip, path_in_repo)
215 upload_file(path_or_fileobj=gitattributes_file.as_posix(), path_in_repo=".gitattributes", repo_id=repo_id, token=token, repo_type='dataset')
216
--> 217 upload_folder(folder_path=temp_dir, path_in_repo="", repo_id=repo_id, token=None, repo_type='dataset')
218 # Try to make dataset card as well!
219 card = DatasetCard.from_template(
/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_validators.py in _inner_fn(*args, **kwargs)
92 validate_repo_id(arg_value)
93
---> 94 return fn(*args, **kwargs)
95
96 return _inner_fn
/usr/local/lib/python3.7/dist-packages/huggingface_hub/hf_api.py in upload_folder(self, repo_id, folder_path, path_in_repo, commit_message, commit_description, token, repo_type, revision, create_pr, parent_commit, allow_patterns, ignore_patterns)
2391 revision=revision,
2392 create_pr=create_pr,
-> 2393 parent_commit=parent_commit,
2394 )
2395
/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_validators.py in _inner_fn(*args, **kwargs)
92 validate_repo_id(arg_value)
93
---> 94 return fn(*args, **kwargs)
95
96 return _inner_fn
/usr/local/lib/python3.7/dist-packages/huggingface_hub/hf_api.py in create_commit(self, repo_id, operations, commit_message, commit_description, token, repo_type, revision, create_pr, num_threads, parent_commit)
2035 revision=revision,
2036 endpoint=self.endpoint,
-> 2037 create_pr=create_pr,
2038 )
2039 except RepositoryNotFoundError as e:
/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_validators.py in _inner_fn(*args, **kwargs)
92 validate_repo_id(arg_value)
93
---> 94 return fn(*args, **kwargs)
95
96 return _inner_fn
/usr/local/lib/python3.7/dist-packages/huggingface_hub/_commit_api.py in fetch_upload_modes(additions, repo_type, repo_id, token, revision, endpoint, create_pr)
375 params={"create_pr": "1"} if create_pr else None,
376 )
--> 377 hf_raise_for_status(resp, endpoint_name="preupload")
378
379 preupload_info = validate_preupload_info(resp.json())
/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_errors.py in hf_raise_for_status(response, endpoint_name)
252 # Convert `HTTPError` into a `HfHubHTTPError` to display request information
253 # as well (request id and/or server error message)
--> 254 raise HfHubHTTPError(str(HTTPError), response=response) from e
255
256
HfHubHTTPError: <class 'requests.exceptions.HTTPError'> (Request ID: IdC2Rq6MbaM7tuOR-Q0Kr)
request entity too large
System Info
Using this branch of converter tool: https://github.com/merveenoyan/huggingface-datasets-converter, the only change is Hub version https://github.com/huggingface/huggingface_hub.git@fix-auth-in-lfs-upload
Issue Analytics
- State:
- Created a year ago
- Comments:11 (11 by maintainers)
Top Results From Across the Web
Upload a File in Chunks - Documentation – Commvault
You can upload a large file in multiple data chunks to a folder in Edge Drive. Each chunk of data is sent as...
Read more >Uploading Large Files as Chunks Using ReactJS & .Net Core
UploadChunks ; Which gathering all pieces of the file and save these each chunk as a file into /Temp ` folder. UploadComplete; It...
Read more >Laravel Chunked Upload - uploading HUGE files - Webdock.io
In this article we show how you can upload very large files safely and effectively in Laravel using chunked uploads.
Read more >Chunk File Upload with JavaScript using PHP - CodexWorld
This script handles the file uploaded process with chunk functionality. ... In the $targetDir variable, specify the folder name where the uploaded ......
Read more >Why isn't chunked video files uploading to target directory in ...
My frontend shows the file or chunk of files as uploading till it's complete. But the upload doesn't show up in the target...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
https://github.com/huggingface/hub-docs/pull/348 - the docs for the commit endpoints, with the “application/x-ndjson” content type.
Basically, the content type is as such:
There can be multiple files, lfs files, deleted files, one line for each. Each line is a JSON. If we add other features to the commit API (eg to rename file, delete folders), it will follow the same pattern: a plural words for the
application/json
content-type with an array of objects, and a singular word for theapplication/x-ndjson
content-type in thekey
field with an object in thevalue
field.There’s a maximum of 25k LFS files, and 1GB payload.
pinging @coyotte508 on this (@SBrandeis currently has low bandwidth as he’s working on Spaces and billing)