Opt-in for downloading without symlinks
See original GitHub issueThis is possibly a niche use case.
I recently found that some libraries (coremltools
, in this case) donāt play nice with symlinks even on Unix platforms š². This led me to replace this one-liner, which was intended for user communication:
from huggingface_hub import snapshot_download
repo_id = "apple/coreml-stable-diffusion-v1-4"
variant = "original/packages"
downloaded = snapshot_download(repo_id, allow_patterns=f"{variant}/*")
With this one (taken from the blog post):
from huggingface_hub import snapshot_download
from huggingface_hub.file_download import repo_folder_name
from pathlib import Path
import shutil
repo_id = "apple/coreml-stable-diffusion-v1-4"
variant = "original/packages"
def download_model(repo_id, variant, output_dir):
destination = Path(output_dir) / (repo_id.split("/")[-1] + "_" + variant.replace("/", "_"))
if destination.exists():
raise Exception(f"Model already exists at {destination}")
# Download and copy without symlinks
downloaded = snapshot_download(repo_id, allow_patterns=f"{variant}/*", cache_dir=output_dir)
downloaded_bundle = Path(downloaded) / variant
shutil.copytree(downloaded_bundle, destination)
# Remove all downloaded files
cache_folder = Path(output_dir) / repo_folder_name(repo_id=repo_id, repo_type="model")
shutil.rmtree(cache_folder)
return destination
model_path = download_model(repo_id, variant, output_dir="./models")
print(f"Model downloaded at {model_path}")
Itās not the end of the world, but in this case I really wanted to stress how easy it was to download Core ML checkpoints from the hub and use them downstream for whatever purpose.
If this is something that only affects coremltools
, then itās not worthwhile doing anything (Iāll open a PR there when I look into the problem in more depth). Iām raising the issue in case somebody else has observed other use cases that could benefit from a flag to unconditionally use #1067 even if symlinks are supported by the underlying os.
Issue Analytics
- State:
- Created 10 months ago
- Reactions:1
- Comments:5 (3 by maintainers)
Top GitHub Comments
Yeah, I supposed there might be other scenarios where users need a exact copy of the file structure. Building a docker container sounds like one of them (or deployment tasks, in general).
snapshot_download
is much better thangit clone
because you can specify branches or patterns (as in my example above) and donāt have to download stuff you donāt need or keep a humongous.git
directory with all the lfs blobs. Admittedly, we tend to create heavily overloaded repos with multiple variants for different frameworks, floating-point precision, etc.Happy to propose a PR if you do decide to go this way.
Hi @pcuenca , thanks for opening the issue. As you said, this is indeed a quite niche situation. It remind me a discussion (internal link) triggered by @philschmid when he wanted to download a model from the Hub without the cache structure (e.g. the blobs and symlinks) in order to build docker containers (cc @julien-c as well).
The solution you proposed is quite good. It would just require to make sure the cache directory is not populated before
download_model
as it is completely erased byshutil.rmtree
(e.g. users need to know what they are doing š).About having a flag to disable smylinks (and activate https://github.com/huggingface/huggingface_hub/pull/1067), Iām not against it. I would just wait for more requests before making it a feature of
hfh
.