Dataset downloading expected behavior pt. 2
See original GitHub issueWhen I instantiate a dataset with download=False
and checksum=False
I expect it to assume everything is in place, however our current setup usually checks to make sure that the archive file exists. If the archive file is 100+ GB then it is totally reasonable for users to delete it but keep the downloaded data.
I think _check_integrity
should be something like this:
def _check_integrity(self) -> bool:
"""Check integrity of dataset.
Returns:
True if dataset MD5s match, else False
"""
return check_integrity(
os.path.join(self.root, self.filename),
self.md5,
)
and we only call it if self.checksum
is true.
Datasets that still use _check_integrity
(and presumably follow the old convention):
- Advance
- Benin Cashews
- CBF
- COWC
- CV4A Kenya
- Cyclone
- ETCI 2021
- Eurosat
- GID15
- LEVIRCD
- Loveda
- VHR10
- SEN12MS
- So2Sat
- SpaceNet
- UCMerced
For a good example to copy, I think CDL is the first dataset I updated to use the new download style.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Loading a Dataset — datasets 1.2.1 documentation
Run the SQuAD python processing script which will download the SQuAD dataset from the original URL (if it's not already downloaded and cached)...
Read more >Argoverse 2
Argoverse 2 Sensor Dataset: contains 1,000 3D annotated scenarios with lidar, stereo imagery, and ring camera imagery. This dataset improves upon the Argoverse ......
Read more >BEHAVE dataset: a chasing; b two-people fighting; c multiple ...
In this paper, a new method is proposed to estimate a model of normal behaviors and consequently to detect abnormal behaviors. Estimating a...
Read more >ground truthed video for multi-person behavior classification
are few ground truthed datasets for assessing multi-person behavioral ... 2. CVBASE: The CVBASE 2006 [CVBASE] sports video downloads (covering basketball,.
Read more >Incremental refresh for datasets and real-time data in Power BI
Rows with a date/time no longer within the refresh period then become part of the historical period, which is not refreshed. If a...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think we need a list here of all datasets that do not follow these conventions / a good example dataset to copy.
@estherrolf (one of our first users!) just hit this problem with a manually downloaded version of the ChesapeakeCVPR dataset. I think it is worth making the change as this will happen especially with the larger datasets.