question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dataset downloading expected behavior pt. 2

See original GitHub issue

When I instantiate a dataset with download=False and checksum=False I expect it to assume everything is in place, however our current setup usually checks to make sure that the archive file exists. If the archive file is 100+ GB then it is totally reasonable for users to delete it but keep the downloaded data.

I think _check_integrity should be something like this:

    def _check_integrity(self) -> bool:
        """Check integrity of dataset.

        Returns:
            True if dataset MD5s match, else False
        """
        return check_integrity(
            os.path.join(self.root, self.filename),
            self.md5,
        )

and we only call it if self.checksum is true.

Datasets that still use _check_integrity (and presumably follow the old convention):

  • Advance
  • Benin Cashews
  • CBF
  • COWC
  • CV4A Kenya
  • Cyclone
  • ETCI 2021
  • Eurosat
  • GID15
  • LEVIRCD
  • Loveda
  • VHR10
  • SEN12MS
  • So2Sat
  • SpaceNet
  • UCMerced

For a good example to copy, I think CDL is the first dataset I updated to use the new download style.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
calebrob6commented, Feb 15, 2022

I think we need a list here of all datasets that do not follow these conventions / a good example dataset to copy.

1reaction
calebrob6commented, Sep 4, 2021

@estherrolf (one of our first users!) just hit this problem with a manually downloaded version of the ChesapeakeCVPR dataset. I think it is worth making the change as this will happen especially with the larger datasets.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Loading a Dataset — datasets 1.2.1 documentation
Run the SQuAD python processing script which will download the SQuAD dataset from the original URL (if it's not already downloaded and cached)...
Read more >
Argoverse 2
Argoverse 2 Sensor Dataset: contains 1,000 3D annotated scenarios with lidar, stereo imagery, and ring camera imagery. This dataset improves upon the Argoverse ......
Read more >
BEHAVE dataset: a chasing; b two-people fighting; c multiple ...
In this paper, a new method is proposed to estimate a model of normal behaviors and consequently to detect abnormal behaviors. Estimating a...
Read more >
ground truthed video for multi-person behavior classification
are few ground truthed datasets for assessing multi-person behavioral ... 2. CVBASE: The CVBASE 2006 [CVBASE] sports video downloads (covering basketball,.
Read more >
Incremental refresh for datasets and real-time data in Power BI
Rows with a date/time no longer within the refresh period then become part of the historical period, which is not refreshed. If a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found