Is there a way to see the dataset size before starting the download
See original GitHub issueHi,
Is there a way to get information about the dataset size (num sentences, etc) before downloading it? Is this available through the API somehow? Is this what cols is https://github.com/thammegowda/mtdata/blob/master/mtdata/entry.py#L101 ?
Thanks, Nick
Issue Analytics
- State:
- Created a year ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
How to display file size before starting a download?
The download manager will in most cases scan the file size before it proceeds to do the download.
Read more >[Documentation] Display the download size for each dataset.
Currently, just by looking at the list of datasets available in TFDS, there is no way to know the size of each dataset...
Read more >Check size prior to download - Questions - DVC
I was wondering if there is a way to obtain the size information of the dvc files/folders before issuing a dvc pull.
Read more >Python | HTTP - How to check file size before downloading it
If server is omitting Content-Length there is no way to check for the size unless as jarmod mentioned you start downloading file. I...
Read more >Solved: How do I check the size of a dataset published to
How do I check the size of a dataset published to to the Power BI Service · Select "Manage group storage" · Now...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

There’s two ways to do this: a HEAD request to the URLs on the fly and a cached set of statistics about each corpus, to include number of segments etc. To have the cached version without downloading, effectively what you are asking for is a continuous release system that downloads stuff then puts metadata in the release. This continuous release system could also cache things and be branded OPUS…
I like the
HEADrequest approach and we can easily do that. An edge case I am concerned is, we have indexed many zip/tarball files that gets mapped to multiple datasets. I could show the overall tarball file size, though it is inaccurate, it would be a good start.Also, as shown in my previous comment,
mtdata statsoutputs character counts. I will revise it to output byte count, and also as human readable size (kB, MB, GB etc). This will be more accurate but costly (we have to download dataset for once)Also +1 for caching these stats and distributing as part of release. OPUS is the large source and fortunately precomputed stats are available from their API. Then for the remaining datasets, I have to run on one of our servers and collect stats overnight. For any new additions, we can rerun a script to update the cached stats and make it available in the next release.
So, to summarize, the action items (for me, for the next release):
mtdata stats <DataID>mtdata stats --quick <DataID>option to perform HEAD request and show content-length header