question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Docs: Getting started redownloads dataset instead of reading downloaded data set

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • Modin version (modin.__version__):
  • Python version:
  • Code we can use to reproduce:

Describe the problem

The Getting Started docs needs a correction to the pd.read_csv parameter to read dataset directly from disk to compare pandas and modin.pandas correctly.

Currently, each run will re-download the dataset and run the test, in which the difference in read_csv speeds is not reflected as the majority of the time is used to download the file.

Source code / logs

https://github.com/modin-project/modin/blame/master/docs/getting_started/quickstart.rst#L83

Line 83 and line 96 uses the s3_path parameter for the file which is the url, instead of the path to the downloaded taxi.csv file.

I’m willing to create a PR for this.

Extra note: there’s also a possibility of improvement by checking whether the file is already downloaded and avoid redownloading when running the dataset download cell.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
rraihansaputracommented, Aug 20, 2022

@pyrito the page is supposed to make Modin look faster 😃 We could download the dataset first as in the quickstart. We could explain that 1) we are doing that to show the perf improvement 2) it’s still possible to read directly from s3, and normally that’s how one would read the data.

Yeah I’m referring to the same page. The bug is that the file is actually downloaded, but the next cells refers to the s3_path which is the object path, rather than referring to read the downloaded file.

I believe reading from local emphasizes the improvements from using modin. Reading from an object store is a valid usecase, but I think it can be demonstrated somewhere else, not on this page.

1reaction
mvashishthacommented, Aug 18, 2022

however, I’m not sure what the documentation was trying to show here

@pyrito the page is supposed to make Modin look faster 😃 We could download the dataset first as in the quickstart. We could explain that 1) we are doing that to show the perf improvement 2) it’s still possible to read directly from s3, and normally that’s how one would read the data.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Loading a Dataset — datasets 1.6.0 documentation
Loading a Dataset¶. A datasets.Dataset can be created from various source of data: from the HuggingFace Hub,. from local files, e.g. CSV/JSON/text/pandas ...
Read more >
NLP Datasets from HuggingFace: How to Access and Train ...
In this article, you will learn how to download, load, configure and use NLP datasets from the hugging face datasets library. Let's get...
Read more >
How do I read data into R? | SAMHDA
See below for instructions on how to read and load data into R from both file extensions. Set the Working Directory. Before reading...
Read more >
Nextclade datasets
Nextclade CLI implements subcommands allowing to list and to download datasets. This functionality requires an internet connection. List available datasets .
Read more >
Avoid unoptimized downloads
Cache HTTP responses. Another important technique is to avoid downloading duplicate data. You can reduce the likelihood of downloading the same piece of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found