Docs: Getting started redownloads dataset instead of reading downloaded data set
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
- Modin version (
modin.__version__
): - Python version:
- Code we can use to reproduce:
Describe the problem
The Getting Started docs needs a correction to the pd.read_csv
parameter to read dataset directly from disk to compare pandas
and modin.pandas
correctly.
Currently, each run will re-download the dataset and run the test, in which the difference in read_csv
speeds is not reflected as the majority of the time is used to download the file.
Source code / logs
https://github.com/modin-project/modin/blame/master/docs/getting_started/quickstart.rst#L83
Line 83 and line 96 uses the s3_path
parameter for the file which is the url, instead of the path to the downloaded taxi.csv
file.
I’m willing to create a PR for this.
Extra note: there’s also a possibility of improvement by checking whether the file is already downloaded and avoid redownloading when running the dataset download cell.
Issue Analytics
- State:
- Created a year ago
- Comments:6 (5 by maintainers)
Yeah I’m referring to the same page. The bug is that the file is actually downloaded, but the next cells refers to the
s3_path
which is the object path, rather than referring to read the downloaded file.I believe reading from local emphasizes the improvements from using modin. Reading from an object store is a valid usecase, but I think it can be demonstrated somewhere else, not on this page.
@pyrito the page is supposed to make Modin look faster 😃 We could download the dataset first as in the quickstart. We could explain that 1) we are doing that to show the perf improvement 2) it’s still possible to read directly from s3, and normally that’s how one would read the data.