Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Docs: Getting started redownloads dataset instead of reading downloaded data set

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Modin version (modin.__version__):
Python version:
Code we can use to reproduce:

Describe the problem

The Getting Started docs needs a correction to the pd.read_csv parameter to read dataset directly from disk to compare pandas and modin.pandas correctly.

Currently, each run will re-download the dataset and run the test, in which the difference in read_csv speeds is not reflected as the majority of the time is used to download the file.

Source code / logs

https://github.com/modin-project/modin/blame/master/docs/getting_started/quickstart.rst#L83

Line 83 and line 96 uses the s3_path parameter for the file which is the url, instead of the path to the downloaded taxi.csv file.

I’m willing to create a PR for this.

Extra note: there’s also a possibility of improvement by checking whether the file is already downloaded and avoid redownloading when running the dataset download cell.

Issue Analytics

State:
Created a year ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

rraihansaputracommented, Aug 20, 2022

@pyrito the page is supposed to make Modin look faster 😃 We could download the dataset first as in the quickstart. We could explain that 1) we are doing that to show the perf improvement 2) it’s still possible to read directly from s3, and normally that’s how one would read the data.

Yeah I’m referring to the same page. The bug is that the file is actually downloaded, but the next cells refers to the s3_path which is the object path, rather than referring to read the downloaded file.

I believe reading from local emphasizes the improvements from using modin. Reading from an object store is a valid usecase, but I think it can be demonstrated somewhere else, not on this page.

1reaction

mvashishthacommented, Aug 18, 2022

however, I’m not sure what the documentation was trying to show here

@pyrito the page is supposed to make Modin look faster 😃 We could download the dataset first as in the quickstart. We could explain that 1) we are doing that to show the perf improvement 2) it’s still possible to read directly from s3, and normally that’s how one would read the data.

Top Results From Across the Web

Loading a Dataset — datasets 1.6.0 documentation

Loading a Dataset¶. A datasets.Dataset can be created from various source of data: from the HuggingFace Hub,. from local files, e.g. CSV/JSON/text/pandas ...

NLP Datasets from HuggingFace: How to Access and Train ...

In this article, you will learn how to download, load, configure and use NLP datasets from the hugging face datasets library. Let's get...

How do I read data into R? | SAMHDA

See below for instructions on how to read and load data into R from both file extensions. Set the Working Directory. Before reading...

Nextclade datasets

Nextclade CLI implements subcommands allowing to list and to download datasets. This functionality requires an internet connection. List available datasets .

Avoid unoptimized downloads

Cache HTTP responses. Another important technique is to avoid downloading duplicate data. You can reduce the likelihood of downloading the same piece of...