question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[PERF] Why is the first `read_csv` call slower than subsequent `read_csv` calls?

See original GitHub issue
import time
import pandas
import numpy as np
import modin.pandas as pd
import modin.config as cfg
import ray

print(f"\tPandas version: {pandas.__version__}")
print(f"\tModin version: {pd.__version__}")
print(f"\tCpuCount: {cfg.CpuCount.get()}")
print(f"\tEngine: {cfg.Engine.get()}")
print(f"\tNPartitions: {cfg.NPartitions.get()}")

ray.init(runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}})

pandas_df = pandas.DataFrame(
    np.random.randint(0, 100, size=(1000000, 13))
)
pandas_df.to_csv("foo.csv", index=False)

def read_csv_with_pandas():
    start_time = time.time()

    pandas_df = pandas.read_csv("foo.csv", index_col=0)

    end_time = time.time()
    pandas_duration = end_time - start_time
    print("Time to read_csv with Pandas: {} seconds".format(round(pandas_duration, 3)))
    return pandas_df


def read_csv_with_modin():
    start_time = time.time()

    modin_df = pd.read_csv("foo.csv", index_col=0)

    end_time = time.time()
    modin_duration = end_time - start_time
    print("Time to read_csv with Modin: {} seconds".format(round(modin_duration, 3))) 
    return modin_df

for i in range(5):
    pandas_df = read_csv_with_pandas()
    modin_df = read_csv_with_modin()

        Pandas version: 1.5.1
        Modin version: 0.16.0+24.g11ba4811
        CpuCount: 8
        Engine: Ray
        NPartitions: 8

Time to read_csv with Pandas: 0.708 seconds
Time to read_csv with Modin: 4.132 seconds
Time to read_csv with Pandas: 0.735 seconds
Time to read_csv with Modin: 0.37 seconds
Time to read_csv with Pandas: 0.646 seconds
Time to read_csv with Modin: 0.377 seconds
Time to read_csv with Pandas: 0.673 seconds
Time to read_csv with Modin: 0.371 seconds
Time to read_csv with Pandas: 0.672 seconds
Time to read_csv with Modin: 0.379 seconds

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:13 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
YarShevcommented, Oct 27, 2022

Can you please notify us here when your change is merged in master so we are in tune.

1reaction
YarShevcommented, Oct 27, 2022

@rkooo567, do you have some notes regarding this in Ray documentation? This behavior can really confuse the user.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why is pandas read_csv converters performance much slower ...
However, using the converters parameter is very slow in comparison to manually converting the corresponding columns.
Read more >
Pandas read_csv() tricks you should know to speed up your ...
read_csv() has an argument called chunksize that allows you to retrieve the data in a same-sized chunk.
Read more >
read_csv() is 3.5X Slower in Pandas 0.23.4 on Python 3.7.1 vs ...
pd.read_csv() using _libs.parsers.TextReader read() method is 3.5X slower on Pandas 0.23.4 on Python 3.7.1 compared to Pandas 0.22.0 on ...
Read more >
Testing pandas.read_csv performance - Kaggle
The C engine is faster while the python engine is currently more feature-complete. What are these features? read_csv has another parameter called "sep"...
Read more >
Optimized ways to Read Large CSVs in Python - Medium
2. pandas.read_csv(chunksize) ... Instead of reading the whole CSV at once, chunks of CSV are read into memory. The size of a chunk...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found