Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[PERF] Why is the first `read_csv` call slower than subsequent `read_csv` calls?

See original GitHub issue

import time
import pandas
import numpy as np
import modin.pandas as pd
import modin.config as cfg
import ray

print(f"\tPandas version: {pandas.__version__}")
print(f"\tModin version: {pd.__version__}")
print(f"\tCpuCount: {cfg.CpuCount.get()}")
print(f"\tEngine: {cfg.Engine.get()}")
print(f"\tNPartitions: {cfg.NPartitions.get()}")

ray.init(runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}})

pandas_df = pandas.DataFrame(
    np.random.randint(0, 100, size=(1000000, 13))
)
pandas_df.to_csv("foo.csv", index=False)

def read_csv_with_pandas():
    start_time = time.time()

    pandas_df = pandas.read_csv("foo.csv", index_col=0)

    end_time = time.time()
    pandas_duration = end_time - start_time
    print("Time to read_csv with Pandas: {} seconds".format(round(pandas_duration, 3)))
    return pandas_df


def read_csv_with_modin():
    start_time = time.time()

    modin_df = pd.read_csv("foo.csv", index_col=0)

    end_time = time.time()
    modin_duration = end_time - start_time
    print("Time to read_csv with Modin: {} seconds".format(round(modin_duration, 3))) 
    return modin_df

for i in range(5):
    pandas_df = read_csv_with_pandas()
    modin_df = read_csv_with_modin()

        Pandas version: 1.5.1
        Modin version: 0.16.0+24.g11ba4811
        CpuCount: 8
        Engine: Ray
        NPartitions: 8

Time to read_csv with Pandas: 0.708 seconds
Time to read_csv with Modin: 4.132 seconds
Time to read_csv with Pandas: 0.735 seconds
Time to read_csv with Modin: 0.37 seconds
Time to read_csv with Pandas: 0.646 seconds
Time to read_csv with Modin: 0.377 seconds
Time to read_csv with Pandas: 0.673 seconds
Time to read_csv with Modin: 0.371 seconds
Time to read_csv with Pandas: 0.672 seconds
Time to read_csv with Modin: 0.379 seconds

Issue Analytics

State:
Created a year ago
Comments:13 (10 by maintainers)

Top GitHub Comments

1reaction

YarShevcommented, Oct 27, 2022

Can you please notify us here when your change is merged in master so we are in tune.

1reaction

YarShevcommented, Oct 27, 2022

@rkooo567, do you have some notes regarding this in Ray documentation? This behavior can really confuse the user.

Top Results From Across the Web

Why is pandas read_csv converters performance much slower ...

However, using the converters parameter is very slow in comparison to manually converting the corresponding columns.

Pandas read_csv() tricks you should know to speed up your ...

read_csv() has an argument called chunksize that allows you to retrieve the data in a same-sized chunk.

read_csv() is 3.5X Slower in Pandas 0.23.4 on Python 3.7.1 vs ...

pd.read_csv() using _libs.parsers.TextReader read() method is 3.5X slower on Pandas 0.23.4 on Python 3.7.1 compared to Pandas 0.22.0 on ...

Testing pandas.read_csv performance - Kaggle

The C engine is faster while the python engine is currently more feature-complete. What are these features? read_csv has another parameter called "sep"...

Optimized ways to Read Large CSVs in Python - Medium

2. pandas.read_csv(chunksize) ... Instead of reading the whole CSV at once, chunks of CSV are read into memory. The size of a chunk...