[PERF] Why is the first `read_csv` call slower than subsequent `read_csv` calls?
See original GitHub issueimport time
import pandas
import numpy as np
import modin.pandas as pd
import modin.config as cfg
import ray
print(f"\tPandas version: {pandas.__version__}")
print(f"\tModin version: {pd.__version__}")
print(f"\tCpuCount: {cfg.CpuCount.get()}")
print(f"\tEngine: {cfg.Engine.get()}")
print(f"\tNPartitions: {cfg.NPartitions.get()}")
ray.init(runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}})
pandas_df = pandas.DataFrame(
np.random.randint(0, 100, size=(1000000, 13))
)
pandas_df.to_csv("foo.csv", index=False)
def read_csv_with_pandas():
start_time = time.time()
pandas_df = pandas.read_csv("foo.csv", index_col=0)
end_time = time.time()
pandas_duration = end_time - start_time
print("Time to read_csv with Pandas: {} seconds".format(round(pandas_duration, 3)))
return pandas_df
def read_csv_with_modin():
start_time = time.time()
modin_df = pd.read_csv("foo.csv", index_col=0)
end_time = time.time()
modin_duration = end_time - start_time
print("Time to read_csv with Modin: {} seconds".format(round(modin_duration, 3)))
return modin_df
for i in range(5):
pandas_df = read_csv_with_pandas()
modin_df = read_csv_with_modin()
Pandas version: 1.5.1
Modin version: 0.16.0+24.g11ba4811
CpuCount: 8
Engine: Ray
NPartitions: 8
Time to read_csv with Pandas: 0.708 seconds
Time to read_csv with Modin: 4.132 seconds
Time to read_csv with Pandas: 0.735 seconds
Time to read_csv with Modin: 0.37 seconds
Time to read_csv with Pandas: 0.646 seconds
Time to read_csv with Modin: 0.377 seconds
Time to read_csv with Pandas: 0.673 seconds
Time to read_csv with Modin: 0.371 seconds
Time to read_csv with Pandas: 0.672 seconds
Time to read_csv with Modin: 0.379 seconds
Issue Analytics
- State:
- Created a year ago
- Comments:13 (10 by maintainers)
Top Results From Across the Web
Why is pandas read_csv converters performance much slower ...
However, using the converters parameter is very slow in comparison to manually converting the corresponding columns.
Read more >Pandas read_csv() tricks you should know to speed up your ...
read_csv() has an argument called chunksize that allows you to retrieve the data in a same-sized chunk.
Read more >read_csv() is 3.5X Slower in Pandas 0.23.4 on Python 3.7.1 vs ...
pd.read_csv() using _libs.parsers.TextReader read() method is 3.5X slower on Pandas 0.23.4 on Python 3.7.1 compared to Pandas 0.22.0 on ...
Read more >Testing pandas.read_csv performance - Kaggle
The C engine is faster while the python engine is currently more feature-complete. What are these features? read_csv has another parameter called "sep"...
Read more >Optimized ways to Read Large CSVs in Python - Medium
2. pandas.read_csv(chunksize) ... Instead of reading the whole CSV at once, chunks of CSV are read into memory. The size of a chunk...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Can you please notify us here when your change is merged in master so we are in tune.
@rkooo567, do you have some notes regarding this in Ray documentation? This behavior can really confuse the user.