Memory leak when reading CSV files in a loop
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Ubuntu 20.04.3 LTS
- Modin version (
modin.__version__
):
0.11.0
- Python version:
Python 3.9.7
- Code we can use to reproduce:
This is a reproducer made from Modin example
import sys
import time
import modin.pandas as pd
from modin.experimental.engines.omnisci_on_native.frame.omnisci_worker import OmnisciServer
def read(filename):
columns_names = [
"trip_id",
"vendor_id",
"pickup_datetime",
"dropoff_datetime",
"store_and_fwd_flag",
"rate_code_id",
"pickup_longitude",
"pickup_latitude",
"dropoff_longitude",
"dropoff_latitude",
"passenger_count",
"trip_distance",
"fare_amount",
"extra",
"mta_tax",
"tip_amount",
"tolls_amount",
"ehail_fee",
"improvement_surcharge",
"total_amount",
"payment_type",
"trip_type",
"pickup",
"dropoff",
"cab_type",
"precipitation",
"snow_depth",
"snowfall",
"max_temperature",
"min_temperature",
"average_wind_speed",
"pickup_nyct2010_gid",
"pickup_ctlabel",
"pickup_borocode",
"pickup_boroname",
"pickup_ct2010",
"pickup_boroct2010",
"pickup_cdeligibil",
"pickup_ntacode",
"pickup_ntaname",
"pickup_puma",
"dropoff_nyct2010_gid",
"dropoff_ctlabel",
"dropoff_borocode",
"dropoff_boroname",
"dropoff_ct2010",
"dropoff_boroct2010",
"dropoff_cdeligibil",
"dropoff_ntacode",
"dropoff_ntaname",
"dropoff_puma",
]
# use string instead of category
columns_types = [
"int64",
"string",
"timestamp",
"timestamp",
"string",
"int64",
"float64",
"float64",
"float64",
"float64",
"int64",
"float64",
"float64",
"float64",
"float64",
"float64",
"float64",
"float64",
"float64",
"float64",
"string",
"float64",
"string",
"string",
"string",
"float64",
"int64",
"float64",
"int64",
"int64",
"float64",
"float64",
"float64",
"float64",
"string",
"float64",
"float64",
"string",
"string",
"string",
"float64",
"float64",
"float64",
"float64",
"string",
"float64",
"float64",
"string",
"string",
"string",
"float64",
]
dtypes = {columns_names[i]: columns_types[i] for i in range(len(columns_names))}
all_but_dates = {
col: valtype
for (col, valtype) in dtypes.items()
if valtype not in ["timestamp"]
}
dates_only = [col for (col, valtype) in dtypes.items() if valtype in ["timestamp"]]
df = pd.read_csv(
filename,
names=columns_names,
dtype=all_but_dates,
parse_dates=dates_only,
)
df.shape # to trigger real execution
df._query_compiler._modin_frame._partitions[0][
0
].frame_id = OmnisciServer().put_arrow_to_omnisci(
df._query_compiler._modin_frame._partitions[0][0].get()
) # to trigger real execution
return df
def q1_omnisci(df):
q1_pandas_output = df.groupby("cab_type").size()
q1_pandas_output.shape # to trigger real execution
return q1_pandas_output
def q2_omnisci(df):
q2_pandas_output = df.groupby("passenger_count").agg({"total_amount": "mean"})
q2_pandas_output.shape # to trigger real execution
return q2_pandas_output
def q3_omnisci(df):
df["pickup_datetime"] = df["pickup_datetime"].dt.year
q3_pandas_output = df.groupby(["passenger_count", "pickup_datetime"]).size()
q3_pandas_output.shape # to trigger real execution
return q3_pandas_output
def q4_omnisci(df):
df["pickup_datetime"] = df["pickup_datetime"].dt.year
df["trip_distance"] = df["trip_distance"].astype("int64")
q4_pandas_output = (
df.groupby(["passenger_count", "pickup_datetime", "trip_distance"], sort=False)
.size()
.reset_index()
.sort_values(
by=["pickup_datetime", 0], ignore_index=True, ascending=[True, False]
)
)
q4_pandas_output.shape # to trigger real execution
return q4_pandas_output
def measure(name, func, *args, **kw):
t0 = time.time()
res = func(*args, **kw)
t1 = time.time()
print(f"{name}: {t1 - t0} sec")
return res
def main():
if len(sys.argv) != 2:
print(
f"USAGE: python nyc-taxi-omnisci.py <data file name>"
)
return
for i in range(0, 20):
print("Iteration #", i+1)
df = measure("Reading", read, sys.argv[1])
#measure("Q1", q1_omnisci, df)
#measure("Q2", q2_omnisci, df)
#measure("Q3", q3_omnisci, df.copy())
#measure("Q4", q4_omnisci, df.copy())
if __name__ == "__main__":
main()
Describe the problem
Download data files form https://modin-datasets.s3.amazonaws.com/taxi/trips_xaa.csv
After CSV files are loaded and no longer needed (python should garbage collect a dataframe) native resources occupied by CSV data are not released. This leads to a memory leak. On system with tight memory condition even 2nd iteration of a benchmark may fail because resources from 1st iteration are still kept in memory.
Source code / logs
Issue Analytics
- State:
- Created 2 years ago
- Comments:30 (25 by maintainers)
Top Results From Across the Web
Iterating through CSV leaking memory - python - Stack Overflow
I've used tracemalloc to display the files allocating the most memory and it seems that each time process_data is called the csv library ......
Read more >Working with large CSV files in Python - GeeksforGeeks
One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and...
Read more >Load the same CSV file 10X times faster and with 10X less ...
Pandas use Contiguous Memory to load data into RAM because read and write operations are must faster on RAM than Disk(or SSDs). Reading...
Read more >Processing large CSV files Ruby, Parse large CSV file ruby, Ruby ...
An unmanaged memory leak. This little program leaks memory by calling malloc directly. It starts off consuming 16MB and finishes off consuming 118MB...
Read more >Script consuming all memory - TechNet - Microsoft
The code parses a many hundred million rows csv file, ... I assume that it is trying to read everything in memory at...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Currently, it looks like OmniSci issue. We need to check that
DROP TABLE
correctly clears temporary tables data.Great! Closing the issue and waiting for the release to try it out.