Memory Keep Increasing
See original GitHub issueIs there anything wrong with the following code? My Memory keep increasing and machine rebooting because of auto healing. I’ve 16GB Machine. I’m reading just 500 Mb. Something is terribly wrong., I guess.
#!/usr/bin/python
import os
import sys
import uuid
import time
import base64
import requests
import s3fs
import fastparquet
import glob
import vtap_cols
import datetime
import pandas as pd
import numpy as np
import ujson as json
start_time = time.time()
key = 'AKIAJ'
secret = 'IX9sRERZ'
fs = s3fs.S3FileSystem(key=key, secret=secret)
s3open = fs.open
filelist = glob.glob('/vol/vtap-sync/vnk/*/*/*/*/events.json')
# get a UUID - URL safe, Base64
def get_a_uuid():
r_uuid = base64.urlsafe_b64encode(uuid.uuid4().bytes)
return r_uuid.replace('=', '')
def flatten(event_data):
return pd.DataFrame(event_data).set_index('key').squeeze()
def flat(x):
return dict([(list(d.values())[1], list(d.values())[0]) for d in x])
def get_cols_map(cols):
default_cols = vtap_cols.columns_to_replace.keys()
correct_cols_map = {}
for col in cols:
if col in default_cols:
correct_cols_map[col] = vtap_cols.columns_to_replace[col]
return correct_cols_map
for file in filelist:
filedate = datetime.datetime.utcfromtimestamp(os.stat(file).st_mtime)
curdate = datetime.datetime.now()
print filedate, curdate, curdate-filedate
if (curdate - filedate).total_seconds() > 5400:
parts = file.split('/')
parts[4] = 'yr='+ parts[4]
parts[5] = 'mn='+ parts[5]
parts[6] = 'dt='+ parts[6]
idx = parts.index('vtap-sync')
newfile = ('/'.join(parts[idx:idx+5] ) + '/proc.json').strip('/')
moved_file = file.replace('events.json', 'proc.json')
print "[+] Processing File %s"%file
try:
os.rename(file, moved_file)
left_over_events = ''
has_next = True
with open(moved_file, 'r') as events:
while has_next:
buf = events.read(536870912)
if buf:
left_over_events += buf
left_parts = left_over_events.split('\n')
events_data = left_parts[:-1]
left_over_events = left_parts[-1]
jevents_data = '[' + ','.join(events_data) + ']'
df = pd.read_json(jevents_data)
event_data = pd.io.json.json_normalize(df['event_data'].apply(flat).tolist())
df = df.drop('event_data', axis=1)
df = df.join(event_data)
df.replace('[Rs. ]','',regex=True, inplace=True)
cls_to_replace = [column for column in df.columns if 'total' in column]
df[cls_to_replace] = df[cls_to_replace].apply(pd.to_numeric, errors='coerce')
# df.rename(columns=get_cols_map(df.columns), inplace=True)
parquet_filename = newfile.replace('proc.json', 'events_'+get_a_uuid()+'.parquet')
fastparquet.write(parquet_filename, df, row_group_offsets=5000000, compression='GZIP',file_scheme='simple', append=False, write_index=True, has_nulls=True, open_with=s3open)
else: has_next = False
except Exception as e:
if os.path.isfile(moved_file):
os.rename(moved_file, file)
print e
requests.post('https://api.flock.com/hooks/sendMessage/ab430c0d-2ed9-4226-84ff-f803245a0e05', json={'text': "Error while processing file %s and error is %s"%(file,e)})
print "[+] Processing Done %s"%file
end_time = time.time()
requests.post('https://api.flock.com/hooks/sendMessage/3dcf9001-9dc8a2d57a1b', json={'text': "Proccessed %s. Time Taken: %s"%(len(filelist), end_time-start_time)})
requests.post('https://api.flock.com/hooks/sendMessage/ab430c0d-f803245a0e05', json={'text': "Proccessed %s. Time Taken: %s"%(len(filelist), end_time-start_time)})
Issue Analytics
- State:
- Created 6 years ago
- Comments:11 (5 by maintainers)
Top Results From Across the Web
Why does my RAM usage keep increasing?
If the PC has been in extended continuous use memory leaks can occur from some programs (usually badly written ones ) that mean...
Read more >PC memory usage slowly increasing, even when idle
That level of memory usage is not normal and the increasing use over time is typical of a memory leak. This is not...
Read more >High Memory utilization and their root causes | Dynatrace
Increasing memory is the obvious workaround for memory leaks or badly written software. Let's discuss the two most common causes for Java high...
Read more >Gradually increasing RAM usage on windows 10?
A buggy driver is probably leaking memory and slowly sucking up all of your RAM which is why you don't see it in...
Read more >Windows 10 High Memory Usage [Causes and Solutions]
Close unnecessary programs. · Disable startup programs. · Disable Superfetch service. · Increase virtual memory. · Set Registry Hack. · Defragment ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
First, I would certainly more the processing within the try block into a function, which will do a better job of releasing memory each time you write some parquet. In your particular example, you are also capturing
e
and variables derived from it, which may well hose on to references toe
and any variables in the stack. These are all global variables, so you would be best to have that in a function too. This is good general policy, don’t have any function do too much:Furthermore, if
buf
is 500MB, when you doleft_over_events += buf
, you will temporarily at least duplicate that number, and you then split, amend and join again before passing to pandas. Each of the variablesbuf
,left_parts
,events_data
,jevents_data
,df
,event_data
are at least that size, order-of-magnitude - do you need them all?? -, and there will be temporary copies too, especially when swapping around pandas columns. I recommend you trymemory_profiler
.You are welcome.