question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory Keep Increasing

See original GitHub issue

Is there anything wrong with the following code? My Memory keep increasing and machine rebooting because of auto healing. I’ve 16GB Machine. I’m reading just 500 Mb. Something is terribly wrong., I guess.

#!/usr/bin/python

import os
import sys
import uuid
import time
import base64
import requests
import s3fs
import fastparquet
import glob
import vtap_cols
import datetime

import pandas as pd
import numpy as np
import ujson as json

start_time = time.time()

key = 'AKIAJ'
secret = 'IX9sRERZ'

fs = s3fs.S3FileSystem(key=key, secret=secret)
s3open = fs.open

filelist = glob.glob('/vol/vtap-sync/vnk/*/*/*/*/events.json')

# get a UUID - URL safe, Base64
def get_a_uuid():
r_uuid = base64.urlsafe_b64encode(uuid.uuid4().bytes)
return r_uuid.replace('=', '')

def flatten(event_data):
    return pd.DataFrame(event_data).set_index('key').squeeze()

def flat(x):
	return dict([(list(d.values())[1], list(d.values())[0]) for d in x])

def get_cols_map(cols):
    default_cols = vtap_cols.columns_to_replace.keys()
    correct_cols_map = {}
    for col in cols:
	    if col in default_cols:
    		correct_cols_map[col] = vtap_cols.columns_to_replace[col]
    return correct_cols_map

for file in filelist:
    filedate = datetime.datetime.utcfromtimestamp(os.stat(file).st_mtime)
    curdate = datetime.datetime.now()
    print filedate, curdate, curdate-filedate
    if (curdate - filedate).total_seconds() > 5400:
	
        parts = file.split('/')
        parts[4] = 'yr='+ parts[4]
        parts[5] = 'mn='+ parts[5]
        parts[6] = 'dt='+ parts[6]
        idx = parts.index('vtap-sync')
        newfile = ('/'.join(parts[idx:idx+5] ) + '/proc.json').strip('/')
        moved_file = file.replace('events.json', 'proc.json')
        print "[+] Processing File %s"%file
        try:
            os.rename(file, moved_file)
            left_over_events = ''
            has_next = True
            with open(moved_file, 'r') as events:
                while has_next:
                    buf = events.read(536870912)
                    if buf:
                        left_over_events += buf
                        left_parts = left_over_events.split('\n')
                        events_data = left_parts[:-1]
                        left_over_events = left_parts[-1]
                        jevents_data = '[' + ','.join(events_data) + ']'
                        df = pd.read_json(jevents_data)
                        event_data = pd.io.json.json_normalize(df['event_data'].apply(flat).tolist())
                        df = df.drop('event_data', axis=1)
                        df = df.join(event_data)
                        df.replace('[Rs. ]','',regex=True, inplace=True)
                        cls_to_replace = [column for column in df.columns if 'total' in column]
                        df[cls_to_replace] = df[cls_to_replace].apply(pd.to_numeric, errors='coerce')
                        # df.rename(columns=get_cols_map(df.columns), inplace=True)
                        parquet_filename = newfile.replace('proc.json', 'events_'+get_a_uuid()+'.parquet')
                        fastparquet.write(parquet_filename, df, row_group_offsets=5000000, compression='GZIP',file_scheme='simple', append=False, write_index=True, has_nulls=True, open_with=s3open)
                    else: has_next = False
        except Exception as e:
            if os.path.isfile(moved_file):
                os.rename(moved_file, file)
            print e
            requests.post('https://api.flock.com/hooks/sendMessage/ab430c0d-2ed9-4226-84ff-f803245a0e05', json={'text': "Error while processing file %s and error is %s"%(file,e)})
        print "[+] Processing Done %s"%file
	
end_time = time.time()

requests.post('https://api.flock.com/hooks/sendMessage/3dcf9001-9dc8a2d57a1b', json={'text': "Proccessed %s. Time Taken: %s"%(len(filelist), end_time-start_time)})
requests.post('https://api.flock.com/hooks/sendMessage/ab430c0d-f803245a0e05', json={'text': "Proccessed %s. Time Taken: %s"%(len(filelist), end_time-start_time)})


Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:11 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Aug 30, 2017

First, I would certainly more the processing within the try block into a function, which will do a better job of releasing memory each time you write some parquet. In your particular example, you are also capturing e and variables derived from it, which may well hose on to references to e and any variables in the stack. These are all global variables, so you would be best to have that in a function too. This is good general policy, don’t have any function do too much:

def process(filelist):
     for fn in filelist:
        make_file_names
        process(fn)

def process(fn):
    with open(fn) as f:
        while ...:
            buf = f.read
            left_over
            to_parquet(events_data, outfile)

def to_parquet(events_data, outfile):
    try:
        df = do_stuff
        fastparquet.write(outfile, df)
    except Exception as e:
        print(...)

Furthermore, if buf is 500MB, when you do left_over_events += buf, you will temporarily at least duplicate that number, and you then split, amend and join again before passing to pandas. Each of the variables buf, left_parts, events_data, jevents_data, df, event_data are at least that size, order-of-magnitude - do you need them all?? -, and there will be temporary copies too, especially when swapping around pandas columns. I recommend you try memory_profiler.

0reactions
martindurantcommented, Aug 30, 2017

You are welcome.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why does my RAM usage keep increasing?
If the PC has been in extended continuous use memory leaks can occur from some programs (usually badly written ones ) that mean...
Read more >
PC memory usage slowly increasing, even when idle
That level of memory usage is not normal and the increasing use over time is typical of a memory leak. This is not...
Read more >
High Memory utilization and their root causes | Dynatrace
Increasing memory is the obvious workaround for memory leaks or badly written software. Let's discuss the two most common causes for Java high...
Read more >
Gradually increasing RAM usage on windows 10?
A buggy driver is probably leaking memory and slowly sucking up all of your RAM which is why you don't see it in...
Read more >
Windows 10 High Memory Usage [Causes and Solutions]
Close unnecessary programs. · Disable startup programs. · Disable Superfetch service. · Increase virtual memory. · Set Registry Hack. · Defragment ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found