OOM killed
See original GitHub issueHi,
I played dqn_workflow with 7.9G training_data. But i got a OOM Killed. Below is my environment and oom logs.
workflow : dqn_workflow.py training_data : 8 features, 20,249,257 rows, 7.9G training_eval_data : 8 features, 2,028,916 rows, 0.8G RAM : 80G
INFO:ml.rl.evaluation.evaluation_data_page:EvaluationDataPage minibatch size: 2028912
WARNING:ml.rl.evaluation.doubly_robust_estimator:Can't normalize DR-CPE because of small or negative logged_policy_score
Killed
[Tue May 7 22:05:38 2019] python invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[Tue May 7 22:05:38 2019] python cpuset=42ee6ef8b84594988960735ef211ac05221059efc2d524f2afc1e2b49eb46d0c mems_allowed=0-1
[Tue May 7 22:05:38 2019] CPU: 1 PID: 51997 Comm: python Tainted: P O 4.20.13-1.el7.elrepo.x86_64 #1
[Tue May 7 22:05:38 2019] Hardware name: Dell Inc. PowerEdge C4140/013M88, BIOS 1.6.11 11/21/2018
[Tue May 7 22:05:38 2019] Call Trace:
[Tue May 7 22:05:38 2019] dump_stack+0x63/0x88
[Tue May 7 22:05:38 2019] dump_header+0x78/0x2a4
[Tue May 7 22:05:38 2019] ? mem_cgroup_scan_tasks+0x9c/0xf0
[Tue May 7 22:05:38 2019] oom_kill_process+0x26b/0x290
[Tue May 7 22:05:38 2019] out_of_memory+0x140/0x4b0
[Tue May 7 22:05:38 2019] mem_cgroup_out_of_memory+0x4b/0x80
[Tue May 7 22:05:38 2019] try_charge+0x6e2/0x750
[Tue May 7 22:05:38 2019] mem_cgroup_try_charge+0x8c/0x1e0
[Tue May 7 22:05:38 2019] __add_to_page_cache_locked+0x1a0/0x300
[Tue May 7 22:05:38 2019] ? scan_shadow_nodes+0x30/0x30
[Tue May 7 22:05:38 2019] add_to_page_cache_lru+0x4e/0xd0
[Tue May 7 22:05:38 2019] filemap_fault+0x428/0x7c0
[Tue May 7 22:05:38 2019] ? xas_find+0x138/0x1a0
[Tue May 7 22:05:38 2019] ? filemap_map_pages+0x153/0x3c0
[Tue May 7 22:05:38 2019] __do_fault+0x3e/0xc0
[Tue May 7 22:05:38 2019] __handle_mm_fault+0xbd6/0xe80
[Tue May 7 22:05:38 2019] handle_mm_fault+0x102/0x220
[Tue May 7 22:05:38 2019] __do_page_fault+0x21c/0x4c0
[Tue May 7 22:05:38 2019] do_page_fault+0x37/0x140
[Tue May 7 22:05:38 2019] ? page_fault+0x8/0x30
[Tue May 7 22:05:38 2019] page_fault+0x1e/0x30
...
[Tue May 7 22:05:38 2019] Memory cgroup out of memory: Kill process 51997 (python) score 997 or sacrifice child
[Tue May 7 22:05:38 2019] Killed process 51997 (python) total-vm:102757536kB, anon-rss:83335008kB, file-rss:132692kB, shmem-rss:8192kB
[Tue May 7 22:05:42 2019] oom_reaper: reaped process 51997 (python), now anon-rss:0kB, file-rss:127188kB, shmem-rss:8192kB
green : CPU
yellow : RAM
Issue Analytics
- State:
- Created 4 years ago
- Comments:14 (7 by maintainers)
Top Results From Across the Web
Out Of Memory Management - The Linux Kernel Archives
If no, not OOM; If there hasn't been 10 failures at least in the last 5 seconds, we're not OOM; Has a process...
Read more >Linux Out of Memory killer - Knowledge Base - Neo4j
The Out Of Memory Killer or OOM Killer is a process that the linux kernel employs when the system is critically low on...
Read more >How does the OOM killer decide which process to kill first?
The OOM Killer has to select the best process(es) to kill. Best here refers to that process which will free up the maximum...
Read more >How to Find Which Process Was Killed by Linux OOM Killer
A quick and practical guide to debugging Linux OOM errors. ... How to Find Which Process Was Killed by Linux OOM Killer.
Read more >Linux Out-Of-Memory Killer. What is this ? | by Rakesh Jain
The “OOM Killer” or “Out of Memory Killer” is a process that the Linux kernel employs when the system is critically low on...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Great. Let’s close this issue.
@pjy953
Np. Let us know how it goes.