All of lore.kernel.org
 help / color / mirror / Atom feed
* Kernel 2.6.29 runs out of memory and hangs.
@ 2009-04-24 10:09 Zeno Davatz
  2009-04-24 17:17 ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Zeno Davatz @ 2009-04-24 10:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: Hannes Wyss

Dear All

Our Kernel-Version:

Linux thinpower 2.6.29 #5 SMP Sun Apr 5 20:04:08 UTC 2009 x86_64
Intel(R) Xeon(R) CPU E5450 @ 3.00GHz GenuineIntel GNU/Linux

This morning our Server ran out of memory and had to be hard rebooted.
The system hast 32 GB or memory and 4 QuadCore Xeon CPUs.
/var/log/kernel/current tells us the following:

Apr 24 09:01:06 thinpower [1349923.491914] postgres invoked
oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=0
Apr 24 09:01:06 thinpower [1349923.491919] postgres cpuset=/ mems_allowed=0
Apr 24 09:01:06 thinpower [1349923.491922] Pid: 2393, comm: postgres
Not tainted 2.6.29 #5
Apr 24 09:01:06 thinpower [1349923.491924] Call Trace:
Apr 24 09:01:06 thinpower [1349923.491934]  [<ffffffff8106eda9>] ?
cpuset_print_task_mems_allowed+0x99/0x9f
Apr 24 09:01:06 thinpower [1349923.491939]  [<ffffffff81083d28>]
oom_kill_process+0x96/0x246
Apr 24 09:01:06 thinpower [1349923.491942]  [<ffffffff8106dcf0>] ?
cpuset_mems_allowed_intersects+0x1c/0x1e
Apr 24 09:01:06 thinpower [1349923.491944]  [<ffffffff810840a5>] ?
badness+0x1a3/0x1e6
Apr 24 09:01:06 thinpower [1349923.491946]  [<ffffffff8108421c>]
__out_of_memory+0x134/0x14b
Apr 24 09:01:06 thinpower [1349923.491949]  [<ffffffff8108438b>]
out_of_memory+0x158/0x18a
Apr 24 09:01:06 thinpower [1349923.491951]  [<ffffffff81087906>]
__alloc_pages_internal+0x372/0x434
Apr 24 09:01:06 thinpower [1349923.491954]  [<ffffffff810aa490>]
alloc_pages_current+0xb9/0xc2
Apr 24 09:01:06 thinpower [1349923.491957]  [<ffffffff81081a59>]
__page_cache_alloc+0x67/0x6b
Apr 24 09:01:06 thinpower [1349923.491959]  [<ffffffff81089426>]
__do_page_cache_readahead+0x96/0x192
Apr 24 09:01:06 thinpower [1349923.491961]  [<ffffffff81089575>]
do_page_cache_readahead+0x53/0x60
Apr 24 09:01:06 thinpower [1349923.491963]  [<ffffffff81081ec7>]
filemap_fault+0x15e/0x313
Apr 24 09:01:06 thinpower [1349923.491967]  [<ffffffff81094451>]
__do_fault+0x53/0x393
Apr 24 09:01:06 thinpower [1349923.491969]  [<ffffffff81030531>] ?
__wake_up_sync+0x45/0x4e
Apr 24 09:01:06 thinpower [1349923.491972]  [<ffffffff81096702>]
handle_mm_fault+0x36b/0x854
Apr 24 09:01:06 thinpower [1349923.491976]  [<ffffffff813a1acb>] ?
release_sock+0xb0/0xbb
Apr 24 09:01:06 thinpower [1349923.491979]  [<ffffffff81022c8f>] ?
default_spin_lock_flags+0x9/0xe
Apr 24 09:01:06 thinpower [1349923.491984]  [<ffffffff814ac6fe>]
do_page_fault+0x662/0xa5c
Apr 24 09:01:06 thinpower [1349923.491988]  [<ffffffff81400b8e>] ?
inet_sendmsg+0x46/0x53
Apr 24 09:01:06 thinpower [1349923.491992]  [<ffffffff8139ed97>] ?
__sock_sendmsg+0x59/0x62
Apr 24 09:01:06 thinpower [1349923.491994]  [<ffffffff8139f6d8>] ?
sock_sendmsg+0xc7/0xe0
Apr 24 09:01:06 thinpower [1349923.491996]  [<ffffffff81086b5e>] ?
free_pages+0x32/0x36
Apr 24 09:01:06 thinpower [1349923.492000]  [<ffffffff8104e7ec>] ?
autoremove_wake_function+0x0/0x38
Apr 24 09:01:06 thinpower [1349923.492004]  [<ffffffff810c1d76>] ?
core_sys_select+0x1df/0x213
Apr 24 09:01:06 thinpower [1349923.492007]  [<ffffffff810b2cc2>] ?
nameidata_to_filp+0x41/0x52
Apr 24 09:01:06 thinpower [1349923.492009]  [<ffffffff8139f028>] ?
sockfd_lookup_light+0x1b/0x54
Apr 24 09:01:06 thinpower [1349923.492013]  [<ffffffff81011ac8>] ?
read_tsc+0xe/0x24
Apr 24 09:01:06 thinpower [1349923.492017]  [<ffffffff81055990>] ?
getnstimeofday+0x58/0xb4
Apr 24 09:01:06 thinpower [1349923.492019]  [<ffffffff810516ed>] ?
ktime_get_ts+0x49/0x4e
Apr 24 09:01:06 thinpower [1349923.492022]  [<ffffffff810c0d35>] ?
poll_select_copy_remaining+0xc5/0xea
Apr 24 09:01:06 thinpower [1349923.492023]  [<ffffffff810c1ff3>] ?
sys_select+0xa7/0xbc
Apr 24 09:01:06 thinpower [1349923.492026]  [<ffffffff814aa305>]
page_fault+0x25/0x30

Before that we had this:

Apr 24 08:59:24 thinpower [1349806.982419] 6954 total pagecache pages
Apr 24 08:59:24 thinpower [1349806.982420] 0 pages in swap cache
Apr 24 08:59:24 thinpower [1349806.982422] Swap cache stats: add 0,
delete 0, find 0/0
Apr 24 08:59:24 thinpower [1349806.982423] Free swap  = 0kB
Apr 24 08:59:24 thinpower [1349806.982423] Total swap = 0kB
Apr 24 08:59:24 thinpower [1349807.182725] 8388592 pages RAM
Apr 24 08:59:24 thinpower [1349807.182728] 136791 pages reserved
Apr 24 08:59:24 thinpower [1349807.182729] 35053 pages shared
Apr 24 08:59:24 thinpower [1349807.182729] 8207346 pages non-shared
Apr 24 08:59:24 thinpower [1349807.182732] Out of memory: kill process
21592 (ruby) score 109489 or a child
Apr 24 08:59:24 thinpower [1349807.182795] Killed process 21592 (ruby)
Apr 24 09:00:59 thinpower [1349916.949909] apache2 invoked oom-killer:
gfp_mask=0x1201d2, order=0, oomkilladj=0
Apr 24 09:01:00 thinpower [1349916.949915] apache2 cpuset=/ mems_allowed=0
Apr 24 09:01:00 thinpower [1349916.949918] Pid: 21599, comm: apache2
Not tainted 2.6.29 #5
Apr 24 09:01:00 thinpower [1349916.949920] Call Trace:
Apr 24 09:01:00 thinpower [1349916.949930]  [<ffffffff8106eda9>] ?
cpuset_print_task_mems_allowed+0x99/0x9f
Apr 24 09:01:00 thinpower [1349916.949936]  [<ffffffff81083d28>]
oom_kill_process+0x96/0x246
Apr 24 09:01:00 thinpower [1349916.949938]  [<ffffffff8106dcf0>] ?
cpuset_mems_allowed_intersects+0x1c/0x1e
Apr 24 09:01:00 thinpower [1349916.949940]  [<ffffffff810840a5>] ?
badness+0x1a3/0x1e6
Apr 24 09:01:00 thinpower [1349916.949942]  [<ffffffff8108421c>]
__out_of_memory+0x134/0x14b
Apr 24 09:01:00 thinpower [1349916.949945]  [<ffffffff8108438b>]
out_of_memory+0x158/0x18a
Apr 24 09:01:00 thinpower [1349916.949947]  [<ffffffff81087906>]
__alloc_pages_internal+0x372/0x434
Apr 24 09:01:00 thinpower [1349916.949950]  [<ffffffff810aa490>]
alloc_pages_current+0xb9/0xc2
Apr 24 09:01:00 thinpower [1349916.949953]  [<ffffffff81081a59>]
__page_cache_alloc+0x67/0x6b
Apr 24 09:01:00 thinpower [1349916.949954]  [<ffffffff81081841>] ?
sync_page+0x0/0x40
Apr 24 09:01:00 thinpower [1349916.949957]  [<ffffffff81089426>]
__do_page_cache_readahead+0x96/0x192
Apr 24 09:01:00 thinpower [1349916.949959]  [<ffffffff81089575>]
do_page_cache_readahead+0x53/0x60
Apr 24 09:01:00 thinpower [1349916.949961]  [<ffffffff81081ec7>]
filemap_fault+0x15e/0x313
Apr 24 09:01:00 thinpower [1349916.949964]  [<ffffffff81094451>]
__do_fault+0x53/0x393
Apr 24 09:01:00 thinpower [1349916.949967]  [<ffffffff81096702>]
handle_mm_fault+0x36b/0x854
Apr 24 09:01:00 thinpower [1349916.949971]  [<ffffffff8100cf6e>] ?
apic_timer_interrupt+0xe/0x20
Apr 24 09:01:00 thinpower [1349916.949974]  [<ffffffff81022c8f>] ?
default_spin_lock_flags+0x9/0xe
Apr 24 09:01:00 thinpower [1349916.949976]  [<ffffffff81022c8f>] ?
default_spin_lock_flags+0x9/0xe
Apr 24 09:01:00 thinpower [1349916.949981]  [<ffffffff814ac6fe>]
do_page_fault+0x662/0xa5c
Apr 24 09:01:00 thinpower [1349916.949985]  [<ffffffff813a1acb>] ?
release_sock+0xb0/0xbb
Apr 24 09:01:00 thinpower [1349916.949989]  [<ffffffff813e6628>] ?
tcp_recvmsg+0x720/0x833
Apr 24 09:01:00 thinpower [1349916.949991]  [<ffffffff813a0eb6>] ?
sock_common_recvmsg+0x32/0x47
Apr 24 09:01:00 thinpower [1349916.949995]  [<ffffffff81198a02>] ?
selinux_socket_recvmsg+0x1d/0x1f
Apr 24 09:01:00 thinpower [1349916.949999]  [<ffffffff8139ef03>] ?
__sock_recvmsg+0x6d/0x79
Apr 24 09:01:00 thinpower [1349916.950001]  [<ffffffff8139effd>] ?
sock_aio_read+0xee/0xfe
Apr 24 09:01:00 thinpower [1349916.950005]  [<ffffffff810b42b3>] ?
do_sync_read+0xe7/0x12d
Apr 24 09:01:00 thinpower [1349916.950008]  [<ffffffff813f22bd>] ?
tcp_write_xmit+0x453/0x913

Thank your for Feedback.

Best regards
Zeno Davatz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 2.6.29 runs out of memory and hangs.
  2009-04-24 10:09 Kernel 2.6.29 runs out of memory and hangs Zeno Davatz
@ 2009-04-24 17:17 ` David Rientjes
  2009-04-24 18:13   ` Zeno Davatz
  2009-04-24 18:24   ` Zeno Davatz
  0 siblings, 2 replies; 17+ messages in thread
From: David Rientjes @ 2009-04-24 17:17 UTC (permalink / raw)
  To: Zeno Davatz; +Cc: linux-kernel, Hannes Wyss

On Fri, 24 Apr 2009, Zeno Davatz wrote:

> Dear All
> 
> Our Kernel-Version:
> 
> Linux thinpower 2.6.29 #5 SMP Sun Apr 5 20:04:08 UTC 2009 x86_64
> Intel(R) Xeon(R) CPU E5450 @ 3.00GHz GenuineIntel GNU/Linux
> 
> This morning our Server ran out of memory and had to be hard rebooted.
> The system hast 32 GB or memory and 4 QuadCore Xeon CPUs.
> /var/log/kernel/current tells us the following:
> 
> Apr 24 09:01:06 thinpower [1349923.491914] postgres invoked
> oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=0
> Apr 24 09:01:06 thinpower [1349923.491919] postgres cpuset=/ mems_allowed=0
> Apr 24 09:01:06 thinpower [1349923.491922] Pid: 2393, comm: postgres
> Not tainted 2.6.29 #5
> Apr 24 09:01:06 thinpower [1349923.491924] Call Trace:
> Apr 24 09:01:06 thinpower [1349923.491934]  [<ffffffff8106eda9>] ?
> cpuset_print_task_mems_allowed+0x99/0x9f
> Apr 24 09:01:06 thinpower [1349923.491939]  [<ffffffff81083d28>]
> oom_kill_process+0x96/0x246
> Apr 24 09:01:06 thinpower [1349923.491942]  [<ffffffff8106dcf0>] ?
> cpuset_mems_allowed_intersects+0x1c/0x1e
> Apr 24 09:01:06 thinpower [1349923.491944]  [<ffffffff810840a5>] ?
> badness+0x1a3/0x1e6
> Apr 24 09:01:06 thinpower [1349923.491946]  [<ffffffff8108421c>]
> __out_of_memory+0x134/0x14b
> Apr 24 09:01:06 thinpower [1349923.491949]  [<ffffffff8108438b>]
> out_of_memory+0x158/0x18a
> Apr 24 09:01:06 thinpower [1349923.491951]  [<ffffffff81087906>]
> __alloc_pages_internal+0x372/0x434
> Apr 24 09:01:06 thinpower [1349923.491954]  [<ffffffff810aa490>]
> alloc_pages_current+0xb9/0xc2
> Apr 24 09:01:06 thinpower [1349923.491957]  [<ffffffff81081a59>]
> __page_cache_alloc+0x67/0x6b
> Apr 24 09:01:06 thinpower [1349923.491959]  [<ffffffff81089426>]
> __do_page_cache_readahead+0x96/0x192
> Apr 24 09:01:06 thinpower [1349923.491961]  [<ffffffff81089575>]
> do_page_cache_readahead+0x53/0x60
> Apr 24 09:01:06 thinpower [1349923.491963]  [<ffffffff81081ec7>]
> filemap_fault+0x15e/0x313
> Apr 24 09:01:06 thinpower [1349923.491967]  [<ffffffff81094451>]
> __do_fault+0x53/0x393
> Apr 24 09:01:06 thinpower [1349923.491969]  [<ffffffff81030531>] ?
> __wake_up_sync+0x45/0x4e
> Apr 24 09:01:06 thinpower [1349923.491972]  [<ffffffff81096702>]
> handle_mm_fault+0x36b/0x854
> Apr 24 09:01:06 thinpower [1349923.491976]  [<ffffffff813a1acb>] ?
> release_sock+0xb0/0xbb
> Apr 24 09:01:06 thinpower [1349923.491979]  [<ffffffff81022c8f>] ?
> default_spin_lock_flags+0x9/0xe
> Apr 24 09:01:06 thinpower [1349923.491984]  [<ffffffff814ac6fe>]
> do_page_fault+0x662/0xa5c
> Apr 24 09:01:06 thinpower [1349923.491988]  [<ffffffff81400b8e>] ?
> inet_sendmsg+0x46/0x53
> Apr 24 09:01:06 thinpower [1349923.491992]  [<ffffffff8139ed97>] ?
> __sock_sendmsg+0x59/0x62
> Apr 24 09:01:06 thinpower [1349923.491994]  [<ffffffff8139f6d8>] ?
> sock_sendmsg+0xc7/0xe0
> Apr 24 09:01:06 thinpower [1349923.491996]  [<ffffffff81086b5e>] ?
> free_pages+0x32/0x36
> Apr 24 09:01:06 thinpower [1349923.492000]  [<ffffffff8104e7ec>] ?
> autoremove_wake_function+0x0/0x38
> Apr 24 09:01:06 thinpower [1349923.492004]  [<ffffffff810c1d76>] ?
> core_sys_select+0x1df/0x213
> Apr 24 09:01:06 thinpower [1349923.492007]  [<ffffffff810b2cc2>] ?
> nameidata_to_filp+0x41/0x52
> Apr 24 09:01:06 thinpower [1349923.492009]  [<ffffffff8139f028>] ?
> sockfd_lookup_light+0x1b/0x54
> Apr 24 09:01:06 thinpower [1349923.492013]  [<ffffffff81011ac8>] ?
> read_tsc+0xe/0x24
> Apr 24 09:01:06 thinpower [1349923.492017]  [<ffffffff81055990>] ?
> getnstimeofday+0x58/0xb4
> Apr 24 09:01:06 thinpower [1349923.492019]  [<ffffffff810516ed>] ?
> ktime_get_ts+0x49/0x4e
> Apr 24 09:01:06 thinpower [1349923.492022]  [<ffffffff810c0d35>] ?
> poll_select_copy_remaining+0xc5/0xea
> Apr 24 09:01:06 thinpower [1349923.492023]  [<ffffffff810c1ff3>] ?
> sys_select+0xa7/0xbc
> Apr 24 09:01:06 thinpower [1349923.492026]  [<ffffffff814aa305>]
> page_fault+0x25/0x30
> 

Are these the last messages in the dmesg?  There should be subsequent 
information that describes the current state of memory following the stack 
trace.

If this is reproducible, I'd recommend enabling 
/proc/sys/vm/oom_dump_tasks so that the oom killer will dump the tasklist 
and show us what may be causing the livelock.

> Before that we had this:
> 
> Apr 24 08:59:24 thinpower [1349806.982419] 6954 total pagecache pages
> Apr 24 08:59:24 thinpower [1349806.982420] 0 pages in swap cache
> Apr 24 08:59:24 thinpower [1349806.982422] Swap cache stats: add 0,
> delete 0, find 0/0
> Apr 24 08:59:24 thinpower [1349806.982423] Free swap  = 0kB
> Apr 24 08:59:24 thinpower [1349806.982423] Total swap = 0kB
> Apr 24 08:59:24 thinpower [1349807.182725] 8388592 pages RAM
> Apr 24 08:59:24 thinpower [1349807.182728] 136791 pages reserved
> Apr 24 08:59:24 thinpower [1349807.182729] 35053 pages shared
> Apr 24 08:59:24 thinpower [1349807.182729] 8207346 pages non-shared
> Apr 24 08:59:24 thinpower [1349807.182732] Out of memory: kill process
> 21592 (ruby) score 109489 or a child
> Apr 24 08:59:24 thinpower [1349807.182795] Killed process 21592 (ruby)
> Apr 24 09:00:59 thinpower [1349916.949909] apache2 invoked oom-killer:
> gfp_mask=0x1201d2, order=0, oomkilladj=0

apache2 in this case (and postgres later) are valid oom killer targets 
because their /proc/pid/oom_adj score is 0.  The only reason they wouldn't 
be getting killed subsequently is because the already killed tasks, ruby, 
is PF_EXITING and has failed to exit.  It has access to memory reserves 
and if it hangs before dying, the machine will livelock.

> Apr 24 09:01:00 thinpower [1349916.949915] apache2 cpuset=/ mems_allowed=0
> Apr 24 09:01:00 thinpower [1349916.949918] Pid: 21599, comm: apache2
> Not tainted 2.6.29 #5
> Apr 24 09:01:00 thinpower [1349916.949920] Call Trace:
> Apr 24 09:01:00 thinpower [1349916.949930]  [<ffffffff8106eda9>] ?
> cpuset_print_task_mems_allowed+0x99/0x9f
> Apr 24 09:01:00 thinpower [1349916.949936]  [<ffffffff81083d28>]
> oom_kill_process+0x96/0x246
> Apr 24 09:01:00 thinpower [1349916.949938]  [<ffffffff8106dcf0>] ?
> cpuset_mems_allowed_intersects+0x1c/0x1e
> Apr 24 09:01:00 thinpower [1349916.949940]  [<ffffffff810840a5>] ?
> badness+0x1a3/0x1e6
> Apr 24 09:01:00 thinpower [1349916.949942]  [<ffffffff8108421c>]
> __out_of_memory+0x134/0x14b
> Apr 24 09:01:00 thinpower [1349916.949945]  [<ffffffff8108438b>]
> out_of_memory+0x158/0x18a
> Apr 24 09:01:00 thinpower [1349916.949947]  [<ffffffff81087906>]
> __alloc_pages_internal+0x372/0x434
> Apr 24 09:01:00 thinpower [1349916.949950]  [<ffffffff810aa490>]
> alloc_pages_current+0xb9/0xc2
> Apr 24 09:01:00 thinpower [1349916.949953]  [<ffffffff81081a59>]
> __page_cache_alloc+0x67/0x6b
> Apr 24 09:01:00 thinpower [1349916.949954]  [<ffffffff81081841>] ?
> sync_page+0x0/0x40
> Apr 24 09:01:00 thinpower [1349916.949957]  [<ffffffff81089426>]
> __do_page_cache_readahead+0x96/0x192
> Apr 24 09:01:00 thinpower [1349916.949959]  [<ffffffff81089575>]
> do_page_cache_readahead+0x53/0x60
> Apr 24 09:01:00 thinpower [1349916.949961]  [<ffffffff81081ec7>]
> filemap_fault+0x15e/0x313
> Apr 24 09:01:00 thinpower [1349916.949964]  [<ffffffff81094451>]
> __do_fault+0x53/0x393
> Apr 24 09:01:00 thinpower [1349916.949967]  [<ffffffff81096702>]
> handle_mm_fault+0x36b/0x854
> Apr 24 09:01:00 thinpower [1349916.949971]  [<ffffffff8100cf6e>] ?
> apic_timer_interrupt+0xe/0x20
> Apr 24 09:01:00 thinpower [1349916.949974]  [<ffffffff81022c8f>] ?
> default_spin_lock_flags+0x9/0xe
> Apr 24 09:01:00 thinpower [1349916.949976]  [<ffffffff81022c8f>] ?
> default_spin_lock_flags+0x9/0xe
> Apr 24 09:01:00 thinpower [1349916.949981]  [<ffffffff814ac6fe>]
> do_page_fault+0x662/0xa5c
> Apr 24 09:01:00 thinpower [1349916.949985]  [<ffffffff813a1acb>] ?
> release_sock+0xb0/0xbb
> Apr 24 09:01:00 thinpower [1349916.949989]  [<ffffffff813e6628>] ?
> tcp_recvmsg+0x720/0x833
> Apr 24 09:01:00 thinpower [1349916.949991]  [<ffffffff813a0eb6>] ?
> sock_common_recvmsg+0x32/0x47
> Apr 24 09:01:00 thinpower [1349916.949995]  [<ffffffff81198a02>] ?
> selinux_socket_recvmsg+0x1d/0x1f
> Apr 24 09:01:00 thinpower [1349916.949999]  [<ffffffff8139ef03>] ?
> __sock_recvmsg+0x6d/0x79
> Apr 24 09:01:00 thinpower [1349916.950001]  [<ffffffff8139effd>] ?
> sock_aio_read+0xee/0xfe
> Apr 24 09:01:00 thinpower [1349916.950005]  [<ffffffff810b42b3>] ?
> do_sync_read+0xe7/0x12d
> Apr 24 09:01:00 thinpower [1349916.950008]  [<ffffffff813f22bd>] ?
> tcp_write_xmit+0x453/0x913
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 2.6.29 runs out of memory and hangs.
  2009-04-24 17:17 ` David Rientjes
@ 2009-04-24 18:13   ` Zeno Davatz
  2009-04-24 19:00     ` David Rientjes
  2009-04-24 18:24   ` Zeno Davatz
  1 sibling, 1 reply; 17+ messages in thread
From: Zeno Davatz @ 2009-04-24 18:13 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-kernel, Hannes Wyss

Dear David

Thank you for your reply, as following:

On Fri, Apr 24, 2009 at 7:17 PM, David Rientjes <rientjes@google.com> wrote:
> On Fri, 24 Apr 2009, Zeno Davatz wrote:
>> Linux thinpower 2.6.29 #5 SMP Sun Apr 5 20:04:08 UTC 2009 x86_64
>> Intel(R) Xeon(R) CPU E5450 @ 3.00GHz GenuineIntel GNU/Linux
>>
>> This morning our Server ran out of memory and had to be hard rebooted.
>> The system hast 32 GB or memory and 4 QuadCore Xeon CPUs.
>> /var/log/kernel/current tells us the following:
>>
>> Apr 24 09:01:06 thinpower [1349923.491914] postgres invoked
>> oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=0
>> Apr 24 09:01:06 thinpower [1349923.491919] postgres cpuset=/ mems_allowed=0
>> Apr 24 09:01:06 thinpower [1349923.491922] Pid: 2393, comm: postgres
>> Not tainted 2.6.29 #5
>> Apr 24 09:01:06 thinpower [1349923.491924] Call Trace:
>> Apr 24 09:01:06 thinpower [1349923.491934]  [<ffffffff8106eda9>] ?
>> cpuset_print_task_mems_allowed+0x99/0x9f
>> Apr 24 09:01:06 thinpower [1349923.491939]  [<ffffffff81083d28>]
>> oom_kill_process+0x96/0x246
>> Apr 24 09:01:06 thinpower [1349923.491942]  [<ffffffff8106dcf0>] ?
>> cpuset_mems_allowed_intersects+0x1c/0x1e
>> Apr 24 09:01:06 thinpower [1349923.491944]  [<ffffffff810840a5>] ?
>> badness+0x1a3/0x1e6
>> Apr 24 09:01:06 thinpower [1349923.491946]  [<ffffffff8108421c>]
>> __out_of_memory+0x134/0x14b
>> Apr 24 09:01:06 thinpower [1349923.491949]  [<ffffffff8108438b>]
>> out_of_memory+0x158/0x18a
>> Apr 24 09:01:06 thinpower [1349923.491951]  [<ffffffff81087906>]
>> __alloc_pages_internal+0x372/0x434
>> Apr 24 09:01:06 thinpower [1349923.491954]  [<ffffffff810aa490>]
>> alloc_pages_current+0xb9/0xc2
>> Apr 24 09:01:06 thinpower [1349923.491957]  [<ffffffff81081a59>]
>> __page_cache_alloc+0x67/0x6b
>> Apr 24 09:01:06 thinpower [1349923.491959]  [<ffffffff81089426>]
>> __do_page_cache_readahead+0x96/0x192
>> Apr 24 09:01:06 thinpower [1349923.491961]  [<ffffffff81089575>]
>> do_page_cache_readahead+0x53/0x60
>> Apr 24 09:01:06 thinpower [1349923.491963]  [<ffffffff81081ec7>]
>> filemap_fault+0x15e/0x313
>> Apr 24 09:01:06 thinpower [1349923.491967]  [<ffffffff81094451>]
>> __do_fault+0x53/0x393
>> Apr 24 09:01:06 thinpower [1349923.491969]  [<ffffffff81030531>] ?
>> __wake_up_sync+0x45/0x4e
>> Apr 24 09:01:06 thinpower [1349923.491972]  [<ffffffff81096702>]
>> handle_mm_fault+0x36b/0x854
>> Apr 24 09:01:06 thinpower [1349923.491976]  [<ffffffff813a1acb>] ?
>> release_sock+0xb0/0xbb
>> Apr 24 09:01:06 thinpower [1349923.491979]  [<ffffffff81022c8f>] ?
>> default_spin_lock_flags+0x9/0xe
>> Apr 24 09:01:06 thinpower [1349923.491984]  [<ffffffff814ac6fe>]
>> do_page_fault+0x662/0xa5c
>> Apr 24 09:01:06 thinpower [1349923.491988]  [<ffffffff81400b8e>] ?
>> inet_sendmsg+0x46/0x53
>> Apr 24 09:01:06 thinpower [1349923.491992]  [<ffffffff8139ed97>] ?
>> __sock_sendmsg+0x59/0x62
>> Apr 24 09:01:06 thinpower [1349923.491994]  [<ffffffff8139f6d8>] ?
>> sock_sendmsg+0xc7/0xe0
>> Apr 24 09:01:06 thinpower [1349923.491996]  [<ffffffff81086b5e>] ?
>> free_pages+0x32/0x36
>> Apr 24 09:01:06 thinpower [1349923.492000]  [<ffffffff8104e7ec>] ?
>> autoremove_wake_function+0x0/0x38
>> Apr 24 09:01:06 thinpower [1349923.492004]  [<ffffffff810c1d76>] ?
>> core_sys_select+0x1df/0x213
>> Apr 24 09:01:06 thinpower [1349923.492007]  [<ffffffff810b2cc2>] ?
>> nameidata_to_filp+0x41/0x52
>> Apr 24 09:01:06 thinpower [1349923.492009]  [<ffffffff8139f028>] ?
>> sockfd_lookup_light+0x1b/0x54
>> Apr 24 09:01:06 thinpower [1349923.492013]  [<ffffffff81011ac8>] ?
>> read_tsc+0xe/0x24
>> Apr 24 09:01:06 thinpower [1349923.492017]  [<ffffffff81055990>] ?
>> getnstimeofday+0x58/0xb4
>> Apr 24 09:01:06 thinpower [1349923.492019]  [<ffffffff810516ed>] ?
>> ktime_get_ts+0x49/0x4e
>> Apr 24 09:01:06 thinpower [1349923.492022]  [<ffffffff810c0d35>] ?
>> poll_select_copy_remaining+0xc5/0xea
>> Apr 24 09:01:06 thinpower [1349923.492023]  [<ffffffff810c1ff3>] ?
>> sys_select+0xa7/0xbc
>> Apr 24 09:01:06 thinpower [1349923.492026]  [<ffffffff814aa305>]
>> page_fault+0x25/0x30
>>
>
> Are these the last messages in the dmesg?  There should be subsequent
> information that describes the current state of memory following the stack
> trace.

These are the last messages from /var/log/kernel/current - [[ let me
know if you need more ]]

Apr 24 09:01:00 thinpower [1349916.950049] CPU    0: hi:  186, btch:
31 usd: 110
Apr 24 09:01:00 thinpower [1349916.950051] CPU    1: hi:  186, btch:
31 usd: 111
Apr 24 09:01:00 thinpower [1349916.950052] CPU    2: hi:  186, btch:
31 usd: 162
Apr 24 09:01:00 thinpower [1349916.950053] CPU    3: hi:  186, btch:
31 usd: 157
Apr 24 09:01:00 thinpower [1349916.950055] CPU    4: hi:  186, btch:
31 usd: 109
Apr 24 09:01:00 thinpower [1349916.950056] CPU    5: hi:  186, btch:
31 usd: 102
Apr 24 09:01:00 thinpower [1349916.950057] CPU    6: hi:  186, btch:
31 usd: 179
Apr 24 09:01:00 thinpower [1349916.950058] CPU    7: hi:  186, btch:
31 usd: 123
Apr 24 09:01:00 thinpower [1349916.950061] Active_anon:7600155
active_file:0 inactive_anon:552800
Apr 24 09:01:00 thinpower [1349916.950061]  inactive_file:5
unevictable:0 dirty:0 writeback:0 unstable:0
Apr 24 09:01:00 thinpower [1349916.950062]  free:38559 slab:12539
mapped:2 pagetables:22733 bounce:0
Apr 24 09:01:00 thinpower [1349916.950064] Node 0 DMA free:15700kB
min:8kB low:8kB high:12kB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:0kB unevictable:0kB present:15156kB
pages_scanned:0 all_unreclaimable? yes
Apr 24 09:01:00 thinpower [1349916.950067] lowmem_reserve[]: 0 3255 32292 32292
Apr 24 09:01:00 thinpower [1349916.950070] Node 0 DMA32 free:117732kB
min:2316kB low:2892kB high:3472kB active_anon:2266592kB
inactive_anon:453072kB active_file:0kB inactive_file:4kB
unevictable:0kB present:3333280kB pages_scanned:1587467
all_unreclaimable? no
Apr 24 09:01:00 thinpower [1349916.950074] lowmem_reserve[]: 0 0 29037 29037
Apr 24 09:01:00 thinpower [1349916.950076] Node 0 Normal free:20804kB
min:20672kB low:25840kB high:31008kB active_anon:28134028kB
inactive_anon:1758128kB active_file:0kB inactive_file:16kB
unevictable:0kB present:29734400kB pages_scanned:896
all_unreclaimable? no
Apr 24 09:01:00 thinpower [1349916.950080] lowmem_reserve[]: 0 0 0 0
Apr 24 09:01:00 thinpower [1349916.950082] Node 0 DMA: 1*4kB 0*8kB
1*16kB 2*32kB 2*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB
3*4096kB = 15700kB
Apr 24 09:01:00 thinpower [1349916.950089] Node 0 DMA32: 159*4kB
127*8kB 59*16kB 14*32kB 12*64kB 6*128kB 20*256kB 15*512kB 6*1024kB
2*2048kB 22*4096kB = 117732kB
Apr 24 09:01:00 thinpower [1349916.950095] Node 0 Normal: 93*4kB 8*8kB
54*16kB 11*32kB 2*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB
4*4096kB = 21108kB
Apr 24 09:01:00 thinpower [1349916.950101] 6946 total pagecache pages
Apr 24 09:01:00 thinpower [1349916.950102] 0 pages in swap cache
Apr 24 09:01:00 thinpower [1349916.950104] Swap cache stats: add 0,
delete 0, find 0/0
Apr 24 09:01:00 thinpower [1349916.950105] Free swap  = 0kB
Apr 24 09:01:00 thinpower [1349916.950106] Total swap = 0kB
Apr 24 09:01:00 thinpower [1349917.149549] 8388592 pages RAM
Apr 24 09:01:00 thinpower [1349917.149551] 136791 pages reserved
Apr 24 09:01:00 thinpower [1349917.149552] 66429 pages shared
Apr 24 09:01:00 thinpower [1349917.149553] 8206765 pages non-shared
Apr 24 09:01:00 thinpower [1349917.149556] Out of memory: kill process
21692 (ruby) score 58368 or a child
Apr 24 09:01:00 thinpower [1349917.149621] Killed process 21692 (ruby)
Apr 24 09:01:06 thinpower [1349923.491914] postgres invoked
oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=0
Apr 24 09:01:06 thinpower [1349923.491919] postgres cpuset=/ mems_allowed=0
Apr 24 09:01:06 thinpower [1349923.491922] Pid: 2393, comm: postgres
Not tainted 2.6.29 #5
Apr 24 09:01:06 thinpower [1349923.491924] Call Trace:
Apr 24 09:01:06 thinpower [1349923.491934]  [<ffffffff8106eda9>] ?
cpuset_print_task_mems_allowed+0x99/0x9f
Apr 24 09:01:06 thinpower [1349923.491939]  [<ffffffff81083d28>]
oom_kill_process+0x96/0x246
Apr 24 09:01:06 thinpower [1349923.491942]  [<ffffffff8106dcf0>] ?
cpuset_mems_allowed_intersects+0x1c/0x1e
Apr 24 09:01:06 thinpower [1349923.491944]  [<ffffffff810840a5>] ?
badness+0x1a3/0x1e6
Apr 24 09:01:06 thinpower [1349923.491946]  [<ffffffff8108421c>]
__out_of_memory+0x134/0x14b
Apr 24 09:01:06 thinpower [1349923.491949]  [<ffffffff8108438b>]
out_of_memory+0x158/0x18a
Apr 24 09:01:06 thinpower [1349923.491951]  [<ffffffff81087906>]
__alloc_pages_internal+0x372/0x434
Apr 24 09:01:06 thinpower [1349923.491954]  [<ffffffff810aa490>]
alloc_pages_current+0xb9/0xc2
Apr 24 09:01:06 thinpower [1349923.491957]  [<ffffffff81081a59>]
__page_cache_alloc+0x67/0x6b
Apr 24 09:01:06 thinpower [1349923.491959]  [<ffffffff81089426>]
__do_page_cache_readahead+0x96/0x192
Apr 24 09:01:06 thinpower [1349923.491961]  [<ffffffff81089575>]
do_page_cache_readahead+0x53/0x60
Apr 24 09:01:06 thinpower [1349923.491963]  [<ffffffff81081ec7>]
filemap_fault+0x15e/0x313
Apr 24 09:01:06 thinpower [1349923.491967]  [<ffffffff81094451>]
__do_fault+0x53/0x393
Apr 24 09:01:06 thinpower [1349923.491969]  [<ffffffff81030531>] ?
__wake_up_sync+0x45/0x4e
Apr 24 09:01:06 thinpower [1349923.491972]  [<ffffffff81096702>]
handle_mm_fault+0x36b/0x854
Apr 24 09:01:06 thinpower [1349923.491976]  [<ffffffff813a1acb>] ?
release_sock+0xb0/0xbb
Apr 24 09:01:06 thinpower [1349923.491979]  [<ffffffff81022c8f>] ?
default_spin_lock_flags+0x9/0xe
Apr 24 09:01:06 thinpower [1349923.491984]  [<ffffffff814ac6fe>]
do_page_fault+0x662/0xa5c
Apr 24 09:01:06 thinpower [1349923.491988]  [<ffffffff81400b8e>] ?
inet_sendmsg+0x46/0x53
Apr 24 09:01:06 thinpower [1349923.491992]  [<ffffffff8139ed97>] ?
__sock_sendmsg+0x59/0x62
Apr 24 09:01:06 thinpower [1349923.491994]  [<ffffffff8139f6d8>] ?
sock_sendmsg+0xc7/0xe0
Apr 24 09:01:06 thinpower [1349923.491996]  [<ffffffff81086b5e>] ?
free_pages+0x32/0x36
Apr 24 09:01:06 thinpower [1349923.492000]  [<ffffffff8104e7ec>] ?
autoremove_wake_function+0x0/0x38
Apr 24 09:01:06 thinpower [1349923.492004]  [<ffffffff810c1d76>] ?
core_sys_select+0x1df/0x213
Apr 24 09:01:06 thinpower [1349923.492007]  [<ffffffff810b2cc2>] ?
nameidata_to_filp+0x41/0x52
Apr 24 09:01:06 thinpower [1349923.492009]  [<ffffffff8139f028>] ?
sockfd_lookup_light+0x1b/0x54
Apr 24 09:01:06 thinpower [1349923.492013]  [<ffffffff81011ac8>] ?
read_tsc+0xe/0x24
Apr 24 09:01:06 thinpower [1349923.492017]  [<ffffffff81055990>] ?
getnstimeofday+0x58/0xb4
Apr 24 09:01:06 thinpower [1349923.492019]  [<ffffffff810516ed>] ?
ktime_get_ts+0x49/0x4e
Apr 24 09:01:06 thinpower [1349923.492022]  [<ffffffff810c0d35>] ?
poll_select_copy_remaining+0xc5/0xea
Apr 24 09:01:06 thinpower [1349923.492023]  [<ffffffff810c1ff3>] ?
sys_select+0xa7/0xbc
Apr 24 09:01:06 thinpower [1349923.492026]  [<ffffffff814aa305>]
page_fault+0x25/0x30
Apr 24 09:01:06 thinpower [1349923.492028] Mem-Info:
Apr 24 09:01:06 thinpower [1349923.492029] Node 0 DMA per-cpu:
Apr 24 09:01:06 thinpower [1349923.492031] CPU    0: hi:    0, btch:
1 usd:   0
Apr 24 09:01:06 thinpower [1349923.492032] CPU    1: hi:    0, btch:
1 usd:   0
Apr 24 09:01:06 thinpower [1349923.492034] CPU    2: hi:    0, btch:
1 usd:   0
Apr 24 09:01:06 thinpower [1349923.492035] CPU    3: hi:    0, btch:
1 usd:   0
Apr 24 09:01:06 thinpower [1349923.492036] CPU    4: hi:    0, btch:
1 usd:   0
Apr 24 09:01:06 thinpower [1349923.492037] CPU    5: hi:    0, btch:
1 usd:   0
Apr 24 09:01:06 thinpower [1349923.492039] CPU    6: hi:    0, btch:
1 usd:   0
Apr 24 09:01:06 thinpower [1349923.492040] CPU    7: hi:    0, btch:
1 usd:   0
Apr 24 09:01:06 thinpower [1349923.492041] Node 0 DMA32 per-cpu:
Apr 24 09:01:06 thinpower [1349923.492043] CPU    0: hi:  186, btch:
31 usd:  29
Apr 24 09:01:06 thinpower [1349923.492044] CPU    1: hi:  186, btch:
31 usd:  17
Apr 24 09:01:06 thinpower [1349923.492046] CPU    2: hi:  186, btch:
31 usd: 176
Apr 24 09:01:06 thinpower [1349923.492047] CPU    3: hi:  186, btch:
31 usd:  19
Apr 24 09:01:06 thinpower [1349923.492048] CPU    4: hi:  186, btch:
31 usd:  29
Apr 24 09:01:06 thinpower [1349923.492050] CPU    5: hi:  186, btch:
31 usd:   0
Apr 24 09:01:06 thinpower [1349923.492051] CPU    6: hi:  186, btch:
31 usd:   0
Apr 24 09:01:06 thinpower [1349923.492052] CPU    7: hi:  186, btch:
31 usd:  31
Apr 24 09:01:06 thinpower [1349923.492053] Node 0 Normal per-cpu:
Apr 24 09:01:06 thinpower [1349923.492055] CPU    0: hi:  186, btch:
31 usd:  50
Apr 24 09:01:06 thinpower [1349923.492056] CPU    1: hi:  186, btch:
31 usd:  20
Apr 24 09:01:06 thinpower [1349923.492057] CPU    2: hi:  186, btch:
31 usd:  42
Apr 24 09:01:06 thinpower [1349923.492059] CPU    3: hi:  186, btch:
31 usd: 124
Apr 24 09:01:06 thinpower [1349923.492060] CPU    4: hi:  186, btch:
31 usd: 104
Apr 24 09:01:06 thinpower [1349923.492061] CPU    5: hi:  186, btch:
31 usd:  95
Apr 24 09:01:06 thinpower [1349923.492062] CPU    6: hi:  186, btch:
31 usd: 178
Apr 24 09:01:06 thinpower [1349923.492064] CPU    7: hi:  186, btch:
31 usd:  60
Apr 24 09:01:06 thinpower [1349923.492066] Active_anon:7595417
active_file:641 inactive_anon:551929
Apr 24 09:01:06 thinpower [1349923.492067]  inactive_file:5122
unevictable:0 dirty:0 writeback:0 unstable:0
Apr 24 09:01:06 thinpower [1349923.492068]  free:39401 slab:12443
mapped:1801 pagetables:22566 bounce:0
Apr 24 09:01:06 thinpower [1349923.492069] Node 0 DMA free:15700kB
min:8kB low:8kB high:12kB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:0kB unevictable:0kB present:15156kB
pages_scanned:0 all_unreclaimable? yes
Apr 24 09:01:06 thinpower [1349923.492073] lowmem_reserve[]: 0 3255 32292 32292
Apr 24 09:01:06 thinpower [1349923.492076] Node 0 DMA32 free:118848kB
min:2316kB low:2892kB high:3472kB active_anon:2265276kB
inactive_anon:452944kB active_file:4kB inactive_file:0kB
unevictable:0kB present:3333280kB pages_scanned:2692832
all_unreclaimable? no
Apr 24 09:01:06 thinpower [1349923.492080] lowmem_reserve[]: 0 0 29037 29037
Apr 24 09:01:06 thinpower [1349923.492082] Node 0 Normal free:23056kB
min:20672kB low:25840kB high:31008kB active_anon:28116392kB
inactive_anon:1754772kB active_file:2560kB inactive_file:20488kB
unevictable:0kB present:29734400kB pages_scanned:77696
all_unreclaimable? no
Apr 24 09:01:06 thinpower [1349923.492086] lowmem_reserve[]: 0 0 0 0
Apr 24 09:01:06 thinpower [1349923.492088] Node 0 DMA: 1*4kB 0*8kB
1*16kB 2*32kB 2*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB
3*4096kB = 15700kB
Apr 24 09:01:06 thinpower [1349923.492095] Node 0 DMA32: 172*4kB
134*8kB 66*16kB 18*32kB 18*64kB 7*128kB 21*256kB 15*512kB 6*1024kB
2*2048kB 22*4096kB = 118848kB
Apr 24 09:01:06 thinpower [1349923.492101] Node 0 Normal: 251*4kB
24*8kB 144*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB
1*2048kB 4*4096kB = 22924kB
Apr 24 09:01:06 thinpower [1349923.492107] 12705 total pagecache pages
Apr 24 09:01:06 thinpower [1349923.492108] 0 pages in swap cache
Apr 24 09:01:06 thinpower [1349923.492109] Swap cache stats: add 0,
delete 0, find 0/0
Apr 24 09:01:06 thinpower [1349923.492110] Free swap  = 0kB
Apr 24 09:01:06 thinpower [1349923.492111] Total swap = 0kB
Apr 24 09:01:06 thinpower [1349923.693323] 8388592 pages RAM
Apr 24 09:01:06 thinpower [1349923.693326] 136791 pages reserved
Apr 24 09:01:06 thinpower [1349923.693327] 69633 pages shared
Apr 24 09:01:06 thinpower [1349923.693328] 8205493 pages non-shared
Apr 24 09:01:06 thinpower [1349923.693331] Out of memory: kill process
21490 (apache2) score 53801 or a child
Apr 24 09:01:06 thinpower [1349923.693410] Killed process 21490 (apache2)

> If this is reproducible, I'd recommend enabling
> /proc/sys/vm/oom_dump_tasks so that the oom killer will dump the tasklist
> and show us what may be causing the livelock.

Ok, how do I enable that? I will google for it.

If we can reproduce it, I will let you know.

>> Before that we had this:
>>
>> Apr 24 08:59:24 thinpower [1349806.982419] 6954 total pagecache pages
>> Apr 24 08:59:24 thinpower [1349806.982420] 0 pages in swap cache
>> Apr 24 08:59:24 thinpower [1349806.982422] Swap cache stats: add 0,
>> delete 0, find 0/0
>> Apr 24 08:59:24 thinpower [1349806.982423] Free swap  = 0kB
>> Apr 24 08:59:24 thinpower [1349806.982423] Total swap = 0kB
>> Apr 24 08:59:24 thinpower [1349807.182725] 8388592 pages RAM
>> Apr 24 08:59:24 thinpower [1349807.182728] 136791 pages reserved
>> Apr 24 08:59:24 thinpower [1349807.182729] 35053 pages shared
>> Apr 24 08:59:24 thinpower [1349807.182729] 8207346 pages non-shared
>> Apr 24 08:59:24 thinpower [1349807.182732] Out of memory: kill process
>> 21592 (ruby) score 109489 or a child
>> Apr 24 08:59:24 thinpower [1349807.182795] Killed process 21592 (ruby)
>> Apr 24 09:00:59 thinpower [1349916.949909] apache2 invoked oom-killer:
>> gfp_mask=0x1201d2, order=0, oomkilladj=0
>
> apache2 in this case (and postgres later) are valid oom killer targets
> because their /proc/pid/oom_adj score is 0.  The only reason they wouldn't
> be getting killed subsequently is because the already killed tasks, ruby,
> is PF_EXITING and has failed to exit.  It has access to memory reserves
> and if it hangs before dying, the machine will livelock.

Thanks for the info. We will try to chew on this.

Best
Zeno

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 2.6.29 runs out of memory and hangs.
  2009-04-24 17:17 ` David Rientjes
  2009-04-24 18:13   ` Zeno Davatz
@ 2009-04-24 18:24   ` Zeno Davatz
  1 sibling, 0 replies; 17+ messages in thread
From: Zeno Davatz @ 2009-04-24 18:24 UTC (permalink / raw)
  To: David Rientjes, linux-kernel; +Cc: Hannes Wyss

On Fri, Apr 24, 2009 at 7:17 PM, David Rientjes <rientjes@google.com> wrote:
> On Fri, 24 Apr 2009, Zeno Davatz wrote:
>
>> Dear All
>>
>> Our Kernel-Version:
>>
>> Linux thinpower 2.6.29 #5 SMP Sun Apr 5 20:04:08 UTC 2009 x86_64
>> Intel(R) Xeon(R) CPU E5450 @ 3.00GHz GenuineIntel GNU/Linux
>>
>> This morning our Server ran out of memory and had to be hard rebooted.
>> The system hast 32 GB or memory and 4 QuadCore Xeon CPUs.
>> /var/log/kernel/current tells us the following:
>>
>> Apr 24 09:01:06 thinpower [1349923.491914] postgres invoked
>> oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=0
>> Apr 24 09:01:06 thinpower [1349923.491919] postgres cpuset=/ mems_allowed=0
>> Apr 24 09:01:06 thinpower [1349923.491922] Pid: 2393, comm: postgres
>> Not tainted 2.6.29 #5
>> Apr 24 09:01:06 thinpower [1349923.491924] Call Trace:
>> Apr 24 09:01:06 thinpower [1349923.491934]  [<ffffffff8106eda9>] ?
>> cpuset_print_task_mems_allowed+0x99/0x9f
>> Apr 24 09:01:06 thinpower [1349923.491939]  [<ffffffff81083d28>]
>> oom_kill_process+0x96/0x246
>> Apr 24 09:01:06 thinpower [1349923.491942]  [<ffffffff8106dcf0>] ?
>> cpuset_mems_allowed_intersects+0x1c/0x1e
>> Apr 24 09:01:06 thinpower [1349923.491944]  [<ffffffff810840a5>] ?
>> badness+0x1a3/0x1e6
>> Apr 24 09:01:06 thinpower [1349923.491946]  [<ffffffff8108421c>]
>> __out_of_memory+0x134/0x14b
>> Apr 24 09:01:06 thinpower [1349923.491949]  [<ffffffff8108438b>]
>> out_of_memory+0x158/0x18a
>> Apr 24 09:01:06 thinpower [1349923.491951]  [<ffffffff81087906>]
>> __alloc_pages_internal+0x372/0x434
>> Apr 24 09:01:06 thinpower [1349923.491954]  [<ffffffff810aa490>]
>> alloc_pages_current+0xb9/0xc2
>> Apr 24 09:01:06 thinpower [1349923.491957]  [<ffffffff81081a59>]
>> __page_cache_alloc+0x67/0x6b
>> Apr 24 09:01:06 thinpower [1349923.491959]  [<ffffffff81089426>]
>> __do_page_cache_readahead+0x96/0x192
>> Apr 24 09:01:06 thinpower [1349923.491961]  [<ffffffff81089575>]
>> do_page_cache_readahead+0x53/0x60
>> Apr 24 09:01:06 thinpower [1349923.491963]  [<ffffffff81081ec7>]
>> filemap_fault+0x15e/0x313
>> Apr 24 09:01:06 thinpower [1349923.491967]  [<ffffffff81094451>]
>> __do_fault+0x53/0x393
>> Apr 24 09:01:06 thinpower [1349923.491969]  [<ffffffff81030531>] ?
>> __wake_up_sync+0x45/0x4e
>> Apr 24 09:01:06 thinpower [1349923.491972]  [<ffffffff81096702>]
>> handle_mm_fault+0x36b/0x854
>> Apr 24 09:01:06 thinpower [1349923.491976]  [<ffffffff813a1acb>] ?
>> release_sock+0xb0/0xbb
>> Apr 24 09:01:06 thinpower [1349923.491979]  [<ffffffff81022c8f>] ?
>> default_spin_lock_flags+0x9/0xe
>> Apr 24 09:01:06 thinpower [1349923.491984]  [<ffffffff814ac6fe>]
>> do_page_fault+0x662/0xa5c
>> Apr 24 09:01:06 thinpower [1349923.491988]  [<ffffffff81400b8e>] ?
>> inet_sendmsg+0x46/0x53
>> Apr 24 09:01:06 thinpower [1349923.491992]  [<ffffffff8139ed97>] ?
>> __sock_sendmsg+0x59/0x62
>> Apr 24 09:01:06 thinpower [1349923.491994]  [<ffffffff8139f6d8>] ?
>> sock_sendmsg+0xc7/0xe0
>> Apr 24 09:01:06 thinpower [1349923.491996]  [<ffffffff81086b5e>] ?
>> free_pages+0x32/0x36
>> Apr 24 09:01:06 thinpower [1349923.492000]  [<ffffffff8104e7ec>] ?
>> autoremove_wake_function+0x0/0x38
>> Apr 24 09:01:06 thinpower [1349923.492004]  [<ffffffff810c1d76>] ?
>> core_sys_select+0x1df/0x213
>> Apr 24 09:01:06 thinpower [1349923.492007]  [<ffffffff810b2cc2>] ?
>> nameidata_to_filp+0x41/0x52
>> Apr 24 09:01:06 thinpower [1349923.492009]  [<ffffffff8139f028>] ?
>> sockfd_lookup_light+0x1b/0x54
>> Apr 24 09:01:06 thinpower [1349923.492013]  [<ffffffff81011ac8>] ?
>> read_tsc+0xe/0x24
>> Apr 24 09:01:06 thinpower [1349923.492017]  [<ffffffff81055990>] ?
>> getnstimeofday+0x58/0xb4
>> Apr 24 09:01:06 thinpower [1349923.492019]  [<ffffffff810516ed>] ?
>> ktime_get_ts+0x49/0x4e
>> Apr 24 09:01:06 thinpower [1349923.492022]  [<ffffffff810c0d35>] ?
>> poll_select_copy_remaining+0xc5/0xea
>> Apr 24 09:01:06 thinpower [1349923.492023]  [<ffffffff810c1ff3>] ?
>> sys_select+0xa7/0xbc
>> Apr 24 09:01:06 thinpower [1349923.492026]  [<ffffffff814aa305>]
>> page_fault+0x25/0x30
>>
>
> Are these the last messages in the dmesg?  There should be subsequent
> information that describes the current state of memory following the stack
> trace.
>
> If this is reproducible, I'd recommend enabling
> /proc/sys/vm/oom_dump_tasks so that the oom killer will dump the tasklist
> and show us what may be causing the livelock.

Ok, I believe I enabled it with

echo 1 > /proc/sys/vm/oom_dump_tasks

Where would I find the output if it livelocks again?

Thank you for your Feedback.

Best
Zeno

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 2.6.29 runs out of memory and hangs.
  2009-04-24 18:13   ` Zeno Davatz
@ 2009-04-24 19:00     ` David Rientjes
  2009-04-24 19:16       ` Zeno Davatz
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2009-04-24 19:00 UTC (permalink / raw)
  To: Zeno Davatz; +Cc: linux-kernel, Hannes Wyss

On Fri, 24 Apr 2009, Zeno Davatz wrote:

> Apr 24 09:01:06 thinpower [1349923.693331] Out of memory: kill process
> 21490 (apache2) score 53801 or a child
> Apr 24 09:01:06 thinpower [1349923.693410] Killed process 21490 (apache2)
> 

If your machine hangs here, then it's most likely because apache2 is 
getting stuck in D state and cannot exit (and it has access to memory 
reserves because of TIF_MEMDIE since it has been oom killed, so it may 
deplete all memory).

I'm assuming that you're describing a machine hang as the inability to 
ping it or ssh into it, not simply your apache server dying.

These types of livelocks are possible with the oom killer when a task 
fails to exit, one possible way to fix that is to introduce an oom killer 
timeout such that if a task fails to exit for a pre-defined period of 
time, the oom killer will choose to kill another task in the hopes of 
future memory freeing.  The problem with that approach, however, is that 
the hung task can consume an enormous amount of memory that will never be 
freed.

> > If this is reproducible, I'd recommend enabling
> > /proc/sys/vm/oom_dump_tasks so that the oom killer will dump the tasklist
> > and show us what may be causing the livelock.
> 
> Ok, how do I enable that? I will google for it.
> 

You're right in your reply, you can enable it with

	echo 1 > /proc/sys/vm/oom_dump_tasks

This will print the tasklist and some pertinent information alongside the 
oom killer output you've already posted.  It will give a better idea of 
the memory usage on the machine and if killing a subsequent task would 
actually help in this case.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 2.6.29 runs out of memory and hangs.
  2009-04-24 19:00     ` David Rientjes
@ 2009-04-24 19:16       ` Zeno Davatz
  2009-04-24 19:26         ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Zeno Davatz @ 2009-04-24 19:16 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-kernel, Hannes Wyss

Dear David

On Fri, Apr 24, 2009 at 9:00 PM, David Rientjes <rientjes@google.com> wrote:
> On Fri, 24 Apr 2009, Zeno Davatz wrote:
>
>> Apr 24 09:01:06 thinpower [1349923.693331] Out of memory: kill process
>> 21490 (apache2) score 53801 or a child
>> Apr 24 09:01:06 thinpower [1349923.693410] Killed process 21490 (apache2)
>>
>
> If your machine hangs here, then it's most likely because apache2 is
> getting stuck in D state and cannot exit (and it has access to memory
> reserves because of TIF_MEMDIE since it has been oom killed, so it may
> deplete all memory).
>
> I'm assuming that you're describing a machine hang as the inability to
> ping it or ssh into it, not simply your apache server dying.

Yes correct. I could neither SSH into the machine nor could I type
anything on the screen after I connected the monitor and the keyboard
directly to the machine.

> These types of livelocks are possible with the oom killer when a task
> fails to exit, one possible way to fix that is to introduce an oom killer
> timeout such that if a task fails to exit for a pre-defined period of
> time, the oom killer will choose to kill another task in the hopes of
> future memory freeing.  The problem with that approach, however, is that
> the hung task can consume an enormous amount of memory that will never be
> freed.

Thanks for the hint! Is there another solution as well? Any
Kernel-Upgrades in the Pipeline? What does Linus think about this?

>> > If this is reproducible, I'd recommend enabling
>> > /proc/sys/vm/oom_dump_tasks so that the oom killer will dump the tasklist
>> > and show us what may be causing the livelock.
>>
>> Ok, how do I enable that? I will google for it.
>>
>
> You're right in your reply, you can enable it with
>
>        echo 1 > /proc/sys/vm/oom_dump_tasks
>
> This will print the tasklist and some pertinent information alongside the
> oom killer output you've already posted.  It will give a better idea of
> the memory usage on the machine and if killing a subsequent task would
> actually help in this case.

Ok done that. Actually not looking forward that it hangs again but if
it does we should catch some fish. ;)

Best
Zeno

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 2.6.29 runs out of memory and hangs.
  2009-04-24 19:16       ` Zeno Davatz
@ 2009-04-24 19:26         ` David Rientjes
  2009-04-25  8:32           ` Zeno Davatz
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2009-04-24 19:26 UTC (permalink / raw)
  To: Zeno Davatz; +Cc: linux-kernel, Hannes Wyss

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1024 bytes --]

On Fri, 24 Apr 2009, Zeno Davatz wrote:

> > These types of livelocks are possible with the oom killer when a task
> > fails to exit, one possible way to fix that is to introduce an oom killer
> > timeout such that if a task fails to exit for a pre-defined period of
> > time, the oom killer will choose to kill another task in the hopes of
> > future memory freeing.  The problem with that approach, however, is that
> > the hung task can consume an enormous amount of memory that will never be
> > freed.
> 
> Thanks for the hint! Is there another solution as well? Any
> Kernel-Upgrades in the Pipeline? What does Linus think about this?
> 

I had a patchset that added a timeout for oom killed tasks now that the 
entire oom killer is serialized, but it only really works well if your 
memory is so partitioned (such as with cpusets or the memory controller) 
that it actually makes sense to continue.  With global system-wide ooms 
such as yours, it would only result in a long chain of kills with no 
positive outcome.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 2.6.29 runs out of memory and hangs.
  2009-04-24 19:26         ` David Rientjes
@ 2009-04-25  8:32           ` Zeno Davatz
  2009-04-26 19:55             ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Zeno Davatz @ 2009-04-25  8:32 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-kernel, Hannes Wyss

Dear David

On Fri, Apr 24, 2009 at 9:26 PM, David Rientjes <rientjes@google.com> wrote:
> On Fri, 24 Apr 2009, Zeno Davatz wrote:
>
>> > These types of livelocks are possible with the oom killer when a task
>> > fails to exit, one possible way to fix that is to introduce an oom killer
>> > timeout such that if a task fails to exit for a pre-defined period of
>> > time, the oom killer will choose to kill another task in the hopes of
>> > future memory freeing.  The problem with that approach, however, is that
>> > the hung task can consume an enormous amount of memory that will never be
>> > freed.
>>
>> Thanks for the hint! Is there another solution as well? Any
>> Kernel-Upgrades in the Pipeline? What does Linus think about this?
>>
>
> I had a patchset that added a timeout for oom killed tasks now that the
> entire oom killer is serialized, but it only really works well if your
> memory is so partitioned (such as with cpusets or the memory controller)
> that it actually makes sense to continue.  With global system-wide ooms
> such as yours, it would only result in a long chain of kills with no
> positive outcome.

Couldn't you start killing processes at an earlier stage? It seems to
me that the stage when processes are being killed by oom_killer is a
bit late in the game, when the machine is anyway very close to the
freezing point. I mean we had these messages of oom_kill several times
in our /var/log/kernel/current over a distributed amount of time.

Of course we now also set a roadblock to our main application
(http://ch.oddb.org) that kills the application if it uses more then
10 GB of memory.

Best
Zeno

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 2.6.29 runs out of memory and hangs.
  2009-04-25  8:32           ` Zeno Davatz
@ 2009-04-26 19:55             ` David Rientjes
  2009-04-27  6:33               ` Zeno Davatz
  2009-04-27 13:35               ` Greg Stark
  0 siblings, 2 replies; 17+ messages in thread
From: David Rientjes @ 2009-04-26 19:55 UTC (permalink / raw)
  To: Zeno Davatz; +Cc: linux-kernel, Hannes Wyss

On Sat, 25 Apr 2009, Zeno Davatz wrote:

> Couldn't you start killing processes at an earlier stage? It seems to
> me that the stage when processes are being killed by oom_killer is a
> bit late in the game, when the machine is anyway very close to the
> freezing point. I mean we had these messages of oom_kill several times
> in our /var/log/kernel/current over a distributed amount of time.
> 

The oom killer serves a very well-defined purpose: to kill a 
memory-hogging task when your system (or cpuset, memory controller) is out 
of memory.  When it kills a task, it also allows access to memory reserves 
that allow it to allocate beyond each allowable zone's min watermark.  
This is intended to allow the task to quickly exit.

There is work being done on a per-cgroup lowmem notifier which would allow 
userspace to poll() on a device file and be notified of low memory 
situations so that it can react on its own (like killing a low priority 
task, expanding a cpuset, increasing a memcontroller limit, etc.)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 2.6.29 runs out of memory and hangs.
  2009-04-26 19:55             ` David Rientjes
@ 2009-04-27  6:33               ` Zeno Davatz
  2009-04-27  8:32                 ` David Rientjes
  2009-04-27 13:35               ` Greg Stark
  1 sibling, 1 reply; 17+ messages in thread
From: Zeno Davatz @ 2009-04-27  6:33 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-kernel, Hannes Wyss

Dear David

On Sun, Apr 26, 2009 at 9:55 PM, David Rientjes <rientjes@google.com> wrote:
> On Sat, 25 Apr 2009, Zeno Davatz wrote:
>
>> Couldn't you start killing processes at an earlier stage? It seems to
>> me that the stage when processes are being killed by oom_killer is a
>> bit late in the game, when the machine is anyway very close to the
>> freezing point. I mean we had these messages of oom_kill several times
>> in our /var/log/kernel/current over a distributed amount of time.
>>
>
> The oom killer serves a very well-defined purpose: to kill a
> memory-hogging task when your system (or cpuset, memory controller) is out
> of memory.

Maybe sometimes to late in the process when the system _is_ already
out of memory ;) - because one task just uses 99% of the memory.

> When it kills a task, it also allows access to memory reserves
> that allow it to allocate beyond each allowable zone's min watermark.
> This is intended to allow the task to quickly exit.

Can you recommend a link where I can read up on this issue / problem?
It seems to me when the system is already out of memory it does not
have enough "memory" for freeing itself for some fresh memory,
specially if there is one application, that uses about 99% of the
memory. I would suggest, that in such a case it should just kill the
task that uses the most memory.

> There is work being done on a per-cgroup lowmem notifier which would allow
> userspace to poll() on a device file and be notified of low memory
> situations so that it can react on its own (like killing a low priority
> task, expanding a cpuset, increasing a memcontroller limit, etc.)

This sounds very interesting. Would this be working together with
udev? How would it got about to kill a task like "Apache" or "Ruby" or
any other application that is running on the linux machine.

Best
Zeno

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 2.6.29 runs out of memory and hangs.
  2009-04-27  6:33               ` Zeno Davatz
@ 2009-04-27  8:32                 ` David Rientjes
  2009-04-27  8:53                   ` Zeno Davatz
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2009-04-27  8:32 UTC (permalink / raw)
  To: Zeno Davatz; +Cc: linux-kernel, Hannes Wyss

On Mon, 27 Apr 2009, Zeno Davatz wrote:

> Can you recommend a link where I can read up on this issue / problem?
> It seems to me when the system is already out of memory it does not
> have enough "memory" for freeing itself for some fresh memory,
> specially if there is one application, that uses about 99% of the
> memory. I would suggest, that in such a case it should just kill the
> task that uses the most memory.
> 

All allowable zones have minimum watermarks that define how much memory 
an oom killed task may use so that it can quickly exit.  This is the 
purpose of the TIF_MEMDIE flag in the kernel.

> This sounds very interesting. Would this be working together with
> udev? How would it got about to kill a task like "Apache" or "Ruby" or
> any other application that is running on the linux machine.
> 

The policy is better defined in userspace to determine what memory-hogging 
task should be killed in low memory situations.  Users can already tune 
the /proc/pid/oom_adj score of any task so that the oom killer can prefer 
it over other tasks (see Documentation/filesystems/proc.txt).  For low 
memory situations, however, userspace can act in any number of ways to 
prevent reaching an oom condition.

Once the oom condition has been reached, however, the oom killer must act 
to free some memory, otherwise the kernel usually livelocks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 2.6.29 runs out of memory and hangs.
  2009-04-27  8:32                 ` David Rientjes
@ 2009-04-27  8:53                   ` Zeno Davatz
  2009-04-27 12:03                     ` Zeno Davatz
  0 siblings, 1 reply; 17+ messages in thread
From: Zeno Davatz @ 2009-04-27  8:53 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-kernel, Hannes Wyss

Dear David

On Mon, Apr 27, 2009 at 10:32 AM, David Rientjes <rientjes@google.com> wrote:

> All allowable zones have minimum watermarks that define how much memory
> an oom killed task may use so that it can quickly exit.  This is the
> purpose of the TIF_MEMDIE flag in the kernel.

Thanks for the info.

> The policy is better defined in userspace to determine what memory-hogging
> task should be killed in low memory situations.  Users can already tune
> the /proc/pid/oom_adj score of any task so that the oom killer can prefer
> it over other tasks (see Documentation/filesystems/proc.txt).  For low
> memory situations, however, userspace can act in any number of ways to
> prevent reaching an oom condition.

This may be off-Topic, but: Can I assign a PID to a process at
startup? Say Apache always should have PID 2755, or Ruby should always
have PID 1003. Otherwise how would I set the score for a PID at
Kernel-Boop-Up?

> Once the oom condition has been reached, however, the oom killer must act
> to free some memory, otherwise the kernel usually livelocks.

Interesting. In our case we did not have a score for Apache in

/proc/pid/oom_adj

that must have caused the livelock because oom_kill could not kill the
biggest task (Apache) in full.

Best
Zeno

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 2.6.29 runs out of memory and hangs.
  2009-04-27  8:53                   ` Zeno Davatz
@ 2009-04-27 12:03                     ` Zeno Davatz
  0 siblings, 0 replies; 17+ messages in thread
From: Zeno Davatz @ 2009-04-27 12:03 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-kernel, Hannes Wyss

Dear David

Thanks for all your help and input. It was very helpful ans useful.

On Mon, Apr 27, 2009 at 10:53 AM, Zeno Davatz <zdavatz@gmail.com> wrote:
> On Mon, Apr 27, 2009 at 10:32 AM, David Rientjes <rientjes@google.com> wrote:
>
>> All allowable zones have minimum watermarks that define how much memory
>> an oom killed task may use so that it can quickly exit.  This is the
>> purpose of the TIF_MEMDIE flag in the kernel.
>
> Thanks for the info.
>
>> The policy is better defined in userspace to determine what memory-hogging
>> task should be killed in low memory situations.  Users can already tune
>> the /proc/pid/oom_adj score of any task so that the oom killer can prefer
>> it over other tasks (see Documentation/filesystems/proc.txt).  For low
>> memory situations, however, userspace can act in any number of ways to
>> prevent reaching an oom condition.
>
> This may be off-Topic, but: Can I assign a PID to a process at
> startup? Say Apache always should have PID 2755, or Ruby should always
> have PID 1003. Otherwise how would I set the score for a PID at
> Kernel-Boop-Up?

Ok, we solved this issue, with the following commit:

http://scm.ywesee.com/?p=oddb.org;a=blobdiff;f=bin/oddbd;h=655b4c235594159d3fe4290696ecc54580188c7c;hp=9cb744b4affe30beee935a495223015d9ce090ce;hb=6d38f6839f746bcf2dc701528e343170af8f47be;hpb=a8b4f8c43200a0035df9f5002a5ea9ec8133e99e

Best
Zeno

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 2.6.29 runs out of memory and hangs.
  2009-04-26 19:55             ` David Rientjes
  2009-04-27  6:33               ` Zeno Davatz
@ 2009-04-27 13:35               ` Greg Stark
  2009-04-27 13:39                 ` Zeno Davatz
  2009-04-27 14:48                 ` Zeno Davatz
  1 sibling, 2 replies; 17+ messages in thread
From: Greg Stark @ 2009-04-27 13:35 UTC (permalink / raw)
  To: David Rientjes; +Cc: Zeno Davatz, linux-kernel, Hannes Wyss

On Sun, Apr 26, 2009 at 8:55 PM, David Rientjes <rientjes@google.com> wrote:
> There is work being done on a per-cgroup lowmem notifier which would allow
> userspace to poll() on a device file and be notified of low memory
> situations so that it can react on its own (like killing a low priority
> task, expanding a cpuset, increasing a memcontroller limit, etc.)

Surely anything that important ought to be running with overcommit
disabled anyways.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 2.6.29 runs out of memory and hangs.
  2009-04-27 13:35               ` Greg Stark
@ 2009-04-27 13:39                 ` Zeno Davatz
  2009-04-27 14:07                   ` Zeno Davatz
  2009-04-27 14:48                 ` Zeno Davatz
  1 sibling, 1 reply; 17+ messages in thread
From: Zeno Davatz @ 2009-04-27 13:39 UTC (permalink / raw)
  To: Greg Stark; +Cc: David Rientjes, linux-kernel, Hannes Wyss

Dear Greg

On Mon, Apr 27, 2009 at 3:35 PM, Greg Stark <gsstark@mit.edu> wrote:
> On Sun, Apr 26, 2009 at 8:55 PM, David Rientjes <rientjes@google.com> wrote:
>> There is work being done on a per-cgroup lowmem notifier which would allow
>> userspace to poll() on a device file and be notified of low memory
>> situations so that it can react on its own (like killing a low priority
>> task, expanding a cpuset, increasing a memcontroller limit, etc.)
>
> Surely anything that important ought to be running with overcommit
> disabled anyways.

Where do I enable or disable overcommit?

Best
Zeno

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 2.6.29 runs out of memory and hangs.
  2009-04-27 13:39                 ` Zeno Davatz
@ 2009-04-27 14:07                   ` Zeno Davatz
  0 siblings, 0 replies; 17+ messages in thread
From: Zeno Davatz @ 2009-04-27 14:07 UTC (permalink / raw)
  To: Greg Stark; +Cc: David Rientjes, linux-kernel, Hannes Wyss

On Mon, Apr 27, 2009 at 3:39 PM, Zeno Davatz <zdavatz@gmail.com> wrote:
> On Mon, Apr 27, 2009 at 3:35 PM, Greg Stark <gsstark@mit.edu> wrote:
>> On Sun, Apr 26, 2009 at 8:55 PM, David Rientjes <rientjes@google.com> wrote:
>>> There is work being done on a per-cgroup lowmem notifier which would allow
>>> userspace to poll() on a device file and be notified of low memory
>>> situations so that it can react on its own (like killing a low priority
>>> task, expanding a cpuset, increasing a memcontroller limit, etc.)
>>
>> Surely anything that important ought to be running with overcommit
>> disabled anyways.
>
> Where do I enable or disable overcommit?

Ok, I found this one as well:

Documentation/vm/overcommit-accounting

and

http://www.linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html?page=2

is a useful article for the beginner.

Best
Zeno

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 2.6.29 runs out of memory and hangs.
  2009-04-27 13:35               ` Greg Stark
  2009-04-27 13:39                 ` Zeno Davatz
@ 2009-04-27 14:48                 ` Zeno Davatz
  1 sibling, 0 replies; 17+ messages in thread
From: Zeno Davatz @ 2009-04-27 14:48 UTC (permalink / raw)
  To: Greg Stark; +Cc: David Rientjes, linux-kernel, Hannes Wyss

On Mon, Apr 27, 2009 at 3:35 PM, Greg Stark <gsstark@mit.edu> wrote:
> On Sun, Apr 26, 2009 at 8:55 PM, David Rientjes <rientjes@google.com> wrote:
>> There is work being done on a per-cgroup lowmem notifier which would allow
>> userspace to poll() on a device file and be notified of low memory
>> situations so that it can react on its own (like killing a low priority
>> task, expanding a cpuset, increasing a memcontroller limit, etc.)
>
> Surely anything that important ought to be running with overcommit
> disabled anyways.

Ok, I tried this suggestion as well. I done:

echo 2 > /proc/sys/vm/overcommit_memory

echo 0 > /proc/sys/vm/overcommit_ratio (Note: our system does not have
any swap, just 32 GB of memory).

Then I tried to run "htop" and got

bash: fork: Cannot allocate memory

I also tried it with

echo 1 > /proc/sys/vm/overcommit_ratio

Same result.

Best
Zeno

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2009-04-27 14:52 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-24 10:09 Kernel 2.6.29 runs out of memory and hangs Zeno Davatz
2009-04-24 17:17 ` David Rientjes
2009-04-24 18:13   ` Zeno Davatz
2009-04-24 19:00     ` David Rientjes
2009-04-24 19:16       ` Zeno Davatz
2009-04-24 19:26         ` David Rientjes
2009-04-25  8:32           ` Zeno Davatz
2009-04-26 19:55             ` David Rientjes
2009-04-27  6:33               ` Zeno Davatz
2009-04-27  8:32                 ` David Rientjes
2009-04-27  8:53                   ` Zeno Davatz
2009-04-27 12:03                     ` Zeno Davatz
2009-04-27 13:35               ` Greg Stark
2009-04-27 13:39                 ` Zeno Davatz
2009-04-27 14:07                   ` Zeno Davatz
2009-04-27 14:48                 ` Zeno Davatz
2009-04-24 18:24   ` Zeno Davatz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.