All of lore.kernel.org
 help / color / mirror / Atom feed
* Machine lockups on extreme memory pressure
@ 2020-09-21 18:35 ` Shakeel Butt
  0 siblings, 0 replies; 14+ messages in thread
From: Shakeel Butt @ 2020-09-21 18:35 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Linux MM, Andrew Morton,
	Roman Gushchin, LKML, Greg Thelen

Hi all,

We are seeing machine lockups due extreme memory pressure where the
free pages on all the zones are way below the min watermarks. The stack
of the stuck CPU looks like the following (I had to crash the machine to
get the info).

 #0 [ ] crash_nmi_callback
 #1 [ ] nmi_handle
 #2 [ ] default_do_nmi
 #3 [ ] do_nmi
 #4 [ ] end_repeat_nmi
--- <NMI exception stack> ---
 #5 [ ] queued_spin_lock_slowpath
 #6 [ ] _raw_spin_lock
 #7 [ ] ____cache_alloc_node
 #8 [ ] fallback_alloc
 #9 [ ] __kmalloc_node_track_caller
#10 [ ] __alloc_skb
#11 [ ] tcp_send_ack
#12 [ ] tcp_delack_timer
#13 [ ] run_timer_softirq
#14 [ ] irq_exit
#15 [ ] smp_apic_timer_interrupt
#16 [ ] apic_timer_interrupt
--- <IRQ stack> ---
#17 [ ] apic_timer_interrupt
#18 [ ] _raw_spin_lock
#19 [ ] vmpressure
#20 [ ] shrink_node
#21 [ ] do_try_to_free_pages
#22 [ ] try_to_free_pages
#23 [ ] __alloc_pages_direct_reclaim
#24 [ ] __alloc_pages_nodemask
#25 [ ] cache_grow_begin
#26 [ ] fallback_alloc
#27 [ ] __kmalloc_node_track_caller
#28 [ ] __alloc_skb
#29 [ ] tcp_sendmsg_locked
#30 [ ] tcp_sendmsg
#31 [ ] inet6_sendmsg
#32 [ ] ___sys_sendmsg
#33 [ ] sys_sendmsg
#34 [ ] do_syscall_64

These are high traffic machines. Almost all the CPUs are stuck on the
root memcg's vmpressure sr_lock and almost half of the CPUs are stuck
on kmem cache node's list_lock in the IRQ. Note that the vmpressure
sr_lock is irq-unsafe. Couple of months back, we observed a similar
situation with swap locks which forces us to disable swap on global
pressure. Since we do proactive reclaim disabling swap on global reclaim
was not an issue. However now we have started seeing the same situation
with other irq-unsafe locks like vmpressure sr_lock and almost all the
slab shrinkers have irq-unsafe spinlocks. One of way to mitigate this
is by converting all such locks (which can be taken in reclaim path)
to be irq-safe but it does not seem like a maintainable solution.

Please note that we are running user space oom-killer which is more
aggressive than oomd/PSI but even that got stuck under this much memory
pressure.

I am wondering if anyone else has seen a similar situation in production
and if there is a recommended way to resolve this situation.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Machine lockups on extreme memory pressure
@ 2020-09-21 18:35 ` Shakeel Butt
  0 siblings, 0 replies; 14+ messages in thread
From: Shakeel Butt @ 2020-09-21 18:35 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Linux MM, Andrew Morton,
	Roman Gushchin, LKML, Greg Thelen

Hi all,

We are seeing machine lockups due extreme memory pressure where the
free pages on all the zones are way below the min watermarks. The stack
of the stuck CPU looks like the following (I had to crash the machine to
get the info).

 #0 [ ] crash_nmi_callback
 #1 [ ] nmi_handle
 #2 [ ] default_do_nmi
 #3 [ ] do_nmi
 #4 [ ] end_repeat_nmi
--- <NMI exception stack> ---
 #5 [ ] queued_spin_lock_slowpath
 #6 [ ] _raw_spin_lock
 #7 [ ] ____cache_alloc_node
 #8 [ ] fallback_alloc
 #9 [ ] __kmalloc_node_track_caller
#10 [ ] __alloc_skb
#11 [ ] tcp_send_ack
#12 [ ] tcp_delack_timer
#13 [ ] run_timer_softirq
#14 [ ] irq_exit
#15 [ ] smp_apic_timer_interrupt
#16 [ ] apic_timer_interrupt
--- <IRQ stack> ---
#17 [ ] apic_timer_interrupt
#18 [ ] _raw_spin_lock
#19 [ ] vmpressure
#20 [ ] shrink_node
#21 [ ] do_try_to_free_pages
#22 [ ] try_to_free_pages
#23 [ ] __alloc_pages_direct_reclaim
#24 [ ] __alloc_pages_nodemask
#25 [ ] cache_grow_begin
#26 [ ] fallback_alloc
#27 [ ] __kmalloc_node_track_caller
#28 [ ] __alloc_skb
#29 [ ] tcp_sendmsg_locked
#30 [ ] tcp_sendmsg
#31 [ ] inet6_sendmsg
#32 [ ] ___sys_sendmsg
#33 [ ] sys_sendmsg
#34 [ ] do_syscall_64

These are high traffic machines. Almost all the CPUs are stuck on the
root memcg's vmpressure sr_lock and almost half of the CPUs are stuck
on kmem cache node's list_lock in the IRQ. Note that the vmpressure
sr_lock is irq-unsafe. Couple of months back, we observed a similar
situation with swap locks which forces us to disable swap on global
pressure. Since we do proactive reclaim disabling swap on global reclaim
was not an issue. However now we have started seeing the same situation
with other irq-unsafe locks like vmpressure sr_lock and almost all the
slab shrinkers have irq-unsafe spinlocks. One of way to mitigate this
is by converting all such locks (which can be taken in reclaim path)
to be irq-safe but it does not seem like a maintainable solution.

Please note that we are running user space oom-killer which is more
aggressive than oomd/PSI but even that got stuck under this much memory
pressure.

I am wondering if anyone else has seen a similar situation in production
and if there is a recommended way to resolve this situation.

thanks,
Shakeel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Machine lockups on extreme memory pressure
  2020-09-21 18:35 ` Shakeel Butt
  (?)
@ 2020-09-22 11:12 ` Michal Hocko
  2020-09-22 13:37     ` Shakeel Butt
  -1 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2020-09-22 11:12 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Linux MM, Andrew Morton, Roman Gushchin, LKML,
	Greg Thelen

On Mon 21-09-20 11:35:35, Shakeel Butt wrote:
> Hi all,
> 
> We are seeing machine lockups due extreme memory pressure where the
> free pages on all the zones are way below the min watermarks. The stack
> of the stuck CPU looks like the following (I had to crash the machine to
> get the info).

sysrq+l didn't report anything?
 
>  #0 [ ] crash_nmi_callback
>  #1 [ ] nmi_handle
>  #2 [ ] default_do_nmi
>  #3 [ ] do_nmi
>  #4 [ ] end_repeat_nmi
> --- <NMI exception stack> ---
>  #5 [ ] queued_spin_lock_slowpath
>  #6 [ ] _raw_spin_lock
>  #7 [ ] ____cache_alloc_node
>  #8 [ ] fallback_alloc
>  #9 [ ] __kmalloc_node_track_caller
> #10 [ ] __alloc_skb
> #11 [ ] tcp_send_ack
> #12 [ ] tcp_delack_timer
> #13 [ ] run_timer_softirq
> #14 [ ] irq_exit
> #15 [ ] smp_apic_timer_interrupt
> #16 [ ] apic_timer_interrupt
> --- <IRQ stack> ---
> #17 [ ] apic_timer_interrupt
> #18 [ ] _raw_spin_lock
> #19 [ ] vmpressure
> #20 [ ] shrink_node
> #21 [ ] do_try_to_free_pages
> #22 [ ] try_to_free_pages
> #23 [ ] __alloc_pages_direct_reclaim
> #24 [ ] __alloc_pages_nodemask
> #25 [ ] cache_grow_begin
> #26 [ ] fallback_alloc
> #27 [ ] __kmalloc_node_track_caller
> #28 [ ] __alloc_skb
> #29 [ ] tcp_sendmsg_locked
> #30 [ ] tcp_sendmsg
> #31 [ ] inet6_sendmsg
> #32 [ ] ___sys_sendmsg
> #33 [ ] sys_sendmsg
> #34 [ ] do_syscall_64
> 
> These are high traffic machines. Almost all the CPUs are stuck on the
> root memcg's vmpressure sr_lock and almost half of the CPUs are stuck
> on kmem cache node's list_lock in the IRQ.

Are you able to track down the lock holder?

> Note that the vmpressure sr_lock is irq-unsafe.

Which is ok because this is only triggered from the memory reclaim and
that cannot ever happen from the interrrupt context for obvoius reasons.

> Couple of months back, we observed a similar
> situation with swap locks which forces us to disable swap on global
> pressure. Since we do proactive reclaim disabling swap on global reclaim
> was not an issue. However now we have started seeing the same situation
> with other irq-unsafe locks like vmpressure sr_lock and almost all the
> slab shrinkers have irq-unsafe spinlocks. One of way to mitigate this
> is by converting all such locks (which can be taken in reclaim path)
> to be irq-safe but it does not seem like a maintainable solution.

This doesn't make much sense to be honest. We are not disabling IRQs
unless it is absolutely necessary.

> Please note that we are running user space oom-killer which is more
> aggressive than oomd/PSI but even that got stuck under this much memory
> pressure.
> 
> I am wondering if anyone else has seen a similar situation in production
> and if there is a recommended way to resolve this situation.

I would recommend to focus on tracking down the who is blocking the
further progress.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Machine lockups on extreme memory pressure
  2020-09-22 11:12 ` Michal Hocko
@ 2020-09-22 13:37     ` Shakeel Butt
  0 siblings, 0 replies; 14+ messages in thread
From: Shakeel Butt @ 2020-09-22 13:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Linux MM, Andrew Morton, Roman Gushchin, LKML,
	Greg Thelen

On Tue, Sep 22, 2020 at 4:12 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 21-09-20 11:35:35, Shakeel Butt wrote:
> > Hi all,
> >
> > We are seeing machine lockups due extreme memory pressure where the
> > free pages on all the zones are way below the min watermarks. The stack
> > of the stuck CPU looks like the following (I had to crash the machine to
> > get the info).
>
> sysrq+l didn't report anything?
>

Sorry I misspoke earlier that I personally crash the machine. I get to
know the state of the machine from the crash dump. We have a crash
timer on our machines which need to be reset every couple of hours
from user space. If the user space daemon responsible to reset does
not get chance to reset it, the machine get crashed, so, these crashes
are where the user space timer resetter daemon could not run for
couple of hours.

> >  #0 [ ] crash_nmi_callback
> >  #1 [ ] nmi_handle
> >  #2 [ ] default_do_nmi
> >  #3 [ ] do_nmi
> >  #4 [ ] end_repeat_nmi
> > --- <NMI exception stack> ---
> >  #5 [ ] queued_spin_lock_slowpath
> >  #6 [ ] _raw_spin_lock
> >  #7 [ ] ____cache_alloc_node
> >  #8 [ ] fallback_alloc
> >  #9 [ ] __kmalloc_node_track_caller
> > #10 [ ] __alloc_skb
> > #11 [ ] tcp_send_ack
> > #12 [ ] tcp_delack_timer
> > #13 [ ] run_timer_softirq
> > #14 [ ] irq_exit
> > #15 [ ] smp_apic_timer_interrupt
> > #16 [ ] apic_timer_interrupt
> > --- <IRQ stack> ---
> > #17 [ ] apic_timer_interrupt
> > #18 [ ] _raw_spin_lock
> > #19 [ ] vmpressure
> > #20 [ ] shrink_node
> > #21 [ ] do_try_to_free_pages
> > #22 [ ] try_to_free_pages
> > #23 [ ] __alloc_pages_direct_reclaim
> > #24 [ ] __alloc_pages_nodemask
> > #25 [ ] cache_grow_begin
> > #26 [ ] fallback_alloc
> > #27 [ ] __kmalloc_node_track_caller
> > #28 [ ] __alloc_skb
> > #29 [ ] tcp_sendmsg_locked
> > #30 [ ] tcp_sendmsg
> > #31 [ ] inet6_sendmsg
> > #32 [ ] ___sys_sendmsg
> > #33 [ ] sys_sendmsg
> > #34 [ ] do_syscall_64
> >
> > These are high traffic machines. Almost all the CPUs are stuck on the
> > root memcg's vmpressure sr_lock and almost half of the CPUs are stuck
> > on kmem cache node's list_lock in the IRQ.
>
> Are you able to track down the lock holder?
>
> > Note that the vmpressure sr_lock is irq-unsafe.
>
> Which is ok because this is only triggered from the memory reclaim and
> that cannot ever happen from the interrrupt context for obvoius reasons.
>
> > Couple of months back, we observed a similar
> > situation with swap locks which forces us to disable swap on global
> > pressure. Since we do proactive reclaim disabling swap on global reclaim
> > was not an issue. However now we have started seeing the same situation
> > with other irq-unsafe locks like vmpressure sr_lock and almost all the
> > slab shrinkers have irq-unsafe spinlocks. One of way to mitigate this
> > is by converting all such locks (which can be taken in reclaim path)
> > to be irq-safe but it does not seem like a maintainable solution.
>
> This doesn't make much sense to be honest. We are not disabling IRQs
> unless it is absolutely necessary.
>
> > Please note that we are running user space oom-killer which is more
> > aggressive than oomd/PSI but even that got stuck under this much memory
> > pressure.
> >
> > I am wondering if anyone else has seen a similar situation in production
> > and if there is a recommended way to resolve this situation.
>
> I would recommend to focus on tracking down the who is blocking the
> further progress.

I was able to find the CPU next in line for the list_lock from the
dump. I don't think anyone is blocking the progress as such but more
like the spinlock in the irq context is starving the spinlock in the
process context. This is a high traffic machine and there are tens of
thousands of potential network ACKs on the queue.

I talked about this problem with Johannes at LPC 2019 and I think we
talked about two potential solutions. First was to somehow give memory
reserves to oomd and second was in-kernel PSI based oom-killer. I am
not sure the first one will work in this situation but the second one
might help.

Shakeel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Machine lockups on extreme memory pressure
@ 2020-09-22 13:37     ` Shakeel Butt
  0 siblings, 0 replies; 14+ messages in thread
From: Shakeel Butt @ 2020-09-22 13:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Linux MM, Andrew Morton, Roman Gushchin, LKML,
	Greg Thelen

On Tue, Sep 22, 2020 at 4:12 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 21-09-20 11:35:35, Shakeel Butt wrote:
> > Hi all,
> >
> > We are seeing machine lockups due extreme memory pressure where the
> > free pages on all the zones are way below the min watermarks. The stack
> > of the stuck CPU looks like the following (I had to crash the machine to
> > get the info).
>
> sysrq+l didn't report anything?
>

Sorry I misspoke earlier that I personally crash the machine. I get to
know the state of the machine from the crash dump. We have a crash
timer on our machines which need to be reset every couple of hours
from user space. If the user space daemon responsible to reset does
not get chance to reset it, the machine get crashed, so, these crashes
are where the user space timer resetter daemon could not run for
couple of hours.

> >  #0 [ ] crash_nmi_callback
> >  #1 [ ] nmi_handle
> >  #2 [ ] default_do_nmi
> >  #3 [ ] do_nmi
> >  #4 [ ] end_repeat_nmi
> > --- <NMI exception stack> ---
> >  #5 [ ] queued_spin_lock_slowpath
> >  #6 [ ] _raw_spin_lock
> >  #7 [ ] ____cache_alloc_node
> >  #8 [ ] fallback_alloc
> >  #9 [ ] __kmalloc_node_track_caller
> > #10 [ ] __alloc_skb
> > #11 [ ] tcp_send_ack
> > #12 [ ] tcp_delack_timer
> > #13 [ ] run_timer_softirq
> > #14 [ ] irq_exit
> > #15 [ ] smp_apic_timer_interrupt
> > #16 [ ] apic_timer_interrupt
> > --- <IRQ stack> ---
> > #17 [ ] apic_timer_interrupt
> > #18 [ ] _raw_spin_lock
> > #19 [ ] vmpressure
> > #20 [ ] shrink_node
> > #21 [ ] do_try_to_free_pages
> > #22 [ ] try_to_free_pages
> > #23 [ ] __alloc_pages_direct_reclaim
> > #24 [ ] __alloc_pages_nodemask
> > #25 [ ] cache_grow_begin
> > #26 [ ] fallback_alloc
> > #27 [ ] __kmalloc_node_track_caller
> > #28 [ ] __alloc_skb
> > #29 [ ] tcp_sendmsg_locked
> > #30 [ ] tcp_sendmsg
> > #31 [ ] inet6_sendmsg
> > #32 [ ] ___sys_sendmsg
> > #33 [ ] sys_sendmsg
> > #34 [ ] do_syscall_64
> >
> > These are high traffic machines. Almost all the CPUs are stuck on the
> > root memcg's vmpressure sr_lock and almost half of the CPUs are stuck
> > on kmem cache node's list_lock in the IRQ.
>
> Are you able to track down the lock holder?
>
> > Note that the vmpressure sr_lock is irq-unsafe.
>
> Which is ok because this is only triggered from the memory reclaim and
> that cannot ever happen from the interrrupt context for obvoius reasons.
>
> > Couple of months back, we observed a similar
> > situation with swap locks which forces us to disable swap on global
> > pressure. Since we do proactive reclaim disabling swap on global reclaim
> > was not an issue. However now we have started seeing the same situation
> > with other irq-unsafe locks like vmpressure sr_lock and almost all the
> > slab shrinkers have irq-unsafe spinlocks. One of way to mitigate this
> > is by converting all such locks (which can be taken in reclaim path)
> > to be irq-safe but it does not seem like a maintainable solution.
>
> This doesn't make much sense to be honest. We are not disabling IRQs
> unless it is absolutely necessary.
>
> > Please note that we are running user space oom-killer which is more
> > aggressive than oomd/PSI but even that got stuck under this much memory
> > pressure.
> >
> > I am wondering if anyone else has seen a similar situation in production
> > and if there is a recommended way to resolve this situation.
>
> I would recommend to focus on tracking down the who is blocking the
> further progress.

I was able to find the CPU next in line for the list_lock from the
dump. I don't think anyone is blocking the progress as such but more
like the spinlock in the irq context is starving the spinlock in the
process context. This is a high traffic machine and there are tens of
thousands of potential network ACKs on the queue.

I talked about this problem with Johannes at LPC 2019 and I think we
talked about two potential solutions. First was to somehow give memory
reserves to oomd and second was in-kernel PSI based oom-killer. I am
not sure the first one will work in this situation but the second one
might help.

Shakeel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Machine lockups on extreme memory pressure
  2020-09-22 13:37     ` Shakeel Butt
  (?)
@ 2020-09-22 15:16     ` Michal Hocko
  2020-09-22 16:29         ` Shakeel Butt
  -1 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2020-09-22 15:16 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Linux MM, Andrew Morton, Roman Gushchin, LKML,
	Greg Thelen

On Tue 22-09-20 06:37:02, Shakeel Butt wrote:
[...]
> > I would recommend to focus on tracking down the who is blocking the
> > further progress.
> 
> I was able to find the CPU next in line for the list_lock from the
> dump. I don't think anyone is blocking the progress as such but more
> like the spinlock in the irq context is starving the spinlock in the
> process context. This is a high traffic machine and there are tens of
> thousands of potential network ACKs on the queue.

So there is a forward progress but it is too slow to have any reasonable
progress in userspace?

> I talked about this problem with Johannes at LPC 2019 and I think we
> talked about two potential solutions. First was to somehow give memory
> reserves to oomd and second was in-kernel PSI based oom-killer. I am
> not sure the first one will work in this situation but the second one
> might help.

Why does your oomd depend on memory allocation?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Machine lockups on extreme memory pressure
  2020-09-22 15:16     ` Michal Hocko
@ 2020-09-22 16:29         ` Shakeel Butt
  0 siblings, 0 replies; 14+ messages in thread
From: Shakeel Butt @ 2020-09-22 16:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Linux MM, Andrew Morton, Roman Gushchin, LKML,
	Greg Thelen

On Tue, Sep 22, 2020 at 8:16 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 22-09-20 06:37:02, Shakeel Butt wrote:
> [...]
> > > I would recommend to focus on tracking down the who is blocking the
> > > further progress.
> >
> > I was able to find the CPU next in line for the list_lock from the
> > dump. I don't think anyone is blocking the progress as such but more
> > like the spinlock in the irq context is starving the spinlock in the
> > process context. This is a high traffic machine and there are tens of
> > thousands of potential network ACKs on the queue.
>
> So there is a forward progress but it is too slow to have any reasonable
> progress in userspace?

Yes.

>
> > I talked about this problem with Johannes at LPC 2019 and I think we
> > talked about two potential solutions. First was to somehow give memory
> > reserves to oomd and second was in-kernel PSI based oom-killer. I am
> > not sure the first one will work in this situation but the second one
> > might help.
>
> Why does your oomd depend on memory allocation?
>

It does not but I think my concern was the potential allocations
during syscalls. Anyways, what do you think of the in-kernel PSI based
oom-kill trigger. I think Johannes had a prototype as well.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Machine lockups on extreme memory pressure
@ 2020-09-22 16:29         ` Shakeel Butt
  0 siblings, 0 replies; 14+ messages in thread
From: Shakeel Butt @ 2020-09-22 16:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Linux MM, Andrew Morton, Roman Gushchin, LKML,
	Greg Thelen

On Tue, Sep 22, 2020 at 8:16 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 22-09-20 06:37:02, Shakeel Butt wrote:
> [...]
> > > I would recommend to focus on tracking down the who is blocking the
> > > further progress.
> >
> > I was able to find the CPU next in line for the list_lock from the
> > dump. I don't think anyone is blocking the progress as such but more
> > like the spinlock in the irq context is starving the spinlock in the
> > process context. This is a high traffic machine and there are tens of
> > thousands of potential network ACKs on the queue.
>
> So there is a forward progress but it is too slow to have any reasonable
> progress in userspace?

Yes.

>
> > I talked about this problem with Johannes at LPC 2019 and I think we
> > talked about two potential solutions. First was to somehow give memory
> > reserves to oomd and second was in-kernel PSI based oom-killer. I am
> > not sure the first one will work in this situation but the second one
> > might help.
>
> Why does your oomd depend on memory allocation?
>

It does not but I think my concern was the potential allocations
during syscalls. Anyways, what do you think of the in-kernel PSI based
oom-kill trigger. I think Johannes had a prototype as well.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Machine lockups on extreme memory pressure
  2020-09-22 16:29         ` Shakeel Butt
  (?)
@ 2020-09-22 16:34         ` Michal Hocko
  2020-09-22 16:51             ` Shakeel Butt
  -1 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2020-09-22 16:34 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Linux MM, Andrew Morton, Roman Gushchin, LKML,
	Greg Thelen

On Tue 22-09-20 09:29:48, Shakeel Butt wrote:
> On Tue, Sep 22, 2020 at 8:16 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 22-09-20 06:37:02, Shakeel Butt wrote:
[...]
> > > I talked about this problem with Johannes at LPC 2019 and I think we
> > > talked about two potential solutions. First was to somehow give memory
> > > reserves to oomd and second was in-kernel PSI based oom-killer. I am
> > > not sure the first one will work in this situation but the second one
> > > might help.
> >
> > Why does your oomd depend on memory allocation?
> >
> 
> It does not but I think my concern was the potential allocations
> during syscalls.

So what is the problem then? Why your oomd cannot kill anything?

> Anyways, what do you think of the in-kernel PSI based
> oom-kill trigger. I think Johannes had a prototype as well.

We have talked about something like that in the past and established
that auto tuning for oom killer based on PSI is almost impossible to get
right for all potential workloads and that so this belongs to userspace.
The kernel's oom killer is there as a last resort when system gets close
to meltdown.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Machine lockups on extreme memory pressure
  2020-09-22 16:34         ` Michal Hocko
@ 2020-09-22 16:51             ` Shakeel Butt
  0 siblings, 0 replies; 14+ messages in thread
From: Shakeel Butt @ 2020-09-22 16:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Linux MM, Andrew Morton, Roman Gushchin, LKML,
	Greg Thelen

On Tue, Sep 22, 2020 at 9:34 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 22-09-20 09:29:48, Shakeel Butt wrote:
> > On Tue, Sep 22, 2020 at 8:16 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 22-09-20 06:37:02, Shakeel Butt wrote:
> [...]
> > > > I talked about this problem with Johannes at LPC 2019 and I think we
> > > > talked about two potential solutions. First was to somehow give memory
> > > > reserves to oomd and second was in-kernel PSI based oom-killer. I am
> > > > not sure the first one will work in this situation but the second one
> > > > might help.
> > >
> > > Why does your oomd depend on memory allocation?
> > >
> >
> > It does not but I think my concern was the potential allocations
> > during syscalls.
>
> So what is the problem then? Why your oomd cannot kill anything?
>

From the dump, it seems like it is not able to get the CPU. I am still
trying to extract the reason though.

> > Anyways, what do you think of the in-kernel PSI based
> > oom-kill trigger. I think Johannes had a prototype as well.
>
> We have talked about something like that in the past and established
> that auto tuning for oom killer based on PSI is almost impossible to get
> right for all potential workloads and that so this belongs to userspace.
> The kernel's oom killer is there as a last resort when system gets close
> to meltdown.

The system is already in meltdown state from the users perspective. I
still think allowing the users to optionally set the oom-kill trigger
based on PSI makes sense. Something like 'if all processes on the
system are stuck for 60 sec, trigger oom-killer'.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Machine lockups on extreme memory pressure
@ 2020-09-22 16:51             ` Shakeel Butt
  0 siblings, 0 replies; 14+ messages in thread
From: Shakeel Butt @ 2020-09-22 16:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Linux MM, Andrew Morton, Roman Gushchin, LKML,
	Greg Thelen

On Tue, Sep 22, 2020 at 9:34 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 22-09-20 09:29:48, Shakeel Butt wrote:
> > On Tue, Sep 22, 2020 at 8:16 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 22-09-20 06:37:02, Shakeel Butt wrote:
> [...]
> > > > I talked about this problem with Johannes at LPC 2019 and I think we
> > > > talked about two potential solutions. First was to somehow give memory
> > > > reserves to oomd and second was in-kernel PSI based oom-killer. I am
> > > > not sure the first one will work in this situation but the second one
> > > > might help.
> > >
> > > Why does your oomd depend on memory allocation?
> > >
> >
> > It does not but I think my concern was the potential allocations
> > during syscalls.
>
> So what is the problem then? Why your oomd cannot kill anything?
>

From the dump, it seems like it is not able to get the CPU. I am still
trying to extract the reason though.

> > Anyways, what do you think of the in-kernel PSI based
> > oom-kill trigger. I think Johannes had a prototype as well.
>
> We have talked about something like that in the past and established
> that auto tuning for oom killer based on PSI is almost impossible to get
> right for all potential workloads and that so this belongs to userspace.
> The kernel's oom killer is there as a last resort when system gets close
> to meltdown.

The system is already in meltdown state from the users perspective. I
still think allowing the users to optionally set the oom-kill trigger
based on PSI makes sense. Something like 'if all processes on the
system are stuck for 60 sec, trigger oom-killer'.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Machine lockups on extreme memory pressure
  2020-09-22 16:51             ` Shakeel Butt
  (?)
@ 2020-09-22 17:01             ` Michal Hocko
  2020-10-30 17:01                 ` Shakeel Butt
  -1 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2020-09-22 17:01 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Linux MM, Andrew Morton, Roman Gushchin, LKML,
	Greg Thelen

On Tue 22-09-20 09:51:30, Shakeel Butt wrote:
> On Tue, Sep 22, 2020 at 9:34 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 22-09-20 09:29:48, Shakeel Butt wrote:
[...]
> > > Anyways, what do you think of the in-kernel PSI based
> > > oom-kill trigger. I think Johannes had a prototype as well.
> >
> > We have talked about something like that in the past and established
> > that auto tuning for oom killer based on PSI is almost impossible to get
> > right for all potential workloads and that so this belongs to userspace.
> > The kernel's oom killer is there as a last resort when system gets close
> > to meltdown.
> 
> The system is already in meltdown state from the users perspective. I
> still think allowing the users to optionally set the oom-kill trigger
> based on PSI makes sense. Something like 'if all processes on the
> system are stuck for 60 sec, trigger oom-killer'.

We already do have watchdogs for that no? If you cannot really schedule
anything then soft lockup detector should fire. In a meltdown state like
that the reboot is likely the best way forward anyway.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Machine lockups on extreme memory pressure
  2020-09-22 17:01             ` Michal Hocko
@ 2020-10-30 17:01                 ` Shakeel Butt
  0 siblings, 0 replies; 14+ messages in thread
From: Shakeel Butt @ 2020-10-30 17:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Linux MM, Andrew Morton, Roman Gushchin, LKML,
	Greg Thelen

On Tue, Sep 22, 2020 at 10:01 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 22-09-20 09:51:30, Shakeel Butt wrote:
> > On Tue, Sep 22, 2020 at 9:34 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 22-09-20 09:29:48, Shakeel Butt wrote:
> [...]
> > > > Anyways, what do you think of the in-kernel PSI based
> > > > oom-kill trigger. I think Johannes had a prototype as well.
> > >
> > > We have talked about something like that in the past and established
> > > that auto tuning for oom killer based on PSI is almost impossible to get
> > > right for all potential workloads and that so this belongs to userspace.
> > > The kernel's oom killer is there as a last resort when system gets close
> > > to meltdown.
> >
> > The system is already in meltdown state from the users perspective. I
> > still think allowing the users to optionally set the oom-kill trigger
> > based on PSI makes sense. Something like 'if all processes on the
> > system are stuck for 60 sec, trigger oom-killer'.
>
> We already do have watchdogs for that no? If you cannot really schedule
> anything then soft lockup detector should fire. In a meltdown state like
> that the reboot is likely the best way forward anyway.

Yes, soft lockup detector can catch this situation but I still think
we can do better than panic/reboot.

Anyways, I think we now know the reason for this extreme pressure and
I just wanted to share if someone else might be facing a similar
situation.

There were several thousand TCP delayed ACKs queued on the system. The
system was under memory pressure and alloc_skb(GFP_ATOMIC) for delayed
ACKs were either stealing from reclaimers or failing. For the delayed
ACKs whose allocation failed, the kernel reschedules them infinitely.
So, these failing allocations for delayed ACKs were keeping the system
in this lockup state for hours. The commit a37c2134bed6 ("tcp: add
exponential backoff in __tcp_send_ack()") recently added the fix for
this situation.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Machine lockups on extreme memory pressure
@ 2020-10-30 17:01                 ` Shakeel Butt
  0 siblings, 0 replies; 14+ messages in thread
From: Shakeel Butt @ 2020-10-30 17:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Linux MM, Andrew Morton, Roman Gushchin, LKML,
	Greg Thelen

On Tue, Sep 22, 2020 at 10:01 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 22-09-20 09:51:30, Shakeel Butt wrote:
> > On Tue, Sep 22, 2020 at 9:34 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 22-09-20 09:29:48, Shakeel Butt wrote:
> [...]
> > > > Anyways, what do you think of the in-kernel PSI based
> > > > oom-kill trigger. I think Johannes had a prototype as well.
> > >
> > > We have talked about something like that in the past and established
> > > that auto tuning for oom killer based on PSI is almost impossible to get
> > > right for all potential workloads and that so this belongs to userspace.
> > > The kernel's oom killer is there as a last resort when system gets close
> > > to meltdown.
> >
> > The system is already in meltdown state from the users perspective. I
> > still think allowing the users to optionally set the oom-kill trigger
> > based on PSI makes sense. Something like 'if all processes on the
> > system are stuck for 60 sec, trigger oom-killer'.
>
> We already do have watchdogs for that no? If you cannot really schedule
> anything then soft lockup detector should fire. In a meltdown state like
> that the reboot is likely the best way forward anyway.

Yes, soft lockup detector can catch this situation but I still think
we can do better than panic/reboot.

Anyways, I think we now know the reason for this extreme pressure and
I just wanted to share if someone else might be facing a similar
situation.

There were several thousand TCP delayed ACKs queued on the system. The
system was under memory pressure and alloc_skb(GFP_ATOMIC) for delayed
ACKs were either stealing from reclaimers or failing. For the delayed
ACKs whose allocation failed, the kernel reschedules them infinitely.
So, these failing allocations for delayed ACKs were keeping the system
in this lockup state for hours. The commit a37c2134bed6 ("tcp: add
exponential backoff in __tcp_send_ack()") recently added the fix for
this situation.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-10-30 17:01 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-21 18:35 Machine lockups on extreme memory pressure Shakeel Butt
2020-09-21 18:35 ` Shakeel Butt
2020-09-22 11:12 ` Michal Hocko
2020-09-22 13:37   ` Shakeel Butt
2020-09-22 13:37     ` Shakeel Butt
2020-09-22 15:16     ` Michal Hocko
2020-09-22 16:29       ` Shakeel Butt
2020-09-22 16:29         ` Shakeel Butt
2020-09-22 16:34         ` Michal Hocko
2020-09-22 16:51           ` Shakeel Butt
2020-09-22 16:51             ` Shakeel Butt
2020-09-22 17:01             ` Michal Hocko
2020-10-30 17:01               ` Shakeel Butt
2020-10-30 17:01                 ` Shakeel Butt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.