All of lore.kernel.org
 help / color / mirror / Atom feed
* RCU nocb list not reclaiming causing OOM
@ 2018-07-20 23:05 David Chen
  2018-07-20 23:32 ` Paul E. McKenney
  0 siblings, 1 reply; 12+ messages in thread
From: David Chen @ 2018-07-20 23:05 UTC (permalink / raw)
  To: linux-kernel, Paul E. McKenney; +Cc: David Chen

Hi Paul,

We hit an RCU issue on 4.9.37 kernel. One of the nocb_follower list grows too
large, and not getting reclaimed, causing the system to OOM.

Printing the culprit rcu_sched_data:

  nocb_q_count = {
    counter = 32369635
  },
  nocb_follower_head = 0xffff88ae901c0a00,
  nocb_follower_tail = 0xffff88af1538b8d8,
  nocb_kthread = 0xffff88b06d290000,

As you can see here, the nocb_follower_head is not empty, so in theory, the
nocb_kthread shouldn't go to sleep. However, if dump the stack of the kthread:

crash> bt 0xffff88b06d290000
PID: 21     TASK: ffff88b06d290000  CPU: 3   COMMAND: "rcuos/1"
 #0 [ffffafc9020b7dc0] __schedule at ffffffff8d8789dc
 #1 [ffffafc9020b7e38] schedule at ffffffff8d878e76
 #2 [ffffafc9020b7e50] rcu_nocb_kthread at ffffffff8d112337
 #3 [ffffafc9020b7ec8] kthread at ffffffff8d0c6ce7
 #4 [ffffafc9020b7f50] ret_from_fork at ffffffff8d87d755

And if we dis the address at ffffffff8d112337:

/usr/src/debug/kernel-4.9.37/linux-4.9.37-29.nutanix.07142017.el7.centos.x86_64/kernel/rcu/tree_plugin.h: 2106
0xffffffff8d11232d <rcu_nocb_kthread+381>:      test   %rax,%rax
0xffffffff8d112330 <rcu_nocb_kthread+384>:      jne    0xffffffff8d112355 <rcu_nocb_kthread+421>
0xffffffff8d112332 <rcu_nocb_kthread+386>:      callq  0xffffffff8d878e40 <schedule>
0xffffffff8d112337 <rcu_nocb_kthread+391>:      lea    -0x40(%rbp),%rsi

So the kthread is blocked at swait_event_interruptible in the nocb_follower_wait.
This contradict with the fact that nocb_follower_head was not empty. So I
wonder if this is caused by the lack of memory barrier in the place shown below.
If the head is set to NULL after doing xchg, it will overwrite the head set
by leader. This caused the kthread to sleep the next iteration, and the leader
won't wake him up as the tail doesn't point to head.

Please tell me what do you think.

Thanks,
David

diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
--- linux-4.9.37.orig/kernel/rcu/tree_plugin.h	2017-07-12 06:42:41.000000000 -0700
+++ linux-4.9.37/kernel/rcu/tree_plugin.h	2018-07-20 15:25:57.311206343 -0700
@@ -2149,6 +2149,7 @@
 		BUG_ON(!list);
 		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
 		WRITE_ONCE(rdp->nocb_follower_head, NULL);
+		smp_mb();
 		tail = xchg(&rdp->nocb_follower_tail, &rdp->nocb_follower_head);
 
 		/* Each pass through the following loop invokes a callback. */

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RCU nocb list not reclaiming causing OOM
  2018-07-20 23:05 RCU nocb list not reclaiming causing OOM David Chen
@ 2018-07-20 23:32 ` Paul E. McKenney
       [not found]   ` <BYAPR02MB450146DF9F67788ED1F0801D94500@BYAPR02MB4501.namprd02.prod.outlook.com>
  0 siblings, 1 reply; 12+ messages in thread
From: Paul E. McKenney @ 2018-07-20 23:32 UTC (permalink / raw)
  To: David Chen; +Cc: linux-kernel

On Fri, Jul 20, 2018 at 11:05:52PM +0000, David Chen wrote:
> Hi Paul,
> 
> We hit an RCU issue on 4.9.37 kernel. One of the nocb_follower list grows too
> large, and not getting reclaimed, causing the system to OOM.
> 
> Printing the culprit rcu_sched_data:
> 
>   nocb_q_count = {
>     counter = 32369635
>   },
>   nocb_follower_head = 0xffff88ae901c0a00,
>   nocb_follower_tail = 0xffff88af1538b8d8,
>   nocb_kthread = 0xffff88b06d290000,
> 
> As you can see here, the nocb_follower_head is not empty, so in theory, the
> nocb_kthread shouldn't go to sleep. However, if dump the stack of the kthread:
> 
> crash> bt 0xffff88b06d290000
> PID: 21     TASK: ffff88b06d290000  CPU: 3   COMMAND: "rcuos/1"
>  #0 [ffffafc9020b7dc0] __schedule at ffffffff8d8789dc
>  #1 [ffffafc9020b7e38] schedule at ffffffff8d878e76
>  #2 [ffffafc9020b7e50] rcu_nocb_kthread at ffffffff8d112337
>  #3 [ffffafc9020b7ec8] kthread at ffffffff8d0c6ce7
>  #4 [ffffafc9020b7f50] ret_from_fork at ffffffff8d87d755
> 
> And if we dis the address at ffffffff8d112337:
> 
> /usr/src/debug/kernel-4.9.37/linux-4.9.37-29.nutanix.07142017.el7.centos.x86_64/kernel/rcu/tree_plugin.h: 2106
> 0xffffffff8d11232d <rcu_nocb_kthread+381>:      test   %rax,%rax
> 0xffffffff8d112330 <rcu_nocb_kthread+384>:      jne    0xffffffff8d112355 <rcu_nocb_kthread+421>
> 0xffffffff8d112332 <rcu_nocb_kthread+386>:      callq  0xffffffff8d878e40 <schedule>
> 0xffffffff8d112337 <rcu_nocb_kthread+391>:      lea    -0x40(%rbp),%rsi
> 
> So the kthread is blocked at swait_event_interruptible in the nocb_follower_wait.
> This contradict with the fact that nocb_follower_head was not empty. So I
> wonder if this is caused by the lack of memory barrier in the place shown below.
> If the head is set to NULL after doing xchg, it will overwrite the head set
> by leader. This caused the kthread to sleep the next iteration, and the leader
> won't wake him up as the tail doesn't point to head.
> 
> Please tell me what do you think.
> 
> Thanks,
> David
> 
> diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h	2017-07-12 06:42:41.000000000 -0700
> +++ linux-4.9.37/kernel/rcu/tree_plugin.h	2018-07-20 15:25:57.311206343 -0700
> @@ -2149,6 +2149,7 @@
>  		BUG_ON(!list);
>  		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
>  		WRITE_ONCE(rdp->nocb_follower_head, NULL);
> +		smp_mb();
>  		tail = xchg(&rdp->nocb_follower_tail, &rdp->nocb_follower_head);

The xchg() operation implies full memory barriers both before and after,
so adding the smp_mb() before would have no effect.

But let me take a look at post-4.9 changes to this code...

I suggest trying out the following commit:

6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")

If that one doesn't help, the following might be worth trying, but probably
a lot harder to backport:

8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")

Please let me know how it goes!

							Thanx, Paul

------------------------------------------------------------------------

commit 6b5fc3a1331810db407c9e0e673dc1837afdc9d0
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Fri Apr 28 20:11:09 2017 -0700

    rcu: Add memory barriers for NOCB leader wakeup
    
    Wait/wakeup operations do not guarantee ordering on their own.  Instead,
    either locking or memory barriers are required.  This commit therefore
    adds memory barriers to wake_nocb_leader() and nocb_leader_wait().
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Tested-by: Krister Johansen <kjlx@templeofstupid.com>
    Cc: <stable@vger.kernel.org> # 4.6.x

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 0b1042545116..573fbe9640a0 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1810,6 +1810,7 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
 	if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
 		/* Prior smp_mb__after_atomic() orders against prior enqueue. */
 		WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
+		smp_mb(); /* ->nocb_leader_sleep before swake_up(). */
 		swake_up(&rdp_leader->nocb_wq);
 	}
 }
@@ -2064,6 +2065,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp)
 	 * nocb_gp_head, where they await a grace period.
 	 */
 	gotcbs = false;
+	smp_mb(); /* wakeup before ->nocb_head reads. */
 	for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
 		rdp->nocb_gp_head = READ_ONCE(rdp->nocb_head);
 		if (!rdp->nocb_gp_head)


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: RCU nocb list not reclaiming causing OOM
       [not found]   ` <BYAPR02MB450146DF9F67788ED1F0801D94500@BYAPR02MB4501.namprd02.prod.outlook.com>
@ 2018-07-27 19:07     ` David Chen
  2018-07-27 22:31       ` Paul E. McKenney
  0 siblings, 1 reply; 12+ messages in thread
From: David Chen @ 2018-07-27 19:07 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel

Hi Paul,

I'd like to opinion again on this subject.

So we are going to backport this patch:
6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")

But the other one:
8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
It doesn't apply cleanly, and I'm not too comfortable porting it myself.

So I'm wondering if I use the following change to always wake up follower thread
regardless if the list was empty or not, just to be on the safe side. Do you think
this change is reasonable? Do you see any problem it might cause?

Thanks,
David

diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
--- linux-4.9.37.orig/kernel/rcu/tree_plugin.h	2017-07-12 06:42:41.000000000 -0700
+++ linux-4.9.37/kernel/rcu/tree_plugin.h	2018-07-27 11:57:03.582134519 -0700
@@ -2077,7 +2077,7 @@
 		tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
 		*tail = rdp->nocb_gp_head;
 		smp_mb__after_atomic(); /* Store *tail before wakeup. */
-		if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
+		if (rdp != my_rdp) {
 			/*
 			 * List was empty, wake up the follower.
 			 * Memory barriers supplied by atomic_long_add().


From: David Chen
Sent: Friday, July 20, 2018 5:12 PM
To: paulmck@linux.vnet.ibm.com
Cc: linux-kernel@vger.kernel.org
Subject: Re: RCU nocb list not reclaiming causing OOM
   

Hi Paul,

Ok, I'll try those patches.

Thanks,
David
  
From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Sent: Friday, July 20, 2018 4:32:12 PM
To: David Chen
Cc: linux-kernel@vger.kernel.org
Subject: Re: RCU nocb list not reclaiming causing OOM
  

On Fri, Jul 20, 2018 at 11:05:52PM +0000, David Chen wrote:
> Hi Paul,
> 
> We hit an RCU issue on 4.9.37 kernel. One of the nocb_follower list grows too
> large, and not getting reclaimed, causing the system to OOM.
> 
> Printing the culprit rcu_sched_data:
> 
>   nocb_q_count = {
>     counter = 32369635
>   },
>   nocb_follower_head = 0xffff88ae901c0a00,
>   nocb_follower_tail = 0xffff88af1538b8d8,
>   nocb_kthread = 0xffff88b06d290000,
> 
> As you can see here, the nocb_follower_head is not empty, so in theory, the
> nocb_kthread shouldn't go to sleep. However, if dump the stack of the kthread:
> 
> crash> bt 0xffff88b06d290000
> PID: 21     TASK: ffff88b06d290000  CPU: 3   COMMAND: "rcuos/1"
>  #0 [ffffafc9020b7dc0] __schedule at ffffffff8d8789dc
>  #1 [ffffafc9020b7e38] schedule at ffffffff8d878e76
>  #2 [ffffafc9020b7e50] rcu_nocb_kthread at ffffffff8d112337
>  #3 [ffffafc9020b7ec8] kthread at ffffffff8d0c6ce7
>  #4 [ffffafc9020b7f50] ret_from_fork at ffffffff8d87d755
> 
> And if we dis the address at ffffffff8d112337:
> 
> /usr/src/debug/kernel-4.9.37/linux-4.9.37-29.nutanix.07142017.el7.centos.x86_64/kernel/rcu/tree_plugin.h: 2106
> 0xffffffff8d11232d <rcu_nocb_kthread+381>:      test   %rax,%rax
> 0xffffffff8d112330 <rcu_nocb_kthread+384>:      jne    0xffffffff8d112355 <rcu_nocb_kthread+421>
> 0xffffffff8d112332 <rcu_nocb_kthread+386>:      callq  0xffffffff8d878e40 <schedule>
> 0xffffffff8d112337 <rcu_nocb_kthread+391>:      lea    -0x40(%rbp),%rsi
> 
> So the kthread is blocked at swait_event_interruptible in the nocb_follower_wait.
> This contradict with the fact that nocb_follower_head was not empty. So I
> wonder if this is caused by the lack of memory barrier in the place shown below.
> If the head is set to NULL after doing xchg, it will overwrite the head set
> by leader. This caused the kthread to sleep the next iteration, and the leader
> won't wake him up as the tail doesn't point to head.
> 
> Please tell me what do you think.
> 
> Thanks,
> David
> 
> diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h        2017-07-12 06:42:41.000000000 -0700
> +++ linux-4.9.37/kernel/rcu/tree_plugin.h     2018-07-20 15:25:57.311206343 -0700
> @@ -2149,6 +2149,7 @@
>                BUG_ON(!list);
>                trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
>                WRITE_ONCE(rdp->nocb_follower_head, NULL);
> +             smp_mb();
>                tail = xchg(&rdp->nocb_follower_tail, &rdp->nocb_follower_head);

The xchg() operation implies full memory barriers both before and after,
so adding the smp_mb() before would have no effect.

But let me take a look at post-4.9 changes to this code...

I suggest trying out the following commit:

6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")

If that one doesn't help, the following might be worth trying, but probably
a lot harder to backport:

8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")

Please let me know how it goes!

                                                        Thanx, Paul

------------------------------------------------------------------------

commit 6b5fc3a1331810db407c9e0e673dc1837afdc9d0
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Fri Apr 28 20:11:09 2017 -0700

    rcu: Add memory barriers for NOCB leader wakeup
    
    Wait/wakeup operations do not guarantee ordering on their own.  Instead,
    either locking or memory barriers are required.  This commit therefore
    adds memory barriers to wake_nocb_leader() and nocb_leader_wait().
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Tested-by: Krister Johansen <kjlx@templeofstupid.com>
    Cc: <stable@vger.kernel.org> # 4.6.x

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 0b1042545116..573fbe9640a0 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1810,6 +1810,7 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
         if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
                 /* Prior smp_mb__after_atomic() orders against prior enqueue. */
                 WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
+               smp_mb(); /* ->nocb_leader_sleep before swake_up(). */
                 swake_up(&rdp_leader->nocb_wq);
         }
 }
@@ -2064,6 +2065,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp)
          * nocb_gp_head, where they await a grace period.
          */
         gotcbs = false;
+       smp_mb(); /* wakeup before ->nocb_head reads. */
         for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
                 rdp->nocb_gp_head = READ_ONCE(rdp->nocb_head);
                 if (!rdp->nocb_gp_head)

     

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: RCU nocb list not reclaiming causing OOM
  2018-07-27 19:07     ` David Chen
@ 2018-07-27 22:31       ` Paul E. McKenney
  2018-07-27 23:16         ` David Chen
  0 siblings, 1 reply; 12+ messages in thread
From: Paul E. McKenney @ 2018-07-27 22:31 UTC (permalink / raw)
  To: David Chen; +Cc: linux-kernel

On Fri, Jul 27, 2018 at 07:07:46PM +0000, David Chen wrote:
> Hi Paul,
> 
> I'd like to opinion again on this subject.
> 
> So we are going to backport this patch:
> 6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")

Does this one solve the problem, or are you still seeing hangs?
If you are no longer seeing hangs, my advice is "hands off keyboard",
though some would no doubt point out that I should follow that advice
more myself.  ;-)

> But the other one:
> 8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
> It doesn't apply cleanly, and I'm not too comfortable porting it myself.

Yeah, that one is a bit on the non-trivial side, no two ways about it.

> So I'm wondering if I use the following change to always wake up follower thread
> regardless if the list was empty or not, just to be on the safe side. Do you think
> this change is reasonable? Do you see any problem it might cause?
> 
> Thanks,
> David
> 
> diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h	2017-07-12 06:42:41.000000000 -0700
> +++ linux-4.9.37/kernel/rcu/tree_plugin.h	2018-07-27 11:57:03.582134519 -0700
> @@ -2077,7 +2077,7 @@
>  		tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
>  		*tail = rdp->nocb_gp_head;
>  		smp_mb__after_atomic(); /* Store *tail before wakeup. */
> -		if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
> +		if (rdp != my_rdp) {

This will burn a bit of extra CPU time, but it should be fine other than
that.

							Thanx, Paul

>  			/*
>  			 * List was empty, wake up the follower.
>  			 * Memory barriers supplied by atomic_long_add().
> 
> 
> From: David Chen
> Sent: Friday, July 20, 2018 5:12 PM
> To: paulmck@linux.vnet.ibm.com
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: RCU nocb list not reclaiming causing OOM
>    
> 
> Hi Paul,
> 
> Ok, I'll try those patches.
> 
> Thanks,
> David
>   
> From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Sent: Friday, July 20, 2018 4:32:12 PM
> To: David Chen
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: RCU nocb list not reclaiming causing OOM
>   
> 
> On Fri, Jul 20, 2018 at 11:05:52PM +0000, David Chen wrote:
> > Hi Paul,
> > 
> > We hit an RCU issue on 4.9.37 kernel. One of the nocb_follower list grows too
> > large, and not getting reclaimed, causing the system to OOM.
> > 
> > Printing the culprit rcu_sched_data:
> > 
> >   nocb_q_count = {
> >     counter = 32369635
> >   },
> >   nocb_follower_head = 0xffff88ae901c0a00,
> >   nocb_follower_tail = 0xffff88af1538b8d8,
> >   nocb_kthread = 0xffff88b06d290000,
> > 
> > As you can see here, the nocb_follower_head is not empty, so in theory, the
> > nocb_kthread shouldn't go to sleep. However, if dump the stack of the kthread:
> > 
> > crash> bt 0xffff88b06d290000
> > PID: 21     TASK: ffff88b06d290000  CPU: 3   COMMAND: "rcuos/1"
> >  #0 [ffffafc9020b7dc0] __schedule at ffffffff8d8789dc
> >  #1 [ffffafc9020b7e38] schedule at ffffffff8d878e76
> >  #2 [ffffafc9020b7e50] rcu_nocb_kthread at ffffffff8d112337
> >  #3 [ffffafc9020b7ec8] kthread at ffffffff8d0c6ce7
> >  #4 [ffffafc9020b7f50] ret_from_fork at ffffffff8d87d755
> > 
> > And if we dis the address at ffffffff8d112337:
> > 
> > /usr/src/debug/kernel-4.9.37/linux-4.9.37-29.nutanix.07142017.el7.centos.x86_64/kernel/rcu/tree_plugin.h: 2106
> > 0xffffffff8d11232d <rcu_nocb_kthread+381>:      test   %rax,%rax
> > 0xffffffff8d112330 <rcu_nocb_kthread+384>:      jne    0xffffffff8d112355 <rcu_nocb_kthread+421>
> > 0xffffffff8d112332 <rcu_nocb_kthread+386>:      callq  0xffffffff8d878e40 <schedule>
> > 0xffffffff8d112337 <rcu_nocb_kthread+391>:      lea    -0x40(%rbp),%rsi
> > 
> > So the kthread is blocked at swait_event_interruptible in the nocb_follower_wait.
> > This contradict with the fact that nocb_follower_head was not empty. So I
> > wonder if this is caused by the lack of memory barrier in the place shown below.
> > If the head is set to NULL after doing xchg, it will overwrite the head set
> > by leader. This caused the kthread to sleep the next iteration, and the leader
> > won't wake him up as the tail doesn't point to head.
> > 
> > Please tell me what do you think.
> > 
> > Thanks,
> > David
> > 
> > diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> > --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h        2017-07-12 06:42:41.000000000 -0700
> > +++ linux-4.9.37/kernel/rcu/tree_plugin.h     2018-07-20 15:25:57.311206343 -0700
> > @@ -2149,6 +2149,7 @@
> >                BUG_ON(!list);
> >                trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
> >                WRITE_ONCE(rdp->nocb_follower_head, NULL);
> > +             smp_mb();
> >                tail = xchg(&rdp->nocb_follower_tail, &rdp->nocb_follower_head);
> 
> The xchg() operation implies full memory barriers both before and after,
> so adding the smp_mb() before would have no effect.
> 
> But let me take a look at post-4.9 changes to this code...
> 
> I suggest trying out the following commit:
> 
> 6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")
> 
> If that one doesn't help, the following might be worth trying, but probably
> a lot harder to backport:
> 
> 8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
> 
> Please let me know how it goes!
> 
>                                                         Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit 6b5fc3a1331810db407c9e0e673dc1837afdc9d0
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date:   Fri Apr 28 20:11:09 2017 -0700
> 
>     rcu: Add memory barriers for NOCB leader wakeup
>     
>     Wait/wakeup operations do not guarantee ordering on their own.  Instead,
>     either locking or memory barriers are required.  This commit therefore
>     adds memory barriers to wake_nocb_leader() and nocb_leader_wait().
>     
>     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>     Tested-by: Krister Johansen <kjlx@templeofstupid.com>
>     Cc: <stable@vger.kernel.org> # 4.6.x
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 0b1042545116..573fbe9640a0 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -1810,6 +1810,7 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
>          if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
>                  /* Prior smp_mb__after_atomic() orders against prior enqueue. */
>                  WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
> +               smp_mb(); /* ->nocb_leader_sleep before swake_up(). */
>                  swake_up(&rdp_leader->nocb_wq);
>          }
>  }
> @@ -2064,6 +2065,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp)
>           * nocb_gp_head, where they await a grace period.
>           */
>          gotcbs = false;
> +       smp_mb(); /* wakeup before ->nocb_head reads. */
>          for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
>                  rdp->nocb_gp_head = READ_ONCE(rdp->nocb_head);
>                  if (!rdp->nocb_gp_head)
> 
>      


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RCU nocb list not reclaiming causing OOM
  2018-07-27 22:31       ` Paul E. McKenney
@ 2018-07-27 23:16         ` David Chen
  2018-07-27 23:47           ` Paul E. McKenney
  0 siblings, 1 reply; 12+ messages in thread
From: David Chen @ 2018-07-27 23:16 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel

Hi Paul,

Thanks for the advice.
The bug is kind of hard to hit, so I can't say for certain yet.

Though after another look at the code, I found out the `smp_mb__after_atomic();`
seems to be only a compiler barrier on x86.

		tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
		*tail = rdp->nocb_gp_head;
		smp_mb__after_atomic(); /* Store *tail before wakeup. */
		if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
			swake_up(&rdp->nocb_wq);

But that wouldn't be enough right? Because from 6b5fc3a13318, it stated that
wakeup operation don't guarantee ordering. And when the follower wakes up, it checks
for nocb_follower_head, which is assigned by `*tail = rdp->nocb_gp_head;` which doesn't
have LOCK prefix. So it's possible for follower to wake up and see the list is empty and go
back to sleep.

Thanks,
David

From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Sent: Friday, July 27, 2018 3:31 PM
To: David Chen
Cc: linux-kernel@vger.kernel.org
Subject: Re: RCU nocb list not reclaiming causing OOM
  

On Fri, Jul 27, 2018 at 07:07:46PM +0000, David Chen wrote:
> Hi Paul,
> 
> I'd like to opinion again on this subject.
> 
> So we are going to backport this patch:
> 6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")

Does this one solve the problem, or are you still seeing hangs?
If you are no longer seeing hangs, my advice is "hands off keyboard",
though some would no doubt point out that I should follow that advice
more myself.  ;-)

> But the other one:
> 8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
> It doesn't apply cleanly, and I'm not too comfortable porting it myself.

Yeah, that one is a bit on the non-trivial side, no two ways about it.

> So I'm wondering if I use the following change to always wake up follower thread
> regardless if the list was empty or not, just to be on the safe side. Do you think
> this change is reasonable? Do you see any problem it might cause?
> 
> Thanks,
> David
> 
> diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h        2017-07-12 06:42:41.000000000 -0700
> +++ linux-4.9.37/kernel/rcu/tree_plugin.h     2018-07-27 11:57:03.582134519 -0700
> @@ -2077,7 +2077,7 @@
>                tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
>                *tail = rdp->nocb_gp_head;
>                smp_mb__after_atomic(); /* Store *tail before wakeup. */
> -             if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
> +             if (rdp != my_rdp) {

This will burn a bit of extra CPU time, but it should be fine other than
that.

                                                        Thanx, Paul

>                        /*
>                         * List was empty, wake up the follower.
>                         * Memory barriers supplied by atomic_long_add().
> 
> 
> From: David Chen
> Sent: Friday, July 20, 2018 5:12 PM
> To: paulmck@linux.vnet.ibm.com
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: RCU nocb list not reclaiming causing OOM
>    
> 
> Hi Paul,
> 
> Ok, I'll try those patches.
> 
> Thanks,
> David
>   
> From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Sent: Friday, July 20, 2018 4:32:12 PM
> To: David Chen
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: RCU nocb list not reclaiming causing OOM
>   
> 
> On Fri, Jul 20, 2018 at 11:05:52PM +0000, David Chen wrote:
> > Hi Paul,
> > 
> > We hit an RCU issue on 4.9.37 kernel. One of the nocb_follower list grows too
> > large, and not getting reclaimed, causing the system to OOM.
> > 
> > Printing the culprit rcu_sched_data:
> > 
> >   nocb_q_count = {
> >     counter = 32369635
> >   },
> >   nocb_follower_head = 0xffff88ae901c0a00,
> >   nocb_follower_tail = 0xffff88af1538b8d8,
> >   nocb_kthread = 0xffff88b06d290000,
> > 
> > As you can see here, the nocb_follower_head is not empty, so in theory, the
> > nocb_kthread shouldn't go to sleep. However, if dump the stack of the kthread:
> > 
> > crash> bt 0xffff88b06d290000
> > PID: 21     TASK: ffff88b06d290000  CPU: 3   COMMAND: "rcuos/1"
> >  #0 [ffffafc9020b7dc0] __schedule at ffffffff8d8789dc
> >  #1 [ffffafc9020b7e38] schedule at ffffffff8d878e76
> >  #2 [ffffafc9020b7e50] rcu_nocb_kthread at ffffffff8d112337
> >  #3 [ffffafc9020b7ec8] kthread at ffffffff8d0c6ce7
> >  #4 [ffffafc9020b7f50] ret_from_fork at ffffffff8d87d755
> > 
> > And if we dis the address at ffffffff8d112337:
> > 
> > /usr/src/debug/kernel-4.9.37/linux-4.9.37-29.nutanix.07142017.el7.centos.x86_64/kernel/rcu/tree_plugin.h: 2106
> > 0xffffffff8d11232d <rcu_nocb_kthread+381>:      test   %rax,%rax
> > 0xffffffff8d112330 <rcu_nocb_kthread+384>:      jne    0xffffffff8d112355 <rcu_nocb_kthread+421>
> > 0xffffffff8d112332 <rcu_nocb_kthread+386>:      callq  0xffffffff8d878e40 <schedule>
> > 0xffffffff8d112337 <rcu_nocb_kthread+391>:      lea    -0x40(%rbp),%rsi
> > 
> > So the kthread is blocked at swait_event_interruptible in the nocb_follower_wait.
> > This contradict with the fact that nocb_follower_head was not empty. So I
> > wonder if this is caused by the lack of memory barrier in the place shown below.
> > If the head is set to NULL after doing xchg, it will overwrite the head set
> > by leader. This caused the kthread to sleep the next iteration, and the leader
> > won't wake him up as the tail doesn't point to head.
> > 
> > Please tell me what do you think.
> > 
> > Thanks,
> > David
> > 
> > diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> > --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h        2017-07-12 06:42:41.000000000 -0700
> > +++ linux-4.9.37/kernel/rcu/tree_plugin.h     2018-07-20 15:25:57.311206343 -0700
> > @@ -2149,6 +2149,7 @@
> >                BUG_ON(!list);
> >                trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
> >                WRITE_ONCE(rdp->nocb_follower_head, NULL);
> > +             smp_mb();
> >                tail = xchg(&rdp->nocb_follower_tail, &rdp->nocb_follower_head);
> 
> The xchg() operation implies full memory barriers both before and after,
> so adding the smp_mb() before would have no effect.
> 
> But let me take a look at post-4.9 changes to this code...
> 
> I suggest trying out the following commit:
> 
> 6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")
> 
> If that one doesn't help, the following might be worth trying, but probably
> a lot harder to backport:
> 
> 8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
> 
> Please let me know how it goes!
> 
>                                                         Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit 6b5fc3a1331810db407c9e0e673dc1837afdc9d0
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date:   Fri Apr 28 20:11:09 2017 -0700
> 
>     rcu: Add memory barriers for NOCB leader wakeup
>     
>     Wait/wakeup operations do not guarantee ordering on their own.  Instead,
>     either locking or memory barriers are required.  This commit therefore
>     adds memory barriers to wake_nocb_leader() and nocb_leader_wait().
>     
>     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>     Tested-by: Krister Johansen <kjlx@templeofstupid.com>
>     Cc: <stable@vger.kernel.org> # 4.6.x
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 0b1042545116..573fbe9640a0 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -1810,6 +1810,7 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
>          if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
>                  /* Prior smp_mb__after_atomic() orders against prior enqueue. */
>                  WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
> +               smp_mb(); /* ->nocb_leader_sleep before swake_up(). */
>                  swake_up(&rdp_leader->nocb_wq);
>          }
>  }
> @@ -2064,6 +2065,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp)
>           * nocb_gp_head, where they await a grace period.
>           */
>          gotcbs = false;
> +       smp_mb(); /* wakeup before ->nocb_head reads. */
>          for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
>                  rdp->nocb_gp_head = READ_ONCE(rdp->nocb_head);
>                  if (!rdp->nocb_gp_head)
> 
>      

    

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RCU nocb list not reclaiming causing OOM
  2018-07-27 23:16         ` David Chen
@ 2018-07-27 23:47           ` Paul E. McKenney
  2018-07-28  0:07             ` David Chen
  0 siblings, 1 reply; 12+ messages in thread
From: Paul E. McKenney @ 2018-07-27 23:47 UTC (permalink / raw)
  To: David Chen; +Cc: linux-kernel

On Fri, Jul 27, 2018 at 11:16:39PM +0000, David Chen wrote:
> Hi Paul,
> 
> Thanks for the advice.
> The bug is kind of hard to hit, so I can't say for certain yet.

Well, you can always remove the "tail == &rdp->nocb_follower_head" as an
extra belt-and-suspenders safety net.  I am not putting that in mainline,
but in the privacy of your own copy of the kernel, I don't see any really
serious problem with it.  (As long as you aren't going for absolute maximum
performance, but even then there are other more important tuning actions
and code changes you could make.)

> Though after another look at the code, I found out the `smp_mb__after_atomic();`
> seems to be only a compiler barrier on x86.

Yes, and that is because the locked xchg instruction used on x86 to
implement xchg() already provides full ordering.  ;-)

> 		tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
> 		*tail = rdp->nocb_gp_head;
> 		smp_mb__after_atomic(); /* Store *tail before wakeup. */
> 		if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
> 			swake_up(&rdp->nocb_wq);
> 
> But that wouldn't be enough right? Because from 6b5fc3a13318, it stated that
> wakeup operation don't guarantee ordering. And when the follower wakes up, it checks
> for nocb_follower_head, which is assigned by `*tail = rdp->nocb_gp_head;` which doesn't
> have LOCK prefix. So it's possible for follower to wake up and see the list is empty and go
> back to sleep.

Again, xchg() is defined to provide full ordering against all operations
before and after it.  Each architecture is required to do whatever is
necessary to implement that full ordering, and x86 need only provide
its "lock xchg" instruction.

The smp_mb__after_atomic() has effect only after atomic read-modify-write
operations that do not return a value, for example, atomic_inc().
If you use it after a value-returning atomic read-modify-write operation
like xchg(), all you do is needlessly slow things down on platforms
that provide non-empty smp_mb__after_atomic() definitions.  So again,
smp_mb__after_atomic() after xchg() is pointless.

Please take a look at Documentation/core-api/atomic_ops.rst in the
Linux-kernel source tree for more information.  Or get a v4.17 kernel
source tree and check this using the memory model (tools/memory-model
in that version).

							Thanx, Paul

> Thanks,
> David
> 
> From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Sent: Friday, July 27, 2018 3:31 PM
> To: David Chen
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: RCU nocb list not reclaiming causing OOM
>   
> 
> On Fri, Jul 27, 2018 at 07:07:46PM +0000, David Chen wrote:
> > Hi Paul,
> > 
> > I'd like to opinion again on this subject.
> > 
> > So we are going to backport this patch:
> > 6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")
> 
> Does this one solve the problem, or are you still seeing hangs?
> If you are no longer seeing hangs, my advice is "hands off keyboard",
> though some would no doubt point out that I should follow that advice
> more myself.  ;-)
> 
> > But the other one:
> > 8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
> > It doesn't apply cleanly, and I'm not too comfortable porting it myself.
> 
> Yeah, that one is a bit on the non-trivial side, no two ways about it.
> 
> > So I'm wondering if I use the following change to always wake up follower thread
> > regardless if the list was empty or not, just to be on the safe side. Do you think
> > this change is reasonable? Do you see any problem it might cause?
> > 
> > Thanks,
> > David
> > 
> > diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> > --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h        2017-07-12 06:42:41.000000000 -0700
> > +++ linux-4.9.37/kernel/rcu/tree_plugin.h     2018-07-27 11:57:03.582134519 -0700
> > @@ -2077,7 +2077,7 @@
> >                tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
> >                *tail = rdp->nocb_gp_head;
> >                smp_mb__after_atomic(); /* Store *tail before wakeup. */
> > -             if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
> > +             if (rdp != my_rdp) {
> 
> This will burn a bit of extra CPU time, but it should be fine other than
> that.
> 
>                                                         Thanx, Paul
> 
> >                        /*
> >                         * List was empty, wake up the follower.
> >                         * Memory barriers supplied by atomic_long_add().
> > 
> > 
> > From: David Chen
> > Sent: Friday, July 20, 2018 5:12 PM
> > To: paulmck@linux.vnet.ibm.com
> > Cc: linux-kernel@vger.kernel.org
> > Subject: Re: RCU nocb list not reclaiming causing OOM
> >    
> > 
> > Hi Paul,
> > 
> > Ok, I'll try those patches.
> > 
> > Thanks,
> > David
> >   
> > From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Sent: Friday, July 20, 2018 4:32:12 PM
> > To: David Chen
> > Cc: linux-kernel@vger.kernel.org
> > Subject: Re: RCU nocb list not reclaiming causing OOM
> >   
> > 
> > On Fri, Jul 20, 2018 at 11:05:52PM +0000, David Chen wrote:
> > > Hi Paul,
> > > 
> > > We hit an RCU issue on 4.9.37 kernel. One of the nocb_follower list grows too
> > > large, and not getting reclaimed, causing the system to OOM.
> > > 
> > > Printing the culprit rcu_sched_data:
> > > 
> > >   nocb_q_count = {
> > >     counter = 32369635
> > >   },
> > >   nocb_follower_head = 0xffff88ae901c0a00,
> > >   nocb_follower_tail = 0xffff88af1538b8d8,
> > >   nocb_kthread = 0xffff88b06d290000,
> > > 
> > > As you can see here, the nocb_follower_head is not empty, so in theory, the
> > > nocb_kthread shouldn't go to sleep. However, if dump the stack of the kthread:
> > > 
> > > crash> bt 0xffff88b06d290000
> > > PID: 21     TASK: ffff88b06d290000  CPU: 3   COMMAND: "rcuos/1"
> > >  #0 [ffffafc9020b7dc0] __schedule at ffffffff8d8789dc
> > >  #1 [ffffafc9020b7e38] schedule at ffffffff8d878e76
> > >  #2 [ffffafc9020b7e50] rcu_nocb_kthread at ffffffff8d112337
> > >  #3 [ffffafc9020b7ec8] kthread at ffffffff8d0c6ce7
> > >  #4 [ffffafc9020b7f50] ret_from_fork at ffffffff8d87d755
> > > 
> > > And if we dis the address at ffffffff8d112337:
> > > 
> > > /usr/src/debug/kernel-4.9.37/linux-4.9.37-29.nutanix.07142017.el7.centos.x86_64/kernel/rcu/tree_plugin.h: 2106
> > > 0xffffffff8d11232d <rcu_nocb_kthread+381>:      test   %rax,%rax
> > > 0xffffffff8d112330 <rcu_nocb_kthread+384>:      jne    0xffffffff8d112355 <rcu_nocb_kthread+421>
> > > 0xffffffff8d112332 <rcu_nocb_kthread+386>:      callq  0xffffffff8d878e40 <schedule>
> > > 0xffffffff8d112337 <rcu_nocb_kthread+391>:      lea    -0x40(%rbp),%rsi
> > > 
> > > So the kthread is blocked at swait_event_interruptible in the nocb_follower_wait.
> > > This contradict with the fact that nocb_follower_head was not empty. So I
> > > wonder if this is caused by the lack of memory barrier in the place shown below.
> > > If the head is set to NULL after doing xchg, it will overwrite the head set
> > > by leader. This caused the kthread to sleep the next iteration, and the leader
> > > won't wake him up as the tail doesn't point to head.
> > > 
> > > Please tell me what do you think.
> > > 
> > > Thanks,
> > > David
> > > 
> > > diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> > > --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h        2017-07-12 06:42:41.000000000 -0700
> > > +++ linux-4.9.37/kernel/rcu/tree_plugin.h     2018-07-20 15:25:57.311206343 -0700
> > > @@ -2149,6 +2149,7 @@
> > >                BUG_ON(!list);
> > >                trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
> > >                WRITE_ONCE(rdp->nocb_follower_head, NULL);
> > > +             smp_mb();
> > >                tail = xchg(&rdp->nocb_follower_tail, &rdp->nocb_follower_head);
> > 
> > The xchg() operation implies full memory barriers both before and after,
> > so adding the smp_mb() before would have no effect.
> > 
> > But let me take a look at post-4.9 changes to this code...
> > 
> > I suggest trying out the following commit:
> > 
> > 6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")
> > 
> > If that one doesn't help, the following might be worth trying, but probably
> > a lot harder to backport:
> > 
> > 8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
> > 
> > Please let me know how it goes!
> > 
> >                                                         Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > commit 6b5fc3a1331810db407c9e0e673dc1837afdc9d0
> > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Date:   Fri Apr 28 20:11:09 2017 -0700
> > 
> >     rcu: Add memory barriers for NOCB leader wakeup
> >     
> >     Wait/wakeup operations do not guarantee ordering on their own.  Instead,
> >     either locking or memory barriers are required.  This commit therefore
> >     adds memory barriers to wake_nocb_leader() and nocb_leader_wait().
> >     
> >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >     Tested-by: Krister Johansen <kjlx@templeofstupid.com>
> >     Cc: <stable@vger.kernel.org> # 4.6.x
> > 
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 0b1042545116..573fbe9640a0 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -1810,6 +1810,7 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
> >          if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
> >                  /* Prior smp_mb__after_atomic() orders against prior enqueue. */
> >                  WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
> > +               smp_mb(); /* ->nocb_leader_sleep before swake_up(). */
> >                  swake_up(&rdp_leader->nocb_wq);
> >          }
> >  }
> > @@ -2064,6 +2065,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp)
> >           * nocb_gp_head, where they await a grace period.
> >           */
> >          gotcbs = false;
> > +       smp_mb(); /* wakeup before ->nocb_head reads. */
> >          for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
> >                  rdp->nocb_gp_head = READ_ONCE(rdp->nocb_head);
> >                  if (!rdp->nocb_gp_head)
> > 
> >      
> 
>     


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RCU nocb list not reclaiming causing OOM
  2018-07-27 23:47           ` Paul E. McKenney
@ 2018-07-28  0:07             ` David Chen
  2018-07-28  1:47               ` Paul E. McKenney
  0 siblings, 1 reply; 12+ messages in thread
From: David Chen @ 2018-07-28  0:07 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel

Hi Paul,

I wasn't talking about the xchg() though.

The smp_mb__after_atomic() is not for xchg(), it's for `*tail = rdp->nocb_gp_head;`
it's stated so in the comment. And I do think we need ordering between
`*tail = rdp->nocb_gp_head;` and wake up, because the waiter is checking on head not tail.

			swait_event_interruptible(rdp->nocb_wq,
						 READ_ONCE(rdp->nocb_follower_head));

So what I'm saying is that since we need to maintain ordering between `*tail = rdp->nocb_gp_head;`
and wake up, we need to change the smp_mb__after_atomic() to smp_mb(). Because
smp_mb__after_atomic() wouldn't guarantee 

So this is what I'm proposing.

diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
--- linux-4.9.37.orig/kernel/rcu/tree_plugin.h	2017-07-12 06:42:41.000000000 -0700
+++ linux-4.9.37/kernel/rcu/tree_plugin.h	2018-07-27 16:23:41.349044259 -0700
@@ -2076,7 +2076,7 @@
 		/* Append callbacks to follower's "done" list. */
 		tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
 		*tail = rdp->nocb_gp_head;
-		smp_mb__after_atomic(); /* Store *tail before wakeup. */
+		smp_mb(); /* Store *tail before wakeup. */
 		if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
 			/*
 			 * List was empty, wake up the follower.

Thanks,
David

From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Sent: Friday, July 27, 2018 4:47 PM
To: David Chen
Cc: linux-kernel@vger.kernel.org
Subject: Re: RCU nocb list not reclaiming causing OOM
  

On Fri, Jul 27, 2018 at 11:16:39PM +0000, David Chen wrote:
> Hi Paul,
> 
> Thanks for the advice.
> The bug is kind of hard to hit, so I can't say for certain yet.

Well, you can always remove the "tail == &rdp->nocb_follower_head" as an
extra belt-and-suspenders safety net.  I am not putting that in mainline,
but in the privacy of your own copy of the kernel, I don't see any really
serious problem with it.  (As long as you aren't going for absolute maximum
performance, but even then there are other more important tuning actions
and code changes you could make.)

> Though after another look at the code, I found out the `smp_mb__after_atomic();`
> seems to be only a compiler barrier on x86.

Yes, and that is because the locked xchg instruction used on x86 to
implement xchg() already provides full ordering.  ;-)

>                tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
>                *tail = rdp->nocb_gp_head;
>                smp_mb__after_atomic(); /* Store *tail before wakeup. */
>                if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
>                        swake_up(&rdp->nocb_wq);
> 
> But that wouldn't be enough right? Because from 6b5fc3a13318, it stated that
> wakeup operation don't guarantee ordering. And when the follower wakes up, it checks
> for nocb_follower_head, which is assigned by `*tail = rdp->nocb_gp_head;` which doesn't
> have LOCK prefix. So it's possible for follower to wake up and see the list is empty and go
> back to sleep.

Again, xchg() is defined to provide full ordering against all operations
before and after it.  Each architecture is required to do whatever is
necessary to implement that full ordering, and x86 need only provide
its "lock xchg" instruction.

The smp_mb__after_atomic() has effect only after atomic read-modify-write
operations that do not return a value, for example, atomic_inc().
If you use it after a value-returning atomic read-modify-write operation
like xchg(), all you do is needlessly slow things down on platforms
that provide non-empty smp_mb__after_atomic() definitions.  So again,
smp_mb__after_atomic() after xchg() is pointless.

Please take a look at Documentation/core-api/atomic_ops.rst in the
Linux-kernel source tree for more information.  Or get a v4.17 kernel
source tree and check this using the memory model (tools/memory-model
in that version).

                                                        Thanx, Paul

> Thanks,
> David
> 
> From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Sent: Friday, July 27, 2018 3:31 PM
> To: David Chen
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: RCU nocb list not reclaiming causing OOM
>   
> 
> On Fri, Jul 27, 2018 at 07:07:46PM +0000, David Chen wrote:
> > Hi Paul,
> > 
> > I'd like to opinion again on this subject.
> > 
> > So we are going to backport this patch:
> > 6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")
> 
> Does this one solve the problem, or are you still seeing hangs?
> If you are no longer seeing hangs, my advice is "hands off keyboard",
> though some would no doubt point out that I should follow that advice
> more myself.  ;-)
> 
> > But the other one:
> > 8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
> > It doesn't apply cleanly, and I'm not too comfortable porting it myself.
> 
> Yeah, that one is a bit on the non-trivial side, no two ways about it.
> 
> > So I'm wondering if I use the following change to always wake up follower thread
> > regardless if the list was empty or not, just to be on the safe side. Do you think
> > this change is reasonable? Do you see any problem it might cause?
> > 
> > Thanks,
> > David
> > 
> > diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> > --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h        2017-07-12 06:42:41.000000000 -0700
> > +++ linux-4.9.37/kernel/rcu/tree_plugin.h     2018-07-27 11:57:03.582134519 -0700
> > @@ -2077,7 +2077,7 @@
> >                tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
> >                *tail = rdp->nocb_gp_head;
> >                smp_mb__after_atomic(); /* Store *tail before wakeup. */
> > -             if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
> > +             if (rdp != my_rdp) {
> 
> This will burn a bit of extra CPU time, but it should be fine other than
> that.
> 
>                                                         Thanx, Paul
> 
> >                        /*
> >                         * List was empty, wake up the follower.
> >                         * Memory barriers supplied by atomic_long_add().
> > 
> > 
> > From: David Chen
> > Sent: Friday, July 20, 2018 5:12 PM
> > To: paulmck@linux.vnet.ibm.com
> > Cc: linux-kernel@vger.kernel.org
> > Subject: Re: RCU nocb list not reclaiming causing OOM
> >    
> > 
> > Hi Paul,
> > 
> > Ok, I'll try those patches.
> > 
> > Thanks,
> > David
> >   
> > From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Sent: Friday, July 20, 2018 4:32:12 PM
> > To: David Chen
> > Cc: linux-kernel@vger.kernel.org
> > Subject: Re: RCU nocb list not reclaiming causing OOM
> >   
> > 
> > On Fri, Jul 20, 2018 at 11:05:52PM +0000, David Chen wrote:
> > > Hi Paul,
> > > 
> > > We hit an RCU issue on 4.9.37 kernel. One of the nocb_follower list grows too
> > > large, and not getting reclaimed, causing the system to OOM.
> > > 
> > > Printing the culprit rcu_sched_data:
> > > 
> > >   nocb_q_count = {
> > >     counter = 32369635
> > >   },
> > >   nocb_follower_head = 0xffff88ae901c0a00,
> > >   nocb_follower_tail = 0xffff88af1538b8d8,
> > >   nocb_kthread = 0xffff88b06d290000,
> > > 
> > > As you can see here, the nocb_follower_head is not empty, so in theory, the
> > > nocb_kthread shouldn't go to sleep. However, if dump the stack of the kthread:
> > > 
> > > crash> bt 0xffff88b06d290000
> > > PID: 21     TASK: ffff88b06d290000  CPU: 3   COMMAND: "rcuos/1"
> > >  #0 [ffffafc9020b7dc0] __schedule at ffffffff8d8789dc
> > >  #1 [ffffafc9020b7e38] schedule at ffffffff8d878e76
> > >  #2 [ffffafc9020b7e50] rcu_nocb_kthread at ffffffff8d112337
> > >  #3 [ffffafc9020b7ec8] kthread at ffffffff8d0c6ce7
> > >  #4 [ffffafc9020b7f50] ret_from_fork at ffffffff8d87d755
> > > 
> > > And if we dis the address at ffffffff8d112337:
> > > 
> > > /usr/src/debug/kernel-4.9.37/linux-4.9.37-29.nutanix.07142017.el7.centos.x86_64/kernel/rcu/tree_plugin.h: 2106
> > > 0xffffffff8d11232d <rcu_nocb_kthread+381>:      test   %rax,%rax
> > > 0xffffffff8d112330 <rcu_nocb_kthread+384>:      jne    0xffffffff8d112355 <rcu_nocb_kthread+421>
> > > 0xffffffff8d112332 <rcu_nocb_kthread+386>:      callq  0xffffffff8d878e40 <schedule>
> > > 0xffffffff8d112337 <rcu_nocb_kthread+391>:      lea    -0x40(%rbp),%rsi
> > > 
> > > So the kthread is blocked at swait_event_interruptible in the nocb_follower_wait.
> > > This contradict with the fact that nocb_follower_head was not empty. So I
> > > wonder if this is caused by the lack of memory barrier in the place shown below.
> > > If the head is set to NULL after doing xchg, it will overwrite the head set
> > > by leader. This caused the kthread to sleep the next iteration, and the leader
> > > won't wake him up as the tail doesn't point to head.
> > > 
> > > Please tell me what do you think.
> > > 
> > > Thanks,
> > > David
> > > 
> > > diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> > > --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h        2017-07-12 06:42:41.000000000 -0700
> > > +++ linux-4.9.37/kernel/rcu/tree_plugin.h     2018-07-20 15:25:57.311206343 -0700
> > > @@ -2149,6 +2149,7 @@
> > >                BUG_ON(!list);
> > >                trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
> > >                WRITE_ONCE(rdp->nocb_follower_head, NULL);
> > > +             smp_mb();
> > >                tail = xchg(&rdp->nocb_follower_tail, &rdp->nocb_follower_head);
> > 
> > The xchg() operation implies full memory barriers both before and after,
> > so adding the smp_mb() before would have no effect.
> > 
> > But let me take a look at post-4.9 changes to this code...
> > 
> > I suggest trying out the following commit:
> > 
> > 6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")
> > 
> > If that one doesn't help, the following might be worth trying, but probably
> > a lot harder to backport:
> > 
> > 8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
> > 
> > Please let me know how it goes!
> > 
> >                                                         Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > commit 6b5fc3a1331810db407c9e0e673dc1837afdc9d0
> > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Date:   Fri Apr 28 20:11:09 2017 -0700
> > 
> >     rcu: Add memory barriers for NOCB leader wakeup
> >     
> >     Wait/wakeup operations do not guarantee ordering on their own.  Instead,
> >     either locking or memory barriers are required.  This commit therefore
> >     adds memory barriers to wake_nocb_leader() and nocb_leader_wait().
> >     
> >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >     Tested-by: Krister Johansen <kjlx@templeofstupid.com>
> >     Cc: <stable@vger.kernel.org> # 4.6.x
> > 
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 0b1042545116..573fbe9640a0 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -1810,6 +1810,7 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
> >          if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
> >                  /* Prior smp_mb__after_atomic() orders against prior enqueue. */
> >                  WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
> > +               smp_mb(); /* ->nocb_leader_sleep before swake_up(). */
> >                  swake_up(&rdp_leader->nocb_wq);
> >          }
> >  }
> > @@ -2064,6 +2065,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp)
> >           * nocb_gp_head, where they await a grace period.
> >           */
> >          gotcbs = false;
> > +       smp_mb(); /* wakeup before ->nocb_head reads. */
> >          for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
> >                  rdp->nocb_gp_head = READ_ONCE(rdp->nocb_head);
> >                  if (!rdp->nocb_gp_head)
> > 
> >      
> 
>     

    

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RCU nocb list not reclaiming causing OOM
  2018-07-28  0:07             ` David Chen
@ 2018-07-28  1:47               ` Paul E. McKenney
  2018-07-28  8:29                 ` Paul E. McKenney
  0 siblings, 1 reply; 12+ messages in thread
From: Paul E. McKenney @ 2018-07-28  1:47 UTC (permalink / raw)
  To: David Chen; +Cc: linux-kernel

On Sat, Jul 28, 2018 at 12:07:19AM +0000, David Chen wrote:
> Hi Paul,
> 
> I wasn't talking about the xchg() though.
> 
> The smp_mb__after_atomic() is not for xchg(), it's for `*tail = rdp->nocb_gp_head;`
> it's stated so in the comment. And I do think we need ordering between
> `*tail = rdp->nocb_gp_head;` and wake up, because the waiter is checking on head not tail.
> 
> 			swait_event_interruptible(rdp->nocb_wq,
> 						 READ_ONCE(rdp->nocb_follower_head));
> 
> So what I'm saying is that since we need to maintain ordering between `*tail = rdp->nocb_gp_head;`
> and wake up, we need to change the smp_mb__after_atomic() to smp_mb(). Because
> smp_mb__after_atomic() wouldn't guarantee 
> 
> So this is what I'm proposing.

Good eyes!

Hmmm...  What do I do about this in mainline?  Ah, I introduced a
->nocb_lock in mainline to prevent this from happening.  In that same
commit that you didn't want to use because it is hard to backport.  ;-)

So yes, but there might well be other misorderings fixed by the
hard-to-backport commit that your change below does not cover.  Still I
do agree that you need full ordering at that point.

							Thanx, Paul

> diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h	2017-07-12 06:42:41.000000000 -0700
> +++ linux-4.9.37/kernel/rcu/tree_plugin.h	2018-07-27 16:23:41.349044259 -0700
> @@ -2076,7 +2076,7 @@
>  		/* Append callbacks to follower's "done" list. */
>  		tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
>  		*tail = rdp->nocb_gp_head;
> -		smp_mb__after_atomic(); /* Store *tail before wakeup. */
> +		smp_mb(); /* Store *tail before wakeup. */
>  		if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
>  			/*
>  			 * List was empty, wake up the follower.
> 
> Thanks,
> David
> 
> From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Sent: Friday, July 27, 2018 4:47 PM
> To: David Chen
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: RCU nocb list not reclaiming causing OOM
>   
> 
> On Fri, Jul 27, 2018 at 11:16:39PM +0000, David Chen wrote:
> > Hi Paul,
> > 
> > Thanks for the advice.
> > The bug is kind of hard to hit, so I can't say for certain yet.
> 
> Well, you can always remove the "tail == &rdp->nocb_follower_head" as an
> extra belt-and-suspenders safety net.  I am not putting that in mainline,
> but in the privacy of your own copy of the kernel, I don't see any really
> serious problem with it.  (As long as you aren't going for absolute maximum
> performance, but even then there are other more important tuning actions
> and code changes you could make.)
> 
> > Though after another look at the code, I found out the `smp_mb__after_atomic();`
> > seems to be only a compiler barrier on x86.
> 
> Yes, and that is because the locked xchg instruction used on x86 to
> implement xchg() already provides full ordering.  ;-)
> 
> >                tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
> >                *tail = rdp->nocb_gp_head;
> >                smp_mb__after_atomic(); /* Store *tail before wakeup. */
> >                if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
> >                        swake_up(&rdp->nocb_wq);
> > 
> > But that wouldn't be enough right? Because from 6b5fc3a13318, it stated that
> > wakeup operation don't guarantee ordering. And when the follower wakes up, it checks
> > for nocb_follower_head, which is assigned by `*tail = rdp->nocb_gp_head;` which doesn't
> > have LOCK prefix. So it's possible for follower to wake up and see the list is empty and go
> > back to sleep.
> 
> Again, xchg() is defined to provide full ordering against all operations
> before and after it.  Each architecture is required to do whatever is
> necessary to implement that full ordering, and x86 need only provide
> its "lock xchg" instruction.
> 
> The smp_mb__after_atomic() has effect only after atomic read-modify-write
> operations that do not return a value, for example, atomic_inc().
> If you use it after a value-returning atomic read-modify-write operation
> like xchg(), all you do is needlessly slow things down on platforms
> that provide non-empty smp_mb__after_atomic() definitions.  So again,
> smp_mb__after_atomic() after xchg() is pointless.
> 
> Please take a look at Documentation/core-api/atomic_ops.rst in the
> Linux-kernel source tree for more information.  Or get a v4.17 kernel
> source tree and check this using the memory model (tools/memory-model
> in that version).
> 
>                                                         Thanx, Paul
> 
> > Thanks,
> > David
> > 
> > From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Sent: Friday, July 27, 2018 3:31 PM
> > To: David Chen
> > Cc: linux-kernel@vger.kernel.org
> > Subject: Re: RCU nocb list not reclaiming causing OOM
> >   
> > 
> > On Fri, Jul 27, 2018 at 07:07:46PM +0000, David Chen wrote:
> > > Hi Paul,
> > > 
> > > I'd like to opinion again on this subject.
> > > 
> > > So we are going to backport this patch:
> > > 6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")
> > 
> > Does this one solve the problem, or are you still seeing hangs?
> > If you are no longer seeing hangs, my advice is "hands off keyboard",
> > though some would no doubt point out that I should follow that advice
> > more myself.  ;-)
> > 
> > > But the other one:
> > > 8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
> > > It doesn't apply cleanly, and I'm not too comfortable porting it myself.
> > 
> > Yeah, that one is a bit on the non-trivial side, no two ways about it.
> > 
> > > So I'm wondering if I use the following change to always wake up follower thread
> > > regardless if the list was empty or not, just to be on the safe side. Do you think
> > > this change is reasonable? Do you see any problem it might cause?
> > > 
> > > Thanks,
> > > David
> > > 
> > > diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> > > --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h        2017-07-12 06:42:41.000000000 -0700
> > > +++ linux-4.9.37/kernel/rcu/tree_plugin.h     2018-07-27 11:57:03.582134519 -0700
> > > @@ -2077,7 +2077,7 @@
> > >                tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
> > >                *tail = rdp->nocb_gp_head;
> > >                smp_mb__after_atomic(); /* Store *tail before wakeup. */
> > > -             if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
> > > +             if (rdp != my_rdp) {
> > 
> > This will burn a bit of extra CPU time, but it should be fine other than
> > that.
> > 
> >                                                         Thanx, Paul
> > 
> > >                        /*
> > >                         * List was empty, wake up the follower.
> > >                         * Memory barriers supplied by atomic_long_add().
> > > 
> > > 
> > > From: David Chen
> > > Sent: Friday, July 20, 2018 5:12 PM
> > > To: paulmck@linux.vnet.ibm.com
> > > Cc: linux-kernel@vger.kernel.org
> > > Subject: Re: RCU nocb list not reclaiming causing OOM
> > >    
> > > 
> > > Hi Paul,
> > > 
> > > Ok, I'll try those patches.
> > > 
> > > Thanks,
> > > David
> > >   
> > > From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > Sent: Friday, July 20, 2018 4:32:12 PM
> > > To: David Chen
> > > Cc: linux-kernel@vger.kernel.org
> > > Subject: Re: RCU nocb list not reclaiming causing OOM
> > >   
> > > 
> > > On Fri, Jul 20, 2018 at 11:05:52PM +0000, David Chen wrote:
> > > > Hi Paul,
> > > > 
> > > > We hit an RCU issue on 4.9.37 kernel. One of the nocb_follower list grows too
> > > > large, and not getting reclaimed, causing the system to OOM.
> > > > 
> > > > Printing the culprit rcu_sched_data:
> > > > 
> > > >   nocb_q_count = {
> > > >     counter = 32369635
> > > >   },
> > > >   nocb_follower_head = 0xffff88ae901c0a00,
> > > >   nocb_follower_tail = 0xffff88af1538b8d8,
> > > >   nocb_kthread = 0xffff88b06d290000,
> > > > 
> > > > As you can see here, the nocb_follower_head is not empty, so in theory, the
> > > > nocb_kthread shouldn't go to sleep. However, if dump the stack of the kthread:
> > > > 
> > > > crash> bt 0xffff88b06d290000
> > > > PID: 21     TASK: ffff88b06d290000  CPU: 3   COMMAND: "rcuos/1"
> > > >  #0 [ffffafc9020b7dc0] __schedule at ffffffff8d8789dc
> > > >  #1 [ffffafc9020b7e38] schedule at ffffffff8d878e76
> > > >  #2 [ffffafc9020b7e50] rcu_nocb_kthread at ffffffff8d112337
> > > >  #3 [ffffafc9020b7ec8] kthread at ffffffff8d0c6ce7
> > > >  #4 [ffffafc9020b7f50] ret_from_fork at ffffffff8d87d755
> > > > 
> > > > And if we dis the address at ffffffff8d112337:
> > > > 
> > > > /usr/src/debug/kernel-4.9.37/linux-4.9.37-29.nutanix.07142017.el7.centos.x86_64/kernel/rcu/tree_plugin.h: 2106
> > > > 0xffffffff8d11232d <rcu_nocb_kthread+381>:      test   %rax,%rax
> > > > 0xffffffff8d112330 <rcu_nocb_kthread+384>:      jne    0xffffffff8d112355 <rcu_nocb_kthread+421>
> > > > 0xffffffff8d112332 <rcu_nocb_kthread+386>:      callq  0xffffffff8d878e40 <schedule>
> > > > 0xffffffff8d112337 <rcu_nocb_kthread+391>:      lea    -0x40(%rbp),%rsi
> > > > 
> > > > So the kthread is blocked at swait_event_interruptible in the nocb_follower_wait.
> > > > This contradict with the fact that nocb_follower_head was not empty. So I
> > > > wonder if this is caused by the lack of memory barrier in the place shown below.
> > > > If the head is set to NULL after doing xchg, it will overwrite the head set
> > > > by leader. This caused the kthread to sleep the next iteration, and the leader
> > > > won't wake him up as the tail doesn't point to head.
> > > > 
> > > > Please tell me what do you think.
> > > > 
> > > > Thanks,
> > > > David
> > > > 
> > > > diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> > > > --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h        2017-07-12 06:42:41.000000000 -0700
> > > > +++ linux-4.9.37/kernel/rcu/tree_plugin.h     2018-07-20 15:25:57.311206343 -0700
> > > > @@ -2149,6 +2149,7 @@
> > > >                BUG_ON(!list);
> > > >                trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
> > > >                WRITE_ONCE(rdp->nocb_follower_head, NULL);
> > > > +             smp_mb();
> > > >                tail = xchg(&rdp->nocb_follower_tail, &rdp->nocb_follower_head);
> > > 
> > > The xchg() operation implies full memory barriers both before and after,
> > > so adding the smp_mb() before would have no effect.
> > > 
> > > But let me take a look at post-4.9 changes to this code...
> > > 
> > > I suggest trying out the following commit:
> > > 
> > > 6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")
> > > 
> > > If that one doesn't help, the following might be worth trying, but probably
> > > a lot harder to backport:
> > > 
> > > 8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
> > > 
> > > Please let me know how it goes!
> > > 
> > >                                                         Thanx, Paul
> > > 
> > > ------------------------------------------------------------------------
> > > 
> > > commit 6b5fc3a1331810db407c9e0e673dc1837afdc9d0
> > > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > Date:   Fri Apr 28 20:11:09 2017 -0700
> > > 
> > >     rcu: Add memory barriers for NOCB leader wakeup
> > >     
> > >     Wait/wakeup operations do not guarantee ordering on their own.  Instead,
> > >     either locking or memory barriers are required.  This commit therefore
> > >     adds memory barriers to wake_nocb_leader() and nocb_leader_wait().
> > >     
> > >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > >     Tested-by: Krister Johansen <kjlx@templeofstupid.com>
> > >     Cc: <stable@vger.kernel.org> # 4.6.x
> > > 
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index 0b1042545116..573fbe9640a0 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -1810,6 +1810,7 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
> > >          if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
> > >                  /* Prior smp_mb__after_atomic() orders against prior enqueue. */
> > >                  WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
> > > +               smp_mb(); /* ->nocb_leader_sleep before swake_up(). */
> > >                  swake_up(&rdp_leader->nocb_wq);
> > >          }
> > >  }
> > > @@ -2064,6 +2065,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp)
> > >           * nocb_gp_head, where they await a grace period.
> > >           */
> > >          gotcbs = false;
> > > +       smp_mb(); /* wakeup before ->nocb_head reads. */
> > >          for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
> > >                  rdp->nocb_gp_head = READ_ONCE(rdp->nocb_head);
> > >                  if (!rdp->nocb_gp_head)
> > > 
> > >      
> > 
> >     
> 
>     


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RCU nocb list not reclaiming causing OOM
  2018-07-28  1:47               ` Paul E. McKenney
@ 2018-07-28  8:29                 ` Paul E. McKenney
  2018-07-30 19:08                   ` David Chen
  0 siblings, 1 reply; 12+ messages in thread
From: Paul E. McKenney @ 2018-07-28  8:29 UTC (permalink / raw)
  To: David Chen; +Cc: linux-kernel

On Fri, Jul 27, 2018 at 06:47:00PM -0700, Paul E. McKenney wrote:
> On Sat, Jul 28, 2018 at 12:07:19AM +0000, David Chen wrote:
> > Hi Paul,
> > 
> > I wasn't talking about the xchg() though.
> > 
> > The smp_mb__after_atomic() is not for xchg(), it's for `*tail = rdp->nocb_gp_head;`
> > it's stated so in the comment. And I do think we need ordering between
> > `*tail = rdp->nocb_gp_head;` and wake up, because the waiter is checking on head not tail.
> > 
> > 			swait_event_interruptible(rdp->nocb_wq,
> > 						 READ_ONCE(rdp->nocb_follower_head));
> > 
> > So what I'm saying is that since we need to maintain ordering between `*tail = rdp->nocb_gp_head;`
> > and wake up, we need to change the smp_mb__after_atomic() to smp_mb(). Because
> > smp_mb__after_atomic() wouldn't guarantee 
> > 
> > So this is what I'm proposing.
> 
> Good eyes!
> 
> Hmmm...  What do I do about this in mainline?  Ah, I introduced a
> ->nocb_lock in mainline to prevent this from happening.  In that same
> commit that you didn't want to use because it is hard to backport.  ;-)
> 
> So yes, but there might well be other misorderings fixed by the
> hard-to-backport commit that your change below does not cover.  Still I
> do agree that you need full ordering at that point.

Another approach would be to backport only the ->nocb_lock portions
of that patch.  This would still potentially leave failure-to-wake
(as opposed to misordering-on-wake) issues, but it should cover all
of the misordering-on-wake issues.

							Thanx, Paul

> > diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> > --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h	2017-07-12 06:42:41.000000000 -0700
> > +++ linux-4.9.37/kernel/rcu/tree_plugin.h	2018-07-27 16:23:41.349044259 -0700
> > @@ -2076,7 +2076,7 @@
> >  		/* Append callbacks to follower's "done" list. */
> >  		tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
> >  		*tail = rdp->nocb_gp_head;
> > -		smp_mb__after_atomic(); /* Store *tail before wakeup. */
> > +		smp_mb(); /* Store *tail before wakeup. */
> >  		if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
> >  			/*
> >  			 * List was empty, wake up the follower.
> > 
> > Thanks,
> > David
> > 
> > From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Sent: Friday, July 27, 2018 4:47 PM
> > To: David Chen
> > Cc: linux-kernel@vger.kernel.org
> > Subject: Re: RCU nocb list not reclaiming causing OOM
> >   
> > 
> > On Fri, Jul 27, 2018 at 11:16:39PM +0000, David Chen wrote:
> > > Hi Paul,
> > > 
> > > Thanks for the advice.
> > > The bug is kind of hard to hit, so I can't say for certain yet.
> > 
> > Well, you can always remove the "tail == &rdp->nocb_follower_head" as an
> > extra belt-and-suspenders safety net.  I am not putting that in mainline,
> > but in the privacy of your own copy of the kernel, I don't see any really
> > serious problem with it.  (As long as you aren't going for absolute maximum
> > performance, but even then there are other more important tuning actions
> > and code changes you could make.)
> > 
> > > Though after another look at the code, I found out the `smp_mb__after_atomic();`
> > > seems to be only a compiler barrier on x86.
> > 
> > Yes, and that is because the locked xchg instruction used on x86 to
> > implement xchg() already provides full ordering.  ;-)
> > 
> > >                tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
> > >                *tail = rdp->nocb_gp_head;
> > >                smp_mb__after_atomic(); /* Store *tail before wakeup. */
> > >                if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
> > >                        swake_up(&rdp->nocb_wq);
> > > 
> > > But that wouldn't be enough right? Because from 6b5fc3a13318, it stated that
> > > wakeup operation don't guarantee ordering. And when the follower wakes up, it checks
> > > for nocb_follower_head, which is assigned by `*tail = rdp->nocb_gp_head;` which doesn't
> > > have LOCK prefix. So it's possible for follower to wake up and see the list is empty and go
> > > back to sleep.
> > 
> > Again, xchg() is defined to provide full ordering against all operations
> > before and after it.  Each architecture is required to do whatever is
> > necessary to implement that full ordering, and x86 need only provide
> > its "lock xchg" instruction.
> > 
> > The smp_mb__after_atomic() has effect only after atomic read-modify-write
> > operations that do not return a value, for example, atomic_inc().
> > If you use it after a value-returning atomic read-modify-write operation
> > like xchg(), all you do is needlessly slow things down on platforms
> > that provide non-empty smp_mb__after_atomic() definitions.  So again,
> > smp_mb__after_atomic() after xchg() is pointless.
> > 
> > Please take a look at Documentation/core-api/atomic_ops.rst in the
> > Linux-kernel source tree for more information.  Or get a v4.17 kernel
> > source tree and check this using the memory model (tools/memory-model
> > in that version).
> > 
> >                                                         Thanx, Paul
> > 
> > > Thanks,
> > > David
> > > 
> > > From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > Sent: Friday, July 27, 2018 3:31 PM
> > > To: David Chen
> > > Cc: linux-kernel@vger.kernel.org
> > > Subject: Re: RCU nocb list not reclaiming causing OOM
> > >   
> > > 
> > > On Fri, Jul 27, 2018 at 07:07:46PM +0000, David Chen wrote:
> > > > Hi Paul,
> > > > 
> > > > I'd like to opinion again on this subject.
> > > > 
> > > > So we are going to backport this patch:
> > > > 6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")
> > > 
> > > Does this one solve the problem, or are you still seeing hangs?
> > > If you are no longer seeing hangs, my advice is "hands off keyboard",
> > > though some would no doubt point out that I should follow that advice
> > > more myself.  ;-)
> > > 
> > > > But the other one:
> > > > 8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
> > > > It doesn't apply cleanly, and I'm not too comfortable porting it myself.
> > > 
> > > Yeah, that one is a bit on the non-trivial side, no two ways about it.
> > > 
> > > > So I'm wondering if I use the following change to always wake up follower thread
> > > > regardless if the list was empty or not, just to be on the safe side. Do you think
> > > > this change is reasonable? Do you see any problem it might cause?
> > > > 
> > > > Thanks,
> > > > David
> > > > 
> > > > diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> > > > --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h        2017-07-12 06:42:41.000000000 -0700
> > > > +++ linux-4.9.37/kernel/rcu/tree_plugin.h     2018-07-27 11:57:03.582134519 -0700
> > > > @@ -2077,7 +2077,7 @@
> > > >                tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
> > > >                *tail = rdp->nocb_gp_head;
> > > >                smp_mb__after_atomic(); /* Store *tail before wakeup. */
> > > > -             if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
> > > > +             if (rdp != my_rdp) {
> > > 
> > > This will burn a bit of extra CPU time, but it should be fine other than
> > > that.
> > > 
> > >                                                         Thanx, Paul
> > > 
> > > >                        /*
> > > >                         * List was empty, wake up the follower.
> > > >                         * Memory barriers supplied by atomic_long_add().
> > > > 
> > > > 
> > > > From: David Chen
> > > > Sent: Friday, July 20, 2018 5:12 PM
> > > > To: paulmck@linux.vnet.ibm.com
> > > > Cc: linux-kernel@vger.kernel.org
> > > > Subject: Re: RCU nocb list not reclaiming causing OOM
> > > >    
> > > > 
> > > > Hi Paul,
> > > > 
> > > > Ok, I'll try those patches.
> > > > 
> > > > Thanks,
> > > > David
> > > >   
> > > > From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > Sent: Friday, July 20, 2018 4:32:12 PM
> > > > To: David Chen
> > > > Cc: linux-kernel@vger.kernel.org
> > > > Subject: Re: RCU nocb list not reclaiming causing OOM
> > > >   
> > > > 
> > > > On Fri, Jul 20, 2018 at 11:05:52PM +0000, David Chen wrote:
> > > > > Hi Paul,
> > > > > 
> > > > > We hit an RCU issue on 4.9.37 kernel. One of the nocb_follower list grows too
> > > > > large, and not getting reclaimed, causing the system to OOM.
> > > > > 
> > > > > Printing the culprit rcu_sched_data:
> > > > > 
> > > > >   nocb_q_count = {
> > > > >     counter = 32369635
> > > > >   },
> > > > >   nocb_follower_head = 0xffff88ae901c0a00,
> > > > >   nocb_follower_tail = 0xffff88af1538b8d8,
> > > > >   nocb_kthread = 0xffff88b06d290000,
> > > > > 
> > > > > As you can see here, the nocb_follower_head is not empty, so in theory, the
> > > > > nocb_kthread shouldn't go to sleep. However, if dump the stack of the kthread:
> > > > > 
> > > > > crash> bt 0xffff88b06d290000
> > > > > PID: 21     TASK: ffff88b06d290000  CPU: 3   COMMAND: "rcuos/1"
> > > > >  #0 [ffffafc9020b7dc0] __schedule at ffffffff8d8789dc
> > > > >  #1 [ffffafc9020b7e38] schedule at ffffffff8d878e76
> > > > >  #2 [ffffafc9020b7e50] rcu_nocb_kthread at ffffffff8d112337
> > > > >  #3 [ffffafc9020b7ec8] kthread at ffffffff8d0c6ce7
> > > > >  #4 [ffffafc9020b7f50] ret_from_fork at ffffffff8d87d755
> > > > > 
> > > > > And if we dis the address at ffffffff8d112337:
> > > > > 
> > > > > /usr/src/debug/kernel-4.9.37/linux-4.9.37-29.nutanix.07142017.el7.centos.x86_64/kernel/rcu/tree_plugin.h: 2106
> > > > > 0xffffffff8d11232d <rcu_nocb_kthread+381>:      test   %rax,%rax
> > > > > 0xffffffff8d112330 <rcu_nocb_kthread+384>:      jne    0xffffffff8d112355 <rcu_nocb_kthread+421>
> > > > > 0xffffffff8d112332 <rcu_nocb_kthread+386>:      callq  0xffffffff8d878e40 <schedule>
> > > > > 0xffffffff8d112337 <rcu_nocb_kthread+391>:      lea    -0x40(%rbp),%rsi
> > > > > 
> > > > > So the kthread is blocked at swait_event_interruptible in the nocb_follower_wait.
> > > > > This contradict with the fact that nocb_follower_head was not empty. So I
> > > > > wonder if this is caused by the lack of memory barrier in the place shown below.
> > > > > If the head is set to NULL after doing xchg, it will overwrite the head set
> > > > > by leader. This caused the kthread to sleep the next iteration, and the leader
> > > > > won't wake him up as the tail doesn't point to head.
> > > > > 
> > > > > Please tell me what do you think.
> > > > > 
> > > > > Thanks,
> > > > > David
> > > > > 
> > > > > diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> > > > > --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h        2017-07-12 06:42:41.000000000 -0700
> > > > > +++ linux-4.9.37/kernel/rcu/tree_plugin.h     2018-07-20 15:25:57.311206343 -0700
> > > > > @@ -2149,6 +2149,7 @@
> > > > >                BUG_ON(!list);
> > > > >                trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
> > > > >                WRITE_ONCE(rdp->nocb_follower_head, NULL);
> > > > > +             smp_mb();
> > > > >                tail = xchg(&rdp->nocb_follower_tail, &rdp->nocb_follower_head);
> > > > 
> > > > The xchg() operation implies full memory barriers both before and after,
> > > > so adding the smp_mb() before would have no effect.
> > > > 
> > > > But let me take a look at post-4.9 changes to this code...
> > > > 
> > > > I suggest trying out the following commit:
> > > > 
> > > > 6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")
> > > > 
> > > > If that one doesn't help, the following might be worth trying, but probably
> > > > a lot harder to backport:
> > > > 
> > > > 8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
> > > > 
> > > > Please let me know how it goes!
> > > > 
> > > >                                                         Thanx, Paul
> > > > 
> > > > ------------------------------------------------------------------------
> > > > 
> > > > commit 6b5fc3a1331810db407c9e0e673dc1837afdc9d0
> > > > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > Date:   Fri Apr 28 20:11:09 2017 -0700
> > > > 
> > > >     rcu: Add memory barriers for NOCB leader wakeup
> > > >     
> > > >     Wait/wakeup operations do not guarantee ordering on their own.  Instead,
> > > >     either locking or memory barriers are required.  This commit therefore
> > > >     adds memory barriers to wake_nocb_leader() and nocb_leader_wait().
> > > >     
> > > >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > >     Tested-by: Krister Johansen <kjlx@templeofstupid.com>
> > > >     Cc: <stable@vger.kernel.org> # 4.6.x
> > > > 
> > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > index 0b1042545116..573fbe9640a0 100644
> > > > --- a/kernel/rcu/tree_plugin.h
> > > > +++ b/kernel/rcu/tree_plugin.h
> > > > @@ -1810,6 +1810,7 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
> > > >          if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
> > > >                  /* Prior smp_mb__after_atomic() orders against prior enqueue. */
> > > >                  WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
> > > > +               smp_mb(); /* ->nocb_leader_sleep before swake_up(). */
> > > >                  swake_up(&rdp_leader->nocb_wq);
> > > >          }
> > > >  }
> > > > @@ -2064,6 +2065,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp)
> > > >           * nocb_gp_head, where they await a grace period.
> > > >           */
> > > >          gotcbs = false;
> > > > +       smp_mb(); /* wakeup before ->nocb_head reads. */
> > > >          for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
> > > >                  rdp->nocb_gp_head = READ_ONCE(rdp->nocb_head);
> > > >                  if (!rdp->nocb_gp_head)
> > > > 
> > > >      
> > > 
> > >     
> > 
> >     


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RCU nocb list not reclaiming causing OOM
  2018-07-28  8:29                 ` Paul E. McKenney
@ 2018-07-30 19:08                   ` David Chen
  2018-07-30 19:32                     ` Paul E. McKenney
  0 siblings, 1 reply; 12+ messages in thread
From: David Chen @ 2018-07-30 19:08 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel

Hi Paul,

Just want to know what's your plan on stable branches regarding this issue.
Do you intend to backport the ->nocb_lock? Or are you going with just the
memory barrier change?

Thanks,
David


From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Sent: Saturday, July 28, 2018 1:29 AM
To: David Chen
Cc: linux-kernel@vger.kernel.org
Subject: Re: RCU nocb list not reclaiming causing OOM
  

On Fri, Jul 27, 2018 at 06:47:00PM -0700, Paul E. McKenney wrote:
> On Sat, Jul 28, 2018 at 12:07:19AM +0000, David Chen wrote:
> > Hi Paul,
> > 
> > I wasn't talking about the xchg() though.
> > 
> > The smp_mb__after_atomic() is not for xchg(), it's for `*tail = rdp->nocb_gp_head;`
> > it's stated so in the comment. And I do think we need ordering between
> > `*tail = rdp->nocb_gp_head;` and wake up, because the waiter is checking on head not tail.
> > 
> >                      swait_event_interruptible(rdp->nocb_wq,
> >                                               READ_ONCE(rdp->nocb_follower_head));
> > 
> > So what I'm saying is that since we need to maintain ordering between `*tail = rdp->nocb_gp_head;`
> > and wake up, we need to change the smp_mb__after_atomic() to smp_mb(). Because
> > smp_mb__after_atomic() wouldn't guarantee 
> > 
> > So this is what I'm proposing.
> 
> Good eyes!
> 
> Hmmm...  What do I do about this in mainline?  Ah, I introduced a
> ->nocb_lock in mainline to prevent this from happening.  In that same
> commit that you didn't want to use because it is hard to backport.  ;-)
> 
> So yes, but there might well be other misorderings fixed by the
> hard-to-backport commit that your change below does not cover.  Still I
> do agree that you need full ordering at that point.

Another approach would be to backport only the ->nocb_lock portions
of that patch.  This would still potentially leave failure-to-wake
(as opposed to misordering-on-wake) issues, but it should cover all
of the misordering-on-wake issues.

                                                        Thanx, Paul

> > diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> > --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h      2017-07-12 06:42:41.000000000 -0700
> > +++ linux-4.9.37/kernel/rcu/tree_plugin.h   2018-07-27 16:23:41.349044259 -0700
> > @@ -2076,7 +2076,7 @@
> >              /* Append callbacks to follower's "done" list. */
> >              tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
> >              *tail = rdp->nocb_gp_head;
> > -           smp_mb__after_atomic(); /* Store *tail before wakeup. */
> > +           smp_mb(); /* Store *tail before wakeup. */
> >              if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
> >                      /*
> >                       * List was empty, wake up the follower.
> > 
> > Thanks,
> > David
> > 
> > From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Sent: Friday, July 27, 2018 4:47 PM
> > To: David Chen
> > Cc: linux-kernel@vger.kernel.org
> > Subject: Re: RCU nocb list not reclaiming causing OOM
> >   
> > 
> > On Fri, Jul 27, 2018 at 11:16:39PM +0000, David Chen wrote:
> > > Hi Paul,
> > > 
> > > Thanks for the advice.
> > > The bug is kind of hard to hit, so I can't say for certain yet.
> > 
> > Well, you can always remove the "tail == &rdp->nocb_follower_head" as an
> > extra belt-and-suspenders safety net.  I am not putting that in mainline,
> > but in the privacy of your own copy of the kernel, I don't see any really
> > serious problem with it.  (As long as you aren't going for absolute maximum
> > performance, but even then there are other more important tuning actions
> > and code changes you could make.)
> > 
> > > Though after another look at the code, I found out the `smp_mb__after_atomic();`
> > > seems to be only a compiler barrier on x86.
> > 
> > Yes, and that is because the locked xchg instruction used on x86 to
> > implement xchg() already provides full ordering.  ;-)
> > 
> > >                tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
> > >                *tail = rdp->nocb_gp_head;
> > >                smp_mb__after_atomic(); /* Store *tail before wakeup. */
> > >                if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
> > >                        swake_up(&rdp->nocb_wq);
> > > 
> > > But that wouldn't be enough right? Because from 6b5fc3a13318, it stated that
> > > wakeup operation don't guarantee ordering. And when the follower wakes up, it checks
> > > for nocb_follower_head, which is assigned by `*tail = rdp->nocb_gp_head;` which doesn't
> > > have LOCK prefix. So it's possible for follower to wake up and see the list is empty and go
> > > back to sleep.
> > 
> > Again, xchg() is defined to provide full ordering against all operations
> > before and after it.  Each architecture is required to do whatever is
> > necessary to implement that full ordering, and x86 need only provide
> > its "lock xchg" instruction.
> > 
> > The smp_mb__after_atomic() has effect only after atomic read-modify-write
> > operations that do not return a value, for example, atomic_inc().
> > If you use it after a value-returning atomic read-modify-write operation
> > like xchg(), all you do is needlessly slow things down on platforms
> > that provide non-empty smp_mb__after_atomic() definitions.  So again,
> > smp_mb__after_atomic() after xchg() is pointless.
> > 
> > Please take a look at Documentation/core-api/atomic_ops.rst in the
> > Linux-kernel source tree for more information.  Or get a v4.17 kernel
> > source tree and check this using the memory model (tools/memory-model
> > in that version).
> > 
> >                                                         Thanx, Paul
> > 
> > > Thanks,
> > > David
> > > 
> > > From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > Sent: Friday, July 27, 2018 3:31 PM
> > > To: David Chen
> > > Cc: linux-kernel@vger.kernel.org
> > > Subject: Re: RCU nocb list not reclaiming causing OOM
> > >   
> > > 
> > > On Fri, Jul 27, 2018 at 07:07:46PM +0000, David Chen wrote:
> > > > Hi Paul,
> > > > 
> > > > I'd like to opinion again on this subject.
> > > > 
> > > > So we are going to backport this patch:
> > > > 6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")
> > > 
> > > Does this one solve the problem, or are you still seeing hangs?
> > > If you are no longer seeing hangs, my advice is "hands off keyboard",
> > > though some would no doubt point out that I should follow that advice
> > > more myself.  ;-)
> > > 
> > > > But the other one:
> > > > 8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
> > > > It doesn't apply cleanly, and I'm not too comfortable porting it myself.
> > > 
> > > Yeah, that one is a bit on the non-trivial side, no two ways about it.
> > > 
> > > > So I'm wondering if I use the following change to always wake up follower thread
> > > > regardless if the list was empty or not, just to be on the safe side. Do you think
> > > > this change is reasonable? Do you see any problem it might cause?
> > > > 
> > > > Thanks,
> > > > David
> > > > 
> > > > diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> > > > --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h        2017-07-12 06:42:41.000000000 -0700
> > > > +++ linux-4.9.37/kernel/rcu/tree_plugin.h     2018-07-27 11:57:03.582134519 -0700
> > > > @@ -2077,7 +2077,7 @@
> > > >                tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
> > > >                *tail = rdp->nocb_gp_head;
> > > >                smp_mb__after_atomic(); /* Store *tail before wakeup. */
> > > > -             if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
> > > > +             if (rdp != my_rdp) {
> > > 
> > > This will burn a bit of extra CPU time, but it should be fine other than
> > > that.
> > > 
> > >                                                         Thanx, Paul
> > > 
> > > >                        /*
> > > >                         * List was empty, wake up the follower.
> > > >                         * Memory barriers supplied by atomic_long_add().
> > > > 
> > > > 
> > > > From: David Chen
> > > > Sent: Friday, July 20, 2018 5:12 PM
> > > > To: paulmck@linux.vnet.ibm.com
> > > > Cc: linux-kernel@vger.kernel.org
> > > > Subject: Re: RCU nocb list not reclaiming causing OOM
> > > >    
> > > > 
> > > > Hi Paul,
> > > > 
> > > > Ok, I'll try those patches.
> > > > 
> > > > Thanks,
> > > > David
> > > >   
> > > > From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > Sent: Friday, July 20, 2018 4:32:12 PM
> > > > To: David Chen
> > > > Cc: linux-kernel@vger.kernel.org
> > > > Subject: Re: RCU nocb list not reclaiming causing OOM
> > > >   
> > > > 
> > > > On Fri, Jul 20, 2018 at 11:05:52PM +0000, David Chen wrote:
> > > > > Hi Paul,
> > > > > 
> > > > > We hit an RCU issue on 4.9.37 kernel. One of the nocb_follower list grows too
> > > > > large, and not getting reclaimed, causing the system to OOM.
> > > > > 
> > > > > Printing the culprit rcu_sched_data:
> > > > > 
> > > > >   nocb_q_count = {
> > > > >     counter = 32369635
> > > > >   },
> > > > >   nocb_follower_head = 0xffff88ae901c0a00,
> > > > >   nocb_follower_tail = 0xffff88af1538b8d8,
> > > > >   nocb_kthread = 0xffff88b06d290000,
> > > > > 
> > > > > As you can see here, the nocb_follower_head is not empty, so in theory, the
> > > > > nocb_kthread shouldn't go to sleep. However, if dump the stack of the kthread:
> > > > > 
> > > > > crash> bt 0xffff88b06d290000
> > > > > PID: 21     TASK: ffff88b06d290000  CPU: 3   COMMAND: "rcuos/1"
> > > > >  #0 [ffffafc9020b7dc0] __schedule at ffffffff8d8789dc
> > > > >  #1 [ffffafc9020b7e38] schedule at ffffffff8d878e76
> > > > >  #2 [ffffafc9020b7e50] rcu_nocb_kthread at ffffffff8d112337
> > > > >  #3 [ffffafc9020b7ec8] kthread at ffffffff8d0c6ce7
> > > > >  #4 [ffffafc9020b7f50] ret_from_fork at ffffffff8d87d755
> > > > > 
> > > > > And if we dis the address at ffffffff8d112337:
> > > > > 
> > > > > /usr/src/debug/kernel-4.9.37/linux-4.9.37-29.nutanix.07142017.el7.centos.x86_64/kernel/rcu/tree_plugin.h: 2106
> > > > > 0xffffffff8d11232d <rcu_nocb_kthread+381>:      test   %rax,%rax
> > > > > 0xffffffff8d112330 <rcu_nocb_kthread+384>:      jne    0xffffffff8d112355 <rcu_nocb_kthread+421>
> > > > > 0xffffffff8d112332 <rcu_nocb_kthread+386>:      callq  0xffffffff8d878e40 <schedule>
> > > > > 0xffffffff8d112337 <rcu_nocb_kthread+391>:      lea    -0x40(%rbp),%rsi
> > > > > 
> > > > > So the kthread is blocked at swait_event_interruptible in the nocb_follower_wait.
> > > > > This contradict with the fact that nocb_follower_head was not empty. So I
> > > > > wonder if this is caused by the lack of memory barrier in the place shown below.
> > > > > If the head is set to NULL after doing xchg, it will overwrite the head set
> > > > > by leader. This caused the kthread to sleep the next iteration, and the leader
> > > > > won't wake him up as the tail doesn't point to head.
> > > > > 
> > > > > Please tell me what do you think.
> > > > > 
> > > > > Thanks,
> > > > > David
> > > > > 
> > > > > diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> > > > > --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h        2017-07-12 06:42:41.000000000 -0700
> > > > > +++ linux-4.9.37/kernel/rcu/tree_plugin.h     2018-07-20 15:25:57.311206343 -0700
> > > > > @@ -2149,6 +2149,7 @@
> > > > >                BUG_ON(!list);
> > > > >                trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
> > > > >                WRITE_ONCE(rdp->nocb_follower_head, NULL);
> > > > > +             smp_mb();
> > > > >                tail = xchg(&rdp->nocb_follower_tail, &rdp->nocb_follower_head);
> > > > 
> > > > The xchg() operation implies full memory barriers both before and after,
> > > > so adding the smp_mb() before would have no effect.
> > > > 
> > > > But let me take a look at post-4.9 changes to this code...
> > > > 
> > > > I suggest trying out the following commit:
> > > > 
> > > > 6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")
> > > > 
> > > > If that one doesn't help, the following might be worth trying, but probably
> > > > a lot harder to backport:
> > > > 
> > > > 8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
> > > > 
> > > > Please let me know how it goes!
> > > > 
> > > >                                                         Thanx, Paul
> > > > 
> > > > ------------------------------------------------------------------------
> > > > 
> > > > commit 6b5fc3a1331810db407c9e0e673dc1837afdc9d0
> > > > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > Date:   Fri Apr 28 20:11:09 2017 -0700
> > > > 
> > > >     rcu: Add memory barriers for NOCB leader wakeup
> > > >     
> > > >     Wait/wakeup operations do not guarantee ordering on their own.  Instead,
> > > >     either locking or memory barriers are required.  This commit therefore
> > > >     adds memory barriers to wake_nocb_leader() and nocb_leader_wait().
> > > >     
> > > >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > >     Tested-by: Krister Johansen <kjlx@templeofstupid.com>
> > > >     Cc: <stable@vger.kernel.org> # 4.6.x
> > > > 
> > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > index 0b1042545116..573fbe9640a0 100644
> > > > --- a/kernel/rcu/tree_plugin.h
> > > > +++ b/kernel/rcu/tree_plugin.h
> > > > @@ -1810,6 +1810,7 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
> > > >          if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
> > > >                  /* Prior smp_mb__after_atomic() orders against prior enqueue. */
> > > >                  WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
> > > > +               smp_mb(); /* ->nocb_leader_sleep before swake_up(). */
> > > >                  swake_up(&rdp_leader->nocb_wq);
> > > >          }
> > > >  }
> > > > @@ -2064,6 +2065,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp)
> > > >           * nocb_gp_head, where they await a grace period.
> > > >           */
> > > >          gotcbs = false;
> > > > +       smp_mb(); /* wakeup before ->nocb_head reads. */
> > > >          for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
> > > >                  rdp->nocb_gp_head = READ_ONCE(rdp->nocb_head);
> > > >                  if (!rdp->nocb_gp_head)
> > > > 
> > > >      
> > > 
> > >     
> > 
> >     

    

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RCU nocb list not reclaiming causing OOM
  2018-07-30 19:08                   ` David Chen
@ 2018-07-30 19:32                     ` Paul E. McKenney
  2018-07-30 22:23                       ` [PATCH] rcu: Fix NOCB follower not waking up " David Chen
  0 siblings, 1 reply; 12+ messages in thread
From: Paul E. McKenney @ 2018-07-30 19:32 UTC (permalink / raw)
  To: David Chen; +Cc: linux-kernel

On Mon, Jul 30, 2018 at 07:08:55PM +0000, David Chen wrote:
> Hi Paul,
> 
> Just want to know what's your plan on stable branches regarding this issue.
> Do you intend to backport the ->nocb_lock? Or are you going with just the
> memory barrier change?

That depends on you.  I am not going to submit to -stable unless someone
(in this case you) have tested it and pronounced it good.  So if you
backport and test the ->nocb_lock changes and test them, I will submit
the patch for -stable.  (Or you can submit to -stable, it does not have
to be me.  Either way works.)

I should have asked this earlier, and perhaps could have saved you quite
a bit of work had I thought to do so (please accept my apologies if so).
Have you filed a bug report with your Linux distribution?

							Thanx, Paul

> Thanks,
> David
> 
> 
> From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Sent: Saturday, July 28, 2018 1:29 AM
> To: David Chen
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: RCU nocb list not reclaiming causing OOM
>   
> 
> On Fri, Jul 27, 2018 at 06:47:00PM -0700, Paul E. McKenney wrote:
> > On Sat, Jul 28, 2018 at 12:07:19AM +0000, David Chen wrote:
> > > Hi Paul,
> > > 
> > > I wasn't talking about the xchg() though.
> > > 
> > > The smp_mb__after_atomic() is not for xchg(), it's for `*tail = rdp->nocb_gp_head;`
> > > it's stated so in the comment. And I do think we need ordering between
> > > `*tail = rdp->nocb_gp_head;` and wake up, because the waiter is checking on head not tail.
> > > 
> > >                      swait_event_interruptible(rdp->nocb_wq,
> > >                                               READ_ONCE(rdp->nocb_follower_head));
> > > 
> > > So what I'm saying is that since we need to maintain ordering between `*tail = rdp->nocb_gp_head;`
> > > and wake up, we need to change the smp_mb__after_atomic() to smp_mb(). Because
> > > smp_mb__after_atomic() wouldn't guarantee 
> > > 
> > > So this is what I'm proposing.
> > 
> > Good eyes!
> > 
> > Hmmm...  What do I do about this in mainline?  Ah, I introduced a
> > ->nocb_lock in mainline to prevent this from happening.  In that same
> > commit that you didn't want to use because it is hard to backport.  ;-)
> > 
> > So yes, but there might well be other misorderings fixed by the
> > hard-to-backport commit that your change below does not cover.  Still I
> > do agree that you need full ordering at that point.
> 
> Another approach would be to backport only the ->nocb_lock portions
> of that patch.  This would still potentially leave failure-to-wake
> (as opposed to misordering-on-wake) issues, but it should cover all
> of the misordering-on-wake issues.
> 
>                                                         Thanx, Paul
> 
> > > diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> > > --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h      2017-07-12 06:42:41.000000000 -0700
> > > +++ linux-4.9.37/kernel/rcu/tree_plugin.h   2018-07-27 16:23:41.349044259 -0700
> > > @@ -2076,7 +2076,7 @@
> > >              /* Append callbacks to follower's "done" list. */
> > >              tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
> > >              *tail = rdp->nocb_gp_head;
> > > -           smp_mb__after_atomic(); /* Store *tail before wakeup. */
> > > +           smp_mb(); /* Store *tail before wakeup. */
> > >              if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
> > >                      /*
> > >                       * List was empty, wake up the follower.
> > > 
> > > Thanks,
> > > David
> > > 
> > > From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > Sent: Friday, July 27, 2018 4:47 PM
> > > To: David Chen
> > > Cc: linux-kernel@vger.kernel.org
> > > Subject: Re: RCU nocb list not reclaiming causing OOM
> > >   
> > > 
> > > On Fri, Jul 27, 2018 at 11:16:39PM +0000, David Chen wrote:
> > > > Hi Paul,
> > > > 
> > > > Thanks for the advice.
> > > > The bug is kind of hard to hit, so I can't say for certain yet.
> > > 
> > > Well, you can always remove the "tail == &rdp->nocb_follower_head" as an
> > > extra belt-and-suspenders safety net.  I am not putting that in mainline,
> > > but in the privacy of your own copy of the kernel, I don't see any really
> > > serious problem with it.  (As long as you aren't going for absolute maximum
> > > performance, but even then there are other more important tuning actions
> > > and code changes you could make.)
> > > 
> > > > Though after another look at the code, I found out the `smp_mb__after_atomic();`
> > > > seems to be only a compiler barrier on x86.
> > > 
> > > Yes, and that is because the locked xchg instruction used on x86 to
> > > implement xchg() already provides full ordering.  ;-)
> > > 
> > > >                tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
> > > >                *tail = rdp->nocb_gp_head;
> > > >                smp_mb__after_atomic(); /* Store *tail before wakeup. */
> > > >                if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
> > > >                        swake_up(&rdp->nocb_wq);
> > > > 
> > > > But that wouldn't be enough right? Because from 6b5fc3a13318, it stated that
> > > > wakeup operation don't guarantee ordering. And when the follower wakes up, it checks
> > > > for nocb_follower_head, which is assigned by `*tail = rdp->nocb_gp_head;` which doesn't
> > > > have LOCK prefix. So it's possible for follower to wake up and see the list is empty and go
> > > > back to sleep.
> > > 
> > > Again, xchg() is defined to provide full ordering against all operations
> > > before and after it.  Each architecture is required to do whatever is
> > > necessary to implement that full ordering, and x86 need only provide
> > > its "lock xchg" instruction.
> > > 
> > > The smp_mb__after_atomic() has effect only after atomic read-modify-write
> > > operations that do not return a value, for example, atomic_inc().
> > > If you use it after a value-returning atomic read-modify-write operation
> > > like xchg(), all you do is needlessly slow things down on platforms
> > > that provide non-empty smp_mb__after_atomic() definitions.  So again,
> > > smp_mb__after_atomic() after xchg() is pointless.
> > > 
> > > Please take a look at Documentation/core-api/atomic_ops.rst in the
> > > Linux-kernel source tree for more information.  Or get a v4.17 kernel
> > > source tree and check this using the memory model (tools/memory-model
> > > in that version).
> > > 
> > >                                                         Thanx, Paul
> > > 
> > > > Thanks,
> > > > David
> > > > 
> > > > From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > Sent: Friday, July 27, 2018 3:31 PM
> > > > To: David Chen
> > > > Cc: linux-kernel@vger.kernel.org
> > > > Subject: Re: RCU nocb list not reclaiming causing OOM
> > > >   
> > > > 
> > > > On Fri, Jul 27, 2018 at 07:07:46PM +0000, David Chen wrote:
> > > > > Hi Paul,
> > > > > 
> > > > > I'd like to opinion again on this subject.
> > > > > 
> > > > > So we are going to backport this patch:
> > > > > 6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")
> > > > 
> > > > Does this one solve the problem, or are you still seeing hangs?
> > > > If you are no longer seeing hangs, my advice is "hands off keyboard",
> > > > though some would no doubt point out that I should follow that advice
> > > > more myself.  ;-)
> > > > 
> > > > > But the other one:
> > > > > 8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
> > > > > It doesn't apply cleanly, and I'm not too comfortable porting it myself.
> > > > 
> > > > Yeah, that one is a bit on the non-trivial side, no two ways about it.
> > > > 
> > > > > So I'm wondering if I use the following change to always wake up follower thread
> > > > > regardless if the list was empty or not, just to be on the safe side. Do you think
> > > > > this change is reasonable? Do you see any problem it might cause?
> > > > > 
> > > > > Thanks,
> > > > > David
> > > > > 
> > > > > diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> > > > > --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h        2017-07-12 06:42:41.000000000 -0700
> > > > > +++ linux-4.9.37/kernel/rcu/tree_plugin.h     2018-07-27 11:57:03.582134519 -0700
> > > > > @@ -2077,7 +2077,7 @@
> > > > >                tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
> > > > >                *tail = rdp->nocb_gp_head;
> > > > >                smp_mb__after_atomic(); /* Store *tail before wakeup. */
> > > > > -             if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
> > > > > +             if (rdp != my_rdp) {
> > > > 
> > > > This will burn a bit of extra CPU time, but it should be fine other than
> > > > that.
> > > > 
> > > >                                                         Thanx, Paul
> > > > 
> > > > >                        /*
> > > > >                         * List was empty, wake up the follower.
> > > > >                         * Memory barriers supplied by atomic_long_add().
> > > > > 
> > > > > 
> > > > > From: David Chen
> > > > > Sent: Friday, July 20, 2018 5:12 PM
> > > > > To: paulmck@linux.vnet.ibm.com
> > > > > Cc: linux-kernel@vger.kernel.org
> > > > > Subject: Re: RCU nocb list not reclaiming causing OOM
> > > > >    
> > > > > 
> > > > > Hi Paul,
> > > > > 
> > > > > Ok, I'll try those patches.
> > > > > 
> > > > > Thanks,
> > > > > David
> > > > >   
> > > > > From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > Sent: Friday, July 20, 2018 4:32:12 PM
> > > > > To: David Chen
> > > > > Cc: linux-kernel@vger.kernel.org
> > > > > Subject: Re: RCU nocb list not reclaiming causing OOM
> > > > >   
> > > > > 
> > > > > On Fri, Jul 20, 2018 at 11:05:52PM +0000, David Chen wrote:
> > > > > > Hi Paul,
> > > > > > 
> > > > > > We hit an RCU issue on 4.9.37 kernel. One of the nocb_follower list grows too
> > > > > > large, and not getting reclaimed, causing the system to OOM.
> > > > > > 
> > > > > > Printing the culprit rcu_sched_data:
> > > > > > 
> > > > > >   nocb_q_count = {
> > > > > >     counter = 32369635
> > > > > >   },
> > > > > >   nocb_follower_head = 0xffff88ae901c0a00,
> > > > > >   nocb_follower_tail = 0xffff88af1538b8d8,
> > > > > >   nocb_kthread = 0xffff88b06d290000,
> > > > > > 
> > > > > > As you can see here, the nocb_follower_head is not empty, so in theory, the
> > > > > > nocb_kthread shouldn't go to sleep. However, if dump the stack of the kthread:
> > > > > > 
> > > > > > crash> bt 0xffff88b06d290000
> > > > > > PID: 21     TASK: ffff88b06d290000  CPU: 3   COMMAND: "rcuos/1"
> > > > > >  #0 [ffffafc9020b7dc0] __schedule at ffffffff8d8789dc
> > > > > >  #1 [ffffafc9020b7e38] schedule at ffffffff8d878e76
> > > > > >  #2 [ffffafc9020b7e50] rcu_nocb_kthread at ffffffff8d112337
> > > > > >  #3 [ffffafc9020b7ec8] kthread at ffffffff8d0c6ce7
> > > > > >  #4 [ffffafc9020b7f50] ret_from_fork at ffffffff8d87d755
> > > > > > 
> > > > > > And if we dis the address at ffffffff8d112337:
> > > > > > 
> > > > > > /usr/src/debug/kernel-4.9.37/linux-4.9.37-29.nutanix.07142017.el7.centos.x86_64/kernel/rcu/tree_plugin.h: 2106
> > > > > > 0xffffffff8d11232d <rcu_nocb_kthread+381>:      test   %rax,%rax
> > > > > > 0xffffffff8d112330 <rcu_nocb_kthread+384>:      jne    0xffffffff8d112355 <rcu_nocb_kthread+421>
> > > > > > 0xffffffff8d112332 <rcu_nocb_kthread+386>:      callq  0xffffffff8d878e40 <schedule>
> > > > > > 0xffffffff8d112337 <rcu_nocb_kthread+391>:      lea    -0x40(%rbp),%rsi
> > > > > > 
> > > > > > So the kthread is blocked at swait_event_interruptible in the nocb_follower_wait.
> > > > > > This contradict with the fact that nocb_follower_head was not empty. So I
> > > > > > wonder if this is caused by the lack of memory barrier in the place shown below.
> > > > > > If the head is set to NULL after doing xchg, it will overwrite the head set
> > > > > > by leader. This caused the kthread to sleep the next iteration, and the leader
> > > > > > won't wake him up as the tail doesn't point to head.
> > > > > > 
> > > > > > Please tell me what do you think.
> > > > > > 
> > > > > > Thanks,
> > > > > > David
> > > > > > 
> > > > > > diff -ru linux-4.9.37.orig/kernel/rcu/tree_plugin.h linux-4.9.37/kernel/rcu/tree_plugin.h
> > > > > > --- linux-4.9.37.orig/kernel/rcu/tree_plugin.h        2017-07-12 06:42:41.000000000 -0700
> > > > > > +++ linux-4.9.37/kernel/rcu/tree_plugin.h     2018-07-20 15:25:57.311206343 -0700
> > > > > > @@ -2149,6 +2149,7 @@
> > > > > >                BUG_ON(!list);
> > > > > >                trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
> > > > > >                WRITE_ONCE(rdp->nocb_follower_head, NULL);
> > > > > > +             smp_mb();
> > > > > >                tail = xchg(&rdp->nocb_follower_tail, &rdp->nocb_follower_head);
> > > > > 
> > > > > The xchg() operation implies full memory barriers both before and after,
> > > > > so adding the smp_mb() before would have no effect.
> > > > > 
> > > > > But let me take a look at post-4.9 changes to this code...
> > > > > 
> > > > > I suggest trying out the following commit:
> > > > > 
> > > > > 6b5fc3a13318 ("rcu: Add memory barriers for NOCB leader wakeup")
> > > > > 
> > > > > If that one doesn't help, the following might be worth trying, but probably
> > > > > a lot harder to backport:
> > > > > 
> > > > > 8be6e1b15c54 ("rcu: Use timer as backstop for NOCB deferred wakeups")
> > > > > 
> > > > > Please let me know how it goes!
> > > > > 
> > > > >                                                         Thanx, Paul
> > > > > 
> > > > > ------------------------------------------------------------------------
> > > > > 
> > > > > commit 6b5fc3a1331810db407c9e0e673dc1837afdc9d0
> > > > > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > Date:   Fri Apr 28 20:11:09 2017 -0700
> > > > > 
> > > > >     rcu: Add memory barriers for NOCB leader wakeup
> > > > >     
> > > > >     Wait/wakeup operations do not guarantee ordering on their own.  Instead,
> > > > >     either locking or memory barriers are required.  This commit therefore
> > > > >     adds memory barriers to wake_nocb_leader() and nocb_leader_wait().
> > > > >     
> > > > >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > >     Tested-by: Krister Johansen <kjlx@templeofstupid.com>
> > > > >     Cc: <stable@vger.kernel.org> # 4.6.x
> > > > > 
> > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > index 0b1042545116..573fbe9640a0 100644
> > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > @@ -1810,6 +1810,7 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
> > > > >          if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
> > > > >                  /* Prior smp_mb__after_atomic() orders against prior enqueue. */
> > > > >                  WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
> > > > > +               smp_mb(); /* ->nocb_leader_sleep before swake_up(). */
> > > > >                  swake_up(&rdp_leader->nocb_wq);
> > > > >          }
> > > > >  }
> > > > > @@ -2064,6 +2065,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp)
> > > > >           * nocb_gp_head, where they await a grace period.
> > > > >           */
> > > > >          gotcbs = false;
> > > > > +       smp_mb(); /* wakeup before ->nocb_head reads. */
> > > > >          for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
> > > > >                  rdp->nocb_gp_head = READ_ONCE(rdp->nocb_head);
> > > > >                  if (!rdp->nocb_gp_head)
> > > > > 
> > > > >      
> > > > 
> > > >     
> > > 
> > >     
> 
>     


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] rcu: Fix NOCB follower not waking up causing OOM
  2018-07-30 19:32                     ` Paul E. McKenney
@ 2018-07-30 22:23                       ` David Chen
  0 siblings, 0 replies; 12+ messages in thread
From: David Chen @ 2018-07-30 22:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: Chunwei Chen, Paul E . McKenney

From: Chunwei Chen <david.chen@nutanix.com>

In 4.9 stable branch, we hit an issue where one of the NOCB follower
thread wasn't waking up, even though the follower list is not empty.
The follower list just kept on growing and never got reclaimed, and
finally caused the system to run out of memory.

This issue is similar to the issue fixed by 6b5fc3a13318, both are
caused by lacking proper memory barrier before swake_up, and causing
wake up to be missed. While we do have smp_mb__after_atomic here, but
smp_mb__after_atomic is only a compiler barrier on x86, so it doesn't
prevent this at all. So we fix this issue by changing it to smp_mb.

Note, this patch doesn't apply to master, because in master the follower
list is changed to spinlock protected list. So there's no such issue
there.

Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree_plugin.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 554ea54..b2d663d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2090,7 +2090,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp)
 		/* Append callbacks to follower's "done" list. */
 		tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
 		*tail = rdp->nocb_gp_head;
-		smp_mb__after_atomic(); /* Store *tail before wakeup. */
+		smp_mb(); /* Store *tail before wakeup. */
 		if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
 			/*
 			 * List was empty, wake up the follower.
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-07-30 22:25 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-20 23:05 RCU nocb list not reclaiming causing OOM David Chen
2018-07-20 23:32 ` Paul E. McKenney
     [not found]   ` <BYAPR02MB450146DF9F67788ED1F0801D94500@BYAPR02MB4501.namprd02.prod.outlook.com>
2018-07-27 19:07     ` David Chen
2018-07-27 22:31       ` Paul E. McKenney
2018-07-27 23:16         ` David Chen
2018-07-27 23:47           ` Paul E. McKenney
2018-07-28  0:07             ` David Chen
2018-07-28  1:47               ` Paul E. McKenney
2018-07-28  8:29                 ` Paul E. McKenney
2018-07-30 19:08                   ` David Chen
2018-07-30 19:32                     ` Paul E. McKenney
2018-07-30 22:23                       ` [PATCH] rcu: Fix NOCB follower not waking up " David Chen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.