linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
       [not found] <4fafad133b074f279dbab1aa3642e23f@xiaomi.com>
@ 2021-11-07  3:25 ` Waiman Long
  2021-11-07  3:28   ` Waiman Long
       [not found] ` <20211107090131.1535-1-hdanton@sina.com>
  1 sibling, 1 reply; 26+ messages in thread
From: Waiman Long @ 2021-11-07  3:25 UTC (permalink / raw)
  To: 马振华, peterz, mingo, will, boqun.feng, linux-kernel

On 11/6/21 08:39, 马振华 wrote:
> Dear longman,
>
> recently , i find a issue which rwsem count is negative value, it 
> happened always when a task try to get the lock 
> with __down_write_killable , then it is killed
>
> this issue happened like this
>
>             CPU2         CPU4
>     task A[reader]     task B[writer]
>     down_read_killable[locked]
>     sem->count=0x100
>             down_write_killable
>             sem->count=0x102[wlist not empty]
>     up_read
>     count=0x2
>             sig kill received
>     down_read_killable
>     sem->count=0x102[wlist not empty]
>             goto branch out_nolock:
> list_del(&waiter.list);
> wait list is empty
> sem->count-RWSEM_FLAG_HANDOFF
> sem->count=0xFE
>     list_empty(&sem->wait_list) is TRUE
>      sem->count andnot RWSEM_FLAG_WAITERS
>       sem->count=0xFC
>     up_read
>     sem->count -= 0x100
>     sem->count=0xFFFFFFFFFFFFFFFC
>     DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
>
> so sem->count will be negative after writer is killed
> i think if flag RWSEM_FLAG_HANDOFF is not set, we shouldn't clean it

Thanks for reporting this possible race condition.

However, I am still trying to figure how it is possible to set the 
wstate to WRITER_HANDOFF without actually setting the handoff bit as 
well. The statement sequence should be as follows:

wstate = WRITER_HANDOFF;
raw_spin_lock_irq(&sem->wait_lock);
if (rwsem_try_write_lock(sem, wstate))
raw_spin_unlock_irq(&sem->wait_lock);
   :
if (signal_pending_state(state, current))
     goto out_nolock

The rwsem_try_write_lock() function will make sure that we either 
acquire the lock and clear handoff or set the handoff bit. This should 
be done before we actually check for signal. I do think that it is 
probably safer to use atomic_long_andnot to clear the handoff bit 
instead of using atomic_long_add().

Cheers,
Longman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-07  3:25 ` [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set Waiman Long
@ 2021-11-07  3:28   ` Waiman Long
  0 siblings, 0 replies; 26+ messages in thread
From: Waiman Long @ 2021-11-07  3:28 UTC (permalink / raw)
  To: 马振华, peterz, mingo, will, boqun.feng, linux-kernel


On 11/6/21 23:25, Waiman Long wrote:
>
>> so sem->count will be negative after writer is killed
>> i think if flag RWSEM_FLAG_HANDOFF is not set, we shouldn't clean it
>
> Thanks for reporting this possible race condition.
>
> However, I am still trying to figure how it is possible to set the 
> wstate to WRITER_HANDOFF without actually setting the handoff bit as 
> well. The statement sequence should be as follows:
>
> wstate = WRITER_HANDOFF;
> raw_spin_lock_irq(&sem->wait_lock);
> if (rwsem_try_write_lock(sem, wstate))
> raw_spin_unlock_irq(&sem->wait_lock);
>   :
> if (signal_pending_state(state, current))
>     goto out_nolock
>
> The rwsem_try_write_lock() function will make sure that we either 
> acquire the lock and clear handoff or set the handoff bit. This should 
> be done before we actually check for signal. I do think that it is 
> probably safer to use atomic_long_andnot to clear the handoff bit 
> instead of using atomic_long_add().
BTW, do you have a reproducer that can reproduce the race condition?

Thanks,
Longman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
       [not found] ` <20211107090131.1535-1-hdanton@sina.com>
@ 2021-11-07 15:24   ` Waiman Long
  2021-11-07 19:52     ` Waiman Long
  0 siblings, 1 reply; 26+ messages in thread
From: Waiman Long @ 2021-11-07 15:24 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 马振华, peterz, mingo, will, boqun.feng, linux-kernel

On 11/7/21 04:01, Hillf Danton wrote:
> On Sat, 6 Nov 2021 23:25:38 -0400 Waiman Long wrote:
>> On 11/6/21 08:39, 马振华 wrote:
>>> Dear longman,
>>>
>>> recently , i find a issue which rwsem count is negative value, it
>>> happened always when a task try to get the lock
>>> with __down_write_killable , then it is killed
>>>
>>> this issue happened like this
>>>
>>>              CPU2         CPU4
>>>      task A[reader]     task B[writer]
>>>      down_read_killable[locked]
>>>      sem->count=0x100
>>>              down_write_killable
>>>              sem->count=0x102[wlist not empty]
>>>      up_read
>>>      count=0x2
>>>              sig kill received
>>>      down_read_killable
>>>      sem->count=0x102[wlist not empty]
>>>              goto branch out_nolock:
>>> list_del(&waiter.list);
>>> wait list is empty
>>> sem->count-RWSEM_FLAG_HANDOFF
>>> sem->count=0xFE
>>>      list_empty(&sem->wait_list) is TRUE
>>>       sem->count andnot RWSEM_FLAG_WAITERS
>>>        sem->count=0xFC
>>>      up_read
>>>      sem->count -= 0x100
>>>      sem->count=0xFFFFFFFFFFFFFFFC
>>>      DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
>>>
>>> so sem->count will be negative after writer is killed
>>> i think if flag RWSEM_FLAG_HANDOFF is not set, we shouldn't clean it
>> Thanks for reporting this possible race condition.
>>
>> However, I am still trying to figure how it is possible to set the
>> wstate to WRITER_HANDOFF without actually setting the handoff bit as
>> well. The statement sequence should be as follows:
>>
>> wstate = WRITER_HANDOFF;
>> raw_spin_lock_irq(&sem->wait_lock);
>> if (rwsem_try_write_lock(sem, wstate))
>> raw_spin_unlock_irq(&sem->wait_lock);
>>    :
>> if (signal_pending_state(state, current))
>>      goto out_nolock
>>
>> The rwsem_try_write_lock() function will make sure that we either
>> acquire the lock and clear handoff or set the handoff bit. This should
>> be done before we actually check for signal. I do think that it is
> Given that WRITER_HANDOFF makes no sure that RWSEM_FLAG_HANDOFF is set in
> wsem_try_write_lock(), the flag should be cleared only by the setter to
> avoid count underflow.
>
> Hillf
>
>> probably safer to use atomic_long_andnot to clear the handoff bit
>> instead of using atomic_long_add().

I did have a tentative patch to address this issue which is somewhat 
similar to your approach. However, I would like to further investigate 
the exact mechanics of the race condition to make sure that I won't miss 
a latent bug somewhere else in the rwsem code.

-Longman

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index c51387a43265..20be967620c0 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -347,7 +347,8 @@ enum rwsem_wake_type {
  enum writer_wait_state {
      WRITER_NOT_FIRST,    /* Writer is not first in wait list */
      WRITER_FIRST,        /* Writer is first in wait list     */
-    WRITER_HANDOFF        /* Writer is first & handoff needed */
+    WRITER_NEED_HANDOFF    /* Writer is first & handoff needed */
+    WRITER_HANDOFF        /* Writer is first & handoff set */
  };

  /*
@@ -532,11 +533,11 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
   * race conditions between checking the rwsem wait list and setting the
   * sem->count accordingly.
   *
- * If wstate is WRITER_HANDOFF, it will make sure that either the handoff
+ * If wstate is WRITER_NEED_HANDOFF, it will make sure that either the 
handoff
   * bit is set or the lock is acquired with handoff bit cleared.
   */
  static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
-                    enum writer_wait_state wstate)
+                    enum writer_wait_state *wstate)
  {
      long count, new;

@@ -546,13 +547,13 @@ static inline bool rwsem_try_write_lock(struct 
rw_semaphore *sem,
      do {
          bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);

-        if (has_handoff && wstate == WRITER_NOT_FIRST)
+        if (has_handoff && *wstate == WRITER_NOT_FIRST)
              return false;

          new = count;

          if (count & RWSEM_LOCK_MASK) {
-            if (has_handoff || (wstate != WRITER_HANDOFF))
+            if (has_handoff || (*wstate != WRITER_NEED_HANDOFF))
                  return false;

              new |= RWSEM_FLAG_HANDOFF;
@@ -569,8 +570,10 @@ static inline bool rwsem_try_write_lock(struct 
rw_semaphore *sem,
       * We have either acquired the lock with handoff bit cleared or
       * set the handoff bit.
       */
-    if (new & RWSEM_FLAG_HANDOFF)
+    if (new & RWSEM_FLAG_HANDOFF) {
+        *wstate = WRITER_HANDOFF;
          return false;
+    }

      rwsem_set_owner(sem);
      return true;
@@ -1083,7 +1086,7 @@ rwsem_down_write_slowpath(struct rw_semaphore 
*sem, int state)
      /* wait until we successfully acquire the lock */
      set_current_state(state);
      for (;;) {
-        if (rwsem_try_write_lock(sem, wstate)) {
+        if (rwsem_try_write_lock(sem, &wstate)) {
              /* rwsem_try_write_lock() implies ACQUIRE on success */
              break;
          }
@@ -1138,7 +1141,7 @@ rwsem_down_write_slowpath(struct rw_semaphore 
*sem, int state)
               */
              if ((wstate == WRITER_FIRST) && (rt_task(current) ||
                  time_after(jiffies, waiter.timeout))) {
-                wstate = WRITER_HANDOFF;
+                wstate = WRITER_NEED_HANDOFF;
                  lockevent_inc(rwsem_wlock_handoff);
                  break;
              }
@@ -1159,7 +1162,7 @@ rwsem_down_write_slowpath(struct rw_semaphore 
*sem, int state)
      list_del(&waiter.list);

      if (unlikely(wstate == WRITER_HANDOFF))
-        atomic_long_add(-RWSEM_FLAG_HANDOFF, &sem->count);
+        atomic_long_addnot(RWSEM_FLAG_HANDOFF, &sem->count);

      if (list_empty(&sem->wait_list))
          atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
-- 



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-07 15:24   ` Waiman Long
@ 2021-11-07 19:52     ` Waiman Long
  2021-11-10 21:38       ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Waiman Long @ 2021-11-07 19:52 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 马振华, peterz, mingo, will, boqun.feng, linux-kernel

On 11/7/21 10:24, Waiman Long wrote:
> On 11/7/21 04:01, Hillf Danton wrote:
>> On Sat, 6 Nov 2021 23:25:38 -0400 Waiman Long wrote:
>>> On 11/6/21 08:39, 马振华 wrote:
>>>> Dear longman,
>>>>
>>>> recently , i find a issue which rwsem count is negative value, it
>>>> happened always when a task try to get the lock
>>>> with __down_write_killable , then it is killed
>>>>
>>>> this issue happened like this
>>>>
>>>>              CPU2         CPU4
>>>>      task A[reader]     task B[writer]
>>>>      down_read_killable[locked]
>>>>      sem->count=0x100
>>>>              down_write_killable
>>>>              sem->count=0x102[wlist not empty]
>>>>      up_read
>>>>      count=0x2
>>>>              sig kill received
>>>>      down_read_killable
>>>>      sem->count=0x102[wlist not empty]
>>>>              goto branch out_nolock:
>>>> list_del(&waiter.list);
>>>> wait list is empty
>>>> sem->count-RWSEM_FLAG_HANDOFF
>>>> sem->count=0xFE
>>>>      list_empty(&sem->wait_list) is TRUE
>>>>       sem->count andnot RWSEM_FLAG_WAITERS
>>>>        sem->count=0xFC
>>>>      up_read
>>>>      sem->count -= 0x100
>>>>      sem->count=0xFFFFFFFFFFFFFFFC
>>>>      DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
>>>>
>>>> so sem->count will be negative after writer is killed
>>>> i think if flag RWSEM_FLAG_HANDOFF is not set, we shouldn't clean it
>>> Thanks for reporting this possible race condition.
>>>
>>> However, I am still trying to figure how it is possible to set the
>>> wstate to WRITER_HANDOFF without actually setting the handoff bit as
>>> well. The statement sequence should be as follows:
>>>
>>> wstate = WRITER_HANDOFF;
>>> raw_spin_lock_irq(&sem->wait_lock);
>>> if (rwsem_try_write_lock(sem, wstate))
>>> raw_spin_unlock_irq(&sem->wait_lock);
>>>    :
>>> if (signal_pending_state(state, current))
>>>      goto out_nolock
>>>
>>> The rwsem_try_write_lock() function will make sure that we either
>>> acquire the lock and clear handoff or set the handoff bit. This should
>>> be done before we actually check for signal. I do think that it is
>> Given that WRITER_HANDOFF makes no sure that RWSEM_FLAG_HANDOFF is 
>> set in
>> wsem_try_write_lock(), the flag should be cleared only by the setter to
>> avoid count underflow.
>>
>> Hillf
>>
>>> probably safer to use atomic_long_andnot to clear the handoff bit
>>> instead of using atomic_long_add().
>
> I did have a tentative patch to address this issue which is somewhat 
> similar to your approach. However, I would like to further investigate 
> the exact mechanics of the race condition to make sure that I won't 
> miss a latent bug somewhere else in the rwsem code.

I still couldn't figure how this race condition can happen. However, I 
do discover that it is possible to leave rwsem with no waiter but 
handoff bit set if we kill or interrupt all the waiters in the wait 
queue. I have just sent out a patch to address that concern, but it 
should be able to handle this race condition as well if it really happens.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-07 19:52     ` Waiman Long
@ 2021-11-10 21:38       ` Peter Zijlstra
  2021-11-11  2:42         ` Maria Yu
  2021-11-11 15:08         ` Peter Zijlstra
  0 siblings, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2021-11-10 21:38 UTC (permalink / raw)
  To: Waiman Long
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel

On Sun, Nov 07, 2021 at 02:52:36PM -0500, Waiman Long wrote:
> > 
> > I did have a tentative patch to address this issue which is somewhat
> > similar to your approach. However, I would like to further investigate
> > the exact mechanics of the race condition to make sure that I won't miss
> > a latent bug somewhere else in the rwsem code.
> 
> I still couldn't figure how this race condition can happen. However, I do
> discover that it is possible to leave rwsem with no waiter but handoff bit
> set if we kill or interrupt all the waiters in the wait queue. I have just
> sent out a patch to address that concern, but it should be able to handle
> this race condition as well if it really happens.

The comment above RWSEM_WRITER_LOCKED seems wrong/out-dated in that
there's a 4th place that modifies the HANDOFF bit namely
rwsem_down_read_slowpath() in the out_nolock: case.

Now the thing I'm most worried about is that rwsem_down_write_slowpath()
modifies the HANDOFF bit depending on wstate, and wstate itself it not
determined under the same ->wait_lock section, so there could be a race
there.

Another thing is that once wstate==HANDOFF, we rely on spin_on_owner()
to return OWNER_NULL such that it goes to trylock_again, however if it
returns anything else then we're at signal_pending_state() and the
observed race can happen.

Now, spin_on_owner() *can* in fact return something else, consider
need_resched() being set for instance.

Combined I think the observed race is valid.

Now before we go make things more complicated, I think we should see if
we can make things simpler. Also I think perhaps the HANDOFF name here
is a misnomer.

I agree that using _andnot() will fix this issue; I also agree with
folding it with the existing _andnot() already there. But let me stare a
little more at this code, something isn't making sense...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-10 21:38       ` Peter Zijlstra
@ 2021-11-11  2:42         ` Maria Yu
  2021-11-11 15:08         ` Peter Zijlstra
  1 sibling, 0 replies; 26+ messages in thread
From: Maria Yu @ 2021-11-11  2:42 UTC (permalink / raw)
  To: peterz
  Cc: boqun.feng, hdanton, linux-kernel, longman, mazhenhua, mingo,
	will, quic_aiquny

From: quic_aiquny@quicinc.com

There is an important information that when the issue reported, the first waiter can be changed. This is via other hook changes which can not always add to the tail of the waitlist which is not the same of the current upstream code.

The current original Xiaomi's issue when we checked the log, we can find out that the scenario is that write waiter set the sem->count with RWSEM_FLAG_HANDOFF bit and then another reader champ in and got the first waiter.
So When the reader go through rwsem_down_read_slowpath if it have RWSEM_FLAG_HANDOFF bit set, it will clear the RWSEM_FLAG_HANDOFF bit.

So it left the wstate of the local variable as WRITER_HANDOFF, but the reader thought it was a Reader Handoff and cleared it.
While the writer only checkes the wstate local variable of WRITER_HANDOFF, and do the add(-RWSEM_FLAG_HANDOFF) action. If it do a addnot(RWSEM_FLAG_HANDOFF), then the writer will not make the sem->count underflow.

The current scenario looked like this:
--------------------------------
	wstate = WRITER_HANDOFF;
	raw_spin_lock_irq(&sem->wait_lock);
	if (rwsem_try_write_lock(sem, wstate))
	raw_spin_unlock_irq(&sem->wait_lock);
	   :
	   if (signal_pending_state(state, current))       //other *important* reader rwsem_down_read_slowpath add to the first of the waitlist, and clear the RWSEM_FLAG_HANDOFF
	     goto out_nolock
	     out_nolock:
	     add(-RWSEM_FLAG_HANDOFF, sem->count)                     // make the count to be negative.
--------------------------------

So if we do a andnot it will be much more safer, and working in this scenario:
--------------------------------
	wstate = WRITER_HANDOFF;
	raw_spin_lock_irq(&sem->wait_lock);
	if (rwsem_try_write_lock(sem, wstate))
	raw_spin_unlock_irq(&sem->wait_lock);
	   :
	   if (signal_pending_state(state, current))       //other *important* reader rwsem_down_read_slowpath add to the first of the waitlist, and clear the RWSEM_FLAG_HANDOFF
	     goto out_nolock
	     out_nolock:
	     addnot(RWSEM_FLAG_HANDOFF, sem->count)                // make the count unchanged.
--------------------------------


So if we do a andnot it will be much more safer, and working in the not killed normal senario:
--------------------------------
	wstate = WRITER_HANDOFF;
	raw_spin_lock_irq(&sem->wait_lock);
	if (rwsem_try_write_lock(sem, wstate))
	raw_spin_unlock_irq(&sem->wait_lock);
	   :
	   if (signal_pending_state(state, current))       //other *important* reader rwsem_down_read_slowpath add to the first of the waitlist, and clear the RWSEM_FLAG_HANDOFF
		if (wstate == WRITER_HANDOFF)
			trylock_again:
			raw_spin_lock_irq(&sem->wait_lock);
			if (rwsem_try_write_lock(sem, wstate))		//writer will set the RWSEM_FLAG_HANDOFF flag again.
	raw_spin_unlock_irq(&sem->wait_lock);
--------------------------------



Thx and BRs,
Aiqun Yu (Maria)



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-10 21:38       ` Peter Zijlstra
  2021-11-11  2:42         ` Maria Yu
@ 2021-11-11 15:08         ` Peter Zijlstra
  2021-11-11 19:14           ` Waiman Long
  1 sibling, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2021-11-11 15:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel

On Wed, Nov 10, 2021 at 10:38:55PM +0100, Peter Zijlstra wrote:
> On Sun, Nov 07, 2021 at 02:52:36PM -0500, Waiman Long wrote:
> > > 
> > > I did have a tentative patch to address this issue which is somewhat
> > > similar to your approach. However, I would like to further investigate
> > > the exact mechanics of the race condition to make sure that I won't miss
> > > a latent bug somewhere else in the rwsem code.
> > 
> > I still couldn't figure how this race condition can happen. However, I do
> > discover that it is possible to leave rwsem with no waiter but handoff bit
> > set if we kill or interrupt all the waiters in the wait queue. I have just
> > sent out a patch to address that concern, but it should be able to handle
> > this race condition as well if it really happens.
> 
> The comment above RWSEM_WRITER_LOCKED seems wrong/out-dated in that
> there's a 4th place that modifies the HANDOFF bit namely
> rwsem_down_read_slowpath() in the out_nolock: case.
> 
> Now the thing I'm most worried about is that rwsem_down_write_slowpath()
> modifies the HANDOFF bit depending on wstate, and wstate itself it not
> determined under the same ->wait_lock section, so there could be a race
> there.
> 
> Another thing is that once wstate==HANDOFF, we rely on spin_on_owner()
> to return OWNER_NULL such that it goes to trylock_again, however if it
> returns anything else then we're at signal_pending_state() and the
> observed race can happen.
> 
> Now, spin_on_owner() *can* in fact return something else, consider
> need_resched() being set for instance.
> 
> Combined I think the observed race is valid.
> 
> Now before we go make things more complicated, I think we should see if
> we can make things simpler. Also I think perhaps the HANDOFF name here
> is a misnomer.
> 
> I agree that using _andnot() will fix this issue; I also agree with
> folding it with the existing _andnot() already there. But let me stare a
> little more at this code, something isn't making sense...

I think I want to see WRITER_HANDOFF go away. And preferably all of
wstate.

Something like the *completely* untested below, might set fire to your
pet, eat your granny, etc..

Also, perhaps s/HANDOFF/PHASE_CHANGE/ ?

Waiman, did I overlook something fundamental here?

---
 kernel/locking/rwsem.c | 85 +++++++++++++++++++-------------------------------
 1 file changed, 32 insertions(+), 53 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index c51387a43265..bc5da05346e2 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -104,14 +104,17 @@
  * atomic_long_fetch_add() is used to obtain reader lock, whereas
  * atomic_long_cmpxchg() will be used to obtain writer lock.
  *
- * There are three places where the lock handoff bit may be set or cleared.
- * 1) rwsem_mark_wake() for readers.
- * 2) rwsem_try_write_lock() for writers.
- * 3) Error path of rwsem_down_write_slowpath().
+ * There are four places where the lock handoff bit may be set or cleared.
+ * 1) rwsem_mark_wake() for readers		-- set, clear
+ * 2) rwsem_try_write_lock() for writers	-- set, clear
+ * 3) Error path of rwsem_down_write_slowpath() -- clear
+ * 4) Error path of rwsem_down_read_slowpath()  -- clear
  *
  * For all the above cases, wait_lock will be held. A writer must also
  * be the first one in the wait_list to be eligible for setting the handoff
  * bit. So concurrent setting/clearing of handoff bit is not possible.
+ *
+ * XXX handoff is a misnomer here, all it does it force a phase change
  */
 #define RWSEM_WRITER_LOCKED	(1UL << 0)
 #define RWSEM_FLAG_WAITERS	(1UL << 1)
@@ -344,12 +347,6 @@ enum rwsem_wake_type {
 	RWSEM_WAKE_READ_OWNED	/* Waker thread holds the read lock */
 };
 
-enum writer_wait_state {
-	WRITER_NOT_FIRST,	/* Writer is not first in wait list */
-	WRITER_FIRST,		/* Writer is first in wait list     */
-	WRITER_HANDOFF		/* Writer is first & handoff needed */
-};
-
 /*
  * The typical HZ value is either 250 or 1000. So set the minimum waiting
  * time to at least 4ms or 1 jiffy (if it is higher than 4ms) in the wait
@@ -531,13 +528,12 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
  * This function must be called with the sem->wait_lock held to prevent
  * race conditions between checking the rwsem wait list and setting the
  * sem->count accordingly.
- *
- * If wstate is WRITER_HANDOFF, it will make sure that either the handoff
- * bit is set or the lock is acquired with handoff bit cleared.
  */
 static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
-					enum writer_wait_state wstate)
+					struct rwsem_waiter *waiter)
 {
+	bool first = rwsem_first_waiter(sem) == waiter;
+	bool timo = time_after(jiffies, waiter->timeout);
 	long count, new;
 
 	lockdep_assert_held(&sem->wait_lock);
@@ -546,13 +542,13 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
 	do {
 		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
 
-		if (has_handoff && wstate == WRITER_NOT_FIRST)
+		if (has_handoff && !first)
 			return false;
 
 		new = count;
 
 		if (count & RWSEM_LOCK_MASK) {
-			if (has_handoff || (wstate != WRITER_HANDOFF))
+			if (has_handoff || !(first && timo))
 				return false;
 
 			new |= RWSEM_FLAG_HANDOFF;
@@ -707,6 +703,8 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
 			break;
 		}
 
+		// XXX also terminate on signal_pending_state(current->__state, current) ?
+
 		cpu_relax();
 	}
 
@@ -1019,11 +1017,10 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
 static struct rw_semaphore *
 rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 {
-	long count;
-	enum writer_wait_state wstate;
 	struct rwsem_waiter waiter;
 	struct rw_semaphore *ret = sem;
 	DEFINE_WAKE_Q(wake_q);
+	long count, flags = 0;
 
 	/* do optimistic spinning and steal lock if possible */
 	if (rwsem_can_spin_on_owner(sem) && rwsem_optimistic_spin(sem)) {
@@ -1041,13 +1038,10 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 
 	raw_spin_lock_irq(&sem->wait_lock);
 
-	/* account for this before adding a new element to the list */
-	wstate = list_empty(&sem->wait_list) ? WRITER_FIRST : WRITER_NOT_FIRST;
-
 	list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock */
-	if (wstate == WRITER_NOT_FIRST) {
+	if (rwsem_first_waiter(sem) != &waiter) {
 		count = atomic_long_read(&sem->count);
 
 		/*
@@ -1083,22 +1077,14 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	/* wait until we successfully acquire the lock */
 	set_current_state(state);
 	for (;;) {
-		if (rwsem_try_write_lock(sem, wstate)) {
+		if (rwsem_try_write_lock(sem, &waiter)) {
 			/* rwsem_try_write_lock() implies ACQUIRE on success */
 			break;
 		}
 
 		raw_spin_unlock_irq(&sem->wait_lock);
 
-		/*
-		 * After setting the handoff bit and failing to acquire
-		 * the lock, attempt to spin on owner to accelerate lock
-		 * transfer. If the previous owner is a on-cpu writer and it
-		 * has just released the lock, OWNER_NULL will be returned.
-		 * In this case, we attempt to acquire the lock again
-		 * without sleeping.
-		 */
-		if (wstate == WRITER_HANDOFF) {
+		if (rwsem_first_waiter(sem) == &waiter) {
 			enum owner_state owner_state;
 
 			preempt_disable();
@@ -1117,28 +1103,15 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 			schedule();
 			lockevent_inc(rwsem_sleep_writer);
 			set_current_state(state);
-			/*
-			 * If HANDOFF bit is set, unconditionally do
-			 * a trylock.
-			 */
-			if (wstate == WRITER_HANDOFF)
-				break;
 
-			if ((wstate == WRITER_NOT_FIRST) &&
-			    (rwsem_first_waiter(sem) == &waiter))
-				wstate = WRITER_FIRST;
+			if (rwsem_first_waiter(sem) != &waiter)
+				continue;
 
 			count = atomic_long_read(&sem->count);
 			if (!(count & RWSEM_LOCK_MASK))
 				break;
 
-			/*
-			 * The setting of the handoff bit is deferred
-			 * until rwsem_try_write_lock() is called.
-			 */
-			if ((wstate == WRITER_FIRST) && (rt_task(current) ||
-			    time_after(jiffies, waiter.timeout))) {
-				wstate = WRITER_HANDOFF;
+			if (rt_task(current) || time_after(jiffies, waiter.timeout)) {
 				lockevent_inc(rwsem_wlock_handoff);
 				break;
 			}
@@ -1156,15 +1129,21 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 out_nolock:
 	__set_current_state(TASK_RUNNING);
 	raw_spin_lock_irq(&sem->wait_lock);
-	list_del(&waiter.list);
 
-	if (unlikely(wstate == WRITER_HANDOFF))
-		atomic_long_add(-RWSEM_FLAG_HANDOFF,  &sem->count);
+	if (rwsem_first_waiter(sem) == &waiter)
+		flags |= RWSEM_FLAG_HANDOFF;
+
+	list_del(&waiter.list);
 
 	if (list_empty(&sem->wait_list))
-		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
-	else
+		flags |= RWSEM_FLAG_WAITERS | RWSEM_FLAG_HANDOFF;
+
+	if (flags)
+		atomic_long_andnot(flags, &sem->count);
+
+	if (!list_empty(&sem->wait_list))
 		rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+
 	raw_spin_unlock_irq(&sem->wait_lock);
 	wake_up_q(&wake_q);
 	lockevent_inc(rwsem_wlock_fail);

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-11 15:08         ` Peter Zijlstra
@ 2021-11-11 19:14           ` Waiman Long
  2021-11-11 19:20             ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Waiman Long @ 2021-11-11 19:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1964 bytes --]


On 11/11/21 10:08, Peter Zijlstra wrote:
> On Wed, Nov 10, 2021 at 10:38:55PM +0100, Peter Zijlstra wrote:
>>
>> The comment above RWSEM_WRITER_LOCKED seems wrong/out-dated in that
>> there's a 4th place that modifies the HANDOFF bit namely
>> rwsem_down_read_slowpath() in the out_nolock: case.
>>
>> Now the thing I'm most worried about is that rwsem_down_write_slowpath()
>> modifies the HANDOFF bit depending on wstate, and wstate itself it not
>> determined under the same ->wait_lock section, so there could be a race
>> there.
>>
>> Another thing is that once wstate==HANDOFF, we rely on spin_on_owner()
>> to return OWNER_NULL such that it goes to trylock_again, however if it
>> returns anything else then we're at signal_pending_state() and the
>> observed race can happen.
>>
>> Now, spin_on_owner() *can* in fact return something else, consider
>> need_resched() being set for instance.
>>
>> Combined I think the observed race is valid.
>>
>> Now before we go make things more complicated, I think we should see if
>> we can make things simpler. Also I think perhaps the HANDOFF name here
>> is a misnomer.
>>
>> I agree that using _andnot() will fix this issue; I also agree with
>> folding it with the existing _andnot() already there. But let me stare a
>> little more at this code, something isn't making sense...
> I think I want to see WRITER_HANDOFF go away. And preferably all of
> wstate.
>
> Something like the *completely* untested below, might set fire to your
> pet, eat your granny, etc..
>
> Also, perhaps s/HANDOFF/PHASE_CHANGE/ ?
>
> Waiman, did I overlook something fundamental here?

The handoff bit is also set when the current writer is a RT task. You 
miss that in your patch. The attached patch is my version of your 
change. What do you think about that?

As for the PHASE_CHANGE name, we have to be consistent in both rwsem and 
mutex. Maybe a follow up patch if you think we should change the 
terminology.

Cheers,
Longman

[-- Attachment #2: 0001-locking-rwsem-Make-handoff-bit-handling-more-consist.patch --]
[-- Type: text/x-patch, Size: 10168 bytes --]

From 1c76a9c1b9d16d0ceb07f643803035177b4042a5 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Thu, 11 Nov 2021 13:49:35 -0500
Subject: [PATCH] locking/rwsem: Make handoff bit handling more consistent

There are some inconsistency in the way that the handoff bit is being
handled in readers and writers.

Firstly, when a queue head writer set the handoff bit, it will clear it
when the writer is being killed or interrupted on its way out without
acquiring the lock. That is not the case for a queue head reader. The
handoff bit will simply be inherited by the next waiter.

Secondly, in the out_nolock path of rwsem_down_read_slowpath(), both
the waiter and handoff bits are cleared if the wait queue becomes empty.
For rwsem_down_write_slowpath(), however, the handoff bit is not checked
and cleared if the wait queue is empty. This can potentially make the
handoff bit set with empty wait queue.

To make the handoff bit handling more consistent and robust, extract
out the rwsem flags handling code into a commont rwsem_out_nolock()
function and call it from both the reader and writer's out_nolock paths.
The common function will only use atomic_long_andnot() to clear bits
to avoid possible race condition.

This will elminate the handoff bit set with empty wait queue case as
well as the possible race condition that may screw up the count value.

More states are stored in rwsem_waiter structure and writer handoff bit
setting are all pushed to rwsem_try_write_lock(). This simplifies the
trylock loop in rwsem_down_write_slowpath().

Fixes: 4f23dbc1e657 ("locking/rwsem: Implement lock handoff to prevent lock starvation")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 109 ++++++++++++++++++-----------------------
 1 file changed, 49 insertions(+), 60 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index c51387a43265..b5fe21d5916d 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -104,10 +104,11 @@
  * atomic_long_fetch_add() is used to obtain reader lock, whereas
  * atomic_long_cmpxchg() will be used to obtain writer lock.
  *
- * There are three places where the lock handoff bit may be set or cleared.
- * 1) rwsem_mark_wake() for readers.
- * 2) rwsem_try_write_lock() for writers.
- * 3) Error path of rwsem_down_write_slowpath().
+ * There are four places where the lock handoff bit may be set or cleared.
+ * 1) rwsem_mark_wake() for readers            -- set, clear
+ * 2) rwsem_try_write_lock() for writers       -- set, clear
+ * 3) Error path of rwsem_down_write_slowpath() -- clear
+ * 4) Error path of rwsem_down_read_slowpath()  -- clear
  *
  * For all the above cases, wait_lock will be held. A writer must also
  * be the first one in the wait_list to be eligible for setting the handoff
@@ -334,6 +335,7 @@ struct rwsem_waiter {
 	struct task_struct *task;
 	enum rwsem_waiter_type type;
 	unsigned long timeout;
+	bool handoff_set, rt_task;
 };
 #define rwsem_first_waiter(sem) \
 	list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
@@ -344,12 +346,6 @@ enum rwsem_wake_type {
 	RWSEM_WAKE_READ_OWNED	/* Waker thread holds the read lock */
 };
 
-enum writer_wait_state {
-	WRITER_NOT_FIRST,	/* Writer is not first in wait list */
-	WRITER_FIRST,		/* Writer is first in wait list     */
-	WRITER_HANDOFF		/* Writer is first & handoff needed */
-};
-
 /*
  * The typical HZ value is either 250 or 1000. So set the minimum waiting
  * time to at least 4ms or 1 jiffy (if it is higher than 4ms) in the wait
@@ -434,6 +430,7 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 			if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
 			    time_after(jiffies, waiter->timeout)) {
 				adjustment -= RWSEM_FLAG_HANDOFF;
+				waiter->handoff_set = true;
 				lockevent_inc(rwsem_rlock_handoff);
 			}
 
@@ -531,14 +528,12 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
  * This function must be called with the sem->wait_lock held to prevent
  * race conditions between checking the rwsem wait list and setting the
  * sem->count accordingly.
- *
- * If wstate is WRITER_HANDOFF, it will make sure that either the handoff
- * bit is set or the lock is acquired with handoff bit cleared.
  */
 static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
-					enum writer_wait_state wstate)
+					struct rwsem_waiter *waiter)
 {
 	long count, new;
+	bool first = rwsem_first_waiter(sem) == waiter;
 
 	lockdep_assert_held(&sem->wait_lock);
 
@@ -546,13 +541,14 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
 	do {
 		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
 
-		if (has_handoff && wstate == WRITER_NOT_FIRST)
+		if (has_handoff && !first)
 			return false;
 
 		new = count;
 
 		if (count & RWSEM_LOCK_MASK) {
-			if (has_handoff || (wstate != WRITER_HANDOFF))
+			if (has_handoff || (!waiter->rt_task &&
+					    !time_after(jiffies, waiter->timeout)))
 				return false;
 
 			new |= RWSEM_FLAG_HANDOFF;
@@ -569,8 +565,11 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
 	 * We have either acquired the lock with handoff bit cleared or
 	 * set the handoff bit.
 	 */
-	if (new & RWSEM_FLAG_HANDOFF)
+	if (new & RWSEM_FLAG_HANDOFF) {
+		waiter->handoff_set = true;
+		lockevent_inc(rwsem_wlock_handoff);
 		return false;
+	}
 
 	rwsem_set_owner(sem);
 	return true;
@@ -889,6 +888,24 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
 }
 #endif
 
+/*
+ * Common code to handle rwsem flags in out_nolock path with wait_lock held.
+ */
+static inline void rwsem_out_nolock_clear_flags(struct rw_semaphore *sem,
+						struct rwsem_waiter *waiter)
+{
+	long flags = 0;
+
+	list_del(&waiter->list);
+	if (list_empty(&sem->wait_list))
+		flags = RWSEM_FLAG_HANDOFF | RWSEM_FLAG_WAITERS;
+	else if (waiter->handoff_set)
+		flags = RWSEM_FLAG_HANDOFF;
+
+	if (flags)
+		atomic_long_andnot(flags,  &sem->count);
+}
+
 /*
  * Wait for the read lock to be granted
  */
@@ -936,6 +953,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
 	waiter.task = current;
 	waiter.type = RWSEM_WAITING_FOR_READ;
 	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
+	waiter.handoff_set = false;
 
 	raw_spin_lock_irq(&sem->wait_lock);
 	if (list_empty(&sem->wait_list)) {
@@ -1002,11 +1020,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
 	return sem;
 
 out_nolock:
-	list_del(&waiter.list);
-	if (list_empty(&sem->wait_list)) {
-		atomic_long_andnot(RWSEM_FLAG_WAITERS|RWSEM_FLAG_HANDOFF,
-				   &sem->count);
-	}
+	rwsem_out_nolock_clear_flags(sem, &waiter);
 	raw_spin_unlock_irq(&sem->wait_lock);
 	__set_current_state(TASK_RUNNING);
 	lockevent_inc(rwsem_rlock_fail);
@@ -1020,7 +1034,6 @@ static struct rw_semaphore *
 rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 {
 	long count;
-	enum writer_wait_state wstate;
 	struct rwsem_waiter waiter;
 	struct rw_semaphore *ret = sem;
 	DEFINE_WAKE_Q(wake_q);
@@ -1038,16 +1051,13 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	waiter.task = current;
 	waiter.type = RWSEM_WAITING_FOR_WRITE;
 	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
+	waiter.rt_task = rt_task(current);
 
 	raw_spin_lock_irq(&sem->wait_lock);
-
-	/* account for this before adding a new element to the list */
-	wstate = list_empty(&sem->wait_list) ? WRITER_FIRST : WRITER_NOT_FIRST;
-
 	list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock */
-	if (wstate == WRITER_NOT_FIRST) {
+	if (rwsem_first_waiter(sem) != &waiter) {
 		count = atomic_long_read(&sem->count);
 
 		/*
@@ -1083,7 +1093,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	/* wait until we successfully acquire the lock */
 	set_current_state(state);
 	for (;;) {
-		if (rwsem_try_write_lock(sem, wstate)) {
+		if (rwsem_try_write_lock(sem, &waiter)) {
 			/* rwsem_try_write_lock() implies ACQUIRE on success */
 			break;
 		}
@@ -1098,9 +1108,12 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 		 * In this case, we attempt to acquire the lock again
 		 * without sleeping.
 		 */
-		if (wstate == WRITER_HANDOFF) {
+		if (waiter.handoff_set) {
 			enum owner_state owner_state;
 
+			if (signal_pending_state(state, current))
+				goto out_nolock;
+
 			preempt_disable();
 			owner_state = rwsem_spin_on_owner(sem);
 			preempt_enable();
@@ -1117,31 +1130,14 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 			schedule();
 			lockevent_inc(rwsem_sleep_writer);
 			set_current_state(state);
-			/*
-			 * If HANDOFF bit is set, unconditionally do
-			 * a trylock.
-			 */
-			if (wstate == WRITER_HANDOFF)
-				break;
-
-			if ((wstate == WRITER_NOT_FIRST) &&
-			    (rwsem_first_waiter(sem) == &waiter))
-				wstate = WRITER_FIRST;
-
-			count = atomic_long_read(&sem->count);
-			if (!(count & RWSEM_LOCK_MASK))
-				break;
 
 			/*
-			 * The setting of the handoff bit is deferred
-			 * until rwsem_try_write_lock() is called.
+			 * Unconditionally do a trylock and spinning if
+			 * HANDOFF bit is set.
 			 */
-			if ((wstate == WRITER_FIRST) && (rt_task(current) ||
-			    time_after(jiffies, waiter.timeout))) {
-				wstate = WRITER_HANDOFF;
-				lockevent_inc(rwsem_wlock_handoff);
+			if (waiter.handoff_set ||
+			   !(atomic_long_read(&sem->count) & RWSEM_LOCK_MASK))
 				break;
-			}
 		}
 trylock_again:
 		raw_spin_lock_irq(&sem->wait_lock);
@@ -1156,19 +1152,12 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 out_nolock:
 	__set_current_state(TASK_RUNNING);
 	raw_spin_lock_irq(&sem->wait_lock);
-	list_del(&waiter.list);
-
-	if (unlikely(wstate == WRITER_HANDOFF))
-		atomic_long_add(-RWSEM_FLAG_HANDOFF,  &sem->count);
-
-	if (list_empty(&sem->wait_list))
-		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
-	else
+	rwsem_out_nolock_clear_flags(sem, &waiter);
+	if (!list_empty(&sem->wait_list))
 		rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 	raw_spin_unlock_irq(&sem->wait_lock);
 	wake_up_q(&wake_q);
 	lockevent_inc(rwsem_wlock_fail);
-
 	return ERR_PTR(-EINTR);
 }
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-11 19:14           ` Waiman Long
@ 2021-11-11 19:20             ` Peter Zijlstra
  2021-11-11 19:36               ` Waiman Long
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2021-11-11 19:20 UTC (permalink / raw)
  To: Waiman Long
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel

On Thu, Nov 11, 2021 at 02:14:48PM -0500, Waiman Long wrote:
> As for the PHASE_CHANGE name, we have to be consistent in both rwsem and
> mutex. Maybe a follow up patch if you think we should change the
> terminology.

Well, that's exactly the point, they do radically different things.
Having the same name for two different things is confusing.

Anyway, let me go read that patch you sent.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-11 19:20             ` Peter Zijlstra
@ 2021-11-11 19:36               ` Waiman Long
  2021-11-11 19:52                 ` Waiman Long
                                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Waiman Long @ 2021-11-11 19:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 896 bytes --]


On 11/11/21 14:20, Peter Zijlstra wrote:
> On Thu, Nov 11, 2021 at 02:14:48PM -0500, Waiman Long wrote:
>> As for the PHASE_CHANGE name, we have to be consistent in both rwsem and
>> mutex. Maybe a follow up patch if you think we should change the
>> terminology.
> Well, that's exactly the point, they do radically different things.
> Having the same name for two different things is confusing.
>
> Anyway, let me go read that patch you sent.

My understanding of handoff is to disable optimistic spinning to let 
waiters in the wait queue have an opportunity to acquire the lock. There 
are difference in details on how to do that in mutex and rwsem, though.

BTW, I have decided that we should further simply the trylock for loop 
in rwsem_down_write_slowpath() to make it easier to read. That is the 
only difference in the attached v2 patch compared with the previous one.

Cheers,
LongmaN

[-- Attachment #2: v2-0001-locking-rwsem-Make-handoff-bit-handling-more-cons.patch --]
[-- Type: text/x-patch, Size: 10361 bytes --]

From 956fb4bf54e61735c0df1fedf97e59a1972efbca Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Thu, 11 Nov 2021 14:27:35 -0500
Subject: [PATCH v2] locking/rwsem: Make handoff bit handling more consistent

There are some inconsistency in the way that the handoff bit is being
handled in readers and writers.

Firstly, when a queue head writer set the handoff bit, it will clear it
when the writer is being killed or interrupted on its way out without
acquiring the lock. That is not the case for a queue head reader. The
handoff bit will simply be inherited by the next waiter.

Secondly, in the out_nolock path of rwsem_down_read_slowpath(), both
the waiter and handoff bits are cleared if the wait queue becomes empty.
For rwsem_down_write_slowpath(), however, the handoff bit is not checked
and cleared if the wait queue is empty. This can potentially make the
handoff bit set with empty wait queue.

To make the handoff bit handling more consistent and robust, extract
out the rwsem flags handling code into a commont rwsem_out_nolock()
function and call it from both the reader and writer's out_nolock paths.
The common function will only use atomic_long_andnot() to clear bits
to avoid possible race condition.

This will elminate the handoff bit set with empty wait queue case as
well as the possible race condition that may screw up the count value.

More states are stored in rwsem_waiter structure and writer handoff bit
setting are all pushed to rwsem_try_write_lock(). This simplifies the
trylock for loop in rwsem_down_write_slowpath().

Fixes: 4f23dbc1e657 ("locking/rwsem: Implement lock handoff to prevent lock starvation")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 121 ++++++++++++++++-------------------------
 1 file changed, 48 insertions(+), 73 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index c51387a43265..a876a3acc383 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -104,10 +104,11 @@
  * atomic_long_fetch_add() is used to obtain reader lock, whereas
  * atomic_long_cmpxchg() will be used to obtain writer lock.
  *
- * There are three places where the lock handoff bit may be set or cleared.
- * 1) rwsem_mark_wake() for readers.
- * 2) rwsem_try_write_lock() for writers.
- * 3) Error path of rwsem_down_write_slowpath().
+ * There are four places where the lock handoff bit may be set or cleared.
+ * 1) rwsem_mark_wake() for readers            -- set, clear
+ * 2) rwsem_try_write_lock() for writers       -- set, clear
+ * 3) Error path of rwsem_down_write_slowpath() -- clear
+ * 4) Error path of rwsem_down_read_slowpath()  -- clear
  *
  * For all the above cases, wait_lock will be held. A writer must also
  * be the first one in the wait_list to be eligible for setting the handoff
@@ -334,6 +335,7 @@ struct rwsem_waiter {
 	struct task_struct *task;
 	enum rwsem_waiter_type type;
 	unsigned long timeout;
+	bool handoff_set, rt_task;
 };
 #define rwsem_first_waiter(sem) \
 	list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
@@ -344,12 +346,6 @@ enum rwsem_wake_type {
 	RWSEM_WAKE_READ_OWNED	/* Waker thread holds the read lock */
 };
 
-enum writer_wait_state {
-	WRITER_NOT_FIRST,	/* Writer is not first in wait list */
-	WRITER_FIRST,		/* Writer is first in wait list     */
-	WRITER_HANDOFF		/* Writer is first & handoff needed */
-};
-
 /*
  * The typical HZ value is either 250 or 1000. So set the minimum waiting
  * time to at least 4ms or 1 jiffy (if it is higher than 4ms) in the wait
@@ -434,6 +430,7 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
 			if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
 			    time_after(jiffies, waiter->timeout)) {
 				adjustment -= RWSEM_FLAG_HANDOFF;
+				waiter->handoff_set = true;
 				lockevent_inc(rwsem_rlock_handoff);
 			}
 
@@ -531,14 +528,12 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
  * This function must be called with the sem->wait_lock held to prevent
  * race conditions between checking the rwsem wait list and setting the
  * sem->count accordingly.
- *
- * If wstate is WRITER_HANDOFF, it will make sure that either the handoff
- * bit is set or the lock is acquired with handoff bit cleared.
  */
 static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
-					enum writer_wait_state wstate)
+					struct rwsem_waiter *waiter)
 {
 	long count, new;
+	bool first = rwsem_first_waiter(sem) == waiter;
 
 	lockdep_assert_held(&sem->wait_lock);
 
@@ -546,13 +541,14 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
 	do {
 		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
 
-		if (has_handoff && wstate == WRITER_NOT_FIRST)
+		if (has_handoff && !first)
 			return false;
 
 		new = count;
 
 		if (count & RWSEM_LOCK_MASK) {
-			if (has_handoff || (wstate != WRITER_HANDOFF))
+			if (has_handoff || (!waiter->rt_task &&
+					    !time_after(jiffies, waiter->timeout)))
 				return false;
 
 			new |= RWSEM_FLAG_HANDOFF;
@@ -569,8 +565,11 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
 	 * We have either acquired the lock with handoff bit cleared or
 	 * set the handoff bit.
 	 */
-	if (new & RWSEM_FLAG_HANDOFF)
+	if (new & RWSEM_FLAG_HANDOFF) {
+		waiter->handoff_set = true;
+		lockevent_inc(rwsem_wlock_handoff);
 		return false;
+	}
 
 	rwsem_set_owner(sem);
 	return true;
@@ -889,6 +888,24 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
 }
 #endif
 
+/*
+ * Common code to handle rwsem flags in out_nolock path with wait_lock held.
+ */
+static inline void rwsem_out_nolock_clear_flags(struct rw_semaphore *sem,
+						struct rwsem_waiter *waiter)
+{
+	long flags = 0;
+
+	list_del(&waiter->list);
+	if (list_empty(&sem->wait_list))
+		flags = RWSEM_FLAG_HANDOFF | RWSEM_FLAG_WAITERS;
+	else if (waiter->handoff_set)
+		flags = RWSEM_FLAG_HANDOFF;
+
+	if (flags)
+		atomic_long_andnot(flags,  &sem->count);
+}
+
 /*
  * Wait for the read lock to be granted
  */
@@ -936,6 +953,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
 	waiter.task = current;
 	waiter.type = RWSEM_WAITING_FOR_READ;
 	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
+	waiter.handoff_set = false;
 
 	raw_spin_lock_irq(&sem->wait_lock);
 	if (list_empty(&sem->wait_list)) {
@@ -1002,11 +1020,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
 	return sem;
 
 out_nolock:
-	list_del(&waiter.list);
-	if (list_empty(&sem->wait_list)) {
-		atomic_long_andnot(RWSEM_FLAG_WAITERS|RWSEM_FLAG_HANDOFF,
-				   &sem->count);
-	}
+	rwsem_out_nolock_clear_flags(sem, &waiter);
 	raw_spin_unlock_irq(&sem->wait_lock);
 	__set_current_state(TASK_RUNNING);
 	lockevent_inc(rwsem_rlock_fail);
@@ -1020,7 +1034,6 @@ static struct rw_semaphore *
 rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 {
 	long count;
-	enum writer_wait_state wstate;
 	struct rwsem_waiter waiter;
 	struct rw_semaphore *ret = sem;
 	DEFINE_WAKE_Q(wake_q);
@@ -1038,16 +1051,13 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	waiter.task = current;
 	waiter.type = RWSEM_WAITING_FOR_WRITE;
 	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
+	waiter.rt_task = rt_task(current);
 
 	raw_spin_lock_irq(&sem->wait_lock);
-
-	/* account for this before adding a new element to the list */
-	wstate = list_empty(&sem->wait_list) ? WRITER_FIRST : WRITER_NOT_FIRST;
-
 	list_add_tail(&waiter.list, &sem->wait_list);
 
 	/* we're now waiting on the lock */
-	if (wstate == WRITER_NOT_FIRST) {
+	if (rwsem_first_waiter(sem) != &waiter) {
 		count = atomic_long_read(&sem->count);
 
 		/*
@@ -1083,13 +1093,16 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 	/* wait until we successfully acquire the lock */
 	set_current_state(state);
 	for (;;) {
-		if (rwsem_try_write_lock(sem, wstate)) {
+		if (rwsem_try_write_lock(sem, &waiter)) {
 			/* rwsem_try_write_lock() implies ACQUIRE on success */
 			break;
 		}
 
 		raw_spin_unlock_irq(&sem->wait_lock);
 
+		if (signal_pending_state(state, current))
+			goto out_nolock;
+
 		/*
 		 * After setting the handoff bit and failing to acquire
 		 * the lock, attempt to spin on owner to accelerate lock
@@ -1098,7 +1111,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 		 * In this case, we attempt to acquire the lock again
 		 * without sleeping.
 		 */
-		if (wstate == WRITER_HANDOFF) {
+		if (waiter.handoff_set) {
 			enum owner_state owner_state;
 
 			preempt_disable();
@@ -1109,40 +1122,9 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 				goto trylock_again;
 		}
 
-		/* Block until there are no active lockers. */
-		for (;;) {
-			if (signal_pending_state(state, current))
-				goto out_nolock;
-
-			schedule();
-			lockevent_inc(rwsem_sleep_writer);
-			set_current_state(state);
-			/*
-			 * If HANDOFF bit is set, unconditionally do
-			 * a trylock.
-			 */
-			if (wstate == WRITER_HANDOFF)
-				break;
-
-			if ((wstate == WRITER_NOT_FIRST) &&
-			    (rwsem_first_waiter(sem) == &waiter))
-				wstate = WRITER_FIRST;
-
-			count = atomic_long_read(&sem->count);
-			if (!(count & RWSEM_LOCK_MASK))
-				break;
-
-			/*
-			 * The setting of the handoff bit is deferred
-			 * until rwsem_try_write_lock() is called.
-			 */
-			if ((wstate == WRITER_FIRST) && (rt_task(current) ||
-			    time_after(jiffies, waiter.timeout))) {
-				wstate = WRITER_HANDOFF;
-				lockevent_inc(rwsem_wlock_handoff);
-				break;
-			}
-		}
+		schedule();
+		lockevent_inc(rwsem_sleep_writer);
+		set_current_state(state);
 trylock_again:
 		raw_spin_lock_irq(&sem->wait_lock);
 	}
@@ -1156,19 +1138,12 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 out_nolock:
 	__set_current_state(TASK_RUNNING);
 	raw_spin_lock_irq(&sem->wait_lock);
-	list_del(&waiter.list);
-
-	if (unlikely(wstate == WRITER_HANDOFF))
-		atomic_long_add(-RWSEM_FLAG_HANDOFF,  &sem->count);
-
-	if (list_empty(&sem->wait_list))
-		atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
-	else
+	rwsem_out_nolock_clear_flags(sem, &waiter);
+	if (!list_empty(&sem->wait_list))
 		rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
 	raw_spin_unlock_irq(&sem->wait_lock);
 	wake_up_q(&wake_q);
 	lockevent_inc(rwsem_wlock_fail);
-
 	return ERR_PTR(-EINTR);
 }
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-11 19:36               ` Waiman Long
@ 2021-11-11 19:52                 ` Waiman Long
  2021-11-11 20:26                 ` Peter Zijlstra
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Waiman Long @ 2021-11-11 19:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel


On 11/11/21 14:36, Waiman Long wrote:
>
> On 11/11/21 14:20, Peter Zijlstra wrote:
>> On Thu, Nov 11, 2021 at 02:14:48PM -0500, Waiman Long wrote:
>>> As for the PHASE_CHANGE name, we have to be consistent in both rwsem 
>>> and
>>> mutex. Maybe a follow up patch if you think we should change the
>>> terminology.
>> Well, that's exactly the point, they do radically different things.
>> Having the same name for two different things is confusing.
>>
>> Anyway, let me go read that patch you sent.
>
> My understanding of handoff is to disable optimistic spinning to let 
> waiters in the wait queue have an opportunity to acquire the lock. 
> There are difference in details on how to do that in mutex and rwsem, 
> though.
>
> BTW, I have decided that we should further simply the trylock for loop 
> in rwsem_down_write_slowpath() to make it easier to read. That is the 
> only difference in the attached v2 patch compared with the previous one.

My bad, I forgot to initialize waiter.handoff_set in 
rwsem_down_write_slowpath(). I will send out an updated version once you 
have finished your review.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-11 19:36               ` Waiman Long
  2021-11-11 19:52                 ` Waiman Long
@ 2021-11-11 20:26                 ` Peter Zijlstra
  2021-11-11 21:01                   ` Waiman Long
  2021-11-11 20:35                 ` Peter Zijlstra
  2021-11-11 20:50                 ` Peter Zijlstra
  3 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2021-11-11 20:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel

On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:

> @@ -434,6 +430,7 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
>  			if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
>  			    time_after(jiffies, waiter->timeout)) {
>  				adjustment -= RWSEM_FLAG_HANDOFF;
> +				waiter->handoff_set = true;
>  				lockevent_inc(rwsem_rlock_handoff);
>  			}
>  

Do we really need this flag? Wouldn't it be the same as waiter-is-first
AND sem-has-handoff ?

>  static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
> +					struct rwsem_waiter *waiter)
>  {
>  	long count, new;
> +	bool first = rwsem_first_waiter(sem) == waiter;

flip those lines for reverse xmas please

>  
>  	lockdep_assert_held(&sem->wait_lock);
>  
> @@ -546,13 +541,14 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
>  	do {
>  		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
>  
> -		if (has_handoff && wstate == WRITER_NOT_FIRST)
> +		if (has_handoff && !first)
>  			return false;
>  
>  		new = count;
>  
>  		if (count & RWSEM_LOCK_MASK) {
> -			if (has_handoff || (wstate != WRITER_HANDOFF))
> +			if (has_handoff || (!waiter->rt_task &&
> +					    !time_after(jiffies, waiter->timeout)))


Does ->rt_task really help over rt_task(current) ? I suppose there's an
argument for locality, but that should be pretty much it, no?

>  				return false;
>  
>  			new |= RWSEM_FLAG_HANDOFF;
> @@ -889,6 +888,24 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
>  }
>  #endif
>  
> +/*
> + * Common code to handle rwsem flags in out_nolock path with wait_lock held.
> + */
> +static inline void rwsem_out_nolock_clear_flags(struct rw_semaphore *sem,
> +						struct rwsem_waiter *waiter)
> +{
> +	long flags = 0;
> +
> +	list_del(&waiter->list);
> +	if (list_empty(&sem->wait_list))
> +		flags = RWSEM_FLAG_HANDOFF | RWSEM_FLAG_WAITERS;
> +	else if (waiter->handoff_set)
> +		flags = RWSEM_FLAG_HANDOFF;
> +
> +	if (flags)
> +		atomic_long_andnot(flags,  &sem->count);
> +}

Right, so I like sharing this between the two _slowpath functions, that
makes sense.

The difference between this and my approach is that I unconditionally
clear HANDOFF when @waiter was the first. Because if it was set, it
must've been ours, and if it wasn't set, clearing it doesn't really hurt
much. This is an unlikely path, I don't think the potentially extra
atomic is an issue here.

> +
>  /*
>   * Wait for the read lock to be granted
>   */
> @@ -936,6 +953,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
>  	waiter.task = current;
>  	waiter.type = RWSEM_WAITING_FOR_READ;
>  	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
> +	waiter.handoff_set = false;

Forgot to set rt_task

>  
>  	raw_spin_lock_irq(&sem->wait_lock);
>  	if (list_empty(&sem->wait_list)) {

> @@ -1038,16 +1051,13 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
>  	waiter.task = current;
>  	waiter.type = RWSEM_WAITING_FOR_WRITE;
>  	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;

Forget to set handoff_set

> +	waiter.rt_task = rt_task(current);
>  
>  	raw_spin_lock_irq(&sem->wait_lock);

Again, I'm not convinced we need these variables.

> @@ -1083,13 +1093,16 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
>  	/* wait until we successfully acquire the lock */
>  	set_current_state(state);
>  	for (;;) {
> -		if (rwsem_try_write_lock(sem, wstate)) {
> +		if (rwsem_try_write_lock(sem, &waiter)) {
>  			/* rwsem_try_write_lock() implies ACQUIRE on success */
>  			break;
>  		}
>  
>  		raw_spin_unlock_irq(&sem->wait_lock);
>  
> +		if (signal_pending_state(state, current))
> +			goto out_nolock;
> +
>  		/*
>  		 * After setting the handoff bit and failing to acquire
>  		 * the lock, attempt to spin on owner to accelerate lock
> @@ -1098,7 +1111,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
>  		 * In this case, we attempt to acquire the lock again
>  		 * without sleeping.
>  		 */
> -		if (wstate == WRITER_HANDOFF) {
> +		if (waiter.handoff_set) {
>  			enum owner_state owner_state;
>  
>  			preempt_disable();

Does it matter much if we spin-wait for every first or only for handoff?

Either way around, I think spin-wait ought to terminate on sigpending
(same for mutex I suppose).

> @@ -1109,40 +1122,9 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
>  				goto trylock_again;
>  		}
>  
> -		/* Block until there are no active lockers. */
> -		for (;;) {
> -			if (signal_pending_state(state, current))
> -				goto out_nolock;
> -
> -			schedule();
> -			lockevent_inc(rwsem_sleep_writer);
> -			set_current_state(state);
> -			/*
> -			 * If HANDOFF bit is set, unconditionally do
> -			 * a trylock.
> -			 */
> -			if (wstate == WRITER_HANDOFF)
> -				break;
> -
> -			if ((wstate == WRITER_NOT_FIRST) &&
> -			    (rwsem_first_waiter(sem) == &waiter))
> -				wstate = WRITER_FIRST;
> -
> -			count = atomic_long_read(&sem->count);
> -			if (!(count & RWSEM_LOCK_MASK))
> -				break;
> -
> -			/*
> -			 * The setting of the handoff bit is deferred
> -			 * until rwsem_try_write_lock() is called.
> -			 */
> -			if ((wstate == WRITER_FIRST) && (rt_task(current) ||
> -			    time_after(jiffies, waiter.timeout))) {
> -				wstate = WRITER_HANDOFF;
> -				lockevent_inc(rwsem_wlock_handoff);
> -				break;
> -			}
> -		}
> +		schedule();
> +		lockevent_inc(rwsem_sleep_writer);
> +		set_current_state(state);
>  trylock_again:
>  		raw_spin_lock_irq(&sem->wait_lock);
>  	}

Nice.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-11 19:36               ` Waiman Long
  2021-11-11 19:52                 ` Waiman Long
  2021-11-11 20:26                 ` Peter Zijlstra
@ 2021-11-11 20:35                 ` Peter Zijlstra
  2021-11-11 20:39                   ` Peter Zijlstra
  2021-11-11 20:45                   ` Waiman Long
  2021-11-11 20:50                 ` Peter Zijlstra
  3 siblings, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2021-11-11 20:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel

On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
> 
> On 11/11/21 14:20, Peter Zijlstra wrote:
> > On Thu, Nov 11, 2021 at 02:14:48PM -0500, Waiman Long wrote:
> > > As for the PHASE_CHANGE name, we have to be consistent in both rwsem and
> > > mutex. Maybe a follow up patch if you think we should change the
> > > terminology.
> > Well, that's exactly the point, they do radically different things.
> > Having the same name for two different things is confusing.
> > 
> > Anyway, let me go read that patch you sent.
> 
> My understanding of handoff is to disable optimistic spinning to let waiters
> in the wait queue have an opportunity to acquire the lock. There are
> difference in details on how to do that in mutex and rwsem, though.

Ah, but the mutex does an actual hand-off, it hands the lock to a
specific waiting task. That is, unlock() sets owner, as opposed to
trylock().

The rwsem code doesn't, it just forces a phase change. Once a waiter has
been blocked too long, the handoff bit is set, causing new readers to be
blocked. Then we wait for existing readers to complete. At that point,
any next waiter (most likely a writer) should really get the lock (and
in that regards the rwsem code is a bit funny).

So while both ensure fairness, the means of doing so is quite different.
One hands the lock ownership to a specific waiter, the other arranges
for a quiescent state such that the next waiter can proceed.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-11 20:35                 ` Peter Zijlstra
@ 2021-11-11 20:39                   ` Peter Zijlstra
  2021-11-11 20:45                   ` Waiman Long
  1 sibling, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2021-11-11 20:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel

On Thu, Nov 11, 2021 at 09:35:00PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
> > 
> > On 11/11/21 14:20, Peter Zijlstra wrote:
> > > On Thu, Nov 11, 2021 at 02:14:48PM -0500, Waiman Long wrote:
> > > > As for the PHASE_CHANGE name, we have to be consistent in both rwsem and
> > > > mutex. Maybe a follow up patch if you think we should change the
> > > > terminology.
> > > Well, that's exactly the point, they do radically different things.
> > > Having the same name for two different things is confusing.
> > > 
> > > Anyway, let me go read that patch you sent.
> > 
> > My understanding of handoff is to disable optimistic spinning to let waiters
> > in the wait queue have an opportunity to acquire the lock. There are
> > difference in details on how to do that in mutex and rwsem, though.
> 
> Ah, but the mutex does an actual hand-off, it hands the lock to a
> specific waiting task. That is, unlock() sets owner, as opposed to
> trylock().
> 
> The rwsem code doesn't, it just forces a phase change. Once a waiter has
> been blocked too long, the handoff bit is set, causing new readers to be
> blocked. Then we wait for existing readers to complete. At that point,
> any next waiter (most likely a writer) should really get the lock (and
> in that regards the rwsem code is a bit funny).

And this is I think the thing you tried in your earlier inherit patch.
Keep the quescent state and simply let whatever next waiter is in line
have a go.

I suspect that change is easier now. But I've not tried.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-11 20:35                 ` Peter Zijlstra
  2021-11-11 20:39                   ` Peter Zijlstra
@ 2021-11-11 20:45                   ` Waiman Long
  2021-11-11 21:27                     ` Peter Zijlstra
  1 sibling, 1 reply; 26+ messages in thread
From: Waiman Long @ 2021-11-11 20:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel


On 11/11/21 15:35, Peter Zijlstra wrote:
> On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
>> On 11/11/21 14:20, Peter Zijlstra wrote:
>>> On Thu, Nov 11, 2021 at 02:14:48PM -0500, Waiman Long wrote:
>>>> As for the PHASE_CHANGE name, we have to be consistent in both rwsem and
>>>> mutex. Maybe a follow up patch if you think we should change the
>>>> terminology.
>>> Well, that's exactly the point, they do radically different things.
>>> Having the same name for two different things is confusing.
>>>
>>> Anyway, let me go read that patch you sent.
>> My understanding of handoff is to disable optimistic spinning to let waiters
>> in the wait queue have an opportunity to acquire the lock. There are
>> difference in details on how to do that in mutex and rwsem, though.
> Ah, but the mutex does an actual hand-off, it hands the lock to a
> specific waiting task. That is, unlock() sets owner, as opposed to
> trylock().
>
> The rwsem code doesn't, it just forces a phase change. Once a waiter has
> been blocked too long, the handoff bit is set, causing new readers to be
> blocked. Then we wait for existing readers to complete. At that point,
> any next waiter (most likely a writer) should really get the lock (and
> in that regards the rwsem code is a bit funny).
>
> So while both ensure fairness, the means of doing so is quite different.
> One hands the lock ownership to a specific waiter, the other arranges
> for a quiescent state such that the next waiter can proceed.

That is a valid argument. However, the name PHASE_CHANGE sounds weird to 
me. I am not objecting to changing the term, but probably with a better 
name NO_OPTSPIN, NO_LOCKSTEALING or something like that to emphasize 
that fact that optimistic spinning or lock stealing should not be allowed.

Anyway, it will be a follow-up patch.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-11 19:36               ` Waiman Long
                                   ` (2 preceding siblings ...)
  2021-11-11 20:35                 ` Peter Zijlstra
@ 2021-11-11 20:50                 ` Peter Zijlstra
  2021-11-11 21:09                   ` Waiman Long
  3 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2021-11-11 20:50 UTC (permalink / raw)
  To: Waiman Long
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel


So I suspect that if..

On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
>  static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
> -					enum writer_wait_state wstate)
> +					struct rwsem_waiter *waiter)
>  {
>  	long count, new;
> +	bool first = rwsem_first_waiter(sem) == waiter;
>  
>  	lockdep_assert_held(&sem->wait_lock);
>  
> @@ -546,13 +541,14 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
>  	do {
>  		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
>  
> -		if (has_handoff && wstate == WRITER_NOT_FIRST)
> +		if (has_handoff && !first)
>  			return false;
>  
>  		new = count;
>  
>  		if (count & RWSEM_LOCK_MASK) {
> -			if (has_handoff || (wstate != WRITER_HANDOFF))
> +			if (has_handoff || (!waiter->rt_task &&
> +					    !time_after(jiffies, waiter->timeout)))
>  				return false;

we delete this whole condition, and..

>  
>  			new |= RWSEM_FLAG_HANDOFF;

> @@ -889,6 +888,24 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
>  }
>  #endif
>  
> +/*
> + * Common code to handle rwsem flags in out_nolock path with wait_lock held.
> + */
> +static inline void rwsem_out_nolock_clear_flags(struct rw_semaphore *sem,
> +						struct rwsem_waiter *waiter)
> +{
> +	long flags = 0;
> +
> +	list_del(&waiter->list);
> +	if (list_empty(&sem->wait_list))
> +		flags = RWSEM_FLAG_HANDOFF | RWSEM_FLAG_WAITERS;
> +	else if (waiter->handoff_set)
> +		flags = RWSEM_FLAG_HANDOFF;

take out this else,

> +
> +	if (flags)
> +		atomic_long_andnot(flags,  &sem->count);
> +}

We get the inherit thing for free, no?

Once HANDOFF is set, new readers are blocked. And then allow any first
waiter to acquire the lock, who cares if it was the one responsible for
having set the HANDOFF bit.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-11 20:26                 ` Peter Zijlstra
@ 2021-11-11 21:01                   ` Waiman Long
  2021-11-11 21:25                     ` Waiman Long
  2021-11-11 21:38                     ` Peter Zijlstra
  0 siblings, 2 replies; 26+ messages in thread
From: Waiman Long @ 2021-11-11 21:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel


On 11/11/21 15:26, Peter Zijlstra wrote:
> On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
>
>> @@ -434,6 +430,7 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
>>   			if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
>>   			    time_after(jiffies, waiter->timeout)) {
>>   				adjustment -= RWSEM_FLAG_HANDOFF;
>> +				waiter->handoff_set = true;
>>   				lockevent_inc(rwsem_rlock_handoff);
>>   			}
>>   
> Do we really need this flag? Wouldn't it be the same as waiter-is-first
> AND sem-has-handoff ?
That is true. The only downside is that we have to read the count first 
in rwsem_out_nolock_clear_flags(). Since this is not a fast path, it 
should be OK to do that.
>>   static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
>> +					struct rwsem_waiter *waiter)
>>   {
>>   	long count, new;
>> +	bool first = rwsem_first_waiter(sem) == waiter;
> flip those lines for reverse xmas please
Sure, will do.
>
>>   
>>   	lockdep_assert_held(&sem->wait_lock);
>>   
>> @@ -546,13 +541,14 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
>>   	do {
>>   		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
>>   
>> -		if (has_handoff && wstate == WRITER_NOT_FIRST)
>> +		if (has_handoff && !first)
>>   			return false;
>>   
>>   		new = count;
>>   
>>   		if (count & RWSEM_LOCK_MASK) {
>> -			if (has_handoff || (wstate != WRITER_HANDOFF))
>> +			if (has_handoff || (!waiter->rt_task &&
>> +					    !time_after(jiffies, waiter->timeout)))
>
> Does ->rt_task really help over rt_task(current) ? I suppose there's an
> argument for locality, but that should be pretty much it, no?
Waiting for the timeout may introduce too much latency for RT task. That 
is the only reason I am doing it. I can take it out if you think it is 
not necessary.
>
>>   				return false;
>>   
>>   			new |= RWSEM_FLAG_HANDOFF;
>> @@ -889,6 +888,24 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
>>   }
>>   #endif
>>   
>> +/*
>> + * Common code to handle rwsem flags in out_nolock path with wait_lock held.
>> + */
>> +static inline void rwsem_out_nolock_clear_flags(struct rw_semaphore *sem,
>> +						struct rwsem_waiter *waiter)
>> +{
>> +	long flags = 0;
>> +
>> +	list_del(&waiter->list);
>> +	if (list_empty(&sem->wait_list))
>> +		flags = RWSEM_FLAG_HANDOFF | RWSEM_FLAG_WAITERS;
>> +	else if (waiter->handoff_set)
>> +		flags = RWSEM_FLAG_HANDOFF;
>> +
>> +	if (flags)
>> +		atomic_long_andnot(flags,  &sem->count);
>> +}
> Right, so I like sharing this between the two _slowpath functions, that
> makes sense.
>
> The difference between this and my approach is that I unconditionally
> clear HANDOFF when @waiter was the first. Because if it was set, it
> must've been ours, and if it wasn't set, clearing it doesn't really hurt
> much. This is an unlikely path, I don't think the potentially extra
> atomic is an issue here.
That is true, we shouldn't worry too much about performance for this 
unlikely path. Will make the change.
>
>> +
>>   /*
>>    * Wait for the read lock to be granted
>>    */
>> @@ -936,6 +953,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
>>   	waiter.task = current;
>>   	waiter.type = RWSEM_WAITING_FOR_READ;
>>   	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
>> +	waiter.handoff_set = false;
> Forgot to set rt_task

We don't use rt_task for reader. It is writer only. I will document that.

>
>>   
>>   	raw_spin_lock_irq(&sem->wait_lock);
>>   	if (list_empty(&sem->wait_list)) {
>> @@ -1038,16 +1051,13 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
>>   	waiter.task = current;
>>   	waiter.type = RWSEM_WAITING_FOR_WRITE;
>>   	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
> Forget to set handoff_set
Yes, I was aware of that.
>
>> +	waiter.rt_task = rt_task(current);
>>   
>>   	raw_spin_lock_irq(&sem->wait_lock);
> Again, I'm not convinced we need these variables.
I will take out handoff_set as suggested. I can can also take out 
rt_task if you don't think we need to test it.
>
>> @@ -1083,13 +1093,16 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
>>   	/* wait until we successfully acquire the lock */
>>   	set_current_state(state);
>>   	for (;;) {
>> -		if (rwsem_try_write_lock(sem, wstate)) {
>> +		if (rwsem_try_write_lock(sem, &waiter)) {
>>   			/* rwsem_try_write_lock() implies ACQUIRE on success */
>>   			break;
>>   		}
>>   
>>   		raw_spin_unlock_irq(&sem->wait_lock);
>>   
>> +		if (signal_pending_state(state, current))
>> +			goto out_nolock;
>> +
>>   		/*
>>   		 * After setting the handoff bit and failing to acquire
>>   		 * the lock, attempt to spin on owner to accelerate lock
>> @@ -1098,7 +1111,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
>>   		 * In this case, we attempt to acquire the lock again
>>   		 * without sleeping.
>>   		 */
>> -		if (wstate == WRITER_HANDOFF) {
>> +		if (waiter.handoff_set) {
>>   			enum owner_state owner_state;
>>   
>>   			preempt_disable();
> Does it matter much if we spin-wait for every first or only for handoff?
Only for handoff as no other task will be spinning for the lock.
>
> Either way around, I think spin-wait ought to terminate on sigpending
> (same for mutex I suppose).

I am thinking about that too. Time for another followup patch, I think.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-11 20:50                 ` Peter Zijlstra
@ 2021-11-11 21:09                   ` Waiman Long
  0 siblings, 0 replies; 26+ messages in thread
From: Waiman Long @ 2021-11-11 21:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel

On 11/11/21 15:50, Peter Zijlstra wrote:
> So I suspect that if..
>
> On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
>>   static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
>> -					enum writer_wait_state wstate)
>> +					struct rwsem_waiter *waiter)
>>   {
>>   	long count, new;
>> +	bool first = rwsem_first_waiter(sem) == waiter;
>>   
>>   	lockdep_assert_held(&sem->wait_lock);
>>   
>> @@ -546,13 +541,14 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
>>   	do {
>>   		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
>>   
>> -		if (has_handoff && wstate == WRITER_NOT_FIRST)
>> +		if (has_handoff && !first)
>>   			return false;
>>   
>>   		new = count;
>>   
>>   		if (count & RWSEM_LOCK_MASK) {
>> -			if (has_handoff || (wstate != WRITER_HANDOFF))
>> +			if (has_handoff || (!waiter->rt_task &&
>> +					    !time_after(jiffies, waiter->timeout)))
>>   				return false;
> we delete this whole condition, and..
I don't think we can take out this if test.
>
>>   
>>   			new |= RWSEM_FLAG_HANDOFF;
>> @@ -889,6 +888,24 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
>>   }
>>   #endif
>>   
>> +/*
>> + * Common code to handle rwsem flags in out_nolock path with wait_lock held.
>> + */
>> +static inline void rwsem_out_nolock_clear_flags(struct rw_semaphore *sem,
>> +						struct rwsem_waiter *waiter)
>> +{
>> +	long flags = 0;
>> +
>> +	list_del(&waiter->list);
>> +	if (list_empty(&sem->wait_list))
>> +		flags = RWSEM_FLAG_HANDOFF | RWSEM_FLAG_WAITERS;
>> +	else if (waiter->handoff_set)
>> +		flags = RWSEM_FLAG_HANDOFF;
> take out this else,
>
>> +
>> +	if (flags)
>> +		atomic_long_andnot(flags,  &sem->count);
>> +}
> We get the inherit thing for free, no?
>
> Once HANDOFF is set, new readers are blocked. And then allow any first
> waiter to acquire the lock, who cares if it was the one responsible for
> having set the HANDOFF bit.

Yes, we can have the policy of inheriting the HANDOFF bit as long as it 
is consistent which will be the case here with a common out_nolock 
function. I can go with that. I just have to document this fact in the 
comment.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-11 21:01                   ` Waiman Long
@ 2021-11-11 21:25                     ` Waiman Long
  2021-11-11 21:53                       ` Peter Zijlstra
  2021-11-11 21:38                     ` Peter Zijlstra
  1 sibling, 1 reply; 26+ messages in thread
From: Waiman Long @ 2021-11-11 21:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel


On 11/11/21 16:01, Waiman Long wrote:
>
> On 11/11/21 15:26, Peter Zijlstra wrote:
>> On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
>>
>>> @@ -434,6 +430,7 @@ static void rwsem_mark_wake(struct rw_semaphore 
>>> *sem,
>>>               if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
>>>                   time_after(jiffies, waiter->timeout)) {
>>>                   adjustment -= RWSEM_FLAG_HANDOFF;
>>> +                waiter->handoff_set = true;
>>>                   lockevent_inc(rwsem_rlock_handoff);
>>>               }
>> Do we really need this flag? Wouldn't it be the same as waiter-is-first
>> AND sem-has-handoff ?
> That is true. The only downside is that we have to read the count 
> first in rwsem_out_nolock_clear_flags(). Since this is not a fast 
> path, it should be OK to do that.

I just realize that I may still need this flag for writer to determine 
if it should spin after failing to acquire the lock. Or I will have to 
do extra read of count value in the loop. I don't need to use it for 
writer now.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-11 20:45                   ` Waiman Long
@ 2021-11-11 21:27                     ` Peter Zijlstra
  2021-11-11 21:54                       ` Waiman Long
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2021-11-11 21:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel

On Thu, Nov 11, 2021 at 03:45:30PM -0500, Waiman Long wrote:
> 
> On 11/11/21 15:35, Peter Zijlstra wrote:
> > On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
> > > On 11/11/21 14:20, Peter Zijlstra wrote:
> > > > On Thu, Nov 11, 2021 at 02:14:48PM -0500, Waiman Long wrote:
> > > > > As for the PHASE_CHANGE name, we have to be consistent in both rwsem and
> > > > > mutex. Maybe a follow up patch if you think we should change the
> > > > > terminology.
> > > > Well, that's exactly the point, they do radically different things.
> > > > Having the same name for two different things is confusing.
> > > > 
> > > > Anyway, let me go read that patch you sent.
> > > My understanding of handoff is to disable optimistic spinning to let waiters
> > > in the wait queue have an opportunity to acquire the lock. There are
> > > difference in details on how to do that in mutex and rwsem, though.
> > Ah, but the mutex does an actual hand-off, it hands the lock to a
> > specific waiting task. That is, unlock() sets owner, as opposed to
> > trylock().
> > 
> > The rwsem code doesn't, it just forces a phase change. Once a waiter has
> > been blocked too long, the handoff bit is set, causing new readers to be
> > blocked. Then we wait for existing readers to complete. At that point,
> > any next waiter (most likely a writer) should really get the lock (and
> > in that regards the rwsem code is a bit funny).
> > 
> > So while both ensure fairness, the means of doing so is quite different.
> > One hands the lock ownership to a specific waiter, the other arranges
> > for a quiescent state such that the next waiter can proceed.
> 
> That is a valid argument. However, the name PHASE_CHANGE sounds weird to me.
> I am not objecting to changing the term, but probably with a better name
> NO_OPTSPIN, NO_LOCKSTEALING or something like that to emphasize that fact
> that optimistic spinning or lock stealing should not be allowed.

RWSEM_FLAG_QUIESCE ?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-11 21:01                   ` Waiman Long
  2021-11-11 21:25                     ` Waiman Long
@ 2021-11-11 21:38                     ` Peter Zijlstra
  2021-11-11 21:46                       ` Waiman Long
  1 sibling, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2021-11-11 21:38 UTC (permalink / raw)
  To: Waiman Long
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel

On Thu, Nov 11, 2021 at 04:01:16PM -0500, Waiman Long wrote:

> > > +			if (has_handoff || (!waiter->rt_task &&
> > > +					    !time_after(jiffies, waiter->timeout)))
> > 
> > Does ->rt_task really help over rt_task(current) ? I suppose there's an
> > argument for locality, but that should be pretty much it, no?
> Waiting for the timeout may introduce too much latency for RT task. That is
> the only reason I am doing it. I can take it out if you think it is not
> necessary.

I meant simply calling rt_task(waiter->task) here, instead of mucking about
with the extra variable.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-11 21:38                     ` Peter Zijlstra
@ 2021-11-11 21:46                       ` Waiman Long
  0 siblings, 0 replies; 26+ messages in thread
From: Waiman Long @ 2021-11-11 21:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel


On 11/11/21 16:38, Peter Zijlstra wrote:
> On Thu, Nov 11, 2021 at 04:01:16PM -0500, Waiman Long wrote:
>
>>>> +			if (has_handoff || (!waiter->rt_task &&
>>>> +					    !time_after(jiffies, waiter->timeout)))
>>> Does ->rt_task really help over rt_task(current) ? I suppose there's an
>>> argument for locality, but that should be pretty much it, no?
>> Waiting for the timeout may introduce too much latency for RT task. That is
>> the only reason I am doing it. I can take it out if you think it is not
>> necessary.
> I meant simply calling rt_task(waiter->task) here, instead of mucking about
> with the extra variable.

OK, that make sense.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-11 21:25                     ` Waiman Long
@ 2021-11-11 21:53                       ` Peter Zijlstra
  2021-11-11 21:55                         ` Waiman Long
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2021-11-11 21:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel

On Thu, Nov 11, 2021 at 04:25:56PM -0500, Waiman Long wrote:
> 
> On 11/11/21 16:01, Waiman Long wrote:
> > 
> > On 11/11/21 15:26, Peter Zijlstra wrote:
> > > On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
> > > 
> > > > @@ -434,6 +430,7 @@ static void rwsem_mark_wake(struct
> > > > rw_semaphore *sem,
> > > >               if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
> > > >                   time_after(jiffies, waiter->timeout)) {
> > > >                   adjustment -= RWSEM_FLAG_HANDOFF;
> > > > +                waiter->handoff_set = true;
> > > >                   lockevent_inc(rwsem_rlock_handoff);
> > > >               }
> > > Do we really need this flag? Wouldn't it be the same as waiter-is-first
> > > AND sem-has-handoff ?
> > That is true. The only downside is that we have to read the count first
> > in rwsem_out_nolock_clear_flags(). Since this is not a fast path, it
> > should be OK to do that.
> 
> I just realize that I may still need this flag for writer to determine if it
> should spin after failing to acquire the lock. Or I will have to do extra
> read of count value in the loop. I don't need to use it for writer now.

Maybe it's too late here, but afaict this is right after failing
try_write_lock(), which will have done at least that load you're
interested in, no?

Simply have try_write_lock() update &count or something.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-11 21:27                     ` Peter Zijlstra
@ 2021-11-11 21:54                       ` Waiman Long
  0 siblings, 0 replies; 26+ messages in thread
From: Waiman Long @ 2021-11-11 21:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel


On 11/11/21 16:27, Peter Zijlstra wrote:
> On Thu, Nov 11, 2021 at 03:45:30PM -0500, Waiman Long wrote:
>> On 11/11/21 15:35, Peter Zijlstra wrote:
>>> On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
>>>> On 11/11/21 14:20, Peter Zijlstra wrote:
>>>>> On Thu, Nov 11, 2021 at 02:14:48PM -0500, Waiman Long wrote:
>>>>>> As for the PHASE_CHANGE name, we have to be consistent in both rwsem and
>>>>>> mutex. Maybe a follow up patch if you think we should change the
>>>>>> terminology.
>>>>> Well, that's exactly the point, they do radically different things.
>>>>> Having the same name for two different things is confusing.
>>>>>
>>>>> Anyway, let me go read that patch you sent.
>>>> My understanding of handoff is to disable optimistic spinning to let waiters
>>>> in the wait queue have an opportunity to acquire the lock. There are
>>>> difference in details on how to do that in mutex and rwsem, though.
>>> Ah, but the mutex does an actual hand-off, it hands the lock to a
>>> specific waiting task. That is, unlock() sets owner, as opposed to
>>> trylock().
>>>
>>> The rwsem code doesn't, it just forces a phase change. Once a waiter has
>>> been blocked too long, the handoff bit is set, causing new readers to be
>>> blocked. Then we wait for existing readers to complete. At that point,
>>> any next waiter (most likely a writer) should really get the lock (and
>>> in that regards the rwsem code is a bit funny).
>>>
>>> So while both ensure fairness, the means of doing so is quite different.
>>> One hands the lock ownership to a specific waiter, the other arranges
>>> for a quiescent state such that the next waiter can proceed.
>> That is a valid argument. However, the name PHASE_CHANGE sounds weird to me.
>> I am not objecting to changing the term, but probably with a better name
>> NO_OPTSPIN, NO_LOCKSTEALING or something like that to emphasize that fact
>> that optimistic spinning or lock stealing should not be allowed.
> RWSEM_FLAG_QUIESCE ?

I think that is a more acceptable term than PHASE_CHANGE. Will have a 
follow up patch later on. This one is more urgent and I want to get it 
done first.

Cheers,
Longman.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-11 21:53                       ` Peter Zijlstra
@ 2021-11-11 21:55                         ` Waiman Long
  2021-11-11 22:00                           ` Waiman Long
  0 siblings, 1 reply; 26+ messages in thread
From: Waiman Long @ 2021-11-11 21:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel


On 11/11/21 16:53, Peter Zijlstra wrote:
> On Thu, Nov 11, 2021 at 04:25:56PM -0500, Waiman Long wrote:
>> On 11/11/21 16:01, Waiman Long wrote:
>>> On 11/11/21 15:26, Peter Zijlstra wrote:
>>>> On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
>>>>
>>>>> @@ -434,6 +430,7 @@ static void rwsem_mark_wake(struct
>>>>> rw_semaphore *sem,
>>>>>                if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
>>>>>                    time_after(jiffies, waiter->timeout)) {
>>>>>                    adjustment -= RWSEM_FLAG_HANDOFF;
>>>>> +                waiter->handoff_set = true;
>>>>>                    lockevent_inc(rwsem_rlock_handoff);
>>>>>                }
>>>> Do we really need this flag? Wouldn't it be the same as waiter-is-first
>>>> AND sem-has-handoff ?
>>> That is true. The only downside is that we have to read the count first
>>> in rwsem_out_nolock_clear_flags(). Since this is not a fast path, it
>>> should be OK to do that.
>> I just realize that I may still need this flag for writer to determine if it
>> should spin after failing to acquire the lock. Or I will have to do extra
>> read of count value in the loop. I don't need to use it for writer now.
> Maybe it's too late here, but afaict this is right after failing
> try_write_lock(), which will have done at least that load you're
> interested in, no?
>
> Simply have try_write_lock() update &count or something.

You are right. I have actually decided to do an extra read after second 
thought.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set
  2021-11-11 21:55                         ` Waiman Long
@ 2021-11-11 22:00                           ` Waiman Long
  0 siblings, 0 replies; 26+ messages in thread
From: Waiman Long @ 2021-11-11 22:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hillf Danton, 马振华,
	mingo, will, boqun.feng, linux-kernel


On 11/11/21 16:55, Waiman Long wrote:
>
> On 11/11/21 16:53, Peter Zijlstra wrote:
>> On Thu, Nov 11, 2021 at 04:25:56PM -0500, Waiman Long wrote:
>>> On 11/11/21 16:01, Waiman Long wrote:
>>>> On 11/11/21 15:26, Peter Zijlstra wrote:
>>>>> On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
>>>>>
>>>>>> @@ -434,6 +430,7 @@ static void rwsem_mark_wake(struct
>>>>>> rw_semaphore *sem,
>>>>>>                if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
>>>>>>                    time_after(jiffies, waiter->timeout)) {
>>>>>>                    adjustment -= RWSEM_FLAG_HANDOFF;
>>>>>> +                waiter->handoff_set = true;
>>>>>>                    lockevent_inc(rwsem_rlock_handoff);
>>>>>>                }
>>>>> Do we really need this flag? Wouldn't it be the same as 
>>>>> waiter-is-first
>>>>> AND sem-has-handoff ?
>>>> That is true. The only downside is that we have to read the count 
>>>> first
>>>> in rwsem_out_nolock_clear_flags(). Since this is not a fast path, it
>>>> should be OK to do that.
>>> I just realize that I may still need this flag for writer to 
>>> determine if it
>>> should spin after failing to acquire the lock. Or I will have to do 
>>> extra
>>> read of count value in the loop. I don't need to use it for writer now.
>> Maybe it's too late here, but afaict this is right after failing
>> try_write_lock(), which will have done at least that load you're
>> interested in, no?
>>
>> Simply have try_write_lock() update &count or something.
>
> You are right. I have actually decided to do an extra read after 
> second thought.

Oh, I would like to take it back. The condition to do spinning is when 
the handoff bit is set and the waiter is the first one in the queue. It 
be easier to do it with extra internal state variable.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-11-11 22:00 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <4fafad133b074f279dbab1aa3642e23f@xiaomi.com>
2021-11-07  3:25 ` [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set Waiman Long
2021-11-07  3:28   ` Waiman Long
     [not found] ` <20211107090131.1535-1-hdanton@sina.com>
2021-11-07 15:24   ` Waiman Long
2021-11-07 19:52     ` Waiman Long
2021-11-10 21:38       ` Peter Zijlstra
2021-11-11  2:42         ` Maria Yu
2021-11-11 15:08         ` Peter Zijlstra
2021-11-11 19:14           ` Waiman Long
2021-11-11 19:20             ` Peter Zijlstra
2021-11-11 19:36               ` Waiman Long
2021-11-11 19:52                 ` Waiman Long
2021-11-11 20:26                 ` Peter Zijlstra
2021-11-11 21:01                   ` Waiman Long
2021-11-11 21:25                     ` Waiman Long
2021-11-11 21:53                       ` Peter Zijlstra
2021-11-11 21:55                         ` Waiman Long
2021-11-11 22:00                           ` Waiman Long
2021-11-11 21:38                     ` Peter Zijlstra
2021-11-11 21:46                       ` Waiman Long
2021-11-11 20:35                 ` Peter Zijlstra
2021-11-11 20:39                   ` Peter Zijlstra
2021-11-11 20:45                   ` Waiman Long
2021-11-11 21:27                     ` Peter Zijlstra
2021-11-11 21:54                       ` Waiman Long
2021-11-11 20:50                 ` Peter Zijlstra
2021-11-11 21:09                   ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).