All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] locking/rwsem: use read_acquire in read_slowpath exit when queue is empty
@ 2019-07-16 16:04 Jan Stancek
  2019-07-16 16:53 ` Waiman Long
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Stancek @ 2019-07-16 16:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: longman, dbueso, will, peterz, mingo, jstancek

LTP mtest06 has been observed to rarely hit "still mapped when deleted"
and following BUG_ON on arm64:
  page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0
  xfs_address_space_operations [xfs]
  flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active)
  page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
  ------------[ cut here ]------------
  kernel BUG at mm/filemap.c:171!
  Internal error: Oops - BUG: 0 [#1] SMP
  CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1
  Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019
  pstate: 40400089 (nZcv daIf +PAN -UAO)
  pc : unaccount_page_cache_page+0x17c/0x1a0
  lr : unaccount_page_cache_page+0x17c/0x1a0
  Call trace:
  unaccount_page_cache_page+0x17c/0x1a0
  delete_from_page_cache_batch+0xa0/0x300
  truncate_inode_pages_range+0x1b8/0x640
  truncate_inode_pages_final+0x88/0xa8
  evict+0x1a0/0x1d8
  iput+0x150/0x240
  dentry_unlink_inode+0x120/0x130
  __dentry_kill+0xd8/0x1d0
  dentry_kill+0x88/0x248
  dput+0x168/0x1b8
  __fput+0xe8/0x208
  ____fput+0x20/0x30
  task_work_run+0xc0/0xf0
  do_notify_resume+0x2b0/0x328
  work_pending+0x8/0x10

The extra mapcount originated from pagefault handler, which handled
pagefault for vma that has already been detached. vma is detached
under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
also invalidates vmacache.

When pagefault handler (under mmap_sem read lock) called find_vma(),
vmacache_valid() wrongly reported vmacache as valid.

After rwsem down_read() returns via 'queue empty' path (as of v5.2),
it does so without issuing read_acquire on sem->count:
  down_read
    __down_read
      rwsem_down_read_failed
        __rwsem_down_read_failed_common
          raw_spin_lock_irq(&sem->wait_lock);
          if (list_empty(&sem->wait_list)) {
            if (atomic_long_read(&sem->count) >= 0) {
              raw_spin_unlock_irq(&sem->wait_lock);
              return sem;

Suspected problem here is that last *_acquire on down_read() side
happens before write side issues *_release:
  1. writer: has the lock
  2. reader: down_read() issues *read_acquire on entry
  3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
  4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns
  5. reader: observes stale mm->vmacache_seqnum

I can reproduce the problem by running LTP mtest06 in a loop and building
kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
within ~hour. Patched kernel ran fine for 5+ hours with clean dmesg.
Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
Enable readers spinning on writer") makes it much harder to reproduce.

Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")

Signed-off-by: Jan Stancek <jstancek@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
---
 kernel/locking/rwsem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 37524a47f002..757b198d7a5b 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1030,7 +1030,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
 		 * exit the slowpath and return immediately as its
 		 * RWSEM_READER_BIAS has already been set in the count.
 		 */
-		if (adjustment && !(atomic_long_read(&sem->count) &
+		if (adjustment && !(atomic_long_read_acquire(&sem->count) &
 		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
 			raw_spin_unlock_irq(&sem->wait_lock);
 			rwsem_set_reader_owned(sem);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] locking/rwsem: use read_acquire in read_slowpath exit when queue is empty
  2019-07-16 16:04 [PATCH] locking/rwsem: use read_acquire in read_slowpath exit when queue is empty Jan Stancek
@ 2019-07-16 16:53 ` Waiman Long
  2019-07-16 18:34   ` Jan Stancek
  2019-07-16 18:58   ` Peter Zijlstra
  0 siblings, 2 replies; 22+ messages in thread
From: Waiman Long @ 2019-07-16 16:53 UTC (permalink / raw)
  To: Jan Stancek, linux-kernel; +Cc: dbueso, will, peterz, mingo

On 7/16/19 12:04 PM, Jan Stancek wrote:
> LTP mtest06 has been observed to rarely hit "still mapped when deleted"
> and following BUG_ON on arm64:
>   page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0
>   xfs_address_space_operations [xfs]
>   flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active)
>   page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
>   ------------[ cut here ]------------
>   kernel BUG at mm/filemap.c:171!
>   Internal error: Oops - BUG: 0 [#1] SMP
>   CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1
>   Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019
>   pstate: 40400089 (nZcv daIf +PAN -UAO)
>   pc : unaccount_page_cache_page+0x17c/0x1a0
>   lr : unaccount_page_cache_page+0x17c/0x1a0
>   Call trace:
>   unaccount_page_cache_page+0x17c/0x1a0
>   delete_from_page_cache_batch+0xa0/0x300
>   truncate_inode_pages_range+0x1b8/0x640
>   truncate_inode_pages_final+0x88/0xa8
>   evict+0x1a0/0x1d8
>   iput+0x150/0x240
>   dentry_unlink_inode+0x120/0x130
>   __dentry_kill+0xd8/0x1d0
>   dentry_kill+0x88/0x248
>   dput+0x168/0x1b8
>   __fput+0xe8/0x208
>   ____fput+0x20/0x30
>   task_work_run+0xc0/0xf0
>   do_notify_resume+0x2b0/0x328
>   work_pending+0x8/0x10
>
> The extra mapcount originated from pagefault handler, which handled
> pagefault for vma that has already been detached. vma is detached
> under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
> also invalidates vmacache.
>
> When pagefault handler (under mmap_sem read lock) called find_vma(),
> vmacache_valid() wrongly reported vmacache as valid.
>
> After rwsem down_read() returns via 'queue empty' path (as of v5.2),
> it does so without issuing read_acquire on sem->count:
>   down_read
>     __down_read
>       rwsem_down_read_failed
>         __rwsem_down_read_failed_common
>           raw_spin_lock_irq(&sem->wait_lock);
>           if (list_empty(&sem->wait_list)) {
>             if (atomic_long_read(&sem->count) >= 0) {
>               raw_spin_unlock_irq(&sem->wait_lock);
>               return sem;
>
> Suspected problem here is that last *_acquire on down_read() side
> happens before write side issues *_release:
>   1. writer: has the lock
>   2. reader: down_read() issues *read_acquire on entry
>   3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
>   4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns
>   5. reader: observes stale mm->vmacache_seqnum
>
> I can reproduce the problem by running LTP mtest06 in a loop and building
> kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
> on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
> within ~hour. Patched kernel ran fine for 5+ hours with clean dmesg.
> Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
> Enable readers spinning on writer") makes it much harder to reproduce.
The explanation makes sense to me. Thanks for the investigation.
>
> Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
> Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
> Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
>
> Signed-off-by: Jan Stancek <jstancek@redhat.com>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Davidlohr Bueso <dbueso@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> ---
>  kernel/locking/rwsem.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 37524a47f002..757b198d7a5b 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1030,7 +1030,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
>  		 * exit the slowpath and return immediately as its
>  		 * RWSEM_READER_BIAS has already been set in the count.
>  		 */
> -		if (adjustment && !(atomic_long_read(&sem->count) &
> +		if (adjustment && !(atomic_long_read_acquire(&sem->count) &
>  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
>  			raw_spin_unlock_irq(&sem->wait_lock);
>  			rwsem_set_reader_owned(sem);

The chance of taking this path is not that high. So instead of
increasing the cost of the test by adding an acquire barrier, how about
just adding smp_mb__after_spinlock() before spin_unlock_irq(). This
should have the same effect of making sure that no stale data will be
used in the read-lock critical section.

-Longman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] locking/rwsem: use read_acquire in read_slowpath exit when queue is empty
  2019-07-16 16:53 ` Waiman Long
@ 2019-07-16 18:34   ` Jan Stancek
  2019-07-16 18:58   ` Peter Zijlstra
  1 sibling, 0 replies; 22+ messages in thread
From: Jan Stancek @ 2019-07-16 18:34 UTC (permalink / raw)
  To: Waiman Long; +Cc: linux-kernel, dbueso, will, peterz, mingo



----- Original Message -----
> On 7/16/19 12:04 PM, Jan Stancek wrote:
> > LTP mtest06 has been observed to rarely hit "still mapped when deleted"
> > and following BUG_ON on arm64:
> >   page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0
> >   index:0x0
> >   xfs_address_space_operations [xfs]
> >   flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active)
> >   page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
> >   ------------[ cut here ]------------
> >   kernel BUG at mm/filemap.c:171!
> >   Internal error: Oops - BUG: 0 [#1] SMP
> >   CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1
> >   Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10
> >   05/17/2019
> >   pstate: 40400089 (nZcv daIf +PAN -UAO)
> >   pc : unaccount_page_cache_page+0x17c/0x1a0
> >   lr : unaccount_page_cache_page+0x17c/0x1a0
> >   Call trace:
> >   unaccount_page_cache_page+0x17c/0x1a0
> >   delete_from_page_cache_batch+0xa0/0x300
> >   truncate_inode_pages_range+0x1b8/0x640
> >   truncate_inode_pages_final+0x88/0xa8
> >   evict+0x1a0/0x1d8
> >   iput+0x150/0x240
> >   dentry_unlink_inode+0x120/0x130
> >   __dentry_kill+0xd8/0x1d0
> >   dentry_kill+0x88/0x248
> >   dput+0x168/0x1b8
> >   __fput+0xe8/0x208
> >   ____fput+0x20/0x30
> >   task_work_run+0xc0/0xf0
> >   do_notify_resume+0x2b0/0x328
> >   work_pending+0x8/0x10
> >
> > The extra mapcount originated from pagefault handler, which handled
> > pagefault for vma that has already been detached. vma is detached
> > under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
> > also invalidates vmacache.
> >
> > When pagefault handler (under mmap_sem read lock) called find_vma(),
> > vmacache_valid() wrongly reported vmacache as valid.
> >
> > After rwsem down_read() returns via 'queue empty' path (as of v5.2),
> > it does so without issuing read_acquire on sem->count:
> >   down_read
> >     __down_read
> >       rwsem_down_read_failed
> >         __rwsem_down_read_failed_common
> >           raw_spin_lock_irq(&sem->wait_lock);
> >           if (list_empty(&sem->wait_list)) {
> >             if (atomic_long_read(&sem->count) >= 0) {
> >               raw_spin_unlock_irq(&sem->wait_lock);
> >               return sem;
> >
> > Suspected problem here is that last *_acquire on down_read() side
> > happens before write side issues *_release:
> >   1. writer: has the lock
> >   2. reader: down_read() issues *read_acquire on entry
> >   3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
> >   4. reader: __rwsem_down_read_failed_common() finds it can take lock and
> >   returns
> >   5. reader: observes stale mm->vmacache_seqnum
> >
> > I can reproduce the problem by running LTP mtest06 in a loop and building
> > kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
> > on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
> > within ~hour. Patched kernel ran fine for 5+ hours with clean dmesg.
> > Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
> > Enable readers spinning on writer") makes it much harder to reproduce.
> The explanation makes sense to me. Thanks for the investigation.
> >
> > Related:
> > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
> > Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
> > munmap")
> > Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty
> > & no writer")
> >
> > Signed-off-by: Jan Stancek <jstancek@redhat.com>
> > Cc: Waiman Long <longman@redhat.com>
> > Cc: Davidlohr Bueso <dbueso@suse.de>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > ---
> >  kernel/locking/rwsem.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> > index 37524a47f002..757b198d7a5b 100644
> > --- a/kernel/locking/rwsem.c
> > +++ b/kernel/locking/rwsem.c
> > @@ -1030,7 +1030,7 @@ static inline bool rwsem_reader_phase_trylock(struct
> > rw_semaphore *sem,
> >  		 * exit the slowpath and return immediately as its
> >  		 * RWSEM_READER_BIAS has already been set in the count.
> >  		 */
> > -		if (adjustment && !(atomic_long_read(&sem->count) &
> > +		if (adjustment && !(atomic_long_read_acquire(&sem->count) &
> >  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> >  			raw_spin_unlock_irq(&sem->wait_lock);
> >  			rwsem_set_reader_owned(sem);
> 
> The chance of taking this path is not that high. So instead of
> increasing the cost of the test by adding an acquire barrier,

Yes, that is good point.

> how about
> just adding smp_mb__after_spinlock() before spin_unlock_irq(). This
> should have the same effect of making sure that no stale data will be
> used in the read-lock critical section.

I'll redo the tests with smp_mb__after_spinlock().

Thanks,
Jan

> 
> -Longman
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] locking/rwsem: use read_acquire in read_slowpath exit when queue is empty
  2019-07-16 16:53 ` Waiman Long
  2019-07-16 18:34   ` Jan Stancek
@ 2019-07-16 18:58   ` Peter Zijlstra
  2019-07-16 19:09     ` Waiman Long
  2019-07-17 12:02     ` [PATCH v2] locking/rwsem: add acquire barrier to " Jan Stancek
  1 sibling, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2019-07-16 18:58 UTC (permalink / raw)
  To: Waiman Long; +Cc: Jan Stancek, linux-kernel, dbueso, will, mingo

On Tue, Jul 16, 2019 at 12:53:14PM -0400, Waiman Long wrote:
> On 7/16/19 12:04 PM, Jan Stancek wrote:

> > Suspected problem here is that last *_acquire on down_read() side
> > happens before write side issues *_release:
> >   1. writer: has the lock
> >   2. reader: down_read() issues *read_acquire on entry
> >   3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
> >   4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns
> >   5. reader: observes stale mm->vmacache_seqnum
> >
> > I can reproduce the problem by running LTP mtest06 in a loop and building
> > kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
> > on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
> > within ~hour. Patched kernel ran fine for 5+ hours with clean dmesg.
> > Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
> > Enable readers spinning on writer") makes it much harder to reproduce.

> > Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
> > Signed-off-by: Jan Stancek <jstancek@redhat.com>
> > Cc: Waiman Long <longman@redhat.com>
> > Cc: Davidlohr Bueso <dbueso@suse.de>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > ---
> >  kernel/locking/rwsem.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> > index 37524a47f002..757b198d7a5b 100644
> > --- a/kernel/locking/rwsem.c
> > +++ b/kernel/locking/rwsem.c
> > @@ -1030,7 +1030,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
> >  		 * exit the slowpath and return immediately as its
> >  		 * RWSEM_READER_BIAS has already been set in the count.
> >  		 */
> > -		if (adjustment && !(atomic_long_read(&sem->count) &
> > +		if (adjustment && !(atomic_long_read_acquire(&sem->count) &
> >  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> >  			raw_spin_unlock_irq(&sem->wait_lock);
> >  			rwsem_set_reader_owned(sem);
> 
> The chance of taking this path is not that high. So instead of
> increasing the cost of the test by adding an acquire barrier, how about
> just adding smp_mb__after_spinlock() before spin_unlock_irq(). This
> should have the same effect of making sure that no stale data will be
> used in the read-lock critical section.

That's actually more expensive on something like ARM64 I expect.

The far cheaper alternative is smp_acquire__after_ctrl_dep(), however in
general Will seems to prefer using load-acquire over separate barriers,
and for x86 it doesn't matter anyway. For PowerPC these two are a wash,
both end up with LWSYNC (over SYNC for your alternative).



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] locking/rwsem: use read_acquire in read_slowpath exit when queue is empty
  2019-07-16 18:58   ` Peter Zijlstra
@ 2019-07-16 19:09     ` Waiman Long
  2019-07-17 12:02     ` [PATCH v2] locking/rwsem: add acquire barrier to " Jan Stancek
  1 sibling, 0 replies; 22+ messages in thread
From: Waiman Long @ 2019-07-16 19:09 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Jan Stancek, linux-kernel, dbueso, will, mingo

On 7/16/19 2:58 PM, Peter Zijlstra wrote:
> On Tue, Jul 16, 2019 at 12:53:14PM -0400, Waiman Long wrote:
>> On 7/16/19 12:04 PM, Jan Stancek wrote:
> Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
> Signed-off-by: Jan Stancek <jstancek@redhat.com>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Davidlohr Bueso <dbueso@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> ---
>  kernel/locking/rwsem.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 37524a47f002..757b198d7a5b 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1030,7 +1030,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
>  		 * exit the slowpath and return immediately as its
>  		 * RWSEM_READER_BIAS has already been set in the count.
>  		 */
> -		if (adjustment && !(atomic_long_read(&sem->count) &
> +		if (adjustment && !(atomic_long_read_acquire(&sem->count) &
>  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
>  			raw_spin_unlock_irq(&sem->wait_lock);
>  			rwsem_set_reader_owned(sem);
>> The chance of taking this path is not that high. So instead of
>> increasing the cost of the test by adding an acquire barrier, how about
>> just adding smp_mb__after_spinlock() before spin_unlock_irq(). This
>> should have the same effect of making sure that no stale data will be
>> used in the read-lock critical section.
> That's actually more expensive on something like ARM64 I expect.
>
> The far cheaper alternative is smp_acquire__after_ctrl_dep(), however in
> general Will seems to prefer using load-acquire over separate barriers,
> and for x86 it doesn't matter anyway. For PowerPC these two are a wash,
> both end up with LWSYNC (over SYNC for your alternative).

With lock event counting turned on, my experience with this code path
was that it got hit very infrequently. It is even less frequent with the
latest reader optimistic spinning patch. That is why I prefer making it
a bit more costly when the condition is true without incurring a cost at
all when the condition is false which is the majority of the cases.
Anyway, this additional cost is for arm64 only, but it is still more
than compensated by all skipping all the waiting list manipulation and
waking up itself code.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty
  2019-07-16 18:58   ` Peter Zijlstra
  2019-07-16 19:09     ` Waiman Long
@ 2019-07-17 12:02     ` Jan Stancek
  2019-07-17 13:13       ` Will Deacon
  2019-07-17 15:33       ` Waiman Long
  1 sibling, 2 replies; 22+ messages in thread
From: Jan Stancek @ 2019-07-17 12:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: longman, dbueso, will, peterz, mingo, jstancek

LTP mtest06 has been observed to rarely hit "still mapped when deleted"
and following BUG_ON on arm64:
  page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0
  xfs_address_space_operations [xfs]
  flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active)
  page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
  ------------[ cut here ]------------
  kernel BUG at mm/filemap.c:171!
  Internal error: Oops - BUG: 0 [#1] SMP
  CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1
  Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019
  pstate: 40400089 (nZcv daIf +PAN -UAO)
  pc : unaccount_page_cache_page+0x17c/0x1a0
  lr : unaccount_page_cache_page+0x17c/0x1a0
  Call trace:
  unaccount_page_cache_page+0x17c/0x1a0
  delete_from_page_cache_batch+0xa0/0x300
  truncate_inode_pages_range+0x1b8/0x640
  truncate_inode_pages_final+0x88/0xa8
  evict+0x1a0/0x1d8
  iput+0x150/0x240
  dentry_unlink_inode+0x120/0x130
  __dentry_kill+0xd8/0x1d0
  dentry_kill+0x88/0x248
  dput+0x168/0x1b8
  __fput+0xe8/0x208
  ____fput+0x20/0x30
  task_work_run+0xc0/0xf0
  do_notify_resume+0x2b0/0x328
  work_pending+0x8/0x10

The extra mapcount originated from pagefault handler, which handled
pagefault for vma that has already been detached. vma is detached
under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
also invalidates vmacache.

When pagefault handler (under mmap_sem read lock) called find_vma(),
vmacache_valid() wrongly reported vmacache as valid.

After rwsem down_read() returns via 'queue empty' path (as of v5.2),
it does so without issuing read_acquire on sem->count:
  down_read
    __down_read
      rwsem_down_read_failed
        __rwsem_down_read_failed_common
          raw_spin_lock_irq(&sem->wait_lock);
          if (list_empty(&sem->wait_list)) {
            if (atomic_long_read(&sem->count) >= 0) {
              raw_spin_unlock_irq(&sem->wait_lock);
              return sem;

Suspected problem here is that last *_acquire on down_read() side
happens before write side issues *_release:
  1. writer: has the lock
  2. reader: down_read() issues *read_acquire on entry
  3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
  4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns
  5. reader: observes stale mm->vmacache_seqnum

I can reproduce the problem by running LTP mtest06 in a loop and building
kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
within ~hour. Patched kernel ran fine for 10+ hours with clean dmesg.
Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
Enable readers spinning on writer") makes it much harder to reproduce.

v2: Move barrier after test (Waiman Long)
    Use smp_acquire__after_ctrl_dep() (Peter Zijlstra)

Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")

Signed-off-by: Jan Stancek <jstancek@redhat.com>
Cc: stable@vger.kernel.org # v4.20+
Cc: Waiman Long <longman@redhat.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
---
 kernel/locking/rwsem.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 37524a47f002..5ac72b60608b 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1032,6 +1032,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
 		 */
 		if (adjustment && !(atomic_long_read(&sem->count) &
 		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
+			smp_acquire__after_ctrl_dep();
 			raw_spin_unlock_irq(&sem->wait_lock);
 			rwsem_set_reader_owned(sem);
 			lockevent_inc(rwsem_rlock_fast);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty
  2019-07-17 12:02     ` [PATCH v2] locking/rwsem: add acquire barrier to " Jan Stancek
@ 2019-07-17 13:13       ` Will Deacon
  2019-07-17 14:19         ` Waiman Long
  2019-07-17 15:33       ` Waiman Long
  1 sibling, 1 reply; 22+ messages in thread
From: Will Deacon @ 2019-07-17 13:13 UTC (permalink / raw)
  To: Jan Stancek; +Cc: linux-kernel, longman, dbueso, peterz, mingo

On Wed, Jul 17, 2019 at 02:02:20PM +0200, Jan Stancek wrote:
> LTP mtest06 has been observed to rarely hit "still mapped when deleted"
> and following BUG_ON on arm64:
>   page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0
>   xfs_address_space_operations [xfs]
>   flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active)
>   page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
>   ------------[ cut here ]------------
>   kernel BUG at mm/filemap.c:171!
>   Internal error: Oops - BUG: 0 [#1] SMP
>   CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1
>   Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019
>   pstate: 40400089 (nZcv daIf +PAN -UAO)
>   pc : unaccount_page_cache_page+0x17c/0x1a0
>   lr : unaccount_page_cache_page+0x17c/0x1a0
>   Call trace:
>   unaccount_page_cache_page+0x17c/0x1a0
>   delete_from_page_cache_batch+0xa0/0x300
>   truncate_inode_pages_range+0x1b8/0x640
>   truncate_inode_pages_final+0x88/0xa8
>   evict+0x1a0/0x1d8
>   iput+0x150/0x240
>   dentry_unlink_inode+0x120/0x130
>   __dentry_kill+0xd8/0x1d0
>   dentry_kill+0x88/0x248
>   dput+0x168/0x1b8
>   __fput+0xe8/0x208
>   ____fput+0x20/0x30
>   task_work_run+0xc0/0xf0
>   do_notify_resume+0x2b0/0x328
>   work_pending+0x8/0x10
> 
> The extra mapcount originated from pagefault handler, which handled
> pagefault for vma that has already been detached. vma is detached
> under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
> also invalidates vmacache.
> 
> When pagefault handler (under mmap_sem read lock) called find_vma(),
> vmacache_valid() wrongly reported vmacache as valid.
> 
> After rwsem down_read() returns via 'queue empty' path (as of v5.2),
> it does so without issuing read_acquire on sem->count:
>   down_read
>     __down_read
>       rwsem_down_read_failed
>         __rwsem_down_read_failed_common
>           raw_spin_lock_irq(&sem->wait_lock);
>           if (list_empty(&sem->wait_list)) {
>             if (atomic_long_read(&sem->count) >= 0) {
>               raw_spin_unlock_irq(&sem->wait_lock);
>               return sem;
> 
> Suspected problem here is that last *_acquire on down_read() side
> happens before write side issues *_release:
>   1. writer: has the lock
>   2. reader: down_read() issues *read_acquire on entry
>   3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
>   4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns
>   5. reader: observes stale mm->vmacache_seqnum
> 
> I can reproduce the problem by running LTP mtest06 in a loop and building
> kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
> on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
> within ~hour. Patched kernel ran fine for 10+ hours with clean dmesg.
> Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
> Enable readers spinning on writer") makes it much harder to reproduce.
> 
> v2: Move barrier after test (Waiman Long)
>     Use smp_acquire__after_ctrl_dep() (Peter Zijlstra)
> 
> Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
> Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
> Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
> 
> Signed-off-by: Jan Stancek <jstancek@redhat.com>
> Cc: stable@vger.kernel.org # v4.20+
> Cc: Waiman Long <longman@redhat.com>
> Cc: Davidlohr Bueso <dbueso@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> ---
>  kernel/locking/rwsem.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 37524a47f002..5ac72b60608b 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1032,6 +1032,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
>  		 */
>  		if (adjustment && !(atomic_long_read(&sem->count) &
>  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> +			smp_acquire__after_ctrl_dep();
>  			raw_spin_unlock_irq(&sem->wait_lock);
>  			rwsem_set_reader_owned(sem);
>  			lockevent_inc(rwsem_rlock_fast);

If you add a comment to the code outlining the issue (preferably as a litmus
test involving sem->count and some shared data which happens to be
vmacache_seqnum in your test)), then:

Reviewed-by: Will Deacon <will@kernel.org>

Thanks,

Will

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty
  2019-07-17 13:13       ` Will Deacon
@ 2019-07-17 14:19         ` Waiman Long
  2019-07-17 19:22           ` Jan Stancek
  0 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2019-07-17 14:19 UTC (permalink / raw)
  To: Will Deacon, Jan Stancek; +Cc: linux-kernel, dbueso, peterz, mingo

On 7/17/19 9:13 AM, Will Deacon wrote:
> On Wed, Jul 17, 2019 at 02:02:20PM +0200, Jan Stancek wrote:
>> LTP mtest06 has been observed to rarely hit "still mapped when deleted"
>> and following BUG_ON on arm64:
>>   page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0
>>   xfs_address_space_operations [xfs]
>>   flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active)
>>   page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
>>   ------------[ cut here ]------------
>>   kernel BUG at mm/filemap.c:171!
>>   Internal error: Oops - BUG: 0 [#1] SMP
>>   CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1
>>   Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019
>>   pstate: 40400089 (nZcv daIf +PAN -UAO)
>>   pc : unaccount_page_cache_page+0x17c/0x1a0
>>   lr : unaccount_page_cache_page+0x17c/0x1a0
>>   Call trace:
>>   unaccount_page_cache_page+0x17c/0x1a0
>>   delete_from_page_cache_batch+0xa0/0x300
>>   truncate_inode_pages_range+0x1b8/0x640
>>   truncate_inode_pages_final+0x88/0xa8
>>   evict+0x1a0/0x1d8
>>   iput+0x150/0x240
>>   dentry_unlink_inode+0x120/0x130
>>   __dentry_kill+0xd8/0x1d0
>>   dentry_kill+0x88/0x248
>>   dput+0x168/0x1b8
>>   __fput+0xe8/0x208
>>   ____fput+0x20/0x30
>>   task_work_run+0xc0/0xf0
>>   do_notify_resume+0x2b0/0x328
>>   work_pending+0x8/0x10
>>
>> The extra mapcount originated from pagefault handler, which handled
>> pagefault for vma that has already been detached. vma is detached
>> under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
>> also invalidates vmacache.
>>
>> When pagefault handler (under mmap_sem read lock) called find_vma(),
>> vmacache_valid() wrongly reported vmacache as valid.
>>
>> After rwsem down_read() returns via 'queue empty' path (as of v5.2),
>> it does so without issuing read_acquire on sem->count:
>>   down_read
>>     __down_read
>>       rwsem_down_read_failed
>>         __rwsem_down_read_failed_common
>>           raw_spin_lock_irq(&sem->wait_lock);
>>           if (list_empty(&sem->wait_list)) {
>>             if (atomic_long_read(&sem->count) >= 0) {
>>               raw_spin_unlock_irq(&sem->wait_lock);
>>               return sem;
>>
>> Suspected problem here is that last *_acquire on down_read() side
>> happens before write side issues *_release:
>>   1. writer: has the lock
>>   2. reader: down_read() issues *read_acquire on entry
>>   3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
>>   4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns
>>   5. reader: observes stale mm->vmacache_seqnum
>>
>> I can reproduce the problem by running LTP mtest06 in a loop and building
>> kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
>> on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
>> within ~hour. Patched kernel ran fine for 10+ hours with clean dmesg.
>> Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
>> Enable readers spinning on writer") makes it much harder to reproduce.
>>
>> v2: Move barrier after test (Waiman Long)
>>     Use smp_acquire__after_ctrl_dep() (Peter Zijlstra)
>>
>> Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
>> Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
>> Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
>>
>> Signed-off-by: Jan Stancek <jstancek@redhat.com>
>> Cc: stable@vger.kernel.org # v4.20+
>> Cc: Waiman Long <longman@redhat.com>
>> Cc: Davidlohr Bueso <dbueso@suse.de>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> ---
>>  kernel/locking/rwsem.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
>> index 37524a47f002..5ac72b60608b 100644
>> --- a/kernel/locking/rwsem.c
>> +++ b/kernel/locking/rwsem.c
>> @@ -1032,6 +1032,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
>>  		 */
>>  		if (adjustment && !(atomic_long_read(&sem->count) &
>>  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
>> +			smp_acquire__after_ctrl_dep();
>>  			raw_spin_unlock_irq(&sem->wait_lock);
>>  			rwsem_set_reader_owned(sem);
>>  			lockevent_inc(rwsem_rlock_fast);
> If you add a comment to the code outlining the issue (preferably as a litmus
> test involving sem->count and some shared data which happens to be
> vmacache_seqnum in your test)), then:
>
> Reviewed-by: Will Deacon <will@kernel.org>
>
> Thanks,
>
> Will

Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this
is needed will be great.

Other than that,

Acked-by: Waiman Long <longman@redhat.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty
  2019-07-17 12:02     ` [PATCH v2] locking/rwsem: add acquire barrier to " Jan Stancek
  2019-07-17 13:13       ` Will Deacon
@ 2019-07-17 15:33       ` Waiman Long
  1 sibling, 0 replies; 22+ messages in thread
From: Waiman Long @ 2019-07-17 15:33 UTC (permalink / raw)
  To: Jan Stancek, linux-kernel; +Cc: dbueso, will, peterz, mingo

On 7/17/19 8:02 AM, Jan Stancek wrote:
> LTP mtest06 has been observed to rarely hit "still mapped when deleted"
> and following BUG_ON on arm64:
>   page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0
>   xfs_address_space_operations [xfs]
>   flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active)
>   page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
>   ------------[ cut here ]------------
>   kernel BUG at mm/filemap.c:171!
>   Internal error: Oops - BUG: 0 [#1] SMP
>   CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1
>   Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019
>   pstate: 40400089 (nZcv daIf +PAN -UAO)
>   pc : unaccount_page_cache_page+0x17c/0x1a0
>   lr : unaccount_page_cache_page+0x17c/0x1a0
>   Call trace:
>   unaccount_page_cache_page+0x17c/0x1a0
>   delete_from_page_cache_batch+0xa0/0x300
>   truncate_inode_pages_range+0x1b8/0x640
>   truncate_inode_pages_final+0x88/0xa8
>   evict+0x1a0/0x1d8
>   iput+0x150/0x240
>   dentry_unlink_inode+0x120/0x130
>   __dentry_kill+0xd8/0x1d0
>   dentry_kill+0x88/0x248
>   dput+0x168/0x1b8
>   __fput+0xe8/0x208
>   ____fput+0x20/0x30
>   task_work_run+0xc0/0xf0
>   do_notify_resume+0x2b0/0x328
>   work_pending+0x8/0x10
>
> The extra mapcount originated from pagefault handler, which handled
> pagefault for vma that has already been detached. vma is detached
> under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
> also invalidates vmacache.
>
> When pagefault handler (under mmap_sem read lock) called find_vma(),
> vmacache_valid() wrongly reported vmacache as valid.
>
> After rwsem down_read() returns via 'queue empty' path (as of v5.2),
> it does so without issuing read_acquire on sem->count:
>   down_read
>     __down_read
>       rwsem_down_read_failed
>         __rwsem_down_read_failed_common
>           raw_spin_lock_irq(&sem->wait_lock);
>           if (list_empty(&sem->wait_list)) {
>             if (atomic_long_read(&sem->count) >= 0) {
>               raw_spin_unlock_irq(&sem->wait_lock);
>               return sem;
>
> Suspected problem here is that last *_acquire on down_read() side
> happens before write side issues *_release:
>   1. writer: has the lock
>   2. reader: down_read() issues *read_acquire on entry
>   3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
>   4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns
>   5. reader: observes stale mm->vmacache_seqnum
>
> I can reproduce the problem by running LTP mtest06 in a loop and building
> kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
> on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
> within ~hour. Patched kernel ran fine for 10+ hours with clean dmesg.
> Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
> Enable readers spinning on writer") makes it much harder to reproduce.
>
> v2: Move barrier after test (Waiman Long)
>     Use smp_acquire__after_ctrl_dep() (Peter Zijlstra)
>
> Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
> Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
> Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
>
> Signed-off-by: Jan Stancek <jstancek@redhat.com>
> Cc: stable@vger.kernel.org # v4.20+
> Cc: Waiman Long <longman@redhat.com>
> Cc: Davidlohr Bueso <dbueso@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> ---
>  kernel/locking/rwsem.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 37524a47f002..5ac72b60608b 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1032,6 +1032,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
>  		 */
>  		if (adjustment && !(atomic_long_read(&sem->count) &
>  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> +			smp_acquire__after_ctrl_dep();
>  			raw_spin_unlock_irq(&sem->wait_lock);
>  			rwsem_set_reader_owned(sem);
>  			lockevent_inc(rwsem_rlock_fast);

The corresponding change for 5.2 or earlier kernels are:

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index fbe96341beee..2fbbb2d46396 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -246,6 +246,7 @@ __rwsem_down_read_failed_common(struct rw_semaphore
*sem, in
                 * been set in the count.
                 */
                if (atomic_long_read(&sem->count) >= 0) {
+                       smp_acquire__after_ctrl_dep();
                        raw_spin_unlock_irq(&sem->wait_lock);
                        return sem;
                }

-Longman


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty
  2019-07-17 14:19         ` Waiman Long
@ 2019-07-17 19:22           ` Jan Stancek
  2019-07-17 19:39             ` Waiman Long
  2019-07-18  9:26             ` [PATCH v2] locking/rwsem: add acquire barrier " Will Deacon
  0 siblings, 2 replies; 22+ messages in thread
From: Jan Stancek @ 2019-07-17 19:22 UTC (permalink / raw)
  To: Waiman Long; +Cc: Will Deacon, linux-kernel, dbueso, peterz, mingo, jstancek

On Wed, Jul 17, 2019 at 10:19:04AM -0400, Waiman Long wrote:
>> If you add a comment to the code outlining the issue (preferably as a litmus
>> test involving sem->count and some shared data which happens to be
>> vmacache_seqnum in your test)), then:
>>
>> Reviewed-by: Will Deacon <will@kernel.org>
>>
>> Thanks,
>>
>> Will
>
>Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this
>is needed will be great.
>
>Other than that,
>
>Acked-by: Waiman Long <longman@redhat.com>
>

litmus test looks a bit long, would following be acceptable?

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 37524a47f002..d9c96651bfc7 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1032,6 +1032,13 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
  		 */
  		if (adjustment && !(atomic_long_read(&sem->count) &
  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
+			/*
+			 * down_read() issued ACQUIRE on enter, but we can race
+			 * with writer who did RELEASE only after us.
+			 * ACQUIRE here makes sure reader operations happen only
+			 * after all writer ones.
+			 */
+			smp_acquire__after_ctrl_dep();
  			raw_spin_unlock_irq(&sem->wait_lock);
  			rwsem_set_reader_owned(sem);
  			lockevent_inc(rwsem_rlock_fast);


with litmus test in commit log:
----------------------------------- 8< ------------------------------------
C rwsem

{
	atomic_t rwsem_count = ATOMIC_INIT(1);
	int vmacache_seqnum = 10;
}

P0(int *vmacache_seqnum, atomic_t *rwsem_count)
{
	r0 = READ_ONCE(*vmacache_seqnum);
	WRITE_ONCE(*vmacache_seqnum, r0 + 1);
	/* downgrade_write */
	r1 = atomic_fetch_add_release(-1+256, rwsem_count);
}

P1(int *vmacache_seqnum, atomic_t *rwsem_count, spinlock_t *sem_wait_lock)
{
	/* rwsem_read_trylock */
	r0 = atomic_add_return_acquire(256, rwsem_count);
	/* rwsem_down_read_slowpath */
	spin_lock(sem_wait_lock);
	r0 = atomic_read(rwsem_count);
	if ((r0 & 1) == 0) {
		// BUG: needs barrier
		spin_unlock(sem_wait_lock);
		r1 = READ_ONCE(*vmacache_seqnum);
	}
}
exists (1:r1=10)
----------------------------------- 8< ------------------------------------

Thanks,
Jan


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty
  2019-07-17 19:22           ` Jan Stancek
@ 2019-07-17 19:39             ` Waiman Long
  2019-07-18  8:51               ` [PATCH v3] " Jan Stancek
  2019-07-18  9:26             ` [PATCH v2] locking/rwsem: add acquire barrier " Will Deacon
  1 sibling, 1 reply; 22+ messages in thread
From: Waiman Long @ 2019-07-17 19:39 UTC (permalink / raw)
  To: Jan Stancek; +Cc: Will Deacon, linux-kernel, dbueso, peterz, mingo

On 7/17/19 3:22 PM, Jan Stancek wrote:
> On Wed, Jul 17, 2019 at 10:19:04AM -0400, Waiman Long wrote:
>>> If you add a comment to the code outlining the issue (preferably as
>>> a litmus
>>> test involving sem->count and some shared data which happens to be
>>> vmacache_seqnum in your test)), then:
>>>
>>> Reviewed-by: Will Deacon <will@kernel.org>
>>>
>>> Thanks,
>>>
>>> Will
>>
>> Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this
>> is needed will be great.
>>
>> Other than that,
>>
>> Acked-by: Waiman Long <longman@redhat.com>
>>
>
> litmus test looks a bit long, would following be acceptable?
>
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 37524a47f002..d9c96651bfc7 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1032,6 +1032,13 @@ static inline bool
> rwsem_reader_phase_trylock(struct rw_semaphore *sem,
>           */
>          if (adjustment && !(atomic_long_read(&sem->count) &
>               (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> +            /*
> +             * down_read() issued ACQUIRE on enter, but we can race
> +             * with writer who did RELEASE only after us.
> +             * ACQUIRE here makes sure reader operations happen only
> +             * after all writer ones.
> +             */


How about that?

                /*
                 * Add an acquire barrier here to make sure no stale data
                 * acquired before the above test where the writer may still
                 * be holding the lock will be reused in the reader
critical
                 * section.
                 */

Thanks,
Longman



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty
  2019-07-17 19:39             ` Waiman Long
@ 2019-07-18  8:51               ` Jan Stancek
  2019-07-25 16:00                 ` [tip:locking/core] locking/rwsem: Add missing ACQUIRE " tip-bot for Jan Stancek
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Stancek @ 2019-07-18  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: longman, dbueso, will, peterz, mingo, jstancek

LTP mtest06 has been observed to rarely hit "still mapped when deleted"
and following BUG_ON on arm64:
  page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0
  xfs_address_space_operations [xfs]
  flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active)
  page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
  ------------[ cut here ]------------
  kernel BUG at mm/filemap.c:171!
  Internal error: Oops - BUG: 0 [#1] SMP
  CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1
  Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019
  pstate: 40400089 (nZcv daIf +PAN -UAO)
  pc : unaccount_page_cache_page+0x17c/0x1a0
  lr : unaccount_page_cache_page+0x17c/0x1a0
  Call trace:
  unaccount_page_cache_page+0x17c/0x1a0
  delete_from_page_cache_batch+0xa0/0x300
  truncate_inode_pages_range+0x1b8/0x640
  truncate_inode_pages_final+0x88/0xa8
  evict+0x1a0/0x1d8
  iput+0x150/0x240
  dentry_unlink_inode+0x120/0x130
  __dentry_kill+0xd8/0x1d0
  dentry_kill+0x88/0x248
  dput+0x168/0x1b8
  __fput+0xe8/0x208
  ____fput+0x20/0x30
  task_work_run+0xc0/0xf0
  do_notify_resume+0x2b0/0x328
  work_pending+0x8/0x10

The extra mapcount originated from pagefault handler, which handled
pagefault for vma that has already been detached. vma is detached
under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
also invalidates vmacache.

When pagefault handler (under mmap_sem read lock) called find_vma(),
vmacache_valid() wrongly reported vmacache as valid.

After rwsem down_read() returns via 'queue empty' path (as of v5.2),
it does so without issuing read_acquire on sem->count:
  down_read
    __down_read
      rwsem_down_read_failed
        __rwsem_down_read_failed_common
          raw_spin_lock_irq(&sem->wait_lock);
          if (list_empty(&sem->wait_list)) {
            if (atomic_long_read(&sem->count) >= 0) {
              raw_spin_unlock_irq(&sem->wait_lock);
              return sem;

Suspected problem here is that last *_acquire on down_read() side
happens before write side issues *_release:
  1. writer: has the lock
  2. reader: down_read() issues *read_acquire on entry
  3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
  4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns
  5. reader: observes stale mm->vmacache_seqnum

----------------------------------- 8< ------------------------------------
C rwsem

{
	atomic_t rwsem_count = ATOMIC_INIT(1);
	int vmacache_seqnum = 10;
}

P0(int *vmacache_seqnum, atomic_t *rwsem_count)
{
	r0 = READ_ONCE(*vmacache_seqnum);
	WRITE_ONCE(*vmacache_seqnum, r0 + 1);
	/* downgrade_write */
	r1 = atomic_fetch_add_release(-1+256, rwsem_count);
}

P1(int *vmacache_seqnum, atomic_t *rwsem_count, spinlock_t *sem_wait_lock)
{
	/* rwsem_read_trylock */
	r0 = atomic_add_return_acquire(256, rwsem_count);
	/* rwsem_down_read_slowpath */
	spin_lock(sem_wait_lock);
	r0 = atomic_read(rwsem_count);
	if ((r0 & 1) == 0) {
		// BUG: needs barrier
		spin_unlock(sem_wait_lock);
		r1 = READ_ONCE(*vmacache_seqnum);
	}
}
exists (1:r1=10)
----------------------------------- >8 ------------------------------------

I can reproduce the problem by running LTP mtest06 in a loop and building
kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
within ~hour. Patched kernel ran fine for 10+ hours with clean dmesg.
Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
Enable readers spinning on writer") makes it much harder to reproduce.

v2: Move barrier after test (Waiman Long)
    Use smp_acquire__after_ctrl_dep() (Peter Zijlstra)
v3: Add comment to barrier (Waiman Long, Will Deacon)
    Add litmus test

Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")

Signed-off-by: Jan Stancek <jstancek@redhat.com>
Reviewed-by: Will Deacon <will@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: stable@vger.kernel.org # v4.20+
Cc: Waiman Long <longman@redhat.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
---
 kernel/locking/rwsem.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 37524a47f002..fe02aef39e9d 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1032,6 +1032,13 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
 		 */
 		if (adjustment && !(atomic_long_read(&sem->count) &
 		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
+			/*
+			 * Add an acquire barrier here to make sure no stale
+			 * data acquired before the above test, where the writer
+			 * may still be holding the lock, will be reused in the
+			 * reader critical section.
+			 */
+			smp_acquire__after_ctrl_dep();
 			raw_spin_unlock_irq(&sem->wait_lock);
 			rwsem_set_reader_owned(sem);
 			lockevent_inc(rwsem_rlock_fast);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty
  2019-07-17 19:22           ` Jan Stancek
  2019-07-17 19:39             ` Waiman Long
@ 2019-07-18  9:26             ` Will Deacon
  2019-07-18 10:50               ` Jan Stancek
  2019-07-18 10:58               ` Peter Zijlstra
  1 sibling, 2 replies; 22+ messages in thread
From: Will Deacon @ 2019-07-18  9:26 UTC (permalink / raw)
  To: Jan Stancek
  Cc: Waiman Long, linux-kernel, dbueso, peterz, mingo, jade.alglave, paulmck

Hi Jan, Waiman, [+Jade and Paul for the litmus test at the end]

On Wed, Jul 17, 2019 at 09:22:00PM +0200, Jan Stancek wrote:
> On Wed, Jul 17, 2019 at 10:19:04AM -0400, Waiman Long wrote:
> > > If you add a comment to the code outlining the issue (preferably as a litmus
> > > test involving sem->count and some shared data which happens to be
> > > vmacache_seqnum in your test)), then:
> > > 
> > > Reviewed-by: Will Deacon <will@kernel.org>
> > > 
> > > Thanks,
> > > 
> > > Will
> > 
> > Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this
> > is needed will be great.
> > 
> > Other than that,
> > 
> > Acked-by: Waiman Long <longman@redhat.com>
> > 
> 
> litmus test looks a bit long, would following be acceptable?
> 
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 37524a47f002..d9c96651bfc7 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1032,6 +1032,13 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
>  		 */
>  		if (adjustment && !(atomic_long_read(&sem->count) &
>  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> +			/*
> +			 * down_read() issued ACQUIRE on enter, but we can race
> +			 * with writer who did RELEASE only after us.
> +			 * ACQUIRE here makes sure reader operations happen only
> +			 * after all writer ones.
> +			 */

How about an abridged form of the litmus test here, just to show the cod
flow? e.g.:

/*
 * We need to ensure ACQUIRE semantics when reading sem->count so that
 * we pair with the RELEASE store performed by an unlocking/downgrading
 * writer.
 *
 * P0 (writer)			P1 (reader)
 *
 * down_write(sem);
 * <write shared data>
 * downgrade_write(sem);
 * -> fetch_add_release(&sem->count)
 *
 *				down_read_slowpath(sem);
 *				-> atomic_read(&sem->count)
 *				   <ctrl dep>
 *				   smp_acquire__after_ctrl_dep()
 *				<read shared data>
 */

In writing this, I also noticed that we don't have any explicit ordering
at the end of the reader slowpath when we wait on the queue but get woken
immediately:

	if (!waiter.task)
		break;

Am I missing something?

> +			smp_acquire__after_ctrl_dep();
>  			raw_spin_unlock_irq(&sem->wait_lock);
>  			rwsem_set_reader_owned(sem);
>  			lockevent_inc(rwsem_rlock_fast);
> 
> 
> with litmus test in commit log:
> ----------------------------------- 8< ------------------------------------
> C rwsem
> 
> {
> 	atomic_t rwsem_count = ATOMIC_INIT(1);
> 	int vmacache_seqnum = 10;
> }
> 
> P0(int *vmacache_seqnum, atomic_t *rwsem_count)
> {
> 	r0 = READ_ONCE(*vmacache_seqnum);
> 	WRITE_ONCE(*vmacache_seqnum, r0 + 1);
> 	/* downgrade_write */
> 	r1 = atomic_fetch_add_release(-1+256, rwsem_count);
> }
> 
> P1(int *vmacache_seqnum, atomic_t *rwsem_count, spinlock_t *sem_wait_lock)
> {
> 	/* rwsem_read_trylock */
> 	r0 = atomic_add_return_acquire(256, rwsem_count);
> 	/* rwsem_down_read_slowpath */
> 	spin_lock(sem_wait_lock);
> 	r0 = atomic_read(rwsem_count);
> 	if ((r0 & 1) == 0) {
> 		// BUG: needs barrier
> 		spin_unlock(sem_wait_lock);
> 		r1 = READ_ONCE(*vmacache_seqnum);
> 	}
> }
> exists (1:r1=10)
> ----------------------------------- 8< ------------------------------------

Thanks for writing this! It's definitely worth sticking it in the commit
log, but Paul and Jade might also like to include it as part of their litmus
test repository too.

Will

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty
  2019-07-18  9:26             ` [PATCH v2] locking/rwsem: add acquire barrier " Will Deacon
@ 2019-07-18 10:50               ` Jan Stancek
  2019-07-18 11:04                 ` Peter Zijlstra
  2019-07-18 12:12                 ` Paul E. McKenney
  2019-07-18 10:58               ` Peter Zijlstra
  1 sibling, 2 replies; 22+ messages in thread
From: Jan Stancek @ 2019-07-18 10:50 UTC (permalink / raw)
  To: Will Deacon
  Cc: Waiman Long, linux-kernel, dbueso, peterz, mingo, jade alglave,
	paulmck, Jan Stancek


----- Original Message -----
> Hi Jan, Waiman, [+Jade and Paul for the litmus test at the end]
> 
> On Wed, Jul 17, 2019 at 09:22:00PM +0200, Jan Stancek wrote:
> > On Wed, Jul 17, 2019 at 10:19:04AM -0400, Waiman Long wrote:
> > > > If you add a comment to the code outlining the issue (preferably as a
> > > > litmus
> > > > test involving sem->count and some shared data which happens to be
> > > > vmacache_seqnum in your test)), then:
> > > > 
> > > > Reviewed-by: Will Deacon <will@kernel.org>
> > > > 
> > > > Thanks,
> > > > 
> > > > Will
> > > 
> > > Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this
> > > is needed will be great.
> > > 
> > > Other than that,
> > > 
> > > Acked-by: Waiman Long <longman@redhat.com>
> > > 
> > 
> > litmus test looks a bit long, would following be acceptable?
> > 
> > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> > index 37524a47f002..d9c96651bfc7 100644
> > --- a/kernel/locking/rwsem.c
> > +++ b/kernel/locking/rwsem.c
> > @@ -1032,6 +1032,13 @@ static inline bool rwsem_reader_phase_trylock(struct
> > rw_semaphore *sem,
> >  		 */
> >  		if (adjustment && !(atomic_long_read(&sem->count) &
> >  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> > +			/*
> > +			 * down_read() issued ACQUIRE on enter, but we can race
> > +			 * with writer who did RELEASE only after us.
> > +			 * ACQUIRE here makes sure reader operations happen only
> > +			 * after all writer ones.
> > +			 */
> 
> How about an abridged form of the litmus test here, just to show the cod
> flow? e.g.:
> 
> /*
>  * We need to ensure ACQUIRE semantics when reading sem->count so that
>  * we pair with the RELEASE store performed by an unlocking/downgrading
>  * writer.
>  *
>  * P0 (writer)			P1 (reader)
>  *
>  * down_write(sem);
>  * <write shared data>
>  * downgrade_write(sem);
>  * -> fetch_add_release(&sem->count)
>  *
>  *				down_read_slowpath(sem);
>  *				-> atomic_read(&sem->count)
>  *				   <ctrl dep>
>  *				   smp_acquire__after_ctrl_dep()
>  *				<read shared data>
>  */

Works for me. The code is at 3 level of indentation, but I can try
to squeeze it in for v4.

> 
> In writing this, I also noticed that we don't have any explicit ordering
> at the end of the reader slowpath when we wait on the queue but get woken
> immediately:
> 
> 	if (!waiter.task)
> 		break;
> 
> Am I missing something?

I'm assuming this isn't problem, because set_current_state() on line above
is using smp_store_mb().

> 
> > +			smp_acquire__after_ctrl_dep();
> >  			raw_spin_unlock_irq(&sem->wait_lock);
> >  			rwsem_set_reader_owned(sem);
> >  			lockevent_inc(rwsem_rlock_fast);
> > 
> > 
> > with litmus test in commit log:
> > ----------------------------------- 8< ------------------------------------
> > C rwsem
> > 
> > {
> > 	atomic_t rwsem_count = ATOMIC_INIT(1);
> > 	int vmacache_seqnum = 10;
> > }
> > 
> > P0(int *vmacache_seqnum, atomic_t *rwsem_count)
> > {
> > 	r0 = READ_ONCE(*vmacache_seqnum);
> > 	WRITE_ONCE(*vmacache_seqnum, r0 + 1);
> > 	/* downgrade_write */
> > 	r1 = atomic_fetch_add_release(-1+256, rwsem_count);
> > }
> > 
> > P1(int *vmacache_seqnum, atomic_t *rwsem_count, spinlock_t *sem_wait_lock)
> > {
> > 	/* rwsem_read_trylock */
> > 	r0 = atomic_add_return_acquire(256, rwsem_count);
> > 	/* rwsem_down_read_slowpath */
> > 	spin_lock(sem_wait_lock);
> > 	r0 = atomic_read(rwsem_count);
> > 	if ((r0 & 1) == 0) {
> > 		// BUG: needs barrier
> > 		spin_unlock(sem_wait_lock);
> > 		r1 = READ_ONCE(*vmacache_seqnum);
> > 	}
> > }
> > exists (1:r1=10)
> > ----------------------------------- 8< ------------------------------------
> 
> Thanks for writing this! It's definitely worth sticking it in the commit
> log, but Paul and Jade might also like to include it as part of their litmus
> test repository too.
> 
> Will
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty
  2019-07-18  9:26             ` [PATCH v2] locking/rwsem: add acquire barrier " Will Deacon
  2019-07-18 10:50               ` Jan Stancek
@ 2019-07-18 10:58               ` Peter Zijlstra
  2019-07-18 11:45                 ` Will Deacon
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-07-18 10:58 UTC (permalink / raw)
  To: Will Deacon
  Cc: Jan Stancek, Waiman Long, linux-kernel, dbueso, mingo,
	jade.alglave, paulmck

On Thu, Jul 18, 2019 at 10:26:41AM +0100, Will Deacon wrote:

> /*
>  * We need to ensure ACQUIRE semantics when reading sem->count so that
>  * we pair with the RELEASE store performed by an unlocking/downgrading
>  * writer.
>  *
>  * P0 (writer)			P1 (reader)
>  *
>  * down_write(sem);
>  * <write shared data>
>  * downgrade_write(sem);
>  * -> fetch_add_release(&sem->count)
>  *
>  *				down_read_slowpath(sem);
>  *				-> atomic_read(&sem->count)
>  *				   <ctrl dep>
>  *				   smp_acquire__after_ctrl_dep()
>  *				<read shared data>
>  */

So I'm thinking all this is excessive; the simple rule is: lock acquire
should imply ACQUIRE, we all know why.

> In writing this, I also noticed that we don't have any explicit ordering
> at the end of the reader slowpath when we wait on the queue but get woken
> immediately:
> 
> 	if (!waiter.task)
> 		break;
> 
> Am I missing something?

Ha!, I ran into the very same one. I keep confusing myself, but I think
you're right and that needs to be smp_load_acquire() to match the
smp_store_release() in rwsem_mark_wake().

(the actual race there is _tiny_ due to the smp_mb() right before it,
but I cannot convince myself that is indeed sufficient)

The signal_pending_state() case is also fun, but I think wait_lock there
is sufficient (even under RCpc).

I've ended up with this..

---
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 37524a47f002..9eb630904a17 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1000,6 +1000,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
 	adjustment = 0;
 	if (rwsem_optimistic_spin(sem, false)) {
+		/* rwsem_optimistic_spin() implies ACQUIRE through rwsem_*trylock() */
 		/*
 		 * Wake up other readers in the wait list if the front
 		 * waiter is a reader.
@@ -1014,6 +1015,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 		}
 		return sem;
 	} else if (rwsem_reader_phase_trylock(sem, waiter.last_rowner)) {
+		/* rwsem_reader_phase_trylock() implies ACQUIRE */
 		return sem;
 	}
 
@@ -1032,6 +1034,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 		 */
 		if (adjustment && !(atomic_long_read(&sem->count) &
 		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
+			/* Provide lock ACQUIRE */
+			smp_acquire__after_ctrl_dep();
 			raw_spin_unlock_irq(&sem->wait_lock);
 			rwsem_set_reader_owned(sem);
 			lockevent_inc(rwsem_rlock_fast);
@@ -1065,15 +1069,25 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	wake_up_q(&wake_q);
 
 	/* wait to be given the lock */
-	while (true) {
+	for (;;) {
 		set_current_state(state);
-		if (!waiter.task)
+		if (!smp_load_acquire(&waiter.task)) {
+			/*
+			 * Matches rwsem_mark_wake()'s smp_store_release() and ensures
+			 * we're ordered against its sem->count operations.
+			 */
 			break;
+		}
 		if (signal_pending_state(state, current)) {
 			raw_spin_lock_irq(&sem->wait_lock);
 			if (waiter.task)
 				goto out_nolock;
 			raw_spin_unlock_irq(&sem->wait_lock);
+			/*
+			 * Ordered by sem->wait_lock against rwsem_mark_wake(), if we
+			 * see its waiter.task store, we must also see its sem->count
+			 * changes.
+			 */
 			break;
 		}
 		schedule();
@@ -1083,6 +1097,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	__set_current_state(TASK_RUNNING);
 	lockevent_inc(rwsem_rlock);
 	return sem;
+
 out_nolock:
 	list_del(&waiter.list);
 	if (list_empty(&sem->wait_list)) {


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty
  2019-07-18 10:50               ` Jan Stancek
@ 2019-07-18 11:04                 ` Peter Zijlstra
  2019-07-18 11:09                   ` Peter Zijlstra
  2019-07-18 12:12                 ` Paul E. McKenney
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-07-18 11:04 UTC (permalink / raw)
  To: Jan Stancek
  Cc: Will Deacon, Waiman Long, linux-kernel, dbueso, mingo,
	jade alglave, paulmck

On Thu, Jul 18, 2019 at 06:50:52AM -0400, Jan Stancek wrote:
> > In writing this, I also noticed that we don't have any explicit ordering
> > at the end of the reader slowpath when we wait on the queue but get woken
> > immediately:
> > 
> > 	if (!waiter.task)
> > 		break;
> > 
> > Am I missing something?
> 
> I'm assuming this isn't problem, because set_current_state() on line above
> is using smp_store_mb().


X = 0;

								X = 1;
	rwsem_down_read()					rwsem_up_write();

	  for (;;) {
	    set_current_state(TASK_UNINTERRUPTIBLE);

								  rwsem_mark_wake()
								    atomic_long_add(adjustment, &sem->count);
								    smp_store_release(&waiter->task, NULL);

	    if (!waiter.task)
	      break;

	    ...
	  }


	r = X;


can I think result in r==0 just fine, because there's nothing ordering
the load of waiter.task with the store of X.

It is exceedingly unlikely, but not impossible afaict.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty
  2019-07-18 11:04                 ` Peter Zijlstra
@ 2019-07-18 11:09                   ` Peter Zijlstra
  2019-07-18 11:36                     ` Jan Stancek
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-07-18 11:09 UTC (permalink / raw)
  To: Jan Stancek
  Cc: Will Deacon, Waiman Long, linux-kernel, dbueso, mingo,
	jade alglave, paulmck


It's simpler like so:

On Thu, Jul 18, 2019 at 01:04:46PM +0200, Peter Zijlstra wrote:
> X = 0;
> 
> 	rwsem_down_read()
> 	  for (;;) {
> 	    set_current_state(TASK_UNINTERRUPTIBLE);
>
>								X = 1;
>                                                               rwsem_up_write();
> 								  rwsem_mark_wake()
> 								    atomic_long_add(adjustment, &sem->count);
> 								    smp_store_release(&waiter->task, NULL);
> 
> 	    if (!waiter.task)
> 	      break;
> 
> 	    ...
> 	  }
> 
> 	r = X;
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty
  2019-07-18 11:09                   ` Peter Zijlstra
@ 2019-07-18 11:36                     ` Jan Stancek
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Stancek @ 2019-07-18 11:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Waiman Long, linux-kernel, dbueso, mingo,
	jade alglave, paulmck


----- Original Message -----
> 
> It's simpler like so:
> 
> On Thu, Jul 18, 2019 at 01:04:46PM +0200, Peter Zijlstra wrote:
> > X = 0;
> > 
> > 	rwsem_down_read()
> > 	  for (;;) {
> > 	    set_current_state(TASK_UNINTERRUPTIBLE);
> >
> >								X = 1;
> >                                                               rwsem_up_write();
> > 								  rwsem_mark_wake()
> > 								    atomic_long_add(adjustment, &sem->count);
> > 								    smp_store_release(&waiter->task, NULL);
> > 
> > 	    if (!waiter.task)
> > 	      break;
> > 
> > 	    ...
> > 	  }
> > 
> > 	r = X;
> > 

I see - it looks possible. Thank you for this example.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty
  2019-07-18 10:58               ` Peter Zijlstra
@ 2019-07-18 11:45                 ` Will Deacon
  2019-07-18 12:23                   ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Will Deacon @ 2019-07-18 11:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Stancek, Waiman Long, linux-kernel, dbueso, mingo,
	jade.alglave, paulmck

On Thu, Jul 18, 2019 at 12:58:12PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 18, 2019 at 10:26:41AM +0100, Will Deacon wrote:
> 
> > /*
> >  * We need to ensure ACQUIRE semantics when reading sem->count so that
> >  * we pair with the RELEASE store performed by an unlocking/downgrading
> >  * writer.
> >  *
> >  * P0 (writer)			P1 (reader)
> >  *
> >  * down_write(sem);
> >  * <write shared data>
> >  * downgrade_write(sem);
> >  * -> fetch_add_release(&sem->count)
> >  *
> >  *				down_read_slowpath(sem);
> >  *				-> atomic_read(&sem->count)
> >  *				   <ctrl dep>
> >  *				   smp_acquire__after_ctrl_dep()
> >  *				<read shared data>
> >  */
> 
> So I'm thinking all this is excessive; the simple rule is: lock acquire
> should imply ACQUIRE, we all know why.

Fair enough, I just thought this was worth highlighting because you can't
reply on the wait_lock to give you ACQUIRE ordering.

> > In writing this, I also noticed that we don't have any explicit ordering
> > at the end of the reader slowpath when we wait on the queue but get woken
> > immediately:
> > 
> > 	if (!waiter.task)
> > 		break;
> > 
> > Am I missing something?
> 
> Ha!, I ran into the very same one. I keep confusing myself, but I think
> you're right and that needs to be smp_load_acquire() to match the
> smp_store_release() in rwsem_mark_wake().
> 
> (the actual race there is _tiny_ due to the smp_mb() right before it,
> but I cannot convince myself that is indeed sufficient)
> 
> The signal_pending_state() case is also fun, but I think wait_lock there
> is sufficient (even under RCpc).
> 
> I've ended up with this..
> 
> ---
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 37524a47f002..9eb630904a17 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1000,6 +1000,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
>  	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
>  	adjustment = 0;
>  	if (rwsem_optimistic_spin(sem, false)) {
> +		/* rwsem_optimistic_spin() implies ACQUIRE through rwsem_*trylock() */

I couldn't figure out if this was dependent on the return value or not,
and looking at osq_lock() I also couldn't see the ACQUIRE barrier when we're
spinning on node->locked. Hmm.

>  		/*
>  		 * Wake up other readers in the wait list if the front
>  		 * waiter is a reader.
> @@ -1014,6 +1015,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
>  		}
>  		return sem;
>  	} else if (rwsem_reader_phase_trylock(sem, waiter.last_rowner)) {
> +		/* rwsem_reader_phase_trylock() implies ACQUIRE */

Can we add "on success" to the end of this, please?

>  		return sem;
>  	}
>  
> @@ -1032,6 +1034,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
>  		 */
>  		if (adjustment && !(atomic_long_read(&sem->count) &
>  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> +			/* Provide lock ACQUIRE */
> +			smp_acquire__after_ctrl_dep();
>  			raw_spin_unlock_irq(&sem->wait_lock);
>  			rwsem_set_reader_owned(sem);
>  			lockevent_inc(rwsem_rlock_fast);
> @@ -1065,15 +1069,25 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
>  	wake_up_q(&wake_q);
>  
>  	/* wait to be given the lock */
> -	while (true) {
> +	for (;;) {
>  		set_current_state(state);
> -		if (!waiter.task)
> +		if (!smp_load_acquire(&waiter.task)) {
> +			/*
> +			 * Matches rwsem_mark_wake()'s smp_store_release() and ensures
> +			 * we're ordered against its sem->count operations.
> +			 */
>  			break;
> +		}

Ack. Also, grepping for 'waiter.task' reveals a similar usage in
drivers/tty/tty_ldsem.c if you're feeling brave enough.

Will

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty
  2019-07-18 10:50               ` Jan Stancek
  2019-07-18 11:04                 ` Peter Zijlstra
@ 2019-07-18 12:12                 ` Paul E. McKenney
  1 sibling, 0 replies; 22+ messages in thread
From: Paul E. McKenney @ 2019-07-18 12:12 UTC (permalink / raw)
  To: Jan Stancek
  Cc: Will Deacon, Waiman Long, linux-kernel, dbueso, peterz, mingo,
	jade alglave

On Thu, Jul 18, 2019 at 06:50:52AM -0400, Jan Stancek wrote:
> 
> ----- Original Message -----
> > Hi Jan, Waiman, [+Jade and Paul for the litmus test at the end]
> > 
> > On Wed, Jul 17, 2019 at 09:22:00PM +0200, Jan Stancek wrote:
> > > On Wed, Jul 17, 2019 at 10:19:04AM -0400, Waiman Long wrote:
> > > > > If you add a comment to the code outlining the issue (preferably as a
> > > > > litmus
> > > > > test involving sem->count and some shared data which happens to be
> > > > > vmacache_seqnum in your test)), then:
> > > > > 
> > > > > Reviewed-by: Will Deacon <will@kernel.org>
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > Will
> > > > 
> > > > Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this
> > > > is needed will be great.
> > > > 
> > > > Other than that,
> > > > 
> > > > Acked-by: Waiman Long <longman@redhat.com>
> > > > 
> > > 
> > > litmus test looks a bit long, would following be acceptable?
> > > 
> > > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> > > index 37524a47f002..d9c96651bfc7 100644
> > > --- a/kernel/locking/rwsem.c
> > > +++ b/kernel/locking/rwsem.c
> > > @@ -1032,6 +1032,13 @@ static inline bool rwsem_reader_phase_trylock(struct
> > > rw_semaphore *sem,
> > >  		 */
> > >  		if (adjustment && !(atomic_long_read(&sem->count) &
> > >  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> > > +			/*
> > > +			 * down_read() issued ACQUIRE on enter, but we can race
> > > +			 * with writer who did RELEASE only after us.
> > > +			 * ACQUIRE here makes sure reader operations happen only
> > > +			 * after all writer ones.
> > > +			 */
> > 
> > How about an abridged form of the litmus test here, just to show the cod
> > flow? e.g.:
> > 
> > /*
> >  * We need to ensure ACQUIRE semantics when reading sem->count so that
> >  * we pair with the RELEASE store performed by an unlocking/downgrading
> >  * writer.
> >  *
> >  * P0 (writer)			P1 (reader)
> >  *
> >  * down_write(sem);
> >  * <write shared data>
> >  * downgrade_write(sem);
> >  * -> fetch_add_release(&sem->count)
> >  *
> >  *				down_read_slowpath(sem);
> >  *				-> atomic_read(&sem->count)
> >  *				   <ctrl dep>
> >  *				   smp_acquire__after_ctrl_dep()
> >  *				<read shared data>
> >  */
> 
> Works for me. The code is at 3 level of indentation, but I can try
> to squeeze it in for v4.
> 
> > 
> > In writing this, I also noticed that we don't have any explicit ordering
> > at the end of the reader slowpath when we wait on the queue but get woken
> > immediately:
> > 
> > 	if (!waiter.task)
> > 		break;
> > 
> > Am I missing something?
> 
> I'm assuming this isn't problem, because set_current_state() on line above
> is using smp_store_mb().
> 
> > 
> > > +			smp_acquire__after_ctrl_dep();
> > >  			raw_spin_unlock_irq(&sem->wait_lock);
> > >  			rwsem_set_reader_owned(sem);
> > >  			lockevent_inc(rwsem_rlock_fast);
> > > 
> > > 
> > > with litmus test in commit log:
> > > ----------------------------------- 8< ------------------------------------
> > > C rwsem
> > > 
> > > {
> > > 	atomic_t rwsem_count = ATOMIC_INIT(1);
> > > 	int vmacache_seqnum = 10;
> > > }
> > > 
> > > P0(int *vmacache_seqnum, atomic_t *rwsem_count)
> > > {
> > > 	r0 = READ_ONCE(*vmacache_seqnum);
> > > 	WRITE_ONCE(*vmacache_seqnum, r0 + 1);
> > > 	/* downgrade_write */
> > > 	r1 = atomic_fetch_add_release(-1+256, rwsem_count);
> > > }
> > > 
> > > P1(int *vmacache_seqnum, atomic_t *rwsem_count, spinlock_t *sem_wait_lock)
> > > {
> > > 	/* rwsem_read_trylock */
> > > 	r0 = atomic_add_return_acquire(256, rwsem_count);
> > > 	/* rwsem_down_read_slowpath */
> > > 	spin_lock(sem_wait_lock);
> > > 	r0 = atomic_read(rwsem_count);
> > > 	if ((r0 & 1) == 0) {
> > > 		// BUG: needs barrier
> > > 		spin_unlock(sem_wait_lock);
> > > 		r1 = READ_ONCE(*vmacache_seqnum);
> > > 	}
> > > }
> > > exists (1:r1=10)
> > > ----------------------------------- 8< ------------------------------------
> > 
> > Thanks for writing this! It's definitely worth sticking it in the commit
> > log, but Paul and Jade might also like to include it as part of their litmus
> > test repository too.

Thank you for forwarding this!  It may now be found at:

https://github.com/paulmckrcu/litmus/blob/master/manual/kernel/C-JanStancek-rwsem.litmus

							Thanx, Paul

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty
  2019-07-18 11:45                 ` Will Deacon
@ 2019-07-18 12:23                   ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2019-07-18 12:23 UTC (permalink / raw)
  To: Will Deacon
  Cc: Jan Stancek, Waiman Long, linux-kernel, dbueso, mingo,
	jade.alglave, paulmck

On Thu, Jul 18, 2019 at 12:45:47PM +0100, Will Deacon wrote:
> On Thu, Jul 18, 2019 at 12:58:12PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 18, 2019 at 10:26:41AM +0100, Will Deacon wrote:
> > 
> > > /*
> > >  * We need to ensure ACQUIRE semantics when reading sem->count so that
> > >  * we pair with the RELEASE store performed by an unlocking/downgrading
> > >  * writer.
> > >  *
> > >  * P0 (writer)			P1 (reader)
> > >  *
> > >  * down_write(sem);
> > >  * <write shared data>
> > >  * downgrade_write(sem);
> > >  * -> fetch_add_release(&sem->count)
> > >  *
> > >  *				down_read_slowpath(sem);
> > >  *				-> atomic_read(&sem->count)
> > >  *				   <ctrl dep>
> > >  *				   smp_acquire__after_ctrl_dep()
> > >  *				<read shared data>
> > >  */
> > 
> > So I'm thinking all this is excessive; the simple rule is: lock acquire
> > should imply ACQUIRE, we all know why.
> 
> Fair enough, I just thought this was worth highlighting because you can't
> reply on the wait_lock to give you ACQUIRE ordering.

Right, not in this case, because sem->count is not fully serialized by
it, whereas below the wait-queue is.

> > ---
> > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> > index 37524a47f002..9eb630904a17 100644
> > --- a/kernel/locking/rwsem.c
> > +++ b/kernel/locking/rwsem.c
> > @@ -1000,6 +1000,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
> >  	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
> >  	adjustment = 0;
> >  	if (rwsem_optimistic_spin(sem, false)) {
> > +		/* rwsem_optimistic_spin() implies ACQUIRE through rwsem_*trylock() */
> 
> I couldn't figure out if this was dependent on the return value or not,

I went with the fact that the only way to return true is if taken
becomes true; and that only happens through
rwsem_try_{read,write}_lock_unqueued(), and both imply ACQUIRE on
success.

> and looking at osq_lock() I also couldn't see the ACQUIRE barrier when we're
> spinning on node->locked. Hmm.

Yes, osq is not a full lock and does not imply these barriers. This came
up somewhere, did we forget to write a comment on that? Lemme go look.

> >  		/*
> >  		 * Wake up other readers in the wait list if the front
> >  		 * waiter is a reader.
> > @@ -1014,6 +1015,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
> >  		}
> >  		return sem;
> >  	} else if (rwsem_reader_phase_trylock(sem, waiter.last_rowner)) {
> > +		/* rwsem_reader_phase_trylock() implies ACQUIRE */
> 
> Can we add "on success" to the end of this, please?

Good point.

> >  		return sem;
> >  	}
> >  
> > @@ -1032,6 +1034,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
> >  		 */
> >  		if (adjustment && !(atomic_long_read(&sem->count) &
> >  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> > +			/* Provide lock ACQUIRE */
> > +			smp_acquire__after_ctrl_dep();
> >  			raw_spin_unlock_irq(&sem->wait_lock);
> >  			rwsem_set_reader_owned(sem);
> >  			lockevent_inc(rwsem_rlock_fast);
> > @@ -1065,15 +1069,25 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
> >  	wake_up_q(&wake_q);
> >  
> >  	/* wait to be given the lock */
> > -	while (true) {
> > +	for (;;) {
> >  		set_current_state(state);
> > -		if (!waiter.task)
> > +		if (!smp_load_acquire(&waiter.task)) {
> > +			/*
> > +			 * Matches rwsem_mark_wake()'s smp_store_release() and ensures
> > +			 * we're ordered against its sem->count operations.
> > +			 */
> >  			break;
> > +		}
> 
> Ack. Also, grepping for 'waiter.task' reveals a similar usage in
> drivers/tty/tty_ldsem.c if you're feeling brave enough.

*sigh* of course, for every bug there needs to be a second copy
somewhere.

I'll go look there too. Thanks!



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:locking/core] locking/rwsem: Add missing ACQUIRE to read_slowpath exit when queue is empty
  2019-07-18  8:51               ` [PATCH v3] " Jan Stancek
@ 2019-07-25 16:00                 ` tip-bot for Jan Stancek
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Jan Stancek @ 2019-07-25 16:00 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: longman, linux-kernel, mingo, hpa, will, peterz, jstancek, tglx,
	torvalds

Commit-ID:  e1b98fa316648420d0434d9ff5b92ad6609ba6c3
Gitweb:     https://git.kernel.org/tip/e1b98fa316648420d0434d9ff5b92ad6609ba6c3
Author:     Jan Stancek <jstancek@redhat.com>
AuthorDate: Thu, 18 Jul 2019 10:51:25 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 25 Jul 2019 15:39:23 +0200

locking/rwsem: Add missing ACQUIRE to read_slowpath exit when queue is empty

LTP mtest06 has been observed to occasionally hit "still mapped when
deleted" and following BUG_ON on arm64.

The extra mapcount originated from pagefault handler, which handled
pagefault for vma that has already been detached. vma is detached
under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
also invalidates vmacache.

When the pagefault handler (under mmap_sem read lock) calls
find_vma(), vmacache_valid() wrongly reports vmacache as valid.

After rwsem down_read() returns via 'queue empty' path (as of v5.2),
it does so without an ACQUIRE on sem->count:

  down_read()
    __down_read()
      rwsem_down_read_failed()
        __rwsem_down_read_failed_common()
          raw_spin_lock_irq(&sem->wait_lock);
          if (list_empty(&sem->wait_list)) {
            if (atomic_long_read(&sem->count) >= 0) {
              raw_spin_unlock_irq(&sem->wait_lock);
              return sem;

The problem can be reproduced by running LTP mtest06 in a loop and
building the kernel (-j $NCPUS) in parallel. It does reproduces since
v4.20 on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It
triggers reliably in about an hour.

The patched kernel ran fine for 10+ hours.

Signed-off-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Will Deacon <will@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dbueso@suse.de
Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
Link: https://lkml.kernel.org/r/50b8914e20d1d62bb2dee42d342836c2c16ebee7.1563438048.git.jstancek@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/rwsem.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index bc91aacaab58..d3ce7c6c42a6 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1036,6 +1036,8 @@ queue:
 		 */
 		if (adjustment && !(atomic_long_read(&sem->count) &
 		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
+			/* Provide lock ACQUIRE */
+			smp_acquire__after_ctrl_dep();
 			raw_spin_unlock_irq(&sem->wait_lock);
 			rwsem_set_reader_owned(sem);
 			lockevent_inc(rwsem_rlock_fast);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-07-25 16:00 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-16 16:04 [PATCH] locking/rwsem: use read_acquire in read_slowpath exit when queue is empty Jan Stancek
2019-07-16 16:53 ` Waiman Long
2019-07-16 18:34   ` Jan Stancek
2019-07-16 18:58   ` Peter Zijlstra
2019-07-16 19:09     ` Waiman Long
2019-07-17 12:02     ` [PATCH v2] locking/rwsem: add acquire barrier to " Jan Stancek
2019-07-17 13:13       ` Will Deacon
2019-07-17 14:19         ` Waiman Long
2019-07-17 19:22           ` Jan Stancek
2019-07-17 19:39             ` Waiman Long
2019-07-18  8:51               ` [PATCH v3] " Jan Stancek
2019-07-25 16:00                 ` [tip:locking/core] locking/rwsem: Add missing ACQUIRE " tip-bot for Jan Stancek
2019-07-18  9:26             ` [PATCH v2] locking/rwsem: add acquire barrier " Will Deacon
2019-07-18 10:50               ` Jan Stancek
2019-07-18 11:04                 ` Peter Zijlstra
2019-07-18 11:09                   ` Peter Zijlstra
2019-07-18 11:36                     ` Jan Stancek
2019-07-18 12:12                 ` Paul E. McKenney
2019-07-18 10:58               ` Peter Zijlstra
2019-07-18 11:45                 ` Will Deacon
2019-07-18 12:23                   ` Peter Zijlstra
2019-07-17 15:33       ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.