[v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty
diff mbox series

Message ID a524cf95ab0dbdd1eb65e9decb9283e73d416b1d.1563352912.git.jstancek@redhat.com
State Superseded
Headers show
Series
  • [v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty
Related show

Commit Message

Jan Stancek July 17, 2019, 12:02 p.m. UTC
LTP mtest06 has been observed to rarely hit "still mapped when deleted"
and following BUG_ON on arm64:
  page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0
  xfs_address_space_operations [xfs]
  flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active)
  page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
  ------------[ cut here ]------------
  kernel BUG at mm/filemap.c:171!
  Internal error: Oops - BUG: 0 [#1] SMP
  CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1
  Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019
  pstate: 40400089 (nZcv daIf +PAN -UAO)
  pc : unaccount_page_cache_page+0x17c/0x1a0
  lr : unaccount_page_cache_page+0x17c/0x1a0
  Call trace:
  unaccount_page_cache_page+0x17c/0x1a0
  delete_from_page_cache_batch+0xa0/0x300
  truncate_inode_pages_range+0x1b8/0x640
  truncate_inode_pages_final+0x88/0xa8
  evict+0x1a0/0x1d8
  iput+0x150/0x240
  dentry_unlink_inode+0x120/0x130
  __dentry_kill+0xd8/0x1d0
  dentry_kill+0x88/0x248
  dput+0x168/0x1b8
  __fput+0xe8/0x208
  ____fput+0x20/0x30
  task_work_run+0xc0/0xf0
  do_notify_resume+0x2b0/0x328
  work_pending+0x8/0x10

The extra mapcount originated from pagefault handler, which handled
pagefault for vma that has already been detached. vma is detached
under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
also invalidates vmacache.

When pagefault handler (under mmap_sem read lock) called find_vma(),
vmacache_valid() wrongly reported vmacache as valid.

After rwsem down_read() returns via 'queue empty' path (as of v5.2),
it does so without issuing read_acquire on sem->count:
  down_read
    __down_read
      rwsem_down_read_failed
        __rwsem_down_read_failed_common
          raw_spin_lock_irq(&sem->wait_lock);
          if (list_empty(&sem->wait_list)) {
            if (atomic_long_read(&sem->count) >= 0) {
              raw_spin_unlock_irq(&sem->wait_lock);
              return sem;

Suspected problem here is that last *_acquire on down_read() side
happens before write side issues *_release:
  1. writer: has the lock
  2. reader: down_read() issues *read_acquire on entry
  3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
  4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns
  5. reader: observes stale mm->vmacache_seqnum

I can reproduce the problem by running LTP mtest06 in a loop and building
kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
within ~hour. Patched kernel ran fine for 10+ hours with clean dmesg.
Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
Enable readers spinning on writer") makes it much harder to reproduce.

v2: Move barrier after test (Waiman Long)
    Use smp_acquire__after_ctrl_dep() (Peter Zijlstra)

Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")

Signed-off-by: Jan Stancek <jstancek@redhat.com>
Cc: stable@vger.kernel.org # v4.20+
Cc: Waiman Long <longman@redhat.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
---
 kernel/locking/rwsem.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Will Deacon July 17, 2019, 1:13 p.m. UTC | #1
On Wed, Jul 17, 2019 at 02:02:20PM +0200, Jan Stancek wrote:
> LTP mtest06 has been observed to rarely hit "still mapped when deleted"
> and following BUG_ON on arm64:
>   page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0
>   xfs_address_space_operations [xfs]
>   flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active)
>   page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
>   ------------[ cut here ]------------
>   kernel BUG at mm/filemap.c:171!
>   Internal error: Oops - BUG: 0 [#1] SMP
>   CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1
>   Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019
>   pstate: 40400089 (nZcv daIf +PAN -UAO)
>   pc : unaccount_page_cache_page+0x17c/0x1a0
>   lr : unaccount_page_cache_page+0x17c/0x1a0
>   Call trace:
>   unaccount_page_cache_page+0x17c/0x1a0
>   delete_from_page_cache_batch+0xa0/0x300
>   truncate_inode_pages_range+0x1b8/0x640
>   truncate_inode_pages_final+0x88/0xa8
>   evict+0x1a0/0x1d8
>   iput+0x150/0x240
>   dentry_unlink_inode+0x120/0x130
>   __dentry_kill+0xd8/0x1d0
>   dentry_kill+0x88/0x248
>   dput+0x168/0x1b8
>   __fput+0xe8/0x208
>   ____fput+0x20/0x30
>   task_work_run+0xc0/0xf0
>   do_notify_resume+0x2b0/0x328
>   work_pending+0x8/0x10
> 
> The extra mapcount originated from pagefault handler, which handled
> pagefault for vma that has already been detached. vma is detached
> under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
> also invalidates vmacache.
> 
> When pagefault handler (under mmap_sem read lock) called find_vma(),
> vmacache_valid() wrongly reported vmacache as valid.
> 
> After rwsem down_read() returns via 'queue empty' path (as of v5.2),
> it does so without issuing read_acquire on sem->count:
>   down_read
>     __down_read
>       rwsem_down_read_failed
>         __rwsem_down_read_failed_common
>           raw_spin_lock_irq(&sem->wait_lock);
>           if (list_empty(&sem->wait_list)) {
>             if (atomic_long_read(&sem->count) >= 0) {
>               raw_spin_unlock_irq(&sem->wait_lock);
>               return sem;
> 
> Suspected problem here is that last *_acquire on down_read() side
> happens before write side issues *_release:
>   1. writer: has the lock
>   2. reader: down_read() issues *read_acquire on entry
>   3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
>   4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns
>   5. reader: observes stale mm->vmacache_seqnum
> 
> I can reproduce the problem by running LTP mtest06 in a loop and building
> kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
> on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
> within ~hour. Patched kernel ran fine for 10+ hours with clean dmesg.
> Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
> Enable readers spinning on writer") makes it much harder to reproduce.
> 
> v2: Move barrier after test (Waiman Long)
>     Use smp_acquire__after_ctrl_dep() (Peter Zijlstra)
> 
> Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
> Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
> Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
> 
> Signed-off-by: Jan Stancek <jstancek@redhat.com>
> Cc: stable@vger.kernel.org # v4.20+
> Cc: Waiman Long <longman@redhat.com>
> Cc: Davidlohr Bueso <dbueso@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> ---
>  kernel/locking/rwsem.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 37524a47f002..5ac72b60608b 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1032,6 +1032,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
>  		 */
>  		if (adjustment && !(atomic_long_read(&sem->count) &
>  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> +			smp_acquire__after_ctrl_dep();
>  			raw_spin_unlock_irq(&sem->wait_lock);
>  			rwsem_set_reader_owned(sem);
>  			lockevent_inc(rwsem_rlock_fast);

If you add a comment to the code outlining the issue (preferably as a litmus
test involving sem->count and some shared data which happens to be
vmacache_seqnum in your test)), then:

Reviewed-by: Will Deacon <will@kernel.org>

Thanks,

Will
Waiman Long July 17, 2019, 2:19 p.m. UTC | #2
On 7/17/19 9:13 AM, Will Deacon wrote:
> On Wed, Jul 17, 2019 at 02:02:20PM +0200, Jan Stancek wrote:
>> LTP mtest06 has been observed to rarely hit "still mapped when deleted"
>> and following BUG_ON on arm64:
>>   page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0
>>   xfs_address_space_operations [xfs]
>>   flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active)
>>   page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
>>   ------------[ cut here ]------------
>>   kernel BUG at mm/filemap.c:171!
>>   Internal error: Oops - BUG: 0 [#1] SMP
>>   CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1
>>   Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019
>>   pstate: 40400089 (nZcv daIf +PAN -UAO)
>>   pc : unaccount_page_cache_page+0x17c/0x1a0
>>   lr : unaccount_page_cache_page+0x17c/0x1a0
>>   Call trace:
>>   unaccount_page_cache_page+0x17c/0x1a0
>>   delete_from_page_cache_batch+0xa0/0x300
>>   truncate_inode_pages_range+0x1b8/0x640
>>   truncate_inode_pages_final+0x88/0xa8
>>   evict+0x1a0/0x1d8
>>   iput+0x150/0x240
>>   dentry_unlink_inode+0x120/0x130
>>   __dentry_kill+0xd8/0x1d0
>>   dentry_kill+0x88/0x248
>>   dput+0x168/0x1b8
>>   __fput+0xe8/0x208
>>   ____fput+0x20/0x30
>>   task_work_run+0xc0/0xf0
>>   do_notify_resume+0x2b0/0x328
>>   work_pending+0x8/0x10
>>
>> The extra mapcount originated from pagefault handler, which handled
>> pagefault for vma that has already been detached. vma is detached
>> under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
>> also invalidates vmacache.
>>
>> When pagefault handler (under mmap_sem read lock) called find_vma(),
>> vmacache_valid() wrongly reported vmacache as valid.
>>
>> After rwsem down_read() returns via 'queue empty' path (as of v5.2),
>> it does so without issuing read_acquire on sem->count:
>>   down_read
>>     __down_read
>>       rwsem_down_read_failed
>>         __rwsem_down_read_failed_common
>>           raw_spin_lock_irq(&sem->wait_lock);
>>           if (list_empty(&sem->wait_list)) {
>>             if (atomic_long_read(&sem->count) >= 0) {
>>               raw_spin_unlock_irq(&sem->wait_lock);
>>               return sem;
>>
>> Suspected problem here is that last *_acquire on down_read() side
>> happens before write side issues *_release:
>>   1. writer: has the lock
>>   2. reader: down_read() issues *read_acquire on entry
>>   3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
>>   4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns
>>   5. reader: observes stale mm->vmacache_seqnum
>>
>> I can reproduce the problem by running LTP mtest06 in a loop and building
>> kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
>> on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
>> within ~hour. Patched kernel ran fine for 10+ hours with clean dmesg.
>> Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
>> Enable readers spinning on writer") makes it much harder to reproduce.
>>
>> v2: Move barrier after test (Waiman Long)
>>     Use smp_acquire__after_ctrl_dep() (Peter Zijlstra)
>>
>> Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
>> Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
>> Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
>>
>> Signed-off-by: Jan Stancek <jstancek@redhat.com>
>> Cc: stable@vger.kernel.org # v4.20+
>> Cc: Waiman Long <longman@redhat.com>
>> Cc: Davidlohr Bueso <dbueso@suse.de>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> ---
>>  kernel/locking/rwsem.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
>> index 37524a47f002..5ac72b60608b 100644
>> --- a/kernel/locking/rwsem.c
>> +++ b/kernel/locking/rwsem.c
>> @@ -1032,6 +1032,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
>>  		 */
>>  		if (adjustment && !(atomic_long_read(&sem->count) &
>>  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
>> +			smp_acquire__after_ctrl_dep();
>>  			raw_spin_unlock_irq(&sem->wait_lock);
>>  			rwsem_set_reader_owned(sem);
>>  			lockevent_inc(rwsem_rlock_fast);
> If you add a comment to the code outlining the issue (preferably as a litmus
> test involving sem->count and some shared data which happens to be
> vmacache_seqnum in your test)), then:
>
> Reviewed-by: Will Deacon <will@kernel.org>
>
> Thanks,
>
> Will

Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this
is needed will be great.

Other than that,

Acked-by: Waiman Long <longman@redhat.com>
Waiman Long July 17, 2019, 3:33 p.m. UTC | #3
On 7/17/19 8:02 AM, Jan Stancek wrote:
> LTP mtest06 has been observed to rarely hit "still mapped when deleted"
> and following BUG_ON on arm64:
>   page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0
>   xfs_address_space_operations [xfs]
>   flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active)
>   page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
>   ------------[ cut here ]------------
>   kernel BUG at mm/filemap.c:171!
>   Internal error: Oops - BUG: 0 [#1] SMP
>   CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1
>   Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019
>   pstate: 40400089 (nZcv daIf +PAN -UAO)
>   pc : unaccount_page_cache_page+0x17c/0x1a0
>   lr : unaccount_page_cache_page+0x17c/0x1a0
>   Call trace:
>   unaccount_page_cache_page+0x17c/0x1a0
>   delete_from_page_cache_batch+0xa0/0x300
>   truncate_inode_pages_range+0x1b8/0x640
>   truncate_inode_pages_final+0x88/0xa8
>   evict+0x1a0/0x1d8
>   iput+0x150/0x240
>   dentry_unlink_inode+0x120/0x130
>   __dentry_kill+0xd8/0x1d0
>   dentry_kill+0x88/0x248
>   dput+0x168/0x1b8
>   __fput+0xe8/0x208
>   ____fput+0x20/0x30
>   task_work_run+0xc0/0xf0
>   do_notify_resume+0x2b0/0x328
>   work_pending+0x8/0x10
>
> The extra mapcount originated from pagefault handler, which handled
> pagefault for vma that has already been detached. vma is detached
> under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
> also invalidates vmacache.
>
> When pagefault handler (under mmap_sem read lock) called find_vma(),
> vmacache_valid() wrongly reported vmacache as valid.
>
> After rwsem down_read() returns via 'queue empty' path (as of v5.2),
> it does so without issuing read_acquire on sem->count:
>   down_read
>     __down_read
>       rwsem_down_read_failed
>         __rwsem_down_read_failed_common
>           raw_spin_lock_irq(&sem->wait_lock);
>           if (list_empty(&sem->wait_list)) {
>             if (atomic_long_read(&sem->count) >= 0) {
>               raw_spin_unlock_irq(&sem->wait_lock);
>               return sem;
>
> Suspected problem here is that last *_acquire on down_read() side
> happens before write side issues *_release:
>   1. writer: has the lock
>   2. reader: down_read() issues *read_acquire on entry
>   3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release)
>   4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns
>   5. reader: observes stale mm->vmacache_seqnum
>
> I can reproduce the problem by running LTP mtest06 in a loop and building
> kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2
> on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably
> within ~hour. Patched kernel ran fine for 10+ hours with clean dmesg.
> Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem:
> Enable readers spinning on writer") makes it much harder to reproduce.
>
> v2: Move barrier after test (Waiman Long)
>     Use smp_acquire__after_ctrl_dep() (Peter Zijlstra)
>
> Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
> Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
> Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
>
> Signed-off-by: Jan Stancek <jstancek@redhat.com>
> Cc: stable@vger.kernel.org # v4.20+
> Cc: Waiman Long <longman@redhat.com>
> Cc: Davidlohr Bueso <dbueso@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> ---
>  kernel/locking/rwsem.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 37524a47f002..5ac72b60608b 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1032,6 +1032,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
>  		 */
>  		if (adjustment && !(atomic_long_read(&sem->count) &
>  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> +			smp_acquire__after_ctrl_dep();
>  			raw_spin_unlock_irq(&sem->wait_lock);
>  			rwsem_set_reader_owned(sem);
>  			lockevent_inc(rwsem_rlock_fast);

The corresponding change for 5.2 or earlier kernels are:

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index fbe96341beee..2fbbb2d46396 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -246,6 +246,7 @@ __rwsem_down_read_failed_common(struct rw_semaphore
*sem, in
                 * been set in the count.
                 */
                if (atomic_long_read(&sem->count) >= 0) {
+                       smp_acquire__after_ctrl_dep();
                        raw_spin_unlock_irq(&sem->wait_lock);
                        return sem;
                }

-Longman
Jan Stancek July 17, 2019, 7:22 p.m. UTC | #4
On Wed, Jul 17, 2019 at 10:19:04AM -0400, Waiman Long wrote:
>> If you add a comment to the code outlining the issue (preferably as a litmus
>> test involving sem->count and some shared data which happens to be
>> vmacache_seqnum in your test)), then:
>>
>> Reviewed-by: Will Deacon <will@kernel.org>
>>
>> Thanks,
>>
>> Will
>
>Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this
>is needed will be great.
>
>Other than that,
>
>Acked-by: Waiman Long <longman@redhat.com>
>

litmus test looks a bit long, would following be acceptable?

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 37524a47f002..d9c96651bfc7 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1032,6 +1032,13 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
  		 */
  		if (adjustment && !(atomic_long_read(&sem->count) &
  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
+			/*
+			 * down_read() issued ACQUIRE on enter, but we can race
+			 * with writer who did RELEASE only after us.
+			 * ACQUIRE here makes sure reader operations happen only
+			 * after all writer ones.
+			 */
+			smp_acquire__after_ctrl_dep();
  			raw_spin_unlock_irq(&sem->wait_lock);
  			rwsem_set_reader_owned(sem);
  			lockevent_inc(rwsem_rlock_fast);


with litmus test in commit log:
----------------------------------- 8< ------------------------------------
C rwsem

{
	atomic_t rwsem_count = ATOMIC_INIT(1);
	int vmacache_seqnum = 10;
}

P0(int *vmacache_seqnum, atomic_t *rwsem_count)
{
	r0 = READ_ONCE(*vmacache_seqnum);
	WRITE_ONCE(*vmacache_seqnum, r0 + 1);
	/* downgrade_write */
	r1 = atomic_fetch_add_release(-1+256, rwsem_count);
}

P1(int *vmacache_seqnum, atomic_t *rwsem_count, spinlock_t *sem_wait_lock)
{
	/* rwsem_read_trylock */
	r0 = atomic_add_return_acquire(256, rwsem_count);
	/* rwsem_down_read_slowpath */
	spin_lock(sem_wait_lock);
	r0 = atomic_read(rwsem_count);
	if ((r0 & 1) == 0) {
		// BUG: needs barrier
		spin_unlock(sem_wait_lock);
		r1 = READ_ONCE(*vmacache_seqnum);
	}
}
exists (1:r1=10)
----------------------------------- 8< ------------------------------------

Thanks,
Jan
Waiman Long July 17, 2019, 7:39 p.m. UTC | #5
On 7/17/19 3:22 PM, Jan Stancek wrote:
> On Wed, Jul 17, 2019 at 10:19:04AM -0400, Waiman Long wrote:
>>> If you add a comment to the code outlining the issue (preferably as
>>> a litmus
>>> test involving sem->count and some shared data which happens to be
>>> vmacache_seqnum in your test)), then:
>>>
>>> Reviewed-by: Will Deacon <will@kernel.org>
>>>
>>> Thanks,
>>>
>>> Will
>>
>> Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this
>> is needed will be great.
>>
>> Other than that,
>>
>> Acked-by: Waiman Long <longman@redhat.com>
>>
>
> litmus test looks a bit long, would following be acceptable?
>
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 37524a47f002..d9c96651bfc7 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1032,6 +1032,13 @@ static inline bool
> rwsem_reader_phase_trylock(struct rw_semaphore *sem,
>           */
>          if (adjustment && !(atomic_long_read(&sem->count) &
>               (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> +            /*
> +             * down_read() issued ACQUIRE on enter, but we can race
> +             * with writer who did RELEASE only after us.
> +             * ACQUIRE here makes sure reader operations happen only
> +             * after all writer ones.
> +             */


How about that?

                /*
                 * Add an acquire barrier here to make sure no stale data
                 * acquired before the above test where the writer may still
                 * be holding the lock will be reused in the reader
critical
                 * section.
                 */

Thanks,
Longman
Will Deacon July 18, 2019, 9:26 a.m. UTC | #6
Hi Jan, Waiman, [+Jade and Paul for the litmus test at the end]

On Wed, Jul 17, 2019 at 09:22:00PM +0200, Jan Stancek wrote:
> On Wed, Jul 17, 2019 at 10:19:04AM -0400, Waiman Long wrote:
> > > If you add a comment to the code outlining the issue (preferably as a litmus
> > > test involving sem->count and some shared data which happens to be
> > > vmacache_seqnum in your test)), then:
> > > 
> > > Reviewed-by: Will Deacon <will@kernel.org>
> > > 
> > > Thanks,
> > > 
> > > Will
> > 
> > Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this
> > is needed will be great.
> > 
> > Other than that,
> > 
> > Acked-by: Waiman Long <longman@redhat.com>
> > 
> 
> litmus test looks a bit long, would following be acceptable?
> 
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 37524a47f002..d9c96651bfc7 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1032,6 +1032,13 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
>  		 */
>  		if (adjustment && !(atomic_long_read(&sem->count) &
>  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> +			/*
> +			 * down_read() issued ACQUIRE on enter, but we can race
> +			 * with writer who did RELEASE only after us.
> +			 * ACQUIRE here makes sure reader operations happen only
> +			 * after all writer ones.
> +			 */

How about an abridged form of the litmus test here, just to show the cod
flow? e.g.:

/*
 * We need to ensure ACQUIRE semantics when reading sem->count so that
 * we pair with the RELEASE store performed by an unlocking/downgrading
 * writer.
 *
 * P0 (writer)			P1 (reader)
 *
 * down_write(sem);
 * <write shared data>
 * downgrade_write(sem);
 * -> fetch_add_release(&sem->count)
 *
 *				down_read_slowpath(sem);
 *				-> atomic_read(&sem->count)
 *				   <ctrl dep>
 *				   smp_acquire__after_ctrl_dep()
 *				<read shared data>
 */

In writing this, I also noticed that we don't have any explicit ordering
at the end of the reader slowpath when we wait on the queue but get woken
immediately:

	if (!waiter.task)
		break;

Am I missing something?

> +			smp_acquire__after_ctrl_dep();
>  			raw_spin_unlock_irq(&sem->wait_lock);
>  			rwsem_set_reader_owned(sem);
>  			lockevent_inc(rwsem_rlock_fast);
> 
> 
> with litmus test in commit log:
> ----------------------------------- 8< ------------------------------------
> C rwsem
> 
> {
> 	atomic_t rwsem_count = ATOMIC_INIT(1);
> 	int vmacache_seqnum = 10;
> }
> 
> P0(int *vmacache_seqnum, atomic_t *rwsem_count)
> {
> 	r0 = READ_ONCE(*vmacache_seqnum);
> 	WRITE_ONCE(*vmacache_seqnum, r0 + 1);
> 	/* downgrade_write */
> 	r1 = atomic_fetch_add_release(-1+256, rwsem_count);
> }
> 
> P1(int *vmacache_seqnum, atomic_t *rwsem_count, spinlock_t *sem_wait_lock)
> {
> 	/* rwsem_read_trylock */
> 	r0 = atomic_add_return_acquire(256, rwsem_count);
> 	/* rwsem_down_read_slowpath */
> 	spin_lock(sem_wait_lock);
> 	r0 = atomic_read(rwsem_count);
> 	if ((r0 & 1) == 0) {
> 		// BUG: needs barrier
> 		spin_unlock(sem_wait_lock);
> 		r1 = READ_ONCE(*vmacache_seqnum);
> 	}
> }
> exists (1:r1=10)
> ----------------------------------- 8< ------------------------------------

Thanks for writing this! It's definitely worth sticking it in the commit
log, but Paul and Jade might also like to include it as part of their litmus
test repository too.

Will
Jan Stancek July 18, 2019, 10:50 a.m. UTC | #7
----- Original Message -----
> Hi Jan, Waiman, [+Jade and Paul for the litmus test at the end]
> 
> On Wed, Jul 17, 2019 at 09:22:00PM +0200, Jan Stancek wrote:
> > On Wed, Jul 17, 2019 at 10:19:04AM -0400, Waiman Long wrote:
> > > > If you add a comment to the code outlining the issue (preferably as a
> > > > litmus
> > > > test involving sem->count and some shared data which happens to be
> > > > vmacache_seqnum in your test)), then:
> > > > 
> > > > Reviewed-by: Will Deacon <will@kernel.org>
> > > > 
> > > > Thanks,
> > > > 
> > > > Will
> > > 
> > > Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this
> > > is needed will be great.
> > > 
> > > Other than that,
> > > 
> > > Acked-by: Waiman Long <longman@redhat.com>
> > > 
> > 
> > litmus test looks a bit long, would following be acceptable?
> > 
> > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> > index 37524a47f002..d9c96651bfc7 100644
> > --- a/kernel/locking/rwsem.c
> > +++ b/kernel/locking/rwsem.c
> > @@ -1032,6 +1032,13 @@ static inline bool rwsem_reader_phase_trylock(struct
> > rw_semaphore *sem,
> >  		 */
> >  		if (adjustment && !(atomic_long_read(&sem->count) &
> >  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> > +			/*
> > +			 * down_read() issued ACQUIRE on enter, but we can race
> > +			 * with writer who did RELEASE only after us.
> > +			 * ACQUIRE here makes sure reader operations happen only
> > +			 * after all writer ones.
> > +			 */
> 
> How about an abridged form of the litmus test here, just to show the cod
> flow? e.g.:
> 
> /*
>  * We need to ensure ACQUIRE semantics when reading sem->count so that
>  * we pair with the RELEASE store performed by an unlocking/downgrading
>  * writer.
>  *
>  * P0 (writer)			P1 (reader)
>  *
>  * down_write(sem);
>  * <write shared data>
>  * downgrade_write(sem);
>  * -> fetch_add_release(&sem->count)
>  *
>  *				down_read_slowpath(sem);
>  *				-> atomic_read(&sem->count)
>  *				   <ctrl dep>
>  *				   smp_acquire__after_ctrl_dep()
>  *				<read shared data>
>  */

Works for me. The code is at 3 level of indentation, but I can try
to squeeze it in for v4.

> 
> In writing this, I also noticed that we don't have any explicit ordering
> at the end of the reader slowpath when we wait on the queue but get woken
> immediately:
> 
> 	if (!waiter.task)
> 		break;
> 
> Am I missing something?

I'm assuming this isn't problem, because set_current_state() on line above
is using smp_store_mb().

> 
> > +			smp_acquire__after_ctrl_dep();
> >  			raw_spin_unlock_irq(&sem->wait_lock);
> >  			rwsem_set_reader_owned(sem);
> >  			lockevent_inc(rwsem_rlock_fast);
> > 
> > 
> > with litmus test in commit log:
> > ----------------------------------- 8< ------------------------------------
> > C rwsem
> > 
> > {
> > 	atomic_t rwsem_count = ATOMIC_INIT(1);
> > 	int vmacache_seqnum = 10;
> > }
> > 
> > P0(int *vmacache_seqnum, atomic_t *rwsem_count)
> > {
> > 	r0 = READ_ONCE(*vmacache_seqnum);
> > 	WRITE_ONCE(*vmacache_seqnum, r0 + 1);
> > 	/* downgrade_write */
> > 	r1 = atomic_fetch_add_release(-1+256, rwsem_count);
> > }
> > 
> > P1(int *vmacache_seqnum, atomic_t *rwsem_count, spinlock_t *sem_wait_lock)
> > {
> > 	/* rwsem_read_trylock */
> > 	r0 = atomic_add_return_acquire(256, rwsem_count);
> > 	/* rwsem_down_read_slowpath */
> > 	spin_lock(sem_wait_lock);
> > 	r0 = atomic_read(rwsem_count);
> > 	if ((r0 & 1) == 0) {
> > 		// BUG: needs barrier
> > 		spin_unlock(sem_wait_lock);
> > 		r1 = READ_ONCE(*vmacache_seqnum);
> > 	}
> > }
> > exists (1:r1=10)
> > ----------------------------------- 8< ------------------------------------
> 
> Thanks for writing this! It's definitely worth sticking it in the commit
> log, but Paul and Jade might also like to include it as part of their litmus
> test repository too.
> 
> Will
>
Peter Zijlstra July 18, 2019, 10:58 a.m. UTC | #8
On Thu, Jul 18, 2019 at 10:26:41AM +0100, Will Deacon wrote:

> /*
>  * We need to ensure ACQUIRE semantics when reading sem->count so that
>  * we pair with the RELEASE store performed by an unlocking/downgrading
>  * writer.
>  *
>  * P0 (writer)			P1 (reader)
>  *
>  * down_write(sem);
>  * <write shared data>
>  * downgrade_write(sem);
>  * -> fetch_add_release(&sem->count)
>  *
>  *				down_read_slowpath(sem);
>  *				-> atomic_read(&sem->count)
>  *				   <ctrl dep>
>  *				   smp_acquire__after_ctrl_dep()
>  *				<read shared data>
>  */

So I'm thinking all this is excessive; the simple rule is: lock acquire
should imply ACQUIRE, we all know why.

> In writing this, I also noticed that we don't have any explicit ordering
> at the end of the reader slowpath when we wait on the queue but get woken
> immediately:
> 
> 	if (!waiter.task)
> 		break;
> 
> Am I missing something?

Ha!, I ran into the very same one. I keep confusing myself, but I think
you're right and that needs to be smp_load_acquire() to match the
smp_store_release() in rwsem_mark_wake().

(the actual race there is _tiny_ due to the smp_mb() right before it,
but I cannot convince myself that is indeed sufficient)

The signal_pending_state() case is also fun, but I think wait_lock there
is sufficient (even under RCpc).

I've ended up with this..

---
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 37524a47f002..9eb630904a17 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1000,6 +1000,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
 	adjustment = 0;
 	if (rwsem_optimistic_spin(sem, false)) {
+		/* rwsem_optimistic_spin() implies ACQUIRE through rwsem_*trylock() */
 		/*
 		 * Wake up other readers in the wait list if the front
 		 * waiter is a reader.
@@ -1014,6 +1015,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 		}
 		return sem;
 	} else if (rwsem_reader_phase_trylock(sem, waiter.last_rowner)) {
+		/* rwsem_reader_phase_trylock() implies ACQUIRE */
 		return sem;
 	}
 
@@ -1032,6 +1034,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 		 */
 		if (adjustment && !(atomic_long_read(&sem->count) &
 		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
+			/* Provide lock ACQUIRE */
+			smp_acquire__after_ctrl_dep();
 			raw_spin_unlock_irq(&sem->wait_lock);
 			rwsem_set_reader_owned(sem);
 			lockevent_inc(rwsem_rlock_fast);
@@ -1065,15 +1069,25 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	wake_up_q(&wake_q);
 
 	/* wait to be given the lock */
-	while (true) {
+	for (;;) {
 		set_current_state(state);
-		if (!waiter.task)
+		if (!smp_load_acquire(&waiter.task)) {
+			/*
+			 * Matches rwsem_mark_wake()'s smp_store_release() and ensures
+			 * we're ordered against its sem->count operations.
+			 */
 			break;
+		}
 		if (signal_pending_state(state, current)) {
 			raw_spin_lock_irq(&sem->wait_lock);
 			if (waiter.task)
 				goto out_nolock;
 			raw_spin_unlock_irq(&sem->wait_lock);
+			/*
+			 * Ordered by sem->wait_lock against rwsem_mark_wake(), if we
+			 * see its waiter.task store, we must also see its sem->count
+			 * changes.
+			 */
 			break;
 		}
 		schedule();
@@ -1083,6 +1097,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 	__set_current_state(TASK_RUNNING);
 	lockevent_inc(rwsem_rlock);
 	return sem;
+
 out_nolock:
 	list_del(&waiter.list);
 	if (list_empty(&sem->wait_list)) {
Peter Zijlstra July 18, 2019, 11:04 a.m. UTC | #9
On Thu, Jul 18, 2019 at 06:50:52AM -0400, Jan Stancek wrote:
> > In writing this, I also noticed that we don't have any explicit ordering
> > at the end of the reader slowpath when we wait on the queue but get woken
> > immediately:
> > 
> > 	if (!waiter.task)
> > 		break;
> > 
> > Am I missing something?
> 
> I'm assuming this isn't problem, because set_current_state() on line above
> is using smp_store_mb().


X = 0;

								X = 1;
	rwsem_down_read()					rwsem_up_write();

	  for (;;) {
	    set_current_state(TASK_UNINTERRUPTIBLE);

								  rwsem_mark_wake()
								    atomic_long_add(adjustment, &sem->count);
								    smp_store_release(&waiter->task, NULL);

	    if (!waiter.task)
	      break;

	    ...
	  }


	r = X;


can I think result in r==0 just fine, because there's nothing ordering
the load of waiter.task with the store of X.

It is exceedingly unlikely, but not impossible afaict.
Peter Zijlstra July 18, 2019, 11:09 a.m. UTC | #10
It's simpler like so:

On Thu, Jul 18, 2019 at 01:04:46PM +0200, Peter Zijlstra wrote:
> X = 0;
> 
> 	rwsem_down_read()
> 	  for (;;) {
> 	    set_current_state(TASK_UNINTERRUPTIBLE);
>
>								X = 1;
>                                                               rwsem_up_write();
> 								  rwsem_mark_wake()
> 								    atomic_long_add(adjustment, &sem->count);
> 								    smp_store_release(&waiter->task, NULL);
> 
> 	    if (!waiter.task)
> 	      break;
> 
> 	    ...
> 	  }
> 
> 	r = X;
>
Jan Stancek July 18, 2019, 11:36 a.m. UTC | #11
----- Original Message -----
> 
> It's simpler like so:
> 
> On Thu, Jul 18, 2019 at 01:04:46PM +0200, Peter Zijlstra wrote:
> > X = 0;
> > 
> > 	rwsem_down_read()
> > 	  for (;;) {
> > 	    set_current_state(TASK_UNINTERRUPTIBLE);
> >
> >								X = 1;
> >                                                               rwsem_up_write();
> > 								  rwsem_mark_wake()
> > 								    atomic_long_add(adjustment, &sem->count);
> > 								    smp_store_release(&waiter->task, NULL);
> > 
> > 	    if (!waiter.task)
> > 	      break;
> > 
> > 	    ...
> > 	  }
> > 
> > 	r = X;
> > 

I see - it looks possible. Thank you for this example.
Will Deacon July 18, 2019, 11:45 a.m. UTC | #12
On Thu, Jul 18, 2019 at 12:58:12PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 18, 2019 at 10:26:41AM +0100, Will Deacon wrote:
> 
> > /*
> >  * We need to ensure ACQUIRE semantics when reading sem->count so that
> >  * we pair with the RELEASE store performed by an unlocking/downgrading
> >  * writer.
> >  *
> >  * P0 (writer)			P1 (reader)
> >  *
> >  * down_write(sem);
> >  * <write shared data>
> >  * downgrade_write(sem);
> >  * -> fetch_add_release(&sem->count)
> >  *
> >  *				down_read_slowpath(sem);
> >  *				-> atomic_read(&sem->count)
> >  *				   <ctrl dep>
> >  *				   smp_acquire__after_ctrl_dep()
> >  *				<read shared data>
> >  */
> 
> So I'm thinking all this is excessive; the simple rule is: lock acquire
> should imply ACQUIRE, we all know why.

Fair enough, I just thought this was worth highlighting because you can't
reply on the wait_lock to give you ACQUIRE ordering.

> > In writing this, I also noticed that we don't have any explicit ordering
> > at the end of the reader slowpath when we wait on the queue but get woken
> > immediately:
> > 
> > 	if (!waiter.task)
> > 		break;
> > 
> > Am I missing something?
> 
> Ha!, I ran into the very same one. I keep confusing myself, but I think
> you're right and that needs to be smp_load_acquire() to match the
> smp_store_release() in rwsem_mark_wake().
> 
> (the actual race there is _tiny_ due to the smp_mb() right before it,
> but I cannot convince myself that is indeed sufficient)
> 
> The signal_pending_state() case is also fun, but I think wait_lock there
> is sufficient (even under RCpc).
> 
> I've ended up with this..
> 
> ---
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 37524a47f002..9eb630904a17 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1000,6 +1000,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
>  	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
>  	adjustment = 0;
>  	if (rwsem_optimistic_spin(sem, false)) {
> +		/* rwsem_optimistic_spin() implies ACQUIRE through rwsem_*trylock() */

I couldn't figure out if this was dependent on the return value or not,
and looking at osq_lock() I also couldn't see the ACQUIRE barrier when we're
spinning on node->locked. Hmm.

>  		/*
>  		 * Wake up other readers in the wait list if the front
>  		 * waiter is a reader.
> @@ -1014,6 +1015,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
>  		}
>  		return sem;
>  	} else if (rwsem_reader_phase_trylock(sem, waiter.last_rowner)) {
> +		/* rwsem_reader_phase_trylock() implies ACQUIRE */

Can we add "on success" to the end of this, please?

>  		return sem;
>  	}
>  
> @@ -1032,6 +1034,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
>  		 */
>  		if (adjustment && !(atomic_long_read(&sem->count) &
>  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> +			/* Provide lock ACQUIRE */
> +			smp_acquire__after_ctrl_dep();
>  			raw_spin_unlock_irq(&sem->wait_lock);
>  			rwsem_set_reader_owned(sem);
>  			lockevent_inc(rwsem_rlock_fast);
> @@ -1065,15 +1069,25 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
>  	wake_up_q(&wake_q);
>  
>  	/* wait to be given the lock */
> -	while (true) {
> +	for (;;) {
>  		set_current_state(state);
> -		if (!waiter.task)
> +		if (!smp_load_acquire(&waiter.task)) {
> +			/*
> +			 * Matches rwsem_mark_wake()'s smp_store_release() and ensures
> +			 * we're ordered against its sem->count operations.
> +			 */
>  			break;
> +		}

Ack. Also, grepping for 'waiter.task' reveals a similar usage in
drivers/tty/tty_ldsem.c if you're feeling brave enough.

Will
Paul E. McKenney July 18, 2019, 12:12 p.m. UTC | #13
On Thu, Jul 18, 2019 at 06:50:52AM -0400, Jan Stancek wrote:
> 
> ----- Original Message -----
> > Hi Jan, Waiman, [+Jade and Paul for the litmus test at the end]
> > 
> > On Wed, Jul 17, 2019 at 09:22:00PM +0200, Jan Stancek wrote:
> > > On Wed, Jul 17, 2019 at 10:19:04AM -0400, Waiman Long wrote:
> > > > > If you add a comment to the code outlining the issue (preferably as a
> > > > > litmus
> > > > > test involving sem->count and some shared data which happens to be
> > > > > vmacache_seqnum in your test)), then:
> > > > > 
> > > > > Reviewed-by: Will Deacon <will@kernel.org>
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > Will
> > > > 
> > > > Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this
> > > > is needed will be great.
> > > > 
> > > > Other than that,
> > > > 
> > > > Acked-by: Waiman Long <longman@redhat.com>
> > > > 
> > > 
> > > litmus test looks a bit long, would following be acceptable?
> > > 
> > > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> > > index 37524a47f002..d9c96651bfc7 100644
> > > --- a/kernel/locking/rwsem.c
> > > +++ b/kernel/locking/rwsem.c
> > > @@ -1032,6 +1032,13 @@ static inline bool rwsem_reader_phase_trylock(struct
> > > rw_semaphore *sem,
> > >  		 */
> > >  		if (adjustment && !(atomic_long_read(&sem->count) &
> > >  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> > > +			/*
> > > +			 * down_read() issued ACQUIRE on enter, but we can race
> > > +			 * with writer who did RELEASE only after us.
> > > +			 * ACQUIRE here makes sure reader operations happen only
> > > +			 * after all writer ones.
> > > +			 */
> > 
> > How about an abridged form of the litmus test here, just to show the cod
> > flow? e.g.:
> > 
> > /*
> >  * We need to ensure ACQUIRE semantics when reading sem->count so that
> >  * we pair with the RELEASE store performed by an unlocking/downgrading
> >  * writer.
> >  *
> >  * P0 (writer)			P1 (reader)
> >  *
> >  * down_write(sem);
> >  * <write shared data>
> >  * downgrade_write(sem);
> >  * -> fetch_add_release(&sem->count)
> >  *
> >  *				down_read_slowpath(sem);
> >  *				-> atomic_read(&sem->count)
> >  *				   <ctrl dep>
> >  *				   smp_acquire__after_ctrl_dep()
> >  *				<read shared data>
> >  */
> 
> Works for me. The code is at 3 level of indentation, but I can try
> to squeeze it in for v4.
> 
> > 
> > In writing this, I also noticed that we don't have any explicit ordering
> > at the end of the reader slowpath when we wait on the queue but get woken
> > immediately:
> > 
> > 	if (!waiter.task)
> > 		break;
> > 
> > Am I missing something?
> 
> I'm assuming this isn't problem, because set_current_state() on line above
> is using smp_store_mb().
> 
> > 
> > > +			smp_acquire__after_ctrl_dep();
> > >  			raw_spin_unlock_irq(&sem->wait_lock);
> > >  			rwsem_set_reader_owned(sem);
> > >  			lockevent_inc(rwsem_rlock_fast);
> > > 
> > > 
> > > with litmus test in commit log:
> > > ----------------------------------- 8< ------------------------------------
> > > C rwsem
> > > 
> > > {
> > > 	atomic_t rwsem_count = ATOMIC_INIT(1);
> > > 	int vmacache_seqnum = 10;
> > > }
> > > 
> > > P0(int *vmacache_seqnum, atomic_t *rwsem_count)
> > > {
> > > 	r0 = READ_ONCE(*vmacache_seqnum);
> > > 	WRITE_ONCE(*vmacache_seqnum, r0 + 1);
> > > 	/* downgrade_write */
> > > 	r1 = atomic_fetch_add_release(-1+256, rwsem_count);
> > > }
> > > 
> > > P1(int *vmacache_seqnum, atomic_t *rwsem_count, spinlock_t *sem_wait_lock)
> > > {
> > > 	/* rwsem_read_trylock */
> > > 	r0 = atomic_add_return_acquire(256, rwsem_count);
> > > 	/* rwsem_down_read_slowpath */
> > > 	spin_lock(sem_wait_lock);
> > > 	r0 = atomic_read(rwsem_count);
> > > 	if ((r0 & 1) == 0) {
> > > 		// BUG: needs barrier
> > > 		spin_unlock(sem_wait_lock);
> > > 		r1 = READ_ONCE(*vmacache_seqnum);
> > > 	}
> > > }
> > > exists (1:r1=10)
> > > ----------------------------------- 8< ------------------------------------
> > 
> > Thanks for writing this! It's definitely worth sticking it in the commit
> > log, but Paul and Jade might also like to include it as part of their litmus
> > test repository too.

Thank you for forwarding this!  It may now be found at:

https://github.com/paulmckrcu/litmus/blob/master/manual/kernel/C-JanStancek-rwsem.litmus

							Thanx, Paul
Peter Zijlstra July 18, 2019, 12:23 p.m. UTC | #14
On Thu, Jul 18, 2019 at 12:45:47PM +0100, Will Deacon wrote:
> On Thu, Jul 18, 2019 at 12:58:12PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 18, 2019 at 10:26:41AM +0100, Will Deacon wrote:
> > 
> > > /*
> > >  * We need to ensure ACQUIRE semantics when reading sem->count so that
> > >  * we pair with the RELEASE store performed by an unlocking/downgrading
> > >  * writer.
> > >  *
> > >  * P0 (writer)			P1 (reader)
> > >  *
> > >  * down_write(sem);
> > >  * <write shared data>
> > >  * downgrade_write(sem);
> > >  * -> fetch_add_release(&sem->count)
> > >  *
> > >  *				down_read_slowpath(sem);
> > >  *				-> atomic_read(&sem->count)
> > >  *				   <ctrl dep>
> > >  *				   smp_acquire__after_ctrl_dep()
> > >  *				<read shared data>
> > >  */
> > 
> > So I'm thinking all this is excessive; the simple rule is: lock acquire
> > should imply ACQUIRE, we all know why.
> 
> Fair enough, I just thought this was worth highlighting because you can't
> reply on the wait_lock to give you ACQUIRE ordering.

Right, not in this case, because sem->count is not fully serialized by
it, whereas below the wait-queue is.

> > ---
> > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> > index 37524a47f002..9eb630904a17 100644
> > --- a/kernel/locking/rwsem.c
> > +++ b/kernel/locking/rwsem.c
> > @@ -1000,6 +1000,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
> >  	atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
> >  	adjustment = 0;
> >  	if (rwsem_optimistic_spin(sem, false)) {
> > +		/* rwsem_optimistic_spin() implies ACQUIRE through rwsem_*trylock() */
> 
> I couldn't figure out if this was dependent on the return value or not,

I went with the fact that the only way to return true is if taken
becomes true; and that only happens through
rwsem_try_{read,write}_lock_unqueued(), and both imply ACQUIRE on
success.

> and looking at osq_lock() I also couldn't see the ACQUIRE barrier when we're
> spinning on node->locked. Hmm.

Yes, osq is not a full lock and does not imply these barriers. This came
up somewhere, did we forget to write a comment on that? Lemme go look.

> >  		/*
> >  		 * Wake up other readers in the wait list if the front
> >  		 * waiter is a reader.
> > @@ -1014,6 +1015,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
> >  		}
> >  		return sem;
> >  	} else if (rwsem_reader_phase_trylock(sem, waiter.last_rowner)) {
> > +		/* rwsem_reader_phase_trylock() implies ACQUIRE */
> 
> Can we add "on success" to the end of this, please?

Good point.

> >  		return sem;
> >  	}
> >  
> > @@ -1032,6 +1034,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
> >  		 */
> >  		if (adjustment && !(atomic_long_read(&sem->count) &
> >  		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
> > +			/* Provide lock ACQUIRE */
> > +			smp_acquire__after_ctrl_dep();
> >  			raw_spin_unlock_irq(&sem->wait_lock);
> >  			rwsem_set_reader_owned(sem);
> >  			lockevent_inc(rwsem_rlock_fast);
> > @@ -1065,15 +1069,25 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
> >  	wake_up_q(&wake_q);
> >  
> >  	/* wait to be given the lock */
> > -	while (true) {
> > +	for (;;) {
> >  		set_current_state(state);
> > -		if (!waiter.task)
> > +		if (!smp_load_acquire(&waiter.task)) {
> > +			/*
> > +			 * Matches rwsem_mark_wake()'s smp_store_release() and ensures
> > +			 * we're ordered against its sem->count operations.
> > +			 */
> >  			break;
> > +		}
> 
> Ack. Also, grepping for 'waiter.task' reveals a similar usage in
> drivers/tty/tty_ldsem.c if you're feeling brave enough.

*sigh* of course, for every bug there needs to be a second copy
somewhere.

I'll go look there too. Thanks!

Patch
diff mbox series

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 37524a47f002..5ac72b60608b 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1032,6 +1032,7 @@  static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
 		 */
 		if (adjustment && !(atomic_long_read(&sem->count) &
 		     (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
+			smp_acquire__after_ctrl_dep();
 			raw_spin_unlock_irq(&sem->wait_lock);
 			rwsem_set_reader_owned(sem);
 			lockevent_inc(rwsem_rlock_fast);