* [PATCH] locking/rwsem: use read_acquire in read_slowpath exit when queue is empty @ 2019-07-16 16:04 Jan Stancek 2019-07-16 16:53 ` Waiman Long 0 siblings, 1 reply; 22+ messages in thread From: Jan Stancek @ 2019-07-16 16:04 UTC (permalink / raw) To: linux-kernel; +Cc: longman, dbueso, will, peterz, mingo, jstancek LTP mtest06 has been observed to rarely hit "still mapped when deleted" and following BUG_ON on arm64: page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0 xfs_address_space_operations [xfs] flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active) page dumped because: VM_BUG_ON_PAGE(page_mapped(page)) ------------[ cut here ]------------ kernel BUG at mm/filemap.c:171! Internal error: Oops - BUG: 0 [#1] SMP CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1 Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019 pstate: 40400089 (nZcv daIf +PAN -UAO) pc : unaccount_page_cache_page+0x17c/0x1a0 lr : unaccount_page_cache_page+0x17c/0x1a0 Call trace: unaccount_page_cache_page+0x17c/0x1a0 delete_from_page_cache_batch+0xa0/0x300 truncate_inode_pages_range+0x1b8/0x640 truncate_inode_pages_final+0x88/0xa8 evict+0x1a0/0x1d8 iput+0x150/0x240 dentry_unlink_inode+0x120/0x130 __dentry_kill+0xd8/0x1d0 dentry_kill+0x88/0x248 dput+0x168/0x1b8 __fput+0xe8/0x208 ____fput+0x20/0x30 task_work_run+0xc0/0xf0 do_notify_resume+0x2b0/0x328 work_pending+0x8/0x10 The extra mapcount originated from pagefault handler, which handled pagefault for vma that has already been detached. vma is detached under mmap_sem write lock by detach_vmas_to_be_unmapped(), which also invalidates vmacache. When pagefault handler (under mmap_sem read lock) called find_vma(), vmacache_valid() wrongly reported vmacache as valid. After rwsem down_read() returns via 'queue empty' path (as of v5.2), it does so without issuing read_acquire on sem->count: down_read __down_read rwsem_down_read_failed __rwsem_down_read_failed_common raw_spin_lock_irq(&sem->wait_lock); if (list_empty(&sem->wait_list)) { if (atomic_long_read(&sem->count) >= 0) { raw_spin_unlock_irq(&sem->wait_lock); return sem; Suspected problem here is that last *_acquire on down_read() side happens before write side issues *_release: 1. writer: has the lock 2. reader: down_read() issues *read_acquire on entry 3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release) 4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns 5. reader: observes stale mm->vmacache_seqnum I can reproduce the problem by running LTP mtest06 in a loop and building kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2 on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably within ~hour. Patched kernel ran fine for 5+ hours with clean dmesg. Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem: Enable readers spinning on writer") makes it much harder to reproduce. Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer") Signed-off-by: Jan Stancek <jstancek@redhat.com> Cc: Waiman Long <longman@redhat.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> --- kernel/locking/rwsem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 37524a47f002..757b198d7a5b 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -1030,7 +1030,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem, * exit the slowpath and return immediately as its * RWSEM_READER_BIAS has already been set in the count. */ - if (adjustment && !(atomic_long_read(&sem->count) & + if (adjustment && !(atomic_long_read_acquire(&sem->count) & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { raw_spin_unlock_irq(&sem->wait_lock); rwsem_set_reader_owned(sem); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH] locking/rwsem: use read_acquire in read_slowpath exit when queue is empty 2019-07-16 16:04 [PATCH] locking/rwsem: use read_acquire in read_slowpath exit when queue is empty Jan Stancek @ 2019-07-16 16:53 ` Waiman Long 2019-07-16 18:34 ` Jan Stancek 2019-07-16 18:58 ` Peter Zijlstra 0 siblings, 2 replies; 22+ messages in thread From: Waiman Long @ 2019-07-16 16:53 UTC (permalink / raw) To: Jan Stancek, linux-kernel; +Cc: dbueso, will, peterz, mingo On 7/16/19 12:04 PM, Jan Stancek wrote: > LTP mtest06 has been observed to rarely hit "still mapped when deleted" > and following BUG_ON on arm64: > page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0 > xfs_address_space_operations [xfs] > flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active) > page dumped because: VM_BUG_ON_PAGE(page_mapped(page)) > ------------[ cut here ]------------ > kernel BUG at mm/filemap.c:171! > Internal error: Oops - BUG: 0 [#1] SMP > CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1 > Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019 > pstate: 40400089 (nZcv daIf +PAN -UAO) > pc : unaccount_page_cache_page+0x17c/0x1a0 > lr : unaccount_page_cache_page+0x17c/0x1a0 > Call trace: > unaccount_page_cache_page+0x17c/0x1a0 > delete_from_page_cache_batch+0xa0/0x300 > truncate_inode_pages_range+0x1b8/0x640 > truncate_inode_pages_final+0x88/0xa8 > evict+0x1a0/0x1d8 > iput+0x150/0x240 > dentry_unlink_inode+0x120/0x130 > __dentry_kill+0xd8/0x1d0 > dentry_kill+0x88/0x248 > dput+0x168/0x1b8 > __fput+0xe8/0x208 > ____fput+0x20/0x30 > task_work_run+0xc0/0xf0 > do_notify_resume+0x2b0/0x328 > work_pending+0x8/0x10 > > The extra mapcount originated from pagefault handler, which handled > pagefault for vma that has already been detached. vma is detached > under mmap_sem write lock by detach_vmas_to_be_unmapped(), which > also invalidates vmacache. > > When pagefault handler (under mmap_sem read lock) called find_vma(), > vmacache_valid() wrongly reported vmacache as valid. > > After rwsem down_read() returns via 'queue empty' path (as of v5.2), > it does so without issuing read_acquire on sem->count: > down_read > __down_read > rwsem_down_read_failed > __rwsem_down_read_failed_common > raw_spin_lock_irq(&sem->wait_lock); > if (list_empty(&sem->wait_list)) { > if (atomic_long_read(&sem->count) >= 0) { > raw_spin_unlock_irq(&sem->wait_lock); > return sem; > > Suspected problem here is that last *_acquire on down_read() side > happens before write side issues *_release: > 1. writer: has the lock > 2. reader: down_read() issues *read_acquire on entry > 3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release) > 4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns > 5. reader: observes stale mm->vmacache_seqnum > > I can reproduce the problem by running LTP mtest06 in a loop and building > kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2 > on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably > within ~hour. Patched kernel ran fine for 5+ hours with clean dmesg. > Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem: > Enable readers spinning on writer") makes it much harder to reproduce. The explanation makes sense to me. Thanks for the investigation. > > Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c > Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") > Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer") > > Signed-off-by: Jan Stancek <jstancek@redhat.com> > Cc: Waiman Long <longman@redhat.com> > Cc: Davidlohr Bueso <dbueso@suse.de> > Cc: Will Deacon <will@kernel.org> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Ingo Molnar <mingo@redhat.com> > --- > kernel/locking/rwsem.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c > index 37524a47f002..757b198d7a5b 100644 > --- a/kernel/locking/rwsem.c > +++ b/kernel/locking/rwsem.c > @@ -1030,7 +1030,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem, > * exit the slowpath and return immediately as its > * RWSEM_READER_BIAS has already been set in the count. > */ > - if (adjustment && !(atomic_long_read(&sem->count) & > + if (adjustment && !(atomic_long_read_acquire(&sem->count) & > (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { > raw_spin_unlock_irq(&sem->wait_lock); > rwsem_set_reader_owned(sem); The chance of taking this path is not that high. So instead of increasing the cost of the test by adding an acquire barrier, how about just adding smp_mb__after_spinlock() before spin_unlock_irq(). This should have the same effect of making sure that no stale data will be used in the read-lock critical section. -Longman ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] locking/rwsem: use read_acquire in read_slowpath exit when queue is empty 2019-07-16 16:53 ` Waiman Long @ 2019-07-16 18:34 ` Jan Stancek 2019-07-16 18:58 ` Peter Zijlstra 1 sibling, 0 replies; 22+ messages in thread From: Jan Stancek @ 2019-07-16 18:34 UTC (permalink / raw) To: Waiman Long; +Cc: linux-kernel, dbueso, will, peterz, mingo ----- Original Message ----- > On 7/16/19 12:04 PM, Jan Stancek wrote: > > LTP mtest06 has been observed to rarely hit "still mapped when deleted" > > and following BUG_ON on arm64: > > page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 > > index:0x0 > > xfs_address_space_operations [xfs] > > flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active) > > page dumped because: VM_BUG_ON_PAGE(page_mapped(page)) > > ------------[ cut here ]------------ > > kernel BUG at mm/filemap.c:171! > > Internal error: Oops - BUG: 0 [#1] SMP > > CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1 > > Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 > > 05/17/2019 > > pstate: 40400089 (nZcv daIf +PAN -UAO) > > pc : unaccount_page_cache_page+0x17c/0x1a0 > > lr : unaccount_page_cache_page+0x17c/0x1a0 > > Call trace: > > unaccount_page_cache_page+0x17c/0x1a0 > > delete_from_page_cache_batch+0xa0/0x300 > > truncate_inode_pages_range+0x1b8/0x640 > > truncate_inode_pages_final+0x88/0xa8 > > evict+0x1a0/0x1d8 > > iput+0x150/0x240 > > dentry_unlink_inode+0x120/0x130 > > __dentry_kill+0xd8/0x1d0 > > dentry_kill+0x88/0x248 > > dput+0x168/0x1b8 > > __fput+0xe8/0x208 > > ____fput+0x20/0x30 > > task_work_run+0xc0/0xf0 > > do_notify_resume+0x2b0/0x328 > > work_pending+0x8/0x10 > > > > The extra mapcount originated from pagefault handler, which handled > > pagefault for vma that has already been detached. vma is detached > > under mmap_sem write lock by detach_vmas_to_be_unmapped(), which > > also invalidates vmacache. > > > > When pagefault handler (under mmap_sem read lock) called find_vma(), > > vmacache_valid() wrongly reported vmacache as valid. > > > > After rwsem down_read() returns via 'queue empty' path (as of v5.2), > > it does so without issuing read_acquire on sem->count: > > down_read > > __down_read > > rwsem_down_read_failed > > __rwsem_down_read_failed_common > > raw_spin_lock_irq(&sem->wait_lock); > > if (list_empty(&sem->wait_list)) { > > if (atomic_long_read(&sem->count) >= 0) { > > raw_spin_unlock_irq(&sem->wait_lock); > > return sem; > > > > Suspected problem here is that last *_acquire on down_read() side > > happens before write side issues *_release: > > 1. writer: has the lock > > 2. reader: down_read() issues *read_acquire on entry > > 3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release) > > 4. reader: __rwsem_down_read_failed_common() finds it can take lock and > > returns > > 5. reader: observes stale mm->vmacache_seqnum > > > > I can reproduce the problem by running LTP mtest06 in a loop and building > > kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2 > > on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably > > within ~hour. Patched kernel ran fine for 5+ hours with clean dmesg. > > Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem: > > Enable readers spinning on writer") makes it much harder to reproduce. > The explanation makes sense to me. Thanks for the investigation. > > > > Related: > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c > > Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in > > munmap") > > Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty > > & no writer") > > > > Signed-off-by: Jan Stancek <jstancek@redhat.com> > > Cc: Waiman Long <longman@redhat.com> > > Cc: Davidlohr Bueso <dbueso@suse.de> > > Cc: Will Deacon <will@kernel.org> > > Cc: Peter Zijlstra <peterz@infradead.org> > > Cc: Ingo Molnar <mingo@redhat.com> > > --- > > kernel/locking/rwsem.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c > > index 37524a47f002..757b198d7a5b 100644 > > --- a/kernel/locking/rwsem.c > > +++ b/kernel/locking/rwsem.c > > @@ -1030,7 +1030,7 @@ static inline bool rwsem_reader_phase_trylock(struct > > rw_semaphore *sem, > > * exit the slowpath and return immediately as its > > * RWSEM_READER_BIAS has already been set in the count. > > */ > > - if (adjustment && !(atomic_long_read(&sem->count) & > > + if (adjustment && !(atomic_long_read_acquire(&sem->count) & > > (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { > > raw_spin_unlock_irq(&sem->wait_lock); > > rwsem_set_reader_owned(sem); > > The chance of taking this path is not that high. So instead of > increasing the cost of the test by adding an acquire barrier, Yes, that is good point. > how about > just adding smp_mb__after_spinlock() before spin_unlock_irq(). This > should have the same effect of making sure that no stale data will be > used in the read-lock critical section. I'll redo the tests with smp_mb__after_spinlock(). Thanks, Jan > > -Longman > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] locking/rwsem: use read_acquire in read_slowpath exit when queue is empty 2019-07-16 16:53 ` Waiman Long 2019-07-16 18:34 ` Jan Stancek @ 2019-07-16 18:58 ` Peter Zijlstra 2019-07-16 19:09 ` Waiman Long 2019-07-17 12:02 ` [PATCH v2] locking/rwsem: add acquire barrier to " Jan Stancek 1 sibling, 2 replies; 22+ messages in thread From: Peter Zijlstra @ 2019-07-16 18:58 UTC (permalink / raw) To: Waiman Long; +Cc: Jan Stancek, linux-kernel, dbueso, will, mingo On Tue, Jul 16, 2019 at 12:53:14PM -0400, Waiman Long wrote: > On 7/16/19 12:04 PM, Jan Stancek wrote: > > Suspected problem here is that last *_acquire on down_read() side > > happens before write side issues *_release: > > 1. writer: has the lock > > 2. reader: down_read() issues *read_acquire on entry > > 3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release) > > 4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns > > 5. reader: observes stale mm->vmacache_seqnum > > > > I can reproduce the problem by running LTP mtest06 in a loop and building > > kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2 > > on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably > > within ~hour. Patched kernel ran fine for 5+ hours with clean dmesg. > > Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem: > > Enable readers spinning on writer") makes it much harder to reproduce. > > Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer") > > Signed-off-by: Jan Stancek <jstancek@redhat.com> > > Cc: Waiman Long <longman@redhat.com> > > Cc: Davidlohr Bueso <dbueso@suse.de> > > Cc: Will Deacon <will@kernel.org> > > Cc: Peter Zijlstra <peterz@infradead.org> > > Cc: Ingo Molnar <mingo@redhat.com> > > --- > > kernel/locking/rwsem.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c > > index 37524a47f002..757b198d7a5b 100644 > > --- a/kernel/locking/rwsem.c > > +++ b/kernel/locking/rwsem.c > > @@ -1030,7 +1030,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem, > > * exit the slowpath and return immediately as its > > * RWSEM_READER_BIAS has already been set in the count. > > */ > > - if (adjustment && !(atomic_long_read(&sem->count) & > > + if (adjustment && !(atomic_long_read_acquire(&sem->count) & > > (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { > > raw_spin_unlock_irq(&sem->wait_lock); > > rwsem_set_reader_owned(sem); > > The chance of taking this path is not that high. So instead of > increasing the cost of the test by adding an acquire barrier, how about > just adding smp_mb__after_spinlock() before spin_unlock_irq(). This > should have the same effect of making sure that no stale data will be > used in the read-lock critical section. That's actually more expensive on something like ARM64 I expect. The far cheaper alternative is smp_acquire__after_ctrl_dep(), however in general Will seems to prefer using load-acquire over separate barriers, and for x86 it doesn't matter anyway. For PowerPC these two are a wash, both end up with LWSYNC (over SYNC for your alternative). ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] locking/rwsem: use read_acquire in read_slowpath exit when queue is empty 2019-07-16 18:58 ` Peter Zijlstra @ 2019-07-16 19:09 ` Waiman Long 2019-07-17 12:02 ` [PATCH v2] locking/rwsem: add acquire barrier to " Jan Stancek 1 sibling, 0 replies; 22+ messages in thread From: Waiman Long @ 2019-07-16 19:09 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Jan Stancek, linux-kernel, dbueso, will, mingo On 7/16/19 2:58 PM, Peter Zijlstra wrote: > On Tue, Jul 16, 2019 at 12:53:14PM -0400, Waiman Long wrote: >> On 7/16/19 12:04 PM, Jan Stancek wrote: > Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer") > Signed-off-by: Jan Stancek <jstancek@redhat.com> > Cc: Waiman Long <longman@redhat.com> > Cc: Davidlohr Bueso <dbueso@suse.de> > Cc: Will Deacon <will@kernel.org> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Ingo Molnar <mingo@redhat.com> > --- > kernel/locking/rwsem.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c > index 37524a47f002..757b198d7a5b 100644 > --- a/kernel/locking/rwsem.c > +++ b/kernel/locking/rwsem.c > @@ -1030,7 +1030,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem, > * exit the slowpath and return immediately as its > * RWSEM_READER_BIAS has already been set in the count. > */ > - if (adjustment && !(atomic_long_read(&sem->count) & > + if (adjustment && !(atomic_long_read_acquire(&sem->count) & > (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { > raw_spin_unlock_irq(&sem->wait_lock); > rwsem_set_reader_owned(sem); >> The chance of taking this path is not that high. So instead of >> increasing the cost of the test by adding an acquire barrier, how about >> just adding smp_mb__after_spinlock() before spin_unlock_irq(). This >> should have the same effect of making sure that no stale data will be >> used in the read-lock critical section. > That's actually more expensive on something like ARM64 I expect. > > The far cheaper alternative is smp_acquire__after_ctrl_dep(), however in > general Will seems to prefer using load-acquire over separate barriers, > and for x86 it doesn't matter anyway. For PowerPC these two are a wash, > both end up with LWSYNC (over SYNC for your alternative). With lock event counting turned on, my experience with this code path was that it got hit very infrequently. It is even less frequent with the latest reader optimistic spinning patch. That is why I prefer making it a bit more costly when the condition is true without incurring a cost at all when the condition is false which is the majority of the cases. Anyway, this additional cost is for arm64 only, but it is still more than compensated by all skipping all the waiting list manipulation and waking up itself code. Cheers, Longman ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty 2019-07-16 18:58 ` Peter Zijlstra 2019-07-16 19:09 ` Waiman Long @ 2019-07-17 12:02 ` Jan Stancek 2019-07-17 13:13 ` Will Deacon 2019-07-17 15:33 ` Waiman Long 1 sibling, 2 replies; 22+ messages in thread From: Jan Stancek @ 2019-07-17 12:02 UTC (permalink / raw) To: linux-kernel; +Cc: longman, dbueso, will, peterz, mingo, jstancek LTP mtest06 has been observed to rarely hit "still mapped when deleted" and following BUG_ON on arm64: page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0 xfs_address_space_operations [xfs] flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active) page dumped because: VM_BUG_ON_PAGE(page_mapped(page)) ------------[ cut here ]------------ kernel BUG at mm/filemap.c:171! Internal error: Oops - BUG: 0 [#1] SMP CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1 Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019 pstate: 40400089 (nZcv daIf +PAN -UAO) pc : unaccount_page_cache_page+0x17c/0x1a0 lr : unaccount_page_cache_page+0x17c/0x1a0 Call trace: unaccount_page_cache_page+0x17c/0x1a0 delete_from_page_cache_batch+0xa0/0x300 truncate_inode_pages_range+0x1b8/0x640 truncate_inode_pages_final+0x88/0xa8 evict+0x1a0/0x1d8 iput+0x150/0x240 dentry_unlink_inode+0x120/0x130 __dentry_kill+0xd8/0x1d0 dentry_kill+0x88/0x248 dput+0x168/0x1b8 __fput+0xe8/0x208 ____fput+0x20/0x30 task_work_run+0xc0/0xf0 do_notify_resume+0x2b0/0x328 work_pending+0x8/0x10 The extra mapcount originated from pagefault handler, which handled pagefault for vma that has already been detached. vma is detached under mmap_sem write lock by detach_vmas_to_be_unmapped(), which also invalidates vmacache. When pagefault handler (under mmap_sem read lock) called find_vma(), vmacache_valid() wrongly reported vmacache as valid. After rwsem down_read() returns via 'queue empty' path (as of v5.2), it does so without issuing read_acquire on sem->count: down_read __down_read rwsem_down_read_failed __rwsem_down_read_failed_common raw_spin_lock_irq(&sem->wait_lock); if (list_empty(&sem->wait_list)) { if (atomic_long_read(&sem->count) >= 0) { raw_spin_unlock_irq(&sem->wait_lock); return sem; Suspected problem here is that last *_acquire on down_read() side happens before write side issues *_release: 1. writer: has the lock 2. reader: down_read() issues *read_acquire on entry 3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release) 4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns 5. reader: observes stale mm->vmacache_seqnum I can reproduce the problem by running LTP mtest06 in a loop and building kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2 on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably within ~hour. Patched kernel ran fine for 10+ hours with clean dmesg. Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem: Enable readers spinning on writer") makes it much harder to reproduce. v2: Move barrier after test (Waiman Long) Use smp_acquire__after_ctrl_dep() (Peter Zijlstra) Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer") Signed-off-by: Jan Stancek <jstancek@redhat.com> Cc: stable@vger.kernel.org # v4.20+ Cc: Waiman Long <longman@redhat.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> --- kernel/locking/rwsem.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 37524a47f002..5ac72b60608b 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -1032,6 +1032,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem, */ if (adjustment && !(atomic_long_read(&sem->count) & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { + smp_acquire__after_ctrl_dep(); raw_spin_unlock_irq(&sem->wait_lock); rwsem_set_reader_owned(sem); lockevent_inc(rwsem_rlock_fast); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty 2019-07-17 12:02 ` [PATCH v2] locking/rwsem: add acquire barrier to " Jan Stancek @ 2019-07-17 13:13 ` Will Deacon 2019-07-17 14:19 ` Waiman Long 2019-07-17 15:33 ` Waiman Long 1 sibling, 1 reply; 22+ messages in thread From: Will Deacon @ 2019-07-17 13:13 UTC (permalink / raw) To: Jan Stancek; +Cc: linux-kernel, longman, dbueso, peterz, mingo On Wed, Jul 17, 2019 at 02:02:20PM +0200, Jan Stancek wrote: > LTP mtest06 has been observed to rarely hit "still mapped when deleted" > and following BUG_ON on arm64: > page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0 > xfs_address_space_operations [xfs] > flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active) > page dumped because: VM_BUG_ON_PAGE(page_mapped(page)) > ------------[ cut here ]------------ > kernel BUG at mm/filemap.c:171! > Internal error: Oops - BUG: 0 [#1] SMP > CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1 > Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019 > pstate: 40400089 (nZcv daIf +PAN -UAO) > pc : unaccount_page_cache_page+0x17c/0x1a0 > lr : unaccount_page_cache_page+0x17c/0x1a0 > Call trace: > unaccount_page_cache_page+0x17c/0x1a0 > delete_from_page_cache_batch+0xa0/0x300 > truncate_inode_pages_range+0x1b8/0x640 > truncate_inode_pages_final+0x88/0xa8 > evict+0x1a0/0x1d8 > iput+0x150/0x240 > dentry_unlink_inode+0x120/0x130 > __dentry_kill+0xd8/0x1d0 > dentry_kill+0x88/0x248 > dput+0x168/0x1b8 > __fput+0xe8/0x208 > ____fput+0x20/0x30 > task_work_run+0xc0/0xf0 > do_notify_resume+0x2b0/0x328 > work_pending+0x8/0x10 > > The extra mapcount originated from pagefault handler, which handled > pagefault for vma that has already been detached. vma is detached > under mmap_sem write lock by detach_vmas_to_be_unmapped(), which > also invalidates vmacache. > > When pagefault handler (under mmap_sem read lock) called find_vma(), > vmacache_valid() wrongly reported vmacache as valid. > > After rwsem down_read() returns via 'queue empty' path (as of v5.2), > it does so without issuing read_acquire on sem->count: > down_read > __down_read > rwsem_down_read_failed > __rwsem_down_read_failed_common > raw_spin_lock_irq(&sem->wait_lock); > if (list_empty(&sem->wait_list)) { > if (atomic_long_read(&sem->count) >= 0) { > raw_spin_unlock_irq(&sem->wait_lock); > return sem; > > Suspected problem here is that last *_acquire on down_read() side > happens before write side issues *_release: > 1. writer: has the lock > 2. reader: down_read() issues *read_acquire on entry > 3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release) > 4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns > 5. reader: observes stale mm->vmacache_seqnum > > I can reproduce the problem by running LTP mtest06 in a loop and building > kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2 > on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably > within ~hour. Patched kernel ran fine for 10+ hours with clean dmesg. > Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem: > Enable readers spinning on writer") makes it much harder to reproduce. > > v2: Move barrier after test (Waiman Long) > Use smp_acquire__after_ctrl_dep() (Peter Zijlstra) > > Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c > Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") > Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer") > > Signed-off-by: Jan Stancek <jstancek@redhat.com> > Cc: stable@vger.kernel.org # v4.20+ > Cc: Waiman Long <longman@redhat.com> > Cc: Davidlohr Bueso <dbueso@suse.de> > Cc: Will Deacon <will@kernel.org> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Ingo Molnar <mingo@redhat.com> > --- > kernel/locking/rwsem.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c > index 37524a47f002..5ac72b60608b 100644 > --- a/kernel/locking/rwsem.c > +++ b/kernel/locking/rwsem.c > @@ -1032,6 +1032,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem, > */ > if (adjustment && !(atomic_long_read(&sem->count) & > (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { > + smp_acquire__after_ctrl_dep(); > raw_spin_unlock_irq(&sem->wait_lock); > rwsem_set_reader_owned(sem); > lockevent_inc(rwsem_rlock_fast); If you add a comment to the code outlining the issue (preferably as a litmus test involving sem->count and some shared data which happens to be vmacache_seqnum in your test)), then: Reviewed-by: Will Deacon <will@kernel.org> Thanks, Will ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty 2019-07-17 13:13 ` Will Deacon @ 2019-07-17 14:19 ` Waiman Long 2019-07-17 19:22 ` Jan Stancek 0 siblings, 1 reply; 22+ messages in thread From: Waiman Long @ 2019-07-17 14:19 UTC (permalink / raw) To: Will Deacon, Jan Stancek; +Cc: linux-kernel, dbueso, peterz, mingo On 7/17/19 9:13 AM, Will Deacon wrote: > On Wed, Jul 17, 2019 at 02:02:20PM +0200, Jan Stancek wrote: >> LTP mtest06 has been observed to rarely hit "still mapped when deleted" >> and following BUG_ON on arm64: >> page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0 >> xfs_address_space_operations [xfs] >> flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active) >> page dumped because: VM_BUG_ON_PAGE(page_mapped(page)) >> ------------[ cut here ]------------ >> kernel BUG at mm/filemap.c:171! >> Internal error: Oops - BUG: 0 [#1] SMP >> CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1 >> Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019 >> pstate: 40400089 (nZcv daIf +PAN -UAO) >> pc : unaccount_page_cache_page+0x17c/0x1a0 >> lr : unaccount_page_cache_page+0x17c/0x1a0 >> Call trace: >> unaccount_page_cache_page+0x17c/0x1a0 >> delete_from_page_cache_batch+0xa0/0x300 >> truncate_inode_pages_range+0x1b8/0x640 >> truncate_inode_pages_final+0x88/0xa8 >> evict+0x1a0/0x1d8 >> iput+0x150/0x240 >> dentry_unlink_inode+0x120/0x130 >> __dentry_kill+0xd8/0x1d0 >> dentry_kill+0x88/0x248 >> dput+0x168/0x1b8 >> __fput+0xe8/0x208 >> ____fput+0x20/0x30 >> task_work_run+0xc0/0xf0 >> do_notify_resume+0x2b0/0x328 >> work_pending+0x8/0x10 >> >> The extra mapcount originated from pagefault handler, which handled >> pagefault for vma that has already been detached. vma is detached >> under mmap_sem write lock by detach_vmas_to_be_unmapped(), which >> also invalidates vmacache. >> >> When pagefault handler (under mmap_sem read lock) called find_vma(), >> vmacache_valid() wrongly reported vmacache as valid. >> >> After rwsem down_read() returns via 'queue empty' path (as of v5.2), >> it does so without issuing read_acquire on sem->count: >> down_read >> __down_read >> rwsem_down_read_failed >> __rwsem_down_read_failed_common >> raw_spin_lock_irq(&sem->wait_lock); >> if (list_empty(&sem->wait_list)) { >> if (atomic_long_read(&sem->count) >= 0) { >> raw_spin_unlock_irq(&sem->wait_lock); >> return sem; >> >> Suspected problem here is that last *_acquire on down_read() side >> happens before write side issues *_release: >> 1. writer: has the lock >> 2. reader: down_read() issues *read_acquire on entry >> 3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release) >> 4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns >> 5. reader: observes stale mm->vmacache_seqnum >> >> I can reproduce the problem by running LTP mtest06 in a loop and building >> kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2 >> on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably >> within ~hour. Patched kernel ran fine for 10+ hours with clean dmesg. >> Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem: >> Enable readers spinning on writer") makes it much harder to reproduce. >> >> v2: Move barrier after test (Waiman Long) >> Use smp_acquire__after_ctrl_dep() (Peter Zijlstra) >> >> Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c >> Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") >> Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer") >> >> Signed-off-by: Jan Stancek <jstancek@redhat.com> >> Cc: stable@vger.kernel.org # v4.20+ >> Cc: Waiman Long <longman@redhat.com> >> Cc: Davidlohr Bueso <dbueso@suse.de> >> Cc: Will Deacon <will@kernel.org> >> Cc: Peter Zijlstra <peterz@infradead.org> >> Cc: Ingo Molnar <mingo@redhat.com> >> --- >> kernel/locking/rwsem.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c >> index 37524a47f002..5ac72b60608b 100644 >> --- a/kernel/locking/rwsem.c >> +++ b/kernel/locking/rwsem.c >> @@ -1032,6 +1032,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem, >> */ >> if (adjustment && !(atomic_long_read(&sem->count) & >> (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { >> + smp_acquire__after_ctrl_dep(); >> raw_spin_unlock_irq(&sem->wait_lock); >> rwsem_set_reader_owned(sem); >> lockevent_inc(rwsem_rlock_fast); > If you add a comment to the code outlining the issue (preferably as a litmus > test involving sem->count and some shared data which happens to be > vmacache_seqnum in your test)), then: > > Reviewed-by: Will Deacon <will@kernel.org> > > Thanks, > > Will Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this is needed will be great. Other than that, Acked-by: Waiman Long <longman@redhat.com> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty 2019-07-17 14:19 ` Waiman Long @ 2019-07-17 19:22 ` Jan Stancek 2019-07-17 19:39 ` Waiman Long 2019-07-18 9:26 ` [PATCH v2] locking/rwsem: add acquire barrier " Will Deacon 0 siblings, 2 replies; 22+ messages in thread From: Jan Stancek @ 2019-07-17 19:22 UTC (permalink / raw) To: Waiman Long; +Cc: Will Deacon, linux-kernel, dbueso, peterz, mingo, jstancek On Wed, Jul 17, 2019 at 10:19:04AM -0400, Waiman Long wrote: >> If you add a comment to the code outlining the issue (preferably as a litmus >> test involving sem->count and some shared data which happens to be >> vmacache_seqnum in your test)), then: >> >> Reviewed-by: Will Deacon <will@kernel.org> >> >> Thanks, >> >> Will > >Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this >is needed will be great. > >Other than that, > >Acked-by: Waiman Long <longman@redhat.com> > litmus test looks a bit long, would following be acceptable? diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 37524a47f002..d9c96651bfc7 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -1032,6 +1032,13 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem, */ if (adjustment && !(atomic_long_read(&sem->count) & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { + /* + * down_read() issued ACQUIRE on enter, but we can race + * with writer who did RELEASE only after us. + * ACQUIRE here makes sure reader operations happen only + * after all writer ones. + */ + smp_acquire__after_ctrl_dep(); raw_spin_unlock_irq(&sem->wait_lock); rwsem_set_reader_owned(sem); lockevent_inc(rwsem_rlock_fast); with litmus test in commit log: ----------------------------------- 8< ------------------------------------ C rwsem { atomic_t rwsem_count = ATOMIC_INIT(1); int vmacache_seqnum = 10; } P0(int *vmacache_seqnum, atomic_t *rwsem_count) { r0 = READ_ONCE(*vmacache_seqnum); WRITE_ONCE(*vmacache_seqnum, r0 + 1); /* downgrade_write */ r1 = atomic_fetch_add_release(-1+256, rwsem_count); } P1(int *vmacache_seqnum, atomic_t *rwsem_count, spinlock_t *sem_wait_lock) { /* rwsem_read_trylock */ r0 = atomic_add_return_acquire(256, rwsem_count); /* rwsem_down_read_slowpath */ spin_lock(sem_wait_lock); r0 = atomic_read(rwsem_count); if ((r0 & 1) == 0) { // BUG: needs barrier spin_unlock(sem_wait_lock); r1 = READ_ONCE(*vmacache_seqnum); } } exists (1:r1=10) ----------------------------------- 8< ------------------------------------ Thanks, Jan ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty 2019-07-17 19:22 ` Jan Stancek @ 2019-07-17 19:39 ` Waiman Long 2019-07-18 8:51 ` [PATCH v3] " Jan Stancek 2019-07-18 9:26 ` [PATCH v2] locking/rwsem: add acquire barrier " Will Deacon 1 sibling, 1 reply; 22+ messages in thread From: Waiman Long @ 2019-07-17 19:39 UTC (permalink / raw) To: Jan Stancek; +Cc: Will Deacon, linux-kernel, dbueso, peterz, mingo On 7/17/19 3:22 PM, Jan Stancek wrote: > On Wed, Jul 17, 2019 at 10:19:04AM -0400, Waiman Long wrote: >>> If you add a comment to the code outlining the issue (preferably as >>> a litmus >>> test involving sem->count and some shared data which happens to be >>> vmacache_seqnum in your test)), then: >>> >>> Reviewed-by: Will Deacon <will@kernel.org> >>> >>> Thanks, >>> >>> Will >> >> Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this >> is needed will be great. >> >> Other than that, >> >> Acked-by: Waiman Long <longman@redhat.com> >> > > litmus test looks a bit long, would following be acceptable? > > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c > index 37524a47f002..d9c96651bfc7 100644 > --- a/kernel/locking/rwsem.c > +++ b/kernel/locking/rwsem.c > @@ -1032,6 +1032,13 @@ static inline bool > rwsem_reader_phase_trylock(struct rw_semaphore *sem, > */ > if (adjustment && !(atomic_long_read(&sem->count) & > (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { > + /* > + * down_read() issued ACQUIRE on enter, but we can race > + * with writer who did RELEASE only after us. > + * ACQUIRE here makes sure reader operations happen only > + * after all writer ones. > + */ How about that? /* * Add an acquire barrier here to make sure no stale data * acquired before the above test where the writer may still * be holding the lock will be reused in the reader critical * section. */ Thanks, Longman ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v3] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty 2019-07-17 19:39 ` Waiman Long @ 2019-07-18 8:51 ` Jan Stancek 2019-07-25 16:00 ` [tip:locking/core] locking/rwsem: Add missing ACQUIRE " tip-bot for Jan Stancek 0 siblings, 1 reply; 22+ messages in thread From: Jan Stancek @ 2019-07-18 8:51 UTC (permalink / raw) To: linux-kernel; +Cc: longman, dbueso, will, peterz, mingo, jstancek LTP mtest06 has been observed to rarely hit "still mapped when deleted" and following BUG_ON on arm64: page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0 xfs_address_space_operations [xfs] flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active) page dumped because: VM_BUG_ON_PAGE(page_mapped(page)) ------------[ cut here ]------------ kernel BUG at mm/filemap.c:171! Internal error: Oops - BUG: 0 [#1] SMP CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1 Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019 pstate: 40400089 (nZcv daIf +PAN -UAO) pc : unaccount_page_cache_page+0x17c/0x1a0 lr : unaccount_page_cache_page+0x17c/0x1a0 Call trace: unaccount_page_cache_page+0x17c/0x1a0 delete_from_page_cache_batch+0xa0/0x300 truncate_inode_pages_range+0x1b8/0x640 truncate_inode_pages_final+0x88/0xa8 evict+0x1a0/0x1d8 iput+0x150/0x240 dentry_unlink_inode+0x120/0x130 __dentry_kill+0xd8/0x1d0 dentry_kill+0x88/0x248 dput+0x168/0x1b8 __fput+0xe8/0x208 ____fput+0x20/0x30 task_work_run+0xc0/0xf0 do_notify_resume+0x2b0/0x328 work_pending+0x8/0x10 The extra mapcount originated from pagefault handler, which handled pagefault for vma that has already been detached. vma is detached under mmap_sem write lock by detach_vmas_to_be_unmapped(), which also invalidates vmacache. When pagefault handler (under mmap_sem read lock) called find_vma(), vmacache_valid() wrongly reported vmacache as valid. After rwsem down_read() returns via 'queue empty' path (as of v5.2), it does so without issuing read_acquire on sem->count: down_read __down_read rwsem_down_read_failed __rwsem_down_read_failed_common raw_spin_lock_irq(&sem->wait_lock); if (list_empty(&sem->wait_list)) { if (atomic_long_read(&sem->count) >= 0) { raw_spin_unlock_irq(&sem->wait_lock); return sem; Suspected problem here is that last *_acquire on down_read() side happens before write side issues *_release: 1. writer: has the lock 2. reader: down_read() issues *read_acquire on entry 3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release) 4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns 5. reader: observes stale mm->vmacache_seqnum ----------------------------------- 8< ------------------------------------ C rwsem { atomic_t rwsem_count = ATOMIC_INIT(1); int vmacache_seqnum = 10; } P0(int *vmacache_seqnum, atomic_t *rwsem_count) { r0 = READ_ONCE(*vmacache_seqnum); WRITE_ONCE(*vmacache_seqnum, r0 + 1); /* downgrade_write */ r1 = atomic_fetch_add_release(-1+256, rwsem_count); } P1(int *vmacache_seqnum, atomic_t *rwsem_count, spinlock_t *sem_wait_lock) { /* rwsem_read_trylock */ r0 = atomic_add_return_acquire(256, rwsem_count); /* rwsem_down_read_slowpath */ spin_lock(sem_wait_lock); r0 = atomic_read(rwsem_count); if ((r0 & 1) == 0) { // BUG: needs barrier spin_unlock(sem_wait_lock); r1 = READ_ONCE(*vmacache_seqnum); } } exists (1:r1=10) ----------------------------------- >8 ------------------------------------ I can reproduce the problem by running LTP mtest06 in a loop and building kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2 on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably within ~hour. Patched kernel ran fine for 10+ hours with clean dmesg. Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem: Enable readers spinning on writer") makes it much harder to reproduce. v2: Move barrier after test (Waiman Long) Use smp_acquire__after_ctrl_dep() (Peter Zijlstra) v3: Add comment to barrier (Waiman Long, Will Deacon) Add litmus test Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer") Signed-off-by: Jan Stancek <jstancek@redhat.com> Reviewed-by: Will Deacon <will@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Cc: stable@vger.kernel.org # v4.20+ Cc: Waiman Long <longman@redhat.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> --- kernel/locking/rwsem.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 37524a47f002..fe02aef39e9d 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -1032,6 +1032,13 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem, */ if (adjustment && !(atomic_long_read(&sem->count) & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { + /* + * Add an acquire barrier here to make sure no stale + * data acquired before the above test, where the writer + * may still be holding the lock, will be reused in the + * reader critical section. + */ + smp_acquire__after_ctrl_dep(); raw_spin_unlock_irq(&sem->wait_lock); rwsem_set_reader_owned(sem); lockevent_inc(rwsem_rlock_fast); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [tip:locking/core] locking/rwsem: Add missing ACQUIRE to read_slowpath exit when queue is empty 2019-07-18 8:51 ` [PATCH v3] " Jan Stancek @ 2019-07-25 16:00 ` tip-bot for Jan Stancek 0 siblings, 0 replies; 22+ messages in thread From: tip-bot for Jan Stancek @ 2019-07-25 16:00 UTC (permalink / raw) To: linux-tip-commits Cc: longman, linux-kernel, mingo, hpa, will, peterz, jstancek, tglx, torvalds Commit-ID: e1b98fa316648420d0434d9ff5b92ad6609ba6c3 Gitweb: https://git.kernel.org/tip/e1b98fa316648420d0434d9ff5b92ad6609ba6c3 Author: Jan Stancek <jstancek@redhat.com> AuthorDate: Thu, 18 Jul 2019 10:51:25 +0200 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Thu, 25 Jul 2019 15:39:23 +0200 locking/rwsem: Add missing ACQUIRE to read_slowpath exit when queue is empty LTP mtest06 has been observed to occasionally hit "still mapped when deleted" and following BUG_ON on arm64. The extra mapcount originated from pagefault handler, which handled pagefault for vma that has already been detached. vma is detached under mmap_sem write lock by detach_vmas_to_be_unmapped(), which also invalidates vmacache. When the pagefault handler (under mmap_sem read lock) calls find_vma(), vmacache_valid() wrongly reports vmacache as valid. After rwsem down_read() returns via 'queue empty' path (as of v5.2), it does so without an ACQUIRE on sem->count: down_read() __down_read() rwsem_down_read_failed() __rwsem_down_read_failed_common() raw_spin_lock_irq(&sem->wait_lock); if (list_empty(&sem->wait_list)) { if (atomic_long_read(&sem->count) >= 0) { raw_spin_unlock_irq(&sem->wait_lock); return sem; The problem can be reproduced by running LTP mtest06 in a loop and building the kernel (-j $NCPUS) in parallel. It does reproduces since v4.20 on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably in about an hour. The patched kernel ran fine for 10+ hours. Signed-off-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Will Deacon <will@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dbueso@suse.de Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer") Link: https://lkml.kernel.org/r/50b8914e20d1d62bb2dee42d342836c2c16ebee7.1563438048.git.jstancek@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/locking/rwsem.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index bc91aacaab58..d3ce7c6c42a6 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -1036,6 +1036,8 @@ queue: */ if (adjustment && !(atomic_long_read(&sem->count) & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { + /* Provide lock ACQUIRE */ + smp_acquire__after_ctrl_dep(); raw_spin_unlock_irq(&sem->wait_lock); rwsem_set_reader_owned(sem); lockevent_inc(rwsem_rlock_fast); ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty 2019-07-17 19:22 ` Jan Stancek 2019-07-17 19:39 ` Waiman Long @ 2019-07-18 9:26 ` Will Deacon 2019-07-18 10:50 ` Jan Stancek 2019-07-18 10:58 ` Peter Zijlstra 1 sibling, 2 replies; 22+ messages in thread From: Will Deacon @ 2019-07-18 9:26 UTC (permalink / raw) To: Jan Stancek Cc: Waiman Long, linux-kernel, dbueso, peterz, mingo, jade.alglave, paulmck Hi Jan, Waiman, [+Jade and Paul for the litmus test at the end] On Wed, Jul 17, 2019 at 09:22:00PM +0200, Jan Stancek wrote: > On Wed, Jul 17, 2019 at 10:19:04AM -0400, Waiman Long wrote: > > > If you add a comment to the code outlining the issue (preferably as a litmus > > > test involving sem->count and some shared data which happens to be > > > vmacache_seqnum in your test)), then: > > > > > > Reviewed-by: Will Deacon <will@kernel.org> > > > > > > Thanks, > > > > > > Will > > > > Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this > > is needed will be great. > > > > Other than that, > > > > Acked-by: Waiman Long <longman@redhat.com> > > > > litmus test looks a bit long, would following be acceptable? > > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c > index 37524a47f002..d9c96651bfc7 100644 > --- a/kernel/locking/rwsem.c > +++ b/kernel/locking/rwsem.c > @@ -1032,6 +1032,13 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem, > */ > if (adjustment && !(atomic_long_read(&sem->count) & > (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { > + /* > + * down_read() issued ACQUIRE on enter, but we can race > + * with writer who did RELEASE only after us. > + * ACQUIRE here makes sure reader operations happen only > + * after all writer ones. > + */ How about an abridged form of the litmus test here, just to show the cod flow? e.g.: /* * We need to ensure ACQUIRE semantics when reading sem->count so that * we pair with the RELEASE store performed by an unlocking/downgrading * writer. * * P0 (writer) P1 (reader) * * down_write(sem); * <write shared data> * downgrade_write(sem); * -> fetch_add_release(&sem->count) * * down_read_slowpath(sem); * -> atomic_read(&sem->count) * <ctrl dep> * smp_acquire__after_ctrl_dep() * <read shared data> */ In writing this, I also noticed that we don't have any explicit ordering at the end of the reader slowpath when we wait on the queue but get woken immediately: if (!waiter.task) break; Am I missing something? > + smp_acquire__after_ctrl_dep(); > raw_spin_unlock_irq(&sem->wait_lock); > rwsem_set_reader_owned(sem); > lockevent_inc(rwsem_rlock_fast); > > > with litmus test in commit log: > ----------------------------------- 8< ------------------------------------ > C rwsem > > { > atomic_t rwsem_count = ATOMIC_INIT(1); > int vmacache_seqnum = 10; > } > > P0(int *vmacache_seqnum, atomic_t *rwsem_count) > { > r0 = READ_ONCE(*vmacache_seqnum); > WRITE_ONCE(*vmacache_seqnum, r0 + 1); > /* downgrade_write */ > r1 = atomic_fetch_add_release(-1+256, rwsem_count); > } > > P1(int *vmacache_seqnum, atomic_t *rwsem_count, spinlock_t *sem_wait_lock) > { > /* rwsem_read_trylock */ > r0 = atomic_add_return_acquire(256, rwsem_count); > /* rwsem_down_read_slowpath */ > spin_lock(sem_wait_lock); > r0 = atomic_read(rwsem_count); > if ((r0 & 1) == 0) { > // BUG: needs barrier > spin_unlock(sem_wait_lock); > r1 = READ_ONCE(*vmacache_seqnum); > } > } > exists (1:r1=10) > ----------------------------------- 8< ------------------------------------ Thanks for writing this! It's definitely worth sticking it in the commit log, but Paul and Jade might also like to include it as part of their litmus test repository too. Will ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty 2019-07-18 9:26 ` [PATCH v2] locking/rwsem: add acquire barrier " Will Deacon @ 2019-07-18 10:50 ` Jan Stancek 2019-07-18 11:04 ` Peter Zijlstra 2019-07-18 12:12 ` Paul E. McKenney 2019-07-18 10:58 ` Peter Zijlstra 1 sibling, 2 replies; 22+ messages in thread From: Jan Stancek @ 2019-07-18 10:50 UTC (permalink / raw) To: Will Deacon Cc: Waiman Long, linux-kernel, dbueso, peterz, mingo, jade alglave, paulmck, Jan Stancek ----- Original Message ----- > Hi Jan, Waiman, [+Jade and Paul for the litmus test at the end] > > On Wed, Jul 17, 2019 at 09:22:00PM +0200, Jan Stancek wrote: > > On Wed, Jul 17, 2019 at 10:19:04AM -0400, Waiman Long wrote: > > > > If you add a comment to the code outlining the issue (preferably as a > > > > litmus > > > > test involving sem->count and some shared data which happens to be > > > > vmacache_seqnum in your test)), then: > > > > > > > > Reviewed-by: Will Deacon <will@kernel.org> > > > > > > > > Thanks, > > > > > > > > Will > > > > > > Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this > > > is needed will be great. > > > > > > Other than that, > > > > > > Acked-by: Waiman Long <longman@redhat.com> > > > > > > > litmus test looks a bit long, would following be acceptable? > > > > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c > > index 37524a47f002..d9c96651bfc7 100644 > > --- a/kernel/locking/rwsem.c > > +++ b/kernel/locking/rwsem.c > > @@ -1032,6 +1032,13 @@ static inline bool rwsem_reader_phase_trylock(struct > > rw_semaphore *sem, > > */ > > if (adjustment && !(atomic_long_read(&sem->count) & > > (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { > > + /* > > + * down_read() issued ACQUIRE on enter, but we can race > > + * with writer who did RELEASE only after us. > > + * ACQUIRE here makes sure reader operations happen only > > + * after all writer ones. > > + */ > > How about an abridged form of the litmus test here, just to show the cod > flow? e.g.: > > /* > * We need to ensure ACQUIRE semantics when reading sem->count so that > * we pair with the RELEASE store performed by an unlocking/downgrading > * writer. > * > * P0 (writer) P1 (reader) > * > * down_write(sem); > * <write shared data> > * downgrade_write(sem); > * -> fetch_add_release(&sem->count) > * > * down_read_slowpath(sem); > * -> atomic_read(&sem->count) > * <ctrl dep> > * smp_acquire__after_ctrl_dep() > * <read shared data> > */ Works for me. The code is at 3 level of indentation, but I can try to squeeze it in for v4. > > In writing this, I also noticed that we don't have any explicit ordering > at the end of the reader slowpath when we wait on the queue but get woken > immediately: > > if (!waiter.task) > break; > > Am I missing something? I'm assuming this isn't problem, because set_current_state() on line above is using smp_store_mb(). > > > + smp_acquire__after_ctrl_dep(); > > raw_spin_unlock_irq(&sem->wait_lock); > > rwsem_set_reader_owned(sem); > > lockevent_inc(rwsem_rlock_fast); > > > > > > with litmus test in commit log: > > ----------------------------------- 8< ------------------------------------ > > C rwsem > > > > { > > atomic_t rwsem_count = ATOMIC_INIT(1); > > int vmacache_seqnum = 10; > > } > > > > P0(int *vmacache_seqnum, atomic_t *rwsem_count) > > { > > r0 = READ_ONCE(*vmacache_seqnum); > > WRITE_ONCE(*vmacache_seqnum, r0 + 1); > > /* downgrade_write */ > > r1 = atomic_fetch_add_release(-1+256, rwsem_count); > > } > > > > P1(int *vmacache_seqnum, atomic_t *rwsem_count, spinlock_t *sem_wait_lock) > > { > > /* rwsem_read_trylock */ > > r0 = atomic_add_return_acquire(256, rwsem_count); > > /* rwsem_down_read_slowpath */ > > spin_lock(sem_wait_lock); > > r0 = atomic_read(rwsem_count); > > if ((r0 & 1) == 0) { > > // BUG: needs barrier > > spin_unlock(sem_wait_lock); > > r1 = READ_ONCE(*vmacache_seqnum); > > } > > } > > exists (1:r1=10) > > ----------------------------------- 8< ------------------------------------ > > Thanks for writing this! It's definitely worth sticking it in the commit > log, but Paul and Jade might also like to include it as part of their litmus > test repository too. > > Will > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty 2019-07-18 10:50 ` Jan Stancek @ 2019-07-18 11:04 ` Peter Zijlstra 2019-07-18 11:09 ` Peter Zijlstra 2019-07-18 12:12 ` Paul E. McKenney 1 sibling, 1 reply; 22+ messages in thread From: Peter Zijlstra @ 2019-07-18 11:04 UTC (permalink / raw) To: Jan Stancek Cc: Will Deacon, Waiman Long, linux-kernel, dbueso, mingo, jade alglave, paulmck On Thu, Jul 18, 2019 at 06:50:52AM -0400, Jan Stancek wrote: > > In writing this, I also noticed that we don't have any explicit ordering > > at the end of the reader slowpath when we wait on the queue but get woken > > immediately: > > > > if (!waiter.task) > > break; > > > > Am I missing something? > > I'm assuming this isn't problem, because set_current_state() on line above > is using smp_store_mb(). X = 0; X = 1; rwsem_down_read() rwsem_up_write(); for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); rwsem_mark_wake() atomic_long_add(adjustment, &sem->count); smp_store_release(&waiter->task, NULL); if (!waiter.task) break; ... } r = X; can I think result in r==0 just fine, because there's nothing ordering the load of waiter.task with the store of X. It is exceedingly unlikely, but not impossible afaict. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty 2019-07-18 11:04 ` Peter Zijlstra @ 2019-07-18 11:09 ` Peter Zijlstra 2019-07-18 11:36 ` Jan Stancek 0 siblings, 1 reply; 22+ messages in thread From: Peter Zijlstra @ 2019-07-18 11:09 UTC (permalink / raw) To: Jan Stancek Cc: Will Deacon, Waiman Long, linux-kernel, dbueso, mingo, jade alglave, paulmck It's simpler like so: On Thu, Jul 18, 2019 at 01:04:46PM +0200, Peter Zijlstra wrote: > X = 0; > > rwsem_down_read() > for (;;) { > set_current_state(TASK_UNINTERRUPTIBLE); > > X = 1; > rwsem_up_write(); > rwsem_mark_wake() > atomic_long_add(adjustment, &sem->count); > smp_store_release(&waiter->task, NULL); > > if (!waiter.task) > break; > > ... > } > > r = X; > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty 2019-07-18 11:09 ` Peter Zijlstra @ 2019-07-18 11:36 ` Jan Stancek 0 siblings, 0 replies; 22+ messages in thread From: Jan Stancek @ 2019-07-18 11:36 UTC (permalink / raw) To: Peter Zijlstra Cc: Will Deacon, Waiman Long, linux-kernel, dbueso, mingo, jade alglave, paulmck ----- Original Message ----- > > It's simpler like so: > > On Thu, Jul 18, 2019 at 01:04:46PM +0200, Peter Zijlstra wrote: > > X = 0; > > > > rwsem_down_read() > > for (;;) { > > set_current_state(TASK_UNINTERRUPTIBLE); > > > > X = 1; > > rwsem_up_write(); > > rwsem_mark_wake() > > atomic_long_add(adjustment, &sem->count); > > smp_store_release(&waiter->task, NULL); > > > > if (!waiter.task) > > break; > > > > ... > > } > > > > r = X; > > I see - it looks possible. Thank you for this example. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty 2019-07-18 10:50 ` Jan Stancek 2019-07-18 11:04 ` Peter Zijlstra @ 2019-07-18 12:12 ` Paul E. McKenney 1 sibling, 0 replies; 22+ messages in thread From: Paul E. McKenney @ 2019-07-18 12:12 UTC (permalink / raw) To: Jan Stancek Cc: Will Deacon, Waiman Long, linux-kernel, dbueso, peterz, mingo, jade alglave On Thu, Jul 18, 2019 at 06:50:52AM -0400, Jan Stancek wrote: > > ----- Original Message ----- > > Hi Jan, Waiman, [+Jade and Paul for the litmus test at the end] > > > > On Wed, Jul 17, 2019 at 09:22:00PM +0200, Jan Stancek wrote: > > > On Wed, Jul 17, 2019 at 10:19:04AM -0400, Waiman Long wrote: > > > > > If you add a comment to the code outlining the issue (preferably as a > > > > > litmus > > > > > test involving sem->count and some shared data which happens to be > > > > > vmacache_seqnum in your test)), then: > > > > > > > > > > Reviewed-by: Will Deacon <will@kernel.org> > > > > > > > > > > Thanks, > > > > > > > > > > Will > > > > > > > > Agreed. A comment just above smp_acquire__after_ctrl_dep() on why this > > > > is needed will be great. > > > > > > > > Other than that, > > > > > > > > Acked-by: Waiman Long <longman@redhat.com> > > > > > > > > > > litmus test looks a bit long, would following be acceptable? > > > > > > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c > > > index 37524a47f002..d9c96651bfc7 100644 > > > --- a/kernel/locking/rwsem.c > > > +++ b/kernel/locking/rwsem.c > > > @@ -1032,6 +1032,13 @@ static inline bool rwsem_reader_phase_trylock(struct > > > rw_semaphore *sem, > > > */ > > > if (adjustment && !(atomic_long_read(&sem->count) & > > > (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { > > > + /* > > > + * down_read() issued ACQUIRE on enter, but we can race > > > + * with writer who did RELEASE only after us. > > > + * ACQUIRE here makes sure reader operations happen only > > > + * after all writer ones. > > > + */ > > > > How about an abridged form of the litmus test here, just to show the cod > > flow? e.g.: > > > > /* > > * We need to ensure ACQUIRE semantics when reading sem->count so that > > * we pair with the RELEASE store performed by an unlocking/downgrading > > * writer. > > * > > * P0 (writer) P1 (reader) > > * > > * down_write(sem); > > * <write shared data> > > * downgrade_write(sem); > > * -> fetch_add_release(&sem->count) > > * > > * down_read_slowpath(sem); > > * -> atomic_read(&sem->count) > > * <ctrl dep> > > * smp_acquire__after_ctrl_dep() > > * <read shared data> > > */ > > Works for me. The code is at 3 level of indentation, but I can try > to squeeze it in for v4. > > > > > In writing this, I also noticed that we don't have any explicit ordering > > at the end of the reader slowpath when we wait on the queue but get woken > > immediately: > > > > if (!waiter.task) > > break; > > > > Am I missing something? > > I'm assuming this isn't problem, because set_current_state() on line above > is using smp_store_mb(). > > > > > > + smp_acquire__after_ctrl_dep(); > > > raw_spin_unlock_irq(&sem->wait_lock); > > > rwsem_set_reader_owned(sem); > > > lockevent_inc(rwsem_rlock_fast); > > > > > > > > > with litmus test in commit log: > > > ----------------------------------- 8< ------------------------------------ > > > C rwsem > > > > > > { > > > atomic_t rwsem_count = ATOMIC_INIT(1); > > > int vmacache_seqnum = 10; > > > } > > > > > > P0(int *vmacache_seqnum, atomic_t *rwsem_count) > > > { > > > r0 = READ_ONCE(*vmacache_seqnum); > > > WRITE_ONCE(*vmacache_seqnum, r0 + 1); > > > /* downgrade_write */ > > > r1 = atomic_fetch_add_release(-1+256, rwsem_count); > > > } > > > > > > P1(int *vmacache_seqnum, atomic_t *rwsem_count, spinlock_t *sem_wait_lock) > > > { > > > /* rwsem_read_trylock */ > > > r0 = atomic_add_return_acquire(256, rwsem_count); > > > /* rwsem_down_read_slowpath */ > > > spin_lock(sem_wait_lock); > > > r0 = atomic_read(rwsem_count); > > > if ((r0 & 1) == 0) { > > > // BUG: needs barrier > > > spin_unlock(sem_wait_lock); > > > r1 = READ_ONCE(*vmacache_seqnum); > > > } > > > } > > > exists (1:r1=10) > > > ----------------------------------- 8< ------------------------------------ > > > > Thanks for writing this! It's definitely worth sticking it in the commit > > log, but Paul and Jade might also like to include it as part of their litmus > > test repository too. Thank you for forwarding this! It may now be found at: https://github.com/paulmckrcu/litmus/blob/master/manual/kernel/C-JanStancek-rwsem.litmus Thanx, Paul ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty 2019-07-18 9:26 ` [PATCH v2] locking/rwsem: add acquire barrier " Will Deacon 2019-07-18 10:50 ` Jan Stancek @ 2019-07-18 10:58 ` Peter Zijlstra 2019-07-18 11:45 ` Will Deacon 1 sibling, 1 reply; 22+ messages in thread From: Peter Zijlstra @ 2019-07-18 10:58 UTC (permalink / raw) To: Will Deacon Cc: Jan Stancek, Waiman Long, linux-kernel, dbueso, mingo, jade.alglave, paulmck On Thu, Jul 18, 2019 at 10:26:41AM +0100, Will Deacon wrote: > /* > * We need to ensure ACQUIRE semantics when reading sem->count so that > * we pair with the RELEASE store performed by an unlocking/downgrading > * writer. > * > * P0 (writer) P1 (reader) > * > * down_write(sem); > * <write shared data> > * downgrade_write(sem); > * -> fetch_add_release(&sem->count) > * > * down_read_slowpath(sem); > * -> atomic_read(&sem->count) > * <ctrl dep> > * smp_acquire__after_ctrl_dep() > * <read shared data> > */ So I'm thinking all this is excessive; the simple rule is: lock acquire should imply ACQUIRE, we all know why. > In writing this, I also noticed that we don't have any explicit ordering > at the end of the reader slowpath when we wait on the queue but get woken > immediately: > > if (!waiter.task) > break; > > Am I missing something? Ha!, I ran into the very same one. I keep confusing myself, but I think you're right and that needs to be smp_load_acquire() to match the smp_store_release() in rwsem_mark_wake(). (the actual race there is _tiny_ due to the smp_mb() right before it, but I cannot convince myself that is indeed sufficient) The signal_pending_state() case is also fun, but I think wait_lock there is sufficient (even under RCpc). I've ended up with this.. --- diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 37524a47f002..9eb630904a17 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -1000,6 +1000,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state) atomic_long_add(-RWSEM_READER_BIAS, &sem->count); adjustment = 0; if (rwsem_optimistic_spin(sem, false)) { + /* rwsem_optimistic_spin() implies ACQUIRE through rwsem_*trylock() */ /* * Wake up other readers in the wait list if the front * waiter is a reader. @@ -1014,6 +1015,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state) } return sem; } else if (rwsem_reader_phase_trylock(sem, waiter.last_rowner)) { + /* rwsem_reader_phase_trylock() implies ACQUIRE */ return sem; } @@ -1032,6 +1034,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state) */ if (adjustment && !(atomic_long_read(&sem->count) & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { + /* Provide lock ACQUIRE */ + smp_acquire__after_ctrl_dep(); raw_spin_unlock_irq(&sem->wait_lock); rwsem_set_reader_owned(sem); lockevent_inc(rwsem_rlock_fast); @@ -1065,15 +1069,25 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state) wake_up_q(&wake_q); /* wait to be given the lock */ - while (true) { + for (;;) { set_current_state(state); - if (!waiter.task) + if (!smp_load_acquire(&waiter.task)) { + /* + * Matches rwsem_mark_wake()'s smp_store_release() and ensures + * we're ordered against its sem->count operations. + */ break; + } if (signal_pending_state(state, current)) { raw_spin_lock_irq(&sem->wait_lock); if (waiter.task) goto out_nolock; raw_spin_unlock_irq(&sem->wait_lock); + /* + * Ordered by sem->wait_lock against rwsem_mark_wake(), if we + * see its waiter.task store, we must also see its sem->count + * changes. + */ break; } schedule(); @@ -1083,6 +1097,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state) __set_current_state(TASK_RUNNING); lockevent_inc(rwsem_rlock); return sem; + out_nolock: list_del(&waiter.list); if (list_empty(&sem->wait_list)) { ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty 2019-07-18 10:58 ` Peter Zijlstra @ 2019-07-18 11:45 ` Will Deacon 2019-07-18 12:23 ` Peter Zijlstra 0 siblings, 1 reply; 22+ messages in thread From: Will Deacon @ 2019-07-18 11:45 UTC (permalink / raw) To: Peter Zijlstra Cc: Jan Stancek, Waiman Long, linux-kernel, dbueso, mingo, jade.alglave, paulmck On Thu, Jul 18, 2019 at 12:58:12PM +0200, Peter Zijlstra wrote: > On Thu, Jul 18, 2019 at 10:26:41AM +0100, Will Deacon wrote: > > > /* > > * We need to ensure ACQUIRE semantics when reading sem->count so that > > * we pair with the RELEASE store performed by an unlocking/downgrading > > * writer. > > * > > * P0 (writer) P1 (reader) > > * > > * down_write(sem); > > * <write shared data> > > * downgrade_write(sem); > > * -> fetch_add_release(&sem->count) > > * > > * down_read_slowpath(sem); > > * -> atomic_read(&sem->count) > > * <ctrl dep> > > * smp_acquire__after_ctrl_dep() > > * <read shared data> > > */ > > So I'm thinking all this is excessive; the simple rule is: lock acquire > should imply ACQUIRE, we all know why. Fair enough, I just thought this was worth highlighting because you can't reply on the wait_lock to give you ACQUIRE ordering. > > In writing this, I also noticed that we don't have any explicit ordering > > at the end of the reader slowpath when we wait on the queue but get woken > > immediately: > > > > if (!waiter.task) > > break; > > > > Am I missing something? > > Ha!, I ran into the very same one. I keep confusing myself, but I think > you're right and that needs to be smp_load_acquire() to match the > smp_store_release() in rwsem_mark_wake(). > > (the actual race there is _tiny_ due to the smp_mb() right before it, > but I cannot convince myself that is indeed sufficient) > > The signal_pending_state() case is also fun, but I think wait_lock there > is sufficient (even under RCpc). > > I've ended up with this.. > > --- > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c > index 37524a47f002..9eb630904a17 100644 > --- a/kernel/locking/rwsem.c > +++ b/kernel/locking/rwsem.c > @@ -1000,6 +1000,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state) > atomic_long_add(-RWSEM_READER_BIAS, &sem->count); > adjustment = 0; > if (rwsem_optimistic_spin(sem, false)) { > + /* rwsem_optimistic_spin() implies ACQUIRE through rwsem_*trylock() */ I couldn't figure out if this was dependent on the return value or not, and looking at osq_lock() I also couldn't see the ACQUIRE barrier when we're spinning on node->locked. Hmm. > /* > * Wake up other readers in the wait list if the front > * waiter is a reader. > @@ -1014,6 +1015,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state) > } > return sem; > } else if (rwsem_reader_phase_trylock(sem, waiter.last_rowner)) { > + /* rwsem_reader_phase_trylock() implies ACQUIRE */ Can we add "on success" to the end of this, please? > return sem; > } > > @@ -1032,6 +1034,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state) > */ > if (adjustment && !(atomic_long_read(&sem->count) & > (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { > + /* Provide lock ACQUIRE */ > + smp_acquire__after_ctrl_dep(); > raw_spin_unlock_irq(&sem->wait_lock); > rwsem_set_reader_owned(sem); > lockevent_inc(rwsem_rlock_fast); > @@ -1065,15 +1069,25 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state) > wake_up_q(&wake_q); > > /* wait to be given the lock */ > - while (true) { > + for (;;) { > set_current_state(state); > - if (!waiter.task) > + if (!smp_load_acquire(&waiter.task)) { > + /* > + * Matches rwsem_mark_wake()'s smp_store_release() and ensures > + * we're ordered against its sem->count operations. > + */ > break; > + } Ack. Also, grepping for 'waiter.task' reveals a similar usage in drivers/tty/tty_ldsem.c if you're feeling brave enough. Will ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty 2019-07-18 11:45 ` Will Deacon @ 2019-07-18 12:23 ` Peter Zijlstra 0 siblings, 0 replies; 22+ messages in thread From: Peter Zijlstra @ 2019-07-18 12:23 UTC (permalink / raw) To: Will Deacon Cc: Jan Stancek, Waiman Long, linux-kernel, dbueso, mingo, jade.alglave, paulmck On Thu, Jul 18, 2019 at 12:45:47PM +0100, Will Deacon wrote: > On Thu, Jul 18, 2019 at 12:58:12PM +0200, Peter Zijlstra wrote: > > On Thu, Jul 18, 2019 at 10:26:41AM +0100, Will Deacon wrote: > > > > > /* > > > * We need to ensure ACQUIRE semantics when reading sem->count so that > > > * we pair with the RELEASE store performed by an unlocking/downgrading > > > * writer. > > > * > > > * P0 (writer) P1 (reader) > > > * > > > * down_write(sem); > > > * <write shared data> > > > * downgrade_write(sem); > > > * -> fetch_add_release(&sem->count) > > > * > > > * down_read_slowpath(sem); > > > * -> atomic_read(&sem->count) > > > * <ctrl dep> > > > * smp_acquire__after_ctrl_dep() > > > * <read shared data> > > > */ > > > > So I'm thinking all this is excessive; the simple rule is: lock acquire > > should imply ACQUIRE, we all know why. > > Fair enough, I just thought this was worth highlighting because you can't > reply on the wait_lock to give you ACQUIRE ordering. Right, not in this case, because sem->count is not fully serialized by it, whereas below the wait-queue is. > > --- > > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c > > index 37524a47f002..9eb630904a17 100644 > > --- a/kernel/locking/rwsem.c > > +++ b/kernel/locking/rwsem.c > > @@ -1000,6 +1000,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state) > > atomic_long_add(-RWSEM_READER_BIAS, &sem->count); > > adjustment = 0; > > if (rwsem_optimistic_spin(sem, false)) { > > + /* rwsem_optimistic_spin() implies ACQUIRE through rwsem_*trylock() */ > > I couldn't figure out if this was dependent on the return value or not, I went with the fact that the only way to return true is if taken becomes true; and that only happens through rwsem_try_{read,write}_lock_unqueued(), and both imply ACQUIRE on success. > and looking at osq_lock() I also couldn't see the ACQUIRE barrier when we're > spinning on node->locked. Hmm. Yes, osq is not a full lock and does not imply these barriers. This came up somewhere, did we forget to write a comment on that? Lemme go look. > > /* > > * Wake up other readers in the wait list if the front > > * waiter is a reader. > > @@ -1014,6 +1015,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state) > > } > > return sem; > > } else if (rwsem_reader_phase_trylock(sem, waiter.last_rowner)) { > > + /* rwsem_reader_phase_trylock() implies ACQUIRE */ > > Can we add "on success" to the end of this, please? Good point. > > return sem; > > } > > > > @@ -1032,6 +1034,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state) > > */ > > if (adjustment && !(atomic_long_read(&sem->count) & > > (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { > > + /* Provide lock ACQUIRE */ > > + smp_acquire__after_ctrl_dep(); > > raw_spin_unlock_irq(&sem->wait_lock); > > rwsem_set_reader_owned(sem); > > lockevent_inc(rwsem_rlock_fast); > > @@ -1065,15 +1069,25 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state) > > wake_up_q(&wake_q); > > > > /* wait to be given the lock */ > > - while (true) { > > + for (;;) { > > set_current_state(state); > > - if (!waiter.task) > > + if (!smp_load_acquire(&waiter.task)) { > > + /* > > + * Matches rwsem_mark_wake()'s smp_store_release() and ensures > > + * we're ordered against its sem->count operations. > > + */ > > break; > > + } > > Ack. Also, grepping for 'waiter.task' reveals a similar usage in > drivers/tty/tty_ldsem.c if you're feeling brave enough. *sigh* of course, for every bug there needs to be a second copy somewhere. I'll go look there too. Thanks! ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v2] locking/rwsem: add acquire barrier to read_slowpath exit when queue is empty 2019-07-17 12:02 ` [PATCH v2] locking/rwsem: add acquire barrier to " Jan Stancek 2019-07-17 13:13 ` Will Deacon @ 2019-07-17 15:33 ` Waiman Long 1 sibling, 0 replies; 22+ messages in thread From: Waiman Long @ 2019-07-17 15:33 UTC (permalink / raw) To: Jan Stancek, linux-kernel; +Cc: dbueso, will, peterz, mingo On 7/17/19 8:02 AM, Jan Stancek wrote: > LTP mtest06 has been observed to rarely hit "still mapped when deleted" > and following BUG_ON on arm64: > page:ffff7e02fa37e480 refcount:3 mapcount:1 mapping:ffff80be3d678ab0 index:0x0 > xfs_address_space_operations [xfs] > flags: 0xbfffe000000037(locked|referenced|uptodate|lru|active) > page dumped because: VM_BUG_ON_PAGE(page_mapped(page)) > ------------[ cut here ]------------ > kernel BUG at mm/filemap.c:171! > Internal error: Oops - BUG: 0 [#1] SMP > CPU: 220 PID: 154292 Comm: mmap1 Not tainted 5.2.0-0ecfebd.cki #1 > Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.10 05/17/2019 > pstate: 40400089 (nZcv daIf +PAN -UAO) > pc : unaccount_page_cache_page+0x17c/0x1a0 > lr : unaccount_page_cache_page+0x17c/0x1a0 > Call trace: > unaccount_page_cache_page+0x17c/0x1a0 > delete_from_page_cache_batch+0xa0/0x300 > truncate_inode_pages_range+0x1b8/0x640 > truncate_inode_pages_final+0x88/0xa8 > evict+0x1a0/0x1d8 > iput+0x150/0x240 > dentry_unlink_inode+0x120/0x130 > __dentry_kill+0xd8/0x1d0 > dentry_kill+0x88/0x248 > dput+0x168/0x1b8 > __fput+0xe8/0x208 > ____fput+0x20/0x30 > task_work_run+0xc0/0xf0 > do_notify_resume+0x2b0/0x328 > work_pending+0x8/0x10 > > The extra mapcount originated from pagefault handler, which handled > pagefault for vma that has already been detached. vma is detached > under mmap_sem write lock by detach_vmas_to_be_unmapped(), which > also invalidates vmacache. > > When pagefault handler (under mmap_sem read lock) called find_vma(), > vmacache_valid() wrongly reported vmacache as valid. > > After rwsem down_read() returns via 'queue empty' path (as of v5.2), > it does so without issuing read_acquire on sem->count: > down_read > __down_read > rwsem_down_read_failed > __rwsem_down_read_failed_common > raw_spin_lock_irq(&sem->wait_lock); > if (list_empty(&sem->wait_list)) { > if (atomic_long_read(&sem->count) >= 0) { > raw_spin_unlock_irq(&sem->wait_lock); > return sem; > > Suspected problem here is that last *_acquire on down_read() side > happens before write side issues *_release: > 1. writer: has the lock > 2. reader: down_read() issues *read_acquire on entry > 3. writer: mm->vmacache_seqnum++; downgrades lock (*fetch_add_release) > 4. reader: __rwsem_down_read_failed_common() finds it can take lock and returns > 5. reader: observes stale mm->vmacache_seqnum > > I can reproduce the problem by running LTP mtest06 in a loop and building > kernel (-j $NCPUS) in parallel. It does reproduce since v4.20 up to v5.2 > on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It triggers reliably > within ~hour. Patched kernel ran fine for 10+ hours with clean dmesg. > Tests were done against v5.2, since commit cf69482d62d9 ("locking/rwsem: > Enable readers spinning on writer") makes it much harder to reproduce. > > v2: Move barrier after test (Waiman Long) > Use smp_acquire__after_ctrl_dep() (Peter Zijlstra) > > Related: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c > Related: commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") > Fixes: 4b486b535c33 ("locking/rwsem: Exit read lock slowpath if queue empty & no writer") > > Signed-off-by: Jan Stancek <jstancek@redhat.com> > Cc: stable@vger.kernel.org # v4.20+ > Cc: Waiman Long <longman@redhat.com> > Cc: Davidlohr Bueso <dbueso@suse.de> > Cc: Will Deacon <will@kernel.org> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Ingo Molnar <mingo@redhat.com> > --- > kernel/locking/rwsem.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c > index 37524a47f002..5ac72b60608b 100644 > --- a/kernel/locking/rwsem.c > +++ b/kernel/locking/rwsem.c > @@ -1032,6 +1032,7 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem, > */ > if (adjustment && !(atomic_long_read(&sem->count) & > (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { > + smp_acquire__after_ctrl_dep(); > raw_spin_unlock_irq(&sem->wait_lock); > rwsem_set_reader_owned(sem); > lockevent_inc(rwsem_rlock_fast); The corresponding change for 5.2 or earlier kernels are: diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c index fbe96341beee..2fbbb2d46396 100644 --- a/kernel/locking/rwsem-xadd.c +++ b/kernel/locking/rwsem-xadd.c @@ -246,6 +246,7 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, in * been set in the count. */ if (atomic_long_read(&sem->count) >= 0) { + smp_acquire__after_ctrl_dep(); raw_spin_unlock_irq(&sem->wait_lock); return sem; } -Longman ^ permalink raw reply related [flat|nested] 22+ messages in thread
end of thread, other threads:[~2019-07-25 16:00 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-07-16 16:04 [PATCH] locking/rwsem: use read_acquire in read_slowpath exit when queue is empty Jan Stancek 2019-07-16 16:53 ` Waiman Long 2019-07-16 18:34 ` Jan Stancek 2019-07-16 18:58 ` Peter Zijlstra 2019-07-16 19:09 ` Waiman Long 2019-07-17 12:02 ` [PATCH v2] locking/rwsem: add acquire barrier to " Jan Stancek 2019-07-17 13:13 ` Will Deacon 2019-07-17 14:19 ` Waiman Long 2019-07-17 19:22 ` Jan Stancek 2019-07-17 19:39 ` Waiman Long 2019-07-18 8:51 ` [PATCH v3] " Jan Stancek 2019-07-25 16:00 ` [tip:locking/core] locking/rwsem: Add missing ACQUIRE " tip-bot for Jan Stancek 2019-07-18 9:26 ` [PATCH v2] locking/rwsem: add acquire barrier " Will Deacon 2019-07-18 10:50 ` Jan Stancek 2019-07-18 11:04 ` Peter Zijlstra 2019-07-18 11:09 ` Peter Zijlstra 2019-07-18 11:36 ` Jan Stancek 2019-07-18 12:12 ` Paul E. McKenney 2019-07-18 10:58 ` Peter Zijlstra 2019-07-18 11:45 ` Will Deacon 2019-07-18 12:23 ` Peter Zijlstra 2019-07-17 15:33 ` Waiman Long
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).