* [PATCH 0/2] ipc/sem.c: sem_lock fixes @ 2016-07-13 5:06 Manfred Spraul 2016-07-13 5:06 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul 2016-07-13 22:05 ` [PATCH 0/2] ipc/sem.c: sem_lock fixes Andrew Morton 0 siblings, 2 replies; 17+ messages in thread From: Manfred Spraul @ 2016-07-13 5:06 UTC (permalink / raw) To: H. Peter Anvin, Peter Zijlstra, Andrew Morton, Davidlohr Bueso Cc: LKML, Thomas Gleixner, Ingo Molnar, 1vier1, felixh, Manfred Spraul Hi Andrew, Hi Peter, next version of the sem_lock() fixes: The patches are again vs. tip. Patch 1 is ready for merging, Patch 2 is for review. - Patch 1 is the patch as in -next since January It fixes the race that was found by Felix. - Patch 2 removes the memory barriers that are part of the qspinlock code. - (The hysteresis patch would be patch 3. The risk of regressions can't be ruled out, thus it must wait for benchmarks from real workload tests) Patch 1+2 were one patch, I've split patches so that review and backporting are simpler. -- Manfred ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race 2016-07-13 5:06 [PATCH 0/2] ipc/sem.c: sem_lock fixes Manfred Spraul @ 2016-07-13 5:06 ` Manfred Spraul 2016-07-13 5:06 ` [PATCH 2/2] ipc/sem.c: Remove duplicated memory barriers Manfred Spraul 2016-07-16 1:27 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Davidlohr Bueso 2016-07-13 22:05 ` [PATCH 0/2] ipc/sem.c: sem_lock fixes Andrew Morton 1 sibling, 2 replies; 17+ messages in thread From: Manfred Spraul @ 2016-07-13 5:06 UTC (permalink / raw) To: H. Peter Anvin, Peter Zijlstra, Andrew Morton, Davidlohr Bueso Cc: LKML, Thomas Gleixner, Ingo Molnar, 1vier1, felixh, Manfred Spraul, stable Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a race: sem_lock has a fast path that allows parallel simple operations. There are two reasons why a simple operation cannot run in parallel: - a non-simple operations is ongoing (sma->sem_perm.lock held) - a complex operation is sleeping (sma->complex_count != 0) As both facts are stored independently, a thread can bypass the current checks by sleeping in the right positions. See below for more details (or kernel bugzilla 105651). The patch fixes that by creating one variable (complex_mode) that tracks both reasons why parallel operations are not possible. The patch also updates stale documentation regarding the locking. With regards to stable kernels: The patch is required for all kernels that include the commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") (3.10?) The alternative is to revert the patch that introduced the race. The patch is safe for backporting, i.e. it makes no assumptions about memory barriers in spin_unlock_wait() or that the acquire within spin_lock() is after setting the lock variable. Background: Here is the race of the current implementation: Thread A: (simple op) - does the first "sma->complex_count == 0" test Thread B: (complex op) - does sem_lock(): This includes an array scan. But the scan can't find Thread A, because Thread A does not own sem->lock yet. - the thread does the operation, increases complex_count, drops sem_lock, sleeps Thread A: - spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock) - sleeps before the complex_count test Thread C: (complex op) - does sem_lock (no array scan, complex_count==1) - wakes up Thread B. - decrements complex_count Thread A: - does the complex_count test Bug: Now both thread A and thread C operate on the same array, without any synchronization. Full memory barrier are required to synchronize changes of complex_mode and the lock operations. Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") Reported-by: felixh@informatik.uni-bremen.de Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@vger.kernel.org> --- include/linux/sem.h | 1 + ipc/sem.c | 130 +++++++++++++++++++++++++++++++--------------------- 2 files changed, 79 insertions(+), 52 deletions(-) diff --git a/include/linux/sem.h b/include/linux/sem.h index 976ce3a..d0efd6e 100644 --- a/include/linux/sem.h +++ b/include/linux/sem.h @@ -21,6 +21,7 @@ struct sem_array { struct list_head list_id; /* undo requests on this array */ int sem_nsems; /* no. of semaphores in array */ int complex_count; /* pending complex operations */ + bool complex_mode; /* no parallel simple ops */ }; #ifdef CONFIG_SYSVIPC diff --git a/ipc/sem.c b/ipc/sem.c index ae72b3c..0da63c8 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -162,14 +162,21 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it); /* * Locking: + * a) global sem_lock() for read/write * sem_undo.id_next, * sem_array.complex_count, - * sem_array.pending{_alter,_cont}, - * sem_array.sem_undo: global sem_lock() for read/write - * sem_undo.proc_next: only "current" is allowed to read/write that field. + * sem_array.complex_mode + * sem_array.pending{_alter,_const}, + * sem_array.sem_undo * + * b) global or semaphore sem_lock() for read/write: * sem_array.sem_base[i].pending_{const,alter}: - * global or semaphore sem_lock() for read/write + * sem_array.complex_mode (for read) + * + * c) special: + * sem_undo_list.list_proc: + * * undo_list->lock for write + * * rcu for read */ #define sc_semmsl sem_ctls[0] @@ -260,28 +267,59 @@ static void sem_rcu_free(struct rcu_head *head) } /* - * Wait until all currently ongoing simple ops have completed. + * Enter the mode suitable for non-simple operations: * Caller must own sem_perm.lock. - * New simple ops cannot start, because simple ops first check - * that sem_perm.lock is free. - * that a) sem_perm.lock is free and b) complex_count is 0. */ -static void sem_wait_array(struct sem_array *sma) +static void complexmode_enter(struct sem_array *sma) { int i; struct sem *sem; - if (sma->complex_count) { - /* The thread that increased sma->complex_count waited on - * all sem->lock locks. Thus we don't need to wait again. - */ + if (sma->complex_mode) { + /* We are already in complex_mode. Nothing to do */ return; } + WRITE_ONCE(sma->complex_mode, true); + + /* We need a full barrier: + * The write to complex_mode must be visible + * before we read the first sem->lock spinlock state. + */ + smp_mb(); for (i = 0; i < sma->sem_nsems; i++) { sem = sma->sem_base + i; spin_unlock_wait(&sem->lock); } + /* + * spin_unlock_wait() is not a memory barriers, it is only a + * control barrier. The code must pair with spin_unlock(&sem->lock), + * thus just the control barrier is insufficient. + * + * smp_rmb() is sufficient, as writes cannot pass the control barrier. + */ + smp_rmb(); +} + +/* + * Try to leave the mode that disallows simple operations: + * Caller must own sem_perm.lock. + */ +static void complexmode_tryleave(struct sem_array *sma) +{ + if (sma->complex_count) { + /* Complex ops are sleeping. + * We must stay in complex mode + */ + return; + } + /* + * Immediately after setting complex_mode to false, + * a simple op can start. Thus: all memory writes + * performed by the current operation must be visible + * before we set complex_mode to false. + */ + smp_store_release(&sma->complex_mode, false); } /* @@ -300,56 +338,40 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops, /* Complex operation - acquire a full lock */ ipc_lock_object(&sma->sem_perm); - /* And wait until all simple ops that are processed - * right now have dropped their locks. - */ - sem_wait_array(sma); + /* Prevent parallel simple ops */ + complexmode_enter(sma); return -1; } /* * Only one semaphore affected - try to optimize locking. - * The rules are: - * - optimized locking is possible if no complex operation - * is either enqueued or processed right now. - * - The test for enqueued complex ops is simple: - * sma->complex_count != 0 - * - Testing for complex ops that are processed right now is - * a bit more difficult. Complex ops acquire the full lock - * and first wait that the running simple ops have completed. - * (see above) - * Thus: If we own a simple lock and the global lock is free - * and complex_count is now 0, then it will stay 0 and - * thus just locking sem->lock is sufficient. + * Optimized locking is possible if no complex operation + * is either enqueued or processed right now. + * + * Both facts are tracked by complex_mode. */ sem = sma->sem_base + sops->sem_num; - if (sma->complex_count == 0) { + /* + * Initial check for complex_mode. Just an optimization, + * no locking, no memory barrier. + */ + if (!READ_ONCE(sma->complex_mode)) { /* * It appears that no complex operation is around. * Acquire the per-semaphore lock. */ spin_lock(&sem->lock); - /* Then check that the global lock is free */ - if (!spin_is_locked(&sma->sem_perm.lock)) { - /* - * We need a memory barrier with acquire semantics, - * otherwise we can race with another thread that does: - * complex_count++; - * spin_unlock(sem_perm.lock); - */ - smp_acquire__after_ctrl_dep(); + /* + * A full barrier is required: the write of sem->lock + * must be visible before the read is executed + */ + smp_mb(); - /* - * Now repeat the test of complex_count: - * It can't change anymore until we drop sem->lock. - * Thus: if is now 0, then it will stay 0. - */ - if (sma->complex_count == 0) { - /* fast path successful! */ - return sops->sem_num; - } + if (!smp_load_acquire(&sma->complex_mode)) { + /* fast path successful! */ + return sops->sem_num; } spin_unlock(&sem->lock); } @@ -369,7 +391,7 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops, /* Not a false alarm, thus complete the sequence for a * full lock. */ - sem_wait_array(sma); + complexmode_enter(sma); return -1; } } @@ -378,6 +400,7 @@ static inline void sem_unlock(struct sem_array *sma, int locknum) { if (locknum == -1) { unmerge_queues(sma); + complexmode_tryleave(sma); ipc_unlock_object(&sma->sem_perm); } else { struct sem *sem = sma->sem_base + locknum; @@ -529,6 +552,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params) } sma->complex_count = 0; + sma->complex_mode = true; /* dropped by sem_unlock below */ INIT_LIST_HEAD(&sma->pending_alter); INIT_LIST_HEAD(&sma->pending_const); INIT_LIST_HEAD(&sma->list_id); @@ -2184,10 +2208,10 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it) /* * The proc interface isn't aware of sem_lock(), it calls * ipc_lock_object() directly (in sysvipc_find_ipc). - * In order to stay compatible with sem_lock(), we must wait until - * all simple semop() calls have left their critical regions. + * In order to stay compatible with sem_lock(), we must + * enter / leave complex_mode. */ - sem_wait_array(sma); + complexmode_enter(sma); sem_otime = get_semotime(sma); @@ -2204,6 +2228,8 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it) sem_otime, sma->sem_ctime); + complexmode_tryleave(sma); + return 0; } #endif -- 2.5.5 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 2/2] ipc/sem.c: Remove duplicated memory barriers. 2016-07-13 5:06 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul @ 2016-07-13 5:06 ` Manfred Spraul 2016-07-13 16:16 ` Davidlohr Bueso 2016-07-16 1:27 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Davidlohr Bueso 1 sibling, 1 reply; 17+ messages in thread From: Manfred Spraul @ 2016-07-13 5:06 UTC (permalink / raw) To: H. Peter Anvin, Peter Zijlstra, Andrew Morton, Davidlohr Bueso Cc: LKML, Thomas Gleixner, Ingo Molnar, 1vier1, felixh, Manfred Spraul With 2c610022711 (locking/qspinlock: Fix spin_unlock_wait() some more), memory barriers were added into spin_unlock_wait(). Thus another barrier is not required. And as explained in 055ce0fd1b8 (locking/qspinlock: Add comments), spin_lock() provides a barrier so that reads within the critical section cannot happen before the write for the lock is visible. i.e. spin_lock provides an acquire barrier after the write of the lock variable, this barrier pairs with the smp_mb() in complexmode_enter(). Please review! For x86, the patch is safe. But I don't know enough about all archs that support SMP. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> --- ipc/sem.c | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/ipc/sem.c b/ipc/sem.c index 0da63c8..d7b4212 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -291,14 +291,6 @@ static void complexmode_enter(struct sem_array *sma) sem = sma->sem_base + i; spin_unlock_wait(&sem->lock); } - /* - * spin_unlock_wait() is not a memory barriers, it is only a - * control barrier. The code must pair with spin_unlock(&sem->lock), - * thus just the control barrier is insufficient. - * - * smp_rmb() is sufficient, as writes cannot pass the control barrier. - */ - smp_rmb(); } /* @@ -363,12 +355,6 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops, */ spin_lock(&sem->lock); - /* - * A full barrier is required: the write of sem->lock - * must be visible before the read is executed - */ - smp_mb(); - if (!smp_load_acquire(&sma->complex_mode)) { /* fast path successful! */ return sops->sem_num; -- 2.5.5 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] ipc/sem.c: Remove duplicated memory barriers. 2016-07-13 5:06 ` [PATCH 2/2] ipc/sem.c: Remove duplicated memory barriers Manfred Spraul @ 2016-07-13 16:16 ` Davidlohr Bueso 2016-07-13 18:37 ` Manfred Spraul 0 siblings, 1 reply; 17+ messages in thread From: Davidlohr Bueso @ 2016-07-13 16:16 UTC (permalink / raw) To: Manfred Spraul Cc: H. Peter Anvin, Peter Zijlstra, Andrew Morton, LKML, Thomas Gleixner, Ingo Molnar, 1vier1, felixh Manfred, shouldn't this patch be part of patch 1 (as you add the unnecessary barriers there? Iow, can we have a single patch for all this? Thanks, Davidlohr ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] ipc/sem.c: Remove duplicated memory barriers. 2016-07-13 16:16 ` Davidlohr Bueso @ 2016-07-13 18:37 ` Manfred Spraul 0 siblings, 0 replies; 17+ messages in thread From: Manfred Spraul @ 2016-07-13 18:37 UTC (permalink / raw) To: Davidlohr Bueso, Michael Ellerman Cc: H. Peter Anvin, Peter Zijlstra, Andrew Morton, LKML, Thomas Gleixner, Ingo Molnar, 1vier1, felixh Hi Davidlohr, On 07/13/2016 06:16 PM, Davidlohr Bueso wrote: > Manfred, shouldn't this patch be part of patch 1 (as you add the > unnecessary barriers there? Iow, can we have a single patch for all this? > Two reasons: - patch 1 is safe for backporting, patch 2 not. - patch 1 is safe on all architectures, for patch 2 I would like to get some review feedback. e.g. I just found/read 51d7d5205d33 ("powerpc: Add smp_mb() to arch_spin_is_locked()"): For powerpc, a smp_mb() was added into spin_is_locked(), more or less for ipc/sem.c. Patch 1 replaces the spin_is_locked() with smp_load_acquire(). Isn't that the proof that smp_mb() is required? -- Manfred ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race 2016-07-13 5:06 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul 2016-07-13 5:06 ` [PATCH 2/2] ipc/sem.c: Remove duplicated memory barriers Manfred Spraul @ 2016-07-16 1:27 ` Davidlohr Bueso 1 sibling, 0 replies; 17+ messages in thread From: Davidlohr Bueso @ 2016-07-16 1:27 UTC (permalink / raw) To: Manfred Spraul Cc: H. Peter Anvin, Peter Zijlstra, Andrew Morton, LKML, Thomas Gleixner, Ingo Molnar, 1vier1, felixh, stable On Wed, 13 Jul 2016, Manfred Spraul wrote: >-static void sem_wait_array(struct sem_array *sma) >+static void complexmode_enter(struct sem_array *sma) > { > int i; > struct sem *sem; > >- if (sma->complex_count) { >- /* The thread that increased sma->complex_count waited on >- * all sem->lock locks. Thus we don't need to wait again. >- */ >+ if (sma->complex_mode) { >+ /* We are already in complex_mode. Nothing to do */ > return; > } >+ WRITE_ONCE(sma->complex_mode, true); So we can actually save those READ/WRITE_ONCE calls for complex_mode as it's a bool and therefore tearing is not an issue. >+ >+ /* We need a full barrier: >+ * The write to complex_mode must be visible >+ * before we read the first sem->lock spinlock state. >+ */ >+ smp_mb(); smp_store_mb()? > /* >@@ -300,56 +338,40 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops, > /* Complex operation - acquire a full lock */ > ipc_lock_object(&sma->sem_perm); > >- /* And wait until all simple ops that are processed >- * right now have dropped their locks. >- */ >- sem_wait_array(sma); >+ /* Prevent parallel simple ops */ >+ complexmode_enter(sma); > return -1; nit and unrelated: we should probably use some better label here than a raw -1 (although I don't see it changing, just for nicer reading), ie: SEM_OBJECT_LOCKED Thanks, Davidlohr ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] ipc/sem.c: sem_lock fixes 2016-07-13 5:06 [PATCH 0/2] ipc/sem.c: sem_lock fixes Manfred Spraul 2016-07-13 5:06 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul @ 2016-07-13 22:05 ` Andrew Morton 2016-07-14 16:40 ` Manfred Spraul 1 sibling, 1 reply; 17+ messages in thread From: Andrew Morton @ 2016-07-13 22:05 UTC (permalink / raw) To: Manfred Spraul Cc: H. Peter Anvin, Peter Zijlstra, Davidlohr Bueso, LKML, Thomas Gleixner, Ingo Molnar, 1vier1, felixh On Wed, 13 Jul 2016 07:06:50 +0200 Manfred Spraul <manfred@colorfullife.com> wrote: > Hi Andrew, Hi Peter, > > next version of the sem_lock() fixes: > The patches are again vs. tip. > > Patch 1 is ready for merging, Patch 2 is for review. > > - Patch 1 is the patch as in -next since January > It fixes the race that was found by Felix. > - Patch 2 removes the memory barriers that are part of the qspinlock > code. > - (The hysteresis patch would be patch 3. The risk of regressions > can't be ruled out, thus it must wait for benchmarks from real > workload tests) I think you're saying that if these two patches cause performance regressions, we will need ipc-sem-sem_lock-with-hysteresis.patch? Is that even necessary? If your testing shows that ipc-sem-sem_lock-with-hysteresis.patch makes things faster then in it goes, surely? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/2] ipc/sem.c: sem_lock fixes 2016-07-13 22:05 ` [PATCH 0/2] ipc/sem.c: sem_lock fixes Andrew Morton @ 2016-07-14 16:40 ` Manfred Spraul 0 siblings, 0 replies; 17+ messages in thread From: Manfred Spraul @ 2016-07-14 16:40 UTC (permalink / raw) To: Andrew Morton Cc: H. Peter Anvin, Peter Zijlstra, Davidlohr Bueso, LKML, Thomas Gleixner, Ingo Molnar, 1vier1, felixh Hi Andrew, On 07/14/2016 12:05 AM, Andrew Morton wrote: > On Wed, 13 Jul 2016 07:06:50 +0200 Manfred Spraul <manfred@colorfullife.com> wrote: > >> Hi Andrew, Hi Peter, >> >> next version of the sem_lock() fixes: >> The patches are again vs. tip. >> >> Patch 1 is ready for merging, Patch 2 is for review. >> >> - Patch 1 is the patch as in -next since January >> It fixes the race that was found by Felix. >> - Patch 2 removes the memory barriers that are part of the qspinlock >> code. >> - (The hysteresis patch would be patch 3. The risk of regressions >> can't be ruled out, thus it must wait for benchmarks from real >> workload tests) > I think you're saying that if these two patches cause performance > regressions, we will need ipc-sem-sem_lock-with-hysteresis.patch? No, the two patches will not cause any regressions. Commit 6062a8dc051 ("ipc,sem: fine grained locking for semtimedop") was a huge scalability improvement for 99% of the users, but introduced a regression for one workload. Noone complained about it, so the workload must be rare. Patch 3 now fixes the regression. But we live with the introduced regression for 3 years, so give me (and Davidlohr, and anyone else who wants to support) some more time. Also on my TODO list: The queue merge/unmerge logic, i.e. commit f269f40ad5ae ("ipc/sem.c: always use only one queue for alter operations") might also win from some hysteresis. My proposal: - patch 1: merge towards main tree. - patch 2: needs update. The barrier in sem_lock() is required. I'm not yet sure about complexmode_enter(), perhaps with #ifndef CONFIG_QUEUED_SPINLOCKS - patch 3 should wait. -- Manfred ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 0/2] ipc/sem.c: sem_lock fixes @ 2016-06-25 17:37 Manfred Spraul 2016-06-25 17:37 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul 0 siblings, 1 reply; 17+ messages in thread From: Manfred Spraul @ 2016-06-25 17:37 UTC (permalink / raw) To: H. Peter Anvin, Peter Zijlstra, Andrew Morton, Davidlohr Bueso Cc: LKML, Thomas Gleixner, Ingo Molnar, 1vier1, felixh, Manfred Spraul Hi Andrew, Hi Peter, next version of the sem_lock() fixes / improvement: The patches are now vs. tip. Patch 1 is ready for merging, patch 2 is new and for discussion. Patch 1 fixes the race that was found by Felix. It also adds smp_mb() to fully synchronize WRITE_ONCE(status, 1); <<< smp_mb(); spin_unlock_wait(); vs. spin_lock(); <<< smp_mb(); READ_ONCE(status); Patch 2 tries to close a performance regression: sem_wait_array() must perform a spin_unlock_wait() for every semaphore in the array. If the array is large, then this is slow. Two open points: - Do we need it? Factor 20 improvement is nice, but is the test realistic? - COMPLEX_MODE_ENTER=10 is a magic number, without any rational. (2+sem_nsems/128) would be another option, but also arbitrary. Test: https://github.com/manfred-colorfu/ipcscale # ./sem-scalebench -o 4 -f -m 0 -h 4 -i 1 -p 2 -c 8 -t 5 -d 500 Before: 214732 After: 4008635 -- Manfred ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race 2016-06-25 17:37 Manfred Spraul @ 2016-06-25 17:37 ` Manfred Spraul 0 siblings, 0 replies; 17+ messages in thread From: Manfred Spraul @ 2016-06-25 17:37 UTC (permalink / raw) To: H. Peter Anvin, Peter Zijlstra, Andrew Morton, Davidlohr Bueso Cc: LKML, Thomas Gleixner, Ingo Molnar, 1vier1, felixh, Manfred Spraul, stable Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a race: sem_lock has a fast path that allows parallel simple operations. There are two reasons why a simple operation cannot run in parallel: - a non-simple operations is ongoing (sma->sem_perm.lock held) - a complex operation is sleeping (sma->complex_count != 0) As both facts are stored independently, a thread can bypass the current checks by sleeping in the right positions. See below for more details (or kernel bugzilla 105651). The patch fixes that by creating one variable (complex_mode) that tracks both reasons why parallel operations are not possible. The patch also updates stale documentation regarding the locking. With regards to stable kernels: The patch is required for all kernels that include the commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") (3.10?) The alternative is to revert the patch that introduced the race. Background: Here is the race of the current implementation: Thread A: (simple op) - does the first "sma->complex_count == 0" test Thread B: (complex op) - does sem_lock(): This includes an array scan. But the scan can't find Thread A, because Thread A does not own sem->lock yet. - the thread does the operation, increases complex_count, drops sem_lock, sleeps Thread A: - spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock) - sleeps before the complex_count test Thread C: (complex op) - does sem_lock (no array scan, complex_count==1) - wakes up Thread B. - decrements complex_count Thread A: - does the complex_count test Bug: Now both thread A and thread C operate on the same array, without any synchronization. Full memory barrier are required to synchronize changes of complex_mode and the lock operations. Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") Reported-by: felixh@informatik.uni-bremen.de Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@vger.kernel.org> --- include/linux/sem.h | 1 + ipc/sem.c | 122 ++++++++++++++++++++++++++++++---------------------- 2 files changed, 71 insertions(+), 52 deletions(-) diff --git a/include/linux/sem.h b/include/linux/sem.h index 976ce3a..d0efd6e 100644 --- a/include/linux/sem.h +++ b/include/linux/sem.h @@ -21,6 +21,7 @@ struct sem_array { struct list_head list_id; /* undo requests on this array */ int sem_nsems; /* no. of semaphores in array */ int complex_count; /* pending complex operations */ + bool complex_mode; /* no parallel simple ops */ }; #ifdef CONFIG_SYSVIPC diff --git a/ipc/sem.c b/ipc/sem.c index ae72b3c..538f43a 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -162,14 +162,21 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it); /* * Locking: + * a) global sem_lock() for read/write * sem_undo.id_next, * sem_array.complex_count, - * sem_array.pending{_alter,_cont}, - * sem_array.sem_undo: global sem_lock() for read/write - * sem_undo.proc_next: only "current" is allowed to read/write that field. + * sem_array.complex_mode + * sem_array.pending{_alter,_const}, + * sem_array.sem_undo * + * b) global or semaphore sem_lock() for read/write: * sem_array.sem_base[i].pending_{const,alter}: - * global or semaphore sem_lock() for read/write + * sem_array.complex_mode (for read) + * + * c) special: + * sem_undo_list.list_proc: + * * undo_list->lock for write + * * rcu for read */ #define sc_semmsl sem_ctls[0] @@ -260,23 +267,25 @@ static void sem_rcu_free(struct rcu_head *head) } /* - * Wait until all currently ongoing simple ops have completed. + * Enter the mode suitable for non-simple operations: * Caller must own sem_perm.lock. - * New simple ops cannot start, because simple ops first check - * that sem_perm.lock is free. - * that a) sem_perm.lock is free and b) complex_count is 0. */ -static void sem_wait_array(struct sem_array *sma) +static void complexmode_enter(struct sem_array *sma) { int i; struct sem *sem; - if (sma->complex_count) { - /* The thread that increased sma->complex_count waited on - * all sem->lock locks. Thus we don't need to wait again. - */ + if (sma->complex_mode) { + /* We are already in complex_mode. Nothing to do */ return; } + WRITE_ONCE(sma->complex_mode, true); + + /* We need a full barrier: + * The write to complex_mode must be visible + * before we read the first sem->lock spinlock state. + */ + smp_mb(); for (i = 0; i < sma->sem_nsems; i++) { sem = sma->sem_base + i; @@ -285,6 +294,27 @@ static void sem_wait_array(struct sem_array *sma) } /* + * Try to leave the mode that disallows simple operations: + * Caller must own sem_perm.lock. + */ +static void complexmode_tryleave(struct sem_array *sma) +{ + if (sma->complex_count) { + /* Complex ops are sleeping. + * We must stay in complex mode + */ + return; + } + /* + * Immediately after setting complex_mode to false, + * a simple op can start. Thus: all memory writes + * performed by the current operation must be visible + * before we set complex_mode to false. + */ + smp_store_release(&sma->complex_mode, false); +} + +/* * If the request contains only one semaphore operation, and there are * no complex transactions pending, lock only the semaphore involved. * Otherwise, lock the entire semaphore array, since we either have @@ -300,56 +330,40 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops, /* Complex operation - acquire a full lock */ ipc_lock_object(&sma->sem_perm); - /* And wait until all simple ops that are processed - * right now have dropped their locks. - */ - sem_wait_array(sma); + /* Prevent parallel simple ops */ + complexmode_enter(sma); return -1; } /* * Only one semaphore affected - try to optimize locking. - * The rules are: - * - optimized locking is possible if no complex operation - * is either enqueued or processed right now. - * - The test for enqueued complex ops is simple: - * sma->complex_count != 0 - * - Testing for complex ops that are processed right now is - * a bit more difficult. Complex ops acquire the full lock - * and first wait that the running simple ops have completed. - * (see above) - * Thus: If we own a simple lock and the global lock is free - * and complex_count is now 0, then it will stay 0 and - * thus just locking sem->lock is sufficient. + * Optimized locking is possible if no complex operation + * is either enqueued or processed right now. + * + * Both facts are tracked by complex_mode. */ sem = sma->sem_base + sops->sem_num; - if (sma->complex_count == 0) { + /* + * Initial check for complex_mode. Just an optimization, + * no locking. + */ + if (!READ_ONCE(sma->complex_mode)) { /* * It appears that no complex operation is around. * Acquire the per-semaphore lock. */ spin_lock(&sem->lock); - /* Then check that the global lock is free */ - if (!spin_is_locked(&sma->sem_perm.lock)) { - /* - * We need a memory barrier with acquire semantics, - * otherwise we can race with another thread that does: - * complex_count++; - * spin_unlock(sem_perm.lock); - */ - smp_acquire__after_ctrl_dep(); + /* + * A full barrier is required: the write of sem->lock + * must be visible before the read is executed + */ + smp_mb(); - /* - * Now repeat the test of complex_count: - * It can't change anymore until we drop sem->lock. - * Thus: if is now 0, then it will stay 0. - */ - if (sma->complex_count == 0) { - /* fast path successful! */ - return sops->sem_num; - } + if (!READ_ONCE(sma->complex_mode)) { + /* fast path successful! */ + return sops->sem_num; } spin_unlock(&sem->lock); } @@ -369,7 +383,7 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops, /* Not a false alarm, thus complete the sequence for a * full lock. */ - sem_wait_array(sma); + complexmode_enter(sma); return -1; } } @@ -378,6 +392,7 @@ static inline void sem_unlock(struct sem_array *sma, int locknum) { if (locknum == -1) { unmerge_queues(sma); + complexmode_tryleave(sma); ipc_unlock_object(&sma->sem_perm); } else { struct sem *sem = sma->sem_base + locknum; @@ -529,6 +544,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params) } sma->complex_count = 0; + sma->complex_mode = true; /* dropped by sem_unlock below */ INIT_LIST_HEAD(&sma->pending_alter); INIT_LIST_HEAD(&sma->pending_const); INIT_LIST_HEAD(&sma->list_id); @@ -2184,10 +2200,10 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it) /* * The proc interface isn't aware of sem_lock(), it calls * ipc_lock_object() directly (in sysvipc_find_ipc). - * In order to stay compatible with sem_lock(), we must wait until - * all simple semop() calls have left their critical regions. + * In order to stay compatible with sem_lock(), we must + * enter / leave complex_mode. */ - sem_wait_array(sma); + complexmode_enter(sma); sem_otime = get_semotime(sma); @@ -2204,6 +2220,8 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it) sem_otime, sma->sem_ctime); + complexmode_tryleave(sma); + return 0; } #endif -- 2.5.5 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* linux-next: manual merge of the akpm-current tree with the tip tree @ 2016-06-15 5:23 Stephen Rothwell 2016-06-18 20:02 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul 0 siblings, 1 reply; 17+ messages in thread From: Stephen Rothwell @ 2016-06-15 5:23 UTC (permalink / raw) To: Andrew Morton, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Peter Zijlstra Cc: linux-next, linux-kernel, Manfred Spraul Hi Andrew, Today's linux-next merge of the akpm-current tree got a conflict in: ipc/sem.c between commit: 33ac279677dc ("locking/barriers: Introduce smp_acquire__after_ctrl_dep()") from the tip tree and commit: a1c58ea067cb ("ipc/sem.c: Fix complex_count vs. simple op race") from the akpm-current tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc ipc/sem.c index ae72b3cddc8d,11d9e605a619..000000000000 --- a/ipc/sem.c +++ b/ipc/sem.c @@@ -260,13 -267,20 +267,10 @@@ static void sem_rcu_free(struct rcu_hea } /* - * Wait until all currently ongoing simple ops have completed. - * spin_unlock_wait() and !spin_is_locked() are not memory barriers, they - * are only control barriers. - * The code must pair with spin_unlock(&sem->lock) or - * spin_unlock(&sem_perm.lock), thus just the control barrier is insufficient. - * - * smp_rmb() is sufficient, as writes cannot pass the control barrier. - */ -#define ipc_smp_acquire__after_spin_is_unlocked() smp_rmb() - -/* + * Enter the mode suitable for non-simple operations: * Caller must own sem_perm.lock. - * New simple ops cannot start, because simple ops first check - * that sem_perm.lock is free. - * that a) sem_perm.lock is free and b) complex_count is 0. */ - static void sem_wait_array(struct sem_array *sma) + static void complexmode_enter(struct sem_array *sma) { int i; struct sem *sem; ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race 2016-06-15 5:23 linux-next: manual merge of the akpm-current tree with the tip tree Stephen Rothwell @ 2016-06-18 20:02 ` Manfred Spraul 2016-06-20 23:04 ` Andrew Morton 2016-06-21 0:30 ` Davidlohr Bueso 0 siblings, 2 replies; 17+ messages in thread From: Manfred Spraul @ 2016-06-18 20:02 UTC (permalink / raw) To: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Peter Zijlstra, Andrew Morton Cc: LKML, linux-next, 1vier1, Davidlohr Bueso, felixh, Manfred Spraul, stable Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a race: sem_lock has a fast path that allows parallel simple operations. There are two reasons why a simple operation cannot run in parallel: - a non-simple operations is ongoing (sma->sem_perm.lock held) - a complex operation is sleeping (sma->complex_count != 0) As both facts are stored independently, a thread can bypass the current checks by sleeping in the right positions. See below for more details (or kernel bugzilla 105651). The patch fixes that by creating one variable (complex_mode) that tracks both reasons why parallel operations are not possible. The patch also updates stale documentation regarding the locking. With regards to stable kernels: The patch is required for all kernels that include the commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") (3.10?) The alternative is to revert the patch that introduced the race. Background: Here is the race of the current implementation: Thread A: (simple op) - does the first "sma->complex_count == 0" test Thread B: (complex op) - does sem_lock(): This includes an array scan. But the scan can't find Thread A, because Thread A does not own sem->lock yet. - the thread does the operation, increases complex_count, drops sem_lock, sleeps Thread A: - spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock) - sleeps before the complex_count test Thread C: (complex op) - does sem_lock (no array scan, complex_count==1) - wakes up Thread B. - decrements complex_count Thread A: - does the complex_count test Bug: Now both thread A and thread C operate on the same array, without any synchronization. Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") Reported-by: felixh@informatik.uni-bremen.de Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@vger.kernel.org> --- diff --git a/include/linux/sem.h b/include/linux/sem.h index 976ce3a..d0efd6e 100644 --- a/include/linux/sem.h +++ b/include/linux/sem.h @@ -21,6 +21,7 @@ struct sem_array { struct list_head list_id; /* undo requests on this array */ int sem_nsems; /* no. of semaphores in array */ int complex_count; /* pending complex operations */ + bool complex_mode; /* no parallel simple ops */ }; #ifdef CONFIG_SYSVIPC diff --git a/ipc/sem.c b/ipc/sem.c index ae72b3c..db2e6fc 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -162,14 +162,21 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it); /* * Locking: + * a) global sem_lock() for read/write * sem_undo.id_next, * sem_array.complex_count, - * sem_array.pending{_alter,_cont}, - * sem_array.sem_undo: global sem_lock() for read/write - * sem_undo.proc_next: only "current" is allowed to read/write that field. + * sem_array.complex_mode + * sem_array.pending{_alter,_const}, + * sem_array.sem_undo * + * b) global or semaphore sem_lock() for read/write: * sem_array.sem_base[i].pending_{const,alter}: - * global or semaphore sem_lock() for read/write + * sem_array.complex_mode (for read) + * + * c) special: + * sem_undo_list.list_proc: + * * undo_list->lock for write + * * rcu for read */ #define sc_semmsl sem_ctls[0] @@ -260,23 +267,25 @@ static void sem_rcu_free(struct rcu_head *head) } /* - * Wait until all currently ongoing simple ops have completed. + * Enter the mode suitable for non-simple operations: * Caller must own sem_perm.lock. - * New simple ops cannot start, because simple ops first check - * that sem_perm.lock is free. - * that a) sem_perm.lock is free and b) complex_count is 0. */ -static void sem_wait_array(struct sem_array *sma) +static void complexmode_enter(struct sem_array *sma) { int i; struct sem *sem; - if (sma->complex_count) { - /* The thread that increased sma->complex_count waited on - * all sem->lock locks. Thus we don't need to wait again. - */ + if (sma->complex_mode) { + /* We are already in complex_mode. Nothing to do */ return; } + WRITE_ONCE(sma->complex_mode, true); + + /* We need a full barrier: + * The write to complex_mode must be visible + * before we read the first sem->lock spinlock state. + */ + smp_mb(); for (i = 0; i < sma->sem_nsems; i++) { sem = sma->sem_base + i; @@ -285,6 +294,29 @@ static void sem_wait_array(struct sem_array *sma) } /* + * Try to leave the mode that disallows simple operations: + * Caller must own sem_perm.lock. + */ +static void complexmode_tryleave(struct sem_array *sma) +{ + if (sma->complex_count) { + /* Complex ops are sleeping. + * We must stay in complex mode + */ + return; + } + /* + * Immediately after setting complex_mode to false, + * a simple op can start. Thus: all memory writes + * performed by the current operation must be visible + * before we set complex_mode to false. + */ + smp_wmb(); + + WRITE_ONCE(sma->complex_mode, false); +} + +/* * If the request contains only one semaphore operation, and there are * no complex transactions pending, lock only the semaphore involved. * Otherwise, lock the entire semaphore array, since we either have @@ -300,56 +332,38 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops, /* Complex operation - acquire a full lock */ ipc_lock_object(&sma->sem_perm); - /* And wait until all simple ops that are processed - * right now have dropped their locks. - */ - sem_wait_array(sma); + /* Prevent parallel simple ops */ + complexmode_enter(sma); return -1; } /* * Only one semaphore affected - try to optimize locking. - * The rules are: - * - optimized locking is possible if no complex operation - * is either enqueued or processed right now. - * - The test for enqueued complex ops is simple: - * sma->complex_count != 0 - * - Testing for complex ops that are processed right now is - * a bit more difficult. Complex ops acquire the full lock - * and first wait that the running simple ops have completed. - * (see above) - * Thus: If we own a simple lock and the global lock is free - * and complex_count is now 0, then it will stay 0 and - * thus just locking sem->lock is sufficient. + * Optimized locking is possible if no complex operation + * is either enqueued or processed right now. + * + * Both facts are tracked by complex_mode. */ sem = sma->sem_base + sops->sem_num; - if (sma->complex_count == 0) { + /* + * Initial check for complex_mode. Just an optimization, + * no locking. + */ + if (!READ_ONCE(sma->complex_mode)) { /* * It appears that no complex operation is around. * Acquire the per-semaphore lock. */ spin_lock(&sem->lock); - /* Then check that the global lock is free */ - if (!spin_is_locked(&sma->sem_perm.lock)) { - /* - * We need a memory barrier with acquire semantics, - * otherwise we can race with another thread that does: - * complex_count++; - * spin_unlock(sem_perm.lock); - */ - smp_acquire__after_ctrl_dep(); - - /* - * Now repeat the test of complex_count: - * It can't change anymore until we drop sem->lock. - * Thus: if is now 0, then it will stay 0. - */ - if (sma->complex_count == 0) { - /* fast path successful! */ - return sops->sem_num; - } + /* Now repeat the test for complex_mode. + * A memory barrier is provided by the spin_lock() + * above. + */ + if (!READ_ONCE(sma->complex_mode)) { + /* fast path successful! */ + return sops->sem_num; } spin_unlock(&sem->lock); } @@ -369,7 +383,7 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops, /* Not a false alarm, thus complete the sequence for a * full lock. */ - sem_wait_array(sma); + complexmode_enter(sma); return -1; } } @@ -378,6 +392,7 @@ static inline void sem_unlock(struct sem_array *sma, int locknum) { if (locknum == -1) { unmerge_queues(sma); + complexmode_tryleave(sma); ipc_unlock_object(&sma->sem_perm); } else { struct sem *sem = sma->sem_base + locknum; @@ -529,6 +544,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params) } sma->complex_count = 0; + sma->complex_mode = true; /* dropped by sem_unlock below */ INIT_LIST_HEAD(&sma->pending_alter); INIT_LIST_HEAD(&sma->pending_const); INIT_LIST_HEAD(&sma->list_id); @@ -2184,10 +2200,10 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it) /* * The proc interface isn't aware of sem_lock(), it calls * ipc_lock_object() directly (in sysvipc_find_ipc). - * In order to stay compatible with sem_lock(), we must wait until - * all simple semop() calls have left their critical regions. + * In order to stay compatible with sem_lock(), we must + * enter / leave complex_mode. */ - sem_wait_array(sma); + complexmode_enter(sma); sem_otime = get_semotime(sma); @@ -2204,6 +2220,8 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it) sem_otime, sma->sem_ctime); + complexmode_tryleave(sma); + return 0; } #endif ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race 2016-06-18 20:02 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul @ 2016-06-20 23:04 ` Andrew Morton 2016-06-23 18:55 ` Manfred Spraul 2016-06-21 0:30 ` Davidlohr Bueso 1 sibling, 1 reply; 17+ messages in thread From: Andrew Morton @ 2016-06-20 23:04 UTC (permalink / raw) To: Manfred Spraul Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Peter Zijlstra, LKML, linux-next, 1vier1, Davidlohr Bueso, felixh, stable On Sat, 18 Jun 2016 22:02:21 +0200 Manfred Spraul <manfred@colorfullife.com> wrote: > Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a race: > > sem_lock has a fast path that allows parallel simple operations. > There are two reasons why a simple operation cannot run in parallel: > - a non-simple operations is ongoing (sma->sem_perm.lock held) > - a complex operation is sleeping (sma->complex_count != 0) > > As both facts are stored independently, a thread can bypass the current > checks by sleeping in the right positions. See below for more details > (or kernel bugzilla 105651). > > The patch fixes that by creating one variable (complex_mode) > that tracks both reasons why parallel operations are not possible. > > The patch also updates stale documentation regarding the locking. > > With regards to stable kernels: > The patch is required for all kernels that include the commit 6d07b68ce16a > ("ipc/sem.c: optimize sem_lock()") (3.10?) I've had this in -mm (and -next) since January 4, without issues. I put it on hold because Davidlohr expressed concern about performance regressions. Your [2/2] should prevent those regressions (yes?) so I assume that any kernel which has [1/2] really should have [2/2] as well. But without any quantitative information, this is all mad guesswork. What to do? (The [2/2] changelog should explain that it is the cure to [1/2]'s regressions, btw). ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race 2016-06-20 23:04 ` Andrew Morton @ 2016-06-23 18:55 ` Manfred Spraul 0 siblings, 0 replies; 17+ messages in thread From: Manfred Spraul @ 2016-06-23 18:55 UTC (permalink / raw) To: Andrew Morton Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Peter Zijlstra, LKML, linux-next, 1vier1, Davidlohr Bueso, felixh, stable On 06/21/2016 01:04 AM, Andrew Morton wrote: > On Sat, 18 Jun 2016 22:02:21 +0200 Manfred Spraul <manfred@colorfullife.com> wrote: > >> Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a race: >> >> sem_lock has a fast path that allows parallel simple operations. >> There are two reasons why a simple operation cannot run in parallel: >> - a non-simple operations is ongoing (sma->sem_perm.lock held) >> - a complex operation is sleeping (sma->complex_count != 0) >> >> As both facts are stored independently, a thread can bypass the current >> checks by sleeping in the right positions. See below for more details >> (or kernel bugzilla 105651). >> >> The patch fixes that by creating one variable (complex_mode) >> that tracks both reasons why parallel operations are not possible. >> >> The patch also updates stale documentation regarding the locking. >> >> With regards to stable kernels: >> The patch is required for all kernels that include the commit 6d07b68ce16a >> ("ipc/sem.c: optimize sem_lock()") (3.10?) > I've had this in -mm (and -next) since January 4, without issues. I > put it on hold because Davidlohr expressed concern about performance > regressions. I had several ideas how to fix it. The initial ideas probably had performance issue. The current one doesn't have any issues. It just took longer than expected to test it. > Your [2/2] should prevent those regressions (yes?) so I assume that any > kernel which has [1/2] really should have [2/2] as well. But without > any quantitative information, this is all mad guesswork. > > What to do? [2/2] is an improvement, it handles one case better than the current code. If you want: 3.10 improved scalability, but it introduced a performance regression for one use case. [with 3.10, simple ops got parallel, but complex ops had to perform a "for_each_sem() {spin_unlock_wait()}"] The patch fixes that. -- Manfred ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race 2016-06-18 20:02 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul 2016-06-20 23:04 ` Andrew Morton @ 2016-06-21 0:30 ` Davidlohr Bueso 2016-06-23 19:22 ` Manfred Spraul 1 sibling, 1 reply; 17+ messages in thread From: Davidlohr Bueso @ 2016-06-21 0:30 UTC (permalink / raw) To: Manfred Spraul Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Peter Zijlstra, Andrew Morton, LKML, linux-next, 1vier1, felixh, stable On Sat, 18 Jun 2016, Manfred Spraul wrote: >diff --git a/include/linux/sem.h b/include/linux/sem.h >index 976ce3a..d0efd6e 100644 >--- a/include/linux/sem.h >+++ b/include/linux/sem.h >@@ -21,6 +21,7 @@ struct sem_array { > struct list_head list_id; /* undo requests on this array */ > int sem_nsems; /* no. of semaphores in array */ > int complex_count; /* pending complex operations */ I see this patch also takes care of complex_count needing READ/WRITE_ONCE as you change that to always be done with the big sem_perm lock. >+ bool complex_mode; /* no parallel simple ops */ But I'm wondering about races with this one. Doesn't complex_mode need acquire/release semantics? > }; > [...] > /* >- * Wait until all currently ongoing simple ops have completed. >+ * Enter the mode suitable for non-simple operations: > * Caller must own sem_perm.lock. >- * New simple ops cannot start, because simple ops first check >- * that sem_perm.lock is free. >- * that a) sem_perm.lock is free and b) complex_count is 0. > */ >-static void sem_wait_array(struct sem_array *sma) >+static void complexmode_enter(struct sem_array *sma) > { > int i; > struct sem *sem; > >- if (sma->complex_count) { >- /* The thread that increased sma->complex_count waited on >- * all sem->lock locks. Thus we don't need to wait again. >- */ >+ if (sma->complex_mode) { >+ /* We are already in complex_mode. Nothing to do */ This complex_mode load is serialized because both complexmode_enter() and _tryleave(), which are the main calls that modify the variable, are serialized by sem_perm lock, right? > return; > } Btw, I like that this logic is much simpler, just by reading the comments :) >+ WRITE_ONCE(sma->complex_mode, true); >+ >+ /* We need a full barrier: >+ * The write to complex_mode must be visible >+ * before we read the first sem->lock spinlock state. >+ */ >+ smp_mb(); > > for (i = 0; i < sma->sem_nsems; i++) { > sem = sma->sem_base + i; >@@ -285,6 +294,29 @@ static void sem_wait_array(struct sem_array *sma) > } > > /* >+ * Try to leave the mode that disallows simple operations: >+ * Caller must own sem_perm.lock. >+ */ >+static void complexmode_tryleave(struct sem_array *sma) >+{ >+ if (sma->complex_count) { >+ /* Complex ops are sleeping. >+ * We must stay in complex mode >+ */ >+ return; >+ } >+ /* >+ * Immediately after setting complex_mode to false, >+ * a simple op can start. Thus: all memory writes >+ * performed by the current operation must be visible >+ * before we set complex_mode to false. >+ */ >+ smp_wmb(); >+ >+ WRITE_ONCE(sma->complex_mode, false); smp_store_release()? See below. >+} >+ >+/* > * If the request contains only one semaphore operation, and there are > * no complex transactions pending, lock only the semaphore involved. > * Otherwise, lock the entire semaphore array, since we either have >@@ -300,56 +332,38 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops, > /* Complex operation - acquire a full lock */ > ipc_lock_object(&sma->sem_perm); > >- /* And wait until all simple ops that are processed >- * right now have dropped their locks. >- */ >- sem_wait_array(sma); >+ /* Prevent parallel simple ops */ >+ complexmode_enter(sma); > return -1; > } > > /* > * Only one semaphore affected - try to optimize locking. >- * The rules are: >- * - optimized locking is possible if no complex operation >- * is either enqueued or processed right now. >- * - The test for enqueued complex ops is simple: >- * sma->complex_count != 0 >- * - Testing for complex ops that are processed right now is >- * a bit more difficult. Complex ops acquire the full lock >- * and first wait that the running simple ops have completed. >- * (see above) >- * Thus: If we own a simple lock and the global lock is free >- * and complex_count is now 0, then it will stay 0 and >- * thus just locking sem->lock is sufficient. >+ * Optimized locking is possible if no complex operation >+ * is either enqueued or processed right now. >+ * >+ * Both facts are tracked by complex_mode. > */ > sem = sma->sem_base + sops->sem_num; > >- if (sma->complex_count == 0) { >+ /* >+ * Initial check for complex_mode. Just an optimization, >+ * no locking. >+ */ >+ if (!READ_ONCE(sma->complex_mode)) { We have no lock (which is the whole point), I think we need stronger guarantees here to avoid racing with another thread that holds sem_perm lock and is entering complexmode for the first time or vice versa with tryleave(). An smp_load_acquire here would pair with the suggested smp_store_release calls. Thanks, Davidlohr ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race 2016-06-21 0:30 ` Davidlohr Bueso @ 2016-06-23 19:22 ` Manfred Spraul 2016-06-28 5:27 ` Davidlohr Bueso 0 siblings, 1 reply; 17+ messages in thread From: Manfred Spraul @ 2016-06-23 19:22 UTC (permalink / raw) To: Davidlohr Bueso Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Peter Zijlstra, Andrew Morton, LKML, linux-next, 1vier1, felixh, stable On 06/21/2016 02:30 AM, Davidlohr Bueso wrote: > On Sat, 18 Jun 2016, Manfred Spraul wrote: > >> diff --git a/include/linux/sem.h b/include/linux/sem.h >> index 976ce3a..d0efd6e 100644 >> --- a/include/linux/sem.h >> +++ b/include/linux/sem.h >> @@ -21,6 +21,7 @@ struct sem_array { >> struct list_head list_id; /* undo requests on this array */ >> int sem_nsems; /* no. of semaphores in array */ >> int complex_count; /* pending complex operations */ > > I see this patch also takes care of complex_count needing READ/WRITE_ONCE > as you change that to always be done with the big sem_perm lock. > >> + bool complex_mode; /* no parallel simple ops */ > > But I'm wondering about races with this one. Doesn't complex_mode need > acquire/release semantics? > Yes. complex_mode needs acquire/release. >> }; >> > > [...] > >> /* >> - * Wait until all currently ongoing simple ops have completed. >> + * Enter the mode suitable for non-simple operations: >> * Caller must own sem_perm.lock. >> - * New simple ops cannot start, because simple ops first check >> - * that sem_perm.lock is free. >> - * that a) sem_perm.lock is free and b) complex_count is 0. >> */ >> -static void sem_wait_array(struct sem_array *sma) >> +static void complexmode_enter(struct sem_array *sma) >> { >> int i; >> struct sem *sem; >> >> - if (sma->complex_count) { >> - /* The thread that increased sma->complex_count waited on >> - * all sem->lock locks. Thus we don't need to wait again. >> - */ >> + if (sma->complex_mode) { >> + /* We are already in complex_mode. Nothing to do */ > > This complex_mode load is serialized because both complexmode_enter() and > _tryleave(), which are the main calls that modify the variable, are > serialized > by sem_perm lock, right? > Exactly. Changes to complex_mode are protected by perm.lock. >> return; >> } > > Btw, I like that this logic is much simpler, just by reading the > comments :) > >> + WRITE_ONCE(sma->complex_mode, true); >> + >> + /* We need a full barrier: >> + * The write to complex_mode must be visible >> + * before we read the first sem->lock spinlock state. >> + */ >> + smp_mb(); >> Theoretically: smp_store_acquire. but this doesn't exist, so smp_mb() >> for (i = 0; i < sma->sem_nsems; i++) { >> sem = sma->sem_base + i; >> @@ -285,6 +294,29 @@ static void sem_wait_array(struct sem_array *sma) >> } >> >> /* >> + * Try to leave the mode that disallows simple operations: >> + * Caller must own sem_perm.lock. >> + */ >> +static void complexmode_tryleave(struct sem_array *sma) >> +{ >> + if (sma->complex_count) { >> + /* Complex ops are sleeping. >> + * We must stay in complex mode >> + */ >> + return; >> + } >> + /* >> + * Immediately after setting complex_mode to false, >> + * a simple op can start. Thus: all memory writes >> + * performed by the current operation must be visible >> + * before we set complex_mode to false. >> + */ >> + smp_wmb(); >> + >> + WRITE_ONCE(sma->complex_mode, false); > > smp_store_release()? See below. > Yes >> +} >> + >> +/* >> * If the request contains only one semaphore operation, and there are >> * no complex transactions pending, lock only the semaphore involved. >> * Otherwise, lock the entire semaphore array, since we either have >> @@ -300,56 +332,38 @@ static inline int sem_lock(struct sem_array >> *sma, struct sembuf *sops, >> /* Complex operation - acquire a full lock */ >> ipc_lock_object(&sma->sem_perm); >> >> - /* And wait until all simple ops that are processed >> - * right now have dropped their locks. >> - */ >> - sem_wait_array(sma); >> + /* Prevent parallel simple ops */ >> + complexmode_enter(sma); >> return -1; >> } >> >> /* >> * Only one semaphore affected - try to optimize locking. >> - * The rules are: >> - * - optimized locking is possible if no complex operation >> - * is either enqueued or processed right now. >> - * - The test for enqueued complex ops is simple: >> - * sma->complex_count != 0 >> - * - Testing for complex ops that are processed right now is >> - * a bit more difficult. Complex ops acquire the full lock >> - * and first wait that the running simple ops have completed. >> - * (see above) >> - * Thus: If we own a simple lock and the global lock is free >> - * and complex_count is now 0, then it will stay 0 and >> - * thus just locking sem->lock is sufficient. >> + * Optimized locking is possible if no complex operation >> + * is either enqueued or processed right now. >> + * >> + * Both facts are tracked by complex_mode. >> */ >> sem = sma->sem_base + sops->sem_num; >> >> - if (sma->complex_count == 0) { >> + /* >> + * Initial check for complex_mode. Just an optimization, >> + * no locking. >> + */ >> + if (!READ_ONCE(sma->complex_mode)) { > > We have no lock (which is the whole point), I think we need stronger > guarantees here to avoid racing with another thread that holds sem_perm > lock and is entering complexmode for the first time or vice versa with > tryleave(). An smp_load_acquire here would pair with the suggested > smp_store_release calls. > Yes,you are right. What I'm not sure yet is if smp_load_acquire() is sufficient: Thread A: > if (!READ_ONCE(sma->complex_mode)) { The code is test_and_test, no barrier requirements for first test > /* > * It appears that no complex operation is around. > * Acquire the per-semaphore lock. > */ > spin_lock(&sem->lock); > > if (!smp_load_acquire(&sma->complex_mode)) { > /* fast path successful! */ > return sops->sem_num; > } > spin_unlock(&sem->lock); > } Thread B: > WRITE_ONCE(sma->complex_mode, true); > > /* We need a full barrier: > * The write to complex_mode must be visible > * before we read the first sem->lock spinlock state. > */ > smp_mb(); > > for (i = 0; i < sma->sem_nsems; i++) { > sem = sma->sem_base + i; > spin_unlock_wait(&sem->lock); > } If thread A is allowed to issue read_spinlock;read complex_mode;write spinlock, then thread B would not notice that thread A is in the critical section > Thanks, > Davidlohr I'll update the patch. -- Manfred ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race 2016-06-23 19:22 ` Manfred Spraul @ 2016-06-28 5:27 ` Davidlohr Bueso 2016-06-30 19:28 ` Manfred Spraul 0 siblings, 1 reply; 17+ messages in thread From: Davidlohr Bueso @ 2016-06-28 5:27 UTC (permalink / raw) To: Manfred Spraul Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Peter Zijlstra, Andrew Morton, LKML, linux-next, 1vier1, felixh, stable On Thu, 23 Jun 2016, Manfred Spraul wrote: >What I'm not sure yet is if smp_load_acquire() is sufficient: > >Thread A: >> if (!READ_ONCE(sma->complex_mode)) { >The code is test_and_test, no barrier requirements for first test Yeah, it would just make us take the big lock unnecessarily, nothing fatal and I agree its probably worth the optimization. It still might be worth commenting. >> /* >> * It appears that no complex operation is around. >> * Acquire the per-semaphore lock. >> */ >> spin_lock(&sem->lock); >> >> if (!smp_load_acquire(&sma->complex_mode)) { >> /* fast path successful! */ >> return sops->sem_num; >> } >> spin_unlock(&sem->lock); >> } > >Thread B: >> WRITE_ONCE(sma->complex_mode, true); >> >> /* We need a full barrier: >> * The write to complex_mode must be visible >> * before we read the first sem->lock spinlock state. >> */ >> smp_mb(); >> >> for (i = 0; i < sma->sem_nsems; i++) { >> sem = sma->sem_base + i; >> spin_unlock_wait(&sem->lock); >> } > >If thread A is allowed to issue read_spinlock;read complex_mode;write >spinlock, then thread B would not notice that thread A is in the >critical section Are you referring to the sem->lock word not being visibly locked before we read complex_mode (for the second time)? This issue was fixed in 2c610022711 (locking/qspinlock: Fix spin_unlock_wait() some more). So smp_load_acquire should be enough afaict, or are you referring to something else? Thanks, Davidlohr ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race 2016-06-28 5:27 ` Davidlohr Bueso @ 2016-06-30 19:28 ` Manfred Spraul 2016-07-01 16:52 ` Davidlohr Bueso 0 siblings, 1 reply; 17+ messages in thread From: Manfred Spraul @ 2016-06-30 19:28 UTC (permalink / raw) To: Davidlohr Bueso Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Peter Zijlstra, Andrew Morton, LKML, linux-next, 1vier1, felixh, stable On 06/28/2016 07:27 AM, Davidlohr Bueso wrote: > On Thu, 23 Jun 2016, Manfred Spraul wrote: > >> What I'm not sure yet is if smp_load_acquire() is sufficient: >> >> Thread A: >>> if (!READ_ONCE(sma->complex_mode)) { >> The code is test_and_test, no barrier requirements for first test > > Yeah, it would just make us take the big lock unnecessarily, nothing > fatal > and I agree its probably worth the optimization. It still might be worth > commenting. > I'll extend the comment: "no locking and no memory barrier" >>> /* >>> * It appears that no complex operation is around. >>> * Acquire the per-semaphore lock. >>> */ >>> spin_lock(&sem->lock); >>> >>> if (!smp_load_acquire(&sma->complex_mode)) { >>> /* fast path successful! */ >>> return sops->sem_num; >>> } >>> spin_unlock(&sem->lock); >>> } >> >> Thread B: >>> WRITE_ONCE(sma->complex_mode, true); >>> >>> /* We need a full barrier: >>> * The write to complex_mode must be visible >>> * before we read the first sem->lock spinlock state. >>> */ >>> smp_mb(); >>> >>> for (i = 0; i < sma->sem_nsems; i++) { >>> sem = sma->sem_base + i; >>> spin_unlock_wait(&sem->lock); >>> } >> >> If thread A is allowed to issue read_spinlock;read complex_mode;write >> spinlock, then thread B would not notice that thread A is in the >> critical section > > Are you referring to the sem->lock word not being visibly locked > before we > read complex_mode (for the second time)? This issue was fixed in > 2c610022711 > (locking/qspinlock: Fix spin_unlock_wait() some more). So > smp_load_acquire > should be enough afaict, or are you referring to something else? > You are right, I didn't read this patch fully. If I understand it right, it means that spin_lock() is both an acquire and a release - for qspinlocks. It this valid for all spinlock implementations, for all architectures? Otherwise: How can I detect in generic code if I can rely on a release inside spin_lock()? -- Manfred ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race 2016-06-30 19:28 ` Manfred Spraul @ 2016-07-01 16:52 ` Davidlohr Bueso 0 siblings, 0 replies; 17+ messages in thread From: Davidlohr Bueso @ 2016-07-01 16:52 UTC (permalink / raw) To: Manfred Spraul Cc: Stephen Rothwell, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Peter Zijlstra, Andrew Morton, LKML, linux-next, 1vier1, felixh, stable On Thu, 30 Jun 2016, Manfred Spraul wrote: >On 06/28/2016 07:27 AM, Davidlohr Bueso wrote: >If I understand it right, it means that spin_lock() is both an acquire >and a release - for qspinlocks. I wouldn't say that: the _Q_LOCKED_VAL stores are still unordered inside the acquire region but no further than the the unlock store-release; which is fine for regular mutual exclusion. >It this valid for all spinlock implementations, for all architectures? >Otherwise: How can I detect in generic code if I can rely on a release >inside spin_lock()? With ticket locks this was never an issue as special lock readers (spin_unlock_wait(), spin_is_locked()) will always see the lock taken as they observe waiters by adding itself to the tail -- something the above commit mimics and piggy backs on to save expensive smp_rmb(). In addition, see 726328d92a4 (locking/spinlock, arch: Update and fix spin_unlock_wait() implementations). Thanks, Davidlohr ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2016-07-16 1:27 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-07-13 5:06 [PATCH 0/2] ipc/sem.c: sem_lock fixes Manfred Spraul 2016-07-13 5:06 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul 2016-07-13 5:06 ` [PATCH 2/2] ipc/sem.c: Remove duplicated memory barriers Manfred Spraul 2016-07-13 16:16 ` Davidlohr Bueso 2016-07-13 18:37 ` Manfred Spraul 2016-07-16 1:27 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Davidlohr Bueso 2016-07-13 22:05 ` [PATCH 0/2] ipc/sem.c: sem_lock fixes Andrew Morton 2016-07-14 16:40 ` Manfred Spraul -- strict thread matches above, loose matches on Subject: below -- 2016-06-25 17:37 Manfred Spraul 2016-06-25 17:37 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul 2016-06-15 5:23 linux-next: manual merge of the akpm-current tree with the tip tree Stephen Rothwell 2016-06-18 20:02 ` [PATCH 1/2] ipc/sem.c: Fix complex_count vs. simple op race Manfred Spraul 2016-06-20 23:04 ` Andrew Morton 2016-06-23 18:55 ` Manfred Spraul 2016-06-21 0:30 ` Davidlohr Bueso 2016-06-23 19:22 ` Manfred Spraul 2016-06-28 5:27 ` Davidlohr Bueso 2016-06-30 19:28 ` Manfred Spraul 2016-07-01 16:52 ` Davidlohr Bueso
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).