* Re: [PATCH V2] MIPS: implement smp_cond_load_acquire() for Loongson-3 [not found] <1531103198-16764-1-git-send-email-chenhc@lemote.com> @ 2018-07-09 16:49 ` Paul Burton 2018-07-10 4:26 ` Huacai Chen 0 siblings, 1 reply; 15+ messages in thread From: Paul Burton @ 2018-07-09 16:49 UTC (permalink / raw) To: Huacai Chen Cc: Ralf Baechle, James Hogan, linux-mips, Fuxin Zhang, Zhangjin Wu, Huacai Chen, stable, Alan Stern, Andrea Parri, Will Deacon, Peter Zijlstra, Boqun Feng, Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney, Akira Yokosawa, linux-kernel Hi Huacai, On Mon, Jul 09, 2018 at 10:26:38AM +0800, Huacai Chen wrote: > After commit 7f56b58a92aaf2c ("locking/mcs: Use smp_cond_load_acquire() > in MCS spin loop") Loongson-3 fails to boot. This is because Loongson-3 > has SFB (Store Fill Buffer) and the weak-ordering may cause READ_ONCE() > to get an old value in a tight loop. So in smp_cond_load_acquire() we > need a __smp_rmb() before the READ_ONCE() loop. > > This patch introduce a Loongson-specific smp_cond_load_acquire(). And > it should be backported to as early as linux-4.5, in which release the > smp_cond_acquire() is introduced. > > There may be other cases where memory barriers is needed, we will fix > them one by one. > > Cc: stable@vger.kernel.org > Signed-off-by: Huacai Chen <chenhc@lemote.com> > --- > arch/mips/include/asm/barrier.h | 18 ++++++++++++++++++ > 1 file changed, 18 insertions(+) > > diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h > index a5eb1bb..e8c4c63 100644 > --- a/arch/mips/include/asm/barrier.h > +++ b/arch/mips/include/asm/barrier.h > @@ -222,6 +222,24 @@ > #define __smp_mb__before_atomic() __smp_mb__before_llsc() > #define __smp_mb__after_atomic() smp_llsc_mb() > > +#ifdef CONFIG_CPU_LOONGSON3 > +/* Loongson-3 need a __smp_rmb() before READ_ONCE() loop */ > +#define smp_cond_load_acquire(ptr, cond_expr) \ > +({ \ > + typeof(ptr) __PTR = (ptr); \ > + typeof(*ptr) VAL; \ > + __smp_rmb(); \ > + for (;;) { \ > + VAL = READ_ONCE(*__PTR); \ > + if (cond_expr) \ > + break; \ > + cpu_relax(); \ > + } \ > + __smp_rmb(); \ > + VAL; \ > +}) > +#endif /* CONFIG_CPU_LOONGSON3 */ The discussion on v1 of this patch [1] seemed to converge on the view that Loongson suffers from the same problem as ARM platforms which enable the CONFIG_ARM_ERRATA_754327 workaround, and that we might require a similar workaround. Is there a reason you've not done that, and instead tweaked your patch that's specific to the smp_cond_load_acquire() case? I'm not comfortable with fixing just this one case when there could be many similar problematic pieces of code you just haven't hit yet. Please also keep the LKMM maintainers on copy for this - their feedback will be valuable & I'll be much more comfortable applying a workaround for Loongson's behavior here if it's one that they're OK with. Thanks, Paul [1] https://www.linux-mips.org/archives/linux-mips/2018-06/msg00139.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH V2] MIPS: implement smp_cond_load_acquire() for Loongson-3 2018-07-09 16:49 ` [PATCH V2] MIPS: implement smp_cond_load_acquire() for Loongson-3 Paul Burton @ 2018-07-10 4:26 ` Huacai Chen 2018-07-10 9:36 ` Peter Zijlstra 0 siblings, 1 reply; 15+ messages in thread From: Huacai Chen @ 2018-07-10 4:26 UTC (permalink / raw) To: Paul Burton Cc: Ralf Baechle, James Hogan, Linux MIPS Mailing List, Fuxin Zhang, Zhangjin Wu, stable, Alan Stern, Andrea Parri, Will Deacon, Peter Zijlstra, Boqun Feng, Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney, Akira Yokosawa, LKML Hi, Paul and Peter, I think we find the real root cause, READ_ONCE() doesn't need any barriers, the problematic code is queued_spin_lock_slowpath() in kernel/locking/qspinlock.c: if (old & _Q_TAIL_MASK) { prev = decode_tail(old); /* Link @node into the waitqueue. */ WRITE_ONCE(prev->next, node); pv_wait_node(node, prev); arch_mcs_spin_lock_contended(&node->locked); /* * While waiting for the MCS lock, the next pointer may have * been set by another lock waiter. We optimistically load * the next pointer & prefetch the cacheline for writing * to reduce latency in the upcoming MCS unlock operation. */ next = READ_ONCE(node->next); if (next) prefetchw(next); } After WRITE_ONCE(prev->next, node); arch_mcs_spin_lock_contended() enter a READ_ONCE() loop, so the effect of WRITE_ONCE() is invisible by other cores because of the write buffer. As a result, arch_mcs_spin_lock_contended() will wait for ever because the waiters of prev->next will wait for ever. I think the right way to fix this is flush SFB after this WRITE_ONCE(), but I don't have a good solution: 1, MIPS has wbflush() which can be used to flush SFB, but other archs don't have; 2, Every arch has mb(), and add mb() after WRITE_ONCE() can actually solve Loongson's problem, but in syntax, mb() is different from wbflush(); 3, Maybe we can define a Loongson-specific WRITE_ONCE(), but not every WRITE_ONCE() need wbflush(), we only need wbflush() between WRITE_ONCE() and a READ_ONCE() loop. Any ideas? Huacai On Tue, Jul 10, 2018 at 12:49 AM, Paul Burton <paul.burton@mips.com> wrote: > Hi Huacai, > > On Mon, Jul 09, 2018 at 10:26:38AM +0800, Huacai Chen wrote: >> After commit 7f56b58a92aaf2c ("locking/mcs: Use smp_cond_load_acquire() >> in MCS spin loop") Loongson-3 fails to boot. This is because Loongson-3 >> has SFB (Store Fill Buffer) and the weak-ordering may cause READ_ONCE() >> to get an old value in a tight loop. So in smp_cond_load_acquire() we >> need a __smp_rmb() before the READ_ONCE() loop. >> >> This patch introduce a Loongson-specific smp_cond_load_acquire(). And >> it should be backported to as early as linux-4.5, in which release the >> smp_cond_acquire() is introduced. >> >> There may be other cases where memory barriers is needed, we will fix >> them one by one. >> >> Cc: stable@vger.kernel.org >> Signed-off-by: Huacai Chen <chenhc@lemote.com> >> --- >> arch/mips/include/asm/barrier.h | 18 ++++++++++++++++++ >> 1 file changed, 18 insertions(+) >> >> diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h >> index a5eb1bb..e8c4c63 100644 >> --- a/arch/mips/include/asm/barrier.h >> +++ b/arch/mips/include/asm/barrier.h >> @@ -222,6 +222,24 @@ >> #define __smp_mb__before_atomic() __smp_mb__before_llsc() >> #define __smp_mb__after_atomic() smp_llsc_mb() >> >> +#ifdef CONFIG_CPU_LOONGSON3 >> +/* Loongson-3 need a __smp_rmb() before READ_ONCE() loop */ >> +#define smp_cond_load_acquire(ptr, cond_expr) \ >> +({ \ >> + typeof(ptr) __PTR = (ptr); \ >> + typeof(*ptr) VAL; \ >> + __smp_rmb(); \ >> + for (;;) { \ >> + VAL = READ_ONCE(*__PTR); \ >> + if (cond_expr) \ >> + break; \ >> + cpu_relax(); \ >> + } \ >> + __smp_rmb(); \ >> + VAL; \ >> +}) >> +#endif /* CONFIG_CPU_LOONGSON3 */ > > The discussion on v1 of this patch [1] seemed to converge on the view > that Loongson suffers from the same problem as ARM platforms which > enable the CONFIG_ARM_ERRATA_754327 workaround, and that we might > require a similar workaround. > > Is there a reason you've not done that, and instead tweaked your patch > that's specific to the smp_cond_load_acquire() case? I'm not comfortable > with fixing just this one case when there could be many similar > problematic pieces of code you just haven't hit yet. > > Please also keep the LKMM maintainers on copy for this - their feedback > will be valuable & I'll be much more comfortable applying a workaround > for Loongson's behavior here if it's one that they're OK with. > > Thanks, > Paul > > [1] https://www.linux-mips.org/archives/linux-mips/2018-06/msg00139.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH V2] MIPS: implement smp_cond_load_acquire() for Loongson-3 2018-07-10 4:26 ` Huacai Chen @ 2018-07-10 9:36 ` Peter Zijlstra 2018-07-10 10:54 ` Peter Zijlstra 0 siblings, 1 reply; 15+ messages in thread From: Peter Zijlstra @ 2018-07-10 9:36 UTC (permalink / raw) To: Huacai Chen Cc: Paul Burton, Ralf Baechle, James Hogan, Linux MIPS Mailing List, Fuxin Zhang, Zhangjin Wu, stable, Alan Stern, Andrea Parri, Will Deacon, Boqun Feng, Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney, Akira Yokosawa, LKML On Tue, Jul 10, 2018 at 12:26:34PM +0800, Huacai Chen wrote: > Hi, Paul and Peter, > > I think we find the real root cause, READ_ONCE() doesn't need any > barriers, the problematic code is queued_spin_lock_slowpath() in > kernel/locking/qspinlock.c: > > if (old & _Q_TAIL_MASK) { > prev = decode_tail(old); > > /* Link @node into the waitqueue. */ > WRITE_ONCE(prev->next, node); > > pv_wait_node(node, prev); > arch_mcs_spin_lock_contended(&node->locked); > > /* > * While waiting for the MCS lock, the next pointer may have > * been set by another lock waiter. We optimistically load > * the next pointer & prefetch the cacheline for writing > * to reduce latency in the upcoming MCS unlock operation. > */ > next = READ_ONCE(node->next); > if (next) > prefetchw(next); > } > > After WRITE_ONCE(prev->next, node); arch_mcs_spin_lock_contended() > enter a READ_ONCE() loop, so the effect of WRITE_ONCE() is invisible > by other cores because of the write buffer. And _that_ is a hardware bug. Also please explain how that is different from the ARM bug mentioned elsewhere. > As a result, > arch_mcs_spin_lock_contended() will wait for ever because the waiters > of prev->next will wait for ever. I think the right way to fix this is > flush SFB after this WRITE_ONCE(), but I don't have a good solution: > 1, MIPS has wbflush() which can be used to flush SFB, but other archs > don't have; Sane archs don't need this. > 2, Every arch has mb(), and add mb() after WRITE_ONCE() can actually > solve Loongson's problem, but in syntax, mb() is different from > wbflush(); Still wrong, because any non-broken arch doesn't need that flush to begin with. > 3, Maybe we can define a Loongson-specific WRITE_ONCE(), but not every > WRITE_ONCE() need wbflush(), we only need wbflush() between > WRITE_ONCE() and a READ_ONCE() loop. No no no no ... So now explain why the cpu_relax() hack that arm did doesn't work for you? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH V2] MIPS: implement smp_cond_load_acquire() for Loongson-3 2018-07-10 9:36 ` Peter Zijlstra @ 2018-07-10 10:54 ` Peter Zijlstra 2018-07-10 11:45 ` 陈华才 0 siblings, 1 reply; 15+ messages in thread From: Peter Zijlstra @ 2018-07-10 10:54 UTC (permalink / raw) To: Huacai Chen Cc: Paul Burton, Ralf Baechle, James Hogan, Linux MIPS Mailing List, Fuxin Zhang, Zhangjin Wu, stable, Alan Stern, Andrea Parri, Will Deacon, Boqun Feng, Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney, Akira Yokosawa, LKML On Tue, Jul 10, 2018 at 11:36:37AM +0200, Peter Zijlstra wrote: > So now explain why the cpu_relax() hack that arm did doesn't work for > you? So below is the patch I think you want; if not explain in detail how this is wrong. diff --git a/arch/mips/include/asm/processor.h b/arch/mips/include/asm/processor.h index af34afbc32d9..e59773de6528 100644 --- a/arch/mips/include/asm/processor.h +++ b/arch/mips/include/asm/processor.h @@ -386,7 +386,17 @@ unsigned long get_wchan(struct task_struct *p); #define KSTK_ESP(tsk) (task_pt_regs(tsk)->regs[29]) #define KSTK_STATUS(tsk) (task_pt_regs(tsk)->cp0_status) +#ifdef CONFIG_CPU_LOONGSON3 +/* + * Loongson-3 has a CPU bug where the store buffer gets starved when stuck in a + * read loop. Since spin loops of any kind should have a cpu_relax() in them, + * force a store-buffer flush from cpu_relax() such that any pending writes + * will become available as expected. + */ +#define cpu_relax() smp_mb() +#else #define cpu_relax() barrier() +#endif /* * Return_address is a replacement for __builtin_return_address(count) ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH V2] MIPS: implement smp_cond_load_acquire() for Loongson-3 2018-07-10 10:54 ` Peter Zijlstra @ 2018-07-10 11:45 ` 陈华才 2018-07-10 12:17 ` Peter Zijlstra 2018-07-11 9:43 ` Will Deacon 0 siblings, 2 replies; 15+ messages in thread From: 陈华才 @ 2018-07-10 11:45 UTC (permalink / raw) To: Peter Zijlstra Cc: Paul Burton, Ralf Baechle, James Hogan, linux-mips, Fuxin Zhang, wuzhangjin, stable, Alan Stern, Andrea Parri, Will Deacon, Boqun Feng, Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney, Akira Yokosawa, LKML Hi, Peter, I'm afraid that you have missing something...... Firstly, our previous conclusion (READ_ONCE need a barrier to avoid 'reads prioritised over writes') is totally wrong. So define cpu_relax() to smp_mb() like ARM11MPCore is incorrect, even if it can 'solve' Loongson's problem. Secondly, I think the real problem is like this: 1, CPU0 set the lock to 0, then do something; 2, While CPU0 is doing something, CPU1 set the flag to 1 with WRITE_ONCE(), and then wait the lock become to 1 with a READ_ONCE() loop; 3, After CPU0 complete its work, it wait the flag become to 1, and if so then set the lock to 1; 4, If the lock becomes to 1, CPU1 will leave the READ_ONCE() loop. If without SFB, everything is OK. But with SFB in step 2, a READ_ONCE() loop is right after WRITE_ONCE(), which makes the flag cached in SFB (so be invisible by other CPUs) for ever, then both CPU0 and CPU1 wait for ever. I don't think this is a hardware bug, in design, SFB will flushed to L1 cache in three cases: 1, data in SFB is full (be a complete cache line); 2, there is a subsequent read access in the same cache line; 3, a 'sync' instruction is executed. In this case, there is no other memory access (read or write) between WRITE_ONCE() and READ_ONCE() loop. So Case 1 and Case 2 will not happen, and the only way to make the flag be visible is wbflush (wbflush is sync in Loongson's case). I think this problem is not only happens on Loongson, but will happen on other CPUs which have write buffer (unless the write buffer has a 4th case to be flushed). Huacai ------------------ Original ------------------ From: "Peter Zijlstra"<peterz@infradead.org>; Date: Tue, Jul 10, 2018 06:54 PM To: "Huacai Chen"<chenhc@lemote.com>; Cc: "Paul Burton"<paul.burton@mips.com>; "Ralf Baechle"<ralf@linux-mips.org>; "James Hogan"<jhogan@kernel.org>; "linux-mips"<linux-mips@linux-mips.org>; "Fuxin Zhang"<zhangfx@lemote.com>; "wuzhangjin"<wuzhangjin@gmail.com>; "stable"<stable@vger.kernel.org>; "Alan Stern"<stern@rowland.harvard.edu>; "Andrea Parri"<andrea.parri@amarulasolutions.com>; "Will Deacon"<will.deacon@arm.com>; "Boqun Feng"<boqun.feng@gmail.com>; "Nicholas Piggin"<npiggin@gmail.com>; "David Howells"<dhowells@redhat.com>; "Jade Alglave"<j.alglave@ucl.ac.uk>; "Luc Maranget"<luc.maranget@inria.fr>; "Paul E. McKenney"<paulmck@linux.vnet.ibm.com>; "Akira Yokosawa"<akiyks@gmail.com>; "LKML"<linux-kernel@vger.kernel.org>; Subject: Re: [PATCH V2] MIPS: implement smp_cond_load_acquire() for Loongson-3 On Tue, Jul 10, 2018 at 11:36:37AM +0200, Peter Zijlstra wrote: > So now explain why the cpu_relax() hack that arm did doesn't work for > you? So below is the patch I think you want; if not explain in detail how this is wrong. diff --git a/arch/mips/include/asm/processor.h b/arch/mips/include/asm/processor.h index af34afbc32d9..e59773de6528 100644 --- a/arch/mips/include/asm/processor.h +++ b/arch/mips/include/asm/processor.h @@ -386,7 +386,17 @@ unsigned long get_wchan(struct task_struct *p); #define KSTK_ESP(tsk) (task_pt_regs(tsk)->regs[29]) #define KSTK_STATUS(tsk) (task_pt_regs(tsk)->cp0_status) +#ifdef CONFIG_CPU_LOONGSON3 +/* + * Loongson-3 has a CPU bug where the store buffer gets starved when stuck in a + * read loop. Since spin loops of any kind should have a cpu_relax() in them, + * force a store-buffer flush from cpu_relax() such that any pending writes + * will become available as expected. + */ +#define cpu_relax() smp_mb() +#else #define cpu_relax() barrier() +#endif /* * Return_address is a replacement for __builtin_return_address(count) ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH V2] MIPS: implement smp_cond_load_acquire() for Loongson-3 2018-07-10 11:45 ` 陈华才 @ 2018-07-10 12:17 ` Peter Zijlstra 2018-07-10 16:14 ` Paul E. McKenney ` (2 more replies) 2018-07-11 9:43 ` Will Deacon 1 sibling, 3 replies; 15+ messages in thread From: Peter Zijlstra @ 2018-07-10 12:17 UTC (permalink / raw) To: 陈华才 Cc: Paul Burton, Ralf Baechle, James Hogan, linux-mips, Fuxin Zhang, wuzhangjin, stable, Alan Stern, Andrea Parri, Will Deacon, Boqun Feng, Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney, Akira Yokosawa, LKML Please!! Learn to use email. A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing in e-mail? Also, wrap non-quoted lines to 78 characters. On Tue, Jul 10, 2018 at 07:45:22PM +0800, 陈华才 wrote: > I'm afraid that you have missing something...... > > Firstly, our previous conclusion (READ_ONCE need a barrier to avoid > 'reads prioritised over writes') is totally wrong. So define > cpu_relax() to smp_mb() like ARM11MPCore is incorrect, even if it can > 'solve' Loongson's problem. Secondly, I think the real problem is > like this: > 1, CPU0 set the lock to 0, then do something; > 2, While CPU0 is doing something, CPU1 set the flag to 1 with > WRITE_ONCE(), and then wait the lock become to 1 with a READ_ONCE() > loop; > 3, After CPU0 complete its work, it wait the flag become to 1, and if > so then set the lock to 1; > 4, If the lock becomes to 1, CPU1 will leave the READ_ONCE() loop. > If without SFB, everything is OK. But with SFB in step 2, a > READ_ONCE() loop is right after WRITE_ONCE(), which makes the flag > cached in SFB (so be invisible by other CPUs) for ever, then both CPU0 > and CPU1 wait for ever. Sure.. we all got that far. And no, this isn't the _real_ problem. This is a manifestation of the problem. The problem is that your SFB is broken (per the Linux requirements). We require that stores will become visible. That is, they must not indefinitely (for whatever reason) stay in the store buffer. > I don't think this is a hardware bug, in design, SFB will flushed to > L1 cache in three cases: > 1, data in SFB is full (be a complete cache line); > 2, there is a subsequent read access in the same cache line; > 3, a 'sync' instruction is executed. And I think this _is_ a hardware bug. You just designed the bug instead of it being by accident. > In this case, there is no other memory access (read or write) between > WRITE_ONCE() and READ_ONCE() loop. So Case 1 and Case 2 will not > happen, and the only way to make the flag be visible is wbflush > (wbflush is sync in Loongson's case). > > I think this problem is not only happens on Loongson, but will happen > on other CPUs which have write buffer (unless the write buffer has a > 4th case to be flushed). It doesn't happen an _any_ other architecture except that dodgy ARM11MPCore part. Linux hard relies on stores to become available _eventually_. Still, even with the rules above, the best work-around is still the very same cpu_relax() hack. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH V2] MIPS: implement smp_cond_load_acquire() for Loongson-3 2018-07-10 12:17 ` Peter Zijlstra @ 2018-07-10 16:14 ` Paul E. McKenney 2018-07-10 17:10 ` Paul Burton 2018-07-11 10:05 ` Jiaxun Yang 2 siblings, 0 replies; 15+ messages in thread From: Paul E. McKenney @ 2018-07-10 16:14 UTC (permalink / raw) To: Peter Zijlstra Cc: 陈华才, Paul Burton, Ralf Baechle, James Hogan, linux-mips, Fuxin Zhang, wuzhangjin, stable, Alan Stern, Andrea Parri, Will Deacon, Boqun Feng, Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget, Akira Yokosawa, LKML On Tue, Jul 10, 2018 at 02:17:27PM +0200, Peter Zijlstra wrote: > > Please!! Learn to use email. > > A: Because it messes up the order in which people normally read text. > Q: Why is top-posting such a bad thing? > A: Top-posting. > Q: What is the most annoying thing in e-mail? > > Also, wrap non-quoted lines to 78 characters. > > On Tue, Jul 10, 2018 at 07:45:22PM +0800, 陈华才 wrote: > > I'm afraid that you have missing something...... > > > > Firstly, our previous conclusion (READ_ONCE need a barrier to avoid > > 'reads prioritised over writes') is totally wrong. So define > > cpu_relax() to smp_mb() like ARM11MPCore is incorrect, even if it can > > 'solve' Loongson's problem. Secondly, I think the real problem is > > like this: > > > 1, CPU0 set the lock to 0, then do something; > > 2, While CPU0 is doing something, CPU1 set the flag to 1 with > > WRITE_ONCE(), and then wait the lock become to 1 with a READ_ONCE() > > loop; > > 3, After CPU0 complete its work, it wait the flag become to 1, and if > > so then set the lock to 1; > > 4, If the lock becomes to 1, CPU1 will leave the READ_ONCE() loop. Are there specific loops in the kernel whose conditions are controlled by READ_ONCE() that don't contain cpu_relax(), smp_mb(), etc.? One way to find them given your description of your hardware is to make cpu_relax() be smp_mb() as Peter suggests, and then run tests to find the problems. Or have you already done this? Thanx, Paul > > If without SFB, everything is OK. But with SFB in step 2, a > > READ_ONCE() loop is right after WRITE_ONCE(), which makes the flag > > cached in SFB (so be invisible by other CPUs) for ever, then both CPU0 > > and CPU1 wait for ever. > > Sure.. we all got that far. And no, this isn't the _real_ problem. This > is a manifestation of the problem. > > The problem is that your SFB is broken (per the Linux requirements). We > require that stores will become visible. That is, they must not > indefinitely (for whatever reason) stay in the store buffer. > > > I don't think this is a hardware bug, in design, SFB will flushed to > > L1 cache in three cases: > > > 1, data in SFB is full (be a complete cache line); > > 2, there is a subsequent read access in the same cache line; > > 3, a 'sync' instruction is executed. > > And I think this _is_ a hardware bug. You just designed the bug instead > of it being by accident. > > > In this case, there is no other memory access (read or write) between > > WRITE_ONCE() and READ_ONCE() loop. So Case 1 and Case 2 will not > > happen, and the only way to make the flag be visible is wbflush > > (wbflush is sync in Loongson's case). > > > > I think this problem is not only happens on Loongson, but will happen > > on other CPUs which have write buffer (unless the write buffer has a > > 4th case to be flushed). > > It doesn't happen an _any_ other architecture except that dodgy > ARM11MPCore part. Linux hard relies on stores to become available > _eventually_. > > Still, even with the rules above, the best work-around is still the very > same cpu_relax() hack. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH V2] MIPS: implement smp_cond_load_acquire() for Loongson-3 2018-07-10 12:17 ` Peter Zijlstra 2018-07-10 16:14 ` Paul E. McKenney @ 2018-07-10 17:10 ` Paul Burton 2018-07-11 10:04 ` David Laight 2018-07-11 10:05 ` Jiaxun Yang 2 siblings, 1 reply; 15+ messages in thread From: Paul Burton @ 2018-07-10 17:10 UTC (permalink / raw) To: Peter Zijlstra, 陈华才 Cc: Ralf Baechle, James Hogan, linux-mips, Fuxin Zhang, wuzhangjin, stable, Alan Stern, Andrea Parri, Will Deacon, Boqun Feng, Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney, Akira Yokosawa, LKML Hello, On Tue, Jul 10, 2018 at 02:17:27PM +0200, Peter Zijlstra wrote: > > 1, CPU0 set the lock to 0, then do something; > > 2, While CPU0 is doing something, CPU1 set the flag to 1 with > > WRITE_ONCE(), and then wait the lock become to 1 with a READ_ONCE() > > loop; > > 3, After CPU0 complete its work, it wait the flag become to 1, and if > > so then set the lock to 1; > > 4, If the lock becomes to 1, CPU1 will leave the READ_ONCE() loop. > > > If without SFB, everything is OK. But with SFB in step 2, a > > READ_ONCE() loop is right after WRITE_ONCE(), which makes the flag > > cached in SFB (so be invisible by other CPUs) for ever, then both CPU0 > > and CPU1 wait for ever. > > Sure.. we all got that far. And no, this isn't the _real_ problem. This > is a manifestation of the problem. > > The problem is that your SFB is broken (per the Linux requirements). We > require that stores will become visible. That is, they must not > indefinitely (for whatever reason) stay in the store buffer. For the record, my understanding is that this doesn't really comply with the MIPS architecture either. It doesn't cover store buffers explicitly but does cover the more general notion of caches. From section 2.8.8.2 "Cached Memory Access" of the Introduction to the MIPS64 Architecture document, revision 5.04 [1]: > In a cached access, physical memory and all caches in the system > containing a copy of the physical location are used to resolve the > access. A copy of a location is coherent if the copy was placed in the > cache by a cached coherent access; a copy of a location is noncoherent > if the copy was placed in the cache by a cached noncoherent access. > (Coherency is dictated by the system architecture, not the processor > implementation.) > > Caches containing a coherent copy of the location are examined and/or > modified to keep the contents of the location coherent. It is not > possible to predict whether caches holding a noncoherent copy of the > location will be examined and/or modified during a cached coherent > access. All of the accesses Linux is performing are cached coherent. I'm not sure which is the intent (I can ask if someone's interested), but you could either: 1) Consider the store buffer a cache, in which case loads need to check all store buffers from all CPUs because of the "all caches" part of the first quoted sentence. or 2) Decide store buffers aren't covered by the MIPS architecture documentation at all in which case the only sane thing to do would be to make it transparent to software (and here Loongson's isn't). > > I don't think this is a hardware bug, in design, SFB will flushed to > > L1 cache in three cases: > > > 1, data in SFB is full (be a complete cache line); > > 2, there is a subsequent read access in the same cache line; > > 3, a 'sync' instruction is executed. > > And I think this _is_ a hardware bug. You just designed the bug instead > of it being by accident. Presuming that the data in the SFB isn't visible to other cores, and presuming that case 2 is only covering loads from the same core that contains the SFB, I agree that this doesn't seem like valid behavior. Thanks, Paul [1] https://www.mips.com/?do-download=introduction-to-the-mips64-architecture-v5-04 ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [PATCH V2] MIPS: implement smp_cond_load_acquire() for Loongson-3 2018-07-10 17:10 ` Paul Burton @ 2018-07-11 10:04 ` David Laight 2018-07-11 10:55 ` Peter Zijlstra 0 siblings, 1 reply; 15+ messages in thread From: David Laight @ 2018-07-11 10:04 UTC (permalink / raw) To: 'Paul Burton', Peter Zijlstra, 陈华才 Cc: Ralf Baechle, James Hogan, linux-mips, Fuxin Zhang, wuzhangjin, stable, Alan Stern, Andrea Parri, Will Deacon, Boqun Feng, Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney, Akira Yokosawa, LKML From: Paul Burton > Sent: 10 July 2018 18:11 ... > I'm not sure which is the intent (I can ask if someone's interested), > but you could either: > > 1) Consider the store buffer a cache, in which case loads need to > check all store buffers from all CPUs because of the "all caches" > part of the first quoted sentence. > > or > > 2) Decide store buffers aren't covered by the MIPS architecture > documentation at all in which case the only sane thing to do would > be to make it transparent to software (and here Loongson's isn't) ... Store buffers are common and are never transparent to multi-threaded code. They are largely why you need locks. At least on (early) sparc systems they were between the execution unit and the data cache. I also suspect that 'write starvation' is also common - after all the purpose of the store buffer is to do reads in preference to writes in order to reduce the cpu stalls waiting for the memory bus (probably the cpu to cache interface). I think your example is just: *(volatile int *)xxx = 1; while (!*(volatile int *)yyy) continue; running on two cpu with xxx and yyy swapped? You need a stronger bus cycle in there somewhere. David ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH V2] MIPS: implement smp_cond_load_acquire() for Loongson-3 2018-07-11 10:04 ` David Laight @ 2018-07-11 10:55 ` Peter Zijlstra 0 siblings, 0 replies; 15+ messages in thread From: Peter Zijlstra @ 2018-07-11 10:55 UTC (permalink / raw) To: David Laight Cc: 'Paul Burton', 陈华才, Ralf Baechle, James Hogan, linux-mips, Fuxin Zhang, wuzhangjin, stable, Alan Stern, Andrea Parri, Will Deacon, Boqun Feng, Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney, Akira Yokosawa, LKML On Wed, Jul 11, 2018 at 10:04:52AM +0000, David Laight wrote: > I also suspect that 'write starvation' is also common - after all the > purpose of the store buffer is to do reads in preference to writes in > order to reduce the cpu stalls waiting for the memory bus (probably > the cpu to cache interface). > > I think your example is just: > *(volatile int *)xxx = 1; > while (!*(volatile int *)yyy) continue; > running on two cpu with xxx and yyy swapped? Yep. And Linux has been relying on that working for (afaict) basically forever. > You need a stronger bus cycle in there somewhere. Since all spin-wait loops _should_ have cpu_relax() that is the natural place to put it. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH V2] MIPS: implement smp_cond_load_acquire() for Loongson-3 2018-07-10 12:17 ` Peter Zijlstra 2018-07-10 16:14 ` Paul E. McKenney 2018-07-10 17:10 ` Paul Burton @ 2018-07-11 10:05 ` Jiaxun Yang 2018-07-11 10:21 ` Will Deacon 2 siblings, 1 reply; 15+ messages in thread From: Jiaxun Yang @ 2018-07-11 10:05 UTC (permalink / raw) To: linux-mips Cc: Peter Zijlstra, 陈华才, Paul Burton, Ralf Baechle, James Hogan, Fuxin Zhang, wuzhangjin, stable, Alan Stern, Andrea Parri, Will Deacon, Boqun Feng, Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney, Akira Yokosawa, LKML On 2018-7-10 Tue at 20:17:27,Peter Zijlstra Wrote: Hi Peter Since Huacai unable to send email via client, I'm going to reply for him > Sure.. we all got that far. And no, this isn't the _real_ problem. This > is a manifestation of the problem. > > The problem is that your SFB is broken (per the Linux requirements). We > require that stores will become visible. That is, they must not > indefinitely (for whatever reason) stay in the store buffer. > > > I don't think this is a hardware bug, in design, SFB will flushed to > > L1 cache in three cases: > > > > 1, data in SFB is full (be a complete cache line); > > 2, there is a subsequent read access in the same cache line; > > 3, a 'sync' instruction is executed. > > And I think this _is_ a hardware bug. You just designed the bug instead > of it being by accident. Yes, we understood that this hardware feature is not supported by LKML, so it should be a hardware bug for LKML. > > It doesn't happen an _any_ other architecture except that dodgy > ARM11MPCore part. Linux hard relies on stores to become available > _eventually_. > > Still, even with the rules above, the best work-around is still the very > same cpu_relax() hack. As you say, SFB makes Loongson not fully SMP-coherent. However, modify cpu_relax can solve the current problem, but not so straight forward. On the other hand, providing a Loongson-specific WRITE_ONCE looks more reasonable, because it the eliminate the "non-cohrency". So we can solve the bug from the root. Thanks. -- Jiaxun Yang ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH V2] MIPS: implement smp_cond_load_acquire() for Loongson-3 2018-07-11 10:05 ` Jiaxun Yang @ 2018-07-11 10:21 ` Will Deacon 2018-07-11 11:09 ` Peter Zijlstra 0 siblings, 1 reply; 15+ messages in thread From: Will Deacon @ 2018-07-11 10:21 UTC (permalink / raw) To: Jiaxun Yang Cc: linux-mips, Peter Zijlstra, 陈华才, Paul Burton, Ralf Baechle, James Hogan, Fuxin Zhang, wuzhangjin, stable, Alan Stern, Andrea Parri, Boqun Feng, Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney, Akira Yokosawa, LKML On Wed, Jul 11, 2018 at 06:05:51PM +0800, Jiaxun Yang wrote: > On 2018-7-10 Tue at 20:17:27,Peter Zijlstra Wrote: > > Hi Peter > Since Huacai unable to send email via client, I'm going to reply for him > > > Sure.. we all got that far. And no, this isn't the _real_ problem. This > > is a manifestation of the problem. > > > > The problem is that your SFB is broken (per the Linux requirements). We > > require that stores will become visible. That is, they must not > > indefinitely (for whatever reason) stay in the store buffer. > > > > > I don't think this is a hardware bug, in design, SFB will flushed to > > > L1 cache in three cases: > > > > > > 1, data in SFB is full (be a complete cache line); > > > 2, there is a subsequent read access in the same cache line; > > > 3, a 'sync' instruction is executed. > > > > And I think this _is_ a hardware bug. You just designed the bug instead > > of it being by accident. > Yes, we understood that this hardware feature is not supported by LKML, > so it should be a hardware bug for LKML. > > > > It doesn't happen an _any_ other architecture except that dodgy > > ARM11MPCore part. Linux hard relies on stores to become available > > _eventually_. > > > > Still, even with the rules above, the best work-around is still the very > > same cpu_relax() hack. > > As you say, SFB makes Loongson not fully SMP-coherent. > However, modify cpu_relax can solve the current problem, > but not so straight forward. On the other hand, providing a Loongson-specific > WRITE_ONCE looks more reasonable, because it the eliminate the "non-cohrency". > So we can solve the bug from the root. Curious, but why is it not straight-forward to hack cpu_relax()? If you try to hack WRITE_ONCE, you also need to hack atomic_set, atomic64_set and all the places that should be using WRITE_ONCE but aren't ;~) Will ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH V2] MIPS: implement smp_cond_load_acquire() for Loongson-3 2018-07-11 10:21 ` Will Deacon @ 2018-07-11 11:09 ` Peter Zijlstra 2018-07-11 11:46 ` David Laight 0 siblings, 1 reply; 15+ messages in thread From: Peter Zijlstra @ 2018-07-11 11:09 UTC (permalink / raw) To: Will Deacon Cc: Jiaxun Yang, linux-mips, 陈华才, Paul Burton, Ralf Baechle, James Hogan, Fuxin Zhang, wuzhangjin, stable, Alan Stern, Andrea Parri, Boqun Feng, Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney, Akira Yokosawa, LKML On Wed, Jul 11, 2018 at 11:21:06AM +0100, Will Deacon wrote: > On Wed, Jul 11, 2018 at 06:05:51PM +0800, Jiaxun Yang wrote: > > On 2018-7-10 Tue at 20:17:27,Peter Zijlstra Wrote: > > > Still, even with the rules above, the best work-around is still the very > > > same cpu_relax() hack. > > > > As you say, SFB makes Loongson not fully SMP-coherent. > > However, modify cpu_relax can solve the current problem, > > but not so straight forward. On the other hand, providing a Loongson-specific > > WRITE_ONCE looks more reasonable, because it the eliminate the "non-cohrency". > > So we can solve the bug from the root. > > Curious, but why is it not straight-forward to hack cpu_relax()? If you try > to hack WRITE_ONCE, you also need to hack atomic_set, atomic64_set and all > the places that should be using WRITE_ONCE but aren't ;~) Right. The problem isn't stores pre-se, normal progress should contain enough stores to flush out 'old' bits in the natural order of things. But the problem is spin-wait loops that inhibit normal progress (and thereby store-buffer flushing). And all spin-wait loops should be having cpu_relax() in them. So cpu_relax() is the natural place to fix this. Adding SYNC to WRITE_ONCE()/atomic* will hurt performance lots and will ultimately not guarantee anything more; and as Will said, keep you chasing dragons where people forgot to use WRITE_ONCE() where they maybe should've. ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [PATCH V2] MIPS: implement smp_cond_load_acquire() for Loongson-3 2018-07-11 11:09 ` Peter Zijlstra @ 2018-07-11 11:46 ` David Laight 0 siblings, 0 replies; 15+ messages in thread From: David Laight @ 2018-07-11 11:46 UTC (permalink / raw) To: 'Peter Zijlstra', Will Deacon Cc: Jiaxun Yang, linux-mips, 陈华才, Paul Burton, Ralf Baechle, James Hogan, Fuxin Zhang, wuzhangjin, stable, Alan Stern, Andrea Parri, Boqun Feng, Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney, Akira Yokosawa, LKML From: Peter Zijlstra > Sent: 11 July 2018 12:10 .. > Adding SYNC to WRITE_ONCE()/atomic* will hurt performance lots and will > ultimately not guarantee anything more; and as Will said, keep you > chasing dragons where people forgot to use WRITE_ONCE() where they maybe > should've. Also WRITE_ONCE() is there to solve an entirely different problem. If you have a function that does: <lots of code without any function calls> foo->bar = 1; the compiler is allowed to write other (unrelated) values to foo->bar in the generated code for <lots of code>. A long time ago I used a compiler that managed to convert: if (foo->bar == 2) foo->bar = 3; into: if (foo->bar == 2) { foo->bar = 0; foo->bar = 3; } When an interrupt read the value 0 a lot of linked list got screwed up. WRITE_ONCE() is there to ensure that doesn't happen. (In my case 'foo' was a 2-bit wide bitfield, and I suspect you can't use WRITE_ONCE() on bitfields.) David ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH V2] MIPS: implement smp_cond_load_acquire() for Loongson-3 2018-07-10 11:45 ` 陈华才 2018-07-10 12:17 ` Peter Zijlstra @ 2018-07-11 9:43 ` Will Deacon 1 sibling, 0 replies; 15+ messages in thread From: Will Deacon @ 2018-07-11 9:43 UTC (permalink / raw) To: 陈华才 Cc: Peter Zijlstra, Paul Burton, Ralf Baechle, James Hogan, linux-mips, Fuxin Zhang, wuzhangjin, stable, Alan Stern, Andrea Parri, Boqun Feng, Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney, Akira Yokosawa, LKML Hi Huacai, On Tue, Jul 10, 2018 at 07:45:22PM +0800, 陈华才 wrote: > I don't think this is a hardware bug, in design, SFB will flushed to L1 > cache in three cases: > 1, data in SFB is full (be a complete cache line); > 2, there is a subsequent read access in the same cache line; > 3, a 'sync' instruction is executed. I'd expect successful LL/SC, cache maintenance (and potentially TLB) operations to flush your SFB as well, not that I think that provides a better workaround than throwing a 'sync' into cpu_relax(). I assume the SFB is all physically addressed? Generally, CPU architectures guarantee that store buffers drain "in finite time" which is a pretty crappy guarantee, but one which tends to be sufficient in practice and therefore relied upon by software. Will ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2018-07-11 11:44 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <1531103198-16764-1-git-send-email-chenhc@lemote.com> 2018-07-09 16:49 ` [PATCH V2] MIPS: implement smp_cond_load_acquire() for Loongson-3 Paul Burton 2018-07-10 4:26 ` Huacai Chen 2018-07-10 9:36 ` Peter Zijlstra 2018-07-10 10:54 ` Peter Zijlstra 2018-07-10 11:45 ` 陈华才 2018-07-10 12:17 ` Peter Zijlstra 2018-07-10 16:14 ` Paul E. McKenney 2018-07-10 17:10 ` Paul Burton 2018-07-11 10:04 ` David Laight 2018-07-11 10:55 ` Peter Zijlstra 2018-07-11 10:05 ` Jiaxun Yang 2018-07-11 10:21 ` Will Deacon 2018-07-11 11:09 ` Peter Zijlstra 2018-07-11 11:46 ` David Laight 2018-07-11 9:43 ` Will Deacon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).