Re: smp_mb__after_spinlock requirement too strong?

From: Boqun Feng <boqun.feng@gmail.com>
To: 焦晓冬 <milestonejxd@gmail.com>
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	Alan Stern <stern@rowland.harvard.edu>,
	will.deacon@arm.com, torvalds@linux-foundation.org,
	npiggin@gmail.com, mingo@kernel.org, mpe@ellerman.id.au,
	oleg@redhat.com, benh@kernel.crashing.org,
	Paul McKenney <paulmck@linux.vnet.ibm.com>
Subject: Re: smp_mb__after_spinlock requirement too strong?
Date: Mon, 12 Mar 2018 16:56:00 +0800	[thread overview]
Message-ID: <20180312085600.aczjkpn73axzs2sb@tardis> (raw)
In-Reply-To: <CAJDTihxxhy7zmhTJ-ky4wvMby_o9y8UOcs9R1dABN7NccekAiQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6641 bytes --]

On Mon, Mar 12, 2018 at 04:18:00PM +0800, 焦晓冬 wrote:
> >> Peter pointed out in this patch https://patchwork.kernel.org/patch/9771921/
> >> that the spinning-lock used at __schedule() should be RCsc to ensure
> >> visibility of writes prior to __schedule when the task is to be migrated to
> >> another CPU.
> >>
> >> And this is emphasized at the comment of the newly introduced
> >> smp_mb__after_spinlock(),
> >>
> >>  * This barrier must provide two things:
> >>  *
> >>  *   - it must guarantee a STORE before the spin_lock() is ordered against a
> >>  *     LOAD after it, see the comments at its two usage sites.
> >>  *
> >>  *   - it must ensure the critical section is RCsc.
> >>  *
> >>  * The latter is important for cases where we observe values written by other
> >>  * CPUs in spin-loops, without barriers, while being subject to scheduling.
> >>  *
> >>  * CPU0         CPU1            CPU2
> >>  *
> >>  *          for (;;) {
> >>  *            if (READ_ONCE(X))
> >>  *              break;
> >>  *          }
> >>  * X=1
> >>  *          <sched-out>
> >>  *                      <sched-in>
> >>  *                      r = X;
> >>  *
> >>  * without transitivity it could be that CPU1 observes X!=0 breaks the loop,
> >>  * we get migrated and CPU2 sees X==0.
> >>
> >> which is used at,
> >>
> >> __schedule(bool preempt) {
> >>     ...
> >>     rq_lock(rq, &rf);
> >>     smp_mb__after_spinlock();
> >>     ...
> >> }
> >> .
> >>
> >> If I didn't miss something, I found this kind of visibility is __not__
> >> necessarily
> >> depends on the spinning-lock at __schedule being RCsc.
> >>
> >> In fact, as for runnable task A, the migration would be,
> >>
> >>  CPU0         CPU1            CPU2
> >>
> >> <ACCESS before schedule out A>
> >>
> >> lock(rq0)
> >> schedule out A
> >> unock(rq0)
> >>
> >>               lock(rq0)
> >>               remove A from rq0
> >>               unlock(rq0)
> >>
> >>               lock(rq2)
> >>               add A into rq2
> >>               unlock(rq2)
> >>                                         lock(rq2)
> >>                                         schedule in A
> >>                                         unlock(rq2)
> >>
> >>                                         <ACCESS after schedule in A>
> >>
> >> <ACCESS before schedule out A> happens-before
> >> unlock(rq0) happends-before
> >> lock(rq0) happends-before
> >> unlock(rq2) happens-before
> >> lock(rq2) happens-before
> >> <ACCESS after schedule in A>
> >>
> >
> > But without RCsc lock, you cannot guarantee that a write propagates to
> > CPU 0 and CPU 2 at the same time, so the same write may propagate to
> > CPU0 before <ACCESS before schedule out A> but propagate to CPU 2 after
> > <ACCESS after scheduler in A>. So..
> >
> > Regards,
> > Boqun
> 
> Thank you for pointing out this case, Boqun.
> But this is just one special case that acquire-release chains promise us.
> 

Ah.. right, because of A-Cumulative.

> A=B=0 as initial
> 
>   CPU0                CPU1                CPU2                CPU3
>  write A=1
>                            read A=1
>                            write B=1
>                            release X
>                                                  acquire X
>                                                  read A=?
>                                                  release Y
> 
>     acquire Y
> 
>     read B=?
> 
> assurance 1: CPU3 will surely see B=1 writing by CPU1, and
> assurance 2: CPU2 will also see A=1 writing by CPU0 as a special case
> 
> The second assurance is both in theory and implemented by real hardware.
> 
> As for theory, the C++11 memory model, which is a potential formal model
> for kernel memory model as
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0124r4.html
> descripes, states that:
> 
> If a value computation A of an atomic object M happens before a value
> computation B of M, and A takes its value from a side effect X on M, then
> the value computed by B shall either be the value stored by X or the value
> stored by a side effect Y on M, where Y follows X in the modification
> order of M.
> 
> at
> $1.10 rule 18, on page 14
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf
> 
> As for real hardware, Luc provided detailed test and explanation on
> ARM and POWER in 5.1 Cumulative Barriers for WRC  on page 19
> in this paper:
> 
> A Tutorial Introduction to the ARM and POWER Relaxed Memory Models
> https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
> 
> So, I think we may remove RCsc from smp_mb__after_spinlock which is
> really confusing.
> 

So I think the purpose of smp_mb__after_spinlock() is to provide RCsc
locks, it's just the comments before that may be misleading. We want
RCsc locks in schedule code because we want writes in different critical
section are ordered even outside the critical sections, for case like:

	CPU 0		CPU 1		CPU 2

	{A =0 , B = 0}
	lock(rq0);
	write A=1;
	unlock(rq0);

			lock(rq0);
			read A=1;
			write B=2;
			unlock(rq0);

					read B=2;
					smp_rmb();
					read A=1;

I think we need to fix the comments rather than loose the requirement.
Peter?

Regards,
Boqun

> Best Regards,
> Trol
> 
> >
> >> And for stopped tasks,
> >>
> >>  CPU0         CPU1            CPU2
> >>
> >> <ACCESS before schedule out A>
> >>
> >> lock(rq0)
> >> schedule out A
> >> remove A from rq0
> >> store-release(A->on_cpu)
> >> unock(rq0)
> >>
> >>               load_acquire(A->on_cpu)
> >>               set_task_cpu(A, 2)
> >>
> >>               lock(rq2)
> >>               add A into rq2
> >>               unlock(rq2)
> >>
> >>                                         lock(rq2)
> >>                                         schedule in A
> >>                                         unlock(rq2)
> >>
> >>                                         <ACCESS after schedule in A>
> >>
> >> <ACCESS before schedule out A> happens-before
> >> store-release(A->on_cpu)  happens-before
> >> load_acquire(A->on_cpu)  happens-before
> >> unlock(rq2) happens-before
> >> lock(rq2) happens-before
> >> <ACCESS after schedule in A>
> >>
> >> So, I think the only requirement to smp_mb__after_spinlock is
> >> to guarantee a STORE before the spin_lock() is ordered
> >> against a LOAD after it. So we could remove the RCsc requirement
> >> to allow more efficient implementation.
> >>
> >> Did I miss something or this RCsc requirement does not really matter?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]