Re: Memory ordering question in the shutdown deferral code

From: Stefano Stabellini <sstabellini@kernel.org>
To: Julien Grall <julien@xen.org>
Cc: "Durrant, Paul" <pdurrant@amazon.co.uk>,
	Jan Beulich <jbeulich@suse.com>,
	 Stefano Stabellini <sstabellini@kernel.org>,
	 "andrew.cooper3@citrix.com" <andrew.cooper3@citrix.com>,
	 George Dunlap <george.dunlap@citrix.com>,
	 "Xia, Hongyan" <hongyxia@amazon.com>,
	 "xen-devel@lists.xenproject.org"
	<xen-devel@lists.xenproject.org>
Subject: Re: Memory ordering question in the shutdown deferral code
Date: Wed, 23 Sep 2020 15:57:21 -0700 (PDT)	[thread overview]
Message-ID: <alpine.DEB.2.21.2009231541410.1495@sstabellini-ThinkPad-T480s> (raw)
In-Reply-To: <ad81f6ac-6127-bea8-a503-d16d3dc175df@xen.org>

On Mon, 21 Sep 2020, Julien Grall wrote:
> On 21/09/2020 13:55, Durrant, Paul wrote:
> > > (+ Xen-devel)
> > > 
> > > Sorry I forgot to CC xen-devel.
> > > 
> > > On 21/09/2020 12:38, Julien Grall wrote:
> > > > Hi all,
> > > > 
> > > > I have started to look at the deferral code (see
> > > > vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and
> > > > Arm will soon use it.
> > > > 
> > > > The current implementation is using an smp_mb() to ensure ordering
> > > > between a write then a read. The code looks roughly (I have slightly
> > > > adapted it to make my question more obvious):
> > > > 
> > > > domain_shutdown()
> > > >       d->is_shutting_down = 1;
> > > >       smp_mb();
> > > >       if ( !vcpu0->defer_shutdown )
> > > >       {
> > > >         vcpu_pause_nosync(v);
> > > >         v->paused_for_shutdown = 1;
> > > >       }
> > > > 
> > > > vcpu_start_shutdown_deferral()
> > > >       vcpu0->defer_shutdown = 1;
> > > >       smp_mb();
> > > >       if ( unlikely(d->is_shutting_down) )
> > > >         vcpu_check_shutdown(v);
> > > > 
> > > >       return vcpu0->defer_shutdown;
> > > > 
> > > > smp_mb() should only guarantee ordering (this may be stronger on some
> > > > arch), so I think there is a race between the two functions.
> > > > 
> > > > It would be possible to pause the vCPU in domain_shutdown() because
> > > > vcpu0->defer_shutdown wasn't yet seen.
> > > > 
> > > > Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down
> > > > and therefore Xen may continue to send the I/O. Yet the vCPU will be
> > > > paused so the I/O will never complete.
> > > > 
> > 
> > The barrier enforces global order, right?
> 
> It is not clear to me what you mean by "global ordering". This seems to
> suggest a very expensive synchronization barrier between all the processors.
> 
> From an arch-agnostic PoV, smp_mb() will enforce an ordering between
> loads/stores but it doesn't guarantee *when* they will be observed.
> 
> > So, if domain_shutdown() pauses the vcpu then is_shutting_down must
> > necessarily be visible all cpus.
> 
> That's not the guarantee provided by smp_mb() (see above).

I simplified the code further to help us reason about it:

   thread#1 |  thread#2
            |
1) WRITE A  |  WRITE B
2) BARRIER  |  BARRIER
3) READ B   |  READ A

I think it is (theoretically) possible for thread#1 to be at 1) and
about to do 2), while thread#2 goes ahead and does 1) 2) 3). By the time
thread#1 does 2), thread#2 has already completed the entire sequence.

If thread#2 has already done 2), and thread#1 is about to do 3), then I
think we are guaranteed that thread#1 will see the new value of B. Or
is this the core of the issue we are discussing?

If it works the way I wrote, then it would confirm Paul's view.

For your information I went to check what the Linux memory model has to
say about this. It says:

"smp_mb() guarantees to restore sequential consistency among accesses that use READ_ONCE, WRITE_ONCE(), or stronger. For example, the following Linux-kernel code would forbid non-SC outcomes"

It is interesting that they chose the words "restore sequential
consistency". It would be difficult to come up with an example that has
"sequential consistency" but doesn't work the way described earlier.