xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* Re: Memory ordering question in the shutdown deferral code
       [not found] <468576ba-8d3f-98e9-e65e-1128b5220d40@xen.org>
@ 2020-09-21 11:40 ` Julien Grall
  2020-09-21 12:55   ` Durrant, Paul
  2020-09-21 13:11   ` Jan Beulich
  0 siblings, 2 replies; 11+ messages in thread
From: Julien Grall @ 2020-09-21 11:40 UTC (permalink / raw)
  To: Jan Beulich, Stefano Stabellini, andrew.cooper3, George Dunlap,
	Paul Durrant
  Cc: Xia, Hongyan, xen-devel

(+ Xen-devel)

Sorry I forgot to CC xen-devel.

On 21/09/2020 12:38, Julien Grall wrote:
> Hi all,
> 
> I have started to look at the deferral code (see 
> vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and 
> Arm will soon use it.
> 
> The current implementation is using an smp_mb() to ensure ordering 
> between a write then a read. The code looks roughly (I have slightly 
> adapted it to make my question more obvious):
> 
> domain_shutdown()
>      d->is_shutting_down = 1;
>      smp_mb();
>      if ( !vcpu0->defer_shutdown )
>      {
>        vcpu_pause_nosync(v);
>        v->paused_for_shutdown = 1;
>      }
> 
> vcpu_start_shutdown_deferral()
>      vcpu0->defer_shutdown = 1;
>      smp_mb();
>      if ( unlikely(d->is_shutting_down) )
>        vcpu_check_shutdown(v);
> 
>      return vcpu0->defer_shutdown;
> 
> smp_mb() should only guarantee ordering (this may be stronger on some 
> arch), so I think there is a race between the two functions.
> 
> It would be possible to pause the vCPU in domain_shutdown() because 
> vcpu0->defer_shutdown wasn't yet seen.
> 
> Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down 
> and therefore Xen may continue to send the I/O. Yet the vCPU will be 
> paused so the I/O will never complete.
> 
> I am not fully familiar with the IOREQ code, but it sounds to me this is 
> not the behavior that was intended. Can someone more familiar with the 
> code confirm it?
> 
> Cheers,
> 

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: Memory ordering question in the shutdown deferral code
  2020-09-21 11:40 ` Memory ordering question in the shutdown deferral code Julien Grall
@ 2020-09-21 12:55   ` Durrant, Paul
  2020-09-21 13:25     ` Xia, Hongyan
  2020-09-21 13:27     ` Julien Grall
  2020-09-21 13:11   ` Jan Beulich
  1 sibling, 2 replies; 11+ messages in thread
From: Durrant, Paul @ 2020-09-21 12:55 UTC (permalink / raw)
  To: Julien Grall, Jan Beulich, Stefano Stabellini, andrew.cooper3,
	George Dunlap
  Cc: Xia, Hongyan, xen-devel

> -----Original Message-----
> From: Julien Grall <julien@xen.org>
> Sent: 21 September 2020 12:41
> To: Jan Beulich <jbeulich@suse.com>; Stefano Stabellini <sstabellini@kernel.org>;
> andrew.cooper3@citrix.com; George Dunlap <george.dunlap@citrix.com>; Durrant, Paul
> <pdurrant@amazon.co.uk>
> Cc: Xia, Hongyan <hongyxia@amazon.com>; xen-devel@lists.xenproject.org
> Subject: RE: [EXTERNAL] Memory ordering question in the shutdown deferral code
> 
> CAUTION: This email originated from outside of the organization. Do not click links or open
> attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> (+ Xen-devel)
> 
> Sorry I forgot to CC xen-devel.
> 
> On 21/09/2020 12:38, Julien Grall wrote:
> > Hi all,
> >
> > I have started to look at the deferral code (see
> > vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and
> > Arm will soon use it.
> >
> > The current implementation is using an smp_mb() to ensure ordering
> > between a write then a read. The code looks roughly (I have slightly
> > adapted it to make my question more obvious):
> >
> > domain_shutdown()
> >      d->is_shutting_down = 1;
> >      smp_mb();
> >      if ( !vcpu0->defer_shutdown )
> >      {
> >        vcpu_pause_nosync(v);
> >        v->paused_for_shutdown = 1;
> >      }
> >
> > vcpu_start_shutdown_deferral()
> >      vcpu0->defer_shutdown = 1;
> >      smp_mb();
> >      if ( unlikely(d->is_shutting_down) )
> >        vcpu_check_shutdown(v);
> >
> >      return vcpu0->defer_shutdown;
> >
> > smp_mb() should only guarantee ordering (this may be stronger on some
> > arch), so I think there is a race between the two functions.
> >
> > It would be possible to pause the vCPU in domain_shutdown() because
> > vcpu0->defer_shutdown wasn't yet seen.
> >
> > Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down
> > and therefore Xen may continue to send the I/O. Yet the vCPU will be
> > paused so the I/O will never complete.
> >

The barrier enforces global order, right? So, if domain_shutdown() pauses the vcpu then is_shutting_down must necessarily be visible all cpus. Thus vcpu_start_shutdown referral will execute vcpu_check_shutdown(), so I'm having a hard time seeing the race.

> > I am not fully familiar with the IOREQ code, but it sounds to me this is
> > not the behavior that was intended. Can someone more familiar with the
> > code confirm it?
> >

No indeed. I think emulation should complete before the vcpu pauses.

  Paul

> > Cheers,
> >
> 
> --
> Julien Grall

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Memory ordering question in the shutdown deferral code
  2020-09-21 11:40 ` Memory ordering question in the shutdown deferral code Julien Grall
  2020-09-21 12:55   ` Durrant, Paul
@ 2020-09-21 13:11   ` Jan Beulich
  2020-09-21 14:22     ` Julien Grall
  1 sibling, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2020-09-21 13:11 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, andrew.cooper3, George Dunlap, Paul Durrant,
	Xia, Hongyan, xen-devel

On 21.09.2020 13:40, Julien Grall wrote:
> (+ Xen-devel)
> 
> Sorry I forgot to CC xen-devel.
> 
> On 21/09/2020 12:38, Julien Grall wrote:
>> Hi all,
>>
>> I have started to look at the deferral code (see 
>> vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and 
>> Arm will soon use it.
>>
>> The current implementation is using an smp_mb() to ensure ordering 
>> between a write then a read. The code looks roughly (I have slightly 
>> adapted it to make my question more obvious):
>>
>> domain_shutdown()
>>      d->is_shutting_down = 1;
>>      smp_mb();
>>      if ( !vcpu0->defer_shutdown )
>>      {
>>        vcpu_pause_nosync(v);
>>        v->paused_for_shutdown = 1;
>>      }
>>
>> vcpu_start_shutdown_deferral()
>>      vcpu0->defer_shutdown = 1;
>>      smp_mb();
>>      if ( unlikely(d->is_shutting_down) )
>>        vcpu_check_shutdown(v);
>>
>>      return vcpu0->defer_shutdown;
>>
>> smp_mb() should only guarantee ordering (this may be stronger on some 
>> arch), so I think there is a race between the two functions.
>>
>> It would be possible to pause the vCPU in domain_shutdown() because 
>> vcpu0->defer_shutdown wasn't yet seen.
>>
>> Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down 
>> and therefore Xen may continue to send the I/O. Yet the vCPU will be 
>> paused so the I/O will never complete.

Individually for each of these I agree. But isn't the goal merely
to prevent both to enter their if()-s' bodies at the same time?
And isn't the combined effect of the two barriers preventing just
this?

>> I am not fully familiar with the IOREQ code, but it sounds to me this is 
>> not the behavior that was intended. Can someone more familiar with the 
>> code confirm it?

As to original intentions, I'm afraid among the people still
listed as maintainers for any part of Xen it may only be Tim to
possibly have been involved in the original installation of
this model, and hence who may know of the precise intentions
and considerations back at the time.

As far as I'm concerned, to be honest I don't think I've ever
managed to fully convince myself of the correctness of the
model in the general case. But since it did look good enough
for x86 ...

Jan


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Memory ordering question in the shutdown deferral code
  2020-09-21 12:55   ` Durrant, Paul
@ 2020-09-21 13:25     ` Xia, Hongyan
  2020-09-21 13:27     ` Julien Grall
  1 sibling, 0 replies; 11+ messages in thread
From: Xia, Hongyan @ 2020-09-21 13:25 UTC (permalink / raw)
  To: jbeulich, sstabellini, julien, Durrant, Paul, andrew.cooper3,
	george.dunlap
  Cc: xen-devel

On Mon, 2020-09-21 at 12:55 +0000, Durrant, Paul wrote:
> > -----Original Message-----
> > From: Julien Grall <julien@xen.org>
> > Sent: 21 September 2020 12:41
> > To: Jan Beulich <jbeulich@suse.com>; Stefano Stabellini <
> > sstabellini@kernel.org>;
> > andrew.cooper3@citrix.com; George Dunlap <george.dunlap@citrix.com>
> > ; Durrant, Paul
> > <pdurrant@amazon.co.uk>
> > Cc: Xia, Hongyan <hongyxia@amazon.com>; 
> > xen-devel@lists.xenproject.org
> > Subject: RE: [EXTERNAL] Memory ordering question in the shutdown
> > deferral code
> > 
> > CAUTION: This email originated from outside of the organization. Do
> > not click links or open
> > attachments unless you can confirm the sender and know the content
> > is safe.
> > 
> > 
> > 
> > (+ Xen-devel)
> > 
> > Sorry I forgot to CC xen-devel.
> > 
> > On 21/09/2020 12:38, Julien Grall wrote:
> > > Hi all,
> > > 
> > > I have started to look at the deferral code (see
> > > vcpu_start_shutdown_deferral()) because we need it for LiveUpdate
> > > and
> > > Arm will soon use it.
> > > 
> > > The current implementation is using an smp_mb() to ensure
> > > ordering
> > > between a write then a read. The code looks roughly (I have
> > > slightly
> > > adapted it to make my question more obvious):
> > > 
> > > domain_shutdown()
> > >      d->is_shutting_down = 1;
> > >      smp_mb();
> > >      if ( !vcpu0->defer_shutdown )
> > >      {
> > >        vcpu_pause_nosync(v);
> > >        v->paused_for_shutdown = 1;
> > >      }
> > > 
> > > vcpu_start_shutdown_deferral()
> > >      vcpu0->defer_shutdown = 1;
> > >      smp_mb();
> > >      if ( unlikely(d->is_shutting_down) )
> > >        vcpu_check_shutdown(v);
> > > 
> > >      return vcpu0->defer_shutdown;
> > > 
> > > smp_mb() should only guarantee ordering (this may be stronger on
> > > some
> > > arch), so I think there is a race between the two functions.
> > > 
> > > It would be possible to pause the vCPU in domain_shutdown()
> > > because
> > > vcpu0->defer_shutdown wasn't yet seen.
> > > 
> > > Equally, vcpu_start_shutdown_deferral() may not see d-
> > > >is_shutting_down
> > > and therefore Xen may continue to send the I/O. Yet the vCPU will
> > > be
> > > paused so the I/O will never complete.
> > > 
> 
> The barrier enforces global order, right? So, if domain_shutdown()
> pauses the vcpu then is_shutting_down must necessarily be visible all
> cpus. Thus vcpu_start_shutdown referral will execute
> vcpu_check_shutdown(), so I'm having a hard time seeing the race.

I think the question is whether smp_mb() can enforce immediate
visibility. It at least enforces ordering, so when other CPUs
eventually see the memory accesses they will be in the right order. On
the x86 side I think an mfence enforces that when the later load is
executed, the previous store is globally visible. But, if it is only
about the ordering seen by other cores, it could be possible that the
write is seen much later.

Hongyan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Memory ordering question in the shutdown deferral code
  2020-09-21 12:55   ` Durrant, Paul
  2020-09-21 13:25     ` Xia, Hongyan
@ 2020-09-21 13:27     ` Julien Grall
  2020-09-21 13:32       ` Jan Beulich
  2020-09-23 22:57       ` Stefano Stabellini
  1 sibling, 2 replies; 11+ messages in thread
From: Julien Grall @ 2020-09-21 13:27 UTC (permalink / raw)
  To: Durrant, Paul, Jan Beulich, Stefano Stabellini, andrew.cooper3,
	George Dunlap
  Cc: Xia, Hongyan, xen-devel



On 21/09/2020 13:55, Durrant, Paul wrote:
>> -----Original Message-----
>> From: Julien Grall <julien@xen.org>
>> Sent: 21 September 2020 12:41
>> To: Jan Beulich <jbeulich@suse.com>; Stefano Stabellini <sstabellini@kernel.org>;
>> andrew.cooper3@citrix.com; George Dunlap <george.dunlap@citrix.com>; Durrant, Paul
>> <pdurrant@amazon.co.uk>
>> Cc: Xia, Hongyan <hongyxia@amazon.com>; xen-devel@lists.xenproject.org
>> Subject: RE: [EXTERNAL] Memory ordering question in the shutdown deferral code
>>
>> CAUTION: This email originated from outside of the organization. Do not click links or open
>> attachments unless you can confirm the sender and know the content is safe.
>>
>>
>>
>> (+ Xen-devel)
>>
>> Sorry I forgot to CC xen-devel.
>>
>> On 21/09/2020 12:38, Julien Grall wrote:
>>> Hi all,
>>>
>>> I have started to look at the deferral code (see
>>> vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and
>>> Arm will soon use it.
>>>
>>> The current implementation is using an smp_mb() to ensure ordering
>>> between a write then a read. The code looks roughly (I have slightly
>>> adapted it to make my question more obvious):
>>>
>>> domain_shutdown()
>>>       d->is_shutting_down = 1;
>>>       smp_mb();
>>>       if ( !vcpu0->defer_shutdown )
>>>       {
>>>         vcpu_pause_nosync(v);
>>>         v->paused_for_shutdown = 1;
>>>       }
>>>
>>> vcpu_start_shutdown_deferral()
>>>       vcpu0->defer_shutdown = 1;
>>>       smp_mb();
>>>       if ( unlikely(d->is_shutting_down) )
>>>         vcpu_check_shutdown(v);
>>>
>>>       return vcpu0->defer_shutdown;
>>>
>>> smp_mb() should only guarantee ordering (this may be stronger on some
>>> arch), so I think there is a race between the two functions.
>>>
>>> It would be possible to pause the vCPU in domain_shutdown() because
>>> vcpu0->defer_shutdown wasn't yet seen.
>>>
>>> Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down
>>> and therefore Xen may continue to send the I/O. Yet the vCPU will be
>>> paused so the I/O will never complete.
>>>
> 
> The barrier enforces global order, right?

It is not clear to me what you mean by "global ordering". This seems to 
suggest a very expensive synchronization barrier between all the processors.

 From an arch-agnostic PoV, smp_mb() will enforce an ordering between 
loads/stores but it doesn't guarantee *when* they will be observed.

> So, if domain_shutdown() pauses the vcpu then is_shutting_down must necessarily be visible all cpus.

That's not the guarantee provided by smp_mb() (see above).

> Thus vcpu_start_shutdown referral will execute vcpu_check_shutdown(), so I'm having a hard time seeing the race.
> 
>>> I am not fully familiar with the IOREQ code, but it sounds to me this is
>>> not the behavior that was intended. Can someone more familiar with the
>>> code confirm it?
>>>
> 
> No indeed. I think emulation should complete before the vcpu pauses.

I think this part is racy at least on non-x86 platform as x86 seems to 
implement smp_mb() with a strong memory barrier (mfence).

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Memory ordering question in the shutdown deferral code
  2020-09-21 13:27     ` Julien Grall
@ 2020-09-21 13:32       ` Jan Beulich
  2020-09-21 13:35         ` Durrant, Paul
  2020-09-23 22:57       ` Stefano Stabellini
  1 sibling, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2020-09-21 13:32 UTC (permalink / raw)
  To: Julien Grall
  Cc: Durrant, Paul, Stefano Stabellini, andrew.cooper3, George Dunlap,
	Xia, Hongyan, xen-devel

On 21.09.2020 15:27, Julien Grall wrote:
> I think this part is racy at least on non-x86 platform as x86 seems to 
> implement smp_mb() with a strong memory barrier (mfence).

The "strength" of the memory barrier doesn't matter here imo. It's
the fully coherent memory model (for WB type memory) which makes
this be fine on x86. The barrier still only guarantees ordering,
not visibility.

Jan


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: Memory ordering question in the shutdown deferral code
  2020-09-21 13:32       ` Jan Beulich
@ 2020-09-21 13:35         ` Durrant, Paul
  2020-09-21 14:02           ` Jan Beulich
  0 siblings, 1 reply; 11+ messages in thread
From: Durrant, Paul @ 2020-09-21 13:35 UTC (permalink / raw)
  To: Jan Beulich, Julien Grall
  Cc: Stefano Stabellini, andrew.cooper3, George Dunlap, Xia, Hongyan,
	xen-devel

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 21 September 2020 14:32
> To: Julien Grall <julien@xen.org>
> Cc: Durrant, Paul <pdurrant@amazon.co.uk>; Stefano Stabellini <sstabellini@kernel.org>;
> andrew.cooper3@citrix.com; George Dunlap <george.dunlap@citrix.com>; Xia, Hongyan
> <hongyxia@amazon.com>; xen-devel@lists.xenproject.org
> Subject: RE: [EXTERNAL] Memory ordering question in the shutdown deferral code
> 
> CAUTION: This email originated from outside of the organization. Do not click links or open
> attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On 21.09.2020 15:27, Julien Grall wrote:
> > I think this part is racy at least on non-x86 platform as x86 seems to
> > implement smp_mb() with a strong memory barrier (mfence).
> 
> The "strength" of the memory barrier doesn't matter here imo. It's
> the fully coherent memory model (for WB type memory) which makes
> this be fine on x86. The barrier still only guarantees ordering,
> not visibility.
> 

In which case I misunderstood what the 'smp' means in this context then.

  Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Memory ordering question in the shutdown deferral code
  2020-09-21 13:35         ` Durrant, Paul
@ 2020-09-21 14:02           ` Jan Beulich
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Beulich @ 2020-09-21 14:02 UTC (permalink / raw)
  To: Durrant, Paul
  Cc: Julien Grall, Stefano Stabellini, andrew.cooper3, George Dunlap,
	Xia, Hongyan, xen-devel

On 21.09.2020 15:35, Durrant, Paul wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 21 September 2020 14:32
>>
>> On 21.09.2020 15:27, Julien Grall wrote:
>>> I think this part is racy at least on non-x86 platform as x86 seems to
>>> implement smp_mb() with a strong memory barrier (mfence).
>>
>> The "strength" of the memory barrier doesn't matter here imo. It's
>> the fully coherent memory model (for WB type memory) which makes
>> this be fine on x86. The barrier still only guarantees ordering,
>> not visibility.
>>
> 
> In which case I misunderstood what the 'smp' means in this context then.

I find this confusing too, at times. But according to my reading
of the doc the "smp" in there really only means "simple compiler
barrier when UP".

Jan


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Memory ordering question in the shutdown deferral code
  2020-09-21 13:11   ` Jan Beulich
@ 2020-09-21 14:22     ` Julien Grall
  0 siblings, 0 replies; 11+ messages in thread
From: Julien Grall @ 2020-09-21 14:22 UTC (permalink / raw)
  To: Jan Beulich, Tim Deegan
  Cc: Stefano Stabellini, andrew.cooper3, George Dunlap, Paul Durrant,
	Xia, Hongyan, xen-devel

Hi Jan,

On 21/09/2020 14:11, Jan Beulich wrote:
> On 21.09.2020 13:40, Julien Grall wrote:
>> (+ Xen-devel)
>>
>> Sorry I forgot to CC xen-devel.
>>
>> On 21/09/2020 12:38, Julien Grall wrote:
>>> Hi all,
>>>
>>> I have started to look at the deferral code (see
>>> vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and
>>> Arm will soon use it.
>>>
>>> The current implementation is using an smp_mb() to ensure ordering
>>> between a write then a read. The code looks roughly (I have slightly
>>> adapted it to make my question more obvious):
>>>
>>> domain_shutdown()
>>>       d->is_shutting_down = 1;
>>>       smp_mb();
>>>       if ( !vcpu0->defer_shutdown )
>>>       {
>>>         vcpu_pause_nosync(v);
>>>         v->paused_for_shutdown = 1;
>>>       }
>>>
>>> vcpu_start_shutdown_deferral()
>>>       vcpu0->defer_shutdown = 1;
>>>       smp_mb();
>>>       if ( unlikely(d->is_shutting_down) )
>>>         vcpu_check_shutdown(v);
>>>
>>>       return vcpu0->defer_shutdown;
>>>
>>> smp_mb() should only guarantee ordering (this may be stronger on some
>>> arch), so I think there is a race between the two functions.
>>>
>>> It would be possible to pause the vCPU in domain_shutdown() because
>>> vcpu0->defer_shutdown wasn't yet seen.
>>>
>>> Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down
>>> and therefore Xen may continue to send the I/O. Yet the vCPU will be
>>> paused so the I/O will never complete.
> 
> Individually for each of these I agree. But isn't the goal merely
> to prevent both to enter their if()-s' bodies at the same time?
> And isn't the combined effect of the two barriers preventing just
> this?

The code should already be able to deal with that as 
vcpu_check_shutdown() will request to hold d->shutdown_lock and then 
check v->paused_for_shutdown.

So I am not sure why the barriers would matter here.

> 
>>> I am not fully familiar with the IOREQ code, but it sounds to me this is
>>> not the behavior that was intended. Can someone more familiar with the
>>> code confirm it?
> 
> As to original intentions, I'm afraid among the people still
> listed as maintainers for any part of Xen it may only be Tim to
> possibly have been involved in the original installation of
> this model, and hence who may know of the precise intentions
> and considerations back at the time.

It would be useful to know the original intentions, so I have CCed Tim.

However, I think it is more important to agree on what we want to 
achieve so we can decide whether the existing code is suitable.

Do you agree that we only want to shutdown (or pause it at an 
architecturally restartable bounday) a domain with no I/Os inflights?

> 
> As far as I'm concerned, to be honest I don't think I've ever
> managed to fully convince myself of the correctness of the
> model in the general case. But since it did look good enough
> for x86 ...

Right, the memory model on x86 is quite simple compare to Arm :). I am 
pretty sure we need some sort of ordering, but I am not convinced we 
have the correct one in place if we want to cater architecture with more 
relaxed memory model.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Memory ordering question in the shutdown deferral code
  2020-09-21 13:27     ` Julien Grall
  2020-09-21 13:32       ` Jan Beulich
@ 2020-09-23 22:57       ` Stefano Stabellini
  2020-09-24 11:10         ` Julien Grall
  1 sibling, 1 reply; 11+ messages in thread
From: Stefano Stabellini @ 2020-09-23 22:57 UTC (permalink / raw)
  To: Julien Grall
  Cc: Durrant, Paul, Jan Beulich, Stefano Stabellini, andrew.cooper3,
	George Dunlap, Xia, Hongyan, xen-devel

On Mon, 21 Sep 2020, Julien Grall wrote:
> On 21/09/2020 13:55, Durrant, Paul wrote:
> > > (+ Xen-devel)
> > > 
> > > Sorry I forgot to CC xen-devel.
> > > 
> > > On 21/09/2020 12:38, Julien Grall wrote:
> > > > Hi all,
> > > > 
> > > > I have started to look at the deferral code (see
> > > > vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and
> > > > Arm will soon use it.
> > > > 
> > > > The current implementation is using an smp_mb() to ensure ordering
> > > > between a write then a read. The code looks roughly (I have slightly
> > > > adapted it to make my question more obvious):
> > > > 
> > > > domain_shutdown()
> > > >       d->is_shutting_down = 1;
> > > >       smp_mb();
> > > >       if ( !vcpu0->defer_shutdown )
> > > >       {
> > > >         vcpu_pause_nosync(v);
> > > >         v->paused_for_shutdown = 1;
> > > >       }
> > > > 
> > > > vcpu_start_shutdown_deferral()
> > > >       vcpu0->defer_shutdown = 1;
> > > >       smp_mb();
> > > >       if ( unlikely(d->is_shutting_down) )
> > > >         vcpu_check_shutdown(v);
> > > > 
> > > >       return vcpu0->defer_shutdown;
> > > > 
> > > > smp_mb() should only guarantee ordering (this may be stronger on some
> > > > arch), so I think there is a race between the two functions.
> > > > 
> > > > It would be possible to pause the vCPU in domain_shutdown() because
> > > > vcpu0->defer_shutdown wasn't yet seen.
> > > > 
> > > > Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down
> > > > and therefore Xen may continue to send the I/O. Yet the vCPU will be
> > > > paused so the I/O will never complete.
> > > > 
> > 
> > The barrier enforces global order, right?
> 
> It is not clear to me what you mean by "global ordering". This seems to
> suggest a very expensive synchronization barrier between all the processors.
> 
> From an arch-agnostic PoV, smp_mb() will enforce an ordering between
> loads/stores but it doesn't guarantee *when* they will be observed.
> 
> > So, if domain_shutdown() pauses the vcpu then is_shutting_down must
> > necessarily be visible all cpus.
> 
> That's not the guarantee provided by smp_mb() (see above).


I simplified the code further to help us reason about it:


   thread#1 |  thread#2
            |
1) WRITE A  |  WRITE B
2) BARRIER  |  BARRIER
3) READ B   |  READ A


I think it is (theoretically) possible for thread#1 to be at 1) and
about to do 2), while thread#2 goes ahead and does 1) 2) 3). By the time
thread#1 does 2), thread#2 has already completed the entire sequence.

If thread#2 has already done 2), and thread#1 is about to do 3), then I
think we are guaranteed that thread#1 will see the new value of B. Or
is this the core of the issue we are discussing?

If it works the way I wrote, then it would confirm Paul's view.


For your information I went to check what the Linux memory model has to
say about this. It says:

"smp_mb() guarantees to restore sequential consistency among accesses that use READ_ONCE, WRITE_ONCE(), or stronger. For example, the following Linux-kernel code would forbid non-SC outcomes"

It is interesting that they chose the words "restore sequential
consistency". It would be difficult to come up with an example that has
"sequential consistency" but doesn't work the way described earlier.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Memory ordering question in the shutdown deferral code
  2020-09-23 22:57       ` Stefano Stabellini
@ 2020-09-24 11:10         ` Julien Grall
  0 siblings, 0 replies; 11+ messages in thread
From: Julien Grall @ 2020-09-24 11:10 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Durrant, Paul, Jan Beulich, andrew.cooper3, George Dunlap, Xia,
	Hongyan, xen-devel

Hi,

On 23/09/2020 23:57, Stefano Stabellini wrote:
> On Mon, 21 Sep 2020, Julien Grall wrote:
>> On 21/09/2020 13:55, Durrant, Paul wrote:
>>>> (+ Xen-devel)
>>>>
>>>> Sorry I forgot to CC xen-devel.
>>>>
>>>> On 21/09/2020 12:38, Julien Grall wrote:
>>>>> Hi all,
>>>>>
>>>>> I have started to look at the deferral code (see
>>>>> vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and
>>>>> Arm will soon use it.
>>>>>
>>>>> The current implementation is using an smp_mb() to ensure ordering
>>>>> between a write then a read. The code looks roughly (I have slightly
>>>>> adapted it to make my question more obvious):
>>>>>
>>>>> domain_shutdown()
>>>>>        d->is_shutting_down = 1;
>>>>>        smp_mb();
>>>>>        if ( !vcpu0->defer_shutdown )
>>>>>        {
>>>>>          vcpu_pause_nosync(v);
>>>>>          v->paused_for_shutdown = 1;
>>>>>        }
>>>>>
>>>>> vcpu_start_shutdown_deferral()
>>>>>        vcpu0->defer_shutdown = 1;
>>>>>        smp_mb();
>>>>>        if ( unlikely(d->is_shutting_down) )
>>>>>          vcpu_check_shutdown(v);
>>>>>
>>>>>        return vcpu0->defer_shutdown;
>>>>>
>>>>> smp_mb() should only guarantee ordering (this may be stronger on some
>>>>> arch), so I think there is a race between the two functions.
>>>>>
>>>>> It would be possible to pause the vCPU in domain_shutdown() because
>>>>> vcpu0->defer_shutdown wasn't yet seen.
>>>>>
>>>>> Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down
>>>>> and therefore Xen may continue to send the I/O. Yet the vCPU will be
>>>>> paused so the I/O will never complete.
>>>>>
>>>
>>> The barrier enforces global order, right?
>>
>> It is not clear to me what you mean by "global ordering". This seems to
>> suggest a very expensive synchronization barrier between all the processors.
>>
>>  From an arch-agnostic PoV, smp_mb() will enforce an ordering between
>> loads/stores but it doesn't guarantee *when* they will be observed.
>>
>>> So, if domain_shutdown() pauses the vcpu then is_shutting_down must
>>> necessarily be visible all cpus.
>>
>> That's not the guarantee provided by smp_mb() (see above).
> 
> 
> I simplified the code further to help us reason about it:
> 
> 
>     thread#1 |  thread#2
>              |
> 1) WRITE A  |  WRITE B
> 2) BARRIER  |  BARRIER
> 3) READ B   |  READ A

Thank you for writing a simpler example. It allowed me to find a litmus 
(see [1]) and now I understood why it works. See more below.

> 
> 
> I think it is (theoretically) possible for thread#1 to be at 1) and
> about to do 2), while thread#2 goes ahead and does 1) 2) 3).

Well it is not that theoritical :). There are many reasons where this 
situation can happen. To only cite a few:
     - Threads may run on the same pCPUs
     - The pCPU running the threads may get interrupted
     - The data modified is not in the L1 cache, there will be delay to 
access it.

> By the time
> thread#1 does 2), thread#2 has already completed the entire sequence.
> 
> If thread#2 has already done 2), and thread#1 is about to do 3), then I
> think we are guaranteed that thread#1 will see the new value of B. > Or
> is this the core of the issue we are discussing?

No you are right. I got confused because smp_mb() doesn't guarantee when 
the write/read is completed.

So I blindly assumed that the ordering would be done just at the 
processor level. Instead, the ordering is done at the innershareable 
level (e.g between all the processors) as we are using dmb ish.

Assuming A and B are initialized to 0 and we are writing 1, then there 
is no way for both thread to read 0. In which case, the existing 
shutdown code is fine.

Cheers,

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/litmus-tests/SB+fencembonceonces.litmus

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-09-24 11:10 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <468576ba-8d3f-98e9-e65e-1128b5220d40@xen.org>
2020-09-21 11:40 ` Memory ordering question in the shutdown deferral code Julien Grall
2020-09-21 12:55   ` Durrant, Paul
2020-09-21 13:25     ` Xia, Hongyan
2020-09-21 13:27     ` Julien Grall
2020-09-21 13:32       ` Jan Beulich
2020-09-21 13:35         ` Durrant, Paul
2020-09-21 14:02           ` Jan Beulich
2020-09-23 22:57       ` Stefano Stabellini
2020-09-24 11:10         ` Julien Grall
2020-09-21 13:11   ` Jan Beulich
2020-09-21 14:22     ` Julien Grall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).