xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Julien Grall <julien@xen.org>
To: Stefano Stabellini <sstabellini@kernel.org>
Cc: "Durrant, Paul" <pdurrant@amazon.co.uk>,
	Jan Beulich <jbeulich@suse.com>,
	 "andrew.cooper3@citrix.com" <andrew.cooper3@citrix.com>,
	George Dunlap <george.dunlap@citrix.com>,
	"Xia, Hongyan" <hongyxia@amazon.com>,
	"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>
Subject: Re: Memory ordering question in the shutdown deferral code
Date: Thu, 24 Sep 2020 12:10:22 +0100	[thread overview]
Message-ID: <b35509a0-42dd-5442-1a79-d57fc1e9170b@xen.org> (raw)
In-Reply-To: <alpine.DEB.2.21.2009231541410.1495@sstabellini-ThinkPad-T480s>

Hi,

On 23/09/2020 23:57, Stefano Stabellini wrote:
> On Mon, 21 Sep 2020, Julien Grall wrote:
>> On 21/09/2020 13:55, Durrant, Paul wrote:
>>>> (+ Xen-devel)
>>>>
>>>> Sorry I forgot to CC xen-devel.
>>>>
>>>> On 21/09/2020 12:38, Julien Grall wrote:
>>>>> Hi all,
>>>>>
>>>>> I have started to look at the deferral code (see
>>>>> vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and
>>>>> Arm will soon use it.
>>>>>
>>>>> The current implementation is using an smp_mb() to ensure ordering
>>>>> between a write then a read. The code looks roughly (I have slightly
>>>>> adapted it to make my question more obvious):
>>>>>
>>>>> domain_shutdown()
>>>>>        d->is_shutting_down = 1;
>>>>>        smp_mb();
>>>>>        if ( !vcpu0->defer_shutdown )
>>>>>        {
>>>>>          vcpu_pause_nosync(v);
>>>>>          v->paused_for_shutdown = 1;
>>>>>        }
>>>>>
>>>>> vcpu_start_shutdown_deferral()
>>>>>        vcpu0->defer_shutdown = 1;
>>>>>        smp_mb();
>>>>>        if ( unlikely(d->is_shutting_down) )
>>>>>          vcpu_check_shutdown(v);
>>>>>
>>>>>        return vcpu0->defer_shutdown;
>>>>>
>>>>> smp_mb() should only guarantee ordering (this may be stronger on some
>>>>> arch), so I think there is a race between the two functions.
>>>>>
>>>>> It would be possible to pause the vCPU in domain_shutdown() because
>>>>> vcpu0->defer_shutdown wasn't yet seen.
>>>>>
>>>>> Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down
>>>>> and therefore Xen may continue to send the I/O. Yet the vCPU will be
>>>>> paused so the I/O will never complete.
>>>>>
>>>
>>> The barrier enforces global order, right?
>>
>> It is not clear to me what you mean by "global ordering". This seems to
>> suggest a very expensive synchronization barrier between all the processors.
>>
>>  From an arch-agnostic PoV, smp_mb() will enforce an ordering between
>> loads/stores but it doesn't guarantee *when* they will be observed.
>>
>>> So, if domain_shutdown() pauses the vcpu then is_shutting_down must
>>> necessarily be visible all cpus.
>>
>> That's not the guarantee provided by smp_mb() (see above).
> 
> 
> I simplified the code further to help us reason about it:
> 
> 
>     thread#1 |  thread#2
>              |
> 1) WRITE A  |  WRITE B
> 2) BARRIER  |  BARRIER
> 3) READ B   |  READ A

Thank you for writing a simpler example. It allowed me to find a litmus 
(see [1]) and now I understood why it works. See more below.

> 
> 
> I think it is (theoretically) possible for thread#1 to be at 1) and
> about to do 2), while thread#2 goes ahead and does 1) 2) 3).

Well it is not that theoritical :). There are many reasons where this 
situation can happen. To only cite a few:
     - Threads may run on the same pCPUs
     - The pCPU running the threads may get interrupted
     - The data modified is not in the L1 cache, there will be delay to 
access it.

> By the time
> thread#1 does 2), thread#2 has already completed the entire sequence.
> 
> If thread#2 has already done 2), and thread#1 is about to do 3), then I
> think we are guaranteed that thread#1 will see the new value of B. > Or
> is this the core of the issue we are discussing?

No you are right. I got confused because smp_mb() doesn't guarantee when 
the write/read is completed.

So I blindly assumed that the ordering would be done just at the 
processor level. Instead, the ordering is done at the innershareable 
level (e.g between all the processors) as we are using dmb ish.

Assuming A and B are initialized to 0 and we are writing 1, then there 
is no way for both thread to read 0. In which case, the existing 
shutdown code is fine.

Cheers,

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/litmus-tests/SB+fencembonceonces.litmus

-- 
Julien Grall


  reply	other threads:[~2020-09-24 11:10 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <468576ba-8d3f-98e9-e65e-1128b5220d40@xen.org>
2020-09-21 11:40 ` Memory ordering question in the shutdown deferral code Julien Grall
2020-09-21 12:55   ` Durrant, Paul
2020-09-21 13:25     ` Xia, Hongyan
2020-09-21 13:27     ` Julien Grall
2020-09-21 13:32       ` Jan Beulich
2020-09-21 13:35         ` Durrant, Paul
2020-09-21 14:02           ` Jan Beulich
2020-09-23 22:57       ` Stefano Stabellini
2020-09-24 11:10         ` Julien Grall [this message]
2020-09-21 13:11   ` Jan Beulich
2020-09-21 14:22     ` Julien Grall

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b35509a0-42dd-5442-1a79-d57fc1e9170b@xen.org \
    --to=julien@xen.org \
    --cc=andrew.cooper3@citrix.com \
    --cc=george.dunlap@citrix.com \
    --cc=hongyxia@amazon.com \
    --cc=jbeulich@suse.com \
    --cc=pdurrant@amazon.co.uk \
    --cc=sstabellini@kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).