On 05/04/2017 09:09, Christoffer Dall wrote:
>>>  - In the explanation you wrote, you use the term 'we' a lot, but when
>>>    talking about SMP barriers, I think it only makes sense to talk about
>>>    actions and observations between multiple CPUs and we have to be
>>>    specific about which CPU observes or does what with respect to the
>>>    other.  Maybe I'm being a stickler here, but there something here
>>>    which is making me uneasy.
>> The write1-mb-if(read2) / write2-mb-if(read1) pattern is pretty common,
>> so I think it is justified to cut the ordering on the reasoning and just
>> focus on what the two memory locations and conditions mean.
> ok, but the pattern above was not common to me (and I'm pretty sure I'm
> not the only fool in the bunch here), so if we can reference something
> that explains that this is a known pattern which has been tried and
> proven, that would be even better.

I found https://lwn.net/Articles/573436/ which shows this example:

  CPU 0					CPU 1
  ---------------------			----------------------
  WRITE_ONCE(x, 1);			WRITE_ONCE(y, 1);
  smp_mb();				smp_mb();
  r2 = READ_ONCE(y);			r4 = READ_ONCE(x);

And says that it is a bug if r2 == 0 && r4 == 0.  This is exactly what 
happens in KVM:

  CPU 0					CPU 1
  ---------------------			----------------------
  vcpu->mode = IN_GUEST_MODE;		kvm_make_request(REQ, vcpu);
  smp_mb();				smp_mb();
  r2 = kvm_request_pending(vcpu)	r4 = (vcpu->mode == IN_GUEST_MODE)
  if (r2)				if (r4)
	abort entry				kick();

If r2 sees no request and r4 doesn't kick there would be a bug.
But why can't this happen?

- if no request is pending at the time of the read to r2, CPU 1 must
not have executed kvm_make_request yet.  In CPU 0, kvm_request_pending
must happen after vcpu->mode is set to IN_GUEST_MODE, therefore CPU 1
will read IN_GUEST_MODE and kick.

- if no kick happens in CPU 1, CPU 0 must not have set vcpu->mode yet.
In CPU 1, vcpu->mode is read after setting the request bit, therefore
CPU 0 will see the request bit and abort the guest entry.

>>>  - Finally, it feels very hard to prove the correctness of this, and
>>>    equally hard to test it (given how long we've been running with
>>>    apparently racy code).  I would hope that we could abstract some of
>>>    this into architecture generic things, that someone who eat memory
>>>    barriers for breakfast could help us verify, but again, maybe this is
>>>    Radim's series I'm asking for here.
>>
>> What I can do here is to suggest copying the paradigms from x86, which
>> is quite battle tested (Windows hammers it really hard).
>
> That sounds reasonable, but I think part of the problem was that we
> simply didn't understand what the paradigms were (see the
> kvm_make_all_cpus_request above as an example), so Drew's action about
> documenting what this all is and the constraints of using it is really
> important for me to do that.

Yes, totally agreed on that.

>> For QEMU I did use model checking in the past for some similarly hairy
>> synchronization code, but that is really just "executable documentation"
>> because the model is not written in C.
>>
> I played with using blast on some of the KVM/ARM code a long time ago,
> and while I was able to find a bug with it, it was sort of an obvious
> bug, and the things I was able to do with it was pretty limited to the
> problems I could imagine myself anyhow.  Perhaps this is what you mean
> with executable documentation.

I prepared three examples of a spin model for KVM vCPU kicking, and
the outcome was actually pretty surprising: the mode check seems not
to be necessary.

I haven't covered all x86 cases so I'm not going to remove it right
ahead, but for ARM it really seems like EXITING_GUEST_MODE is nothing
but an optimization of consecutive kvm_vcpu_kicks.

All three models can use C preprocessor #defines to inject bugs:

- kvm-arm-pause.promela: the "paused" mechanism; the model proves that
  the "paused" test in the interrupt-disabled region is necessary

- kvm-req.promela: the requests mechanism; the model proves that
  the requests check in the interrupt-disabled region is necessary

- kvm-x86-pi.promela: the x86 posted interrupt mechanism (simplified
  a bit); the model proves that KVM must disable interrupts before
  checking for interrupts injected while outside guest mode
  (commit b95234c84004, "kvm: x86: do not use KVM_REQ_EVENT for APICv
  interrupt injection", 2017-02-15)

So it seems like there are no races after all in KVM/ARM code, though
the code can still be cleaned up.  And I have been convinced of the wrong
thing all the time. :)

But why is KVM/ARM using KVM_REQ_VCPU_EXIT
just fine without checking for requests (kvm-req.promela)?  Because,
as mentioned earlier in the thread, KVM/ARM is using kvm_make_all_vcpus_request
simply to kick all VCPUs.  The paused variable _is_ checked after disabling
interrupts, and that is fine.

After this experiment, I think I like Drew's KVM_REQ_PAUSE more than I did
yesterday.  However, yet another alternative is to leave pause/power_off as
they are, while taking some inspiration from his patch to do some cleanups:

1) change the "if"

                if (ret <= 0 || need_new_vmid_gen(vcpu->kvm) ||
                        vcpu->arch.power_off || vcpu->arch.pause) {

to test kvm_requests_pending instead of pause/power_off

2) clear KVM_REQ_VCPU_EXIT before the other "if":

                if (vcpu->arch.power_off || vcpu->arch.pause)
                        vcpu_sleep(vcpu);


In any case, the no-wakeup behavior of kvm_make_all_vcpus_request suits
either use of requests (KVM_REQ_PAUSE or "fixed" KVM_REQ_VCPU_EXIT).

Paolo