kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* guest/host mem out of sync on core2duo?
@ 2021-06-12 22:49 stsp
  2021-06-13 12:36 ` stsp
  2021-06-14 17:06 ` Sean Christopherson
  0 siblings, 2 replies; 36+ messages in thread
From: stsp @ 2021-06-12 22:49 UTC (permalink / raw)
  To: kvm; +Cc: Sean Christopherson

Hi kvm developers.

I am having the strange problem
that can only be reproduced on a
core2duo CPU but not AMD FX or
Intel Core I7.

My code has 2 ways of setting the
guest registers: one is the guest's
ring0 stub that just pops all regs
from stack and does iret to ring3.
That works fine.
But sometimes I use KVM_SET_SREGS
and resume the VM directly to ring3.
That randomly results in either a
good run or invalid guest state
return, or a page fault in guest.

I tried to analyze when either of
the above happens exactly, and
I have a very strong suspection
that the problem is in a way I
update LDT. LDT is shared between
guest and host with KVM_SET_USER_MEMORY_REGION,
and I modify it on host.
So it seems like if I just allocated
the new LDT entry, there is a risk
of invalid guest state, as if the
guest's LDT still doesn't have it.
If I modified some LDT entry, there
can be a page fault in guest, as if
the entry is still old.

I've found that the one needs to
check KVM_CAP_SYNC_MMU to
safely write to the guest memory,
but it doesn't seem to be documented
well. Of course maybe my problem
has nothing to do with that, but I
think it does.
So can it be that even though I
check for the KVM_CAP_SYNC_MMU,
writing to the guest memory from
host is still unsafe? What is this
KVM_CAP_SYNC_MMU actually all
about?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: guest/host mem out of sync on core2duo?
  2021-06-12 22:49 guest/host mem out of sync on core2duo? stsp
@ 2021-06-13 12:36 ` stsp
  2021-06-14 17:06 ` Sean Christopherson
  1 sibling, 0 replies; 36+ messages in thread
From: stsp @ 2021-06-13 12:36 UTC (permalink / raw)
  To: kvm

13.06.2021 01:49, stsp пишет:
> I've found that the one needs to
> check KVM_CAP_SYNC_MMU to
> safely write to the guest memory,
> but it doesn't seem to be documented
> well.
In fact, I wonder if its description in
the kernel doc is even correct:
---

When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
the memory region are automatically reflected into the guest.
---

But, after looking into the patches
that introduce that capability, I've
got an impression that it is only needed
if you mmap() something else to the
guest-shared region.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: guest/host mem out of sync on core2duo?
  2021-06-12 22:49 guest/host mem out of sync on core2duo? stsp
  2021-06-13 12:36 ` stsp
@ 2021-06-14 17:06 ` Sean Christopherson
  2021-06-14 17:32   ` stsp
  1 sibling, 1 reply; 36+ messages in thread
From: Sean Christopherson @ 2021-06-14 17:06 UTC (permalink / raw)
  To: stsp; +Cc: kvm, Sean Christopherson

On Sun, Jun 13, 2021, stsp wrote:
> Hi kvm developers.
> 
> I am having the strange problem that can only be reproduced on a core2duo CPU
> but not AMD FX or Intel Core I7.
> 
> My code has 2 ways of setting the guest registers: one is the guest's ring0
> stub that just pops all regs from stack and does iret to ring3.  That works
> fine.  But sometimes I use KVM_SET_SREGS and resume the VM directly to ring3.
> That randomly results in either a good run or invalid guest state return, or
> a page fault in guest.

Hmm, a core2duo failure is more than likely due to lack of unrestricted guest.
You verify this by loading kvm_intel on the Core i7 with unrestricted_guest=0.

> I tried to analyze when either of the above happens exactly, and I have a
> very strong suspection that the problem is in a way I update LDT. LDT is
> shared between guest and host with KVM_SET_USER_MEMORY_REGION, and I modify
> it on host.  So it seems like if I just allocated the new LDT entry, there is
> a risk of invalid guest state, as if the guest's LDT still doesn't have it.
> If I modified some LDT entry, there can be a page fault in guest, as if the
> entry is still old.

IIUC, you are updating the LDT itself, e.g. an FS/GS descriptor in the LDT, as
opposed to updating the LDT descriptor in the GDT?

Either way, do you also update all relevant segments via KVM_SET_SREGS after
modifying memory?   Best guess is that KVM doesn't detect that the VM has state
that needs to be emulated, or that KVM's internal register state and what's in
memory are not consistent.

> I've found that the one needs to check KVM_CAP_SYNC_MMU to safely write to
> the guest memory, but it doesn't seem to be documented well. Of course maybe
> my problem has nothing to do with that, but I think it does.  So can it be
> that even though I check for the KVM_CAP_SYNC_MMU, writing to the guest
> memory from host is still unsafe? What is this KVM_CAP_SYNC_MMU actually all
> about?

On x86, KVM_CAP_SYNC_MMU means that the primary MMU will notify KVM of any
relevant changes.  As you surmised, this is needed if userspace is making changes
via to mmap()/mprotect()/etc..., but is also used to react to PFN migration, NUMA
balancing, etc...

Anyways, I highly doubt this is a memory synchronization issue, a corner case
related to lack of unrestricted guest is much more likely.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: guest/host mem out of sync on core2duo?
  2021-06-14 17:06 ` Sean Christopherson
@ 2021-06-14 17:32   ` stsp
  2021-06-17 14:42     ` Sean Christopherson
  0 siblings, 1 reply; 36+ messages in thread
From: stsp @ 2021-06-14 17:32 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, Sean Christopherson

14.06.2021 20:06, Sean Christopherson пишет:
> On Sun, Jun 13, 2021, stsp wrote:
>> Hi kvm developers.
>>
>> I am having the strange problem that can only be reproduced on a core2duo CPU
>> but not AMD FX or Intel Core I7.
>>
>> My code has 2 ways of setting the guest registers: one is the guest's ring0
>> stub that just pops all regs from stack and does iret to ring3.  That works
>> fine.  But sometimes I use KVM_SET_SREGS and resume the VM directly to ring3.
>> That randomly results in either a good run or invalid guest state return, or
>> a page fault in guest.
> Hmm, a core2duo failure is more than likely due to lack of unrestricted guest.
> You verify this by loading kvm_intel on the Core i7 with unrestricted_guest=0.

Wow, excellent shot!
Indeed, the problem then starts
reproducing also there!
So at least I now have a problematic
setup myself, rather than needing
to ask for ssh from everyone involved. :)

What does this mean to us, though?
That its completely unrelated to any
memory synchronization?


>> I tried to analyze when either of the above happens exactly, and I have a
>> very strong suspection that the problem is in a way I update LDT. LDT is
>> shared between guest and host with KVM_SET_USER_MEMORY_REGION, and I modify
>> it on host.  So it seems like if I just allocated the new LDT entry, there is
>> a risk of invalid guest state, as if the guest's LDT still doesn't have it.
>> If I modified some LDT entry, there can be a page fault in guest, as if the
>> entry is still old.
> IIUC, you are updating the LDT itself, e.g. an FS/GS descriptor in the LDT, as
> opposed to updating the LDT descriptor in the GDT?

I am updating the LDT itself,
not modifying its descriptor
in gdt. And with the same
KVM_SET_SREGS call I also
update the segregs to the new
values, if needed.


> Either way, do you also update all relevant segments via KVM_SET_SREGS after
> modifying memory?

Yes, if this is needed.
Sometimes its not needed, and
when not - it seems page fault is
more likely. If I also update segregs -
then invalid guest state.
But these are just the statistical
guesses so far.


>     Best guess is that KVM doesn't detect that the VM has state
> that needs to be emulated, or that KVM's internal register state and what's in
> memory are not consistent.

Hope you know what parts are
emulated w/o unrestricted guest,
in which case we can advance. :)


> Anyways, I highly doubt this is a memory synchronization issue, a corner case
> related to lack of unrestricted guest is much more likely.

Just to be sure I tried the CD bit
in CR0 to rule out the caching
issues, and that changes nothing.
So...
What to do next?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: guest/host mem out of sync on core2duo?
  2021-06-14 17:32   ` stsp
@ 2021-06-17 14:42     ` Sean Christopherson
  2021-06-18 15:59       ` stsp
  0 siblings, 1 reply; 36+ messages in thread
From: Sean Christopherson @ 2021-06-17 14:42 UTC (permalink / raw)
  To: stsp; +Cc: kvm

Dropped my old @intel email to stop getting bounces.

On Mon, Jun 14, 2021, stsp wrote:
> 14.06.2021 20:06, Sean Christopherson пишет:
> > On Sun, Jun 13, 2021, stsp wrote:
> > > Hi kvm developers.
> > > 
> > > I am having the strange problem that can only be reproduced on a core2duo CPU
> > > but not AMD FX or Intel Core I7.
> > > 
> > > My code has 2 ways of setting the guest registers: one is the guest's ring0
> > > stub that just pops all regs from stack and does iret to ring3.  That works
> > > fine.  But sometimes I use KVM_SET_SREGS and resume the VM directly to ring3.
> > > That randomly results in either a good run or invalid guest state return, or
> > > a page fault in guest.
> > Hmm, a core2duo failure is more than likely due to lack of unrestricted guest.
> > You verify this by loading kvm_intel on the Core i7 with unrestricted_guest=0.
> 
> Wow, excellent shot!  Indeed, the problem then starts reproducing also there!
> So at least I now have a problematic setup myself, rather than needing to ask
> for ssh from everyone involved. :)
> 
> What does this mean to us, though?  That its completely unrelated to any
> memory synchronization?

Yes, more than likely this has nothing to do with memory synchronization.

> > > I tried to analyze when either of the above happens exactly, and I have a
> > > very strong suspection that the problem is in a way I update LDT. LDT is
> > > shared between guest and host with KVM_SET_USER_MEMORY_REGION, and I modify
> > > it on host.  So it seems like if I just allocated the new LDT entry, there is
> > > a risk of invalid guest state, as if the guest's LDT still doesn't have it.
> > > If I modified some LDT entry, there can be a page fault in guest, as if the
> > > entry is still old.
> > IIUC, you are updating the LDT itself, e.g. an FS/GS descriptor in the LDT, as
> > opposed to updating the LDT descriptor in the GDT?
> 
> I am updating the LDT itself, not modifying its descriptor in gdt. And with
> the same KVM_SET_SREGS call I also update the segregs to the new values, if
> needed.

Hmm, unconditionally calling KVM_SET_SREGS if you modify anything in the LDT
would be worth trying.  Or did I misunderstand the "if needed" part?

> > Either way, do you also update all relevant segments via KVM_SET_SREGS after
> > modifying memory?
> 
> Yes, if this is needed.  Sometimes its not needed, and when not - it seems
> page fault is more likely. If I also update segregs - then invalid guest
> state.  But these are just the statistical guesses so far.

Ah.  Hrm.  It would still be worth doing KVM_SET_SREGS unconditionally, e.g. it
would narrow the search if the page faults go away and the failures are always
invalid guest state.

> >     Best guess is that KVM doesn't detect that the VM has state
> > that needs to be emulated, or that KVM's internal register state and what's in
> > memory are not consistent.
> 
> Hope you know what parts are emulated w/o unrestricted guest, in which case
> we can advance. :)

It's not parts per se.  KVM needs to emulate "everything", one instruction at a
time, until guest state is no longer invalid with respec to the !unrestricted
rules.

> > Anyways, I highly doubt this is a memory synchronization issue, a corner case
> > related to lack of unrestricted guest is much more likely.
> 
> Just to be sure I tried the CD bit in CR0 to rule out the caching issues, and
> that changes nothing.  So...
>
> What to do next?

In addition to the above experiment, can you get a state dump for the invalid
guest state failure?  I.e. load kvm_intel with dump_invalid_vmcs=1.  And on that
failure, also provide the input to KVM_SET_SREGS.  The LDT in memory might also
be interesting, but it's hopefully unnecessary, especially if unconditionally
doing kVM_SET_SREGS makes the page faults go away.

Best case scenario is that KVM_SET_SREGS stuffs invalid guest state that KVM
doesn't correct detect.  That would be easy to debug and fix, and would give us
a regression test as well.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: guest/host mem out of sync on core2duo?
  2021-06-17 14:42     ` Sean Christopherson
@ 2021-06-18 15:59       ` stsp
  2021-06-18 21:07         ` Jim Mattson
  0 siblings, 1 reply; 36+ messages in thread
From: stsp @ 2021-06-18 15:59 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm

Hi,

17.06.2021 17:42, Sean Christopherson пишет:
> On Mon, Jun 14, 2021, stsp wrote:
>
>> Wow, excellent shot!  Indeed, the problem then starts reproducing also there!
>> So at least I now have a problematic setup myself, rather than needing to ask
>> for ssh from everyone involved. :)
>>
>> What does this mean to us, though?  That its completely unrelated to any
>> memory synchronization?
> Yes, more than likely this has nothing to do with memory synchronization.

But this still has something to
do with LDT I suppose. Because
before the guest app switches
to prot mode, nothing bad happens.


>>>> I tried to analyze when either of the above happens exactly, and I have a
>>>> very strong suspection that the problem is in a way I update LDT. LDT is
>>>> shared between guest and host with KVM_SET_USER_MEMORY_REGION, and I modify
>>>> it on host.  So it seems like if I just allocated the new LDT entry, there is
>>>> a risk of invalid guest state, as if the guest's LDT still doesn't have it.
>>>> If I modified some LDT entry, there can be a page fault in guest, as if the
>>>> entry is still old.
>>> IIUC, you are updating the LDT itself, e.g. an FS/GS descriptor in the LDT, as
>>> opposed to updating the LDT descriptor in the GDT?
>> I am updating the LDT itself, not modifying its descriptor in gdt. And with
>> the same KVM_SET_SREGS call I also update the segregs to the new values, if
>> needed.
> Hmm, unconditionally calling KVM_SET_SREGS if you modify anything in the LDT
> would be worth trying.  Or did I misunderstand the "if needed" part?

I mean, the registers are modified
only when there is a reason to, for
example I need to inject the interrupt
and call the guest's interrupt handler
by hands. But I did the "unconditional"
test already (and repeated it now) and
indeed it makes the crash reproducible
100%, with only invalid guest state.
Which is very good to debug, indeed.
I am not sure if you want to reproduce
the bug locally or not. Of course it
would be great if you do. :)

For that I pushed this change:
https://github.com/dosemu2/dosemu2/commit/3d83a866c5f04e8902cd653474e03c60e5bdc108
which makes KVM_SET_SREGS to
be called with different regs always.
This is achieved by ignoring the fact
that we stopped in a ring0 monitor,
and always setting up the ring3 context.
It still works on the AMD cpu or with
unrestricted guest on I7.
So... if you really like to replicate the
setup, you will need to first install
dosemu2 from a binary package, to
get the dependencies right. Dealing
with its deps by hands, usually leads
nowhere, so here is the list of distros
for which there are packages:
https://github.com/dosemu2/dosemu2/blob/devel/README

So after installing it from package
and getting deps in, you would need
to uninstall it and build from git, to
get the reproducer patch in (branch
kvm_bug). "make deb" or "make rpm"
can be useful for finding the build-time
deps, but usually

./default-configure -d

is enough (-d means debug build).


>>> Either way, do you also update all relevant segments via KVM_SET_SREGS after
>>> modifying memory?
>> Yes, if this is needed.  Sometimes its not needed, and when not - it seems
>> page fault is more likely. If I also update segregs - then invalid guest
>> state.  But these are just the statistical guesses so far.
> Ah.  Hrm.  It would still be worth doing KVM_SET_SREGS unconditionally, e.g. it
> would narrow the search if the page faults go away and the failures are always
> invalid guest state.

This is what happened indeed, yes.


>>> Anyways, I highly doubt this is a memory synchronization issue, a corner case
>>> related to lack of unrestricted guest is much more likely.
>> Just to be sure I tried the CD bit in CR0 to rule out the caching issues, and
>> that changes nothing.  So...
>>
>> What to do next?
> In addition to the above experiment, can you get a state dump for the invalid
> guest state failure?  I.e. load kvm_intel with dump_invalid_vmcs=1.

It appears I can dynamically modify
the module params via sysfs, so I
didn't even need to reload the kvm_intel.

Dynamic modification seems to work
well with dump_invalid_vmcs but not
with unrestricted_guest, where I still
need to reload the module to change
the param.


>    And on that
> failure, also provide the input to KVM_SET_SREGS.

Here it goes.
But I studied it quite thoroughly
and can't see anything obviously
wrong.


[7011807.029737] *** Guest State ***
[7011807.029742] CR0: actual=0x0000000080000031, 
shadow=0x00000000e0000031, gh_mask=fffffffffffffff7
[7011807.029743] CR4: actual=0x0000000000002041, 
shadow=0x0000000000000001, gh_mask=ffffffffffffe871
[7011807.029744] CR3 = 0x000000000a709000
[7011807.029745] RSP = 0x000000000000eff0  RIP = 0x000000000000017c
[7011807.029746] RFLAGS=0x00080202         DR7 = 0x0000000000000400
[7011807.029747] Sysenter RSP=0000000000000000 CS:RIP=0000:0000000000000000
[7011807.029749] CS:   sel=0x0097, attr=0x040fb, limit=0x000001a0, 
base=0x0000000002110000
[7011807.029751] DS:   sel=0x00f7, attr=0x0c0f2, limit=0xffffffff, 
base=0x0000000000000000
[7011807.029752] SS:   sel=0x009f, attr=0x040f3, limit=0x0000efff, 
base=0x0000000002111000
[7011807.029753] ES:   sel=0x00f7, attr=0x0c0f2, limit=0xffffffff, 
base=0x0000000000000000
[7011807.029764] FS:   sel=0x0000, attr=0x10000, limit=0x00000000, 
base=0x0000000000000000
[7011807.029765] GS:   sel=0x0000, attr=0x10000, limit=0x00000000, 
base=0x0000000000000000
[7011807.029767] GDTR:                           limit=0x00000017, 
base=0x000000000a708100
[7011807.029768] LDTR: sel=0x0010, attr=0x00082, limit=0x0000ffff, 
base=0x000000000ab0a000
[7011807.029769] IDTR:                           limit=0x000007ff, 
base=0x000000000a708200
[7011807.029770] TR:   sel=0x0010, attr=0x0008b, limit=0x00002088, 
base=0x000000000a706000
[7011807.029771] EFER =     0x0000000000000000  PAT = 0x0007040600070406
[7011807.029772] DebugCtl = 0x0000000000000000  DebugExceptions = 
0x0000000000000000
[7011807.029783] Interruptibility = 00000000  ActivityState = 00000000
[7011807.029784] *** Host State ***
[7011807.029785] RIP = 0xffffffffc12525b0  RSP = 0xffffbc831306fc68
[7011807.029787] CS=0010 SS=0018 DS=0000 ES=0000 FS=0000 GS=0000 TR=0040
[7011807.029788] FSBase=00007ffff70a7780 GSBase=ffff968085b40000 
TRBase=fffffe0000102000
[7011807.029789] GDTBase=fffffe0000100000 IDTBase=fffffe0000000000
[7011807.029790] CR0=0000000080050033 CR3=00000004c0642001 
CR4=00000000001626e0
[7011807.029791] Sysenter RSP=fffffe0000102000 CS:RIP=0010:ffffffffbd001720
[7011807.029792] EFER = 0x0000000000000d01  PAT = 0x0407050600070106
[7011807.029793] *** Control State ***
[7011807.029794] PinBased=0000007f CPUBased=b5986dfa SecondaryExec=00002c6a
[7011807.029795] EntryControls=0000d1ff ExitControls=002befff
[7011807.029796] ExceptionBitmap=00060042 PFECmask=00000000 
PFECmatch=00000000
[7011807.029797] VMEntry: intr_info=000004e6 errcode=00000000 ilen=00000002
[7011807.029798] VMExit: intr_info=00000000 errcode=00000000 ilen=00000001
[7011807.029799]         reason=80000021 qualification=0000000000000000
[7011807.029800] IDTVectoring: info=00000000 errcode=00000000
[7011807.029801] TSC Offset = 0xffa686ee1681acf4
[7011807.029802] EPT pointer = 0x00000005fcec605e
[7011807.029803] PLE Gap=00000080 Window=00001000
[7011807.029803] Virtual processor ID = 0x0001


gp regs dump:

(gdb) p /x *regs
$2 = {ebx = 0xff, ecx = 0x1, edx = 0x7bf, esi = 0x772, edi = 0x6e,
   ebp = 0x91e, eax = 0x10001, __null_ds = 0xf7, __null_es = 0xf7,
   __null_fs = 0x0, __null_gs = 0x0, orig_eax = 0xd0000, eip = 0x17c,
   cs = 0x97, __csh = 0x0, eflags = 0x80202, esp = 0xeff0, ss = 0x9f,
   __ssh = 0x0, es = 0xc93d, __esh = 0x0, ds = 0xc53c, __dsh = 0x0, fs = 
0x0,
   __fsh = 0x0, gs = 0x0, __gsh = 0x0}


sregs dump:

(gdb) p /x sregs
$3 = {cs = {base = 0x2110000, limit = 0x1a0, selector = 0x97, type = 0xb,
     present = 0x1, dpl = 0x3, db = 0x1, s = 0x1, l = 0x0, g = 0x0, avl 
= 0x0,
     unusable = 0x0, padding = 0x0}, ds = {base = 0x0, limit = 0xffffffff,
     selector = 0xf7, type = 0x2, present = 0x1, dpl = 0x3, db = 0x1, s 
= 0x1,
     l = 0x0, g = 0x1, avl = 0x0, unusable = 0x0, padding = 0x0}, es = {
     base = 0x0, limit = 0xffffffff, selector = 0xf7, type = 0x2,
     present = 0x1, dpl = 0x3, db = 0x1, s = 0x1, l = 0x0, g = 0x1, avl 
= 0x0,
     unusable = 0x0, padding = 0x0}, fs = {base = 0x0, limit = 0x0,
     selector = 0x0, type = 0x0, present = 0x0, dpl = 0x0, db = 0x0, s = 
0x0,
     l = 0x0, g = 0x0, avl = 0x0, unusable = 0x1, padding = 0x0}, gs = {
     base = 0x0, limit = 0x0, selector = 0x0, type = 0x0, present = 0x0,
     dpl = 0x0, db = 0x0, s = 0x0, l = 0x0, g = 0x0, avl = 0x0, unusable 
= 0x1,
     padding = 0x0}, ss = {base = 0x2111000, limit = 0xefff, selector = 
0x9f,
     type = 0x3, present = 0x1, dpl = 0x3, db = 0x1, s = 0x1, l = 0x0, g 
= 0x0,
     avl = 0x0, unusable = 0x0, padding = 0x0}, tr = {base = 0xa706000,
     limit = 0x2088, selector = 0x10, type = 0xb, present = 0x1, dpl = 0x0,
     db = 0x0, s = 0x0, l = 0x0, g = 0x0, avl = 0x0, unusable = 0x0,
     padding = 0x0}, ldt = {base = 0xab0a000, limit = 0xffff, selector = 
0x10,
     type = 0x2, present = 0x1, dpl = 0x0, db = 0x0, s = 0x0, l = 0x0, g 
= 0x0,
     avl = 0x0, unusable = 0x0, padding = 0x0}, gdt = {base = 0xa708100,
     limit = 0x17, padding = {0x0, 0x0, 0x0}}, idt = {base = 0xa708200,
     limit = 0x7ff, padding = {0x0, 0x0, 0x0}}, cr0 = 0xe0000031, cr2 = 
0x0,
   cr3 = 0xa709000, cr4 = 0x1, cr8 = 0x0, efer = 0x0, apic_base = 
0xfee00900,
   interrupt_bitmap = {0x0, 0x0, 0x0, 0x0}}

> Best case scenario is that KVM_SET_SREGS stuffs invalid guest state that KVM
> doesn't correct detect.  That would be easy to debug and fix, and would give us
> a regression test as well.

OK, let me know how can we
progress towards that point. :)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: guest/host mem out of sync on core2duo?
  2021-06-18 15:59       ` stsp
@ 2021-06-18 21:07         ` Jim Mattson
  2021-06-18 21:55           ` stsp
  2021-06-21  2:34           ` exception vs SIGALRM race (was: Re: guest/host mem out of sync on core2duo?) stsp
  0 siblings, 2 replies; 36+ messages in thread
From: Jim Mattson @ 2021-06-18 21:07 UTC (permalink / raw)
  To: stsp; +Cc: Sean Christopherson, kvm

On Fri, Jun 18, 2021 at 9:02 AM stsp <stsp2@yandex.ru> wrote:

> Here it goes.
> But I studied it quite thoroughly
> and can't see anything obviously
> wrong.
>
>
> [7011807.029737] *** Guest State ***
> [7011807.029742] CR0: actual=0x0000000080000031,
> shadow=0x00000000e0000031, gh_mask=fffffffffffffff7
> [7011807.029743] CR4: actual=0x0000000000002041,
> shadow=0x0000000000000001, gh_mask=ffffffffffffe871
> [7011807.029744] CR3 = 0x000000000a709000
> [7011807.029745] RSP = 0x000000000000eff0  RIP = 0x000000000000017c
> [7011807.029746] RFLAGS=0x00080202         DR7 = 0x0000000000000400
> [7011807.029747] Sysenter RSP=0000000000000000 CS:RIP=0000:0000000000000000
> [7011807.029749] CS:   sel=0x0097, attr=0x040fb, limit=0x000001a0,
> base=0x0000000002110000
> [7011807.029751] DS:   sel=0x00f7, attr=0x0c0f2, limit=0xffffffff,
> base=0x0000000000000000

I believe DS is illegal. Per the SDM, Checks on Guest Segment Registers:

* If the guest will not be virtual-8086, the different sub-fields are
considered separately:
  - Bits 3:0 (Type).
    * DS, ES, FS, GS. The following checks apply if the register is usable:
      - Bit 0 of the Type must be 1 (accessed).

> [7011807.029752] SS:   sel=0x009f, attr=0x040f3, limit=0x0000efff,
> base=0x0000000002111000
> [7011807.029753] ES:   sel=0x00f7, attr=0x0c0f2, limit=0xffffffff,
> base=0x0000000000000000

And I believe ES is also illegal, for the same reason.

> [7011807.029764] FS:   sel=0x0000, attr=0x10000, limit=0x00000000,
> base=0x0000000000000000
> [7011807.029765] GS:   sel=0x0000, attr=0x10000, limit=0x00000000,
> base=0x0000000000000000
> [7011807.029767] GDTR:                           limit=0x00000017,
> base=0x000000000a708100
> [7011807.029768] LDTR: sel=0x0010, attr=0x00082, limit=0x0000ffff,
> base=0x000000000ab0a000
> [7011807.029769] IDTR:                           limit=0x000007ff,
> base=0x000000000a708200
> [7011807.029770] TR:   sel=0x0010, attr=0x0008b, limit=0x00002088,
> base=0x000000000a706000

It seems a bit odd that TR and LDTR are both 0x10,  but that's perfectly legal.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: guest/host mem out of sync on core2duo?
  2021-06-18 21:07         ` Jim Mattson
@ 2021-06-18 21:55           ` stsp
  2021-06-18 22:06             ` Jim Mattson
  2021-06-21  2:34           ` exception vs SIGALRM race (was: Re: guest/host mem out of sync on core2duo?) stsp
  1 sibling, 1 reply; 36+ messages in thread
From: stsp @ 2021-06-18 21:55 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Sean Christopherson, kvm

19.06.2021 00:07, Jim Mattson пишет:
> On Fri, Jun 18, 2021 at 9:02 AM stsp <stsp2@yandex.ru> wrote:
>
>> Here it goes.
>> But I studied it quite thoroughly
>> and can't see anything obviously
>> wrong.
>>
>>
>> [7011807.029737] *** Guest State ***
>> [7011807.029742] CR0: actual=0x0000000080000031,
>> shadow=0x00000000e0000031, gh_mask=fffffffffffffff7
>> [7011807.029743] CR4: actual=0x0000000000002041,
>> shadow=0x0000000000000001, gh_mask=ffffffffffffe871
>> [7011807.029744] CR3 = 0x000000000a709000
>> [7011807.029745] RSP = 0x000000000000eff0  RIP = 0x000000000000017c
>> [7011807.029746] RFLAGS=0x00080202         DR7 = 0x0000000000000400
>> [7011807.029747] Sysenter RSP=0000000000000000 CS:RIP=0000:0000000000000000
>> [7011807.029749] CS:   sel=0x0097, attr=0x040fb, limit=0x000001a0,
>> base=0x0000000002110000
>> [7011807.029751] DS:   sel=0x00f7, attr=0x0c0f2, limit=0xffffffff,
>> base=0x0000000000000000
> I believe DS is illegal. Per the SDM, Checks on Guest Segment Registers:
>
> * If the guest will not be virtual-8086, the different sub-fields are
> considered separately:
>    - Bits 3:0 (Type).
>      * DS, ES, FS, GS. The following checks apply if the register is usable:
>        - Bit 0 of the Type must be 1 (accessed).

That seems to be it, thank you!
At least for the minimal reproducer
I've done.

So only with unrestricted guest its
possible to ignore that field?


> [7011807.029764] FS:   sel=0x0000, attr=0x10000, limit=0x00000000,
> base=0x0000000000000000
> [7011807.029765] GS:   sel=0x0000, attr=0x10000, limit=0x00000000,
> base=0x0000000000000000
> [7011807.029767] GDTR:                           limit=0x00000017,
> base=0x000000000a708100
> [7011807.029768] LDTR: sel=0x0010, attr=0x00082, limit=0x0000ffff,
> base=0x000000000ab0a000
> [7011807.029769] IDTR:                           limit=0x000007ff,
> base=0x000000000a708200
> [7011807.029770] TR:   sel=0x0010, attr=0x0008b, limit=0x00002088,
> base=0x000000000a706000
> It seems a bit odd that TR and LDTR are both 0x10,  but that's perfectly legal.

This selector is fake.
Our guest doesn't do LLDT or LTR,
so we didn't care to even reserve
the GDT entries for those.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: guest/host mem out of sync on core2duo?
  2021-06-18 21:55           ` stsp
@ 2021-06-18 22:06             ` Jim Mattson
  2021-06-18 22:26               ` stsp
  2021-06-18 22:32               ` Sean Christopherson
  0 siblings, 2 replies; 36+ messages in thread
From: Jim Mattson @ 2021-06-18 22:06 UTC (permalink / raw)
  To: stsp; +Cc: Sean Christopherson, kvm list

On Fri, Jun 18, 2021 at 2:55 PM stsp <stsp2@yandex.ru> wrote:
>
> 19.06.2021 00:07, Jim Mattson пишет:
> > On Fri, Jun 18, 2021 at 9:02 AM stsp <stsp2@yandex.ru> wrote:
> >
> >> Here it goes.
> >> But I studied it quite thoroughly
> >> and can't see anything obviously
> >> wrong.
> >>
> >>
> >> [7011807.029737] *** Guest State ***
> >> [7011807.029742] CR0: actual=0x0000000080000031,
> >> shadow=0x00000000e0000031, gh_mask=fffffffffffffff7
> >> [7011807.029743] CR4: actual=0x0000000000002041,
> >> shadow=0x0000000000000001, gh_mask=ffffffffffffe871
> >> [7011807.029744] CR3 = 0x000000000a709000
> >> [7011807.029745] RSP = 0x000000000000eff0  RIP = 0x000000000000017c
> >> [7011807.029746] RFLAGS=0x00080202         DR7 = 0x0000000000000400
> >> [7011807.029747] Sysenter RSP=0000000000000000 CS:RIP=0000:0000000000000000
> >> [7011807.029749] CS:   sel=0x0097, attr=0x040fb, limit=0x000001a0,
> >> base=0x0000000002110000
> >> [7011807.029751] DS:   sel=0x00f7, attr=0x0c0f2, limit=0xffffffff,
> >> base=0x0000000000000000
> > I believe DS is illegal. Per the SDM, Checks on Guest Segment Registers:
> >
> > * If the guest will not be virtual-8086, the different sub-fields are
> > considered separately:
> >    - Bits 3:0 (Type).
> >      * DS, ES, FS, GS. The following checks apply if the register is usable:
> >        - Bit 0 of the Type must be 1 (accessed).
>
> That seems to be it, thank you!
> At least for the minimal reproducer
> I've done.
>
> So only with unrestricted guest its
> possible to ignore that field?

The VM-entry constraints are the same with unrestricted guest.

Note that *without* unrestricted guest, kvm will generally have to
emulate the early guest protected mode code--until the last vestiges
of real-address mode are purged from the descriptor cache. Maybe it
fails to set the accessed bits in the LDT on emulated segment register
loads?

> > [7011807.029764] FS:   sel=0x0000, attr=0x10000, limit=0x00000000,
> > base=0x0000000000000000
> > [7011807.029765] GS:   sel=0x0000, attr=0x10000, limit=0x00000000,
> > base=0x0000000000000000
> > [7011807.029767] GDTR:                           limit=0x00000017,
> > base=0x000000000a708100
> > [7011807.029768] LDTR: sel=0x0010, attr=0x00082, limit=0x0000ffff,
> > base=0x000000000ab0a000
> > [7011807.029769] IDTR:                           limit=0x000007ff,
> > base=0x000000000a708200
> > [7011807.029770] TR:   sel=0x0010, attr=0x0008b, limit=0x00002088,
> > base=0x000000000a706000
> > It seems a bit odd that TR and LDTR are both 0x10,  but that's perfectly legal.
>
> This selector is fake.
> Our guest doesn't do LLDT or LTR,
> so we didn't care to even reserve
> the GDT entries for those.
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: guest/host mem out of sync on core2duo?
  2021-06-18 22:06             ` Jim Mattson
@ 2021-06-18 22:26               ` stsp
  2021-06-18 22:32               ` Sean Christopherson
  1 sibling, 0 replies; 36+ messages in thread
From: stsp @ 2021-06-18 22:26 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Sean Christopherson, kvm list

19.06.2021 01:06, Jim Mattson пишет:
> On Fri, Jun 18, 2021 at 2:55 PM stsp <stsp2@yandex.ru> wrote:
>> 19.06.2021 00:07, Jim Mattson пишет:
>>> On Fri, Jun 18, 2021 at 9:02 AM stsp <stsp2@yandex.ru> wrote:
>>>
>>>> Here it goes.
>>>> But I studied it quite thoroughly
>>>> and can't see anything obviously
>>>> wrong.
>>>>
>>>>
>>>> [7011807.029737] *** Guest State ***
>>>> [7011807.029742] CR0: actual=0x0000000080000031,
>>>> shadow=0x00000000e0000031, gh_mask=fffffffffffffff7
>>>> [7011807.029743] CR4: actual=0x0000000000002041,
>>>> shadow=0x0000000000000001, gh_mask=ffffffffffffe871
>>>> [7011807.029744] CR3 = 0x000000000a709000
>>>> [7011807.029745] RSP = 0x000000000000eff0  RIP = 0x000000000000017c
>>>> [7011807.029746] RFLAGS=0x00080202         DR7 = 0x0000000000000400
>>>> [7011807.029747] Sysenter RSP=0000000000000000 CS:RIP=0000:0000000000000000
>>>> [7011807.029749] CS:   sel=0x0097, attr=0x040fb, limit=0x000001a0,
>>>> base=0x0000000002110000
>>>> [7011807.029751] DS:   sel=0x00f7, attr=0x0c0f2, limit=0xffffffff,
>>>> base=0x0000000000000000
>>> I believe DS is illegal. Per the SDM, Checks on Guest Segment Registers:
>>>
>>> * If the guest will not be virtual-8086, the different sub-fields are
>>> considered separately:
>>>     - Bits 3:0 (Type).
>>>       * DS, ES, FS, GS. The following checks apply if the register is usable:
>>>         - Bit 0 of the Type must be 1 (accessed).
>> That seems to be it, thank you!
>> At least for the minimal reproducer
>> I've done.
>>
>> So only with unrestricted guest its
>> possible to ignore that field?
> The VM-entry constraints are the same with unrestricted guest.
>
> Note that *without* unrestricted guest, kvm will generally have to
> emulate the early guest protected mode code--until the last vestiges
> of real-address mode are purged from the descriptor cache. Maybe it
> fails to set the accessed bits in the LDT on emulated segment register
> loads?
I believe this is where the KVM_SET_SREGS
difference kicks in. When the segregs are
loaded in guest's ring0, there is no problem.
Likely in this case the accessed bit is properly
set.
But if we bypass the ring0 trampoline, then
the just created new LDT entry doesn't yet
have the accessed bit, and that propagates
to KVM_SET_SREGS. I believe I should just
force the accessed bit for KVM_SET_SREGS?

But there is no such problem if unrestricted
guest is available, so not everything is yet
clear.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: guest/host mem out of sync on core2duo?
  2021-06-18 22:06             ` Jim Mattson
  2021-06-18 22:26               ` stsp
@ 2021-06-18 22:32               ` Sean Christopherson
  2021-06-19  0:11                 ` stsp
  1 sibling, 1 reply; 36+ messages in thread
From: Sean Christopherson @ 2021-06-18 22:32 UTC (permalink / raw)
  To: Jim Mattson; +Cc: stsp, kvm list

On Fri, Jun 18, 2021, Jim Mattson wrote:
> On Fri, Jun 18, 2021 at 2:55 PM stsp <stsp2@yandex.ru> wrote:
> >
> > 19.06.2021 00:07, Jim Mattson пишет:
> > > On Fri, Jun 18, 2021 at 9:02 AM stsp <stsp2@yandex.ru> wrote:
> > >
> > >> Here it goes.
> > >> But I studied it quite thoroughly
> > >> and can't see anything obviously
> > >> wrong.
> > >>
> > >>
> > >> [7011807.029737] *** Guest State ***
> > >> [7011807.029742] CR0: actual=0x0000000080000031,
> > >> shadow=0x00000000e0000031, gh_mask=fffffffffffffff7
> > >> [7011807.029743] CR4: actual=0x0000000000002041,
> > >> shadow=0x0000000000000001, gh_mask=ffffffffffffe871
> > >> [7011807.029744] CR3 = 0x000000000a709000
> > >> [7011807.029745] RSP = 0x000000000000eff0  RIP = 0x000000000000017c
> > >> [7011807.029746] RFLAGS=0x00080202         DR7 = 0x0000000000000400
> > >> [7011807.029747] Sysenter RSP=0000000000000000 CS:RIP=0000:0000000000000000
> > >> [7011807.029749] CS:   sel=0x0097, attr=0x040fb, limit=0x000001a0,
> > >> base=0x0000000002110000
> > >> [7011807.029751] DS:   sel=0x00f7, attr=0x0c0f2, limit=0xffffffff,
> > >> base=0x0000000000000000
> > > I believe DS is illegal. Per the SDM, Checks on Guest Segment Registers:
> > >
> > > * If the guest will not be virtual-8086, the different sub-fields are
> > > considered separately:
> > >    - Bits 3:0 (Type).
> > >      * DS, ES, FS, GS. The following checks apply if the register is usable:
> > >        - Bit 0 of the Type must be 1 (accessed).
> >
> > That seems to be it, thank you!
> > At least for the minimal reproducer
> > I've done.
> >
> > So only with unrestricted guest its
> > possible to ignore that field?
> 
> The VM-entry constraints are the same with unrestricted guest.
> 
> Note that *without* unrestricted guest, kvm will generally have to
> emulate the early guest protected mode code--until the last vestiges
> of real-address mode are purged from the descriptor cache. Maybe it
> fails to set the accessed bits in the LDT on emulated segment register
> loads?

Argh!  Check out this gem:

	/*
	 *   Fix the "Accessed" bit in AR field of segment registers for older
	 * qemu binaries.
	 *   IA32 arch specifies that at the time of processor reset the
	 * "Accessed" bit in the AR field of segment registers is 1. And qemu
	 * is setting it to 0 in the userland code. This causes invalid guest
	 * state vmexit when "unrestricted guest" mode is turned on.
	 *    Fix for this setup issue in cpu_reset is being pushed in the qemu
	 * tree. Newer qemu binaries with that qemu fix would not need this
	 * kvm hack.
	 */
	if (is_unrestricted_guest(vcpu) && (seg != VCPU_SREG_LDTR))
		var->type |= 0x1; /* Accessed */


KVM fixes up segs when unrestricted guest is enabled, but otherwise leaves 'em
be, presumably because it has the emulator to fall back on for invalid state.
Guess what's missing in the invalid state check...

I think this should do it:

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 68a72c80bd3f..a753b9859826 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3427,6 +3427,8 @@ static bool data_segment_valid(struct kvm_vcpu *vcpu, int seg)
                if (var.dpl < rpl) /* DPL < RPL */
                        return false;
        }
+       if (!(var.type & VMX_AR_TYPE_ACCESSES_MASK))
+               return false;

        /* TODO: Add other members to kvm_segment_field to allow checking for other access
         * rights flags

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: guest/host mem out of sync on core2duo?
  2021-06-18 22:32               ` Sean Christopherson
@ 2021-06-19  0:11                 ` stsp
  2021-06-19  0:54                   ` Sean Christopherson
  0 siblings, 1 reply; 36+ messages in thread
From: stsp @ 2021-06-19  0:11 UTC (permalink / raw)
  To: Sean Christopherson, Jim Mattson; +Cc: kvm list

19.06.2021 01:32, Sean Christopherson пишет:
> Argh!  Check out this gem:
>
> 	/*
> 	 *   Fix the "Accessed" bit in AR field of segment registers for older
> 	 * qemu binaries.
> 	 *   IA32 arch specifies that at the time of processor reset the
> 	 * "Accessed" bit in the AR field of segment registers is 1. And qemu
> 	 * is setting it to 0 in the userland code. This causes invalid guest
> 	 * state vmexit when "unrestricted guest" mode is turned on.
> 	 *    Fix for this setup issue in cpu_reset is being pushed in the qemu
> 	 * tree. Newer qemu binaries with that qemu fix would not need this
> 	 * kvm hack.
> 	 */
> 	if (is_unrestricted_guest(vcpu) && (seg != VCPU_SREG_LDTR))
> 		var->type |= 0x1; /* Accessed */
>
>
> KVM fixes up segs when unrestricted guest is enabled, but otherwise leaves 'em
> be, presumably because it has the emulator to fall back on for invalid state.
> Guess what's missing in the invalid state check...
>
> I think this should do it:
Until when will it run on an
emulator in this case?
Will it be too slow without a
slightest hint to the user?

If it is indeed the performance
penalty for no good reason,
then my preference would be
to get an error right from
KVM_SET_SREGS instead,
or maybe from KVM_RUN,
but not run everything on
an emulator.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: guest/host mem out of sync on core2duo?
  2021-06-19  0:11                 ` stsp
@ 2021-06-19  0:54                   ` Sean Christopherson
  2021-06-19  9:18                     ` stsp
  0 siblings, 1 reply; 36+ messages in thread
From: Sean Christopherson @ 2021-06-19  0:54 UTC (permalink / raw)
  To: stsp; +Cc: Jim Mattson, kvm list

On Sat, Jun 19, 2021, stsp wrote:
> 19.06.2021 01:32, Sean Christopherson пишет:
> > Argh!  Check out this gem:
> > 
> > 	/*
> > 	 *   Fix the "Accessed" bit in AR field of segment registers for older
> > 	 * qemu binaries.
> > 	 *   IA32 arch specifies that at the time of processor reset the
> > 	 * "Accessed" bit in the AR field of segment registers is 1. And qemu
> > 	 * is setting it to 0 in the userland code. This causes invalid guest
> > 	 * state vmexit when "unrestricted guest" mode is turned on.
> > 	 *    Fix for this setup issue in cpu_reset is being pushed in the qemu
> > 	 * tree. Newer qemu binaries with that qemu fix would not need this
> > 	 * kvm hack.
> > 	 */
> > 	if (is_unrestricted_guest(vcpu) && (seg != VCPU_SREG_LDTR))
> > 		var->type |= 0x1; /* Accessed */
> > 
> > 
> > KVM fixes up segs when unrestricted guest is enabled, but otherwise leaves 'em
> > be, presumably because it has the emulator to fall back on for invalid state.
> > Guess what's missing in the invalid state check...
> > 
> > I think this should do it:
> Until when will it run on an emulator in this case?  Will it be too slow
> without a slightest hint to the user?

KVM would emulate until the invalid state went away, i.e. until the offending
register was loaded with a new segment that set the Accessed bit.

> If it is indeed the performance penalty for no good reason, then my
> preference would be to get an error right from KVM_SET_SREGS instead, or
> maybe from KVM_RUN, but not run everything on an emulator.

Sadly, to be consistent with other segments (SS and CS), I believe detecting and
emulating is the right "fix".  Ideally, KVM would differentiate between "invalid
for !unrestricted_guest" and "always invalid", with the latter being rejected and
punted to userspace.  E.g. I don't think it's possible for a physical CPU to have
a loaded segment with the Accessed bit set.  Unfortunately that ship sailed long,
long ago.

One possibility would be to try disabling emulate_invalid_guest_state via module
param.  That should cause failure instead of emulating.  But I suspect that that
appraoch will cause explosions for your core2duo users as KVM is probably
emulating at other points for them.  :-/

The other thing you could do would be to add a bit instrumention to query the
number of instructions KVM has emulated and alert the user if it exceeds some
arbitrary threshold.  The hiccup there is that KVM's stats are currently on
debugfs, which is usually root-only.

  $ tail /sys/kernel/debug/kvm/insn_emulation
  0

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: guest/host mem out of sync on core2duo?
  2021-06-19  0:54                   ` Sean Christopherson
@ 2021-06-19  9:18                     ` stsp
  0 siblings, 0 replies; 36+ messages in thread
From: stsp @ 2021-06-19  9:18 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Jim Mattson, kvm list

19.06.2021 03:54, Sean Christopherson пишет:
> On Sat, Jun 19, 2021, stsp wrote:
>> 19.06.2021 01:32, Sean Christopherson пишет:
>>> Argh!  Check out this gem:
>>>
>>> 	/*
>>> 	 *   Fix the "Accessed" bit in AR field of segment registers for older
>>> 	 * qemu binaries.
>>> 	 *   IA32 arch specifies that at the time of processor reset the
>>> 	 * "Accessed" bit in the AR field of segment registers is 1. And qemu
>>> 	 * is setting it to 0 in the userland code. This causes invalid guest
>>> 	 * state vmexit when "unrestricted guest" mode is turned on.
>>> 	 *    Fix for this setup issue in cpu_reset is being pushed in the qemu
>>> 	 * tree. Newer qemu binaries with that qemu fix would not need this
>>> 	 * kvm hack.
>>> 	 */
>>> 	if (is_unrestricted_guest(vcpu) && (seg != VCPU_SREG_LDTR))
>>> 		var->type |= 0x1; /* Accessed */
>>>
>>>
>>> KVM fixes up segs when unrestricted guest is enabled, but otherwise leaves 'em
>>> be, presumably because it has the emulator to fall back on for invalid state.
>>> Guess what's missing in the invalid state check...
>>>
>>> I think this should do it:
>> Until when will it run on an emulator in this case?  Will it be too slow
>> without a slightest hint to the user?
> KVM would emulate until the invalid state went away, i.e. until the offending
> register was loaded with a new segment that set the Accessed bit.
Such condition will happen
pretty quickly if the emulator
sets the accessed bit also in LDT.
Does it do that?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* exception vs SIGALRM race (was: Re: guest/host mem out of sync on core2duo?)
  2021-06-18 21:07         ` Jim Mattson
  2021-06-18 21:55           ` stsp
@ 2021-06-21  2:34           ` stsp
  2021-06-21 22:33             ` Jim Mattson
  1 sibling, 1 reply; 36+ messages in thread
From: stsp @ 2021-06-21  2:34 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Sean Christopherson, kvm

19.06.2021 00:07, Jim Mattson пишет:
> I believe DS is illegal. Per the SDM, Checks on Guest Segment Registers:
OK, so this indeed have solved
the biggest part of the problem,
thanks again.

Now back to the original problem,
where I was getting a page fault
on some CPUs sometimes.
I digged a bit more.
It seems I am getting a race of
this kind: exception in guest happens
at the same time when the host's
SIGALRM arrives. KVM returns to
host with the exception somehow
"pending", but its still on ring3, not
switched to the ring0 handler.

Then from host I inject the interrupt
(which is what SIGALRM asks for),
and when I enter the guest, it throws
the pending exception instead of
executing the interrupt handler.
I suspect the bug is again on my side,
but I am not sure how to handle that
kind of race. I suppose I need to look
at some interruptibility state to find
out that the interrupt cannot be injected
at that time. But I can't find if KVM
exports the interruptibility state, other
than guest's IF/VIF flag, which is not
enough in this case.
Also I am a bit puzzled why I can't
see such race on an I7 CPU even
after disabling the unrestricted_guest.

Any ideas? :)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: exception vs SIGALRM race (was: Re: guest/host mem out of sync on core2duo?)
  2021-06-21  2:34           ` exception vs SIGALRM race (was: Re: guest/host mem out of sync on core2duo?) stsp
@ 2021-06-21 22:33             ` Jim Mattson
  2021-06-21 23:32               ` stsp
                                 ` (3 more replies)
  0 siblings, 4 replies; 36+ messages in thread
From: Jim Mattson @ 2021-06-21 22:33 UTC (permalink / raw)
  To: stsp; +Cc: Sean Christopherson, kvm

On Sun, Jun 20, 2021 at 7:34 PM stsp <stsp2@yandex.ru> wrote:
>
> 19.06.2021 00:07, Jim Mattson пишет:
> > I believe DS is illegal. Per the SDM, Checks on Guest Segment Registers:
> OK, so this indeed have solved
> the biggest part of the problem,
> thanks again.
>
> Now back to the original problem,
> where I was getting a page fault
> on some CPUs sometimes.
> I digged a bit more.
> It seems I am getting a race of
> this kind: exception in guest happens
> at the same time when the host's
> SIGALRM arrives. KVM returns to
> host with the exception somehow
> "pending", but its still on ring3, not
> switched to the ring0 handler.
>
> Then from host I inject the interrupt
> (which is what SIGALRM asks for),
> and when I enter the guest, it throws
> the pending exception instead of
> executing the interrupt handler.
> I suspect the bug is again on my side,
> but I am not sure how to handle that
> kind of race. I suppose I need to look
> at some interruptibility state to find
> out that the interrupt cannot be injected
> at that time. But I can't find if KVM
> exports the interruptibility state, other
> than guest's IF/VIF flag, which is not
> enough in this case.

Maybe what you want is run->ready_for_interrupt_injection? And, if
that's not set, try KVM_RUN with run->request_interrupt_window set?

> Also I am a bit puzzled why I can't
> see such race on an I7 CPU even
> after disabling the unrestricted_guest.
>
> Any ideas? :)

I'm guessing that your core2duo doesn't have a VMX preemption timer,
and this has some subtle effect on how the alarm interrupts VMX
non-root operation. On the i7, try setting the module parameter
preemption_timer to 0.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: exception vs SIGALRM race (was: Re: guest/host mem out of sync on core2duo?)
  2021-06-21 22:33             ` Jim Mattson
@ 2021-06-21 23:32               ` stsp
  2021-06-22  0:27               ` stsp
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 36+ messages in thread
From: stsp @ 2021-06-21 23:32 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Sean Christopherson, kvm

22.06.2021 01:33, Jim Mattson пишет:
> On Sun, Jun 20, 2021 at 7:34 PM stsp <stsp2@yandex.ru> wrote:
>> 19.06.2021 00:07, Jim Mattson пишет:
>>> I believe DS is illegal. Per the SDM, Checks on Guest Segment Registers:
>> OK, so this indeed have solved
>> the biggest part of the problem,
>> thanks again.
>>
>> Now back to the original problem,
>> where I was getting a page fault
>> on some CPUs sometimes.
>> I digged a bit more.
>> It seems I am getting a race of
>> this kind: exception in guest happens
>> at the same time when the host's
>> SIGALRM arrives. KVM returns to
>> host with the exception somehow
>> "pending", but its still on ring3, not
>> switched to the ring0 handler.
>>
>> Then from host I inject the interrupt
>> (which is what SIGALRM asks for),
>> and when I enter the guest, it throws
>> the pending exception instead of
>> executing the interrupt handler.
>> I suspect the bug is again on my side,
>> but I am not sure how to handle that
>> kind of race. I suppose I need to look
>> at some interruptibility state to find
>> out that the interrupt cannot be injected
>> at that time. But I can't find if KVM
>> exports the interruptibility state, other
>> than guest's IF/VIF flag, which is not
>> enough in this case.
> Maybe what you want is run->ready_for_interrupt_injection? And, if
> that's not set, try KVM_RUN with run->request_interrupt_window set?

Good idea, I coded the patch to
check that. It will take some time
to find out the result.


>> Also I am a bit puzzled why I can't
>> see such race on an I7 CPU even
>> after disabling the unrestricted_guest.
>>
>> Any ideas? :)
> I'm guessing that your core2duo doesn't have a VMX preemption timer,
> and this has some subtle effect on how the alarm interrupts VMX
> non-root operation. On the i7, try setting the module parameter
> preemption_timer to 0.

OK, will try that tomorrow.
But why don't you consider the
simpler scenario?

kvm code is full of ctxt->have_exception
vcpu->arch.exception.pending
kvm_queue_exception() and all that.
This all is set up by the emulator
that handles "invalid guest state".
I would much rather believe that emulator
encountered PF and exited to user-space
after "queueing" it.
Does this sound realistic?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: exception vs SIGALRM race (was: Re: guest/host mem out of sync on core2duo?)
  2021-06-21 22:33             ` Jim Mattson
  2021-06-21 23:32               ` stsp
@ 2021-06-22  0:27               ` stsp
  2021-06-28 21:47                 ` Jim Mattson
  2021-06-23 23:38               ` exception vs SIGALRM race (with test-case now!) stsp
  2021-06-26 14:03               ` exception vs SIGALRM race (another patch) stsp
  3 siblings, 1 reply; 36+ messages in thread
From: stsp @ 2021-06-22  0:27 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Sean Christopherson, kvm

22.06.2021 01:33, Jim Mattson пишет:
> Maybe what you want is run->ready_for_interrupt_injection? And, if
> that's not set, try KVM_RUN with run->request_interrupt_window set?
static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
{
         return kvm_arch_interrupt_allowed(vcpu) &&
                 !kvm_cpu_has_interrupt(vcpu) &&
                 !kvm_event_needs_reinjection(vcpu) &&
                 kvm_cpu_accept_dm_intr(vcpu);

}


So judging from this snippet,
I wouldn't bet on the right indication
from run->ready_for_interrupt_injection

in our situation.
It doesn't check for vcpu->arch.exception.pending
or anything like that.
I believe, the exit to user-space
with pending synchronous exception
was not supposed to happen (but it does).

Also x86_emulate_instruction() seems
to be doing kvm_clear_exception_queue(vcpu)
before anything else, so obviously
such scenario is not trivial...
Possibly the non-emulate path
forgets to clear the queue on entry?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* exception vs SIGALRM race (with test-case now!)
  2021-06-21 22:33             ` Jim Mattson
  2021-06-21 23:32               ` stsp
  2021-06-22  0:27               ` stsp
@ 2021-06-23 23:38               ` stsp
  2021-06-24  0:11                 ` stsp
  2021-06-26 14:03               ` exception vs SIGALRM race (another patch) stsp
  3 siblings, 1 reply; 36+ messages in thread
From: stsp @ 2021-06-23 23:38 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Sean Christopherson, kvm

22.06.2021 01:33, Jim Mattson пишет:
> I'm guessing that your core2duo doesn't have a VMX preemption timer,
> and this has some subtle effect on how the alarm interrupts VMX
> non-root operation. On the i7, try setting the module parameter
> preemption_timer to 0.

Hi again.

I wrote a reliable test-case to
check your suggestion.
Unfortunately, no luck on I7,
even with

---

$ cat /sys/module/kvm_intel/parameters/preemption_timer
N

$ cat /sys/module/kvm_intel/parameters/unrestricted_guest
N

---


But the Core2 owners reproduce
the problem immediately using my
test-case:
https://github.com/dosemu2/dosemu2/issues/1500#issuecomment-867221736

The test-case:
https://github.com/dosemu2/dosemu2/issues/1500#issuecomment-867215291

The source of it:
https://github.com/dosemu2/dosemu2/issues/1500#issuecomment-867215913

Now the question is: can anyone
from that list please replicate our
setup and reproduce the problem
on Core2 CPU, and then fix it?
I myself do not have such CPU, so
I probably can't do that on my own.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: exception vs SIGALRM race (with test-case now!)
  2021-06-23 23:38               ` exception vs SIGALRM race (with test-case now!) stsp
@ 2021-06-24  0:11                 ` stsp
  2021-06-24  0:25                   ` stsp
  0 siblings, 1 reply; 36+ messages in thread
From: stsp @ 2021-06-24  0:11 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Sean Christopherson, kvm

24.06.2021 02:38, stsp пишет:
> The test-case:
> https://github.com/dosemu2/dosemu2/issues/1500#issuecomment-867215291 
URL was off 1 comment.
The right one is:
https://github.com/dosemu2/dosemu2/issues/1500#issuecomment-867214782

Direct link to the test-case:
https://github.com/dosemu2/dosemu2/files/6705274/a.exe.gz

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: exception vs SIGALRM race (with test-case now!)
  2021-06-24  0:11                 ` stsp
@ 2021-06-24  0:25                   ` stsp
  2021-06-24 18:05                     ` exception vs SIGALRM race on core2 CPUs (with qemu-based test-case this time!) stsp
  2021-06-24 18:07                     ` stsp
  0 siblings, 2 replies; 36+ messages in thread
From: stsp @ 2021-06-24  0:25 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Sean Christopherson, kvm

24.06.2021 03:11, stsp пишет:
> 24.06.2021 02:38, stsp пишет:
>> The test-case:
>> https://github.com/dosemu2/dosemu2/issues/1500#issuecomment-867215291 
> URL was off 1 comment.
> The right one is:
> https://github.com/dosemu2/dosemu2/issues/1500#issuecomment-867214782
>
> Direct link to the test-case:
> https://github.com/dosemu2/dosemu2/files/6705274/a.exe.gz

What does this test-case do?
It provokes the PF by writing to
the NULL pointer. The PF handler
checks if PF is coming from the
right place, or from the nearby
IRQ8 timer handler. If PF is coming
from the very first instruction of
the timer handler, then we got
that nasty SIGALRM race and
KVM exited to user-space with
the pending PF exception.

How to replicate the buggy setup?
Just install dosemu2 on your
favourite distro using the pre-built packages:
https://github.com/dosemu2/dosemu2/blob/devel/README

How to run the test-case?
On the PC with Intel Core2 CPU,
run this command:
|dosemu -K ./a.exe -T -I 'ignore_djgpp_null_derefs off'|

|After a few seconds it will say
"Race DETECTED" and exit.
If it just keeps printing dots
forever, then your setup is not buggy,
press Ctrl-c to finish test.
|


^ permalink raw reply	[flat|nested] 36+ messages in thread

* exception vs SIGALRM race on core2 CPUs (with qemu-based test-case this time!)
  2021-06-24  0:25                   ` stsp
@ 2021-06-24 18:05                     ` stsp
  2021-06-24 18:07                     ` stsp
  1 sibling, 0 replies; 36+ messages in thread
From: stsp @ 2021-06-24 18:05 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Sean Christopherson, kvm

24.06.2021 03:25, stsp пишет:
> What does this test-case do?
> It provokes the PF by writing to
> the NULL pointer. The PF handler
> checks if PF is coming from the
> right place, or from the nearby
> IRQ8 timer handler. If PF is coming
> from the very first instruction of
> the timer handler, then we got
> that nasty SIGALRM race and
> KVM exited to user-space with
> the pending PF exception.
>
> How to replicate the buggy setup? 
Since I don't think anyone wanted
to install dosemu2, I now created
the qemu-based reproducer.
Unfortunately, even with qemu
you still need the real core2 CPU
to reproduce.
At least I don't know how to enable
the emulated VTx under qemu, but
maybe someone else knows.

So you need a disk image that I
uploaded here:

https://www.filemail.com/d/fvkwgqgcmrhsxmk 
<https://www.filemail.com/d/fvkwgqgcmrhsxmk>

And you can run it like so:

|qemu-system-x86_64 -hda disk.img -enable-kvm \ -cpu host -bios 
/usr/share/OVMF/OVMF_CODE.fd \ -device intel-hda -device hda-duplex -m 
2G The result on a core2 CPU will be like here: 
https://github.com/dosemu2/dosemu2/issues/1500#issuecomment-867838848 
"Race DETECTED" means that the test-case detected the page-fault coming 
from an interrupt handler, when it shouldn't. |


^ permalink raw reply	[flat|nested] 36+ messages in thread

* exception vs SIGALRM race on core2 CPUs (with qemu-based test-case this time!)
  2021-06-24  0:25                   ` stsp
  2021-06-24 18:05                     ` exception vs SIGALRM race on core2 CPUs (with qemu-based test-case this time!) stsp
@ 2021-06-24 18:07                     ` stsp
  2021-06-25 23:35                       ` exception vs SIGALRM race on core2 CPUs (with fix!) stsp
  1 sibling, 1 reply; 36+ messages in thread
From: stsp @ 2021-06-24 18:07 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Sean Christopherson, kvm

24.06.2021 03:25, stsp пишет:
> What does this test-case do?
> It provokes the PF by writing to
> the NULL pointer. The PF handler
> checks if PF is coming from the
> right place, or from the nearby
> IRQ8 timer handler. If PF is coming
> from the very first instruction of
> the timer handler, then we got
> that nasty SIGALRM race and
> KVM exited to user-space with
> the pending PF exception.
>
> How to replicate the buggy setup? 
Since I don't think anyone wanted
to install dosemu2, I now created
the qemu-based reproducer.
Unfortunately, even with qemu
you still need the real core2 CPU
to reproduce.
At least I don't know how to enable
the emulated VTx under qemu, but
maybe someone else knows.

So you need a disk image that I
uploaded here:

https://www.filemail.com/d/fvkwgqgcmrhsxmk

And you can run it like so:

qemu-system-x86_64 -hda disk.img -enable-kvm \
     -cpu host -bios /usr/share/OVMF/OVMF_CODE.fd \
     -device intel-hda -device hda-duplex -m 2G

The result on a core2 CPU will be like here:
https://github.com/dosemu2/dosemu2/issues/1500#issuecomment-867838848

"Race DETECTED" means that the test-case
detected the page-fault coming from an
interrupt handler, when it shouldn't.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* exception vs SIGALRM race on core2 CPUs (with fix!)
  2021-06-24 18:07                     ` stsp
@ 2021-06-25 23:35                       ` stsp
  2021-06-26  0:15                         ` Jim Mattson
  0 siblings, 1 reply; 36+ messages in thread
From: stsp @ 2021-06-25 23:35 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Sean Christopherson, kvm

OK, I've finally found that this
fixes the race:

--- x86.c.old   2021-03-20 12:51:14.000000000 +0300
+++ x86.c       2021-06-26 02:28:37.082919492 +0300
@@ -9176,8 +9176,10 @@
                 if (__xfer_to_guest_mode_work_pending()) {
                         srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
                         r = xfer_to_guest_mode_handle_work(vcpu);
-                       if (r)
+                       if (r) {
+kvm_clear_exception_queue(vcpu);
                                 return r;
+}
                         vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
                 }
         }



This is where it returns to user
with the PF exception still pending.
So... any ideas?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: exception vs SIGALRM race on core2 CPUs (with fix!)
  2021-06-25 23:35                       ` exception vs SIGALRM race on core2 CPUs (with fix!) stsp
@ 2021-06-26  0:15                         ` Jim Mattson
  2021-06-26  0:35                           ` stsp
                                             ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Jim Mattson @ 2021-06-26  0:15 UTC (permalink / raw)
  To: stsp; +Cc: Sean Christopherson, kvm

On Fri, Jun 25, 2021 at 4:35 PM stsp <stsp2@yandex.ru> wrote:
>
> OK, I've finally found that this
> fixes the race:
>
> --- x86.c.old   2021-03-20 12:51:14.000000000 +0300
> +++ x86.c       2021-06-26 02:28:37.082919492 +0300
> @@ -9176,8 +9176,10 @@
>                  if (__xfer_to_guest_mode_work_pending()) {
>                          srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
>                          r = xfer_to_guest_mode_handle_work(vcpu);
> -                       if (r)
> +                       if (r) {
> +kvm_clear_exception_queue(vcpu);
>                                  return r;
> +}
>                          vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>                  }
>          }
>
>
>
> This is where it returns to user
> with the PF exception still pending.
> So... any ideas?

If the squashed exception was a trap, it's now lost.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: exception vs SIGALRM race on core2 CPUs (with fix!)
  2021-06-26  0:15                         ` Jim Mattson
@ 2021-06-26  0:35                           ` stsp
  2021-06-26 21:50                           ` stsp
  2021-06-27 12:13                           ` stsp
  2 siblings, 0 replies; 36+ messages in thread
From: stsp @ 2021-06-26  0:35 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Sean Christopherson, kvm

26.06.2021 03:15, Jim Mattson пишет:
> On Fri, Jun 25, 2021 at 4:35 PM stsp <stsp2@yandex.ru> wrote:
>> OK, I've finally found that this
>> fixes the race:
>>
>> --- x86.c.old   2021-03-20 12:51:14.000000000 +0300
>> +++ x86.c       2021-06-26 02:28:37.082919492 +0300
>> @@ -9176,8 +9176,10 @@
>>                   if (__xfer_to_guest_mode_work_pending()) {
>>                           srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
>>                           r = xfer_to_guest_mode_handle_work(vcpu);
>> -                       if (r)
>> +                       if (r) {
>> +kvm_clear_exception_queue(vcpu);
>>                                   return r;
>> +}
>>                           vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>>                   }
>>           }
>>
>>
>>
>> This is where it returns to user
>> with the PF exception still pending.
>> So... any ideas?
> If the squashed exception was a trap, it's now lost.

I am not saying this patch is
correct or should be applied.

The more interesting question is:
why KVM doesn't _inject_ the PF,
but is rather setting it pending
and then exits to user-space?

#0  kvm_multiple_exception (vcpu=vcpu@entry=0xffff888005934000, 
nr=nr@entry=14,
     has_error=has_error@entry=true, error_code=6, 
has_payload=has_payload@entry=true,
     payload=35979264, reinject=false) at ./include/linux/kvm_host.h:1280
#1  0xffffffff8103a13c in kvm_queue_exception_e_p (payload=<optimized out>,
     error_code=<optimized out>, nr=14, vcpu=0xffff888005934000) at 
arch/x86/kvm/x86.c:641
#2  kvm_inject_page_fault (vcpu=0xffff888005934000, fault=<optimized 
out>) at arch/x86/kvm/x86.c:641
#3  0xffffffff81031454 in kvm_inject_emulated_page_fault 
(vcpu=vcpu@entry=0xffff888005934000,
     fault=fault@entry=0xffffc9000031fc60) at arch/x86/kvm/x86.c:665
#4  0xffffffff8106df86 in paging32_page_fault (vcpu=0xffff888005934000, 
addr=35979264, error_code=6,
     prefault=<optimized out>) at arch/x86/kvm/mmu/paging_tmpl.h:816
#5  0xffffffff8106cdb4 in kvm_mmu_do_page_fault (prefault=false, err=6, 
cr2_or_gpa=35979264,
     vcpu=0xffff888005934000) at arch/x86/kvm/mmu.h:119
#6  kvm_mmu_page_fault (vcpu=vcpu@entry=0xffff888005934000, 
cr2_or_gpa=cr2_or_gpa@entry=35979264,
     error_code=error_code@entry=6, insn=0x0 <fixed_percpu_data>, 
insn_len=0)
     at arch/x86/kvm/mmu/mmu.c:5076
#7  0xffffffff8106d090 in kvm_handle_page_fault (insn_len=<optimized 
out>, insn=<optimized out>,
     fault_address=35979264, error_code=6, vcpu=0xffff888005934000) at 
arch/x86/kvm/mmu/mmu.c:3775
#8  kvm_handle_page_fault (vcpu=0xffff888005934000, error_code=6, 
fault_address=35979264,
     insn=<optimized out>, insn_len=<optimized out>) at 
arch/x86/kvm/mmu/mmu.c:3757
#9  0xffffffff810443e0 in vcpu_enter_guest (vcpu=0xffff888005934000) at 
arch/x86/kvm/x86.c:9090
#10 vcpu_run (vcpu=0xffff888005934000) at arch/x86/kvm/x86.c:9156
#11 kvm_arch_vcpu_ioctl_run (vcpu=vcpu@entry=0xffff888005934000) at 
arch/x86/kvm/x86.c:9385
#12 0xffffffff81020fec in kvm_vcpu_ioctl (filp=<optimized out>, 
ioctl=44672, arg=0)
     at arch/x86/kvm/../../../virt/kvm/kvm_main.c:3292


This is the stack trace when it
is set pending.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: exception vs SIGALRM race (another patch)
  2021-06-21 22:33             ` Jim Mattson
                                 ` (2 preceding siblings ...)
  2021-06-23 23:38               ` exception vs SIGALRM race (with test-case now!) stsp
@ 2021-06-26 14:03               ` stsp
  3 siblings, 0 replies; 36+ messages in thread
From: stsp @ 2021-06-26 14:03 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Sean Christopherson, kvm

22.06.2021 01:33, Jim Mattson пишет:
> Maybe what you want is run->ready_for_interrupt_injection? And, if

I implemented this suggestion with
the patch below:

---

--- x86.c.old   2021-03-20 12:51:14.000000000 +0300
+++ x86.c       2021-06-26 16:51:17.366592310 +0300
@@ -4109,7 +4109,9 @@
  static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
  {
         return kvm_arch_interrupt_allowed(vcpu) &&
-               kvm_cpu_accept_dm_intr(vcpu);
+               kvm_cpu_accept_dm_intr(vcpu) &&
+               !vcpu->arch.exception.pending &&
+               !vcpu->arch.exception.injected;
  }

  static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,

---


With that change I indeed can
look into run->ready_for_interrupt_injection
and avoid the race.
That means a cpu-specific work-around
in my code, but at least that works.
But without this change,
run->ready_for_interrupt_injection
just lies.

Does this bring us any closer to the
understanding of what's going on?
If not, what should I find out next?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: exception vs SIGALRM race on core2 CPUs (with fix!)
  2021-06-26  0:15                         ` Jim Mattson
  2021-06-26  0:35                           ` stsp
@ 2021-06-26 21:50                           ` stsp
  2021-06-27 12:13                           ` stsp
  2 siblings, 0 replies; 36+ messages in thread
From: stsp @ 2021-06-26 21:50 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Sean Christopherson, kvm

26.06.2021 03:15, Jim Mattson пишет:
> If the squashed exception was a trap, it's now lost.

OK, below are 2 more patches, each of
them alone is fixing the problem.

---

--- x86.c.old   2021-03-20 12:51:14.000000000 +0300
+++ x86.c       2021-06-27 00:38:45.547355116 +0300
@@ -9094,6 +9094,7 @@
         if (req_immediate_exit)
                 kvm_make_request(KVM_REQ_EVENT, vcpu);
         kvm_x86_ops.cancel_injection(vcpu);
+       kvm_clear_exception_queue(vcpu);
         if (unlikely(vcpu->arch.apic_attention))
                 kvm_lapic_sync_from_vapic(vcpu);
  out:
---


Or:

---

--- x86.c.old   2021-03-20 12:51:14.000000000 +0300
+++ x86.c       2021-06-27 00:47:06.958618185 +0300
@@ -1783,8 +1783,7 @@
  bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu)
  {
         xfer_to_guest_mode_prepare();
-       return vcpu->mode == EXITING_GUEST_MODE || 
kvm_request_pending(vcpu) ||
-               xfer_to_guest_mode_work_pending();
+       return vcpu->mode == EXITING_GUEST_MODE || 
kvm_request_pending(vcpu);
  }
  EXPORT_SYMBOL_GPL(kvm_vcpu_exit_request);

---


Still not a clue?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: exception vs SIGALRM race on core2 CPUs (with fix!)
  2021-06-26  0:15                         ` Jim Mattson
  2021-06-26  0:35                           ` stsp
  2021-06-26 21:50                           ` stsp
@ 2021-06-27 12:13                           ` stsp
  2 siblings, 0 replies; 36+ messages in thread
From: stsp @ 2021-06-27 12:13 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Sean Christopherson, kvm

26.06.2021 03:15, Jim Mattson пишет:
> If the squashed exception was a trap, it's now lost.

I am pretty sure this will do it:

---

--- x86.c.old   2021-03-20 12:51:14.000000000 +0300
+++ x86.c       2021-06-27 15:02:45.126161812 +0300
@@ -9093,7 +9093,11 @@
  cancel_injection:
         if (req_immediate_exit)
                 kvm_make_request(KVM_REQ_EVENT, vcpu);
-       kvm_x86_ops.cancel_injection(vcpu);
+       if (vcpu->arch.exception.injected) {
+               kvm_x86_ops.cancel_injection(vcpu);
+               vcpu->arch.exception.injected = false;
+               vcpu->arch.exception.pending = true;
+       }
         if (unlikely(vcpu->arch.apic_attention))
                 kvm_lapic_sync_from_vapic(vcpu);
  out:
@@ -9464,6 +9468,7 @@
         kvm_rip_write(vcpu, regs->rip);
         kvm_set_rflags(vcpu, regs->rflags | X86_EFLAGS_FIXED);

+       WARN_ON_ONCE(vcpu->arch.exception.injected);
         vcpu->arch.exception.pending = false;

         kvm_make_request(KVM_REQ_EVENT, vcpu);
---


In cancel_injection, the injected/pending
members were getting out of sync with
vmcs.
We need to move it back to pending,
and if user-space does SET_REGS, then
it is cleared (not sure why SET_SREGS
doesn't clear it also).
But if the .injected member is stuck,
then its not cleared by SET_REGS, and
I added WARN_ON_ONCE() for that case.

Does this make sense?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: exception vs SIGALRM race (was: Re: guest/host mem out of sync on core2duo?)
  2021-06-22  0:27               ` stsp
@ 2021-06-28 21:47                 ` Jim Mattson
  2021-06-28 21:50                   ` stsp
  2021-06-28 22:00                   ` stsp
  0 siblings, 2 replies; 36+ messages in thread
From: Jim Mattson @ 2021-06-28 21:47 UTC (permalink / raw)
  To: stsp; +Cc: Sean Christopherson, kvm

On Mon, Jun 21, 2021 at 5:27 PM stsp <stsp2@yandex.ru> wrote:
>
> 22.06.2021 01:33, Jim Mattson пишет:
> > Maybe what you want is run->ready_for_interrupt_injection? And, if
> > that's not set, try KVM_RUN with run->request_interrupt_window set?
> static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
> {
>          return kvm_arch_interrupt_allowed(vcpu) &&
>                  !kvm_cpu_has_interrupt(vcpu) &&
>                  !kvm_event_needs_reinjection(vcpu) &&
>                  kvm_cpu_accept_dm_intr(vcpu);
>
> }
>
>
> So judging from this snippet,
> I wouldn't bet on the right indication
> from run->ready_for_interrupt_injection

In your case, vcpu->arch.exception.injected is true, so
kvm_event_needs_reinjection() returns true. Hence,
kvm_vcpu_ready_for_interrupt_injection() returns false.

Are you seeing that run->ready_for_interrupt_injection is true, or are
you just speculating?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: exception vs SIGALRM race (was: Re: guest/host mem out of sync on core2duo?)
  2021-06-28 21:47                 ` Jim Mattson
@ 2021-06-28 21:50                   ` stsp
  2021-06-28 22:00                   ` stsp
  1 sibling, 0 replies; 36+ messages in thread
From: stsp @ 2021-06-28 21:50 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Sean Christopherson, kvm

29.06.2021 00:47, Jim Mattson пишет:
> On Mon, Jun 21, 2021 at 5:27 PM stsp <stsp2@yandex.ru> wrote:
>> 22.06.2021 01:33, Jim Mattson пишет:
>>> Maybe what you want is run->ready_for_interrupt_injection? And, if
>>> that's not set, try KVM_RUN with run->request_interrupt_window set?
>> static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
>> {
>>           return kvm_arch_interrupt_allowed(vcpu) &&
>>                   !kvm_cpu_has_interrupt(vcpu) &&
>>                   !kvm_event_needs_reinjection(vcpu) &&
>>                   kvm_cpu_accept_dm_intr(vcpu);
>>
>> }
>>
>>
>> So judging from this snippet,
>> I wouldn't bet on the right indication
>> from run->ready_for_interrupt_injection
> In your case, vcpu->arch.exception.injected is true, so
> kvm_event_needs_reinjection() returns true. Hence,
> kvm_vcpu_ready_for_interrupt_injection() returns false.
>
> Are you seeing that run->ready_for_interrupt_injection is true, or are
> you just speculating?

I have checked everything I said,
BUT: on a kernel 5.11.8.
Was this fixed on 5.12 or what?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: exception vs SIGALRM race (was: Re: guest/host mem out of sync on core2duo?)
  2021-06-28 21:47                 ` Jim Mattson
  2021-06-28 21:50                   ` stsp
@ 2021-06-28 22:00                   ` stsp
  2021-06-28 22:27                     ` Jim Mattson
  1 sibling, 1 reply; 36+ messages in thread
From: stsp @ 2021-06-28 22:00 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Sean Christopherson, kvm

29.06.2021 00:47, Jim Mattson пишет:
> On Mon, Jun 21, 2021 at 5:27 PM stsp <stsp2@yandex.ru> wrote:
>> 22.06.2021 01:33, Jim Mattson пишет:
>>> Maybe what you want is run->ready_for_interrupt_injection? And, if
>>> that's not set, try KVM_RUN with run->request_interrupt_window set?
>> static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
>> {
>>           return kvm_arch_interrupt_allowed(vcpu) &&
>>                   !kvm_cpu_has_interrupt(vcpu) &&
>>                   !kvm_event_needs_reinjection(vcpu) &&
>>                   kvm_cpu_accept_dm_intr(vcpu);
>>
>> }
>>
>>
>> So judging from this snippet,
>> I wouldn't bet on the right indication
>> from run->ready_for_interrupt_injection
> In your case, vcpu->arch.exception.injected is true, so
> kvm_event_needs_reinjection() returns true. Hence,
> kvm_vcpu_ready_for_interrupt_injection() returns false.
>
> Are you seeing that run->ready_for_interrupt_injection is true, or are
> you just speculating?

OK, please see this commit:
https://www.lkml.org/lkml/2020/12/1/324

There is simply no such code
any longer. I don't know where
I got the above snippet, but its
not valid. The code is currently:

---

static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
{
         return kvm_arch_interrupt_allowed(vcpu) &&
                 kvm_cpu_accept_dm_intr(vcpu);
}


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: exception vs SIGALRM race (was: Re: guest/host mem out of sync on core2duo?)
  2021-06-28 22:00                   ` stsp
@ 2021-06-28 22:27                     ` Jim Mattson
  2021-07-06 16:28                       ` Paolo Bonzini
  0 siblings, 1 reply; 36+ messages in thread
From: Jim Mattson @ 2021-06-28 22:27 UTC (permalink / raw)
  To: stsp, Paolo Bonzini; +Cc: Sean Christopherson, kvm, David Woodhouse, ntsironis

On Mon, Jun 28, 2021 at 3:00 PM stsp <stsp2@yandex.ru> wrote:
>
> 29.06.2021 00:47, Jim Mattson пишет:
> > On Mon, Jun 21, 2021 at 5:27 PM stsp <stsp2@yandex.ru> wrote:
> >> 22.06.2021 01:33, Jim Mattson пишет:
> >>> Maybe what you want is run->ready_for_interrupt_injection? And, if
> >>> that's not set, try KVM_RUN with run->request_interrupt_window set?
> >> static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
> >> {
> >>           return kvm_arch_interrupt_allowed(vcpu) &&
> >>                   !kvm_cpu_has_interrupt(vcpu) &&
> >>                   !kvm_event_needs_reinjection(vcpu) &&
> >>                   kvm_cpu_accept_dm_intr(vcpu);
> >>
> >> }
> >>
> >>
> >> So judging from this snippet,
> >> I wouldn't bet on the right indication
> >> from run->ready_for_interrupt_injection
> > In your case, vcpu->arch.exception.injected is true, so
> > kvm_event_needs_reinjection() returns true. Hence,
> > kvm_vcpu_ready_for_interrupt_injection() returns false.
> >
> > Are you seeing that run->ready_for_interrupt_injection is true, or are
> > you just speculating?
>
> OK, please see this commit:
> https://www.lkml.org/lkml/2020/12/1/324
>
> There is simply no such code
> any longer. I don't know where
> I got the above snippet, but its
> not valid. The code is currently:
>
> ---
>
> static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
> {
>          return kvm_arch_interrupt_allowed(vcpu) &&
>                  kvm_cpu_accept_dm_intr(vcpu);
> }

 It looks like Paolo may have broken this in commit 71cc849b7093
("KVM: x86: Fix split-irqchip vs interrupt injection window request").
The commit message seems focused only on
vcpu->arch.interrupt.injected. Perhaps he overlooked
vcpu->arch.exception.injected.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: exception vs SIGALRM race (was: Re: guest/host mem out of sync on core2duo?)
  2021-06-28 22:27                     ` Jim Mattson
@ 2021-07-06 16:28                       ` Paolo Bonzini
  2021-07-06 22:22                         ` stsp
  0 siblings, 1 reply; 36+ messages in thread
From: Paolo Bonzini @ 2021-07-06 16:28 UTC (permalink / raw)
  To: Jim Mattson, stsp; +Cc: Sean Christopherson, kvm, David Woodhouse, ntsironis

On 29/06/21 00:27, Jim Mattson wrote:
>> static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
>> {
>>           return kvm_arch_interrupt_allowed(vcpu) &&
>>                   kvm_cpu_accept_dm_intr(vcpu);
>> }
>   It looks like Paolo may have broken this in commit 71cc849b7093
> ("KVM: x86: Fix split-irqchip vs interrupt injection window request").
> The commit message seems focused only on
> vcpu->arch.interrupt.injected. Perhaps he overlooked
> vcpu->arch.exception.injected.

I was expecting the exception to be injected first and the interrupt second.
But something like this should fix it:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 21877ad2214e..dddff682c9c2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4277,6 +4277,9 @@ static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu *vcpu,
  
  static int kvm_cpu_accept_dm_intr(struct kvm_vcpu *vcpu)
  {
+	if (kvm_event_needs_reinjection(vcpu))
+		return false;
+
  	/*
  	 * We can accept userspace's request for interrupt injection
  	 * as long as we have a place to store the interrupt number.

I'll figure out a selftest to better understand what's going on.  In the meanwhile
Stas can test it!

Thanks,

Paolo


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: exception vs SIGALRM race (was: Re: guest/host mem out of sync on core2duo?)
  2021-07-06 16:28                       ` Paolo Bonzini
@ 2021-07-06 22:22                         ` stsp
  2021-07-06 23:41                           ` Paolo Bonzini
  0 siblings, 1 reply; 36+ messages in thread
From: stsp @ 2021-07-06 22:22 UTC (permalink / raw)
  To: Paolo Bonzini, Jim Mattson
  Cc: Sean Christopherson, kvm, David Woodhouse, ntsironis

06.07.2021 19:28, Paolo Bonzini пишет:
> On 29/06/21 00:27, Jim Mattson wrote:
>>> static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu 
>>> *vcpu)
>>> {
>>>           return kvm_arch_interrupt_allowed(vcpu) &&
>>>                   kvm_cpu_accept_dm_intr(vcpu);
>>> }
>>   It looks like Paolo may have broken this in commit 71cc849b7093
>> ("KVM: x86: Fix split-irqchip vs interrupt injection window request").
>> The commit message seems focused only on
>> vcpu->arch.interrupt.injected. Perhaps he overlooked
>> vcpu->arch.exception.injected.
>
> I was expecting the exception to be injected first and the interrupt 
> second.
> But something like this should fix it:
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 21877ad2214e..dddff682c9c2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4277,6 +4277,9 @@ static int kvm_vcpu_ioctl_set_lapic(struct 
> kvm_vcpu *vcpu,
>
>  static int kvm_cpu_accept_dm_intr(struct kvm_vcpu *vcpu)
>  {
> +    if (kvm_event_needs_reinjection(vcpu))
> +        return false;
> +
>      /*
>       * We can accept userspace's request for interrupt injection
>       * as long as we have a place to store the interrupt number.
>
> I'll figure out a selftest to better understand what's going on. In 
> the meanwhile
> Stas can test it!
I confirm that this works, thanks.
Sadly the problematic patch was
CCed to -stable, and is now present
in all kernels, like ubuntu's 5.8.0-55-generic.
Since AFAICT it didn't contain the
important/security fix, I think it
shouldn't have been CCed to -stable.

Can we revert it from -stable?
That will mean a relatively quick
fix for most of current users.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: exception vs SIGALRM race (was: Re: guest/host mem out of sync on core2duo?)
  2021-07-06 22:22                         ` stsp
@ 2021-07-06 23:41                           ` Paolo Bonzini
  0 siblings, 0 replies; 36+ messages in thread
From: Paolo Bonzini @ 2021-07-06 23:41 UTC (permalink / raw)
  To: stsp, Jim Mattson; +Cc: Sean Christopherson, kvm, David Woodhouse, ntsironis

On 07/07/21 00:22, stsp wrote:
> I confirm that this works, thanks.
> Sadly the problematic patch was
> CCed to -stable, and is now present
> in all kernels, like ubuntu's 5.8.0-55-generic.
> Since AFAICT it didn't contain the
> important/security fix, I think it
> shouldn't have been CCed to -stable.
> 
> Can we revert it from -stable?
> That will mean a relatively quick
> fix for most of current users.

It was a bugfix, see the commit message:

     when userspace requests an IRQ window vmexit, an interrupt in the
     local APIC can cause kvm_cpu_has_interrupt() to be true and thus
     kvm_vcpu_ready_for_interrupt_injection() to return false.  When this
     happens, vcpu_run does not exit to userspace but the interrupt window
     vmexits keep occurring.  The VM loops without any hope of making progress.

Thanks for the testing!

Paolo


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2021-07-06 23:41 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-12 22:49 guest/host mem out of sync on core2duo? stsp
2021-06-13 12:36 ` stsp
2021-06-14 17:06 ` Sean Christopherson
2021-06-14 17:32   ` stsp
2021-06-17 14:42     ` Sean Christopherson
2021-06-18 15:59       ` stsp
2021-06-18 21:07         ` Jim Mattson
2021-06-18 21:55           ` stsp
2021-06-18 22:06             ` Jim Mattson
2021-06-18 22:26               ` stsp
2021-06-18 22:32               ` Sean Christopherson
2021-06-19  0:11                 ` stsp
2021-06-19  0:54                   ` Sean Christopherson
2021-06-19  9:18                     ` stsp
2021-06-21  2:34           ` exception vs SIGALRM race (was: Re: guest/host mem out of sync on core2duo?) stsp
2021-06-21 22:33             ` Jim Mattson
2021-06-21 23:32               ` stsp
2021-06-22  0:27               ` stsp
2021-06-28 21:47                 ` Jim Mattson
2021-06-28 21:50                   ` stsp
2021-06-28 22:00                   ` stsp
2021-06-28 22:27                     ` Jim Mattson
2021-07-06 16:28                       ` Paolo Bonzini
2021-07-06 22:22                         ` stsp
2021-07-06 23:41                           ` Paolo Bonzini
2021-06-23 23:38               ` exception vs SIGALRM race (with test-case now!) stsp
2021-06-24  0:11                 ` stsp
2021-06-24  0:25                   ` stsp
2021-06-24 18:05                     ` exception vs SIGALRM race on core2 CPUs (with qemu-based test-case this time!) stsp
2021-06-24 18:07                     ` stsp
2021-06-25 23:35                       ` exception vs SIGALRM race on core2 CPUs (with fix!) stsp
2021-06-26  0:15                         ` Jim Mattson
2021-06-26  0:35                           ` stsp
2021-06-26 21:50                           ` stsp
2021-06-27 12:13                           ` stsp
2021-06-26 14:03               ` exception vs SIGALRM race (another patch) stsp

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).