All of lore.kernel.org
 help / color / mirror / Atom feed
* Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-01  3:17 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-01  3:17 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Alexander Graf, Paul Mackerras, kvm-ppc

Hi Avi !

We identified a problem on powerpc which seems to actually be a generic
issue, and Alex suggested we propose a generic fix. I want to make sure
we are on the right track first before proposing an actual patch as we
would like the patch to go in ASAP (ie not waiting the next merge
window) as it will fix an actual nasty bug with reset in KVM.

So the basic issue has to do with doing a machine reset as a result of a
hypervisor call, but the same problem should happen with MMIO/PIO
emulation.

After we do an exit as a result of such an operation, at the next
KVM_RUN, KVM will fetch the "results" of the operation (in the hypercall
case that's a bunch of register values, in the MMIO read emulation case
it's a single register value usually, x86 might have more subtle cases)
and we update the VCPU state (ie. registers) with that data.

However, what happens is that if a reset happens in between, we end up
clobbering the reset state.

IE. What happens in qemu is roughtly:

 - The hcall or MMIO that triggers the reset happens, goes to qemu,
which eventually calls qemu_system_reset_request()

 - This sets the global reset pending flag and wakes up the main loop.
It also does a stop of the current vcpu, so we do not return to the
kernel at this stage.

 - The main loop gets the flag, starts the reset process, which begins
with stopping all the VCPUs.

 - The reset handlers are called, which includes resetting the CPU
state, which in our case (powerpc) results in a SET_REGS ioctl to
establish a new fresh state for booting.

 - The generic code then restarts all VCPUs, which then return into
VCPU_RUN.

 - The VCPU(s) that did an exit as a result of MMIO emulation,
hypercall, or similiar (typically the one that triggered the reset but
possibly others) then gets some of their register state "updated" by the
result of the operation (in the hcall case, it's a field in the mmap'ed
run structure that clobbers GPR3 among others).

Now this is generally not a big issue as -usually- machines don't care
much about the state of registers on reset.

However it is for us, at least in a couple of cases because we build a
device-tree in qemu and we pass a pointer to it in GPR3 before we start
the firmware or the guest kernel.

This pointer is set in env->gpr[3] which then gets written by a reset
handler to KVM with SET_REGS, but that ends up being clobbered
eventually by the return from the hypercall.

My original idea to fix that was a ppc specific "trick" where we would
make SET_REGS clear the flags indicating that we had a pending hypercall
(or similar) and thus have SET_REGS have the side-effect of aborting a
current operation.

However, Alex reckons we should instead have a generic ioctl
(VCPU_ABORT_OPERATION) which qemu could use when there's a reset, that
we call to generically abort all such operations (MMIO emu, hcalls,
etc...) on all archs when qemu decides to reset the processors.

Alex and I have looked at various possible ways to try to fix that from
inside qemu but they are all extremely horrible & gross and might not
even work in the general case.

Avi, before I actually do a patch (that adds the generic definition,
doc, and powerpc implementation), I'd like to confirm with you that this
is the right approach and that this can benefit from expedited
upstreaming :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-01  3:17 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-01  3:17 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Alexander Graf, Paul Mackerras, kvm-ppc

Hi Avi !

We identified a problem on powerpc which seems to actually be a generic
issue, and Alex suggested we propose a generic fix. I want to make sure
we are on the right track first before proposing an actual patch as we
would like the patch to go in ASAP (ie not waiting the next merge
window) as it will fix an actual nasty bug with reset in KVM.

So the basic issue has to do with doing a machine reset as a result of a
hypervisor call, but the same problem should happen with MMIO/PIO
emulation.

After we do an exit as a result of such an operation, at the next
KVM_RUN, KVM will fetch the "results" of the operation (in the hypercall
case that's a bunch of register values, in the MMIO read emulation case
it's a single register value usually, x86 might have more subtle cases)
and we update the VCPU state (ie. registers) with that data.

However, what happens is that if a reset happens in between, we end up
clobbering the reset state.

IE. What happens in qemu is roughtly:

 - The hcall or MMIO that triggers the reset happens, goes to qemu,
which eventually calls qemu_system_reset_request()

 - This sets the global reset pending flag and wakes up the main loop.
It also does a stop of the current vcpu, so we do not return to the
kernel at this stage.

 - The main loop gets the flag, starts the reset process, which begins
with stopping all the VCPUs.

 - The reset handlers are called, which includes resetting the CPU
state, which in our case (powerpc) results in a SET_REGS ioctl to
establish a new fresh state for booting.

 - The generic code then restarts all VCPUs, which then return into
VCPU_RUN.

 - The VCPU(s) that did an exit as a result of MMIO emulation,
hypercall, or similiar (typically the one that triggered the reset but
possibly others) then gets some of their register state "updated" by the
result of the operation (in the hcall case, it's a field in the mmap'ed
run structure that clobbers GPR3 among others).

Now this is generally not a big issue as -usually- machines don't care
much about the state of registers on reset.

However it is for us, at least in a couple of cases because we build a
device-tree in qemu and we pass a pointer to it in GPR3 before we start
the firmware or the guest kernel.

This pointer is set in env->gpr[3] which then gets written by a reset
handler to KVM with SET_REGS, but that ends up being clobbered
eventually by the return from the hypercall.

My original idea to fix that was a ppc specific "trick" where we would
make SET_REGS clear the flags indicating that we had a pending hypercall
(or similar) and thus have SET_REGS have the side-effect of aborting a
current operation.

However, Alex reckons we should instead have a generic ioctl
(VCPU_ABORT_OPERATION) which qemu could use when there's a reset, that
we call to generically abort all such operations (MMIO emu, hcalls,
etc...) on all archs when qemu decides to reset the processors.

Alex and I have looked at various possible ways to try to fix that from
inside qemu but they are all extremely horrible & gross and might not
even work in the general case.

Avi, before I actually do a patch (that adds the generic definition,
doc, and powerpc implementation), I'd like to confirm with you that this
is the right approach and that this can benefit from expedited
upstreaming :-)

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-01  3:17 ` Benjamin Herrenschmidt
@ 2012-08-02 10:46   ` Alexander Graf
  -1 siblings, 0 replies; 62+ messages in thread
From: Alexander Graf @ 2012-08-02 10:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Avi Kivity, KVM list, Paul Mackerras, kvm-ppc, Marcelo Tosatti,
	Jan Kiszka

Avi, Marcelo, Jan? Any thoughts on this?

Alex

On 01.08.2012, at 05:17, Benjamin Herrenschmidt wrote:

> Hi Avi !
> 
> We identified a problem on powerpc which seems to actually be a generic
> issue, and Alex suggested we propose a generic fix. I want to make sure
> we are on the right track first before proposing an actual patch as we
> would like the patch to go in ASAP (ie not waiting the next merge
> window) as it will fix an actual nasty bug with reset in KVM.
> 
> So the basic issue has to do with doing a machine reset as a result of a
> hypervisor call, but the same problem should happen with MMIO/PIO
> emulation.
> 
> After we do an exit as a result of such an operation, at the next
> KVM_RUN, KVM will fetch the "results" of the operation (in the hypercall
> case that's a bunch of register values, in the MMIO read emulation case
> it's a single register value usually, x86 might have more subtle cases)
> and we update the VCPU state (ie. registers) with that data.
> 
> However, what happens is that if a reset happens in between, we end up
> clobbering the reset state.
> 
> IE. What happens in qemu is roughtly:
> 
> - The hcall or MMIO that triggers the reset happens, goes to qemu,
> which eventually calls qemu_system_reset_request()
> 
> - This sets the global reset pending flag and wakes up the main loop.
> It also does a stop of the current vcpu, so we do not return to the
> kernel at this stage.
> 
> - The main loop gets the flag, starts the reset process, which begins
> with stopping all the VCPUs.
> 
> - The reset handlers are called, which includes resetting the CPU
> state, which in our case (powerpc) results in a SET_REGS ioctl to
> establish a new fresh state for booting.
> 
> - The generic code then restarts all VCPUs, which then return into
> VCPU_RUN.
> 
> - The VCPU(s) that did an exit as a result of MMIO emulation,
> hypercall, or similiar (typically the one that triggered the reset but
> possibly others) then gets some of their register state "updated" by the
> result of the operation (in the hcall case, it's a field in the mmap'ed
> run structure that clobbers GPR3 among others).
> 
> Now this is generally not a big issue as -usually- machines don't care
> much about the state of registers on reset.
> 
> However it is for us, at least in a couple of cases because we build a
> device-tree in qemu and we pass a pointer to it in GPR3 before we start
> the firmware or the guest kernel.
> 
> This pointer is set in env->gpr[3] which then gets written by a reset
> handler to KVM with SET_REGS, but that ends up being clobbered
> eventually by the return from the hypercall.
> 
> My original idea to fix that was a ppc specific "trick" where we would
> make SET_REGS clear the flags indicating that we had a pending hypercall
> (or similar) and thus have SET_REGS have the side-effect of aborting a
> current operation.
> 
> However, Alex reckons we should instead have a generic ioctl
> (VCPU_ABORT_OPERATION) which qemu could use when there's a reset, that
> we call to generically abort all such operations (MMIO emu, hcalls,
> etc...) on all archs when qemu decides to reset the processors.
> 
> Alex and I have looked at various possible ways to try to fix that from
> inside qemu but they are all extremely horrible & gross and might not
> even work in the general case.
> 
> Avi, before I actually do a patch (that adds the generic definition,
> doc, and powerpc implementation), I'd like to confirm with you that this
> is the right approach and that this can benefit from expedited
> upstreaming :-)
> 
> Cheers,
> Ben.
> 
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-02 10:46   ` Alexander Graf
  0 siblings, 0 replies; 62+ messages in thread
From: Alexander Graf @ 2012-08-02 10:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Avi Kivity, KVM list, Paul Mackerras, kvm-ppc, Marcelo Tosatti,
	Jan Kiszka

Avi, Marcelo, Jan? Any thoughts on this?

Alex

On 01.08.2012, at 05:17, Benjamin Herrenschmidt wrote:

> Hi Avi !
> 
> We identified a problem on powerpc which seems to actually be a generic
> issue, and Alex suggested we propose a generic fix. I want to make sure
> we are on the right track first before proposing an actual patch as we
> would like the patch to go in ASAP (ie not waiting the next merge
> window) as it will fix an actual nasty bug with reset in KVM.
> 
> So the basic issue has to do with doing a machine reset as a result of a
> hypervisor call, but the same problem should happen with MMIO/PIO
> emulation.
> 
> After we do an exit as a result of such an operation, at the next
> KVM_RUN, KVM will fetch the "results" of the operation (in the hypercall
> case that's a bunch of register values, in the MMIO read emulation case
> it's a single register value usually, x86 might have more subtle cases)
> and we update the VCPU state (ie. registers) with that data.
> 
> However, what happens is that if a reset happens in between, we end up
> clobbering the reset state.
> 
> IE. What happens in qemu is roughtly:
> 
> - The hcall or MMIO that triggers the reset happens, goes to qemu,
> which eventually calls qemu_system_reset_request()
> 
> - This sets the global reset pending flag and wakes up the main loop.
> It also does a stop of the current vcpu, so we do not return to the
> kernel at this stage.
> 
> - The main loop gets the flag, starts the reset process, which begins
> with stopping all the VCPUs.
> 
> - The reset handlers are called, which includes resetting the CPU
> state, which in our case (powerpc) results in a SET_REGS ioctl to
> establish a new fresh state for booting.
> 
> - The generic code then restarts all VCPUs, which then return into
> VCPU_RUN.
> 
> - The VCPU(s) that did an exit as a result of MMIO emulation,
> hypercall, or similiar (typically the one that triggered the reset but
> possibly others) then gets some of their register state "updated" by the
> result of the operation (in the hcall case, it's a field in the mmap'ed
> run structure that clobbers GPR3 among others).
> 
> Now this is generally not a big issue as -usually- machines don't care
> much about the state of registers on reset.
> 
> However it is for us, at least in a couple of cases because we build a
> device-tree in qemu and we pass a pointer to it in GPR3 before we start
> the firmware or the guest kernel.
> 
> This pointer is set in env->gpr[3] which then gets written by a reset
> handler to KVM with SET_REGS, but that ends up being clobbered
> eventually by the return from the hypercall.
> 
> My original idea to fix that was a ppc specific "trick" where we would
> make SET_REGS clear the flags indicating that we had a pending hypercall
> (or similar) and thus have SET_REGS have the side-effect of aborting a
> current operation.
> 
> However, Alex reckons we should instead have a generic ioctl
> (VCPU_ABORT_OPERATION) which qemu could use when there's a reset, that
> we call to generically abort all such operations (MMIO emu, hcalls,
> etc...) on all archs when qemu decides to reset the processors.
> 
> Alex and I have looked at various possible ways to try to fix that from
> inside qemu but they are all extremely horrible & gross and might not
> even work in the general case.
> 
> Avi, before I actually do a patch (that adds the generic definition,
> doc, and powerpc implementation), I'd like to confirm with you that this
> is the right approach and that this can benefit from expedited
> upstreaming :-)
> 
> Cheers,
> Ben.
> 
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-02 10:46   ` Alexander Graf
@ 2012-08-02 12:22     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-02 12:22 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Avi Kivity, KVM list, Paul Mackerras, kvm-ppc, Marcelo Tosatti,
	Jan Kiszka

On Thu, 2012-08-02 at 12:46 +0200, Alexander Graf wrote:
> Avi, Marcelo, Jan? Any thoughts on this?

Had a chat with Anthony and it looks like we can do it using the MP
state stuff which we don't currently use on powerpc, at least it seems
that's how x86 breaks IO emulation on reset.

Cheers,
Ben.

> Alex
> 
> On 01.08.2012, at 05:17, Benjamin Herrenschmidt wrote:
> 
> > Hi Avi !
> > 
> > We identified a problem on powerpc which seems to actually be a generic
> > issue, and Alex suggested we propose a generic fix. I want to make sure
> > we are on the right track first before proposing an actual patch as we
> > would like the patch to go in ASAP (ie not waiting the next merge
> > window) as it will fix an actual nasty bug with reset in KVM.
> > 
> > So the basic issue has to do with doing a machine reset as a result of a
> > hypervisor call, but the same problem should happen with MMIO/PIO
> > emulation.
> > 
> > After we do an exit as a result of such an operation, at the next
> > KVM_RUN, KVM will fetch the "results" of the operation (in the hypercall
> > case that's a bunch of register values, in the MMIO read emulation case
> > it's a single register value usually, x86 might have more subtle cases)
> > and we update the VCPU state (ie. registers) with that data.
> > 
> > However, what happens is that if a reset happens in between, we end up
> > clobbering the reset state.
> > 
> > IE. What happens in qemu is roughtly:
> > 
> > - The hcall or MMIO that triggers the reset happens, goes to qemu,
> > which eventually calls qemu_system_reset_request()
> > 
> > - This sets the global reset pending flag and wakes up the main loop.
> > It also does a stop of the current vcpu, so we do not return to the
> > kernel at this stage.
> > 
> > - The main loop gets the flag, starts the reset process, which begins
> > with stopping all the VCPUs.
> > 
> > - The reset handlers are called, which includes resetting the CPU
> > state, which in our case (powerpc) results in a SET_REGS ioctl to
> > establish a new fresh state for booting.
> > 
> > - The generic code then restarts all VCPUs, which then return into
> > VCPU_RUN.
> > 
> > - The VCPU(s) that did an exit as a result of MMIO emulation,
> > hypercall, or similiar (typically the one that triggered the reset but
> > possibly others) then gets some of their register state "updated" by the
> > result of the operation (in the hcall case, it's a field in the mmap'ed
> > run structure that clobbers GPR3 among others).
> > 
> > Now this is generally not a big issue as -usually- machines don't care
> > much about the state of registers on reset.
> > 
> > However it is for us, at least in a couple of cases because we build a
> > device-tree in qemu and we pass a pointer to it in GPR3 before we start
> > the firmware or the guest kernel.
> > 
> > This pointer is set in env->gpr[3] which then gets written by a reset
> > handler to KVM with SET_REGS, but that ends up being clobbered
> > eventually by the return from the hypercall.
> > 
> > My original idea to fix that was a ppc specific "trick" where we would
> > make SET_REGS clear the flags indicating that we had a pending hypercall
> > (or similar) and thus have SET_REGS have the side-effect of aborting a
> > current operation.
> > 
> > However, Alex reckons we should instead have a generic ioctl
> > (VCPU_ABORT_OPERATION) which qemu could use when there's a reset, that
> > we call to generically abort all such operations (MMIO emu, hcalls,
> > etc...) on all archs when qemu decides to reset the processors.
> > 
> > Alex and I have looked at various possible ways to try to fix that from
> > inside qemu but they are all extremely horrible & gross and might not
> > even work in the general case.
> > 
> > Avi, before I actually do a patch (that adds the generic definition,
> > doc, and powerpc implementation), I'd like to confirm with you that this
> > is the right approach and that this can benefit from expedited
> > upstreaming :-)
> > 
> > Cheers,
> > Ben.
> > 
> > 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-02 12:22     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-02 12:22 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Avi Kivity, KVM list, Paul Mackerras, kvm-ppc, Marcelo Tosatti,
	Jan Kiszka

On Thu, 2012-08-02 at 12:46 +0200, Alexander Graf wrote:
> Avi, Marcelo, Jan? Any thoughts on this?

Had a chat with Anthony and it looks like we can do it using the MP
state stuff which we don't currently use on powerpc, at least it seems
that's how x86 breaks IO emulation on reset.

Cheers,
Ben.

> Alex
> 
> On 01.08.2012, at 05:17, Benjamin Herrenschmidt wrote:
> 
> > Hi Avi !
> > 
> > We identified a problem on powerpc which seems to actually be a generic
> > issue, and Alex suggested we propose a generic fix. I want to make sure
> > we are on the right track first before proposing an actual patch as we
> > would like the patch to go in ASAP (ie not waiting the next merge
> > window) as it will fix an actual nasty bug with reset in KVM.
> > 
> > So the basic issue has to do with doing a machine reset as a result of a
> > hypervisor call, but the same problem should happen with MMIO/PIO
> > emulation.
> > 
> > After we do an exit as a result of such an operation, at the next
> > KVM_RUN, KVM will fetch the "results" of the operation (in the hypercall
> > case that's a bunch of register values, in the MMIO read emulation case
> > it's a single register value usually, x86 might have more subtle cases)
> > and we update the VCPU state (ie. registers) with that data.
> > 
> > However, what happens is that if a reset happens in between, we end up
> > clobbering the reset state.
> > 
> > IE. What happens in qemu is roughtly:
> > 
> > - The hcall or MMIO that triggers the reset happens, goes to qemu,
> > which eventually calls qemu_system_reset_request()
> > 
> > - This sets the global reset pending flag and wakes up the main loop.
> > It also does a stop of the current vcpu, so we do not return to the
> > kernel at this stage.
> > 
> > - The main loop gets the flag, starts the reset process, which begins
> > with stopping all the VCPUs.
> > 
> > - The reset handlers are called, which includes resetting the CPU
> > state, which in our case (powerpc) results in a SET_REGS ioctl to
> > establish a new fresh state for booting.
> > 
> > - The generic code then restarts all VCPUs, which then return into
> > VCPU_RUN.
> > 
> > - The VCPU(s) that did an exit as a result of MMIO emulation,
> > hypercall, or similiar (typically the one that triggered the reset but
> > possibly others) then gets some of their register state "updated" by the
> > result of the operation (in the hcall case, it's a field in the mmap'ed
> > run structure that clobbers GPR3 among others).
> > 
> > Now this is generally not a big issue as -usually- machines don't care
> > much about the state of registers on reset.
> > 
> > However it is for us, at least in a couple of cases because we build a
> > device-tree in qemu and we pass a pointer to it in GPR3 before we start
> > the firmware or the guest kernel.
> > 
> > This pointer is set in env->gpr[3] which then gets written by a reset
> > handler to KVM with SET_REGS, but that ends up being clobbered
> > eventually by the return from the hypercall.
> > 
> > My original idea to fix that was a ppc specific "trick" where we would
> > make SET_REGS clear the flags indicating that we had a pending hypercall
> > (or similar) and thus have SET_REGS have the side-effect of aborting a
> > current operation.
> > 
> > However, Alex reckons we should instead have a generic ioctl
> > (VCPU_ABORT_OPERATION) which qemu could use when there's a reset, that
> > we call to generically abort all such operations (MMIO emu, hcalls,
> > etc...) on all archs when qemu decides to reset the processors.
> > 
> > Alex and I have looked at various possible ways to try to fix that from
> > inside qemu but they are all extremely horrible & gross and might not
> > even work in the general case.
> > 
> > Avi, before I actually do a patch (that adds the generic definition,
> > doc, and powerpc implementation), I'd like to confirm with you that this
> > is the right approach and that this can benefit from expedited
> > upstreaming :-)
> > 
> > Cheers,
> > Ben.
> > 
> > 



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-01  3:17 ` Benjamin Herrenschmidt
@ 2012-08-02 12:35   ` Avi Kivity
  -1 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2012-08-02 12:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: kvm, Alexander Graf, Paul Mackerras, kvm-ppc

On 08/01/2012 06:17 AM, Benjamin Herrenschmidt wrote:
> Hi Avi !
> 
> We identified a problem on powerpc which seems to actually be a generic
> issue, and Alex suggested we propose a generic fix. I want to make sure
> we are on the right track first before proposing an actual patch as we
> would like the patch to go in ASAP (ie not waiting the next merge
> window) as it will fix an actual nasty bug with reset in KVM.
> 
> So the basic issue has to do with doing a machine reset as a result of a
> hypervisor call, but the same problem should happen with MMIO/PIO
> emulation.
> 
> After we do an exit as a result of such an operation, at the next
> KVM_RUN, KVM will fetch the "results" of the operation (in the hypercall
> case that's a bunch of register values, in the MMIO read emulation case
> it's a single register value usually, x86 might have more subtle cases)
> and we update the VCPU state (ie. registers) with that data.
> 
> However, what happens is that if a reset happens in between, we end up
> clobbering the reset state.
> 
> IE. What happens in qemu is roughtly:
> 
>  - The hcall or MMIO that triggers the reset happens, goes to qemu,
> which eventually calls qemu_system_reset_request()
> 
>  - This sets the global reset pending flag and wakes up the main loop.
> It also does a stop of the current vcpu, so we do not return to the
> kernel at this stage.
> 
>  - The main loop gets the flag, starts the reset process, which begins
> with stopping all the VCPUs.
> 
>  - The reset handlers are called, which includes resetting the CPU
> state, which in our case (powerpc) results in a SET_REGS ioctl to
> establish a new fresh state for booting.
> 
>  - The generic code then restarts all VCPUs, which then return into
> VCPU_RUN.
> 
>  - The VCPU(s) that did an exit as a result of MMIO emulation,
> hypercall, or similiar (typically the one that triggered the reset but
> possibly others) then gets some of their register state "updated" by the
> result of the operation (in the hcall case, it's a field in the mmap'ed
> run structure that clobbers GPR3 among others).
> 
> Now this is generally not a big issue as -usually- machines don't care
> much about the state of registers on reset.
> 

This is actually documented in api.txt, though not in relation to reset:

  NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the
  corresponding operations are complete (and guest state is consistent)
  only after userspace has re-entered the kernel with KVM_RUN.  The
  kernel side will first finish incomplete operations and then check
  for pending signals.  Userspace can re-enter the guest with an
  unmasked signal pending to complete pending operations.

For x86 the issue was with live migration - you can't copy guest
register state in the middle of an I/O operation.  Reset is actually
similar, but it involves writing state (which can then be overwritten)
instead of reading it.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-02 12:35   ` Avi Kivity
  0 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2012-08-02 12:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: kvm, Alexander Graf, Paul Mackerras, kvm-ppc

On 08/01/2012 06:17 AM, Benjamin Herrenschmidt wrote:
> Hi Avi !
> 
> We identified a problem on powerpc which seems to actually be a generic
> issue, and Alex suggested we propose a generic fix. I want to make sure
> we are on the right track first before proposing an actual patch as we
> would like the patch to go in ASAP (ie not waiting the next merge
> window) as it will fix an actual nasty bug with reset in KVM.
> 
> So the basic issue has to do with doing a machine reset as a result of a
> hypervisor call, but the same problem should happen with MMIO/PIO
> emulation.
> 
> After we do an exit as a result of such an operation, at the next
> KVM_RUN, KVM will fetch the "results" of the operation (in the hypercall
> case that's a bunch of register values, in the MMIO read emulation case
> it's a single register value usually, x86 might have more subtle cases)
> and we update the VCPU state (ie. registers) with that data.
> 
> However, what happens is that if a reset happens in between, we end up
> clobbering the reset state.
> 
> IE. What happens in qemu is roughtly:
> 
>  - The hcall or MMIO that triggers the reset happens, goes to qemu,
> which eventually calls qemu_system_reset_request()
> 
>  - This sets the global reset pending flag and wakes up the main loop.
> It also does a stop of the current vcpu, so we do not return to the
> kernel at this stage.
> 
>  - The main loop gets the flag, starts the reset process, which begins
> with stopping all the VCPUs.
> 
>  - The reset handlers are called, which includes resetting the CPU
> state, which in our case (powerpc) results in a SET_REGS ioctl to
> establish a new fresh state for booting.
> 
>  - The generic code then restarts all VCPUs, which then return into
> VCPU_RUN.
> 
>  - The VCPU(s) that did an exit as a result of MMIO emulation,
> hypercall, or similiar (typically the one that triggered the reset but
> possibly others) then gets some of their register state "updated" by the
> result of the operation (in the hcall case, it's a field in the mmap'ed
> run structure that clobbers GPR3 among others).
> 
> Now this is generally not a big issue as -usually- machines don't care
> much about the state of registers on reset.
> 

This is actually documented in api.txt, though not in relation to reset:

  NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the
  corresponding operations are complete (and guest state is consistent)
  only after userspace has re-entered the kernel with KVM_RUN.  The
  kernel side will first finish incomplete operations and then check
  for pending signals.  Userspace can re-enter the guest with an
  unmasked signal pending to complete pending operations.

For x86 the issue was with live migration - you can't copy guest
register state in the middle of an I/O operation.  Reset is actually
similar, but it involves writing state (which can then be overwritten)
instead of reading it.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-02 12:35   ` Avi Kivity
@ 2012-08-02 12:59     ` Alexander Graf
  -1 siblings, 0 replies; 62+ messages in thread
From: Alexander Graf @ 2012-08-02 12:59 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Benjamin Herrenschmidt, kvm, Paul Mackerras, kvm-ppc


On 02.08.2012, at 14:35, Avi Kivity wrote:

> On 08/01/2012 06:17 AM, Benjamin Herrenschmidt wrote:
>> Hi Avi !
>> 
>> We identified a problem on powerpc which seems to actually be a generic
>> issue, and Alex suggested we propose a generic fix. I want to make sure
>> we are on the right track first before proposing an actual patch as we
>> would like the patch to go in ASAP (ie not waiting the next merge
>> window) as it will fix an actual nasty bug with reset in KVM.
>> 
>> So the basic issue has to do with doing a machine reset as a result of a
>> hypervisor call, but the same problem should happen with MMIO/PIO
>> emulation.
>> 
>> After we do an exit as a result of such an operation, at the next
>> KVM_RUN, KVM will fetch the "results" of the operation (in the hypercall
>> case that's a bunch of register values, in the MMIO read emulation case
>> it's a single register value usually, x86 might have more subtle cases)
>> and we update the VCPU state (ie. registers) with that data.
>> 
>> However, what happens is that if a reset happens in between, we end up
>> clobbering the reset state.
>> 
>> IE. What happens in qemu is roughtly:
>> 
>> - The hcall or MMIO that triggers the reset happens, goes to qemu,
>> which eventually calls qemu_system_reset_request()
>> 
>> - This sets the global reset pending flag and wakes up the main loop.
>> It also does a stop of the current vcpu, so we do not return to the
>> kernel at this stage.
>> 
>> - The main loop gets the flag, starts the reset process, which begins
>> with stopping all the VCPUs.
>> 
>> - The reset handlers are called, which includes resetting the CPU
>> state, which in our case (powerpc) results in a SET_REGS ioctl to
>> establish a new fresh state for booting.
>> 
>> - The generic code then restarts all VCPUs, which then return into
>> VCPU_RUN.
>> 
>> - The VCPU(s) that did an exit as a result of MMIO emulation,
>> hypercall, or similiar (typically the one that triggered the reset but
>> possibly others) then gets some of their register state "updated" by the
>> result of the operation (in the hcall case, it's a field in the mmap'ed
>> run structure that clobbers GPR3 among others).
>> 
>> Now this is generally not a big issue as -usually- machines don't care
>> much about the state of registers on reset.
>> 
> 
> This is actually documented in api.txt, though not in relation to reset:
> 
>  NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the
>  corresponding operations are complete (and guest state is consistent)
>  only after userspace has re-entered the kernel with KVM_RUN.  The
>  kernel side will first finish incomplete operations and then check
>  for pending signals.  Userspace can re-enter the guest with an
>  unmasked signal pending to complete pending operations.
> 
> For x86 the issue was with live migration - you can't copy guest
> register state in the middle of an I/O operation.  Reset is actually
> similar, but it involves writing state (which can then be overwritten)
> instead of reading it.

Yeah, we stumbled over this chunk as well. So you're saying we should delay the reset by invoking a self-signal if we're in such an operation?


Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-02 12:59     ` Alexander Graf
  0 siblings, 0 replies; 62+ messages in thread
From: Alexander Graf @ 2012-08-02 12:59 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Benjamin Herrenschmidt, kvm, Paul Mackerras, kvm-ppc


On 02.08.2012, at 14:35, Avi Kivity wrote:

> On 08/01/2012 06:17 AM, Benjamin Herrenschmidt wrote:
>> Hi Avi !
>> 
>> We identified a problem on powerpc which seems to actually be a generic
>> issue, and Alex suggested we propose a generic fix. I want to make sure
>> we are on the right track first before proposing an actual patch as we
>> would like the patch to go in ASAP (ie not waiting the next merge
>> window) as it will fix an actual nasty bug with reset in KVM.
>> 
>> So the basic issue has to do with doing a machine reset as a result of a
>> hypervisor call, but the same problem should happen with MMIO/PIO
>> emulation.
>> 
>> After we do an exit as a result of such an operation, at the next
>> KVM_RUN, KVM will fetch the "results" of the operation (in the hypercall
>> case that's a bunch of register values, in the MMIO read emulation case
>> it's a single register value usually, x86 might have more subtle cases)
>> and we update the VCPU state (ie. registers) with that data.
>> 
>> However, what happens is that if a reset happens in between, we end up
>> clobbering the reset state.
>> 
>> IE. What happens in qemu is roughtly:
>> 
>> - The hcall or MMIO that triggers the reset happens, goes to qemu,
>> which eventually calls qemu_system_reset_request()
>> 
>> - This sets the global reset pending flag and wakes up the main loop.
>> It also does a stop of the current vcpu, so we do not return to the
>> kernel at this stage.
>> 
>> - The main loop gets the flag, starts the reset process, which begins
>> with stopping all the VCPUs.
>> 
>> - The reset handlers are called, which includes resetting the CPU
>> state, which in our case (powerpc) results in a SET_REGS ioctl to
>> establish a new fresh state for booting.
>> 
>> - The generic code then restarts all VCPUs, which then return into
>> VCPU_RUN.
>> 
>> - The VCPU(s) that did an exit as a result of MMIO emulation,
>> hypercall, or similiar (typically the one that triggered the reset but
>> possibly others) then gets some of their register state "updated" by the
>> result of the operation (in the hcall case, it's a field in the mmap'ed
>> run structure that clobbers GPR3 among others).
>> 
>> Now this is generally not a big issue as -usually- machines don't care
>> much about the state of registers on reset.
>> 
> 
> This is actually documented in api.txt, though not in relation to reset:
> 
>  NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the
>  corresponding operations are complete (and guest state is consistent)
>  only after userspace has re-entered the kernel with KVM_RUN.  The
>  kernel side will first finish incomplete operations and then check
>  for pending signals.  Userspace can re-enter the guest with an
>  unmasked signal pending to complete pending operations.
> 
> For x86 the issue was with live migration - you can't copy guest
> register state in the middle of an I/O operation.  Reset is actually
> similar, but it involves writing state (which can then be overwritten)
> instead of reading it.

Yeah, we stumbled over this chunk as well. So you're saying we should delay the reset by invoking a self-signal if we're in such an operation?


Alex


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-02 12:59     ` Alexander Graf
@ 2012-08-02 13:05       ` Avi Kivity
  -1 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2012-08-02 13:05 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Benjamin Herrenschmidt, kvm, Paul Mackerras, kvm-ppc

On 08/02/2012 03:59 PM, Alexander Graf wrote:
> 
> On 02.08.2012, at 14:35, Avi Kivity wrote:
> 
>> On 08/01/2012 06:17 AM, Benjamin Herrenschmidt wrote:
>>> Hi Avi !
>>> 
>>> We identified a problem on powerpc which seems to actually be a generic
>>> issue, and Alex suggested we propose a generic fix. I want to make sure
>>> we are on the right track first before proposing an actual patch as we
>>> would like the patch to go in ASAP (ie not waiting the next merge
>>> window) as it will fix an actual nasty bug with reset in KVM.
>>> 
>>> So the basic issue has to do with doing a machine reset as a result of a
>>> hypervisor call, but the same problem should happen with MMIO/PIO
>>> emulation.
>>> 
>>> After we do an exit as a result of such an operation, at the next
>>> KVM_RUN, KVM will fetch the "results" of the operation (in the hypercall
>>> case that's a bunch of register values, in the MMIO read emulation case
>>> it's a single register value usually, x86 might have more subtle cases)
>>> and we update the VCPU state (ie. registers) with that data.
>>> 
>>> However, what happens is that if a reset happens in between, we end up
>>> clobbering the reset state.
>>> 
>>> IE. What happens in qemu is roughtly:
>>> 
>>> - The hcall or MMIO that triggers the reset happens, goes to qemu,
>>> which eventually calls qemu_system_reset_request()
>>> 
>>> - This sets the global reset pending flag and wakes up the main loop.
>>> It also does a stop of the current vcpu, so we do not return to the
>>> kernel at this stage.
>>> 
>>> - The main loop gets the flag, starts the reset process, which begins
>>> with stopping all the VCPUs.
>>> 
>>> - The reset handlers are called, which includes resetting the CPU
>>> state, which in our case (powerpc) results in a SET_REGS ioctl to
>>> establish a new fresh state for booting.
>>> 
>>> - The generic code then restarts all VCPUs, which then return into
>>> VCPU_RUN.
>>> 
>>> - The VCPU(s) that did an exit as a result of MMIO emulation,
>>> hypercall, or similiar (typically the one that triggered the reset but
>>> possibly others) then gets some of their register state "updated" by the
>>> result of the operation (in the hcall case, it's a field in the mmap'ed
>>> run structure that clobbers GPR3 among others).
>>> 
>>> Now this is generally not a big issue as -usually- machines don't care
>>> much about the state of registers on reset.
>>> 
>> 
>> This is actually documented in api.txt, though not in relation to reset:
>> 
>>  NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the
>>  corresponding operations are complete (and guest state is consistent)
>>  only after userspace has re-entered the kernel with KVM_RUN.  The
>>  kernel side will first finish incomplete operations and then check
>>  for pending signals.  Userspace can re-enter the guest with an
>>  unmasked signal pending to complete pending operations.
>> 
>> For x86 the issue was with live migration - you can't copy guest
>> register state in the middle of an I/O operation.  Reset is actually
>> similar, but it involves writing state (which can then be overwritten)
>> instead of reading it.
> 
> Yeah, we stumbled over this chunk as well. So you're saying we should delay the reset by invoking a self-signal if we're in such an operation?

Yes.  Qemu of course already supports this for migration, so it should
be easy to add.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-02 13:05       ` Avi Kivity
  0 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2012-08-02 13:05 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Benjamin Herrenschmidt, kvm, Paul Mackerras, kvm-ppc

On 08/02/2012 03:59 PM, Alexander Graf wrote:
> 
> On 02.08.2012, at 14:35, Avi Kivity wrote:
> 
>> On 08/01/2012 06:17 AM, Benjamin Herrenschmidt wrote:
>>> Hi Avi !
>>> 
>>> We identified a problem on powerpc which seems to actually be a generic
>>> issue, and Alex suggested we propose a generic fix. I want to make sure
>>> we are on the right track first before proposing an actual patch as we
>>> would like the patch to go in ASAP (ie not waiting the next merge
>>> window) as it will fix an actual nasty bug with reset in KVM.
>>> 
>>> So the basic issue has to do with doing a machine reset as a result of a
>>> hypervisor call, but the same problem should happen with MMIO/PIO
>>> emulation.
>>> 
>>> After we do an exit as a result of such an operation, at the next
>>> KVM_RUN, KVM will fetch the "results" of the operation (in the hypercall
>>> case that's a bunch of register values, in the MMIO read emulation case
>>> it's a single register value usually, x86 might have more subtle cases)
>>> and we update the VCPU state (ie. registers) with that data.
>>> 
>>> However, what happens is that if a reset happens in between, we end up
>>> clobbering the reset state.
>>> 
>>> IE. What happens in qemu is roughtly:
>>> 
>>> - The hcall or MMIO that triggers the reset happens, goes to qemu,
>>> which eventually calls qemu_system_reset_request()
>>> 
>>> - This sets the global reset pending flag and wakes up the main loop.
>>> It also does a stop of the current vcpu, so we do not return to the
>>> kernel at this stage.
>>> 
>>> - The main loop gets the flag, starts the reset process, which begins
>>> with stopping all the VCPUs.
>>> 
>>> - The reset handlers are called, which includes resetting the CPU
>>> state, which in our case (powerpc) results in a SET_REGS ioctl to
>>> establish a new fresh state for booting.
>>> 
>>> - The generic code then restarts all VCPUs, which then return into
>>> VCPU_RUN.
>>> 
>>> - The VCPU(s) that did an exit as a result of MMIO emulation,
>>> hypercall, or similiar (typically the one that triggered the reset but
>>> possibly others) then gets some of their register state "updated" by the
>>> result of the operation (in the hcall case, it's a field in the mmap'ed
>>> run structure that clobbers GPR3 among others).
>>> 
>>> Now this is generally not a big issue as -usually- machines don't care
>>> much about the state of registers on reset.
>>> 
>> 
>> This is actually documented in api.txt, though not in relation to reset:
>> 
>>  NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the
>>  corresponding operations are complete (and guest state is consistent)
>>  only after userspace has re-entered the kernel with KVM_RUN.  The
>>  kernel side will first finish incomplete operations and then check
>>  for pending signals.  Userspace can re-enter the guest with an
>>  unmasked signal pending to complete pending operations.
>> 
>> For x86 the issue was with live migration - you can't copy guest
>> register state in the middle of an I/O operation.  Reset is actually
>> similar, but it involves writing state (which can then be overwritten)
>> instead of reading it.
> 
> Yeah, we stumbled over this chunk as well. So you're saying we should delay the reset by invoking a self-signal if we're in such an operation?

Yes.  Qemu of course already supports this for migration, so it should
be easy to add.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-02 12:35   ` Avi Kivity
@ 2012-08-02 20:20     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-02 20:20 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Alexander Graf, Paul Mackerras, kvm-ppc

On Thu, 2012-08-02 at 15:35 +0300, Avi Kivity wrote:
> This is actually documented in api.txt, though not in relation to
> reset:
> 
>   NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the
>   corresponding operations are complete (and guest state is
> consistent)
>   only after userspace has re-entered the kernel with KVM_RUN.  The
>   kernel side will first finish incomplete operations and then check
>   for pending signals.  Userspace can re-enter the guest with an
>   unmasked signal pending to complete pending operations.
> 
> For x86 the issue was with live migration - you can't copy guest
> register state in the middle of an I/O operation.  Reset is actually
> similar, but it involves writing state (which can then be overwritten)
> instead of reading it.

Hrm, except that doing KVM_RUN with a signal is very cumbersome to do
and I couldn't quite find the logic in qemu to do it ... but I might
just have missed it. I can see indeed that in the migration case you
want to actually complete the operation rather than just "abort it".

Any chance you can point me to the code that performs that trick qemu
side for migration ?

Anthony seems to think that for reset we can just abort the operation
state in the kernel when the MP state changes.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-02 20:20     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-02 20:20 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Alexander Graf, Paul Mackerras, kvm-ppc

On Thu, 2012-08-02 at 15:35 +0300, Avi Kivity wrote:
> This is actually documented in api.txt, though not in relation to
> reset:
> 
>   NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the
>   corresponding operations are complete (and guest state is
> consistent)
>   only after userspace has re-entered the kernel with KVM_RUN.  The
>   kernel side will first finish incomplete operations and then check
>   for pending signals.  Userspace can re-enter the guest with an
>   unmasked signal pending to complete pending operations.
> 
> For x86 the issue was with live migration - you can't copy guest
> register state in the middle of an I/O operation.  Reset is actually
> similar, but it involves writing state (which can then be overwritten)
> instead of reading it.

Hrm, except that doing KVM_RUN with a signal is very cumbersome to do
and I couldn't quite find the logic in qemu to do it ... but I might
just have missed it. I can see indeed that in the migration case you
want to actually complete the operation rather than just "abort it".

Any chance you can point me to the code that performs that trick qemu
side for migration ?

Anthony seems to think that for reset we can just abort the operation
state in the kernel when the MP state changes.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-02 13:05       ` Avi Kivity
@ 2012-08-02 20:29         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-02 20:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alexander Graf, kvm, Paul Mackerras, kvm-ppc

On Thu, 2012-08-02 at 16:05 +0300, Avi Kivity wrote:
> > Yeah, we stumbled over this chunk as well. So you're saying we
> should delay the reset by invoking a self-signal if we're in such an
> operation?
> 
> Yes.  Qemu of course already supports this for migration, so it should
> be easy to add.

Adding a self signal for the CPU initiating the reset is not enough,
other VCPUs might also be in an hcall or MMIO emulation when that
happens.

It must be done for all VCPUs, so best is to look at the migration
situation.

For reset, there are two code path at play:

 - The VCPU initiating the request: It calls qemu_system_reset_request()
which calls cpu_stop_current() directly after signaling the main loop

 - The other VCPUs are then marked with the "stop" flag by the maintloop
which will then wait for them to set "stopped" to 1, which is done by
qemu_wait_io_event_common() when it sees "stop".

Now, it seems like suspend also uses that same technique. I don't
totally grasp where migration fits in that picture and where it does the
KVM_RUN with a signal pending trick to complete pending operations, any
chance you can enlighten me ?

Thanks !

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-02 20:29         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-02 20:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alexander Graf, kvm, Paul Mackerras, kvm-ppc

On Thu, 2012-08-02 at 16:05 +0300, Avi Kivity wrote:
> > Yeah, we stumbled over this chunk as well. So you're saying we
> should delay the reset by invoking a self-signal if we're in such an
> operation?
> 
> Yes.  Qemu of course already supports this for migration, so it should
> be easy to add.

Adding a self signal for the CPU initiating the reset is not enough,
other VCPUs might also be in an hcall or MMIO emulation when that
happens.

It must be done for all VCPUs, so best is to look at the migration
situation.

For reset, there are two code path at play:

 - The VCPU initiating the request: It calls qemu_system_reset_request()
which calls cpu_stop_current() directly after signaling the main loop

 - The other VCPUs are then marked with the "stop" flag by the maintloop
which will then wait for them to set "stopped" to 1, which is done by
qemu_wait_io_event_common() when it sees "stop".

Now, it seems like suspend also uses that same technique. I don't
totally grasp where migration fits in that picture and where it does the
KVM_RUN with a signal pending trick to complete pending operations, any
chance you can enlighten me ?

Thanks !

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-02 20:20     ` Benjamin Herrenschmidt
@ 2012-08-03 17:41       ` Marcelo Tosatti
  -1 siblings, 0 replies; 62+ messages in thread
From: Marcelo Tosatti @ 2012-08-03 17:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Avi Kivity, kvm, Alexander Graf, Paul Mackerras, kvm-ppc

On Fri, Aug 03, 2012 at 06:20:18AM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2012-08-02 at 15:35 +0300, Avi Kivity wrote:
> > This is actually documented in api.txt, though not in relation to
> > reset:
> > 
> >   NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the
> >   corresponding operations are complete (and guest state is
> > consistent)
> >   only after userspace has re-entered the kernel with KVM_RUN.  The
> >   kernel side will first finish incomplete operations and then check
> >   for pending signals.  Userspace can re-enter the guest with an
> >   unmasked signal pending to complete pending operations.
> > 
> > For x86 the issue was with live migration - you can't copy guest
> > register state in the middle of an I/O operation.  Reset is actually
> > similar, but it involves writing state (which can then be overwritten)
> > instead of reading it.
> 
> Hrm, except that doing KVM_RUN with a signal is very cumbersome to do
> and I couldn't quite find the logic in qemu to do it ... but I might
> just have missed it. I can see indeed that in the migration case you
> want to actually complete the operation rather than just "abort it".
> 
> Any chance you can point me to the code that performs that trick qemu
> side for migration ?

kvm-all.c:

        kvm_arch_pre_run(env, run);
        if (env->exit_request) {
            DPRINTF("interrupt exit requested\n");
            /*
             * KVM requires us to reenter the kernel after IO exits to
             * complete
             * instruction emulation. This self-signal will ensure that
             * we
             * leave ASAP again.
             */
            qemu_cpu_kick_self();
        }


> Anthony seems to think that for reset we can just abort the operation
> state in the kernel when the MP state changes.
> 
> Cheers,
> Ben.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-03 17:41       ` Marcelo Tosatti
  0 siblings, 0 replies; 62+ messages in thread
From: Marcelo Tosatti @ 2012-08-03 17:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Avi Kivity, kvm, Alexander Graf, Paul Mackerras, kvm-ppc

On Fri, Aug 03, 2012 at 06:20:18AM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2012-08-02 at 15:35 +0300, Avi Kivity wrote:
> > This is actually documented in api.txt, though not in relation to
> > reset:
> > 
> >   NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the
> >   corresponding operations are complete (and guest state is
> > consistent)
> >   only after userspace has re-entered the kernel with KVM_RUN.  The
> >   kernel side will first finish incomplete operations and then check
> >   for pending signals.  Userspace can re-enter the guest with an
> >   unmasked signal pending to complete pending operations.
> > 
> > For x86 the issue was with live migration - you can't copy guest
> > register state in the middle of an I/O operation.  Reset is actually
> > similar, but it involves writing state (which can then be overwritten)
> > instead of reading it.
> 
> Hrm, except that doing KVM_RUN with a signal is very cumbersome to do
> and I couldn't quite find the logic in qemu to do it ... but I might
> just have missed it. I can see indeed that in the migration case you
> want to actually complete the operation rather than just "abort it".
> 
> Any chance you can point me to the code that performs that trick qemu
> side for migration ?

kvm-all.c:

        kvm_arch_pre_run(env, run);
        if (env->exit_request) {
            DPRINTF("interrupt exit requested\n");
            /*
             * KVM requires us to reenter the kernel after IO exits to
             * complete
             * instruction emulation. This self-signal will ensure that
             * we
             * leave ASAP again.
             */
            qemu_cpu_kick_self();
        }


> Anthony seems to think that for reset we can just abort the operation
> state in the kernel when the MP state changes.
> 
> Cheers,
> Ben.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-03 17:41       ` Marcelo Tosatti
@ 2012-08-03 18:05         ` Marcelo Tosatti
  -1 siblings, 0 replies; 62+ messages in thread
From: Marcelo Tosatti @ 2012-08-03 18:05 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Avi Kivity, kvm, Alexander Graf, Paul Mackerras, kvm-ppc

On Fri, Aug 03, 2012 at 02:41:13PM -0300, Marcelo Tosatti wrote:
> On Fri, Aug 03, 2012 at 06:20:18AM +1000, Benjamin Herrenschmidt wrote:
> > On Thu, 2012-08-02 at 15:35 +0300, Avi Kivity wrote:
> > > This is actually documented in api.txt, though not in relation to
> > > reset:
> > > 
> > >   NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the
> > >   corresponding operations are complete (and guest state is
> > > consistent)
> > >   only after userspace has re-entered the kernel with KVM_RUN.  The
> > >   kernel side will first finish incomplete operations and then check
> > >   for pending signals.  Userspace can re-enter the guest with an
> > >   unmasked signal pending to complete pending operations.
> > > 
> > > For x86 the issue was with live migration - you can't copy guest
> > > register state in the middle of an I/O operation.  Reset is actually
> > > similar, but it involves writing state (which can then be overwritten)
> > > instead of reading it.
> > 
> > Hrm, except that doing KVM_RUN with a signal is very cumbersome to do
> > and I couldn't quite find the logic in qemu to do it ... but I might
> > just have missed it. I can see indeed that in the migration case you
> > want to actually complete the operation rather than just "abort it".
> > 
> > Any chance you can point me to the code that performs that trick qemu
> > side for migration ?
> 
> kvm-all.c:
> 
>         kvm_arch_pre_run(env, run);
>         if (env->exit_request) {
>             DPRINTF("interrupt exit requested\n");
>             /*
>              * KVM requires us to reenter the kernel after IO exits to
>              * complete
>              * instruction emulation. This self-signal will ensure that
>              * we
>              * leave ASAP again.
>              */
>             qemu_cpu_kick_self();
>         }


See kvm_arch_process_async_events() call to qemu_system_reset_request()
in target-i386/kvm.c.

The whole thing is fragile, though: we rely on the order events
are processed inside KVM_RUN, in x86:

1) If there is pending MMIO, process it.
2) If not, return with -EINTR (and KVM_EXIT_INTR) in case
there is a signal pending.

That way, the vcpu will not process the stop event from the main loop
(ie not exit from the kvm_cpu_exec() loop), until MMIO is finished.

> > Anthony seems to think that for reset we can just abort the operation
> > state in the kernel when the MP state changes.
> > 
> > Cheers,
> > Ben.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-03 18:05         ` Marcelo Tosatti
  0 siblings, 0 replies; 62+ messages in thread
From: Marcelo Tosatti @ 2012-08-03 18:05 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Avi Kivity, kvm, Alexander Graf, Paul Mackerras, kvm-ppc

On Fri, Aug 03, 2012 at 02:41:13PM -0300, Marcelo Tosatti wrote:
> On Fri, Aug 03, 2012 at 06:20:18AM +1000, Benjamin Herrenschmidt wrote:
> > On Thu, 2012-08-02 at 15:35 +0300, Avi Kivity wrote:
> > > This is actually documented in api.txt, though not in relation to
> > > reset:
> > > 
> > >   NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the
> > >   corresponding operations are complete (and guest state is
> > > consistent)
> > >   only after userspace has re-entered the kernel with KVM_RUN.  The
> > >   kernel side will first finish incomplete operations and then check
> > >   for pending signals.  Userspace can re-enter the guest with an
> > >   unmasked signal pending to complete pending operations.
> > > 
> > > For x86 the issue was with live migration - you can't copy guest
> > > register state in the middle of an I/O operation.  Reset is actually
> > > similar, but it involves writing state (which can then be overwritten)
> > > instead of reading it.
> > 
> > Hrm, except that doing KVM_RUN with a signal is very cumbersome to do
> > and I couldn't quite find the logic in qemu to do it ... but I might
> > just have missed it. I can see indeed that in the migration case you
> > want to actually complete the operation rather than just "abort it".
> > 
> > Any chance you can point me to the code that performs that trick qemu
> > side for migration ?
> 
> kvm-all.c:
> 
>         kvm_arch_pre_run(env, run);
>         if (env->exit_request) {
>             DPRINTF("interrupt exit requested\n");
>             /*
>              * KVM requires us to reenter the kernel after IO exits to
>              * complete
>              * instruction emulation. This self-signal will ensure that
>              * we
>              * leave ASAP again.
>              */
>             qemu_cpu_kick_self();
>         }


See kvm_arch_process_async_events() call to qemu_system_reset_request()
in target-i386/kvm.c.

The whole thing is fragile, though: we rely on the order events
are processed inside KVM_RUN, in x86:

1) If there is pending MMIO, process it.
2) If not, return with -EINTR (and KVM_EXIT_INTR) in case
there is a signal pending.

That way, the vcpu will not process the stop event from the main loop
(ie not exit from the kvm_cpu_exec() loop), until MMIO is finished.

> > Anthony seems to think that for reset we can just abort the operation
> > state in the kernel when the MP state changes.
> > 
> > Cheers,
> > Ben.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-03 17:41       ` Marcelo Tosatti
@ 2012-08-03 22:30         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-03 22:30 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Avi Kivity, kvm, Alexander Graf, Paul Mackerras, kvm-ppc, David Gibson

On Fri, 2012-08-03 at 14:41 -0300, Marcelo Tosatti wrote:

> > Hrm, except that doing KVM_RUN with a signal is very cumbersome to do
> > and I couldn't quite find the logic in qemu to do it ... but I might
> > just have missed it. I can see indeed that in the migration case you
> > want to actually complete the operation rather than just "abort it".
> > 
> > Any chance you can point me to the code that performs that trick qemu
> > side for migration ?
> 
> kvm-all.c:
> 
>         kvm_arch_pre_run(env, run);
>         if (env->exit_request) {
>             DPRINTF("interrupt exit requested\n");
>             /*
>              * KVM requires us to reenter the kernel after IO exits to
>              * complete
>              * instruction emulation. This self-signal will ensure that
>              * we
>              * leave ASAP again.
>              */
>             qemu_cpu_kick_self();
>         }
> 
> 
> > Anthony seems to think that for reset we can just abort the operation
> > state in the kernel when the MP state changes.

Ok, I see now, this is scary really... set a flag somewhere, pray that
we are in the right part of the loop on elsewhere... oh well.

In fact there's some fun (through harmless) bits to be found, look at
x86 for example:

        if (env->exception_injected == EXCP08_DBLE) {
            /* this means triple fault */
            qemu_system_reset_request();
            env->exit_request = 1;
            return 0;
        }

Well, qemu_system_reset_request() calls cpu_stop_current() which calls
cpu_exit() which also does:

   env->exit_request = 1;
 
So obviously it will be well set by that time :-)

Now I can see how having it set during kvm_arch_process_async_events()
will work as this is called right before the run loop. Similarily,
EXIT_MMIO and EXIT_IO would work so they are a non issue.

Our problem I suspect with PAPR doing resets via hcalls is that
our kvm_arch_handle_exit() does:

    case KVM_EXIT_PAPR_HCALL:
        dprintf("handle PAPR hypercall\n");
        run->papr_hcall.ret = spapr_hypercall(env, run->papr_hcall.nr,
                                              run->papr_hcall.args);
        ret = 1;
        break;

See the ret = 1 here ? That means that the caller (kvm_cpu_exec.c main
loop) will exit immediately upon return. If we had instead returned 0,
it would have looped, seen the "exit_requested" set by
qemu_system_reset_request(), which would have then done the signal + run
trick, completed the hypercall, and returned this time in a clean state.

So it seems (I don't have the machine at hand to test right now) that by
just changing that ret = 1 to ret = 0, we might be fixing our problem,
including the case where another vcpu is taking a hypercall at the time
of the reset (this other CPU will also need to complete the operation
before stopping).

David, is there any reason why you did ret = 1 to begin with ? For
things like reset etc... the exit will happen as a result of
env->exit_requested being set by cpu_exit().

Are there other cases where you wish the hcall to go back to the main
loop ? In that case, shouldn't it set env->exit_requested rather than
returning 1 at that point ? (H_CEDE for example ?)

Paul, David, I don't have time to test that before Tuesday (I'm away on
monday) but if you have a chance, revert the kernel change we did and
try this qemu patch for reset:

--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -766,7 +766,7 @@ int kvm_arch_handle_exit(CPUPPCState *env, struct
kvm_run *r
         dprintf("handle PAPR hypercall\n");
         run->papr_hcall.ret = spapr_hypercall(env, run->papr_hcall.nr,
                                               run->papr_hcall.args);
-        ret = 1;
+        ret = 0;
         break;
 #endif
     default:

It might also need something like:

diff --git a/hw/spapr_hcall.c b/hw/spapr_hcall.c
index a5990a9..d4d7fb0 100644
--- a/hw/spapr_hcall.c
+++ b/hw/spapr_hcall.c
@@ -545,6 +545,7 @@ static target_ulong h_cede(CPUPPCState *env,
sPAPREnvironmen
     hreg_compute_hflags(env);
     if (!cpu_has_work(env)) {
         env->halted = 1;
+        env->exit_request = 1;
     }
     return H_SUCCESS;
 }

Though I don't think H_CEDE ever goes down to qemu, does it ?

Cheers,
Ben.

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-03 22:30         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-03 22:30 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Avi Kivity, kvm, Alexander Graf, Paul Mackerras, kvm-ppc, David Gibson

On Fri, 2012-08-03 at 14:41 -0300, Marcelo Tosatti wrote:

> > Hrm, except that doing KVM_RUN with a signal is very cumbersome to do
> > and I couldn't quite find the logic in qemu to do it ... but I might
> > just have missed it. I can see indeed that in the migration case you
> > want to actually complete the operation rather than just "abort it".
> > 
> > Any chance you can point me to the code that performs that trick qemu
> > side for migration ?
> 
> kvm-all.c:
> 
>         kvm_arch_pre_run(env, run);
>         if (env->exit_request) {
>             DPRINTF("interrupt exit requested\n");
>             /*
>              * KVM requires us to reenter the kernel after IO exits to
>              * complete
>              * instruction emulation. This self-signal will ensure that
>              * we
>              * leave ASAP again.
>              */
>             qemu_cpu_kick_self();
>         }
> 
> 
> > Anthony seems to think that for reset we can just abort the operation
> > state in the kernel when the MP state changes.

Ok, I see now, this is scary really... set a flag somewhere, pray that
we are in the right part of the loop on elsewhere... oh well.

In fact there's some fun (through harmless) bits to be found, look at
x86 for example:

        if (env->exception_injected = EXCP08_DBLE) {
            /* this means triple fault */
            qemu_system_reset_request();
            env->exit_request = 1;
            return 0;
        }

Well, qemu_system_reset_request() calls cpu_stop_current() which calls
cpu_exit() which also does:

   env->exit_request = 1;
 
So obviously it will be well set by that time :-)

Now I can see how having it set during kvm_arch_process_async_events()
will work as this is called right before the run loop. Similarily,
EXIT_MMIO and EXIT_IO would work so they are a non issue.

Our problem I suspect with PAPR doing resets via hcalls is that
our kvm_arch_handle_exit() does:

    case KVM_EXIT_PAPR_HCALL:
        dprintf("handle PAPR hypercall\n");
        run->papr_hcall.ret = spapr_hypercall(env, run->papr_hcall.nr,
                                              run->papr_hcall.args);
        ret = 1;
        break;

See the ret = 1 here ? That means that the caller (kvm_cpu_exec.c main
loop) will exit immediately upon return. If we had instead returned 0,
it would have looped, seen the "exit_requested" set by
qemu_system_reset_request(), which would have then done the signal + run
trick, completed the hypercall, and returned this time in a clean state.

So it seems (I don't have the machine at hand to test right now) that by
just changing that ret = 1 to ret = 0, we might be fixing our problem,
including the case where another vcpu is taking a hypercall at the time
of the reset (this other CPU will also need to complete the operation
before stopping).

David, is there any reason why you did ret = 1 to begin with ? For
things like reset etc... the exit will happen as a result of
env->exit_requested being set by cpu_exit().

Are there other cases where you wish the hcall to go back to the main
loop ? In that case, shouldn't it set env->exit_requested rather than
returning 1 at that point ? (H_CEDE for example ?)

Paul, David, I don't have time to test that before Tuesday (I'm away on
monday) but if you have a chance, revert the kernel change we did and
try this qemu patch for reset:

--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -766,7 +766,7 @@ int kvm_arch_handle_exit(CPUPPCState *env, struct
kvm_run *r
         dprintf("handle PAPR hypercall\n");
         run->papr_hcall.ret = spapr_hypercall(env, run->papr_hcall.nr,
                                               run->papr_hcall.args);
-        ret = 1;
+        ret = 0;
         break;
 #endif
     default:

It might also need something like:

diff --git a/hw/spapr_hcall.c b/hw/spapr_hcall.c
index a5990a9..d4d7fb0 100644
--- a/hw/spapr_hcall.c
+++ b/hw/spapr_hcall.c
@@ -545,6 +545,7 @@ static target_ulong h_cede(CPUPPCState *env,
sPAPREnvironmen
     hreg_compute_hflags(env);
     if (!cpu_has_work(env)) {
         env->halted = 1;
+        env->exit_request = 1;
     }
     return H_SUCCESS;
 }

Though I don't think H_CEDE ever goes down to qemu, does it ?

Cheers,
Ben.



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-03 18:05         ` Marcelo Tosatti
@ 2012-08-03 22:32           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-03 22:32 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm, Alexander Graf, Paul Mackerras, kvm-ppc

On Fri, 2012-08-03 at 15:05 -0300, Marcelo Tosatti wrote:

> See kvm_arch_process_async_events() call to qemu_system_reset_request()
> in target-i386/kvm.c.
> 
> The whole thing is fragile, though: we rely on the order events
> are processed inside KVM_RUN, in x86:
> 
> 1) If there is pending MMIO, process it.
> 2) If not, return with -EINTR (and KVM_EXIT_INTR) in case
> there is a signal pending.
> 
> That way, the vcpu will not process the stop event from the main loop
> (ie not exit from the kvm_cpu_exec() loop), until MMIO is finished.

Right, it is fragile, thankfully we appear to adhere to the same
ordering on powerpc so far :-)

So we'll need to test but it looks like we might be able to fix our
problem without a kernel or API change, just by changing qemu to
do the same exit_request trick for our reboot hypercall.

Long run however, I wonder whether we should consider an explicit ioctl
to complete those pending operations instead...

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-03 22:32           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-03 22:32 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm, Alexander Graf, Paul Mackerras, kvm-ppc

On Fri, 2012-08-03 at 15:05 -0300, Marcelo Tosatti wrote:

> See kvm_arch_process_async_events() call to qemu_system_reset_request()
> in target-i386/kvm.c.
> 
> The whole thing is fragile, though: we rely on the order events
> are processed inside KVM_RUN, in x86:
> 
> 1) If there is pending MMIO, process it.
> 2) If not, return with -EINTR (and KVM_EXIT_INTR) in case
> there is a signal pending.
> 
> That way, the vcpu will not process the stop event from the main loop
> (ie not exit from the kvm_cpu_exec() loop), until MMIO is finished.

Right, it is fragile, thankfully we appear to adhere to the same
ordering on powerpc so far :-)

So we'll need to test but it looks like we might be able to fix our
problem without a kernel or API change, just by changing qemu to
do the same exit_request trick for our reboot hypercall.

Long run however, I wonder whether we should consider an explicit ioctl
to complete those pending operations instead...

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-02 20:29         ` Benjamin Herrenschmidt
@ 2012-08-05  8:55           ` Avi Kivity
  -1 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2012-08-05  8:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Alexander Graf, kvm, Paul Mackerras, kvm-ppc

On 08/02/2012 11:29 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2012-08-02 at 16:05 +0300, Avi Kivity wrote:
>> > Yeah, we stumbled over this chunk as well. So you're saying we
>> should delay the reset by invoking a self-signal if we're in such an
>> operation?
>> 
>> Yes.  Qemu of course already supports this for migration, so it should
>> be easy to add.
> 
> Adding a self signal for the CPU initiating the reset is not enough,
> other VCPUs might also be in an hcall or MMIO emulation when that
> happens.

That happens naturally if you update (or read) registers through a
run_on_cpu() call.  run_on_cpu() should never happen within an mmio
sequence.

> 
> It must be done for all VCPUs, so best is to look at the migration
> situation.
> 
> For reset, there are two code path at play:
> 
>  - The VCPU initiating the request: It calls qemu_system_reset_request()
> which calls cpu_stop_current() directly after signaling the main loop
> 
>  - The other VCPUs are then marked with the "stop" flag by the maintloop
> which will then wait for them to set "stopped" to 1, which is done by
> qemu_wait_io_event_common() when it sees "stop".
> 
> Now, it seems like suspend also uses that same technique. I don't
> totally grasp where migration fits in that picture and where it does the
> KVM_RUN with a signal pending trick to complete pending operations, any
> chance you can enlighten me ?

I'm afraid I no longer know the details so closely, the code has changed
quite a lot.  But the self-signal happens in kvm_cpu_exec(), see also
env->exit_request.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-05  8:55           ` Avi Kivity
  0 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2012-08-05  8:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Alexander Graf, kvm, Paul Mackerras, kvm-ppc

On 08/02/2012 11:29 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2012-08-02 at 16:05 +0300, Avi Kivity wrote:
>> > Yeah, we stumbled over this chunk as well. So you're saying we
>> should delay the reset by invoking a self-signal if we're in such an
>> operation?
>> 
>> Yes.  Qemu of course already supports this for migration, so it should
>> be easy to add.
> 
> Adding a self signal for the CPU initiating the reset is not enough,
> other VCPUs might also be in an hcall or MMIO emulation when that
> happens.

That happens naturally if you update (or read) registers through a
run_on_cpu() call.  run_on_cpu() should never happen within an mmio
sequence.

> 
> It must be done for all VCPUs, so best is to look at the migration
> situation.
> 
> For reset, there are two code path at play:
> 
>  - The VCPU initiating the request: It calls qemu_system_reset_request()
> which calls cpu_stop_current() directly after signaling the main loop
> 
>  - The other VCPUs are then marked with the "stop" flag by the maintloop
> which will then wait for them to set "stopped" to 1, which is done by
> qemu_wait_io_event_common() when it sees "stop".
> 
> Now, it seems like suspend also uses that same technique. I don't
> totally grasp where migration fits in that picture and where it does the
> KVM_RUN with a signal pending trick to complete pending operations, any
> chance you can enlighten me ?

I'm afraid I no longer know the details so closely, the code has changed
quite a lot.  But the self-signal happens in kvm_cpu_exec(), see also
env->exit_request.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-03 22:32           ` Benjamin Herrenschmidt
@ 2012-08-05  9:00             ` Avi Kivity
  -1 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2012-08-05  9:00 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Marcelo Tosatti, kvm, Alexander Graf, Paul Mackerras, kvm-ppc

On 08/04/2012 01:32 AM, Benjamin Herrenschmidt wrote:
> On Fri, 2012-08-03 at 15:05 -0300, Marcelo Tosatti wrote:
> 
>> See kvm_arch_process_async_events() call to qemu_system_reset_request()
>> in target-i386/kvm.c.
>> 
>> The whole thing is fragile, though: we rely on the order events
>> are processed inside KVM_RUN, in x86:
>> 
>> 1) If there is pending MMIO, process it.
>> 2) If not, return with -EINTR (and KVM_EXIT_INTR) in case
>> there is a signal pending.
>> 
>> That way, the vcpu will not process the stop event from the main loop
>> (ie not exit from the kvm_cpu_exec() loop), until MMIO is finished.
> 
> Right, it is fragile, thankfully we appear to adhere to the same
> ordering on powerpc so far :-)
> 
> So we'll need to test but it looks like we might be able to fix our
> problem without a kernel or API change, just by changing qemu to
> do the same exit_request trick for our reboot hypercall.
> 
> Long run however, I wonder whether we should consider an explicit ioctl
> to complete those pending operations instead...

It's pointless.  We have to support the old method forever.  There's no
material different between sigqueue() + KVM_RUN and KVM_COMPLETE, or a
KVM_RUN with a flag that tells it to exit immediately.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-05  9:00             ` Avi Kivity
  0 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2012-08-05  9:00 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Marcelo Tosatti, kvm, Alexander Graf, Paul Mackerras, kvm-ppc

On 08/04/2012 01:32 AM, Benjamin Herrenschmidt wrote:
> On Fri, 2012-08-03 at 15:05 -0300, Marcelo Tosatti wrote:
> 
>> See kvm_arch_process_async_events() call to qemu_system_reset_request()
>> in target-i386/kvm.c.
>> 
>> The whole thing is fragile, though: we rely on the order events
>> are processed inside KVM_RUN, in x86:
>> 
>> 1) If there is pending MMIO, process it.
>> 2) If not, return with -EINTR (and KVM_EXIT_INTR) in case
>> there is a signal pending.
>> 
>> That way, the vcpu will not process the stop event from the main loop
>> (ie not exit from the kvm_cpu_exec() loop), until MMIO is finished.
> 
> Right, it is fragile, thankfully we appear to adhere to the same
> ordering on powerpc so far :-)
> 
> So we'll need to test but it looks like we might be able to fix our
> problem without a kernel or API change, just by changing qemu to
> do the same exit_request trick for our reboot hypercall.
> 
> Long run however, I wonder whether we should consider an explicit ioctl
> to complete those pending operations instead...

It's pointless.  We have to support the old method forever.  There's no
material different between sigqueue() + KVM_RUN and KVM_COMPLETE, or a
KVM_RUN with a flag that tells it to exit immediately.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-05  8:55           ` Avi Kivity
@ 2012-08-05 20:45             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-05 20:45 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alexander Graf, kvm, Paul Mackerras, kvm-ppc

On Sun, 2012-08-05 at 11:55 +0300, Avi Kivity wrote:
> 
> I'm afraid I no longer know the details so closely, the code has
> changed
> quite a lot.  But the self-signal happens in kvm_cpu_exec(), see also
> env->exit_request.

Right, I think I eventually grasped it :-) It is fairly fragile however,
it basically relies that none of those things that leave the kernel in
an "incomplete" state (hcalls, mmio emulation, ...) return a non-zero
value, but instead only ever request an exit via exit_request, so that
we are guaranteed that the exec loop -will- go back, send that signal
and finally exit as a result of EINTR.

It also requires that the kernel tests & handles all those "completion"
early in VCPU_RUN before it does anything else really including testing
for signals.

The latter seems fine for us, the former was what we got wrong on ppc:
our hypercalls always cause exits via a non-zero return value for some
reason (I didn't write that code, not sure exactly why it was written
like that). Working on fixing that on qemu side now.

Thanks !

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-05 20:45             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-05 20:45 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alexander Graf, kvm, Paul Mackerras, kvm-ppc

On Sun, 2012-08-05 at 11:55 +0300, Avi Kivity wrote:
> 
> I'm afraid I no longer know the details so closely, the code has
> changed
> quite a lot.  But the self-signal happens in kvm_cpu_exec(), see also
> env->exit_request.

Right, I think I eventually grasped it :-) It is fairly fragile however,
it basically relies that none of those things that leave the kernel in
an "incomplete" state (hcalls, mmio emulation, ...) return a non-zero
value, but instead only ever request an exit via exit_request, so that
we are guaranteed that the exec loop -will- go back, send that signal
and finally exit as a result of EINTR.

It also requires that the kernel tests & handles all those "completion"
early in VCPU_RUN before it does anything else really including testing
for signals.

The latter seems fine for us, the former was what we got wrong on ppc:
our hypercalls always cause exits via a non-zero return value for some
reason (I didn't write that code, not sure exactly why it was written
like that). Working on fixing that on qemu side now.

Thanks !

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-03 22:30         ` Benjamin Herrenschmidt
@ 2012-08-06  3:13           ` David Gibson
  -1 siblings, 0 replies; 62+ messages in thread
From: David Gibson @ 2012-08-06  3:13 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Marcelo Tosatti, Avi Kivity, kvm, Alexander Graf, Paul Mackerras,
	kvm-ppc

On Sat, Aug 04, 2012 at 08:30:08AM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2012-08-03 at 14:41 -0300, Marcelo Tosatti wrote:
> 
> > > Hrm, except that doing KVM_RUN with a signal is very cumbersome to do
> > > and I couldn't quite find the logic in qemu to do it ... but I might
> > > just have missed it. I can see indeed that in the migration case you
> > > want to actually complete the operation rather than just "abort it".
> > > 
> > > Any chance you can point me to the code that performs that trick qemu
> > > side for migration ?
> > 
> > kvm-all.c:
> > 
> >         kvm_arch_pre_run(env, run);
> >         if (env->exit_request) {
> >             DPRINTF("interrupt exit requested\n");
> >             /*
> >              * KVM requires us to reenter the kernel after IO exits to
> >              * complete
> >              * instruction emulation. This self-signal will ensure that
> >              * we
> >              * leave ASAP again.
> >              */
> >             qemu_cpu_kick_self();
> >         }
> > 
> > 
> > > Anthony seems to think that for reset we can just abort the operation
> > > state in the kernel when the MP state changes.
> 
> Ok, I see now, this is scary really... set a flag somewhere, pray that
> we are in the right part of the loop on elsewhere... oh well.
> 
> In fact there's some fun (through harmless) bits to be found, look at
> x86 for example:
> 
>         if (env->exception_injected == EXCP08_DBLE) {
>             /* this means triple fault */
>             qemu_system_reset_request();
>             env->exit_request = 1;
>             return 0;
>         }
> 
> Well, qemu_system_reset_request() calls cpu_stop_current() which calls
> cpu_exit() which also does:
> 
>    env->exit_request = 1;
>  
> So obviously it will be well set by that time :-)
> 
> Now I can see how having it set during kvm_arch_process_async_events()
> will work as this is called right before the run loop. Similarily,
> EXIT_MMIO and EXIT_IO would work so they are a non issue.
> 
> Our problem I suspect with PAPR doing resets via hcalls is that
> our kvm_arch_handle_exit() does:
> 
>     case KVM_EXIT_PAPR_HCALL:
>         dprintf("handle PAPR hypercall\n");
>         run->papr_hcall.ret = spapr_hypercall(env, run->papr_hcall.nr,
>                                               run->papr_hcall.args);
>         ret = 1;
>         break;
> 
> See the ret = 1 here ? That means that the caller (kvm_cpu_exec.c main
> loop) will exit immediately upon return. If we had instead returned 0,
> it would have looped, seen the "exit_requested" set by
> qemu_system_reset_request(), which would have then done the signal + run
> trick, completed the hypercall, and returned this time in a clean state.
> 
> So it seems (I don't have the machine at hand to test right now) that by
> just changing that ret = 1 to ret = 0, we might be fixing our problem,
> including the case where another vcpu is taking a hypercall at the time
> of the reset (this other CPU will also need to complete the operation
> before stopping).
> 
> David, is there any reason why you did ret = 1 to begin with ? For
> things like reset etc... the exit will happen as a result of
> env->exit_requested being set by cpu_exit().

Erm, there might have been, but I don't remember what it was.

> Are there other cases where you wish the hcall to go back to the main
> loop ? In that case, shouldn't it set env->exit_requested rather than
> returning 1 at that point ? (H_CEDE for example ?)

So, I'm still trying to nut out the implications for H_CEDE, and think
if there are any other hypercalls that might want to block the guest
for a time.  We were considering blocking H_PUT_TCE if qemu devices
had active dma maps on the previously mapped iovas.  I'm not sure if
the discussions that led to the inclusion of the qemu IOMMU code
decided that was wholly unnnecessary or just not necessary for the
time being.

> Paul, David, I don't have time to test that before Tuesday (I'm away on
> monday) but if you have a chance, revert the kernel change we did and
> try this qemu patch for reset:
> 
> --- a/target-ppc/kvm.c
> +++ b/target-ppc/kvm.c
> @@ -766,7 +766,7 @@ int kvm_arch_handle_exit(CPUPPCState *env, struct
> kvm_run *r
>          dprintf("handle PAPR hypercall\n");
>          run->papr_hcall.ret = spapr_hypercall(env, run->papr_hcall.nr,
>                                                run->papr_hcall.args);
> -        ret = 1;
> +        ret = 0;
>          break;
>  #endif
>      default:

So, usually I think ret = 0 is correct, it lets us handle the
hypercall and get back to the guest without further fuss.  I think
right now, ret = 0 is always correct, since H_CEDE is always handled
in kernel.  But if we ever have qemu implemented hypercalls which want
to "block" then we might need a facility to do something else.  Maybe,
like I say, I haven't convinved myself I've thought of all the
implications yet.

> It might also need something like:
> 
> diff --git a/hw/spapr_hcall.c b/hw/spapr_hcall.c
> index a5990a9..d4d7fb0 100644
> --- a/hw/spapr_hcall.c
> +++ b/hw/spapr_hcall.c
> @@ -545,6 +545,7 @@ static target_ulong h_cede(CPUPPCState *env,
> sPAPREnvironmen
>      hreg_compute_hflags(env);
>      if (!cpu_has_work(env)) {
>          env->halted = 1;
> +        env->exit_request = 1;
>      }
>      return H_SUCCESS;
>  }
> 
> Though I don't think H_CEDE ever goes down to qemu, does it ?

No, though it might be worth doing that anyway, just in case we ever
have some whacky case where the H_CEDE does get handed on from the
kernel.

Ok, I'll send a patch and we can worry about the blocking qemu
hypercall case later, since it doesn't exist now.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-06  3:13           ` David Gibson
  0 siblings, 0 replies; 62+ messages in thread
From: David Gibson @ 2012-08-06  3:13 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Marcelo Tosatti, Avi Kivity, kvm, Alexander Graf, Paul Mackerras,
	kvm-ppc

On Sat, Aug 04, 2012 at 08:30:08AM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2012-08-03 at 14:41 -0300, Marcelo Tosatti wrote:
> 
> > > Hrm, except that doing KVM_RUN with a signal is very cumbersome to do
> > > and I couldn't quite find the logic in qemu to do it ... but I might
> > > just have missed it. I can see indeed that in the migration case you
> > > want to actually complete the operation rather than just "abort it".
> > > 
> > > Any chance you can point me to the code that performs that trick qemu
> > > side for migration ?
> > 
> > kvm-all.c:
> > 
> >         kvm_arch_pre_run(env, run);
> >         if (env->exit_request) {
> >             DPRINTF("interrupt exit requested\n");
> >             /*
> >              * KVM requires us to reenter the kernel after IO exits to
> >              * complete
> >              * instruction emulation. This self-signal will ensure that
> >              * we
> >              * leave ASAP again.
> >              */
> >             qemu_cpu_kick_self();
> >         }
> > 
> > 
> > > Anthony seems to think that for reset we can just abort the operation
> > > state in the kernel when the MP state changes.
> 
> Ok, I see now, this is scary really... set a flag somewhere, pray that
> we are in the right part of the loop on elsewhere... oh well.
> 
> In fact there's some fun (through harmless) bits to be found, look at
> x86 for example:
> 
>         if (env->exception_injected = EXCP08_DBLE) {
>             /* this means triple fault */
>             qemu_system_reset_request();
>             env->exit_request = 1;
>             return 0;
>         }
> 
> Well, qemu_system_reset_request() calls cpu_stop_current() which calls
> cpu_exit() which also does:
> 
>    env->exit_request = 1;
>  
> So obviously it will be well set by that time :-)
> 
> Now I can see how having it set during kvm_arch_process_async_events()
> will work as this is called right before the run loop. Similarily,
> EXIT_MMIO and EXIT_IO would work so they are a non issue.
> 
> Our problem I suspect with PAPR doing resets via hcalls is that
> our kvm_arch_handle_exit() does:
> 
>     case KVM_EXIT_PAPR_HCALL:
>         dprintf("handle PAPR hypercall\n");
>         run->papr_hcall.ret = spapr_hypercall(env, run->papr_hcall.nr,
>                                               run->papr_hcall.args);
>         ret = 1;
>         break;
> 
> See the ret = 1 here ? That means that the caller (kvm_cpu_exec.c main
> loop) will exit immediately upon return. If we had instead returned 0,
> it would have looped, seen the "exit_requested" set by
> qemu_system_reset_request(), which would have then done the signal + run
> trick, completed the hypercall, and returned this time in a clean state.
> 
> So it seems (I don't have the machine at hand to test right now) that by
> just changing that ret = 1 to ret = 0, we might be fixing our problem,
> including the case where another vcpu is taking a hypercall at the time
> of the reset (this other CPU will also need to complete the operation
> before stopping).
> 
> David, is there any reason why you did ret = 1 to begin with ? For
> things like reset etc... the exit will happen as a result of
> env->exit_requested being set by cpu_exit().

Erm, there might have been, but I don't remember what it was.

> Are there other cases where you wish the hcall to go back to the main
> loop ? In that case, shouldn't it set env->exit_requested rather than
> returning 1 at that point ? (H_CEDE for example ?)

So, I'm still trying to nut out the implications for H_CEDE, and think
if there are any other hypercalls that might want to block the guest
for a time.  We were considering blocking H_PUT_TCE if qemu devices
had active dma maps on the previously mapped iovas.  I'm not sure if
the discussions that led to the inclusion of the qemu IOMMU code
decided that was wholly unnnecessary or just not necessary for the
time being.

> Paul, David, I don't have time to test that before Tuesday (I'm away on
> monday) but if you have a chance, revert the kernel change we did and
> try this qemu patch for reset:
> 
> --- a/target-ppc/kvm.c
> +++ b/target-ppc/kvm.c
> @@ -766,7 +766,7 @@ int kvm_arch_handle_exit(CPUPPCState *env, struct
> kvm_run *r
>          dprintf("handle PAPR hypercall\n");
>          run->papr_hcall.ret = spapr_hypercall(env, run->papr_hcall.nr,
>                                                run->papr_hcall.args);
> -        ret = 1;
> +        ret = 0;
>          break;
>  #endif
>      default:

So, usually I think ret = 0 is correct, it lets us handle the
hypercall and get back to the guest without further fuss.  I think
right now, ret = 0 is always correct, since H_CEDE is always handled
in kernel.  But if we ever have qemu implemented hypercalls which want
to "block" then we might need a facility to do something else.  Maybe,
like I say, I haven't convinved myself I've thought of all the
implications yet.

> It might also need something like:
> 
> diff --git a/hw/spapr_hcall.c b/hw/spapr_hcall.c
> index a5990a9..d4d7fb0 100644
> --- a/hw/spapr_hcall.c
> +++ b/hw/spapr_hcall.c
> @@ -545,6 +545,7 @@ static target_ulong h_cede(CPUPPCState *env,
> sPAPREnvironmen
>      hreg_compute_hflags(env);
>      if (!cpu_has_work(env)) {
>          env->halted = 1;
> +        env->exit_request = 1;
>      }
>      return H_SUCCESS;
>  }
> 
> Though I don't think H_CEDE ever goes down to qemu, does it ?

No, though it might be worth doing that anyway, just in case we ever
have some whacky case where the H_CEDE does get handed on from the
kernel.

Ok, I'll send a patch and we can worry about the blocking qemu
hypercall case later, since it doesn't exist now.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-05  9:00             ` Avi Kivity
@ 2012-08-06 20:25               ` Scott Wood
  -1 siblings, 0 replies; 62+ messages in thread
From: Scott Wood @ 2012-08-06 20:25 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Benjamin Herrenschmidt, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On 08/05/2012 04:00 AM, Avi Kivity wrote:
> On 08/04/2012 01:32 AM, Benjamin Herrenschmidt wrote:
>> On Fri, 2012-08-03 at 15:05 -0300, Marcelo Tosatti wrote:
>>
>>> See kvm_arch_process_async_events() call to qemu_system_reset_request()
>>> in target-i386/kvm.c.
>>>
>>> The whole thing is fragile, though: we rely on the order events
>>> are processed inside KVM_RUN, in x86:
>>>
>>> 1) If there is pending MMIO, process it.
>>> 2) If not, return with -EINTR (and KVM_EXIT_INTR) in case
>>> there is a signal pending.
>>>
>>> That way, the vcpu will not process the stop event from the main loop
>>> (ie not exit from the kvm_cpu_exec() loop), until MMIO is finished.
>>
>> Right, it is fragile, thankfully we appear to adhere to the same
>> ordering on powerpc so far :-)
>>
>> So we'll need to test but it looks like we might be able to fix our
>> problem without a kernel or API change, just by changing qemu to
>> do the same exit_request trick for our reboot hypercall.
>>
>> Long run however, I wonder whether we should consider an explicit ioctl
>> to complete those pending operations instead...
> 
> It's pointless.  We have to support the old method forever.

Not in new architectures (even PPC has yet to start using this) or new
userspaces -- and forever is a long time.  People down the road may very
well decide that it's time to clean out the deprecated stuff that hasn't
been used in over a decade.  IMHO this shouldn't be a reason to
not improve the API, as long as compatibility is possible for as long as
it is deemed worthwhile.

> There's no
> material different between sigqueue() + KVM_RUN and KVM_COMPLETE, or a
> KVM_RUN with a flag that tells it to exit immediately.

The latter is less fragile and easier to use.

-Scott

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-06 20:25               ` Scott Wood
  0 siblings, 0 replies; 62+ messages in thread
From: Scott Wood @ 2012-08-06 20:25 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Benjamin Herrenschmidt, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On 08/05/2012 04:00 AM, Avi Kivity wrote:
> On 08/04/2012 01:32 AM, Benjamin Herrenschmidt wrote:
>> On Fri, 2012-08-03 at 15:05 -0300, Marcelo Tosatti wrote:
>>
>>> See kvm_arch_process_async_events() call to qemu_system_reset_request()
>>> in target-i386/kvm.c.
>>>
>>> The whole thing is fragile, though: we rely on the order events
>>> are processed inside KVM_RUN, in x86:
>>>
>>> 1) If there is pending MMIO, process it.
>>> 2) If not, return with -EINTR (and KVM_EXIT_INTR) in case
>>> there is a signal pending.
>>>
>>> That way, the vcpu will not process the stop event from the main loop
>>> (ie not exit from the kvm_cpu_exec() loop), until MMIO is finished.
>>
>> Right, it is fragile, thankfully we appear to adhere to the same
>> ordering on powerpc so far :-)
>>
>> So we'll need to test but it looks like we might be able to fix our
>> problem without a kernel or API change, just by changing qemu to
>> do the same exit_request trick for our reboot hypercall.
>>
>> Long run however, I wonder whether we should consider an explicit ioctl
>> to complete those pending operations instead...
> 
> It's pointless.  We have to support the old method forever.

Not in new architectures (even PPC has yet to start using this) or new
userspaces -- and forever is a long time.  People down the road may very
well decide that it's time to clean out the deprecated stuff that hasn't
been used in over a decade.  IMHO this shouldn't be a reason to
not improve the API, as long as compatibility is possible for as long as
it is deemed worthwhile.

> There's no
> material different between sigqueue() + KVM_RUN and KVM_COMPLETE, or a
> KVM_RUN with a flag that tells it to exit immediately.

The latter is less fragile and easier to use.

-Scott



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-05  9:00             ` Avi Kivity
@ 2012-08-06 20:54               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-06 20:54 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm, Alexander Graf, Paul Mackerras, kvm-ppc

On Sun, 2012-08-05 at 12:00 +0300, Avi Kivity wrote:
> > So we'll need to test but it looks like we might be able to fix our
> > problem without a kernel or API change, just by changing qemu to
> > do the same exit_request trick for our reboot hypercall.
> > 
> > Long run however, I wonder whether we should consider an explicit ioctl
> > to complete those pending operations instead...
> 
> It's pointless.  We have to support the old method forever.  There's no
> material different between sigqueue() + KVM_RUN and KVM_COMPLETE, or a
> KVM_RUN with a flag that tells it to exit immediately. 

Sort-of... in fact, I would say after more thoughts that the real reason
to leave things as-is is that we actually need a clean restartable state
for migration, and an ioctl for "aborting" transactions wouldn't
complete them properly, which is fine for reset but not for migration or
snapshots.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-06 20:54               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-06 20:54 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm, Alexander Graf, Paul Mackerras, kvm-ppc

On Sun, 2012-08-05 at 12:00 +0300, Avi Kivity wrote:
> > So we'll need to test but it looks like we might be able to fix our
> > problem without a kernel or API change, just by changing qemu to
> > do the same exit_request trick for our reboot hypercall.
> > 
> > Long run however, I wonder whether we should consider an explicit ioctl
> > to complete those pending operations instead...
> 
> It's pointless.  We have to support the old method forever.  There's no
> material different between sigqueue() + KVM_RUN and KVM_COMPLETE, or a
> KVM_RUN with a flag that tells it to exit immediately. 

Sort-of... in fact, I would say after more thoughts that the real reason
to leave things as-is is that we actually need a clean restartable state
for migration, and an ioctl for "aborting" transactions wouldn't
complete them properly, which is fine for reset but not for migration or
snapshots.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-06  3:13           ` David Gibson
@ 2012-08-06 20:57             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-06 20:57 UTC (permalink / raw)
  To: David Gibson
  Cc: Marcelo Tosatti, Avi Kivity, kvm, Alexander Graf, Paul Mackerras,
	kvm-ppc

On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
> So, I'm still trying to nut out the implications for H_CEDE, and think
> if there are any other hypercalls that might want to block the guest
> for a time.  We were considering blocking H_PUT_TCE if qemu devices
> had active dma maps on the previously mapped iovas.  I'm not sure if
> the discussions that led to the inclusion of the qemu IOMMU code
> decided that was wholly unnnecessary or just not necessary for the
> time being.

For "sleeping hcalls" they will simply have to set exit_request to
complete the hcall from the kernel perspective, leaving us in a state
where the kernel is about to restart at srr0 + 4, along with some other
flag (stop or halt) to actually freeze the vcpu.

If such an "async" hcall decides to return an error, it can then set
gpr3 directly using ioctls before restarting the vcpu.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-06 20:57             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-06 20:57 UTC (permalink / raw)
  To: David Gibson
  Cc: Marcelo Tosatti, Avi Kivity, kvm, Alexander Graf, Paul Mackerras,
	kvm-ppc

On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
> So, I'm still trying to nut out the implications for H_CEDE, and think
> if there are any other hypercalls that might want to block the guest
> for a time.  We were considering blocking H_PUT_TCE if qemu devices
> had active dma maps on the previously mapped iovas.  I'm not sure if
> the discussions that led to the inclusion of the qemu IOMMU code
> decided that was wholly unnnecessary or just not necessary for the
> time being.

For "sleeping hcalls" they will simply have to set exit_request to
complete the hcall from the kernel perspective, leaving us in a state
where the kernel is about to restart at srr0 + 4, along with some other
flag (stop or halt) to actually freeze the vcpu.

If such an "async" hcall decides to return an error, it can then set
gpr3 directly using ioctls before restarting the vcpu.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-06 20:57             ` Benjamin Herrenschmidt
@ 2012-08-07  1:32               ` David Gibson
  -1 siblings, 0 replies; 62+ messages in thread
From: David Gibson @ 2012-08-07  1:32 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Marcelo Tosatti, Avi Kivity, kvm, Alexander Graf, Paul Mackerras,
	kvm-ppc

On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
> > So, I'm still trying to nut out the implications for H_CEDE, and think
> > if there are any other hypercalls that might want to block the guest
> > for a time.  We were considering blocking H_PUT_TCE if qemu devices
> > had active dma maps on the previously mapped iovas.  I'm not sure if
> > the discussions that led to the inclusion of the qemu IOMMU code
> > decided that was wholly unnnecessary or just not necessary for the
> > time being.
> 
> For "sleeping hcalls" they will simply have to set exit_request to
> complete the hcall from the kernel perspective, leaving us in a state
> where the kernel is about to restart at srr0 + 4, along with some other
> flag (stop or halt) to actually freeze the vcpu.
> 
> If such an "async" hcall decides to return an error, it can then set
> gpr3 directly using ioctls before restarting the vcpu.

Yeah, I'd pretty much convinced myself of that by the end of
yesterday.  I hope to send patches implementing these fixes today.

There are also some questions about why our in-kernel H_CEDE works
kind of differently from x86's hlt instruction implementation (which
comes out to qemu unless the irqchip is in-kernel as well).  I don't
think we have an urgent problem there though.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-07  1:32               ` David Gibson
  0 siblings, 0 replies; 62+ messages in thread
From: David Gibson @ 2012-08-07  1:32 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Marcelo Tosatti, Avi Kivity, kvm, Alexander Graf, Paul Mackerras,
	kvm-ppc

On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
> > So, I'm still trying to nut out the implications for H_CEDE, and think
> > if there are any other hypercalls that might want to block the guest
> > for a time.  We were considering blocking H_PUT_TCE if qemu devices
> > had active dma maps on the previously mapped iovas.  I'm not sure if
> > the discussions that led to the inclusion of the qemu IOMMU code
> > decided that was wholly unnnecessary or just not necessary for the
> > time being.
> 
> For "sleeping hcalls" they will simply have to set exit_request to
> complete the hcall from the kernel perspective, leaving us in a state
> where the kernel is about to restart at srr0 + 4, along with some other
> flag (stop or halt) to actually freeze the vcpu.
> 
> If such an "async" hcall decides to return an error, it can then set
> gpr3 directly using ioctls before restarting the vcpu.

Yeah, I'd pretty much convinced myself of that by the end of
yesterday.  I hope to send patches implementing these fixes today.

There are also some questions about why our in-kernel H_CEDE works
kind of differently from x86's hlt instruction implementation (which
comes out to qemu unless the irqchip is in-kernel as well).  I don't
think we have an urgent problem there though.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-06 20:25               ` Scott Wood
@ 2012-08-07  8:44                 ` Avi Kivity
  -1 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2012-08-07  8:44 UTC (permalink / raw)
  To: Scott Wood
  Cc: Benjamin Herrenschmidt, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On 08/06/2012 11:25 PM, Scott Wood wrote:
> On 08/05/2012 04:00 AM, Avi Kivity wrote:
>> On 08/04/2012 01:32 AM, Benjamin Herrenschmidt wrote:
>>> On Fri, 2012-08-03 at 15:05 -0300, Marcelo Tosatti wrote:
>>>
>>>> See kvm_arch_process_async_events() call to qemu_system_reset_request()
>>>> in target-i386/kvm.c.
>>>>
>>>> The whole thing is fragile, though: we rely on the order events
>>>> are processed inside KVM_RUN, in x86:
>>>>
>>>> 1) If there is pending MMIO, process it.
>>>> 2) If not, return with -EINTR (and KVM_EXIT_INTR) in case
>>>> there is a signal pending.
>>>>
>>>> That way, the vcpu will not process the stop event from the main loop
>>>> (ie not exit from the kvm_cpu_exec() loop), until MMIO is finished.
>>>
>>> Right, it is fragile, thankfully we appear to adhere to the same
>>> ordering on powerpc so far :-)
>>>
>>> So we'll need to test but it looks like we might be able to fix our
>>> problem without a kernel or API change, just by changing qemu to
>>> do the same exit_request trick for our reboot hypercall.
>>>
>>> Long run however, I wonder whether we should consider an explicit ioctl
>>> to complete those pending operations instead...
>> 
>> It's pointless.  We have to support the old method forever.
> 
> Not in new architectures (even PPC has yet to start using this) or new
> userspaces -- and forever is a long time.  People down the road may very
> well decide that it's time to clean out the deprecated stuff that hasn't
> been used in over a decade.  IMHO this shouldn't be a reason to
> not improve the API, as long as compatibility is possible for as long as
> it is deemed worthwhile.

For qemu this is in common code (kvm-all.c); some architectures might
need wiring up, but the code is there.

For the kernel we need to handle signals, and we need to check for
signals in the atomic guest entry path.  That leads naturally to the
completion-before-signal order.

In fact there's no other way to do it.  If we check for signals before
handling completions, there's no way for userspace to know whether the
completion was completely handled, and whether more completions are
necessary (mmio completions are limited to 8 bytes but some x86
instructions generate many more accesses).

> 
>> There's no
>> material different between sigqueue() + KVM_RUN and KVM_COMPLETE, or a
>> KVM_RUN with a flag that tells it to exit immediately.
> 
> The latter is less fragile and easier to use.


   if (need_exit)
        queue signal

 vs

   run->need_exit = need_exit


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-07  8:44                 ` Avi Kivity
  0 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2012-08-07  8:44 UTC (permalink / raw)
  To: Scott Wood
  Cc: Benjamin Herrenschmidt, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On 08/06/2012 11:25 PM, Scott Wood wrote:
> On 08/05/2012 04:00 AM, Avi Kivity wrote:
>> On 08/04/2012 01:32 AM, Benjamin Herrenschmidt wrote:
>>> On Fri, 2012-08-03 at 15:05 -0300, Marcelo Tosatti wrote:
>>>
>>>> See kvm_arch_process_async_events() call to qemu_system_reset_request()
>>>> in target-i386/kvm.c.
>>>>
>>>> The whole thing is fragile, though: we rely on the order events
>>>> are processed inside KVM_RUN, in x86:
>>>>
>>>> 1) If there is pending MMIO, process it.
>>>> 2) If not, return with -EINTR (and KVM_EXIT_INTR) in case
>>>> there is a signal pending.
>>>>
>>>> That way, the vcpu will not process the stop event from the main loop
>>>> (ie not exit from the kvm_cpu_exec() loop), until MMIO is finished.
>>>
>>> Right, it is fragile, thankfully we appear to adhere to the same
>>> ordering on powerpc so far :-)
>>>
>>> So we'll need to test but it looks like we might be able to fix our
>>> problem without a kernel or API change, just by changing qemu to
>>> do the same exit_request trick for our reboot hypercall.
>>>
>>> Long run however, I wonder whether we should consider an explicit ioctl
>>> to complete those pending operations instead...
>> 
>> It's pointless.  We have to support the old method forever.
> 
> Not in new architectures (even PPC has yet to start using this) or new
> userspaces -- and forever is a long time.  People down the road may very
> well decide that it's time to clean out the deprecated stuff that hasn't
> been used in over a decade.  IMHO this shouldn't be a reason to
> not improve the API, as long as compatibility is possible for as long as
> it is deemed worthwhile.

For qemu this is in common code (kvm-all.c); some architectures might
need wiring up, but the code is there.

For the kernel we need to handle signals, and we need to check for
signals in the atomic guest entry path.  That leads naturally to the
completion-before-signal order.

In fact there's no other way to do it.  If we check for signals before
handling completions, there's no way for userspace to know whether the
completion was completely handled, and whether more completions are
necessary (mmio completions are limited to 8 bytes but some x86
instructions generate many more accesses).

> 
>> There's no
>> material different between sigqueue() + KVM_RUN and KVM_COMPLETE, or a
>> KVM_RUN with a flag that tells it to exit immediately.
> 
> The latter is less fragile and easier to use.


   if (need_exit)
        queue signal

 vs

   run->need_exit = need_exit


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-07  1:32               ` David Gibson
@ 2012-08-07  8:46                 ` Avi Kivity
  -1 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2012-08-07  8:46 UTC (permalink / raw)
  To: David Gibson
  Cc: Benjamin Herrenschmidt, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On 08/07/2012 04:32 AM, David Gibson wrote:
> On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote:
>> On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
>> > So, I'm still trying to nut out the implications for H_CEDE, and think
>> > if there are any other hypercalls that might want to block the guest
>> > for a time.  We were considering blocking H_PUT_TCE if qemu devices
>> > had active dma maps on the previously mapped iovas.  I'm not sure if
>> > the discussions that led to the inclusion of the qemu IOMMU code
>> > decided that was wholly unnnecessary or just not necessary for the
>> > time being.
>> 
>> For "sleeping hcalls" they will simply have to set exit_request to
>> complete the hcall from the kernel perspective, leaving us in a state
>> where the kernel is about to restart at srr0 + 4, along with some other
>> flag (stop or halt) to actually freeze the vcpu.
>> 
>> If such an "async" hcall decides to return an error, it can then set
>> gpr3 directly using ioctls before restarting the vcpu.
> 
> Yeah, I'd pretty much convinced myself of that by the end of
> yesterday.  I hope to send patches implementing these fixes today.
> 
> There are also some questions about why our in-kernel H_CEDE works
> kind of differently from x86's hlt instruction implementation (which
> comes out to qemu unless the irqchip is in-kernel as well).  I don't
> think we have an urgent problem there though.

It's the other way round, hlt sleeps in the kernel unless the irqchip is
not in the kernel.  Meaning the normal state of things is to sleep in
the kernel (whether or not you have an emulated interrupt controller in
the kernel -- the term irqchip in kernel is overloaded for x86).


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-07  8:46                 ` Avi Kivity
  0 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2012-08-07  8:46 UTC (permalink / raw)
  To: David Gibson
  Cc: Benjamin Herrenschmidt, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On 08/07/2012 04:32 AM, David Gibson wrote:
> On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote:
>> On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
>> > So, I'm still trying to nut out the implications for H_CEDE, and think
>> > if there are any other hypercalls that might want to block the guest
>> > for a time.  We were considering blocking H_PUT_TCE if qemu devices
>> > had active dma maps on the previously mapped iovas.  I'm not sure if
>> > the discussions that led to the inclusion of the qemu IOMMU code
>> > decided that was wholly unnnecessary or just not necessary for the
>> > time being.
>> 
>> For "sleeping hcalls" they will simply have to set exit_request to
>> complete the hcall from the kernel perspective, leaving us in a state
>> where the kernel is about to restart at srr0 + 4, along with some other
>> flag (stop or halt) to actually freeze the vcpu.
>> 
>> If such an "async" hcall decides to return an error, it can then set
>> gpr3 directly using ioctls before restarting the vcpu.
> 
> Yeah, I'd pretty much convinced myself of that by the end of
> yesterday.  I hope to send patches implementing these fixes today.
> 
> There are also some questions about why our in-kernel H_CEDE works
> kind of differently from x86's hlt instruction implementation (which
> comes out to qemu unless the irqchip is in-kernel as well).  I don't
> think we have an urgent problem there though.

It's the other way round, hlt sleeps in the kernel unless the irqchip is
not in the kernel.  Meaning the normal state of things is to sleep in
the kernel (whether or not you have an emulated interrupt controller in
the kernel -- the term irqchip in kernel is overloaded for x86).


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-07  8:46                 ` Avi Kivity
@ 2012-08-07 12:14                   ` David Gibson
  -1 siblings, 0 replies; 62+ messages in thread
From: David Gibson @ 2012-08-07 12:14 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Benjamin Herrenschmidt, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On Tue, Aug 07, 2012 at 11:46:35AM +0300, Avi Kivity wrote:
> On 08/07/2012 04:32 AM, David Gibson wrote:
> > On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote:
> >> On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
> >> > So, I'm still trying to nut out the implications for H_CEDE, and think
> >> > if there are any other hypercalls that might want to block the guest
> >> > for a time.  We were considering blocking H_PUT_TCE if qemu devices
> >> > had active dma maps on the previously mapped iovas.  I'm not sure if
> >> > the discussions that led to the inclusion of the qemu IOMMU code
> >> > decided that was wholly unnnecessary or just not necessary for the
> >> > time being.
> >> 
> >> For "sleeping hcalls" they will simply have to set exit_request to
> >> complete the hcall from the kernel perspective, leaving us in a state
> >> where the kernel is about to restart at srr0 + 4, along with some other
> >> flag (stop or halt) to actually freeze the vcpu.
> >> 
> >> If such an "async" hcall decides to return an error, it can then set
> >> gpr3 directly using ioctls before restarting the vcpu.
> > 
> > Yeah, I'd pretty much convinced myself of that by the end of
> > yesterday.  I hope to send patches implementing these fixes today.
> > 
> > There are also some questions about why our in-kernel H_CEDE works
> > kind of differently from x86's hlt instruction implementation (which
> > comes out to qemu unless the irqchip is in-kernel as well).  I don't
> > think we have an urgent problem there though.
> 
> It's the other way round, hlt sleeps in the kernel unless the irqchip is
> not in the kernel.

That's the same as what I said.

We never have irqchip in kernel (because we haven't written that yet)
but we still sleep in-kernel for CEDE.  I haven't spotted any problem
with that, but now I'm wondering if there is one, since x86 don't do
it in what seems like the analogous situation.

It's possible this works because our decrementer (timer) interrupts
are different at the core level from external interrupts coming from
the PIC, and *are* handled in kernel, but I haven't actually followed
the logic to work out if this is the case.

>  Meaning the normal state of things is to sleep in
> the kernel (whether or not you have an emulated interrupt controller in
> the kernel -- the term irqchip in kernel is overloaded for x86).

Uh.. overloaded in what way.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-07 12:14                   ` David Gibson
  0 siblings, 0 replies; 62+ messages in thread
From: David Gibson @ 2012-08-07 12:14 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Benjamin Herrenschmidt, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On Tue, Aug 07, 2012 at 11:46:35AM +0300, Avi Kivity wrote:
> On 08/07/2012 04:32 AM, David Gibson wrote:
> > On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote:
> >> On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
> >> > So, I'm still trying to nut out the implications for H_CEDE, and think
> >> > if there are any other hypercalls that might want to block the guest
> >> > for a time.  We were considering blocking H_PUT_TCE if qemu devices
> >> > had active dma maps on the previously mapped iovas.  I'm not sure if
> >> > the discussions that led to the inclusion of the qemu IOMMU code
> >> > decided that was wholly unnnecessary or just not necessary for the
> >> > time being.
> >> 
> >> For "sleeping hcalls" they will simply have to set exit_request to
> >> complete the hcall from the kernel perspective, leaving us in a state
> >> where the kernel is about to restart at srr0 + 4, along with some other
> >> flag (stop or halt) to actually freeze the vcpu.
> >> 
> >> If such an "async" hcall decides to return an error, it can then set
> >> gpr3 directly using ioctls before restarting the vcpu.
> > 
> > Yeah, I'd pretty much convinced myself of that by the end of
> > yesterday.  I hope to send patches implementing these fixes today.
> > 
> > There are also some questions about why our in-kernel H_CEDE works
> > kind of differently from x86's hlt instruction implementation (which
> > comes out to qemu unless the irqchip is in-kernel as well).  I don't
> > think we have an urgent problem there though.
> 
> It's the other way round, hlt sleeps in the kernel unless the irqchip is
> not in the kernel.

That's the same as what I said.

We never have irqchip in kernel (because we haven't written that yet)
but we still sleep in-kernel for CEDE.  I haven't spotted any problem
with that, but now I'm wondering if there is one, since x86 don't do
it in what seems like the analogous situation.

It's possible this works because our decrementer (timer) interrupts
are different at the core level from external interrupts coming from
the PIC, and *are* handled in kernel, but I haven't actually followed
the logic to work out if this is the case.

>  Meaning the normal state of things is to sleep in
> the kernel (whether or not you have an emulated interrupt controller in
> the kernel -- the term irqchip in kernel is overloaded for x86).

Uh.. overloaded in what way.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-07 12:14                   ` David Gibson
@ 2012-08-07 13:13                     ` Avi Kivity
  -1 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2012-08-07 13:13 UTC (permalink / raw)
  To: David Gibson
  Cc: Benjamin Herrenschmidt, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On 08/07/2012 03:14 PM, David Gibson wrote:
> On Tue, Aug 07, 2012 at 11:46:35AM +0300, Avi Kivity wrote:
>> On 08/07/2012 04:32 AM, David Gibson wrote:
>> > On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote:
>> >> On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
>> >> > So, I'm still trying to nut out the implications for H_CEDE, and think
>> >> > if there are any other hypercalls that might want to block the guest
>> >> > for a time.  We were considering blocking H_PUT_TCE if qemu devices
>> >> > had active dma maps on the previously mapped iovas.  I'm not sure if
>> >> > the discussions that led to the inclusion of the qemu IOMMU code
>> >> > decided that was wholly unnnecessary or just not necessary for the
>> >> > time being.
>> >> 
>> >> For "sleeping hcalls" they will simply have to set exit_request to
>> >> complete the hcall from the kernel perspective, leaving us in a state
>> >> where the kernel is about to restart at srr0 + 4, along with some other
>> >> flag (stop or halt) to actually freeze the vcpu.
>> >> 
>> >> If such an "async" hcall decides to return an error, it can then set
>> >> gpr3 directly using ioctls before restarting the vcpu.
>> > 
>> > Yeah, I'd pretty much convinced myself of that by the end of
>> > yesterday.  I hope to send patches implementing these fixes today.
>> > 
>> > There are also some questions about why our in-kernel H_CEDE works
>> > kind of differently from x86's hlt instruction implementation (which
>> > comes out to qemu unless the irqchip is in-kernel as well).  I don't
>> > think we have an urgent problem there though.
>> 
>> It's the other way round, hlt sleeps in the kernel unless the irqchip is
>> not in the kernel.
> 
> That's the same as what I said.

I meant to stress that the normal way which other archs should emulate
is sleep-in-kernel.

> 
> We never have irqchip in kernel (because we haven't written that yet)
> but we still sleep in-kernel for CEDE.  I haven't spotted any problem
> with that, but now I'm wondering if there is one, since x86 don't do
> it in what seems like the analogous situation.
> 
> It's possible this works because our decrementer (timer) interrupts
> are different at the core level from external interrupts coming from
> the PIC, and *are* handled in kernel, but I haven't actually followed
> the logic to work out if this is the case.
> 
>>  Meaning the normal state of things is to sleep in
>> the kernel (whether or not you have an emulated interrupt controller in
>> the kernel -- the term irqchip in kernel is overloaded for x86).
> 
> Uh.. overloaded in what way.

On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and
the two PICs are emulated in the kernel.  Now the IOAPIC and the PICs
correspond to non-x86 interrupt controllers, but the local APIC is more
tightly coupled to the core.  Interrupt acceptance by the core is an
operation that involved synchronous communication with the local APIC:
the APIC presents the interrupt, the core accepts it based on the value
of the interrupt enable flag and possible a register (CR8), then the
APIC updates the ISR and IRR.

The upshot is that if the local APIC is in userspace, interrupts must be
synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl
and HLT is emulated in userspace (so that local APIC emulation can check
if an interrupt wakes it up or not).  As soon as the local APIC is
emulated in the kernel, HLT can be emulated there as well, and
interrupts become asynchronous (KVM_IRQ_LINE, a vm ioctl).

So irqchip_in_kernel, for most discussions, really means whether
interrupt queuing is synchronous or asynchronous.  It has nothing to do
with the interrupt controllers per se.  All non-x86 archs always have
irqchip_in_kernel() in this sense.

Peter has started to fix up this naming mess in qemu.  I guess we should
do the same for the kernel (except for ABIs) and document it, because it
keeps generating confusion.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-07 13:13                     ` Avi Kivity
  0 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2012-08-07 13:13 UTC (permalink / raw)
  To: David Gibson
  Cc: Benjamin Herrenschmidt, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On 08/07/2012 03:14 PM, David Gibson wrote:
> On Tue, Aug 07, 2012 at 11:46:35AM +0300, Avi Kivity wrote:
>> On 08/07/2012 04:32 AM, David Gibson wrote:
>> > On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote:
>> >> On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
>> >> > So, I'm still trying to nut out the implications for H_CEDE, and think
>> >> > if there are any other hypercalls that might want to block the guest
>> >> > for a time.  We were considering blocking H_PUT_TCE if qemu devices
>> >> > had active dma maps on the previously mapped iovas.  I'm not sure if
>> >> > the discussions that led to the inclusion of the qemu IOMMU code
>> >> > decided that was wholly unnnecessary or just not necessary for the
>> >> > time being.
>> >> 
>> >> For "sleeping hcalls" they will simply have to set exit_request to
>> >> complete the hcall from the kernel perspective, leaving us in a state
>> >> where the kernel is about to restart at srr0 + 4, along with some other
>> >> flag (stop or halt) to actually freeze the vcpu.
>> >> 
>> >> If such an "async" hcall decides to return an error, it can then set
>> >> gpr3 directly using ioctls before restarting the vcpu.
>> > 
>> > Yeah, I'd pretty much convinced myself of that by the end of
>> > yesterday.  I hope to send patches implementing these fixes today.
>> > 
>> > There are also some questions about why our in-kernel H_CEDE works
>> > kind of differently from x86's hlt instruction implementation (which
>> > comes out to qemu unless the irqchip is in-kernel as well).  I don't
>> > think we have an urgent problem there though.
>> 
>> It's the other way round, hlt sleeps in the kernel unless the irqchip is
>> not in the kernel.
> 
> That's the same as what I said.

I meant to stress that the normal way which other archs should emulate
is sleep-in-kernel.

> 
> We never have irqchip in kernel (because we haven't written that yet)
> but we still sleep in-kernel for CEDE.  I haven't spotted any problem
> with that, but now I'm wondering if there is one, since x86 don't do
> it in what seems like the analogous situation.
> 
> It's possible this works because our decrementer (timer) interrupts
> are different at the core level from external interrupts coming from
> the PIC, and *are* handled in kernel, but I haven't actually followed
> the logic to work out if this is the case.
> 
>>  Meaning the normal state of things is to sleep in
>> the kernel (whether or not you have an emulated interrupt controller in
>> the kernel -- the term irqchip in kernel is overloaded for x86).
> 
> Uh.. overloaded in what way.

On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and
the two PICs are emulated in the kernel.  Now the IOAPIC and the PICs
correspond to non-x86 interrupt controllers, but the local APIC is more
tightly coupled to the core.  Interrupt acceptance by the core is an
operation that involved synchronous communication with the local APIC:
the APIC presents the interrupt, the core accepts it based on the value
of the interrupt enable flag and possible a register (CR8), then the
APIC updates the ISR and IRR.

The upshot is that if the local APIC is in userspace, interrupts must be
synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl
and HLT is emulated in userspace (so that local APIC emulation can check
if an interrupt wakes it up or not).  As soon as the local APIC is
emulated in the kernel, HLT can be emulated there as well, and
interrupts become asynchronous (KVM_IRQ_LINE, a vm ioctl).

So irqchip_in_kernel, for most discussions, really means whether
interrupt queuing is synchronous or asynchronous.  It has nothing to do
with the interrupt controllers per se.  All non-x86 archs always have
irqchip_in_kernel() in this sense.

Peter has started to fix up this naming mess in qemu.  I guess we should
do the same for the kernel (except for ABIs) and document it, because it
keeps generating confusion.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-07 13:13                     ` Avi Kivity
@ 2012-08-07 21:09                       ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-07 21:09 UTC (permalink / raw)
  To: Avi Kivity
  Cc: David Gibson, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On Tue, 2012-08-07 at 16:13 +0300, Avi Kivity wrote:
> 

> Peter has started to fix up this naming mess in qemu.  I guess we should
> do the same for the kernel (except for ABIs) and document it, because it
> keeps generating confusion. 

Ok so our current situation is that the XICS and MPIC are emulated in
userspace entirely but the link between them and the VCPU is the
asynchronous EE line so we are fine.

I'm currently working on moving the XICS into the kernel for performance
reasons, however, just like ARM VGIC, I can't seem to find a way to
"fit" in the "generic" irqchip code in there. It's just not generic at
all and quite x86 centric :-)

So for now I'm just doing my own version of CREATE_IRQCHIP to create it
and KVM_INTERRUPT to trigger the various interrupts. None of the mapping
stuff (which we really don't need).

That's a bit of a problem vs. some of the code qemu-side such as in
virtio-pci which does seem to be written around the model exposed by the
x86 stuff and relies on doing such mappings so I think we'll have to
butcher some of that.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-07 21:09                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-07 21:09 UTC (permalink / raw)
  To: Avi Kivity
  Cc: David Gibson, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On Tue, 2012-08-07 at 16:13 +0300, Avi Kivity wrote:
> 

> Peter has started to fix up this naming mess in qemu.  I guess we should
> do the same for the kernel (except for ABIs) and document it, because it
> keeps generating confusion. 

Ok so our current situation is that the XICS and MPIC are emulated in
userspace entirely but the link between them and the VCPU is the
asynchronous EE line so we are fine.

I'm currently working on moving the XICS into the kernel for performance
reasons, however, just like ARM VGIC, I can't seem to find a way to
"fit" in the "generic" irqchip code in there. It's just not generic at
all and quite x86 centric :-)

So for now I'm just doing my own version of CREATE_IRQCHIP to create it
and KVM_INTERRUPT to trigger the various interrupts. None of the mapping
stuff (which we really don't need).

That's a bit of a problem vs. some of the code qemu-side such as in
virtio-pci which does seem to be written around the model exposed by the
x86 stuff and relies on doing such mappings so I think we'll have to
butcher some of that.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-07 13:13                     ` Avi Kivity
@ 2012-08-08  0:49                       ` David Gibson
  -1 siblings, 0 replies; 62+ messages in thread
From: David Gibson @ 2012-08-08  0:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Benjamin Herrenschmidt, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On Tue, Aug 07, 2012 at 04:13:49PM +0300, Avi Kivity wrote:
> On 08/07/2012 03:14 PM, David Gibson wrote:
> > On Tue, Aug 07, 2012 at 11:46:35AM +0300, Avi Kivity wrote:
> >> On 08/07/2012 04:32 AM, David Gibson wrote:
> >> > On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote:
> >> >> On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
> >> >> > So, I'm still trying to nut out the implications for H_CEDE, and think
> >> >> > if there are any other hypercalls that might want to block the guest
> >> >> > for a time.  We were considering blocking H_PUT_TCE if qemu devices
> >> >> > had active dma maps on the previously mapped iovas.  I'm not sure if
> >> >> > the discussions that led to the inclusion of the qemu IOMMU code
> >> >> > decided that was wholly unnnecessary or just not necessary for the
> >> >> > time being.
> >> >> 
> >> >> For "sleeping hcalls" they will simply have to set exit_request to
> >> >> complete the hcall from the kernel perspective, leaving us in a state
> >> >> where the kernel is about to restart at srr0 + 4, along with some other
> >> >> flag (stop or halt) to actually freeze the vcpu.
> >> >> 
> >> >> If such an "async" hcall decides to return an error, it can then set
> >> >> gpr3 directly using ioctls before restarting the vcpu.
> >> > 
> >> > Yeah, I'd pretty much convinced myself of that by the end of
> >> > yesterday.  I hope to send patches implementing these fixes today.
> >> > 
> >> > There are also some questions about why our in-kernel H_CEDE works
> >> > kind of differently from x86's hlt instruction implementation (which
> >> > comes out to qemu unless the irqchip is in-kernel as well).  I don't
> >> > think we have an urgent problem there though.
> >> 
> >> It's the other way round, hlt sleeps in the kernel unless the irqchip is
> >> not in the kernel.
> > 
> > That's the same as what I said.
> 
> I meant to stress that the normal way which other archs should emulate
> is sleep-in-kernel.

Ok.

> > We never have irqchip in kernel (because we haven't written that yet)
> > but we still sleep in-kernel for CEDE.  I haven't spotted any problem
> > with that, but now I'm wondering if there is one, since x86 don't do
> > it in what seems like the analogous situation.
> > 
> > It's possible this works because our decrementer (timer) interrupts
> > are different at the core level from external interrupts coming from
> > the PIC, and *are* handled in kernel, but I haven't actually followed
> > the logic to work out if this is the case.
> > 
> >>  Meaning the normal state of things is to sleep in
> >> the kernel (whether or not you have an emulated interrupt controller in
> >> the kernel -- the term irqchip in kernel is overloaded for x86).
> > 
> > Uh.. overloaded in what way.
> 
> On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and
> the two PICs are emulated in the kernel.  Now the IOAPIC and the PICs
> correspond to non-x86 interrupt controllers, but the local APIC is more
> tightly coupled to the core.  Interrupt acceptance by the core is an
> operation that involved synchronous communication with the local APIC:
> the APIC presents the interrupt, the core accepts it based on the value
> of the interrupt enable flag and possible a register (CR8), then the
> APIC updates the ISR and IRR.
> 
> The upshot is that if the local APIC is in userspace, interrupts must be
> synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl
> and HLT is emulated in userspace (so that local APIC emulation can check
> if an interrupt wakes it up or not).

Sorry, still not 100% getting it.  When the vcpu is actually running
code, that synchronous communication must still be accomplished via
the KVM_INTERRUPT ioctl, yes?  So what makes HLT different, that the
communication can't be accomplished in that case.

> As soon as the local APIC is
> emulated in the kernel, HLT can be emulated there as well, and
> interrupts become asynchronous (KVM_IRQ_LINE, a vm ioctl).
> 
> So irqchip_in_kernel, for most discussions, really means whether
> interrupt queuing is synchronous or asynchronous.  It has nothing to do
> with the interrupt controllers per se.  All non-x86 archs always have
> irqchip_in_kernel() in this sense.
> 
> Peter has started to fix up this naming mess in qemu.  I guess we should
> do the same for the kernel (except for ABIs) and document it, because it
> keeps generating confusion.

Ok.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-08  0:49                       ` David Gibson
  0 siblings, 0 replies; 62+ messages in thread
From: David Gibson @ 2012-08-08  0:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Benjamin Herrenschmidt, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On Tue, Aug 07, 2012 at 04:13:49PM +0300, Avi Kivity wrote:
> On 08/07/2012 03:14 PM, David Gibson wrote:
> > On Tue, Aug 07, 2012 at 11:46:35AM +0300, Avi Kivity wrote:
> >> On 08/07/2012 04:32 AM, David Gibson wrote:
> >> > On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote:
> >> >> On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote:
> >> >> > So, I'm still trying to nut out the implications for H_CEDE, and think
> >> >> > if there are any other hypercalls that might want to block the guest
> >> >> > for a time.  We were considering blocking H_PUT_TCE if qemu devices
> >> >> > had active dma maps on the previously mapped iovas.  I'm not sure if
> >> >> > the discussions that led to the inclusion of the qemu IOMMU code
> >> >> > decided that was wholly unnnecessary or just not necessary for the
> >> >> > time being.
> >> >> 
> >> >> For "sleeping hcalls" they will simply have to set exit_request to
> >> >> complete the hcall from the kernel perspective, leaving us in a state
> >> >> where the kernel is about to restart at srr0 + 4, along with some other
> >> >> flag (stop or halt) to actually freeze the vcpu.
> >> >> 
> >> >> If such an "async" hcall decides to return an error, it can then set
> >> >> gpr3 directly using ioctls before restarting the vcpu.
> >> > 
> >> > Yeah, I'd pretty much convinced myself of that by the end of
> >> > yesterday.  I hope to send patches implementing these fixes today.
> >> > 
> >> > There are also some questions about why our in-kernel H_CEDE works
> >> > kind of differently from x86's hlt instruction implementation (which
> >> > comes out to qemu unless the irqchip is in-kernel as well).  I don't
> >> > think we have an urgent problem there though.
> >> 
> >> It's the other way round, hlt sleeps in the kernel unless the irqchip is
> >> not in the kernel.
> > 
> > That's the same as what I said.
> 
> I meant to stress that the normal way which other archs should emulate
> is sleep-in-kernel.

Ok.

> > We never have irqchip in kernel (because we haven't written that yet)
> > but we still sleep in-kernel for CEDE.  I haven't spotted any problem
> > with that, but now I'm wondering if there is one, since x86 don't do
> > it in what seems like the analogous situation.
> > 
> > It's possible this works because our decrementer (timer) interrupts
> > are different at the core level from external interrupts coming from
> > the PIC, and *are* handled in kernel, but I haven't actually followed
> > the logic to work out if this is the case.
> > 
> >>  Meaning the normal state of things is to sleep in
> >> the kernel (whether or not you have an emulated interrupt controller in
> >> the kernel -- the term irqchip in kernel is overloaded for x86).
> > 
> > Uh.. overloaded in what way.
> 
> On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and
> the two PICs are emulated in the kernel.  Now the IOAPIC and the PICs
> correspond to non-x86 interrupt controllers, but the local APIC is more
> tightly coupled to the core.  Interrupt acceptance by the core is an
> operation that involved synchronous communication with the local APIC:
> the APIC presents the interrupt, the core accepts it based on the value
> of the interrupt enable flag and possible a register (CR8), then the
> APIC updates the ISR and IRR.
> 
> The upshot is that if the local APIC is in userspace, interrupts must be
> synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl
> and HLT is emulated in userspace (so that local APIC emulation can check
> if an interrupt wakes it up or not).

Sorry, still not 100% getting it.  When the vcpu is actually running
code, that synchronous communication must still be accomplished via
the KVM_INTERRUPT ioctl, yes?  So what makes HLT different, that the
communication can't be accomplished in that case.

> As soon as the local APIC is
> emulated in the kernel, HLT can be emulated there as well, and
> interrupts become asynchronous (KVM_IRQ_LINE, a vm ioctl).
> 
> So irqchip_in_kernel, for most discussions, really means whether
> interrupt queuing is synchronous or asynchronous.  It has nothing to do
> with the interrupt controllers per se.  All non-x86 archs always have
> irqchip_in_kernel() in this sense.
> 
> Peter has started to fix up this naming mess in qemu.  I guess we should
> do the same for the kernel (except for ABIs) and document it, because it
> keeps generating confusion.

Ok.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-07 21:09                       ` Benjamin Herrenschmidt
@ 2012-08-08  8:52                         ` Avi Kivity
  -1 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2012-08-08  8:52 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Gibson, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On 08/08/2012 12:09 AM, Benjamin Herrenschmidt wrote:
> On Tue, 2012-08-07 at 16:13 +0300, Avi Kivity wrote:
>> 
> 
>> Peter has started to fix up this naming mess in qemu.  I guess we should
>> do the same for the kernel (except for ABIs) and document it, because it
>> keeps generating confusion. 
> 
> Ok so our current situation is that the XICS and MPIC are emulated in
> userspace entirely but the link between them and the VCPU is the
> asynchronous EE line so we are fine.
> 
> I'm currently working on moving the XICS into the kernel for performance
> reasons, however, just like ARM VGIC, I can't seem to find a way to
> "fit" in the "generic" irqchip code in there. It's just not generic at
> all and quite x86 centric :-)

The generic code is for the two apic architectures: x64 and ia64.  Don't
try to shoehorn it in, you'll damage both the shoe and your foot.

> 
> So for now I'm just doing my own version of CREATE_IRQCHIP to create it
> and KVM_INTERRUPT to trigger the various interrupts. None of the mapping
> stuff (which we really don't need).

You mean KVM_IRQ_LINE.  KVM_INTERRUPT is a synchronous vcpu ioctl.

> 
> That's a bit of a problem vs. some of the code qemu-side such as in
> virtio-pci which does seem to be written around the model exposed by the
> x86 stuff and relies on doing such mappings so I think we'll have to
> butcher some of that.

Can you elaborate? virtio-pci is pci-centric, there should be nothing
x86 specific there.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-08  8:52                         ` Avi Kivity
  0 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2012-08-08  8:52 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Gibson, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On 08/08/2012 12:09 AM, Benjamin Herrenschmidt wrote:
> On Tue, 2012-08-07 at 16:13 +0300, Avi Kivity wrote:
>> 
> 
>> Peter has started to fix up this naming mess in qemu.  I guess we should
>> do the same for the kernel (except for ABIs) and document it, because it
>> keeps generating confusion. 
> 
> Ok so our current situation is that the XICS and MPIC are emulated in
> userspace entirely but the link between them and the VCPU is the
> asynchronous EE line so we are fine.
> 
> I'm currently working on moving the XICS into the kernel for performance
> reasons, however, just like ARM VGIC, I can't seem to find a way to
> "fit" in the "generic" irqchip code in there. It's just not generic at
> all and quite x86 centric :-)

The generic code is for the two apic architectures: x64 and ia64.  Don't
try to shoehorn it in, you'll damage both the shoe and your foot.

> 
> So for now I'm just doing my own version of CREATE_IRQCHIP to create it
> and KVM_INTERRUPT to trigger the various interrupts. None of the mapping
> stuff (which we really don't need).

You mean KVM_IRQ_LINE.  KVM_INTERRUPT is a synchronous vcpu ioctl.

> 
> That's a bit of a problem vs. some of the code qemu-side such as in
> virtio-pci which does seem to be written around the model exposed by the
> x86 stuff and relies on doing such mappings so I think we'll have to
> butcher some of that.

Can you elaborate? virtio-pci is pci-centric, there should be nothing
x86 specific there.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-08  0:49                       ` David Gibson
@ 2012-08-08  8:58                         ` Avi Kivity
  -1 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2012-08-08  8:58 UTC (permalink / raw)
  To: David Gibson
  Cc: Benjamin Herrenschmidt, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On 08/08/2012 03:49 AM, David Gibson wrote:
>> > We never have irqchip in kernel (because we haven't written that yet)
>> > but we still sleep in-kernel for CEDE.  I haven't spotted any problem
>> > with that, but now I'm wondering if there is one, since x86 don't do
>> > it in what seems like the analogous situation.
>> > 
>> > It's possible this works because our decrementer (timer) interrupts
>> > are different at the core level from external interrupts coming from
>> > the PIC, and *are* handled in kernel, but I haven't actually followed
>> > the logic to work out if this is the case.
>> > 
>> >>  Meaning the normal state of things is to sleep in
>> >> the kernel (whether or not you have an emulated interrupt controller in
>> >> the kernel -- the term irqchip in kernel is overloaded for x86).
>> > 
>> > Uh.. overloaded in what way.
>> 
>> On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and
>> the two PICs are emulated in the kernel.  Now the IOAPIC and the PICs
>> correspond to non-x86 interrupt controllers, but the local APIC is more
>> tightly coupled to the core.  Interrupt acceptance by the core is an
>> operation that involved synchronous communication with the local APIC:
>> the APIC presents the interrupt, the core accepts it based on the value
>> of the interrupt enable flag and possible a register (CR8), then the
>> APIC updates the ISR and IRR.
>> 
>> The upshot is that if the local APIC is in userspace, interrupts must be
>> synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl
>> and HLT is emulated in userspace (so that local APIC emulation can check
>> if an interrupt wakes it up or not).
> 
> Sorry, still not 100% getting it.  When the vcpu is actually running
> code, that synchronous communication must still be accomplished via
> the KVM_INTERRUPT ioctl, yes?  So what makes HLT different, that the
> communication can't be accomplished in that case.

No, you're correct.  HLT could have been emulated in userspace, it just
wasn't.  The correct statement is that HLT was arbitrarily chosen to be
emulated in userspace with the synchronous model, but the asynchronous
model forced it into the kernel.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-08  8:58                         ` Avi Kivity
  0 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2012-08-08  8:58 UTC (permalink / raw)
  To: David Gibson
  Cc: Benjamin Herrenschmidt, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On 08/08/2012 03:49 AM, David Gibson wrote:
>> > We never have irqchip in kernel (because we haven't written that yet)
>> > but we still sleep in-kernel for CEDE.  I haven't spotted any problem
>> > with that, but now I'm wondering if there is one, since x86 don't do
>> > it in what seems like the analogous situation.
>> > 
>> > It's possible this works because our decrementer (timer) interrupts
>> > are different at the core level from external interrupts coming from
>> > the PIC, and *are* handled in kernel, but I haven't actually followed
>> > the logic to work out if this is the case.
>> > 
>> >>  Meaning the normal state of things is to sleep in
>> >> the kernel (whether or not you have an emulated interrupt controller in
>> >> the kernel -- the term irqchip in kernel is overloaded for x86).
>> > 
>> > Uh.. overloaded in what way.
>> 
>> On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and
>> the two PICs are emulated in the kernel.  Now the IOAPIC and the PICs
>> correspond to non-x86 interrupt controllers, but the local APIC is more
>> tightly coupled to the core.  Interrupt acceptance by the core is an
>> operation that involved synchronous communication with the local APIC:
>> the APIC presents the interrupt, the core accepts it based on the value
>> of the interrupt enable flag and possible a register (CR8), then the
>> APIC updates the ISR and IRR.
>> 
>> The upshot is that if the local APIC is in userspace, interrupts must be
>> synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl
>> and HLT is emulated in userspace (so that local APIC emulation can check
>> if an interrupt wakes it up or not).
> 
> Sorry, still not 100% getting it.  When the vcpu is actually running
> code, that synchronous communication must still be accomplished via
> the KVM_INTERRUPT ioctl, yes?  So what makes HLT different, that the
> communication can't be accomplished in that case.

No, you're correct.  HLT could have been emulated in userspace, it just
wasn't.  The correct statement is that HLT was arbitrarily chosen to be
emulated in userspace with the synchronous model, but the asynchronous
model forced it into the kernel.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-08  8:52                         ` Avi Kivity
@ 2012-08-08  9:27                           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-08  9:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: David Gibson, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On Wed, 2012-08-08 at 11:52 +0300, Avi Kivity wrote:
> > So for now I'm just doing my own version of CREATE_IRQCHIP to create it
> > and KVM_INTERRUPT to trigger the various interrupts. None of the mapping
> > stuff (which we really don't need).
> 
> You mean KVM_IRQ_LINE.  KVM_INTERRUPT is a synchronous vcpu ioctl.

Yes, sorry, brain and fingers not agreeing :-)

> > That's a bit of a problem vs. some of the code qemu-side such as in
> > virtio-pci which does seem to be written around the model exposed by the
> > x86 stuff and relies on doing such mappings so I think we'll have to
> > butcher some of that.
> 
> Can you elaborate? virtio-pci is pci-centric, there should be nothing
> x86 specific there.

The kvm_irqchip_add_msi_route & co is a problem, it more/less dives into
the x86/ia64 specific routing stuff at a generic level, I'll have to
hook things up differently in qemu I suspect.

Not a huge deal, we can sort it out.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-08  9:27                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 62+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-08  9:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: David Gibson, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On Wed, 2012-08-08 at 11:52 +0300, Avi Kivity wrote:
> > So for now I'm just doing my own version of CREATE_IRQCHIP to create it
> > and KVM_INTERRUPT to trigger the various interrupts. None of the mapping
> > stuff (which we really don't need).
> 
> You mean KVM_IRQ_LINE.  KVM_INTERRUPT is a synchronous vcpu ioctl.

Yes, sorry, brain and fingers not agreeing :-)

> > That's a bit of a problem vs. some of the code qemu-side such as in
> > virtio-pci which does seem to be written around the model exposed by the
> > x86 stuff and relies on doing such mappings so I think we'll have to
> > butcher some of that.
> 
> Can you elaborate? virtio-pci is pci-centric, there should be nothing
> x86 specific there.

The kvm_irqchip_add_msi_route & co is a problem, it more/less dives into
the x86/ia64 specific routing stuff at a generic level, I'll have to
hook things up differently in qemu I suspect.

Not a huge deal, we can sort it out.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-08  8:58                         ` Avi Kivity
@ 2012-08-08 11:59                           ` David Gibson
  -1 siblings, 0 replies; 62+ messages in thread
From: David Gibson @ 2012-08-08 11:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Benjamin Herrenschmidt, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On Wed, Aug 08, 2012 at 11:58:58AM +0300, Avi Kivity wrote:
> On 08/08/2012 03:49 AM, David Gibson wrote:
> >> > We never have irqchip in kernel (because we haven't written that yet)
> >> > but we still sleep in-kernel for CEDE.  I haven't spotted any problem
> >> > with that, but now I'm wondering if there is one, since x86 don't do
> >> > it in what seems like the analogous situation.
> >> > 
> >> > It's possible this works because our decrementer (timer) interrupts
> >> > are different at the core level from external interrupts coming from
> >> > the PIC, and *are* handled in kernel, but I haven't actually followed
> >> > the logic to work out if this is the case.
> >> > 
> >> >>  Meaning the normal state of things is to sleep in
> >> >> the kernel (whether or not you have an emulated interrupt controller in
> >> >> the kernel -- the term irqchip in kernel is overloaded for x86).
> >> > 
> >> > Uh.. overloaded in what way.
> >> 
> >> On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and
> >> the two PICs are emulated in the kernel.  Now the IOAPIC and the PICs
> >> correspond to non-x86 interrupt controllers, but the local APIC is more
> >> tightly coupled to the core.  Interrupt acceptance by the core is an
> >> operation that involved synchronous communication with the local APIC:
> >> the APIC presents the interrupt, the core accepts it based on the value
> >> of the interrupt enable flag and possible a register (CR8), then the
> >> APIC updates the ISR and IRR.
> >> 
> >> The upshot is that if the local APIC is in userspace, interrupts must be
> >> synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl
> >> and HLT is emulated in userspace (so that local APIC emulation can check
> >> if an interrupt wakes it up or not).
> > 
> > Sorry, still not 100% getting it.  When the vcpu is actually running
> > code, that synchronous communication must still be accomplished via
> > the KVM_INTERRUPT ioctl, yes?  So what makes HLT different, that the
> > communication can't be accomplished in that case.
> 
> No, you're correct.  HLT could have been emulated in userspace, it just
> wasn't.  The correct statement is that HLT was arbitrarily chosen to be
> emulated in userspace with the synchronous model, but the asynchronous
> model forced it into the kernel.

Aha!  Ok, understood.  Uh, assuming you meant kernelspace, not
userspace in the first line there, anyway.  Ok, so I am now reassured
that our current handling of CEDE in kernelspace does not cause
problems.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-08 11:59                           ` David Gibson
  0 siblings, 0 replies; 62+ messages in thread
From: David Gibson @ 2012-08-08 11:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Benjamin Herrenschmidt, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On Wed, Aug 08, 2012 at 11:58:58AM +0300, Avi Kivity wrote:
> On 08/08/2012 03:49 AM, David Gibson wrote:
> >> > We never have irqchip in kernel (because we haven't written that yet)
> >> > but we still sleep in-kernel for CEDE.  I haven't spotted any problem
> >> > with that, but now I'm wondering if there is one, since x86 don't do
> >> > it in what seems like the analogous situation.
> >> > 
> >> > It's possible this works because our decrementer (timer) interrupts
> >> > are different at the core level from external interrupts coming from
> >> > the PIC, and *are* handled in kernel, but I haven't actually followed
> >> > the logic to work out if this is the case.
> >> > 
> >> >>  Meaning the normal state of things is to sleep in
> >> >> the kernel (whether or not you have an emulated interrupt controller in
> >> >> the kernel -- the term irqchip in kernel is overloaded for x86).
> >> > 
> >> > Uh.. overloaded in what way.
> >> 
> >> On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and
> >> the two PICs are emulated in the kernel.  Now the IOAPIC and the PICs
> >> correspond to non-x86 interrupt controllers, but the local APIC is more
> >> tightly coupled to the core.  Interrupt acceptance by the core is an
> >> operation that involved synchronous communication with the local APIC:
> >> the APIC presents the interrupt, the core accepts it based on the value
> >> of the interrupt enable flag and possible a register (CR8), then the
> >> APIC updates the ISR and IRR.
> >> 
> >> The upshot is that if the local APIC is in userspace, interrupts must be
> >> synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl
> >> and HLT is emulated in userspace (so that local APIC emulation can check
> >> if an interrupt wakes it up or not).
> > 
> > Sorry, still not 100% getting it.  When the vcpu is actually running
> > code, that synchronous communication must still be accomplished via
> > the KVM_INTERRUPT ioctl, yes?  So what makes HLT different, that the
> > communication can't be accomplished in that case.
> 
> No, you're correct.  HLT could have been emulated in userspace, it just
> wasn't.  The correct statement is that HLT was arbitrarily chosen to be
> emulated in userspace with the synchronous model, but the asynchronous
> model forced it into the kernel.

Aha!  Ok, understood.  Uh, assuming you meant kernelspace, not
userspace in the first line there, anyway.  Ok, so I am now reassured
that our current handling of CEDE in kernelspace does not cause
problems.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
  2012-08-08 11:59                           ` David Gibson
@ 2012-08-08 12:42                             ` Avi Kivity
  -1 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2012-08-08 12:42 UTC (permalink / raw)
  To: David Gibson
  Cc: Benjamin Herrenschmidt, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On 08/08/2012 02:59 PM, David Gibson wrote:
>> 
>> No, you're correct.  HLT could have been emulated in userspace, it just
>> wasn't.  The correct statement is that HLT was arbitrarily chosen to be
>> emulated in userspace with the synchronous model, but the asynchronous
>> model forced it into the kernel.
> 
> Aha!  Ok, understood.  Uh, assuming you meant kernelspace, not
> userspace in the first line there, anyway.  

I did.

> Ok, so I am now reassured
> that our current handling of CEDE in kernelspace does not cause
> problems.

Great.  It's a real pity the original local APIC implementation was in
userspace, it causes no end of confusion, and in fact was broken for smp
until recently even though it is 7 years old.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Reset problem vs. MMIO emulation, hypercalls, etc...
@ 2012-08-08 12:42                             ` Avi Kivity
  0 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2012-08-08 12:42 UTC (permalink / raw)
  To: David Gibson
  Cc: Benjamin Herrenschmidt, Marcelo Tosatti, kvm, Alexander Graf,
	Paul Mackerras, kvm-ppc

On 08/08/2012 02:59 PM, David Gibson wrote:
>> 
>> No, you're correct.  HLT could have been emulated in userspace, it just
>> wasn't.  The correct statement is that HLT was arbitrarily chosen to be
>> emulated in userspace with the synchronous model, but the asynchronous
>> model forced it into the kernel.
> 
> Aha!  Ok, understood.  Uh, assuming you meant kernelspace, not
> userspace in the first line there, anyway.  

I did.

> Ok, so I am now reassured
> that our current handling of CEDE in kernelspace does not cause
> problems.

Great.  It's a real pity the original local APIC implementation was in
userspace, it causes no end of confusion, and in fact was broken for smp
until recently even though it is 7 years old.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2012-08-08 12:42 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-01  3:17 Reset problem vs. MMIO emulation, hypercalls, etc Benjamin Herrenschmidt
2012-08-01  3:17 ` Benjamin Herrenschmidt
2012-08-02 10:46 ` Alexander Graf
2012-08-02 10:46   ` Alexander Graf
2012-08-02 12:22   ` Benjamin Herrenschmidt
2012-08-02 12:22     ` Benjamin Herrenschmidt
2012-08-02 12:35 ` Avi Kivity
2012-08-02 12:35   ` Avi Kivity
2012-08-02 12:59   ` Alexander Graf
2012-08-02 12:59     ` Alexander Graf
2012-08-02 13:05     ` Avi Kivity
2012-08-02 13:05       ` Avi Kivity
2012-08-02 20:29       ` Benjamin Herrenschmidt
2012-08-02 20:29         ` Benjamin Herrenschmidt
2012-08-05  8:55         ` Avi Kivity
2012-08-05  8:55           ` Avi Kivity
2012-08-05 20:45           ` Benjamin Herrenschmidt
2012-08-05 20:45             ` Benjamin Herrenschmidt
2012-08-02 20:20   ` Benjamin Herrenschmidt
2012-08-02 20:20     ` Benjamin Herrenschmidt
2012-08-03 17:41     ` Marcelo Tosatti
2012-08-03 17:41       ` Marcelo Tosatti
2012-08-03 18:05       ` Marcelo Tosatti
2012-08-03 18:05         ` Marcelo Tosatti
2012-08-03 22:32         ` Benjamin Herrenschmidt
2012-08-03 22:32           ` Benjamin Herrenschmidt
2012-08-05  9:00           ` Avi Kivity
2012-08-05  9:00             ` Avi Kivity
2012-08-06 20:25             ` Scott Wood
2012-08-06 20:25               ` Scott Wood
2012-08-07  8:44               ` Avi Kivity
2012-08-07  8:44                 ` Avi Kivity
2012-08-06 20:54             ` Benjamin Herrenschmidt
2012-08-06 20:54               ` Benjamin Herrenschmidt
2012-08-03 22:30       ` Benjamin Herrenschmidt
2012-08-03 22:30         ` Benjamin Herrenschmidt
2012-08-06  3:13         ` David Gibson
2012-08-06  3:13           ` David Gibson
2012-08-06 20:57           ` Benjamin Herrenschmidt
2012-08-06 20:57             ` Benjamin Herrenschmidt
2012-08-07  1:32             ` David Gibson
2012-08-07  1:32               ` David Gibson
2012-08-07  8:46               ` Avi Kivity
2012-08-07  8:46                 ` Avi Kivity
2012-08-07 12:14                 ` David Gibson
2012-08-07 12:14                   ` David Gibson
2012-08-07 13:13                   ` Avi Kivity
2012-08-07 13:13                     ` Avi Kivity
2012-08-07 21:09                     ` Benjamin Herrenschmidt
2012-08-07 21:09                       ` Benjamin Herrenschmidt
2012-08-08  8:52                       ` Avi Kivity
2012-08-08  8:52                         ` Avi Kivity
2012-08-08  9:27                         ` Benjamin Herrenschmidt
2012-08-08  9:27                           ` Benjamin Herrenschmidt
2012-08-08  0:49                     ` David Gibson
2012-08-08  0:49                       ` David Gibson
2012-08-08  8:58                       ` Avi Kivity
2012-08-08  8:58                         ` Avi Kivity
2012-08-08 11:59                         ` David Gibson
2012-08-08 11:59                           ` David Gibson
2012-08-08 12:42                           ` Avi Kivity
2012-08-08 12:42                             ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.