All of lore.kernel.org
 help / color / mirror / Atom feed
* Question on dirty sync before kvm memslot removal
@ 2020-03-27 15:04 Peter Xu
  2020-03-30 13:11 ` Paolo Bonzini
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Xu @ 2020-03-27 15:04 UTC (permalink / raw)
  To: Paolo Bonzini, QEMU Devel Mailing List; +Cc: Dr. David Alan Gilbert

Hi, Paolo & all,

I noticed something tricky in dirty logging when we want to remove a
kvm memslot: kvm_set_phys_mem() will sync dirty log before the slot is
removed to make sure dirty bits won't get lost.  IIUC that's majorly
for system resets, because that's where the path should trigger very
frequently.  The original commit does mention the system reset too.

That makes perfect sense to me, however... What if the vcpu generates
dirty bits _after_ we do KVM_GET_DIRTY_LOG but _before_ we send
KVM_SET_USER_MEMORY_REGION to remove the memslot?  If the vcpu is in
the userspace I think it's fine because BQL is needed so it won't be
able to, however the vcpus running inside KVM should not be restricted
to that.  I think those dirty bits will still get lost, but it should
be extremely hard to trigger.

I'm not sure whether I missed something above, but if I'm correct, I
think the solution should be a flag for KVM_SET_USER_MEMORY_REGION to
set the memslot as invalid (KVM_MEM_INVALID), then when removing the
memslot which has KVM_MEM_LOG_DIRTY_PAGES enabled, we should:

  - send KVM_SET_USER_MEMORY_REGION with KVM_MEM_INVALID to invalidate
    the memslot, but keep the slot and bitmap in KVM

  - send KVM_GET_DIRTY_LOG to fetch the bitmap for the slot

  - send KVM_SET_USER_MEMORY_REGION with size==0 to remove the slot

However I don't know whether that'll worth it.

(Side question which is irrelevant to this: for kvm dirty ring we now
 need to do similar thing to flush dirty bits before removing a
 memslot, however that's even trickier because flushing dirty ring
 needs to kick all vcpu out, currently the RFC series is using
 run_on_cpu() which will release the BQL and wait for all vcpus to
 quit into userspace, however that cannot be done inside
 kvm_set_phys_mem because it needs the BQL.  I'm still thinking about
 a good way to fix this, but any idea is greatly welcomed :)

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question on dirty sync before kvm memslot removal
  2020-03-27 15:04 Question on dirty sync before kvm memslot removal Peter Xu
@ 2020-03-30 13:11 ` Paolo Bonzini
  2020-03-31 15:23   ` Peter Xu
  0 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2020-03-30 13:11 UTC (permalink / raw)
  To: Peter Xu, QEMU Devel Mailing List; +Cc: Dr. David Alan Gilbert

On 27/03/20 16:04, Peter Xu wrote:
> That makes perfect sense to me, however... What if the vcpu generates
> dirty bits _after_ we do KVM_GET_DIRTY_LOG but _before_ we send
> KVM_SET_USER_MEMORY_REGION to remove the memslot?  If the vcpu is in
> the userspace I think it's fine because BQL is needed so it won't be
> able to, however the vcpus running inside KVM should not be restricted
> to that.  I think those dirty bits will still get lost, but it should
> be extremely hard to trigger.

Yes, you've found a bug.

> I'm not sure whether I missed something above, but if I'm correct, I
> think the solution should be a flag for KVM_SET_USER_MEMORY_REGION to
> set the memslot as invalid (KVM_MEM_INVALID), then when removing the
> memslot which has KVM_MEM_LOG_DIRTY_PAGES enabled, we should:
> 
>   - send KVM_SET_USER_MEMORY_REGION with KVM_MEM_INVALID to invalidate
>     the memslot, but keep the slot and bitmap in KVM
> 
>   - send KVM_GET_DIRTY_LOG to fetch the bitmap for the slot
> 
>   - send KVM_SET_USER_MEMORY_REGION with size==0 to remove the slot

Or KVM_MEM_READONLY.

> However I don't know whether that'll worth it.

Yes, especially in the light of the dirty ring issue below.

> (Side question which is irrelevant to this: for kvm dirty ring we now
>  need to do similar thing to flush dirty bits before removing a
>  memslot, however that's even trickier because flushing dirty ring
>  needs to kick all vcpu out, currently the RFC series is using
>  run_on_cpu() which will release the BQL and wait for all vcpus to
>  quit into userspace, however that cannot be done inside
>  kvm_set_phys_mem because it needs the BQL.  I'm still thinking about
>  a good way to fix this, but any idea is greatly welcomed :)

The problem here is also that the GFN is not an unique identifier of the
QEMU ram_addr_t.  However you don't really need to kick all vCPUs out,
do you?  You can protect the dirty ring with its own per-vCPU mutex and
harvest the pages from the main thread.

Paolo



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question on dirty sync before kvm memslot removal
  2020-03-30 13:11 ` Paolo Bonzini
@ 2020-03-31 15:23   ` Peter Xu
  2020-03-31 15:34     ` Paolo Bonzini
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Xu @ 2020-03-31 15:23 UTC (permalink / raw)
  To: Paolo Bonzini, QEMU Devel Mailing List; +Cc: Dr. David Alan Gilbert

On Mon, Mar 30, 2020 at 03:11:53PM +0200, Paolo Bonzini wrote:
> On 27/03/20 16:04, Peter Xu wrote:
> > That makes perfect sense to me, however... What if the vcpu generates
> > dirty bits _after_ we do KVM_GET_DIRTY_LOG but _before_ we send
> > KVM_SET_USER_MEMORY_REGION to remove the memslot?  If the vcpu is in
> > the userspace I think it's fine because BQL is needed so it won't be
> > able to, however the vcpus running inside KVM should not be restricted
> > to that.  I think those dirty bits will still get lost, but it should
> > be extremely hard to trigger.
> 
> Yes, you've found a bug.
> 
> > I'm not sure whether I missed something above, but if I'm correct, I
> > think the solution should be a flag for KVM_SET_USER_MEMORY_REGION to
> > set the memslot as invalid (KVM_MEM_INVALID), then when removing the
> > memslot which has KVM_MEM_LOG_DIRTY_PAGES enabled, we should:
> > 
> >   - send KVM_SET_USER_MEMORY_REGION with KVM_MEM_INVALID to invalidate
> >     the memslot, but keep the slot and bitmap in KVM
> > 
> >   - send KVM_GET_DIRTY_LOG to fetch the bitmap for the slot
> > 
> >   - send KVM_SET_USER_MEMORY_REGION with size==0 to remove the slot
> 
> Or KVM_MEM_READONLY.

Yeah, I used a new flag because I thought READONLY was a bit tricky to
be used directly here.  The thing is IIUC if guest writes to a
READONLY slot then KVM should either ignore the write or trigger an
error which I didn't check, however here what we want to do is to let
the write to fallback to the userspace so it's neither dropped (we
still want the written data to land gracefully on RAM), nor triggering
an error (because the slot is actually writable).

> 
> > However I don't know whether that'll worth it.
> 
> Yes, especially in the light of the dirty ring issue below.
> 
> > (Side question which is irrelevant to this: for kvm dirty ring we now
> >  need to do similar thing to flush dirty bits before removing a
> >  memslot, however that's even trickier because flushing dirty ring
> >  needs to kick all vcpu out, currently the RFC series is using
> >  run_on_cpu() which will release the BQL and wait for all vcpus to
> >  quit into userspace, however that cannot be done inside
> >  kvm_set_phys_mem because it needs the BQL.  I'm still thinking about
> >  a good way to fix this, but any idea is greatly welcomed :)
> 
> The problem here is also that the GFN is not an unique identifier of the
> QEMU ram_addr_t.  However you don't really need to kick all vCPUs out,
> do you?  You can protect the dirty ring with its own per-vCPU mutex and
> harvest the pages from the main thread.

I'm not sure I get the point, but just to mention that currently the
dirty GFNs are collected in a standalone thread (in the QEMU series
it's called the reaper thread) rather than in the per vcpu thread
because the KVM_RESET_DIRTY_RINGS is per-vm after all.  One major
reason to kick the vcpus is to make sure the hardware cached dirty
GFNs (i.e. PML) are flushed synchronously.

I think the whole kick operation is indeed too heavy for this when
with the run_on_cpu() trick, because the thing we want to know (pml
flushing) is actually per-vcpu and no BQL interaction. Do we have/need
a lightweight way to kick one vcpu in synchronous way?  I was
wondering maybe something like responding a "sync kick" request in the
vcpu thread right after KVM_RUN ends (when we don't have BQL yet).
Would that make sense?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question on dirty sync before kvm memslot removal
  2020-03-31 15:23   ` Peter Xu
@ 2020-03-31 15:34     ` Paolo Bonzini
  2020-03-31 16:51       ` Peter Xu
  0 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2020-03-31 15:34 UTC (permalink / raw)
  To: Peter Xu, QEMU Devel Mailing List; +Cc: Dr. David Alan Gilbert

On 31/03/20 17:23, Peter Xu wrote:
>> Or KVM_MEM_READONLY.
> Yeah, I used a new flag because I thought READONLY was a bit tricky to
> be used directly here.  The thing is IIUC if guest writes to a
> READONLY slot then KVM should either ignore the write or trigger an
> error which I didn't check, however here what we want to do is to let
> the write to fallback to the userspace so it's neither dropped (we
> still want the written data to land gracefully on RAM), nor triggering
> an error (because the slot is actually writable).

No, writes fall back to userspace with KVM_MEM_READONLY.

>> The problem here is also that the GFN is not an unique identifier of the
>> QEMU ram_addr_t.  However you don't really need to kick all vCPUs out,
>> do you?  You can protect the dirty ring with its own per-vCPU mutex and
>> harvest the pages from the main thread.
> I'm not sure I get the point, but just to mention that currently the
> dirty GFNs are collected in a standalone thread (in the QEMU series
> it's called the reaper thread) rather than in the per vcpu thread
> because the KVM_RESET_DIRTY_RINGS is per-vm after all.  One major
> reason to kick the vcpus is to make sure the hardware cached dirty
> GFNs (i.e. PML) are flushed synchronously.

But you're referring to KVM kicking vCPUs not qemu_vcpu_kick.  Can you
just do an iteration of reaping after setting KVM_MEM_READONLY?

> I think the whole kick operation is indeed too heavy for this when
> with the run_on_cpu() trick, because the thing we want to know (pml
> flushing) is actually per-vcpu and no BQL interaction. Do we have/need
> a lightweight way to kick one vcpu in synchronous way?  I was
> wondering maybe something like responding a "sync kick" request in the
> vcpu thread right after KVM_RUN ends (when we don't have BQL yet).
> Would that make sense?

Not synchronously, because anything synchronous is very susceptible to
deadlocks.

Thanks,

Paolo



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question on dirty sync before kvm memslot removal
  2020-03-31 15:34     ` Paolo Bonzini
@ 2020-03-31 16:51       ` Peter Xu
  2020-03-31 23:12         ` Paolo Bonzini
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Xu @ 2020-03-31 16:51 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: QEMU Devel Mailing List, Dr. David Alan Gilbert

On Tue, Mar 31, 2020 at 05:34:43PM +0200, Paolo Bonzini wrote:
> On 31/03/20 17:23, Peter Xu wrote:
> >> Or KVM_MEM_READONLY.
> > Yeah, I used a new flag because I thought READONLY was a bit tricky to
> > be used directly here.  The thing is IIUC if guest writes to a
> > READONLY slot then KVM should either ignore the write or trigger an
> > error which I didn't check, however here what we want to do is to let
> > the write to fallback to the userspace so it's neither dropped (we
> > still want the written data to land gracefully on RAM), nor triggering
> > an error (because the slot is actually writable).
> 
> No, writes fall back to userspace with KVM_MEM_READONLY.

I read that __kvm_write_guest_page() will return -EFAULT when writting
to the read-only memslot, and e.g. kvm_write_guest_virt_helper() will
return with X86EMUL_IO_NEEDED, which will be translated into a
EMULATION_OK in x86_emulate_insn().  Then in x86_emulate_instruction()
it seems to get a "1" returned (note that I think it does not set
either vcpu->arch.pio.count or vcpu->mmio_needed).  Does that mean
it'll retry the write forever instead of quit into the userspace?  I
may possibly have misread somewhere, though..

> 
> >> The problem here is also that the GFN is not an unique identifier of the
> >> QEMU ram_addr_t.  However you don't really need to kick all vCPUs out,
> >> do you?  You can protect the dirty ring with its own per-vCPU mutex and
> >> harvest the pages from the main thread.
> > I'm not sure I get the point, but just to mention that currently the
> > dirty GFNs are collected in a standalone thread (in the QEMU series
> > it's called the reaper thread) rather than in the per vcpu thread
> > because the KVM_RESET_DIRTY_RINGS is per-vm after all.  One major
> > reason to kick the vcpus is to make sure the hardware cached dirty
> > GFNs (i.e. PML) are flushed synchronously.
> 
> But you're referring to KVM kicking vCPUs not qemu_vcpu_kick.  Can you
> just do an iteration of reaping after setting KVM_MEM_READONLY?

Oh I think you're right, deleting & readding memslot is special here
that we can keep the data in PML untouched, as long as they can be
flushed again to the new bitmap we're going to create.

However... I think I might find another race with this:

          main thread                       vcpu thread
          -----------                       -----------
                                            dirty GFN1, cached in PML
                                            ...
          remove memslot1 of GFN1
            set slot READONLY (whatever, or INVALID)
            sync log (NOTE: no GFN1 yet)
                                            vmexit, flush PML with RCU
                                            (will flush to old bitmap) <------- [1]
            delete memslot1 (old bitmap freed)                         <------- [2]
          add memslot2 of GFN1 (memslot2 could be smaller)
            add memslot2

I'm not 100% sure, but I think GFN1's dirty bit will be lost though
it's correctly applied at [1] but quickly freed at [2].

> 
> > I think the whole kick operation is indeed too heavy for this when
> > with the run_on_cpu() trick, because the thing we want to know (pml
> > flushing) is actually per-vcpu and no BQL interaction. Do we have/need
> > a lightweight way to kick one vcpu in synchronous way?  I was
> > wondering maybe something like responding a "sync kick" request in the
> > vcpu thread right after KVM_RUN ends (when we don't have BQL yet).
> > Would that make sense?
> 
> Not synchronously, because anything synchronous is very susceptible to
> deadlocks.

Yeah it's easy to deadlock (I suffer from it...), but besides above
case (which I really think it's special) I still think unluckily we
need a synchronous way.  For example, the VGA code will need the
latest dirty bit information to decide whether to update the screen
(or it could stall), or the migration code where we need to calculate
downtime with the current dirty bit information, etc.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question on dirty sync before kvm memslot removal
  2020-03-31 16:51       ` Peter Xu
@ 2020-03-31 23:12         ` Paolo Bonzini
  2020-04-01 23:09           ` Peter Xu
  0 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2020-03-31 23:12 UTC (permalink / raw)
  To: Peter Xu; +Cc: QEMU Devel Mailing List, Dr. David Alan Gilbert

On 31/03/20 18:51, Peter Xu wrote:
> On Tue, Mar 31, 2020 at 05:34:43PM +0200, Paolo Bonzini wrote:
>> On 31/03/20 17:23, Peter Xu wrote:
>>>> Or KVM_MEM_READONLY.
>>> Yeah, I used a new flag because I thought READONLY was a bit tricky to
>>> be used directly here.  The thing is IIUC if guest writes to a
>>> READONLY slot then KVM should either ignore the write or trigger an
>>> error which I didn't check, however here what we want to do is to let
>>> the write to fallback to the userspace so it's neither dropped (we
>>> still want the written data to land gracefully on RAM), nor triggering
>>> an error (because the slot is actually writable).
>>
>> No, writes fall back to userspace with KVM_MEM_READONLY.
> 
> I read that __kvm_write_guest_page() will return -EFAULT when writting
> to the read-only memslot, and e.g. kvm_write_guest_virt_helper() will
> return with X86EMUL_IO_NEEDED, which will be translated into a
> EMULATION_OK in x86_emulate_insn().  Then in x86_emulate_instruction()
> it seems to get a "1" returned (note that I think it does not set
> either vcpu->arch.pio.count or vcpu->mmio_needed).  Does that mean
> it'll retry the write forever instead of quit into the userspace?  I
> may possibly have misread somewhere, though..

We are definitely relying on KVM_MEM_READONLY to exit to userspace, in
order to emulate flash memory.

> However... I think I might find another race with this:
> 
>           main thread                       vcpu thread
>           -----------                       -----------
>                                             dirty GFN1, cached in PML
>                                             ...
>           remove memslot1 of GFN1
>             set slot READONLY (whatever, or INVALID)
>             sync log (NOTE: no GFN1 yet)
>                                             vmexit, flush PML with RCU
>                                             (will flush to old bitmap) <------- [1]
>             delete memslot1 (old bitmap freed)                         <------- [2]
>           add memslot2 of GFN1 (memslot2 could be smaller)
>             add memslot2
> 
> I'm not 100% sure, but I think GFN1's dirty bit will be lost though
> it's correctly applied at [1] but quickly freed at [2].

Yes, we probably need to do a mass vCPU kick when a slot is made
READONLY, before KVM_SET_USER_MEMORY_REGION returns (and after releasing
slots_lock).  It makes sense to guarantee that you can't get any more
dirtying after KVM_SET_USER_MEMORY_REGION returns.

Paolo

>>> I think the whole kick operation is indeed too heavy for this when
>>> with the run_on_cpu() trick, because the thing we want to know (pml
>>> flushing) is actually per-vcpu and no BQL interaction. Do we have/need
>>> a lightweight way to kick one vcpu in synchronous way?  I was
>>> wondering maybe something like responding a "sync kick" request in the
>>> vcpu thread right after KVM_RUN ends (when we don't have BQL yet).
>>> Would that make sense?
>>
>> Not synchronously, because anything synchronous is very susceptible to
>> deadlocks.
> 
> Yeah it's easy to deadlock (I suffer from it...), but besides above
> case (which I really think it's special) I still think unluckily we
> need a synchronous way.  For example, the VGA code will need the
> latest dirty bit information to decide whether to update the screen
> (or it could stall), or the migration code where we need to calculate
> downtime with the current dirty bit information, etc.
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question on dirty sync before kvm memslot removal
  2020-03-31 23:12         ` Paolo Bonzini
@ 2020-04-01 23:09           ` Peter Xu
  2020-04-02 20:47             ` Peter Xu
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Xu @ 2020-04-01 23:09 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: QEMU Devel Mailing List, Dr. David Alan Gilbert

On Wed, Apr 01, 2020 at 01:12:04AM +0200, Paolo Bonzini wrote:
> On 31/03/20 18:51, Peter Xu wrote:
> > On Tue, Mar 31, 2020 at 05:34:43PM +0200, Paolo Bonzini wrote:
> >> On 31/03/20 17:23, Peter Xu wrote:
> >>>> Or KVM_MEM_READONLY.
> >>> Yeah, I used a new flag because I thought READONLY was a bit tricky to
> >>> be used directly here.  The thing is IIUC if guest writes to a
> >>> READONLY slot then KVM should either ignore the write or trigger an
> >>> error which I didn't check, however here what we want to do is to let
> >>> the write to fallback to the userspace so it's neither dropped (we
> >>> still want the written data to land gracefully on RAM), nor triggering
> >>> an error (because the slot is actually writable).
> >>
> >> No, writes fall back to userspace with KVM_MEM_READONLY.
> > 
> > I read that __kvm_write_guest_page() will return -EFAULT when writting
> > to the read-only memslot, and e.g. kvm_write_guest_virt_helper() will
> > return with X86EMUL_IO_NEEDED, which will be translated into a
> > EMULATION_OK in x86_emulate_insn().  Then in x86_emulate_instruction()
> > it seems to get a "1" returned (note that I think it does not set
> > either vcpu->arch.pio.count or vcpu->mmio_needed).  Does that mean
> > it'll retry the write forever instead of quit into the userspace?  I
> > may possibly have misread somewhere, though..
> 
> We are definitely relying on KVM_MEM_READONLY to exit to userspace, in
> order to emulate flash memory.
> 
> > However... I think I might find another race with this:
> > 
> >           main thread                       vcpu thread
> >           -----------                       -----------
> >                                             dirty GFN1, cached in PML
> >                                             ...
> >           remove memslot1 of GFN1
> >             set slot READONLY (whatever, or INVALID)
> >             sync log (NOTE: no GFN1 yet)
> >                                             vmexit, flush PML with RCU
> >                                             (will flush to old bitmap) <------- [1]
> >             delete memslot1 (old bitmap freed)                         <------- [2]
> >           add memslot2 of GFN1 (memslot2 could be smaller)
> >             add memslot2
> > 
> > I'm not 100% sure, but I think GFN1's dirty bit will be lost though
> > it's correctly applied at [1] but quickly freed at [2].
> 
> Yes, we probably need to do a mass vCPU kick when a slot is made
> READONLY, before KVM_SET_USER_MEMORY_REGION returns (and after releasing
> slots_lock).  It makes sense to guarantee that you can't get any more
> dirtying after KVM_SET_USER_MEMORY_REGION returns.

Sounds doable.  Though we still need a synchronous way to kick vcpus
in KVM to make sure the PML is flushed before KVM_SET_MEMORY_REGION
returns, am I right?  Is there an existing good way to do this?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question on dirty sync before kvm memslot removal
  2020-04-01 23:09           ` Peter Xu
@ 2020-04-02 20:47             ` Peter Xu
  2020-04-02 22:32               ` Peter Xu
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Xu @ 2020-04-02 20:47 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: QEMU Devel Mailing List, Dr. David Alan Gilbert

On Wed, Apr 01, 2020 at 07:09:28PM -0400, Peter Xu wrote:
> On Wed, Apr 01, 2020 at 01:12:04AM +0200, Paolo Bonzini wrote:
> > On 31/03/20 18:51, Peter Xu wrote:
> > > On Tue, Mar 31, 2020 at 05:34:43PM +0200, Paolo Bonzini wrote:
> > >> On 31/03/20 17:23, Peter Xu wrote:
> > >>>> Or KVM_MEM_READONLY.
> > >>> Yeah, I used a new flag because I thought READONLY was a bit tricky to
> > >>> be used directly here.  The thing is IIUC if guest writes to a
> > >>> READONLY slot then KVM should either ignore the write or trigger an
> > >>> error which I didn't check, however here what we want to do is to let
> > >>> the write to fallback to the userspace so it's neither dropped (we
> > >>> still want the written data to land gracefully on RAM), nor triggering
> > >>> an error (because the slot is actually writable).
> > >>
> > >> No, writes fall back to userspace with KVM_MEM_READONLY.
> > > 
> > > I read that __kvm_write_guest_page() will return -EFAULT when writting
> > > to the read-only memslot, and e.g. kvm_write_guest_virt_helper() will
> > > return with X86EMUL_IO_NEEDED, which will be translated into a
> > > EMULATION_OK in x86_emulate_insn().  Then in x86_emulate_instruction()
> > > it seems to get a "1" returned (note that I think it does not set
> > > either vcpu->arch.pio.count or vcpu->mmio_needed).  Does that mean
> > > it'll retry the write forever instead of quit into the userspace?  I
> > > may possibly have misread somewhere, though..
> > 
> > We are definitely relying on KVM_MEM_READONLY to exit to userspace, in
> > order to emulate flash memory.
> > 
> > > However... I think I might find another race with this:
> > > 
> > >           main thread                       vcpu thread
> > >           -----------                       -----------
> > >                                             dirty GFN1, cached in PML
> > >                                             ...
> > >           remove memslot1 of GFN1
> > >             set slot READONLY (whatever, or INVALID)
> > >             sync log (NOTE: no GFN1 yet)
> > >                                             vmexit, flush PML with RCU
> > >                                             (will flush to old bitmap) <------- [1]
> > >             delete memslot1 (old bitmap freed)                         <------- [2]
> > >           add memslot2 of GFN1 (memslot2 could be smaller)
> > >             add memslot2
> > > 
> > > I'm not 100% sure, but I think GFN1's dirty bit will be lost though
> > > it's correctly applied at [1] but quickly freed at [2].
> > 
> > Yes, we probably need to do a mass vCPU kick when a slot is made
> > READONLY, before KVM_SET_USER_MEMORY_REGION returns (and after releasing
> > slots_lock).  It makes sense to guarantee that you can't get any more
> > dirtying after KVM_SET_USER_MEMORY_REGION returns.
> 
> Sounds doable.  Though we still need a synchronous way to kick vcpus
> in KVM to make sure the PML is flushed before KVM_SET_MEMORY_REGION
> returns, am I right?  Is there an existing good way to do this?

Paolo,

I'm not sure whether it's anything close to acceptable, but this is
something I was thinking about below (pesudo code).  Do you think it
makes any sense?  Thanks,

8<-------------------------------------------
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1b6d9ac9533c..437d669dde42 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8161,6 +8161,22 @@ void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(__kvm_request_immediate_exit);

+void kvm_vcpu_sync(struct kvm_vcpu *vcpu)
+{
+       DECLARE_WAITQUEUE(wait, current);
+
+       add_wait_queue(&vcpu->sync_wq, &wait);
+       set_current_state(TASK_UNINTERRUPTIBLE);
+       kvm_make_request(KVM_REQ_SYNC_VCPU, vcpu);
+       schedule();
+       remove_wait_queue(&vcpu->sync_wq, &wait);
+}
+
+void kvm_vcpu_sync_ack(struct kvm_vcpu *vcpu)
+{
+       wake_up(&vcpu->sync_wq);
+}
+
 /*
  * Returns 1 to let vcpu_run() continue the guest execution loop without
  * exiting to the userspace.  Otherwise, the value will be returned to the
@@ -8274,6 +8290,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
                        kvm_hv_process_stimers(vcpu);
                if (kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu))
                        kvm_vcpu_update_apicv(vcpu);
+               if (kvm_check_request(KVM_REQ_SYNC_VCPU, vcpu))
+                       kvm_vcpu_sync_ack(vcpu);
        }

        if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f6a1905da9bf..e825d2e0a221 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -146,6 +146,7 @@ static inline bool is_error_page(struct page *page)
 #define KVM_REQ_MMU_RELOAD        (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_PENDING_TIMER     2
 #define KVM_REQ_UNHALT            3
+#define KVM_REQ_SYNC_VCPU         4
 #define KVM_REQUEST_ARCH_BASE     8

 #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
@@ -278,6 +279,7 @@ struct kvm_vcpu {
        struct kvm_run *run;

        struct swait_queue_head wq;
+       struct wait_queue_head sync_wq;
        struct pid __rcu *pid;
        int sigset_active;
        sigset_t sigset;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f744bc603c53..35216aeb0365 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -342,6 +342,7 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
        vcpu->vcpu_id = id;
        vcpu->pid = NULL;
        init_swait_queue_head(&vcpu->wq);
+       init_wait_queue_head(&vcpu->sync_wq);
        kvm_async_pf_vcpu_init(vcpu);

        vcpu->pre_pcpu = -1;
@@ -1316,9 +1317,20 @@ int kvm_set_memory_region(struct kvm *kvm,
                          const struct kvm_userspace_memory_region *mem)
 {
        int r;
+       unsigned int i;
+       struct kvm_vcpu *vcpu;

        mutex_lock(&kvm->slots_lock);
+
        r = __kvm_set_memory_region(kvm, mem);
+
+       /* TBD: use arch-specific hooks; this won't compile on non-x86 */
+       if ((mem->flags & KVM_MEM_LOG_DIRTY_PAGES) &&
+           (mem->flags & KVM_MEM_READONLY)) {
+               kvm_for_each_vcpu(i, vcpu, kvm)
+                   kvm_vcpu_sync(vcpu);
+       }
+
        mutex_unlock(&kvm->slots_lock);
        return r;
 }
@@ -2658,6 +2670,8 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
                goto out;
        if (signal_pending(current))
                goto out;
+       if (kvm_check_request(KVM_REQ_SYNC_VCPU, vcpu))
+               goto out;

        ret = 0;
 out:
8<-------------------------------------------

-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Question on dirty sync before kvm memslot removal
  2020-04-02 20:47             ` Peter Xu
@ 2020-04-02 22:32               ` Peter Xu
  0 siblings, 0 replies; 9+ messages in thread
From: Peter Xu @ 2020-04-02 22:32 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: QEMU Devel Mailing List, Dr. David Alan Gilbert

On Thu, Apr 02, 2020 at 04:47:58PM -0400, Peter Xu wrote:
> On Wed, Apr 01, 2020 at 07:09:28PM -0400, Peter Xu wrote:
> > On Wed, Apr 01, 2020 at 01:12:04AM +0200, Paolo Bonzini wrote:
> > > On 31/03/20 18:51, Peter Xu wrote:
> > > > On Tue, Mar 31, 2020 at 05:34:43PM +0200, Paolo Bonzini wrote:
> > > >> On 31/03/20 17:23, Peter Xu wrote:
> > > >>>> Or KVM_MEM_READONLY.
> > > >>> Yeah, I used a new flag because I thought READONLY was a bit tricky to
> > > >>> be used directly here.  The thing is IIUC if guest writes to a
> > > >>> READONLY slot then KVM should either ignore the write or trigger an
> > > >>> error which I didn't check, however here what we want to do is to let
> > > >>> the write to fallback to the userspace so it's neither dropped (we
> > > >>> still want the written data to land gracefully on RAM), nor triggering
> > > >>> an error (because the slot is actually writable).
> > > >>
> > > >> No, writes fall back to userspace with KVM_MEM_READONLY.
> > > > 
> > > > I read that __kvm_write_guest_page() will return -EFAULT when writting
> > > > to the read-only memslot, and e.g. kvm_write_guest_virt_helper() will
> > > > return with X86EMUL_IO_NEEDED, which will be translated into a
> > > > EMULATION_OK in x86_emulate_insn().  Then in x86_emulate_instruction()
> > > > it seems to get a "1" returned (note that I think it does not set
> > > > either vcpu->arch.pio.count or vcpu->mmio_needed).  Does that mean
> > > > it'll retry the write forever instead of quit into the userspace?  I
> > > > may possibly have misread somewhere, though..
> > > 
> > > We are definitely relying on KVM_MEM_READONLY to exit to userspace, in
> > > order to emulate flash memory.
> > > 
> > > > However... I think I might find another race with this:
> > > > 
> > > >           main thread                       vcpu thread
> > > >           -----------                       -----------
> > > >                                             dirty GFN1, cached in PML
> > > >                                             ...
> > > >           remove memslot1 of GFN1
> > > >             set slot READONLY (whatever, or INVALID)
> > > >             sync log (NOTE: no GFN1 yet)
> > > >                                             vmexit, flush PML with RCU
> > > >                                             (will flush to old bitmap) <------- [1]
> > > >             delete memslot1 (old bitmap freed)                         <------- [2]
> > > >           add memslot2 of GFN1 (memslot2 could be smaller)
> > > >             add memslot2
> > > > 
> > > > I'm not 100% sure, but I think GFN1's dirty bit will be lost though
> > > > it's correctly applied at [1] but quickly freed at [2].
> > > 
> > > Yes, we probably need to do a mass vCPU kick when a slot is made
> > > READONLY, before KVM_SET_USER_MEMORY_REGION returns (and after releasing
> > > slots_lock).  It makes sense to guarantee that you can't get any more
> > > dirtying after KVM_SET_USER_MEMORY_REGION returns.
> > 
> > Sounds doable.  Though we still need a synchronous way to kick vcpus
> > in KVM to make sure the PML is flushed before KVM_SET_MEMORY_REGION
> > returns, am I right?  Is there an existing good way to do this?
> 
> Paolo,
> 
> I'm not sure whether it's anything close to acceptable, but this is
> something I was thinking about below (pesudo code).  Do you think it
> makes any sense?  Thanks,
> 
> 8<-------------------------------------------
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1b6d9ac9533c..437d669dde42 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8161,6 +8161,22 @@ void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu)
>  }
>  EXPORT_SYMBOL_GPL(__kvm_request_immediate_exit);
> 
> +void kvm_vcpu_sync(struct kvm_vcpu *vcpu)
> +{
> +       DECLARE_WAITQUEUE(wait, current);
> +
> +       add_wait_queue(&vcpu->sync_wq, &wait);
> +       set_current_state(TASK_UNINTERRUPTIBLE);
> +       kvm_make_request(KVM_REQ_SYNC_VCPU, vcpu);
> +       schedule();
> +       remove_wait_queue(&vcpu->sync_wq, &wait);
> +}
> +
> +void kvm_vcpu_sync_ack(struct kvm_vcpu *vcpu)
> +{
> +       wake_up(&vcpu->sync_wq);
> +}
> +
>  /*
>   * Returns 1 to let vcpu_run() continue the guest execution loop without
>   * exiting to the userspace.  Otherwise, the value will be returned to the
> @@ -8274,6 +8290,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>                         kvm_hv_process_stimers(vcpu);
>                 if (kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu))
>                         kvm_vcpu_update_apicv(vcpu);
> +               if (kvm_check_request(KVM_REQ_SYNC_VCPU, vcpu))
> +                       kvm_vcpu_sync_ack(vcpu);
>         }
> 
>         if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f6a1905da9bf..e825d2e0a221 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -146,6 +146,7 @@ static inline bool is_error_page(struct page *page)
>  #define KVM_REQ_MMU_RELOAD        (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_PENDING_TIMER     2
>  #define KVM_REQ_UNHALT            3
> +#define KVM_REQ_SYNC_VCPU         4
>  #define KVM_REQUEST_ARCH_BASE     8
> 
>  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> @@ -278,6 +279,7 @@ struct kvm_vcpu {
>         struct kvm_run *run;
> 
>         struct swait_queue_head wq;
> +       struct wait_queue_head sync_wq;
>         struct pid __rcu *pid;
>         int sigset_active;
>         sigset_t sigset;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index f744bc603c53..35216aeb0365 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -342,6 +342,7 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>         vcpu->vcpu_id = id;
>         vcpu->pid = NULL;
>         init_swait_queue_head(&vcpu->wq);
> +       init_wait_queue_head(&vcpu->sync_wq);
>         kvm_async_pf_vcpu_init(vcpu);
> 
>         vcpu->pre_pcpu = -1;
> @@ -1316,9 +1317,20 @@ int kvm_set_memory_region(struct kvm *kvm,
>                           const struct kvm_userspace_memory_region *mem)
>  {
>         int r;
> +       unsigned int i;
> +       struct kvm_vcpu *vcpu;
> 
>         mutex_lock(&kvm->slots_lock);
> +
>         r = __kvm_set_memory_region(kvm, mem);
> +
> +       /* TBD: use arch-specific hooks; this won't compile on non-x86 */
> +       if ((mem->flags & KVM_MEM_LOG_DIRTY_PAGES) &&
> +           (mem->flags & KVM_MEM_READONLY)) {
> +               kvm_for_each_vcpu(i, vcpu, kvm)
> +                   kvm_vcpu_sync(vcpu);
> +       }
> +

Oops, this block should definitely be after the unlock as you
suggested...

>         mutex_unlock(&kvm->slots_lock);
>         return r;
>  }
> @@ -2658,6 +2670,8 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
>                 goto out;
>         if (signal_pending(current))
>                 goto out;
> +       if (kvm_check_request(KVM_REQ_SYNC_VCPU, vcpu))
> +               goto out;
> 
>         ret = 0;
>  out:
> 8<-------------------------------------------
> 
> -- 
> Peter Xu

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-04-02 22:33 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-27 15:04 Question on dirty sync before kvm memslot removal Peter Xu
2020-03-30 13:11 ` Paolo Bonzini
2020-03-31 15:23   ` Peter Xu
2020-03-31 15:34     ` Paolo Bonzini
2020-03-31 16:51       ` Peter Xu
2020-03-31 23:12         ` Paolo Bonzini
2020-04-01 23:09           ` Peter Xu
2020-04-02 20:47             ` Peter Xu
2020-04-02 22:32               ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.