kvmarm.lists.cs.columbia.edu archive mirror
 help / color / mirror / Atom feed
* Reference count on pages held in secondary MMUs
@ 2019-06-09  8:18 Christoffer Dall
  2019-06-09  9:37 ` Paolo Bonzini
  0 siblings, 1 reply; 8+ messages in thread
From: Christoffer Dall @ 2019-06-09  8:18 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, kvmarm

Hi,

I have been looking at how we deal with page_count(page) on pages held
in stage 2 page tables on KVM/arm64.

What we do currently is to drop the reference on the page we get from
get_user_pages() once the page is inserted into our stage 2 page table,
typically leaving page_count(page) == page_mapcount(page) == 1 which
represents the userspace stage 1 mapping of the page,
and we rely on MMU notifiers to remove the stage 2 mapping if
corresponding stage 1 mapping is being unmapped.

I believe this is analogous to what other architectures do?

In some sense, we are thus maintaining a 'hidden', or internal,
reference to the page, which is not counted anywhere.

I am wondering if it would be equally valid to take a reference on the
page, and remove that reference when unmapping via MMU notifiers, and if
so, if there would be any advantages/drawbacks in doing so?


Thanks,

    Christoffer
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Reference count on pages held in secondary MMUs
  2019-06-09  8:18 Reference count on pages held in secondary MMUs Christoffer Dall
@ 2019-06-09  9:37 ` Paolo Bonzini
  2019-06-09 17:40   ` Andrea Arcangeli
  2019-06-11 11:21   ` Christoffer Dall
  0 siblings, 2 replies; 8+ messages in thread
From: Paolo Bonzini @ 2019-06-09  9:37 UTC (permalink / raw)
  To: Christoffer Dall, kvm; +Cc: kvmarm

On 09/06/19 10:18, Christoffer Dall wrote:
> In some sense, we are thus maintaining a 'hidden', or internal,
> reference to the page, which is not counted anywhere.
> 
> I am wondering if it would be equally valid to take a reference on the
> page, and remove that reference when unmapping via MMU notifiers, and if
> so, if there would be any advantages/drawbacks in doing so?

If I understand correctly, I think the MMU notifier would not fire if
you took an actual reference; the page would be pinned in memory and
could not be swapped out.

Paolo
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Reference count on pages held in secondary MMUs
  2019-06-09  9:37 ` Paolo Bonzini
@ 2019-06-09 17:40   ` Andrea Arcangeli
  2019-06-11 11:51     ` Christoffer Dall
  2019-06-11 11:21   ` Christoffer Dall
  1 sibling, 1 reply; 8+ messages in thread
From: Andrea Arcangeli @ 2019-06-09 17:40 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Jerome Glisse, kvm, kvmarm

Hello,

On Sun, Jun 09, 2019 at 11:37:19AM +0200, Paolo Bonzini wrote:
> On 09/06/19 10:18, Christoffer Dall wrote:
> > In some sense, we are thus maintaining a 'hidden', or internal,
> > reference to the page, which is not counted anywhere.
> > 
> > I am wondering if it would be equally valid to take a reference on the
> > page, and remove that reference when unmapping via MMU notifiers, and if
> > so, if there would be any advantages/drawbacks in doing so?
> 
> If I understand correctly, I think the MMU notifier would not fire if
> you took an actual reference; the page would be pinned in memory and
> could not be swapped out.

MMU notifiers still fires, the refcount is simple and can be dropped
also in the mmu notifier invalidate and in fact Jerome also thinks
like me that we should eventually optimize away the FOLL_GET and not
take the refcount in the first place, but a whole different chapter is
dedicated on the set_page_dirty_lock crash on MAP_SHARED mappings
after long term GUP pins. So since you're looking into how to handle
the page struct in the MMU notifier it's worth mentioning the issues
related to set_page_dirty too.

To achieve the cleanest writeback fix to avoid crashes in
set_page_dirty_lock on long term secondary MMU mappings that supports
MMU notifier like KVM shadow MMU, the ideal is to mark the page dirty
before establishing a writable the mapping in the secondary MMU like
in the model below.

The below solution works also for those secondary MMU that are like a
TLB and if there are two concurrent invalidates on the same page
invoked at the same time (a potential problem Jerome noticed), you
don't know which come out first and you would risk to call
set_page_dirty twice, which would be still potentially kernel crashing
(even if only a theoretical issue like O_DIRECT). So the below model
will solve that and it's also valid for KVM/vhost accelleration,
despite KVM can figure out how to issue a single set_page_dirty call
for each spte that gets invalidated by concurrent invalidates on the
same page because it has shadow pagetables and it's not just a TLB.

  access = FOLL_WRITE|FOLL_GET

repeat:
  page = gup(access)
  put_page(page)

  spin_lock(mmu_notifier_lock);
  if (race with invalidate) {
    spin_unlock..
    goto repeat;
  }
  if (access == FOLL_WRITE)
    set_page_dirty(page)
  establish writable mapping in secondary MMU on page
  spin_unlock

The above solves the crash in set_page_dirty_lock without having to
modify any filesystem, it should work theoretically safer than the
O_DIRECT short term GUP pin.

With regard to KVM this should be enough, but we also look for a crash
avoidance solution for those devices that cannot support the MMU
notifier for short and long term GUP pins.

There's lots of work going on on linux-mm, to try to let those devices
support writeback in a safe way (also with stable pages so all fs
integrity checks will pass) using bounce buffer if a long term GUP pin
is detected by the filesystem. In addition there's other work to make
the short term GUP pin theoretically safe by delaying the writeback
for the short window the GUP pin is taken by O_DIRECT, so it becomes
theoretically safe too (currently it's only practically safe).

However I'm not sure if the long term GUP pins really needs to support
writeback.

To do a coherent snapshot without talking to the device so that it
stops writing to the whole mapping, one should write protect the
memory, but it can't be write protected without MMU notifier
support.. The VM already wrprotect the MAP_SHARED pages before writing
them out to provide stable pages, but that's just not going to work
with a long term GUP pin that mapped the page as writable in the
device.

To make a practical example: if the memory under long term GUP pin
would be a KVM guest physical memory mapped in MAP_SHARED with
vfio/iommu device assignment and if the device or iommu can't support
the MMU notifier, there's no value in the filesystem being able to
flush the guest physical memory to disk periodically, because if the
write-able GUP pin can't be dropped first, it's impossible to take a
coherent snapshot of the guest physical memory while the device can
still write anywhere in the KVM guest physical memory. Whatever gets
written by the complex logic that will do the bounce buffers would be
just useless for this use case. If the system crashed, you couldn't
possibly start a new guest and pretend that whatever got written was a
coherent snapshot of the guest physical memory. To take a coherent
snapshot KVM needs to tell the device to stop writing, and only then
flush the dirty data in the MAP_SHARED to disk, just calling mprotect
won't be enough because that won't get rid of the device writable
mapping associated with the long term GUP pin. But if it has to do
that, it can also tell the device to temporarily drop the iommu
mapping, drop all the GUP pins and to re-take the GUP pins and remap
the guest in the iommu only after the data already hit the disk, to
mark all pages dirty again. So I suppose lots of other use cases would
work like this.

If the data written by the device through the long term GUP pin,
doesn't need to be coherent (perhaps if the data is structured like a
rotating debug logfile) or if it's coherent down to the smallest size
the device can write with its DMA, then there would be some value in
being able to flush the data without stopping the device from writing
to it.

Overall it might be just enough to keep things simpler in the kernel
filesystems and define that long term GUP pins without MMU notifier,
don't ever get written to disk until the GUP pin is dropped, so all
GUP pin will work the same and the O_DIRECT solution can be applied to
long term GUP pins too. Any device requiring a long term GUP pin to
operate, requires some special setup with root capability so among the
other special things it does, its userland could also orchestrate
periodic unpinning, to flush the MAP_SHARED dirty data to disk, if the
data written by the device through the long term GUP pin cannot be
lost after a power loss or a kernel crash.

Either that or it'd be interesting to know exactly what are the uses
cases that requires long term GUP pinned MAP_SHARED pages to remain
writeback capable, to justify the additional kernel complexity that
such filesystem solution requires. Currently those same use cases would
tend to be kernel crashing, so I suppose they're not very common use
cases to begin with.

Thanks,
Andrea
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Reference count on pages held in secondary MMUs
  2019-06-09  9:37 ` Paolo Bonzini
  2019-06-09 17:40   ` Andrea Arcangeli
@ 2019-06-11 11:21   ` Christoffer Dall
  2019-06-11 11:29     ` Paolo Bonzini
  1 sibling, 1 reply; 8+ messages in thread
From: Christoffer Dall @ 2019-06-11 11:21 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvmarm, kvm

On Sun, Jun 09, 2019 at 11:37:19AM +0200, Paolo Bonzini wrote:
> On 09/06/19 10:18, Christoffer Dall wrote:
> > In some sense, we are thus maintaining a 'hidden', or internal,
> > reference to the page, which is not counted anywhere.
> > 
> > I am wondering if it would be equally valid to take a reference on the
> > page, and remove that reference when unmapping via MMU notifiers, and if
> > so, if there would be any advantages/drawbacks in doing so?
> 
> If I understand correctly, I think the MMU notifier would not fire if
> you took an actual reference; the page would be pinned in memory and
> could not be swapped out.
> 

That was my understanding too, but I can't find the code path that would
support this theory.

The closest thing I could find was is_page_cache_freeable(), and as far
as I'm able to understand that code, that is called (via pageout()) later in
shrink_page_list() than try_to_unmap() which fires the MMU notifiers
through the rmap code.

It is entirely possible that I'm looking at the wrong place and missing
something overall though?


Thanks,

    Christoffer
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Reference count on pages held in secondary MMUs
  2019-06-11 11:21   ` Christoffer Dall
@ 2019-06-11 11:29     ` Paolo Bonzini
  0 siblings, 0 replies; 8+ messages in thread
From: Paolo Bonzini @ 2019-06-11 11:29 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: kvmarm, kvm

On 11/06/19 13:21, Christoffer Dall wrote:
>> If I understand correctly, I think the MMU notifier would not fire if
>> you took an actual reference; the page would be pinned in memory and
>> could not be swapped out.
>>
> That was my understanding too, but I can't find the code path that would
> support this theory.

Yeah, as Andrea said you could drop the reference in the invalidate
callback too.

Paolo
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Reference count on pages held in secondary MMUs
  2019-06-09 17:40   ` Andrea Arcangeli
@ 2019-06-11 11:51     ` Christoffer Dall
  2019-06-22 19:11       ` Andrea Arcangeli
  0 siblings, 1 reply; 8+ messages in thread
From: Christoffer Dall @ 2019-06-11 11:51 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Paolo Bonzini, Jerome Glisse, kvmarm, kvm

On Sun, Jun 09, 2019 at 01:40:24PM -0400, Andrea Arcangeli wrote:
> Hello,
> 
> On Sun, Jun 09, 2019 at 11:37:19AM +0200, Paolo Bonzini wrote:
> > On 09/06/19 10:18, Christoffer Dall wrote:
> > > In some sense, we are thus maintaining a 'hidden', or internal,
> > > reference to the page, which is not counted anywhere.
> > > 
> > > I am wondering if it would be equally valid to take a reference on the
> > > page, and remove that reference when unmapping via MMU notifiers, and if
> > > so, if there would be any advantages/drawbacks in doing so?
> > 
> > If I understand correctly, I think the MMU notifier would not fire if
> > you took an actual reference; the page would be pinned in memory and
> > could not be swapped out.
> 
> MMU notifiers still fires, the refcount is simple and can be dropped
> also in the mmu notifier invalidate 

Sorry, what does this mean?  Do you mean that we can either do:

  on_vm_s2_fault() {
      page = gup();
      map_page_in_s2_mmu();
      put_page();
  }

  mmu_notifier_invalidate() {
      unmap_page_in_s2_mmu();
  }

or

  on_vm_s2_fault() {
      page = gup();
      map_page_in_s2_mmu();
  }

  mmu_notifier_invalidate() {
      unmap_page_in_s2_mmu();
      put_page();
  }


> and in fact Jerome also thinks
> like me that we should eventually optimize away the FOLL_GET and not
> take the refcount in the first place, 

So if I understood the above correct, the next point is that there are
advantages to avoiding keeping the extra reference on that page, because
we have problematic race conditions related to set_page_dirty(), and we
can reduce the problem of race conditions further by not getting a
reference on the page at all when going GUP as part of a KVM fault?

Can you explain, or provide a pointer to, the root cause of the
problem with holding a reference on the page and setting it dirty?

> but a whole different chapter is
> dedicated on the set_page_dirty_lock crash on MAP_SHARED mappings
> after long term GUP pins. So since you're looking into how to handle
> the page struct in the MMU notifier it's worth mentioning the issues
> related to set_page_dirty too.

Is there some background info on the "set_page_dirty_lock crash on
MAP_SHARED" ?  I'm having trouble following this without the background.

> 
> To achieve the cleanest writeback fix to avoid crashes in
> set_page_dirty_lock on long term secondary MMU mappings that supports
> MMU notifier like KVM shadow MMU, the ideal is to mark the page dirty
> before establishing a writable the mapping in the secondary MMU like
> in the model below.
> 
> The below solution works also for those secondary MMU that are like a
> TLB and if there are two concurrent invalidates on the same page
> invoked at the same time (a potential problem Jerome noticed), you
> don't know which come out first and you would risk to call
> set_page_dirty twice, which would be still potentially kernel crashing
> (even if only a theoretical issue like O_DIRECT).

Why is it problematic to call set_page_dirty() twice?  I thought that at
worst it would only lead to writing out data to disk unnecessarily ?

I am also not familiar with a problem related to KVM and O_DIRECT, so
I'm having trouble keeping up here as well :(


> So the below model
> will solve that and it's also valid for KVM/vhost accelleration,
> despite KVM can figure out how to issue a single set_page_dirty call
> for each spte that gets invalidated by concurrent invalidates on the
> same page because it has shadow pagetables and it's not just a TLB.
> 
>   access = FOLL_WRITE|FOLL_GET
> 
> repeat:
>   page = gup(access)
>   put_page(page)
> 
>   spin_lock(mmu_notifier_lock);
>   if (race with invalidate) {
>     spin_unlock..
>     goto repeat;
>   }
>   if (access == FOLL_WRITE)
>     set_page_dirty(page)
>   establish writable mapping in secondary MMU on page
>   spin_unlock
> 
> The above solves the crash in set_page_dirty_lock without having to
> modify any filesystem, it should work theoretically safer than the
> O_DIRECT short term GUP pin.

That is not exactly how we do things today on the arm64 side.  We do
something that looks like:

  /*
   * user_mem_abort is our function for a secondary MMU fault that
   * resolves to a memslot.
   */
  user_mem_abort() {
      page = gup(access, &writable);
      spin_lock(&kvm->mmu_lock);
      if (mmu_notifier_retry(kvm, mmu_seq))
          goto out; /* run the VM again and see what happens */

      if (writable)
          kvm_set_pfn_dirty(page_to_pfn(page));
      stage2_set_pte(); /* establish_writable mapping in secondary MMU on page */

  out:
      spin_unlock(&kvm_mmu_lock);
      put_page(page);
  }

Should we rework this to address the race you are refering to, and are
other architectures already safe against this?

> 
> With regard to KVM this should be enough, but we also look for a crash
> avoidance solution for those devices that cannot support the MMU
> notifier for short and long term GUP pins.

Sorry, can you define short and long term GUP pins, and do we have
current examples of both?


> 
> There's lots of work going on on linux-mm, to try to let those devices
> support writeback in a safe way (also with stable pages so all fs
> integrity checks will pass) using bounce buffer if a long term GUP pin
> is detected by the filesystem. In addition there's other work to make
> the short term GUP pin theoretically safe by delaying the writeback
> for the short window the GUP pin is taken by O_DIRECT, so it becomes
> theoretically safe too (currently it's only practically safe).
> 
> However I'm not sure if the long term GUP pins really needs to support
> writeback.
> 

I don't think I understand the distinction between a long term GUP pin
that supports writeback vs. a short term GUP pin.  My question was
whether or not the pin could be dropped at the time the mapping was torn
down, or if it has to be done at the same time the mapping is
established, for things to work properly wrt. the semantics of memory
behavior of the rest of the kernel.

I'm not sure if we're talking about the same thing here, or if you're
explaining a different scenario?


Thanks,

    Christoffer
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Reference count on pages held in secondary MMUs
  2019-06-11 11:51     ` Christoffer Dall
@ 2019-06-22 19:11       ` Andrea Arcangeli
  2019-06-26 12:16         ` Christoffer Dall
  0 siblings, 1 reply; 8+ messages in thread
From: Andrea Arcangeli @ 2019-06-22 19:11 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Paolo Bonzini, Jerome Glisse, kvmarm, kvm

Hello Christoffer,

On Tue, Jun 11, 2019 at 01:51:32PM +0200, Christoffer Dall wrote:
> Sorry, what does this mean?  Do you mean that we can either do:
> 
>   on_vm_s2_fault() {
>       page = gup();
>       map_page_in_s2_mmu();
>       put_page();
>   }
> 
>   mmu_notifier_invalidate() {
>       unmap_page_in_s2_mmu();
>   }
> 
> or
> 
>   on_vm_s2_fault() {
>       page = gup();
>       map_page_in_s2_mmu();
>   }
> 
>   mmu_notifier_invalidate() {
>       unmap_page_in_s2_mmu();
>       put_page();
>   }

Yes both work, refcounting always works.

> > and in fact Jerome also thinks
> > like me that we should eventually optimize away the FOLL_GET and not
> > take the refcount in the first place, 
> 
> So if I understood the above correct, the next point is that there are
> advantages to avoiding keeping the extra reference on that page, because
> we have problematic race conditions related to set_page_dirty(), and we
> can reduce the problem of race conditions further by not getting a
> reference on the page at all when going GUP as part of a KVM fault?

You could still keep the extra reference until the
invalidate.

The set_page_dirty however if you do in the context of the secondary
MMU fault (i.e. atomically with the mapping of the page in the
secondary MMU, with respect of MMU notifier invalidates), it solves
the whole problem with the ->mkwrite/mkclean and then you can keep a
GUP long term pin fully safely already. That is a solution that always
works and becomes guaranteed by design by the MMU notifier not to
interfere with the _current_ writeback code in the filesystem. It also
already provides stable pages.

> Can you explain, or provide a pointer to, the root cause of the
> problem with holding a reference on the page and setting it dirty?

The filesystem/VM doesn't possibly expect set_page_dirty to be called
again after it called page_mkclean. Supposedly a wrprotect fault
should have been generated if somebody tried to write to the page
under writeback, so page_mkwrite should have run again before you
could have called set_page_dirty.

Instead page_mkclean failed to get rid of the long term GUP obtained
with FOLL_WRITE because it simply can't ask the device to release it
without MMU notifier, so the device can later still call
set_page_dirty despite page_mkclean already run.

> > but a whole different chapter is
> > dedicated on the set_page_dirty_lock crash on MAP_SHARED mappings
> > after long term GUP pins. So since you're looking into how to handle
> > the page struct in the MMU notifier it's worth mentioning the issues
> > related to set_page_dirty too.
> 
> Is there some background info on the "set_page_dirty_lock crash on
> MAP_SHARED" ?  I'm having trouble following this without the background.

Jan Kara leaded the topic explained all the details on this filesystem
issue at the LSF-MM and also last year.

Which is what makes me think there can't be too many uses cases that
require writback to work while long term GUP pin allow some device to
write to the pages at any given time, if nobody requires this to be
urgently fixed.

You can find coverage on lwn and on linux-mm.

> 
> > 
> > To achieve the cleanest writeback fix to avoid crashes in
> > set_page_dirty_lock on long term secondary MMU mappings that supports
> > MMU notifier like KVM shadow MMU, the ideal is to mark the page dirty
> > before establishing a writable the mapping in the secondary MMU like
> > in the model below.
> > 
> > The below solution works also for those secondary MMU that are like a
> > TLB and if there are two concurrent invalidates on the same page
> > invoked at the same time (a potential problem Jerome noticed), you
> > don't know which come out first and you would risk to call
> > set_page_dirty twice, which would be still potentially kernel crashing
> > (even if only a theoretical issue like O_DIRECT).
> 
> Why is it problematic to call set_page_dirty() twice?  I thought that at
> worst it would only lead to writing out data to disk unnecessarily ?

According to Jerome, after the first set_page_dirty returns, writeback
could start before the second set_page_dirty has been called. So if
there are additional random later invalidates the next ones shouldn't
call set_page_dirty again.

The problem is if you call set_page_dirty in the invalidate, you've
also to make sure set_page_dirty is being called only once.

There can be concurrent invalidates for the same page running at the
same time, while the page fault there is only one that runs atomically
with respect to the mmu notifier invalidates (under whatever lock that
serializes the MMU notifier invalidates vs the secondary MMU page fault).

If you call set_page_dirty twice in a row, you again open the window
for the writeback to have already called ->page_mkclean on the page
after the first set_page_dirty, so the second set_page_dirty will
then crash.

You can enforce to call it only once if you have sptes (shadow
pagetables) like in KVM has, so this is not an issue for KVM.

> I am also not familiar with a problem related to KVM and O_DIRECT, so
> I'm having trouble keeping up here as well :(

There's no problem in KVM and O_DIRECT.

There's a problem in O_DIRECT itself regardless if it's qemu or any
other app using it: just the time window is too low to be
noticeable. It's still a corollary of why we can't run two
set_page_dirty per page, if there are concurrent MMU notifier
invalidates.

> > So the below model
> > will solve that and it's also valid for KVM/vhost accelleration,
> > despite KVM can figure out how to issue a single set_page_dirty call
> > for each spte that gets invalidated by concurrent invalidates on the
> > same page because it has shadow pagetables and it's not just a TLB.
> > 
> >   access = FOLL_WRITE|FOLL_GET
> > 
> > repeat:
> >   page = gup(access)
> >   put_page(page)
> > 
> >   spin_lock(mmu_notifier_lock);
> >   if (race with invalidate) {
> >     spin_unlock..
> >     goto repeat;
> >   }
> >   if (access == FOLL_WRITE)
> >     set_page_dirty(page)
> >   establish writable mapping in secondary MMU on page
> >   spin_unlock
> > 
> > The above solves the crash in set_page_dirty_lock without having to
> > modify any filesystem, it should work theoretically safer than the
> > O_DIRECT short term GUP pin.
> 
> That is not exactly how we do things today on the arm64 side.  We do
> something that looks like:

The above is the model that solves all problems with writeback
page_mkclean/mkwrite, provides stable pages to current filesystems,
regardless of lowlevel implementation details of the mmu notifier
methods.

For KVM all models works not only the above one because we have sptes
to disambiguate which is the first invalidate that has to run
set_page_dirty.

> 
>   /*
>    * user_mem_abort is our function for a secondary MMU fault that
>    * resolves to a memslot.
>    */
>   user_mem_abort() {
>       page = gup(access, &writable);
>       spin_lock(&kvm->mmu_lock);
>       if (mmu_notifier_retry(kvm, mmu_seq))
>           goto out; /* run the VM again and see what happens */
> 
>       if (writable)
>           kvm_set_pfn_dirty(page_to_pfn(page));
>       stage2_set_pte(); /* establish_writable mapping in secondary MMU on page */
> 
>   out:
>       spin_unlock(&kvm_mmu_lock);
>       put_page(page);
>   }
> 
> Should we rework this to address the race you are refering to, and are
> other architectures already safe against this?

Actually it seems you mark the page dirty exactly where I suggested
above: i.e. atomically with the secondary MMU mapping establishment
with respect to the mmu notifier invalidates.

I don't see any problem with the above (well you need to have a way to
track if you run stage2_set_pte or if you taken "goto out" but the
above is pseudocode).

There's a problem however in kvm_set_pfn_dirty common code, it should
call set_page_dirty not SetPageDirty or it won't do anything in the
MAP_SHARED filebacked case. The current code is perfectly ok for anon and
MAP_PRIVATE write=1 cases.

However FOLL_TOUCH in gup already either calls set_page_dirty or it
marks the linux pte as dirty, so that's working around the lack of
set_page_dirty... I wonder if we could just rely on the set_page_dirty
in gup with FOLL_TOUCH and drop SetPageDirty as a whole in KVM in fact.

> > With regard to KVM this should be enough, but we also look for a crash
> > avoidance solution for those devices that cannot support the MMU
> > notifier for short and long term GUP pins.
> 
> Sorry, can you define short and long term GUP pins, and do we have
> current examples of both?

Long term as in mm/gup.c:FOLL_LONGTERM, means you expect to call some
get_user_pages with FOLL_GET and not release the refcount immediately
after I/O completion, the page could remain indefinitely pinned,
virt example: vfio device assignment.

The most common example of short term GUP (i.e. default behavior of
GPU) is O_DIRECT. I/O completion takes short time (depends on the
buffer size of course.. could be 1TB of buffer I/O) but it's still
going to be released ASAP.

> > There's lots of work going on on linux-mm, to try to let those devices
> > support writeback in a safe way (also with stable pages so all fs
> > integrity checks will pass) using bounce buffer if a long term GUP pin
> > is detected by the filesystem. In addition there's other work to make
> > the short term GUP pin theoretically safe by delaying the writeback
> > for the short window the GUP pin is taken by O_DIRECT, so it becomes
> > theoretically safe too (currently it's only practically safe).
> > 
> > However I'm not sure if the long term GUP pins really needs to support
> > writeback.
> > 
> 
> I don't think I understand the distinction between a long term GUP pin
> that supports writeback vs. a short term GUP pin.  My question was
> whether or not the pin could be dropped at the time the mapping was torn
> down, or if it has to be done at the same time the mapping is
> established, for things to work properly wrt. the semantics of memory
> behavior of the rest of the kernel.

Yes sorry, the question about the refcounting was just trivial to
answer: it always works, you can drop the refcount anywhere.

I just thought if there was any doubt on the refcounting issue which
had an immediate black and white answer, it was safer to raise
awareness about the much more troubling and subtle issues with
set_page_dirty caused by GUP refcounts.

> I'm not sure if we're talking about the same thing here, or if you're
> explaining a different scenario?

Simply KVM secondary MMU fault has to mark the page dirty somehow
(either in gup itself or in the fault or in the invalidates) in
addition to dealing the refcount. That's the connection.

However this set_page_dirty issue needs solution that works both for
short term GPU pins that can't use mmu notifier (O_DIRECT), long term
GUP pins that can't use mmu notifier (vfio) and the MMU notifier
mappings (KVM page fault, whose refcount can be implicitly hold on by
the MMU notifier itself and in turn the put_page can go anywhere).

The solution to the O_DIRECT gup pin is also highly connected with the
GUP refcounting, because the solution is to alter the refcount so that
the filesystem learns that there's a special refcounting and
->page_mkclean can be then deferred as long as the special refcount is
hold by O_DIRECT. I argue the same special refcount can be also hold
by long term GUP pins and the MMU notifier KVM page fault case (which
will either drop FOLL_GET ideally) so we can defer the page_mkwrite
indefinitely for long term GUP pins too (there will be no deferred
write in MMU Notifier case because ->page_mkclean invokes the MMU
notifier invalidates).

If one wants to flush to disk the dirty MAP_SHARED periodically the
device needs to be told to drop the refcounts and stop writing to the
memory. If the device isn't told to stop writing to the memory what
comes out is only coherent at 512 bytes units or maybe less anyway.

Thanks,
Andrea
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Reference count on pages held in secondary MMUs
  2019-06-22 19:11       ` Andrea Arcangeli
@ 2019-06-26 12:16         ` Christoffer Dall
  0 siblings, 0 replies; 8+ messages in thread
From: Christoffer Dall @ 2019-06-26 12:16 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Paolo Bonzini, Jerome Glisse, kvmarm, kvm

On Sat, Jun 22, 2019 at 03:11:36PM -0400, Andrea Arcangeli wrote:
> Hello Christoffer,
> 
> On Tue, Jun 11, 2019 at 01:51:32PM +0200, Christoffer Dall wrote:
> > Sorry, what does this mean?  Do you mean that we can either do:
> > 
> >   on_vm_s2_fault() {
> >       page = gup();
> >       map_page_in_s2_mmu();
> >       put_page();
> >   }
> > 
> >   mmu_notifier_invalidate() {
> >       unmap_page_in_s2_mmu();
> >   }
> > 
> > or
> > 
> >   on_vm_s2_fault() {
> >       page = gup();
> >       map_page_in_s2_mmu();
> >   }
> > 
> >   mmu_notifier_invalidate() {
> >       unmap_page_in_s2_mmu();
> >       put_page();
> >   }
> 
> Yes both work, refcounting always works.
> 
> > > and in fact Jerome also thinks
> > > like me that we should eventually optimize away the FOLL_GET and not
> > > take the refcount in the first place, 
> > 
> > So if I understood the above correct, the next point is that there are
> > advantages to avoiding keeping the extra reference on that page, because
> > we have problematic race conditions related to set_page_dirty(), and we
> > can reduce the problem of race conditions further by not getting a
> > reference on the page at all when going GUP as part of a KVM fault?
> 
> You could still keep the extra reference until the
> invalidate.
> 
> The set_page_dirty however if you do in the context of the secondary
> MMU fault (i.e. atomically with the mapping of the page in the
> secondary MMU, with respect of MMU notifier invalidates), it solves
> the whole problem with the ->mkwrite/mkclean and then you can keep a
> GUP long term pin fully safely already. That is a solution that always
> works and becomes guaranteed by design by the MMU notifier not to
> interfere with the _current_ writeback code in the filesystem. It also
> already provides stable pages.
> 

Ok.

> > Can you explain, or provide a pointer to, the root cause of the
> > problem with holding a reference on the page and setting it dirty?
> 
> The filesystem/VM doesn't possibly expect set_page_dirty to be called
> again after it called page_mkclean. Supposedly a wrprotect fault
> should have been generated if somebody tried to write to the page
> under writeback, so page_mkwrite should have run again before you
> could have called set_page_dirty.
> 
> Instead page_mkclean failed to get rid of the long term GUP obtained
> with FOLL_WRITE because it simply can't ask the device to release it
> without MMU notifier, so the device can later still call
> set_page_dirty despite page_mkclean already run.
> 

I see, I'm now able to link this to recent articles on LWN.

> > > but a whole different chapter is
> > > dedicated on the set_page_dirty_lock crash on MAP_SHARED mappings
> > > after long term GUP pins. So since you're looking into how to handle
> > > the page struct in the MMU notifier it's worth mentioning the issues
> > > related to set_page_dirty too.
> > 
> > Is there some background info on the "set_page_dirty_lock crash on
> > MAP_SHARED" ?  I'm having trouble following this without the background.
> 
> Jan Kara leaded the topic explained all the details on this filesystem
> issue at the LSF-MM and also last year.
> 
> Which is what makes me think there can't be too many uses cases that
> require writback to work while long term GUP pin allow some device to
> write to the pages at any given time, if nobody requires this to be
> urgently fixed.
> 
> You can find coverage on lwn and on linux-mm.
> 
> > 
> > > 
> > > To achieve the cleanest writeback fix to avoid crashes in
> > > set_page_dirty_lock on long term secondary MMU mappings that supports
> > > MMU notifier like KVM shadow MMU, the ideal is to mark the page dirty
> > > before establishing a writable the mapping in the secondary MMU like
> > > in the model below.
> > > 
> > > The below solution works also for those secondary MMU that are like a
> > > TLB and if there are two concurrent invalidates on the same page
> > > invoked at the same time (a potential problem Jerome noticed), you
> > > don't know which come out first and you would risk to call
> > > set_page_dirty twice, which would be still potentially kernel crashing
> > > (even if only a theoretical issue like O_DIRECT).
> > 
> > Why is it problematic to call set_page_dirty() twice?  I thought that at
> > worst it would only lead to writing out data to disk unnecessarily ?
> 
> According to Jerome, after the first set_page_dirty returns, writeback
> could start before the second set_page_dirty has been called. So if
> there are additional random later invalidates the next ones shouldn't
> call set_page_dirty again.
> 
> The problem is if you call set_page_dirty in the invalidate, you've
> also to make sure set_page_dirty is being called only once.
> 
> There can be concurrent invalidates for the same page running at the
> same time, while the page fault there is only one that runs atomically
> with respect to the mmu notifier invalidates (under whatever lock that
> serializes the MMU notifier invalidates vs the secondary MMU page fault).
> 
> If you call set_page_dirty twice in a row, you again open the window
> for the writeback to have already called ->page_mkclean on the page
> after the first set_page_dirty, so the second set_page_dirty will
> then crash.
> 
> You can enforce to call it only once if you have sptes (shadow
> pagetables) like in KVM has, so this is not an issue for KVM.
> 

Makes sense.  The key for my understanding was that an atomic
relationship between the page fault and the mmu notifier has to be
enforced.

> > I am also not familiar with a problem related to KVM and O_DIRECT, so
> > I'm having trouble keeping up here as well :(
> 
> There's no problem in KVM and O_DIRECT.
> 
> There's a problem in O_DIRECT itself regardless if it's qemu or any
> other app using it: just the time window is too low to be
> noticeable. It's still a corollary of why we can't run two
> set_page_dirty per page, if there are concurrent MMU notifier
> invalidates.
> 
> > > So the below model
> > > will solve that and it's also valid for KVM/vhost accelleration,
> > > despite KVM can figure out how to issue a single set_page_dirty call
> > > for each spte that gets invalidated by concurrent invalidates on the
> > > same page because it has shadow pagetables and it's not just a TLB.
> > > 
> > >   access = FOLL_WRITE|FOLL_GET
> > > 
> > > repeat:
> > >   page = gup(access)
> > >   put_page(page)
> > > 
> > >   spin_lock(mmu_notifier_lock);
> > >   if (race with invalidate) {
> > >     spin_unlock..
> > >     goto repeat;
> > >   }
> > >   if (access == FOLL_WRITE)
> > >     set_page_dirty(page)
> > >   establish writable mapping in secondary MMU on page
> > >   spin_unlock
> > > 
> > > The above solves the crash in set_page_dirty_lock without having to
> > > modify any filesystem, it should work theoretically safer than the
> > > O_DIRECT short term GUP pin.
> > 
> > That is not exactly how we do things today on the arm64 side.  We do
> > something that looks like:
> 
> The above is the model that solves all problems with writeback
> page_mkclean/mkwrite, provides stable pages to current filesystems,
> regardless of lowlevel implementation details of the mmu notifier
> methods.
> 
> For KVM all models works not only the above one because we have sptes
> to disambiguate which is the first invalidate that has to run
> set_page_dirty.
> 
> > 
> >   /*
> >    * user_mem_abort is our function for a secondary MMU fault that
> >    * resolves to a memslot.
> >    */
> >   user_mem_abort() {
> >       page = gup(access, &writable);
> >       spin_lock(&kvm->mmu_lock);
> >       if (mmu_notifier_retry(kvm, mmu_seq))
> >           goto out; /* run the VM again and see what happens */
> > 
> >       if (writable)
> >           kvm_set_pfn_dirty(page_to_pfn(page));
> >       stage2_set_pte(); /* establish_writable mapping in secondary MMU on page */
> > 
> >   out:
> >       spin_unlock(&kvm_mmu_lock);
> >       put_page(page);
> >   }
> > 
> > Should we rework this to address the race you are refering to, and are
> > other architectures already safe against this?
> 
> Actually it seems you mark the page dirty exactly where I suggested
> above: i.e. atomically with the secondary MMU mapping establishment
> with respect to the mmu notifier invalidates.
> 
> I don't see any problem with the above (well you need to have a way to
> track if you run stage2_set_pte or if you taken "goto out" but the
> above is pseudocode).
> 
> There's a problem however in kvm_set_pfn_dirty common code, it should
> call set_page_dirty not SetPageDirty or it won't do anything in the
> MAP_SHARED filebacked case. The current code is perfectly ok for anon and
> MAP_PRIVATE write=1 cases.
> 
> However FOLL_TOUCH in gup already either calls set_page_dirty or it
> marks the linux pte as dirty, so that's working around the lack of
> set_page_dirty... I wonder if we could just rely on the set_page_dirty
> in gup with FOLL_TOUCH and drop SetPageDirty as a whole in KVM in fact.
> 

I may have misled you with my use of 'gup()' as a function above.  In
reality we use the gfn_to_pfn_prot() wrapper, which means there are
several things going on:

First, it appears we only do pte_mkdirty with FOLL_TOUCH if we also set
FOLL_WRITE.  This seems to be confimed by the commentary on
get_user_pages, which says that set_page_dirty() must be called if a
page is written to, and the page is found without FOLL_WRITE.

Second, KVM first attempts a __get_user_pages_fast(), and if that fails
does get_user_pages_unlocked(), but only sets FOLL_WRITE as part of a
write fauult.  If the page is nevertheless mapped writable, I think we
still need the subsequent set_page_dirty() when a write fault happens on
the secondary mmu.

So I take it you'll send a patch addressing the SetPageDirty to
set_page_dirty problem?


> > > With regard to KVM this should be enough, but we also look for a crash
> > > avoidance solution for those devices that cannot support the MMU
> > > notifier for short and long term GUP pins.
> > 
> > Sorry, can you define short and long term GUP pins, and do we have
> > current examples of both?
> 
> Long term as in mm/gup.c:FOLL_LONGTERM, means you expect to call some
> get_user_pages with FOLL_GET and not release the refcount immediately
> after I/O completion, the page could remain indefinitely pinned,
> virt example: vfio device assignment.
> 
> The most common example of short term GUP (i.e. default behavior of
> GPU) is O_DIRECT. I/O completion takes short time (depends on the
> buffer size of course.. could be 1TB of buffer I/O) but it's still
> going to be released ASAP.
> 
> > > There's lots of work going on on linux-mm, to try to let those devices
> > > support writeback in a safe way (also with stable pages so all fs
> > > integrity checks will pass) using bounce buffer if a long term GUP pin
> > > is detected by the filesystem. In addition there's other work to make
> > > the short term GUP pin theoretically safe by delaying the writeback
> > > for the short window the GUP pin is taken by O_DIRECT, so it becomes
> > > theoretically safe too (currently it's only practically safe).
> > > 
> > > However I'm not sure if the long term GUP pins really needs to support
> > > writeback.
> > > 
> > 
> > I don't think I understand the distinction between a long term GUP pin
> > that supports writeback vs. a short term GUP pin.  My question was
> > whether or not the pin could be dropped at the time the mapping was torn
> > down, or if it has to be done at the same time the mapping is
> > established, for things to work properly wrt. the semantics of memory
> > behavior of the rest of the kernel.
> 
> Yes sorry, the question about the refcounting was just trivial to
> answer: it always works, you can drop the refcount anywhere.
> 
> I just thought if there was any doubt on the refcounting issue which
> had an immediate black and white answer, it was safer to raise
> awareness about the much more troubling and subtle issues with
> set_page_dirty caused by GUP refcounts.
> 

I understand, and thanks for doing that.  I just had to put the pieces
together on my end.

> > I'm not sure if we're talking about the same thing here, or if you're
> > explaining a different scenario?
> 
> Simply KVM secondary MMU fault has to mark the page dirty somehow
> (either in gup itself or in the fault or in the invalidates) in
> addition to dealing the refcount. That's the connection.
> 
> However this set_page_dirty issue needs solution that works both for
> short term GPU pins that can't use mmu notifier (O_DIRECT), long term
> GUP pins that can't use mmu notifier (vfio) and the MMU notifier
> mappings (KVM page fault, whose refcount can be implicitly hold on by
> the MMU notifier itself and in turn the put_page can go anywhere).
> 
> The solution to the O_DIRECT gup pin is also highly connected with the
> GUP refcounting, because the solution is to alter the refcount so that
> the filesystem learns that there's a special refcounting and
> ->page_mkclean can be then deferred as long as the special refcount is
> hold by O_DIRECT. I argue the same special refcount can be also hold
> by long term GUP pins and the MMU notifier KVM page fault case (which
> will either drop FOLL_GET ideally) so we can defer the page_mkwrite
> indefinitely for long term GUP pins too (there will be no deferred
> write in MMU Notifier case because ->page_mkclean invokes the MMU
> notifier invalidates).
> 
> If one wants to flush to disk the dirty MAP_SHARED periodically the
> device needs to be told to drop the refcounts and stop writing to the
> memory. If the device isn't told to stop writing to the memory what
> comes out is only coherent at 512 bytes units or maybe less anyway.
> 

Thanks for the clarification.

    Christoffer
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-06-26 12:16 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-09  8:18 Reference count on pages held in secondary MMUs Christoffer Dall
2019-06-09  9:37 ` Paolo Bonzini
2019-06-09 17:40   ` Andrea Arcangeli
2019-06-11 11:51     ` Christoffer Dall
2019-06-22 19:11       ` Andrea Arcangeli
2019-06-26 12:16         ` Christoffer Dall
2019-06-11 11:21   ` Christoffer Dall
2019-06-11 11:29     ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).