linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/1] soft_dirty: fix soft_dirty during THP split
@ 2016-08-19 12:41 Andrea Arcangeli
  2016-08-19 12:41 ` [PATCH 1/1] " Andrea Arcangeli
  2016-08-19 13:20 ` [PATCH 0/1] " Pavel Emelyanov
  0 siblings, 2 replies; 9+ messages in thread
From: Andrea Arcangeli @ 2016-08-19 12:41 UTC (permalink / raw)
  To: Kirill A. Shutemov, Pavel Emelyanov, Andrew Morton; +Cc: linux-mm

Hello,

while adding proper userfaultfd_wp support with bits in pagetable and
swap entry to avoid false positives WP userfaults through
swap/fork/KSM/etc.. I've been adding a framework that mostly mirrors
soft dirty.

So I noticed in one place I had to add uffd_wp support to the
pagetables that wasn't covered by soft_dirty and I think it should
have.

Example: in the THP migration code migrate_misplaced_transhuge_page()
pmd_mkdirty is called unconditionally after mk_huge_pmd.

	entry = mk_huge_pmd(new_page, vma->vm_page_prot);
	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);

That sets soft dirty too (it's a false positive for soft dirty, the
soft dirty bit could be more finegrined and transfer the bit like
uffd_wp will do..  pmd/pte_uffd_wp() enforces the invariant that when
it's set pmd/pte_write is not set).

However in the THP split there's no unconditional pmd_mkdirty after
mk_huge_pmd and pte_swp_mksoft_dirty isn't called after the migration
entry is created. The code sets the dirty bit in the struct page
instead of setting it in the pagetable (which is fully equivalent as
far as the real dirty bit is concerned, as the whole point of
pagetable bits is to be eventually flushed out of to the page, but
that is not equivalent for the soft-dirty bit that gets lost in
translation).

This was found by code review only and totally untested as I'm working
to actually replace soft dirty and I don't have time to test potential
soft dirty bugfixes as well :).


---
Changing topic slightly: some considerations about soft dirty vs
userfaultfd_wp follows.

I'm optimistic once userfaultfd WP support is fully accurate with
pmd/pte_uffd_wp tracking enabled, we can then remove soft dirty some
time after that.

Not even the qemu precopy code uses soft dirty, instead it prefers to
set the dirty bitmap in software, after every guest memory
modification, even if it means setting the same dirty bit over and
over again.

We considered soft dirty, but the cost of scanning all pagetables like
soft dirty has to do is excessive and it doesn't scale: it's O(N)
where N is the number of pages in the program/virtual machine. If
there's a terabyte of RAM the cost would be excessive (especially
considering we're tracking faults at 4k granularity and soft dirty
wouldn't even give it at such granularity). As opposed we use the new
x86 virt hardware feature that notifies KVM of a list of virtual
addresses that are dirty, in an array that sends a notification and
blocks in a vmexit when it gets full. That feature is not requiring us
to scan all shadow pagetables in order to leave the memory read-write
and avoid write faults during precopy dirty logging.

userfaultfd WP tracking can provide the same information that the
hardware shadow pagetables feature provides, without having to stop
and scan all pagetables at every precopy pass. So it would remove the
complexity issues from dirty tracking.

Most important for most usages soft dirty is not enough regardless of
performance considerations, as it can't block the fault,
userfaultfd_wp can do that as well instead. Throttling the write
faults is fundamental to be able to guarantee a maximum amount of
allocations in the snapshot use case, i.e. postcopy live snapshotting
and redis snapshotting with userfault thread and dropping fork()
(fork() in fact cannot throttle the write faults, nor decide the
granularity of the COW faults in the parent which is why redis
under performs with THP on).

Clearly soft dirty is better than mprotect + sigsegv for
non-cooperative usages like checkpoint but I believe userfaultfd_wp
would be even better for that, despite it will schedule. Perhaps later
we could add an async queue mode to enable and disable at runtime, so
the userfault could be still notified to userland through uffd
asynchronously, despite the faulting thread continues running without
blocking.

Yet another difference is that soft dirty exposes the memory
granularity the kernel decided to use internally, so it'll report 2mb
dirty if THP could have been allocated or 4kb dirty if it
couldn't. With userfaultfd it's always userland that decides the
granularity of the faults and userland cannot possibly notice any
difference in behavior or runtime depending on THP being used or
not. Of course for userland to give a chance to the kernel to avoid
splitting THPs in the user faulted regions, userland would need to use
a 2MB granularity in the UFFDIO ioclts (i.e. calling
UFFDIO_WRITEPROTECT with 2MB aligned "start, end" addreses etc..).

Said that for the time being I'm trying to allow soft dirty and
userfaultfd_wp to work simultaneously on the same "vmas", so that they
stay orthogonal.

userfaultfd_wp already works for test programs and it shall be safe as
far as the kernel safety is concerned but I don't think swap is being
handled right in the current code and the pmd/pte_(swp)_uffd_wp
pagetable bitflag I'm adding should fix it.

Andrea Arcangeli (1):
  soft_dirty: fix soft_dirty during THP split

 mm/huge_memory.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/1] soft_dirty: fix soft_dirty during THP split
  2016-08-19 12:41 [PATCH 0/1] soft_dirty: fix soft_dirty during THP split Andrea Arcangeli
@ 2016-08-19 12:41 ` Andrea Arcangeli
  2016-08-19 13:17   ` Pavel Emelyanov
  2016-08-19 13:20 ` [PATCH 0/1] " Pavel Emelyanov
  1 sibling, 1 reply; 9+ messages in thread
From: Andrea Arcangeli @ 2016-08-19 12:41 UTC (permalink / raw)
  To: Kirill A. Shutemov, Pavel Emelyanov, Andrew Morton; +Cc: linux-mm

Transfer the soft_dirty from pmd to pte during THP splits.

This fix avoids losing the soft_dirty bit and avoids userland memory
corruption in the checkpoint.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b9570b5..cb95a83 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1512,7 +1512,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct page *page;
 	pgtable_t pgtable;
 	pmd_t _pmd;
-	bool young, write, dirty;
+	bool young, write, dirty, soft_dirty;
 	unsigned long addr;
 	int i;
 
@@ -1546,6 +1546,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	write = pmd_write(*pmd);
 	young = pmd_young(*pmd);
 	dirty = pmd_dirty(*pmd);
+	soft_dirty = pmd_soft_dirty(*pmd);
 
 	pmdp_huge_split_prepare(vma, haddr, pmd);
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
@@ -1562,6 +1563,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			swp_entry_t swp_entry;
 			swp_entry = make_migration_entry(page + i, write);
 			entry = swp_entry_to_pte(swp_entry);
+			if (soft_dirty)
+				entry = pte_swp_mksoft_dirty(entry);
 		} else {
 			entry = mk_pte(page + i, vma->vm_page_prot);
 			entry = maybe_mkwrite(entry, vma);
@@ -1569,6 +1572,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				entry = pte_wrprotect(entry);
 			if (!young)
 				entry = pte_mkold(entry);
+			if (soft_dirty)
+				entry = pte_mksoft_dirty(entry);
 		}
 		if (dirty)
 			SetPageDirty(page + i);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] soft_dirty: fix soft_dirty during THP split
  2016-08-19 12:41 ` [PATCH 1/1] " Andrea Arcangeli
@ 2016-08-19 13:17   ` Pavel Emelyanov
  0 siblings, 0 replies; 9+ messages in thread
From: Pavel Emelyanov @ 2016-08-19 13:17 UTC (permalink / raw)
  To: Andrea Arcangeli, Kirill A. Shutemov, Andrew Morton; +Cc: linux-mm

On 08/19/2016 03:41 PM, Andrea Arcangeli wrote:
> Transfer the soft_dirty from pmd to pte during THP splits.
> 
> This fix avoids losing the soft_dirty bit and avoids userland memory
> corruption in the checkpoint.

Nasty :( Thanks for catching this!

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>

> ---
>  mm/huge_memory.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index b9570b5..cb95a83 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1512,7 +1512,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	struct page *page;
>  	pgtable_t pgtable;
>  	pmd_t _pmd;
> -	bool young, write, dirty;
> +	bool young, write, dirty, soft_dirty;
>  	unsigned long addr;
>  	int i;
>  
> @@ -1546,6 +1546,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	write = pmd_write(*pmd);
>  	young = pmd_young(*pmd);
>  	dirty = pmd_dirty(*pmd);
> +	soft_dirty = pmd_soft_dirty(*pmd);
>  
>  	pmdp_huge_split_prepare(vma, haddr, pmd);
>  	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> @@ -1562,6 +1563,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  			swp_entry_t swp_entry;
>  			swp_entry = make_migration_entry(page + i, write);
>  			entry = swp_entry_to_pte(swp_entry);
> +			if (soft_dirty)
> +				entry = pte_swp_mksoft_dirty(entry);
>  		} else {
>  			entry = mk_pte(page + i, vma->vm_page_prot);
>  			entry = maybe_mkwrite(entry, vma);
> @@ -1569,6 +1572,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  				entry = pte_wrprotect(entry);
>  			if (!young)
>  				entry = pte_mkold(entry);
> +			if (soft_dirty)
> +				entry = pte_mksoft_dirty(entry);
>  		}
>  		if (dirty)
>  			SetPageDirty(page + i);
> .
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/1] soft_dirty: fix soft_dirty during THP split
  2016-08-19 12:41 [PATCH 0/1] soft_dirty: fix soft_dirty during THP split Andrea Arcangeli
  2016-08-19 12:41 ` [PATCH 1/1] " Andrea Arcangeli
@ 2016-08-19 13:20 ` Pavel Emelyanov
  2016-08-19 13:43   ` Andrea Arcangeli
  1 sibling, 1 reply; 9+ messages in thread
From: Pavel Emelyanov @ 2016-08-19 13:20 UTC (permalink / raw)
  To: Andrea Arcangeli, Kirill A. Shutemov, Andrew Morton; +Cc: linux-mm

On 08/19/2016 03:41 PM, Andrea Arcangeli wrote:
> Hello,
> 
> while adding proper userfaultfd_wp support with bits in pagetable and
> swap entry to avoid false positives WP userfaults through
> swap/fork/KSM/etc.. I've been adding a framework that mostly mirrors
> soft dirty.
> 
> So I noticed in one place I had to add uffd_wp support to the
> pagetables that wasn't covered by soft_dirty and I think it should
> have.
> 
> Example: in the THP migration code migrate_misplaced_transhuge_page()
> pmd_mkdirty is called unconditionally after mk_huge_pmd.
> 
> 	entry = mk_huge_pmd(new_page, vma->vm_page_prot);
> 	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> 
> That sets soft dirty too (it's a false positive for soft dirty, the
> soft dirty bit could be more finegrined and transfer the bit like
> uffd_wp will do..  pmd/pte_uffd_wp() enforces the invariant that when
> it's set pmd/pte_write is not set).
> 
> However in the THP split there's no unconditional pmd_mkdirty after
> mk_huge_pmd and pte_swp_mksoft_dirty isn't called after the migration
> entry is created. The code sets the dirty bit in the struct page
> instead of setting it in the pagetable (which is fully equivalent as
> far as the real dirty bit is concerned, as the whole point of
> pagetable bits is to be eventually flushed out of to the page, but
> that is not equivalent for the soft-dirty bit that gets lost in
> translation).
> 
> This was found by code review only and totally untested as I'm working
> to actually replace soft dirty and I don't have time to test potential
> soft dirty bugfixes as well :).
> 
> 
> ---
> Changing topic slightly: some considerations about soft dirty vs
> userfaultfd_wp follows.
> 
> I'm optimistic once userfaultfd WP support is fully accurate with
> pmd/pte_uffd_wp tracking enabled, we can then remove soft dirty some
> time after that.

And (!) after non-cooperative patches are functional too.

> Not even the qemu precopy code uses soft dirty, instead it prefers to
> set the dirty bitmap in software, after every guest memory
> modification, even if it means setting the same dirty bit over and
> over again.
> 
> We considered soft dirty, but the cost of scanning all pagetables like
> soft dirty has to do is excessive and it doesn't scale: it's O(N)
> where N is the number of pages in the program/virtual machine. If
> there's a terabyte of RAM the cost would be excessive (especially
> considering we're tracking faults at 4k granularity and soft dirty
> wouldn't even give it at such granularity). As opposed we use the new
> x86 virt hardware feature that notifies KVM of a list of virtual
> addresses that are dirty, in an array that sends a notification and
> blocks in a vmexit when it gets full. That feature is not requiring us
> to scan all shadow pagetables in order to leave the memory read-write
> and avoid write faults during precopy dirty logging.
> 
> userfaultfd WP tracking can provide the same information that the
> hardware shadow pagetables feature provides, without having to stop
> and scan all pagetables at every precopy pass. So it would remove the
> complexity issues from dirty tracking.
> 
> Most important for most usages soft dirty is not enough regardless of
> performance considerations, as it can't block the fault,
> userfaultfd_wp can do that as well instead. Throttling the write
> faults is fundamental to be able to guarantee a maximum amount of
> allocations in the snapshot use case, i.e. postcopy live snapshotting
> and redis snapshotting with userfault thread and dropping fork()
> (fork() in fact cannot throttle the write faults, nor decide the
> granularity of the COW faults in the parent which is why redis
> under performs with THP on).
> 
> Clearly soft dirty is better than mprotect + sigsegv for
> non-cooperative usages like checkpoint but I believe userfaultfd_wp
> would be even better for that, despite it will schedule. Perhaps later
> we could add an async queue mode to enable and disable at runtime, so
> the userfault could be still notified to userland through uffd
> asynchronously, despite the faulting thread continues running without
> blocking.

Yes. Another problem of soft-dirty that will be addressed by uffd is
simultaneous memory tracking of two ... scanners (?) E.g. when we
reset soft-dirty to track the mem and then some other software comes
and tries to do the same, the whole soft-dirty state becomes screwed.
With uffd we'll at least have the ability for the first tracker to
keep the 2nd one off the tracking task.

> Yet another difference is that soft dirty exposes the memory
> granularity the kernel decided to use internally, so it'll report 2mb
> dirty if THP could have been allocated or 4kb dirty if it
> couldn't. With userfaultfd it's always userland that decides the
> granularity of the faults and userland cannot possibly notice any
> difference in behavior or runtime depending on THP being used or
> not. Of course for userland to give a chance to the kernel to avoid
> splitting THPs in the user faulted regions, userland would need to use
> a 2MB granularity in the UFFDIO ioclts (i.e. calling
> UFFDIO_WRITEPROTECT with 2MB aligned "start, end" addreses etc..).
> 
> Said that for the time being I'm trying to allow soft dirty and
> userfaultfd_wp to work simultaneously on the same "vmas", so that they
> stay orthogonal.
> 
> userfaultfd_wp already works for test programs and it shall be safe as
> far as the kernel safety is concerned but I don't think swap is being
> handled right in the current code and the pmd/pte_(swp)_uffd_wp
> pagetable bitflag I'm adding should fix it.
> 
> Andrea Arcangeli (1):
>   soft_dirty: fix soft_dirty during THP split
> 
>  mm/huge_memory.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> .
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/1] soft_dirty: fix soft_dirty during THP split
  2016-08-19 13:20 ` [PATCH 0/1] " Pavel Emelyanov
@ 2016-08-19 13:43   ` Andrea Arcangeli
  2016-08-19 13:52     ` Pavel Emelyanov
  0 siblings, 1 reply; 9+ messages in thread
From: Andrea Arcangeli @ 2016-08-19 13:43 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Kirill A. Shutemov, Andrew Morton, linux-mm

On Fri, Aug 19, 2016 at 04:20:22PM +0300, Pavel Emelyanov wrote:
> And (!) after non-cooperative patches are functional too.

I merged your non-cooperative patches in my tree although there's no
testcase to exercise them yet.

> Yes. Another problem of soft-dirty that will be addressed by uffd is
> simultaneous memory tracking of two ... scanners (?) E.g. when we
> reset soft-dirty to track the mem and then some other software comes
> and tries to do the same, the whole soft-dirty state becomes screwed.
> With uffd we'll at least have the ability for the first tracker to
> keep the 2nd one off the tracking task.

Yes, that sounds like nesting will have to work for it though.

		/*
		 * Check that this vma isn't already owned by a
		 * different userfaultfd. We can't allow more than one
		 * userfaultfd to own a single vma simultaneously or we
		 * wouldn't know which one to deliver the userfaults to.
		 */
		ret = -EBUSY;
		if (cur->vm_userfaultfd_ctx.ctx &&
		    cur->vm_userfaultfd_ctx.ctx != ctx)
			goto out_unlock;

This check shall be lifted... and it'll complicate the code quite a
bit to lift it.

My main long term worry at the moment for the non-cooperative usage in
fact is not really the non cooperative code itself, but the nesting of
uffd if the app is already its own set of uffds for its own
purposes. The nesting won't be straightforward.

Do you have plans to solve the nesting?

It's not just for the use case you mentioned above of two WP trackers
on the same vma which sounds more "cooperative" than when the app
already uses "uffd" for its own runtime.

userfaultfd is going to be used by apps like databases for reliability
purposes on hugetlbfs or tmpfs (both already supported in my
development tree), and I believe it'll be perfect to optimize the
redis snapshotting removing any cons of THP-on and further optimizing
the snapshot with thread instead of processes, potentially it can be
used by to-native compilers to stop overwriting the write bit at every
memory modification (and it'd be interesting to check if the JVM could
use it too to drop the write bit too..). Here a research article about
the last usage case of the WP tracking:

https://medium.com/@MartinCracauer/generational-garbage-collection-write-barriers-write-protection-and-userfaultfd-2-8b0e796b8f7f

The nesting with virtual machines is strightforward because the uffd
used by qemu becomes invisible to the guest. The complexities with the
nesting happen when it has to work at the host level in a non
cooperative way.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/1] soft_dirty: fix soft_dirty during THP split
  2016-08-19 13:43   ` Andrea Arcangeli
@ 2016-08-19 13:52     ` Pavel Emelyanov
  2016-08-19 14:37       ` Andrea Arcangeli
  2016-08-23 11:03       ` Mike Rapoport
  0 siblings, 2 replies; 9+ messages in thread
From: Pavel Emelyanov @ 2016-08-19 13:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, Mike Rapoport

On 08/19/2016 04:43 PM, Andrea Arcangeli wrote:
> On Fri, Aug 19, 2016 at 04:20:22PM +0300, Pavel Emelyanov wrote:
>> And (!) after non-cooperative patches are functional too.
> 
> I merged your non-cooperative patches in my tree although there's no
> testcase to exercise them yet.

Hm... Are you talking about some in-kernel test, or just any? We have
tests in CRIU tree for UFFD (not sure we've wired up the non-cooperative
part though).

>> Yes. Another problem of soft-dirty that will be addressed by uffd is
>> simultaneous memory tracking of two ... scanners (?) E.g. when we
>> reset soft-dirty to track the mem and then some other software comes
>> and tries to do the same, the whole soft-dirty state becomes screwed.
>> With uffd we'll at least have the ability for the first tracker to
>> keep the 2nd one off the tracking task.
> 
> Yes, that sounds like nesting will have to work for it though.
> 
> 		/*
> 		 * Check that this vma isn't already owned by a
> 		 * different userfaultfd. We can't allow more than one
> 		 * userfaultfd to own a single vma simultaneously or we
> 		 * wouldn't know which one to deliver the userfaults to.
> 		 */
> 		ret = -EBUSY;
> 		if (cur->vm_userfaultfd_ctx.ctx &&
> 		    cur->vm_userfaultfd_ctx.ctx != ctx)
> 			goto out_unlock;
> 
> This check shall be lifted... and it'll complicate the code quite a
> bit to lift it.

:)

> My main long term worry at the moment for the non-cooperative usage in
> fact is not really the non cooperative code itself, but the nesting of
> uffd if the app is already its own set of uffds for its own
> purposes. The nesting won't be straightforward.

And my main worry about this is COW-sharing. If we have two tasks that
fork()-ed from each other and we try to lazily restore a page that
is still COW-ed between them, the uffd API doesn't give us anything to
do it. So we effectively break COW on lazy restore. Do you have any
ideas what can be done about it?

> Do you have plans to solve the nesting?

We have ... readiness to do it :) since once CRIU hits this we'll have to.

> It's not just for the use case you mentioned above of two WP trackers
> on the same vma which sounds more "cooperative" than when the app
> already uses "uffd" for its own runtime.
> 
> userfaultfd is going to be used by apps like databases for reliability
> purposes on hugetlbfs or tmpfs (both already supported in my
> development tree), and I believe it'll be perfect to optimize the
> redis snapshotting removing any cons of THP-on and further optimizing
> the snapshot with thread instead of processes, potentially it can be
> used by to-native compilers to stop overwriting the write bit at every
> memory modification (and it'd be interesting to check if the JVM could
> use it too to drop the write bit too..). 

Yes, yes :) Apparently we'll hit this quite soon.

> Here a research article about
> the last usage case of the WP tracking:
> 
> https://medium.com/@MartinCracauer/generational-garbage-collection-write-barriers-write-protection-and-userfaultfd-2-8b0e796b8f7f
> 
> The nesting with virtual machines is strightforward because the uffd
> used by qemu becomes invisible to the guest. The complexities with the
> nesting happen when it has to work at the host level in a non
> cooperative way.
> 
> Thanks,
> Andrea
> .
> 

-- Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/1] soft_dirty: fix soft_dirty during THP split
  2016-08-19 13:52     ` Pavel Emelyanov
@ 2016-08-19 14:37       ` Andrea Arcangeli
  2016-08-22 16:35         ` Pavel Emelyanov
  2016-08-23 11:03       ` Mike Rapoport
  1 sibling, 1 reply; 9+ messages in thread
From: Andrea Arcangeli @ 2016-08-19 14:37 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, Mike Rapoport

On Fri, Aug 19, 2016 at 04:52:51PM +0300, Pavel Emelyanov wrote:
> Hm... Are you talking about some in-kernel test, or just any? We have
> tests in CRIU tree for UFFD (not sure we've wired up the non-cooperative
> part though).

Nice. I wasn't aware you had uffd specific tests in CRIU, I'll check.

I was referring to the tools/testing/selftest/vm/userfault*, but I
suppose it's fine in CIRU as well. A self contained test suitable for
testing/selftest would be nice too as not everyone will run CRIU tests
to test the kernel.

Currently what's tested is anon missing, tmpfs missing and hugetlbfs
missing and they all work (just fixed two tmpfs bugs yesterday thanks
to the tmpfs test that crashed my workstation when I tried it, now it
passes fine :).

> And my main worry about this is COW-sharing. If we have two tasks that
> fork()-ed from each other and we try to lazily restore a page that
> is still COW-ed between them, the uffd API doesn't give us anything to
> do it. So we effectively break COW on lazy restore. Do you have any
> ideas what can be done about it?

Building a shared page is tricky, not even khugepaged was doing that
for anon.

Kirill extended khugepaged to do it, along the THP on tmpfs support,
as it's more important for tmpfs (I haven't yet checked if it landed
upstream with the rest of tmpfs in 4.8-rc though).

The main API problem is the uffd is different between parent and
child, fork with your non cooperative patches gives you a new uffd
that represents the child mm.

To create a shared page among two "mm" the API should be able to
specify the two "mm" and two "addresses" atomically in the same
ioctl. And the uffd _is_ the "mm" with the current API.

So what it takes to do it is to add a UFFDIO_COPY_COW that takes as
parameter an address for the current "uffd" and a list of "int uffd,
unsigned long address" pairs.

Even with the UFFDIO_COPY things should still work solid, it'll just
take more memory and it'll break-COW during restore. The important
thing is "break" is as in "allocate more memory", not as in "crashing" :).

> We have ... readiness to do it :) since once CRIU hits this we'll have to.

Ok great.

I also thought about it a bit and I think it's just a matter of
specifying which uffd should get the notification first. The manager
then will take the notification first and it will call an
UFFDIO_FAULT_PASS to cascade in the second uffd registered in the
region if the page was missing in the source container, without waking
up the task blocked in handle_userfault. To find the page is missing
in the source container you could use pagemap.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/1] soft_dirty: fix soft_dirty during THP split
  2016-08-19 14:37       ` Andrea Arcangeli
@ 2016-08-22 16:35         ` Pavel Emelyanov
  0 siblings, 0 replies; 9+ messages in thread
From: Pavel Emelyanov @ 2016-08-22 16:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, Mike Rapoport

On 08/19/2016 05:37 PM, Andrea Arcangeli wrote:
> On Fri, Aug 19, 2016 at 04:52:51PM +0300, Pavel Emelyanov wrote:
>> Hm... Are you talking about some in-kernel test, or just any? We have
>> tests in CRIU tree for UFFD (not sure we've wired up the non-cooperative
>> part though).
> 
> Nice. I wasn't aware you had uffd specific tests in CRIU, I'll check.
> 
> I was referring to the tools/testing/selftest/vm/userfault*, but I
> suppose it's fine in CIRU as well. A self contained test suitable for
> testing/selftest would be nice too as not everyone will run CRIU tests
> to test the kernel.
> 
> Currently what's tested is anon missing, tmpfs missing and hugetlbfs
> missing and they all work (just fixed two tmpfs bugs yesterday thanks
> to the tmpfs test that crashed my workstation when I tried it, now it
> passes fine :).
> 
>> And my main worry about this is COW-sharing. If we have two tasks that
>> fork()-ed from each other and we try to lazily restore a page that
>> is still COW-ed between them, the uffd API doesn't give us anything to
>> do it. So we effectively break COW on lazy restore. Do you have any
>> ideas what can be done about it?
> 
> Building a shared page is tricky, not even khugepaged was doing that
> for anon.
> 
> Kirill extended khugepaged to do it, along the THP on tmpfs support,
> as it's more important for tmpfs (I haven't yet checked if it landed
> upstream with the rest of tmpfs in 4.8-rc though).
> 
> The main API problem is the uffd is different between parent and
> child, fork with your non cooperative patches gives you a new uffd
> that represents the child mm.

Yes.

> To create a shared page among two "mm" the API should be able to
> specify the two "mm" and two "addresses" atomically in the same
> ioctl. And the uffd _is_ the "mm" with the current API.

Well, with current approach mm equals uffd file, so passing
one uffd descriptor into another's ioctl should do the trick.

> So what it takes to do it is to add a UFFDIO_COPY_COW that takes as
> parameter an address for the current "uffd" and a list of "int uffd,
> unsigned long address" pairs.

Yup :)

> Even with the UFFDIO_COPY things should still work solid, it'll just
> take more memory and it'll break-COW during restore. The important
> thing is "break" is as in "allocate more memory", not as in "crashing" :).
> 
>> We have ... readiness to do it :) since once CRIU hits this we'll have to.
> 
> Ok great.
> 
> I also thought about it a bit and I think it's just a matter of
> specifying which uffd should get the notification first. The manager
> then will take the notification first and it will call an
> UFFDIO_FAULT_PASS to cascade in the second uffd registered in the
> region if the page was missing in the source container, without waking
> up the task blocked in handle_userfault. To find the page is missing
> in the source container you could use pagemap.
> 
> Thanks,
> Andrea
> .
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/1] soft_dirty: fix soft_dirty during THP split
  2016-08-19 13:52     ` Pavel Emelyanov
  2016-08-19 14:37       ` Andrea Arcangeli
@ 2016-08-23 11:03       ` Mike Rapoport
  1 sibling, 0 replies; 9+ messages in thread
From: Mike Rapoport @ 2016-08-23 11:03 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Andrea Arcangeli, Kirill A. Shutemov, Andrew Morton, linux-mm,
	Mike Rapoport

On Fri, Aug 19, 2016 at 04:52:51PM +0300, Pavel Emelyanov wrote:
> On 08/19/2016 04:43 PM, Andrea Arcangeli wrote:
> > On Fri, Aug 19, 2016 at 04:20:22PM +0300, Pavel Emelyanov wrote:
> >> And (!) after non-cooperative patches are functional too.
> > 
> > I merged your non-cooperative patches in my tree although there's no
> > testcase to exercise them yet.
> 
> Hm... Are you talking about some in-kernel test, or just any? We have
> tests in CRIU tree for UFFD (not sure we've wired up the non-cooperative
> part though).

Well, CRIU is by definition non-cooperative :)
Still, we don't have fork() and other events in CRIU lazy restore yet.
I have some brute force additions to the selftests/vm/userfaultfd.c that
verify that the events work, and I'm trying now to get a clean version.

BTW, with addition of hugetlbfs and tmpfs support to userfaultfd, we'd need
MADV_REMOVE and fallocate(PUNCH_HOLE) events in addition to
MADV_DONTNEED...
 
> > 
> > Thanks,
> > Andrea
> > .
> > 
> 
> -- Pavel
> 
--
Sincerely yours,
Mike.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-08-23 11:03 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-19 12:41 [PATCH 0/1] soft_dirty: fix soft_dirty during THP split Andrea Arcangeli
2016-08-19 12:41 ` [PATCH 1/1] " Andrea Arcangeli
2016-08-19 13:17   ` Pavel Emelyanov
2016-08-19 13:20 ` [PATCH 0/1] " Pavel Emelyanov
2016-08-19 13:43   ` Andrea Arcangeli
2016-08-19 13:52     ` Pavel Emelyanov
2016-08-19 14:37       ` Andrea Arcangeli
2016-08-22 16:35         ` Pavel Emelyanov
2016-08-23 11:03       ` Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).