[RFC PATCH 0/4] Restore change_pte optimization to its former glory

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] Restore change_pte optimization to its former glory
@ 2019-01-31 18:37 jglisse
  2019-01-31 18:37 ` [RFC PATCH 1/4] uprobes: use set_pte_at() not set_pte_at_notify() jglisse
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: jglisse @ 2019-01-31 18:37 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrea Arcangeli,
	Peter Xu, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Matthew Wilcox, Paolo Bonzini, Radim Krčmář,
	Michal Hocko, kvm

From: Jérôme Glisse <jglisse@redhat.com>

This patchset is on top of my patchset to add context information to
mmu notifier [1] you can find a branch with everything [2]. I have not
tested it but i wanted to get the discussion started. I believe it is
correct but i am not sure what kind of kvm test i can run to exercise
this.

The idea is that since kvm will invalidate the secondary MMUs within
invalidate_range callback then the change_pte() optimization is lost.
With this patchset everytime core mm is using set_pte_at_notify() and
thus change_pte() get calls then we can ignore the invalidate_range
callback altogether and only rely on change_pte callback.

Note that this is only valid when either going from a read and write
pte to a read only pte with same pfn, or from a read only pte to a
read and write pte with different pfn. The other side of the story
is that the primary mmu pte is clear with ptep_clear_flush_notify
before the call to change_pte.

Also with the mmu notifier context information [1] you can further
optimize other cases like mprotect or write protect when forking. You
can use the new context information to infer that the invalidation is
for read only update of the primary mmu and update the secondary mmu
accordingly instead of clearing it and forcing fault even for read
access. I do not know if that is an optimization that would bear any
fruit for kvm. It does help for device driver. You can also optimize
the soft dirty update.

Cheers,
Jérôme

[1] https://lore.kernel.org/linux-fsdevel/20190123222315.1122-1-jglisse@redhat.com/T/#m69e8f589240e18acbf196a1c8aa1d6fc97bd3565
[2] https://cgit.freedesktop.org/~glisse/linux/log/?h=kvm-restore-change_pte

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: kvm@vger.kernel.org

Jérôme Glisse (4):
  uprobes: use set_pte_at() not set_pte_at_notify()
  mm/mmu_notifier: use unsigned for event field in range struct
  mm/mmu_notifier: set MMU_NOTIFIER_USE_CHANGE_PTE flag where
    appropriate
  kvm/mmu_notifier: re-enable the change_pte() optimization.

 include/linux/mmu_notifier.h | 21 +++++++++++++++++++--
 kernel/events/uprobes.c      |  3 +--
 mm/ksm.c                     |  6 ++++--
 mm/memory.c                  |  3 ++-
 virt/kvm/kvm_main.c          | 16 ++++++++++++++++
 5 files changed, 42 insertions(+), 7 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH 1/4] uprobes: use set_pte_at() not set_pte_at_notify()
  2019-01-31 18:37 [RFC PATCH 0/4] Restore change_pte optimization to its former glory jglisse
@ 2019-01-31 18:37 ` jglisse
  2019-02-02  0:50   ` Andrea Arcangeli
  2019-01-31 18:37 ` [RFC PATCH 2/4] mm/mmu_notifier: use unsigned for event field in range struct jglisse
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: jglisse @ 2019-01-31 18:37 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrea Arcangeli,
	Peter Xu, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andrew Morton,
	Matthew Wilcox, Paolo Bonzini, Radim Krčmář,
	Michal Hocko, kvm

From: Jérôme Glisse <jglisse@redhat.com>

Using set_pte_at_notify() trigger useless calls to change_pte() so just
use set_pte_at() instead. The reason is that set_pte_at_notify() should
only be use when going from either a read and write pte to read only pte
with same pfn, or from read only to read and write with a different pfn.

The set_pte_at_notify() was use because __replace_page() code came from
the mm/ksm.c code in which the above rules are valid.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: kvm@vger.kernel.org
---
 kernel/events/uprobes.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 87e76a1dc758..a4807b1edd7f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -207,8 +207,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	flush_cache_page(vma, addr, pte_pfn(*pvmw.pte));
 	ptep_clear_flush_notify(vma, addr, pvmw.pte);
-	set_pte_at_notify(mm, addr, pvmw.pte,
-			mk_pte(new_page, vma->vm_page_prot));
+	set_pte_at(mm, addr, pvmw.pte, mk_pte(new_page, vma->vm_page_prot));
 
 	page_remove_rmap(old_page, false);
 	if (!page_mapped(old_page))
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 2/4] mm/mmu_notifier: use unsigned for event field in range struct
  2019-01-31 18:37 [RFC PATCH 0/4] Restore change_pte optimization to its former glory jglisse
  2019-01-31 18:37 ` [RFC PATCH 1/4] uprobes: use set_pte_at() not set_pte_at_notify() jglisse
@ 2019-01-31 18:37 ` jglisse
  2019-02-02  1:13   ` Andrea Arcangeli
  2019-01-31 18:37 ` [RFC PATCH 3/4] mm/mmu_notifier: set MMU_NOTIFIER_USE_CHANGE_PTE flag where appropriate jglisse
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: jglisse @ 2019-01-31 18:37 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrea Arcangeli,
	Peter Xu, Andrew Morton, Paolo Bonzini,
	Radim Krčmář,
	kvm

From: Jérôme Glisse <jglisse@redhat.com>

Use unsigned for event field in range struct so that we can also set
flags with the event. This patch change the field and introduce the
helper.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
---
 include/linux/mmu_notifier.h | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index be873c431886..d7a35975c2bd 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -6,6 +6,7 @@
 #include <linux/spinlock.h>
 #include <linux/mm_types.h>
 #include <linux/srcu.h>
+#include <linux/log2.h>
 
 struct mmu_notifier;
 struct mmu_notifier_ops;
@@ -38,8 +39,11 @@ enum mmu_notifier_event {
 	MMU_NOTIFY_PROTECTION_VMA,
 	MMU_NOTIFY_PROTECTION_PAGE,
 	MMU_NOTIFY_SOFT_DIRTY,
+	MMU_NOTIFY_EVENT_MAX
 };
 
+#define MMU_NOTIFIER_EVENT_BITS order_base_2(MMU_NOTIFY_EVENT_MAX)
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 /*
@@ -60,7 +64,7 @@ struct mmu_notifier_range {
 	struct mm_struct *mm;
 	unsigned long start;
 	unsigned long end;
-	enum mmu_notifier_event event;
+	unsigned event;
 	bool blockable;
 };
 
@@ -352,7 +356,7 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 
 
 static inline void mmu_notifier_range_init(struct mmu_notifier_range *range,
-					   enum mmu_notifier_event event,
+					   unsigned event,
 					   struct vm_area_struct *vma,
 					   struct mm_struct *mm,
 					   unsigned long start,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 3/4] mm/mmu_notifier: set MMU_NOTIFIER_USE_CHANGE_PTE flag where appropriate
  2019-01-31 18:37 [RFC PATCH 0/4] Restore change_pte optimization to its former glory jglisse
  2019-01-31 18:37 ` [RFC PATCH 1/4] uprobes: use set_pte_at() not set_pte_at_notify() jglisse
  2019-01-31 18:37 ` [RFC PATCH 2/4] mm/mmu_notifier: use unsigned for event field in range struct jglisse
@ 2019-01-31 18:37 ` jglisse
  2019-01-31 18:37 ` [RFC PATCH 4/4] kvm/mmu_notifier: re-enable the change_pte() optimization jglisse
  2019-02-01 23:57 ` [RFC PATCH 0/4] Restore change_pte optimization to its former glory Andrea Arcangeli
  4 siblings, 0 replies; 18+ messages in thread
From: jglisse @ 2019-01-31 18:37 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrea Arcangeli,
	Peter Xu, Andrew Morton, Paolo Bonzini,
	Radim Krčmář,
	kvm

From: Jérôme Glisse <jglisse@redhat.com>

When notifying change for a range use MMU_NOTIFIER_USE_CHANGE_PTE flag
for page table update that use set_pte_at_notify() and where the we are
going either from read and write to read only with same pfn or read only
to read and write with new pfn.

Note that set_pte_at_notify() itself should only be use in rare cases
ie we do not want to use it when we are updating a significant range of
virtual addresses and thus a significant number of pte. Instead for
those cases the event provided to mmu notifer invalidate_range_start()
callback should be use for optimization.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
---
 include/linux/mmu_notifier.h | 13 +++++++++++++
 mm/ksm.c                     |  6 ++++--
 mm/memory.c                  |  3 ++-
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index d7a35975c2bd..0885bf33dc9c 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -43,6 +43,19 @@ enum mmu_notifier_event {
 };
 
 #define MMU_NOTIFIER_EVENT_BITS order_base_2(MMU_NOTIFY_EVENT_MAX)
+/*
+ * Set MMU_NOTIFIER_USE_CHANGE_PTE only when the page table it updated with the
+ * set_pte_at_notify() and when pte is updated from read and write to read only
+ * with same pfn or from read only to read and write with different pfn. It is
+ * illegal to set in any other circumstances.
+ *
+ * Note that set_pte_at_notify() should not be use outside of the above cases.
+ * When updating a range in batch (like write protecting a range) it is better
+ * to rely on invalidate_range_start() and struct mmu_notifier_range to infer
+ * the kind of update that is happening (as an example you can look at the
+ * mmu_notifier_range_update_to_read_only() function).
+ */
+#define MMU_NOTIFIER_USE_CHANGE_PTE (1 << MMU_NOTIFIER_EVENT_BITS)
 
 #ifdef CONFIG_MMU_NOTIFIER
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 97757c5fa15f..b7fb7b560cc0 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1051,7 +1051,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 	BUG_ON(PageTransCompound(page));
 
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, vma, mm,
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR |
+				MMU_NOTIFIER_USE_CHANGE_PTE, vma, mm,
 				pvmw.address,
 				pvmw.address + PAGE_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
@@ -1140,7 +1141,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	if (!pmd)
 		goto out;
 
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, vma, mm, addr,
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR |
+				MMU_NOTIFIER_USE_CHANGE_PTE, vma, mm, addr,
 				addr + PAGE_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
 
diff --git a/mm/memory.c b/mm/memory.c
index a8c6922526f6..daf4b0f92af8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2275,7 +2275,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 
 	__SetPageUptodate(new_page);
 
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, vma, mm,
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR |
+				MMU_NOTIFIER_USE_CHANGE_PTE, vma, mm,
 				vmf->address & PAGE_MASK,
 				(vmf->address & PAGE_MASK) + PAGE_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 4/4] kvm/mmu_notifier: re-enable the change_pte() optimization.
  2019-01-31 18:37 [RFC PATCH 0/4] Restore change_pte optimization to its former glory jglisse
                   ` (2 preceding siblings ...)
  2019-01-31 18:37 ` [RFC PATCH 3/4] mm/mmu_notifier: set MMU_NOTIFIER_USE_CHANGE_PTE flag where appropriate jglisse
@ 2019-01-31 18:37 ` jglisse
  2019-02-01 23:57 ` [RFC PATCH 0/4] Restore change_pte optimization to its former glory Andrea Arcangeli
  4 siblings, 0 replies; 18+ messages in thread
From: jglisse @ 2019-01-31 18:37 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Jérôme Glisse, Andrea Arcangeli,
	Peter Xu, Andrew Morton, Paolo Bonzini,
	Radim Krčmář,
	kvm

From: Jérôme Glisse <jglisse@redhat.com>

Since changes to mmu notifier the change_pte() optimization was lost
for kvm. This re-enable it, when ever a pte is going from read and
write to read only with same pfn, or from read only to read and write
with different pfn.

It is safe to update the secondary MMUs, because the primary MMU
pte invalidate must have already happened with a ptep_clear_flush()
before set_pte_at_notify() is invoked (and thus before change_pte()
callback).

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
---
 virt/kvm/kvm_main.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5ecea812cb6a..fec155c2d7b8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -369,6 +369,14 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	int need_tlb_flush = 0, idx;
 	int ret;
 
+	/*
+	 * Nothing to do when MMU_NOTIFIER_USE_CHANGE_PTE is set as it means
+	 * that change_pte() will be call and it is a situation in which we
+	 * allow to only rely on change_pte().
+	 */
+	if (range->event & MMU_NOTIFIER_USE_CHANGE_PTE)
+		return 0;
+
 	idx = srcu_read_lock(&kvm->srcu);
 	spin_lock(&kvm->mmu_lock);
 	/*
@@ -398,6 +406,14 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 
+	/*
+	 * Nothing to do when MMU_NOTIFIER_USE_CHANGE_PTE is set as it means
+	 * that change_pte() will be call and it is a situation in which we
+	 * allow to only rely on change_pte().
+	 */
+	if (range->event & MMU_NOTIFIER_USE_CHANGE_PTE)
+		return;
+
 	spin_lock(&kvm->mmu_lock);
 	/*
 	 * This sequence increase will notify the kvm page fault that
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] Restore change_pte optimization to its former glory
  2019-01-31 18:37 [RFC PATCH 0/4] Restore change_pte optimization to its former glory jglisse
                   ` (3 preceding siblings ...)
  2019-01-31 18:37 ` [RFC PATCH 4/4] kvm/mmu_notifier: re-enable the change_pte() optimization jglisse
@ 2019-02-01 23:57 ` Andrea Arcangeli
  2019-02-02  0:14   ` Andrea Arcangeli
  2019-02-11 19:09   ` Jerome Glisse
  4 siblings, 2 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2019-02-01 23:57 UTC (permalink / raw)
  To: jglisse
  Cc: linux-mm, linux-kernel, Peter Xu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Andrew Morton, Matthew Wilcox, Paolo Bonzini,
	Radim Krčmář,
	Michal Hocko, kvm

Hello everyone,

On Thu, Jan 31, 2019 at 01:37:02PM -0500, Jerome Glisse wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> This patchset is on top of my patchset to add context information to
> mmu notifier [1] you can find a branch with everything [2]. I have not
> tested it but i wanted to get the discussion started. I believe it is
> correct but i am not sure what kind of kvm test i can run to exercise
> this.
> 
> The idea is that since kvm will invalidate the secondary MMUs within
> invalidate_range callback then the change_pte() optimization is lost.
> With this patchset everytime core mm is using set_pte_at_notify() and
> thus change_pte() get calls then we can ignore the invalidate_range
> callback altogether and only rely on change_pte callback.
> 
> Note that this is only valid when either going from a read and write
> pte to a read only pte with same pfn, or from a read only pte to a
> read and write pte with different pfn. The other side of the story
> is that the primary mmu pte is clear with ptep_clear_flush_notify
> before the call to change_pte.

If it's cleared with ptep_clear_flush_notify, change_pte still won't
work. The above text needs updating with
"ptep_clear_flush". set_pte_at_notify is all about having
ptep_clear_flush only before it or it's the same as having a range
invalidate preceding it.

With regard to the code, wp_page_copy() needs
s/ptep_clear_flush_notify/ptep_clear_flush/ before set_pte_at_notify.

change_pte relies on the ptep_clear_flush preceding the
set_pte_at_notify that will make sure if the secondary MMU mapping
randomly disappears between ptep_clear_flush and set_pte_at_notify,
gup_fast will wait and block on the PT lock until after
set_pte_at_notify is completed before trying to re-establish a
secondary MMU mapping.

So then we've only to worry about what happens because we left the
secondary MMU mapping potentially intact despite we flushed the
primary MMU mapping with ptep_clear_flush (as opposed to
ptep_clear_flush_notify which would teardown the secondary MMU mapping
too).

In you wording above at least the "with a different pfn" is
superflous. I think it's ok if the protection changes from read-only
to read-write and the pfn remains the same. Like when we takeover a
page because it's not shared anymore (fork child quit).

It's also ok to change pfn if the mapping is read-only and remains
read-only, this is what KSM does in replace_page.

The read-write to read-only case must not change pfn to avoid losing
coherency from the secondary MMU point of view. This isn't so much
about change_pte itself, but the fact that the page-copy generally
happens well before the pte mangling starts. This case never presents
itself in the code because KSM is first write protecting the page and
only later merging it, regardless of change_pte or not.

The important thing is that the secondary MMU must be updated first
(unlike the invalidates) to be sure the secondary MMU already points
to the new page when the pfn changes and the protection changes from
read-only to read-write (COW fault). The primary MMU cannot read/write
to the page anyway while we update the secondary MMU because we did
ptep_clear_flush() before calling set_pte_at_notify(). So this
ordering of "ptep_clear_flush; change_pte; set_pte_at" ensures
whenever the CPU can access the memory, the access is synchronous
with the secondary MMUs because they've all been updated already.

If (in set_pte_at_notify) we were to call change_pte() after
set_pte_at() what would happen is that the CPU could write to the page
through a TLB fill without page fault while the secondary MMUs still
read the old memory in the old readonly page.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] Restore change_pte optimization to its former glory
  2019-02-01 23:57 ` [RFC PATCH 0/4] Restore change_pte optimization to its former glory Andrea Arcangeli
@ 2019-02-02  0:14   ` Andrea Arcangeli
  2019-02-11 19:09   ` Jerome Glisse
  1 sibling, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2019-02-02  0:14 UTC (permalink / raw)
  To: jglisse
  Cc: linux-mm, linux-kernel, Peter Xu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Andrew Morton, Matthew Wilcox, Paolo Bonzini,
	Radim Krčmář,
	Michal Hocko, kvm

On Fri, Feb 01, 2019 at 06:57:38PM -0500, Andrea Arcangeli wrote:
> If it's cleared with ptep_clear_flush_notify, change_pte still won't
> work. The above text needs updating with
> "ptep_clear_flush". set_pte_at_notify is all about having
> ptep_clear_flush only before it or it's the same as having a range
> invalidate preceding it.
> 
> With regard to the code, wp_page_copy() needs
> s/ptep_clear_flush_notify/ptep_clear_flush/ before set_pte_at_notify.

Oops, the above two statements were incorrect because
ptep_clear_flush_notify doesn't interfere with change_pte and will
only invalidate secondary MMU mappings on those secondary MMUs that
shares the same pagetables with the primary MMU and that in turn won't
ever implement a change_pte method.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/4] uprobes: use set_pte_at() not set_pte_at_notify()
  2019-01-31 18:37 ` [RFC PATCH 1/4] uprobes: use set_pte_at() not set_pte_at_notify() jglisse
@ 2019-02-02  0:50   ` Andrea Arcangeli
  2019-02-11 19:27     ` Jerome Glisse
  0 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2019-02-02  0:50 UTC (permalink / raw)
  To: jglisse
  Cc: linux-mm, linux-kernel, Peter Xu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Andrew Morton, Matthew Wilcox, Paolo Bonzini,
	Radim Krčmář,
	Michal Hocko, kvm

On Thu, Jan 31, 2019 at 01:37:03PM -0500, Jerome Glisse wrote:
> @@ -207,8 +207,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>  
>  	flush_cache_page(vma, addr, pte_pfn(*pvmw.pte));
>  	ptep_clear_flush_notify(vma, addr, pvmw.pte);
> -	set_pte_at_notify(mm, addr, pvmw.pte,
> -			mk_pte(new_page, vma->vm_page_prot));
> +	set_pte_at(mm, addr, pvmw.pte, mk_pte(new_page, vma->vm_page_prot));
>  
>  	page_remove_rmap(old_page, false);
>  	if (!page_mapped(old_page))

This seems racy by design in the way it copies the page, if the vma
mapping isn't readonly to begin with (in which case it'd be ok to
change the pfn with change_pte too, it'd be a from read-only to
read-only change which is ok).

If the code copies a writable page there's no much issue if coherency
is lost by other means too.

Said that this isn't a worthwhile optimization for uprobes so because
of the lack of explicit read-only enforcement, I agree it's simpler to
skip change_pte above.

It's orthogonal, but in this function the
mmu_notifier_invalidate_range_end(&range); can be optimized to
mmu_notifier_invalidate_range_only_end(&range); otherwise there's no
point to retain the _notify in ptep_clear_flush_notify.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/4] mm/mmu_notifier: use unsigned for event field in range struct
  2019-01-31 18:37 ` [RFC PATCH 2/4] mm/mmu_notifier: use unsigned for event field in range struct jglisse
@ 2019-02-02  1:13   ` Andrea Arcangeli
  0 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2019-02-02  1:13 UTC (permalink / raw)
  To: jglisse
  Cc: linux-mm, linux-kernel, Peter Xu, Andrew Morton, Paolo Bonzini,
	Radim Krčmář,
	kvm

On Thu, Jan 31, 2019 at 01:37:04PM -0500, Jerome Glisse wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> Use unsigned for event field in range struct so that we can also set
> flags with the event. This patch change the field and introduce the
> helper.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: kvm@vger.kernel.org
> ---
>  include/linux/mmu_notifier.h | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index be873c431886..d7a35975c2bd 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -6,6 +6,7 @@
>  #include <linux/spinlock.h>
>  #include <linux/mm_types.h>
>  #include <linux/srcu.h>
> +#include <linux/log2.h>
>  
>  struct mmu_notifier;
>  struct mmu_notifier_ops;
> @@ -38,8 +39,11 @@ enum mmu_notifier_event {
>  	MMU_NOTIFY_PROTECTION_VMA,
>  	MMU_NOTIFY_PROTECTION_PAGE,
>  	MMU_NOTIFY_SOFT_DIRTY,
> +	MMU_NOTIFY_EVENT_MAX
>  };
>  
> +#define MMU_NOTIFIER_EVENT_BITS order_base_2(MMU_NOTIFY_EVENT_MAX)
> +
>  #ifdef CONFIG_MMU_NOTIFIER
>  
>  /*
> @@ -60,7 +64,7 @@ struct mmu_notifier_range {
>  	struct mm_struct *mm;
>  	unsigned long start;
>  	unsigned long end;
> -	enum mmu_notifier_event event;
> +	unsigned event;
>  	bool blockable;
>  };

This is only allocated in the stack, so saving RAM by mixing bitfields
with enum in the same 4 bytes to save 4 bytes isn't of maximum
priority.

A possibly cleaner way to save those 4 bytes without mixing enum with
bitfields by hand, is to add a "unsigned short flags" which will make
"event/flags/blockable" fit in the same 8 bytes (bool only needs 1
byte) as before the patch (the first bitfield can start from 0 then).

Yet another way is to drop blockable and convert it to a flag in
"unsigned int flags".

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] Restore change_pte optimization to its former glory
  2019-02-01 23:57 ` [RFC PATCH 0/4] Restore change_pte optimization to its former glory Andrea Arcangeli
  2019-02-02  0:14   ` Andrea Arcangeli
@ 2019-02-11 19:09   ` Jerome Glisse
  2019-02-11 20:02     ` Andrea Arcangeli
  1 sibling, 1 reply; 18+ messages in thread
From: Jerome Glisse @ 2019-02-11 19:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, linux-kernel, Peter Xu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Andrew Morton, Matthew Wilcox, Paolo Bonzini,
	Radim Krčmář,
	Michal Hocko, kvm

On Fri, Feb 01, 2019 at 06:57:38PM -0500, Andrea Arcangeli wrote:
> Hello everyone,
> 
> On Thu, Jan 31, 2019 at 01:37:02PM -0500, Jerome Glisse wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > This patchset is on top of my patchset to add context information to
> > mmu notifier [1] you can find a branch with everything [2]. I have not
> > tested it but i wanted to get the discussion started. I believe it is
> > correct but i am not sure what kind of kvm test i can run to exercise
> > this.
> > 
> > The idea is that since kvm will invalidate the secondary MMUs within
> > invalidate_range callback then the change_pte() optimization is lost.
> > With this patchset everytime core mm is using set_pte_at_notify() and
> > thus change_pte() get calls then we can ignore the invalidate_range
> > callback altogether and only rely on change_pte callback.
> > 
> > Note that this is only valid when either going from a read and write
> > pte to a read only pte with same pfn, or from a read only pte to a
> > read and write pte with different pfn. The other side of the story
> > is that the primary mmu pte is clear with ptep_clear_flush_notify
> > before the call to change_pte.
> 
> If it's cleared with ptep_clear_flush_notify, change_pte still won't
> work. The above text needs updating with
> "ptep_clear_flush". set_pte_at_notify is all about having
> ptep_clear_flush only before it or it's the same as having a range
> invalidate preceding it.
> 
> With regard to the code, wp_page_copy() needs
> s/ptep_clear_flush_notify/ptep_clear_flush/ before set_pte_at_notify.
> 
> change_pte relies on the ptep_clear_flush preceding the
> set_pte_at_notify that will make sure if the secondary MMU mapping
> randomly disappears between ptep_clear_flush and set_pte_at_notify,
> gup_fast will wait and block on the PT lock until after
> set_pte_at_notify is completed before trying to re-establish a
> secondary MMU mapping.
> 
> So then we've only to worry about what happens because we left the
> secondary MMU mapping potentially intact despite we flushed the
> primary MMU mapping with ptep_clear_flush (as opposed to
> ptep_clear_flush_notify which would teardown the secondary MMU mapping
> too).

So all the above is moot since as you pointed out in the other email
ptep_clear_flush_notify does not invalidate kvm secondary mmu hence.


> 
> In you wording above at least the "with a different pfn" is
> superflous. I think it's ok if the protection changes from read-only
> to read-write and the pfn remains the same. Like when we takeover a
> page because it's not shared anymore (fork child quit).
> 
> It's also ok to change pfn if the mapping is read-only and remains
> read-only, this is what KSM does in replace_page.

Yes i thought this was obvious i will reword and probably just do a
list of every case that is fine.

> 
> The read-write to read-only case must not change pfn to avoid losing
> coherency from the secondary MMU point of view. This isn't so much
> about change_pte itself, but the fact that the page-copy generally
> happens well before the pte mangling starts. This case never presents
> itself in the code because KSM is first write protecting the page and
> only later merging it, regardless of change_pte or not.
> 
> The important thing is that the secondary MMU must be updated first
> (unlike the invalidates) to be sure the secondary MMU already points
> to the new page when the pfn changes and the protection changes from
> read-only to read-write (COW fault). The primary MMU cannot read/write
> to the page anyway while we update the secondary MMU because we did
> ptep_clear_flush() before calling set_pte_at_notify(). So this
> ordering of "ptep_clear_flush; change_pte; set_pte_at" ensures
> whenever the CPU can access the memory, the access is synchronous
> with the secondary MMUs because they've all been updated already.
> 
> If (in set_pte_at_notify) we were to call change_pte() after
> set_pte_at() what would happen is that the CPU could write to the page
> through a TLB fill without page fault while the secondary MMUs still
> read the old memory in the old readonly page.

Yeah, between do you have any good workload for me to test this ? I
was thinking of running few same VM and having KSM work on them. Is
there some way to trigger KVM to fork ? As the other case is breaking
COW after fork.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/4] uprobes: use set_pte_at() not set_pte_at_notify()
  2019-02-02  0:50   ` Andrea Arcangeli
@ 2019-02-11 19:27     ` Jerome Glisse
  0 siblings, 0 replies; 18+ messages in thread
From: Jerome Glisse @ 2019-02-11 19:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, linux-kernel, Peter Xu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Andrew Morton, Paolo Bonzini,
	Radim Krčmář,
	Michal Hocko, kvm

Background we are discussing __replace_page() in:
    kernel/events/uprobes.c

and wether this can be call on page that can be written too through
its virtual address mapping.

On Fri, Feb 01, 2019 at 07:50:22PM -0500, Andrea Arcangeli wrote:
> On Thu, Jan 31, 2019 at 01:37:03PM -0500, Jerome Glisse wrote:
> > @@ -207,8 +207,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
> >  
> >  	flush_cache_page(vma, addr, pte_pfn(*pvmw.pte));
> >  	ptep_clear_flush_notify(vma, addr, pvmw.pte);
> > -	set_pte_at_notify(mm, addr, pvmw.pte,
> > -			mk_pte(new_page, vma->vm_page_prot));
> > +	set_pte_at(mm, addr, pvmw.pte, mk_pte(new_page, vma->vm_page_prot));
> >  
> >  	page_remove_rmap(old_page, false);
> >  	if (!page_mapped(old_page))
> 
> This seems racy by design in the way it copies the page, if the vma
> mapping isn't readonly to begin with (in which case it'd be ok to
> change the pfn with change_pte too, it'd be a from read-only to
> read-only change which is ok).
> 
> If the code copies a writable page there's no much issue if coherency
> is lost by other means too.

I am not sure the race exist but i am not familiar with the uprobe
code so maybe the page is already write protected and thus the copy
is fine and in fact that is likely the case but there is not check
for that. Maybe there should be a check ?

Maybe someone familiar with this code can chime in.

> 
> Said that this isn't a worthwhile optimization for uprobes so because
> of the lack of explicit read-only enforcement, I agree it's simpler to
> skip change_pte above.
> 
> It's orthogonal, but in this function the
> mmu_notifier_invalidate_range_end(&range); can be optimized to
> mmu_notifier_invalidate_range_only_end(&range); otherwise there's no
> point to retain the _notify in ptep_clear_flush_notify.

We need to keep the _notify for IOMMU otherwise it would break IOMMU.
IOMMU can walk the page table at any time and thus we need to first
clear the table then notify the IOMMU to flush TLB on all the devices
that might have a TLB entry. Only then can we set the new pte.

But yes the mmu_notifier_invalidate_range_end can be optimized to
only end. I will do a separate patch for this. As it is orthogonal as
you pointed out :)

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] Restore change_pte optimization to its former glory
  2019-02-11 19:09   ` Jerome Glisse
@ 2019-02-11 20:02     ` Andrea Arcangeli
  2019-02-18 16:04       ` Jerome Glisse
  0 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2019-02-11 20:02 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Peter Xu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Andrew Morton, Matthew Wilcox, Paolo Bonzini,
	Radim Krčmář,
	Michal Hocko, kvm

On Mon, Feb 11, 2019 at 02:09:31PM -0500, Jerome Glisse wrote:
> Yeah, between do you have any good workload for me to test this ? I
> was thinking of running few same VM and having KSM work on them. Is
> there some way to trigger KVM to fork ? As the other case is breaking
> COW after fork.

KVM can fork on guest pci-hotplug events or network init to run host
scripts and re-init the signals before doing the exec, but it won't
move the needle because all guest memory registered in the MMU
notifier is set as MADV_DONTFORK... so fork() is a noop unless qemu is
also modified not to call MADV_DONTFORK.

Calling if (!fork()) exit(0) from a timer at regular intervals during
qemu runtime after turning off MADV_DONTFORK in qemu would allow to
exercise fork against the KVM MMU Notifier methods.

The optimized change_pte code in copy-on-write code is the same
post-fork or post-KSM merge and fork() itself doesn't use change_pte
while KSM does, so with regard to change_pte it should already provide
a good test coverage to test with only KSM without fork(). It'll cover
the read-write -> readonly transition with same PFN
(write_protect_page), the read-only to read-only changing PFN
(replace_page) as well as the readonly -> read-write transition
changing PFN (wp_page_copy) all three optimized with change_pte. Fork
would not leverage change_pte for the first two cases.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] Restore change_pte optimization to its former glory
  2019-02-11 20:02     ` Andrea Arcangeli
@ 2019-02-18 16:04       ` Jerome Glisse
  2019-02-18 17:45         ` Andrea Arcangeli
  0 siblings, 1 reply; 18+ messages in thread
From: Jerome Glisse @ 2019-02-18 16:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, linux-kernel, Peter Xu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Andrew Morton, Matthew Wilcox, Paolo Bonzini,
	Radim Krčmář,
	Michal Hocko, kvm

On Mon, Feb 11, 2019 at 03:02:00PM -0500, Andrea Arcangeli wrote:
> On Mon, Feb 11, 2019 at 02:09:31PM -0500, Jerome Glisse wrote:
> > Yeah, between do you have any good workload for me to test this ? I
> > was thinking of running few same VM and having KSM work on them. Is
> > there some way to trigger KVM to fork ? As the other case is breaking
> > COW after fork.
> 
> KVM can fork on guest pci-hotplug events or network init to run host
> scripts and re-init the signals before doing the exec, but it won't
> move the needle because all guest memory registered in the MMU
> notifier is set as MADV_DONTFORK... so fork() is a noop unless qemu is
> also modified not to call MADV_DONTFORK.
> 
> Calling if (!fork()) exit(0) from a timer at regular intervals during
> qemu runtime after turning off MADV_DONTFORK in qemu would allow to
> exercise fork against the KVM MMU Notifier methods.
> 
> The optimized change_pte code in copy-on-write code is the same
> post-fork or post-KSM merge and fork() itself doesn't use change_pte
> while KSM does, so with regard to change_pte it should already provide
> a good test coverage to test with only KSM without fork(). It'll cover
> the read-write -> readonly transition with same PFN
> (write_protect_page), the read-only to read-only changing PFN
> (replace_page) as well as the readonly -> read-write transition
> changing PFN (wp_page_copy) all three optimized with change_pte. Fork
> would not leverage change_pte for the first two cases.

So i run 2 exact same VMs side by side (copy of same COW image) and
built the same kernel tree inside each (that is the only important
workload that exist ;)) but the change_pte did not have any impact:

before  mean  {real: 1358.250977, user: 16650.880859, sys: 839.199524, npages: 76855.390625}
before  stdev {real:    6.744010, user:   108.863762, sys:   6.840437, npages:  1868.071899}
after   mean  {real: 1357.833740, user: 16685.849609, sys: 839.646973, npages: 76210.601562}
after   stdev {real:    5.124797, user:    78.469360, sys:   7.009164, npages:  2468.017578}
without mean  {real: 1358.501343, user: 16674.478516, sys: 837.791992, npages: 76225.203125}
without stdev {real:    5.541104, user:    97.998367, sys:   6.715869, npages:  1682.392578}

Above is time taken by make inside each VM for all yes config. npages
is the number of page shared reported on the host at the end of the
build.

There is no change before and after the patchset to restore change
pte. I tried removing the change_pte callback alltogether to see if
that did have any effect (without above) and it did not have any
effect either.

Should we still restore change_pte() ? It does not hurt, but it does
not seems to help in anyway. Maybe you have a better benchmark i could
run ?

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] Restore change_pte optimization to its former glory
  2019-02-18 16:04       ` Jerome Glisse
@ 2019-02-18 17:45         ` Andrea Arcangeli
  2019-02-18 18:20           ` Jerome Glisse
                             ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2019-02-18 17:45 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Peter Xu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Andrew Morton, Matthew Wilcox, Paolo Bonzini,
	Radim Krčmář,
	Michal Hocko, kvm

On Mon, Feb 18, 2019 at 11:04:13AM -0500, Jerome Glisse wrote:
> So i run 2 exact same VMs side by side (copy of same COW image) and
> built the same kernel tree inside each (that is the only important
> workload that exist ;)) but the change_pte did not have any impact:
> 
> before  mean  {real: 1358.250977, user: 16650.880859, sys: 839.199524, npages: 76855.390625}
> before  stdev {real:    6.744010, user:   108.863762, sys:   6.840437, npages:  1868.071899}
> after   mean  {real: 1357.833740, user: 16685.849609, sys: 839.646973, npages: 76210.601562}
> after   stdev {real:    5.124797, user:    78.469360, sys:   7.009164, npages:  2468.017578}
> without mean  {real: 1358.501343, user: 16674.478516, sys: 837.791992, npages: 76225.203125}
> without stdev {real:    5.541104, user:    97.998367, sys:   6.715869, npages:  1682.392578}
> 
> Above is time taken by make inside each VM for all yes config. npages
> is the number of page shared reported on the host at the end of the
> build.

Did you set /sys/kernel/mm/ksm/sleep_millisecs to 0?

It would also help to remove the checksum check from mm/ksm.c:

-	if (rmap_item->oldchecksum != checksum) {
-		rmap_item->oldchecksum = checksum;
-		return;
-	}

One way or another, /sys/kernel/mm/ksm/pages_shared and/or
pages_sharing need to change significantly to be sure we're exercising
the COW/merging code that uses change_pte. KSM is smart enough to
merge only not frequently changing pages, and with the default KSM
code this probably works too well for a kernel build.

> Should we still restore change_pte() ? It does not hurt, but it does
> not seems to help in anyway. Maybe you have a better benchmark i could
> run ?

We could also try a microbenchmark based on
ltp/testcases/kernel/mem/ksm/ksm02.c that already should trigger a
merge flood and a COW flood during its internal processing.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] Restore change_pte optimization to its former glory
  2019-02-18 17:45         ` Andrea Arcangeli
@ 2019-02-18 18:20           ` Jerome Glisse
  2019-02-19  2:37           ` Peter Xu
  2019-02-19  3:33           ` Jerome Glisse
  2 siblings, 0 replies; 18+ messages in thread
From: Jerome Glisse @ 2019-02-18 18:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, linux-kernel, Peter Xu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Andrew Morton, Matthew Wilcox, Paolo Bonzini,
	Radim Krčmář,
	Michal Hocko, kvm

On Mon, Feb 18, 2019 at 12:45:05PM -0500, Andrea Arcangeli wrote:
> On Mon, Feb 18, 2019 at 11:04:13AM -0500, Jerome Glisse wrote:
> > So i run 2 exact same VMs side by side (copy of same COW image) and
> > built the same kernel tree inside each (that is the only important
> > workload that exist ;)) but the change_pte did not have any impact:
> > 
> > before  mean  {real: 1358.250977, user: 16650.880859, sys: 839.199524, npages: 76855.390625}
> > before  stdev {real:    6.744010, user:   108.863762, sys:   6.840437, npages:  1868.071899}
> > after   mean  {real: 1357.833740, user: 16685.849609, sys: 839.646973, npages: 76210.601562}
> > after   stdev {real:    5.124797, user:    78.469360, sys:   7.009164, npages:  2468.017578}
> > without mean  {real: 1358.501343, user: 16674.478516, sys: 837.791992, npages: 76225.203125}
> > without stdev {real:    5.541104, user:    97.998367, sys:   6.715869, npages:  1682.392578}
> > 
> > Above is time taken by make inside each VM for all yes config. npages
> > is the number of page shared reported on the host at the end of the
> > build.
> 
> Did you set /sys/kernel/mm/ksm/sleep_millisecs to 0?

No but i have increase the pages_to_scan to 10000 and during the kernel
build i see the number of page that are shared increase steadily so it
is definitly merging thing.

> 
> It would also help to remove the checksum check from mm/ksm.c:
> 
> -	if (rmap_item->oldchecksum != checksum) {
> -		rmap_item->oldchecksum = checksum;
> -		return;
> -	}

Will try with that.

> 
> One way or another, /sys/kernel/mm/ksm/pages_shared and/or
> pages_sharing need to change significantly to be sure we're exercising
> the COW/merging code that uses change_pte. KSM is smart enough to
> merge only not frequently changing pages, and with the default KSM
> code this probably works too well for a kernel build.
> 
> > Should we still restore change_pte() ? It does not hurt, but it does
> > not seems to help in anyway. Maybe you have a better benchmark i could
> > run ?
> 
> We could also try a microbenchmark based on
> ltp/testcases/kernel/mem/ksm/ksm02.c that already should trigger a
> merge flood and a COW flood during its internal processing.

Will try that.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] Restore change_pte optimization to its former glory
  2019-02-18 17:45         ` Andrea Arcangeli
  2019-02-18 18:20           ` Jerome Glisse
@ 2019-02-19  2:37           ` Peter Xu
  2019-02-19  2:43             ` Jerome Glisse
  2019-02-19  3:33           ` Jerome Glisse
  2 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2019-02-19  2:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Jerome Glisse, linux-mm, linux-kernel, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andrew Morton, Matthew Wilcox,
	Paolo Bonzini, Radim Krčmář,
	Michal Hocko, kvm

On Mon, Feb 18, 2019 at 12:45:05PM -0500, Andrea Arcangeli wrote:
> On Mon, Feb 18, 2019 at 11:04:13AM -0500, Jerome Glisse wrote:
> > So i run 2 exact same VMs side by side (copy of same COW image) and
> > built the same kernel tree inside each (that is the only important
> > workload that exist ;)) but the change_pte did not have any impact:
> > 
> > before  mean  {real: 1358.250977, user: 16650.880859, sys: 839.199524, npages: 76855.390625}
> > before  stdev {real:    6.744010, user:   108.863762, sys:   6.840437, npages:  1868.071899}
> > after   mean  {real: 1357.833740, user: 16685.849609, sys: 839.646973, npages: 76210.601562}
> > after   stdev {real:    5.124797, user:    78.469360, sys:   7.009164, npages:  2468.017578}
> > without mean  {real: 1358.501343, user: 16674.478516, sys: 837.791992, npages: 76225.203125}
> > without stdev {real:    5.541104, user:    97.998367, sys:   6.715869, npages:  1682.392578}
> > 
> > Above is time taken by make inside each VM for all yes config. npages
> > is the number of page shared reported on the host at the end of the
> > build.
> 
> Did you set /sys/kernel/mm/ksm/sleep_millisecs to 0?
> 
> It would also help to remove the checksum check from mm/ksm.c:
> 
> -	if (rmap_item->oldchecksum != checksum) {
> -		rmap_item->oldchecksum = checksum;
> -		return;
> -	}
> 
> One way or another, /sys/kernel/mm/ksm/pages_shared and/or
> pages_sharing need to change significantly to be sure we're exercising
> the COW/merging code that uses change_pte. KSM is smart enough to
> merge only not frequently changing pages, and with the default KSM
> code this probably works too well for a kernel build.

Would it also make sense to track how many pages are really affected
by change_pte (say, in kvm_set_pte_rmapp, count avaliable SPTEs that
are correctly rebuilt)?  I'm thinking even if many pages are merged by
KSM it's still possible that these pages are not actively shadowed by
KVM MMU, meanwhile change_pte should only affect actively shadowed
SPTEs.  In other words, IMHO we might not be able to observe obvious
performance differeneces if the pages we are accessing are not merged
by KSM.  In our case (building the kernel), IIUC the mostly possible
shared pages are system image pages, however when building the kernel
I'm thinking whether these pages will be frequently accesses, and
whether this could lead to similar performance numbers.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] Restore change_pte optimization to its former glory
  2019-02-19  2:37           ` Peter Xu
@ 2019-02-19  2:43             ` Jerome Glisse
  0 siblings, 0 replies; 18+ messages in thread
From: Jerome Glisse @ 2019-02-19  2:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: Andrea Arcangeli, linux-mm, linux-kernel, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andrew Morton, Matthew Wilcox,
	Paolo Bonzini, Radim Krčmář,
	Michal Hocko, kvm

On Tue, Feb 19, 2019 at 10:37:01AM +0800, Peter Xu wrote:
> On Mon, Feb 18, 2019 at 12:45:05PM -0500, Andrea Arcangeli wrote:
> > On Mon, Feb 18, 2019 at 11:04:13AM -0500, Jerome Glisse wrote:
> > > So i run 2 exact same VMs side by side (copy of same COW image) and
> > > built the same kernel tree inside each (that is the only important
> > > workload that exist ;)) but the change_pte did not have any impact:
> > > 
> > > before  mean  {real: 1358.250977, user: 16650.880859, sys: 839.199524, npages: 76855.390625}
> > > before  stdev {real:    6.744010, user:   108.863762, sys:   6.840437, npages:  1868.071899}
> > > after   mean  {real: 1357.833740, user: 16685.849609, sys: 839.646973, npages: 76210.601562}
> > > after   stdev {real:    5.124797, user:    78.469360, sys:   7.009164, npages:  2468.017578}
> > > without mean  {real: 1358.501343, user: 16674.478516, sys: 837.791992, npages: 76225.203125}
> > > without stdev {real:    5.541104, user:    97.998367, sys:   6.715869, npages:  1682.392578}
> > > 
> > > Above is time taken by make inside each VM for all yes config. npages
> > > is the number of page shared reported on the host at the end of the
> > > build.
> > 
> > Did you set /sys/kernel/mm/ksm/sleep_millisecs to 0?
> > 
> > It would also help to remove the checksum check from mm/ksm.c:
> > 
> > -	if (rmap_item->oldchecksum != checksum) {
> > -		rmap_item->oldchecksum = checksum;
> > -		return;
> > -	}
> > 
> > One way or another, /sys/kernel/mm/ksm/pages_shared and/or
> > pages_sharing need to change significantly to be sure we're exercising
> > the COW/merging code that uses change_pte. KSM is smart enough to
> > merge only not frequently changing pages, and with the default KSM
> > code this probably works too well for a kernel build.
> 
> Would it also make sense to track how many pages are really affected
> by change_pte (say, in kvm_set_pte_rmapp, count avaliable SPTEs that
> are correctly rebuilt)?  I'm thinking even if many pages are merged by
> KSM it's still possible that these pages are not actively shadowed by
> KVM MMU, meanwhile change_pte should only affect actively shadowed
> SPTEs.  In other words, IMHO we might not be able to observe obvious
> performance differeneces if the pages we are accessing are not merged
> by KSM.  In our case (building the kernel), IIUC the mostly possible
> shared pages are system image pages, however when building the kernel
> I'm thinking whether these pages will be frequently accesses, and
> whether this could lead to similar performance numbers.

I checked that, if no KVM is running KSM never merge anything (after
bumping KSM page to scan to 10000 and sleep to 0). It starts merging
once i start KVM. Then i wait a bit for KSM to stabilize (ie to merge
the stock KVM pages). It is only once KSM count is somewhat stable
that i run the test and check that KSM count goes up significantly
while test is running.

KSM will definitly go through the change_pte path for KVM so i am
definitly testing the change_pte path.

I have been running the micro benchmark and on that i do see a perf
improvement i will report shortly once i am done gathering enough
data.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/4] Restore change_pte optimization to its former glory
  2019-02-18 17:45         ` Andrea Arcangeli
  2019-02-18 18:20           ` Jerome Glisse
  2019-02-19  2:37           ` Peter Xu
@ 2019-02-19  3:33           ` Jerome Glisse
  2 siblings, 0 replies; 18+ messages in thread
From: Jerome Glisse @ 2019-02-19  3:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, linux-kernel, Peter Xu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Andrew Morton, Paolo Bonzini,
	Radim Krčmář,
	Michal Hocko, kvm

On Mon, Feb 18, 2019 at 12:45:05PM -0500, Andrea Arcangeli wrote:
> On Mon, Feb 18, 2019 at 11:04:13AM -0500, Jerome Glisse wrote:
> > So i run 2 exact same VMs side by side (copy of same COW image) and
> > built the same kernel tree inside each (that is the only important
> > workload that exist ;)) but the change_pte did not have any impact:
> > 
> > before  mean  {real: 1358.250977, user: 16650.880859, sys: 839.199524, npages: 76855.390625}
> > before  stdev {real:    6.744010, user:   108.863762, sys:   6.840437, npages:  1868.071899}
> > after   mean  {real: 1357.833740, user: 16685.849609, sys: 839.646973, npages: 76210.601562}
> > after   stdev {real:    5.124797, user:    78.469360, sys:   7.009164, npages:  2468.017578}
> > without mean  {real: 1358.501343, user: 16674.478516, sys: 837.791992, npages: 76225.203125}
> > without stdev {real:    5.541104, user:    97.998367, sys:   6.715869, npages:  1682.392578}
> > 
> > Above is time taken by make inside each VM for all yes config. npages
> > is the number of page shared reported on the host at the end of the
> > build.
> 
> Did you set /sys/kernel/mm/ksm/sleep_millisecs to 0?
> 
> It would also help to remove the checksum check from mm/ksm.c:
> 
> -	if (rmap_item->oldchecksum != checksum) {
> -		rmap_item->oldchecksum = checksum;
> -		return;
> -	}
> 
> One way or another, /sys/kernel/mm/ksm/pages_shared and/or
> pages_sharing need to change significantly to be sure we're exercising
> the COW/merging code that uses change_pte. KSM is smart enough to
> merge only not frequently changing pages, and with the default KSM
> code this probably works too well for a kernel build.
> 
> > Should we still restore change_pte() ? It does not hurt, but it does
> > not seems to help in anyway. Maybe you have a better benchmark i could
> > run ?
> 
> We could also try a microbenchmark based on
> ltp/testcases/kernel/mem/ksm/ksm02.c that already should trigger a
> merge flood and a COW flood during its internal processing.

So using that and the checksum test removed there is an improvement,
roughly 7%~8% on time spends in the kernel:

before  mean  {real: 675.460632, user: 857.771423, sys: 215.929657, npages: 4773.066895}
before  stdev {real:  37.035435, user:   4.395942, sys:   3.976172, npages:  675.352783}
after   mean  {real: 672.515503, user: 855.817322, sys: 200.902710, npages: 4899.000000}
after   stdev {real:  37.340954, user:   4.051633, sys:   3.894153, npages:  742.413452}

I am guessing for kernel build this get lost in the noise and that
KSM changes do not have that much of an impact. So i will reposting
the mmu notifier changes shortly in hope to get them in 5.1 and i
will post the KVM part separatly shortly there after.

If there is any more testing you wish me to do let me know.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-02-19  3:34 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-31 18:37 [RFC PATCH 0/4] Restore change_pte optimization to its former glory jglisse
2019-01-31 18:37 ` [RFC PATCH 1/4] uprobes: use set_pte_at() not set_pte_at_notify() jglisse
2019-02-02  0:50   ` Andrea Arcangeli
2019-02-11 19:27     ` Jerome Glisse
2019-01-31 18:37 ` [RFC PATCH 2/4] mm/mmu_notifier: use unsigned for event field in range struct jglisse
2019-02-02  1:13   ` Andrea Arcangeli
2019-01-31 18:37 ` [RFC PATCH 3/4] mm/mmu_notifier: set MMU_NOTIFIER_USE_CHANGE_PTE flag where appropriate jglisse
2019-01-31 18:37 ` [RFC PATCH 4/4] kvm/mmu_notifier: re-enable the change_pte() optimization jglisse
2019-02-01 23:57 ` [RFC PATCH 0/4] Restore change_pte optimization to its former glory Andrea Arcangeli
2019-02-02  0:14   ` Andrea Arcangeli
2019-02-11 19:09   ` Jerome Glisse
2019-02-11 20:02     ` Andrea Arcangeli
2019-02-18 16:04       ` Jerome Glisse
2019-02-18 17:45         ` Andrea Arcangeli
2019-02-18 18:20           ` Jerome Glisse
2019-02-19  2:37           ` Peter Xu
2019-02-19  2:43             ` Jerome Glisse
2019-02-19  3:33           ` Jerome Glisse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).