All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] powernv: kvm: numa fault improvement
@ 2013-12-11  8:47 ` Liu Ping Fan
  0 siblings, 0 replies; 36+ messages in thread
From: Liu Ping Fan @ 2013-12-11  8:47 UTC (permalink / raw)
  To: linuxppc-dev, kvm-ppc; +Cc: Paul Mackerras, Aneesh Kumar K.V, Alexander Graf

This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"

For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
(for which, I still try to get a machine to show nums)

But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
which is  well known.

If my suppose is correct, will CCing kvm@vger.kernel.org from next version.


Liu Ping Fan (4):
  mm: export numa_migrate_prep()
  powernv: kvm: make _PAGE_NUMA take effect
  powernv: kvm: extend input param for lookup_linux_pte
  powernv: kvm: make the handling of _PAGE_NUMA faster for guest

 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 38 ++++++++++++++++++++++++++++++++++---
 include/linux/mm.h                  |  2 ++
 2 files changed, 37 insertions(+), 3 deletions(-)

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 0/4] powernv: kvm: numa fault improvement
@ 2013-12-11  8:47 ` Liu Ping Fan
  0 siblings, 0 replies; 36+ messages in thread
From: Liu Ping Fan @ 2013-12-11  8:47 UTC (permalink / raw)
  To: linuxppc-dev, kvm-ppc; +Cc: Paul Mackerras, Aneesh Kumar K.V, Alexander Graf

This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"

For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
(for which, I still try to get a machine to show nums)

But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
which is  well known.

If my suppose is correct, will CCing kvm@vger.kernel.org from next version.


Liu Ping Fan (4):
  mm: export numa_migrate_prep()
  powernv: kvm: make _PAGE_NUMA take effect
  powernv: kvm: extend input param for lookup_linux_pte
  powernv: kvm: make the handling of _PAGE_NUMA faster for guest

 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 38 ++++++++++++++++++++++++++++++++++---
 include/linux/mm.h                  |  2 ++
 2 files changed, 37 insertions(+), 3 deletions(-)

-- 
1.8.1.4


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/4] mm: export numa_migrate_prep()
  2013-12-11  8:47 ` Liu Ping Fan
@ 2013-12-11  8:47   ` Liu Ping Fan
  -1 siblings, 0 replies; 36+ messages in thread
From: Liu Ping Fan @ 2013-12-11  8:47 UTC (permalink / raw)
  To: linuxppc-dev, kvm-ppc; +Cc: Paul Mackerras, Aneesh Kumar K.V, Alexander Graf

powerpc will use it in fast path.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 include/linux/mm.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5ab0e22..420fb77 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1092,6 +1092,8 @@ extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long
 extern int mprotect_fixup(struct vm_area_struct *vma,
 			  struct vm_area_struct **pprev, unsigned long start,
 			  unsigned long end, unsigned long newflags);
+extern int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+				unsigned long addr, int page_nid);
 
 /*
  * doesn't attempt to fault and will return short.
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 1/4] mm: export numa_migrate_prep()
@ 2013-12-11  8:47   ` Liu Ping Fan
  0 siblings, 0 replies; 36+ messages in thread
From: Liu Ping Fan @ 2013-12-11  8:47 UTC (permalink / raw)
  To: linuxppc-dev, kvm-ppc; +Cc: Paul Mackerras, Aneesh Kumar K.V, Alexander Graf

powerpc will use it in fast path.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 include/linux/mm.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5ab0e22..420fb77 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1092,6 +1092,8 @@ extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long
 extern int mprotect_fixup(struct vm_area_struct *vma,
 			  struct vm_area_struct **pprev, unsigned long start,
 			  unsigned long end, unsigned long newflags);
+extern int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+				unsigned long addr, int page_nid);
 
 /*
  * doesn't attempt to fault and will return short.
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/4] powernv: kvm: make _PAGE_NUMA take effect
  2013-12-11  8:47 ` Liu Ping Fan
@ 2013-12-11  8:47   ` Liu Ping Fan
  -1 siblings, 0 replies; 36+ messages in thread
From: Liu Ping Fan @ 2013-12-11  8:47 UTC (permalink / raw)
  To: linuxppc-dev, kvm-ppc; +Cc: Paul Mackerras, Aneesh Kumar K.V, Alexander Graf

To make _PAGE_NUMA take effect, we should force the checking when
guest uses hypercall to setup hpte.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 9c51544..af8602d 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -232,7 +232,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 		/* Look up the Linux PTE for the backing page */
 		pte_size = psize;
 		pte = lookup_linux_pte(pgdir, hva, writing, &pte_size);
-		if (pte_present(pte)) {
+		if (pte_present(pte) && !pte_numa(pte)) {
 			if (writing && !pte_write(pte))
 				/* make the actual HPTE be read-only */
 				ptel = hpte_make_readonly(ptel);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/4] powernv: kvm: make _PAGE_NUMA take effect
@ 2013-12-11  8:47   ` Liu Ping Fan
  0 siblings, 0 replies; 36+ messages in thread
From: Liu Ping Fan @ 2013-12-11  8:47 UTC (permalink / raw)
  To: linuxppc-dev, kvm-ppc; +Cc: Paul Mackerras, Aneesh Kumar K.V, Alexander Graf

To make _PAGE_NUMA take effect, we should force the checking when
guest uses hypercall to setup hpte.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 9c51544..af8602d 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -232,7 +232,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 		/* Look up the Linux PTE for the backing page */
 		pte_size = psize;
 		pte = lookup_linux_pte(pgdir, hva, writing, &pte_size);
-		if (pte_present(pte)) {
+		if (pte_present(pte) && !pte_numa(pte)) {
 			if (writing && !pte_write(pte))
 				/* make the actual HPTE be read-only */
 				ptel = hpte_make_readonly(ptel);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/4] powernv: kvm: extend input param for lookup_linux_pte
  2013-12-11  8:47 ` Liu Ping Fan
@ 2013-12-11  8:47   ` Liu Ping Fan
  -1 siblings, 0 replies; 36+ messages in thread
From: Liu Ping Fan @ 2013-12-11  8:47 UTC (permalink / raw)
  To: linuxppc-dev, kvm-ppc; +Cc: Paul Mackerras, Aneesh Kumar K.V, Alexander Graf

It will be helpful for next patch

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
Can it be merged with the next patch?
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index af8602d..ae46052 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -135,7 +135,8 @@ static void remove_revmap_chain(struct kvm *kvm, long pte_index,
 }
 
 static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
-			      int writing, unsigned long *pte_sizep)
+			      int writing, unsigned long *pte_sizep,
+			      pte_t **ptepp)
 {
 	pte_t *ptep;
 	unsigned long ps = *pte_sizep;
@@ -144,6 +145,8 @@ static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
 	ptep = find_linux_pte_or_hugepte(pgdir, hva, &hugepage_shift);
 	if (!ptep)
 		return __pte(0);
+	if (ptepp != NULL)
+		*ptepp = ptep;
 	if (hugepage_shift)
 		*pte_sizep = 1ul << hugepage_shift;
 	else
@@ -231,7 +234,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 
 		/* Look up the Linux PTE for the backing page */
 		pte_size = psize;
-		pte = lookup_linux_pte(pgdir, hva, writing, &pte_size);
+		pte = lookup_linux_pte(pgdir, hva, writing, &pte_size, NULL);
 		if (pte_present(pte) && !pte_numa(pte)) {
 			if (writing && !pte_write(pte))
 				/* make the actual HPTE be read-only */
@@ -671,7 +674,8 @@ long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags,
 			memslot = __gfn_to_memslot(kvm_memslots(kvm), gfn);
 			if (memslot) {
 				hva = __gfn_to_hva_memslot(memslot, gfn);
-				pte = lookup_linux_pte(pgdir, hva, 1, &psize);
+				pte = lookup_linux_pte(pgdir, hva, 1, &psize,
+							NULL);
 				if (pte_present(pte) && !pte_write(pte))
 					r = hpte_make_readonly(r);
 			}
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/4] powernv: kvm: extend input param for lookup_linux_pte
@ 2013-12-11  8:47   ` Liu Ping Fan
  0 siblings, 0 replies; 36+ messages in thread
From: Liu Ping Fan @ 2013-12-11  8:47 UTC (permalink / raw)
  To: linuxppc-dev, kvm-ppc; +Cc: Paul Mackerras, Aneesh Kumar K.V, Alexander Graf

It will be helpful for next patch

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
Can it be merged with the next patch?
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index af8602d..ae46052 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -135,7 +135,8 @@ static void remove_revmap_chain(struct kvm *kvm, long pte_index,
 }
 
 static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
-			      int writing, unsigned long *pte_sizep)
+			      int writing, unsigned long *pte_sizep,
+			      pte_t **ptepp)
 {
 	pte_t *ptep;
 	unsigned long ps = *pte_sizep;
@@ -144,6 +145,8 @@ static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
 	ptep = find_linux_pte_or_hugepte(pgdir, hva, &hugepage_shift);
 	if (!ptep)
 		return __pte(0);
+	if (ptepp != NULL)
+		*ptepp = ptep;
 	if (hugepage_shift)
 		*pte_sizep = 1ul << hugepage_shift;
 	else
@@ -231,7 +234,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 
 		/* Look up the Linux PTE for the backing page */
 		pte_size = psize;
-		pte = lookup_linux_pte(pgdir, hva, writing, &pte_size);
+		pte = lookup_linux_pte(pgdir, hva, writing, &pte_size, NULL);
 		if (pte_present(pte) && !pte_numa(pte)) {
 			if (writing && !pte_write(pte))
 				/* make the actual HPTE be read-only */
@@ -671,7 +674,8 @@ long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags,
 			memslot = __gfn_to_memslot(kvm_memslots(kvm), gfn);
 			if (memslot) {
 				hva = __gfn_to_hva_memslot(memslot, gfn);
-				pte = lookup_linux_pte(pgdir, hva, 1, &psize);
+				pte = lookup_linux_pte(pgdir, hva, 1, &psize,
+							NULL);
 				if (pte_present(pte) && !pte_write(pte))
 					r = hpte_make_readonly(r);
 			}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 4/4] powernv: kvm: make the handling of _PAGE_NUMA faster for guest
  2013-12-11  8:47 ` Liu Ping Fan
@ 2013-12-11  8:47   ` Liu Ping Fan
  -1 siblings, 0 replies; 36+ messages in thread
From: Liu Ping Fan @ 2013-12-11  8:47 UTC (permalink / raw)
  To: linuxppc-dev, kvm-ppc; +Cc: Paul Mackerras, Aneesh Kumar K.V, Alexander Graf

The period check of _PAGE_NUMA can probably happen on the correctly
placed page. For this case, when guest try to setup hpte in real mode,
we try to resolve the numa fault in real mode, since the switch between
guest context and host context costs too much.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index ae46052..a06b199 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -179,6 +179,11 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 	unsigned int writing;
 	unsigned long mmu_seq;
 	unsigned long rcbits;
+	struct mm_struct *mm = kvm->mm;
+	struct vm_area_struct *vma;
+	int page_nid, target_nid;
+	struct page *test_page;
+	pte_t *ptep;
 
 	psize = hpte_page_size(pteh, ptel);
 	if (!psize)
@@ -234,8 +239,26 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 
 		/* Look up the Linux PTE for the backing page */
 		pte_size = psize;
-		pte = lookup_linux_pte(pgdir, hva, writing, &pte_size, NULL);
-		if (pte_present(pte) && !pte_numa(pte)) {
+		pte = lookup_linux_pte(pgdir, hva, writing, &pte_size, &ptep);
+		if (pte_present(pte)) {
+			if (pte_numa(pte)) {
+				/* If fail, let gup handle it */
+				if (unlikely(!down_read_trylock(&mm->mmap_sem)))
+					goto pte_check;
+
+				vma = find_vma(mm, hva);
+				up_read(&mm->mmap_sem);
+				test_page = pte_page(pte);
+				page_nid = page_to_nid(test_page);
+				target_nid = numa_migrate_prep(test_page, vma,
+							 hva, page_nid);
+				put_page(test_page);
+				if (unlikely(target_nid != -1)) {
+					/* If fail, let gup handle it */
+					goto pte_check;
+				}
+			}
+
 			if (writing && !pte_write(pte))
 				/* make the actual HPTE be read-only */
 				ptel = hpte_make_readonly(ptel);
@@ -244,6 +267,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 		}
 	}
 
+pte_check:
 	if (pte_size < psize)
 		return H_PARAMETER;
 	if (pa && pte_size > psize)
@@ -339,6 +363,10 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 			pteh &= ~HPTE_V_VALID;
 			unlock_rmap(rmap);
 		} else {
+			if (pte_numa(pte) && pa) {
+				pte = pte_mknonnuma(pte);
+				*ptep = pte;
+			}
 			kvmppc_add_revmap_chain(kvm, rev, rmap, pte_index,
 						realmode);
 			/* Only set R/C in real HPTE if already set in *rmap */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 4/4] powernv: kvm: make the handling of _PAGE_NUMA faster for guest
@ 2013-12-11  8:47   ` Liu Ping Fan
  0 siblings, 0 replies; 36+ messages in thread
From: Liu Ping Fan @ 2013-12-11  8:47 UTC (permalink / raw)
  To: linuxppc-dev, kvm-ppc; +Cc: Paul Mackerras, Aneesh Kumar K.V, Alexander Graf

The period check of _PAGE_NUMA can probably happen on the correctly
placed page. For this case, when guest try to setup hpte in real mode,
we try to resolve the numa fault in real mode, since the switch between
guest context and host context costs too much.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index ae46052..a06b199 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -179,6 +179,11 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 	unsigned int writing;
 	unsigned long mmu_seq;
 	unsigned long rcbits;
+	struct mm_struct *mm = kvm->mm;
+	struct vm_area_struct *vma;
+	int page_nid, target_nid;
+	struct page *test_page;
+	pte_t *ptep;
 
 	psize = hpte_page_size(pteh, ptel);
 	if (!psize)
@@ -234,8 +239,26 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 
 		/* Look up the Linux PTE for the backing page */
 		pte_size = psize;
-		pte = lookup_linux_pte(pgdir, hva, writing, &pte_size, NULL);
-		if (pte_present(pte) && !pte_numa(pte)) {
+		pte = lookup_linux_pte(pgdir, hva, writing, &pte_size, &ptep);
+		if (pte_present(pte)) {
+			if (pte_numa(pte)) {
+				/* If fail, let gup handle it */
+				if (unlikely(!down_read_trylock(&mm->mmap_sem)))
+					goto pte_check;
+
+				vma = find_vma(mm, hva);
+				up_read(&mm->mmap_sem);
+				test_page = pte_page(pte);
+				page_nid = page_to_nid(test_page);
+				target_nid = numa_migrate_prep(test_page, vma,
+							 hva, page_nid);
+				put_page(test_page);
+				if (unlikely(target_nid != -1)) {
+					/* If fail, let gup handle it */
+					goto pte_check;
+				}
+			}
+
 			if (writing && !pte_write(pte))
 				/* make the actual HPTE be read-only */
 				ptel = hpte_make_readonly(ptel);
@@ -244,6 +267,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 		}
 	}
 
+pte_check:
 	if (pte_size < psize)
 		return H_PARAMETER;
 	if (pa && pte_size > psize)
@@ -339,6 +363,10 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 			pteh &= ~HPTE_V_VALID;
 			unlock_rmap(rmap);
 		} else {
+			if (pte_numa(pte) && pa) {
+				pte = pte_mknonnuma(pte);
+				*ptep = pte;
+			}
 			kvmppc_add_revmap_chain(kvm, rev, rmap, pte_index,
 						realmode);
 			/* Only set R/C in real HPTE if already set in *rmap */
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
  2013-12-11  8:47 ` Liu Ping Fan
@ 2014-01-09 12:08   ` Alexander Graf
  -1 siblings, 0 replies; 36+ messages in thread
From: Alexander Graf @ 2014-01-09 12:08 UTC (permalink / raw)
  To: Liu Ping Fan; +Cc: Paul Mackerras, linuxppc-dev, Aneesh Kumar K.V, kvm-ppc


On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:

> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: =
Numa faults support for ppc64"
>=20
> For this series, I apply the same idea from the previous thread =
"[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
> (for which, I still try to get a machine to show nums)
>=20
> But for this series, I think that I have a good justification -- the =
fact of heavy cost when switching context between guest and host,
> which is  well known.

This cover letter isn't really telling me anything. Please put a proper =
description of what you're trying to achieve, why you're trying to =
achieve what you're trying and convince your readers that it's a good =
idea to do it the way you do it.

> If my suppose is correct, will CCing kvm@vger.kernel.org from next =
version.

This translates to me as "This is an RFC"?


Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
@ 2014-01-09 12:08   ` Alexander Graf
  0 siblings, 0 replies; 36+ messages in thread
From: Alexander Graf @ 2014-01-09 12:08 UTC (permalink / raw)
  To: Liu Ping Fan; +Cc: Paul Mackerras, linuxppc-dev, Aneesh Kumar K.V, kvm-ppc


On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:

> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
> 
> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
> (for which, I still try to get a machine to show nums)
> 
> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
> which is  well known.

This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.

> If my suppose is correct, will CCing kvm@vger.kernel.org from next version.

This translates to me as "This is an RFC"?


Alex


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
  2014-01-09 12:08   ` Alexander Graf
@ 2014-01-15  6:36     ` Liu ping fan
  -1 siblings, 0 replies; 36+ messages in thread
From: Liu ping fan @ 2014-01-15  6:36 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Paul Mackerras, linuxppc-dev, Aneesh Kumar K.V, kvm-ppc

On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>
> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>
>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>>
>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>> (for which, I still try to get a machine to show nums)
>>
>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>> which is  well known.
>
> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>
Sorry for the unclear message. After introducing the _PAGE_NUMA,
kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
should rely on host's kvmppc_book3s_hv_page_fault() to call
do_numa_page() to do the numa fault check. This incurs the overhead
when exiting from rmode to vmode.  My idea is that in
kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
there is no need to exit to vmode (i.e saving htab, slab switching)

>> If my suppose is correct, will CCing kvm@vger.kernel.org from next version.
>
> This translates to me as "This is an RFC"?
>
Yes, I am not quite sure about it. I have no bare-metal to verify it.
So I hope at least, from the theory, it is correct.

Thanks and regards,
Ping Fan
>
> Alex
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
@ 2014-01-15  6:36     ` Liu ping fan
  0 siblings, 0 replies; 36+ messages in thread
From: Liu ping fan @ 2014-01-15  6:36 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Paul Mackerras, linuxppc-dev, Aneesh Kumar K.V, kvm-ppc

On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>
> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>
>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>>
>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>> (for which, I still try to get a machine to show nums)
>>
>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>> which is  well known.
>
> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>
Sorry for the unclear message. After introducing the _PAGE_NUMA,
kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
should rely on host's kvmppc_book3s_hv_page_fault() to call
do_numa_page() to do the numa fault check. This incurs the overhead
when exiting from rmode to vmode.  My idea is that in
kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
there is no need to exit to vmode (i.e saving htab, slab switching)

>> If my suppose is correct, will CCing kvm@vger.kernel.org from next version.
>
> This translates to me as "This is an RFC"?
>
Yes, I am not quite sure about it. I have no bare-metal to verify it.
So I hope at least, from the theory, it is correct.

Thanks and regards,
Ping Fan
>
> Alex
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
  2014-01-15  6:36     ` Liu ping fan
@ 2014-01-20 14:48       ` Alexander Graf
  -1 siblings, 0 replies; 36+ messages in thread
From: Alexander Graf @ 2014-01-20 14:48 UTC (permalink / raw)
  To: Liu ping fan; +Cc: Paul Mackerras, linuxppc-dev, Aneesh Kumar K.V, kvm-ppc


On 15.01.2014, at 07:36, Liu ping fan <kernelfans@gmail.com> wrote:

> On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>>=20
>> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>>=20
>>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: =
mm: Numa faults support for ppc64"
>>>=20
>>> For this series, I apply the same idea from the previous thread =
"[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>>> (for which, I still try to get a machine to show nums)
>>>=20
>>> But for this series, I think that I have a good justification -- the =
fact of heavy cost when switching context between guest and host,
>>> which is  well known.
>>=20
>> This cover letter isn't really telling me anything. Please put a =
proper description of what you're trying to achieve, why you're trying =
to achieve what you're trying and convince your readers that it's a good =
idea to do it the way you do it.
>>=20
> Sorry for the unclear message. After introducing the _PAGE_NUMA,
> kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
> should rely on host's kvmppc_book3s_hv_page_fault() to call
> do_numa_page() to do the numa fault check. This incurs the overhead
> when exiting from rmode to vmode.  My idea is that in
> kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
> there is no need to exit to vmode (i.e saving htab, slab switching)
>=20
>>> If my suppose is correct, will CCing kvm@vger.kernel.org from next =
version.
>>=20
>> This translates to me as "This is an RFC"?
>>=20
> Yes, I am not quite sure about it. I have no bare-metal to verify it.
> So I hope at least, from the theory, it is correct.

Paul, could you please give this some thought and maybe benchmark it?


Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
@ 2014-01-20 14:48       ` Alexander Graf
  0 siblings, 0 replies; 36+ messages in thread
From: Alexander Graf @ 2014-01-20 14:48 UTC (permalink / raw)
  To: Liu ping fan; +Cc: Paul Mackerras, linuxppc-dev, Aneesh Kumar K.V, kvm-ppc


On 15.01.2014, at 07:36, Liu ping fan <kernelfans@gmail.com> wrote:

> On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>> 
>> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>> 
>>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>>> 
>>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>>> (for which, I still try to get a machine to show nums)
>>> 
>>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>>> which is  well known.
>> 
>> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>> 
> Sorry for the unclear message. After introducing the _PAGE_NUMA,
> kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
> should rely on host's kvmppc_book3s_hv_page_fault() to call
> do_numa_page() to do the numa fault check. This incurs the overhead
> when exiting from rmode to vmode.  My idea is that in
> kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
> there is no need to exit to vmode (i.e saving htab, slab switching)
> 
>>> If my suppose is correct, will CCing kvm@vger.kernel.org from next version.
>> 
>> This translates to me as "This is an RFC"?
>> 
> Yes, I am not quite sure about it. I have no bare-metal to verify it.
> So I hope at least, from the theory, it is correct.

Paul, could you please give this some thought and maybe benchmark it?


Alex


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/4] powernv: kvm: make _PAGE_NUMA take effect
  2013-12-11  8:47   ` Liu Ping Fan
@ 2014-01-20 15:34     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 36+ messages in thread
From: Aneesh Kumar K.V @ 2014-01-20 15:22 UTC (permalink / raw)
  To: Liu Ping Fan, linuxppc-dev, kvm-ppc; +Cc: Paul Mackerras, Alexander Graf

Liu Ping Fan <kernelfans@gmail.com> writes:

> To make _PAGE_NUMA take effect, we should force the checking when
> guest uses hypercall to setup hpte.
>
> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> index 9c51544..af8602d 100644
> --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> @@ -232,7 +232,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
>  		/* Look up the Linux PTE for the backing page */
>  		pte_size = psize;
>  		pte = lookup_linux_pte(pgdir, hva, writing, &pte_size);
> -		if (pte_present(pte)) {
> +		if (pte_present(pte) && !pte_numa(pte)) {
>  			if (writing && !pte_write(pte))
>  				/* make the actual HPTE be read-only */
>  				ptel = hpte_make_readonly(ptel);

How did we end up doing h_enter on a pte entry with pte_numa bit set ?

-aneesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/4] powernv: kvm: make _PAGE_NUMA take effect
@ 2014-01-20 15:34     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 36+ messages in thread
From: Aneesh Kumar K.V @ 2014-01-20 15:34 UTC (permalink / raw)
  To: Liu Ping Fan, linuxppc-dev, kvm-ppc; +Cc: Paul Mackerras, Alexander Graf

Liu Ping Fan <kernelfans@gmail.com> writes:

> To make _PAGE_NUMA take effect, we should force the checking when
> guest uses hypercall to setup hpte.
>
> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kvm/book3s_hv_rm_mmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> index 9c51544..af8602d 100644
> --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> @@ -232,7 +232,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
>  		/* Look up the Linux PTE for the backing page */
>  		pte_size = psize;
>  		pte = lookup_linux_pte(pgdir, hva, writing, &pte_size);
> -		if (pte_present(pte)) {
> +		if (pte_present(pte) && !pte_numa(pte)) {
>  			if (writing && !pte_write(pte))
>  				/* make the actual HPTE be read-only */
>  				ptel = hpte_make_readonly(ptel);

How did we end up doing h_enter on a pte entry with pte_numa bit set ?

-aneesh


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
  2014-01-15  6:36     ` Liu ping fan
@ 2014-01-20 15:57       ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 36+ messages in thread
From: Aneesh Kumar K.V @ 2014-01-20 15:45 UTC (permalink / raw)
  To: Liu ping fan, Alexander Graf; +Cc: Paul Mackerras, linuxppc-dev, kvm-ppc

Liu ping fan <kernelfans@gmail.com> writes:

> On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>>
>> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>>
>>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>>>
>>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>>> (for which, I still try to get a machine to show nums)
>>>
>>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>>> which is  well known.
>>
>> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>>
> Sorry for the unclear message. After introducing the _PAGE_NUMA,
> kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
> should rely on host's kvmppc_book3s_hv_page_fault() to call
> do_numa_page() to do the numa fault check. This incurs the overhead
> when exiting from rmode to vmode.  My idea is that in
> kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
> there is no need to exit to vmode (i.e saving htab, slab switching)

Can you explain more. Are we looking at hcall from guest  and
hypervisor handling them in real mode ? If so why would guest issue a
hcall on a pte entry that have PAGE_NUMA set. Or is this about
hypervisor handling a missing hpte, because of host swapping this page
out ? In that case how we end up in h_enter ? IIUC for that case we
should get to kvmppc_hpte_hv_fault. 


>
>>> If my suppose is correct, will CCing kvm@vger.kernel.org from next version.
>>
>> This translates to me as "This is an RFC"?
>>
> Yes, I am not quite sure about it. I have no bare-metal to verify it.
> So I hope at least, from the theory, it is correct.
>

-aneesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
@ 2014-01-20 15:57       ` Aneesh Kumar K.V
  0 siblings, 0 replies; 36+ messages in thread
From: Aneesh Kumar K.V @ 2014-01-20 15:57 UTC (permalink / raw)
  To: Liu ping fan, Alexander Graf; +Cc: Paul Mackerras, linuxppc-dev, kvm-ppc

Liu ping fan <kernelfans@gmail.com> writes:

> On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>>
>> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>>
>>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>>>
>>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>>> (for which, I still try to get a machine to show nums)
>>>
>>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>>> which is  well known.
>>
>> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>>
> Sorry for the unclear message. After introducing the _PAGE_NUMA,
> kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
> should rely on host's kvmppc_book3s_hv_page_fault() to call
> do_numa_page() to do the numa fault check. This incurs the overhead
> when exiting from rmode to vmode.  My idea is that in
> kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
> there is no need to exit to vmode (i.e saving htab, slab switching)

Can you explain more. Are we looking at hcall from guest  and
hypervisor handling them in real mode ? If so why would guest issue a
hcall on a pte entry that have PAGE_NUMA set. Or is this about
hypervisor handling a missing hpte, because of host swapping this page
out ? In that case how we end up in h_enter ? IIUC for that case we
should get to kvmppc_hpte_hv_fault. 


>
>>> If my suppose is correct, will CCing kvm@vger.kernel.org from next version.
>>
>> This translates to me as "This is an RFC"?
>>
> Yes, I am not quite sure about it. I have no bare-metal to verify it.
> So I hope at least, from the theory, it is correct.
>

-aneesh


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
  2014-01-20 15:57       ` Aneesh Kumar K.V
@ 2014-01-21  2:30         ` Liu ping fan
  -1 siblings, 0 replies; 36+ messages in thread
From: Liu ping fan @ 2014-01-21  2:30 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Paul Mackerras, linuxppc-dev, Alexander Graf, kvm-ppc

On Mon, Jan 20, 2014 at 11:45 PM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
> Liu ping fan <kernelfans@gmail.com> writes:
>
>> On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>>>
>>> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>>>
>>>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>>>>
>>>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>>>> (for which, I still try to get a machine to show nums)
>>>>
>>>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>>>> which is  well known.
>>>
>>> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>>>
>> Sorry for the unclear message. After introducing the _PAGE_NUMA,
>> kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
>> should rely on host's kvmppc_book3s_hv_page_fault() to call
>> do_numa_page() to do the numa fault check. This incurs the overhead
>> when exiting from rmode to vmode.  My idea is that in
>> kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
>> there is no need to exit to vmode (i.e saving htab, slab switching)
>
> Can you explain more. Are we looking at hcall from guest  and
> hypervisor handling them in real mode ? If so why would guest issue a
> hcall on a pte entry that have PAGE_NUMA set. Or is this about
> hypervisor handling a missing hpte, because of host swapping this page
> out ? In that case how we end up in h_enter ? IIUC for that case we
> should get to kvmppc_hpte_hv_fault.
>
After setting _PAGE_NUMA, we should flush out all hptes both in host's
htab and guest's. So when guest tries to access memory, host finds
that there is not hpte ready for guest in guest's htab. And host
should raise dsi to guest. This incurs that guest ends up in h_enter.
And you can see in current code, we also try this quick path firstly.
Only if fail, we will resort to slow path --  kvmppc_hpte_hv_fault.

Thanks and regards,
Fan
>
>>
>>>> If my suppose is correct, will CCing kvm@vger.kernel.org from next version.
>>>
>>> This translates to me as "This is an RFC"?
>>>
>> Yes, I am not quite sure about it. I have no bare-metal to verify it.
>> So I hope at least, from the theory, it is correct.
>>
>
> -aneesh
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
@ 2014-01-21  2:30         ` Liu ping fan
  0 siblings, 0 replies; 36+ messages in thread
From: Liu ping fan @ 2014-01-21  2:30 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Paul Mackerras, linuxppc-dev, Alexander Graf, kvm-ppc

On Mon, Jan 20, 2014 at 11:45 PM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
> Liu ping fan <kernelfans@gmail.com> writes:
>
>> On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>>>
>>> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>>>
>>>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>>>>
>>>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>>>> (for which, I still try to get a machine to show nums)
>>>>
>>>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>>>> which is  well known.
>>>
>>> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>>>
>> Sorry for the unclear message. After introducing the _PAGE_NUMA,
>> kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
>> should rely on host's kvmppc_book3s_hv_page_fault() to call
>> do_numa_page() to do the numa fault check. This incurs the overhead
>> when exiting from rmode to vmode.  My idea is that in
>> kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
>> there is no need to exit to vmode (i.e saving htab, slab switching)
>
> Can you explain more. Are we looking at hcall from guest  and
> hypervisor handling them in real mode ? If so why would guest issue a
> hcall on a pte entry that have PAGE_NUMA set. Or is this about
> hypervisor handling a missing hpte, because of host swapping this page
> out ? In that case how we end up in h_enter ? IIUC for that case we
> should get to kvmppc_hpte_hv_fault.
>
After setting _PAGE_NUMA, we should flush out all hptes both in host's
htab and guest's. So when guest tries to access memory, host finds
that there is not hpte ready for guest in guest's htab. And host
should raise dsi to guest. This incurs that guest ends up in h_enter.
And you can see in current code, we also try this quick path firstly.
Only if fail, we will resort to slow path --  kvmppc_hpte_hv_fault.

Thanks and regards,
Fan
>
>>
>>>> If my suppose is correct, will CCing kvm@vger.kernel.org from next version.
>>>
>>> This translates to me as "This is an RFC"?
>>>
>> Yes, I am not quite sure about it. I have no bare-metal to verify it.
>> So I hope at least, from the theory, it is correct.
>>
>
> -aneesh
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
  2014-01-21  2:30         ` Liu ping fan
@ 2014-01-21  3:52           ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 36+ messages in thread
From: Aneesh Kumar K.V @ 2014-01-21  3:40 UTC (permalink / raw)
  To: Liu ping fan; +Cc: Paul Mackerras, linuxppc-dev, Alexander Graf, kvm-ppc

Liu ping fan <kernelfans@gmail.com> writes:

> On Mon, Jan 20, 2014 at 11:45 PM, Aneesh Kumar K.V
> <aneesh.kumar@linux.vnet.ibm.com> wrote:
>> Liu ping fan <kernelfans@gmail.com> writes:
>>
>>> On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>>>>
>>>> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>>>>
>>>>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>>>>>
>>>>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>>>>> (for which, I still try to get a machine to show nums)
>>>>>
>>>>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>>>>> which is  well known.
>>>>
>>>> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>>>>
>>> Sorry for the unclear message. After introducing the _PAGE_NUMA,
>>> kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
>>> should rely on host's kvmppc_book3s_hv_page_fault() to call
>>> do_numa_page() to do the numa fault check. This incurs the overhead
>>> when exiting from rmode to vmode.  My idea is that in
>>> kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
>>> there is no need to exit to vmode (i.e saving htab, slab switching)
>>
>> Can you explain more. Are we looking at hcall from guest  and
>> hypervisor handling them in real mode ? If so why would guest issue a
>> hcall on a pte entry that have PAGE_NUMA set. Or is this about
>> hypervisor handling a missing hpte, because of host swapping this page
>> out ? In that case how we end up in h_enter ? IIUC for that case we
>> should get to kvmppc_hpte_hv_fault.
>>
> After setting _PAGE_NUMA, we should flush out all hptes both in host's
> htab and guest's. So when guest tries to access memory, host finds
> that there is not hpte ready for guest in guest's htab. And host
> should raise dsi to guest.

Now guest receive that fault, removes the PAGE_NUMA bit and do an
hpte_insert. So before we do an hpte_insert (or H_ENTER) we should have
cleared PAGE_NUMA bit.

>This incurs that guest ends up in h_enter.
> And you can see in current code, we also try this quick path firstly.
> Only if fail, we will resort to slow path --  kvmppc_hpte_hv_fault.

hmm ? hpte_hv_fault is the hypervisor handling the fault.

-aneesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
@ 2014-01-21  3:52           ` Aneesh Kumar K.V
  0 siblings, 0 replies; 36+ messages in thread
From: Aneesh Kumar K.V @ 2014-01-21  3:52 UTC (permalink / raw)
  To: Liu ping fan; +Cc: Paul Mackerras, linuxppc-dev, Alexander Graf, kvm-ppc

Liu ping fan <kernelfans@gmail.com> writes:

> On Mon, Jan 20, 2014 at 11:45 PM, Aneesh Kumar K.V
> <aneesh.kumar@linux.vnet.ibm.com> wrote:
>> Liu ping fan <kernelfans@gmail.com> writes:
>>
>>> On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>>>>
>>>> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>>>>
>>>>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>>>>>
>>>>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>>>>> (for which, I still try to get a machine to show nums)
>>>>>
>>>>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>>>>> which is  well known.
>>>>
>>>> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>>>>
>>> Sorry for the unclear message. After introducing the _PAGE_NUMA,
>>> kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
>>> should rely on host's kvmppc_book3s_hv_page_fault() to call
>>> do_numa_page() to do the numa fault check. This incurs the overhead
>>> when exiting from rmode to vmode.  My idea is that in
>>> kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
>>> there is no need to exit to vmode (i.e saving htab, slab switching)
>>
>> Can you explain more. Are we looking at hcall from guest  and
>> hypervisor handling them in real mode ? If so why would guest issue a
>> hcall on a pte entry that have PAGE_NUMA set. Or is this about
>> hypervisor handling a missing hpte, because of host swapping this page
>> out ? In that case how we end up in h_enter ? IIUC for that case we
>> should get to kvmppc_hpte_hv_fault.
>>
> After setting _PAGE_NUMA, we should flush out all hptes both in host's
> htab and guest's. So when guest tries to access memory, host finds
> that there is not hpte ready for guest in guest's htab. And host
> should raise dsi to guest.

Now guest receive that fault, removes the PAGE_NUMA bit and do an
hpte_insert. So before we do an hpte_insert (or H_ENTER) we should have
cleared PAGE_NUMA bit.

>This incurs that guest ends up in h_enter.
> And you can see in current code, we also try this quick path firstly.
> Only if fail, we will resort to slow path --  kvmppc_hpte_hv_fault.

hmm ? hpte_hv_fault is the hypervisor handling the fault.

-aneesh


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
  2014-01-21  3:52           ` Aneesh Kumar K.V
@ 2014-01-21  9:07             ` Liu ping fan
  -1 siblings, 0 replies; 36+ messages in thread
From: Liu ping fan @ 2014-01-21  9:07 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Paul Mackerras, linuxppc-dev, Alexander Graf, kvm-ppc

On Tue, Jan 21, 2014 at 11:40 AM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
> Liu ping fan <kernelfans@gmail.com> writes:
>
>> On Mon, Jan 20, 2014 at 11:45 PM, Aneesh Kumar K.V
>> <aneesh.kumar@linux.vnet.ibm.com> wrote:
>>> Liu ping fan <kernelfans@gmail.com> writes:
>>>
>>>> On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>>>>>
>>>>> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>>>>>
>>>>>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>>>>>>
>>>>>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>>>>>> (for which, I still try to get a machine to show nums)
>>>>>>
>>>>>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>>>>>> which is  well known.
>>>>>
>>>>> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>>>>>
>>>> Sorry for the unclear message. After introducing the _PAGE_NUMA,
>>>> kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
>>>> should rely on host's kvmppc_book3s_hv_page_fault() to call
>>>> do_numa_page() to do the numa fault check. This incurs the overhead
>>>> when exiting from rmode to vmode.  My idea is that in
>>>> kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
>>>> there is no need to exit to vmode (i.e saving htab, slab switching)
>>>
>>> Can you explain more. Are we looking at hcall from guest  and
>>> hypervisor handling them in real mode ? If so why would guest issue a
>>> hcall on a pte entry that have PAGE_NUMA set. Or is this about
>>> hypervisor handling a missing hpte, because of host swapping this page
>>> out ? In that case how we end up in h_enter ? IIUC for that case we
>>> should get to kvmppc_hpte_hv_fault.
>>>
>> After setting _PAGE_NUMA, we should flush out all hptes both in host's
>> htab and guest's. So when guest tries to access memory, host finds
>> that there is not hpte ready for guest in guest's htab. And host
>> should raise dsi to guest.
>
> Now guest receive that fault, removes the PAGE_NUMA bit and do an
> hpte_insert. So before we do an hpte_insert (or H_ENTER) we should have
> cleared PAGE_NUMA bit.
>
>>This incurs that guest ends up in h_enter.
>> And you can see in current code, we also try this quick path firstly.
>> Only if fail, we will resort to slow path --  kvmppc_hpte_hv_fault.
>
> hmm ? hpte_hv_fault is the hypervisor handling the fault.
>
After we discuss in irc. I think we should also do the fast check in
kvmppc_hpte_hv_fault() for the case of HPTE_V_ABSENT,
and let H_ENTER take care of the rest case i.e. no hpte when pte_mknuma. Right?

Thanks and regards,
Fan
> -aneesh
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
@ 2014-01-21  9:07             ` Liu ping fan
  0 siblings, 0 replies; 36+ messages in thread
From: Liu ping fan @ 2014-01-21  9:07 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Paul Mackerras, linuxppc-dev, Alexander Graf, kvm-ppc

On Tue, Jan 21, 2014 at 11:40 AM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
> Liu ping fan <kernelfans@gmail.com> writes:
>
>> On Mon, Jan 20, 2014 at 11:45 PM, Aneesh Kumar K.V
>> <aneesh.kumar@linux.vnet.ibm.com> wrote:
>>> Liu ping fan <kernelfans@gmail.com> writes:
>>>
>>>> On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>>>>>
>>>>> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>>>>>
>>>>>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>>>>>>
>>>>>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>>>>>> (for which, I still try to get a machine to show nums)
>>>>>>
>>>>>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>>>>>> which is  well known.
>>>>>
>>>>> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>>>>>
>>>> Sorry for the unclear message. After introducing the _PAGE_NUMA,
>>>> kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
>>>> should rely on host's kvmppc_book3s_hv_page_fault() to call
>>>> do_numa_page() to do the numa fault check. This incurs the overhead
>>>> when exiting from rmode to vmode.  My idea is that in
>>>> kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
>>>> there is no need to exit to vmode (i.e saving htab, slab switching)
>>>
>>> Can you explain more. Are we looking at hcall from guest  and
>>> hypervisor handling them in real mode ? If so why would guest issue a
>>> hcall on a pte entry that have PAGE_NUMA set. Or is this about
>>> hypervisor handling a missing hpte, because of host swapping this page
>>> out ? In that case how we end up in h_enter ? IIUC for that case we
>>> should get to kvmppc_hpte_hv_fault.
>>>
>> After setting _PAGE_NUMA, we should flush out all hptes both in host's
>> htab and guest's. So when guest tries to access memory, host finds
>> that there is not hpte ready for guest in guest's htab. And host
>> should raise dsi to guest.
>
> Now guest receive that fault, removes the PAGE_NUMA bit and do an
> hpte_insert. So before we do an hpte_insert (or H_ENTER) we should have
> cleared PAGE_NUMA bit.
>
>>This incurs that guest ends up in h_enter.
>> And you can see in current code, we also try this quick path firstly.
>> Only if fail, we will resort to slow path --  kvmppc_hpte_hv_fault.
>
> hmm ? hpte_hv_fault is the hypervisor handling the fault.
>
After we discuss in irc. I think we should also do the fast check in
kvmppc_hpte_hv_fault() for the case of HPTE_V_ABSENT,
and let H_ENTER take care of the rest case i.e. no hpte when pte_mknuma. Right?

Thanks and regards,
Fan
> -aneesh
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
  2014-01-21  9:07             ` Liu ping fan
@ 2014-01-21  9:11               ` Liu ping fan
  -1 siblings, 0 replies; 36+ messages in thread
From: Liu ping fan @ 2014-01-21  9:11 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Paul Mackerras, linuxppc-dev, Alexander Graf, kvm-ppc

On Tue, Jan 21, 2014 at 5:07 PM, Liu ping fan <kernelfans@gmail.com> wrote:
> On Tue, Jan 21, 2014 at 11:40 AM, Aneesh Kumar K.V
> <aneesh.kumar@linux.vnet.ibm.com> wrote:
>> Liu ping fan <kernelfans@gmail.com> writes:
>>
>>> On Mon, Jan 20, 2014 at 11:45 PM, Aneesh Kumar K.V
>>> <aneesh.kumar@linux.vnet.ibm.com> wrote:
>>>> Liu ping fan <kernelfans@gmail.com> writes:
>>>>
>>>>> On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>>>>>>
>>>>>> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>>>>>>
>>>>>>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>>>>>>>
>>>>>>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>>>>>>> (for which, I still try to get a machine to show nums)
>>>>>>>
>>>>>>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>>>>>>> which is  well known.
>>>>>>
>>>>>> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>>>>>>
>>>>> Sorry for the unclear message. After introducing the _PAGE_NUMA,
>>>>> kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
>>>>> should rely on host's kvmppc_book3s_hv_page_fault() to call
>>>>> do_numa_page() to do the numa fault check. This incurs the overhead
>>>>> when exiting from rmode to vmode.  My idea is that in
>>>>> kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
>>>>> there is no need to exit to vmode (i.e saving htab, slab switching)
>>>>
>>>> Can you explain more. Are we looking at hcall from guest  and
>>>> hypervisor handling them in real mode ? If so why would guest issue a
>>>> hcall on a pte entry that have PAGE_NUMA set. Or is this about
>>>> hypervisor handling a missing hpte, because of host swapping this page
>>>> out ? In that case how we end up in h_enter ? IIUC for that case we
>>>> should get to kvmppc_hpte_hv_fault.
>>>>
>>> After setting _PAGE_NUMA, we should flush out all hptes both in host's
>>> htab and guest's. So when guest tries to access memory, host finds
>>> that there is not hpte ready for guest in guest's htab. And host
>>> should raise dsi to guest.
>>
>> Now guest receive that fault, removes the PAGE_NUMA bit and do an
>> hpte_insert. So before we do an hpte_insert (or H_ENTER) we should have
>> cleared PAGE_NUMA bit.
>>
>>>This incurs that guest ends up in h_enter.
>>> And you can see in current code, we also try this quick path firstly.
>>> Only if fail, we will resort to slow path --  kvmppc_hpte_hv_fault.
>>
>> hmm ? hpte_hv_fault is the hypervisor handling the fault.
>>
> After we discuss in irc. I think we should also do the fast check in
> kvmppc_hpte_hv_fault() for the case of HPTE_V_ABSENT,
> and let H_ENTER take care of the rest case i.e. no hpte when pte_mknuma. Right?
>
Or we can delay the quick fix in H_ENTER, and let the host fault
again, so do the fix in kvmppc_hpte_hv_fault()

> Thanks and regards,
> Fan
>> -aneesh
>>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
@ 2014-01-21  9:11               ` Liu ping fan
  0 siblings, 0 replies; 36+ messages in thread
From: Liu ping fan @ 2014-01-21  9:11 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Paul Mackerras, linuxppc-dev, Alexander Graf, kvm-ppc

On Tue, Jan 21, 2014 at 5:07 PM, Liu ping fan <kernelfans@gmail.com> wrote:
> On Tue, Jan 21, 2014 at 11:40 AM, Aneesh Kumar K.V
> <aneesh.kumar@linux.vnet.ibm.com> wrote:
>> Liu ping fan <kernelfans@gmail.com> writes:
>>
>>> On Mon, Jan 20, 2014 at 11:45 PM, Aneesh Kumar K.V
>>> <aneesh.kumar@linux.vnet.ibm.com> wrote:
>>>> Liu ping fan <kernelfans@gmail.com> writes:
>>>>
>>>>> On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>>>>>>
>>>>>> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>>>>>>
>>>>>>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>>>>>>>
>>>>>>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>>>>>>> (for which, I still try to get a machine to show nums)
>>>>>>>
>>>>>>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>>>>>>> which is  well known.
>>>>>>
>>>>>> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>>>>>>
>>>>> Sorry for the unclear message. After introducing the _PAGE_NUMA,
>>>>> kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
>>>>> should rely on host's kvmppc_book3s_hv_page_fault() to call
>>>>> do_numa_page() to do the numa fault check. This incurs the overhead
>>>>> when exiting from rmode to vmode.  My idea is that in
>>>>> kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
>>>>> there is no need to exit to vmode (i.e saving htab, slab switching)
>>>>
>>>> Can you explain more. Are we looking at hcall from guest  and
>>>> hypervisor handling them in real mode ? If so why would guest issue a
>>>> hcall on a pte entry that have PAGE_NUMA set. Or is this about
>>>> hypervisor handling a missing hpte, because of host swapping this page
>>>> out ? In that case how we end up in h_enter ? IIUC for that case we
>>>> should get to kvmppc_hpte_hv_fault.
>>>>
>>> After setting _PAGE_NUMA, we should flush out all hptes both in host's
>>> htab and guest's. So when guest tries to access memory, host finds
>>> that there is not hpte ready for guest in guest's htab. And host
>>> should raise dsi to guest.
>>
>> Now guest receive that fault, removes the PAGE_NUMA bit and do an
>> hpte_insert. So before we do an hpte_insert (or H_ENTER) we should have
>> cleared PAGE_NUMA bit.
>>
>>>This incurs that guest ends up in h_enter.
>>> And you can see in current code, we also try this quick path firstly.
>>> Only if fail, we will resort to slow path --  kvmppc_hpte_hv_fault.
>>
>> hmm ? hpte_hv_fault is the hypervisor handling the fault.
>>
> After we discuss in irc. I think we should also do the fast check in
> kvmppc_hpte_hv_fault() for the case of HPTE_V_ABSENT,
> and let H_ENTER take care of the rest case i.e. no hpte when pte_mknuma. Right?
>
Or we can delay the quick fix in H_ENTER, and let the host fault
again, so do the fix in kvmppc_hpte_hv_fault()

> Thanks and regards,
> Fan
>> -aneesh
>>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
  2014-01-20 14:48       ` Alexander Graf
@ 2014-01-21 11:22         ` Paul Mackerras
  -1 siblings, 0 replies; 36+ messages in thread
From: Paul Mackerras @ 2014-01-21 11:22 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, kvm-ppc, Liu ping fan, Aneesh Kumar K.V

On Mon, Jan 20, 2014 at 03:48:36PM +0100, Alexander Graf wrote:
> 
> On 15.01.2014, at 07:36, Liu ping fan <kernelfans@gmail.com> wrote:
> 
> > On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
> >> 
> >> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
> >> 
> >>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
> >>> 
> >>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
> >>> (for which, I still try to get a machine to show nums)
> >>> 
> >>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
> >>> which is  well known.
> >> 
> >> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
> >> 
> > Sorry for the unclear message. After introducing the _PAGE_NUMA,
> > kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
> > should rely on host's kvmppc_book3s_hv_page_fault() to call
> > do_numa_page() to do the numa fault check. This incurs the overhead
> > when exiting from rmode to vmode.  My idea is that in
> > kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
> > there is no need to exit to vmode (i.e saving htab, slab switching)
> > 
> >>> If my suppose is correct, will CCing kvm@vger.kernel.org from next version.
> >> 
> >> This translates to me as "This is an RFC"?
> >> 
> > Yes, I am not quite sure about it. I have no bare-metal to verify it.
> > So I hope at least, from the theory, it is correct.
> 
> Paul, could you please give this some thought and maybe benchmark it?

OK, once I get Aneesh to tell me how I get to have ptes with
_PAGE_NUMA set in the first place. :)

Paul.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
@ 2014-01-21 11:22         ` Paul Mackerras
  0 siblings, 0 replies; 36+ messages in thread
From: Paul Mackerras @ 2014-01-21 11:22 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, kvm-ppc, Liu ping fan, Aneesh Kumar K.V

On Mon, Jan 20, 2014 at 03:48:36PM +0100, Alexander Graf wrote:
> 
> On 15.01.2014, at 07:36, Liu ping fan <kernelfans@gmail.com> wrote:
> 
> > On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
> >> 
> >> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
> >> 
> >>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
> >>> 
> >>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
> >>> (for which, I still try to get a machine to show nums)
> >>> 
> >>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
> >>> which is  well known.
> >> 
> >> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
> >> 
> > Sorry for the unclear message. After introducing the _PAGE_NUMA,
> > kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
> > should rely on host's kvmppc_book3s_hv_page_fault() to call
> > do_numa_page() to do the numa fault check. This incurs the overhead
> > when exiting from rmode to vmode.  My idea is that in
> > kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
> > there is no need to exit to vmode (i.e saving htab, slab switching)
> > 
> >>> If my suppose is correct, will CCing kvm@vger.kernel.org from next version.
> >> 
> >> This translates to me as "This is an RFC"?
> >> 
> > Yes, I am not quite sure about it. I have no bare-metal to verify it.
> > So I hope at least, from the theory, it is correct.
> 
> Paul, could you please give this some thought and maybe benchmark it?

OK, once I get Aneesh to tell me how I get to have ptes with
_PAGE_NUMA set in the first place. :)

Paul.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
  2014-01-21 11:22         ` Paul Mackerras
@ 2014-01-22  5:30           ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 36+ messages in thread
From: Aneesh Kumar K.V @ 2014-01-22  5:18 UTC (permalink / raw)
  To: Paul Mackerras, Alexander Graf; +Cc: linuxppc-dev, kvm-ppc, Liu ping fan

Paul Mackerras <paulus@samba.org> writes:

> On Mon, Jan 20, 2014 at 03:48:36PM +0100, Alexander Graf wrote:
>> 
>> On 15.01.2014, at 07:36, Liu ping fan <kernelfans@gmail.com> wrote:
>> 
>> > On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>> >> 
>> >> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>> >> 
>> >>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>> >>> 
>> >>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>> >>> (for which, I still try to get a machine to show nums)
>> >>> 
>> >>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>> >>> which is  well known.
>> >> 
>> >> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>> >> 
>> > Sorry for the unclear message. After introducing the _PAGE_NUMA,
>> > kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
>> > should rely on host's kvmppc_book3s_hv_page_fault() to call
>> > do_numa_page() to do the numa fault check. This incurs the overhead
>> > when exiting from rmode to vmode.  My idea is that in
>> > kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
>> > there is no need to exit to vmode (i.e saving htab, slab switching)
>> > 
>> >>> If my suppose is correct, will CCing kvm@vger.kernel.org from next version.
>> >> 
>> >> This translates to me as "This is an RFC"?
>> >> 
>> > Yes, I am not quite sure about it. I have no bare-metal to verify it.
>> > So I hope at least, from the theory, it is correct.
>> 
>> Paul, could you please give this some thought and maybe benchmark it?
>
> OK, once I get Aneesh to tell me how I get to have ptes with
> _PAGE_NUMA set in the first place. :)
>

I guess we want patch 2, Which Liu has sent separately and I have
reviewed. http://article.gmane.org/gmane.comp.emulators.kvm.powerpc.devel/8619
I am not sure about the rest of the patches in the series.
We definitely don't want to numa migrate on henter. We may want to do
that on fault. But even there, IMHO, we should let the host take the
fault and do the numa migration instead of doing this in guest context.

-aneesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
@ 2014-01-22  5:30           ` Aneesh Kumar K.V
  0 siblings, 0 replies; 36+ messages in thread
From: Aneesh Kumar K.V @ 2014-01-22  5:30 UTC (permalink / raw)
  To: Paul Mackerras, Alexander Graf; +Cc: linuxppc-dev, kvm-ppc, Liu ping fan

Paul Mackerras <paulus@samba.org> writes:

> On Mon, Jan 20, 2014 at 03:48:36PM +0100, Alexander Graf wrote:
>> 
>> On 15.01.2014, at 07:36, Liu ping fan <kernelfans@gmail.com> wrote:
>> 
>> > On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>> >> 
>> >> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>> >> 
>> >>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>> >>> 
>> >>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>> >>> (for which, I still try to get a machine to show nums)
>> >>> 
>> >>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>> >>> which is  well known.
>> >> 
>> >> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>> >> 
>> > Sorry for the unclear message. After introducing the _PAGE_NUMA,
>> > kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
>> > should rely on host's kvmppc_book3s_hv_page_fault() to call
>> > do_numa_page() to do the numa fault check. This incurs the overhead
>> > when exiting from rmode to vmode.  My idea is that in
>> > kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
>> > there is no need to exit to vmode (i.e saving htab, slab switching)
>> > 
>> >>> If my suppose is correct, will CCing kvm@vger.kernel.org from next version.
>> >> 
>> >> This translates to me as "This is an RFC"?
>> >> 
>> > Yes, I am not quite sure about it. I have no bare-metal to verify it.
>> > So I hope at least, from the theory, it is correct.
>> 
>> Paul, could you please give this some thought and maybe benchmark it?
>
> OK, once I get Aneesh to tell me how I get to have ptes with
> _PAGE_NUMA set in the first place. :)
>

I guess we want patch 2, Which Liu has sent separately and I have
reviewed. http://article.gmane.org/gmane.comp.emulators.kvm.powerpc.devel/8619
I am not sure about the rest of the patches in the series.
We definitely don't want to numa migrate on henter. We may want to do
that on fault. But even there, IMHO, we should let the host take the
fault and do the numa migration instead of doing this in guest context.

-aneesh


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
  2014-01-22  5:30           ` Aneesh Kumar K.V
@ 2014-01-22  8:33             ` Liu ping fan
  -1 siblings, 0 replies; 36+ messages in thread
From: Liu ping fan @ 2014-01-22  8:33 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, Paul Mackerras, Alexander Graf, kvm-ppc

On Wed, Jan 22, 2014 at 1:18 PM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
> Paul Mackerras <paulus@samba.org> writes:
>
>> On Mon, Jan 20, 2014 at 03:48:36PM +0100, Alexander Graf wrote:
>>>
>>> On 15.01.2014, at 07:36, Liu ping fan <kernelfans@gmail.com> wrote:
>>>
>>> > On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>>> >>
>>> >> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>>> >>
>>> >>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>>> >>>
>>> >>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>>> >>> (for which, I still try to get a machine to show nums)
>>> >>>
>>> >>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>>> >>> which is  well known.
>>> >>
>>> >> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>>> >>
>>> > Sorry for the unclear message. After introducing the _PAGE_NUMA,
>>> > kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
>>> > should rely on host's kvmppc_book3s_hv_page_fault() to call
>>> > do_numa_page() to do the numa fault check. This incurs the overhead
>>> > when exiting from rmode to vmode.  My idea is that in
>>> > kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
>>> > there is no need to exit to vmode (i.e saving htab, slab switching)
>>> >
>>> >>> If my suppose is correct, will CCing kvm@vger.kernel.org from next version.
>>> >>
>>> >> This translates to me as "This is an RFC"?
>>> >>
>>> > Yes, I am not quite sure about it. I have no bare-metal to verify it.
>>> > So I hope at least, from the theory, it is correct.
>>>
>>> Paul, could you please give this some thought and maybe benchmark it?
>>
>> OK, once I get Aneesh to tell me how I get to have ptes with
>> _PAGE_NUMA set in the first place. :)
>>
>
> I guess we want patch 2, Which Liu has sent separately and I have
> reviewed. http://article.gmane.org/gmane.comp.emulators.kvm.powerpc.devel/8619
> I am not sure about the rest of the patches in the series.
> We definitely don't want to numa migrate on henter. We may want to do
> that on fault. But even there, IMHO, we should let the host take the
> fault and do the numa migration instead of doing this in guest context.
>
My patch does NOT do the numa migration in guest context( h_enter).
Instead it just do a pre-check to see whether the numa migration is
needed. If needed, the host will take the fault and do the numa
migration as it currently does. Otherwise, h_enter can directly setup
hpte without HPTE_V_ABSENT.
And since pte_mknuma() is called system-wide periodly, so it has more
possibility that guest will suffer from HPTE_V_ABSENT.(as my previous
reply, I think we should also place the quick check in
kvmppc_hpte_hv_fault )

Thx,
Fan

> -aneesh
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
@ 2014-01-22  8:33             ` Liu ping fan
  0 siblings, 0 replies; 36+ messages in thread
From: Liu ping fan @ 2014-01-22  8:33 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, Paul Mackerras, Alexander Graf, kvm-ppc

On Wed, Jan 22, 2014 at 1:18 PM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
> Paul Mackerras <paulus@samba.org> writes:
>
>> On Mon, Jan 20, 2014 at 03:48:36PM +0100, Alexander Graf wrote:
>>>
>>> On 15.01.2014, at 07:36, Liu ping fan <kernelfans@gmail.com> wrote:
>>>
>>> > On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>>> >>
>>> >> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>>> >>
>>> >>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>>> >>>
>>> >>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>>> >>> (for which, I still try to get a machine to show nums)
>>> >>>
>>> >>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>>> >>> which is  well known.
>>> >>
>>> >> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>>> >>
>>> > Sorry for the unclear message. After introducing the _PAGE_NUMA,
>>> > kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
>>> > should rely on host's kvmppc_book3s_hv_page_fault() to call
>>> > do_numa_page() to do the numa fault check. This incurs the overhead
>>> > when exiting from rmode to vmode.  My idea is that in
>>> > kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
>>> > there is no need to exit to vmode (i.e saving htab, slab switching)
>>> >
>>> >>> If my suppose is correct, will CCing kvm@vger.kernel.org from next version.
>>> >>
>>> >> This translates to me as "This is an RFC"?
>>> >>
>>> > Yes, I am not quite sure about it. I have no bare-metal to verify it.
>>> > So I hope at least, from the theory, it is correct.
>>>
>>> Paul, could you please give this some thought and maybe benchmark it?
>>
>> OK, once I get Aneesh to tell me how I get to have ptes with
>> _PAGE_NUMA set in the first place. :)
>>
>
> I guess we want patch 2, Which Liu has sent separately and I have
> reviewed. http://article.gmane.org/gmane.comp.emulators.kvm.powerpc.devel/8619
> I am not sure about the rest of the patches in the series.
> We definitely don't want to numa migrate on henter. We may want to do
> that on fault. But even there, IMHO, we should let the host take the
> fault and do the numa migration instead of doing this in guest context.
>
My patch does NOT do the numa migration in guest context( h_enter).
Instead it just do a pre-check to see whether the numa migration is
needed. If needed, the host will take the fault and do the numa
migration as it currently does. Otherwise, h_enter can directly setup
hpte without HPTE_V_ABSENT.
And since pte_mknuma() is called system-wide periodly, so it has more
possibility that guest will suffer from HPTE_V_ABSENT.(as my previous
reply, I think we should also place the quick check in
kvmppc_hpte_hv_fault )

Thx,
Fan

> -aneesh
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
  2014-01-20 14:48       ` Alexander Graf
@ 2014-02-26  3:09         ` Liu ping fan
  -1 siblings, 0 replies; 36+ messages in thread
From: Liu ping fan @ 2014-02-26  3:09 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Paul Mackerras, linuxppc-dev, Aneesh Kumar K.V, kvm-ppc

Sorry to update lately. It takes a long time to apply for test machine
and then, I hit a series of other bugs which I could not resolve
easily. And for now, I have some high priority task, and will come
back to this topic when time is available.
Besides this, I had do some basic test for numa-fault and no
numa-fault test for HV guest, it shows that 10% drop in performance
when  numa-fault is on. (Test with $pg_access_random 60 4 200, and
guest has 10GB mlocked pages ).
I thought this is caused based on the following factors: cache-miss,
tlb-miss, guest->host exit and hw-thread cooperate to exit from guest
state.  Hope my patches to be helpful to reduce the cost of
guest->host exit and hw-thread cooperate to exit.

My test case launches 4 threads on guest( as 4 hw-threads ), and each
of them has random access to PAGE_ALIGN area.
Hope from some suggestion about the test case, so when I had time, I
could improve and finish the test.

Thanks,
Fan

--- test case: usage: pg_random_access  secs  fork_num  mem_size---
#include <ctype.h>
#include <errno.h>
#include <libgen.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>
#include <time.h>
#include <unistd.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/timerfd.h>
#include <time.h>
#include <stdint.h>        /* Definition of uint64_t */
#include <poll.h>


/* */
#define CMD_STOP 0x1234
#define SHM_FNAME "/numafault_shm"
#define PAGE_SIZE (1<<12)

/* the protocol defined on the shm */
#define SHM_CMD_OFF 0x0
#define SHM_CNT_OFF 0x1
#define SHM_MESSAGE_OFF 0x2

#define handle_error(msg) \
        do { perror(msg); exit(EXIT_FAILURE); } while (0)


void __inline__ random_access(void *region_start, int len)
{
        int *p;
        int num;

        num = random();
        num &= ~(PAGE_SIZE - 1);
        num &= (len - 1);
        p = region_start + num;
        *p = 0x654321;
}

static int numafault_body(int size_MB)
{
        /* since MB is always align on PAGE_SIZE, so it is ok to test
fault on page */
        int size = size_MB*1024*1024;
        void *region_start = malloc(size);
        unsigned long *pmap;
        int shm_fid;
        unsigned long cnt = 0;
        pid_t pid = getpid();
        char *dst;
        char buf[128];

        shm_fid = shm_open(SHM_FNAME, O_RDWR, S_IRUSR | S_IWUSR);
        ftruncate(shm_fid, 2*sizeof(long));
        pmap = mmap(NULL, 2*sizeof(long), PROT_WRITE | PROT_READ,
MAP_SHARED, shm_fid, 0);
        if (!pmap) {
                printf("child fail to setup mmap of shm\n");
                return -1;
        }

        while (*(pmap+SHM_CMD_OFF) != CMD_STOP){
                random_access(region_start, size);
                cnt++;
        }

        __atomic_fetch_add((pmap+SHM_CNT_OFF), cnt, __ATOMIC_SEQ_CST);
        dst = (char *)(pmap+SHM_MESSAGE_OFF);
        //tofix, need lock
        sprintf(buf, "child [%i] cnt=%u\n\0", pid, cnt);
        strcat(dst, buf);

        munmap(pmap, 2*sizeof(long));
        shm_unlink(SHM_FNAME);
        fprintf(stdout, "[%s] cnt=%lu\n", pid, cnt);
        fflush(stdout);
        exit(0);

}

int main(int argc, char **argv)
{
        int i;
        pid_t pid;
        int shm_fid;
        unsigned long *pmap;
        int fork_num;
        int size;
        char *dst_info;

        struct itimerspec new_value;
        int fd;
        struct timespec now;
        uint64_t exp, tot_exp;
        ssize_t s;
        struct pollfd pfd;
        int elapsed;

        if (argc != 4){
            fprintf(stderr, "%s wait-secs [secs elapsed before parent
asks the children to exit]\n \
                    fork-num [child num]\n \
                    size [memory region covered by each child in MB]\n",
                    argv[0]);
            exit(EXIT_FAILURE);
        }
        elapsed = atoi(argv[1]);
        fork_num = atoi(argv[2]);
        size = atoi(argv[3]);
        printf("fork %i child process to test mem %i MB for a period: %i sec\n",
                fork_num, size, elapsed);

        fd = timerfd_create(CLOCK_REALTIME, 0);
        if (fd == -1)
            handle_error("timerfd_create");


        shm_fid = shm_open(SHM_FNAME, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
        ftruncate(shm_fid, PAGE_SIZE);
        pmap = mmap(NULL, PAGE_SIZE, PROT_WRITE | PROT_READ,
MAP_SHARED, shm_fid, 0);
        if (!pmap) {
                printf("fail to setup mmap of shm\n");
                return -1;
        }
        memset(pmap, 0, 2*sizeof(long));
        //wmb();

        for (i = 0; i < fork_num; i++){
                switch (pid = fork())
                {
                case 0:            /* child */
                        numafault_body(size);
                        exit(0);
                case -1:           /* error */
                        err (stderr, "fork failed: %s\n", strerror (errno));
                        break;
                default:           /* parent */
                        printf("fork child [%i]\n", pid);
                }
        }

        if (clock_gettime(CLOCK_REALTIME, &now) == -1)
                handle_error("clock_gettime");

        /* Create a CLOCK_REALTIME absolute timer with initial
expiration and interval as specified in command line */

        new_value.it_value.tv_sec = now.tv_sec + elapsed;
        new_value.it_value.tv_nsec = now.tv_nsec;
        new_value.it_interval.tv_sec = 0;
        new_value.it_interval.tv_nsec = 0;

        if (timerfd_settime(fd, TFD_TIMER_ABSTIME, &new_value, NULL) == -1)
                handle_error("timerfd_settime");

        pfd.fd = fd;
        pfd.events = POLLIN;
        pfd.revents = 0;
        /* -1: infinite wait */
        poll(&pfd, 1, -1);



        /* ask children to stop and get back cnt */

        *(pmap + SHM_CMD_OFF) = CMD_STOP;

        wait(NULL);
        dst_info = (char *)(pmap + SHM_MESSAGE_OFF);
        printf(dst_info);
        printf("total cnt:%lu\n", *(pmap + SHM_CNT_OFF));

        munmap(pmap, PAGE_SIZE);
        shm_unlink(SHM_FNAME);
}




On Mon, Jan 20, 2014 at 10:48 PM, Alexander Graf <agraf@suse.de> wrote:
>
> On 15.01.2014, at 07:36, Liu ping fan <kernelfans@gmail.com> wrote:
>
>> On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>>>
>>> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>>>
>>>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>>>>
>>>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>>>> (for which, I still try to get a machine to show nums)
>>>>
>>>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>>>> which is  well known.
>>>
>>> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>>>
>> Sorry for the unclear message. After introducing the _PAGE_NUMA,
>> kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
>> should rely on host's kvmppc_book3s_hv_page_fault() to call
>> do_numa_page() to do the numa fault check. This incurs the overhead
>> when exiting from rmode to vmode.  My idea is that in
>> kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
>> there is no need to exit to vmode (i.e saving htab, slab switching)
>>
>>>> If my suppose is correct, will CCing kvm@vger.kernel.org from next version.
>>>
>>> This translates to me as "This is an RFC"?
>>>
>> Yes, I am not quite sure about it. I have no bare-metal to verify it.
>> So I hope at least, from the theory, it is correct.
>
> Paul, could you please give this some thought and maybe benchmark it?
>
>
> Alex
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
@ 2014-02-26  3:09         ` Liu ping fan
  0 siblings, 0 replies; 36+ messages in thread
From: Liu ping fan @ 2014-02-26  3:09 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Paul Mackerras, linuxppc-dev, Aneesh Kumar K.V, kvm-ppc

Sorry to update lately. It takes a long time to apply for test machine
and then, I hit a series of other bugs which I could not resolve
easily. And for now, I have some high priority task, and will come
back to this topic when time is available.
Besides this, I had do some basic test for numa-fault and no
numa-fault test for HV guest, it shows that 10% drop in performance
when  numa-fault is on. (Test with $pg_access_random 60 4 200, and
guest has 10GB mlocked pages ).
I thought this is caused based on the following factors: cache-miss,
tlb-miss, guest->host exit and hw-thread cooperate to exit from guest
state.  Hope my patches to be helpful to reduce the cost of
guest->host exit and hw-thread cooperate to exit.

My test case launches 4 threads on guest( as 4 hw-threads ), and each
of them has random access to PAGE_ALIGN area.
Hope from some suggestion about the test case, so when I had time, I
could improve and finish the test.

Thanks,
Fan

--- test case: usage: pg_random_access  secs  fork_num  mem_size---
#include <ctype.h>
#include <errno.h>
#include <libgen.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>
#include <time.h>
#include <unistd.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/timerfd.h>
#include <time.h>
#include <stdint.h>        /* Definition of uint64_t */
#include <poll.h>


/* */
#define CMD_STOP 0x1234
#define SHM_FNAME "/numafault_shm"
#define PAGE_SIZE (1<<12)

/* the protocol defined on the shm */
#define SHM_CMD_OFF 0x0
#define SHM_CNT_OFF 0x1
#define SHM_MESSAGE_OFF 0x2

#define handle_error(msg) \
        do { perror(msg); exit(EXIT_FAILURE); } while (0)


void __inline__ random_access(void *region_start, int len)
{
        int *p;
        int num;

        num = random();
        num &= ~(PAGE_SIZE - 1);
        num &= (len - 1);
        p = region_start + num;
        *p = 0x654321;
}

static int numafault_body(int size_MB)
{
        /* since MB is always align on PAGE_SIZE, so it is ok to test
fault on page */
        int size = size_MB*1024*1024;
        void *region_start = malloc(size);
        unsigned long *pmap;
        int shm_fid;
        unsigned long cnt = 0;
        pid_t pid = getpid();
        char *dst;
        char buf[128];

        shm_fid = shm_open(SHM_FNAME, O_RDWR, S_IRUSR | S_IWUSR);
        ftruncate(shm_fid, 2*sizeof(long));
        pmap = mmap(NULL, 2*sizeof(long), PROT_WRITE | PROT_READ,
MAP_SHARED, shm_fid, 0);
        if (!pmap) {
                printf("child fail to setup mmap of shm\n");
                return -1;
        }

        while (*(pmap+SHM_CMD_OFF) != CMD_STOP){
                random_access(region_start, size);
                cnt++;
        }

        __atomic_fetch_add((pmap+SHM_CNT_OFF), cnt, __ATOMIC_SEQ_CST);
        dst = (char *)(pmap+SHM_MESSAGE_OFF);
        //tofix, need lock
        sprintf(buf, "child [%i] cnt=%u\n\0", pid, cnt);
        strcat(dst, buf);

        munmap(pmap, 2*sizeof(long));
        shm_unlink(SHM_FNAME);
        fprintf(stdout, "[%s] cnt=%lu\n", pid, cnt);
        fflush(stdout);
        exit(0);

}

int main(int argc, char **argv)
{
        int i;
        pid_t pid;
        int shm_fid;
        unsigned long *pmap;
        int fork_num;
        int size;
        char *dst_info;

        struct itimerspec new_value;
        int fd;
        struct timespec now;
        uint64_t exp, tot_exp;
        ssize_t s;
        struct pollfd pfd;
        int elapsed;

        if (argc != 4){
            fprintf(stderr, "%s wait-secs [secs elapsed before parent
asks the children to exit]\n \
                    fork-num [child num]\n \
                    size [memory region covered by each child in MB]\n",
                    argv[0]);
            exit(EXIT_FAILURE);
        }
        elapsed = atoi(argv[1]);
        fork_num = atoi(argv[2]);
        size = atoi(argv[3]);
        printf("fork %i child process to test mem %i MB for a period: %i sec\n",
                fork_num, size, elapsed);

        fd = timerfd_create(CLOCK_REALTIME, 0);
        if (fd = -1)
            handle_error("timerfd_create");


        shm_fid = shm_open(SHM_FNAME, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
        ftruncate(shm_fid, PAGE_SIZE);
        pmap = mmap(NULL, PAGE_SIZE, PROT_WRITE | PROT_READ,
MAP_SHARED, shm_fid, 0);
        if (!pmap) {
                printf("fail to setup mmap of shm\n");
                return -1;
        }
        memset(pmap, 0, 2*sizeof(long));
        //wmb();

        for (i = 0; i < fork_num; i++){
                switch (pid = fork())
                {
                case 0:            /* child */
                        numafault_body(size);
                        exit(0);
                case -1:           /* error */
                        err (stderr, "fork failed: %s\n", strerror (errno));
                        break;
                default:           /* parent */
                        printf("fork child [%i]\n", pid);
                }
        }

        if (clock_gettime(CLOCK_REALTIME, &now) = -1)
                handle_error("clock_gettime");

        /* Create a CLOCK_REALTIME absolute timer with initial
expiration and interval as specified in command line */

        new_value.it_value.tv_sec = now.tv_sec + elapsed;
        new_value.it_value.tv_nsec = now.tv_nsec;
        new_value.it_interval.tv_sec = 0;
        new_value.it_interval.tv_nsec = 0;

        if (timerfd_settime(fd, TFD_TIMER_ABSTIME, &new_value, NULL) = -1)
                handle_error("timerfd_settime");

        pfd.fd = fd;
        pfd.events = POLLIN;
        pfd.revents = 0;
        /* -1: infinite wait */
        poll(&pfd, 1, -1);



        /* ask children to stop and get back cnt */

        *(pmap + SHM_CMD_OFF) = CMD_STOP;

        wait(NULL);
        dst_info = (char *)(pmap + SHM_MESSAGE_OFF);
        printf(dst_info);
        printf("total cnt:%lu\n", *(pmap + SHM_CNT_OFF));

        munmap(pmap, PAGE_SIZE);
        shm_unlink(SHM_FNAME);
}




On Mon, Jan 20, 2014 at 10:48 PM, Alexander Graf <agraf@suse.de> wrote:
>
> On 15.01.2014, at 07:36, Liu ping fan <kernelfans@gmail.com> wrote:
>
>> On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>>>
>>> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>>>
>>>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>>>>
>>>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>>>> (for which, I still try to get a machine to show nums)
>>>>
>>>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>>>> which is  well known.
>>>
>>> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>>>
>> Sorry for the unclear message. After introducing the _PAGE_NUMA,
>> kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
>> should rely on host's kvmppc_book3s_hv_page_fault() to call
>> do_numa_page() to do the numa fault check. This incurs the overhead
>> when exiting from rmode to vmode.  My idea is that in
>> kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
>> there is no need to exit to vmode (i.e saving htab, slab switching)
>>
>>>> If my suppose is correct, will CCing kvm@vger.kernel.org from next version.
>>>
>>> This translates to me as "This is an RFC"?
>>>
>> Yes, I am not quite sure about it. I have no bare-metal to verify it.
>> So I hope at least, from the theory, it is correct.
>
> Paul, could you please give this some thought and maybe benchmark it?
>
>
> Alex
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2014-02-26  3:09 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-11  8:47 [PATCH 0/4] powernv: kvm: numa fault improvement Liu Ping Fan
2013-12-11  8:47 ` Liu Ping Fan
2013-12-11  8:47 ` [PATCH 1/4] mm: export numa_migrate_prep() Liu Ping Fan
2013-12-11  8:47   ` Liu Ping Fan
2013-12-11  8:47 ` [PATCH 2/4] powernv: kvm: make _PAGE_NUMA take effect Liu Ping Fan
2013-12-11  8:47   ` Liu Ping Fan
2014-01-20 15:22   ` Aneesh Kumar K.V
2014-01-20 15:34     ` Aneesh Kumar K.V
2013-12-11  8:47 ` [PATCH 3/4] powernv: kvm: extend input param for lookup_linux_pte Liu Ping Fan
2013-12-11  8:47   ` Liu Ping Fan
2013-12-11  8:47 ` [PATCH 4/4] powernv: kvm: make the handling of _PAGE_NUMA faster for guest Liu Ping Fan
2013-12-11  8:47   ` Liu Ping Fan
2014-01-09 12:08 ` [PATCH 0/4] powernv: kvm: numa fault improvement Alexander Graf
2014-01-09 12:08   ` Alexander Graf
2014-01-15  6:36   ` Liu ping fan
2014-01-15  6:36     ` Liu ping fan
2014-01-20 14:48     ` Alexander Graf
2014-01-20 14:48       ` Alexander Graf
2014-01-21 11:22       ` Paul Mackerras
2014-01-21 11:22         ` Paul Mackerras
2014-01-22  5:18         ` Aneesh Kumar K.V
2014-01-22  5:30           ` Aneesh Kumar K.V
2014-01-22  8:33           ` Liu ping fan
2014-01-22  8:33             ` Liu ping fan
2014-02-26  3:09       ` Liu ping fan
2014-02-26  3:09         ` Liu ping fan
2014-01-20 15:45     ` Aneesh Kumar K.V
2014-01-20 15:57       ` Aneesh Kumar K.V
2014-01-21  2:30       ` Liu ping fan
2014-01-21  2:30         ` Liu ping fan
2014-01-21  3:40         ` Aneesh Kumar K.V
2014-01-21  3:52           ` Aneesh Kumar K.V
2014-01-21  9:07           ` Liu ping fan
2014-01-21  9:07             ` Liu ping fan
2014-01-21  9:11             ` Liu ping fan
2014-01-21  9:11               ` Liu ping fan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.