All of lore.kernel.org
 help / color / mirror / Atom feed
* Fix NUMA problems in transparent hugepages and KSM
@ 2011-02-21 19:07 ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-21 19:07 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, aarcange, lwoodman

The current transparent hugepages daemon can mess up local
memory affinity on NUMA systems. When it copies memory to a 
huge page it does not necessarily keep it on the same
node as the local allocations.

While fixing this I also found some more related issues:
- The NUMA policy interleaving for THP was using the small
page size, not the large parse size.
- KSM and THP copies also did not preserve the local node
- The accounting for local/remote allocations in the daemon
was misleading.
- There were no VM statistics counters for THP, which made it 
impossible to analyze.
 
At least some of the bug fixes are 2.6.38 candidates IMHO
because some of the NUMA problems are pretty bad. In some workloads
this can cause performance problems. 

What can be delayed are GFP_OTHERNODE and the statistics changes.

Git tree:

  git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc-2.6.git thp-numa

Andi Kleen (8):
      Fix interleaving for transparent hugepages
      Change alloc_pages_vma to pass down the policy node for local policy
      Preserve local node for KSM copies
      Preserve original node for transparent huge page copies
      Use correct numa policy node for transparent hugepages
      Add __GFP_OTHER_NODE flag
      Use GFP_OTHER_NODE for transparent huge pages
      Add VM counters for transparent hugepages

 include/linux/gfp.h    |   13 ++++++++---
 include/linux/vmstat.h |   11 ++++++++-
 mm/huge_memory.c       |   49 +++++++++++++++++++++++++++++++++--------------
 mm/ksm.c               |    3 +-
 mm/mempolicy.c         |   16 +++++++-------
 mm/page_alloc.c        |    2 +-
 mm/vmstat.c            |   17 ++++++++++++++-
 7 files changed, 78 insertions(+), 33 deletions(-)

-Andi

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Fix NUMA problems in transparent hugepages and KSM
@ 2011-02-21 19:07 ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-21 19:07 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, aarcange, lwoodman

The current transparent hugepages daemon can mess up local
memory affinity on NUMA systems. When it copies memory to a 
huge page it does not necessarily keep it on the same
node as the local allocations.

While fixing this I also found some more related issues:
- The NUMA policy interleaving for THP was using the small
page size, not the large parse size.
- KSM and THP copies also did not preserve the local node
- The accounting for local/remote allocations in the daemon
was misleading.
- There were no VM statistics counters for THP, which made it 
impossible to analyze.
 
At least some of the bug fixes are 2.6.38 candidates IMHO
because some of the NUMA problems are pretty bad. In some workloads
this can cause performance problems. 

What can be delayed are GFP_OTHERNODE and the statistics changes.

Git tree:

  git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc-2.6.git thp-numa

Andi Kleen (8):
      Fix interleaving for transparent hugepages
      Change alloc_pages_vma to pass down the policy node for local policy
      Preserve local node for KSM copies
      Preserve original node for transparent huge page copies
      Use correct numa policy node for transparent hugepages
      Add __GFP_OTHER_NODE flag
      Use GFP_OTHER_NODE for transparent huge pages
      Add VM counters for transparent hugepages

 include/linux/gfp.h    |   13 ++++++++---
 include/linux/vmstat.h |   11 ++++++++-
 mm/huge_memory.c       |   49 +++++++++++++++++++++++++++++++++--------------
 mm/ksm.c               |    3 +-
 mm/mempolicy.c         |   16 +++++++-------
 mm/page_alloc.c        |    2 +-
 mm/vmstat.c            |   17 ++++++++++++++-
 7 files changed, 78 insertions(+), 33 deletions(-)

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/8] Fix interleaving for transparent hugepages
  2011-02-21 19:07 ` Andi Kleen
@ 2011-02-21 19:07   ` Andi Kleen
  -1 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-21 19:07 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Bugfix, independent from the rest of the series.

The THP code didn't pass the correct interleaving shift to the memory
policy code. Fix this here by adjusting for the order.

Cc: aarcange@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 mm/mempolicy.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 368fc9d..76c51b7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1830,7 +1830,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
 		unsigned nid;
 
-		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
+		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT << order);
 		mpol_cond_put(pol);
 		page = alloc_page_interleave(gfp, order, nid);
 		put_mems_allowed();
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 1/8] Fix interleaving for transparent hugepages
@ 2011-02-21 19:07   ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-21 19:07 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Bugfix, independent from the rest of the series.

The THP code didn't pass the correct interleaving shift to the memory
policy code. Fix this here by adjusting for the order.

Cc: aarcange@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 mm/mempolicy.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 368fc9d..76c51b7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1830,7 +1830,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
 		unsigned nid;
 
-		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
+		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT << order);
 		mpol_cond_put(pol);
 		page = alloc_page_interleave(gfp, order, nid);
 		put_mems_allowed();
-- 
1.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/8] Change alloc_pages_vma to pass down the policy node for local policy
  2011-02-21 19:07 ` Andi Kleen
@ 2011-02-21 19:07   ` Andi Kleen
  -1 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-21 19:07 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Currently alloc_pages_vma always uses the local node as policy node
for the LOCAL policy. Pass this node down as an argument instead.

No behaviour change from this patch, but will be needed for followons.

Cc: aarcange@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/gfp.h |    9 +++++----
 mm/huge_memory.c    |    2 +-
 mm/mempolicy.c      |   11 +++++------
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0b84c61..782e74a 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -332,16 +332,17 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
 	return alloc_pages_current(gfp_mask, order);
 }
 extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
-			struct vm_area_struct *vma, unsigned long addr);
+		    	struct vm_area_struct *vma, unsigned long addr,
+			int node);
 #else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_pages_vma(gfp_mask, order, vma, addr)	\
+#define alloc_pages_vma(gfp_mask, order, vma, addr, node)	\
 	alloc_pages(gfp_mask, order)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
-#define alloc_page_vma(gfp_mask, vma, addr)	\
-	alloc_pages_vma(gfp_mask, 0, vma, addr)
+#define alloc_page_vma(gfp_mask, vma, addr)			\
+	alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e62ddb8..73ecca5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -653,7 +653,7 @@ static inline struct page *alloc_hugepage_vma(int defrag,
 					      unsigned long haddr)
 {
 	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag),
-			       HPAGE_PMD_ORDER, vma, haddr);
+			       HPAGE_PMD_ORDER, vma, haddr, numa_node_id());
 }
 
 #ifndef CONFIG_NUMA
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 76c51b7..d3d1e747 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1524,10 +1524,9 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 }
 
 /* Return a zonelist indicated by gfp for node representing a mempolicy */
-static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy)
+static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
+	int nd)
 {
-	int nd = numa_node_id();
-
 	switch (policy->mode) {
 	case MPOL_PREFERRED:
 		if (!(policy->flags & MPOL_F_LOCAL))
@@ -1679,7 +1678,7 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
 		zl = node_zonelist(interleave_nid(*mpol, vma, addr,
 				huge_page_shift(hstate_vma(vma))), gfp_flags);
 	} else {
-		zl = policy_zonelist(gfp_flags, *mpol);
+		zl = policy_zonelist(gfp_flags, *mpol, numa_node_id());
 		if ((*mpol)->mode == MPOL_BIND)
 			*nodemask = &(*mpol)->v.nodes;
 	}
@@ -1820,7 +1819,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
  */
 struct page *
 alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
-		unsigned long addr)
+		unsigned long addr, int node)
 {
 	struct mempolicy *pol = get_vma_policy(current, vma, addr);
 	struct zonelist *zl;
@@ -1836,7 +1835,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		put_mems_allowed();
 		return page;
 	}
-	zl = policy_zonelist(gfp, pol);
+	zl = policy_zonelist(gfp, pol, node);
 	if (unlikely(mpol_needs_cond_ref(pol))) {
 		/*
 		 * slow path: ref counted shared policy
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/8] Change alloc_pages_vma to pass down the policy node for local policy
@ 2011-02-21 19:07   ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-21 19:07 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Currently alloc_pages_vma always uses the local node as policy node
for the LOCAL policy. Pass this node down as an argument instead.

No behaviour change from this patch, but will be needed for followons.

Cc: aarcange@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/gfp.h |    9 +++++----
 mm/huge_memory.c    |    2 +-
 mm/mempolicy.c      |   11 +++++------
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0b84c61..782e74a 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -332,16 +332,17 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
 	return alloc_pages_current(gfp_mask, order);
 }
 extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
-			struct vm_area_struct *vma, unsigned long addr);
+		    	struct vm_area_struct *vma, unsigned long addr,
+			int node);
 #else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_pages_vma(gfp_mask, order, vma, addr)	\
+#define alloc_pages_vma(gfp_mask, order, vma, addr, node)	\
 	alloc_pages(gfp_mask, order)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
-#define alloc_page_vma(gfp_mask, vma, addr)	\
-	alloc_pages_vma(gfp_mask, 0, vma, addr)
+#define alloc_page_vma(gfp_mask, vma, addr)			\
+	alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e62ddb8..73ecca5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -653,7 +653,7 @@ static inline struct page *alloc_hugepage_vma(int defrag,
 					      unsigned long haddr)
 {
 	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag),
-			       HPAGE_PMD_ORDER, vma, haddr);
+			       HPAGE_PMD_ORDER, vma, haddr, numa_node_id());
 }
 
 #ifndef CONFIG_NUMA
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 76c51b7..d3d1e747 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1524,10 +1524,9 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 }
 
 /* Return a zonelist indicated by gfp for node representing a mempolicy */
-static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy)
+static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
+	int nd)
 {
-	int nd = numa_node_id();
-
 	switch (policy->mode) {
 	case MPOL_PREFERRED:
 		if (!(policy->flags & MPOL_F_LOCAL))
@@ -1679,7 +1678,7 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
 		zl = node_zonelist(interleave_nid(*mpol, vma, addr,
 				huge_page_shift(hstate_vma(vma))), gfp_flags);
 	} else {
-		zl = policy_zonelist(gfp_flags, *mpol);
+		zl = policy_zonelist(gfp_flags, *mpol, numa_node_id());
 		if ((*mpol)->mode == MPOL_BIND)
 			*nodemask = &(*mpol)->v.nodes;
 	}
@@ -1820,7 +1819,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
  */
 struct page *
 alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
-		unsigned long addr)
+		unsigned long addr, int node)
 {
 	struct mempolicy *pol = get_vma_policy(current, vma, addr);
 	struct zonelist *zl;
@@ -1836,7 +1835,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		put_mems_allowed();
 		return page;
 	}
-	zl = policy_zonelist(gfp, pol);
+	zl = policy_zonelist(gfp, pol, node);
 	if (unlikely(mpol_needs_cond_ref(pol))) {
 		/*
 		 * slow path: ref counted shared policy
-- 
1.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/8] Preserve local node for KSM copies
  2011-02-21 19:07 ` Andi Kleen
@ 2011-02-21 19:07   ` Andi Kleen
  -1 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-21 19:07 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen, arcange

From: Andi Kleen <ak@linux.intel.com>

Add a alloc_page_vma_node that allows passing the "local" node in.
Use it in ksm to allocate copy pages on the same node as
the original as possible.

Cc: arcange@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/gfp.h |    2 ++
 mm/ksm.c            |    3 ++-
 2 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 782e74a..814d50e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -343,6 +343,8 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 #define alloc_page_vma(gfp_mask, vma, addr)			\
 	alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
+#define alloc_page_vma_node(gfp_mask, vma, addr, node)		\
+	alloc_pages_vma(gfp_mask, 0, vma, addr, node)
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/mm/ksm.c b/mm/ksm.c
index c2b2a94..fc83335 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1575,7 +1575,8 @@ struct page *ksm_does_need_to_copy(struct page *page,
 {
 	struct page *new_page;
 
-	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+	new_page = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE, vma, address,
+				       page_to_nid(page));
 	if (new_page) {
 		copy_user_highpage(new_page, page, address, vma);
 
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/8] Preserve local node for KSM copies
@ 2011-02-21 19:07   ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-21 19:07 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add a alloc_page_vma_node that allows passing the "local" node in.
Use it in ksm to allocate copy pages on the same node as
the original as possible.

Cc: arcange@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/gfp.h |    2 ++
 mm/ksm.c            |    3 ++-
 2 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 782e74a..814d50e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -343,6 +343,8 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 #define alloc_page_vma(gfp_mask, vma, addr)			\
 	alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
+#define alloc_page_vma_node(gfp_mask, vma, addr, node)		\
+	alloc_pages_vma(gfp_mask, 0, vma, addr, node)
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/mm/ksm.c b/mm/ksm.c
index c2b2a94..fc83335 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1575,7 +1575,8 @@ struct page *ksm_does_need_to_copy(struct page *page,
 {
 	struct page *new_page;
 
-	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+	new_page = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE, vma, address,
+				       page_to_nid(page));
 	if (new_page) {
 		copy_user_highpage(new_page, page, address, vma);
 
-- 
1.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 4/8] Preserve original node for transparent huge page copies
  2011-02-21 19:07 ` Andi Kleen
@ 2011-02-21 19:07   ` Andi Kleen
  -1 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-21 19:07 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

This makes a difference for LOCAL policy, where the node cannot
be determined from the policy itself, but has to be gotten
from the original page.

Cc: aarcange@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 mm/huge_memory.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 73ecca5..00a5c39 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -799,8 +799,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	}
 
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
-		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
-					  vma, address);
+		pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE,
+					       vma, address, page_to_nid(page));
 		if (unlikely(!pages[i] ||
 			     mem_cgroup_newpage_charge(pages[i], mm,
 						       GFP_KERNEL))) {
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 4/8] Preserve original node for transparent huge page copies
@ 2011-02-21 19:07   ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-21 19:07 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

This makes a difference for LOCAL policy, where the node cannot
be determined from the policy itself, but has to be gotten
from the original page.

Cc: aarcange@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 mm/huge_memory.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 73ecca5..00a5c39 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -799,8 +799,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	}
 
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
-		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
-					  vma, address);
+		pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE,
+					       vma, address, page_to_nid(page));
 		if (unlikely(!pages[i] ||
 			     mem_cgroup_newpage_charge(pages[i], mm,
 						       GFP_KERNEL))) {
-- 
1.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 5/8] Use correct numa policy node for transparent hugepages
  2011-02-21 19:07 ` Andi Kleen
@ 2011-02-21 19:07   ` Andi Kleen
  -1 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-21 19:07 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Pass down the correct node for a transparent hugepage allocation.
Most callers continue to use the current node, however the hugepaged
daemon now uses the previous node of the first to be collapsed page
instead. This ensures that khugepaged does not mess up local memory
for an existing process which uses local policy.

The choice of node is somewhat primitive currently: it just
uses the node of the first page in the pmd range. An alternative
would be to look at multiple pages and use the most popular
node. I used the simplest variant for now which should work
well enough for the case of all pages being on the same node.

Cc: aarcange@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 mm/huge_memory.c |   24 +++++++++++++++++-------
 mm/mempolicy.c   |    3 ++-
 2 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 00a5c39..5a05b35 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -650,10 +650,10 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag)
 
 static inline struct page *alloc_hugepage_vma(int defrag,
 					      struct vm_area_struct *vma,
-					      unsigned long haddr)
+					      unsigned long haddr, int nd)
 {
 	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag),
-			       HPAGE_PMD_ORDER, vma, haddr, numa_node_id());
+			       HPAGE_PMD_ORDER, vma, haddr, nd);
 }
 
 #ifndef CONFIG_NUMA
@@ -678,7 +678,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(khugepaged_enter(vma)))
 			return VM_FAULT_OOM;
 		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
-					  vma, haddr);
+					  vma, haddr, numa_node_id());
 		if (unlikely(!page))
 			goto out;
 		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
@@ -902,7 +902,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow())
 		new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
-					      vma, haddr);
+					      vma, haddr, numa_node_id());
 	else
 		new_page = NULL;
 
@@ -1745,7 +1745,8 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 static void collapse_huge_page(struct mm_struct *mm,
 			       unsigned long address,
 			       struct page **hpage,
-			       struct vm_area_struct *vma)
+			       struct vm_area_struct *vma,
+			       int node)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -1773,7 +1774,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * mmap_sem in read mode is good idea also to allow greater
 	 * scalability.
 	 */
-	new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address);
+	new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address,
+				      node);
 	if (unlikely(!new_page)) {
 		up_read(&mm->mmap_sem);
 		*hpage = ERR_PTR(-ENOMEM);
@@ -1917,6 +1919,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	struct page *page;
 	unsigned long _address;
 	spinlock_t *ptl;
+	int node = -1;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
@@ -1947,6 +1950,13 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		page = vm_normal_page(vma, _address, pteval);
 		if (unlikely(!page))
 			goto out_unmap;
+		/* 
+		 * Chose the node of the first page. This could 
+		 * be more sophisticated and look at more pages,
+		 * but isn't for now.
+		 */
+		if (node == -1) 
+			node = page_to_nid(page);
 		VM_BUG_ON(PageCompound(page));
 		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
 			goto out_unmap;
@@ -1963,7 +1973,7 @@ out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret)
 		/* collapse_huge_page will return with the mmap_sem released */
-		collapse_huge_page(mm, address, hpage, vma);
+		collapse_huge_page(mm, address, hpage, vma, node);
 out:
 	return ret;
 }
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d3d1e747..0e7515a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1891,7 +1891,8 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
 	else
 		page = __alloc_pages_nodemask(gfp, order,
-			policy_zonelist(gfp, pol), policy_nodemask(gfp, pol));
+	      			policy_zonelist(gfp, pol, numa_node_id()), 
+				policy_nodemask(gfp, pol));
 	put_mems_allowed();
 	return page;
 }
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 5/8] Use correct numa policy node for transparent hugepages
@ 2011-02-21 19:07   ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-21 19:07 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Pass down the correct node for a transparent hugepage allocation.
Most callers continue to use the current node, however the hugepaged
daemon now uses the previous node of the first to be collapsed page
instead. This ensures that khugepaged does not mess up local memory
for an existing process which uses local policy.

The choice of node is somewhat primitive currently: it just
uses the node of the first page in the pmd range. An alternative
would be to look at multiple pages and use the most popular
node. I used the simplest variant for now which should work
well enough for the case of all pages being on the same node.

Cc: aarcange@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 mm/huge_memory.c |   24 +++++++++++++++++-------
 mm/mempolicy.c   |    3 ++-
 2 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 00a5c39..5a05b35 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -650,10 +650,10 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag)
 
 static inline struct page *alloc_hugepage_vma(int defrag,
 					      struct vm_area_struct *vma,
-					      unsigned long haddr)
+					      unsigned long haddr, int nd)
 {
 	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag),
-			       HPAGE_PMD_ORDER, vma, haddr, numa_node_id());
+			       HPAGE_PMD_ORDER, vma, haddr, nd);
 }
 
 #ifndef CONFIG_NUMA
@@ -678,7 +678,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(khugepaged_enter(vma)))
 			return VM_FAULT_OOM;
 		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
-					  vma, haddr);
+					  vma, haddr, numa_node_id());
 		if (unlikely(!page))
 			goto out;
 		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
@@ -902,7 +902,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow())
 		new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
-					      vma, haddr);
+					      vma, haddr, numa_node_id());
 	else
 		new_page = NULL;
 
@@ -1745,7 +1745,8 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 static void collapse_huge_page(struct mm_struct *mm,
 			       unsigned long address,
 			       struct page **hpage,
-			       struct vm_area_struct *vma)
+			       struct vm_area_struct *vma,
+			       int node)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -1773,7 +1774,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * mmap_sem in read mode is good idea also to allow greater
 	 * scalability.
 	 */
-	new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address);
+	new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address,
+				      node);
 	if (unlikely(!new_page)) {
 		up_read(&mm->mmap_sem);
 		*hpage = ERR_PTR(-ENOMEM);
@@ -1917,6 +1919,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	struct page *page;
 	unsigned long _address;
 	spinlock_t *ptl;
+	int node = -1;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
@@ -1947,6 +1950,13 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		page = vm_normal_page(vma, _address, pteval);
 		if (unlikely(!page))
 			goto out_unmap;
+		/* 
+		 * Chose the node of the first page. This could 
+		 * be more sophisticated and look at more pages,
+		 * but isn't for now.
+		 */
+		if (node == -1) 
+			node = page_to_nid(page);
 		VM_BUG_ON(PageCompound(page));
 		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
 			goto out_unmap;
@@ -1963,7 +1973,7 @@ out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret)
 		/* collapse_huge_page will return with the mmap_sem released */
-		collapse_huge_page(mm, address, hpage, vma);
+		collapse_huge_page(mm, address, hpage, vma, node);
 out:
 	return ret;
 }
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d3d1e747..0e7515a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1891,7 +1891,8 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
 	else
 		page = __alloc_pages_nodemask(gfp, order,
-			policy_zonelist(gfp, pol), policy_nodemask(gfp, pol));
+	      			policy_zonelist(gfp, pol, numa_node_id()), 
+				policy_nodemask(gfp, pol));
 	put_mems_allowed();
 	return page;
 }
-- 
1.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 6/8] Add __GFP_OTHER_NODE flag
  2011-02-21 19:07 ` Andi Kleen
@ 2011-02-21 19:07   ` Andi Kleen
  -1 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-21 19:07 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics
in zone_statistics() that an allocation is on behalf of another thread.
This way the local and remote counters can be still correct, even
when background daemons like khugepaged are changing memory
mappings.

This only affects the accounting, but I think it's worth doing that
right to avoid confusing users.

I first tried to just pass down the right node, but this required
a lot of changes to pass down this parameter and at least one
addition of a 10th argument to a 9 argument function. Using
the flag is a lot less intrusive.

Open: should be also used for migration?

Cc: aarcange@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/gfp.h    |    2 ++
 include/linux/vmstat.h |    4 ++--
 mm/page_alloc.c        |    2 +-
 mm/vmstat.c            |    9 +++++++--
 4 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 814d50e..a064724 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
 #define ___GFP_NOTRACK		0
 #endif
 #define ___GFP_NO_KSWAPD	0x400000u
+#define ___GFP_OTHER_NODE	0x800000u
 
 /*
  * GFP bitmasks..
@@ -83,6 +84,7 @@ struct vm_area_struct;
 #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)  /* Don't track with kmemcheck */
 
 #define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
+#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
 
 /*
  * This may seem redundant, but it's a way of annotating false positives vs.
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 833e676..9b5c63d 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -220,12 +220,12 @@ static inline unsigned long node_page_state(int node,
 		zone_page_state(&zones[ZONE_MOVABLE], item);
 }
 
-extern void zone_statistics(struct zone *, struct zone *);
+extern void zone_statistics(struct zone *, struct zone *, gfp_t gfp);
 
 #else
 
 #define node_page_state(node, item) global_page_state(item)
-#define zone_statistics(_zl,_z) do { } while (0)
+#define zone_statistics(_zl,_z, gfp) do { } while (0)
 
 #endif /* CONFIG_NUMA */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a873e61..4ce06a6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1333,7 +1333,7 @@ again:
 	}
 
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
-	zone_statistics(preferred_zone, zone);
+	zone_statistics(preferred_zone, zone, gfp_flags);
 	local_irq_restore(flags);
 
 	VM_BUG_ON(bad_range(zone, page));
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 0c3b504..2b461ed 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -500,8 +500,12 @@ void refresh_cpu_vm_stats(int cpu)
  * z 	    = the zone from which the allocation occurred.
  *
  * Must be called with interrupts disabled.
+ * 
+ * When __GFP_OTHER_NODE is set assume the node of the preferred
+ * zone is the local node. This is useful for daemons who allocate
+ * memory on behalf of other processes.
  */
-void zone_statistics(struct zone *preferred_zone, struct zone *z)
+void zone_statistics(struct zone *preferred_zone, struct zone *z, gfp_t flags)
 {
 	if (z->zone_pgdat == preferred_zone->zone_pgdat) {
 		__inc_zone_state(z, NUMA_HIT);
@@ -509,7 +513,8 @@ void zone_statistics(struct zone *preferred_zone, struct zone *z)
 		__inc_zone_state(z, NUMA_MISS);
 		__inc_zone_state(preferred_zone, NUMA_FOREIGN);
 	}
-	if (z->node == numa_node_id())
+	if (z->node == ((flags & __GFP_OTHER_NODE) ? 
+			preferred_zone->node : numa_node_id()))
 		__inc_zone_state(z, NUMA_LOCAL);
 	else
 		__inc_zone_state(z, NUMA_OTHER);
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 6/8] Add __GFP_OTHER_NODE flag
@ 2011-02-21 19:07   ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-21 19:07 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics
in zone_statistics() that an allocation is on behalf of another thread.
This way the local and remote counters can be still correct, even
when background daemons like khugepaged are changing memory
mappings.

This only affects the accounting, but I think it's worth doing that
right to avoid confusing users.

I first tried to just pass down the right node, but this required
a lot of changes to pass down this parameter and at least one
addition of a 10th argument to a 9 argument function. Using
the flag is a lot less intrusive.

Open: should be also used for migration?

Cc: aarcange@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/gfp.h    |    2 ++
 include/linux/vmstat.h |    4 ++--
 mm/page_alloc.c        |    2 +-
 mm/vmstat.c            |    9 +++++++--
 4 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 814d50e..a064724 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
 #define ___GFP_NOTRACK		0
 #endif
 #define ___GFP_NO_KSWAPD	0x400000u
+#define ___GFP_OTHER_NODE	0x800000u
 
 /*
  * GFP bitmasks..
@@ -83,6 +84,7 @@ struct vm_area_struct;
 #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)  /* Don't track with kmemcheck */
 
 #define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
+#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
 
 /*
  * This may seem redundant, but it's a way of annotating false positives vs.
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 833e676..9b5c63d 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -220,12 +220,12 @@ static inline unsigned long node_page_state(int node,
 		zone_page_state(&zones[ZONE_MOVABLE], item);
 }
 
-extern void zone_statistics(struct zone *, struct zone *);
+extern void zone_statistics(struct zone *, struct zone *, gfp_t gfp);
 
 #else
 
 #define node_page_state(node, item) global_page_state(item)
-#define zone_statistics(_zl,_z) do { } while (0)
+#define zone_statistics(_zl,_z, gfp) do { } while (0)
 
 #endif /* CONFIG_NUMA */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a873e61..4ce06a6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1333,7 +1333,7 @@ again:
 	}
 
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
-	zone_statistics(preferred_zone, zone);
+	zone_statistics(preferred_zone, zone, gfp_flags);
 	local_irq_restore(flags);
 
 	VM_BUG_ON(bad_range(zone, page));
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 0c3b504..2b461ed 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -500,8 +500,12 @@ void refresh_cpu_vm_stats(int cpu)
  * z 	    = the zone from which the allocation occurred.
  *
  * Must be called with interrupts disabled.
+ * 
+ * When __GFP_OTHER_NODE is set assume the node of the preferred
+ * zone is the local node. This is useful for daemons who allocate
+ * memory on behalf of other processes.
  */
-void zone_statistics(struct zone *preferred_zone, struct zone *z)
+void zone_statistics(struct zone *preferred_zone, struct zone *z, gfp_t flags)
 {
 	if (z->zone_pgdat == preferred_zone->zone_pgdat) {
 		__inc_zone_state(z, NUMA_HIT);
@@ -509,7 +513,8 @@ void zone_statistics(struct zone *preferred_zone, struct zone *z)
 		__inc_zone_state(z, NUMA_MISS);
 		__inc_zone_state(preferred_zone, NUMA_FOREIGN);
 	}
-	if (z->node == numa_node_id())
+	if (z->node == ((flags & __GFP_OTHER_NODE) ? 
+			preferred_zone->node : numa_node_id()))
 		__inc_zone_state(z, NUMA_LOCAL);
 	else
 		__inc_zone_state(z, NUMA_OTHER);
-- 
1.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 7/8] Use GFP_OTHER_NODE for transparent huge pages
  2011-02-21 19:07 ` Andi Kleen
@ 2011-02-21 19:07   ` Andi Kleen
  -1 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-21 19:07 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Pass GFP_OTHER_NODE for transparent hugepages NUMA allocations
done by the hugepages daemon. This way the low level accounting
for local versus remote pages works correctly.

Cc: aarcange@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 mm/huge_memory.c |   18 ++++++++++--------
 1 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5a05b35..877756e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -643,16 +643,17 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	return ret;
 }
 
-static inline gfp_t alloc_hugepage_gfpmask(int defrag)
+static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
 {
-	return GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT);
+	return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
 }
 
 static inline struct page *alloc_hugepage_vma(int defrag,
 					      struct vm_area_struct *vma,
-					      unsigned long haddr, int nd)
+					      unsigned long haddr, int nd,
+					      gfp_t extra_gfp)
 {
-	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag),
+	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
 			       HPAGE_PMD_ORDER, vma, haddr, nd);
 }
 
@@ -678,7 +679,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(khugepaged_enter(vma)))
 			return VM_FAULT_OOM;
 		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
-					  vma, haddr, numa_node_id());
+					  vma, haddr, numa_node_id(), 0);
 		if (unlikely(!page))
 			goto out;
 		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
@@ -799,7 +800,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	}
 
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
-		pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE,
+		pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE | 
+					       __GFP_OTHER_NODE,
 					       vma, address, page_to_nid(page));
 		if (unlikely(!pages[i] ||
 			     mem_cgroup_newpage_charge(pages[i], mm,
@@ -902,7 +904,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow())
 		new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
-					      vma, haddr, numa_node_id());
+					      vma, haddr, numa_node_id(), 0);
 	else
 		new_page = NULL;
 
@@ -1775,7 +1777,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * scalability.
 	 */
 	new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address,
-				      node);
+				      node, __GFP_OTHER_NODE);
 	if (unlikely(!new_page)) {
 		up_read(&mm->mmap_sem);
 		*hpage = ERR_PTR(-ENOMEM);
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 7/8] Use GFP_OTHER_NODE for transparent huge pages
@ 2011-02-21 19:07   ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-21 19:07 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Pass GFP_OTHER_NODE for transparent hugepages NUMA allocations
done by the hugepages daemon. This way the low level accounting
for local versus remote pages works correctly.

Cc: aarcange@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 mm/huge_memory.c |   18 ++++++++++--------
 1 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5a05b35..877756e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -643,16 +643,17 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	return ret;
 }
 
-static inline gfp_t alloc_hugepage_gfpmask(int defrag)
+static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
 {
-	return GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT);
+	return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
 }
 
 static inline struct page *alloc_hugepage_vma(int defrag,
 					      struct vm_area_struct *vma,
-					      unsigned long haddr, int nd)
+					      unsigned long haddr, int nd,
+					      gfp_t extra_gfp)
 {
-	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag),
+	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
 			       HPAGE_PMD_ORDER, vma, haddr, nd);
 }
 
@@ -678,7 +679,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(khugepaged_enter(vma)))
 			return VM_FAULT_OOM;
 		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
-					  vma, haddr, numa_node_id());
+					  vma, haddr, numa_node_id(), 0);
 		if (unlikely(!page))
 			goto out;
 		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
@@ -799,7 +800,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	}
 
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
-		pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE,
+		pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE | 
+					       __GFP_OTHER_NODE,
 					       vma, address, page_to_nid(page));
 		if (unlikely(!pages[i] ||
 			     mem_cgroup_newpage_charge(pages[i], mm,
@@ -902,7 +904,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow())
 		new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
-					      vma, haddr, numa_node_id());
+					      vma, haddr, numa_node_id(), 0);
 	else
 		new_page = NULL;
 
@@ -1775,7 +1777,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * scalability.
 	 */
 	new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address,
-				      node);
+				      node, __GFP_OTHER_NODE);
 	if (unlikely(!new_page)) {
 		up_read(&mm->mmap_sem);
 		*hpage = ERR_PTR(-ENOMEM);
-- 
1.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 8/8] Add VM counters for transparent hugepages
  2011-02-21 19:07 ` Andi Kleen
@ 2011-02-21 19:07   ` Andi Kleen
  -1 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-21 19:07 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

I found it difficult to make sense of transparent huge pages without
having any counters for its actions. Add some counters to vmstat
for allocation of transparent hugepages and fallback to smaller
pages.

Optional patch, but useful for development and understanding the system.

Cc: aarcange@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/vmstat.h |    7 +++++++
 mm/huge_memory.c       |   13 ++++++++++---
 mm/vmstat.c            |    8 ++++++++
 3 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 9b5c63d..7794d1a7 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -58,6 +58,13 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		UNEVICTABLE_PGCLEARED,	/* on COW, page truncate */
 		UNEVICTABLE_PGSTRANDED,	/* unable to isolate on unlock */
 		UNEVICTABLE_MLOCKFREED,
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	        THP_DIRECT_ALLOC,
+		THP_DAEMON_ALLOC,	
+		THP_DIRECT_FALLBACK,	
+		THP_DAEMON_ALLOC_FAILED,
+		THP_SPLIT,
+#endif
 		NR_VM_EVENT_ITEMS
 };
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 877756e..4ef8c32 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -680,13 +680,15 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			return VM_FAULT_OOM;
 		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
 					  vma, haddr, numa_node_id(), 0);
-		if (unlikely(!page))
+		if (unlikely(!page)) {
+			count_vm_event(THP_DIRECT_FALLBACK);
 			goto out;
+		}
 		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
 			put_page(page);
 			goto out;
 		}
-
+		count_vm_event(THP_DIRECT_ALLOC);
 		return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
 	}
 out:
@@ -909,6 +911,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		new_page = NULL;
 
 	if (unlikely(!new_page)) {
+		count_vm_event(THP_DIRECT_FALLBACK);
 		ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
 						   pmd, orig_pmd, page, haddr);
 		put_page(page);
@@ -921,7 +924,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret |= VM_FAULT_OOM;
 		goto out;
 	}
-
+	count_vm_event(THP_DIRECT_ALLOC);
 	copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
 
@@ -1780,6 +1783,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 				      node, __GFP_OTHER_NODE);
 	if (unlikely(!new_page)) {
 		up_read(&mm->mmap_sem);
+		count_vm_event(THP_DAEMON_ALLOC_FAILED);
 		*hpage = ERR_PTR(-ENOMEM);
 		return;
 	}
@@ -2286,6 +2290,9 @@ void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
 		spin_unlock(&mm->page_table_lock);
 		return;
 	}
+
+	count_vm_event(THP_SPLIT);
+
 	page = pmd_page(*pmd);
 	VM_BUG_ON(!page_count(page));
 	get_page(page);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 2b461ed..f3ab7e9 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -946,6 +946,14 @@ static const char * const vmstat_text[] = {
 	"unevictable_pgs_stranded",
 	"unevictable_pgs_mlockfreed",
 #endif
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	"thp_direct_alloc",
+	"thp_daemon_alloc",
+	"thp_direct_fallback",
+	"thp_daemon_alloc_failed",
+	"thp_split",
+#endif
 };
 
 static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 8/8] Add VM counters for transparent hugepages
@ 2011-02-21 19:07   ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-21 19:07 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

I found it difficult to make sense of transparent huge pages without
having any counters for its actions. Add some counters to vmstat
for allocation of transparent hugepages and fallback to smaller
pages.

Optional patch, but useful for development and understanding the system.

Cc: aarcange@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/vmstat.h |    7 +++++++
 mm/huge_memory.c       |   13 ++++++++++---
 mm/vmstat.c            |    8 ++++++++
 3 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 9b5c63d..7794d1a7 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -58,6 +58,13 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		UNEVICTABLE_PGCLEARED,	/* on COW, page truncate */
 		UNEVICTABLE_PGSTRANDED,	/* unable to isolate on unlock */
 		UNEVICTABLE_MLOCKFREED,
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	        THP_DIRECT_ALLOC,
+		THP_DAEMON_ALLOC,	
+		THP_DIRECT_FALLBACK,	
+		THP_DAEMON_ALLOC_FAILED,
+		THP_SPLIT,
+#endif
 		NR_VM_EVENT_ITEMS
 };
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 877756e..4ef8c32 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -680,13 +680,15 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			return VM_FAULT_OOM;
 		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
 					  vma, haddr, numa_node_id(), 0);
-		if (unlikely(!page))
+		if (unlikely(!page)) {
+			count_vm_event(THP_DIRECT_FALLBACK);
 			goto out;
+		}
 		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
 			put_page(page);
 			goto out;
 		}
-
+		count_vm_event(THP_DIRECT_ALLOC);
 		return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
 	}
 out:
@@ -909,6 +911,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		new_page = NULL;
 
 	if (unlikely(!new_page)) {
+		count_vm_event(THP_DIRECT_FALLBACK);
 		ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
 						   pmd, orig_pmd, page, haddr);
 		put_page(page);
@@ -921,7 +924,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret |= VM_FAULT_OOM;
 		goto out;
 	}
-
+	count_vm_event(THP_DIRECT_ALLOC);
 	copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
 
@@ -1780,6 +1783,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 				      node, __GFP_OTHER_NODE);
 	if (unlikely(!new_page)) {
 		up_read(&mm->mmap_sem);
+		count_vm_event(THP_DAEMON_ALLOC_FAILED);
 		*hpage = ERR_PTR(-ENOMEM);
 		return;
 	}
@@ -2286,6 +2290,9 @@ void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
 		spin_unlock(&mm->page_table_lock);
 		return;
 	}
+
+	count_vm_event(THP_SPLIT);
+
 	page = pmd_page(*pmd);
 	VM_BUG_ON(!page_count(page));
 	get_page(page);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 2b461ed..f3ab7e9 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -946,6 +946,14 @@ static const char * const vmstat_text[] = {
 	"unevictable_pgs_stranded",
 	"unevictable_pgs_mlockfreed",
 #endif
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	"thp_direct_alloc",
+	"thp_daemon_alloc",
+	"thp_direct_fallback",
+	"thp_daemon_alloc_failed",
+	"thp_split",
+#endif
 };
 
 static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
-- 
1.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] Fix interleaving for transparent hugepages
  2011-02-21 19:07   ` Andi Kleen
@ 2011-02-22 15:34     ` Christoph Lameter
  -1 siblings, 0 replies; 48+ messages in thread
From: Christoph Lameter @ 2011-02-22 15:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen


On Mon, 21 Feb 2011, Andi Kleen wrote:

> @@ -1830,7 +1830,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
>  		unsigned nid;
>
> -		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
> +		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT << order);

Should be PAGE_SHIFT + order.
x

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] Fix interleaving for transparent hugepages
@ 2011-02-22 15:34     ` Christoph Lameter
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Lameter @ 2011-02-22 15:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen


On Mon, 21 Feb 2011, Andi Kleen wrote:

> @@ -1830,7 +1830,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
>  		unsigned nid;
>
> -		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
> +		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT << order);

Should be PAGE_SHIFT + order.
x

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/8] Change alloc_pages_vma to pass down the policy node for local policy
  2011-02-21 19:07   ` Andi Kleen
@ 2011-02-22 15:42     ` Christoph Lameter
  -1 siblings, 0 replies; 48+ messages in thread
From: Christoph Lameter @ 2011-02-22 15:42 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

On Mon, 21 Feb 2011, Andi Kleen wrote:

> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1524,10 +1524,9 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
>  }
>
>  /* Return a zonelist indicated by gfp for node representing a mempolicy */
> -static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy)
> +static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
> +	int nd)
>  {
> -	int nd = numa_node_id();
> -
>  	switch (policy->mode) {
>  	case MPOL_PREFERRED:
>  		if (!(policy->flags & MPOL_F_LOCAL))
> @@ -1679,7 +1678,7 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
>  		zl = node_zonelist(interleave_nid(*mpol, vma, addr,
>  				huge_page_shift(hstate_vma(vma))), gfp_flags);
>  	} else {
> -		zl = policy_zonelist(gfp_flags, *mpol);
> +		zl = policy_zonelist(gfp_flags, *mpol, numa_node_id());
>  		if ((*mpol)->mode == MPOL_BIND)
>  			*nodemask = &(*mpol)->v.nodes;
>  	}

If we do that then why not also consolidate the MPOL_INTERLEAVE
treatment also in policy_zonelist()? Looks awfully similar now and Would
simplify the code and likely get rid of some functions.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/8] Change alloc_pages_vma to pass down the policy node for local policy
@ 2011-02-22 15:42     ` Christoph Lameter
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Lameter @ 2011-02-22 15:42 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

On Mon, 21 Feb 2011, Andi Kleen wrote:

> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1524,10 +1524,9 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
>  }
>
>  /* Return a zonelist indicated by gfp for node representing a mempolicy */
> -static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy)
> +static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
> +	int nd)
>  {
> -	int nd = numa_node_id();
> -
>  	switch (policy->mode) {
>  	case MPOL_PREFERRED:
>  		if (!(policy->flags & MPOL_F_LOCAL))
> @@ -1679,7 +1678,7 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
>  		zl = node_zonelist(interleave_nid(*mpol, vma, addr,
>  				huge_page_shift(hstate_vma(vma))), gfp_flags);
>  	} else {
> -		zl = policy_zonelist(gfp_flags, *mpol);
> +		zl = policy_zonelist(gfp_flags, *mpol, numa_node_id());
>  		if ((*mpol)->mode == MPOL_BIND)
>  			*nodemask = &(*mpol)->v.nodes;
>  	}

If we do that then why not also consolidate the MPOL_INTERLEAVE
treatment also in policy_zonelist()? Looks awfully similar now and Would
simplify the code and likely get rid of some functions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/8] Preserve local node for KSM copies
  2011-02-21 19:07   ` Andi Kleen
@ 2011-02-22 15:47     ` Christoph Lameter
  -1 siblings, 0 replies; 48+ messages in thread
From: Christoph Lameter @ 2011-02-22 15:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: akpm, linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen, arcange

On Mon, 21 Feb 2011, Andi Kleen wrote:

> Add a alloc_page_vma_node that allows passing the "local" node in.
> Use it in ksm to allocate copy pages on the same node as
> the original as possible.

Why would that be useful? The shared page could be on a node that is not
near the process that maps the page. Would it not be better to allocate on
the node that is local to the process that maps the page?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/8] Preserve local node for KSM copies
@ 2011-02-22 15:47     ` Christoph Lameter
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Lameter @ 2011-02-22 15:47 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

On Mon, 21 Feb 2011, Andi Kleen wrote:

> Add a alloc_page_vma_node that allows passing the "local" node in.
> Use it in ksm to allocate copy pages on the same node as
> the original as possible.

Why would that be useful? The shared page could be on a node that is not
near the process that maps the page. Would it not be better to allocate on
the node that is local to the process that maps the page?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/8] Preserve local node for KSM copies
  2011-02-22 15:47     ` Christoph Lameter
@ 2011-02-22 16:20       ` Andrea Arcangeli
  -1 siblings, 0 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2011-02-22 16:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, akpm, linux-mm, linux-kernel, lwoodman, Andi Kleen, arcange

On Tue, Feb 22, 2011 at 09:47:26AM -0600, Christoph Lameter wrote:
> On Mon, 21 Feb 2011, Andi Kleen wrote:
> 
> > Add a alloc_page_vma_node that allows passing the "local" node in.
> > Use it in ksm to allocate copy pages on the same node as
> > the original as possible.
> 
> Why would that be useful? The shared page could be on a node that is not
> near the process that maps the page. Would it not be better to allocate on
> the node that is local to the process that maps the page?

This is what I was trying to understand. To me it looks like this is
making things worse. Following the "vma" advice like current code
does, sounds much better than following the previous page that may
have been allocated in the wrong node if the allocation from the right
node couldn't be satisfied at the time the page was first allocated
(we should still try to allocate from the right node following the vma
hint).

In the KSM case this badness is exacerbated by the fact a ksm page is
guarnteed to be randomly-numa allocated, because it's shared across
all processes regardless of their vma settings. KSM is not NUMA aware,
so following the location of the KSM "page" seems a regression
compared to the current code that at least follows the vma when
bringing up a page during swapin.

I've an hard time generally to see how following "page" (that is
especially wrong with KSM because of the very sharing effect) is
better than following "vma".

KSM may become NUMA aware if we replicate the stable tree in each
node, but we're not even close to that, so I've an hard time how
"page" hinting instead of "vma" hinting can do any good, especially in
KSM case. But I've to think more about this.. but if you've
suggestions you're welcome.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/8] Preserve local node for KSM copies
@ 2011-02-22 16:20       ` Andrea Arcangeli
  0 siblings, 0 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2011-02-22 16:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, akpm, linux-mm, linux-kernel, lwoodman, Andi Kleen, arcange

On Tue, Feb 22, 2011 at 09:47:26AM -0600, Christoph Lameter wrote:
> On Mon, 21 Feb 2011, Andi Kleen wrote:
> 
> > Add a alloc_page_vma_node that allows passing the "local" node in.
> > Use it in ksm to allocate copy pages on the same node as
> > the original as possible.
> 
> Why would that be useful? The shared page could be on a node that is not
> near the process that maps the page. Would it not be better to allocate on
> the node that is local to the process that maps the page?

This is what I was trying to understand. To me it looks like this is
making things worse. Following the "vma" advice like current code
does, sounds much better than following the previous page that may
have been allocated in the wrong node if the allocation from the right
node couldn't be satisfied at the time the page was first allocated
(we should still try to allocate from the right node following the vma
hint).

In the KSM case this badness is exacerbated by the fact a ksm page is
guarnteed to be randomly-numa allocated, because it's shared across
all processes regardless of their vma settings. KSM is not NUMA aware,
so following the location of the KSM "page" seems a regression
compared to the current code that at least follows the vma when
bringing up a page during swapin.

I've an hard time generally to see how following "page" (that is
especially wrong with KSM because of the very sharing effect) is
better than following "vma".

KSM may become NUMA aware if we replicate the stable tree in each
node, but we're not even close to that, so I've an hard time how
"page" hinting instead of "vma" hinting can do any good, especially in
KSM case. But I've to think more about this.. but if you've
suggestions you're welcome.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] Fix interleaving for transparent hugepages
  2011-02-22 15:34     ` Christoph Lameter
@ 2011-02-22 16:29       ` Andrea Arcangeli
  -1 siblings, 0 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2011-02-22 16:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, akpm, linux-mm, linux-kernel, lwoodman, Andi Kleen

On Tue, Feb 22, 2011 at 09:34:33AM -0600, Christoph Lameter wrote:
> 
> On Mon, 21 Feb 2011, Andi Kleen wrote:
> 
> > @@ -1830,7 +1830,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> >  	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
> >  		unsigned nid;
> >
> > -		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
> > +		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT << order);
> 
> Should be PAGE_SHIFT + order.

This one is very good after changing + order. I updated
alloc_page_interleave to get an order parameter but I didn't adjust
the nid accordingly to order. This was my incomplete modification for
hpage interleaving:

@@ -1830,7 +1832,7 @@ alloc_page_vma(gfp_t gfp, struct vm_area_struct
*vma, unsigned long addr)
 
                nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
                mpol_cond_put(pol);
-               page = alloc_page_interleave(gfp, 0, nid);
+               page = alloc_page_interleave(gfp, order, nid);
                put_mems_allowed();
                return page;
        }


Andi, can you resubmit this one fixd with + to Andrew, this one can go
in 2.6.38. For the rest frankly I've an hard time to see how it cannot
hurt performance (instead of improving them) especially for KSM. It's
impossible to improve KSM for NUMA with that change to
ksm_does_need_to_copy at the very least. But the same reasoning
applies to the rest. But I'll think more about the others, I just
would prefer to include the above fix quick, the rest don't seem that
urgent even if it's really improving performance (instead of hurting
it as I think).

Acked-by: Andrea Arcangeli <aarcange@redhat.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] Fix interleaving for transparent hugepages
@ 2011-02-22 16:29       ` Andrea Arcangeli
  0 siblings, 0 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2011-02-22 16:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, akpm, linux-mm, linux-kernel, lwoodman, Andi Kleen

On Tue, Feb 22, 2011 at 09:34:33AM -0600, Christoph Lameter wrote:
> 
> On Mon, 21 Feb 2011, Andi Kleen wrote:
> 
> > @@ -1830,7 +1830,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> >  	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
> >  		unsigned nid;
> >
> > -		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
> > +		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT << order);
> 
> Should be PAGE_SHIFT + order.

This one is very good after changing + order. I updated
alloc_page_interleave to get an order parameter but I didn't adjust
the nid accordingly to order. This was my incomplete modification for
hpage interleaving:

@@ -1830,7 +1832,7 @@ alloc_page_vma(gfp_t gfp, struct vm_area_struct
*vma, unsigned long addr)
 
                nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
                mpol_cond_put(pol);
-               page = alloc_page_interleave(gfp, 0, nid);
+               page = alloc_page_interleave(gfp, order, nid);
                put_mems_allowed();
                return page;
        }


Andi, can you resubmit this one fixd with + to Andrew, this one can go
in 2.6.38. For the rest frankly I've an hard time to see how it cannot
hurt performance (instead of improving them) especially for KSM. It's
impossible to improve KSM for NUMA with that change to
ksm_does_need_to_copy at the very least. But the same reasoning
applies to the rest. But I'll think more about the others, I just
would prefer to include the above fix quick, the rest don't seem that
urgent even if it's really improving performance (instead of hurting
it as I think).

Acked-by: Andrea Arcangeli <aarcange@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 8/8] Add VM counters for transparent hugepages
  2011-02-21 19:07   ` Andi Kleen
@ 2011-02-22 16:36     ` Dave Hansen
  -1 siblings, 0 replies; 48+ messages in thread
From: Dave Hansen @ 2011-02-22 16:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

On Mon, 2011-02-21 at 11:07 -0800, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> I found it difficult to make sense of transparent huge pages without
> having any counters for its actions. Add some counters to vmstat
> for allocation of transparent hugepages and fallback to smaller
> pages.
> 
> Optional patch, but useful for development and understanding the system.

Very nice.  I did the same thing, splits-only.  I also found this stuff
a must-have for trying to do any work with transparent hugepages.  It's
just impossible otherwise.

Acked-by: Dave Hansen <dave@linux.vnet.ibm.com>



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 8/8] Add VM counters for transparent hugepages
@ 2011-02-22 16:36     ` Dave Hansen
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Hansen @ 2011-02-22 16:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

On Mon, 2011-02-21 at 11:07 -0800, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> I found it difficult to make sense of transparent huge pages without
> having any counters for its actions. Add some counters to vmstat
> for allocation of transparent hugepages and fallback to smaller
> pages.
> 
> Optional patch, but useful for development and understanding the system.

Very nice.  I did the same thing, splits-only.  I also found this stuff
a must-have for trying to do any work with transparent hugepages.  It's
just impossible otherwise.

Acked-by: Dave Hansen <dave@linux.vnet.ibm.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 8/8] Add VM counters for transparent hugepages
  2011-02-22 16:36     ` Dave Hansen
@ 2011-02-22 16:43       ` Andrea Arcangeli
  -1 siblings, 0 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2011-02-22 16:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andi Kleen, akpm, linux-mm, linux-kernel, lwoodman, Andi Kleen

On Tue, Feb 22, 2011 at 08:36:26AM -0800, Dave Hansen wrote:
> On Mon, 2011-02-21 at 11:07 -0800, Andi Kleen wrote:
> > From: Andi Kleen <ak@linux.intel.com>
> > 
> > I found it difficult to make sense of transparent huge pages without
> > having any counters for its actions. Add some counters to vmstat
> > for allocation of transparent hugepages and fallback to smaller
> > pages.
> > 
> > Optional patch, but useful for development and understanding the system.
> 
> Very nice.  I did the same thing, splits-only.  I also found this stuff
> a must-have for trying to do any work with transparent hugepages.  It's
> just impossible otherwise.

This patch is good too. 1 and 8 I think can go in, patch 1 is high
priority.

Patches 2-5 I've an hard time to see how they're not hurting
performance instead of improving it, especially patch 3 looks dead
wrong and has no chance by the very design of KSM and by the random
(vma-ignoring) NUMA affinity of PageKSM, patch 3 is most certainly a
regression.

Patch 6-7 I didn't evaluate those yet, they look good, even if low
priority.

Acked-by: Andrea Arcangeli <aarcange@redhat.com>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 8/8] Add VM counters for transparent hugepages
@ 2011-02-22 16:43       ` Andrea Arcangeli
  0 siblings, 0 replies; 48+ messages in thread
From: Andrea Arcangeli @ 2011-02-22 16:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andi Kleen, akpm, linux-mm, linux-kernel, lwoodman, Andi Kleen

On Tue, Feb 22, 2011 at 08:36:26AM -0800, Dave Hansen wrote:
> On Mon, 2011-02-21 at 11:07 -0800, Andi Kleen wrote:
> > From: Andi Kleen <ak@linux.intel.com>
> > 
> > I found it difficult to make sense of transparent huge pages without
> > having any counters for its actions. Add some counters to vmstat
> > for allocation of transparent hugepages and fallback to smaller
> > pages.
> > 
> > Optional patch, but useful for development and understanding the system.
> 
> Very nice.  I did the same thing, splits-only.  I also found this stuff
> a must-have for trying to do any work with transparent hugepages.  It's
> just impossible otherwise.

This patch is good too. 1 and 8 I think can go in, patch 1 is high
priority.

Patches 2-5 I've an hard time to see how they're not hurting
performance instead of improving it, especially patch 3 looks dead
wrong and has no chance by the very design of KSM and by the random
(vma-ignoring) NUMA affinity of PageKSM, patch 3 is most certainly a
regression.

Patch 6-7 I didn't evaluate those yet, they look good, even if low
priority.

Acked-by: Andrea Arcangeli <aarcange@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] Fix interleaving for transparent hugepages
  2011-02-22 15:34     ` Christoph Lameter
@ 2011-02-22 17:11       ` Andi Kleen
  -1 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-22 17:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, akpm, linux-mm, linux-kernel, aarcange, lwoodman

On 2/22/2011 7:34 AM, Christoph Lameter wrote:
> On Mon, 21 Feb 2011, Andi Kleen wrote:
>
>> @@ -1830,7 +1830,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>>   	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
>>   		unsigned nid;
>>
>> -		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
>> +		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT<<  order);
> Should be PAGE_SHIFT + order.

Oops. True. Thanks.

-Andi


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] Fix interleaving for transparent hugepages
@ 2011-02-22 17:11       ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-22 17:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, akpm, linux-mm, linux-kernel, aarcange, lwoodman

On 2/22/2011 7:34 AM, Christoph Lameter wrote:
> On Mon, 21 Feb 2011, Andi Kleen wrote:
>
>> @@ -1830,7 +1830,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>>   	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
>>   		unsigned nid;
>>
>> -		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
>> +		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT<<  order);
> Should be PAGE_SHIFT + order.

Oops. True. Thanks.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/8] Preserve local node for KSM copies
  2011-02-22 16:20       ` Andrea Arcangeli
@ 2011-02-22 17:15         ` Andi Kleen
  -1 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-22 17:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Andi Kleen, akpm, linux-mm, linux-kernel,
	lwoodman, arcange

On 2/22/2011 8:20 AM, Andrea Arcangeli wrote:
> On Tue, Feb 22, 2011 at 09:47:26AM -0600, Christoph Lameter wrote:
>> On Mon, 21 Feb 2011, Andi Kleen wrote:
>>
>>> Add a alloc_page_vma_node that allows passing the "local" node in.
>>> Use it in ksm to allocate copy pages on the same node as
>>> the original as possible.
>> Why would that be useful? The shared page could be on a node that is not
>> near the process that maps the page. Would it not be better to allocate on
>> the node that is local to the process that maps the page?
> This is what I was trying to understand. To me it looks like this is
> making things worse. Following the "vma" advice like current code

The problem is: most people use local policy and for that the correct node
is the node that the memory got allocated on.

You cannot get that information from the vma because the vma just says
"local"

If you don't use that node the daemon will arbitarily destroy memory 
locality.
In fact this has happened. This is really bad regression caused by THP.

Now full KSM locality is more difficult (obviously you can have a 
conflict), but
keeping it on the original node is still the most predictable. At least 
you won't
make anything worse for that process.

-Andi


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/8] Preserve local node for KSM copies
@ 2011-02-22 17:15         ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-22 17:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Andi Kleen, akpm, linux-mm, linux-kernel, lwoodman

On 2/22/2011 8:20 AM, Andrea Arcangeli wrote:
> On Tue, Feb 22, 2011 at 09:47:26AM -0600, Christoph Lameter wrote:
>> On Mon, 21 Feb 2011, Andi Kleen wrote:
>>
>>> Add a alloc_page_vma_node that allows passing the "local" node in.
>>> Use it in ksm to allocate copy pages on the same node as
>>> the original as possible.
>> Why would that be useful? The shared page could be on a node that is not
>> near the process that maps the page. Would it not be better to allocate on
>> the node that is local to the process that maps the page?
> This is what I was trying to understand. To me it looks like this is
> making things worse. Following the "vma" advice like current code

The problem is: most people use local policy and for that the correct node
is the node that the memory got allocated on.

You cannot get that information from the vma because the vma just says
"local"

If you don't use that node the daemon will arbitarily destroy memory 
locality.
In fact this has happened. This is really bad regression caused by THP.

Now full KSM locality is more difficult (obviously you can have a 
conflict), but
keeping it on the original node is still the most predictable. At least 
you won't
make anything worse for that process.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 8/8] Add VM counters for transparent hugepages
  2011-02-22 16:43       ` Andrea Arcangeli
@ 2011-02-22 18:02         ` Andi Kleen
  -1 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-22 18:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Dave Hansen, Andi Kleen, akpm, linux-mm, linux-kernel, lwoodman,
	Andi Kleen

On Tue, Feb 22, 2011 at 05:43:31PM +0100, Andrea Arcangeli wrote:
> On Tue, Feb 22, 2011 at 08:36:26AM -0800, Dave Hansen wrote:
> > On Mon, 2011-02-21 at 11:07 -0800, Andi Kleen wrote:
> > > From: Andi Kleen <ak@linux.intel.com>
> > > 
> > > I found it difficult to make sense of transparent huge pages without
> > > having any counters for its actions. Add some counters to vmstat
> > > for allocation of transparent hugepages and fallback to smaller
> > > pages.
> > > 
> > > Optional patch, but useful for development and understanding the system.
> > 
> > Very nice.  I did the same thing, splits-only.  I also found this stuff
> > a must-have for trying to do any work with transparent hugepages.  It's
> > just impossible otherwise.
> 
> This patch is good too. 1 and 8 I think can go in, patch 1 is high
> priority.
> 
> Patches 2-5 I've an hard time to see how they're not hurting
> performance instead of improving it, especially patch 3 looks dead

Well right now THP destroys memory locality and that is a quite bad
regression. Destroying memory locality hurts performance significantly.

In general the assumption that you can get the full policy from the 
vma only is wrong: for local you always have to look at the node 
of the existing page too.

I haven't had any reports of KSM doing so, but it seems better
to fix it in the same way. Don't feel very strongly about KSM
for this though, i guess these parts could be dropped too. I guess
you're right and KSM is a bit of a lost cause for NUMA anyways.
Also in my experience KSM is very little memory usually anyways so
it shouldn't matter too much. I guess I can drop that part if it's
controversal.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 8/8] Add VM counters for transparent hugepages
@ 2011-02-22 18:02         ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-22 18:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Dave Hansen, Andi Kleen, akpm, linux-mm, linux-kernel, lwoodman,
	Andi Kleen

On Tue, Feb 22, 2011 at 05:43:31PM +0100, Andrea Arcangeli wrote:
> On Tue, Feb 22, 2011 at 08:36:26AM -0800, Dave Hansen wrote:
> > On Mon, 2011-02-21 at 11:07 -0800, Andi Kleen wrote:
> > > From: Andi Kleen <ak@linux.intel.com>
> > > 
> > > I found it difficult to make sense of transparent huge pages without
> > > having any counters for its actions. Add some counters to vmstat
> > > for allocation of transparent hugepages and fallback to smaller
> > > pages.
> > > 
> > > Optional patch, but useful for development and understanding the system.
> > 
> > Very nice.  I did the same thing, splits-only.  I also found this stuff
> > a must-have for trying to do any work with transparent hugepages.  It's
> > just impossible otherwise.
> 
> This patch is good too. 1 and 8 I think can go in, patch 1 is high
> priority.
> 
> Patches 2-5 I've an hard time to see how they're not hurting
> performance instead of improving it, especially patch 3 looks dead

Well right now THP destroys memory locality and that is a quite bad
regression. Destroying memory locality hurts performance significantly.

In general the assumption that you can get the full policy from the 
vma only is wrong: for local you always have to look at the node 
of the existing page too.

I haven't had any reports of KSM doing so, but it seems better
to fix it in the same way. Don't feel very strongly about KSM
for this though, i guess these parts could be dropped too. I guess
you're right and KSM is a bit of a lost cause for NUMA anyways.
Also in my experience KSM is very little memory usually anyways so
it shouldn't matter too much. I guess I can drop that part if it's
controversal.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/8] Preserve local node for KSM copies
  2011-02-22 15:47     ` Christoph Lameter
@ 2011-02-22 18:24       ` Andi Kleen
  -1 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-22 18:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, akpm, linux-mm, linux-kernel, aarcange, lwoodman,
	Andi Kleen, arcange

On Tue, Feb 22, 2011 at 09:47:26AM -0600, Christoph Lameter wrote:
> On Mon, 21 Feb 2011, Andi Kleen wrote:
> 
> > Add a alloc_page_vma_node that allows passing the "local" node in.
> > Use it in ksm to allocate copy pages on the same node as
> > the original as possible.
> 
> Why would that be useful? The shared page could be on a node that is not
> near the process that maps the page. Would it not be better to allocate on
> the node that is local to the process that maps the page?

Either could be wrong, but not moving the mappings seems most deterministic
to me. At least one of the processes (whoever allocated the page first) 
will stay with its local memory.

Also the alloc_page_vma_node() call is used for THP too. I guess i should
split it out.

-Andi

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/8] Preserve local node for KSM copies
@ 2011-02-22 18:24       ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-22 18:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, akpm, linux-mm, linux-kernel, aarcange, lwoodman, Andi Kleen

On Tue, Feb 22, 2011 at 09:47:26AM -0600, Christoph Lameter wrote:
> On Mon, 21 Feb 2011, Andi Kleen wrote:
> 
> > Add a alloc_page_vma_node that allows passing the "local" node in.
> > Use it in ksm to allocate copy pages on the same node as
> > the original as possible.
> 
> Why would that be useful? The shared page could be on a node that is not
> near the process that maps the page. Would it not be better to allocate on
> the node that is local to the process that maps the page?

Either could be wrong, but not moving the mappings seems most deterministic
to me. At least one of the processes (whoever allocated the page first) 
will stay with its local memory.

Also the alloc_page_vma_node() call is used for THP too. I guess i should
split it out.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 6/8] Add __GFP_OTHER_NODE flag
  2011-02-21 19:07   ` Andi Kleen
@ 2011-02-22 21:42     ` David Rientjes
  -1 siblings, 0 replies; 48+ messages in thread
From: David Rientjes @ 2011-02-22 21:42 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-mm, linux-kernel, Andrea Arcangeli,
	lwoodman, Andi Kleen

On Mon, 21 Feb 2011, Andi Kleen wrote:

> From: Andi Kleen <ak@linux.intel.com>
> 
> Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics
> in zone_statistics() that an allocation is on behalf of another thread.
> This way the local and remote counters can be still correct, even
> when background daemons like khugepaged are changing memory
> mappings.
> 
> This only affects the accounting, but I think it's worth doing that
> right to avoid confusing users.
> 

This makes the accounting worse, NUMA_LOCAL is defined as "allocation from 
local node," meaning it's local to the allocating cpu, not local to the 
node being targeted.

Further, preferred_zone has taken on a much more significant meaning other 
than just statistics: it impacts the behavior of memory compaction and how 
long congestion timeouts are, if a timeout is taken at all, depending on 
the I/O being done on behalf of the zone.

A better way to address the issue is by making sure preferred_zone is 
actually correct by using the appropriate zonelist to be passed into the 
allocator in the first place.

> I first tried to just pass down the right node, but this required
> a lot of changes to pass down this parameter and at least one
> addition of a 10th argument to a 9 argument function. Using
> the flag is a lot less intrusive.
> 

And adding a branch to every successful page allocation for statistics 
isn't intrusive?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 6/8] Add __GFP_OTHER_NODE flag
@ 2011-02-22 21:42     ` David Rientjes
  0 siblings, 0 replies; 48+ messages in thread
From: David Rientjes @ 2011-02-22 21:42 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-mm, linux-kernel, Andrea Arcangeli,
	lwoodman, Andi Kleen

On Mon, 21 Feb 2011, Andi Kleen wrote:

> From: Andi Kleen <ak@linux.intel.com>
> 
> Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics
> in zone_statistics() that an allocation is on behalf of another thread.
> This way the local and remote counters can be still correct, even
> when background daemons like khugepaged are changing memory
> mappings.
> 
> This only affects the accounting, but I think it's worth doing that
> right to avoid confusing users.
> 

This makes the accounting worse, NUMA_LOCAL is defined as "allocation from 
local node," meaning it's local to the allocating cpu, not local to the 
node being targeted.

Further, preferred_zone has taken on a much more significant meaning other 
than just statistics: it impacts the behavior of memory compaction and how 
long congestion timeouts are, if a timeout is taken at all, depending on 
the I/O being done on behalf of the zone.

A better way to address the issue is by making sure preferred_zone is 
actually correct by using the appropriate zonelist to be passed into the 
allocator in the first place.

> I first tried to just pass down the right node, but this required
> a lot of changes to pass down this parameter and at least one
> addition of a 10th argument to a 9 argument function. Using
> the flag is a lot less intrusive.
> 

And adding a branch to every successful page allocation for statistics 
isn't intrusive?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 6/8] Add __GFP_OTHER_NODE flag
  2011-02-22 21:42     ` David Rientjes
@ 2011-02-22 21:47       ` Andi Kleen
  -1 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-22 21:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andi Kleen, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, lwoodman

On 2/22/2011 1:42 PM, David Rientjes wrote:
>
> This makes the accounting worse, NUMA_LOCAL is defined as "allocation from
> local node," meaning it's local to the allocating cpu, not local to the
> node being targeted.

Local to the process really (and I defined it originally ...)  That is 
what I'm implementing

I don't think "local to some random kernel daemon which changes mappings 
on behalf of others"
makes any sense as semantics.

> Further, preferred_zone has taken on a much more significant meaning other
> than just statistics: it impacts the behavior of memory compaction and how
> long congestion timeouts are, if a timeout is taken at all, depending on
> the I/O being done on behalf of the zone.
>
> A better way to address the issue is by making sure preferred_zone is
> actually correct by using the appropriate zonelist to be passed into the
> allocator in the first place

That is what is done already (well for THP together with my other patches)
The problem is just that local_hit/miss still uses numa_node_id() and 
not the preferred zone
to do the accounting.  In most cases that's fine and intended, just not 
for these
special daemons.


-Andi


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 6/8] Add __GFP_OTHER_NODE flag
@ 2011-02-22 21:47       ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-22 21:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andi Kleen, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, lwoodman

On 2/22/2011 1:42 PM, David Rientjes wrote:
>
> This makes the accounting worse, NUMA_LOCAL is defined as "allocation from
> local node," meaning it's local to the allocating cpu, not local to the
> node being targeted.

Local to the process really (and I defined it originally ...)  That is 
what I'm implementing

I don't think "local to some random kernel daemon which changes mappings 
on behalf of others"
makes any sense as semantics.

> Further, preferred_zone has taken on a much more significant meaning other
> than just statistics: it impacts the behavior of memory compaction and how
> long congestion timeouts are, if a timeout is taken at all, depending on
> the I/O being done on behalf of the zone.
>
> A better way to address the issue is by making sure preferred_zone is
> actually correct by using the appropriate zonelist to be passed into the
> allocator in the first place

That is what is done already (well for THP together with my other patches)
The problem is just that local_hit/miss still uses numa_node_id() and 
not the preferred zone
to do the accounting.  In most cases that's fine and intended, just not 
for these
special daemons.


-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 6/8] Add __GFP_OTHER_NODE flag
  2011-02-22 21:47       ` Andi Kleen
@ 2011-02-22 22:08         ` David Rientjes
  -1 siblings, 0 replies; 48+ messages in thread
From: David Rientjes @ 2011-02-22 22:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andi Kleen, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, lwoodman

On Tue, 22 Feb 2011, Andi Kleen wrote:

> > This makes the accounting worse, NUMA_LOCAL is defined as "allocation from
> > local node," meaning it's local to the allocating cpu, not local to the
> > node being targeted.
> 
> Local to the process really (and I defined it originally ...)  That is what
> I'm implementing
> 
> I don't think "local to some random kernel daemon which changes mappings on
> behalf of others"
> makes any sense as semantics.
> 

You could make the same argument for anything using kmalloc_node() since 
preferred_zone may very well not be on the allocating cpu's node.  So you 
either define NUMA_LOCAL to account for when a cpu allocates memory local 
to itself (as it's name implies) or you define it to account for when 
memory comes from the preferred_zone's node as determined by the zonelist.  
It's not useful to change it from the former to the latter since it's 
already the definition of NUMA_HIT.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 6/8] Add __GFP_OTHER_NODE flag
@ 2011-02-22 22:08         ` David Rientjes
  0 siblings, 0 replies; 48+ messages in thread
From: David Rientjes @ 2011-02-22 22:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andi Kleen, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, lwoodman

On Tue, 22 Feb 2011, Andi Kleen wrote:

> > This makes the accounting worse, NUMA_LOCAL is defined as "allocation from
> > local node," meaning it's local to the allocating cpu, not local to the
> > node being targeted.
> 
> Local to the process really (and I defined it originally ...)  That is what
> I'm implementing
> 
> I don't think "local to some random kernel daemon which changes mappings on
> behalf of others"
> makes any sense as semantics.
> 

You could make the same argument for anything using kmalloc_node() since 
preferred_zone may very well not be on the allocating cpu's node.  So you 
either define NUMA_LOCAL to account for when a cpu allocates memory local 
to itself (as it's name implies) or you define it to account for when 
memory comes from the preferred_zone's node as determined by the zonelist.  
It's not useful to change it from the former to the latter since it's 
already the definition of NUMA_HIT.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 6/8] Add __GFP_OTHER_NODE flag
  2011-02-22 22:08         ` David Rientjes
@ 2011-02-22 22:52           ` Andi Kleen
  -1 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-22 22:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andi Kleen, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, lwoodman


> You could make the same argument for anything using kmalloc_node() since
> preferred_zone may very well not be on the allocating cpu's node.

You're right. It is not always, that is why I defined a new flag. In the 
cases where the flag
is passed it is.



>   So you
> either define NUMA_LOCAL to account for when a cpu allocates memory local
> to itself (as it's name implies) or you define it to account for when
> memory comes from the preferred_zone's node as determined by the zonelist.

That's already numa_hit as you say.

I just don't think "local to some random kernel daemon that means 
nothing to the user"
is a useful definition for local_hit.

When I defined the counter I intended it to be local to the user 
process. It always was like
that too, just THP changed the rules.

-Andi



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 6/8] Add __GFP_OTHER_NODE flag
@ 2011-02-22 22:52           ` Andi Kleen
  0 siblings, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2011-02-22 22:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andi Kleen, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, lwoodman


> You could make the same argument for anything using kmalloc_node() since
> preferred_zone may very well not be on the allocating cpu's node.

You're right. It is not always, that is why I defined a new flag. In the 
cases where the flag
is passed it is.



>   So you
> either define NUMA_LOCAL to account for when a cpu allocates memory local
> to itself (as it's name implies) or you define it to account for when
> memory comes from the preferred_zone's node as determined by the zonelist.

That's already numa_hit as you say.

I just don't think "local to some random kernel daemon that means 
nothing to the user"
is a useful definition for local_hit.

When I defined the counter I intended it to be local to the user 
process. It always was like
that too, just THP changed the rules.

-Andi


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2011-02-22 22:52 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-21 19:07 Fix NUMA problems in transparent hugepages and KSM Andi Kleen
2011-02-21 19:07 ` Andi Kleen
2011-02-21 19:07 ` [PATCH 1/8] Fix interleaving for transparent hugepages Andi Kleen
2011-02-21 19:07   ` Andi Kleen
2011-02-22 15:34   ` Christoph Lameter
2011-02-22 15:34     ` Christoph Lameter
2011-02-22 16:29     ` Andrea Arcangeli
2011-02-22 16:29       ` Andrea Arcangeli
2011-02-22 17:11     ` Andi Kleen
2011-02-22 17:11       ` Andi Kleen
2011-02-21 19:07 ` [PATCH 2/8] Change alloc_pages_vma to pass down the policy node for local policy Andi Kleen
2011-02-21 19:07   ` Andi Kleen
2011-02-22 15:42   ` Christoph Lameter
2011-02-22 15:42     ` Christoph Lameter
2011-02-21 19:07 ` [PATCH 3/8] Preserve local node for KSM copies Andi Kleen
2011-02-21 19:07   ` Andi Kleen
2011-02-22 15:47   ` Christoph Lameter
2011-02-22 15:47     ` Christoph Lameter
2011-02-22 16:20     ` Andrea Arcangeli
2011-02-22 16:20       ` Andrea Arcangeli
2011-02-22 17:15       ` Andi Kleen
2011-02-22 17:15         ` Andi Kleen
2011-02-22 18:24     ` Andi Kleen
2011-02-22 18:24       ` Andi Kleen
2011-02-21 19:07 ` [PATCH 4/8] Preserve original node for transparent huge page copies Andi Kleen
2011-02-21 19:07   ` Andi Kleen
2011-02-21 19:07 ` [PATCH 5/8] Use correct numa policy node for transparent hugepages Andi Kleen
2011-02-21 19:07   ` Andi Kleen
2011-02-21 19:07 ` [PATCH 6/8] Add __GFP_OTHER_NODE flag Andi Kleen
2011-02-21 19:07   ` Andi Kleen
2011-02-22 21:42   ` David Rientjes
2011-02-22 21:42     ` David Rientjes
2011-02-22 21:47     ` Andi Kleen
2011-02-22 21:47       ` Andi Kleen
2011-02-22 22:08       ` David Rientjes
2011-02-22 22:08         ` David Rientjes
2011-02-22 22:52         ` Andi Kleen
2011-02-22 22:52           ` Andi Kleen
2011-02-21 19:07 ` [PATCH 7/8] Use GFP_OTHER_NODE for transparent huge pages Andi Kleen
2011-02-21 19:07   ` Andi Kleen
2011-02-21 19:07 ` [PATCH 8/8] Add VM counters for transparent hugepages Andi Kleen
2011-02-21 19:07   ` Andi Kleen
2011-02-22 16:36   ` Dave Hansen
2011-02-22 16:36     ` Dave Hansen
2011-02-22 16:43     ` Andrea Arcangeli
2011-02-22 16:43       ` Andrea Arcangeli
2011-02-22 18:02       ` Andi Kleen
2011-02-22 18:02         ` Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.