All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/39] [RFC] AutoNUMA alpha10
@ 2012-03-26 17:45 ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Hi everyone,

This is the result of the first round of cleanups of the AutoNUMA patch.

This can also be fetched through an autonuma-alpha10 branch.

git clone --reference linux -b autonuma-alpha10 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

A first attempt for an highlevel description of how this works can be
found in the comment of the patch "autonuma: core".

Some benchmark results can be found in the below document (updated to
include the SMT regression test, with only half of the cores loaded,
and half idle, page 7).

Page 9 measures the overhead of the knuma_scand pass during a kernel
build with its default 10sec knuma_scand pass interval.

http://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120322.pdf

TODO:

1) THP native migration
2) dynamically allocate the AutoNUMA struct page fields like memcg
   does it so they won't take any memory if the kernel is booted on
   not-NUMA hardware
3) write Documentation/vm/autonuma.txt with a kernel internal focus
4) possibly find a way to improve the kernel/sched/numa.c algorithm
   with an implementation that is more integrated in CFS

Andrea Arcangeli (36):
  autonuma: make set_pmd_at always available
  xen: document Xen is using an unused bit for the pagetables
  autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  autonuma: x86 pte_numa() and pmd_numa()
  autonuma: generic pte_numa() and pmd_numa()
  autonuma: teach gup_fast about pte_numa
  autonuma: introduce kthread_bind_node()
  autonuma: mm_autonuma and sched_autonuma data structures
  autonuma: define the autonuma flags
  autonuma: core autonuma.h header
  autonuma: CPU follow memory algorithm
  autonuma: add page structure fields
  autonuma: knuma_migrated per NUMA node queues
  autonuma: init knuma_migrated queues
  autonuma: autonuma_enter/exit
  autonuma: call autonuma_setup_new_exec()
  autonuma: alloc/free/init sched_autonuma
  autonuma: alloc/free/init mm_autonuma
  mm: add unlikely to the mm allocation failure check
  autonuma: avoid CFS select_task_rq_fair to return -1
  autonuma: select_task_rq_fair cleanup new_cpu < 0 fix
  autonuma: teach CFS about autonuma affinity
  autonuma: select_idle_sibling cleanup target assignment
  autonuma: core
  autonuma: follow_page check for pte_numa/pmd_numa
  autonuma: default mempolicy follow AutoNUMA
  autonuma: call autonuma_split_huge_page()
  autonuma: make khugepaged pte_numa aware
  autonuma: retain page last_nid information in khugepaged
  autonuma: numa hinting page faults entry points
  autonuma: reset autonuma page data when pages are freed
  autonuma: initialize page structure fields
  autonuma: link mm/autonuma.o and kernel/sched/numa.o
  autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
  autonuma: boost khugepaged scanning rate
  autonuma: NUMA scheduler SMT awareness

Hillf Danton (3):
  autonuma: fix selecting task runqueue
  autonuma: fix finding idlest cpu
  autonuma: fix selecting idle sibling

 arch/x86/include/asm/paravirt.h      |    2 -
 arch/x86/include/asm/pgtable.h       |   51 ++-
 arch/x86/include/asm/pgtable_types.h |   22 +-
 arch/x86/mm/gup.c                    |    2 +-
 fs/exec.c                            |    3 +
 include/asm-generic/pgtable.h        |   12 +
 include/linux/autonuma.h             |   41 +
 include/linux/autonuma_flags.h       |   72 ++
 include/linux/autonuma_sched.h       |   61 ++
 include/linux/autonuma_types.h       |   54 ++
 include/linux/huge_mm.h              |    2 +
 include/linux/kthread.h              |    1 +
 include/linux/mm_types.h             |   30 +
 include/linux/mmzone.h               |    6 +
 include/linux/sched.h                |    3 +
 kernel/fork.c                        |   36 +-
 kernel/kthread.c                     |   23 +
 kernel/sched/Makefile                |    3 +-
 kernel/sched/core.c                  |   12 +-
 kernel/sched/fair.c                  |   68 ++-
 kernel/sched/numa.c                  |  320 ++++++++
 kernel/sched/sched.h                 |   12 +
 mm/Kconfig                           |   13 +
 mm/Makefile                          |    1 +
 mm/autonuma.c                        | 1444 ++++++++++++++++++++++++++++++++++
 mm/huge_memory.c                     |   51 ++-
 mm/memory.c                          |   36 +-
 mm/mempolicy.c                       |   15 +-
 mm/mmu_context.c                     |    2 +
 mm/page_alloc.c                      |   19 +
 30 files changed, 2373 insertions(+), 44 deletions(-)
 create mode 100644 include/linux/autonuma.h
 create mode 100644 include/linux/autonuma_flags.h
 create mode 100644 include/linux/autonuma_sched.h
 create mode 100644 include/linux/autonuma_types.h
 create mode 100644 kernel/sched/numa.c
 create mode 100644 mm/autonuma.c


^ permalink raw reply	[flat|nested] 125+ messages in thread

* [PATCH 00/39] [RFC] AutoNUMA alpha10
@ 2012-03-26 17:45 ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Hi everyone,

This is the result of the first round of cleanups of the AutoNUMA patch.

This can also be fetched through an autonuma-alpha10 branch.

git clone --reference linux -b autonuma-alpha10 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

A first attempt for an highlevel description of how this works can be
found in the comment of the patch "autonuma: core".

Some benchmark results can be found in the below document (updated to
include the SMT regression test, with only half of the cores loaded,
and half idle, page 7).

Page 9 measures the overhead of the knuma_scand pass during a kernel
build with its default 10sec knuma_scand pass interval.

http://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120322.pdf

TODO:

1) THP native migration
2) dynamically allocate the AutoNUMA struct page fields like memcg
   does it so they won't take any memory if the kernel is booted on
   not-NUMA hardware
3) write Documentation/vm/autonuma.txt with a kernel internal focus
4) possibly find a way to improve the kernel/sched/numa.c algorithm
   with an implementation that is more integrated in CFS

Andrea Arcangeli (36):
  autonuma: make set_pmd_at always available
  xen: document Xen is using an unused bit for the pagetables
  autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  autonuma: x86 pte_numa() and pmd_numa()
  autonuma: generic pte_numa() and pmd_numa()
  autonuma: teach gup_fast about pte_numa
  autonuma: introduce kthread_bind_node()
  autonuma: mm_autonuma and sched_autonuma data structures
  autonuma: define the autonuma flags
  autonuma: core autonuma.h header
  autonuma: CPU follow memory algorithm
  autonuma: add page structure fields
  autonuma: knuma_migrated per NUMA node queues
  autonuma: init knuma_migrated queues
  autonuma: autonuma_enter/exit
  autonuma: call autonuma_setup_new_exec()
  autonuma: alloc/free/init sched_autonuma
  autonuma: alloc/free/init mm_autonuma
  mm: add unlikely to the mm allocation failure check
  autonuma: avoid CFS select_task_rq_fair to return -1
  autonuma: select_task_rq_fair cleanup new_cpu < 0 fix
  autonuma: teach CFS about autonuma affinity
  autonuma: select_idle_sibling cleanup target assignment
  autonuma: core
  autonuma: follow_page check for pte_numa/pmd_numa
  autonuma: default mempolicy follow AutoNUMA
  autonuma: call autonuma_split_huge_page()
  autonuma: make khugepaged pte_numa aware
  autonuma: retain page last_nid information in khugepaged
  autonuma: numa hinting page faults entry points
  autonuma: reset autonuma page data when pages are freed
  autonuma: initialize page structure fields
  autonuma: link mm/autonuma.o and kernel/sched/numa.o
  autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
  autonuma: boost khugepaged scanning rate
  autonuma: NUMA scheduler SMT awareness

Hillf Danton (3):
  autonuma: fix selecting task runqueue
  autonuma: fix finding idlest cpu
  autonuma: fix selecting idle sibling

 arch/x86/include/asm/paravirt.h      |    2 -
 arch/x86/include/asm/pgtable.h       |   51 ++-
 arch/x86/include/asm/pgtable_types.h |   22 +-
 arch/x86/mm/gup.c                    |    2 +-
 fs/exec.c                            |    3 +
 include/asm-generic/pgtable.h        |   12 +
 include/linux/autonuma.h             |   41 +
 include/linux/autonuma_flags.h       |   72 ++
 include/linux/autonuma_sched.h       |   61 ++
 include/linux/autonuma_types.h       |   54 ++
 include/linux/huge_mm.h              |    2 +
 include/linux/kthread.h              |    1 +
 include/linux/mm_types.h             |   30 +
 include/linux/mmzone.h               |    6 +
 include/linux/sched.h                |    3 +
 kernel/fork.c                        |   36 +-
 kernel/kthread.c                     |   23 +
 kernel/sched/Makefile                |    3 +-
 kernel/sched/core.c                  |   12 +-
 kernel/sched/fair.c                  |   68 ++-
 kernel/sched/numa.c                  |  320 ++++++++
 kernel/sched/sched.h                 |   12 +
 mm/Kconfig                           |   13 +
 mm/Makefile                          |    1 +
 mm/autonuma.c                        | 1444 ++++++++++++++++++++++++++++++++++
 mm/huge_memory.c                     |   51 ++-
 mm/memory.c                          |   36 +-
 mm/mempolicy.c                       |   15 +-
 mm/mmu_context.c                     |    2 +
 mm/page_alloc.c                      |   19 +
 30 files changed, 2373 insertions(+), 44 deletions(-)
 create mode 100644 include/linux/autonuma.h
 create mode 100644 include/linux/autonuma_flags.h
 create mode 100644 include/linux/autonuma_sched.h
 create mode 100644 include/linux/autonuma_types.h
 create mode 100644 kernel/sched/numa.c
 create mode 100644 mm/autonuma.c

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* [PATCH 01/39] autonuma: make set_pmd_at always available
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:45   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

set_pmd_at() will also be used for the knuma_scand/pmd = 1 (default)
mode even when TRANSPARENT_HUGEPAGE=n. Make it available so the build
won't fail.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/paravirt.h |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index aa0f913..f34370d 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -564,7 +564,6 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 		PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 			      pmd_t *pmdp, pmd_t pmd)
 {
@@ -575,7 +574,6 @@ static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp,
 			    native_pmd_val(pmd));
 }
-#endif
 
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 01/39] autonuma: make set_pmd_at always available
@ 2012-03-26 17:45   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

set_pmd_at() will also be used for the knuma_scand/pmd = 1 (default)
mode even when TRANSPARENT_HUGEPAGE=n. Make it available so the build
won't fail.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/paravirt.h |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index aa0f913..f34370d 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -564,7 +564,6 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 		PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 			      pmd_t *pmdp, pmd_t pmd)
 {
@@ -575,7 +574,6 @@ static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp,
 			    native_pmd_val(pmd));
 }
-#endif
 
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 02/39] xen: document Xen is using an unused bit for the pagetables
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:45   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Xen has taken over the last reserved bit available for the pagetables
which is set through ioremap, this documents it and makes the code
more readable.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable_types.h |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 013286a..b74cac9 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -17,7 +17,7 @@
 #define _PAGE_BIT_PAT		7	/* on 4KB pages */
 #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
 #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
-#define _PAGE_BIT_IOMAP		10	/* flag used to indicate IO mapping */
+#define _PAGE_BIT_UNUSED2	10
 #define _PAGE_BIT_HIDDEN	11	/* hidden by kmemcheck */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
@@ -41,7 +41,7 @@
 #define _PAGE_PSE	(_AT(pteval_t, 1) << _PAGE_BIT_PSE)
 #define _PAGE_GLOBAL	(_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
 #define _PAGE_UNUSED1	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED1)
-#define _PAGE_IOMAP	(_AT(pteval_t, 1) << _PAGE_BIT_IOMAP)
+#define _PAGE_UNUSED2	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
 #define _PAGE_PAT	(_AT(pteval_t, 1) << _PAGE_BIT_PAT)
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
@@ -49,6 +49,13 @@
 #define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
 #define __HAVE_ARCH_PTE_SPECIAL
 
+/* flag used to indicate IO mapping */
+#ifdef CONFIG_XEN
+#define _PAGE_IOMAP	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
+#else
+#define _PAGE_IOMAP	(_AT(pteval_t, 0))
+#endif
+
 #ifdef CONFIG_KMEMCHECK
 #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
 #else

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 02/39] xen: document Xen is using an unused bit for the pagetables
@ 2012-03-26 17:45   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Xen has taken over the last reserved bit available for the pagetables
which is set through ioremap, this documents it and makes the code
more readable.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable_types.h |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 013286a..b74cac9 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -17,7 +17,7 @@
 #define _PAGE_BIT_PAT		7	/* on 4KB pages */
 #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
 #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
-#define _PAGE_BIT_IOMAP		10	/* flag used to indicate IO mapping */
+#define _PAGE_BIT_UNUSED2	10
 #define _PAGE_BIT_HIDDEN	11	/* hidden by kmemcheck */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
@@ -41,7 +41,7 @@
 #define _PAGE_PSE	(_AT(pteval_t, 1) << _PAGE_BIT_PSE)
 #define _PAGE_GLOBAL	(_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
 #define _PAGE_UNUSED1	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED1)
-#define _PAGE_IOMAP	(_AT(pteval_t, 1) << _PAGE_BIT_IOMAP)
+#define _PAGE_UNUSED2	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
 #define _PAGE_PAT	(_AT(pteval_t, 1) << _PAGE_BIT_PAT)
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
@@ -49,6 +49,13 @@
 #define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
 #define __HAVE_ARCH_PTE_SPECIAL
 
+/* flag used to indicate IO mapping */
+#ifdef CONFIG_XEN
+#define _PAGE_IOMAP	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
+#else
+#define _PAGE_IOMAP	(_AT(pteval_t, 0))
+#endif
+
 #ifdef CONFIG_KMEMCHECK
 #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
 #else

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 03/39] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:45   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

We will set these bitflags only when the pmd and pte is non present.

They work like PROT_NONE but they identify a request for the numa
hinting page fault to trigger.

Because we want to be able to set these bitflag in any established pte
or pmd (while clearing the present bit at the same time) without
losing information, these bitflags must never be set when the pte and
pmd are present.

For _PAGE_NUMA_PTE the pte bitflag used is _PAGE_PSE, which cannot be
set on ptes and it also fits in between _PAGE_FILE and _PAGE_PROTNONE
which avoids having to alter the swp entries format.

For _PAGE_NUMA_PMD, we use a reserved bitflag. pmds never contain
swap_entries but if in the future we'll swap transparent hugepages, we
must keep in mind not to use the _PAGE_UNUSED2 bitflag in the swap
entry format and to start the swap entry offset above it.

PAGE_UNUSED2 is used by Xen but only on ptes established by ioremap,
but it's never used on pmds so there's no risk of collision with Xen.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable_types.h |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index b74cac9..6e2d954 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -71,6 +71,17 @@
 #define _PAGE_FILE	(_AT(pteval_t, 1) << _PAGE_BIT_FILE)
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
+/*
+ * Cannot be set on pte. The fact it's in between _PAGE_FILE and
+ * _PAGE_PROTNONE avoids having to alter the swp entries.
+ */
+#define _PAGE_NUMA_PTE	_PAGE_PSE
+/*
+ * Cannot be set on pmd, if transparent hugepages will be swapped out
+ * the swap entry offset must start above it.
+ */
+#define _PAGE_NUMA_PMD	_PAGE_UNUSED2
+
 #define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |	\
 			 _PAGE_ACCESSED | _PAGE_DIRTY)
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 03/39] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
@ 2012-03-26 17:45   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

We will set these bitflags only when the pmd and pte is non present.

They work like PROT_NONE but they identify a request for the numa
hinting page fault to trigger.

Because we want to be able to set these bitflag in any established pte
or pmd (while clearing the present bit at the same time) without
losing information, these bitflags must never be set when the pte and
pmd are present.

For _PAGE_NUMA_PTE the pte bitflag used is _PAGE_PSE, which cannot be
set on ptes and it also fits in between _PAGE_FILE and _PAGE_PROTNONE
which avoids having to alter the swp entries format.

For _PAGE_NUMA_PMD, we use a reserved bitflag. pmds never contain
swap_entries but if in the future we'll swap transparent hugepages, we
must keep in mind not to use the _PAGE_UNUSED2 bitflag in the swap
entry format and to start the swap entry offset above it.

PAGE_UNUSED2 is used by Xen but only on ptes established by ioremap,
but it's never used on pmds so there's no risk of collision with Xen.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable_types.h |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index b74cac9..6e2d954 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -71,6 +71,17 @@
 #define _PAGE_FILE	(_AT(pteval_t, 1) << _PAGE_BIT_FILE)
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
+/*
+ * Cannot be set on pte. The fact it's in between _PAGE_FILE and
+ * _PAGE_PROTNONE avoids having to alter the swp entries.
+ */
+#define _PAGE_NUMA_PTE	_PAGE_PSE
+/*
+ * Cannot be set on pmd, if transparent hugepages will be swapped out
+ * the swap entry offset must start above it.
+ */
+#define _PAGE_NUMA_PMD	_PAGE_UNUSED2
+
 #define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |	\
 			 _PAGE_ACCESSED | _PAGE_DIRTY)
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 04/39] autonuma: x86 pte_numa() and pmd_numa()
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:45   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Implement pte_numa and pmd_numa and related methods on x86 arch.

We must atomically set the numa bit and clear the present bit to
define a pte_numa or pmd_numa.

Whenever a pte or pmd is set as pte_numa or pmd_numa the first time a
thread will touch that virtual address, a NUMA hinting page fault will
trigger. The NUMA hinting page fault will simply clear the NUMA bit
and set the present bit again to resolve the page fault.

NUMA hinting page faults are used:

1) to fill in the per-thread NUMA statistic stored for each thread in
   a current->sched_autonuma data structure

2) to track the per-node last_nid information in the page structure to
   detect false sharing

3) to queue the page mapped by the pte_numa or pmd_numa for async
   migration if there have been enough NUMA hinting page faults on the
   page coming from remote CPUs

NUMA hinting page faults don't do anything except collecting
information and possibly adding pages to migrate queues. They're
extremely quick and absolutely non blocking. They don't allocate any
memory either.

The only "input" information of the AutoNUMA algorithm that isn't
collected through NUMA hinting page faults are the per-process
(per-thread not) mm->mm_autonuma statistics. Those mm_autonuma
statistics are collected by the knuma_scand pmd/pte scans that are
also responsible for setting the pte_numa/pmd_numa to activate the
NUMA hinting page faults.

knuma_scand -> NUMA hinting page faults
  |                       |
 \|/                     \|/
mm_autonuma  <->  sched_autonuma (CPU follow memory, this is mm_autonuma too)
                  page last_nid  (false thread sharing/thread shared memory detection )
                  queue or cancel page migration (memory follow CPU)

After pages are queued, there is one knuma_migratedN daemon per NUMA
node that will take care of migrating the pages at a perfectly steady
rate in parallel from all nodes, and in round robin from all incoming
nodes going to the same destination node to keep all memory channels
in large boxes active at the same time to avoid hitting on a single
memory channel for too long to minimize memory bus migration latency
effects.

Once pages are queued for async migration by knuma_migratedN, their
migration can still be canceled before they're actually migrated, if
false sharing is later detected.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable.h |   51 +++++++++++++++++++++++++++++++++++++--
 1 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 49afb3f..7514fa6 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -109,7 +109,7 @@ static inline int pte_write(pte_t pte)
 
 static inline int pte_file(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_FILE;
+	return (pte_flags(pte) & _PAGE_FILE) == _PAGE_FILE;
 }
 
 static inline int pte_huge(pte_t pte)
@@ -405,7 +405,9 @@ static inline int pte_same(pte_t a, pte_t b)
 
 static inline int pte_present(pte_t a)
 {
-	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
+	/* _PAGE_NUMA includes _PAGE_PROTNONE */
+	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+			       _PAGE_NUMA_PTE);
 }
 
 static inline int pte_hidden(pte_t pte)
@@ -415,7 +417,46 @@ static inline int pte_hidden(pte_t pte)
 
 static inline int pmd_present(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_PRESENT;
+	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+				 _PAGE_NUMA_PMD);
+}
+
+#ifdef CONFIG_AUTONUMA
+static inline int pte_numa(pte_t pte)
+{
+	return (pte_flags(pte) &
+		(_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+	return (pmd_flags(pmd) &
+		(_PAGE_NUMA_PMD|_PAGE_PRESENT)) == _PAGE_NUMA_PMD;
+}
+#endif
+
+static inline pte_t pte_mknotnuma(pte_t pte)
+{
+	pte = pte_clear_flags(pte, _PAGE_NUMA_PTE);
+	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mknotnuma(pmd_t pmd)
+{
+	pmd = pmd_clear_flags(pmd, _PAGE_NUMA_PMD);
+	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pte_t pte_mknuma(pte_t pte)
+{
+	pte = pte_set_flags(pte, _PAGE_NUMA_PTE);
+	return pte_clear_flags(pte, _PAGE_PRESENT);
+}
+
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+	pmd = pmd_set_flags(pmd, _PAGE_NUMA_PMD);
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
 }
 
 static inline int pmd_none(pmd_t pmd)
@@ -474,6 +515,10 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 
 static inline int pmd_bad(pmd_t pmd)
 {
+#ifdef CONFIG_AUTONUMA
+	if (pmd_numa(pmd))
+		return 0;
+#endif
 	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
 }
 

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 04/39] autonuma: x86 pte_numa() and pmd_numa()
@ 2012-03-26 17:45   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Implement pte_numa and pmd_numa and related methods on x86 arch.

We must atomically set the numa bit and clear the present bit to
define a pte_numa or pmd_numa.

Whenever a pte or pmd is set as pte_numa or pmd_numa the first time a
thread will touch that virtual address, a NUMA hinting page fault will
trigger. The NUMA hinting page fault will simply clear the NUMA bit
and set the present bit again to resolve the page fault.

NUMA hinting page faults are used:

1) to fill in the per-thread NUMA statistic stored for each thread in
   a current->sched_autonuma data structure

2) to track the per-node last_nid information in the page structure to
   detect false sharing

3) to queue the page mapped by the pte_numa or pmd_numa for async
   migration if there have been enough NUMA hinting page faults on the
   page coming from remote CPUs

NUMA hinting page faults don't do anything except collecting
information and possibly adding pages to migrate queues. They're
extremely quick and absolutely non blocking. They don't allocate any
memory either.

The only "input" information of the AutoNUMA algorithm that isn't
collected through NUMA hinting page faults are the per-process
(per-thread not) mm->mm_autonuma statistics. Those mm_autonuma
statistics are collected by the knuma_scand pmd/pte scans that are
also responsible for setting the pte_numa/pmd_numa to activate the
NUMA hinting page faults.

knuma_scand -> NUMA hinting page faults
  |                       |
 \|/                     \|/
mm_autonuma  <->  sched_autonuma (CPU follow memory, this is mm_autonuma too)
                  page last_nid  (false thread sharing/thread shared memory detection )
                  queue or cancel page migration (memory follow CPU)

After pages are queued, there is one knuma_migratedN daemon per NUMA
node that will take care of migrating the pages at a perfectly steady
rate in parallel from all nodes, and in round robin from all incoming
nodes going to the same destination node to keep all memory channels
in large boxes active at the same time to avoid hitting on a single
memory channel for too long to minimize memory bus migration latency
effects.

Once pages are queued for async migration by knuma_migratedN, their
migration can still be canceled before they're actually migrated, if
false sharing is later detected.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable.h |   51 +++++++++++++++++++++++++++++++++++++--
 1 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 49afb3f..7514fa6 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -109,7 +109,7 @@ static inline int pte_write(pte_t pte)
 
 static inline int pte_file(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_FILE;
+	return (pte_flags(pte) & _PAGE_FILE) == _PAGE_FILE;
 }
 
 static inline int pte_huge(pte_t pte)
@@ -405,7 +405,9 @@ static inline int pte_same(pte_t a, pte_t b)
 
 static inline int pte_present(pte_t a)
 {
-	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
+	/* _PAGE_NUMA includes _PAGE_PROTNONE */
+	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+			       _PAGE_NUMA_PTE);
 }
 
 static inline int pte_hidden(pte_t pte)
@@ -415,7 +417,46 @@ static inline int pte_hidden(pte_t pte)
 
 static inline int pmd_present(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_PRESENT;
+	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+				 _PAGE_NUMA_PMD);
+}
+
+#ifdef CONFIG_AUTONUMA
+static inline int pte_numa(pte_t pte)
+{
+	return (pte_flags(pte) &
+		(_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+	return (pmd_flags(pmd) &
+		(_PAGE_NUMA_PMD|_PAGE_PRESENT)) == _PAGE_NUMA_PMD;
+}
+#endif
+
+static inline pte_t pte_mknotnuma(pte_t pte)
+{
+	pte = pte_clear_flags(pte, _PAGE_NUMA_PTE);
+	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mknotnuma(pmd_t pmd)
+{
+	pmd = pmd_clear_flags(pmd, _PAGE_NUMA_PMD);
+	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pte_t pte_mknuma(pte_t pte)
+{
+	pte = pte_set_flags(pte, _PAGE_NUMA_PTE);
+	return pte_clear_flags(pte, _PAGE_PRESENT);
+}
+
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+	pmd = pmd_set_flags(pmd, _PAGE_NUMA_PMD);
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
 }
 
 static inline int pmd_none(pmd_t pmd)
@@ -474,6 +515,10 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 
 static inline int pmd_bad(pmd_t pmd)
 {
+#ifdef CONFIG_AUTONUMA
+	if (pmd_numa(pmd))
+		return 0;
+#endif
 	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 05/39] autonuma: generic pte_numa() and pmd_numa()
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:45   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Implement generic version of the methods. They're used when
CONFIG_AUTONUMA=n, and they're a noop.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/asm-generic/pgtable.h |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 125c54e..d22552b 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -503,6 +503,18 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
+#ifndef CONFIG_AUTONUMA
+static inline int pte_numa(pte_t pte)
+{
+	return 0;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+	return 0;
+}
+#endif /* CONFIG_AUTONUMA */
+
 #endif /* CONFIG_MMU */
 
 #endif /* !__ASSEMBLY__ */

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 05/39] autonuma: generic pte_numa() and pmd_numa()
@ 2012-03-26 17:45   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Implement generic version of the methods. They're used when
CONFIG_AUTONUMA=n, and they're a noop.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/asm-generic/pgtable.h |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 125c54e..d22552b 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -503,6 +503,18 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
+#ifndef CONFIG_AUTONUMA
+static inline int pte_numa(pte_t pte)
+{
+	return 0;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+	return 0;
+}
+#endif /* CONFIG_AUTONUMA */
+
 #endif /* CONFIG_MMU */
 
 #endif /* !__ASSEMBLY__ */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 06/39] autonuma: teach gup_fast about pte_numa
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:45   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

gup_fast will skip over non present ptes (pte_numa requires the pte to
be non present). So no explicit check is needed for pte_numa in the
pte case.

gup_fast will also automatically skip over THP when the trans huge pmd
is non present (pmd_numa requires the pmd to be non present).

But for the special pmd mode scan of knuma_scand
(/sys/kernel/mm/autonuma/knuma_scand/pmd == 1), the pmd may be of numa
type (so non present too), the pte may be present. gup_pte_range
wouldn't notice the pmd is of numa type. So to avoid losing a NUMA
hinting page fault with gup_fast we need an explicit check for
pmd_numa() here to be sure it will fault through gup ->
handle_mm_fault.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/mm/gup.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..bf36575 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -164,7 +164,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		 * wait_split_huge_page() would never return as the
 		 * tlb flush IPI wouldn't run.
 		 */
-		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd) || pmd_numa(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 06/39] autonuma: teach gup_fast about pte_numa
@ 2012-03-26 17:45   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

gup_fast will skip over non present ptes (pte_numa requires the pte to
be non present). So no explicit check is needed for pte_numa in the
pte case.

gup_fast will also automatically skip over THP when the trans huge pmd
is non present (pmd_numa requires the pmd to be non present).

But for the special pmd mode scan of knuma_scand
(/sys/kernel/mm/autonuma/knuma_scand/pmd == 1), the pmd may be of numa
type (so non present too), the pte may be present. gup_pte_range
wouldn't notice the pmd is of numa type. So to avoid losing a NUMA
hinting page fault with gup_fast we need an explicit check for
pmd_numa() here to be sure it will fault through gup ->
handle_mm_fault.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/mm/gup.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..bf36575 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -164,7 +164,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		 * wait_split_huge_page() would never return as the
 		 * tlb flush IPI wouldn't run.
 		 */
-		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd) || pmd_numa(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 07/39] autonuma: introduce kthread_bind_node()
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:45   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

This function makes it easy to bind the per-node knuma_migrated
threads to their respective NUMA nodes. Those threads take memory from
the other nodes (in round robin with a incoming queue for each remote
node) and they move that memory to their local node.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/kthread.h |    1 +
 kernel/kthread.c        |   23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 0714b24..e733f97 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -33,6 +33,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 })
 
 void kthread_bind(struct task_struct *k, unsigned int cpu);
+void kthread_bind_node(struct task_struct *p, int nid);
 int kthread_stop(struct task_struct *k);
 int kthread_should_stop(void);
 bool kthread_freezable_should_stop(bool *was_frozen);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3d3de63..48b36f9 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -234,6 +234,29 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
 EXPORT_SYMBOL(kthread_bind);
 
 /**
+ * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
+ * @p: thread created by kthread_create().
+ * @nid: node (might not be online, must be possible) for @k to run on.
+ *
+ * Description: This function is equivalent to set_cpus_allowed(),
+ * except that @nid doesn't need to be online, and the thread must be
+ * stopped (i.e., just returned from kthread_create()).
+ */
+void kthread_bind_node(struct task_struct *p, int nid)
+{
+	/* Must have done schedule() in kthread() before we set_task_cpu */
+	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
+		WARN_ON(1);
+		return;
+	}
+
+	/* It's safe because the task is inactive. */
+	do_set_cpus_allowed(p, cpumask_of_node(nid));
+	p->flags |= PF_THREAD_BOUND;
+}
+EXPORT_SYMBOL(kthread_bind_node);
+
+/**
  * kthread_stop - stop a thread created by kthread_create().
  * @k: thread created by kthread_create().
  *

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 07/39] autonuma: introduce kthread_bind_node()
@ 2012-03-26 17:45   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

This function makes it easy to bind the per-node knuma_migrated
threads to their respective NUMA nodes. Those threads take memory from
the other nodes (in round robin with a incoming queue for each remote
node) and they move that memory to their local node.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/kthread.h |    1 +
 kernel/kthread.c        |   23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 0714b24..e733f97 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -33,6 +33,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 })
 
 void kthread_bind(struct task_struct *k, unsigned int cpu);
+void kthread_bind_node(struct task_struct *p, int nid);
 int kthread_stop(struct task_struct *k);
 int kthread_should_stop(void);
 bool kthread_freezable_should_stop(bool *was_frozen);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3d3de63..48b36f9 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -234,6 +234,29 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
 EXPORT_SYMBOL(kthread_bind);
 
 /**
+ * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
+ * @p: thread created by kthread_create().
+ * @nid: node (might not be online, must be possible) for @k to run on.
+ *
+ * Description: This function is equivalent to set_cpus_allowed(),
+ * except that @nid doesn't need to be online, and the thread must be
+ * stopped (i.e., just returned from kthread_create()).
+ */
+void kthread_bind_node(struct task_struct *p, int nid)
+{
+	/* Must have done schedule() in kthread() before we set_task_cpu */
+	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
+		WARN_ON(1);
+		return;
+	}
+
+	/* It's safe because the task is inactive. */
+	do_set_cpus_allowed(p, cpumask_of_node(nid));
+	p->flags |= PF_THREAD_BOUND;
+}
+EXPORT_SYMBOL(kthread_bind_node);
+
+/**
  * kthread_stop - stop a thread created by kthread_create().
  * @k: thread created by kthread_create().
  *

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 08/39] autonuma: mm_autonuma and sched_autonuma data structures
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:45   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Define the two data structures that collect the per-process (in the
mm) and per-thread (in the task_struct) statistical information that
are the input of the CPU follow memory algorithms in the NUMA
scheduler.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_types.h |   54 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 54 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma_types.h

diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
new file mode 100644
index 0000000..818e7c7
--- /dev/null
+++ b/include/linux/autonuma_types.h
@@ -0,0 +1,54 @@
+#ifndef _LINUX_AUTONUMA_TYPES_H
+#define _LINUX_AUTONUMA_TYPES_H
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/numa.h>
+
+struct mm_autonuma {
+	struct list_head mm_node;
+	struct mm_struct *mm;
+	unsigned long numa_fault_tot; /* reset from here */
+	unsigned long numa_fault_pass;
+	unsigned long numa_fault[0];
+};
+
+extern int alloc_mm_autonuma(struct mm_struct *mm);
+extern void free_mm_autonuma(struct mm_struct *mm);
+extern void __init mm_autonuma_init(void);
+
+struct sched_autonuma {
+	int autonuma_node;
+	bool autonuma_stop_one_cpu; /* reset from here */
+	unsigned long numa_fault_pass;
+	unsigned long numa_fault_tot;
+	unsigned long numa_fault[0];
+};
+
+extern int alloc_sched_autonuma(struct task_struct *tsk,
+				struct task_struct *orig,
+				int node);
+extern void __init sched_autonuma_init(void);
+extern void free_sched_autonuma(struct task_struct *tsk);
+
+#else /* CONFIG_AUTONUMA */
+
+static inline int alloc_mm_autonuma(struct mm_struct *mm)
+{
+	return 0;
+}
+static inline void free_mm_autonuma(struct mm_struct *mm) {}
+static inline void mm_autonuma_init(void) {}
+
+static inline int alloc_sched_autonuma(struct task_struct *tsk,
+				       struct task_struct *orig,
+				       int node)
+{
+	return 0;
+}
+static inline void sched_autonuma_init(void) {}
+static inline void free_sched_autonuma(struct task_struct *tsk) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+#endif /* _LINUX_AUTONUMA_TYPES_H */

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 08/39] autonuma: mm_autonuma and sched_autonuma data structures
@ 2012-03-26 17:45   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Define the two data structures that collect the per-process (in the
mm) and per-thread (in the task_struct) statistical information that
are the input of the CPU follow memory algorithms in the NUMA
scheduler.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_types.h |   54 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 54 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma_types.h

diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
new file mode 100644
index 0000000..818e7c7
--- /dev/null
+++ b/include/linux/autonuma_types.h
@@ -0,0 +1,54 @@
+#ifndef _LINUX_AUTONUMA_TYPES_H
+#define _LINUX_AUTONUMA_TYPES_H
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/numa.h>
+
+struct mm_autonuma {
+	struct list_head mm_node;
+	struct mm_struct *mm;
+	unsigned long numa_fault_tot; /* reset from here */
+	unsigned long numa_fault_pass;
+	unsigned long numa_fault[0];
+};
+
+extern int alloc_mm_autonuma(struct mm_struct *mm);
+extern void free_mm_autonuma(struct mm_struct *mm);
+extern void __init mm_autonuma_init(void);
+
+struct sched_autonuma {
+	int autonuma_node;
+	bool autonuma_stop_one_cpu; /* reset from here */
+	unsigned long numa_fault_pass;
+	unsigned long numa_fault_tot;
+	unsigned long numa_fault[0];
+};
+
+extern int alloc_sched_autonuma(struct task_struct *tsk,
+				struct task_struct *orig,
+				int node);
+extern void __init sched_autonuma_init(void);
+extern void free_sched_autonuma(struct task_struct *tsk);
+
+#else /* CONFIG_AUTONUMA */
+
+static inline int alloc_mm_autonuma(struct mm_struct *mm)
+{
+	return 0;
+}
+static inline void free_mm_autonuma(struct mm_struct *mm) {}
+static inline void mm_autonuma_init(void) {}
+
+static inline int alloc_sched_autonuma(struct task_struct *tsk,
+				       struct task_struct *orig,
+				       int node)
+{
+	return 0;
+}
+static inline void sched_autonuma_init(void) {}
+static inline void free_sched_autonuma(struct task_struct *tsk) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+#endif /* _LINUX_AUTONUMA_TYPES_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 09/39] autonuma: define the autonuma flags
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:45   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

These flags are the ones tweaked through sysfs, they control the
behavior of autonuma, from enabling disabling it, to selecting various
runtime options.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_flags.h |   62 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 62 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma_flags.h

diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
new file mode 100644
index 0000000..9c702fd
--- /dev/null
+++ b/include/linux/autonuma_flags.h
@@ -0,0 +1,62 @@
+#ifndef _LINUX_AUTONUMA_FLAGS_H
+#define _LINUX_AUTONUMA_FLAGS_H
+
+enum autonuma_flag {
+	AUTONUMA_FLAG,
+	AUTONUMA_IMPOSSIBLE,
+	AUTONUMA_DEBUG_FLAG,
+	AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
+	AUTONUMA_SCHED_CLONE_RESET_FLAG,
+	AUTONUMA_SCHED_FORK_RESET_FLAG,
+	AUTONUMA_SCAN_PMD_FLAG,
+	AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
+	AUTONUMA_MIGRATE_DEFER_FLAG,
+};
+
+extern unsigned long autonuma_flags;
+
+static bool inline autonuma_enabled(void)
+{
+	return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
+}
+
+static bool inline autonuma_debug(void)
+{
+	return !!test_bit(AUTONUMA_DEBUG_FLAG, &autonuma_flags);
+}
+
+static bool inline autonuma_sched_load_balance_strict(void)
+{
+	return !!test_bit(AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
+			  &autonuma_flags);
+}
+
+static bool inline autonuma_sched_clone_reset(void)
+{
+	return !!test_bit(AUTONUMA_SCHED_CLONE_RESET_FLAG,
+			  &autonuma_flags);
+}
+
+static bool inline autonuma_sched_fork_reset(void)
+{
+	return !!test_bit(AUTONUMA_SCHED_FORK_RESET_FLAG,
+			  &autonuma_flags);
+}
+
+static bool inline autonuma_scan_pmd(void)
+{
+	return !!test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
+}
+
+static bool inline autonuma_scan_use_working_set(void)
+{
+	return !!test_bit(AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
+			  &autonuma_flags);
+}
+
+static bool inline autonuma_migrate_defer(void)
+{
+	return !!test_bit(AUTONUMA_MIGRATE_DEFER_FLAG, &autonuma_flags);
+}
+
+#endif /* _LINUX_AUTONUMA_FLAGS_H */

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 09/39] autonuma: define the autonuma flags
@ 2012-03-26 17:45   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

These flags are the ones tweaked through sysfs, they control the
behavior of autonuma, from enabling disabling it, to selecting various
runtime options.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_flags.h |   62 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 62 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma_flags.h

diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
new file mode 100644
index 0000000..9c702fd
--- /dev/null
+++ b/include/linux/autonuma_flags.h
@@ -0,0 +1,62 @@
+#ifndef _LINUX_AUTONUMA_FLAGS_H
+#define _LINUX_AUTONUMA_FLAGS_H
+
+enum autonuma_flag {
+	AUTONUMA_FLAG,
+	AUTONUMA_IMPOSSIBLE,
+	AUTONUMA_DEBUG_FLAG,
+	AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
+	AUTONUMA_SCHED_CLONE_RESET_FLAG,
+	AUTONUMA_SCHED_FORK_RESET_FLAG,
+	AUTONUMA_SCAN_PMD_FLAG,
+	AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
+	AUTONUMA_MIGRATE_DEFER_FLAG,
+};
+
+extern unsigned long autonuma_flags;
+
+static bool inline autonuma_enabled(void)
+{
+	return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
+}
+
+static bool inline autonuma_debug(void)
+{
+	return !!test_bit(AUTONUMA_DEBUG_FLAG, &autonuma_flags);
+}
+
+static bool inline autonuma_sched_load_balance_strict(void)
+{
+	return !!test_bit(AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
+			  &autonuma_flags);
+}
+
+static bool inline autonuma_sched_clone_reset(void)
+{
+	return !!test_bit(AUTONUMA_SCHED_CLONE_RESET_FLAG,
+			  &autonuma_flags);
+}
+
+static bool inline autonuma_sched_fork_reset(void)
+{
+	return !!test_bit(AUTONUMA_SCHED_FORK_RESET_FLAG,
+			  &autonuma_flags);
+}
+
+static bool inline autonuma_scan_pmd(void)
+{
+	return !!test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
+}
+
+static bool inline autonuma_scan_use_working_set(void)
+{
+	return !!test_bit(AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
+			  &autonuma_flags);
+}
+
+static bool inline autonuma_migrate_defer(void)
+{
+	return !!test_bit(AUTONUMA_MIGRATE_DEFER_FLAG, &autonuma_flags);
+}
+
+#endif /* _LINUX_AUTONUMA_FLAGS_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 10/39] autonuma: core autonuma.h header
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:45   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

This is the generic autonuma.h header that defines the generic
AutoNUMA specific functions like autonuma_setup_new_exec,
autonuma_split_huge_page, numa_hinting_fault, etc...

As usual functions like numa_hinting_fault that only matter for builds
with CONFIG_AUTONUMA=y are defined unconditionally, but they are only
linked into the kernel if CONFIG_AUTONUMA=n. Their call sites are
optimized away at build time (or kernel won't link).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma.h |   41 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 41 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma.h

diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
new file mode 100644
index 0000000..a963dcb
--- /dev/null
+++ b/include/linux/autonuma.h
@@ -0,0 +1,41 @@
+#ifndef _LINUX_AUTONUMA_H
+#define _LINUX_AUTONUMA_H
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/autonuma_flags.h>
+
+extern void autonuma_enter(struct mm_struct *mm);
+extern void autonuma_exit(struct mm_struct *mm);
+extern void __autonuma_migrate_page_remove(struct page *page);
+extern void autonuma_migrate_split_huge_page(struct page *page,
+					     struct page *page_tail);
+extern void autonuma_setup_new_exec(struct task_struct *p);
+
+static inline void autonuma_migrate_page_remove(struct page *page)
+{
+	if (ACCESS_ONCE(page->autonuma_migrate_nid) >= 0)
+		__autonuma_migrate_page_remove(page);
+}
+
+#define autonuma_printk(format, args...) \
+	if (autonuma_debug()) printk(format, ##args)
+
+#else /* CONFIG_AUTONUMA */
+
+static inline void autonuma_enter(struct mm_struct *mm) {}
+static inline void autonuma_exit(struct mm_struct *mm) {}
+static inline void autonuma_migrate_page_remove(struct page *page) {}
+static inline void autonuma_migrate_split_huge_page(struct page *page,
+						    struct page *page_tail) {}
+static inline void autonuma_setup_new_exec(struct task_struct *p) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+extern pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+			      unsigned long addr, pte_t pte, pte_t *ptep);
+extern void __pmd_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+			     unsigned long addr, pmd_t *pmd);
+extern void numa_hinting_fault(struct page *page, int numpages);
+
+#endif /* _LINUX_AUTONUMA_H */

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 10/39] autonuma: core autonuma.h header
@ 2012-03-26 17:45   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

This is the generic autonuma.h header that defines the generic
AutoNUMA specific functions like autonuma_setup_new_exec,
autonuma_split_huge_page, numa_hinting_fault, etc...

As usual functions like numa_hinting_fault that only matter for builds
with CONFIG_AUTONUMA=y are defined unconditionally, but they are only
linked into the kernel if CONFIG_AUTONUMA=n. Their call sites are
optimized away at build time (or kernel won't link).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma.h |   41 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 41 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma.h

diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
new file mode 100644
index 0000000..a963dcb
--- /dev/null
+++ b/include/linux/autonuma.h
@@ -0,0 +1,41 @@
+#ifndef _LINUX_AUTONUMA_H
+#define _LINUX_AUTONUMA_H
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/autonuma_flags.h>
+
+extern void autonuma_enter(struct mm_struct *mm);
+extern void autonuma_exit(struct mm_struct *mm);
+extern void __autonuma_migrate_page_remove(struct page *page);
+extern void autonuma_migrate_split_huge_page(struct page *page,
+					     struct page *page_tail);
+extern void autonuma_setup_new_exec(struct task_struct *p);
+
+static inline void autonuma_migrate_page_remove(struct page *page)
+{
+	if (ACCESS_ONCE(page->autonuma_migrate_nid) >= 0)
+		__autonuma_migrate_page_remove(page);
+}
+
+#define autonuma_printk(format, args...) \
+	if (autonuma_debug()) printk(format, ##args)
+
+#else /* CONFIG_AUTONUMA */
+
+static inline void autonuma_enter(struct mm_struct *mm) {}
+static inline void autonuma_exit(struct mm_struct *mm) {}
+static inline void autonuma_migrate_page_remove(struct page *page) {}
+static inline void autonuma_migrate_split_huge_page(struct page *page,
+						    struct page *page_tail) {}
+static inline void autonuma_setup_new_exec(struct task_struct *p) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+extern pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+			      unsigned long addr, pte_t pte, pte_t *ptep);
+extern void __pmd_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+			     unsigned long addr, pmd_t *pmd);
+extern void numa_hinting_fault(struct page *page, int numpages);
+
+#endif /* _LINUX_AUTONUMA_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 11/39] autonuma: CPU follow memory algorithm
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:45   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

This algorithm takes as input the statistical information filled by the
knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
(p->sched_autonuma), evaluates it for the current scheduled task, and
compares it against every other running process to see if it should
move the current task to another NUMA node.

For example if there's any idle CPU in the NUMA node where the current
task prefers to be scheduled into (according to the mm_autonuma and
sched_autonuma data structures) the task will be migrated there
instead of keep running in the current CPU.

When the scheduler decides if the task should be migrated to a
different NUMA node or to stay in the same NUMA node, the decision is
then stored into p->sched_autonuma->autonuma_node. The fair scheduler
than tries to keep the task on the autonuma_node too.

Code include fixes from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_sched.h |   61 +++++++++
 include/linux/mm_types.h       |    5 +
 include/linux/sched.h          |    3 +
 kernel/sched/core.c            |   12 +-
 kernel/sched/numa.c            |  276 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h           |   12 ++
 6 files changed, 361 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/autonuma_sched.h
 create mode 100644 kernel/sched/numa.c

diff --git a/include/linux/autonuma_sched.h b/include/linux/autonuma_sched.h
new file mode 100644
index 0000000..286308f
--- /dev/null
+++ b/include/linux/autonuma_sched.h
@@ -0,0 +1,61 @@
+#ifndef _LINUX_AUTONUMA_SCHED_H
+#define _LINUX_AUTONUMA_SCHED_H
+
+#include <linux/autonuma_flags.h>
+
+static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
+{
+#ifdef CONFIG_AUTONUMA
+	int autonuma_node;
+	struct sched_autonuma *sched_autonuma = p->sched_autonuma;
+
+	if (!sched_autonuma)
+		return true;
+
+	autonuma_node = ACCESS_ONCE(sched_autonuma->autonuma_node);
+	if (autonuma_node < 0 || autonuma_node == cpu_to_node(cpu))
+		return true;
+	else
+		return false;
+#else
+	return true;
+#endif
+}
+
+static bool inline task_autonuma(void)
+{
+#ifdef CONFIG_AUTONUMA
+	struct task_struct *p = current;
+	struct sched_autonuma *sched_autonuma = p->sched_autonuma;
+	int autonuma_node;
+
+	if (!sched_autonuma)
+		return true;
+
+	autonuma_node = ACCESS_ONCE(sched_autonuma->autonuma_node);
+	if (autonuma_node < 0 || autonuma_node == numa_node_id())
+		return true;
+	else
+		return false;
+#else
+	return true;
+#endif
+}
+
+#ifdef CONFIG_AUTONUMA
+extern void sched_autonuma_balance(void);
+extern bool sched_autonuma_can_migrate_task(struct task_struct *p,
+					    int numa, int dst_cpu,
+					    enum cpu_idle_type idle,
+					    struct cpumask *allowed);
+#else /* CONFIG_AUTONUMA */
+static inline void sched_autonuma_balance(void) {}
+static inline bool sched_autonuma_can_migrate_task(struct task_struct *p,
+						   int numa, int dst_cpu,
+						   enum cpu_idle_type idle,
+						   struct cpumask *allowed) {
+	return true;
+}
+#endif /* CONFIG_AUTONUMA */
+
+#endif /* _LINUX_AUTONUMA_SCHED_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3cc3062..2567798 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
 #include <linux/completion.h>
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
+#include <linux/autonuma_types.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -388,6 +389,10 @@ struct mm_struct {
 #ifdef CONFIG_CPUMASK_OFFSTACK
 	struct cpumask cpumask_allocation;
 #endif
+#ifdef CONFIG_AUTONUMA
+	/* this is used by the scheduler and the page allocator */
+	struct mm_autonuma *mm_autonuma;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0c3854b..d0f2182 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1553,6 +1553,9 @@ struct task_struct {
 	struct mempolicy *mempolicy;	/* Protected by alloc_lock */
 	short il_next;
 	short pref_node_fork;
+#ifdef CONFIG_AUTONUMA
+	struct sched_autonuma *sched_autonuma;
+#endif
 #endif
 	struct rcu_head rcu;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 503d642..501b87f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -72,6 +72,7 @@
 #include <linux/slab.h>
 #include <linux/init_task.h>
 #include <linux/binfmts.h>
+#include <linux/autonuma_sched.h>
 
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
@@ -1117,13 +1118,6 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	__set_task_cpu(p, new_cpu);
 }
 
-struct migration_arg {
-	struct task_struct *task;
-	int dest_cpu;
-};
-
-static int migration_cpu_stop(void *data);
-
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
@@ -3220,6 +3214,8 @@ need_resched:
 
 	post_schedule(rq);
 
+	sched_autonuma_balance();
+
 	sched_preempt_enable_no_resched();
 	if (need_resched())
 		goto need_resched;
@@ -5055,7 +5051,7 @@ fail:
  * and performs thread migration by bumping thread off CPU then
  * 'pushing' onto another runqueue.
  */
-static int migration_cpu_stop(void *data)
+int migration_cpu_stop(void *data)
 {
 	struct migration_arg *arg = data;
 
diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c
new file mode 100644
index 0000000..d51e1ec
--- /dev/null
+++ b/kernel/sched/numa.c
@@ -0,0 +1,276 @@
+/*
+ *  Copyright (C) 2012  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/sched.h>
+#include <linux/autonuma_sched.h>
+#include <asm/tlb.h>
+
+#include "sched.h"
+
+#define AUTONUMA_BALANCE_SCALE 1000
+
+/*
+ * This function is responsible for deciding which is the best CPU
+ * each process should be running on according to the NUMA
+ * affinity. To do that it evaluates all CPUs and checks if there's
+ * any remote CPU where the current process has more NUMA affinity
+ * than with the current CPU, and where the process running on the
+ * remote CPU has less NUMA affinity than the current process to run
+ * on the remote CPU. Ideally this should be expanded to take all
+ * runnable processes into account but this is a good
+ * approximation. When we compare the NUMA affinity between the
+ * current and remote CPU we use the per-thread information if the
+ * remote CPU runs a thread of the same process that the current task
+ * belongs to, or the per-process information if the remote CPU runs a
+ * different process than the current one. If the remote CPU runs the
+ * idle task we require both the per-thread and per-process
+ * information to have more affinity with the remote CPU than with the
+ * current CPU for a migration to happen.
+ *
+ * This has O(N) complexity but N isn't the number of running
+ * processes, but the number of CPUs, so if you assume a constant
+ * number of CPUs (capped at NR_CPUS) it is O(1). O(1) misleading math
+ * aside, the number of cachelines touched with thousands of CPU might
+ * make it measurable. Calling this at every schedule may also be
+ * overkill and it may be enough to call it with a frequency similar
+ * to the load balancing, but by doing so we're also verifying the
+ * algorithm is a converging one in all workloads if performance is
+ * improved and there's no frequent CPU migration, so it's good in the
+ * short term for stressing the algorithm.
+ */
+void sched_autonuma_balance(void)
+{
+	int cpu, nid, selected_cpu, selected_nid;
+	int cpu_nid = numa_node_id();
+	int this_cpu = smp_processor_id();
+	unsigned long p_w, p_t, m_w, m_t;
+	unsigned long weight_delta_max, weight;
+	struct cpumask *allowed;
+	struct migration_arg arg;
+	struct task_struct *p = current;
+	struct sched_autonuma *sched_autonuma = p->sched_autonuma;
+
+	/* per-cpu statically allocated in runqueues */
+	long *weight_others;
+	long *weight_current;
+	long *weight_current_mm;
+	unsigned long *mm_mask;
+
+	if (!sched_autonuma || sched_autonuma->autonuma_stop_one_cpu || !p->mm)
+		return;
+
+	if (!autonuma_enabled()) {
+		if (sched_autonuma->autonuma_node != -1)
+			sched_autonuma->autonuma_node = -1;
+		return;
+	}
+
+	allowed = tsk_cpus_allowed(p);
+
+	m_t = ACCESS_ONCE(p->mm->mm_autonuma->numa_fault_tot);
+	p_t = sched_autonuma->numa_fault_tot;
+	/*
+	 * If a process still misses the per-thread or per-process
+	 * information skip it.
+	 */
+	if (!m_t || !p_t)
+		return;
+
+	weight_others = cpu_rq(this_cpu)->weight_others;
+	weight_current = cpu_rq(this_cpu)->weight_current;
+	weight_current_mm = cpu_rq(this_cpu)->weight_current_mm;
+	mm_mask = cpu_rq(this_cpu)->mm_mask;
+
+	for_each_online_node(nid) {
+		m_w = ACCESS_ONCE(p->mm->mm_autonuma->numa_fault[nid]);
+		p_w = sched_autonuma->numa_fault[nid];
+		if (m_w > m_t)
+			m_t = m_w;
+		weight_current_mm[nid] = m_w*AUTONUMA_BALANCE_SCALE/m_t;
+		if (p_w > p_t)
+			p_t = p_w;
+		weight_current[nid] = p_w*AUTONUMA_BALANCE_SCALE/p_t;
+	}
+
+	bitmap_zero(mm_mask, NR_CPUS);
+	for_each_online_node(nid) {
+		if (nid == cpu_nid)
+			continue;
+		for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
+			struct mm_struct *mm;
+			struct rq *rq = cpu_rq(cpu);
+			if (!cpu_online(cpu))
+				continue;
+			weight_others[cpu] = LONG_MAX;
+			if (idle_cpu(cpu) &&
+			    rq->avg_idle > sysctl_sched_migration_cost) {
+				if (weight_current[nid] >
+				    weight_current[cpu_nid] &&
+				    weight_current_mm[nid] >
+				    weight_current_mm[cpu_nid])
+					weight_others[cpu] = -1;
+				continue;
+			}
+			mm = rq->curr->mm;
+			if (!mm)
+				continue;
+			raw_spin_lock_irq(&rq->lock);
+			/* recheck after implicit barrier() */
+			mm = rq->curr->mm;
+			if (!mm) {
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+			m_t = ACCESS_ONCE(mm->mm_autonuma->numa_fault_tot);
+			p_t = rq->curr->sched_autonuma->numa_fault_tot;
+			if (!m_t || !p_t) {
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+			m_w = ACCESS_ONCE(mm->mm_autonuma->numa_fault[nid]);
+			p_w = rq->curr->sched_autonuma->numa_fault[nid];
+			raw_spin_unlock_irq(&rq->lock);
+			if (mm == p->mm) {
+				if (p_w > p_t)
+					p_t = p_w;
+				weight_others[cpu] = p_w*
+					AUTONUMA_BALANCE_SCALE/p_t;
+
+				__set_bit(cpu, mm_mask);
+			} else {
+				if (m_w > m_t)
+					m_t = m_w;
+				weight_others[cpu] = m_w*
+					AUTONUMA_BALANCE_SCALE/m_t;
+			}
+		}
+	}
+
+	selected_cpu = this_cpu;
+	selected_nid = cpu_nid;
+	weight_delta_max = 0;
+
+	for_each_online_node(nid) {
+		if (nid == cpu_nid)
+			continue;
+		for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
+			long w_nid, w_cpu_nid;
+			if (!cpu_online(cpu))
+				continue;
+			if (test_bit(cpu, mm_mask)) {
+				w_nid = weight_current[nid];
+				w_cpu_nid = weight_current[cpu_nid];
+			} else {
+				w_nid = weight_current_mm[nid];
+				w_cpu_nid = weight_current_mm[cpu_nid];
+			}
+			if (w_nid > weight_others[cpu] &&
+			    w_nid > w_cpu_nid) {
+				weight = w_nid -
+					weight_others[cpu] +
+					w_nid -
+					w_cpu_nid;
+				if (weight > weight_delta_max) {
+					weight_delta_max = weight;
+					selected_cpu = cpu;
+					selected_nid = nid;
+				}
+			}
+		}
+	}
+
+	if (sched_autonuma->autonuma_node != selected_nid)
+		sched_autonuma->autonuma_node = selected_nid;
+	if (selected_cpu != this_cpu) {
+		if (autonuma_debug())
+			printk("%p %d - %dto%d - %dto%d - %ld %ld %ld - %s\n",
+			       p->mm, p->pid, cpu_nid, selected_nid,
+			       this_cpu, selected_cpu,
+			       weight_others[selected_cpu],
+			       test_bit(selected_cpu, mm_mask) ?
+			       weight_current[selected_nid] :
+			       weight_current_mm[selected_nid],
+			       test_bit(selected_cpu, mm_mask) ?
+			       weight_current[cpu_nid] :
+			       weight_current_mm[cpu_nid],
+			       test_bit(selected_cpu, mm_mask) ?
+			       "thread" : "process");
+		BUG_ON(cpu_nid == selected_nid);
+		goto found;
+	}
+
+	return;
+
+found:
+	arg = (struct migration_arg) { p, selected_cpu };
+	/* Need help from migration thread: drop lock and wait. */
+	sched_autonuma->autonuma_stop_one_cpu = true;
+	sched_preempt_enable_no_resched();
+	stop_one_cpu(this_cpu, migration_cpu_stop, &arg);
+	preempt_disable();
+	sched_autonuma->autonuma_stop_one_cpu = false;
+	tlb_migrate_finish(p->mm);
+}
+
+bool sched_autonuma_can_migrate_task(struct task_struct *p,
+				     int numa, int dst_cpu,
+				     enum cpu_idle_type idle,
+				     struct cpumask *allowed)
+{
+	if (!task_autonuma_cpu(p, dst_cpu)) {
+		int cpu;
+		int autonuma_node;
+
+		if (numa)
+			return false;
+		if (autonuma_sched_load_balance_strict() &&
+		    idle != CPU_NEWLY_IDLE && idle != CPU_IDLE)
+			return false;
+
+		autonuma_node = ACCESS_ONCE(p->sched_autonuma->autonuma_node);
+		for_each_cpu_and(cpu, cpumask_of_node(autonuma_node),
+				 allowed) {
+			struct rq *rq = cpu_rq(cpu);
+			int _autonuma_node;
+			struct sched_autonuma *sa;
+			if (!cpu_online(cpu))
+				continue;
+			sa = rq->curr->sched_autonuma;
+			_autonuma_node = ACCESS_ONCE(sa->autonuma_node);
+			if (_autonuma_node != autonuma_node)
+				return false;
+			if (idle_cpu(cpu) && rq->avg_idle >=
+			    sysctl_sched_migration_cost)
+				return false;
+		}
+	}
+	return true;
+}
+
+void sched_autonuma_dump_mm(void)
+{
+	int nid, cpu;
+	struct cpumask x;
+	cpumask_setall(&x);
+	for_each_online_node(nid) {
+		for_each_cpu(cpu, cpumask_of_node(nid)) {
+			struct rq *rq = cpu_rq(cpu);
+			struct mm_struct *mm = rq->curr->mm;
+			int nr = 0, cpux;
+			if (!cpumask_test_cpu(cpu, &x))
+				continue;
+			for_each_cpu(cpux, cpumask_of_node(nid)) {
+				struct rq *rqx = cpu_rq(cpux);
+				if (rqx->curr->mm == mm) {
+					nr++;
+					cpumask_clear_cpu(cpux, &x);
+				}
+			}
+			printk("nid %d mm %p nr %d\n", nid, mm, nr);
+		}
+	}
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 42b1f30..261a295 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -463,6 +463,12 @@ struct rq {
 #ifdef CONFIG_SMP
 	struct llist_head wake_list;
 #endif
+#ifdef CONFIG_AUTONUMA
+	long weight_others[NR_CPUS];
+	long weight_current[MAX_NUMNODES];
+	long weight_current_mm[MAX_NUMNODES];
+	DECLARE_BITMAP(mm_mask, NR_CPUS);
+#endif
 };
 
 static inline int cpu_of(struct rq *rq)
@@ -526,6 +532,12 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
 DECLARE_PER_CPU(struct sched_domain *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_id);
 
+struct migration_arg {
+	struct task_struct *task;
+	int dest_cpu;
+};
+extern int migration_cpu_stop(void *data);
+
 #endif /* CONFIG_SMP */
 
 #include "stats.h"

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 11/39] autonuma: CPU follow memory algorithm
@ 2012-03-26 17:45   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

This algorithm takes as input the statistical information filled by the
knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
(p->sched_autonuma), evaluates it for the current scheduled task, and
compares it against every other running process to see if it should
move the current task to another NUMA node.

For example if there's any idle CPU in the NUMA node where the current
task prefers to be scheduled into (according to the mm_autonuma and
sched_autonuma data structures) the task will be migrated there
instead of keep running in the current CPU.

When the scheduler decides if the task should be migrated to a
different NUMA node or to stay in the same NUMA node, the decision is
then stored into p->sched_autonuma->autonuma_node. The fair scheduler
than tries to keep the task on the autonuma_node too.

Code include fixes from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_sched.h |   61 +++++++++
 include/linux/mm_types.h       |    5 +
 include/linux/sched.h          |    3 +
 kernel/sched/core.c            |   12 +-
 kernel/sched/numa.c            |  276 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h           |   12 ++
 6 files changed, 361 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/autonuma_sched.h
 create mode 100644 kernel/sched/numa.c

diff --git a/include/linux/autonuma_sched.h b/include/linux/autonuma_sched.h
new file mode 100644
index 0000000..286308f
--- /dev/null
+++ b/include/linux/autonuma_sched.h
@@ -0,0 +1,61 @@
+#ifndef _LINUX_AUTONUMA_SCHED_H
+#define _LINUX_AUTONUMA_SCHED_H
+
+#include <linux/autonuma_flags.h>
+
+static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
+{
+#ifdef CONFIG_AUTONUMA
+	int autonuma_node;
+	struct sched_autonuma *sched_autonuma = p->sched_autonuma;
+
+	if (!sched_autonuma)
+		return true;
+
+	autonuma_node = ACCESS_ONCE(sched_autonuma->autonuma_node);
+	if (autonuma_node < 0 || autonuma_node == cpu_to_node(cpu))
+		return true;
+	else
+		return false;
+#else
+	return true;
+#endif
+}
+
+static bool inline task_autonuma(void)
+{
+#ifdef CONFIG_AUTONUMA
+	struct task_struct *p = current;
+	struct sched_autonuma *sched_autonuma = p->sched_autonuma;
+	int autonuma_node;
+
+	if (!sched_autonuma)
+		return true;
+
+	autonuma_node = ACCESS_ONCE(sched_autonuma->autonuma_node);
+	if (autonuma_node < 0 || autonuma_node == numa_node_id())
+		return true;
+	else
+		return false;
+#else
+	return true;
+#endif
+}
+
+#ifdef CONFIG_AUTONUMA
+extern void sched_autonuma_balance(void);
+extern bool sched_autonuma_can_migrate_task(struct task_struct *p,
+					    int numa, int dst_cpu,
+					    enum cpu_idle_type idle,
+					    struct cpumask *allowed);
+#else /* CONFIG_AUTONUMA */
+static inline void sched_autonuma_balance(void) {}
+static inline bool sched_autonuma_can_migrate_task(struct task_struct *p,
+						   int numa, int dst_cpu,
+						   enum cpu_idle_type idle,
+						   struct cpumask *allowed) {
+	return true;
+}
+#endif /* CONFIG_AUTONUMA */
+
+#endif /* _LINUX_AUTONUMA_SCHED_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3cc3062..2567798 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
 #include <linux/completion.h>
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
+#include <linux/autonuma_types.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -388,6 +389,10 @@ struct mm_struct {
 #ifdef CONFIG_CPUMASK_OFFSTACK
 	struct cpumask cpumask_allocation;
 #endif
+#ifdef CONFIG_AUTONUMA
+	/* this is used by the scheduler and the page allocator */
+	struct mm_autonuma *mm_autonuma;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0c3854b..d0f2182 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1553,6 +1553,9 @@ struct task_struct {
 	struct mempolicy *mempolicy;	/* Protected by alloc_lock */
 	short il_next;
 	short pref_node_fork;
+#ifdef CONFIG_AUTONUMA
+	struct sched_autonuma *sched_autonuma;
+#endif
 #endif
 	struct rcu_head rcu;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 503d642..501b87f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -72,6 +72,7 @@
 #include <linux/slab.h>
 #include <linux/init_task.h>
 #include <linux/binfmts.h>
+#include <linux/autonuma_sched.h>
 
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
@@ -1117,13 +1118,6 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	__set_task_cpu(p, new_cpu);
 }
 
-struct migration_arg {
-	struct task_struct *task;
-	int dest_cpu;
-};
-
-static int migration_cpu_stop(void *data);
-
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
@@ -3220,6 +3214,8 @@ need_resched:
 
 	post_schedule(rq);
 
+	sched_autonuma_balance();
+
 	sched_preempt_enable_no_resched();
 	if (need_resched())
 		goto need_resched;
@@ -5055,7 +5051,7 @@ fail:
  * and performs thread migration by bumping thread off CPU then
  * 'pushing' onto another runqueue.
  */
-static int migration_cpu_stop(void *data)
+int migration_cpu_stop(void *data)
 {
 	struct migration_arg *arg = data;
 
diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c
new file mode 100644
index 0000000..d51e1ec
--- /dev/null
+++ b/kernel/sched/numa.c
@@ -0,0 +1,276 @@
+/*
+ *  Copyright (C) 2012  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/sched.h>
+#include <linux/autonuma_sched.h>
+#include <asm/tlb.h>
+
+#include "sched.h"
+
+#define AUTONUMA_BALANCE_SCALE 1000
+
+/*
+ * This function is responsible for deciding which is the best CPU
+ * each process should be running on according to the NUMA
+ * affinity. To do that it evaluates all CPUs and checks if there's
+ * any remote CPU where the current process has more NUMA affinity
+ * than with the current CPU, and where the process running on the
+ * remote CPU has less NUMA affinity than the current process to run
+ * on the remote CPU. Ideally this should be expanded to take all
+ * runnable processes into account but this is a good
+ * approximation. When we compare the NUMA affinity between the
+ * current and remote CPU we use the per-thread information if the
+ * remote CPU runs a thread of the same process that the current task
+ * belongs to, or the per-process information if the remote CPU runs a
+ * different process than the current one. If the remote CPU runs the
+ * idle task we require both the per-thread and per-process
+ * information to have more affinity with the remote CPU than with the
+ * current CPU for a migration to happen.
+ *
+ * This has O(N) complexity but N isn't the number of running
+ * processes, but the number of CPUs, so if you assume a constant
+ * number of CPUs (capped at NR_CPUS) it is O(1). O(1) misleading math
+ * aside, the number of cachelines touched with thousands of CPU might
+ * make it measurable. Calling this at every schedule may also be
+ * overkill and it may be enough to call it with a frequency similar
+ * to the load balancing, but by doing so we're also verifying the
+ * algorithm is a converging one in all workloads if performance is
+ * improved and there's no frequent CPU migration, so it's good in the
+ * short term for stressing the algorithm.
+ */
+void sched_autonuma_balance(void)
+{
+	int cpu, nid, selected_cpu, selected_nid;
+	int cpu_nid = numa_node_id();
+	int this_cpu = smp_processor_id();
+	unsigned long p_w, p_t, m_w, m_t;
+	unsigned long weight_delta_max, weight;
+	struct cpumask *allowed;
+	struct migration_arg arg;
+	struct task_struct *p = current;
+	struct sched_autonuma *sched_autonuma = p->sched_autonuma;
+
+	/* per-cpu statically allocated in runqueues */
+	long *weight_others;
+	long *weight_current;
+	long *weight_current_mm;
+	unsigned long *mm_mask;
+
+	if (!sched_autonuma || sched_autonuma->autonuma_stop_one_cpu || !p->mm)
+		return;
+
+	if (!autonuma_enabled()) {
+		if (sched_autonuma->autonuma_node != -1)
+			sched_autonuma->autonuma_node = -1;
+		return;
+	}
+
+	allowed = tsk_cpus_allowed(p);
+
+	m_t = ACCESS_ONCE(p->mm->mm_autonuma->numa_fault_tot);
+	p_t = sched_autonuma->numa_fault_tot;
+	/*
+	 * If a process still misses the per-thread or per-process
+	 * information skip it.
+	 */
+	if (!m_t || !p_t)
+		return;
+
+	weight_others = cpu_rq(this_cpu)->weight_others;
+	weight_current = cpu_rq(this_cpu)->weight_current;
+	weight_current_mm = cpu_rq(this_cpu)->weight_current_mm;
+	mm_mask = cpu_rq(this_cpu)->mm_mask;
+
+	for_each_online_node(nid) {
+		m_w = ACCESS_ONCE(p->mm->mm_autonuma->numa_fault[nid]);
+		p_w = sched_autonuma->numa_fault[nid];
+		if (m_w > m_t)
+			m_t = m_w;
+		weight_current_mm[nid] = m_w*AUTONUMA_BALANCE_SCALE/m_t;
+		if (p_w > p_t)
+			p_t = p_w;
+		weight_current[nid] = p_w*AUTONUMA_BALANCE_SCALE/p_t;
+	}
+
+	bitmap_zero(mm_mask, NR_CPUS);
+	for_each_online_node(nid) {
+		if (nid == cpu_nid)
+			continue;
+		for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
+			struct mm_struct *mm;
+			struct rq *rq = cpu_rq(cpu);
+			if (!cpu_online(cpu))
+				continue;
+			weight_others[cpu] = LONG_MAX;
+			if (idle_cpu(cpu) &&
+			    rq->avg_idle > sysctl_sched_migration_cost) {
+				if (weight_current[nid] >
+				    weight_current[cpu_nid] &&
+				    weight_current_mm[nid] >
+				    weight_current_mm[cpu_nid])
+					weight_others[cpu] = -1;
+				continue;
+			}
+			mm = rq->curr->mm;
+			if (!mm)
+				continue;
+			raw_spin_lock_irq(&rq->lock);
+			/* recheck after implicit barrier() */
+			mm = rq->curr->mm;
+			if (!mm) {
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+			m_t = ACCESS_ONCE(mm->mm_autonuma->numa_fault_tot);
+			p_t = rq->curr->sched_autonuma->numa_fault_tot;
+			if (!m_t || !p_t) {
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+			m_w = ACCESS_ONCE(mm->mm_autonuma->numa_fault[nid]);
+			p_w = rq->curr->sched_autonuma->numa_fault[nid];
+			raw_spin_unlock_irq(&rq->lock);
+			if (mm == p->mm) {
+				if (p_w > p_t)
+					p_t = p_w;
+				weight_others[cpu] = p_w*
+					AUTONUMA_BALANCE_SCALE/p_t;
+
+				__set_bit(cpu, mm_mask);
+			} else {
+				if (m_w > m_t)
+					m_t = m_w;
+				weight_others[cpu] = m_w*
+					AUTONUMA_BALANCE_SCALE/m_t;
+			}
+		}
+	}
+
+	selected_cpu = this_cpu;
+	selected_nid = cpu_nid;
+	weight_delta_max = 0;
+
+	for_each_online_node(nid) {
+		if (nid == cpu_nid)
+			continue;
+		for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
+			long w_nid, w_cpu_nid;
+			if (!cpu_online(cpu))
+				continue;
+			if (test_bit(cpu, mm_mask)) {
+				w_nid = weight_current[nid];
+				w_cpu_nid = weight_current[cpu_nid];
+			} else {
+				w_nid = weight_current_mm[nid];
+				w_cpu_nid = weight_current_mm[cpu_nid];
+			}
+			if (w_nid > weight_others[cpu] &&
+			    w_nid > w_cpu_nid) {
+				weight = w_nid -
+					weight_others[cpu] +
+					w_nid -
+					w_cpu_nid;
+				if (weight > weight_delta_max) {
+					weight_delta_max = weight;
+					selected_cpu = cpu;
+					selected_nid = nid;
+				}
+			}
+		}
+	}
+
+	if (sched_autonuma->autonuma_node != selected_nid)
+		sched_autonuma->autonuma_node = selected_nid;
+	if (selected_cpu != this_cpu) {
+		if (autonuma_debug())
+			printk("%p %d - %dto%d - %dto%d - %ld %ld %ld - %s\n",
+			       p->mm, p->pid, cpu_nid, selected_nid,
+			       this_cpu, selected_cpu,
+			       weight_others[selected_cpu],
+			       test_bit(selected_cpu, mm_mask) ?
+			       weight_current[selected_nid] :
+			       weight_current_mm[selected_nid],
+			       test_bit(selected_cpu, mm_mask) ?
+			       weight_current[cpu_nid] :
+			       weight_current_mm[cpu_nid],
+			       test_bit(selected_cpu, mm_mask) ?
+			       "thread" : "process");
+		BUG_ON(cpu_nid == selected_nid);
+		goto found;
+	}
+
+	return;
+
+found:
+	arg = (struct migration_arg) { p, selected_cpu };
+	/* Need help from migration thread: drop lock and wait. */
+	sched_autonuma->autonuma_stop_one_cpu = true;
+	sched_preempt_enable_no_resched();
+	stop_one_cpu(this_cpu, migration_cpu_stop, &arg);
+	preempt_disable();
+	sched_autonuma->autonuma_stop_one_cpu = false;
+	tlb_migrate_finish(p->mm);
+}
+
+bool sched_autonuma_can_migrate_task(struct task_struct *p,
+				     int numa, int dst_cpu,
+				     enum cpu_idle_type idle,
+				     struct cpumask *allowed)
+{
+	if (!task_autonuma_cpu(p, dst_cpu)) {
+		int cpu;
+		int autonuma_node;
+
+		if (numa)
+			return false;
+		if (autonuma_sched_load_balance_strict() &&
+		    idle != CPU_NEWLY_IDLE && idle != CPU_IDLE)
+			return false;
+
+		autonuma_node = ACCESS_ONCE(p->sched_autonuma->autonuma_node);
+		for_each_cpu_and(cpu, cpumask_of_node(autonuma_node),
+				 allowed) {
+			struct rq *rq = cpu_rq(cpu);
+			int _autonuma_node;
+			struct sched_autonuma *sa;
+			if (!cpu_online(cpu))
+				continue;
+			sa = rq->curr->sched_autonuma;
+			_autonuma_node = ACCESS_ONCE(sa->autonuma_node);
+			if (_autonuma_node != autonuma_node)
+				return false;
+			if (idle_cpu(cpu) && rq->avg_idle >=
+			    sysctl_sched_migration_cost)
+				return false;
+		}
+	}
+	return true;
+}
+
+void sched_autonuma_dump_mm(void)
+{
+	int nid, cpu;
+	struct cpumask x;
+	cpumask_setall(&x);
+	for_each_online_node(nid) {
+		for_each_cpu(cpu, cpumask_of_node(nid)) {
+			struct rq *rq = cpu_rq(cpu);
+			struct mm_struct *mm = rq->curr->mm;
+			int nr = 0, cpux;
+			if (!cpumask_test_cpu(cpu, &x))
+				continue;
+			for_each_cpu(cpux, cpumask_of_node(nid)) {
+				struct rq *rqx = cpu_rq(cpux);
+				if (rqx->curr->mm == mm) {
+					nr++;
+					cpumask_clear_cpu(cpux, &x);
+				}
+			}
+			printk("nid %d mm %p nr %d\n", nid, mm, nr);
+		}
+	}
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 42b1f30..261a295 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -463,6 +463,12 @@ struct rq {
 #ifdef CONFIG_SMP
 	struct llist_head wake_list;
 #endif
+#ifdef CONFIG_AUTONUMA
+	long weight_others[NR_CPUS];
+	long weight_current[MAX_NUMNODES];
+	long weight_current_mm[MAX_NUMNODES];
+	DECLARE_BITMAP(mm_mask, NR_CPUS);
+#endif
 };
 
 static inline int cpu_of(struct rq *rq)
@@ -526,6 +532,12 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
 DECLARE_PER_CPU(struct sched_domain *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_id);
 
+struct migration_arg {
+	struct task_struct *task;
+	int dest_cpu;
+};
+extern int migration_cpu_stop(void *data);
+
 #endif /* CONFIG_SMP */
 
 #include "stats.h"

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 12/39] autonuma: add page structure fields
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:45   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On 64bit archs, 20 bytes are used for async memory migration (specific
to the knuma_migrated per-node threads), and 4 bytes are used for the
thread NUMA false sharing detection logic.

This is a bad implementation due lack of time to do a proper one.

These AutoNUMA new fields must be moved to the pgdat like memcg
does. So that they're only allocated at boot time if the kernel is
booted on NUMA hardware. And so that they're not allocated even if
booted on NUMA hardware if "noautonuma" is passed as boot parameter to
the kernel.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm_types.h |   25 +++++++++++++++++++++++++
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2567798..0a163bd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -125,6 +125,31 @@ struct page {
 		struct page *first_page;	/* Compound tail pages */
 	};
 
+#ifdef CONFIG_AUTONUMA
+	/*
+	 * FIXME: move to pgdat section along with the memcg and allocate
+	 * at runtime only in presence of a numa system.
+	 */
+	/*
+	 * To modify autonuma_last_nid lockless the architecture,
+	 * needs SMP atomic granularity < sizeof(long), not all archs
+	 * have that, notably some alpha. Archs without that requires
+	 * autonuma_last_nid to be a long.
+	 */
+#if BITS_PER_LONG > 32
+	int autonuma_migrate_nid;
+	int autonuma_last_nid;
+#else
+#if MAX_NUMNODES >= 32768
+#error "too many nodes"
+#endif
+	/* FIXME: remember to check the updates are atomic */
+	short autonuma_migrate_nid;
+	short autonuma_last_nid;
+#endif
+	struct list_head autonuma_migrate_node;
+#endif
+
 	/*
 	 * On machines where all RAM is mapped into kernel address space,
 	 * we can simply calculate the virtual address. On machines with

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 12/39] autonuma: add page structure fields
@ 2012-03-26 17:45   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On 64bit archs, 20 bytes are used for async memory migration (specific
to the knuma_migrated per-node threads), and 4 bytes are used for the
thread NUMA false sharing detection logic.

This is a bad implementation due lack of time to do a proper one.

These AutoNUMA new fields must be moved to the pgdat like memcg
does. So that they're only allocated at boot time if the kernel is
booted on NUMA hardware. And so that they're not allocated even if
booted on NUMA hardware if "noautonuma" is passed as boot parameter to
the kernel.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm_types.h |   25 +++++++++++++++++++++++++
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2567798..0a163bd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -125,6 +125,31 @@ struct page {
 		struct page *first_page;	/* Compound tail pages */
 	};
 
+#ifdef CONFIG_AUTONUMA
+	/*
+	 * FIXME: move to pgdat section along with the memcg and allocate
+	 * at runtime only in presence of a numa system.
+	 */
+	/*
+	 * To modify autonuma_last_nid lockless the architecture,
+	 * needs SMP atomic granularity < sizeof(long), not all archs
+	 * have that, notably some alpha. Archs without that requires
+	 * autonuma_last_nid to be a long.
+	 */
+#if BITS_PER_LONG > 32
+	int autonuma_migrate_nid;
+	int autonuma_last_nid;
+#else
+#if MAX_NUMNODES >= 32768
+#error "too many nodes"
+#endif
+	/* FIXME: remember to check the updates are atomic */
+	short autonuma_migrate_nid;
+	short autonuma_last_nid;
+#endif
+	struct list_head autonuma_migrate_node;
+#endif
+
 	/*
 	 * On machines where all RAM is mapped into kernel address space,
 	 * we can simply calculate the virtual address. On machines with

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 13/39] autonuma: knuma_migrated per NUMA node queues
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

This implements the knuma_migrated queues. The pages are added to
these queues through the NUMA hinting page faults (memory follow CPU
algorithm with false sharing evaluation) and knuma_migrated then is
waken with a certain hysteresis to migrate the memory in round robin
from all remote nodes to its local node.

The head that belongs to the local node that knuma_migrated runs on,
for now must be empty and it's not being used.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mmzone.h |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index dff7115..b60747a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -666,6 +666,12 @@ typedef struct pglist_data {
 	struct task_struct *kswapd;
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
+#ifdef CONFIG_AUTONUMA
+	spinlock_t autonuma_lock;
+	struct list_head autonuma_migrate_head[MAX_NUMNODES];
+	unsigned long autonuma_nr_migrate_pages;
+	wait_queue_head_t autonuma_knuma_migrated_wait;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 13/39] autonuma: knuma_migrated per NUMA node queues
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

This implements the knuma_migrated queues. The pages are added to
these queues through the NUMA hinting page faults (memory follow CPU
algorithm with false sharing evaluation) and knuma_migrated then is
waken with a certain hysteresis to migrate the memory in round robin
from all remote nodes to its local node.

The head that belongs to the local node that knuma_migrated runs on,
for now must be empty and it's not being used.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mmzone.h |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index dff7115..b60747a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -666,6 +666,12 @@ typedef struct pglist_data {
 	struct task_struct *kswapd;
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
+#ifdef CONFIG_AUTONUMA
+	spinlock_t autonuma_lock;
+	struct list_head autonuma_migrate_head[MAX_NUMNODES];
+	unsigned long autonuma_nr_migrate_pages;
+	wait_queue_head_t autonuma_knuma_migrated_wait;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 14/39] autonuma: init knuma_migrated queues
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Initialize the knuma_migrated queues at boot time.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6f2b600..b9c80df 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -58,6 +58,7 @@
 #include <linux/memcontrol.h>
 #include <linux/prefetch.h>
 #include <linux/page-debug-flags.h>
+#include <linux/autonuma.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -4255,8 +4256,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	int nid = pgdat->node_id;
 	unsigned long zone_start_pfn = pgdat->node_start_pfn;
 	int ret;
+#ifdef CONFIG_AUTONUMA
+	int node_iter;
+#endif
 
 	pgdat_resize_init(pgdat);
+#ifdef CONFIG_AUTONUMA
+	spin_lock_init(&pgdat->autonuma_lock);
+	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
+	pgdat->autonuma_nr_migrate_pages = 0;
+	for_each_node(node_iter)
+		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+#endif
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	pgdat->kswapd_max_order = 0;

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 14/39] autonuma: init knuma_migrated queues
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Initialize the knuma_migrated queues at boot time.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6f2b600..b9c80df 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -58,6 +58,7 @@
 #include <linux/memcontrol.h>
 #include <linux/prefetch.h>
 #include <linux/page-debug-flags.h>
+#include <linux/autonuma.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -4255,8 +4256,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	int nid = pgdat->node_id;
 	unsigned long zone_start_pfn = pgdat->node_start_pfn;
 	int ret;
+#ifdef CONFIG_AUTONUMA
+	int node_iter;
+#endif
 
 	pgdat_resize_init(pgdat);
+#ifdef CONFIG_AUTONUMA
+	spin_lock_init(&pgdat->autonuma_lock);
+	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
+	pgdat->autonuma_nr_migrate_pages = 0;
+	for_each_node(node_iter)
+		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+#endif
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	pgdat->kswapd_max_order = 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 15/39] autonuma: autonuma_enter/exit
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

The first gear in the whole AutoNUMA algorithm is knuma_scand. If
knuma_scand doesn't run AutoNUMA is a full bypass. If knuma_scand is
stopped, soon all other AutoNUMA gears will settle down too.

knuma_scand is the daemon that sets the pmd_numa and pte_numa and
allows the NUMA hinting page faults to start and then all other
actions follows as a reaction to that.

knuma_scand scans a list of "mm" and this is where we register and
unregister the "mm" into AutoNUMA for knuma_scand to scan them.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index b9372a0..ba3b339 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -67,6 +67,7 @@
 #include <linux/oom.h>
 #include <linux/khugepaged.h>
 #include <linux/signalfd.h>
+#include <linux/autonuma.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -505,6 +506,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
 		mmu_notifier_mm_init(mm);
+		autonuma_enter(mm);
 		return mm;
 	}
 
@@ -572,6 +574,7 @@ void mmput(struct mm_struct *mm)
 		exit_aio(mm);
 		ksm_exit(mm);
 		khugepaged_exit(mm); /* must run before exit_mmap */
+		autonuma_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 15/39] autonuma: autonuma_enter/exit
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

The first gear in the whole AutoNUMA algorithm is knuma_scand. If
knuma_scand doesn't run AutoNUMA is a full bypass. If knuma_scand is
stopped, soon all other AutoNUMA gears will settle down too.

knuma_scand is the daemon that sets the pmd_numa and pte_numa and
allows the NUMA hinting page faults to start and then all other
actions follows as a reaction to that.

knuma_scand scans a list of "mm" and this is where we register and
unregister the "mm" into AutoNUMA for knuma_scand to scan them.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index b9372a0..ba3b339 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -67,6 +67,7 @@
 #include <linux/oom.h>
 #include <linux/khugepaged.h>
 #include <linux/signalfd.h>
+#include <linux/autonuma.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -505,6 +506,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
 		mmu_notifier_mm_init(mm);
+		autonuma_enter(mm);
 		return mm;
 	}
 
@@ -572,6 +574,7 @@ void mmput(struct mm_struct *mm)
 		exit_aio(mm);
 		ksm_exit(mm);
 		khugepaged_exit(mm); /* must run before exit_mmap */
+		autonuma_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 16/39] autonuma: call autonuma_setup_new_exec()
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

This resets all per-thread and per-process statistics across exec
syscalls or after kernel threads detached from the mm. The past
statistical NUMA information is unlikely to be relevant for the future
in these cases.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/exec.c        |    3 +++
 mm/mmu_context.c |    2 ++
 2 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 23559c2..488cc4d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/pipe_fs_i.h>
 #include <linux/oom.h>
 #include <linux/compat.h>
+#include <linux/autonuma.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1175,6 +1176,8 @@ void setup_new_exec(struct linux_binprm * bprm)
 			
 	flush_signal_handlers(current, 0);
 	flush_old_files(current->files);
+
+	autonuma_setup_new_exec(current);
 }
 EXPORT_SYMBOL(setup_new_exec);
 
diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 3dcfaf4..40f0f13 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -7,6 +7,7 @@
 #include <linux/mmu_context.h>
 #include <linux/export.h>
 #include <linux/sched.h>
+#include <linux/autonuma.h>
 
 #include <asm/mmu_context.h>
 
@@ -58,5 +59,6 @@ void unuse_mm(struct mm_struct *mm)
 	/* active_mm is still 'mm' */
 	enter_lazy_tlb(mm, tsk);
 	task_unlock(tsk);
+	autonuma_setup_new_exec(tsk);
 }
 EXPORT_SYMBOL_GPL(unuse_mm);

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 16/39] autonuma: call autonuma_setup_new_exec()
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

This resets all per-thread and per-process statistics across exec
syscalls or after kernel threads detached from the mm. The past
statistical NUMA information is unlikely to be relevant for the future
in these cases.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/exec.c        |    3 +++
 mm/mmu_context.c |    2 ++
 2 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 23559c2..488cc4d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/pipe_fs_i.h>
 #include <linux/oom.h>
 #include <linux/compat.h>
+#include <linux/autonuma.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1175,6 +1176,8 @@ void setup_new_exec(struct linux_binprm * bprm)
 			
 	flush_signal_handlers(current, 0);
 	flush_old_files(current->files);
+
+	autonuma_setup_new_exec(current);
 }
 EXPORT_SYMBOL(setup_new_exec);
 
diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 3dcfaf4..40f0f13 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -7,6 +7,7 @@
 #include <linux/mmu_context.h>
 #include <linux/export.h>
 #include <linux/sched.h>
+#include <linux/autonuma.h>
 
 #include <asm/mmu_context.h>
 
@@ -58,5 +59,6 @@ void unuse_mm(struct mm_struct *mm)
 	/* active_mm is still 'mm' */
 	enter_lazy_tlb(mm, tsk);
 	task_unlock(tsk);
+	autonuma_setup_new_exec(tsk);
 }
 EXPORT_SYMBOL_GPL(unuse_mm);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 17/39] autonuma: alloc/free/init sched_autonuma
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

This is where the dynamically allocated sched_autonuma structure is
being handled.

The reason for keeping this outside of the task_struct besides not
using too much kernel stack, is to only allocate it on NUMA
hardware. So the not NUMA hardware only pays the memory of a pointer
in the kernel stack (which remains NULL at all times in that case).

If the kernel is compiled with CONFIG_AUTONUMA=n, not even the pointer
is allocated on the kernel stack of course.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |   24 ++++++++++++++----------
 1 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index ba3b339..938098b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -168,6 +168,7 @@ static void account_kernel_stack(struct thread_info *ti, int account)
 void free_task(struct task_struct *tsk)
 {
 	account_kernel_stack(tsk->stack, -1);
+	free_sched_autonuma(tsk);
 	free_thread_info(tsk->stack);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
@@ -227,6 +228,8 @@ void __init fork_init(unsigned long mempages)
 	/* do the arch specific task caches init */
 	arch_task_cache_init();
 
+	sched_autonuma_init();
+
 	/*
 	 * The default maximum number of threads is set to a safe
 	 * value: the thread structures can take up at most half
@@ -259,23 +262,23 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	struct thread_info *ti;
 	unsigned long *stackend;
 	int node = tsk_fork_get_node(orig);
-	int err;
 
 	prepare_to_copy(orig);
 
 	tsk = alloc_task_struct_node(node);
-	if (!tsk)
+	if (unlikely(!tsk))
 		return NULL;
 
 	ti = alloc_thread_info_node(tsk, node);
-	if (!ti) {
-		free_task_struct(tsk);
-		return NULL;
-	}
+	if (unlikely(!ti))
+		goto out_task_struct;
 
-	err = arch_dup_task_struct(tsk, orig);
-	if (err)
-		goto out;
+	if (unlikely(arch_dup_task_struct(tsk, orig)))
+		goto out_thread_info;
+
+	if (unlikely(alloc_sched_autonuma(tsk, orig, node)))
+		/* free_thread_info() undoes arch_dup_task_struct() too */
+		goto out_thread_info;
 
 	tsk->stack = ti;
 
@@ -303,8 +306,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 
 	return tsk;
 
-out:
+out_thread_info:
 	free_thread_info(ti);
+out_task_struct:
 	free_task_struct(tsk);
 	return NULL;
 }

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 17/39] autonuma: alloc/free/init sched_autonuma
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

This is where the dynamically allocated sched_autonuma structure is
being handled.

The reason for keeping this outside of the task_struct besides not
using too much kernel stack, is to only allocate it on NUMA
hardware. So the not NUMA hardware only pays the memory of a pointer
in the kernel stack (which remains NULL at all times in that case).

If the kernel is compiled with CONFIG_AUTONUMA=n, not even the pointer
is allocated on the kernel stack of course.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |   24 ++++++++++++++----------
 1 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index ba3b339..938098b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -168,6 +168,7 @@ static void account_kernel_stack(struct thread_info *ti, int account)
 void free_task(struct task_struct *tsk)
 {
 	account_kernel_stack(tsk->stack, -1);
+	free_sched_autonuma(tsk);
 	free_thread_info(tsk->stack);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
@@ -227,6 +228,8 @@ void __init fork_init(unsigned long mempages)
 	/* do the arch specific task caches init */
 	arch_task_cache_init();
 
+	sched_autonuma_init();
+
 	/*
 	 * The default maximum number of threads is set to a safe
 	 * value: the thread structures can take up at most half
@@ -259,23 +262,23 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	struct thread_info *ti;
 	unsigned long *stackend;
 	int node = tsk_fork_get_node(orig);
-	int err;
 
 	prepare_to_copy(orig);
 
 	tsk = alloc_task_struct_node(node);
-	if (!tsk)
+	if (unlikely(!tsk))
 		return NULL;
 
 	ti = alloc_thread_info_node(tsk, node);
-	if (!ti) {
-		free_task_struct(tsk);
-		return NULL;
-	}
+	if (unlikely(!ti))
+		goto out_task_struct;
 
-	err = arch_dup_task_struct(tsk, orig);
-	if (err)
-		goto out;
+	if (unlikely(arch_dup_task_struct(tsk, orig)))
+		goto out_thread_info;
+
+	if (unlikely(alloc_sched_autonuma(tsk, orig, node)))
+		/* free_thread_info() undoes arch_dup_task_struct() too */
+		goto out_thread_info;
 
 	tsk->stack = ti;
 
@@ -303,8 +306,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 
 	return tsk;
 
-out:
+out_thread_info:
 	free_thread_info(ti);
+out_task_struct:
 	free_task_struct(tsk);
 	return NULL;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 18/39] autonuma: alloc/free/init mm_autonuma
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

This is where the mm_autonuma structure is being handled. Just like
sched_autonuma, this is only allocated at runtime if the hardware the
kernel is running on has been detected as NUMA. On not NUMA hardware
the memory cost is reduced to one pointer per mm.

To get rid of the pointer in the each mm, the kernel can be compiled
with CONFIG_AUTONUMA=n.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 938098b..aefe24f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -492,6 +492,8 @@ static void mm_init_aio(struct mm_struct *mm)
 
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 {
+	if (unlikely(alloc_mm_autonuma(mm)))
+		goto out_free_mm;
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
 	init_rwsem(&mm->mmap_sem);
@@ -514,6 +516,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 		return mm;
 	}
 
+	free_mm_autonuma(mm);
+out_free_mm:
 	free_mm(mm);
 	return NULL;
 }
@@ -563,6 +567,7 @@ void __mmdrop(struct mm_struct *mm)
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
+	free_mm_autonuma(mm);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -842,6 +847,7 @@ fail_nocontext:
 	 * If init_new_context() failed, we cannot use mmput() to free the mm
 	 * because it calls destroy_context()
 	 */
+	free_mm_autonuma(mm);
 	mm_free_pgd(mm);
 	free_mm(mm);
 	return NULL;
@@ -1664,6 +1670,7 @@ void __init proc_caches_init(void)
 	mm_cachep = kmem_cache_create("mm_struct",
 			sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL);
+	mm_autonuma_init();
 	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC);
 	mmap_init();
 	nsproxy_cache_init();

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 18/39] autonuma: alloc/free/init mm_autonuma
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

This is where the mm_autonuma structure is being handled. Just like
sched_autonuma, this is only allocated at runtime if the hardware the
kernel is running on has been detected as NUMA. On not NUMA hardware
the memory cost is reduced to one pointer per mm.

To get rid of the pointer in the each mm, the kernel can be compiled
with CONFIG_AUTONUMA=n.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 938098b..aefe24f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -492,6 +492,8 @@ static void mm_init_aio(struct mm_struct *mm)
 
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 {
+	if (unlikely(alloc_mm_autonuma(mm)))
+		goto out_free_mm;
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
 	init_rwsem(&mm->mmap_sem);
@@ -514,6 +516,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 		return mm;
 	}
 
+	free_mm_autonuma(mm);
+out_free_mm:
 	free_mm(mm);
 	return NULL;
 }
@@ -563,6 +567,7 @@ void __mmdrop(struct mm_struct *mm)
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
+	free_mm_autonuma(mm);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -842,6 +847,7 @@ fail_nocontext:
 	 * If init_new_context() failed, we cannot use mmput() to free the mm
 	 * because it calls destroy_context()
 	 */
+	free_mm_autonuma(mm);
 	mm_free_pgd(mm);
 	free_mm(mm);
 	return NULL;
@@ -1664,6 +1670,7 @@ void __init proc_caches_init(void)
 	mm_cachep = kmem_cache_create("mm_struct",
 			sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL);
+	mm_autonuma_init();
 	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC);
 	mmap_init();
 	nsproxy_cache_init();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 19/39] mm: add unlikely to the mm allocation failure check
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Very minor optimization to hint gcc.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index aefe24f..31b2f35 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -547,7 +547,7 @@ struct mm_struct *mm_alloc(void)
 	struct mm_struct *mm;
 
 	mm = allocate_mm();
-	if (!mm)
+	if (unlikely(!mm))
 		return NULL;
 
 	memset(mm, 0, sizeof(*mm));

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 19/39] mm: add unlikely to the mm allocation failure check
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Very minor optimization to hint gcc.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index aefe24f..31b2f35 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -547,7 +547,7 @@ struct mm_struct *mm_alloc(void)
 	struct mm_struct *mm;
 
 	mm = allocate_mm();
-	if (!mm)
+	if (unlikely(!mm))
 		return NULL;
 
 	memset(mm, 0, sizeof(*mm));

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 20/39] autonuma: avoid CFS select_task_rq_fair to return -1
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Fix to avoid -1 retval.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 94340c7..25e9e5b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2769,6 +2769,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		goto unlock;
 	}
 
+	prev_cpu = new_cpu;
 	while (sd) {
 		int load_idx = sd->forkexec_idx;
 		struct sched_group *group;
@@ -2792,6 +2793,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		if (new_cpu == -1 || new_cpu == cpu) {
 			/* Now try balancing at a lower domain level of cpu */
 			sd = sd->child;
+			new_cpu = prev_cpu;
 			continue;
 		}
 
@@ -2810,6 +2812,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 unlock:
 	rcu_read_unlock();
 
+	BUG_ON(new_cpu < 0);
 	return new_cpu;
 }
 #endif /* CONFIG_SMP */

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 20/39] autonuma: avoid CFS select_task_rq_fair to return -1
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Fix to avoid -1 retval.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 94340c7..25e9e5b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2769,6 +2769,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		goto unlock;
 	}
 
+	prev_cpu = new_cpu;
 	while (sd) {
 		int load_idx = sd->forkexec_idx;
 		struct sched_group *group;
@@ -2792,6 +2793,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		if (new_cpu == -1 || new_cpu == cpu) {
 			/* Now try balancing at a lower domain level of cpu */
 			sd = sd->child;
+			new_cpu = prev_cpu;
 			continue;
 		}
 
@@ -2810,6 +2812,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 unlock:
 	rcu_read_unlock();
 
+	BUG_ON(new_cpu < 0);
 	return new_cpu;
 }
 #endif /* CONFIG_SMP */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 21/39] autonuma: fix selecting task runqueue
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

From: Hillf Danton <dhillf@gmail.com>

Without coments, the following three hunks, I guess,
======
@@ -2788,6 +2801,7 @@ select_task_rq_fair(struct task_struct *p,
 		goto unlock;
 	}

+	prev_cpu = new_cpu;
 	while (sd) {
 		int load_idx = sd->forkexec_idx;
 		struct sched_group *group;
@@ -2811,6 +2825,7 @@ select_task_rq_fair(struct task_struct *p,
 		if (new_cpu == -1 || new_cpu == cpu) {
 			/* Now try balancing at a lower domain level of cpu */
 			sd = sd->child;
+			new_cpu = prev_cpu;
 			continue;
 		}

@@ -2826,6 +2841,7 @@ select_task_rq_fair(struct task_struct *p,
 		}
 		/* while loop will break here if sd == NULL */
 	}
+	BUG_ON(new_cpu < 0);
 unlock:
 	rcu_read_unlock();

======
were added for certain that selected cpu is valid, based on BUG_ON.

But question raised, why prev_cpu is changed?

Andrea's answer: yes the BUG_ON was introduced to verify the function
wouldn't return -1. This patch fixes the problem too.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25e9e5b..a8498e0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2769,7 +2769,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		goto unlock;
 	}
 
-	prev_cpu = new_cpu;
 	while (sd) {
 		int load_idx = sd->forkexec_idx;
 		struct sched_group *group;
@@ -2793,7 +2792,10 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		if (new_cpu == -1 || new_cpu == cpu) {
 			/* Now try balancing at a lower domain level of cpu */
 			sd = sd->child;
-			new_cpu = prev_cpu;
+			if (new_cpu == -1) {
+				/* Only for certain that new cpu is valid */
+				new_cpu = prev_cpu;
+			}
 			continue;
 		}
 

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 21/39] autonuma: fix selecting task runqueue
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

From: Hillf Danton <dhillf@gmail.com>

Without coments, the following three hunks, I guess,
======
@@ -2788,6 +2801,7 @@ select_task_rq_fair(struct task_struct *p,
 		goto unlock;
 	}

+	prev_cpu = new_cpu;
 	while (sd) {
 		int load_idx = sd->forkexec_idx;
 		struct sched_group *group;
@@ -2811,6 +2825,7 @@ select_task_rq_fair(struct task_struct *p,
 		if (new_cpu == -1 || new_cpu == cpu) {
 			/* Now try balancing at a lower domain level of cpu */
 			sd = sd->child;
+			new_cpu = prev_cpu;
 			continue;
 		}

@@ -2826,6 +2841,7 @@ select_task_rq_fair(struct task_struct *p,
 		}
 		/* while loop will break here if sd == NULL */
 	}
+	BUG_ON(new_cpu < 0);
 unlock:
 	rcu_read_unlock();

======
were added for certain that selected cpu is valid, based on BUG_ON.

But question raised, why prev_cpu is changed?

Andrea's answer: yes the BUG_ON was introduced to verify the function
wouldn't return -1. This patch fixes the problem too.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25e9e5b..a8498e0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2769,7 +2769,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		goto unlock;
 	}
 
-	prev_cpu = new_cpu;
 	while (sd) {
 		int load_idx = sd->forkexec_idx;
 		struct sched_group *group;
@@ -2793,7 +2792,10 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		if (new_cpu == -1 || new_cpu == cpu) {
 			/* Now try balancing at a lower domain level of cpu */
 			sd = sd->child;
-			new_cpu = prev_cpu;
+			if (new_cpu == -1) {
+				/* Only for certain that new cpu is valid */
+				new_cpu = prev_cpu;
+			}
 			continue;
 		}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 22/39] autonuma: select_task_rq_fair cleanup new_cpu < 0 fix
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Cleanup comment in the case find_idlest_cpu returned -1 and check for
< 0 instead of == -1 as it should allow gcc to generate more optimal
code.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a8498e0..0c60f46 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2792,10 +2792,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		if (new_cpu == -1 || new_cpu == cpu) {
 			/* Now try balancing at a lower domain level of cpu */
 			sd = sd->child;
-			if (new_cpu == -1) {
-				/* Only for certain that new cpu is valid */
+			if (new_cpu < 0)
+				/* Return prev_cpu is find_idlest_cpu failed */
 				new_cpu = prev_cpu;
-			}
 			continue;
 		}
 

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 22/39] autonuma: select_task_rq_fair cleanup new_cpu < 0 fix
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Cleanup comment in the case find_idlest_cpu returned -1 and check for
< 0 instead of == -1 as it should allow gcc to generate more optimal
code.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a8498e0..0c60f46 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2792,10 +2792,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		if (new_cpu == -1 || new_cpu == cpu) {
 			/* Now try balancing at a lower domain level of cpu */
 			sd = sd->child;
-			if (new_cpu == -1) {
-				/* Only for certain that new cpu is valid */
+			if (new_cpu < 0)
+				/* Return prev_cpu is find_idlest_cpu failed */
 				new_cpu = prev_cpu;
-			}
 			continue;
 		}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 23/39] autonuma: teach CFS about autonuma affinity
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

The CFS scheduler is still in charge of all scheduling
decisions. AutoNUMA balancing at times will override those. But
generally we'll just relay on the CFS scheduler to keep doing its
thing, but while preferring the autonuma affine nodes when deciding
to move a process to a different runqueue or when waking it up.

For example the idle balancing, will look into the runqueues of the
busy CPUs, but it'll search first for a task that wants to run into
the idle CPU in AutoNUMA terms (task_autonuma_cpu() being true).

Most of this is encoded in the can_migrate_task becoming AutoNUMA
aware and running two passes for each balancing pass, the first NUMA
aware, and the second one relaxed.

The idle/newidle balancing is always allowed to fallback into
non-affine AutoNUMA tasks. The load_balancing (which is more a
fariness than a performance issue) is instead only able to cross over
the AutoNUMA affinity if the flag controlled by
/sys/kernel/mm/autonuma/scheduler/load_balance_strict is not set (it
is set by default).

Tasks that haven't been fully profiled yet, are not affected by this
because their p->sched_autonuma->autonuma_node is still set to the
original value of -1 and task_autonuma_cpu will always return true in
that case.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |   57 ++++++++++++++++++++++++++++++++++++++++++++------
 1 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c60f46..166168d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/profile.h>
 #include <linux/interrupt.h>
+#include <linux/autonuma_sched.h>
 
 #include <trace/events/sched.h>
 
@@ -2618,6 +2619,8 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
+		if (task_autonuma_cpu(p, i))
+			continue;
 		load = weighted_cpuload(i);
 
 		if (load < min_load || (load == min_load && i == this_cpu)) {
@@ -2639,24 +2642,28 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	struct sched_domain *sd;
 	struct sched_group *sg;
 	int i;
+	bool numa;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
 	 * already idle, then it is the right target.
 	 */
-	if (target == cpu && idle_cpu(cpu))
+	if (target == cpu && idle_cpu(cpu) && task_autonuma_cpu(p, cpu))
 		return cpu;
 
 	/*
 	 * If the task is going to be woken-up on the cpu where it previously
 	 * ran and if it is currently idle, then it the right target.
 	 */
-	if (target == prev_cpu && idle_cpu(prev_cpu))
+	if (target == prev_cpu && idle_cpu(prev_cpu) &&
+	    task_autonuma_cpu(p, prev_cpu))
 		return prev_cpu;
 
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
+	numa = true;
+again:
 	sd = rcu_dereference(per_cpu(sd_llc, target));
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
@@ -2666,7 +2673,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
 				goto next;
 
 			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (!idle_cpu(i))
+				if (!idle_cpu(i) ||
+				    (numa && !task_autonuma_cpu(p, i)))
 					goto next;
 			}
 
@@ -2677,6 +2685,10 @@ next:
 			sg = sg->next;
 		} while (sg != sd->groups);
 	}
+	if (numa) {
+		numa = false;
+		goto again;
+	}
 done:
 	return target;
 }
@@ -2707,7 +2719,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		return prev_cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
-		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) &&
+		    task_autonuma_cpu(p, cpu))
 			want_affine = 1;
 		new_cpu = prev_cpu;
 	}
@@ -3075,6 +3088,7 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
+#define LBF_NUMA	0x04
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -3145,13 +3159,14 @@ static
 int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
 	int tsk_cache_hot = 0;
+	struct cpumask *allowed = tsk_cpus_allowed(p);
 	/*
 	 * We do not migrate tasks that are:
 	 * 1) running (obviously), or
 	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
 	 * 3) are cache-hot on their current CPU.
 	 */
-	if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
+	if (!cpumask_test_cpu(env->dst_cpu, allowed)) {
 		schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
 		return 0;
 	}
@@ -3162,6 +3177,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}
 
+	if (!sched_autonuma_can_migrate_task(p, env->flags & LBF_NUMA,
+					     env->dst_cpu, env->idle, allowed))
+		return 0;
+
 	/*
 	 * Aggressive migration if:
 	 * 1) task is cache cold, or
@@ -3198,6 +3217,8 @@ static int move_one_task(struct lb_env *env)
 {
 	struct task_struct *p, *n;
 
+	env->flags |= LBF_NUMA;
+numa_repeat:
 	list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
 		if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
 			continue;
@@ -3212,8 +3233,14 @@ static int move_one_task(struct lb_env *env)
 		 * stats here rather than inside move_task().
 		 */
 		schedstat_inc(env->sd, lb_gained[env->idle]);
+		env->flags &= ~LBF_NUMA;
 		return 1;
 	}
+	if (env->flags & LBF_NUMA) {
+		env->flags &= ~LBF_NUMA;
+		goto numa_repeat;
+	}
+
 	return 0;
 }
 
@@ -3236,6 +3263,8 @@ static int move_tasks(struct lb_env *env)
 	if (env->load_move <= 0)
 		return 0;
 
+	env->flags |= LBF_NUMA;
+numa_repeat:
 	while (!list_empty(tasks)) {
 		p = list_first_entry(tasks, struct task_struct, se.group_node);
 
@@ -3275,9 +3304,13 @@ static int move_tasks(struct lb_env *env)
 		 * kernels will stop after the first task is pulled to minimize
 		 * the critical section.
 		 */
-		if (env->idle == CPU_NEWLY_IDLE)
-			break;
+		if (env->idle == CPU_NEWLY_IDLE) {
+			env->flags &= ~LBF_NUMA;
+			goto out;
+		}
 #endif
+		/* not idle anymore after pulling first task */
+		env->idle = CPU_NOT_IDLE;
 
 		/*
 		 * We only want to steal up to the prescribed amount of
@@ -3290,6 +3323,16 @@ static int move_tasks(struct lb_env *env)
 next:
 		list_move_tail(&p->se.group_node, tasks);
 	}
+	if (env->flags & LBF_NUMA) {
+		env->flags &= ~LBF_NUMA;
+		if (env->load_move > 0) {
+			env->loop = 0;
+			goto numa_repeat;
+		}
+	}
+#ifdef CONFIG_PREEMPT
+out:
+#endif
 
 	/*
 	 * Right now, this is one of only two places move_task() is called,

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 23/39] autonuma: teach CFS about autonuma affinity
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

The CFS scheduler is still in charge of all scheduling
decisions. AutoNUMA balancing at times will override those. But
generally we'll just relay on the CFS scheduler to keep doing its
thing, but while preferring the autonuma affine nodes when deciding
to move a process to a different runqueue or when waking it up.

For example the idle balancing, will look into the runqueues of the
busy CPUs, but it'll search first for a task that wants to run into
the idle CPU in AutoNUMA terms (task_autonuma_cpu() being true).

Most of this is encoded in the can_migrate_task becoming AutoNUMA
aware and running two passes for each balancing pass, the first NUMA
aware, and the second one relaxed.

The idle/newidle balancing is always allowed to fallback into
non-affine AutoNUMA tasks. The load_balancing (which is more a
fariness than a performance issue) is instead only able to cross over
the AutoNUMA affinity if the flag controlled by
/sys/kernel/mm/autonuma/scheduler/load_balance_strict is not set (it
is set by default).

Tasks that haven't been fully profiled yet, are not affected by this
because their p->sched_autonuma->autonuma_node is still set to the
original value of -1 and task_autonuma_cpu will always return true in
that case.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |   57 ++++++++++++++++++++++++++++++++++++++++++++------
 1 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c60f46..166168d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/profile.h>
 #include <linux/interrupt.h>
+#include <linux/autonuma_sched.h>
 
 #include <trace/events/sched.h>
 
@@ -2618,6 +2619,8 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
+		if (task_autonuma_cpu(p, i))
+			continue;
 		load = weighted_cpuload(i);
 
 		if (load < min_load || (load == min_load && i == this_cpu)) {
@@ -2639,24 +2642,28 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	struct sched_domain *sd;
 	struct sched_group *sg;
 	int i;
+	bool numa;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
 	 * already idle, then it is the right target.
 	 */
-	if (target == cpu && idle_cpu(cpu))
+	if (target == cpu && idle_cpu(cpu) && task_autonuma_cpu(p, cpu))
 		return cpu;
 
 	/*
 	 * If the task is going to be woken-up on the cpu where it previously
 	 * ran and if it is currently idle, then it the right target.
 	 */
-	if (target == prev_cpu && idle_cpu(prev_cpu))
+	if (target == prev_cpu && idle_cpu(prev_cpu) &&
+	    task_autonuma_cpu(p, prev_cpu))
 		return prev_cpu;
 
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
+	numa = true;
+again:
 	sd = rcu_dereference(per_cpu(sd_llc, target));
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
@@ -2666,7 +2673,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
 				goto next;
 
 			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (!idle_cpu(i))
+				if (!idle_cpu(i) ||
+				    (numa && !task_autonuma_cpu(p, i)))
 					goto next;
 			}
 
@@ -2677,6 +2685,10 @@ next:
 			sg = sg->next;
 		} while (sg != sd->groups);
 	}
+	if (numa) {
+		numa = false;
+		goto again;
+	}
 done:
 	return target;
 }
@@ -2707,7 +2719,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		return prev_cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
-		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) &&
+		    task_autonuma_cpu(p, cpu))
 			want_affine = 1;
 		new_cpu = prev_cpu;
 	}
@@ -3075,6 +3088,7 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
+#define LBF_NUMA	0x04
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -3145,13 +3159,14 @@ static
 int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
 	int tsk_cache_hot = 0;
+	struct cpumask *allowed = tsk_cpus_allowed(p);
 	/*
 	 * We do not migrate tasks that are:
 	 * 1) running (obviously), or
 	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
 	 * 3) are cache-hot on their current CPU.
 	 */
-	if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
+	if (!cpumask_test_cpu(env->dst_cpu, allowed)) {
 		schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
 		return 0;
 	}
@@ -3162,6 +3177,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}
 
+	if (!sched_autonuma_can_migrate_task(p, env->flags & LBF_NUMA,
+					     env->dst_cpu, env->idle, allowed))
+		return 0;
+
 	/*
 	 * Aggressive migration if:
 	 * 1) task is cache cold, or
@@ -3198,6 +3217,8 @@ static int move_one_task(struct lb_env *env)
 {
 	struct task_struct *p, *n;
 
+	env->flags |= LBF_NUMA;
+numa_repeat:
 	list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
 		if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
 			continue;
@@ -3212,8 +3233,14 @@ static int move_one_task(struct lb_env *env)
 		 * stats here rather than inside move_task().
 		 */
 		schedstat_inc(env->sd, lb_gained[env->idle]);
+		env->flags &= ~LBF_NUMA;
 		return 1;
 	}
+	if (env->flags & LBF_NUMA) {
+		env->flags &= ~LBF_NUMA;
+		goto numa_repeat;
+	}
+
 	return 0;
 }
 
@@ -3236,6 +3263,8 @@ static int move_tasks(struct lb_env *env)
 	if (env->load_move <= 0)
 		return 0;
 
+	env->flags |= LBF_NUMA;
+numa_repeat:
 	while (!list_empty(tasks)) {
 		p = list_first_entry(tasks, struct task_struct, se.group_node);
 
@@ -3275,9 +3304,13 @@ static int move_tasks(struct lb_env *env)
 		 * kernels will stop after the first task is pulled to minimize
 		 * the critical section.
 		 */
-		if (env->idle == CPU_NEWLY_IDLE)
-			break;
+		if (env->idle == CPU_NEWLY_IDLE) {
+			env->flags &= ~LBF_NUMA;
+			goto out;
+		}
 #endif
+		/* not idle anymore after pulling first task */
+		env->idle = CPU_NOT_IDLE;
 
 		/*
 		 * We only want to steal up to the prescribed amount of
@@ -3290,6 +3323,16 @@ static int move_tasks(struct lb_env *env)
 next:
 		list_move_tail(&p->se.group_node, tasks);
 	}
+	if (env->flags & LBF_NUMA) {
+		env->flags &= ~LBF_NUMA;
+		if (env->load_move > 0) {
+			env->loop = 0;
+			goto numa_repeat;
+		}
+	}
+#ifdef CONFIG_PREEMPT
+out:
+#endif
 
 	/*
 	 * Right now, this is one of only two places move_task() is called,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 24/39] autonuma: fix finding idlest cpu
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

From: Hillf Danton <dhillf@gmail.com>

If autonuma not enabled, no cpu is selected, which is behavior change.
We have to fix it.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 166168d..bf109cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2619,11 +2619,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
-		if (task_autonuma_cpu(p, i))
-			continue;
 		load = weighted_cpuload(i);
 
 		if (load < min_load || (load == min_load && i == this_cpu)) {
+			if (!task_autonuma_cpu(p, i))
+				continue;
 			min_load = load;
 			idlest = i;
 		}

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 24/39] autonuma: fix finding idlest cpu
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

From: Hillf Danton <dhillf@gmail.com>

If autonuma not enabled, no cpu is selected, which is behavior change.
We have to fix it.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 166168d..bf109cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2619,11 +2619,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
-		if (task_autonuma_cpu(p, i))
-			continue;
 		load = weighted_cpuload(i);
 
 		if (load < min_load || (load == min_load && i == this_cpu)) {
+			if (!task_autonuma_cpu(p, i))
+				continue;
 			min_load = load;
 			idlest = i;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 25/39] autonuma: fix selecting idle sibling
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

From: Hillf Danton <dhillf@gmail.com>

Autonuma cpu is selected only from the idle group without the requirement that
each cpu in the group is autonuma for given task.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |   25 +++++++++++++------------
 1 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf109cc..0d2fe26 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2642,7 +2642,6 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	struct sched_domain *sd;
 	struct sched_group *sg;
 	int i;
-	bool numa;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
@@ -2662,8 +2661,6 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
-	numa = true;
-again:
 	sd = rcu_dereference(per_cpu(sd_llc, target));
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
@@ -2673,22 +2670,26 @@ again:
 				goto next;
 
 			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (!idle_cpu(i) ||
-				    (numa && !task_autonuma_cpu(p, i)))
+				if (!idle_cpu(i))
 					goto next;
 			}
 
-			target = cpumask_first_and(sched_group_cpus(sg),
-					tsk_cpus_allowed(p));
-			goto done;
+			cpu = -1;
+			for_each_cpu_and(i, sched_group_cpus(sg),
+						tsk_cpus_allowed(p)) {
+				/* Find autonuma cpu only in idle group */
+				if (task_autonuma_cpu(p, i)) {
+					target = i;
+					goto done;
+				}
+				if (cpu == -1)
+					cpu = i;
+			}
+			target = cpu;
 next:
 			sg = sg->next;
 		} while (sg != sd->groups);
 	}
-	if (numa) {
-		numa = false;
-		goto again;
-	}
 done:
 	return target;
 }

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 25/39] autonuma: fix selecting idle sibling
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

From: Hillf Danton <dhillf@gmail.com>

Autonuma cpu is selected only from the idle group without the requirement that
each cpu in the group is autonuma for given task.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |   25 +++++++++++++------------
 1 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf109cc..0d2fe26 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2642,7 +2642,6 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	struct sched_domain *sd;
 	struct sched_group *sg;
 	int i;
-	bool numa;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
@@ -2662,8 +2661,6 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
-	numa = true;
-again:
 	sd = rcu_dereference(per_cpu(sd_llc, target));
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
@@ -2673,22 +2670,26 @@ again:
 				goto next;
 
 			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (!idle_cpu(i) ||
-				    (numa && !task_autonuma_cpu(p, i)))
+				if (!idle_cpu(i))
 					goto next;
 			}
 
-			target = cpumask_first_and(sched_group_cpus(sg),
-					tsk_cpus_allowed(p));
-			goto done;
+			cpu = -1;
+			for_each_cpu_and(i, sched_group_cpus(sg),
+						tsk_cpus_allowed(p)) {
+				/* Find autonuma cpu only in idle group */
+				if (task_autonuma_cpu(p, i)) {
+					target = i;
+					goto done;
+				}
+				if (cpu == -1)
+					cpu = i;
+			}
+			target = cpu;
 next:
 			sg = sg->next;
 		} while (sg != sd->groups);
 	}
-	if (numa) {
-		numa = false;
-		goto again;
-	}
 done:
 	return target;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 26/39] autonuma: select_idle_sibling cleanup target assignment
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Cleanup the code without reusing the cpu variable to simplify
readibility.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d2fe26..693adc5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2642,6 +2642,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	struct sched_domain *sd;
 	struct sched_group *sg;
 	int i;
+	bool idle_target;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
@@ -2661,6 +2662,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
+	idle_target = false;
 	sd = rcu_dereference(per_cpu(sd_llc, target));
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
@@ -2674,7 +2676,6 @@ static int select_idle_sibling(struct task_struct *p, int target)
 					goto next;
 			}
 
-			cpu = -1;
 			for_each_cpu_and(i, sched_group_cpus(sg),
 						tsk_cpus_allowed(p)) {
 				/* Find autonuma cpu only in idle group */
@@ -2682,10 +2683,11 @@ static int select_idle_sibling(struct task_struct *p, int target)
 					target = i;
 					goto done;
 				}
-				if (cpu == -1)
-					cpu = i;
+				if (!idle_target) {
+					idle_target = true;
+					target = i;
+				}
 			}
-			target = cpu;
 next:
 			sg = sg->next;
 		} while (sg != sd->groups);

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 26/39] autonuma: select_idle_sibling cleanup target assignment
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Cleanup the code without reusing the cpu variable to simplify
readibility.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d2fe26..693adc5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2642,6 +2642,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	struct sched_domain *sd;
 	struct sched_group *sg;
 	int i;
+	bool idle_target;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
@@ -2661,6 +2662,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
+	idle_target = false;
 	sd = rcu_dereference(per_cpu(sd_llc, target));
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
@@ -2674,7 +2676,6 @@ static int select_idle_sibling(struct task_struct *p, int target)
 					goto next;
 			}
 
-			cpu = -1;
 			for_each_cpu_and(i, sched_group_cpus(sg),
 						tsk_cpus_allowed(p)) {
 				/* Find autonuma cpu only in idle group */
@@ -2682,10 +2683,11 @@ static int select_idle_sibling(struct task_struct *p, int target)
 					target = i;
 					goto done;
 				}
-				if (cpu == -1)
-					cpu = i;
+				if (!idle_target) {
+					idle_target = true;
+					target = i;
+				}
 			}
-			target = cpu;
 next:
 			sg = sg->next;
 		} while (sg != sd->groups);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 27/39] autonuma: core
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

This implements knuma_scand, the numa_hinting faults started by
knuma_scand, the knuma_migrated that migrates the memory queued by the
NUMA hinting faults, the statistics gathering code that is done by
knuma_scand for the mm_autonuma and by the numa hinting page faults
for the sched_autonuma, and most of the rest of the AutoNUMA core
logics like the false sharing detection, sysfs and initialization
routines.

The AutoNUMA algorithm when knuma_scand is not running is a full
bypass and it must not alter the runtime of memory management and
scheduler at all.

The whole AutoNUMA logic is a chain reaction as result of the actions
of the knuma_scand.

knuma_scand is the first gear and it collects the mm_autonuma per-process
statistics and at the same time it sets the pte/pmd it scans as
pte_numa and pmd_numa.

The second gear are the numa hinting page faults. These are triggered
by the pte_numa/pmd_numa pmd/ptes. They collect the sched_autonuma
per-thread statistics. They also implement the memory follow CPU logic
where we track if pages are repeatedly accessed by remote nodes. The
memory follow CPU logic can decide to migrate pages across different
NUMA nodes by queuing the pages for migration in the per-node
knuma_migrated queues.

The third gear is knuma_migrated. There is one knuma_migrated daemon
per node. Pages pending for migration are queued in a matrix of
lists. Each knuma_migrated (in parallel with each other) goes over
those lists and migrates the pages queued for migration in round robin
from each incoming node to the node where knuma_migrated is running
on.

The fourth gear is the NUMA scheduler balancing code. That computes
the statistical information collected in mm->mm_autonuma and
p->sched_autonuma and evaluates the status of all CPUs to decide if
tasks should be migrated to CPUs in remote nodes.

The code include fixes from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/autonuma.c | 1437 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 1437 insertions(+), 0 deletions(-)
 create mode 100644 mm/autonuma.c

diff --git a/mm/autonuma.c b/mm/autonuma.c
new file mode 100644
index 0000000..7ca4992
--- /dev/null
+++ b/mm/autonuma.c
@@ -0,0 +1,1437 @@
+/*
+ *  Copyright (C) 2012  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ *  Boot with "numa=fake=2" to test on not NUMA systems.
+ */
+
+#include <linux/mm.h>
+#include <linux/rmap.h>
+#include <linux/kthread.h>
+#include <linux/mmu_notifier.h>
+#include <linux/freezer.h>
+#include <linux/mm_inline.h>
+#include <linux/migrate.h>
+#include <linux/swap.h>
+#include <linux/autonuma.h>
+#include <asm/tlbflush.h>
+#include <asm/pgtable.h>
+
+unsigned long autonuma_flags __read_mostly =
+	(1<<AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG)|
+	(1<<AUTONUMA_SCHED_CLONE_RESET_FLAG)|
+	(1<<AUTONUMA_SCHED_FORK_RESET_FLAG)|
+#ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
+	(1<<AUTONUMA_FLAG)|
+#endif
+	(1<<AUTONUMA_SCAN_PMD_FLAG);
+
+static DEFINE_MUTEX(knumad_mm_mutex);
+
+/* knuma_scand */
+static unsigned int scan_sleep_millisecs __read_mostly = 100;
+static unsigned int scan_sleep_pass_millisecs __read_mostly = 5000;
+static unsigned int pages_to_scan __read_mostly = 128*1024*1024/PAGE_SIZE;
+static DECLARE_WAIT_QUEUE_HEAD(knuma_scand_wait);
+static unsigned long full_scans;
+static unsigned long pages_scanned;
+
+/* knuma_migrated */
+static unsigned int migrate_sleep_millisecs __read_mostly = 100;
+static unsigned int pages_to_migrate __read_mostly = 128*1024*1024/PAGE_SIZE;
+static volatile unsigned long pages_migrated;
+
+static struct knumad_scan {
+	struct list_head mm_head;
+	struct mm_struct *mm;
+	unsigned long address;
+} knumad_scan = {
+	.mm_head = LIST_HEAD_INIT(knumad_scan.mm_head),
+};
+
+static inline bool autonuma_impossible(void)
+{
+	return num_possible_nodes() <= 1 ||
+		test_bit(AUTONUMA_IMPOSSIBLE, &autonuma_flags);
+}
+
+static inline void autonuma_migrate_lock(int nid)
+{
+	spin_lock(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_unlock(int nid)
+{
+	spin_unlock(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_lock_irq(int nid)
+{
+	spin_lock_irq(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_unlock_irq(int nid)
+{
+	spin_unlock_irq(&NODE_DATA(nid)->autonuma_lock);
+}
+
+/* caller already holds the compound_lock */
+void autonuma_migrate_split_huge_page(struct page *page,
+				      struct page *page_tail)
+{
+	int nid, last_nid;
+
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	VM_BUG_ON(page_tail->autonuma_migrate_nid != -1);
+	if (nid >= 0) {
+		VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
+		autonuma_migrate_lock(nid);
+		list_add_tail(&page_tail->autonuma_migrate_node,
+			      &page->autonuma_migrate_node);
+		autonuma_migrate_unlock(nid);
+
+		page_tail->autonuma_migrate_nid = nid;
+	}
+
+	last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	if (last_nid >= 0)
+		page_tail->autonuma_last_nid = last_nid;
+}
+
+void __autonuma_migrate_page_remove(struct page *page)
+{
+	unsigned long flags;
+	int nid;
+
+	flags = compound_lock_irqsave(page);
+
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	if (nid >= 0) {
+		int numpages = hpage_nr_pages(page);
+		autonuma_migrate_lock(nid);
+		list_del(&page->autonuma_migrate_node);
+		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
+		autonuma_migrate_unlock(nid);
+
+		page->autonuma_migrate_nid = -1;
+	}
+
+	compound_unlock_irqrestore(page, flags);
+}
+
+static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
+					int page_nid)
+{
+	unsigned long flags;
+	int nid;
+	int numpages;
+	unsigned long nr_migrate_pages;
+	wait_queue_head_t *wait_queue;
+
+	VM_BUG_ON(dst_nid >= MAX_NUMNODES);
+	VM_BUG_ON(dst_nid < -1);
+	VM_BUG_ON(page_nid >= MAX_NUMNODES);
+	VM_BUG_ON(page_nid < -1);
+
+	VM_BUG_ON(page_nid == dst_nid);
+	VM_BUG_ON(page_to_nid(page) != page_nid);
+
+	flags = compound_lock_irqsave(page);
+
+	numpages = hpage_nr_pages(page);
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	if (nid >= 0) {
+		autonuma_migrate_lock(nid);
+		list_del(&page->autonuma_migrate_node);
+		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
+		autonuma_migrate_unlock(nid);
+	}
+
+	autonuma_migrate_lock(dst_nid);
+	list_add(&page->autonuma_migrate_node,
+		 &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
+	NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
+	nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
+
+	autonuma_migrate_unlock(dst_nid);
+
+	page->autonuma_migrate_nid = dst_nid;
+
+	compound_unlock_irqrestore(page, flags);
+
+	if (!autonuma_migrate_defer()) {
+		wait_queue = &NODE_DATA(dst_nid)->autonuma_knuma_migrated_wait;
+		if (nr_migrate_pages >= pages_to_migrate &&
+		    nr_migrate_pages - numpages < pages_to_migrate &&
+		    waitqueue_active(wait_queue))
+			wake_up_interruptible(wait_queue);
+	}
+}
+
+static void autonuma_migrate_page_add(struct page *page, int dst_nid,
+				      int page_nid)
+{
+	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	if (migrate_nid != dst_nid)
+		__autonuma_migrate_page_add(page, dst_nid, page_nid);
+}
+
+static bool balance_pgdat(struct pglist_data *pgdat,
+			  int nr_migrate_pages)
+{
+	/* FIXME: this only check the wmarks, make it move
+	 * "unused" memory or pagecache by queuing it to
+	 * pgdat->autonuma_migrate_head[pgdat->node_id].
+	 */
+	int z;
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone->all_unreclaimable)
+			continue;
+
+		/*
+		 * FIXME: deal with order with THP, maybe if
+		 * kswapd will learn using compaction, otherwise
+		 * order = 0 probably is ok.
+		 * FIXME: in theory we're ok if we can obtain
+		 * pages_to_migrate pages from all zones, it doesn't
+		 * need to be all in a single zone. We care about the
+		 * pgdat, the zone not.
+		 */
+
+		/*
+		 * Try not to wakeup kswapd by allocating
+		 * pages_to_migrate pages.
+		 */
+		if (!zone_watermark_ok(zone, 0,
+				       high_wmark_pages(zone) +
+				       nr_migrate_pages,
+				       0, 0))
+			continue;
+		return true;
+	}
+	return false;
+}
+
+static void cpu_follow_memory_pass(struct task_struct *p,
+				   struct sched_autonuma *sched_autonuma,
+				   unsigned long *numa_fault)
+{
+	int nid;
+	for_each_node(nid)
+		numa_fault[nid] >>= 1;
+	sched_autonuma->numa_fault_tot >>= 1;
+}
+
+static void numa_hinting_fault_cpu_follow_memory(struct task_struct *p,
+						 int access_nid,
+						 int numpages,
+						 bool pass)
+{
+	struct sched_autonuma *sched_autonuma = p->sched_autonuma;
+	unsigned long *numa_fault = sched_autonuma->numa_fault;
+	if (unlikely(pass))
+		cpu_follow_memory_pass(p, sched_autonuma, numa_fault);
+	numa_fault[access_nid] += numpages;
+	sched_autonuma->numa_fault_tot += numpages;
+}
+
+static inline bool last_nid_set(struct task_struct *p,
+				struct page *page, int cpu_nid)
+{
+	bool ret = true;
+	int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	VM_BUG_ON(cpu_nid < 0);
+	VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
+	if (autonuma_last_nid >= 0 && autonuma_last_nid != cpu_nid) {
+		int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+		if (migrate_nid >= 0 && migrate_nid != cpu_nid)
+			__autonuma_migrate_page_remove(page);
+		ret = false;
+	}
+	if (autonuma_last_nid != cpu_nid)
+		ACCESS_ONCE(page->autonuma_last_nid) = cpu_nid;
+	return ret;
+}
+
+static int __page_migrate_nid(struct page *page, int page_nid)
+{
+	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	if (migrate_nid < 0)
+		migrate_nid = page_nid;
+#if 0
+	return page_nid;
+#endif
+	return migrate_nid;
+}
+
+static int page_migrate_nid(struct page *page)
+{
+	return __page_migrate_nid(page, page_to_nid(page));
+}
+
+static int numa_hinting_fault_memory_follow_cpu(struct task_struct *p,
+						struct page *page,
+						int cpu_nid, int page_nid,
+						bool pass)
+{
+	if (!last_nid_set(p, page, cpu_nid))
+		return __page_migrate_nid(page, page_nid);
+	if (!PageLRU(page))
+		return page_nid;
+	if (cpu_nid != page_nid)
+		autonuma_migrate_page_add(page, cpu_nid, page_nid);
+	else
+		autonuma_migrate_page_remove(page);
+	return cpu_nid;
+}
+
+void numa_hinting_fault(struct page *page, int numpages)
+{
+	WARN_ON_ONCE(!current->mm);
+	if (likely(current->mm && !current->mempolicy && autonuma_enabled())) {
+		struct task_struct *p = current;
+		int cpu_nid, page_nid, access_nid;
+		bool pass;
+
+		pass = p->sched_autonuma->numa_fault_pass !=
+			p->mm->mm_autonuma->numa_fault_pass;
+		page_nid = page_to_nid(page);
+		cpu_nid = numa_node_id();
+		VM_BUG_ON(cpu_nid < 0);
+		VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
+		access_nid = numa_hinting_fault_memory_follow_cpu(p, page,
+								  cpu_nid,
+								  page_nid,
+								  pass);
+		numa_hinting_fault_cpu_follow_memory(p, access_nid,
+						     numpages, pass);
+		if (unlikely(pass))
+			p->sched_autonuma->numa_fault_pass =
+				p->mm->mm_autonuma->numa_fault_pass;
+	}
+}
+
+pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+		       unsigned long addr, pte_t pte, pte_t *ptep)
+{
+	struct page *page;
+	pte = pte_mknotnuma(pte);
+	set_pte_at(mm, addr, ptep, pte);
+	page = vm_normal_page(vma, addr, pte);
+	BUG_ON(!page);
+	numa_hinting_fault(page, 1);
+	return pte;
+}
+
+void __pmd_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+		      unsigned long addr, pmd_t *pmdp)
+{
+	pmd_t pmd;
+	pte_t *pte;
+	unsigned long _addr = addr & PMD_MASK;
+	spinlock_t *ptl;
+	bool numa = false;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = *pmdp;
+	if (pmd_numa(pmd)) {
+		set_pmd_at(mm, _addr, pmdp, pmd_mknotnuma(pmd));
+		numa = true;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	if (!numa)
+		return;
+
+	pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
+	for (addr = _addr; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
+		pte_t pteval = *pte;
+		struct page * page;
+		if (!pte_present(pteval))
+			continue;
+		if (pte_numa(pteval)) {
+			pteval = pte_mknotnuma(pteval);
+			set_pte_at(mm, addr, pte, pteval);
+		}
+		page = vm_normal_page(vma, addr, pteval);
+		if (unlikely(!page))
+			continue;
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1)
+			continue;
+		numa_hinting_fault(page, 1);
+	}
+	pte_unmap_unlock(pte, ptl);
+}
+
+static inline int sched_autonuma_size(void)
+{
+	return sizeof(struct sched_autonuma) +
+		num_possible_nodes() * sizeof(unsigned long);
+}
+
+static inline int sched_autonuma_reset_size(void)
+{
+	struct sched_autonuma *sched_autonuma = NULL;
+	return sched_autonuma_size() -
+		(int)((char *)(&sched_autonuma->autonuma_stop_one_cpu) -
+		      (char *)sched_autonuma);
+}
+
+static void sched_autonuma_reset(struct sched_autonuma *sched_autonuma)
+{
+	sched_autonuma->autonuma_node = -1;
+	memset(&sched_autonuma->autonuma_stop_one_cpu, 0,
+	       sched_autonuma_reset_size());
+}
+
+static inline int mm_autonuma_fault_size(void)
+{
+	return num_possible_nodes() * sizeof(unsigned long);
+}
+
+static inline unsigned long *mm_autonuma_numa_fault_tmp(struct mm_struct *mm)
+{
+	return mm->mm_autonuma->numa_fault + num_possible_nodes();
+}
+
+static inline int mm_autonuma_size(void)
+{
+	return sizeof(struct mm_autonuma) + mm_autonuma_fault_size() * 2;
+}
+
+static inline int mm_autonuma_reset_size(void)
+{
+	struct mm_autonuma *mm_autonuma = NULL;
+	return mm_autonuma_size() -
+		(int)((char *)(&mm_autonuma->numa_fault_tot) -
+		      (char *)mm_autonuma);
+}
+
+static void mm_autonuma_reset(struct mm_autonuma *mm_autonuma)
+{
+	memset(&mm_autonuma->numa_fault_tot, 0, mm_autonuma_reset_size());
+}
+
+void autonuma_setup_new_exec(struct task_struct *p)
+{
+	if (p->sched_autonuma)
+		sched_autonuma_reset(p->sched_autonuma);
+	if (p->mm && p->mm->mm_autonuma)
+		mm_autonuma_reset(p->mm->mm_autonuma);
+}
+
+static inline int knumad_test_exit(struct mm_struct *mm)
+{
+	return atomic_read(&mm->mm_users) == 0;
+}
+
+static int knumad_scan_pmd(struct mm_struct *mm,
+			   struct vm_area_struct *vma,
+			   unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte, *_pte;
+	struct page *page;
+	unsigned long _address, end;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	VM_BUG_ON(address & ~PAGE_MASK);
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+	if (pmd_trans_huge(*pmd)) {
+		spin_lock(&mm->page_table_lock);
+		if (pmd_trans_huge(*pmd)) {
+			VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+			if (unlikely(pmd_trans_splitting(*pmd))) {
+				spin_unlock(&mm->page_table_lock);
+				wait_split_huge_page(vma->anon_vma, pmd);
+			} else {
+				int page_nid;
+				unsigned long *numa_fault_tmp;
+				ret = HPAGE_PMD_NR;
+
+				if (autonuma_scan_use_working_set() &&
+				    pmd_numa(*pmd)) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
+				page = pmd_page(*pmd);
+
+				/* only check non-shared pages */
+				if (page_mapcount(page) != 1) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
+				page_nid = page_migrate_nid(page);
+				numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+				numa_fault_tmp[page_nid] += ret;
+
+				if (pmd_numa(*pmd)) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
+				set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+				/* defer TLB flush to lower the overhead */
+				spin_unlock(&mm->page_table_lock);
+				goto out;
+			}
+		} else
+			spin_unlock(&mm->page_table_lock);
+	}
+
+	VM_BUG_ON(!pmd_present(*pmd));
+
+	end = min(vma->vm_end, (address + PMD_SIZE) & PMD_MASK);
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	for (_address = address, _pte = pte; _address < end;
+	     _pte++, _address += PAGE_SIZE) {
+		unsigned long *numa_fault_tmp;
+		pte_t pteval = *_pte;
+		if (!pte_present(pteval))
+			continue;
+		if (autonuma_scan_use_working_set() &&
+		    pte_numa(pteval))
+			continue;
+		page = vm_normal_page(vma, _address, pteval);
+		if (unlikely(!page))
+			continue;
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1)
+			continue;
+
+		numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+		numa_fault_tmp[page_migrate_nid(page)]++;
+
+		if (pte_numa(pteval))
+			continue;
+
+		if (!autonuma_scan_pmd())
+			set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
+
+		/* defer TLB flush to lower the overhead */
+		ret++;
+	}
+	pte_unmap_unlock(pte, ptl);
+
+	if (ret && !pmd_numa(*pmd) && autonuma_scan_pmd()) {
+		spin_lock(&mm->page_table_lock);
+		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+		spin_unlock(&mm->page_table_lock);
+		/* defer TLB flush to lower the overhead */
+	}
+
+out:
+	return ret;
+}
+
+static void mm_numa_fault_flush(struct mm_struct *mm)
+{
+	int nid;
+	struct mm_autonuma *mma = mm->mm_autonuma;
+	unsigned long *numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+	unsigned long tot = 0;
+	/* FIXME: protect this with seqlock against autonuma_balance() */
+	for_each_node(nid) {
+		mma->numa_fault[nid] = numa_fault_tmp[nid];
+		tot += mma->numa_fault[nid];
+		numa_fault_tmp[nid] = 0;
+	}
+	mma->numa_fault_tot = tot;
+}
+
+static int knumad_do_scan(void)
+{
+	struct mm_struct *mm;
+	struct mm_autonuma *mm_autonuma;
+	unsigned long address;
+	struct vm_area_struct *vma;
+	int progress = 0;
+
+	mm = knumad_scan.mm;
+	if (!mm) {
+		if (unlikely(list_empty(&knumad_scan.mm_head)))
+			return pages_to_scan;
+		mm_autonuma = list_entry(knumad_scan.mm_head.next,
+					 struct mm_autonuma, mm_node);
+		mm = mm_autonuma->mm;
+		knumad_scan.address = 0;
+		knumad_scan.mm = mm;
+		atomic_inc(&mm->mm_count);
+		mm_autonuma->numa_fault_pass++;
+	}
+	address = knumad_scan.address;
+
+	mutex_unlock(&knumad_mm_mutex);
+
+	down_read(&mm->mmap_sem);
+	if (unlikely(knumad_test_exit(mm)))
+		vma = NULL;
+	else
+		vma = find_vma(mm, address);
+
+	progress++;
+	for (; vma && progress < pages_to_scan; vma = vma->vm_next) {
+		unsigned long start_addr, end_addr;
+		cond_resched();
+		if (unlikely(knumad_test_exit(mm))) {
+			progress++;
+			break;
+		}
+
+		if (!vma->anon_vma || vma_policy(vma)) {
+			progress++;
+			continue;
+		}
+		if (is_vma_temporary_stack(vma)) {
+			progress++;
+			continue;
+		}
+
+		VM_BUG_ON(address & ~PAGE_MASK);
+		if (address < vma->vm_start)
+			address = vma->vm_start;
+
+		start_addr = address;
+		while (address < vma->vm_end) {
+			cond_resched();
+			if (unlikely(knumad_test_exit(mm)))
+				break;
+
+			VM_BUG_ON(address < vma->vm_start ||
+				  address + PAGE_SIZE > vma->vm_end);
+			progress += knumad_scan_pmd(mm, vma, address);
+			/* move to next address */
+			address = (address + PMD_SIZE) & PMD_MASK;
+			if (progress >= pages_to_scan)
+				break;
+		}
+		end_addr = min(address, vma->vm_end);
+
+		/*
+		 * Flush the TLB for the mm to start the numa
+		 * hinting minor page faults after we finish
+		 * scanning this vma part.
+		 */
+		mmu_notifier_invalidate_range_start(vma->vm_mm, start_addr,
+						    end_addr);
+		flush_tlb_range(vma, start_addr, end_addr);
+		mmu_notifier_invalidate_range_end(vma->vm_mm, start_addr,
+						  end_addr);
+	}
+	up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
+
+	mutex_lock(&knumad_mm_mutex);
+	VM_BUG_ON(knumad_scan.mm != mm);
+	knumad_scan.address = address;
+	/*
+	 * Change the current mm if this mm is about to die, or if we
+	 * scanned all vmas of this mm.
+	 */
+	if (knumad_test_exit(mm) || !vma) {
+		mm_autonuma = mm->mm_autonuma;
+		if (mm_autonuma->mm_node.next != &knumad_scan.mm_head) {
+			mm_autonuma = list_entry(mm_autonuma->mm_node.next,
+						 struct mm_autonuma, mm_node);
+			knumad_scan.mm = mm_autonuma->mm;
+			atomic_inc(&knumad_scan.mm->mm_count);
+			knumad_scan.address = 0;
+			knumad_scan.mm->mm_autonuma->numa_fault_pass++;
+		} else
+			knumad_scan.mm = NULL;
+
+		if (knumad_test_exit(mm))
+			list_del(&mm->mm_autonuma->mm_node);
+		else
+			mm_numa_fault_flush(mm);
+
+		mmdrop(mm);
+	}
+
+	return progress;
+}
+
+static void wake_up_knuma_migrated(void)
+{
+	int nid;
+
+	lru_add_drain();
+	for_each_online_node(nid) {
+		struct pglist_data *pgdat = NODE_DATA(nid);
+		if (pgdat->autonuma_nr_migrate_pages &&
+		    waitqueue_active(&pgdat->autonuma_knuma_migrated_wait))
+			wake_up_interruptible(&pgdat->
+					      autonuma_knuma_migrated_wait);
+	}
+}
+
+static void knuma_scand_disabled(void)
+{
+	if (!autonuma_enabled())
+		wait_event_freezable(knuma_scand_wait,
+				     autonuma_enabled() ||
+				     kthread_should_stop());
+}
+
+static int knuma_scand(void *none)
+{
+	struct mm_struct *mm = NULL;
+	int progress = 0, _progress;
+	unsigned long total_progress = 0;
+
+	set_freezable();
+
+	knuma_scand_disabled();
+
+	mutex_lock(&knumad_mm_mutex);
+
+	for (;;) {
+		if (unlikely(kthread_should_stop()))
+			break;
+		_progress = knumad_do_scan();
+		progress += _progress;
+		total_progress += _progress;
+		mutex_unlock(&knumad_mm_mutex);
+
+		if (unlikely(!knumad_scan.mm)) {
+			autonuma_printk("knuma_scand %lu\n", total_progress);
+			pages_scanned += total_progress;
+			total_progress = 0;
+			full_scans++;
+
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_pass_millisecs));
+			/* flush the last pending pages < pages_to_migrate */
+			wake_up_knuma_migrated();
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_pass_millisecs));
+
+			if (autonuma_debug()) {
+				extern void sched_autonuma_dump_mm(void);
+				sched_autonuma_dump_mm();
+			}
+
+			/* wait while there is no pinned mm */
+			knuma_scand_disabled();
+		}
+		if (progress > pages_to_scan) {
+			progress = 0;
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_millisecs));
+		}
+		cond_resched();
+		mutex_lock(&knumad_mm_mutex);
+	}
+
+	mm = knumad_scan.mm;
+	knumad_scan.mm = NULL;
+	if (mm)
+		list_del(&mm->mm_autonuma->mm_node);
+	mutex_unlock(&knumad_mm_mutex);
+
+	if (mm)
+		mmdrop(mm);
+
+	return 0;
+}
+
+static int isolate_migratepages(struct list_head *migratepages,
+				struct pglist_data *pgdat)
+{
+	int nr = 0, nid;
+	struct list_head *heads = pgdat->autonuma_migrate_head;
+
+	/* FIXME: THP balancing, restart from last nid */
+	for_each_online_node(nid) {
+		struct zone *zone;
+		struct page *page;
+		cond_resched();
+		VM_BUG_ON(numa_node_id() != pgdat->node_id);
+		if (nid == pgdat->node_id) {
+			VM_BUG_ON(!list_empty(&heads[nid]));
+			continue;
+		}
+		if (list_empty(&heads[nid]))
+			continue;
+		/* some page wants to go to this pgdat */
+		/*
+		 * Take the lock with irqs disabled to avoid a lock
+		 * inversion with the lru_lock which is taken before
+		 * the autonuma_migrate_lock in split_huge_page, and
+		 * that could be taken by interrupts after we obtained
+		 * the autonuma_migrate_lock here, if we didn't disable
+		 * irqs.
+		 */
+		autonuma_migrate_lock_irq(pgdat->node_id);
+		if (list_empty(&heads[nid])) {
+			autonuma_migrate_unlock_irq(pgdat->node_id);
+			continue;
+		}
+		page = list_entry(heads[nid].prev,
+				  struct page,
+				  autonuma_migrate_node);
+		if (unlikely(!get_page_unless_zero(page))) {
+			/*
+			 * Is getting freed and will remove self from the
+			 * autonuma list shortly, skip it for now.
+			 */
+			list_del(&page->autonuma_migrate_node);
+			list_add(&page->autonuma_migrate_node,
+				 &heads[nid]);
+			autonuma_migrate_unlock_irq(pgdat->node_id);
+			autonuma_printk("autonuma migrate page is free\n");
+			continue;
+		}
+		if (!PageLRU(page)) {
+			autonuma_migrate_unlock_irq(pgdat->node_id);
+			autonuma_printk("autonuma migrate page not in LRU\n");
+			__autonuma_migrate_page_remove(page);
+			put_page(page);
+			continue;
+		}
+		autonuma_migrate_unlock_irq(pgdat->node_id);
+
+		VM_BUG_ON(nid != page_to_nid(page));
+
+		if (PageTransHuge(page))
+			/* FIXME: remove split_huge_page */
+			split_huge_page(page);
+
+		__autonuma_migrate_page_remove(page);
+		put_page(page);
+
+		zone = page_zone(page);
+		spin_lock_irq(&zone->lru_lock);
+		if (!__isolate_lru_page(page, ISOLATE_ACTIVE|ISOLATE_INACTIVE,
+					0)) {
+			/* FIXME NR_ISOLATED */
+			VM_BUG_ON(PageTransCompound(page));
+			del_page_from_lru_list(zone, page, page_lru(page));
+			inc_zone_state(zone, page_is_file_cache(page) ?
+				       NR_ISOLATED_FILE : NR_ISOLATED_ANON);
+			spin_unlock_irq(&zone->lru_lock);
+
+			list_add(&page->lru, migratepages);
+			nr += hpage_nr_pages(page);
+		} else {
+			/* FIXME: losing page, safest and simplest for now */
+			spin_unlock_irq(&zone->lru_lock);
+			autonuma_printk("autonuma migrate page lost\n");
+		}
+
+	}
+
+	return nr;
+}
+
+static struct page *alloc_migrate_dst_page(struct page *page,
+					   unsigned long data,
+					   int **result)
+{
+	int nid = (int) data;
+	struct page *newpage;
+	newpage = alloc_pages_exact_node(nid,
+					 GFP_HIGHUSER_MOVABLE | GFP_THISNODE,
+					 0);
+	if (newpage)
+		newpage->autonuma_last_nid = page->autonuma_last_nid;
+	return newpage;
+}
+
+static void knumad_do_migrate(struct pglist_data *pgdat)
+{
+	int nr_migrate_pages = 0;
+	LIST_HEAD(migratepages);
+
+	autonuma_printk("nr_migrate_pages %lu to node %d\n",
+			pgdat->autonuma_nr_migrate_pages, pgdat->node_id);
+	do {
+		int isolated = 0;
+		if (balance_pgdat(pgdat, nr_migrate_pages))
+			isolated = isolate_migratepages(&migratepages, pgdat);
+		/* FIXME: might need to check too many isolated */
+		if (!isolated)
+			break;
+		nr_migrate_pages += isolated;
+	} while (nr_migrate_pages < pages_to_migrate);
+
+	if (nr_migrate_pages) {
+		int err;
+		autonuma_printk("migrate %d to node %d\n", nr_migrate_pages,
+				pgdat->node_id);
+		pages_migrated += nr_migrate_pages; /* FIXME: per node */
+		err = migrate_pages(&migratepages, alloc_migrate_dst_page,
+				    pgdat->node_id, false, true);
+		if (err)
+			/* FIXME: requeue failed pages */
+			putback_lru_pages(&migratepages);
+	}
+}
+
+static int knuma_migrated(void *arg)
+{
+	struct pglist_data *pgdat = (struct pglist_data *)arg;
+	int nid = pgdat->node_id;
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(nowakeup);
+
+	set_freezable();
+
+	for (;;) {
+		if (unlikely(kthread_should_stop()))
+			break;
+		/* FIXME: scan the free levels of this node we may not
+		 * be allowed to receive memory if the wmark of this
+		 * pgdat are below high.  In the future also add
+		 * not-interesting pages like not-accessed pages to
+		 * pgdat->autonuma_migrate_head[pgdat->node_id]; so we
+		 * can move our memory away to other nodes in order
+		 * to satisfy the high-wmark described above (so migration
+		 * can continue).
+		 */
+		knumad_do_migrate(pgdat);
+		if (!pgdat->autonuma_nr_migrate_pages) {
+			wait_event_freezable(
+				pgdat->autonuma_knuma_migrated_wait,
+				pgdat->autonuma_nr_migrate_pages ||
+				kthread_should_stop());
+			autonuma_printk("wake knuma_migrated %d\n", nid);
+		} else
+			wait_event_freezable_timeout(nowakeup,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     migrate_sleep_millisecs));
+	}
+
+	return 0;
+}
+
+void autonuma_enter(struct mm_struct *mm)
+{
+	if (autonuma_impossible())
+		return;
+
+	mutex_lock(&knumad_mm_mutex);
+	list_add_tail(&mm->mm_autonuma->mm_node, &knumad_scan.mm_head);
+	mutex_unlock(&knumad_mm_mutex);
+}
+
+void autonuma_exit(struct mm_struct *mm)
+{
+	bool serialize;
+
+	if (autonuma_impossible())
+		return;
+
+	serialize = false;
+	mutex_lock(&knumad_mm_mutex);
+	if (knumad_scan.mm == mm)
+		serialize = true;
+	else
+		list_del(&mm->mm_autonuma->mm_node);
+	mutex_unlock(&knumad_mm_mutex);
+
+	if (serialize) {
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	}
+}
+
+static int start_knuma_scand(void)
+{
+	int err = 0;
+	struct task_struct *knumad_thread;
+
+	knumad_thread = kthread_run(knuma_scand, NULL, "knuma_scand");
+	if (unlikely(IS_ERR(knumad_thread))) {
+		autonuma_printk(KERN_ERR
+				"knumad: kthread_run(knuma_scand) failed\n");
+		err = PTR_ERR(knumad_thread);
+	}
+	return err;
+}
+
+static int start_knuma_migrated(void)
+{
+	int err = 0;
+	struct task_struct *knumad_thread;
+	int nid;
+
+	for_each_online_node(nid) {
+		knumad_thread = kthread_create_on_node(knuma_migrated,
+						       NODE_DATA(nid),
+						       nid,
+						       "knuma_migrated%d",
+						       nid);
+		if (unlikely(IS_ERR(knumad_thread))) {
+			autonuma_printk(KERN_ERR
+					"knumad: "
+					"kthread_run(knuma_migrated%d) "
+					"failed\n", nid);
+			err = PTR_ERR(knumad_thread);
+		} else {
+			autonuma_printk("cpumask %d %lx\n", nid,
+					cpumask_of_node(nid)->bits[0]);
+			kthread_bind_node(knumad_thread, nid);
+			wake_up_process(knumad_thread);
+		}
+	}
+	return err;
+}
+
+
+#ifdef CONFIG_SYSFS
+
+static ssize_t flag_show(struct kobject *kobj,
+			 struct kobj_attribute *attr, char *buf,
+			 enum autonuma_flag flag)
+{
+	return sprintf(buf, "%d\n",
+		       !!test_bit(flag, &autonuma_flags));
+}
+static ssize_t flag_store(struct kobject *kobj,
+			  struct kobj_attribute *attr,
+			  const char *buf, size_t count,
+			  enum autonuma_flag flag)
+{
+	unsigned long value;
+	int ret;
+
+	ret = kstrtoul(buf, 10, &value);
+	if (ret < 0)
+		return ret;
+	if (value > 1)
+		return -EINVAL;
+
+	if (value)
+		set_bit(flag, &autonuma_flags);
+	else
+		clear_bit(flag, &autonuma_flags);
+
+	return count;
+}
+
+static ssize_t enabled_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return flag_show(kobj, attr, buf, AUTONUMA_FLAG);
+}
+static ssize_t enabled_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = flag_store(kobj, attr, buf, count, AUTONUMA_FLAG);
+
+	if (ret > 0 && autonuma_enabled())
+		wake_up_interruptible(&knuma_scand_wait);
+
+	return ret;
+}
+static struct kobj_attribute enabled_attr =
+	__ATTR(enabled, 0644, enabled_show, enabled_store);
+
+#define SYSFS_ENTRY(NAME, FLAG)						\
+static ssize_t NAME ## _show(struct kobject *kobj,			\
+			     struct kobj_attribute *attr, char *buf)	\
+{									\
+	return flag_show(kobj, attr, buf, FLAG);			\
+}									\
+									\
+static ssize_t NAME ## _store(struct kobject *kobj,			\
+			      struct kobj_attribute *attr,		\
+			      const char *buf, size_t count)		\
+{									\
+	return flag_store(kobj, attr, buf, count, FLAG);		\
+}									\
+static struct kobj_attribute NAME ## _attr =				\
+	__ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
+
+SYSFS_ENTRY(debug, AUTONUMA_DEBUG_FLAG);
+SYSFS_ENTRY(pmd, AUTONUMA_SCAN_PMD_FLAG);
+SYSFS_ENTRY(working_set, AUTONUMA_SCAN_USE_WORKING_SET_FLAG);
+SYSFS_ENTRY(defer, AUTONUMA_MIGRATE_DEFER_FLAG);
+SYSFS_ENTRY(load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
+SYSFS_ENTRY(clone_reset, AUTONUMA_SCHED_CLONE_RESET_FLAG);
+SYSFS_ENTRY(fork_reset, AUTONUMA_SCHED_FORK_RESET_FLAG);
+
+#undef SYSFS_ENTRY
+
+enum {
+	SYSFS_KNUMA_SCAND_SLEEP_ENTRY,
+	SYSFS_KNUMA_SCAND_PAGES_ENTRY,
+	SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY,
+	SYSFS_KNUMA_MIGRATED_PAGES_ENTRY,
+};
+
+#define SYSFS_ENTRY(NAME, SYSFS_TYPE)				\
+static ssize_t NAME ## _show(struct kobject *kobj,		\
+			     struct kobj_attribute *attr,	\
+			     char *buf)				\
+{								\
+	return sprintf(buf, "%u\n", NAME);			\
+}								\
+static ssize_t NAME ## _store(struct kobject *kobj,		\
+			      struct kobj_attribute *attr,	\
+			      const char *buf, size_t count)	\
+{								\
+	unsigned long val;					\
+	int err;						\
+								\
+	err = strict_strtoul(buf, 10, &val);			\
+	if (err || val > UINT_MAX)				\
+		return -EINVAL;					\
+	switch (SYSFS_TYPE) {					\
+	case SYSFS_KNUMA_SCAND_PAGES_ENTRY:			\
+	case SYSFS_KNUMA_MIGRATED_PAGES_ENTRY:			\
+		if (!val)					\
+			return -EINVAL;				\
+		break;						\
+	}							\
+								\
+	NAME = val;						\
+	switch (SYSFS_TYPE) {					\
+	case SYSFS_KNUMA_SCAND_SLEEP_ENTRY:			\
+		wake_up_interruptible(&knuma_scand_wait);	\
+		break;						\
+	case							\
+		SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY:		\
+		wake_up_knuma_migrated();			\
+		break;						\
+	}							\
+								\
+	return count;						\
+}								\
+static struct kobj_attribute NAME ## _attr =			\
+	__ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
+
+SYSFS_ENTRY(scan_sleep_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
+SYSFS_ENTRY(scan_sleep_pass_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
+SYSFS_ENTRY(pages_to_scan, SYSFS_KNUMA_SCAND_PAGES_ENTRY);
+
+SYSFS_ENTRY(migrate_sleep_millisecs, SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY);
+SYSFS_ENTRY(pages_to_migrate, SYSFS_KNUMA_MIGRATED_PAGES_ENTRY);
+
+#undef SYSFS_ENTRY
+
+static struct attribute *autonuma_attr[] = {
+	&enabled_attr.attr,
+	&debug_attr.attr,
+	NULL,
+};
+static struct attribute_group autonuma_attr_group = {
+	.attrs = autonuma_attr,
+};
+
+#define SYSFS_ENTRY(NAME)					\
+static ssize_t NAME ## _show(struct kobject *kobj,		\
+			     struct kobj_attribute *attr,	\
+			     char *buf)				\
+{								\
+	return sprintf(buf, "%lu\n", NAME);			\
+}								\
+static struct kobj_attribute NAME ## _attr =			\
+	__ATTR_RO(NAME);
+
+SYSFS_ENTRY(full_scans);
+SYSFS_ENTRY(pages_scanned);
+SYSFS_ENTRY(pages_migrated);
+
+#undef SYSFS_ENTRY
+
+static struct attribute *knuma_scand_attr[] = {
+	&scan_sleep_millisecs_attr.attr,
+	&scan_sleep_pass_millisecs_attr.attr,
+	&pages_to_scan_attr.attr,
+	&pages_scanned_attr.attr,
+	&full_scans_attr.attr,
+	&pmd_attr.attr,
+	&working_set_attr.attr,
+	NULL,
+};
+static struct attribute_group knuma_scand_attr_group = {
+	.attrs = knuma_scand_attr,
+	.name = "knuma_scand",
+};
+
+static struct attribute *knuma_migrated_attr[] = {
+	&migrate_sleep_millisecs_attr.attr,
+	&pages_to_migrate_attr.attr,
+	&pages_migrated_attr.attr,
+	&defer_attr.attr,
+	NULL,
+};
+static struct attribute_group knuma_migrated_attr_group = {
+	.attrs = knuma_migrated_attr,
+	.name = "knuma_migrated",
+};
+
+static struct attribute *scheduler_attr[] = {
+	&clone_reset_attr.attr,
+	&fork_reset_attr.attr,
+	&load_balance_strict_attr.attr,
+	NULL,
+};
+static struct attribute_group scheduler_attr_group = {
+	.attrs = scheduler_attr,
+	.name = "scheduler",
+};
+
+static int __init autonuma_init_sysfs(struct kobject **autonuma_kobj)
+{
+	int err;
+
+	*autonuma_kobj = kobject_create_and_add("autonuma", mm_kobj);
+	if (unlikely(!*autonuma_kobj)) {
+		printk(KERN_ERR "autonuma: failed kobject create\n");
+		return -ENOMEM;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &autonuma_attr_group);
+	if (err) {
+		printk(KERN_ERR "autonuma: failed register autonuma group\n");
+		goto delete_obj;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &knuma_scand_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register knuma_scand group\n");
+		goto remove_autonuma;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &knuma_migrated_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register knuma_migrated group\n");
+		goto remove_knuma_scand;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &scheduler_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register scheduler group\n");
+		goto remove_knuma_migrated;
+	}
+
+	return 0;
+
+remove_knuma_migrated:
+	sysfs_remove_group(*autonuma_kobj, &knuma_migrated_attr_group);
+remove_knuma_scand:
+	sysfs_remove_group(*autonuma_kobj, &knuma_scand_attr_group);
+remove_autonuma:
+	sysfs_remove_group(*autonuma_kobj, &autonuma_attr_group);
+delete_obj:
+	kobject_put(*autonuma_kobj);
+	return err;
+}
+
+static void __init autonuma_exit_sysfs(struct kobject *autonuma_kobj)
+{
+	sysfs_remove_group(autonuma_kobj, &knuma_migrated_attr_group);
+	sysfs_remove_group(autonuma_kobj, &knuma_scand_attr_group);
+	sysfs_remove_group(autonuma_kobj, &autonuma_attr_group);
+	kobject_put(autonuma_kobj);
+}
+#else
+static inline int autonuma_init_sysfs(struct kobject **autonuma_kobj)
+{
+	return 0;
+}
+
+static inline void autonuma_exit_sysfs(struct kobject *autonuma_kobj)
+{
+}
+#endif /* CONFIG_SYSFS */
+
+static int __init noautonuma_setup(char *str)
+{
+	if (!autonuma_impossible()) {
+		printk("AutoNUMA permanently disabled\n");
+		set_bit(AUTONUMA_IMPOSSIBLE, &autonuma_flags);
+		BUG_ON(!autonuma_impossible());
+	}
+	return 1;
+}
+__setup("noautonuma", noautonuma_setup);
+
+static int __init autonuma_init(void)
+{
+	int err;
+	struct kobject *autonuma_kobj;
+
+	VM_BUG_ON(num_possible_nodes() < 1);
+	if (autonuma_impossible())
+		return -EINVAL;
+
+	err = autonuma_init_sysfs(&autonuma_kobj);
+	if (err)
+		return err;
+
+	err = start_knuma_scand();
+	if (err) {
+		printk("failed to start knuma_scand\n");
+		goto out;
+	}
+	err = start_knuma_migrated();
+	if (err) {
+		printk("failed to start knuma_migrated\n");
+		goto out;
+	}
+
+	printk("AutoNUMA initialized successfully\n");
+	return err;
+
+out:
+	autonuma_exit_sysfs(autonuma_kobj);
+	return err;
+}
+module_init(autonuma_init)
+
+static struct kmem_cache *sched_autonuma_cachep;
+
+int alloc_sched_autonuma(struct task_struct *tsk, struct task_struct *orig,
+			 int node)
+{
+	int err = 1;
+	struct sched_autonuma *sched_autonuma;
+
+	if (autonuma_impossible())
+		goto no_numa;
+	sched_autonuma = kmem_cache_alloc_node(sched_autonuma_cachep,
+					       GFP_KERNEL, node);
+	if (!sched_autonuma)
+		goto out;
+	if (autonuma_sched_clone_reset())
+		sched_autonuma_reset(sched_autonuma);
+	else {
+		memcpy(sched_autonuma, orig->sched_autonuma,
+		       sched_autonuma_size());
+		BUG_ON(sched_autonuma->autonuma_stop_one_cpu);
+	}
+	tsk->sched_autonuma = sched_autonuma;
+no_numa:
+	err = 0;
+out:
+	return err;
+}
+
+void free_sched_autonuma(struct task_struct *tsk)
+{
+	if (autonuma_impossible()) {
+		BUG_ON(tsk->sched_autonuma);
+		return;
+	}
+
+	BUG_ON(!tsk->sched_autonuma);
+	kmem_cache_free(sched_autonuma_cachep, tsk->sched_autonuma);
+	tsk->sched_autonuma = NULL;
+}
+
+void __init sched_autonuma_init(void)
+{
+	struct sched_autonuma *sched_autonuma;
+
+	BUG_ON(current != &init_task);
+
+	if (autonuma_impossible())
+		return;
+
+	sched_autonuma_cachep =
+		kmem_cache_create("sched_autonuma",
+				  sched_autonuma_size(), 0,
+				  SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+
+	sched_autonuma = kmem_cache_alloc_node(sched_autonuma_cachep,
+					       GFP_KERNEL, numa_node_id());
+	BUG_ON(!sched_autonuma);
+	sched_autonuma_reset(sched_autonuma);
+	BUG_ON(current->sched_autonuma);
+	current->sched_autonuma = sched_autonuma;
+}
+
+static struct kmem_cache *mm_autonuma_cachep;
+
+int alloc_mm_autonuma(struct mm_struct *mm)
+{
+	int err = 1;
+	struct mm_autonuma *mm_autonuma;
+
+	if (autonuma_impossible())
+		goto no_numa;
+	mm_autonuma = kmem_cache_alloc(mm_autonuma_cachep, GFP_KERNEL);
+	if (!mm_autonuma)
+		goto out;
+	if (autonuma_sched_fork_reset() || !mm->mm_autonuma)
+		mm_autonuma_reset(mm_autonuma);
+	else
+		memcpy(mm_autonuma, mm->mm_autonuma, mm_autonuma_size());
+	mm->mm_autonuma = mm_autonuma;
+	mm_autonuma->mm = mm;
+no_numa:
+	err = 0;
+out:
+	return err;
+}
+
+void free_mm_autonuma(struct mm_struct *mm)
+{
+	if (autonuma_impossible()) {
+		BUG_ON(mm->mm_autonuma);
+		return;
+	}
+
+	BUG_ON(!mm->mm_autonuma);
+	kmem_cache_free(mm_autonuma_cachep, mm->mm_autonuma);
+	mm->mm_autonuma = NULL;
+}
+
+void __init mm_autonuma_init(void)
+{
+	BUG_ON(current != &init_task);
+	BUG_ON(current->mm);
+
+	if (autonuma_impossible())
+		return;
+
+	mm_autonuma_cachep =
+		kmem_cache_create("mm_autonuma",
+				  mm_autonuma_size(), 0,
+				  SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+}

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 27/39] autonuma: core
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

This implements knuma_scand, the numa_hinting faults started by
knuma_scand, the knuma_migrated that migrates the memory queued by the
NUMA hinting faults, the statistics gathering code that is done by
knuma_scand for the mm_autonuma and by the numa hinting page faults
for the sched_autonuma, and most of the rest of the AutoNUMA core
logics like the false sharing detection, sysfs and initialization
routines.

The AutoNUMA algorithm when knuma_scand is not running is a full
bypass and it must not alter the runtime of memory management and
scheduler at all.

The whole AutoNUMA logic is a chain reaction as result of the actions
of the knuma_scand.

knuma_scand is the first gear and it collects the mm_autonuma per-process
statistics and at the same time it sets the pte/pmd it scans as
pte_numa and pmd_numa.

The second gear are the numa hinting page faults. These are triggered
by the pte_numa/pmd_numa pmd/ptes. They collect the sched_autonuma
per-thread statistics. They also implement the memory follow CPU logic
where we track if pages are repeatedly accessed by remote nodes. The
memory follow CPU logic can decide to migrate pages across different
NUMA nodes by queuing the pages for migration in the per-node
knuma_migrated queues.

The third gear is knuma_migrated. There is one knuma_migrated daemon
per node. Pages pending for migration are queued in a matrix of
lists. Each knuma_migrated (in parallel with each other) goes over
those lists and migrates the pages queued for migration in round robin
from each incoming node to the node where knuma_migrated is running
on.

The fourth gear is the NUMA scheduler balancing code. That computes
the statistical information collected in mm->mm_autonuma and
p->sched_autonuma and evaluates the status of all CPUs to decide if
tasks should be migrated to CPUs in remote nodes.

The code include fixes from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/autonuma.c | 1437 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 1437 insertions(+), 0 deletions(-)
 create mode 100644 mm/autonuma.c

diff --git a/mm/autonuma.c b/mm/autonuma.c
new file mode 100644
index 0000000..7ca4992
--- /dev/null
+++ b/mm/autonuma.c
@@ -0,0 +1,1437 @@
+/*
+ *  Copyright (C) 2012  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ *  Boot with "numa=fake=2" to test on not NUMA systems.
+ */
+
+#include <linux/mm.h>
+#include <linux/rmap.h>
+#include <linux/kthread.h>
+#include <linux/mmu_notifier.h>
+#include <linux/freezer.h>
+#include <linux/mm_inline.h>
+#include <linux/migrate.h>
+#include <linux/swap.h>
+#include <linux/autonuma.h>
+#include <asm/tlbflush.h>
+#include <asm/pgtable.h>
+
+unsigned long autonuma_flags __read_mostly =
+	(1<<AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG)|
+	(1<<AUTONUMA_SCHED_CLONE_RESET_FLAG)|
+	(1<<AUTONUMA_SCHED_FORK_RESET_FLAG)|
+#ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
+	(1<<AUTONUMA_FLAG)|
+#endif
+	(1<<AUTONUMA_SCAN_PMD_FLAG);
+
+static DEFINE_MUTEX(knumad_mm_mutex);
+
+/* knuma_scand */
+static unsigned int scan_sleep_millisecs __read_mostly = 100;
+static unsigned int scan_sleep_pass_millisecs __read_mostly = 5000;
+static unsigned int pages_to_scan __read_mostly = 128*1024*1024/PAGE_SIZE;
+static DECLARE_WAIT_QUEUE_HEAD(knuma_scand_wait);
+static unsigned long full_scans;
+static unsigned long pages_scanned;
+
+/* knuma_migrated */
+static unsigned int migrate_sleep_millisecs __read_mostly = 100;
+static unsigned int pages_to_migrate __read_mostly = 128*1024*1024/PAGE_SIZE;
+static volatile unsigned long pages_migrated;
+
+static struct knumad_scan {
+	struct list_head mm_head;
+	struct mm_struct *mm;
+	unsigned long address;
+} knumad_scan = {
+	.mm_head = LIST_HEAD_INIT(knumad_scan.mm_head),
+};
+
+static inline bool autonuma_impossible(void)
+{
+	return num_possible_nodes() <= 1 ||
+		test_bit(AUTONUMA_IMPOSSIBLE, &autonuma_flags);
+}
+
+static inline void autonuma_migrate_lock(int nid)
+{
+	spin_lock(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_unlock(int nid)
+{
+	spin_unlock(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_lock_irq(int nid)
+{
+	spin_lock_irq(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_unlock_irq(int nid)
+{
+	spin_unlock_irq(&NODE_DATA(nid)->autonuma_lock);
+}
+
+/* caller already holds the compound_lock */
+void autonuma_migrate_split_huge_page(struct page *page,
+				      struct page *page_tail)
+{
+	int nid, last_nid;
+
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	VM_BUG_ON(page_tail->autonuma_migrate_nid != -1);
+	if (nid >= 0) {
+		VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
+		autonuma_migrate_lock(nid);
+		list_add_tail(&page_tail->autonuma_migrate_node,
+			      &page->autonuma_migrate_node);
+		autonuma_migrate_unlock(nid);
+
+		page_tail->autonuma_migrate_nid = nid;
+	}
+
+	last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	if (last_nid >= 0)
+		page_tail->autonuma_last_nid = last_nid;
+}
+
+void __autonuma_migrate_page_remove(struct page *page)
+{
+	unsigned long flags;
+	int nid;
+
+	flags = compound_lock_irqsave(page);
+
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	if (nid >= 0) {
+		int numpages = hpage_nr_pages(page);
+		autonuma_migrate_lock(nid);
+		list_del(&page->autonuma_migrate_node);
+		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
+		autonuma_migrate_unlock(nid);
+
+		page->autonuma_migrate_nid = -1;
+	}
+
+	compound_unlock_irqrestore(page, flags);
+}
+
+static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
+					int page_nid)
+{
+	unsigned long flags;
+	int nid;
+	int numpages;
+	unsigned long nr_migrate_pages;
+	wait_queue_head_t *wait_queue;
+
+	VM_BUG_ON(dst_nid >= MAX_NUMNODES);
+	VM_BUG_ON(dst_nid < -1);
+	VM_BUG_ON(page_nid >= MAX_NUMNODES);
+	VM_BUG_ON(page_nid < -1);
+
+	VM_BUG_ON(page_nid == dst_nid);
+	VM_BUG_ON(page_to_nid(page) != page_nid);
+
+	flags = compound_lock_irqsave(page);
+
+	numpages = hpage_nr_pages(page);
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	if (nid >= 0) {
+		autonuma_migrate_lock(nid);
+		list_del(&page->autonuma_migrate_node);
+		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
+		autonuma_migrate_unlock(nid);
+	}
+
+	autonuma_migrate_lock(dst_nid);
+	list_add(&page->autonuma_migrate_node,
+		 &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
+	NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
+	nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
+
+	autonuma_migrate_unlock(dst_nid);
+
+	page->autonuma_migrate_nid = dst_nid;
+
+	compound_unlock_irqrestore(page, flags);
+
+	if (!autonuma_migrate_defer()) {
+		wait_queue = &NODE_DATA(dst_nid)->autonuma_knuma_migrated_wait;
+		if (nr_migrate_pages >= pages_to_migrate &&
+		    nr_migrate_pages - numpages < pages_to_migrate &&
+		    waitqueue_active(wait_queue))
+			wake_up_interruptible(wait_queue);
+	}
+}
+
+static void autonuma_migrate_page_add(struct page *page, int dst_nid,
+				      int page_nid)
+{
+	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	if (migrate_nid != dst_nid)
+		__autonuma_migrate_page_add(page, dst_nid, page_nid);
+}
+
+static bool balance_pgdat(struct pglist_data *pgdat,
+			  int nr_migrate_pages)
+{
+	/* FIXME: this only check the wmarks, make it move
+	 * "unused" memory or pagecache by queuing it to
+	 * pgdat->autonuma_migrate_head[pgdat->node_id].
+	 */
+	int z;
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone->all_unreclaimable)
+			continue;
+
+		/*
+		 * FIXME: deal with order with THP, maybe if
+		 * kswapd will learn using compaction, otherwise
+		 * order = 0 probably is ok.
+		 * FIXME: in theory we're ok if we can obtain
+		 * pages_to_migrate pages from all zones, it doesn't
+		 * need to be all in a single zone. We care about the
+		 * pgdat, the zone not.
+		 */
+
+		/*
+		 * Try not to wakeup kswapd by allocating
+		 * pages_to_migrate pages.
+		 */
+		if (!zone_watermark_ok(zone, 0,
+				       high_wmark_pages(zone) +
+				       nr_migrate_pages,
+				       0, 0))
+			continue;
+		return true;
+	}
+	return false;
+}
+
+static void cpu_follow_memory_pass(struct task_struct *p,
+				   struct sched_autonuma *sched_autonuma,
+				   unsigned long *numa_fault)
+{
+	int nid;
+	for_each_node(nid)
+		numa_fault[nid] >>= 1;
+	sched_autonuma->numa_fault_tot >>= 1;
+}
+
+static void numa_hinting_fault_cpu_follow_memory(struct task_struct *p,
+						 int access_nid,
+						 int numpages,
+						 bool pass)
+{
+	struct sched_autonuma *sched_autonuma = p->sched_autonuma;
+	unsigned long *numa_fault = sched_autonuma->numa_fault;
+	if (unlikely(pass))
+		cpu_follow_memory_pass(p, sched_autonuma, numa_fault);
+	numa_fault[access_nid] += numpages;
+	sched_autonuma->numa_fault_tot += numpages;
+}
+
+static inline bool last_nid_set(struct task_struct *p,
+				struct page *page, int cpu_nid)
+{
+	bool ret = true;
+	int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	VM_BUG_ON(cpu_nid < 0);
+	VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
+	if (autonuma_last_nid >= 0 && autonuma_last_nid != cpu_nid) {
+		int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+		if (migrate_nid >= 0 && migrate_nid != cpu_nid)
+			__autonuma_migrate_page_remove(page);
+		ret = false;
+	}
+	if (autonuma_last_nid != cpu_nid)
+		ACCESS_ONCE(page->autonuma_last_nid) = cpu_nid;
+	return ret;
+}
+
+static int __page_migrate_nid(struct page *page, int page_nid)
+{
+	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	if (migrate_nid < 0)
+		migrate_nid = page_nid;
+#if 0
+	return page_nid;
+#endif
+	return migrate_nid;
+}
+
+static int page_migrate_nid(struct page *page)
+{
+	return __page_migrate_nid(page, page_to_nid(page));
+}
+
+static int numa_hinting_fault_memory_follow_cpu(struct task_struct *p,
+						struct page *page,
+						int cpu_nid, int page_nid,
+						bool pass)
+{
+	if (!last_nid_set(p, page, cpu_nid))
+		return __page_migrate_nid(page, page_nid);
+	if (!PageLRU(page))
+		return page_nid;
+	if (cpu_nid != page_nid)
+		autonuma_migrate_page_add(page, cpu_nid, page_nid);
+	else
+		autonuma_migrate_page_remove(page);
+	return cpu_nid;
+}
+
+void numa_hinting_fault(struct page *page, int numpages)
+{
+	WARN_ON_ONCE(!current->mm);
+	if (likely(current->mm && !current->mempolicy && autonuma_enabled())) {
+		struct task_struct *p = current;
+		int cpu_nid, page_nid, access_nid;
+		bool pass;
+
+		pass = p->sched_autonuma->numa_fault_pass !=
+			p->mm->mm_autonuma->numa_fault_pass;
+		page_nid = page_to_nid(page);
+		cpu_nid = numa_node_id();
+		VM_BUG_ON(cpu_nid < 0);
+		VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
+		access_nid = numa_hinting_fault_memory_follow_cpu(p, page,
+								  cpu_nid,
+								  page_nid,
+								  pass);
+		numa_hinting_fault_cpu_follow_memory(p, access_nid,
+						     numpages, pass);
+		if (unlikely(pass))
+			p->sched_autonuma->numa_fault_pass =
+				p->mm->mm_autonuma->numa_fault_pass;
+	}
+}
+
+pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+		       unsigned long addr, pte_t pte, pte_t *ptep)
+{
+	struct page *page;
+	pte = pte_mknotnuma(pte);
+	set_pte_at(mm, addr, ptep, pte);
+	page = vm_normal_page(vma, addr, pte);
+	BUG_ON(!page);
+	numa_hinting_fault(page, 1);
+	return pte;
+}
+
+void __pmd_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+		      unsigned long addr, pmd_t *pmdp)
+{
+	pmd_t pmd;
+	pte_t *pte;
+	unsigned long _addr = addr & PMD_MASK;
+	spinlock_t *ptl;
+	bool numa = false;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = *pmdp;
+	if (pmd_numa(pmd)) {
+		set_pmd_at(mm, _addr, pmdp, pmd_mknotnuma(pmd));
+		numa = true;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	if (!numa)
+		return;
+
+	pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
+	for (addr = _addr; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
+		pte_t pteval = *pte;
+		struct page * page;
+		if (!pte_present(pteval))
+			continue;
+		if (pte_numa(pteval)) {
+			pteval = pte_mknotnuma(pteval);
+			set_pte_at(mm, addr, pte, pteval);
+		}
+		page = vm_normal_page(vma, addr, pteval);
+		if (unlikely(!page))
+			continue;
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1)
+			continue;
+		numa_hinting_fault(page, 1);
+	}
+	pte_unmap_unlock(pte, ptl);
+}
+
+static inline int sched_autonuma_size(void)
+{
+	return sizeof(struct sched_autonuma) +
+		num_possible_nodes() * sizeof(unsigned long);
+}
+
+static inline int sched_autonuma_reset_size(void)
+{
+	struct sched_autonuma *sched_autonuma = NULL;
+	return sched_autonuma_size() -
+		(int)((char *)(&sched_autonuma->autonuma_stop_one_cpu) -
+		      (char *)sched_autonuma);
+}
+
+static void sched_autonuma_reset(struct sched_autonuma *sched_autonuma)
+{
+	sched_autonuma->autonuma_node = -1;
+	memset(&sched_autonuma->autonuma_stop_one_cpu, 0,
+	       sched_autonuma_reset_size());
+}
+
+static inline int mm_autonuma_fault_size(void)
+{
+	return num_possible_nodes() * sizeof(unsigned long);
+}
+
+static inline unsigned long *mm_autonuma_numa_fault_tmp(struct mm_struct *mm)
+{
+	return mm->mm_autonuma->numa_fault + num_possible_nodes();
+}
+
+static inline int mm_autonuma_size(void)
+{
+	return sizeof(struct mm_autonuma) + mm_autonuma_fault_size() * 2;
+}
+
+static inline int mm_autonuma_reset_size(void)
+{
+	struct mm_autonuma *mm_autonuma = NULL;
+	return mm_autonuma_size() -
+		(int)((char *)(&mm_autonuma->numa_fault_tot) -
+		      (char *)mm_autonuma);
+}
+
+static void mm_autonuma_reset(struct mm_autonuma *mm_autonuma)
+{
+	memset(&mm_autonuma->numa_fault_tot, 0, mm_autonuma_reset_size());
+}
+
+void autonuma_setup_new_exec(struct task_struct *p)
+{
+	if (p->sched_autonuma)
+		sched_autonuma_reset(p->sched_autonuma);
+	if (p->mm && p->mm->mm_autonuma)
+		mm_autonuma_reset(p->mm->mm_autonuma);
+}
+
+static inline int knumad_test_exit(struct mm_struct *mm)
+{
+	return atomic_read(&mm->mm_users) == 0;
+}
+
+static int knumad_scan_pmd(struct mm_struct *mm,
+			   struct vm_area_struct *vma,
+			   unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte, *_pte;
+	struct page *page;
+	unsigned long _address, end;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	VM_BUG_ON(address & ~PAGE_MASK);
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+	if (pmd_trans_huge(*pmd)) {
+		spin_lock(&mm->page_table_lock);
+		if (pmd_trans_huge(*pmd)) {
+			VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+			if (unlikely(pmd_trans_splitting(*pmd))) {
+				spin_unlock(&mm->page_table_lock);
+				wait_split_huge_page(vma->anon_vma, pmd);
+			} else {
+				int page_nid;
+				unsigned long *numa_fault_tmp;
+				ret = HPAGE_PMD_NR;
+
+				if (autonuma_scan_use_working_set() &&
+				    pmd_numa(*pmd)) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
+				page = pmd_page(*pmd);
+
+				/* only check non-shared pages */
+				if (page_mapcount(page) != 1) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
+				page_nid = page_migrate_nid(page);
+				numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+				numa_fault_tmp[page_nid] += ret;
+
+				if (pmd_numa(*pmd)) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
+				set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+				/* defer TLB flush to lower the overhead */
+				spin_unlock(&mm->page_table_lock);
+				goto out;
+			}
+		} else
+			spin_unlock(&mm->page_table_lock);
+	}
+
+	VM_BUG_ON(!pmd_present(*pmd));
+
+	end = min(vma->vm_end, (address + PMD_SIZE) & PMD_MASK);
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	for (_address = address, _pte = pte; _address < end;
+	     _pte++, _address += PAGE_SIZE) {
+		unsigned long *numa_fault_tmp;
+		pte_t pteval = *_pte;
+		if (!pte_present(pteval))
+			continue;
+		if (autonuma_scan_use_working_set() &&
+		    pte_numa(pteval))
+			continue;
+		page = vm_normal_page(vma, _address, pteval);
+		if (unlikely(!page))
+			continue;
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1)
+			continue;
+
+		numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+		numa_fault_tmp[page_migrate_nid(page)]++;
+
+		if (pte_numa(pteval))
+			continue;
+
+		if (!autonuma_scan_pmd())
+			set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
+
+		/* defer TLB flush to lower the overhead */
+		ret++;
+	}
+	pte_unmap_unlock(pte, ptl);
+
+	if (ret && !pmd_numa(*pmd) && autonuma_scan_pmd()) {
+		spin_lock(&mm->page_table_lock);
+		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+		spin_unlock(&mm->page_table_lock);
+		/* defer TLB flush to lower the overhead */
+	}
+
+out:
+	return ret;
+}
+
+static void mm_numa_fault_flush(struct mm_struct *mm)
+{
+	int nid;
+	struct mm_autonuma *mma = mm->mm_autonuma;
+	unsigned long *numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+	unsigned long tot = 0;
+	/* FIXME: protect this with seqlock against autonuma_balance() */
+	for_each_node(nid) {
+		mma->numa_fault[nid] = numa_fault_tmp[nid];
+		tot += mma->numa_fault[nid];
+		numa_fault_tmp[nid] = 0;
+	}
+	mma->numa_fault_tot = tot;
+}
+
+static int knumad_do_scan(void)
+{
+	struct mm_struct *mm;
+	struct mm_autonuma *mm_autonuma;
+	unsigned long address;
+	struct vm_area_struct *vma;
+	int progress = 0;
+
+	mm = knumad_scan.mm;
+	if (!mm) {
+		if (unlikely(list_empty(&knumad_scan.mm_head)))
+			return pages_to_scan;
+		mm_autonuma = list_entry(knumad_scan.mm_head.next,
+					 struct mm_autonuma, mm_node);
+		mm = mm_autonuma->mm;
+		knumad_scan.address = 0;
+		knumad_scan.mm = mm;
+		atomic_inc(&mm->mm_count);
+		mm_autonuma->numa_fault_pass++;
+	}
+	address = knumad_scan.address;
+
+	mutex_unlock(&knumad_mm_mutex);
+
+	down_read(&mm->mmap_sem);
+	if (unlikely(knumad_test_exit(mm)))
+		vma = NULL;
+	else
+		vma = find_vma(mm, address);
+
+	progress++;
+	for (; vma && progress < pages_to_scan; vma = vma->vm_next) {
+		unsigned long start_addr, end_addr;
+		cond_resched();
+		if (unlikely(knumad_test_exit(mm))) {
+			progress++;
+			break;
+		}
+
+		if (!vma->anon_vma || vma_policy(vma)) {
+			progress++;
+			continue;
+		}
+		if (is_vma_temporary_stack(vma)) {
+			progress++;
+			continue;
+		}
+
+		VM_BUG_ON(address & ~PAGE_MASK);
+		if (address < vma->vm_start)
+			address = vma->vm_start;
+
+		start_addr = address;
+		while (address < vma->vm_end) {
+			cond_resched();
+			if (unlikely(knumad_test_exit(mm)))
+				break;
+
+			VM_BUG_ON(address < vma->vm_start ||
+				  address + PAGE_SIZE > vma->vm_end);
+			progress += knumad_scan_pmd(mm, vma, address);
+			/* move to next address */
+			address = (address + PMD_SIZE) & PMD_MASK;
+			if (progress >= pages_to_scan)
+				break;
+		}
+		end_addr = min(address, vma->vm_end);
+
+		/*
+		 * Flush the TLB for the mm to start the numa
+		 * hinting minor page faults after we finish
+		 * scanning this vma part.
+		 */
+		mmu_notifier_invalidate_range_start(vma->vm_mm, start_addr,
+						    end_addr);
+		flush_tlb_range(vma, start_addr, end_addr);
+		mmu_notifier_invalidate_range_end(vma->vm_mm, start_addr,
+						  end_addr);
+	}
+	up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
+
+	mutex_lock(&knumad_mm_mutex);
+	VM_BUG_ON(knumad_scan.mm != mm);
+	knumad_scan.address = address;
+	/*
+	 * Change the current mm if this mm is about to die, or if we
+	 * scanned all vmas of this mm.
+	 */
+	if (knumad_test_exit(mm) || !vma) {
+		mm_autonuma = mm->mm_autonuma;
+		if (mm_autonuma->mm_node.next != &knumad_scan.mm_head) {
+			mm_autonuma = list_entry(mm_autonuma->mm_node.next,
+						 struct mm_autonuma, mm_node);
+			knumad_scan.mm = mm_autonuma->mm;
+			atomic_inc(&knumad_scan.mm->mm_count);
+			knumad_scan.address = 0;
+			knumad_scan.mm->mm_autonuma->numa_fault_pass++;
+		} else
+			knumad_scan.mm = NULL;
+
+		if (knumad_test_exit(mm))
+			list_del(&mm->mm_autonuma->mm_node);
+		else
+			mm_numa_fault_flush(mm);
+
+		mmdrop(mm);
+	}
+
+	return progress;
+}
+
+static void wake_up_knuma_migrated(void)
+{
+	int nid;
+
+	lru_add_drain();
+	for_each_online_node(nid) {
+		struct pglist_data *pgdat = NODE_DATA(nid);
+		if (pgdat->autonuma_nr_migrate_pages &&
+		    waitqueue_active(&pgdat->autonuma_knuma_migrated_wait))
+			wake_up_interruptible(&pgdat->
+					      autonuma_knuma_migrated_wait);
+	}
+}
+
+static void knuma_scand_disabled(void)
+{
+	if (!autonuma_enabled())
+		wait_event_freezable(knuma_scand_wait,
+				     autonuma_enabled() ||
+				     kthread_should_stop());
+}
+
+static int knuma_scand(void *none)
+{
+	struct mm_struct *mm = NULL;
+	int progress = 0, _progress;
+	unsigned long total_progress = 0;
+
+	set_freezable();
+
+	knuma_scand_disabled();
+
+	mutex_lock(&knumad_mm_mutex);
+
+	for (;;) {
+		if (unlikely(kthread_should_stop()))
+			break;
+		_progress = knumad_do_scan();
+		progress += _progress;
+		total_progress += _progress;
+		mutex_unlock(&knumad_mm_mutex);
+
+		if (unlikely(!knumad_scan.mm)) {
+			autonuma_printk("knuma_scand %lu\n", total_progress);
+			pages_scanned += total_progress;
+			total_progress = 0;
+			full_scans++;
+
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_pass_millisecs));
+			/* flush the last pending pages < pages_to_migrate */
+			wake_up_knuma_migrated();
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_pass_millisecs));
+
+			if (autonuma_debug()) {
+				extern void sched_autonuma_dump_mm(void);
+				sched_autonuma_dump_mm();
+			}
+
+			/* wait while there is no pinned mm */
+			knuma_scand_disabled();
+		}
+		if (progress > pages_to_scan) {
+			progress = 0;
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_millisecs));
+		}
+		cond_resched();
+		mutex_lock(&knumad_mm_mutex);
+	}
+
+	mm = knumad_scan.mm;
+	knumad_scan.mm = NULL;
+	if (mm)
+		list_del(&mm->mm_autonuma->mm_node);
+	mutex_unlock(&knumad_mm_mutex);
+
+	if (mm)
+		mmdrop(mm);
+
+	return 0;
+}
+
+static int isolate_migratepages(struct list_head *migratepages,
+				struct pglist_data *pgdat)
+{
+	int nr = 0, nid;
+	struct list_head *heads = pgdat->autonuma_migrate_head;
+
+	/* FIXME: THP balancing, restart from last nid */
+	for_each_online_node(nid) {
+		struct zone *zone;
+		struct page *page;
+		cond_resched();
+		VM_BUG_ON(numa_node_id() != pgdat->node_id);
+		if (nid == pgdat->node_id) {
+			VM_BUG_ON(!list_empty(&heads[nid]));
+			continue;
+		}
+		if (list_empty(&heads[nid]))
+			continue;
+		/* some page wants to go to this pgdat */
+		/*
+		 * Take the lock with irqs disabled to avoid a lock
+		 * inversion with the lru_lock which is taken before
+		 * the autonuma_migrate_lock in split_huge_page, and
+		 * that could be taken by interrupts after we obtained
+		 * the autonuma_migrate_lock here, if we didn't disable
+		 * irqs.
+		 */
+		autonuma_migrate_lock_irq(pgdat->node_id);
+		if (list_empty(&heads[nid])) {
+			autonuma_migrate_unlock_irq(pgdat->node_id);
+			continue;
+		}
+		page = list_entry(heads[nid].prev,
+				  struct page,
+				  autonuma_migrate_node);
+		if (unlikely(!get_page_unless_zero(page))) {
+			/*
+			 * Is getting freed and will remove self from the
+			 * autonuma list shortly, skip it for now.
+			 */
+			list_del(&page->autonuma_migrate_node);
+			list_add(&page->autonuma_migrate_node,
+				 &heads[nid]);
+			autonuma_migrate_unlock_irq(pgdat->node_id);
+			autonuma_printk("autonuma migrate page is free\n");
+			continue;
+		}
+		if (!PageLRU(page)) {
+			autonuma_migrate_unlock_irq(pgdat->node_id);
+			autonuma_printk("autonuma migrate page not in LRU\n");
+			__autonuma_migrate_page_remove(page);
+			put_page(page);
+			continue;
+		}
+		autonuma_migrate_unlock_irq(pgdat->node_id);
+
+		VM_BUG_ON(nid != page_to_nid(page));
+
+		if (PageTransHuge(page))
+			/* FIXME: remove split_huge_page */
+			split_huge_page(page);
+
+		__autonuma_migrate_page_remove(page);
+		put_page(page);
+
+		zone = page_zone(page);
+		spin_lock_irq(&zone->lru_lock);
+		if (!__isolate_lru_page(page, ISOLATE_ACTIVE|ISOLATE_INACTIVE,
+					0)) {
+			/* FIXME NR_ISOLATED */
+			VM_BUG_ON(PageTransCompound(page));
+			del_page_from_lru_list(zone, page, page_lru(page));
+			inc_zone_state(zone, page_is_file_cache(page) ?
+				       NR_ISOLATED_FILE : NR_ISOLATED_ANON);
+			spin_unlock_irq(&zone->lru_lock);
+
+			list_add(&page->lru, migratepages);
+			nr += hpage_nr_pages(page);
+		} else {
+			/* FIXME: losing page, safest and simplest for now */
+			spin_unlock_irq(&zone->lru_lock);
+			autonuma_printk("autonuma migrate page lost\n");
+		}
+
+	}
+
+	return nr;
+}
+
+static struct page *alloc_migrate_dst_page(struct page *page,
+					   unsigned long data,
+					   int **result)
+{
+	int nid = (int) data;
+	struct page *newpage;
+	newpage = alloc_pages_exact_node(nid,
+					 GFP_HIGHUSER_MOVABLE | GFP_THISNODE,
+					 0);
+	if (newpage)
+		newpage->autonuma_last_nid = page->autonuma_last_nid;
+	return newpage;
+}
+
+static void knumad_do_migrate(struct pglist_data *pgdat)
+{
+	int nr_migrate_pages = 0;
+	LIST_HEAD(migratepages);
+
+	autonuma_printk("nr_migrate_pages %lu to node %d\n",
+			pgdat->autonuma_nr_migrate_pages, pgdat->node_id);
+	do {
+		int isolated = 0;
+		if (balance_pgdat(pgdat, nr_migrate_pages))
+			isolated = isolate_migratepages(&migratepages, pgdat);
+		/* FIXME: might need to check too many isolated */
+		if (!isolated)
+			break;
+		nr_migrate_pages += isolated;
+	} while (nr_migrate_pages < pages_to_migrate);
+
+	if (nr_migrate_pages) {
+		int err;
+		autonuma_printk("migrate %d to node %d\n", nr_migrate_pages,
+				pgdat->node_id);
+		pages_migrated += nr_migrate_pages; /* FIXME: per node */
+		err = migrate_pages(&migratepages, alloc_migrate_dst_page,
+				    pgdat->node_id, false, true);
+		if (err)
+			/* FIXME: requeue failed pages */
+			putback_lru_pages(&migratepages);
+	}
+}
+
+static int knuma_migrated(void *arg)
+{
+	struct pglist_data *pgdat = (struct pglist_data *)arg;
+	int nid = pgdat->node_id;
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(nowakeup);
+
+	set_freezable();
+
+	for (;;) {
+		if (unlikely(kthread_should_stop()))
+			break;
+		/* FIXME: scan the free levels of this node we may not
+		 * be allowed to receive memory if the wmark of this
+		 * pgdat are below high.  In the future also add
+		 * not-interesting pages like not-accessed pages to
+		 * pgdat->autonuma_migrate_head[pgdat->node_id]; so we
+		 * can move our memory away to other nodes in order
+		 * to satisfy the high-wmark described above (so migration
+		 * can continue).
+		 */
+		knumad_do_migrate(pgdat);
+		if (!pgdat->autonuma_nr_migrate_pages) {
+			wait_event_freezable(
+				pgdat->autonuma_knuma_migrated_wait,
+				pgdat->autonuma_nr_migrate_pages ||
+				kthread_should_stop());
+			autonuma_printk("wake knuma_migrated %d\n", nid);
+		} else
+			wait_event_freezable_timeout(nowakeup,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     migrate_sleep_millisecs));
+	}
+
+	return 0;
+}
+
+void autonuma_enter(struct mm_struct *mm)
+{
+	if (autonuma_impossible())
+		return;
+
+	mutex_lock(&knumad_mm_mutex);
+	list_add_tail(&mm->mm_autonuma->mm_node, &knumad_scan.mm_head);
+	mutex_unlock(&knumad_mm_mutex);
+}
+
+void autonuma_exit(struct mm_struct *mm)
+{
+	bool serialize;
+
+	if (autonuma_impossible())
+		return;
+
+	serialize = false;
+	mutex_lock(&knumad_mm_mutex);
+	if (knumad_scan.mm == mm)
+		serialize = true;
+	else
+		list_del(&mm->mm_autonuma->mm_node);
+	mutex_unlock(&knumad_mm_mutex);
+
+	if (serialize) {
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	}
+}
+
+static int start_knuma_scand(void)
+{
+	int err = 0;
+	struct task_struct *knumad_thread;
+
+	knumad_thread = kthread_run(knuma_scand, NULL, "knuma_scand");
+	if (unlikely(IS_ERR(knumad_thread))) {
+		autonuma_printk(KERN_ERR
+				"knumad: kthread_run(knuma_scand) failed\n");
+		err = PTR_ERR(knumad_thread);
+	}
+	return err;
+}
+
+static int start_knuma_migrated(void)
+{
+	int err = 0;
+	struct task_struct *knumad_thread;
+	int nid;
+
+	for_each_online_node(nid) {
+		knumad_thread = kthread_create_on_node(knuma_migrated,
+						       NODE_DATA(nid),
+						       nid,
+						       "knuma_migrated%d",
+						       nid);
+		if (unlikely(IS_ERR(knumad_thread))) {
+			autonuma_printk(KERN_ERR
+					"knumad: "
+					"kthread_run(knuma_migrated%d) "
+					"failed\n", nid);
+			err = PTR_ERR(knumad_thread);
+		} else {
+			autonuma_printk("cpumask %d %lx\n", nid,
+					cpumask_of_node(nid)->bits[0]);
+			kthread_bind_node(knumad_thread, nid);
+			wake_up_process(knumad_thread);
+		}
+	}
+	return err;
+}
+
+
+#ifdef CONFIG_SYSFS
+
+static ssize_t flag_show(struct kobject *kobj,
+			 struct kobj_attribute *attr, char *buf,
+			 enum autonuma_flag flag)
+{
+	return sprintf(buf, "%d\n",
+		       !!test_bit(flag, &autonuma_flags));
+}
+static ssize_t flag_store(struct kobject *kobj,
+			  struct kobj_attribute *attr,
+			  const char *buf, size_t count,
+			  enum autonuma_flag flag)
+{
+	unsigned long value;
+	int ret;
+
+	ret = kstrtoul(buf, 10, &value);
+	if (ret < 0)
+		return ret;
+	if (value > 1)
+		return -EINVAL;
+
+	if (value)
+		set_bit(flag, &autonuma_flags);
+	else
+		clear_bit(flag, &autonuma_flags);
+
+	return count;
+}
+
+static ssize_t enabled_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return flag_show(kobj, attr, buf, AUTONUMA_FLAG);
+}
+static ssize_t enabled_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = flag_store(kobj, attr, buf, count, AUTONUMA_FLAG);
+
+	if (ret > 0 && autonuma_enabled())
+		wake_up_interruptible(&knuma_scand_wait);
+
+	return ret;
+}
+static struct kobj_attribute enabled_attr =
+	__ATTR(enabled, 0644, enabled_show, enabled_store);
+
+#define SYSFS_ENTRY(NAME, FLAG)						\
+static ssize_t NAME ## _show(struct kobject *kobj,			\
+			     struct kobj_attribute *attr, char *buf)	\
+{									\
+	return flag_show(kobj, attr, buf, FLAG);			\
+}									\
+									\
+static ssize_t NAME ## _store(struct kobject *kobj,			\
+			      struct kobj_attribute *attr,		\
+			      const char *buf, size_t count)		\
+{									\
+	return flag_store(kobj, attr, buf, count, FLAG);		\
+}									\
+static struct kobj_attribute NAME ## _attr =				\
+	__ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
+
+SYSFS_ENTRY(debug, AUTONUMA_DEBUG_FLAG);
+SYSFS_ENTRY(pmd, AUTONUMA_SCAN_PMD_FLAG);
+SYSFS_ENTRY(working_set, AUTONUMA_SCAN_USE_WORKING_SET_FLAG);
+SYSFS_ENTRY(defer, AUTONUMA_MIGRATE_DEFER_FLAG);
+SYSFS_ENTRY(load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
+SYSFS_ENTRY(clone_reset, AUTONUMA_SCHED_CLONE_RESET_FLAG);
+SYSFS_ENTRY(fork_reset, AUTONUMA_SCHED_FORK_RESET_FLAG);
+
+#undef SYSFS_ENTRY
+
+enum {
+	SYSFS_KNUMA_SCAND_SLEEP_ENTRY,
+	SYSFS_KNUMA_SCAND_PAGES_ENTRY,
+	SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY,
+	SYSFS_KNUMA_MIGRATED_PAGES_ENTRY,
+};
+
+#define SYSFS_ENTRY(NAME, SYSFS_TYPE)				\
+static ssize_t NAME ## _show(struct kobject *kobj,		\
+			     struct kobj_attribute *attr,	\
+			     char *buf)				\
+{								\
+	return sprintf(buf, "%u\n", NAME);			\
+}								\
+static ssize_t NAME ## _store(struct kobject *kobj,		\
+			      struct kobj_attribute *attr,	\
+			      const char *buf, size_t count)	\
+{								\
+	unsigned long val;					\
+	int err;						\
+								\
+	err = strict_strtoul(buf, 10, &val);			\
+	if (err || val > UINT_MAX)				\
+		return -EINVAL;					\
+	switch (SYSFS_TYPE) {					\
+	case SYSFS_KNUMA_SCAND_PAGES_ENTRY:			\
+	case SYSFS_KNUMA_MIGRATED_PAGES_ENTRY:			\
+		if (!val)					\
+			return -EINVAL;				\
+		break;						\
+	}							\
+								\
+	NAME = val;						\
+	switch (SYSFS_TYPE) {					\
+	case SYSFS_KNUMA_SCAND_SLEEP_ENTRY:			\
+		wake_up_interruptible(&knuma_scand_wait);	\
+		break;						\
+	case							\
+		SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY:		\
+		wake_up_knuma_migrated();			\
+		break;						\
+	}							\
+								\
+	return count;						\
+}								\
+static struct kobj_attribute NAME ## _attr =			\
+	__ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
+
+SYSFS_ENTRY(scan_sleep_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
+SYSFS_ENTRY(scan_sleep_pass_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
+SYSFS_ENTRY(pages_to_scan, SYSFS_KNUMA_SCAND_PAGES_ENTRY);
+
+SYSFS_ENTRY(migrate_sleep_millisecs, SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY);
+SYSFS_ENTRY(pages_to_migrate, SYSFS_KNUMA_MIGRATED_PAGES_ENTRY);
+
+#undef SYSFS_ENTRY
+
+static struct attribute *autonuma_attr[] = {
+	&enabled_attr.attr,
+	&debug_attr.attr,
+	NULL,
+};
+static struct attribute_group autonuma_attr_group = {
+	.attrs = autonuma_attr,
+};
+
+#define SYSFS_ENTRY(NAME)					\
+static ssize_t NAME ## _show(struct kobject *kobj,		\
+			     struct kobj_attribute *attr,	\
+			     char *buf)				\
+{								\
+	return sprintf(buf, "%lu\n", NAME);			\
+}								\
+static struct kobj_attribute NAME ## _attr =			\
+	__ATTR_RO(NAME);
+
+SYSFS_ENTRY(full_scans);
+SYSFS_ENTRY(pages_scanned);
+SYSFS_ENTRY(pages_migrated);
+
+#undef SYSFS_ENTRY
+
+static struct attribute *knuma_scand_attr[] = {
+	&scan_sleep_millisecs_attr.attr,
+	&scan_sleep_pass_millisecs_attr.attr,
+	&pages_to_scan_attr.attr,
+	&pages_scanned_attr.attr,
+	&full_scans_attr.attr,
+	&pmd_attr.attr,
+	&working_set_attr.attr,
+	NULL,
+};
+static struct attribute_group knuma_scand_attr_group = {
+	.attrs = knuma_scand_attr,
+	.name = "knuma_scand",
+};
+
+static struct attribute *knuma_migrated_attr[] = {
+	&migrate_sleep_millisecs_attr.attr,
+	&pages_to_migrate_attr.attr,
+	&pages_migrated_attr.attr,
+	&defer_attr.attr,
+	NULL,
+};
+static struct attribute_group knuma_migrated_attr_group = {
+	.attrs = knuma_migrated_attr,
+	.name = "knuma_migrated",
+};
+
+static struct attribute *scheduler_attr[] = {
+	&clone_reset_attr.attr,
+	&fork_reset_attr.attr,
+	&load_balance_strict_attr.attr,
+	NULL,
+};
+static struct attribute_group scheduler_attr_group = {
+	.attrs = scheduler_attr,
+	.name = "scheduler",
+};
+
+static int __init autonuma_init_sysfs(struct kobject **autonuma_kobj)
+{
+	int err;
+
+	*autonuma_kobj = kobject_create_and_add("autonuma", mm_kobj);
+	if (unlikely(!*autonuma_kobj)) {
+		printk(KERN_ERR "autonuma: failed kobject create\n");
+		return -ENOMEM;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &autonuma_attr_group);
+	if (err) {
+		printk(KERN_ERR "autonuma: failed register autonuma group\n");
+		goto delete_obj;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &knuma_scand_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register knuma_scand group\n");
+		goto remove_autonuma;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &knuma_migrated_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register knuma_migrated group\n");
+		goto remove_knuma_scand;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &scheduler_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register scheduler group\n");
+		goto remove_knuma_migrated;
+	}
+
+	return 0;
+
+remove_knuma_migrated:
+	sysfs_remove_group(*autonuma_kobj, &knuma_migrated_attr_group);
+remove_knuma_scand:
+	sysfs_remove_group(*autonuma_kobj, &knuma_scand_attr_group);
+remove_autonuma:
+	sysfs_remove_group(*autonuma_kobj, &autonuma_attr_group);
+delete_obj:
+	kobject_put(*autonuma_kobj);
+	return err;
+}
+
+static void __init autonuma_exit_sysfs(struct kobject *autonuma_kobj)
+{
+	sysfs_remove_group(autonuma_kobj, &knuma_migrated_attr_group);
+	sysfs_remove_group(autonuma_kobj, &knuma_scand_attr_group);
+	sysfs_remove_group(autonuma_kobj, &autonuma_attr_group);
+	kobject_put(autonuma_kobj);
+}
+#else
+static inline int autonuma_init_sysfs(struct kobject **autonuma_kobj)
+{
+	return 0;
+}
+
+static inline void autonuma_exit_sysfs(struct kobject *autonuma_kobj)
+{
+}
+#endif /* CONFIG_SYSFS */
+
+static int __init noautonuma_setup(char *str)
+{
+	if (!autonuma_impossible()) {
+		printk("AutoNUMA permanently disabled\n");
+		set_bit(AUTONUMA_IMPOSSIBLE, &autonuma_flags);
+		BUG_ON(!autonuma_impossible());
+	}
+	return 1;
+}
+__setup("noautonuma", noautonuma_setup);
+
+static int __init autonuma_init(void)
+{
+	int err;
+	struct kobject *autonuma_kobj;
+
+	VM_BUG_ON(num_possible_nodes() < 1);
+	if (autonuma_impossible())
+		return -EINVAL;
+
+	err = autonuma_init_sysfs(&autonuma_kobj);
+	if (err)
+		return err;
+
+	err = start_knuma_scand();
+	if (err) {
+		printk("failed to start knuma_scand\n");
+		goto out;
+	}
+	err = start_knuma_migrated();
+	if (err) {
+		printk("failed to start knuma_migrated\n");
+		goto out;
+	}
+
+	printk("AutoNUMA initialized successfully\n");
+	return err;
+
+out:
+	autonuma_exit_sysfs(autonuma_kobj);
+	return err;
+}
+module_init(autonuma_init)
+
+static struct kmem_cache *sched_autonuma_cachep;
+
+int alloc_sched_autonuma(struct task_struct *tsk, struct task_struct *orig,
+			 int node)
+{
+	int err = 1;
+	struct sched_autonuma *sched_autonuma;
+
+	if (autonuma_impossible())
+		goto no_numa;
+	sched_autonuma = kmem_cache_alloc_node(sched_autonuma_cachep,
+					       GFP_KERNEL, node);
+	if (!sched_autonuma)
+		goto out;
+	if (autonuma_sched_clone_reset())
+		sched_autonuma_reset(sched_autonuma);
+	else {
+		memcpy(sched_autonuma, orig->sched_autonuma,
+		       sched_autonuma_size());
+		BUG_ON(sched_autonuma->autonuma_stop_one_cpu);
+	}
+	tsk->sched_autonuma = sched_autonuma;
+no_numa:
+	err = 0;
+out:
+	return err;
+}
+
+void free_sched_autonuma(struct task_struct *tsk)
+{
+	if (autonuma_impossible()) {
+		BUG_ON(tsk->sched_autonuma);
+		return;
+	}
+
+	BUG_ON(!tsk->sched_autonuma);
+	kmem_cache_free(sched_autonuma_cachep, tsk->sched_autonuma);
+	tsk->sched_autonuma = NULL;
+}
+
+void __init sched_autonuma_init(void)
+{
+	struct sched_autonuma *sched_autonuma;
+
+	BUG_ON(current != &init_task);
+
+	if (autonuma_impossible())
+		return;
+
+	sched_autonuma_cachep =
+		kmem_cache_create("sched_autonuma",
+				  sched_autonuma_size(), 0,
+				  SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+
+	sched_autonuma = kmem_cache_alloc_node(sched_autonuma_cachep,
+					       GFP_KERNEL, numa_node_id());
+	BUG_ON(!sched_autonuma);
+	sched_autonuma_reset(sched_autonuma);
+	BUG_ON(current->sched_autonuma);
+	current->sched_autonuma = sched_autonuma;
+}
+
+static struct kmem_cache *mm_autonuma_cachep;
+
+int alloc_mm_autonuma(struct mm_struct *mm)
+{
+	int err = 1;
+	struct mm_autonuma *mm_autonuma;
+
+	if (autonuma_impossible())
+		goto no_numa;
+	mm_autonuma = kmem_cache_alloc(mm_autonuma_cachep, GFP_KERNEL);
+	if (!mm_autonuma)
+		goto out;
+	if (autonuma_sched_fork_reset() || !mm->mm_autonuma)
+		mm_autonuma_reset(mm_autonuma);
+	else
+		memcpy(mm_autonuma, mm->mm_autonuma, mm_autonuma_size());
+	mm->mm_autonuma = mm_autonuma;
+	mm_autonuma->mm = mm;
+no_numa:
+	err = 0;
+out:
+	return err;
+}
+
+void free_mm_autonuma(struct mm_struct *mm)
+{
+	if (autonuma_impossible()) {
+		BUG_ON(mm->mm_autonuma);
+		return;
+	}
+
+	BUG_ON(!mm->mm_autonuma);
+	kmem_cache_free(mm_autonuma_cachep, mm->mm_autonuma);
+	mm->mm_autonuma = NULL;
+}
+
+void __init mm_autonuma_init(void)
+{
+	BUG_ON(current != &init_task);
+	BUG_ON(current->mm);
+
+	if (autonuma_impossible())
+		return;
+
+	mm_autonuma_cachep =
+		kmem_cache_create("mm_autonuma",
+				  mm_autonuma_size(), 0,
+				  SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 28/39] autonuma: follow_page check for pte_numa/pmd_numa
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Without this, follow_page wouldn't trigger the NUMA hinting faults.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/memory.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index eac888a..a0f35cd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1486,7 +1486,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 		goto no_page_table;
 
 	pmd = pmd_offset(pud, address);
-	if (pmd_none(*pmd))
+	if (pmd_none(*pmd) || pmd_numa(*pmd))
 		goto no_page_table;
 	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
 		BUG_ON(flags & FOLL_GET);
@@ -1520,7 +1520,7 @@ split_fallthrough:
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
 
 	pte = *ptep;
-	if (!pte_present(pte))
+	if (!pte_present(pte) || pte_numa(pte))
 		goto no_page;
 	if ((flags & FOLL_WRITE) && !pte_write(pte))
 		goto unlock;

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 28/39] autonuma: follow_page check for pte_numa/pmd_numa
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Without this, follow_page wouldn't trigger the NUMA hinting faults.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/memory.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index eac888a..a0f35cd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1486,7 +1486,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 		goto no_page_table;
 
 	pmd = pmd_offset(pud, address);
-	if (pmd_none(*pmd))
+	if (pmd_none(*pmd) || pmd_numa(*pmd))
 		goto no_page_table;
 	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
 		BUG_ON(flags & FOLL_GET);
@@ -1520,7 +1520,7 @@ split_fallthrough:
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
 
 	pte = *ptep;
-	if (!pte_present(pte))
+	if (!pte_present(pte) || pte_numa(pte))
 		goto no_page;
 	if ((flags & FOLL_WRITE) && !pte_write(pte))
 		goto unlock;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 29/39] autonuma: default mempolicy follow AutoNUMA
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

If the task has already been moved to an autonuma_node try to allocate
memory from it even if it's temporarily not the local node. Chances
are it's where most of its memory is already located and where it will
run in the future.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/mempolicy.c |   15 +++++++++++++--
 1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index cfb6c86..f3c03cb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1929,10 +1929,21 @@ retry_cpuset:
 	 */
 	if (pol->mode == MPOL_INTERLEAVE)
 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
-	else
+	else {
+		int nid;
+#ifdef CONFIG_AUTONUMA
+		nid = -1;
+		if (current->sched_autonuma)
+			nid = current->sched_autonuma->autonuma_node;
+		if (nid < 0)
+			nid = numa_node_id();
+#else
+		nid = numa_node_id();
+#endif
 		page = __alloc_pages_nodemask(gfp, order,
-				policy_zonelist(gfp, pol, numa_node_id()),
+				policy_zonelist(gfp, pol, nid),
 				policy_nodemask(gfp, pol));
+	}
 
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 		goto retry_cpuset;

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 29/39] autonuma: default mempolicy follow AutoNUMA
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

If the task has already been moved to an autonuma_node try to allocate
memory from it even if it's temporarily not the local node. Chances
are it's where most of its memory is already located and where it will
run in the future.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/mempolicy.c |   15 +++++++++++++--
 1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index cfb6c86..f3c03cb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1929,10 +1929,21 @@ retry_cpuset:
 	 */
 	if (pol->mode == MPOL_INTERLEAVE)
 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
-	else
+	else {
+		int nid;
+#ifdef CONFIG_AUTONUMA
+		nid = -1;
+		if (current->sched_autonuma)
+			nid = current->sched_autonuma->autonuma_node;
+		if (nid < 0)
+			nid = numa_node_id();
+#else
+		nid = numa_node_id();
+#endif
 		page = __alloc_pages_nodemask(gfp, order,
-				policy_zonelist(gfp, pol, numa_node_id()),
+				policy_zonelist(gfp, pol, nid),
 				policy_nodemask(gfp, pol));
+	}
 
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 		goto retry_cpuset;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 30/39] autonuma: call autonuma_split_huge_page()
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

This is needed to make sure the tail pages are also queued into the
migration queues of knuma_migrated across a transparent hugepage
split.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 383ae4d..b1c047b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -17,6 +17,7 @@
 #include <linux/khugepaged.h>
 #include <linux/freezer.h>
 #include <linux/mman.h>
+#include <linux/autonuma.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -1307,6 +1308,7 @@ static void __split_huge_page_refcount(struct page *page)
 
 
 		lru_add_page_tail(zone, page, page_tail);
+		autonuma_migrate_split_huge_page(page, page_tail);
 	}
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(__page_count(page) <= 0);

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 30/39] autonuma: call autonuma_split_huge_page()
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

This is needed to make sure the tail pages are also queued into the
migration queues of knuma_migrated across a transparent hugepage
split.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 383ae4d..b1c047b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -17,6 +17,7 @@
 #include <linux/khugepaged.h>
 #include <linux/freezer.h>
 #include <linux/mman.h>
+#include <linux/autonuma.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -1307,6 +1308,7 @@ static void __split_huge_page_refcount(struct page *page)
 
 
 		lru_add_page_tail(zone, page, page_tail);
+		autonuma_migrate_split_huge_page(page, page_tail);
 	}
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(__page_count(page) <= 0);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 31/39] autonuma: make khugepaged pte_numa aware
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

If any of the ptes that khugepaged is collapsing was a pte_numa, the
resulting trans huge pmd will be a pmd_numa too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b1c047b..d388517 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1790,12 +1790,13 @@ out:
 	return isolated;
 }
 
-static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
+static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 				      struct vm_area_struct *vma,
 				      unsigned long address,
 				      spinlock_t *ptl)
 {
 	pte_t *_pte;
+	bool mknuma = false;
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
 		pte_t pteval = *_pte;
 		struct page *src_page;
@@ -1823,11 +1824,15 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			page_remove_rmap(src_page);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
+
+			mknuma |= pte_numa(pteval);
 		}
 
 		address += PAGE_SIZE;
 		page++;
 	}
+
+	return mknuma;
 }
 
 static void collapse_huge_page(struct mm_struct *mm,
@@ -1845,6 +1850,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	int isolated;
 	unsigned long hstart, hend;
+	bool mknuma = false;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 #ifndef CONFIG_NUMA
@@ -1963,7 +1969,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	anon_vma_unlock(vma->anon_vma);
 
-	__collapse_huge_page_copy(pte, new_page, vma, address, ptl);
+	mknuma = pmd_numa(_pmd);
+	mknuma |= __collapse_huge_page_copy(pte, new_page, vma, address, ptl);
 	pte_unmap(pte);
 	__SetPageUptodate(new_page);
 	pgtable = pmd_pgtable(_pmd);
@@ -1973,6 +1980,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	_pmd = mk_pmd(new_page, vma->vm_page_prot);
 	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
 	_pmd = pmd_mkhuge(_pmd);
+	if (mknuma)
+		_pmd = pmd_mknuma(_pmd);
 
 	/*
 	 * spin_lock() below is not the equivalent of smp_wmb(), so

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 31/39] autonuma: make khugepaged pte_numa aware
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

If any of the ptes that khugepaged is collapsing was a pte_numa, the
resulting trans huge pmd will be a pmd_numa too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b1c047b..d388517 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1790,12 +1790,13 @@ out:
 	return isolated;
 }
 
-static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
+static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 				      struct vm_area_struct *vma,
 				      unsigned long address,
 				      spinlock_t *ptl)
 {
 	pte_t *_pte;
+	bool mknuma = false;
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
 		pte_t pteval = *_pte;
 		struct page *src_page;
@@ -1823,11 +1824,15 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			page_remove_rmap(src_page);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
+
+			mknuma |= pte_numa(pteval);
 		}
 
 		address += PAGE_SIZE;
 		page++;
 	}
+
+	return mknuma;
 }
 
 static void collapse_huge_page(struct mm_struct *mm,
@@ -1845,6 +1850,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	int isolated;
 	unsigned long hstart, hend;
+	bool mknuma = false;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 #ifndef CONFIG_NUMA
@@ -1963,7 +1969,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	anon_vma_unlock(vma->anon_vma);
 
-	__collapse_huge_page_copy(pte, new_page, vma, address, ptl);
+	mknuma = pmd_numa(_pmd);
+	mknuma |= __collapse_huge_page_copy(pte, new_page, vma, address, ptl);
 	pte_unmap(pte);
 	__SetPageUptodate(new_page);
 	pgtable = pmd_pgtable(_pmd);
@@ -1973,6 +1980,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	_pmd = mk_pmd(new_page, vma->vm_page_prot);
 	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
 	_pmd = pmd_mkhuge(_pmd);
+	if (mknuma)
+		_pmd = pmd_mknuma(_pmd);
 
 	/*
 	 * spin_lock() below is not the equivalent of smp_wmb(), so

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 32/39] autonuma: retain page last_nid information in khugepaged
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

When pages are collapsed try to keep the last_nid information from one
of the original pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d388517..76bdc48 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1805,7 +1805,18 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			clear_user_highpage(page, address);
 			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
 		} else {
+#ifdef CONFIG_AUTONUMA
+			int autonuma_last_nid;
+#endif
 			src_page = pte_page(pteval);
+#ifdef CONFIG_AUTONUMA
+			/* pick the last one, better than nothing */
+			autonuma_last_nid =
+				ACCESS_ONCE(src_page->autonuma_last_nid);
+			if (autonuma_last_nid >= 0)
+				ACCESS_ONCE(page->autonuma_last_nid) =
+					autonuma_last_nid;
+#endif
 			copy_user_highpage(page, src_page, address, vma);
 			VM_BUG_ON(page_mapcount(src_page) != 1);
 			VM_BUG_ON(page_count(src_page) != 2);

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 32/39] autonuma: retain page last_nid information in khugepaged
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

When pages are collapsed try to keep the last_nid information from one
of the original pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d388517..76bdc48 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1805,7 +1805,18 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			clear_user_highpage(page, address);
 			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
 		} else {
+#ifdef CONFIG_AUTONUMA
+			int autonuma_last_nid;
+#endif
 			src_page = pte_page(pteval);
+#ifdef CONFIG_AUTONUMA
+			/* pick the last one, better than nothing */
+			autonuma_last_nid =
+				ACCESS_ONCE(src_page->autonuma_last_nid);
+			if (autonuma_last_nid >= 0)
+				ACCESS_ONCE(page->autonuma_last_nid) =
+					autonuma_last_nid;
+#endif
 			copy_user_highpage(page, src_page, address, vma);
 			VM_BUG_ON(page_mapcount(src_page) != 1);
 			VM_BUG_ON(page_count(src_page) != 2);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 33/39] autonuma: numa hinting page faults entry points
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

This is where the numa hinting page faults are detected and are passed
over to the AutoNUMA core logic.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/huge_mm.h |    2 ++
 mm/huge_memory.c        |   17 +++++++++++++++++
 mm/memory.c             |   32 ++++++++++++++++++++++++++++++++
 3 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c8af7a2..72eac1d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -11,6 +11,8 @@ extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
 			       pmd_t orig_pmd);
+extern pmd_t __huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+				   pmd_t pmd, pmd_t *pmdp);
 extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
 extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
 					  unsigned long addr,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 76bdc48..017c0a3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1030,6 +1030,23 @@ out:
 	return page;
 }
 
+#ifdef CONFIG_AUTONUMA
+pmd_t __huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+			    pmd_t pmd, pmd_t *pmdp)
+{
+	spin_lock(&mm->page_table_lock);
+	if (pmd_same(pmd, *pmdp)) {
+		struct page *page = pmd_page(pmd);
+		pmd = pmd_mknotnuma(pmd);
+		set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd);
+		numa_hinting_fault(page, HPAGE_PMD_NR);
+		VM_BUG_ON(pmd_numa(pmd));
+	}
+	spin_unlock(&mm->page_table_lock);
+	return pmd;
+}
+#endif
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
diff --git a/mm/memory.c b/mm/memory.c
index a0f35cd..9dcfc35 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/autonuma.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3401,6 +3402,32 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+static inline pte_t pte_numa_fixup(struct mm_struct *mm,
+				   struct vm_area_struct *vma,
+				   unsigned long addr, pte_t pte, pte_t *ptep)
+{
+	if (pte_numa(pte))
+		pte = __pte_numa_fixup(mm, vma, addr, pte, ptep);
+	return pte;
+}
+
+static inline void pmd_numa_fixup(struct mm_struct *mm,
+				  struct vm_area_struct *vma,
+				  unsigned long addr, pmd_t *pmd)
+{
+	if (pmd_numa(*pmd))
+		__pmd_numa_fixup(mm, vma, addr, pmd);
+}
+
+static inline pmd_t huge_pmd_numa_fixup(struct mm_struct *mm,
+					unsigned long addr,
+					pmd_t pmd, pmd_t *pmdp)
+{
+	if (pmd_numa(pmd))
+		pmd = __huge_pmd_numa_fixup(mm, addr, pmd, pmdp);
+	return pmd;
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3443,6 +3470,7 @@ int handle_pte_fault(struct mm_struct *mm,
 	spin_lock(ptl);
 	if (unlikely(!pte_same(*pte, entry)))
 		goto unlock;
+	entry = pte_numa_fixup(mm, vma, address, entry, pte);
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!pte_write(entry))
 			return do_wp_page(mm, vma, address,
@@ -3504,6 +3532,8 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		pmd_t orig_pmd = *pmd;
 		barrier();
 		if (pmd_trans_huge(orig_pmd)) {
+			orig_pmd = huge_pmd_numa_fixup(mm, address,
+						       orig_pmd, pmd);
 			if (flags & FAULT_FLAG_WRITE &&
 			    !pmd_write(orig_pmd) &&
 			    !pmd_trans_splitting(orig_pmd))
@@ -3513,6 +3543,8 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 	}
 
+	pmd_numa_fixup(mm, vma, address, pmd);
+
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
 	 * run pte_offset_map on the pmd, if an huge pmd could

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 33/39] autonuma: numa hinting page faults entry points
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

This is where the numa hinting page faults are detected and are passed
over to the AutoNUMA core logic.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/huge_mm.h |    2 ++
 mm/huge_memory.c        |   17 +++++++++++++++++
 mm/memory.c             |   32 ++++++++++++++++++++++++++++++++
 3 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c8af7a2..72eac1d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -11,6 +11,8 @@ extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
 			       pmd_t orig_pmd);
+extern pmd_t __huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+				   pmd_t pmd, pmd_t *pmdp);
 extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
 extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
 					  unsigned long addr,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 76bdc48..017c0a3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1030,6 +1030,23 @@ out:
 	return page;
 }
 
+#ifdef CONFIG_AUTONUMA
+pmd_t __huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+			    pmd_t pmd, pmd_t *pmdp)
+{
+	spin_lock(&mm->page_table_lock);
+	if (pmd_same(pmd, *pmdp)) {
+		struct page *page = pmd_page(pmd);
+		pmd = pmd_mknotnuma(pmd);
+		set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd);
+		numa_hinting_fault(page, HPAGE_PMD_NR);
+		VM_BUG_ON(pmd_numa(pmd));
+	}
+	spin_unlock(&mm->page_table_lock);
+	return pmd;
+}
+#endif
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
diff --git a/mm/memory.c b/mm/memory.c
index a0f35cd..9dcfc35 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/autonuma.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3401,6 +3402,32 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+static inline pte_t pte_numa_fixup(struct mm_struct *mm,
+				   struct vm_area_struct *vma,
+				   unsigned long addr, pte_t pte, pte_t *ptep)
+{
+	if (pte_numa(pte))
+		pte = __pte_numa_fixup(mm, vma, addr, pte, ptep);
+	return pte;
+}
+
+static inline void pmd_numa_fixup(struct mm_struct *mm,
+				  struct vm_area_struct *vma,
+				  unsigned long addr, pmd_t *pmd)
+{
+	if (pmd_numa(*pmd))
+		__pmd_numa_fixup(mm, vma, addr, pmd);
+}
+
+static inline pmd_t huge_pmd_numa_fixup(struct mm_struct *mm,
+					unsigned long addr,
+					pmd_t pmd, pmd_t *pmdp)
+{
+	if (pmd_numa(pmd))
+		pmd = __huge_pmd_numa_fixup(mm, addr, pmd, pmdp);
+	return pmd;
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3443,6 +3470,7 @@ int handle_pte_fault(struct mm_struct *mm,
 	spin_lock(ptl);
 	if (unlikely(!pte_same(*pte, entry)))
 		goto unlock;
+	entry = pte_numa_fixup(mm, vma, address, entry, pte);
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!pte_write(entry))
 			return do_wp_page(mm, vma, address,
@@ -3504,6 +3532,8 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		pmd_t orig_pmd = *pmd;
 		barrier();
 		if (pmd_trans_huge(orig_pmd)) {
+			orig_pmd = huge_pmd_numa_fixup(mm, address,
+						       orig_pmd, pmd);
 			if (flags & FAULT_FLAG_WRITE &&
 			    !pmd_write(orig_pmd) &&
 			    !pmd_trans_splitting(orig_pmd))
@@ -3513,6 +3543,8 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 	}
 
+	pmd_numa_fixup(mm, vma, address, pmd);
+
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
 	 * run pte_offset_map on the pmd, if an huge pmd could

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 34/39] autonuma: reset autonuma page data when pages are freed
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

When pages are freed abort any pending migration. If knuma_migrated
arrives first it will notice because get_page_unless_zero would fail.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b9c80df..c986c0c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -614,6 +614,10 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
+	autonuma_migrate_page_remove(page);
+#ifdef CONFIG_AUTONUMA
+	ACCESS_ONCE(page->autonuma_last_nid) = -1;
+#endif
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 34/39] autonuma: reset autonuma page data when pages are freed
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

When pages are freed abort any pending migration. If knuma_migrated
arrives first it will notice because get_page_unless_zero would fail.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b9c80df..c986c0c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -614,6 +614,10 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
+	autonuma_migrate_page_remove(page);
+#ifdef CONFIG_AUTONUMA
+	ACCESS_ONCE(page->autonuma_last_nid) = -1;
+#endif
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 35/39] autonuma: initialize page structure fields
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Initialize the AutoNUMA page structure fields at boot.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c986c0c..3ac525a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3633,6 +3633,10 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 
 		INIT_LIST_HEAD(&page->lru);
+#ifdef CONFIG_AUTONUMA
+		page->autonuma_last_nid = -1;
+		page->autonuma_migrate_nid = -1;
+#endif
 #ifdef WANT_PAGE_VIRTUAL
 		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
 		if (!is_highmem_idx(zone))

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 35/39] autonuma: initialize page structure fields
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Initialize the AutoNUMA page structure fields at boot.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c986c0c..3ac525a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3633,6 +3633,10 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 
 		INIT_LIST_HEAD(&page->lru);
+#ifdef CONFIG_AUTONUMA
+		page->autonuma_last_nid = -1;
+		page->autonuma_migrate_nid = -1;
+#endif
 #ifdef WANT_PAGE_VIRTUAL
 		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
 		if (!is_highmem_idx(zone))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 36/39] autonuma: link mm/autonuma.o and kernel/sched/numa.o
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Link the AutoNUMA core and scheduler object files in the kernel if
CONFIG_AUTONUMA=y.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/Makefile |    3 +--
 mm/Makefile           |    1 +
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 9a7dd35..783a840 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -16,5 +16,4 @@ obj-$(CONFIG_SMP) += cpupri.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
-
-
+obj-$(CONFIG_AUTONUMA) += numa.o
diff --git a/mm/Makefile b/mm/Makefile
index 50ec00e..67c77bd 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
+obj-$(CONFIG_AUTONUMA) 	+= autonuma.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 36/39] autonuma: link mm/autonuma.o and kernel/sched/numa.o
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Link the AutoNUMA core and scheduler object files in the kernel if
CONFIG_AUTONUMA=y.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/Makefile |    3 +--
 mm/Makefile           |    1 +
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 9a7dd35..783a840 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -16,5 +16,4 @@ obj-$(CONFIG_SMP) += cpupri.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
-
-
+obj-$(CONFIG_AUTONUMA) += numa.o
diff --git a/mm/Makefile b/mm/Makefile
index 50ec00e..67c77bd 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
+obj-$(CONFIG_AUTONUMA) 	+= autonuma.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 37/39] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Add the config options to allow building the kernel with AutoNUMA.

If CONFIG_AUTONUMA_DEFAULT_ENABLED is "=y", then
/sys/kernel/mm/autonuma/enabled will be equal to 1, and AutoNUMA will
be enabled automatically at boot.

CONFIG_AUTONUMA currently depends on X86, because no other arch
implements the pte/pmd_numa yet and selecting =y would result in a
failed build, but this shall be relaxed in the future. Porting
AutoNUMA to other archs should be pretty simple.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/Kconfig |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index e338407..cbfdb15 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -207,6 +207,19 @@ config MIGRATION
 	  pages as migration can relocate pages to satisfy a huge page
 	  allocation instead of reclaiming.
 
+config AUTONUMA
+	bool "Auto NUMA"
+	select MIGRATION
+	depends on NUMA && X86
+	help
+	  Automatic NUMA CPU scheduling and memory migration.
+
+config AUTONUMA_DEFAULT_ENABLED
+	bool "Auto NUMA default enabled"
+	depends on AUTONUMA
+	help
+	  Automatic NUMA CPU scheduling and memory migration enabled at boot.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 37/39] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Add the config options to allow building the kernel with AutoNUMA.

If CONFIG_AUTONUMA_DEFAULT_ENABLED is "=y", then
/sys/kernel/mm/autonuma/enabled will be equal to 1, and AutoNUMA will
be enabled automatically at boot.

CONFIG_AUTONUMA currently depends on X86, because no other arch
implements the pte/pmd_numa yet and selecting =y would result in a
failed build, but this shall be relaxed in the future. Porting
AutoNUMA to other archs should be pretty simple.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/Kconfig |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index e338407..cbfdb15 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -207,6 +207,19 @@ config MIGRATION
 	  pages as migration can relocate pages to satisfy a huge page
 	  allocation instead of reclaiming.
 
+config AUTONUMA
+	bool "Auto NUMA"
+	select MIGRATION
+	depends on NUMA && X86
+	help
+	  Automatic NUMA CPU scheduling and memory migration.
+
+config AUTONUMA_DEFAULT_ENABLED
+	bool "Auto NUMA default enabled"
+	depends on AUTONUMA
+	help
+	  Automatic NUMA CPU scheduling and memory migration enabled at boot.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 38/39] autonuma: boost khugepaged scanning rate
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Until THP native migration is implemented it's safer to boost
khugepaged scanning rate because all memory migration are splitting
the hugepages. So the regular rate of scanning becomes too low when
lots of memory is migrated.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 017c0a3..b919c0c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -573,6 +573,14 @@ static int __init hugepage_init(void)
 
 	set_recommended_min_free_kbytes();
 
+#ifdef CONFIG_AUTONUMA
+	/* Hack, remove after THP native migration */
+	if (num_possible_nodes() > 1) {
+		khugepaged_scan_sleep_millisecs = 100;
+		khugepaged_alloc_sleep_millisecs = 10000;
+	}
+#endif
+
 	return 0;
 out:
 	hugepage_exit_sysfs(hugepage_kobj);

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 38/39] autonuma: boost khugepaged scanning rate
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Until THP native migration is implemented it's safer to boost
khugepaged scanning rate because all memory migration are splitting
the hugepages. So the regular rate of scanning becomes too low when
lots of memory is migrated.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 017c0a3..b919c0c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -573,6 +573,14 @@ static int __init hugepage_init(void)
 
 	set_recommended_min_free_kbytes();
 
+#ifdef CONFIG_AUTONUMA
+	/* Hack, remove after THP native migration */
+	if (num_possible_nodes() > 1) {
+		khugepaged_scan_sleep_millisecs = 100;
+		khugepaged_alloc_sleep_millisecs = 10000;
+	}
+#endif
+
 	return 0;
 out:
 	hugepage_exit_sysfs(hugepage_kobj);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 39/39] autonuma: NUMA scheduler SMT awareness
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-03-26 17:46   ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Add SMT awareness to the NUMA scheduler so that it will not move load
from fully idle SMT threads, to semi idle SMT threads.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_flags.h |   10 ++++++++
 kernel/sched/numa.c            |   50 +++++++++++++++++++++++++++++++++++++--
 mm/autonuma.c                  |    7 +++++
 3 files changed, 64 insertions(+), 3 deletions(-)

diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
index 9c702fd..d6b34b0 100644
--- a/include/linux/autonuma_flags.h
+++ b/include/linux/autonuma_flags.h
@@ -8,6 +8,7 @@ enum autonuma_flag {
 	AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
 	AUTONUMA_SCHED_CLONE_RESET_FLAG,
 	AUTONUMA_SCHED_FORK_RESET_FLAG,
+	AUTONUMA_SCHED_SMT_FLAG,
 	AUTONUMA_SCAN_PMD_FLAG,
 	AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
 	AUTONUMA_MIGRATE_DEFER_FLAG,
@@ -43,6 +44,15 @@ static bool inline autonuma_sched_fork_reset(void)
 			  &autonuma_flags);
 }
 
+static bool inline autonuma_sched_smt(void)
+{
+#ifdef CONFIG_SCHED_SMT
+	return !!test_bit(AUTONUMA_SCHED_SMT_FLAG, &autonuma_flags);
+#else
+	return 0;
+#endif
+}
+
 static bool inline autonuma_scan_pmd(void)
 {
 	return !!test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c
index d51e1ec..4211305 100644
--- a/kernel/sched/numa.c
+++ b/kernel/sched/numa.c
@@ -11,6 +11,30 @@
 
 #include "sched.h"
 
+static inline bool idle_cpu_avg(int cpu, bool require_avg_idle)
+{
+	struct rq *rq = cpu_rq(cpu);
+	return idle_cpu(cpu) && (!require_avg_idle ||
+				 rq->avg_idle > sysctl_sched_migration_cost);
+}
+
+/* A false avg_idle param makes it easier for smt_idle() to return true */
+static bool smt_idle(int _cpu, bool require_avg_idle)
+{
+#ifdef CONFIG_SCHED_SMT
+	int cpu;
+
+	for_each_cpu_and(cpu, topology_thread_cpumask(_cpu), cpu_online_mask) {
+		if (cpu == _cpu)
+			continue;
+		if (!idle_cpu_avg(cpu, require_avg_idle))
+			return false;
+	}
+#endif
+
+	return true;
+}
+
 #define AUTONUMA_BALANCE_SCALE 1000
 
 /*
@@ -47,6 +71,7 @@ void sched_autonuma_balance(void)
 	int cpu, nid, selected_cpu, selected_nid;
 	int cpu_nid = numa_node_id();
 	int this_cpu = smp_processor_id();
+	int this_smt_idle;
 	unsigned long p_w, p_t, m_w, m_t;
 	unsigned long weight_delta_max, weight;
 	struct cpumask *allowed;
@@ -96,6 +121,7 @@ void sched_autonuma_balance(void)
 		weight_current[nid] = p_w*AUTONUMA_BALANCE_SCALE/p_t;
 	}
 
+	this_smt_idle = smt_idle(this_cpu, false);
 	bitmap_zero(mm_mask, NR_CPUS);
 	for_each_online_node(nid) {
 		if (nid == cpu_nid)
@@ -103,11 +129,24 @@ void sched_autonuma_balance(void)
 		for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
 			struct mm_struct *mm;
 			struct rq *rq = cpu_rq(cpu);
+			bool other_smt_idle;
 			if (!cpu_online(cpu))
 				continue;
 			weight_others[cpu] = LONG_MAX;
-			if (idle_cpu(cpu) &&
-			    rq->avg_idle > sysctl_sched_migration_cost) {
+
+			other_smt_idle = smt_idle(cpu, true);
+			if (autonuma_sched_smt() &&
+			    this_smt_idle && !other_smt_idle)
+				continue;
+
+			if (idle_cpu_avg(cpu, true)) {
+				if (autonuma_sched_smt() &&
+				    !this_smt_idle && other_smt_idle) {
+					/* NUMA affinity override */
+					weight_others[cpu] = -2;
+					continue;
+				}
+
 				if (weight_current[nid] >
 				    weight_current[cpu_nid] &&
 				    weight_current_mm[nid] >
@@ -115,6 +154,11 @@ void sched_autonuma_balance(void)
 					weight_others[cpu] = -1;
 				continue;
 			}
+
+			if (autonuma_sched_smt() &&
+			    this_smt_idle && cpu_rq(this_cpu)->nr_running <= 1)
+				continue;
+
 			mm = rq->curr->mm;
 			if (!mm)
 				continue;
@@ -169,7 +213,7 @@ void sched_autonuma_balance(void)
 				w_cpu_nid = weight_current_mm[cpu_nid];
 			}
 			if (w_nid > weight_others[cpu] &&
-			    w_nid > w_cpu_nid) {
+			    (w_nid > w_cpu_nid || weight_others[cpu] == -2)) {
 				weight = w_nid -
 					weight_others[cpu] +
 					w_nid -
diff --git a/mm/autonuma.c b/mm/autonuma.c
index 7ca4992..4cce6a1 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -23,6 +23,7 @@ unsigned long autonuma_flags __read_mostly =
 	(1<<AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG)|
 	(1<<AUTONUMA_SCHED_CLONE_RESET_FLAG)|
 	(1<<AUTONUMA_SCHED_FORK_RESET_FLAG)|
+	(1<<AUTONUMA_SCHED_SMT_FLAG)|
 #ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
 	(1<<AUTONUMA_FLAG)|
 #endif
@@ -1089,6 +1090,9 @@ SYSFS_ENTRY(defer, AUTONUMA_MIGRATE_DEFER_FLAG);
 SYSFS_ENTRY(load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
 SYSFS_ENTRY(clone_reset, AUTONUMA_SCHED_CLONE_RESET_FLAG);
 SYSFS_ENTRY(fork_reset, AUTONUMA_SCHED_FORK_RESET_FLAG);
+#ifdef CONFIG_SCHED_SMT
+SYSFS_ENTRY(smt, AUTONUMA_SCHED_SMT_FLAG);
+#endif
 
 #undef SYSFS_ENTRY
 
@@ -1205,6 +1209,9 @@ static struct attribute *scheduler_attr[] = {
 	&clone_reset_attr.attr,
 	&fork_reset_attr.attr,
 	&load_balance_strict_attr.attr,
+#ifdef CONFIG_SCHED_SMT
+	&smt_attr.attr,
+#endif
 	NULL,
 };
 static struct attribute_group scheduler_attr_group = {

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [PATCH 39/39] autonuma: NUMA scheduler SMT awareness
@ 2012-03-26 17:46   ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 17:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

Add SMT awareness to the NUMA scheduler so that it will not move load
from fully idle SMT threads, to semi idle SMT threads.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_flags.h |   10 ++++++++
 kernel/sched/numa.c            |   50 +++++++++++++++++++++++++++++++++++++--
 mm/autonuma.c                  |    7 +++++
 3 files changed, 64 insertions(+), 3 deletions(-)

diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
index 9c702fd..d6b34b0 100644
--- a/include/linux/autonuma_flags.h
+++ b/include/linux/autonuma_flags.h
@@ -8,6 +8,7 @@ enum autonuma_flag {
 	AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
 	AUTONUMA_SCHED_CLONE_RESET_FLAG,
 	AUTONUMA_SCHED_FORK_RESET_FLAG,
+	AUTONUMA_SCHED_SMT_FLAG,
 	AUTONUMA_SCAN_PMD_FLAG,
 	AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
 	AUTONUMA_MIGRATE_DEFER_FLAG,
@@ -43,6 +44,15 @@ static bool inline autonuma_sched_fork_reset(void)
 			  &autonuma_flags);
 }
 
+static bool inline autonuma_sched_smt(void)
+{
+#ifdef CONFIG_SCHED_SMT
+	return !!test_bit(AUTONUMA_SCHED_SMT_FLAG, &autonuma_flags);
+#else
+	return 0;
+#endif
+}
+
 static bool inline autonuma_scan_pmd(void)
 {
 	return !!test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c
index d51e1ec..4211305 100644
--- a/kernel/sched/numa.c
+++ b/kernel/sched/numa.c
@@ -11,6 +11,30 @@
 
 #include "sched.h"
 
+static inline bool idle_cpu_avg(int cpu, bool require_avg_idle)
+{
+	struct rq *rq = cpu_rq(cpu);
+	return idle_cpu(cpu) && (!require_avg_idle ||
+				 rq->avg_idle > sysctl_sched_migration_cost);
+}
+
+/* A false avg_idle param makes it easier for smt_idle() to return true */
+static bool smt_idle(int _cpu, bool require_avg_idle)
+{
+#ifdef CONFIG_SCHED_SMT
+	int cpu;
+
+	for_each_cpu_and(cpu, topology_thread_cpumask(_cpu), cpu_online_mask) {
+		if (cpu == _cpu)
+			continue;
+		if (!idle_cpu_avg(cpu, require_avg_idle))
+			return false;
+	}
+#endif
+
+	return true;
+}
+
 #define AUTONUMA_BALANCE_SCALE 1000
 
 /*
@@ -47,6 +71,7 @@ void sched_autonuma_balance(void)
 	int cpu, nid, selected_cpu, selected_nid;
 	int cpu_nid = numa_node_id();
 	int this_cpu = smp_processor_id();
+	int this_smt_idle;
 	unsigned long p_w, p_t, m_w, m_t;
 	unsigned long weight_delta_max, weight;
 	struct cpumask *allowed;
@@ -96,6 +121,7 @@ void sched_autonuma_balance(void)
 		weight_current[nid] = p_w*AUTONUMA_BALANCE_SCALE/p_t;
 	}
 
+	this_smt_idle = smt_idle(this_cpu, false);
 	bitmap_zero(mm_mask, NR_CPUS);
 	for_each_online_node(nid) {
 		if (nid == cpu_nid)
@@ -103,11 +129,24 @@ void sched_autonuma_balance(void)
 		for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
 			struct mm_struct *mm;
 			struct rq *rq = cpu_rq(cpu);
+			bool other_smt_idle;
 			if (!cpu_online(cpu))
 				continue;
 			weight_others[cpu] = LONG_MAX;
-			if (idle_cpu(cpu) &&
-			    rq->avg_idle > sysctl_sched_migration_cost) {
+
+			other_smt_idle = smt_idle(cpu, true);
+			if (autonuma_sched_smt() &&
+			    this_smt_idle && !other_smt_idle)
+				continue;
+
+			if (idle_cpu_avg(cpu, true)) {
+				if (autonuma_sched_smt() &&
+				    !this_smt_idle && other_smt_idle) {
+					/* NUMA affinity override */
+					weight_others[cpu] = -2;
+					continue;
+				}
+
 				if (weight_current[nid] >
 				    weight_current[cpu_nid] &&
 				    weight_current_mm[nid] >
@@ -115,6 +154,11 @@ void sched_autonuma_balance(void)
 					weight_others[cpu] = -1;
 				continue;
 			}
+
+			if (autonuma_sched_smt() &&
+			    this_smt_idle && cpu_rq(this_cpu)->nr_running <= 1)
+				continue;
+
 			mm = rq->curr->mm;
 			if (!mm)
 				continue;
@@ -169,7 +213,7 @@ void sched_autonuma_balance(void)
 				w_cpu_nid = weight_current_mm[cpu_nid];
 			}
 			if (w_nid > weight_others[cpu] &&
-			    w_nid > w_cpu_nid) {
+			    (w_nid > w_cpu_nid || weight_others[cpu] == -2)) {
 				weight = w_nid -
 					weight_others[cpu] +
 					w_nid -
diff --git a/mm/autonuma.c b/mm/autonuma.c
index 7ca4992..4cce6a1 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -23,6 +23,7 @@ unsigned long autonuma_flags __read_mostly =
 	(1<<AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG)|
 	(1<<AUTONUMA_SCHED_CLONE_RESET_FLAG)|
 	(1<<AUTONUMA_SCHED_FORK_RESET_FLAG)|
+	(1<<AUTONUMA_SCHED_SMT_FLAG)|
 #ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
 	(1<<AUTONUMA_FLAG)|
 #endif
@@ -1089,6 +1090,9 @@ SYSFS_ENTRY(defer, AUTONUMA_MIGRATE_DEFER_FLAG);
 SYSFS_ENTRY(load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
 SYSFS_ENTRY(clone_reset, AUTONUMA_SCHED_CLONE_RESET_FLAG);
 SYSFS_ENTRY(fork_reset, AUTONUMA_SCHED_FORK_RESET_FLAG);
+#ifdef CONFIG_SCHED_SMT
+SYSFS_ENTRY(smt, AUTONUMA_SCHED_SMT_FLAG);
+#endif
 
 #undef SYSFS_ENTRY
 
@@ -1205,6 +1209,9 @@ static struct attribute *scheduler_attr[] = {
 	&clone_reset_attr.attr,
 	&fork_reset_attr.attr,
 	&load_balance_strict_attr.attr,
+#ifdef CONFIG_SCHED_SMT
+	&smt_attr.attr,
+#endif
 	NULL,
 };
 static struct attribute_group scheduler_attr_group = {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
  2012-03-26 17:45   ` Andrea Arcangeli
@ 2012-03-26 18:25     ` Peter Zijlstra
  -1 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2012-03-26 18:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Mon, 2012-03-26 at 19:45 +0200, Andrea Arcangeli wrote:
> @@ -3220,6 +3214,8 @@ need_resched:
>  
>         post_schedule(rq);
>  
> +       sched_autonuma_balance();
> +
>         sched_preempt_enable_no_resched();
>         if (need_resched())
>                 goto need_resched; 

I already told you, this isn't ever going to happen. You do _NOT_ put a
for_each_online_cpu() loop in the middle of schedule().

You also do not call stop_one_cpu(migration_cpu_stop) in schedule to
force migrate the task you just scheduled to away from this cpu. That's
retarded.

Nacked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
@ 2012-03-26 18:25     ` Peter Zijlstra
  0 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2012-03-26 18:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Mon, 2012-03-26 at 19:45 +0200, Andrea Arcangeli wrote:
> @@ -3220,6 +3214,8 @@ need_resched:
>  
>         post_schedule(rq);
>  
> +       sched_autonuma_balance();
> +
>         sched_preempt_enable_no_resched();
>         if (need_resched())
>                 goto need_resched; 

I already told you, this isn't ever going to happen. You do _NOT_ put a
for_each_online_cpu() loop in the middle of schedule().

You also do not call stop_one_cpu(migration_cpu_stop) in schedule to
force migrate the task you just scheduled to away from this cpu. That's
retarded.

Nacked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 07/39] autonuma: introduce kthread_bind_node()
  2012-03-26 17:45   ` Andrea Arcangeli
@ 2012-03-26 18:32     ` Peter Zijlstra
  -1 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2012-03-26 18:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Mon, 2012-03-26 at 19:45 +0200, Andrea Arcangeli wrote:
> +void kthread_bind_node(struct task_struct *p, int nid)
> +{
> +       /* Must have done schedule() in kthread() before we set_task_cpu */
> +       if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> +               WARN_ON(1);
> +               return;
> +       }
> +
> +       /* It's safe because the task is inactive. */
> +       do_set_cpus_allowed(p, cpumask_of_node(nid));
> +       p->flags |= PF_THREAD_BOUND;
> +}
> +EXPORT_SYMBOL(kthread_bind_node);

That's a wrong use of PF_THREAD_BOUND, we should only use that for
cpumask_weight(tsk_cpus_allowed()) == 1.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 07/39] autonuma: introduce kthread_bind_node()
@ 2012-03-26 18:32     ` Peter Zijlstra
  0 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2012-03-26 18:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Mon, 2012-03-26 at 19:45 +0200, Andrea Arcangeli wrote:
> +void kthread_bind_node(struct task_struct *p, int nid)
> +{
> +       /* Must have done schedule() in kthread() before we set_task_cpu */
> +       if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> +               WARN_ON(1);
> +               return;
> +       }
> +
> +       /* It's safe because the task is inactive. */
> +       do_set_cpus_allowed(p, cpumask_of_node(nid));
> +       p->flags |= PF_THREAD_BOUND;
> +}
> +EXPORT_SYMBOL(kthread_bind_node);

That's a wrong use of PF_THREAD_BOUND, we should only use that for
cpumask_weight(tsk_cpus_allowed()) == 1.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 39/39] autonuma: NUMA scheduler SMT awareness
  2012-03-26 17:46   ` Andrea Arcangeli
@ 2012-03-26 18:57     ` Peter Zijlstra
  -1 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2012-03-26 18:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Mon, 2012-03-26 at 19:46 +0200, Andrea Arcangeli wrote:
> Add SMT awareness to the NUMA scheduler so that it will not move load
> from fully idle SMT threads, to semi idle SMT threads.

This shows a complete fail in design, you're working around the regular
scheduler/load-balancer instead of with it and hence are duplicating all
kinds of stuff.

I'll not have that..

Nacked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 39/39] autonuma: NUMA scheduler SMT awareness
@ 2012-03-26 18:57     ` Peter Zijlstra
  0 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2012-03-26 18:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Mon, 2012-03-26 at 19:46 +0200, Andrea Arcangeli wrote:
> Add SMT awareness to the NUMA scheduler so that it will not move load
> from fully idle SMT threads, to semi idle SMT threads.

This shows a complete fail in design, you're working around the regular
scheduler/load-balancer instead of with it and hence are duplicating all
kinds of stuff.

I'll not have that..

Nacked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
  2012-03-26 18:25     ` Peter Zijlstra
@ 2012-03-26 19:28       ` Rik van Riel
  -1 siblings, 0 replies; 125+ messages in thread
From: Rik van Riel @ 2012-03-26 19:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner

On 03/26/2012 02:25 PM, Peter Zijlstra wrote:
> On Mon, 2012-03-26 at 19:45 +0200, Andrea Arcangeli wrote:
>> @@ -3220,6 +3214,8 @@ need_resched:
>>
>>          post_schedule(rq);
>>
>> +       sched_autonuma_balance();
>> +
>>          sched_preempt_enable_no_resched();
>>          if (need_resched())
>>                  goto need_resched;
>
> I already told you, this isn't ever going to happen. You do _NOT_ put a
> for_each_online_cpu() loop in the middle of schedule().

Agreed, it looks O(N), but because every CPU will be calling
it its behaviour will be O(N^2) and has the potential to
completely break systems with a large number of CPUs.

Finding a lower overhead way of doing the balancing does not
seem like an unsurmountable problem.

> You also do not call stop_one_cpu(migration_cpu_stop) in schedule to
> force migrate the task you just scheduled to away from this cpu. That's
> retarded.
>
> Nacked-by: Peter Zijlstra<a.p.zijlstra@chello.nl>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
@ 2012-03-26 19:28       ` Rik van Riel
  0 siblings, 0 replies; 125+ messages in thread
From: Rik van Riel @ 2012-03-26 19:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner

On 03/26/2012 02:25 PM, Peter Zijlstra wrote:
> On Mon, 2012-03-26 at 19:45 +0200, Andrea Arcangeli wrote:
>> @@ -3220,6 +3214,8 @@ need_resched:
>>
>>          post_schedule(rq);
>>
>> +       sched_autonuma_balance();
>> +
>>          sched_preempt_enable_no_resched();
>>          if (need_resched())
>>                  goto need_resched;
>
> I already told you, this isn't ever going to happen. You do _NOT_ put a
> for_each_online_cpu() loop in the middle of schedule().

Agreed, it looks O(N), but because every CPU will be calling
it its behaviour will be O(N^2) and has the potential to
completely break systems with a large number of CPUs.

Finding a lower overhead way of doing the balancing does not
seem like an unsurmountable problem.

> You also do not call stop_one_cpu(migration_cpu_stop) in schedule to
> force migrate the task you just scheduled to away from this cpu. That's
> retarded.
>
> Nacked-by: Peter Zijlstra<a.p.zijlstra@chello.nl>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 20/39] autonuma: avoid CFS select_task_rq_fair to return -1
  2012-03-26 17:46   ` Andrea Arcangeli
@ 2012-03-26 19:36     ` Peter Zijlstra
  -1 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2012-03-26 19:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Mon, 2012-03-26 at 19:46 +0200, Andrea Arcangeli wrote:
> Fix to avoid -1 retval.

Please fold this and the following 5 patches into something sane. 6
patches changing the same few lines and none of them with a useful
changelog isn't how we do thing.



^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 20/39] autonuma: avoid CFS select_task_rq_fair to return -1
@ 2012-03-26 19:36     ` Peter Zijlstra
  0 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2012-03-26 19:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Mon, 2012-03-26 at 19:46 +0200, Andrea Arcangeli wrote:
> Fix to avoid -1 retval.

Please fold this and the following 5 patches into something sane. 6
patches changing the same few lines and none of them with a useful
changelog isn't how we do thing.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
  2012-03-26 19:28       ` Rik van Riel
@ 2012-03-26 19:44         ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 19:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, linux-kernel, linux-mm, Hillf Danton, Dan Smith,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner

On Mon, Mar 26, 2012 at 03:28:37PM -0400, Rik van Riel wrote:
> Agreed, it looks O(N), but because every CPU will be calling
> it its behaviour will be O(N^2) and has the potential to
> completely break systems with a large number of CPUs.

As I wrote in the comment before the function, math speaking, this
looks like O(N) but it is O(1), not O(N) nor O(N^2). This is because N
= NR_CPUS = 1.

As I also wrote in the comment before the function, this is called at
every schedule in the short term primarily because I want to see a
flood if this algorithm does something wrong after I do.

echo 1 >/sys/kernel/mm/autonuma/debug

 * This has O(N) complexity but N isn't the number of running
 * processes, but the number of CPUs, so if you assume a constant
 * number of CPUs (capped at NR_CPUS) it is O(1). O(1) misleading math
 * aside, the number of cachelines touched with thousands of CPU might
 * make it measurable. Calling this at every schedule may also be
 * overkill and it may be enough to call it with a frequency similar
 * to the load balancing, but by doing so we're also verifying the
 * algorithm is a converging one in all workloads if performance is
 * improved and there's no frequent CPU migration, so it's good in the
 * short term for stressing the algorithm.

Over time (not urgent) this can be called at a regular interval like
load_balance() or be more integrated within CFS so it doesn't need to
be called at all.

For the short term it shall be called at every schedule for debug
reasons so I wouldn't suggest to make an effort to call it at lower
frequency right now. If somebody wants to make an effort to make it
more integrated in CFS that's welcome though, but I would still like a
tweak to force the algorithm synchronously during every schedule
decision like now so I can verify it converges at the scheduler level
and there is not a flood of worthless bounces.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
@ 2012-03-26 19:44         ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 19:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, linux-kernel, linux-mm, Hillf Danton, Dan Smith,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner

On Mon, Mar 26, 2012 at 03:28:37PM -0400, Rik van Riel wrote:
> Agreed, it looks O(N), but because every CPU will be calling
> it its behaviour will be O(N^2) and has the potential to
> completely break systems with a large number of CPUs.

As I wrote in the comment before the function, math speaking, this
looks like O(N) but it is O(1), not O(N) nor O(N^2). This is because N
= NR_CPUS = 1.

As I also wrote in the comment before the function, this is called at
every schedule in the short term primarily because I want to see a
flood if this algorithm does something wrong after I do.

echo 1 >/sys/kernel/mm/autonuma/debug

 * This has O(N) complexity but N isn't the number of running
 * processes, but the number of CPUs, so if you assume a constant
 * number of CPUs (capped at NR_CPUS) it is O(1). O(1) misleading math
 * aside, the number of cachelines touched with thousands of CPU might
 * make it measurable. Calling this at every schedule may also be
 * overkill and it may be enough to call it with a frequency similar
 * to the load balancing, but by doing so we're also verifying the
 * algorithm is a converging one in all workloads if performance is
 * improved and there's no frequent CPU migration, so it's good in the
 * short term for stressing the algorithm.

Over time (not urgent) this can be called at a regular interval like
load_balance() or be more integrated within CFS so it doesn't need to
be called at all.

For the short term it shall be called at every schedule for debug
reasons so I wouldn't suggest to make an effort to call it at lower
frequency right now. If somebody wants to make an effort to make it
more integrated in CFS that's welcome though, but I would still like a
tweak to force the algorithm synchronously during every schedule
decision like now so I can verify it converges at the scheduler level
and there is not a flood of worthless bounces.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
  2012-03-26 19:44         ` Andrea Arcangeli
  (?)
@ 2012-03-26 19:58         ` Linus Torvalds
  2012-03-26 20:39             ` Andrea Arcangeli
  -1 siblings, 1 reply; 125+ messages in thread
From: Linus Torvalds @ 2012-03-26 19:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hillf Danton, Paul E. McKenney, Dan Smith, Paul Turner,
	Peter Zijlstra, Lai Jiangshan, Rik van Riel, Ingo Molnar,
	Andrew Morton, Lee Schermerhorn, linux-mm, Suresh Siddha,
	Mike Galbraith, Bharata B Rao, Thomas Gleixner, Johannes Weiner,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 689 bytes --]

On Mar 26, 2012 12:45 PM, "Andrea Arcangeli" <aarcange@redhat.com> wrote:
>
> As I wrote in the comment before the function, math speaking, this
> looks like O(N) but it is O(1), not O(N) nor O(N^2). This is because N
> = NR_CPUS = 1.

That's just stupid sophistry.

No, you can't just say that it's limited to some large constant, and thus
the same as O(1).

That's the worst kind of lie: something that's technically true if you look
at it a certain stupid way, but isn't actually true in practice.

It's clearly O(n) in number of CPUs, and people told you it can't go into
the scheduler. Stop arguing idiotic things. Just say you'll fix it, instead
of looking like a tool.

      Linus

[-- Attachment #2: Type: text/html, Size: 869 bytes --]

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
  2012-03-26 19:58         ` Linus Torvalds
@ 2012-03-26 20:39             ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 20:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hillf Danton, Paul E. McKenney, Dan Smith, Paul Turner,
	Peter Zijlstra, Lai Jiangshan, Rik van Riel, Ingo Molnar,
	Andrew Morton, Lee Schermerhorn, linux-mm, Suresh Siddha,
	Mike Galbraith, Bharata B Rao, Thomas Gleixner, Johannes Weiner,
	linux-kernel

Hi,

On Mon, Mar 26, 2012 at 12:58:05PM -0700, Linus Torvalds wrote:
> On Mar 26, 2012 12:45 PM, "Andrea Arcangeli" <aarcange@redhat.com> wrote:
> >
> > As I wrote in the comment before the function, math speaking, this
> > looks like O(N) but it is O(1), not O(N) nor O(N^2). This is because N
> > = NR_CPUS = 1.
> 
> That's just stupid sophistry.

I agree, this is why I warned everyone in the comment before the
function with the adjective "misleading":

 * O(1) misleading math
 * aside, the number of cachelines touched with thousands of CPU might
 * make it measurable.

> No, you can't just say that it's limited to some large constant, and thus
> the same as O(1).

I pointed out it is O(1) just because if we use the O notation we may
as well do the math right about it.

I may not have been clear but I never meant that because it is O(1)
(NR_CPUS constant) it means it's already ok as it is now.

> 
> That's the worst kind of lie: something that's technically true if you look
> at it a certain stupid way, but isn't actually true in practice.
> 
> It's clearly O(n) in number of CPUs, and people told you it can't go into
> the scheduler. Stop arguing idiotic things. Just say you'll fix it, instead
> of looking like a tool.

About fixing it, this can be called at a regular interval like
load_balance() (which also has an higher cost than the per-cpu
schedule fast path, in having to walk over the other CPU runqueues) or
to be more integrated within CFS so it doesn't need to be called at
all.

I didn't think it was urgent to fix (also because it has a debug
benefit to keep it like this in the short term), but I definitely
intended to fix it.

I also would welcome people who knows the scheduler so much better
than me to rewrite or fix it as they like it.

To be crystal clear: I totally agree to fix this, in the comment
before the code I wrote:

 * it's good in the
 * short term for stressing the algorithm.

I probably wasn't clear enough, but I already implicitly meant it
shall be optimized further later.

If there's a slight disagreement is only on the "urgency" to fix it but
I will certainly change my priorities on this after reading your
comments!

Thanks for looking into this.
Andrea

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
@ 2012-03-26 20:39             ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 20:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hillf Danton, Paul E. McKenney, Dan Smith, Paul Turner,
	Peter Zijlstra, Lai Jiangshan, Rik van Riel, Ingo Molnar,
	Andrew Morton, Lee Schermerhorn, linux-mm, Suresh Siddha,
	Mike Galbraith, Bharata B Rao, Thomas Gleixner, Johannes Weiner,
	linux-kernel

Hi,

On Mon, Mar 26, 2012 at 12:58:05PM -0700, Linus Torvalds wrote:
> On Mar 26, 2012 12:45 PM, "Andrea Arcangeli" <aarcange@redhat.com> wrote:
> >
> > As I wrote in the comment before the function, math speaking, this
> > looks like O(N) but it is O(1), not O(N) nor O(N^2). This is because N
> > = NR_CPUS = 1.
> 
> That's just stupid sophistry.

I agree, this is why I warned everyone in the comment before the
function with the adjective "misleading":

 * O(1) misleading math
 * aside, the number of cachelines touched with thousands of CPU might
 * make it measurable.

> No, you can't just say that it's limited to some large constant, and thus
> the same as O(1).

I pointed out it is O(1) just because if we use the O notation we may
as well do the math right about it.

I may not have been clear but I never meant that because it is O(1)
(NR_CPUS constant) it means it's already ok as it is now.

> 
> That's the worst kind of lie: something that's technically true if you look
> at it a certain stupid way, but isn't actually true in practice.
> 
> It's clearly O(n) in number of CPUs, and people told you it can't go into
> the scheduler. Stop arguing idiotic things. Just say you'll fix it, instead
> of looking like a tool.

About fixing it, this can be called at a regular interval like
load_balance() (which also has an higher cost than the per-cpu
schedule fast path, in having to walk over the other CPU runqueues) or
to be more integrated within CFS so it doesn't need to be called at
all.

I didn't think it was urgent to fix (also because it has a debug
benefit to keep it like this in the short term), but I definitely
intended to fix it.

I also would welcome people who knows the scheduler so much better
than me to rewrite or fix it as they like it.

To be crystal clear: I totally agree to fix this, in the comment
before the code I wrote:

 * it's good in the
 * short term for stressing the algorithm.

I probably wasn't clear enough, but I already implicitly meant it
shall be optimized further later.

If there's a slight disagreement is only on the "urgency" to fix it but
I will certainly change my priorities on this after reading your
comments!

Thanks for looking into this.
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 20/39] autonuma: avoid CFS select_task_rq_fair to return -1
  2012-03-26 19:36     ` Peter Zijlstra
@ 2012-03-26 20:53       ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 20:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Mon, Mar 26, 2012 at 09:36:54PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-03-26 at 19:46 +0200, Andrea Arcangeli wrote:
> > Fix to avoid -1 retval.
> 
> Please fold this and the following 5 patches into something sane. 6
> patches changing the same few lines and none of them with a useful
> changelog isn't how we do thing.

I folded the next two patches, and other two patches into the later
CFS patch (still kept 2 patches total for such file as there are two
things happening so it should be simpler to review those
separately).

I should have folded those in the first place but I tried to retain
exact attribution of the fixes but I agree for now the priority should
be given to keep the code as easy to review as possible. So I added
attribution in the header of a common commit as I already did for
other commits before.

You can review the folded version in the autonuma-dev-smt branch which
I just pushed (not fast forward):

git clone --reference linux -b autonuma-dev-smt git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
check head is 3b1ff002978862264c4a24308bddc5e7da24e9ee
git format-patch 0b100d7

0020-autonuma-avoid-CFS-select_task_rq_fair-to-return-1.patch
0021-autonuma-teach-CFS-about-autonuma-affinity.patch

This should be much simpler to review, sorry for the confusion.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 20/39] autonuma: avoid CFS select_task_rq_fair to return -1
@ 2012-03-26 20:53       ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-26 20:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Mon, Mar 26, 2012 at 09:36:54PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-03-26 at 19:46 +0200, Andrea Arcangeli wrote:
> > Fix to avoid -1 retval.
> 
> Please fold this and the following 5 patches into something sane. 6
> patches changing the same few lines and none of them with a useful
> changelog isn't how we do thing.

I folded the next two patches, and other two patches into the later
CFS patch (still kept 2 patches total for such file as there are two
things happening so it should be simpler to review those
separately).

I should have folded those in the first place but I tried to retain
exact attribution of the fixes but I agree for now the priority should
be given to keep the code as easy to review as possible. So I added
attribution in the header of a common commit as I already did for
other commits before.

You can review the folded version in the autonuma-dev-smt branch which
I just pushed (not fast forward):

git clone --reference linux -b autonuma-dev-smt git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
check head is 3b1ff002978862264c4a24308bddc5e7da24e9ee
git format-patch 0b100d7

0020-autonuma-avoid-CFS-select_task_rq_fair-to-return-1.patch
0021-autonuma-teach-CFS-about-autonuma-affinity.patch

This should be much simpler to review, sorry for the confusion.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 39/39] autonuma: NUMA scheduler SMT awareness
  2012-03-26 18:57     ` Peter Zijlstra
@ 2012-03-27  0:00       ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-27  0:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Mon, Mar 26, 2012 at 08:57:03PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-03-26 at 19:46 +0200, Andrea Arcangeli wrote:
> > Add SMT awareness to the NUMA scheduler so that it will not move load
> > from fully idle SMT threads, to semi idle SMT threads.
> 
> This shows a complete fail in design, you're working around the regular
> scheduler/load-balancer instead of with it and hence are duplicating all
> kinds of stuff.
> 
> I'll not have that..

I think here you're misunderstanding implementation issues with
design.

I already mentioned the need of closer integration in CFS as point 4
of my TODO list in the first email of this thread. The current
implementation is just good enough to evaluate the AutoNUMA math and
the resulting final performance (and after cleaning it up, it'll run
even faster if something).

If you want to contribute to sched/numa.c integrate it with CFS and
remove the code duplication you're welcome. I tried for a short while
but it wasn't even obvious the exact lines in fair.c where SMT is
handled (I'm aware of SD_SHARE_CPUPOWER in SD_SIBLING_INIT but things
weren't crystal clear there, in fact it's even hard to extrapolate the
exact semantics of all SD_ bitflags, the comment on the right isn't
very helpful either). An explanation of the exact lines in CFS where
SMT is handled would be welcome too if I shall do the cleanup.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 39/39] autonuma: NUMA scheduler SMT awareness
@ 2012-03-27  0:00       ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-27  0:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Mon, Mar 26, 2012 at 08:57:03PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-03-26 at 19:46 +0200, Andrea Arcangeli wrote:
> > Add SMT awareness to the NUMA scheduler so that it will not move load
> > from fully idle SMT threads, to semi idle SMT threads.
> 
> This shows a complete fail in design, you're working around the regular
> scheduler/load-balancer instead of with it and hence are duplicating all
> kinds of stuff.
> 
> I'll not have that..

I think here you're misunderstanding implementation issues with
design.

I already mentioned the need of closer integration in CFS as point 4
of my TODO list in the first email of this thread. The current
implementation is just good enough to evaluate the AutoNUMA math and
the resulting final performance (and after cleaning it up, it'll run
even faster if something).

If you want to contribute to sched/numa.c integrate it with CFS and
remove the code duplication you're welcome. I tried for a short while
but it wasn't even obvious the exact lines in fair.c where SMT is
handled (I'm aware of SD_SHARE_CPUPOWER in SD_SIBLING_INIT but things
weren't crystal clear there, in fact it's even hard to extrapolate the
exact semantics of all SD_ bitflags, the comment on the right isn't
very helpful either). An explanation of the exact lines in CFS where
SMT is handled would be welcome too if I shall do the cleanup.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
  2012-03-26 20:39             ` Andrea Arcangeli
@ 2012-03-27  8:39               ` Peter Zijlstra
  -1 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2012-03-27  8:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Hillf Danton, Paul E. McKenney, Dan Smith,
	Paul Turner, Lai Jiangshan, Rik van Riel, Ingo Molnar,
	Andrew Morton, Lee Schermerhorn, linux-mm, Suresh Siddha,
	Mike Galbraith, Bharata B Rao, Thomas Gleixner, Johannes Weiner,
	linux-kernel

On Mon, 2012-03-26 at 22:39 +0200, Andrea Arcangeli wrote:

> > No, you can't just say that it's limited to some large constant, and thus
> > the same as O(1).
> 
> I pointed out it is O(1) just because if we use the O notation we may
> as well do the math right about it.

I think you have a fundamental mis-understanding of the concepts here.
You do not get to fill in n for whatever specific instance of the
problem you have.

The traveling salesman problem can be solved in O(n!), simply because
you know your route will not be larger than 10 houses doesn't mean you
can say your algorithm will be O(10!) and thus O(1).

That's simply not how it works.

You can talk pretty much anything down to O(1) that way. Take an
algorithm that is O(n) in the number of tasks, since you know you have a
pid-space constraint of 30bits you can never have more than 2^30 (aka
1Gi) tasks, hence your algorithm is O(2^30) aka O(1).

> I also would welcome people who knows the scheduler so much better
> than me to rewrite or fix it as they like it.

Again, you seem unclear on how things work, you want this nonsense, you
get to write it.

I am most certainly not going to fix your mess as I completely disagree
with the approach taken.

> I probably wasn't clear enough, but I already implicitly meant it
> shall be optimized further later.

You're in fact very unclear. You post patches without the RFC tag,
meaning you think they're ready to be considered. You write huge
misleading comments instead of /* XXX crap, needs fixing */.

Also, I find your language to be overly obtuse and hard to parse, but
that could be my fault, we're both non-native speakers.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
@ 2012-03-27  8:39               ` Peter Zijlstra
  0 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2012-03-27  8:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Hillf Danton, Paul E. McKenney, Dan Smith,
	Paul Turner, Lai Jiangshan, Rik van Riel, Ingo Molnar,
	Andrew Morton, Lee Schermerhorn, linux-mm, Suresh Siddha,
	Mike Galbraith, Bharata B Rao, Thomas Gleixner, Johannes Weiner,
	linux-kernel

On Mon, 2012-03-26 at 22:39 +0200, Andrea Arcangeli wrote:

> > No, you can't just say that it's limited to some large constant, and thus
> > the same as O(1).
> 
> I pointed out it is O(1) just because if we use the O notation we may
> as well do the math right about it.

I think you have a fundamental mis-understanding of the concepts here.
You do not get to fill in n for whatever specific instance of the
problem you have.

The traveling salesman problem can be solved in O(n!), simply because
you know your route will not be larger than 10 houses doesn't mean you
can say your algorithm will be O(10!) and thus O(1).

That's simply not how it works.

You can talk pretty much anything down to O(1) that way. Take an
algorithm that is O(n) in the number of tasks, since you know you have a
pid-space constraint of 30bits you can never have more than 2^30 (aka
1Gi) tasks, hence your algorithm is O(2^30) aka O(1).

> I also would welcome people who knows the scheduler so much better
> than me to rewrite or fix it as they like it.

Again, you seem unclear on how things work, you want this nonsense, you
get to write it.

I am most certainly not going to fix your mess as I completely disagree
with the approach taken.

> I probably wasn't clear enough, but I already implicitly meant it
> shall be optimized further later.

You're in fact very unclear. You post patches without the RFC tag,
meaning you think they're ready to be considered. You write huge
misleading comments instead of /* XXX crap, needs fixing */.

Also, I find your language to be overly obtuse and hard to parse, but
that could be my fault, we're both non-native speakers.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
  2012-03-27  8:39               ` Peter Zijlstra
@ 2012-03-27 14:37                 ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-27 14:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Hillf Danton, Paul E. McKenney, Dan Smith,
	Paul Turner, Lai Jiangshan, Rik van Riel, Ingo Molnar,
	Andrew Morton, Lee Schermerhorn, linux-mm, Suresh Siddha,
	Mike Galbraith, Bharata B Rao, Thomas Gleixner, Johannes Weiner,
	linux-kernel

On Tue, Mar 27, 2012 at 10:39:55AM +0200, Peter Zijlstra wrote:
> You can talk pretty much anything down to O(1) that way. Take an
> algorithm that is O(n) in the number of tasks, since you know you have a
> pid-space constraint of 30bits you can never have more than 2^30 (aka
> 1Gi) tasks, hence your algorithm is O(2^30) aka O(1).

Still this O notation thingy... This is not about the max value but
about the fact the number is _variable_ or _fixed_.

If you have a variable amount of entries (and variable amount of
memory) in a list it's O(N) where N is the number of entries (even if
we know the max ram is maybe 4TB?). If you've a _fixed_ number of them
it's O(1). Even if the fixed number is very large.

It basically shows it won't degraded depending on load, and the cost
per-schedule remains exactly fixed at all times (non liner cacheline
and out-of-order CPU execution/HT effects aside).

If it was O(N) the time this would take to run for each schedule shall
have to vary at runtime depending on a some variable factor N and
that's not the case here.

You can argue about CPU hotplug though.

But this is just math nitpicking because I already pointed out I agree
the cacheline hits on a 1024 way would be measurable and needs fixing.

I'm not sure how useful it is to keep arguing on the O notation when
we agree on what shall be optimized in practice.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
@ 2012-03-27 14:37                 ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-27 14:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Hillf Danton, Paul E. McKenney, Dan Smith,
	Paul Turner, Lai Jiangshan, Rik van Riel, Ingo Molnar,
	Andrew Morton, Lee Schermerhorn, linux-mm, Suresh Siddha,
	Mike Galbraith, Bharata B Rao, Thomas Gleixner, Johannes Weiner,
	linux-kernel

On Tue, Mar 27, 2012 at 10:39:55AM +0200, Peter Zijlstra wrote:
> You can talk pretty much anything down to O(1) that way. Take an
> algorithm that is O(n) in the number of tasks, since you know you have a
> pid-space constraint of 30bits you can never have more than 2^30 (aka
> 1Gi) tasks, hence your algorithm is O(2^30) aka O(1).

Still this O notation thingy... This is not about the max value but
about the fact the number is _variable_ or _fixed_.

If you have a variable amount of entries (and variable amount of
memory) in a list it's O(N) where N is the number of entries (even if
we know the max ram is maybe 4TB?). If you've a _fixed_ number of them
it's O(1). Even if the fixed number is very large.

It basically shows it won't degraded depending on load, and the cost
per-schedule remains exactly fixed at all times (non liner cacheline
and out-of-order CPU execution/HT effects aside).

If it was O(N) the time this would take to run for each schedule shall
have to vary at runtime depending on a some variable factor N and
that's not the case here.

You can argue about CPU hotplug though.

But this is just math nitpicking because I already pointed out I agree
the cacheline hits on a 1024 way would be measurable and needs fixing.

I'm not sure how useful it is to keep arguing on the O notation when
we agree on what shall be optimized in practice.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 07/39] autonuma: introduce kthread_bind_node()
  2012-03-26 18:32     ` Peter Zijlstra
@ 2012-03-27 15:22       ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-27 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Mon, Mar 26, 2012 at 08:32:35PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-03-26 at 19:45 +0200, Andrea Arcangeli wrote:
> > +void kthread_bind_node(struct task_struct *p, int nid)
> > +{
> > +       /* Must have done schedule() in kthread() before we set_task_cpu */
> > +       if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> > +               WARN_ON(1);
> > +               return;
> > +       }
> > +
> > +       /* It's safe because the task is inactive. */
> > +       do_set_cpus_allowed(p, cpumask_of_node(nid));
> > +       p->flags |= PF_THREAD_BOUND;
> > +}
> > +EXPORT_SYMBOL(kthread_bind_node);
> 
> That's a wrong use of PF_THREAD_BOUND, we should only use that for
> cpumask_weight(tsk_cpus_allowed()) == 1.

I don't see what's wrong with more than 1 CPU in the hard bind cpumask.

The only two places that care about PF_THREAD_BOUND are quoted.

This is just to avoid the "root" user to shoot itself in the foot and
crash the kernel by changing the CPU bindings for the kernel thread
(potentially leading to breaking assumptions the kernel thread does on
numa_node_id).

knuma_migratedN for example BUG_ON if the binding is changed under it
before anything bad can happen.

Maybe this wasn't the supposed initial semantic of PF_THREAD_BOUND,
but this extends it without apparent downsides and it adds a bit more
of robustness.

int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
{
	unsigned long flags;
	struct rq *rq;
	unsigned int dest_cpu;
	int ret = 0;

	rq = task_rq_lock(p, &flags);

	if (cpumask_equal(&p->cpus_allowed, new_mask))
		goto out;

	if (!cpumask_intersects(new_mask, cpu_active_mask)) {
		ret = -EINVAL;
		goto out;
	}

	if (unlikely((p->flags & PF_THREAD_BOUND) && p != current)) {
		ret = -EINVAL;
		goto out;
	}
[..]

static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
{
	struct cpuset *cs = cgroup_cs(cgrp);
	struct task_struct *task;
	int ret;

	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
		return -ENOSPC;

	cgroup_taskset_for_each(task, cgrp, tset) {
		/*
		 * Kthreads bound to specific cpus cannot be moved to a new
		 * cpuset; we cannot change their cpu affinity and
		 * isolating such threads by their set of allowed nodes is
		 * unnecessary.  Thus, cpusets are not applicable for such
		 * threads.  This prevents checking for success of
		 * set_cpus_allowed_ptr() on all attached tasks before
		 * cpus_allowed may be changed.
		 */
		if (task->flags & PF_THREAD_BOUND)
			return -EINVAL;
[..]

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 07/39] autonuma: introduce kthread_bind_node()
@ 2012-03-27 15:22       ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-27 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Mon, Mar 26, 2012 at 08:32:35PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-03-26 at 19:45 +0200, Andrea Arcangeli wrote:
> > +void kthread_bind_node(struct task_struct *p, int nid)
> > +{
> > +       /* Must have done schedule() in kthread() before we set_task_cpu */
> > +       if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> > +               WARN_ON(1);
> > +               return;
> > +       }
> > +
> > +       /* It's safe because the task is inactive. */
> > +       do_set_cpus_allowed(p, cpumask_of_node(nid));
> > +       p->flags |= PF_THREAD_BOUND;
> > +}
> > +EXPORT_SYMBOL(kthread_bind_node);
> 
> That's a wrong use of PF_THREAD_BOUND, we should only use that for
> cpumask_weight(tsk_cpus_allowed()) == 1.

I don't see what's wrong with more than 1 CPU in the hard bind cpumask.

The only two places that care about PF_THREAD_BOUND are quoted.

This is just to avoid the "root" user to shoot itself in the foot and
crash the kernel by changing the CPU bindings for the kernel thread
(potentially leading to breaking assumptions the kernel thread does on
numa_node_id).

knuma_migratedN for example BUG_ON if the binding is changed under it
before anything bad can happen.

Maybe this wasn't the supposed initial semantic of PF_THREAD_BOUND,
but this extends it without apparent downsides and it adds a bit more
of robustness.

int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
{
	unsigned long flags;
	struct rq *rq;
	unsigned int dest_cpu;
	int ret = 0;

	rq = task_rq_lock(p, &flags);

	if (cpumask_equal(&p->cpus_allowed, new_mask))
		goto out;

	if (!cpumask_intersects(new_mask, cpu_active_mask)) {
		ret = -EINVAL;
		goto out;
	}

	if (unlikely((p->flags & PF_THREAD_BOUND) && p != current)) {
		ret = -EINVAL;
		goto out;
	}
[..]

static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
{
	struct cpuset *cs = cgroup_cs(cgrp);
	struct task_struct *task;
	int ret;

	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
		return -ENOSPC;

	cgroup_taskset_for_each(task, cgrp, tset) {
		/*
		 * Kthreads bound to specific cpus cannot be moved to a new
		 * cpuset; we cannot change their cpu affinity and
		 * isolating such threads by their set of allowed nodes is
		 * unnecessary.  Thus, cpusets are not applicable for such
		 * threads.  This prevents checking for success of
		 * set_cpus_allowed_ptr() on all attached tasks before
		 * cpus_allowed may be changed.
		 */
		if (task->flags & PF_THREAD_BOUND)
			return -EINVAL;
[..]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 07/39] autonuma: introduce kthread_bind_node()
  2012-03-27 15:22       ` Andrea Arcangeli
@ 2012-03-27 15:45         ` Peter Zijlstra
  -1 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2012-03-27 15:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Tue, 2012-03-27 at 17:22 +0200, Andrea Arcangeli wrote:
> I don't see what's wrong with more than 1 CPU in the hard bind
> cpumask.

Because its currently broken, but we're trying to restore its pure
semantic so that we can use it in more places again, like
debug_smp_processor_id(). Testing a single process flag is _much_
cheaper than testing ->cpus_allowed.

Adding more broken isn't an option.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 07/39] autonuma: introduce kthread_bind_node()
@ 2012-03-27 15:45         ` Peter Zijlstra
  0 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2012-03-27 15:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Tue, 2012-03-27 at 17:22 +0200, Andrea Arcangeli wrote:
> I don't see what's wrong with more than 1 CPU in the hard bind
> cpumask.

Because its currently broken, but we're trying to restore its pure
semantic so that we can use it in more places again, like
debug_smp_processor_id(). Testing a single process flag is _much_
cheaper than testing ->cpus_allowed.

Adding more broken isn't an option.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 07/39] autonuma: introduce kthread_bind_node()
  2012-03-27 15:45         ` Peter Zijlstra
@ 2012-03-27 16:04           ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-27 16:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Tue, Mar 27, 2012 at 05:45:35PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-03-27 at 17:22 +0200, Andrea Arcangeli wrote:
> > I don't see what's wrong with more than 1 CPU in the hard bind
> > cpumask.
> 
> Because its currently broken, but we're trying to restore its pure
> semantic so that we can use it in more places again, like
> debug_smp_processor_id(). Testing a single process flag is _much_
> cheaper than testing ->cpus_allowed.
> 
> Adding more broken isn't an option.

I would suggest you to use a new bitflag for that _future_
optimization that you plan to do without altering the way the current
bitflag works.

I doubt knuma_migrated will ever be the only kernel thread that wants
to run with a NUMA NODE-wide CPU binding (instead of single-CPU
binding).

Being able to keep using this bitflag for NUMA-wide bindings too in
the future as well (after you do the optimization you planned), is
going to reduce the chances of the root user shooting himself in the
foot for both the kernel thread node-BIND and the single-cpu-BIND.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 07/39] autonuma: introduce kthread_bind_node()
@ 2012-03-27 16:04           ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-27 16:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Tue, Mar 27, 2012 at 05:45:35PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-03-27 at 17:22 +0200, Andrea Arcangeli wrote:
> > I don't see what's wrong with more than 1 CPU in the hard bind
> > cpumask.
> 
> Because its currently broken, but we're trying to restore its pure
> semantic so that we can use it in more places again, like
> debug_smp_processor_id(). Testing a single process flag is _much_
> cheaper than testing ->cpus_allowed.
> 
> Adding more broken isn't an option.

I would suggest you to use a new bitflag for that _future_
optimization that you plan to do without altering the way the current
bitflag works.

I doubt knuma_migrated will ever be the only kernel thread that wants
to run with a NUMA NODE-wide CPU binding (instead of single-CPU
binding).

Being able to keep using this bitflag for NUMA-wide bindings too in
the future as well (after you do the optimization you planned), is
going to reduce the chances of the root user shooting himself in the
foot for both the kernel thread node-BIND and the single-cpu-BIND.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
  2012-03-27  8:39               ` Peter Zijlstra
@ 2012-03-27 16:15                 ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-27 16:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Hillf Danton, Paul E. McKenney, Dan Smith,
	Paul Turner, Lai Jiangshan, Rik van Riel, Ingo Molnar,
	Andrew Morton, Lee Schermerhorn, linux-mm, Suresh Siddha,
	Mike Galbraith, Bharata B Rao, Thomas Gleixner, Johannes Weiner,
	linux-kernel

On Tue, Mar 27, 2012 at 10:39:55AM +0200, Peter Zijlstra wrote:
> I am most certainly not going to fix your mess as I completely disagree
> with the approach taken.

This is _purely_ a performance optimization so if my design is so bad,
and you're also requiring all apps that spans over more than one NUMA
node to be modified to use your new syscalls, you won't have problems
to win against AutoNUMA in the benchmarks.

At the moment I can't believe your design has a chance to compete.

But please proof me wrong with the numbers, and I won't be stubborn
and I'll rm -r autonuma and (if you let me), I'll be happy to contribute
to your code.

> You're in fact very unclear. You post patches without the RFC tag,

Subject: [PATCH 00/39] [RFC] AutoNUMA alpha10

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
@ 2012-03-27 16:15                 ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-27 16:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Hillf Danton, Paul E. McKenney, Dan Smith,
	Paul Turner, Lai Jiangshan, Rik van Riel, Ingo Molnar,
	Andrew Morton, Lee Schermerhorn, linux-mm, Suresh Siddha,
	Mike Galbraith, Bharata B Rao, Thomas Gleixner, Johannes Weiner,
	linux-kernel

On Tue, Mar 27, 2012 at 10:39:55AM +0200, Peter Zijlstra wrote:
> I am most certainly not going to fix your mess as I completely disagree
> with the approach taken.

This is _purely_ a performance optimization so if my design is so bad,
and you're also requiring all apps that spans over more than one NUMA
node to be modified to use your new syscalls, you won't have problems
to win against AutoNUMA in the benchmarks.

At the moment I can't believe your design has a chance to compete.

But please proof me wrong with the numbers, and I won't be stubborn
and I'll rm -r autonuma and (if you let me), I'll be happy to contribute
to your code.

> You're in fact very unclear. You post patches without the RFC tag,

Subject: [PATCH 00/39] [RFC] AutoNUMA alpha10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 07/39] autonuma: introduce kthread_bind_node()
  2012-03-27 16:04           ` Andrea Arcangeli
@ 2012-03-27 16:19             ` Peter Zijlstra
  -1 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2012-03-27 16:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Tue, 2012-03-27 at 18:04 +0200, Andrea Arcangeli wrote:
> On Tue, Mar 27, 2012 at 05:45:35PM +0200, Peter Zijlstra wrote:
> > On Tue, 2012-03-27 at 17:22 +0200, Andrea Arcangeli wrote:
> > > I don't see what's wrong with more than 1 CPU in the hard bind
> > > cpumask.
> > 
> > Because its currently broken, but we're trying to restore its pure
> > semantic so that we can use it in more places again, like
> > debug_smp_processor_id(). Testing a single process flag is _much_
> > cheaper than testing ->cpus_allowed.
> > 
> > Adding more broken isn't an option.
> 
> I would suggest you to use a new bitflag for that _future_
> optimization that you plan to do without altering the way the current
> bitflag works.
> 
> I doubt knuma_migrated will ever be the only kernel thread that wants
> to run with a NUMA NODE-wide CPU binding (instead of single-CPU
> binding).
> 
> Being able to keep using this bitflag for NUMA-wide bindings too in
> the future as well (after you do the optimization you planned), is
> going to reduce the chances of the root user shooting himself in the
> foot for both the kernel thread node-BIND and the single-cpu-BIND.

But then the current flag is a mis-nomer. Also, there's no correctness
issue with the per-node threads, its perfectly fine if they run some
place else so I don't think we should restrict userspace to force them
away from their preferred node.

So even if you were to introduce a new flag, I'd still object.

The only reason to ever refuse userspace moving a task around is if it
will break stuff. Worst that can happen with a node affine thread is
that it'll incur remote memory penalties, that's not fatal.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 07/39] autonuma: introduce kthread_bind_node()
@ 2012-03-27 16:19             ` Peter Zijlstra
  0 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2012-03-27 16:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Tue, 2012-03-27 at 18:04 +0200, Andrea Arcangeli wrote:
> On Tue, Mar 27, 2012 at 05:45:35PM +0200, Peter Zijlstra wrote:
> > On Tue, 2012-03-27 at 17:22 +0200, Andrea Arcangeli wrote:
> > > I don't see what's wrong with more than 1 CPU in the hard bind
> > > cpumask.
> > 
> > Because its currently broken, but we're trying to restore its pure
> > semantic so that we can use it in more places again, like
> > debug_smp_processor_id(). Testing a single process flag is _much_
> > cheaper than testing ->cpus_allowed.
> > 
> > Adding more broken isn't an option.
> 
> I would suggest you to use a new bitflag for that _future_
> optimization that you plan to do without altering the way the current
> bitflag works.
> 
> I doubt knuma_migrated will ever be the only kernel thread that wants
> to run with a NUMA NODE-wide CPU binding (instead of single-CPU
> binding).
> 
> Being able to keep using this bitflag for NUMA-wide bindings too in
> the future as well (after you do the optimization you planned), is
> going to reduce the chances of the root user shooting himself in the
> foot for both the kernel thread node-BIND and the single-cpu-BIND.

But then the current flag is a mis-nomer. Also, there's no correctness
issue with the per-node threads, its perfectly fine if they run some
place else so I don't think we should restrict userspace to force them
away from their preferred node.

So even if you were to introduce a new flag, I'd still object.

The only reason to ever refuse userspace moving a task around is if it
will break stuff. Worst that can happen with a node affine thread is
that it'll incur remote memory penalties, that's not fatal.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
  2012-03-27  8:39               ` Peter Zijlstra
@ 2012-03-27 17:09                 ` Ingo Molnar
  -1 siblings, 0 replies; 125+ messages in thread
From: Ingo Molnar @ 2012-03-27 17:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Linus Torvalds, Hillf Danton, Paul E. McKenney,
	Dan Smith, Paul Turner, Lai Jiangshan, Rik van Riel, Ingo Molnar,
	Andrew Morton, Lee Schermerhorn, linux-mm, Suresh Siddha,
	Mike Galbraith, Bharata B Rao, Thomas Gleixner, Johannes Weiner,
	linux-kernel


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> You can talk pretty much anything down to O(1) that way. Take 
> an algorithm that is O(n) in the number of tasks, since you 
> know you have a pid-space constraint of 30bits you can never 
> have more than 2^30 (aka 1Gi) tasks, hence your algorithm is 
> O(2^30) aka O(1).

We can go even further than that, IIRC all physical states of 
this universe fit into a roughly 2^1000 finite state-space, so 
every computing problem in this universe is O(2^1000), i.e. 
every computing problem we can ever work on is O(1).

Really, I think Andrea is missing the big picture here.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
@ 2012-03-27 17:09                 ` Ingo Molnar
  0 siblings, 0 replies; 125+ messages in thread
From: Ingo Molnar @ 2012-03-27 17:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Linus Torvalds, Hillf Danton, Paul E. McKenney,
	Dan Smith, Paul Turner, Lai Jiangshan, Rik van Riel, Ingo Molnar,
	Andrew Morton, Lee Schermerhorn, linux-mm, Suresh Siddha,
	Mike Galbraith, Bharata B Rao, Thomas Gleixner, Johannes Weiner,
	linux-kernel


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> You can talk pretty much anything down to O(1) that way. Take 
> an algorithm that is O(n) in the number of tasks, since you 
> know you have a pid-space constraint of 30bits you can never 
> have more than 2^30 (aka 1Gi) tasks, hence your algorithm is 
> O(2^30) aka O(1).

We can go even further than that, IIRC all physical states of 
this universe fit into a roughly 2^1000 finite state-space, so 
every computing problem in this universe is O(2^1000), i.e. 
every computing problem we can ever work on is O(1).

Really, I think Andrea is missing the big picture here.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
  2012-03-27 16:15                 ` Andrea Arcangeli
@ 2012-03-28 11:26                   ` Peter Zijlstra
  -1 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2012-03-28 11:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Hillf Danton, Paul E. McKenney, Dan Smith,
	Paul Turner, Lai Jiangshan, Rik van Riel, Ingo Molnar,
	Andrew Morton, Lee Schermerhorn, linux-mm, Suresh Siddha,
	Mike Galbraith, Bharata B Rao, Thomas Gleixner, Johannes Weiner,
	linux-kernel

On Tue, 2012-03-27 at 18:15 +0200, Andrea Arcangeli wrote:
> This is _purely_ a performance optimization so if my design is so bad,
> and you're also requiring all apps that spans over more than one NUMA
> node to be modified to use your new syscalls, you won't have problems
> to win against AutoNUMA in the benchmarks. 

Right, so can we agree that the only case where they diverge is single
processes that have multiple threads and are bigger than a single node (either
in memory, cputime or both)?

I've asked you several times why you care about that one case so much, but
without answer.

I'll grant you that unmodified such processes might do better with your
stuff, however:

 - your stuff assumes there is a fair amount of locality to exploit.

   I'm not seeing how this is true in general, since data partitioning is hard
   and for those problems where its possible people tend to already do so,
   yielding natural points to add the syscalls.

 - your stuff doesn't actually nest, since a guest kernel has no clue as to
   what constitutes a node (or if there even is such a thing) it will randomly
   move tasks around on the vcpus, with complete disrespect for whatever host
   vcpu<->page mappings you set up.

   guest kernels actively scramble whatever relations you're building by
   scanning, destroying whatever (temporal) locality you think you might
   have found.

 - also, by not exposing NUMA to the guest kernel, the guest kernel/userspace
   has no clue it needs to behave as if there's multiple nodes etc..

Furthermore, most applications that are really big tend to have already thought
about parallelism and have employed things like data-parallelism if at all
possible. If this is not possible (many problems fall in this category) there
really isn't much you can do.

Related to this is that all applications that currently use mbind() and
sched_setaffinity() are trivial to convert.

Also, really big threaded programs have a natural enemy, the shared state that
makes it a process, most dominantly the shared address space (mmap_sem etc..).

There's also the reason Avi mentioned, core count tends to go up, which means
nodes are getting bigger and bigger.

But most importantly, your solution is big, complex and costly specifically to
handle this case which, as per the above reasons, I think is not very
interesting.

So why not do the simple thing first before going overboard for a case that
might be irrelevant?


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
@ 2012-03-28 11:26                   ` Peter Zijlstra
  0 siblings, 0 replies; 125+ messages in thread
From: Peter Zijlstra @ 2012-03-28 11:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Hillf Danton, Paul E. McKenney, Dan Smith,
	Paul Turner, Lai Jiangshan, Rik van Riel, Ingo Molnar,
	Andrew Morton, Lee Schermerhorn, linux-mm, Suresh Siddha,
	Mike Galbraith, Bharata B Rao, Thomas Gleixner, Johannes Weiner,
	linux-kernel

On Tue, 2012-03-27 at 18:15 +0200, Andrea Arcangeli wrote:
> This is _purely_ a performance optimization so if my design is so bad,
> and you're also requiring all apps that spans over more than one NUMA
> node to be modified to use your new syscalls, you won't have problems
> to win against AutoNUMA in the benchmarks. 

Right, so can we agree that the only case where they diverge is single
processes that have multiple threads and are bigger than a single node (either
in memory, cputime or both)?

I've asked you several times why you care about that one case so much, but
without answer.

I'll grant you that unmodified such processes might do better with your
stuff, however:

 - your stuff assumes there is a fair amount of locality to exploit.

   I'm not seeing how this is true in general, since data partitioning is hard
   and for those problems where its possible people tend to already do so,
   yielding natural points to add the syscalls.

 - your stuff doesn't actually nest, since a guest kernel has no clue as to
   what constitutes a node (or if there even is such a thing) it will randomly
   move tasks around on the vcpus, with complete disrespect for whatever host
   vcpu<->page mappings you set up.

   guest kernels actively scramble whatever relations you're building by
   scanning, destroying whatever (temporal) locality you think you might
   have found.

 - also, by not exposing NUMA to the guest kernel, the guest kernel/userspace
   has no clue it needs to behave as if there's multiple nodes etc..

Furthermore, most applications that are really big tend to have already thought
about parallelism and have employed things like data-parallelism if at all
possible. If this is not possible (many problems fall in this category) there
really isn't much you can do.

Related to this is that all applications that currently use mbind() and
sched_setaffinity() are trivial to convert.

Also, really big threaded programs have a natural enemy, the shared state that
makes it a process, most dominantly the shared address space (mmap_sem etc..).

There's also the reason Avi mentioned, core count tends to go up, which means
nodes are getting bigger and bigger.

But most importantly, your solution is big, complex and costly specifically to
handle this case which, as per the above reasons, I think is not very
interesting.

So why not do the simple thing first before going overboard for a case that
might be irrelevant?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 39/39] autonuma: NUMA scheduler SMT awareness
  2012-03-27  0:00       ` Andrea Arcangeli
@ 2012-03-28 13:51         ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-28 13:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Tue, Mar 27, 2012 at 02:00:12AM +0200, Andrea Arcangeli wrote:
> On Mon, Mar 26, 2012 at 08:57:03PM +0200, Peter Zijlstra wrote:
> > On Mon, 2012-03-26 at 19:46 +0200, Andrea Arcangeli wrote:
> > > Add SMT awareness to the NUMA scheduler so that it will not move load
> > > from fully idle SMT threads, to semi idle SMT threads.
> > 
> > This shows a complete fail in design, you're working around the regular
> > scheduler/load-balancer instead of with it and hence are duplicating all
> > kinds of stuff.
> > 
> > I'll not have that..
> 
> I think here you're misunderstanding implementation issues with
> design.
> 
> I already mentioned the need of closer integration in CFS as point 4
> of my TODO list in the first email of this thread. The current

I pushed an autonuma-alpha11 branch where I dropped the SMT logic
entirely from the AutoNUMA scheduler. The one you naked. Not just
that, I dropped the idle balancing as well.

It seems slower to react but its active idle balancing is smarter and
in average it's maxing out the memory channels bandwidth better now.

I hope I eliminated the code duplication. What remains AutoNUMA is the
NUMA load active balancing which CFS has zero clues about.

I did a full regression test and it passed it, and now multi instance
stream shall also run much faster with nr_process > 1 and nr_process <
nr_cpus/2.

About the need of closer integration with CFS, note also that your
kernel/sched/numa.c code was doing things like:

+       // XXX should be sched_domain aware
+       for_each_online_node(node) {

So I hope you will understand why I had to took a bit of shortcuts but
over time I'm fully committed to integrate numa.c better wherever
possible and especially remove the call at every schedule so it will
scale fine to thousand of CPUs. It's just not trivial to do it.

About your code, I've an hard time to believe that driving the
scheduler depending on an home node static placement decided at
exec/fork like your code does, could have a chance to compete with the
AutoNUMA math for workloads with very variable load and several
threads and processes going idle and loading the CPUs again. Real life
unfortunately isn't as trivial as a multi instance stream. I believe
you can handle multi instance streams ok though.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 39/39] autonuma: NUMA scheduler SMT awareness
@ 2012-03-28 13:51         ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-28 13:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner

On Tue, Mar 27, 2012 at 02:00:12AM +0200, Andrea Arcangeli wrote:
> On Mon, Mar 26, 2012 at 08:57:03PM +0200, Peter Zijlstra wrote:
> > On Mon, 2012-03-26 at 19:46 +0200, Andrea Arcangeli wrote:
> > > Add SMT awareness to the NUMA scheduler so that it will not move load
> > > from fully idle SMT threads, to semi idle SMT threads.
> > 
> > This shows a complete fail in design, you're working around the regular
> > scheduler/load-balancer instead of with it and hence are duplicating all
> > kinds of stuff.
> > 
> > I'll not have that..
> 
> I think here you're misunderstanding implementation issues with
> design.
> 
> I already mentioned the need of closer integration in CFS as point 4
> of my TODO list in the first email of this thread. The current

I pushed an autonuma-alpha11 branch where I dropped the SMT logic
entirely from the AutoNUMA scheduler. The one you naked. Not just
that, I dropped the idle balancing as well.

It seems slower to react but its active idle balancing is smarter and
in average it's maxing out the memory channels bandwidth better now.

I hope I eliminated the code duplication. What remains AutoNUMA is the
NUMA load active balancing which CFS has zero clues about.

I did a full regression test and it passed it, and now multi instance
stream shall also run much faster with nr_process > 1 and nr_process <
nr_cpus/2.

About the need of closer integration with CFS, note also that your
kernel/sched/numa.c code was doing things like:

+       // XXX should be sched_domain aware
+       for_each_online_node(node) {

So I hope you will understand why I had to took a bit of shortcuts but
over time I'm fully committed to integrate numa.c better wherever
possible and especially remove the call at every schedule so it will
scale fine to thousand of CPUs. It's just not trivial to do it.

About your code, I've an hard time to believe that driving the
scheduler depending on an home node static placement decided at
exec/fork like your code does, could have a chance to compete with the
AutoNUMA math for workloads with very variable load and several
threads and processes going idle and loading the CPUs again. Real life
unfortunately isn't as trivial as a multi instance stream. I believe
you can handle multi instance streams ok though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
  2012-03-28 11:26                   ` Peter Zijlstra
@ 2012-03-28 18:39                     ` Andrea Arcangeli
  -1 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-28 18:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Hillf Danton, Paul E. McKenney, Dan Smith,
	Paul Turner, Lai Jiangshan, Rik van Riel, Ingo Molnar,
	Andrew Morton, Lee Schermerhorn, linux-mm, Suresh Siddha,
	Mike Galbraith, Bharata B Rao, Thomas Gleixner, Johannes Weiner,
	linux-kernel

Hi,

On Wed, Mar 28, 2012 at 01:26:08PM +0200, Peter Zijlstra wrote:
> Right, so can we agree that the only case where they diverge is single
> processes that have multiple threads and are bigger than a single node (either
> in memory, cputime or both)?

I think it vastly diverges for processes that are smaller than one
node too. 1) your numa/sched goes blind with an almost arbitrary home
node, 2) your migrate-on-fault will be unable to provide an efficient
and steady async background migration.

> I've asked you several times why you care about that one case so much, but
> without answer.

If this case wasn't important to you, you wouldn't need to introduce
your syscalls.

> I'll grant you that unmodified such processes might do better with your
> stuff, however:
> 
>  - your stuff assumes there is a fair amount of locality to exploit.
> 
>    I'm not seeing how this is true in general, since data partitioning is hard
>    and for those problems where its possible people tend to already do so,
>    yielding natural points to add the syscalls.

Later, I plan to detect this and layout interleaved pages
automatically so you don't even need to manually set MPOL_INTERLEAVE.

>  - your stuff doesn't actually nest, since a guest kernel has no clue as to
>    what constitutes a node (or if there even is such a thing) it will randomly
>    move tasks around on the vcpus, with complete disrespect for whatever host
>    vcpu<->page mappings you set up.
> 
>    guest kernels actively scramble whatever relations you're building by
>    scanning, destroying whatever (temporal) locality you think you might
>    have found.

This shall work fine, running AutoNUMA in guest and host. qemu just
need to create a vtopology for the guest that matches the hardware
topology. Hard binds in the guest will also work great (they create
node locality too).

A paravirt layer could also hint the host on the vcpu switches to
shift the host numa stats across but I didn't thought too much on this
possible paravirt numa-sched optimization, it's not mandatory, just an idea.

> Related to this is that all applications that currently use mbind() and
> sched_setaffinity() are trivial to convert.

Too bad firefox isn't using mbind yet. My primary target are the 99%
of apps out there running on a 24way 2 node system or equivalent and
KVM.

I agree converting qemu to the syscalls would be trivial though.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 11/39] autonuma: CPU follow memory algorithm
@ 2012-03-28 18:39                     ` Andrea Arcangeli
  0 siblings, 0 replies; 125+ messages in thread
From: Andrea Arcangeli @ 2012-03-28 18:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Hillf Danton, Paul E. McKenney, Dan Smith,
	Paul Turner, Lai Jiangshan, Rik van Riel, Ingo Molnar,
	Andrew Morton, Lee Schermerhorn, linux-mm, Suresh Siddha,
	Mike Galbraith, Bharata B Rao, Thomas Gleixner, Johannes Weiner,
	linux-kernel

Hi,

On Wed, Mar 28, 2012 at 01:26:08PM +0200, Peter Zijlstra wrote:
> Right, so can we agree that the only case where they diverge is single
> processes that have multiple threads and are bigger than a single node (either
> in memory, cputime or both)?

I think it vastly diverges for processes that are smaller than one
node too. 1) your numa/sched goes blind with an almost arbitrary home
node, 2) your migrate-on-fault will be unable to provide an efficient
and steady async background migration.

> I've asked you several times why you care about that one case so much, but
> without answer.

If this case wasn't important to you, you wouldn't need to introduce
your syscalls.

> I'll grant you that unmodified such processes might do better with your
> stuff, however:
> 
>  - your stuff assumes there is a fair amount of locality to exploit.
> 
>    I'm not seeing how this is true in general, since data partitioning is hard
>    and for those problems where its possible people tend to already do so,
>    yielding natural points to add the syscalls.

Later, I plan to detect this and layout interleaved pages
automatically so you don't even need to manually set MPOL_INTERLEAVE.

>  - your stuff doesn't actually nest, since a guest kernel has no clue as to
>    what constitutes a node (or if there even is such a thing) it will randomly
>    move tasks around on the vcpus, with complete disrespect for whatever host
>    vcpu<->page mappings you set up.
> 
>    guest kernels actively scramble whatever relations you're building by
>    scanning, destroying whatever (temporal) locality you think you might
>    have found.

This shall work fine, running AutoNUMA in guest and host. qemu just
need to create a vtopology for the guest that matches the hardware
topology. Hard binds in the guest will also work great (they create
node locality too).

A paravirt layer could also hint the host on the vcpu switches to
shift the host numa stats across but I didn't thought too much on this
possible paravirt numa-sched optimization, it's not mandatory, just an idea.

> Related to this is that all applications that currently use mbind() and
> sched_setaffinity() are trivial to convert.

Too bad firefox isn't using mbind yet. My primary target are the 99%
of apps out there running on a 24way 2 node system or equivalent and
KVM.

I agree converting qemu to the syscalls would be trivial though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 02/39] xen: document Xen is using an unused bit for the pagetables
  2012-03-26 17:45   ` Andrea Arcangeli
@ 2012-03-30 21:40     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 125+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-03-30 21:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner

On Mon, Mar 26, 2012 at 07:45:49PM +0200, Andrea Arcangeli wrote:
> Xen has taken over the last reserved bit available for the pagetables
> which is set through ioremap, this documents it and makes the code
> more readable.

About a year ago we redid the P2M code to ditch the major use case for this.
But there were two left over cases that I hadn't found a good solution
for that would allow us to completly eliminate the use of this bit:

1). When setting a PTE of a PFN which overlaps an E820 hole or any of the
    non-E820-RAM entries, we lookup in the P2M and find out that
    the PFN is a 1:1 and return a pte.pte | pfn << PAGE_SIZE.

    But we also stick the _PAGE_IOMAP on it so that when the call to
    xen_pte_val is done we don't end up doing the lookup in the P2M tree
    once more and just set the pte as is.

    So this is the dance between xen_pte_val and xen_make_pte.

2). When the userspace tries to mmap a guest memory for save/migrate
    or to setup something in the guest, it would use the xen_remap_domain_mfn_range
    to setup PTE's with the guest's PFN (gpfn). The _PAGE_IOMAP
    is used again to tell xen_pte_val to not bother looking it up in the
    P2M tree and use it as is.

So.. any thoughts on how to eliminate the usage of this?
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/x86/include/asm/pgtable_types.h |   11 +++++++++--
>  1 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 013286a..b74cac9 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -17,7 +17,7 @@
>  #define _PAGE_BIT_PAT		7	/* on 4KB pages */
>  #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
>  #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
> -#define _PAGE_BIT_IOMAP		10	/* flag used to indicate IO mapping */
> +#define _PAGE_BIT_UNUSED2	10
>  #define _PAGE_BIT_HIDDEN	11	/* hidden by kmemcheck */
>  #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
>  #define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
> @@ -41,7 +41,7 @@
>  #define _PAGE_PSE	(_AT(pteval_t, 1) << _PAGE_BIT_PSE)
>  #define _PAGE_GLOBAL	(_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
>  #define _PAGE_UNUSED1	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED1)
> -#define _PAGE_IOMAP	(_AT(pteval_t, 1) << _PAGE_BIT_IOMAP)
> +#define _PAGE_UNUSED2	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
>  #define _PAGE_PAT	(_AT(pteval_t, 1) << _PAGE_BIT_PAT)
>  #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
>  #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
> @@ -49,6 +49,13 @@
>  #define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
>  #define __HAVE_ARCH_PTE_SPECIAL
>  
> +/* flag used to indicate IO mapping */
> +#ifdef CONFIG_XEN
> +#define _PAGE_IOMAP	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
> +#else
> +#define _PAGE_IOMAP	(_AT(pteval_t, 0))
> +#endif
> +
>  #ifdef CONFIG_KMEMCHECK
>  #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
>  #else
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 02/39] xen: document Xen is using an unused bit for the pagetables
@ 2012-03-30 21:40     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 125+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-03-30 21:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner

On Mon, Mar 26, 2012 at 07:45:49PM +0200, Andrea Arcangeli wrote:
> Xen has taken over the last reserved bit available for the pagetables
> which is set through ioremap, this documents it and makes the code
> more readable.

About a year ago we redid the P2M code to ditch the major use case for this.
But there were two left over cases that I hadn't found a good solution
for that would allow us to completly eliminate the use of this bit:

1). When setting a PTE of a PFN which overlaps an E820 hole or any of the
    non-E820-RAM entries, we lookup in the P2M and find out that
    the PFN is a 1:1 and return a pte.pte | pfn << PAGE_SIZE.

    But we also stick the _PAGE_IOMAP on it so that when the call to
    xen_pte_val is done we don't end up doing the lookup in the P2M tree
    once more and just set the pte as is.

    So this is the dance between xen_pte_val and xen_make_pte.

2). When the userspace tries to mmap a guest memory for save/migrate
    or to setup something in the guest, it would use the xen_remap_domain_mfn_range
    to setup PTE's with the guest's PFN (gpfn). The _PAGE_IOMAP
    is used again to tell xen_pte_val to not bother looking it up in the
    P2M tree and use it as is.

So.. any thoughts on how to eliminate the usage of this?
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/x86/include/asm/pgtable_types.h |   11 +++++++++--
>  1 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 013286a..b74cac9 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -17,7 +17,7 @@
>  #define _PAGE_BIT_PAT		7	/* on 4KB pages */
>  #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
>  #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
> -#define _PAGE_BIT_IOMAP		10	/* flag used to indicate IO mapping */
> +#define _PAGE_BIT_UNUSED2	10
>  #define _PAGE_BIT_HIDDEN	11	/* hidden by kmemcheck */
>  #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
>  #define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
> @@ -41,7 +41,7 @@
>  #define _PAGE_PSE	(_AT(pteval_t, 1) << _PAGE_BIT_PSE)
>  #define _PAGE_GLOBAL	(_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
>  #define _PAGE_UNUSED1	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED1)
> -#define _PAGE_IOMAP	(_AT(pteval_t, 1) << _PAGE_BIT_IOMAP)
> +#define _PAGE_UNUSED2	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
>  #define _PAGE_PAT	(_AT(pteval_t, 1) << _PAGE_BIT_PAT)
>  #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
>  #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
> @@ -49,6 +49,13 @@
>  #define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
>  #define __HAVE_ARCH_PTE_SPECIAL
>  
> +/* flag used to indicate IO mapping */
> +#ifdef CONFIG_XEN
> +#define _PAGE_IOMAP	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
> +#else
> +#define _PAGE_IOMAP	(_AT(pteval_t, 0))
> +#endif
> +
>  #ifdef CONFIG_KMEMCHECK
>  #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
>  #else
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/39] [RFC] AutoNUMA alpha10
  2012-03-26 17:45 ` Andrea Arcangeli
@ 2012-04-03 20:35   ` Srivatsa Vaddagiri
  -1 siblings, 0 replies; 125+ messages in thread
From: Srivatsa Vaddagiri @ 2012-04-03 20:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner

* Andrea Arcangeli <aarcange@redhat.com> [2012-03-26 19:45:47]:

> This is the result of the first round of cleanups of the AutoNUMA patch.

I happened to test numasched and autonuma against a Java benchmark and
here are some results (higher scores are better).

Base            : 1 (std. dev : 91%)
Numa sched      : 2.17 (std. dev : 15%)
Autonuma        : 2.56 (std. dev : 10.7%)

Numa sched is ~200% better compared to "base" case. Autonuma is ~18% better 
compared to numasched. Note the high standard deviation in base case.

Also given the differences in base kernel versions for both, this is
admittedly not a apple-2-apple comparison. Getting both patches onto
common code base would help do that type of comparison!

Details:

Base = tip (ee415e2) + numasched patches posted on 3/16.
       qemu-kvm 0.12.1

Numa sched = tip (ee415e2) + numasched patches posted on 3/16.
             Modified version of qemu-kvm 1.0.50 that creates memsched groups

Autonuma = Autonuma alpha10 (SHA1 4596315). qemu-kvm 0.12.1

Machine with 2 Quad-core (w/ HT) Intel Nehalem CPUs. Two NUMA nodes each with
8GB memory.

3 VMs are created:
	VM1 and VM2, each with 4vcpus, 3GB memory and 1024 cpu.shares.
        Each of them runs memory hogs to consume total of 2.5 GB total
	memory (2.5GB memory first written to and then continuously read in a
	loop)

        VM3 of 8vcpus, 4GB memory and 2048 cpu.shares. Runs
	SPECJbb2000 benchmark w/ 8 warehouses (and consuming 2GB heap)

Benchmark was repeated 5 times. Each run consisted of launching VM1 first, 
waiting for it to initialize (wrt memory footprint), launching VM2 next, waiting
for it to initialize before launching VM3 and the benchmark inside VM3. At the 
end of benchmark, all VMs are destroyed and process repeated.

- vatsa


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [PATCH 00/39] [RFC] AutoNUMA alpha10
@ 2012-04-03 20:35   ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 125+ messages in thread
From: Srivatsa Vaddagiri @ 2012-04-03 20:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner

* Andrea Arcangeli <aarcange@redhat.com> [2012-03-26 19:45:47]:

> This is the result of the first round of cleanups of the AutoNUMA patch.

I happened to test numasched and autonuma against a Java benchmark and
here are some results (higher scores are better).

Base            : 1 (std. dev : 91%)
Numa sched      : 2.17 (std. dev : 15%)
Autonuma        : 2.56 (std. dev : 10.7%)

Numa sched is ~200% better compared to "base" case. Autonuma is ~18% better 
compared to numasched. Note the high standard deviation in base case.

Also given the differences in base kernel versions for both, this is
admittedly not a apple-2-apple comparison. Getting both patches onto
common code base would help do that type of comparison!

Details:

Base = tip (ee415e2) + numasched patches posted on 3/16.
       qemu-kvm 0.12.1

Numa sched = tip (ee415e2) + numasched patches posted on 3/16.
             Modified version of qemu-kvm 1.0.50 that creates memsched groups

Autonuma = Autonuma alpha10 (SHA1 4596315). qemu-kvm 0.12.1

Machine with 2 Quad-core (w/ HT) Intel Nehalem CPUs. Two NUMA nodes each with
8GB memory.

3 VMs are created:
	VM1 and VM2, each with 4vcpus, 3GB memory and 1024 cpu.shares.
        Each of them runs memory hogs to consume total of 2.5 GB total
	memory (2.5GB memory first written to and then continuously read in a
	loop)

        VM3 of 8vcpus, 4GB memory and 2048 cpu.shares. Runs
	SPECJbb2000 benchmark w/ 8 warehouses (and consuming 2GB heap)

Benchmark was repeated 5 times. Each run consisted of launching VM1 first, 
waiting for it to initialize (wrt memory footprint), launching VM2 next, waiting
for it to initialize before launching VM3 and the benchmark inside VM3. At the 
end of benchmark, all VMs are destroyed and process repeated.

- vatsa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 125+ messages in thread

end of thread, other threads:[~2012-04-03 20:35 UTC | newest]

Thread overview: 125+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-26 17:45 [PATCH 00/39] [RFC] AutoNUMA alpha10 Andrea Arcangeli
2012-03-26 17:45 ` Andrea Arcangeli
2012-03-26 17:45 ` [PATCH 01/39] autonuma: make set_pmd_at always available Andrea Arcangeli
2012-03-26 17:45   ` Andrea Arcangeli
2012-03-26 17:45 ` [PATCH 02/39] xen: document Xen is using an unused bit for the pagetables Andrea Arcangeli
2012-03-26 17:45   ` Andrea Arcangeli
2012-03-30 21:40   ` Konrad Rzeszutek Wilk
2012-03-30 21:40     ` Konrad Rzeszutek Wilk
2012-03-26 17:45 ` [PATCH 03/39] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD Andrea Arcangeli
2012-03-26 17:45   ` Andrea Arcangeli
2012-03-26 17:45 ` [PATCH 04/39] autonuma: x86 pte_numa() and pmd_numa() Andrea Arcangeli
2012-03-26 17:45   ` Andrea Arcangeli
2012-03-26 17:45 ` [PATCH 05/39] autonuma: generic " Andrea Arcangeli
2012-03-26 17:45   ` Andrea Arcangeli
2012-03-26 17:45 ` [PATCH 06/39] autonuma: teach gup_fast about pte_numa Andrea Arcangeli
2012-03-26 17:45   ` Andrea Arcangeli
2012-03-26 17:45 ` [PATCH 07/39] autonuma: introduce kthread_bind_node() Andrea Arcangeli
2012-03-26 17:45   ` Andrea Arcangeli
2012-03-26 18:32   ` Peter Zijlstra
2012-03-26 18:32     ` Peter Zijlstra
2012-03-27 15:22     ` Andrea Arcangeli
2012-03-27 15:22       ` Andrea Arcangeli
2012-03-27 15:45       ` Peter Zijlstra
2012-03-27 15:45         ` Peter Zijlstra
2012-03-27 16:04         ` Andrea Arcangeli
2012-03-27 16:04           ` Andrea Arcangeli
2012-03-27 16:19           ` Peter Zijlstra
2012-03-27 16:19             ` Peter Zijlstra
2012-03-26 17:45 ` [PATCH 08/39] autonuma: mm_autonuma and sched_autonuma data structures Andrea Arcangeli
2012-03-26 17:45   ` Andrea Arcangeli
2012-03-26 17:45 ` [PATCH 09/39] autonuma: define the autonuma flags Andrea Arcangeli
2012-03-26 17:45   ` Andrea Arcangeli
2012-03-26 17:45 ` [PATCH 10/39] autonuma: core autonuma.h header Andrea Arcangeli
2012-03-26 17:45   ` Andrea Arcangeli
2012-03-26 17:45 ` [PATCH 11/39] autonuma: CPU follow memory algorithm Andrea Arcangeli
2012-03-26 17:45   ` Andrea Arcangeli
2012-03-26 18:25   ` Peter Zijlstra
2012-03-26 18:25     ` Peter Zijlstra
2012-03-26 19:28     ` Rik van Riel
2012-03-26 19:28       ` Rik van Riel
2012-03-26 19:44       ` Andrea Arcangeli
2012-03-26 19:44         ` Andrea Arcangeli
2012-03-26 19:58         ` Linus Torvalds
2012-03-26 20:39           ` Andrea Arcangeli
2012-03-26 20:39             ` Andrea Arcangeli
2012-03-27  8:39             ` Peter Zijlstra
2012-03-27  8:39               ` Peter Zijlstra
2012-03-27 14:37               ` Andrea Arcangeli
2012-03-27 14:37                 ` Andrea Arcangeli
2012-03-27 16:15               ` Andrea Arcangeli
2012-03-27 16:15                 ` Andrea Arcangeli
2012-03-28 11:26                 ` Peter Zijlstra
2012-03-28 11:26                   ` Peter Zijlstra
2012-03-28 18:39                   ` Andrea Arcangeli
2012-03-28 18:39                     ` Andrea Arcangeli
2012-03-27 17:09               ` Ingo Molnar
2012-03-27 17:09                 ` Ingo Molnar
2012-03-26 17:45 ` [PATCH 12/39] autonuma: add page structure fields Andrea Arcangeli
2012-03-26 17:45   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 13/39] autonuma: knuma_migrated per NUMA node queues Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 14/39] autonuma: init knuma_migrated queues Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 15/39] autonuma: autonuma_enter/exit Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 16/39] autonuma: call autonuma_setup_new_exec() Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 17/39] autonuma: alloc/free/init sched_autonuma Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 18/39] autonuma: alloc/free/init mm_autonuma Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 19/39] mm: add unlikely to the mm allocation failure check Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 20/39] autonuma: avoid CFS select_task_rq_fair to return -1 Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 19:36   ` Peter Zijlstra
2012-03-26 19:36     ` Peter Zijlstra
2012-03-26 20:53     ` Andrea Arcangeli
2012-03-26 20:53       ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 21/39] autonuma: fix selecting task runqueue Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 22/39] autonuma: select_task_rq_fair cleanup new_cpu < 0 fix Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 23/39] autonuma: teach CFS about autonuma affinity Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 24/39] autonuma: fix finding idlest cpu Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 25/39] autonuma: fix selecting idle sibling Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 26/39] autonuma: select_idle_sibling cleanup target assignment Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 27/39] autonuma: core Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 28/39] autonuma: follow_page check for pte_numa/pmd_numa Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 29/39] autonuma: default mempolicy follow AutoNUMA Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 30/39] autonuma: call autonuma_split_huge_page() Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 31/39] autonuma: make khugepaged pte_numa aware Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 32/39] autonuma: retain page last_nid information in khugepaged Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 33/39] autonuma: numa hinting page faults entry points Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 34/39] autonuma: reset autonuma page data when pages are freed Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 35/39] autonuma: initialize page structure fields Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 36/39] autonuma: link mm/autonuma.o and kernel/sched/numa.o Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 37/39] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 38/39] autonuma: boost khugepaged scanning rate Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 17:46 ` [PATCH 39/39] autonuma: NUMA scheduler SMT awareness Andrea Arcangeli
2012-03-26 17:46   ` Andrea Arcangeli
2012-03-26 18:57   ` Peter Zijlstra
2012-03-26 18:57     ` Peter Zijlstra
2012-03-27  0:00     ` Andrea Arcangeli
2012-03-27  0:00       ` Andrea Arcangeli
2012-03-28 13:51       ` Andrea Arcangeli
2012-03-28 13:51         ` Andrea Arcangeli
2012-04-03 20:35 ` [PATCH 00/39] [RFC] AutoNUMA alpha10 Srivatsa Vaddagiri
2012-04-03 20:35   ` Srivatsa Vaddagiri

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.