All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/35] AutoNUMA alpha14
@ 2012-05-25 17:02 ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Hello everyone,

It's time for a new autonuma-alpha14 milestone.

Removed the [RFC] from Subject because 1) this is a release I'm quite
happy with (from the implementation side it allows the same kernel
image to boot optimally on NUMA and not-NUMA hardware and it avoids
altering the scheduler runtime most of the time) and 2) because of the
great benchmark results we got so far, showing this design so far has
been proved to perform best.

I believe (realistically speaking) nobody is going to change
applications to specify which thread is using which memory (for
threaded apps) with the only exception of QEMU and a few others.

For not threaded apps that fits in a NUMA node, there's no way a blind
home node can perform nearly as good as AutoNUMA: AutoNUMA monitor the
whole status of the memory of the running processes and it optimizes
the memory placement and CPU placement dynamically
accordingly. There's a small memory and CPU cost in collecting so much
information to be able to make smart decisions, but the benefits
largely outweight those costs.

If a big idle task was idle for a long while, but it suddenly start
computing, AutoNUMA may totally change the memory and CPU placement of
the other running tasks according to what's best, because it has
enough information to take optimal NUMA placement decisions.

git clone --reference linux -b autonuma-alpha14 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma-alpha14

Development autonuma branch (currently equal to autonuma-alpha14 ==
a49fedcc284a8e8b47175fbc23e9d3b075884e53):

git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
to update: git fetch; git checkout -f origin/autonuma

Changelog from alpha13 to alpha14:

o page_autonuma introduction, no memory wasted if the kernel is booted
  on not-NUMA hardware. Tested with flatmem/sparsemem on x86
  autonuma=y/n and sparsemem/vsparsemem on x86_64 with autonuma=y/n.

  The "noautonuma" kernel param disables autonuma permanently also
  when booted on NUMA hardware (no /sys/kernel/mm/autonuma, and no
  page_autonuma allocations, like cgroup_disable=memory)

o autonuma_balance only runs along with run_rebalance_domains, to
  avoid altering the scheduler runtime. autonuma_balance gives a
  "kick" to the scheduler only along the load balance events (it
  overrides the load balance activity if needed). This change has not
  yet been tested on specjbb or more schedule intensive benchmarks,
  but I don't expect measurable NUMA affinity regressions. For
  intensive compute loads not involving a flood of scheduling activity
  this has already been verified not to show any performance
  regression, and it will boost the scheduler performance compared to
  previous autonuma releases.

  Note: autonuma_balance still runs from normal context (not softirq
  context like run_rebalance_domains) to be able to wait on process
  migration (avoid _nowait), but most of the time it does nothing at
  all.

Changelog from alpha11 to alpha13:

o autonuma_balance optimization (take the fast path when process is in
  the preferred NUMA node)

TODO:

o THP native migration (orthogonal and also needed for
  cpuset/migrate_pages(2)/numa/sched).

o distribute pagecache to other nodes (and maybe shared memory or
  other movable memory) if knuma_migrated stops because the local node
  is full

Andrea Arcangeli (35):
  mm: add unlikely to the mm allocation failure check
  autonuma: make set_pmd_at always available
  xen: document Xen is using an unused bit for the pagetables
  autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  autonuma: x86 pte_numa() and pmd_numa()
  autonuma: generic pte_numa() and pmd_numa()
  autonuma: teach gup_fast about pte_numa
  autonuma: introduce kthread_bind_node()
  autonuma: mm_autonuma and sched_autonuma data structures
  autonuma: define the autonuma flags
  autonuma: core autonuma.h header
  autonuma: CPU follow memory algorithm
  autonuma: add page structure fields
  autonuma: knuma_migrated per NUMA node queues
  autonuma: init knuma_migrated queues
  autonuma: autonuma_enter/exit
  autonuma: call autonuma_setup_new_exec()
  autonuma: alloc/free/init sched_autonuma
  autonuma: alloc/free/init mm_autonuma
  autonuma: avoid CFS select_task_rq_fair to return -1
  autonuma: teach CFS about autonuma affinity
  autonuma: sched_set_autonuma_need_balance
  autonuma: core
  autonuma: follow_page check for pte_numa/pmd_numa
  autonuma: default mempolicy follow AutoNUMA
  autonuma: call autonuma_split_huge_page()
  autonuma: make khugepaged pte_numa aware
  autonuma: retain page last_nid information in khugepaged
  autonuma: numa hinting page faults entry points
  autonuma: reset autonuma page data when pages are freed
  autonuma: initialize page structure fields
  autonuma: link mm/autonuma.o and kernel/sched/numa.o
  autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
  autonuma: boost khugepaged scanning rate
  autonuma: page_autonuma

 arch/x86/include/asm/paravirt.h      |    2 -
 arch/x86/include/asm/pgtable.h       |   51 ++-
 arch/x86/include/asm/pgtable_types.h |   22 +-
 arch/x86/mm/gup.c                    |    2 +-
 fs/exec.c                            |    3 +
 include/asm-generic/pgtable.h        |   12 +
 include/linux/autonuma.h             |   53 ++
 include/linux/autonuma_flags.h       |   68 ++
 include/linux/autonuma_sched.h       |   50 ++
 include/linux/autonuma_types.h       |   88 ++
 include/linux/huge_mm.h              |    2 +
 include/linux/kthread.h              |    1 +
 include/linux/mm_types.h             |    5 +
 include/linux/mmzone.h               |   18 +
 include/linux/page_autonuma.h        |   53 ++
 include/linux/sched.h                |    3 +
 init/main.c                          |    2 +
 kernel/fork.c                        |   36 +-
 kernel/kthread.c                     |   23 +
 kernel/sched/Makefile                |    1 +
 kernel/sched/core.c                  |   12 +-
 kernel/sched/fair.c                  |   72 ++-
 kernel/sched/numa.c                  |  281 +++++++
 kernel/sched/sched.h                 |   10 +
 mm/Kconfig                           |   13 +
 mm/Makefile                          |    1 +
 mm/autonuma.c                        | 1464 ++++++++++++++++++++++++++++++++++
 mm/huge_memory.c                     |   58 ++-
 mm/memory.c                          |   36 +-
 mm/mempolicy.c                       |   15 +-
 mm/mmu_context.c                     |    2 +
 mm/page_alloc.c                      |    6 +-
 mm/page_autonuma.c                   |  234 ++++++
 mm/sparse.c                          |  126 +++-
 34 files changed, 2776 insertions(+), 49 deletions(-)
 create mode 100644 include/linux/autonuma.h
 create mode 100644 include/linux/autonuma_flags.h
 create mode 100644 include/linux/autonuma_sched.h
 create mode 100644 include/linux/autonuma_types.h
 create mode 100644 include/linux/page_autonuma.h
 create mode 100644 kernel/sched/numa.c
 create mode 100644 mm/autonuma.c
 create mode 100644 mm/page_autonuma.c

^ permalink raw reply	[flat|nested] 236+ messages in thread

* [PATCH 00/35] AutoNUMA alpha14
@ 2012-05-25 17:02 ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Hello everyone,

It's time for a new autonuma-alpha14 milestone.

Removed the [RFC] from Subject because 1) this is a release I'm quite
happy with (from the implementation side it allows the same kernel
image to boot optimally on NUMA and not-NUMA hardware and it avoids
altering the scheduler runtime most of the time) and 2) because of the
great benchmark results we got so far, showing this design so far has
been proved to perform best.

I believe (realistically speaking) nobody is going to change
applications to specify which thread is using which memory (for
threaded apps) with the only exception of QEMU and a few others.

For not threaded apps that fits in a NUMA node, there's no way a blind
home node can perform nearly as good as AutoNUMA: AutoNUMA monitor the
whole status of the memory of the running processes and it optimizes
the memory placement and CPU placement dynamically
accordingly. There's a small memory and CPU cost in collecting so much
information to be able to make smart decisions, but the benefits
largely outweight those costs.

If a big idle task was idle for a long while, but it suddenly start
computing, AutoNUMA may totally change the memory and CPU placement of
the other running tasks according to what's best, because it has
enough information to take optimal NUMA placement decisions.

git clone --reference linux -b autonuma-alpha14 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma-alpha14

Development autonuma branch (currently equal to autonuma-alpha14 ==
a49fedcc284a8e8b47175fbc23e9d3b075884e53):

git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
to update: git fetch; git checkout -f origin/autonuma

Changelog from alpha13 to alpha14:

o page_autonuma introduction, no memory wasted if the kernel is booted
  on not-NUMA hardware. Tested with flatmem/sparsemem on x86
  autonuma=y/n and sparsemem/vsparsemem on x86_64 with autonuma=y/n.

  The "noautonuma" kernel param disables autonuma permanently also
  when booted on NUMA hardware (no /sys/kernel/mm/autonuma, and no
  page_autonuma allocations, like cgroup_disable=memory)

o autonuma_balance only runs along with run_rebalance_domains, to
  avoid altering the scheduler runtime. autonuma_balance gives a
  "kick" to the scheduler only along the load balance events (it
  overrides the load balance activity if needed). This change has not
  yet been tested on specjbb or more schedule intensive benchmarks,
  but I don't expect measurable NUMA affinity regressions. For
  intensive compute loads not involving a flood of scheduling activity
  this has already been verified not to show any performance
  regression, and it will boost the scheduler performance compared to
  previous autonuma releases.

  Note: autonuma_balance still runs from normal context (not softirq
  context like run_rebalance_domains) to be able to wait on process
  migration (avoid _nowait), but most of the time it does nothing at
  all.

Changelog from alpha11 to alpha13:

o autonuma_balance optimization (take the fast path when process is in
  the preferred NUMA node)

TODO:

o THP native migration (orthogonal and also needed for
  cpuset/migrate_pages(2)/numa/sched).

o distribute pagecache to other nodes (and maybe shared memory or
  other movable memory) if knuma_migrated stops because the local node
  is full

Andrea Arcangeli (35):
  mm: add unlikely to the mm allocation failure check
  autonuma: make set_pmd_at always available
  xen: document Xen is using an unused bit for the pagetables
  autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  autonuma: x86 pte_numa() and pmd_numa()
  autonuma: generic pte_numa() and pmd_numa()
  autonuma: teach gup_fast about pte_numa
  autonuma: introduce kthread_bind_node()
  autonuma: mm_autonuma and sched_autonuma data structures
  autonuma: define the autonuma flags
  autonuma: core autonuma.h header
  autonuma: CPU follow memory algorithm
  autonuma: add page structure fields
  autonuma: knuma_migrated per NUMA node queues
  autonuma: init knuma_migrated queues
  autonuma: autonuma_enter/exit
  autonuma: call autonuma_setup_new_exec()
  autonuma: alloc/free/init sched_autonuma
  autonuma: alloc/free/init mm_autonuma
  autonuma: avoid CFS select_task_rq_fair to return -1
  autonuma: teach CFS about autonuma affinity
  autonuma: sched_set_autonuma_need_balance
  autonuma: core
  autonuma: follow_page check for pte_numa/pmd_numa
  autonuma: default mempolicy follow AutoNUMA
  autonuma: call autonuma_split_huge_page()
  autonuma: make khugepaged pte_numa aware
  autonuma: retain page last_nid information in khugepaged
  autonuma: numa hinting page faults entry points
  autonuma: reset autonuma page data when pages are freed
  autonuma: initialize page structure fields
  autonuma: link mm/autonuma.o and kernel/sched/numa.o
  autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
  autonuma: boost khugepaged scanning rate
  autonuma: page_autonuma

 arch/x86/include/asm/paravirt.h      |    2 -
 arch/x86/include/asm/pgtable.h       |   51 ++-
 arch/x86/include/asm/pgtable_types.h |   22 +-
 arch/x86/mm/gup.c                    |    2 +-
 fs/exec.c                            |    3 +
 include/asm-generic/pgtable.h        |   12 +
 include/linux/autonuma.h             |   53 ++
 include/linux/autonuma_flags.h       |   68 ++
 include/linux/autonuma_sched.h       |   50 ++
 include/linux/autonuma_types.h       |   88 ++
 include/linux/huge_mm.h              |    2 +
 include/linux/kthread.h              |    1 +
 include/linux/mm_types.h             |    5 +
 include/linux/mmzone.h               |   18 +
 include/linux/page_autonuma.h        |   53 ++
 include/linux/sched.h                |    3 +
 init/main.c                          |    2 +
 kernel/fork.c                        |   36 +-
 kernel/kthread.c                     |   23 +
 kernel/sched/Makefile                |    1 +
 kernel/sched/core.c                  |   12 +-
 kernel/sched/fair.c                  |   72 ++-
 kernel/sched/numa.c                  |  281 +++++++
 kernel/sched/sched.h                 |   10 +
 mm/Kconfig                           |   13 +
 mm/Makefile                          |    1 +
 mm/autonuma.c                        | 1464 ++++++++++++++++++++++++++++++++++
 mm/huge_memory.c                     |   58 ++-
 mm/memory.c                          |   36 +-
 mm/mempolicy.c                       |   15 +-
 mm/mmu_context.c                     |    2 +
 mm/page_alloc.c                      |    6 +-
 mm/page_autonuma.c                   |  234 ++++++
 mm/sparse.c                          |  126 +++-
 34 files changed, 2776 insertions(+), 49 deletions(-)
 create mode 100644 include/linux/autonuma.h
 create mode 100644 include/linux/autonuma_flags.h
 create mode 100644 include/linux/autonuma_sched.h
 create mode 100644 include/linux/autonuma_types.h
 create mode 100644 include/linux/page_autonuma.h
 create mode 100644 kernel/sched/numa.c
 create mode 100644 mm/autonuma.c
 create mode 100644 mm/page_autonuma.c

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* [PATCH 01/35] mm: add unlikely to the mm allocation failure check
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Very minor optimization to hint gcc.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 47b4e4f..98db8b0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -571,7 +571,7 @@ struct mm_struct *mm_alloc(void)
 	struct mm_struct *mm;
 
 	mm = allocate_mm();
-	if (!mm)
+	if (unlikely(!mm))
 		return NULL;
 
 	memset(mm, 0, sizeof(*mm));

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 01/35] mm: add unlikely to the mm allocation failure check
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Very minor optimization to hint gcc.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 47b4e4f..98db8b0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -571,7 +571,7 @@ struct mm_struct *mm_alloc(void)
 	struct mm_struct *mm;
 
 	mm = allocate_mm();
-	if (!mm)
+	if (unlikely(!mm))
 		return NULL;
 
 	memset(mm, 0, sizeof(*mm));

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 02/35] autonuma: make set_pmd_at always available
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

set_pmd_at() will also be used for the knuma_scand/pmd = 1 (default)
mode even when TRANSPARENT_HUGEPAGE=n. Make it available so the build
won't fail.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/paravirt.h |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 6cbbabf..e99fb37 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -564,7 +564,6 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 		PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 			      pmd_t *pmdp, pmd_t pmd)
 {
@@ -575,7 +574,6 @@ static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp,
 			    native_pmd_val(pmd));
 }
-#endif
 
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 02/35] autonuma: make set_pmd_at always available
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

set_pmd_at() will also be used for the knuma_scand/pmd = 1 (default)
mode even when TRANSPARENT_HUGEPAGE=n. Make it available so the build
won't fail.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/paravirt.h |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 6cbbabf..e99fb37 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -564,7 +564,6 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 		PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 			      pmd_t *pmdp, pmd_t pmd)
 {
@@ -575,7 +574,6 @@ static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp,
 			    native_pmd_val(pmd));
 }
-#endif
 
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 03/35] xen: document Xen is using an unused bit for the pagetables
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Xen has taken over the last reserved bit available for the pagetables
which is set through ioremap, this documents it and makes the code
more readable.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable_types.h |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 013286a..b74cac9 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -17,7 +17,7 @@
 #define _PAGE_BIT_PAT		7	/* on 4KB pages */
 #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
 #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
-#define _PAGE_BIT_IOMAP		10	/* flag used to indicate IO mapping */
+#define _PAGE_BIT_UNUSED2	10
 #define _PAGE_BIT_HIDDEN	11	/* hidden by kmemcheck */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
@@ -41,7 +41,7 @@
 #define _PAGE_PSE	(_AT(pteval_t, 1) << _PAGE_BIT_PSE)
 #define _PAGE_GLOBAL	(_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
 #define _PAGE_UNUSED1	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED1)
-#define _PAGE_IOMAP	(_AT(pteval_t, 1) << _PAGE_BIT_IOMAP)
+#define _PAGE_UNUSED2	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
 #define _PAGE_PAT	(_AT(pteval_t, 1) << _PAGE_BIT_PAT)
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
@@ -49,6 +49,13 @@
 #define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
 #define __HAVE_ARCH_PTE_SPECIAL
 
+/* flag used to indicate IO mapping */
+#ifdef CONFIG_XEN
+#define _PAGE_IOMAP	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
+#else
+#define _PAGE_IOMAP	(_AT(pteval_t, 0))
+#endif
+
 #ifdef CONFIG_KMEMCHECK
 #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
 #else

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 03/35] xen: document Xen is using an unused bit for the pagetables
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Xen has taken over the last reserved bit available for the pagetables
which is set through ioremap, this documents it and makes the code
more readable.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable_types.h |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 013286a..b74cac9 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -17,7 +17,7 @@
 #define _PAGE_BIT_PAT		7	/* on 4KB pages */
 #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
 #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
-#define _PAGE_BIT_IOMAP		10	/* flag used to indicate IO mapping */
+#define _PAGE_BIT_UNUSED2	10
 #define _PAGE_BIT_HIDDEN	11	/* hidden by kmemcheck */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
@@ -41,7 +41,7 @@
 #define _PAGE_PSE	(_AT(pteval_t, 1) << _PAGE_BIT_PSE)
 #define _PAGE_GLOBAL	(_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
 #define _PAGE_UNUSED1	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED1)
-#define _PAGE_IOMAP	(_AT(pteval_t, 1) << _PAGE_BIT_IOMAP)
+#define _PAGE_UNUSED2	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
 #define _PAGE_PAT	(_AT(pteval_t, 1) << _PAGE_BIT_PAT)
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
@@ -49,6 +49,13 @@
 #define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
 #define __HAVE_ARCH_PTE_SPECIAL
 
+/* flag used to indicate IO mapping */
+#ifdef CONFIG_XEN
+#define _PAGE_IOMAP	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
+#else
+#define _PAGE_IOMAP	(_AT(pteval_t, 0))
+#endif
+
 #ifdef CONFIG_KMEMCHECK
 #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
 #else

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 04/35] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

We will set these bitflags only when the pmd and pte is non present.

They work like PROT_NONE but they identify a request for the numa
hinting page fault to trigger.

Because we want to be able to set these bitflag in any established pte
or pmd (while clearing the present bit at the same time) without
losing information, these bitflags must never be set when the pte and
pmd are present.

For _PAGE_NUMA_PTE the pte bitflag used is _PAGE_PSE, which cannot be
set on ptes and it also fits in between _PAGE_FILE and _PAGE_PROTNONE
which avoids having to alter the swp entries format.

For _PAGE_NUMA_PMD, we use a reserved bitflag. pmds never contain
swap_entries but if in the future we'll swap transparent hugepages, we
must keep in mind not to use the _PAGE_UNUSED2 bitflag in the swap
entry format and to start the swap entry offset above it.

PAGE_UNUSED2 is used by Xen but only on ptes established by ioremap,
but it's never used on pmds so there's no risk of collision with Xen.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable_types.h |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index b74cac9..6e2d954 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -71,6 +71,17 @@
 #define _PAGE_FILE	(_AT(pteval_t, 1) << _PAGE_BIT_FILE)
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
+/*
+ * Cannot be set on pte. The fact it's in between _PAGE_FILE and
+ * _PAGE_PROTNONE avoids having to alter the swp entries.
+ */
+#define _PAGE_NUMA_PTE	_PAGE_PSE
+/*
+ * Cannot be set on pmd, if transparent hugepages will be swapped out
+ * the swap entry offset must start above it.
+ */
+#define _PAGE_NUMA_PMD	_PAGE_UNUSED2
+
 #define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |	\
 			 _PAGE_ACCESSED | _PAGE_DIRTY)
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 04/35] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

We will set these bitflags only when the pmd and pte is non present.

They work like PROT_NONE but they identify a request for the numa
hinting page fault to trigger.

Because we want to be able to set these bitflag in any established pte
or pmd (while clearing the present bit at the same time) without
losing information, these bitflags must never be set when the pte and
pmd are present.

For _PAGE_NUMA_PTE the pte bitflag used is _PAGE_PSE, which cannot be
set on ptes and it also fits in between _PAGE_FILE and _PAGE_PROTNONE
which avoids having to alter the swp entries format.

For _PAGE_NUMA_PMD, we use a reserved bitflag. pmds never contain
swap_entries but if in the future we'll swap transparent hugepages, we
must keep in mind not to use the _PAGE_UNUSED2 bitflag in the swap
entry format and to start the swap entry offset above it.

PAGE_UNUSED2 is used by Xen but only on ptes established by ioremap,
but it's never used on pmds so there's no risk of collision with Xen.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable_types.h |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index b74cac9..6e2d954 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -71,6 +71,17 @@
 #define _PAGE_FILE	(_AT(pteval_t, 1) << _PAGE_BIT_FILE)
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
+/*
+ * Cannot be set on pte. The fact it's in between _PAGE_FILE and
+ * _PAGE_PROTNONE avoids having to alter the swp entries.
+ */
+#define _PAGE_NUMA_PTE	_PAGE_PSE
+/*
+ * Cannot be set on pmd, if transparent hugepages will be swapped out
+ * the swap entry offset must start above it.
+ */
+#define _PAGE_NUMA_PMD	_PAGE_UNUSED2
+
 #define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |	\
 			 _PAGE_ACCESSED | _PAGE_DIRTY)
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 05/35] autonuma: x86 pte_numa() and pmd_numa()
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Implement pte_numa and pmd_numa and related methods on x86 arch.

We must atomically set the numa bit and clear the present bit to
define a pte_numa or pmd_numa.

Whenever a pte or pmd is set as pte_numa or pmd_numa the first time a
thread will touch that virtual address, a NUMA hinting page fault will
trigger. The NUMA hinting page fault will simply clear the NUMA bit
and set the present bit again to resolve the page fault.

NUMA hinting page faults are used:

1) to fill in the per-thread NUMA statistic stored for each thread in
   a current->sched_autonuma data structure

2) to track the per-node last_nid information in the page structure to
   detect false sharing

3) to queue the page mapped by the pte_numa or pmd_numa for async
   migration if there have been enough NUMA hinting page faults on the
   page coming from remote CPUs

NUMA hinting page faults don't do anything except collecting
information and possibly adding pages to migrate queues. They're
extremely quick and absolutely non blocking. They don't allocate any
memory either.

The only "input" information of the AutoNUMA algorithm that isn't
collected through NUMA hinting page faults are the per-process
(per-thread not) mm->mm_autonuma statistics. Those mm_autonuma
statistics are collected by the knuma_scand pmd/pte scans that are
also responsible for setting the pte_numa/pmd_numa to activate the
NUMA hinting page faults.

knuma_scand -> NUMA hinting page faults
  |                       |
 \|/                     \|/
mm_autonuma  <->  sched_autonuma (CPU follow memory, this is mm_autonuma too)
                  page last_nid  (false thread sharing/thread shared memory detection )
                  queue or cancel page migration (memory follow CPU)

After pages are queued, there is one knuma_migratedN daemon per NUMA
node that will take care of migrating the pages at a perfectly steady
rate in parallel from all nodes, and in round robin from all incoming
nodes going to the same destination node to keep all memory channels
in large boxes active at the same time to avoid hitting on a single
memory channel for too long to minimize memory bus migration latency
effects.

Once pages are queued for async migration by knuma_migratedN, their
migration can still be canceled before they're actually migrated, if
false sharing is later detected.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable.h |   51 +++++++++++++++++++++++++++++++++++++--
 1 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 49afb3f..7514fa6 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -109,7 +109,7 @@ static inline int pte_write(pte_t pte)
 
 static inline int pte_file(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_FILE;
+	return (pte_flags(pte) & _PAGE_FILE) == _PAGE_FILE;
 }
 
 static inline int pte_huge(pte_t pte)
@@ -405,7 +405,9 @@ static inline int pte_same(pte_t a, pte_t b)
 
 static inline int pte_present(pte_t a)
 {
-	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
+	/* _PAGE_NUMA includes _PAGE_PROTNONE */
+	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+			       _PAGE_NUMA_PTE);
 }
 
 static inline int pte_hidden(pte_t pte)
@@ -415,7 +417,46 @@ static inline int pte_hidden(pte_t pte)
 
 static inline int pmd_present(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_PRESENT;
+	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+				 _PAGE_NUMA_PMD);
+}
+
+#ifdef CONFIG_AUTONUMA
+static inline int pte_numa(pte_t pte)
+{
+	return (pte_flags(pte) &
+		(_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+	return (pmd_flags(pmd) &
+		(_PAGE_NUMA_PMD|_PAGE_PRESENT)) == _PAGE_NUMA_PMD;
+}
+#endif
+
+static inline pte_t pte_mknotnuma(pte_t pte)
+{
+	pte = pte_clear_flags(pte, _PAGE_NUMA_PTE);
+	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mknotnuma(pmd_t pmd)
+{
+	pmd = pmd_clear_flags(pmd, _PAGE_NUMA_PMD);
+	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pte_t pte_mknuma(pte_t pte)
+{
+	pte = pte_set_flags(pte, _PAGE_NUMA_PTE);
+	return pte_clear_flags(pte, _PAGE_PRESENT);
+}
+
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+	pmd = pmd_set_flags(pmd, _PAGE_NUMA_PMD);
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
 }
 
 static inline int pmd_none(pmd_t pmd)
@@ -474,6 +515,10 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 
 static inline int pmd_bad(pmd_t pmd)
 {
+#ifdef CONFIG_AUTONUMA
+	if (pmd_numa(pmd))
+		return 0;
+#endif
 	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
 }
 

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 05/35] autonuma: x86 pte_numa() and pmd_numa()
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Implement pte_numa and pmd_numa and related methods on x86 arch.

We must atomically set the numa bit and clear the present bit to
define a pte_numa or pmd_numa.

Whenever a pte or pmd is set as pte_numa or pmd_numa the first time a
thread will touch that virtual address, a NUMA hinting page fault will
trigger. The NUMA hinting page fault will simply clear the NUMA bit
and set the present bit again to resolve the page fault.

NUMA hinting page faults are used:

1) to fill in the per-thread NUMA statistic stored for each thread in
   a current->sched_autonuma data structure

2) to track the per-node last_nid information in the page structure to
   detect false sharing

3) to queue the page mapped by the pte_numa or pmd_numa for async
   migration if there have been enough NUMA hinting page faults on the
   page coming from remote CPUs

NUMA hinting page faults don't do anything except collecting
information and possibly adding pages to migrate queues. They're
extremely quick and absolutely non blocking. They don't allocate any
memory either.

The only "input" information of the AutoNUMA algorithm that isn't
collected through NUMA hinting page faults are the per-process
(per-thread not) mm->mm_autonuma statistics. Those mm_autonuma
statistics are collected by the knuma_scand pmd/pte scans that are
also responsible for setting the pte_numa/pmd_numa to activate the
NUMA hinting page faults.

knuma_scand -> NUMA hinting page faults
  |                       |
 \|/                     \|/
mm_autonuma  <->  sched_autonuma (CPU follow memory, this is mm_autonuma too)
                  page last_nid  (false thread sharing/thread shared memory detection )
                  queue or cancel page migration (memory follow CPU)

After pages are queued, there is one knuma_migratedN daemon per NUMA
node that will take care of migrating the pages at a perfectly steady
rate in parallel from all nodes, and in round robin from all incoming
nodes going to the same destination node to keep all memory channels
in large boxes active at the same time to avoid hitting on a single
memory channel for too long to minimize memory bus migration latency
effects.

Once pages are queued for async migration by knuma_migratedN, their
migration can still be canceled before they're actually migrated, if
false sharing is later detected.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable.h |   51 +++++++++++++++++++++++++++++++++++++--
 1 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 49afb3f..7514fa6 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -109,7 +109,7 @@ static inline int pte_write(pte_t pte)
 
 static inline int pte_file(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_FILE;
+	return (pte_flags(pte) & _PAGE_FILE) == _PAGE_FILE;
 }
 
 static inline int pte_huge(pte_t pte)
@@ -405,7 +405,9 @@ static inline int pte_same(pte_t a, pte_t b)
 
 static inline int pte_present(pte_t a)
 {
-	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
+	/* _PAGE_NUMA includes _PAGE_PROTNONE */
+	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+			       _PAGE_NUMA_PTE);
 }
 
 static inline int pte_hidden(pte_t pte)
@@ -415,7 +417,46 @@ static inline int pte_hidden(pte_t pte)
 
 static inline int pmd_present(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_PRESENT;
+	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+				 _PAGE_NUMA_PMD);
+}
+
+#ifdef CONFIG_AUTONUMA
+static inline int pte_numa(pte_t pte)
+{
+	return (pte_flags(pte) &
+		(_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+	return (pmd_flags(pmd) &
+		(_PAGE_NUMA_PMD|_PAGE_PRESENT)) == _PAGE_NUMA_PMD;
+}
+#endif
+
+static inline pte_t pte_mknotnuma(pte_t pte)
+{
+	pte = pte_clear_flags(pte, _PAGE_NUMA_PTE);
+	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mknotnuma(pmd_t pmd)
+{
+	pmd = pmd_clear_flags(pmd, _PAGE_NUMA_PMD);
+	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pte_t pte_mknuma(pte_t pte)
+{
+	pte = pte_set_flags(pte, _PAGE_NUMA_PTE);
+	return pte_clear_flags(pte, _PAGE_PRESENT);
+}
+
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+	pmd = pmd_set_flags(pmd, _PAGE_NUMA_PMD);
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
 }
 
 static inline int pmd_none(pmd_t pmd)
@@ -474,6 +515,10 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 
 static inline int pmd_bad(pmd_t pmd)
 {
+#ifdef CONFIG_AUTONUMA
+	if (pmd_numa(pmd))
+		return 0;
+#endif
 	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 06/35] autonuma: generic pte_numa() and pmd_numa()
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Implement generic version of the methods. They're used when
CONFIG_AUTONUMA=n, and they're a noop.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/asm-generic/pgtable.h |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index fa596d9..780f707 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -521,6 +521,18 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
+#ifndef CONFIG_AUTONUMA
+static inline int pte_numa(pte_t pte)
+{
+	return 0;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+	return 0;
+}
+#endif /* CONFIG_AUTONUMA */
+
 #endif /* CONFIG_MMU */
 
 #endif /* !__ASSEMBLY__ */

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 06/35] autonuma: generic pte_numa() and pmd_numa()
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Implement generic version of the methods. They're used when
CONFIG_AUTONUMA=n, and they're a noop.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/asm-generic/pgtable.h |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index fa596d9..780f707 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -521,6 +521,18 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
+#ifndef CONFIG_AUTONUMA
+static inline int pte_numa(pte_t pte)
+{
+	return 0;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+	return 0;
+}
+#endif /* CONFIG_AUTONUMA */
+
 #endif /* CONFIG_MMU */
 
 #endif /* !__ASSEMBLY__ */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 07/35] autonuma: teach gup_fast about pte_numa
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

gup_fast will skip over non present ptes (pte_numa requires the pte to
be non present). So no explicit check is needed for pte_numa in the
pte case.

gup_fast will also automatically skip over THP when the trans huge pmd
is non present (pmd_numa requires the pmd to be non present).

But for the special pmd mode scan of knuma_scand
(/sys/kernel/mm/autonuma/knuma_scand/pmd == 1), the pmd may be of numa
type (so non present too), the pte may be present. gup_pte_range
wouldn't notice the pmd is of numa type. So to avoid losing a NUMA
hinting page fault with gup_fast we need an explicit check for
pmd_numa() here to be sure it will fault through gup ->
handle_mm_fault.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/mm/gup.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..bf36575 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -164,7 +164,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		 * wait_split_huge_page() would never return as the
 		 * tlb flush IPI wouldn't run.
 		 */
-		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd) || pmd_numa(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 07/35] autonuma: teach gup_fast about pte_numa
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

gup_fast will skip over non present ptes (pte_numa requires the pte to
be non present). So no explicit check is needed for pte_numa in the
pte case.

gup_fast will also automatically skip over THP when the trans huge pmd
is non present (pmd_numa requires the pmd to be non present).

But for the special pmd mode scan of knuma_scand
(/sys/kernel/mm/autonuma/knuma_scand/pmd == 1), the pmd may be of numa
type (so non present too), the pte may be present. gup_pte_range
wouldn't notice the pmd is of numa type. So to avoid losing a NUMA
hinting page fault with gup_fast we need an explicit check for
pmd_numa() here to be sure it will fault through gup ->
handle_mm_fault.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/mm/gup.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..bf36575 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -164,7 +164,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		 * wait_split_huge_page() would never return as the
 		 * tlb flush IPI wouldn't run.
 		 */
-		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd) || pmd_numa(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 08/35] autonuma: introduce kthread_bind_node()
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

This function makes it easy to bind the per-node knuma_migrated
threads to their respective NUMA nodes. Those threads take memory from
the other nodes (in round robin with a incoming queue for each remote
node) and they move that memory to their local node.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/kthread.h |    1 +
 kernel/kthread.c        |   23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 0714b24..e733f97 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -33,6 +33,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 })
 
 void kthread_bind(struct task_struct *k, unsigned int cpu);
+void kthread_bind_node(struct task_struct *p, int nid);
 int kthread_stop(struct task_struct *k);
 int kthread_should_stop(void);
 bool kthread_freezable_should_stop(bool *was_frozen);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3d3de63..48b36f9 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -234,6 +234,29 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
 EXPORT_SYMBOL(kthread_bind);
 
 /**
+ * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
+ * @p: thread created by kthread_create().
+ * @nid: node (might not be online, must be possible) for @k to run on.
+ *
+ * Description: This function is equivalent to set_cpus_allowed(),
+ * except that @nid doesn't need to be online, and the thread must be
+ * stopped (i.e., just returned from kthread_create()).
+ */
+void kthread_bind_node(struct task_struct *p, int nid)
+{
+	/* Must have done schedule() in kthread() before we set_task_cpu */
+	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
+		WARN_ON(1);
+		return;
+	}
+
+	/* It's safe because the task is inactive. */
+	do_set_cpus_allowed(p, cpumask_of_node(nid));
+	p->flags |= PF_THREAD_BOUND;
+}
+EXPORT_SYMBOL(kthread_bind_node);
+
+/**
  * kthread_stop - stop a thread created by kthread_create().
  * @k: thread created by kthread_create().
  *

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 08/35] autonuma: introduce kthread_bind_node()
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

This function makes it easy to bind the per-node knuma_migrated
threads to their respective NUMA nodes. Those threads take memory from
the other nodes (in round robin with a incoming queue for each remote
node) and they move that memory to their local node.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/kthread.h |    1 +
 kernel/kthread.c        |   23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 0714b24..e733f97 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -33,6 +33,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 })
 
 void kthread_bind(struct task_struct *k, unsigned int cpu);
+void kthread_bind_node(struct task_struct *p, int nid);
 int kthread_stop(struct task_struct *k);
 int kthread_should_stop(void);
 bool kthread_freezable_should_stop(bool *was_frozen);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3d3de63..48b36f9 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -234,6 +234,29 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
 EXPORT_SYMBOL(kthread_bind);
 
 /**
+ * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
+ * @p: thread created by kthread_create().
+ * @nid: node (might not be online, must be possible) for @k to run on.
+ *
+ * Description: This function is equivalent to set_cpus_allowed(),
+ * except that @nid doesn't need to be online, and the thread must be
+ * stopped (i.e., just returned from kthread_create()).
+ */
+void kthread_bind_node(struct task_struct *p, int nid)
+{
+	/* Must have done schedule() in kthread() before we set_task_cpu */
+	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
+		WARN_ON(1);
+		return;
+	}
+
+	/* It's safe because the task is inactive. */
+	do_set_cpus_allowed(p, cpumask_of_node(nid));
+	p->flags |= PF_THREAD_BOUND;
+}
+EXPORT_SYMBOL(kthread_bind_node);
+
+/**
  * kthread_stop - stop a thread created by kthread_create().
  * @k: thread created by kthread_create().
  *

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 09/35] autonuma: mm_autonuma and sched_autonuma data structures
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Define the two data structures that collect the per-process (in the
mm) and per-thread (in the task_struct) statistical information that
are the input of the CPU follow memory algorithms in the NUMA
scheduler.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_types.h |   57 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 57 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma_types.h

diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
new file mode 100644
index 0000000..65b175b
--- /dev/null
+++ b/include/linux/autonuma_types.h
@@ -0,0 +1,57 @@
+#ifndef _LINUX_AUTONUMA_TYPES_H
+#define _LINUX_AUTONUMA_TYPES_H
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/numa.h>
+
+struct mm_autonuma {
+	struct list_head mm_node;
+	struct mm_struct *mm;
+	unsigned long numa_fault_tot; /* reset from here */
+	unsigned long numa_fault_pass;
+	unsigned long numa_fault[0];
+};
+
+extern int alloc_mm_autonuma(struct mm_struct *mm);
+extern void free_mm_autonuma(struct mm_struct *mm);
+extern void __init mm_autonuma_init(void);
+
+#define SCHED_AUTONUMA_FLAG_STOP_ONE_CPU	(1<<0)
+#define SCHED_AUTONUMA_FLAG_NEED_BALANCE	(1<<1)
+
+struct sched_autonuma {
+	int autonuma_node;
+	unsigned int autonuma_flags; /* zeroed from here */
+	unsigned long numa_fault_pass;
+	unsigned long numa_fault_tot;
+	unsigned long numa_fault[0];
+};
+
+extern int alloc_sched_autonuma(struct task_struct *tsk,
+				struct task_struct *orig,
+				int node);
+extern void __init sched_autonuma_init(void);
+extern void free_sched_autonuma(struct task_struct *tsk);
+
+#else /* CONFIG_AUTONUMA */
+
+static inline int alloc_mm_autonuma(struct mm_struct *mm)
+{
+	return 0;
+}
+static inline void free_mm_autonuma(struct mm_struct *mm) {}
+static inline void mm_autonuma_init(void) {}
+
+static inline int alloc_sched_autonuma(struct task_struct *tsk,
+				       struct task_struct *orig,
+				       int node)
+{
+	return 0;
+}
+static inline void sched_autonuma_init(void) {}
+static inline void free_sched_autonuma(struct task_struct *tsk) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+#endif /* _LINUX_AUTONUMA_TYPES_H */

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 09/35] autonuma: mm_autonuma and sched_autonuma data structures
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Define the two data structures that collect the per-process (in the
mm) and per-thread (in the task_struct) statistical information that
are the input of the CPU follow memory algorithms in the NUMA
scheduler.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_types.h |   57 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 57 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma_types.h

diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
new file mode 100644
index 0000000..65b175b
--- /dev/null
+++ b/include/linux/autonuma_types.h
@@ -0,0 +1,57 @@
+#ifndef _LINUX_AUTONUMA_TYPES_H
+#define _LINUX_AUTONUMA_TYPES_H
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/numa.h>
+
+struct mm_autonuma {
+	struct list_head mm_node;
+	struct mm_struct *mm;
+	unsigned long numa_fault_tot; /* reset from here */
+	unsigned long numa_fault_pass;
+	unsigned long numa_fault[0];
+};
+
+extern int alloc_mm_autonuma(struct mm_struct *mm);
+extern void free_mm_autonuma(struct mm_struct *mm);
+extern void __init mm_autonuma_init(void);
+
+#define SCHED_AUTONUMA_FLAG_STOP_ONE_CPU	(1<<0)
+#define SCHED_AUTONUMA_FLAG_NEED_BALANCE	(1<<1)
+
+struct sched_autonuma {
+	int autonuma_node;
+	unsigned int autonuma_flags; /* zeroed from here */
+	unsigned long numa_fault_pass;
+	unsigned long numa_fault_tot;
+	unsigned long numa_fault[0];
+};
+
+extern int alloc_sched_autonuma(struct task_struct *tsk,
+				struct task_struct *orig,
+				int node);
+extern void __init sched_autonuma_init(void);
+extern void free_sched_autonuma(struct task_struct *tsk);
+
+#else /* CONFIG_AUTONUMA */
+
+static inline int alloc_mm_autonuma(struct mm_struct *mm)
+{
+	return 0;
+}
+static inline void free_mm_autonuma(struct mm_struct *mm) {}
+static inline void mm_autonuma_init(void) {}
+
+static inline int alloc_sched_autonuma(struct task_struct *tsk,
+				       struct task_struct *orig,
+				       int node)
+{
+	return 0;
+}
+static inline void sched_autonuma_init(void) {}
+static inline void free_sched_autonuma(struct task_struct *tsk) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+#endif /* _LINUX_AUTONUMA_TYPES_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 10/35] autonuma: define the autonuma flags
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

These flags are the ones tweaked through sysfs, they control the
behavior of autonuma, from enabling disabling it, to selecting various
runtime options.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_flags.h |   62 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 62 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma_flags.h

diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
new file mode 100644
index 0000000..9c702fd
--- /dev/null
+++ b/include/linux/autonuma_flags.h
@@ -0,0 +1,62 @@
+#ifndef _LINUX_AUTONUMA_FLAGS_H
+#define _LINUX_AUTONUMA_FLAGS_H
+
+enum autonuma_flag {
+	AUTONUMA_FLAG,
+	AUTONUMA_IMPOSSIBLE,
+	AUTONUMA_DEBUG_FLAG,
+	AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
+	AUTONUMA_SCHED_CLONE_RESET_FLAG,
+	AUTONUMA_SCHED_FORK_RESET_FLAG,
+	AUTONUMA_SCAN_PMD_FLAG,
+	AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
+	AUTONUMA_MIGRATE_DEFER_FLAG,
+};
+
+extern unsigned long autonuma_flags;
+
+static bool inline autonuma_enabled(void)
+{
+	return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
+}
+
+static bool inline autonuma_debug(void)
+{
+	return !!test_bit(AUTONUMA_DEBUG_FLAG, &autonuma_flags);
+}
+
+static bool inline autonuma_sched_load_balance_strict(void)
+{
+	return !!test_bit(AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
+			  &autonuma_flags);
+}
+
+static bool inline autonuma_sched_clone_reset(void)
+{
+	return !!test_bit(AUTONUMA_SCHED_CLONE_RESET_FLAG,
+			  &autonuma_flags);
+}
+
+static bool inline autonuma_sched_fork_reset(void)
+{
+	return !!test_bit(AUTONUMA_SCHED_FORK_RESET_FLAG,
+			  &autonuma_flags);
+}
+
+static bool inline autonuma_scan_pmd(void)
+{
+	return !!test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
+}
+
+static bool inline autonuma_scan_use_working_set(void)
+{
+	return !!test_bit(AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
+			  &autonuma_flags);
+}
+
+static bool inline autonuma_migrate_defer(void)
+{
+	return !!test_bit(AUTONUMA_MIGRATE_DEFER_FLAG, &autonuma_flags);
+}
+
+#endif /* _LINUX_AUTONUMA_FLAGS_H */

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 10/35] autonuma: define the autonuma flags
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

These flags are the ones tweaked through sysfs, they control the
behavior of autonuma, from enabling disabling it, to selecting various
runtime options.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_flags.h |   62 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 62 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma_flags.h

diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
new file mode 100644
index 0000000..9c702fd
--- /dev/null
+++ b/include/linux/autonuma_flags.h
@@ -0,0 +1,62 @@
+#ifndef _LINUX_AUTONUMA_FLAGS_H
+#define _LINUX_AUTONUMA_FLAGS_H
+
+enum autonuma_flag {
+	AUTONUMA_FLAG,
+	AUTONUMA_IMPOSSIBLE,
+	AUTONUMA_DEBUG_FLAG,
+	AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
+	AUTONUMA_SCHED_CLONE_RESET_FLAG,
+	AUTONUMA_SCHED_FORK_RESET_FLAG,
+	AUTONUMA_SCAN_PMD_FLAG,
+	AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
+	AUTONUMA_MIGRATE_DEFER_FLAG,
+};
+
+extern unsigned long autonuma_flags;
+
+static bool inline autonuma_enabled(void)
+{
+	return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
+}
+
+static bool inline autonuma_debug(void)
+{
+	return !!test_bit(AUTONUMA_DEBUG_FLAG, &autonuma_flags);
+}
+
+static bool inline autonuma_sched_load_balance_strict(void)
+{
+	return !!test_bit(AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
+			  &autonuma_flags);
+}
+
+static bool inline autonuma_sched_clone_reset(void)
+{
+	return !!test_bit(AUTONUMA_SCHED_CLONE_RESET_FLAG,
+			  &autonuma_flags);
+}
+
+static bool inline autonuma_sched_fork_reset(void)
+{
+	return !!test_bit(AUTONUMA_SCHED_FORK_RESET_FLAG,
+			  &autonuma_flags);
+}
+
+static bool inline autonuma_scan_pmd(void)
+{
+	return !!test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
+}
+
+static bool inline autonuma_scan_use_working_set(void)
+{
+	return !!test_bit(AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
+			  &autonuma_flags);
+}
+
+static bool inline autonuma_migrate_defer(void)
+{
+	return !!test_bit(AUTONUMA_MIGRATE_DEFER_FLAG, &autonuma_flags);
+}
+
+#endif /* _LINUX_AUTONUMA_FLAGS_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 11/35] autonuma: core autonuma.h header
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

This is the generic autonuma.h header that defines the generic
AutoNUMA specific functions like autonuma_setup_new_exec,
autonuma_split_huge_page, numa_hinting_fault, etc...

As usual functions like numa_hinting_fault that only matter for builds
with CONFIG_AUTONUMA=y are defined unconditionally, but they are only
linked into the kernel if CONFIG_AUTONUMA=n. Their call sites are
optimized away at build time (or kernel won't link).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma.h |   41 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 41 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma.h

diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
new file mode 100644
index 0000000..a963dcb
--- /dev/null
+++ b/include/linux/autonuma.h
@@ -0,0 +1,41 @@
+#ifndef _LINUX_AUTONUMA_H
+#define _LINUX_AUTONUMA_H
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/autonuma_flags.h>
+
+extern void autonuma_enter(struct mm_struct *mm);
+extern void autonuma_exit(struct mm_struct *mm);
+extern void __autonuma_migrate_page_remove(struct page *page);
+extern void autonuma_migrate_split_huge_page(struct page *page,
+					     struct page *page_tail);
+extern void autonuma_setup_new_exec(struct task_struct *p);
+
+static inline void autonuma_migrate_page_remove(struct page *page)
+{
+	if (ACCESS_ONCE(page->autonuma_migrate_nid) >= 0)
+		__autonuma_migrate_page_remove(page);
+}
+
+#define autonuma_printk(format, args...) \
+	if (autonuma_debug()) printk(format, ##args)
+
+#else /* CONFIG_AUTONUMA */
+
+static inline void autonuma_enter(struct mm_struct *mm) {}
+static inline void autonuma_exit(struct mm_struct *mm) {}
+static inline void autonuma_migrate_page_remove(struct page *page) {}
+static inline void autonuma_migrate_split_huge_page(struct page *page,
+						    struct page *page_tail) {}
+static inline void autonuma_setup_new_exec(struct task_struct *p) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+extern pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+			      unsigned long addr, pte_t pte, pte_t *ptep);
+extern void __pmd_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+			     unsigned long addr, pmd_t *pmd);
+extern void numa_hinting_fault(struct page *page, int numpages);
+
+#endif /* _LINUX_AUTONUMA_H */

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 11/35] autonuma: core autonuma.h header
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

This is the generic autonuma.h header that defines the generic
AutoNUMA specific functions like autonuma_setup_new_exec,
autonuma_split_huge_page, numa_hinting_fault, etc...

As usual functions like numa_hinting_fault that only matter for builds
with CONFIG_AUTONUMA=y are defined unconditionally, but they are only
linked into the kernel if CONFIG_AUTONUMA=n. Their call sites are
optimized away at build time (or kernel won't link).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma.h |   41 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 41 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma.h

diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
new file mode 100644
index 0000000..a963dcb
--- /dev/null
+++ b/include/linux/autonuma.h
@@ -0,0 +1,41 @@
+#ifndef _LINUX_AUTONUMA_H
+#define _LINUX_AUTONUMA_H
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/autonuma_flags.h>
+
+extern void autonuma_enter(struct mm_struct *mm);
+extern void autonuma_exit(struct mm_struct *mm);
+extern void __autonuma_migrate_page_remove(struct page *page);
+extern void autonuma_migrate_split_huge_page(struct page *page,
+					     struct page *page_tail);
+extern void autonuma_setup_new_exec(struct task_struct *p);
+
+static inline void autonuma_migrate_page_remove(struct page *page)
+{
+	if (ACCESS_ONCE(page->autonuma_migrate_nid) >= 0)
+		__autonuma_migrate_page_remove(page);
+}
+
+#define autonuma_printk(format, args...) \
+	if (autonuma_debug()) printk(format, ##args)
+
+#else /* CONFIG_AUTONUMA */
+
+static inline void autonuma_enter(struct mm_struct *mm) {}
+static inline void autonuma_exit(struct mm_struct *mm) {}
+static inline void autonuma_migrate_page_remove(struct page *page) {}
+static inline void autonuma_migrate_split_huge_page(struct page *page,
+						    struct page *page_tail) {}
+static inline void autonuma_setup_new_exec(struct task_struct *p) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+extern pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+			      unsigned long addr, pte_t pte, pte_t *ptep);
+extern void __pmd_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+			     unsigned long addr, pmd_t *pmd);
+extern void numa_hinting_fault(struct page *page, int numpages);
+
+#endif /* _LINUX_AUTONUMA_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 12/35] autonuma: CPU follow memory algorithm
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

This algorithm takes as input the statistical information filled by the
knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
(p->sched_autonuma), evaluates it for the current scheduled task, and
compares it against every other running process to see if it should
move the current task to another NUMA node.

For example if there's any idle CPU in the NUMA node where the current
task prefers to be scheduled into (according to the mm_autonuma and
sched_autonuma data structures) the task will be migrated there
instead of keep running in the current CPU.

When the scheduler decides if the task should be migrated to a
different NUMA node or to stay in the same NUMA node, the decision is
then stored into p->sched_autonuma->autonuma_node. The fair scheduler
than tries to keep the task on the autonuma_node too.

Code include fixes and cleanups from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_sched.h |   50 +++++++
 include/linux/mm_types.h       |    5 +
 include/linux/sched.h          |    3 +
 kernel/sched/core.c            |   12 +-
 kernel/sched/numa.c            |  281 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h           |   10 ++
 6 files changed, 353 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/autonuma_sched.h
 create mode 100644 kernel/sched/numa.c

diff --git a/include/linux/autonuma_sched.h b/include/linux/autonuma_sched.h
new file mode 100644
index 0000000..9a4d945
--- /dev/null
+++ b/include/linux/autonuma_sched.h
@@ -0,0 +1,50 @@
+#ifndef _LINUX_AUTONUMA_SCHED_H
+#define _LINUX_AUTONUMA_SCHED_H
+
+#include <linux/autonuma_flags.h>
+
+static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
+{
+#ifdef CONFIG_AUTONUMA
+	int autonuma_node;
+	struct sched_autonuma *sched_autonuma = p->sched_autonuma;
+
+	if (!sched_autonuma)
+		return true;
+
+	autonuma_node = ACCESS_ONCE(sched_autonuma->autonuma_node);
+	if (autonuma_node < 0 || autonuma_node == cpu_to_node(cpu))
+		return true;
+	else
+		return false;
+#else
+	return true;
+#endif
+}
+
+static inline void sched_set_autonuma_need_balance(void)
+{
+#ifdef CONFIG_AUTONUMA
+	struct sched_autonuma *sa = current->sched_autonuma;
+
+	if (sa && current->mm)
+		sa->autonuma_flags |= SCHED_AUTONUMA_FLAG_NEED_BALANCE;
+#endif
+}
+
+#ifdef CONFIG_AUTONUMA
+extern void sched_autonuma_balance(void);
+extern bool sched_autonuma_can_migrate_task(struct task_struct *p,
+					    int numa, int dst_cpu,
+					    enum cpu_idle_type idle);
+#else /* CONFIG_AUTONUMA */
+static inline void sched_autonuma_balance(void) {}
+static inline bool sched_autonuma_can_migrate_task(struct task_struct *p,
+						   int numa, int dst_cpu,
+						   enum cpu_idle_type idle)
+{
+	return true;
+}
+#endif /* CONFIG_AUTONUMA */
+
+#endif /* _LINUX_AUTONUMA_SCHED_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 26574c7..780ded7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -13,6 +13,7 @@
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
 #include <linux/uprobes.h>
+#include <linux/autonuma_types.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -390,6 +391,10 @@ struct mm_struct {
 	struct cpumask cpumask_allocation;
 #endif
 	struct uprobes_state uprobes_state;
+#ifdef CONFIG_AUTONUMA
+	/* this is used by the scheduler and the page allocator */
+	struct mm_autonuma *mm_autonuma;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f45c0b2..60a699c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1507,6 +1507,9 @@ struct task_struct {
 	struct mempolicy *mempolicy;	/* Protected by alloc_lock */
 	short il_next;
 	short pref_node_fork;
+#ifdef CONFIG_AUTONUMA
+	struct sched_autonuma *sched_autonuma;
+#endif
 #endif
 	struct rcu_head rcu;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 39eb601..e3e4c99 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -72,6 +72,7 @@
 #include <linux/slab.h>
 #include <linux/init_task.h>
 #include <linux/binfmts.h>
+#include <linux/autonuma_sched.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -1117,13 +1118,6 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	__set_task_cpu(p, new_cpu);
 }
 
-struct migration_arg {
-	struct task_struct *task;
-	int dest_cpu;
-};
-
-static int migration_cpu_stop(void *data);
-
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
@@ -3274,6 +3268,8 @@ need_resched:
 
 	post_schedule(rq);
 
+	sched_autonuma_balance();
+
 	sched_preempt_enable_no_resched();
 	if (need_resched())
 		goto need_resched;
@@ -5106,7 +5102,7 @@ fail:
  * and performs thread migration by bumping thread off CPU then
  * 'pushing' onto another runqueue.
  */
-static int migration_cpu_stop(void *data)
+int migration_cpu_stop(void *data)
 {
 	struct migration_arg *arg = data;
 
diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c
new file mode 100644
index 0000000..499a197
--- /dev/null
+++ b/kernel/sched/numa.c
@@ -0,0 +1,281 @@
+/*
+ *  Copyright (C) 2012  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/sched.h>
+#include <linux/autonuma_sched.h>
+#include <asm/tlb.h>
+
+#include "sched.h"
+
+#define AUTONUMA_BALANCE_SCALE 1000
+
+enum {
+	W_TYPE_THREAD,
+	W_TYPE_PROCESS,
+};
+
+/*
+ * This function is responsible for deciding which is the best CPU
+ * each process should be running on according to the NUMA
+ * affinity. To do that it evaluates all CPUs and checks if there's
+ * any remote CPU where the current process has more NUMA affinity
+ * than with the current CPU, and where the process running on the
+ * remote CPU has less NUMA affinity than the current process to run
+ * on the remote CPU. Ideally this should be expanded to take all
+ * runnable processes into account but this is a good
+ * approximation. When we compare the NUMA affinity between the
+ * current and remote CPU we use the per-thread information if the
+ * remote CPU runs a thread of the same process that the current task
+ * belongs to, or the per-process information if the remote CPU runs a
+ * different process than the current one. If the remote CPU runs the
+ * idle task we require both the per-thread and per-process
+ * information to have more affinity with the remote CPU than with the
+ * current CPU for a migration to happen.
+ *
+ * This has O(N) complexity but N isn't the number of running
+ * processes, but the number of CPUs, so if you assume a constant
+ * number of CPUs (capped at NR_CPUS) it is O(1). O(1) misleading math
+ * aside, the number of cachelines touched with thousands of CPU might
+ * make it measurable. Calling this at every schedule may also be
+ * overkill and it may be enough to call it with a frequency similar
+ * to the load balancing, but by doing so we're also verifying the
+ * algorithm is a converging one in all workloads if performance is
+ * improved and there's no frequent CPU migration, so it's good in the
+ * short term for stressing the algorithm.
+ */
+void sched_autonuma_balance(void)
+{
+	int cpu, nid, selected_cpu, selected_nid;
+	int cpu_nid = numa_node_id();
+	int this_cpu = smp_processor_id();
+	unsigned long p_w, p_t, m_w, m_t, p_w_max, m_w_max;
+	unsigned long weight_delta_max, weight;
+	long s_w_nid = -1, s_w_cpu_nid = -1, s_w_other = -1;
+	int s_w_type = -1;
+	struct cpumask *allowed;
+	struct migration_arg arg;
+	struct task_struct *p = current;
+	struct sched_autonuma *sched_autonuma = p->sched_autonuma;
+
+	/* per-cpu statically allocated in runqueues */
+	long *weight_current;
+	long *weight_current_mm;
+
+	if (!sched_autonuma || !p->mm)
+		return;
+
+	if (!(sched_autonuma->autonuma_flags &
+	      SCHED_AUTONUMA_FLAG_NEED_BALANCE))
+		return;
+	else
+		sched_autonuma->autonuma_flags &=
+			~SCHED_AUTONUMA_FLAG_NEED_BALANCE;
+
+	if (sched_autonuma->autonuma_flags & SCHED_AUTONUMA_FLAG_STOP_ONE_CPU)
+		return;
+
+	if (!autonuma_enabled()) {
+		if (sched_autonuma->autonuma_node != -1)
+			sched_autonuma->autonuma_node = -1;
+		return;
+	}
+
+	allowed = tsk_cpus_allowed(p);
+
+	m_t = ACCESS_ONCE(p->mm->mm_autonuma->numa_fault_tot);
+	p_t = sched_autonuma->numa_fault_tot;
+	/*
+	 * If a process still misses the per-thread or per-process
+	 * information skip it.
+	 */
+	if (!m_t || !p_t)
+		return;
+
+	weight_current = cpu_rq(this_cpu)->weight_current;
+	weight_current_mm = cpu_rq(this_cpu)->weight_current_mm;
+
+	p_w_max = m_w_max = 0;
+	selected_nid = -1;
+	for_each_online_node(nid) {
+		int hits = 0;
+		m_w = ACCESS_ONCE(p->mm->mm_autonuma->numa_fault[nid]);
+		p_w = sched_autonuma->numa_fault[nid];
+		if (m_w > m_t)
+			m_t = m_w;
+		weight_current_mm[nid] = m_w*AUTONUMA_BALANCE_SCALE/m_t;
+		if (p_w > p_t)
+			p_t = p_w;
+		weight_current[nid] = p_w*AUTONUMA_BALANCE_SCALE/p_t;
+		if (weight_current_mm[nid] > m_w_max) {
+			m_w_max = weight_current_mm[nid];
+			hits++;
+		}
+		if (weight_current[nid] > p_w_max) {
+			p_w_max = weight_current[nid];
+			hits++;
+		}
+		if (hits == 2)
+			selected_nid = nid;
+	}
+	if (selected_nid == cpu_nid) {
+		if (sched_autonuma->autonuma_node != selected_nid)
+			sched_autonuma->autonuma_node = selected_nid;
+		return;
+	}
+
+	selected_cpu = this_cpu;
+	selected_nid = cpu_nid;
+	weight = weight_delta_max = 0;
+
+	for_each_online_node(nid) {
+		if (nid == cpu_nid)
+			continue;
+		for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
+			long w_nid, w_cpu_nid, w_other;
+			int w_type;
+			struct mm_struct *mm;
+			struct rq *rq = cpu_rq(cpu);
+			if (!cpu_online(cpu))
+				continue;
+
+			if (idle_cpu(cpu))
+				/*
+				 * Offload the while IDLE balancing
+				 * and physical / logical imbalances
+				 * to CFS.
+				 */
+				continue;
+
+			mm = rq->curr->mm;
+			if (!mm)
+				continue;
+			raw_spin_lock_irq(&rq->lock);
+			/* recheck after implicit barrier() */
+			mm = rq->curr->mm;
+			if (!mm) {
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+			m_t = ACCESS_ONCE(mm->mm_autonuma->numa_fault_tot);
+			p_t = rq->curr->sched_autonuma->numa_fault_tot;
+			if (!m_t || !p_t) {
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+			m_w = ACCESS_ONCE(mm->mm_autonuma->numa_fault[nid]);
+			p_w = rq->curr->sched_autonuma->numa_fault[nid];
+			raw_spin_unlock_irq(&rq->lock);
+			if (mm == p->mm) {
+				if (p_w > p_t)
+					p_t = p_w;
+				w_other = p_w*AUTONUMA_BALANCE_SCALE/p_t;
+				w_nid = weight_current[nid];
+				w_cpu_nid = weight_current[cpu_nid];
+				w_type = W_TYPE_THREAD;
+			} else {
+				if (m_w > m_t)
+					m_t = m_w;
+				w_other = m_w*AUTONUMA_BALANCE_SCALE/m_t;
+				w_nid = weight_current_mm[nid];
+				w_cpu_nid = weight_current_mm[cpu_nid];
+				w_type = W_TYPE_PROCESS;
+			}
+
+			if (w_nid > w_other && w_nid > w_cpu_nid) {
+				weight = w_nid - w_other + w_nid - w_cpu_nid;
+
+				if (weight > weight_delta_max) {
+					weight_delta_max = weight;
+					selected_cpu = cpu;
+					selected_nid = nid;
+
+					s_w_other = w_other;
+					s_w_nid = w_nid;
+					s_w_cpu_nid = w_cpu_nid;
+					s_w_type = w_type;
+				}
+			}
+		}
+	}
+
+	if (sched_autonuma->autonuma_node != selected_nid)
+		sched_autonuma->autonuma_node = selected_nid;
+	if (selected_cpu != this_cpu) {
+		if (autonuma_debug()) {
+			char *w_type_str = NULL;
+			switch (s_w_type) {
+			case W_TYPE_THREAD:
+				w_type_str = "thread";
+				break;
+			case W_TYPE_PROCESS:
+				w_type_str = "process";
+				break;
+			}
+			printk("%p %d - %dto%d - %dto%d - %ld %ld %ld - %s\n",
+			       p->mm, p->pid, cpu_nid, selected_nid,
+			       this_cpu, selected_cpu,
+			       s_w_other, s_w_nid, s_w_cpu_nid,
+			       w_type_str);
+		}
+		BUG_ON(cpu_nid == selected_nid);
+		goto found;
+	}
+
+	return;
+
+found:
+	arg = (struct migration_arg) { p, selected_cpu };
+	/* Need help from migration thread: drop lock and wait. */
+	sched_autonuma->autonuma_flags |= SCHED_AUTONUMA_FLAG_STOP_ONE_CPU;
+	sched_preempt_enable_no_resched();
+	stop_one_cpu(this_cpu, migration_cpu_stop, &arg);
+	preempt_disable();
+	sched_autonuma->autonuma_flags &= ~SCHED_AUTONUMA_FLAG_STOP_ONE_CPU;
+	tlb_migrate_finish(p->mm);
+}
+
+bool sched_autonuma_can_migrate_task(struct task_struct *p,
+				     int numa, int dst_cpu,
+				     enum cpu_idle_type idle)
+{
+	if (!task_autonuma_cpu(p, dst_cpu)) {
+		if (numa)
+			return false;
+		if (autonuma_sched_load_balance_strict() &&
+		    idle != CPU_NEWLY_IDLE && idle != CPU_IDLE)
+			return false;
+	}
+	return true;
+}
+
+void sched_autonuma_dump_mm(void)
+{
+	int nid, cpu;
+	cpumask_var_t x;
+
+	if (!alloc_cpumask_var(&x, GFP_KERNEL))
+		return;
+	cpumask_setall(x);
+	for_each_online_node(nid) {
+		for_each_cpu(cpu, cpumask_of_node(nid)) {
+			struct rq *rq = cpu_rq(cpu);
+			struct mm_struct *mm = rq->curr->mm;
+			int nr = 0, cpux;
+			if (!cpumask_test_cpu(cpu, x))
+				continue;
+			for_each_cpu(cpux, cpumask_of_node(nid)) {
+				struct rq *rqx = cpu_rq(cpux);
+				if (rqx->curr->mm == mm) {
+					nr++;
+					cpumask_clear_cpu(cpux, x);
+				}
+			}
+			printk("nid %d mm %p nr %d\n", nid, mm, nr);
+		}
+	}
+	free_cpumask_var(x);
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ba9dccf..b12b8cd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -463,6 +463,10 @@ struct rq {
 #ifdef CONFIG_SMP
 	struct llist_head wake_list;
 #endif
+#ifdef CONFIG_AUTONUMA
+	long weight_current[MAX_NUMNODES];
+	long weight_current_mm[MAX_NUMNODES];
+#endif
 };
 
 static inline int cpu_of(struct rq *rq)
@@ -526,6 +530,12 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
 DECLARE_PER_CPU(struct sched_domain *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_id);
 
+struct migration_arg {
+	struct task_struct *task;
+	int dest_cpu;
+};
+extern int migration_cpu_stop(void *data);
+
 #endif /* CONFIG_SMP */
 
 #include "stats.h"

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 12/35] autonuma: CPU follow memory algorithm
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

This algorithm takes as input the statistical information filled by the
knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
(p->sched_autonuma), evaluates it for the current scheduled task, and
compares it against every other running process to see if it should
move the current task to another NUMA node.

For example if there's any idle CPU in the NUMA node where the current
task prefers to be scheduled into (according to the mm_autonuma and
sched_autonuma data structures) the task will be migrated there
instead of keep running in the current CPU.

When the scheduler decides if the task should be migrated to a
different NUMA node or to stay in the same NUMA node, the decision is
then stored into p->sched_autonuma->autonuma_node. The fair scheduler
than tries to keep the task on the autonuma_node too.

Code include fixes and cleanups from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_sched.h |   50 +++++++
 include/linux/mm_types.h       |    5 +
 include/linux/sched.h          |    3 +
 kernel/sched/core.c            |   12 +-
 kernel/sched/numa.c            |  281 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h           |   10 ++
 6 files changed, 353 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/autonuma_sched.h
 create mode 100644 kernel/sched/numa.c

diff --git a/include/linux/autonuma_sched.h b/include/linux/autonuma_sched.h
new file mode 100644
index 0000000..9a4d945
--- /dev/null
+++ b/include/linux/autonuma_sched.h
@@ -0,0 +1,50 @@
+#ifndef _LINUX_AUTONUMA_SCHED_H
+#define _LINUX_AUTONUMA_SCHED_H
+
+#include <linux/autonuma_flags.h>
+
+static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
+{
+#ifdef CONFIG_AUTONUMA
+	int autonuma_node;
+	struct sched_autonuma *sched_autonuma = p->sched_autonuma;
+
+	if (!sched_autonuma)
+		return true;
+
+	autonuma_node = ACCESS_ONCE(sched_autonuma->autonuma_node);
+	if (autonuma_node < 0 || autonuma_node == cpu_to_node(cpu))
+		return true;
+	else
+		return false;
+#else
+	return true;
+#endif
+}
+
+static inline void sched_set_autonuma_need_balance(void)
+{
+#ifdef CONFIG_AUTONUMA
+	struct sched_autonuma *sa = current->sched_autonuma;
+
+	if (sa && current->mm)
+		sa->autonuma_flags |= SCHED_AUTONUMA_FLAG_NEED_BALANCE;
+#endif
+}
+
+#ifdef CONFIG_AUTONUMA
+extern void sched_autonuma_balance(void);
+extern bool sched_autonuma_can_migrate_task(struct task_struct *p,
+					    int numa, int dst_cpu,
+					    enum cpu_idle_type idle);
+#else /* CONFIG_AUTONUMA */
+static inline void sched_autonuma_balance(void) {}
+static inline bool sched_autonuma_can_migrate_task(struct task_struct *p,
+						   int numa, int dst_cpu,
+						   enum cpu_idle_type idle)
+{
+	return true;
+}
+#endif /* CONFIG_AUTONUMA */
+
+#endif /* _LINUX_AUTONUMA_SCHED_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 26574c7..780ded7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -13,6 +13,7 @@
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
 #include <linux/uprobes.h>
+#include <linux/autonuma_types.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -390,6 +391,10 @@ struct mm_struct {
 	struct cpumask cpumask_allocation;
 #endif
 	struct uprobes_state uprobes_state;
+#ifdef CONFIG_AUTONUMA
+	/* this is used by the scheduler and the page allocator */
+	struct mm_autonuma *mm_autonuma;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f45c0b2..60a699c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1507,6 +1507,9 @@ struct task_struct {
 	struct mempolicy *mempolicy;	/* Protected by alloc_lock */
 	short il_next;
 	short pref_node_fork;
+#ifdef CONFIG_AUTONUMA
+	struct sched_autonuma *sched_autonuma;
+#endif
 #endif
 	struct rcu_head rcu;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 39eb601..e3e4c99 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -72,6 +72,7 @@
 #include <linux/slab.h>
 #include <linux/init_task.h>
 #include <linux/binfmts.h>
+#include <linux/autonuma_sched.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -1117,13 +1118,6 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	__set_task_cpu(p, new_cpu);
 }
 
-struct migration_arg {
-	struct task_struct *task;
-	int dest_cpu;
-};
-
-static int migration_cpu_stop(void *data);
-
 /*
  * wait_task_inactive - wait for a thread to unschedule.
  *
@@ -3274,6 +3268,8 @@ need_resched:
 
 	post_schedule(rq);
 
+	sched_autonuma_balance();
+
 	sched_preempt_enable_no_resched();
 	if (need_resched())
 		goto need_resched;
@@ -5106,7 +5102,7 @@ fail:
  * and performs thread migration by bumping thread off CPU then
  * 'pushing' onto another runqueue.
  */
-static int migration_cpu_stop(void *data)
+int migration_cpu_stop(void *data)
 {
 	struct migration_arg *arg = data;
 
diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c
new file mode 100644
index 0000000..499a197
--- /dev/null
+++ b/kernel/sched/numa.c
@@ -0,0 +1,281 @@
+/*
+ *  Copyright (C) 2012  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/sched.h>
+#include <linux/autonuma_sched.h>
+#include <asm/tlb.h>
+
+#include "sched.h"
+
+#define AUTONUMA_BALANCE_SCALE 1000
+
+enum {
+	W_TYPE_THREAD,
+	W_TYPE_PROCESS,
+};
+
+/*
+ * This function is responsible for deciding which is the best CPU
+ * each process should be running on according to the NUMA
+ * affinity. To do that it evaluates all CPUs and checks if there's
+ * any remote CPU where the current process has more NUMA affinity
+ * than with the current CPU, and where the process running on the
+ * remote CPU has less NUMA affinity than the current process to run
+ * on the remote CPU. Ideally this should be expanded to take all
+ * runnable processes into account but this is a good
+ * approximation. When we compare the NUMA affinity between the
+ * current and remote CPU we use the per-thread information if the
+ * remote CPU runs a thread of the same process that the current task
+ * belongs to, or the per-process information if the remote CPU runs a
+ * different process than the current one. If the remote CPU runs the
+ * idle task we require both the per-thread and per-process
+ * information to have more affinity with the remote CPU than with the
+ * current CPU for a migration to happen.
+ *
+ * This has O(N) complexity but N isn't the number of running
+ * processes, but the number of CPUs, so if you assume a constant
+ * number of CPUs (capped at NR_CPUS) it is O(1). O(1) misleading math
+ * aside, the number of cachelines touched with thousands of CPU might
+ * make it measurable. Calling this at every schedule may also be
+ * overkill and it may be enough to call it with a frequency similar
+ * to the load balancing, but by doing so we're also verifying the
+ * algorithm is a converging one in all workloads if performance is
+ * improved and there's no frequent CPU migration, so it's good in the
+ * short term for stressing the algorithm.
+ */
+void sched_autonuma_balance(void)
+{
+	int cpu, nid, selected_cpu, selected_nid;
+	int cpu_nid = numa_node_id();
+	int this_cpu = smp_processor_id();
+	unsigned long p_w, p_t, m_w, m_t, p_w_max, m_w_max;
+	unsigned long weight_delta_max, weight;
+	long s_w_nid = -1, s_w_cpu_nid = -1, s_w_other = -1;
+	int s_w_type = -1;
+	struct cpumask *allowed;
+	struct migration_arg arg;
+	struct task_struct *p = current;
+	struct sched_autonuma *sched_autonuma = p->sched_autonuma;
+
+	/* per-cpu statically allocated in runqueues */
+	long *weight_current;
+	long *weight_current_mm;
+
+	if (!sched_autonuma || !p->mm)
+		return;
+
+	if (!(sched_autonuma->autonuma_flags &
+	      SCHED_AUTONUMA_FLAG_NEED_BALANCE))
+		return;
+	else
+		sched_autonuma->autonuma_flags &=
+			~SCHED_AUTONUMA_FLAG_NEED_BALANCE;
+
+	if (sched_autonuma->autonuma_flags & SCHED_AUTONUMA_FLAG_STOP_ONE_CPU)
+		return;
+
+	if (!autonuma_enabled()) {
+		if (sched_autonuma->autonuma_node != -1)
+			sched_autonuma->autonuma_node = -1;
+		return;
+	}
+
+	allowed = tsk_cpus_allowed(p);
+
+	m_t = ACCESS_ONCE(p->mm->mm_autonuma->numa_fault_tot);
+	p_t = sched_autonuma->numa_fault_tot;
+	/*
+	 * If a process still misses the per-thread or per-process
+	 * information skip it.
+	 */
+	if (!m_t || !p_t)
+		return;
+
+	weight_current = cpu_rq(this_cpu)->weight_current;
+	weight_current_mm = cpu_rq(this_cpu)->weight_current_mm;
+
+	p_w_max = m_w_max = 0;
+	selected_nid = -1;
+	for_each_online_node(nid) {
+		int hits = 0;
+		m_w = ACCESS_ONCE(p->mm->mm_autonuma->numa_fault[nid]);
+		p_w = sched_autonuma->numa_fault[nid];
+		if (m_w > m_t)
+			m_t = m_w;
+		weight_current_mm[nid] = m_w*AUTONUMA_BALANCE_SCALE/m_t;
+		if (p_w > p_t)
+			p_t = p_w;
+		weight_current[nid] = p_w*AUTONUMA_BALANCE_SCALE/p_t;
+		if (weight_current_mm[nid] > m_w_max) {
+			m_w_max = weight_current_mm[nid];
+			hits++;
+		}
+		if (weight_current[nid] > p_w_max) {
+			p_w_max = weight_current[nid];
+			hits++;
+		}
+		if (hits == 2)
+			selected_nid = nid;
+	}
+	if (selected_nid == cpu_nid) {
+		if (sched_autonuma->autonuma_node != selected_nid)
+			sched_autonuma->autonuma_node = selected_nid;
+		return;
+	}
+
+	selected_cpu = this_cpu;
+	selected_nid = cpu_nid;
+	weight = weight_delta_max = 0;
+
+	for_each_online_node(nid) {
+		if (nid == cpu_nid)
+			continue;
+		for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
+			long w_nid, w_cpu_nid, w_other;
+			int w_type;
+			struct mm_struct *mm;
+			struct rq *rq = cpu_rq(cpu);
+			if (!cpu_online(cpu))
+				continue;
+
+			if (idle_cpu(cpu))
+				/*
+				 * Offload the while IDLE balancing
+				 * and physical / logical imbalances
+				 * to CFS.
+				 */
+				continue;
+
+			mm = rq->curr->mm;
+			if (!mm)
+				continue;
+			raw_spin_lock_irq(&rq->lock);
+			/* recheck after implicit barrier() */
+			mm = rq->curr->mm;
+			if (!mm) {
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+			m_t = ACCESS_ONCE(mm->mm_autonuma->numa_fault_tot);
+			p_t = rq->curr->sched_autonuma->numa_fault_tot;
+			if (!m_t || !p_t) {
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+			m_w = ACCESS_ONCE(mm->mm_autonuma->numa_fault[nid]);
+			p_w = rq->curr->sched_autonuma->numa_fault[nid];
+			raw_spin_unlock_irq(&rq->lock);
+			if (mm == p->mm) {
+				if (p_w > p_t)
+					p_t = p_w;
+				w_other = p_w*AUTONUMA_BALANCE_SCALE/p_t;
+				w_nid = weight_current[nid];
+				w_cpu_nid = weight_current[cpu_nid];
+				w_type = W_TYPE_THREAD;
+			} else {
+				if (m_w > m_t)
+					m_t = m_w;
+				w_other = m_w*AUTONUMA_BALANCE_SCALE/m_t;
+				w_nid = weight_current_mm[nid];
+				w_cpu_nid = weight_current_mm[cpu_nid];
+				w_type = W_TYPE_PROCESS;
+			}
+
+			if (w_nid > w_other && w_nid > w_cpu_nid) {
+				weight = w_nid - w_other + w_nid - w_cpu_nid;
+
+				if (weight > weight_delta_max) {
+					weight_delta_max = weight;
+					selected_cpu = cpu;
+					selected_nid = nid;
+
+					s_w_other = w_other;
+					s_w_nid = w_nid;
+					s_w_cpu_nid = w_cpu_nid;
+					s_w_type = w_type;
+				}
+			}
+		}
+	}
+
+	if (sched_autonuma->autonuma_node != selected_nid)
+		sched_autonuma->autonuma_node = selected_nid;
+	if (selected_cpu != this_cpu) {
+		if (autonuma_debug()) {
+			char *w_type_str = NULL;
+			switch (s_w_type) {
+			case W_TYPE_THREAD:
+				w_type_str = "thread";
+				break;
+			case W_TYPE_PROCESS:
+				w_type_str = "process";
+				break;
+			}
+			printk("%p %d - %dto%d - %dto%d - %ld %ld %ld - %s\n",
+			       p->mm, p->pid, cpu_nid, selected_nid,
+			       this_cpu, selected_cpu,
+			       s_w_other, s_w_nid, s_w_cpu_nid,
+			       w_type_str);
+		}
+		BUG_ON(cpu_nid == selected_nid);
+		goto found;
+	}
+
+	return;
+
+found:
+	arg = (struct migration_arg) { p, selected_cpu };
+	/* Need help from migration thread: drop lock and wait. */
+	sched_autonuma->autonuma_flags |= SCHED_AUTONUMA_FLAG_STOP_ONE_CPU;
+	sched_preempt_enable_no_resched();
+	stop_one_cpu(this_cpu, migration_cpu_stop, &arg);
+	preempt_disable();
+	sched_autonuma->autonuma_flags &= ~SCHED_AUTONUMA_FLAG_STOP_ONE_CPU;
+	tlb_migrate_finish(p->mm);
+}
+
+bool sched_autonuma_can_migrate_task(struct task_struct *p,
+				     int numa, int dst_cpu,
+				     enum cpu_idle_type idle)
+{
+	if (!task_autonuma_cpu(p, dst_cpu)) {
+		if (numa)
+			return false;
+		if (autonuma_sched_load_balance_strict() &&
+		    idle != CPU_NEWLY_IDLE && idle != CPU_IDLE)
+			return false;
+	}
+	return true;
+}
+
+void sched_autonuma_dump_mm(void)
+{
+	int nid, cpu;
+	cpumask_var_t x;
+
+	if (!alloc_cpumask_var(&x, GFP_KERNEL))
+		return;
+	cpumask_setall(x);
+	for_each_online_node(nid) {
+		for_each_cpu(cpu, cpumask_of_node(nid)) {
+			struct rq *rq = cpu_rq(cpu);
+			struct mm_struct *mm = rq->curr->mm;
+			int nr = 0, cpux;
+			if (!cpumask_test_cpu(cpu, x))
+				continue;
+			for_each_cpu(cpux, cpumask_of_node(nid)) {
+				struct rq *rqx = cpu_rq(cpux);
+				if (rqx->curr->mm == mm) {
+					nr++;
+					cpumask_clear_cpu(cpux, x);
+				}
+			}
+			printk("nid %d mm %p nr %d\n", nid, mm, nr);
+		}
+	}
+	free_cpumask_var(x);
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ba9dccf..b12b8cd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -463,6 +463,10 @@ struct rq {
 #ifdef CONFIG_SMP
 	struct llist_head wake_list;
 #endif
+#ifdef CONFIG_AUTONUMA
+	long weight_current[MAX_NUMNODES];
+	long weight_current_mm[MAX_NUMNODES];
+#endif
 };
 
 static inline int cpu_of(struct rq *rq)
@@ -526,6 +530,12 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
 DECLARE_PER_CPU(struct sched_domain *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_id);
 
+struct migration_arg {
+	struct task_struct *task;
+	int dest_cpu;
+};
+extern int migration_cpu_stop(void *data);
+
 #endif /* CONFIG_SMP */
 
 #include "stats.h"

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 13/35] autonuma: add page structure fields
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On 64bit archs, 20 bytes are used for async memory migration (specific
to the knuma_migrated per-node threads), and 4 bytes are used for the
thread NUMA false sharing detection logic.

This is a bad implementation due lack of time to do a proper one.

These AutoNUMA new fields must be moved to the pgdat like memcg
does. So that they're only allocated at boot time if the kernel is
booted on NUMA hardware. And so that they're not allocated even if
booted on NUMA hardware if "noautonuma" is passed as boot parameter to
the kernel.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm_types.h |   25 +++++++++++++++++++++++++
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 780ded7..e8dc82c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -126,6 +126,31 @@ struct page {
 		struct page *first_page;	/* Compound tail pages */
 	};
 
+#ifdef CONFIG_AUTONUMA
+	/*
+	 * FIXME: move to pgdat section along with the memcg and allocate
+	 * at runtime only in presence of a numa system.
+	 */
+	/*
+	 * To modify autonuma_last_nid lockless the architecture,
+	 * needs SMP atomic granularity < sizeof(long), not all archs
+	 * have that, notably some alpha. Archs without that requires
+	 * autonuma_last_nid to be a long.
+	 */
+#if BITS_PER_LONG > 32
+	int autonuma_migrate_nid;
+	int autonuma_last_nid;
+#else
+#if MAX_NUMNODES >= 32768
+#error "too many nodes"
+#endif
+	/* FIXME: remember to check the updates are atomic */
+	short autonuma_migrate_nid;
+	short autonuma_last_nid;
+#endif
+	struct list_head autonuma_migrate_node;
+#endif
+
 	/*
 	 * On machines where all RAM is mapped into kernel address space,
 	 * we can simply calculate the virtual address. On machines with

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 13/35] autonuma: add page structure fields
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On 64bit archs, 20 bytes are used for async memory migration (specific
to the knuma_migrated per-node threads), and 4 bytes are used for the
thread NUMA false sharing detection logic.

This is a bad implementation due lack of time to do a proper one.

These AutoNUMA new fields must be moved to the pgdat like memcg
does. So that they're only allocated at boot time if the kernel is
booted on NUMA hardware. And so that they're not allocated even if
booted on NUMA hardware if "noautonuma" is passed as boot parameter to
the kernel.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm_types.h |   25 +++++++++++++++++++++++++
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 780ded7..e8dc82c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -126,6 +126,31 @@ struct page {
 		struct page *first_page;	/* Compound tail pages */
 	};
 
+#ifdef CONFIG_AUTONUMA
+	/*
+	 * FIXME: move to pgdat section along with the memcg and allocate
+	 * at runtime only in presence of a numa system.
+	 */
+	/*
+	 * To modify autonuma_last_nid lockless the architecture,
+	 * needs SMP atomic granularity < sizeof(long), not all archs
+	 * have that, notably some alpha. Archs without that requires
+	 * autonuma_last_nid to be a long.
+	 */
+#if BITS_PER_LONG > 32
+	int autonuma_migrate_nid;
+	int autonuma_last_nid;
+#else
+#if MAX_NUMNODES >= 32768
+#error "too many nodes"
+#endif
+	/* FIXME: remember to check the updates are atomic */
+	short autonuma_migrate_nid;
+	short autonuma_last_nid;
+#endif
+	struct list_head autonuma_migrate_node;
+#endif
+
 	/*
 	 * On machines where all RAM is mapped into kernel address space,
 	 * we can simply calculate the virtual address. On machines with

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 14/35] autonuma: knuma_migrated per NUMA node queues
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

This implements the knuma_migrated queues. The pages are added to
these queues through the NUMA hinting page faults (memory follow CPU
algorithm with false sharing evaluation) and knuma_migrated then is
waken with a certain hysteresis to migrate the memory in round robin
from all remote nodes to its local node.

The head that belongs to the local node that knuma_migrated runs on,
for now must be empty and it's not being used.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mmzone.h |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 41aa49b..8e578e6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -666,6 +666,12 @@ typedef struct pglist_data {
 	struct task_struct *kswapd;
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
+#ifdef CONFIG_AUTONUMA
+	spinlock_t autonuma_lock;
+	struct list_head autonuma_migrate_head[MAX_NUMNODES];
+	unsigned long autonuma_nr_migrate_pages;
+	wait_queue_head_t autonuma_knuma_migrated_wait;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 14/35] autonuma: knuma_migrated per NUMA node queues
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

This implements the knuma_migrated queues. The pages are added to
these queues through the NUMA hinting page faults (memory follow CPU
algorithm with false sharing evaluation) and knuma_migrated then is
waken with a certain hysteresis to migrate the memory in round robin
from all remote nodes to its local node.

The head that belongs to the local node that knuma_migrated runs on,
for now must be empty and it's not being used.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mmzone.h |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 41aa49b..8e578e6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -666,6 +666,12 @@ typedef struct pglist_data {
 	struct task_struct *kswapd;
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
+#ifdef CONFIG_AUTONUMA
+	spinlock_t autonuma_lock;
+	struct list_head autonuma_migrate_head[MAX_NUMNODES];
+	unsigned long autonuma_nr_migrate_pages;
+	wait_queue_head_t autonuma_knuma_migrated_wait;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 15/35] autonuma: init knuma_migrated queues
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Initialize the knuma_migrated queues at boot time.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3d69735..3d1ee70 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -58,6 +58,7 @@
 #include <linux/memcontrol.h>
 #include <linux/prefetch.h>
 #include <linux/page-debug-flags.h>
+#include <linux/autonuma.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -4295,8 +4296,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	int nid = pgdat->node_id;
 	unsigned long zone_start_pfn = pgdat->node_start_pfn;
 	int ret;
+#ifdef CONFIG_AUTONUMA
+	int node_iter;
+#endif
 
 	pgdat_resize_init(pgdat);
+#ifdef CONFIG_AUTONUMA
+	spin_lock_init(&pgdat->autonuma_lock);
+	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
+	pgdat->autonuma_nr_migrate_pages = 0;
+	for_each_node(node_iter)
+		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+#endif
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	pgdat->kswapd_max_order = 0;

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 15/35] autonuma: init knuma_migrated queues
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Initialize the knuma_migrated queues at boot time.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3d69735..3d1ee70 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -58,6 +58,7 @@
 #include <linux/memcontrol.h>
 #include <linux/prefetch.h>
 #include <linux/page-debug-flags.h>
+#include <linux/autonuma.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -4295,8 +4296,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	int nid = pgdat->node_id;
 	unsigned long zone_start_pfn = pgdat->node_start_pfn;
 	int ret;
+#ifdef CONFIG_AUTONUMA
+	int node_iter;
+#endif
 
 	pgdat_resize_init(pgdat);
+#ifdef CONFIG_AUTONUMA
+	spin_lock_init(&pgdat->autonuma_lock);
+	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
+	pgdat->autonuma_nr_migrate_pages = 0;
+	for_each_node(node_iter)
+		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+#endif
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	pgdat->kswapd_max_order = 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 16/35] autonuma: autonuma_enter/exit
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

The first gear in the whole AutoNUMA algorithm is knuma_scand. If
knuma_scand doesn't run AutoNUMA is a full bypass. If knuma_scand is
stopped, soon all other AutoNUMA gears will settle down too.

knuma_scand is the daemon that sets the pmd_numa and pte_numa and
allows the NUMA hinting page faults to start and then all other
actions follows as a reaction to that.

knuma_scand scans a list of "mm" and this is where we register and
unregister the "mm" into AutoNUMA for knuma_scand to scan them.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 98db8b0..237c34e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -70,6 +70,7 @@
 #include <linux/khugepaged.h>
 #include <linux/signalfd.h>
 #include <linux/uprobes.h>
+#include <linux/autonuma.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -539,6 +540,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
 		mmu_notifier_mm_init(mm);
+		autonuma_enter(mm);
 		return mm;
 	}
 
@@ -607,6 +609,7 @@ void mmput(struct mm_struct *mm)
 		exit_aio(mm);
 		ksm_exit(mm);
 		khugepaged_exit(mm); /* must run before exit_mmap */
+		autonuma_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 16/35] autonuma: autonuma_enter/exit
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

The first gear in the whole AutoNUMA algorithm is knuma_scand. If
knuma_scand doesn't run AutoNUMA is a full bypass. If knuma_scand is
stopped, soon all other AutoNUMA gears will settle down too.

knuma_scand is the daemon that sets the pmd_numa and pte_numa and
allows the NUMA hinting page faults to start and then all other
actions follows as a reaction to that.

knuma_scand scans a list of "mm" and this is where we register and
unregister the "mm" into AutoNUMA for knuma_scand to scan them.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 98db8b0..237c34e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -70,6 +70,7 @@
 #include <linux/khugepaged.h>
 #include <linux/signalfd.h>
 #include <linux/uprobes.h>
+#include <linux/autonuma.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -539,6 +540,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
 		mmu_notifier_mm_init(mm);
+		autonuma_enter(mm);
 		return mm;
 	}
 
@@ -607,6 +609,7 @@ void mmput(struct mm_struct *mm)
 		exit_aio(mm);
 		ksm_exit(mm);
 		khugepaged_exit(mm); /* must run before exit_mmap */
+		autonuma_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 17/35] autonuma: call autonuma_setup_new_exec()
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

This resets all per-thread and per-process statistics across exec
syscalls or after kernel threads detached from the mm. The past
statistical NUMA information is unlikely to be relevant for the future
in these cases.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/exec.c        |    3 +++
 mm/mmu_context.c |    2 ++
 2 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 52c9e2f..17330ba 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/pipe_fs_i.h>
 #include <linux/oom.h>
 #include <linux/compat.h>
+#include <linux/autonuma.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1176,6 +1177,8 @@ void setup_new_exec(struct linux_binprm * bprm)
 			
 	flush_signal_handlers(current, 0);
 	flush_old_files(current->files);
+
+	autonuma_setup_new_exec(current);
 }
 EXPORT_SYMBOL(setup_new_exec);
 
diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 3dcfaf4..40f0f13 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -7,6 +7,7 @@
 #include <linux/mmu_context.h>
 #include <linux/export.h>
 #include <linux/sched.h>
+#include <linux/autonuma.h>
 
 #include <asm/mmu_context.h>
 
@@ -58,5 +59,6 @@ void unuse_mm(struct mm_struct *mm)
 	/* active_mm is still 'mm' */
 	enter_lazy_tlb(mm, tsk);
 	task_unlock(tsk);
+	autonuma_setup_new_exec(tsk);
 }
 EXPORT_SYMBOL_GPL(unuse_mm);

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 17/35] autonuma: call autonuma_setup_new_exec()
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

This resets all per-thread and per-process statistics across exec
syscalls or after kernel threads detached from the mm. The past
statistical NUMA information is unlikely to be relevant for the future
in these cases.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/exec.c        |    3 +++
 mm/mmu_context.c |    2 ++
 2 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 52c9e2f..17330ba 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/pipe_fs_i.h>
 #include <linux/oom.h>
 #include <linux/compat.h>
+#include <linux/autonuma.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1176,6 +1177,8 @@ void setup_new_exec(struct linux_binprm * bprm)
 			
 	flush_signal_handlers(current, 0);
 	flush_old_files(current->files);
+
+	autonuma_setup_new_exec(current);
 }
 EXPORT_SYMBOL(setup_new_exec);
 
diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 3dcfaf4..40f0f13 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -7,6 +7,7 @@
 #include <linux/mmu_context.h>
 #include <linux/export.h>
 #include <linux/sched.h>
+#include <linux/autonuma.h>
 
 #include <asm/mmu_context.h>
 
@@ -58,5 +59,6 @@ void unuse_mm(struct mm_struct *mm)
 	/* active_mm is still 'mm' */
 	enter_lazy_tlb(mm, tsk);
 	task_unlock(tsk);
+	autonuma_setup_new_exec(tsk);
 }
 EXPORT_SYMBOL_GPL(unuse_mm);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 18/35] autonuma: alloc/free/init sched_autonuma
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

This is where the dynamically allocated sched_autonuma structure is
being handled.

The reason for keeping this outside of the task_struct besides not
using too much kernel stack, is to only allocate it on NUMA
hardware. So the not NUMA hardware only pays the memory of a pointer
in the kernel stack (which remains NULL at all times in that case).

If the kernel is compiled with CONFIG_AUTONUMA=n, not even the pointer
is allocated on the kernel stack of course.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |   24 ++++++++++++++----------
 1 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 237c34e..d323eb1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -206,6 +206,7 @@ static void account_kernel_stack(struct thread_info *ti, int account)
 void free_task(struct task_struct *tsk)
 {
 	account_kernel_stack(tsk->stack, -1);
+	free_sched_autonuma(tsk);
 	free_thread_info(tsk->stack);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
@@ -260,6 +261,8 @@ void __init fork_init(unsigned long mempages)
 	/* do the arch specific task caches init */
 	arch_task_cache_init();
 
+	sched_autonuma_init();
+
 	/*
 	 * The default maximum number of threads is set to a safe
 	 * value: the thread structures can take up at most half
@@ -292,21 +295,21 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	struct thread_info *ti;
 	unsigned long *stackend;
 	int node = tsk_fork_get_node(orig);
-	int err;
 
 	tsk = alloc_task_struct_node(node);
-	if (!tsk)
+	if (unlikely(!tsk))
 		return NULL;
 
 	ti = alloc_thread_info_node(tsk, node);
-	if (!ti) {
-		free_task_struct(tsk);
-		return NULL;
-	}
+	if (unlikely(!ti))
+		goto out_task_struct;
 
-	err = arch_dup_task_struct(tsk, orig);
-	if (err)
-		goto out;
+	if (unlikely(arch_dup_task_struct(tsk, orig)))
+		goto out_thread_info;
+
+	if (unlikely(alloc_sched_autonuma(tsk, orig, node)))
+		/* free_thread_info() undoes arch_dup_task_struct() too */
+		goto out_thread_info;
 
 	tsk->stack = ti;
 
@@ -334,8 +337,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 
 	return tsk;
 
-out:
+out_thread_info:
 	free_thread_info(ti);
+out_task_struct:
 	free_task_struct(tsk);
 	return NULL;
 }

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 18/35] autonuma: alloc/free/init sched_autonuma
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

This is where the dynamically allocated sched_autonuma structure is
being handled.

The reason for keeping this outside of the task_struct besides not
using too much kernel stack, is to only allocate it on NUMA
hardware. So the not NUMA hardware only pays the memory of a pointer
in the kernel stack (which remains NULL at all times in that case).

If the kernel is compiled with CONFIG_AUTONUMA=n, not even the pointer
is allocated on the kernel stack of course.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |   24 ++++++++++++++----------
 1 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 237c34e..d323eb1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -206,6 +206,7 @@ static void account_kernel_stack(struct thread_info *ti, int account)
 void free_task(struct task_struct *tsk)
 {
 	account_kernel_stack(tsk->stack, -1);
+	free_sched_autonuma(tsk);
 	free_thread_info(tsk->stack);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
@@ -260,6 +261,8 @@ void __init fork_init(unsigned long mempages)
 	/* do the arch specific task caches init */
 	arch_task_cache_init();
 
+	sched_autonuma_init();
+
 	/*
 	 * The default maximum number of threads is set to a safe
 	 * value: the thread structures can take up at most half
@@ -292,21 +295,21 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	struct thread_info *ti;
 	unsigned long *stackend;
 	int node = tsk_fork_get_node(orig);
-	int err;
 
 	tsk = alloc_task_struct_node(node);
-	if (!tsk)
+	if (unlikely(!tsk))
 		return NULL;
 
 	ti = alloc_thread_info_node(tsk, node);
-	if (!ti) {
-		free_task_struct(tsk);
-		return NULL;
-	}
+	if (unlikely(!ti))
+		goto out_task_struct;
 
-	err = arch_dup_task_struct(tsk, orig);
-	if (err)
-		goto out;
+	if (unlikely(arch_dup_task_struct(tsk, orig)))
+		goto out_thread_info;
+
+	if (unlikely(alloc_sched_autonuma(tsk, orig, node)))
+		/* free_thread_info() undoes arch_dup_task_struct() too */
+		goto out_thread_info;
 
 	tsk->stack = ti;
 
@@ -334,8 +337,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 
 	return tsk;
 
-out:
+out_thread_info:
 	free_thread_info(ti);
+out_task_struct:
 	free_task_struct(tsk);
 	return NULL;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 19/35] autonuma: alloc/free/init mm_autonuma
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

This is where the mm_autonuma structure is being handled. Just like
sched_autonuma, this is only allocated at runtime if the hardware the
kernel is running on has been detected as NUMA. On not NUMA hardware
the memory cost is reduced to one pointer per mm.

To get rid of the pointer in the each mm, the kernel can be compiled
with CONFIG_AUTONUMA=n.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index d323eb1..22f102e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -526,6 +526,8 @@ static void mm_init_aio(struct mm_struct *mm)
 
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 {
+	if (unlikely(alloc_mm_autonuma(mm)))
+		goto out_free_mm;
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
 	init_rwsem(&mm->mmap_sem);
@@ -548,6 +550,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 		return mm;
 	}
 
+	free_mm_autonuma(mm);
+out_free_mm:
 	free_mm(mm);
 	return NULL;
 }
@@ -597,6 +601,7 @@ void __mmdrop(struct mm_struct *mm)
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
+	free_mm_autonuma(mm);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -880,6 +885,7 @@ fail_nocontext:
 	 * If init_new_context() failed, we cannot use mmput() to free the mm
 	 * because it calls destroy_context()
 	 */
+	free_mm_autonuma(mm);
 	mm_free_pgd(mm);
 	free_mm(mm);
 	return NULL;
@@ -1706,6 +1712,7 @@ void __init proc_caches_init(void)
 	mm_cachep = kmem_cache_create("mm_struct",
 			sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL);
+	mm_autonuma_init();
 	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC);
 	mmap_init();
 	nsproxy_cache_init();

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 19/35] autonuma: alloc/free/init mm_autonuma
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

This is where the mm_autonuma structure is being handled. Just like
sched_autonuma, this is only allocated at runtime if the hardware the
kernel is running on has been detected as NUMA. On not NUMA hardware
the memory cost is reduced to one pointer per mm.

To get rid of the pointer in the each mm, the kernel can be compiled
with CONFIG_AUTONUMA=n.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index d323eb1..22f102e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -526,6 +526,8 @@ static void mm_init_aio(struct mm_struct *mm)
 
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 {
+	if (unlikely(alloc_mm_autonuma(mm)))
+		goto out_free_mm;
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
 	init_rwsem(&mm->mmap_sem);
@@ -548,6 +550,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 		return mm;
 	}
 
+	free_mm_autonuma(mm);
+out_free_mm:
 	free_mm(mm);
 	return NULL;
 }
@@ -597,6 +601,7 @@ void __mmdrop(struct mm_struct *mm)
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
+	free_mm_autonuma(mm);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -880,6 +885,7 @@ fail_nocontext:
 	 * If init_new_context() failed, we cannot use mmput() to free the mm
 	 * because it calls destroy_context()
 	 */
+	free_mm_autonuma(mm);
 	mm_free_pgd(mm);
 	free_mm(mm);
 	return NULL;
@@ -1706,6 +1712,7 @@ void __init proc_caches_init(void)
 	mm_cachep = kmem_cache_create("mm_struct",
 			sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL);
+	mm_autonuma_init();
 	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC);
 	mmap_init();
 	nsproxy_cache_init();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 20/35] autonuma: avoid CFS select_task_rq_fair to return -1
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Fix to avoid -1 retval.

Includes fixes from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 940e6d1..137119f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2789,6 +2789,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		if (new_cpu == -1 || new_cpu == cpu) {
 			/* Now try balancing at a lower domain level of cpu */
 			sd = sd->child;
+			if (new_cpu < 0)
+				/* Return prev_cpu is find_idlest_cpu failed */
+				new_cpu = prev_cpu;
 			continue;
 		}
 
@@ -2807,6 +2810,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 unlock:
 	rcu_read_unlock();
 
+	BUG_ON(new_cpu < 0);
 	return new_cpu;
 }
 #endif /* CONFIG_SMP */

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 20/35] autonuma: avoid CFS select_task_rq_fair to return -1
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Fix to avoid -1 retval.

Includes fixes from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 940e6d1..137119f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2789,6 +2789,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		if (new_cpu == -1 || new_cpu == cpu) {
 			/* Now try balancing at a lower domain level of cpu */
 			sd = sd->child;
+			if (new_cpu < 0)
+				/* Return prev_cpu is find_idlest_cpu failed */
+				new_cpu = prev_cpu;
 			continue;
 		}
 
@@ -2807,6 +2810,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 unlock:
 	rcu_read_unlock();
 
+	BUG_ON(new_cpu < 0);
 	return new_cpu;
 }
 #endif /* CONFIG_SMP */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 21/35] autonuma: teach CFS about autonuma affinity
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

The CFS scheduler is still in charge of all scheduling
decisions. AutoNUMA balancing at times will override those. But
generally we'll just relay on the CFS scheduler to keep doing its
thing, but while preferring the autonuma affine nodes when deciding
to move a process to a different runqueue or when waking it up.

For example the idle balancing, will look into the runqueues of the
busy CPUs, but it'll search first for a task that wants to run into
the idle CPU in AutoNUMA terms (task_autonuma_cpu() being true).

Most of this is encoded in the can_migrate_task becoming AutoNUMA
aware and running two passes for each balancing pass, the first NUMA
aware, and the second one relaxed.

The idle/newidle balancing is always allowed to fallback into
non-affine AutoNUMA tasks. The load_balancing (which is more a
fariness than a performance issue) is instead only able to cross over
the AutoNUMA affinity if the flag controlled by
/sys/kernel/mm/autonuma/scheduler/load_balance_strict is not set (it
is set by default).

Tasks that haven't been fully profiled yet, are not affected by this
because their p->sched_autonuma->autonuma_node is still set to the
original value of -1 and task_autonuma_cpu will always return true in
that case.

Includes fixes from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |   65 +++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 56 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 137119f..99d1d33 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/profile.h>
 #include <linux/interrupt.h>
+#include <linux/autonuma_sched.h>
 
 #include <trace/events/sched.h>
 
@@ -2621,6 +2622,8 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 		load = weighted_cpuload(i);
 
 		if (load < min_load || (load == min_load && i == this_cpu)) {
+			if (!task_autonuma_cpu(p, i))
+				continue;
 			min_load = load;
 			idlest = i;
 		}
@@ -2639,24 +2642,27 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	struct sched_domain *sd;
 	struct sched_group *sg;
 	int i;
+	bool idle_target;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
 	 * already idle, then it is the right target.
 	 */
-	if (target == cpu && idle_cpu(cpu))
+	if (target == cpu && idle_cpu(cpu) && task_autonuma_cpu(p, cpu))
 		return cpu;
 
 	/*
 	 * If the task is going to be woken-up on the cpu where it previously
 	 * ran and if it is currently idle, then it the right target.
 	 */
-	if (target == prev_cpu && idle_cpu(prev_cpu))
+	if (target == prev_cpu && idle_cpu(prev_cpu) &&
+	    task_autonuma_cpu(p, prev_cpu))
 		return prev_cpu;
 
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
+	idle_target = false;
 	sd = rcu_dereference(per_cpu(sd_llc, target));
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
@@ -2670,9 +2676,18 @@ static int select_idle_sibling(struct task_struct *p, int target)
 					goto next;
 			}
 
-			target = cpumask_first_and(sched_group_cpus(sg),
-					tsk_cpus_allowed(p));
-			goto done;
+			for_each_cpu_and(i, sched_group_cpus(sg),
+						tsk_cpus_allowed(p)) {
+				/* Find autonuma cpu only in idle group */
+				if (task_autonuma_cpu(p, i)) {
+					target = i;
+					goto done;
+				}
+				if (!idle_target) {
+					idle_target = true;
+					target = i;
+				}
+			}
 next:
 			sg = sg->next;
 		} while (sg != sd->groups);
@@ -2707,7 +2722,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		return prev_cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
-		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) &&
+		    task_autonuma_cpu(p, cpu))
 			want_affine = 1;
 		new_cpu = prev_cpu;
 	}
@@ -3072,6 +3088,7 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
+#define LBF_NUMA	0x04
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -3142,13 +3159,14 @@ static
 int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
 	int tsk_cache_hot = 0;
+	struct cpumask *allowed = tsk_cpus_allowed(p);
 	/*
 	 * We do not migrate tasks that are:
 	 * 1) running (obviously), or
 	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
 	 * 3) are cache-hot on their current CPU.
 	 */
-	if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
+	if (!cpumask_test_cpu(env->dst_cpu, allowed)) {
 		schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
 		return 0;
 	}
@@ -3159,6 +3177,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}
 
+	if (!sched_autonuma_can_migrate_task(p, env->flags & LBF_NUMA,
+					     env->dst_cpu, env->idle))
+		return 0;
+
 	/*
 	 * Aggressive migration if:
 	 * 1) task is cache cold, or
@@ -3195,6 +3217,8 @@ static int move_one_task(struct lb_env *env)
 {
 	struct task_struct *p, *n;
 
+	env->flags |= LBF_NUMA;
+numa_repeat:
 	list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
 		if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
 			continue;
@@ -3209,8 +3233,14 @@ static int move_one_task(struct lb_env *env)
 		 * stats here rather than inside move_task().
 		 */
 		schedstat_inc(env->sd, lb_gained[env->idle]);
+		env->flags &= ~LBF_NUMA;
 		return 1;
 	}
+	if (env->flags & LBF_NUMA) {
+		env->flags &= ~LBF_NUMA;
+		goto numa_repeat;
+	}
+
 	return 0;
 }
 
@@ -3235,6 +3265,8 @@ static int move_tasks(struct lb_env *env)
 	if (env->imbalance <= 0)
 		return 0;
 
+	env->flags |= LBF_NUMA;
+numa_repeat:
 	while (!list_empty(tasks)) {
 		p = list_first_entry(tasks, struct task_struct, se.group_node);
 
@@ -3274,9 +3306,13 @@ static int move_tasks(struct lb_env *env)
 		 * kernels will stop after the first task is pulled to minimize
 		 * the critical section.
 		 */
-		if (env->idle == CPU_NEWLY_IDLE)
-			break;
+		if (env->idle == CPU_NEWLY_IDLE) {
+			env->flags &= ~LBF_NUMA;
+			goto out;
+		}
 #endif
+		/* not idle anymore after pulling first task */
+		env->idle = CPU_NOT_IDLE;
 
 		/*
 		 * We only want to steal up to the prescribed amount of
@@ -3289,6 +3325,17 @@ static int move_tasks(struct lb_env *env)
 next:
 		list_move_tail(&p->se.group_node, tasks);
 	}
+	if ((env->flags & (LBF_NUMA|LBF_NEED_BREAK)) == LBF_NUMA) {
+		env->flags &= ~LBF_NUMA;
+		if (env->imbalance > 0) {
+			env->loop = 0;
+			env->loop_break = sched_nr_migrate_break;
+			goto numa_repeat;
+		}
+	}
+#ifdef CONFIG_PREEMPT
+out:
+#endif
 
 	/*
 	 * Right now, this is one of only two places move_task() is called,

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 21/35] autonuma: teach CFS about autonuma affinity
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

The CFS scheduler is still in charge of all scheduling
decisions. AutoNUMA balancing at times will override those. But
generally we'll just relay on the CFS scheduler to keep doing its
thing, but while preferring the autonuma affine nodes when deciding
to move a process to a different runqueue or when waking it up.

For example the idle balancing, will look into the runqueues of the
busy CPUs, but it'll search first for a task that wants to run into
the idle CPU in AutoNUMA terms (task_autonuma_cpu() being true).

Most of this is encoded in the can_migrate_task becoming AutoNUMA
aware and running two passes for each balancing pass, the first NUMA
aware, and the second one relaxed.

The idle/newidle balancing is always allowed to fallback into
non-affine AutoNUMA tasks. The load_balancing (which is more a
fariness than a performance issue) is instead only able to cross over
the AutoNUMA affinity if the flag controlled by
/sys/kernel/mm/autonuma/scheduler/load_balance_strict is not set (it
is set by default).

Tasks that haven't been fully profiled yet, are not affected by this
because their p->sched_autonuma->autonuma_node is still set to the
original value of -1 and task_autonuma_cpu will always return true in
that case.

Includes fixes from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |   65 +++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 56 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 137119f..99d1d33 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/profile.h>
 #include <linux/interrupt.h>
+#include <linux/autonuma_sched.h>
 
 #include <trace/events/sched.h>
 
@@ -2621,6 +2622,8 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 		load = weighted_cpuload(i);
 
 		if (load < min_load || (load == min_load && i == this_cpu)) {
+			if (!task_autonuma_cpu(p, i))
+				continue;
 			min_load = load;
 			idlest = i;
 		}
@@ -2639,24 +2642,27 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	struct sched_domain *sd;
 	struct sched_group *sg;
 	int i;
+	bool idle_target;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
 	 * already idle, then it is the right target.
 	 */
-	if (target == cpu && idle_cpu(cpu))
+	if (target == cpu && idle_cpu(cpu) && task_autonuma_cpu(p, cpu))
 		return cpu;
 
 	/*
 	 * If the task is going to be woken-up on the cpu where it previously
 	 * ran and if it is currently idle, then it the right target.
 	 */
-	if (target == prev_cpu && idle_cpu(prev_cpu))
+	if (target == prev_cpu && idle_cpu(prev_cpu) &&
+	    task_autonuma_cpu(p, prev_cpu))
 		return prev_cpu;
 
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
+	idle_target = false;
 	sd = rcu_dereference(per_cpu(sd_llc, target));
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
@@ -2670,9 +2676,18 @@ static int select_idle_sibling(struct task_struct *p, int target)
 					goto next;
 			}
 
-			target = cpumask_first_and(sched_group_cpus(sg),
-					tsk_cpus_allowed(p));
-			goto done;
+			for_each_cpu_and(i, sched_group_cpus(sg),
+						tsk_cpus_allowed(p)) {
+				/* Find autonuma cpu only in idle group */
+				if (task_autonuma_cpu(p, i)) {
+					target = i;
+					goto done;
+				}
+				if (!idle_target) {
+					idle_target = true;
+					target = i;
+				}
+			}
 next:
 			sg = sg->next;
 		} while (sg != sd->groups);
@@ -2707,7 +2722,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		return prev_cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
-		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) &&
+		    task_autonuma_cpu(p, cpu))
 			want_affine = 1;
 		new_cpu = prev_cpu;
 	}
@@ -3072,6 +3088,7 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
+#define LBF_NUMA	0x04
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -3142,13 +3159,14 @@ static
 int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
 	int tsk_cache_hot = 0;
+	struct cpumask *allowed = tsk_cpus_allowed(p);
 	/*
 	 * We do not migrate tasks that are:
 	 * 1) running (obviously), or
 	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
 	 * 3) are cache-hot on their current CPU.
 	 */
-	if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
+	if (!cpumask_test_cpu(env->dst_cpu, allowed)) {
 		schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
 		return 0;
 	}
@@ -3159,6 +3177,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}
 
+	if (!sched_autonuma_can_migrate_task(p, env->flags & LBF_NUMA,
+					     env->dst_cpu, env->idle))
+		return 0;
+
 	/*
 	 * Aggressive migration if:
 	 * 1) task is cache cold, or
@@ -3195,6 +3217,8 @@ static int move_one_task(struct lb_env *env)
 {
 	struct task_struct *p, *n;
 
+	env->flags |= LBF_NUMA;
+numa_repeat:
 	list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
 		if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
 			continue;
@@ -3209,8 +3233,14 @@ static int move_one_task(struct lb_env *env)
 		 * stats here rather than inside move_task().
 		 */
 		schedstat_inc(env->sd, lb_gained[env->idle]);
+		env->flags &= ~LBF_NUMA;
 		return 1;
 	}
+	if (env->flags & LBF_NUMA) {
+		env->flags &= ~LBF_NUMA;
+		goto numa_repeat;
+	}
+
 	return 0;
 }
 
@@ -3235,6 +3265,8 @@ static int move_tasks(struct lb_env *env)
 	if (env->imbalance <= 0)
 		return 0;
 
+	env->flags |= LBF_NUMA;
+numa_repeat:
 	while (!list_empty(tasks)) {
 		p = list_first_entry(tasks, struct task_struct, se.group_node);
 
@@ -3274,9 +3306,13 @@ static int move_tasks(struct lb_env *env)
 		 * kernels will stop after the first task is pulled to minimize
 		 * the critical section.
 		 */
-		if (env->idle == CPU_NEWLY_IDLE)
-			break;
+		if (env->idle == CPU_NEWLY_IDLE) {
+			env->flags &= ~LBF_NUMA;
+			goto out;
+		}
 #endif
+		/* not idle anymore after pulling first task */
+		env->idle = CPU_NOT_IDLE;
 
 		/*
 		 * We only want to steal up to the prescribed amount of
@@ -3289,6 +3325,17 @@ static int move_tasks(struct lb_env *env)
 next:
 		list_move_tail(&p->se.group_node, tasks);
 	}
+	if ((env->flags & (LBF_NUMA|LBF_NEED_BREAK)) == LBF_NUMA) {
+		env->flags &= ~LBF_NUMA;
+		if (env->imbalance > 0) {
+			env->loop = 0;
+			env->loop_break = sched_nr_migrate_break;
+			goto numa_repeat;
+		}
+	}
+#ifdef CONFIG_PREEMPT
+out:
+#endif
 
 	/*
 	 * Right now, this is one of only two places move_task() is called,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 22/35] autonuma: sched_set_autonuma_need_balance
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Invoke autonuma_balance only on the busy CPUs at the same frequency of
the CFS load balance.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 99d1d33..1357938 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4893,6 +4893,9 @@ static void run_rebalance_domains(struct softirq_action *h)
 
 	rebalance_domains(this_cpu, idle);
 
+	if (!this_rq->idle_balance)
+		sched_set_autonuma_need_balance();
+
 	/*
 	 * If this cpu has a pending nohz_balance_kick, then do the
 	 * balancing on behalf of the other idle cpus whose ticks are

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 22/35] autonuma: sched_set_autonuma_need_balance
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Invoke autonuma_balance only on the busy CPUs at the same frequency of
the CFS load balance.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 99d1d33..1357938 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4893,6 +4893,9 @@ static void run_rebalance_domains(struct softirq_action *h)
 
 	rebalance_domains(this_cpu, idle);
 
+	if (!this_rq->idle_balance)
+		sched_set_autonuma_need_balance();
+
 	/*
 	 * If this cpu has a pending nohz_balance_kick, then do the
 	 * balancing on behalf of the other idle cpus whose ticks are

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 23/35] autonuma: core
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

This implements knuma_scand, the numa_hinting faults started by
knuma_scand, the knuma_migrated that migrates the memory queued by the
NUMA hinting faults, the statistics gathering code that is done by
knuma_scand for the mm_autonuma and by the numa hinting page faults
for the sched_autonuma, and most of the rest of the AutoNUMA core
logics like the false sharing detection, sysfs and initialization
routines.

The AutoNUMA algorithm when knuma_scand is not running is a full
bypass and it must not alter the runtime of memory management and
scheduler.

The whole AutoNUMA logic is a chain reaction as result of the actions
of the knuma_scand. The various parts of the code can be described
like different gears (gears as in glxgears).

knuma_scand is the first gear and it collects the mm_autonuma per-process
statistics and at the same time it sets the pte/pmd it scans as
pte_numa and pmd_numa.

The second gear are the numa hinting page faults. These are triggered
by the pte_numa/pmd_numa pmd/ptes. They collect the sched_autonuma
per-thread statistics. They also implement the memory follow CPU logic
where we track if pages are repeatedly accessed by remote nodes. The
memory follow CPU logic can decide to migrate pages across different
NUMA nodes by queuing the pages for migration in the per-node
knuma_migrated queues.

The third gear is knuma_migrated. There is one knuma_migrated daemon
per node. Pages pending for migration are queued in a matrix of
lists. Each knuma_migrated (in parallel with each other) goes over
those lists and migrates the pages queued for migration in round robin
from each incoming node to the node where knuma_migrated is running
on.

The fourth gear is the NUMA scheduler balancing code. That computes
the statistical information collected in mm->mm_autonuma and
p->sched_autonuma and evaluates the status of all CPUs to decide if
tasks should be migrated to CPUs in remote nodes.

The code include fixes from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/autonuma.c | 1449 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 1449 insertions(+), 0 deletions(-)
 create mode 100644 mm/autonuma.c

diff --git a/mm/autonuma.c b/mm/autonuma.c
new file mode 100644
index 0000000..88c7ab3
--- /dev/null
+++ b/mm/autonuma.c
@@ -0,0 +1,1449 @@
+/*
+ *  Copyright (C) 2012  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ *  Boot with "numa=fake=2" to test on not NUMA systems.
+ */
+
+#include <linux/mm.h>
+#include <linux/rmap.h>
+#include <linux/kthread.h>
+#include <linux/mmu_notifier.h>
+#include <linux/freezer.h>
+#include <linux/mm_inline.h>
+#include <linux/migrate.h>
+#include <linux/swap.h>
+#include <linux/autonuma.h>
+#include <asm/tlbflush.h>
+#include <asm/pgtable.h>
+
+unsigned long autonuma_flags __read_mostly =
+	(1<<AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG)|
+	(1<<AUTONUMA_SCHED_CLONE_RESET_FLAG)|
+	(1<<AUTONUMA_SCHED_FORK_RESET_FLAG)|
+#ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
+	(1<<AUTONUMA_FLAG)|
+#endif
+	(1<<AUTONUMA_SCAN_PMD_FLAG);
+
+static DEFINE_MUTEX(knumad_mm_mutex);
+
+/* knuma_scand */
+static unsigned int scan_sleep_millisecs __read_mostly = 100;
+static unsigned int scan_sleep_pass_millisecs __read_mostly = 5000;
+static unsigned int pages_to_scan __read_mostly = 128*1024*1024/PAGE_SIZE;
+static DECLARE_WAIT_QUEUE_HEAD(knuma_scand_wait);
+static unsigned long full_scans;
+static unsigned long pages_scanned;
+
+/* knuma_migrated */
+static unsigned int migrate_sleep_millisecs __read_mostly = 100;
+static unsigned int pages_to_migrate __read_mostly = 128*1024*1024/PAGE_SIZE;
+static volatile unsigned long pages_migrated;
+
+static struct knumad_scan {
+	struct list_head mm_head;
+	struct mm_struct *mm;
+	unsigned long address;
+} knumad_scan = {
+	.mm_head = LIST_HEAD_INIT(knumad_scan.mm_head),
+};
+
+static inline bool autonuma_impossible(void)
+{
+	return num_possible_nodes() <= 1 ||
+		test_bit(AUTONUMA_IMPOSSIBLE, &autonuma_flags);
+}
+
+static inline void autonuma_migrate_lock(int nid)
+{
+	spin_lock(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_unlock(int nid)
+{
+	spin_unlock(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_lock_irq(int nid)
+{
+	spin_lock_irq(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_unlock_irq(int nid)
+{
+	spin_unlock_irq(&NODE_DATA(nid)->autonuma_lock);
+}
+
+/* caller already holds the compound_lock */
+void autonuma_migrate_split_huge_page(struct page *page,
+				      struct page *page_tail)
+{
+	int nid, last_nid;
+
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	VM_BUG_ON(page_tail->autonuma_migrate_nid != -1);
+	if (nid >= 0) {
+		VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
+		autonuma_migrate_lock(nid);
+		list_add_tail(&page_tail->autonuma_migrate_node,
+			      &page->autonuma_migrate_node);
+		autonuma_migrate_unlock(nid);
+
+		page_tail->autonuma_migrate_nid = nid;
+	}
+
+	last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	if (last_nid >= 0)
+		page_tail->autonuma_last_nid = last_nid;
+}
+
+void __autonuma_migrate_page_remove(struct page *page)
+{
+	unsigned long flags;
+	int nid;
+
+	flags = compound_lock_irqsave(page);
+
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	if (nid >= 0) {
+		int numpages = hpage_nr_pages(page);
+		autonuma_migrate_lock(nid);
+		list_del(&page->autonuma_migrate_node);
+		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
+		autonuma_migrate_unlock(nid);
+
+		page->autonuma_migrate_nid = -1;
+	}
+
+	compound_unlock_irqrestore(page, flags);
+}
+
+static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
+					int page_nid)
+{
+	unsigned long flags;
+	int nid;
+	int numpages;
+	unsigned long nr_migrate_pages;
+	wait_queue_head_t *wait_queue;
+
+	VM_BUG_ON(dst_nid >= MAX_NUMNODES);
+	VM_BUG_ON(dst_nid < -1);
+	VM_BUG_ON(page_nid >= MAX_NUMNODES);
+	VM_BUG_ON(page_nid < -1);
+
+	VM_BUG_ON(page_nid == dst_nid);
+	VM_BUG_ON(page_to_nid(page) != page_nid);
+
+	flags = compound_lock_irqsave(page);
+
+	numpages = hpage_nr_pages(page);
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	if (nid >= 0) {
+		autonuma_migrate_lock(nid);
+		list_del(&page->autonuma_migrate_node);
+		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
+		autonuma_migrate_unlock(nid);
+	}
+
+	autonuma_migrate_lock(dst_nid);
+	list_add(&page->autonuma_migrate_node,
+		 &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
+	NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
+	nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
+
+	autonuma_migrate_unlock(dst_nid);
+
+	page->autonuma_migrate_nid = dst_nid;
+
+	compound_unlock_irqrestore(page, flags);
+
+	if (!autonuma_migrate_defer()) {
+		wait_queue = &NODE_DATA(dst_nid)->autonuma_knuma_migrated_wait;
+		if (nr_migrate_pages >= pages_to_migrate &&
+		    nr_migrate_pages - numpages < pages_to_migrate &&
+		    waitqueue_active(wait_queue))
+			wake_up_interruptible(wait_queue);
+	}
+}
+
+static void autonuma_migrate_page_add(struct page *page, int dst_nid,
+				      int page_nid)
+{
+	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	if (migrate_nid != dst_nid)
+		__autonuma_migrate_page_add(page, dst_nid, page_nid);
+}
+
+static bool balance_pgdat(struct pglist_data *pgdat,
+			  int nr_migrate_pages)
+{
+	/* FIXME: this only check the wmarks, make it move
+	 * "unused" memory or pagecache by queuing it to
+	 * pgdat->autonuma_migrate_head[pgdat->node_id].
+	 */
+	int z;
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone->all_unreclaimable)
+			continue;
+
+		/*
+		 * FIXME: deal with order with THP, maybe if
+		 * kswapd will learn using compaction, otherwise
+		 * order = 0 probably is ok.
+		 * FIXME: in theory we're ok if we can obtain
+		 * pages_to_migrate pages from all zones, it doesn't
+		 * need to be all in a single zone. We care about the
+		 * pgdat, the zone not.
+		 */
+
+		/*
+		 * Try not to wakeup kswapd by allocating
+		 * pages_to_migrate pages.
+		 */
+		if (!zone_watermark_ok(zone, 0,
+				       high_wmark_pages(zone) +
+				       nr_migrate_pages,
+				       0, 0))
+			continue;
+		return true;
+	}
+	return false;
+}
+
+static void cpu_follow_memory_pass(struct task_struct *p,
+				   struct sched_autonuma *sched_autonuma,
+				   unsigned long *numa_fault)
+{
+	int nid;
+	for_each_node(nid)
+		numa_fault[nid] >>= 1;
+	sched_autonuma->numa_fault_tot >>= 1;
+}
+
+static void numa_hinting_fault_cpu_follow_memory(struct task_struct *p,
+						 int access_nid,
+						 int numpages,
+						 bool pass)
+{
+	struct sched_autonuma *sched_autonuma = p->sched_autonuma;
+	unsigned long *numa_fault = sched_autonuma->numa_fault;
+	if (unlikely(pass))
+		cpu_follow_memory_pass(p, sched_autonuma, numa_fault);
+	numa_fault[access_nid] += numpages;
+	sched_autonuma->numa_fault_tot += numpages;
+}
+
+static inline bool last_nid_set(struct task_struct *p,
+				struct page *page, int cpu_nid)
+{
+	bool ret = true;
+	int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	VM_BUG_ON(cpu_nid < 0);
+	VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
+	if (autonuma_last_nid >= 0 && autonuma_last_nid != cpu_nid) {
+		int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+		if (migrate_nid >= 0 && migrate_nid != cpu_nid)
+			__autonuma_migrate_page_remove(page);
+		ret = false;
+	}
+	if (autonuma_last_nid != cpu_nid)
+		ACCESS_ONCE(page->autonuma_last_nid) = cpu_nid;
+	return ret;
+}
+
+static int __page_migrate_nid(struct page *page, int page_nid)
+{
+	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	if (migrate_nid < 0)
+		migrate_nid = page_nid;
+#if 0
+	return page_nid;
+#endif
+	return migrate_nid;
+}
+
+static int page_migrate_nid(struct page *page)
+{
+	return __page_migrate_nid(page, page_to_nid(page));
+}
+
+static int numa_hinting_fault_memory_follow_cpu(struct task_struct *p,
+						struct page *page,
+						int cpu_nid, int page_nid,
+						bool pass)
+{
+	if (!last_nid_set(p, page, cpu_nid))
+		return __page_migrate_nid(page, page_nid);
+	if (!PageLRU(page))
+		return page_nid;
+	if (cpu_nid != page_nid)
+		autonuma_migrate_page_add(page, cpu_nid, page_nid);
+	else
+		autonuma_migrate_page_remove(page);
+	return cpu_nid;
+}
+
+void numa_hinting_fault(struct page *page, int numpages)
+{
+	WARN_ON_ONCE(!current->mm);
+	if (likely(current->mm && !current->mempolicy && autonuma_enabled())) {
+		struct task_struct *p = current;
+		int cpu_nid, page_nid, access_nid;
+		bool pass;
+
+		pass = p->sched_autonuma->numa_fault_pass !=
+			p->mm->mm_autonuma->numa_fault_pass;
+		page_nid = page_to_nid(page);
+		cpu_nid = numa_node_id();
+		VM_BUG_ON(cpu_nid < 0);
+		VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
+		access_nid = numa_hinting_fault_memory_follow_cpu(p, page,
+								  cpu_nid,
+								  page_nid,
+								  pass);
+		numa_hinting_fault_cpu_follow_memory(p, access_nid,
+						     numpages, pass);
+		if (unlikely(pass))
+			p->sched_autonuma->numa_fault_pass =
+				p->mm->mm_autonuma->numa_fault_pass;
+	}
+}
+
+pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+		       unsigned long addr, pte_t pte, pte_t *ptep)
+{
+	struct page *page;
+	pte = pte_mknotnuma(pte);
+	set_pte_at(mm, addr, ptep, pte);
+	page = vm_normal_page(vma, addr, pte);
+	BUG_ON(!page);
+	numa_hinting_fault(page, 1);
+	return pte;
+}
+
+void __pmd_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+		      unsigned long addr, pmd_t *pmdp)
+{
+	pmd_t pmd;
+	pte_t *pte;
+	unsigned long _addr = addr & PMD_MASK;
+	spinlock_t *ptl;
+	bool numa = false;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = *pmdp;
+	if (pmd_numa(pmd)) {
+		set_pmd_at(mm, _addr, pmdp, pmd_mknotnuma(pmd));
+		numa = true;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	if (!numa)
+		return;
+
+	pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
+	for (addr = _addr; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
+		pte_t pteval = *pte;
+		struct page * page;
+		if (!pte_present(pteval))
+			continue;
+		if (pte_numa(pteval)) {
+			pteval = pte_mknotnuma(pteval);
+			set_pte_at(mm, addr, pte, pteval);
+		}
+		page = vm_normal_page(vma, addr, pteval);
+		if (unlikely(!page))
+			continue;
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1)
+			continue;
+		numa_hinting_fault(page, 1);
+	}
+	pte_unmap_unlock(pte, ptl);
+}
+
+static inline int sched_autonuma_size(void)
+{
+	return sizeof(struct sched_autonuma) +
+		num_possible_nodes() * sizeof(unsigned long);
+}
+
+static inline int sched_autonuma_reset_size(void)
+{
+	struct sched_autonuma *sched_autonuma = NULL;
+	return sched_autonuma_size() -
+		(int)((char *)(&sched_autonuma->autonuma_flags) -
+		      (char *)sched_autonuma);
+}
+
+static void sched_autonuma_reset(struct sched_autonuma *sched_autonuma)
+{
+	sched_autonuma->autonuma_node = -1;
+	memset(&sched_autonuma->autonuma_flags, 0,
+	       sched_autonuma_reset_size());
+}
+
+static inline int mm_autonuma_fault_size(void)
+{
+	return num_possible_nodes() * sizeof(unsigned long);
+}
+
+static inline unsigned long *mm_autonuma_numa_fault_tmp(struct mm_struct *mm)
+{
+	return mm->mm_autonuma->numa_fault + num_possible_nodes();
+}
+
+static inline int mm_autonuma_size(void)
+{
+	return sizeof(struct mm_autonuma) + mm_autonuma_fault_size() * 2;
+}
+
+static inline int mm_autonuma_reset_size(void)
+{
+	struct mm_autonuma *mm_autonuma = NULL;
+	return mm_autonuma_size() -
+		(int)((char *)(&mm_autonuma->numa_fault_tot) -
+		      (char *)mm_autonuma);
+}
+
+static void mm_autonuma_reset(struct mm_autonuma *mm_autonuma)
+{
+	memset(&mm_autonuma->numa_fault_tot, 0, mm_autonuma_reset_size());
+}
+
+void autonuma_setup_new_exec(struct task_struct *p)
+{
+	if (p->sched_autonuma)
+		sched_autonuma_reset(p->sched_autonuma);
+	if (p->mm && p->mm->mm_autonuma)
+		mm_autonuma_reset(p->mm->mm_autonuma);
+}
+
+static inline int knumad_test_exit(struct mm_struct *mm)
+{
+	return atomic_read(&mm->mm_users) == 0;
+}
+
+static int knumad_scan_pmd(struct mm_struct *mm,
+			   struct vm_area_struct *vma,
+			   unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte, *_pte;
+	struct page *page;
+	unsigned long _address, end;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	VM_BUG_ON(address & ~PAGE_MASK);
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+	if (pmd_trans_huge(*pmd)) {
+		spin_lock(&mm->page_table_lock);
+		if (pmd_trans_huge(*pmd)) {
+			VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+			if (unlikely(pmd_trans_splitting(*pmd))) {
+				spin_unlock(&mm->page_table_lock);
+				wait_split_huge_page(vma->anon_vma, pmd);
+			} else {
+				int page_nid;
+				unsigned long *numa_fault_tmp;
+				ret = HPAGE_PMD_NR;
+
+				if (autonuma_scan_use_working_set() &&
+				    pmd_numa(*pmd)) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
+				page = pmd_page(*pmd);
+
+				/* only check non-shared pages */
+				if (page_mapcount(page) != 1) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
+				page_nid = page_migrate_nid(page);
+				numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+				numa_fault_tmp[page_nid] += ret;
+
+				if (pmd_numa(*pmd)) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
+				set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+				/* defer TLB flush to lower the overhead */
+				spin_unlock(&mm->page_table_lock);
+				goto out;
+			}
+		} else
+			spin_unlock(&mm->page_table_lock);
+	}
+
+	VM_BUG_ON(!pmd_present(*pmd));
+
+	end = min(vma->vm_end, (address + PMD_SIZE) & PMD_MASK);
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	for (_address = address, _pte = pte; _address < end;
+	     _pte++, _address += PAGE_SIZE) {
+		unsigned long *numa_fault_tmp;
+		pte_t pteval = *_pte;
+		if (!pte_present(pteval))
+			continue;
+		if (autonuma_scan_use_working_set() &&
+		    pte_numa(pteval))
+			continue;
+		page = vm_normal_page(vma, _address, pteval);
+		if (unlikely(!page))
+			continue;
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1)
+			continue;
+
+		numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+		numa_fault_tmp[page_migrate_nid(page)]++;
+
+		if (pte_numa(pteval))
+			continue;
+
+		if (!autonuma_scan_pmd())
+			set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
+
+		/* defer TLB flush to lower the overhead */
+		ret++;
+	}
+	pte_unmap_unlock(pte, ptl);
+
+	if (ret && !pmd_numa(*pmd) && autonuma_scan_pmd()) {
+		spin_lock(&mm->page_table_lock);
+		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+		spin_unlock(&mm->page_table_lock);
+		/* defer TLB flush to lower the overhead */
+	}
+
+out:
+	return ret;
+}
+
+static void mm_numa_fault_flush(struct mm_struct *mm)
+{
+	int nid;
+	struct mm_autonuma *mma = mm->mm_autonuma;
+	unsigned long *numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+	unsigned long tot = 0;
+	/* FIXME: protect this with seqlock against autonuma_balance() */
+	for_each_node(nid) {
+		mma->numa_fault[nid] = numa_fault_tmp[nid];
+		tot += mma->numa_fault[nid];
+		numa_fault_tmp[nid] = 0;
+	}
+	mma->numa_fault_tot = tot;
+}
+
+static int knumad_do_scan(void)
+{
+	struct mm_struct *mm;
+	struct mm_autonuma *mm_autonuma;
+	unsigned long address;
+	struct vm_area_struct *vma;
+	int progress = 0;
+
+	mm = knumad_scan.mm;
+	if (!mm) {
+		if (unlikely(list_empty(&knumad_scan.mm_head)))
+			return pages_to_scan;
+		mm_autonuma = list_entry(knumad_scan.mm_head.next,
+					 struct mm_autonuma, mm_node);
+		mm = mm_autonuma->mm;
+		knumad_scan.address = 0;
+		knumad_scan.mm = mm;
+		atomic_inc(&mm->mm_count);
+		mm_autonuma->numa_fault_pass++;
+	}
+	address = knumad_scan.address;
+
+	mutex_unlock(&knumad_mm_mutex);
+
+	down_read(&mm->mmap_sem);
+	if (unlikely(knumad_test_exit(mm)))
+		vma = NULL;
+	else
+		vma = find_vma(mm, address);
+
+	progress++;
+	for (; vma && progress < pages_to_scan; vma = vma->vm_next) {
+		unsigned long start_addr, end_addr;
+		cond_resched();
+		if (unlikely(knumad_test_exit(mm))) {
+			progress++;
+			break;
+		}
+
+		if (!vma->anon_vma || vma_policy(vma)) {
+			progress++;
+			continue;
+		}
+		if (is_vma_temporary_stack(vma)) {
+			progress++;
+			continue;
+		}
+
+		VM_BUG_ON(address & ~PAGE_MASK);
+		if (address < vma->vm_start)
+			address = vma->vm_start;
+
+		start_addr = address;
+		while (address < vma->vm_end) {
+			cond_resched();
+			if (unlikely(knumad_test_exit(mm)))
+				break;
+
+			VM_BUG_ON(address < vma->vm_start ||
+				  address + PAGE_SIZE > vma->vm_end);
+			progress += knumad_scan_pmd(mm, vma, address);
+			/* move to next address */
+			address = (address + PMD_SIZE) & PMD_MASK;
+			if (progress >= pages_to_scan)
+				break;
+		}
+		end_addr = min(address, vma->vm_end);
+
+		/*
+		 * Flush the TLB for the mm to start the numa
+		 * hinting minor page faults after we finish
+		 * scanning this vma part.
+		 */
+		mmu_notifier_invalidate_range_start(vma->vm_mm, start_addr,
+						    end_addr);
+		flush_tlb_range(vma, start_addr, end_addr);
+		mmu_notifier_invalidate_range_end(vma->vm_mm, start_addr,
+						  end_addr);
+	}
+	up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
+
+	mutex_lock(&knumad_mm_mutex);
+	VM_BUG_ON(knumad_scan.mm != mm);
+	knumad_scan.address = address;
+	/*
+	 * Change the current mm if this mm is about to die, or if we
+	 * scanned all vmas of this mm.
+	 */
+	if (knumad_test_exit(mm) || !vma) {
+		mm_autonuma = mm->mm_autonuma;
+		if (mm_autonuma->mm_node.next != &knumad_scan.mm_head) {
+			mm_autonuma = list_entry(mm_autonuma->mm_node.next,
+						 struct mm_autonuma, mm_node);
+			knumad_scan.mm = mm_autonuma->mm;
+			atomic_inc(&knumad_scan.mm->mm_count);
+			knumad_scan.address = 0;
+			knumad_scan.mm->mm_autonuma->numa_fault_pass++;
+		} else
+			knumad_scan.mm = NULL;
+
+		if (knumad_test_exit(mm))
+			list_del(&mm->mm_autonuma->mm_node);
+		else
+			mm_numa_fault_flush(mm);
+
+		mmdrop(mm);
+	}
+
+	return progress;
+}
+
+static void wake_up_knuma_migrated(void)
+{
+	int nid;
+
+	lru_add_drain();
+	for_each_online_node(nid) {
+		struct pglist_data *pgdat = NODE_DATA(nid);
+		if (pgdat->autonuma_nr_migrate_pages &&
+		    waitqueue_active(&pgdat->autonuma_knuma_migrated_wait))
+			wake_up_interruptible(&pgdat->
+					      autonuma_knuma_migrated_wait);
+	}
+}
+
+static void knuma_scand_disabled(void)
+{
+	if (!autonuma_enabled())
+		wait_event_freezable(knuma_scand_wait,
+				     autonuma_enabled() ||
+				     kthread_should_stop());
+}
+
+static int knuma_scand(void *none)
+{
+	struct mm_struct *mm = NULL;
+	int progress = 0, _progress;
+	unsigned long total_progress = 0;
+
+	set_freezable();
+
+	knuma_scand_disabled();
+
+	mutex_lock(&knumad_mm_mutex);
+
+	for (;;) {
+		if (unlikely(kthread_should_stop()))
+			break;
+		_progress = knumad_do_scan();
+		progress += _progress;
+		total_progress += _progress;
+		mutex_unlock(&knumad_mm_mutex);
+
+		if (unlikely(!knumad_scan.mm)) {
+			autonuma_printk("knuma_scand %lu\n", total_progress);
+			pages_scanned += total_progress;
+			total_progress = 0;
+			full_scans++;
+
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_pass_millisecs));
+			/* flush the last pending pages < pages_to_migrate */
+			wake_up_knuma_migrated();
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_pass_millisecs));
+
+			if (autonuma_debug()) {
+				extern void sched_autonuma_dump_mm(void);
+				sched_autonuma_dump_mm();
+			}
+
+			/* wait while there is no pinned mm */
+			knuma_scand_disabled();
+		}
+		if (progress > pages_to_scan) {
+			progress = 0;
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_millisecs));
+		}
+		cond_resched();
+		mutex_lock(&knumad_mm_mutex);
+	}
+
+	mm = knumad_scan.mm;
+	knumad_scan.mm = NULL;
+	if (mm)
+		list_del(&mm->mm_autonuma->mm_node);
+	mutex_unlock(&knumad_mm_mutex);
+
+	if (mm)
+		mmdrop(mm);
+
+	return 0;
+}
+
+static int isolate_migratepages(struct list_head *migratepages,
+				struct pglist_data *pgdat)
+{
+	int nr = 0, nid;
+	struct list_head *heads = pgdat->autonuma_migrate_head;
+
+	/* FIXME: THP balancing, restart from last nid */
+	for_each_online_node(nid) {
+		struct zone *zone;
+		struct page *page;
+		cond_resched();
+		VM_BUG_ON(numa_node_id() != pgdat->node_id);
+		if (nid == pgdat->node_id) {
+			VM_BUG_ON(!list_empty(&heads[nid]));
+			continue;
+		}
+		if (list_empty(&heads[nid]))
+			continue;
+		/* some page wants to go to this pgdat */
+		/*
+		 * Take the lock with irqs disabled to avoid a lock
+		 * inversion with the lru_lock which is taken before
+		 * the autonuma_migrate_lock in split_huge_page, and
+		 * that could be taken by interrupts after we obtained
+		 * the autonuma_migrate_lock here, if we didn't disable
+		 * irqs.
+		 */
+		autonuma_migrate_lock_irq(pgdat->node_id);
+		if (list_empty(&heads[nid])) {
+			autonuma_migrate_unlock_irq(pgdat->node_id);
+			continue;
+		}
+		page = list_entry(heads[nid].prev,
+				  struct page,
+				  autonuma_migrate_node);
+		if (unlikely(!get_page_unless_zero(page))) {
+			/*
+			 * Is getting freed and will remove self from the
+			 * autonuma list shortly, skip it for now.
+			 */
+			list_del(&page->autonuma_migrate_node);
+			list_add(&page->autonuma_migrate_node,
+				 &heads[nid]);
+			autonuma_migrate_unlock_irq(pgdat->node_id);
+			autonuma_printk("autonuma migrate page is free\n");
+			continue;
+		}
+		if (!PageLRU(page)) {
+			autonuma_migrate_unlock_irq(pgdat->node_id);
+			autonuma_printk("autonuma migrate page not in LRU\n");
+			__autonuma_migrate_page_remove(page);
+			put_page(page);
+			continue;
+		}
+		autonuma_migrate_unlock_irq(pgdat->node_id);
+
+		VM_BUG_ON(nid != page_to_nid(page));
+
+		if (PageAnon(page) && PageTransHuge(page))
+			/* FIXME: remove split_huge_page */
+			split_huge_page(page);
+
+		__autonuma_migrate_page_remove(page);
+
+		zone = page_zone(page);
+		spin_lock_irq(&zone->lru_lock);
+		if (!__isolate_lru_page(page, ISOLATE_ACTIVE|ISOLATE_INACTIVE,
+					0)) {
+			VM_BUG_ON(PageTransCompound(page));
+			del_page_from_lru_list(zone, page, page_lru(page));
+			inc_zone_state(zone, page_is_file_cache(page) ?
+				       NR_ISOLATED_FILE : NR_ISOLATED_ANON);
+			spin_unlock_irq(&zone->lru_lock);
+			/*
+			 * hold the page pin at least until
+			 * __isolate_lru_page succeeds
+			 * (__isolate_lru_page takes a second pin when
+			 * it succeeds). If we release the pin before
+			 * __isolate_lru_page returns, the page could
+			 * have been freed and reallocated from under
+			 * us, so rendering worthless our previous
+			 * checks on the page including the
+			 * split_huge_page call.
+			 */
+			put_page(page);
+
+			list_add(&page->lru, migratepages);
+			nr += hpage_nr_pages(page);
+		} else {
+			/* FIXME: losing page, safest and simplest for now */
+			spin_unlock_irq(&zone->lru_lock);
+			put_page(page);
+			autonuma_printk("autonuma migrate page lost\n");
+		}
+	}
+
+	return nr;
+}
+
+static struct page *alloc_migrate_dst_page(struct page *page,
+					   unsigned long data,
+					   int **result)
+{
+	int nid = (int) data;
+	struct page *newpage;
+	newpage = alloc_pages_exact_node(nid,
+					 GFP_HIGHUSER_MOVABLE | GFP_THISNODE,
+					 0);
+	if (newpage)
+		newpage->autonuma_last_nid = page->autonuma_last_nid;
+	return newpage;
+}
+
+static void knumad_do_migrate(struct pglist_data *pgdat)
+{
+	int nr_migrate_pages = 0;
+	LIST_HEAD(migratepages);
+
+	autonuma_printk("nr_migrate_pages %lu to node %d\n",
+			pgdat->autonuma_nr_migrate_pages, pgdat->node_id);
+	do {
+		int isolated = 0;
+		if (balance_pgdat(pgdat, nr_migrate_pages))
+			isolated = isolate_migratepages(&migratepages, pgdat);
+		/* FIXME: might need to check too many isolated */
+		if (!isolated)
+			break;
+		nr_migrate_pages += isolated;
+	} while (nr_migrate_pages < pages_to_migrate);
+
+	if (nr_migrate_pages) {
+		int err;
+		autonuma_printk("migrate %d to node %d\n", nr_migrate_pages,
+				pgdat->node_id);
+		pages_migrated += nr_migrate_pages; /* FIXME: per node */
+		err = migrate_pages(&migratepages, alloc_migrate_dst_page,
+				    pgdat->node_id, false, true);
+		if (err)
+			/* FIXME: requeue failed pages */
+			putback_lru_pages(&migratepages);
+	}
+}
+
+static int knuma_migrated(void *arg)
+{
+	struct pglist_data *pgdat = (struct pglist_data *)arg;
+	int nid = pgdat->node_id;
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(nowakeup);
+
+	set_freezable();
+
+	for (;;) {
+		if (unlikely(kthread_should_stop()))
+			break;
+		/* FIXME: scan the free levels of this node we may not
+		 * be allowed to receive memory if the wmark of this
+		 * pgdat are below high.  In the future also add
+		 * not-interesting pages like not-accessed pages to
+		 * pgdat->autonuma_migrate_head[pgdat->node_id]; so we
+		 * can move our memory away to other nodes in order
+		 * to satisfy the high-wmark described above (so migration
+		 * can continue).
+		 */
+		knumad_do_migrate(pgdat);
+		if (!pgdat->autonuma_nr_migrate_pages) {
+			wait_event_freezable(
+				pgdat->autonuma_knuma_migrated_wait,
+				pgdat->autonuma_nr_migrate_pages ||
+				kthread_should_stop());
+			autonuma_printk("wake knuma_migrated %d\n", nid);
+		} else
+			wait_event_freezable_timeout(nowakeup,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     migrate_sleep_millisecs));
+	}
+
+	return 0;
+}
+
+void autonuma_enter(struct mm_struct *mm)
+{
+	if (autonuma_impossible())
+		return;
+
+	mutex_lock(&knumad_mm_mutex);
+	list_add_tail(&mm->mm_autonuma->mm_node, &knumad_scan.mm_head);
+	mutex_unlock(&knumad_mm_mutex);
+}
+
+void autonuma_exit(struct mm_struct *mm)
+{
+	bool serialize;
+
+	if (autonuma_impossible())
+		return;
+
+	serialize = false;
+	mutex_lock(&knumad_mm_mutex);
+	if (knumad_scan.mm == mm)
+		serialize = true;
+	else
+		list_del(&mm->mm_autonuma->mm_node);
+	mutex_unlock(&knumad_mm_mutex);
+
+	if (serialize) {
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	}
+}
+
+static int start_knuma_scand(void)
+{
+	int err = 0;
+	struct task_struct *knumad_thread;
+
+	knumad_thread = kthread_run(knuma_scand, NULL, "knuma_scand");
+	if (unlikely(IS_ERR(knumad_thread))) {
+		autonuma_printk(KERN_ERR
+				"knumad: kthread_run(knuma_scand) failed\n");
+		err = PTR_ERR(knumad_thread);
+	}
+	return err;
+}
+
+static int start_knuma_migrated(void)
+{
+	int err = 0;
+	struct task_struct *knumad_thread;
+	int nid;
+
+	for_each_online_node(nid) {
+		knumad_thread = kthread_create_on_node(knuma_migrated,
+						       NODE_DATA(nid),
+						       nid,
+						       "knuma_migrated%d",
+						       nid);
+		if (unlikely(IS_ERR(knumad_thread))) {
+			autonuma_printk(KERN_ERR
+					"knumad: "
+					"kthread_run(knuma_migrated%d) "
+					"failed\n", nid);
+			err = PTR_ERR(knumad_thread);
+		} else {
+			autonuma_printk("cpumask %d %lx\n", nid,
+					cpumask_of_node(nid)->bits[0]);
+			kthread_bind_node(knumad_thread, nid);
+			wake_up_process(knumad_thread);
+		}
+	}
+	return err;
+}
+
+
+#ifdef CONFIG_SYSFS
+
+static ssize_t flag_show(struct kobject *kobj,
+			 struct kobj_attribute *attr, char *buf,
+			 enum autonuma_flag flag)
+{
+	return sprintf(buf, "%d\n",
+		       !!test_bit(flag, &autonuma_flags));
+}
+static ssize_t flag_store(struct kobject *kobj,
+			  struct kobj_attribute *attr,
+			  const char *buf, size_t count,
+			  enum autonuma_flag flag)
+{
+	unsigned long value;
+	int ret;
+
+	ret = kstrtoul(buf, 10, &value);
+	if (ret < 0)
+		return ret;
+	if (value > 1)
+		return -EINVAL;
+
+	if (value)
+		set_bit(flag, &autonuma_flags);
+	else
+		clear_bit(flag, &autonuma_flags);
+
+	return count;
+}
+
+static ssize_t enabled_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return flag_show(kobj, attr, buf, AUTONUMA_FLAG);
+}
+static ssize_t enabled_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = flag_store(kobj, attr, buf, count, AUTONUMA_FLAG);
+
+	if (ret > 0 && autonuma_enabled())
+		wake_up_interruptible(&knuma_scand_wait);
+
+	return ret;
+}
+static struct kobj_attribute enabled_attr =
+	__ATTR(enabled, 0644, enabled_show, enabled_store);
+
+#define SYSFS_ENTRY(NAME, FLAG)						\
+static ssize_t NAME ## _show(struct kobject *kobj,			\
+			     struct kobj_attribute *attr, char *buf)	\
+{									\
+	return flag_show(kobj, attr, buf, FLAG);			\
+}									\
+									\
+static ssize_t NAME ## _store(struct kobject *kobj,			\
+			      struct kobj_attribute *attr,		\
+			      const char *buf, size_t count)		\
+{									\
+	return flag_store(kobj, attr, buf, count, FLAG);		\
+}									\
+static struct kobj_attribute NAME ## _attr =				\
+	__ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
+
+SYSFS_ENTRY(debug, AUTONUMA_DEBUG_FLAG);
+SYSFS_ENTRY(pmd, AUTONUMA_SCAN_PMD_FLAG);
+SYSFS_ENTRY(working_set, AUTONUMA_SCAN_USE_WORKING_SET_FLAG);
+SYSFS_ENTRY(defer, AUTONUMA_MIGRATE_DEFER_FLAG);
+SYSFS_ENTRY(load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
+SYSFS_ENTRY(clone_reset, AUTONUMA_SCHED_CLONE_RESET_FLAG);
+SYSFS_ENTRY(fork_reset, AUTONUMA_SCHED_FORK_RESET_FLAG);
+
+#undef SYSFS_ENTRY
+
+enum {
+	SYSFS_KNUMA_SCAND_SLEEP_ENTRY,
+	SYSFS_KNUMA_SCAND_PAGES_ENTRY,
+	SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY,
+	SYSFS_KNUMA_MIGRATED_PAGES_ENTRY,
+};
+
+#define SYSFS_ENTRY(NAME, SYSFS_TYPE)				\
+static ssize_t NAME ## _show(struct kobject *kobj,		\
+			     struct kobj_attribute *attr,	\
+			     char *buf)				\
+{								\
+	return sprintf(buf, "%u\n", NAME);			\
+}								\
+static ssize_t NAME ## _store(struct kobject *kobj,		\
+			      struct kobj_attribute *attr,	\
+			      const char *buf, size_t count)	\
+{								\
+	unsigned long val;					\
+	int err;						\
+								\
+	err = strict_strtoul(buf, 10, &val);			\
+	if (err || val > UINT_MAX)				\
+		return -EINVAL;					\
+	switch (SYSFS_TYPE) {					\
+	case SYSFS_KNUMA_SCAND_PAGES_ENTRY:			\
+	case SYSFS_KNUMA_MIGRATED_PAGES_ENTRY:			\
+		if (!val)					\
+			return -EINVAL;				\
+		break;						\
+	}							\
+								\
+	NAME = val;						\
+	switch (SYSFS_TYPE) {					\
+	case SYSFS_KNUMA_SCAND_SLEEP_ENTRY:			\
+		wake_up_interruptible(&knuma_scand_wait);	\
+		break;						\
+	case							\
+		SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY:		\
+		wake_up_knuma_migrated();			\
+		break;						\
+	}							\
+								\
+	return count;						\
+}								\
+static struct kobj_attribute NAME ## _attr =			\
+	__ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
+
+SYSFS_ENTRY(scan_sleep_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
+SYSFS_ENTRY(scan_sleep_pass_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
+SYSFS_ENTRY(pages_to_scan, SYSFS_KNUMA_SCAND_PAGES_ENTRY);
+
+SYSFS_ENTRY(migrate_sleep_millisecs, SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY);
+SYSFS_ENTRY(pages_to_migrate, SYSFS_KNUMA_MIGRATED_PAGES_ENTRY);
+
+#undef SYSFS_ENTRY
+
+static struct attribute *autonuma_attr[] = {
+	&enabled_attr.attr,
+	&debug_attr.attr,
+	NULL,
+};
+static struct attribute_group autonuma_attr_group = {
+	.attrs = autonuma_attr,
+};
+
+#define SYSFS_ENTRY(NAME)					\
+static ssize_t NAME ## _show(struct kobject *kobj,		\
+			     struct kobj_attribute *attr,	\
+			     char *buf)				\
+{								\
+	return sprintf(buf, "%lu\n", NAME);			\
+}								\
+static struct kobj_attribute NAME ## _attr =			\
+	__ATTR_RO(NAME);
+
+SYSFS_ENTRY(full_scans);
+SYSFS_ENTRY(pages_scanned);
+SYSFS_ENTRY(pages_migrated);
+
+#undef SYSFS_ENTRY
+
+static struct attribute *knuma_scand_attr[] = {
+	&scan_sleep_millisecs_attr.attr,
+	&scan_sleep_pass_millisecs_attr.attr,
+	&pages_to_scan_attr.attr,
+	&pages_scanned_attr.attr,
+	&full_scans_attr.attr,
+	&pmd_attr.attr,
+	&working_set_attr.attr,
+	NULL,
+};
+static struct attribute_group knuma_scand_attr_group = {
+	.attrs = knuma_scand_attr,
+	.name = "knuma_scand",
+};
+
+static struct attribute *knuma_migrated_attr[] = {
+	&migrate_sleep_millisecs_attr.attr,
+	&pages_to_migrate_attr.attr,
+	&pages_migrated_attr.attr,
+	&defer_attr.attr,
+	NULL,
+};
+static struct attribute_group knuma_migrated_attr_group = {
+	.attrs = knuma_migrated_attr,
+	.name = "knuma_migrated",
+};
+
+static struct attribute *scheduler_attr[] = {
+	&clone_reset_attr.attr,
+	&fork_reset_attr.attr,
+	&load_balance_strict_attr.attr,
+	NULL,
+};
+static struct attribute_group scheduler_attr_group = {
+	.attrs = scheduler_attr,
+	.name = "scheduler",
+};
+
+static int __init autonuma_init_sysfs(struct kobject **autonuma_kobj)
+{
+	int err;
+
+	*autonuma_kobj = kobject_create_and_add("autonuma", mm_kobj);
+	if (unlikely(!*autonuma_kobj)) {
+		printk(KERN_ERR "autonuma: failed kobject create\n");
+		return -ENOMEM;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &autonuma_attr_group);
+	if (err) {
+		printk(KERN_ERR "autonuma: failed register autonuma group\n");
+		goto delete_obj;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &knuma_scand_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register knuma_scand group\n");
+		goto remove_autonuma;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &knuma_migrated_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register knuma_migrated group\n");
+		goto remove_knuma_scand;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &scheduler_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register scheduler group\n");
+		goto remove_knuma_migrated;
+	}
+
+	return 0;
+
+remove_knuma_migrated:
+	sysfs_remove_group(*autonuma_kobj, &knuma_migrated_attr_group);
+remove_knuma_scand:
+	sysfs_remove_group(*autonuma_kobj, &knuma_scand_attr_group);
+remove_autonuma:
+	sysfs_remove_group(*autonuma_kobj, &autonuma_attr_group);
+delete_obj:
+	kobject_put(*autonuma_kobj);
+	return err;
+}
+
+static void __init autonuma_exit_sysfs(struct kobject *autonuma_kobj)
+{
+	sysfs_remove_group(autonuma_kobj, &knuma_migrated_attr_group);
+	sysfs_remove_group(autonuma_kobj, &knuma_scand_attr_group);
+	sysfs_remove_group(autonuma_kobj, &autonuma_attr_group);
+	kobject_put(autonuma_kobj);
+}
+#else
+static inline int autonuma_init_sysfs(struct kobject **autonuma_kobj)
+{
+	return 0;
+}
+
+static inline void autonuma_exit_sysfs(struct kobject *autonuma_kobj)
+{
+}
+#endif /* CONFIG_SYSFS */
+
+static int __init noautonuma_setup(char *str)
+{
+	if (!autonuma_impossible()) {
+		printk("AutoNUMA permanently disabled\n");
+		set_bit(AUTONUMA_IMPOSSIBLE, &autonuma_flags);
+		BUG_ON(!autonuma_impossible());
+	}
+	return 1;
+}
+__setup("noautonuma", noautonuma_setup);
+
+static int __init autonuma_init(void)
+{
+	int err;
+	struct kobject *autonuma_kobj;
+
+	VM_BUG_ON(num_possible_nodes() < 1);
+	if (autonuma_impossible())
+		return -EINVAL;
+
+	err = autonuma_init_sysfs(&autonuma_kobj);
+	if (err)
+		return err;
+
+	err = start_knuma_scand();
+	if (err) {
+		printk("failed to start knuma_scand\n");
+		goto out;
+	}
+	err = start_knuma_migrated();
+	if (err) {
+		printk("failed to start knuma_migrated\n");
+		goto out;
+	}
+
+	printk("AutoNUMA initialized successfully\n");
+	return err;
+
+out:
+	autonuma_exit_sysfs(autonuma_kobj);
+	return err;
+}
+module_init(autonuma_init)
+
+static struct kmem_cache *sched_autonuma_cachep;
+
+int alloc_sched_autonuma(struct task_struct *tsk, struct task_struct *orig,
+			 int node)
+{
+	int err = 1;
+	struct sched_autonuma *sched_autonuma;
+
+	if (autonuma_impossible())
+		goto no_numa;
+	sched_autonuma = kmem_cache_alloc_node(sched_autonuma_cachep,
+					       GFP_KERNEL, node);
+	if (!sched_autonuma)
+		goto out;
+	if (autonuma_sched_clone_reset())
+		sched_autonuma_reset(sched_autonuma);
+	else {
+		memcpy(sched_autonuma, orig->sched_autonuma,
+		       sched_autonuma_size());
+		BUG_ON(sched_autonuma->autonuma_flags &
+		       SCHED_AUTONUMA_FLAG_STOP_ONE_CPU);
+		sched_autonuma->autonuma_flags = 0;
+	}
+	tsk->sched_autonuma = sched_autonuma;
+no_numa:
+	err = 0;
+out:
+	return err;
+}
+
+void free_sched_autonuma(struct task_struct *tsk)
+{
+	if (autonuma_impossible()) {
+		BUG_ON(tsk->sched_autonuma);
+		return;
+	}
+
+	BUG_ON(!tsk->sched_autonuma);
+	kmem_cache_free(sched_autonuma_cachep, tsk->sched_autonuma);
+	tsk->sched_autonuma = NULL;
+}
+
+void __init sched_autonuma_init(void)
+{
+	struct sched_autonuma *sched_autonuma;
+
+	BUG_ON(current != &init_task);
+
+	if (autonuma_impossible())
+		return;
+
+	sched_autonuma_cachep =
+		kmem_cache_create("sched_autonuma",
+				  sched_autonuma_size(), 0,
+				  SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+
+	sched_autonuma = kmem_cache_alloc_node(sched_autonuma_cachep,
+					       GFP_KERNEL, numa_node_id());
+	BUG_ON(!sched_autonuma);
+	sched_autonuma_reset(sched_autonuma);
+	BUG_ON(current->sched_autonuma);
+	current->sched_autonuma = sched_autonuma;
+}
+
+static struct kmem_cache *mm_autonuma_cachep;
+
+int alloc_mm_autonuma(struct mm_struct *mm)
+{
+	int err = 1;
+	struct mm_autonuma *mm_autonuma;
+
+	if (autonuma_impossible())
+		goto no_numa;
+	mm_autonuma = kmem_cache_alloc(mm_autonuma_cachep, GFP_KERNEL);
+	if (!mm_autonuma)
+		goto out;
+	if (autonuma_sched_fork_reset() || !mm->mm_autonuma)
+		mm_autonuma_reset(mm_autonuma);
+	else
+		memcpy(mm_autonuma, mm->mm_autonuma, mm_autonuma_size());
+	mm->mm_autonuma = mm_autonuma;
+	mm_autonuma->mm = mm;
+no_numa:
+	err = 0;
+out:
+	return err;
+}
+
+void free_mm_autonuma(struct mm_struct *mm)
+{
+	if (autonuma_impossible()) {
+		BUG_ON(mm->mm_autonuma);
+		return;
+	}
+
+	BUG_ON(!mm->mm_autonuma);
+	kmem_cache_free(mm_autonuma_cachep, mm->mm_autonuma);
+	mm->mm_autonuma = NULL;
+}
+
+void __init mm_autonuma_init(void)
+{
+	BUG_ON(current != &init_task);
+	BUG_ON(current->mm);
+
+	if (autonuma_impossible())
+		return;
+
+	mm_autonuma_cachep =
+		kmem_cache_create("mm_autonuma",
+				  mm_autonuma_size(), 0,
+				  SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+}

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 23/35] autonuma: core
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

This implements knuma_scand, the numa_hinting faults started by
knuma_scand, the knuma_migrated that migrates the memory queued by the
NUMA hinting faults, the statistics gathering code that is done by
knuma_scand for the mm_autonuma and by the numa hinting page faults
for the sched_autonuma, and most of the rest of the AutoNUMA core
logics like the false sharing detection, sysfs and initialization
routines.

The AutoNUMA algorithm when knuma_scand is not running is a full
bypass and it must not alter the runtime of memory management and
scheduler.

The whole AutoNUMA logic is a chain reaction as result of the actions
of the knuma_scand. The various parts of the code can be described
like different gears (gears as in glxgears).

knuma_scand is the first gear and it collects the mm_autonuma per-process
statistics and at the same time it sets the pte/pmd it scans as
pte_numa and pmd_numa.

The second gear are the numa hinting page faults. These are triggered
by the pte_numa/pmd_numa pmd/ptes. They collect the sched_autonuma
per-thread statistics. They also implement the memory follow CPU logic
where we track if pages are repeatedly accessed by remote nodes. The
memory follow CPU logic can decide to migrate pages across different
NUMA nodes by queuing the pages for migration in the per-node
knuma_migrated queues.

The third gear is knuma_migrated. There is one knuma_migrated daemon
per node. Pages pending for migration are queued in a matrix of
lists. Each knuma_migrated (in parallel with each other) goes over
those lists and migrates the pages queued for migration in round robin
from each incoming node to the node where knuma_migrated is running
on.

The fourth gear is the NUMA scheduler balancing code. That computes
the statistical information collected in mm->mm_autonuma and
p->sched_autonuma and evaluates the status of all CPUs to decide if
tasks should be migrated to CPUs in remote nodes.

The code include fixes from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/autonuma.c | 1449 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 1449 insertions(+), 0 deletions(-)
 create mode 100644 mm/autonuma.c

diff --git a/mm/autonuma.c b/mm/autonuma.c
new file mode 100644
index 0000000..88c7ab3
--- /dev/null
+++ b/mm/autonuma.c
@@ -0,0 +1,1449 @@
+/*
+ *  Copyright (C) 2012  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ *  Boot with "numa=fake=2" to test on not NUMA systems.
+ */
+
+#include <linux/mm.h>
+#include <linux/rmap.h>
+#include <linux/kthread.h>
+#include <linux/mmu_notifier.h>
+#include <linux/freezer.h>
+#include <linux/mm_inline.h>
+#include <linux/migrate.h>
+#include <linux/swap.h>
+#include <linux/autonuma.h>
+#include <asm/tlbflush.h>
+#include <asm/pgtable.h>
+
+unsigned long autonuma_flags __read_mostly =
+	(1<<AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG)|
+	(1<<AUTONUMA_SCHED_CLONE_RESET_FLAG)|
+	(1<<AUTONUMA_SCHED_FORK_RESET_FLAG)|
+#ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
+	(1<<AUTONUMA_FLAG)|
+#endif
+	(1<<AUTONUMA_SCAN_PMD_FLAG);
+
+static DEFINE_MUTEX(knumad_mm_mutex);
+
+/* knuma_scand */
+static unsigned int scan_sleep_millisecs __read_mostly = 100;
+static unsigned int scan_sleep_pass_millisecs __read_mostly = 5000;
+static unsigned int pages_to_scan __read_mostly = 128*1024*1024/PAGE_SIZE;
+static DECLARE_WAIT_QUEUE_HEAD(knuma_scand_wait);
+static unsigned long full_scans;
+static unsigned long pages_scanned;
+
+/* knuma_migrated */
+static unsigned int migrate_sleep_millisecs __read_mostly = 100;
+static unsigned int pages_to_migrate __read_mostly = 128*1024*1024/PAGE_SIZE;
+static volatile unsigned long pages_migrated;
+
+static struct knumad_scan {
+	struct list_head mm_head;
+	struct mm_struct *mm;
+	unsigned long address;
+} knumad_scan = {
+	.mm_head = LIST_HEAD_INIT(knumad_scan.mm_head),
+};
+
+static inline bool autonuma_impossible(void)
+{
+	return num_possible_nodes() <= 1 ||
+		test_bit(AUTONUMA_IMPOSSIBLE, &autonuma_flags);
+}
+
+static inline void autonuma_migrate_lock(int nid)
+{
+	spin_lock(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_unlock(int nid)
+{
+	spin_unlock(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_lock_irq(int nid)
+{
+	spin_lock_irq(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_unlock_irq(int nid)
+{
+	spin_unlock_irq(&NODE_DATA(nid)->autonuma_lock);
+}
+
+/* caller already holds the compound_lock */
+void autonuma_migrate_split_huge_page(struct page *page,
+				      struct page *page_tail)
+{
+	int nid, last_nid;
+
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	VM_BUG_ON(page_tail->autonuma_migrate_nid != -1);
+	if (nid >= 0) {
+		VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
+		autonuma_migrate_lock(nid);
+		list_add_tail(&page_tail->autonuma_migrate_node,
+			      &page->autonuma_migrate_node);
+		autonuma_migrate_unlock(nid);
+
+		page_tail->autonuma_migrate_nid = nid;
+	}
+
+	last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	if (last_nid >= 0)
+		page_tail->autonuma_last_nid = last_nid;
+}
+
+void __autonuma_migrate_page_remove(struct page *page)
+{
+	unsigned long flags;
+	int nid;
+
+	flags = compound_lock_irqsave(page);
+
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	if (nid >= 0) {
+		int numpages = hpage_nr_pages(page);
+		autonuma_migrate_lock(nid);
+		list_del(&page->autonuma_migrate_node);
+		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
+		autonuma_migrate_unlock(nid);
+
+		page->autonuma_migrate_nid = -1;
+	}
+
+	compound_unlock_irqrestore(page, flags);
+}
+
+static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
+					int page_nid)
+{
+	unsigned long flags;
+	int nid;
+	int numpages;
+	unsigned long nr_migrate_pages;
+	wait_queue_head_t *wait_queue;
+
+	VM_BUG_ON(dst_nid >= MAX_NUMNODES);
+	VM_BUG_ON(dst_nid < -1);
+	VM_BUG_ON(page_nid >= MAX_NUMNODES);
+	VM_BUG_ON(page_nid < -1);
+
+	VM_BUG_ON(page_nid == dst_nid);
+	VM_BUG_ON(page_to_nid(page) != page_nid);
+
+	flags = compound_lock_irqsave(page);
+
+	numpages = hpage_nr_pages(page);
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	if (nid >= 0) {
+		autonuma_migrate_lock(nid);
+		list_del(&page->autonuma_migrate_node);
+		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
+		autonuma_migrate_unlock(nid);
+	}
+
+	autonuma_migrate_lock(dst_nid);
+	list_add(&page->autonuma_migrate_node,
+		 &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
+	NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
+	nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
+
+	autonuma_migrate_unlock(dst_nid);
+
+	page->autonuma_migrate_nid = dst_nid;
+
+	compound_unlock_irqrestore(page, flags);
+
+	if (!autonuma_migrate_defer()) {
+		wait_queue = &NODE_DATA(dst_nid)->autonuma_knuma_migrated_wait;
+		if (nr_migrate_pages >= pages_to_migrate &&
+		    nr_migrate_pages - numpages < pages_to_migrate &&
+		    waitqueue_active(wait_queue))
+			wake_up_interruptible(wait_queue);
+	}
+}
+
+static void autonuma_migrate_page_add(struct page *page, int dst_nid,
+				      int page_nid)
+{
+	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	if (migrate_nid != dst_nid)
+		__autonuma_migrate_page_add(page, dst_nid, page_nid);
+}
+
+static bool balance_pgdat(struct pglist_data *pgdat,
+			  int nr_migrate_pages)
+{
+	/* FIXME: this only check the wmarks, make it move
+	 * "unused" memory or pagecache by queuing it to
+	 * pgdat->autonuma_migrate_head[pgdat->node_id].
+	 */
+	int z;
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone->all_unreclaimable)
+			continue;
+
+		/*
+		 * FIXME: deal with order with THP, maybe if
+		 * kswapd will learn using compaction, otherwise
+		 * order = 0 probably is ok.
+		 * FIXME: in theory we're ok if we can obtain
+		 * pages_to_migrate pages from all zones, it doesn't
+		 * need to be all in a single zone. We care about the
+		 * pgdat, the zone not.
+		 */
+
+		/*
+		 * Try not to wakeup kswapd by allocating
+		 * pages_to_migrate pages.
+		 */
+		if (!zone_watermark_ok(zone, 0,
+				       high_wmark_pages(zone) +
+				       nr_migrate_pages,
+				       0, 0))
+			continue;
+		return true;
+	}
+	return false;
+}
+
+static void cpu_follow_memory_pass(struct task_struct *p,
+				   struct sched_autonuma *sched_autonuma,
+				   unsigned long *numa_fault)
+{
+	int nid;
+	for_each_node(nid)
+		numa_fault[nid] >>= 1;
+	sched_autonuma->numa_fault_tot >>= 1;
+}
+
+static void numa_hinting_fault_cpu_follow_memory(struct task_struct *p,
+						 int access_nid,
+						 int numpages,
+						 bool pass)
+{
+	struct sched_autonuma *sched_autonuma = p->sched_autonuma;
+	unsigned long *numa_fault = sched_autonuma->numa_fault;
+	if (unlikely(pass))
+		cpu_follow_memory_pass(p, sched_autonuma, numa_fault);
+	numa_fault[access_nid] += numpages;
+	sched_autonuma->numa_fault_tot += numpages;
+}
+
+static inline bool last_nid_set(struct task_struct *p,
+				struct page *page, int cpu_nid)
+{
+	bool ret = true;
+	int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	VM_BUG_ON(cpu_nid < 0);
+	VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
+	if (autonuma_last_nid >= 0 && autonuma_last_nid != cpu_nid) {
+		int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+		if (migrate_nid >= 0 && migrate_nid != cpu_nid)
+			__autonuma_migrate_page_remove(page);
+		ret = false;
+	}
+	if (autonuma_last_nid != cpu_nid)
+		ACCESS_ONCE(page->autonuma_last_nid) = cpu_nid;
+	return ret;
+}
+
+static int __page_migrate_nid(struct page *page, int page_nid)
+{
+	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	if (migrate_nid < 0)
+		migrate_nid = page_nid;
+#if 0
+	return page_nid;
+#endif
+	return migrate_nid;
+}
+
+static int page_migrate_nid(struct page *page)
+{
+	return __page_migrate_nid(page, page_to_nid(page));
+}
+
+static int numa_hinting_fault_memory_follow_cpu(struct task_struct *p,
+						struct page *page,
+						int cpu_nid, int page_nid,
+						bool pass)
+{
+	if (!last_nid_set(p, page, cpu_nid))
+		return __page_migrate_nid(page, page_nid);
+	if (!PageLRU(page))
+		return page_nid;
+	if (cpu_nid != page_nid)
+		autonuma_migrate_page_add(page, cpu_nid, page_nid);
+	else
+		autonuma_migrate_page_remove(page);
+	return cpu_nid;
+}
+
+void numa_hinting_fault(struct page *page, int numpages)
+{
+	WARN_ON_ONCE(!current->mm);
+	if (likely(current->mm && !current->mempolicy && autonuma_enabled())) {
+		struct task_struct *p = current;
+		int cpu_nid, page_nid, access_nid;
+		bool pass;
+
+		pass = p->sched_autonuma->numa_fault_pass !=
+			p->mm->mm_autonuma->numa_fault_pass;
+		page_nid = page_to_nid(page);
+		cpu_nid = numa_node_id();
+		VM_BUG_ON(cpu_nid < 0);
+		VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
+		access_nid = numa_hinting_fault_memory_follow_cpu(p, page,
+								  cpu_nid,
+								  page_nid,
+								  pass);
+		numa_hinting_fault_cpu_follow_memory(p, access_nid,
+						     numpages, pass);
+		if (unlikely(pass))
+			p->sched_autonuma->numa_fault_pass =
+				p->mm->mm_autonuma->numa_fault_pass;
+	}
+}
+
+pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+		       unsigned long addr, pte_t pte, pte_t *ptep)
+{
+	struct page *page;
+	pte = pte_mknotnuma(pte);
+	set_pte_at(mm, addr, ptep, pte);
+	page = vm_normal_page(vma, addr, pte);
+	BUG_ON(!page);
+	numa_hinting_fault(page, 1);
+	return pte;
+}
+
+void __pmd_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+		      unsigned long addr, pmd_t *pmdp)
+{
+	pmd_t pmd;
+	pte_t *pte;
+	unsigned long _addr = addr & PMD_MASK;
+	spinlock_t *ptl;
+	bool numa = false;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = *pmdp;
+	if (pmd_numa(pmd)) {
+		set_pmd_at(mm, _addr, pmdp, pmd_mknotnuma(pmd));
+		numa = true;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	if (!numa)
+		return;
+
+	pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
+	for (addr = _addr; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
+		pte_t pteval = *pte;
+		struct page * page;
+		if (!pte_present(pteval))
+			continue;
+		if (pte_numa(pteval)) {
+			pteval = pte_mknotnuma(pteval);
+			set_pte_at(mm, addr, pte, pteval);
+		}
+		page = vm_normal_page(vma, addr, pteval);
+		if (unlikely(!page))
+			continue;
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1)
+			continue;
+		numa_hinting_fault(page, 1);
+	}
+	pte_unmap_unlock(pte, ptl);
+}
+
+static inline int sched_autonuma_size(void)
+{
+	return sizeof(struct sched_autonuma) +
+		num_possible_nodes() * sizeof(unsigned long);
+}
+
+static inline int sched_autonuma_reset_size(void)
+{
+	struct sched_autonuma *sched_autonuma = NULL;
+	return sched_autonuma_size() -
+		(int)((char *)(&sched_autonuma->autonuma_flags) -
+		      (char *)sched_autonuma);
+}
+
+static void sched_autonuma_reset(struct sched_autonuma *sched_autonuma)
+{
+	sched_autonuma->autonuma_node = -1;
+	memset(&sched_autonuma->autonuma_flags, 0,
+	       sched_autonuma_reset_size());
+}
+
+static inline int mm_autonuma_fault_size(void)
+{
+	return num_possible_nodes() * sizeof(unsigned long);
+}
+
+static inline unsigned long *mm_autonuma_numa_fault_tmp(struct mm_struct *mm)
+{
+	return mm->mm_autonuma->numa_fault + num_possible_nodes();
+}
+
+static inline int mm_autonuma_size(void)
+{
+	return sizeof(struct mm_autonuma) + mm_autonuma_fault_size() * 2;
+}
+
+static inline int mm_autonuma_reset_size(void)
+{
+	struct mm_autonuma *mm_autonuma = NULL;
+	return mm_autonuma_size() -
+		(int)((char *)(&mm_autonuma->numa_fault_tot) -
+		      (char *)mm_autonuma);
+}
+
+static void mm_autonuma_reset(struct mm_autonuma *mm_autonuma)
+{
+	memset(&mm_autonuma->numa_fault_tot, 0, mm_autonuma_reset_size());
+}
+
+void autonuma_setup_new_exec(struct task_struct *p)
+{
+	if (p->sched_autonuma)
+		sched_autonuma_reset(p->sched_autonuma);
+	if (p->mm && p->mm->mm_autonuma)
+		mm_autonuma_reset(p->mm->mm_autonuma);
+}
+
+static inline int knumad_test_exit(struct mm_struct *mm)
+{
+	return atomic_read(&mm->mm_users) == 0;
+}
+
+static int knumad_scan_pmd(struct mm_struct *mm,
+			   struct vm_area_struct *vma,
+			   unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte, *_pte;
+	struct page *page;
+	unsigned long _address, end;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	VM_BUG_ON(address & ~PAGE_MASK);
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+	if (pmd_trans_huge(*pmd)) {
+		spin_lock(&mm->page_table_lock);
+		if (pmd_trans_huge(*pmd)) {
+			VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+			if (unlikely(pmd_trans_splitting(*pmd))) {
+				spin_unlock(&mm->page_table_lock);
+				wait_split_huge_page(vma->anon_vma, pmd);
+			} else {
+				int page_nid;
+				unsigned long *numa_fault_tmp;
+				ret = HPAGE_PMD_NR;
+
+				if (autonuma_scan_use_working_set() &&
+				    pmd_numa(*pmd)) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
+				page = pmd_page(*pmd);
+
+				/* only check non-shared pages */
+				if (page_mapcount(page) != 1) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
+				page_nid = page_migrate_nid(page);
+				numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+				numa_fault_tmp[page_nid] += ret;
+
+				if (pmd_numa(*pmd)) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
+				set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+				/* defer TLB flush to lower the overhead */
+				spin_unlock(&mm->page_table_lock);
+				goto out;
+			}
+		} else
+			spin_unlock(&mm->page_table_lock);
+	}
+
+	VM_BUG_ON(!pmd_present(*pmd));
+
+	end = min(vma->vm_end, (address + PMD_SIZE) & PMD_MASK);
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	for (_address = address, _pte = pte; _address < end;
+	     _pte++, _address += PAGE_SIZE) {
+		unsigned long *numa_fault_tmp;
+		pte_t pteval = *_pte;
+		if (!pte_present(pteval))
+			continue;
+		if (autonuma_scan_use_working_set() &&
+		    pte_numa(pteval))
+			continue;
+		page = vm_normal_page(vma, _address, pteval);
+		if (unlikely(!page))
+			continue;
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1)
+			continue;
+
+		numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+		numa_fault_tmp[page_migrate_nid(page)]++;
+
+		if (pte_numa(pteval))
+			continue;
+
+		if (!autonuma_scan_pmd())
+			set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
+
+		/* defer TLB flush to lower the overhead */
+		ret++;
+	}
+	pte_unmap_unlock(pte, ptl);
+
+	if (ret && !pmd_numa(*pmd) && autonuma_scan_pmd()) {
+		spin_lock(&mm->page_table_lock);
+		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+		spin_unlock(&mm->page_table_lock);
+		/* defer TLB flush to lower the overhead */
+	}
+
+out:
+	return ret;
+}
+
+static void mm_numa_fault_flush(struct mm_struct *mm)
+{
+	int nid;
+	struct mm_autonuma *mma = mm->mm_autonuma;
+	unsigned long *numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+	unsigned long tot = 0;
+	/* FIXME: protect this with seqlock against autonuma_balance() */
+	for_each_node(nid) {
+		mma->numa_fault[nid] = numa_fault_tmp[nid];
+		tot += mma->numa_fault[nid];
+		numa_fault_tmp[nid] = 0;
+	}
+	mma->numa_fault_tot = tot;
+}
+
+static int knumad_do_scan(void)
+{
+	struct mm_struct *mm;
+	struct mm_autonuma *mm_autonuma;
+	unsigned long address;
+	struct vm_area_struct *vma;
+	int progress = 0;
+
+	mm = knumad_scan.mm;
+	if (!mm) {
+		if (unlikely(list_empty(&knumad_scan.mm_head)))
+			return pages_to_scan;
+		mm_autonuma = list_entry(knumad_scan.mm_head.next,
+					 struct mm_autonuma, mm_node);
+		mm = mm_autonuma->mm;
+		knumad_scan.address = 0;
+		knumad_scan.mm = mm;
+		atomic_inc(&mm->mm_count);
+		mm_autonuma->numa_fault_pass++;
+	}
+	address = knumad_scan.address;
+
+	mutex_unlock(&knumad_mm_mutex);
+
+	down_read(&mm->mmap_sem);
+	if (unlikely(knumad_test_exit(mm)))
+		vma = NULL;
+	else
+		vma = find_vma(mm, address);
+
+	progress++;
+	for (; vma && progress < pages_to_scan; vma = vma->vm_next) {
+		unsigned long start_addr, end_addr;
+		cond_resched();
+		if (unlikely(knumad_test_exit(mm))) {
+			progress++;
+			break;
+		}
+
+		if (!vma->anon_vma || vma_policy(vma)) {
+			progress++;
+			continue;
+		}
+		if (is_vma_temporary_stack(vma)) {
+			progress++;
+			continue;
+		}
+
+		VM_BUG_ON(address & ~PAGE_MASK);
+		if (address < vma->vm_start)
+			address = vma->vm_start;
+
+		start_addr = address;
+		while (address < vma->vm_end) {
+			cond_resched();
+			if (unlikely(knumad_test_exit(mm)))
+				break;
+
+			VM_BUG_ON(address < vma->vm_start ||
+				  address + PAGE_SIZE > vma->vm_end);
+			progress += knumad_scan_pmd(mm, vma, address);
+			/* move to next address */
+			address = (address + PMD_SIZE) & PMD_MASK;
+			if (progress >= pages_to_scan)
+				break;
+		}
+		end_addr = min(address, vma->vm_end);
+
+		/*
+		 * Flush the TLB for the mm to start the numa
+		 * hinting minor page faults after we finish
+		 * scanning this vma part.
+		 */
+		mmu_notifier_invalidate_range_start(vma->vm_mm, start_addr,
+						    end_addr);
+		flush_tlb_range(vma, start_addr, end_addr);
+		mmu_notifier_invalidate_range_end(vma->vm_mm, start_addr,
+						  end_addr);
+	}
+	up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
+
+	mutex_lock(&knumad_mm_mutex);
+	VM_BUG_ON(knumad_scan.mm != mm);
+	knumad_scan.address = address;
+	/*
+	 * Change the current mm if this mm is about to die, or if we
+	 * scanned all vmas of this mm.
+	 */
+	if (knumad_test_exit(mm) || !vma) {
+		mm_autonuma = mm->mm_autonuma;
+		if (mm_autonuma->mm_node.next != &knumad_scan.mm_head) {
+			mm_autonuma = list_entry(mm_autonuma->mm_node.next,
+						 struct mm_autonuma, mm_node);
+			knumad_scan.mm = mm_autonuma->mm;
+			atomic_inc(&knumad_scan.mm->mm_count);
+			knumad_scan.address = 0;
+			knumad_scan.mm->mm_autonuma->numa_fault_pass++;
+		} else
+			knumad_scan.mm = NULL;
+
+		if (knumad_test_exit(mm))
+			list_del(&mm->mm_autonuma->mm_node);
+		else
+			mm_numa_fault_flush(mm);
+
+		mmdrop(mm);
+	}
+
+	return progress;
+}
+
+static void wake_up_knuma_migrated(void)
+{
+	int nid;
+
+	lru_add_drain();
+	for_each_online_node(nid) {
+		struct pglist_data *pgdat = NODE_DATA(nid);
+		if (pgdat->autonuma_nr_migrate_pages &&
+		    waitqueue_active(&pgdat->autonuma_knuma_migrated_wait))
+			wake_up_interruptible(&pgdat->
+					      autonuma_knuma_migrated_wait);
+	}
+}
+
+static void knuma_scand_disabled(void)
+{
+	if (!autonuma_enabled())
+		wait_event_freezable(knuma_scand_wait,
+				     autonuma_enabled() ||
+				     kthread_should_stop());
+}
+
+static int knuma_scand(void *none)
+{
+	struct mm_struct *mm = NULL;
+	int progress = 0, _progress;
+	unsigned long total_progress = 0;
+
+	set_freezable();
+
+	knuma_scand_disabled();
+
+	mutex_lock(&knumad_mm_mutex);
+
+	for (;;) {
+		if (unlikely(kthread_should_stop()))
+			break;
+		_progress = knumad_do_scan();
+		progress += _progress;
+		total_progress += _progress;
+		mutex_unlock(&knumad_mm_mutex);
+
+		if (unlikely(!knumad_scan.mm)) {
+			autonuma_printk("knuma_scand %lu\n", total_progress);
+			pages_scanned += total_progress;
+			total_progress = 0;
+			full_scans++;
+
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_pass_millisecs));
+			/* flush the last pending pages < pages_to_migrate */
+			wake_up_knuma_migrated();
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_pass_millisecs));
+
+			if (autonuma_debug()) {
+				extern void sched_autonuma_dump_mm(void);
+				sched_autonuma_dump_mm();
+			}
+
+			/* wait while there is no pinned mm */
+			knuma_scand_disabled();
+		}
+		if (progress > pages_to_scan) {
+			progress = 0;
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_millisecs));
+		}
+		cond_resched();
+		mutex_lock(&knumad_mm_mutex);
+	}
+
+	mm = knumad_scan.mm;
+	knumad_scan.mm = NULL;
+	if (mm)
+		list_del(&mm->mm_autonuma->mm_node);
+	mutex_unlock(&knumad_mm_mutex);
+
+	if (mm)
+		mmdrop(mm);
+
+	return 0;
+}
+
+static int isolate_migratepages(struct list_head *migratepages,
+				struct pglist_data *pgdat)
+{
+	int nr = 0, nid;
+	struct list_head *heads = pgdat->autonuma_migrate_head;
+
+	/* FIXME: THP balancing, restart from last nid */
+	for_each_online_node(nid) {
+		struct zone *zone;
+		struct page *page;
+		cond_resched();
+		VM_BUG_ON(numa_node_id() != pgdat->node_id);
+		if (nid == pgdat->node_id) {
+			VM_BUG_ON(!list_empty(&heads[nid]));
+			continue;
+		}
+		if (list_empty(&heads[nid]))
+			continue;
+		/* some page wants to go to this pgdat */
+		/*
+		 * Take the lock with irqs disabled to avoid a lock
+		 * inversion with the lru_lock which is taken before
+		 * the autonuma_migrate_lock in split_huge_page, and
+		 * that could be taken by interrupts after we obtained
+		 * the autonuma_migrate_lock here, if we didn't disable
+		 * irqs.
+		 */
+		autonuma_migrate_lock_irq(pgdat->node_id);
+		if (list_empty(&heads[nid])) {
+			autonuma_migrate_unlock_irq(pgdat->node_id);
+			continue;
+		}
+		page = list_entry(heads[nid].prev,
+				  struct page,
+				  autonuma_migrate_node);
+		if (unlikely(!get_page_unless_zero(page))) {
+			/*
+			 * Is getting freed and will remove self from the
+			 * autonuma list shortly, skip it for now.
+			 */
+			list_del(&page->autonuma_migrate_node);
+			list_add(&page->autonuma_migrate_node,
+				 &heads[nid]);
+			autonuma_migrate_unlock_irq(pgdat->node_id);
+			autonuma_printk("autonuma migrate page is free\n");
+			continue;
+		}
+		if (!PageLRU(page)) {
+			autonuma_migrate_unlock_irq(pgdat->node_id);
+			autonuma_printk("autonuma migrate page not in LRU\n");
+			__autonuma_migrate_page_remove(page);
+			put_page(page);
+			continue;
+		}
+		autonuma_migrate_unlock_irq(pgdat->node_id);
+
+		VM_BUG_ON(nid != page_to_nid(page));
+
+		if (PageAnon(page) && PageTransHuge(page))
+			/* FIXME: remove split_huge_page */
+			split_huge_page(page);
+
+		__autonuma_migrate_page_remove(page);
+
+		zone = page_zone(page);
+		spin_lock_irq(&zone->lru_lock);
+		if (!__isolate_lru_page(page, ISOLATE_ACTIVE|ISOLATE_INACTIVE,
+					0)) {
+			VM_BUG_ON(PageTransCompound(page));
+			del_page_from_lru_list(zone, page, page_lru(page));
+			inc_zone_state(zone, page_is_file_cache(page) ?
+				       NR_ISOLATED_FILE : NR_ISOLATED_ANON);
+			spin_unlock_irq(&zone->lru_lock);
+			/*
+			 * hold the page pin at least until
+			 * __isolate_lru_page succeeds
+			 * (__isolate_lru_page takes a second pin when
+			 * it succeeds). If we release the pin before
+			 * __isolate_lru_page returns, the page could
+			 * have been freed and reallocated from under
+			 * us, so rendering worthless our previous
+			 * checks on the page including the
+			 * split_huge_page call.
+			 */
+			put_page(page);
+
+			list_add(&page->lru, migratepages);
+			nr += hpage_nr_pages(page);
+		} else {
+			/* FIXME: losing page, safest and simplest for now */
+			spin_unlock_irq(&zone->lru_lock);
+			put_page(page);
+			autonuma_printk("autonuma migrate page lost\n");
+		}
+	}
+
+	return nr;
+}
+
+static struct page *alloc_migrate_dst_page(struct page *page,
+					   unsigned long data,
+					   int **result)
+{
+	int nid = (int) data;
+	struct page *newpage;
+	newpage = alloc_pages_exact_node(nid,
+					 GFP_HIGHUSER_MOVABLE | GFP_THISNODE,
+					 0);
+	if (newpage)
+		newpage->autonuma_last_nid = page->autonuma_last_nid;
+	return newpage;
+}
+
+static void knumad_do_migrate(struct pglist_data *pgdat)
+{
+	int nr_migrate_pages = 0;
+	LIST_HEAD(migratepages);
+
+	autonuma_printk("nr_migrate_pages %lu to node %d\n",
+			pgdat->autonuma_nr_migrate_pages, pgdat->node_id);
+	do {
+		int isolated = 0;
+		if (balance_pgdat(pgdat, nr_migrate_pages))
+			isolated = isolate_migratepages(&migratepages, pgdat);
+		/* FIXME: might need to check too many isolated */
+		if (!isolated)
+			break;
+		nr_migrate_pages += isolated;
+	} while (nr_migrate_pages < pages_to_migrate);
+
+	if (nr_migrate_pages) {
+		int err;
+		autonuma_printk("migrate %d to node %d\n", nr_migrate_pages,
+				pgdat->node_id);
+		pages_migrated += nr_migrate_pages; /* FIXME: per node */
+		err = migrate_pages(&migratepages, alloc_migrate_dst_page,
+				    pgdat->node_id, false, true);
+		if (err)
+			/* FIXME: requeue failed pages */
+			putback_lru_pages(&migratepages);
+	}
+}
+
+static int knuma_migrated(void *arg)
+{
+	struct pglist_data *pgdat = (struct pglist_data *)arg;
+	int nid = pgdat->node_id;
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(nowakeup);
+
+	set_freezable();
+
+	for (;;) {
+		if (unlikely(kthread_should_stop()))
+			break;
+		/* FIXME: scan the free levels of this node we may not
+		 * be allowed to receive memory if the wmark of this
+		 * pgdat are below high.  In the future also add
+		 * not-interesting pages like not-accessed pages to
+		 * pgdat->autonuma_migrate_head[pgdat->node_id]; so we
+		 * can move our memory away to other nodes in order
+		 * to satisfy the high-wmark described above (so migration
+		 * can continue).
+		 */
+		knumad_do_migrate(pgdat);
+		if (!pgdat->autonuma_nr_migrate_pages) {
+			wait_event_freezable(
+				pgdat->autonuma_knuma_migrated_wait,
+				pgdat->autonuma_nr_migrate_pages ||
+				kthread_should_stop());
+			autonuma_printk("wake knuma_migrated %d\n", nid);
+		} else
+			wait_event_freezable_timeout(nowakeup,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     migrate_sleep_millisecs));
+	}
+
+	return 0;
+}
+
+void autonuma_enter(struct mm_struct *mm)
+{
+	if (autonuma_impossible())
+		return;
+
+	mutex_lock(&knumad_mm_mutex);
+	list_add_tail(&mm->mm_autonuma->mm_node, &knumad_scan.mm_head);
+	mutex_unlock(&knumad_mm_mutex);
+}
+
+void autonuma_exit(struct mm_struct *mm)
+{
+	bool serialize;
+
+	if (autonuma_impossible())
+		return;
+
+	serialize = false;
+	mutex_lock(&knumad_mm_mutex);
+	if (knumad_scan.mm == mm)
+		serialize = true;
+	else
+		list_del(&mm->mm_autonuma->mm_node);
+	mutex_unlock(&knumad_mm_mutex);
+
+	if (serialize) {
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	}
+}
+
+static int start_knuma_scand(void)
+{
+	int err = 0;
+	struct task_struct *knumad_thread;
+
+	knumad_thread = kthread_run(knuma_scand, NULL, "knuma_scand");
+	if (unlikely(IS_ERR(knumad_thread))) {
+		autonuma_printk(KERN_ERR
+				"knumad: kthread_run(knuma_scand) failed\n");
+		err = PTR_ERR(knumad_thread);
+	}
+	return err;
+}
+
+static int start_knuma_migrated(void)
+{
+	int err = 0;
+	struct task_struct *knumad_thread;
+	int nid;
+
+	for_each_online_node(nid) {
+		knumad_thread = kthread_create_on_node(knuma_migrated,
+						       NODE_DATA(nid),
+						       nid,
+						       "knuma_migrated%d",
+						       nid);
+		if (unlikely(IS_ERR(knumad_thread))) {
+			autonuma_printk(KERN_ERR
+					"knumad: "
+					"kthread_run(knuma_migrated%d) "
+					"failed\n", nid);
+			err = PTR_ERR(knumad_thread);
+		} else {
+			autonuma_printk("cpumask %d %lx\n", nid,
+					cpumask_of_node(nid)->bits[0]);
+			kthread_bind_node(knumad_thread, nid);
+			wake_up_process(knumad_thread);
+		}
+	}
+	return err;
+}
+
+
+#ifdef CONFIG_SYSFS
+
+static ssize_t flag_show(struct kobject *kobj,
+			 struct kobj_attribute *attr, char *buf,
+			 enum autonuma_flag flag)
+{
+	return sprintf(buf, "%d\n",
+		       !!test_bit(flag, &autonuma_flags));
+}
+static ssize_t flag_store(struct kobject *kobj,
+			  struct kobj_attribute *attr,
+			  const char *buf, size_t count,
+			  enum autonuma_flag flag)
+{
+	unsigned long value;
+	int ret;
+
+	ret = kstrtoul(buf, 10, &value);
+	if (ret < 0)
+		return ret;
+	if (value > 1)
+		return -EINVAL;
+
+	if (value)
+		set_bit(flag, &autonuma_flags);
+	else
+		clear_bit(flag, &autonuma_flags);
+
+	return count;
+}
+
+static ssize_t enabled_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return flag_show(kobj, attr, buf, AUTONUMA_FLAG);
+}
+static ssize_t enabled_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = flag_store(kobj, attr, buf, count, AUTONUMA_FLAG);
+
+	if (ret > 0 && autonuma_enabled())
+		wake_up_interruptible(&knuma_scand_wait);
+
+	return ret;
+}
+static struct kobj_attribute enabled_attr =
+	__ATTR(enabled, 0644, enabled_show, enabled_store);
+
+#define SYSFS_ENTRY(NAME, FLAG)						\
+static ssize_t NAME ## _show(struct kobject *kobj,			\
+			     struct kobj_attribute *attr, char *buf)	\
+{									\
+	return flag_show(kobj, attr, buf, FLAG);			\
+}									\
+									\
+static ssize_t NAME ## _store(struct kobject *kobj,			\
+			      struct kobj_attribute *attr,		\
+			      const char *buf, size_t count)		\
+{									\
+	return flag_store(kobj, attr, buf, count, FLAG);		\
+}									\
+static struct kobj_attribute NAME ## _attr =				\
+	__ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
+
+SYSFS_ENTRY(debug, AUTONUMA_DEBUG_FLAG);
+SYSFS_ENTRY(pmd, AUTONUMA_SCAN_PMD_FLAG);
+SYSFS_ENTRY(working_set, AUTONUMA_SCAN_USE_WORKING_SET_FLAG);
+SYSFS_ENTRY(defer, AUTONUMA_MIGRATE_DEFER_FLAG);
+SYSFS_ENTRY(load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
+SYSFS_ENTRY(clone_reset, AUTONUMA_SCHED_CLONE_RESET_FLAG);
+SYSFS_ENTRY(fork_reset, AUTONUMA_SCHED_FORK_RESET_FLAG);
+
+#undef SYSFS_ENTRY
+
+enum {
+	SYSFS_KNUMA_SCAND_SLEEP_ENTRY,
+	SYSFS_KNUMA_SCAND_PAGES_ENTRY,
+	SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY,
+	SYSFS_KNUMA_MIGRATED_PAGES_ENTRY,
+};
+
+#define SYSFS_ENTRY(NAME, SYSFS_TYPE)				\
+static ssize_t NAME ## _show(struct kobject *kobj,		\
+			     struct kobj_attribute *attr,	\
+			     char *buf)				\
+{								\
+	return sprintf(buf, "%u\n", NAME);			\
+}								\
+static ssize_t NAME ## _store(struct kobject *kobj,		\
+			      struct kobj_attribute *attr,	\
+			      const char *buf, size_t count)	\
+{								\
+	unsigned long val;					\
+	int err;						\
+								\
+	err = strict_strtoul(buf, 10, &val);			\
+	if (err || val > UINT_MAX)				\
+		return -EINVAL;					\
+	switch (SYSFS_TYPE) {					\
+	case SYSFS_KNUMA_SCAND_PAGES_ENTRY:			\
+	case SYSFS_KNUMA_MIGRATED_PAGES_ENTRY:			\
+		if (!val)					\
+			return -EINVAL;				\
+		break;						\
+	}							\
+								\
+	NAME = val;						\
+	switch (SYSFS_TYPE) {					\
+	case SYSFS_KNUMA_SCAND_SLEEP_ENTRY:			\
+		wake_up_interruptible(&knuma_scand_wait);	\
+		break;						\
+	case							\
+		SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY:		\
+		wake_up_knuma_migrated();			\
+		break;						\
+	}							\
+								\
+	return count;						\
+}								\
+static struct kobj_attribute NAME ## _attr =			\
+	__ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
+
+SYSFS_ENTRY(scan_sleep_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
+SYSFS_ENTRY(scan_sleep_pass_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
+SYSFS_ENTRY(pages_to_scan, SYSFS_KNUMA_SCAND_PAGES_ENTRY);
+
+SYSFS_ENTRY(migrate_sleep_millisecs, SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY);
+SYSFS_ENTRY(pages_to_migrate, SYSFS_KNUMA_MIGRATED_PAGES_ENTRY);
+
+#undef SYSFS_ENTRY
+
+static struct attribute *autonuma_attr[] = {
+	&enabled_attr.attr,
+	&debug_attr.attr,
+	NULL,
+};
+static struct attribute_group autonuma_attr_group = {
+	.attrs = autonuma_attr,
+};
+
+#define SYSFS_ENTRY(NAME)					\
+static ssize_t NAME ## _show(struct kobject *kobj,		\
+			     struct kobj_attribute *attr,	\
+			     char *buf)				\
+{								\
+	return sprintf(buf, "%lu\n", NAME);			\
+}								\
+static struct kobj_attribute NAME ## _attr =			\
+	__ATTR_RO(NAME);
+
+SYSFS_ENTRY(full_scans);
+SYSFS_ENTRY(pages_scanned);
+SYSFS_ENTRY(pages_migrated);
+
+#undef SYSFS_ENTRY
+
+static struct attribute *knuma_scand_attr[] = {
+	&scan_sleep_millisecs_attr.attr,
+	&scan_sleep_pass_millisecs_attr.attr,
+	&pages_to_scan_attr.attr,
+	&pages_scanned_attr.attr,
+	&full_scans_attr.attr,
+	&pmd_attr.attr,
+	&working_set_attr.attr,
+	NULL,
+};
+static struct attribute_group knuma_scand_attr_group = {
+	.attrs = knuma_scand_attr,
+	.name = "knuma_scand",
+};
+
+static struct attribute *knuma_migrated_attr[] = {
+	&migrate_sleep_millisecs_attr.attr,
+	&pages_to_migrate_attr.attr,
+	&pages_migrated_attr.attr,
+	&defer_attr.attr,
+	NULL,
+};
+static struct attribute_group knuma_migrated_attr_group = {
+	.attrs = knuma_migrated_attr,
+	.name = "knuma_migrated",
+};
+
+static struct attribute *scheduler_attr[] = {
+	&clone_reset_attr.attr,
+	&fork_reset_attr.attr,
+	&load_balance_strict_attr.attr,
+	NULL,
+};
+static struct attribute_group scheduler_attr_group = {
+	.attrs = scheduler_attr,
+	.name = "scheduler",
+};
+
+static int __init autonuma_init_sysfs(struct kobject **autonuma_kobj)
+{
+	int err;
+
+	*autonuma_kobj = kobject_create_and_add("autonuma", mm_kobj);
+	if (unlikely(!*autonuma_kobj)) {
+		printk(KERN_ERR "autonuma: failed kobject create\n");
+		return -ENOMEM;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &autonuma_attr_group);
+	if (err) {
+		printk(KERN_ERR "autonuma: failed register autonuma group\n");
+		goto delete_obj;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &knuma_scand_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register knuma_scand group\n");
+		goto remove_autonuma;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &knuma_migrated_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register knuma_migrated group\n");
+		goto remove_knuma_scand;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &scheduler_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register scheduler group\n");
+		goto remove_knuma_migrated;
+	}
+
+	return 0;
+
+remove_knuma_migrated:
+	sysfs_remove_group(*autonuma_kobj, &knuma_migrated_attr_group);
+remove_knuma_scand:
+	sysfs_remove_group(*autonuma_kobj, &knuma_scand_attr_group);
+remove_autonuma:
+	sysfs_remove_group(*autonuma_kobj, &autonuma_attr_group);
+delete_obj:
+	kobject_put(*autonuma_kobj);
+	return err;
+}
+
+static void __init autonuma_exit_sysfs(struct kobject *autonuma_kobj)
+{
+	sysfs_remove_group(autonuma_kobj, &knuma_migrated_attr_group);
+	sysfs_remove_group(autonuma_kobj, &knuma_scand_attr_group);
+	sysfs_remove_group(autonuma_kobj, &autonuma_attr_group);
+	kobject_put(autonuma_kobj);
+}
+#else
+static inline int autonuma_init_sysfs(struct kobject **autonuma_kobj)
+{
+	return 0;
+}
+
+static inline void autonuma_exit_sysfs(struct kobject *autonuma_kobj)
+{
+}
+#endif /* CONFIG_SYSFS */
+
+static int __init noautonuma_setup(char *str)
+{
+	if (!autonuma_impossible()) {
+		printk("AutoNUMA permanently disabled\n");
+		set_bit(AUTONUMA_IMPOSSIBLE, &autonuma_flags);
+		BUG_ON(!autonuma_impossible());
+	}
+	return 1;
+}
+__setup("noautonuma", noautonuma_setup);
+
+static int __init autonuma_init(void)
+{
+	int err;
+	struct kobject *autonuma_kobj;
+
+	VM_BUG_ON(num_possible_nodes() < 1);
+	if (autonuma_impossible())
+		return -EINVAL;
+
+	err = autonuma_init_sysfs(&autonuma_kobj);
+	if (err)
+		return err;
+
+	err = start_knuma_scand();
+	if (err) {
+		printk("failed to start knuma_scand\n");
+		goto out;
+	}
+	err = start_knuma_migrated();
+	if (err) {
+		printk("failed to start knuma_migrated\n");
+		goto out;
+	}
+
+	printk("AutoNUMA initialized successfully\n");
+	return err;
+
+out:
+	autonuma_exit_sysfs(autonuma_kobj);
+	return err;
+}
+module_init(autonuma_init)
+
+static struct kmem_cache *sched_autonuma_cachep;
+
+int alloc_sched_autonuma(struct task_struct *tsk, struct task_struct *orig,
+			 int node)
+{
+	int err = 1;
+	struct sched_autonuma *sched_autonuma;
+
+	if (autonuma_impossible())
+		goto no_numa;
+	sched_autonuma = kmem_cache_alloc_node(sched_autonuma_cachep,
+					       GFP_KERNEL, node);
+	if (!sched_autonuma)
+		goto out;
+	if (autonuma_sched_clone_reset())
+		sched_autonuma_reset(sched_autonuma);
+	else {
+		memcpy(sched_autonuma, orig->sched_autonuma,
+		       sched_autonuma_size());
+		BUG_ON(sched_autonuma->autonuma_flags &
+		       SCHED_AUTONUMA_FLAG_STOP_ONE_CPU);
+		sched_autonuma->autonuma_flags = 0;
+	}
+	tsk->sched_autonuma = sched_autonuma;
+no_numa:
+	err = 0;
+out:
+	return err;
+}
+
+void free_sched_autonuma(struct task_struct *tsk)
+{
+	if (autonuma_impossible()) {
+		BUG_ON(tsk->sched_autonuma);
+		return;
+	}
+
+	BUG_ON(!tsk->sched_autonuma);
+	kmem_cache_free(sched_autonuma_cachep, tsk->sched_autonuma);
+	tsk->sched_autonuma = NULL;
+}
+
+void __init sched_autonuma_init(void)
+{
+	struct sched_autonuma *sched_autonuma;
+
+	BUG_ON(current != &init_task);
+
+	if (autonuma_impossible())
+		return;
+
+	sched_autonuma_cachep =
+		kmem_cache_create("sched_autonuma",
+				  sched_autonuma_size(), 0,
+				  SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+
+	sched_autonuma = kmem_cache_alloc_node(sched_autonuma_cachep,
+					       GFP_KERNEL, numa_node_id());
+	BUG_ON(!sched_autonuma);
+	sched_autonuma_reset(sched_autonuma);
+	BUG_ON(current->sched_autonuma);
+	current->sched_autonuma = sched_autonuma;
+}
+
+static struct kmem_cache *mm_autonuma_cachep;
+
+int alloc_mm_autonuma(struct mm_struct *mm)
+{
+	int err = 1;
+	struct mm_autonuma *mm_autonuma;
+
+	if (autonuma_impossible())
+		goto no_numa;
+	mm_autonuma = kmem_cache_alloc(mm_autonuma_cachep, GFP_KERNEL);
+	if (!mm_autonuma)
+		goto out;
+	if (autonuma_sched_fork_reset() || !mm->mm_autonuma)
+		mm_autonuma_reset(mm_autonuma);
+	else
+		memcpy(mm_autonuma, mm->mm_autonuma, mm_autonuma_size());
+	mm->mm_autonuma = mm_autonuma;
+	mm_autonuma->mm = mm;
+no_numa:
+	err = 0;
+out:
+	return err;
+}
+
+void free_mm_autonuma(struct mm_struct *mm)
+{
+	if (autonuma_impossible()) {
+		BUG_ON(mm->mm_autonuma);
+		return;
+	}
+
+	BUG_ON(!mm->mm_autonuma);
+	kmem_cache_free(mm_autonuma_cachep, mm->mm_autonuma);
+	mm->mm_autonuma = NULL;
+}
+
+void __init mm_autonuma_init(void)
+{
+	BUG_ON(current != &init_task);
+	BUG_ON(current->mm);
+
+	if (autonuma_impossible())
+		return;
+
+	mm_autonuma_cachep =
+		kmem_cache_create("mm_autonuma",
+				  mm_autonuma_size(), 0,
+				  SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 24/35] autonuma: follow_page check for pte_numa/pmd_numa
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Without this, follow_page wouldn't trigger the NUMA hinting faults.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/memory.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7f265fc..e3aa47c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1483,7 +1483,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 		goto no_page_table;
 
 	pmd = pmd_offset(pud, address);
-	if (pmd_none(*pmd))
+	if (pmd_none(*pmd) || pmd_numa(*pmd))
 		goto no_page_table;
 	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
 		BUG_ON(flags & FOLL_GET);
@@ -1517,7 +1517,7 @@ split_fallthrough:
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
 
 	pte = *ptep;
-	if (!pte_present(pte))
+	if (!pte_present(pte) || pte_numa(pte))
 		goto no_page;
 	if ((flags & FOLL_WRITE) && !pte_write(pte))
 		goto unlock;

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 24/35] autonuma: follow_page check for pte_numa/pmd_numa
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Without this, follow_page wouldn't trigger the NUMA hinting faults.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/memory.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7f265fc..e3aa47c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1483,7 +1483,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 		goto no_page_table;
 
 	pmd = pmd_offset(pud, address);
-	if (pmd_none(*pmd))
+	if (pmd_none(*pmd) || pmd_numa(*pmd))
 		goto no_page_table;
 	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
 		BUG_ON(flags & FOLL_GET);
@@ -1517,7 +1517,7 @@ split_fallthrough:
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
 
 	pte = *ptep;
-	if (!pte_present(pte))
+	if (!pte_present(pte) || pte_numa(pte))
 		goto no_page;
 	if ((flags & FOLL_WRITE) && !pte_write(pte))
 		goto unlock;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 25/35] autonuma: default mempolicy follow AutoNUMA
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

If the task has already been moved to an autonuma_node try to allocate
memory from it even if it's temporarily not the local node. Chances
are it's where most of its memory is already located and where it will
run in the future.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/mempolicy.c |   15 +++++++++++++--
 1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 88f9422..b6b88f6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1925,10 +1925,21 @@ retry_cpuset:
 	 */
 	if (pol->mode == MPOL_INTERLEAVE)
 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
-	else
+	else {
+		int nid;
+#ifdef CONFIG_AUTONUMA
+		nid = -1;
+		if (current->sched_autonuma)
+			nid = current->sched_autonuma->autonuma_node;
+		if (nid < 0)
+			nid = numa_node_id();
+#else
+		nid = numa_node_id();
+#endif
 		page = __alloc_pages_nodemask(gfp, order,
-				policy_zonelist(gfp, pol, numa_node_id()),
+				policy_zonelist(gfp, pol, nid),
 				policy_nodemask(gfp, pol));
+	}
 
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 		goto retry_cpuset;

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 25/35] autonuma: default mempolicy follow AutoNUMA
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

If the task has already been moved to an autonuma_node try to allocate
memory from it even if it's temporarily not the local node. Chances
are it's where most of its memory is already located and where it will
run in the future.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/mempolicy.c |   15 +++++++++++++--
 1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 88f9422..b6b88f6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1925,10 +1925,21 @@ retry_cpuset:
 	 */
 	if (pol->mode == MPOL_INTERLEAVE)
 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
-	else
+	else {
+		int nid;
+#ifdef CONFIG_AUTONUMA
+		nid = -1;
+		if (current->sched_autonuma)
+			nid = current->sched_autonuma->autonuma_node;
+		if (nid < 0)
+			nid = numa_node_id();
+#else
+		nid = numa_node_id();
+#endif
 		page = __alloc_pages_nodemask(gfp, order,
-				policy_zonelist(gfp, pol, numa_node_id()),
+				policy_zonelist(gfp, pol, nid),
 				policy_nodemask(gfp, pol));
+	}
 
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 		goto retry_cpuset;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 26/35] autonuma: call autonuma_split_huge_page()
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

This is needed to make sure the tail pages are also queued into the
migration queues of knuma_migrated across a transparent hugepage
split.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 383ae4d..b1c047b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -17,6 +17,7 @@
 #include <linux/khugepaged.h>
 #include <linux/freezer.h>
 #include <linux/mman.h>
+#include <linux/autonuma.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -1307,6 +1308,7 @@ static void __split_huge_page_refcount(struct page *page)
 
 
 		lru_add_page_tail(zone, page, page_tail);
+		autonuma_migrate_split_huge_page(page, page_tail);
 	}
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(__page_count(page) <= 0);

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 26/35] autonuma: call autonuma_split_huge_page()
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

This is needed to make sure the tail pages are also queued into the
migration queues of knuma_migrated across a transparent hugepage
split.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 383ae4d..b1c047b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -17,6 +17,7 @@
 #include <linux/khugepaged.h>
 #include <linux/freezer.h>
 #include <linux/mman.h>
+#include <linux/autonuma.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -1307,6 +1308,7 @@ static void __split_huge_page_refcount(struct page *page)
 
 
 		lru_add_page_tail(zone, page, page_tail);
+		autonuma_migrate_split_huge_page(page, page_tail);
 	}
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(__page_count(page) <= 0);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 27/35] autonuma: make khugepaged pte_numa aware
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

If any of the ptes that khugepaged is collapsing was a pte_numa, the
resulting trans huge pmd will be a pmd_numa too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b1c047b..d388517 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1790,12 +1790,13 @@ out:
 	return isolated;
 }
 
-static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
+static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 				      struct vm_area_struct *vma,
 				      unsigned long address,
 				      spinlock_t *ptl)
 {
 	pte_t *_pte;
+	bool mknuma = false;
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
 		pte_t pteval = *_pte;
 		struct page *src_page;
@@ -1823,11 +1824,15 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			page_remove_rmap(src_page);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
+
+			mknuma |= pte_numa(pteval);
 		}
 
 		address += PAGE_SIZE;
 		page++;
 	}
+
+	return mknuma;
 }
 
 static void collapse_huge_page(struct mm_struct *mm,
@@ -1845,6 +1850,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	int isolated;
 	unsigned long hstart, hend;
+	bool mknuma = false;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 #ifndef CONFIG_NUMA
@@ -1963,7 +1969,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	anon_vma_unlock(vma->anon_vma);
 
-	__collapse_huge_page_copy(pte, new_page, vma, address, ptl);
+	mknuma = pmd_numa(_pmd);
+	mknuma |= __collapse_huge_page_copy(pte, new_page, vma, address, ptl);
 	pte_unmap(pte);
 	__SetPageUptodate(new_page);
 	pgtable = pmd_pgtable(_pmd);
@@ -1973,6 +1980,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	_pmd = mk_pmd(new_page, vma->vm_page_prot);
 	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
 	_pmd = pmd_mkhuge(_pmd);
+	if (mknuma)
+		_pmd = pmd_mknuma(_pmd);
 
 	/*
 	 * spin_lock() below is not the equivalent of smp_wmb(), so

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 27/35] autonuma: make khugepaged pte_numa aware
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

If any of the ptes that khugepaged is collapsing was a pte_numa, the
resulting trans huge pmd will be a pmd_numa too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b1c047b..d388517 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1790,12 +1790,13 @@ out:
 	return isolated;
 }
 
-static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
+static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 				      struct vm_area_struct *vma,
 				      unsigned long address,
 				      spinlock_t *ptl)
 {
 	pte_t *_pte;
+	bool mknuma = false;
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
 		pte_t pteval = *_pte;
 		struct page *src_page;
@@ -1823,11 +1824,15 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			page_remove_rmap(src_page);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
+
+			mknuma |= pte_numa(pteval);
 		}
 
 		address += PAGE_SIZE;
 		page++;
 	}
+
+	return mknuma;
 }
 
 static void collapse_huge_page(struct mm_struct *mm,
@@ -1845,6 +1850,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	int isolated;
 	unsigned long hstart, hend;
+	bool mknuma = false;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 #ifndef CONFIG_NUMA
@@ -1963,7 +1969,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	anon_vma_unlock(vma->anon_vma);
 
-	__collapse_huge_page_copy(pte, new_page, vma, address, ptl);
+	mknuma = pmd_numa(_pmd);
+	mknuma |= __collapse_huge_page_copy(pte, new_page, vma, address, ptl);
 	pte_unmap(pte);
 	__SetPageUptodate(new_page);
 	pgtable = pmd_pgtable(_pmd);
@@ -1973,6 +1980,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	_pmd = mk_pmd(new_page, vma->vm_page_prot);
 	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
 	_pmd = pmd_mkhuge(_pmd);
+	if (mknuma)
+		_pmd = pmd_mknuma(_pmd);
 
 	/*
 	 * spin_lock() below is not the equivalent of smp_wmb(), so

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 28/35] autonuma: retain page last_nid information in khugepaged
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

When pages are collapsed try to keep the last_nid information from one
of the original pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d388517..76bdc48 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1805,7 +1805,18 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			clear_user_highpage(page, address);
 			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
 		} else {
+#ifdef CONFIG_AUTONUMA
+			int autonuma_last_nid;
+#endif
 			src_page = pte_page(pteval);
+#ifdef CONFIG_AUTONUMA
+			/* pick the last one, better than nothing */
+			autonuma_last_nid =
+				ACCESS_ONCE(src_page->autonuma_last_nid);
+			if (autonuma_last_nid >= 0)
+				ACCESS_ONCE(page->autonuma_last_nid) =
+					autonuma_last_nid;
+#endif
 			copy_user_highpage(page, src_page, address, vma);
 			VM_BUG_ON(page_mapcount(src_page) != 1);
 			VM_BUG_ON(page_count(src_page) != 2);

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 28/35] autonuma: retain page last_nid information in khugepaged
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

When pages are collapsed try to keep the last_nid information from one
of the original pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d388517..76bdc48 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1805,7 +1805,18 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			clear_user_highpage(page, address);
 			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
 		} else {
+#ifdef CONFIG_AUTONUMA
+			int autonuma_last_nid;
+#endif
 			src_page = pte_page(pteval);
+#ifdef CONFIG_AUTONUMA
+			/* pick the last one, better than nothing */
+			autonuma_last_nid =
+				ACCESS_ONCE(src_page->autonuma_last_nid);
+			if (autonuma_last_nid >= 0)
+				ACCESS_ONCE(page->autonuma_last_nid) =
+					autonuma_last_nid;
+#endif
 			copy_user_highpage(page, src_page, address, vma);
 			VM_BUG_ON(page_mapcount(src_page) != 1);
 			VM_BUG_ON(page_count(src_page) != 2);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 29/35] autonuma: numa hinting page faults entry points
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

This is where the numa hinting page faults are detected and are passed
over to the AutoNUMA core logic.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/huge_mm.h |    2 ++
 mm/huge_memory.c        |   17 +++++++++++++++++
 mm/memory.c             |   32 ++++++++++++++++++++++++++++++++
 3 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c8af7a2..72eac1d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -11,6 +11,8 @@ extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
 			       pmd_t orig_pmd);
+extern pmd_t __huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+				   pmd_t pmd, pmd_t *pmdp);
 extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
 extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
 					  unsigned long addr,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 76bdc48..017c0a3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1030,6 +1030,23 @@ out:
 	return page;
 }
 
+#ifdef CONFIG_AUTONUMA
+pmd_t __huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+			    pmd_t pmd, pmd_t *pmdp)
+{
+	spin_lock(&mm->page_table_lock);
+	if (pmd_same(pmd, *pmdp)) {
+		struct page *page = pmd_page(pmd);
+		pmd = pmd_mknotnuma(pmd);
+		set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd);
+		numa_hinting_fault(page, HPAGE_PMD_NR);
+		VM_BUG_ON(pmd_numa(pmd));
+	}
+	spin_unlock(&mm->page_table_lock);
+	return pmd;
+}
+#endif
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
diff --git a/mm/memory.c b/mm/memory.c
index e3aa47c..316ce54 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/autonuma.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3398,6 +3399,32 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+static inline pte_t pte_numa_fixup(struct mm_struct *mm,
+				   struct vm_area_struct *vma,
+				   unsigned long addr, pte_t pte, pte_t *ptep)
+{
+	if (pte_numa(pte))
+		pte = __pte_numa_fixup(mm, vma, addr, pte, ptep);
+	return pte;
+}
+
+static inline void pmd_numa_fixup(struct mm_struct *mm,
+				  struct vm_area_struct *vma,
+				  unsigned long addr, pmd_t *pmd)
+{
+	if (pmd_numa(*pmd))
+		__pmd_numa_fixup(mm, vma, addr, pmd);
+}
+
+static inline pmd_t huge_pmd_numa_fixup(struct mm_struct *mm,
+					unsigned long addr,
+					pmd_t pmd, pmd_t *pmdp)
+{
+	if (pmd_numa(pmd))
+		pmd = __huge_pmd_numa_fixup(mm, addr, pmd, pmdp);
+	return pmd;
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3440,6 +3467,7 @@ int handle_pte_fault(struct mm_struct *mm,
 	spin_lock(ptl);
 	if (unlikely(!pte_same(*pte, entry)))
 		goto unlock;
+	entry = pte_numa_fixup(mm, vma, address, entry, pte);
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!pte_write(entry))
 			return do_wp_page(mm, vma, address,
@@ -3501,6 +3529,8 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		pmd_t orig_pmd = *pmd;
 		barrier();
 		if (pmd_trans_huge(orig_pmd)) {
+			orig_pmd = huge_pmd_numa_fixup(mm, address,
+						       orig_pmd, pmd);
 			if (flags & FAULT_FLAG_WRITE &&
 			    !pmd_write(orig_pmd) &&
 			    !pmd_trans_splitting(orig_pmd))
@@ -3510,6 +3540,8 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 	}
 
+	pmd_numa_fixup(mm, vma, address, pmd);
+
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
 	 * run pte_offset_map on the pmd, if an huge pmd could

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 29/35] autonuma: numa hinting page faults entry points
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

This is where the numa hinting page faults are detected and are passed
over to the AutoNUMA core logic.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/huge_mm.h |    2 ++
 mm/huge_memory.c        |   17 +++++++++++++++++
 mm/memory.c             |   32 ++++++++++++++++++++++++++++++++
 3 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c8af7a2..72eac1d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -11,6 +11,8 @@ extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
 			       pmd_t orig_pmd);
+extern pmd_t __huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+				   pmd_t pmd, pmd_t *pmdp);
 extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
 extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
 					  unsigned long addr,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 76bdc48..017c0a3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1030,6 +1030,23 @@ out:
 	return page;
 }
 
+#ifdef CONFIG_AUTONUMA
+pmd_t __huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+			    pmd_t pmd, pmd_t *pmdp)
+{
+	spin_lock(&mm->page_table_lock);
+	if (pmd_same(pmd, *pmdp)) {
+		struct page *page = pmd_page(pmd);
+		pmd = pmd_mknotnuma(pmd);
+		set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd);
+		numa_hinting_fault(page, HPAGE_PMD_NR);
+		VM_BUG_ON(pmd_numa(pmd));
+	}
+	spin_unlock(&mm->page_table_lock);
+	return pmd;
+}
+#endif
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
diff --git a/mm/memory.c b/mm/memory.c
index e3aa47c..316ce54 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/autonuma.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3398,6 +3399,32 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+static inline pte_t pte_numa_fixup(struct mm_struct *mm,
+				   struct vm_area_struct *vma,
+				   unsigned long addr, pte_t pte, pte_t *ptep)
+{
+	if (pte_numa(pte))
+		pte = __pte_numa_fixup(mm, vma, addr, pte, ptep);
+	return pte;
+}
+
+static inline void pmd_numa_fixup(struct mm_struct *mm,
+				  struct vm_area_struct *vma,
+				  unsigned long addr, pmd_t *pmd)
+{
+	if (pmd_numa(*pmd))
+		__pmd_numa_fixup(mm, vma, addr, pmd);
+}
+
+static inline pmd_t huge_pmd_numa_fixup(struct mm_struct *mm,
+					unsigned long addr,
+					pmd_t pmd, pmd_t *pmdp)
+{
+	if (pmd_numa(pmd))
+		pmd = __huge_pmd_numa_fixup(mm, addr, pmd, pmdp);
+	return pmd;
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3440,6 +3467,7 @@ int handle_pte_fault(struct mm_struct *mm,
 	spin_lock(ptl);
 	if (unlikely(!pte_same(*pte, entry)))
 		goto unlock;
+	entry = pte_numa_fixup(mm, vma, address, entry, pte);
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!pte_write(entry))
 			return do_wp_page(mm, vma, address,
@@ -3501,6 +3529,8 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		pmd_t orig_pmd = *pmd;
 		barrier();
 		if (pmd_trans_huge(orig_pmd)) {
+			orig_pmd = huge_pmd_numa_fixup(mm, address,
+						       orig_pmd, pmd);
 			if (flags & FAULT_FLAG_WRITE &&
 			    !pmd_write(orig_pmd) &&
 			    !pmd_trans_splitting(orig_pmd))
@@ -3510,6 +3540,8 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 	}
 
+	pmd_numa_fixup(mm, vma, address, pmd);
+
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
 	 * run pte_offset_map on the pmd, if an huge pmd could

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 30/35] autonuma: reset autonuma page data when pages are freed
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

When pages are freed abort any pending migration. If knuma_migrated
arrives first it will notice because get_page_unless_zero would fail.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3d1ee70..1d3163f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -614,6 +614,10 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
+	autonuma_migrate_page_remove(page);
+#ifdef CONFIG_AUTONUMA
+	ACCESS_ONCE(page->autonuma_last_nid) = -1;
+#endif
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 30/35] autonuma: reset autonuma page data when pages are freed
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

When pages are freed abort any pending migration. If knuma_migrated
arrives first it will notice because get_page_unless_zero would fail.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3d1ee70..1d3163f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -614,6 +614,10 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
+	autonuma_migrate_page_remove(page);
+#ifdef CONFIG_AUTONUMA
+	ACCESS_ONCE(page->autonuma_last_nid) = -1;
+#endif
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 31/35] autonuma: initialize page structure fields
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Initialize the AutoNUMA page structure fields at boot.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1d3163f..3c354d4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3673,6 +3673,10 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 
 		INIT_LIST_HEAD(&page->lru);
+#ifdef CONFIG_AUTONUMA
+		page->autonuma_last_nid = -1;
+		page->autonuma_migrate_nid = -1;
+#endif
 #ifdef WANT_PAGE_VIRTUAL
 		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
 		if (!is_highmem_idx(zone))

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 31/35] autonuma: initialize page structure fields
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Initialize the AutoNUMA page structure fields at boot.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1d3163f..3c354d4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3673,6 +3673,10 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 
 		INIT_LIST_HEAD(&page->lru);
+#ifdef CONFIG_AUTONUMA
+		page->autonuma_last_nid = -1;
+		page->autonuma_migrate_nid = -1;
+#endif
 #ifdef WANT_PAGE_VIRTUAL
 		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
 		if (!is_highmem_idx(zone))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 32/35] autonuma: link mm/autonuma.o and kernel/sched/numa.o
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Link the AutoNUMA core and scheduler object files in the kernel if
CONFIG_AUTONUMA=y.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/Makefile |    1 +
 mm/Makefile           |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 173ea52..783a840 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -16,3 +16,4 @@ obj-$(CONFIG_SMP) += cpupri.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
+obj-$(CONFIG_AUTONUMA) += numa.o
diff --git a/mm/Makefile b/mm/Makefile
index 50ec00e..67c77bd 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
+obj-$(CONFIG_AUTONUMA) 	+= autonuma.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 32/35] autonuma: link mm/autonuma.o and kernel/sched/numa.o
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Link the AutoNUMA core and scheduler object files in the kernel if
CONFIG_AUTONUMA=y.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/Makefile |    1 +
 mm/Makefile           |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 173ea52..783a840 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -16,3 +16,4 @@ obj-$(CONFIG_SMP) += cpupri.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
+obj-$(CONFIG_AUTONUMA) += numa.o
diff --git a/mm/Makefile b/mm/Makefile
index 50ec00e..67c77bd 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
+obj-$(CONFIG_AUTONUMA) 	+= autonuma.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 33/35] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Add the config options to allow building the kernel with AutoNUMA.

If CONFIG_AUTONUMA_DEFAULT_ENABLED is "=y", then
/sys/kernel/mm/autonuma/enabled will be equal to 1, and AutoNUMA will
be enabled automatically at boot.

CONFIG_AUTONUMA currently depends on X86, because no other arch
implements the pte/pmd_numa yet and selecting =y would result in a
failed build, but this shall be relaxed in the future. Porting
AutoNUMA to other archs should be pretty simple.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/Kconfig |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index e338407..cbfdb15 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -207,6 +207,19 @@ config MIGRATION
 	  pages as migration can relocate pages to satisfy a huge page
 	  allocation instead of reclaiming.
 
+config AUTONUMA
+	bool "Auto NUMA"
+	select MIGRATION
+	depends on NUMA && X86
+	help
+	  Automatic NUMA CPU scheduling and memory migration.
+
+config AUTONUMA_DEFAULT_ENABLED
+	bool "Auto NUMA default enabled"
+	depends on AUTONUMA
+	help
+	  Automatic NUMA CPU scheduling and memory migration enabled at boot.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 33/35] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Add the config options to allow building the kernel with AutoNUMA.

If CONFIG_AUTONUMA_DEFAULT_ENABLED is "=y", then
/sys/kernel/mm/autonuma/enabled will be equal to 1, and AutoNUMA will
be enabled automatically at boot.

CONFIG_AUTONUMA currently depends on X86, because no other arch
implements the pte/pmd_numa yet and selecting =y would result in a
failed build, but this shall be relaxed in the future. Porting
AutoNUMA to other archs should be pretty simple.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/Kconfig |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index e338407..cbfdb15 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -207,6 +207,19 @@ config MIGRATION
 	  pages as migration can relocate pages to satisfy a huge page
 	  allocation instead of reclaiming.
 
+config AUTONUMA
+	bool "Auto NUMA"
+	select MIGRATION
+	depends on NUMA && X86
+	help
+	  Automatic NUMA CPU scheduling and memory migration.
+
+config AUTONUMA_DEFAULT_ENABLED
+	bool "Auto NUMA default enabled"
+	depends on AUTONUMA
+	help
+	  Automatic NUMA CPU scheduling and memory migration enabled at boot.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 34/35] autonuma: boost khugepaged scanning rate
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Until THP native migration is implemented it's safer to boost
khugepaged scanning rate because all memory migration are splitting
the hugepages. So the regular rate of scanning becomes too low when
lots of memory is migrated.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 017c0a3..b919c0c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -573,6 +573,14 @@ static int __init hugepage_init(void)
 
 	set_recommended_min_free_kbytes();
 
+#ifdef CONFIG_AUTONUMA
+	/* Hack, remove after THP native migration */
+	if (num_possible_nodes() > 1) {
+		khugepaged_scan_sleep_millisecs = 100;
+		khugepaged_alloc_sleep_millisecs = 10000;
+	}
+#endif
+
 	return 0;
 out:
 	hugepage_exit_sysfs(hugepage_kobj);

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 34/35] autonuma: boost khugepaged scanning rate
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Until THP native migration is implemented it's safer to boost
khugepaged scanning rate because all memory migration are splitting
the hugepages. So the regular rate of scanning becomes too low when
lots of memory is migrated.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 017c0a3..b919c0c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -573,6 +573,14 @@ static int __init hugepage_init(void)
 
 	set_recommended_min_free_kbytes();
 
+#ifdef CONFIG_AUTONUMA
+	/* Hack, remove after THP native migration */
+	if (num_possible_nodes() > 1) {
+		khugepaged_scan_sleep_millisecs = 100;
+		khugepaged_alloc_sleep_millisecs = 10000;
+	}
+#endif
+
 	return 0;
 out:
 	hugepage_exit_sysfs(hugepage_kobj);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 35/35] autonuma: page_autonuma
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-25 17:02   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Move the AutoNUMA per page information from the "struct page" to a
separate page_autonuma data structure allocated in the memsection
(with sparsemem) or in the pgdat (with flatmem).

This is done to avoid growing the size of the "struct page" and the
page_autonuma data is only allocated if the kernel has been booted on
real NUMA hardware (or if noautonuma is passed as parameter to the
kernel).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma.h       |   18 +++-
 include/linux/autonuma_flags.h |    6 +
 include/linux/autonuma_types.h |   31 ++++++
 include/linux/mm_types.h       |   25 -----
 include/linux/mmzone.h         |   14 +++-
 include/linux/page_autonuma.h  |   53 +++++++++
 init/main.c                    |    2 +
 mm/Makefile                    |    2 +-
 mm/autonuma.c                  |   95 ++++++++++-------
 mm/huge_memory.c               |   11 ++-
 mm/page_alloc.c                |   23 +----
 mm/page_autonuma.c             |  234 ++++++++++++++++++++++++++++++++++++++++
 mm/sparse.c                    |  126 ++++++++++++++++++++-
 13 files changed, 543 insertions(+), 97 deletions(-)
 create mode 100644 include/linux/page_autonuma.h
 create mode 100644 mm/page_autonuma.c

diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
index a963dcb..1eb84d0 100644
--- a/include/linux/autonuma.h
+++ b/include/linux/autonuma.h
@@ -7,15 +7,26 @@
 
 extern void autonuma_enter(struct mm_struct *mm);
 extern void autonuma_exit(struct mm_struct *mm);
-extern void __autonuma_migrate_page_remove(struct page *page);
+extern void __autonuma_migrate_page_remove(struct page *,
+					   struct page_autonuma *);
 extern void autonuma_migrate_split_huge_page(struct page *page,
 					     struct page *page_tail);
 extern void autonuma_setup_new_exec(struct task_struct *p);
+extern struct page_autonuma *lookup_page_autonuma(struct page *page);
 
 static inline void autonuma_migrate_page_remove(struct page *page)
 {
-	if (ACCESS_ONCE(page->autonuma_migrate_nid) >= 0)
-		__autonuma_migrate_page_remove(page);
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+	if (ACCESS_ONCE(page_autonuma->autonuma_migrate_nid) >= 0)
+		__autonuma_migrate_page_remove(page, page_autonuma);
+}
+
+static inline void autonuma_free_page(struct page *page)
+{
+	if (!autonuma_impossible()) {
+		autonuma_migrate_page_remove(page);
+		ACCESS_ONCE(lookup_page_autonuma(page)->autonuma_last_nid) = -1;
+	}
 }
 
 #define autonuma_printk(format, args...) \
@@ -29,6 +40,7 @@ static inline void autonuma_migrate_page_remove(struct page *page) {}
 static inline void autonuma_migrate_split_huge_page(struct page *page,
 						    struct page *page_tail) {}
 static inline void autonuma_setup_new_exec(struct task_struct *p) {}
+static inline void autonuma_free_page(struct page *page) {}
 
 #endif /* CONFIG_AUTONUMA */
 
diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
index 9c702fd..6ec837a 100644
--- a/include/linux/autonuma_flags.h
+++ b/include/linux/autonuma_flags.h
@@ -15,6 +15,12 @@ enum autonuma_flag {
 
 extern unsigned long autonuma_flags;
 
+static inline bool autonuma_impossible(void)
+{
+	return num_possible_nodes() <= 1 ||
+		test_bit(AUTONUMA_IMPOSSIBLE, &autonuma_flags);
+}
+
 static bool inline autonuma_enabled(void)
 {
 	return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
index 65b175b..28f64ec 100644
--- a/include/linux/autonuma_types.h
+++ b/include/linux/autonuma_types.h
@@ -28,6 +28,37 @@ struct sched_autonuma {
 	unsigned long numa_fault[0];
 };
 
+struct page_autonuma {
+	/*
+	 * FIXME: move to pgdat section along with the memcg and allocate
+	 * at runtime only in presence of a numa system.
+	 */
+	/*
+	 * To modify autonuma_last_nid lockless the architecture,
+	 * needs SMP atomic granularity < sizeof(long), not all archs
+	 * have that, notably some alpha. Archs without that requires
+	 * autonuma_last_nid to be a long.
+	 */
+#if BITS_PER_LONG > 32
+	int autonuma_migrate_nid;
+	int autonuma_last_nid;
+#else
+#if MAX_NUMNODES >= 32768
+#error "too many nodes"
+#endif
+	/* FIXME: remember to check the updates are atomic */
+	short autonuma_migrate_nid;
+	short autonuma_last_nid;
+#endif
+	struct list_head autonuma_migrate_node;
+
+	/*
+	 * To find the page starting from the autonuma_migrate_node we
+	 * need a backlink.
+	 */
+	struct page *page;
+};
+
 extern int alloc_sched_autonuma(struct task_struct *tsk,
 				struct task_struct *orig,
 				int node);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e8dc82c..780ded7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -126,31 +126,6 @@ struct page {
 		struct page *first_page;	/* Compound tail pages */
 	};
 
-#ifdef CONFIG_AUTONUMA
-	/*
-	 * FIXME: move to pgdat section along with the memcg and allocate
-	 * at runtime only in presence of a numa system.
-	 */
-	/*
-	 * To modify autonuma_last_nid lockless the architecture,
-	 * needs SMP atomic granularity < sizeof(long), not all archs
-	 * have that, notably some alpha. Archs without that requires
-	 * autonuma_last_nid to be a long.
-	 */
-#if BITS_PER_LONG > 32
-	int autonuma_migrate_nid;
-	int autonuma_last_nid;
-#else
-#if MAX_NUMNODES >= 32768
-#error "too many nodes"
-#endif
-	/* FIXME: remember to check the updates are atomic */
-	short autonuma_migrate_nid;
-	short autonuma_last_nid;
-#endif
-	struct list_head autonuma_migrate_node;
-#endif
-
 	/*
 	 * On machines where all RAM is mapped into kernel address space,
 	 * we can simply calculate the virtual address. On machines with
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8e578e6..89fa49f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -667,10 +667,13 @@ typedef struct pglist_data {
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
 #ifdef CONFIG_AUTONUMA
-	spinlock_t autonuma_lock;
+#if !defined(CONFIG_SPARSEMEM)
+	struct page_autonuma *node_page_autonuma;
+#endif
 	struct list_head autonuma_migrate_head[MAX_NUMNODES];
 	unsigned long autonuma_nr_migrate_pages;
 	wait_queue_head_t autonuma_knuma_migrated_wait;
+	spinlock_t autonuma_lock;
 #endif
 } pg_data_t;
 
@@ -1022,6 +1025,15 @@ struct mem_section {
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
 	 */
 	struct page_cgroup *page_cgroup;
+#endif
+#ifdef CONFIG_AUTONUMA
+	/*
+	 * If !SPARSEMEM, pgdat doesn't have page_autonuma pointer. We use
+	 * section.
+	 */
+	struct page_autonuma *section_page_autonuma;
+#endif
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR) ^ defined(CONFIG_AUTONUMA)
 	unsigned long pad;
 #endif
 };
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
new file mode 100644
index 0000000..05d2862
--- /dev/null
+++ b/include/linux/page_autonuma.h
@@ -0,0 +1,53 @@
+#ifndef _LINUX_PAGE_AUTONUMA_H
+#define _LINUX_PAGE_AUTONUMA_H
+
+#if defined(CONFIG_AUTONUMA) && !defined(CONFIG_SPARSEMEM)
+extern void __init page_autonuma_init_flatmem(void);
+#else
+static inline void __init page_autonuma_init_flatmem(void) {}
+#endif
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/autonuma_flags.h>
+
+extern void __meminit page_autonuma_map_init(struct page *page,
+					     struct page_autonuma *page_autonuma,
+					     int nr_pages);
+
+#ifdef CONFIG_SPARSEMEM
+#define PAGE_AUTONUMA_SIZE (sizeof(struct page_autonuma))
+#define SECTION_PAGE_AUTONUMA_SIZE (PAGE_AUTONUMA_SIZE *	\
+				    PAGES_PER_SECTION)
+#endif
+
+extern void __meminit pgdat_autonuma_init(struct pglist_data *);
+
+#else /* CONFIG_AUTONUMA */
+
+#ifdef CONFIG_SPARSEMEM
+struct page_autonuma;
+#define PAGE_AUTONUMA_SIZE 0
+#define SECTION_PAGE_AUTONUMA_SIZE 0
+
+#define autonuma_impossible() true
+
+#endif
+
+static inline void pgdat_autonuma_init(struct pglist_data *pgdat) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+#ifdef CONFIG_SPARSEMEM
+extern struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
+									unsigned long nr_pages);
+extern void __meminit __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
+						    unsigned long nr_pages);
+extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
+							 unsigned long pnum_begin,
+							 unsigned long pnum_end,
+							 unsigned long map_count,
+							 int nodeid);
+#endif
+
+#endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/init/main.c b/init/main.c
index 1ca6b32..275e914 100644
--- a/init/main.c
+++ b/init/main.c
@@ -68,6 +68,7 @@
 #include <linux/shmem_fs.h>
 #include <linux/slab.h>
 #include <linux/perf_event.h>
+#include <linux/page_autonuma.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -455,6 +456,7 @@ static void __init mm_init(void)
 	 * bigger than MAX_ORDER unless SPARSEMEM.
 	 */
 	page_cgroup_init_flatmem();
+	page_autonuma_init_flatmem();
 	mem_init();
 	kmem_cache_init();
 	percpu_init_late();
diff --git a/mm/Makefile b/mm/Makefile
index 67c77bd..5410eba 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -29,7 +29,7 @@ obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
-obj-$(CONFIG_AUTONUMA) 	+= autonuma.o
+obj-$(CONFIG_AUTONUMA) 	+= autonuma.o page_autonuma.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
diff --git a/mm/autonuma.c b/mm/autonuma.c
index 88c7ab3..96c02a2 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -51,12 +51,6 @@ static struct knumad_scan {
 	.mm_head = LIST_HEAD_INIT(knumad_scan.mm_head),
 };
 
-static inline bool autonuma_impossible(void)
-{
-	return num_possible_nodes() <= 1 ||
-		test_bit(AUTONUMA_IMPOSSIBLE, &autonuma_flags);
-}
-
 static inline void autonuma_migrate_lock(int nid)
 {
 	spin_lock(&NODE_DATA(nid)->autonuma_lock);
@@ -82,51 +76,57 @@ void autonuma_migrate_split_huge_page(struct page *page,
 				      struct page *page_tail)
 {
 	int nid, last_nid;
+	struct page_autonuma *page_autonuma, *page_tail_autonuma;
 
-	nid = page->autonuma_migrate_nid;
+	page_autonuma = lookup_page_autonuma(page);
+	page_tail_autonuma = lookup_page_autonuma(page_tail);
+
+	nid = page_autonuma->autonuma_migrate_nid;
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 	VM_BUG_ON(nid < -1);
-	VM_BUG_ON(page_tail->autonuma_migrate_nid != -1);
+	VM_BUG_ON(page_tail_autonuma->autonuma_migrate_nid != -1);
 	if (nid >= 0) {
 		VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
 		autonuma_migrate_lock(nid);
-		list_add_tail(&page_tail->autonuma_migrate_node,
-			      &page->autonuma_migrate_node);
+		list_add_tail(&page_tail_autonuma->autonuma_migrate_node,
+			      &page_autonuma->autonuma_migrate_node);
 		autonuma_migrate_unlock(nid);
 
-		page_tail->autonuma_migrate_nid = nid;
+		page_tail_autonuma->autonuma_migrate_nid = nid;
 	}
 
-	last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid);
 	if (last_nid >= 0)
-		page_tail->autonuma_last_nid = last_nid;
+		page_tail_autonuma->autonuma_last_nid = last_nid;
 }
 
-void __autonuma_migrate_page_remove(struct page *page)
+void __autonuma_migrate_page_remove(struct page *page,
+				    struct page_autonuma *page_autonuma)
 {
 	unsigned long flags;
 	int nid;
 
 	flags = compound_lock_irqsave(page);
 
-	nid = page->autonuma_migrate_nid;
+	nid = page_autonuma->autonuma_migrate_nid;
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 	VM_BUG_ON(nid < -1);
 	if (nid >= 0) {
 		int numpages = hpage_nr_pages(page);
 		autonuma_migrate_lock(nid);
-		list_del(&page->autonuma_migrate_node);
+		list_del(&page_autonuma->autonuma_migrate_node);
 		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
 		autonuma_migrate_unlock(nid);
 
-		page->autonuma_migrate_nid = -1;
+		page_autonuma->autonuma_migrate_nid = -1;
 	}
 
 	compound_unlock_irqrestore(page, flags);
 }
 
-static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
-					int page_nid)
+static void __autonuma_migrate_page_add(struct page *page,
+					struct page_autonuma *page_autonuma,
+					int dst_nid, int page_nid)
 {
 	unsigned long flags;
 	int nid;
@@ -145,25 +145,25 @@ static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
 	flags = compound_lock_irqsave(page);
 
 	numpages = hpage_nr_pages(page);
-	nid = page->autonuma_migrate_nid;
+	nid = page_autonuma->autonuma_migrate_nid;
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 	VM_BUG_ON(nid < -1);
 	if (nid >= 0) {
 		autonuma_migrate_lock(nid);
-		list_del(&page->autonuma_migrate_node);
+		list_del(&page_autonuma->autonuma_migrate_node);
 		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
 		autonuma_migrate_unlock(nid);
 	}
 
 	autonuma_migrate_lock(dst_nid);
-	list_add(&page->autonuma_migrate_node,
+	list_add(&page_autonuma->autonuma_migrate_node,
 		 &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
 	NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
 	nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
 
 	autonuma_migrate_unlock(dst_nid);
 
-	page->autonuma_migrate_nid = dst_nid;
+	page_autonuma->autonuma_migrate_nid = dst_nid;
 
 	compound_unlock_irqrestore(page, flags);
 
@@ -179,9 +179,13 @@ static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
 static void autonuma_migrate_page_add(struct page *page, int dst_nid,
 				      int page_nid)
 {
-	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	int migrate_nid;
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+
+	migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
 	if (migrate_nid != dst_nid)
-		__autonuma_migrate_page_add(page, dst_nid, page_nid);
+		__autonuma_migrate_page_add(page, page_autonuma,
+					    dst_nid, page_nid);
 }
 
 static bool balance_pgdat(struct pglist_data *pgdat,
@@ -252,23 +256,26 @@ static inline bool last_nid_set(struct task_struct *p,
 				struct page *page, int cpu_nid)
 {
 	bool ret = true;
-	int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+	int autonuma_last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid);
 	VM_BUG_ON(cpu_nid < 0);
 	VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
 	if (autonuma_last_nid >= 0 && autonuma_last_nid != cpu_nid) {
-		int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+		int migrate_nid;
+		migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
 		if (migrate_nid >= 0 && migrate_nid != cpu_nid)
-			__autonuma_migrate_page_remove(page);
+			__autonuma_migrate_page_remove(page, page_autonuma);
 		ret = false;
 	}
 	if (autonuma_last_nid != cpu_nid)
-		ACCESS_ONCE(page->autonuma_last_nid) = cpu_nid;
+		ACCESS_ONCE(page_autonuma->autonuma_last_nid) = cpu_nid;
 	return ret;
 }
 
 static int __page_migrate_nid(struct page *page, int page_nid)
 {
-	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+	int migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
 	if (migrate_nid < 0)
 		migrate_nid = page_nid;
 #if 0
@@ -780,6 +787,7 @@ static int isolate_migratepages(struct list_head *migratepages,
 	for_each_online_node(nid) {
 		struct zone *zone;
 		struct page *page;
+		struct page_autonuma *page_autonuma;
 		cond_resched();
 		VM_BUG_ON(numa_node_id() != pgdat->node_id);
 		if (nid == pgdat->node_id) {
@@ -802,16 +810,17 @@ static int isolate_migratepages(struct list_head *migratepages,
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			continue;
 		}
-		page = list_entry(heads[nid].prev,
-				  struct page,
-				  autonuma_migrate_node);
+		page_autonuma = list_entry(heads[nid].prev,
+					   struct page_autonuma,
+					   autonuma_migrate_node);
+		page = page_autonuma->page;
 		if (unlikely(!get_page_unless_zero(page))) {
 			/*
 			 * Is getting freed and will remove self from the
 			 * autonuma list shortly, skip it for now.
 			 */
-			list_del(&page->autonuma_migrate_node);
-			list_add(&page->autonuma_migrate_node,
+			list_del(&page_autonuma->autonuma_migrate_node);
+			list_add(&page_autonuma->autonuma_migrate_node,
 				 &heads[nid]);
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			autonuma_printk("autonuma migrate page is free\n");
@@ -820,7 +829,7 @@ static int isolate_migratepages(struct list_head *migratepages,
 		if (!PageLRU(page)) {
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			autonuma_printk("autonuma migrate page not in LRU\n");
-			__autonuma_migrate_page_remove(page);
+			__autonuma_migrate_page_remove(page, page_autonuma);
 			put_page(page);
 			continue;
 		}
@@ -832,7 +841,7 @@ static int isolate_migratepages(struct list_head *migratepages,
 			/* FIXME: remove split_huge_page */
 			split_huge_page(page);
 
-		__autonuma_migrate_page_remove(page);
+		__autonuma_migrate_page_remove(page, page_autonuma);
 
 		zone = page_zone(page);
 		spin_lock_irq(&zone->lru_lock);
@@ -875,11 +884,16 @@ static struct page *alloc_migrate_dst_page(struct page *page,
 {
 	int nid = (int) data;
 	struct page *newpage;
+	struct page_autonuma *page_autonuma, *newpage_autonuma;
 	newpage = alloc_pages_exact_node(nid,
 					 GFP_HIGHUSER_MOVABLE | GFP_THISNODE,
 					 0);
-	if (newpage)
-		newpage->autonuma_last_nid = page->autonuma_last_nid;
+	if (newpage) {
+		page_autonuma = lookup_page_autonuma(page);
+		newpage_autonuma = lookup_page_autonuma(newpage);
+		newpage_autonuma->autonuma_last_nid =
+			page_autonuma->autonuma_last_nid;
+	}
 	return newpage;
 }
 
@@ -1299,7 +1313,8 @@ static int __init noautonuma_setup(char *str)
 	}
 	return 1;
 }
-__setup("noautonuma", noautonuma_setup);
+/* early so sparse.c also can see it */
+early_param("noautonuma", noautonuma_setup);
 
 static int __init autonuma_init(void)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b919c0c..faaf73f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1822,6 +1822,12 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 {
 	pte_t *_pte;
 	bool mknuma = false;
+#ifdef CONFIG_AUTONUMA
+	struct page_autonuma *src_page_an, *page_an;
+
+	page_an = lookup_page_autonuma(page);
+#endif
+
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
 		pte_t pteval = *_pte;
 		struct page *src_page;
@@ -1835,11 +1841,12 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 #endif
 			src_page = pte_page(pteval);
 #ifdef CONFIG_AUTONUMA
+			src_page_an = lookup_page_autonuma(src_page);
 			/* pick the last one, better than nothing */
 			autonuma_last_nid =
-				ACCESS_ONCE(src_page->autonuma_last_nid);
+				ACCESS_ONCE(src_page_an->autonuma_last_nid);
 			if (autonuma_last_nid >= 0)
-				ACCESS_ONCE(page->autonuma_last_nid) =
+				ACCESS_ONCE(page_an->autonuma_last_nid) =
 					autonuma_last_nid;
 #endif
 			copy_user_highpage(page, src_page, address, vma);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c354d4..b8c13ff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -59,6 +59,7 @@
 #include <linux/prefetch.h>
 #include <linux/page-debug-flags.h>
 #include <linux/autonuma.h>
+#include <linux/page_autonuma.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -614,10 +615,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
-	autonuma_migrate_page_remove(page);
-#ifdef CONFIG_AUTONUMA
-	ACCESS_ONCE(page->autonuma_last_nid) = -1;
-#endif
+	autonuma_free_page(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -3673,10 +3671,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 
 		INIT_LIST_HEAD(&page->lru);
-#ifdef CONFIG_AUTONUMA
-		page->autonuma_last_nid = -1;
-		page->autonuma_migrate_nid = -1;
-#endif
 #ifdef WANT_PAGE_VIRTUAL
 		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
 		if (!is_highmem_idx(zone))
@@ -4304,23 +4298,14 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	int nid = pgdat->node_id;
 	unsigned long zone_start_pfn = pgdat->node_start_pfn;
 	int ret;
-#ifdef CONFIG_AUTONUMA
-	int node_iter;
-#endif
 
 	pgdat_resize_init(pgdat);
-#ifdef CONFIG_AUTONUMA
-	spin_lock_init(&pgdat->autonuma_lock);
-	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
-	pgdat->autonuma_nr_migrate_pages = 0;
-	for_each_node(node_iter)
-		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
-#endif
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	pgdat->kswapd_max_order = 0;
 	pgdat_page_cgroup_init(pgdat);
-	
+	pgdat_autonuma_init(pgdat);
+
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
 		unsigned long size, realsize, memmap_pages;
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
new file mode 100644
index 0000000..131b5c9
--- /dev/null
+++ b/mm/page_autonuma.c
@@ -0,0 +1,234 @@
+#include <linux/mm.h>
+#include <linux/memory.h>
+#include <linux/autonuma_flags.h>
+#include <linux/page_autonuma.h>
+#include <linux/bootmem.h>
+
+void __meminit page_autonuma_map_init(struct page *page,
+				      struct page_autonuma *page_autonuma,
+				      int nr_pages)
+{
+	struct page *end;
+	for (end = page + nr_pages; page < end; page++, page_autonuma++) {
+		page_autonuma->autonuma_last_nid = -1;
+		page_autonuma->autonuma_migrate_nid = -1;
+		page_autonuma->page = page;
+	}
+}
+
+static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+	int node_iter;
+
+	spin_lock_init(&pgdat->autonuma_lock);
+	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
+	pgdat->autonuma_nr_migrate_pages = 0;
+	for_each_node(node_iter)
+		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+}
+
+#if !defined(CONFIG_SPARSEMEM)
+
+static unsigned long total_usage;
+
+void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+	__pgdat_autonuma_init(pgdat);
+	pgdat->node_page_autonuma = NULL;
+}
+
+struct page_autonuma *lookup_page_autonuma(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long offset;
+	struct page_autonuma *base;
+
+	base = NODE_DATA(page_to_nid(page))->node_page_autonuma;
+#ifdef CONFIG_DEBUG_VM
+	/*
+	 * The sanity checks the page allocator does upon freeing a
+	 * page can reach here before the page_autonuma arrays are
+	 * allocated when feeding a range of pages to the allocator
+	 * for the first time during bootup or memory hotplug.
+	 */
+	if (unlikely(!base))
+		return NULL;
+#endif
+	offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
+	return base + offset;
+}
+
+static int __init alloc_node_page_autonuma(int nid)
+{
+	struct page_autonuma *base;
+	unsigned long table_size;
+	unsigned long nr_pages;
+
+	nr_pages = NODE_DATA(nid)->node_spanned_pages;
+	if (!nr_pages)
+		return 0;
+
+	table_size = sizeof(struct page_autonuma) * nr_pages;
+
+	base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
+			table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	if (!base)
+		return -ENOMEM;
+	NODE_DATA(nid)->node_page_autonuma = base;
+	total_usage += table_size;
+	page_autonuma_map_init(NODE_DATA(nid)->node_mem_map, base, nr_pages);
+	return 0;
+}
+
+void __init page_autonuma_init_flatmem(void)
+{
+
+	int nid, fail;
+
+	if (autonuma_impossible())
+		return;
+
+	for_each_online_node(nid)  {
+		fail = alloc_node_page_autonuma(nid);
+		if (fail)
+			goto fail;
+	}
+	printk(KERN_INFO "allocated %lu KBytes of page_autonuma\n",
+	       total_usage >> 10);
+	printk(KERN_INFO "please try the 'noautonuma' option if you"
+	" don't want to allocate page_autonuma memory\n");
+	return;
+fail:
+	printk(KERN_CRIT "allocation of page_autonuma failed.\n");
+	printk(KERN_CRIT "please try the 'noautonuma' boot option\n");
+	panic("Out of memory");
+}
+
+#else /* CONFIG_SPARSEMEM */
+
+struct page_autonuma *lookup_page_autonuma(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	struct mem_section *section = __pfn_to_section(pfn);
+
+	/* if it's not a power of two we may be wasting memory */
+	BUILD_BUG_ON(SECTION_PAGE_AUTONUMA_SIZE &
+		     (SECTION_PAGE_AUTONUMA_SIZE-1));
+
+#ifdef CONFIG_DEBUG_VM
+	/*
+	 * The sanity checks the page allocator does upon freeing a
+	 * page can reach here before the page_autonuma arrays are
+	 * allocated when feeding a range of pages to the allocator
+	 * for the first time during bootup or memory hotplug.
+	 */
+	if (!section->section_page_autonuma)
+		return NULL;
+#endif
+	return section->section_page_autonuma + pfn;
+}
+
+void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+	__pgdat_autonuma_init(pgdat);
+}
+
+struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
+								 unsigned long nr_pages)
+{
+	struct page_autonuma *ret;
+	struct page *page;
+	unsigned long memmap_size = PAGE_AUTONUMA_SIZE * nr_pages;
+
+	page = alloc_pages_node(nid, GFP_KERNEL|__GFP_NOWARN,
+				get_order(memmap_size));
+	if (page)
+		goto got_map_page_autonuma;
+
+	ret = vmalloc(memmap_size);
+	if (ret)
+		goto out;
+
+	return NULL;
+got_map_page_autonuma:
+	ret = (struct page_autonuma *)pfn_to_kaddr(page_to_pfn(page));
+out:
+	return ret;
+}
+
+void __meminit __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
+					     unsigned long nr_pages)
+{
+	if (is_vmalloc_addr(page_autonuma))
+		vfree(page_autonuma);
+	else
+		free_pages((unsigned long)page_autonuma,
+			   get_order(PAGE_AUTONUMA_SIZE * nr_pages));
+}
+
+static struct page_autonuma __init *sparse_page_autonuma_map_populate(unsigned long pnum,
+								      int nid)
+{
+	struct page_autonuma *map;
+	unsigned long size;
+
+	map = alloc_remap(nid, SECTION_PAGE_AUTONUMA_SIZE);
+	if (map)
+		return map;
+
+	size = PAGE_ALIGN(SECTION_PAGE_AUTONUMA_SIZE);
+	map = __alloc_bootmem_node_high(NODE_DATA(nid), size,
+					PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	return map;
+}
+
+void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
+						  unsigned long pnum_begin,
+						  unsigned long pnum_end,
+						  unsigned long map_count,
+						  int nodeid)
+{
+	void *map;
+	unsigned long pnum;
+	unsigned long size = SECTION_PAGE_AUTONUMA_SIZE;
+
+	map = alloc_remap(nodeid, size * map_count);
+	if (map) {
+		for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+			if (!present_section_nr(pnum))
+				continue;
+			page_autonuma_map[pnum] = map;
+			map += size;
+		}
+		return;
+	}
+
+	size = PAGE_ALIGN(size);
+	map = __alloc_bootmem_node_high(NODE_DATA(nodeid), size * map_count,
+					PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	if (map) {
+		for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+			if (!present_section_nr(pnum))
+				continue;
+			page_autonuma_map[pnum] = map;
+			map += size;
+		}
+		return;
+	}
+
+	/* fallback */
+	for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+		struct mem_section *ms;
+
+		if (!present_section_nr(pnum))
+			continue;
+		page_autonuma_map[pnum] = sparse_page_autonuma_map_populate(pnum, nodeid);
+		if (page_autonuma_map[pnum])
+			continue;
+		ms = __nr_to_section(pnum);
+		printk(KERN_ERR "%s: sparsemem page_autonuma map backing failed "
+		       "some memory will not be available.\n", __func__);
+	}
+}
+
+#endif
diff --git a/mm/sparse.c b/mm/sparse.c
index a8bc7d3..e20d891 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -9,6 +9,7 @@
 #include <linux/export.h>
 #include <linux/spinlock.h>
 #include <linux/vmalloc.h>
+#include <linux/page_autonuma.h>
 #include "internal.h"
 #include <asm/dma.h>
 #include <asm/pgalloc.h>
@@ -242,7 +243,8 @@ struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pn
 
 static int __meminit sparse_init_one_section(struct mem_section *ms,
 		unsigned long pnum, struct page *mem_map,
-		unsigned long *pageblock_bitmap)
+		unsigned long *pageblock_bitmap,
+		struct page_autonuma *page_autonuma)
 {
 	if (!present_section(ms))
 		return -EINVAL;
@@ -251,6 +253,14 @@ static int __meminit sparse_init_one_section(struct mem_section *ms,
 	ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) |
 							SECTION_HAS_MEM_MAP;
  	ms->pageblock_flags = pageblock_bitmap;
+#ifdef CONFIG_AUTONUMA
+	if (page_autonuma) {
+		ms->section_page_autonuma = page_autonuma - section_nr_to_pfn(pnum);
+		page_autonuma_map_init(mem_map, page_autonuma, PAGES_PER_SECTION);
+	}
+#else
+	BUG_ON(page_autonuma);
+#endif
 
 	return 1;
 }
@@ -485,6 +495,9 @@ void __init sparse_init(void)
 	int size2;
 	struct page **map_map;
 #endif
+	struct page_autonuma **uninitialized_var(page_autonuma_map);
+	struct page_autonuma *page_autonuma;
+	int size3;
 
 	/*
 	 * map is using big page (aka 2M in x86 64 bit)
@@ -579,6 +592,62 @@ void __init sparse_init(void)
 					 map_count, nodeid_begin);
 #endif
 
+	if (!autonuma_impossible()) {
+		unsigned long total_page_autonuma;
+		unsigned long page_autonuma_count;
+
+		size3 = sizeof(struct page_autonuma *) * NR_MEM_SECTIONS;
+		page_autonuma_map = alloc_bootmem(size3);
+		if (!page_autonuma_map)
+			panic("can not allocate page_autonuma_map\n");
+
+		for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
+			struct mem_section *ms;
+
+			if (!present_section_nr(pnum))
+				continue;
+			ms = __nr_to_section(pnum);
+			nodeid_begin = sparse_early_nid(ms);
+			pnum_begin = pnum;
+			break;
+		}
+		total_page_autonuma = 0;
+		page_autonuma_count = 1;
+		for (pnum = pnum_begin + 1; pnum < NR_MEM_SECTIONS; pnum++) {
+			struct mem_section *ms;
+			int nodeid;
+
+			if (!present_section_nr(pnum))
+				continue;
+			ms = __nr_to_section(pnum);
+			nodeid = sparse_early_nid(ms);
+			if (nodeid == nodeid_begin) {
+				page_autonuma_count++;
+				continue;
+			}
+			/* ok, we need to take cake of from pnum_begin to pnum - 1*/
+			sparse_early_page_autonuma_alloc_node(page_autonuma_map,
+							      pnum_begin,
+							      NR_MEM_SECTIONS,
+							      page_autonuma_count,
+							      nodeid_begin);
+			total_page_autonuma += SECTION_PAGE_AUTONUMA_SIZE * page_autonuma_count;
+			/* new start, update count etc*/
+			nodeid_begin = nodeid;
+			pnum_begin = pnum;
+			page_autonuma_count = 1;
+		}
+		/* ok, last chunk */
+		sparse_early_page_autonuma_alloc_node(page_autonuma_map, pnum_begin,
+						      NR_MEM_SECTIONS,
+						      page_autonuma_count, nodeid_begin);
+		total_page_autonuma += SECTION_PAGE_AUTONUMA_SIZE * page_autonuma_count;
+		printk("allocated %lu KBytes of page_autonuma\n",
+		       total_page_autonuma >> 10);
+		printk(KERN_INFO "please try the 'noautonuma' option if you"
+		       " don't want to allocate page_autonuma memory\n");
+	}
+
 	for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
 		if (!present_section_nr(pnum))
 			continue;
@@ -587,6 +656,14 @@ void __init sparse_init(void)
 		if (!usemap)
 			continue;
 
+		if (autonuma_impossible())
+			page_autonuma = NULL;
+		else {
+			page_autonuma = page_autonuma_map[pnum];
+			if (!page_autonuma)
+				continue;
+		}
+
 #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
 		map = map_map[pnum];
 #else
@@ -596,11 +673,13 @@ void __init sparse_init(void)
 			continue;
 
 		sparse_init_one_section(__nr_to_section(pnum), pnum, map,
-								usemap);
+					usemap, page_autonuma);
 	}
 
 	vmemmap_populate_print_last();
 
+	if (!autonuma_impossible())
+		free_bootmem(__pa(page_autonuma_map), size3);
 #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
 	free_bootmem(__pa(map_map), size2);
 #endif
@@ -687,7 +766,8 @@ static void free_map_bootmem(struct page *page, unsigned long nr_pages)
 }
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
-static void free_section_usemap(struct page *memmap, unsigned long *usemap)
+static void free_section_usemap(struct page *memmap, unsigned long *usemap,
+				struct page_autonuma *page_autonuma)
 {
 	struct page *usemap_page;
 	unsigned long nr_pages;
@@ -701,8 +781,14 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 	 */
 	if (PageSlab(usemap_page)) {
 		kfree(usemap);
-		if (memmap)
+		if (memmap) {
 			__kfree_section_memmap(memmap, PAGES_PER_SECTION);
+			if (!autonuma_impossible())
+				__kfree_section_page_autonuma(page_autonuma,
+							      PAGES_PER_SECTION);
+			else
+				BUG_ON(page_autonuma);
+		}
 		return;
 	}
 
@@ -719,6 +805,13 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 			>> PAGE_SHIFT;
 
 		free_map_bootmem(memmap_page, nr_pages);
+
+		if (!autonuma_impossible()) {
+			struct page *page_autonuma_page;
+			page_autonuma_page = virt_to_page(page_autonuma);
+			free_map_bootmem(page_autonuma_page, nr_pages);
+		} else
+			BUG_ON(page_autonuma);
 	}
 }
 
@@ -734,6 +827,7 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	struct mem_section *ms;
 	struct page *memmap;
+	struct page_autonuma *page_autonuma;
 	unsigned long *usemap;
 	unsigned long flags;
 	int ret;
@@ -753,6 +847,16 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 		__kfree_section_memmap(memmap, nr_pages);
 		return -ENOMEM;
 	}
+	if (!autonuma_impossible()) {
+		page_autonuma = __kmalloc_section_page_autonuma(pgdat->node_id,
+								nr_pages);
+		if (!page_autonuma) {
+			kfree(usemap);
+			__kfree_section_memmap(memmap, nr_pages);
+			return -ENOMEM;
+		}
+	} else
+		page_autonuma = NULL;
 
 	pgdat_resize_lock(pgdat, &flags);
 
@@ -764,11 +868,16 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 
 	ms->section_mem_map |= SECTION_MARKED_PRESENT;
 
-	ret = sparse_init_one_section(ms, section_nr, memmap, usemap);
+	ret = sparse_init_one_section(ms, section_nr, memmap, usemap,
+				      page_autonuma);
 
 out:
 	pgdat_resize_unlock(pgdat, &flags);
 	if (ret <= 0) {
+		if (!autonuma_impossible())
+			__kfree_section_page_autonuma(page_autonuma, nr_pages);
+		else
+			BUG_ON(page_autonuma);
 		kfree(usemap);
 		__kfree_section_memmap(memmap, nr_pages);
 	}
@@ -779,6 +888,7 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 {
 	struct page *memmap = NULL;
 	unsigned long *usemap = NULL;
+	struct page_autonuma *page_autonuma = NULL;
 
 	if (ms->section_mem_map) {
 		usemap = ms->pageblock_flags;
@@ -786,8 +896,12 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 						__section_nr(ms));
 		ms->section_mem_map = 0;
 		ms->pageblock_flags = NULL;
+
+#ifdef CONFIG_AUTONUMA
+		page_autonuma = ms->section_page_autonuma;
+#endif
 	}
 
-	free_section_usemap(memmap, usemap);
+	free_section_usemap(memmap, usemap, page_autonuma);
 }
 #endif

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* [PATCH 35/35] autonuma: page_autonuma
@ 2012-05-25 17:02   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-25 17:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Move the AutoNUMA per page information from the "struct page" to a
separate page_autonuma data structure allocated in the memsection
(with sparsemem) or in the pgdat (with flatmem).

This is done to avoid growing the size of the "struct page" and the
page_autonuma data is only allocated if the kernel has been booted on
real NUMA hardware (or if noautonuma is passed as parameter to the
kernel).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma.h       |   18 +++-
 include/linux/autonuma_flags.h |    6 +
 include/linux/autonuma_types.h |   31 ++++++
 include/linux/mm_types.h       |   25 -----
 include/linux/mmzone.h         |   14 +++-
 include/linux/page_autonuma.h  |   53 +++++++++
 init/main.c                    |    2 +
 mm/Makefile                    |    2 +-
 mm/autonuma.c                  |   95 ++++++++++-------
 mm/huge_memory.c               |   11 ++-
 mm/page_alloc.c                |   23 +----
 mm/page_autonuma.c             |  234 ++++++++++++++++++++++++++++++++++++++++
 mm/sparse.c                    |  126 ++++++++++++++++++++-
 13 files changed, 543 insertions(+), 97 deletions(-)
 create mode 100644 include/linux/page_autonuma.h
 create mode 100644 mm/page_autonuma.c

diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
index a963dcb..1eb84d0 100644
--- a/include/linux/autonuma.h
+++ b/include/linux/autonuma.h
@@ -7,15 +7,26 @@
 
 extern void autonuma_enter(struct mm_struct *mm);
 extern void autonuma_exit(struct mm_struct *mm);
-extern void __autonuma_migrate_page_remove(struct page *page);
+extern void __autonuma_migrate_page_remove(struct page *,
+					   struct page_autonuma *);
 extern void autonuma_migrate_split_huge_page(struct page *page,
 					     struct page *page_tail);
 extern void autonuma_setup_new_exec(struct task_struct *p);
+extern struct page_autonuma *lookup_page_autonuma(struct page *page);
 
 static inline void autonuma_migrate_page_remove(struct page *page)
 {
-	if (ACCESS_ONCE(page->autonuma_migrate_nid) >= 0)
-		__autonuma_migrate_page_remove(page);
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+	if (ACCESS_ONCE(page_autonuma->autonuma_migrate_nid) >= 0)
+		__autonuma_migrate_page_remove(page, page_autonuma);
+}
+
+static inline void autonuma_free_page(struct page *page)
+{
+	if (!autonuma_impossible()) {
+		autonuma_migrate_page_remove(page);
+		ACCESS_ONCE(lookup_page_autonuma(page)->autonuma_last_nid) = -1;
+	}
 }
 
 #define autonuma_printk(format, args...) \
@@ -29,6 +40,7 @@ static inline void autonuma_migrate_page_remove(struct page *page) {}
 static inline void autonuma_migrate_split_huge_page(struct page *page,
 						    struct page *page_tail) {}
 static inline void autonuma_setup_new_exec(struct task_struct *p) {}
+static inline void autonuma_free_page(struct page *page) {}
 
 #endif /* CONFIG_AUTONUMA */
 
diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
index 9c702fd..6ec837a 100644
--- a/include/linux/autonuma_flags.h
+++ b/include/linux/autonuma_flags.h
@@ -15,6 +15,12 @@ enum autonuma_flag {
 
 extern unsigned long autonuma_flags;
 
+static inline bool autonuma_impossible(void)
+{
+	return num_possible_nodes() <= 1 ||
+		test_bit(AUTONUMA_IMPOSSIBLE, &autonuma_flags);
+}
+
 static bool inline autonuma_enabled(void)
 {
 	return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
index 65b175b..28f64ec 100644
--- a/include/linux/autonuma_types.h
+++ b/include/linux/autonuma_types.h
@@ -28,6 +28,37 @@ struct sched_autonuma {
 	unsigned long numa_fault[0];
 };
 
+struct page_autonuma {
+	/*
+	 * FIXME: move to pgdat section along with the memcg and allocate
+	 * at runtime only in presence of a numa system.
+	 */
+	/*
+	 * To modify autonuma_last_nid lockless the architecture,
+	 * needs SMP atomic granularity < sizeof(long), not all archs
+	 * have that, notably some alpha. Archs without that requires
+	 * autonuma_last_nid to be a long.
+	 */
+#if BITS_PER_LONG > 32
+	int autonuma_migrate_nid;
+	int autonuma_last_nid;
+#else
+#if MAX_NUMNODES >= 32768
+#error "too many nodes"
+#endif
+	/* FIXME: remember to check the updates are atomic */
+	short autonuma_migrate_nid;
+	short autonuma_last_nid;
+#endif
+	struct list_head autonuma_migrate_node;
+
+	/*
+	 * To find the page starting from the autonuma_migrate_node we
+	 * need a backlink.
+	 */
+	struct page *page;
+};
+
 extern int alloc_sched_autonuma(struct task_struct *tsk,
 				struct task_struct *orig,
 				int node);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e8dc82c..780ded7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -126,31 +126,6 @@ struct page {
 		struct page *first_page;	/* Compound tail pages */
 	};
 
-#ifdef CONFIG_AUTONUMA
-	/*
-	 * FIXME: move to pgdat section along with the memcg and allocate
-	 * at runtime only in presence of a numa system.
-	 */
-	/*
-	 * To modify autonuma_last_nid lockless the architecture,
-	 * needs SMP atomic granularity < sizeof(long), not all archs
-	 * have that, notably some alpha. Archs without that requires
-	 * autonuma_last_nid to be a long.
-	 */
-#if BITS_PER_LONG > 32
-	int autonuma_migrate_nid;
-	int autonuma_last_nid;
-#else
-#if MAX_NUMNODES >= 32768
-#error "too many nodes"
-#endif
-	/* FIXME: remember to check the updates are atomic */
-	short autonuma_migrate_nid;
-	short autonuma_last_nid;
-#endif
-	struct list_head autonuma_migrate_node;
-#endif
-
 	/*
 	 * On machines where all RAM is mapped into kernel address space,
 	 * we can simply calculate the virtual address. On machines with
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8e578e6..89fa49f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -667,10 +667,13 @@ typedef struct pglist_data {
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
 #ifdef CONFIG_AUTONUMA
-	spinlock_t autonuma_lock;
+#if !defined(CONFIG_SPARSEMEM)
+	struct page_autonuma *node_page_autonuma;
+#endif
 	struct list_head autonuma_migrate_head[MAX_NUMNODES];
 	unsigned long autonuma_nr_migrate_pages;
 	wait_queue_head_t autonuma_knuma_migrated_wait;
+	spinlock_t autonuma_lock;
 #endif
 } pg_data_t;
 
@@ -1022,6 +1025,15 @@ struct mem_section {
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
 	 */
 	struct page_cgroup *page_cgroup;
+#endif
+#ifdef CONFIG_AUTONUMA
+	/*
+	 * If !SPARSEMEM, pgdat doesn't have page_autonuma pointer. We use
+	 * section.
+	 */
+	struct page_autonuma *section_page_autonuma;
+#endif
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR) ^ defined(CONFIG_AUTONUMA)
 	unsigned long pad;
 #endif
 };
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
new file mode 100644
index 0000000..05d2862
--- /dev/null
+++ b/include/linux/page_autonuma.h
@@ -0,0 +1,53 @@
+#ifndef _LINUX_PAGE_AUTONUMA_H
+#define _LINUX_PAGE_AUTONUMA_H
+
+#if defined(CONFIG_AUTONUMA) && !defined(CONFIG_SPARSEMEM)
+extern void __init page_autonuma_init_flatmem(void);
+#else
+static inline void __init page_autonuma_init_flatmem(void) {}
+#endif
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/autonuma_flags.h>
+
+extern void __meminit page_autonuma_map_init(struct page *page,
+					     struct page_autonuma *page_autonuma,
+					     int nr_pages);
+
+#ifdef CONFIG_SPARSEMEM
+#define PAGE_AUTONUMA_SIZE (sizeof(struct page_autonuma))
+#define SECTION_PAGE_AUTONUMA_SIZE (PAGE_AUTONUMA_SIZE *	\
+				    PAGES_PER_SECTION)
+#endif
+
+extern void __meminit pgdat_autonuma_init(struct pglist_data *);
+
+#else /* CONFIG_AUTONUMA */
+
+#ifdef CONFIG_SPARSEMEM
+struct page_autonuma;
+#define PAGE_AUTONUMA_SIZE 0
+#define SECTION_PAGE_AUTONUMA_SIZE 0
+
+#define autonuma_impossible() true
+
+#endif
+
+static inline void pgdat_autonuma_init(struct pglist_data *pgdat) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+#ifdef CONFIG_SPARSEMEM
+extern struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
+									unsigned long nr_pages);
+extern void __meminit __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
+						    unsigned long nr_pages);
+extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
+							 unsigned long pnum_begin,
+							 unsigned long pnum_end,
+							 unsigned long map_count,
+							 int nodeid);
+#endif
+
+#endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/init/main.c b/init/main.c
index 1ca6b32..275e914 100644
--- a/init/main.c
+++ b/init/main.c
@@ -68,6 +68,7 @@
 #include <linux/shmem_fs.h>
 #include <linux/slab.h>
 #include <linux/perf_event.h>
+#include <linux/page_autonuma.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -455,6 +456,7 @@ static void __init mm_init(void)
 	 * bigger than MAX_ORDER unless SPARSEMEM.
 	 */
 	page_cgroup_init_flatmem();
+	page_autonuma_init_flatmem();
 	mem_init();
 	kmem_cache_init();
 	percpu_init_late();
diff --git a/mm/Makefile b/mm/Makefile
index 67c77bd..5410eba 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -29,7 +29,7 @@ obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
-obj-$(CONFIG_AUTONUMA) 	+= autonuma.o
+obj-$(CONFIG_AUTONUMA) 	+= autonuma.o page_autonuma.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
diff --git a/mm/autonuma.c b/mm/autonuma.c
index 88c7ab3..96c02a2 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -51,12 +51,6 @@ static struct knumad_scan {
 	.mm_head = LIST_HEAD_INIT(knumad_scan.mm_head),
 };
 
-static inline bool autonuma_impossible(void)
-{
-	return num_possible_nodes() <= 1 ||
-		test_bit(AUTONUMA_IMPOSSIBLE, &autonuma_flags);
-}
-
 static inline void autonuma_migrate_lock(int nid)
 {
 	spin_lock(&NODE_DATA(nid)->autonuma_lock);
@@ -82,51 +76,57 @@ void autonuma_migrate_split_huge_page(struct page *page,
 				      struct page *page_tail)
 {
 	int nid, last_nid;
+	struct page_autonuma *page_autonuma, *page_tail_autonuma;
 
-	nid = page->autonuma_migrate_nid;
+	page_autonuma = lookup_page_autonuma(page);
+	page_tail_autonuma = lookup_page_autonuma(page_tail);
+
+	nid = page_autonuma->autonuma_migrate_nid;
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 	VM_BUG_ON(nid < -1);
-	VM_BUG_ON(page_tail->autonuma_migrate_nid != -1);
+	VM_BUG_ON(page_tail_autonuma->autonuma_migrate_nid != -1);
 	if (nid >= 0) {
 		VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
 		autonuma_migrate_lock(nid);
-		list_add_tail(&page_tail->autonuma_migrate_node,
-			      &page->autonuma_migrate_node);
+		list_add_tail(&page_tail_autonuma->autonuma_migrate_node,
+			      &page_autonuma->autonuma_migrate_node);
 		autonuma_migrate_unlock(nid);
 
-		page_tail->autonuma_migrate_nid = nid;
+		page_tail_autonuma->autonuma_migrate_nid = nid;
 	}
 
-	last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid);
 	if (last_nid >= 0)
-		page_tail->autonuma_last_nid = last_nid;
+		page_tail_autonuma->autonuma_last_nid = last_nid;
 }
 
-void __autonuma_migrate_page_remove(struct page *page)
+void __autonuma_migrate_page_remove(struct page *page,
+				    struct page_autonuma *page_autonuma)
 {
 	unsigned long flags;
 	int nid;
 
 	flags = compound_lock_irqsave(page);
 
-	nid = page->autonuma_migrate_nid;
+	nid = page_autonuma->autonuma_migrate_nid;
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 	VM_BUG_ON(nid < -1);
 	if (nid >= 0) {
 		int numpages = hpage_nr_pages(page);
 		autonuma_migrate_lock(nid);
-		list_del(&page->autonuma_migrate_node);
+		list_del(&page_autonuma->autonuma_migrate_node);
 		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
 		autonuma_migrate_unlock(nid);
 
-		page->autonuma_migrate_nid = -1;
+		page_autonuma->autonuma_migrate_nid = -1;
 	}
 
 	compound_unlock_irqrestore(page, flags);
 }
 
-static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
-					int page_nid)
+static void __autonuma_migrate_page_add(struct page *page,
+					struct page_autonuma *page_autonuma,
+					int dst_nid, int page_nid)
 {
 	unsigned long flags;
 	int nid;
@@ -145,25 +145,25 @@ static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
 	flags = compound_lock_irqsave(page);
 
 	numpages = hpage_nr_pages(page);
-	nid = page->autonuma_migrate_nid;
+	nid = page_autonuma->autonuma_migrate_nid;
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 	VM_BUG_ON(nid < -1);
 	if (nid >= 0) {
 		autonuma_migrate_lock(nid);
-		list_del(&page->autonuma_migrate_node);
+		list_del(&page_autonuma->autonuma_migrate_node);
 		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
 		autonuma_migrate_unlock(nid);
 	}
 
 	autonuma_migrate_lock(dst_nid);
-	list_add(&page->autonuma_migrate_node,
+	list_add(&page_autonuma->autonuma_migrate_node,
 		 &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
 	NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
 	nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
 
 	autonuma_migrate_unlock(dst_nid);
 
-	page->autonuma_migrate_nid = dst_nid;
+	page_autonuma->autonuma_migrate_nid = dst_nid;
 
 	compound_unlock_irqrestore(page, flags);
 
@@ -179,9 +179,13 @@ static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
 static void autonuma_migrate_page_add(struct page *page, int dst_nid,
 				      int page_nid)
 {
-	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	int migrate_nid;
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+
+	migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
 	if (migrate_nid != dst_nid)
-		__autonuma_migrate_page_add(page, dst_nid, page_nid);
+		__autonuma_migrate_page_add(page, page_autonuma,
+					    dst_nid, page_nid);
 }
 
 static bool balance_pgdat(struct pglist_data *pgdat,
@@ -252,23 +256,26 @@ static inline bool last_nid_set(struct task_struct *p,
 				struct page *page, int cpu_nid)
 {
 	bool ret = true;
-	int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+	int autonuma_last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid);
 	VM_BUG_ON(cpu_nid < 0);
 	VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
 	if (autonuma_last_nid >= 0 && autonuma_last_nid != cpu_nid) {
-		int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+		int migrate_nid;
+		migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
 		if (migrate_nid >= 0 && migrate_nid != cpu_nid)
-			__autonuma_migrate_page_remove(page);
+			__autonuma_migrate_page_remove(page, page_autonuma);
 		ret = false;
 	}
 	if (autonuma_last_nid != cpu_nid)
-		ACCESS_ONCE(page->autonuma_last_nid) = cpu_nid;
+		ACCESS_ONCE(page_autonuma->autonuma_last_nid) = cpu_nid;
 	return ret;
 }
 
 static int __page_migrate_nid(struct page *page, int page_nid)
 {
-	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+	int migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
 	if (migrate_nid < 0)
 		migrate_nid = page_nid;
 #if 0
@@ -780,6 +787,7 @@ static int isolate_migratepages(struct list_head *migratepages,
 	for_each_online_node(nid) {
 		struct zone *zone;
 		struct page *page;
+		struct page_autonuma *page_autonuma;
 		cond_resched();
 		VM_BUG_ON(numa_node_id() != pgdat->node_id);
 		if (nid == pgdat->node_id) {
@@ -802,16 +810,17 @@ static int isolate_migratepages(struct list_head *migratepages,
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			continue;
 		}
-		page = list_entry(heads[nid].prev,
-				  struct page,
-				  autonuma_migrate_node);
+		page_autonuma = list_entry(heads[nid].prev,
+					   struct page_autonuma,
+					   autonuma_migrate_node);
+		page = page_autonuma->page;
 		if (unlikely(!get_page_unless_zero(page))) {
 			/*
 			 * Is getting freed and will remove self from the
 			 * autonuma list shortly, skip it for now.
 			 */
-			list_del(&page->autonuma_migrate_node);
-			list_add(&page->autonuma_migrate_node,
+			list_del(&page_autonuma->autonuma_migrate_node);
+			list_add(&page_autonuma->autonuma_migrate_node,
 				 &heads[nid]);
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			autonuma_printk("autonuma migrate page is free\n");
@@ -820,7 +829,7 @@ static int isolate_migratepages(struct list_head *migratepages,
 		if (!PageLRU(page)) {
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			autonuma_printk("autonuma migrate page not in LRU\n");
-			__autonuma_migrate_page_remove(page);
+			__autonuma_migrate_page_remove(page, page_autonuma);
 			put_page(page);
 			continue;
 		}
@@ -832,7 +841,7 @@ static int isolate_migratepages(struct list_head *migratepages,
 			/* FIXME: remove split_huge_page */
 			split_huge_page(page);
 
-		__autonuma_migrate_page_remove(page);
+		__autonuma_migrate_page_remove(page, page_autonuma);
 
 		zone = page_zone(page);
 		spin_lock_irq(&zone->lru_lock);
@@ -875,11 +884,16 @@ static struct page *alloc_migrate_dst_page(struct page *page,
 {
 	int nid = (int) data;
 	struct page *newpage;
+	struct page_autonuma *page_autonuma, *newpage_autonuma;
 	newpage = alloc_pages_exact_node(nid,
 					 GFP_HIGHUSER_MOVABLE | GFP_THISNODE,
 					 0);
-	if (newpage)
-		newpage->autonuma_last_nid = page->autonuma_last_nid;
+	if (newpage) {
+		page_autonuma = lookup_page_autonuma(page);
+		newpage_autonuma = lookup_page_autonuma(newpage);
+		newpage_autonuma->autonuma_last_nid =
+			page_autonuma->autonuma_last_nid;
+	}
 	return newpage;
 }
 
@@ -1299,7 +1313,8 @@ static int __init noautonuma_setup(char *str)
 	}
 	return 1;
 }
-__setup("noautonuma", noautonuma_setup);
+/* early so sparse.c also can see it */
+early_param("noautonuma", noautonuma_setup);
 
 static int __init autonuma_init(void)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b919c0c..faaf73f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1822,6 +1822,12 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 {
 	pte_t *_pte;
 	bool mknuma = false;
+#ifdef CONFIG_AUTONUMA
+	struct page_autonuma *src_page_an, *page_an;
+
+	page_an = lookup_page_autonuma(page);
+#endif
+
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
 		pte_t pteval = *_pte;
 		struct page *src_page;
@@ -1835,11 +1841,12 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 #endif
 			src_page = pte_page(pteval);
 #ifdef CONFIG_AUTONUMA
+			src_page_an = lookup_page_autonuma(src_page);
 			/* pick the last one, better than nothing */
 			autonuma_last_nid =
-				ACCESS_ONCE(src_page->autonuma_last_nid);
+				ACCESS_ONCE(src_page_an->autonuma_last_nid);
 			if (autonuma_last_nid >= 0)
-				ACCESS_ONCE(page->autonuma_last_nid) =
+				ACCESS_ONCE(page_an->autonuma_last_nid) =
 					autonuma_last_nid;
 #endif
 			copy_user_highpage(page, src_page, address, vma);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c354d4..b8c13ff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -59,6 +59,7 @@
 #include <linux/prefetch.h>
 #include <linux/page-debug-flags.h>
 #include <linux/autonuma.h>
+#include <linux/page_autonuma.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -614,10 +615,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
-	autonuma_migrate_page_remove(page);
-#ifdef CONFIG_AUTONUMA
-	ACCESS_ONCE(page->autonuma_last_nid) = -1;
-#endif
+	autonuma_free_page(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -3673,10 +3671,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 
 		INIT_LIST_HEAD(&page->lru);
-#ifdef CONFIG_AUTONUMA
-		page->autonuma_last_nid = -1;
-		page->autonuma_migrate_nid = -1;
-#endif
 #ifdef WANT_PAGE_VIRTUAL
 		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
 		if (!is_highmem_idx(zone))
@@ -4304,23 +4298,14 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	int nid = pgdat->node_id;
 	unsigned long zone_start_pfn = pgdat->node_start_pfn;
 	int ret;
-#ifdef CONFIG_AUTONUMA
-	int node_iter;
-#endif
 
 	pgdat_resize_init(pgdat);
-#ifdef CONFIG_AUTONUMA
-	spin_lock_init(&pgdat->autonuma_lock);
-	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
-	pgdat->autonuma_nr_migrate_pages = 0;
-	for_each_node(node_iter)
-		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
-#endif
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	pgdat->kswapd_max_order = 0;
 	pgdat_page_cgroup_init(pgdat);
-	
+	pgdat_autonuma_init(pgdat);
+
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
 		unsigned long size, realsize, memmap_pages;
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
new file mode 100644
index 0000000..131b5c9
--- /dev/null
+++ b/mm/page_autonuma.c
@@ -0,0 +1,234 @@
+#include <linux/mm.h>
+#include <linux/memory.h>
+#include <linux/autonuma_flags.h>
+#include <linux/page_autonuma.h>
+#include <linux/bootmem.h>
+
+void __meminit page_autonuma_map_init(struct page *page,
+				      struct page_autonuma *page_autonuma,
+				      int nr_pages)
+{
+	struct page *end;
+	for (end = page + nr_pages; page < end; page++, page_autonuma++) {
+		page_autonuma->autonuma_last_nid = -1;
+		page_autonuma->autonuma_migrate_nid = -1;
+		page_autonuma->page = page;
+	}
+}
+
+static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+	int node_iter;
+
+	spin_lock_init(&pgdat->autonuma_lock);
+	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
+	pgdat->autonuma_nr_migrate_pages = 0;
+	for_each_node(node_iter)
+		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+}
+
+#if !defined(CONFIG_SPARSEMEM)
+
+static unsigned long total_usage;
+
+void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+	__pgdat_autonuma_init(pgdat);
+	pgdat->node_page_autonuma = NULL;
+}
+
+struct page_autonuma *lookup_page_autonuma(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long offset;
+	struct page_autonuma *base;
+
+	base = NODE_DATA(page_to_nid(page))->node_page_autonuma;
+#ifdef CONFIG_DEBUG_VM
+	/*
+	 * The sanity checks the page allocator does upon freeing a
+	 * page can reach here before the page_autonuma arrays are
+	 * allocated when feeding a range of pages to the allocator
+	 * for the first time during bootup or memory hotplug.
+	 */
+	if (unlikely(!base))
+		return NULL;
+#endif
+	offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
+	return base + offset;
+}
+
+static int __init alloc_node_page_autonuma(int nid)
+{
+	struct page_autonuma *base;
+	unsigned long table_size;
+	unsigned long nr_pages;
+
+	nr_pages = NODE_DATA(nid)->node_spanned_pages;
+	if (!nr_pages)
+		return 0;
+
+	table_size = sizeof(struct page_autonuma) * nr_pages;
+
+	base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
+			table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	if (!base)
+		return -ENOMEM;
+	NODE_DATA(nid)->node_page_autonuma = base;
+	total_usage += table_size;
+	page_autonuma_map_init(NODE_DATA(nid)->node_mem_map, base, nr_pages);
+	return 0;
+}
+
+void __init page_autonuma_init_flatmem(void)
+{
+
+	int nid, fail;
+
+	if (autonuma_impossible())
+		return;
+
+	for_each_online_node(nid)  {
+		fail = alloc_node_page_autonuma(nid);
+		if (fail)
+			goto fail;
+	}
+	printk(KERN_INFO "allocated %lu KBytes of page_autonuma\n",
+	       total_usage >> 10);
+	printk(KERN_INFO "please try the 'noautonuma' option if you"
+	" don't want to allocate page_autonuma memory\n");
+	return;
+fail:
+	printk(KERN_CRIT "allocation of page_autonuma failed.\n");
+	printk(KERN_CRIT "please try the 'noautonuma' boot option\n");
+	panic("Out of memory");
+}
+
+#else /* CONFIG_SPARSEMEM */
+
+struct page_autonuma *lookup_page_autonuma(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	struct mem_section *section = __pfn_to_section(pfn);
+
+	/* if it's not a power of two we may be wasting memory */
+	BUILD_BUG_ON(SECTION_PAGE_AUTONUMA_SIZE &
+		     (SECTION_PAGE_AUTONUMA_SIZE-1));
+
+#ifdef CONFIG_DEBUG_VM
+	/*
+	 * The sanity checks the page allocator does upon freeing a
+	 * page can reach here before the page_autonuma arrays are
+	 * allocated when feeding a range of pages to the allocator
+	 * for the first time during bootup or memory hotplug.
+	 */
+	if (!section->section_page_autonuma)
+		return NULL;
+#endif
+	return section->section_page_autonuma + pfn;
+}
+
+void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+	__pgdat_autonuma_init(pgdat);
+}
+
+struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
+								 unsigned long nr_pages)
+{
+	struct page_autonuma *ret;
+	struct page *page;
+	unsigned long memmap_size = PAGE_AUTONUMA_SIZE * nr_pages;
+
+	page = alloc_pages_node(nid, GFP_KERNEL|__GFP_NOWARN,
+				get_order(memmap_size));
+	if (page)
+		goto got_map_page_autonuma;
+
+	ret = vmalloc(memmap_size);
+	if (ret)
+		goto out;
+
+	return NULL;
+got_map_page_autonuma:
+	ret = (struct page_autonuma *)pfn_to_kaddr(page_to_pfn(page));
+out:
+	return ret;
+}
+
+void __meminit __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
+					     unsigned long nr_pages)
+{
+	if (is_vmalloc_addr(page_autonuma))
+		vfree(page_autonuma);
+	else
+		free_pages((unsigned long)page_autonuma,
+			   get_order(PAGE_AUTONUMA_SIZE * nr_pages));
+}
+
+static struct page_autonuma __init *sparse_page_autonuma_map_populate(unsigned long pnum,
+								      int nid)
+{
+	struct page_autonuma *map;
+	unsigned long size;
+
+	map = alloc_remap(nid, SECTION_PAGE_AUTONUMA_SIZE);
+	if (map)
+		return map;
+
+	size = PAGE_ALIGN(SECTION_PAGE_AUTONUMA_SIZE);
+	map = __alloc_bootmem_node_high(NODE_DATA(nid), size,
+					PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	return map;
+}
+
+void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
+						  unsigned long pnum_begin,
+						  unsigned long pnum_end,
+						  unsigned long map_count,
+						  int nodeid)
+{
+	void *map;
+	unsigned long pnum;
+	unsigned long size = SECTION_PAGE_AUTONUMA_SIZE;
+
+	map = alloc_remap(nodeid, size * map_count);
+	if (map) {
+		for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+			if (!present_section_nr(pnum))
+				continue;
+			page_autonuma_map[pnum] = map;
+			map += size;
+		}
+		return;
+	}
+
+	size = PAGE_ALIGN(size);
+	map = __alloc_bootmem_node_high(NODE_DATA(nodeid), size * map_count,
+					PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	if (map) {
+		for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+			if (!present_section_nr(pnum))
+				continue;
+			page_autonuma_map[pnum] = map;
+			map += size;
+		}
+		return;
+	}
+
+	/* fallback */
+	for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+		struct mem_section *ms;
+
+		if (!present_section_nr(pnum))
+			continue;
+		page_autonuma_map[pnum] = sparse_page_autonuma_map_populate(pnum, nodeid);
+		if (page_autonuma_map[pnum])
+			continue;
+		ms = __nr_to_section(pnum);
+		printk(KERN_ERR "%s: sparsemem page_autonuma map backing failed "
+		       "some memory will not be available.\n", __func__);
+	}
+}
+
+#endif
diff --git a/mm/sparse.c b/mm/sparse.c
index a8bc7d3..e20d891 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -9,6 +9,7 @@
 #include <linux/export.h>
 #include <linux/spinlock.h>
 #include <linux/vmalloc.h>
+#include <linux/page_autonuma.h>
 #include "internal.h"
 #include <asm/dma.h>
 #include <asm/pgalloc.h>
@@ -242,7 +243,8 @@ struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pn
 
 static int __meminit sparse_init_one_section(struct mem_section *ms,
 		unsigned long pnum, struct page *mem_map,
-		unsigned long *pageblock_bitmap)
+		unsigned long *pageblock_bitmap,
+		struct page_autonuma *page_autonuma)
 {
 	if (!present_section(ms))
 		return -EINVAL;
@@ -251,6 +253,14 @@ static int __meminit sparse_init_one_section(struct mem_section *ms,
 	ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) |
 							SECTION_HAS_MEM_MAP;
  	ms->pageblock_flags = pageblock_bitmap;
+#ifdef CONFIG_AUTONUMA
+	if (page_autonuma) {
+		ms->section_page_autonuma = page_autonuma - section_nr_to_pfn(pnum);
+		page_autonuma_map_init(mem_map, page_autonuma, PAGES_PER_SECTION);
+	}
+#else
+	BUG_ON(page_autonuma);
+#endif
 
 	return 1;
 }
@@ -485,6 +495,9 @@ void __init sparse_init(void)
 	int size2;
 	struct page **map_map;
 #endif
+	struct page_autonuma **uninitialized_var(page_autonuma_map);
+	struct page_autonuma *page_autonuma;
+	int size3;
 
 	/*
 	 * map is using big page (aka 2M in x86 64 bit)
@@ -579,6 +592,62 @@ void __init sparse_init(void)
 					 map_count, nodeid_begin);
 #endif
 
+	if (!autonuma_impossible()) {
+		unsigned long total_page_autonuma;
+		unsigned long page_autonuma_count;
+
+		size3 = sizeof(struct page_autonuma *) * NR_MEM_SECTIONS;
+		page_autonuma_map = alloc_bootmem(size3);
+		if (!page_autonuma_map)
+			panic("can not allocate page_autonuma_map\n");
+
+		for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
+			struct mem_section *ms;
+
+			if (!present_section_nr(pnum))
+				continue;
+			ms = __nr_to_section(pnum);
+			nodeid_begin = sparse_early_nid(ms);
+			pnum_begin = pnum;
+			break;
+		}
+		total_page_autonuma = 0;
+		page_autonuma_count = 1;
+		for (pnum = pnum_begin + 1; pnum < NR_MEM_SECTIONS; pnum++) {
+			struct mem_section *ms;
+			int nodeid;
+
+			if (!present_section_nr(pnum))
+				continue;
+			ms = __nr_to_section(pnum);
+			nodeid = sparse_early_nid(ms);
+			if (nodeid == nodeid_begin) {
+				page_autonuma_count++;
+				continue;
+			}
+			/* ok, we need to take cake of from pnum_begin to pnum - 1*/
+			sparse_early_page_autonuma_alloc_node(page_autonuma_map,
+							      pnum_begin,
+							      NR_MEM_SECTIONS,
+							      page_autonuma_count,
+							      nodeid_begin);
+			total_page_autonuma += SECTION_PAGE_AUTONUMA_SIZE * page_autonuma_count;
+			/* new start, update count etc*/
+			nodeid_begin = nodeid;
+			pnum_begin = pnum;
+			page_autonuma_count = 1;
+		}
+		/* ok, last chunk */
+		sparse_early_page_autonuma_alloc_node(page_autonuma_map, pnum_begin,
+						      NR_MEM_SECTIONS,
+						      page_autonuma_count, nodeid_begin);
+		total_page_autonuma += SECTION_PAGE_AUTONUMA_SIZE * page_autonuma_count;
+		printk("allocated %lu KBytes of page_autonuma\n",
+		       total_page_autonuma >> 10);
+		printk(KERN_INFO "please try the 'noautonuma' option if you"
+		       " don't want to allocate page_autonuma memory\n");
+	}
+
 	for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
 		if (!present_section_nr(pnum))
 			continue;
@@ -587,6 +656,14 @@ void __init sparse_init(void)
 		if (!usemap)
 			continue;
 
+		if (autonuma_impossible())
+			page_autonuma = NULL;
+		else {
+			page_autonuma = page_autonuma_map[pnum];
+			if (!page_autonuma)
+				continue;
+		}
+
 #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
 		map = map_map[pnum];
 #else
@@ -596,11 +673,13 @@ void __init sparse_init(void)
 			continue;
 
 		sparse_init_one_section(__nr_to_section(pnum), pnum, map,
-								usemap);
+					usemap, page_autonuma);
 	}
 
 	vmemmap_populate_print_last();
 
+	if (!autonuma_impossible())
+		free_bootmem(__pa(page_autonuma_map), size3);
 #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
 	free_bootmem(__pa(map_map), size2);
 #endif
@@ -687,7 +766,8 @@ static void free_map_bootmem(struct page *page, unsigned long nr_pages)
 }
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
-static void free_section_usemap(struct page *memmap, unsigned long *usemap)
+static void free_section_usemap(struct page *memmap, unsigned long *usemap,
+				struct page_autonuma *page_autonuma)
 {
 	struct page *usemap_page;
 	unsigned long nr_pages;
@@ -701,8 +781,14 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 	 */
 	if (PageSlab(usemap_page)) {
 		kfree(usemap);
-		if (memmap)
+		if (memmap) {
 			__kfree_section_memmap(memmap, PAGES_PER_SECTION);
+			if (!autonuma_impossible())
+				__kfree_section_page_autonuma(page_autonuma,
+							      PAGES_PER_SECTION);
+			else
+				BUG_ON(page_autonuma);
+		}
 		return;
 	}
 
@@ -719,6 +805,13 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 			>> PAGE_SHIFT;
 
 		free_map_bootmem(memmap_page, nr_pages);
+
+		if (!autonuma_impossible()) {
+			struct page *page_autonuma_page;
+			page_autonuma_page = virt_to_page(page_autonuma);
+			free_map_bootmem(page_autonuma_page, nr_pages);
+		} else
+			BUG_ON(page_autonuma);
 	}
 }
 
@@ -734,6 +827,7 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	struct mem_section *ms;
 	struct page *memmap;
+	struct page_autonuma *page_autonuma;
 	unsigned long *usemap;
 	unsigned long flags;
 	int ret;
@@ -753,6 +847,16 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 		__kfree_section_memmap(memmap, nr_pages);
 		return -ENOMEM;
 	}
+	if (!autonuma_impossible()) {
+		page_autonuma = __kmalloc_section_page_autonuma(pgdat->node_id,
+								nr_pages);
+		if (!page_autonuma) {
+			kfree(usemap);
+			__kfree_section_memmap(memmap, nr_pages);
+			return -ENOMEM;
+		}
+	} else
+		page_autonuma = NULL;
 
 	pgdat_resize_lock(pgdat, &flags);
 
@@ -764,11 +868,16 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 
 	ms->section_mem_map |= SECTION_MARKED_PRESENT;
 
-	ret = sparse_init_one_section(ms, section_nr, memmap, usemap);
+	ret = sparse_init_one_section(ms, section_nr, memmap, usemap,
+				      page_autonuma);
 
 out:
 	pgdat_resize_unlock(pgdat, &flags);
 	if (ret <= 0) {
+		if (!autonuma_impossible())
+			__kfree_section_page_autonuma(page_autonuma, nr_pages);
+		else
+			BUG_ON(page_autonuma);
 		kfree(usemap);
 		__kfree_section_memmap(memmap, nr_pages);
 	}
@@ -779,6 +888,7 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 {
 	struct page *memmap = NULL;
 	unsigned long *usemap = NULL;
+	struct page_autonuma *page_autonuma = NULL;
 
 	if (ms->section_mem_map) {
 		usemap = ms->pageblock_flags;
@@ -786,8 +896,12 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 						__section_nr(ms));
 		ms->section_mem_map = 0;
 		ms->pageblock_flags = NULL;
+
+#ifdef CONFIG_AUTONUMA
+		page_autonuma = ms->section_page_autonuma;
+#endif
 	}
 
-	free_section_usemap(memmap, usemap);
+	free_section_usemap(memmap, usemap, page_autonuma);
 }
 #endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* Re: [PATCH 03/35] xen: document Xen is using an unused bit for the pagetables
  2012-05-25 17:02   ` Andrea Arcangeli
@ 2012-05-25 20:26     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 236+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-25 20:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Fri, May 25, 2012 at 07:02:07PM +0200, Andrea Arcangeli wrote:
> Xen has taken over the last reserved bit available for the pagetables
> which is set through ioremap, this documents it and makes the code
> more readable.

Andrea, my previous respone had a question about this - was wondering
if you had a chance to look at that in your busy schedule and provide
some advice on how to remove the _PAGE_IOMAP altogether?

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 03/35] xen: document Xen is using an unused bit for the pagetables
@ 2012-05-25 20:26     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 236+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-25 20:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Fri, May 25, 2012 at 07:02:07PM +0200, Andrea Arcangeli wrote:
> Xen has taken over the last reserved bit available for the pagetables
> which is set through ioremap, this documents it and makes the code
> more readable.

Andrea, my previous respone had a question about this - was wondering
if you had a chance to look at that in your busy schedule and provide
some advice on how to remove the _PAGE_IOMAP altogether?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 03/35] xen: document Xen is using an unused bit for the pagetables
  2012-05-25 20:26     ` Konrad Rzeszutek Wilk
@ 2012-05-26 15:59       ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-26 15:59 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

Hi Konrad,

On Fri, May 25, 2012 at 04:26:56PM -0400, Konrad Rzeszutek Wilk wrote:
> On Fri, May 25, 2012 at 07:02:07PM +0200, Andrea Arcangeli wrote:
> > Xen has taken over the last reserved bit available for the pagetables
> > which is set through ioremap, this documents it and makes the code
> > more readable.
> 
> Andrea, my previous respone had a question about this - was wondering
> if you had a chance to look at that in your busy schedule and provide
> some advice on how to remove the _PAGE_IOMAP altogether?

I read you response but I didn't look into the P2M tree and
xen_val_pte code yet sorry. Thanks for looking into this, if it's
possible to remove it without downsides, it would be a nice
cleanup. It's not urgent though, we're not running out of reserved
pte bits yet :).

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 03/35] xen: document Xen is using an unused bit for the pagetables
@ 2012-05-26 15:59       ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-26 15:59 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

Hi Konrad,

On Fri, May 25, 2012 at 04:26:56PM -0400, Konrad Rzeszutek Wilk wrote:
> On Fri, May 25, 2012 at 07:02:07PM +0200, Andrea Arcangeli wrote:
> > Xen has taken over the last reserved bit available for the pagetables
> > which is set through ioremap, this documents it and makes the code
> > more readable.
> 
> Andrea, my previous respone had a question about this - was wondering
> if you had a chance to look at that in your busy schedule and provide
> some advice on how to remove the _PAGE_IOMAP altogether?

I read you response but I didn't look into the P2M tree and
xen_val_pte code yet sorry. Thanks for looking into this, if it's
possible to remove it without downsides, it would be a nice
cleanup. It's not urgent though, we're not running out of reserved
pte bits yet :).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 00/35] AutoNUMA alpha14
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-26 17:28   ` Rik van Riel
  -1 siblings, 0 replies; 236+ messages in thread
From: Rik van Riel @ 2012-05-26 17:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On 05/25/2012 01:02 PM, Andrea Arcangeli wrote:

> I believe (realistically speaking) nobody is going to change
> applications to specify which thread is using which memory (for
> threaded apps) with the only exception of QEMU and a few others.

This is the point of contention.  I believe that for some
programs these kinds of modifications might happen, but
that for some other programs - managed runtimes like JVMs -
it is fundamentally impossible to do proper NUMA hinting,
because the programming languages that run on top of the
runtimes have no concept of pointers or memory ranges, making
it impossible to do those kinds of hints without fundamentally
changing the programming languages in question.

It would be good to get everybody's ideas out there on this
topic, because this is the fundamental factor in deciding
between Peter's approach to NUMA and Andrea's approach.

Ingo? Andrew? Linus? Paul?

> For not threaded apps that fits in a NUMA node, there's no way a blind
> home node can perform nearly as good as AutoNUMA:

The small tasks are easy. I suspect that either implementation
can be tuned to produce good results there.

It is the large programs (that do not fit in a NUMA node, either
due to too much memory, or due to too many threads) that will
force our hand in deciding whether to go with Peter's approach
or your approach.

I believe we do need to carefully think about this issue, decide
on a NUMA approach based on the fundamental technical properties of
each approach.

After we figure out what we want to do, we can nit-pick on the
codebase in question, and make sure it gets completely fixed.
I am sure neither codebase is perfect right now, but both are
entirely fixable.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 00/35] AutoNUMA alpha14
@ 2012-05-26 17:28   ` Rik van Riel
  0 siblings, 0 replies; 236+ messages in thread
From: Rik van Riel @ 2012-05-26 17:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On 05/25/2012 01:02 PM, Andrea Arcangeli wrote:

> I believe (realistically speaking) nobody is going to change
> applications to specify which thread is using which memory (for
> threaded apps) with the only exception of QEMU and a few others.

This is the point of contention.  I believe that for some
programs these kinds of modifications might happen, but
that for some other programs - managed runtimes like JVMs -
it is fundamentally impossible to do proper NUMA hinting,
because the programming languages that run on top of the
runtimes have no concept of pointers or memory ranges, making
it impossible to do those kinds of hints without fundamentally
changing the programming languages in question.

It would be good to get everybody's ideas out there on this
topic, because this is the fundamental factor in deciding
between Peter's approach to NUMA and Andrea's approach.

Ingo? Andrew? Linus? Paul?

> For not threaded apps that fits in a NUMA node, there's no way a blind
> home node can perform nearly as good as AutoNUMA:

The small tasks are easy. I suspect that either implementation
can be tuned to produce good results there.

It is the large programs (that do not fit in a NUMA node, either
due to too much memory, or due to too many threads) that will
force our hand in deciding whether to go with Peter's approach
or your approach.

I believe we do need to carefully think about this issue, decide
on a NUMA approach based on the fundamental technical properties of
each approach.

After we figure out what we want to do, we can nit-pick on the
codebase in question, and make sure it gets completely fixed.
I am sure neither codebase is perfect right now, but both are
entirely fixable.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 00/35] AutoNUMA alpha14
  2012-05-26 17:28   ` Rik van Riel
@ 2012-05-26 20:42     ` Linus Torvalds
  -1 siblings, 0 replies; 236+ messages in thread
From: Linus Torvalds @ 2012-05-26 20:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Peter Zijlstra, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Sat, May 26, 2012 at 10:28 AM, Rik van Riel <riel@redhat.com> wrote:
>
> It would be good to get everybody's ideas out there on this
> topic, because this is the fundamental factor in deciding
> between Peter's approach to NUMA and Andrea's approach.
>
> Ingo? Andrew? Linus? Paul?

I'm a *firm* believer that if it cannot be done automatically "well
enough", the absolute last thing we should ever do is worry about the
crazy people who think they can tweak it to perfection with complex
interfaces.

You can't do it, except for trivial loads (often benchmarks), and for
very specific machines.

So I think very strongly that we should entirely dismiss all the
people who want to do manual placement and claim that they know what
their loads do. They're either full of sh*t (most likely), or they
have a very specific benchmark and platform that they are tuning for
that is totally irrelevant to everybody else.

What we *should* try to aim for is a system that doesn't do horribly
badly right out of the box. IOW, no tuning what-so-ever (at most a
kind of "yes, I want you to try to do the NUMA thing" flag to just
enable it at all), and try to not suck.

Seriously. "Try to avoid sucking" is *way* superior to "We can let the
user tweak things to their hearts content". Because users won't get it
right.

Give the anal people a knob they can tweak, and tell them it does
something fancy. And never actually wire the damn thing up. They'll be
really happy with their OCD tweaking, and do lots of nice graphs that
just show how the error bars are so big that you can find any damn
pattern you want in random noise.

                 Linus

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 00/35] AutoNUMA alpha14
@ 2012-05-26 20:42     ` Linus Torvalds
  0 siblings, 0 replies; 236+ messages in thread
From: Linus Torvalds @ 2012-05-26 20:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Peter Zijlstra, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Sat, May 26, 2012 at 10:28 AM, Rik van Riel <riel@redhat.com> wrote:
>
> It would be good to get everybody's ideas out there on this
> topic, because this is the fundamental factor in deciding
> between Peter's approach to NUMA and Andrea's approach.
>
> Ingo? Andrew? Linus? Paul?

I'm a *firm* believer that if it cannot be done automatically "well
enough", the absolute last thing we should ever do is worry about the
crazy people who think they can tweak it to perfection with complex
interfaces.

You can't do it, except for trivial loads (often benchmarks), and for
very specific machines.

So I think very strongly that we should entirely dismiss all the
people who want to do manual placement and claim that they know what
their loads do. They're either full of sh*t (most likely), or they
have a very specific benchmark and platform that they are tuning for
that is totally irrelevant to everybody else.

What we *should* try to aim for is a system that doesn't do horribly
badly right out of the box. IOW, no tuning what-so-ever (at most a
kind of "yes, I want you to try to do the NUMA thing" flag to just
enable it at all), and try to not suck.

Seriously. "Try to avoid sucking" is *way* superior to "We can let the
user tweak things to their hearts content". Because users won't get it
right.

Give the anal people a knob they can tweak, and tell them it does
something fancy. And never actually wire the damn thing up. They'll be
really happy with their OCD tweaking, and do lots of nice graphs that
just show how the error bars are so big that you can find any damn
pattern you want in random noise.

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 23/35] autonuma: core
  2012-05-25 17:02   ` Andrea Arcangeli
@ 2012-05-29 11:45     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 236+ messages in thread
From: Kirill A. Shutemov @ 2012-05-29 11:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Fri, May 25, 2012 at 07:02:27PM +0200, Andrea Arcangeli wrote:

> +static int knumad_do_scan(void)
> +{

...

> +	if (knumad_test_exit(mm) || !vma) {
> +		mm_autonuma = mm->mm_autonuma;
> +		if (mm_autonuma->mm_node.next != &knumad_scan.mm_head) {
> +			mm_autonuma = list_entry(mm_autonuma->mm_node.next,
> +						 struct mm_autonuma, mm_node);
> +			knumad_scan.mm = mm_autonuma->mm;
> +			atomic_inc(&knumad_scan.mm->mm_count);
> +			knumad_scan.address = 0;
> +			knumad_scan.mm->mm_autonuma->numa_fault_pass++;
> +		} else
> +			knumad_scan.mm = NULL;

knumad_scan.mm should be nulled only after list_del otherwise you will
have race with autonuma_exit():

[   22.905208] ------------[ cut here ]------------
[   23.003620] WARNING: at /home/kas/git/public/linux/lib/list_debug.c:50 __list_del_entry+0x63/0xd0()
[   23.003621] Hardware name: QSSC-S4R
[   23.003624] list_del corruption, ffff880771a49300->next is LIST_POISON1 (dead000000100100)
[   23.003626] Modules linked in: megaraid_sas
[   23.003629] Pid: 569, comm: udevd Not tainted 3.4.0+ #31
[   23.003630] Call Trace:
[   23.003640]  [<ffffffff8105956f>] warn_slowpath_common+0x7f/0xc0
[   23.003643]  [<ffffffff81059666>] warn_slowpath_fmt+0x46/0x50
[   23.003645]  [<ffffffff813202e3>] __list_del_entry+0x63/0xd0
[   23.003648]  [<ffffffff81320361>] list_del+0x11/0x40
[   23.003654]  [<ffffffff8117b80f>] autonuma_exit+0x5f/0xb0
[   23.003657]  [<ffffffff810567ab>] mmput+0x7b/0x120
[   23.003663]  [<ffffffff8105e7d8>] exit_mm+0x108/0x130
[   23.003674]  [<ffffffff8165da5b>] ? _raw_spin_unlock_irq+0x2b/0x40
[   23.003677]  [<ffffffff8105e94a>] do_exit+0x14a/0x8d0
[   23.003682]  [<ffffffff811b71c6>] ? mntput+0x26/0x40
[   23.003688]  [<ffffffff8119a599>] ? fput+0x1c9/0x270
[   23.003693]  [<ffffffff81319dd9>] ? lockdep_sys_exit_thunk+0x35/0x67
[   23.003696]  [<ffffffff8105f41f>] do_group_exit+0x4f/0xc0
[   23.003698]  [<ffffffff8105f4a7>] sys_exit_group+0x17/0x20
[   23.003703]  [<ffffffff816663e9>] system_call_fastpath+0x16/0x1b
[   23.003705] ---[ end trace 8b21c7adb0af191b ]---

> +
> +		if (knumad_test_exit(mm))
> +			list_del(&mm->mm_autonuma->mm_node);
> +		else
> +			mm_numa_fault_flush(mm);
> +
> +		mmdrop(mm);
> +	}
> +
> +	return progress;
> +}

...

> +
> +static int knuma_scand(void *none)
> +{

...

> +	mm = knumad_scan.mm;
> +	knumad_scan.mm = NULL;

The same problem here.

> +	if (mm)
> +		list_del(&mm->mm_autonuma->mm_node);
> +	mutex_unlock(&knumad_mm_mutex);
> +
> +	if (mm)
> +		mmdrop(mm);
> +
> +	return 0;
> +}

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 23/35] autonuma: core
@ 2012-05-29 11:45     ` Kirill A. Shutemov
  0 siblings, 0 replies; 236+ messages in thread
From: Kirill A. Shutemov @ 2012-05-29 11:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Fri, May 25, 2012 at 07:02:27PM +0200, Andrea Arcangeli wrote:

> +static int knumad_do_scan(void)
> +{

...

> +	if (knumad_test_exit(mm) || !vma) {
> +		mm_autonuma = mm->mm_autonuma;
> +		if (mm_autonuma->mm_node.next != &knumad_scan.mm_head) {
> +			mm_autonuma = list_entry(mm_autonuma->mm_node.next,
> +						 struct mm_autonuma, mm_node);
> +			knumad_scan.mm = mm_autonuma->mm;
> +			atomic_inc(&knumad_scan.mm->mm_count);
> +			knumad_scan.address = 0;
> +			knumad_scan.mm->mm_autonuma->numa_fault_pass++;
> +		} else
> +			knumad_scan.mm = NULL;

knumad_scan.mm should be nulled only after list_del otherwise you will
have race with autonuma_exit():

[   22.905208] ------------[ cut here ]------------
[   23.003620] WARNING: at /home/kas/git/public/linux/lib/list_debug.c:50 __list_del_entry+0x63/0xd0()
[   23.003621] Hardware name: QSSC-S4R
[   23.003624] list_del corruption, ffff880771a49300->next is LIST_POISON1 (dead000000100100)
[   23.003626] Modules linked in: megaraid_sas
[   23.003629] Pid: 569, comm: udevd Not tainted 3.4.0+ #31
[   23.003630] Call Trace:
[   23.003640]  [<ffffffff8105956f>] warn_slowpath_common+0x7f/0xc0
[   23.003643]  [<ffffffff81059666>] warn_slowpath_fmt+0x46/0x50
[   23.003645]  [<ffffffff813202e3>] __list_del_entry+0x63/0xd0
[   23.003648]  [<ffffffff81320361>] list_del+0x11/0x40
[   23.003654]  [<ffffffff8117b80f>] autonuma_exit+0x5f/0xb0
[   23.003657]  [<ffffffff810567ab>] mmput+0x7b/0x120
[   23.003663]  [<ffffffff8105e7d8>] exit_mm+0x108/0x130
[   23.003674]  [<ffffffff8165da5b>] ? _raw_spin_unlock_irq+0x2b/0x40
[   23.003677]  [<ffffffff8105e94a>] do_exit+0x14a/0x8d0
[   23.003682]  [<ffffffff811b71c6>] ? mntput+0x26/0x40
[   23.003688]  [<ffffffff8119a599>] ? fput+0x1c9/0x270
[   23.003693]  [<ffffffff81319dd9>] ? lockdep_sys_exit_thunk+0x35/0x67
[   23.003696]  [<ffffffff8105f41f>] do_group_exit+0x4f/0xc0
[   23.003698]  [<ffffffff8105f4a7>] sys_exit_group+0x17/0x20
[   23.003703]  [<ffffffff816663e9>] system_call_fastpath+0x16/0x1b
[   23.003705] ---[ end trace 8b21c7adb0af191b ]---

> +
> +		if (knumad_test_exit(mm))
> +			list_del(&mm->mm_autonuma->mm_node);
> +		else
> +			mm_numa_fault_flush(mm);
> +
> +		mmdrop(mm);
> +	}
> +
> +	return progress;
> +}

...

> +
> +static int knuma_scand(void *none)
> +{

...

> +	mm = knumad_scan.mm;
> +	knumad_scan.mm = NULL;

The same problem here.

> +	if (mm)
> +		list_del(&mm->mm_autonuma->mm_node);
> +	mutex_unlock(&knumad_mm_mutex);
> +
> +	if (mm)
> +		mmdrop(mm);
> +
> +	return 0;
> +}

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 08/35] autonuma: introduce kthread_bind_node()
  2012-05-25 17:02   ` Andrea Arcangeli
@ 2012-05-29 12:49     ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 12:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
>  /**
> + * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
> + * @p: thread created by kthread_create().
> + * @nid: node (might not be online, must be possible) for @k to run on.
> + *
> + * Description: This function is equivalent to set_cpus_allowed(),
> + * except that @nid doesn't need to be online, and the thread must be
> + * stopped (i.e., just returned from kthread_create()).
> + */
> +void kthread_bind_node(struct task_struct *p, int nid)
> +{
> +       /* Must have done schedule() in kthread() before we set_task_cpu */
> +       if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> +               WARN_ON(1);
> +               return;
> +       }
> +
> +       /* It's safe because the task is inactive. */
> +       do_set_cpus_allowed(p, cpumask_of_node(nid));
> +       p->flags |= PF_THREAD_BOUND;

No, I've said before, this is wrong. You should only ever use
PF_THREAD_BOUND when its strictly per-cpu. Moving the your numa threads
to a different node is silly but not fatal in any way.

> +}
> +EXPORT_SYMBOL(kthread_bind_node); 



^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 08/35] autonuma: introduce kthread_bind_node()
@ 2012-05-29 12:49     ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 12:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
>  /**
> + * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
> + * @p: thread created by kthread_create().
> + * @nid: node (might not be online, must be possible) for @k to run on.
> + *
> + * Description: This function is equivalent to set_cpus_allowed(),
> + * except that @nid doesn't need to be online, and the thread must be
> + * stopped (i.e., just returned from kthread_create()).
> + */
> +void kthread_bind_node(struct task_struct *p, int nid)
> +{
> +       /* Must have done schedule() in kthread() before we set_task_cpu */
> +       if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> +               WARN_ON(1);
> +               return;
> +       }
> +
> +       /* It's safe because the task is inactive. */
> +       do_set_cpus_allowed(p, cpumask_of_node(nid));
> +       p->flags |= PF_THREAD_BOUND;

No, I've said before, this is wrong. You should only ever use
PF_THREAD_BOUND when its strictly per-cpu. Moving the your numa threads
to a different node is silly but not fatal in any way.

> +}
> +EXPORT_SYMBOL(kthread_bind_node); 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 12/35] autonuma: CPU follow memory algorithm
  2012-05-25 17:02   ` Andrea Arcangeli
@ 2012-05-29 13:00     ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 13:00 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> @@ -3274,6 +3268,8 @@ need_resched:
>  
>         post_schedule(rq);
>  
> +       sched_autonuma_balance();
> +
>         sched_preempt_enable_no_resched();
>         if (need_resched())
>                 goto need_resched;



> +void sched_autonuma_balance(void)
> +{

> +       for_each_online_node(nid) {
> +       }

> +       for_each_online_node(nid) {
> +               for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {


> +               }
> +       }

> +       stop_one_cpu(this_cpu, migration_cpu_stop, &arg);
> +} 

NAK

You do _NOT_ put a O(nr_cpus) or even O(nr_nodes) loop in the middle of
schedule().

I see you've made it conditional, but schedule() taking that long --
even occasionally -- is just not cool.

schedule() calling schedule() is also an absolute abomination.

You were told to fix this several times..

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 12/35] autonuma: CPU follow memory algorithm
@ 2012-05-29 13:00     ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 13:00 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> @@ -3274,6 +3268,8 @@ need_resched:
>  
>         post_schedule(rq);
>  
> +       sched_autonuma_balance();
> +
>         sched_preempt_enable_no_resched();
>         if (need_resched())
>                 goto need_resched;



> +void sched_autonuma_balance(void)
> +{

> +       for_each_online_node(nid) {
> +       }

> +       for_each_online_node(nid) {
> +               for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {


> +               }
> +       }

> +       stop_one_cpu(this_cpu, migration_cpu_stop, &arg);
> +} 

NAK

You do _NOT_ put a O(nr_cpus) or even O(nr_nodes) loop in the middle of
schedule().

I see you've made it conditional, but schedule() taking that long --
even occasionally -- is just not cool.

schedule() calling schedule() is also an absolute abomination.

You were told to fix this several times..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 12/35] autonuma: CPU follow memory algorithm
  2012-05-25 17:02   ` Andrea Arcangeli
@ 2012-05-29 13:10     ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 13:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> + * This function is responsible for deciding which is the best CPU
> + * each process should be running on according to the NUMA
> + * affinity. To do that it evaluates all CPUs and checks if there's
> + * any remote CPU where the current process has more NUMA affinity
> + * than with the current CPU, and where the process running on the
> + * remote CPU has less NUMA affinity than the current process to run
> + * on the remote CPU. Ideally this should be expanded to take all
> + * runnable processes into account but this is a good
> + * approximation. When we compare the NUMA affinity between the
> + * current and remote CPU we use the per-thread information if the
> + * remote CPU runs a thread of the same process that the current task
> + * belongs to, or the per-process information if the remote CPU runs
> a
> + * different process than the current one. If the remote CPU runs the
> + * idle task we require both the per-thread and per-process
> + * information to have more affinity with the remote CPU than with
> the
> + * current CPU for a migration to happen. 

This doesn't explain anything in the dense code that follows.

What statistics, how are they used, with what goal etc..



^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 12/35] autonuma: CPU follow memory algorithm
@ 2012-05-29 13:10     ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 13:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> + * This function is responsible for deciding which is the best CPU
> + * each process should be running on according to the NUMA
> + * affinity. To do that it evaluates all CPUs and checks if there's
> + * any remote CPU where the current process has more NUMA affinity
> + * than with the current CPU, and where the process running on the
> + * remote CPU has less NUMA affinity than the current process to run
> + * on the remote CPU. Ideally this should be expanded to take all
> + * runnable processes into account but this is a good
> + * approximation. When we compare the NUMA affinity between the
> + * current and remote CPU we use the per-thread information if the
> + * remote CPU runs a thread of the same process that the current task
> + * belongs to, or the per-process information if the remote CPU runs
> a
> + * different process than the current one. If the remote CPU runs the
> + * idle task we require both the per-thread and per-process
> + * information to have more affinity with the remote CPU than with
> the
> + * current CPU for a migration to happen. 

This doesn't explain anything in the dense code that follows.

What statistics, how are they used, with what goal etc..


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
  2012-05-25 17:02   ` Andrea Arcangeli
@ 2012-05-29 13:16     ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 13:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 780ded7..e8dc82c 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -126,6 +126,31 @@ struct page {
>  		struct page *first_page;	/* Compound tail pages */
>  	};
>  
> +#ifdef CONFIG_AUTONUMA
> +	/*
> +	 * FIXME: move to pgdat section along with the memcg and allocate
> +	 * at runtime only in presence of a numa system.
> +	 */
> +	/*
> +	 * To modify autonuma_last_nid lockless the architecture,
> +	 * needs SMP atomic granularity < sizeof(long), not all archs
> +	 * have that, notably some alpha. Archs without that requires
> +	 * autonuma_last_nid to be a long.
> +	 */
> +#if BITS_PER_LONG > 32
> +	int autonuma_migrate_nid;
> +	int autonuma_last_nid;
> +#else
> +#if MAX_NUMNODES >= 32768
> +#error "too many nodes"
> +#endif
> +	/* FIXME: remember to check the updates are atomic */
> +	short autonuma_migrate_nid;
> +	short autonuma_last_nid;
> +#endif
> +	struct list_head autonuma_migrate_node;
> +#endif
> +
>  	/*
>  	 * On machines where all RAM is mapped into kernel address space,
>  	 * we can simply calculate the virtual address. On machines with


24 bytes per page.. or ~0.6% of memory gone. This is far too great a
price to pay.

At LSF/MM Rik already suggested you limit the number of pages that can
be migrated concurrently and use this to move the extra list_head out of
struct page and into a smaller amount of extra structures, reducing the
total overhead.



^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
@ 2012-05-29 13:16     ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 13:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 780ded7..e8dc82c 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -126,6 +126,31 @@ struct page {
>  		struct page *first_page;	/* Compound tail pages */
>  	};
>  
> +#ifdef CONFIG_AUTONUMA
> +	/*
> +	 * FIXME: move to pgdat section along with the memcg and allocate
> +	 * at runtime only in presence of a numa system.
> +	 */
> +	/*
> +	 * To modify autonuma_last_nid lockless the architecture,
> +	 * needs SMP atomic granularity < sizeof(long), not all archs
> +	 * have that, notably some alpha. Archs without that requires
> +	 * autonuma_last_nid to be a long.
> +	 */
> +#if BITS_PER_LONG > 32
> +	int autonuma_migrate_nid;
> +	int autonuma_last_nid;
> +#else
> +#if MAX_NUMNODES >= 32768
> +#error "too many nodes"
> +#endif
> +	/* FIXME: remember to check the updates are atomic */
> +	short autonuma_migrate_nid;
> +	short autonuma_last_nid;
> +#endif
> +	struct list_head autonuma_migrate_node;
> +#endif
> +
>  	/*
>  	 * On machines where all RAM is mapped into kernel address space,
>  	 * we can simply calculate the virtual address. On machines with


24 bytes per page.. or ~0.6% of memory gone. This is far too great a
price to pay.

At LSF/MM Rik already suggested you limit the number of pages that can
be migrated concurrently and use this to move the extra list_head out of
struct page and into a smaller amount of extra structures, reducing the
total overhead.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 00/35] AutoNUMA alpha14
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-05-29 13:36   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 236+ messages in thread
From: Kirill A. Shutemov @ 2012-05-29 13:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

<2>[  729.065896] kernel BUG at /home/kas/git/public/linux/mm/autonuma.c:850!
<4>[  729.176966] invalid opcode: 0000 [#1] SMP 
<4>[  729.287517] CPU 24 
<4>[  729.397025] Modules linked in: sunrpc bnep bluetooth rfkill cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack coretemp kvm asix usbnet igb i7core_edac crc32c_intel iTCO_wdt i2c_i801 ioatdma pcspkr tpm_tis microcode joydev mii i2c_core iTCO_vendor_support tpm edac_core dca ptp tpm_bios pps_core megaraid_sas [last unloaded: scsi_wait_scan]
<4>[  729.870867] 
<4>[  729.989848] Pid: 342, comm: knuma_migrated0 Not tainted 3.4.0+ #32 QCI QSSC-S4R/QSSC-S4R
<4>[  730.111497] RIP: 0010:[<ffffffff8117baf5>]  [<ffffffff8117baf5>] knuma_migrated+0x915/0xa50
<4>[  730.234615] RSP: 0018:ffff88026c8b7d40  EFLAGS: 00010006
<4>[  730.357993] RAX: 0000000000000000 RBX: ffff88027ffea000 RCX: 0000000000000002
<4>[  730.482959] RDX: 0000000000000002 RSI: ffffea0017b7001c RDI: ffffea0017b70000
<4>[  730.607709] RBP: ffff88026c8b7e90 R08: 0000000000000001 R09: 0000000000000000
<4>[  730.733082] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
<4>[  730.858424] R13: 0000000000000200 R14: ffffea0017b70000 R15: ffff88067ffeae00
<4>[  730.983686] FS:  0000000000000000(0000) GS:ffff880272200000(0000) knlGS:0000000000000000
<4>[  731.110169] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>[  731.236396] CR2: 00007ff3463dd000 CR3: 0000000001c0b000 CR4: 00000000000007e0
<4>[  731.363987] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[  731.490875] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[  731.616769] Process knuma_migrated0 (pid: 342, threadinfo ffff88026c8b6000, task ffff88026c8ac5a0)
<4>[  731.745286] Stack:
<4>[  731.871079]  ffff88027fffdef0 ffff88026c8ac5a0 0000000000000082 ffff88026c8ac5a0
<4>[  731.999154]  ffff88026c8b7d98 0000000100000003 ffffea000f165e60 ffff88067ffeb2c0
<4>[  732.126565]  ffffea000f165e60 ffffea000f165e20 ffffffff8107df90 ffff88026c8b7d98
<4>[  732.253488] Call Trace:
<4>[  732.377354]  [<ffffffff8107df90>] ? __init_waitqueue_head+0x60/0x60
<4>[  732.501250]  [<ffffffff8107e075>] ? finish_wait+0x45/0x90
<4>[  732.623816]  [<ffffffff8117b1e0>] ? __autonuma_migrate_page_remove+0x130/0x130
<4>[  732.748194]  [<ffffffff8107d437>] kthread+0xb7/0xc0
<4>[  732.872468]  [<ffffffff81668324>] kernel_thread_helper+0x4/0x10
<4>[  732.997588]  [<ffffffff8107d380>] ? __init_kthread_worker+0x70/0x70
<4>[  733.120411]  [<ffffffff81668320>] ? gs_change+0x13/0x13
<4>[  733.240230] Code: 4e 00 48 8b 05 6d 05 b8 00 a8 04 0f 84 b5 f9 ff ff 48 c7 c7 b0 c9 9e 81 31 c0 e8 04 6c 4d 00 e9 a2 f9 ff ff 66 90 e8 8a 87 4d 00 <0f> 0b 48 c7 c7 d0 c9 9e 81 31 c0 e8 e8 6b 4d 00 e9 73 f9 ff ff 
<1>[  733.489612] RIP  [<ffffffff8117baf5>] knuma_migrated+0x915/0xa50
<4>[  733.614281]  RSP <ffff88026c8b7d40>
<4>[  733.736855] ---[ end trace 25052e4d75b2f1f6 ]---

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 00/35] AutoNUMA alpha14
@ 2012-05-29 13:36   ` Kirill A. Shutemov
  0 siblings, 0 replies; 236+ messages in thread
From: Kirill A. Shutemov @ 2012-05-29 13:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

<2>[  729.065896] kernel BUG at /home/kas/git/public/linux/mm/autonuma.c:850!
<4>[  729.176966] invalid opcode: 0000 [#1] SMP 
<4>[  729.287517] CPU 24 
<4>[  729.397025] Modules linked in: sunrpc bnep bluetooth rfkill cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack coretemp kvm asix usbnet igb i7core_edac crc32c_intel iTCO_wdt i2c_i801 ioatdma pcspkr tpm_tis microcode joydev mii i2c_core iTCO_vendor_support tpm edac_core dca ptp tpm_bios pps_core megaraid_sas [last unloaded: scsi_wait_scan]
<4>[  729.870867] 
<4>[  729.989848] Pid: 342, comm: knuma_migrated0 Not tainted 3.4.0+ #32 QCI QSSC-S4R/QSSC-S4R
<4>[  730.111497] RIP: 0010:[<ffffffff8117baf5>]  [<ffffffff8117baf5>] knuma_migrated+0x915/0xa50
<4>[  730.234615] RSP: 0018:ffff88026c8b7d40  EFLAGS: 00010006
<4>[  730.357993] RAX: 0000000000000000 RBX: ffff88027ffea000 RCX: 0000000000000002
<4>[  730.482959] RDX: 0000000000000002 RSI: ffffea0017b7001c RDI: ffffea0017b70000
<4>[  730.607709] RBP: ffff88026c8b7e90 R08: 0000000000000001 R09: 0000000000000000
<4>[  730.733082] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
<4>[  730.858424] R13: 0000000000000200 R14: ffffea0017b70000 R15: ffff88067ffeae00
<4>[  730.983686] FS:  0000000000000000(0000) GS:ffff880272200000(0000) knlGS:0000000000000000
<4>[  731.110169] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>[  731.236396] CR2: 00007ff3463dd000 CR3: 0000000001c0b000 CR4: 00000000000007e0
<4>[  731.363987] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[  731.490875] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[  731.616769] Process knuma_migrated0 (pid: 342, threadinfo ffff88026c8b6000, task ffff88026c8ac5a0)
<4>[  731.745286] Stack:
<4>[  731.871079]  ffff88027fffdef0 ffff88026c8ac5a0 0000000000000082 ffff88026c8ac5a0
<4>[  731.999154]  ffff88026c8b7d98 0000000100000003 ffffea000f165e60 ffff88067ffeb2c0
<4>[  732.126565]  ffffea000f165e60 ffffea000f165e20 ffffffff8107df90 ffff88026c8b7d98
<4>[  732.253488] Call Trace:
<4>[  732.377354]  [<ffffffff8107df90>] ? __init_waitqueue_head+0x60/0x60
<4>[  732.501250]  [<ffffffff8107e075>] ? finish_wait+0x45/0x90
<4>[  732.623816]  [<ffffffff8117b1e0>] ? __autonuma_migrate_page_remove+0x130/0x130
<4>[  732.748194]  [<ffffffff8107d437>] kthread+0xb7/0xc0
<4>[  732.872468]  [<ffffffff81668324>] kernel_thread_helper+0x4/0x10
<4>[  732.997588]  [<ffffffff8107d380>] ? __init_kthread_worker+0x70/0x70
<4>[  733.120411]  [<ffffffff81668320>] ? gs_change+0x13/0x13
<4>[  733.240230] Code: 4e 00 48 8b 05 6d 05 b8 00 a8 04 0f 84 b5 f9 ff ff 48 c7 c7 b0 c9 9e 81 31 c0 e8 04 6c 4d 00 e9 a2 f9 ff ff 66 90 e8 8a 87 4d 00 <0f> 0b 48 c7 c7 d0 c9 9e 81 31 c0 e8 e8 6b 4d 00 e9 73 f9 ff ff 
<1>[  733.489612] RIP  [<ffffffff8117baf5>] knuma_migrated+0x915/0xa50
<4>[  733.614281]  RSP <ffff88026c8b7d40>
<4>[  733.736855] ---[ end trace 25052e4d75b2f1f6 ]---

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 14/35] autonuma: knuma_migrated per NUMA node queues
  2012-05-25 17:02   ` Andrea Arcangeli
@ 2012-05-29 13:51     ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 13:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:


> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 41aa49b..8e578e6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -666,6 +666,12 @@ typedef struct pglist_data {
>  	struct task_struct *kswapd;
>  	int kswapd_max_order;
>  	enum zone_type classzone_idx;
> +#ifdef CONFIG_AUTONUMA
> +	spinlock_t autonuma_lock;
> +	struct list_head autonuma_migrate_head[MAX_NUMNODES];
> +	unsigned long autonuma_nr_migrate_pages;
> +	wait_queue_head_t autonuma_knuma_migrated_wait;
> +#endif
>  } pg_data_t;
>  
>  #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)

O(nr_nodes^2) data.. ISTR people rewriting a certain slab allocator to
get rid of that :-)

Also, don't forget that MAX_NUMNODES is an unconditional 512 on distro
kernels, even when we only have 2.

Now the total wasted space isn't too bad since its only 16 bytes,
totaling a whole 2M for a 256 node system. But still, something like
that wants at least a mention somewhere.



^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 14/35] autonuma: knuma_migrated per NUMA node queues
@ 2012-05-29 13:51     ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 13:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:


> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 41aa49b..8e578e6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -666,6 +666,12 @@ typedef struct pglist_data {
>  	struct task_struct *kswapd;
>  	int kswapd_max_order;
>  	enum zone_type classzone_idx;
> +#ifdef CONFIG_AUTONUMA
> +	spinlock_t autonuma_lock;
> +	struct list_head autonuma_migrate_head[MAX_NUMNODES];
> +	unsigned long autonuma_nr_migrate_pages;
> +	wait_queue_head_t autonuma_knuma_migrated_wait;
> +#endif
>  } pg_data_t;
>  
>  #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)

O(nr_nodes^2) data.. ISTR people rewriting a certain slab allocator to
get rid of that :-)

Also, don't forget that MAX_NUMNODES is an unconditional 512 on distro
kernels, even when we only have 2.

Now the total wasted space isn't too bad since its only 16 bytes,
totaling a whole 2M for a 256 node system. But still, something like
that wants at least a mention somewhere.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 12/35] autonuma: CPU follow memory algorithm
  2012-05-29 13:00     ` Peter Zijlstra
@ 2012-05-29 13:54       ` Rik van Riel
  -1 siblings, 0 replies; 236+ messages in thread
From: Rik van Riel @ 2012-05-29 13:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On 05/29/2012 09:00 AM, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
>> @@ -3274,6 +3268,8 @@ need_resched:
>>
>>          post_schedule(rq);
>>
>> +       sched_autonuma_balance();
>> +
>>          sched_preempt_enable_no_resched();
>>          if (need_resched())
>>                  goto need_resched;
>
>
>
>> +void sched_autonuma_balance(void)
>> +{
>
>> +       for_each_online_node(nid) {
>> +       }
>
>> +       for_each_online_node(nid) {
>> +               for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
>
>
>> +               }
>> +       }
>
>> +       stop_one_cpu(this_cpu, migration_cpu_stop,&arg);
>> +}
>
> NAK
>
> You do _NOT_ put a O(nr_cpus) or even O(nr_nodes) loop in the middle of
> schedule().
>
> I see you've made it conditional, but schedule() taking that long --
> even occasionally -- is just not cool.
>
> schedule() calling schedule() is also an absolute abomination.
>
> You were told to fix this several times..

Do you have any suggestions for how Andrea could fix this?

Pairwise comparisons with a busy CPU/node?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 12/35] autonuma: CPU follow memory algorithm
@ 2012-05-29 13:54       ` Rik van Riel
  0 siblings, 0 replies; 236+ messages in thread
From: Rik van Riel @ 2012-05-29 13:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On 05/29/2012 09:00 AM, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
>> @@ -3274,6 +3268,8 @@ need_resched:
>>
>>          post_schedule(rq);
>>
>> +       sched_autonuma_balance();
>> +
>>          sched_preempt_enable_no_resched();
>>          if (need_resched())
>>                  goto need_resched;
>
>
>
>> +void sched_autonuma_balance(void)
>> +{
>
>> +       for_each_online_node(nid) {
>> +       }
>
>> +       for_each_online_node(nid) {
>> +               for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
>
>
>> +               }
>> +       }
>
>> +       stop_one_cpu(this_cpu, migration_cpu_stop,&arg);
>> +}
>
> NAK
>
> You do _NOT_ put a O(nr_cpus) or even O(nr_nodes) loop in the middle of
> schedule().
>
> I see you've made it conditional, but schedule() taking that long --
> even occasionally -- is just not cool.
>
> schedule() calling schedule() is also an absolute abomination.
>
> You were told to fix this several times..

Do you have any suggestions for how Andrea could fix this?

Pairwise comparisons with a busy CPU/node?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
  2012-05-29 13:16     ` Peter Zijlstra
@ 2012-05-29 13:56       ` Rik van Riel
  -1 siblings, 0 replies; 236+ messages in thread
From: Rik van Riel @ 2012-05-29 13:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On 05/29/2012 09:16 AM, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:

> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
> price to pay.
>
> At LSF/MM Rik already suggested you limit the number of pages that can
> be migrated concurrently and use this to move the extra list_head out of
> struct page and into a smaller amount of extra structures, reducing the
> total overhead.

For THP, we should be able to track this NUMA info on a
2MB page granularity.

It is not like we will ever want to break up a large
page into small pages anyway (with different 4kB pages
going to different NUMA nodes), because the THP benefit
is on the same order of magnitude as the NUMA benefit.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
@ 2012-05-29 13:56       ` Rik van Riel
  0 siblings, 0 replies; 236+ messages in thread
From: Rik van Riel @ 2012-05-29 13:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On 05/29/2012 09:16 AM, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:

> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
> price to pay.
>
> At LSF/MM Rik already suggested you limit the number of pages that can
> be migrated concurrently and use this to move the extra list_head out of
> struct page and into a smaller amount of extra structures, reducing the
> total overhead.

For THP, we should be able to track this NUMA info on a
2MB page granularity.

It is not like we will ever want to break up a large
page into small pages anyway (with different 4kB pages
going to different NUMA nodes), because the THP benefit
is on the same order of magnitude as the NUMA benefit.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 20/35] autonuma: avoid CFS select_task_rq_fair to return -1
  2012-05-25 17:02   ` Andrea Arcangeli
@ 2012-05-29 14:02     ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 14:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> Fix to avoid -1 retval.
> 
> Includes fixes from Hillf Danton <dhillf@gmail.com>.

This changelog is very much insufficient. It fails to mention why your
solution is the right one or if there's something else wrong with that
code.

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  kernel/sched/fair.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 940e6d1..137119f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2789,6 +2789,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>  		if (new_cpu == -1 || new_cpu == cpu) {
>  			/* Now try balancing at a lower domain level of cpu */
>  			sd = sd->child;
> +			if (new_cpu < 0)
> +				/* Return prev_cpu is find_idlest_cpu failed */
> +				new_cpu = prev_cpu;
>  			continue;
>  		}
>  
> @@ -2807,6 +2810,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>  unlock:
>  	rcu_read_unlock();
>  
> +	BUG_ON(new_cpu < 0);
>  	return new_cpu;
>  }
>  #endif /* CONFIG_SMP */


^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 20/35] autonuma: avoid CFS select_task_rq_fair to return -1
@ 2012-05-29 14:02     ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 14:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> Fix to avoid -1 retval.
> 
> Includes fixes from Hillf Danton <dhillf@gmail.com>.

This changelog is very much insufficient. It fails to mention why your
solution is the right one or if there's something else wrong with that
code.

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  kernel/sched/fair.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 940e6d1..137119f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2789,6 +2789,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>  		if (new_cpu == -1 || new_cpu == cpu) {
>  			/* Now try balancing at a lower domain level of cpu */
>  			sd = sd->child;
> +			if (new_cpu < 0)
> +				/* Return prev_cpu is find_idlest_cpu failed */
> +				new_cpu = prev_cpu;
>  			continue;
>  		}
>  
> @@ -2807,6 +2810,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>  unlock:
>  	rcu_read_unlock();
>  
> +	BUG_ON(new_cpu < 0);
>  	return new_cpu;
>  }
>  #endif /* CONFIG_SMP */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 03/35] xen: document Xen is using an unused bit for the pagetables
  2012-05-26 15:59       ` Andrea Arcangeli
@ 2012-05-29 14:10         ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 236+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-29 14:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Sat, May 26, 2012 at 05:59:12PM +0200, Andrea Arcangeli wrote:
> Hi Konrad,
> 
> On Fri, May 25, 2012 at 04:26:56PM -0400, Konrad Rzeszutek Wilk wrote:
> > On Fri, May 25, 2012 at 07:02:07PM +0200, Andrea Arcangeli wrote:
> > > Xen has taken over the last reserved bit available for the pagetables
> > > which is set through ioremap, this documents it and makes the code
> > > more readable.
> > 
> > Andrea, my previous respone had a question about this - was wondering
> > if you had a chance to look at that in your busy schedule and provide
> > some advice on how to remove the _PAGE_IOMAP altogether?
> 
> I read you response but I didn't look into the P2M tree and
> xen_val_pte code yet sorry. Thanks for looking into this, if it's
> possible to remove it without downsides, it would be a nice

Yeah, I am not really thrilled about it.

> cleanup. It's not urgent though, we're not running out of reserved
> pte bits yet :).

Oh, your git comment says "the last reserved bit". Let me
look through all your patches to see how the AutoNUMA code works -
I am probably just missing something simple.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 03/35] xen: document Xen is using an unused bit for the pagetables
@ 2012-05-29 14:10         ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 236+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-29 14:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Sat, May 26, 2012 at 05:59:12PM +0200, Andrea Arcangeli wrote:
> Hi Konrad,
> 
> On Fri, May 25, 2012 at 04:26:56PM -0400, Konrad Rzeszutek Wilk wrote:
> > On Fri, May 25, 2012 at 07:02:07PM +0200, Andrea Arcangeli wrote:
> > > Xen has taken over the last reserved bit available for the pagetables
> > > which is set through ioremap, this documents it and makes the code
> > > more readable.
> > 
> > Andrea, my previous respone had a question about this - was wondering
> > if you had a chance to look at that in your busy schedule and provide
> > some advice on how to remove the _PAGE_IOMAP altogether?
> 
> I read you response but I didn't look into the P2M tree and
> xen_val_pte code yet sorry. Thanks for looking into this, if it's
> possible to remove it without downsides, it would be a nice

Yeah, I am not really thrilled about it.

> cleanup. It's not urgent though, we're not running out of reserved
> pte bits yet :).

Oh, your git comment says "the last reserved bit". Let me
look through all your patches to see how the AutoNUMA code works -
I am probably just missing something simple.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
  2012-05-29 13:56       ` Rik van Riel
@ 2012-05-29 14:54         ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 14:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Tue, 2012-05-29 at 09:56 -0400, Rik van Riel wrote:
> On 05/29/2012 09:16 AM, Peter Zijlstra wrote:
> > On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> 
> > 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
> > price to pay.
> >
> > At LSF/MM Rik already suggested you limit the number of pages that can
> > be migrated concurrently and use this to move the extra list_head out of
> > struct page and into a smaller amount of extra structures, reducing the
> > total overhead.
> 
> For THP, we should be able to track this NUMA info on a
> 2MB page granularity.

Yeah, but that's another x86-only feature, _IF_ we're going to do this
it must be done for all archs that have CONFIG_NUMA, thus we're stuck
with 4k (or other base page size).

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
@ 2012-05-29 14:54         ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 14:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Tue, 2012-05-29 at 09:56 -0400, Rik van Riel wrote:
> On 05/29/2012 09:16 AM, Peter Zijlstra wrote:
> > On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> 
> > 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
> > price to pay.
> >
> > At LSF/MM Rik already suggested you limit the number of pages that can
> > be migrated concurrently and use this to move the extra list_head out of
> > struct page and into a smaller amount of extra structures, reducing the
> > total overhead.
> 
> For THP, we should be able to track this NUMA info on a
> 2MB page granularity.

Yeah, but that's another x86-only feature, _IF_ we're going to do this
it must be done for all archs that have CONFIG_NUMA, thus we're stuck
with 4k (or other base page size).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA alpha14
  2012-05-29 13:36   ` Kirill A. Shutemov
@ 2012-05-29 15:43     ` Petr Holasek
  -1 siblings, 0 replies; 236+ messages in thread
From: Petr Holasek @ 2012-05-29 15:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, 29 May 2012, Kirill A. Shutemov wrote:

> <4>[  732.253488] Call Trace:
> <4>[  732.377354]  [<ffffffff8107df90>] ? __init_waitqueue_head+0x60/0x60
> <4>[  732.501250]  [<ffffffff8107e075>] ? finish_wait+0x45/0x90
> <4>[  732.623816]  [<ffffffff8117b1e0>] ? __autonuma_migrate_page_remove+0x130/0x130
> <4>[  732.748194]  [<ffffffff8107d437>] kthread+0xb7/0xc0
> <4>[  732.872468]  [<ffffffff81668324>] kernel_thread_helper+0x4/0x10
> <4>[  732.997588]  [<ffffffff8107d380>] ? __init_kthread_worker+0x70/0x70
> <4>[  733.120411]  [<ffffffff81668320>] ? gs_change+0x13/0x13
> <4>[  733.240230] Code: 4e 00 48 8b 05 6d 05 b8 00 a8 04 0f 84 b5 f9 ff ff 48 c7 c7 b0 c9 9e 81 31 c0 e8 04 6c 4d 00 e9 a2 f9 ff ff 66 90 e8 8a 87 4d 00 <0f> 0b 48 c7 c7 d0 c9 9e 81 31 c0 e8 e8 6b 4d 00 e9 73 f9 ff ff 
> <1>[  733.489612] RIP  [<ffffffff8117baf5>] knuma_migrated+0x915/0xa50
> <4>[  733.614281]  RSP <ffff88026c8b7d40>
> <4>[  733.736855] ---[ end trace 25052e4d75b2f1f6 ]---
> 

Similar problem with __autonuma_migrate_page_remove here. 

[ 1945.516632] ------------[ cut here ]------------
[ 1945.516636] WARNING: at lib/list_debug.c:50 __list_del_entry+0x63/0xd0()
[ 1945.516642] Hardware name: ProLiant DL585 G5   
[ 1945.516651] list_del corruption, ffff88017d68b068->next is LIST_POISON1 (dead000000100100)
[ 1945.516682] Modules linked in: ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle lockd ip6t_REJECT sunrpc nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables iptable_nat nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack mperf freq_table kvm_amd kvm pcspkr amd64_edac_mod edac_core serio_raw bnx2 microcode edac_mce_amd shpchp k10temp hpilo ipmi_si ipmi_msghandler hpwdt qla2xxx hpsa ata_generic pata_acpi scsi_transport_fc scsi_tgt cciss pata_amd radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [last unloaded: scsi_wait_scan]
[ 1945.516694] Pid: 150, comm: knuma_migrated0 Tainted: G        W    3.4.0aa_alpha+ #3
[ 1945.516701] Call Trace:
[ 1945.516710]  [<ffffffff8105788f>] warn_slowpath_common+0x7f/0xc0
[ 1945.516717]  [<ffffffff81057986>] warn_slowpath_fmt+0x46/0x50
[ 1945.516726]  [<ffffffff812f9713>] __list_del_entry+0x63/0xd0
[ 1945.516735]  [<ffffffff812f9791>] list_del+0x11/0x40
[ 1945.516743]  [<ffffffff81165b98>] __autonuma_migrate_page_remove+0x48/0x80
[ 1945.516746]  [<ffffffff81165e66>] knuma_migrated+0x296/0x8a0
[ 1945.516749]  [<ffffffff8107a200>] ? wake_up_bit+0x40/0x40
[ 1945.516758]  [<ffffffff81165bd0>] ? __autonuma_migrate_page_remove+0x80/0x80
[ 1945.516766]  [<ffffffff81079cc3>] kthread+0x93/0xa0
[ 1945.516780]  [<ffffffff81626f24>] kernel_thread_helper+0x4/0x10
[ 1945.516791]  [<ffffffff81079c30>] ? flush_kthread_worker+0x80/0x80
[ 1945.516798]  [<ffffffff81626f20>] ? gs_change+0x13/0x13
[ 1945.516800] ---[ end trace 7cab294af87bd79f ]---

I am getting this warning continually during memory intensive operations,
e.g. AutoNUMA benchmarks from Andrea.

thanks,
Petr H

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA alpha14
@ 2012-05-29 15:43     ` Petr Holasek
  0 siblings, 0 replies; 236+ messages in thread
From: Petr Holasek @ 2012-05-29 15:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, 29 May 2012, Kirill A. Shutemov wrote:

> <4>[  732.253488] Call Trace:
> <4>[  732.377354]  [<ffffffff8107df90>] ? __init_waitqueue_head+0x60/0x60
> <4>[  732.501250]  [<ffffffff8107e075>] ? finish_wait+0x45/0x90
> <4>[  732.623816]  [<ffffffff8117b1e0>] ? __autonuma_migrate_page_remove+0x130/0x130
> <4>[  732.748194]  [<ffffffff8107d437>] kthread+0xb7/0xc0
> <4>[  732.872468]  [<ffffffff81668324>] kernel_thread_helper+0x4/0x10
> <4>[  732.997588]  [<ffffffff8107d380>] ? __init_kthread_worker+0x70/0x70
> <4>[  733.120411]  [<ffffffff81668320>] ? gs_change+0x13/0x13
> <4>[  733.240230] Code: 4e 00 48 8b 05 6d 05 b8 00 a8 04 0f 84 b5 f9 ff ff 48 c7 c7 b0 c9 9e 81 31 c0 e8 04 6c 4d 00 e9 a2 f9 ff ff 66 90 e8 8a 87 4d 00 <0f> 0b 48 c7 c7 d0 c9 9e 81 31 c0 e8 e8 6b 4d 00 e9 73 f9 ff ff 
> <1>[  733.489612] RIP  [<ffffffff8117baf5>] knuma_migrated+0x915/0xa50
> <4>[  733.614281]  RSP <ffff88026c8b7d40>
> <4>[  733.736855] ---[ end trace 25052e4d75b2f1f6 ]---
> 

Similar problem with __autonuma_migrate_page_remove here. 

[ 1945.516632] ------------[ cut here ]------------
[ 1945.516636] WARNING: at lib/list_debug.c:50 __list_del_entry+0x63/0xd0()
[ 1945.516642] Hardware name: ProLiant DL585 G5   
[ 1945.516651] list_del corruption, ffff88017d68b068->next is LIST_POISON1 (dead000000100100)
[ 1945.516682] Modules linked in: ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle lockd ip6t_REJECT sunrpc nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables iptable_nat nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack mperf freq_table kvm_amd kvm pcspkr amd64_edac_mod edac_core serio_raw bnx2 microcode edac_mce_amd shpchp k10temp hpilo ipmi_si ipmi_msghandler hpwdt qla2xxx hpsa ata_generic pata_acpi scsi_transport_fc scsi_tgt cciss pata_amd radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [last unloaded: scsi_wait_scan]
[ 1945.516694] Pid: 150, comm: knuma_migrated0 Tainted: G        W    3.4.0aa_alpha+ #3
[ 1945.516701] Call Trace:
[ 1945.516710]  [<ffffffff8105788f>] warn_slowpath_common+0x7f/0xc0
[ 1945.516717]  [<ffffffff81057986>] warn_slowpath_fmt+0x46/0x50
[ 1945.516726]  [<ffffffff812f9713>] __list_del_entry+0x63/0xd0
[ 1945.516735]  [<ffffffff812f9791>] list_del+0x11/0x40
[ 1945.516743]  [<ffffffff81165b98>] __autonuma_migrate_page_remove+0x48/0x80
[ 1945.516746]  [<ffffffff81165e66>] knuma_migrated+0x296/0x8a0
[ 1945.516749]  [<ffffffff8107a200>] ? wake_up_bit+0x40/0x40
[ 1945.516758]  [<ffffffff81165bd0>] ? __autonuma_migrate_page_remove+0x80/0x80
[ 1945.516766]  [<ffffffff81079cc3>] kthread+0x93/0xa0
[ 1945.516780]  [<ffffffff81626f24>] kernel_thread_helper+0x4/0x10
[ 1945.516791]  [<ffffffff81079c30>] ? flush_kthread_worker+0x80/0x80
[ 1945.516798]  [<ffffffff81626f20>] ? gs_change+0x13/0x13
[ 1945.516800] ---[ end trace 7cab294af87bd79f ]---

I am getting this warning continually during memory intensive operations,
e.g. AutoNUMA benchmarks from Andrea.

thanks,
Petr H

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 00/35] AutoNUMA alpha14
  2012-05-26 20:42     ` Linus Torvalds
@ 2012-05-29 15:53       ` Christoph Lameter
  -1 siblings, 0 replies; 236+ messages in thread
From: Christoph Lameter @ 2012-05-29 15:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Andrea Arcangeli, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Peter Zijlstra, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri

On Sat, 26 May 2012, Linus Torvalds wrote:

>
> I'm a *firm* believer that if it cannot be done automatically "well
> enough", the absolute last thing we should ever do is worry about the
> crazy people who think they can tweak it to perfection with complex
> interfaces.
>
> You can't do it, except for trivial loads (often benchmarks), and for
> very specific machines.

NUMA APIs already exist that allow tuning for the NUMA cases by allowing
the application to specify where to get memory from and where to run the
threads of a process. Those require the application to be aware of the
NUMA topology and exploit the capabilities there explicitly. Typically one
would like to reserve processors and memory for a single application that
then does the distribution of the load on its own. NUMA aware applications
like that do not benefit and do not need either of the mechanisms proposed
here.

What these automatic migration schemes (autonuma is really a bad term for
this. These are *migration* schemes where the memory is moved between NUMA
nodes automatically so call it AutoMigration if you like) try to do is to
avoid the tuning bits and automatically distribute generic process loads
in a NUMA aware fashion in order to improve performance. This is no easy
task since the cost of migrating a page is much more expensive that the
additional latency due to access of memory from a distant node. A huge
number of accesses must occur in order to amortize the migration of a
page. Various companies in decades past have tried to implement
automigration schemes without too much success.

I think the proof that we need is that a general mix of applications
actually benefits from an auto migration scheme. I would also like to see
that it does no harm to existing NUMA aware applications.


^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 00/35] AutoNUMA alpha14
@ 2012-05-29 15:53       ` Christoph Lameter
  0 siblings, 0 replies; 236+ messages in thread
From: Christoph Lameter @ 2012-05-29 15:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Andrea Arcangeli, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Peter Zijlstra, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri

On Sat, 26 May 2012, Linus Torvalds wrote:

>
> I'm a *firm* believer that if it cannot be done automatically "well
> enough", the absolute last thing we should ever do is worry about the
> crazy people who think they can tweak it to perfection with complex
> interfaces.
>
> You can't do it, except for trivial loads (often benchmarks), and for
> very specific machines.

NUMA APIs already exist that allow tuning for the NUMA cases by allowing
the application to specify where to get memory from and where to run the
threads of a process. Those require the application to be aware of the
NUMA topology and exploit the capabilities there explicitly. Typically one
would like to reserve processors and memory for a single application that
then does the distribution of the load on its own. NUMA aware applications
like that do not benefit and do not need either of the mechanisms proposed
here.

What these automatic migration schemes (autonuma is really a bad term for
this. These are *migration* schemes where the memory is moved between NUMA
nodes automatically so call it AutoMigration if you like) try to do is to
avoid the tuning bits and automatically distribute generic process loads
in a NUMA aware fashion in order to improve performance. This is no easy
task since the cost of migrating a page is much more expensive that the
additional latency due to access of memory from a distant node. A huge
number of accesses must occur in order to amortize the migration of a
page. Various companies in decades past have tried to implement
automigration schemes without too much success.

I think the proof that we need is that a general mix of applications
actually benefits from an auto migration scheme. I would also like to see
that it does no harm to existing NUMA aware applications.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 03/35] xen: document Xen is using an unused bit for the pagetables
  2012-05-29 14:10         ` Konrad Rzeszutek Wilk
@ 2012-05-29 16:01           ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 16:01 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

Hi,

On Tue, May 29, 2012 at 10:10:49AM -0400, Konrad Rzeszutek Wilk wrote:
> Oh, your git comment says "the last reserved bit". Let me
> look through all your patches to see how the AutoNUMA code works -
> I am probably just missing something simple.

Ah, with "the last reserved bit" I didn't mean AutoNUMA is using
it. It just means there is nothing left if somebody in the future
needs it. AutoNUMA happened to need it initially, but I figured how I
was better off not using it. Initially I had to make AUTONUMA=y
mutually exclusive with XEN=y but it's not the case anymore. So at
this point the patch is only a cleanup, I could drop it too but I
thought it was cleaner to keep it.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 03/35] xen: document Xen is using an unused bit for the pagetables
@ 2012-05-29 16:01           ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 16:01 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

Hi,

On Tue, May 29, 2012 at 10:10:49AM -0400, Konrad Rzeszutek Wilk wrote:
> Oh, your git comment says "the last reserved bit". Let me
> look through all your patches to see how the AutoNUMA code works -
> I am probably just missing something simple.

Ah, with "the last reserved bit" I didn't mean AutoNUMA is using
it. It just means there is nothing left if somebody in the future
needs it. AutoNUMA happened to need it initially, but I figured how I
was better off not using it. Initially I had to make AUTONUMA=y
mutually exclusive with XEN=y but it's not the case anymore. So at
this point the patch is only a cleanup, I could drop it too but I
thought it was cleaner to keep it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 21/35] autonuma: teach CFS about autonuma affinity
  2012-05-25 17:02   ` Andrea Arcangeli
@ 2012-05-29 16:05     ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 16:05 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> The CFS scheduler is still in charge of all scheduling
> decisions. AutoNUMA balancing at times will override those. But
> generally we'll just relay on the CFS scheduler to keep doing its
> thing, but while preferring the autonuma affine nodes when deciding
> to move a process to a different runqueue or when waking it up.
>
> For example the idle balancing, will look into the runqueues of the
> busy CPUs, but it'll search first for a task that wants to run into
> the idle CPU in AutoNUMA terms (task_autonuma_cpu() being true).
> 
> Most of this is encoded in the can_migrate_task becoming AutoNUMA
> aware and running two passes for each balancing pass, the first NUMA
> aware, and the second one relaxed.
> 
> The idle/newidle balancing is always allowed to fallback into
> non-affine AutoNUMA tasks. The load_balancing (which is more a
> fariness than a performance issue) is instead only able to cross over
> the AutoNUMA affinity if the flag controlled by
> /sys/kernel/mm/autonuma/scheduler/load_balance_strict is not set (it
> is set by default).

This is unacceptable, and contradicts your earlier claim that you rely
on the regular load-balancer.

The strict mode needs to go, load-balancing is a best effort and
fairness is important -- so much so to some people that I get complaints
the current thing isn't strong enough.

Your strict mode basically supplants any and all balancing done at node
level and above.

Please use something like: 

  https://lkml.org/lkml/2012/5/19/53

with the sched_setnode() function from:

  https://lkml.org/lkml/2012/5/18/109

Fairness matters because people expect similar throughput or runtimes so
balancing such that we first ensure equal load on cpus and only then
bother with node placement should be the order.

Furthermore, load-balancing does things like trying to place tasks that
wake each-other closer together, your strict mode completely breaks
that. Instead, if the balancer finds these tasks are related and should
be together that should be a hint the memory needs to come to them, not
the other way around.


^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 21/35] autonuma: teach CFS about autonuma affinity
@ 2012-05-29 16:05     ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 16:05 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> The CFS scheduler is still in charge of all scheduling
> decisions. AutoNUMA balancing at times will override those. But
> generally we'll just relay on the CFS scheduler to keep doing its
> thing, but while preferring the autonuma affine nodes when deciding
> to move a process to a different runqueue or when waking it up.
>
> For example the idle balancing, will look into the runqueues of the
> busy CPUs, but it'll search first for a task that wants to run into
> the idle CPU in AutoNUMA terms (task_autonuma_cpu() being true).
> 
> Most of this is encoded in the can_migrate_task becoming AutoNUMA
> aware and running two passes for each balancing pass, the first NUMA
> aware, and the second one relaxed.
> 
> The idle/newidle balancing is always allowed to fallback into
> non-affine AutoNUMA tasks. The load_balancing (which is more a
> fariness than a performance issue) is instead only able to cross over
> the AutoNUMA affinity if the flag controlled by
> /sys/kernel/mm/autonuma/scheduler/load_balance_strict is not set (it
> is set by default).

This is unacceptable, and contradicts your earlier claim that you rely
on the regular load-balancer.

The strict mode needs to go, load-balancing is a best effort and
fairness is important -- so much so to some people that I get complaints
the current thing isn't strong enough.

Your strict mode basically supplants any and all balancing done at node
level and above.

Please use something like: 

  https://lkml.org/lkml/2012/5/19/53

with the sched_setnode() function from:

  https://lkml.org/lkml/2012/5/18/109

Fairness matters because people expect similar throughput or runtimes so
balancing such that we first ensure equal load on cpus and only then
bother with node placement should be the order.

Furthermore, load-balancing does things like trying to place tasks that
wake each-other closer together, your strict mode completely breaks
that. Instead, if the balancer finds these tasks are related and should
be together that should be a hint the memory needs to come to them, not
the other way around.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 00/35] AutoNUMA alpha14
  2012-05-29 15:53       ` Christoph Lameter
@ 2012-05-29 16:08         ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 16:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Rik van Riel, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Peter Zijlstra, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri

Hi,

On Tue, May 29, 2012 at 10:53:32AM -0500, Christoph Lameter wrote:
> then does the distribution of the load on its own. NUMA aware applications
> like that do not benefit and do not need either of the mechanisms proposed
> here.

Agreed. Who changes the apps to optimize things to that lowlevel, I
doubt wants to risk to hit on on a migrate on fault (or AutoNUMA async
migration for that matter).

> I think the proof that we need is that a general mix of applications
> actually benefits from an auto migration scheme. I would also like to see

Agreed.

> that it does no harm to existing NUMA aware applications.

As far as AutoNUMA is concerned, it will be a total bypass whenever
the mpol isn't MPOL_DEFAULT. So it shouldn't harm. Shared memory is
also bypassed.

It only alters the beahvior of MPOL_DEFAULT, any other kind of
mempolicy is unaffected, and all CPU bindings are also unaffected.

If an app has only a few vmas that are MPOL_DEFAULT those few will be
handled by AutoNUMA.

If people thinks AutoMigration is a better name I should rename
it. It's up to you. I thought using a "NUMA" suffix  would make it
more intuitive that if your hardware isn't NUMA, this won't do
anything at all. Migration as a feature isn't limited to NUMA (see
compaction etc..). Comments welcome.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 00/35] AutoNUMA alpha14
@ 2012-05-29 16:08         ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 16:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Rik van Riel, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Peter Zijlstra, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri

Hi,

On Tue, May 29, 2012 at 10:53:32AM -0500, Christoph Lameter wrote:
> then does the distribution of the load on its own. NUMA aware applications
> like that do not benefit and do not need either of the mechanisms proposed
> here.

Agreed. Who changes the apps to optimize things to that lowlevel, I
doubt wants to risk to hit on on a migrate on fault (or AutoNUMA async
migration for that matter).

> I think the proof that we need is that a general mix of applications
> actually benefits from an auto migration scheme. I would also like to see

Agreed.

> that it does no harm to existing NUMA aware applications.

As far as AutoNUMA is concerned, it will be a total bypass whenever
the mpol isn't MPOL_DEFAULT. So it shouldn't harm. Shared memory is
also bypassed.

It only alters the beahvior of MPOL_DEFAULT, any other kind of
mempolicy is unaffected, and all CPU bindings are also unaffected.

If an app has only a few vmas that are MPOL_DEFAULT those few will be
handled by AutoNUMA.

If people thinks AutoMigration is a better name I should rename
it. It's up to you. I thought using a "NUMA" suffix  would make it
more intuitive that if your hardware isn't NUMA, this won't do
anything at all. Migration as a feature isn't limited to NUMA (see
compaction etc..). Comments welcome.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 08/35] autonuma: introduce kthread_bind_node()
  2012-05-29 12:49     ` Peter Zijlstra
@ 2012-05-29 16:11       ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 16:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 02:49:13PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> >  /**
> > + * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
> > + * @p: thread created by kthread_create().
> > + * @nid: node (might not be online, must be possible) for @k to run on.
> > + *
> > + * Description: This function is equivalent to set_cpus_allowed(),
> > + * except that @nid doesn't need to be online, and the thread must be
> > + * stopped (i.e., just returned from kthread_create()).
> > + */
> > +void kthread_bind_node(struct task_struct *p, int nid)
> > +{
> > +       /* Must have done schedule() in kthread() before we set_task_cpu */
> > +       if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> > +               WARN_ON(1);
> > +               return;
> > +       }
> > +
> > +       /* It's safe because the task is inactive. */
> > +       do_set_cpus_allowed(p, cpumask_of_node(nid));
> > +       p->flags |= PF_THREAD_BOUND;
> 
> No, I've said before, this is wrong. You should only ever use
> PF_THREAD_BOUND when its strictly per-cpu. Moving the your numa threads
> to a different node is silly but not fatal in any way.

I changed the semantics of that bitflag, now it means: userland isn't
allowed to shoot itself in the foot and mess with whatever CPU
bindings the kernel has set for the kernel thread.

It'd be a clear regress not to set PF_THREAD_BOUND there. It would be
even worse to remove the CPU binding to the node: it'd risk to copy
memory with both src and dst being in remote nodes from the CPU where
knuma_migrate runs on (there aren't just 2 node systems out there).

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 08/35] autonuma: introduce kthread_bind_node()
@ 2012-05-29 16:11       ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 16:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 02:49:13PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> >  /**
> > + * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
> > + * @p: thread created by kthread_create().
> > + * @nid: node (might not be online, must be possible) for @k to run on.
> > + *
> > + * Description: This function is equivalent to set_cpus_allowed(),
> > + * except that @nid doesn't need to be online, and the thread must be
> > + * stopped (i.e., just returned from kthread_create()).
> > + */
> > +void kthread_bind_node(struct task_struct *p, int nid)
> > +{
> > +       /* Must have done schedule() in kthread() before we set_task_cpu */
> > +       if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> > +               WARN_ON(1);
> > +               return;
> > +       }
> > +
> > +       /* It's safe because the task is inactive. */
> > +       do_set_cpus_allowed(p, cpumask_of_node(nid));
> > +       p->flags |= PF_THREAD_BOUND;
> 
> No, I've said before, this is wrong. You should only ever use
> PF_THREAD_BOUND when its strictly per-cpu. Moving the your numa threads
> to a different node is silly but not fatal in any way.

I changed the semantics of that bitflag, now it means: userland isn't
allowed to shoot itself in the foot and mess with whatever CPU
bindings the kernel has set for the kernel thread.

It'd be a clear regress not to set PF_THREAD_BOUND there. It would be
even worse to remove the CPU binding to the node: it'd risk to copy
memory with both src and dst being in remote nodes from the CPU where
knuma_migrate runs on (there aren't just 2 node systems out there).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 22/35] autonuma: sched_set_autonuma_need_balance
  2012-05-25 17:02   ` Andrea Arcangeli
@ 2012-05-29 16:12     ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 16:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> Invoke autonuma_balance only on the busy CPUs at the same frequency of
> the CFS load balance.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  kernel/sched/fair.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 99d1d33..1357938 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4893,6 +4893,9 @@ static void run_rebalance_domains(struct softirq_action *h)
>  
>  	rebalance_domains(this_cpu, idle);
>  
> +	if (!this_rq->idle_balance)
> +		sched_set_autonuma_need_balance();
> +

This just isn't enough.. the whole thing needs to move out of
schedule(). The only time schedule() should ever look at another cpu is
if its idle.

As it stands load-balance actually takes too much time as it is to live
in a softirq, -rt gets around that by pushing all softirqs into a thread
and I was thinking of doing some of that for mainline too.



^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 22/35] autonuma: sched_set_autonuma_need_balance
@ 2012-05-29 16:12     ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 16:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> Invoke autonuma_balance only on the busy CPUs at the same frequency of
> the CFS load balance.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  kernel/sched/fair.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 99d1d33..1357938 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4893,6 +4893,9 @@ static void run_rebalance_domains(struct softirq_action *h)
>  
>  	rebalance_domains(this_cpu, idle);
>  
> +	if (!this_rq->idle_balance)
> +		sched_set_autonuma_need_balance();
> +

This just isn't enough.. the whole thing needs to move out of
schedule(). The only time schedule() should ever look at another cpu is
if its idle.

As it stands load-balance actually takes too much time as it is to live
in a softirq, -rt gets around that by pushing all softirqs into a thread
and I was thinking of doing some of that for mainline too.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 23/35] autonuma: core
  2012-05-25 17:02   ` Andrea Arcangeli
@ 2012-05-29 16:27     ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 16:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> This implements knuma_scand, the numa_hinting faults started by
> knuma_scand, the knuma_migrated that migrates the memory queued by the
> NUMA hinting faults, the statistics gathering code that is done by
> knuma_scand for the mm_autonuma and by the numa hinting page faults
> for the sched_autonuma, and most of the rest of the AutoNUMA core
> logics like the false sharing detection, sysfs and initialization
> routines.
> 
> The AutoNUMA algorithm when knuma_scand is not running is a full
> bypass and it must not alter the runtime of memory management and
> scheduler.
> 
> The whole AutoNUMA logic is a chain reaction as result of the actions
> of the knuma_scand. The various parts of the code can be described
> like different gears (gears as in glxgears).
> 
> knuma_scand is the first gear and it collects the mm_autonuma per-process
> statistics and at the same time it sets the pte/pmd it scans as
> pte_numa and pmd_numa.
> 
> The second gear are the numa hinting page faults. These are triggered
> by the pte_numa/pmd_numa pmd/ptes. They collect the sched_autonuma
> per-thread statistics. They also implement the memory follow CPU logic
> where we track if pages are repeatedly accessed by remote nodes. The
> memory follow CPU logic can decide to migrate pages across different
> NUMA nodes by queuing the pages for migration in the per-node
> knuma_migrated queues.
> 
> The third gear is knuma_migrated. There is one knuma_migrated daemon
> per node. Pages pending for migration are queued in a matrix of
> lists. Each knuma_migrated (in parallel with each other) goes over
> those lists and migrates the pages queued for migration in round robin
> from each incoming node to the node where knuma_migrated is running
> on.
> 
> The fourth gear is the NUMA scheduler balancing code. That computes
> the statistical information collected in mm->mm_autonuma and
> p->sched_autonuma and evaluates the status of all CPUs to decide if
> tasks should be migrated to CPUs in remote nodes. 

IOW:

"knuma_scand 'unmaps' ptes and collects mm stats, this triggers
numa_hinting pagefaults, using these we collect per task stats.

knuma_migrated migrates pages to their destination node. Something
queues pages.

The numa scheduling code uses the gathered stats to place tasks."


That covers just about all you said, now the interesting bits are still
missing:

 - how do you do false sharing;

 - what stats do you gather, and how are they used at each stage;

 - what's your balance goal, and how is that expressed and 
   converged upon.

Also, what I've not seen anywhere are scheduling stats, what if, despite
you giving a hint a particular process should run on a particular node
it doesn't and sticks to where its at (granted with strict this can't
happen -- but it should).



^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 23/35] autonuma: core
@ 2012-05-29 16:27     ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 16:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> This implements knuma_scand, the numa_hinting faults started by
> knuma_scand, the knuma_migrated that migrates the memory queued by the
> NUMA hinting faults, the statistics gathering code that is done by
> knuma_scand for the mm_autonuma and by the numa hinting page faults
> for the sched_autonuma, and most of the rest of the AutoNUMA core
> logics like the false sharing detection, sysfs and initialization
> routines.
> 
> The AutoNUMA algorithm when knuma_scand is not running is a full
> bypass and it must not alter the runtime of memory management and
> scheduler.
> 
> The whole AutoNUMA logic is a chain reaction as result of the actions
> of the knuma_scand. The various parts of the code can be described
> like different gears (gears as in glxgears).
> 
> knuma_scand is the first gear and it collects the mm_autonuma per-process
> statistics and at the same time it sets the pte/pmd it scans as
> pte_numa and pmd_numa.
> 
> The second gear are the numa hinting page faults. These are triggered
> by the pte_numa/pmd_numa pmd/ptes. They collect the sched_autonuma
> per-thread statistics. They also implement the memory follow CPU logic
> where we track if pages are repeatedly accessed by remote nodes. The
> memory follow CPU logic can decide to migrate pages across different
> NUMA nodes by queuing the pages for migration in the per-node
> knuma_migrated queues.
> 
> The third gear is knuma_migrated. There is one knuma_migrated daemon
> per node. Pages pending for migration are queued in a matrix of
> lists. Each knuma_migrated (in parallel with each other) goes over
> those lists and migrates the pages queued for migration in round robin
> from each incoming node to the node where knuma_migrated is running
> on.
> 
> The fourth gear is the NUMA scheduler balancing code. That computes
> the statistical information collected in mm->mm_autonuma and
> p->sched_autonuma and evaluates the status of all CPUs to decide if
> tasks should be migrated to CPUs in remote nodes. 

IOW:

"knuma_scand 'unmaps' ptes and collects mm stats, this triggers
numa_hinting pagefaults, using these we collect per task stats.

knuma_migrated migrates pages to their destination node. Something
queues pages.

The numa scheduling code uses the gathered stats to place tasks."


That covers just about all you said, now the interesting bits are still
missing:

 - how do you do false sharing;

 - what stats do you gather, and how are they used at each stage;

 - what's your balance goal, and how is that expressed and 
   converged upon.

Also, what I've not seen anywhere are scheduling stats, what if, despite
you giving a hint a particular process should run on a particular node
it doesn't and sticks to where its at (granted with strict this can't
happen -- but it should).


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 30/35] autonuma: reset autonuma page data when pages are freed
  2012-05-25 17:02   ` Andrea Arcangeli
@ 2012-05-29 16:30     ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 16:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> When pages are freed abort any pending migration. If knuma_migrated
> arrives first it will notice because get_page_unless_zero would fail.

But knuma_migrated can run on a different cpu than this free is
happening, ACCESS_ONCE() won't cure that.

What's that ACCESS_ONCE() good for?

Also, you already have an autonuma_ hook right there, why add more
#ifdeffery ?

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/page_alloc.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3d1ee70..1d3163f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -614,6 +614,10 @@ static inline int free_pages_check(struct page *page)
>  		bad_page(page);
>  		return 1;
>  	}
> +	autonuma_migrate_page_remove(page);
> +#ifdef CONFIG_AUTONUMA
> +	ACCESS_ONCE(page->autonuma_last_nid) = -1;
> +#endif
>  	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
>  		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
>  	return 0;


^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 30/35] autonuma: reset autonuma page data when pages are freed
@ 2012-05-29 16:30     ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 16:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> When pages are freed abort any pending migration. If knuma_migrated
> arrives first it will notice because get_page_unless_zero would fail.

But knuma_migrated can run on a different cpu than this free is
happening, ACCESS_ONCE() won't cure that.

What's that ACCESS_ONCE() good for?

Also, you already have an autonuma_ hook right there, why add more
#ifdeffery ?

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/page_alloc.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3d1ee70..1d3163f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -614,6 +614,10 @@ static inline int free_pages_check(struct page *page)
>  		bad_page(page);
>  		return 1;
>  	}
> +	autonuma_migrate_page_remove(page);
> +#ifdef CONFIG_AUTONUMA
> +	ACCESS_ONCE(page->autonuma_last_nid) = -1;
> +#endif
>  	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
>  		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
>  	return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
  2012-05-29 13:16     ` Peter Zijlstra
@ 2012-05-29 16:38       ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 16:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 03:16:25PM +0200, Peter Zijlstra wrote:
> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
> price to pay.

I don't think it's too great, memcg uses for half of that and yet
nobody is booting with cgroup_disable=memory even on not-NUMA servers
with less RAM.

> At LSF/MM Rik already suggested you limit the number of pages that can
> be migrated concurrently and use this to move the extra list_head out of
> struct page and into a smaller amount of extra structures, reducing the
> total overhead.

It would reduce the memory overhead but it'll make the code more
complex and it'll require more locking, plus allowing for very long
migration lrus, provides an additional means of false-sharing
avoidance. Those are lrus, if the last_nid false sharing logic will
pass, the page still has to reach the tail of the list before being
migrated, if false sharing happens in the meanwhile we'll remove it
from the lru.

But I'm all for experimenting. It's just not something I had the time
to try yet. I will certainly love to see how it performs by reducing
the max size of the list. I totally agree it's a good idea to try it
out, and I don't exclude it will work fine, but it's not obvious it's
worth the memory saving.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
@ 2012-05-29 16:38       ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 16:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 03:16:25PM +0200, Peter Zijlstra wrote:
> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
> price to pay.

I don't think it's too great, memcg uses for half of that and yet
nobody is booting with cgroup_disable=memory even on not-NUMA servers
with less RAM.

> At LSF/MM Rik already suggested you limit the number of pages that can
> be migrated concurrently and use this to move the extra list_head out of
> struct page and into a smaller amount of extra structures, reducing the
> total overhead.

It would reduce the memory overhead but it'll make the code more
complex and it'll require more locking, plus allowing for very long
migration lrus, provides an additional means of false-sharing
avoidance. Those are lrus, if the last_nid false sharing logic will
pass, the page still has to reach the tail of the list before being
migrated, if false sharing happens in the meanwhile we'll remove it
from the lru.

But I'm all for experimenting. It's just not something I had the time
to try yet. I will certainly love to see how it performs by reducing
the max size of the list. I totally agree it's a good idea to try it
out, and I don't exclude it will work fine, but it's not obvious it's
worth the memory saving.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 35/35] autonuma: page_autonuma
  2012-05-25 17:02   ` Andrea Arcangeli
@ 2012-05-29 16:44     ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 16:44 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> Move the AutoNUMA per page information from the "struct page" to a
> separate page_autonuma data structure allocated in the memsection
> (with sparsemem) or in the pgdat (with flatmem).
> 
> This is done to avoid growing the size of the "struct page" and the
> page_autonuma data is only allocated if the kernel has been booted on
> real NUMA hardware (or if noautonuma is passed as parameter to the
> kernel).
> 

Argh, please fold this change back into the series proper. If you want
to keep it.. as it is its not really an improvement IMO, see below.

> +struct page_autonuma {
> +       /*
> +        * FIXME: move to pgdat section along with the memcg and allocate
> +        * at runtime only in presence of a numa system.
> +        */
> +       /*
> +        * To modify autonuma_last_nid lockless the architecture,
> +        * needs SMP atomic granularity < sizeof(long), not all archs
> +        * have that, notably some alpha. Archs without that requires
> +        * autonuma_last_nid to be a long.
> +        */

Looking at arch/alpha/include/asm/xchg.h it looks to have that just
fine, so maybe we simply don't support SMP on those early Alphas that
had that weirdness.

> +#if BITS_PER_LONG > 32
> +       int autonuma_migrate_nid;
> +       int autonuma_last_nid;
> +#else
> +#if MAX_NUMNODES >= 32768
> +#error "too many nodes"
> +#endif
> +       /* FIXME: remember to check the updates are atomic */
> +       short autonuma_migrate_nid;
> +       short autonuma_last_nid;
> +#endif
> +       struct list_head autonuma_migrate_node;
> +
> +       /*
> +        * To find the page starting from the autonuma_migrate_node we
> +        * need a backlink.
> +        */
> +       struct page *page;
> +}; 

This makes a shadow page frame of 32 bytes per page, or ~0.8% of memory.
This isn't in fact an improvement.

The suggestion done by Rik was to have something like a sqrt(nr_pages)
(?) scaled array of such things containing the list_head and page
pointer -- and leave the two nids in the regular page frame. Although I
think you've got to fight the memcg people over that last word in struct
page.

That places a limit on the amount of pages that can be in migration
concurrently, but also greatly reduces the memory overhead.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 35/35] autonuma: page_autonuma
@ 2012-05-29 16:44     ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 16:44 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> Move the AutoNUMA per page information from the "struct page" to a
> separate page_autonuma data structure allocated in the memsection
> (with sparsemem) or in the pgdat (with flatmem).
> 
> This is done to avoid growing the size of the "struct page" and the
> page_autonuma data is only allocated if the kernel has been booted on
> real NUMA hardware (or if noautonuma is passed as parameter to the
> kernel).
> 

Argh, please fold this change back into the series proper. If you want
to keep it.. as it is its not really an improvement IMO, see below.

> +struct page_autonuma {
> +       /*
> +        * FIXME: move to pgdat section along with the memcg and allocate
> +        * at runtime only in presence of a numa system.
> +        */
> +       /*
> +        * To modify autonuma_last_nid lockless the architecture,
> +        * needs SMP atomic granularity < sizeof(long), not all archs
> +        * have that, notably some alpha. Archs without that requires
> +        * autonuma_last_nid to be a long.
> +        */

Looking at arch/alpha/include/asm/xchg.h it looks to have that just
fine, so maybe we simply don't support SMP on those early Alphas that
had that weirdness.

> +#if BITS_PER_LONG > 32
> +       int autonuma_migrate_nid;
> +       int autonuma_last_nid;
> +#else
> +#if MAX_NUMNODES >= 32768
> +#error "too many nodes"
> +#endif
> +       /* FIXME: remember to check the updates are atomic */
> +       short autonuma_migrate_nid;
> +       short autonuma_last_nid;
> +#endif
> +       struct list_head autonuma_migrate_node;
> +
> +       /*
> +        * To find the page starting from the autonuma_migrate_node we
> +        * need a backlink.
> +        */
> +       struct page *page;
> +}; 

This makes a shadow page frame of 32 bytes per page, or ~0.8% of memory.
This isn't in fact an improvement.

The suggestion done by Rik was to have something like a sqrt(nr_pages)
(?) scaled array of such things containing the list_head and page
pointer -- and leave the two nids in the regular page frame. Although I
think you've got to fight the memcg people over that last word in struct
page.

That places a limit on the amount of pages that can be in migration
concurrently, but also greatly reduces the memory overhead.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
  2012-05-29 16:38       ` Andrea Arcangeli
@ 2012-05-29 16:46         ` Rik van Riel
  -1 siblings, 0 replies; 236+ messages in thread
From: Rik van Riel @ 2012-05-29 16:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Zijlstra, linux-kernel, linux-mm, Hillf Danton, Dan Smith,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On 05/29/2012 12:38 PM, Andrea Arcangeli wrote:
> On Tue, May 29, 2012 at 03:16:25PM +0200, Peter Zijlstra wrote:
>> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
>> price to pay.
>
> I don't think it's too great, memcg uses for half of that and yet
> nobody is booting with cgroup_disable=memory even on not-NUMA servers
> with less RAM.

Not any more.

Ever since the memcg naturalization work by Johannes,
a page is only ever on one LRU list and the memcg
memory overhead is gone.

> But I'm all for experimenting. It's just not something I had the time
> to try yet. I will certainly love to see how it performs by reducing
> the max size of the list. I totally agree it's a good idea to try it
> out, and I don't exclude it will work fine, but it's not obvious it's
> worth the memory saving.

That's fair enough.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
@ 2012-05-29 16:46         ` Rik van Riel
  0 siblings, 0 replies; 236+ messages in thread
From: Rik van Riel @ 2012-05-29 16:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Zijlstra, linux-kernel, linux-mm, Hillf Danton, Dan Smith,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On 05/29/2012 12:38 PM, Andrea Arcangeli wrote:
> On Tue, May 29, 2012 at 03:16:25PM +0200, Peter Zijlstra wrote:
>> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
>> price to pay.
>
> I don't think it's too great, memcg uses for half of that and yet
> nobody is booting with cgroup_disable=memory even on not-NUMA servers
> with less RAM.

Not any more.

Ever since the memcg naturalization work by Johannes,
a page is only ever on one LRU list and the memcg
memory overhead is gone.

> But I'm all for experimenting. It's just not something I had the time
> to try yet. I will certainly love to see how it performs by reducing
> the max size of the list. I totally agree it's a good idea to try it
> out, and I don't exclude it will work fine, but it's not obvious it's
> worth the memory saving.

That's fair enough.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 30/35] autonuma: reset autonuma page data when pages are freed
  2012-05-29 16:30     ` Peter Zijlstra
@ 2012-05-29 16:49       ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 16:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 06:30:29PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> > When pages are freed abort any pending migration. If knuma_migrated
> > arrives first it will notice because get_page_unless_zero would fail.
> 
> But knuma_migrated can run on a different cpu than this free is
> happening, ACCESS_ONCE() won't cure that.

knuma_migrated won't alter the last_nid and it generally won't work on
any page that has page_count() = 0.

last_nid is the false sharing avoidance information (btw, that really
better exist for every page, unlike the list node, which might be
limited maybe).

Then there's a second false sharing avoidance through the implicit
properties of the autonuma_migrate_head lrus and the
migration-cancellation in numa_hinting_fault_memory_follow_cpu (which
is why I wouldn't like the idea of an insert-only list, even if it
would save a pointer per page, but then I couldn't cancel the
migration when a false sharing is detected and knuma_migrated is
congested).

> What's that ACCESS_ONCE() good for?

The ACCESS_ONCE was used when setting last_nid, to tell gcc the value
can change from under it. It shouldn't alter the code emitted here and
probably it's superfluous in any case.

But considering that the page is being freed, I don't think it can
change from under us here so this was definitely superflous, numa
hinting page faults can't run on that page. I will remove it, thanks!

> 
> Also, you already have an autonuma_ hook right there, why add more
> #ifdeffery ?

Agreed, the #ifdef is in fact already cleaned up in page_autonuma,
with autonuma_page_free.

-       autonuma_migrate_page_remove(page);
-#ifdef CONFIG_AUTONUMA
-       ACCESS_ONCE(page->autonuma_last_nid) = -1;
-#endif
+       autonuma_free_page(page);

> 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> >  mm/page_alloc.c |    4 ++++
> >  1 files changed, 4 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3d1ee70..1d3163f 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -614,6 +614,10 @@ static inline int free_pages_check(struct page *page)
> >  		bad_page(page);
> >  		return 1;
> >  	}
> > +	autonuma_migrate_page_remove(page);
> > +#ifdef CONFIG_AUTONUMA
> > +	ACCESS_ONCE(page->autonuma_last_nid) = -1;
> > +#endif
> >  	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
> >  		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >  	return 0;
> 

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 30/35] autonuma: reset autonuma page data when pages are freed
@ 2012-05-29 16:49       ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 16:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 06:30:29PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> > When pages are freed abort any pending migration. If knuma_migrated
> > arrives first it will notice because get_page_unless_zero would fail.
> 
> But knuma_migrated can run on a different cpu than this free is
> happening, ACCESS_ONCE() won't cure that.

knuma_migrated won't alter the last_nid and it generally won't work on
any page that has page_count() = 0.

last_nid is the false sharing avoidance information (btw, that really
better exist for every page, unlike the list node, which might be
limited maybe).

Then there's a second false sharing avoidance through the implicit
properties of the autonuma_migrate_head lrus and the
migration-cancellation in numa_hinting_fault_memory_follow_cpu (which
is why I wouldn't like the idea of an insert-only list, even if it
would save a pointer per page, but then I couldn't cancel the
migration when a false sharing is detected and knuma_migrated is
congested).

> What's that ACCESS_ONCE() good for?

The ACCESS_ONCE was used when setting last_nid, to tell gcc the value
can change from under it. It shouldn't alter the code emitted here and
probably it's superfluous in any case.

But considering that the page is being freed, I don't think it can
change from under us here so this was definitely superflous, numa
hinting page faults can't run on that page. I will remove it, thanks!

> 
> Also, you already have an autonuma_ hook right there, why add more
> #ifdeffery ?

Agreed, the #ifdef is in fact already cleaned up in page_autonuma,
with autonuma_page_free.

-       autonuma_migrate_page_remove(page);
-#ifdef CONFIG_AUTONUMA
-       ACCESS_ONCE(page->autonuma_last_nid) = -1;
-#endif
+       autonuma_free_page(page);

> 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> >  mm/page_alloc.c |    4 ++++
> >  1 files changed, 4 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3d1ee70..1d3163f 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -614,6 +614,10 @@ static inline int free_pages_check(struct page *page)
> >  		bad_page(page);
> >  		return 1;
> >  	}
> > +	autonuma_migrate_page_remove(page);
> > +#ifdef CONFIG_AUTONUMA
> > +	ACCESS_ONCE(page->autonuma_last_nid) = -1;
> > +#endif
> >  	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
> >  		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >  	return 0;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
  2012-05-29 16:46         ` Rik van Riel
@ 2012-05-29 16:56           ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 16:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Tue, 2012-05-29 at 12:46 -0400, Rik van Riel wrote:
> > I don't think it's too great, memcg uses for half of that and yet
> > nobody is booting with cgroup_disable=memory even on not-NUMA servers
> > with less RAM.

Right, it was such a hit we had to disable that by default on RHEL6.

> Not any more. 

Right, hnaz did great work there, but wasn't there still some few pieces
of the shadow page frame left? ISTR LSF/MM talk of moving the last few
bits into the regular page frame, taking the word that became available
through: fc9bb8c768 ("mm: Rearrange struct page").



^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
@ 2012-05-29 16:56           ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 16:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Tue, 2012-05-29 at 12:46 -0400, Rik van Riel wrote:
> > I don't think it's too great, memcg uses for half of that and yet
> > nobody is booting with cgroup_disable=memory even on not-NUMA servers
> > with less RAM.

Right, it was such a hit we had to disable that by default on RHEL6.

> Not any more. 

Right, hnaz did great work there, but wasn't there still some few pieces
of the shadow page frame left? ISTR LSF/MM talk of moving the last few
bits into the regular page frame, taking the word that became available
through: fc9bb8c768 ("mm: Rearrange struct page").


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 08/35] autonuma: introduce kthread_bind_node()
  2012-05-29 16:11       ` Andrea Arcangeli
@ 2012-05-29 17:04         ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 17:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, 2012-05-29 at 18:11 +0200, Andrea Arcangeli wrote:
> On Tue, May 29, 2012 at 02:49:13PM +0200, Peter Zijlstra wrote:
> > On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> > >  /**
> > > + * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
> > > + * @p: thread created by kthread_create().
> > > + * @nid: node (might not be online, must be possible) for @k to run on.
> > > + *
> > > + * Description: This function is equivalent to set_cpus_allowed(),
> > > + * except that @nid doesn't need to be online, and the thread must be
> > > + * stopped (i.e., just returned from kthread_create()).
> > > + */
> > > +void kthread_bind_node(struct task_struct *p, int nid)
> > > +{
> > > +       /* Must have done schedule() in kthread() before we set_task_cpu */
> > > +       if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> > > +               WARN_ON(1);
> > > +               return;
> > > +       }
> > > +
> > > +       /* It's safe because the task is inactive. */
> > > +       do_set_cpus_allowed(p, cpumask_of_node(nid));
> > > +       p->flags |= PF_THREAD_BOUND;
> > 
> > No, I've said before, this is wrong. You should only ever use
> > PF_THREAD_BOUND when its strictly per-cpu. Moving the your numa threads
> > to a different node is silly but not fatal in any way.
> 
> I changed the semantics of that bitflag, now it means: userland isn't
> allowed to shoot itself in the foot and mess with whatever CPU
> bindings the kernel has set for the kernel thread.

Yeah, and you did so without mentioning that in your changelog.
Furthermore I object to that change. I object even more strongly to
doing it without mention and keeping a misleading comment near the
definition.

> It'd be a clear regress not to set PF_THREAD_BOUND there. It would be
> even worse to remove the CPU binding to the node: it'd risk to copy
> memory with both src and dst being in remote nodes from the CPU where
> knuma_migrate runs on (there aren't just 2 node systems out there).

Just teach each knuma_migrated what node it represents and don't use
numa_node_id().

That way you can change the affinity just fine, it'll be sub-optimal,
copying memory from node x to node y through node z, but it'll still
work correctly.

numa isn't special in the way per-cpu stuff is special.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 08/35] autonuma: introduce kthread_bind_node()
@ 2012-05-29 17:04         ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 17:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, 2012-05-29 at 18:11 +0200, Andrea Arcangeli wrote:
> On Tue, May 29, 2012 at 02:49:13PM +0200, Peter Zijlstra wrote:
> > On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> > >  /**
> > > + * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
> > > + * @p: thread created by kthread_create().
> > > + * @nid: node (might not be online, must be possible) for @k to run on.
> > > + *
> > > + * Description: This function is equivalent to set_cpus_allowed(),
> > > + * except that @nid doesn't need to be online, and the thread must be
> > > + * stopped (i.e., just returned from kthread_create()).
> > > + */
> > > +void kthread_bind_node(struct task_struct *p, int nid)
> > > +{
> > > +       /* Must have done schedule() in kthread() before we set_task_cpu */
> > > +       if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> > > +               WARN_ON(1);
> > > +               return;
> > > +       }
> > > +
> > > +       /* It's safe because the task is inactive. */
> > > +       do_set_cpus_allowed(p, cpumask_of_node(nid));
> > > +       p->flags |= PF_THREAD_BOUND;
> > 
> > No, I've said before, this is wrong. You should only ever use
> > PF_THREAD_BOUND when its strictly per-cpu. Moving the your numa threads
> > to a different node is silly but not fatal in any way.
> 
> I changed the semantics of that bitflag, now it means: userland isn't
> allowed to shoot itself in the foot and mess with whatever CPU
> bindings the kernel has set for the kernel thread.

Yeah, and you did so without mentioning that in your changelog.
Furthermore I object to that change. I object even more strongly to
doing it without mention and keeping a misleading comment near the
definition.

> It'd be a clear regress not to set PF_THREAD_BOUND there. It would be
> even worse to remove the CPU binding to the node: it'd risk to copy
> memory with both src and dst being in remote nodes from the CPU where
> knuma_migrate runs on (there aren't just 2 node systems out there).

Just teach each knuma_migrated what node it represents and don't use
numa_node_id().

That way you can change the affinity just fine, it'll be sub-optimal,
copying memory from node x to node y through node z, but it'll still
work correctly.

numa isn't special in the way per-cpu stuff is special.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 35/35] autonuma: page_autonuma
  2012-05-29 16:44     ` Peter Zijlstra
@ 2012-05-29 17:14       ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 17:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 06:44:15PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> > Move the AutoNUMA per page information from the "struct page" to a
> > separate page_autonuma data structure allocated in the memsection
> > (with sparsemem) or in the pgdat (with flatmem).
> > 
> > This is done to avoid growing the size of the "struct page" and the
> > page_autonuma data is only allocated if the kernel has been booted on
> > real NUMA hardware (or if noautonuma is passed as parameter to the
> > kernel).
> > 
> 
> Argh, please fold this change back into the series proper. If you want
> to keep it.. as it is its not really an improvement IMO, see below.

The whole objective of this patch is to avoid allocating the
page_autonuma structures when the kernel is booted on not NUMA
hardware.

It's not an improvement when booting the kernel on NUMA hardware
that's for sure.

I didn't merge it with the previous because this was the most
experimental recent change, so I wanted bisectability here. When
something goes wrong here, the kernel won't boot, so unless you use
kvm with gdbstub it's a little tricky to debug (indeed I debugged it
with gdbstub, there it's trivial).

> > +struct page_autonuma {
> > +       /*
> > +        * FIXME: move to pgdat section along with the memcg and allocate
> > +        * at runtime only in presence of a numa system.
> > +        */
> > +       /*
> > +        * To modify autonuma_last_nid lockless the architecture,
> > +        * needs SMP atomic granularity < sizeof(long), not all archs
> > +        * have that, notably some alpha. Archs without that requires
> > +        * autonuma_last_nid to be a long.
> > +        */
> 
> Looking at arch/alpha/include/asm/xchg.h it looks to have that just
> fine, so maybe we simply don't support SMP on those early Alphas that
> had that weirdness.

I agree we should never risk that.

> This makes a shadow page frame of 32 bytes per page, or ~0.8% of memory.
> This isn't in fact an improvement.
> 
> The suggestion done by Rik was to have something like a sqrt(nr_pages)
> (?) scaled array of such things containing the list_head and page
> pointer -- and leave the two nids in the regular page frame. Although I
> think you've got to fight the memcg people over that last word in struct
> page.
> 
> That places a limit on the amount of pages that can be in migration
> concurrently, but also greatly reduces the memory overhead.

Yes, however for the last_nid I'd still need it for every page (and if
I allocate it dynamic I still first need to find a way to remove the
struct page pointer).

I thought to add a pointer in the memsection (or maybe to use a vmemmap
so that it won't even require a pointer in every memsection). I've to
check a few more things before I allow_the autonuma->page translation
without a page pointer, notably to verify the boot time allocations
points won't just allocate power of two blocks of memory (they
shouldn't but I didn't verify).

This is clearly a move in the right direction to avoid the memory
overhead when not booted on NUMA hardware, and I don't think there's
anything fundamental that prevents us remove the page pointer from the
page_autonuma structure, and to later experiment with a limited size
array of async migration structures.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 35/35] autonuma: page_autonuma
@ 2012-05-29 17:14       ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 17:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 06:44:15PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> > Move the AutoNUMA per page information from the "struct page" to a
> > separate page_autonuma data structure allocated in the memsection
> > (with sparsemem) or in the pgdat (with flatmem).
> > 
> > This is done to avoid growing the size of the "struct page" and the
> > page_autonuma data is only allocated if the kernel has been booted on
> > real NUMA hardware (or if noautonuma is passed as parameter to the
> > kernel).
> > 
> 
> Argh, please fold this change back into the series proper. If you want
> to keep it.. as it is its not really an improvement IMO, see below.

The whole objective of this patch is to avoid allocating the
page_autonuma structures when the kernel is booted on not NUMA
hardware.

It's not an improvement when booting the kernel on NUMA hardware
that's for sure.

I didn't merge it with the previous because this was the most
experimental recent change, so I wanted bisectability here. When
something goes wrong here, the kernel won't boot, so unless you use
kvm with gdbstub it's a little tricky to debug (indeed I debugged it
with gdbstub, there it's trivial).

> > +struct page_autonuma {
> > +       /*
> > +        * FIXME: move to pgdat section along with the memcg and allocate
> > +        * at runtime only in presence of a numa system.
> > +        */
> > +       /*
> > +        * To modify autonuma_last_nid lockless the architecture,
> > +        * needs SMP atomic granularity < sizeof(long), not all archs
> > +        * have that, notably some alpha. Archs without that requires
> > +        * autonuma_last_nid to be a long.
> > +        */
> 
> Looking at arch/alpha/include/asm/xchg.h it looks to have that just
> fine, so maybe we simply don't support SMP on those early Alphas that
> had that weirdness.

I agree we should never risk that.

> This makes a shadow page frame of 32 bytes per page, or ~0.8% of memory.
> This isn't in fact an improvement.
> 
> The suggestion done by Rik was to have something like a sqrt(nr_pages)
> (?) scaled array of such things containing the list_head and page
> pointer -- and leave the two nids in the regular page frame. Although I
> think you've got to fight the memcg people over that last word in struct
> page.
> 
> That places a limit on the amount of pages that can be in migration
> concurrently, but also greatly reduces the memory overhead.

Yes, however for the last_nid I'd still need it for every page (and if
I allocate it dynamic I still first need to find a way to remove the
struct page pointer).

I thought to add a pointer in the memsection (or maybe to use a vmemmap
so that it won't even require a pointer in every memsection). I've to
check a few more things before I allow_the autonuma->page translation
without a page pointer, notably to verify the boot time allocations
points won't just allocate power of two blocks of memory (they
shouldn't but I didn't verify).

This is clearly a move in the right direction to avoid the memory
overhead when not booted on NUMA hardware, and I don't think there's
anything fundamental that prevents us remove the page pointer from the
page_autonuma structure, and to later experiment with a limited size
array of async migration structures.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 00/35] AutoNUMA alpha14
  2012-05-29 13:36   ` Kirill A. Shutemov
@ 2012-05-29 17:15     ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 17:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

Hi Kirill,

The anon page was munmapped just after get_page_unless_zero obtained a
refcount in knuma_migrated. This can happen for example if a big
process exits while knuma_migrated starts to migrate the page. In that
case split_huge_page would do nothing but when it does nothing it
notifies the caller returning 1. When it returns 1, we just need to
put_page and bail out (the page isn't splitted in that case and it's
pointless to try to migrate a freed page).

I also made the code more strict now, to be sure the reason of the bug
wasn't an hugepage in the LRU that wasn't Anon, such a thing must not
exist, but this will verify it just in case.

I'll push it to the origin/autonuma branch of aa.git shortly
(rebased), could you try if it helps?

diff --git a/mm/autonuma.c b/mm/autonuma.c
index 3d4c2a7..c2a5a82 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -840,9 +840,17 @@ static int isolate_migratepages(struct list_head *migratepages,
 
 		VM_BUG_ON(nid != page_to_nid(page));
 
-		if (PageAnon(page) && PageTransHuge(page))
+		if (PageTransHuge(page)) {
+			VM_BUG_ON(!PageAnon(page));
 			/* FIXME: remove split_huge_page */
-			split_huge_page(page);
+			if (unlikely(split_huge_page(page))) {
+				autonuma_printk("autonuma migrate THP free\n");
+				__autonuma_migrate_page_remove(page,
+							       page_autonuma);
+				put_page(page);
+				continue;
+			}
+		}
 
 		__autonuma_migrate_page_remove(page, page_autonuma);
 

Thanks a lot,
Andrea

BTW, interesting the knuma_migrated0 runs on CPU24, just in case, you
may also want to verify that it's correct with numactl --hardware, in
my case the highest cpuid in node0 is 17. It's not related to the
above, which is needed anyway.

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* Re: [PATCH 00/35] AutoNUMA alpha14
@ 2012-05-29 17:15     ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 17:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

Hi Kirill,

The anon page was munmapped just after get_page_unless_zero obtained a
refcount in knuma_migrated. This can happen for example if a big
process exits while knuma_migrated starts to migrate the page. In that
case split_huge_page would do nothing but when it does nothing it
notifies the caller returning 1. When it returns 1, we just need to
put_page and bail out (the page isn't splitted in that case and it's
pointless to try to migrate a freed page).

I also made the code more strict now, to be sure the reason of the bug
wasn't an hugepage in the LRU that wasn't Anon, such a thing must not
exist, but this will verify it just in case.

I'll push it to the origin/autonuma branch of aa.git shortly
(rebased), could you try if it helps?

diff --git a/mm/autonuma.c b/mm/autonuma.c
index 3d4c2a7..c2a5a82 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -840,9 +840,17 @@ static int isolate_migratepages(struct list_head *migratepages,
 
 		VM_BUG_ON(nid != page_to_nid(page));
 
-		if (PageAnon(page) && PageTransHuge(page))
+		if (PageTransHuge(page)) {
+			VM_BUG_ON(!PageAnon(page));
 			/* FIXME: remove split_huge_page */
-			split_huge_page(page);
+			if (unlikely(split_huge_page(page))) {
+				autonuma_printk("autonuma migrate THP free\n");
+				__autonuma_migrate_page_remove(page,
+							       page_autonuma);
+				put_page(page);
+				continue;
+			}
+		}
 
 		__autonuma_migrate_page_remove(page, page_autonuma);
 

Thanks a lot,
Andrea

BTW, interesting the knuma_migrated0 runs on CPU24, just in case, you
may also want to verify that it's correct with numactl --hardware, in
my case the highest cpuid in node0 is 17. It's not related to the
above, which is needed anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* Re: [PATCH 22/35] autonuma: sched_set_autonuma_need_balance
  2012-05-29 16:12     ` Peter Zijlstra
@ 2012-05-29 17:33       ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 17:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 06:12:22PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> > Invoke autonuma_balance only on the busy CPUs at the same frequency of
> > the CFS load balance.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> >  kernel/sched/fair.c |    3 +++
> >  1 files changed, 3 insertions(+), 0 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 99d1d33..1357938 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4893,6 +4893,9 @@ static void run_rebalance_domains(struct softirq_action *h)
> >  
> >  	rebalance_domains(this_cpu, idle);
> >  
> > +	if (!this_rq->idle_balance)
> > +		sched_set_autonuma_need_balance();
> > +
> 
> This just isn't enough.. the whole thing needs to move out of
> schedule(). The only time schedule() should ever look at another cpu is
> if its idle.
> 
> As it stands load-balance actually takes too much time as it is to live
> in a softirq, -rt gets around that by pushing all softirqs into a thread
> and I was thinking of doing some of that for mainline too.

No worries, I didn't mean to leave it like this forever. I was
considering using the stop cpu _nowait variant but I didn't have
enough time to realize if it would work for my case. I need to rethink
about that.

I was thinking which thread to use for that or if to use the stop_cpu
_nowait variant that active balancing is using, but it wasn't so easy
to change and considering from a practical standpoint it already flies
I released it. It's already an improvement, the previous approach was
mostly a debug approach to see if autonuma_balance would flood the
debug log and not converging.

autonuma_balance isn't fundamentally different from load_balance, they
boot look around at the other runqueues, to see if some task should be
moved.

If you move the load_balance to a kernel thread, I could move
autonuma_balance there too.

I just wasn't sure if to invoke a schedule() to actually call
autonuma_balance() made any sense, so I thought running it from
softirq too with the noblocking _nowait variant (or keep it in
schedule to be able to call stop_one_cpu without _nowait) would have
been more efficient.

The moment I gave up on the _nowait variant before releasing is when I
couldn't understand what is tlb_migrate_finish doing, and why it's not
present in the _nowait version in fair.c. Can you explain me that?

Obviously it's only used by ia64 so I could as well ignore that but it
was still an additional annoyance that made me think I needed a bit
more of time to think about it.

I'm glad you acknowledge load_balance already takes a bulk of the time
as it needs to find the busiest runqueue checking other CPU runqueues
too... With autonuma14 there's no measurable difference in hackbench
with autonuma=y or noautonuma boot parameter anymore, or upstream
without autonuma applied (not just autonuma=n). So the cost on a
24-way SMP is 0.

Then I tried to measure it also with lockdep and all lock/mutex
debugging/stats enabled there's a slighty measurable slowdown in
hackbench that may not be a measurement error, but it's barely
noticeable and I expect if I remove load_balance from the softirq, the
gain would be bigger than removing autonuma_balance (it goes from 70
to 80 sec in avg IIRC, but the error is about 10sec, just the avg
seems slightly higher). With lockdep and all other debug disabled it
takes fixed 6sec for all configs and it's definitely not measurable
(tested both thread and process, not that it makes any difference for
this).

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 22/35] autonuma: sched_set_autonuma_need_balance
@ 2012-05-29 17:33       ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 17:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 06:12:22PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> > Invoke autonuma_balance only on the busy CPUs at the same frequency of
> > the CFS load balance.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> >  kernel/sched/fair.c |    3 +++
> >  1 files changed, 3 insertions(+), 0 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 99d1d33..1357938 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4893,6 +4893,9 @@ static void run_rebalance_domains(struct softirq_action *h)
> >  
> >  	rebalance_domains(this_cpu, idle);
> >  
> > +	if (!this_rq->idle_balance)
> > +		sched_set_autonuma_need_balance();
> > +
> 
> This just isn't enough.. the whole thing needs to move out of
> schedule(). The only time schedule() should ever look at another cpu is
> if its idle.
> 
> As it stands load-balance actually takes too much time as it is to live
> in a softirq, -rt gets around that by pushing all softirqs into a thread
> and I was thinking of doing some of that for mainline too.

No worries, I didn't mean to leave it like this forever. I was
considering using the stop cpu _nowait variant but I didn't have
enough time to realize if it would work for my case. I need to rethink
about that.

I was thinking which thread to use for that or if to use the stop_cpu
_nowait variant that active balancing is using, but it wasn't so easy
to change and considering from a practical standpoint it already flies
I released it. It's already an improvement, the previous approach was
mostly a debug approach to see if autonuma_balance would flood the
debug log and not converging.

autonuma_balance isn't fundamentally different from load_balance, they
boot look around at the other runqueues, to see if some task should be
moved.

If you move the load_balance to a kernel thread, I could move
autonuma_balance there too.

I just wasn't sure if to invoke a schedule() to actually call
autonuma_balance() made any sense, so I thought running it from
softirq too with the noblocking _nowait variant (or keep it in
schedule to be able to call stop_one_cpu without _nowait) would have
been more efficient.

The moment I gave up on the _nowait variant before releasing is when I
couldn't understand what is tlb_migrate_finish doing, and why it's not
present in the _nowait version in fair.c. Can you explain me that?

Obviously it's only used by ia64 so I could as well ignore that but it
was still an additional annoyance that made me think I needed a bit
more of time to think about it.

I'm glad you acknowledge load_balance already takes a bulk of the time
as it needs to find the busiest runqueue checking other CPU runqueues
too... With autonuma14 there's no measurable difference in hackbench
with autonuma=y or noautonuma boot parameter anymore, or upstream
without autonuma applied (not just autonuma=n). So the cost on a
24-way SMP is 0.

Then I tried to measure it also with lockdep and all lock/mutex
debugging/stats enabled there's a slighty measurable slowdown in
hackbench that may not be a measurement error, but it's barely
noticeable and I expect if I remove load_balance from the softirq, the
gain would be bigger than removing autonuma_balance (it goes from 70
to 80 sec in avg IIRC, but the error is about 10sec, just the avg
seems slightly higher). With lockdep and all other debug disabled it
takes fixed 6sec for all configs and it's definitely not measurable
(tested both thread and process, not that it makes any difference for
this).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
  2012-05-29 16:38       ` Andrea Arcangeli
@ 2012-05-29 17:38         ` Linus Torvalds
  -1 siblings, 0 replies; 236+ messages in thread
From: Linus Torvalds @ 2012-05-29 17:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Zijlstra, linux-kernel, linux-mm, Hillf Danton, Dan Smith,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 9:38 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Tue, May 29, 2012 at 03:16:25PM +0200, Peter Zijlstra wrote:
>> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
>> price to pay.
>
> I don't think it's too great, memcg uses for half of that and yet
> nobody is booting with cgroup_disable=memory even on not-NUMA servers
> with less RAM.

A big fraction of one percent is absolutely unacceptable.

Our "struct page" is one of our biggest memory users, there's no way
we should cavalierly make it even bigger.

It's also a huge performance sink, the cache miss on struct page tends
to be one of the biggest problems in managing memory. We may not ever
fix that, but making struct page bigger certainly isn't going to help
the bad cache behavior.

                    Linus

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
@ 2012-05-29 17:38         ` Linus Torvalds
  0 siblings, 0 replies; 236+ messages in thread
From: Linus Torvalds @ 2012-05-29 17:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Zijlstra, linux-kernel, linux-mm, Hillf Danton, Dan Smith,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 9:38 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Tue, May 29, 2012 at 03:16:25PM +0200, Peter Zijlstra wrote:
>> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
>> price to pay.
>
> I don't think it's too great, memcg uses for half of that and yet
> nobody is booting with cgroup_disable=memory even on not-NUMA servers
> with less RAM.

A big fraction of one percent is absolutely unacceptable.

Our "struct page" is one of our biggest memory users, there's no way
we should cavalierly make it even bigger.

It's also a huge performance sink, the cache miss on struct page tends
to be one of the biggest problems in managing memory. We may not ever
fix that, but making struct page bigger certainly isn't going to help
the bad cache behavior.

                    Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 22/35] autonuma: sched_set_autonuma_need_balance
  2012-05-29 17:33       ` Andrea Arcangeli
@ 2012-05-29 17:43         ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 17:43 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, 2012-05-29 at 19:33 +0200, Andrea Arcangeli wrote:
> So the cost on a 24-way SMP 

is irrelevant.. also, not every cpu gets to the 24 cpu domain, just 2
do.

When you do for_each_cpu() think at least 4096, if you do
for_each_node() think at least 256.

Add to that the knowledge that doing 4096 remote memory accesses will
cost multiple jiffies, then realize you're wanting to do that with
preemption disabled.

That's just a very big no go.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 22/35] autonuma: sched_set_autonuma_need_balance
@ 2012-05-29 17:43         ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 17:43 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, 2012-05-29 at 19:33 +0200, Andrea Arcangeli wrote:
> So the cost on a 24-way SMP 

is irrelevant.. also, not every cpu gets to the 24 cpu domain, just 2
do.

When you do for_each_cpu() think at least 4096, if you do
for_each_node() think at least 256.

Add to that the knowledge that doing 4096 remote memory accesses will
cost multiple jiffies, then realize you're wanting to do that with
preemption disabled.

That's just a very big no go.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 08/35] autonuma: introduce kthread_bind_node()
  2012-05-29 17:04         ` Peter Zijlstra
@ 2012-05-29 17:44           ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 17:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 07:04:51PM +0200, Peter Zijlstra wrote:
> doing it without mention and keeping a misleading comment near the
> definition.

Right, I forgot to update the comment, I fixed it now.

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 60a699c..0b84494 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1788,7 +1788,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
 #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
 #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
 #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
-#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
+#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpus */
 #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
 

> Just teach each knuma_migrated what node it represents and don't use
> numa_node_id().

It already works like that, I absolutely never use numa_node_id(), I
always use the pgdat passed as parameter to the kernel thread through
the pointer parameter.

But it'd be totally bad not to do the hard bindings to the cpu_s_ of
the node, and not using PF_THREAD_BOUND would just allow userland to
shoot itself in the foot. I mean if PF_THREAD_BOUND wouldn't exist
already I wouldn't add it, but considering somebody bothered to
implement it for the sake to make userland root user "safer", it'd be
really silly not to take advantage of that for knuma_migrated too
(even if it binds to more than 1 CPU).

Additionally I added a bugcheck in the main knuma_migrated loop:

		VM_BUG_ON(numa_node_id() != pgdat->node_id);

to be sure it never goes wrong. This above bugcheck is what allowed me
to find a bug in the numa emulation fixed in commit
d71b5a73fe9af42752c4329b087f7911b35f8f79.

> That way you can change the affinity just fine, it'll be sub-optimal,
> copying memory from node x to node y through node z, but it'll still
> work correctly.

I don't think allowing userland to do suboptimal things (even if it
will only decrease performance and still work correctly) makes
sense (considering somebody added PF_THREAD_BOUND already and it's
zero cost to use).

> numa isn't special in the way per-cpu stuff is special.

Agreed that it won't be as bad as getting per-cpu stuff wrong, it only
slowdown -50% in the worst case, but it's a guaranteed regression in
the best case too, so there's no reason to allow root to shoot itself
in the foot.

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* Re: [PATCH 08/35] autonuma: introduce kthread_bind_node()
@ 2012-05-29 17:44           ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 17:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 07:04:51PM +0200, Peter Zijlstra wrote:
> doing it without mention and keeping a misleading comment near the
> definition.

Right, I forgot to update the comment, I fixed it now.

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 60a699c..0b84494 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1788,7 +1788,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
 #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
 #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
 #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
-#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
+#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpus */
 #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
 

> Just teach each knuma_migrated what node it represents and don't use
> numa_node_id().

It already works like that, I absolutely never use numa_node_id(), I
always use the pgdat passed as parameter to the kernel thread through
the pointer parameter.

But it'd be totally bad not to do the hard bindings to the cpu_s_ of
the node, and not using PF_THREAD_BOUND would just allow userland to
shoot itself in the foot. I mean if PF_THREAD_BOUND wouldn't exist
already I wouldn't add it, but considering somebody bothered to
implement it for the sake to make userland root user "safer", it'd be
really silly not to take advantage of that for knuma_migrated too
(even if it binds to more than 1 CPU).

Additionally I added a bugcheck in the main knuma_migrated loop:

		VM_BUG_ON(numa_node_id() != pgdat->node_id);

to be sure it never goes wrong. This above bugcheck is what allowed me
to find a bug in the numa emulation fixed in commit
d71b5a73fe9af42752c4329b087f7911b35f8f79.

> That way you can change the affinity just fine, it'll be sub-optimal,
> copying memory from node x to node y through node z, but it'll still
> work correctly.

I don't think allowing userland to do suboptimal things (even if it
will only decrease performance and still work correctly) makes
sense (considering somebody added PF_THREAD_BOUND already and it's
zero cost to use).

> numa isn't special in the way per-cpu stuff is special.

Agreed that it won't be as bad as getting per-cpu stuff wrong, it only
slowdown -50% in the worst case, but it's a guaranteed regression in
the best case too, so there's no reason to allow root to shoot itself
in the foot.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* Re: [PATCH 08/35] autonuma: introduce kthread_bind_node()
  2012-05-29 17:44           ` Andrea Arcangeli
@ 2012-05-29 17:48             ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 17:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, 2012-05-29 at 19:44 +0200, Andrea Arcangeli wrote:
> 
> But it'd be totally bad not to do the hard bindings to the cpu_s_ of
> the node, and not using PF_THREAD_BOUND would just allow userland to
> shoot itself in the foot. I mean if PF_THREAD_BOUND wouldn't exist
> already I wouldn't add it, but considering somebody bothered to
> implement it for the sake to make userland root user "safer", it'd be
> really silly not to take advantage of that for knuma_migrated too
> (even if it binds to more than 1 CPU). 

No, I'm absolutely ok with the user shooting himself in the foot. The
thing exists because you can crash stuff if you get it wrong with
per-cpu.

Crashing is not good, worse performance is his own damn fault.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 08/35] autonuma: introduce kthread_bind_node()
@ 2012-05-29 17:48             ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 17:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, 2012-05-29 at 19:44 +0200, Andrea Arcangeli wrote:
> 
> But it'd be totally bad not to do the hard bindings to the cpu_s_ of
> the node, and not using PF_THREAD_BOUND would just allow userland to
> shoot itself in the foot. I mean if PF_THREAD_BOUND wouldn't exist
> already I wouldn't add it, but considering somebody bothered to
> implement it for the sake to make userland root user "safer", it'd be
> really silly not to take advantage of that for knuma_migrated too
> (even if it binds to more than 1 CPU). 

No, I'm absolutely ok with the user shooting himself in the foot. The
thing exists because you can crash stuff if you get it wrong with
per-cpu.

Crashing is not good, worse performance is his own damn fault.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
  2012-05-29 17:38         ` Linus Torvalds
@ 2012-05-29 18:09           ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 18:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-kernel, linux-mm, Hillf Danton, Dan Smith,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 10:38:34AM -0700, Linus Torvalds wrote:
> A big fraction of one percent is absolutely unacceptable.
> 
> Our "struct page" is one of our biggest memory users, there's no way
> we should cavalierly make it even bigger.
> 
> It's also a huge performance sink, the cache miss on struct page tends
> to be one of the biggest problems in managing memory. We may not ever
> fix that, but making struct page bigger certainly isn't going to help
> the bad cache behavior.

The cache effects on the VM fast paths shouldn't be altered, and no
additional memory per-page is allocated when booting the same bzImage
on not NUMA hardware.

But now when booted on NUMA hardware it takes 8 bytes more than
before. There are 32 bytes allocated for every page (with autonuma13
it was only 24 bytes). The struct page itself isn't modified.

I want to remove the page pointer from the page_autonuma structure, to
keep the overhead at 0.58% instead of the current 0.78% (like it was
on autonuma alpha13 before the page_autonuma introduction). That
shouldn't be difficult and it's the next step.

Those changes aren't visible to anything but *autonuma.* files and the
cache misses in accessing the page_autonuma structure shouldn't be
measurable (the only fast path access is from
autonuma_free_page). Even if we find a way to shrink it below 0.58%,
it won't be intrusive over the rest of the kernel.

memcg takes 0.39% on every system built with
CONFIG_CGROUP_MEM_RES_CTLR=y unless the kernel is booted with
cgroup_disable=memory (and nobody does).

I'll do my best to shrink it further, like mentioned I'm very willing
to experiment with a fixed size array in function of the RAM per node,
to reduce the overhead (Michel and Rik suggested that at MM summit
too). Maybe it'll just work fine even if the max size of the lru is
reduced by a factor of 10. In the worst case I personally believe lots
of people would be ok to pay 0.58% considering they're paying 0.39%
even on much smaller not-NUMA systems to boot with memcg. And I'm sure
I can reduce it at least to 0.58% without any downside.

It's lots of work to reduce it below 0.58%, so before doing that I
believe it's fair enough to do enough performance measurement and
reviews to be sure the design flies.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
@ 2012-05-29 18:09           ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 18:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, linux-kernel, linux-mm, Hillf Danton, Dan Smith,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 10:38:34AM -0700, Linus Torvalds wrote:
> A big fraction of one percent is absolutely unacceptable.
> 
> Our "struct page" is one of our biggest memory users, there's no way
> we should cavalierly make it even bigger.
> 
> It's also a huge performance sink, the cache miss on struct page tends
> to be one of the biggest problems in managing memory. We may not ever
> fix that, but making struct page bigger certainly isn't going to help
> the bad cache behavior.

The cache effects on the VM fast paths shouldn't be altered, and no
additional memory per-page is allocated when booting the same bzImage
on not NUMA hardware.

But now when booted on NUMA hardware it takes 8 bytes more than
before. There are 32 bytes allocated for every page (with autonuma13
it was only 24 bytes). The struct page itself isn't modified.

I want to remove the page pointer from the page_autonuma structure, to
keep the overhead at 0.58% instead of the current 0.78% (like it was
on autonuma alpha13 before the page_autonuma introduction). That
shouldn't be difficult and it's the next step.

Those changes aren't visible to anything but *autonuma.* files and the
cache misses in accessing the page_autonuma structure shouldn't be
measurable (the only fast path access is from
autonuma_free_page). Even if we find a way to shrink it below 0.58%,
it won't be intrusive over the rest of the kernel.

memcg takes 0.39% on every system built with
CONFIG_CGROUP_MEM_RES_CTLR=y unless the kernel is booted with
cgroup_disable=memory (and nobody does).

I'll do my best to shrink it further, like mentioned I'm very willing
to experiment with a fixed size array in function of the RAM per node,
to reduce the overhead (Michel and Rik suggested that at MM summit
too). Maybe it'll just work fine even if the max size of the lru is
reduced by a factor of 10. In the worst case I personally believe lots
of people would be ok to pay 0.58% considering they're paying 0.39%
even on much smaller not-NUMA systems to boot with memcg. And I'm sure
I can reduce it at least to 0.58% without any downside.

It's lots of work to reduce it below 0.58%, so before doing that I
believe it's fair enough to do enough performance measurement and
reviews to be sure the design flies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 08/35] autonuma: introduce kthread_bind_node()
  2012-05-29 17:48             ` Peter Zijlstra
@ 2012-05-29 18:15               ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 18:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 07:48:06PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-05-29 at 19:44 +0200, Andrea Arcangeli wrote:
> > 
> > But it'd be totally bad not to do the hard bindings to the cpu_s_ of
> > the node, and not using PF_THREAD_BOUND would just allow userland to
> > shoot itself in the foot. I mean if PF_THREAD_BOUND wouldn't exist
> > already I wouldn't add it, but considering somebody bothered to
> > implement it for the sake to make userland root user "safer", it'd be
> > really silly not to take advantage of that for knuma_migrated too
> > (even if it binds to more than 1 CPU). 
> 
> No, I'm absolutely ok with the user shooting himself in the foot. The
> thing exists because you can crash stuff if you get it wrong with
> per-cpu.
> 
> Crashing is not good, worse performance is his own damn fault.

Some people don't like root to write to /dev/mem or rm -r /
either. I'm not in that camp, but if you're not in that camp, then you
should _never_ care to set PF_THREAD_BOUND, no matter if it's about
crashing or just slowing down the kernel.

If such a thing exists, well using it to avoid the user either to crash or
to screw with the system performance, can only be a bonus.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 08/35] autonuma: introduce kthread_bind_node()
@ 2012-05-29 18:15               ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 18:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 07:48:06PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-05-29 at 19:44 +0200, Andrea Arcangeli wrote:
> > 
> > But it'd be totally bad not to do the hard bindings to the cpu_s_ of
> > the node, and not using PF_THREAD_BOUND would just allow userland to
> > shoot itself in the foot. I mean if PF_THREAD_BOUND wouldn't exist
> > already I wouldn't add it, but considering somebody bothered to
> > implement it for the sake to make userland root user "safer", it'd be
> > really silly not to take advantage of that for knuma_migrated too
> > (even if it binds to more than 1 CPU). 
> 
> No, I'm absolutely ok with the user shooting himself in the foot. The
> thing exists because you can crash stuff if you get it wrong with
> per-cpu.
> 
> Crashing is not good, worse performance is his own damn fault.

Some people don't like root to write to /dev/mem or rm -r /
either. I'm not in that camp, but if you're not in that camp, then you
should _never_ care to set PF_THREAD_BOUND, no matter if it's about
crashing or just slowing down the kernel.

If such a thing exists, well using it to avoid the user either to crash or
to screw with the system performance, can only be a bonus.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 22/35] autonuma: sched_set_autonuma_need_balance
  2012-05-29 17:43         ` Peter Zijlstra
@ 2012-05-29 18:24           ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 18:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 07:43:27PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-05-29 at 19:33 +0200, Andrea Arcangeli wrote:
> > So the cost on a 24-way SMP 
> 
> is irrelevant.. also, not every cpu gets to the 24 cpu domain, just 2
> do.
> 
> When you do for_each_cpu() think at least 4096, if you do
> for_each_node() think at least 256.
> 
> Add to that the knowledge that doing 4096 remote memory accesses will
> cost multiple jiffies, then realize you're wanting to do that with
> preemption disabled.
> 
> That's just a very big no go.

I'm thinking 4096/256, this is why I mentioned it's a 24-way system. I
think the hackbench should be repeated on a much bigger system to see
what happens, I'm not saying it'll work fine already.

But from autonuma13 to 14 it's a world of difference in hackbench
terms, to the point the cost is zero on a 24-way.

My idea down the road, with multi hop systems, is to balance across
the 1 hop at the regular load_balance interval, and move to the 2 hops
at half frequency, and 3 hops at 1/4th frequency etc... That change
alone should help tremendously with 256 nodes and 5/6 hops. And it
should be quite easy to implement too.

knuma_migrated also need to learn more about the hops and probably
scan at higher frequency the lru heads coming from the closer hops.

The code is not "hops" aware yet and certainly there are still lots of
optimization to do for the very big systems. I think it's already
quite ideal right now for most servers and I don't see blockers in
optimizing it for the extreme big cases (and I expect it'd already
work better than nothing in the extreme setups). I removed [RFC]
because I'm quite happy with it now (there were things I wasn't happy
with before), but I didn't mean it's finished.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 22/35] autonuma: sched_set_autonuma_need_balance
@ 2012-05-29 18:24           ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 18:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 07:43:27PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-05-29 at 19:33 +0200, Andrea Arcangeli wrote:
> > So the cost on a 24-way SMP 
> 
> is irrelevant.. also, not every cpu gets to the 24 cpu domain, just 2
> do.
> 
> When you do for_each_cpu() think at least 4096, if you do
> for_each_node() think at least 256.
> 
> Add to that the knowledge that doing 4096 remote memory accesses will
> cost multiple jiffies, then realize you're wanting to do that with
> preemption disabled.
> 
> That's just a very big no go.

I'm thinking 4096/256, this is why I mentioned it's a 24-way system. I
think the hackbench should be repeated on a much bigger system to see
what happens, I'm not saying it'll work fine already.

But from autonuma13 to 14 it's a world of difference in hackbench
terms, to the point the cost is zero on a 24-way.

My idea down the road, with multi hop systems, is to balance across
the 1 hop at the regular load_balance interval, and move to the 2 hops
at half frequency, and 3 hops at 1/4th frequency etc... That change
alone should help tremendously with 256 nodes and 5/6 hops. And it
should be quite easy to implement too.

knuma_migrated also need to learn more about the hops and probably
scan at higher frequency the lru heads coming from the closer hops.

The code is not "hops" aware yet and certainly there are still lots of
optimization to do for the very big systems. I think it's already
quite ideal right now for most servers and I don't see blockers in
optimizing it for the extreme big cases (and I expect it'd already
work better than nothing in the extreme setups). I removed [RFC]
because I'm quite happy with it now (there were things I wasn't happy
with before), but I didn't mean it's finished.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
  2012-05-29 16:56           ` Peter Zijlstra
@ 2012-05-29 18:35             ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 18:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, linux-kernel, linux-mm, Hillf Danton, Dan Smith,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 06:56:53PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-05-29 at 12:46 -0400, Rik van Riel wrote:
> > > I don't think it's too great, memcg uses for half of that and yet
> > > nobody is booting with cgroup_disable=memory even on not-NUMA servers
> > > with less RAM.
> 
> Right, it was such a hit we had to disable that by default on RHEL6.

CONFIG_CGROUP_MEM_RES_CTLR is =y, do you mean it's set to
cgroup_disable=memory by default in grub? I didn't notice that.

If a certain amount of users is passing cgroup_disable=memory at boot
because they don't need the feature, well that's perfectly reasonable
and the way it should be. That's why such an option exists and why I
also provided a noautonuma parameter for the same reason.

> Right, hnaz did great work there, but wasn't there still some few pieces
> of the shadow page frame left? ISTR LSF/MM talk of moving the last few
> bits into the regular page frame, taking the word that became available
> through: fc9bb8c768 ("mm: Rearrange struct page").

memcg diet topic is there for a long time, they started working on it
more than 1 year ago, I'm currently referring to current upstream
(maybe 1 week ago old).

But this is normal, first you focus on the algorithm, then you worry
how to optimize the implementation to reduce the memory usage without
altering the runtime (well, without altering it too much at least...).

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
@ 2012-05-29 18:35             ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-29 18:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, linux-kernel, linux-mm, Hillf Danton, Dan Smith,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 06:56:53PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-05-29 at 12:46 -0400, Rik van Riel wrote:
> > > I don't think it's too great, memcg uses for half of that and yet
> > > nobody is booting with cgroup_disable=memory even on not-NUMA servers
> > > with less RAM.
> 
> Right, it was such a hit we had to disable that by default on RHEL6.

CONFIG_CGROUP_MEM_RES_CTLR is =y, do you mean it's set to
cgroup_disable=memory by default in grub? I didn't notice that.

If a certain amount of users is passing cgroup_disable=memory at boot
because they don't need the feature, well that's perfectly reasonable
and the way it should be. That's why such an option exists and why I
also provided a noautonuma parameter for the same reason.

> Right, hnaz did great work there, but wasn't there still some few pieces
> of the shadow page frame left? ISTR LSF/MM talk of moving the last few
> bits into the regular page frame, taking the word that became available
> through: fc9bb8c768 ("mm: Rearrange struct page").

memcg diet topic is there for a long time, they started working on it
more than 1 year ago, I'm currently referring to current upstream
(maybe 1 week ago old).

But this is normal, first you focus on the algorithm, then you worry
how to optimize the implementation to reduce the memory usage without
altering the runtime (well, without altering it too much at least...).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
  2012-05-29 17:38         ` Linus Torvalds
@ 2012-05-29 20:42           ` Rik van Riel
  -1 siblings, 0 replies; 236+ messages in thread
From: Rik van Riel @ 2012-05-29 20:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Peter Zijlstra, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On 05/29/2012 01:38 PM, Linus Torvalds wrote:
> On Tue, May 29, 2012 at 9:38 AM, Andrea Arcangeli<aarcange@redhat.com>  wrote:
>> On Tue, May 29, 2012 at 03:16:25PM +0200, Peter Zijlstra wrote:
>>> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
>>> price to pay.
>>
>> I don't think it's too great, memcg uses for half of that and yet
>> nobody is booting with cgroup_disable=memory even on not-NUMA servers
>> with less RAM.
>
> A big fraction of one percent is absolutely unacceptable.

Andrea, here is a quick back of the envelope idea.

In every zone, we keep an array of pointers to pages and
other needed info for knumad.  We do not need as many as
we have pages in a zone, because we do not want to move
all that memory across anyway (especially in larger systems).
Maybe the number of entries can scale up with the square
root of the zone size?

struct numa_pq_entry {
	struct page *page;
	pg_data_t *destination;
};

For each zone, we can have a numa queueing struct

struct numa_queue {
	struct numa_pq_entry *current_knumad;
	struct numa_pq_entry *current_queue;
	struct numa_pq_entry[];
};

Pages can get added to the knumad queue by filling
in a pointer and a destination node, and by setting
a page flag indicating that this page should be
moved to another NUMA node.

If something happens to the page that would cancel
the queuing, we simply clear that page flag.

When knumad gets around to an entry in the array,
it will check to see if the "should migrate" page
flag is still set. If it is not, it skips the entry.

The current_knumad and current_queue entries can
be used to simply keep circular buffer semantics.

Does this look reasonable?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
@ 2012-05-29 20:42           ` Rik van Riel
  0 siblings, 0 replies; 236+ messages in thread
From: Rik van Riel @ 2012-05-29 20:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Peter Zijlstra, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On 05/29/2012 01:38 PM, Linus Torvalds wrote:
> On Tue, May 29, 2012 at 9:38 AM, Andrea Arcangeli<aarcange@redhat.com>  wrote:
>> On Tue, May 29, 2012 at 03:16:25PM +0200, Peter Zijlstra wrote:
>>> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
>>> price to pay.
>>
>> I don't think it's too great, memcg uses for half of that and yet
>> nobody is booting with cgroup_disable=memory even on not-NUMA servers
>> with less RAM.
>
> A big fraction of one percent is absolutely unacceptable.

Andrea, here is a quick back of the envelope idea.

In every zone, we keep an array of pointers to pages and
other needed info for knumad.  We do not need as many as
we have pages in a zone, because we do not want to move
all that memory across anyway (especially in larger systems).
Maybe the number of entries can scale up with the square
root of the zone size?

struct numa_pq_entry {
	struct page *page;
	pg_data_t *destination;
};

For each zone, we can have a numa queueing struct

struct numa_queue {
	struct numa_pq_entry *current_knumad;
	struct numa_pq_entry *current_queue;
	struct numa_pq_entry[];
};

Pages can get added to the knumad queue by filling
in a pointer and a destination node, and by setting
a page flag indicating that this page should be
moved to another NUMA node.

If something happens to the page that would cancel
the queuing, we simply clear that page flag.

When knumad gets around to an entry in the array,
it will check to see if the "should migrate" page
flag is still set. If it is not, it skips the entry.

The current_knumad and current_queue entries can
be used to simply keep circular buffer semantics.

Does this look reasonable?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 22/35] autonuma: sched_set_autonuma_need_balance
  2012-05-29 17:33       ` Andrea Arcangeli
@ 2012-05-29 22:21         ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 22:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, 2012-05-29 at 19:33 +0200, Andrea Arcangeli wrote:

> No worries, I didn't mean to leave it like this forever. I was
> considering using the stop cpu _nowait variant but I didn't have
> enough time to realize if it would work for my case. I need to rethink
> about that.

No, you're not going to use any stop_cpu variant at all. Nothing is
_that_ urgent. Your whole strict mode needs to go, it completely wrecks
the regular balancer.

> The moment I gave up on the _nowait variant before releasing is when I
> couldn't understand what is tlb_migrate_finish doing, and why it's not
> present in the _nowait version in fair.c. Can you explain me that?

Its an optional tlb flush, I guess they didn't find the active_balance
worth the effort -- it should be fairly rare anyway.

> I'm glad you acknowledge load_balance already takes a bulk of the time
> as it needs to find the busiest runqueue checking other CPU runqueues
> too...

I've never said otherwise, its always been about where you do it, in the
middle of schedule() just isn't it. And I'm getting very tired of having
to repeat myself.

Also for regular load-balance only 2 cpus will ever scan all cpus, the
rest will only scan smaller ranges. Your thing does n-1 nodes worth of
cpus for every cpu.



^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 22/35] autonuma: sched_set_autonuma_need_balance
@ 2012-05-29 22:21         ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-29 22:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, 2012-05-29 at 19:33 +0200, Andrea Arcangeli wrote:

> No worries, I didn't mean to leave it like this forever. I was
> considering using the stop cpu _nowait variant but I didn't have
> enough time to realize if it would work for my case. I need to rethink
> about that.

No, you're not going to use any stop_cpu variant at all. Nothing is
_that_ urgent. Your whole strict mode needs to go, it completely wrecks
the regular balancer.

> The moment I gave up on the _nowait variant before releasing is when I
> couldn't understand what is tlb_migrate_finish doing, and why it's not
> present in the _nowait version in fair.c. Can you explain me that?

Its an optional tlb flush, I guess they didn't find the active_balance
worth the effort -- it should be fairly rare anyway.

> I'm glad you acknowledge load_balance already takes a bulk of the time
> as it needs to find the busiest runqueue checking other CPU runqueues
> too...

I've never said otherwise, its always been about where you do it, in the
middle of schedule() just isn't it. And I'm getting very tired of having
to repeat myself.

Also for regular load-balance only 2 cpus will ever scan all cpus, the
rest will only scan smaller ranges. Your thing does n-1 nodes worth of
cpus for every cpu.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 23/35] autonuma: core
  2012-05-29 11:45     ` Kirill A. Shutemov
@ 2012-05-30  0:03       ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-30  0:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 02:45:54PM +0300, Kirill A. Shutemov wrote:
> On Fri, May 25, 2012 at 07:02:27PM +0200, Andrea Arcangeli wrote:
> 
> > +static int knumad_do_scan(void)
> > +{
> 
> ...
> 
> > +	if (knumad_test_exit(mm) || !vma) {
> > +		mm_autonuma = mm->mm_autonuma;
> > +		if (mm_autonuma->mm_node.next != &knumad_scan.mm_head) {
> > +			mm_autonuma = list_entry(mm_autonuma->mm_node.next,
> > +						 struct mm_autonuma, mm_node);
> > +			knumad_scan.mm = mm_autonuma->mm;
> > +			atomic_inc(&knumad_scan.mm->mm_count);
> > +			knumad_scan.address = 0;
> > +			knumad_scan.mm->mm_autonuma->numa_fault_pass++;
> > +		} else
> > +			knumad_scan.mm = NULL;
> 
> knumad_scan.mm should be nulled only after list_del otherwise you will
> have race with autonuma_exit():

Thanks for noticing I managed to reproduce it by setting
knuma_scand/scan_sleep_millisecs and
knuma_scand/scan_sleep_pass_millisecs both to 0 and running a loop of
"while :; do memhog -r10 10m &>/dev/null; done".

So the problem was that if knuma_scand would change the knumad_scan.mm
after the mm->mm_users was 0 but before autonuma_exit run,
autonuma_exit wouldn't notice that the mm->mm_auotnuma was already
unlinked and it would unlink again.

autonuma_exit itself doesn't need to tell anything to knuma_scand,
because if it notices knuma_scand.mm == mm, it will do nothing and it
_always_ relies on knumad_scan to unlink it.

And if instead knuma_scand.mm is != mm, then autonuma_exit knows the
knuma_scand daemon will never have a chance to see the "mm" in the
list again if it arrived first (setting mm_autonuma->mm = NULL there
is just a debug tweak according to the comment).

The "serialize" event is there only to wait the knuma_scand main loop
before taking down the mm (it's not related to the list management).

The mm_autonuma->mm is useless after the "mm_autonuma" has been
unlinked so it's ok to use that to track if knuma_scand arrives first.

The exit path of the kernel daemon also forgot to check for
knumad_test_exit(mm) before unlinking, but that only runs if
kthread_should_stop() is true, and nobody calls kthread_stop so it's
only a theoretical improvement.

So this seems to fix it.

diff --git a/mm/autonuma.c b/mm/autonuma.c
index c2a5a82..768250a 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -679,9 +679,12 @@ static int knumad_do_scan(void)
 		} else
 			knumad_scan.mm = NULL;
 
-		if (knumad_test_exit(mm))
+		if (knumad_test_exit(mm)) {
 			list_del(&mm->mm_autonuma->mm_node);
-		else
+			/* tell autonuma_exit not to list_del */
+			VM_BUG_ON(mm->mm_autonuma->mm != mm);
+			mm->mm_autonuma->mm = NULL;
+		} else
 			mm_numa_fault_flush(mm);
 
 		mmdrop(mm);
@@ -770,8 +773,12 @@ static int knuma_scand(void *none)
 
 	mm = knumad_scan.mm;
 	knumad_scan.mm = NULL;
-	if (mm)
+	if (mm && knumad_test_exit(mm)) {
 		list_del(&mm->mm_autonuma->mm_node);
+		/* tell autonuma_exit not to list_del */
+		VM_BUG_ON(mm->mm_autonuma->mm != mm);
+		mm->mm_autonuma->mm = NULL;
+	}
 	mutex_unlock(&knumad_mm_mutex);
 
 	if (mm)
@@ -996,11 +1003,15 @@ void autonuma_exit(struct mm_struct *mm)
 	mutex_lock(&knumad_mm_mutex);
 	if (knumad_scan.mm == mm)
 		serialize = true;
-	else
+	else if (mm->mm_autonuma->mm) {
+		VM_BUG_ON(mm->mm_autonuma->mm != mm);
+		mm->mm_autonuma->mm = NULL; /* debug */
 		list_del(&mm->mm_autonuma->mm_node);
+	}
 	mutex_unlock(&knumad_mm_mutex);
 
 	if (serialize) {
+		/* prevent the mm to go away under knumad_do_scan main loop */
 		down_write(&mm->mmap_sem);
 		up_write(&mm->mmap_sem);
 	}

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* Re: [PATCH 23/35] autonuma: core
@ 2012-05-30  0:03       ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-30  0:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 02:45:54PM +0300, Kirill A. Shutemov wrote:
> On Fri, May 25, 2012 at 07:02:27PM +0200, Andrea Arcangeli wrote:
> 
> > +static int knumad_do_scan(void)
> > +{
> 
> ...
> 
> > +	if (knumad_test_exit(mm) || !vma) {
> > +		mm_autonuma = mm->mm_autonuma;
> > +		if (mm_autonuma->mm_node.next != &knumad_scan.mm_head) {
> > +			mm_autonuma = list_entry(mm_autonuma->mm_node.next,
> > +						 struct mm_autonuma, mm_node);
> > +			knumad_scan.mm = mm_autonuma->mm;
> > +			atomic_inc(&knumad_scan.mm->mm_count);
> > +			knumad_scan.address = 0;
> > +			knumad_scan.mm->mm_autonuma->numa_fault_pass++;
> > +		} else
> > +			knumad_scan.mm = NULL;
> 
> knumad_scan.mm should be nulled only after list_del otherwise you will
> have race with autonuma_exit():

Thanks for noticing I managed to reproduce it by setting
knuma_scand/scan_sleep_millisecs and
knuma_scand/scan_sleep_pass_millisecs both to 0 and running a loop of
"while :; do memhog -r10 10m &>/dev/null; done".

So the problem was that if knuma_scand would change the knumad_scan.mm
after the mm->mm_users was 0 but before autonuma_exit run,
autonuma_exit wouldn't notice that the mm->mm_auotnuma was already
unlinked and it would unlink again.

autonuma_exit itself doesn't need to tell anything to knuma_scand,
because if it notices knuma_scand.mm == mm, it will do nothing and it
_always_ relies on knumad_scan to unlink it.

And if instead knuma_scand.mm is != mm, then autonuma_exit knows the
knuma_scand daemon will never have a chance to see the "mm" in the
list again if it arrived first (setting mm_autonuma->mm = NULL there
is just a debug tweak according to the comment).

The "serialize" event is there only to wait the knuma_scand main loop
before taking down the mm (it's not related to the list management).

The mm_autonuma->mm is useless after the "mm_autonuma" has been
unlinked so it's ok to use that to track if knuma_scand arrives first.

The exit path of the kernel daemon also forgot to check for
knumad_test_exit(mm) before unlinking, but that only runs if
kthread_should_stop() is true, and nobody calls kthread_stop so it's
only a theoretical improvement.

So this seems to fix it.

diff --git a/mm/autonuma.c b/mm/autonuma.c
index c2a5a82..768250a 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -679,9 +679,12 @@ static int knumad_do_scan(void)
 		} else
 			knumad_scan.mm = NULL;
 
-		if (knumad_test_exit(mm))
+		if (knumad_test_exit(mm)) {
 			list_del(&mm->mm_autonuma->mm_node);
-		else
+			/* tell autonuma_exit not to list_del */
+			VM_BUG_ON(mm->mm_autonuma->mm != mm);
+			mm->mm_autonuma->mm = NULL;
+		} else
 			mm_numa_fault_flush(mm);
 
 		mmdrop(mm);
@@ -770,8 +773,12 @@ static int knuma_scand(void *none)
 
 	mm = knumad_scan.mm;
 	knumad_scan.mm = NULL;
-	if (mm)
+	if (mm && knumad_test_exit(mm)) {
 		list_del(&mm->mm_autonuma->mm_node);
+		/* tell autonuma_exit not to list_del */
+		VM_BUG_ON(mm->mm_autonuma->mm != mm);
+		mm->mm_autonuma->mm = NULL;
+	}
 	mutex_unlock(&knumad_mm_mutex);
 
 	if (mm)
@@ -996,11 +1003,15 @@ void autonuma_exit(struct mm_struct *mm)
 	mutex_lock(&knumad_mm_mutex);
 	if (knumad_scan.mm == mm)
 		serialize = true;
-	else
+	else if (mm->mm_autonuma->mm) {
+		VM_BUG_ON(mm->mm_autonuma->mm != mm);
+		mm->mm_autonuma->mm = NULL; /* debug */
 		list_del(&mm->mm_autonuma->mm_node);
+	}
 	mutex_unlock(&knumad_mm_mutex);
 
 	if (serialize) {
+		/* prevent the mm to go away under knumad_do_scan main loop */
 		down_write(&mm->mmap_sem);
 		up_write(&mm->mmap_sem);
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* Re: [PATCH 14/35] autonuma: knuma_migrated per NUMA node queues
  2012-05-29 13:51     ` Peter Zijlstra
@ 2012-05-30  0:14       ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-30  0:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Hi,

On Tue, May 29, 2012 at 03:51:08PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> 
> 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 41aa49b..8e578e6 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -666,6 +666,12 @@ typedef struct pglist_data {
> >  	struct task_struct *kswapd;
> >  	int kswapd_max_order;
> >  	enum zone_type classzone_idx;
> > +#ifdef CONFIG_AUTONUMA
> > +	spinlock_t autonuma_lock;
> > +	struct list_head autonuma_migrate_head[MAX_NUMNODES];
> > +	unsigned long autonuma_nr_migrate_pages;
> > +	wait_queue_head_t autonuma_knuma_migrated_wait;
> > +#endif
> >  } pg_data_t;
> >  
> >  #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
> 
> O(nr_nodes^2) data.. ISTR people rewriting a certain slab allocator to
> get rid of that :-)
> 
> Also, don't forget that MAX_NUMNODES is an unconditional 512 on distro
> kernels, even when we only have 2.
> 
> Now the total wasted space isn't too bad since its only 16 bytes,
> totaling a whole 2M for a 256 node system. But still, something like
> that wants at least a mention somewhere.

I fully agree, I prefer to fix it and I was fully aware about
this. It's not a big deal so it got low priority to be fixed, but I
intended to optimize this.

As long as num_possible_nodes() is initialized before the pgdat is
allocated it shouldn't be difficult to optimize this moving struct
list_head autonuma_migrate_head[0] at the end of the structure.

mm_autonuma and sched_autonuma initially also had MAX_NUMNODES arrays
in them, then I converted to dynamic allocations to be optimal. We
same needs to happen here. 

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 14/35] autonuma: knuma_migrated per NUMA node queues
@ 2012-05-30  0:14       ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-30  0:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Hi,

On Tue, May 29, 2012 at 03:51:08PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> 
> 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 41aa49b..8e578e6 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -666,6 +666,12 @@ typedef struct pglist_data {
> >  	struct task_struct *kswapd;
> >  	int kswapd_max_order;
> >  	enum zone_type classzone_idx;
> > +#ifdef CONFIG_AUTONUMA
> > +	spinlock_t autonuma_lock;
> > +	struct list_head autonuma_migrate_head[MAX_NUMNODES];
> > +	unsigned long autonuma_nr_migrate_pages;
> > +	wait_queue_head_t autonuma_knuma_migrated_wait;
> > +#endif
> >  } pg_data_t;
> >  
> >  #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
> 
> O(nr_nodes^2) data.. ISTR people rewriting a certain slab allocator to
> get rid of that :-)
> 
> Also, don't forget that MAX_NUMNODES is an unconditional 512 on distro
> kernels, even when we only have 2.
> 
> Now the total wasted space isn't too bad since its only 16 bytes,
> totaling a whole 2M for a 256 node system. But still, something like
> that wants at least a mention somewhere.

I fully agree, I prefer to fix it and I was fully aware about
this. It's not a big deal so it got low priority to be fixed, but I
intended to optimize this.

As long as num_possible_nodes() is initialized before the pgdat is
allocated it shouldn't be difficult to optimize this moving struct
list_head autonuma_migrate_head[0] at the end of the structure.

mm_autonuma and sched_autonuma initially also had MAX_NUMNODES arrays
in them, then I converted to dynamic allocations to be optimal. We
same needs to happen here. 

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
  2012-05-29 14:54         ` Peter Zijlstra
@ 2012-05-30  8:25           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 236+ messages in thread
From: KOSAKI Motohiro @ 2012-05-30  8:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, kosaki.motohiro

(5/29/12 10:54 AM), Peter Zijlstra wrote:
> On Tue, 2012-05-29 at 09:56 -0400, Rik van Riel wrote:
>> On 05/29/2012 09:16 AM, Peter Zijlstra wrote:
>>> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
>>
>>> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
>>> price to pay.
>>>
>>> At LSF/MM Rik already suggested you limit the number of pages that can
>>> be migrated concurrently and use this to move the extra list_head out of
>>> struct page and into a smaller amount of extra structures, reducing the
>>> total overhead.
>>
>> For THP, we should be able to track this NUMA info on a
>> 2MB page granularity.
>
> Yeah, but that's another x86-only feature, _IF_ we're going to do this
> it must be done for all archs that have CONFIG_NUMA, thus we're stuck
> with 4k (or other base page size).

Even if THP=n, we don't need 4k granularity. All modern malloc implementation have
per-thread heap (e.g. glibc call it as arena) and it is usually 1-8MB size. So, if
it is larger than 2MB, we can always use per-pmd tracking. iow, memory consumption
reduce to 1/512.

My suggestion is, track per-pmd (i.e. 2M size) granularity and fix glibc too (current
glibc malloc has dynamically arena size adjusting feature and then it often become
less than 2M).



^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
@ 2012-05-30  8:25           ` KOSAKI Motohiro
  0 siblings, 0 replies; 236+ messages in thread
From: KOSAKI Motohiro @ 2012-05-30  8:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, kosaki.motohiro

(5/29/12 10:54 AM), Peter Zijlstra wrote:
> On Tue, 2012-05-29 at 09:56 -0400, Rik van Riel wrote:
>> On 05/29/2012 09:16 AM, Peter Zijlstra wrote:
>>> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
>>
>>> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
>>> price to pay.
>>>
>>> At LSF/MM Rik already suggested you limit the number of pages that can
>>> be migrated concurrently and use this to move the extra list_head out of
>>> struct page and into a smaller amount of extra structures, reducing the
>>> total overhead.
>>
>> For THP, we should be able to track this NUMA info on a
>> 2MB page granularity.
>
> Yeah, but that's another x86-only feature, _IF_ we're going to do this
> it must be done for all archs that have CONFIG_NUMA, thus we're stuck
> with 4k (or other base page size).

Even if THP=n, we don't need 4k granularity. All modern malloc implementation have
per-thread heap (e.g. glibc call it as arena) and it is usually 1-8MB size. So, if
it is larger than 2MB, we can always use per-pmd tracking. iow, memory consumption
reduce to 1/512.

My suggestion is, track per-pmd (i.e. 2M size) granularity and fix glibc too (current
glibc malloc has dynamically arena size adjusting feature and then it often become
less than 2M).


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
  2012-05-30  8:25           ` KOSAKI Motohiro
@ 2012-05-30  9:06             ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-30  9:06 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Andrea Arcangeli, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter

On Wed, 2012-05-30 at 04:25 -0400, KOSAKI Motohiro wrote:
> (5/29/12 10:54 AM), Peter Zijlstra wrote:
> > On Tue, 2012-05-29 at 09:56 -0400, Rik van Riel wrote:
> >> On 05/29/2012 09:16 AM, Peter Zijlstra wrote:
> >>> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> >>
> >>> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
> >>> price to pay.
> >>>
> >>> At LSF/MM Rik already suggested you limit the number of pages that can
> >>> be migrated concurrently and use this to move the extra list_head out of
> >>> struct page and into a smaller amount of extra structures, reducing the
> >>> total overhead.
> >>
> >> For THP, we should be able to track this NUMA info on a
> >> 2MB page granularity.
> >
> > Yeah, but that's another x86-only feature, _IF_ we're going to do this
> > it must be done for all archs that have CONFIG_NUMA, thus we're stuck
> > with 4k (or other base page size).
> 
> Even if THP=n, we don't need 4k granularity. All modern malloc implementation have
> per-thread heap (e.g. glibc call it as arena) and it is usually 1-8MB size. So, if
> it is larger than 2MB, we can always use per-pmd tracking. iow, memory consumption
> reduce to 1/512.

Yes, and we all know objects allocated in one thread are never shared
with other threads.. the producer-consumer pattern seems fairly popular
and will destroy your argument.

> My suggestion is, track per-pmd (i.e. 2M size) granularity and fix glibc too (current
> glibc malloc has dynamically arena size adjusting feature and then it often become
> less than 2M).

The trouble with making this per pmd is that you then get the false
sharing per pmd, so if there's shared data on the 2m page you'll not
know where to put it.

I also know of some folks who did a strict per-cpu allocator based on
some kernel patches I hope to see posted sometime soon. This because if
you have many more threads than cpus the wasted space in your areas is
tremendous.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
@ 2012-05-30  9:06             ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-30  9:06 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Andrea Arcangeli, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter

On Wed, 2012-05-30 at 04:25 -0400, KOSAKI Motohiro wrote:
> (5/29/12 10:54 AM), Peter Zijlstra wrote:
> > On Tue, 2012-05-29 at 09:56 -0400, Rik van Riel wrote:
> >> On 05/29/2012 09:16 AM, Peter Zijlstra wrote:
> >>> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> >>
> >>> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
> >>> price to pay.
> >>>
> >>> At LSF/MM Rik already suggested you limit the number of pages that can
> >>> be migrated concurrently and use this to move the extra list_head out of
> >>> struct page and into a smaller amount of extra structures, reducing the
> >>> total overhead.
> >>
> >> For THP, we should be able to track this NUMA info on a
> >> 2MB page granularity.
> >
> > Yeah, but that's another x86-only feature, _IF_ we're going to do this
> > it must be done for all archs that have CONFIG_NUMA, thus we're stuck
> > with 4k (or other base page size).
> 
> Even if THP=n, we don't need 4k granularity. All modern malloc implementation have
> per-thread heap (e.g. glibc call it as arena) and it is usually 1-8MB size. So, if
> it is larger than 2MB, we can always use per-pmd tracking. iow, memory consumption
> reduce to 1/512.

Yes, and we all know objects allocated in one thread are never shared
with other threads.. the producer-consumer pattern seems fairly popular
and will destroy your argument.

> My suggestion is, track per-pmd (i.e. 2M size) granularity and fix glibc too (current
> glibc malloc has dynamically arena size adjusting feature and then it often become
> less than 2M).

The trouble with making this per pmd is that you then get the false
sharing per pmd, so if there's shared data on the 2m page you'll not
know where to put it.

I also know of some folks who did a strict per-cpu allocator based on
some kernel patches I hope to see posted sometime soon. This because if
you have many more threads than cpus the wasted space in your areas is
tremendous.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
  2012-05-30  9:06             ` Peter Zijlstra
@ 2012-05-30  9:41               ` KOSAKI Motohiro
  -1 siblings, 0 replies; 236+ messages in thread
From: KOSAKI Motohiro @ 2012-05-30  9:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, linux-kernel,
	linux-mm, Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter

(5/30/12 5:06 AM), Peter Zijlstra wrote:
> On Wed, 2012-05-30 at 04:25 -0400, KOSAKI Motohiro wrote:
>> (5/29/12 10:54 AM), Peter Zijlstra wrote:
>>> On Tue, 2012-05-29 at 09:56 -0400, Rik van Riel wrote:
>>>> On 05/29/2012 09:16 AM, Peter Zijlstra wrote:
>>>>> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
>>>>
>>>>> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
>>>>> price to pay.
>>>>>
>>>>> At LSF/MM Rik already suggested you limit the number of pages that can
>>>>> be migrated concurrently and use this to move the extra list_head out of
>>>>> struct page and into a smaller amount of extra structures, reducing the
>>>>> total overhead.
>>>>
>>>> For THP, we should be able to track this NUMA info on a
>>>> 2MB page granularity.
>>>
>>> Yeah, but that's another x86-only feature, _IF_ we're going to do this
>>> it must be done for all archs that have CONFIG_NUMA, thus we're stuck
>>> with 4k (or other base page size).
>>
>> Even if THP=n, we don't need 4k granularity. All modern malloc implementation have
>> per-thread heap (e.g. glibc call it as arena) and it is usually 1-8MB size. So, if
>> it is larger than 2MB, we can always use per-pmd tracking. iow, memory consumption
>> reduce to 1/512.
>
> Yes, and we all know objects allocated in one thread are never shared
> with other threads.. the producer-consumer pattern seems fairly popular
> and will destroy your argument.

THP also strike producer-consumer pattern. But, as far as I know, people haven't observed
significant performance degression. thus I _guessed_ performance critical producer-consumer
pattern is rare. Just guess.


>> My suggestion is, track per-pmd (i.e. 2M size) granularity and fix glibc too (current
>> glibc malloc has dynamically arena size adjusting feature and then it often become
>> less than 2M).
>
> The trouble with making this per pmd is that you then get the false
> sharing per pmd, so if there's shared data on the 2m page you'll not
> know where to put it.
>
> I also know of some folks who did a strict per-cpu allocator based on
> some kernel patches I hope to see posted sometime soon. This because if
> you have many more threads than cpus the wasted space in your areas is
> tremendous.


^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
@ 2012-05-30  9:41               ` KOSAKI Motohiro
  0 siblings, 0 replies; 236+ messages in thread
From: KOSAKI Motohiro @ 2012-05-30  9:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KOSAKI Motohiro, Rik van Riel, Andrea Arcangeli, linux-kernel,
	linux-mm, Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter

(5/30/12 5:06 AM), Peter Zijlstra wrote:
> On Wed, 2012-05-30 at 04:25 -0400, KOSAKI Motohiro wrote:
>> (5/29/12 10:54 AM), Peter Zijlstra wrote:
>>> On Tue, 2012-05-29 at 09:56 -0400, Rik van Riel wrote:
>>>> On 05/29/2012 09:16 AM, Peter Zijlstra wrote:
>>>>> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
>>>>
>>>>> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
>>>>> price to pay.
>>>>>
>>>>> At LSF/MM Rik already suggested you limit the number of pages that can
>>>>> be migrated concurrently and use this to move the extra list_head out of
>>>>> struct page and into a smaller amount of extra structures, reducing the
>>>>> total overhead.
>>>>
>>>> For THP, we should be able to track this NUMA info on a
>>>> 2MB page granularity.
>>>
>>> Yeah, but that's another x86-only feature, _IF_ we're going to do this
>>> it must be done for all archs that have CONFIG_NUMA, thus we're stuck
>>> with 4k (or other base page size).
>>
>> Even if THP=n, we don't need 4k granularity. All modern malloc implementation have
>> per-thread heap (e.g. glibc call it as arena) and it is usually 1-8MB size. So, if
>> it is larger than 2MB, we can always use per-pmd tracking. iow, memory consumption
>> reduce to 1/512.
>
> Yes, and we all know objects allocated in one thread are never shared
> with other threads.. the producer-consumer pattern seems fairly popular
> and will destroy your argument.

THP also strike producer-consumer pattern. But, as far as I know, people haven't observed
significant performance degression. thus I _guessed_ performance critical producer-consumer
pattern is rare. Just guess.


>> My suggestion is, track per-pmd (i.e. 2M size) granularity and fix glibc too (current
>> glibc malloc has dynamically arena size adjusting feature and then it often become
>> less than 2M).
>
> The trouble with making this per pmd is that you then get the false
> sharing per pmd, so if there's shared data on the 2m page you'll not
> know where to put it.
>
> I also know of some folks who did a strict per-cpu allocator based on
> some kernel patches I hope to see posted sometime soon. This because if
> you have many more threads than cpus the wasted space in your areas is
> tremendous.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
  2012-05-30  9:41               ` KOSAKI Motohiro
@ 2012-05-30  9:55                 ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-30  9:55 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Andrea Arcangeli, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter

On Wed, 2012-05-30 at 05:41 -0400, KOSAKI Motohiro wrote:
> > Yes, and we all know objects allocated in one thread are never shared
> > with other threads.. the producer-consumer pattern seems fairly popular
> > and will destroy your argument.
> 
> THP also strike producer-consumer pattern. But, as far as I know, people haven't observed
> significant performance degression. thus I _guessed_ performance critical producer-consumer
> pattern is rare. Just guess. 

Not so, as long as the areas span PMDs THP can back them using huge
pages, regardless of what objects live in that virtual space (or indeed
if its given out as objects at all or lives on the free-lists).

THP doesn't care about what lives in the virtual space, all it cares
about is ranges spanning PMDs that are populated densely enough.


^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
@ 2012-05-30  9:55                 ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-30  9:55 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Rik van Riel, Andrea Arcangeli, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter

On Wed, 2012-05-30 at 05:41 -0400, KOSAKI Motohiro wrote:
> > Yes, and we all know objects allocated in one thread are never shared
> > with other threads.. the producer-consumer pattern seems fairly popular
> > and will destroy your argument.
> 
> THP also strike producer-consumer pattern. But, as far as I know, people haven't observed
> significant performance degression. thus I _guessed_ performance critical producer-consumer
> pattern is rare. Just guess. 

Not so, as long as the areas span PMDs THP can back them using huge
pages, regardless of what objects live in that virtual space (or indeed
if its given out as objects at all or lives on the free-lists).

THP doesn't care about what lives in the virtual space, all it cares
about is ranges spanning PMDs that are populated densely enough.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
  2012-05-30  9:06             ` Peter Zijlstra
@ 2012-05-30 13:49               ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-30 13:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter

On Wed, May 30, 2012 at 11:06:03AM +0200, Peter Zijlstra wrote:
> The trouble with making this per pmd is that you then get the false
> sharing per pmd, so if there's shared data on the 2m page you'll not
> know where to put it.

The numa hinting page fault is already scanning the pmd only, and it's
working fine. So reducing the page_autonuma to one per pmd would not
reduce the granularity of the information with the default settings
everyone has been using so far, but then it would prevent this runtime
tweak to work:

echo 0 >/sys/kernel/mm/autonuma/knuma_scand/pmd

I'm thinking about it but probably reducing the page_autonuma to one
per pmd is going to be the simplest solution considering by default we
only track the pmd anyway.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
@ 2012-05-30 13:49               ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-30 13:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter

On Wed, May 30, 2012 at 11:06:03AM +0200, Peter Zijlstra wrote:
> The trouble with making this per pmd is that you then get the false
> sharing per pmd, so if there's shared data on the 2m page you'll not
> know where to put it.

The numa hinting page fault is already scanning the pmd only, and it's
working fine. So reducing the page_autonuma to one per pmd would not
reduce the granularity of the information with the default settings
everyone has been using so far, but then it would prevent this runtime
tweak to work:

echo 0 >/sys/kernel/mm/autonuma/knuma_scand/pmd

I'm thinking about it but probably reducing the page_autonuma to one
per pmd is going to be the simplest solution considering by default we
only track the pmd anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 00/35] AutoNUMA alpha14
  2012-05-26 20:42     ` Linus Torvalds
@ 2012-05-30 14:46       ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-30 14:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Andrea Arcangeli, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Sat, 2012-05-26 at 13:42 -0700, Linus Torvalds wrote:

> I'm a *firm* believer that if it cannot be done automatically "well
> enough", the absolute last thing we should ever do is worry about the
> crazy people who think they can tweak it to perfection with complex
> interfaces.
> 
> You can't do it, except for trivial loads (often benchmarks), and for
> very specific machines.
> 
> So I think very strongly that we should entirely dismiss all the
> people who want to do manual placement and claim that they know what
> their loads do. They're either full of sh*t (most likely), or they
> have a very specific benchmark and platform that they are tuning for
> that is totally irrelevant to everybody else.
> 
> What we *should* try to aim for is a system that doesn't do horribly
> badly right out of the box. IOW, no tuning what-so-ever (at most a
> kind of "yes, I want you to try to do the NUMA thing" flag to just
> enable it at all), and try to not suck.
> 
> Seriously. "Try to avoid sucking" is *way* superior to "We can let the
> user tweak things to their hearts content". Because users won't get it
> right.
> 
> Give the anal people a knob they can tweak, and tell them it does
> something fancy. And never actually wire the damn thing up. They'll be
> really happy with their OCD tweaking, and do lots of nice graphs that
> just show how the error bars are so big that you can find any damn
> pattern you want in random noise.

So the thing is, my homenode-per-process approach should work for
everything except the case where a single process out-strips a single
node in either cpu utilization or memory consumption.

Now I claim such processes are rare since nodes are big, typically 6-8
cores. Writing anything that can sustain parallel execution larger than
that is very specialist (and typically already employs strong data
separation).

Yes there are such things out there, some use JVMs some are virtual
machines some regular applications, but by and large processes are small
compared to nodes.

So my approach is focus on the normal case, and provide 2 system calls
to replace sched_setaffinity() and mbind() for the people who use those.

Now, maybe I shouldn't have bothered with the system calls.. but I
thought providing something better than hard-affinity would be nice.


Andrea went the other way and focused on these big processes. His
approach relies on a pte scanner and faults. His code builds a
page<->thread map using this data either moves memory around or
processes (I'm a little vague on the details simply because I haven't
seen it explained anywhere yet -- and the code is non-obvious).

I have a number of problems with both the approach as well as the
implementation. 

On the approach my biggest complaints are:

 - the complexity, it focuses on the rarest sort of processes and thus
   results in a rather complex setup.

 - load-balance state explosion, the page-tables become part of the
   load-balance state -- this is a lot of extra state making
   reproduction more 'interesting'.

 - the overhead, since its per page, it needs per-page state.

 - I don't see how it can reliably work for virtual machines, because
   the host page<->thread (vcpu) relation doesn't reflect a
   data<->compute relation in this case. The guest scheduler can move
   the guest thread (the compute) part around between the vcpus at a
   much higher rate than the host will update its page<->vcpu map.

On the implementation:

 - he works around the scheduler instead of with it.

 - its x86 only (although he claims adding archs is trivial
   I've yet to see the first !x86 support).

 - complete lack of useful comments describing the balancing goal and
   approach.

The worst part is that I've asked for this stuff several times, but
nothing seems forth-coming.

Anyway, I prefer doing the simple thing first and then seeing if there's
need for more complexity, esp. given the overheads involved. But if you
prefer we can dive off the deep end :-)

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 00/35] AutoNUMA alpha14
@ 2012-05-30 14:46       ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-30 14:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Andrea Arcangeli, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Sat, 2012-05-26 at 13:42 -0700, Linus Torvalds wrote:

> I'm a *firm* believer that if it cannot be done automatically "well
> enough", the absolute last thing we should ever do is worry about the
> crazy people who think they can tweak it to perfection with complex
> interfaces.
> 
> You can't do it, except for trivial loads (often benchmarks), and for
> very specific machines.
> 
> So I think very strongly that we should entirely dismiss all the
> people who want to do manual placement and claim that they know what
> their loads do. They're either full of sh*t (most likely), or they
> have a very specific benchmark and platform that they are tuning for
> that is totally irrelevant to everybody else.
> 
> What we *should* try to aim for is a system that doesn't do horribly
> badly right out of the box. IOW, no tuning what-so-ever (at most a
> kind of "yes, I want you to try to do the NUMA thing" flag to just
> enable it at all), and try to not suck.
> 
> Seriously. "Try to avoid sucking" is *way* superior to "We can let the
> user tweak things to their hearts content". Because users won't get it
> right.
> 
> Give the anal people a knob they can tweak, and tell them it does
> something fancy. And never actually wire the damn thing up. They'll be
> really happy with their OCD tweaking, and do lots of nice graphs that
> just show how the error bars are so big that you can find any damn
> pattern you want in random noise.

So the thing is, my homenode-per-process approach should work for
everything except the case where a single process out-strips a single
node in either cpu utilization or memory consumption.

Now I claim such processes are rare since nodes are big, typically 6-8
cores. Writing anything that can sustain parallel execution larger than
that is very specialist (and typically already employs strong data
separation).

Yes there are such things out there, some use JVMs some are virtual
machines some regular applications, but by and large processes are small
compared to nodes.

So my approach is focus on the normal case, and provide 2 system calls
to replace sched_setaffinity() and mbind() for the people who use those.

Now, maybe I shouldn't have bothered with the system calls.. but I
thought providing something better than hard-affinity would be nice.


Andrea went the other way and focused on these big processes. His
approach relies on a pte scanner and faults. His code builds a
page<->thread map using this data either moves memory around or
processes (I'm a little vague on the details simply because I haven't
seen it explained anywhere yet -- and the code is non-obvious).

I have a number of problems with both the approach as well as the
implementation. 

On the approach my biggest complaints are:

 - the complexity, it focuses on the rarest sort of processes and thus
   results in a rather complex setup.

 - load-balance state explosion, the page-tables become part of the
   load-balance state -- this is a lot of extra state making
   reproduction more 'interesting'.

 - the overhead, since its per page, it needs per-page state.

 - I don't see how it can reliably work for virtual machines, because
   the host page<->thread (vcpu) relation doesn't reflect a
   data<->compute relation in this case. The guest scheduler can move
   the guest thread (the compute) part around between the vcpus at a
   much higher rate than the host will update its page<->vcpu map.

On the implementation:

 - he works around the scheduler instead of with it.

 - its x86 only (although he claims adding archs is trivial
   I've yet to see the first !x86 support).

 - complete lack of useful comments describing the balancing goal and
   approach.

The worst part is that I've asked for this stuff several times, but
nothing seems forth-coming.

Anyway, I prefer doing the simple thing first and then seeing if there's
need for more complexity, esp. given the overheads involved. But if you
prefer we can dive off the deep end :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 00/35] AutoNUMA alpha14
  2012-05-30 14:46       ` Peter Zijlstra
@ 2012-05-30 15:30         ` Ingo Molnar
  -1 siblings, 0 replies; 236+ messages in thread
From: Ingo Molnar @ 2012-05-30 15:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Rik van Riel, Andrea Arcangeli, linux-kernel,
	linux-mm, Hillf Danton, Dan Smith, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> So the thing is, my homenode-per-process approach should work 
> for everything except the case where a single process 
> out-strips a single node in either cpu utilization or memory 
> consumption.
> 
> Now I claim such processes are rare since nodes are big, 
> typically 6-8 cores. Writing anything that can sustain 
> parallel execution larger than that is very specialist (and 
> typically already employs strong data separation).
> 
> Yes there are such things out there, some use JVMs some are 
> virtual machines some regular applications, but by and large 
> processes are small compared to nodes.
> 
> So my approach is focus on the normal case, and provide 2 
> system calls to replace sched_setaffinity() and mbind() for 
> the people who use those.

We could certainly strike those from the first version, if Linus 
agrees with the general approach.

This gives us degrees freedom as it's an obvious on/off kernel 
feature which we fix or remove if it does not work.

I'd even venture that it should be on by default, it's an 
obvious placement strategy for everything sane that does not try 
to nest some other execution environment within Linux (i.e. 
specialist runtimes).

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 00/35] AutoNUMA alpha14
@ 2012-05-30 15:30         ` Ingo Molnar
  0 siblings, 0 replies; 236+ messages in thread
From: Ingo Molnar @ 2012-05-30 15:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Rik van Riel, Andrea Arcangeli, linux-kernel,
	linux-mm, Hillf Danton, Dan Smith, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> So the thing is, my homenode-per-process approach should work 
> for everything except the case where a single process 
> out-strips a single node in either cpu utilization or memory 
> consumption.
> 
> Now I claim such processes are rare since nodes are big, 
> typically 6-8 cores. Writing anything that can sustain 
> parallel execution larger than that is very specialist (and 
> typically already employs strong data separation).
> 
> Yes there are such things out there, some use JVMs some are 
> virtual machines some regular applications, but by and large 
> processes are small compared to nodes.
> 
> So my approach is focus on the normal case, and provide 2 
> system calls to replace sched_setaffinity() and mbind() for 
> the people who use those.

We could certainly strike those from the first version, if Linus 
agrees with the general approach.

This gives us degrees freedom as it's an obvious on/off kernel 
feature which we fix or remove if it does not work.

I'd even venture that it should be on by default, it's an 
obvious placement strategy for everything sane that does not try 
to nest some other execution environment within Linux (i.e. 
specialist runtimes).

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 14/35] autonuma: knuma_migrated per NUMA node queues
  2012-05-30  0:14       ` Andrea Arcangeli
@ 2012-05-30 18:19         ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-30 18:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Wed, May 30, 2012 at 02:14:38AM +0200, Andrea Arcangeli wrote:
> I fully agree, I prefer to fix it and I was fully aware about

I did this yesterday, this is saving a couple of pages on my numa
system with node shift = 9. However I'm not sure anymore if it's
really worth it... but since I did it I may as well keep it.

==
From: Andrea Arcangeli <aarcange@redhat.com>
Subject: [PATCH] autonuma: autonuma_migrate_head[0] dynamic size

Reduce the autonuma_migrate_head array entries from MAX_NUMNODES to
num_possible_nodes() or zero if autonuma_impossible() is true.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/mm/numa.c             |    6 ++++--
 arch/x86/mm/numa_32.c          |    3 ++-
 include/linux/memory_hotplug.h |    3 ++-
 include/linux/mmzone.h         |    8 +++++++-
 include/linux/page_autonuma.h  |   10 ++++++++--
 mm/memory_hotplug.c            |    2 +-
 mm/page_autonuma.c             |    5 +++--
 7 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2d125be..a4a9e92 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -11,6 +11,7 @@
 #include <linux/nodemask.h>
 #include <linux/sched.h>
 #include <linux/topology.h>
+#include <linux/page_autonuma.h>
 
 #include <asm/e820.h>
 #include <asm/proto.h>
@@ -192,7 +193,8 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
 /* Initialize NODE_DATA for a node on the local memory */
 static void __init setup_node_data(int nid, u64 start, u64 end)
 {
-	const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
+	const size_t nd_size = roundup(autonuma_pglist_data_size(),
+				       PAGE_SIZE);
 	bool remapped = false;
 	u64 nd_pa;
 	void *nd;
@@ -239,7 +241,7 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
 		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
 
 	node_data[nid] = nd;
-	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+	memset(NODE_DATA(nid), 0, autonuma_pglist_data_size());
 	NODE_DATA(nid)->node_id = nid;
 	NODE_DATA(nid)->node_start_pfn = start >> PAGE_SHIFT;
 	NODE_DATA(nid)->node_spanned_pages = (end - start) >> PAGE_SHIFT;
diff --git a/arch/x86/mm/numa_32.c b/arch/x86/mm/numa_32.c
index 534255a..d32d6cc 100644
--- a/arch/x86/mm/numa_32.c
+++ b/arch/x86/mm/numa_32.c
@@ -25,6 +25,7 @@
 #include <linux/bootmem.h>
 #include <linux/memblock.h>
 #include <linux/module.h>
+#include <linux/page_autonuma.h>
 
 #include "numa_internal.h"
 
@@ -194,7 +195,7 @@ void __init init_alloc_remap(int nid, u64 start, u64 end)
 
 	/* calculate the necessary space aligned to large page size */
 	size = node_memmap_size_bytes(nid, start_pfn, end_pfn);
-	size += ALIGN(sizeof(pg_data_t), PAGE_SIZE);
+	size += ALIGN(autonuma_pglist_data_size(), PAGE_SIZE);
 	size = ALIGN(size, LARGE_PAGE_BYTES);
 
 	/* allocate node memory and the lowmem remap area */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 910550f..76b1840 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -5,6 +5,7 @@
 #include <linux/spinlock.h>
 #include <linux/notifier.h>
 #include <linux/bug.h>
+#include <linux/page_autonuma.h>
 
 struct page;
 struct zone;
@@ -130,7 +131,7 @@ extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
  */
 #define generic_alloc_nodedata(nid)				\
 ({								\
-	kzalloc(sizeof(pg_data_t), GFP_KERNEL);			\
+	kzalloc(autonuma_pglist_data_size(), GFP_KERNEL);	\
 })
 /*
  * This definition is just for error path in node hotadd.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e66da74..ed5b0c0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -701,10 +701,16 @@ typedef struct pglist_data {
 #if !defined(CONFIG_SPARSEMEM)
 	struct page_autonuma *node_page_autonuma;
 #endif
-	struct list_head autonuma_migrate_head[MAX_NUMNODES];
 	unsigned long autonuma_nr_migrate_pages;
 	wait_queue_head_t autonuma_knuma_migrated_wait;
 	spinlock_t autonuma_lock;
+	/*
+	 * Archs supporting AutoNUMA should allocate the pgdat with
+	 * size autonuma_pglist_data_size() after including
+	 * <linux/page_autonuma.h> and the below field must remain the
+	 * last one of this structure.
+	 */
+	struct list_head autonuma_migrate_head[0];
 #endif
 } pg_data_t;
 
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
index 05d2862..1d02643 100644
--- a/include/linux/page_autonuma.h
+++ b/include/linux/page_autonuma.h
@@ -10,6 +10,7 @@ static inline void __init page_autonuma_init_flatmem(void) {}
 #ifdef CONFIG_AUTONUMA
 
 #include <linux/autonuma_flags.h>
+#include <linux/autonuma_types.h>
 
 extern void __meminit page_autonuma_map_init(struct page *page,
 					     struct page_autonuma *page_autonuma,
@@ -29,11 +30,10 @@ extern void __meminit pgdat_autonuma_init(struct pglist_data *);
 struct page_autonuma;
 #define PAGE_AUTONUMA_SIZE 0
 #define SECTION_PAGE_AUTONUMA_SIZE 0
+#endif
 
 #define autonuma_impossible() true
 
-#endif
-
 static inline void pgdat_autonuma_init(struct pglist_data *pgdat) {}
 
 #endif /* CONFIG_AUTONUMA */
@@ -50,4 +50,10 @@ extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **
 							 int nodeid);
 #endif
 
+/* inline won't work here */
+#define autonuma_pglist_data_size() (sizeof(struct pglist_data) +	\
+				     (autonuma_impossible() ? 0 :	\
+				      sizeof(struct list_head) * \
+				      num_possible_nodes()))
+
 #endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0d7e3ec..604995b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -164,7 +164,7 @@ void register_page_bootmem_info_node(struct pglist_data *pgdat)
 	struct page *page;
 	struct zone *zone;
 
-	nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
+	nr_pages = PAGE_ALIGN(autonuma_pglist_data_size()) >> PAGE_SHIFT;
 	page = virt_to_page(pgdat);
 
 	for (i = 0; i < nr_pages; i++, page++)
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
index 131b5c9..c5c340b 100644
--- a/mm/page_autonuma.c
+++ b/mm/page_autonuma.c
@@ -23,8 +23,9 @@ static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
 	spin_lock_init(&pgdat->autonuma_lock);
 	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
 	pgdat->autonuma_nr_migrate_pages = 0;
-	for_each_node(node_iter)
-		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+	if (!autonuma_impossible())
+		for_each_node(node_iter)
+			INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
 }
 
 #if !defined(CONFIG_SPARSEMEM)

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* Re: [PATCH 14/35] autonuma: knuma_migrated per NUMA node queues
@ 2012-05-30 18:19         ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-30 18:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Wed, May 30, 2012 at 02:14:38AM +0200, Andrea Arcangeli wrote:
> I fully agree, I prefer to fix it and I was fully aware about

I did this yesterday, this is saving a couple of pages on my numa
system with node shift = 9. However I'm not sure anymore if it's
really worth it... but since I did it I may as well keep it.

==
From: Andrea Arcangeli <aarcange@redhat.com>
Subject: [PATCH] autonuma: autonuma_migrate_head[0] dynamic size

Reduce the autonuma_migrate_head array entries from MAX_NUMNODES to
num_possible_nodes() or zero if autonuma_impossible() is true.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/mm/numa.c             |    6 ++++--
 arch/x86/mm/numa_32.c          |    3 ++-
 include/linux/memory_hotplug.h |    3 ++-
 include/linux/mmzone.h         |    8 +++++++-
 include/linux/page_autonuma.h  |   10 ++++++++--
 mm/memory_hotplug.c            |    2 +-
 mm/page_autonuma.c             |    5 +++--
 7 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2d125be..a4a9e92 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -11,6 +11,7 @@
 #include <linux/nodemask.h>
 #include <linux/sched.h>
 #include <linux/topology.h>
+#include <linux/page_autonuma.h>
 
 #include <asm/e820.h>
 #include <asm/proto.h>
@@ -192,7 +193,8 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
 /* Initialize NODE_DATA for a node on the local memory */
 static void __init setup_node_data(int nid, u64 start, u64 end)
 {
-	const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
+	const size_t nd_size = roundup(autonuma_pglist_data_size(),
+				       PAGE_SIZE);
 	bool remapped = false;
 	u64 nd_pa;
 	void *nd;
@@ -239,7 +241,7 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
 		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
 
 	node_data[nid] = nd;
-	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+	memset(NODE_DATA(nid), 0, autonuma_pglist_data_size());
 	NODE_DATA(nid)->node_id = nid;
 	NODE_DATA(nid)->node_start_pfn = start >> PAGE_SHIFT;
 	NODE_DATA(nid)->node_spanned_pages = (end - start) >> PAGE_SHIFT;
diff --git a/arch/x86/mm/numa_32.c b/arch/x86/mm/numa_32.c
index 534255a..d32d6cc 100644
--- a/arch/x86/mm/numa_32.c
+++ b/arch/x86/mm/numa_32.c
@@ -25,6 +25,7 @@
 #include <linux/bootmem.h>
 #include <linux/memblock.h>
 #include <linux/module.h>
+#include <linux/page_autonuma.h>
 
 #include "numa_internal.h"
 
@@ -194,7 +195,7 @@ void __init init_alloc_remap(int nid, u64 start, u64 end)
 
 	/* calculate the necessary space aligned to large page size */
 	size = node_memmap_size_bytes(nid, start_pfn, end_pfn);
-	size += ALIGN(sizeof(pg_data_t), PAGE_SIZE);
+	size += ALIGN(autonuma_pglist_data_size(), PAGE_SIZE);
 	size = ALIGN(size, LARGE_PAGE_BYTES);
 
 	/* allocate node memory and the lowmem remap area */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 910550f..76b1840 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -5,6 +5,7 @@
 #include <linux/spinlock.h>
 #include <linux/notifier.h>
 #include <linux/bug.h>
+#include <linux/page_autonuma.h>
 
 struct page;
 struct zone;
@@ -130,7 +131,7 @@ extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
  */
 #define generic_alloc_nodedata(nid)				\
 ({								\
-	kzalloc(sizeof(pg_data_t), GFP_KERNEL);			\
+	kzalloc(autonuma_pglist_data_size(), GFP_KERNEL);	\
 })
 /*
  * This definition is just for error path in node hotadd.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e66da74..ed5b0c0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -701,10 +701,16 @@ typedef struct pglist_data {
 #if !defined(CONFIG_SPARSEMEM)
 	struct page_autonuma *node_page_autonuma;
 #endif
-	struct list_head autonuma_migrate_head[MAX_NUMNODES];
 	unsigned long autonuma_nr_migrate_pages;
 	wait_queue_head_t autonuma_knuma_migrated_wait;
 	spinlock_t autonuma_lock;
+	/*
+	 * Archs supporting AutoNUMA should allocate the pgdat with
+	 * size autonuma_pglist_data_size() after including
+	 * <linux/page_autonuma.h> and the below field must remain the
+	 * last one of this structure.
+	 */
+	struct list_head autonuma_migrate_head[0];
 #endif
 } pg_data_t;
 
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
index 05d2862..1d02643 100644
--- a/include/linux/page_autonuma.h
+++ b/include/linux/page_autonuma.h
@@ -10,6 +10,7 @@ static inline void __init page_autonuma_init_flatmem(void) {}
 #ifdef CONFIG_AUTONUMA
 
 #include <linux/autonuma_flags.h>
+#include <linux/autonuma_types.h>
 
 extern void __meminit page_autonuma_map_init(struct page *page,
 					     struct page_autonuma *page_autonuma,
@@ -29,11 +30,10 @@ extern void __meminit pgdat_autonuma_init(struct pglist_data *);
 struct page_autonuma;
 #define PAGE_AUTONUMA_SIZE 0
 #define SECTION_PAGE_AUTONUMA_SIZE 0
+#endif
 
 #define autonuma_impossible() true
 
-#endif
-
 static inline void pgdat_autonuma_init(struct pglist_data *pgdat) {}
 
 #endif /* CONFIG_AUTONUMA */
@@ -50,4 +50,10 @@ extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **
 							 int nodeid);
 #endif
 
+/* inline won't work here */
+#define autonuma_pglist_data_size() (sizeof(struct pglist_data) +	\
+				     (autonuma_impossible() ? 0 :	\
+				      sizeof(struct list_head) * \
+				      num_possible_nodes()))
+
 #endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0d7e3ec..604995b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -164,7 +164,7 @@ void register_page_bootmem_info_node(struct pglist_data *pgdat)
 	struct page *page;
 	struct zone *zone;
 
-	nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
+	nr_pages = PAGE_ALIGN(autonuma_pglist_data_size()) >> PAGE_SHIFT;
 	page = virt_to_page(pgdat);
 
 	for (i = 0; i < nr_pages; i++, page++)
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
index 131b5c9..c5c340b 100644
--- a/mm/page_autonuma.c
+++ b/mm/page_autonuma.c
@@ -23,8 +23,9 @@ static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
 	spin_lock_init(&pgdat->autonuma_lock);
 	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
 	pgdat->autonuma_nr_migrate_pages = 0;
-	for_each_node(node_iter)
-		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+	if (!autonuma_impossible())
+		for_each_node(node_iter)
+			INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
 }
 
 #if !defined(CONFIG_SPARSEMEM)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* Re: [PATCH 04/35] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  2012-05-25 17:02   ` Andrea Arcangeli
@ 2012-05-30 18:22     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 236+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-30 18:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Fri, May 25, 2012 at 07:02:08PM +0200, Andrea Arcangeli wrote:
> We will set these bitflags only when the pmd and pte is non present.
> 
> They work like PROT_NONE but they identify a request for the numa
> hinting page fault to trigger.
> 
> Because we want to be able to set these bitflag in any established pte
> or pmd (while clearing the present bit at the same time) without
> losing information, these bitflags must never be set when the pte and
> pmd are present.
> 
> For _PAGE_NUMA_PTE the pte bitflag used is _PAGE_PSE, which cannot be
> set on ptes and it also fits in between _PAGE_FILE and _PAGE_PROTNONE
> which avoids having to alter the swp entries format.
> 
> For _PAGE_NUMA_PMD, we use a reserved bitflag. pmds never contain
> swap_entries but if in the future we'll swap transparent hugepages, we
> must keep in mind not to use the _PAGE_UNUSED2 bitflag in the swap
> entry format and to start the swap entry offset above it.
> 
> PAGE_UNUSED2 is used by Xen but only on ptes established by ioremap,
> but it's never used on pmds so there's no risk of collision with Xen.

Thank you for loking at this from the xen side. The interesting thing
is that I believe the _PAGE_PAT (or _PAGE_PSE) is actually used on
Xen on PTEs. It is used to mark the pages WC. <sigh>



> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/x86/include/asm/pgtable_types.h |   11 +++++++++++
>  1 files changed, 11 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index b74cac9..6e2d954 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -71,6 +71,17 @@
>  #define _PAGE_FILE	(_AT(pteval_t, 1) << _PAGE_BIT_FILE)
>  #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
>  
> +/*
> + * Cannot be set on pte. The fact it's in between _PAGE_FILE and
> + * _PAGE_PROTNONE avoids having to alter the swp entries.
> + */
> +#define _PAGE_NUMA_PTE	_PAGE_PSE
> +/*
> + * Cannot be set on pmd, if transparent hugepages will be swapped out
> + * the swap entry offset must start above it.
> + */
> +#define _PAGE_NUMA_PMD	_PAGE_UNUSED2
> +
>  #define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |	\
>  			 _PAGE_ACCESSED | _PAGE_DIRTY)
>  #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 04/35] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
@ 2012-05-30 18:22     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 236+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-30 18:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Fri, May 25, 2012 at 07:02:08PM +0200, Andrea Arcangeli wrote:
> We will set these bitflags only when the pmd and pte is non present.
> 
> They work like PROT_NONE but they identify a request for the numa
> hinting page fault to trigger.
> 
> Because we want to be able to set these bitflag in any established pte
> or pmd (while clearing the present bit at the same time) without
> losing information, these bitflags must never be set when the pte and
> pmd are present.
> 
> For _PAGE_NUMA_PTE the pte bitflag used is _PAGE_PSE, which cannot be
> set on ptes and it also fits in between _PAGE_FILE and _PAGE_PROTNONE
> which avoids having to alter the swp entries format.
> 
> For _PAGE_NUMA_PMD, we use a reserved bitflag. pmds never contain
> swap_entries but if in the future we'll swap transparent hugepages, we
> must keep in mind not to use the _PAGE_UNUSED2 bitflag in the swap
> entry format and to start the swap entry offset above it.
> 
> PAGE_UNUSED2 is used by Xen but only on ptes established by ioremap,
> but it's never used on pmds so there's no risk of collision with Xen.

Thank you for loking at this from the xen side. The interesting thing
is that I believe the _PAGE_PAT (or _PAGE_PSE) is actually used on
Xen on PTEs. It is used to mark the pages WC. <sigh>



> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/x86/include/asm/pgtable_types.h |   11 +++++++++++
>  1 files changed, 11 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index b74cac9..6e2d954 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -71,6 +71,17 @@
>  #define _PAGE_FILE	(_AT(pteval_t, 1) << _PAGE_BIT_FILE)
>  #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
>  
> +/*
> + * Cannot be set on pte. The fact it's in between _PAGE_FILE and
> + * _PAGE_PROTNONE avoids having to alter the swp entries.
> + */
> +#define _PAGE_NUMA_PTE	_PAGE_PSE
> +/*
> + * Cannot be set on pmd, if transparent hugepages will be swapped out
> + * the swap entry offset must start above it.
> + */
> +#define _PAGE_NUMA_PMD	_PAGE_UNUSED2
> +
>  #define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |	\
>  			 _PAGE_ACCESSED | _PAGE_DIRTY)
>  #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 04/35] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  2012-05-30 18:22     ` Konrad Rzeszutek Wilk
@ 2012-05-30 18:34       ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-30 18:34 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

Hi Konrad,

On Wed, May 30, 2012 at 02:22:49PM -0400, Konrad Rzeszutek Wilk wrote:
> Thank you for loking at this from the xen side. The interesting thing
> is that I believe the _PAGE_PAT (or _PAGE_PSE) is actually used on
> Xen on PTEs. It is used to mark the pages WC. <sigh>

Oops, I'm using _PAGE_PSE too on the pte, but only when it's unmapped.

static inline int pte_numa(pte_t pte)
{
	return (pte_flags(pte) &
		(_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
}

And _PAGE_UNUSED2 (_PAGE_IOMAP) is used for the pmd but _PAGE_IOMAP by
Xen should only be set on ptes.

The only way to use _PAGE_PSE safe on the pte is if the pte is
non-present, is this what Xen is also doing? (in turn colliding with
pte_numa)

Now if I shrink the size of the page_autonuma to one entry per pmd
(instead of per pte) I may as well drop pte_numa entirely and only
leave pmd_numa. At the moment it's possible to switch between the two
models at runtime with sysctl (if one wants to do a more expensive
granular tracking). I'm still uncertain on the best way to shrink
the page_autonuma size we'll see.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 04/35] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
@ 2012-05-30 18:34       ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-30 18:34 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

Hi Konrad,

On Wed, May 30, 2012 at 02:22:49PM -0400, Konrad Rzeszutek Wilk wrote:
> Thank you for loking at this from the xen side. The interesting thing
> is that I believe the _PAGE_PAT (or _PAGE_PSE) is actually used on
> Xen on PTEs. It is used to mark the pages WC. <sigh>

Oops, I'm using _PAGE_PSE too on the pte, but only when it's unmapped.

static inline int pte_numa(pte_t pte)
{
	return (pte_flags(pte) &
		(_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
}

And _PAGE_UNUSED2 (_PAGE_IOMAP) is used for the pmd but _PAGE_IOMAP by
Xen should only be set on ptes.

The only way to use _PAGE_PSE safe on the pte is if the pte is
non-present, is this what Xen is also doing? (in turn colliding with
pte_numa)

Now if I shrink the size of the page_autonuma to one entry per pmd
(instead of per pte) I may as well drop pte_numa entirely and only
leave pmd_numa. At the moment it's possible to switch between the two
models at runtime with sysctl (if one wants to do a more expensive
granular tracking). I'm still uncertain on the best way to shrink
the page_autonuma size we'll see.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 04/35] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  2012-05-30 18:34       ` Andrea Arcangeli
@ 2012-05-30 20:01         ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 236+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-30 20:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Wed, May 30, 2012 at 08:34:06PM +0200, Andrea Arcangeli wrote:
> Hi Konrad,
> 
> On Wed, May 30, 2012 at 02:22:49PM -0400, Konrad Rzeszutek Wilk wrote:
> > Thank you for loking at this from the xen side. The interesting thing
> > is that I believe the _PAGE_PAT (or _PAGE_PSE) is actually used on
> > Xen on PTEs. It is used to mark the pages WC. <sigh>
> 
> Oops, I'm using _PAGE_PSE too on the pte, but only when it's unmapped.
> 
> static inline int pte_numa(pte_t pte)
> {
> 	return (pte_flags(pte) &
> 		(_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
> }
> 
> And _PAGE_UNUSED2 (_PAGE_IOMAP) is used for the pmd but _PAGE_IOMAP by
> Xen should only be set on ptes.
<nods>
> 
> The only way to use _PAGE_PSE safe on the pte is if the pte is
> non-present, is this what Xen is also doing? (in turn colliding with
> pte_numa)

The only time the _PAGE_PSE (_PAGE_PAT) is set is when
_PAGE_PCD | _PAGE_PWT are set. It is this ugly transformation
of doing:

 if (pat_enabled && _PAGE_PWT | _PAGE_PCD)
	pte = ~(_PAGE_PWT | _PAGE_PCD) | _PAGE_PAT;

and then writting the pte with the 7th bit set instead of the
2nd and 3rd to mark it as WC. There is a corresponding reverse too
(to read the pte - so the pte_val calls) - so if _PAGE_PAT is
detected it will remove the _PAGE_PAT and return the PTE as
if it had _PAGE_PWT | _PAGE_PCD.

So that little bit of code will need some tweaking - as it does
that even if _PAGE_PRESENT is not set. Meaning it would
transform your _PAGE_PAT to _PAGE_PWT | _PAGE_PCD. Gah!


> 
> Now if I shrink the size of the page_autonuma to one entry per pmd
> (instead of per pte) I may as well drop pte_numa entirely and only
> leave pmd_numa. At the moment it's possible to switch between the two
> models at runtime with sysctl (if one wants to do a more expensive
> granular tracking). I'm still uncertain on the best way to shrink
> the page_autonuma size we'll see.

OK. I can whip up a patch to deal with the 'Gah!' case easily if needed.

Thanks!

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 04/35] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
@ 2012-05-30 20:01         ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 236+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-30 20:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Wed, May 30, 2012 at 08:34:06PM +0200, Andrea Arcangeli wrote:
> Hi Konrad,
> 
> On Wed, May 30, 2012 at 02:22:49PM -0400, Konrad Rzeszutek Wilk wrote:
> > Thank you for loking at this from the xen side. The interesting thing
> > is that I believe the _PAGE_PAT (or _PAGE_PSE) is actually used on
> > Xen on PTEs. It is used to mark the pages WC. <sigh>
> 
> Oops, I'm using _PAGE_PSE too on the pte, but only when it's unmapped.
> 
> static inline int pte_numa(pte_t pte)
> {
> 	return (pte_flags(pte) &
> 		(_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
> }
> 
> And _PAGE_UNUSED2 (_PAGE_IOMAP) is used for the pmd but _PAGE_IOMAP by
> Xen should only be set on ptes.
<nods>
> 
> The only way to use _PAGE_PSE safe on the pte is if the pte is
> non-present, is this what Xen is also doing? (in turn colliding with
> pte_numa)

The only time the _PAGE_PSE (_PAGE_PAT) is set is when
_PAGE_PCD | _PAGE_PWT are set. It is this ugly transformation
of doing:

 if (pat_enabled && _PAGE_PWT | _PAGE_PCD)
	pte = ~(_PAGE_PWT | _PAGE_PCD) | _PAGE_PAT;

and then writting the pte with the 7th bit set instead of the
2nd and 3rd to mark it as WC. There is a corresponding reverse too
(to read the pte - so the pte_val calls) - so if _PAGE_PAT is
detected it will remove the _PAGE_PAT and return the PTE as
if it had _PAGE_PWT | _PAGE_PCD.

So that little bit of code will need some tweaking - as it does
that even if _PAGE_PRESENT is not set. Meaning it would
transform your _PAGE_PAT to _PAGE_PWT | _PAGE_PCD. Gah!


> 
> Now if I shrink the size of the page_autonuma to one entry per pmd
> (instead of per pte) I may as well drop pte_numa entirely and only
> leave pmd_numa. At the moment it's possible to switch between the two
> models at runtime with sysctl (if one wants to do a more expensive
> granular tracking). I'm still uncertain on the best way to shrink
> the page_autonuma size we'll see.

OK. I can whip up a patch to deal with the 'Gah!' case easily if needed.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 06/35] autonuma: generic pte_numa() and pmd_numa()
  2012-05-25 17:02   ` Andrea Arcangeli
@ 2012-05-30 20:23     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 236+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-30 20:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Fri, May 25, 2012 at 07:02:10PM +0200, Andrea Arcangeli wrote:
> Implement generic version of the methods. They're used when
> CONFIG_AUTONUMA=n, and they're a noop.

I think you can roll that in the previous patch.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/asm-generic/pgtable.h |   12 ++++++++++++
>  1 files changed, 12 insertions(+), 0 deletions(-)
> 
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index fa596d9..780f707 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -521,6 +521,18 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
>  #endif
>  }
>  
> +#ifndef CONFIG_AUTONUMA
> +static inline int pte_numa(pte_t pte)
> +{
> +	return 0;
> +}
> +
> +static inline int pmd_numa(pmd_t pmd)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_AUTONUMA */
> +
>  #endif /* CONFIG_MMU */
>  
>  #endif /* !__ASSEMBLY__ */
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 06/35] autonuma: generic pte_numa() and pmd_numa()
@ 2012-05-30 20:23     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 236+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-30 20:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Fri, May 25, 2012 at 07:02:10PM +0200, Andrea Arcangeli wrote:
> Implement generic version of the methods. They're used when
> CONFIG_AUTONUMA=n, and they're a noop.

I think you can roll that in the previous patch.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/asm-generic/pgtable.h |   12 ++++++++++++
>  1 files changed, 12 insertions(+), 0 deletions(-)
> 
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index fa596d9..780f707 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -521,6 +521,18 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
>  #endif
>  }
>  
> +#ifndef CONFIG_AUTONUMA
> +static inline int pte_numa(pte_t pte)
> +{
> +	return 0;
> +}
> +
> +static inline int pmd_numa(pmd_t pmd)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_AUTONUMA */
> +
>  #endif /* CONFIG_MMU */
>  
>  #endif /* !__ASSEMBLY__ */
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 08/35] autonuma: introduce kthread_bind_node()
  2012-05-25 17:02   ` Andrea Arcangeli
@ 2012-05-30 20:26     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 236+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-30 20:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Fri, May 25, 2012 at 07:02:12PM +0200, Andrea Arcangeli wrote:
> This function makes it easy to bind the per-node knuma_migrated
> threads to their respective NUMA nodes. Those threads take memory from
> the other nodes (in round robin with a incoming queue for each remote
> node) and they move that memory to their local node.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/kthread.h |    1 +
>  kernel/kthread.c        |   23 +++++++++++++++++++++++
>  2 files changed, 24 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/kthread.h b/include/linux/kthread.h
> index 0714b24..e733f97 100644
> --- a/include/linux/kthread.h
> +++ b/include/linux/kthread.h
> @@ -33,6 +33,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
>  })
>  
>  void kthread_bind(struct task_struct *k, unsigned int cpu);
> +void kthread_bind_node(struct task_struct *p, int nid);
>  int kthread_stop(struct task_struct *k);
>  int kthread_should_stop(void);
>  bool kthread_freezable_should_stop(bool *was_frozen);
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 3d3de63..48b36f9 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -234,6 +234,29 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
>  EXPORT_SYMBOL(kthread_bind);
>  
>  /**
> + * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
> + * @p: thread created by kthread_create().
> + * @nid: node (might not be online, must be possible) for @k to run on.
> + *
> + * Description: This function is equivalent to set_cpus_allowed(),
> + * except that @nid doesn't need to be online, and the thread must be
> + * stopped (i.e., just returned from kthread_create()).
> + */
> +void kthread_bind_node(struct task_struct *p, int nid)
> +{
> +	/* Must have done schedule() in kthread() before we set_task_cpu */
> +	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> +		WARN_ON(1);
> +		return;
> +	}
> +
> +	/* It's safe because the task is inactive. */
> +	do_set_cpus_allowed(p, cpumask_of_node(nid));
> +	p->flags |= PF_THREAD_BOUND;
> +}
> +EXPORT_SYMBOL(kthread_bind_node);

_GPL?

> +
> +/**
>   * kthread_stop - stop a thread created by kthread_create().
>   * @k: thread created by kthread_create().
>   *
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 08/35] autonuma: introduce kthread_bind_node()
@ 2012-05-30 20:26     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 236+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-30 20:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Fri, May 25, 2012 at 07:02:12PM +0200, Andrea Arcangeli wrote:
> This function makes it easy to bind the per-node knuma_migrated
> threads to their respective NUMA nodes. Those threads take memory from
> the other nodes (in round robin with a incoming queue for each remote
> node) and they move that memory to their local node.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/kthread.h |    1 +
>  kernel/kthread.c        |   23 +++++++++++++++++++++++
>  2 files changed, 24 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/kthread.h b/include/linux/kthread.h
> index 0714b24..e733f97 100644
> --- a/include/linux/kthread.h
> +++ b/include/linux/kthread.h
> @@ -33,6 +33,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
>  })
>  
>  void kthread_bind(struct task_struct *k, unsigned int cpu);
> +void kthread_bind_node(struct task_struct *p, int nid);
>  int kthread_stop(struct task_struct *k);
>  int kthread_should_stop(void);
>  bool kthread_freezable_should_stop(bool *was_frozen);
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 3d3de63..48b36f9 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -234,6 +234,29 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
>  EXPORT_SYMBOL(kthread_bind);
>  
>  /**
> + * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
> + * @p: thread created by kthread_create().
> + * @nid: node (might not be online, must be possible) for @k to run on.
> + *
> + * Description: This function is equivalent to set_cpus_allowed(),
> + * except that @nid doesn't need to be online, and the thread must be
> + * stopped (i.e., just returned from kthread_create()).
> + */
> +void kthread_bind_node(struct task_struct *p, int nid)
> +{
> +	/* Must have done schedule() in kthread() before we set_task_cpu */
> +	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> +		WARN_ON(1);
> +		return;
> +	}
> +
> +	/* It's safe because the task is inactive. */
> +	do_set_cpus_allowed(p, cpumask_of_node(nid));
> +	p->flags |= PF_THREAD_BOUND;
> +}
> +EXPORT_SYMBOL(kthread_bind_node);

_GPL?

> +
> +/**
>   * kthread_stop - stop a thread created by kthread_create().
>   * @k: thread created by kthread_create().
>   *
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 18/35] autonuma: alloc/free/init sched_autonuma
  2012-05-25 17:02   ` Andrea Arcangeli
@ 2012-05-30 20:55     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 236+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-30 20:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Fri, May 25, 2012 at 07:02:22PM +0200, Andrea Arcangeli wrote:
> This is where the dynamically allocated sched_autonuma structure is
> being handled.
> 
> The reason for keeping this outside of the task_struct besides not
> using too much kernel stack, is to only allocate it on NUMA
> hardware. So the not NUMA hardware only pays the memory of a pointer
> in the kernel stack (which remains NULL at all times in that case).
> 
> If the kernel is compiled with CONFIG_AUTONUMA=n, not even the pointer
> is allocated on the kernel stack of course.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  kernel/fork.c |   24 ++++++++++++++----------
>  1 files changed, 14 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 237c34e..d323eb1 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -206,6 +206,7 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>  void free_task(struct task_struct *tsk)
>  {
>  	account_kernel_stack(tsk->stack, -1);
> +	free_sched_autonuma(tsk);
>  	free_thread_info(tsk->stack);
>  	rt_mutex_debug_task_free(tsk);
>  	ftrace_graph_exit_task(tsk);
> @@ -260,6 +261,8 @@ void __init fork_init(unsigned long mempages)
>  	/* do the arch specific task caches init */
>  	arch_task_cache_init();
>  
> +	sched_autonuma_init();
> +
>  	/*
>  	 * The default maximum number of threads is set to a safe
>  	 * value: the thread structures can take up at most half
> @@ -292,21 +295,21 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
>  	struct thread_info *ti;
>  	unsigned long *stackend;
>  	int node = tsk_fork_get_node(orig);
> -	int err;
>  
>  	tsk = alloc_task_struct_node(node);
> -	if (!tsk)
> +	if (unlikely(!tsk))
>  		return NULL;
>  
>  	ti = alloc_thread_info_node(tsk, node);
> -	if (!ti) {
> -		free_task_struct(tsk);
> -		return NULL;
> -	}
> +	if (unlikely(!ti))

Should those "unlikely" have their own commit? Did you
run this with the likely/unlikely tracer to confirm that it
does give a sppedup?


> +		goto out_task_struct;
>  
> -	err = arch_dup_task_struct(tsk, orig);
> -	if (err)
> -		goto out;
> +	if (unlikely(arch_dup_task_struct(tsk, orig)))
> +		goto out_thread_info;
> +
> +	if (unlikely(alloc_sched_autonuma(tsk, orig, node)))
> +		/* free_thread_info() undoes arch_dup_task_struct() too */
> +		goto out_thread_info;
>  
>  	tsk->stack = ti;
>  
> @@ -334,8 +337,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
>  
>  	return tsk;
>  
> -out:
> +out_thread_info:
>  	free_thread_info(ti);
> +out_task_struct:
>  	free_task_struct(tsk);
>  	return NULL;
>  }
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 18/35] autonuma: alloc/free/init sched_autonuma
@ 2012-05-30 20:55     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 236+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-30 20:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Fri, May 25, 2012 at 07:02:22PM +0200, Andrea Arcangeli wrote:
> This is where the dynamically allocated sched_autonuma structure is
> being handled.
> 
> The reason for keeping this outside of the task_struct besides not
> using too much kernel stack, is to only allocate it on NUMA
> hardware. So the not NUMA hardware only pays the memory of a pointer
> in the kernel stack (which remains NULL at all times in that case).
> 
> If the kernel is compiled with CONFIG_AUTONUMA=n, not even the pointer
> is allocated on the kernel stack of course.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  kernel/fork.c |   24 ++++++++++++++----------
>  1 files changed, 14 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 237c34e..d323eb1 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -206,6 +206,7 @@ static void account_kernel_stack(struct thread_info *ti, int account)
>  void free_task(struct task_struct *tsk)
>  {
>  	account_kernel_stack(tsk->stack, -1);
> +	free_sched_autonuma(tsk);
>  	free_thread_info(tsk->stack);
>  	rt_mutex_debug_task_free(tsk);
>  	ftrace_graph_exit_task(tsk);
> @@ -260,6 +261,8 @@ void __init fork_init(unsigned long mempages)
>  	/* do the arch specific task caches init */
>  	arch_task_cache_init();
>  
> +	sched_autonuma_init();
> +
>  	/*
>  	 * The default maximum number of threads is set to a safe
>  	 * value: the thread structures can take up at most half
> @@ -292,21 +295,21 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
>  	struct thread_info *ti;
>  	unsigned long *stackend;
>  	int node = tsk_fork_get_node(orig);
> -	int err;
>  
>  	tsk = alloc_task_struct_node(node);
> -	if (!tsk)
> +	if (unlikely(!tsk))
>  		return NULL;
>  
>  	ti = alloc_thread_info_node(tsk, node);
> -	if (!ti) {
> -		free_task_struct(tsk);
> -		return NULL;
> -	}
> +	if (unlikely(!ti))

Should those "unlikely" have their own commit? Did you
run this with the likely/unlikely tracer to confirm that it
does give a sppedup?


> +		goto out_task_struct;
>  
> -	err = arch_dup_task_struct(tsk, orig);
> -	if (err)
> -		goto out;
> +	if (unlikely(arch_dup_task_struct(tsk, orig)))
> +		goto out_thread_info;
> +
> +	if (unlikely(alloc_sched_autonuma(tsk, orig, node)))
> +		/* free_thread_info() undoes arch_dup_task_struct() too */
> +		goto out_thread_info;
>  
>  	tsk->stack = ti;
>  
> @@ -334,8 +337,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
>  
>  	return tsk;
>  
> -out:
> +out_thread_info:
>  	free_thread_info(ti);
> +out_task_struct:
>  	free_task_struct(tsk);
>  	return NULL;
>  }
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* AutoNUMA15
  2012-05-29 15:43     ` Petr Holasek
@ 2012-05-31 18:08       ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-31 18:08 UTC (permalink / raw)
  To: Petr Holasek
  Cc: Kirill A. Shutemov, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Hi,

On Tue, May 29, 2012 at 05:43:09PM +0200, Petr Holasek wrote:
> Similar problem with __autonuma_migrate_page_remove here. 
> 
> [ 1945.516632] ------------[ cut here ]------------
> [ 1945.516636] WARNING: at lib/list_debug.c:50 __list_del_entry+0x63/0xd0()
> [ 1945.516642] Hardware name: ProLiant DL585 G5   
> [ 1945.516651] list_del corruption, ffff88017d68b068->next is LIST_POISON1 (dead000000100100)
> [ 1945.516682] Modules linked in: ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle lockd ip6t_REJECT sunrpc nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables iptable_nat nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack mperf freq_table kvm_amd kvm pcspkr amd64_edac_mod edac_core serio_raw bnx2 microcode edac_mce_amd shpchp k10temp hpilo ipmi_si ipmi_msghandler hpwdt qla2xxx hpsa ata_generic pata_acpi scsi_transport_fc scsi_tgt cciss pata_amd radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [last unloaded: scsi_wait_scan]
> [ 1945.516694] Pid: 150, comm: knuma_migrated0 Tainted: G        W    3.4.0aa_alpha+ #3
> [ 1945.516701] Call Trace:
> [ 1945.516710]  [<ffffffff8105788f>] warn_slowpath_common+0x7f/0xc0
> [ 1945.516717]  [<ffffffff81057986>] warn_slowpath_fmt+0x46/0x50
> [ 1945.516726]  [<ffffffff812f9713>] __list_del_entry+0x63/0xd0
> [ 1945.516735]  [<ffffffff812f9791>] list_del+0x11/0x40
> [ 1945.516743]  [<ffffffff81165b98>] __autonuma_migrate_page_remove+0x48/0x80
> [ 1945.516746]  [<ffffffff81165e66>] knuma_migrated+0x296/0x8a0
> [ 1945.516749]  [<ffffffff8107a200>] ? wake_up_bit+0x40/0x40
> [ 1945.516758]  [<ffffffff81165bd0>] ? __autonuma_migrate_page_remove+0x80/0x80
> [ 1945.516766]  [<ffffffff81079cc3>] kthread+0x93/0xa0
> [ 1945.516780]  [<ffffffff81626f24>] kernel_thread_helper+0x4/0x10
> [ 1945.516791]  [<ffffffff81079c30>] ? flush_kthread_worker+0x80/0x80
> [ 1945.516798]  [<ffffffff81626f20>] ? gs_change+0x13/0x13
> [ 1945.516800] ---[ end trace 7cab294af87bd79f ]---

I didn't manage to reproduce it on my hardware but it seems this was
caused by the autonuma_migrate_split_huge_page: the tail page list
linking wasn't surrounded by the compound lock to make list insertion
and migrate_nid setting atomic like it happens everywhere else (the
caller holding the lock on the head page wasn't enough to make the
tails stable too).

I released an AutoNUMA15 branch that includes all pending fixes:

git clone --reference linux -b autonuma15 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 236+ messages in thread

* AutoNUMA15
@ 2012-05-31 18:08       ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-31 18:08 UTC (permalink / raw)
  To: Petr Holasek
  Cc: Kirill A. Shutemov, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

Hi,

On Tue, May 29, 2012 at 05:43:09PM +0200, Petr Holasek wrote:
> Similar problem with __autonuma_migrate_page_remove here. 
> 
> [ 1945.516632] ------------[ cut here ]------------
> [ 1945.516636] WARNING: at lib/list_debug.c:50 __list_del_entry+0x63/0xd0()
> [ 1945.516642] Hardware name: ProLiant DL585 G5   
> [ 1945.516651] list_del corruption, ffff88017d68b068->next is LIST_POISON1 (dead000000100100)
> [ 1945.516682] Modules linked in: ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle lockd ip6t_REJECT sunrpc nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables iptable_nat nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack mperf freq_table kvm_amd kvm pcspkr amd64_edac_mod edac_core serio_raw bnx2 microcode edac_mce_amd shpchp k10temp hpilo ipmi_si ipmi_msghandler hpwdt qla2xxx hpsa ata_generic pata_acpi scsi_transport_fc scsi_tgt cciss pata_amd radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [last unloaded: scsi_wait_scan]
> [ 1945.516694] Pid: 150, comm: knuma_migrated0 Tainted: G        W    3.4.0aa_alpha+ #3
> [ 1945.516701] Call Trace:
> [ 1945.516710]  [<ffffffff8105788f>] warn_slowpath_common+0x7f/0xc0
> [ 1945.516717]  [<ffffffff81057986>] warn_slowpath_fmt+0x46/0x50
> [ 1945.516726]  [<ffffffff812f9713>] __list_del_entry+0x63/0xd0
> [ 1945.516735]  [<ffffffff812f9791>] list_del+0x11/0x40
> [ 1945.516743]  [<ffffffff81165b98>] __autonuma_migrate_page_remove+0x48/0x80
> [ 1945.516746]  [<ffffffff81165e66>] knuma_migrated+0x296/0x8a0
> [ 1945.516749]  [<ffffffff8107a200>] ? wake_up_bit+0x40/0x40
> [ 1945.516758]  [<ffffffff81165bd0>] ? __autonuma_migrate_page_remove+0x80/0x80
> [ 1945.516766]  [<ffffffff81079cc3>] kthread+0x93/0xa0
> [ 1945.516780]  [<ffffffff81626f24>] kernel_thread_helper+0x4/0x10
> [ 1945.516791]  [<ffffffff81079c30>] ? flush_kthread_worker+0x80/0x80
> [ 1945.516798]  [<ffffffff81626f20>] ? gs_change+0x13/0x13
> [ 1945.516800] ---[ end trace 7cab294af87bd79f ]---

I didn't manage to reproduce it on my hardware but it seems this was
caused by the autonuma_migrate_split_huge_page: the tail page list
linking wasn't surrounded by the compound lock to make list insertion
and migrate_nid setting atomic like it happens everywhere else (the
caller holding the lock on the head page wasn't enough to make the
tails stable too).

I released an AutoNUMA15 branch that includes all pending fixes:

git clone --reference linux -b autonuma15 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
  2012-05-30 13:49               ` Andrea Arcangeli
@ 2012-05-31 18:18                 ` Peter Zijlstra
  -1 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-31 18:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter

On Wed, 2012-05-30 at 15:49 +0200, Andrea Arcangeli wrote:
> 
> I'm thinking about it but probably reducing the page_autonuma to one
> per pmd is going to be the simplest solution considering by default we
> only track the pmd anyway. 

Do also consider that some archs have larger base page size. So their
effective PMD size is increased as well.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
@ 2012-05-31 18:18                 ` Peter Zijlstra
  0 siblings, 0 replies; 236+ messages in thread
From: Peter Zijlstra @ 2012-05-31 18:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter

On Wed, 2012-05-30 at 15:49 +0200, Andrea Arcangeli wrote:
> 
> I'm thinking about it but probably reducing the page_autonuma to one
> per pmd is going to be the simplest solution considering by default we
> only track the pmd anyway. 

Do also consider that some archs have larger base page size. So their
effective PMD size is increased as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
  2012-06-01  0:04           ` AutoNUMA15 Andrea Arcangeli
@ 2012-05-31 18:52             ` Don Morris
  0 siblings, 0 replies; 236+ messages in thread
From: Don Morris @ 2012-05-31 18:52 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

On 05/31/2012 05:04 PM, Andrea Arcangeli wrote:
> On Fri, Jun 01, 2012 at 12:54:06AM +0200, Andrea Arcangeli wrote:
>> I'll push a fix in the origin/autonuma branch as soon as I figure it
>> out...
> 
> 7f9729f89000-7f9729faa000 rw-p 00000000 00:00 0
> 7f9729faa000-7f9729fea000 rw-s 000c0000 00:15 15364 /dev/mem
> 
> addr:00007f9729fc5000 vm_flags:08000875 anon_vma:          (null)
> mapping:ffff880430025410 index:b8
> 
> The reason for the false positive, was that there are multiple vmas in
> the same pmd range, and I was passing the single vma belonging to the
> page fault address, for all ptes in that pmd.
> 
> The vma is only used for that check, this is why it was harmless.
> 
> The vma found during page fault would have been valid for the pmd huge
> numa fixup and the pte numa fixup, but not for the less granular pmd
> numa fixup (not huge).
> 
> This also explains why echo 0 >/sys/kernel/mm/autonuma/knuma_scand/pmd
> avoided the warnings.
> 
> Can you test the below? I'll push the fix in the origin/autonuma branch.

Makes sense -- and the patch works as expected. No spurious warnings
or odd behaviors.

Thanks,
Don

> 
> Thanks!
> 
> ---
>  include/linux/autonuma.h |    4 ++--
>  mm/autonuma.c            |   19 +++++++++++++++++--
>  mm/memory.c              |    5 ++---
>  3 files changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
> index b0a8d87..67af86a 100644
> --- a/include/linux/autonuma.h
> +++ b/include/linux/autonuma.h
> @@ -46,8 +46,8 @@ static inline void autonuma_free_page(struct page *page) {}
>  
>  extern pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
>  			      unsigned long addr, pte_t pte, pte_t *ptep);
> -extern void __pmd_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
> -			     unsigned long addr, pmd_t *pmd);
> +extern void __pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
> +			     pmd_t *pmd);
>  extern void numa_hinting_fault(struct page *page, int numpages);
>  
>  #endif /* _LINUX_AUTONUMA_H */
> diff --git a/mm/autonuma.c b/mm/autonuma.c
> index d37647a..ca4c189 100644
> --- a/mm/autonuma.c
> +++ b/mm/autonuma.c
> @@ -349,14 +349,16 @@ pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
>  	return pte;
>  }
>  
> -void __pmd_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
> +void __pmd_numa_fixup(struct mm_struct *mm,
>  		      unsigned long addr, pmd_t *pmdp)
>  {
>  	pmd_t pmd;
>  	pte_t *pte;
>  	unsigned long _addr = addr & PMD_MASK;
> +	unsigned long offset;
>  	spinlock_t *ptl;
>  	bool numa = false;
> +	struct vm_area_struct *vma;
>  
>  	spin_lock(&mm->page_table_lock);
>  	pmd = *pmdp;
> @@ -369,12 +371,25 @@ void __pmd_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (!numa)
>  		return;
>  
> +	vma = find_vma(mm, _addr);
> +	/* we're in a page fault so some vma must be in the range */
> +	BUG_ON(!vma);
> +	BUG_ON(vma->vm_start >= _addr + PMD_SIZE);
> +	offset = max(_addr, vma->vm_start) & ~PMD_MASK;
> +	VM_BUG_ON(offset >= PMD_SIZE);
>  	pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
> -	for (addr = _addr; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
> +	pte += offset >> PAGE_SHIFT;
> +	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
>  		pte_t pteval = *pte;
>  		struct page * page;
>  		if (!pte_present(pteval))
>  			continue;
> +		if (addr >= vma->vm_end) {
> +			vma = find_vma(mm, addr);
> +			/* there's a pte present so there must be a vma */
> +			BUG_ON(!vma);
> +			BUG_ON(addr < vma->vm_start);
> +		}
>  		if (pte_numa(pteval)) {
>  			pteval = pte_mknotnuma(pteval);
>  			set_pte_at(mm, addr, pte, pteval);
> diff --git a/mm/memory.c b/mm/memory.c
> index f46cf8a..bbf10c7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3409,11 +3409,10 @@ static inline pte_t pte_numa_fixup(struct mm_struct *mm,
>  }
>  
>  static inline void pmd_numa_fixup(struct mm_struct *mm,
> -				  struct vm_area_struct *vma,
>  				  unsigned long addr, pmd_t *pmd)
>  {
>  	if (pmd_numa(*pmd))
> -		__pmd_numa_fixup(mm, vma, addr, pmd);
> +		__pmd_numa_fixup(mm, addr, pmd);
>  }
>  
>  static inline pmd_t huge_pmd_numa_fixup(struct mm_struct *mm,
> @@ -3552,7 +3551,7 @@ retry:
>  		}
>  	}
>  
> -	pmd_numa_fixup(mm, vma, address, pmd);
> +	pmd_numa_fixup(mm, address, pmd);
>  
>  	/*
>  	 * Use __pte_alloc instead of pte_alloc_map, because we can't
> .
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
  2012-05-31 18:08       ` AutoNUMA15 Andrea Arcangeli
  (?)
@ 2012-05-31 20:01       ` Don Morris
  2012-05-31 22:54         ` AutoNUMA15 Andrea Arcangeli
  -1 siblings, 1 reply; 236+ messages in thread
From: Don Morris @ 2012-05-31 20:01 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm

[-- Attachment #1: Type: text/plain, Size: 7583 bytes --]

On 05/31/2012 11:08 AM, Andrea Arcangeli wrote:
> Hi,
> 
> On Tue, May 29, 2012 at 05:43:09PM +0200, Petr Holasek wrote:
>> Similar problem with __autonuma_migrate_page_remove here. 
>>
>> [ 1945.516632] ------------[ cut here ]------------
>> [ 1945.516636] WARNING: at lib/list_debug.c:50 __list_del_entry+0x63/0xd0()
>> [ 1945.516642] Hardware name: ProLiant DL585 G5   
>> [ 1945.516651] list_del corruption, ffff88017d68b068->next is LIST_POISON1 (dead000000100100)
>> [ 1945.516682] Modules linked in: ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle lockd ip6t_REJECT sunrpc nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables iptable_nat nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack mperf freq_table kvm_amd kvm pcspkr amd64_edac_mod edac_core serio_raw bnx2 microcode edac_mce_amd shpchp k10temp hpilo ipmi_si ipmi_msghandler hpwdt qla2xxx hpsa ata_generic pata_acpi scsi_transport_fc scsi_tgt cciss pata_amd radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [last unloaded: scsi_wait_scan]
>> [ 1945.516694] Pid: 150, comm: knuma_migrated0 Tainted: G        W    3.4.0aa_alpha+ #3
>> [ 1945.516701] Call Trace:
>> [ 1945.516710]  [<ffffffff8105788f>] warn_slowpath_common+0x7f/0xc0
>> [ 1945.516717]  [<ffffffff81057986>] warn_slowpath_fmt+0x46/0x50
>> [ 1945.516726]  [<ffffffff812f9713>] __list_del_entry+0x63/0xd0
>> [ 1945.516735]  [<ffffffff812f9791>] list_del+0x11/0x40
>> [ 1945.516743]  [<ffffffff81165b98>] __autonuma_migrate_page_remove+0x48/0x80
>> [ 1945.516746]  [<ffffffff81165e66>] knuma_migrated+0x296/0x8a0
>> [ 1945.516749]  [<ffffffff8107a200>] ? wake_up_bit+0x40/0x40
>> [ 1945.516758]  [<ffffffff81165bd0>] ? __autonuma_migrate_page_remove+0x80/0x80
>> [ 1945.516766]  [<ffffffff81079cc3>] kthread+0x93/0xa0
>> [ 1945.516780]  [<ffffffff81626f24>] kernel_thread_helper+0x4/0x10
>> [ 1945.516791]  [<ffffffff81079c30>] ? flush_kthread_worker+0x80/0x80
>> [ 1945.516798]  [<ffffffff81626f20>] ? gs_change+0x13/0x13
>> [ 1945.516800] ---[ end trace 7cab294af87bd79f ]---
> 
> I didn't manage to reproduce it on my hardware but it seems this was
> caused by the autonuma_migrate_split_huge_page: the tail page list
> linking wasn't surrounded by the compound lock to make list insertion
> and migrate_nid setting atomic like it happens everywhere else (the
> caller holding the lock on the head page wasn't enough to make the
> tails stable too).
> 
> I released an AutoNUMA15 branch that includes all pending fixes:
> 
> git clone --reference linux -b autonuma15 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

Got this from AutoNUMA14 as well, thought I'd just missed a fix... but
it is happening with AutoNUMA15 as well, now:

[  258.911975] BUG: Bad page map in process Xorg  pte:80000000c2e4626f
pmd:8208b2067
[  258.911979] addr:00007f495e3fe000 vm_flags:08100073
anon_vma:ffff88081d8bde10 mapping:ffff880821a249f0 index:14
[  258.911984] vma->vm_ops->fault: filemap_fault+0x0/0x4c0
[  258.911987] vma->vm_file->f_op->mmap: ext4_file_mmap+0x0/0x60
[  258.911990] Pid: 4707, comm: Xorg Tainted: G    B
3.4.0-09252-gebb5196 #1
[  258.911992] Call Trace:
[  258.911997]  [<ffffffff8116c754>] print_bad_pte+0x1d4/0x270
[  258.912003]  [<ffffffff811a117a>] ? mem_cgroup_count_vm_event+0x1a/0xd0
[  258.912015]  [<ffffffff8116d9e9>] vm_normal_page+0x69/0x80
[  258.912021]  [<ffffffff8118f1f4>] __pmd_numa_fixup+0x124/0x160
[  258.912026]  [<ffffffff81807e80>] ? do_page_fault+0xd0/0x570
[  258.912031]  [<ffffffff81171217>] handle_mm_fault+0x2b7/0x380
[  258.912037]  [<ffffffff81807f06>] do_page_fault+0x156/0x570
[  258.912042]  [<ffffffff810eb0b8>] ? lockdep_sys_exit+0x28/0x90
[  258.912052]  [<ffffffff8139ef09>] ? lockdep_sys_exit_thunk+0x35/0x67
[  258.912064]  [<ffffffff818049b5>] page_fault+0x25/0x30
[  258.912071] BUG: Bad page map in process Xorg  pte:80000000c2e4726f
pmd:8208b2067
[  258.912079] addr:00007f495e3ff000 vm_flags:08100073
anon_vma:ffff88081d8bde10 mapping:ffff880821a249f0 index:15
[  258.912085] vma->vm_ops->fault: filemap_fault+0x0/0x4c0
[  258.912088] vma->vm_file->f_op->mmap: ext4_file_mmap+0x0/0x60
[  258.912096] Pid: 4707, comm: Xorg Tainted: G    B
3.4.0-09252-gebb5196 #1
[  258.912098] Call Trace:
[  258.912107]  [<ffffffff8116c754>] print_bad_pte+0x1d4/0x270
[  258.912115]  [<ffffffff811a117a>] ? mem_cgroup_count_vm_event+0x1a/0xd0
[  258.912124]  [<ffffffff8116d9e9>] vm_normal_page+0x69/0x80
[  258.912137]  [<ffffffff8118f1f4>] __pmd_numa_fixup+0x124/0x160
[  258.912146]  [<ffffffff81807e80>] ? do_page_fault+0xd0/0x570
[  258.912156]  [<ffffffff81171217>] handle_mm_fault+0x2b7/0x380
[  258.912165]  [<ffffffff81807f06>] do_page_fault+0x156/0x570
[  258.912173]  [<ffffffff810eb0b8>] ? lockdep_sys_exit+0x28/0x90
[  258.912180]  [<ffffffff8139ef09>] ? lockdep_sys_exit_thunk+0x35/0x67
[  258.912186]  [<ffffffff818049b5>] page_fault+0x25/0x30

(repeats... a *lot*).

Config file attached. Seems consistently in Xorg, and likely for
a good reason (see below).

If I turn off cgroups, I still get the problem on fault:

[  816.631466] BUG: Bad page map in process Xorg  pte:80000000c2e3526f
pmd:81e57f067
[  816.631473] addr:00007fa5ae495000 vm_flags:08000075 anon_vma:
  (null) mapping:ffff88049d3791b0 index:ffffffffffff4
[  816.631478] vma->vm_ops->fault: filemap_fault+0x0/0x4b0
[  816.631486] vma->vm_file->f_op->mmap: ext4_file_mmap+0x0/0x60
[  816.631490] Pid: 4711, comm: Xorg Tainted: G    B
3.4.0-09252-gebb5196 #3
[  816.631496] Call Trace:
[  816.631501]  [<ffffffff8116bae4>] print_bad_pte+0x1d4/0x270
[  816.631509]  [<ffffffff8116cd69>] vm_normal_page+0x69/0x80
[  816.631519]  [<ffffffff8118df14>] __pmd_numa_fixup+0x124/0x160
[  816.631529]  [<ffffffff817fdf80>] ? do_page_fault+0xd0/0x570
[  816.631538]  [<ffffffff81170447>] handle_mm_fault+0x2a7/0x360
[  816.631548]  [<ffffffff817fe006>] do_page_fault+0x156/0x570
[  816.631557]  [<ffffffff8139505d>] ? copy_user_generic_string+0x2d/0x40
[  816.631565]  [<ffffffff811b3561>] ?
poll_select_copy_remaining+0x101/0x170
[  816.631576]  [<ffffffff810eae58>] ? lockdep_sys_exit+0x28/0x90
[  816.631586]  [<ffffffff813967d9>] ? lockdep_sys_exit_thunk+0x35/0x67
[  816.631596]  [<ffffffff817faab5>] page_fault+0x25/0x30

>From a little printk addition -- it looks like this is the
following block:

	if (HAVE_PTE_SPECIAL) {
		if (likely(!pte_special(pte)))
			goto check_pfn;
		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
			return NULL;
		if (!is_zero_pfn(pfn)) {
			print_bad_pte(vma, addr, pte, NULL);
		}
		return NULL;
	}

We're a special pte, but a non-zero pfn. Being Xorg, I'm
assuming this is a remap of a kernel page into the user virtual
address space, but that's just a gut instinct. Since I read
the above as "We don't expect to ever take spurious faults
on instantiated special ptes", I would think you'd need
to either never migrate such special pages and have
__pmd_numa_fixup() skip them with an explicit check before
the vm_normal_page() call, or expand vm_normal_page()
recognize this as a legal case. Since the NUMA bit/state
is cleared prior on the pmd/pte prior to this call, you'll
have to pass the NUMA-ness of the fault down by other means.

Of course... I'm still really ramping up on this kernel, so
that could all be hokum, too. Hopefully it helps.

I can dump the EFI memory map and whatnot to you if you
need it, but I think this is more of an algorithmic issue
than a hardware corner case. 2 sockets, 6 cores per socket, HT
on, 32Gb of RAM split a little asymmetrically (18Gb / 14Gb)
if it matters.

Thanks,
Don Morris




[-- Attachment #2: config_radeon --]
[-- Type: text/plain, Size: 81921 bytes --]

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 3.4.0 Kernel Configuration
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_ARCH_HAS_CPU_AUTOPROBE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11"
CONFIG_ARCH_CPU_PROBE_RELEASE=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_HAVE_IRQ_WORK=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
# CONFIG_FHANDLE is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
# CONFIG_TASK_XACCT is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y
# CONFIG_AUDIT_LOGINUID_IMMUTABLE is not set
CONFIG_HAVE_GENERIC_HARDIRQS=y

#
# IRQ subsystem
#
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_PREEMPT_RCU is not set
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_RCU_FAST_NO_HZ is not set
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_IKCONFIG=m
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=18
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_FREEZER=y
# CONFIG_CGROUP_DEVICE is not set
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_RESOURCE_COUNTERS=y
CONFIG_CGROUP_MEM_RES_CTLR=y
CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y
CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED=y
CONFIG_CGROUP_MEM_RES_CTLR_KMEM=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
# CONFIG_CHECKPOINT_RESTORE is not set
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y
CONFIG_SCHED_AUTOGROUP=y
CONFIG_MM_OWNER=y
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
# CONFIG_EXPERT is not set
CONFIG_UID16=y
# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
# CONFIG_EMBEDDED is not set
CONFIG_HAVE_PERF_EVENTS=y

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PCI_QUIRKS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_COMPAT_BRK is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
CONFIG_OPROFILE=y
# CONFIG_OPROFILE_EVENT_MULTIPLEX is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_OPROFILE_NMI_TIMER=y
CONFIG_KPROBES=y
CONFIG_JUMP_LABEL=y
CONFIG_OPTPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_KRETPROBES=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_ATTRS=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HAVE_ALIGNED_STRUCT_PAGE=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP_FILTER=y

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_FORCE_LOAD is not set
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_BSG=y
CONFIG_BLK_DEV_BSGLIB=y
CONFIG_BLK_DEV_INTEGRITY=y
# CONFIG_BLK_DEV_THROTTLING is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
# CONFIG_OSF_PARTITION is not set
# CONFIG_AMIGA_PARTITION is not set
# CONFIG_ATARI_PARTITION is not set
# CONFIG_MAC_PARTITION is not set
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
# CONFIG_MINIX_SUBPARTITION is not set
# CONFIG_SOLARIS_X86_PARTITION is not set
# CONFIG_UNIXWARE_DISKLABEL is not set
# CONFIG_LDM_PARTITION is not set
# CONFIG_SGI_PARTITION is not set
# CONFIG_ULTRIX_PARTITION is not set
# CONFIG_SUN_PARTITION is not set
# CONFIG_KARMA_PARTITION is not set
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_CFQ_GROUP_IOSCHED=y
# CONFIG_DEFAULT_DEADLINE is not set
# CONFIG_DEFAULT_CFQ is not set
CONFIG_DEFAULT_NOOP=y
CONFIG_DEFAULT_IOSCHED="noop"
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_INLINE_SPIN_TRYLOCK is not set
# CONFIG_INLINE_SPIN_TRYLOCK_BH is not set
# CONFIG_INLINE_SPIN_LOCK is not set
# CONFIG_INLINE_SPIN_LOCK_BH is not set
# CONFIG_INLINE_SPIN_LOCK_IRQ is not set
# CONFIG_INLINE_SPIN_LOCK_IRQSAVE is not set
CONFIG_UNINLINE_SPIN_UNLOCK=y
# CONFIG_INLINE_SPIN_UNLOCK_BH is not set
# CONFIG_INLINE_SPIN_UNLOCK_IRQ is not set
# CONFIG_INLINE_SPIN_UNLOCK_IRQRESTORE is not set
# CONFIG_INLINE_READ_TRYLOCK is not set
# CONFIG_INLINE_READ_LOCK is not set
# CONFIG_INLINE_READ_LOCK_BH is not set
# CONFIG_INLINE_READ_LOCK_IRQ is not set
# CONFIG_INLINE_READ_LOCK_IRQSAVE is not set
# CONFIG_INLINE_READ_UNLOCK is not set
# CONFIG_INLINE_READ_UNLOCK_BH is not set
# CONFIG_INLINE_READ_UNLOCK_IRQ is not set
# CONFIG_INLINE_READ_UNLOCK_IRQRESTORE is not set
# CONFIG_INLINE_WRITE_TRYLOCK is not set
# CONFIG_INLINE_WRITE_LOCK is not set
# CONFIG_INLINE_WRITE_LOCK_BH is not set
# CONFIG_INLINE_WRITE_LOCK_IRQ is not set
# CONFIG_INLINE_WRITE_LOCK_IRQSAVE is not set
# CONFIG_INLINE_WRITE_UNLOCK is not set
# CONFIG_INLINE_WRITE_UNLOCK_BH is not set
# CONFIG_INLINE_WRITE_UNLOCK_IRQ is not set
# CONFIG_INLINE_WRITE_UNLOCK_IRQRESTORE is not set
# CONFIG_MUTEX_SPIN_ON_OWNER is not set
CONFIG_FREEZER=y

#
# Processor type and features
#
CONFIG_ZONE_DMA=y
CONFIG_SMP=y
CONFIG_X86_MPPARSE=y
# CONFIG_X86_EXTENDED_PLATFORM is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_PARAVIRT_GUEST=y
CONFIG_PARAVIRT_TIME_ACCOUNTING=y
CONFIG_XEN=y
CONFIG_XEN_DOM0=y
CONFIG_XEN_PRIVILEGED_GUEST=y
CONFIG_XEN_PVHVM=y
CONFIG_XEN_MAX_DOMAIN_MEMORY=500
CONFIG_XEN_SAVE_RESTORE=y
CONFIG_XEN_DEBUG_FS=y
CONFIG_KVM_CLOCK=y
CONFIG_KVM_GUEST=y
CONFIG_PARAVIRT=y
CONFIG_PARAVIRT_SPINLOCKS=y
CONFIG_PARAVIRT_CLOCK=y
# CONFIG_PARAVIRT_DEBUG is not set
CONFIG_NO_BOOTMEM=y
# CONFIG_MEMTEST is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
CONFIG_MCORE2=y
# CONFIG_MATOM is not set
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_XADD=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_P6_NOP=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
# CONFIG_CALGARY_IOMMU is not set
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS=24
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_IRQ_TIME_ACCOUNTING=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_COUNT=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
# CONFIG_X86_MCE_INJECT is not set
CONFIG_X86_THERMAL_VECTOR=y
# CONFIG_I8K is not set
CONFIG_MICROCODE=y
CONFIG_MICROCODE_INTEL=y
CONFIG_MICROCODE_AMD=y
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_DIRECT_GBPAGES=y
CONFIG_NUMA=y
# CONFIG_AMD_NUMA is not set
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NODES_SPAN_OTHER_NODES=y
# CONFIG_NUMA_EMU is not set
CONFIG_NODES_SHIFT=1
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_HAVE_MEMBLOCK=y
CONFIG_HAVE_MEMBLOCK_NODE_MAP=y
CONFIG_ARCH_DISCARD_MEMBLOCK=y
# CONFIG_MEMORY_HOTPLUG is not set
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=999999
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
CONFIG_AUTONUMA=y
CONFIG_AUTONUMA_DEFAULT_ENABLED=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MMU_NOTIFIER=y
CONFIG_KSM=y
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
# CONFIG_MEMORY_FAILURE is not set
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
# CONFIG_CLEANCACHE is not set
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y
CONFIG_X86_RESERVE_LOW=64
CONFIG_MTRR=y
# CONFIG_MTRR_SANITIZER is not set
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_ARCH_RANDOM=y
CONFIG_EFI=y
# CONFIG_EFI_STUB is not set
CONFIG_SECCOMP=y
# CONFIG_CC_STACKPROTECTOR is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
# CONFIG_KEXEC_JUMP is not set
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_ALIGN=0x1000000
CONFIG_HOTPLUG_CPU=y
# CONFIG_COMPAT_VDSO is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y

#
# Power management and ACPI options
#
CONFIG_ARCH_HIBERNATION_HEADER=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
CONFIG_HIBERNATE_CALLBACKS=y
CONFIG_HIBERNATION=y
CONFIG_PM_STD_PARTITION=""
CONFIG_PM_SLEEP=y
CONFIG_PM_SLEEP_SMP=y
# CONFIG_PM_AUTOSLEEP is not set
# CONFIG_PM_WAKELOCKS is not set
CONFIG_PM_RUNTIME=y
CONFIG_PM=y
CONFIG_PM_DEBUG=y
# CONFIG_PM_ADVANCED_DEBUG is not set
# CONFIG_PM_TEST_SUSPEND is not set
CONFIG_CAN_PM_TRACE=y
CONFIG_PM_TRACE=y
CONFIG_PM_TRACE_RTC=y
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_PROCFS=y
# CONFIG_ACPI_PROCFS_POWER is not set
# CONFIG_ACPI_EC_DEBUGFS is not set
CONFIG_ACPI_PROC_EVENT=y
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_VIDEO=y
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_PROCESSOR_AGGREGATOR=y
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_NUMA=y
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
# CONFIG_ACPI_PCI_SLOT is not set
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
CONFIG_ACPI_SBS=y
CONFIG_ACPI_HED=y
# CONFIG_ACPI_CUSTOM_METHOD is not set
CONFIG_ACPI_BGRT=y
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_GHES=y
CONFIG_ACPI_APEI_PCIEAER=y
# CONFIG_ACPI_APEI_EINJ is not set
# CONFIG_ACPI_APEI_ERST_DEBUG is not set
# CONFIG_SFI is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
CONFIG_CPU_FREQ_STAT=y
CONFIG_CPU_FREQ_STAT_DETAILS=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_GOV_POWERSAVE is not set
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
# CONFIG_CPU_FREQ_GOV_CONSERVATIVE is not set

#
# x86 CPU frequency scaling drivers
#
CONFIG_X86_PCC_CPUFREQ=m
CONFIG_X86_ACPI_CPUFREQ=y
# CONFIG_X86_POWERNOW_K8 is not set
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#
# CONFIG_X86_SPEEDSTEP_LIB is not set
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y
CONFIG_INTEL_IDLE=y

#
# Memory power savings
#
# CONFIG_I7300_IDLE is not set

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_XEN=y
CONFIG_PCI_DOMAINS=y
# CONFIG_PCI_CNB20LE_QUIRK is not set
CONFIG_PCIEPORTBUS=y
CONFIG_PCIEAER=y
# CONFIG_PCIE_ECRC is not set
# CONFIG_PCIEAER_INJECT is not set
CONFIG_PCIEASPM=y
# CONFIG_PCIEASPM_DEBUG is not set
CONFIG_PCIEASPM_DEFAULT=y
# CONFIG_PCIEASPM_POWERSAVE is not set
# CONFIG_PCIEASPM_PERFORMANCE is not set
CONFIG_PCIE_PME=y
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_REALLOC_ENABLE_AUTO is not set
# CONFIG_PCI_STUB is not set
CONFIG_XEN_PCIDEV_FRONTEND=y
CONFIG_HT_IRQ=y
CONFIG_PCI_ATS=y
CONFIG_PCI_IOV=y
CONFIG_PCI_PRI=y
CONFIG_PCI_PASID=y
# CONFIG_PCI_IOAPIC is not set
CONFIG_PCI_LABEL=y
CONFIG_ISA_DMA_API=y
CONFIG_AMD_NB=y
# CONFIG_PCCARD is not set
# CONFIG_HOTPLUG_PCI is not set
# CONFIG_RAPIDIO is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE=y
CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
# CONFIG_HAVE_AOUT is not set
CONFIG_BINFMT_MISC=y
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_AOUT is not set
# CONFIG_X86_X32 is not set
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_KEYS_COMPAT=y
CONFIG_HAVE_TEXT_POKE_SMP=y
CONFIG_X86_DEV_DMA_OPS=y
CONFIG_NET=y
CONFIG_COMPAT_NETLINK_MESSAGES=y

#
# Networking options
#
CONFIG_PACKET=y
CONFIG_UNIX=y
# CONFIG_UNIX_DIAG is not set
CONFIG_XFRM=y
CONFIG_XFRM_ALGO=y
CONFIG_XFRM_USER=y
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
# CONFIG_XFRM_STATISTICS is not set
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
# CONFIG_IP_FIB_TRIE_STATS is not set
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y
CONFIG_IP_PNP_BOOTP=y
CONFIG_IP_PNP_RARP=y
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE_DEMUX is not set
CONFIG_IP_MROUTE=y
# CONFIG_IP_MROUTE_MULTIPLE_TABLES is not set
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
CONFIG_INET_TUNNEL=y
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
CONFIG_INET_LRO=y
# CONFIG_INET_DIAG is not set
CONFIG_TCP_CONG_ADVANCED=y
# CONFIG_TCP_CONG_BIC is not set
CONFIG_TCP_CONG_CUBIC=y
# CONFIG_TCP_CONG_WESTWOOD is not set
# CONFIG_TCP_CONG_HTCP is not set
# CONFIG_TCP_CONG_HSTCP is not set
# CONFIG_TCP_CONG_HYBLA is not set
# CONFIG_TCP_CONG_VEGAS is not set
# CONFIG_TCP_CONG_SCALABLE is not set
# CONFIG_TCP_CONG_LP is not set
# CONFIG_TCP_CONG_VENO is not set
# CONFIG_TCP_CONG_YEAH is not set
# CONFIG_TCP_CONG_ILLINOIS is not set
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_TCP_MD5SIG=y
CONFIG_IPV6=y
# CONFIG_IPV6_PRIVACY is not set
# CONFIG_IPV6_ROUTER_PREF is not set
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
CONFIG_INET6_AH=y
CONFIG_INET6_ESP=y
# CONFIG_INET6_IPCOMP is not set
# CONFIG_IPV6_MIP6 is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
# CONFIG_INET6_TUNNEL is not set
CONFIG_INET6_XFRM_MODE_TRANSPORT=y
CONFIG_INET6_XFRM_MODE_TUNNEL=y
CONFIG_INET6_XFRM_MODE_BEET=y
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
CONFIG_IPV6_SIT=y
# CONFIG_IPV6_SIT_6RD is not set
CONFIG_IPV6_NDISC_NODETYPE=y
# CONFIG_IPV6_TUNNEL is not set
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_IPV6_MROUTE is not set
CONFIG_NETLABEL=y
CONFIG_NETWORK_SECMARK=y
# CONFIG_NETWORK_PHY_TIMESTAMPING is not set
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
# CONFIG_NETFILTER_ADVANCED is not set

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_NETLINK=y
CONFIG_NETFILTER_NETLINK_LOG=y
CONFIG_NF_CONNTRACK=y
CONFIG_NF_CONNTRACK_SECMARK=y
CONFIG_NF_CONNTRACK_PROCFS=y
CONFIG_NF_CONNTRACK_FTP=y
CONFIG_NF_CONNTRACK_IRC=y
# CONFIG_NF_CONNTRACK_NETBIOS_NS is not set
CONFIG_NF_CONNTRACK_SIP=y
CONFIG_NF_CT_NETLINK=y
CONFIG_NETFILTER_XTABLES=y

#
# Xtables combined modules
#
CONFIG_NETFILTER_XT_MARK=m

#
# Xtables targets
#
CONFIG_NETFILTER_XT_TARGET_CONNSECMARK=y
CONFIG_NETFILTER_XT_TARGET_LOG=m
CONFIG_NETFILTER_XT_TARGET_NFLOG=y
CONFIG_NETFILTER_XT_TARGET_SECMARK=y
CONFIG_NETFILTER_XT_TARGET_TCPMSS=y

#
# Xtables matches
#
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=y
CONFIG_NETFILTER_XT_MATCH_POLICY=y
CONFIG_NETFILTER_XT_MATCH_STATE=y
# CONFIG_IP_SET is not set
# CONFIG_IP_VS is not set

#
# IP: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV4=y
CONFIG_NF_CONNTRACK_IPV4=y
CONFIG_NF_CONNTRACK_PROC_COMPAT=y
CONFIG_IP_NF_IPTABLES=y
CONFIG_IP_NF_FILTER=y
CONFIG_IP_NF_TARGET_REJECT=y
CONFIG_IP_NF_TARGET_ULOG=y
CONFIG_NF_NAT=y
CONFIG_NF_NAT_NEEDED=y
CONFIG_IP_NF_TARGET_MASQUERADE=y
CONFIG_NF_NAT_FTP=y
CONFIG_NF_NAT_IRC=y
# CONFIG_NF_NAT_TFTP is not set
# CONFIG_NF_NAT_AMANDA is not set
# CONFIG_NF_NAT_PPTP is not set
# CONFIG_NF_NAT_H323 is not set
CONFIG_NF_NAT_SIP=y
CONFIG_IP_NF_MANGLE=y
# CONFIG_IP_NF_RAW is not set

#
# IPv6: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV6=y
CONFIG_NF_CONNTRACK_IPV6=y
CONFIG_IP6_NF_IPTABLES=y
CONFIG_IP6_NF_MATCH_IPV6HEADER=y
CONFIG_IP6_NF_FILTER=y
CONFIG_IP6_NF_TARGET_REJECT=y
CONFIG_IP6_NF_MANGLE=y
# CONFIG_IP6_NF_RAW is not set
# CONFIG_BRIDGE_NF_EBTABLES is not set
# CONFIG_IP_DCCP is not set
CONFIG_IP_SCTP=m
# CONFIG_NET_SCTPPROBE is not set
# CONFIG_SCTP_DBG_MSG is not set
# CONFIG_SCTP_DBG_OBJCNT is not set
CONFIG_SCTP_HMAC_NONE=y
# CONFIG_SCTP_HMAC_SHA1 is not set
# CONFIG_SCTP_HMAC_MD5 is not set
# CONFIG_RDS is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_L2TP is not set
CONFIG_STP=m
CONFIG_BRIDGE=m
CONFIG_BRIDGE_IGMP_SNOOPING=y
# CONFIG_NET_DSA is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
CONFIG_LLC=m
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_WAN_ROUTER is not set
# CONFIG_PHONET is not set
# CONFIG_IEEE802154 is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
# CONFIG_NET_SCH_CBQ is not set
# CONFIG_NET_SCH_HTB is not set
# CONFIG_NET_SCH_HFSC is not set
# CONFIG_NET_SCH_PRIO is not set
# CONFIG_NET_SCH_MULTIQ is not set
# CONFIG_NET_SCH_RED is not set
# CONFIG_NET_SCH_SFB is not set
# CONFIG_NET_SCH_SFQ is not set
# CONFIG_NET_SCH_TEQL is not set
# CONFIG_NET_SCH_TBF is not set
# CONFIG_NET_SCH_GRED is not set
# CONFIG_NET_SCH_DSMARK is not set
# CONFIG_NET_SCH_NETEM is not set
# CONFIG_NET_SCH_DRR is not set
# CONFIG_NET_SCH_MQPRIO is not set
# CONFIG_NET_SCH_CHOKE is not set
# CONFIG_NET_SCH_QFQ is not set
# CONFIG_NET_SCH_CODEL is not set
# CONFIG_NET_SCH_FQ_CODEL is not set
# CONFIG_NET_SCH_INGRESS is not set
# CONFIG_NET_SCH_PLUG is not set

#
# Classification
#
CONFIG_NET_CLS=y
# CONFIG_NET_CLS_BASIC is not set
# CONFIG_NET_CLS_TCINDEX is not set
# CONFIG_NET_CLS_ROUTE4 is not set
# CONFIG_NET_CLS_FW is not set
# CONFIG_NET_CLS_U32 is not set
# CONFIG_NET_CLS_RSVP is not set
# CONFIG_NET_CLS_RSVP6 is not set
# CONFIG_NET_CLS_FLOW is not set
# CONFIG_NET_CLS_CGROUP is not set
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
# CONFIG_NET_EMATCH_CMP is not set
# CONFIG_NET_EMATCH_NBYTE is not set
# CONFIG_NET_EMATCH_U32 is not set
# CONFIG_NET_EMATCH_META is not set
# CONFIG_NET_EMATCH_TEXT is not set
CONFIG_NET_CLS_ACT=y
# CONFIG_NET_ACT_POLICE is not set
# CONFIG_NET_ACT_GACT is not set
# CONFIG_NET_ACT_MIRRED is not set
# CONFIG_NET_ACT_IPT is not set
# CONFIG_NET_ACT_NAT is not set
# CONFIG_NET_ACT_PEDIT is not set
# CONFIG_NET_ACT_SIMP is not set
# CONFIG_NET_ACT_SKBEDIT is not set
# CONFIG_NET_ACT_CSUM is not set
CONFIG_NET_SCH_FIFO=y
# CONFIG_DCB is not set
CONFIG_DNS_RESOLVER=y
# CONFIG_BATMAN_ADV is not set
CONFIG_OPENVSWITCH=m
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_XPS=y
# CONFIG_NETPRIO_CGROUP is not set
CONFIG_BQL=y
# CONFIG_BPF_JIT is not set

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_TCPPROBE is not set
# CONFIG_NET_DROP_MONITOR is not set
# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
# CONFIG_IRDA is not set
# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
CONFIG_FIB_RULES=y
CONFIG_WIRELESS=y
CONFIG_WEXT_CORE=y
CONFIG_WEXT_PROC=y
CONFIG_CFG80211=y
# CONFIG_NL80211_TESTMODE is not set
# CONFIG_CFG80211_DEVELOPER_WARNINGS is not set
# CONFIG_CFG80211_REG_DEBUG is not set
CONFIG_CFG80211_DEFAULT_PS=y
# CONFIG_CFG80211_DEBUGFS is not set
# CONFIG_CFG80211_INTERNAL_REGDB is not set
CONFIG_CFG80211_WEXT=y
# CONFIG_WIRELESS_EXT_SYSFS is not set
# CONFIG_LIB80211 is not set
CONFIG_MAC80211=y
CONFIG_MAC80211_HAS_RC=y
CONFIG_MAC80211_RC_MINSTREL=y
CONFIG_MAC80211_RC_MINSTREL_HT=y
CONFIG_MAC80211_RC_DEFAULT_MINSTREL=y
CONFIG_MAC80211_RC_DEFAULT="minstrel_ht"
# CONFIG_MAC80211_MESH is not set
# CONFIG_MAC80211_DEBUGFS is not set
# CONFIG_MAC80211_DEBUG_MENU is not set
# CONFIG_WIMAX is not set
# CONFIG_RFKILL is not set
# CONFIG_NET_9P is not set
# CONFIG_CAIF is not set
# CONFIG_CEPH_LIB is not set
# CONFIG_NFC is not set
CONFIG_HAVE_BPF_JIT=y

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
# CONFIG_DEVTMPFS is not set
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_DEBUG_DRIVER is not set
CONFIG_DEBUG_DEVRES=y
CONFIG_SYS_HYPERVISOR=y
# CONFIG_GENERIC_CPU_DEVICES is not set
CONFIG_DMA_SHARED_BUFFER=y
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
# CONFIG_MTD is not set
# CONFIG_PARPORT is not set
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_DEV_PCIESSD_MTIP32XX is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
# CONFIG_BLK_DEV_DRBD is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_NVME is not set
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=16384
# CONFIG_BLK_DEV_XIP is not set
CONFIG_CDROM_PKTCDVD=y
CONFIG_CDROM_PKTCDVD_BUFFERS=8
CONFIG_CDROM_PKTCDVD_WCACHE=y
# CONFIG_ATA_OVER_ETH is not set
CONFIG_XEN_BLKDEV_FRONTEND=y
# CONFIG_XEN_BLKDEV_BACKEND is not set
# CONFIG_VIRTIO_BLK is not set
# CONFIG_BLK_DEV_HD is not set
# CONFIG_BLK_DEV_RBD is not set

#
# Misc devices
#
# CONFIG_SENSORS_LIS3LV02D is not set
# CONFIG_AD525X_DPOT is not set
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_INTEL_MID_PTI is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_HP_ILO is not set
# CONFIG_APDS9802ALS is not set
# CONFIG_ISL29003 is not set
# CONFIG_ISL29020 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_SENSORS_BH1780 is not set
# CONFIG_SENSORS_BH1770 is not set
# CONFIG_SENSORS_APDS990X is not set
# CONFIG_HMC6352 is not set
# CONFIG_DS1682 is not set
# CONFIG_VMWARE_BALLOON is not set
# CONFIG_BMP085_I2C is not set
# CONFIG_PCH_PHUB is not set
# CONFIG_USB_SWITCH_FSA9480 is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_LEGACY is not set
# CONFIG_EEPROM_MAX6875 is not set
# CONFIG_EEPROM_93CX6 is not set
# CONFIG_CB710_CORE is not set

#
# Texas Instruments shared transport line discipline
#
# CONFIG_SENSORS_LIS3_I2C is not set

#
# Altera FPGA firmware download module
#
# CONFIG_ALTERA_STAPL is not set
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
# CONFIG_SCSI_TGT is not set
# CONFIG_SCSI_NETLINK is not set
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set
# CONFIG_SCSI_MULTI_LUN is not set
CONFIG_SCSI_CONSTANTS=y
# CONFIG_SCSI_LOGGING is not set
# CONFIG_SCSI_SCAN_ASYNC is not set
CONFIG_SCSI_WAIT_SCAN=m

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=y
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set
# CONFIG_SCSI_SRP_ATTRS is not set
# CONFIG_SCSI_LOWLEVEL is not set
# CONFIG_SCSI_DH is not set
# CONFIG_SCSI_OSD_INITIATOR is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_VERBOSE_ERROR=y
CONFIG_ATA_ACPI=y
CONFIG_SATA_PMP=y

#
# Controllers with non-SFF native interface
#
CONFIG_SATA_AHCI=y
# CONFIG_SATA_AHCI_PLATFORM is not set
# CONFIG_SATA_INIC162X is not set
# CONFIG_SATA_ACARD_AHCI is not set
# CONFIG_SATA_SIL24 is not set
CONFIG_ATA_SFF=y

#
# SFF controllers with custom DMA interface
#
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_SX4 is not set
CONFIG_ATA_BMDMA=y

#
# SATA SFF controllers with BMDMA
#
CONFIG_ATA_PIIX=y
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_SVW is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set

#
# PATA SFF controllers with BMDMA
#
# CONFIG_PATA_ALI is not set
CONFIG_PATA_AMD=y
# CONFIG_PATA_ARASAN_CF is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_ATP867X is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CS5536 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87415 is not set
CONFIG_PATA_OLDPIIX=y
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RDC is not set
# CONFIG_PATA_SC1200 is not set
CONFIG_PATA_SCH=y
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_TOSHIBA is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set

#
# PIO-only SFF controllers
#
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_RZ1000 is not set

#
# Generic fallback / legacy drivers
#
# CONFIG_PATA_ACPI is not set
# CONFIG_ATA_GENERIC is not set
# CONFIG_PATA_LEGACY is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
# CONFIG_MD_LINEAR is not set
# CONFIG_MD_RAID0 is not set
# CONFIG_MD_RAID1 is not set
# CONFIG_MD_RAID10 is not set
# CONFIG_MD_RAID456 is not set
# CONFIG_MD_MULTIPATH is not set
# CONFIG_MD_FAULTY is not set
CONFIG_BLK_DEV_DM=y
# CONFIG_DM_DEBUG is not set
# CONFIG_DM_CRYPT is not set
# CONFIG_DM_SNAPSHOT is not set
# CONFIG_DM_THIN_PROVISIONING is not set
CONFIG_DM_MIRROR=y
# CONFIG_DM_RAID is not set
# CONFIG_DM_LOG_USERSPACE is not set
CONFIG_DM_ZERO=y
# CONFIG_DM_MULTIPATH is not set
# CONFIG_DM_DELAY is not set
# CONFIG_DM_UEVENT is not set
# CONFIG_DM_FLAKEY is not set
# CONFIG_DM_VERITY is not set
# CONFIG_TARGET_CORE is not set
# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support
#
# CONFIG_FIREWIRE is not set
# CONFIG_FIREWIRE_NOSY is not set
# CONFIG_I2O is not set
CONFIG_MACINTOSH_DRIVERS=y
CONFIG_MAC_EMUMOUSEBTN=y
CONFIG_NETDEVICES=y
CONFIG_NET_CORE=y
# CONFIG_BONDING is not set
# CONFIG_DUMMY is not set
# CONFIG_EQUALIZER is not set
# CONFIG_NET_FC is not set
CONFIG_MII=y
# CONFIG_IFB is not set
# CONFIG_NET_TEAM is not set
# CONFIG_MACVLAN is not set
CONFIG_NETCONSOLE=y
CONFIG_NETPOLL=y
# CONFIG_NETPOLL_TRAP is not set
CONFIG_NET_POLL_CONTROLLER=y
# CONFIG_TUN is not set
# CONFIG_VETH is not set
# CONFIG_VIRTIO_NET is not set
# CONFIG_ARCNET is not set

#
# CAIF transport drivers
#
CONFIG_ETHERNET=y
CONFIG_MDIO=m
# CONFIG_NET_VENDOR_3COM is not set
# CONFIG_NET_VENDOR_ADAPTEC is not set
# CONFIG_NET_VENDOR_ALTEON is not set
# CONFIG_NET_VENDOR_AMD is not set
# CONFIG_NET_VENDOR_ATHEROS is not set
# CONFIG_NET_VENDOR_BROADCOM is not set
# CONFIG_NET_VENDOR_BROCADE is not set
# CONFIG_NET_CALXEDA_XGMAC is not set
# CONFIG_NET_VENDOR_CHELSIO is not set
# CONFIG_NET_VENDOR_CISCO is not set
# CONFIG_DNET is not set
# CONFIG_NET_VENDOR_DEC is not set
# CONFIG_NET_VENDOR_DLINK is not set
# CONFIG_NET_VENDOR_EMULEX is not set
# CONFIG_NET_VENDOR_EXAR is not set
# CONFIG_NET_VENDOR_HP is not set
CONFIG_NET_VENDOR_INTEL=y
CONFIG_E100=y
CONFIG_E1000=y
CONFIG_E1000E=m
CONFIG_IGB=m
CONFIG_IGBVF=m
CONFIG_IXGB=m
CONFIG_IXGBE=m
CONFIG_IXGBE_HWMON=y
CONFIG_IXGBEVF=m
CONFIG_NET_VENDOR_I825XX=y
# CONFIG_ZNET is not set
# CONFIG_IP1000 is not set
# CONFIG_JME is not set
# CONFIG_NET_VENDOR_MARVELL is not set
# CONFIG_NET_VENDOR_MELLANOX is not set
# CONFIG_NET_VENDOR_MICREL is not set
# CONFIG_NET_VENDOR_MYRI is not set
# CONFIG_FEALNX is not set
# CONFIG_NET_VENDOR_NATSEMI is not set
# CONFIG_NET_VENDOR_NVIDIA is not set
# CONFIG_NET_VENDOR_OKI is not set
# CONFIG_ETHOC is not set
# CONFIG_NET_PACKET_ENGINE is not set
# CONFIG_NET_VENDOR_QLOGIC is not set
# CONFIG_NET_VENDOR_REALTEK is not set
# CONFIG_NET_VENDOR_RDC is not set
# CONFIG_NET_VENDOR_SEEQ is not set
# CONFIG_NET_VENDOR_SILAN is not set
# CONFIG_NET_VENDOR_SIS is not set
# CONFIG_SFC is not set
# CONFIG_NET_VENDOR_SMSC is not set
# CONFIG_NET_VENDOR_STMICRO is not set
# CONFIG_NET_VENDOR_SUN is not set
# CONFIG_NET_VENDOR_TEHUTI is not set
# CONFIG_NET_VENDOR_TI is not set
# CONFIG_NET_VENDOR_VIA is not set
CONFIG_NET_VENDOR_WIZNET=y
# CONFIG_WIZNET_W5100 is not set
# CONFIG_WIZNET_W5300 is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_NET_SB1000 is not set
CONFIG_PHYLIB=y

#
# MII PHY device drivers
#
# CONFIG_AMD_PHY is not set
# CONFIG_MARVELL_PHY is not set
# CONFIG_DAVICOM_PHY is not set
# CONFIG_QSEMI_PHY is not set
# CONFIG_LXT_PHY is not set
# CONFIG_CICADA_PHY is not set
# CONFIG_VITESSE_PHY is not set
# CONFIG_SMSC_PHY is not set
# CONFIG_BROADCOM_PHY is not set
# CONFIG_ICPLUS_PHY is not set
# CONFIG_REALTEK_PHY is not set
# CONFIG_NATIONAL_PHY is not set
# CONFIG_STE10XP is not set
# CONFIG_LSI_ET1011C_PHY is not set
# CONFIG_MICREL_PHY is not set
# CONFIG_FIXED_PHY is not set
# CONFIG_MDIO_BITBANG is not set
CONFIG_PPP=y
CONFIG_PPP_BSDCOMP=y
CONFIG_PPP_DEFLATE=y
CONFIG_PPP_FILTER=y
CONFIG_PPP_MPPE=y
CONFIG_PPP_MULTILINK=y
CONFIG_PPPOE=y
CONFIG_PPP_ASYNC=y
CONFIG_PPP_SYNC_TTY=y
# CONFIG_SLIP is not set
CONFIG_SLHC=y

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET is not set
# CONFIG_USB_IPHETH is not set
# CONFIG_WLAN is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#
# CONFIG_WAN is not set
CONFIG_XEN_NETDEV_FRONTEND=y
# CONFIG_XEN_NETDEV_BACKEND is not set
# CONFIG_VMXNET3 is not set
# CONFIG_ISDN is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=y
CONFIG_INPUT_POLLDEV=y
CONFIG_INPUT_SPARSEKMAP=y
# CONFIG_INPUT_MATRIXKMAP is not set

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
# CONFIG_INPUT_MOUSEDEV_PSAUX is not set
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
# CONFIG_KEYBOARD_ADP5588 is not set
# CONFIG_KEYBOARD_ADP5589 is not set
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_QT1070 is not set
# CONFIG_KEYBOARD_QT2160 is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_TCA6416 is not set
# CONFIG_KEYBOARD_TCA8418 is not set
# CONFIG_KEYBOARD_LM8333 is not set
# CONFIG_KEYBOARD_MAX7359 is not set
# CONFIG_KEYBOARD_MCS is not set
# CONFIG_KEYBOARD_MPR121 is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_OPENCORES is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_OMAP4 is not set
# CONFIG_KEYBOARD_XTKBD is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
# CONFIG_MOUSE_PS2_ELANTECH is not set
# CONFIG_MOUSE_PS2_SENTELIC is not set
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_MOUSE_SYNAPTICS_I2C is not set
# CONFIG_MOUSE_SYNAPTICS_USB is not set
CONFIG_INPUT_JOYSTICK=y
# CONFIG_JOYSTICK_ANALOG is not set
# CONFIG_JOYSTICK_A3D is not set
# CONFIG_JOYSTICK_ADI is not set
# CONFIG_JOYSTICK_COBRA is not set
# CONFIG_JOYSTICK_GF2K is not set
# CONFIG_JOYSTICK_GRIP is not set
# CONFIG_JOYSTICK_GRIP_MP is not set
# CONFIG_JOYSTICK_GUILLEMOT is not set
# CONFIG_JOYSTICK_INTERACT is not set
# CONFIG_JOYSTICK_SIDEWINDER is not set
# CONFIG_JOYSTICK_TMDC is not set
# CONFIG_JOYSTICK_IFORCE is not set
# CONFIG_JOYSTICK_WARRIOR is not set
# CONFIG_JOYSTICK_MAGELLAN is not set
# CONFIG_JOYSTICK_SPACEORB is not set
# CONFIG_JOYSTICK_SPACEBALL is not set
# CONFIG_JOYSTICK_STINGER is not set
# CONFIG_JOYSTICK_TWIDJOY is not set
# CONFIG_JOYSTICK_ZHENHUA is not set
# CONFIG_JOYSTICK_AS5011 is not set
# CONFIG_JOYSTICK_JOYDUMP is not set
# CONFIG_JOYSTICK_XPAD is not set
CONFIG_INPUT_TABLET=y
# CONFIG_TABLET_USB_ACECAD is not set
# CONFIG_TABLET_USB_AIPTEK is not set
# CONFIG_TABLET_USB_GTCO is not set
# CONFIG_TABLET_USB_HANWANG is not set
# CONFIG_TABLET_USB_KBTAB is not set
# CONFIG_TABLET_USB_WACOM is not set
CONFIG_INPUT_TOUCHSCREEN=y
# CONFIG_TOUCHSCREEN_AD7879 is not set
# CONFIG_TOUCHSCREEN_ATMEL_MXT is not set
# CONFIG_TOUCHSCREEN_BU21013 is not set
# CONFIG_TOUCHSCREEN_CYTTSP_CORE is not set
# CONFIG_TOUCHSCREEN_DYNAPRO is not set
# CONFIG_TOUCHSCREEN_HAMPSHIRE is not set
# CONFIG_TOUCHSCREEN_EETI is not set
# CONFIG_TOUCHSCREEN_EGALAX is not set
# CONFIG_TOUCHSCREEN_FUJITSU is not set
# CONFIG_TOUCHSCREEN_ILI210X is not set
# CONFIG_TOUCHSCREEN_GUNZE is not set
# CONFIG_TOUCHSCREEN_ELO is not set
# CONFIG_TOUCHSCREEN_WACOM_W8001 is not set
# CONFIG_TOUCHSCREEN_WACOM_I2C is not set
# CONFIG_TOUCHSCREEN_MAX11801 is not set
# CONFIG_TOUCHSCREEN_MCS5000 is not set
# CONFIG_TOUCHSCREEN_MTOUCH is not set
# CONFIG_TOUCHSCREEN_INEXIO is not set
# CONFIG_TOUCHSCREEN_MK712 is not set
# CONFIG_TOUCHSCREEN_PENMOUNT is not set
# CONFIG_TOUCHSCREEN_TOUCHRIGHT is not set
# CONFIG_TOUCHSCREEN_TOUCHWIN is not set
# CONFIG_TOUCHSCREEN_PIXCIR is not set
# CONFIG_TOUCHSCREEN_WM97XX is not set
# CONFIG_TOUCHSCREEN_USB_COMPOSITE is not set
# CONFIG_TOUCHSCREEN_TOUCHIT213 is not set
# CONFIG_TOUCHSCREEN_TSC_SERIO is not set
# CONFIG_TOUCHSCREEN_TSC2007 is not set
# CONFIG_TOUCHSCREEN_ST1232 is not set
# CONFIG_TOUCHSCREEN_TPS6507X is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_AD714X is not set
# CONFIG_INPUT_BMA150 is not set
# CONFIG_INPUT_PCSPKR is not set
# CONFIG_INPUT_MMA8450 is not set
# CONFIG_INPUT_MPU3050 is not set
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_KXTJ9 is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
# CONFIG_INPUT_UINPUT is not set
# CONFIG_INPUT_PCF8574 is not set
# CONFIG_INPUT_ADXL34X is not set
# CONFIG_INPUT_CMA3000 is not set
CONFIG_INPUT_XEN_KBDDEV_FRONTEND=y

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_SERIO_ALTERA_PS2 is not set
# CONFIG_SERIO_PS2MULT is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_VT_CONSOLE_SLEEP=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_UNIX98_PTYS=y
# CONFIG_DEVPTS_MULTIPLE_INSTANCES is not set
# CONFIG_LEGACY_PTYS is not set
CONFIG_SERIAL_NONSTANDARD=y
# CONFIG_ROCKETPORT is not set
# CONFIG_CYCLADES is not set
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_SYNCLINK is not set
# CONFIG_SYNCLINKMP is not set
# CONFIG_SYNCLINK_GT is not set
# CONFIG_NOZOMI is not set
# CONFIG_ISI is not set
# CONFIG_N_HDLC is not set
# CONFIG_N_GSM is not set
# CONFIG_TRACE_SINK is not set
CONFIG_DEVKMEM=y
# CONFIG_STALDRV is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_MFD_HSU is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
# CONFIG_SERIAL_JSM is not set
# CONFIG_SERIAL_TIMBERDALE is not set
# CONFIG_SERIAL_ALTERA_JTAGUART is not set
# CONFIG_SERIAL_ALTERA_UART is not set
# CONFIG_SERIAL_PCH_UART is not set
# CONFIG_SERIAL_XILINX_PS_UART is not set
CONFIG_HVC_DRIVER=y
CONFIG_HVC_IRQ=y
CONFIG_HVC_XEN=y
CONFIG_HVC_XEN_FRONTEND=y
# CONFIG_VIRTIO_CONSOLE is not set
# CONFIG_IPMI_HANDLER is not set
CONFIG_HW_RANDOM=y
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
CONFIG_HW_RANDOM_INTEL=m
# CONFIG_HW_RANDOM_AMD is not set
# CONFIG_HW_RANDOM_VIA is not set
# CONFIG_HW_RANDOM_VIRTIO is not set
CONFIG_NVRAM=y
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_MWAVE is not set
# CONFIG_RAW_DRIVER is not set
CONFIG_HPET=y
# CONFIG_HPET_MMAP is not set
# CONFIG_HANGCHECK_TIMER is not set
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
CONFIG_I2C=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
# CONFIG_I2C_CHARDEV is not set
# CONFIG_I2C_MUX is not set
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_ALGOBIT=y

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
CONFIG_I2C_I801=y
# CONFIG_I2C_ISCH is not set
# CONFIG_I2C_PIIX4 is not set
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set

#
# ACPI drivers
#
# CONFIG_I2C_SCMI is not set

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_DESIGNWARE_PCI is not set
# CONFIG_I2C_EG20T is not set
# CONFIG_I2C_INTEL_MID is not set
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_PCA_PLATFORM is not set
# CONFIG_I2C_PXA_PCI is not set
# CONFIG_I2C_SIMTEC is not set
# CONFIG_I2C_XILINX is not set

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_DIOLAN_U2C is not set
# CONFIG_I2C_PARPORT_LIGHT is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_STUB is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_SPI is not set
# CONFIG_HSI is not set

#
# PPS support
#
# CONFIG_PPS is not set

#
# PPS generators support
#

#
# PTP clock support
#

#
# Enable Device Drivers -> PPS to see the PTP clock options.
#
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
# CONFIG_GPIOLIB is not set
# CONFIG_W1 is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_TEST_POWER is not set
# CONFIG_BATTERY_DS2780 is not set
# CONFIG_BATTERY_DS2781 is not set
# CONFIG_BATTERY_DS2782 is not set
# CONFIG_BATTERY_SBS is not set
# CONFIG_BATTERY_BQ27x00 is not set
# CONFIG_BATTERY_MAX17040 is not set
# CONFIG_BATTERY_MAX17042 is not set
# CONFIG_CHARGER_MAX8903 is not set
# CONFIG_CHARGER_LP8727 is not set
# CONFIG_CHARGER_SMB347 is not set
CONFIG_HWMON=y
# CONFIG_HWMON_VID is not set
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Native drivers
#
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7411 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_ASC7621 is not set
# CONFIG_SENSORS_K8TEMP is not set
# CONFIG_SENSORS_K10TEMP is not set
# CONFIG_SENSORS_FAM15H_POWER is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_DS620 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
CONFIG_SENSORS_CORETEMP=m
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_JC42 is not set
# CONFIG_SENSORS_LINEAGE is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM73 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LTC4151 is not set
# CONFIG_SENSORS_LTC4215 is not set
# CONFIG_SENSORS_LTC4245 is not set
# CONFIG_SENSORS_LTC4261 is not set
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_LM95245 is not set
# CONFIG_SENSORS_MAX16065 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX1668 is not set
# CONFIG_SENSORS_MAX6639 is not set
# CONFIG_SENSORS_MAX6642 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_MCP3021 is not set
# CONFIG_SENSORS_NTC_THERMISTOR is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_PMBUS is not set
# CONFIG_SENSORS_SHT21 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_SMM665 is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_EMC1403 is not set
# CONFIG_SENSORS_EMC2103 is not set
# CONFIG_SENSORS_EMC6W201 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_SCH56XX_COMMON is not set
# CONFIG_SENSORS_SCH5627 is not set
# CONFIG_SENSORS_SCH5636 is not set
# CONFIG_SENSORS_ADS1015 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_AMC6821 is not set
# CONFIG_SENSORS_INA2XX is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_TMP102 is not set
# CONFIG_SENSORS_TMP401 is not set
# CONFIG_SENSORS_TMP421 is not set
# CONFIG_SENSORS_VIA_CPUTEMP is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83795 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
# CONFIG_SENSORS_APPLESMC is not set

#
# ACPI drivers
#
CONFIG_SENSORS_ACPI_POWER=m
# CONFIG_SENSORS_ATK0110 is not set
CONFIG_THERMAL=y
CONFIG_THERMAL_HWMON=y
CONFIG_WATCHDOG=y
# CONFIG_WATCHDOG_CORE is not set
# CONFIG_WATCHDOG_NOWAYOUT is not set

#
# Watchdog Device Drivers
#
# CONFIG_SOFT_WATCHDOG is not set
# CONFIG_ACQUIRE_WDT is not set
# CONFIG_ADVANTECH_WDT is not set
# CONFIG_ALIM1535_WDT is not set
# CONFIG_ALIM7101_WDT is not set
# CONFIG_F71808E_WDT is not set
# CONFIG_SP5100_TCO is not set
# CONFIG_SC520_WDT is not set
# CONFIG_SBC_FITPC2_WATCHDOG is not set
# CONFIG_EUROTECH_WDT is not set
# CONFIG_IB700_WDT is not set
# CONFIG_IBMASR is not set
# CONFIG_WAFER_WDT is not set
# CONFIG_I6300ESB_WDT is not set
# CONFIG_IE6XX_WDT is not set
# CONFIG_ITCO_WDT is not set
# CONFIG_IT8712F_WDT is not set
# CONFIG_IT87_WDT is not set
# CONFIG_HP_WATCHDOG is not set
# CONFIG_SC1200_WDT is not set
# CONFIG_PC87413_WDT is not set
# CONFIG_NV_TCO is not set
# CONFIG_60XX_WDT is not set
# CONFIG_SBC8360_WDT is not set
# CONFIG_CPU5_WDT is not set
# CONFIG_SMSC_SCH311X_WDT is not set
# CONFIG_SMSC37B787_WDT is not set
# CONFIG_VIA_WDT is not set
# CONFIG_W83627HF_WDT is not set
# CONFIG_W83697HF_WDT is not set
# CONFIG_W83697UG_WDT is not set
# CONFIG_W83877F_WDT is not set
# CONFIG_W83977F_WDT is not set
# CONFIG_MACHZ_WDT is not set
# CONFIG_SBC_EPX_C3_WATCHDOG is not set
# CONFIG_XEN_WDT is not set

#
# PCI-based Watchdog Cards
#
# CONFIG_PCIPCWATCHDOG is not set
# CONFIG_WDTPCI is not set

#
# USB-based Watchdog Cards
#
# CONFIG_USBPCWATCHDOG is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
# CONFIG_SSB is not set
CONFIG_BCMA_POSSIBLE=y

#
# Broadcom specific AMBA
#
# CONFIG_BCMA is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
# CONFIG_MFD_88PM860X is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_MFD_LM3533 is not set
# CONFIG_TPS6105X is not set
# CONFIG_TPS6507X is not set
# CONFIG_MFD_TPS65217 is not set
# CONFIG_TWL4030_CORE is not set
# CONFIG_TWL6040_CORE is not set
# CONFIG_MFD_STMPE is not set
# CONFIG_MFD_TC3589X is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_PMIC_DA903X is not set
# CONFIG_MFD_DA9052_I2C is not set
# CONFIG_PMIC_ADP5520 is not set
# CONFIG_MFD_MAX77693 is not set
# CONFIG_MFD_MAX8925 is not set
# CONFIG_MFD_MAX8997 is not set
# CONFIG_MFD_MAX8998 is not set
# CONFIG_MFD_S5M_CORE is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM831X_I2C is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_MFD_WM8994 is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_MFD_MC13XXX_I2C is not set
# CONFIG_ABX500_CORE is not set
# CONFIG_MFD_CS5535 is not set
# CONFIG_LPC_SCH is not set
# CONFIG_LPC_ICH is not set
# CONFIG_MFD_RDC321X is not set
# CONFIG_MFD_JANZ_CMODIO is not set
# CONFIG_MFD_VX855 is not set
# CONFIG_MFD_WL1273_CORE is not set
# CONFIG_MFD_TPS65090 is not set
# CONFIG_MFD_RC5T583 is not set
# CONFIG_MFD_PALMAS is not set
# CONFIG_REGULATOR is not set
# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
# CONFIG_AGP_SIS is not set
# CONFIG_AGP_VIA is not set
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=2
# CONFIG_VGA_SWITCHEROO is not set
CONFIG_DRM=y
CONFIG_DRM_KMS_HELPER=m
# CONFIG_DRM_LOAD_EDID_FIRMWARE is not set
CONFIG_DRM_TTM=m
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
CONFIG_DRM_RADEON=m
CONFIG_DRM_RADEON_KMS=y
CONFIG_DRM_NOUVEAU=m
CONFIG_DRM_NOUVEAU_BACKLIGHT=y
CONFIG_DRM_NOUVEAU_DEBUG=y

#
# I2C encoder or helper chips
#
# CONFIG_DRM_I2C_CH7006 is not set
# CONFIG_DRM_I2C_SIL164 is not set
# CONFIG_DRM_I810 is not set
# CONFIG_DRM_I915 is not set
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set
# CONFIG_DRM_VMWGFX is not set
# CONFIG_DRM_GMA500 is not set
# CONFIG_DRM_UDL is not set
# CONFIG_DRM_AST is not set
# CONFIG_DRM_MGAG200 is not set
# CONFIG_DRM_CIRRUS_QEMU is not set
# CONFIG_STUB_POULSBO is not set
CONFIG_VGASTATE=m
CONFIG_VIDEO_OUTPUT_CONTROL=y
CONFIG_FB=y
CONFIG_FIRMWARE_EDID=y
CONFIG_FB_DDC=m
CONFIG_FB_BOOT_VESA_SUPPORT=y
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
CONFIG_FB_SYS_FILLRECT=y
CONFIG_FB_SYS_COPYAREA=y
CONFIG_FB_SYS_IMAGEBLIT=y
# CONFIG_FB_FOREIGN_ENDIAN is not set
CONFIG_FB_SYS_FOPS=y
# CONFIG_FB_WMT_GE_ROPS is not set
CONFIG_FB_DEFERRED_IO=y
# CONFIG_FB_SVGALIB is not set
# CONFIG_FB_MACMODES is not set
CONFIG_FB_BACKLIGHT=y
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_UVESA is not set
CONFIG_FB_VESA=y
CONFIG_FB_EFI=y
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
CONFIG_FB_NVIDIA=m
CONFIG_FB_NVIDIA_I2C=y
# CONFIG_FB_NVIDIA_DEBUG is not set
CONFIG_FB_NVIDIA_BACKLIGHT=y
CONFIG_FB_RIVA=m
CONFIG_FB_RIVA_I2C=y
# CONFIG_FB_RIVA_DEBUG is not set
CONFIG_FB_RIVA_BACKLIGHT=y
# CONFIG_FB_I740 is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_MATROX is not set
CONFIG_FB_RADEON=m
CONFIG_FB_RADEON_I2C=y
CONFIG_FB_RADEON_BACKLIGHT=y
CONFIG_FB_RADEON_DEBUG=y
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_VIA is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_GEODE is not set
# CONFIG_FB_SMSCUFX is not set
# CONFIG_FB_UDL is not set
# CONFIG_FB_VIRTUAL is not set
CONFIG_XEN_FBDEV_FRONTEND=y
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_BROADSHEET is not set
# CONFIG_EXYNOS_VIDEO is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
# CONFIG_LCD_CLASS_DEVICE is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
CONFIG_BACKLIGHT_GENERIC=y
# CONFIG_BACKLIGHT_PROGEAR is not set
# CONFIG_BACKLIGHT_APPLE is not set
# CONFIG_BACKLIGHT_SAHARA is not set
# CONFIG_BACKLIGHT_ADP8860 is not set
# CONFIG_BACKLIGHT_ADP8870 is not set
# CONFIG_BACKLIGHT_LP855X is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
# CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
# CONFIG_LOGO_LINUX_VGA16 is not set
CONFIG_LOGO_LINUX_CLUT224=y
CONFIG_SOUND=y
CONFIG_SOUND_OSS_CORE=y
CONFIG_SOUND_OSS_CORE_PRECLAIM=y
CONFIG_SND=y
CONFIG_SND_TIMER=y
CONFIG_SND_PCM=y
CONFIG_SND_HWDEP=y
CONFIG_SND_RAWMIDI=m
CONFIG_SND_JACK=y
CONFIG_SND_SEQUENCER=y
CONFIG_SND_SEQ_DUMMY=y
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=y
CONFIG_SND_PCM_OSS=y
CONFIG_SND_PCM_OSS_PLUGINS=y
CONFIG_SND_SEQUENCER_OSS=y
CONFIG_SND_HRTIMER=y
CONFIG_SND_SEQ_HRTIMER_DEFAULT=y
CONFIG_SND_DYNAMIC_MINORS=y
CONFIG_SND_SUPPORT_OLD_API=y
CONFIG_SND_VERBOSE_PROCFS=y
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set
CONFIG_SND_VMASTER=y
CONFIG_SND_KCTL_JACK=y
CONFIG_SND_DMA_SGBUF=y
CONFIG_SND_RAWMIDI_SEQ=m
# CONFIG_SND_OPL3_LIB_SEQ is not set
# CONFIG_SND_OPL4_LIB_SEQ is not set
# CONFIG_SND_SBAWE_SEQ is not set
CONFIG_SND_EMU10K1_SEQ=m
CONFIG_SND_AC97_CODEC=m
CONFIG_SND_DRIVERS=y
# CONFIG_SND_PCSP is not set
# CONFIG_SND_DUMMY is not set
# CONFIG_SND_ALOOP is not set
# CONFIG_SND_VIRMIDI is not set
# CONFIG_SND_MTPAV is not set
# CONFIG_SND_SERIAL_U16550 is not set
# CONFIG_SND_MPU401 is not set
# CONFIG_SND_AC97_POWER_SAVE is not set
CONFIG_SND_PCI=y
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ASIHPI is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AW2 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_OXYGEN is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_CS5530 is not set
# CONFIG_SND_CS5535AUDIO is not set
# CONFIG_SND_CTXFI is not set
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
# CONFIG_SND_INDIGOIOX is not set
# CONFIG_SND_INDIGODJX is not set
CONFIG_SND_EMU10K1=m
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_ENS1370 is not set
# CONFIG_SND_ENS1371 is not set
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
CONFIG_SND_HDA_INTEL=y
CONFIG_SND_HDA_PREALLOC_SIZE=64
CONFIG_SND_HDA_HWDEP=y
CONFIG_SND_HDA_RECONFIG=y
CONFIG_SND_HDA_INPUT_BEEP=y
CONFIG_SND_HDA_INPUT_BEEP_MODE=1
CONFIG_SND_HDA_INPUT_JACK=y
CONFIG_SND_HDA_PATCH_LOADER=y
CONFIG_SND_HDA_CODEC_REALTEK=y
CONFIG_SND_HDA_ENABLE_REALTEK_QUIRKS=y
CONFIG_SND_HDA_CODEC_ANALOG=y
CONFIG_SND_HDA_CODEC_SIGMATEL=y
CONFIG_SND_HDA_CODEC_VIA=y
CONFIG_SND_HDA_CODEC_HDMI=y
CONFIG_SND_HDA_CODEC_CIRRUS=y
CONFIG_SND_HDA_CODEC_CONEXANT=y
CONFIG_SND_HDA_CODEC_CA0110=y
CONFIG_SND_HDA_CODEC_CA0132=y
CONFIG_SND_HDA_CODEC_CMEDIA=y
CONFIG_SND_HDA_CODEC_SI3054=y
CONFIG_SND_HDA_GENERIC=y
# CONFIG_SND_HDA_POWER_SAVE is not set
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
CONFIG_SND_INTEL8X0=m
CONFIG_SND_INTEL8X0M=m
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_LOLA is not set
# CONFIG_SND_LX6464ES is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
# CONFIG_SND_VIA82XX is not set
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VIRTUOSO is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set
CONFIG_SND_USB=y
# CONFIG_SND_USB_AUDIO is not set
# CONFIG_SND_USB_UA101 is not set
# CONFIG_SND_USB_USX2Y is not set
# CONFIG_SND_USB_CAIAQ is not set
# CONFIG_SND_USB_US122L is not set
# CONFIG_SND_USB_6FIRE is not set
# CONFIG_SND_SOC is not set
# CONFIG_SOUND_PRIME is not set
CONFIG_AC97_BUS=m
CONFIG_HID_SUPPORT=y
CONFIG_HID=y
# CONFIG_HID_BATTERY_STRENGTH is not set
CONFIG_HIDRAW=y

#
# USB Input Devices
#
CONFIG_USB_HID=y
CONFIG_HID_PID=y
CONFIG_USB_HIDDEV=y

#
# Special HID drivers
#
CONFIG_HID_GENERIC=y
CONFIG_HID_A4TECH=y
# CONFIG_HID_ACRUX is not set
CONFIG_HID_APPLE=y
# CONFIG_HID_AUREAL is not set
CONFIG_HID_BELKIN=y
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
# CONFIG_HID_PRODIKEYS is not set
CONFIG_HID_CYPRESS=y
# CONFIG_HID_DRAGONRISE is not set
# CONFIG_HID_EMS_FF is not set
CONFIG_HID_EZKEY=y
# CONFIG_HID_HOLTEK is not set
# CONFIG_HID_KEYTOUCH is not set
# CONFIG_HID_KYE is not set
# CONFIG_HID_UCLOGIC is not set
# CONFIG_HID_WALTOP is not set
CONFIG_HID_GYRATION=y
# CONFIG_HID_TWINHAN is not set
CONFIG_HID_KENSINGTON=y
# CONFIG_HID_LCPOWER is not set
CONFIG_HID_LOGITECH=y
CONFIG_HID_LOGITECH_DJ=m
CONFIG_LOGITECH_FF=y
# CONFIG_LOGIRUMBLEPAD2_FF is not set
# CONFIG_LOGIG940_FF is not set
CONFIG_LOGIWHEELS_FF=y
CONFIG_HID_MICROSOFT=y
CONFIG_HID_MONTEREY=y
# CONFIG_HID_MULTITOUCH is not set
CONFIG_HID_NTRIG=y
# CONFIG_HID_ORTEK is not set
CONFIG_HID_PANTHERLORD=y
CONFIG_PANTHERLORD_FF=y
CONFIG_HID_PETALYNX=y
# CONFIG_HID_PICOLCD is not set
# CONFIG_HID_PRIMAX is not set
# CONFIG_HID_ROCCAT is not set
# CONFIG_HID_SAITEK is not set
CONFIG_HID_SAMSUNG=y
CONFIG_HID_SONY=y
# CONFIG_HID_SPEEDLINK is not set
CONFIG_HID_SUNPLUS=y
# CONFIG_HID_GREENASIA is not set
# CONFIG_HID_SMARTJOYPLUS is not set
# CONFIG_HID_TIVO is not set
CONFIG_HID_TOPSEED=y
# CONFIG_HID_THRUSTMASTER is not set
# CONFIG_HID_ZEROPLUS is not set
# CONFIG_HID_ZYDACRON is not set
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB_ARCH_HAS_XHCI=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_COMMON=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB=y
CONFIG_USB_DEBUG=y
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y

#
# Miscellaneous USB options
#
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
CONFIG_USB_MON=y
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_XHCI_HCD=m
# CONFIG_USB_XHCI_HCD_DEBUGGING is not set
CONFIG_USB_EHCI_HCD=y
# CONFIG_USB_EHCI_ROOT_HUB_TT is not set
# CONFIG_USB_EHCI_TT_NEWSCHED is not set
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_ISP1760_HCD is not set
# CONFIG_USB_ISP1362_HCD is not set
CONFIG_USB_OHCI_HCD=y
# CONFIG_USB_OHCI_HCD_PLATFORM is not set
# CONFIG_USB_EHCI_HCD_PLATFORM is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_CHIPIDEA is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
CONFIG_USB_PRINTER=y
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
CONFIG_USB_STORAGE=y
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_REALTEK is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_ISD200 is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_ONETOUCH is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_STORAGE_CYPRESS_ATACB is not set
# CONFIG_USB_STORAGE_ENE_UB6250 is not set
# CONFIG_USB_UAS is not set
CONFIG_USB_LIBUSUAL=y

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB port drivers
#
# CONFIG_USB_SERIAL is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_YUREX is not set

#
# USB Physical Layer drivers
#
# CONFIG_USB_ISP1301 is not set
# CONFIG_USB_GADGET is not set

#
# OTG and related infrastructure
#
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_UWB is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
# CONFIG_NEW_LEDS is not set
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
CONFIG_EDAC=y

#
# Reporting subsystems
#
# CONFIG_EDAC_DEBUG is not set
CONFIG_EDAC_DECODE_MCE=y
# CONFIG_EDAC_MCE_INJ is not set
# CONFIG_EDAC_MM_EDAC is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_CLASS=y
# CONFIG_RTC_HCTOSYS is not set
# CONFIG_RTC_DEBUG is not set

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
# CONFIG_RTC_DRV_DS1307 is not set
# CONFIG_RTC_DRV_DS1374 is not set
# CONFIG_RTC_DRV_DS1672 is not set
# CONFIG_RTC_DRV_DS3232 is not set
# CONFIG_RTC_DRV_MAX6900 is not set
# CONFIG_RTC_DRV_RS5C372 is not set
# CONFIG_RTC_DRV_ISL1208 is not set
# CONFIG_RTC_DRV_ISL12022 is not set
# CONFIG_RTC_DRV_X1205 is not set
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_BQ32K is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8581 is not set
# CONFIG_RTC_DRV_RX8025 is not set
# CONFIG_RTC_DRV_EM3027 is not set
# CONFIG_RTC_DRV_RV3029C2 is not set

#
# SPI RTC drivers
#

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_MSM6242 is not set
# CONFIG_RTC_DRV_BQ4802 is not set
# CONFIG_RTC_DRV_RP5C01 is not set
# CONFIG_RTC_DRV_V3020 is not set

#
# on-CPU RTC drivers
#
CONFIG_DMADEVICES=y
# CONFIG_DMADEVICES_DEBUG is not set

#
# DMA Devices
#
# CONFIG_INTEL_MID_DMAC is not set
# CONFIG_INTEL_IOATDMA is not set
# CONFIG_TIMB_DMA is not set
# CONFIG_PCH_DMA is not set
CONFIG_AUXDISPLAY=y
CONFIG_UIO=m
# CONFIG_UIO_CIF is not set
# CONFIG_UIO_PDRV is not set
# CONFIG_UIO_PDRV_GENIRQ is not set
# CONFIG_UIO_AEC is not set
# CONFIG_UIO_SERCOS3 is not set
# CONFIG_UIO_PCI_GENERIC is not set
# CONFIG_UIO_NETX is not set
CONFIG_VIRTIO=m
CONFIG_VIRTIO_RING=m

#
# Virtio drivers
#
CONFIG_VIRTIO_PCI=m
CONFIG_VIRTIO_BALLOON=m
CONFIG_VIRTIO_MMIO=m
# CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES is not set

#
# Microsoft Hyper-V guest support
#
# CONFIG_HYPERV is not set

#
# Xen driver support
#
CONFIG_XEN_BALLOON=y
CONFIG_XEN_SCRUB_PAGES=y
CONFIG_XEN_DEV_EVTCHN=y
CONFIG_XEN_BACKEND=y
CONFIG_XENFS=y
CONFIG_XEN_COMPAT_XENFS=y
CONFIG_XEN_SYS_HYPERVISOR=y
CONFIG_XEN_XENBUS_FRONTEND=y
CONFIG_XEN_GNTDEV=m
CONFIG_XEN_GRANT_DEV_ALLOC=m
CONFIG_SWIOTLB_XEN=y
CONFIG_XEN_PCIDEV_BACKEND=m
CONFIG_XEN_PRIVCMD=y
CONFIG_XEN_ACPI_PROCESSOR=m
# CONFIG_STAGING is not set
CONFIG_X86_PLATFORM_DEVICES=y
# CONFIG_ACER_WMI is not set
# CONFIG_ACERHDF is not set
# CONFIG_ASUS_LAPTOP is not set
# CONFIG_DELL_WMI is not set
# CONFIG_DELL_WMI_AIO is not set
# CONFIG_FUJITSU_LAPTOP is not set
# CONFIG_FUJITSU_TABLET is not set
# CONFIG_HP_ACCEL is not set
CONFIG_HP_WMI=m
# CONFIG_PANASONIC_LAPTOP is not set
# CONFIG_THINKPAD_ACPI is not set
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_INTEL_MENLOW is not set
CONFIG_ACPI_WMI=m
# CONFIG_MSI_WMI is not set
# CONFIG_TOPSTAR_LAPTOP is not set
# CONFIG_ACPI_TOSHIBA is not set
# CONFIG_TOSHIBA_BT_RFKILL is not set
# CONFIG_ACPI_CMPC is not set
# CONFIG_INTEL_IPS is not set
# CONFIG_IBM_RTL is not set
# CONFIG_XO15_EBOOK is not set
# CONFIG_SAMSUNG_LAPTOP is not set
CONFIG_MXM_WMI=m
# CONFIG_SAMSUNG_Q10 is not set
# CONFIG_APPLE_GMUX is not set

#
# Hardware Spinlock drivers
#
CONFIG_CLKEVT_I8253=y
CONFIG_I8253_LOCK=y
CONFIG_CLKBLD_I8253=y
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y
CONFIG_AMD_IOMMU=y
CONFIG_AMD_IOMMU_STATS=y
# CONFIG_AMD_IOMMU_V2 is not set
CONFIG_DMAR_TABLE=y
CONFIG_INTEL_IOMMU=y
CONFIG_INTEL_IOMMU_DEFAULT_ON=y
CONFIG_INTEL_IOMMU_FLOPPY_WA=y
# CONFIG_IRQ_REMAP is not set

#
# Remoteproc drivers (EXPERIMENTAL)
#

#
# Rpmsg drivers (EXPERIMENTAL)
#
CONFIG_VIRT_DRIVERS=y
CONFIG_PM_DEVFREQ=y

#
# DEVFREQ Governors
#
# CONFIG_DEVFREQ_GOV_SIMPLE_ONDEMAND is not set
# CONFIG_DEVFREQ_GOV_PERFORMANCE is not set
# CONFIG_DEVFREQ_GOV_POWERSAVE is not set
# CONFIG_DEVFREQ_GOV_USERSPACE is not set

#
# DEVFREQ Drivers
#
# CONFIG_EXTCON is not set
# CONFIG_MEMORY is not set
# CONFIG_IIO is not set
# CONFIG_VME_BUS is not set

#
# Firmware Drivers
#
# CONFIG_EDD is not set
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_EFI_VARS=y
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
CONFIG_DMIID=y
# CONFIG_DMI_SYSFS is not set
# CONFIG_ISCSI_IBFT_FIND is not set
# CONFIG_GOOGLE_FIRMWARE is not set

#
# File systems
#
CONFIG_DCACHE_WORD_ACCESS=y
CONFIG_EXT2_FS=y
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
# CONFIG_EXT2_FS_XIP is not set
CONFIG_EXT3_FS=y
# CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_EXT4_FS=y
CONFIG_EXT4_FS_XATTR=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
# CONFIG_EXT4_DEBUG is not set
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_JBD2=y
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
# CONFIG_BTRFS_FS is not set
# CONFIG_NILFS2_FS is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_FILE_LOCKING=y
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY_USER=y
# CONFIG_FANOTIFY is not set
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
# CONFIG_PRINT_QUOTA_WARNING is not set
# CONFIG_QUOTA_DEBUG is not set
CONFIG_QUOTA_TREE=y
# CONFIG_QFMT_V1 is not set
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
CONFIG_QUOTACTL_COMPAT=y
CONFIG_AUTOFS4_FS=y
CONFIG_FUSE_FS=m
# CONFIG_CUSE is not set
CONFIG_GENERIC_ACL=y

#
# Caches
#
CONFIG_FSCACHE=m
CONFIG_FSCACHE_STATS=y
CONFIG_FSCACHE_HISTOGRAM=y
# CONFIG_FSCACHE_DEBUG is not set
CONFIG_FSCACHE_OBJECT_LIST=y
CONFIG_CACHEFILES=m
# CONFIG_CACHEFILES_DEBUG is not set
# CONFIG_CACHEFILES_HISTOGRAM is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_UDF_FS=y
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=y
CONFIG_MSDOS_FS=y
CONFIG_VFAT_FS=y
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
CONFIG_NTFS_FS=y
# CONFIG_NTFS_DEBUG is not set
CONFIG_NTFS_RW=y

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_CONFIGFS_FS=m
CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_ECRYPT_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_LOGFS is not set
# CONFIG_CRAMFS is not set
# CONFIG_SQUASHFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_QNX6FS_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_PSTORE=y
# CONFIG_PSTORE_RAM is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=y
CONFIG_NFS_V2=y
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
# CONFIG_NFS_V4_1 is not set
CONFIG_ROOT_NFS=y
# CONFIG_NFS_USE_LEGACY_DNS is not set
CONFIG_NFS_USE_KERNEL_DNS=y
# CONFIG_NFSD is not set
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_GSS=y
# CONFIG_SUNRPC_DEBUG is not set
# CONFIG_CEPH_FS is not set
# CONFIG_CIFS is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
CONFIG_NLS_ASCII=y
CONFIG_NLS_ISO8859_1=y
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
CONFIG_NLS_UTF8=y
CONFIG_DLM=m
# CONFIG_DLM_DEBUG is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_PRINTK_TIME=y
CONFIG_DEFAULT_MESSAGE_LOGLEVEL=4
# CONFIG_ENABLE_WARN_DEPRECATED is not set
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=2048
CONFIG_MAGIC_SYSRQ=y
# CONFIG_STRIP_ASM_SYMS is not set
# CONFIG_READABLE_ASM is not set
# CONFIG_UNUSED_SYMBOLS is not set
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
# CONFIG_DEBUG_SECTION_MISMATCH is not set
CONFIG_DEBUG_KERNEL=y
# CONFIG_DEBUG_SHIRQ is not set
CONFIG_LOCKUP_DETECTOR=y
CONFIG_HARDLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=0
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
CONFIG_DETECT_HUNG_TASK=y
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120
# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set
CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=0
CONFIG_SCHED_DEBUG=y
CONFIG_SCHEDSTATS=y
CONFIG_TIMER_STATS=y
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set
# CONFIG_DEBUG_KMEMLEAK is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y
# CONFIG_PROVE_LOCKING is not set
# CONFIG_SPARSE_RCU_POINTER is not set
CONFIG_LOCKDEP=y
CONFIG_LOCK_STAT=y
# CONFIG_DEBUG_LOCKDEP is not set
CONFIG_DEBUG_ATOMIC_SLEEP=y
CONFIG_DEBUG_LOCKING_API_SELFTESTS=y
CONFIG_STACKTRACE=y
CONFIG_DEBUG_STACK_USAGE=y
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_INFO=y
# CONFIG_DEBUG_INFO_REDUCED is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_VIRTUAL is not set
# CONFIG_DEBUG_WRITECOUNT is not set
CONFIG_DEBUG_MEMORY_INIT=y
# CONFIG_DEBUG_LIST is not set
# CONFIG_TEST_LIST_SORT is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_DEBUG_CREDENTIALS is not set
CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
# CONFIG_RCU_CPU_STALL_INFO is not set
# CONFIG_RCU_TRACE is not set
# CONFIG_KPROBES_SANITY_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
# CONFIG_DEBUG_PER_CPU_MAPS is not set
# CONFIG_LKDTM is not set
# CONFIG_CPU_NOTIFIER_ERROR_INJECT is not set
# CONFIG_FAULT_INJECTION is not set
CONFIG_LATENCYTOP=y
# CONFIG_DEBUG_PAGEALLOC is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
# CONFIG_EVENT_POWER_TRACING_DEPRECATED is not set
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_RING_BUFFER_ALLOW_SWAP=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
# CONFIG_FUNCTION_TRACER is not set
# CONFIG_IRQSOFF_TRACER is not set
# CONFIG_SCHED_TRACER is not set
# CONFIG_FTRACE_SYSCALLS is not set
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_PROFILE_ALL_BRANCHES is not set
# CONFIG_STACK_TRACER is not set
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_KPROBE_EVENT=y
# CONFIG_UPROBE_EVENT is not set
CONFIG_PROBE_EVENTS=y
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_MMIOTRACE is not set
# CONFIG_RING_BUFFER_BENCHMARK is not set
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_DYNAMIC_DEBUG is not set
# CONFIG_DMA_API_DEBUG is not set
# CONFIG_ATOMIC64_SELFTEST is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
CONFIG_HAVE_ARCH_KMEMCHECK=y
# CONFIG_KMEMCHECK is not set
# CONFIG_TEST_KSTRTOX is not set
# CONFIG_STRICT_DEVMEM is not set
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
CONFIG_EARLY_PRINTK_DBGP=y
CONFIG_DEBUG_STACKOVERFLOW=y
# CONFIG_X86_PTDUMP is not set
CONFIG_DEBUG_RODATA=y
# CONFIG_DEBUG_RODATA_TEST is not set
# CONFIG_DEBUG_SET_MODULE_RONX is not set
CONFIG_DEBUG_NX_TEST=m
# CONFIG_IOMMU_DEBUG is not set
# CONFIG_IOMMU_STRESS is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
# CONFIG_X86_DECODER_SELFTEST is not set
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
CONFIG_DEBUG_BOOT_PARAMS=y
# CONFIG_CPA_DEBUG is not set
CONFIG_OPTIMIZE_INLINING=y
# CONFIG_DEBUG_STRICT_USER_COPY_CHECKS is not set
CONFIG_DEBUG_NMI_SELFTEST=y

#
# Security options
#
CONFIG_KEYS=y
# CONFIG_ENCRYPTED_KEYS is not set
CONFIG_KEYS_DEBUG_PROC_KEYS=y
# CONFIG_SECURITY_DMESG_RESTRICT is not set
CONFIG_SECURITY=y
# CONFIG_SECURITYFS is not set
CONFIG_SECURITY_NETWORK=y
# CONFIG_SECURITY_NETWORK_XFRM is not set
# CONFIG_SECURITY_PATH is not set
# CONFIG_INTEL_TXT is not set
CONFIG_LSM_MMAP_MIN_ADDR=65536
CONFIG_SECURITY_SELINUX=y
CONFIG_SECURITY_SELINUX_BOOTPARAM=y
CONFIG_SECURITY_SELINUX_BOOTPARAM_VALUE=1
CONFIG_SECURITY_SELINUX_DISABLE=y
CONFIG_SECURITY_SELINUX_DEVELOP=y
CONFIG_SECURITY_SELINUX_AVC_STATS=y
CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1
# CONFIG_SECURITY_SELINUX_POLICYDB_VERSION_MAX is not set
# CONFIG_SECURITY_SMACK is not set
# CONFIG_SECURITY_TOMOYO is not set
# CONFIG_SECURITY_APPARMOR is not set
# CONFIG_SECURITY_YAMA is not set
# CONFIG_IMA is not set
# CONFIG_EVM is not set
CONFIG_DEFAULT_SECURITY_SELINUX=y
# CONFIG_DEFAULT_SECURITY_DAC is not set
CONFIG_DEFAULT_SECURITY="selinux"
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_PCOMP2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_USER is not set
CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y
# CONFIG_CRYPTO_GF128MUL is not set
# CONFIG_CRYPTO_NULL is not set
# CONFIG_CRYPTO_PCRYPT is not set
CONFIG_CRYPTO_WORKQUEUE=y
# CONFIG_CRYPTO_CRYPTD is not set
CONFIG_CRYPTO_AUTHENC=y
# CONFIG_CRYPTO_TEST is not set

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_SEQIV is not set

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
# CONFIG_CRYPTO_CTR is not set
# CONFIG_CRYPTO_CTS is not set
CONFIG_CRYPTO_ECB=y
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_PCBC is not set
# CONFIG_CRYPTO_XTS is not set

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_VMAC is not set

#
# Digest
#
CONFIG_CRYPTO_CRC32C=m
# CONFIG_CRYPTO_CRC32C_INTEL is not set
# CONFIG_CRYPTO_GHASH is not set
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
# CONFIG_CRYPTO_SHA1_SSSE3 is not set
# CONFIG_CRYPTO_SHA256 is not set
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL is not set

#
# Ciphers
#
CONFIG_CRYPTO_AES=y
# CONFIG_CRYPTO_AES_X86_64 is not set
# CONFIG_CRYPTO_AES_NI_INTEL is not set
# CONFIG_CRYPTO_ANUBIS is not set
CONFIG_CRYPTO_ARC4=y
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_BLOWFISH_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAMELLIA_X86_64 is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
CONFIG_CRYPTO_DES=y
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SALSA20_X86_64 is not set
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_SERPENT_SSE2_X86_64 is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH_X86_64_3WAY is not set

#
# Compression
#
# CONFIG_CRYPTO_DEFLATE is not set
# CONFIG_CRYPTO_ZLIB is not set
# CONFIG_CRYPTO_LZO is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
# CONFIG_CRYPTO_USER_API_HASH is not set
# CONFIG_CRYPTO_USER_API_SKCIPHER is not set
CONFIG_CRYPTO_HW=y
# CONFIG_CRYPTO_DEV_PADLOCK is not set
CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_APIC_ARCHITECTURE=y
CONFIG_KVM_MMIO=y
CONFIG_KVM_ASYNC_PF=y
CONFIG_HAVE_KVM_MSI=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=y
CONFIG_KVM_INTEL=y
# CONFIG_KVM_AMD is not set
# CONFIG_KVM_MMU_AUDIT is not set
CONFIG_VHOST_NET=m
CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_IO=y
CONFIG_CRC_CCITT=y
CONFIG_CRC16=y
CONFIG_CRC_T10DIF=y
CONFIG_CRC_ITU_T=y
CONFIG_CRC32=y
# CONFIG_CRC32_SELFTEST is not set
CONFIG_CRC32_SLICEBY8=y
# CONFIG_CRC32_SLICEBY4 is not set
# CONFIG_CRC32_SARWATE is not set
# CONFIG_CRC32_BIT is not set
# CONFIG_CRC7 is not set
CONFIG_LIBCRC32C=m
# CONFIG_CRC8 is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_XZ_DEC=y
CONFIG_XZ_DEC_X86=y
CONFIG_XZ_DEC_POWERPC=y
CONFIG_XZ_DEC_IA64=y
CONFIG_XZ_DEC_ARM=y
CONFIG_XZ_DEC_ARMTHUMB=y
CONFIG_XZ_DEC_SPARC=y
CONFIG_XZ_DEC_BCJ=y
# CONFIG_XZ_DEC_TEST is not set
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_BZIP2=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_DECOMPRESS_XZ=y
CONFIG_DECOMPRESS_LZO=y
CONFIG_GENERIC_ALLOCATOR=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
CONFIG_CHECK_SIGNATURE=y
CONFIG_CPU_RMAP=y
CONFIG_DQL=y
CONFIG_NLATTR=y
CONFIG_AVERAGE=y
# CONFIG_CORDIC is not set
# CONFIG_DDR is not set


^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
  2012-05-31 20:01       ` AutoNUMA15 Don Morris
@ 2012-05-31 22:54         ` Andrea Arcangeli
  2012-06-01  0:04           ` AutoNUMA15 Andrea Arcangeli
  0 siblings, 1 reply; 236+ messages in thread
From: Andrea Arcangeli @ 2012-05-31 22:54 UTC (permalink / raw)
  To: Don Morris; +Cc: linux-mm

Hi Don,

On Thu, May 31, 2012 at 01:01:19PM -0700, Don Morris wrote:
> We're a special pte, but a non-zero pfn. Being Xorg, I'm
> assuming this is a remap of a kernel page into the user virtual
> address space, but that's just a gut instinct. Since I read

I reproduced it. The address is in /dev/mem, the other is a nonlinear
ext4 map.

> the above as "We don't expect to ever take spurious faults
> on instantiated special ptes", I would think you'd need

I would better skip VM_PFNMAP|VM_MIXEDMAP agreed, but it still
shouldn't fail like this, vma_normal_page shouldn't error out on a
pte_special.

> Of course... I'm still really ramping up on this kernel, so
> that could all be hokum, too. Hopefully it helps.

It helps a lot, thanks!

> I can dump the EFI memory map and whatnot to you if you
> need it, but I think this is more of an algorithmic issue

No need, I can reproduce.

On the bright side, it looks totally harmless and you can ignore it.

And if you run "echo 0 >/sys/kernel/mm/autonuma/knuma_scand/pmd" it
seems to go away but I suggest to keep the default and ignore it, the
pmd scan saves 1% of the overhead.

I'll push a fix in the origin/autonuma branch as soon as I figure it
out...

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
  2012-05-31 22:54         ` AutoNUMA15 Andrea Arcangeli
@ 2012-06-01  0:04           ` Andrea Arcangeli
  2012-05-31 18:52             ` AutoNUMA15 Don Morris
  0 siblings, 1 reply; 236+ messages in thread
From: Andrea Arcangeli @ 2012-06-01  0:04 UTC (permalink / raw)
  To: Don Morris; +Cc: linux-mm

On Fri, Jun 01, 2012 at 12:54:06AM +0200, Andrea Arcangeli wrote:
> I'll push a fix in the origin/autonuma branch as soon as I figure it
> out...

7f9729f89000-7f9729faa000 rw-p 00000000 00:00 0
7f9729faa000-7f9729fea000 rw-s 000c0000 00:15 15364 /dev/mem

addr:00007f9729fc5000 vm_flags:08000875 anon_vma:          (null)
mapping:ffff880430025410 index:b8

The reason for the false positive, was that there are multiple vmas in
the same pmd range, and I was passing the single vma belonging to the
page fault address, for all ptes in that pmd.

The vma is only used for that check, this is why it was harmless.

The vma found during page fault would have been valid for the pmd huge
numa fixup and the pte numa fixup, but not for the less granular pmd
numa fixup (not huge).

This also explains why echo 0 >/sys/kernel/mm/autonuma/knuma_scand/pmd
avoided the warnings.

Can you test the below? I'll push the fix in the origin/autonuma branch.

Thanks!

---
 include/linux/autonuma.h |    4 ++--
 mm/autonuma.c            |   19 +++++++++++++++++--
 mm/memory.c              |    5 ++---
 3 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
index b0a8d87..67af86a 100644
--- a/include/linux/autonuma.h
+++ b/include/linux/autonuma.h
@@ -46,8 +46,8 @@ static inline void autonuma_free_page(struct page *page) {}
 
 extern pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
 			      unsigned long addr, pte_t pte, pte_t *ptep);
-extern void __pmd_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
-			     unsigned long addr, pmd_t *pmd);
+extern void __pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+			     pmd_t *pmd);
 extern void numa_hinting_fault(struct page *page, int numpages);
 
 #endif /* _LINUX_AUTONUMA_H */
diff --git a/mm/autonuma.c b/mm/autonuma.c
index d37647a..ca4c189 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -349,14 +349,16 @@ pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
 	return pte;
 }
 
-void __pmd_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+void __pmd_numa_fixup(struct mm_struct *mm,
 		      unsigned long addr, pmd_t *pmdp)
 {
 	pmd_t pmd;
 	pte_t *pte;
 	unsigned long _addr = addr & PMD_MASK;
+	unsigned long offset;
 	spinlock_t *ptl;
 	bool numa = false;
+	struct vm_area_struct *vma;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = *pmdp;
@@ -369,12 +371,25 @@ void __pmd_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!numa)
 		return;
 
+	vma = find_vma(mm, _addr);
+	/* we're in a page fault so some vma must be in the range */
+	BUG_ON(!vma);
+	BUG_ON(vma->vm_start >= _addr + PMD_SIZE);
+	offset = max(_addr, vma->vm_start) & ~PMD_MASK;
+	VM_BUG_ON(offset >= PMD_SIZE);
 	pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
-	for (addr = _addr; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
+	pte += offset >> PAGE_SHIFT;
+	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
 		pte_t pteval = *pte;
 		struct page * page;
 		if (!pte_present(pteval))
 			continue;
+		if (addr >= vma->vm_end) {
+			vma = find_vma(mm, addr);
+			/* there's a pte present so there must be a vma */
+			BUG_ON(!vma);
+			BUG_ON(addr < vma->vm_start);
+		}
 		if (pte_numa(pteval)) {
 			pteval = pte_mknotnuma(pteval);
 			set_pte_at(mm, addr, pte, pteval);
diff --git a/mm/memory.c b/mm/memory.c
index f46cf8a..bbf10c7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3409,11 +3409,10 @@ static inline pte_t pte_numa_fixup(struct mm_struct *mm,
 }
 
 static inline void pmd_numa_fixup(struct mm_struct *mm,
-				  struct vm_area_struct *vma,
 				  unsigned long addr, pmd_t *pmd)
 {
 	if (pmd_numa(*pmd))
-		__pmd_numa_fixup(mm, vma, addr, pmd);
+		__pmd_numa_fixup(mm, addr, pmd);
 }
 
 static inline pmd_t huge_pmd_numa_fixup(struct mm_struct *mm,
@@ -3552,7 +3551,7 @@ retry:
 		}
 	}
 
-	pmd_numa_fixup(mm, vma, address, pmd);
+	pmd_numa_fixup(mm, address, pmd);
 
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* Re: [PATCH 00/35] AutoNUMA alpha14
  2012-05-25 17:02 ` Andrea Arcangeli
@ 2012-06-01 22:41   ` Mauricio Faria de Oliveira
  -1 siblings, 0 replies; 236+ messages in thread
From: Mauricio Faria de Oliveira @ 2012-06-01 22:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, srikar,
	mjw

Hi Andrea, everyone..

AA> Changelog from alpha13 to alpha14:
AA> [...]
AA> o autonuma_balance only runs along with run_rebalance_domains, to
AA>    avoid altering the scheduler runtime. [...]
AA>    [...] This change has not
AA>    yet been tested on specjbb or more schedule intensive benchmarks,
AA>    but I don't expect measurable NUMA affinity regressions. [...]

Perhaps I can contribute a bit to the SPECjbb tests.

I got SPECjbb2005 results for 3.4-rc2 mainline, numasched, 
autonuma-alpha10, and autonuma-alpha13. If you judge the data is OK it 
may suit a comparison between autonuma-alpha13/14 to verify NUMA 
affinity regressions.

The system is an Intel 2-socket Blade. Each NUMA node has 6 cores (+6 
hyperthreads) and 12 GB RAM. Different permutations of THP, KSM, and VM 
memory size were tested for each kernel.

I'll have to leave the analysis of each variable for you, as I'm not 
familiar w/ the code and expected impacts; but I'm perfectly fine with 
providing more details about the tests, environment and procedures, and 
even some reruns, if needed.

Please CC me on questions and comments.


Environment:
------------

Host:
- Enterprise Linux Distro
- Kernel: 3.4-rc2 (either mainline, or patched w/ numasched, 
autonuma-alpha10, or autonuma-alpha13)
- 2 NUMA nodes. 6 cores + 6 hyperthreads/node, 12 GB RAM/node.
   (total of 24 logical CPUs and 24 GB RAM)
- Hypervisor: qemu-kvm 1.0.50 (+ memsched patches only for numasched)

VMs:
- Enterprise Linux Distro
- Distro Kernel

   1 Main VM (VM1) -- relevant benchmark score.
   - 12 vCPUs
   - 12 GB (for '< 1 Node' configuration) or 14 GB (for '> 1 Node' 
configuration)

   2 Noise VMs (VM2 and VM3)
   - each noise VM has half of the remaining resources.
   - 6 vCPUs
   - 4 GB (for '< 1 Node' configuration) or 3 GB ('> 1 Node' configuration)
     (to sum 20 GB w/ main VM + 4 GB for host = total 24 GB)

Settings:
- Swapping disabled on host and VMs.
- Memory Overcommit enabled on host and VMs.
- THP on host is a variable. THP disabled on VMs.
- KSM on host is a variable. KSM disabled on VMs.


Results
=======

Reference is mainline kernel with THP disabled (its score is 
approximately 100%). It performed similarly (less than 2% difference) on 
the 4 permutations of KSM and Main VM memory size.

For the results of all permutations, see chart [1].
One interesting permutation seems to be: No THP (disabled); KSM (enabled).

Interpretation:
- higher is better;
- main VM should perform better than noise VMs;
- noise VMs should perform similarly.


Main VM < 1 Node
-----------------

                 Main VM     Noise VM    Noise VM
mainline        ~100%       60%         60%
numasched *     50%/135%    30%/58%     40%/68%
autonuma-a10    125%        60%         60%
autonuma-a13    126%        32%         32%

* numasched yielded a wide range of scores. Is this behavior expected?


Main VM > 1 Node.
-----------------

                 Main VM     Noise VM    Noise VM
mainline        ~100%       60%         59%
numasched       60%         48%         48%
autonuma-a10    62%         37%         38%
autonuma-a13    125%        61%         63%



Considerations:
---------------

The 3 VMs ran SPECjbb2005, synchronously starting the benchmark.

For the benchmark run to take about the same time on the 3 VMs, its 
configuration for the Noise VMs is different than for the Main VM.
So comparing VM1 scores w/ VM2 or VM3 scores is not reasonable.
But comparing scores between VM2 and VM3 is perfectly fine (it's 
evidence of the performed balancing).

Sometimes both autonuma and numasched prioritized one of the Noise VMs 
over the other Noise VM, or even over the Main VM. In these cases, some 
reruns would yield scores of 'expected proportion', given the VMs 
configuration (Main VM w/ the highest score, both Noise VMs with lower 
scores which are about the same).

The non-expected proportion scores happened less often w/ 
autonuma-alpha13, followed by autonuma-alpha10, and finally numasched 
(i.e., numasched had the greatest rate of non-expected proportion scores).

For most permutations, numasched didn't yield scores of expected 
proportion. I'd like to know how likely this is to happen, before 
performing additional runs to confirm it. If anyone would provide 
evidence or thoughts?


Links:
------

[1] http://dl.dropbox.com/u/82832537/kvm-numa-comparison-0.png


-- 
Mauricio Faria de Oliveira
IBM Linux Technology Center



^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 00/35] AutoNUMA alpha14
@ 2012-06-01 22:41   ` Mauricio Faria de Oliveira
  0 siblings, 0 replies; 236+ messages in thread
From: Mauricio Faria de Oliveira @ 2012-06-01 22:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, srikar,
	mjw

Hi Andrea, everyone..

AA> Changelog from alpha13 to alpha14:
AA> [...]
AA> o autonuma_balance only runs along with run_rebalance_domains, to
AA>    avoid altering the scheduler runtime. [...]
AA>    [...] This change has not
AA>    yet been tested on specjbb or more schedule intensive benchmarks,
AA>    but I don't expect measurable NUMA affinity regressions. [...]

Perhaps I can contribute a bit to the SPECjbb tests.

I got SPECjbb2005 results for 3.4-rc2 mainline, numasched, 
autonuma-alpha10, and autonuma-alpha13. If you judge the data is OK it 
may suit a comparison between autonuma-alpha13/14 to verify NUMA 
affinity regressions.

The system is an Intel 2-socket Blade. Each NUMA node has 6 cores (+6 
hyperthreads) and 12 GB RAM. Different permutations of THP, KSM, and VM 
memory size were tested for each kernel.

I'll have to leave the analysis of each variable for you, as I'm not 
familiar w/ the code and expected impacts; but I'm perfectly fine with 
providing more details about the tests, environment and procedures, and 
even some reruns, if needed.

Please CC me on questions and comments.


Environment:
------------

Host:
- Enterprise Linux Distro
- Kernel: 3.4-rc2 (either mainline, or patched w/ numasched, 
autonuma-alpha10, or autonuma-alpha13)
- 2 NUMA nodes. 6 cores + 6 hyperthreads/node, 12 GB RAM/node.
   (total of 24 logical CPUs and 24 GB RAM)
- Hypervisor: qemu-kvm 1.0.50 (+ memsched patches only for numasched)

VMs:
- Enterprise Linux Distro
- Distro Kernel

   1 Main VM (VM1) -- relevant benchmark score.
   - 12 vCPUs
   - 12 GB (for '< 1 Node' configuration) or 14 GB (for '> 1 Node' 
configuration)

   2 Noise VMs (VM2 and VM3)
   - each noise VM has half of the remaining resources.
   - 6 vCPUs
   - 4 GB (for '< 1 Node' configuration) or 3 GB ('> 1 Node' configuration)
     (to sum 20 GB w/ main VM + 4 GB for host = total 24 GB)

Settings:
- Swapping disabled on host and VMs.
- Memory Overcommit enabled on host and VMs.
- THP on host is a variable. THP disabled on VMs.
- KSM on host is a variable. KSM disabled on VMs.


Results
=======

Reference is mainline kernel with THP disabled (its score is 
approximately 100%). It performed similarly (less than 2% difference) on 
the 4 permutations of KSM and Main VM memory size.

For the results of all permutations, see chart [1].
One interesting permutation seems to be: No THP (disabled); KSM (enabled).

Interpretation:
- higher is better;
- main VM should perform better than noise VMs;
- noise VMs should perform similarly.


Main VM < 1 Node
-----------------

                 Main VM     Noise VM    Noise VM
mainline        ~100%       60%         60%
numasched *     50%/135%    30%/58%     40%/68%
autonuma-a10    125%        60%         60%
autonuma-a13    126%        32%         32%

* numasched yielded a wide range of scores. Is this behavior expected?


Main VM > 1 Node.
-----------------

                 Main VM     Noise VM    Noise VM
mainline        ~100%       60%         59%
numasched       60%         48%         48%
autonuma-a10    62%         37%         38%
autonuma-a13    125%        61%         63%



Considerations:
---------------

The 3 VMs ran SPECjbb2005, synchronously starting the benchmark.

For the benchmark run to take about the same time on the 3 VMs, its 
configuration for the Noise VMs is different than for the Main VM.
So comparing VM1 scores w/ VM2 or VM3 scores is not reasonable.
But comparing scores between VM2 and VM3 is perfectly fine (it's 
evidence of the performed balancing).

Sometimes both autonuma and numasched prioritized one of the Noise VMs 
over the other Noise VM, or even over the Main VM. In these cases, some 
reruns would yield scores of 'expected proportion', given the VMs 
configuration (Main VM w/ the highest score, both Noise VMs with lower 
scores which are about the same).

The non-expected proportion scores happened less often w/ 
autonuma-alpha13, followed by autonuma-alpha10, and finally numasched 
(i.e., numasched had the greatest rate of non-expected proportion scores).

For most permutations, numasched didn't yield scores of expected 
proportion. I'd like to know how likely this is to happen, before 
performing additional runs to confirm it. If anyone would provide 
evidence or thoughts?


Links:
------

[1] http://dl.dropbox.com/u/82832537/kvm-numa-comparison-0.png


-- 
Mauricio Faria de Oliveira
IBM Linux Technology Center


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
  2012-05-31 18:18                 ` Peter Zijlstra
@ 2012-06-05 14:51                   ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-06-05 14:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter

Hi,

On Thu, May 31, 2012 at 08:18:59PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-05-30 at 15:49 +0200, Andrea Arcangeli wrote:
> > 
> > I'm thinking about it but probably reducing the page_autonuma to one
> > per pmd is going to be the simplest solution considering by default we
> > only track the pmd anyway. 
> 
> Do also consider that some archs have larger base page size. So their
> effective PMD size is increased as well.

With a larger PAGE_SIZE like 64k I doubt this would be a concern, it's
just 4k is too small.

Now I did a number of cleanups and already added a number of comments,
I'll write proper badly needed docs on the autonuma_balance() function
ASAP, but at least a number of cleanups are already committed in the
autonuma branch of my git tree.

>From my side, the thing that annoys me the most at the moment is the
page_autonuma size.

So I gave more thought about the idea outlined above but well I gave
up in less than a minute of thinking what I could run into doing
that. The fact we do pmd tracking in knuma_scand by default (possible
to disable with sysfs) is irrelevant. Unless I'm only going to track
THP pages, 1 page_autonuma per pmd won't work, when the pmd_numa fault
triggers it's all nonlinear on whatever scattered 4k page is pointed
by the pte, not shared pagecache especially.

I kept thinking more on it, I should have now figured how to reduce
the page_autonuma to 12 bytes per 4k page on both 32bit and 64bit
without losing information (no code written yet but this one should
work). I just couldn't shrink it below 12 bytes without going into
ridiculous high and worthless complexities.

After this change AutoNUMA will bail out if any of the two below
conditions is true:

1) MAX_NUMNODES >= 65536
2) any NUMA node pgdat.node_spanned_pages >= 16TB/PAGE_SIZE

That means AutoNUMA will disengage itself automatically on boot on x86
NUMA systems with more than 1152921504606846976 of ram, that's 60bit
of physical address space and no x86 CPU even gets that far in terms
of physical address space.

Other archs requiring more memory than that, will hopefully have a
PAGE_SIZE > 4KB (in turn doubling up the per-node limit of ram at
every doubling of the PAGE_SIZE without having to increase the size of
the page_autonuma even on 64bit from 12bytes).

A packed 12 bytes per page should be all I need (maybe some arch with
alignment troubles may prefer to make it a 16 bytes, but on x86 packed
should work). So on x86 that's 0.29% of RAM used for autonuma and only
spent when booting on NUMA hardware (and trivial to get rid of by
passing "noatuonuma" on the command line).

If I leave the anti false sharing last_nid information in the page
structure plus a pointer to a dynamic structure, that would be still
about 12 bytes. So I rather spend those 12 bytes to avoid having to
point to a dynamic object which in fact would waste even more memory
in addition to the 12 bytes of pointer+last_nid.

The details of the solution:

struct page_autonuma {
    short autonuma_last_nid;
    short autonuma_migrate_nid;
    unsigned int pfn_offset_next;
    unsigned int pfn_offset_prev;
} __attribute__((packed));

page_autonuma can only point to a page that belongs to the same node
(page_autonuma is queued into the
NODE_DATA(autonuma_migrate_nid)->autonuma_migrate_head[src_nid]) where
src_nid is the source node that page_autonuma belongs to, so all pages
in the autonuma_migrate_head[src_nid] lru must come from the same
src_nid. So the next page_autonuma in the list will be
lookup_page_autonuma(pfn_to_page(NODE_DATA(src_nid)->node_start_pfn +
page_autonuma->pfn_offset_next)) etc..

Of course all list_add/del must be hardcoded specially for this, but
it's not a conceptually difficult solution, just we can't use list.h
and stright pointers anymore and some conversion must happen.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
@ 2012-06-05 14:51                   ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-06-05 14:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter

Hi,

On Thu, May 31, 2012 at 08:18:59PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-05-30 at 15:49 +0200, Andrea Arcangeli wrote:
> > 
> > I'm thinking about it but probably reducing the page_autonuma to one
> > per pmd is going to be the simplest solution considering by default we
> > only track the pmd anyway. 
> 
> Do also consider that some archs have larger base page size. So their
> effective PMD size is increased as well.

With a larger PAGE_SIZE like 64k I doubt this would be a concern, it's
just 4k is too small.

Now I did a number of cleanups and already added a number of comments,
I'll write proper badly needed docs on the autonuma_balance() function
ASAP, but at least a number of cleanups are already committed in the
autonuma branch of my git tree.

>From my side, the thing that annoys me the most at the moment is the
page_autonuma size.

So I gave more thought about the idea outlined above but well I gave
up in less than a minute of thinking what I could run into doing
that. The fact we do pmd tracking in knuma_scand by default (possible
to disable with sysfs) is irrelevant. Unless I'm only going to track
THP pages, 1 page_autonuma per pmd won't work, when the pmd_numa fault
triggers it's all nonlinear on whatever scattered 4k page is pointed
by the pte, not shared pagecache especially.

I kept thinking more on it, I should have now figured how to reduce
the page_autonuma to 12 bytes per 4k page on both 32bit and 64bit
without losing information (no code written yet but this one should
work). I just couldn't shrink it below 12 bytes without going into
ridiculous high and worthless complexities.

After this change AutoNUMA will bail out if any of the two below
conditions is true:

1) MAX_NUMNODES >= 65536
2) any NUMA node pgdat.node_spanned_pages >= 16TB/PAGE_SIZE

That means AutoNUMA will disengage itself automatically on boot on x86
NUMA systems with more than 1152921504606846976 of ram, that's 60bit
of physical address space and no x86 CPU even gets that far in terms
of physical address space.

Other archs requiring more memory than that, will hopefully have a
PAGE_SIZE > 4KB (in turn doubling up the per-node limit of ram at
every doubling of the PAGE_SIZE without having to increase the size of
the page_autonuma even on 64bit from 12bytes).

A packed 12 bytes per page should be all I need (maybe some arch with
alignment troubles may prefer to make it a 16 bytes, but on x86 packed
should work). So on x86 that's 0.29% of RAM used for autonuma and only
spent when booting on NUMA hardware (and trivial to get rid of by
passing "noatuonuma" on the command line).

If I leave the anti false sharing last_nid information in the page
structure plus a pointer to a dynamic structure, that would be still
about 12 bytes. So I rather spend those 12 bytes to avoid having to
point to a dynamic object which in fact would waste even more memory
in addition to the 12 bytes of pointer+last_nid.

The details of the solution:

struct page_autonuma {
    short autonuma_last_nid;
    short autonuma_migrate_nid;
    unsigned int pfn_offset_next;
    unsigned int pfn_offset_prev;
} __attribute__((packed));

page_autonuma can only point to a page that belongs to the same node
(page_autonuma is queued into the
NODE_DATA(autonuma_migrate_nid)->autonuma_migrate_head[src_nid]) where
src_nid is the source node that page_autonuma belongs to, so all pages
in the autonuma_migrate_head[src_nid] lru must come from the same
src_nid. So the next page_autonuma in the list will be
lookup_page_autonuma(pfn_to_page(NODE_DATA(src_nid)->node_start_pfn +
page_autonuma->pfn_offset_next)) etc..

Of course all list_add/del must be hardcoded specially for this, but
it's not a conceptually difficult solution, just we can't use list.h
and stright pointers anymore and some conversion must happen.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 04/35] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  2012-05-30 20:01         ` Konrad Rzeszutek Wilk
@ 2012-06-05 17:13           ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-06-05 17:13 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Wed, May 30, 2012 at 04:01:51PM -0400, Konrad Rzeszutek Wilk wrote:
> The only time the _PAGE_PSE (_PAGE_PAT) is set is when
> _PAGE_PCD | _PAGE_PWT are set. It is this ugly transformation
> of doing:
> 
>  if (pat_enabled && _PAGE_PWT | _PAGE_PCD)
> 	pte = ~(_PAGE_PWT | _PAGE_PCD) | _PAGE_PAT;
> 
> and then writting the pte with the 7th bit set instead of the
> 2nd and 3rd to mark it as WC. There is a corresponding reverse too
> (to read the pte - so the pte_val calls) - so if _PAGE_PAT is
> detected it will remove the _PAGE_PAT and return the PTE as
> if it had _PAGE_PWT | _PAGE_PCD.
> 
> So that little bit of code will need some tweaking - as it does
> that even if _PAGE_PRESENT is not set. Meaning it would
> transform your _PAGE_PAT to _PAGE_PWT | _PAGE_PCD. Gah!

It looks like this is disabled in current upstream?
8eaffa67b43e99ae581622c5133e20b0f48bcef1

> OK. I can whip up a patch to deal with the 'Gah!' case easily if needed.

That would help! But again it looks disabled in Xen?

About linux host (no xen) when I decided to use PSE I checked this part:

	/* Set PWT to Write-Combining. All other bits stay the same */
	/*
	 * PTE encoding used in Linux:
	 *      PAT
	 *      |PCD
	 *      ||PWT
	 *      |||
	 *      000 WB		_PAGE_CACHE_WB
	 *      001 WC		_PAGE_CACHE_WC
	 *      010 UC-		_PAGE_CACHE_UC_MINUS
	 *      011 UC		_PAGE_CACHE_UC
	 * PAT bit unused
	 */

I need to go read the specs pdf and audit the code against the specs
to be sure but if my interpretation correct, PAT is never set on linux
host (novirt) the way the relevant msr are programmed.

If I couldn't use the PSE (/PAT) it'd screw with 32bit because I need
to poke a bit between _PAGE_BIT_DIRTY and _PAGE_BIT_GLOBAL to avoid
losing space on the swap entry, and there's just one bit in that range
(PSE).

_PAGE_UNUSED1 (besides it's used by Xen) wouldn't work unless I change
the swp entry format for 32bit x86 reducing the max amount of swap
(conditional to CONFIG_AUTONUMA so it wouldn't be the end of the
world, plus the amount of swap on 32bit NUMA may not be so important)

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 04/35] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
@ 2012-06-05 17:13           ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-06-05 17:13 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter

On Wed, May 30, 2012 at 04:01:51PM -0400, Konrad Rzeszutek Wilk wrote:
> The only time the _PAGE_PSE (_PAGE_PAT) is set is when
> _PAGE_PCD | _PAGE_PWT are set. It is this ugly transformation
> of doing:
> 
>  if (pat_enabled && _PAGE_PWT | _PAGE_PCD)
> 	pte = ~(_PAGE_PWT | _PAGE_PCD) | _PAGE_PAT;
> 
> and then writting the pte with the 7th bit set instead of the
> 2nd and 3rd to mark it as WC. There is a corresponding reverse too
> (to read the pte - so the pte_val calls) - so if _PAGE_PAT is
> detected it will remove the _PAGE_PAT and return the PTE as
> if it had _PAGE_PWT | _PAGE_PCD.
> 
> So that little bit of code will need some tweaking - as it does
> that even if _PAGE_PRESENT is not set. Meaning it would
> transform your _PAGE_PAT to _PAGE_PWT | _PAGE_PCD. Gah!

It looks like this is disabled in current upstream?
8eaffa67b43e99ae581622c5133e20b0f48bcef1

> OK. I can whip up a patch to deal with the 'Gah!' case easily if needed.

That would help! But again it looks disabled in Xen?

About linux host (no xen) when I decided to use PSE I checked this part:

	/* Set PWT to Write-Combining. All other bits stay the same */
	/*
	 * PTE encoding used in Linux:
	 *      PAT
	 *      |PCD
	 *      ||PWT
	 *      |||
	 *      000 WB		_PAGE_CACHE_WB
	 *      001 WC		_PAGE_CACHE_WC
	 *      010 UC-		_PAGE_CACHE_UC_MINUS
	 *      011 UC		_PAGE_CACHE_UC
	 * PAT bit unused
	 */

I need to go read the specs pdf and audit the code against the specs
to be sure but if my interpretation correct, PAT is never set on linux
host (novirt) the way the relevant msr are programmed.

If I couldn't use the PSE (/PAT) it'd screw with 32bit because I need
to poke a bit between _PAGE_BIT_DIRTY and _PAGE_BIT_GLOBAL to avoid
losing space on the swap entry, and there's just one bit in that range
(PSE).

_PAGE_UNUSED1 (besides it's used by Xen) wouldn't work unless I change
the swp entry format for 32bit x86 reducing the max amount of swap
(conditional to CONFIG_AUTONUMA so it wouldn't be the end of the
world, plus the amount of swap on 32bit NUMA may not be so important)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 04/35] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  2012-06-05 17:13           ` Andrea Arcangeli
@ 2012-06-05 17:17             ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 236+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-05 17:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Konrad Rzeszutek Wilk, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, Jun 05, 2012 at 07:13:54PM +0200, Andrea Arcangeli wrote:
> On Wed, May 30, 2012 at 04:01:51PM -0400, Konrad Rzeszutek Wilk wrote:
> > The only time the _PAGE_PSE (_PAGE_PAT) is set is when
> > _PAGE_PCD | _PAGE_PWT are set. It is this ugly transformation
> > of doing:
> > 
> >  if (pat_enabled && _PAGE_PWT | _PAGE_PCD)
> > 	pte = ~(_PAGE_PWT | _PAGE_PCD) | _PAGE_PAT;
> > 
> > and then writting the pte with the 7th bit set instead of the
> > 2nd and 3rd to mark it as WC. There is a corresponding reverse too
> > (to read the pte - so the pte_val calls) - so if _PAGE_PAT is
> > detected it will remove the _PAGE_PAT and return the PTE as
> > if it had _PAGE_PWT | _PAGE_PCD.
> > 
> > So that little bit of code will need some tweaking - as it does
> > that even if _PAGE_PRESENT is not set. Meaning it would
> > transform your _PAGE_PAT to _PAGE_PWT | _PAGE_PCD. Gah!
> 
> It looks like this is disabled in current upstream?
> 8eaffa67b43e99ae581622c5133e20b0f48bcef1

Yup. But it is a temporary bandaid that I hope to fix soon.
> 
> > OK. I can whip up a patch to deal with the 'Gah!' case easily if needed.
> 
> That would help! But again it looks disabled in Xen?
> 
> About linux host (no xen) when I decided to use PSE I checked this part:
> 
> 	/* Set PWT to Write-Combining. All other bits stay the same */
> 	/*
> 	 * PTE encoding used in Linux:
> 	 *      PAT
> 	 *      |PCD
> 	 *      ||PWT
> 	 *      |||
> 	 *      000 WB		_PAGE_CACHE_WB
> 	 *      001 WC		_PAGE_CACHE_WC
> 	 *      010 UC-		_PAGE_CACHE_UC_MINUS
> 	 *      011 UC		_PAGE_CACHE_UC
> 	 * PAT bit unused
> 	 */
> 
> I need to go read the specs pdf and audit the code against the specs
> to be sure but if my interpretation correct, PAT is never set on linux
> host (novirt) the way the relevant msr are programmed.
> 
> If I couldn't use the PSE (/PAT) it'd screw with 32bit because I need
> to poke a bit between _PAGE_BIT_DIRTY and _PAGE_BIT_GLOBAL to avoid
> losing space on the swap entry, and there's just one bit in that range
> (PSE).
> 
> _PAGE_UNUSED1 (besides it's used by Xen) wouldn't work unless I change
> the swp entry format for 32bit x86 reducing the max amount of swap
> (conditional to CONFIG_AUTONUMA so it wouldn't be the end of the
> world, plus the amount of swap on 32bit NUMA may not be so important)

Yeah, I concur. I think stick with _PAGE_PAT (/PSE) and I can cook
up the appropiate patch for it on the Xen side.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 04/35] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
@ 2012-06-05 17:17             ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 236+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-05 17:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Konrad Rzeszutek Wilk, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, Jun 05, 2012 at 07:13:54PM +0200, Andrea Arcangeli wrote:
> On Wed, May 30, 2012 at 04:01:51PM -0400, Konrad Rzeszutek Wilk wrote:
> > The only time the _PAGE_PSE (_PAGE_PAT) is set is when
> > _PAGE_PCD | _PAGE_PWT are set. It is this ugly transformation
> > of doing:
> > 
> >  if (pat_enabled && _PAGE_PWT | _PAGE_PCD)
> > 	pte = ~(_PAGE_PWT | _PAGE_PCD) | _PAGE_PAT;
> > 
> > and then writting the pte with the 7th bit set instead of the
> > 2nd and 3rd to mark it as WC. There is a corresponding reverse too
> > (to read the pte - so the pte_val calls) - so if _PAGE_PAT is
> > detected it will remove the _PAGE_PAT and return the PTE as
> > if it had _PAGE_PWT | _PAGE_PCD.
> > 
> > So that little bit of code will need some tweaking - as it does
> > that even if _PAGE_PRESENT is not set. Meaning it would
> > transform your _PAGE_PAT to _PAGE_PWT | _PAGE_PCD. Gah!
> 
> It looks like this is disabled in current upstream?
> 8eaffa67b43e99ae581622c5133e20b0f48bcef1

Yup. But it is a temporary bandaid that I hope to fix soon.
> 
> > OK. I can whip up a patch to deal with the 'Gah!' case easily if needed.
> 
> That would help! But again it looks disabled in Xen?
> 
> About linux host (no xen) when I decided to use PSE I checked this part:
> 
> 	/* Set PWT to Write-Combining. All other bits stay the same */
> 	/*
> 	 * PTE encoding used in Linux:
> 	 *      PAT
> 	 *      |PCD
> 	 *      ||PWT
> 	 *      |||
> 	 *      000 WB		_PAGE_CACHE_WB
> 	 *      001 WC		_PAGE_CACHE_WC
> 	 *      010 UC-		_PAGE_CACHE_UC_MINUS
> 	 *      011 UC		_PAGE_CACHE_UC
> 	 * PAT bit unused
> 	 */
> 
> I need to go read the specs pdf and audit the code against the specs
> to be sure but if my interpretation correct, PAT is never set on linux
> host (novirt) the way the relevant msr are programmed.
> 
> If I couldn't use the PSE (/PAT) it'd screw with 32bit because I need
> to poke a bit between _PAGE_BIT_DIRTY and _PAGE_BIT_GLOBAL to avoid
> losing space on the swap entry, and there's just one bit in that range
> (PSE).
> 
> _PAGE_UNUSED1 (besides it's used by Xen) wouldn't work unless I change
> the swp entry format for 32bit x86 reducing the max amount of swap
> (conditional to CONFIG_AUTONUMA so it wouldn't be the end of the
> world, plus the amount of swap on 32bit NUMA may not be so important)

Yeah, I concur. I think stick with _PAGE_PAT (/PSE) and I can cook
up the appropiate patch for it on the Xen side.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 04/35] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  2012-06-05 17:17             ` Konrad Rzeszutek Wilk
@ 2012-06-05 17:40               ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-06-05 17:40 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Konrad Rzeszutek Wilk, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, Jun 05, 2012 at 01:17:27PM -0400, Konrad Rzeszutek Wilk wrote:
> Yup. But it is a temporary bandaid that I hope to fix soon.

Ok.

> Yeah, I concur. I think stick with _PAGE_PAT (/PSE) and I can cook
> up the appropiate patch for it on the Xen side.

Great, thanks!

Andrea

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 04/35] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
@ 2012-06-05 17:40               ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-06-05 17:40 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Konrad Rzeszutek Wilk, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, Jun 05, 2012 at 01:17:27PM -0400, Konrad Rzeszutek Wilk wrote:
> Yup. But it is a temporary bandaid that I hope to fix soon.

Ok.

> Yeah, I concur. I think stick with _PAGE_PAT (/PSE) and I can cook
> up the appropiate patch for it on the Xen side.

Great, thanks!

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
  2012-05-31 18:08       ` AutoNUMA15 Andrea Arcangeli
@ 2012-06-07  2:30         ` Zhouping Liu
  -1 siblings, 0 replies; 236+ messages in thread
From: Zhouping Liu @ 2012-06-07  2:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Petr Holasek, Kirill A. Shutemov, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, caiqian

On 06/01/2012 02:08 AM, Andrea Arcangeli wrote:
> Hi,
>
> On Tue, May 29, 2012 at 05:43:09PM +0200, Petr Holasek wrote:
>> Similar problem with __autonuma_migrate_page_remove here.
>>
>> [ 1945.516632] ------------[ cut here ]------------
>> [ 1945.516636] WARNING: at lib/list_debug.c:50 __list_del_entry+0x63/0xd0()
>> [ 1945.516642] Hardware name: ProLiant DL585 G5
>> [ 1945.516651] list_del corruption, ffff88017d68b068->next is LIST_POISON1 (dead000000100100)
>> [ 1945.516682] Modules linked in: ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle lockd ip6t_REJECT sunrpc nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables iptable_nat nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack mperf freq_table kvm_amd kvm pcspkr amd64_edac_mod edac_core serio_raw bnx2 microcode edac_mce_amd shpchp k10temp hpilo ipmi_si ipmi_msghandler hpwdt qla2xxx hpsa ata_generic pata_acpi scsi_transport_fc scsi_tgt cciss pata_amd radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [last unloaded: scsi_wait_scan]
>> [ 1945.516694] Pid: 150, comm: knuma_migrated0 Tainted: G        W    3.4.0aa_alpha+ #3
>> [ 1945.516701] Call Trace:
>> [ 1945.516710]  [<ffffffff8105788f>] warn_slowpath_common+0x7f/0xc0
>> [ 1945.516717]  [<ffffffff81057986>] warn_slowpath_fmt+0x46/0x50
>> [ 1945.516726]  [<ffffffff812f9713>] __list_del_entry+0x63/0xd0
>> [ 1945.516735]  [<ffffffff812f9791>] list_del+0x11/0x40
>> [ 1945.516743]  [<ffffffff81165b98>] __autonuma_migrate_page_remove+0x48/0x80
>> [ 1945.516746]  [<ffffffff81165e66>] knuma_migrated+0x296/0x8a0
>> [ 1945.516749]  [<ffffffff8107a200>] ? wake_up_bit+0x40/0x40
>> [ 1945.516758]  [<ffffffff81165bd0>] ? __autonuma_migrate_page_remove+0x80/0x80
>> [ 1945.516766]  [<ffffffff81079cc3>] kthread+0x93/0xa0
>> [ 1945.516780]  [<ffffffff81626f24>] kernel_thread_helper+0x4/0x10
>> [ 1945.516791]  [<ffffffff81079c30>] ? flush_kthread_worker+0x80/0x80
>> [ 1945.516798]  [<ffffffff81626f20>] ? gs_change+0x13/0x13
>> [ 1945.516800] ---[ end trace 7cab294af87bd79f ]---
> I didn't manage to reproduce it on my hardware but it seems this was
> caused by the autonuma_migrate_split_huge_page: the tail page list
> linking wasn't surrounded by the compound lock to make list insertion
> and migrate_nid setting atomic like it happens everywhere else (the
> caller holding the lock on the head page wasn't enough to make the
> tails stable too).
>
> I released an AutoNUMA15 branch that includes all pending fixes:
>
> git clone --reference linux -b autonuma15 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

hi, Andrea and all

when I tested autonuma patch set, kernel panic occurred in the process 
of starting with new compiled kernel,
also I found the issue in latest Linus tree(3.5.0-rc1), partial call 
trace are:

[    2.635443] kernel BUG at include/linux/gfp.h:318!
[    2.642998] invalid opcode: 0000 [#1] SMP
[    2.651148] CPU 0
[    2.653911] Modules linked in:
[    2.662388]
[    2.664657] Pid: 1, comm: swapper/0 Not tainted 3.4.0+ #1 HP ProLiant 
DL585 G7
[    2.677609] RIP: 0010:[<ffffffff811b044d>]  [<ffffffff811b044d>] 
new_slab+0x26d/0x310
[    2.692803] RSP: 0018:ffff880135ad3c80  EFLAGS: 00010246
[    2.702541] RAX: 0000000000000000 RBX: ffff880137008c80 RCX: 
ffff8801377db780
[    2.716402] RDX: ffff880135bf8000 RSI: 0000000000000003 RDI: 
00000000000052d0
[    2.728471] RBP: ffff880135ad3cb0 R08: 0000000000000000 R09: 
0000000000000000
[    2.743791] R10: 0000000000000001 R11: 0000000000000000 R12: 
00000000000040d0
[    2.756111] R13: ffff880137008c80 R14: 0000000000000001 R15: 
0000000000030027
[    2.770428] FS:  0000000000000000(0000) GS:ffff880137600000(0000) 
knlGS:0000000000000000
[    2.786319] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[    2.798100] CR2: 0000000000000000 CR3: 000000000196b000 CR4: 
00000000000007f0
[    2.810264] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[    2.824889] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[    2.836882] Process swapper/0 (pid: 1, threadinfo ffff880135ad2000, 
task ffff880135bf8000)
[    2.856452] Stack:
[    2.859175]  ffff880135ad3ca0 0000000000000002 0000000000000001 
ffff880137008c80
[    2.872325]  ffff8801377db760 ffff880137008c80 ffff880135ad3db0 
ffffffff8167632f
[    2.887248]  ffffffff8167e0e7 0000000000000000 ffff8801377db780 
ffff8801377db770
[    2.899666] Call Trace:
[    2.906792]  [<ffffffff8167632f>] __slab_alloc+0x351/0x4d2
[    2.914238]  [<ffffffff8167e0e7>] ? mutex_lock_nested+0x2e7/0x390
[    2.925157]  [<ffffffff813350d8>] ? alloc_cpumask_var_node+0x28/0x90
[    2.939430]  [<ffffffff81c81e50>] ? sched_init_smp+0x16a/0x3b4
[    2.949790]  [<ffffffff811b1a04>] kmem_cache_alloc_node_trace+0xa4/0x250
[    2.964259]  [<ffffffff8109e72f>] ? kzalloc+0xf/0x20
[    2.976298]  [<ffffffff81c81e50>] ? sched_init_smp+0x16a/0x3b4
[    2.984664]  [<ffffffff81c81e50>] sched_init_smp+0x16a/0x3b4
[    2.997217]  [<ffffffff81c66d57>] kernel_init+0xe3/0x215
[    3.006848]  [<ffffffff810d4c3d>] ? trace_hardirqs_on_caller+0x10d/0x1a0
[    3.020673]  [<ffffffff8168c3b4>] kernel_thread_helper+0x4/0x10
[    3.031154]  [<ffffffff81682470>] ? retint_restore_args+0x13/0x13
[    3.040816]  [<ffffffff81c66c74>] ? start_kernel+0x401/0x401
[    3.052881]  [<ffffffff8168c3b0>] ? gs_change+0x13/0x13
[    3.061692] Code: 1f 80 00 00 00 00 fa 66 66 90 66 66 90 e8 cc e2 f1 
ff e9 71 fe ff ff 0f 1f 80 00 00 00 00 e8 8b 25 ff ff 49 89 c5 e9 4a fe 
ff ff <0f> 0b 0f 0b 49 8b 45 00 31 c9 f6 c4 40 74 04 41 8b 4d 68 ba 00
[    3.095893] RIP  [<ffffffff811b044d>] new_slab+0x26d/0x310
[    3.107828]  RSP <ffff880135ad3c80>
[    3.114024] ---[ end trace e696d6ddf3adb276 ]---
[    3.121541] swapper/0 used greatest stack depth: 4768 bytes left
[    3.143784] Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x0000000b
[    3.143784]

such above errors occurred in my two boxes:
in one machine, which has 120Gb RAM and 8 numa nodes with AMD CPU, 
kernel panic occurred in autonuma15 and Linus tree(3.5.0-rc1)
but in another one, which has 16Gb RAM and 4 numa nodes with AMD CPU, 
kernel panic only occurred in autonuma15, no such issues in Linus tree,

whole panic info is available in 
http://www.sanweiying.org/download/kernel_panic_log
and config file in http://www.sanweiying.org/download/config

please feel free to tell me if you need more detailed info.

Thanks,
Zhouping

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
@ 2012-06-07  2:30         ` Zhouping Liu
  0 siblings, 0 replies; 236+ messages in thread
From: Zhouping Liu @ 2012-06-07  2:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Petr Holasek, Kirill A. Shutemov, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, caiqian

On 06/01/2012 02:08 AM, Andrea Arcangeli wrote:
> Hi,
>
> On Tue, May 29, 2012 at 05:43:09PM +0200, Petr Holasek wrote:
>> Similar problem with __autonuma_migrate_page_remove here.
>>
>> [ 1945.516632] ------------[ cut here ]------------
>> [ 1945.516636] WARNING: at lib/list_debug.c:50 __list_del_entry+0x63/0xd0()
>> [ 1945.516642] Hardware name: ProLiant DL585 G5
>> [ 1945.516651] list_del corruption, ffff88017d68b068->next is LIST_POISON1 (dead000000100100)
>> [ 1945.516682] Modules linked in: ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle lockd ip6t_REJECT sunrpc nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables iptable_nat nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack mperf freq_table kvm_amd kvm pcspkr amd64_edac_mod edac_core serio_raw bnx2 microcode edac_mce_amd shpchp k10temp hpilo ipmi_si ipmi_msghandler hpwdt qla2xxx hpsa ata_generic pata_acpi scsi_transport_fc scsi_tgt cciss pata_amd radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [last unloaded: scsi_wait_scan]
>> [ 1945.516694] Pid: 150, comm: knuma_migrated0 Tainted: G        W    3.4.0aa_alpha+ #3
>> [ 1945.516701] Call Trace:
>> [ 1945.516710]  [<ffffffff8105788f>] warn_slowpath_common+0x7f/0xc0
>> [ 1945.516717]  [<ffffffff81057986>] warn_slowpath_fmt+0x46/0x50
>> [ 1945.516726]  [<ffffffff812f9713>] __list_del_entry+0x63/0xd0
>> [ 1945.516735]  [<ffffffff812f9791>] list_del+0x11/0x40
>> [ 1945.516743]  [<ffffffff81165b98>] __autonuma_migrate_page_remove+0x48/0x80
>> [ 1945.516746]  [<ffffffff81165e66>] knuma_migrated+0x296/0x8a0
>> [ 1945.516749]  [<ffffffff8107a200>] ? wake_up_bit+0x40/0x40
>> [ 1945.516758]  [<ffffffff81165bd0>] ? __autonuma_migrate_page_remove+0x80/0x80
>> [ 1945.516766]  [<ffffffff81079cc3>] kthread+0x93/0xa0
>> [ 1945.516780]  [<ffffffff81626f24>] kernel_thread_helper+0x4/0x10
>> [ 1945.516791]  [<ffffffff81079c30>] ? flush_kthread_worker+0x80/0x80
>> [ 1945.516798]  [<ffffffff81626f20>] ? gs_change+0x13/0x13
>> [ 1945.516800] ---[ end trace 7cab294af87bd79f ]---
> I didn't manage to reproduce it on my hardware but it seems this was
> caused by the autonuma_migrate_split_huge_page: the tail page list
> linking wasn't surrounded by the compound lock to make list insertion
> and migrate_nid setting atomic like it happens everywhere else (the
> caller holding the lock on the head page wasn't enough to make the
> tails stable too).
>
> I released an AutoNUMA15 branch that includes all pending fixes:
>
> git clone --reference linux -b autonuma15 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

hi, Andrea and all

when I tested autonuma patch set, kernel panic occurred in the process 
of starting with new compiled kernel,
also I found the issue in latest Linus tree(3.5.0-rc1), partial call 
trace are:

[    2.635443] kernel BUG at include/linux/gfp.h:318!
[    2.642998] invalid opcode: 0000 [#1] SMP
[    2.651148] CPU 0
[    2.653911] Modules linked in:
[    2.662388]
[    2.664657] Pid: 1, comm: swapper/0 Not tainted 3.4.0+ #1 HP ProLiant 
DL585 G7
[    2.677609] RIP: 0010:[<ffffffff811b044d>]  [<ffffffff811b044d>] 
new_slab+0x26d/0x310
[    2.692803] RSP: 0018:ffff880135ad3c80  EFLAGS: 00010246
[    2.702541] RAX: 0000000000000000 RBX: ffff880137008c80 RCX: 
ffff8801377db780
[    2.716402] RDX: ffff880135bf8000 RSI: 0000000000000003 RDI: 
00000000000052d0
[    2.728471] RBP: ffff880135ad3cb0 R08: 0000000000000000 R09: 
0000000000000000
[    2.743791] R10: 0000000000000001 R11: 0000000000000000 R12: 
00000000000040d0
[    2.756111] R13: ffff880137008c80 R14: 0000000000000001 R15: 
0000000000030027
[    2.770428] FS:  0000000000000000(0000) GS:ffff880137600000(0000) 
knlGS:0000000000000000
[    2.786319] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[    2.798100] CR2: 0000000000000000 CR3: 000000000196b000 CR4: 
00000000000007f0
[    2.810264] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[    2.824889] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[    2.836882] Process swapper/0 (pid: 1, threadinfo ffff880135ad2000, 
task ffff880135bf8000)
[    2.856452] Stack:
[    2.859175]  ffff880135ad3ca0 0000000000000002 0000000000000001 
ffff880137008c80
[    2.872325]  ffff8801377db760 ffff880137008c80 ffff880135ad3db0 
ffffffff8167632f
[    2.887248]  ffffffff8167e0e7 0000000000000000 ffff8801377db780 
ffff8801377db770
[    2.899666] Call Trace:
[    2.906792]  [<ffffffff8167632f>] __slab_alloc+0x351/0x4d2
[    2.914238]  [<ffffffff8167e0e7>] ? mutex_lock_nested+0x2e7/0x390
[    2.925157]  [<ffffffff813350d8>] ? alloc_cpumask_var_node+0x28/0x90
[    2.939430]  [<ffffffff81c81e50>] ? sched_init_smp+0x16a/0x3b4
[    2.949790]  [<ffffffff811b1a04>] kmem_cache_alloc_node_trace+0xa4/0x250
[    2.964259]  [<ffffffff8109e72f>] ? kzalloc+0xf/0x20
[    2.976298]  [<ffffffff81c81e50>] ? sched_init_smp+0x16a/0x3b4
[    2.984664]  [<ffffffff81c81e50>] sched_init_smp+0x16a/0x3b4
[    2.997217]  [<ffffffff81c66d57>] kernel_init+0xe3/0x215
[    3.006848]  [<ffffffff810d4c3d>] ? trace_hardirqs_on_caller+0x10d/0x1a0
[    3.020673]  [<ffffffff8168c3b4>] kernel_thread_helper+0x4/0x10
[    3.031154]  [<ffffffff81682470>] ? retint_restore_args+0x13/0x13
[    3.040816]  [<ffffffff81c66c74>] ? start_kernel+0x401/0x401
[    3.052881]  [<ffffffff8168c3b0>] ? gs_change+0x13/0x13
[    3.061692] Code: 1f 80 00 00 00 00 fa 66 66 90 66 66 90 e8 cc e2 f1 
ff e9 71 fe ff ff 0f 1f 80 00 00 00 00 e8 8b 25 ff ff 49 89 c5 e9 4a fe 
ff ff <0f> 0b 0f 0b 49 8b 45 00 31 c9 f6 c4 40 74 04 41 8b 4d 68 ba 00
[    3.095893] RIP  [<ffffffff811b044d>] new_slab+0x26d/0x310
[    3.107828]  RSP <ffff880135ad3c80>
[    3.114024] ---[ end trace e696d6ddf3adb276 ]---
[    3.121541] swapper/0 used greatest stack depth: 4768 bytes left
[    3.143784] Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x0000000b
[    3.143784]

such above errors occurred in my two boxes:
in one machine, which has 120Gb RAM and 8 numa nodes with AMD CPU, 
kernel panic occurred in autonuma15 and Linus tree(3.5.0-rc1)
but in another one, which has 16Gb RAM and 4 numa nodes with AMD CPU, 
kernel panic only occurred in autonuma15, no such issues in Linus tree,

whole panic info is available in 
http://www.sanweiying.org/download/kernel_panic_log
and config file in http://www.sanweiying.org/download/config

please feel free to tell me if you need more detailed info.

Thanks,
Zhouping

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
  2012-06-07  2:30         ` AutoNUMA15 Zhouping Liu
  (?)
@ 2012-06-07 11:44         ` Hillf Danton
  2012-06-07 13:30           ` AutoNUMA15 Andrea Arcangeli
  2012-06-07 14:08           ` AutoNUMA15 Zhouping Liu
  -1 siblings, 2 replies; 236+ messages in thread
From: Hillf Danton @ 2012-06-07 11:44 UTC (permalink / raw)
  To: Zhouping Liu; +Cc: Andrea Arcangeli, LKML

Hi Zhouping

On Thu, Jun 7, 2012 at 10:30 AM, Zhouping Liu <zliu@redhat.com> wrote:
>
> [    3.114024] ---[ end trace e696d6ddf3adb276 ]---
> [    3.121541] swapper/0 used greatest stack depth: 4768 bytes left
> [    3.143784] Kernel panic - not syncing: Attempted to kill init!
> exitcode=0x0000000b
> [    3.143784]
>
> such above errors occurred in my two boxes:
> in one machine, which has 120Gb RAM and 8 numa nodes with AMD CPU, kernel
> panic occurred in autonuma15 and Linus tree(3.5.0-rc1)
> but in another one, which has 16Gb RAM and 4 numa nodes with AMD CPU, kernel
> panic only occurred in autonuma15, no such issues in Linus tree,
>
Related to fix at https://lkml.org/lkml/2012/6/5/31  ?

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
  2012-06-07 11:44         ` AutoNUMA15 Hillf Danton
@ 2012-06-07 13:30           ` Andrea Arcangeli
  2012-06-07 14:08           ` AutoNUMA15 Zhouping Liu
  1 sibling, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-06-07 13:30 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Zhouping Liu, LKML

On Thu, Jun 07, 2012 at 07:44:33PM +0800, Hillf Danton wrote:
> Hi Zhouping
> 
> On Thu, Jun 7, 2012 at 10:30 AM, Zhouping Liu <zliu@redhat.com> wrote:
> >
> > [    3.114024] ---[ end trace e696d6ddf3adb276 ]---
> > [    3.121541] swapper/0 used greatest stack depth: 4768 bytes left
> > [    3.143784] Kernel panic - not syncing: Attempted to kill init!
> > exitcode=0x0000000b
> > [    3.143784]
> >
> > such above errors occurred in my two boxes:
> > in one machine, which has 120Gb RAM and 8 numa nodes with AMD CPU, kernel
> > panic occurred in autonuma15 and Linus tree(3.5.0-rc1)
> > but in another one, which has 16Gb RAM and 4 numa nodes with AMD CPU, kernel
> > panic only occurred in autonuma15, no such issues in Linus tree,
> >
> Related to fix at https://lkml.org/lkml/2012/6/5/31  ?

Right thanks! I pushed an update after an upstream rebase to fix it.

git fetch; git checkout -f origin/autonuma

or:

git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

Please let me know if you still have problems.

Andrea

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
  2012-06-07 11:44         ` AutoNUMA15 Hillf Danton
  2012-06-07 13:30           ` AutoNUMA15 Andrea Arcangeli
@ 2012-06-07 14:08           ` Zhouping Liu
  2012-06-07 19:37             ` AutoNUMA15 Andrea Arcangeli
  1 sibling, 1 reply; 236+ messages in thread
From: Zhouping Liu @ 2012-06-07 14:08 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Andrea Arcangeli, LKML

> On Thu, Jun 7, 2012 at 10:30 AM, Zhouping Liu <zliu@redhat.com>
> wrote:
> >
> > [    3.114024] ---[ end trace e696d6ddf3adb276 ]---
> > [    3.121541] swapper/0 used greatest stack depth: 4768 bytes left
> > [    3.143784] Kernel panic - not syncing: Attempted to kill init!
> > exitcode=0x0000000b
> > [    3.143784]
> >
> > such above errors occurred in my two boxes:
> > in one machine, which has 120Gb RAM and 8 numa nodes with AMD CPU,
> > kernel
> > panic occurred in autonuma15 and Linus tree(3.5.0-rc1)
> > but in another one, which has 16Gb RAM and 4 numa nodes with AMD
> > CPU, kernel
> > panic only occurred in autonuma15, no such issues in Linus tree,
> >
> Related to fix at https://lkml.org/lkml/2012/6/5/31  ?
> 

hi, Hillf

Thanks! but the Linus tree I tested has contained the patch,
also I tested it in autunuma15 with the patch just now, and
the panic is still alive, so maybe it's a new issues...

-- 
Thanks,
Zhouping

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
  2012-06-07 14:08           ` AutoNUMA15 Zhouping Liu
@ 2012-06-07 19:37             ` Andrea Arcangeli
  2012-06-08  6:09               ` AutoNUMA15 Zhouping Liu
  2012-06-08 13:32               ` AutoNUMA15 Peter Zijlstra
  0 siblings, 2 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-06-07 19:37 UTC (permalink / raw)
  To: Zhouping Liu; +Cc: Hillf Danton, LKML, Peter Zijlstra

On Thu, Jun 07, 2012 at 10:08:52AM -0400, Zhouping Liu wrote:
> > On Thu, Jun 7, 2012 at 10:30 AM, Zhouping Liu <zliu@redhat.com>
> > wrote:
> > >
> > > [    3.114024] ---[ end trace e696d6ddf3adb276 ]---
> > > [    3.121541] swapper/0 used greatest stack depth: 4768 bytes left
> > > [    3.143784] Kernel panic - not syncing: Attempted to kill init!
> > > exitcode=0x0000000b
> > > [    3.143784]
> > >
> > > such above errors occurred in my two boxes:
> > > in one machine, which has 120Gb RAM and 8 numa nodes with AMD CPU,
> > > kernel
> > > panic occurred in autonuma15 and Linus tree(3.5.0-rc1)
> > > but in another one, which has 16Gb RAM and 4 numa nodes with AMD
> > > CPU, kernel
> > > panic only occurred in autonuma15, no such issues in Linus tree,
> > >
> > Related to fix at https://lkml.org/lkml/2012/6/5/31  ?
> >
> 
> hi, Hillf
> 
> Thanks! but the Linus tree I tested has contained the patch,
> also I tested it in autunuma15 with the patch just now, and
> the panic is still alive, so maybe it's a new issues...

I guess this 74a5ce20e6eeeb3751340b390e7ac1d1d07bbf55 or this
8e7fbcbc22c12414bcc9dfdd683637f58fb32759 may have introduced a problem
with sgp->power being null.

After applying the zalloc_node it oopses in a different place here:

	/* Adjust by relative CPU power of the group */
	sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->sgp->power;

power is zero.

[    3.243773] divide error: 0000 [#1] SMP
[    3.244564] CPU 5
[    3.245016] Modules linked in:
[    3.245642]
[    3.245939] Pid: 0, comm: swapper/5 Not tainted 3.5.0-rc1+ #1 HP ProLiant DL785 G6   
[    3.247640] RIP: 0010:[<ffffffff810afbeb>]  [<ffffffff810afbeb>] update_sd_lb_stats+0x27b/0x620
[    3.249534] RSP: 0000:ffff880411207b48  EFLAGS: 00010056
[    3.250636] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880811496d00
[    3.252174] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8818116a0548
[    3.253509] RBP: ffff880411207c28 R08: 0000000000000000 R09: 0000000000000000
[    3.255073] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[    3.256607] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000030
[    3.258278] FS:  0000000000000000(0000) GS:ffff881817200000(0000) knlGS:0000000000000000
[    3.260010] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[    3.261250] CR2: 0000000000000000 CR3: 000000000196f000 CR4: 00000000000007e0
[    3.262586] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    3.263912] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[    3.265320] Process swapper/5 (pid: 0, threadinfo ffff880411206000, task ffff8804111fa680)
[    3.267150] Stack:
[    3.267670]  0000000000000001 ffff880411207e34 ffff880411207bb8 ffff880411207d90
[    3.269344]  00000000ffffffff ffff8818116a0548 00000000001d4780 00000000001d4780
[    3.270953]  ffff880416c21000 ffff880411207c38 ffff8818116a0560 0000000000000000
[    3.272379] Call Trace:
[    3.272933]  [<ffffffff810affc9>] find_busiest_group+0x39/0x4b0
[    3.274214]  [<ffffffff810b0545>] load_balance+0x105/0xac0
[    3.275408]  [<ffffffff810ceefd>] ? trace_hardirqs_off+0xd/0x10
[    3.276695]  [<ffffffff810aa26f>] ? local_clock+0x6f/0x80
[    3.277925]  [<ffffffff810b1500>] idle_balance+0x130/0x2d0
[    3.279137]  [<ffffffff810b1420>] ? idle_balance+0x50/0x2d0
[    3.280224]  [<ffffffff81683e40>] __schedule+0x910/0xa00
[    3.281229]  [<ffffffff81684269>] schedule+0x29/0x70
[    3.282165]  [<ffffffff8102352f>] cpu_idle+0x12f/0x140
[    3.283130]  [<ffffffff8166bf85>] start_secondary+0x262/0x264

Please let me know if it rings a bell, it looks an upstream problem.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
  2012-06-07 19:37             ` AutoNUMA15 Andrea Arcangeli
@ 2012-06-08  6:09               ` Zhouping Liu
  2012-06-08 13:04                 ` AutoNUMA15 Hillf Danton
  2012-06-08 13:32               ` AutoNUMA15 Peter Zijlstra
  1 sibling, 1 reply; 236+ messages in thread
From: Zhouping Liu @ 2012-06-08  6:09 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Hillf Danton, LKML, Peter Zijlstra

Hi,

> > 
> > Thanks! but the Linus tree I tested has contained the patch,
> > also I tested it in autunuma15 with the patch just now, and
> > the panic is still alive, so maybe it's a new issues...
> 
> I guess this 74a5ce20e6eeeb3751340b390e7ac1d1d07bbf55 or this
> 8e7fbcbc22c12414bcc9dfdd683637f58fb32759 may have introduced a
> problem
> with sgp->power being null.

I have tested the kernel after removing the two commit 74a5ce20e6ee & 8e7fbcbc22c1241,
but unluckily, the panic is still alive, so I think it's maybe not
related to the two commit, and I will do more investigating to check
what introduced the panic, please let me know if you need me to test
the special commit.

Thanks,
Zhouping

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
  2012-06-08  6:09               ` AutoNUMA15 Zhouping Liu
@ 2012-06-08 13:04                 ` Hillf Danton
  0 siblings, 0 replies; 236+ messages in thread
From: Hillf Danton @ 2012-06-08 13:04 UTC (permalink / raw)
  To: Zhouping Liu; +Cc: Andrea Arcangeli, LKML, Peter Zijlstra

On Fri, Jun 8, 2012 at 2:09 PM, Zhouping Liu <zliu@redhat.com> wrote:
>
> I have tested the kernel after removing the two commit 74a5ce20e6ee & 8e7fbcbc22c1241,
> but unluckily, the panic is still alive, so I think it's maybe not
> related to the two commit, and I will do more investigating to check
>

I see three reverts at

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=log;h=refs/heads/autonuma

74a5ce20e Revert "sched: Fix SD_OVERLAP"
9f646389a Revert "sched/x86: Use cpu_llc_shared_mask(cpu) for coregroup_mask"
8e7fbcbc2 Revert "sched: Remove stale power aware scheduling remnants and
			dysfunctional knobs"

Would you please take revert of 9f646389a also into account?

Good Weekend
Hillf

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
  2012-06-07 19:37             ` AutoNUMA15 Andrea Arcangeli
  2012-06-08  6:09               ` AutoNUMA15 Zhouping Liu
@ 2012-06-08 13:32               ` Peter Zijlstra
  2012-06-08 16:31                 ` AutoNUMA15 Zhouping Liu
  1 sibling, 1 reply; 236+ messages in thread
From: Peter Zijlstra @ 2012-06-08 13:32 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Zhouping Liu, Hillf Danton, LKML

Don't hide reports like this in subjects like the above

On Thu, 2012-06-07 at 21:37 +0200, Andrea Arcangeli wrote:
> Please let me know if it rings a bell, it looks an upstream problem.

Please try tip/master, if it still fails, please report in a new thread
with appropriate subject.


^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
  2012-06-07  2:30         ` AutoNUMA15 Zhouping Liu
  (?)
  (?)
@ 2012-06-08 13:43         ` Chen
  -1 siblings, 0 replies; 236+ messages in thread
From: Chen @ 2012-06-08 13:43 UTC (permalink / raw)
  To: Zhouping Liu, linux-kernel

On Thu, Jun 7, 2012 at 10:30 AM, Zhouping Liu <zliu@redhat.com> wrote:
> On 06/01/2012 02:08 AM, Andrea Arcangeli wrote:
>>
>> Hi,
>>
>> On Tue, May 29, 2012 at 05:43:09PM +0200, Petr Holasek wrote:
>>>
>>> Similar problem with __autonuma_migrate_page_remove here.
>>>
>>> [ 1945.516632] ------------[ cut here ]------------
>>> [ 1945.516636] WARNING: at lib/list_debug.c:50
>>> __list_del_entry+0x63/0xd0()
>>> [ 1945.516642] Hardware name: ProLiant DL585 G5
>>> [ 1945.516651] list_del corruption, ffff88017d68b068->next is
>>> LIST_POISON1 (dead000000100100)
>>> [ 1945.516682] Modules linked in: ipt_MASQUERADE nf_conntrack_netbios_ns
>>> nf_conntrack_broadcast ip6table_mangle lockd ip6t_REJECT sunrpc
>>> nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables iptable_nat
>>> nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
>>> nf_conntrack mperf freq_table kvm_amd kvm pcspkr amd64_edac_mod edac_core
>>> serio_raw bnx2 microcode edac_mce_amd shpchp k10temp hpilo ipmi_si
>>> ipmi_msghandler hpwdt qla2xxx hpsa ata_generic pata_acpi scsi_transport_fc
>>> scsi_tgt cciss pata_amd radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core
>>> [last unloaded: scsi_wait_scan]
>>> [ 1945.516694] Pid: 150, comm: knuma_migrated0 Tainted: G        W
>>>  3.4.0aa_alpha+ #3
>>> [ 1945.516701] Call Trace:
>>> [ 1945.516710]  [<ffffffff8105788f>] warn_slowpath_common+0x7f/0xc0
>>> [ 1945.516717]  [<ffffffff81057986>] warn_slowpath_fmt+0x46/0x50
>>> [ 1945.516726]  [<ffffffff812f9713>] __list_del_entry+0x63/0xd0
>>> [ 1945.516735]  [<ffffffff812f9791>] list_del+0x11/0x40
>>> [ 1945.516743]  [<ffffffff81165b98>]
>>> __autonuma_migrate_page_remove+0x48/0x80
>>> [ 1945.516746]  [<ffffffff81165e66>] knuma_migrated+0x296/0x8a0
>>> [ 1945.516749]  [<ffffffff8107a200>] ? wake_up_bit+0x40/0x40
>>> [ 1945.516758]  [<ffffffff81165bd0>] ?
>>> __autonuma_migrate_page_remove+0x80/0x80
>>> [ 1945.516766]  [<ffffffff81079cc3>] kthread+0x93/0xa0
>>> [ 1945.516780]  [<ffffffff81626f24>] kernel_thread_helper+0x4/0x10
>>> [ 1945.516791]  [<ffffffff81079c30>] ? flush_kthread_worker+0x80/0x80
>>> [ 1945.516798]  [<ffffffff81626f20>] ? gs_change+0x13/0x13
>>> [ 1945.516800] ---[ end trace 7cab294af87bd79f ]---
>>
>> I didn't manage to reproduce it on my hardware but it seems this was
>> caused by the autonuma_migrate_split_huge_page: the tail page list
>> linking wasn't surrounded by the compound lock to make list insertion
>> and migrate_nid setting atomic like it happens everywhere else (the
>> caller holding the lock on the head page wasn't enough to make the
>> tails stable too).
>>
>> I released an AutoNUMA15 branch that includes all pending fixes:
>>
>> git clone --reference linux -b autonuma15
>> git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
>
>
> hi, Andrea and all
>
> when I tested autonuma patch set, kernel panic occurred in the process of
> starting with new compiled kernel,
> also I found the issue in latest Linus tree(3.5.0-rc1), partial call trace
> are:
>
> [    2.635443] kernel BUG at include/linux/gfp.h:318!
> [    2.642998] invalid opcode: 0000 [#1] SMP
> [    2.651148] CPU 0
> [    2.653911] Modules linked in:
> [    2.662388]
> [    2.664657] Pid: 1, comm: swapper/0 Not tainted 3.4.0+ #1 HP ProLiant
> DL585 G7
> [    2.677609] RIP: 0010:[<ffffffff811b044d>]  [<ffffffff811b044d>]
> new_slab+0x26d/0x310
> [    2.692803] RSP: 0018:ffff880135ad3c80  EFLAGS: 00010246
> [    2.702541] RAX: 0000000000000000 RBX: ffff880137008c80 RCX:
> ffff8801377db780
> [    2.716402] RDX: ffff880135bf8000 RSI: 0000000000000003 RDI:
> 00000000000052d0
> [    2.728471] RBP: ffff880135ad3cb0 R08: 0000000000000000 R09:
> 0000000000000000
> [    2.743791] R10: 0000000000000001 R11: 0000000000000000 R12:
> 00000000000040d0
> [    2.756111] R13: ffff880137008c80 R14: 0000000000000001 R15:
> 0000000000030027
> [    2.770428] FS:  0000000000000000(0000) GS:ffff880137600000(0000)
> knlGS:0000000000000000
> [    2.786319] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [    2.798100] CR2: 0000000000000000 CR3: 000000000196b000 CR4:
> 00000000000007f0
> [    2.810264] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [    2.824889] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [    2.836882] Process swapper/0 (pid: 1, threadinfo ffff880135ad2000, task
> ffff880135bf8000)
> [    2.856452] Stack:
> [    2.859175]  ffff880135ad3ca0 0000000000000002 0000000000000001
> ffff880137008c80
> [    2.872325]  ffff8801377db760 ffff880137008c80 ffff880135ad3db0
> ffffffff8167632f
> [    2.887248]  ffffffff8167e0e7 0000000000000000 ffff8801377db780
> ffff8801377db770
> [    2.899666] Call Trace:
> [    2.906792]  [<ffffffff8167632f>] __slab_alloc+0x351/0x4d2
> [    2.914238]  [<ffffffff8167e0e7>] ? mutex_lock_nested+0x2e7/0x390
> [    2.925157]  [<ffffffff813350d8>] ? alloc_cpumask_var_node+0x28/0x90
> [    2.939430]  [<ffffffff81c81e50>] ? sched_init_smp+0x16a/0x3b4
> [    2.949790]  [<ffffffff811b1a04>] kmem_cache_alloc_node_trace+0xa4/0x250
> [    2.964259]  [<ffffffff8109e72f>] ? kzalloc+0xf/0x20
> [    2.976298]  [<ffffffff81c81e50>] ? sched_init_smp+0x16a/0x3b4
> [    2.984664]  [<ffffffff81c81e50>] sched_init_smp+0x16a/0x3b4
> [    2.997217]  [<ffffffff81c66d57>] kernel_init+0xe3/0x215
> [    3.006848]  [<ffffffff810d4c3d>] ? trace_hardirqs_on_caller+0x10d/0x1a0
> [    3.020673]  [<ffffffff8168c3b4>] kernel_thread_helper+0x4/0x10
> [    3.031154]  [<ffffffff81682470>] ? retint_restore_args+0x13/0x13
> [    3.040816]  [<ffffffff81c66c74>] ? start_kernel+0x401/0x401
> [    3.052881]  [<ffffffff8168c3b0>] ? gs_change+0x13/0x13
> [    3.061692] Code: 1f 80 00 00 00 00 fa 66 66 90 66 66 90 e8 cc e2 f1 ff
> e9 71 fe ff ff 0f 1f 80 00 00 00 00 e8 8b 25 ff ff 49 89 c5 e9 4a fe ff ff
> <0f> 0b 0f 0b 49 8b 45 00 31 c9 f6 c4 40 74 04 41 8b 4d 68 ba 00
> [    3.095893] RIP  [<ffffffff811b044d>] new_slab+0x26d/0x310
> [ 3.107828]  RSP <ffff880135ad3c80>
> [    3.114024] ---[ end trace e696d6ddf3adb276 ]---
> [    3.121541] swapper/0 used greatest stack depth: 4768 bytes left
> [    3.143784] Kernel panic - not syncing: Attempted to kill init!
> exitcode=0x0000000b
> [    3.143784]
>
> such above errors occurred in my two boxes:
> in one machine, which has 120Gb RAM and 8 numa nodes with AMD CPU, kernel
> panic occurred in autonuma15 and Linus tree(3.5.0-rc1)
> but in another one, which has 16Gb RAM and 4 numa nodes with AMD CPU, kernel
> panic only occurred in autonuma15, no such issues in Linus tree,
>
> whole panic info is available in
> http://www.sanweiying.org/download/kernel_panic_log
> and config file in http://www.sanweiying.org/download/config
>
> please feel free to tell me if you need more detailed info.
>
> Thanks,
> Zhouping
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Killing init actually means there is problem with scheduling-related subsystem.
Check it and fix it yourself after reading the design. :-)

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
  2012-06-08 13:32               ` AutoNUMA15 Peter Zijlstra
@ 2012-06-08 16:31                 ` Zhouping Liu
  0 siblings, 0 replies; 236+ messages in thread
From: Zhouping Liu @ 2012-06-08 16:31 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Arcangeli; +Cc: Hillf Danton, LKML

Hi,

> 
> Please try tip/master, if it still fails, please report in a new
> thread
> with appropriate subject.

I tested tip/master(commit: b2f5ce55c4e683), it's OK without the above panic,
also tested mainline(commit: 48d212a2eecaca2), the panic still exist.

and I will open a new subject to trace the issue.

-- 
Thanks,
Zhouping

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
  2012-06-05 14:51                   ` Andrea Arcangeli
@ 2012-06-19 18:06                     ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-06-19 18:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter

Hi everyone,

On Tue, Jun 05, 2012 at 04:51:23PM +0200, Andrea Arcangeli wrote:
> The details of the solution:
> 
> struct page_autonuma {
>     short autonuma_last_nid;
>     short autonuma_migrate_nid;
>     unsigned int pfn_offset_next;
>     unsigned int pfn_offset_prev;
> } __attribute__((packed));
> 
> page_autonuma can only point to a page that belongs to the same node
> (page_autonuma is queued into the
> NODE_DATA(autonuma_migrate_nid)->autonuma_migrate_head[src_nid]) where
> src_nid is the source node that page_autonuma belongs to, so all pages
> in the autonuma_migrate_head[src_nid] lru must come from the same
> src_nid. So the next page_autonuma in the list will be
> lookup_page_autonuma(pfn_to_page(NODE_DATA(src_nid)->node_start_pfn +
> page_autonuma->pfn_offset_next)) etc..
> 
> Of course all list_add/del must be hardcoded specially for this, but
> it's not a conceptually difficult solution, just we can't use list.h
> and stright pointers anymore and some conversion must happen.

So here the above idea implemented and working fine (it seems...?!? it
has been running only for half an hour but all benchmark regression
tests passed with the same score as before and I verified memory goes
in all directions during the bench, so there's good chance it's ok).

It actually works even if a node has more than 16TB but in that case
it will WARN_ONCE on the first page that is migrated at an offset
above 16TB from the start of the node, and then it will continue
simply skipping migrating those pages with a too large offset.

Next part coming is the docs of autonuma_balance() at the top of
kernel/sched/numa.c and cleanup the autonuma_balance callout location
(if I can figure how to do an active balance on the running task from
softirq). The location at the moment is there just to be invoked after
load_balance runs so it shouldn't make a runtime difference after I
clean it up (hackbench already runs identical to upstream) but
certainly it'll be nice to microoptimize away a call and a branch from
the schedule() fast path.

After that I'll write Documentation/vm/AutoNUMA.txt and I'll finish
the THP native migration (the last one assuming nobody does it before
I get there, if somebody wants to do it sooner, we figured the locking
details with Johannes during the MM summit but it's some work to
implement it).

===
>From 17e1cbc02c1b41037248d9952179ff293a287d58 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Tue, 19 Jun 2012 18:55:25 +0200
Subject: [PATCH] autonuma: shrink the per-page page_autonuma struct size

>From 32 to 12 bytes, so the AutoNUMA memory footprint is reduced to
0.29% of RAM.

This however will fail to migrate pages above a 16 Terabyte offset
from the start of each node (migration failure isn't fatal, simply
those pages will not follow the CPU, a warning will be printed in the
log just once in that case).

AutoNUMA will also fail to build if there are more than (2**15)-1
nodes supported by the MAX_NUMNODES at build time (it would be easy to
relax it to (2**16)-1 nodes without increasing the memory footprint,
but it's not even worth it, so let's keep the negative space reserved
for now).

This means the max RAM configuration fully supported by AutoNUMA
becomes AUTONUMA_LIST_MAX_PFN_OFFSET multiplied by 32767 nodes
multiplied by the PAGE_SIZE (assume 4096 here, but for some archs it's
bigger).

4096*32767*(0xffffffff-3)>>(10*5) = 511 PetaBytes.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_list.h  |   94 ++++++++++++++++++++++
 include/linux/autonuma_types.h |   48 +++++++-----
 include/linux/mmzone.h         |    3 +-
 include/linux/page_autonuma.h  |    2 +-
 mm/Makefile                    |    2 +-
 mm/autonuma.c                  |   75 +++++++++++++-----
 mm/autonuma_list.c             |  167 ++++++++++++++++++++++++++++++++++++++++
 mm/page_autonuma.c             |   15 ++--
 8 files changed, 355 insertions(+), 51 deletions(-)
 create mode 100644 include/linux/autonuma_list.h
 create mode 100644 mm/autonuma_list.c

diff --git a/include/linux/autonuma_list.h b/include/linux/autonuma_list.h
new file mode 100644
index 0000000..0f338e9
--- /dev/null
+++ b/include/linux/autonuma_list.h
@@ -0,0 +1,94 @@
+#ifndef __AUTONUMA_LIST_H
+#define __AUTONUMA_LIST_H
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+
+typedef uint32_t autonuma_list_entry;
+#define AUTONUMA_LIST_MAX_PFN_OFFSET	(AUTONUMA_LIST_HEAD-3)
+#define AUTONUMA_LIST_POISON1		(AUTONUMA_LIST_HEAD-2)
+#define AUTONUMA_LIST_POISON2		(AUTONUMA_LIST_HEAD-1)
+#define AUTONUMA_LIST_HEAD		((uint32_t)UINT_MAX)
+
+struct autonuma_list_head {
+	autonuma_list_entry anl_next_pfn;
+	autonuma_list_entry anl_prev_pfn;
+};
+
+static inline void AUTONUMA_INIT_LIST_HEAD(struct autonuma_list_head *anl)
+{
+	anl->anl_next_pfn = AUTONUMA_LIST_HEAD;
+	anl->anl_prev_pfn = AUTONUMA_LIST_HEAD;
+}
+
+/* abstraction conversion methods */
+extern struct page *autonuma_list_entry_to_page(int nid,
+					autonuma_list_entry pfn_offset);
+extern autonuma_list_entry autonuma_page_to_list_entry(int page_nid,
+						       struct page *page);
+extern struct autonuma_list_head *__autonuma_list_head(int page_nid,
+					struct autonuma_list_head *head,
+					autonuma_list_entry pfn_offset);
+
+extern bool __autonuma_list_add(int page_nid,
+				struct page *page,
+				struct autonuma_list_head *head,
+				autonuma_list_entry prev,
+				autonuma_list_entry next);
+
+/*
+ * autonuma_list_add - add a new entry
+ *
+ * Insert a new entry after the specified head.
+ */
+static inline bool autonuma_list_add(int page_nid,
+				     struct page *page,
+				     autonuma_list_entry entry,
+				     struct autonuma_list_head *head)
+{
+	struct autonuma_list_head *entry_head;
+	entry_head = __autonuma_list_head(page_nid, head, entry);
+	return __autonuma_list_add(page_nid, page, head,
+				   entry, entry_head->anl_next_pfn);
+}
+
+/*
+ * autonuma_list_add_tail - add a new entry
+ *
+ * Insert a new entry before the specified head.
+ * This is useful for implementing queues.
+ */
+static inline bool autonuma_list_add_tail(int page_nid,
+					  struct page *page,
+					  autonuma_list_entry entry,
+					  struct autonuma_list_head *head)
+{
+	struct autonuma_list_head *entry_head;
+	entry_head = __autonuma_list_head(page_nid, head, entry);
+	return __autonuma_list_add(page_nid, page, head,
+				   entry_head->anl_prev_pfn, entry);
+}
+
+/*
+ * autonuma_list_del - deletes entry from list.
+ * @entry: the element to delete from the list.
+ */
+extern void autonuma_list_del(int page_nid,
+			      struct autonuma_list_head *entry,
+			      struct autonuma_list_head *head);
+
+extern bool autonuma_list_empty(const struct autonuma_list_head *head);
+
+#if 0 /* not needed so far */
+/*
+ * autonuma_list_is_singular - tests whether a list has just one entry.
+ * @head: the list to test.
+ */
+static inline int autonuma_list_is_singular(const struct autonuma_list_head *head)
+{
+	return !autonuma_list_empty(head) &&
+		(head->anl_next_pfn == head->anl_prev_pfn);
+}
+#endif
+
+#endif /* __AUTONUMA_LIST_H */
diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
index 6662990..1abde9c5 100644
--- a/include/linux/autonuma_types.h
+++ b/include/linux/autonuma_types.h
@@ -4,6 +4,7 @@
 #ifdef CONFIG_AUTONUMA
 
 #include <linux/numa.h>
+#include <linux/autonuma_list.h>
 
 /*
  * Per-mm (process) structure dynamically allocated only if autonuma
@@ -45,15 +46,36 @@ struct task_autonuma {
 /*
  * Per page (or per-pageblock) structure dynamically allocated only if
  * autonuma is not impossible.
+ *
+ * This structure takes 12 bytes per page for all architectures. There
+ * are two constraints to make this work:
+ *
+ * 1) the build will abort if * MAX_NUMNODES is too big according to
+ *    the #error check below
+ *
+ * 2) AutoNUMA will not succeed to insert into the migration queue any
+ *    page whose pfn offset value (offset with respect to the first
+ *    pfn of the node) is bigger than AUTONUMA_LIST_MAX_PFN_OFFSET
+ *    (NOTE: AUTONUMA_LIST_MAX_PFN_OFFSET is still a valid pfn offset
+ *    value). This means with huge node sizes and small PAGE_SIZE,
+ *    some pages may not be allowed to be migrated.
  */
 struct page_autonuma {
 	/*
 	 * To modify autonuma_last_nid lockless the architecture,
 	 * needs SMP atomic granularity < sizeof(long), not all archs
-	 * have that, notably some alpha. Archs without that requires
+	 * have that, notably some ancient alpha (but none of those
+	 * should run in NUMA systems). Archs without that requires
 	 * autonuma_last_nid to be a long.
 	 */
-#if BITS_PER_LONG > 32
+#if MAX_NUMNODES > 32767
+	/*
+	 * Verify at build time that int16_t for autonuma_migrate_nid
+	 * and autonuma_last_nid won't risk to overflow, max allowed
+	 * nid value is (2**15)-1.
+	 */
+#error "too many nodes"
+#endif
 	/*
 	 * autonuma_migrate_nid is -1 if the page_autonuma structure
 	 * is not linked into any
@@ -63,7 +85,7 @@ struct page_autonuma {
 	 * page_nid is the nid that the page (referenced by the
 	 * page_autonuma structure) belongs to.
 	 */
-	int autonuma_migrate_nid;
+	int16_t autonuma_migrate_nid;
 	/*
 	 * autonuma_last_nid records which is the NUMA nid that tried
 	 * to access this page at the last NUMA hinting page fault.
@@ -72,28 +94,14 @@ struct page_autonuma {
 	 * it will make different threads trashing on the same pages,
 	 * converge on the same NUMA node (if possible).
 	 */
-	int autonuma_last_nid;
-#else
-#if MAX_NUMNODES >= 32768
-#error "too many nodes"
-#endif
-	short autonuma_migrate_nid;
-	short autonuma_last_nid;
-#endif
+	int16_t autonuma_last_nid;
+
 	/*
 	 * This is the list node that links the page (referenced by
 	 * the page_autonuma structure) in the
 	 * &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid] lru.
 	 */
-	struct list_head autonuma_migrate_node;
-
-	/*
-	 * To find the page starting from the autonuma_migrate_node we
-	 * need a backlink.
-	 *
-	 * FIXME: drop it;
-	 */
-	struct page *page;
+	struct autonuma_list_head autonuma_migrate_node;
 };
 
 extern int alloc_task_autonuma(struct task_struct *tsk,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ed5b0c0..acefdfa 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -17,6 +17,7 @@
 #include <linux/pageblock-flags.h>
 #include <generated/bounds.h>
 #include <linux/atomic.h>
+#include <linux/autonuma_list.h>
 #include <asm/page.h>
 
 /* Free memory management - zoned buddy allocator.  */
@@ -710,7 +711,7 @@ typedef struct pglist_data {
 	 * <linux/page_autonuma.h> and the below field must remain the
 	 * last one of this structure.
 	 */
-	struct list_head autonuma_migrate_head[0];
+	struct autonuma_list_head autonuma_migrate_head[0];
 #endif
 } pg_data_t;
 
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
index bc7a629..e78beda 100644
--- a/include/linux/page_autonuma.h
+++ b/include/linux/page_autonuma.h
@@ -53,7 +53,7 @@ extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **
 /* inline won't work here */
 #define autonuma_pglist_data_size() (sizeof(struct pglist_data) +	\
 				     (autonuma_impossible() ? 0 :	\
-				      sizeof(struct list_head) * \
+				      sizeof(struct autonuma_list_head) * \
 				      num_possible_nodes()))
 
 #endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/mm/Makefile b/mm/Makefile
index a4d8354..4aa90d4 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,7 +33,7 @@ obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
-obj-$(CONFIG_AUTONUMA) 	+= autonuma.o page_autonuma.o
+obj-$(CONFIG_AUTONUMA) 	+= autonuma.o page_autonuma.o autonuma_list.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
diff --git a/mm/autonuma.c b/mm/autonuma.c
index 9834f5d..8aed9af 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -89,12 +89,21 @@ void autonuma_migrate_split_huge_page(struct page *page,
 	VM_BUG_ON(nid < -1);
 	VM_BUG_ON(page_tail_autonuma->autonuma_migrate_nid != -1);
 	if (nid >= 0) {
-		VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
+		int page_nid = page_to_nid(page);
+		struct autonuma_list_head *head;
+		autonuma_list_entry entry;
+		entry = autonuma_page_to_list_entry(page_nid, page);
+		head = &NODE_DATA(nid)->autonuma_migrate_head[page_nid];
+		VM_BUG_ON(page_nid != page_to_nid(page_tail));
+		VM_BUG_ON(page_nid == nid);
 
 		compound_lock(page_tail);
 		autonuma_migrate_lock(nid);
-		list_add_tail(&page_tail_autonuma->autonuma_migrate_node,
-			      &page_autonuma->autonuma_migrate_node);
+		if (!autonuma_list_add_tail(page_nid,
+					    page_tail,
+					    entry,
+					    head))
+			BUG();
 		autonuma_migrate_unlock(nid);
 
 		page_tail_autonuma->autonuma_migrate_nid = nid;
@@ -119,8 +128,15 @@ void __autonuma_migrate_page_remove(struct page *page,
 	VM_BUG_ON(nid < -1);
 	if (nid >= 0) {
 		int numpages = hpage_nr_pages(page);
+		int page_nid = page_to_nid(page);
+		struct autonuma_list_head *head;
+		VM_BUG_ON(nid == page_nid);
+		head = &NODE_DATA(nid)->autonuma_migrate_head[page_nid];
+
 		autonuma_migrate_lock(nid);
-		list_del(&page_autonuma->autonuma_migrate_node);
+		autonuma_list_del(page_nid,
+				  &page_autonuma->autonuma_migrate_node,
+				  head);
 		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
 		autonuma_migrate_unlock(nid);
 
@@ -139,6 +155,8 @@ static void __autonuma_migrate_page_add(struct page *page,
 	int numpages;
 	unsigned long nr_migrate_pages;
 	wait_queue_head_t *wait_queue;
+	struct autonuma_list_head *head;
+	bool added;
 
 	VM_BUG_ON(dst_nid >= MAX_NUMNODES);
 	VM_BUG_ON(dst_nid < -1);
@@ -155,25 +173,33 @@ static void __autonuma_migrate_page_add(struct page *page,
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 	VM_BUG_ON(nid < -1);
 	if (nid >= 0) {
+		VM_BUG_ON(nid == page_nid);
+		head = &NODE_DATA(nid)->autonuma_migrate_head[page_nid];
+
 		autonuma_migrate_lock(nid);
-		list_del(&page_autonuma->autonuma_migrate_node);
+		autonuma_list_del(page_nid,
+				  &page_autonuma->autonuma_migrate_node,
+				  head);
 		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
 		autonuma_migrate_unlock(nid);
 	}
 
+	head = &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid];
+
 	autonuma_migrate_lock(dst_nid);
-	list_add(&page_autonuma->autonuma_migrate_node,
-		 &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
-	NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
+	added = autonuma_list_add(page_nid, page, AUTONUMA_LIST_HEAD, head);
+	if (added)
+		NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
 	nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
 
 	autonuma_migrate_unlock(dst_nid);
 
-	page_autonuma->autonuma_migrate_nid = dst_nid;
+	if (added)
+		page_autonuma->autonuma_migrate_nid = dst_nid;
 
 	compound_unlock_irqrestore(page, flags);
 
-	if (!autonuma_migrate_defer()) {
+	if (added && !autonuma_migrate_defer()) {
 		wait_queue = &NODE_DATA(dst_nid)->autonuma_knuma_migrated_wait;
 		if (nr_migrate_pages >= pages_to_migrate &&
 		    nr_migrate_pages - numpages < pages_to_migrate &&
@@ -813,7 +839,7 @@ static int isolate_migratepages(struct list_head *migratepages,
 				struct pglist_data *pgdat)
 {
 	int nr = 0, nid;
-	struct list_head *heads = pgdat->autonuma_migrate_head;
+	struct autonuma_list_head *heads = pgdat->autonuma_migrate_head;
 
 	/* FIXME: THP balancing, restart from last nid */
 	for_each_online_node(nid) {
@@ -825,10 +851,10 @@ static int isolate_migratepages(struct list_head *migratepages,
 		cond_resched();
 		VM_BUG_ON(numa_node_id() != pgdat->node_id);
 		if (nid == pgdat->node_id) {
-			VM_BUG_ON(!list_empty(&heads[nid]));
+			VM_BUG_ON(!autonuma_list_empty(&heads[nid]));
 			continue;
 		}
-		if (list_empty(&heads[nid]))
+		if (autonuma_list_empty(&heads[nid]))
 			continue;
 		/* some page wants to go to this pgdat */
 		/*
@@ -840,22 +866,29 @@ static int isolate_migratepages(struct list_head *migratepages,
 		 * irqs.
 		 */
 		autonuma_migrate_lock_irq(pgdat->node_id);
-		if (list_empty(&heads[nid])) {
+		if (autonuma_list_empty(&heads[nid])) {
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			continue;
 		}
-		page_autonuma = list_entry(heads[nid].prev,
-					   struct page_autonuma,
-					   autonuma_migrate_node);
-		page = page_autonuma->page;
+		page = autonuma_list_entry_to_page(nid,
+						   heads[nid].anl_prev_pfn);
+		page_autonuma = lookup_page_autonuma(page);
 		if (unlikely(!get_page_unless_zero(page))) {
+			int page_nid = page_to_nid(page);
+			struct autonuma_list_head *entry_head;
+			VM_BUG_ON(nid == page_nid);
+
 			/*
 			 * Is getting freed and will remove self from the
 			 * autonuma list shortly, skip it for now.
 			 */
-			list_del(&page_autonuma->autonuma_migrate_node);
-			list_add(&page_autonuma->autonuma_migrate_node,
-				 &heads[nid]);
+			entry_head = &page_autonuma->autonuma_migrate_node;
+			autonuma_list_del(page_nid, entry_head,
+					  &heads[nid]);
+			if (!autonuma_list_add(page_nid, page,
+					       AUTONUMA_LIST_HEAD,
+					       &heads[nid]))
+				BUG();
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			autonuma_printk("autonuma migrate page is free\n");
 			continue;
diff --git a/mm/autonuma_list.c b/mm/autonuma_list.c
new file mode 100644
index 0000000..2c840f7
--- /dev/null
+++ b/mm/autonuma_list.c
@@ -0,0 +1,167 @@
+/*
+ * Copyright 2006, Red Hat, Inc., Dave Jones
+ * Copyright 2012, Red Hat, Inc.
+ * Released under the General Public License (GPL).
+ *
+ * This file contains the linked list implementations for
+ * autonuma migration lists.
+ */
+
+#include <linux/mm.h>
+#include <linux/autonuma.h>
+
+/*
+ * Insert a new entry between two known consecutive entries.
+ *
+ * This is only for internal list manipulation where we know
+ * the prev/next entries already!
+ *
+ * return true if succeeded, or false if the (page_nid, pfn_offset)
+ * pair couldn't represent the pfn and the list_add didn't succeed.
+ */
+bool __autonuma_list_add(int page_nid,
+			 struct page *page,
+			 struct autonuma_list_head *head,
+			 autonuma_list_entry prev,
+			 autonuma_list_entry next)
+{
+	autonuma_list_entry new;
+
+	VM_BUG_ON(page_nid != page_to_nid(page));
+	new = autonuma_page_to_list_entry(page_nid, page);
+	if (new > AUTONUMA_LIST_MAX_PFN_OFFSET)
+		return false;
+
+	WARN(new == prev || new == next,
+	     "autonuma_list_add double add: new=%u, prev=%u, next=%u.\n",
+	     new, prev, next);
+
+	__autonuma_list_head(page_nid, head, next)->anl_prev_pfn = new;
+	__autonuma_list_head(page_nid, head, new)->anl_next_pfn = next;
+	__autonuma_list_head(page_nid, head, new)->anl_prev_pfn = prev;
+	__autonuma_list_head(page_nid, head, prev)->anl_next_pfn = new;
+	return true;
+}
+
+static inline void __autonuma_list_del_entry(int page_nid,
+					     struct autonuma_list_head *entry,
+					     struct autonuma_list_head *head)
+{
+	autonuma_list_entry prev, next;
+
+	prev = entry->anl_prev_pfn;
+	next = entry->anl_next_pfn;
+
+	if (WARN(next == AUTONUMA_LIST_POISON1,
+		 "autonuma_list_del corruption, "
+		 "%p->anl_next_pfn is AUTONUMA_LIST_POISON1 (%u)\n",
+		entry, AUTONUMA_LIST_POISON1) ||
+	    WARN(prev == AUTONUMA_LIST_POISON2,
+		"autonuma_list_del corruption, "
+		 "%p->anl_prev_pfn is AUTONUMA_LIST_POISON2 (%u)\n",
+		entry, AUTONUMA_LIST_POISON2))
+		return;
+
+	__autonuma_list_head(page_nid, head, next)->anl_prev_pfn = prev;
+	__autonuma_list_head(page_nid, head, prev)->anl_next_pfn = next;
+}
+
+/*
+ * autonuma_list_del - deletes entry from list.
+ *
+ * Note: autonuma_list_empty on entry does not return true after this,
+ * the entry is in an undefined state.
+ */
+void autonuma_list_del(int page_nid, struct autonuma_list_head *entry,
+		       struct autonuma_list_head *head)
+{
+	__autonuma_list_del_entry(page_nid, entry, head);
+	entry->anl_next_pfn = AUTONUMA_LIST_POISON1;
+	entry->anl_prev_pfn = AUTONUMA_LIST_POISON2;
+}
+
+/*
+ * autonuma_list_empty - tests whether a list is empty
+ * @head: the list to test.
+ */
+bool autonuma_list_empty(const struct autonuma_list_head *head)
+{
+	bool ret = false;
+	if (head->anl_next_pfn == AUTONUMA_LIST_HEAD) {
+		ret = true;
+		BUG_ON(head->anl_prev_pfn != AUTONUMA_LIST_HEAD);
+	}
+	return ret;
+}
+
+/* abstraction conversion methods */
+
+static inline struct page *__autonuma_list_entry_to_page(int page_nid,
+							 autonuma_list_entry pfn_offset)
+{
+	struct pglist_data *pgdat = NODE_DATA(page_nid);
+	unsigned long pfn = pgdat->node_start_pfn + pfn_offset;
+	return pfn_to_page(pfn);
+}
+
+struct page *autonuma_list_entry_to_page(int page_nid,
+					 autonuma_list_entry pfn_offset)
+{
+	VM_BUG_ON(page_nid < 0);
+	BUG_ON(pfn_offset == AUTONUMA_LIST_POISON1);
+	BUG_ON(pfn_offset == AUTONUMA_LIST_POISON2);
+	BUG_ON(pfn_offset == AUTONUMA_LIST_HEAD);
+	return __autonuma_list_entry_to_page(page_nid, pfn_offset);
+}
+
+/*
+ * returns a value above AUTONUMA_LIST_MAX_PFN_OFFSET if the pfn is
+ * located a too big offset from the start of the node and cannot be
+ * represented by the (page_nid, pfn_offset) pair.
+ */
+autonuma_list_entry autonuma_page_to_list_entry(int page_nid,
+						struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	struct pglist_data *pgdat = NODE_DATA(page_nid);
+	VM_BUG_ON(page_nid != page_to_nid(page));
+	BUG_ON(pfn < pgdat->node_start_pfn);
+	pfn -= pgdat->node_start_pfn;
+	if (pfn > AUTONUMA_LIST_MAX_PFN_OFFSET) {
+		WARN_ONCE(1, "autonuma_page_to_list_entry: "
+			  "pfn_offset  %lu, pgdat %p, "
+			  "pgdat->node_start_pfn %lu\n",
+			  pfn, pgdat, pgdat->node_start_pfn);
+		/*
+		 * Any value bigger than AUTONUMA_LIST_MAX_PFN_OFFSET
+		 * will work as an error retval, but better pick one
+		 * that will cause noise if computed wrong by the
+		 * caller.
+		 */
+		return AUTONUMA_LIST_POISON1;
+	}
+	return pfn; /* convert to uint16_t without losing information */
+}
+
+static inline struct autonuma_list_head *____autonuma_list_head(int page_nid,
+					autonuma_list_entry pfn_offset)
+{
+	struct pglist_data *pgdat = NODE_DATA(page_nid);
+	unsigned long pfn = pgdat->node_start_pfn + pfn_offset;
+	struct page *page = pfn_to_page(pfn);
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+	return &page_autonuma->autonuma_migrate_node;
+}
+
+struct autonuma_list_head *__autonuma_list_head(int page_nid,
+					struct autonuma_list_head *head,
+					autonuma_list_entry pfn_offset)
+{
+	VM_BUG_ON(page_nid < 0);
+	BUG_ON(pfn_offset == AUTONUMA_LIST_POISON1);
+	BUG_ON(pfn_offset == AUTONUMA_LIST_POISON2);
+	if (pfn_offset != AUTONUMA_LIST_HEAD)
+		return ____autonuma_list_head(page_nid, pfn_offset);
+	else
+		return head;
+}
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
index f929d81..151f25c 100644
--- a/mm/page_autonuma.c
+++ b/mm/page_autonuma.c
@@ -12,7 +12,6 @@ void __meminit page_autonuma_map_init(struct page *page,
 	for (end = page + nr_pages; page < end; page++, page_autonuma++) {
 		page_autonuma->autonuma_last_nid = -1;
 		page_autonuma->autonuma_migrate_nid = -1;
-		page_autonuma->page = page;
 	}
 }
 
@@ -20,12 +19,18 @@ static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
 {
 	int node_iter;
 
+	/* verify the per-page page_autonuma 12 byte fixed cost */
+	BUILD_BUG_ON((unsigned long) &((struct page_autonuma *)0)[1] != 12);
+
 	spin_lock_init(&pgdat->autonuma_lock);
 	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
 	pgdat->autonuma_nr_migrate_pages = 0;
 	if (!autonuma_impossible())
-		for_each_node(node_iter)
-			INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+		for_each_node(node_iter) {
+			struct autonuma_list_head *head;
+			head = &pgdat->autonuma_migrate_head[node_iter];
+			AUTONUMA_INIT_LIST_HEAD(head);
+		}
 }
 
 #if !defined(CONFIG_SPARSEMEM)
@@ -112,10 +117,6 @@ struct page_autonuma *lookup_page_autonuma(struct page *page)
 	unsigned long pfn = page_to_pfn(page);
 	struct mem_section *section = __pfn_to_section(pfn);
 
-	/* if it's not a power of two we may be wasting memory */
-	BUILD_BUG_ON(SECTION_PAGE_AUTONUMA_SIZE &
-		     (SECTION_PAGE_AUTONUMA_SIZE-1));
-
 #ifdef CONFIG_DEBUG_VM
 	/*
 	 * The sanity checks the page allocator does upon freeing a

^ permalink raw reply related	[flat|nested] 236+ messages in thread

* Re: [PATCH 13/35] autonuma: add page structure fields
@ 2012-06-19 18:06                     ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-06-19 18:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter

Hi everyone,

On Tue, Jun 05, 2012 at 04:51:23PM +0200, Andrea Arcangeli wrote:
> The details of the solution:
> 
> struct page_autonuma {
>     short autonuma_last_nid;
>     short autonuma_migrate_nid;
>     unsigned int pfn_offset_next;
>     unsigned int pfn_offset_prev;
> } __attribute__((packed));
> 
> page_autonuma can only point to a page that belongs to the same node
> (page_autonuma is queued into the
> NODE_DATA(autonuma_migrate_nid)->autonuma_migrate_head[src_nid]) where
> src_nid is the source node that page_autonuma belongs to, so all pages
> in the autonuma_migrate_head[src_nid] lru must come from the same
> src_nid. So the next page_autonuma in the list will be
> lookup_page_autonuma(pfn_to_page(NODE_DATA(src_nid)->node_start_pfn +
> page_autonuma->pfn_offset_next)) etc..
> 
> Of course all list_add/del must be hardcoded specially for this, but
> it's not a conceptually difficult solution, just we can't use list.h
> and stright pointers anymore and some conversion must happen.

So here the above idea implemented and working fine (it seems...?!? it
has been running only for half an hour but all benchmark regression
tests passed with the same score as before and I verified memory goes
in all directions during the bench, so there's good chance it's ok).

It actually works even if a node has more than 16TB but in that case
it will WARN_ONCE on the first page that is migrated at an offset
above 16TB from the start of the node, and then it will continue
simply skipping migrating those pages with a too large offset.

Next part coming is the docs of autonuma_balance() at the top of
kernel/sched/numa.c and cleanup the autonuma_balance callout location
(if I can figure how to do an active balance on the running task from
softirq). The location at the moment is there just to be invoked after
load_balance runs so it shouldn't make a runtime difference after I
clean it up (hackbench already runs identical to upstream) but
certainly it'll be nice to microoptimize away a call and a branch from
the schedule() fast path.

After that I'll write Documentation/vm/AutoNUMA.txt and I'll finish
the THP native migration (the last one assuming nobody does it before
I get there, if somebody wants to do it sooner, we figured the locking
details with Johannes during the MM summit but it's some work to
implement it).

===

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
  2012-05-31 18:08       ` AutoNUMA15 Andrea Arcangeli
@ 2012-06-21  7:29         ` Alex Shi
  -1 siblings, 0 replies; 236+ messages in thread
From: Alex Shi @ 2012-06-21  7:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Petr Holasek, Kirill A. Shutemov, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi, Chen, Tim C

> I released an AutoNUMA15 branch that includes all pending fixes:
>
> git clone --reference linux -b autonuma15 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
>

I did a quick testing on our
specjbb2005/oltp/hackbench/tbench/netperf-loop/fio/ffsb on NHM EP/EX,
Core2 EP, Romely EP machine, In generally no clear performance change
found. Is this results expected for this patch set?

Regards!
Alex

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
@ 2012-06-21  7:29         ` Alex Shi
  0 siblings, 0 replies; 236+ messages in thread
From: Alex Shi @ 2012-06-21  7:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Petr Holasek, Kirill A. Shutemov, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi, Chen, Tim C

> I released an AutoNUMA15 branch that includes all pending fixes:
>
> git clone --reference linux -b autonuma15 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
>

I did a quick testing on our
specjbb2005/oltp/hackbench/tbench/netperf-loop/fio/ffsb on NHM EP/EX,
Core2 EP, Romely EP machine, In generally no clear performance change
found. Is this results expected for this patch set?

Regards!
Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
  2012-06-21  7:29         ` AutoNUMA15 Alex Shi
@ 2012-06-21 14:55           ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-06-21 14:55 UTC (permalink / raw)
  To: Alex Shi
  Cc: Petr Holasek, Kirill A. Shutemov, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi, Chen, Tim C

On Thu, Jun 21, 2012 at 03:29:52PM +0800, Alex Shi wrote:
> > I released an AutoNUMA15 branch that includes all pending fixes:
> >
> > git clone --reference linux -b autonuma15 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> >
> 
> I did a quick testing on our
> specjbb2005/oltp/hackbench/tbench/netperf-loop/fio/ffsb on NHM EP/EX,
> Core2 EP, Romely EP machine, In generally no clear performance change
> found. Is this results expected for this patch set?

hackbench and network benchs won't get benefit (the former
overschedule like crazy so there's no way any autonuma balancing can
have effect with such an overscheduling and zillion of threads, the
latter is I/O dominated usually taking so little RAM it doesn't
matter, the memory accesses on the kernel side and DMA issue should
dominate it in CPU utilization). Similar issue for filesystem
benchmarks like fio.

On all _system_ time dominated kernel benchmarks it is expected not to
measure a performance optimization and if you don't measure a
regression it's more than enough.

The only benchmarks that gets benefit are userland where the user/nice
time in top dominates. AutoNUMA cannot optimize or move kernel memory
around, it only optimizes userland computations.

So you should run HPC jobs. The only strange thing here is that
specjbb2005 gets a measurable significant boost with AutoNUMA so if
you didn't even get a boost with that you may want to verify:

cat /sys/kernel/mm/autonuma/enabled == 1

Also verify:

CONFIG_AUTONUMA_DEFAULT_ENABLED=y

If that's 1 well maybe the memory interconnect is so fast that there's
no benefit?

My numa01/02 benchmarks measures the best worst case of the hardware
(not software), with -DINVERSE_BIND -DHARD_BIND parameters, you can
consider running that to verify.

Probably there should be a little boot time kernel benchmark to
measure the inverse bind vs hard bind performance across the first two
nodes, if the difference is nil AutoNUMA should disengage and not even
allocate the page_autonuma (now only 12 bytes per page but anyway).

If you can retest with autonuma17 it would help too as there was some
performance issue fixed and it'd stress the new autonuma migration lru
code:

git clone --reference linux -b autonuma17 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma17

And the very latest is always at the autonuma branch:

git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
@ 2012-06-21 14:55           ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-06-21 14:55 UTC (permalink / raw)
  To: Alex Shi
  Cc: Petr Holasek, Kirill A. Shutemov, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi, Chen, Tim C

On Thu, Jun 21, 2012 at 03:29:52PM +0800, Alex Shi wrote:
> > I released an AutoNUMA15 branch that includes all pending fixes:
> >
> > git clone --reference linux -b autonuma15 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> >
> 
> I did a quick testing on our
> specjbb2005/oltp/hackbench/tbench/netperf-loop/fio/ffsb on NHM EP/EX,
> Core2 EP, Romely EP machine, In generally no clear performance change
> found. Is this results expected for this patch set?

hackbench and network benchs won't get benefit (the former
overschedule like crazy so there's no way any autonuma balancing can
have effect with such an overscheduling and zillion of threads, the
latter is I/O dominated usually taking so little RAM it doesn't
matter, the memory accesses on the kernel side and DMA issue should
dominate it in CPU utilization). Similar issue for filesystem
benchmarks like fio.

On all _system_ time dominated kernel benchmarks it is expected not to
measure a performance optimization and if you don't measure a
regression it's more than enough.

The only benchmarks that gets benefit are userland where the user/nice
time in top dominates. AutoNUMA cannot optimize or move kernel memory
around, it only optimizes userland computations.

So you should run HPC jobs. The only strange thing here is that
specjbb2005 gets a measurable significant boost with AutoNUMA so if
you didn't even get a boost with that you may want to verify:

cat /sys/kernel/mm/autonuma/enabled == 1

Also verify:

CONFIG_AUTONUMA_DEFAULT_ENABLED=y

If that's 1 well maybe the memory interconnect is so fast that there's
no benefit?

My numa01/02 benchmarks measures the best worst case of the hardware
(not software), with -DINVERSE_BIND -DHARD_BIND parameters, you can
consider running that to verify.

Probably there should be a little boot time kernel benchmark to
measure the inverse bind vs hard bind performance across the first two
nodes, if the difference is nil AutoNUMA should disengage and not even
allocate the page_autonuma (now only 12 bytes per page but anyway).

If you can retest with autonuma17 it would help too as there was some
performance issue fixed and it'd stress the new autonuma migration lru
code:

git clone --reference linux -b autonuma17 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma17

And the very latest is always at the autonuma branch:

git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 12/35] autonuma: CPU follow memory algorithm
  2012-05-29 13:10     ` Peter Zijlstra
@ 2012-06-22 17:36       ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-06-22 17:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 03:10:04PM +0200, Peter Zijlstra wrote:
> This doesn't explain anything in the dense code that follows.
> 
> What statistics, how are they used, with what goal etc..

Right, sorry for taking so long at updating the docs. So I tried to
write a more useful comment to explain why it converges and what is
the objective of the math. This is the current status and it includes
everything related to the autonuma balancing and lots of documentation
on it (in fact almost more documentation than code), including the
part to prioritize CFS in picking from the autonuma_node.

If you consider this is _everything_ needed in terms of scheduler
code, I think it's quite simple, and with the docs it should be a lot
more clear.

Moving the callout out of schedule is the next step but it's only an
implementation issue (and I would already done it if only I could
schedule from softirq, but this isn't preempt-rt...).

/*
 *  Copyright (C) 2012  Red Hat, Inc.
 *
 *  This work is licensed under the terms of the GNU GPL, version 2. See
 *  the COPYING file in the top-level directory.
 */

#include <linux/sched.h>
#include <linux/autonuma_sched.h>
#include <asm/tlb.h>

#include "sched.h"

#define AUTONUMA_BALANCE_SCALE 1000

enum {
	W_TYPE_THREAD,
	W_TYPE_PROCESS,
};

/*
 * This function is responsible for deciding which is the best CPU
 * each process should be running on according to the NUMA statistics
 * collected in mm->mm_autonuma and tsk->task_autonuma.
 *
 * The core math that evaluates the current CPU against the CPUs of
 * all _other_ nodes is this:
 *
 *			if (w_nid > w_other && w_nid > w_cpu_nid) {
 *				weight = w_nid - w_other + w_nid - w_cpu_nid;
 *
 * w_nid: worthiness of moving the current thread/process to the other
 * CPU.
 *
 * w_other: worthiness of moving the thread/process running in the
 * other CPU to the current CPU.
 *
 * w_cpu_nid: worthiness of keeping the current thread/process in the
 * current CPU.
 *
 * We run the above math on every CPU not part of the current NUMA
 * node, and we compare the current process against the other
 * processes running in the other CPUs in the remote NUMA nodes. The
 * objective is to select the cpu (in selected_cpu) with a bigger
 * worthiness weight (calculated as w_nid - w_other + w_nid -
 * w_cpu_nid). The bigger the worthiness weight of the other CPU the
 * biggest gain we'll get by moving the current process to the
 * selected_cpu (not only the biggest immediate CPU gain but also the
 * fewer async memory migrations that will be required to reach full
 * convergence later). If we select a cpu we migrate the current
 * process to it.
 *
 * Checking that the other process prefers to run here (w_nid >
 * w_other) and not only that we prefer to run there (w_nid >
 * w_cpu_nid) completely avoids ping pongs and ensures (temporary)
 * convergence of the algorithm (at least from a CPU standpoint).
 *
 * It's then up to the idle balancing code that will run as soon as
 * the current CPU goes idle to pick the other process and move it
 * here.
 *
 * By only evaluating running processes against running processes we
 * avoid to interfere with the CFS stock active idle balancing so
 * critical to perform optimally with HT enabled (getting HT wrong is
 * worse than running on remote memory so the active idle balancing
 * has the priority). The idle balancing (and all other CFS load
 * balancing) is however NUMA aware through the introduction of
 * sched_autonuma_can_migrate_task(). CFS searches the CPUs in the
 * tsk->autonuma_node first when it needs to find idle CPUs during the
 * idle balancing or tasks to pick during the load balancing.
 *
 * Then in the background asynchronously the memory always slowly
 * follows the CPU. Here the CPU follows the memory as fast as it can
 * (as long as active idle balancing permits).
 *
 * One non trivial bit of this logic that deserves an explanation is
 * how the three crucial variables of the core math
 * (w_nid/w_other/wcpu_nid) are going to change depending if the other
 * CPU is running a thread of the current process, or a thread of a
 * different process. A simple example is required: assume there are 2
 * processes and 4 thread per process and two nodes with 4 CPUs
 * each. Because the total 8 threads belongs to two different
 * processes by using the process statistics when comparing threads of
 * different processes, we'll end up converging reliably and quickly
 * in a configuration where first process is entirely contained in the
 * first node and the second process is entirely contained in the
 * second node. Now if you knew in advance that all threads only use
 * thread local memory and there's no sharing of memory between the
 * different threads, it wouldn't matter if use per-thread or per-mm
 * statistics in the w_nid/w_other/wcpu_nid and we could use
 * per-thread statistics at all times. But clearly with threads it's
 * expectable to get some sharing of memory, so to avoid false sharing
 * it's better to keep all threads of the same process in the same
 * node (or if they don't fit in a single node, in as fewer nodes as
 * possible), and this is why we've to use processes statistics in
 * w_nid/w_other/wcpu_nid when comparing thread of different
 * processes. Why instead we've to use thread statistics when
 * comparing threads of the same process should be already obvious if
 * you're still reading (hint: the mm statistics are identical for
 * threads of the same process). If some process doesn't fit in one
 * node, the thread statistics will then distribute the threads to the
 * best nodes within the group of nodes where the process is
 * contained.
 *
 * This is an example of the CPU layout after the startup of 2
 * processes with 12 threads each:
 *
 * nid 0 mm ffff880433367b80 nr 6
 * nid 0 mm ffff880433367480 nr 5
 * nid 1 mm ffff880433367b80 nr 6
 * nid 1 mm ffff880433367480 nr 6
 *
 * And after a few seconds it becomes:
 *
 * nid 0 mm ffff880433367b80 nr 12
 * nid 1 mm ffff880433367480 nr 11
 *
 * You can see it happening yourself by enabling debug with sysfs.
 *
 * Before scanning all other CPUs runqueues to compute the above math,
 * we also verify that we're not already in the preferred nid from the
 * point of view of both the process statistics and the thread
 * statistics. In such case we can return to the caller without having
 * to check any other CPUs runqueues because full convergence has been
 * already reached.
 *
 * Ideally this should be expanded to take all runnable processes into
 * account but this is a good enough approximation because some
 * runnable processes may run only for a short time so statistically
 * there will always be a bias on the processes that uses most the of
 * the CPU and that's ideal (if a process runs only for a short time,
 * it won't matter too much if the NUMA balancing isn't optimal for
 * it).
 *
 * This function is invoked at the same frequency of the load balancer
 * and only if the CPU is not idle. The rest of the time we depend on
 * CFS to keep sticking to the current CPU or to prioritize on the
 * CPUs in the selected_nid recorded in the task autonuma_node.
 */
void sched_autonuma_balance(void)
{
	int cpu, nid, selected_cpu, selected_nid, selected_nid_mm;
	int cpu_nid = numa_node_id();
	int this_cpu = smp_processor_id();
	unsigned long t_w, t_t, m_w, m_t, t_w_max, m_w_max;
	unsigned long weight_delta_max, weight;
	long s_w_nid = -1, s_w_cpu_nid = -1, s_w_other = -1;
	int s_w_type = -1;
	struct cpumask *allowed;
	struct migration_arg arg;
	struct task_struct *p = current;
	struct task_autonuma *task_autonuma = p->task_autonuma;

	/* per-cpu statically allocated in runqueues */
	long *task_numa_weight;
	long *mm_numa_weight;

	if (!task_autonuma || !p->mm)
		return;

	if (!(task_autonuma->autonuma_flags &
	      SCHED_AUTONUMA_FLAG_NEED_BALANCE))
		return;
	else
		task_autonuma->autonuma_flags &=
			~SCHED_AUTONUMA_FLAG_NEED_BALANCE;

	if (task_autonuma->autonuma_flags & SCHED_AUTONUMA_FLAG_STOP_ONE_CPU)
		return;

	if (!autonuma_enabled()) {
		if (task_autonuma->autonuma_node != -1)
			task_autonuma->autonuma_node = -1;
		return;
	}

	allowed = tsk_cpus_allowed(p);

	m_t = ACCESS_ONCE(p->mm->mm_autonuma->mm_numa_fault_tot);
	t_t = task_autonuma->task_numa_fault_tot;
	/*
	 * If a process still misses the per-thread or per-process
	 * information skip it.
	 */
	if (!m_t || !t_t)
		return;

	task_numa_weight = cpu_rq(this_cpu)->task_numa_weight;
	mm_numa_weight = cpu_rq(this_cpu)->mm_numa_weight;

	/*
	 * See if we already converged to skip the more expensive loop
	 * below, if we can already predict here with only CPU local
	 * information, that it would selected the current cpu_nid.
	 */
	t_w_max = m_w_max = 0;
	selected_nid = selected_nid_mm = -1;
	for_each_online_node(nid) {
		m_w = ACCESS_ONCE(p->mm->mm_autonuma->mm_numa_fault[nid]);
		t_w = task_autonuma->task_numa_fault[nid];
		if (m_w > m_t)
			m_t = m_w;
		mm_numa_weight[nid] = m_w*AUTONUMA_BALANCE_SCALE/m_t;
		if (t_w > t_t)
			t_t = t_w;
		task_numa_weight[nid] = t_w*AUTONUMA_BALANCE_SCALE/t_t;
		if (mm_numa_weight[nid] > m_w_max) {
			m_w_max = mm_numa_weight[nid];
			selected_nid_mm = nid;
		}
		if (task_numa_weight[nid] > t_w_max) {
			t_w_max = task_numa_weight[nid];
			selected_nid = nid;
		}
	}
	if (selected_nid == cpu_nid && selected_nid_mm == selected_nid) {
		if (task_autonuma->autonuma_node != selected_nid)
			task_autonuma->autonuma_node = selected_nid;
		return;
	}

	selected_cpu = this_cpu;
	/*
	 * Avoid the process migration if we don't find an ideal not
	 * idle CPU (hence the above selected_cpu = this_cpu), but
	 * keep the autonuma_node pointing to the node with most of
	 * the thread memory as selected above using the thread
	 * statistical data so the idle balancing code keeps
	 * prioritizing on it when selecting an idle CPU where to run
	 * the task on. Do not set it to the cpu_nid which would keep
	 * it in the current nid even if maybe the thread memory got
	 * allocated somewhere else because the current nid was
	 * already full.
	 *
	 * NOTE: selected_nid should never be below zero here, it's
	 * not a BUG_ON(selected_nid < 0), because it's nicer to keep
	 * the autonuma thread/mm statistics speculative.
	 */
	if (selected_nid < 0)
		selected_nid = cpu_nid;
	weight = weight_delta_max = 0;

	for_each_online_node(nid) {
		if (nid == cpu_nid)
			continue;
		for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
			long w_nid, w_cpu_nid, w_other;
			int w_type;
			struct mm_struct *mm;
			struct rq *rq = cpu_rq(cpu);
			if (!cpu_online(cpu))
				continue;

			if (idle_cpu(cpu))
				/*
				 * Offload the while IDLE balancing
				 * and physical / logical imbalances
				 * to CFS.
				 */
				continue;

			mm = rq->curr->mm;
			if (!mm)
				continue;
			raw_spin_lock_irq(&rq->lock);
			/* recheck after implicit barrier() */
			mm = rq->curr->mm;
			if (!mm) {
				raw_spin_unlock_irq(&rq->lock);
				continue;
			}
			m_t = ACCESS_ONCE(mm->mm_autonuma->mm_numa_fault_tot);
			t_t = rq->curr->task_autonuma->task_numa_fault_tot;
			if (!m_t || !t_t) {
				raw_spin_unlock_irq(&rq->lock);
				continue;
			}
			m_w = ACCESS_ONCE(mm->mm_autonuma->mm_numa_fault[nid]);
			t_w = rq->curr->task_autonuma->task_numa_fault[nid];
			raw_spin_unlock_irq(&rq->lock);
			if (mm == p->mm) {
				if (t_w > t_t)
					t_t = t_w;
				w_other = t_w*AUTONUMA_BALANCE_SCALE/t_t;
				w_nid = task_numa_weight[nid];
				w_cpu_nid = task_numa_weight[cpu_nid];
				w_type = W_TYPE_THREAD;
			} else {
				if (m_w > m_t)
					m_t = m_w;
				w_other = m_w*AUTONUMA_BALANCE_SCALE/m_t;
				w_nid = mm_numa_weight[nid];
				w_cpu_nid = mm_numa_weight[cpu_nid];
				w_type = W_TYPE_PROCESS;
			}

			if (w_nid > w_other && w_nid > w_cpu_nid) {
				weight = w_nid - w_other + w_nid - w_cpu_nid;

				if (weight > weight_delta_max) {
					weight_delta_max = weight;
					selected_cpu = cpu;
					selected_nid = nid;

					s_w_other = w_other;
					s_w_nid = w_nid;
					s_w_cpu_nid = w_cpu_nid;
					s_w_type = w_type;
				}
			}
		}
	}

	if (task_autonuma->autonuma_node != selected_nid)
		task_autonuma->autonuma_node = selected_nid;
	if (selected_cpu != this_cpu) {
		if (autonuma_debug()) {
			char *w_type_str = NULL;
			switch (s_w_type) {
			case W_TYPE_THREAD:
				w_type_str = "thread";
				break;
			case W_TYPE_PROCESS:
				w_type_str = "process";
				break;
			}
			printk("%p %d - %dto%d - %dto%d - %ld %ld %ld - %s\n",
			       p->mm, p->pid, cpu_nid, selected_nid,
			       this_cpu, selected_cpu,
			       s_w_other, s_w_nid, s_w_cpu_nid,
			       w_type_str);
		}
		BUG_ON(cpu_nid == selected_nid);
		goto found;
	}

	return;

found:
	arg = (struct migration_arg) { p, selected_cpu };
	/* Need help from migration thread: drop lock and wait. */
	task_autonuma->autonuma_flags |= SCHED_AUTONUMA_FLAG_STOP_ONE_CPU;
	sched_preempt_enable_no_resched();
	stop_one_cpu(this_cpu, migration_cpu_stop, &arg);
	preempt_disable();
	task_autonuma->autonuma_flags &= ~SCHED_AUTONUMA_FLAG_STOP_ONE_CPU;
	tlb_migrate_finish(p->mm);
}

/*
 * This is called by CFS can_migrate_task() to prioritize the
 * selection of AutoNUMA affine tasks (according to the autonuma_node)
 * during the CFS load balance, active balance, etc...
 *
 * This is first called with numa == true to skip not AutoNUMA affine
 * tasks. If this is later called with a numa == false parameter, it
 * means a first pass of CFS load balancing wasn't satisfied by an
 * AutoNUMA affine task and so we can decide to fallback to allowing
 * migration of not affine tasks.
 *
 * If load_balance_strict is enabled, AutoNUMA will only allow
 * migration of tasks for idle balancing purposes (the idle balancing
 * of CFS is never altered by AutoNUMA). In the not strict mode the
 * load balancing is not altered and the AutoNUMA affinity is
 * disregarded in favor of higher fairness
 *
 * The load_balance_strict mode (tunable through sysfs), if enabled,
 * tends to partition the system and in turn it may reduce the
 * scheduler fariness across different NUMA nodes but it shall deliver
 * higher global performance.
 */
bool sched_autonuma_can_migrate_task(struct task_struct *p,
				     int numa, int dst_cpu,
				     enum cpu_idle_type idle)
{
	if (!task_autonuma_cpu(p, dst_cpu)) {
		if (numa)
			return false;
		if (autonuma_sched_load_balance_strict() &&
		    idle != CPU_NEWLY_IDLE && idle != CPU_IDLE)
			return false;
	}
	return true;
}

void sched_autonuma_dump_mm(void)
{
	int nid, cpu;
	cpumask_var_t x;

	if (!alloc_cpumask_var(&x, GFP_KERNEL))
		return;
	cpumask_setall(x);
	for_each_online_node(nid) {
		for_each_cpu(cpu, cpumask_of_node(nid)) {
			struct rq *rq = cpu_rq(cpu);
			struct mm_struct *mm = rq->curr->mm;
			int nr = 0, cpux;
			if (!cpumask_test_cpu(cpu, x))
				continue;
			for_each_cpu(cpux, cpumask_of_node(nid)) {
				struct rq *rqx = cpu_rq(cpux);
				if (rqx->curr->mm == mm) {
					nr++;
					cpumask_clear_cpu(cpux, x);
				}
			}
			printk("nid %d mm %p nr %d\n", nid, mm, nr);
		}
	}
	free_cpumask_var(x);
}

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 12/35] autonuma: CPU follow memory algorithm
@ 2012-06-22 17:36       ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-06-22 17:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter

On Tue, May 29, 2012 at 03:10:04PM +0200, Peter Zijlstra wrote:
> This doesn't explain anything in the dense code that follows.
> 
> What statistics, how are they used, with what goal etc..

Right, sorry for taking so long at updating the docs. So I tried to
write a more useful comment to explain why it converges and what is
the objective of the math. This is the current status and it includes
everything related to the autonuma balancing and lots of documentation
on it (in fact almost more documentation than code), including the
part to prioritize CFS in picking from the autonuma_node.

If you consider this is _everything_ needed in terms of scheduler
code, I think it's quite simple, and with the docs it should be a lot
more clear.

Moving the callout out of schedule is the next step but it's only an
implementation issue (and I would already done it if only I could
schedule from softirq, but this isn't preempt-rt...).

/*
 *  Copyright (C) 2012  Red Hat, Inc.
 *
 *  This work is licensed under the terms of the GNU GPL, version 2. See
 *  the COPYING file in the top-level directory.
 */

#include <linux/sched.h>
#include <linux/autonuma_sched.h>
#include <asm/tlb.h>

#include "sched.h"

#define AUTONUMA_BALANCE_SCALE 1000

enum {
	W_TYPE_THREAD,
	W_TYPE_PROCESS,
};

/*
 * This function is responsible for deciding which is the best CPU
 * each process should be running on according to the NUMA statistics
 * collected in mm->mm_autonuma and tsk->task_autonuma.
 *
 * The core math that evaluates the current CPU against the CPUs of
 * all _other_ nodes is this:
 *
 *			if (w_nid > w_other && w_nid > w_cpu_nid) {
 *				weight = w_nid - w_other + w_nid - w_cpu_nid;
 *
 * w_nid: worthiness of moving the current thread/process to the other
 * CPU.
 *
 * w_other: worthiness of moving the thread/process running in the
 * other CPU to the current CPU.
 *
 * w_cpu_nid: worthiness of keeping the current thread/process in the
 * current CPU.
 *
 * We run the above math on every CPU not part of the current NUMA
 * node, and we compare the current process against the other
 * processes running in the other CPUs in the remote NUMA nodes. The
 * objective is to select the cpu (in selected_cpu) with a bigger
 * worthiness weight (calculated as w_nid - w_other + w_nid -
 * w_cpu_nid). The bigger the worthiness weight of the other CPU the
 * biggest gain we'll get by moving the current process to the
 * selected_cpu (not only the biggest immediate CPU gain but also the
 * fewer async memory migrations that will be required to reach full
 * convergence later). If we select a cpu we migrate the current
 * process to it.
 *
 * Checking that the other process prefers to run here (w_nid >
 * w_other) and not only that we prefer to run there (w_nid >
 * w_cpu_nid) completely avoids ping pongs and ensures (temporary)
 * convergence of the algorithm (at least from a CPU standpoint).
 *
 * It's then up to the idle balancing code that will run as soon as
 * the current CPU goes idle to pick the other process and move it
 * here.
 *
 * By only evaluating running processes against running processes we
 * avoid to interfere with the CFS stock active idle balancing so
 * critical to perform optimally with HT enabled (getting HT wrong is
 * worse than running on remote memory so the active idle balancing
 * has the priority). The idle balancing (and all other CFS load
 * balancing) is however NUMA aware through the introduction of
 * sched_autonuma_can_migrate_task(). CFS searches the CPUs in the
 * tsk->autonuma_node first when it needs to find idle CPUs during the
 * idle balancing or tasks to pick during the load balancing.
 *
 * Then in the background asynchronously the memory always slowly
 * follows the CPU. Here the CPU follows the memory as fast as it can
 * (as long as active idle balancing permits).
 *
 * One non trivial bit of this logic that deserves an explanation is
 * how the three crucial variables of the core math
 * (w_nid/w_other/wcpu_nid) are going to change depending if the other
 * CPU is running a thread of the current process, or a thread of a
 * different process. A simple example is required: assume there are 2
 * processes and 4 thread per process and two nodes with 4 CPUs
 * each. Because the total 8 threads belongs to two different
 * processes by using the process statistics when comparing threads of
 * different processes, we'll end up converging reliably and quickly
 * in a configuration where first process is entirely contained in the
 * first node and the second process is entirely contained in the
 * second node. Now if you knew in advance that all threads only use
 * thread local memory and there's no sharing of memory between the
 * different threads, it wouldn't matter if use per-thread or per-mm
 * statistics in the w_nid/w_other/wcpu_nid and we could use
 * per-thread statistics at all times. But clearly with threads it's
 * expectable to get some sharing of memory, so to avoid false sharing
 * it's better to keep all threads of the same process in the same
 * node (or if they don't fit in a single node, in as fewer nodes as
 * possible), and this is why we've to use processes statistics in
 * w_nid/w_other/wcpu_nid when comparing thread of different
 * processes. Why instead we've to use thread statistics when
 * comparing threads of the same process should be already obvious if
 * you're still reading (hint: the mm statistics are identical for
 * threads of the same process). If some process doesn't fit in one
 * node, the thread statistics will then distribute the threads to the
 * best nodes within the group of nodes where the process is
 * contained.
 *
 * This is an example of the CPU layout after the startup of 2
 * processes with 12 threads each:
 *
 * nid 0 mm ffff880433367b80 nr 6
 * nid 0 mm ffff880433367480 nr 5
 * nid 1 mm ffff880433367b80 nr 6
 * nid 1 mm ffff880433367480 nr 6
 *
 * And after a few seconds it becomes:
 *
 * nid 0 mm ffff880433367b80 nr 12
 * nid 1 mm ffff880433367480 nr 11
 *
 * You can see it happening yourself by enabling debug with sysfs.
 *
 * Before scanning all other CPUs runqueues to compute the above math,
 * we also verify that we're not already in the preferred nid from the
 * point of view of both the process statistics and the thread
 * statistics. In such case we can return to the caller without having
 * to check any other CPUs runqueues because full convergence has been
 * already reached.
 *
 * Ideally this should be expanded to take all runnable processes into
 * account but this is a good enough approximation because some
 * runnable processes may run only for a short time so statistically
 * there will always be a bias on the processes that uses most the of
 * the CPU and that's ideal (if a process runs only for a short time,
 * it won't matter too much if the NUMA balancing isn't optimal for
 * it).
 *
 * This function is invoked at the same frequency of the load balancer
 * and only if the CPU is not idle. The rest of the time we depend on
 * CFS to keep sticking to the current CPU or to prioritize on the
 * CPUs in the selected_nid recorded in the task autonuma_node.
 */
void sched_autonuma_balance(void)
{
	int cpu, nid, selected_cpu, selected_nid, selected_nid_mm;
	int cpu_nid = numa_node_id();
	int this_cpu = smp_processor_id();
	unsigned long t_w, t_t, m_w, m_t, t_w_max, m_w_max;
	unsigned long weight_delta_max, weight;
	long s_w_nid = -1, s_w_cpu_nid = -1, s_w_other = -1;
	int s_w_type = -1;
	struct cpumask *allowed;
	struct migration_arg arg;
	struct task_struct *p = current;
	struct task_autonuma *task_autonuma = p->task_autonuma;

	/* per-cpu statically allocated in runqueues */
	long *task_numa_weight;
	long *mm_numa_weight;

	if (!task_autonuma || !p->mm)
		return;

	if (!(task_autonuma->autonuma_flags &
	      SCHED_AUTONUMA_FLAG_NEED_BALANCE))
		return;
	else
		task_autonuma->autonuma_flags &=
			~SCHED_AUTONUMA_FLAG_NEED_BALANCE;

	if (task_autonuma->autonuma_flags & SCHED_AUTONUMA_FLAG_STOP_ONE_CPU)
		return;

	if (!autonuma_enabled()) {
		if (task_autonuma->autonuma_node != -1)
			task_autonuma->autonuma_node = -1;
		return;
	}

	allowed = tsk_cpus_allowed(p);

	m_t = ACCESS_ONCE(p->mm->mm_autonuma->mm_numa_fault_tot);
	t_t = task_autonuma->task_numa_fault_tot;
	/*
	 * If a process still misses the per-thread or per-process
	 * information skip it.
	 */
	if (!m_t || !t_t)
		return;

	task_numa_weight = cpu_rq(this_cpu)->task_numa_weight;
	mm_numa_weight = cpu_rq(this_cpu)->mm_numa_weight;

	/*
	 * See if we already converged to skip the more expensive loop
	 * below, if we can already predict here with only CPU local
	 * information, that it would selected the current cpu_nid.
	 */
	t_w_max = m_w_max = 0;
	selected_nid = selected_nid_mm = -1;
	for_each_online_node(nid) {
		m_w = ACCESS_ONCE(p->mm->mm_autonuma->mm_numa_fault[nid]);
		t_w = task_autonuma->task_numa_fault[nid];
		if (m_w > m_t)
			m_t = m_w;
		mm_numa_weight[nid] = m_w*AUTONUMA_BALANCE_SCALE/m_t;
		if (t_w > t_t)
			t_t = t_w;
		task_numa_weight[nid] = t_w*AUTONUMA_BALANCE_SCALE/t_t;
		if (mm_numa_weight[nid] > m_w_max) {
			m_w_max = mm_numa_weight[nid];
			selected_nid_mm = nid;
		}
		if (task_numa_weight[nid] > t_w_max) {
			t_w_max = task_numa_weight[nid];
			selected_nid = nid;
		}
	}
	if (selected_nid == cpu_nid && selected_nid_mm == selected_nid) {
		if (task_autonuma->autonuma_node != selected_nid)
			task_autonuma->autonuma_node = selected_nid;
		return;
	}

	selected_cpu = this_cpu;
	/*
	 * Avoid the process migration if we don't find an ideal not
	 * idle CPU (hence the above selected_cpu = this_cpu), but
	 * keep the autonuma_node pointing to the node with most of
	 * the thread memory as selected above using the thread
	 * statistical data so the idle balancing code keeps
	 * prioritizing on it when selecting an idle CPU where to run
	 * the task on. Do not set it to the cpu_nid which would keep
	 * it in the current nid even if maybe the thread memory got
	 * allocated somewhere else because the current nid was
	 * already full.
	 *
	 * NOTE: selected_nid should never be below zero here, it's
	 * not a BUG_ON(selected_nid < 0), because it's nicer to keep
	 * the autonuma thread/mm statistics speculative.
	 */
	if (selected_nid < 0)
		selected_nid = cpu_nid;
	weight = weight_delta_max = 0;

	for_each_online_node(nid) {
		if (nid == cpu_nid)
			continue;
		for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
			long w_nid, w_cpu_nid, w_other;
			int w_type;
			struct mm_struct *mm;
			struct rq *rq = cpu_rq(cpu);
			if (!cpu_online(cpu))
				continue;

			if (idle_cpu(cpu))
				/*
				 * Offload the while IDLE balancing
				 * and physical / logical imbalances
				 * to CFS.
				 */
				continue;

			mm = rq->curr->mm;
			if (!mm)
				continue;
			raw_spin_lock_irq(&rq->lock);
			/* recheck after implicit barrier() */
			mm = rq->curr->mm;
			if (!mm) {
				raw_spin_unlock_irq(&rq->lock);
				continue;
			}
			m_t = ACCESS_ONCE(mm->mm_autonuma->mm_numa_fault_tot);
			t_t = rq->curr->task_autonuma->task_numa_fault_tot;
			if (!m_t || !t_t) {
				raw_spin_unlock_irq(&rq->lock);
				continue;
			}
			m_w = ACCESS_ONCE(mm->mm_autonuma->mm_numa_fault[nid]);
			t_w = rq->curr->task_autonuma->task_numa_fault[nid];
			raw_spin_unlock_irq(&rq->lock);
			if (mm == p->mm) {
				if (t_w > t_t)
					t_t = t_w;
				w_other = t_w*AUTONUMA_BALANCE_SCALE/t_t;
				w_nid = task_numa_weight[nid];
				w_cpu_nid = task_numa_weight[cpu_nid];
				w_type = W_TYPE_THREAD;
			} else {
				if (m_w > m_t)
					m_t = m_w;
				w_other = m_w*AUTONUMA_BALANCE_SCALE/m_t;
				w_nid = mm_numa_weight[nid];
				w_cpu_nid = mm_numa_weight[cpu_nid];
				w_type = W_TYPE_PROCESS;
			}

			if (w_nid > w_other && w_nid > w_cpu_nid) {
				weight = w_nid - w_other + w_nid - w_cpu_nid;

				if (weight > weight_delta_max) {
					weight_delta_max = weight;
					selected_cpu = cpu;
					selected_nid = nid;

					s_w_other = w_other;
					s_w_nid = w_nid;
					s_w_cpu_nid = w_cpu_nid;
					s_w_type = w_type;
				}
			}
		}
	}

	if (task_autonuma->autonuma_node != selected_nid)
		task_autonuma->autonuma_node = selected_nid;
	if (selected_cpu != this_cpu) {
		if (autonuma_debug()) {
			char *w_type_str = NULL;
			switch (s_w_type) {
			case W_TYPE_THREAD:
				w_type_str = "thread";
				break;
			case W_TYPE_PROCESS:
				w_type_str = "process";
				break;
			}
			printk("%p %d - %dto%d - %dto%d - %ld %ld %ld - %s\n",
			       p->mm, p->pid, cpu_nid, selected_nid,
			       this_cpu, selected_cpu,
			       s_w_other, s_w_nid, s_w_cpu_nid,
			       w_type_str);
		}
		BUG_ON(cpu_nid == selected_nid);
		goto found;
	}

	return;

found:
	arg = (struct migration_arg) { p, selected_cpu };
	/* Need help from migration thread: drop lock and wait. */
	task_autonuma->autonuma_flags |= SCHED_AUTONUMA_FLAG_STOP_ONE_CPU;
	sched_preempt_enable_no_resched();
	stop_one_cpu(this_cpu, migration_cpu_stop, &arg);
	preempt_disable();
	task_autonuma->autonuma_flags &= ~SCHED_AUTONUMA_FLAG_STOP_ONE_CPU;
	tlb_migrate_finish(p->mm);
}

/*
 * This is called by CFS can_migrate_task() to prioritize the
 * selection of AutoNUMA affine tasks (according to the autonuma_node)
 * during the CFS load balance, active balance, etc...
 *
 * This is first called with numa == true to skip not AutoNUMA affine
 * tasks. If this is later called with a numa == false parameter, it
 * means a first pass of CFS load balancing wasn't satisfied by an
 * AutoNUMA affine task and so we can decide to fallback to allowing
 * migration of not affine tasks.
 *
 * If load_balance_strict is enabled, AutoNUMA will only allow
 * migration of tasks for idle balancing purposes (the idle balancing
 * of CFS is never altered by AutoNUMA). In the not strict mode the
 * load balancing is not altered and the AutoNUMA affinity is
 * disregarded in favor of higher fairness
 *
 * The load_balance_strict mode (tunable through sysfs), if enabled,
 * tends to partition the system and in turn it may reduce the
 * scheduler fariness across different NUMA nodes but it shall deliver
 * higher global performance.
 */
bool sched_autonuma_can_migrate_task(struct task_struct *p,
				     int numa, int dst_cpu,
				     enum cpu_idle_type idle)
{
	if (!task_autonuma_cpu(p, dst_cpu)) {
		if (numa)
			return false;
		if (autonuma_sched_load_balance_strict() &&
		    idle != CPU_NEWLY_IDLE && idle != CPU_IDLE)
			return false;
	}
	return true;
}

void sched_autonuma_dump_mm(void)
{
	int nid, cpu;
	cpumask_var_t x;

	if (!alloc_cpumask_var(&x, GFP_KERNEL))
		return;
	cpumask_setall(x);
	for_each_online_node(nid) {
		for_each_cpu(cpu, cpumask_of_node(nid)) {
			struct rq *rq = cpu_rq(cpu);
			struct mm_struct *mm = rq->curr->mm;
			int nr = 0, cpux;
			if (!cpumask_test_cpu(cpu, x))
				continue;
			for_each_cpu(cpux, cpumask_of_node(nid)) {
				struct rq *rqx = cpu_rq(cpux);
				if (rqx->curr->mm == mm) {
					nr++;
					cpumask_clear_cpu(cpux, x);
				}
			}
			printk("nid %d mm %p nr %d\n", nid, mm, nr);
		}
	}
	free_cpumask_var(x);
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 00/35] AutoNUMA alpha14
  2012-06-01 22:41   ` Mauricio Faria de Oliveira
@ 2012-06-22 17:57     ` Andrea Arcangeli
  -1 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-06-22 17:57 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, srikar,
	mjw

Hi Mauricio and everyone,

On Fri, Jun 01, 2012 at 07:41:09PM -0300, Mauricio Faria de Oliveira wrote:
> I got SPECjbb2005 results for 3.4-rc2 mainline, numasched, 
> autonuma-alpha10, and autonuma-alpha13. If you judge the data is OK it 
> may suit a comparison between autonuma-alpha13/14 to verify NUMA 
> affinity regressions.
> 
> The system is an Intel 2-socket Blade. Each NUMA node has 6 cores (+6 
> hyperthreads) and 12 GB RAM. Different permutations of THP, KSM, and VM 
> memory size were tested for each kernel.
> 
> I'll have to leave the analysis of each variable for you, as I'm not 
> familiar w/ the code and expected impacts; but I'm perfectly fine with 
> providing more details about the tests, environment and procedures, and 
> even some reruns, if needed.

So autonuma10 didn't have a fully working idle balancing yet, that's
why it's under-performing. My initial regression test didn't verify the
idle balancing, that got fixed in autonuma11 (notably: it also fixes
multi instance streams)

Your testing methodology was perfect, because you tested with THP off
too, on the host. This rules out the possibility that different
khugepaged default settings could skew the results (AutoNUMA when
engaging boosts khugepaged to offset for the fact THP native migration
isn't available yet so THP gets splitted across memory migrations and
so we need to collapse them more aggressively).

Another thing I noticed is the THP off, KSM off, and VM1 < node, on
autonuma13 the VM1 gets slightly less priority and scores only 87%
(but VM2/3 scores higher than on mainline). It may be a not
reproducible hyperthreading effect that happens once in a while (the
active balancing probably isn't as fast as it should and I'm seeing
some effect of that even on upstream without patches when half of the
hyperthreads are idle), but more likely it's one of the bugs that I've
been fixing lately.

If you have time. you may consider running this again on
autonuma18. Lots of changes and improvements happened since
autonuma13. The amount of memory used in page_autonuma (per-page) has
also been significantly reduced from 24 (or 32 since autonuma14) bytes
to 12, the scheduler should be much faster if overscheduling.

git clone --reference linux -b autonuma18 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma18

Other two tweaks to test (only if you have time):

echo 15000 >/sys/kernel/mm/autonuma/knuma_scand/scan_sleep_pass_millisecs

Thanks a lot for your great effort, this was very useful!
Andrea

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: [PATCH 00/35] AutoNUMA alpha14
@ 2012-06-22 17:57     ` Andrea Arcangeli
  0 siblings, 0 replies; 236+ messages in thread
From: Andrea Arcangeli @ 2012-06-22 17:57 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, srikar,
	mjw

Hi Mauricio and everyone,

On Fri, Jun 01, 2012 at 07:41:09PM -0300, Mauricio Faria de Oliveira wrote:
> I got SPECjbb2005 results for 3.4-rc2 mainline, numasched, 
> autonuma-alpha10, and autonuma-alpha13. If you judge the data is OK it 
> may suit a comparison between autonuma-alpha13/14 to verify NUMA 
> affinity regressions.
> 
> The system is an Intel 2-socket Blade. Each NUMA node has 6 cores (+6 
> hyperthreads) and 12 GB RAM. Different permutations of THP, KSM, and VM 
> memory size were tested for each kernel.
> 
> I'll have to leave the analysis of each variable for you, as I'm not 
> familiar w/ the code and expected impacts; but I'm perfectly fine with 
> providing more details about the tests, environment and procedures, and 
> even some reruns, if needed.

So autonuma10 didn't have a fully working idle balancing yet, that's
why it's under-performing. My initial regression test didn't verify the
idle balancing, that got fixed in autonuma11 (notably: it also fixes
multi instance streams)

Your testing methodology was perfect, because you tested with THP off
too, on the host. This rules out the possibility that different
khugepaged default settings could skew the results (AutoNUMA when
engaging boosts khugepaged to offset for the fact THP native migration
isn't available yet so THP gets splitted across memory migrations and
so we need to collapse them more aggressively).

Another thing I noticed is the THP off, KSM off, and VM1 < node, on
autonuma13 the VM1 gets slightly less priority and scores only 87%
(but VM2/3 scores higher than on mainline). It may be a not
reproducible hyperthreading effect that happens once in a while (the
active balancing probably isn't as fast as it should and I'm seeing
some effect of that even on upstream without patches when half of the
hyperthreads are idle), but more likely it's one of the bugs that I've
been fixing lately.

If you have time. you may consider running this again on
autonuma18. Lots of changes and improvements happened since
autonuma13. The amount of memory used in page_autonuma (per-page) has
also been significantly reduced from 24 (or 32 since autonuma14) bytes
to 12, the scheduler should be much faster if overscheduling.

git clone --reference linux -b autonuma18 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma18

Other two tweaks to test (only if you have time):

echo 15000 >/sys/kernel/mm/autonuma/knuma_scand/scan_sleep_pass_millisecs

Thanks a lot for your great effort, this was very useful!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
  2012-06-21 14:55           ` AutoNUMA15 Andrea Arcangeli
@ 2012-06-26  7:52             ` Alex Shi
  -1 siblings, 0 replies; 236+ messages in thread
From: Alex Shi @ 2012-06-26  7:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Alex Shi, Petr Holasek, Kirill A. Shutemov, linux-kernel,
	linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Chen,
	Tim C

On 06/21/2012 10:55 PM, Andrea Arcangeli wrote:

> On Thu, Jun 21, 2012 at 03:29:52PM +0800, Alex Shi wrote:
>>> I released an AutoNUMA15 branch that includes all pending fixes:
>>>
>>> git clone --reference linux -b autonuma15 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
>>>
>>
>> I did a quick testing on our
>> specjbb2005/oltp/hackbench/tbench/netperf-loop/fio/ffsb on NHM EP/EX,
>> Core2 EP, Romely EP machine, In generally no clear performance change
>> found. Is this results expected for this patch set?
> 
> hackbench and network benchs won't get benefit (the former
> overschedule like crazy so there's no way any autonuma balancing can
> have effect with such an overscheduling and zillion of threads, the
> latter is I/O dominated usually taking so little RAM it doesn't
> matter, the memory accesses on the kernel side and DMA issue should
> dominate it in CPU utilization). Similar issue for filesystem
> benchmarks like fio.
> 
> On all _system_ time dominated kernel benchmarks it is expected not to
> measure a performance optimization and if you don't measure a
> regression it's more than enough.
> 
> The only benchmarks that gets benefit are userland where the user/nice
> time in top dominates. AutoNUMA cannot optimize or move kernel memory
> around, it only optimizes userland computations.
> 
> So you should run HPC jobs. The only strange thing here is that
> specjbb2005 gets a measurable significant boost with AutoNUMA so if
> you didn't even get a boost with that you may want to verify:
> 
> cat /sys/kernel/mm/autonuma/enabled == 1
> 
> Also verify:
> 
> CONFIG_AUTONUMA_DEFAULT_ENABLED=y
> 
> If that's 1 well maybe the memory interconnect is so fast that there's
> no benefit?
> 
> My numa01/02 benchmarks measures the best worst case of the hardware
> (not software), with -DINVERSE_BIND -DHARD_BIND parameters, you can
> consider running that to verify.


Could you like to give a url for the benchmarks?

> 
> Probably there should be a little boot time kernel benchmark to
> measure the inverse bind vs hard bind performance across the first two
> nodes, if the difference is nil AutoNUMA should disengage and not even
> allocate the page_autonuma (now only 12 bytes per page but anyway).
> 
> If you can retest with autonuma17 it would help too as there was some
> performance issue fixed and it'd stress the new autonuma migration lru
> code:
> 
> git clone --reference linux -b autonuma17 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma17
> 
> And the very latest is always at the autonuma branch:
> 
> git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma


I got the commit till 2c7535e100805d. and retested specjbb2005 with
jrockit and openjdk again on my Romely EP(2P * 8 cores * HT, with 64GB
memory). find the openjdk has about 2% regression, while jrockit has no
clear change.


the testing user 2 instances, each of them are pinned to a node. some
setting is here:
  per_jvm_warehouse_rampup = 3.0
  per_jvm_warehouse_rampdown = 20.0
  jvm_instances = 2
  deterministic_random_seed = false
  ramp_up_seconds = 30
  measurement_seconds = 240
  starting_number_warehouses = 1
  increment_number_warehouses = 1
  ending_number_warehouses = 34
  expected_peak_warehouse = 16

openjdk
java options:
-Xmx8g -Xms8g -Xincgc

jrockit use hugetlb and its options:
-Xmx8g -Xms8g -Xns4g -XXaggressive -Xlargepages -XXlazyUnlocking
-Xgc:genpar -XXtlasize:min=16k,preferred=64k

> 
> Thanks,
> Andrea



^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
@ 2012-06-26  7:52             ` Alex Shi
  0 siblings, 0 replies; 236+ messages in thread
From: Alex Shi @ 2012-06-26  7:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Alex Shi, Petr Holasek, Kirill A. Shutemov, linux-kernel,
	linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Chen,
	Tim C

On 06/21/2012 10:55 PM, Andrea Arcangeli wrote:

> On Thu, Jun 21, 2012 at 03:29:52PM +0800, Alex Shi wrote:
>>> I released an AutoNUMA15 branch that includes all pending fixes:
>>>
>>> git clone --reference linux -b autonuma15 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
>>>
>>
>> I did a quick testing on our
>> specjbb2005/oltp/hackbench/tbench/netperf-loop/fio/ffsb on NHM EP/EX,
>> Core2 EP, Romely EP machine, In generally no clear performance change
>> found. Is this results expected for this patch set?
> 
> hackbench and network benchs won't get benefit (the former
> overschedule like crazy so there's no way any autonuma balancing can
> have effect with such an overscheduling and zillion of threads, the
> latter is I/O dominated usually taking so little RAM it doesn't
> matter, the memory accesses on the kernel side and DMA issue should
> dominate it in CPU utilization). Similar issue for filesystem
> benchmarks like fio.
> 
> On all _system_ time dominated kernel benchmarks it is expected not to
> measure a performance optimization and if you don't measure a
> regression it's more than enough.
> 
> The only benchmarks that gets benefit are userland where the user/nice
> time in top dominates. AutoNUMA cannot optimize or move kernel memory
> around, it only optimizes userland computations.
> 
> So you should run HPC jobs. The only strange thing here is that
> specjbb2005 gets a measurable significant boost with AutoNUMA so if
> you didn't even get a boost with that you may want to verify:
> 
> cat /sys/kernel/mm/autonuma/enabled == 1
> 
> Also verify:
> 
> CONFIG_AUTONUMA_DEFAULT_ENABLED=y
> 
> If that's 1 well maybe the memory interconnect is so fast that there's
> no benefit?
> 
> My numa01/02 benchmarks measures the best worst case of the hardware
> (not software), with -DINVERSE_BIND -DHARD_BIND parameters, you can
> consider running that to verify.


Could you like to give a url for the benchmarks?

> 
> Probably there should be a little boot time kernel benchmark to
> measure the inverse bind vs hard bind performance across the first two
> nodes, if the difference is nil AutoNUMA should disengage and not even
> allocate the page_autonuma (now only 12 bytes per page but anyway).
> 
> If you can retest with autonuma17 it would help too as there was some
> performance issue fixed and it'd stress the new autonuma migration lru
> code:
> 
> git clone --reference linux -b autonuma17 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma17
> 
> And the very latest is always at the autonuma branch:
> 
> git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma


I got the commit till 2c7535e100805d. and retested specjbb2005 with
jrockit and openjdk again on my Romely EP(2P * 8 cores * HT, with 64GB
memory). find the openjdk has about 2% regression, while jrockit has no
clear change.


the testing user 2 instances, each of them are pinned to a node. some
setting is here:
  per_jvm_warehouse_rampup = 3.0
  per_jvm_warehouse_rampdown = 20.0
  jvm_instances = 2
  deterministic_random_seed = false
  ramp_up_seconds = 30
  measurement_seconds = 240
  starting_number_warehouses = 1
  increment_number_warehouses = 1
  ending_number_warehouses = 34
  expected_peak_warehouse = 16

openjdk
java options:
-Xmx8g -Xms8g -Xincgc

jrockit use hugetlb and its options:
-Xmx8g -Xms8g -Xns4g -XXaggressive -Xlargepages -XXlazyUnlocking
-Xgc:genpar -XXtlasize:min=16k,preferred=64k

> 
> Thanks,
> Andrea


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
  2012-06-26  7:52             ` AutoNUMA15 Alex Shi
  (?)
@ 2012-06-26 12:03             ` Andrea Arcangeli
  2012-07-12  2:36                 ` AutoNUMA15 Alex Shi
  -1 siblings, 1 reply; 236+ messages in thread
From: Andrea Arcangeli @ 2012-06-26 12:03 UTC (permalink / raw)
  To: Alex Shi
  Cc: Alex Shi, Petr Holasek, Kirill A. Shutemov, linux-kernel,
	linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Chen,
	Tim C

[-- Attachment #1: Type: text/plain, Size: 4437 bytes --]

On Tue, Jun 26, 2012 at 03:52:26PM +0800, Alex Shi wrote:
> Could you like to give a url for the benchmarks?

I posted them to lkml a few months ago, I'm attaching them here. There
is actually a more polished version around that I didn't have time to
test yet. For now I'm attaching the old version here that I'm still
using to verify the regressions.

If you edit the .c files to make the right hard/inverse binds, and
then build with -DHARD_BIND and later -DINVERSE_BIND you can measure
the hardware NUMA effects on your hardware. numactl --hardware will
give you the topology to check if the code is ok for your hardware.

> memory). find the openjdk has about 2% regression, while jrockit has no

2% regression is in the worst case the numa hinting page faults (or in
the best case a measurement error) when you get no benefit from the
vastly increased NUMA affinity.

You can reduce that overhead to below 1% by multiplying by 2/3 times
the /sys/kernel/mm/autonuma/knuma_scand/scan_sleep_millisecs and
/sys/kernel/mm/autonuma/knuma_scand/scan_sleep_pass_millisecs .
Especially the latter if set to 15000 will reduce the overhead by 1%.

The current AutoNUMA defaults are hyper aggressive, with benchmarks
running for several minutes you can easily reduce AutoNUMA
aggressiveness to pay a lower fixed cost in the numa hinting page
faults without reducing overall performance.

The boost when you use AutoNUMA is >20%, sometime as high as 100%, so
the 2% is lost in the noise, but over time we should reduce it
(especially with hypervisor tuned profile for those cloud nodes which
only run virtual machines in turn with quite constant loads where
there's no need to react that fast).

> the testing user 2 instances, each of them are pinned to a node. some
> setting is here:

Ok the problem is that you must not pin anything. If you hard pin
AutoNUMA won't do anything on those processes.

It is impossible to run faster than the raw hard pinning, impossible
because AutoNUMA has also to migrate memory, hard pinning avoids all
memory migrations.

AutoNUMA aims to achieve as close performance to hard pinning as
possible without having to user hard pinning, that's the whole point.

So this explains why you measure a 2% regression or no difference,
with hard pins used at all times only the AutoNUMA worst case overhead
can be measured (and I explained above how it can be reduced).

A plan I can suggest for this benchmark is this:

1) "upstream default"
  - no hugetlbfs (AutoNUMA cannot migrate hugetlbfs memory)
  - no hard pinning of CPUs or memory to nodes
  - CONFIG_AUTONUMA=n
  - CONFIG_TRANSPARENT_HUGEPAGE=y

2) "autonuma"
  - no hugetlbfs (AutoNUMA cannot migrate hugetlbfs memory)
  - no hard pinning of CPUs or memory to nodes
  - CONFIG_AUTONUMA=y
  - CONFIG_AUTONUMA_DEFAULT_ENABLED=y
  - CONFIG_TRANSPARENT_HUGEPAGE=y

3) "autonuma lower numa hinting page fault overhead"
  - no hugetlbfs (AutoNUMA cannot migrate hugetlbfs memory)
  - no hard pinning of CPUs or memory to nodes
  - CONFIG_AUTONUMA=y
  - CONFIG_AUTONUMA_DEFAULT_ENABLED=y
  - CONFIG_TRANSPARENT_HUGEPAGE=y
  - echo 15000 >/sys/kernel/mm/autonuma/knuma_scand/scan_sleep_pass_millisecs

4) "upstream hard pinning and transparent hugepage"
  - hard pinning of CPUs or memory to nodes
  - CONFIG_AUTONUMA=n
  - CONFIG_TRANSPARENT_HUGEPAGE=y

5) "upstream hard pinning and hugetlbfs"
  - hugetlbfs
  - hard pinning of CPUs or memory to nodes
  - CONFIG_AUTONUMA=n
  - CONFIG_TRANSPARENT_HUGEPAGE=y (y/n won't matter if you use hugetlbfs)

Then you can compare 1/2/3/4/5.

The minimum to make a meaningful comparison is 1 vs 2. The next best
comparison is 1 vs 2 vs 4 (4 is very useful reference too because the
closer AutoNUMA gets to 4 the better! beating 1 is trivial, getting
very close to 4 is less easy because 4 isn't migrating any memory).

Running 3 and 5 is optional, especially I mentioned 5 just because you
liked to run it with hugetlbfs and not just THP.

> jrockit use hugetlb and its options:

hugetlbfs should be disabled when AutoNUMA is enabled because AutoNUMA
won't try to migrate hugetlbfs memory, not that it makes any
difference if the memory is hard pinned. THP should deliver the same
performance of hugetlbfs for the JVM and THP memory can be migrated by
AutoNUMA (as well as mmapped not-shared pagecache, not just anon
memory).

Thanks a lot, and looking forward to see how things goes when you
remove the hard pins.

Andrea

[-- Attachment #2: numa01.c --]
[-- Type: text/x-c, Size: 3046 bytes --]

/*
 *  Copyright (C) 2012  Red Hat, Inc.
 *
 *  This work is licensed under the terms of the GNU GPL, version 2. See
 *  the COPYING file in the top-level directory.
 */

#define _GNU_SOURCE
#include <pthread.h>
#include <strings.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <numaif.h>
#include <sched.h>
#include <time.h>
#include <sys/wait.h>
#include <sys/file.h>

//#define KVM
#ifndef KVM
#define THREADS 12
#define SIZE (3UL*1024*1024*1024)
#else
#define THREADS 4
#define SIZE (200*1024*1024)
#endif
//#define THREAD_ALLOC
#ifdef THREAD_ALLOC
#define THREAD_SIZE (SIZE/THREADS)
#else
#define THREAD_SIZE SIZE
#endif
//#define HARD_BIND
//#define INVERSE_BIND
//#define NO_BIND_FORCE_SAME_NODE

static char *p_global;
static unsigned long nodemask_global;

void *thread(void * arg)
{
	char *p = arg;
	int i;
#ifndef KVM
#ifndef THREAD_ALLOC
	int nr = 50;
#else
	int nr = 1000;
#endif
#else
	int nr = 500;
#endif
#ifdef NO_BIND_FORCE_SAME_NODE
	if (set_mempolicy(MPOL_BIND, &nodemask_global, 3) < 0)
		perror("set_mempolicy"), printf("%lu\n", nodemask_global),
			exit(1);
#endif
	bzero(p_global, SIZE);
#ifdef NO_BIND_FORCE_SAME_NODE
	if (set_mempolicy(MPOL_DEFAULT, NULL, 3) < 0)
		perror("set_mempolicy"), exit(1);
#endif
	for (i = 0; i < nr; i++) {
#if 1
		bzero(p, THREAD_SIZE);
#else
		memcpy(p, p+THREAD_SIZE/2, THREAD_SIZE/2);
#endif
		asm volatile("" : : : "memory");
	}
	return NULL;
}

int main()
{
	int i;
	pthread_t pthread[THREADS];
	char *p;
	pid_t pid;
	cpu_set_t cpumask;
	int f;
	unsigned long nodemask;

	nodemask_global = (time(NULL) & 1) + 1;
	f = creat("lock", 0400);
	if (f < 0)
		perror("creat"), exit(1);
	if (unlink("lock") < 0)
		perror("unlink"), exit(1);

	if ((pid = fork()) < 0)
		perror("fork"), exit(1);

	p_global = p = malloc(SIZE);
	if (!p)
		perror("malloc"), exit(1);
	CPU_ZERO(&cpumask);
	if (!pid) {
		nodemask = 1;
		for (i = 0; i < 6; i++)
			CPU_SET(i, &cpumask);
#if 1
		for (i = 12; i < 18; i++)
			CPU_SET(i, &cpumask);
#else
		for (i = 6; i < 12; i++)
			CPU_SET(i, &cpumask);
#endif
	} else {
		nodemask = 2;
		for (i = 6; i < 12; i++)
			CPU_SET(i, &cpumask);
		for (i = 18; i < 24; i++)
			CPU_SET(i, &cpumask);
	}
#ifdef INVERSE_BIND
	if (nodemask == 1)
		nodemask = 2;
	else if (nodemask == 2)
		nodemask = 1;
#endif
#if 0
	if (pid)
		goto skip;
#endif
#ifdef HARD_BIND
	if (sched_setaffinity(0, sizeof(cpumask), &cpumask) < 0)
		perror("sched_setaffinity"), exit(1);
#endif
#ifdef HARD_BIND
	if (set_mempolicy(MPOL_BIND, &nodemask, 3) < 0)
		perror("set_mempolicy"), printf("%lu\n", nodemask), exit(1);
#endif
#if 0
	bzero(p, SIZE);
#endif
	for (i = 0; i < THREADS; i++) {
		char *_p = p;
#ifdef THREAD_ALLOC
		_p += THREAD_SIZE * i;
#endif
		if (pthread_create(&pthread[i], NULL, thread, _p) != 0)
			perror("pthread_create"), exit(1);
	}
	for (i = 0; i < THREADS; i++)
		if (pthread_join(pthread[i], NULL) != 0)
			perror("pthread_join"), exit(1);
#if 1
skip:
#endif
	if (pid)
		if (wait(NULL) < 0)
			perror("wait"), exit(1);
	return 0;
}

[-- Attachment #3: numa02.c --]
[-- Type: text/x-c, Size: 2138 bytes --]

/*
 *  Copyright (C) 2012  Red Hat, Inc.
 *
 *  This work is licensed under the terms of the GNU GPL, version 2. See
 *  the COPYING file in the top-level directory.
 */

#define _GNU_SOURCE
#include <pthread.h>
#include <strings.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <numaif.h>
#include <sched.h>
#include <sys/wait.h>
#include <sys/file.h>

//#define KVM
#ifndef KVM
#ifndef SMT
#define THREADS 24
#else
#define THREADS 12
#endif
#define SIZE (1UL*1024*1024*1024)
#else
#ifndef SMT
#define THREADS 8
#else
#define THREADS 4
#endif
#define SIZE (500*1024*1024)
#endif
#define TOTALSIZE (4UL*1024*1024*1024*200)
#define THREAD_SIZE (SIZE/THREADS)
//#define HARD_BIND
//#define INVERSE_BIND

static void *thread(void * arg)
{
	char *p = arg;
	int i;
	for (i = 0; i < TOTALSIZE/SIZE; i++) {
		bzero(p, THREAD_SIZE);
		asm volatile("" : : : "memory");
	}
	return NULL;
}

#ifdef HARD_BIND
static void bind(int node)
{
	int i;
	unsigned long nodemask;
	cpu_set_t cpumask;
	CPU_ZERO(&cpumask);
	if (!node) {
		nodemask = 1;
		for (i = 0; i < 6; i++)
			CPU_SET(i, &cpumask);
		for (i = 12; i < 18; i++)
			CPU_SET(i, &cpumask);
	} else {
		nodemask = 2;
		for (i = 6; i < 12; i++)
			CPU_SET(i, &cpumask);
		for (i = 18; i < 24; i++)
			CPU_SET(i, &cpumask);
	}
	if (sched_setaffinity(0, sizeof(cpumask), &cpumask) < 0)
		perror("sched_setaffinity"), exit(1);
	if (set_mempolicy(MPOL_BIND, &nodemask, 3) < 0)
		perror("set_mempolicy"), printf("%lu\n", nodemask), exit(1);
}
#else
static void bind(int node) {}
#endif

int main()
{
	int i;
	pthread_t pthread[THREADS];
	char *p;
	pid_t pid;
	int f;

	p = malloc(SIZE);
	if (!p)
		perror("malloc"), exit(1);
	bind(0);
	bzero(p, SIZE/2);
	bind(1);
	bzero(p+SIZE/2, SIZE/2);
	for (i = 0; i < THREADS; i++) {
		char *_p = p + THREAD_SIZE * i;
#ifdef INVERSE_BIND
		bind(i < THREADS/2);
#else
		bind(i >= THREADS/2);
#endif
		if (pthread_create(&pthread[i], NULL, thread, _p) != 0)
			perror("pthread_create"), exit(1);
	}
	for (i = 0; i < THREADS; i++)
		if (pthread_join(pthread[i], NULL) != 0)
			perror("pthread_join"), exit(1);
	return 0;
}

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
  2012-06-26 12:03             ` AutoNUMA15 Andrea Arcangeli
@ 2012-07-12  2:36                 ` Alex Shi
  0 siblings, 0 replies; 236+ messages in thread
From: Alex Shi @ 2012-07-12  2:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Alex Shi, Petr Holasek, Kirill A. Shutemov, linux-kernel,
	linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Chen,
	Tim C

>

> Ok the problem is that you must not pin anything. If you hard pin
> AutoNUMA won't do anything on those processes.
> 
> It is impossible to run faster than the raw hard pinning, impossible
> because AutoNUMA has also to migrate memory, hard pinning avoids all
> memory migrations.
> 



> 
> Thanks a lot, and looking forward to see how things goes when you
> remove the hard pins.
> 



Andrea:
I continue testing specjbb2005 for your patch on 2c7535e100805d9,
removed hard pin for openjdk JVM.
On my NHM EP machine 12GB memory 16 LCPUs. Following data use each
scenario's results on 3.5-rc2 as 100% base.

			3.5-rc2 	3.5-rc2+autonuma
2 JVM, each 1GBmem 	 100%		100%
1 JVM with 2GBmem	 100%		100%

2 JVM, each 4GBmem	 100%		98%~100%
1 JVM with 4GB mem	 100%		98%~100%

So, my testing didn't find the benefit from autonuma patch, and when use
bigger memory size, the path introduce more variation and may cause 2%
performance drop. my open jdk option is "-Xmx4g -Xms4g -Xincgc"

I am wondering if the specjbb can show your patch's advantage.

^ permalink raw reply	[flat|nested] 236+ messages in thread

* Re: AutoNUMA15
@ 2012-07-12  2:36                 ` Alex Shi
  0 siblings, 0 replies; 236+ messages in thread
From: Alex Shi @ 2012-07-12  2:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Alex Shi, Petr Holasek, Kirill A. Shutemov, linux-kernel,
	linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Chen,
	Tim C

>

> Ok the problem is that you must not pin anything. If you hard pin
> AutoNUMA won't do anything on those processes.
> 
> It is impossible to run faster than the raw hard pinning, impossible
> because AutoNUMA has also to migrate memory, hard pinning avoids all
> memory migrations.
> 



> 
> Thanks a lot, and looking forward to see how things goes when you
> remove the hard pins.
> 



Andrea:
I continue testing specjbb2005 for your patch on 2c7535e100805d9,
removed hard pin for openjdk JVM.
On my NHM EP machine 12GB memory 16 LCPUs. Following data use each
scenario's results on 3.5-rc2 as 100% base.

			3.5-rc2 	3.5-rc2+autonuma
2 JVM, each 1GBmem 	 100%		100%
1 JVM with 2GBmem	 100%		100%

2 JVM, each 4GBmem	 100%		98%~100%
1 JVM with 4GB mem	 100%		98%~100%

So, my testing didn't find the benefit from autonuma patch, and when use
bigger memory size, the path introduce more variation and may cause 2%
performance drop. my open jdk option is "-Xmx4g -Xms4g -Xincgc"

I am wondering if the specjbb can show your patch's advantage.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 236+ messages in thread

end of thread, other threads:[~2012-07-12  2:36 UTC | newest]

Thread overview: 236+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-25 17:02 [PATCH 00/35] AutoNUMA alpha14 Andrea Arcangeli
2012-05-25 17:02 ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 01/35] mm: add unlikely to the mm allocation failure check Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 02/35] autonuma: make set_pmd_at always available Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 03/35] xen: document Xen is using an unused bit for the pagetables Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 20:26   ` Konrad Rzeszutek Wilk
2012-05-25 20:26     ` Konrad Rzeszutek Wilk
2012-05-26 15:59     ` Andrea Arcangeli
2012-05-26 15:59       ` Andrea Arcangeli
2012-05-29 14:10       ` Konrad Rzeszutek Wilk
2012-05-29 14:10         ` Konrad Rzeszutek Wilk
2012-05-29 16:01         ` Andrea Arcangeli
2012-05-29 16:01           ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 04/35] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-30 18:22   ` Konrad Rzeszutek Wilk
2012-05-30 18:22     ` Konrad Rzeszutek Wilk
2012-05-30 18:34     ` Andrea Arcangeli
2012-05-30 18:34       ` Andrea Arcangeli
2012-05-30 20:01       ` Konrad Rzeszutek Wilk
2012-05-30 20:01         ` Konrad Rzeszutek Wilk
2012-06-05 17:13         ` Andrea Arcangeli
2012-06-05 17:13           ` Andrea Arcangeli
2012-06-05 17:17           ` Konrad Rzeszutek Wilk
2012-06-05 17:17             ` Konrad Rzeszutek Wilk
2012-06-05 17:40             ` Andrea Arcangeli
2012-06-05 17:40               ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 05/35] autonuma: x86 pte_numa() and pmd_numa() Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 06/35] autonuma: generic " Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-30 20:23   ` Konrad Rzeszutek Wilk
2012-05-30 20:23     ` Konrad Rzeszutek Wilk
2012-05-25 17:02 ` [PATCH 07/35] autonuma: teach gup_fast about pte_numa Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 08/35] autonuma: introduce kthread_bind_node() Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-29 12:49   ` Peter Zijlstra
2012-05-29 12:49     ` Peter Zijlstra
2012-05-29 16:11     ` Andrea Arcangeli
2012-05-29 16:11       ` Andrea Arcangeli
2012-05-29 17:04       ` Peter Zijlstra
2012-05-29 17:04         ` Peter Zijlstra
2012-05-29 17:44         ` Andrea Arcangeli
2012-05-29 17:44           ` Andrea Arcangeli
2012-05-29 17:48           ` Peter Zijlstra
2012-05-29 17:48             ` Peter Zijlstra
2012-05-29 18:15             ` Andrea Arcangeli
2012-05-29 18:15               ` Andrea Arcangeli
2012-05-30 20:26   ` Konrad Rzeszutek Wilk
2012-05-30 20:26     ` Konrad Rzeszutek Wilk
2012-05-25 17:02 ` [PATCH 09/35] autonuma: mm_autonuma and sched_autonuma data structures Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 10/35] autonuma: define the autonuma flags Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 11/35] autonuma: core autonuma.h header Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 12/35] autonuma: CPU follow memory algorithm Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-29 13:00   ` Peter Zijlstra
2012-05-29 13:00     ` Peter Zijlstra
2012-05-29 13:54     ` Rik van Riel
2012-05-29 13:54       ` Rik van Riel
2012-05-29 13:10   ` Peter Zijlstra
2012-05-29 13:10     ` Peter Zijlstra
2012-06-22 17:36     ` Andrea Arcangeli
2012-06-22 17:36       ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 13/35] autonuma: add page structure fields Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-29 13:16   ` Peter Zijlstra
2012-05-29 13:16     ` Peter Zijlstra
2012-05-29 13:56     ` Rik van Riel
2012-05-29 13:56       ` Rik van Riel
2012-05-29 14:54       ` Peter Zijlstra
2012-05-29 14:54         ` Peter Zijlstra
2012-05-30  8:25         ` KOSAKI Motohiro
2012-05-30  8:25           ` KOSAKI Motohiro
2012-05-30  9:06           ` Peter Zijlstra
2012-05-30  9:06             ` Peter Zijlstra
2012-05-30  9:41             ` KOSAKI Motohiro
2012-05-30  9:41               ` KOSAKI Motohiro
2012-05-30  9:55               ` Peter Zijlstra
2012-05-30  9:55                 ` Peter Zijlstra
2012-05-30 13:49             ` Andrea Arcangeli
2012-05-30 13:49               ` Andrea Arcangeli
2012-05-31 18:18               ` Peter Zijlstra
2012-05-31 18:18                 ` Peter Zijlstra
2012-06-05 14:51                 ` Andrea Arcangeli
2012-06-05 14:51                   ` Andrea Arcangeli
2012-06-19 18:06                   ` Andrea Arcangeli
2012-06-19 18:06                     ` Andrea Arcangeli
2012-05-29 16:38     ` Andrea Arcangeli
2012-05-29 16:38       ` Andrea Arcangeli
2012-05-29 16:46       ` Rik van Riel
2012-05-29 16:46         ` Rik van Riel
2012-05-29 16:56         ` Peter Zijlstra
2012-05-29 16:56           ` Peter Zijlstra
2012-05-29 18:35           ` Andrea Arcangeli
2012-05-29 18:35             ` Andrea Arcangeli
2012-05-29 17:38       ` Linus Torvalds
2012-05-29 17:38         ` Linus Torvalds
2012-05-29 18:09         ` Andrea Arcangeli
2012-05-29 18:09           ` Andrea Arcangeli
2012-05-29 20:42         ` Rik van Riel
2012-05-29 20:42           ` Rik van Riel
2012-05-25 17:02 ` [PATCH 14/35] autonuma: knuma_migrated per NUMA node queues Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-29 13:51   ` Peter Zijlstra
2012-05-29 13:51     ` Peter Zijlstra
2012-05-30  0:14     ` Andrea Arcangeli
2012-05-30  0:14       ` Andrea Arcangeli
2012-05-30 18:19       ` Andrea Arcangeli
2012-05-30 18:19         ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 15/35] autonuma: init knuma_migrated queues Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 16/35] autonuma: autonuma_enter/exit Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 17/35] autonuma: call autonuma_setup_new_exec() Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 18/35] autonuma: alloc/free/init sched_autonuma Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-30 20:55   ` Konrad Rzeszutek Wilk
2012-05-30 20:55     ` Konrad Rzeszutek Wilk
2012-05-25 17:02 ` [PATCH 19/35] autonuma: alloc/free/init mm_autonuma Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 20/35] autonuma: avoid CFS select_task_rq_fair to return -1 Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-29 14:02   ` Peter Zijlstra
2012-05-29 14:02     ` Peter Zijlstra
2012-05-25 17:02 ` [PATCH 21/35] autonuma: teach CFS about autonuma affinity Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-29 16:05   ` Peter Zijlstra
2012-05-29 16:05     ` Peter Zijlstra
2012-05-25 17:02 ` [PATCH 22/35] autonuma: sched_set_autonuma_need_balance Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-29 16:12   ` Peter Zijlstra
2012-05-29 16:12     ` Peter Zijlstra
2012-05-29 17:33     ` Andrea Arcangeli
2012-05-29 17:33       ` Andrea Arcangeli
2012-05-29 17:43       ` Peter Zijlstra
2012-05-29 17:43         ` Peter Zijlstra
2012-05-29 18:24         ` Andrea Arcangeli
2012-05-29 18:24           ` Andrea Arcangeli
2012-05-29 22:21       ` Peter Zijlstra
2012-05-29 22:21         ` Peter Zijlstra
2012-05-25 17:02 ` [PATCH 23/35] autonuma: core Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-29 11:45   ` Kirill A. Shutemov
2012-05-29 11:45     ` Kirill A. Shutemov
2012-05-30  0:03     ` Andrea Arcangeli
2012-05-30  0:03       ` Andrea Arcangeli
2012-05-29 16:27   ` Peter Zijlstra
2012-05-29 16:27     ` Peter Zijlstra
2012-05-25 17:02 ` [PATCH 24/35] autonuma: follow_page check for pte_numa/pmd_numa Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 25/35] autonuma: default mempolicy follow AutoNUMA Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 26/35] autonuma: call autonuma_split_huge_page() Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 27/35] autonuma: make khugepaged pte_numa aware Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 28/35] autonuma: retain page last_nid information in khugepaged Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 29/35] autonuma: numa hinting page faults entry points Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 30/35] autonuma: reset autonuma page data when pages are freed Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-29 16:30   ` Peter Zijlstra
2012-05-29 16:30     ` Peter Zijlstra
2012-05-29 16:49     ` Andrea Arcangeli
2012-05-29 16:49       ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 31/35] autonuma: initialize page structure fields Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 32/35] autonuma: link mm/autonuma.o and kernel/sched/numa.o Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 33/35] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 34/35] autonuma: boost khugepaged scanning rate Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-25 17:02 ` [PATCH 35/35] autonuma: page_autonuma Andrea Arcangeli
2012-05-25 17:02   ` Andrea Arcangeli
2012-05-29 16:44   ` Peter Zijlstra
2012-05-29 16:44     ` Peter Zijlstra
2012-05-29 17:14     ` Andrea Arcangeli
2012-05-29 17:14       ` Andrea Arcangeli
2012-05-26 17:28 ` [PATCH 00/35] AutoNUMA alpha14 Rik van Riel
2012-05-26 17:28   ` Rik van Riel
2012-05-26 20:42   ` Linus Torvalds
2012-05-26 20:42     ` Linus Torvalds
2012-05-29 15:53     ` Christoph Lameter
2012-05-29 15:53       ` Christoph Lameter
2012-05-29 16:08       ` Andrea Arcangeli
2012-05-29 16:08         ` Andrea Arcangeli
2012-05-30 14:46     ` Peter Zijlstra
2012-05-30 14:46       ` Peter Zijlstra
2012-05-30 15:30       ` Ingo Molnar
2012-05-30 15:30         ` Ingo Molnar
2012-05-29 13:36 ` Kirill A. Shutemov
2012-05-29 13:36   ` Kirill A. Shutemov
2012-05-29 15:43   ` Petr Holasek
2012-05-29 15:43     ` Petr Holasek
2012-05-31 18:08     ` AutoNUMA15 Andrea Arcangeli
2012-05-31 18:08       ` AutoNUMA15 Andrea Arcangeli
2012-05-31 20:01       ` AutoNUMA15 Don Morris
2012-05-31 22:54         ` AutoNUMA15 Andrea Arcangeli
2012-06-01  0:04           ` AutoNUMA15 Andrea Arcangeli
2012-05-31 18:52             ` AutoNUMA15 Don Morris
2012-06-07  2:30       ` AutoNUMA15 Zhouping Liu
2012-06-07  2:30         ` AutoNUMA15 Zhouping Liu
2012-06-07 11:44         ` AutoNUMA15 Hillf Danton
2012-06-07 13:30           ` AutoNUMA15 Andrea Arcangeli
2012-06-07 14:08           ` AutoNUMA15 Zhouping Liu
2012-06-07 19:37             ` AutoNUMA15 Andrea Arcangeli
2012-06-08  6:09               ` AutoNUMA15 Zhouping Liu
2012-06-08 13:04                 ` AutoNUMA15 Hillf Danton
2012-06-08 13:32               ` AutoNUMA15 Peter Zijlstra
2012-06-08 16:31                 ` AutoNUMA15 Zhouping Liu
2012-06-08 13:43         ` AutoNUMA15 Chen
2012-06-21  7:29       ` AutoNUMA15 Alex Shi
2012-06-21  7:29         ` AutoNUMA15 Alex Shi
2012-06-21 14:55         ` AutoNUMA15 Andrea Arcangeli
2012-06-21 14:55           ` AutoNUMA15 Andrea Arcangeli
2012-06-26  7:52           ` AutoNUMA15 Alex Shi
2012-06-26  7:52             ` AutoNUMA15 Alex Shi
2012-06-26 12:03             ` AutoNUMA15 Andrea Arcangeli
2012-07-12  2:36               ` AutoNUMA15 Alex Shi
2012-07-12  2:36                 ` AutoNUMA15 Alex Shi
2012-05-29 17:15   ` [PATCH 00/35] AutoNUMA alpha14 Andrea Arcangeli
2012-05-29 17:15     ` Andrea Arcangeli
2012-06-01 22:41 ` Mauricio Faria de Oliveira
2012-06-01 22:41   ` Mauricio Faria de Oliveira
2012-06-22 17:57   ` Andrea Arcangeli
2012-06-22 17:57     ` Andrea Arcangeli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.