All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/40] AutoNUMA19
@ 2012-06-28 12:55 ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Hello everyone,

It's time for a new AutoNUMA19 release.

The objective of AutoNUMA is to be able to perform as close as
possible to (and sometime faster than) the NUMA hard CPU/memory
bindings setups, without requiring the administrator to manually setup
any NUMA hard bind.

https://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf
(NOTE: the TODO slide is obsolete)

git clone --reference linux -b autonuma19 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma19

Development autonuma branch:

git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

To update:

git fetch
git checkout -f origin/autonuma

Changelog from AutoNUMA-alpha14 to AutoNUMA19:

o sched_autonuma_balance callout location removed from schedule() now it runs
  in the softirq along with CFS load_balancing

o lots of documentation about the math in the sched_autonuma_balance algorithm

o fixed a bug in the fast path detection in sched_autonuma_balance that could
  decrease performance with many nodes

o reduced the page_autonuma memory overhead to from 32 to 12 bytes per page

o fixed a crash in __pmd_numa_fixup

o knuma_numad won't scan VM_MIXEDMAP|PFNMAP (it never touched those ptes
  anyway)

o fixed a crash in autonuma_exit

o fixed a crash when split_huge_page returns 0 in knuma_migratedN as the page
  has been freed already

o assorted cleanups and probably more

Changelog from alpha13 to alpha14:

o page_autonuma introduction, no memory wasted if the kernel is booted
  on not-NUMA hardware. Tested with flatmem/sparsemem on x86
  autonuma=y/n and sparsemem/vsparsemem on x86_64 with autonuma=y/n.
  "noautonuma" kernel param disables autonuma permanently also when
  booted on NUMA hardware (no /sys/kernel/mm/autonuma, and no
  page_autonuma allocations, like cgroup_disable=memory)

o autonuma_balance only runs along with run_rebalance_domains, to
  avoid altering the usual scheduler runtime. autonuma_balance gives a
  "kick" to the scheduler after a rebalance (it overrides the load
  balance activity if needed). It's not yet tested on specjbb or more
  schedule intensive benchmark, hopefully there's no NUMA
  regression. For intensive compute loads not involving a flood of
  scheduling activity this doesn't show any performance regression,
  and it avoids altering the strict schedule performance. It goes in
  the direction of being less intrusive with the stock scheduler
  runtime.

  Note: autonuma_balance still runs from normal context (not softirq
  context like run_rebalance_domains) to be able to wait on process
  migration (avoid _nowait), but most of the time it does nothing at
  all.

Changelog from alpha11 to alpha13:

o autonuma_balance optimization (take the fast path when process is in
  the preferred NUMA node)

TODO:

o THP native migration (orthogonal and also needed for
  cpuset/migrate_pages(2)/numa/sched).

o port to ppc64, Ben? Any arch able to support PROT_NONE can also support
  AutoNUMA, in short all archs should work fine with AutoNUMA.

Andrea Arcangeli (40):
  mm: add unlikely to the mm allocation failure check
  autonuma: make set_pmd_at always available
  autonuma: export is_vma_temporary_stack() even if
    CONFIG_TRANSPARENT_HUGEPAGE=n
  xen: document Xen is using an unused bit for the pagetables
  autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  autonuma: x86 pte_numa() and pmd_numa()
  autonuma: generic pte_numa() and pmd_numa()
  autonuma: teach gup_fast about pte_numa
  autonuma: introduce kthread_bind_node()
  autonuma: mm_autonuma and sched_autonuma data structures
  autonuma: define the autonuma flags
  autonuma: core autonuma.h header
  autonuma: CPU follow memory algorithm
  autonuma: add page structure fields
  autonuma: knuma_migrated per NUMA node queues
  autonuma: init knuma_migrated queues
  autonuma: autonuma_enter/exit
  autonuma: call autonuma_setup_new_exec()
  autonuma: alloc/free/init sched_autonuma
  autonuma: alloc/free/init mm_autonuma
  autonuma: avoid CFS select_task_rq_fair to return -1
  autonuma: teach CFS about autonuma affinity
  autonuma: sched_set_autonuma_need_balance
  autonuma: core
  autonuma: follow_page check for pte_numa/pmd_numa
  autonuma: default mempolicy follow AutoNUMA
  autonuma: call autonuma_split_huge_page()
  autonuma: make khugepaged pte_numa aware
  autonuma: retain page last_nid information in khugepaged
  autonuma: numa hinting page faults entry points
  autonuma: reset autonuma page data when pages are freed
  autonuma: initialize page structure fields
  autonuma: link mm/autonuma.o and kernel/sched/numa.o
  autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
  autonuma: boost khugepaged scanning rate
  autonuma: page_autonuma
  autonuma: page_autonuma change #include for sparse
  autonuma: autonuma_migrate_head[0] dynamic size
  autonuma: bugcheck page_autonuma fields on newly allocated pages
  autonuma: shrink the per-page page_autonuma struct size

 arch/x86/include/asm/paravirt.h      |    2 -
 arch/x86/include/asm/pgtable.h       |   51 ++-
 arch/x86/include/asm/pgtable_types.h |   22 +-
 arch/x86/mm/gup.c                    |    2 +-
 arch/x86/mm/numa.c                   |    6 +-
 arch/x86/mm/numa_32.c                |    3 +-
 fs/exec.c                            |    3 +
 include/asm-generic/pgtable.h        |   12 +
 include/linux/autonuma.h             |   64 ++
 include/linux/autonuma_flags.h       |   68 ++
 include/linux/autonuma_list.h        |   94 ++
 include/linux/autonuma_sched.h       |   50 ++
 include/linux/autonuma_types.h       |  130 +++
 include/linux/huge_mm.h              |    6 +-
 include/linux/kthread.h              |    1 +
 include/linux/memory_hotplug.h       |    3 +-
 include/linux/mm_types.h             |    5 +
 include/linux/mmzone.h               |   25 +
 include/linux/page_autonuma.h        |   59 ++
 include/linux/sched.h                |    5 +-
 init/main.c                          |    2 +
 kernel/fork.c                        |   36 +-
 kernel/kthread.c                     |   23 +
 kernel/sched/Makefile                |    1 +
 kernel/sched/core.c                  |    1 +
 kernel/sched/fair.c                  |   72 ++-
 kernel/sched/numa.c                  |  586 +++++++++++++
 kernel/sched/sched.h                 |   18 +
 mm/Kconfig                           |   13 +
 mm/Makefile                          |    1 +
 mm/autonuma.c                        | 1549 ++++++++++++++++++++++++++++++++++
 mm/autonuma_list.c                   |  167 ++++
 mm/huge_memory.c                     |   59 ++-
 mm/memory.c                          |   35 +-
 mm/memory_hotplug.c                  |    2 +-
 mm/mempolicy.c                       |   15 +-
 mm/mmu_context.c                     |    2 +
 mm/page_alloc.c                      |    5 +
 mm/page_autonuma.c                   |  236 ++++++
 mm/sparse.c                          |  126 +++-
 40 files changed, 3512 insertions(+), 48 deletions(-)
 create mode 100644 include/linux/autonuma.h
 create mode 100644 include/linux/autonuma_flags.h
 create mode 100644 include/linux/autonuma_list.h
 create mode 100644 include/linux/autonuma_sched.h
 create mode 100644 include/linux/autonuma_types.h
 create mode 100644 include/linux/page_autonuma.h
 create mode 100644 kernel/sched/numa.c
 create mode 100644 mm/autonuma.c
 create mode 100644 mm/autonuma_list.c
 create mode 100644 mm/page_autonuma.c


^ permalink raw reply	[flat|nested] 327+ messages in thread

* [PATCH 00/40] AutoNUMA19
@ 2012-06-28 12:55 ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Hello everyone,

It's time for a new AutoNUMA19 release.

The objective of AutoNUMA is to be able to perform as close as
possible to (and sometime faster than) the NUMA hard CPU/memory
bindings setups, without requiring the administrator to manually setup
any NUMA hard bind.

https://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf
(NOTE: the TODO slide is obsolete)

git clone --reference linux -b autonuma19 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma19

Development autonuma branch:

git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

To update:

git fetch
git checkout -f origin/autonuma

Changelog from AutoNUMA-alpha14 to AutoNUMA19:

o sched_autonuma_balance callout location removed from schedule() now it runs
  in the softirq along with CFS load_balancing

o lots of documentation about the math in the sched_autonuma_balance algorithm

o fixed a bug in the fast path detection in sched_autonuma_balance that could
  decrease performance with many nodes

o reduced the page_autonuma memory overhead to from 32 to 12 bytes per page

o fixed a crash in __pmd_numa_fixup

o knuma_numad won't scan VM_MIXEDMAP|PFNMAP (it never touched those ptes
  anyway)

o fixed a crash in autonuma_exit

o fixed a crash when split_huge_page returns 0 in knuma_migratedN as the page
  has been freed already

o assorted cleanups and probably more

Changelog from alpha13 to alpha14:

o page_autonuma introduction, no memory wasted if the kernel is booted
  on not-NUMA hardware. Tested with flatmem/sparsemem on x86
  autonuma=y/n and sparsemem/vsparsemem on x86_64 with autonuma=y/n.
  "noautonuma" kernel param disables autonuma permanently also when
  booted on NUMA hardware (no /sys/kernel/mm/autonuma, and no
  page_autonuma allocations, like cgroup_disable=memory)

o autonuma_balance only runs along with run_rebalance_domains, to
  avoid altering the usual scheduler runtime. autonuma_balance gives a
  "kick" to the scheduler after a rebalance (it overrides the load
  balance activity if needed). It's not yet tested on specjbb or more
  schedule intensive benchmark, hopefully there's no NUMA
  regression. For intensive compute loads not involving a flood of
  scheduling activity this doesn't show any performance regression,
  and it avoids altering the strict schedule performance. It goes in
  the direction of being less intrusive with the stock scheduler
  runtime.

  Note: autonuma_balance still runs from normal context (not softirq
  context like run_rebalance_domains) to be able to wait on process
  migration (avoid _nowait), but most of the time it does nothing at
  all.

Changelog from alpha11 to alpha13:

o autonuma_balance optimization (take the fast path when process is in
  the preferred NUMA node)

TODO:

o THP native migration (orthogonal and also needed for
  cpuset/migrate_pages(2)/numa/sched).

o port to ppc64, Ben? Any arch able to support PROT_NONE can also support
  AutoNUMA, in short all archs should work fine with AutoNUMA.

Andrea Arcangeli (40):
  mm: add unlikely to the mm allocation failure check
  autonuma: make set_pmd_at always available
  autonuma: export is_vma_temporary_stack() even if
    CONFIG_TRANSPARENT_HUGEPAGE=n
  xen: document Xen is using an unused bit for the pagetables
  autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  autonuma: x86 pte_numa() and pmd_numa()
  autonuma: generic pte_numa() and pmd_numa()
  autonuma: teach gup_fast about pte_numa
  autonuma: introduce kthread_bind_node()
  autonuma: mm_autonuma and sched_autonuma data structures
  autonuma: define the autonuma flags
  autonuma: core autonuma.h header
  autonuma: CPU follow memory algorithm
  autonuma: add page structure fields
  autonuma: knuma_migrated per NUMA node queues
  autonuma: init knuma_migrated queues
  autonuma: autonuma_enter/exit
  autonuma: call autonuma_setup_new_exec()
  autonuma: alloc/free/init sched_autonuma
  autonuma: alloc/free/init mm_autonuma
  autonuma: avoid CFS select_task_rq_fair to return -1
  autonuma: teach CFS about autonuma affinity
  autonuma: sched_set_autonuma_need_balance
  autonuma: core
  autonuma: follow_page check for pte_numa/pmd_numa
  autonuma: default mempolicy follow AutoNUMA
  autonuma: call autonuma_split_huge_page()
  autonuma: make khugepaged pte_numa aware
  autonuma: retain page last_nid information in khugepaged
  autonuma: numa hinting page faults entry points
  autonuma: reset autonuma page data when pages are freed
  autonuma: initialize page structure fields
  autonuma: link mm/autonuma.o and kernel/sched/numa.o
  autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
  autonuma: boost khugepaged scanning rate
  autonuma: page_autonuma
  autonuma: page_autonuma change #include for sparse
  autonuma: autonuma_migrate_head[0] dynamic size
  autonuma: bugcheck page_autonuma fields on newly allocated pages
  autonuma: shrink the per-page page_autonuma struct size

 arch/x86/include/asm/paravirt.h      |    2 -
 arch/x86/include/asm/pgtable.h       |   51 ++-
 arch/x86/include/asm/pgtable_types.h |   22 +-
 arch/x86/mm/gup.c                    |    2 +-
 arch/x86/mm/numa.c                   |    6 +-
 arch/x86/mm/numa_32.c                |    3 +-
 fs/exec.c                            |    3 +
 include/asm-generic/pgtable.h        |   12 +
 include/linux/autonuma.h             |   64 ++
 include/linux/autonuma_flags.h       |   68 ++
 include/linux/autonuma_list.h        |   94 ++
 include/linux/autonuma_sched.h       |   50 ++
 include/linux/autonuma_types.h       |  130 +++
 include/linux/huge_mm.h              |    6 +-
 include/linux/kthread.h              |    1 +
 include/linux/memory_hotplug.h       |    3 +-
 include/linux/mm_types.h             |    5 +
 include/linux/mmzone.h               |   25 +
 include/linux/page_autonuma.h        |   59 ++
 include/linux/sched.h                |    5 +-
 init/main.c                          |    2 +
 kernel/fork.c                        |   36 +-
 kernel/kthread.c                     |   23 +
 kernel/sched/Makefile                |    1 +
 kernel/sched/core.c                  |    1 +
 kernel/sched/fair.c                  |   72 ++-
 kernel/sched/numa.c                  |  586 +++++++++++++
 kernel/sched/sched.h                 |   18 +
 mm/Kconfig                           |   13 +
 mm/Makefile                          |    1 +
 mm/autonuma.c                        | 1549 ++++++++++++++++++++++++++++++++++
 mm/autonuma_list.c                   |  167 ++++
 mm/huge_memory.c                     |   59 ++-
 mm/memory.c                          |   35 +-
 mm/memory_hotplug.c                  |    2 +-
 mm/mempolicy.c                       |   15 +-
 mm/mmu_context.c                     |    2 +
 mm/page_alloc.c                      |    5 +
 mm/page_autonuma.c                   |  236 ++++++
 mm/sparse.c                          |  126 +++-
 40 files changed, 3512 insertions(+), 48 deletions(-)
 create mode 100644 include/linux/autonuma.h
 create mode 100644 include/linux/autonuma_flags.h
 create mode 100644 include/linux/autonuma_list.h
 create mode 100644 include/linux/autonuma_sched.h
 create mode 100644 include/linux/autonuma_types.h
 create mode 100644 include/linux/page_autonuma.h
 create mode 100644 kernel/sched/numa.c
 create mode 100644 mm/autonuma.c
 create mode 100644 mm/autonuma_list.c
 create mode 100644 mm/page_autonuma.c

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* [PATCH 01/40] mm: add unlikely to the mm allocation failure check
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:55   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Very minor optimization to hint gcc.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index ab5211b..5fcfa70 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -572,7 +572,7 @@ struct mm_struct *mm_alloc(void)
 	struct mm_struct *mm;
 
 	mm = allocate_mm();
-	if (!mm)
+	if (unlikely(!mm))
 		return NULL;
 
 	memset(mm, 0, sizeof(*mm));

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 01/40] mm: add unlikely to the mm allocation failure check
@ 2012-06-28 12:55   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Very minor optimization to hint gcc.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index ab5211b..5fcfa70 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -572,7 +572,7 @@ struct mm_struct *mm_alloc(void)
 	struct mm_struct *mm;
 
 	mm = allocate_mm();
-	if (!mm)
+	if (unlikely(!mm))
 		return NULL;
 
 	memset(mm, 0, sizeof(*mm));

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 02/40] autonuma: make set_pmd_at always available
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:55   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

set_pmd_at() will also be used for the knuma_scand/pmd = 1 (default)
mode even when TRANSPARENT_HUGEPAGE=n. Make it available so the build
won't fail.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/paravirt.h |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 6cbbabf..e99fb37 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -564,7 +564,6 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 		PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 			      pmd_t *pmdp, pmd_t pmd)
 {
@@ -575,7 +574,6 @@ static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp,
 			    native_pmd_val(pmd));
 }
-#endif
 
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 02/40] autonuma: make set_pmd_at always available
@ 2012-06-28 12:55   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

set_pmd_at() will also be used for the knuma_scand/pmd = 1 (default)
mode even when TRANSPARENT_HUGEPAGE=n. Make it available so the build
won't fail.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/paravirt.h |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 6cbbabf..e99fb37 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -564,7 +564,6 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 		PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 			      pmd_t *pmdp, pmd_t pmd)
 {
@@ -575,7 +574,6 @@ static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp,
 			    native_pmd_val(pmd));
 }
-#endif
 
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 03/40] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:55   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

is_vma_temporary_stack() is needed by mm/autonuma.c too, and without
this the build breaks with CONFIG_TRANSPARENT_HUGEPAGE=n.

Reported-by: Petr Holasek <pholasek@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/huge_mm.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4c59b11..ad4e2e0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -54,13 +54,13 @@ extern pmd_t *page_check_address_pmd(struct page *page,
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 
+extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define HPAGE_PMD_SHIFT HPAGE_SHIFT
 #define HPAGE_PMD_MASK HPAGE_MASK
 #define HPAGE_PMD_SIZE HPAGE_SIZE
 
-extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
-
 #define transparent_hugepage_enabled(__vma)				\
 	((transparent_hugepage_flags &					\
 	  (1<<TRANSPARENT_HUGEPAGE_FLAG) ||				\

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 03/40] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n
@ 2012-06-28 12:55   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

is_vma_temporary_stack() is needed by mm/autonuma.c too, and without
this the build breaks with CONFIG_TRANSPARENT_HUGEPAGE=n.

Reported-by: Petr Holasek <pholasek@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/huge_mm.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4c59b11..ad4e2e0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -54,13 +54,13 @@ extern pmd_t *page_check_address_pmd(struct page *page,
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 
+extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define HPAGE_PMD_SHIFT HPAGE_SHIFT
 #define HPAGE_PMD_MASK HPAGE_MASK
 #define HPAGE_PMD_SIZE HPAGE_SIZE
 
-extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
-
 #define transparent_hugepage_enabled(__vma)				\
 	((transparent_hugepage_flags &					\
 	  (1<<TRANSPARENT_HUGEPAGE_FLAG) ||				\

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 04/40] xen: document Xen is using an unused bit for the pagetables
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:55   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Xen has taken over the last reserved bit available for the pagetables
which is set through ioremap, this documents it and makes the code
more readable.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable_types.h |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 013286a..b74cac9 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -17,7 +17,7 @@
 #define _PAGE_BIT_PAT		7	/* on 4KB pages */
 #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
 #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
-#define _PAGE_BIT_IOMAP		10	/* flag used to indicate IO mapping */
+#define _PAGE_BIT_UNUSED2	10
 #define _PAGE_BIT_HIDDEN	11	/* hidden by kmemcheck */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
@@ -41,7 +41,7 @@
 #define _PAGE_PSE	(_AT(pteval_t, 1) << _PAGE_BIT_PSE)
 #define _PAGE_GLOBAL	(_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
 #define _PAGE_UNUSED1	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED1)
-#define _PAGE_IOMAP	(_AT(pteval_t, 1) << _PAGE_BIT_IOMAP)
+#define _PAGE_UNUSED2	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
 #define _PAGE_PAT	(_AT(pteval_t, 1) << _PAGE_BIT_PAT)
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
@@ -49,6 +49,13 @@
 #define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
 #define __HAVE_ARCH_PTE_SPECIAL
 
+/* flag used to indicate IO mapping */
+#ifdef CONFIG_XEN
+#define _PAGE_IOMAP	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
+#else
+#define _PAGE_IOMAP	(_AT(pteval_t, 0))
+#endif
+
 #ifdef CONFIG_KMEMCHECK
 #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
 #else

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 04/40] xen: document Xen is using an unused bit for the pagetables
@ 2012-06-28 12:55   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Xen has taken over the last reserved bit available for the pagetables
which is set through ioremap, this documents it and makes the code
more readable.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable_types.h |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 013286a..b74cac9 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -17,7 +17,7 @@
 #define _PAGE_BIT_PAT		7	/* on 4KB pages */
 #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
 #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
-#define _PAGE_BIT_IOMAP		10	/* flag used to indicate IO mapping */
+#define _PAGE_BIT_UNUSED2	10
 #define _PAGE_BIT_HIDDEN	11	/* hidden by kmemcheck */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
@@ -41,7 +41,7 @@
 #define _PAGE_PSE	(_AT(pteval_t, 1) << _PAGE_BIT_PSE)
 #define _PAGE_GLOBAL	(_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
 #define _PAGE_UNUSED1	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED1)
-#define _PAGE_IOMAP	(_AT(pteval_t, 1) << _PAGE_BIT_IOMAP)
+#define _PAGE_UNUSED2	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
 #define _PAGE_PAT	(_AT(pteval_t, 1) << _PAGE_BIT_PAT)
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
@@ -49,6 +49,13 @@
 #define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
 #define __HAVE_ARCH_PTE_SPECIAL
 
+/* flag used to indicate IO mapping */
+#ifdef CONFIG_XEN
+#define _PAGE_IOMAP	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
+#else
+#define _PAGE_IOMAP	(_AT(pteval_t, 0))
+#endif
+
 #ifdef CONFIG_KMEMCHECK
 #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
 #else

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 05/40] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:55   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

We will set these bitflags only when the pmd and pte is non present.

They work like PROT_NONE but they identify a request for the numa
hinting page fault to trigger.

Because we want to be able to set these bitflag in any established pte
or pmd (while clearing the present bit at the same time) without
losing information, these bitflags must never be set when the pte and
pmd are present.

For _PAGE_NUMA_PTE the pte bitflag used is _PAGE_PSE, which cannot be
set on ptes and it also fits in between _PAGE_FILE and _PAGE_PROTNONE
which avoids having to alter the swp entries format.

For _PAGE_NUMA_PMD, we use a reserved bitflag. pmds never contain
swap_entries but if in the future we'll swap transparent hugepages, we
must keep in mind not to use the _PAGE_UNUSED2 bitflag in the swap
entry format and to start the swap entry offset above it.

PAGE_UNUSED2 is used by Xen but only on ptes established by ioremap,
but it's never used on pmds so there's no risk of collision with Xen.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable_types.h |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index b74cac9..6e2d954 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -71,6 +71,17 @@
 #define _PAGE_FILE	(_AT(pteval_t, 1) << _PAGE_BIT_FILE)
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
+/*
+ * Cannot be set on pte. The fact it's in between _PAGE_FILE and
+ * _PAGE_PROTNONE avoids having to alter the swp entries.
+ */
+#define _PAGE_NUMA_PTE	_PAGE_PSE
+/*
+ * Cannot be set on pmd, if transparent hugepages will be swapped out
+ * the swap entry offset must start above it.
+ */
+#define _PAGE_NUMA_PMD	_PAGE_UNUSED2
+
 #define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |	\
 			 _PAGE_ACCESSED | _PAGE_DIRTY)
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 05/40] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
@ 2012-06-28 12:55   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

We will set these bitflags only when the pmd and pte is non present.

They work like PROT_NONE but they identify a request for the numa
hinting page fault to trigger.

Because we want to be able to set these bitflag in any established pte
or pmd (while clearing the present bit at the same time) without
losing information, these bitflags must never be set when the pte and
pmd are present.

For _PAGE_NUMA_PTE the pte bitflag used is _PAGE_PSE, which cannot be
set on ptes and it also fits in between _PAGE_FILE and _PAGE_PROTNONE
which avoids having to alter the swp entries format.

For _PAGE_NUMA_PMD, we use a reserved bitflag. pmds never contain
swap_entries but if in the future we'll swap transparent hugepages, we
must keep in mind not to use the _PAGE_UNUSED2 bitflag in the swap
entry format and to start the swap entry offset above it.

PAGE_UNUSED2 is used by Xen but only on ptes established by ioremap,
but it's never used on pmds so there's no risk of collision with Xen.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable_types.h |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index b74cac9..6e2d954 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -71,6 +71,17 @@
 #define _PAGE_FILE	(_AT(pteval_t, 1) << _PAGE_BIT_FILE)
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
+/*
+ * Cannot be set on pte. The fact it's in between _PAGE_FILE and
+ * _PAGE_PROTNONE avoids having to alter the swp entries.
+ */
+#define _PAGE_NUMA_PTE	_PAGE_PSE
+/*
+ * Cannot be set on pmd, if transparent hugepages will be swapped out
+ * the swap entry offset must start above it.
+ */
+#define _PAGE_NUMA_PMD	_PAGE_UNUSED2
+
 #define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |	\
 			 _PAGE_ACCESSED | _PAGE_DIRTY)
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 06/40] autonuma: x86 pte_numa() and pmd_numa()
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:55   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Implement pte_numa and pmd_numa and related methods on x86 arch.

We must atomically set the numa bit and clear the present bit to
define a pte_numa or pmd_numa.

Whenever a pte or pmd is set as pte_numa or pmd_numa the first time a
thread will touch that virtual address, a NUMA hinting page fault will
trigger. The NUMA hinting page fault will simply clear the NUMA bit
and set the present bit again to resolve the page fault.

NUMA hinting page faults are used:

1) to fill in the per-thread NUMA statistic stored for each thread in
   a current->sched_autonuma data structure

2) to track the per-node last_nid information in the page structure to
   detect false sharing

3) to queue the page mapped by the pte_numa or pmd_numa for async
   migration if there have been enough NUMA hinting page faults on the
   page coming from remote CPUs

NUMA hinting page faults don't do anything except collecting
information and possibly adding pages to migrate queues. They're
extremely quick and absolutely non blocking. They don't allocate any
memory either.

The only "input" information of the AutoNUMA algorithm that isn't
collected through NUMA hinting page faults are the per-process
(per-thread not) mm->mm_autonuma statistics. Those mm_autonuma
statistics are collected by the knuma_scand pmd/pte scans that are
also responsible for setting the pte_numa/pmd_numa to activate the
NUMA hinting page faults.

knuma_scand -> NUMA hinting page faults
  |                       |
 \|/                     \|/
mm_autonuma  <->  sched_autonuma (CPU follow memory, this is mm_autonuma too)
                  page last_nid  (false thread sharing/thread shared memory detection )
                  queue or cancel page migration (memory follow CPU)

After pages are queued, there is one knuma_migratedN daemon per NUMA
node that will take care of migrating the pages at a perfectly steady
rate in parallel from all nodes, and in round robin from all incoming
nodes going to the same destination node to keep all memory channels
in large boxes active at the same time to avoid hitting on a single
memory channel for too long to minimize memory bus migration latency
effects.

Once pages are queued for async migration by knuma_migratedN, their
migration can still be canceled before they're actually migrated, if
false sharing is later detected.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable.h |   51 +++++++++++++++++++++++++++++++++++++--
 1 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 49afb3f..7514fa6 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -109,7 +109,7 @@ static inline int pte_write(pte_t pte)
 
 static inline int pte_file(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_FILE;
+	return (pte_flags(pte) & _PAGE_FILE) == _PAGE_FILE;
 }
 
 static inline int pte_huge(pte_t pte)
@@ -405,7 +405,9 @@ static inline int pte_same(pte_t a, pte_t b)
 
 static inline int pte_present(pte_t a)
 {
-	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
+	/* _PAGE_NUMA includes _PAGE_PROTNONE */
+	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+			       _PAGE_NUMA_PTE);
 }
 
 static inline int pte_hidden(pte_t pte)
@@ -415,7 +417,46 @@ static inline int pte_hidden(pte_t pte)
 
 static inline int pmd_present(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_PRESENT;
+	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+				 _PAGE_NUMA_PMD);
+}
+
+#ifdef CONFIG_AUTONUMA
+static inline int pte_numa(pte_t pte)
+{
+	return (pte_flags(pte) &
+		(_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+	return (pmd_flags(pmd) &
+		(_PAGE_NUMA_PMD|_PAGE_PRESENT)) == _PAGE_NUMA_PMD;
+}
+#endif
+
+static inline pte_t pte_mknotnuma(pte_t pte)
+{
+	pte = pte_clear_flags(pte, _PAGE_NUMA_PTE);
+	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mknotnuma(pmd_t pmd)
+{
+	pmd = pmd_clear_flags(pmd, _PAGE_NUMA_PMD);
+	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pte_t pte_mknuma(pte_t pte)
+{
+	pte = pte_set_flags(pte, _PAGE_NUMA_PTE);
+	return pte_clear_flags(pte, _PAGE_PRESENT);
+}
+
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+	pmd = pmd_set_flags(pmd, _PAGE_NUMA_PMD);
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
 }
 
 static inline int pmd_none(pmd_t pmd)
@@ -474,6 +515,10 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 
 static inline int pmd_bad(pmd_t pmd)
 {
+#ifdef CONFIG_AUTONUMA
+	if (pmd_numa(pmd))
+		return 0;
+#endif
 	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
 }
 

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 06/40] autonuma: x86 pte_numa() and pmd_numa()
@ 2012-06-28 12:55   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Implement pte_numa and pmd_numa and related methods on x86 arch.

We must atomically set the numa bit and clear the present bit to
define a pte_numa or pmd_numa.

Whenever a pte or pmd is set as pte_numa or pmd_numa the first time a
thread will touch that virtual address, a NUMA hinting page fault will
trigger. The NUMA hinting page fault will simply clear the NUMA bit
and set the present bit again to resolve the page fault.

NUMA hinting page faults are used:

1) to fill in the per-thread NUMA statistic stored for each thread in
   a current->sched_autonuma data structure

2) to track the per-node last_nid information in the page structure to
   detect false sharing

3) to queue the page mapped by the pte_numa or pmd_numa for async
   migration if there have been enough NUMA hinting page faults on the
   page coming from remote CPUs

NUMA hinting page faults don't do anything except collecting
information and possibly adding pages to migrate queues. They're
extremely quick and absolutely non blocking. They don't allocate any
memory either.

The only "input" information of the AutoNUMA algorithm that isn't
collected through NUMA hinting page faults are the per-process
(per-thread not) mm->mm_autonuma statistics. Those mm_autonuma
statistics are collected by the knuma_scand pmd/pte scans that are
also responsible for setting the pte_numa/pmd_numa to activate the
NUMA hinting page faults.

knuma_scand -> NUMA hinting page faults
  |                       |
 \|/                     \|/
mm_autonuma  <->  sched_autonuma (CPU follow memory, this is mm_autonuma too)
                  page last_nid  (false thread sharing/thread shared memory detection )
                  queue or cancel page migration (memory follow CPU)

After pages are queued, there is one knuma_migratedN daemon per NUMA
node that will take care of migrating the pages at a perfectly steady
rate in parallel from all nodes, and in round robin from all incoming
nodes going to the same destination node to keep all memory channels
in large boxes active at the same time to avoid hitting on a single
memory channel for too long to minimize memory bus migration latency
effects.

Once pages are queued for async migration by knuma_migratedN, their
migration can still be canceled before they're actually migrated, if
false sharing is later detected.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable.h |   51 +++++++++++++++++++++++++++++++++++++--
 1 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 49afb3f..7514fa6 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -109,7 +109,7 @@ static inline int pte_write(pte_t pte)
 
 static inline int pte_file(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_FILE;
+	return (pte_flags(pte) & _PAGE_FILE) == _PAGE_FILE;
 }
 
 static inline int pte_huge(pte_t pte)
@@ -405,7 +405,9 @@ static inline int pte_same(pte_t a, pte_t b)
 
 static inline int pte_present(pte_t a)
 {
-	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
+	/* _PAGE_NUMA includes _PAGE_PROTNONE */
+	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+			       _PAGE_NUMA_PTE);
 }
 
 static inline int pte_hidden(pte_t pte)
@@ -415,7 +417,46 @@ static inline int pte_hidden(pte_t pte)
 
 static inline int pmd_present(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_PRESENT;
+	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+				 _PAGE_NUMA_PMD);
+}
+
+#ifdef CONFIG_AUTONUMA
+static inline int pte_numa(pte_t pte)
+{
+	return (pte_flags(pte) &
+		(_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+	return (pmd_flags(pmd) &
+		(_PAGE_NUMA_PMD|_PAGE_PRESENT)) == _PAGE_NUMA_PMD;
+}
+#endif
+
+static inline pte_t pte_mknotnuma(pte_t pte)
+{
+	pte = pte_clear_flags(pte, _PAGE_NUMA_PTE);
+	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mknotnuma(pmd_t pmd)
+{
+	pmd = pmd_clear_flags(pmd, _PAGE_NUMA_PMD);
+	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pte_t pte_mknuma(pte_t pte)
+{
+	pte = pte_set_flags(pte, _PAGE_NUMA_PTE);
+	return pte_clear_flags(pte, _PAGE_PRESENT);
+}
+
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+	pmd = pmd_set_flags(pmd, _PAGE_NUMA_PMD);
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
 }
 
 static inline int pmd_none(pmd_t pmd)
@@ -474,6 +515,10 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 
 static inline int pmd_bad(pmd_t pmd)
 {
+#ifdef CONFIG_AUTONUMA
+	if (pmd_numa(pmd))
+		return 0;
+#endif
 	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 07/40] autonuma: generic pte_numa() and pmd_numa()
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:55   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Implement generic version of the methods. They're used when
CONFIG_AUTONUMA=n, and they're a noop.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/asm-generic/pgtable.h |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index ff4947b..0ff87ec 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -530,6 +530,18 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
+#ifndef CONFIG_AUTONUMA
+static inline int pte_numa(pte_t pte)
+{
+	return 0;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+	return 0;
+}
+#endif /* CONFIG_AUTONUMA */
+
 #endif /* CONFIG_MMU */
 
 #endif /* !__ASSEMBLY__ */

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 07/40] autonuma: generic pte_numa() and pmd_numa()
@ 2012-06-28 12:55   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Implement generic version of the methods. They're used when
CONFIG_AUTONUMA=n, and they're a noop.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/asm-generic/pgtable.h |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index ff4947b..0ff87ec 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -530,6 +530,18 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
+#ifndef CONFIG_AUTONUMA
+static inline int pte_numa(pte_t pte)
+{
+	return 0;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+	return 0;
+}
+#endif /* CONFIG_AUTONUMA */
+
 #endif /* CONFIG_MMU */
 
 #endif /* !__ASSEMBLY__ */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 08/40] autonuma: teach gup_fast about pte_numa
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:55   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

gup_fast will skip over non present ptes (pte_numa requires the pte to
be non present). So no explicit check is needed for pte_numa in the
pte case.

gup_fast will also automatically skip over THP when the trans huge pmd
is non present (pmd_numa requires the pmd to be non present).

But for the special pmd mode scan of knuma_scand
(/sys/kernel/mm/autonuma/knuma_scand/pmd == 1), the pmd may be of numa
type (so non present too), the pte may be present. gup_pte_range
wouldn't notice the pmd is of numa type. So to avoid losing a NUMA
hinting page fault with gup_fast we need an explicit check for
pmd_numa() here to be sure it will fault through gup ->
handle_mm_fault.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/mm/gup.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..bf36575 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -164,7 +164,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		 * wait_split_huge_page() would never return as the
 		 * tlb flush IPI wouldn't run.
 		 */
-		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd) || pmd_numa(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 08/40] autonuma: teach gup_fast about pte_numa
@ 2012-06-28 12:55   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

gup_fast will skip over non present ptes (pte_numa requires the pte to
be non present). So no explicit check is needed for pte_numa in the
pte case.

gup_fast will also automatically skip over THP when the trans huge pmd
is non present (pmd_numa requires the pmd to be non present).

But for the special pmd mode scan of knuma_scand
(/sys/kernel/mm/autonuma/knuma_scand/pmd == 1), the pmd may be of numa
type (so non present too), the pte may be present. gup_pte_range
wouldn't notice the pmd is of numa type. So to avoid losing a NUMA
hinting page fault with gup_fast we need an explicit check for
pmd_numa() here to be sure it will fault through gup ->
handle_mm_fault.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/mm/gup.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..bf36575 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -164,7 +164,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		 * wait_split_huge_page() would never return as the
 		 * tlb flush IPI wouldn't run.
 		 */
-		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd) || pmd_numa(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 09/40] autonuma: introduce kthread_bind_node()
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:55   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This function makes it easy to bind the per-node knuma_migrated
threads to their respective NUMA nodes. Those threads take memory from
the other nodes (in round robin with a incoming queue for each remote
node) and they move that memory to their local node.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/kthread.h |    1 +
 include/linux/sched.h   |    2 +-
 kernel/kthread.c        |   23 +++++++++++++++++++++++
 3 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 0714b24..e733f97 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -33,6 +33,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 })
 
 void kthread_bind(struct task_struct *k, unsigned int cpu);
+void kthread_bind_node(struct task_struct *p, int nid);
 int kthread_stop(struct task_struct *k);
 int kthread_should_stop(void);
 bool kthread_freezable_should_stop(bool *was_frozen);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4059c0f..699324c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
 #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
 #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
 #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
-#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
+#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpus */
 #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3d3de63..48b36f9 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -234,6 +234,29 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
 EXPORT_SYMBOL(kthread_bind);
 
 /**
+ * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
+ * @p: thread created by kthread_create().
+ * @nid: node (might not be online, must be possible) for @k to run on.
+ *
+ * Description: This function is equivalent to set_cpus_allowed(),
+ * except that @nid doesn't need to be online, and the thread must be
+ * stopped (i.e., just returned from kthread_create()).
+ */
+void kthread_bind_node(struct task_struct *p, int nid)
+{
+	/* Must have done schedule() in kthread() before we set_task_cpu */
+	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
+		WARN_ON(1);
+		return;
+	}
+
+	/* It's safe because the task is inactive. */
+	do_set_cpus_allowed(p, cpumask_of_node(nid));
+	p->flags |= PF_THREAD_BOUND;
+}
+EXPORT_SYMBOL(kthread_bind_node);
+
+/**
  * kthread_stop - stop a thread created by kthread_create().
  * @k: thread created by kthread_create().
  *

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 09/40] autonuma: introduce kthread_bind_node()
@ 2012-06-28 12:55   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This function makes it easy to bind the per-node knuma_migrated
threads to their respective NUMA nodes. Those threads take memory from
the other nodes (in round robin with a incoming queue for each remote
node) and they move that memory to their local node.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/kthread.h |    1 +
 include/linux/sched.h   |    2 +-
 kernel/kthread.c        |   23 +++++++++++++++++++++++
 3 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 0714b24..e733f97 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -33,6 +33,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 })
 
 void kthread_bind(struct task_struct *k, unsigned int cpu);
+void kthread_bind_node(struct task_struct *p, int nid);
 int kthread_stop(struct task_struct *k);
 int kthread_should_stop(void);
 bool kthread_freezable_should_stop(bool *was_frozen);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4059c0f..699324c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
 #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
 #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
 #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
-#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
+#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpus */
 #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3d3de63..48b36f9 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -234,6 +234,29 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
 EXPORT_SYMBOL(kthread_bind);
 
 /**
+ * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
+ * @p: thread created by kthread_create().
+ * @nid: node (might not be online, must be possible) for @k to run on.
+ *
+ * Description: This function is equivalent to set_cpus_allowed(),
+ * except that @nid doesn't need to be online, and the thread must be
+ * stopped (i.e., just returned from kthread_create()).
+ */
+void kthread_bind_node(struct task_struct *p, int nid)
+{
+	/* Must have done schedule() in kthread() before we set_task_cpu */
+	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
+		WARN_ON(1);
+		return;
+	}
+
+	/* It's safe because the task is inactive. */
+	do_set_cpus_allowed(p, cpumask_of_node(nid));
+	p->flags |= PF_THREAD_BOUND;
+}
+EXPORT_SYMBOL(kthread_bind_node);
+
+/**
  * kthread_stop - stop a thread created by kthread_create().
  * @k: thread created by kthread_create().
  *

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 10/40] autonuma: mm_autonuma and sched_autonuma data structures
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:55   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Define the two data structures that collect the per-process (in the
mm) and per-thread (in the task_struct) statistical information that
are the input of the CPU follow memory algorithms in the NUMA
scheduler.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_types.h |   68 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 68 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma_types.h

diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
new file mode 100644
index 0000000..9e697e3
--- /dev/null
+++ b/include/linux/autonuma_types.h
@@ -0,0 +1,68 @@
+#ifndef _LINUX_AUTONUMA_TYPES_H
+#define _LINUX_AUTONUMA_TYPES_H
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/numa.h>
+
+/*
+ * Per-mm (process) structure dynamically allocated only if autonuma
+ * is not impossible. This links the mm to scan into the
+ * knuma_scand.mm_head and it contains the NUMA memory placement
+ * statistics for the process (generated by knuma_scand).
+ */
+struct mm_autonuma {
+	/* list node to link the "mm" into the knuma_scand.mm_head */
+	struct list_head mm_node;
+	struct mm_struct *mm;
+	unsigned long mm_numa_fault_pass; /* zeroed from here during allocation */
+	unsigned long mm_numa_fault_tot;
+	unsigned long mm_numa_fault[0];
+};
+
+extern int alloc_mm_autonuma(struct mm_struct *mm);
+extern void free_mm_autonuma(struct mm_struct *mm);
+extern void __init mm_autonuma_init(void);
+
+/*
+ * Per-task (thread) structure dynamically allocated only if autonuma
+ * is not impossible. This contains the preferred autonuma_node where
+ * the userland thread should be scheduled into (only relevant if
+ * tsk->mm is not null) and the per-thread NUMA accesses statistics
+ * (generated by the NUMA hinting page faults).
+ */
+struct task_autonuma {
+	int autonuma_node;
+	/* zeroed from the below field during allocation */
+	unsigned long task_numa_fault_pass;
+	unsigned long task_numa_fault_tot;
+	unsigned long task_numa_fault[0];
+};
+
+extern int alloc_task_autonuma(struct task_struct *tsk,
+			       struct task_struct *orig,
+			       int node);
+extern void __init task_autonuma_init(void);
+extern void free_task_autonuma(struct task_struct *tsk);
+
+#else /* CONFIG_AUTONUMA */
+
+static inline int alloc_mm_autonuma(struct mm_struct *mm)
+{
+	return 0;
+}
+static inline void free_mm_autonuma(struct mm_struct *mm) {}
+static inline void mm_autonuma_init(void) {}
+
+static inline int alloc_task_autonuma(struct task_struct *tsk,
+				      struct task_struct *orig,
+				      int node)
+{
+	return 0;
+}
+static inline void task_autonuma_init(void) {}
+static inline void free_task_autonuma(struct task_struct *tsk) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+#endif /* _LINUX_AUTONUMA_TYPES_H */

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 10/40] autonuma: mm_autonuma and sched_autonuma data structures
@ 2012-06-28 12:55   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Define the two data structures that collect the per-process (in the
mm) and per-thread (in the task_struct) statistical information that
are the input of the CPU follow memory algorithms in the NUMA
scheduler.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_types.h |   68 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 68 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma_types.h

diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
new file mode 100644
index 0000000..9e697e3
--- /dev/null
+++ b/include/linux/autonuma_types.h
@@ -0,0 +1,68 @@
+#ifndef _LINUX_AUTONUMA_TYPES_H
+#define _LINUX_AUTONUMA_TYPES_H
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/numa.h>
+
+/*
+ * Per-mm (process) structure dynamically allocated only if autonuma
+ * is not impossible. This links the mm to scan into the
+ * knuma_scand.mm_head and it contains the NUMA memory placement
+ * statistics for the process (generated by knuma_scand).
+ */
+struct mm_autonuma {
+	/* list node to link the "mm" into the knuma_scand.mm_head */
+	struct list_head mm_node;
+	struct mm_struct *mm;
+	unsigned long mm_numa_fault_pass; /* zeroed from here during allocation */
+	unsigned long mm_numa_fault_tot;
+	unsigned long mm_numa_fault[0];
+};
+
+extern int alloc_mm_autonuma(struct mm_struct *mm);
+extern void free_mm_autonuma(struct mm_struct *mm);
+extern void __init mm_autonuma_init(void);
+
+/*
+ * Per-task (thread) structure dynamically allocated only if autonuma
+ * is not impossible. This contains the preferred autonuma_node where
+ * the userland thread should be scheduled into (only relevant if
+ * tsk->mm is not null) and the per-thread NUMA accesses statistics
+ * (generated by the NUMA hinting page faults).
+ */
+struct task_autonuma {
+	int autonuma_node;
+	/* zeroed from the below field during allocation */
+	unsigned long task_numa_fault_pass;
+	unsigned long task_numa_fault_tot;
+	unsigned long task_numa_fault[0];
+};
+
+extern int alloc_task_autonuma(struct task_struct *tsk,
+			       struct task_struct *orig,
+			       int node);
+extern void __init task_autonuma_init(void);
+extern void free_task_autonuma(struct task_struct *tsk);
+
+#else /* CONFIG_AUTONUMA */
+
+static inline int alloc_mm_autonuma(struct mm_struct *mm)
+{
+	return 0;
+}
+static inline void free_mm_autonuma(struct mm_struct *mm) {}
+static inline void mm_autonuma_init(void) {}
+
+static inline int alloc_task_autonuma(struct task_struct *tsk,
+				      struct task_struct *orig,
+				      int node)
+{
+	return 0;
+}
+static inline void task_autonuma_init(void) {}
+static inline void free_task_autonuma(struct task_struct *tsk) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+#endif /* _LINUX_AUTONUMA_TYPES_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 11/40] autonuma: define the autonuma flags
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:55   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

These flags are the ones tweaked through sysfs, they control the
behavior of autonuma, from enabling disabling it, to selecting various
runtime options.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_flags.h |   62 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 62 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma_flags.h

diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
new file mode 100644
index 0000000..5e29a75
--- /dev/null
+++ b/include/linux/autonuma_flags.h
@@ -0,0 +1,62 @@
+#ifndef _LINUX_AUTONUMA_FLAGS_H
+#define _LINUX_AUTONUMA_FLAGS_H
+
+enum autonuma_flag {
+	AUTONUMA_FLAG,
+	AUTONUMA_IMPOSSIBLE_FLAG,
+	AUTONUMA_DEBUG_FLAG,
+	AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
+	AUTONUMA_SCHED_CLONE_RESET_FLAG,
+	AUTONUMA_SCHED_FORK_RESET_FLAG,
+	AUTONUMA_SCAN_PMD_FLAG,
+	AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
+	AUTONUMA_MIGRATE_DEFER_FLAG,
+};
+
+extern unsigned long autonuma_flags;
+
+static inline bool autonuma_enabled(void)
+{
+	return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
+}
+
+static inline bool autonuma_debug(void)
+{
+	return !!test_bit(AUTONUMA_DEBUG_FLAG, &autonuma_flags);
+}
+
+static inline bool autonuma_sched_load_balance_strict(void)
+{
+	return !!test_bit(AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
+			  &autonuma_flags);
+}
+
+static inline bool autonuma_sched_clone_reset(void)
+{
+	return !!test_bit(AUTONUMA_SCHED_CLONE_RESET_FLAG,
+			  &autonuma_flags);
+}
+
+static inline bool autonuma_sched_fork_reset(void)
+{
+	return !!test_bit(AUTONUMA_SCHED_FORK_RESET_FLAG,
+			  &autonuma_flags);
+}
+
+static inline bool autonuma_scan_pmd(void)
+{
+	return !!test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
+}
+
+static inline bool autonuma_scan_use_working_set(void)
+{
+	return !!test_bit(AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
+			  &autonuma_flags);
+}
+
+static inline bool autonuma_migrate_defer(void)
+{
+	return !!test_bit(AUTONUMA_MIGRATE_DEFER_FLAG, &autonuma_flags);
+}
+
+#endif /* _LINUX_AUTONUMA_FLAGS_H */

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 11/40] autonuma: define the autonuma flags
@ 2012-06-28 12:55   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

These flags are the ones tweaked through sysfs, they control the
behavior of autonuma, from enabling disabling it, to selecting various
runtime options.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_flags.h |   62 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 62 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma_flags.h

diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
new file mode 100644
index 0000000..5e29a75
--- /dev/null
+++ b/include/linux/autonuma_flags.h
@@ -0,0 +1,62 @@
+#ifndef _LINUX_AUTONUMA_FLAGS_H
+#define _LINUX_AUTONUMA_FLAGS_H
+
+enum autonuma_flag {
+	AUTONUMA_FLAG,
+	AUTONUMA_IMPOSSIBLE_FLAG,
+	AUTONUMA_DEBUG_FLAG,
+	AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
+	AUTONUMA_SCHED_CLONE_RESET_FLAG,
+	AUTONUMA_SCHED_FORK_RESET_FLAG,
+	AUTONUMA_SCAN_PMD_FLAG,
+	AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
+	AUTONUMA_MIGRATE_DEFER_FLAG,
+};
+
+extern unsigned long autonuma_flags;
+
+static inline bool autonuma_enabled(void)
+{
+	return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
+}
+
+static inline bool autonuma_debug(void)
+{
+	return !!test_bit(AUTONUMA_DEBUG_FLAG, &autonuma_flags);
+}
+
+static inline bool autonuma_sched_load_balance_strict(void)
+{
+	return !!test_bit(AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
+			  &autonuma_flags);
+}
+
+static inline bool autonuma_sched_clone_reset(void)
+{
+	return !!test_bit(AUTONUMA_SCHED_CLONE_RESET_FLAG,
+			  &autonuma_flags);
+}
+
+static inline bool autonuma_sched_fork_reset(void)
+{
+	return !!test_bit(AUTONUMA_SCHED_FORK_RESET_FLAG,
+			  &autonuma_flags);
+}
+
+static inline bool autonuma_scan_pmd(void)
+{
+	return !!test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
+}
+
+static inline bool autonuma_scan_use_working_set(void)
+{
+	return !!test_bit(AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
+			  &autonuma_flags);
+}
+
+static inline bool autonuma_migrate_defer(void)
+{
+	return !!test_bit(AUTONUMA_MIGRATE_DEFER_FLAG, &autonuma_flags);
+}
+
+#endif /* _LINUX_AUTONUMA_FLAGS_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 12/40] autonuma: core autonuma.h header
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:55   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This is the generic autonuma.h header that defines the generic
AutoNUMA specific functions like autonuma_setup_new_exec,
autonuma_split_huge_page, numa_hinting_fault, etc...

As usual functions like numa_hinting_fault that only matter for builds
with CONFIG_AUTONUMA=y are defined unconditionally, but they are only
linked into the kernel if CONFIG_AUTONUMA=n. Their call sites are
optimized away at build time (or kernel won't link).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma.h |   41 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 41 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma.h

diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
new file mode 100644
index 0000000..85ca5eb
--- /dev/null
+++ b/include/linux/autonuma.h
@@ -0,0 +1,41 @@
+#ifndef _LINUX_AUTONUMA_H
+#define _LINUX_AUTONUMA_H
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/autonuma_flags.h>
+
+extern void autonuma_enter(struct mm_struct *mm);
+extern void autonuma_exit(struct mm_struct *mm);
+extern void __autonuma_migrate_page_remove(struct page *page);
+extern void autonuma_migrate_split_huge_page(struct page *page,
+					     struct page *page_tail);
+extern void autonuma_setup_new_exec(struct task_struct *p);
+
+static inline void autonuma_migrate_page_remove(struct page *page)
+{
+	if (ACCESS_ONCE(page->autonuma_migrate_nid) >= 0)
+		__autonuma_migrate_page_remove(page);
+}
+
+#define autonuma_printk(format, args...) \
+	if (autonuma_debug()) printk(format, ##args)
+
+#else /* CONFIG_AUTONUMA */
+
+static inline void autonuma_enter(struct mm_struct *mm) {}
+static inline void autonuma_exit(struct mm_struct *mm) {}
+static inline void autonuma_migrate_page_remove(struct page *page) {}
+static inline void autonuma_migrate_split_huge_page(struct page *page,
+						    struct page *page_tail) {}
+static inline void autonuma_setup_new_exec(struct task_struct *p) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+extern pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+			      unsigned long addr, pte_t pte, pte_t *ptep);
+extern void __pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+			     pmd_t *pmd);
+extern void numa_hinting_fault(struct page *page, int numpages);
+
+#endif /* _LINUX_AUTONUMA_H */

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 12/40] autonuma: core autonuma.h header
@ 2012-06-28 12:55   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This is the generic autonuma.h header that defines the generic
AutoNUMA specific functions like autonuma_setup_new_exec,
autonuma_split_huge_page, numa_hinting_fault, etc...

As usual functions like numa_hinting_fault that only matter for builds
with CONFIG_AUTONUMA=y are defined unconditionally, but they are only
linked into the kernel if CONFIG_AUTONUMA=n. Their call sites are
optimized away at build time (or kernel won't link).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma.h |   41 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 41 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma.h

diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
new file mode 100644
index 0000000..85ca5eb
--- /dev/null
+++ b/include/linux/autonuma.h
@@ -0,0 +1,41 @@
+#ifndef _LINUX_AUTONUMA_H
+#define _LINUX_AUTONUMA_H
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/autonuma_flags.h>
+
+extern void autonuma_enter(struct mm_struct *mm);
+extern void autonuma_exit(struct mm_struct *mm);
+extern void __autonuma_migrate_page_remove(struct page *page);
+extern void autonuma_migrate_split_huge_page(struct page *page,
+					     struct page *page_tail);
+extern void autonuma_setup_new_exec(struct task_struct *p);
+
+static inline void autonuma_migrate_page_remove(struct page *page)
+{
+	if (ACCESS_ONCE(page->autonuma_migrate_nid) >= 0)
+		__autonuma_migrate_page_remove(page);
+}
+
+#define autonuma_printk(format, args...) \
+	if (autonuma_debug()) printk(format, ##args)
+
+#else /* CONFIG_AUTONUMA */
+
+static inline void autonuma_enter(struct mm_struct *mm) {}
+static inline void autonuma_exit(struct mm_struct *mm) {}
+static inline void autonuma_migrate_page_remove(struct page *page) {}
+static inline void autonuma_migrate_split_huge_page(struct page *page,
+						    struct page *page_tail) {}
+static inline void autonuma_setup_new_exec(struct task_struct *p) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+extern pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+			      unsigned long addr, pte_t pte, pte_t *ptep);
+extern void __pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+			     pmd_t *pmd);
+extern void numa_hinting_fault(struct page *page, int numpages);
+
+#endif /* _LINUX_AUTONUMA_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:55   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This algorithm takes as input the statistical information filled by the
knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
(p->sched_autonuma), evaluates it for the current scheduled task, and
compares it against every other running process to see if it should
move the current task to another NUMA node.

When the scheduler decides if the task should be migrated to a
different NUMA node or to stay in the same NUMA node, the decision is
then stored into p->sched_autonuma->autonuma_node. The fair scheduler
than tries to keep the task on the autonuma_node too.

Code include fixes and cleanups from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_sched.h |   50 ++++
 include/linux/mm_types.h       |    5 +
 include/linux/sched.h          |    3 +
 kernel/sched/core.c            |    1 +
 kernel/sched/numa.c            |  586 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h           |   18 ++
 6 files changed, 663 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma_sched.h
 create mode 100644 kernel/sched/numa.c

diff --git a/include/linux/autonuma_sched.h b/include/linux/autonuma_sched.h
new file mode 100644
index 0000000..aff31d4
--- /dev/null
+++ b/include/linux/autonuma_sched.h
@@ -0,0 +1,50 @@
+#ifndef _LINUX_AUTONUMA_SCHED_H
+#define _LINUX_AUTONUMA_SCHED_H
+
+#ifdef CONFIG_AUTONUMA
+#include <linux/autonuma_flags.h>
+
+extern void sched_autonuma_balance(void);
+extern bool sched_autonuma_can_migrate_task(struct task_struct *p,
+					    int numa, int dst_cpu,
+					    enum cpu_idle_type idle);
+#else /* CONFIG_AUTONUMA */
+static inline void sched_autonuma_balance(void) {}
+static inline bool sched_autonuma_can_migrate_task(struct task_struct *p,
+						   int numa, int dst_cpu,
+						   enum cpu_idle_type idle)
+{
+	return true;
+}
+#endif /* CONFIG_AUTONUMA */
+
+static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
+{
+#ifdef CONFIG_AUTONUMA
+	int autonuma_node;
+	struct task_autonuma *task_autonuma = p->task_autonuma;
+
+	if (!task_autonuma)
+		return true;
+
+	autonuma_node = ACCESS_ONCE(task_autonuma->autonuma_node);
+	if (autonuma_node < 0 || autonuma_node == cpu_to_node(cpu))
+		return true;
+	else
+		return false;
+#else
+	return true;
+#endif
+}
+
+static inline void sched_set_autonuma_need_balance(void)
+{
+#ifdef CONFIG_AUTONUMA
+	struct task_autonuma *ta = current->task_autonuma;
+
+	if (ta && current->mm)
+		sched_autonuma_balance();
+#endif
+}
+
+#endif /* _LINUX_AUTONUMA_SCHED_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 704a626..f0c6379 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -13,6 +13,7 @@
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
 #include <linux/uprobes.h>
+#include <linux/autonuma_types.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -389,6 +390,10 @@ struct mm_struct {
 	struct cpumask cpumask_allocation;
 #endif
 	struct uprobes_state uprobes_state;
+#ifdef CONFIG_AUTONUMA
+	/* this is used by the scheduler and the page allocator */
+	struct mm_autonuma *mm_autonuma;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 699324c..cb20347 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1514,6 +1514,9 @@ struct task_struct {
 	struct mempolicy *mempolicy;	/* Protected by alloc_lock */
 	short il_next;
 	short pref_node_fork;
+#ifdef CONFIG_AUTONUMA
+	struct task_autonuma *task_autonuma;
+#endif
 #endif
 	struct rcu_head rcu;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d5594a4..a8f94b9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -72,6 +72,7 @@
 #include <linux/slab.h>
 #include <linux/init_task.h>
 #include <linux/binfmts.h>
+#include <linux/autonuma_sched.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c
new file mode 100644
index 0000000..72f6158
--- /dev/null
+++ b/kernel/sched/numa.c
@@ -0,0 +1,586 @@
+/*
+ *  Copyright (C) 2012  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/sched.h>
+#include <linux/autonuma_sched.h>
+#include <asm/tlb.h>
+
+#include "sched.h"
+
+/*
+ * autonuma_balance_cpu_stop() is a callback to be invoked by
+ * stop_one_cpu_nowait(). It is used by sched_autonuma_balance() to
+ * migrate the tasks to the selected_cpu, from softirq context.
+ */
+static int autonuma_balance_cpu_stop(void *data)
+{
+	struct rq *src_rq = data;
+	int src_cpu = cpu_of(src_rq);
+	int dst_cpu = src_rq->autonuma_balance_dst_cpu;
+	struct task_struct *p = src_rq->autonuma_balance_task;
+	struct rq *dst_rq = cpu_rq(dst_cpu);
+
+	raw_spin_lock_irq(&p->pi_lock);
+	raw_spin_lock(&src_rq->lock);
+
+	/* Make sure the selected cpu hasn't gone down in the meanwhile */
+	if (unlikely(src_cpu != smp_processor_id() ||
+		     !src_rq->autonuma_balance))
+		goto out_unlock;
+
+	/* Check if the affinity changed in the meanwhile */
+	if (!cpumask_test_cpu(dst_cpu, tsk_cpus_allowed(p)))
+		goto out_unlock;
+
+	/* Is the task to migrate still there? */
+	if (task_cpu(p) != src_cpu)
+		goto out_unlock;
+
+	BUG_ON(src_rq == dst_rq);
+
+	/* Prepare to move the task from src_rq to dst_rq */
+	double_lock_balance(src_rq, dst_rq);
+
+	/*
+	 * Supposedly pi_lock should have been enough but some code
+	 * seems to call __set_task_cpu without pi_lock.
+	 */
+	if (task_cpu(p) != src_cpu) {
+		WARN_ONCE(1, "autonuma_balance_cpu_stop: "
+			  "not pi_lock protected");
+		goto out_double_unlock;
+	}
+
+	/*
+	 * If the task is not on a rq, the autonuma_node will take
+	 * care of the NUMA affinity at the next wake-up.
+	 */
+	if (p->on_rq) {
+		deactivate_task(src_rq, p, 0);
+		set_task_cpu(p, dst_cpu);
+		activate_task(dst_rq, p, 0);
+		check_preempt_curr(dst_rq, p, 0);
+	}
+
+out_double_unlock:
+	double_unlock_balance(src_rq, dst_rq);
+out_unlock:
+	src_rq->autonuma_balance = false;
+	raw_spin_unlock(&src_rq->lock);
+	/* spinlocks acts as barrier() so p is stored local on the stack */
+	raw_spin_unlock_irq(&p->pi_lock);
+	put_task_struct(p);
+	return 0;
+}
+
+#define AUTONUMA_BALANCE_SCALE 1000
+
+enum {
+	W_TYPE_THREAD,
+	W_TYPE_PROCESS,
+};
+
+/*
+ * This function sched_autonuma_balance() is responsible for deciding
+ * which is the best CPU each process should be running on according
+ * to the NUMA statistics collected in mm->mm_autonuma and
+ * tsk->task_autonuma.
+ *
+ * The core math that evaluates the current CPU against the CPUs of
+ * all _other_ nodes is this:
+ *
+ *	if (w_nid > w_other && w_nid > w_cpu_nid)
+ *		weight = w_nid - w_other + w_nid - w_cpu_nid;
+ *
+ * w_nid: NUMA affinity of the current thread/process if run on the
+ * other CPU.
+ *
+ * w_other: NUMA affinity of the other thread/process if run on the
+ * other CPU.
+ *
+ * w_cpu_nid: NUMA affinity of the current thread/process if run on
+ * the current CPU.
+ *
+ * weight: combined NUMA affinity benefit in moving the current
+ * thread/process to the other CPU taking into account both the higher
+ * NUMA affinity of the current process if run on the other CPU, and
+ * the increase in NUMA affinity in the other CPU by replacing the
+ * other process.
+ *
+ * We run the above math on every CPU not part of the current NUMA
+ * node, and we compare the current process against the other
+ * processes running in the other CPUs in the remote NUMA nodes. The
+ * objective is to select the cpu (in selected_cpu) with a bigger
+ * "weight". The bigger the "weight" the biggest gain we'll get by
+ * moving the current process to the selected_cpu (not only the
+ * biggest immediate CPU gain but also the fewer async memory
+ * migrations that will be required to reach full convergence
+ * later). If we select a cpu we migrate the current process to it.
+ *
+ * Checking that the current process has higher NUMA affinity than the
+ * other process on the other CPU (w_nid > w_other) and not only that
+ * the current process has higher NUMA affinity on the other CPU than
+ * on the current CPU (w_nid > w_cpu_nid) completely avoids ping pongs
+ * and ensures (temporary) convergence of the algorithm (at least from
+ * a CPU standpoint).
+ *
+ * It's then up to the idle balancing code that will run as soon as
+ * the current CPU goes idle to pick the other process and move it
+ * here (or in some other idle CPU if any).
+ *
+ * By only evaluating running processes against running processes we
+ * avoid interfering with the CFS stock active idle balancing, which
+ * is critical to optimal performance with HT enabled. (getting HT
+ * wrong is worse than running on remote memory so the active idle
+ * balancing has priority)
+ *
+ * Idle balancing and all other CFS load balancing become NUMA
+ * affinity aware through the introduction of
+ * sched_autonuma_can_migrate_task(). CFS searches CPUs in the task's
+ * autonuma_node first when it needs to find idle CPUs during idle
+ * balancing or tasks to pick during load balancing.
+ *
+ * The task's autonuma_node is the node selected by
+ * sched_autonuma_balance() when it migrates a task to the
+ * selected_cpu in the selected_nid.
+ *
+ * Once a process/thread has been moved to another node, closer to the
+ * much of memory it has recently accessed, any memory for that task
+ * not in the new node moves slowly (asynchronously in the background)
+ * to the new node. This is done by the knuma_migratedN (where the
+ * suffix N is the node id) daemon described in mm/autonuma.c.
+ *
+ * One non trivial bit of this logic that deserves an explanation is
+ * how the three crucial variables of the core math
+ * (w_nid/w_other/wcpu_nid) are going to change depending on whether
+ * the other CPU is running a thread of the current process, or a
+ * thread of a different process.
+ *
+ * A simple example is required. Given the following:
+ * - 2 processes
+ * - 4 threads per process
+ * - 2 NUMA nodes
+ * - 4 CPUS per NUMA node
+ *
+ * Because the 8 threads belong to 2 different processes, by using the
+ * process statistics when comparing threads of different processes,
+ * we will converge reliably and quickly to a configuration where the
+ * 1st process is entirely contained in one node and the 2nd process
+ * in the other node.
+ *
+ * If all threads only use thread local memory (no sharing of memory
+ * between the threads), it wouldn't matter if we use per-thread or
+ * per-mm statistics for w_nid/w_other/w_cpu_nid. We could then use
+ * per-thread statistics all the time.
+ *
+ * But clearly with threads it's expected to get some sharing of
+ * memory. To avoid false sharing it's better to keep all threads of
+ * the same process in the same node (or if they don't fit in a single
+ * node, in as fewer nodes as possible). This is why we have to use
+ * processes statistics in w_nid/w_other/wcpu_nid when comparing
+ * threads of different processes. Why instead do we have to use
+ * thread statistics when comparing threads of the same process? This
+ * should be obvious if you're still reading (hint: the mm statistics
+ * are identical for threads of the same process). If some process
+ * doesn't fit in one node, the thread statistics will then distribute
+ * the threads to the best nodes within the group of nodes where the
+ * process is contained.
+ *
+ * False sharing in the above sentence (and generally in AutoNUMA
+ * context) is intended as virtual memory accessed simultaneously (or
+ * frequently) by threads running in CPUs of different nodes. This
+ * doesn't refer to shared memory as in tmpfs, but it refers to
+ * CLONE_VM instead. If the threads access the same memory from CPUs
+ * of different nodes it means the memory accesses will be NUMA local
+ * for some thread and NUMA remote for some other thread. The only way
+ * to avoid NUMA false sharing is to schedule all threads accessing
+ * the same memory in the same node (which may or may not be possible,
+ * if it's not possible because there aren't enough CPUs in the node,
+ * the threads should be scheduled in as few nodes as possible and the
+ * nodes distance should be the lowest possible).
+ *
+ * This is an example of the CPU layout after the startup of 2
+ * processes with 12 threads each. This is some of the logs you will
+ * find in `dmesg` after running:
+ *
+ *	echo 1 >/sys/kernel/mm/autonuma/debug
+ *
+ * nid is the node id
+ * mm is the pointer to the mm structure (kind of the "ID" of the process)
+ * nr is the number of threads of that belongs to that process in that node id.
+ *
+ * This dumps the raw content of the CPUs' runqueues, it doesn't show
+ * kernel threads (the kernel thread dumping the below stats is
+ * clearly using one CPU, hence only 23 CPUs are dumped, clearly the
+ * debug mode can be improved but it's good enough to see what's going
+ * on).
+ *
+ * nid 0 mm ffff880433367b80 nr 6
+ * nid 0 mm ffff880433367480 nr 5
+ * nid 1 mm ffff880433367b80 nr 6
+ * nid 1 mm ffff880433367480 nr 6
+ *
+ * Now, the process with mm == ffff880433367b80 has 6 threads in node0
+ * and 6 threads in node1, while the process with mm ==
+ * ffff880433367480 has 5 threads in node0 and 6 threads running in
+ * node1.
+ *
+ * And after a few seconds it becomes:
+ *
+ * nid 0 mm ffff880433367b80 nr 12
+ * nid 1 mm ffff880433367480 nr 11
+ *
+ * Now, 12 threads of one process are running on node 0 and 11 threads
+ * of the other process are running on node 1.
+ *
+ * Before scanning all other CPUs' runqueues to compute the above
+ * math, we also verify that the current CPU is not already in the
+ * preferred NUMA node from the point of view of both the process
+ * statistics and the thread statistics. In such case we can return to
+ * the caller without having to check any other CPUs' runqueues
+ * because full convergence has been already reached.
+ *
+ * This algorithm might be expanded to take all runnable processes
+ * into account but examining just the currently running processes is
+ * a good enough approximation because some runnable processes may run
+ * only for a short time so statistically there will always be a bias
+ * on the processes that uses most the of the CPU. This is ideal
+ * because it doesn't matter if NUMA balancing isn't optimal for
+ * processes that run only for a short time.
+ *
+ * This function is invoked at the same frequency and in the same
+ * location of the CFS load balancer and only if the CPU is not
+ * idle. The rest of the time we depend on CFS to keep sticking to the
+ * current CPU or to prioritize on the CPUs in the selected_nid
+ * (recorded in the task's autonuma_node field).
+ */
+void sched_autonuma_balance(void)
+{
+	int cpu, nid, selected_cpu, selected_nid, selected_nid_mm;
+	int cpu_nid = numa_node_id();
+	int this_cpu = smp_processor_id();
+	/*
+	 * w_t: node thread weight
+	 * w_t_t: total sum of all node thread weights
+	 * w_m: node mm/process weight
+	 * w_m_t: total sum of all node mm/process weights
+	 */
+	unsigned long w_t, w_t_t, w_m, w_m_t;
+	unsigned long w_t_max, w_m_max;
+	unsigned long weight_max, weight;
+	long s_w_nid = -1, s_w_cpu_nid = -1, s_w_other = -1;
+	int s_w_type = -1;
+	struct cpumask *allowed;
+	struct task_struct *p = current;
+	struct task_autonuma *task_autonuma = p->task_autonuma;
+	struct rq *rq;
+
+	/* per-cpu statically allocated in runqueues */
+	long *task_numa_weight;
+	long *mm_numa_weight;
+
+	if (!task_autonuma || !p->mm)
+		return;
+
+	if (!autonuma_enabled()) {
+		if (task_autonuma->autonuma_node != -1)
+			task_autonuma->autonuma_node = -1;
+		return;
+	}
+
+	allowed = tsk_cpus_allowed(p);
+
+	/*
+	 * Do nothing if the task had no numa hinting page faults yet
+	 * or if the mm hasn't been fully scanned by knuma_scand yet.
+	 */
+	w_t_t = task_autonuma->task_numa_fault_tot;
+	if (!w_t_t)
+		return;
+	w_m_t = ACCESS_ONCE(p->mm->mm_autonuma->mm_numa_fault_tot);
+	if (!w_m_t)
+		return;
+
+	/*
+	 * The below two arrays holds the NUMA affinity information of
+	 * the current process if scheduled in the "nid". This is task
+	 * local and mm local information. We compute this information
+	 * for all nodes.
+	 *
+	 * task/mm_numa_weight[nid] will become w_nid.
+	 * task/mm_numa_weight[cpu_nid] will become w_cpu_nid.
+	 */
+	rq = cpu_rq(this_cpu);
+	task_numa_weight = rq->task_numa_weight;
+	mm_numa_weight = rq->mm_numa_weight;
+
+	w_t_max = w_m_max = 0;
+	selected_nid = selected_nid_mm = -1;
+	for_each_online_node(nid) {
+		w_m = ACCESS_ONCE(p->mm->mm_autonuma->mm_numa_fault[nid]);
+		w_t = task_autonuma->task_numa_fault[nid];
+		if (w_m > w_m_t)
+			w_m_t = w_m;
+		mm_numa_weight[nid] = w_m*AUTONUMA_BALANCE_SCALE/w_m_t;
+		if (w_t > w_t_t)
+			w_t_t = w_t;
+		task_numa_weight[nid] = w_t*AUTONUMA_BALANCE_SCALE/w_t_t;
+		if (mm_numa_weight[nid] > w_m_max) {
+			w_m_max = mm_numa_weight[nid];
+			selected_nid_mm = nid;
+		}
+		if (task_numa_weight[nid] > w_t_max) {
+			w_t_max = task_numa_weight[nid];
+			selected_nid = nid;
+		}
+	}
+	/*
+	 * See if we already converged to skip the more expensive loop
+	 * below. Return if we can already predict here with only
+	 * mm/task local information, that the below loop would
+	 * selected the current cpu_nid.
+	 */
+	if (selected_nid == cpu_nid && selected_nid_mm == selected_nid) {
+		if (task_autonuma->autonuma_node != selected_nid)
+			task_autonuma->autonuma_node = selected_nid;
+		return;
+	}
+
+	selected_cpu = this_cpu;
+	selected_nid = cpu_nid;
+
+	weight = weight_max = 0;
+
+	/* check that the following raw_spin_lock_irq is safe */
+	BUG_ON(irqs_disabled());
+
+	for_each_online_node(nid) {
+		/*
+		 * Calculate the "weight" for all CPUs that the
+		 * current process is allowed to be migrated to,
+		 * except the CPUs of the current nid (it would be
+		 * worthless from a NUMA affinity standpoint to
+		 * migrate the task to another CPU of the current
+		 * node).
+		 */
+		if (nid == cpu_nid)
+			continue;
+		for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
+			long w_nid, w_cpu_nid, w_other;
+			int w_type;
+			struct mm_struct *mm;
+			rq = cpu_rq(cpu);
+			if (!cpu_online(cpu))
+				continue;
+
+			if (idle_cpu(cpu))
+				/*
+				 * Offload the while IDLE balancing
+				 * and physical / logical imbalances
+				 * to CFS.
+				 */
+				continue;
+
+			mm = rq->curr->mm;
+			if (!mm)
+				continue;
+			/*
+			 * Grab the w_m/w_t/w_m_t/w_t_t of the
+			 * processes running in the other CPUs to
+			 * compute w_other.
+			 */
+			raw_spin_lock_irq(&rq->lock);
+			/* recheck after implicit barrier() */
+			mm = rq->curr->mm;
+			if (!mm) {
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+			w_m_t = ACCESS_ONCE(mm->mm_autonuma->mm_numa_fault_tot);
+			w_t_t = rq->curr->task_autonuma->task_numa_fault_tot;
+			if (!w_m_t || !w_t_t) {
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+			w_m = ACCESS_ONCE(mm->mm_autonuma->mm_numa_fault[nid]);
+			w_t = rq->curr->task_autonuma->task_numa_fault[nid];
+			raw_spin_unlock_irq(&rq->lock);
+			/*
+			 * Generate the w_nid/w_cpu_nid from the
+			 * pre-computed mm/task_numa_weight[] and
+			 * compute w_other using the w_m/w_t info
+			 * collected from the other process.
+			 */
+			if (mm == p->mm) {
+				if (w_t > w_t_t)
+					w_t_t = w_t;
+				w_other = w_t*AUTONUMA_BALANCE_SCALE/w_t_t;
+				w_nid = task_numa_weight[nid];
+				w_cpu_nid = task_numa_weight[cpu_nid];
+				w_type = W_TYPE_THREAD;
+			} else {
+				if (w_m > w_m_t)
+					w_m_t = w_m;
+				w_other = w_m*AUTONUMA_BALANCE_SCALE/w_m_t;
+				w_nid = mm_numa_weight[nid];
+				w_cpu_nid = mm_numa_weight[cpu_nid];
+				w_type = W_TYPE_PROCESS;
+			}
+
+			/*
+			 * Finally check if there's a combined gain in
+			 * NUMA affinity. If there is and it's the
+			 * biggest weight seen so far, record its
+			 * weight and select this NUMA remote "cpu" as
+			 * candidate migration destination.
+			 */
+			if (w_nid > w_other && w_nid > w_cpu_nid) {
+				weight = w_nid - w_other + w_nid - w_cpu_nid;
+
+				if (weight > weight_max) {
+					weight_max = weight;
+					selected_cpu = cpu;
+					selected_nid = nid;
+
+					s_w_other = w_other;
+					s_w_nid = w_nid;
+					s_w_cpu_nid = w_cpu_nid;
+					s_w_type = w_type;
+				}
+			}
+		}
+	}
+
+	if (task_autonuma->autonuma_node != selected_nid)
+		task_autonuma->autonuma_node = selected_nid;
+	if (selected_cpu != this_cpu) {
+		if (autonuma_debug()) {
+			char *w_type_str = NULL;
+			switch (s_w_type) {
+			case W_TYPE_THREAD:
+				w_type_str = "thread";
+				break;
+			case W_TYPE_PROCESS:
+				w_type_str = "process";
+				break;
+			}
+			printk("%p %d - %dto%d - %dto%d - %ld %ld %ld - %s\n",
+			       p->mm, p->pid, cpu_nid, selected_nid,
+			       this_cpu, selected_cpu,
+			       s_w_other, s_w_nid, s_w_cpu_nid,
+			       w_type_str);
+		}
+		BUG_ON(cpu_nid == selected_nid);
+		goto found;
+	}
+
+	return;
+
+found:
+	rq = cpu_rq(this_cpu);
+
+	/*
+	 * autonuma_balance synchronizes accesses to
+	 * autonuma_balance_work. Once set, it's cleared by the
+	 * callback once the migration work is finished.
+	 */
+	raw_spin_lock_irq(&rq->lock);
+	if (rq->autonuma_balance) {
+		raw_spin_unlock_irq(&rq->lock);
+		return;
+	}
+	rq->autonuma_balance = true;
+	raw_spin_unlock_irq(&rq->lock);
+
+	rq->autonuma_balance_dst_cpu = selected_cpu;
+	rq->autonuma_balance_task = p;
+	get_task_struct(p);
+
+	stop_one_cpu_nowait(this_cpu,
+			    autonuma_balance_cpu_stop, rq,
+			    &rq->autonuma_balance_work);
+#ifdef __ia64__
+#error "NOTE: tlb_migrate_finish won't run here"
+#endif
+}
+
+/*
+ * The function sched_autonuma_can_migrate_task is called by CFS
+ * can_migrate_task() to prioritize on the task's autonuma_node. It is
+ * called during load_balancing, idle balancing and in general
+ * before any task CPU migration event happens.
+ *
+ * The caller first scans the CFS migration candidate tasks passing a
+ * not zero numa parameter, to skip tasks without AutoNUMA affinity
+ * (according to the tasks's autonuma_node). If no task can be
+ * migrated in the first scan, a second scan is run with a zero numa
+ * parameter.
+ *
+ * If load_balance_strict is enabled, AutoNUMA will only allow
+ * migration of tasks for idle balancing purposes (the idle balancing
+ * of CFS is never altered by AutoNUMA). In the not strict mode the
+ * load balancing is not altered and the AutoNUMA affinity is
+ * disregarded in favor of higher fairness. The load_balance_strict
+ * knob is runtime tunable in sysfs.
+ *
+ * If load_balance_strict is enabled, it tends to partition the
+ * system. In turn it may reduce the scheduler fairness across NUMA
+ * nodes, but it should deliver higher global performance.
+ */
+bool sched_autonuma_can_migrate_task(struct task_struct *p,
+				     int numa, int dst_cpu,
+				     enum cpu_idle_type idle)
+{
+	if (!task_autonuma_cpu(p, dst_cpu)) {
+		if (numa)
+			return false;
+		if (autonuma_sched_load_balance_strict() &&
+		    idle != CPU_NEWLY_IDLE && idle != CPU_IDLE)
+			return false;
+	}
+	return true;
+}
+
+/*
+ * sched_autonuma_dump_mm is a purely debugging function called at
+ * regular intervals when /sys/kernel/mm/autonuma/debug is
+ * enabled. This prints in the kernel logs how the threads and
+ * processes are distributed in all NUMA nodes to easily check if the
+ * threads of the same processes are converging in the same
+ * nodes. This won't take into account kernel threads and because it
+ * runs itself from a kernel thread it won't show what was running in
+ * the current CPU, but it's simple and good enough to get what we
+ * need in the debug logs. This function can be disabled or deleted
+ * later.
+ */
+void sched_autonuma_dump_mm(void)
+{
+	int nid, cpu;
+	cpumask_var_t x;
+
+	if (!alloc_cpumask_var(&x, GFP_KERNEL))
+		return;
+	cpumask_setall(x);
+	for_each_online_node(nid) {
+		for_each_cpu(cpu, cpumask_of_node(nid)) {
+			struct rq *rq = cpu_rq(cpu);
+			struct mm_struct *mm = rq->curr->mm;
+			int nr = 0, cpux;
+			if (!cpumask_test_cpu(cpu, x))
+				continue;
+			for_each_cpu(cpux, cpumask_of_node(nid)) {
+				struct rq *rqx = cpu_rq(cpux);
+				if (rqx->curr->mm == mm) {
+					nr++;
+					cpumask_clear_cpu(cpux, x);
+				}
+			}
+			printk("nid %d mm %p nr %d\n", nid, mm, nr);
+		}
+	}
+	free_cpumask_var(x);
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6d52cea..e5b7ae9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -463,6 +463,24 @@ struct rq {
 #ifdef CONFIG_SMP
 	struct llist_head wake_list;
 #endif
+#ifdef CONFIG_AUTONUMA
+	/*
+	 * Per-cpu arrays to compute the per-thread and per-process
+	 * statistics. Allocated statically to avoid overflowing the
+	 * stack with large MAX_NUMNODES values.
+	 *
+	 * FIXME: allocate dynamically and with num_possible_nodes()
+	 * array sizes only if autonuma is not impossible, to save
+	 * some dozen KB of RAM when booting on not NUMA (or small
+	 * NUMA) systems.
+	 */
+	long task_numa_weight[MAX_NUMNODES];
+	long mm_numa_weight[MAX_NUMNODES];
+	bool autonuma_balance;
+	int autonuma_balance_dst_cpu;
+	struct task_struct *autonuma_balance_task;
+	struct cpu_stop_work autonuma_balance_work;
+#endif
 };
 
 static inline int cpu_of(struct rq *rq)

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-28 12:55   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This algorithm takes as input the statistical information filled by the
knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
(p->sched_autonuma), evaluates it for the current scheduled task, and
compares it against every other running process to see if it should
move the current task to another NUMA node.

When the scheduler decides if the task should be migrated to a
different NUMA node or to stay in the same NUMA node, the decision is
then stored into p->sched_autonuma->autonuma_node. The fair scheduler
than tries to keep the task on the autonuma_node too.

Code include fixes and cleanups from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_sched.h |   50 ++++
 include/linux/mm_types.h       |    5 +
 include/linux/sched.h          |    3 +
 kernel/sched/core.c            |    1 +
 kernel/sched/numa.c            |  586 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h           |   18 ++
 6 files changed, 663 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma_sched.h
 create mode 100644 kernel/sched/numa.c

diff --git a/include/linux/autonuma_sched.h b/include/linux/autonuma_sched.h
new file mode 100644
index 0000000..aff31d4
--- /dev/null
+++ b/include/linux/autonuma_sched.h
@@ -0,0 +1,50 @@
+#ifndef _LINUX_AUTONUMA_SCHED_H
+#define _LINUX_AUTONUMA_SCHED_H
+
+#ifdef CONFIG_AUTONUMA
+#include <linux/autonuma_flags.h>
+
+extern void sched_autonuma_balance(void);
+extern bool sched_autonuma_can_migrate_task(struct task_struct *p,
+					    int numa, int dst_cpu,
+					    enum cpu_idle_type idle);
+#else /* CONFIG_AUTONUMA */
+static inline void sched_autonuma_balance(void) {}
+static inline bool sched_autonuma_can_migrate_task(struct task_struct *p,
+						   int numa, int dst_cpu,
+						   enum cpu_idle_type idle)
+{
+	return true;
+}
+#endif /* CONFIG_AUTONUMA */
+
+static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
+{
+#ifdef CONFIG_AUTONUMA
+	int autonuma_node;
+	struct task_autonuma *task_autonuma = p->task_autonuma;
+
+	if (!task_autonuma)
+		return true;
+
+	autonuma_node = ACCESS_ONCE(task_autonuma->autonuma_node);
+	if (autonuma_node < 0 || autonuma_node == cpu_to_node(cpu))
+		return true;
+	else
+		return false;
+#else
+	return true;
+#endif
+}
+
+static inline void sched_set_autonuma_need_balance(void)
+{
+#ifdef CONFIG_AUTONUMA
+	struct task_autonuma *ta = current->task_autonuma;
+
+	if (ta && current->mm)
+		sched_autonuma_balance();
+#endif
+}
+
+#endif /* _LINUX_AUTONUMA_SCHED_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 704a626..f0c6379 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -13,6 +13,7 @@
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
 #include <linux/uprobes.h>
+#include <linux/autonuma_types.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -389,6 +390,10 @@ struct mm_struct {
 	struct cpumask cpumask_allocation;
 #endif
 	struct uprobes_state uprobes_state;
+#ifdef CONFIG_AUTONUMA
+	/* this is used by the scheduler and the page allocator */
+	struct mm_autonuma *mm_autonuma;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 699324c..cb20347 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1514,6 +1514,9 @@ struct task_struct {
 	struct mempolicy *mempolicy;	/* Protected by alloc_lock */
 	short il_next;
 	short pref_node_fork;
+#ifdef CONFIG_AUTONUMA
+	struct task_autonuma *task_autonuma;
+#endif
 #endif
 	struct rcu_head rcu;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d5594a4..a8f94b9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -72,6 +72,7 @@
 #include <linux/slab.h>
 #include <linux/init_task.h>
 #include <linux/binfmts.h>
+#include <linux/autonuma_sched.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c
new file mode 100644
index 0000000..72f6158
--- /dev/null
+++ b/kernel/sched/numa.c
@@ -0,0 +1,586 @@
+/*
+ *  Copyright (C) 2012  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/sched.h>
+#include <linux/autonuma_sched.h>
+#include <asm/tlb.h>
+
+#include "sched.h"
+
+/*
+ * autonuma_balance_cpu_stop() is a callback to be invoked by
+ * stop_one_cpu_nowait(). It is used by sched_autonuma_balance() to
+ * migrate the tasks to the selected_cpu, from softirq context.
+ */
+static int autonuma_balance_cpu_stop(void *data)
+{
+	struct rq *src_rq = data;
+	int src_cpu = cpu_of(src_rq);
+	int dst_cpu = src_rq->autonuma_balance_dst_cpu;
+	struct task_struct *p = src_rq->autonuma_balance_task;
+	struct rq *dst_rq = cpu_rq(dst_cpu);
+
+	raw_spin_lock_irq(&p->pi_lock);
+	raw_spin_lock(&src_rq->lock);
+
+	/* Make sure the selected cpu hasn't gone down in the meanwhile */
+	if (unlikely(src_cpu != smp_processor_id() ||
+		     !src_rq->autonuma_balance))
+		goto out_unlock;
+
+	/* Check if the affinity changed in the meanwhile */
+	if (!cpumask_test_cpu(dst_cpu, tsk_cpus_allowed(p)))
+		goto out_unlock;
+
+	/* Is the task to migrate still there? */
+	if (task_cpu(p) != src_cpu)
+		goto out_unlock;
+
+	BUG_ON(src_rq == dst_rq);
+
+	/* Prepare to move the task from src_rq to dst_rq */
+	double_lock_balance(src_rq, dst_rq);
+
+	/*
+	 * Supposedly pi_lock should have been enough but some code
+	 * seems to call __set_task_cpu without pi_lock.
+	 */
+	if (task_cpu(p) != src_cpu) {
+		WARN_ONCE(1, "autonuma_balance_cpu_stop: "
+			  "not pi_lock protected");
+		goto out_double_unlock;
+	}
+
+	/*
+	 * If the task is not on a rq, the autonuma_node will take
+	 * care of the NUMA affinity at the next wake-up.
+	 */
+	if (p->on_rq) {
+		deactivate_task(src_rq, p, 0);
+		set_task_cpu(p, dst_cpu);
+		activate_task(dst_rq, p, 0);
+		check_preempt_curr(dst_rq, p, 0);
+	}
+
+out_double_unlock:
+	double_unlock_balance(src_rq, dst_rq);
+out_unlock:
+	src_rq->autonuma_balance = false;
+	raw_spin_unlock(&src_rq->lock);
+	/* spinlocks acts as barrier() so p is stored local on the stack */
+	raw_spin_unlock_irq(&p->pi_lock);
+	put_task_struct(p);
+	return 0;
+}
+
+#define AUTONUMA_BALANCE_SCALE 1000
+
+enum {
+	W_TYPE_THREAD,
+	W_TYPE_PROCESS,
+};
+
+/*
+ * This function sched_autonuma_balance() is responsible for deciding
+ * which is the best CPU each process should be running on according
+ * to the NUMA statistics collected in mm->mm_autonuma and
+ * tsk->task_autonuma.
+ *
+ * The core math that evaluates the current CPU against the CPUs of
+ * all _other_ nodes is this:
+ *
+ *	if (w_nid > w_other && w_nid > w_cpu_nid)
+ *		weight = w_nid - w_other + w_nid - w_cpu_nid;
+ *
+ * w_nid: NUMA affinity of the current thread/process if run on the
+ * other CPU.
+ *
+ * w_other: NUMA affinity of the other thread/process if run on the
+ * other CPU.
+ *
+ * w_cpu_nid: NUMA affinity of the current thread/process if run on
+ * the current CPU.
+ *
+ * weight: combined NUMA affinity benefit in moving the current
+ * thread/process to the other CPU taking into account both the higher
+ * NUMA affinity of the current process if run on the other CPU, and
+ * the increase in NUMA affinity in the other CPU by replacing the
+ * other process.
+ *
+ * We run the above math on every CPU not part of the current NUMA
+ * node, and we compare the current process against the other
+ * processes running in the other CPUs in the remote NUMA nodes. The
+ * objective is to select the cpu (in selected_cpu) with a bigger
+ * "weight". The bigger the "weight" the biggest gain we'll get by
+ * moving the current process to the selected_cpu (not only the
+ * biggest immediate CPU gain but also the fewer async memory
+ * migrations that will be required to reach full convergence
+ * later). If we select a cpu we migrate the current process to it.
+ *
+ * Checking that the current process has higher NUMA affinity than the
+ * other process on the other CPU (w_nid > w_other) and not only that
+ * the current process has higher NUMA affinity on the other CPU than
+ * on the current CPU (w_nid > w_cpu_nid) completely avoids ping pongs
+ * and ensures (temporary) convergence of the algorithm (at least from
+ * a CPU standpoint).
+ *
+ * It's then up to the idle balancing code that will run as soon as
+ * the current CPU goes idle to pick the other process and move it
+ * here (or in some other idle CPU if any).
+ *
+ * By only evaluating running processes against running processes we
+ * avoid interfering with the CFS stock active idle balancing, which
+ * is critical to optimal performance with HT enabled. (getting HT
+ * wrong is worse than running on remote memory so the active idle
+ * balancing has priority)
+ *
+ * Idle balancing and all other CFS load balancing become NUMA
+ * affinity aware through the introduction of
+ * sched_autonuma_can_migrate_task(). CFS searches CPUs in the task's
+ * autonuma_node first when it needs to find idle CPUs during idle
+ * balancing or tasks to pick during load balancing.
+ *
+ * The task's autonuma_node is the node selected by
+ * sched_autonuma_balance() when it migrates a task to the
+ * selected_cpu in the selected_nid.
+ *
+ * Once a process/thread has been moved to another node, closer to the
+ * much of memory it has recently accessed, any memory for that task
+ * not in the new node moves slowly (asynchronously in the background)
+ * to the new node. This is done by the knuma_migratedN (where the
+ * suffix N is the node id) daemon described in mm/autonuma.c.
+ *
+ * One non trivial bit of this logic that deserves an explanation is
+ * how the three crucial variables of the core math
+ * (w_nid/w_other/wcpu_nid) are going to change depending on whether
+ * the other CPU is running a thread of the current process, or a
+ * thread of a different process.
+ *
+ * A simple example is required. Given the following:
+ * - 2 processes
+ * - 4 threads per process
+ * - 2 NUMA nodes
+ * - 4 CPUS per NUMA node
+ *
+ * Because the 8 threads belong to 2 different processes, by using the
+ * process statistics when comparing threads of different processes,
+ * we will converge reliably and quickly to a configuration where the
+ * 1st process is entirely contained in one node and the 2nd process
+ * in the other node.
+ *
+ * If all threads only use thread local memory (no sharing of memory
+ * between the threads), it wouldn't matter if we use per-thread or
+ * per-mm statistics for w_nid/w_other/w_cpu_nid. We could then use
+ * per-thread statistics all the time.
+ *
+ * But clearly with threads it's expected to get some sharing of
+ * memory. To avoid false sharing it's better to keep all threads of
+ * the same process in the same node (or if they don't fit in a single
+ * node, in as fewer nodes as possible). This is why we have to use
+ * processes statistics in w_nid/w_other/wcpu_nid when comparing
+ * threads of different processes. Why instead do we have to use
+ * thread statistics when comparing threads of the same process? This
+ * should be obvious if you're still reading (hint: the mm statistics
+ * are identical for threads of the same process). If some process
+ * doesn't fit in one node, the thread statistics will then distribute
+ * the threads to the best nodes within the group of nodes where the
+ * process is contained.
+ *
+ * False sharing in the above sentence (and generally in AutoNUMA
+ * context) is intended as virtual memory accessed simultaneously (or
+ * frequently) by threads running in CPUs of different nodes. This
+ * doesn't refer to shared memory as in tmpfs, but it refers to
+ * CLONE_VM instead. If the threads access the same memory from CPUs
+ * of different nodes it means the memory accesses will be NUMA local
+ * for some thread and NUMA remote for some other thread. The only way
+ * to avoid NUMA false sharing is to schedule all threads accessing
+ * the same memory in the same node (which may or may not be possible,
+ * if it's not possible because there aren't enough CPUs in the node,
+ * the threads should be scheduled in as few nodes as possible and the
+ * nodes distance should be the lowest possible).
+ *
+ * This is an example of the CPU layout after the startup of 2
+ * processes with 12 threads each. This is some of the logs you will
+ * find in `dmesg` after running:
+ *
+ *	echo 1 >/sys/kernel/mm/autonuma/debug
+ *
+ * nid is the node id
+ * mm is the pointer to the mm structure (kind of the "ID" of the process)
+ * nr is the number of threads of that belongs to that process in that node id.
+ *
+ * This dumps the raw content of the CPUs' runqueues, it doesn't show
+ * kernel threads (the kernel thread dumping the below stats is
+ * clearly using one CPU, hence only 23 CPUs are dumped, clearly the
+ * debug mode can be improved but it's good enough to see what's going
+ * on).
+ *
+ * nid 0 mm ffff880433367b80 nr 6
+ * nid 0 mm ffff880433367480 nr 5
+ * nid 1 mm ffff880433367b80 nr 6
+ * nid 1 mm ffff880433367480 nr 6
+ *
+ * Now, the process with mm == ffff880433367b80 has 6 threads in node0
+ * and 6 threads in node1, while the process with mm ==
+ * ffff880433367480 has 5 threads in node0 and 6 threads running in
+ * node1.
+ *
+ * And after a few seconds it becomes:
+ *
+ * nid 0 mm ffff880433367b80 nr 12
+ * nid 1 mm ffff880433367480 nr 11
+ *
+ * Now, 12 threads of one process are running on node 0 and 11 threads
+ * of the other process are running on node 1.
+ *
+ * Before scanning all other CPUs' runqueues to compute the above
+ * math, we also verify that the current CPU is not already in the
+ * preferred NUMA node from the point of view of both the process
+ * statistics and the thread statistics. In such case we can return to
+ * the caller without having to check any other CPUs' runqueues
+ * because full convergence has been already reached.
+ *
+ * This algorithm might be expanded to take all runnable processes
+ * into account but examining just the currently running processes is
+ * a good enough approximation because some runnable processes may run
+ * only for a short time so statistically there will always be a bias
+ * on the processes that uses most the of the CPU. This is ideal
+ * because it doesn't matter if NUMA balancing isn't optimal for
+ * processes that run only for a short time.
+ *
+ * This function is invoked at the same frequency and in the same
+ * location of the CFS load balancer and only if the CPU is not
+ * idle. The rest of the time we depend on CFS to keep sticking to the
+ * current CPU or to prioritize on the CPUs in the selected_nid
+ * (recorded in the task's autonuma_node field).
+ */
+void sched_autonuma_balance(void)
+{
+	int cpu, nid, selected_cpu, selected_nid, selected_nid_mm;
+	int cpu_nid = numa_node_id();
+	int this_cpu = smp_processor_id();
+	/*
+	 * w_t: node thread weight
+	 * w_t_t: total sum of all node thread weights
+	 * w_m: node mm/process weight
+	 * w_m_t: total sum of all node mm/process weights
+	 */
+	unsigned long w_t, w_t_t, w_m, w_m_t;
+	unsigned long w_t_max, w_m_max;
+	unsigned long weight_max, weight;
+	long s_w_nid = -1, s_w_cpu_nid = -1, s_w_other = -1;
+	int s_w_type = -1;
+	struct cpumask *allowed;
+	struct task_struct *p = current;
+	struct task_autonuma *task_autonuma = p->task_autonuma;
+	struct rq *rq;
+
+	/* per-cpu statically allocated in runqueues */
+	long *task_numa_weight;
+	long *mm_numa_weight;
+
+	if (!task_autonuma || !p->mm)
+		return;
+
+	if (!autonuma_enabled()) {
+		if (task_autonuma->autonuma_node != -1)
+			task_autonuma->autonuma_node = -1;
+		return;
+	}
+
+	allowed = tsk_cpus_allowed(p);
+
+	/*
+	 * Do nothing if the task had no numa hinting page faults yet
+	 * or if the mm hasn't been fully scanned by knuma_scand yet.
+	 */
+	w_t_t = task_autonuma->task_numa_fault_tot;
+	if (!w_t_t)
+		return;
+	w_m_t = ACCESS_ONCE(p->mm->mm_autonuma->mm_numa_fault_tot);
+	if (!w_m_t)
+		return;
+
+	/*
+	 * The below two arrays holds the NUMA affinity information of
+	 * the current process if scheduled in the "nid". This is task
+	 * local and mm local information. We compute this information
+	 * for all nodes.
+	 *
+	 * task/mm_numa_weight[nid] will become w_nid.
+	 * task/mm_numa_weight[cpu_nid] will become w_cpu_nid.
+	 */
+	rq = cpu_rq(this_cpu);
+	task_numa_weight = rq->task_numa_weight;
+	mm_numa_weight = rq->mm_numa_weight;
+
+	w_t_max = w_m_max = 0;
+	selected_nid = selected_nid_mm = -1;
+	for_each_online_node(nid) {
+		w_m = ACCESS_ONCE(p->mm->mm_autonuma->mm_numa_fault[nid]);
+		w_t = task_autonuma->task_numa_fault[nid];
+		if (w_m > w_m_t)
+			w_m_t = w_m;
+		mm_numa_weight[nid] = w_m*AUTONUMA_BALANCE_SCALE/w_m_t;
+		if (w_t > w_t_t)
+			w_t_t = w_t;
+		task_numa_weight[nid] = w_t*AUTONUMA_BALANCE_SCALE/w_t_t;
+		if (mm_numa_weight[nid] > w_m_max) {
+			w_m_max = mm_numa_weight[nid];
+			selected_nid_mm = nid;
+		}
+		if (task_numa_weight[nid] > w_t_max) {
+			w_t_max = task_numa_weight[nid];
+			selected_nid = nid;
+		}
+	}
+	/*
+	 * See if we already converged to skip the more expensive loop
+	 * below. Return if we can already predict here with only
+	 * mm/task local information, that the below loop would
+	 * selected the current cpu_nid.
+	 */
+	if (selected_nid == cpu_nid && selected_nid_mm == selected_nid) {
+		if (task_autonuma->autonuma_node != selected_nid)
+			task_autonuma->autonuma_node = selected_nid;
+		return;
+	}
+
+	selected_cpu = this_cpu;
+	selected_nid = cpu_nid;
+
+	weight = weight_max = 0;
+
+	/* check that the following raw_spin_lock_irq is safe */
+	BUG_ON(irqs_disabled());
+
+	for_each_online_node(nid) {
+		/*
+		 * Calculate the "weight" for all CPUs that the
+		 * current process is allowed to be migrated to,
+		 * except the CPUs of the current nid (it would be
+		 * worthless from a NUMA affinity standpoint to
+		 * migrate the task to another CPU of the current
+		 * node).
+		 */
+		if (nid == cpu_nid)
+			continue;
+		for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
+			long w_nid, w_cpu_nid, w_other;
+			int w_type;
+			struct mm_struct *mm;
+			rq = cpu_rq(cpu);
+			if (!cpu_online(cpu))
+				continue;
+
+			if (idle_cpu(cpu))
+				/*
+				 * Offload the while IDLE balancing
+				 * and physical / logical imbalances
+				 * to CFS.
+				 */
+				continue;
+
+			mm = rq->curr->mm;
+			if (!mm)
+				continue;
+			/*
+			 * Grab the w_m/w_t/w_m_t/w_t_t of the
+			 * processes running in the other CPUs to
+			 * compute w_other.
+			 */
+			raw_spin_lock_irq(&rq->lock);
+			/* recheck after implicit barrier() */
+			mm = rq->curr->mm;
+			if (!mm) {
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+			w_m_t = ACCESS_ONCE(mm->mm_autonuma->mm_numa_fault_tot);
+			w_t_t = rq->curr->task_autonuma->task_numa_fault_tot;
+			if (!w_m_t || !w_t_t) {
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+			w_m = ACCESS_ONCE(mm->mm_autonuma->mm_numa_fault[nid]);
+			w_t = rq->curr->task_autonuma->task_numa_fault[nid];
+			raw_spin_unlock_irq(&rq->lock);
+			/*
+			 * Generate the w_nid/w_cpu_nid from the
+			 * pre-computed mm/task_numa_weight[] and
+			 * compute w_other using the w_m/w_t info
+			 * collected from the other process.
+			 */
+			if (mm == p->mm) {
+				if (w_t > w_t_t)
+					w_t_t = w_t;
+				w_other = w_t*AUTONUMA_BALANCE_SCALE/w_t_t;
+				w_nid = task_numa_weight[nid];
+				w_cpu_nid = task_numa_weight[cpu_nid];
+				w_type = W_TYPE_THREAD;
+			} else {
+				if (w_m > w_m_t)
+					w_m_t = w_m;
+				w_other = w_m*AUTONUMA_BALANCE_SCALE/w_m_t;
+				w_nid = mm_numa_weight[nid];
+				w_cpu_nid = mm_numa_weight[cpu_nid];
+				w_type = W_TYPE_PROCESS;
+			}
+
+			/*
+			 * Finally check if there's a combined gain in
+			 * NUMA affinity. If there is and it's the
+			 * biggest weight seen so far, record its
+			 * weight and select this NUMA remote "cpu" as
+			 * candidate migration destination.
+			 */
+			if (w_nid > w_other && w_nid > w_cpu_nid) {
+				weight = w_nid - w_other + w_nid - w_cpu_nid;
+
+				if (weight > weight_max) {
+					weight_max = weight;
+					selected_cpu = cpu;
+					selected_nid = nid;
+
+					s_w_other = w_other;
+					s_w_nid = w_nid;
+					s_w_cpu_nid = w_cpu_nid;
+					s_w_type = w_type;
+				}
+			}
+		}
+	}
+
+	if (task_autonuma->autonuma_node != selected_nid)
+		task_autonuma->autonuma_node = selected_nid;
+	if (selected_cpu != this_cpu) {
+		if (autonuma_debug()) {
+			char *w_type_str = NULL;
+			switch (s_w_type) {
+			case W_TYPE_THREAD:
+				w_type_str = "thread";
+				break;
+			case W_TYPE_PROCESS:
+				w_type_str = "process";
+				break;
+			}
+			printk("%p %d - %dto%d - %dto%d - %ld %ld %ld - %s\n",
+			       p->mm, p->pid, cpu_nid, selected_nid,
+			       this_cpu, selected_cpu,
+			       s_w_other, s_w_nid, s_w_cpu_nid,
+			       w_type_str);
+		}
+		BUG_ON(cpu_nid == selected_nid);
+		goto found;
+	}
+
+	return;
+
+found:
+	rq = cpu_rq(this_cpu);
+
+	/*
+	 * autonuma_balance synchronizes accesses to
+	 * autonuma_balance_work. Once set, it's cleared by the
+	 * callback once the migration work is finished.
+	 */
+	raw_spin_lock_irq(&rq->lock);
+	if (rq->autonuma_balance) {
+		raw_spin_unlock_irq(&rq->lock);
+		return;
+	}
+	rq->autonuma_balance = true;
+	raw_spin_unlock_irq(&rq->lock);
+
+	rq->autonuma_balance_dst_cpu = selected_cpu;
+	rq->autonuma_balance_task = p;
+	get_task_struct(p);
+
+	stop_one_cpu_nowait(this_cpu,
+			    autonuma_balance_cpu_stop, rq,
+			    &rq->autonuma_balance_work);
+#ifdef __ia64__
+#error "NOTE: tlb_migrate_finish won't run here"
+#endif
+}
+
+/*
+ * The function sched_autonuma_can_migrate_task is called by CFS
+ * can_migrate_task() to prioritize on the task's autonuma_node. It is
+ * called during load_balancing, idle balancing and in general
+ * before any task CPU migration event happens.
+ *
+ * The caller first scans the CFS migration candidate tasks passing a
+ * not zero numa parameter, to skip tasks without AutoNUMA affinity
+ * (according to the tasks's autonuma_node). If no task can be
+ * migrated in the first scan, a second scan is run with a zero numa
+ * parameter.
+ *
+ * If load_balance_strict is enabled, AutoNUMA will only allow
+ * migration of tasks for idle balancing purposes (the idle balancing
+ * of CFS is never altered by AutoNUMA). In the not strict mode the
+ * load balancing is not altered and the AutoNUMA affinity is
+ * disregarded in favor of higher fairness. The load_balance_strict
+ * knob is runtime tunable in sysfs.
+ *
+ * If load_balance_strict is enabled, it tends to partition the
+ * system. In turn it may reduce the scheduler fairness across NUMA
+ * nodes, but it should deliver higher global performance.
+ */
+bool sched_autonuma_can_migrate_task(struct task_struct *p,
+				     int numa, int dst_cpu,
+				     enum cpu_idle_type idle)
+{
+	if (!task_autonuma_cpu(p, dst_cpu)) {
+		if (numa)
+			return false;
+		if (autonuma_sched_load_balance_strict() &&
+		    idle != CPU_NEWLY_IDLE && idle != CPU_IDLE)
+			return false;
+	}
+	return true;
+}
+
+/*
+ * sched_autonuma_dump_mm is a purely debugging function called at
+ * regular intervals when /sys/kernel/mm/autonuma/debug is
+ * enabled. This prints in the kernel logs how the threads and
+ * processes are distributed in all NUMA nodes to easily check if the
+ * threads of the same processes are converging in the same
+ * nodes. This won't take into account kernel threads and because it
+ * runs itself from a kernel thread it won't show what was running in
+ * the current CPU, but it's simple and good enough to get what we
+ * need in the debug logs. This function can be disabled or deleted
+ * later.
+ */
+void sched_autonuma_dump_mm(void)
+{
+	int nid, cpu;
+	cpumask_var_t x;
+
+	if (!alloc_cpumask_var(&x, GFP_KERNEL))
+		return;
+	cpumask_setall(x);
+	for_each_online_node(nid) {
+		for_each_cpu(cpu, cpumask_of_node(nid)) {
+			struct rq *rq = cpu_rq(cpu);
+			struct mm_struct *mm = rq->curr->mm;
+			int nr = 0, cpux;
+			if (!cpumask_test_cpu(cpu, x))
+				continue;
+			for_each_cpu(cpux, cpumask_of_node(nid)) {
+				struct rq *rqx = cpu_rq(cpux);
+				if (rqx->curr->mm == mm) {
+					nr++;
+					cpumask_clear_cpu(cpux, x);
+				}
+			}
+			printk("nid %d mm %p nr %d\n", nid, mm, nr);
+		}
+	}
+	free_cpumask_var(x);
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6d52cea..e5b7ae9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -463,6 +463,24 @@ struct rq {
 #ifdef CONFIG_SMP
 	struct llist_head wake_list;
 #endif
+#ifdef CONFIG_AUTONUMA
+	/*
+	 * Per-cpu arrays to compute the per-thread and per-process
+	 * statistics. Allocated statically to avoid overflowing the
+	 * stack with large MAX_NUMNODES values.
+	 *
+	 * FIXME: allocate dynamically and with num_possible_nodes()
+	 * array sizes only if autonuma is not impossible, to save
+	 * some dozen KB of RAM when booting on not NUMA (or small
+	 * NUMA) systems.
+	 */
+	long task_numa_weight[MAX_NUMNODES];
+	long mm_numa_weight[MAX_NUMNODES];
+	bool autonuma_balance;
+	int autonuma_balance_dst_cpu;
+	struct task_struct *autonuma_balance_task;
+	struct cpu_stop_work autonuma_balance_work;
+#endif
 };
 
 static inline int cpu_of(struct rq *rq)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 14/40] autonuma: add page structure fields
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:55   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 64bit archs, 20 bytes are used for async memory migration (specific
to the knuma_migrated per-node threads), and 4 bytes are used for the
thread NUMA false sharing detection logic.

This is a bad implementation due lack of time to do a proper one.

These AutoNUMA new fields must be moved to the pgdat like memcg
does. So that they're only allocated at boot time if the kernel is
booted on NUMA hardware. And so that they're not allocated even if
booted on NUMA hardware if "noautonuma" is passed as boot parameter to
the kernel.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm_types.h |   26 ++++++++++++++++++++++++++
 1 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f0c6379..d1248cf 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -136,6 +136,32 @@ struct page {
 		struct page *first_page;	/* Compound tail pages */
 	};
 
+#ifdef CONFIG_AUTONUMA
+	/*
+	 * FIXME: move to pgdat section along with the memcg and allocate
+	 * at runtime only in presence of a numa system.
+	 */
+	/*
+	 * To modify autonuma_last_nid lockless the architecture,
+	 * needs SMP atomic granularity < sizeof(long), not all archs
+	 * have that, notably some ancient alpha (but none of those
+	 * should run in NUMA systems). Archs without that requires
+	 * autonuma_last_nid to be a long.
+	 */
+#if BITS_PER_LONG > 32
+	int autonuma_migrate_nid;
+	int autonuma_last_nid;
+#else
+#if MAX_NUMNODES >= 32768
+#error "too many nodes"
+#endif
+	/* FIXME: remember to check the updates are atomic */
+	short autonuma_migrate_nid;
+	short autonuma_last_nid;
+#endif
+	struct list_head autonuma_migrate_node;
+#endif
+
 	/*
 	 * On machines where all RAM is mapped into kernel address space,
 	 * we can simply calculate the virtual address. On machines with

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 14/40] autonuma: add page structure fields
@ 2012-06-28 12:55   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 64bit archs, 20 bytes are used for async memory migration (specific
to the knuma_migrated per-node threads), and 4 bytes are used for the
thread NUMA false sharing detection logic.

This is a bad implementation due lack of time to do a proper one.

These AutoNUMA new fields must be moved to the pgdat like memcg
does. So that they're only allocated at boot time if the kernel is
booted on NUMA hardware. And so that they're not allocated even if
booted on NUMA hardware if "noautonuma" is passed as boot parameter to
the kernel.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm_types.h |   26 ++++++++++++++++++++++++++
 1 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f0c6379..d1248cf 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -136,6 +136,32 @@ struct page {
 		struct page *first_page;	/* Compound tail pages */
 	};
 
+#ifdef CONFIG_AUTONUMA
+	/*
+	 * FIXME: move to pgdat section along with the memcg and allocate
+	 * at runtime only in presence of a numa system.
+	 */
+	/*
+	 * To modify autonuma_last_nid lockless the architecture,
+	 * needs SMP atomic granularity < sizeof(long), not all archs
+	 * have that, notably some ancient alpha (but none of those
+	 * should run in NUMA systems). Archs without that requires
+	 * autonuma_last_nid to be a long.
+	 */
+#if BITS_PER_LONG > 32
+	int autonuma_migrate_nid;
+	int autonuma_last_nid;
+#else
+#if MAX_NUMNODES >= 32768
+#error "too many nodes"
+#endif
+	/* FIXME: remember to check the updates are atomic */
+	short autonuma_migrate_nid;
+	short autonuma_last_nid;
+#endif
+	struct list_head autonuma_migrate_node;
+#endif
+
 	/*
 	 * On machines where all RAM is mapped into kernel address space,
 	 * we can simply calculate the virtual address. On machines with

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 15/40] autonuma: knuma_migrated per NUMA node queues
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:55   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This implements the knuma_migrated queues. The pages are added to
these queues through the NUMA hinting page faults (memory follow CPU
algorithm with false sharing evaluation) and knuma_migrated then is
waken with a certain hysteresis to migrate the memory in round robin
from all remote nodes to its local node.

The head that belongs to the local node that knuma_migrated runs on,
for now must be empty and it's not being used.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mmzone.h |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2427706..d53b26a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -697,6 +697,12 @@ typedef struct pglist_data {
 	struct task_struct *kswapd;
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
+#ifdef CONFIG_AUTONUMA
+	spinlock_t autonuma_lock;
+	struct list_head autonuma_migrate_head[MAX_NUMNODES];
+	unsigned long autonuma_nr_migrate_pages;
+	wait_queue_head_t autonuma_knuma_migrated_wait;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 15/40] autonuma: knuma_migrated per NUMA node queues
@ 2012-06-28 12:55   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This implements the knuma_migrated queues. The pages are added to
these queues through the NUMA hinting page faults (memory follow CPU
algorithm with false sharing evaluation) and knuma_migrated then is
waken with a certain hysteresis to migrate the memory in round robin
from all remote nodes to its local node.

The head that belongs to the local node that knuma_migrated runs on,
for now must be empty and it's not being used.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mmzone.h |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2427706..d53b26a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -697,6 +697,12 @@ typedef struct pglist_data {
 	struct task_struct *kswapd;
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
+#ifdef CONFIG_AUTONUMA
+	spinlock_t autonuma_lock;
+	struct list_head autonuma_migrate_head[MAX_NUMNODES];
+	unsigned long autonuma_nr_migrate_pages;
+	wait_queue_head_t autonuma_knuma_migrated_wait;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 16/40] autonuma: init knuma_migrated queues
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:55   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Initialize the knuma_migrated queues at boot time.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a9710a4..48eabe9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -59,6 +59,7 @@
 #include <linux/prefetch.h>
 #include <linux/migrate.h>
 #include <linux/page-debug-flags.h>
+#include <linux/autonuma.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -4348,8 +4349,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	int nid = pgdat->node_id;
 	unsigned long zone_start_pfn = pgdat->node_start_pfn;
 	int ret;
+#ifdef CONFIG_AUTONUMA
+	int node_iter;
+#endif
 
 	pgdat_resize_init(pgdat);
+#ifdef CONFIG_AUTONUMA
+	spin_lock_init(&pgdat->autonuma_lock);
+	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
+	pgdat->autonuma_nr_migrate_pages = 0;
+	for_each_node(node_iter)
+		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+#endif
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	pgdat->kswapd_max_order = 0;

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 16/40] autonuma: init knuma_migrated queues
@ 2012-06-28 12:55   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Initialize the knuma_migrated queues at boot time.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a9710a4..48eabe9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -59,6 +59,7 @@
 #include <linux/prefetch.h>
 #include <linux/migrate.h>
 #include <linux/page-debug-flags.h>
+#include <linux/autonuma.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -4348,8 +4349,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	int nid = pgdat->node_id;
 	unsigned long zone_start_pfn = pgdat->node_start_pfn;
 	int ret;
+#ifdef CONFIG_AUTONUMA
+	int node_iter;
+#endif
 
 	pgdat_resize_init(pgdat);
+#ifdef CONFIG_AUTONUMA
+	spin_lock_init(&pgdat->autonuma_lock);
+	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
+	pgdat->autonuma_nr_migrate_pages = 0;
+	for_each_node(node_iter)
+		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+#endif
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	pgdat->kswapd_max_order = 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 17/40] autonuma: autonuma_enter/exit
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:55   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

The first gear in the whole AutoNUMA algorithm is knuma_scand. If
knuma_scand doesn't run AutoNUMA is a full bypass. If knuma_scand is
stopped, soon all other AutoNUMA gears will settle down too.

knuma_scand is the daemon that sets the pmd_numa and pte_numa and
allows the NUMA hinting page faults to start and then all other
actions follows as a reaction to that.

knuma_scand scans a list of "mm" and this is where we register and
unregister the "mm" into AutoNUMA for knuma_scand to scan them.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 5fcfa70..d3c064c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -70,6 +70,7 @@
 #include <linux/khugepaged.h>
 #include <linux/signalfd.h>
 #include <linux/uprobes.h>
+#include <linux/autonuma.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -540,6 +541,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
 		mmu_notifier_mm_init(mm);
+		autonuma_enter(mm);
 		return mm;
 	}
 
@@ -608,6 +610,7 @@ void mmput(struct mm_struct *mm)
 		exit_aio(mm);
 		ksm_exit(mm);
 		khugepaged_exit(mm); /* must run before exit_mmap */
+		autonuma_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 17/40] autonuma: autonuma_enter/exit
@ 2012-06-28 12:55   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

The first gear in the whole AutoNUMA algorithm is knuma_scand. If
knuma_scand doesn't run AutoNUMA is a full bypass. If knuma_scand is
stopped, soon all other AutoNUMA gears will settle down too.

knuma_scand is the daemon that sets the pmd_numa and pte_numa and
allows the NUMA hinting page faults to start and then all other
actions follows as a reaction to that.

knuma_scand scans a list of "mm" and this is where we register and
unregister the "mm" into AutoNUMA for knuma_scand to scan them.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 5fcfa70..d3c064c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -70,6 +70,7 @@
 #include <linux/khugepaged.h>
 #include <linux/signalfd.h>
 #include <linux/uprobes.h>
+#include <linux/autonuma.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -540,6 +541,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
 		mmu_notifier_mm_init(mm);
+		autonuma_enter(mm);
 		return mm;
 	}
 
@@ -608,6 +610,7 @@ void mmput(struct mm_struct *mm)
 		exit_aio(mm);
 		ksm_exit(mm);
 		khugepaged_exit(mm); /* must run before exit_mmap */
+		autonuma_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 18/40] autonuma: call autonuma_setup_new_exec()
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:55   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This resets all per-thread and per-process statistics across exec
syscalls or after kernel threads detached from the mm. The past
statistical NUMA information is unlikely to be relevant for the future
in these cases.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/exec.c        |    3 +++
 mm/mmu_context.c |    2 ++
 2 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index da27b91..146ced2 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/pipe_fs_i.h>
 #include <linux/oom.h>
 #include <linux/compat.h>
+#include <linux/autonuma.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1172,6 +1173,8 @@ void setup_new_exec(struct linux_binprm * bprm)
 			
 	flush_signal_handlers(current, 0);
 	flush_old_files(current->files);
+
+	autonuma_setup_new_exec(current);
 }
 EXPORT_SYMBOL(setup_new_exec);
 
diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 3dcfaf4..40f0f13 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -7,6 +7,7 @@
 #include <linux/mmu_context.h>
 #include <linux/export.h>
 #include <linux/sched.h>
+#include <linux/autonuma.h>
 
 #include <asm/mmu_context.h>
 
@@ -58,5 +59,6 @@ void unuse_mm(struct mm_struct *mm)
 	/* active_mm is still 'mm' */
 	enter_lazy_tlb(mm, tsk);
 	task_unlock(tsk);
+	autonuma_setup_new_exec(tsk);
 }
 EXPORT_SYMBOL_GPL(unuse_mm);

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 18/40] autonuma: call autonuma_setup_new_exec()
@ 2012-06-28 12:55   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This resets all per-thread and per-process statistics across exec
syscalls or after kernel threads detached from the mm. The past
statistical NUMA information is unlikely to be relevant for the future
in these cases.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/exec.c        |    3 +++
 mm/mmu_context.c |    2 ++
 2 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index da27b91..146ced2 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/pipe_fs_i.h>
 #include <linux/oom.h>
 #include <linux/compat.h>
+#include <linux/autonuma.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1172,6 +1173,8 @@ void setup_new_exec(struct linux_binprm * bprm)
 			
 	flush_signal_handlers(current, 0);
 	flush_old_files(current->files);
+
+	autonuma_setup_new_exec(current);
 }
 EXPORT_SYMBOL(setup_new_exec);
 
diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 3dcfaf4..40f0f13 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -7,6 +7,7 @@
 #include <linux/mmu_context.h>
 #include <linux/export.h>
 #include <linux/sched.h>
+#include <linux/autonuma.h>
 
 #include <asm/mmu_context.h>
 
@@ -58,5 +59,6 @@ void unuse_mm(struct mm_struct *mm)
 	/* active_mm is still 'mm' */
 	enter_lazy_tlb(mm, tsk);
 	task_unlock(tsk);
+	autonuma_setup_new_exec(tsk);
 }
 EXPORT_SYMBOL_GPL(unuse_mm);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 19/40] autonuma: alloc/free/init sched_autonuma
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:55   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This is where the dynamically allocated sched_autonuma structure is
being handled.

The reason for keeping this outside of the task_struct besides not
using too much kernel stack, is to only allocate it on NUMA
hardware. So the not NUMA hardware only pays the memory of a pointer
in the kernel stack (which remains NULL at all times in that case).

If the kernel is compiled with CONFIG_AUTONUMA=n, not even the pointer
is allocated on the kernel stack of course.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |   24 ++++++++++++++----------
 1 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index d3c064c..0adbe09 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -206,6 +206,7 @@ static void account_kernel_stack(struct thread_info *ti, int account)
 void free_task(struct task_struct *tsk)
 {
 	account_kernel_stack(tsk->stack, -1);
+	free_task_autonuma(tsk);
 	free_thread_info(tsk->stack);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
@@ -260,6 +261,8 @@ void __init fork_init(unsigned long mempages)
 	/* do the arch specific task caches init */
 	arch_task_cache_init();
 
+	task_autonuma_init();
+
 	/*
 	 * The default maximum number of threads is set to a safe
 	 * value: the thread structures can take up at most half
@@ -292,21 +295,21 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	struct thread_info *ti;
 	unsigned long *stackend;
 	int node = tsk_fork_get_node(orig);
-	int err;
 
 	tsk = alloc_task_struct_node(node);
-	if (!tsk)
+	if (unlikely(!tsk))
 		return NULL;
 
 	ti = alloc_thread_info_node(tsk, node);
-	if (!ti) {
-		free_task_struct(tsk);
-		return NULL;
-	}
+	if (unlikely(!ti))
+		goto out_task_struct;
 
-	err = arch_dup_task_struct(tsk, orig);
-	if (err)
-		goto out;
+	if (unlikely(arch_dup_task_struct(tsk, orig)))
+		goto out_thread_info;
+
+	if (unlikely(alloc_task_autonuma(tsk, orig, node)))
+		/* free_thread_info() undoes arch_dup_task_struct() too */
+		goto out_thread_info;
 
 	tsk->stack = ti;
 
@@ -334,8 +337,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 
 	return tsk;
 
-out:
+out_thread_info:
 	free_thread_info(ti);
+out_task_struct:
 	free_task_struct(tsk);
 	return NULL;
 }

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 19/40] autonuma: alloc/free/init sched_autonuma
@ 2012-06-28 12:55   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:55 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This is where the dynamically allocated sched_autonuma structure is
being handled.

The reason for keeping this outside of the task_struct besides not
using too much kernel stack, is to only allocate it on NUMA
hardware. So the not NUMA hardware only pays the memory of a pointer
in the kernel stack (which remains NULL at all times in that case).

If the kernel is compiled with CONFIG_AUTONUMA=n, not even the pointer
is allocated on the kernel stack of course.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |   24 ++++++++++++++----------
 1 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index d3c064c..0adbe09 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -206,6 +206,7 @@ static void account_kernel_stack(struct thread_info *ti, int account)
 void free_task(struct task_struct *tsk)
 {
 	account_kernel_stack(tsk->stack, -1);
+	free_task_autonuma(tsk);
 	free_thread_info(tsk->stack);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
@@ -260,6 +261,8 @@ void __init fork_init(unsigned long mempages)
 	/* do the arch specific task caches init */
 	arch_task_cache_init();
 
+	task_autonuma_init();
+
 	/*
 	 * The default maximum number of threads is set to a safe
 	 * value: the thread structures can take up at most half
@@ -292,21 +295,21 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	struct thread_info *ti;
 	unsigned long *stackend;
 	int node = tsk_fork_get_node(orig);
-	int err;
 
 	tsk = alloc_task_struct_node(node);
-	if (!tsk)
+	if (unlikely(!tsk))
 		return NULL;
 
 	ti = alloc_thread_info_node(tsk, node);
-	if (!ti) {
-		free_task_struct(tsk);
-		return NULL;
-	}
+	if (unlikely(!ti))
+		goto out_task_struct;
 
-	err = arch_dup_task_struct(tsk, orig);
-	if (err)
-		goto out;
+	if (unlikely(arch_dup_task_struct(tsk, orig)))
+		goto out_thread_info;
+
+	if (unlikely(alloc_task_autonuma(tsk, orig, node)))
+		/* free_thread_info() undoes arch_dup_task_struct() too */
+		goto out_thread_info;
 
 	tsk->stack = ti;
 
@@ -334,8 +337,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 
 	return tsk;
 
-out:
+out_thread_info:
 	free_thread_info(ti);
+out_task_struct:
 	free_task_struct(tsk);
 	return NULL;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 20/40] autonuma: alloc/free/init mm_autonuma
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This is where the mm_autonuma structure is being handled. Just like
sched_autonuma, this is only allocated at runtime if the hardware the
kernel is running on has been detected as NUMA. On not NUMA hardware
the memory cost is reduced to one pointer per mm.

To get rid of the pointer in the each mm, the kernel can be compiled
with CONFIG_AUTONUMA=n.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 0adbe09..3e5a0d9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -527,6 +527,8 @@ static void mm_init_aio(struct mm_struct *mm)
 
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 {
+	if (unlikely(alloc_mm_autonuma(mm)))
+		goto out_free_mm;
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
 	init_rwsem(&mm->mmap_sem);
@@ -549,6 +551,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 		return mm;
 	}
 
+	free_mm_autonuma(mm);
+out_free_mm:
 	free_mm(mm);
 	return NULL;
 }
@@ -598,6 +602,7 @@ void __mmdrop(struct mm_struct *mm)
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
+	free_mm_autonuma(mm);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -880,6 +885,7 @@ fail_nocontext:
 	 * If init_new_context() failed, we cannot use mmput() to free the mm
 	 * because it calls destroy_context()
 	 */
+	free_mm_autonuma(mm);
 	mm_free_pgd(mm);
 	free_mm(mm);
 	return NULL;
@@ -1702,6 +1708,7 @@ void __init proc_caches_init(void)
 	mm_cachep = kmem_cache_create("mm_struct",
 			sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL);
+	mm_autonuma_init();
 	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC);
 	mmap_init();
 	nsproxy_cache_init();

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 20/40] autonuma: alloc/free/init mm_autonuma
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This is where the mm_autonuma structure is being handled. Just like
sched_autonuma, this is only allocated at runtime if the hardware the
kernel is running on has been detected as NUMA. On not NUMA hardware
the memory cost is reduced to one pointer per mm.

To get rid of the pointer in the each mm, the kernel can be compiled
with CONFIG_AUTONUMA=n.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 0adbe09..3e5a0d9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -527,6 +527,8 @@ static void mm_init_aio(struct mm_struct *mm)
 
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 {
+	if (unlikely(alloc_mm_autonuma(mm)))
+		goto out_free_mm;
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
 	init_rwsem(&mm->mmap_sem);
@@ -549,6 +551,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 		return mm;
 	}
 
+	free_mm_autonuma(mm);
+out_free_mm:
 	free_mm(mm);
 	return NULL;
 }
@@ -598,6 +602,7 @@ void __mmdrop(struct mm_struct *mm)
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
+	free_mm_autonuma(mm);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -880,6 +885,7 @@ fail_nocontext:
 	 * If init_new_context() failed, we cannot use mmput() to free the mm
 	 * because it calls destroy_context()
 	 */
+	free_mm_autonuma(mm);
 	mm_free_pgd(mm);
 	free_mm(mm);
 	return NULL;
@@ -1702,6 +1708,7 @@ void __init proc_caches_init(void)
 	mm_cachep = kmem_cache_create("mm_struct",
 			sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL);
+	mm_autonuma_init();
 	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC);
 	mmap_init();
 	nsproxy_cache_init();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 21/40] autonuma: avoid CFS select_task_rq_fair to return -1
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Fix to avoid -1 retval.

Includes fixes from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c099cc6..fa96810 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2789,6 +2789,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		if (new_cpu == -1 || new_cpu == cpu) {
 			/* Now try balancing at a lower domain level of cpu */
 			sd = sd->child;
+			if (new_cpu < 0)
+				/* Return prev_cpu is find_idlest_cpu failed */
+				new_cpu = prev_cpu;
 			continue;
 		}
 
@@ -2807,6 +2810,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 unlock:
 	rcu_read_unlock();
 
+	BUG_ON(new_cpu < 0);
 	return new_cpu;
 }
 #endif /* CONFIG_SMP */

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 21/40] autonuma: avoid CFS select_task_rq_fair to return -1
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Fix to avoid -1 retval.

Includes fixes from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c099cc6..fa96810 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2789,6 +2789,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		if (new_cpu == -1 || new_cpu == cpu) {
 			/* Now try balancing at a lower domain level of cpu */
 			sd = sd->child;
+			if (new_cpu < 0)
+				/* Return prev_cpu is find_idlest_cpu failed */
+				new_cpu = prev_cpu;
 			continue;
 		}
 
@@ -2807,6 +2810,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 unlock:
 	rcu_read_unlock();
 
+	BUG_ON(new_cpu < 0);
 	return new_cpu;
 }
 #endif /* CONFIG_SMP */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 22/40] autonuma: teach CFS about autonuma affinity
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

The CFS scheduler is still in charge of all scheduling
decisions. AutoNUMA balancing at times will override those. But
generally we'll just relay on the CFS scheduler to keep doing its
thing, but while preferring the autonuma affine nodes when deciding
to move a process to a different runqueue or when waking it up.

For example the idle balancing, will look into the runqueues of the
busy CPUs, but it'll search first for a task that wants to run into
the idle CPU in AutoNUMA terms (task_autonuma_cpu() being true).

Most of this is encoded in the can_migrate_task becoming AutoNUMA
aware and running two passes for each balancing pass, the first NUMA
aware, and the second one relaxed.

The idle/newidle balancing is always allowed to fallback into
non-affine AutoNUMA tasks. The load_balancing (which is more a
fariness than a performance issue) is instead only able to cross over
the AutoNUMA affinity if the flag controlled by
/sys/kernel/mm/autonuma/scheduler/load_balance_strict is not set (it
is set by default).

Tasks that haven't been fully profiled yet, are not affected by this
because their p->sched_autonuma->autonuma_node is still set to the
original value of -1 and task_autonuma_cpu will always return true in
that case.

Includes fixes from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |   65 +++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 56 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fa96810..dab9bdd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/profile.h>
 #include <linux/interrupt.h>
+#include <linux/autonuma_sched.h>
 
 #include <trace/events/sched.h>
 
@@ -2621,6 +2622,8 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 		load = weighted_cpuload(i);
 
 		if (load < min_load || (load == min_load && i == this_cpu)) {
+			if (!task_autonuma_cpu(p, i))
+				continue;
 			min_load = load;
 			idlest = i;
 		}
@@ -2639,24 +2642,27 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	struct sched_domain *sd;
 	struct sched_group *sg;
 	int i;
+	bool idle_target;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
 	 * already idle, then it is the right target.
 	 */
-	if (target == cpu && idle_cpu(cpu))
+	if (target == cpu && idle_cpu(cpu) && task_autonuma_cpu(p, cpu))
 		return cpu;
 
 	/*
 	 * If the task is going to be woken-up on the cpu where it previously
 	 * ran and if it is currently idle, then it the right target.
 	 */
-	if (target == prev_cpu && idle_cpu(prev_cpu))
+	if (target == prev_cpu && idle_cpu(prev_cpu) &&
+	    task_autonuma_cpu(p, prev_cpu))
 		return prev_cpu;
 
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
+	idle_target = false;
 	sd = rcu_dereference(per_cpu(sd_llc, target));
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
@@ -2670,9 +2676,18 @@ static int select_idle_sibling(struct task_struct *p, int target)
 					goto next;
 			}
 
-			target = cpumask_first_and(sched_group_cpus(sg),
-					tsk_cpus_allowed(p));
-			goto done;
+			for_each_cpu_and(i, sched_group_cpus(sg),
+						tsk_cpus_allowed(p)) {
+				/* Find autonuma cpu only in idle group */
+				if (task_autonuma_cpu(p, i)) {
+					target = i;
+					goto done;
+				}
+				if (!idle_target) {
+					idle_target = true;
+					target = i;
+				}
+			}
 next:
 			sg = sg->next;
 		} while (sg != sd->groups);
@@ -2707,7 +2722,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		return prev_cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
-		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) &&
+		    task_autonuma_cpu(p, cpu))
 			want_affine = 1;
 		new_cpu = prev_cpu;
 	}
@@ -3072,6 +3088,7 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
+#define LBF_NUMA	0x04
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -3142,13 +3159,14 @@ static
 int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
 	int tsk_cache_hot = 0;
+	struct cpumask *allowed = tsk_cpus_allowed(p);
 	/*
 	 * We do not migrate tasks that are:
 	 * 1) running (obviously), or
 	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
 	 * 3) are cache-hot on their current CPU.
 	 */
-	if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
+	if (!cpumask_test_cpu(env->dst_cpu, allowed)) {
 		schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
 		return 0;
 	}
@@ -3159,6 +3177,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}
 
+	if (!sched_autonuma_can_migrate_task(p, env->flags & LBF_NUMA,
+					     env->dst_cpu, env->idle))
+		return 0;
+
 	/*
 	 * Aggressive migration if:
 	 * 1) task is cache cold, or
@@ -3195,6 +3217,8 @@ static int move_one_task(struct lb_env *env)
 {
 	struct task_struct *p, *n;
 
+	env->flags |= LBF_NUMA;
+numa_repeat:
 	list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
 		if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
 			continue;
@@ -3209,8 +3233,14 @@ static int move_one_task(struct lb_env *env)
 		 * stats here rather than inside move_task().
 		 */
 		schedstat_inc(env->sd, lb_gained[env->idle]);
+		env->flags &= ~LBF_NUMA;
 		return 1;
 	}
+	if (env->flags & LBF_NUMA) {
+		env->flags &= ~LBF_NUMA;
+		goto numa_repeat;
+	}
+
 	return 0;
 }
 
@@ -3235,6 +3265,8 @@ static int move_tasks(struct lb_env *env)
 	if (env->imbalance <= 0)
 		return 0;
 
+	env->flags |= LBF_NUMA;
+numa_repeat:
 	while (!list_empty(tasks)) {
 		p = list_first_entry(tasks, struct task_struct, se.group_node);
 
@@ -3274,9 +3306,13 @@ static int move_tasks(struct lb_env *env)
 		 * kernels will stop after the first task is pulled to minimize
 		 * the critical section.
 		 */
-		if (env->idle == CPU_NEWLY_IDLE)
-			break;
+		if (env->idle == CPU_NEWLY_IDLE) {
+			env->flags &= ~LBF_NUMA;
+			goto out;
+		}
 #endif
+		/* not idle anymore after pulling first task */
+		env->idle = CPU_NOT_IDLE;
 
 		/*
 		 * We only want to steal up to the prescribed amount of
@@ -3289,6 +3325,17 @@ static int move_tasks(struct lb_env *env)
 next:
 		list_move_tail(&p->se.group_node, tasks);
 	}
+	if ((env->flags & (LBF_NUMA|LBF_NEED_BREAK)) == LBF_NUMA) {
+		env->flags &= ~LBF_NUMA;
+		if (env->imbalance > 0) {
+			env->loop = 0;
+			env->loop_break = sched_nr_migrate_break;
+			goto numa_repeat;
+		}
+	}
+#ifdef CONFIG_PREEMPT
+out:
+#endif
 
 	/*
 	 * Right now, this is one of only two places move_task() is called,

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 22/40] autonuma: teach CFS about autonuma affinity
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

The CFS scheduler is still in charge of all scheduling
decisions. AutoNUMA balancing at times will override those. But
generally we'll just relay on the CFS scheduler to keep doing its
thing, but while preferring the autonuma affine nodes when deciding
to move a process to a different runqueue or when waking it up.

For example the idle balancing, will look into the runqueues of the
busy CPUs, but it'll search first for a task that wants to run into
the idle CPU in AutoNUMA terms (task_autonuma_cpu() being true).

Most of this is encoded in the can_migrate_task becoming AutoNUMA
aware and running two passes for each balancing pass, the first NUMA
aware, and the second one relaxed.

The idle/newidle balancing is always allowed to fallback into
non-affine AutoNUMA tasks. The load_balancing (which is more a
fariness than a performance issue) is instead only able to cross over
the AutoNUMA affinity if the flag controlled by
/sys/kernel/mm/autonuma/scheduler/load_balance_strict is not set (it
is set by default).

Tasks that haven't been fully profiled yet, are not affected by this
because their p->sched_autonuma->autonuma_node is still set to the
original value of -1 and task_autonuma_cpu will always return true in
that case.

Includes fixes from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |   65 +++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 56 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fa96810..dab9bdd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/profile.h>
 #include <linux/interrupt.h>
+#include <linux/autonuma_sched.h>
 
 #include <trace/events/sched.h>
 
@@ -2621,6 +2622,8 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 		load = weighted_cpuload(i);
 
 		if (load < min_load || (load == min_load && i == this_cpu)) {
+			if (!task_autonuma_cpu(p, i))
+				continue;
 			min_load = load;
 			idlest = i;
 		}
@@ -2639,24 +2642,27 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	struct sched_domain *sd;
 	struct sched_group *sg;
 	int i;
+	bool idle_target;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
 	 * already idle, then it is the right target.
 	 */
-	if (target == cpu && idle_cpu(cpu))
+	if (target == cpu && idle_cpu(cpu) && task_autonuma_cpu(p, cpu))
 		return cpu;
 
 	/*
 	 * If the task is going to be woken-up on the cpu where it previously
 	 * ran and if it is currently idle, then it the right target.
 	 */
-	if (target == prev_cpu && idle_cpu(prev_cpu))
+	if (target == prev_cpu && idle_cpu(prev_cpu) &&
+	    task_autonuma_cpu(p, prev_cpu))
 		return prev_cpu;
 
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
+	idle_target = false;
 	sd = rcu_dereference(per_cpu(sd_llc, target));
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
@@ -2670,9 +2676,18 @@ static int select_idle_sibling(struct task_struct *p, int target)
 					goto next;
 			}
 
-			target = cpumask_first_and(sched_group_cpus(sg),
-					tsk_cpus_allowed(p));
-			goto done;
+			for_each_cpu_and(i, sched_group_cpus(sg),
+						tsk_cpus_allowed(p)) {
+				/* Find autonuma cpu only in idle group */
+				if (task_autonuma_cpu(p, i)) {
+					target = i;
+					goto done;
+				}
+				if (!idle_target) {
+					idle_target = true;
+					target = i;
+				}
+			}
 next:
 			sg = sg->next;
 		} while (sg != sd->groups);
@@ -2707,7 +2722,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		return prev_cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
-		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) &&
+		    task_autonuma_cpu(p, cpu))
 			want_affine = 1;
 		new_cpu = prev_cpu;
 	}
@@ -3072,6 +3088,7 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
+#define LBF_NUMA	0x04
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -3142,13 +3159,14 @@ static
 int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
 	int tsk_cache_hot = 0;
+	struct cpumask *allowed = tsk_cpus_allowed(p);
 	/*
 	 * We do not migrate tasks that are:
 	 * 1) running (obviously), or
 	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
 	 * 3) are cache-hot on their current CPU.
 	 */
-	if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
+	if (!cpumask_test_cpu(env->dst_cpu, allowed)) {
 		schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
 		return 0;
 	}
@@ -3159,6 +3177,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}
 
+	if (!sched_autonuma_can_migrate_task(p, env->flags & LBF_NUMA,
+					     env->dst_cpu, env->idle))
+		return 0;
+
 	/*
 	 * Aggressive migration if:
 	 * 1) task is cache cold, or
@@ -3195,6 +3217,8 @@ static int move_one_task(struct lb_env *env)
 {
 	struct task_struct *p, *n;
 
+	env->flags |= LBF_NUMA;
+numa_repeat:
 	list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
 		if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
 			continue;
@@ -3209,8 +3233,14 @@ static int move_one_task(struct lb_env *env)
 		 * stats here rather than inside move_task().
 		 */
 		schedstat_inc(env->sd, lb_gained[env->idle]);
+		env->flags &= ~LBF_NUMA;
 		return 1;
 	}
+	if (env->flags & LBF_NUMA) {
+		env->flags &= ~LBF_NUMA;
+		goto numa_repeat;
+	}
+
 	return 0;
 }
 
@@ -3235,6 +3265,8 @@ static int move_tasks(struct lb_env *env)
 	if (env->imbalance <= 0)
 		return 0;
 
+	env->flags |= LBF_NUMA;
+numa_repeat:
 	while (!list_empty(tasks)) {
 		p = list_first_entry(tasks, struct task_struct, se.group_node);
 
@@ -3274,9 +3306,13 @@ static int move_tasks(struct lb_env *env)
 		 * kernels will stop after the first task is pulled to minimize
 		 * the critical section.
 		 */
-		if (env->idle == CPU_NEWLY_IDLE)
-			break;
+		if (env->idle == CPU_NEWLY_IDLE) {
+			env->flags &= ~LBF_NUMA;
+			goto out;
+		}
 #endif
+		/* not idle anymore after pulling first task */
+		env->idle = CPU_NOT_IDLE;
 
 		/*
 		 * We only want to steal up to the prescribed amount of
@@ -3289,6 +3325,17 @@ static int move_tasks(struct lb_env *env)
 next:
 		list_move_tail(&p->se.group_node, tasks);
 	}
+	if ((env->flags & (LBF_NUMA|LBF_NEED_BREAK)) == LBF_NUMA) {
+		env->flags &= ~LBF_NUMA;
+		if (env->imbalance > 0) {
+			env->loop = 0;
+			env->loop_break = sched_nr_migrate_break;
+			goto numa_repeat;
+		}
+	}
+#ifdef CONFIG_PREEMPT
+out:
+#endif
 
 	/*
 	 * Right now, this is one of only two places move_task() is called,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 23/40] autonuma: sched_set_autonuma_need_balance
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Invoke autonuma_balance only on the busy CPUs at the same frequency of
the CFS load balance.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dab9bdd..ff288c0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4906,6 +4906,9 @@ static void run_rebalance_domains(struct softirq_action *h)
 
 	rebalance_domains(this_cpu, idle);
 
+	if (!this_rq->idle_balance)
+		sched_set_autonuma_need_balance();
+
 	/*
 	 * If this cpu has a pending nohz_balance_kick, then do the
 	 * balancing on behalf of the other idle cpus whose ticks are

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 23/40] autonuma: sched_set_autonuma_need_balance
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Invoke autonuma_balance only on the busy CPUs at the same frequency of
the CFS load balance.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dab9bdd..ff288c0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4906,6 +4906,9 @@ static void run_rebalance_domains(struct softirq_action *h)
 
 	rebalance_domains(this_cpu, idle);
 
+	if (!this_rq->idle_balance)
+		sched_set_autonuma_need_balance();
+
 	/*
 	 * If this cpu has a pending nohz_balance_kick, then do the
 	 * balancing on behalf of the other idle cpus whose ticks are

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 24/40] autonuma: core
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This implements knuma_scand, the numa_hinting faults started by
knuma_scand, the knuma_migrated that migrates the memory queued by the
NUMA hinting faults, the statistics gathering code that is done by
knuma_scand for the mm_autonuma and by the numa hinting page faults
for the sched_autonuma, and most of the rest of the AutoNUMA core
logics like the false sharing detection, sysfs and initialization
routines.

The AutoNUMA algorithm when knuma_scand is not running is a full
bypass and it must not alter the runtime of memory management and
scheduler.

The whole AutoNUMA logic is a chain reaction as result of the actions
of the knuma_scand. The various parts of the code can be described
like different gears (gears as in glxgears).

knuma_scand is the first gear and it collects the mm_autonuma per-process
statistics and at the same time it sets the pte/pmd it scans as
pte_numa and pmd_numa.

The second gear are the numa hinting page faults. These are triggered
by the pte_numa/pmd_numa pmd/ptes. They collect the sched_autonuma
per-thread statistics. They also implement the memory follow CPU logic
where we track if pages are repeatedly accessed by remote nodes. The
memory follow CPU logic can decide to migrate pages across different
NUMA nodes by queuing the pages for migration in the per-node
knuma_migrated queues.

The third gear is knuma_migrated. There is one knuma_migrated daemon
per node. Pages pending for migration are queued in a matrix of
lists. Each knuma_migrated (in parallel with each other) goes over
those lists and migrates the pages queued for migration in round robin
from each incoming node to the node where knuma_migrated is running
on.

The fourth gear is the NUMA scheduler balancing code. That computes
the statistical information collected in mm->mm_autonuma and
p->sched_autonuma and evaluates the status of all CPUs to decide if
tasks should be migrated to CPUs in remote nodes.

The code include fixes from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/autonuma.c | 1491 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 1491 insertions(+), 0 deletions(-)
 create mode 100644 mm/autonuma.c

diff --git a/mm/autonuma.c b/mm/autonuma.c
new file mode 100644
index 0000000..f44272b
--- /dev/null
+++ b/mm/autonuma.c
@@ -0,0 +1,1491 @@
+/*
+ *  Copyright (C) 2012  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ *  Boot with "numa=fake=2" to test on not NUMA systems.
+ */
+
+#include <linux/mm.h>
+#include <linux/rmap.h>
+#include <linux/kthread.h>
+#include <linux/mmu_notifier.h>
+#include <linux/freezer.h>
+#include <linux/mm_inline.h>
+#include <linux/migrate.h>
+#include <linux/swap.h>
+#include <linux/autonuma.h>
+#include <asm/tlbflush.h>
+#include <asm/pgtable.h>
+
+unsigned long autonuma_flags __read_mostly =
+	(1<<AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG)|
+	(1<<AUTONUMA_SCHED_CLONE_RESET_FLAG)|
+	(1<<AUTONUMA_SCHED_FORK_RESET_FLAG)|
+#ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
+	(1<<AUTONUMA_FLAG)|
+#endif
+	(1<<AUTONUMA_SCAN_PMD_FLAG);
+
+static DEFINE_MUTEX(knumad_mm_mutex);
+
+/* knuma_scand */
+static unsigned int scan_sleep_millisecs __read_mostly = 100;
+static unsigned int scan_sleep_pass_millisecs __read_mostly = 5000;
+static unsigned int pages_to_scan __read_mostly = 128*1024*1024/PAGE_SIZE;
+static DECLARE_WAIT_QUEUE_HEAD(knuma_scand_wait);
+static unsigned long full_scans;
+static unsigned long pages_scanned;
+
+/* knuma_migrated */
+static unsigned int migrate_sleep_millisecs __read_mostly = 100;
+static unsigned int pages_to_migrate __read_mostly = 128*1024*1024/PAGE_SIZE;
+static volatile unsigned long pages_migrated;
+
+static struct knumad_scan {
+	struct list_head mm_head;
+	struct mm_struct *mm;
+	unsigned long address;
+} knumad_scan = {
+	.mm_head = LIST_HEAD_INIT(knumad_scan.mm_head),
+};
+
+static inline bool autonuma_impossible(void)
+{
+	return num_possible_nodes() <= 1 ||
+		test_bit(AUTONUMA_IMPOSSIBLE_FLAG, &autonuma_flags);
+}
+
+static inline void autonuma_migrate_lock(int nid)
+{
+	spin_lock(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_unlock(int nid)
+{
+	spin_unlock(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_lock_irq(int nid)
+{
+	spin_lock_irq(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_unlock_irq(int nid)
+{
+	spin_unlock_irq(&NODE_DATA(nid)->autonuma_lock);
+}
+
+/* caller already holds the compound_lock */
+void autonuma_migrate_split_huge_page(struct page *page,
+				      struct page *page_tail)
+{
+	int nid, last_nid;
+
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	VM_BUG_ON(page_tail->autonuma_migrate_nid != -1);
+	if (nid >= 0) {
+		VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
+
+		compound_lock(page_tail);
+		autonuma_migrate_lock(nid);
+		list_add_tail(&page_tail->autonuma_migrate_node,
+			      &page->autonuma_migrate_node);
+		autonuma_migrate_unlock(nid);
+
+		page_tail->autonuma_migrate_nid = nid;
+		compound_unlock(page_tail);
+	}
+
+	last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	if (last_nid >= 0)
+		page_tail->autonuma_last_nid = last_nid;
+}
+
+void __autonuma_migrate_page_remove(struct page *page)
+{
+	unsigned long flags;
+	int nid;
+
+	flags = compound_lock_irqsave(page);
+
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	if (nid >= 0) {
+		int numpages = hpage_nr_pages(page);
+		autonuma_migrate_lock(nid);
+		list_del(&page->autonuma_migrate_node);
+		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
+		autonuma_migrate_unlock(nid);
+
+		page->autonuma_migrate_nid = -1;
+	}
+
+	compound_unlock_irqrestore(page, flags);
+}
+
+static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
+					int page_nid)
+{
+	unsigned long flags;
+	int nid;
+	int numpages;
+	unsigned long nr_migrate_pages;
+	wait_queue_head_t *wait_queue;
+
+	VM_BUG_ON(dst_nid >= MAX_NUMNODES);
+	VM_BUG_ON(dst_nid < -1);
+	VM_BUG_ON(page_nid >= MAX_NUMNODES);
+	VM_BUG_ON(page_nid < -1);
+
+	VM_BUG_ON(page_nid == dst_nid);
+	VM_BUG_ON(page_to_nid(page) != page_nid);
+
+	flags = compound_lock_irqsave(page);
+
+	numpages = hpage_nr_pages(page);
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	if (nid >= 0) {
+		autonuma_migrate_lock(nid);
+		list_del(&page->autonuma_migrate_node);
+		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
+		autonuma_migrate_unlock(nid);
+	}
+
+	autonuma_migrate_lock(dst_nid);
+	list_add(&page->autonuma_migrate_node,
+		 &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
+	NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
+	nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
+
+	autonuma_migrate_unlock(dst_nid);
+
+	page->autonuma_migrate_nid = dst_nid;
+
+	compound_unlock_irqrestore(page, flags);
+
+	if (!autonuma_migrate_defer()) {
+		wait_queue = &NODE_DATA(dst_nid)->autonuma_knuma_migrated_wait;
+		if (nr_migrate_pages >= pages_to_migrate &&
+		    nr_migrate_pages - numpages < pages_to_migrate &&
+		    waitqueue_active(wait_queue))
+			wake_up_interruptible(wait_queue);
+	}
+}
+
+static void autonuma_migrate_page_add(struct page *page, int dst_nid,
+				      int page_nid)
+{
+	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	if (migrate_nid != dst_nid)
+		__autonuma_migrate_page_add(page, dst_nid, page_nid);
+}
+
+static bool balance_pgdat(struct pglist_data *pgdat,
+			  int nr_migrate_pages)
+{
+	/* FIXME: this only check the wmarks, make it move
+	 * "unused" memory or pagecache by queuing it to
+	 * pgdat->autonuma_migrate_head[pgdat->node_id].
+	 */
+	int z;
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone->all_unreclaimable)
+			continue;
+
+		/*
+		 * FIXME: deal with order with THP, maybe if
+		 * kswapd will learn using compaction, otherwise
+		 * order = 0 probably is ok.
+		 * FIXME: in theory we're ok if we can obtain
+		 * pages_to_migrate pages from all zones, it doesn't
+		 * need to be all in a single zone. We care about the
+		 * pgdat, the zone not.
+		 */
+
+		/*
+		 * Try not to wakeup kswapd by allocating
+		 * pages_to_migrate pages.
+		 */
+		if (!zone_watermark_ok(zone, 0,
+				       high_wmark_pages(zone) +
+				       nr_migrate_pages,
+				       0, 0))
+			continue;
+		return true;
+	}
+	return false;
+}
+
+static void cpu_follow_memory_pass(struct task_struct *p,
+				   struct task_autonuma *task_autonuma,
+				   unsigned long *task_numa_fault)
+{
+	int nid;
+	for_each_node(nid)
+		task_numa_fault[nid] >>= 1;
+	task_autonuma->task_numa_fault_tot >>= 1;
+}
+
+static void numa_hinting_fault_cpu_follow_memory(struct task_struct *p,
+						 int access_nid,
+						 int numpages,
+						 bool pass)
+{
+	struct task_autonuma *task_autonuma = p->task_autonuma;
+	unsigned long *task_numa_fault = task_autonuma->task_numa_fault;
+	if (unlikely(pass))
+		cpu_follow_memory_pass(p, task_autonuma, task_numa_fault);
+	task_numa_fault[access_nid] += numpages;
+	task_autonuma->task_numa_fault_tot += numpages;
+}
+
+static inline bool last_nid_set(struct task_struct *p,
+				struct page *page, int cpu_nid)
+{
+	bool ret = true;
+	int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	VM_BUG_ON(cpu_nid < 0);
+	VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
+	if (autonuma_last_nid >= 0 && autonuma_last_nid != cpu_nid) {
+		int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+		if (migrate_nid >= 0 && migrate_nid != cpu_nid)
+			__autonuma_migrate_page_remove(page);
+		ret = false;
+	}
+	if (autonuma_last_nid != cpu_nid)
+		ACCESS_ONCE(page->autonuma_last_nid) = cpu_nid;
+	return ret;
+}
+
+static int __page_migrate_nid(struct page *page, int page_nid)
+{
+	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	if (migrate_nid < 0)
+		migrate_nid = page_nid;
+#if 0
+	return page_nid;
+#endif
+	return migrate_nid;
+}
+
+static int page_migrate_nid(struct page *page)
+{
+	return __page_migrate_nid(page, page_to_nid(page));
+}
+
+static int numa_hinting_fault_memory_follow_cpu(struct task_struct *p,
+						struct page *page,
+						int cpu_nid, int page_nid,
+						bool pass)
+{
+	if (!last_nid_set(p, page, cpu_nid))
+		return __page_migrate_nid(page, page_nid);
+	if (!PageLRU(page))
+		return page_nid;
+	if (cpu_nid != page_nid)
+		autonuma_migrate_page_add(page, cpu_nid, page_nid);
+	else
+		autonuma_migrate_page_remove(page);
+	return cpu_nid;
+}
+
+void numa_hinting_fault(struct page *page, int numpages)
+{
+	WARN_ON_ONCE(!current->mm);
+	if (likely(current->mm && !current->mempolicy && autonuma_enabled())) {
+		struct task_struct *p = current;
+		int cpu_nid, page_nid, access_nid;
+		bool pass;
+
+		pass = p->task_autonuma->task_numa_fault_pass !=
+			p->mm->mm_autonuma->mm_numa_fault_pass;
+		page_nid = page_to_nid(page);
+		cpu_nid = numa_node_id();
+		VM_BUG_ON(cpu_nid < 0);
+		VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
+		access_nid = numa_hinting_fault_memory_follow_cpu(p, page,
+								  cpu_nid,
+								  page_nid,
+								  pass);
+		numa_hinting_fault_cpu_follow_memory(p, access_nid,
+						     numpages, pass);
+		if (unlikely(pass))
+			p->task_autonuma->task_numa_fault_pass =
+				p->mm->mm_autonuma->mm_numa_fault_pass;
+	}
+}
+
+pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+		       unsigned long addr, pte_t pte, pte_t *ptep)
+{
+	struct page *page;
+	pte = pte_mknotnuma(pte);
+	set_pte_at(mm, addr, ptep, pte);
+	page = vm_normal_page(vma, addr, pte);
+	BUG_ON(!page);
+	numa_hinting_fault(page, 1);
+	return pte;
+}
+
+void __pmd_numa_fixup(struct mm_struct *mm,
+		      unsigned long addr, pmd_t *pmdp)
+{
+	pmd_t pmd;
+	pte_t *pte;
+	unsigned long _addr = addr & PMD_MASK;
+	unsigned long offset;
+	spinlock_t *ptl;
+	bool numa = false;
+	struct vm_area_struct *vma;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = *pmdp;
+	if (pmd_numa(pmd)) {
+		set_pmd_at(mm, _addr, pmdp, pmd_mknotnuma(pmd));
+		numa = true;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	if (!numa)
+		return;
+
+	vma = find_vma(mm, _addr);
+	/* we're in a page fault so some vma must be in the range */
+	BUG_ON(!vma);
+	BUG_ON(vma->vm_start >= _addr + PMD_SIZE);
+	offset = max(_addr, vma->vm_start) & ~PMD_MASK;
+	VM_BUG_ON(offset >= PMD_SIZE);
+	pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
+	pte += offset >> PAGE_SHIFT;
+	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
+		pte_t pteval = *pte;
+		struct page * page;
+		if (!pte_present(pteval))
+			continue;
+		if (addr >= vma->vm_end) {
+			vma = find_vma(mm, addr);
+			/* there's a pte present so there must be a vma */
+			BUG_ON(!vma);
+			BUG_ON(addr < vma->vm_start);
+		}
+		if (pte_numa(pteval)) {
+			pteval = pte_mknotnuma(pteval);
+			set_pte_at(mm, addr, pte, pteval);
+		}
+		page = vm_normal_page(vma, addr, pteval);
+		if (unlikely(!page))
+			continue;
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1)
+			continue;
+		numa_hinting_fault(page, 1);
+	}
+	pte_unmap_unlock(pte, ptl);
+}
+
+static inline int task_autonuma_size(void)
+{
+	return sizeof(struct task_autonuma) +
+		num_possible_nodes() * sizeof(unsigned long);
+}
+
+static inline int task_autonuma_reset_size(void)
+{
+	struct task_autonuma *task_autonuma = NULL;
+	return task_autonuma_size() -
+		(int)((char *)(&task_autonuma->task_numa_fault_pass) -
+		      (char *)task_autonuma);
+}
+
+static void task_autonuma_reset(struct task_autonuma *task_autonuma)
+{
+	task_autonuma->autonuma_node = -1;
+	memset(&task_autonuma->task_numa_fault_pass, 0,
+	       task_autonuma_reset_size());
+}
+
+static inline int mm_autonuma_fault_size(void)
+{
+	return num_possible_nodes() * sizeof(unsigned long);
+}
+
+static inline unsigned long *mm_autonuma_numa_fault_tmp(struct mm_struct *mm)
+{
+	return mm->mm_autonuma->mm_numa_fault + num_possible_nodes();
+}
+
+static inline int mm_autonuma_size(void)
+{
+	return sizeof(struct mm_autonuma) + mm_autonuma_fault_size() * 2;
+}
+
+static inline int mm_autonuma_reset_size(void)
+{
+	struct mm_autonuma *mm_autonuma = NULL;
+	return mm_autonuma_size() -
+		(int)((char *)(&mm_autonuma->mm_numa_fault_pass) -
+		      (char *)mm_autonuma);
+}
+
+static void mm_autonuma_reset(struct mm_autonuma *mm_autonuma)
+{
+	memset(&mm_autonuma->mm_numa_fault_pass, 0, mm_autonuma_reset_size());
+}
+
+void autonuma_setup_new_exec(struct task_struct *p)
+{
+	if (p->task_autonuma)
+		task_autonuma_reset(p->task_autonuma);
+	if (p->mm && p->mm->mm_autonuma)
+		mm_autonuma_reset(p->mm->mm_autonuma);
+}
+
+static inline int knumad_test_exit(struct mm_struct *mm)
+{
+	return atomic_read(&mm->mm_users) == 0;
+}
+
+static int knumad_scan_pmd(struct mm_struct *mm,
+			   struct vm_area_struct *vma,
+			   unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte, *_pte;
+	struct page *page;
+	unsigned long _address, end;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	VM_BUG_ON(address & ~PAGE_MASK);
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+	if (pmd_trans_huge(*pmd)) {
+		spin_lock(&mm->page_table_lock);
+		if (pmd_trans_huge(*pmd)) {
+			VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+			if (unlikely(pmd_trans_splitting(*pmd))) {
+				spin_unlock(&mm->page_table_lock);
+				wait_split_huge_page(vma->anon_vma, pmd);
+			} else {
+				int page_nid;
+				unsigned long *numa_fault_tmp;
+				ret = HPAGE_PMD_NR;
+
+				if (autonuma_scan_use_working_set() &&
+				    pmd_numa(*pmd)) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
+				page = pmd_page(*pmd);
+
+				/* only check non-shared pages */
+				if (page_mapcount(page) != 1) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
+				page_nid = page_migrate_nid(page);
+				numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+				numa_fault_tmp[page_nid] += ret;
+
+				if (pmd_numa(*pmd)) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
+				set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+				/* defer TLB flush to lower the overhead */
+				spin_unlock(&mm->page_table_lock);
+				goto out;
+			}
+		} else
+			spin_unlock(&mm->page_table_lock);
+	}
+
+	VM_BUG_ON(!pmd_present(*pmd));
+
+	end = min(vma->vm_end, (address + PMD_SIZE) & PMD_MASK);
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	for (_address = address, _pte = pte; _address < end;
+	     _pte++, _address += PAGE_SIZE) {
+		unsigned long *numa_fault_tmp;
+		pte_t pteval = *_pte;
+		if (!pte_present(pteval))
+			continue;
+		if (autonuma_scan_use_working_set() &&
+		    pte_numa(pteval))
+			continue;
+		page = vm_normal_page(vma, _address, pteval);
+		if (unlikely(!page))
+			continue;
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1)
+			continue;
+
+		numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+		numa_fault_tmp[page_migrate_nid(page)]++;
+
+		if (pte_numa(pteval))
+			continue;
+
+		if (!autonuma_scan_pmd())
+			set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
+
+		/* defer TLB flush to lower the overhead */
+		ret++;
+	}
+	pte_unmap_unlock(pte, ptl);
+
+	if (ret && !pmd_numa(*pmd) && autonuma_scan_pmd()) {
+		spin_lock(&mm->page_table_lock);
+		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+		spin_unlock(&mm->page_table_lock);
+		/* defer TLB flush to lower the overhead */
+	}
+
+out:
+	return ret;
+}
+
+static void mm_numa_fault_flush(struct mm_struct *mm)
+{
+	int nid;
+	struct mm_autonuma *mma = mm->mm_autonuma;
+	unsigned long *numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+	unsigned long tot = 0;
+	/* FIXME: protect this with seqlock against autonuma_balance() */
+	for_each_node(nid) {
+		mma->mm_numa_fault[nid] = numa_fault_tmp[nid];
+		tot += mma->mm_numa_fault[nid];
+		numa_fault_tmp[nid] = 0;
+	}
+	mma->mm_numa_fault_tot = tot;
+}
+
+static int knumad_do_scan(void)
+{
+	struct mm_struct *mm;
+	struct mm_autonuma *mm_autonuma;
+	unsigned long address;
+	struct vm_area_struct *vma;
+	int progress = 0;
+
+	mm = knumad_scan.mm;
+	if (!mm) {
+		if (unlikely(list_empty(&knumad_scan.mm_head)))
+			return pages_to_scan;
+		mm_autonuma = list_entry(knumad_scan.mm_head.next,
+					 struct mm_autonuma, mm_node);
+		mm = mm_autonuma->mm;
+		knumad_scan.address = 0;
+		knumad_scan.mm = mm;
+		atomic_inc(&mm->mm_count);
+		mm_autonuma->mm_numa_fault_pass++;
+	}
+	address = knumad_scan.address;
+
+	mutex_unlock(&knumad_mm_mutex);
+
+	down_read(&mm->mmap_sem);
+	if (unlikely(knumad_test_exit(mm)))
+		vma = NULL;
+	else
+		vma = find_vma(mm, address);
+
+	progress++;
+	for (; vma && progress < pages_to_scan; vma = vma->vm_next) {
+		unsigned long start_addr, end_addr;
+		cond_resched();
+		if (unlikely(knumad_test_exit(mm))) {
+			progress++;
+			break;
+		}
+
+		if (!vma->anon_vma || vma_policy(vma)) {
+			progress++;
+			continue;
+		}
+		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) {
+			progress++;
+			continue;
+		}
+		if (is_vma_temporary_stack(vma)) {
+			progress++;
+			continue;
+		}
+
+		VM_BUG_ON(address & ~PAGE_MASK);
+		if (address < vma->vm_start)
+			address = vma->vm_start;
+
+		start_addr = address;
+		while (address < vma->vm_end) {
+			cond_resched();
+			if (unlikely(knumad_test_exit(mm)))
+				break;
+
+			VM_BUG_ON(address < vma->vm_start ||
+				  address + PAGE_SIZE > vma->vm_end);
+			progress += knumad_scan_pmd(mm, vma, address);
+			/* move to next address */
+			address = (address + PMD_SIZE) & PMD_MASK;
+			if (progress >= pages_to_scan)
+				break;
+		}
+		end_addr = min(address, vma->vm_end);
+
+		/*
+		 * Flush the TLB for the mm to start the numa
+		 * hinting minor page faults after we finish
+		 * scanning this vma part.
+		 */
+		mmu_notifier_invalidate_range_start(vma->vm_mm, start_addr,
+						    end_addr);
+		flush_tlb_range(vma, start_addr, end_addr);
+		mmu_notifier_invalidate_range_end(vma->vm_mm, start_addr,
+						  end_addr);
+	}
+	up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
+
+	mutex_lock(&knumad_mm_mutex);
+	VM_BUG_ON(knumad_scan.mm != mm);
+	knumad_scan.address = address;
+	/*
+	 * Change the current mm if this mm is about to die, or if we
+	 * scanned all vmas of this mm.
+	 */
+	if (knumad_test_exit(mm) || !vma) {
+		mm_autonuma = mm->mm_autonuma;
+		if (mm_autonuma->mm_node.next != &knumad_scan.mm_head) {
+			mm_autonuma = list_entry(mm_autonuma->mm_node.next,
+						 struct mm_autonuma, mm_node);
+			knumad_scan.mm = mm_autonuma->mm;
+			atomic_inc(&knumad_scan.mm->mm_count);
+			knumad_scan.address = 0;
+			knumad_scan.mm->mm_autonuma->mm_numa_fault_pass++;
+		} else
+			knumad_scan.mm = NULL;
+
+		if (knumad_test_exit(mm)) {
+			list_del(&mm->mm_autonuma->mm_node);
+			/* tell autonuma_exit not to list_del */
+			VM_BUG_ON(mm->mm_autonuma->mm != mm);
+			mm->mm_autonuma->mm = NULL;
+		} else
+			mm_numa_fault_flush(mm);
+
+		mmdrop(mm);
+	}
+
+	return progress;
+}
+
+static void wake_up_knuma_migrated(void)
+{
+	int nid;
+
+	lru_add_drain();
+	for_each_online_node(nid) {
+		struct pglist_data *pgdat = NODE_DATA(nid);
+		if (pgdat->autonuma_nr_migrate_pages &&
+		    waitqueue_active(&pgdat->autonuma_knuma_migrated_wait))
+			wake_up_interruptible(&pgdat->
+					      autonuma_knuma_migrated_wait);
+	}
+}
+
+static void knuma_scand_disabled(void)
+{
+	if (!autonuma_enabled())
+		wait_event_freezable(knuma_scand_wait,
+				     autonuma_enabled() ||
+				     kthread_should_stop());
+}
+
+static int knuma_scand(void *none)
+{
+	struct mm_struct *mm = NULL;
+	int progress = 0, _progress;
+	unsigned long total_progress = 0;
+
+	set_freezable();
+
+	knuma_scand_disabled();
+
+	mutex_lock(&knumad_mm_mutex);
+
+	for (;;) {
+		if (unlikely(kthread_should_stop()))
+			break;
+		_progress = knumad_do_scan();
+		progress += _progress;
+		total_progress += _progress;
+		mutex_unlock(&knumad_mm_mutex);
+
+		if (unlikely(!knumad_scan.mm)) {
+			autonuma_printk("knuma_scand %lu\n", total_progress);
+			pages_scanned += total_progress;
+			total_progress = 0;
+			full_scans++;
+
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_pass_millisecs));
+			/* flush the last pending pages < pages_to_migrate */
+			wake_up_knuma_migrated();
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_pass_millisecs));
+
+			if (autonuma_debug()) {
+				extern void sched_autonuma_dump_mm(void);
+				sched_autonuma_dump_mm();
+			}
+
+			/* wait while there is no pinned mm */
+			knuma_scand_disabled();
+		}
+		if (progress > pages_to_scan) {
+			progress = 0;
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_millisecs));
+		}
+		cond_resched();
+		mutex_lock(&knumad_mm_mutex);
+	}
+
+	mm = knumad_scan.mm;
+	knumad_scan.mm = NULL;
+	if (mm && knumad_test_exit(mm)) {
+		list_del(&mm->mm_autonuma->mm_node);
+		/* tell autonuma_exit not to list_del */
+		VM_BUG_ON(mm->mm_autonuma->mm != mm);
+		mm->mm_autonuma->mm = NULL;
+	}
+	mutex_unlock(&knumad_mm_mutex);
+
+	if (mm)
+		mmdrop(mm);
+
+	return 0;
+}
+
+static int isolate_migratepages(struct list_head *migratepages,
+				struct pglist_data *pgdat)
+{
+	int nr = 0, nid;
+	struct list_head *heads = pgdat->autonuma_migrate_head;
+
+	/* FIXME: THP balancing, restart from last nid */
+	for_each_online_node(nid) {
+		struct zone *zone;
+		struct page *page;
+		struct lruvec *lruvec;
+
+		cond_resched();
+		VM_BUG_ON(numa_node_id() != pgdat->node_id);
+		if (nid == pgdat->node_id) {
+			VM_BUG_ON(!list_empty(&heads[nid]));
+			continue;
+		}
+		if (list_empty(&heads[nid]))
+			continue;
+		/* some page wants to go to this pgdat */
+		/*
+		 * Take the lock with irqs disabled to avoid a lock
+		 * inversion with the lru_lock which is taken before
+		 * the autonuma_migrate_lock in split_huge_page, and
+		 * that could be taken by interrupts after we obtained
+		 * the autonuma_migrate_lock here, if we didn't disable
+		 * irqs.
+		 */
+		autonuma_migrate_lock_irq(pgdat->node_id);
+		if (list_empty(&heads[nid])) {
+			autonuma_migrate_unlock_irq(pgdat->node_id);
+			continue;
+		}
+		page = list_entry(heads[nid].prev,
+				  struct page,
+				  autonuma_migrate_node);
+		if (unlikely(!get_page_unless_zero(page))) {
+			/*
+			 * Is getting freed and will remove self from the
+			 * autonuma list shortly, skip it for now.
+			 */
+			list_del(&page->autonuma_migrate_node);
+			list_add(&page->autonuma_migrate_node,
+				 &heads[nid]);
+			autonuma_migrate_unlock_irq(pgdat->node_id);
+			autonuma_printk("autonuma migrate page is free\n");
+			continue;
+		}
+		if (!PageLRU(page)) {
+			autonuma_migrate_unlock_irq(pgdat->node_id);
+			autonuma_printk("autonuma migrate page not in LRU\n");
+			__autonuma_migrate_page_remove(page);
+			put_page(page);
+			continue;
+		}
+		autonuma_migrate_unlock_irq(pgdat->node_id);
+
+		VM_BUG_ON(nid != page_to_nid(page));
+
+		if (PageTransHuge(page)) {
+			VM_BUG_ON(!PageAnon(page));
+			/* FIXME: remove split_huge_page */
+			if (unlikely(split_huge_page(page))) {
+				autonuma_printk("autonuma migrate THP free\n");
+				__autonuma_migrate_page_remove(page,
+							       page_autonuma);
+				put_page(page);
+				continue;
+			}
+		}
+
+		__autonuma_migrate_page_remove(page);
+
+		zone = page_zone(page);
+		spin_lock_irq(&zone->lru_lock);
+
+		/* Must run under the lru_lock and before page isolation */
+		lruvec = mem_cgroup_page_lruvec(page, zone);
+
+		if (!__isolate_lru_page(page, 0)) {
+			VM_BUG_ON(PageTransCompound(page));
+			del_page_from_lru_list(page, lruvec, page_lru(page));
+			inc_zone_state(zone, page_is_file_cache(page) ?
+				       NR_ISOLATED_FILE : NR_ISOLATED_ANON);
+			spin_unlock_irq(&zone->lru_lock);
+			/*
+			 * hold the page pin at least until
+			 * __isolate_lru_page succeeds
+			 * (__isolate_lru_page takes a second pin when
+			 * it succeeds). If we release the pin before
+			 * __isolate_lru_page returns, the page could
+			 * have been freed and reallocated from under
+			 * us, so rendering worthless our previous
+			 * checks on the page including the
+			 * split_huge_page call.
+			 */
+			put_page(page);
+
+			list_add(&page->lru, migratepages);
+			nr += hpage_nr_pages(page);
+		} else {
+			/* FIXME: losing page, safest and simplest for now */
+			spin_unlock_irq(&zone->lru_lock);
+			put_page(page);
+			autonuma_printk("autonuma migrate page lost\n");
+		}
+	}
+
+	return nr;
+}
+
+static struct page *alloc_migrate_dst_page(struct page *page,
+					   unsigned long data,
+					   int **result)
+{
+	int nid = (int) data;
+	struct page *newpage;
+	newpage = alloc_pages_exact_node(nid,
+					 GFP_HIGHUSER_MOVABLE | GFP_THISNODE,
+					 0);
+	if (newpage)
+		newpage->autonuma_last_nid = page->autonuma_last_nid;
+	return newpage;
+}
+
+static void knumad_do_migrate(struct pglist_data *pgdat)
+{
+	int nr_migrate_pages = 0;
+	LIST_HEAD(migratepages);
+
+	autonuma_printk("nr_migrate_pages %lu to node %d\n",
+			pgdat->autonuma_nr_migrate_pages, pgdat->node_id);
+	do {
+		int isolated = 0;
+		if (balance_pgdat(pgdat, nr_migrate_pages))
+			isolated = isolate_migratepages(&migratepages, pgdat);
+		/* FIXME: might need to check too many isolated */
+		if (!isolated)
+			break;
+		nr_migrate_pages += isolated;
+	} while (nr_migrate_pages < pages_to_migrate);
+
+	if (nr_migrate_pages) {
+		int err;
+		autonuma_printk("migrate %d to node %d\n", nr_migrate_pages,
+				pgdat->node_id);
+		pages_migrated += nr_migrate_pages; /* FIXME: per node */
+		err = migrate_pages(&migratepages, alloc_migrate_dst_page,
+				    pgdat->node_id, false, true);
+		if (err)
+			/* FIXME: requeue failed pages */
+			putback_lru_pages(&migratepages);
+	}
+}
+
+static int knuma_migrated(void *arg)
+{
+	struct pglist_data *pgdat = (struct pglist_data *)arg;
+	int nid = pgdat->node_id;
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(nowakeup);
+
+	set_freezable();
+
+	for (;;) {
+		if (unlikely(kthread_should_stop()))
+			break;
+		/* FIXME: scan the free levels of this node we may not
+		 * be allowed to receive memory if the wmark of this
+		 * pgdat are below high.  In the future also add
+		 * not-interesting pages like not-accessed pages to
+		 * pgdat->autonuma_migrate_head[pgdat->node_id]; so we
+		 * can move our memory away to other nodes in order
+		 * to satisfy the high-wmark described above (so migration
+		 * can continue).
+		 */
+		knumad_do_migrate(pgdat);
+		if (!pgdat->autonuma_nr_migrate_pages) {
+			wait_event_freezable(
+				pgdat->autonuma_knuma_migrated_wait,
+				pgdat->autonuma_nr_migrate_pages ||
+				kthread_should_stop());
+			autonuma_printk("wake knuma_migrated %d\n", nid);
+		} else
+			wait_event_freezable_timeout(nowakeup,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     migrate_sleep_millisecs));
+	}
+
+	return 0;
+}
+
+void autonuma_enter(struct mm_struct *mm)
+{
+	if (autonuma_impossible())
+		return;
+
+	mutex_lock(&knumad_mm_mutex);
+	list_add_tail(&mm->mm_autonuma->mm_node, &knumad_scan.mm_head);
+	mutex_unlock(&knumad_mm_mutex);
+}
+
+void autonuma_exit(struct mm_struct *mm)
+{
+	bool serialize;
+
+	if (autonuma_impossible())
+		return;
+
+	serialize = false;
+	mutex_lock(&knumad_mm_mutex);
+	if (knumad_scan.mm == mm)
+		serialize = true;
+	else if (mm->mm_autonuma->mm) {
+		VM_BUG_ON(mm->mm_autonuma->mm != mm);
+		mm->mm_autonuma->mm = NULL; /* debug */
+		list_del(&mm->mm_autonuma->mm_node);
+	}
+	mutex_unlock(&knumad_mm_mutex);
+
+	if (serialize) {
+		/* prevent the mm to go away under knumad_do_scan main loop */
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	}
+}
+
+static int start_knuma_scand(void)
+{
+	int err = 0;
+	struct task_struct *knumad_thread;
+
+	knumad_thread = kthread_run(knuma_scand, NULL, "knuma_scand");
+	if (unlikely(IS_ERR(knumad_thread))) {
+		autonuma_printk(KERN_ERR
+				"knumad: kthread_run(knuma_scand) failed\n");
+		err = PTR_ERR(knumad_thread);
+	}
+	return err;
+}
+
+static int start_knuma_migrated(void)
+{
+	int err = 0;
+	struct task_struct *knumad_thread;
+	int nid;
+
+	for_each_online_node(nid) {
+		knumad_thread = kthread_create_on_node(knuma_migrated,
+						       NODE_DATA(nid),
+						       nid,
+						       "knuma_migrated%d",
+						       nid);
+		if (unlikely(IS_ERR(knumad_thread))) {
+			autonuma_printk(KERN_ERR
+					"knumad: "
+					"kthread_run(knuma_migrated%d) "
+					"failed\n", nid);
+			err = PTR_ERR(knumad_thread);
+		} else {
+			autonuma_printk("cpumask %d %lx\n", nid,
+					cpumask_of_node(nid)->bits[0]);
+			kthread_bind_node(knumad_thread, nid);
+			wake_up_process(knumad_thread);
+		}
+	}
+	return err;
+}
+
+
+#ifdef CONFIG_SYSFS
+
+static ssize_t flag_show(struct kobject *kobj,
+			 struct kobj_attribute *attr, char *buf,
+			 enum autonuma_flag flag)
+{
+	return sprintf(buf, "%d\n",
+		       !!test_bit(flag, &autonuma_flags));
+}
+static ssize_t flag_store(struct kobject *kobj,
+			  struct kobj_attribute *attr,
+			  const char *buf, size_t count,
+			  enum autonuma_flag flag)
+{
+	unsigned long value;
+	int ret;
+
+	ret = kstrtoul(buf, 10, &value);
+	if (ret < 0)
+		return ret;
+	if (value > 1)
+		return -EINVAL;
+
+	if (value)
+		set_bit(flag, &autonuma_flags);
+	else
+		clear_bit(flag, &autonuma_flags);
+
+	return count;
+}
+
+static ssize_t enabled_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return flag_show(kobj, attr, buf, AUTONUMA_FLAG);
+}
+static ssize_t enabled_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = flag_store(kobj, attr, buf, count, AUTONUMA_FLAG);
+
+	if (ret > 0 && autonuma_enabled())
+		wake_up_interruptible(&knuma_scand_wait);
+
+	return ret;
+}
+static struct kobj_attribute enabled_attr =
+	__ATTR(enabled, 0644, enabled_show, enabled_store);
+
+#define SYSFS_ENTRY(NAME, FLAG)						\
+static ssize_t NAME ## _show(struct kobject *kobj,			\
+			     struct kobj_attribute *attr, char *buf)	\
+{									\
+	return flag_show(kobj, attr, buf, FLAG);			\
+}									\
+									\
+static ssize_t NAME ## _store(struct kobject *kobj,			\
+			      struct kobj_attribute *attr,		\
+			      const char *buf, size_t count)		\
+{									\
+	return flag_store(kobj, attr, buf, count, FLAG);		\
+}									\
+static struct kobj_attribute NAME ## _attr =				\
+	__ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
+
+SYSFS_ENTRY(debug, AUTONUMA_DEBUG_FLAG);
+SYSFS_ENTRY(pmd, AUTONUMA_SCAN_PMD_FLAG);
+SYSFS_ENTRY(working_set, AUTONUMA_SCAN_USE_WORKING_SET_FLAG);
+SYSFS_ENTRY(defer, AUTONUMA_MIGRATE_DEFER_FLAG);
+SYSFS_ENTRY(load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
+SYSFS_ENTRY(clone_reset, AUTONUMA_SCHED_CLONE_RESET_FLAG);
+SYSFS_ENTRY(fork_reset, AUTONUMA_SCHED_FORK_RESET_FLAG);
+
+#undef SYSFS_ENTRY
+
+enum {
+	SYSFS_KNUMA_SCAND_SLEEP_ENTRY,
+	SYSFS_KNUMA_SCAND_PAGES_ENTRY,
+	SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY,
+	SYSFS_KNUMA_MIGRATED_PAGES_ENTRY,
+};
+
+#define SYSFS_ENTRY(NAME, SYSFS_TYPE)				\
+static ssize_t NAME ## _show(struct kobject *kobj,		\
+			     struct kobj_attribute *attr,	\
+			     char *buf)				\
+{								\
+	return sprintf(buf, "%u\n", NAME);			\
+}								\
+static ssize_t NAME ## _store(struct kobject *kobj,		\
+			      struct kobj_attribute *attr,	\
+			      const char *buf, size_t count)	\
+{								\
+	unsigned long val;					\
+	int err;						\
+								\
+	err = strict_strtoul(buf, 10, &val);			\
+	if (err || val > UINT_MAX)				\
+		return -EINVAL;					\
+	switch (SYSFS_TYPE) {					\
+	case SYSFS_KNUMA_SCAND_PAGES_ENTRY:			\
+	case SYSFS_KNUMA_MIGRATED_PAGES_ENTRY:			\
+		if (!val)					\
+			return -EINVAL;				\
+		break;						\
+	}							\
+								\
+	NAME = val;						\
+	switch (SYSFS_TYPE) {					\
+	case SYSFS_KNUMA_SCAND_SLEEP_ENTRY:			\
+		wake_up_interruptible(&knuma_scand_wait);	\
+		break;						\
+	case							\
+		SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY:		\
+		wake_up_knuma_migrated();			\
+		break;						\
+	}							\
+								\
+	return count;						\
+}								\
+static struct kobj_attribute NAME ## _attr =			\
+	__ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
+
+SYSFS_ENTRY(scan_sleep_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
+SYSFS_ENTRY(scan_sleep_pass_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
+SYSFS_ENTRY(pages_to_scan, SYSFS_KNUMA_SCAND_PAGES_ENTRY);
+
+SYSFS_ENTRY(migrate_sleep_millisecs, SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY);
+SYSFS_ENTRY(pages_to_migrate, SYSFS_KNUMA_MIGRATED_PAGES_ENTRY);
+
+#undef SYSFS_ENTRY
+
+static struct attribute *autonuma_attr[] = {
+	&enabled_attr.attr,
+	&debug_attr.attr,
+	NULL,
+};
+static struct attribute_group autonuma_attr_group = {
+	.attrs = autonuma_attr,
+};
+
+#define SYSFS_ENTRY(NAME)					\
+static ssize_t NAME ## _show(struct kobject *kobj,		\
+			     struct kobj_attribute *attr,	\
+			     char *buf)				\
+{								\
+	return sprintf(buf, "%lu\n", NAME);			\
+}								\
+static struct kobj_attribute NAME ## _attr =			\
+	__ATTR_RO(NAME);
+
+SYSFS_ENTRY(full_scans);
+SYSFS_ENTRY(pages_scanned);
+SYSFS_ENTRY(pages_migrated);
+
+#undef SYSFS_ENTRY
+
+static struct attribute *knuma_scand_attr[] = {
+	&scan_sleep_millisecs_attr.attr,
+	&scan_sleep_pass_millisecs_attr.attr,
+	&pages_to_scan_attr.attr,
+	&pages_scanned_attr.attr,
+	&full_scans_attr.attr,
+	&pmd_attr.attr,
+	&working_set_attr.attr,
+	NULL,
+};
+static struct attribute_group knuma_scand_attr_group = {
+	.attrs = knuma_scand_attr,
+	.name = "knuma_scand",
+};
+
+static struct attribute *knuma_migrated_attr[] = {
+	&migrate_sleep_millisecs_attr.attr,
+	&pages_to_migrate_attr.attr,
+	&pages_migrated_attr.attr,
+	&defer_attr.attr,
+	NULL,
+};
+static struct attribute_group knuma_migrated_attr_group = {
+	.attrs = knuma_migrated_attr,
+	.name = "knuma_migrated",
+};
+
+static struct attribute *scheduler_attr[] = {
+	&clone_reset_attr.attr,
+	&fork_reset_attr.attr,
+	&load_balance_strict_attr.attr,
+	NULL,
+};
+static struct attribute_group scheduler_attr_group = {
+	.attrs = scheduler_attr,
+	.name = "scheduler",
+};
+
+static int __init autonuma_init_sysfs(struct kobject **autonuma_kobj)
+{
+	int err;
+
+	*autonuma_kobj = kobject_create_and_add("autonuma", mm_kobj);
+	if (unlikely(!*autonuma_kobj)) {
+		printk(KERN_ERR "autonuma: failed kobject create\n");
+		return -ENOMEM;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &autonuma_attr_group);
+	if (err) {
+		printk(KERN_ERR "autonuma: failed register autonuma group\n");
+		goto delete_obj;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &knuma_scand_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register knuma_scand group\n");
+		goto remove_autonuma;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &knuma_migrated_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register knuma_migrated group\n");
+		goto remove_knuma_scand;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &scheduler_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register scheduler group\n");
+		goto remove_knuma_migrated;
+	}
+
+	return 0;
+
+remove_knuma_migrated:
+	sysfs_remove_group(*autonuma_kobj, &knuma_migrated_attr_group);
+remove_knuma_scand:
+	sysfs_remove_group(*autonuma_kobj, &knuma_scand_attr_group);
+remove_autonuma:
+	sysfs_remove_group(*autonuma_kobj, &autonuma_attr_group);
+delete_obj:
+	kobject_put(*autonuma_kobj);
+	return err;
+}
+
+static void __init autonuma_exit_sysfs(struct kobject *autonuma_kobj)
+{
+	sysfs_remove_group(autonuma_kobj, &knuma_migrated_attr_group);
+	sysfs_remove_group(autonuma_kobj, &knuma_scand_attr_group);
+	sysfs_remove_group(autonuma_kobj, &autonuma_attr_group);
+	kobject_put(autonuma_kobj);
+}
+#else
+static inline int autonuma_init_sysfs(struct kobject **autonuma_kobj)
+{
+	return 0;
+}
+
+static inline void autonuma_exit_sysfs(struct kobject *autonuma_kobj)
+{
+}
+#endif /* CONFIG_SYSFS */
+
+static int __init noautonuma_setup(char *str)
+{
+	if (!autonuma_impossible()) {
+		printk("AutoNUMA permanently disabled\n");
+		set_bit(AUTONUMA_IMPOSSIBLE_FLAG, &autonuma_flags);
+		BUG_ON(!autonuma_impossible());
+	}
+	return 1;
+}
+__setup("noautonuma", noautonuma_setup);
+
+static int __init autonuma_init(void)
+{
+	int err;
+	struct kobject *autonuma_kobj;
+
+	VM_BUG_ON(num_possible_nodes() < 1);
+	if (autonuma_impossible())
+		return -EINVAL;
+
+	err = autonuma_init_sysfs(&autonuma_kobj);
+	if (err)
+		return err;
+
+	err = start_knuma_scand();
+	if (err) {
+		printk("failed to start knuma_scand\n");
+		goto out;
+	}
+	err = start_knuma_migrated();
+	if (err) {
+		printk("failed to start knuma_migrated\n");
+		goto out;
+	}
+
+	printk("AutoNUMA initialized successfully\n");
+	return err;
+
+out:
+	autonuma_exit_sysfs(autonuma_kobj);
+	return err;
+}
+module_init(autonuma_init)
+
+static struct kmem_cache *task_autonuma_cachep;
+
+int alloc_task_autonuma(struct task_struct *tsk, struct task_struct *orig,
+			 int node)
+{
+	int err = 1;
+	struct task_autonuma *task_autonuma;
+
+	if (autonuma_impossible())
+		goto no_numa;
+	task_autonuma = kmem_cache_alloc_node(task_autonuma_cachep,
+					      GFP_KERNEL, node);
+	if (!task_autonuma)
+		goto out;
+	if (autonuma_sched_clone_reset())
+		task_autonuma_reset(task_autonuma);
+	else
+		memcpy(task_autonuma, orig->task_autonuma,
+		       task_autonuma_size());
+	tsk->task_autonuma = task_autonuma;
+no_numa:
+	err = 0;
+out:
+	return err;
+}
+
+void free_task_autonuma(struct task_struct *tsk)
+{
+	if (autonuma_impossible()) {
+		BUG_ON(tsk->task_autonuma);
+		return;
+	}
+
+	BUG_ON(!tsk->task_autonuma);
+	kmem_cache_free(task_autonuma_cachep, tsk->task_autonuma);
+	tsk->task_autonuma = NULL;
+}
+
+void __init task_autonuma_init(void)
+{
+	struct task_autonuma *task_autonuma;
+
+	BUG_ON(current != &init_task);
+
+	if (autonuma_impossible())
+		return;
+
+	task_autonuma_cachep =
+		kmem_cache_create("task_autonuma",
+				  task_autonuma_size(), 0,
+				  SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+
+	task_autonuma = kmem_cache_alloc_node(task_autonuma_cachep,
+					      GFP_KERNEL, numa_node_id());
+	BUG_ON(!task_autonuma);
+	task_autonuma_reset(task_autonuma);
+	BUG_ON(current->task_autonuma);
+	current->task_autonuma = task_autonuma;
+}
+
+static struct kmem_cache *mm_autonuma_cachep;
+
+int alloc_mm_autonuma(struct mm_struct *mm)
+{
+	int err = 1;
+	struct mm_autonuma *mm_autonuma;
+
+	if (autonuma_impossible())
+		goto no_numa;
+	mm_autonuma = kmem_cache_alloc(mm_autonuma_cachep, GFP_KERNEL);
+	if (!mm_autonuma)
+		goto out;
+	if (autonuma_sched_fork_reset() || !mm->mm_autonuma)
+		mm_autonuma_reset(mm_autonuma);
+	else
+		memcpy(mm_autonuma, mm->mm_autonuma, mm_autonuma_size());
+	mm->mm_autonuma = mm_autonuma;
+	mm_autonuma->mm = mm;
+no_numa:
+	err = 0;
+out:
+	return err;
+}
+
+void free_mm_autonuma(struct mm_struct *mm)
+{
+	if (autonuma_impossible()) {
+		BUG_ON(mm->mm_autonuma);
+		return;
+	}
+
+	BUG_ON(!mm->mm_autonuma);
+	kmem_cache_free(mm_autonuma_cachep, mm->mm_autonuma);
+	mm->mm_autonuma = NULL;
+}
+
+void __init mm_autonuma_init(void)
+{
+	BUG_ON(current != &init_task);
+	BUG_ON(current->mm);
+
+	if (autonuma_impossible())
+		return;
+
+	mm_autonuma_cachep =
+		kmem_cache_create("mm_autonuma",
+				  mm_autonuma_size(), 0,
+				  SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+}

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 24/40] autonuma: core
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This implements knuma_scand, the numa_hinting faults started by
knuma_scand, the knuma_migrated that migrates the memory queued by the
NUMA hinting faults, the statistics gathering code that is done by
knuma_scand for the mm_autonuma and by the numa hinting page faults
for the sched_autonuma, and most of the rest of the AutoNUMA core
logics like the false sharing detection, sysfs and initialization
routines.

The AutoNUMA algorithm when knuma_scand is not running is a full
bypass and it must not alter the runtime of memory management and
scheduler.

The whole AutoNUMA logic is a chain reaction as result of the actions
of the knuma_scand. The various parts of the code can be described
like different gears (gears as in glxgears).

knuma_scand is the first gear and it collects the mm_autonuma per-process
statistics and at the same time it sets the pte/pmd it scans as
pte_numa and pmd_numa.

The second gear are the numa hinting page faults. These are triggered
by the pte_numa/pmd_numa pmd/ptes. They collect the sched_autonuma
per-thread statistics. They also implement the memory follow CPU logic
where we track if pages are repeatedly accessed by remote nodes. The
memory follow CPU logic can decide to migrate pages across different
NUMA nodes by queuing the pages for migration in the per-node
knuma_migrated queues.

The third gear is knuma_migrated. There is one knuma_migrated daemon
per node. Pages pending for migration are queued in a matrix of
lists. Each knuma_migrated (in parallel with each other) goes over
those lists and migrates the pages queued for migration in round robin
from each incoming node to the node where knuma_migrated is running
on.

The fourth gear is the NUMA scheduler balancing code. That computes
the statistical information collected in mm->mm_autonuma and
p->sched_autonuma and evaluates the status of all CPUs to decide if
tasks should be migrated to CPUs in remote nodes.

The code include fixes from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/autonuma.c | 1491 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 1491 insertions(+), 0 deletions(-)
 create mode 100644 mm/autonuma.c

diff --git a/mm/autonuma.c b/mm/autonuma.c
new file mode 100644
index 0000000..f44272b
--- /dev/null
+++ b/mm/autonuma.c
@@ -0,0 +1,1491 @@
+/*
+ *  Copyright (C) 2012  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ *  Boot with "numa=fake=2" to test on not NUMA systems.
+ */
+
+#include <linux/mm.h>
+#include <linux/rmap.h>
+#include <linux/kthread.h>
+#include <linux/mmu_notifier.h>
+#include <linux/freezer.h>
+#include <linux/mm_inline.h>
+#include <linux/migrate.h>
+#include <linux/swap.h>
+#include <linux/autonuma.h>
+#include <asm/tlbflush.h>
+#include <asm/pgtable.h>
+
+unsigned long autonuma_flags __read_mostly =
+	(1<<AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG)|
+	(1<<AUTONUMA_SCHED_CLONE_RESET_FLAG)|
+	(1<<AUTONUMA_SCHED_FORK_RESET_FLAG)|
+#ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
+	(1<<AUTONUMA_FLAG)|
+#endif
+	(1<<AUTONUMA_SCAN_PMD_FLAG);
+
+static DEFINE_MUTEX(knumad_mm_mutex);
+
+/* knuma_scand */
+static unsigned int scan_sleep_millisecs __read_mostly = 100;
+static unsigned int scan_sleep_pass_millisecs __read_mostly = 5000;
+static unsigned int pages_to_scan __read_mostly = 128*1024*1024/PAGE_SIZE;
+static DECLARE_WAIT_QUEUE_HEAD(knuma_scand_wait);
+static unsigned long full_scans;
+static unsigned long pages_scanned;
+
+/* knuma_migrated */
+static unsigned int migrate_sleep_millisecs __read_mostly = 100;
+static unsigned int pages_to_migrate __read_mostly = 128*1024*1024/PAGE_SIZE;
+static volatile unsigned long pages_migrated;
+
+static struct knumad_scan {
+	struct list_head mm_head;
+	struct mm_struct *mm;
+	unsigned long address;
+} knumad_scan = {
+	.mm_head = LIST_HEAD_INIT(knumad_scan.mm_head),
+};
+
+static inline bool autonuma_impossible(void)
+{
+	return num_possible_nodes() <= 1 ||
+		test_bit(AUTONUMA_IMPOSSIBLE_FLAG, &autonuma_flags);
+}
+
+static inline void autonuma_migrate_lock(int nid)
+{
+	spin_lock(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_unlock(int nid)
+{
+	spin_unlock(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_lock_irq(int nid)
+{
+	spin_lock_irq(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_unlock_irq(int nid)
+{
+	spin_unlock_irq(&NODE_DATA(nid)->autonuma_lock);
+}
+
+/* caller already holds the compound_lock */
+void autonuma_migrate_split_huge_page(struct page *page,
+				      struct page *page_tail)
+{
+	int nid, last_nid;
+
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	VM_BUG_ON(page_tail->autonuma_migrate_nid != -1);
+	if (nid >= 0) {
+		VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
+
+		compound_lock(page_tail);
+		autonuma_migrate_lock(nid);
+		list_add_tail(&page_tail->autonuma_migrate_node,
+			      &page->autonuma_migrate_node);
+		autonuma_migrate_unlock(nid);
+
+		page_tail->autonuma_migrate_nid = nid;
+		compound_unlock(page_tail);
+	}
+
+	last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	if (last_nid >= 0)
+		page_tail->autonuma_last_nid = last_nid;
+}
+
+void __autonuma_migrate_page_remove(struct page *page)
+{
+	unsigned long flags;
+	int nid;
+
+	flags = compound_lock_irqsave(page);
+
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	if (nid >= 0) {
+		int numpages = hpage_nr_pages(page);
+		autonuma_migrate_lock(nid);
+		list_del(&page->autonuma_migrate_node);
+		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
+		autonuma_migrate_unlock(nid);
+
+		page->autonuma_migrate_nid = -1;
+	}
+
+	compound_unlock_irqrestore(page, flags);
+}
+
+static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
+					int page_nid)
+{
+	unsigned long flags;
+	int nid;
+	int numpages;
+	unsigned long nr_migrate_pages;
+	wait_queue_head_t *wait_queue;
+
+	VM_BUG_ON(dst_nid >= MAX_NUMNODES);
+	VM_BUG_ON(dst_nid < -1);
+	VM_BUG_ON(page_nid >= MAX_NUMNODES);
+	VM_BUG_ON(page_nid < -1);
+
+	VM_BUG_ON(page_nid == dst_nid);
+	VM_BUG_ON(page_to_nid(page) != page_nid);
+
+	flags = compound_lock_irqsave(page);
+
+	numpages = hpage_nr_pages(page);
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	if (nid >= 0) {
+		autonuma_migrate_lock(nid);
+		list_del(&page->autonuma_migrate_node);
+		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
+		autonuma_migrate_unlock(nid);
+	}
+
+	autonuma_migrate_lock(dst_nid);
+	list_add(&page->autonuma_migrate_node,
+		 &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
+	NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
+	nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
+
+	autonuma_migrate_unlock(dst_nid);
+
+	page->autonuma_migrate_nid = dst_nid;
+
+	compound_unlock_irqrestore(page, flags);
+
+	if (!autonuma_migrate_defer()) {
+		wait_queue = &NODE_DATA(dst_nid)->autonuma_knuma_migrated_wait;
+		if (nr_migrate_pages >= pages_to_migrate &&
+		    nr_migrate_pages - numpages < pages_to_migrate &&
+		    waitqueue_active(wait_queue))
+			wake_up_interruptible(wait_queue);
+	}
+}
+
+static void autonuma_migrate_page_add(struct page *page, int dst_nid,
+				      int page_nid)
+{
+	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	if (migrate_nid != dst_nid)
+		__autonuma_migrate_page_add(page, dst_nid, page_nid);
+}
+
+static bool balance_pgdat(struct pglist_data *pgdat,
+			  int nr_migrate_pages)
+{
+	/* FIXME: this only check the wmarks, make it move
+	 * "unused" memory or pagecache by queuing it to
+	 * pgdat->autonuma_migrate_head[pgdat->node_id].
+	 */
+	int z;
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone->all_unreclaimable)
+			continue;
+
+		/*
+		 * FIXME: deal with order with THP, maybe if
+		 * kswapd will learn using compaction, otherwise
+		 * order = 0 probably is ok.
+		 * FIXME: in theory we're ok if we can obtain
+		 * pages_to_migrate pages from all zones, it doesn't
+		 * need to be all in a single zone. We care about the
+		 * pgdat, the zone not.
+		 */
+
+		/*
+		 * Try not to wakeup kswapd by allocating
+		 * pages_to_migrate pages.
+		 */
+		if (!zone_watermark_ok(zone, 0,
+				       high_wmark_pages(zone) +
+				       nr_migrate_pages,
+				       0, 0))
+			continue;
+		return true;
+	}
+	return false;
+}
+
+static void cpu_follow_memory_pass(struct task_struct *p,
+				   struct task_autonuma *task_autonuma,
+				   unsigned long *task_numa_fault)
+{
+	int nid;
+	for_each_node(nid)
+		task_numa_fault[nid] >>= 1;
+	task_autonuma->task_numa_fault_tot >>= 1;
+}
+
+static void numa_hinting_fault_cpu_follow_memory(struct task_struct *p,
+						 int access_nid,
+						 int numpages,
+						 bool pass)
+{
+	struct task_autonuma *task_autonuma = p->task_autonuma;
+	unsigned long *task_numa_fault = task_autonuma->task_numa_fault;
+	if (unlikely(pass))
+		cpu_follow_memory_pass(p, task_autonuma, task_numa_fault);
+	task_numa_fault[access_nid] += numpages;
+	task_autonuma->task_numa_fault_tot += numpages;
+}
+
+static inline bool last_nid_set(struct task_struct *p,
+				struct page *page, int cpu_nid)
+{
+	bool ret = true;
+	int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	VM_BUG_ON(cpu_nid < 0);
+	VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
+	if (autonuma_last_nid >= 0 && autonuma_last_nid != cpu_nid) {
+		int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+		if (migrate_nid >= 0 && migrate_nid != cpu_nid)
+			__autonuma_migrate_page_remove(page);
+		ret = false;
+	}
+	if (autonuma_last_nid != cpu_nid)
+		ACCESS_ONCE(page->autonuma_last_nid) = cpu_nid;
+	return ret;
+}
+
+static int __page_migrate_nid(struct page *page, int page_nid)
+{
+	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	if (migrate_nid < 0)
+		migrate_nid = page_nid;
+#if 0
+	return page_nid;
+#endif
+	return migrate_nid;
+}
+
+static int page_migrate_nid(struct page *page)
+{
+	return __page_migrate_nid(page, page_to_nid(page));
+}
+
+static int numa_hinting_fault_memory_follow_cpu(struct task_struct *p,
+						struct page *page,
+						int cpu_nid, int page_nid,
+						bool pass)
+{
+	if (!last_nid_set(p, page, cpu_nid))
+		return __page_migrate_nid(page, page_nid);
+	if (!PageLRU(page))
+		return page_nid;
+	if (cpu_nid != page_nid)
+		autonuma_migrate_page_add(page, cpu_nid, page_nid);
+	else
+		autonuma_migrate_page_remove(page);
+	return cpu_nid;
+}
+
+void numa_hinting_fault(struct page *page, int numpages)
+{
+	WARN_ON_ONCE(!current->mm);
+	if (likely(current->mm && !current->mempolicy && autonuma_enabled())) {
+		struct task_struct *p = current;
+		int cpu_nid, page_nid, access_nid;
+		bool pass;
+
+		pass = p->task_autonuma->task_numa_fault_pass !=
+			p->mm->mm_autonuma->mm_numa_fault_pass;
+		page_nid = page_to_nid(page);
+		cpu_nid = numa_node_id();
+		VM_BUG_ON(cpu_nid < 0);
+		VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
+		access_nid = numa_hinting_fault_memory_follow_cpu(p, page,
+								  cpu_nid,
+								  page_nid,
+								  pass);
+		numa_hinting_fault_cpu_follow_memory(p, access_nid,
+						     numpages, pass);
+		if (unlikely(pass))
+			p->task_autonuma->task_numa_fault_pass =
+				p->mm->mm_autonuma->mm_numa_fault_pass;
+	}
+}
+
+pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+		       unsigned long addr, pte_t pte, pte_t *ptep)
+{
+	struct page *page;
+	pte = pte_mknotnuma(pte);
+	set_pte_at(mm, addr, ptep, pte);
+	page = vm_normal_page(vma, addr, pte);
+	BUG_ON(!page);
+	numa_hinting_fault(page, 1);
+	return pte;
+}
+
+void __pmd_numa_fixup(struct mm_struct *mm,
+		      unsigned long addr, pmd_t *pmdp)
+{
+	pmd_t pmd;
+	pte_t *pte;
+	unsigned long _addr = addr & PMD_MASK;
+	unsigned long offset;
+	spinlock_t *ptl;
+	bool numa = false;
+	struct vm_area_struct *vma;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = *pmdp;
+	if (pmd_numa(pmd)) {
+		set_pmd_at(mm, _addr, pmdp, pmd_mknotnuma(pmd));
+		numa = true;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	if (!numa)
+		return;
+
+	vma = find_vma(mm, _addr);
+	/* we're in a page fault so some vma must be in the range */
+	BUG_ON(!vma);
+	BUG_ON(vma->vm_start >= _addr + PMD_SIZE);
+	offset = max(_addr, vma->vm_start) & ~PMD_MASK;
+	VM_BUG_ON(offset >= PMD_SIZE);
+	pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
+	pte += offset >> PAGE_SHIFT;
+	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
+		pte_t pteval = *pte;
+		struct page * page;
+		if (!pte_present(pteval))
+			continue;
+		if (addr >= vma->vm_end) {
+			vma = find_vma(mm, addr);
+			/* there's a pte present so there must be a vma */
+			BUG_ON(!vma);
+			BUG_ON(addr < vma->vm_start);
+		}
+		if (pte_numa(pteval)) {
+			pteval = pte_mknotnuma(pteval);
+			set_pte_at(mm, addr, pte, pteval);
+		}
+		page = vm_normal_page(vma, addr, pteval);
+		if (unlikely(!page))
+			continue;
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1)
+			continue;
+		numa_hinting_fault(page, 1);
+	}
+	pte_unmap_unlock(pte, ptl);
+}
+
+static inline int task_autonuma_size(void)
+{
+	return sizeof(struct task_autonuma) +
+		num_possible_nodes() * sizeof(unsigned long);
+}
+
+static inline int task_autonuma_reset_size(void)
+{
+	struct task_autonuma *task_autonuma = NULL;
+	return task_autonuma_size() -
+		(int)((char *)(&task_autonuma->task_numa_fault_pass) -
+		      (char *)task_autonuma);
+}
+
+static void task_autonuma_reset(struct task_autonuma *task_autonuma)
+{
+	task_autonuma->autonuma_node = -1;
+	memset(&task_autonuma->task_numa_fault_pass, 0,
+	       task_autonuma_reset_size());
+}
+
+static inline int mm_autonuma_fault_size(void)
+{
+	return num_possible_nodes() * sizeof(unsigned long);
+}
+
+static inline unsigned long *mm_autonuma_numa_fault_tmp(struct mm_struct *mm)
+{
+	return mm->mm_autonuma->mm_numa_fault + num_possible_nodes();
+}
+
+static inline int mm_autonuma_size(void)
+{
+	return sizeof(struct mm_autonuma) + mm_autonuma_fault_size() * 2;
+}
+
+static inline int mm_autonuma_reset_size(void)
+{
+	struct mm_autonuma *mm_autonuma = NULL;
+	return mm_autonuma_size() -
+		(int)((char *)(&mm_autonuma->mm_numa_fault_pass) -
+		      (char *)mm_autonuma);
+}
+
+static void mm_autonuma_reset(struct mm_autonuma *mm_autonuma)
+{
+	memset(&mm_autonuma->mm_numa_fault_pass, 0, mm_autonuma_reset_size());
+}
+
+void autonuma_setup_new_exec(struct task_struct *p)
+{
+	if (p->task_autonuma)
+		task_autonuma_reset(p->task_autonuma);
+	if (p->mm && p->mm->mm_autonuma)
+		mm_autonuma_reset(p->mm->mm_autonuma);
+}
+
+static inline int knumad_test_exit(struct mm_struct *mm)
+{
+	return atomic_read(&mm->mm_users) == 0;
+}
+
+static int knumad_scan_pmd(struct mm_struct *mm,
+			   struct vm_area_struct *vma,
+			   unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte, *_pte;
+	struct page *page;
+	unsigned long _address, end;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	VM_BUG_ON(address & ~PAGE_MASK);
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+	if (pmd_trans_huge(*pmd)) {
+		spin_lock(&mm->page_table_lock);
+		if (pmd_trans_huge(*pmd)) {
+			VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+			if (unlikely(pmd_trans_splitting(*pmd))) {
+				spin_unlock(&mm->page_table_lock);
+				wait_split_huge_page(vma->anon_vma, pmd);
+			} else {
+				int page_nid;
+				unsigned long *numa_fault_tmp;
+				ret = HPAGE_PMD_NR;
+
+				if (autonuma_scan_use_working_set() &&
+				    pmd_numa(*pmd)) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
+				page = pmd_page(*pmd);
+
+				/* only check non-shared pages */
+				if (page_mapcount(page) != 1) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
+				page_nid = page_migrate_nid(page);
+				numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+				numa_fault_tmp[page_nid] += ret;
+
+				if (pmd_numa(*pmd)) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
+				set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+				/* defer TLB flush to lower the overhead */
+				spin_unlock(&mm->page_table_lock);
+				goto out;
+			}
+		} else
+			spin_unlock(&mm->page_table_lock);
+	}
+
+	VM_BUG_ON(!pmd_present(*pmd));
+
+	end = min(vma->vm_end, (address + PMD_SIZE) & PMD_MASK);
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	for (_address = address, _pte = pte; _address < end;
+	     _pte++, _address += PAGE_SIZE) {
+		unsigned long *numa_fault_tmp;
+		pte_t pteval = *_pte;
+		if (!pte_present(pteval))
+			continue;
+		if (autonuma_scan_use_working_set() &&
+		    pte_numa(pteval))
+			continue;
+		page = vm_normal_page(vma, _address, pteval);
+		if (unlikely(!page))
+			continue;
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1)
+			continue;
+
+		numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+		numa_fault_tmp[page_migrate_nid(page)]++;
+
+		if (pte_numa(pteval))
+			continue;
+
+		if (!autonuma_scan_pmd())
+			set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
+
+		/* defer TLB flush to lower the overhead */
+		ret++;
+	}
+	pte_unmap_unlock(pte, ptl);
+
+	if (ret && !pmd_numa(*pmd) && autonuma_scan_pmd()) {
+		spin_lock(&mm->page_table_lock);
+		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+		spin_unlock(&mm->page_table_lock);
+		/* defer TLB flush to lower the overhead */
+	}
+
+out:
+	return ret;
+}
+
+static void mm_numa_fault_flush(struct mm_struct *mm)
+{
+	int nid;
+	struct mm_autonuma *mma = mm->mm_autonuma;
+	unsigned long *numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+	unsigned long tot = 0;
+	/* FIXME: protect this with seqlock against autonuma_balance() */
+	for_each_node(nid) {
+		mma->mm_numa_fault[nid] = numa_fault_tmp[nid];
+		tot += mma->mm_numa_fault[nid];
+		numa_fault_tmp[nid] = 0;
+	}
+	mma->mm_numa_fault_tot = tot;
+}
+
+static int knumad_do_scan(void)
+{
+	struct mm_struct *mm;
+	struct mm_autonuma *mm_autonuma;
+	unsigned long address;
+	struct vm_area_struct *vma;
+	int progress = 0;
+
+	mm = knumad_scan.mm;
+	if (!mm) {
+		if (unlikely(list_empty(&knumad_scan.mm_head)))
+			return pages_to_scan;
+		mm_autonuma = list_entry(knumad_scan.mm_head.next,
+					 struct mm_autonuma, mm_node);
+		mm = mm_autonuma->mm;
+		knumad_scan.address = 0;
+		knumad_scan.mm = mm;
+		atomic_inc(&mm->mm_count);
+		mm_autonuma->mm_numa_fault_pass++;
+	}
+	address = knumad_scan.address;
+
+	mutex_unlock(&knumad_mm_mutex);
+
+	down_read(&mm->mmap_sem);
+	if (unlikely(knumad_test_exit(mm)))
+		vma = NULL;
+	else
+		vma = find_vma(mm, address);
+
+	progress++;
+	for (; vma && progress < pages_to_scan; vma = vma->vm_next) {
+		unsigned long start_addr, end_addr;
+		cond_resched();
+		if (unlikely(knumad_test_exit(mm))) {
+			progress++;
+			break;
+		}
+
+		if (!vma->anon_vma || vma_policy(vma)) {
+			progress++;
+			continue;
+		}
+		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) {
+			progress++;
+			continue;
+		}
+		if (is_vma_temporary_stack(vma)) {
+			progress++;
+			continue;
+		}
+
+		VM_BUG_ON(address & ~PAGE_MASK);
+		if (address < vma->vm_start)
+			address = vma->vm_start;
+
+		start_addr = address;
+		while (address < vma->vm_end) {
+			cond_resched();
+			if (unlikely(knumad_test_exit(mm)))
+				break;
+
+			VM_BUG_ON(address < vma->vm_start ||
+				  address + PAGE_SIZE > vma->vm_end);
+			progress += knumad_scan_pmd(mm, vma, address);
+			/* move to next address */
+			address = (address + PMD_SIZE) & PMD_MASK;
+			if (progress >= pages_to_scan)
+				break;
+		}
+		end_addr = min(address, vma->vm_end);
+
+		/*
+		 * Flush the TLB for the mm to start the numa
+		 * hinting minor page faults after we finish
+		 * scanning this vma part.
+		 */
+		mmu_notifier_invalidate_range_start(vma->vm_mm, start_addr,
+						    end_addr);
+		flush_tlb_range(vma, start_addr, end_addr);
+		mmu_notifier_invalidate_range_end(vma->vm_mm, start_addr,
+						  end_addr);
+	}
+	up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
+
+	mutex_lock(&knumad_mm_mutex);
+	VM_BUG_ON(knumad_scan.mm != mm);
+	knumad_scan.address = address;
+	/*
+	 * Change the current mm if this mm is about to die, or if we
+	 * scanned all vmas of this mm.
+	 */
+	if (knumad_test_exit(mm) || !vma) {
+		mm_autonuma = mm->mm_autonuma;
+		if (mm_autonuma->mm_node.next != &knumad_scan.mm_head) {
+			mm_autonuma = list_entry(mm_autonuma->mm_node.next,
+						 struct mm_autonuma, mm_node);
+			knumad_scan.mm = mm_autonuma->mm;
+			atomic_inc(&knumad_scan.mm->mm_count);
+			knumad_scan.address = 0;
+			knumad_scan.mm->mm_autonuma->mm_numa_fault_pass++;
+		} else
+			knumad_scan.mm = NULL;
+
+		if (knumad_test_exit(mm)) {
+			list_del(&mm->mm_autonuma->mm_node);
+			/* tell autonuma_exit not to list_del */
+			VM_BUG_ON(mm->mm_autonuma->mm != mm);
+			mm->mm_autonuma->mm = NULL;
+		} else
+			mm_numa_fault_flush(mm);
+
+		mmdrop(mm);
+	}
+
+	return progress;
+}
+
+static void wake_up_knuma_migrated(void)
+{
+	int nid;
+
+	lru_add_drain();
+	for_each_online_node(nid) {
+		struct pglist_data *pgdat = NODE_DATA(nid);
+		if (pgdat->autonuma_nr_migrate_pages &&
+		    waitqueue_active(&pgdat->autonuma_knuma_migrated_wait))
+			wake_up_interruptible(&pgdat->
+					      autonuma_knuma_migrated_wait);
+	}
+}
+
+static void knuma_scand_disabled(void)
+{
+	if (!autonuma_enabled())
+		wait_event_freezable(knuma_scand_wait,
+				     autonuma_enabled() ||
+				     kthread_should_stop());
+}
+
+static int knuma_scand(void *none)
+{
+	struct mm_struct *mm = NULL;
+	int progress = 0, _progress;
+	unsigned long total_progress = 0;
+
+	set_freezable();
+
+	knuma_scand_disabled();
+
+	mutex_lock(&knumad_mm_mutex);
+
+	for (;;) {
+		if (unlikely(kthread_should_stop()))
+			break;
+		_progress = knumad_do_scan();
+		progress += _progress;
+		total_progress += _progress;
+		mutex_unlock(&knumad_mm_mutex);
+
+		if (unlikely(!knumad_scan.mm)) {
+			autonuma_printk("knuma_scand %lu\n", total_progress);
+			pages_scanned += total_progress;
+			total_progress = 0;
+			full_scans++;
+
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_pass_millisecs));
+			/* flush the last pending pages < pages_to_migrate */
+			wake_up_knuma_migrated();
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_pass_millisecs));
+
+			if (autonuma_debug()) {
+				extern void sched_autonuma_dump_mm(void);
+				sched_autonuma_dump_mm();
+			}
+
+			/* wait while there is no pinned mm */
+			knuma_scand_disabled();
+		}
+		if (progress > pages_to_scan) {
+			progress = 0;
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_millisecs));
+		}
+		cond_resched();
+		mutex_lock(&knumad_mm_mutex);
+	}
+
+	mm = knumad_scan.mm;
+	knumad_scan.mm = NULL;
+	if (mm && knumad_test_exit(mm)) {
+		list_del(&mm->mm_autonuma->mm_node);
+		/* tell autonuma_exit not to list_del */
+		VM_BUG_ON(mm->mm_autonuma->mm != mm);
+		mm->mm_autonuma->mm = NULL;
+	}
+	mutex_unlock(&knumad_mm_mutex);
+
+	if (mm)
+		mmdrop(mm);
+
+	return 0;
+}
+
+static int isolate_migratepages(struct list_head *migratepages,
+				struct pglist_data *pgdat)
+{
+	int nr = 0, nid;
+	struct list_head *heads = pgdat->autonuma_migrate_head;
+
+	/* FIXME: THP balancing, restart from last nid */
+	for_each_online_node(nid) {
+		struct zone *zone;
+		struct page *page;
+		struct lruvec *lruvec;
+
+		cond_resched();
+		VM_BUG_ON(numa_node_id() != pgdat->node_id);
+		if (nid == pgdat->node_id) {
+			VM_BUG_ON(!list_empty(&heads[nid]));
+			continue;
+		}
+		if (list_empty(&heads[nid]))
+			continue;
+		/* some page wants to go to this pgdat */
+		/*
+		 * Take the lock with irqs disabled to avoid a lock
+		 * inversion with the lru_lock which is taken before
+		 * the autonuma_migrate_lock in split_huge_page, and
+		 * that could be taken by interrupts after we obtained
+		 * the autonuma_migrate_lock here, if we didn't disable
+		 * irqs.
+		 */
+		autonuma_migrate_lock_irq(pgdat->node_id);
+		if (list_empty(&heads[nid])) {
+			autonuma_migrate_unlock_irq(pgdat->node_id);
+			continue;
+		}
+		page = list_entry(heads[nid].prev,
+				  struct page,
+				  autonuma_migrate_node);
+		if (unlikely(!get_page_unless_zero(page))) {
+			/*
+			 * Is getting freed and will remove self from the
+			 * autonuma list shortly, skip it for now.
+			 */
+			list_del(&page->autonuma_migrate_node);
+			list_add(&page->autonuma_migrate_node,
+				 &heads[nid]);
+			autonuma_migrate_unlock_irq(pgdat->node_id);
+			autonuma_printk("autonuma migrate page is free\n");
+			continue;
+		}
+		if (!PageLRU(page)) {
+			autonuma_migrate_unlock_irq(pgdat->node_id);
+			autonuma_printk("autonuma migrate page not in LRU\n");
+			__autonuma_migrate_page_remove(page);
+			put_page(page);
+			continue;
+		}
+		autonuma_migrate_unlock_irq(pgdat->node_id);
+
+		VM_BUG_ON(nid != page_to_nid(page));
+
+		if (PageTransHuge(page)) {
+			VM_BUG_ON(!PageAnon(page));
+			/* FIXME: remove split_huge_page */
+			if (unlikely(split_huge_page(page))) {
+				autonuma_printk("autonuma migrate THP free\n");
+				__autonuma_migrate_page_remove(page,
+							       page_autonuma);
+				put_page(page);
+				continue;
+			}
+		}
+
+		__autonuma_migrate_page_remove(page);
+
+		zone = page_zone(page);
+		spin_lock_irq(&zone->lru_lock);
+
+		/* Must run under the lru_lock and before page isolation */
+		lruvec = mem_cgroup_page_lruvec(page, zone);
+
+		if (!__isolate_lru_page(page, 0)) {
+			VM_BUG_ON(PageTransCompound(page));
+			del_page_from_lru_list(page, lruvec, page_lru(page));
+			inc_zone_state(zone, page_is_file_cache(page) ?
+				       NR_ISOLATED_FILE : NR_ISOLATED_ANON);
+			spin_unlock_irq(&zone->lru_lock);
+			/*
+			 * hold the page pin at least until
+			 * __isolate_lru_page succeeds
+			 * (__isolate_lru_page takes a second pin when
+			 * it succeeds). If we release the pin before
+			 * __isolate_lru_page returns, the page could
+			 * have been freed and reallocated from under
+			 * us, so rendering worthless our previous
+			 * checks on the page including the
+			 * split_huge_page call.
+			 */
+			put_page(page);
+
+			list_add(&page->lru, migratepages);
+			nr += hpage_nr_pages(page);
+		} else {
+			/* FIXME: losing page, safest and simplest for now */
+			spin_unlock_irq(&zone->lru_lock);
+			put_page(page);
+			autonuma_printk("autonuma migrate page lost\n");
+		}
+	}
+
+	return nr;
+}
+
+static struct page *alloc_migrate_dst_page(struct page *page,
+					   unsigned long data,
+					   int **result)
+{
+	int nid = (int) data;
+	struct page *newpage;
+	newpage = alloc_pages_exact_node(nid,
+					 GFP_HIGHUSER_MOVABLE | GFP_THISNODE,
+					 0);
+	if (newpage)
+		newpage->autonuma_last_nid = page->autonuma_last_nid;
+	return newpage;
+}
+
+static void knumad_do_migrate(struct pglist_data *pgdat)
+{
+	int nr_migrate_pages = 0;
+	LIST_HEAD(migratepages);
+
+	autonuma_printk("nr_migrate_pages %lu to node %d\n",
+			pgdat->autonuma_nr_migrate_pages, pgdat->node_id);
+	do {
+		int isolated = 0;
+		if (balance_pgdat(pgdat, nr_migrate_pages))
+			isolated = isolate_migratepages(&migratepages, pgdat);
+		/* FIXME: might need to check too many isolated */
+		if (!isolated)
+			break;
+		nr_migrate_pages += isolated;
+	} while (nr_migrate_pages < pages_to_migrate);
+
+	if (nr_migrate_pages) {
+		int err;
+		autonuma_printk("migrate %d to node %d\n", nr_migrate_pages,
+				pgdat->node_id);
+		pages_migrated += nr_migrate_pages; /* FIXME: per node */
+		err = migrate_pages(&migratepages, alloc_migrate_dst_page,
+				    pgdat->node_id, false, true);
+		if (err)
+			/* FIXME: requeue failed pages */
+			putback_lru_pages(&migratepages);
+	}
+}
+
+static int knuma_migrated(void *arg)
+{
+	struct pglist_data *pgdat = (struct pglist_data *)arg;
+	int nid = pgdat->node_id;
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(nowakeup);
+
+	set_freezable();
+
+	for (;;) {
+		if (unlikely(kthread_should_stop()))
+			break;
+		/* FIXME: scan the free levels of this node we may not
+		 * be allowed to receive memory if the wmark of this
+		 * pgdat are below high.  In the future also add
+		 * not-interesting pages like not-accessed pages to
+		 * pgdat->autonuma_migrate_head[pgdat->node_id]; so we
+		 * can move our memory away to other nodes in order
+		 * to satisfy the high-wmark described above (so migration
+		 * can continue).
+		 */
+		knumad_do_migrate(pgdat);
+		if (!pgdat->autonuma_nr_migrate_pages) {
+			wait_event_freezable(
+				pgdat->autonuma_knuma_migrated_wait,
+				pgdat->autonuma_nr_migrate_pages ||
+				kthread_should_stop());
+			autonuma_printk("wake knuma_migrated %d\n", nid);
+		} else
+			wait_event_freezable_timeout(nowakeup,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     migrate_sleep_millisecs));
+	}
+
+	return 0;
+}
+
+void autonuma_enter(struct mm_struct *mm)
+{
+	if (autonuma_impossible())
+		return;
+
+	mutex_lock(&knumad_mm_mutex);
+	list_add_tail(&mm->mm_autonuma->mm_node, &knumad_scan.mm_head);
+	mutex_unlock(&knumad_mm_mutex);
+}
+
+void autonuma_exit(struct mm_struct *mm)
+{
+	bool serialize;
+
+	if (autonuma_impossible())
+		return;
+
+	serialize = false;
+	mutex_lock(&knumad_mm_mutex);
+	if (knumad_scan.mm == mm)
+		serialize = true;
+	else if (mm->mm_autonuma->mm) {
+		VM_BUG_ON(mm->mm_autonuma->mm != mm);
+		mm->mm_autonuma->mm = NULL; /* debug */
+		list_del(&mm->mm_autonuma->mm_node);
+	}
+	mutex_unlock(&knumad_mm_mutex);
+
+	if (serialize) {
+		/* prevent the mm to go away under knumad_do_scan main loop */
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	}
+}
+
+static int start_knuma_scand(void)
+{
+	int err = 0;
+	struct task_struct *knumad_thread;
+
+	knumad_thread = kthread_run(knuma_scand, NULL, "knuma_scand");
+	if (unlikely(IS_ERR(knumad_thread))) {
+		autonuma_printk(KERN_ERR
+				"knumad: kthread_run(knuma_scand) failed\n");
+		err = PTR_ERR(knumad_thread);
+	}
+	return err;
+}
+
+static int start_knuma_migrated(void)
+{
+	int err = 0;
+	struct task_struct *knumad_thread;
+	int nid;
+
+	for_each_online_node(nid) {
+		knumad_thread = kthread_create_on_node(knuma_migrated,
+						       NODE_DATA(nid),
+						       nid,
+						       "knuma_migrated%d",
+						       nid);
+		if (unlikely(IS_ERR(knumad_thread))) {
+			autonuma_printk(KERN_ERR
+					"knumad: "
+					"kthread_run(knuma_migrated%d) "
+					"failed\n", nid);
+			err = PTR_ERR(knumad_thread);
+		} else {
+			autonuma_printk("cpumask %d %lx\n", nid,
+					cpumask_of_node(nid)->bits[0]);
+			kthread_bind_node(knumad_thread, nid);
+			wake_up_process(knumad_thread);
+		}
+	}
+	return err;
+}
+
+
+#ifdef CONFIG_SYSFS
+
+static ssize_t flag_show(struct kobject *kobj,
+			 struct kobj_attribute *attr, char *buf,
+			 enum autonuma_flag flag)
+{
+	return sprintf(buf, "%d\n",
+		       !!test_bit(flag, &autonuma_flags));
+}
+static ssize_t flag_store(struct kobject *kobj,
+			  struct kobj_attribute *attr,
+			  const char *buf, size_t count,
+			  enum autonuma_flag flag)
+{
+	unsigned long value;
+	int ret;
+
+	ret = kstrtoul(buf, 10, &value);
+	if (ret < 0)
+		return ret;
+	if (value > 1)
+		return -EINVAL;
+
+	if (value)
+		set_bit(flag, &autonuma_flags);
+	else
+		clear_bit(flag, &autonuma_flags);
+
+	return count;
+}
+
+static ssize_t enabled_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return flag_show(kobj, attr, buf, AUTONUMA_FLAG);
+}
+static ssize_t enabled_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = flag_store(kobj, attr, buf, count, AUTONUMA_FLAG);
+
+	if (ret > 0 && autonuma_enabled())
+		wake_up_interruptible(&knuma_scand_wait);
+
+	return ret;
+}
+static struct kobj_attribute enabled_attr =
+	__ATTR(enabled, 0644, enabled_show, enabled_store);
+
+#define SYSFS_ENTRY(NAME, FLAG)						\
+static ssize_t NAME ## _show(struct kobject *kobj,			\
+			     struct kobj_attribute *attr, char *buf)	\
+{									\
+	return flag_show(kobj, attr, buf, FLAG);			\
+}									\
+									\
+static ssize_t NAME ## _store(struct kobject *kobj,			\
+			      struct kobj_attribute *attr,		\
+			      const char *buf, size_t count)		\
+{									\
+	return flag_store(kobj, attr, buf, count, FLAG);		\
+}									\
+static struct kobj_attribute NAME ## _attr =				\
+	__ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
+
+SYSFS_ENTRY(debug, AUTONUMA_DEBUG_FLAG);
+SYSFS_ENTRY(pmd, AUTONUMA_SCAN_PMD_FLAG);
+SYSFS_ENTRY(working_set, AUTONUMA_SCAN_USE_WORKING_SET_FLAG);
+SYSFS_ENTRY(defer, AUTONUMA_MIGRATE_DEFER_FLAG);
+SYSFS_ENTRY(load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
+SYSFS_ENTRY(clone_reset, AUTONUMA_SCHED_CLONE_RESET_FLAG);
+SYSFS_ENTRY(fork_reset, AUTONUMA_SCHED_FORK_RESET_FLAG);
+
+#undef SYSFS_ENTRY
+
+enum {
+	SYSFS_KNUMA_SCAND_SLEEP_ENTRY,
+	SYSFS_KNUMA_SCAND_PAGES_ENTRY,
+	SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY,
+	SYSFS_KNUMA_MIGRATED_PAGES_ENTRY,
+};
+
+#define SYSFS_ENTRY(NAME, SYSFS_TYPE)				\
+static ssize_t NAME ## _show(struct kobject *kobj,		\
+			     struct kobj_attribute *attr,	\
+			     char *buf)				\
+{								\
+	return sprintf(buf, "%u\n", NAME);			\
+}								\
+static ssize_t NAME ## _store(struct kobject *kobj,		\
+			      struct kobj_attribute *attr,	\
+			      const char *buf, size_t count)	\
+{								\
+	unsigned long val;					\
+	int err;						\
+								\
+	err = strict_strtoul(buf, 10, &val);			\
+	if (err || val > UINT_MAX)				\
+		return -EINVAL;					\
+	switch (SYSFS_TYPE) {					\
+	case SYSFS_KNUMA_SCAND_PAGES_ENTRY:			\
+	case SYSFS_KNUMA_MIGRATED_PAGES_ENTRY:			\
+		if (!val)					\
+			return -EINVAL;				\
+		break;						\
+	}							\
+								\
+	NAME = val;						\
+	switch (SYSFS_TYPE) {					\
+	case SYSFS_KNUMA_SCAND_SLEEP_ENTRY:			\
+		wake_up_interruptible(&knuma_scand_wait);	\
+		break;						\
+	case							\
+		SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY:		\
+		wake_up_knuma_migrated();			\
+		break;						\
+	}							\
+								\
+	return count;						\
+}								\
+static struct kobj_attribute NAME ## _attr =			\
+	__ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
+
+SYSFS_ENTRY(scan_sleep_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
+SYSFS_ENTRY(scan_sleep_pass_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
+SYSFS_ENTRY(pages_to_scan, SYSFS_KNUMA_SCAND_PAGES_ENTRY);
+
+SYSFS_ENTRY(migrate_sleep_millisecs, SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY);
+SYSFS_ENTRY(pages_to_migrate, SYSFS_KNUMA_MIGRATED_PAGES_ENTRY);
+
+#undef SYSFS_ENTRY
+
+static struct attribute *autonuma_attr[] = {
+	&enabled_attr.attr,
+	&debug_attr.attr,
+	NULL,
+};
+static struct attribute_group autonuma_attr_group = {
+	.attrs = autonuma_attr,
+};
+
+#define SYSFS_ENTRY(NAME)					\
+static ssize_t NAME ## _show(struct kobject *kobj,		\
+			     struct kobj_attribute *attr,	\
+			     char *buf)				\
+{								\
+	return sprintf(buf, "%lu\n", NAME);			\
+}								\
+static struct kobj_attribute NAME ## _attr =			\
+	__ATTR_RO(NAME);
+
+SYSFS_ENTRY(full_scans);
+SYSFS_ENTRY(pages_scanned);
+SYSFS_ENTRY(pages_migrated);
+
+#undef SYSFS_ENTRY
+
+static struct attribute *knuma_scand_attr[] = {
+	&scan_sleep_millisecs_attr.attr,
+	&scan_sleep_pass_millisecs_attr.attr,
+	&pages_to_scan_attr.attr,
+	&pages_scanned_attr.attr,
+	&full_scans_attr.attr,
+	&pmd_attr.attr,
+	&working_set_attr.attr,
+	NULL,
+};
+static struct attribute_group knuma_scand_attr_group = {
+	.attrs = knuma_scand_attr,
+	.name = "knuma_scand",
+};
+
+static struct attribute *knuma_migrated_attr[] = {
+	&migrate_sleep_millisecs_attr.attr,
+	&pages_to_migrate_attr.attr,
+	&pages_migrated_attr.attr,
+	&defer_attr.attr,
+	NULL,
+};
+static struct attribute_group knuma_migrated_attr_group = {
+	.attrs = knuma_migrated_attr,
+	.name = "knuma_migrated",
+};
+
+static struct attribute *scheduler_attr[] = {
+	&clone_reset_attr.attr,
+	&fork_reset_attr.attr,
+	&load_balance_strict_attr.attr,
+	NULL,
+};
+static struct attribute_group scheduler_attr_group = {
+	.attrs = scheduler_attr,
+	.name = "scheduler",
+};
+
+static int __init autonuma_init_sysfs(struct kobject **autonuma_kobj)
+{
+	int err;
+
+	*autonuma_kobj = kobject_create_and_add("autonuma", mm_kobj);
+	if (unlikely(!*autonuma_kobj)) {
+		printk(KERN_ERR "autonuma: failed kobject create\n");
+		return -ENOMEM;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &autonuma_attr_group);
+	if (err) {
+		printk(KERN_ERR "autonuma: failed register autonuma group\n");
+		goto delete_obj;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &knuma_scand_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register knuma_scand group\n");
+		goto remove_autonuma;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &knuma_migrated_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register knuma_migrated group\n");
+		goto remove_knuma_scand;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &scheduler_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register scheduler group\n");
+		goto remove_knuma_migrated;
+	}
+
+	return 0;
+
+remove_knuma_migrated:
+	sysfs_remove_group(*autonuma_kobj, &knuma_migrated_attr_group);
+remove_knuma_scand:
+	sysfs_remove_group(*autonuma_kobj, &knuma_scand_attr_group);
+remove_autonuma:
+	sysfs_remove_group(*autonuma_kobj, &autonuma_attr_group);
+delete_obj:
+	kobject_put(*autonuma_kobj);
+	return err;
+}
+
+static void __init autonuma_exit_sysfs(struct kobject *autonuma_kobj)
+{
+	sysfs_remove_group(autonuma_kobj, &knuma_migrated_attr_group);
+	sysfs_remove_group(autonuma_kobj, &knuma_scand_attr_group);
+	sysfs_remove_group(autonuma_kobj, &autonuma_attr_group);
+	kobject_put(autonuma_kobj);
+}
+#else
+static inline int autonuma_init_sysfs(struct kobject **autonuma_kobj)
+{
+	return 0;
+}
+
+static inline void autonuma_exit_sysfs(struct kobject *autonuma_kobj)
+{
+}
+#endif /* CONFIG_SYSFS */
+
+static int __init noautonuma_setup(char *str)
+{
+	if (!autonuma_impossible()) {
+		printk("AutoNUMA permanently disabled\n");
+		set_bit(AUTONUMA_IMPOSSIBLE_FLAG, &autonuma_flags);
+		BUG_ON(!autonuma_impossible());
+	}
+	return 1;
+}
+__setup("noautonuma", noautonuma_setup);
+
+static int __init autonuma_init(void)
+{
+	int err;
+	struct kobject *autonuma_kobj;
+
+	VM_BUG_ON(num_possible_nodes() < 1);
+	if (autonuma_impossible())
+		return -EINVAL;
+
+	err = autonuma_init_sysfs(&autonuma_kobj);
+	if (err)
+		return err;
+
+	err = start_knuma_scand();
+	if (err) {
+		printk("failed to start knuma_scand\n");
+		goto out;
+	}
+	err = start_knuma_migrated();
+	if (err) {
+		printk("failed to start knuma_migrated\n");
+		goto out;
+	}
+
+	printk("AutoNUMA initialized successfully\n");
+	return err;
+
+out:
+	autonuma_exit_sysfs(autonuma_kobj);
+	return err;
+}
+module_init(autonuma_init)
+
+static struct kmem_cache *task_autonuma_cachep;
+
+int alloc_task_autonuma(struct task_struct *tsk, struct task_struct *orig,
+			 int node)
+{
+	int err = 1;
+	struct task_autonuma *task_autonuma;
+
+	if (autonuma_impossible())
+		goto no_numa;
+	task_autonuma = kmem_cache_alloc_node(task_autonuma_cachep,
+					      GFP_KERNEL, node);
+	if (!task_autonuma)
+		goto out;
+	if (autonuma_sched_clone_reset())
+		task_autonuma_reset(task_autonuma);
+	else
+		memcpy(task_autonuma, orig->task_autonuma,
+		       task_autonuma_size());
+	tsk->task_autonuma = task_autonuma;
+no_numa:
+	err = 0;
+out:
+	return err;
+}
+
+void free_task_autonuma(struct task_struct *tsk)
+{
+	if (autonuma_impossible()) {
+		BUG_ON(tsk->task_autonuma);
+		return;
+	}
+
+	BUG_ON(!tsk->task_autonuma);
+	kmem_cache_free(task_autonuma_cachep, tsk->task_autonuma);
+	tsk->task_autonuma = NULL;
+}
+
+void __init task_autonuma_init(void)
+{
+	struct task_autonuma *task_autonuma;
+
+	BUG_ON(current != &init_task);
+
+	if (autonuma_impossible())
+		return;
+
+	task_autonuma_cachep =
+		kmem_cache_create("task_autonuma",
+				  task_autonuma_size(), 0,
+				  SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+
+	task_autonuma = kmem_cache_alloc_node(task_autonuma_cachep,
+					      GFP_KERNEL, numa_node_id());
+	BUG_ON(!task_autonuma);
+	task_autonuma_reset(task_autonuma);
+	BUG_ON(current->task_autonuma);
+	current->task_autonuma = task_autonuma;
+}
+
+static struct kmem_cache *mm_autonuma_cachep;
+
+int alloc_mm_autonuma(struct mm_struct *mm)
+{
+	int err = 1;
+	struct mm_autonuma *mm_autonuma;
+
+	if (autonuma_impossible())
+		goto no_numa;
+	mm_autonuma = kmem_cache_alloc(mm_autonuma_cachep, GFP_KERNEL);
+	if (!mm_autonuma)
+		goto out;
+	if (autonuma_sched_fork_reset() || !mm->mm_autonuma)
+		mm_autonuma_reset(mm_autonuma);
+	else
+		memcpy(mm_autonuma, mm->mm_autonuma, mm_autonuma_size());
+	mm->mm_autonuma = mm_autonuma;
+	mm_autonuma->mm = mm;
+no_numa:
+	err = 0;
+out:
+	return err;
+}
+
+void free_mm_autonuma(struct mm_struct *mm)
+{
+	if (autonuma_impossible()) {
+		BUG_ON(mm->mm_autonuma);
+		return;
+	}
+
+	BUG_ON(!mm->mm_autonuma);
+	kmem_cache_free(mm_autonuma_cachep, mm->mm_autonuma);
+	mm->mm_autonuma = NULL;
+}
+
+void __init mm_autonuma_init(void)
+{
+	BUG_ON(current != &init_task);
+	BUG_ON(current->mm);
+
+	if (autonuma_impossible())
+		return;
+
+	mm_autonuma_cachep =
+		kmem_cache_create("mm_autonuma",
+				  mm_autonuma_size(), 0,
+				  SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 25/40] autonuma: follow_page check for pte_numa/pmd_numa
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Without this, follow_page wouldn't trigger the NUMA hinting faults.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/memory.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2e9cab2..78b6acc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1491,7 +1491,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 		goto no_page_table;
 
 	pmd = pmd_offset(pud, address);
-	if (pmd_none(*pmd))
+	if (pmd_none(*pmd) || pmd_numa(*pmd))
 		goto no_page_table;
 	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
 		BUG_ON(flags & FOLL_GET);
@@ -1525,7 +1525,7 @@ split_fallthrough:
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
 
 	pte = *ptep;
-	if (!pte_present(pte))
+	if (!pte_present(pte) || pte_numa(pte))
 		goto no_page;
 	if ((flags & FOLL_WRITE) && !pte_write(pte))
 		goto unlock;

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 25/40] autonuma: follow_page check for pte_numa/pmd_numa
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Without this, follow_page wouldn't trigger the NUMA hinting faults.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/memory.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2e9cab2..78b6acc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1491,7 +1491,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 		goto no_page_table;
 
 	pmd = pmd_offset(pud, address);
-	if (pmd_none(*pmd))
+	if (pmd_none(*pmd) || pmd_numa(*pmd))
 		goto no_page_table;
 	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
 		BUG_ON(flags & FOLL_GET);
@@ -1525,7 +1525,7 @@ split_fallthrough:
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
 
 	pte = *ptep;
-	if (!pte_present(pte))
+	if (!pte_present(pte) || pte_numa(pte))
 		goto no_page;
 	if ((flags & FOLL_WRITE) && !pte_write(pte))
 		goto unlock;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 26/40] autonuma: default mempolicy follow AutoNUMA
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

If the task has already been moved to an autonuma_node try to allocate
memory from it even if it's temporarily not the local node. Chances
are it's where most of its memory is already located and where it will
run in the future.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/mempolicy.c |   15 +++++++++++++--
 1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1d771e4..86c0df0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1945,10 +1945,21 @@ retry_cpuset:
 	 */
 	if (pol->mode == MPOL_INTERLEAVE)
 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
-	else
+	else {
+		int nid;
+#ifdef CONFIG_AUTONUMA
+		nid = -1;
+		if (current->task_autonuma)
+			nid = current->task_autonuma->autonuma_node;
+		if (nid < 0)
+			nid = numa_node_id();
+#else
+		nid = numa_node_id();
+#endif
 		page = __alloc_pages_nodemask(gfp, order,
-				policy_zonelist(gfp, pol, numa_node_id()),
+				policy_zonelist(gfp, pol, nid),
 				policy_nodemask(gfp, pol));
+	}
 
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 		goto retry_cpuset;

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 26/40] autonuma: default mempolicy follow AutoNUMA
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

If the task has already been moved to an autonuma_node try to allocate
memory from it even if it's temporarily not the local node. Chances
are it's where most of its memory is already located and where it will
run in the future.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/mempolicy.c |   15 +++++++++++++--
 1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1d771e4..86c0df0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1945,10 +1945,21 @@ retry_cpuset:
 	 */
 	if (pol->mode == MPOL_INTERLEAVE)
 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
-	else
+	else {
+		int nid;
+#ifdef CONFIG_AUTONUMA
+		nid = -1;
+		if (current->task_autonuma)
+			nid = current->task_autonuma->autonuma_node;
+		if (nid < 0)
+			nid = numa_node_id();
+#else
+		nid = numa_node_id();
+#endif
 		page = __alloc_pages_nodemask(gfp, order,
-				policy_zonelist(gfp, pol, numa_node_id()),
+				policy_zonelist(gfp, pol, nid),
 				policy_nodemask(gfp, pol));
+	}
 
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 		goto retry_cpuset;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 27/40] autonuma: call autonuma_split_huge_page()
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This is needed to make sure the tail pages are also queued into the
migration queues of knuma_migrated across a transparent hugepage
split.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1598708..55fc72d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -17,6 +17,7 @@
 #include <linux/khugepaged.h>
 #include <linux/freezer.h>
 #include <linux/mman.h>
+#include <linux/autonuma.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -1316,6 +1317,7 @@ static void __split_huge_page_refcount(struct page *page)
 		BUG_ON(!PageSwapBacked(page_tail));
 
 		lru_add_page_tail(page, page_tail, lruvec);
+		autonuma_migrate_split_huge_page(page, page_tail);
 	}
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(__page_count(page) <= 0);

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 27/40] autonuma: call autonuma_split_huge_page()
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This is needed to make sure the tail pages are also queued into the
migration queues of knuma_migrated across a transparent hugepage
split.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1598708..55fc72d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -17,6 +17,7 @@
 #include <linux/khugepaged.h>
 #include <linux/freezer.h>
 #include <linux/mman.h>
+#include <linux/autonuma.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -1316,6 +1317,7 @@ static void __split_huge_page_refcount(struct page *page)
 		BUG_ON(!PageSwapBacked(page_tail));
 
 		lru_add_page_tail(page, page_tail, lruvec);
+		autonuma_migrate_split_huge_page(page, page_tail);
 	}
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(__page_count(page) <= 0);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 28/40] autonuma: make khugepaged pte_numa aware
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

If any of the ptes that khugepaged is collapsing was a pte_numa, the
resulting trans huge pmd will be a pmd_numa too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 55fc72d..094f82b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1799,12 +1799,13 @@ out:
 	return isolated;
 }
 
-static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
+static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 				      struct vm_area_struct *vma,
 				      unsigned long address,
 				      spinlock_t *ptl)
 {
 	pte_t *_pte;
+	bool mknuma = false;
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
 		pte_t pteval = *_pte;
 		struct page *src_page;
@@ -1832,11 +1833,15 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			page_remove_rmap(src_page);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
+
+			mknuma |= pte_numa(pteval);
 		}
 
 		address += PAGE_SIZE;
 		page++;
 	}
+
+	return mknuma;
 }
 
 static void collapse_huge_page(struct mm_struct *mm,
@@ -1854,6 +1859,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	int isolated;
 	unsigned long hstart, hend;
+	bool mknuma = false;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 #ifndef CONFIG_NUMA
@@ -1972,7 +1978,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	anon_vma_unlock(vma->anon_vma);
 
-	__collapse_huge_page_copy(pte, new_page, vma, address, ptl);
+	mknuma = pmd_numa(_pmd);
+	mknuma |= __collapse_huge_page_copy(pte, new_page, vma, address, ptl);
 	pte_unmap(pte);
 	__SetPageUptodate(new_page);
 	pgtable = pmd_pgtable(_pmd);
@@ -1982,6 +1989,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	_pmd = mk_pmd(new_page, vma->vm_page_prot);
 	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
 	_pmd = pmd_mkhuge(_pmd);
+	if (mknuma)
+		_pmd = pmd_mknuma(_pmd);
 
 	/*
 	 * spin_lock() below is not the equivalent of smp_wmb(), so

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 28/40] autonuma: make khugepaged pte_numa aware
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

If any of the ptes that khugepaged is collapsing was a pte_numa, the
resulting trans huge pmd will be a pmd_numa too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 55fc72d..094f82b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1799,12 +1799,13 @@ out:
 	return isolated;
 }
 
-static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
+static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 				      struct vm_area_struct *vma,
 				      unsigned long address,
 				      spinlock_t *ptl)
 {
 	pte_t *_pte;
+	bool mknuma = false;
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
 		pte_t pteval = *_pte;
 		struct page *src_page;
@@ -1832,11 +1833,15 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			page_remove_rmap(src_page);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
+
+			mknuma |= pte_numa(pteval);
 		}
 
 		address += PAGE_SIZE;
 		page++;
 	}
+
+	return mknuma;
 }
 
 static void collapse_huge_page(struct mm_struct *mm,
@@ -1854,6 +1859,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	int isolated;
 	unsigned long hstart, hend;
+	bool mknuma = false;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 #ifndef CONFIG_NUMA
@@ -1972,7 +1978,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	anon_vma_unlock(vma->anon_vma);
 
-	__collapse_huge_page_copy(pte, new_page, vma, address, ptl);
+	mknuma = pmd_numa(_pmd);
+	mknuma |= __collapse_huge_page_copy(pte, new_page, vma, address, ptl);
 	pte_unmap(pte);
 	__SetPageUptodate(new_page);
 	pgtable = pmd_pgtable(_pmd);
@@ -1982,6 +1989,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	_pmd = mk_pmd(new_page, vma->vm_page_prot);
 	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
 	_pmd = pmd_mkhuge(_pmd);
+	if (mknuma)
+		_pmd = pmd_mknuma(_pmd);
 
 	/*
 	 * spin_lock() below is not the equivalent of smp_wmb(), so

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 29/40] autonuma: retain page last_nid information in khugepaged
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

When pages are collapsed try to keep the last_nid information from one
of the original pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 094f82b..ae20409 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1814,7 +1814,18 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			clear_user_highpage(page, address);
 			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
 		} else {
+#ifdef CONFIG_AUTONUMA
+			int autonuma_last_nid;
+#endif
 			src_page = pte_page(pteval);
+#ifdef CONFIG_AUTONUMA
+			/* pick the last one, better than nothing */
+			autonuma_last_nid =
+				ACCESS_ONCE(src_page->autonuma_last_nid);
+			if (autonuma_last_nid >= 0)
+				ACCESS_ONCE(page->autonuma_last_nid) =
+					autonuma_last_nid;
+#endif
 			copy_user_highpage(page, src_page, address, vma);
 			VM_BUG_ON(page_mapcount(src_page) != 1);
 			VM_BUG_ON(page_count(src_page) != 2);

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 29/40] autonuma: retain page last_nid information in khugepaged
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

When pages are collapsed try to keep the last_nid information from one
of the original pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 094f82b..ae20409 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1814,7 +1814,18 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			clear_user_highpage(page, address);
 			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
 		} else {
+#ifdef CONFIG_AUTONUMA
+			int autonuma_last_nid;
+#endif
 			src_page = pte_page(pteval);
+#ifdef CONFIG_AUTONUMA
+			/* pick the last one, better than nothing */
+			autonuma_last_nid =
+				ACCESS_ONCE(src_page->autonuma_last_nid);
+			if (autonuma_last_nid >= 0)
+				ACCESS_ONCE(page->autonuma_last_nid) =
+					autonuma_last_nid;
+#endif
 			copy_user_highpage(page, src_page, address, vma);
 			VM_BUG_ON(page_mapcount(src_page) != 1);
 			VM_BUG_ON(page_count(src_page) != 2);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 30/40] autonuma: numa hinting page faults entry points
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This is where the numa hinting page faults are detected and are passed
over to the AutoNUMA core logic.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/huge_mm.h |    2 ++
 mm/huge_memory.c        |   17 +++++++++++++++++
 mm/memory.c             |   31 +++++++++++++++++++++++++++++++
 3 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ad4e2e0..5270c81 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -11,6 +11,8 @@ extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
 			       pmd_t orig_pmd);
+extern pmd_t __huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+				   pmd_t pmd, pmd_t *pmdp);
 extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
 extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
 					  unsigned long addr,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ae20409..4fcdaf7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1037,6 +1037,23 @@ out:
 	return page;
 }
 
+#ifdef CONFIG_AUTONUMA
+pmd_t __huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+			    pmd_t pmd, pmd_t *pmdp)
+{
+	spin_lock(&mm->page_table_lock);
+	if (pmd_same(pmd, *pmdp)) {
+		struct page *page = pmd_page(pmd);
+		pmd = pmd_mknotnuma(pmd);
+		set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd);
+		numa_hinting_fault(page, HPAGE_PMD_NR);
+		VM_BUG_ON(pmd_numa(pmd));
+	}
+	spin_unlock(&mm->page_table_lock);
+	return pmd;
+}
+#endif
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 78b6acc..d72aafd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/autonuma.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3406,6 +3407,31 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+static inline pte_t pte_numa_fixup(struct mm_struct *mm,
+				   struct vm_area_struct *vma,
+				   unsigned long addr, pte_t pte, pte_t *ptep)
+{
+	if (pte_numa(pte))
+		pte = __pte_numa_fixup(mm, vma, addr, pte, ptep);
+	return pte;
+}
+
+static inline void pmd_numa_fixup(struct mm_struct *mm,
+				  unsigned long addr, pmd_t *pmd)
+{
+	if (pmd_numa(*pmd))
+		__pmd_numa_fixup(mm, addr, pmd);
+}
+
+static inline pmd_t huge_pmd_numa_fixup(struct mm_struct *mm,
+					unsigned long addr,
+					pmd_t pmd, pmd_t *pmdp)
+{
+	if (pmd_numa(pmd))
+		pmd = __huge_pmd_numa_fixup(mm, addr, pmd, pmdp);
+	return pmd;
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3448,6 +3474,7 @@ int handle_pte_fault(struct mm_struct *mm,
 	spin_lock(ptl);
 	if (unlikely(!pte_same(*pte, entry)))
 		goto unlock;
+	entry = pte_numa_fixup(mm, vma, address, entry, pte);
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!pte_write(entry))
 			return do_wp_page(mm, vma, address,
@@ -3512,6 +3539,8 @@ retry:
 
 		barrier();
 		if (pmd_trans_huge(orig_pmd)) {
+			orig_pmd = huge_pmd_numa_fixup(mm, address,
+						       orig_pmd, pmd);
 			if (flags & FAULT_FLAG_WRITE &&
 			    !pmd_write(orig_pmd) &&
 			    !pmd_trans_splitting(orig_pmd)) {
@@ -3530,6 +3559,8 @@ retry:
 		}
 	}
 
+	pmd_numa_fixup(mm, address, pmd);
+
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
 	 * run pte_offset_map on the pmd, if an huge pmd could

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 30/40] autonuma: numa hinting page faults entry points
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This is where the numa hinting page faults are detected and are passed
over to the AutoNUMA core logic.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/huge_mm.h |    2 ++
 mm/huge_memory.c        |   17 +++++++++++++++++
 mm/memory.c             |   31 +++++++++++++++++++++++++++++++
 3 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ad4e2e0..5270c81 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -11,6 +11,8 @@ extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
 			       pmd_t orig_pmd);
+extern pmd_t __huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+				   pmd_t pmd, pmd_t *pmdp);
 extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
 extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
 					  unsigned long addr,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ae20409..4fcdaf7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1037,6 +1037,23 @@ out:
 	return page;
 }
 
+#ifdef CONFIG_AUTONUMA
+pmd_t __huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+			    pmd_t pmd, pmd_t *pmdp)
+{
+	spin_lock(&mm->page_table_lock);
+	if (pmd_same(pmd, *pmdp)) {
+		struct page *page = pmd_page(pmd);
+		pmd = pmd_mknotnuma(pmd);
+		set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd);
+		numa_hinting_fault(page, HPAGE_PMD_NR);
+		VM_BUG_ON(pmd_numa(pmd));
+	}
+	spin_unlock(&mm->page_table_lock);
+	return pmd;
+}
+#endif
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 78b6acc..d72aafd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/autonuma.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3406,6 +3407,31 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+static inline pte_t pte_numa_fixup(struct mm_struct *mm,
+				   struct vm_area_struct *vma,
+				   unsigned long addr, pte_t pte, pte_t *ptep)
+{
+	if (pte_numa(pte))
+		pte = __pte_numa_fixup(mm, vma, addr, pte, ptep);
+	return pte;
+}
+
+static inline void pmd_numa_fixup(struct mm_struct *mm,
+				  unsigned long addr, pmd_t *pmd)
+{
+	if (pmd_numa(*pmd))
+		__pmd_numa_fixup(mm, addr, pmd);
+}
+
+static inline pmd_t huge_pmd_numa_fixup(struct mm_struct *mm,
+					unsigned long addr,
+					pmd_t pmd, pmd_t *pmdp)
+{
+	if (pmd_numa(pmd))
+		pmd = __huge_pmd_numa_fixup(mm, addr, pmd, pmdp);
+	return pmd;
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3448,6 +3474,7 @@ int handle_pte_fault(struct mm_struct *mm,
 	spin_lock(ptl);
 	if (unlikely(!pte_same(*pte, entry)))
 		goto unlock;
+	entry = pte_numa_fixup(mm, vma, address, entry, pte);
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!pte_write(entry))
 			return do_wp_page(mm, vma, address,
@@ -3512,6 +3539,8 @@ retry:
 
 		barrier();
 		if (pmd_trans_huge(orig_pmd)) {
+			orig_pmd = huge_pmd_numa_fixup(mm, address,
+						       orig_pmd, pmd);
 			if (flags & FAULT_FLAG_WRITE &&
 			    !pmd_write(orig_pmd) &&
 			    !pmd_trans_splitting(orig_pmd)) {
@@ -3530,6 +3559,8 @@ retry:
 		}
 	}
 
+	pmd_numa_fixup(mm, address, pmd);
+
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
 	 * run pte_offset_map on the pmd, if an huge pmd could

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 31/40] autonuma: reset autonuma page data when pages are freed
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

When pages are freed abort any pending migration. If knuma_migrated
arrives first it will notice because get_page_unless_zero would fail.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48eabe9..841d964 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -615,6 +615,10 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
+	autonuma_migrate_page_remove(page);
+#ifdef CONFIG_AUTONUMA
+	page->autonuma_last_nid = -1;
+#endif
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 31/40] autonuma: reset autonuma page data when pages are freed
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

When pages are freed abort any pending migration. If knuma_migrated
arrives first it will notice because get_page_unless_zero would fail.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48eabe9..841d964 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -615,6 +615,10 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
+	autonuma_migrate_page_remove(page);
+#ifdef CONFIG_AUTONUMA
+	page->autonuma_last_nid = -1;
+#endif
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 32/40] autonuma: initialize page structure fields
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Initialize the AutoNUMA page structure fields at boot.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 841d964..8c4ae8e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3729,6 +3729,10 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 
 		INIT_LIST_HEAD(&page->lru);
+#ifdef CONFIG_AUTONUMA
+		page->autonuma_last_nid = -1;
+		page->autonuma_migrate_nid = -1;
+#endif
 #ifdef WANT_PAGE_VIRTUAL
 		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
 		if (!is_highmem_idx(zone))

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 32/40] autonuma: initialize page structure fields
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Initialize the AutoNUMA page structure fields at boot.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 841d964..8c4ae8e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3729,6 +3729,10 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 
 		INIT_LIST_HEAD(&page->lru);
+#ifdef CONFIG_AUTONUMA
+		page->autonuma_last_nid = -1;
+		page->autonuma_migrate_nid = -1;
+#endif
 #ifdef WANT_PAGE_VIRTUAL
 		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
 		if (!is_highmem_idx(zone))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 33/40] autonuma: link mm/autonuma.o and kernel/sched/numa.o
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Link the AutoNUMA core and scheduler object files in the kernel if
CONFIG_AUTONUMA=y.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/Makefile |    1 +
 mm/Makefile           |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 173ea52..783a840 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -16,3 +16,4 @@ obj-$(CONFIG_SMP) += cpupri.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
+obj-$(CONFIG_AUTONUMA) += numa.o
diff --git a/mm/Makefile b/mm/Makefile
index 2e2fbbe..15900fd 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
+obj-$(CONFIG_AUTONUMA) 	+= autonuma.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 33/40] autonuma: link mm/autonuma.o and kernel/sched/numa.o
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Link the AutoNUMA core and scheduler object files in the kernel if
CONFIG_AUTONUMA=y.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/Makefile |    1 +
 mm/Makefile           |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 173ea52..783a840 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -16,3 +16,4 @@ obj-$(CONFIG_SMP) += cpupri.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
+obj-$(CONFIG_AUTONUMA) += numa.o
diff --git a/mm/Makefile b/mm/Makefile
index 2e2fbbe..15900fd 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
+obj-$(CONFIG_AUTONUMA) 	+= autonuma.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 34/40] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Add the config options to allow building the kernel with AutoNUMA.

If CONFIG_AUTONUMA_DEFAULT_ENABLED is "=y", then
/sys/kernel/mm/autonuma/enabled will be equal to 1, and AutoNUMA will
be enabled automatically at boot.

CONFIG_AUTONUMA currently depends on X86, because no other arch
implements the pte/pmd_numa yet and selecting =y would result in a
failed build, but this shall be relaxed in the future. Porting
AutoNUMA to other archs should be pretty simple.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/Kconfig |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 82fed4e..330dd51 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -207,6 +207,19 @@ config MIGRATION
 	  pages as migration can relocate pages to satisfy a huge page
 	  allocation instead of reclaiming.
 
+config AUTONUMA
+	bool "Auto NUMA"
+	select MIGRATION
+	depends on NUMA && X86
+	help
+	  Automatic NUMA CPU scheduling and memory migration.
+
+config AUTONUMA_DEFAULT_ENABLED
+	bool "Auto NUMA default enabled"
+	depends on AUTONUMA
+	help
+	  Automatic NUMA CPU scheduling and memory migration enabled at boot.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 34/40] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Add the config options to allow building the kernel with AutoNUMA.

If CONFIG_AUTONUMA_DEFAULT_ENABLED is "=y", then
/sys/kernel/mm/autonuma/enabled will be equal to 1, and AutoNUMA will
be enabled automatically at boot.

CONFIG_AUTONUMA currently depends on X86, because no other arch
implements the pte/pmd_numa yet and selecting =y would result in a
failed build, but this shall be relaxed in the future. Porting
AutoNUMA to other archs should be pretty simple.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/Kconfig |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 82fed4e..330dd51 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -207,6 +207,19 @@ config MIGRATION
 	  pages as migration can relocate pages to satisfy a huge page
 	  allocation instead of reclaiming.
 
+config AUTONUMA
+	bool "Auto NUMA"
+	select MIGRATION
+	depends on NUMA && X86
+	help
+	  Automatic NUMA CPU scheduling and memory migration.
+
+config AUTONUMA_DEFAULT_ENABLED
+	bool "Auto NUMA default enabled"
+	depends on AUTONUMA
+	help
+	  Automatic NUMA CPU scheduling and memory migration enabled at boot.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 35/40] autonuma: boost khugepaged scanning rate
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Until THP native migration is implemented it's safer to boost
khugepaged scanning rate because all memory migration are splitting
the hugepages. So the regular rate of scanning becomes too low when
lots of memory is migrated.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4fcdaf7..bcaa8ac 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -573,6 +573,14 @@ static int __init hugepage_init(void)
 
 	set_recommended_min_free_kbytes();
 
+#ifdef CONFIG_AUTONUMA
+	/* Hack, remove after THP native migration */
+	if (num_possible_nodes() > 1) {
+		khugepaged_scan_sleep_millisecs = 100;
+		khugepaged_alloc_sleep_millisecs = 10000;
+	}
+#endif
+
 	return 0;
 out:
 	hugepage_exit_sysfs(hugepage_kobj);

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 35/40] autonuma: boost khugepaged scanning rate
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Until THP native migration is implemented it's safer to boost
khugepaged scanning rate because all memory migration are splitting
the hugepages. So the regular rate of scanning becomes too low when
lots of memory is migrated.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4fcdaf7..bcaa8ac 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -573,6 +573,14 @@ static int __init hugepage_init(void)
 
 	set_recommended_min_free_kbytes();
 
+#ifdef CONFIG_AUTONUMA
+	/* Hack, remove after THP native migration */
+	if (num_possible_nodes() > 1) {
+		khugepaged_scan_sleep_millisecs = 100;
+		khugepaged_alloc_sleep_millisecs = 10000;
+	}
+#endif
+
 	return 0;
 out:
 	hugepage_exit_sysfs(hugepage_kobj);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 36/40] autonuma: page_autonuma
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Move the AutoNUMA per page information from the "struct page" to a
separate page_autonuma data structure allocated in the memsection
(with sparsemem) or in the pgdat (with flatmem).

This is done to avoid growing the size of the "struct page" and the
page_autonuma data is only allocated if the kernel has been booted on
real NUMA hardware (or if noautonuma is passed as parameter to the
kernel).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma.h       |   18 +++-
 include/linux/autonuma_flags.h |    6 +
 include/linux/autonuma_types.h |   55 ++++++++++
 include/linux/mm_types.h       |   26 -----
 include/linux/mmzone.h         |   14 +++-
 include/linux/page_autonuma.h  |   53 +++++++++
 init/main.c                    |    2 +
 mm/Makefile                    |    2 +-
 mm/autonuma.c                  |   98 ++++++++++-------
 mm/huge_memory.c               |   26 +++--
 mm/page_alloc.c                |   21 +---
 mm/page_autonuma.c             |  234 ++++++++++++++++++++++++++++++++++++++++
 mm/sparse.c                    |  126 ++++++++++++++++++++-
 13 files changed, 577 insertions(+), 104 deletions(-)
 create mode 100644 include/linux/page_autonuma.h
 create mode 100644 mm/page_autonuma.c

diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
index 85ca5eb..67af86a 100644
--- a/include/linux/autonuma.h
+++ b/include/linux/autonuma.h
@@ -7,15 +7,26 @@
 
 extern void autonuma_enter(struct mm_struct *mm);
 extern void autonuma_exit(struct mm_struct *mm);
-extern void __autonuma_migrate_page_remove(struct page *page);
+extern void __autonuma_migrate_page_remove(struct page *,
+					   struct page_autonuma *);
 extern void autonuma_migrate_split_huge_page(struct page *page,
 					     struct page *page_tail);
 extern void autonuma_setup_new_exec(struct task_struct *p);
+extern struct page_autonuma *lookup_page_autonuma(struct page *page);
 
 static inline void autonuma_migrate_page_remove(struct page *page)
 {
-	if (ACCESS_ONCE(page->autonuma_migrate_nid) >= 0)
-		__autonuma_migrate_page_remove(page);
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+	if (ACCESS_ONCE(page_autonuma->autonuma_migrate_nid) >= 0)
+		__autonuma_migrate_page_remove(page, page_autonuma);
+}
+
+static inline void autonuma_free_page(struct page *page)
+{
+	if (!autonuma_impossible()) {
+		autonuma_migrate_page_remove(page);
+		lookup_page_autonuma(page)->autonuma_last_nid = -1;
+	}
 }
 
 #define autonuma_printk(format, args...) \
@@ -29,6 +40,7 @@ static inline void autonuma_migrate_page_remove(struct page *page) {}
 static inline void autonuma_migrate_split_huge_page(struct page *page,
 						    struct page *page_tail) {}
 static inline void autonuma_setup_new_exec(struct task_struct *p) {}
+static inline void autonuma_free_page(struct page *page) {}
 
 #endif /* CONFIG_AUTONUMA */
 
diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
index 5e29a75..035d993 100644
--- a/include/linux/autonuma_flags.h
+++ b/include/linux/autonuma_flags.h
@@ -15,6 +15,12 @@ enum autonuma_flag {
 
 extern unsigned long autonuma_flags;
 
+static inline bool autonuma_impossible(void)
+{
+	return num_possible_nodes() <= 1 ||
+		test_bit(AUTONUMA_IMPOSSIBLE_FLAG, &autonuma_flags);
+}
+
 static inline bool autonuma_enabled(void)
 {
 	return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
index 9e697e3..1e860f6 100644
--- a/include/linux/autonuma_types.h
+++ b/include/linux/autonuma_types.h
@@ -39,6 +39,61 @@ struct task_autonuma {
 	unsigned long task_numa_fault[0];
 };
 
+/*
+ * Per page (or per-pageblock) structure dynamically allocated only if
+ * autonuma is not impossible.
+ */
+struct page_autonuma {
+	/*
+	 * To modify autonuma_last_nid lockless the architecture,
+	 * needs SMP atomic granularity < sizeof(long), not all archs
+	 * have that, notably some ancient alpha (but none of those
+	 * should run in NUMA systems). Archs without that requires
+	 * autonuma_last_nid to be a long.
+	 */
+#if BITS_PER_LONG > 32
+	/*
+	 * autonuma_migrate_nid is -1 if the page_autonuma structure
+	 * is not linked into any
+	 * pgdat->autonuma_migrate_head. Otherwise it means the
+	 * page_autonuma structure is linked into the
+	 * &NODE_DATA(autonuma_migrate_nid)->autonuma_migrate_head[page_nid].
+	 * page_nid is the nid that the page (referenced by the
+	 * page_autonuma structure) belongs to.
+	 */
+	int autonuma_migrate_nid;
+	/*
+	 * autonuma_last_nid records which is the NUMA nid that tried
+	 * to access this page at the last NUMA hinting page fault.
+	 * If it changed, AutoNUMA will not try to migrate the page to
+	 * the nid where the thread is running on and to the contrary,
+	 * it will make different threads trashing on the same pages,
+	 * converge on the same NUMA node (if possible).
+	 */
+	int autonuma_last_nid;
+#else
+#if MAX_NUMNODES >= 32768
+#error "too many nodes"
+#endif
+	short autonuma_migrate_nid;
+	short autonuma_last_nid;
+#endif
+	/*
+	 * This is the list node that links the page (referenced by
+	 * the page_autonuma structure) in the
+	 * &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid] lru.
+	 */
+	struct list_head autonuma_migrate_node;
+
+	/*
+	 * To find the page starting from the autonuma_migrate_node we
+	 * need a backlink.
+	 *
+	 * FIXME: drop it;
+	 */
+	struct page *page;
+};
+
 extern int alloc_task_autonuma(struct task_struct *tsk,
 			       struct task_struct *orig,
 			       int node);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d1248cf..f0c6379 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -136,32 +136,6 @@ struct page {
 		struct page *first_page;	/* Compound tail pages */
 	};
 
-#ifdef CONFIG_AUTONUMA
-	/*
-	 * FIXME: move to pgdat section along with the memcg and allocate
-	 * at runtime only in presence of a numa system.
-	 */
-	/*
-	 * To modify autonuma_last_nid lockless the architecture,
-	 * needs SMP atomic granularity < sizeof(long), not all archs
-	 * have that, notably some ancient alpha (but none of those
-	 * should run in NUMA systems). Archs without that requires
-	 * autonuma_last_nid to be a long.
-	 */
-#if BITS_PER_LONG > 32
-	int autonuma_migrate_nid;
-	int autonuma_last_nid;
-#else
-#if MAX_NUMNODES >= 32768
-#error "too many nodes"
-#endif
-	/* FIXME: remember to check the updates are atomic */
-	short autonuma_migrate_nid;
-	short autonuma_last_nid;
-#endif
-	struct list_head autonuma_migrate_node;
-#endif
-
 	/*
 	 * On machines where all RAM is mapped into kernel address space,
 	 * we can simply calculate the virtual address. On machines with
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d53b26a..e66da74 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -698,10 +698,13 @@ typedef struct pglist_data {
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
 #ifdef CONFIG_AUTONUMA
-	spinlock_t autonuma_lock;
+#if !defined(CONFIG_SPARSEMEM)
+	struct page_autonuma *node_page_autonuma;
+#endif
 	struct list_head autonuma_migrate_head[MAX_NUMNODES];
 	unsigned long autonuma_nr_migrate_pages;
 	wait_queue_head_t autonuma_knuma_migrated_wait;
+	spinlock_t autonuma_lock;
 #endif
 } pg_data_t;
 
@@ -1064,6 +1067,15 @@ struct mem_section {
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
 	 */
 	struct page_cgroup *page_cgroup;
+#endif
+#ifdef CONFIG_AUTONUMA
+	/*
+	 * If !SPARSEMEM, pgdat doesn't have page_autonuma pointer. We use
+	 * section.
+	 */
+	struct page_autonuma *section_page_autonuma;
+#endif
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR) ^ defined(CONFIG_AUTONUMA)
 	unsigned long pad;
 #endif
 };
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
new file mode 100644
index 0000000..d748aa2
--- /dev/null
+++ b/include/linux/page_autonuma.h
@@ -0,0 +1,53 @@
+#ifndef _LINUX_PAGE_AUTONUMA_H
+#define _LINUX_PAGE_AUTONUMA_H
+
+#if defined(CONFIG_AUTONUMA) && !defined(CONFIG_SPARSEMEM)
+extern void __init page_autonuma_init_flatmem(void);
+#else
+static inline void __init page_autonuma_init_flatmem(void) {}
+#endif
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/autonuma_flags.h>
+
+extern void __meminit page_autonuma_map_init(struct page *page,
+					     struct page_autonuma *page_autonuma,
+					     int nr_pages);
+
+#ifdef CONFIG_SPARSEMEM
+#define PAGE_AUTONUMA_SIZE (sizeof(struct page_autonuma))
+#define SECTION_PAGE_AUTONUMA_SIZE (PAGE_AUTONUMA_SIZE *	\
+				    PAGES_PER_SECTION)
+#endif
+
+extern void __meminit pgdat_autonuma_init(struct pglist_data *);
+
+#else /* CONFIG_AUTONUMA */
+
+#ifdef CONFIG_SPARSEMEM
+struct page_autonuma;
+#define PAGE_AUTONUMA_SIZE 0
+#define SECTION_PAGE_AUTONUMA_SIZE 0
+
+#define autonuma_impossible() true
+
+#endif
+
+static inline void pgdat_autonuma_init(struct pglist_data *pgdat) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+#ifdef CONFIG_SPARSEMEM
+extern struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
+									unsigned long nr_pages);
+extern void __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
+					  unsigned long nr_pages);
+extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
+							 unsigned long pnum_begin,
+							 unsigned long pnum_end,
+							 unsigned long map_count,
+							 int nodeid);
+#endif
+
+#endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/init/main.c b/init/main.c
index b5cc0a7..070a377 100644
--- a/init/main.c
+++ b/init/main.c
@@ -68,6 +68,7 @@
 #include <linux/shmem_fs.h>
 #include <linux/slab.h>
 #include <linux/perf_event.h>
+#include <linux/page_autonuma.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -455,6 +456,7 @@ static void __init mm_init(void)
 	 * bigger than MAX_ORDER unless SPARSEMEM.
 	 */
 	page_cgroup_init_flatmem();
+	page_autonuma_init_flatmem();
 	mem_init();
 	kmem_cache_init();
 	percpu_init_late();
diff --git a/mm/Makefile b/mm/Makefile
index 15900fd..a4d8354 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,7 +33,7 @@ obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
-obj-$(CONFIG_AUTONUMA) 	+= autonuma.o
+obj-$(CONFIG_AUTONUMA) 	+= autonuma.o page_autonuma.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
diff --git a/mm/autonuma.c b/mm/autonuma.c
index f44272b..ec4d492 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -51,12 +51,6 @@ static struct knumad_scan {
 	.mm_head = LIST_HEAD_INIT(knumad_scan.mm_head),
 };
 
-static inline bool autonuma_impossible(void)
-{
-	return num_possible_nodes() <= 1 ||
-		test_bit(AUTONUMA_IMPOSSIBLE_FLAG, &autonuma_flags);
-}
-
 static inline void autonuma_migrate_lock(int nid)
 {
 	spin_lock(&NODE_DATA(nid)->autonuma_lock);
@@ -82,54 +76,63 @@ void autonuma_migrate_split_huge_page(struct page *page,
 				      struct page *page_tail)
 {
 	int nid, last_nid;
+	struct page_autonuma *page_autonuma, *page_tail_autonuma;
 
-	nid = page->autonuma_migrate_nid;
+	if (autonuma_impossible())
+		return;
+
+	page_autonuma = lookup_page_autonuma(page);
+	page_tail_autonuma = lookup_page_autonuma(page_tail);
+
+	nid = page_autonuma->autonuma_migrate_nid;
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 	VM_BUG_ON(nid < -1);
-	VM_BUG_ON(page_tail->autonuma_migrate_nid != -1);
+	VM_BUG_ON(page_tail_autonuma->autonuma_migrate_nid != -1);
 	if (nid >= 0) {
 		VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
 
 		compound_lock(page_tail);
 		autonuma_migrate_lock(nid);
-		list_add_tail(&page_tail->autonuma_migrate_node,
-			      &page->autonuma_migrate_node);
+		list_add_tail(&page_tail_autonuma->autonuma_migrate_node,
+			      &page_autonuma->autonuma_migrate_node);
 		autonuma_migrate_unlock(nid);
 
-		page_tail->autonuma_migrate_nid = nid;
+		page_tail_autonuma->autonuma_migrate_nid = nid;
 		compound_unlock(page_tail);
 	}
 
-	last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid);
 	if (last_nid >= 0)
-		page_tail->autonuma_last_nid = last_nid;
+		page_tail_autonuma->autonuma_last_nid = last_nid;
 }
 
-void __autonuma_migrate_page_remove(struct page *page)
+void __autonuma_migrate_page_remove(struct page *page,
+				    struct page_autonuma *page_autonuma)
 {
 	unsigned long flags;
 	int nid;
 
 	flags = compound_lock_irqsave(page);
 
-	nid = page->autonuma_migrate_nid;
+	nid = page_autonuma->autonuma_migrate_nid;
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 	VM_BUG_ON(nid < -1);
 	if (nid >= 0) {
 		int numpages = hpage_nr_pages(page);
 		autonuma_migrate_lock(nid);
-		list_del(&page->autonuma_migrate_node);
+		list_del(&page_autonuma->autonuma_migrate_node);
 		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
 		autonuma_migrate_unlock(nid);
 
-		page->autonuma_migrate_nid = -1;
+		page_autonuma->autonuma_migrate_nid = -1;
 	}
 
 	compound_unlock_irqrestore(page, flags);
 }
 
-static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
-					int page_nid)
+static void __autonuma_migrate_page_add(struct page *page,
+					struct page_autonuma *page_autonuma,
+					int dst_nid, int page_nid)
 {
 	unsigned long flags;
 	int nid;
@@ -148,25 +151,25 @@ static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
 	flags = compound_lock_irqsave(page);
 
 	numpages = hpage_nr_pages(page);
-	nid = page->autonuma_migrate_nid;
+	nid = page_autonuma->autonuma_migrate_nid;
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 	VM_BUG_ON(nid < -1);
 	if (nid >= 0) {
 		autonuma_migrate_lock(nid);
-		list_del(&page->autonuma_migrate_node);
+		list_del(&page_autonuma->autonuma_migrate_node);
 		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
 		autonuma_migrate_unlock(nid);
 	}
 
 	autonuma_migrate_lock(dst_nid);
-	list_add(&page->autonuma_migrate_node,
+	list_add(&page_autonuma->autonuma_migrate_node,
 		 &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
 	NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
 	nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
 
 	autonuma_migrate_unlock(dst_nid);
 
-	page->autonuma_migrate_nid = dst_nid;
+	page_autonuma->autonuma_migrate_nid = dst_nid;
 
 	compound_unlock_irqrestore(page, flags);
 
@@ -182,9 +185,13 @@ static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
 static void autonuma_migrate_page_add(struct page *page, int dst_nid,
 				      int page_nid)
 {
-	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	int migrate_nid;
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+
+	migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
 	if (migrate_nid != dst_nid)
-		__autonuma_migrate_page_add(page, dst_nid, page_nid);
+		__autonuma_migrate_page_add(page, page_autonuma,
+					    dst_nid, page_nid);
 }
 
 static bool balance_pgdat(struct pglist_data *pgdat,
@@ -255,23 +262,26 @@ static inline bool last_nid_set(struct task_struct *p,
 				struct page *page, int cpu_nid)
 {
 	bool ret = true;
-	int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+	int autonuma_last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid);
 	VM_BUG_ON(cpu_nid < 0);
 	VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
 	if (autonuma_last_nid >= 0 && autonuma_last_nid != cpu_nid) {
-		int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+		int migrate_nid;
+		migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
 		if (migrate_nid >= 0 && migrate_nid != cpu_nid)
-			__autonuma_migrate_page_remove(page);
+			__autonuma_migrate_page_remove(page, page_autonuma);
 		ret = false;
 	}
 	if (autonuma_last_nid != cpu_nid)
-		ACCESS_ONCE(page->autonuma_last_nid) = cpu_nid;
+		ACCESS_ONCE(page_autonuma->autonuma_last_nid) = cpu_nid;
 	return ret;
 }
 
 static int __page_migrate_nid(struct page *page, int page_nid)
 {
-	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+	int migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
 	if (migrate_nid < 0)
 		migrate_nid = page_nid;
 #if 0
@@ -810,6 +820,7 @@ static int isolate_migratepages(struct list_head *migratepages,
 		struct zone *zone;
 		struct page *page;
 		struct lruvec *lruvec;
+		struct page_autonuma *page_autonuma;
 
 		cond_resched();
 		VM_BUG_ON(numa_node_id() != pgdat->node_id);
@@ -833,16 +844,17 @@ static int isolate_migratepages(struct list_head *migratepages,
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			continue;
 		}
-		page = list_entry(heads[nid].prev,
-				  struct page,
-				  autonuma_migrate_node);
+		page_autonuma = list_entry(heads[nid].prev,
+					   struct page_autonuma,
+					   autonuma_migrate_node);
+		page = page_autonuma->page;
 		if (unlikely(!get_page_unless_zero(page))) {
 			/*
 			 * Is getting freed and will remove self from the
 			 * autonuma list shortly, skip it for now.
 			 */
-			list_del(&page->autonuma_migrate_node);
-			list_add(&page->autonuma_migrate_node,
+			list_del(&page_autonuma->autonuma_migrate_node);
+			list_add(&page_autonuma->autonuma_migrate_node,
 				 &heads[nid]);
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			autonuma_printk("autonuma migrate page is free\n");
@@ -851,7 +863,7 @@ static int isolate_migratepages(struct list_head *migratepages,
 		if (!PageLRU(page)) {
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			autonuma_printk("autonuma migrate page not in LRU\n");
-			__autonuma_migrate_page_remove(page);
+			__autonuma_migrate_page_remove(page, page_autonuma);
 			put_page(page);
 			continue;
 		}
@@ -871,7 +883,7 @@ static int isolate_migratepages(struct list_head *migratepages,
 			}
 		}
 
-		__autonuma_migrate_page_remove(page);
+		__autonuma_migrate_page_remove(page, page_autonuma);
 
 		zone = page_zone(page);
 		spin_lock_irq(&zone->lru_lock);
@@ -917,11 +929,16 @@ static struct page *alloc_migrate_dst_page(struct page *page,
 {
 	int nid = (int) data;
 	struct page *newpage;
+	struct page_autonuma *page_autonuma, *newpage_autonuma;
 	newpage = alloc_pages_exact_node(nid,
 					 GFP_HIGHUSER_MOVABLE | GFP_THISNODE,
 					 0);
-	if (newpage)
-		newpage->autonuma_last_nid = page->autonuma_last_nid;
+	if (newpage) {
+		page_autonuma = lookup_page_autonuma(page);
+		newpage_autonuma = lookup_page_autonuma(newpage);
+		newpage_autonuma->autonuma_last_nid =
+			page_autonuma->autonuma_last_nid;
+	}
 	return newpage;
 }
 
@@ -1345,7 +1362,8 @@ static int __init noautonuma_setup(char *str)
 	}
 	return 1;
 }
-__setup("noautonuma", noautonuma_setup);
+/* early so sparse.c also can see it */
+early_param("noautonuma", noautonuma_setup);
 
 static int __init autonuma_init(void)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bcaa8ac..c5e47bc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1831,6 +1831,13 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 {
 	pte_t *_pte;
 	bool mknuma = false;
+#ifdef CONFIG_AUTONUMA
+	struct page_autonuma *src_page_an, *page_an = NULL;
+
+	if (!autonuma_impossible())
+		page_an = lookup_page_autonuma(page);
+#endif
+
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
 		pte_t pteval = *_pte;
 		struct page *src_page;
@@ -1839,17 +1846,18 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			clear_user_highpage(page, address);
 			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
 		} else {
-#ifdef CONFIG_AUTONUMA
-			int autonuma_last_nid;
-#endif
 			src_page = pte_page(pteval);
 #ifdef CONFIG_AUTONUMA
-			/* pick the last one, better than nothing */
-			autonuma_last_nid =
-				ACCESS_ONCE(src_page->autonuma_last_nid);
-			if (autonuma_last_nid >= 0)
-				ACCESS_ONCE(page->autonuma_last_nid) =
-					autonuma_last_nid;
+			if (!autonuma_impossible()) {
+				int autonuma_last_nid;
+				src_page_an = lookup_page_autonuma(src_page);
+				/* pick the last one, better than nothing */
+				autonuma_last_nid =
+					ACCESS_ONCE(src_page_an->autonuma_last_nid);
+				if (autonuma_last_nid >= 0)
+					ACCESS_ONCE(page_an->autonuma_last_nid) =
+						autonuma_last_nid;
+			}
 #endif
 			copy_user_highpage(page, src_page, address, vma);
 			VM_BUG_ON(page_mapcount(src_page) != 1);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8c4ae8e..2d53a1f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -60,6 +60,7 @@
 #include <linux/migrate.h>
 #include <linux/page-debug-flags.h>
 #include <linux/autonuma.h>
+#include <linux/page_autonuma.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -615,10 +616,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
-	autonuma_migrate_page_remove(page);
-#ifdef CONFIG_AUTONUMA
-	page->autonuma_last_nid = -1;
-#endif
+	autonuma_free_page(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -3729,10 +3727,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 
 		INIT_LIST_HEAD(&page->lru);
-#ifdef CONFIG_AUTONUMA
-		page->autonuma_last_nid = -1;
-		page->autonuma_migrate_nid = -1;
-#endif
 #ifdef WANT_PAGE_VIRTUAL
 		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
 		if (!is_highmem_idx(zone))
@@ -4357,22 +4351,13 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	int nid = pgdat->node_id;
 	unsigned long zone_start_pfn = pgdat->node_start_pfn;
 	int ret;
-#ifdef CONFIG_AUTONUMA
-	int node_iter;
-#endif
 
 	pgdat_resize_init(pgdat);
-#ifdef CONFIG_AUTONUMA
-	spin_lock_init(&pgdat->autonuma_lock);
-	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
-	pgdat->autonuma_nr_migrate_pages = 0;
-	for_each_node(node_iter)
-		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
-#endif
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	pgdat->kswapd_max_order = 0;
 	pgdat_page_cgroup_init(pgdat);
+	pgdat_autonuma_init(pgdat);
 
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
new file mode 100644
index 0000000..bace9b8
--- /dev/null
+++ b/mm/page_autonuma.c
@@ -0,0 +1,234 @@
+#include <linux/mm.h>
+#include <linux/memory.h>
+#include <linux/autonuma_flags.h>
+#include <linux/page_autonuma.h>
+#include <linux/bootmem.h>
+
+void __meminit page_autonuma_map_init(struct page *page,
+				      struct page_autonuma *page_autonuma,
+				      int nr_pages)
+{
+	struct page *end;
+	for (end = page + nr_pages; page < end; page++, page_autonuma++) {
+		page_autonuma->autonuma_last_nid = -1;
+		page_autonuma->autonuma_migrate_nid = -1;
+		page_autonuma->page = page;
+	}
+}
+
+static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+	int node_iter;
+
+	spin_lock_init(&pgdat->autonuma_lock);
+	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
+	pgdat->autonuma_nr_migrate_pages = 0;
+	for_each_node(node_iter)
+		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+}
+
+#if !defined(CONFIG_SPARSEMEM)
+
+static unsigned long total_usage;
+
+void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+	__pgdat_autonuma_init(pgdat);
+	pgdat->node_page_autonuma = NULL;
+}
+
+struct page_autonuma *lookup_page_autonuma(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long offset;
+	struct page_autonuma *base;
+
+	base = NODE_DATA(page_to_nid(page))->node_page_autonuma;
+#ifdef CONFIG_DEBUG_VM
+	/*
+	 * The sanity checks the page allocator does upon freeing a
+	 * page can reach here before the page_autonuma arrays are
+	 * allocated when feeding a range of pages to the allocator
+	 * for the first time during bootup or memory hotplug.
+	 */
+	if (unlikely(!base))
+		return NULL;
+#endif
+	offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
+	return base + offset;
+}
+
+static int __init alloc_node_page_autonuma(int nid)
+{
+	struct page_autonuma *base;
+	unsigned long table_size;
+	unsigned long nr_pages;
+
+	nr_pages = NODE_DATA(nid)->node_spanned_pages;
+	if (!nr_pages)
+		return 0;
+
+	table_size = sizeof(struct page_autonuma) * nr_pages;
+
+	base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
+			table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	if (!base)
+		return -ENOMEM;
+	NODE_DATA(nid)->node_page_autonuma = base;
+	total_usage += table_size;
+	page_autonuma_map_init(NODE_DATA(nid)->node_mem_map, base, nr_pages);
+	return 0;
+}
+
+void __init page_autonuma_init_flatmem(void)
+{
+
+	int nid, fail;
+
+	if (autonuma_impossible())
+		return;
+
+	for_each_online_node(nid)  {
+		fail = alloc_node_page_autonuma(nid);
+		if (fail)
+			goto fail;
+	}
+	printk(KERN_INFO "allocated %lu KBytes of page_autonuma\n",
+	       total_usage >> 10);
+	printk(KERN_INFO "please try the 'noautonuma' option if you"
+	" don't want to allocate page_autonuma memory\n");
+	return;
+fail:
+	printk(KERN_CRIT "allocation of page_autonuma failed.\n");
+	printk(KERN_CRIT "please try the 'noautonuma' boot option\n");
+	panic("Out of memory");
+}
+
+#else /* CONFIG_SPARSEMEM */
+
+struct page_autonuma *lookup_page_autonuma(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	struct mem_section *section = __pfn_to_section(pfn);
+
+	/* if it's not a power of two we may be wasting memory */
+	BUILD_BUG_ON(SECTION_PAGE_AUTONUMA_SIZE &
+		     (SECTION_PAGE_AUTONUMA_SIZE-1));
+
+#ifdef CONFIG_DEBUG_VM
+	/*
+	 * The sanity checks the page allocator does upon freeing a
+	 * page can reach here before the page_autonuma arrays are
+	 * allocated when feeding a range of pages to the allocator
+	 * for the first time during bootup or memory hotplug.
+	 */
+	if (!section->section_page_autonuma)
+		return NULL;
+#endif
+	return section->section_page_autonuma + pfn;
+}
+
+void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+	__pgdat_autonuma_init(pgdat);
+}
+
+struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
+								 unsigned long nr_pages)
+{
+	struct page_autonuma *ret;
+	struct page *page;
+	unsigned long memmap_size = PAGE_AUTONUMA_SIZE * nr_pages;
+
+	page = alloc_pages_node(nid, GFP_KERNEL|__GFP_NOWARN,
+				get_order(memmap_size));
+	if (page)
+		goto got_map_page_autonuma;
+
+	ret = vmalloc(memmap_size);
+	if (ret)
+		goto out;
+
+	return NULL;
+got_map_page_autonuma:
+	ret = (struct page_autonuma *)pfn_to_kaddr(page_to_pfn(page));
+out:
+	return ret;
+}
+
+void __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
+				   unsigned long nr_pages)
+{
+	if (is_vmalloc_addr(page_autonuma))
+		vfree(page_autonuma);
+	else
+		free_pages((unsigned long)page_autonuma,
+			   get_order(PAGE_AUTONUMA_SIZE * nr_pages));
+}
+
+static struct page_autonuma __init *sparse_page_autonuma_map_populate(unsigned long pnum,
+								      int nid)
+{
+	struct page_autonuma *map;
+	unsigned long size;
+
+	map = alloc_remap(nid, SECTION_PAGE_AUTONUMA_SIZE);
+	if (map)
+		return map;
+
+	size = PAGE_ALIGN(SECTION_PAGE_AUTONUMA_SIZE);
+	map = __alloc_bootmem_node_high(NODE_DATA(nid), size,
+					PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	return map;
+}
+
+void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
+						  unsigned long pnum_begin,
+						  unsigned long pnum_end,
+						  unsigned long map_count,
+						  int nodeid)
+{
+	void *map;
+	unsigned long pnum;
+	unsigned long size = SECTION_PAGE_AUTONUMA_SIZE;
+
+	map = alloc_remap(nodeid, size * map_count);
+	if (map) {
+		for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+			if (!present_section_nr(pnum))
+				continue;
+			page_autonuma_map[pnum] = map;
+			map += size;
+		}
+		return;
+	}
+
+	size = PAGE_ALIGN(size);
+	map = __alloc_bootmem_node_high(NODE_DATA(nodeid), size * map_count,
+					PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	if (map) {
+		for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+			if (!present_section_nr(pnum))
+				continue;
+			page_autonuma_map[pnum] = map;
+			map += size;
+		}
+		return;
+	}
+
+	/* fallback */
+	for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+		struct mem_section *ms;
+
+		if (!present_section_nr(pnum))
+			continue;
+		page_autonuma_map[pnum] = sparse_page_autonuma_map_populate(pnum, nodeid);
+		if (page_autonuma_map[pnum])
+			continue;
+		ms = __nr_to_section(pnum);
+		printk(KERN_ERR "%s: sparsemem page_autonuma map backing failed "
+		       "some memory will not be available.\n", __func__);
+	}
+}
+
+#endif
diff --git a/mm/sparse.c b/mm/sparse.c
index 6a4bf91..1eb301e 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -9,6 +9,7 @@
 #include <linux/export.h>
 #include <linux/spinlock.h>
 #include <linux/vmalloc.h>
+#include <linux/page_autonuma.h>
 #include "internal.h"
 #include <asm/dma.h>
 #include <asm/pgalloc.h>
@@ -242,7 +243,8 @@ struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pn
 
 static int __meminit sparse_init_one_section(struct mem_section *ms,
 		unsigned long pnum, struct page *mem_map,
-		unsigned long *pageblock_bitmap)
+		unsigned long *pageblock_bitmap,
+		struct page_autonuma *page_autonuma)
 {
 	if (!present_section(ms))
 		return -EINVAL;
@@ -251,6 +253,14 @@ static int __meminit sparse_init_one_section(struct mem_section *ms,
 	ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) |
 							SECTION_HAS_MEM_MAP;
  	ms->pageblock_flags = pageblock_bitmap;
+#ifdef CONFIG_AUTONUMA
+	if (page_autonuma) {
+		ms->section_page_autonuma = page_autonuma - section_nr_to_pfn(pnum);
+		page_autonuma_map_init(mem_map, page_autonuma, PAGES_PER_SECTION);
+	}
+#else
+	BUG_ON(page_autonuma);
+#endif
 
 	return 1;
 }
@@ -484,6 +494,9 @@ void __init sparse_init(void)
 	int size2;
 	struct page **map_map;
 #endif
+	struct page_autonuma **uninitialized_var(page_autonuma_map);
+	struct page_autonuma *page_autonuma;
+	int size3;
 
 	/*
 	 * map is using big page (aka 2M in x86 64 bit)
@@ -578,6 +591,62 @@ void __init sparse_init(void)
 					 map_count, nodeid_begin);
 #endif
 
+	if (!autonuma_impossible()) {
+		unsigned long total_page_autonuma;
+		unsigned long page_autonuma_count;
+
+		size3 = sizeof(struct page_autonuma *) * NR_MEM_SECTIONS;
+		page_autonuma_map = alloc_bootmem(size3);
+		if (!page_autonuma_map)
+			panic("can not allocate page_autonuma_map\n");
+
+		for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
+			struct mem_section *ms;
+
+			if (!present_section_nr(pnum))
+				continue;
+			ms = __nr_to_section(pnum);
+			nodeid_begin = sparse_early_nid(ms);
+			pnum_begin = pnum;
+			break;
+		}
+		total_page_autonuma = 0;
+		page_autonuma_count = 1;
+		for (pnum = pnum_begin + 1; pnum < NR_MEM_SECTIONS; pnum++) {
+			struct mem_section *ms;
+			int nodeid;
+
+			if (!present_section_nr(pnum))
+				continue;
+			ms = __nr_to_section(pnum);
+			nodeid = sparse_early_nid(ms);
+			if (nodeid == nodeid_begin) {
+				page_autonuma_count++;
+				continue;
+			}
+			/* ok, we need to take cake of from pnum_begin to pnum - 1*/
+			sparse_early_page_autonuma_alloc_node(page_autonuma_map,
+							      pnum_begin,
+							      NR_MEM_SECTIONS,
+							      page_autonuma_count,
+							      nodeid_begin);
+			total_page_autonuma += SECTION_PAGE_AUTONUMA_SIZE * page_autonuma_count;
+			/* new start, update count etc*/
+			nodeid_begin = nodeid;
+			pnum_begin = pnum;
+			page_autonuma_count = 1;
+		}
+		/* ok, last chunk */
+		sparse_early_page_autonuma_alloc_node(page_autonuma_map, pnum_begin,
+						      NR_MEM_SECTIONS,
+						      page_autonuma_count, nodeid_begin);
+		total_page_autonuma += SECTION_PAGE_AUTONUMA_SIZE * page_autonuma_count;
+		printk("allocated %lu KBytes of page_autonuma\n",
+		       total_page_autonuma >> 10);
+		printk(KERN_INFO "please try the 'noautonuma' option if you"
+		       " don't want to allocate page_autonuma memory\n");
+	}
+
 	for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
 		if (!present_section_nr(pnum))
 			continue;
@@ -586,6 +655,14 @@ void __init sparse_init(void)
 		if (!usemap)
 			continue;
 
+		if (autonuma_impossible())
+			page_autonuma = NULL;
+		else {
+			page_autonuma = page_autonuma_map[pnum];
+			if (!page_autonuma)
+				continue;
+		}
+
 #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
 		map = map_map[pnum];
 #else
@@ -595,11 +672,13 @@ void __init sparse_init(void)
 			continue;
 
 		sparse_init_one_section(__nr_to_section(pnum), pnum, map,
-								usemap);
+					usemap, page_autonuma);
 	}
 
 	vmemmap_populate_print_last();
 
+	if (!autonuma_impossible())
+		free_bootmem(__pa(page_autonuma_map), size3);
 #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
 	free_bootmem(__pa(map_map), size2);
 #endif
@@ -686,7 +765,8 @@ static void free_map_bootmem(struct page *page, unsigned long nr_pages)
 }
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
-static void free_section_usemap(struct page *memmap, unsigned long *usemap)
+static void free_section_usemap(struct page *memmap, unsigned long *usemap,
+				struct page_autonuma *page_autonuma)
 {
 	struct page *usemap_page;
 	unsigned long nr_pages;
@@ -700,8 +780,14 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 	 */
 	if (PageSlab(usemap_page)) {
 		kfree(usemap);
-		if (memmap)
+		if (memmap) {
 			__kfree_section_memmap(memmap, PAGES_PER_SECTION);
+			if (!autonuma_impossible())
+				__kfree_section_page_autonuma(page_autonuma,
+							      PAGES_PER_SECTION);
+			else
+				BUG_ON(page_autonuma);
+		}
 		return;
 	}
 
@@ -718,6 +804,13 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 			>> PAGE_SHIFT;
 
 		free_map_bootmem(memmap_page, nr_pages);
+
+		if (!autonuma_impossible()) {
+			struct page *page_autonuma_page;
+			page_autonuma_page = virt_to_page(page_autonuma);
+			free_map_bootmem(page_autonuma_page, nr_pages);
+		} else
+			BUG_ON(page_autonuma);
 	}
 }
 
@@ -733,6 +826,7 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	struct mem_section *ms;
 	struct page *memmap;
+	struct page_autonuma *page_autonuma;
 	unsigned long *usemap;
 	unsigned long flags;
 	int ret;
@@ -752,6 +846,16 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 		__kfree_section_memmap(memmap, nr_pages);
 		return -ENOMEM;
 	}
+	if (!autonuma_impossible()) {
+		page_autonuma = __kmalloc_section_page_autonuma(pgdat->node_id,
+								nr_pages);
+		if (!page_autonuma) {
+			kfree(usemap);
+			__kfree_section_memmap(memmap, nr_pages);
+			return -ENOMEM;
+		}
+	} else
+		page_autonuma = NULL;
 
 	pgdat_resize_lock(pgdat, &flags);
 
@@ -763,11 +867,16 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 
 	ms->section_mem_map |= SECTION_MARKED_PRESENT;
 
-	ret = sparse_init_one_section(ms, section_nr, memmap, usemap);
+	ret = sparse_init_one_section(ms, section_nr, memmap, usemap,
+				      page_autonuma);
 
 out:
 	pgdat_resize_unlock(pgdat, &flags);
 	if (ret <= 0) {
+		if (!autonuma_impossible())
+			__kfree_section_page_autonuma(page_autonuma, nr_pages);
+		else
+			BUG_ON(page_autonuma);
 		kfree(usemap);
 		__kfree_section_memmap(memmap, nr_pages);
 	}
@@ -778,6 +887,7 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 {
 	struct page *memmap = NULL;
 	unsigned long *usemap = NULL;
+	struct page_autonuma *page_autonuma = NULL;
 
 	if (ms->section_mem_map) {
 		usemap = ms->pageblock_flags;
@@ -785,8 +895,12 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 						__section_nr(ms));
 		ms->section_mem_map = 0;
 		ms->pageblock_flags = NULL;
+
+#ifdef CONFIG_AUTONUMA
+		page_autonuma = ms->section_page_autonuma;
+#endif
 	}
 
-	free_section_usemap(memmap, usemap);
+	free_section_usemap(memmap, usemap, page_autonuma);
 }
 #endif

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 36/40] autonuma: page_autonuma
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Move the AutoNUMA per page information from the "struct page" to a
separate page_autonuma data structure allocated in the memsection
(with sparsemem) or in the pgdat (with flatmem).

This is done to avoid growing the size of the "struct page" and the
page_autonuma data is only allocated if the kernel has been booted on
real NUMA hardware (or if noautonuma is passed as parameter to the
kernel).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma.h       |   18 +++-
 include/linux/autonuma_flags.h |    6 +
 include/linux/autonuma_types.h |   55 ++++++++++
 include/linux/mm_types.h       |   26 -----
 include/linux/mmzone.h         |   14 +++-
 include/linux/page_autonuma.h  |   53 +++++++++
 init/main.c                    |    2 +
 mm/Makefile                    |    2 +-
 mm/autonuma.c                  |   98 ++++++++++-------
 mm/huge_memory.c               |   26 +++--
 mm/page_alloc.c                |   21 +---
 mm/page_autonuma.c             |  234 ++++++++++++++++++++++++++++++++++++++++
 mm/sparse.c                    |  126 ++++++++++++++++++++-
 13 files changed, 577 insertions(+), 104 deletions(-)
 create mode 100644 include/linux/page_autonuma.h
 create mode 100644 mm/page_autonuma.c

diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
index 85ca5eb..67af86a 100644
--- a/include/linux/autonuma.h
+++ b/include/linux/autonuma.h
@@ -7,15 +7,26 @@
 
 extern void autonuma_enter(struct mm_struct *mm);
 extern void autonuma_exit(struct mm_struct *mm);
-extern void __autonuma_migrate_page_remove(struct page *page);
+extern void __autonuma_migrate_page_remove(struct page *,
+					   struct page_autonuma *);
 extern void autonuma_migrate_split_huge_page(struct page *page,
 					     struct page *page_tail);
 extern void autonuma_setup_new_exec(struct task_struct *p);
+extern struct page_autonuma *lookup_page_autonuma(struct page *page);
 
 static inline void autonuma_migrate_page_remove(struct page *page)
 {
-	if (ACCESS_ONCE(page->autonuma_migrate_nid) >= 0)
-		__autonuma_migrate_page_remove(page);
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+	if (ACCESS_ONCE(page_autonuma->autonuma_migrate_nid) >= 0)
+		__autonuma_migrate_page_remove(page, page_autonuma);
+}
+
+static inline void autonuma_free_page(struct page *page)
+{
+	if (!autonuma_impossible()) {
+		autonuma_migrate_page_remove(page);
+		lookup_page_autonuma(page)->autonuma_last_nid = -1;
+	}
 }
 
 #define autonuma_printk(format, args...) \
@@ -29,6 +40,7 @@ static inline void autonuma_migrate_page_remove(struct page *page) {}
 static inline void autonuma_migrate_split_huge_page(struct page *page,
 						    struct page *page_tail) {}
 static inline void autonuma_setup_new_exec(struct task_struct *p) {}
+static inline void autonuma_free_page(struct page *page) {}
 
 #endif /* CONFIG_AUTONUMA */
 
diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
index 5e29a75..035d993 100644
--- a/include/linux/autonuma_flags.h
+++ b/include/linux/autonuma_flags.h
@@ -15,6 +15,12 @@ enum autonuma_flag {
 
 extern unsigned long autonuma_flags;
 
+static inline bool autonuma_impossible(void)
+{
+	return num_possible_nodes() <= 1 ||
+		test_bit(AUTONUMA_IMPOSSIBLE_FLAG, &autonuma_flags);
+}
+
 static inline bool autonuma_enabled(void)
 {
 	return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
index 9e697e3..1e860f6 100644
--- a/include/linux/autonuma_types.h
+++ b/include/linux/autonuma_types.h
@@ -39,6 +39,61 @@ struct task_autonuma {
 	unsigned long task_numa_fault[0];
 };
 
+/*
+ * Per page (or per-pageblock) structure dynamically allocated only if
+ * autonuma is not impossible.
+ */
+struct page_autonuma {
+	/*
+	 * To modify autonuma_last_nid lockless the architecture,
+	 * needs SMP atomic granularity < sizeof(long), not all archs
+	 * have that, notably some ancient alpha (but none of those
+	 * should run in NUMA systems). Archs without that requires
+	 * autonuma_last_nid to be a long.
+	 */
+#if BITS_PER_LONG > 32
+	/*
+	 * autonuma_migrate_nid is -1 if the page_autonuma structure
+	 * is not linked into any
+	 * pgdat->autonuma_migrate_head. Otherwise it means the
+	 * page_autonuma structure is linked into the
+	 * &NODE_DATA(autonuma_migrate_nid)->autonuma_migrate_head[page_nid].
+	 * page_nid is the nid that the page (referenced by the
+	 * page_autonuma structure) belongs to.
+	 */
+	int autonuma_migrate_nid;
+	/*
+	 * autonuma_last_nid records which is the NUMA nid that tried
+	 * to access this page at the last NUMA hinting page fault.
+	 * If it changed, AutoNUMA will not try to migrate the page to
+	 * the nid where the thread is running on and to the contrary,
+	 * it will make different threads trashing on the same pages,
+	 * converge on the same NUMA node (if possible).
+	 */
+	int autonuma_last_nid;
+#else
+#if MAX_NUMNODES >= 32768
+#error "too many nodes"
+#endif
+	short autonuma_migrate_nid;
+	short autonuma_last_nid;
+#endif
+	/*
+	 * This is the list node that links the page (referenced by
+	 * the page_autonuma structure) in the
+	 * &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid] lru.
+	 */
+	struct list_head autonuma_migrate_node;
+
+	/*
+	 * To find the page starting from the autonuma_migrate_node we
+	 * need a backlink.
+	 *
+	 * FIXME: drop it;
+	 */
+	struct page *page;
+};
+
 extern int alloc_task_autonuma(struct task_struct *tsk,
 			       struct task_struct *orig,
 			       int node);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d1248cf..f0c6379 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -136,32 +136,6 @@ struct page {
 		struct page *first_page;	/* Compound tail pages */
 	};
 
-#ifdef CONFIG_AUTONUMA
-	/*
-	 * FIXME: move to pgdat section along with the memcg and allocate
-	 * at runtime only in presence of a numa system.
-	 */
-	/*
-	 * To modify autonuma_last_nid lockless the architecture,
-	 * needs SMP atomic granularity < sizeof(long), not all archs
-	 * have that, notably some ancient alpha (but none of those
-	 * should run in NUMA systems). Archs without that requires
-	 * autonuma_last_nid to be a long.
-	 */
-#if BITS_PER_LONG > 32
-	int autonuma_migrate_nid;
-	int autonuma_last_nid;
-#else
-#if MAX_NUMNODES >= 32768
-#error "too many nodes"
-#endif
-	/* FIXME: remember to check the updates are atomic */
-	short autonuma_migrate_nid;
-	short autonuma_last_nid;
-#endif
-	struct list_head autonuma_migrate_node;
-#endif
-
 	/*
 	 * On machines where all RAM is mapped into kernel address space,
 	 * we can simply calculate the virtual address. On machines with
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d53b26a..e66da74 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -698,10 +698,13 @@ typedef struct pglist_data {
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
 #ifdef CONFIG_AUTONUMA
-	spinlock_t autonuma_lock;
+#if !defined(CONFIG_SPARSEMEM)
+	struct page_autonuma *node_page_autonuma;
+#endif
 	struct list_head autonuma_migrate_head[MAX_NUMNODES];
 	unsigned long autonuma_nr_migrate_pages;
 	wait_queue_head_t autonuma_knuma_migrated_wait;
+	spinlock_t autonuma_lock;
 #endif
 } pg_data_t;
 
@@ -1064,6 +1067,15 @@ struct mem_section {
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
 	 */
 	struct page_cgroup *page_cgroup;
+#endif
+#ifdef CONFIG_AUTONUMA
+	/*
+	 * If !SPARSEMEM, pgdat doesn't have page_autonuma pointer. We use
+	 * section.
+	 */
+	struct page_autonuma *section_page_autonuma;
+#endif
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR) ^ defined(CONFIG_AUTONUMA)
 	unsigned long pad;
 #endif
 };
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
new file mode 100644
index 0000000..d748aa2
--- /dev/null
+++ b/include/linux/page_autonuma.h
@@ -0,0 +1,53 @@
+#ifndef _LINUX_PAGE_AUTONUMA_H
+#define _LINUX_PAGE_AUTONUMA_H
+
+#if defined(CONFIG_AUTONUMA) && !defined(CONFIG_SPARSEMEM)
+extern void __init page_autonuma_init_flatmem(void);
+#else
+static inline void __init page_autonuma_init_flatmem(void) {}
+#endif
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/autonuma_flags.h>
+
+extern void __meminit page_autonuma_map_init(struct page *page,
+					     struct page_autonuma *page_autonuma,
+					     int nr_pages);
+
+#ifdef CONFIG_SPARSEMEM
+#define PAGE_AUTONUMA_SIZE (sizeof(struct page_autonuma))
+#define SECTION_PAGE_AUTONUMA_SIZE (PAGE_AUTONUMA_SIZE *	\
+				    PAGES_PER_SECTION)
+#endif
+
+extern void __meminit pgdat_autonuma_init(struct pglist_data *);
+
+#else /* CONFIG_AUTONUMA */
+
+#ifdef CONFIG_SPARSEMEM
+struct page_autonuma;
+#define PAGE_AUTONUMA_SIZE 0
+#define SECTION_PAGE_AUTONUMA_SIZE 0
+
+#define autonuma_impossible() true
+
+#endif
+
+static inline void pgdat_autonuma_init(struct pglist_data *pgdat) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+#ifdef CONFIG_SPARSEMEM
+extern struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
+									unsigned long nr_pages);
+extern void __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
+					  unsigned long nr_pages);
+extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
+							 unsigned long pnum_begin,
+							 unsigned long pnum_end,
+							 unsigned long map_count,
+							 int nodeid);
+#endif
+
+#endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/init/main.c b/init/main.c
index b5cc0a7..070a377 100644
--- a/init/main.c
+++ b/init/main.c
@@ -68,6 +68,7 @@
 #include <linux/shmem_fs.h>
 #include <linux/slab.h>
 #include <linux/perf_event.h>
+#include <linux/page_autonuma.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -455,6 +456,7 @@ static void __init mm_init(void)
 	 * bigger than MAX_ORDER unless SPARSEMEM.
 	 */
 	page_cgroup_init_flatmem();
+	page_autonuma_init_flatmem();
 	mem_init();
 	kmem_cache_init();
 	percpu_init_late();
diff --git a/mm/Makefile b/mm/Makefile
index 15900fd..a4d8354 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,7 +33,7 @@ obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
-obj-$(CONFIG_AUTONUMA) 	+= autonuma.o
+obj-$(CONFIG_AUTONUMA) 	+= autonuma.o page_autonuma.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
diff --git a/mm/autonuma.c b/mm/autonuma.c
index f44272b..ec4d492 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -51,12 +51,6 @@ static struct knumad_scan {
 	.mm_head = LIST_HEAD_INIT(knumad_scan.mm_head),
 };
 
-static inline bool autonuma_impossible(void)
-{
-	return num_possible_nodes() <= 1 ||
-		test_bit(AUTONUMA_IMPOSSIBLE_FLAG, &autonuma_flags);
-}
-
 static inline void autonuma_migrate_lock(int nid)
 {
 	spin_lock(&NODE_DATA(nid)->autonuma_lock);
@@ -82,54 +76,63 @@ void autonuma_migrate_split_huge_page(struct page *page,
 				      struct page *page_tail)
 {
 	int nid, last_nid;
+	struct page_autonuma *page_autonuma, *page_tail_autonuma;
 
-	nid = page->autonuma_migrate_nid;
+	if (autonuma_impossible())
+		return;
+
+	page_autonuma = lookup_page_autonuma(page);
+	page_tail_autonuma = lookup_page_autonuma(page_tail);
+
+	nid = page_autonuma->autonuma_migrate_nid;
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 	VM_BUG_ON(nid < -1);
-	VM_BUG_ON(page_tail->autonuma_migrate_nid != -1);
+	VM_BUG_ON(page_tail_autonuma->autonuma_migrate_nid != -1);
 	if (nid >= 0) {
 		VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
 
 		compound_lock(page_tail);
 		autonuma_migrate_lock(nid);
-		list_add_tail(&page_tail->autonuma_migrate_node,
-			      &page->autonuma_migrate_node);
+		list_add_tail(&page_tail_autonuma->autonuma_migrate_node,
+			      &page_autonuma->autonuma_migrate_node);
 		autonuma_migrate_unlock(nid);
 
-		page_tail->autonuma_migrate_nid = nid;
+		page_tail_autonuma->autonuma_migrate_nid = nid;
 		compound_unlock(page_tail);
 	}
 
-	last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid);
 	if (last_nid >= 0)
-		page_tail->autonuma_last_nid = last_nid;
+		page_tail_autonuma->autonuma_last_nid = last_nid;
 }
 
-void __autonuma_migrate_page_remove(struct page *page)
+void __autonuma_migrate_page_remove(struct page *page,
+				    struct page_autonuma *page_autonuma)
 {
 	unsigned long flags;
 	int nid;
 
 	flags = compound_lock_irqsave(page);
 
-	nid = page->autonuma_migrate_nid;
+	nid = page_autonuma->autonuma_migrate_nid;
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 	VM_BUG_ON(nid < -1);
 	if (nid >= 0) {
 		int numpages = hpage_nr_pages(page);
 		autonuma_migrate_lock(nid);
-		list_del(&page->autonuma_migrate_node);
+		list_del(&page_autonuma->autonuma_migrate_node);
 		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
 		autonuma_migrate_unlock(nid);
 
-		page->autonuma_migrate_nid = -1;
+		page_autonuma->autonuma_migrate_nid = -1;
 	}
 
 	compound_unlock_irqrestore(page, flags);
 }
 
-static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
-					int page_nid)
+static void __autonuma_migrate_page_add(struct page *page,
+					struct page_autonuma *page_autonuma,
+					int dst_nid, int page_nid)
 {
 	unsigned long flags;
 	int nid;
@@ -148,25 +151,25 @@ static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
 	flags = compound_lock_irqsave(page);
 
 	numpages = hpage_nr_pages(page);
-	nid = page->autonuma_migrate_nid;
+	nid = page_autonuma->autonuma_migrate_nid;
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 	VM_BUG_ON(nid < -1);
 	if (nid >= 0) {
 		autonuma_migrate_lock(nid);
-		list_del(&page->autonuma_migrate_node);
+		list_del(&page_autonuma->autonuma_migrate_node);
 		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
 		autonuma_migrate_unlock(nid);
 	}
 
 	autonuma_migrate_lock(dst_nid);
-	list_add(&page->autonuma_migrate_node,
+	list_add(&page_autonuma->autonuma_migrate_node,
 		 &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
 	NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
 	nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
 
 	autonuma_migrate_unlock(dst_nid);
 
-	page->autonuma_migrate_nid = dst_nid;
+	page_autonuma->autonuma_migrate_nid = dst_nid;
 
 	compound_unlock_irqrestore(page, flags);
 
@@ -182,9 +185,13 @@ static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
 static void autonuma_migrate_page_add(struct page *page, int dst_nid,
 				      int page_nid)
 {
-	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	int migrate_nid;
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+
+	migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
 	if (migrate_nid != dst_nid)
-		__autonuma_migrate_page_add(page, dst_nid, page_nid);
+		__autonuma_migrate_page_add(page, page_autonuma,
+					    dst_nid, page_nid);
 }
 
 static bool balance_pgdat(struct pglist_data *pgdat,
@@ -255,23 +262,26 @@ static inline bool last_nid_set(struct task_struct *p,
 				struct page *page, int cpu_nid)
 {
 	bool ret = true;
-	int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+	int autonuma_last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid);
 	VM_BUG_ON(cpu_nid < 0);
 	VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
 	if (autonuma_last_nid >= 0 && autonuma_last_nid != cpu_nid) {
-		int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+		int migrate_nid;
+		migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
 		if (migrate_nid >= 0 && migrate_nid != cpu_nid)
-			__autonuma_migrate_page_remove(page);
+			__autonuma_migrate_page_remove(page, page_autonuma);
 		ret = false;
 	}
 	if (autonuma_last_nid != cpu_nid)
-		ACCESS_ONCE(page->autonuma_last_nid) = cpu_nid;
+		ACCESS_ONCE(page_autonuma->autonuma_last_nid) = cpu_nid;
 	return ret;
 }
 
 static int __page_migrate_nid(struct page *page, int page_nid)
 {
-	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+	int migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
 	if (migrate_nid < 0)
 		migrate_nid = page_nid;
 #if 0
@@ -810,6 +820,7 @@ static int isolate_migratepages(struct list_head *migratepages,
 		struct zone *zone;
 		struct page *page;
 		struct lruvec *lruvec;
+		struct page_autonuma *page_autonuma;
 
 		cond_resched();
 		VM_BUG_ON(numa_node_id() != pgdat->node_id);
@@ -833,16 +844,17 @@ static int isolate_migratepages(struct list_head *migratepages,
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			continue;
 		}
-		page = list_entry(heads[nid].prev,
-				  struct page,
-				  autonuma_migrate_node);
+		page_autonuma = list_entry(heads[nid].prev,
+					   struct page_autonuma,
+					   autonuma_migrate_node);
+		page = page_autonuma->page;
 		if (unlikely(!get_page_unless_zero(page))) {
 			/*
 			 * Is getting freed and will remove self from the
 			 * autonuma list shortly, skip it for now.
 			 */
-			list_del(&page->autonuma_migrate_node);
-			list_add(&page->autonuma_migrate_node,
+			list_del(&page_autonuma->autonuma_migrate_node);
+			list_add(&page_autonuma->autonuma_migrate_node,
 				 &heads[nid]);
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			autonuma_printk("autonuma migrate page is free\n");
@@ -851,7 +863,7 @@ static int isolate_migratepages(struct list_head *migratepages,
 		if (!PageLRU(page)) {
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			autonuma_printk("autonuma migrate page not in LRU\n");
-			__autonuma_migrate_page_remove(page);
+			__autonuma_migrate_page_remove(page, page_autonuma);
 			put_page(page);
 			continue;
 		}
@@ -871,7 +883,7 @@ static int isolate_migratepages(struct list_head *migratepages,
 			}
 		}
 
-		__autonuma_migrate_page_remove(page);
+		__autonuma_migrate_page_remove(page, page_autonuma);
 
 		zone = page_zone(page);
 		spin_lock_irq(&zone->lru_lock);
@@ -917,11 +929,16 @@ static struct page *alloc_migrate_dst_page(struct page *page,
 {
 	int nid = (int) data;
 	struct page *newpage;
+	struct page_autonuma *page_autonuma, *newpage_autonuma;
 	newpage = alloc_pages_exact_node(nid,
 					 GFP_HIGHUSER_MOVABLE | GFP_THISNODE,
 					 0);
-	if (newpage)
-		newpage->autonuma_last_nid = page->autonuma_last_nid;
+	if (newpage) {
+		page_autonuma = lookup_page_autonuma(page);
+		newpage_autonuma = lookup_page_autonuma(newpage);
+		newpage_autonuma->autonuma_last_nid =
+			page_autonuma->autonuma_last_nid;
+	}
 	return newpage;
 }
 
@@ -1345,7 +1362,8 @@ static int __init noautonuma_setup(char *str)
 	}
 	return 1;
 }
-__setup("noautonuma", noautonuma_setup);
+/* early so sparse.c also can see it */
+early_param("noautonuma", noautonuma_setup);
 
 static int __init autonuma_init(void)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bcaa8ac..c5e47bc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1831,6 +1831,13 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 {
 	pte_t *_pte;
 	bool mknuma = false;
+#ifdef CONFIG_AUTONUMA
+	struct page_autonuma *src_page_an, *page_an = NULL;
+
+	if (!autonuma_impossible())
+		page_an = lookup_page_autonuma(page);
+#endif
+
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
 		pte_t pteval = *_pte;
 		struct page *src_page;
@@ -1839,17 +1846,18 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			clear_user_highpage(page, address);
 			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
 		} else {
-#ifdef CONFIG_AUTONUMA
-			int autonuma_last_nid;
-#endif
 			src_page = pte_page(pteval);
 #ifdef CONFIG_AUTONUMA
-			/* pick the last one, better than nothing */
-			autonuma_last_nid =
-				ACCESS_ONCE(src_page->autonuma_last_nid);
-			if (autonuma_last_nid >= 0)
-				ACCESS_ONCE(page->autonuma_last_nid) =
-					autonuma_last_nid;
+			if (!autonuma_impossible()) {
+				int autonuma_last_nid;
+				src_page_an = lookup_page_autonuma(src_page);
+				/* pick the last one, better than nothing */
+				autonuma_last_nid =
+					ACCESS_ONCE(src_page_an->autonuma_last_nid);
+				if (autonuma_last_nid >= 0)
+					ACCESS_ONCE(page_an->autonuma_last_nid) =
+						autonuma_last_nid;
+			}
 #endif
 			copy_user_highpage(page, src_page, address, vma);
 			VM_BUG_ON(page_mapcount(src_page) != 1);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8c4ae8e..2d53a1f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -60,6 +60,7 @@
 #include <linux/migrate.h>
 #include <linux/page-debug-flags.h>
 #include <linux/autonuma.h>
+#include <linux/page_autonuma.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -615,10 +616,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
-	autonuma_migrate_page_remove(page);
-#ifdef CONFIG_AUTONUMA
-	page->autonuma_last_nid = -1;
-#endif
+	autonuma_free_page(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -3729,10 +3727,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 
 		INIT_LIST_HEAD(&page->lru);
-#ifdef CONFIG_AUTONUMA
-		page->autonuma_last_nid = -1;
-		page->autonuma_migrate_nid = -1;
-#endif
 #ifdef WANT_PAGE_VIRTUAL
 		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
 		if (!is_highmem_idx(zone))
@@ -4357,22 +4351,13 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	int nid = pgdat->node_id;
 	unsigned long zone_start_pfn = pgdat->node_start_pfn;
 	int ret;
-#ifdef CONFIG_AUTONUMA
-	int node_iter;
-#endif
 
 	pgdat_resize_init(pgdat);
-#ifdef CONFIG_AUTONUMA
-	spin_lock_init(&pgdat->autonuma_lock);
-	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
-	pgdat->autonuma_nr_migrate_pages = 0;
-	for_each_node(node_iter)
-		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
-#endif
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	pgdat->kswapd_max_order = 0;
 	pgdat_page_cgroup_init(pgdat);
+	pgdat_autonuma_init(pgdat);
 
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
new file mode 100644
index 0000000..bace9b8
--- /dev/null
+++ b/mm/page_autonuma.c
@@ -0,0 +1,234 @@
+#include <linux/mm.h>
+#include <linux/memory.h>
+#include <linux/autonuma_flags.h>
+#include <linux/page_autonuma.h>
+#include <linux/bootmem.h>
+
+void __meminit page_autonuma_map_init(struct page *page,
+				      struct page_autonuma *page_autonuma,
+				      int nr_pages)
+{
+	struct page *end;
+	for (end = page + nr_pages; page < end; page++, page_autonuma++) {
+		page_autonuma->autonuma_last_nid = -1;
+		page_autonuma->autonuma_migrate_nid = -1;
+		page_autonuma->page = page;
+	}
+}
+
+static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+	int node_iter;
+
+	spin_lock_init(&pgdat->autonuma_lock);
+	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
+	pgdat->autonuma_nr_migrate_pages = 0;
+	for_each_node(node_iter)
+		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+}
+
+#if !defined(CONFIG_SPARSEMEM)
+
+static unsigned long total_usage;
+
+void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+	__pgdat_autonuma_init(pgdat);
+	pgdat->node_page_autonuma = NULL;
+}
+
+struct page_autonuma *lookup_page_autonuma(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long offset;
+	struct page_autonuma *base;
+
+	base = NODE_DATA(page_to_nid(page))->node_page_autonuma;
+#ifdef CONFIG_DEBUG_VM
+	/*
+	 * The sanity checks the page allocator does upon freeing a
+	 * page can reach here before the page_autonuma arrays are
+	 * allocated when feeding a range of pages to the allocator
+	 * for the first time during bootup or memory hotplug.
+	 */
+	if (unlikely(!base))
+		return NULL;
+#endif
+	offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
+	return base + offset;
+}
+
+static int __init alloc_node_page_autonuma(int nid)
+{
+	struct page_autonuma *base;
+	unsigned long table_size;
+	unsigned long nr_pages;
+
+	nr_pages = NODE_DATA(nid)->node_spanned_pages;
+	if (!nr_pages)
+		return 0;
+
+	table_size = sizeof(struct page_autonuma) * nr_pages;
+
+	base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
+			table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	if (!base)
+		return -ENOMEM;
+	NODE_DATA(nid)->node_page_autonuma = base;
+	total_usage += table_size;
+	page_autonuma_map_init(NODE_DATA(nid)->node_mem_map, base, nr_pages);
+	return 0;
+}
+
+void __init page_autonuma_init_flatmem(void)
+{
+
+	int nid, fail;
+
+	if (autonuma_impossible())
+		return;
+
+	for_each_online_node(nid)  {
+		fail = alloc_node_page_autonuma(nid);
+		if (fail)
+			goto fail;
+	}
+	printk(KERN_INFO "allocated %lu KBytes of page_autonuma\n",
+	       total_usage >> 10);
+	printk(KERN_INFO "please try the 'noautonuma' option if you"
+	" don't want to allocate page_autonuma memory\n");
+	return;
+fail:
+	printk(KERN_CRIT "allocation of page_autonuma failed.\n");
+	printk(KERN_CRIT "please try the 'noautonuma' boot option\n");
+	panic("Out of memory");
+}
+
+#else /* CONFIG_SPARSEMEM */
+
+struct page_autonuma *lookup_page_autonuma(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	struct mem_section *section = __pfn_to_section(pfn);
+
+	/* if it's not a power of two we may be wasting memory */
+	BUILD_BUG_ON(SECTION_PAGE_AUTONUMA_SIZE &
+		     (SECTION_PAGE_AUTONUMA_SIZE-1));
+
+#ifdef CONFIG_DEBUG_VM
+	/*
+	 * The sanity checks the page allocator does upon freeing a
+	 * page can reach here before the page_autonuma arrays are
+	 * allocated when feeding a range of pages to the allocator
+	 * for the first time during bootup or memory hotplug.
+	 */
+	if (!section->section_page_autonuma)
+		return NULL;
+#endif
+	return section->section_page_autonuma + pfn;
+}
+
+void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+	__pgdat_autonuma_init(pgdat);
+}
+
+struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
+								 unsigned long nr_pages)
+{
+	struct page_autonuma *ret;
+	struct page *page;
+	unsigned long memmap_size = PAGE_AUTONUMA_SIZE * nr_pages;
+
+	page = alloc_pages_node(nid, GFP_KERNEL|__GFP_NOWARN,
+				get_order(memmap_size));
+	if (page)
+		goto got_map_page_autonuma;
+
+	ret = vmalloc(memmap_size);
+	if (ret)
+		goto out;
+
+	return NULL;
+got_map_page_autonuma:
+	ret = (struct page_autonuma *)pfn_to_kaddr(page_to_pfn(page));
+out:
+	return ret;
+}
+
+void __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
+				   unsigned long nr_pages)
+{
+	if (is_vmalloc_addr(page_autonuma))
+		vfree(page_autonuma);
+	else
+		free_pages((unsigned long)page_autonuma,
+			   get_order(PAGE_AUTONUMA_SIZE * nr_pages));
+}
+
+static struct page_autonuma __init *sparse_page_autonuma_map_populate(unsigned long pnum,
+								      int nid)
+{
+	struct page_autonuma *map;
+	unsigned long size;
+
+	map = alloc_remap(nid, SECTION_PAGE_AUTONUMA_SIZE);
+	if (map)
+		return map;
+
+	size = PAGE_ALIGN(SECTION_PAGE_AUTONUMA_SIZE);
+	map = __alloc_bootmem_node_high(NODE_DATA(nid), size,
+					PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	return map;
+}
+
+void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
+						  unsigned long pnum_begin,
+						  unsigned long pnum_end,
+						  unsigned long map_count,
+						  int nodeid)
+{
+	void *map;
+	unsigned long pnum;
+	unsigned long size = SECTION_PAGE_AUTONUMA_SIZE;
+
+	map = alloc_remap(nodeid, size * map_count);
+	if (map) {
+		for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+			if (!present_section_nr(pnum))
+				continue;
+			page_autonuma_map[pnum] = map;
+			map += size;
+		}
+		return;
+	}
+
+	size = PAGE_ALIGN(size);
+	map = __alloc_bootmem_node_high(NODE_DATA(nodeid), size * map_count,
+					PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	if (map) {
+		for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+			if (!present_section_nr(pnum))
+				continue;
+			page_autonuma_map[pnum] = map;
+			map += size;
+		}
+		return;
+	}
+
+	/* fallback */
+	for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+		struct mem_section *ms;
+
+		if (!present_section_nr(pnum))
+			continue;
+		page_autonuma_map[pnum] = sparse_page_autonuma_map_populate(pnum, nodeid);
+		if (page_autonuma_map[pnum])
+			continue;
+		ms = __nr_to_section(pnum);
+		printk(KERN_ERR "%s: sparsemem page_autonuma map backing failed "
+		       "some memory will not be available.\n", __func__);
+	}
+}
+
+#endif
diff --git a/mm/sparse.c b/mm/sparse.c
index 6a4bf91..1eb301e 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -9,6 +9,7 @@
 #include <linux/export.h>
 #include <linux/spinlock.h>
 #include <linux/vmalloc.h>
+#include <linux/page_autonuma.h>
 #include "internal.h"
 #include <asm/dma.h>
 #include <asm/pgalloc.h>
@@ -242,7 +243,8 @@ struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pn
 
 static int __meminit sparse_init_one_section(struct mem_section *ms,
 		unsigned long pnum, struct page *mem_map,
-		unsigned long *pageblock_bitmap)
+		unsigned long *pageblock_bitmap,
+		struct page_autonuma *page_autonuma)
 {
 	if (!present_section(ms))
 		return -EINVAL;
@@ -251,6 +253,14 @@ static int __meminit sparse_init_one_section(struct mem_section *ms,
 	ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) |
 							SECTION_HAS_MEM_MAP;
  	ms->pageblock_flags = pageblock_bitmap;
+#ifdef CONFIG_AUTONUMA
+	if (page_autonuma) {
+		ms->section_page_autonuma = page_autonuma - section_nr_to_pfn(pnum);
+		page_autonuma_map_init(mem_map, page_autonuma, PAGES_PER_SECTION);
+	}
+#else
+	BUG_ON(page_autonuma);
+#endif
 
 	return 1;
 }
@@ -484,6 +494,9 @@ void __init sparse_init(void)
 	int size2;
 	struct page **map_map;
 #endif
+	struct page_autonuma **uninitialized_var(page_autonuma_map);
+	struct page_autonuma *page_autonuma;
+	int size3;
 
 	/*
 	 * map is using big page (aka 2M in x86 64 bit)
@@ -578,6 +591,62 @@ void __init sparse_init(void)
 					 map_count, nodeid_begin);
 #endif
 
+	if (!autonuma_impossible()) {
+		unsigned long total_page_autonuma;
+		unsigned long page_autonuma_count;
+
+		size3 = sizeof(struct page_autonuma *) * NR_MEM_SECTIONS;
+		page_autonuma_map = alloc_bootmem(size3);
+		if (!page_autonuma_map)
+			panic("can not allocate page_autonuma_map\n");
+
+		for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
+			struct mem_section *ms;
+
+			if (!present_section_nr(pnum))
+				continue;
+			ms = __nr_to_section(pnum);
+			nodeid_begin = sparse_early_nid(ms);
+			pnum_begin = pnum;
+			break;
+		}
+		total_page_autonuma = 0;
+		page_autonuma_count = 1;
+		for (pnum = pnum_begin + 1; pnum < NR_MEM_SECTIONS; pnum++) {
+			struct mem_section *ms;
+			int nodeid;
+
+			if (!present_section_nr(pnum))
+				continue;
+			ms = __nr_to_section(pnum);
+			nodeid = sparse_early_nid(ms);
+			if (nodeid == nodeid_begin) {
+				page_autonuma_count++;
+				continue;
+			}
+			/* ok, we need to take cake of from pnum_begin to pnum - 1*/
+			sparse_early_page_autonuma_alloc_node(page_autonuma_map,
+							      pnum_begin,
+							      NR_MEM_SECTIONS,
+							      page_autonuma_count,
+							      nodeid_begin);
+			total_page_autonuma += SECTION_PAGE_AUTONUMA_SIZE * page_autonuma_count;
+			/* new start, update count etc*/
+			nodeid_begin = nodeid;
+			pnum_begin = pnum;
+			page_autonuma_count = 1;
+		}
+		/* ok, last chunk */
+		sparse_early_page_autonuma_alloc_node(page_autonuma_map, pnum_begin,
+						      NR_MEM_SECTIONS,
+						      page_autonuma_count, nodeid_begin);
+		total_page_autonuma += SECTION_PAGE_AUTONUMA_SIZE * page_autonuma_count;
+		printk("allocated %lu KBytes of page_autonuma\n",
+		       total_page_autonuma >> 10);
+		printk(KERN_INFO "please try the 'noautonuma' option if you"
+		       " don't want to allocate page_autonuma memory\n");
+	}
+
 	for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
 		if (!present_section_nr(pnum))
 			continue;
@@ -586,6 +655,14 @@ void __init sparse_init(void)
 		if (!usemap)
 			continue;
 
+		if (autonuma_impossible())
+			page_autonuma = NULL;
+		else {
+			page_autonuma = page_autonuma_map[pnum];
+			if (!page_autonuma)
+				continue;
+		}
+
 #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
 		map = map_map[pnum];
 #else
@@ -595,11 +672,13 @@ void __init sparse_init(void)
 			continue;
 
 		sparse_init_one_section(__nr_to_section(pnum), pnum, map,
-								usemap);
+					usemap, page_autonuma);
 	}
 
 	vmemmap_populate_print_last();
 
+	if (!autonuma_impossible())
+		free_bootmem(__pa(page_autonuma_map), size3);
 #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
 	free_bootmem(__pa(map_map), size2);
 #endif
@@ -686,7 +765,8 @@ static void free_map_bootmem(struct page *page, unsigned long nr_pages)
 }
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
-static void free_section_usemap(struct page *memmap, unsigned long *usemap)
+static void free_section_usemap(struct page *memmap, unsigned long *usemap,
+				struct page_autonuma *page_autonuma)
 {
 	struct page *usemap_page;
 	unsigned long nr_pages;
@@ -700,8 +780,14 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 	 */
 	if (PageSlab(usemap_page)) {
 		kfree(usemap);
-		if (memmap)
+		if (memmap) {
 			__kfree_section_memmap(memmap, PAGES_PER_SECTION);
+			if (!autonuma_impossible())
+				__kfree_section_page_autonuma(page_autonuma,
+							      PAGES_PER_SECTION);
+			else
+				BUG_ON(page_autonuma);
+		}
 		return;
 	}
 
@@ -718,6 +804,13 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 			>> PAGE_SHIFT;
 
 		free_map_bootmem(memmap_page, nr_pages);
+
+		if (!autonuma_impossible()) {
+			struct page *page_autonuma_page;
+			page_autonuma_page = virt_to_page(page_autonuma);
+			free_map_bootmem(page_autonuma_page, nr_pages);
+		} else
+			BUG_ON(page_autonuma);
 	}
 }
 
@@ -733,6 +826,7 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	struct mem_section *ms;
 	struct page *memmap;
+	struct page_autonuma *page_autonuma;
 	unsigned long *usemap;
 	unsigned long flags;
 	int ret;
@@ -752,6 +846,16 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 		__kfree_section_memmap(memmap, nr_pages);
 		return -ENOMEM;
 	}
+	if (!autonuma_impossible()) {
+		page_autonuma = __kmalloc_section_page_autonuma(pgdat->node_id,
+								nr_pages);
+		if (!page_autonuma) {
+			kfree(usemap);
+			__kfree_section_memmap(memmap, nr_pages);
+			return -ENOMEM;
+		}
+	} else
+		page_autonuma = NULL;
 
 	pgdat_resize_lock(pgdat, &flags);
 
@@ -763,11 +867,16 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 
 	ms->section_mem_map |= SECTION_MARKED_PRESENT;
 
-	ret = sparse_init_one_section(ms, section_nr, memmap, usemap);
+	ret = sparse_init_one_section(ms, section_nr, memmap, usemap,
+				      page_autonuma);
 
 out:
 	pgdat_resize_unlock(pgdat, &flags);
 	if (ret <= 0) {
+		if (!autonuma_impossible())
+			__kfree_section_page_autonuma(page_autonuma, nr_pages);
+		else
+			BUG_ON(page_autonuma);
 		kfree(usemap);
 		__kfree_section_memmap(memmap, nr_pages);
 	}
@@ -778,6 +887,7 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 {
 	struct page *memmap = NULL;
 	unsigned long *usemap = NULL;
+	struct page_autonuma *page_autonuma = NULL;
 
 	if (ms->section_mem_map) {
 		usemap = ms->pageblock_flags;
@@ -785,8 +895,12 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 						__section_nr(ms));
 		ms->section_mem_map = 0;
 		ms->pageblock_flags = NULL;
+
+#ifdef CONFIG_AUTONUMA
+		page_autonuma = ms->section_page_autonuma;
+#endif
 	}
 
-	free_section_usemap(memmap, usemap);
+	free_section_usemap(memmap, usemap, page_autonuma);
 }
 #endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 37/40] autonuma: page_autonuma change #include for sparse
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

sparse (make C=1) warns about lookup_page_autonuma not being declared,
that's a false positive, but we can shut it down by being less strict
in the includes.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_autonuma.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
index bace9b8..2468c9e 100644
--- a/mm/page_autonuma.c
+++ b/mm/page_autonuma.c
@@ -1,6 +1,6 @@
 #include <linux/mm.h>
 #include <linux/memory.h>
-#include <linux/autonuma_flags.h>
+#include <linux/autonuma.h>
 #include <linux/page_autonuma.h>
 #include <linux/bootmem.h>
 

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 37/40] autonuma: page_autonuma change #include for sparse
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

sparse (make C=1) warns about lookup_page_autonuma not being declared,
that's a false positive, but we can shut it down by being less strict
in the includes.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_autonuma.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
index bace9b8..2468c9e 100644
--- a/mm/page_autonuma.c
+++ b/mm/page_autonuma.c
@@ -1,6 +1,6 @@
 #include <linux/mm.h>
 #include <linux/memory.h>
-#include <linux/autonuma_flags.h>
+#include <linux/autonuma.h>
 #include <linux/page_autonuma.h>
 #include <linux/bootmem.h>
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 38/40] autonuma: autonuma_migrate_head[0] dynamic size
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Reduce the autonuma_migrate_head array entries from MAX_NUMNODES to
num_possible_nodes() or zero if autonuma_impossible() is true.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/mm/numa.c             |    6 ++++--
 arch/x86/mm/numa_32.c          |    3 ++-
 include/linux/memory_hotplug.h |    3 ++-
 include/linux/mmzone.h         |    8 +++++++-
 include/linux/page_autonuma.h  |   10 ++++++++--
 mm/memory_hotplug.c            |    2 +-
 mm/page_autonuma.c             |    5 +++--
 7 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2d125be..a4a9e92 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -11,6 +11,7 @@
 #include <linux/nodemask.h>
 #include <linux/sched.h>
 #include <linux/topology.h>
+#include <linux/page_autonuma.h>
 
 #include <asm/e820.h>
 #include <asm/proto.h>
@@ -192,7 +193,8 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
 /* Initialize NODE_DATA for a node on the local memory */
 static void __init setup_node_data(int nid, u64 start, u64 end)
 {
-	const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
+	const size_t nd_size = roundup(autonuma_pglist_data_size(),
+				       PAGE_SIZE);
 	bool remapped = false;
 	u64 nd_pa;
 	void *nd;
@@ -239,7 +241,7 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
 		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
 
 	node_data[nid] = nd;
-	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+	memset(NODE_DATA(nid), 0, autonuma_pglist_data_size());
 	NODE_DATA(nid)->node_id = nid;
 	NODE_DATA(nid)->node_start_pfn = start >> PAGE_SHIFT;
 	NODE_DATA(nid)->node_spanned_pages = (end - start) >> PAGE_SHIFT;
diff --git a/arch/x86/mm/numa_32.c b/arch/x86/mm/numa_32.c
index 534255a..d32d6cc 100644
--- a/arch/x86/mm/numa_32.c
+++ b/arch/x86/mm/numa_32.c
@@ -25,6 +25,7 @@
 #include <linux/bootmem.h>
 #include <linux/memblock.h>
 #include <linux/module.h>
+#include <linux/page_autonuma.h>
 
 #include "numa_internal.h"
 
@@ -194,7 +195,7 @@ void __init init_alloc_remap(int nid, u64 start, u64 end)
 
 	/* calculate the necessary space aligned to large page size */
 	size = node_memmap_size_bytes(nid, start_pfn, end_pfn);
-	size += ALIGN(sizeof(pg_data_t), PAGE_SIZE);
+	size += ALIGN(autonuma_pglist_data_size(), PAGE_SIZE);
 	size = ALIGN(size, LARGE_PAGE_BYTES);
 
 	/* allocate node memory and the lowmem remap area */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 910550f..76b1840 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -5,6 +5,7 @@
 #include <linux/spinlock.h>
 #include <linux/notifier.h>
 #include <linux/bug.h>
+#include <linux/page_autonuma.h>
 
 struct page;
 struct zone;
@@ -130,7 +131,7 @@ extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
  */
 #define generic_alloc_nodedata(nid)				\
 ({								\
-	kzalloc(sizeof(pg_data_t), GFP_KERNEL);			\
+	kzalloc(autonuma_pglist_data_size(), GFP_KERNEL);	\
 })
 /*
  * This definition is just for error path in node hotadd.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e66da74..ed5b0c0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -701,10 +701,16 @@ typedef struct pglist_data {
 #if !defined(CONFIG_SPARSEMEM)
 	struct page_autonuma *node_page_autonuma;
 #endif
-	struct list_head autonuma_migrate_head[MAX_NUMNODES];
 	unsigned long autonuma_nr_migrate_pages;
 	wait_queue_head_t autonuma_knuma_migrated_wait;
 	spinlock_t autonuma_lock;
+	/*
+	 * Archs supporting AutoNUMA should allocate the pgdat with
+	 * size autonuma_pglist_data_size() after including
+	 * <linux/page_autonuma.h> and the below field must remain the
+	 * last one of this structure.
+	 */
+	struct list_head autonuma_migrate_head[0];
 #endif
 } pg_data_t;
 
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
index d748aa2..bc7a629 100644
--- a/include/linux/page_autonuma.h
+++ b/include/linux/page_autonuma.h
@@ -10,6 +10,7 @@ static inline void __init page_autonuma_init_flatmem(void) {}
 #ifdef CONFIG_AUTONUMA
 
 #include <linux/autonuma_flags.h>
+#include <linux/autonuma_types.h>
 
 extern void __meminit page_autonuma_map_init(struct page *page,
 					     struct page_autonuma *page_autonuma,
@@ -29,11 +30,10 @@ extern void __meminit pgdat_autonuma_init(struct pglist_data *);
 struct page_autonuma;
 #define PAGE_AUTONUMA_SIZE 0
 #define SECTION_PAGE_AUTONUMA_SIZE 0
+#endif
 
 #define autonuma_impossible() true
 
-#endif
-
 static inline void pgdat_autonuma_init(struct pglist_data *pgdat) {}
 
 #endif /* CONFIG_AUTONUMA */
@@ -50,4 +50,10 @@ extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **
 							 int nodeid);
 #endif
 
+/* inline won't work here */
+#define autonuma_pglist_data_size() (sizeof(struct pglist_data) +	\
+				     (autonuma_impossible() ? 0 :	\
+				      sizeof(struct list_head) * \
+				      num_possible_nodes()))
+
 #endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0d7e3ec..604995b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -164,7 +164,7 @@ void register_page_bootmem_info_node(struct pglist_data *pgdat)
 	struct page *page;
 	struct zone *zone;
 
-	nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
+	nr_pages = PAGE_ALIGN(autonuma_pglist_data_size()) >> PAGE_SHIFT;
 	page = virt_to_page(pgdat);
 
 	for (i = 0; i < nr_pages; i++, page++)
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
index 2468c9e..d7c5e4a 100644
--- a/mm/page_autonuma.c
+++ b/mm/page_autonuma.c
@@ -23,8 +23,9 @@ static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
 	spin_lock_init(&pgdat->autonuma_lock);
 	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
 	pgdat->autonuma_nr_migrate_pages = 0;
-	for_each_node(node_iter)
-		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+	if (!autonuma_impossible())
+		for_each_node(node_iter)
+			INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
 }
 
 #if !defined(CONFIG_SPARSEMEM)

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 38/40] autonuma: autonuma_migrate_head[0] dynamic size
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Reduce the autonuma_migrate_head array entries from MAX_NUMNODES to
num_possible_nodes() or zero if autonuma_impossible() is true.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/mm/numa.c             |    6 ++++--
 arch/x86/mm/numa_32.c          |    3 ++-
 include/linux/memory_hotplug.h |    3 ++-
 include/linux/mmzone.h         |    8 +++++++-
 include/linux/page_autonuma.h  |   10 ++++++++--
 mm/memory_hotplug.c            |    2 +-
 mm/page_autonuma.c             |    5 +++--
 7 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2d125be..a4a9e92 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -11,6 +11,7 @@
 #include <linux/nodemask.h>
 #include <linux/sched.h>
 #include <linux/topology.h>
+#include <linux/page_autonuma.h>
 
 #include <asm/e820.h>
 #include <asm/proto.h>
@@ -192,7 +193,8 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
 /* Initialize NODE_DATA for a node on the local memory */
 static void __init setup_node_data(int nid, u64 start, u64 end)
 {
-	const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
+	const size_t nd_size = roundup(autonuma_pglist_data_size(),
+				       PAGE_SIZE);
 	bool remapped = false;
 	u64 nd_pa;
 	void *nd;
@@ -239,7 +241,7 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
 		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
 
 	node_data[nid] = nd;
-	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+	memset(NODE_DATA(nid), 0, autonuma_pglist_data_size());
 	NODE_DATA(nid)->node_id = nid;
 	NODE_DATA(nid)->node_start_pfn = start >> PAGE_SHIFT;
 	NODE_DATA(nid)->node_spanned_pages = (end - start) >> PAGE_SHIFT;
diff --git a/arch/x86/mm/numa_32.c b/arch/x86/mm/numa_32.c
index 534255a..d32d6cc 100644
--- a/arch/x86/mm/numa_32.c
+++ b/arch/x86/mm/numa_32.c
@@ -25,6 +25,7 @@
 #include <linux/bootmem.h>
 #include <linux/memblock.h>
 #include <linux/module.h>
+#include <linux/page_autonuma.h>
 
 #include "numa_internal.h"
 
@@ -194,7 +195,7 @@ void __init init_alloc_remap(int nid, u64 start, u64 end)
 
 	/* calculate the necessary space aligned to large page size */
 	size = node_memmap_size_bytes(nid, start_pfn, end_pfn);
-	size += ALIGN(sizeof(pg_data_t), PAGE_SIZE);
+	size += ALIGN(autonuma_pglist_data_size(), PAGE_SIZE);
 	size = ALIGN(size, LARGE_PAGE_BYTES);
 
 	/* allocate node memory and the lowmem remap area */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 910550f..76b1840 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -5,6 +5,7 @@
 #include <linux/spinlock.h>
 #include <linux/notifier.h>
 #include <linux/bug.h>
+#include <linux/page_autonuma.h>
 
 struct page;
 struct zone;
@@ -130,7 +131,7 @@ extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
  */
 #define generic_alloc_nodedata(nid)				\
 ({								\
-	kzalloc(sizeof(pg_data_t), GFP_KERNEL);			\
+	kzalloc(autonuma_pglist_data_size(), GFP_KERNEL);	\
 })
 /*
  * This definition is just for error path in node hotadd.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e66da74..ed5b0c0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -701,10 +701,16 @@ typedef struct pglist_data {
 #if !defined(CONFIG_SPARSEMEM)
 	struct page_autonuma *node_page_autonuma;
 #endif
-	struct list_head autonuma_migrate_head[MAX_NUMNODES];
 	unsigned long autonuma_nr_migrate_pages;
 	wait_queue_head_t autonuma_knuma_migrated_wait;
 	spinlock_t autonuma_lock;
+	/*
+	 * Archs supporting AutoNUMA should allocate the pgdat with
+	 * size autonuma_pglist_data_size() after including
+	 * <linux/page_autonuma.h> and the below field must remain the
+	 * last one of this structure.
+	 */
+	struct list_head autonuma_migrate_head[0];
 #endif
 } pg_data_t;
 
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
index d748aa2..bc7a629 100644
--- a/include/linux/page_autonuma.h
+++ b/include/linux/page_autonuma.h
@@ -10,6 +10,7 @@ static inline void __init page_autonuma_init_flatmem(void) {}
 #ifdef CONFIG_AUTONUMA
 
 #include <linux/autonuma_flags.h>
+#include <linux/autonuma_types.h>
 
 extern void __meminit page_autonuma_map_init(struct page *page,
 					     struct page_autonuma *page_autonuma,
@@ -29,11 +30,10 @@ extern void __meminit pgdat_autonuma_init(struct pglist_data *);
 struct page_autonuma;
 #define PAGE_AUTONUMA_SIZE 0
 #define SECTION_PAGE_AUTONUMA_SIZE 0
+#endif
 
 #define autonuma_impossible() true
 
-#endif
-
 static inline void pgdat_autonuma_init(struct pglist_data *pgdat) {}
 
 #endif /* CONFIG_AUTONUMA */
@@ -50,4 +50,10 @@ extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **
 							 int nodeid);
 #endif
 
+/* inline won't work here */
+#define autonuma_pglist_data_size() (sizeof(struct pglist_data) +	\
+				     (autonuma_impossible() ? 0 :	\
+				      sizeof(struct list_head) * \
+				      num_possible_nodes()))
+
 #endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0d7e3ec..604995b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -164,7 +164,7 @@ void register_page_bootmem_info_node(struct pglist_data *pgdat)
 	struct page *page;
 	struct zone *zone;
 
-	nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
+	nr_pages = PAGE_ALIGN(autonuma_pglist_data_size()) >> PAGE_SHIFT;
 	page = virt_to_page(pgdat);
 
 	for (i = 0; i < nr_pages; i++, page++)
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
index 2468c9e..d7c5e4a 100644
--- a/mm/page_autonuma.c
+++ b/mm/page_autonuma.c
@@ -23,8 +23,9 @@ static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
 	spin_lock_init(&pgdat->autonuma_lock);
 	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
 	pgdat->autonuma_nr_migrate_pages = 0;
-	for_each_node(node_iter)
-		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+	if (!autonuma_impossible())
+		for_each_node(node_iter)
+			INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
 }
 
 #if !defined(CONFIG_SPARSEMEM)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 39/40] autonuma: bugcheck page_autonuma fields on newly allocated pages
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Debug tweak.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma.h |   11 +++++++++++
 mm/page_alloc.c          |    1 +
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
index 67af86a..05bd8c1 100644
--- a/include/linux/autonuma.h
+++ b/include/linux/autonuma.h
@@ -29,6 +29,16 @@ static inline void autonuma_free_page(struct page *page)
 	}
 }
 
+static inline void autonuma_check_new_page(struct page *page)
+{
+	struct page_autonuma *page_autonuma;
+	if (!autonuma_impossible()) {
+		page_autonuma = lookup_page_autonuma(page);
+		BUG_ON(page_autonuma->autonuma_migrate_nid != -1);
+		BUG_ON(page_autonuma->autonuma_last_nid != -1);
+	}
+}
+
 #define autonuma_printk(format, args...) \
 	if (autonuma_debug()) printk(format, ##args)
 
@@ -41,6 +51,7 @@ static inline void autonuma_migrate_split_huge_page(struct page *page,
 						    struct page *page_tail) {}
 static inline void autonuma_setup_new_exec(struct task_struct *p) {}
 static inline void autonuma_free_page(struct page *page) {}
+static inline void autonuma_check_new_page(struct page *page) {}
 
 #endif /* CONFIG_AUTONUMA */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2d53a1f..5943ed2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -833,6 +833,7 @@ static inline int check_new_page(struct page *page)
 		bad_page(page);
 		return 1;
 	}
+	autonuma_check_new_page(page);
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 39/40] autonuma: bugcheck page_autonuma fields on newly allocated pages
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Debug tweak.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma.h |   11 +++++++++++
 mm/page_alloc.c          |    1 +
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
index 67af86a..05bd8c1 100644
--- a/include/linux/autonuma.h
+++ b/include/linux/autonuma.h
@@ -29,6 +29,16 @@ static inline void autonuma_free_page(struct page *page)
 	}
 }
 
+static inline void autonuma_check_new_page(struct page *page)
+{
+	struct page_autonuma *page_autonuma;
+	if (!autonuma_impossible()) {
+		page_autonuma = lookup_page_autonuma(page);
+		BUG_ON(page_autonuma->autonuma_migrate_nid != -1);
+		BUG_ON(page_autonuma->autonuma_last_nid != -1);
+	}
+}
+
 #define autonuma_printk(format, args...) \
 	if (autonuma_debug()) printk(format, ##args)
 
@@ -41,6 +51,7 @@ static inline void autonuma_migrate_split_huge_page(struct page *page,
 						    struct page *page_tail) {}
 static inline void autonuma_setup_new_exec(struct task_struct *p) {}
 static inline void autonuma_free_page(struct page *page) {}
+static inline void autonuma_check_new_page(struct page *page) {}
 
 #endif /* CONFIG_AUTONUMA */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2d53a1f..5943ed2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -833,6 +833,7 @@ static inline int check_new_page(struct page *page)
 		bad_page(page);
 		return 1;
 	}
+	autonuma_check_new_page(page);
 	return 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 40/40] autonuma: shrink the per-page page_autonuma struct size
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-06-28 12:56   ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

>From 32 to 12 bytes, so the AutoNUMA memory footprint is reduced to
0.29% of RAM.

This however will fail to migrate pages above a 16 Terabyte offset
from the start of each node (migration failure isn't fatal, simply
those pages will not follow the CPU, a warning will be printed in the
log just once in that case).

AutoNUMA will also fail to build if there are more than (2**15)-1
nodes supported by the MAX_NUMNODES at build time (it would be easy to
relax it to (2**16)-1 nodes without increasing the memory footprint,
but it's not even worth it, so let's keep the negative space reserved
for now).

This means the max RAM configuration fully supported by AutoNUMA
becomes AUTONUMA_LIST_MAX_PFN_OFFSET multiplied by 32767 nodes
multiplied by the PAGE_SIZE (assume 4096 here, but for some archs it's
bigger).

4096*32767*(0xffffffff-3)>>(10*5) = 511 PetaBytes.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_list.h  |   94 ++++++++++++++++++++++
 include/linux/autonuma_types.h |   45 ++++++-----
 include/linux/mmzone.h         |    3 +-
 include/linux/page_autonuma.h  |    2 +-
 mm/Makefile                    |    2 +-
 mm/autonuma.c                  |   86 +++++++++++++++------
 mm/autonuma_list.c             |  167 ++++++++++++++++++++++++++++++++++++++++
 mm/page_autonuma.c             |   15 ++--
 8 files changed, 362 insertions(+), 52 deletions(-)
 create mode 100644 include/linux/autonuma_list.h
 create mode 100644 mm/autonuma_list.c

diff --git a/include/linux/autonuma_list.h b/include/linux/autonuma_list.h
new file mode 100644
index 0000000..0f338e9
--- /dev/null
+++ b/include/linux/autonuma_list.h
@@ -0,0 +1,94 @@
+#ifndef __AUTONUMA_LIST_H
+#define __AUTONUMA_LIST_H
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+
+typedef uint32_t autonuma_list_entry;
+#define AUTONUMA_LIST_MAX_PFN_OFFSET	(AUTONUMA_LIST_HEAD-3)
+#define AUTONUMA_LIST_POISON1		(AUTONUMA_LIST_HEAD-2)
+#define AUTONUMA_LIST_POISON2		(AUTONUMA_LIST_HEAD-1)
+#define AUTONUMA_LIST_HEAD		((uint32_t)UINT_MAX)
+
+struct autonuma_list_head {
+	autonuma_list_entry anl_next_pfn;
+	autonuma_list_entry anl_prev_pfn;
+};
+
+static inline void AUTONUMA_INIT_LIST_HEAD(struct autonuma_list_head *anl)
+{
+	anl->anl_next_pfn = AUTONUMA_LIST_HEAD;
+	anl->anl_prev_pfn = AUTONUMA_LIST_HEAD;
+}
+
+/* abstraction conversion methods */
+extern struct page *autonuma_list_entry_to_page(int nid,
+					autonuma_list_entry pfn_offset);
+extern autonuma_list_entry autonuma_page_to_list_entry(int page_nid,
+						       struct page *page);
+extern struct autonuma_list_head *__autonuma_list_head(int page_nid,
+					struct autonuma_list_head *head,
+					autonuma_list_entry pfn_offset);
+
+extern bool __autonuma_list_add(int page_nid,
+				struct page *page,
+				struct autonuma_list_head *head,
+				autonuma_list_entry prev,
+				autonuma_list_entry next);
+
+/*
+ * autonuma_list_add - add a new entry
+ *
+ * Insert a new entry after the specified head.
+ */
+static inline bool autonuma_list_add(int page_nid,
+				     struct page *page,
+				     autonuma_list_entry entry,
+				     struct autonuma_list_head *head)
+{
+	struct autonuma_list_head *entry_head;
+	entry_head = __autonuma_list_head(page_nid, head, entry);
+	return __autonuma_list_add(page_nid, page, head,
+				   entry, entry_head->anl_next_pfn);
+}
+
+/*
+ * autonuma_list_add_tail - add a new entry
+ *
+ * Insert a new entry before the specified head.
+ * This is useful for implementing queues.
+ */
+static inline bool autonuma_list_add_tail(int page_nid,
+					  struct page *page,
+					  autonuma_list_entry entry,
+					  struct autonuma_list_head *head)
+{
+	struct autonuma_list_head *entry_head;
+	entry_head = __autonuma_list_head(page_nid, head, entry);
+	return __autonuma_list_add(page_nid, page, head,
+				   entry_head->anl_prev_pfn, entry);
+}
+
+/*
+ * autonuma_list_del - deletes entry from list.
+ * @entry: the element to delete from the list.
+ */
+extern void autonuma_list_del(int page_nid,
+			      struct autonuma_list_head *entry,
+			      struct autonuma_list_head *head);
+
+extern bool autonuma_list_empty(const struct autonuma_list_head *head);
+
+#if 0 /* not needed so far */
+/*
+ * autonuma_list_is_singular - tests whether a list has just one entry.
+ * @head: the list to test.
+ */
+static inline int autonuma_list_is_singular(const struct autonuma_list_head *head)
+{
+	return !autonuma_list_empty(head) &&
+		(head->anl_next_pfn == head->anl_prev_pfn);
+}
+#endif
+
+#endif /* __AUTONUMA_LIST_H */
diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
index 1e860f6..579e126 100644
--- a/include/linux/autonuma_types.h
+++ b/include/linux/autonuma_types.h
@@ -4,6 +4,7 @@
 #ifdef CONFIG_AUTONUMA
 
 #include <linux/numa.h>
+#include <linux/autonuma_list.h>
 
 /*
  * Per-mm (process) structure dynamically allocated only if autonuma
@@ -42,6 +43,19 @@ struct task_autonuma {
 /*
  * Per page (or per-pageblock) structure dynamically allocated only if
  * autonuma is not impossible.
+ *
+ * This structure takes 12 bytes per page for all architectures. There
+ * are two constraints to make this work:
+ *
+ * 1) the build will abort if * MAX_NUMNODES is too big according to
+ *    the #error check below
+ *
+ * 2) AutoNUMA will not succeed to insert into the migration queue any
+ *    page whose pfn offset value (offset with respect to the first
+ *    pfn of the node) is bigger than AUTONUMA_LIST_MAX_PFN_OFFSET
+ *    (NOTE: AUTONUMA_LIST_MAX_PFN_OFFSET is still a valid pfn offset
+ *    value). This means with huge node sizes and small PAGE_SIZE,
+ *    some pages may not be allowed to be migrated.
  */
 struct page_autonuma {
 	/*
@@ -51,7 +65,14 @@ struct page_autonuma {
 	 * should run in NUMA systems). Archs without that requires
 	 * autonuma_last_nid to be a long.
 	 */
-#if BITS_PER_LONG > 32
+#if MAX_NUMNODES > 32767
+	/*
+	 * Verify at build time that int16_t for autonuma_migrate_nid
+	 * and autonuma_last_nid won't risk to overflow, max allowed
+	 * nid value is (2**15)-1.
+	 */
+#error "too many nodes"
+#endif
 	/*
 	 * autonuma_migrate_nid is -1 if the page_autonuma structure
 	 * is not linked into any
@@ -61,7 +82,7 @@ struct page_autonuma {
 	 * page_nid is the nid that the page (referenced by the
 	 * page_autonuma structure) belongs to.
 	 */
-	int autonuma_migrate_nid;
+	int16_t autonuma_migrate_nid;
 	/*
 	 * autonuma_last_nid records which is the NUMA nid that tried
 	 * to access this page at the last NUMA hinting page fault.
@@ -70,28 +91,14 @@ struct page_autonuma {
 	 * it will make different threads trashing on the same pages,
 	 * converge on the same NUMA node (if possible).
 	 */
-	int autonuma_last_nid;
-#else
-#if MAX_NUMNODES >= 32768
-#error "too many nodes"
-#endif
-	short autonuma_migrate_nid;
-	short autonuma_last_nid;
-#endif
+	int16_t autonuma_last_nid;
+
 	/*
 	 * This is the list node that links the page (referenced by
 	 * the page_autonuma structure) in the
 	 * &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid] lru.
 	 */
-	struct list_head autonuma_migrate_node;
-
-	/*
-	 * To find the page starting from the autonuma_migrate_node we
-	 * need a backlink.
-	 *
-	 * FIXME: drop it;
-	 */
-	struct page *page;
+	struct autonuma_list_head autonuma_migrate_node;
 };
 
 extern int alloc_task_autonuma(struct task_struct *tsk,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ed5b0c0..acefdfa 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -17,6 +17,7 @@
 #include <linux/pageblock-flags.h>
 #include <generated/bounds.h>
 #include <linux/atomic.h>
+#include <linux/autonuma_list.h>
 #include <asm/page.h>
 
 /* Free memory management - zoned buddy allocator.  */
@@ -710,7 +711,7 @@ typedef struct pglist_data {
 	 * <linux/page_autonuma.h> and the below field must remain the
 	 * last one of this structure.
 	 */
-	struct list_head autonuma_migrate_head[0];
+	struct autonuma_list_head autonuma_migrate_head[0];
 #endif
 } pg_data_t;
 
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
index bc7a629..e78beda 100644
--- a/include/linux/page_autonuma.h
+++ b/include/linux/page_autonuma.h
@@ -53,7 +53,7 @@ extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **
 /* inline won't work here */
 #define autonuma_pglist_data_size() (sizeof(struct pglist_data) +	\
 				     (autonuma_impossible() ? 0 :	\
-				      sizeof(struct list_head) * \
+				      sizeof(struct autonuma_list_head) * \
 				      num_possible_nodes()))
 
 #endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/mm/Makefile b/mm/Makefile
index a4d8354..4aa90d4 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,7 +33,7 @@ obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
-obj-$(CONFIG_AUTONUMA) 	+= autonuma.o page_autonuma.o
+obj-$(CONFIG_AUTONUMA) 	+= autonuma.o page_autonuma.o autonuma_list.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
diff --git a/mm/autonuma.c b/mm/autonuma.c
index ec4d492..1873a7b 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -89,15 +89,30 @@ void autonuma_migrate_split_huge_page(struct page *page,
 	VM_BUG_ON(nid < -1);
 	VM_BUG_ON(page_tail_autonuma->autonuma_migrate_nid != -1);
 	if (nid >= 0) {
-		VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
+		bool added;
+		int page_nid = page_to_nid(page);
+		struct autonuma_list_head *head;
+		autonuma_list_entry entry;
+		entry = autonuma_page_to_list_entry(page_nid, page);
+		head = &NODE_DATA(nid)->autonuma_migrate_head[page_nid];
+		VM_BUG_ON(page_nid != page_to_nid(page_tail));
+		VM_BUG_ON(page_nid == nid);
 
 		compound_lock(page_tail);
 		autonuma_migrate_lock(nid);
-		list_add_tail(&page_tail_autonuma->autonuma_migrate_node,
-			      &page_autonuma->autonuma_migrate_node);
+		added = autonuma_list_add_tail(page_nid, page_tail, entry,
+					       head);
+		/*
+		 * AUTONUMA_LIST_MAX_PFN_OFFSET+1 isn't a power of 2
+		 * so "added" may be false if there's a pfn overflow
+		 * in the list.
+		 */
+		if (!added)
+			NODE_DATA(nid)->autonuma_nr_migrate_pages--;
 		autonuma_migrate_unlock(nid);
 
-		page_tail_autonuma->autonuma_migrate_nid = nid;
+		if (added)
+			page_tail_autonuma->autonuma_migrate_nid = nid;
 		compound_unlock(page_tail);
 	}
 
@@ -119,8 +134,15 @@ void __autonuma_migrate_page_remove(struct page *page,
 	VM_BUG_ON(nid < -1);
 	if (nid >= 0) {
 		int numpages = hpage_nr_pages(page);
+		int page_nid = page_to_nid(page);
+		struct autonuma_list_head *head;
+		VM_BUG_ON(nid == page_nid);
+		head = &NODE_DATA(nid)->autonuma_migrate_head[page_nid];
+
 		autonuma_migrate_lock(nid);
-		list_del(&page_autonuma->autonuma_migrate_node);
+		autonuma_list_del(page_nid,
+				  &page_autonuma->autonuma_migrate_node,
+				  head);
 		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
 		autonuma_migrate_unlock(nid);
 
@@ -139,6 +161,8 @@ static void __autonuma_migrate_page_add(struct page *page,
 	int numpages;
 	unsigned long nr_migrate_pages;
 	wait_queue_head_t *wait_queue;
+	struct autonuma_list_head *head;
+	bool added;
 
 	VM_BUG_ON(dst_nid >= MAX_NUMNODES);
 	VM_BUG_ON(dst_nid < -1);
@@ -155,25 +179,34 @@ static void __autonuma_migrate_page_add(struct page *page,
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 	VM_BUG_ON(nid < -1);
 	if (nid >= 0) {
+		VM_BUG_ON(nid == page_nid);
+		head = &NODE_DATA(nid)->autonuma_migrate_head[page_nid];
+
 		autonuma_migrate_lock(nid);
-		list_del(&page_autonuma->autonuma_migrate_node);
+		autonuma_list_del(page_nid,
+				  &page_autonuma->autonuma_migrate_node,
+				  head);
 		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
 		autonuma_migrate_unlock(nid);
 	}
 
+	head = &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid];
+
 	autonuma_migrate_lock(dst_nid);
-	list_add(&page_autonuma->autonuma_migrate_node,
-		 &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
-	NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
-	nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
+	added = autonuma_list_add(page_nid, page, AUTONUMA_LIST_HEAD, head);
+	if (added) {
+		NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
+		nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
+	}
 
 	autonuma_migrate_unlock(dst_nid);
 
-	page_autonuma->autonuma_migrate_nid = dst_nid;
+	if (added)
+		page_autonuma->autonuma_migrate_nid = dst_nid;
 
 	compound_unlock_irqrestore(page, flags);
 
-	if (!autonuma_migrate_defer()) {
+	if (added && !autonuma_migrate_defer()) {
 		wait_queue = &NODE_DATA(dst_nid)->autonuma_knuma_migrated_wait;
 		if (nr_migrate_pages >= pages_to_migrate &&
 		    nr_migrate_pages - numpages < pages_to_migrate &&
@@ -813,7 +846,7 @@ static int isolate_migratepages(struct list_head *migratepages,
 				struct pglist_data *pgdat)
 {
 	int nr = 0, nid;
-	struct list_head *heads = pgdat->autonuma_migrate_head;
+	struct autonuma_list_head *heads = pgdat->autonuma_migrate_head;
 
 	/* FIXME: THP balancing, restart from last nid */
 	for_each_online_node(nid) {
@@ -825,10 +858,10 @@ static int isolate_migratepages(struct list_head *migratepages,
 		cond_resched();
 		VM_BUG_ON(numa_node_id() != pgdat->node_id);
 		if (nid == pgdat->node_id) {
-			VM_BUG_ON(!list_empty(&heads[nid]));
+			VM_BUG_ON(!autonuma_list_empty(&heads[nid]));
 			continue;
 		}
-		if (list_empty(&heads[nid]))
+		if (autonuma_list_empty(&heads[nid]))
 			continue;
 		/* some page wants to go to this pgdat */
 		/*
@@ -840,22 +873,29 @@ static int isolate_migratepages(struct list_head *migratepages,
 		 * irqs.
 		 */
 		autonuma_migrate_lock_irq(pgdat->node_id);
-		if (list_empty(&heads[nid])) {
+		if (autonuma_list_empty(&heads[nid])) {
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			continue;
 		}
-		page_autonuma = list_entry(heads[nid].prev,
-					   struct page_autonuma,
-					   autonuma_migrate_node);
-		page = page_autonuma->page;
+		page = autonuma_list_entry_to_page(nid,
+						   heads[nid].anl_prev_pfn);
+		page_autonuma = lookup_page_autonuma(page);
 		if (unlikely(!get_page_unless_zero(page))) {
+			int page_nid = page_to_nid(page);
+			struct autonuma_list_head *entry_head;
+			VM_BUG_ON(nid == page_nid);
+
 			/*
 			 * Is getting freed and will remove self from the
 			 * autonuma list shortly, skip it for now.
 			 */
-			list_del(&page_autonuma->autonuma_migrate_node);
-			list_add(&page_autonuma->autonuma_migrate_node,
-				 &heads[nid]);
+			entry_head = &page_autonuma->autonuma_migrate_node;
+			autonuma_list_del(page_nid, entry_head,
+					  &heads[nid]);
+			if (!autonuma_list_add(page_nid, page,
+					       AUTONUMA_LIST_HEAD,
+					       &heads[nid]))
+				BUG();
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			autonuma_printk("autonuma migrate page is free\n");
 			continue;
diff --git a/mm/autonuma_list.c b/mm/autonuma_list.c
new file mode 100644
index 0000000..2c840f7
--- /dev/null
+++ b/mm/autonuma_list.c
@@ -0,0 +1,167 @@
+/*
+ * Copyright 2006, Red Hat, Inc., Dave Jones
+ * Copyright 2012, Red Hat, Inc.
+ * Released under the General Public License (GPL).
+ *
+ * This file contains the linked list implementations for
+ * autonuma migration lists.
+ */
+
+#include <linux/mm.h>
+#include <linux/autonuma.h>
+
+/*
+ * Insert a new entry between two known consecutive entries.
+ *
+ * This is only for internal list manipulation where we know
+ * the prev/next entries already!
+ *
+ * return true if succeeded, or false if the (page_nid, pfn_offset)
+ * pair couldn't represent the pfn and the list_add didn't succeed.
+ */
+bool __autonuma_list_add(int page_nid,
+			 struct page *page,
+			 struct autonuma_list_head *head,
+			 autonuma_list_entry prev,
+			 autonuma_list_entry next)
+{
+	autonuma_list_entry new;
+
+	VM_BUG_ON(page_nid != page_to_nid(page));
+	new = autonuma_page_to_list_entry(page_nid, page);
+	if (new > AUTONUMA_LIST_MAX_PFN_OFFSET)
+		return false;
+
+	WARN(new == prev || new == next,
+	     "autonuma_list_add double add: new=%u, prev=%u, next=%u.\n",
+	     new, prev, next);
+
+	__autonuma_list_head(page_nid, head, next)->anl_prev_pfn = new;
+	__autonuma_list_head(page_nid, head, new)->anl_next_pfn = next;
+	__autonuma_list_head(page_nid, head, new)->anl_prev_pfn = prev;
+	__autonuma_list_head(page_nid, head, prev)->anl_next_pfn = new;
+	return true;
+}
+
+static inline void __autonuma_list_del_entry(int page_nid,
+					     struct autonuma_list_head *entry,
+					     struct autonuma_list_head *head)
+{
+	autonuma_list_entry prev, next;
+
+	prev = entry->anl_prev_pfn;
+	next = entry->anl_next_pfn;
+
+	if (WARN(next == AUTONUMA_LIST_POISON1,
+		 "autonuma_list_del corruption, "
+		 "%p->anl_next_pfn is AUTONUMA_LIST_POISON1 (%u)\n",
+		entry, AUTONUMA_LIST_POISON1) ||
+	    WARN(prev == AUTONUMA_LIST_POISON2,
+		"autonuma_list_del corruption, "
+		 "%p->anl_prev_pfn is AUTONUMA_LIST_POISON2 (%u)\n",
+		entry, AUTONUMA_LIST_POISON2))
+		return;
+
+	__autonuma_list_head(page_nid, head, next)->anl_prev_pfn = prev;
+	__autonuma_list_head(page_nid, head, prev)->anl_next_pfn = next;
+}
+
+/*
+ * autonuma_list_del - deletes entry from list.
+ *
+ * Note: autonuma_list_empty on entry does not return true after this,
+ * the entry is in an undefined state.
+ */
+void autonuma_list_del(int page_nid, struct autonuma_list_head *entry,
+		       struct autonuma_list_head *head)
+{
+	__autonuma_list_del_entry(page_nid, entry, head);
+	entry->anl_next_pfn = AUTONUMA_LIST_POISON1;
+	entry->anl_prev_pfn = AUTONUMA_LIST_POISON2;
+}
+
+/*
+ * autonuma_list_empty - tests whether a list is empty
+ * @head: the list to test.
+ */
+bool autonuma_list_empty(const struct autonuma_list_head *head)
+{
+	bool ret = false;
+	if (head->anl_next_pfn == AUTONUMA_LIST_HEAD) {
+		ret = true;
+		BUG_ON(head->anl_prev_pfn != AUTONUMA_LIST_HEAD);
+	}
+	return ret;
+}
+
+/* abstraction conversion methods */
+
+static inline struct page *__autonuma_list_entry_to_page(int page_nid,
+							 autonuma_list_entry pfn_offset)
+{
+	struct pglist_data *pgdat = NODE_DATA(page_nid);
+	unsigned long pfn = pgdat->node_start_pfn + pfn_offset;
+	return pfn_to_page(pfn);
+}
+
+struct page *autonuma_list_entry_to_page(int page_nid,
+					 autonuma_list_entry pfn_offset)
+{
+	VM_BUG_ON(page_nid < 0);
+	BUG_ON(pfn_offset == AUTONUMA_LIST_POISON1);
+	BUG_ON(pfn_offset == AUTONUMA_LIST_POISON2);
+	BUG_ON(pfn_offset == AUTONUMA_LIST_HEAD);
+	return __autonuma_list_entry_to_page(page_nid, pfn_offset);
+}
+
+/*
+ * returns a value above AUTONUMA_LIST_MAX_PFN_OFFSET if the pfn is
+ * located a too big offset from the start of the node and cannot be
+ * represented by the (page_nid, pfn_offset) pair.
+ */
+autonuma_list_entry autonuma_page_to_list_entry(int page_nid,
+						struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	struct pglist_data *pgdat = NODE_DATA(page_nid);
+	VM_BUG_ON(page_nid != page_to_nid(page));
+	BUG_ON(pfn < pgdat->node_start_pfn);
+	pfn -= pgdat->node_start_pfn;
+	if (pfn > AUTONUMA_LIST_MAX_PFN_OFFSET) {
+		WARN_ONCE(1, "autonuma_page_to_list_entry: "
+			  "pfn_offset  %lu, pgdat %p, "
+			  "pgdat->node_start_pfn %lu\n",
+			  pfn, pgdat, pgdat->node_start_pfn);
+		/*
+		 * Any value bigger than AUTONUMA_LIST_MAX_PFN_OFFSET
+		 * will work as an error retval, but better pick one
+		 * that will cause noise if computed wrong by the
+		 * caller.
+		 */
+		return AUTONUMA_LIST_POISON1;
+	}
+	return pfn; /* convert to uint16_t without losing information */
+}
+
+static inline struct autonuma_list_head *____autonuma_list_head(int page_nid,
+					autonuma_list_entry pfn_offset)
+{
+	struct pglist_data *pgdat = NODE_DATA(page_nid);
+	unsigned long pfn = pgdat->node_start_pfn + pfn_offset;
+	struct page *page = pfn_to_page(pfn);
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+	return &page_autonuma->autonuma_migrate_node;
+}
+
+struct autonuma_list_head *__autonuma_list_head(int page_nid,
+					struct autonuma_list_head *head,
+					autonuma_list_entry pfn_offset)
+{
+	VM_BUG_ON(page_nid < 0);
+	BUG_ON(pfn_offset == AUTONUMA_LIST_POISON1);
+	BUG_ON(pfn_offset == AUTONUMA_LIST_POISON2);
+	if (pfn_offset != AUTONUMA_LIST_HEAD)
+		return ____autonuma_list_head(page_nid, pfn_offset);
+	else
+		return head;
+}
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
index d7c5e4a..b629074 100644
--- a/mm/page_autonuma.c
+++ b/mm/page_autonuma.c
@@ -12,7 +12,6 @@ void __meminit page_autonuma_map_init(struct page *page,
 	for (end = page + nr_pages; page < end; page++, page_autonuma++) {
 		page_autonuma->autonuma_last_nid = -1;
 		page_autonuma->autonuma_migrate_nid = -1;
-		page_autonuma->page = page;
 	}
 }
 
@@ -20,12 +19,18 @@ static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
 {
 	int node_iter;
 
+	/* verify the per-page page_autonuma 12 byte fixed cost */
+	BUILD_BUG_ON((unsigned long) &((struct page_autonuma *)0)[1] != 12);
+
 	spin_lock_init(&pgdat->autonuma_lock);
 	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
 	pgdat->autonuma_nr_migrate_pages = 0;
 	if (!autonuma_impossible())
-		for_each_node(node_iter)
-			INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+		for_each_node(node_iter) {
+			struct autonuma_list_head *head;
+			head = &pgdat->autonuma_migrate_head[node_iter];
+			AUTONUMA_INIT_LIST_HEAD(head);
+		}
 }
 
 #if !defined(CONFIG_SPARSEMEM)
@@ -112,10 +117,6 @@ struct page_autonuma *lookup_page_autonuma(struct page *page)
 	unsigned long pfn = page_to_pfn(page);
 	struct mem_section *section = __pfn_to_section(pfn);
 
-	/* if it's not a power of two we may be wasting memory */
-	BUILD_BUG_ON(SECTION_PAGE_AUTONUMA_SIZE &
-		     (SECTION_PAGE_AUTONUMA_SIZE-1));
-
 #ifdef CONFIG_DEBUG_VM
 	/*
 	 * The sanity checks the page allocator does upon freeing a

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* [PATCH 40/40] autonuma: shrink the per-page page_autonuma struct size
@ 2012-06-28 12:56   ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 12:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

>From 32 to 12 bytes, so the AutoNUMA memory footprint is reduced to
0.29% of RAM.

This however will fail to migrate pages above a 16 Terabyte offset
from the start of each node (migration failure isn't fatal, simply
those pages will not follow the CPU, a warning will be printed in the
log just once in that case).

AutoNUMA will also fail to build if there are more than (2**15)-1
nodes supported by the MAX_NUMNODES at build time (it would be easy to
relax it to (2**16)-1 nodes without increasing the memory footprint,
but it's not even worth it, so let's keep the negative space reserved
for now).

This means the max RAM configuration fully supported by AutoNUMA
becomes AUTONUMA_LIST_MAX_PFN_OFFSET multiplied by 32767 nodes
multiplied by the PAGE_SIZE (assume 4096 here, but for some archs it's
bigger).

4096*32767*(0xffffffff-3)>>(10*5) = 511 PetaBytes.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_list.h  |   94 ++++++++++++++++++++++
 include/linux/autonuma_types.h |   45 ++++++-----
 include/linux/mmzone.h         |    3 +-
 include/linux/page_autonuma.h  |    2 +-
 mm/Makefile                    |    2 +-
 mm/autonuma.c                  |   86 +++++++++++++++------
 mm/autonuma_list.c             |  167 ++++++++++++++++++++++++++++++++++++++++
 mm/page_autonuma.c             |   15 ++--
 8 files changed, 362 insertions(+), 52 deletions(-)
 create mode 100644 include/linux/autonuma_list.h
 create mode 100644 mm/autonuma_list.c

diff --git a/include/linux/autonuma_list.h b/include/linux/autonuma_list.h
new file mode 100644
index 0000000..0f338e9
--- /dev/null
+++ b/include/linux/autonuma_list.h
@@ -0,0 +1,94 @@
+#ifndef __AUTONUMA_LIST_H
+#define __AUTONUMA_LIST_H
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+
+typedef uint32_t autonuma_list_entry;
+#define AUTONUMA_LIST_MAX_PFN_OFFSET	(AUTONUMA_LIST_HEAD-3)
+#define AUTONUMA_LIST_POISON1		(AUTONUMA_LIST_HEAD-2)
+#define AUTONUMA_LIST_POISON2		(AUTONUMA_LIST_HEAD-1)
+#define AUTONUMA_LIST_HEAD		((uint32_t)UINT_MAX)
+
+struct autonuma_list_head {
+	autonuma_list_entry anl_next_pfn;
+	autonuma_list_entry anl_prev_pfn;
+};
+
+static inline void AUTONUMA_INIT_LIST_HEAD(struct autonuma_list_head *anl)
+{
+	anl->anl_next_pfn = AUTONUMA_LIST_HEAD;
+	anl->anl_prev_pfn = AUTONUMA_LIST_HEAD;
+}
+
+/* abstraction conversion methods */
+extern struct page *autonuma_list_entry_to_page(int nid,
+					autonuma_list_entry pfn_offset);
+extern autonuma_list_entry autonuma_page_to_list_entry(int page_nid,
+						       struct page *page);
+extern struct autonuma_list_head *__autonuma_list_head(int page_nid,
+					struct autonuma_list_head *head,
+					autonuma_list_entry pfn_offset);
+
+extern bool __autonuma_list_add(int page_nid,
+				struct page *page,
+				struct autonuma_list_head *head,
+				autonuma_list_entry prev,
+				autonuma_list_entry next);
+
+/*
+ * autonuma_list_add - add a new entry
+ *
+ * Insert a new entry after the specified head.
+ */
+static inline bool autonuma_list_add(int page_nid,
+				     struct page *page,
+				     autonuma_list_entry entry,
+				     struct autonuma_list_head *head)
+{
+	struct autonuma_list_head *entry_head;
+	entry_head = __autonuma_list_head(page_nid, head, entry);
+	return __autonuma_list_add(page_nid, page, head,
+				   entry, entry_head->anl_next_pfn);
+}
+
+/*
+ * autonuma_list_add_tail - add a new entry
+ *
+ * Insert a new entry before the specified head.
+ * This is useful for implementing queues.
+ */
+static inline bool autonuma_list_add_tail(int page_nid,
+					  struct page *page,
+					  autonuma_list_entry entry,
+					  struct autonuma_list_head *head)
+{
+	struct autonuma_list_head *entry_head;
+	entry_head = __autonuma_list_head(page_nid, head, entry);
+	return __autonuma_list_add(page_nid, page, head,
+				   entry_head->anl_prev_pfn, entry);
+}
+
+/*
+ * autonuma_list_del - deletes entry from list.
+ * @entry: the element to delete from the list.
+ */
+extern void autonuma_list_del(int page_nid,
+			      struct autonuma_list_head *entry,
+			      struct autonuma_list_head *head);
+
+extern bool autonuma_list_empty(const struct autonuma_list_head *head);
+
+#if 0 /* not needed so far */
+/*
+ * autonuma_list_is_singular - tests whether a list has just one entry.
+ * @head: the list to test.
+ */
+static inline int autonuma_list_is_singular(const struct autonuma_list_head *head)
+{
+	return !autonuma_list_empty(head) &&
+		(head->anl_next_pfn == head->anl_prev_pfn);
+}
+#endif
+
+#endif /* __AUTONUMA_LIST_H */
diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
index 1e860f6..579e126 100644
--- a/include/linux/autonuma_types.h
+++ b/include/linux/autonuma_types.h
@@ -4,6 +4,7 @@
 #ifdef CONFIG_AUTONUMA
 
 #include <linux/numa.h>
+#include <linux/autonuma_list.h>
 
 /*
  * Per-mm (process) structure dynamically allocated only if autonuma
@@ -42,6 +43,19 @@ struct task_autonuma {
 /*
  * Per page (or per-pageblock) structure dynamically allocated only if
  * autonuma is not impossible.
+ *
+ * This structure takes 12 bytes per page for all architectures. There
+ * are two constraints to make this work:
+ *
+ * 1) the build will abort if * MAX_NUMNODES is too big according to
+ *    the #error check below
+ *
+ * 2) AutoNUMA will not succeed to insert into the migration queue any
+ *    page whose pfn offset value (offset with respect to the first
+ *    pfn of the node) is bigger than AUTONUMA_LIST_MAX_PFN_OFFSET
+ *    (NOTE: AUTONUMA_LIST_MAX_PFN_OFFSET is still a valid pfn offset
+ *    value). This means with huge node sizes and small PAGE_SIZE,
+ *    some pages may not be allowed to be migrated.
  */
 struct page_autonuma {
 	/*
@@ -51,7 +65,14 @@ struct page_autonuma {
 	 * should run in NUMA systems). Archs without that requires
 	 * autonuma_last_nid to be a long.
 	 */
-#if BITS_PER_LONG > 32
+#if MAX_NUMNODES > 32767
+	/*
+	 * Verify at build time that int16_t for autonuma_migrate_nid
+	 * and autonuma_last_nid won't risk to overflow, max allowed
+	 * nid value is (2**15)-1.
+	 */
+#error "too many nodes"
+#endif
 	/*
 	 * autonuma_migrate_nid is -1 if the page_autonuma structure
 	 * is not linked into any
@@ -61,7 +82,7 @@ struct page_autonuma {
 	 * page_nid is the nid that the page (referenced by the
 	 * page_autonuma structure) belongs to.
 	 */
-	int autonuma_migrate_nid;
+	int16_t autonuma_migrate_nid;
 	/*
 	 * autonuma_last_nid records which is the NUMA nid that tried
 	 * to access this page at the last NUMA hinting page fault.
@@ -70,28 +91,14 @@ struct page_autonuma {
 	 * it will make different threads trashing on the same pages,
 	 * converge on the same NUMA node (if possible).
 	 */
-	int autonuma_last_nid;
-#else
-#if MAX_NUMNODES >= 32768
-#error "too many nodes"
-#endif
-	short autonuma_migrate_nid;
-	short autonuma_last_nid;
-#endif
+	int16_t autonuma_last_nid;
+
 	/*
 	 * This is the list node that links the page (referenced by
 	 * the page_autonuma structure) in the
 	 * &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid] lru.
 	 */
-	struct list_head autonuma_migrate_node;
-
-	/*
-	 * To find the page starting from the autonuma_migrate_node we
-	 * need a backlink.
-	 *
-	 * FIXME: drop it;
-	 */
-	struct page *page;
+	struct autonuma_list_head autonuma_migrate_node;
 };
 
 extern int alloc_task_autonuma(struct task_struct *tsk,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ed5b0c0..acefdfa 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -17,6 +17,7 @@
 #include <linux/pageblock-flags.h>
 #include <generated/bounds.h>
 #include <linux/atomic.h>
+#include <linux/autonuma_list.h>
 #include <asm/page.h>
 
 /* Free memory management - zoned buddy allocator.  */
@@ -710,7 +711,7 @@ typedef struct pglist_data {
 	 * <linux/page_autonuma.h> and the below field must remain the
 	 * last one of this structure.
 	 */
-	struct list_head autonuma_migrate_head[0];
+	struct autonuma_list_head autonuma_migrate_head[0];
 #endif
 } pg_data_t;
 
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
index bc7a629..e78beda 100644
--- a/include/linux/page_autonuma.h
+++ b/include/linux/page_autonuma.h
@@ -53,7 +53,7 @@ extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **
 /* inline won't work here */
 #define autonuma_pglist_data_size() (sizeof(struct pglist_data) +	\
 				     (autonuma_impossible() ? 0 :	\
-				      sizeof(struct list_head) * \
+				      sizeof(struct autonuma_list_head) * \
 				      num_possible_nodes()))
 
 #endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/mm/Makefile b/mm/Makefile
index a4d8354..4aa90d4 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,7 +33,7 @@ obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
-obj-$(CONFIG_AUTONUMA) 	+= autonuma.o page_autonuma.o
+obj-$(CONFIG_AUTONUMA) 	+= autonuma.o page_autonuma.o autonuma_list.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
diff --git a/mm/autonuma.c b/mm/autonuma.c
index ec4d492..1873a7b 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -89,15 +89,30 @@ void autonuma_migrate_split_huge_page(struct page *page,
 	VM_BUG_ON(nid < -1);
 	VM_BUG_ON(page_tail_autonuma->autonuma_migrate_nid != -1);
 	if (nid >= 0) {
-		VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
+		bool added;
+		int page_nid = page_to_nid(page);
+		struct autonuma_list_head *head;
+		autonuma_list_entry entry;
+		entry = autonuma_page_to_list_entry(page_nid, page);
+		head = &NODE_DATA(nid)->autonuma_migrate_head[page_nid];
+		VM_BUG_ON(page_nid != page_to_nid(page_tail));
+		VM_BUG_ON(page_nid == nid);
 
 		compound_lock(page_tail);
 		autonuma_migrate_lock(nid);
-		list_add_tail(&page_tail_autonuma->autonuma_migrate_node,
-			      &page_autonuma->autonuma_migrate_node);
+		added = autonuma_list_add_tail(page_nid, page_tail, entry,
+					       head);
+		/*
+		 * AUTONUMA_LIST_MAX_PFN_OFFSET+1 isn't a power of 2
+		 * so "added" may be false if there's a pfn overflow
+		 * in the list.
+		 */
+		if (!added)
+			NODE_DATA(nid)->autonuma_nr_migrate_pages--;
 		autonuma_migrate_unlock(nid);
 
-		page_tail_autonuma->autonuma_migrate_nid = nid;
+		if (added)
+			page_tail_autonuma->autonuma_migrate_nid = nid;
 		compound_unlock(page_tail);
 	}
 
@@ -119,8 +134,15 @@ void __autonuma_migrate_page_remove(struct page *page,
 	VM_BUG_ON(nid < -1);
 	if (nid >= 0) {
 		int numpages = hpage_nr_pages(page);
+		int page_nid = page_to_nid(page);
+		struct autonuma_list_head *head;
+		VM_BUG_ON(nid == page_nid);
+		head = &NODE_DATA(nid)->autonuma_migrate_head[page_nid];
+
 		autonuma_migrate_lock(nid);
-		list_del(&page_autonuma->autonuma_migrate_node);
+		autonuma_list_del(page_nid,
+				  &page_autonuma->autonuma_migrate_node,
+				  head);
 		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
 		autonuma_migrate_unlock(nid);
 
@@ -139,6 +161,8 @@ static void __autonuma_migrate_page_add(struct page *page,
 	int numpages;
 	unsigned long nr_migrate_pages;
 	wait_queue_head_t *wait_queue;
+	struct autonuma_list_head *head;
+	bool added;
 
 	VM_BUG_ON(dst_nid >= MAX_NUMNODES);
 	VM_BUG_ON(dst_nid < -1);
@@ -155,25 +179,34 @@ static void __autonuma_migrate_page_add(struct page *page,
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 	VM_BUG_ON(nid < -1);
 	if (nid >= 0) {
+		VM_BUG_ON(nid == page_nid);
+		head = &NODE_DATA(nid)->autonuma_migrate_head[page_nid];
+
 		autonuma_migrate_lock(nid);
-		list_del(&page_autonuma->autonuma_migrate_node);
+		autonuma_list_del(page_nid,
+				  &page_autonuma->autonuma_migrate_node,
+				  head);
 		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
 		autonuma_migrate_unlock(nid);
 	}
 
+	head = &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid];
+
 	autonuma_migrate_lock(dst_nid);
-	list_add(&page_autonuma->autonuma_migrate_node,
-		 &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
-	NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
-	nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
+	added = autonuma_list_add(page_nid, page, AUTONUMA_LIST_HEAD, head);
+	if (added) {
+		NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
+		nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
+	}
 
 	autonuma_migrate_unlock(dst_nid);
 
-	page_autonuma->autonuma_migrate_nid = dst_nid;
+	if (added)
+		page_autonuma->autonuma_migrate_nid = dst_nid;
 
 	compound_unlock_irqrestore(page, flags);
 
-	if (!autonuma_migrate_defer()) {
+	if (added && !autonuma_migrate_defer()) {
 		wait_queue = &NODE_DATA(dst_nid)->autonuma_knuma_migrated_wait;
 		if (nr_migrate_pages >= pages_to_migrate &&
 		    nr_migrate_pages - numpages < pages_to_migrate &&
@@ -813,7 +846,7 @@ static int isolate_migratepages(struct list_head *migratepages,
 				struct pglist_data *pgdat)
 {
 	int nr = 0, nid;
-	struct list_head *heads = pgdat->autonuma_migrate_head;
+	struct autonuma_list_head *heads = pgdat->autonuma_migrate_head;
 
 	/* FIXME: THP balancing, restart from last nid */
 	for_each_online_node(nid) {
@@ -825,10 +858,10 @@ static int isolate_migratepages(struct list_head *migratepages,
 		cond_resched();
 		VM_BUG_ON(numa_node_id() != pgdat->node_id);
 		if (nid == pgdat->node_id) {
-			VM_BUG_ON(!list_empty(&heads[nid]));
+			VM_BUG_ON(!autonuma_list_empty(&heads[nid]));
 			continue;
 		}
-		if (list_empty(&heads[nid]))
+		if (autonuma_list_empty(&heads[nid]))
 			continue;
 		/* some page wants to go to this pgdat */
 		/*
@@ -840,22 +873,29 @@ static int isolate_migratepages(struct list_head *migratepages,
 		 * irqs.
 		 */
 		autonuma_migrate_lock_irq(pgdat->node_id);
-		if (list_empty(&heads[nid])) {
+		if (autonuma_list_empty(&heads[nid])) {
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			continue;
 		}
-		page_autonuma = list_entry(heads[nid].prev,
-					   struct page_autonuma,
-					   autonuma_migrate_node);
-		page = page_autonuma->page;
+		page = autonuma_list_entry_to_page(nid,
+						   heads[nid].anl_prev_pfn);
+		page_autonuma = lookup_page_autonuma(page);
 		if (unlikely(!get_page_unless_zero(page))) {
+			int page_nid = page_to_nid(page);
+			struct autonuma_list_head *entry_head;
+			VM_BUG_ON(nid == page_nid);
+
 			/*
 			 * Is getting freed and will remove self from the
 			 * autonuma list shortly, skip it for now.
 			 */
-			list_del(&page_autonuma->autonuma_migrate_node);
-			list_add(&page_autonuma->autonuma_migrate_node,
-				 &heads[nid]);
+			entry_head = &page_autonuma->autonuma_migrate_node;
+			autonuma_list_del(page_nid, entry_head,
+					  &heads[nid]);
+			if (!autonuma_list_add(page_nid, page,
+					       AUTONUMA_LIST_HEAD,
+					       &heads[nid]))
+				BUG();
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			autonuma_printk("autonuma migrate page is free\n");
 			continue;
diff --git a/mm/autonuma_list.c b/mm/autonuma_list.c
new file mode 100644
index 0000000..2c840f7
--- /dev/null
+++ b/mm/autonuma_list.c
@@ -0,0 +1,167 @@
+/*
+ * Copyright 2006, Red Hat, Inc., Dave Jones
+ * Copyright 2012, Red Hat, Inc.
+ * Released under the General Public License (GPL).
+ *
+ * This file contains the linked list implementations for
+ * autonuma migration lists.
+ */
+
+#include <linux/mm.h>
+#include <linux/autonuma.h>
+
+/*
+ * Insert a new entry between two known consecutive entries.
+ *
+ * This is only for internal list manipulation where we know
+ * the prev/next entries already!
+ *
+ * return true if succeeded, or false if the (page_nid, pfn_offset)
+ * pair couldn't represent the pfn and the list_add didn't succeed.
+ */
+bool __autonuma_list_add(int page_nid,
+			 struct page *page,
+			 struct autonuma_list_head *head,
+			 autonuma_list_entry prev,
+			 autonuma_list_entry next)
+{
+	autonuma_list_entry new;
+
+	VM_BUG_ON(page_nid != page_to_nid(page));
+	new = autonuma_page_to_list_entry(page_nid, page);
+	if (new > AUTONUMA_LIST_MAX_PFN_OFFSET)
+		return false;
+
+	WARN(new == prev || new == next,
+	     "autonuma_list_add double add: new=%u, prev=%u, next=%u.\n",
+	     new, prev, next);
+
+	__autonuma_list_head(page_nid, head, next)->anl_prev_pfn = new;
+	__autonuma_list_head(page_nid, head, new)->anl_next_pfn = next;
+	__autonuma_list_head(page_nid, head, new)->anl_prev_pfn = prev;
+	__autonuma_list_head(page_nid, head, prev)->anl_next_pfn = new;
+	return true;
+}
+
+static inline void __autonuma_list_del_entry(int page_nid,
+					     struct autonuma_list_head *entry,
+					     struct autonuma_list_head *head)
+{
+	autonuma_list_entry prev, next;
+
+	prev = entry->anl_prev_pfn;
+	next = entry->anl_next_pfn;
+
+	if (WARN(next == AUTONUMA_LIST_POISON1,
+		 "autonuma_list_del corruption, "
+		 "%p->anl_next_pfn is AUTONUMA_LIST_POISON1 (%u)\n",
+		entry, AUTONUMA_LIST_POISON1) ||
+	    WARN(prev == AUTONUMA_LIST_POISON2,
+		"autonuma_list_del corruption, "
+		 "%p->anl_prev_pfn is AUTONUMA_LIST_POISON2 (%u)\n",
+		entry, AUTONUMA_LIST_POISON2))
+		return;
+
+	__autonuma_list_head(page_nid, head, next)->anl_prev_pfn = prev;
+	__autonuma_list_head(page_nid, head, prev)->anl_next_pfn = next;
+}
+
+/*
+ * autonuma_list_del - deletes entry from list.
+ *
+ * Note: autonuma_list_empty on entry does not return true after this,
+ * the entry is in an undefined state.
+ */
+void autonuma_list_del(int page_nid, struct autonuma_list_head *entry,
+		       struct autonuma_list_head *head)
+{
+	__autonuma_list_del_entry(page_nid, entry, head);
+	entry->anl_next_pfn = AUTONUMA_LIST_POISON1;
+	entry->anl_prev_pfn = AUTONUMA_LIST_POISON2;
+}
+
+/*
+ * autonuma_list_empty - tests whether a list is empty
+ * @head: the list to test.
+ */
+bool autonuma_list_empty(const struct autonuma_list_head *head)
+{
+	bool ret = false;
+	if (head->anl_next_pfn == AUTONUMA_LIST_HEAD) {
+		ret = true;
+		BUG_ON(head->anl_prev_pfn != AUTONUMA_LIST_HEAD);
+	}
+	return ret;
+}
+
+/* abstraction conversion methods */
+
+static inline struct page *__autonuma_list_entry_to_page(int page_nid,
+							 autonuma_list_entry pfn_offset)
+{
+	struct pglist_data *pgdat = NODE_DATA(page_nid);
+	unsigned long pfn = pgdat->node_start_pfn + pfn_offset;
+	return pfn_to_page(pfn);
+}
+
+struct page *autonuma_list_entry_to_page(int page_nid,
+					 autonuma_list_entry pfn_offset)
+{
+	VM_BUG_ON(page_nid < 0);
+	BUG_ON(pfn_offset == AUTONUMA_LIST_POISON1);
+	BUG_ON(pfn_offset == AUTONUMA_LIST_POISON2);
+	BUG_ON(pfn_offset == AUTONUMA_LIST_HEAD);
+	return __autonuma_list_entry_to_page(page_nid, pfn_offset);
+}
+
+/*
+ * returns a value above AUTONUMA_LIST_MAX_PFN_OFFSET if the pfn is
+ * located a too big offset from the start of the node and cannot be
+ * represented by the (page_nid, pfn_offset) pair.
+ */
+autonuma_list_entry autonuma_page_to_list_entry(int page_nid,
+						struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	struct pglist_data *pgdat = NODE_DATA(page_nid);
+	VM_BUG_ON(page_nid != page_to_nid(page));
+	BUG_ON(pfn < pgdat->node_start_pfn);
+	pfn -= pgdat->node_start_pfn;
+	if (pfn > AUTONUMA_LIST_MAX_PFN_OFFSET) {
+		WARN_ONCE(1, "autonuma_page_to_list_entry: "
+			  "pfn_offset  %lu, pgdat %p, "
+			  "pgdat->node_start_pfn %lu\n",
+			  pfn, pgdat, pgdat->node_start_pfn);
+		/*
+		 * Any value bigger than AUTONUMA_LIST_MAX_PFN_OFFSET
+		 * will work as an error retval, but better pick one
+		 * that will cause noise if computed wrong by the
+		 * caller.
+		 */
+		return AUTONUMA_LIST_POISON1;
+	}
+	return pfn; /* convert to uint16_t without losing information */
+}
+
+static inline struct autonuma_list_head *____autonuma_list_head(int page_nid,
+					autonuma_list_entry pfn_offset)
+{
+	struct pglist_data *pgdat = NODE_DATA(page_nid);
+	unsigned long pfn = pgdat->node_start_pfn + pfn_offset;
+	struct page *page = pfn_to_page(pfn);
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+	return &page_autonuma->autonuma_migrate_node;
+}
+
+struct autonuma_list_head *__autonuma_list_head(int page_nid,
+					struct autonuma_list_head *head,
+					autonuma_list_entry pfn_offset)
+{
+	VM_BUG_ON(page_nid < 0);
+	BUG_ON(pfn_offset == AUTONUMA_LIST_POISON1);
+	BUG_ON(pfn_offset == AUTONUMA_LIST_POISON2);
+	if (pfn_offset != AUTONUMA_LIST_HEAD)
+		return ____autonuma_list_head(page_nid, pfn_offset);
+	else
+		return head;
+}
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
index d7c5e4a..b629074 100644
--- a/mm/page_autonuma.c
+++ b/mm/page_autonuma.c
@@ -12,7 +12,6 @@ void __meminit page_autonuma_map_init(struct page *page,
 	for (end = page + nr_pages; page < end; page++, page_autonuma++) {
 		page_autonuma->autonuma_last_nid = -1;
 		page_autonuma->autonuma_migrate_nid = -1;
-		page_autonuma->page = page;
 	}
 }
 
@@ -20,12 +19,18 @@ static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
 {
 	int node_iter;
 
+	/* verify the per-page page_autonuma 12 byte fixed cost */
+	BUILD_BUG_ON((unsigned long) &((struct page_autonuma *)0)[1] != 12);
+
 	spin_lock_init(&pgdat->autonuma_lock);
 	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
 	pgdat->autonuma_nr_migrate_pages = 0;
 	if (!autonuma_impossible())
-		for_each_node(node_iter)
-			INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+		for_each_node(node_iter) {
+			struct autonuma_list_head *head;
+			head = &pgdat->autonuma_migrate_head[node_iter];
+			AUTONUMA_INIT_LIST_HEAD(head);
+		}
 }
 
 #if !defined(CONFIG_SPARSEMEM)
@@ -112,10 +117,6 @@ struct page_autonuma *lookup_page_autonuma(struct page *page)
 	unsigned long pfn = page_to_pfn(page);
 	struct mem_section *section = __pfn_to_section(pfn);
 
-	/* if it's not a power of two we may be wasting memory */
-	BUILD_BUG_ON(SECTION_PAGE_AUTONUMA_SIZE &
-		     (SECTION_PAGE_AUTONUMA_SIZE-1));
-
 #ifdef CONFIG_DEBUG_VM
 	/*
 	 * The sanity checks the page allocator does upon freeing a

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-28 14:46     ` Peter Zijlstra
  -1 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-28 14:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, 2012-06-28 at 14:55 +0200, Andrea Arcangeli wrote:
> +/*
> + * This function sched_autonuma_balance() is responsible for deciding
> + * which is the best CPU each process should be running on according
> + * to the NUMA statistics collected in mm->mm_autonuma and
> + * tsk->task_autonuma.
> + *
> + * The core math that evaluates the current CPU against the CPUs of
> + * all _other_ nodes is this:
> + *
> + *     if (w_nid > w_other && w_nid > w_cpu_nid)
> + *             weight = w_nid - w_other + w_nid - w_cpu_nid;
> + *
> + * w_nid: NUMA affinity of the current thread/process if run on the
> + * other CPU.
> + *
> + * w_other: NUMA affinity of the other thread/process if run on the
> + * other CPU.
> + *
> + * w_cpu_nid: NUMA affinity of the current thread/process if run on
> + * the current CPU.
> + *
> + * weight: combined NUMA affinity benefit in moving the current
> + * thread/process to the other CPU taking into account both the
> higher
> + * NUMA affinity of the current process if run on the other CPU, and
> + * the increase in NUMA affinity in the other CPU by replacing the
> + * other process.

A lot of words, all meaningless without a proper definition of w_*
stuff. How are they calculated and why.

> + * We run the above math on every CPU not part of the current NUMA
> + * node, and we compare the current process against the other
> + * processes running in the other CPUs in the remote NUMA nodes. The
> + * objective is to select the cpu (in selected_cpu) with a bigger
> + * "weight". The bigger the "weight" the biggest gain we'll get by
> + * moving the current process to the selected_cpu (not only the
> + * biggest immediate CPU gain but also the fewer async memory
> + * migrations that will be required to reach full convergence
> + * later). If we select a cpu we migrate the current process to it.

So you do something like: 

	max_(i, node(i) != curr_node) { weight_i }

That is, you have this weight, then what do you do?

> + * Checking that the current process has higher NUMA affinity than
> the
> + * other process on the other CPU (w_nid > w_other) and not only that
> + * the current process has higher NUMA affinity on the other CPU than
> + * on the current CPU (w_nid > w_cpu_nid) completely avoids ping
> pongs
> + * and ensures (temporary) convergence of the algorithm (at least
> from
> + * a CPU standpoint).

How does that follow?

> + * It's then up to the idle balancing code that will run as soon as
> + * the current CPU goes idle to pick the other process and move it
> + * here (or in some other idle CPU if any).
> + *
> + * By only evaluating running processes against running processes we
> + * avoid interfering with the CFS stock active idle balancing, which
> + * is critical to optimal performance with HT enabled. (getting HT
> + * wrong is worse than running on remote memory so the active idle
> + * balancing has priority)

what?

> + * Idle balancing and all other CFS load balancing become NUMA
> + * affinity aware through the introduction of
> + * sched_autonuma_can_migrate_task(). CFS searches CPUs in the task's
> + * autonuma_node first when it needs to find idle CPUs during idle
> + * balancing or tasks to pick during load balancing.

You talk a lot about idle balance, but there's zero mention of fairness.
This is worrysome.

> + * The task's autonuma_node is the node selected by
> + * sched_autonuma_balance() when it migrates a task to the
> + * selected_cpu in the selected_nid

I think I already said that strict was out of the question and hard
movement like that simply didn't make sense.

> + * Once a process/thread has been moved to another node, closer to
> the
> + * much of memory it has recently accessed,

closer to the recently accessed memory you mean?

>  any memory for that task
> + * not in the new node moves slowly (asynchronously in the
> background)
> + * to the new node. This is done by the knuma_migratedN (where the
> + * suffix N is the node id) daemon described in mm/autonuma.c.
> + *
> + * One non trivial bit of this logic that deserves an explanation is
> + * how the three crucial variables of the core math
> + * (w_nid/w_other/wcpu_nid) are going to change depending on whether
> + * the other CPU is running a thread of the current process, or a
> + * thread of a different process.

No no no,.. its not a friggin detail, its absolutely crucial. Also, if
you'd given proper definition you wouldn't need to hand wave your way
around the dynamics either because that would simply follow from the
definition.

<snip terrible example>

> + * Before scanning all other CPUs' runqueues to compute the above
> + * math,

OK, lets stop calling the one isolated conditional you mentioned 'math'.
On its own its useless.

>  we also verify that the current CPU is not already in the
> + * preferred NUMA node from the point of view of both the process
> + * statistics and the thread statistics. In such case we can return
> to
> + * the caller without having to check any other CPUs' runqueues
> + * because full convergence has been already reached.

Things being in the 'preferred' place don't have much to do with
convergence. Does your model have local minima/maxima where it can get
stuck, or does it always find a global min/max?


> + * This algorithm might be expanded to take all runnable processes
> + * into account but examining just the currently running processes is
> + * a good enough approximation because some runnable processes may
> run
> + * only for a short time so statistically there will always be a bias
> + * on the processes that uses most the of the CPU. This is ideal
> + * because it doesn't matter if NUMA balancing isn't optimal for
> + * processes that run only for a short time.

Almost, but not quite.. it would be so if the sampling could be proven
to be unbiased. But its quite possible for a task to consume most cpu
time and never show up as the current task in your load-balance run.



As it stands you wrote a lot of words.. but none of them were really
helpful in understanding what you do.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-28 14:46     ` Peter Zijlstra
  0 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-28 14:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, 2012-06-28 at 14:55 +0200, Andrea Arcangeli wrote:
> +/*
> + * This function sched_autonuma_balance() is responsible for deciding
> + * which is the best CPU each process should be running on according
> + * to the NUMA statistics collected in mm->mm_autonuma and
> + * tsk->task_autonuma.
> + *
> + * The core math that evaluates the current CPU against the CPUs of
> + * all _other_ nodes is this:
> + *
> + *     if (w_nid > w_other && w_nid > w_cpu_nid)
> + *             weight = w_nid - w_other + w_nid - w_cpu_nid;
> + *
> + * w_nid: NUMA affinity of the current thread/process if run on the
> + * other CPU.
> + *
> + * w_other: NUMA affinity of the other thread/process if run on the
> + * other CPU.
> + *
> + * w_cpu_nid: NUMA affinity of the current thread/process if run on
> + * the current CPU.
> + *
> + * weight: combined NUMA affinity benefit in moving the current
> + * thread/process to the other CPU taking into account both the
> higher
> + * NUMA affinity of the current process if run on the other CPU, and
> + * the increase in NUMA affinity in the other CPU by replacing the
> + * other process.

A lot of words, all meaningless without a proper definition of w_*
stuff. How are they calculated and why.

> + * We run the above math on every CPU not part of the current NUMA
> + * node, and we compare the current process against the other
> + * processes running in the other CPUs in the remote NUMA nodes. The
> + * objective is to select the cpu (in selected_cpu) with a bigger
> + * "weight". The bigger the "weight" the biggest gain we'll get by
> + * moving the current process to the selected_cpu (not only the
> + * biggest immediate CPU gain but also the fewer async memory
> + * migrations that will be required to reach full convergence
> + * later). If we select a cpu we migrate the current process to it.

So you do something like: 

	max_(i, node(i) != curr_node) { weight_i }

That is, you have this weight, then what do you do?

> + * Checking that the current process has higher NUMA affinity than
> the
> + * other process on the other CPU (w_nid > w_other) and not only that
> + * the current process has higher NUMA affinity on the other CPU than
> + * on the current CPU (w_nid > w_cpu_nid) completely avoids ping
> pongs
> + * and ensures (temporary) convergence of the algorithm (at least
> from
> + * a CPU standpoint).

How does that follow?

> + * It's then up to the idle balancing code that will run as soon as
> + * the current CPU goes idle to pick the other process and move it
> + * here (or in some other idle CPU if any).
> + *
> + * By only evaluating running processes against running processes we
> + * avoid interfering with the CFS stock active idle balancing, which
> + * is critical to optimal performance with HT enabled. (getting HT
> + * wrong is worse than running on remote memory so the active idle
> + * balancing has priority)

what?

> + * Idle balancing and all other CFS load balancing become NUMA
> + * affinity aware through the introduction of
> + * sched_autonuma_can_migrate_task(). CFS searches CPUs in the task's
> + * autonuma_node first when it needs to find idle CPUs during idle
> + * balancing or tasks to pick during load balancing.

You talk a lot about idle balance, but there's zero mention of fairness.
This is worrysome.

> + * The task's autonuma_node is the node selected by
> + * sched_autonuma_balance() when it migrates a task to the
> + * selected_cpu in the selected_nid

I think I already said that strict was out of the question and hard
movement like that simply didn't make sense.

> + * Once a process/thread has been moved to another node, closer to
> the
> + * much of memory it has recently accessed,

closer to the recently accessed memory you mean?

>  any memory for that task
> + * not in the new node moves slowly (asynchronously in the
> background)
> + * to the new node. This is done by the knuma_migratedN (where the
> + * suffix N is the node id) daemon described in mm/autonuma.c.
> + *
> + * One non trivial bit of this logic that deserves an explanation is
> + * how the three crucial variables of the core math
> + * (w_nid/w_other/wcpu_nid) are going to change depending on whether
> + * the other CPU is running a thread of the current process, or a
> + * thread of a different process.

No no no,.. its not a friggin detail, its absolutely crucial. Also, if
you'd given proper definition you wouldn't need to hand wave your way
around the dynamics either because that would simply follow from the
definition.

<snip terrible example>

> + * Before scanning all other CPUs' runqueues to compute the above
> + * math,

OK, lets stop calling the one isolated conditional you mentioned 'math'.
On its own its useless.

>  we also verify that the current CPU is not already in the
> + * preferred NUMA node from the point of view of both the process
> + * statistics and the thread statistics. In such case we can return
> to
> + * the caller without having to check any other CPUs' runqueues
> + * because full convergence has been already reached.

Things being in the 'preferred' place don't have much to do with
convergence. Does your model have local minima/maxima where it can get
stuck, or does it always find a global min/max?


> + * This algorithm might be expanded to take all runnable processes
> + * into account but examining just the currently running processes is
> + * a good enough approximation because some runnable processes may
> run
> + * only for a short time so statistically there will always be a bias
> + * on the processes that uses most the of the CPU. This is ideal
> + * because it doesn't matter if NUMA balancing isn't optimal for
> + * processes that run only for a short time.

Almost, but not quite.. it would be so if the sampling could be proven
to be unbiased. But its quite possible for a task to consume most cpu
time and never show up as the current task in your load-balance run.



As it stands you wrote a lot of words.. but none of them were really
helpful in understanding what you do.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-28 14:53     ` Peter Zijlstra
  -1 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-28 14:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, 2012-06-28 at 14:55 +0200, Andrea Arcangeli wrote:
> +#ifdef __ia64__
> +#error "NOTE: tlb_migrate_finish won't run here"
> +#endif 

https://lkml.org/lkml/2012/5/29/359

Its an optional thing, not running it isn't fatal at all.

Also, ia64 has CONFIG_NUMA so all this code had better run on it.

That said, I've also already told you to stop using such forceful
migration, that simply doesn't make any sense, numa balancing isn't that
critical.

Unless you're going to listen to feedback I give you, I'm going to
completely stop reading your patches, I don't give a rats arse you work
for the same company anymore.

You're impossible to work with.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-28 14:53     ` Peter Zijlstra
  0 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-28 14:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, 2012-06-28 at 14:55 +0200, Andrea Arcangeli wrote:
> +#ifdef __ia64__
> +#error "NOTE: tlb_migrate_finish won't run here"
> +#endif 

https://lkml.org/lkml/2012/5/29/359

Its an optional thing, not running it isn't fatal at all.

Also, ia64 has CONFIG_NUMA so all this code had better run on it.

That said, I've also already told you to stop using such forceful
migration, that simply doesn't make any sense, numa balancing isn't that
critical.

Unless you're going to listen to feedback I give you, I'm going to
completely stop reading your patches, I don't give a rats arse you work
for the same company anymore.

You're impossible to work with.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 05/40] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  2012-06-28 15:13     ` Don Morris
@ 2012-06-28 15:00       ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 15:00 UTC (permalink / raw)
  To: Don Morris; +Cc: linux-kernel, linux-mm

Hi Don,

On Thu, Jun 28, 2012 at 08:13:11AM -0700, Don Morris wrote:
> On 06/28/2012 05:55 AM, Andrea Arcangeli wrote:
> > We will set these bitflags only when the pmd and pte is non present.
> > 
> 
> Just a couple grammar nitpicks.
> 
> > They work like PROT_NONE but they identify a request for the numa
> > hinting page fault to trigger.
> > 
> > Because we want to be able to set these bitflag in any established pte
> 
> these bitflags
> 
> > or pmd (while clearing the present bit at the same time) without
> > losing information, these bitflags must never be set when the pte and
> > pmd are present.
> > 
> > For _PAGE_NUMA_PTE the pte bitflag used is _PAGE_PSE, which cannot be
> > set on ptes and it also fits in between _PAGE_FILE and _PAGE_PROTNONE
> > which avoids having to alter the swp entries format.
> > 
> > For _PAGE_NUMA_PMD, we use a reserved bitflag. pmds never contain
> > swap_entries but if in the future we'll swap transparent hugepages, we
> > must keep in mind not to use the _PAGE_UNUSED2 bitflag in the swap
> > entry format and to start the swap entry offset above it.
> > 
> > PAGE_UNUSED2 is used by Xen but only on ptes established by ioremap,
> > but it's never used on pmds so there's no risk of collision with Xen.
> 
> Maybe "but only on ptes established by ioremap, never on pmds so
> there's no risk of collision with Xen." ? The extra "but" just
> doesn't flow in the original.

Agreed and applied, thanks!
Andrea

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 05/40] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
@ 2012-06-28 15:00       ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-28 15:00 UTC (permalink / raw)
  To: Don Morris; +Cc: linux-kernel, linux-mm

Hi Don,

On Thu, Jun 28, 2012 at 08:13:11AM -0700, Don Morris wrote:
> On 06/28/2012 05:55 AM, Andrea Arcangeli wrote:
> > We will set these bitflags only when the pmd and pte is non present.
> > 
> 
> Just a couple grammar nitpicks.
> 
> > They work like PROT_NONE but they identify a request for the numa
> > hinting page fault to trigger.
> > 
> > Because we want to be able to set these bitflag in any established pte
> 
> these bitflags
> 
> > or pmd (while clearing the present bit at the same time) without
> > losing information, these bitflags must never be set when the pte and
> > pmd are present.
> > 
> > For _PAGE_NUMA_PTE the pte bitflag used is _PAGE_PSE, which cannot be
> > set on ptes and it also fits in between _PAGE_FILE and _PAGE_PROTNONE
> > which avoids having to alter the swp entries format.
> > 
> > For _PAGE_NUMA_PMD, we use a reserved bitflag. pmds never contain
> > swap_entries but if in the future we'll swap transparent hugepages, we
> > must keep in mind not to use the _PAGE_UNUSED2 bitflag in the swap
> > entry format and to start the swap entry offset above it.
> > 
> > PAGE_UNUSED2 is used by Xen but only on ptes established by ioremap,
> > but it's never used on pmds so there's no risk of collision with Xen.
> 
> Maybe "but only on ptes established by ioremap, never on pmds so
> there's no risk of collision with Xen." ? The extra "but" just
> doesn't flow in the original.

Agreed and applied, thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 05/40] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-28 15:13     ` Don Morris
  -1 siblings, 0 replies; 327+ messages in thread
From: Don Morris @ 2012-06-28 15:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt

On 06/28/2012 05:55 AM, Andrea Arcangeli wrote:
> We will set these bitflags only when the pmd and pte is non present.
> 

Just a couple grammar nitpicks.

> They work like PROT_NONE but they identify a request for the numa
> hinting page fault to trigger.
> 
> Because we want to be able to set these bitflag in any established pte

these bitflags

> or pmd (while clearing the present bit at the same time) without
> losing information, these bitflags must never be set when the pte and
> pmd are present.
> 
> For _PAGE_NUMA_PTE the pte bitflag used is _PAGE_PSE, which cannot be
> set on ptes and it also fits in between _PAGE_FILE and _PAGE_PROTNONE
> which avoids having to alter the swp entries format.
> 
> For _PAGE_NUMA_PMD, we use a reserved bitflag. pmds never contain
> swap_entries but if in the future we'll swap transparent hugepages, we
> must keep in mind not to use the _PAGE_UNUSED2 bitflag in the swap
> entry format and to start the swap entry offset above it.
> 
> PAGE_UNUSED2 is used by Xen but only on ptes established by ioremap,
> but it's never used on pmds so there's no risk of collision with Xen.

Maybe "but only on ptes established by ioremap, never on pmds so
there's no risk of collision with Xen." ? The extra "but" just
doesn't flow in the original.

Don Morris

> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/x86/include/asm/pgtable_types.h |   11 +++++++++++
>  1 files changed, 11 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index b74cac9..6e2d954 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -71,6 +71,17 @@
>  #define _PAGE_FILE	(_AT(pteval_t, 1) << _PAGE_BIT_FILE)
>  #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
>  
> +/*
> + * Cannot be set on pte. The fact it's in between _PAGE_FILE and
> + * _PAGE_PROTNONE avoids having to alter the swp entries.
> + */
> +#define _PAGE_NUMA_PTE	_PAGE_PSE
> +/*
> + * Cannot be set on pmd, if transparent hugepages will be swapped out
> + * the swap entry offset must start above it.
> + */
> +#define _PAGE_NUMA_PMD	_PAGE_UNUSED2
> +
>  #define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |	\
>  			 _PAGE_ACCESSED | _PAGE_DIRTY)
>  #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> .
> 


^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 05/40] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
@ 2012-06-28 15:13     ` Don Morris
  0 siblings, 0 replies; 327+ messages in thread
From: Don Morris @ 2012-06-28 15:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt

On 06/28/2012 05:55 AM, Andrea Arcangeli wrote:
> We will set these bitflags only when the pmd and pte is non present.
> 

Just a couple grammar nitpicks.

> They work like PROT_NONE but they identify a request for the numa
> hinting page fault to trigger.
> 
> Because we want to be able to set these bitflag in any established pte

these bitflags

> or pmd (while clearing the present bit at the same time) without
> losing information, these bitflags must never be set when the pte and
> pmd are present.
> 
> For _PAGE_NUMA_PTE the pte bitflag used is _PAGE_PSE, which cannot be
> set on ptes and it also fits in between _PAGE_FILE and _PAGE_PROTNONE
> which avoids having to alter the swp entries format.
> 
> For _PAGE_NUMA_PMD, we use a reserved bitflag. pmds never contain
> swap_entries but if in the future we'll swap transparent hugepages, we
> must keep in mind not to use the _PAGE_UNUSED2 bitflag in the swap
> entry format and to start the swap entry offset above it.
> 
> PAGE_UNUSED2 is used by Xen but only on ptes established by ioremap,
> but it's never used on pmds so there's no risk of collision with Xen.

Maybe "but only on ptes established by ioremap, never on pmds so
there's no risk of collision with Xen." ? The extra "but" just
doesn't flow in the original.

Don Morris

> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/x86/include/asm/pgtable_types.h |   11 +++++++++++
>  1 files changed, 11 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index b74cac9..6e2d954 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -71,6 +71,17 @@
>  #define _PAGE_FILE	(_AT(pteval_t, 1) << _PAGE_BIT_FILE)
>  #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
>  
> +/*
> + * Cannot be set on pte. The fact it's in between _PAGE_FILE and
> + * _PAGE_PROTNONE avoids having to alter the swp entries.
> + */
> +#define _PAGE_NUMA_PTE	_PAGE_PSE
> +/*
> + * Cannot be set on pmd, if transparent hugepages will be swapped out
> + * the swap entry offset must start above it.
> + */
> +#define _PAGE_NUMA_PMD	_PAGE_UNUSED2
> +
>  #define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |	\
>  			 _PAGE_ACCESSED | _PAGE_DIRTY)
>  #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> .
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-28 14:53     ` Peter Zijlstra
@ 2012-06-29 12:16       ` Hillf Danton
  -1 siblings, 0 replies; 327+ messages in thread
From: Hillf Danton @ 2012-06-29 12:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Dan Smith,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Jun 28, 2012 at 10:53 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> Unless you're going to listen to feedback I give you, I'm going to
> completely stop reading your patches, I don't give a rats arse you work
> for the same company anymore.
>

Are you brought up, Peter, in dirty environment with mind polluted?
And stop shaming RedHat anymore, since you are no longer teenager.

If you did not nothing wrong, give me the email address of your manager
at Redhat in reply please.

> You're impossible to work with.

Clearly you show that you are not sane enough and able to work with others.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-29 12:16       ` Hillf Danton
  0 siblings, 0 replies; 327+ messages in thread
From: Hillf Danton @ 2012-06-29 12:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Dan Smith,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Jun 28, 2012 at 10:53 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> Unless you're going to listen to feedback I give you, I'm going to
> completely stop reading your patches, I don't give a rats arse you work
> for the same company anymore.
>

Are you brought up, Peter, in dirty environment with mind polluted?
And stop shaming RedHat anymore, since you are no longer teenager.

If you did not nothing wrong, give me the email address of your manager
at Redhat in reply please.

> You're impossible to work with.

Clearly you show that you are not sane enough and able to work with others.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-29 12:16       ` Hillf Danton
@ 2012-06-29 12:55         ` Ingo Molnar
  -1 siblings, 0 replies; 327+ messages in thread
From: Ingo Molnar @ 2012-06-29 12:55 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, Andrea Arcangeli, linux-kernel, linux-mm,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt


* Hillf Danton <dhillf@gmail.com> wrote:

> On Thu, Jun 28, 2012 at 10:53 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > Unless you're going to listen to feedback I give you, I'm 
> > going to completely stop reading your patches, I don't give 
> > a rats arse you work for the same company anymore.
> 
> Are you brought up, Peter, in dirty environment with mind 
> polluted?

You do not seem to be aware of the history of this patch-set,
I suspect Peter got "polluted" by Andrea ignoring his repeated 
review feedbacks...

If his multiple rounds of polite (and extensive) review didn't 
have much of an effect then maybe some amount of not so nice 
shouting has more of an effect?

The other option would be to NAK and ignore the patchset, in 
that sense Peter is a lot more constructive and forward looking 
than a polite NAK would be, even if the language is rough.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-29 12:55         ` Ingo Molnar
  0 siblings, 0 replies; 327+ messages in thread
From: Ingo Molnar @ 2012-06-29 12:55 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, Andrea Arcangeli, linux-kernel, linux-mm,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt


* Hillf Danton <dhillf@gmail.com> wrote:

> On Thu, Jun 28, 2012 at 10:53 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > Unless you're going to listen to feedback I give you, I'm 
> > going to completely stop reading your patches, I don't give 
> > a rats arse you work for the same company anymore.
> 
> Are you brought up, Peter, in dirty environment with mind 
> polluted?

You do not seem to be aware of the history of this patch-set,
I suspect Peter got "polluted" by Andrea ignoring his repeated 
review feedbacks...

If his multiple rounds of polite (and extensive) review didn't 
have much of an effect then maybe some amount of not so nice 
shouting has more of an effect?

The other option would be to NAK and ignore the patchset, in 
that sense Peter is a lot more constructive and forward looking 
than a polite NAK would be, even if the language is rough.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 01/40] mm: add unlikely to the mm allocation failure check
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-29 14:10     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 14:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> Very minor optimization to hint gcc.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

This looks like something that could be submitted separately,
reducing the size of your autonuma patch series a little...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 01/40] mm: add unlikely to the mm allocation failure check
@ 2012-06-29 14:10     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 14:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> Very minor optimization to hint gcc.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

This looks like something that could be submitted separately,
reducing the size of your autonuma patch series a little...

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 02/40] autonuma: make set_pmd_at always available
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-29 14:10     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 14:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> set_pmd_at() will also be used for the knuma_scand/pmd = 1 (default)
> mode even when TRANSPARENT_HUGEPAGE=n. Make it available so the build
> won't fail.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 02/40] autonuma: make set_pmd_at always available
@ 2012-06-29 14:10     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 14:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> set_pmd_at() will also be used for the knuma_scand/pmd = 1 (default)
> mode even when TRANSPARENT_HUGEPAGE=n. Make it available so the build
> won't fail.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 03/40] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-29 14:11     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 14:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> is_vma_temporary_stack() is needed by mm/autonuma.c too, and without
> this the build breaks with CONFIG_TRANSPARENT_HUGEPAGE=n.
>
> Reported-by: Petr Holasek<pholasek@redhat.com>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 03/40] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n
@ 2012-06-29 14:11     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 14:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> is_vma_temporary_stack() is needed by mm/autonuma.c too, and without
> this the build breaks with CONFIG_TRANSPARENT_HUGEPAGE=n.
>
> Reported-by: Petr Holasek<pholasek@redhat.com>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-28 14:46     ` Peter Zijlstra
@ 2012-06-29 14:11       ` Nai Xia
  -1 siblings, 0 replies; 327+ messages in thread
From: Nai Xia @ 2012-06-29 14:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt



On 2012年06月28日 22:46, Peter Zijlstra wrote:
> On Thu, 2012-06-28 at 14:55 +0200, Andrea Arcangeli wrote:
>> +/*
>> + * This function sched_autonuma_balance() is responsible for deciding
>> + * which is the best CPU each process should be running on according
>> + * to the NUMA statistics collected in mm->mm_autonuma and
>> + * tsk->task_autonuma.
>> + *
>> + * The core math that evaluates the current CPU against the CPUs of
>> + * all _other_ nodes is this:
>> + *
>> + *     if (w_nid>  w_other&&  w_nid>  w_cpu_nid)
>> + *             weight = w_nid - w_other + w_nid - w_cpu_nid;
>> + *
>> + * w_nid: NUMA affinity of the current thread/process if run on the
>> + * other CPU.
>> + *
>> + * w_other: NUMA affinity of the other thread/process if run on the
>> + * other CPU.
>> + *
>> + * w_cpu_nid: NUMA affinity of the current thread/process if run on
>> + * the current CPU.
>> + *
>> + * weight: combined NUMA affinity benefit in moving the current
>> + * thread/process to the other CPU taking into account both the
>> higher
>> + * NUMA affinity of the current process if run on the other CPU, and
>> + * the increase in NUMA affinity in the other CPU by replacing the
>> + * other process.
>
> A lot of words, all meaningless without a proper definition of w_*
> stuff. How are they calculated and why.
>
>> + * We run the above math on every CPU not part of the current NUMA
>> + * node, and we compare the current process against the other
>> + * processes running in the other CPUs in the remote NUMA nodes. The
>> + * objective is to select the cpu (in selected_cpu) with a bigger
>> + * "weight". The bigger the "weight" the biggest gain we'll get by
>> + * moving the current process to the selected_cpu (not only the
>> + * biggest immediate CPU gain but also the fewer async memory
>> + * migrations that will be required to reach full convergence
>> + * later). If we select a cpu we migrate the current process to it.
>
> So you do something like:
>
> 	max_(i, node(i) != curr_node) { weight_i }
>
> That is, you have this weight, then what do you do?
>
>> + * Checking that the current process has higher NUMA affinity than
>> the
>> + * other process on the other CPU (w_nid>  w_other) and not only that
>> + * the current process has higher NUMA affinity on the other CPU than
>> + * on the current CPU (w_nid>  w_cpu_nid) completely avoids ping
>> pongs
>> + * and ensures (temporary) convergence of the algorithm (at least
>> from
>> + * a CPU standpoint).
>
> How does that follow?
>
>> + * It's then up to the idle balancing code that will run as soon as
>> + * the current CPU goes idle to pick the other process and move it
>> + * here (or in some other idle CPU if any).
>> + *
>> + * By only evaluating running processes against running processes we
>> + * avoid interfering with the CFS stock active idle balancing, which
>> + * is critical to optimal performance with HT enabled. (getting HT
>> + * wrong is worse than running on remote memory so the active idle
>> + * balancing has priority)
>
> what?
>
>> + * Idle balancing and all other CFS load balancing become NUMA
>> + * affinity aware through the introduction of
>> + * sched_autonuma_can_migrate_task(). CFS searches CPUs in the task's
>> + * autonuma_node first when it needs to find idle CPUs during idle
>> + * balancing or tasks to pick during load balancing.
>
> You talk a lot about idle balance, but there's zero mention of fairness.
> This is worrysome.
>
>> + * The task's autonuma_node is the node selected by
>> + * sched_autonuma_balance() when it migrates a task to the
>> + * selected_cpu in the selected_nid
>
> I think I already said that strict was out of the question and hard
> movement like that simply didn't make sense.
>
>> + * Once a process/thread has been moved to another node, closer to
>> the
>> + * much of memory it has recently accessed,
>
> closer to the recently accessed memory you mean?
>
>>   any memory for that task
>> + * not in the new node moves slowly (asynchronously in the
>> background)
>> + * to the new node. This is done by the knuma_migratedN (where the
>> + * suffix N is the node id) daemon described in mm/autonuma.c.
>> + *
>> + * One non trivial bit of this logic that deserves an explanation is
>> + * how the three crucial variables of the core math
>> + * (w_nid/w_other/wcpu_nid) are going to change depending on whether
>> + * the other CPU is running a thread of the current process, or a
>> + * thread of a different process.
>
> No no no,.. its not a friggin detail, its absolutely crucial. Also, if
> you'd given proper definition you wouldn't need to hand wave your way
> around the dynamics either because that would simply follow from the
> definition.
>
> <snip terrible example>
>
>> + * Before scanning all other CPUs' runqueues to compute the above
>> + * math,
>
> OK, lets stop calling the one isolated conditional you mentioned 'math'.
> On its own its useless.
>
>>   we also verify that the current CPU is not already in the
>> + * preferred NUMA node from the point of view of both the process
>> + * statistics and the thread statistics. In such case we can return
>> to
>> + * the caller without having to check any other CPUs' runqueues
>> + * because full convergence has been already reached.
>
> Things being in the 'preferred' place don't have much to do with
> convergence. Does your model have local minima/maxima where it can get
> stuck, or does it always find a global min/max?
>
>
>> + * This algorithm might be expanded to take all runnable processes
>> + * into account but examining just the currently running processes is
>> + * a good enough approximation because some runnable processes may
>> run
>> + * only for a short time so statistically there will always be a bias
>> + * on the processes that uses most the of the CPU. This is ideal
>> + * because it doesn't matter if NUMA balancing isn't optimal for
>> + * processes that run only for a short time.
>
> Almost, but not quite.. it would be so if the sampling could be proven
> to be unbiased. But its quite possible for a task to consume most cpu
> time and never show up as the current task in your load-balance run.

Same here, I have another similar question regarding sampling:
If one process do very intensive visit of a small set of pages in this
node, but occasional visit of a large set of pages in another node.
Will this algorithm do a very bad judgment? I guess the answer would
be: it's possible and this judgment depends on the racing pattern
between the process and your knuma_scand.

Usually, if we are using sampling, we are on the assumption that if
this sampling would not be accurate, we only lose chance to
better optimization, but NOT to do bad/false judgment.

Andrea, sorry, I don't have enough time to look into all your patches
details(and also since I'm not on the CCs ;-) ),
But my intuition tells me that your current sampling and weight
algorithm is far from optimal.

>
>
>
> As it stands you wrote a lot of words.. but none of them were really
> helpful in understanding what you do.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email:<a href=ilto:"dont@kvack.org">  email@kvack.org</a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-29 14:11       ` Nai Xia
  0 siblings, 0 replies; 327+ messages in thread
From: Nai Xia @ 2012-06-29 14:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt



On 2012a1'06ae??28ae?JPY 22:46, Peter Zijlstra wrote:
> On Thu, 2012-06-28 at 14:55 +0200, Andrea Arcangeli wrote:
>> +/*
>> + * This function sched_autonuma_balance() is responsible for deciding
>> + * which is the best CPU each process should be running on according
>> + * to the NUMA statistics collected in mm->mm_autonuma and
>> + * tsk->task_autonuma.
>> + *
>> + * The core math that evaluates the current CPU against the CPUs of
>> + * all _other_ nodes is this:
>> + *
>> + *     if (w_nid>  w_other&&  w_nid>  w_cpu_nid)
>> + *             weight = w_nid - w_other + w_nid - w_cpu_nid;
>> + *
>> + * w_nid: NUMA affinity of the current thread/process if run on the
>> + * other CPU.
>> + *
>> + * w_other: NUMA affinity of the other thread/process if run on the
>> + * other CPU.
>> + *
>> + * w_cpu_nid: NUMA affinity of the current thread/process if run on
>> + * the current CPU.
>> + *
>> + * weight: combined NUMA affinity benefit in moving the current
>> + * thread/process to the other CPU taking into account both the
>> higher
>> + * NUMA affinity of the current process if run on the other CPU, and
>> + * the increase in NUMA affinity in the other CPU by replacing the
>> + * other process.
>
> A lot of words, all meaningless without a proper definition of w_*
> stuff. How are they calculated and why.
>
>> + * We run the above math on every CPU not part of the current NUMA
>> + * node, and we compare the current process against the other
>> + * processes running in the other CPUs in the remote NUMA nodes. The
>> + * objective is to select the cpu (in selected_cpu) with a bigger
>> + * "weight". The bigger the "weight" the biggest gain we'll get by
>> + * moving the current process to the selected_cpu (not only the
>> + * biggest immediate CPU gain but also the fewer async memory
>> + * migrations that will be required to reach full convergence
>> + * later). If we select a cpu we migrate the current process to it.
>
> So you do something like:
>
> 	max_(i, node(i) != curr_node) { weight_i }
>
> That is, you have this weight, then what do you do?
>
>> + * Checking that the current process has higher NUMA affinity than
>> the
>> + * other process on the other CPU (w_nid>  w_other) and not only that
>> + * the current process has higher NUMA affinity on the other CPU than
>> + * on the current CPU (w_nid>  w_cpu_nid) completely avoids ping
>> pongs
>> + * and ensures (temporary) convergence of the algorithm (at least
>> from
>> + * a CPU standpoint).
>
> How does that follow?
>
>> + * It's then up to the idle balancing code that will run as soon as
>> + * the current CPU goes idle to pick the other process and move it
>> + * here (or in some other idle CPU if any).
>> + *
>> + * By only evaluating running processes against running processes we
>> + * avoid interfering with the CFS stock active idle balancing, which
>> + * is critical to optimal performance with HT enabled. (getting HT
>> + * wrong is worse than running on remote memory so the active idle
>> + * balancing has priority)
>
> what?
>
>> + * Idle balancing and all other CFS load balancing become NUMA
>> + * affinity aware through the introduction of
>> + * sched_autonuma_can_migrate_task(). CFS searches CPUs in the task's
>> + * autonuma_node first when it needs to find idle CPUs during idle
>> + * balancing or tasks to pick during load balancing.
>
> You talk a lot about idle balance, but there's zero mention of fairness.
> This is worrysome.
>
>> + * The task's autonuma_node is the node selected by
>> + * sched_autonuma_balance() when it migrates a task to the
>> + * selected_cpu in the selected_nid
>
> I think I already said that strict was out of the question and hard
> movement like that simply didn't make sense.
>
>> + * Once a process/thread has been moved to another node, closer to
>> the
>> + * much of memory it has recently accessed,
>
> closer to the recently accessed memory you mean?
>
>>   any memory for that task
>> + * not in the new node moves slowly (asynchronously in the
>> background)
>> + * to the new node. This is done by the knuma_migratedN (where the
>> + * suffix N is the node id) daemon described in mm/autonuma.c.
>> + *
>> + * One non trivial bit of this logic that deserves an explanation is
>> + * how the three crucial variables of the core math
>> + * (w_nid/w_other/wcpu_nid) are going to change depending on whether
>> + * the other CPU is running a thread of the current process, or a
>> + * thread of a different process.
>
> No no no,.. its not a friggin detail, its absolutely crucial. Also, if
> you'd given proper definition you wouldn't need to hand wave your way
> around the dynamics either because that would simply follow from the
> definition.
>
> <snip terrible example>
>
>> + * Before scanning all other CPUs' runqueues to compute the above
>> + * math,
>
> OK, lets stop calling the one isolated conditional you mentioned 'math'.
> On its own its useless.
>
>>   we also verify that the current CPU is not already in the
>> + * preferred NUMA node from the point of view of both the process
>> + * statistics and the thread statistics. In such case we can return
>> to
>> + * the caller without having to check any other CPUs' runqueues
>> + * because full convergence has been already reached.
>
> Things being in the 'preferred' place don't have much to do with
> convergence. Does your model have local minima/maxima where it can get
> stuck, or does it always find a global min/max?
>
>
>> + * This algorithm might be expanded to take all runnable processes
>> + * into account but examining just the currently running processes is
>> + * a good enough approximation because some runnable processes may
>> run
>> + * only for a short time so statistically there will always be a bias
>> + * on the processes that uses most the of the CPU. This is ideal
>> + * because it doesn't matter if NUMA balancing isn't optimal for
>> + * processes that run only for a short time.
>
> Almost, but not quite.. it would be so if the sampling could be proven
> to be unbiased. But its quite possible for a task to consume most cpu
> time and never show up as the current task in your load-balance run.

Same here, I have another similar question regarding sampling:
If one process do very intensive visit of a small set of pages in this
node, but occasional visit of a large set of pages in another node.
Will this algorithm do a very bad judgment? I guess the answer would
be: it's possible and this judgment depends on the racing pattern
between the process and your knuma_scand.

Usually, if we are using sampling, we are on the assumption that if
this sampling would not be accurate, we only lose chance to
better optimization, but NOT to do bad/false judgment.

Andrea, sorry, I don't have enough time to look into all your patches
details(and also since I'm not on the CCs ;-) ),
But my intuition tells me that your current sampling and weight
algorithm is far from optimal.

>
>
>
> As it stands you wrote a lot of words.. but none of them were really
> helpful in understanding what you do.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email:<a href=ilto:"dont@kvack.org">  email@kvack.org</a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 04/40] xen: document Xen is using an unused bit for the pagetables
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-29 14:16     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 14:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> Xen has taken over the last reserved bit available for the pagetables
> which is set through ioremap, this documents it and makes the code
> more readable.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> ---
>   arch/x86/include/asm/pgtable_types.h |   11 +++++++++--
>   1 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 013286a..b74cac9 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -17,7 +17,7 @@
>   #define _PAGE_BIT_PAT		7	/* on 4KB pages */
>   #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
>   #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
> -#define _PAGE_BIT_IOMAP		10	/* flag used to indicate IO mapping */
> +#define _PAGE_BIT_UNUSED2	10

Considering that Xen is using it, it is not really
unused, is it?

Not that I can think of a better name, considering
you are using this bit for something else at the PMD
level...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 04/40] xen: document Xen is using an unused bit for the pagetables
@ 2012-06-29 14:16     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 14:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> Xen has taken over the last reserved bit available for the pagetables
> which is set through ioremap, this documents it and makes the code
> more readable.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> ---
>   arch/x86/include/asm/pgtable_types.h |   11 +++++++++--
>   1 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 013286a..b74cac9 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -17,7 +17,7 @@
>   #define _PAGE_BIT_PAT		7	/* on 4KB pages */
>   #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
>   #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
> -#define _PAGE_BIT_IOMAP		10	/* flag used to indicate IO mapping */
> +#define _PAGE_BIT_UNUSED2	10

Considering that Xen is using it, it is not really
unused, is it?

Not that I can think of a better name, considering
you are using this bit for something else at the PMD
level...

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 05/40] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-29 14:26     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 14:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:

> +/*
> + * Cannot be set on pte. The fact it's in between _PAGE_FILE and
> + * _PAGE_PROTNONE avoids having to alter the swp entries.
> + */
> +#define _PAGE_NUMA_PTE	_PAGE_PSE
> +/*
> + * Cannot be set on pmd, if transparent hugepages will be swapped out
> + * the swap entry offset must start above it.
> + */
> +#define _PAGE_NUMA_PMD	_PAGE_UNUSED2

Those comments only tell us what the flags can NOT be
used for, not what they are actually used for.

That needs to be fixed.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 05/40] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
@ 2012-06-29 14:26     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 14:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:

> +/*
> + * Cannot be set on pte. The fact it's in between _PAGE_FILE and
> + * _PAGE_PROTNONE avoids having to alter the swp entries.
> + */
> +#define _PAGE_NUMA_PTE	_PAGE_PSE
> +/*
> + * Cannot be set on pmd, if transparent hugepages will be swapped out
> + * the swap entry offset must start above it.
> + */
> +#define _PAGE_NUMA_PMD	_PAGE_UNUSED2

Those comments only tell us what the flags can NOT be
used for, not what they are actually used for.

That needs to be fixed.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 06/40] autonuma: x86 pte_numa() and pmd_numa()
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-29 15:02     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 15:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:

>   static inline int pte_file(pte_t pte)
>   {
> -	return pte_flags(pte)&  _PAGE_FILE;
> +	return (pte_flags(pte)&  _PAGE_FILE) == _PAGE_FILE;
>   }

Wait, why is this change made?  Surely _PAGE_FILE is just
one single bit and this change is not useful?

If there is a reason for this change, please document it.

> @@ -405,7 +405,9 @@ static inline int pte_same(pte_t a, pte_t b)
>
>   static inline int pte_present(pte_t a)
>   {
> -	return pte_flags(a)&  (_PAGE_PRESENT | _PAGE_PROTNONE);
> +	/* _PAGE_NUMA includes _PAGE_PROTNONE */
> +	return pte_flags(a)&  (_PAGE_PRESENT | _PAGE_PROTNONE |
> +			       _PAGE_NUMA_PTE);
>   }
>
>   static inline int pte_hidden(pte_t pte)
> @@ -415,7 +417,46 @@ static inline int pte_hidden(pte_t pte)
>
>   static inline int pmd_present(pmd_t pmd)
>   {
> -	return pmd_flags(pmd)&  _PAGE_PRESENT;
> +	return pmd_flags(pmd)&  (_PAGE_PRESENT | _PAGE_PROTNONE |
> +				 _PAGE_NUMA_PMD);
> +}

Somewhat subtle. Better documentation in patch 5 will
help explain this.

> +#ifdef CONFIG_AUTONUMA
> +static inline int pte_numa(pte_t pte)
> +{
> +	return (pte_flags(pte)&
> +		(_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
> +}
> +
> +static inline int pmd_numa(pmd_t pmd)
> +{
> +	return (pmd_flags(pmd)&
> +		(_PAGE_NUMA_PMD|_PAGE_PRESENT)) == _PAGE_NUMA_PMD;
> +}
> +#endif

These could use a little explanation of how _PAGE_NUMA_* is
used and what the flags mean.

> +static inline pte_t pte_mknotnuma(pte_t pte)
> +{
> +	pte = pte_clear_flags(pte, _PAGE_NUMA_PTE);
> +	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
> +}
> +
> +static inline pmd_t pmd_mknotnuma(pmd_t pmd)
> +{
> +	pmd = pmd_clear_flags(pmd, _PAGE_NUMA_PMD);
> +	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
> +}
> +
> +static inline pte_t pte_mknuma(pte_t pte)
> +{
> +	pte = pte_set_flags(pte, _PAGE_NUMA_PTE);
> +	return pte_clear_flags(pte, _PAGE_PRESENT);
> +}
> +
> +static inline pmd_t pmd_mknuma(pmd_t pmd)
> +{
> +	pmd = pmd_set_flags(pmd, _PAGE_NUMA_PMD);
> +	return pmd_clear_flags(pmd, _PAGE_PRESENT);
>   }

These functions could use some explanation, too.

Why do the top ones set _PAGE_ACCESSED, while the bottom ones
leave _PAGE_ACCESSED alone?

I can guess the answer, but it should be documented so it is
also clear to people with less experience in the VM.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 06/40] autonuma: x86 pte_numa() and pmd_numa()
@ 2012-06-29 15:02     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 15:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:

>   static inline int pte_file(pte_t pte)
>   {
> -	return pte_flags(pte)&  _PAGE_FILE;
> +	return (pte_flags(pte)&  _PAGE_FILE) == _PAGE_FILE;
>   }

Wait, why is this change made?  Surely _PAGE_FILE is just
one single bit and this change is not useful?

If there is a reason for this change, please document it.

> @@ -405,7 +405,9 @@ static inline int pte_same(pte_t a, pte_t b)
>
>   static inline int pte_present(pte_t a)
>   {
> -	return pte_flags(a)&  (_PAGE_PRESENT | _PAGE_PROTNONE);
> +	/* _PAGE_NUMA includes _PAGE_PROTNONE */
> +	return pte_flags(a)&  (_PAGE_PRESENT | _PAGE_PROTNONE |
> +			       _PAGE_NUMA_PTE);
>   }
>
>   static inline int pte_hidden(pte_t pte)
> @@ -415,7 +417,46 @@ static inline int pte_hidden(pte_t pte)
>
>   static inline int pmd_present(pmd_t pmd)
>   {
> -	return pmd_flags(pmd)&  _PAGE_PRESENT;
> +	return pmd_flags(pmd)&  (_PAGE_PRESENT | _PAGE_PROTNONE |
> +				 _PAGE_NUMA_PMD);
> +}

Somewhat subtle. Better documentation in patch 5 will
help explain this.

> +#ifdef CONFIG_AUTONUMA
> +static inline int pte_numa(pte_t pte)
> +{
> +	return (pte_flags(pte)&
> +		(_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
> +}
> +
> +static inline int pmd_numa(pmd_t pmd)
> +{
> +	return (pmd_flags(pmd)&
> +		(_PAGE_NUMA_PMD|_PAGE_PRESENT)) == _PAGE_NUMA_PMD;
> +}
> +#endif

These could use a little explanation of how _PAGE_NUMA_* is
used and what the flags mean.

> +static inline pte_t pte_mknotnuma(pte_t pte)
> +{
> +	pte = pte_clear_flags(pte, _PAGE_NUMA_PTE);
> +	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
> +}
> +
> +static inline pmd_t pmd_mknotnuma(pmd_t pmd)
> +{
> +	pmd = pmd_clear_flags(pmd, _PAGE_NUMA_PMD);
> +	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
> +}
> +
> +static inline pte_t pte_mknuma(pte_t pte)
> +{
> +	pte = pte_set_flags(pte, _PAGE_NUMA_PTE);
> +	return pte_clear_flags(pte, _PAGE_PRESENT);
> +}
> +
> +static inline pmd_t pmd_mknuma(pmd_t pmd)
> +{
> +	pmd = pmd_set_flags(pmd, _PAGE_NUMA_PMD);
> +	return pmd_clear_flags(pmd, _PAGE_PRESENT);
>   }

These functions could use some explanation, too.

Why do the top ones set _PAGE_ACCESSED, while the bottom ones
leave _PAGE_ACCESSED alone?

I can guess the answer, but it should be documented so it is
also clear to people with less experience in the VM.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 07/40] autonuma: generic pte_numa() and pmd_numa()
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-29 15:13     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 15:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> Implement generic version of the methods. They're used when
> CONFIG_AUTONUMA=n, and they're a noop.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 07/40] autonuma: generic pte_numa() and pmd_numa()
@ 2012-06-29 15:13     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 15:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> Implement generic version of the methods. They're used when
> CONFIG_AUTONUMA=n, and they're a noop.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 08/40] autonuma: teach gup_fast about pte_numa
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-29 15:27     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 15:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> gup_fast will skip over non present ptes (pte_numa requires the pte to
> be non present). So no explicit check is needed for pte_numa in the
> pte case.
>
> gup_fast will also automatically skip over THP when the trans huge pmd
> is non present (pmd_numa requires the pmd to be non present).
>
> But for the special pmd mode scan of knuma_scand
> (/sys/kernel/mm/autonuma/knuma_scand/pmd == 1), the pmd may be of numa
> type (so non present too), the pte may be present. gup_pte_range
> wouldn't notice the pmd is of numa type. So to avoid losing a NUMA
> hinting page fault with gup_fast we need an explicit check for
> pmd_numa() here to be sure it will fault through gup ->
> handle_mm_fault.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Assuming pmd_numa will get the documentation I asked for a few
patches back, this patch is fine, since people will just be able
to look at a nice comment above pmd_numa and see what is going on.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 08/40] autonuma: teach gup_fast about pte_numa
@ 2012-06-29 15:27     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 15:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> gup_fast will skip over non present ptes (pte_numa requires the pte to
> be non present). So no explicit check is needed for pte_numa in the
> pte case.
>
> gup_fast will also automatically skip over THP when the trans huge pmd
> is non present (pmd_numa requires the pmd to be non present).
>
> But for the special pmd mode scan of knuma_scand
> (/sys/kernel/mm/autonuma/knuma_scand/pmd == 1), the pmd may be of numa
> type (so non present too), the pte may be present. gup_pte_range
> wouldn't notice the pmd is of numa type. So to avoid losing a NUMA
> hinting page fault with gup_fast we need an explicit check for
> pmd_numa() here to be sure it will fault through gup ->
> handle_mm_fault.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Assuming pmd_numa will get the documentation I asked for a few
patches back, this patch is fine, since people will just be able
to look at a nice comment above pmd_numa and see what is going on.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-29 15:36     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 15:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
>   #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
>   #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
>   #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
> -#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
> +#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpus */
>   #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
>   #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
>   #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */

Changing the semantics of PF_THREAD_BOUND without so much as
a comment in your changelog or buy-in from the scheduler
maintainers is a big no-no.

Is there any reason you even need PF_THREAD_BOUND in your
kernel numa threads?

I do not see much at all in the scheduler code that uses
PF_THREAD_BOUND and it is not clear at all that your
numa threads get any benefit from them...

Why do you think you need it?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
@ 2012-06-29 15:36     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 15:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
>   #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
>   #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
>   #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
> -#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
> +#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpus */
>   #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
>   #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
>   #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */

Changing the semantics of PF_THREAD_BOUND without so much as
a comment in your changelog or buy-in from the scheduler
maintainers is a big no-no.

Is there any reason you even need PF_THREAD_BOUND in your
kernel numa threads?

I do not see much at all in the scheduler code that uses
PF_THREAD_BOUND and it is not clear at all that your
numa threads get any benefit from them...

Why do you think you need it?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 10/40] autonuma: mm_autonuma and sched_autonuma data structures
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-29 15:47     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 15:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:

You tell us when the data structures are not allocated, but
you do not tell us how the data structures is used, or what
the fields inside the data structures mean.

This makes it very hard for other people to figure out the
code later. Please document these kinds of things properly.

> +/*
> + * Per-mm (process) structure dynamically allocated only if autonuma
> + * is not impossible. This links the mm to scan into the
> + * knuma_scand.mm_head and it contains the NUMA memory placement
> + * statistics for the process (generated by knuma_scand).
> + */
> +struct mm_autonuma {
> +	/* list node to link the "mm" into the knuma_scand.mm_head */
> +	struct list_head mm_node;
> +	struct mm_struct *mm;
> +	unsigned long mm_numa_fault_pass; /* zeroed from here during allocation */
> +	unsigned long mm_numa_fault_tot;
> +	unsigned long mm_numa_fault[0];
> +};

> +/*
> + * Per-task (thread) structure dynamically allocated only if autonuma
> + * is not impossible. This contains the preferred autonuma_node where
> + * the userland thread should be scheduled into (only relevant if
> + * tsk->mm is not null) and the per-thread NUMA accesses statistics
> + * (generated by the NUMA hinting page faults).
> + */
> +struct task_autonuma {
> +	int autonuma_node;
> +	/* zeroed from the below field during allocation */
> +	unsigned long task_numa_fault_pass;
> +	unsigned long task_numa_fault_tot;
> +	unsigned long task_numa_fault[0];
> +};



-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 10/40] autonuma: mm_autonuma and sched_autonuma data structures
@ 2012-06-29 15:47     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 15:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:

You tell us when the data structures are not allocated, but
you do not tell us how the data structures is used, or what
the fields inside the data structures mean.

This makes it very hard for other people to figure out the
code later. Please document these kinds of things properly.

> +/*
> + * Per-mm (process) structure dynamically allocated only if autonuma
> + * is not impossible. This links the mm to scan into the
> + * knuma_scand.mm_head and it contains the NUMA memory placement
> + * statistics for the process (generated by knuma_scand).
> + */
> +struct mm_autonuma {
> +	/* list node to link the "mm" into the knuma_scand.mm_head */
> +	struct list_head mm_node;
> +	struct mm_struct *mm;
> +	unsigned long mm_numa_fault_pass; /* zeroed from here during allocation */
> +	unsigned long mm_numa_fault_tot;
> +	unsigned long mm_numa_fault[0];
> +};

> +/*
> + * Per-task (thread) structure dynamically allocated only if autonuma
> + * is not impossible. This contains the preferred autonuma_node where
> + * the userland thread should be scheduled into (only relevant if
> + * tsk->mm is not null) and the per-thread NUMA accesses statistics
> + * (generated by the NUMA hinting page faults).
> + */
> +struct task_autonuma {
> +	int autonuma_node;
> +	/* zeroed from the below field during allocation */
> +	unsigned long task_numa_fault_pass;
> +	unsigned long task_numa_fault_tot;
> +	unsigned long task_numa_fault[0];
> +};



-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
  2012-06-29 15:36     ` Rik van Riel
@ 2012-06-29 16:04       ` Peter Zijlstra
  -1 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-29 16:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, 2012-06-29 at 11:36 -0400, Rik van Riel wrote:
> On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> 
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> >   #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
> >   #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
> >   #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
> > -#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
> > +#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpus */
> >   #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
> >   #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
> >   #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
> 
> Changing the semantics of PF_THREAD_BOUND without so much as
> a comment in your changelog or buy-in from the scheduler
> maintainers is a big no-no.

In fact I've already said a number of times this patch isn't going
anywhere.



^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
@ 2012-06-29 16:04       ` Peter Zijlstra
  0 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-29 16:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, 2012-06-29 at 11:36 -0400, Rik van Riel wrote:
> On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> 
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> >   #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
> >   #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
> >   #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
> > -#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
> > +#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpus */
> >   #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
> >   #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
> >   #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
> 
> Changing the semantics of PF_THREAD_BOUND without so much as
> a comment in your changelog or buy-in from the scheduler
> maintainers is a big no-no.

In fact I've already said a number of times this patch isn't going
anywhere.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 11/40] autonuma: define the autonuma flags
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-29 16:10     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 16:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> These flags are the ones tweaked through sysfs, they control the
> behavior of autonuma, from enabling disabling it, to selecting various
> runtime options.

That's all fine and dandy, but what do these flags mean?

How do you expect people to be able to maintain this code,
or control autonuma behaviour, when these flags are not
documented at all?

Please document them.

> +enum autonuma_flag {
> +	AUTONUMA_FLAG,
> +	AUTONUMA_IMPOSSIBLE_FLAG,
> +	AUTONUMA_DEBUG_FLAG,
> +	AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
> +	AUTONUMA_SCHED_CLONE_RESET_FLAG,
> +	AUTONUMA_SCHED_FORK_RESET_FLAG,
> +	AUTONUMA_SCAN_PMD_FLAG,
> +	AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
> +	AUTONUMA_MIGRATE_DEFER_FLAG,
> +};


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 11/40] autonuma: define the autonuma flags
@ 2012-06-29 16:10     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 16:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> These flags are the ones tweaked through sysfs, they control the
> behavior of autonuma, from enabling disabling it, to selecting various
> runtime options.

That's all fine and dandy, but what do these flags mean?

How do you expect people to be able to maintain this code,
or control autonuma behaviour, when these flags are not
documented at all?

Please document them.

> +enum autonuma_flag {
> +	AUTONUMA_FLAG,
> +	AUTONUMA_IMPOSSIBLE_FLAG,
> +	AUTONUMA_DEBUG_FLAG,
> +	AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
> +	AUTONUMA_SCHED_CLONE_RESET_FLAG,
> +	AUTONUMA_SCHED_FORK_RESET_FLAG,
> +	AUTONUMA_SCAN_PMD_FLAG,
> +	AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
> +	AUTONUMA_MIGRATE_DEFER_FLAG,
> +};


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
  2012-06-29 16:04       ` Peter Zijlstra
@ 2012-06-29 16:11         ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 16:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/29/2012 12:04 PM, Peter Zijlstra wrote:
> On Fri, 2012-06-29 at 11:36 -0400, Rik van Riel wrote:
>> On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
>>
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
>>>    #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
>>>    #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
>>>    #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
>>> -#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
>>> +#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpus */
>>>    #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
>>>    #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
>>>    #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
>>
>> Changing the semantics of PF_THREAD_BOUND without so much as
>> a comment in your changelog or buy-in from the scheduler
>> maintainers is a big no-no.
>
> In fact I've already said a number of times this patch isn't going
> anywhere.

Which is probably fine, because I see no reason why Andrea's
numa threads would need PF_THREAD_BOUND in the first place.

Andrea?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
@ 2012-06-29 16:11         ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 16:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/29/2012 12:04 PM, Peter Zijlstra wrote:
> On Fri, 2012-06-29 at 11:36 -0400, Rik van Riel wrote:
>> On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
>>
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
>>>    #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
>>>    #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
>>>    #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
>>> -#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
>>> +#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpus */
>>>    #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
>>>    #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
>>>    #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
>>
>> Changing the semantics of PF_THREAD_BOUND without so much as
>> a comment in your changelog or buy-in from the scheduler
>> maintainers is a big no-no.
>
> In fact I've already said a number of times this patch isn't going
> anywhere.

Which is probably fine, because I see no reason why Andrea's
numa threads would need PF_THREAD_BOUND in the first place.

Andrea?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-29 14:11       ` Nai Xia
@ 2012-06-29 16:30         ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-29 16:30 UTC (permalink / raw)
  To: Nai Xia
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Hi Nai,

On Fri, Jun 29, 2012 at 10:11:35PM +0800, Nai Xia wrote:
> If one process do very intensive visit of a small set of pages in this
> node, but occasional visit of a large set of pages in another node.
> Will this algorithm do a very bad judgment? I guess the answer would
> be: it's possible and this judgment depends on the racing pattern
> between the process and your knuma_scand.

Depending if the knuma_scand/scan_pass_sleep_millisecs is more or less
occasional than the visit of a large set of pages it may behave
differently correct.

Note that every algorithm will have a limit on how smart it can be.

Just to make a random example: if you lookup some pagecache a million
times and some other pagecache a dozen times, their "aging"
information in the pagecache will end up identical. Yet we know one
set of pages is clearly higher priority than the other. We've only so
many levels of lrus and so many referenced/active bitflags per
page. Once you get at the top, then all is equal.

Does this mean the "active" list working set detection is useless just
because we can't differentiate a million of lookups on a few pages, vs
a dozen of lookups on lots of pages?

Last but not the least, in the very example you mention it's not even
clear that the process should be scheduled in the CPU where there is
the small set of pages accessed frequently, or the CPU where there's
the large set of pages accessed occasionally. If the small sets of
pages fits in the 8MBytes of the L2 cache, then it's better to put the
process in the other CPU where the large set of pages can't fit in the
L2 cache. Lots of hardware details should be evaluated, to really know
what's the right thing in such case even if it was you having to
decide.

But the real reason why the above isn't an issue and why we don't need
to solve that problem perfectly: there's not just a CPU follow memory
algorithm in AutoNUMA. There's also the memory follow CPU
algorithm. AutoNUMA will do its best to change the layout of your
example to one that has only one clear solution: the occasional lookup
of the large set of pages, will make those eventually go in the node
together with the small set of pages (or the other way around), and
this is how it's solved.

In any case, whatever wrong decision it will take, it will at least be
a better decision than the numa/sched where there's absolutely zero
information about what pages the process is accessing. And best of all
with AutoNUMA you also know which pages the _thread_ is accessing so
it will also be able to take optimal decisions if there are more
threads than CPUs in a node (as long as not all thread accesses are
shared).

Hope this explains things better.
Andrea

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-29 16:30         ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-29 16:30 UTC (permalink / raw)
  To: Nai Xia
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Hi Nai,

On Fri, Jun 29, 2012 at 10:11:35PM +0800, Nai Xia wrote:
> If one process do very intensive visit of a small set of pages in this
> node, but occasional visit of a large set of pages in another node.
> Will this algorithm do a very bad judgment? I guess the answer would
> be: it's possible and this judgment depends on the racing pattern
> between the process and your knuma_scand.

Depending if the knuma_scand/scan_pass_sleep_millisecs is more or less
occasional than the visit of a large set of pages it may behave
differently correct.

Note that every algorithm will have a limit on how smart it can be.

Just to make a random example: if you lookup some pagecache a million
times and some other pagecache a dozen times, their "aging"
information in the pagecache will end up identical. Yet we know one
set of pages is clearly higher priority than the other. We've only so
many levels of lrus and so many referenced/active bitflags per
page. Once you get at the top, then all is equal.

Does this mean the "active" list working set detection is useless just
because we can't differentiate a million of lookups on a few pages, vs
a dozen of lookups on lots of pages?

Last but not the least, in the very example you mention it's not even
clear that the process should be scheduled in the CPU where there is
the small set of pages accessed frequently, or the CPU where there's
the large set of pages accessed occasionally. If the small sets of
pages fits in the 8MBytes of the L2 cache, then it's better to put the
process in the other CPU where the large set of pages can't fit in the
L2 cache. Lots of hardware details should be evaluated, to really know
what's the right thing in such case even if it was you having to
decide.

But the real reason why the above isn't an issue and why we don't need
to solve that problem perfectly: there's not just a CPU follow memory
algorithm in AutoNUMA. There's also the memory follow CPU
algorithm. AutoNUMA will do its best to change the layout of your
example to one that has only one clear solution: the occasional lookup
of the large set of pages, will make those eventually go in the node
together with the small set of pages (or the other way around), and
this is how it's solved.

In any case, whatever wrong decision it will take, it will at least be
a better decision than the numa/sched where there's absolutely zero
information about what pages the process is accessing. And best of all
with AutoNUMA you also know which pages the _thread_ is accessing so
it will also be able to take optimal decisions if there are more
threads than CPUs in a node (as long as not all thread accesses are
shared).

Hope this explains things better.
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
  2012-06-29 15:36     ` Rik van Riel
@ 2012-06-29 16:38       ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-29 16:38 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, Jun 29, 2012 at 11:36:26AM -0400, Rik van Riel wrote:
> On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> 
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> >   #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
> >   #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
> >   #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
> > -#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
> > +#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpus */
> >   #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
> >   #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
> >   #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
> 
> Changing the semantics of PF_THREAD_BOUND without so much as
> a comment in your changelog or buy-in from the scheduler
> maintainers is a big no-no.
> 
> Is there any reason you even need PF_THREAD_BOUND in your
> kernel numa threads?
> 
> I do not see much at all in the scheduler code that uses
> PF_THREAD_BOUND and it is not clear at all that your
> numa threads get any benefit from them...
> 
> Why do you think you need it?

Nobody needs that flag anyway, you can drop the flag from the kernel
in all places it used, it is never "needed", but since somebody
bothered to add this reliability feature to the kernel, why not to
take advantage of it whenever possible?

This flag is only used to prevent userland to mess with the kernel CPU
binds of kernel threads. It is used to avoid the root user to shoot
itself in the foot.

So far it has been used to prevent changing bindings to a single
CPU. I'm setting it also after making a multiple-cpu bind (all CPUs of
the node, instead of just 1 CPU). I hope it's clear to everybody that
this is perfectly ok usage and if it the bind is done on 1 CPU or 10 CPUs
or all CPUs, nothing changes for how the bitflag works.

There's no legitimate reason to allow the root user to change the CPU
binding of knuma_migratedN. Any change would be a guaranteed
regression. So there's no reason not to enforce the NODE wide binding,
if nothing else to document that the binding enforced is the ideal one
in all possible conditions.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
@ 2012-06-29 16:38       ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-29 16:38 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, Jun 29, 2012 at 11:36:26AM -0400, Rik van Riel wrote:
> On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> 
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> >   #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
> >   #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
> >   #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
> > -#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
> > +#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpus */
> >   #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
> >   #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
> >   #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
> 
> Changing the semantics of PF_THREAD_BOUND without so much as
> a comment in your changelog or buy-in from the scheduler
> maintainers is a big no-no.
> 
> Is there any reason you even need PF_THREAD_BOUND in your
> kernel numa threads?
> 
> I do not see much at all in the scheduler code that uses
> PF_THREAD_BOUND and it is not clear at all that your
> numa threads get any benefit from them...
> 
> Why do you think you need it?

Nobody needs that flag anyway, you can drop the flag from the kernel
in all places it used, it is never "needed", but since somebody
bothered to add this reliability feature to the kernel, why not to
take advantage of it whenever possible?

This flag is only used to prevent userland to mess with the kernel CPU
binds of kernel threads. It is used to avoid the root user to shoot
itself in the foot.

So far it has been used to prevent changing bindings to a single
CPU. I'm setting it also after making a multiple-cpu bind (all CPUs of
the node, instead of just 1 CPU). I hope it's clear to everybody that
this is perfectly ok usage and if it the bind is done on 1 CPU or 10 CPUs
or all CPUs, nothing changes for how the bitflag works.

There's no legitimate reason to allow the root user to change the CPU
binding of knuma_migratedN. Any change would be a guaranteed
regression. So there's no reason not to enforce the NODE wide binding,
if nothing else to document that the binding enforced is the ideal one
in all possible conditions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-29 12:55         ` Ingo Molnar
@ 2012-06-29 16:51           ` Dor Laor
  -1 siblings, 0 replies; 327+ messages in thread
From: Dor Laor @ 2012-06-29 16:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Hillf Danton, Peter Zijlstra, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/29/2012 08:55 AM, Ingo Molnar wrote:
>
> * Hillf Danton <dhillf@gmail.com> wrote:
>
>> On Thu, Jun 28, 2012 at 10:53 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>>>
>>> Unless you're going to listen to feedback I give you, I'm
>>> going to completely stop reading your patches, I don't give
>>> a rats arse you work for the same company anymore.
>>
>> Are you brought up, Peter, in dirty environment with mind
>> polluted?
>
> You do not seem to be aware of the history of this patch-set,
> I suspect Peter got "polluted" by Andrea ignoring his repeated
> review feedbacks...

AFAIK, Andrea answered many of Peter's request by reducing the memory 
overhead, adding documentation and changing the scheduler integration.

When someone plants 'crap' too much in his comments, its not a surprise 
that some will get ignored. Moreover, I don't think the decent comments 
got ignored, sometimes both were talking in parallel lines - even in 
this case, it's hard to say whether Peter's like to add ia64 support or 
just like to get rid of the forceful migration as a whole.

Since it takes more time to fully understand the code than to write the 
comments, I suggest to go the extra mile there and make sure the review 
is crystal clear.

>
> If his multiple rounds of polite (and extensive) review didn't
> have much of an effect then maybe some amount of not so nice
> shouting has more of an effect?
>
> The other option would be to NAK and ignore the patchset, in
> that sense Peter is a lot more constructive and forward looking
> than a polite NAK would be, even if the language is rough.

NAK is better w/ further explanation or even suggestion about 
alternatives. The previous comments were not shouts but the mother of 
all NAKs.

There are some in the Linux community that adore flames but this is a 
perfect example that this approach slows innovation instead of advance it.

Some developers have a thick skin and nothing gets in, others are human 
and have feelings. Using a tiny difference in behavior we can do much 
much better. What's works in a f2f loud discussion doesn't play well in 
email.

Or alternatively:

/*
  * can_nice - check if folks on lkml can be nicer&productive
  * @p: person
  * @nice: nice value
  * Since nice isn't a negative property, nice is an uint here.
  */
int can_nice(const struct person *p, const unsigned int nice)
{
         int nice_rlim = MAX_LIMIT_BEFORE_NAK;

         BUG_ON(!capable(CAP_SYS_NICE));

         if (nice_rlim >= task_rlimit(p, RLIMIT_NICE))
            printk(KERN_INFO "Please NAK w/ decent explanation or \
            submit an alternative patch);

         return 0;
}

Ingo, what's your technical perspective of this particular patch?

Cheers,
Dor

>
> Thanks,
>
> 	Ingo
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-29 16:51           ` Dor Laor
  0 siblings, 0 replies; 327+ messages in thread
From: Dor Laor @ 2012-06-29 16:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Hillf Danton, Peter Zijlstra, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/29/2012 08:55 AM, Ingo Molnar wrote:
>
> * Hillf Danton <dhillf@gmail.com> wrote:
>
>> On Thu, Jun 28, 2012 at 10:53 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>>>
>>> Unless you're going to listen to feedback I give you, I'm
>>> going to completely stop reading your patches, I don't give
>>> a rats arse you work for the same company anymore.
>>
>> Are you brought up, Peter, in dirty environment with mind
>> polluted?
>
> You do not seem to be aware of the history of this patch-set,
> I suspect Peter got "polluted" by Andrea ignoring his repeated
> review feedbacks...

AFAIK, Andrea answered many of Peter's request by reducing the memory 
overhead, adding documentation and changing the scheduler integration.

When someone plants 'crap' too much in his comments, its not a surprise 
that some will get ignored. Moreover, I don't think the decent comments 
got ignored, sometimes both were talking in parallel lines - even in 
this case, it's hard to say whether Peter's like to add ia64 support or 
just like to get rid of the forceful migration as a whole.

Since it takes more time to fully understand the code than to write the 
comments, I suggest to go the extra mile there and make sure the review 
is crystal clear.

>
> If his multiple rounds of polite (and extensive) review didn't
> have much of an effect then maybe some amount of not so nice
> shouting has more of an effect?
>
> The other option would be to NAK and ignore the patchset, in
> that sense Peter is a lot more constructive and forward looking
> than a polite NAK would be, even if the language is rough.

NAK is better w/ further explanation or even suggestion about 
alternatives. The previous comments were not shouts but the mother of 
all NAKs.

There are some in the Linux community that adore flames but this is a 
perfect example that this approach slows innovation instead of advance it.

Some developers have a thick skin and nothing gets in, others are human 
and have feelings. Using a tiny difference in behavior we can do much 
much better. What's works in a f2f loud discussion doesn't play well in 
email.

Or alternatively:

/*
  * can_nice - check if folks on lkml can be nicer&productive
  * @p: person
  * @nice: nice value
  * Since nice isn't a negative property, nice is an uint here.
  */
int can_nice(const struct person *p, const unsigned int nice)
{
         int nice_rlim = MAX_LIMIT_BEFORE_NAK;

         BUG_ON(!capable(CAP_SYS_NICE));

         if (nice_rlim >= task_rlimit(p, RLIMIT_NICE))
            printk(KERN_INFO "Please NAK w/ decent explanation or \
            submit an alternative patch);

         return 0;
}

Ingo, what's your technical perspective of this particular patch?

Cheers,
Dor

>
> Thanks,
>
> 	Ingo
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
  2012-06-29 16:38       ` Andrea Arcangeli
@ 2012-06-29 16:58         ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 16:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/29/2012 12:38 PM, Andrea Arcangeli wrote:
> On Fri, Jun 29, 2012 at 11:36:26AM -0400, Rik van Riel wrote:
>> On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
>>
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
>>>    #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
>>>    #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
>>>    #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
>>> -#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
>>> +#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpus */
>>>    #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
>>>    #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
>>>    #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
>>
>> Changing the semantics of PF_THREAD_BOUND without so much as
>> a comment in your changelog or buy-in from the scheduler
>> maintainers is a big no-no.
>>
>> Is there any reason you even need PF_THREAD_BOUND in your
>> kernel numa threads?
>>
>> I do not see much at all in the scheduler code that uses
>> PF_THREAD_BOUND and it is not clear at all that your
>> numa threads get any benefit from them...
>>
>> Why do you think you need it?

> This flag is only used to prevent userland to mess with the kernel CPU
> binds of kernel threads. It is used to avoid the root user to shoot
> itself in the foot.
>
> So far it has been used to prevent changing bindings to a single
> CPU. I'm setting it also after making a multiple-cpu bind (all CPUs of
> the node, instead of just 1 CPU).

Fair enough.  Looking at the scheduler code some more, I
see that all PF_THREAD_BOUND seems to do is block userspace
from changing a thread's CPU bindings.

Peter and Ingo, what is the special magic in PF_THREAD_BOUND
that should make it only apply to kernel threads that are bound
to a single CPU?

Allowing it for threads that are bound to a NUMA node
could make some sense for eg. kswapd...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
@ 2012-06-29 16:58         ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 16:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/29/2012 12:38 PM, Andrea Arcangeli wrote:
> On Fri, Jun 29, 2012 at 11:36:26AM -0400, Rik van Riel wrote:
>> On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
>>
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
>>>    #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
>>>    #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
>>>    #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
>>> -#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
>>> +#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpus */
>>>    #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
>>>    #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
>>>    #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
>>
>> Changing the semantics of PF_THREAD_BOUND without so much as
>> a comment in your changelog or buy-in from the scheduler
>> maintainers is a big no-no.
>>
>> Is there any reason you even need PF_THREAD_BOUND in your
>> kernel numa threads?
>>
>> I do not see much at all in the scheduler code that uses
>> PF_THREAD_BOUND and it is not clear at all that your
>> numa threads get any benefit from them...
>>
>> Why do you think you need it?

> This flag is only used to prevent userland to mess with the kernel CPU
> binds of kernel threads. It is used to avoid the root user to shoot
> itself in the foot.
>
> So far it has been used to prevent changing bindings to a single
> CPU. I'm setting it also after making a multiple-cpu bind (all CPUs of
> the node, instead of just 1 CPU).

Fair enough.  Looking at the scheduler code some more, I
see that all PF_THREAD_BOUND seems to do is block userspace
from changing a thread's CPU bindings.

Peter and Ingo, what is the special magic in PF_THREAD_BOUND
that should make it only apply to kernel threads that are bound
to a single CPU?

Allowing it for threads that are bound to a NUMA node
could make some sense for eg. kswapd...

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 10/40] autonuma: mm_autonuma and sched_autonuma data structures
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-29 17:45     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 17:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> Define the two data structures that collect the per-process (in the
> mm) and per-thread (in the task_struct) statistical information that
> are the input of the CPU follow memory algorithms in the NUMA
> scheduler.

I just noticed the subject of this email is misleading, too.

This patch does not introduce sched_autonuma at all.

*searches around for the patch that does*

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 10/40] autonuma: mm_autonuma and sched_autonuma data structures
@ 2012-06-29 17:45     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 17:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> Define the two data structures that collect the per-process (in the
> mm) and per-thread (in the task_struct) statistical information that
> are the input of the CPU follow memory algorithms in the NUMA
> scheduler.

I just noticed the subject of this email is misleading, too.

This patch does not introduce sched_autonuma at all.

*searches around for the patch that does*

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-29 18:03     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 18:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> This algorithm takes as input the statistical information filled by the
> knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
> (p->sched_autonuma),

Somewhat confusing patch order, since the NUMA hinting page faults
appear to be later in the patch series.

At least the data structures got introduced earlier, albeit without
any documentation whatsoever (that needs fixing).

> evaluates it for the current scheduled task, and
> compares it against every other running process to see if it should
> move the current task to another NUMA node.

This is a little worrying. What if you are running on a system
with hundreds of NUMA nodes? How often does this code run?

> +static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
> +{
> +#ifdef CONFIG_AUTONUMA
> +	int autonuma_node;
> +	struct task_autonuma *task_autonuma = p->task_autonuma;
> +
> +	if (!task_autonuma)
> +		return true;
> +
> +	autonuma_node = ACCESS_ONCE(task_autonuma->autonuma_node);
> +	if (autonuma_node<  0 || autonuma_node == cpu_to_node(cpu))
> +		return true;
> +	else
> +		return false;
> +#else
> +	return true;
> +#endif
> +}

What is the return value of task_autonuma_cpu supposed
to represent?  It is not at all clear what this function
is trying to do...

> +#ifdef CONFIG_AUTONUMA
> +	/* this is used by the scheduler and the page allocator */
> +	struct mm_autonuma *mm_autonuma;
> +#endif
>   };

Great.  What is it used for, and how?
Why is that not documented?

> @@ -1514,6 +1514,9 @@ struct task_struct {
>   	struct mempolicy *mempolicy;	/* Protected by alloc_lock */
>   	short il_next;
>   	short pref_node_fork;
> +#ifdef CONFIG_AUTONUMA
> +	struct task_autonuma *task_autonuma;
> +#endif

This could use a comment,too.  I know task_struct has historically
been documented rather poorly, but it may be time to break that
tradition and add documentation.

> +/*
> + * autonuma_balance_cpu_stop() is a callback to be invoked by
> + * stop_one_cpu_nowait(). It is used by sched_autonuma_balance() to
> + * migrate the tasks to the selected_cpu, from softirq context.
> + */
> +static int autonuma_balance_cpu_stop(void *data)
> +{

Uhhh what?   It does not look like anything ends up stopped
as a result of this function running.

It looks like the function migrates a task to another NUMA
node, and always returns 0. Maybe void should be the return
type, and not the argument type?

It would be nice if the function name described what the
function does.

> +	struct rq *src_rq = data;

A void* as the function parameter, when you know what the
data pointer actually is?

Why are you doing this?

> +	int src_cpu = cpu_of(src_rq);
> +	int dst_cpu = src_rq->autonuma_balance_dst_cpu;
> +	struct task_struct *p = src_rq->autonuma_balance_task;

Why is the task to be migrated an item in the runqueue struct,
and not a function argument?

This seems backwards from the way things are usually done.
Not saying it is wrong, but doing things this way needs a good
explanation.

> +out_unlock:
> +	src_rq->autonuma_balance = false;
> +	raw_spin_unlock(&src_rq->lock);
> +	/* spinlocks acts as barrier() so p is stored local on the stack */

What race are you trying to protect against?

Surely the reason p continues to be valid is that you are
holding a refcount to the task?

> +	raw_spin_unlock_irq(&p->pi_lock);
> +	put_task_struct(p);
> +	return 0;
> +}

> +enum {
> +	W_TYPE_THREAD,
> +	W_TYPE_PROCESS,
> +};

What is W?  What is the difference between thread type
and process type Ws?

You wrote a lot of text describing sched_autonuma_balance(),
but none of it helps me understand what you are trying to do :(

> + * We run the above math on every CPU not part of the current NUMA
> + * node, and we compare the current process against the other
> + * processes running in the other CPUs in the remote NUMA nodes. The
> + * objective is to select the cpu (in selected_cpu) with a bigger
> + * "weight". The bigger the "weight" the biggest gain we'll get by
> + * moving the current process to the selected_cpu (not only the
> + * biggest immediate CPU gain but also the fewer async memory
> + * migrations that will be required to reach full convergence
> + * later). If we select a cpu we migrate the current process to it.

The one thing you have not described at all is what factors
go into the weight calculation, and why you are using those.

We can all read C and figure out what the code does, but
we need to know why.

What factors does the code use to weigh the NUMA nodes and processes?

Why are statistics kept both on a per process and a per thread basis?

What is the difference between those two?

What makes a particular NUMA node a good node for a thread to run on?

When is it worthwhile moving stuff around?

When is it not worthwhile?

> + * One non trivial bit of this logic that deserves an explanation is
> + * how the three crucial variables of the core math
> + * (w_nid/w_other/wcpu_nid) are going to change depending on whether
> + * the other CPU is running a thread of the current process, or a
> + * thread of a different process.

It would be nice to know what w_nid/w_other/w_cpu_nid mean.

You have a one-line description of them higher up in the comment,
but there is still no description at all of what factors go into
calculating the weights, or why...

> + * A simple example is required. Given the following:
> + * - 2 processes
> + * - 4 threads per process
> + * - 2 NUMA nodes
> + * - 4 CPUS per NUMA node
> + *
> + * Because the 8 threads belong to 2 different processes, by using the
> + * process statistics when comparing threads of different processes,
> + * we will converge reliably and quickly to a configuration where the
> + * 1st process is entirely contained in one node and the 2nd process
> + * in the other node.
> + *
> + * If all threads only use thread local memory (no sharing of memory
> + * between the threads), it wouldn't matter if we use per-thread or
> + * per-mm statistics for w_nid/w_other/w_cpu_nid. We could then use
> + * per-thread statistics all the time.
> + *
> + * But clearly with threads it's expected to get some sharing of
> + * memory. To avoid false sharing it's better to keep all threads of
> + * the same process in the same node (or if they don't fit in a single
> + * node, in as fewer nodes as possible). This is why we have to use
> + * processes statistics in w_nid/w_other/wcpu_nid when comparing
> + * threads of different processes. Why instead do we have to use
> + * thread statistics when comparing threads of the same process? This
> + * should be obvious if you're still reading

Nothing at all here is obvious, because you have not explained
what factors go into determining each weight.

You describe a lot of specific details, but are missing the
general overview that helps us make sense of things.

> +void sched_autonuma_balance(void)
> +{
> +	int cpu, nid, selected_cpu, selected_nid, selected_nid_mm;
> +	int cpu_nid = numa_node_id();
> +	int this_cpu = smp_processor_id();
> +	/*
> +	 * w_t: node thread weight
> +	 * w_t_t: total sum of all node thread weights
> +	 * w_m: node mm/process weight
> +	 * w_m_t: total sum of all node mm/process weights
> +	 */
> +	unsigned long w_t, w_t_t, w_m, w_m_t;
> +	unsigned long w_t_max, w_m_max;
> +	unsigned long weight_max, weight;
> +	long s_w_nid = -1, s_w_cpu_nid = -1, s_w_other = -1;
> +	int s_w_type = -1;
> +	struct cpumask *allowed;
> +	struct task_struct *p = current;
> +	struct task_autonuma *task_autonuma = p->task_autonuma;

Considering that p is always current, it may be better to just use
current throughout the function, that way people can see at a glance
that "p" cannot go away while the code is running, because current
is running the code on itself.

> +	/*
> +	 * The below two arrays holds the NUMA affinity information of
> +	 * the current process if scheduled in the "nid". This is task
> +	 * local and mm local information. We compute this information
> +	 * for all nodes.
> +	 *
> +	 * task/mm_numa_weight[nid] will become w_nid.
> +	 * task/mm_numa_weight[cpu_nid] will become w_cpu_nid.
> +	 */
> +	rq = cpu_rq(this_cpu);
> +	task_numa_weight = rq->task_numa_weight;
> +	mm_numa_weight = rq->mm_numa_weight;

It is a mystery to me why these items are allocated in the
runqueue structure. We have per-cpu allocations for things
like this, why are you adding them to the runqueue?

If there is a reason, you need to document it.

> +	w_t_max = w_m_max = 0;
> +	selected_nid = selected_nid_mm = -1;
> +	for_each_online_node(nid) {
> +		w_m = ACCESS_ONCE(p->mm->mm_autonuma->mm_numa_fault[nid]);
> +		w_t = task_autonuma->task_numa_fault[nid];
> +		if (w_m>  w_m_t)
> +			w_m_t = w_m;
> +		mm_numa_weight[nid] = w_m*AUTONUMA_BALANCE_SCALE/w_m_t;
> +		if (w_t>  w_t_t)
> +			w_t_t = w_t;
> +		task_numa_weight[nid] = w_t*AUTONUMA_BALANCE_SCALE/w_t_t;
> +		if (mm_numa_weight[nid]>  w_m_max) {
> +			w_m_max = mm_numa_weight[nid];
> +			selected_nid_mm = nid;
> +		}
> +		if (task_numa_weight[nid]>  w_t_max) {
> +			w_t_max = task_numa_weight[nid];
> +			selected_nid = nid;
> +		}
> +	}

What do the task and mm numa weights mean?

What factors go into calculating them?

Is it better to have a higher or a lower number? :)

We could use less documentation of what the code
does, and more explaining what the code is trying
to do, and why.


Under what circumstances do we continue into this loop?

What is it trying to do?

> +	for_each_online_node(nid) {
> +		/*
> +		 * Calculate the "weight" for all CPUs that the
> +		 * current process is allowed to be migrated to,
> +		 * except the CPUs of the current nid (it would be
> +		 * worthless from a NUMA affinity standpoint to
> +		 * migrate the task to another CPU of the current
> +		 * node).
> +		 */
> +		if (nid == cpu_nid)
> +			continue;
> +		for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
> +			long w_nid, w_cpu_nid, w_other;
> +			int w_type;
> +			struct mm_struct *mm;
> +			rq = cpu_rq(cpu);
> +			if (!cpu_online(cpu))
> +				continue;
> +
> +			if (idle_cpu(cpu))
> +				/*
> +				 * Offload the while IDLE balancing
> +				 * and physical / logical imbalances
> +				 * to CFS.
> +				 */

			/* CFS idle balancing takes care of this */

> +				continue;
> +
> +			mm = rq->curr->mm;
> +			if (!mm)
> +				continue;
> +			/*
> +			 * Grab the w_m/w_t/w_m_t/w_t_t of the
> +			 * processes running in the other CPUs to
> +			 * compute w_other.
> +			 */
> +			raw_spin_lock_irq(&rq->lock);
> +			/* recheck after implicit barrier() */
> +			mm = rq->curr->mm;
> +			if (!mm) {
> +				raw_spin_unlock_irq(&rq->lock);
> +				continue;
> +			}
> +			w_m_t = ACCESS_ONCE(mm->mm_autonuma->mm_numa_fault_tot);
> +			w_t_t = rq->curr->task_autonuma->task_numa_fault_tot;
> +			if (!w_m_t || !w_t_t) {
> +				raw_spin_unlock_irq(&rq->lock);
> +				continue;
> +			}
> +			w_m = ACCESS_ONCE(mm->mm_autonuma->mm_numa_fault[nid]);
> +			w_t = rq->curr->task_autonuma->task_numa_fault[nid];
> +			raw_spin_unlock_irq(&rq->lock);

Is this why the info is stored in the runqueue struct?

How do we know the other runqueue's data is consistent?
We seem to be doing our own updates outside of the lock...

How do we know the other runqueue's data is up to date?
How often is this function run?

> +			/*
> +			 * Generate the w_nid/w_cpu_nid from the
> +			 * pre-computed mm/task_numa_weight[] and
> +			 * compute w_other using the w_m/w_t info
> +			 * collected from the other process.
> +			 */
> +			if (mm == p->mm) {

			if (mm == current->mm) {

> +				if (w_t>  w_t_t)
> +					w_t_t = w_t;
> +				w_other = w_t*AUTONUMA_BALANCE_SCALE/w_t_t;
> +				w_nid = task_numa_weight[nid];
> +				w_cpu_nid = task_numa_weight[cpu_nid];
> +				w_type = W_TYPE_THREAD;
> +			} else {
> +				if (w_m>  w_m_t)
> +					w_m_t = w_m;
> +				w_other = w_m*AUTONUMA_BALANCE_SCALE/w_m_t;
> +				w_nid = mm_numa_weight[nid];
> +				w_cpu_nid = mm_numa_weight[cpu_nid];
> +				w_type = W_TYPE_PROCESS;
> +			}

Wait, what?

Why is w_t used in one case, and w_m in the other?

Explaining the meaning of the two, and how each is used,
would help people understand this code.

> +			/*
> +			 * Finally check if there's a combined gain in
> +			 * NUMA affinity. If there is and it's the
> +			 * biggest weight seen so far, record its
> +			 * weight and select this NUMA remote "cpu" as
> +			 * candidate migration destination.
> +			 */
> +			if (w_nid>  w_other&&  w_nid>  w_cpu_nid) {
> +				weight = w_nid - w_other + w_nid - w_cpu_nid;

I read this as "check if moving the current task to the other CPU,
and moving its task away, would increase overall NUMA affinity".

Is that correct?

> +	stop_one_cpu_nowait(this_cpu,
> +			    autonuma_balance_cpu_stop, rq,
> +			&rq->autonuma_balance_work);
> +#ifdef __ia64__
> +#error "NOTE: tlb_migrate_finish won't run here"
> +#endif
> +}

So that is why the function is called autonuma_balance_cpu_stop?
Even though its function is to migrate a task?

What will happen in the IA64 case?
Memory corruption?
What would the IA64 maintainers have to do to make things work?

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 6d52cea..e5b7ae9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -463,6 +463,24 @@ struct rq {
>   #ifdef CONFIG_SMP
>   	struct llist_head wake_list;
>   #endif
> +#ifdef CONFIG_AUTONUMA
> +	/*
> +	 * Per-cpu arrays to compute the per-thread and per-process
> +	 * statistics. Allocated statically to avoid overflowing the
> +	 * stack with large MAX_NUMNODES values.
> +	 *
> +	 * FIXME: allocate dynamically and with num_possible_nodes()
> +	 * array sizes only if autonuma is not impossible, to save
> +	 * some dozen KB of RAM when booting on not NUMA (or small
> +	 * NUMA) systems.
> +	 */

I have a second FIXME: document what these fields actually mean,
what they are used for, and why they are allocated as part of the
runqueue structure.

> +	long task_numa_weight[MAX_NUMNODES];
> +	long mm_numa_weight[MAX_NUMNODES];
> +	bool autonuma_balance;
> +	int autonuma_balance_dst_cpu;
> +	struct task_struct *autonuma_balance_task;
> +	struct cpu_stop_work autonuma_balance_work;
> +#endif


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-29 18:03     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 18:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> This algorithm takes as input the statistical information filled by the
> knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
> (p->sched_autonuma),

Somewhat confusing patch order, since the NUMA hinting page faults
appear to be later in the patch series.

At least the data structures got introduced earlier, albeit without
any documentation whatsoever (that needs fixing).

> evaluates it for the current scheduled task, and
> compares it against every other running process to see if it should
> move the current task to another NUMA node.

This is a little worrying. What if you are running on a system
with hundreds of NUMA nodes? How often does this code run?

> +static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
> +{
> +#ifdef CONFIG_AUTONUMA
> +	int autonuma_node;
> +	struct task_autonuma *task_autonuma = p->task_autonuma;
> +
> +	if (!task_autonuma)
> +		return true;
> +
> +	autonuma_node = ACCESS_ONCE(task_autonuma->autonuma_node);
> +	if (autonuma_node<  0 || autonuma_node == cpu_to_node(cpu))
> +		return true;
> +	else
> +		return false;
> +#else
> +	return true;
> +#endif
> +}

What is the return value of task_autonuma_cpu supposed
to represent?  It is not at all clear what this function
is trying to do...

> +#ifdef CONFIG_AUTONUMA
> +	/* this is used by the scheduler and the page allocator */
> +	struct mm_autonuma *mm_autonuma;
> +#endif
>   };

Great.  What is it used for, and how?
Why is that not documented?

> @@ -1514,6 +1514,9 @@ struct task_struct {
>   	struct mempolicy *mempolicy;	/* Protected by alloc_lock */
>   	short il_next;
>   	short pref_node_fork;
> +#ifdef CONFIG_AUTONUMA
> +	struct task_autonuma *task_autonuma;
> +#endif

This could use a comment,too.  I know task_struct has historically
been documented rather poorly, but it may be time to break that
tradition and add documentation.

> +/*
> + * autonuma_balance_cpu_stop() is a callback to be invoked by
> + * stop_one_cpu_nowait(). It is used by sched_autonuma_balance() to
> + * migrate the tasks to the selected_cpu, from softirq context.
> + */
> +static int autonuma_balance_cpu_stop(void *data)
> +{

Uhhh what?   It does not look like anything ends up stopped
as a result of this function running.

It looks like the function migrates a task to another NUMA
node, and always returns 0. Maybe void should be the return
type, and not the argument type?

It would be nice if the function name described what the
function does.

> +	struct rq *src_rq = data;

A void* as the function parameter, when you know what the
data pointer actually is?

Why are you doing this?

> +	int src_cpu = cpu_of(src_rq);
> +	int dst_cpu = src_rq->autonuma_balance_dst_cpu;
> +	struct task_struct *p = src_rq->autonuma_balance_task;

Why is the task to be migrated an item in the runqueue struct,
and not a function argument?

This seems backwards from the way things are usually done.
Not saying it is wrong, but doing things this way needs a good
explanation.

> +out_unlock:
> +	src_rq->autonuma_balance = false;
> +	raw_spin_unlock(&src_rq->lock);
> +	/* spinlocks acts as barrier() so p is stored local on the stack */

What race are you trying to protect against?

Surely the reason p continues to be valid is that you are
holding a refcount to the task?

> +	raw_spin_unlock_irq(&p->pi_lock);
> +	put_task_struct(p);
> +	return 0;
> +}

> +enum {
> +	W_TYPE_THREAD,
> +	W_TYPE_PROCESS,
> +};

What is W?  What is the difference between thread type
and process type Ws?

You wrote a lot of text describing sched_autonuma_balance(),
but none of it helps me understand what you are trying to do :(

> + * We run the above math on every CPU not part of the current NUMA
> + * node, and we compare the current process against the other
> + * processes running in the other CPUs in the remote NUMA nodes. The
> + * objective is to select the cpu (in selected_cpu) with a bigger
> + * "weight". The bigger the "weight" the biggest gain we'll get by
> + * moving the current process to the selected_cpu (not only the
> + * biggest immediate CPU gain but also the fewer async memory
> + * migrations that will be required to reach full convergence
> + * later). If we select a cpu we migrate the current process to it.

The one thing you have not described at all is what factors
go into the weight calculation, and why you are using those.

We can all read C and figure out what the code does, but
we need to know why.

What factors does the code use to weigh the NUMA nodes and processes?

Why are statistics kept both on a per process and a per thread basis?

What is the difference between those two?

What makes a particular NUMA node a good node for a thread to run on?

When is it worthwhile moving stuff around?

When is it not worthwhile?

> + * One non trivial bit of this logic that deserves an explanation is
> + * how the three crucial variables of the core math
> + * (w_nid/w_other/wcpu_nid) are going to change depending on whether
> + * the other CPU is running a thread of the current process, or a
> + * thread of a different process.

It would be nice to know what w_nid/w_other/w_cpu_nid mean.

You have a one-line description of them higher up in the comment,
but there is still no description at all of what factors go into
calculating the weights, or why...

> + * A simple example is required. Given the following:
> + * - 2 processes
> + * - 4 threads per process
> + * - 2 NUMA nodes
> + * - 4 CPUS per NUMA node
> + *
> + * Because the 8 threads belong to 2 different processes, by using the
> + * process statistics when comparing threads of different processes,
> + * we will converge reliably and quickly to a configuration where the
> + * 1st process is entirely contained in one node and the 2nd process
> + * in the other node.
> + *
> + * If all threads only use thread local memory (no sharing of memory
> + * between the threads), it wouldn't matter if we use per-thread or
> + * per-mm statistics for w_nid/w_other/w_cpu_nid. We could then use
> + * per-thread statistics all the time.
> + *
> + * But clearly with threads it's expected to get some sharing of
> + * memory. To avoid false sharing it's better to keep all threads of
> + * the same process in the same node (or if they don't fit in a single
> + * node, in as fewer nodes as possible). This is why we have to use
> + * processes statistics in w_nid/w_other/wcpu_nid when comparing
> + * threads of different processes. Why instead do we have to use
> + * thread statistics when comparing threads of the same process? This
> + * should be obvious if you're still reading

Nothing at all here is obvious, because you have not explained
what factors go into determining each weight.

You describe a lot of specific details, but are missing the
general overview that helps us make sense of things.

> +void sched_autonuma_balance(void)
> +{
> +	int cpu, nid, selected_cpu, selected_nid, selected_nid_mm;
> +	int cpu_nid = numa_node_id();
> +	int this_cpu = smp_processor_id();
> +	/*
> +	 * w_t: node thread weight
> +	 * w_t_t: total sum of all node thread weights
> +	 * w_m: node mm/process weight
> +	 * w_m_t: total sum of all node mm/process weights
> +	 */
> +	unsigned long w_t, w_t_t, w_m, w_m_t;
> +	unsigned long w_t_max, w_m_max;
> +	unsigned long weight_max, weight;
> +	long s_w_nid = -1, s_w_cpu_nid = -1, s_w_other = -1;
> +	int s_w_type = -1;
> +	struct cpumask *allowed;
> +	struct task_struct *p = current;
> +	struct task_autonuma *task_autonuma = p->task_autonuma;

Considering that p is always current, it may be better to just use
current throughout the function, that way people can see at a glance
that "p" cannot go away while the code is running, because current
is running the code on itself.

> +	/*
> +	 * The below two arrays holds the NUMA affinity information of
> +	 * the current process if scheduled in the "nid". This is task
> +	 * local and mm local information. We compute this information
> +	 * for all nodes.
> +	 *
> +	 * task/mm_numa_weight[nid] will become w_nid.
> +	 * task/mm_numa_weight[cpu_nid] will become w_cpu_nid.
> +	 */
> +	rq = cpu_rq(this_cpu);
> +	task_numa_weight = rq->task_numa_weight;
> +	mm_numa_weight = rq->mm_numa_weight;

It is a mystery to me why these items are allocated in the
runqueue structure. We have per-cpu allocations for things
like this, why are you adding them to the runqueue?

If there is a reason, you need to document it.

> +	w_t_max = w_m_max = 0;
> +	selected_nid = selected_nid_mm = -1;
> +	for_each_online_node(nid) {
> +		w_m = ACCESS_ONCE(p->mm->mm_autonuma->mm_numa_fault[nid]);
> +		w_t = task_autonuma->task_numa_fault[nid];
> +		if (w_m>  w_m_t)
> +			w_m_t = w_m;
> +		mm_numa_weight[nid] = w_m*AUTONUMA_BALANCE_SCALE/w_m_t;
> +		if (w_t>  w_t_t)
> +			w_t_t = w_t;
> +		task_numa_weight[nid] = w_t*AUTONUMA_BALANCE_SCALE/w_t_t;
> +		if (mm_numa_weight[nid]>  w_m_max) {
> +			w_m_max = mm_numa_weight[nid];
> +			selected_nid_mm = nid;
> +		}
> +		if (task_numa_weight[nid]>  w_t_max) {
> +			w_t_max = task_numa_weight[nid];
> +			selected_nid = nid;
> +		}
> +	}

What do the task and mm numa weights mean?

What factors go into calculating them?

Is it better to have a higher or a lower number? :)

We could use less documentation of what the code
does, and more explaining what the code is trying
to do, and why.


Under what circumstances do we continue into this loop?

What is it trying to do?

> +	for_each_online_node(nid) {
> +		/*
> +		 * Calculate the "weight" for all CPUs that the
> +		 * current process is allowed to be migrated to,
> +		 * except the CPUs of the current nid (it would be
> +		 * worthless from a NUMA affinity standpoint to
> +		 * migrate the task to another CPU of the current
> +		 * node).
> +		 */
> +		if (nid == cpu_nid)
> +			continue;
> +		for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
> +			long w_nid, w_cpu_nid, w_other;
> +			int w_type;
> +			struct mm_struct *mm;
> +			rq = cpu_rq(cpu);
> +			if (!cpu_online(cpu))
> +				continue;
> +
> +			if (idle_cpu(cpu))
> +				/*
> +				 * Offload the while IDLE balancing
> +				 * and physical / logical imbalances
> +				 * to CFS.
> +				 */

			/* CFS idle balancing takes care of this */

> +				continue;
> +
> +			mm = rq->curr->mm;
> +			if (!mm)
> +				continue;
> +			/*
> +			 * Grab the w_m/w_t/w_m_t/w_t_t of the
> +			 * processes running in the other CPUs to
> +			 * compute w_other.
> +			 */
> +			raw_spin_lock_irq(&rq->lock);
> +			/* recheck after implicit barrier() */
> +			mm = rq->curr->mm;
> +			if (!mm) {
> +				raw_spin_unlock_irq(&rq->lock);
> +				continue;
> +			}
> +			w_m_t = ACCESS_ONCE(mm->mm_autonuma->mm_numa_fault_tot);
> +			w_t_t = rq->curr->task_autonuma->task_numa_fault_tot;
> +			if (!w_m_t || !w_t_t) {
> +				raw_spin_unlock_irq(&rq->lock);
> +				continue;
> +			}
> +			w_m = ACCESS_ONCE(mm->mm_autonuma->mm_numa_fault[nid]);
> +			w_t = rq->curr->task_autonuma->task_numa_fault[nid];
> +			raw_spin_unlock_irq(&rq->lock);

Is this why the info is stored in the runqueue struct?

How do we know the other runqueue's data is consistent?
We seem to be doing our own updates outside of the lock...

How do we know the other runqueue's data is up to date?
How often is this function run?

> +			/*
> +			 * Generate the w_nid/w_cpu_nid from the
> +			 * pre-computed mm/task_numa_weight[] and
> +			 * compute w_other using the w_m/w_t info
> +			 * collected from the other process.
> +			 */
> +			if (mm == p->mm) {

			if (mm == current->mm) {

> +				if (w_t>  w_t_t)
> +					w_t_t = w_t;
> +				w_other = w_t*AUTONUMA_BALANCE_SCALE/w_t_t;
> +				w_nid = task_numa_weight[nid];
> +				w_cpu_nid = task_numa_weight[cpu_nid];
> +				w_type = W_TYPE_THREAD;
> +			} else {
> +				if (w_m>  w_m_t)
> +					w_m_t = w_m;
> +				w_other = w_m*AUTONUMA_BALANCE_SCALE/w_m_t;
> +				w_nid = mm_numa_weight[nid];
> +				w_cpu_nid = mm_numa_weight[cpu_nid];
> +				w_type = W_TYPE_PROCESS;
> +			}

Wait, what?

Why is w_t used in one case, and w_m in the other?

Explaining the meaning of the two, and how each is used,
would help people understand this code.

> +			/*
> +			 * Finally check if there's a combined gain in
> +			 * NUMA affinity. If there is and it's the
> +			 * biggest weight seen so far, record its
> +			 * weight and select this NUMA remote "cpu" as
> +			 * candidate migration destination.
> +			 */
> +			if (w_nid>  w_other&&  w_nid>  w_cpu_nid) {
> +				weight = w_nid - w_other + w_nid - w_cpu_nid;

I read this as "check if moving the current task to the other CPU,
and moving its task away, would increase overall NUMA affinity".

Is that correct?

> +	stop_one_cpu_nowait(this_cpu,
> +			    autonuma_balance_cpu_stop, rq,
> +			&rq->autonuma_balance_work);
> +#ifdef __ia64__
> +#error "NOTE: tlb_migrate_finish won't run here"
> +#endif
> +}

So that is why the function is called autonuma_balance_cpu_stop?
Even though its function is to migrate a task?

What will happen in the IA64 case?
Memory corruption?
What would the IA64 maintainers have to do to make things work?

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 6d52cea..e5b7ae9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -463,6 +463,24 @@ struct rq {
>   #ifdef CONFIG_SMP
>   	struct llist_head wake_list;
>   #endif
> +#ifdef CONFIG_AUTONUMA
> +	/*
> +	 * Per-cpu arrays to compute the per-thread and per-process
> +	 * statistics. Allocated statically to avoid overflowing the
> +	 * stack with large MAX_NUMNODES values.
> +	 *
> +	 * FIXME: allocate dynamically and with num_possible_nodes()
> +	 * array sizes only if autonuma is not impossible, to save
> +	 * some dozen KB of RAM when booting on not NUMA (or small
> +	 * NUMA) systems.
> +	 */

I have a second FIXME: document what these fields actually mean,
what they are used for, and why they are allocated as part of the
runqueue structure.

> +	long task_numa_weight[MAX_NUMNODES];
> +	long mm_numa_weight[MAX_NUMNODES];
> +	bool autonuma_balance;
> +	int autonuma_balance_dst_cpu;
> +	struct task_struct *autonuma_balance_task;
> +	struct cpu_stop_work autonuma_balance_work;
> +#endif


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 14/40] autonuma: add page structure fields
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-29 18:06     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 18:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> On 64bit archs, 20 bytes are used for async memory migration (specific
> to the knuma_migrated per-node threads), and 4 bytes are used for the
> thread NUMA false sharing detection logic.
>
> This is a bad implementation due lack of time to do a proper one.

It is not ideal, no.

If you document what everything does, maybe somebody else
will understand the code well enough to help fix it.

> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -136,6 +136,32 @@ struct page {
>   		struct page *first_page;	/* Compound tail pages */
>   	};
>
> +#ifdef CONFIG_AUTONUMA
> +	/*
> +	 * FIXME: move to pgdat section along with the memcg and allocate
> +	 * at runtime only in presence of a numa system.
> +	 */

Once you fix it, could you fold the fix into this patch?

> +	/*
> +	 * To modify autonuma_last_nid lockless the architecture,
> +	 * needs SMP atomic granularity<  sizeof(long), not all archs
> +	 * have that, notably some ancient alpha (but none of those
> +	 * should run in NUMA systems). Archs without that requires
> +	 * autonuma_last_nid to be a long.
> +	 */
> +#if BITS_PER_LONG>  32
> +	int autonuma_migrate_nid;
> +	int autonuma_last_nid;
> +#else
> +#if MAX_NUMNODES>= 32768
> +#error "too many nodes"
> +#endif
> +	/* FIXME: remember to check the updates are atomic */
> +	short autonuma_migrate_nid;
> +	short autonuma_last_nid;
> +#endif
> +	struct list_head autonuma_migrate_node;
> +#endif

Please document what these fields mean.


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 14/40] autonuma: add page structure fields
@ 2012-06-29 18:06     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 18:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> On 64bit archs, 20 bytes are used for async memory migration (specific
> to the knuma_migrated per-node threads), and 4 bytes are used for the
> thread NUMA false sharing detection logic.
>
> This is a bad implementation due lack of time to do a proper one.

It is not ideal, no.

If you document what everything does, maybe somebody else
will understand the code well enough to help fix it.

> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -136,6 +136,32 @@ struct page {
>   		struct page *first_page;	/* Compound tail pages */
>   	};
>
> +#ifdef CONFIG_AUTONUMA
> +	/*
> +	 * FIXME: move to pgdat section along with the memcg and allocate
> +	 * at runtime only in presence of a numa system.
> +	 */

Once you fix it, could you fold the fix into this patch?

> +	/*
> +	 * To modify autonuma_last_nid lockless the architecture,
> +	 * needs SMP atomic granularity<  sizeof(long), not all archs
> +	 * have that, notably some ancient alpha (but none of those
> +	 * should run in NUMA systems). Archs without that requires
> +	 * autonuma_last_nid to be a long.
> +	 */
> +#if BITS_PER_LONG>  32
> +	int autonuma_migrate_nid;
> +	int autonuma_last_nid;
> +#else
> +#if MAX_NUMNODES>= 32768
> +#error "too many nodes"
> +#endif
> +	/* FIXME: remember to check the updates are atomic */
> +	short autonuma_migrate_nid;
> +	short autonuma_last_nid;
> +#endif
> +	struct list_head autonuma_migrate_node;
> +#endif

Please document what these fields mean.


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-29 16:30         ` Andrea Arcangeli
@ 2012-06-29 18:09           ` Nai Xia
  -1 siblings, 0 replies; 327+ messages in thread
From: Nai Xia @ 2012-06-29 18:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt



On 2012年06月30日 00:30, Andrea Arcangeli wrote:
> Hi Nai,
>
> On Fri, Jun 29, 2012 at 10:11:35PM +0800, Nai Xia wrote:
>> If one process do very intensive visit of a small set of pages in this
>> node, but occasional visit of a large set of pages in another node.
>> Will this algorithm do a very bad judgment? I guess the answer would
>> be: it's possible and this judgment depends on the racing pattern
>> between the process and your knuma_scand.
>
> Depending if the knuma_scand/scan_pass_sleep_millisecs is more or less
> occasional than the visit of a large set of pages it may behave
> differently correct.

I bet this racing is more subtle than this, but since you admit
this judgment is a racing problem. Then it doesn't matter how subtle
it would be.

>
> Note that every algorithm will have a limit on how smart it can be.
>
> Just to make a random example: if you lookup some pagecache a million
> times and some other pagecache a dozen times, their "aging"
> information in the pagecache will end up identical. Yet we know one
> set of pages is clearly higher priority than the other. We've only so
> many levels of lrus and so many referenced/active bitflags per
> page. Once you get at the top, then all is equal.
>
> Does this mean the "active" list working set detection is useless just
> because we can't differentiate a million of lookups on a few pages, vs
> a dozen of lookups on lots of pages?

I knew you will give us an example of LRU. ;D
But unfortunately the approximation of LRU can not justify your case:
There are cases when LRU approximation behaves very badly,
but enough research in history have told us that 90% of the workloads
conforms to this kind of approximation, and even every programmer has
been taught to write LRU conforming programs.

But we have no idea how well real world workloads will conforms to your
algo especially the racing pattern.


>
> Last but not the least, in the very example you mention it's not even
> clear that the process should be scheduled in the CPU where there is
> the small set of pages accessed frequently, or the CPU where there's
> the large set of pages accessed occasionally. If the small sets of
> pages fits in the 8MBytes of the L2 cache, then it's better to put the
> process in the other CPU where the large set of pages can't fit in the
> L2 cache. Lots of hardware details should be evaluated, to really know
> what's the right thing in such case even if it was you having to
> decide.

That's just why I think it more subtle and why I am feeling not confident
about your algo -- if the effectiveness of your algorithm depends on so
many uncertain things.

>
> But the real reason why the above isn't an issue and why we don't need
> to solve that problem perfectly: there's not just a CPU follow memory
> algorithm in AutoNUMA. There's also the memory follow CPU
> algorithm. AutoNUMA will do its best to change the layout of your
> example to one that has only one clear solution: the occasional lookup
> of the large set of pages, will make those eventually go in the node
> together with the small set of pages (or the other way around), and
> this is how it's solved.

Not sure to follow, if you fall back on this, then why all its complexity?
This fall back equals to "just group all the pages to the running" policy.


>
> In any case, whatever wrong decision it will take, it will at least be
> a better decision than the numa/sched where there's absolutely zero
> information about what pages the process is accessing. And best of all
> with AutoNUMA you also know which pages the _thread_ is accessing so
> it will also be able to take optimal decisions if there are more
> threads than CPUs in a node (as long as not all thread accesses are
> shared).

Yeah, we need the information. But how to make best of the information
is a big problem.
I feel you may not address my question only by word reasoning,
if you currently have in your hand no survey of the common page access
patterns of real world workloads.

Maybe the assumption of your algorithm is right, maybe not...


>
> Hope this explains things better.
> Andrea


Thanks,

Nai

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-29 18:09           ` Nai Xia
  0 siblings, 0 replies; 327+ messages in thread
From: Nai Xia @ 2012-06-29 18:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt



On 2012a1'06ae??30ae?JPY 00:30, Andrea Arcangeli wrote:
> Hi Nai,
>
> On Fri, Jun 29, 2012 at 10:11:35PM +0800, Nai Xia wrote:
>> If one process do very intensive visit of a small set of pages in this
>> node, but occasional visit of a large set of pages in another node.
>> Will this algorithm do a very bad judgment? I guess the answer would
>> be: it's possible and this judgment depends on the racing pattern
>> between the process and your knuma_scand.
>
> Depending if the knuma_scand/scan_pass_sleep_millisecs is more or less
> occasional than the visit of a large set of pages it may behave
> differently correct.

I bet this racing is more subtle than this, but since you admit
this judgment is a racing problem. Then it doesn't matter how subtle
it would be.

>
> Note that every algorithm will have a limit on how smart it can be.
>
> Just to make a random example: if you lookup some pagecache a million
> times and some other pagecache a dozen times, their "aging"
> information in the pagecache will end up identical. Yet we know one
> set of pages is clearly higher priority than the other. We've only so
> many levels of lrus and so many referenced/active bitflags per
> page. Once you get at the top, then all is equal.
>
> Does this mean the "active" list working set detection is useless just
> because we can't differentiate a million of lookups on a few pages, vs
> a dozen of lookups on lots of pages?

I knew you will give us an example of LRU. ;D
But unfortunately the approximation of LRU can not justify your case:
There are cases when LRU approximation behaves very badly,
but enough research in history have told us that 90% of the workloads
conforms to this kind of approximation, and even every programmer has
been taught to write LRU conforming programs.

But we have no idea how well real world workloads will conforms to your
algo especially the racing pattern.


>
> Last but not the least, in the very example you mention it's not even
> clear that the process should be scheduled in the CPU where there is
> the small set of pages accessed frequently, or the CPU where there's
> the large set of pages accessed occasionally. If the small sets of
> pages fits in the 8MBytes of the L2 cache, then it's better to put the
> process in the other CPU where the large set of pages can't fit in the
> L2 cache. Lots of hardware details should be evaluated, to really know
> what's the right thing in such case even if it was you having to
> decide.

That's just why I think it more subtle and why I am feeling not confident
about your algo -- if the effectiveness of your algorithm depends on so
many uncertain things.

>
> But the real reason why the above isn't an issue and why we don't need
> to solve that problem perfectly: there's not just a CPU follow memory
> algorithm in AutoNUMA. There's also the memory follow CPU
> algorithm. AutoNUMA will do its best to change the layout of your
> example to one that has only one clear solution: the occasional lookup
> of the large set of pages, will make those eventually go in the node
> together with the small set of pages (or the other way around), and
> this is how it's solved.

Not sure to follow, if you fall back on this, then why all its complexity?
This fall back equals to "just group all the pages to the running" policy.


>
> In any case, whatever wrong decision it will take, it will at least be
> a better decision than the numa/sched where there's absolutely zero
> information about what pages the process is accessing. And best of all
> with AutoNUMA you also know which pages the _thread_ is accessing so
> it will also be able to take optimal decisions if there are more
> threads than CPUs in a node (as long as not all thread accesses are
> shared).

Yeah, we need the information. But how to make best of the information
is a big problem.
I feel you may not address my question only by word reasoning,
if you currently have in your hand no survey of the common page access
patterns of real world workloads.

Maybe the assumption of your algorithm is right, maybe not...


>
> Hope this explains things better.
> Andrea


Thanks,

Nai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 15/40] autonuma: knuma_migrated per NUMA node queues
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-29 18:31     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 18:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:

> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 2427706..d53b26a 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -697,6 +697,12 @@ typedef struct pglist_data {
>   	struct task_struct *kswapd;
>   	int kswapd_max_order;
>   	enum zone_type classzone_idx;
> +#ifdef CONFIG_AUTONUMA
> +	spinlock_t autonuma_lock;
> +	struct list_head autonuma_migrate_head[MAX_NUMNODES];
> +	unsigned long autonuma_nr_migrate_pages;
> +	wait_queue_head_t autonuma_knuma_migrated_wait;
> +#endif
>   } pg_data_t;
>
>   #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)

Once again, the data structure could use documentation.

What are these things for?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 15/40] autonuma: knuma_migrated per NUMA node queues
@ 2012-06-29 18:31     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 18:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:

> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 2427706..d53b26a 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -697,6 +697,12 @@ typedef struct pglist_data {
>   	struct task_struct *kswapd;
>   	int kswapd_max_order;
>   	enum zone_type classzone_idx;
> +#ifdef CONFIG_AUTONUMA
> +	spinlock_t autonuma_lock;
> +	struct list_head autonuma_migrate_head[MAX_NUMNODES];
> +	unsigned long autonuma_nr_migrate_pages;
> +	wait_queue_head_t autonuma_knuma_migrated_wait;
> +#endif
>   } pg_data_t;
>
>   #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)

Once again, the data structure could use documentation.

What are these things for?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 16/40] autonuma: init knuma_migrated queues
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-29 18:35     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 18:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> Initialize the knuma_migrated queues at boot time.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> ---
>   mm/page_alloc.c |   11 +++++++++++
>   1 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a9710a4..48eabe9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -59,6 +59,7 @@
>   #include<linux/prefetch.h>
>   #include<linux/migrate.h>
>   #include<linux/page-debug-flags.h>
> +#include<linux/autonuma.h>
>
>   #include<asm/tlbflush.h>
>   #include<asm/div64.h>
> @@ -4348,8 +4349,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>   	int nid = pgdat->node_id;
>   	unsigned long zone_start_pfn = pgdat->node_start_pfn;
>   	int ret;
> +#ifdef CONFIG_AUTONUMA
> +	int node_iter;
> +#endif
>
>   	pgdat_resize_init(pgdat);
> +#ifdef CONFIG_AUTONUMA
> +	spin_lock_init(&pgdat->autonuma_lock);
> +	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
> +	pgdat->autonuma_nr_migrate_pages = 0;
> +	for_each_node(node_iter)
> +		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
> +#endif

Should this be a __paginginit function inside one of the
autonuma files, so we can avoid the ifdefs here?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 16/40] autonuma: init knuma_migrated queues
@ 2012-06-29 18:35     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 18:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> Initialize the knuma_migrated queues at boot time.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> ---
>   mm/page_alloc.c |   11 +++++++++++
>   1 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a9710a4..48eabe9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -59,6 +59,7 @@
>   #include<linux/prefetch.h>
>   #include<linux/migrate.h>
>   #include<linux/page-debug-flags.h>
> +#include<linux/autonuma.h>
>
>   #include<asm/tlbflush.h>
>   #include<asm/div64.h>
> @@ -4348,8 +4349,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>   	int nid = pgdat->node_id;
>   	unsigned long zone_start_pfn = pgdat->node_start_pfn;
>   	int ret;
> +#ifdef CONFIG_AUTONUMA
> +	int node_iter;
> +#endif
>
>   	pgdat_resize_init(pgdat);
> +#ifdef CONFIG_AUTONUMA
> +	spin_lock_init(&pgdat->autonuma_lock);
> +	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
> +	pgdat->autonuma_nr_migrate_pages = 0;
> +	for_each_node(node_iter)
> +		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
> +#endif

Should this be a __paginginit function inside one of the
autonuma files, so we can avoid the ifdefs here?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 17/40] autonuma: autonuma_enter/exit
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-29 18:37     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 18:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> The first gear in the whole AutoNUMA algorithm is knuma_scand. If
> knuma_scand doesn't run AutoNUMA is a full bypass. If knuma_scand is
> stopped, soon all other AutoNUMA gears will settle down too.
>
> knuma_scand is the daemon that sets the pmd_numa and pte_numa and
> allows the NUMA hinting page faults to start and then all other
> actions follows as a reaction to that.
>
> knuma_scand scans a list of "mm" and this is where we register and
> unregister the "mm" into AutoNUMA for knuma_scand to scan them.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 17/40] autonuma: autonuma_enter/exit
@ 2012-06-29 18:37     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 18:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> The first gear in the whole AutoNUMA algorithm is knuma_scand. If
> knuma_scand doesn't run AutoNUMA is a full bypass. If knuma_scand is
> stopped, soon all other AutoNUMA gears will settle down too.
>
> knuma_scand is the daemon that sets the pmd_numa and pte_numa and
> allows the NUMA hinting page faults to start and then all other
> actions follows as a reaction to that.
>
> knuma_scand scans a list of "mm" and this is where we register and
> unregister the "mm" into AutoNUMA for knuma_scand to scan them.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 18/40] autonuma: call autonuma_setup_new_exec()
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-29 18:39     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 18:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> This resets all per-thread and per-process statistics across exec
> syscalls or after kernel threads detached from the mm. The past
> statistical NUMA information is unlikely to be relevant for the future
> in these cases.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 18/40] autonuma: call autonuma_setup_new_exec()
@ 2012-06-29 18:39     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 18:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> This resets all per-thread and per-process statistics across exec
> syscalls or after kernel threads detached from the mm. The past
> statistical NUMA information is unlikely to be relevant for the future
> in these cases.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-29 16:51           ` Dor Laor
@ 2012-06-29 18:41             ` Peter Zijlstra
  -1 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-29 18:41 UTC (permalink / raw)
  To: dlaor
  Cc: Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
> t's hard to say whether Peter's like to add ia64 support or 
> just like to get rid of the forceful migration as a whole. 

I've stated several times that all archs that have CONFIG_NUMA must be
supported before we can consider any of this. I've no intention of doing
so myself. Andrea wants this, Andrea gets to do it.

I've also stated several times that forceful migration in the context of
numa balancing must go.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-29 18:41             ` Peter Zijlstra
  0 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-29 18:41 UTC (permalink / raw)
  To: dlaor
  Cc: Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
> t's hard to say whether Peter's like to add ia64 support or 
> just like to get rid of the forceful migration as a whole. 

I've stated several times that all archs that have CONFIG_NUMA must be
supported before we can consider any of this. I've no intention of doing
so myself. Andrea wants this, Andrea gets to do it.

I've also stated several times that forceful migration in the context of
numa balancing must go.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-29 18:41             ` Peter Zijlstra
@ 2012-06-29 18:46               ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 18:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dlaor, Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On 06/29/2012 02:41 PM, Peter Zijlstra wrote:
> On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
>> t's hard to say whether Peter's like to add ia64 support or
>> just like to get rid of the forceful migration as a whole.
>
> I've stated several times that all archs that have CONFIG_NUMA must be
> supported before we can consider any of this. I've no intention of doing
> so myself. Andrea wants this, Andrea gets to do it.

I am not convinced all architectures that have CONFIG_NUMA
need to be a requirement, since some of them (eg. Alpha)
seem to be lacking a maintainer nowadays.

It would be good if Andrea could touch base with the maintainers
of the actively maintained architectures with NUMA, and get them
to sign off on the way autonuma does things, and work with them
to get autonuma ported to those architectures.

> I've also stated several times that forceful migration in the context of
> numa balancing must go.

I am not convinced about this part either way.

I do not see how a migration numa thread (which could potentially
use idle cpu time) will be any worse than migrate on fault, which
will always take away time from the userspace process.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-29 18:46               ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 18:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dlaor, Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On 06/29/2012 02:41 PM, Peter Zijlstra wrote:
> On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
>> t's hard to say whether Peter's like to add ia64 support or
>> just like to get rid of the forceful migration as a whole.
>
> I've stated several times that all archs that have CONFIG_NUMA must be
> supported before we can consider any of this. I've no intention of doing
> so myself. Andrea wants this, Andrea gets to do it.

I am not convinced all architectures that have CONFIG_NUMA
need to be a requirement, since some of them (eg. Alpha)
seem to be lacking a maintainer nowadays.

It would be good if Andrea could touch base with the maintainers
of the actively maintained architectures with NUMA, and get them
to sign off on the way autonuma does things, and work with them
to get autonuma ported to those architectures.

> I've also stated several times that forceful migration in the context of
> numa balancing must go.

I am not convinced about this part either way.

I do not see how a migration numa thread (which could potentially
use idle cpu time) will be any worse than migrate on fault, which
will always take away time from the userspace process.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-29 16:51           ` Dor Laor
@ 2012-06-29 18:49             ` Peter Zijlstra
  -1 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-29 18:49 UTC (permalink / raw)
  To: dlaor
  Cc: Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
> AFAIK, Andrea answered many of Peter's request by reducing the memory 
> overhead, adding documentation and changing the scheduler integration.
> 
He ignored even more. Getting him to change anything is like pulling
teeth. I'm well tired of it. Take for instance this kthread_bind_node()
nonsense, I've said from the very beginning that wasn't good. Nor is it
required, yet he persists in including it.

The thing is, I would very much like to talk about the design of this
thing, but there's just nothing coming. Yes Andrea wrote a lot of words,
but they didn't explain anything much at all.

And while I have a fair idea of what and how its working, I still miss a
number of critical fundamentals of the whole approach.

And yes I'm tired and I'm cranky.. but wouldn't you be if you'd spend
days poring over dense and ill documented code, giving comments only to
have your feedback dismissed and ignored.



^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-29 18:49             ` Peter Zijlstra
  0 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-29 18:49 UTC (permalink / raw)
  To: dlaor
  Cc: Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
> AFAIK, Andrea answered many of Peter's request by reducing the memory 
> overhead, adding documentation and changing the scheduler integration.
> 
He ignored even more. Getting him to change anything is like pulling
teeth. I'm well tired of it. Take for instance this kthread_bind_node()
nonsense, I've said from the very beginning that wasn't good. Nor is it
required, yet he persists in including it.

The thing is, I would very much like to talk about the design of this
thing, but there's just nothing coming. Yes Andrea wrote a lot of words,
but they didn't explain anything much at all.

And while I have a fair idea of what and how its working, I still miss a
number of critical fundamentals of the whole approach.

And yes I'm tired and I'm cranky.. but wouldn't you be if you'd spend
days poring over dense and ill documented code, giving comments only to
have your feedback dismissed and ignored.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-29 18:46               ` Rik van Riel
@ 2012-06-29 18:51                 ` Peter Zijlstra
  -1 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-29 18:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: dlaor, Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On Fri, 2012-06-29 at 14:46 -0400, Rik van Riel wrote:
> > I've also stated several times that forceful migration in the context of
> > numa balancing must go.
> 
> I am not convinced about this part either way.
> 
> I do not see how a migration numa thread (which could potentially
> use idle cpu time) will be any worse than migrate on fault, which
> will always take away time from the userspace process. 

Any NUMA stuff is long term, it really shouldn't matter on the timescale
of a few jiffies.

NUMA placement should also not over-ride fairness, esp. not by default.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-29 18:51                 ` Peter Zijlstra
  0 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-29 18:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: dlaor, Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On Fri, 2012-06-29 at 14:46 -0400, Rik van Riel wrote:
> > I've also stated several times that forceful migration in the context of
> > numa balancing must go.
> 
> I am not convinced about this part either way.
> 
> I do not see how a migration numa thread (which could potentially
> use idle cpu time) will be any worse than migrate on fault, which
> will always take away time from the userspace process. 

Any NUMA stuff is long term, it really shouldn't matter on the timescale
of a few jiffies.

NUMA placement should also not over-ride fairness, esp. not by default.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 19/40] autonuma: alloc/free/init sched_autonuma
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-29 18:52     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 18:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> This is where the dynamically allocated sched_autonuma structure is
> being handled.
>
> The reason for keeping this outside of the task_struct besides not
> using too much kernel stack, is to only allocate it on NUMA
> hardware. So the not NUMA hardware only pays the memory of a pointer
> in the kernel stack (which remains NULL at all times in that case).

What is not documented is the reason for keeping it at all.

What is in the data structure?

What is the data structure used for?

How do we use it?

> +	if (unlikely(alloc_task_autonuma(tsk, orig, node)))
> +		/* free_thread_info() undoes arch_dup_task_struct() too */
> +		goto out_thread_info;

Oh, you mean task_autonuma, and not sched_autonuma?

Please fix the commit message and the subject.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 19/40] autonuma: alloc/free/init sched_autonuma
@ 2012-06-29 18:52     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 18:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> This is where the dynamically allocated sched_autonuma structure is
> being handled.
>
> The reason for keeping this outside of the task_struct besides not
> using too much kernel stack, is to only allocate it on NUMA
> hardware. So the not NUMA hardware only pays the memory of a pointer
> in the kernel stack (which remains NULL at all times in that case).

What is not documented is the reason for keeping it at all.

What is in the data structure?

What is the data structure used for?

How do we use it?

> +	if (unlikely(alloc_task_autonuma(tsk, orig, node)))
> +		/* free_thread_info() undoes arch_dup_task_struct() too */
> +		goto out_thread_info;

Oh, you mean task_autonuma, and not sched_autonuma?

Please fix the commit message and the subject.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-29 16:51           ` Dor Laor
@ 2012-06-29 18:53             ` Peter Zijlstra
  -1 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-29 18:53 UTC (permalink / raw)
  To: dlaor
  Cc: Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
> The previous comments were not shouts but the mother of all NAKs. 

I never said any such thing. I just said why should I bother reading
your stuff if you're ignoring most my feedback anyway.

If you want to read that as a NAK, not my problem.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-29 18:53             ` Peter Zijlstra
  0 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-29 18:53 UTC (permalink / raw)
  To: dlaor
  Cc: Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
> The previous comments were not shouts but the mother of all NAKs. 

I never said any such thing. I just said why should I bother reading
your stuff if you're ignoring most my feedback anyway.

If you want to read that as a NAK, not my problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 20/40] autonuma: alloc/free/init mm_autonuma
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-06-29 18:54     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 18:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> This is where the mm_autonuma structure is being handled. Just like
> sched_autonuma, this is only allocated at runtime if the hardware the
> kernel is running on has been detected as NUMA. On not NUMA hardware
> the memory cost is reduced to one pointer per mm.
>
> To get rid of the pointer in the each mm, the kernel can be compiled
> with CONFIG_AUTONUMA=n.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Same comments as before.  A description of what the data
structure is used for and how would be good.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 20/40] autonuma: alloc/free/init mm_autonuma
@ 2012-06-29 18:54     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 18:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> This is where the mm_autonuma structure is being handled. Just like
> sched_autonuma, this is only allocated at runtime if the hardware the
> kernel is running on has been detected as NUMA. On not NUMA hardware
> the memory cost is reduced to one pointer per mm.
>
> To get rid of the pointer in the each mm, the kernel can be compiled
> with CONFIG_AUTONUMA=n.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Same comments as before.  A description of what the data
structure is used for and how would be good.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-29 18:46               ` Rik van Riel
@ 2012-06-29 18:57                 ` Peter Zijlstra
  -1 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-29 18:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: dlaor, Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On Fri, 2012-06-29 at 14:46 -0400, Rik van Riel wrote:
> 
> I am not convinced all architectures that have CONFIG_NUMA
> need to be a requirement, since some of them (eg. Alpha)
> seem to be lacking a maintainer nowadays. 

Still, this NUMA balancing stuff is not a small tweak to load-balancing.
Its a very significant change is how you schedule. Having such great
differences over architectures isn't something I look forward to.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-29 18:57                 ` Peter Zijlstra
  0 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-29 18:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: dlaor, Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On Fri, 2012-06-29 at 14:46 -0400, Rik van Riel wrote:
> 
> I am not convinced all architectures that have CONFIG_NUMA
> need to be a requirement, since some of them (eg. Alpha)
> seem to be lacking a maintainer nowadays. 

Still, this NUMA balancing stuff is not a small tweak to load-balancing.
Its a very significant change is how you schedule. Having such great
differences over architectures isn't something I look forward to.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 21/40] autonuma: avoid CFS select_task_rq_fair to return -1
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-06-29 18:57     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 18:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> Fix to avoid -1 retval.
>
> Includes fixes from Hillf Danton<dhillf@gmail.com>.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> ---
>   kernel/sched/fair.c |    4 ++++
>   1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c099cc6..fa96810 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2789,6 +2789,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>   		if (new_cpu == -1 || new_cpu == cpu) {
>   			/* Now try balancing at a lower domain level of cpu */
>   			sd = sd->child;
> +			if (new_cpu<  0)
> +				/* Return prev_cpu is find_idlest_cpu failed */
> +				new_cpu = prev_cpu;
>   			continue;
>   		}
>
> @@ -2807,6 +2810,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>   unlock:
>   	rcu_read_unlock();
>
> +	BUG_ON(new_cpu<  0);
>   	return new_cpu;
>   }
>   #endif /* CONFIG_SMP */

Wait, what?

Either this is a scheduler bugfix, in which case you
are better off submitting it separately and reducing
the size of your autonuma patch queue, or this is a
behaviour change in the scheduler that needs better
arguments than a 1-line changelog.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 21/40] autonuma: avoid CFS select_task_rq_fair to return -1
@ 2012-06-29 18:57     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 18:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> Fix to avoid -1 retval.
>
> Includes fixes from Hillf Danton<dhillf@gmail.com>.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> ---
>   kernel/sched/fair.c |    4 ++++
>   1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c099cc6..fa96810 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2789,6 +2789,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>   		if (new_cpu == -1 || new_cpu == cpu) {
>   			/* Now try balancing at a lower domain level of cpu */
>   			sd = sd->child;
> +			if (new_cpu<  0)
> +				/* Return prev_cpu is find_idlest_cpu failed */
> +				new_cpu = prev_cpu;
>   			continue;
>   		}
>
> @@ -2807,6 +2810,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>   unlock:
>   	rcu_read_unlock();
>
> +	BUG_ON(new_cpu<  0);
>   	return new_cpu;
>   }
>   #endif /* CONFIG_SMP */

Wait, what?

Either this is a scheduler bugfix, in which case you
are better off submitting it separately and reducing
the size of your autonuma patch queue, or this is a
behaviour change in the scheduler that needs better
arguments than a 1-line changelog.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-29 18:57                 ` Peter Zijlstra
@ 2012-06-29 19:03                   ` Peter Zijlstra
  -1 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-29 19:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: dlaor, Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On Fri, 2012-06-29 at 20:57 +0200, Peter Zijlstra wrote:
> On Fri, 2012-06-29 at 14:46 -0400, Rik van Riel wrote:
> > 
> > I am not convinced all architectures that have CONFIG_NUMA
> > need to be a requirement, since some of them (eg. Alpha)
> > seem to be lacking a maintainer nowadays. 
> 
> Still, this NUMA balancing stuff is not a small tweak to load-balancing.
> Its a very significant change is how you schedule. Having such great
> differences over architectures isn't something I look forward to.

Also, Andrea keeps insisting arch support is trivial, so I don't see the
problem.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-29 19:03                   ` Peter Zijlstra
  0 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-29 19:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: dlaor, Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On Fri, 2012-06-29 at 20:57 +0200, Peter Zijlstra wrote:
> On Fri, 2012-06-29 at 14:46 -0400, Rik van Riel wrote:
> > 
> > I am not convinced all architectures that have CONFIG_NUMA
> > need to be a requirement, since some of them (eg. Alpha)
> > seem to be lacking a maintainer nowadays. 
> 
> Still, this NUMA balancing stuff is not a small tweak to load-balancing.
> Its a very significant change is how you schedule. Having such great
> differences over architectures isn't something I look forward to.

Also, Andrea keeps insisting arch support is trivial, so I don't see the
problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-29 16:51           ` Dor Laor
@ 2012-06-29 19:04             ` Peter Zijlstra
  -1 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-29 19:04 UTC (permalink / raw)
  To: dlaor
  Cc: Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
> Some developers have a thick skin and nothing gets in, others are human 
> and have feelings. Using a tiny difference in behavior we can do much 
> much better. What's works in a f2f loud discussion doesn't play well in 
> email. 

We're all humans, we all have feelings, and I'm frigging upset.

As a maintainer I try and do my best to support and maintain the
subsystems I'm responsible for. I take this very serious.

I don't agree with the approach Andrea takes, we all know that, yet I do
want to talk about it. The problem is, many of the crucial details are
non-obvious and no sane explanation seems forthcoming.

I really feel I'm talking to deaf ears.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-29 19:04             ` Peter Zijlstra
  0 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-29 19:04 UTC (permalink / raw)
  To: dlaor
  Cc: Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
> Some developers have a thick skin and nothing gets in, others are human 
> and have feelings. Using a tiny difference in behavior we can do much 
> much better. What's works in a f2f loud discussion doesn't play well in 
> email. 

We're all humans, we all have feelings, and I'm frigging upset.

As a maintainer I try and do my best to support and maintain the
subsystems I'm responsible for. I take this very serious.

I don't agree with the approach Andrea takes, we all know that, yet I do
want to talk about it. The problem is, many of the crucial details are
non-obvious and no sane explanation seems forthcoming.

I really feel I'm talking to deaf ears.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 21/40] autonuma: avoid CFS select_task_rq_fair to return -1
  2012-06-29 18:57     ` Rik van Riel
@ 2012-06-29 19:05       ` Peter Zijlstra
  -1 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-29 19:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, 2012-06-29 at 14:57 -0400, Rik van Riel wrote:
> Either this is a scheduler bugfix, in which case you
> are better off submitting it separately and reducing
> the size of your autonuma patch queue, or this is a
> behaviour change in the scheduler that needs better
> arguments than a 1-line changelog. 

I've only said this like 2 or 3 times.. :/

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 21/40] autonuma: avoid CFS select_task_rq_fair to return -1
@ 2012-06-29 19:05       ` Peter Zijlstra
  0 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-06-29 19:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, 2012-06-29 at 14:57 -0400, Rik van Riel wrote:
> Either this is a scheduler bugfix, in which case you
> are better off submitting it separately and reducing
> the size of your autonuma patch queue, or this is a
> behaviour change in the scheduler that needs better
> arguments than a 1-line changelog. 

I've only said this like 2 or 3 times.. :/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 21/40] autonuma: avoid CFS select_task_rq_fair to return -1
  2012-06-29 19:05       ` Peter Zijlstra
@ 2012-06-29 19:07         ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 19:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/29/2012 03:05 PM, Peter Zijlstra wrote:
> On Fri, 2012-06-29 at 14:57 -0400, Rik van Riel wrote:
>> Either this is a scheduler bugfix, in which case you
>> are better off submitting it separately and reducing
>> the size of your autonuma patch queue, or this is a
>> behaviour change in the scheduler that needs better
>> arguments than a 1-line changelog.
>
> I've only said this like 2 or 3 times.. :/

I'll keep saying it until Andrea has fixed it :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 21/40] autonuma: avoid CFS select_task_rq_fair to return -1
@ 2012-06-29 19:07         ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 19:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/29/2012 03:05 PM, Peter Zijlstra wrote:
> On Fri, 2012-06-29 at 14:57 -0400, Rik van Riel wrote:
>> Either this is a scheduler bugfix, in which case you
>> are better off submitting it separately and reducing
>> the size of your autonuma patch queue, or this is a
>> behaviour change in the scheduler that needs better
>> arguments than a 1-line changelog.
>
> I've only said this like 2 or 3 times.. :/

I'll keep saying it until Andrea has fixed it :)

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-29 19:03                   ` Peter Zijlstra
@ 2012-06-29 19:19                     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 19:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dlaor, Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On 06/29/2012 03:03 PM, Peter Zijlstra wrote:
> On Fri, 2012-06-29 at 20:57 +0200, Peter Zijlstra wrote:
>> On Fri, 2012-06-29 at 14:46 -0400, Rik van Riel wrote:
>>>
>>> I am not convinced all architectures that have CONFIG_NUMA
>>> need to be a requirement, since some of them (eg. Alpha)
>>> seem to be lacking a maintainer nowadays.
>>
>> Still, this NUMA balancing stuff is not a small tweak to load-balancing.
>> Its a very significant change is how you schedule. Having such great
>> differences over architectures isn't something I look forward to.

I am not too worried about the performance of architectures
that are essentially orphaned :)

> Also, Andrea keeps insisting arch support is trivial, so I don't see the
> problem.

Getting it implemented in one or two additional architectures
would be good, to get a template out there that can be used by
other architecture maintainers.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-29 19:19                     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-06-29 19:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dlaor, Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On 06/29/2012 03:03 PM, Peter Zijlstra wrote:
> On Fri, 2012-06-29 at 20:57 +0200, Peter Zijlstra wrote:
>> On Fri, 2012-06-29 at 14:46 -0400, Rik van Riel wrote:
>>>
>>> I am not convinced all architectures that have CONFIG_NUMA
>>> need to be a requirement, since some of them (eg. Alpha)
>>> seem to be lacking a maintainer nowadays.
>>
>> Still, this NUMA balancing stuff is not a small tweak to load-balancing.
>> Its a very significant change is how you schedule. Having such great
>> differences over architectures isn't something I look forward to.

I am not too worried about the performance of architectures
that are essentially orphaned :)

> Also, Andrea keeps insisting arch support is trivial, so I don't see the
> problem.

Getting it implemented in one or two additional architectures
would be good, to get a template out there that can be used by
other architecture maintainers.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-29 18:53             ` Peter Zijlstra
  (?)
@ 2012-06-29 20:01             ` Nai Xia
  2012-06-29 20:44               ` Nai Xia
                                 ` (2 more replies)
  -1 siblings, 3 replies; 327+ messages in thread
From: Nai Xia @ 2012-06-29 20:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dlaor, Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 2:53 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
>> The previous comments were not shouts but the mother of all NAKs.
>
> I never said any such thing. I just said why should I bother reading
> your stuff if you're ignoring most my feedback anyway.
>
> If you want to read that as a NAK, not my problem.

Hey guys, Can I say NAK to these patches ?

Now I aware that this sampling algorithm is completely broken, if we take
a few seconds to see what it is trying to solve:

We all know that LRU is try to solve the question of "what are the
pages recently accessed?",
so its engouth to use pte bits to approximate.

However, the numa balancing problem is fundamentally like this:

In some time unit,

      W = pages_accessed  *  average_page_access_frequence

We are trying to move process to the node having max W,  right?

Andrea's patch can only approximate the pages_accessed number in a
time unit(scan interval),
I don't think it can catch even 1% of  average_page_access_frequence
on a busy workload.
Blindly assuming that all the pages'  average_page_access_frequence is
the same is seemly
broken to me.

Sometimes, it's good to have a good view of your problem before
spending a lot time coding.

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a hrefmailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-29 19:04             ` Peter Zijlstra
  (?)
@ 2012-06-29 20:27             ` Nai Xia
  -1 siblings, 0 replies; 327+ messages in thread
From: Nai Xia @ 2012-06-29 20:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dlaor, Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 3:04 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
>> Some developers have a thick skin and nothing gets in, others are human
>> and have feelings. Using a tiny difference in behavior we can do much
>> much better. What's works in a f2f loud discussion doesn't play well in
>> email.
>
> We're all humans, we all have feelings, and I'm frigging upset.
>
> As a maintainer I try and do my best to support and maintain the
> subsystems I'm responsible for. I take this very serious.

So, would you please very seriously look into my NAK ?
It's looks very clear to me.

>
> I don't agree with the approach Andrea takes, we all know that, yet I do
> want to talk about it. The problem is, many of the crucial details are
> non-obvious and no sane explanation seems forthcoming.
>
> I really feel I'm talking to deaf ears.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a hrefmailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-29 20:01             ` Nai Xia
@ 2012-06-29 20:44               ` Nai Xia
  2012-06-30  1:23               ` Andrea Arcangeli
  2012-07-05 18:07               ` Rik van Riel
  2 siblings, 0 replies; 327+ messages in thread
From: Nai Xia @ 2012-06-29 20:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dlaor, Ingo Molnar, Hillf Danton, Andrea Arcangeli, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 4:01 AM, Nai Xia <nai.xia@gmail.com> wrote:
> On Sat, Jun 30, 2012 at 2:53 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
>>> The previous comments were not shouts but the mother of all NAKs.
>>
>> I never said any such thing. I just said why should I bother reading
>> your stuff if you're ignoring most my feedback anyway.
>>
>> If you want to read that as a NAK, not my problem.
>
> Hey guys, Can I say NAK to these patches ?
>
> Now I aware that this sampling algorithm is completely broken, if we take
> a few seconds to see what it is trying to solve:
>
> We all know that LRU is try to solve the question of "what are the
> pages recently accessed?",
> so its engouth to use pte bits to approximate.
>
> However, the numa balancing problem is fundamentally like this:
>
> In some time unit,
>
>      W = pages_accessed  *  average_page_access_frequence
>
> We are trying to move process to the node having max W,  right?
>
> Andrea's patch can only approximate the pages_accessed number in a
> time unit(scan interval),
> I don't think it can catch even 1% of  average_page_access_frequence
> on a busy workload.
> Blindly assuming that all the pages'  average_page_access_frequence is

Oh, sorry for my typo,  I mean "frequency".


> the same is seemly
> broken to me.
>
> Sometimes, it's good to have a good view of your problem before
> spending a lot time coding.
>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a hrefmailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 21/40] autonuma: avoid CFS select_task_rq_fair to return -1
  2012-06-29 19:07         ` Rik van Riel
@ 2012-06-29 20:48           ` Ingo Molnar
  -1 siblings, 0 replies; 327+ messages in thread
From: Ingo Molnar @ 2012-06-29 20:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Andrea Arcangeli, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt


* Rik van Riel <riel@redhat.com> wrote:

> On 06/29/2012 03:05 PM, Peter Zijlstra wrote:
> >On Fri, 2012-06-29 at 14:57 -0400, Rik van Riel wrote:
> >>Either this is a scheduler bugfix, in which case you
> >>are better off submitting it separately and reducing
> >>the size of your autonuma patch queue, or this is a
> >>behaviour change in the scheduler that needs better
> >>arguments than a 1-line changelog.
> >
> >I've only said this like 2 or 3 times.. :/
> 
> I'll keep saying it until Andrea has fixed it :)

But that's just wrong - patch submitters *MUST* be responsive 
and forthcoming. Mistakes are OK, but this goes well beyond 
that. A patch-queue must generally not be resubmitted for yet 
another review round, as long as there are yet unaddressed 
review feedback items.

The thing is, core kernel code maintainers like PeterZ don't 
scale and the number of patches to review is huge - yet Andrea 
keeps wasting Peter's time with the same things again and 
again... How much is too much?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 21/40] autonuma: avoid CFS select_task_rq_fair to return -1
@ 2012-06-29 20:48           ` Ingo Molnar
  0 siblings, 0 replies; 327+ messages in thread
From: Ingo Molnar @ 2012-06-29 20:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Andrea Arcangeli, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt


* Rik van Riel <riel@redhat.com> wrote:

> On 06/29/2012 03:05 PM, Peter Zijlstra wrote:
> >On Fri, 2012-06-29 at 14:57 -0400, Rik van Riel wrote:
> >>Either this is a scheduler bugfix, in which case you
> >>are better off submitting it separately and reducing
> >>the size of your autonuma patch queue, or this is a
> >>behaviour change in the scheduler that needs better
> >>arguments than a 1-line changelog.
> >
> >I've only said this like 2 or 3 times.. :/
> 
> I'll keep saying it until Andrea has fixed it :)

But that's just wrong - patch submitters *MUST* be responsive 
and forthcoming. Mistakes are OK, but this goes well beyond 
that. A patch-queue must generally not be resubmitted for yet 
another review round, as long as there are yet unaddressed 
review feedback items.

The thing is, core kernel code maintainers like PeterZ don't 
scale and the number of patches to review is huge - yet Andrea 
keeps wasting Peter's time with the same things again and 
again... How much is too much?

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-29 16:30         ` Andrea Arcangeli
@ 2012-06-29 21:02           ` Nai Xia
  -1 siblings, 0 replies; 327+ messages in thread
From: Nai Xia @ 2012-06-29 21:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 12:30 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hi Nai,
>
> On Fri, Jun 29, 2012 at 10:11:35PM +0800, Nai Xia wrote:
>> If one process do very intensive visit of a small set of pages in this
>> node, but occasional visit of a large set of pages in another node.
>> Will this algorithm do a very bad judgment? I guess the answer would
>> be: it's possible and this judgment depends on the racing pattern
>> between the process and your knuma_scand.
>
> Depending if the knuma_scand/scan_pass_sleep_millisecs is more or less
> occasional than the visit of a large set of pages it may behave
> differently correct.
>
> Note that every algorithm will have a limit on how smart it can be.
>
> Just to make a random example: if you lookup some pagecache a million
> times and some other pagecache a dozen times, their "aging"
> information in the pagecache will end up identical. Yet we know one
> set of pages is clearly higher priority than the other. We've only so
> many levels of lrus and so many referenced/active bitflags per
> page. Once you get at the top, then all is equal.
>
> Does this mean the "active" list working set detection is useless just
> because we can't differentiate a million of lookups on a few pages, vs
> a dozen of lookups on lots of pages?
>
> Last but not the least, in the very example you mention it's not even
> clear that the process should be scheduled in the CPU where there is
> the small set of pages accessed frequently, or the CPU where there's
> the large set of pages accessed occasionally. If the small sets of
> pages fits in the 8MBytes of the L2 cache, then it's better to put the
> process in the other CPU where the large set of pages can't fit in the
> L2 cache. Lots of hardware details should be evaluated, to really know
> what's the right thing in such case even if it was you having to
> decide.
>
> But the real reason why the above isn't an issue and why we don't need
> to solve that problem perfectly: there's not just a CPU follow memory
> algorithm in AutoNUMA. There's also the memory follow CPU
> algorithm. AutoNUMA will do its best to change the layout of your
> example to one that has only one clear solution: the occasional lookup
> of the large set of pages, will make those eventually go in the node
> together with the small set of pages (or the other way around), and
> this is how it's solved.
>
> In any case, whatever wrong decision it will take, it will at least be
> a better decision than the numa/sched where there's absolutely zero
> information about what pages the process is accessing. And best of all
> with AutoNUMA you also know which pages the _thread_ is accessing so
> it will also be able to take optimal decisions if there are more
> threads than CPUs in a node (as long as not all thread accesses are
> shared).
>
> Hope this explains things better.
> Andrea

Hi Andrea,

Sorry for being so negative, but this problem seems so clear to me.
I might have pointed all these out, if you CC me since the first version,
I am not always on the list watching posts....

Sincerely,

Nai

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-29 21:02           ` Nai Xia
  0 siblings, 0 replies; 327+ messages in thread
From: Nai Xia @ 2012-06-29 21:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 12:30 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hi Nai,
>
> On Fri, Jun 29, 2012 at 10:11:35PM +0800, Nai Xia wrote:
>> If one process do very intensive visit of a small set of pages in this
>> node, but occasional visit of a large set of pages in another node.
>> Will this algorithm do a very bad judgment? I guess the answer would
>> be: it's possible and this judgment depends on the racing pattern
>> between the process and your knuma_scand.
>
> Depending if the knuma_scand/scan_pass_sleep_millisecs is more or less
> occasional than the visit of a large set of pages it may behave
> differently correct.
>
> Note that every algorithm will have a limit on how smart it can be.
>
> Just to make a random example: if you lookup some pagecache a million
> times and some other pagecache a dozen times, their "aging"
> information in the pagecache will end up identical. Yet we know one
> set of pages is clearly higher priority than the other. We've only so
> many levels of lrus and so many referenced/active bitflags per
> page. Once you get at the top, then all is equal.
>
> Does this mean the "active" list working set detection is useless just
> because we can't differentiate a million of lookups on a few pages, vs
> a dozen of lookups on lots of pages?
>
> Last but not the least, in the very example you mention it's not even
> clear that the process should be scheduled in the CPU where there is
> the small set of pages accessed frequently, or the CPU where there's
> the large set of pages accessed occasionally. If the small sets of
> pages fits in the 8MBytes of the L2 cache, then it's better to put the
> process in the other CPU where the large set of pages can't fit in the
> L2 cache. Lots of hardware details should be evaluated, to really know
> what's the right thing in such case even if it was you having to
> decide.
>
> But the real reason why the above isn't an issue and why we don't need
> to solve that problem perfectly: there's not just a CPU follow memory
> algorithm in AutoNUMA. There's also the memory follow CPU
> algorithm. AutoNUMA will do its best to change the layout of your
> example to one that has only one clear solution: the occasional lookup
> of the large set of pages, will make those eventually go in the node
> together with the small set of pages (or the other way around), and
> this is how it's solved.
>
> In any case, whatever wrong decision it will take, it will at least be
> a better decision than the numa/sched where there's absolutely zero
> information about what pages the process is accessing. And best of all
> with AutoNUMA you also know which pages the _thread_ is accessing so
> it will also be able to take optimal decisions if there are more
> threads than CPUs in a node (as long as not all thread accesses are
> shared).
>
> Hope this explains things better.
> Andrea

Hi Andrea,

Sorry for being so negative, but this problem seems so clear to me.
I might have pointed all these out, if you CC me since the first version,
I am not always on the list watching posts....

Sincerely,

Nai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-29 20:01             ` Nai Xia
  2012-06-29 20:44               ` Nai Xia
@ 2012-06-30  1:23               ` Andrea Arcangeli
  2012-06-30  2:43                 ` Nai Xia
  2012-07-05 18:07               ` Rik van Riel
  2 siblings, 1 reply; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-30  1:23 UTC (permalink / raw)
  To: Nai Xia
  Cc: Peter Zijlstra, dlaor, Ingo Molnar, Hillf Danton, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 04:01:50AM +0800, Nai Xia wrote:
> On Sat, Jun 30, 2012 at 2:53 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
> >> The previous comments were not shouts but the mother of all NAKs.
> >
> > I never said any such thing. I just said why should I bother reading
> > your stuff if you're ignoring most my feedback anyway.
> >
> > If you want to read that as a NAK, not my problem.
> 
> Hey guys, Can I say NAK to these patches ?
> 
> Now I aware that this sampling algorithm is completely broken, if we take
> a few seconds to see what it is trying to solve:
> 
> We all know that LRU is try to solve the question of "what are the
> pages recently accessed?",
> so its engouth to use pte bits to approximate.

I made an example about the active list to try to explain it why your
example is still going to work fine.

After it becomes active (from inactive) and it's being a referenced
active page, it won't become _very_active_ or _very_very_active_ or
more no matter how many more times you look up the pagecache.

The LRU order wasn't relevant here.

> However, the numa balancing problem is fundamentally like this:
> 
> In some time unit,
> 
>       W = pages_accessed  *  average_page_access_frequence
> 
> We are trying to move process to the node having max W,  right?

First of all, the mm_autonuma statistics are not in function of time
and there is no page access frequency there.

mm_autonuma is static information collected by knuma_scand from the
pagetables. That's static and 100% accurate on the whole process and
definitely not generated by the numa hinting page faults. I could shut
off all numa hinting page faults permanently and still generate the
mm_autonuma information identically.

There's a knob in /sys/kernel/mm/autonuma/knuma_scand/working_set that
you can enable if you want to use a "runtime" and not static
information for the mm_autonuma too, but that's not the default for
now (but I think it may be a better default, there wasn't enough time
to test this yet)

The task_autonuma (thread) statistics are the only thing that is
sampled by default in a 10sec interval (the interval tunable too with
sysfs, and 10sec is likely too aggressive, 30sec sounds better, we're
eventually going to make it dynamic anyway)

So even if you were right, the thread statistics only kicks in to
balance threads against threads of the same process, most of the time
what's more important are the mm_autonuma statistics.

But in reality the thread statistics also works perfectly for the job,
as an approximation of the NUMA memory footprint of the thread (vs the
other threads). And then the rest of the memory slowly follows
whatever node CPUs I placed the thread (even if that's not the
absolutely best one at all times).

> Andrea's patch can only approximate the pages_accessed number in a
> time unit(scan interval),
> I don't think it can catch even 1% of  average_page_access_frequence
> on a busy workload.
> Blindly assuming that all the pages'  average_page_access_frequence is
> the same is seemly
> broken to me.

All we need is an approximation to take a better than random decision,
even if you get it 1% right, it's still better than 0% right by going
blind. Your 1% is too pessimistic, in my tests the thread statistics
are more like >90% correct in average (I monitor them with the debug
mode constantly).

If this 1% right, happens one a million samples, who cares, it's not
going to run measurably slower anyway (and it will still be better
than picking a 0% right node).

What you're saying is that because the active list in the pagecache
won't differentiate between 10 cache hits and 20 cache hits, we should
drop the active list and stop activating pages and just threat them
all the same because in some unlucky access pattern, the active list
may only get right 1% of the working set. But there's a reason why the
active list exists despite it may get things wrong in some corner case
and possibly leave the large amount of pages accessed infrequently in
the inactive list forever (even if it gets things only 1% right in
those worst cases, it's still better than 0% right and no active list
at all).

To say it in another way, you may still crash with the car even if
you're careful, but do you think it's better to watch at the street or
to drive blindfolded?

numa/sched drives blindfolded, autonuma watches around every 10sec
very carefully for the best next turn to take with the car and to
avoid obstacles, you can imagine who wins.

Watching the street carefully every 10sec doesn't mean the next moment
a missile won't hit your car to make you crash, you're still having
better chances not to crash than by driving blindfolded.

numa/sched pretends to compete without collecting information for the
NUMA thread memory footprint (task_autonuma, sampled with a
exponential backoff at 10sec intervals), and without process
information (full static information from the pagetables, not
sampled). No matter how you compute stuff, if you've nothing
meaningful in input to your algorithm you lose. And it looks like you
believe that you can take better decisions with nothing in input to
your NUMA placement algorithm, because my thread info (task_autonuma)
isn't 100% perfect at all times and it can't predict the future. The
alternative is to get that information from syscalls, but even
ignoring the -ENOMEM from split_vma, that will lead to userland bugs
and overall the task_autonuma information may be more reliable in the
end, even if it's sampled using an exponential backoff.

Also note the exponential backoff thing, it's not really the last
interval, it's the last interval plus half the previous interval plus
1/4 the previous interval etc... and we can trivially control the
decay.

All we need is to get a direction and knowing _exactly_ what the task
did over the last 10 seconds (even if it can't predict the future of
what the thread will do in the next 1sec), is all we need to get a
direction. After we take the direction then the memory will follow so
we cannot care less what it does in the next second because that will
follow the CPU (after a while, last_nid anti-false-sharing logic
permitting), and at least we'll know for sure that the memory accessed
in the last 10sec is already local and that defines the best node to
schedule the thread.

I don't mean there's no room for improvement in the way the input data
can be computed, and even in the way the input data can be generated,
the exponential backoff decay can be tuned too, I just tried to do the
simplest computations on the data to make the workloads converge fast
and you're welcome to contribute.

But I believe the task_autonuma information is extremely valuable and
we can trust it very much knowing we'll get a great placement. The
concern you have isn't invalid, but it's a very minor one and the
sampling rate effects you are concerned about, while real, they're
lost in the noise in practice.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-30  1:23               ` Andrea Arcangeli
@ 2012-06-30  2:43                 ` Nai Xia
  2012-06-30  5:48                   ` Dor Laor
  2012-06-30 12:48                   ` Andrea Arcangeli
  0 siblings, 2 replies; 327+ messages in thread
From: Nai Xia @ 2012-06-30  2:43 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Zijlstra, dlaor, Ingo Molnar, Hillf Danton, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 9:23 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Sat, Jun 30, 2012 at 04:01:50AM +0800, Nai Xia wrote:
>> On Sat, Jun 30, 2012 at 2:53 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> > On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
>> >> The previous comments were not shouts but the mother of all NAKs.
>> >
>> > I never said any such thing. I just said why should I bother reading
>> > your stuff if you're ignoring most my feedback anyway.
>> >
>> > If you want to read that as a NAK, not my problem.
>>
>> Hey guys, Can I say NAK to these patches ?
>>
>> Now I aware that this sampling algorithm is completely broken, if we take
>> a few seconds to see what it is trying to solve:
>>
>> We all know that LRU is try to solve the question of "what are the
>> pages recently accessed?",
>> so its engouth to use pte bits to approximate.
>
> I made an example about the active list to try to explain it why your
> example is still going to work fine.
>
> After it becomes active (from inactive) and it's being a referenced
> active page, it won't become _very_active_ or _very_very_active_ or
> more no matter how many more times you look up the pagecache.
>
> The LRU order wasn't relevant here.
>
>> However, the numa balancing problem is fundamentally like this:
>>
>> In some time unit,
>>
>>       W = pages_accessed  *  average_page_access_frequence
>>
>> We are trying to move process to the node having max W,  right?
>
> First of all, the mm_autonuma statistics are not in function of time
> and there is no page access frequency there.
>
> mm_autonuma is static information collected by knuma_scand from the
> pagetables. That's static and 100% accurate on the whole process and
> definitely not generated by the numa hinting page faults. I could shut
> off all numa hinting page faults permanently and still generate the
> mm_autonuma information identically.
>
> There's a knob in /sys/kernel/mm/autonuma/knuma_scand/working_set that
> you can enable if you want to use a "runtime" and not static
> information for the mm_autonuma too, but that's not the default for
> now (but I think it may be a better default, there wasn't enough time
> to test this yet)
>
> The task_autonuma (thread) statistics are the only thing that is
> sampled by default in a 10sec interval (the interval tunable too with
> sysfs, and 10sec is likely too aggressive, 30sec sounds better, we're
> eventually going to make it dynamic anyway)
>
> So even if you were right, the thread statistics only kicks in to
> balance threads against threads of the same process, most of the time
> what's more important are the mm_autonuma statistics.
>
> But in reality the thread statistics also works perfectly for the job,
> as an approximation of the NUMA memory footprint of the thread (vs the
> other threads). And then the rest of the memory slowly follows
> whatever node CPUs I placed the thread (even if that's not the
> absolutely best one at all times).
>
>> Andrea's patch can only approximate the pages_accessed number in a
>> time unit(scan interval),
>> I don't think it can catch even 1% of  average_page_access_frequence
>> on a busy workload.
>> Blindly assuming that all the pages'  average_page_access_frequence is
>> the same is seemly
>> broken to me.
>
> All we need is an approximation to take a better than random decision,
> even if you get it 1% right, it's still better than 0% right by going
> blind. Your 1% is too pessimistic, in my tests the thread statistics
> are more like >90% correct in average (I monitor them with the debug
> mode constantly).
>
> If this 1% right, happens one a million samples, who cares, it's not
> going to run measurably slower anyway (and it will still be better
> than picking a 0% right node).
>
> What you're saying is that because the active list in the pagecache
> won't differentiate between 10 cache hits and 20 cache hits, we should
> drop the active list and stop activating pages and just threat them
> all the same because in some unlucky access pattern, the active list
> may only get right 1% of the working set. But there's a reason why the
> active list exists despite it may get things wrong in some corner case
> and possibly leave the large amount of pages accessed infrequently in
> the inactive list forever (even if it gets things only 1% right in
> those worst cases, it's still better than 0% right and no active list
> at all).
>
> To say it in another way, you may still crash with the car even if
> you're careful, but do you think it's better to watch at the street or
> to drive blindfolded?
>
> numa/sched drives blindfolded, autonuma watches around every 10sec
> very carefully for the best next turn to take with the car and to
> avoid obstacles, you can imagine who wins.
>
> Watching the street carefully every 10sec doesn't mean the next moment
> a missile won't hit your car to make you crash, you're still having
> better chances not to crash than by driving blindfolded.
>
> numa/sched pretends to compete without collecting information for the
> NUMA thread memory footprint (task_autonuma, sampled with a
> exponential backoff at 10sec intervals), and without process
> information (full static information from the pagetables, not
> sampled). No matter how you compute stuff, if you've nothing
> meaningful in input to your algorithm you lose. And it looks like you
> believe that you can take better decisions with nothing in input to
> your NUMA placement algorithm, because my thread info (task_autonuma)
> isn't 100% perfect at all times and it can't predict the future. The
> alternative is to get that information from syscalls, but even
> ignoring the -ENOMEM from split_vma, that will lead to userland bugs
> and overall the task_autonuma information may be more reliable in the
> end, even if it's sampled using an exponential backoff.
>
> Also note the exponential backoff thing, it's not really the last
> interval, it's the last interval plus half the previous interval plus
> 1/4 the previous interval etc... and we can trivially control the
> decay.
>
> All we need is to get a direction and knowing _exactly_ what the task
> did over the last 10 seconds (even if it can't predict the future of
> what the thread will do in the next 1sec), is all we need to get a
> direction. After we take the direction then the memory will follow so
> we cannot care less what it does in the next second because that will
> follow the CPU (after a while, last_nid anti-false-sharing logic
> permitting), and at least we'll know for sure that the memory accessed
> in the last 10sec is already local and that defines the best node to
> schedule the thread.
>
> I don't mean there's no room for improvement in the way the input data
> can be computed, and even in the way the input data can be generated,
> the exponential backoff decay can be tuned too, I just tried to do the
> simplest computations on the data to make the workloads converge fast
> and you're welcome to contribute.
>
> But I believe the task_autonuma information is extremely valuable and
> we can trust it very much knowing we'll get a great placement. The
> concern you have isn't invalid, but it's a very minor one and the
> sampling rate effects you are concerned about, while real, they're
> lost in the noise in practice.

Well, I think I am not convinced by your this many words. And surely
I  will NOT follow your reasoning of "Having information is always
good than nothing".  We all know that  an illy biased balancing is worse
than randomness:  at least randomness means "average, fair play, ...".
With all uncertain things, I think only a comprehensive survey
of real world workloads can tell if my concern is significant or not.

So I think my suggestion to you is:  Show world some solid and sound
real world proof that your approximation is > 90% accurate, just like
the pioneers already did to LRU(This problem is surely different from
LRU. ).  Tons of words, will not do this.

Thanks,

Nai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 04/40] xen: document Xen is using an unused bit for the pagetables
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-30  4:47     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 327+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-30  4:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Jun 28, 2012 at 02:55:44PM +0200, Andrea Arcangeli wrote:
> Xen has taken over the last reserved bit available for the pagetables

Some time ago when I saw this patch I asked about it (if there is way
to actually stop using this bit) and you mentioned it is not the last
bit available for pagemaps. Perhaps you should alter the comment
in this description?

> which is set through ioremap, this documents it and makes the code

It actually is through ioremap, gntdev (to map another guest memory),
and on pfns which fall in E820 on the non-RAM and gap sections.

> more readable.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/x86/include/asm/pgtable_types.h |   11 +++++++++--
>  1 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 013286a..b74cac9 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -17,7 +17,7 @@
>  #define _PAGE_BIT_PAT		7	/* on 4KB pages */
>  #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
>  #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
> -#define _PAGE_BIT_IOMAP		10	/* flag used to indicate IO mapping */
> +#define _PAGE_BIT_UNUSED2	10
>  #define _PAGE_BIT_HIDDEN	11	/* hidden by kmemcheck */
>  #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
>  #define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
> @@ -41,7 +41,7 @@
>  #define _PAGE_PSE	(_AT(pteval_t, 1) << _PAGE_BIT_PSE)
>  #define _PAGE_GLOBAL	(_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
>  #define _PAGE_UNUSED1	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED1)
> -#define _PAGE_IOMAP	(_AT(pteval_t, 1) << _PAGE_BIT_IOMAP)
> +#define _PAGE_UNUSED2	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
>  #define _PAGE_PAT	(_AT(pteval_t, 1) << _PAGE_BIT_PAT)
>  #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
>  #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
> @@ -49,6 +49,13 @@
>  #define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
>  #define __HAVE_ARCH_PTE_SPECIAL
>  
> +/* flag used to indicate IO mapping */
> +#ifdef CONFIG_XEN
> +#define _PAGE_IOMAP	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
> +#else
> +#define _PAGE_IOMAP	(_AT(pteval_t, 0))
> +#endif
> +
>  #ifdef CONFIG_KMEMCHECK
>  #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
>  #else
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 04/40] xen: document Xen is using an unused bit for the pagetables
@ 2012-06-30  4:47     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 327+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-30  4:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Jun 28, 2012 at 02:55:44PM +0200, Andrea Arcangeli wrote:
> Xen has taken over the last reserved bit available for the pagetables

Some time ago when I saw this patch I asked about it (if there is way
to actually stop using this bit) and you mentioned it is not the last
bit available for pagemaps. Perhaps you should alter the comment
in this description?

> which is set through ioremap, this documents it and makes the code

It actually is through ioremap, gntdev (to map another guest memory),
and on pfns which fall in E820 on the non-RAM and gap sections.

> more readable.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  arch/x86/include/asm/pgtable_types.h |   11 +++++++++--
>  1 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 013286a..b74cac9 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -17,7 +17,7 @@
>  #define _PAGE_BIT_PAT		7	/* on 4KB pages */
>  #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
>  #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
> -#define _PAGE_BIT_IOMAP		10	/* flag used to indicate IO mapping */
> +#define _PAGE_BIT_UNUSED2	10
>  #define _PAGE_BIT_HIDDEN	11	/* hidden by kmemcheck */
>  #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
>  #define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
> @@ -41,7 +41,7 @@
>  #define _PAGE_PSE	(_AT(pteval_t, 1) << _PAGE_BIT_PSE)
>  #define _PAGE_GLOBAL	(_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
>  #define _PAGE_UNUSED1	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED1)
> -#define _PAGE_IOMAP	(_AT(pteval_t, 1) << _PAGE_BIT_IOMAP)
> +#define _PAGE_UNUSED2	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
>  #define _PAGE_PAT	(_AT(pteval_t, 1) << _PAGE_BIT_PAT)
>  #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
>  #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
> @@ -49,6 +49,13 @@
>  #define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
>  #define __HAVE_ARCH_PTE_SPECIAL
>  
> +/* flag used to indicate IO mapping */
> +#ifdef CONFIG_XEN
> +#define _PAGE_IOMAP	(_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
> +#else
> +#define _PAGE_IOMAP	(_AT(pteval_t, 0))
> +#endif
> +
>  #ifdef CONFIG_KMEMCHECK
>  #define _PAGE_HIDDEN	(_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
>  #else
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-30  4:50     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 327+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-30  4:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Jun 28, 2012 at 02:55:49PM +0200, Andrea Arcangeli wrote:
> This function makes it easy to bind the per-node knuma_migrated
> threads to their respective NUMA nodes. Those threads take memory from
> the other nodes (in round robin with a incoming queue for each remote
> node) and they move that memory to their local node.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/kthread.h |    1 +
>  include/linux/sched.h   |    2 +-
>  kernel/kthread.c        |   23 +++++++++++++++++++++++
>  3 files changed, 25 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/kthread.h b/include/linux/kthread.h
> index 0714b24..e733f97 100644
> --- a/include/linux/kthread.h
> +++ b/include/linux/kthread.h
> @@ -33,6 +33,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
>  })
>  
>  void kthread_bind(struct task_struct *k, unsigned int cpu);
> +void kthread_bind_node(struct task_struct *p, int nid);
>  int kthread_stop(struct task_struct *k);
>  int kthread_should_stop(void);
>  bool kthread_freezable_should_stop(bool *was_frozen);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4059c0f..699324c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
>  #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
>  #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
>  #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
> -#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
> +#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpus */
>  #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
>  #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
>  #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 3d3de63..48b36f9 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -234,6 +234,29 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
>  EXPORT_SYMBOL(kthread_bind);
>  
>  /**
> + * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
> + * @p: thread created by kthread_create().
> + * @nid: node (might not be online, must be possible) for @k to run on.
> + *
> + * Description: This function is equivalent to set_cpus_allowed(),
> + * except that @nid doesn't need to be online, and the thread must be
> + * stopped (i.e., just returned from kthread_create()).
> + */
> +void kthread_bind_node(struct task_struct *p, int nid)
> +{
> +	/* Must have done schedule() in kthread() before we set_task_cpu */
> +	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> +		WARN_ON(1);
> +		return;
> +	}
> +
> +	/* It's safe because the task is inactive. */
> +	do_set_cpus_allowed(p, cpumask_of_node(nid));
> +	p->flags |= PF_THREAD_BOUND;
> +}
> +EXPORT_SYMBOL(kthread_bind_node);

_GPL ?
> +
> +/**
>   * kthread_stop - stop a thread created by kthread_create().
>   * @k: thread created by kthread_create().
>   *
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
@ 2012-06-30  4:50     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 327+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-30  4:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Jun 28, 2012 at 02:55:49PM +0200, Andrea Arcangeli wrote:
> This function makes it easy to bind the per-node knuma_migrated
> threads to their respective NUMA nodes. Those threads take memory from
> the other nodes (in round robin with a incoming queue for each remote
> node) and they move that memory to their local node.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/kthread.h |    1 +
>  include/linux/sched.h   |    2 +-
>  kernel/kthread.c        |   23 +++++++++++++++++++++++
>  3 files changed, 25 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/kthread.h b/include/linux/kthread.h
> index 0714b24..e733f97 100644
> --- a/include/linux/kthread.h
> +++ b/include/linux/kthread.h
> @@ -33,6 +33,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
>  })
>  
>  void kthread_bind(struct task_struct *k, unsigned int cpu);
> +void kthread_bind_node(struct task_struct *p, int nid);
>  int kthread_stop(struct task_struct *k);
>  int kthread_should_stop(void);
>  bool kthread_freezable_should_stop(bool *was_frozen);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4059c0f..699324c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
>  #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
>  #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
>  #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
> -#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
> +#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpus */
>  #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
>  #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
>  #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 3d3de63..48b36f9 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -234,6 +234,29 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
>  EXPORT_SYMBOL(kthread_bind);
>  
>  /**
> + * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
> + * @p: thread created by kthread_create().
> + * @nid: node (might not be online, must be possible) for @k to run on.
> + *
> + * Description: This function is equivalent to set_cpus_allowed(),
> + * except that @nid doesn't need to be online, and the thread must be
> + * stopped (i.e., just returned from kthread_create()).
> + */
> +void kthread_bind_node(struct task_struct *p, int nid)
> +{
> +	/* Must have done schedule() in kthread() before we set_task_cpu */
> +	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> +		WARN_ON(1);
> +		return;
> +	}
> +
> +	/* It's safe because the task is inactive. */
> +	do_set_cpus_allowed(p, cpumask_of_node(nid));
> +	p->flags |= PF_THREAD_BOUND;
> +}
> +EXPORT_SYMBOL(kthread_bind_node);

_GPL ?
> +
> +/**
>   * kthread_stop - stop a thread created by kthread_create().
>   * @k: thread created by kthread_create().
>   *
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 11/40] autonuma: define the autonuma flags
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-30  4:58     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 327+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-30  4:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Jun 28, 2012 at 02:55:51PM +0200, Andrea Arcangeli wrote:
> These flags are the ones tweaked through sysfs, they control the
> behavior of autonuma, from enabling disabling it, to selecting various
> runtime options.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/autonuma_flags.h |   62 ++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 62 insertions(+), 0 deletions(-)
>  create mode 100644 include/linux/autonuma_flags.h
> 
> diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
> new file mode 100644
> index 0000000..5e29a75
> --- /dev/null
> +++ b/include/linux/autonuma_flags.h
> @@ -0,0 +1,62 @@
> +#ifndef _LINUX_AUTONUMA_FLAGS_H
> +#define _LINUX_AUTONUMA_FLAGS_H
> +
> +enum autonuma_flag {

These aren't really flags. They are bit-fields.
A
> +	AUTONUMA_FLAG,

Looking at the code, this is to turn it on. Perhaps a better name such
as: AUTONUMA_ACTIVE_FLAG ?


> +	AUTONUMA_IMPOSSIBLE_FLAG,
> +	AUTONUMA_DEBUG_FLAG,
> +	AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,

I might have gotten my math wrong, but if you have
AUTONUMA_SCHED_LOAD_BALACE.. set (so 3), that also means
that bit 0 and 1 are on. In other words AUTONUMA_FLAG
and AUTONUMA_IMPOSSIBLE_FLAG are turned on.

> +	AUTONUMA_SCHED_CLONE_RESET_FLAG,
> +	AUTONUMA_SCHED_FORK_RESET_FLAG,
> +	AUTONUMA_SCAN_PMD_FLAG,

So this is 6, which means 110 bits. So AUTONUMA_FLAG
gets turned off.

You definitly want to convert these to #defines or
at least define the proper numbers.
> +	AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
> +	AUTONUMA_MIGRATE_DEFER_FLAG,
> +};
> +
> +extern unsigned long autonuma_flags;
> +
> +static inline bool autonuma_enabled(void)
> +{
> +	return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_debug(void)
> +{
> +	return !!test_bit(AUTONUMA_DEBUG_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_sched_load_balance_strict(void)
> +{
> +	return !!test_bit(AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
> +			  &autonuma_flags);
> +}
> +
> +static inline bool autonuma_sched_clone_reset(void)
> +{
> +	return !!test_bit(AUTONUMA_SCHED_CLONE_RESET_FLAG,
> +			  &autonuma_flags);
> +}
> +
> +static inline bool autonuma_sched_fork_reset(void)
> +{
> +	return !!test_bit(AUTONUMA_SCHED_FORK_RESET_FLAG,
> +			  &autonuma_flags);
> +}
> +
> +static inline bool autonuma_scan_pmd(void)
> +{
> +	return !!test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_scan_use_working_set(void)
> +{
> +	return !!test_bit(AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
> +			  &autonuma_flags);
> +}
> +
> +static inline bool autonuma_migrate_defer(void)
> +{
> +	return !!test_bit(AUTONUMA_MIGRATE_DEFER_FLAG, &autonuma_flags);
> +}
> +
> +#endif /* _LINUX_AUTONUMA_FLAGS_H */
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 11/40] autonuma: define the autonuma flags
@ 2012-06-30  4:58     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 327+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-30  4:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Jun 28, 2012 at 02:55:51PM +0200, Andrea Arcangeli wrote:
> These flags are the ones tweaked through sysfs, they control the
> behavior of autonuma, from enabling disabling it, to selecting various
> runtime options.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/autonuma_flags.h |   62 ++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 62 insertions(+), 0 deletions(-)
>  create mode 100644 include/linux/autonuma_flags.h
> 
> diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
> new file mode 100644
> index 0000000..5e29a75
> --- /dev/null
> +++ b/include/linux/autonuma_flags.h
> @@ -0,0 +1,62 @@
> +#ifndef _LINUX_AUTONUMA_FLAGS_H
> +#define _LINUX_AUTONUMA_FLAGS_H
> +
> +enum autonuma_flag {

These aren't really flags. They are bit-fields.
A
> +	AUTONUMA_FLAG,

Looking at the code, this is to turn it on. Perhaps a better name such
as: AUTONUMA_ACTIVE_FLAG ?


> +	AUTONUMA_IMPOSSIBLE_FLAG,
> +	AUTONUMA_DEBUG_FLAG,
> +	AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,

I might have gotten my math wrong, but if you have
AUTONUMA_SCHED_LOAD_BALACE.. set (so 3), that also means
that bit 0 and 1 are on. In other words AUTONUMA_FLAG
and AUTONUMA_IMPOSSIBLE_FLAG are turned on.

> +	AUTONUMA_SCHED_CLONE_RESET_FLAG,
> +	AUTONUMA_SCHED_FORK_RESET_FLAG,
> +	AUTONUMA_SCAN_PMD_FLAG,

So this is 6, which means 110 bits. So AUTONUMA_FLAG
gets turned off.

You definitly want to convert these to #defines or
at least define the proper numbers.
> +	AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
> +	AUTONUMA_MIGRATE_DEFER_FLAG,
> +};
> +
> +extern unsigned long autonuma_flags;
> +
> +static inline bool autonuma_enabled(void)
> +{
> +	return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_debug(void)
> +{
> +	return !!test_bit(AUTONUMA_DEBUG_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_sched_load_balance_strict(void)
> +{
> +	return !!test_bit(AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
> +			  &autonuma_flags);
> +}
> +
> +static inline bool autonuma_sched_clone_reset(void)
> +{
> +	return !!test_bit(AUTONUMA_SCHED_CLONE_RESET_FLAG,
> +			  &autonuma_flags);
> +}
> +
> +static inline bool autonuma_sched_fork_reset(void)
> +{
> +	return !!test_bit(AUTONUMA_SCHED_FORK_RESET_FLAG,
> +			  &autonuma_flags);
> +}
> +
> +static inline bool autonuma_scan_pmd(void)
> +{
> +	return !!test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_scan_use_working_set(void)
> +{
> +	return !!test_bit(AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
> +			  &autonuma_flags);
> +}
> +
> +static inline bool autonuma_migrate_defer(void)
> +{
> +	return !!test_bit(AUTONUMA_MIGRATE_DEFER_FLAG, &autonuma_flags);
> +}
> +
> +#endif /* _LINUX_AUTONUMA_FLAGS_H */
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 11/40] autonuma: define the autonuma flags
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-30  5:01     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 327+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-30  5:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Jun 28, 2012 at 02:55:51PM +0200, Andrea Arcangeli wrote:
> These flags are the ones tweaked through sysfs, they control the
> behavior of autonuma, from enabling disabling it, to selecting various
> runtime options.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/autonuma_flags.h |   62 ++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 62 insertions(+), 0 deletions(-)
>  create mode 100644 include/linux/autonuma_flags.h
> 
> diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
> new file mode 100644
> index 0000000..5e29a75
> --- /dev/null
> +++ b/include/linux/autonuma_flags.h
> @@ -0,0 +1,62 @@
> +#ifndef _LINUX_AUTONUMA_FLAGS_H
> +#define _LINUX_AUTONUMA_FLAGS_H
> +
> +enum autonuma_flag {
> +	AUTONUMA_FLAG,
> +	AUTONUMA_IMPOSSIBLE_FLAG,
> +	AUTONUMA_DEBUG_FLAG,
> +	AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
> +	AUTONUMA_SCHED_CLONE_RESET_FLAG,
> +	AUTONUMA_SCHED_FORK_RESET_FLAG,
> +	AUTONUMA_SCAN_PMD_FLAG,
> +	AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
> +	AUTONUMA_MIGRATE_DEFER_FLAG,
> +};
> +
> +extern unsigned long autonuma_flags;

I could not find the this variable in the preceding patches?
Which patch actually uses it?

Also, is there a way to force the AutoNUMA framework
from not initializing at all? Hold that thought, it probably
is in some of the other patches.

> +
> +static inline bool autonuma_enabled(void)
> +{
> +	return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_debug(void)
> +{
> +	return !!test_bit(AUTONUMA_DEBUG_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_sched_load_balance_strict(void)
> +{
> +	return !!test_bit(AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
> +			  &autonuma_flags);
> +}
> +
> +static inline bool autonuma_sched_clone_reset(void)
> +{
> +	return !!test_bit(AUTONUMA_SCHED_CLONE_RESET_FLAG,
> +			  &autonuma_flags);
> +}
> +
> +static inline bool autonuma_sched_fork_reset(void)
> +{
> +	return !!test_bit(AUTONUMA_SCHED_FORK_RESET_FLAG,
> +			  &autonuma_flags);
> +}
> +
> +static inline bool autonuma_scan_pmd(void)
> +{
> +	return !!test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_scan_use_working_set(void)
> +{
> +	return !!test_bit(AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
> +			  &autonuma_flags);
> +}
> +
> +static inline bool autonuma_migrate_defer(void)
> +{
> +	return !!test_bit(AUTONUMA_MIGRATE_DEFER_FLAG, &autonuma_flags);
> +}
> +
> +#endif /* _LINUX_AUTONUMA_FLAGS_H */
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 11/40] autonuma: define the autonuma flags
@ 2012-06-30  5:01     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 327+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-30  5:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Jun 28, 2012 at 02:55:51PM +0200, Andrea Arcangeli wrote:
> These flags are the ones tweaked through sysfs, they control the
> behavior of autonuma, from enabling disabling it, to selecting various
> runtime options.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/autonuma_flags.h |   62 ++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 62 insertions(+), 0 deletions(-)
>  create mode 100644 include/linux/autonuma_flags.h
> 
> diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
> new file mode 100644
> index 0000000..5e29a75
> --- /dev/null
> +++ b/include/linux/autonuma_flags.h
> @@ -0,0 +1,62 @@
> +#ifndef _LINUX_AUTONUMA_FLAGS_H
> +#define _LINUX_AUTONUMA_FLAGS_H
> +
> +enum autonuma_flag {
> +	AUTONUMA_FLAG,
> +	AUTONUMA_IMPOSSIBLE_FLAG,
> +	AUTONUMA_DEBUG_FLAG,
> +	AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
> +	AUTONUMA_SCHED_CLONE_RESET_FLAG,
> +	AUTONUMA_SCHED_FORK_RESET_FLAG,
> +	AUTONUMA_SCAN_PMD_FLAG,
> +	AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
> +	AUTONUMA_MIGRATE_DEFER_FLAG,
> +};
> +
> +extern unsigned long autonuma_flags;

I could not find the this variable in the preceding patches?
Which patch actually uses it?

Also, is there a way to force the AutoNUMA framework
from not initializing at all? Hold that thought, it probably
is in some of the other patches.

> +
> +static inline bool autonuma_enabled(void)
> +{
> +	return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_debug(void)
> +{
> +	return !!test_bit(AUTONUMA_DEBUG_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_sched_load_balance_strict(void)
> +{
> +	return !!test_bit(AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
> +			  &autonuma_flags);
> +}
> +
> +static inline bool autonuma_sched_clone_reset(void)
> +{
> +	return !!test_bit(AUTONUMA_SCHED_CLONE_RESET_FLAG,
> +			  &autonuma_flags);
> +}
> +
> +static inline bool autonuma_sched_fork_reset(void)
> +{
> +	return !!test_bit(AUTONUMA_SCHED_FORK_RESET_FLAG,
> +			  &autonuma_flags);
> +}
> +
> +static inline bool autonuma_scan_pmd(void)
> +{
> +	return !!test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_scan_use_working_set(void)
> +{
> +	return !!test_bit(AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
> +			  &autonuma_flags);
> +}
> +
> +static inline bool autonuma_migrate_defer(void)
> +{
> +	return !!test_bit(AUTONUMA_MIGRATE_DEFER_FLAG, &autonuma_flags);
> +}
> +
> +#endif /* _LINUX_AUTONUMA_FLAGS_H */
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 18/40] autonuma: call autonuma_setup_new_exec()
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-30  5:04     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 327+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-30  5:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Jun 28, 2012 at 02:55:58PM +0200, Andrea Arcangeli wrote:
> This resets all per-thread and per-process statistics across exec
> syscalls or after kernel threads detached from the mm. The past
> statistical NUMA information is unlikely to be relevant for the future
> in these cases.

The previous patch mentioned that it can run in bypass mode. Is
this also able to do so? Meaning that these calls end up doing nops?

Thanks!
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  fs/exec.c        |    3 +++
>  mm/mmu_context.c |    2 ++
>  2 files changed, 5 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index da27b91..146ced2 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -55,6 +55,7 @@
>  #include <linux/pipe_fs_i.h>
>  #include <linux/oom.h>
>  #include <linux/compat.h>
> +#include <linux/autonuma.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -1172,6 +1173,8 @@ void setup_new_exec(struct linux_binprm * bprm)
>  			
>  	flush_signal_handlers(current, 0);
>  	flush_old_files(current->files);
> +
> +	autonuma_setup_new_exec(current);
>  }
>  EXPORT_SYMBOL(setup_new_exec);
>  
> diff --git a/mm/mmu_context.c b/mm/mmu_context.c
> index 3dcfaf4..40f0f13 100644
> --- a/mm/mmu_context.c
> +++ b/mm/mmu_context.c
> @@ -7,6 +7,7 @@
>  #include <linux/mmu_context.h>
>  #include <linux/export.h>
>  #include <linux/sched.h>
> +#include <linux/autonuma.h>
>  
>  #include <asm/mmu_context.h>
>  
> @@ -58,5 +59,6 @@ void unuse_mm(struct mm_struct *mm)
>  	/* active_mm is still 'mm' */
>  	enter_lazy_tlb(mm, tsk);
>  	task_unlock(tsk);
> +	autonuma_setup_new_exec(tsk);
>  }
>  EXPORT_SYMBOL_GPL(unuse_mm);
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 18/40] autonuma: call autonuma_setup_new_exec()
@ 2012-06-30  5:04     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 327+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-30  5:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Jun 28, 2012 at 02:55:58PM +0200, Andrea Arcangeli wrote:
> This resets all per-thread and per-process statistics across exec
> syscalls or after kernel threads detached from the mm. The past
> statistical NUMA information is unlikely to be relevant for the future
> in these cases.

The previous patch mentioned that it can run in bypass mode. Is
this also able to do so? Meaning that these calls end up doing nops?

Thanks!
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  fs/exec.c        |    3 +++
>  mm/mmu_context.c |    2 ++
>  2 files changed, 5 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index da27b91..146ced2 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -55,6 +55,7 @@
>  #include <linux/pipe_fs_i.h>
>  #include <linux/oom.h>
>  #include <linux/compat.h>
> +#include <linux/autonuma.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/mmu_context.h>
> @@ -1172,6 +1173,8 @@ void setup_new_exec(struct linux_binprm * bprm)
>  			
>  	flush_signal_handlers(current, 0);
>  	flush_old_files(current->files);
> +
> +	autonuma_setup_new_exec(current);
>  }
>  EXPORT_SYMBOL(setup_new_exec);
>  
> diff --git a/mm/mmu_context.c b/mm/mmu_context.c
> index 3dcfaf4..40f0f13 100644
> --- a/mm/mmu_context.c
> +++ b/mm/mmu_context.c
> @@ -7,6 +7,7 @@
>  #include <linux/mmu_context.h>
>  #include <linux/export.h>
>  #include <linux/sched.h>
> +#include <linux/autonuma.h>
>  
>  #include <asm/mmu_context.h>
>  
> @@ -58,5 +59,6 @@ void unuse_mm(struct mm_struct *mm)
>  	/* active_mm is still 'mm' */
>  	enter_lazy_tlb(mm, tsk);
>  	task_unlock(tsk);
> +	autonuma_setup_new_exec(tsk);
>  }
>  EXPORT_SYMBOL_GPL(unuse_mm);
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 19/40] autonuma: alloc/free/init sched_autonuma
  2012-06-28 12:55   ` Andrea Arcangeli
@ 2012-06-30  5:10     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 327+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-30  5:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Jun 28, 2012 at 02:55:59PM +0200, Andrea Arcangeli wrote:
> This is where the dynamically allocated sched_autonuma structure is
> being handled.
> 
> The reason for keeping this outside of the task_struct besides not
> using too much kernel stack, is to only allocate it on NUMA
> hardware. So the not NUMA hardware only pays the memory of a pointer
> in the kernel stack (which remains NULL at all times in that case). 

.. snip..
> +	if (unlikely(alloc_task_autonuma(tsk, orig, node)))
> +		/* free_thread_info() undoes arch_dup_task_struct() too */
> +		goto out_thread_info;
>  

That looks (without seeing the implementation) and from reading the git
commit, like that on non-NUMA machines it would fail - and end up
stop the creation of a task.

Perhaps a better name for the function: alloc_always_task_autonuma
since the function (at least from the description of this patch) will
always succeed. Perhaps even remove the:
"if unlikely(..)" bit?


^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 19/40] autonuma: alloc/free/init sched_autonuma
@ 2012-06-30  5:10     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 327+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-30  5:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Jun 28, 2012 at 02:55:59PM +0200, Andrea Arcangeli wrote:
> This is where the dynamically allocated sched_autonuma structure is
> being handled.
> 
> The reason for keeping this outside of the task_struct besides not
> using too much kernel stack, is to only allocate it on NUMA
> hardware. So the not NUMA hardware only pays the memory of a pointer
> in the kernel stack (which remains NULL at all times in that case). 

.. snip..
> +	if (unlikely(alloc_task_autonuma(tsk, orig, node)))
> +		/* free_thread_info() undoes arch_dup_task_struct() too */
> +		goto out_thread_info;
>  

That looks (without seeing the implementation) and from reading the git
commit, like that on non-NUMA machines it would fail - and end up
stop the creation of a task.

Perhaps a better name for the function: alloc_always_task_autonuma
since the function (at least from the description of this patch) will
always succeed. Perhaps even remove the:
"if unlikely(..)" bit?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 20/40] autonuma: alloc/free/init mm_autonuma
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-06-30  5:12     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 327+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-30  5:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Jun 28, 2012 at 02:56:00PM +0200, Andrea Arcangeli wrote:
> This is where the mm_autonuma structure is being handled. Just like
> sched_autonuma, this is only allocated at runtime if the hardware the
> kernel is running on has been detected as NUMA. On not NUMA hardware

I think the correct wording is "non-NUMA", not "not NUMA".

> the memory cost is reduced to one pointer per mm.
> 
> To get rid of the pointer in the each mm, the kernel can be compiled
> with CONFIG_AUTONUMA=n.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  kernel/fork.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0adbe09..3e5a0d9 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -527,6 +527,8 @@ static void mm_init_aio(struct mm_struct *mm)
>  
>  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
>  {
> +	if (unlikely(alloc_mm_autonuma(mm)))
> +		goto out_free_mm;

So reading that I would think that on non-NUMA machines this would fail
(since there is nothing to allocate). But that is not the case
(I hope!?) Perhaps just make the function not return any values?


^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 20/40] autonuma: alloc/free/init mm_autonuma
@ 2012-06-30  5:12     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 327+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-30  5:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Jun 28, 2012 at 02:56:00PM +0200, Andrea Arcangeli wrote:
> This is where the mm_autonuma structure is being handled. Just like
> sched_autonuma, this is only allocated at runtime if the hardware the
> kernel is running on has been detected as NUMA. On not NUMA hardware

I think the correct wording is "non-NUMA", not "not NUMA".

> the memory cost is reduced to one pointer per mm.
> 
> To get rid of the pointer in the each mm, the kernel can be compiled
> with CONFIG_AUTONUMA=n.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  kernel/fork.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0adbe09..3e5a0d9 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -527,6 +527,8 @@ static void mm_init_aio(struct mm_struct *mm)
>  
>  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
>  {
> +	if (unlikely(alloc_mm_autonuma(mm)))
> +		goto out_free_mm;

So reading that I would think that on non-NUMA machines this would fail
(since there is nothing to allocate). But that is not the case
(I hope!?) Perhaps just make the function not return any values?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 36/40] autonuma: page_autonuma
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-06-30  5:24     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 327+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-30  5:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Jun 28, 2012 at 02:56:16PM +0200, Andrea Arcangeli wrote:
> Move the AutoNUMA per page information from the "struct page" to a
> separate page_autonuma data structure allocated in the memsection
> (with sparsemem) or in the pgdat (with flatmem).
> 
> This is done to avoid growing the size of the "struct page" and the
> page_autonuma data is only allocated if the kernel has been booted on
> real NUMA hardware (or if noautonuma is passed as parameter to the
> kernel).
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/autonuma.h       |   18 +++-
>  include/linux/autonuma_flags.h |    6 +
>  include/linux/autonuma_types.h |   55 ++++++++++
>  include/linux/mm_types.h       |   26 -----
>  include/linux/mmzone.h         |   14 +++-
>  include/linux/page_autonuma.h  |   53 +++++++++
>  init/main.c                    |    2 +
>  mm/Makefile                    |    2 +-
>  mm/autonuma.c                  |   98 ++++++++++-------
>  mm/huge_memory.c               |   26 +++--
>  mm/page_alloc.c                |   21 +---
>  mm/page_autonuma.c             |  234 ++++++++++++++++++++++++++++++++++++++++
>  mm/sparse.c                    |  126 ++++++++++++++++++++-
>  13 files changed, 577 insertions(+), 104 deletions(-)
>  create mode 100644 include/linux/page_autonuma.h
>  create mode 100644 mm/page_autonuma.c
> 
> diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
> index 85ca5eb..67af86a 100644
> --- a/include/linux/autonuma.h
> +++ b/include/linux/autonuma.h
> @@ -7,15 +7,26 @@
>  
>  extern void autonuma_enter(struct mm_struct *mm);
>  extern void autonuma_exit(struct mm_struct *mm);
> -extern void __autonuma_migrate_page_remove(struct page *page);
> +extern void __autonuma_migrate_page_remove(struct page *,
> +					   struct page_autonuma *);
>  extern void autonuma_migrate_split_huge_page(struct page *page,
>  					     struct page *page_tail);
>  extern void autonuma_setup_new_exec(struct task_struct *p);
> +extern struct page_autonuma *lookup_page_autonuma(struct page *page);
>  
>  static inline void autonuma_migrate_page_remove(struct page *page)
>  {
> -	if (ACCESS_ONCE(page->autonuma_migrate_nid) >= 0)
> -		__autonuma_migrate_page_remove(page);
> +	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
> +	if (ACCESS_ONCE(page_autonuma->autonuma_migrate_nid) >= 0)
> +		__autonuma_migrate_page_remove(page, page_autonuma);
> +}
> +
> +static inline void autonuma_free_page(struct page *page)
> +{
> +	if (!autonuma_impossible()) {

I think you are better using a different name.

Perhaps 'if (autonuma_on())'

> +		autonuma_migrate_page_remove(page);
> +		lookup_page_autonuma(page)->autonuma_last_nid = -1;
> +	}
>  }
>  
>  #define autonuma_printk(format, args...) \
> @@ -29,6 +40,7 @@ static inline void autonuma_migrate_page_remove(struct page *page) {}
>  static inline void autonuma_migrate_split_huge_page(struct page *page,
>  						    struct page *page_tail) {}
>  static inline void autonuma_setup_new_exec(struct task_struct *p) {}
> +static inline void autonuma_free_page(struct page *page) {}
>  
>  #endif /* CONFIG_AUTONUMA */
>  
> diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
> index 5e29a75..035d993 100644
> --- a/include/linux/autonuma_flags.h
> +++ b/include/linux/autonuma_flags.h
> @@ -15,6 +15,12 @@ enum autonuma_flag {
>  
>  extern unsigned long autonuma_flags;
>  
> +static inline bool autonuma_impossible(void)
> +{
> +	return num_possible_nodes() <= 1 ||
> +		test_bit(AUTONUMA_IMPOSSIBLE_FLAG, &autonuma_flags);
> +}
> +
>  static inline bool autonuma_enabled(void)
>  {
>  	return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
> diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
> index 9e697e3..1e860f6 100644
> --- a/include/linux/autonuma_types.h
> +++ b/include/linux/autonuma_types.h
> @@ -39,6 +39,61 @@ struct task_autonuma {
>  	unsigned long task_numa_fault[0];
>  };
>  
> +/*
> + * Per page (or per-pageblock) structure dynamically allocated only if
> + * autonuma is not impossible.

not impossible? So possible?

> + */
> +struct page_autonuma {
> +	/*
> +	 * To modify autonuma_last_nid lockless the architecture,
> +	 * needs SMP atomic granularity < sizeof(long), not all archs
> +	 * have that, notably some ancient alpha (but none of those
> +	 * should run in NUMA systems). Archs without that requires
> +	 * autonuma_last_nid to be a long.
> +	 */
> +#if BITS_PER_LONG > 32
> +	/*
> +	 * autonuma_migrate_nid is -1 if the page_autonuma structure
> +	 * is not linked into any
> +	 * pgdat->autonuma_migrate_head. Otherwise it means the
> +	 * page_autonuma structure is linked into the
> +	 * &NODE_DATA(autonuma_migrate_nid)->autonuma_migrate_head[page_nid].
> +	 * page_nid is the nid that the page (referenced by the
> +	 * page_autonuma structure) belongs to.
> +	 */
> +	int autonuma_migrate_nid;
> +	/*
> +	 * autonuma_last_nid records which is the NUMA nid that tried
> +	 * to access this page at the last NUMA hinting page fault.
> +	 * If it changed, AutoNUMA will not try to migrate the page to
> +	 * the nid where the thread is running on and to the contrary,
> +	 * it will make different threads trashing on the same pages,
> +	 * converge on the same NUMA node (if possible).
> +	 */
> +	int autonuma_last_nid;
> +#else
> +#if MAX_NUMNODES >= 32768
> +#error "too many nodes"
> +#endif
> +	short autonuma_migrate_nid;
> +	short autonuma_last_nid;
> +#endif
> +	/*
> +	 * This is the list node that links the page (referenced by
> +	 * the page_autonuma structure) in the
> +	 * &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid] lru.
> +	 */
> +	struct list_head autonuma_migrate_node;
> +
> +	/*
> +	 * To find the page starting from the autonuma_migrate_node we
> +	 * need a backlink.
> +	 *
> +	 * FIXME: drop it;
> +	 */
> +	struct page *page;
> +};
> +
>  extern int alloc_task_autonuma(struct task_struct *tsk,
>  			       struct task_struct *orig,
>  			       int node);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index d1248cf..f0c6379 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -136,32 +136,6 @@ struct page {
>  		struct page *first_page;	/* Compound tail pages */
>  	};
>  
> -#ifdef CONFIG_AUTONUMA
> -	/*
> -	 * FIXME: move to pgdat section along with the memcg and allocate
> -	 * at runtime only in presence of a numa system.
> -	 */
> -	/*
> -	 * To modify autonuma_last_nid lockless the architecture,
> -	 * needs SMP atomic granularity < sizeof(long), not all archs
> -	 * have that, notably some ancient alpha (but none of those
> -	 * should run in NUMA systems). Archs without that requires
> -	 * autonuma_last_nid to be a long.
> -	 */
> -#if BITS_PER_LONG > 32
> -	int autonuma_migrate_nid;
> -	int autonuma_last_nid;
> -#else
> -#if MAX_NUMNODES >= 32768
> -#error "too many nodes"
> -#endif
> -	/* FIXME: remember to check the updates are atomic */
> -	short autonuma_migrate_nid;
> -	short autonuma_last_nid;
> -#endif
> -	struct list_head autonuma_migrate_node;
> -#endif
> -
>  	/*
>  	 * On machines where all RAM is mapped into kernel address space,
>  	 * we can simply calculate the virtual address. On machines with
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index d53b26a..e66da74 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -698,10 +698,13 @@ typedef struct pglist_data {
>  	int kswapd_max_order;
>  	enum zone_type classzone_idx;
>  #ifdef CONFIG_AUTONUMA
> -	spinlock_t autonuma_lock;
> +#if !defined(CONFIG_SPARSEMEM)
> +	struct page_autonuma *node_page_autonuma;
> +#endif
>  	struct list_head autonuma_migrate_head[MAX_NUMNODES];
>  	unsigned long autonuma_nr_migrate_pages;
>  	wait_queue_head_t autonuma_knuma_migrated_wait;
> +	spinlock_t autonuma_lock;
>  #endif
>  } pg_data_t;
>  
> @@ -1064,6 +1067,15 @@ struct mem_section {
>  	 * section. (see memcontrol.h/page_cgroup.h about this.)
>  	 */
>  	struct page_cgroup *page_cgroup;
> +#endif
> +#ifdef CONFIG_AUTONUMA
> +	/*
> +	 * If !SPARSEMEM, pgdat doesn't have page_autonuma pointer. We use
> +	 * section.
> +	 */
> +	struct page_autonuma *section_page_autonuma;
> +#endif
> +#if defined(CONFIG_CGROUP_MEM_RES_CTLR) ^ defined(CONFIG_AUTONUMA)
>  	unsigned long pad;
>  #endif
>  };
> diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
> new file mode 100644
> index 0000000..d748aa2
> --- /dev/null
> +++ b/include/linux/page_autonuma.h
> @@ -0,0 +1,53 @@
> +#ifndef _LINUX_PAGE_AUTONUMA_H
> +#define _LINUX_PAGE_AUTONUMA_H
> +
> +#if defined(CONFIG_AUTONUMA) && !defined(CONFIG_SPARSEMEM)
> +extern void __init page_autonuma_init_flatmem(void);
> +#else
> +static inline void __init page_autonuma_init_flatmem(void) {}
> +#endif
> +
> +#ifdef CONFIG_AUTONUMA
> +
> +#include <linux/autonuma_flags.h>
> +
> +extern void __meminit page_autonuma_map_init(struct page *page,
> +					     struct page_autonuma *page_autonuma,
> +					     int nr_pages);
> +
> +#ifdef CONFIG_SPARSEMEM
> +#define PAGE_AUTONUMA_SIZE (sizeof(struct page_autonuma))
> +#define SECTION_PAGE_AUTONUMA_SIZE (PAGE_AUTONUMA_SIZE *	\
> +				    PAGES_PER_SECTION)
> +#endif
> +
> +extern void __meminit pgdat_autonuma_init(struct pglist_data *);
> +
> +#else /* CONFIG_AUTONUMA */
> +
> +#ifdef CONFIG_SPARSEMEM
> +struct page_autonuma;
> +#define PAGE_AUTONUMA_SIZE 0
> +#define SECTION_PAGE_AUTONUMA_SIZE 0
> +
> +#define autonuma_impossible() true
> +
> +#endif
> +
> +static inline void pgdat_autonuma_init(struct pglist_data *pgdat) {}
> +
> +#endif /* CONFIG_AUTONUMA */
> +
> +#ifdef CONFIG_SPARSEMEM
> +extern struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
> +									unsigned long nr_pages);
> +extern void __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
> +					  unsigned long nr_pages);
> +extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
> +							 unsigned long pnum_begin,
> +							 unsigned long pnum_end,
> +							 unsigned long map_count,
> +							 int nodeid);
> +#endif
> +
> +#endif /* _LINUX_PAGE_AUTONUMA_H */
> diff --git a/init/main.c b/init/main.c
> index b5cc0a7..070a377 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -68,6 +68,7 @@
>  #include <linux/shmem_fs.h>
>  #include <linux/slab.h>
>  #include <linux/perf_event.h>
> +#include <linux/page_autonuma.h>
>  
>  #include <asm/io.h>
>  #include <asm/bugs.h>
> @@ -455,6 +456,7 @@ static void __init mm_init(void)
>  	 * bigger than MAX_ORDER unless SPARSEMEM.
>  	 */
>  	page_cgroup_init_flatmem();
> +	page_autonuma_init_flatmem();
>  	mem_init();
>  	kmem_cache_init();
>  	percpu_init_late();
> diff --git a/mm/Makefile b/mm/Makefile
> index 15900fd..a4d8354 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -33,7 +33,7 @@ obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
>  obj-$(CONFIG_HAS_DMA)	+= dmapool.o
>  obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
>  obj-$(CONFIG_NUMA) 	+= mempolicy.o
> -obj-$(CONFIG_AUTONUMA) 	+= autonuma.o
> +obj-$(CONFIG_AUTONUMA) 	+= autonuma.o page_autonuma.o
>  obj-$(CONFIG_SPARSEMEM)	+= sparse.o
>  obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
>  obj-$(CONFIG_SLOB) += slob.o
> diff --git a/mm/autonuma.c b/mm/autonuma.c
> index f44272b..ec4d492 100644
> --- a/mm/autonuma.c
> +++ b/mm/autonuma.c
> @@ -51,12 +51,6 @@ static struct knumad_scan {
>  	.mm_head = LIST_HEAD_INIT(knumad_scan.mm_head),
>  };
>  
> -static inline bool autonuma_impossible(void)
> -{
> -	return num_possible_nodes() <= 1 ||
> -		test_bit(AUTONUMA_IMPOSSIBLE_FLAG, &autonuma_flags);
> -}
> -
>  static inline void autonuma_migrate_lock(int nid)
>  {
>  	spin_lock(&NODE_DATA(nid)->autonuma_lock);
> @@ -82,54 +76,63 @@ void autonuma_migrate_split_huge_page(struct page *page,
>  				      struct page *page_tail)
>  {
>  	int nid, last_nid;
> +	struct page_autonuma *page_autonuma, *page_tail_autonuma;
>  
> -	nid = page->autonuma_migrate_nid;
> +	if (autonuma_impossible())

Is it just better to call it 'autonuma_off()' ?

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 36/40] autonuma: page_autonuma
@ 2012-06-30  5:24     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 327+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-30  5:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Jun 28, 2012 at 02:56:16PM +0200, Andrea Arcangeli wrote:
> Move the AutoNUMA per page information from the "struct page" to a
> separate page_autonuma data structure allocated in the memsection
> (with sparsemem) or in the pgdat (with flatmem).
> 
> This is done to avoid growing the size of the "struct page" and the
> page_autonuma data is only allocated if the kernel has been booted on
> real NUMA hardware (or if noautonuma is passed as parameter to the
> kernel).
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/autonuma.h       |   18 +++-
>  include/linux/autonuma_flags.h |    6 +
>  include/linux/autonuma_types.h |   55 ++++++++++
>  include/linux/mm_types.h       |   26 -----
>  include/linux/mmzone.h         |   14 +++-
>  include/linux/page_autonuma.h  |   53 +++++++++
>  init/main.c                    |    2 +
>  mm/Makefile                    |    2 +-
>  mm/autonuma.c                  |   98 ++++++++++-------
>  mm/huge_memory.c               |   26 +++--
>  mm/page_alloc.c                |   21 +---
>  mm/page_autonuma.c             |  234 ++++++++++++++++++++++++++++++++++++++++
>  mm/sparse.c                    |  126 ++++++++++++++++++++-
>  13 files changed, 577 insertions(+), 104 deletions(-)
>  create mode 100644 include/linux/page_autonuma.h
>  create mode 100644 mm/page_autonuma.c
> 
> diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
> index 85ca5eb..67af86a 100644
> --- a/include/linux/autonuma.h
> +++ b/include/linux/autonuma.h
> @@ -7,15 +7,26 @@
>  
>  extern void autonuma_enter(struct mm_struct *mm);
>  extern void autonuma_exit(struct mm_struct *mm);
> -extern void __autonuma_migrate_page_remove(struct page *page);
> +extern void __autonuma_migrate_page_remove(struct page *,
> +					   struct page_autonuma *);
>  extern void autonuma_migrate_split_huge_page(struct page *page,
>  					     struct page *page_tail);
>  extern void autonuma_setup_new_exec(struct task_struct *p);
> +extern struct page_autonuma *lookup_page_autonuma(struct page *page);
>  
>  static inline void autonuma_migrate_page_remove(struct page *page)
>  {
> -	if (ACCESS_ONCE(page->autonuma_migrate_nid) >= 0)
> -		__autonuma_migrate_page_remove(page);
> +	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
> +	if (ACCESS_ONCE(page_autonuma->autonuma_migrate_nid) >= 0)
> +		__autonuma_migrate_page_remove(page, page_autonuma);
> +}
> +
> +static inline void autonuma_free_page(struct page *page)
> +{
> +	if (!autonuma_impossible()) {

I think you are better using a different name.

Perhaps 'if (autonuma_on())'

> +		autonuma_migrate_page_remove(page);
> +		lookup_page_autonuma(page)->autonuma_last_nid = -1;
> +	}
>  }
>  
>  #define autonuma_printk(format, args...) \
> @@ -29,6 +40,7 @@ static inline void autonuma_migrate_page_remove(struct page *page) {}
>  static inline void autonuma_migrate_split_huge_page(struct page *page,
>  						    struct page *page_tail) {}
>  static inline void autonuma_setup_new_exec(struct task_struct *p) {}
> +static inline void autonuma_free_page(struct page *page) {}
>  
>  #endif /* CONFIG_AUTONUMA */
>  
> diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
> index 5e29a75..035d993 100644
> --- a/include/linux/autonuma_flags.h
> +++ b/include/linux/autonuma_flags.h
> @@ -15,6 +15,12 @@ enum autonuma_flag {
>  
>  extern unsigned long autonuma_flags;
>  
> +static inline bool autonuma_impossible(void)
> +{
> +	return num_possible_nodes() <= 1 ||
> +		test_bit(AUTONUMA_IMPOSSIBLE_FLAG, &autonuma_flags);
> +}
> +
>  static inline bool autonuma_enabled(void)
>  {
>  	return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
> diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
> index 9e697e3..1e860f6 100644
> --- a/include/linux/autonuma_types.h
> +++ b/include/linux/autonuma_types.h
> @@ -39,6 +39,61 @@ struct task_autonuma {
>  	unsigned long task_numa_fault[0];
>  };
>  
> +/*
> + * Per page (or per-pageblock) structure dynamically allocated only if
> + * autonuma is not impossible.

not impossible? So possible?

> + */
> +struct page_autonuma {
> +	/*
> +	 * To modify autonuma_last_nid lockless the architecture,
> +	 * needs SMP atomic granularity < sizeof(long), not all archs
> +	 * have that, notably some ancient alpha (but none of those
> +	 * should run in NUMA systems). Archs without that requires
> +	 * autonuma_last_nid to be a long.
> +	 */
> +#if BITS_PER_LONG > 32
> +	/*
> +	 * autonuma_migrate_nid is -1 if the page_autonuma structure
> +	 * is not linked into any
> +	 * pgdat->autonuma_migrate_head. Otherwise it means the
> +	 * page_autonuma structure is linked into the
> +	 * &NODE_DATA(autonuma_migrate_nid)->autonuma_migrate_head[page_nid].
> +	 * page_nid is the nid that the page (referenced by the
> +	 * page_autonuma structure) belongs to.
> +	 */
> +	int autonuma_migrate_nid;
> +	/*
> +	 * autonuma_last_nid records which is the NUMA nid that tried
> +	 * to access this page at the last NUMA hinting page fault.
> +	 * If it changed, AutoNUMA will not try to migrate the page to
> +	 * the nid where the thread is running on and to the contrary,
> +	 * it will make different threads trashing on the same pages,
> +	 * converge on the same NUMA node (if possible).
> +	 */
> +	int autonuma_last_nid;
> +#else
> +#if MAX_NUMNODES >= 32768
> +#error "too many nodes"
> +#endif
> +	short autonuma_migrate_nid;
> +	short autonuma_last_nid;
> +#endif
> +	/*
> +	 * This is the list node that links the page (referenced by
> +	 * the page_autonuma structure) in the
> +	 * &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid] lru.
> +	 */
> +	struct list_head autonuma_migrate_node;
> +
> +	/*
> +	 * To find the page starting from the autonuma_migrate_node we
> +	 * need a backlink.
> +	 *
> +	 * FIXME: drop it;
> +	 */
> +	struct page *page;
> +};
> +
>  extern int alloc_task_autonuma(struct task_struct *tsk,
>  			       struct task_struct *orig,
>  			       int node);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index d1248cf..f0c6379 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -136,32 +136,6 @@ struct page {
>  		struct page *first_page;	/* Compound tail pages */
>  	};
>  
> -#ifdef CONFIG_AUTONUMA
> -	/*
> -	 * FIXME: move to pgdat section along with the memcg and allocate
> -	 * at runtime only in presence of a numa system.
> -	 */
> -	/*
> -	 * To modify autonuma_last_nid lockless the architecture,
> -	 * needs SMP atomic granularity < sizeof(long), not all archs
> -	 * have that, notably some ancient alpha (but none of those
> -	 * should run in NUMA systems). Archs without that requires
> -	 * autonuma_last_nid to be a long.
> -	 */
> -#if BITS_PER_LONG > 32
> -	int autonuma_migrate_nid;
> -	int autonuma_last_nid;
> -#else
> -#if MAX_NUMNODES >= 32768
> -#error "too many nodes"
> -#endif
> -	/* FIXME: remember to check the updates are atomic */
> -	short autonuma_migrate_nid;
> -	short autonuma_last_nid;
> -#endif
> -	struct list_head autonuma_migrate_node;
> -#endif
> -
>  	/*
>  	 * On machines where all RAM is mapped into kernel address space,
>  	 * we can simply calculate the virtual address. On machines with
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index d53b26a..e66da74 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -698,10 +698,13 @@ typedef struct pglist_data {
>  	int kswapd_max_order;
>  	enum zone_type classzone_idx;
>  #ifdef CONFIG_AUTONUMA
> -	spinlock_t autonuma_lock;
> +#if !defined(CONFIG_SPARSEMEM)
> +	struct page_autonuma *node_page_autonuma;
> +#endif
>  	struct list_head autonuma_migrate_head[MAX_NUMNODES];
>  	unsigned long autonuma_nr_migrate_pages;
>  	wait_queue_head_t autonuma_knuma_migrated_wait;
> +	spinlock_t autonuma_lock;
>  #endif
>  } pg_data_t;
>  
> @@ -1064,6 +1067,15 @@ struct mem_section {
>  	 * section. (see memcontrol.h/page_cgroup.h about this.)
>  	 */
>  	struct page_cgroup *page_cgroup;
> +#endif
> +#ifdef CONFIG_AUTONUMA
> +	/*
> +	 * If !SPARSEMEM, pgdat doesn't have page_autonuma pointer. We use
> +	 * section.
> +	 */
> +	struct page_autonuma *section_page_autonuma;
> +#endif
> +#if defined(CONFIG_CGROUP_MEM_RES_CTLR) ^ defined(CONFIG_AUTONUMA)
>  	unsigned long pad;
>  #endif
>  };
> diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
> new file mode 100644
> index 0000000..d748aa2
> --- /dev/null
> +++ b/include/linux/page_autonuma.h
> @@ -0,0 +1,53 @@
> +#ifndef _LINUX_PAGE_AUTONUMA_H
> +#define _LINUX_PAGE_AUTONUMA_H
> +
> +#if defined(CONFIG_AUTONUMA) && !defined(CONFIG_SPARSEMEM)
> +extern void __init page_autonuma_init_flatmem(void);
> +#else
> +static inline void __init page_autonuma_init_flatmem(void) {}
> +#endif
> +
> +#ifdef CONFIG_AUTONUMA
> +
> +#include <linux/autonuma_flags.h>
> +
> +extern void __meminit page_autonuma_map_init(struct page *page,
> +					     struct page_autonuma *page_autonuma,
> +					     int nr_pages);
> +
> +#ifdef CONFIG_SPARSEMEM
> +#define PAGE_AUTONUMA_SIZE (sizeof(struct page_autonuma))
> +#define SECTION_PAGE_AUTONUMA_SIZE (PAGE_AUTONUMA_SIZE *	\
> +				    PAGES_PER_SECTION)
> +#endif
> +
> +extern void __meminit pgdat_autonuma_init(struct pglist_data *);
> +
> +#else /* CONFIG_AUTONUMA */
> +
> +#ifdef CONFIG_SPARSEMEM
> +struct page_autonuma;
> +#define PAGE_AUTONUMA_SIZE 0
> +#define SECTION_PAGE_AUTONUMA_SIZE 0
> +
> +#define autonuma_impossible() true
> +
> +#endif
> +
> +static inline void pgdat_autonuma_init(struct pglist_data *pgdat) {}
> +
> +#endif /* CONFIG_AUTONUMA */
> +
> +#ifdef CONFIG_SPARSEMEM
> +extern struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
> +									unsigned long nr_pages);
> +extern void __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
> +					  unsigned long nr_pages);
> +extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
> +							 unsigned long pnum_begin,
> +							 unsigned long pnum_end,
> +							 unsigned long map_count,
> +							 int nodeid);
> +#endif
> +
> +#endif /* _LINUX_PAGE_AUTONUMA_H */
> diff --git a/init/main.c b/init/main.c
> index b5cc0a7..070a377 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -68,6 +68,7 @@
>  #include <linux/shmem_fs.h>
>  #include <linux/slab.h>
>  #include <linux/perf_event.h>
> +#include <linux/page_autonuma.h>
>  
>  #include <asm/io.h>
>  #include <asm/bugs.h>
> @@ -455,6 +456,7 @@ static void __init mm_init(void)
>  	 * bigger than MAX_ORDER unless SPARSEMEM.
>  	 */
>  	page_cgroup_init_flatmem();
> +	page_autonuma_init_flatmem();
>  	mem_init();
>  	kmem_cache_init();
>  	percpu_init_late();
> diff --git a/mm/Makefile b/mm/Makefile
> index 15900fd..a4d8354 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -33,7 +33,7 @@ obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
>  obj-$(CONFIG_HAS_DMA)	+= dmapool.o
>  obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
>  obj-$(CONFIG_NUMA) 	+= mempolicy.o
> -obj-$(CONFIG_AUTONUMA) 	+= autonuma.o
> +obj-$(CONFIG_AUTONUMA) 	+= autonuma.o page_autonuma.o
>  obj-$(CONFIG_SPARSEMEM)	+= sparse.o
>  obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
>  obj-$(CONFIG_SLOB) += slob.o
> diff --git a/mm/autonuma.c b/mm/autonuma.c
> index f44272b..ec4d492 100644
> --- a/mm/autonuma.c
> +++ b/mm/autonuma.c
> @@ -51,12 +51,6 @@ static struct knumad_scan {
>  	.mm_head = LIST_HEAD_INIT(knumad_scan.mm_head),
>  };
>  
> -static inline bool autonuma_impossible(void)
> -{
> -	return num_possible_nodes() <= 1 ||
> -		test_bit(AUTONUMA_IMPOSSIBLE_FLAG, &autonuma_flags);
> -}
> -
>  static inline void autonuma_migrate_lock(int nid)
>  {
>  	spin_lock(&NODE_DATA(nid)->autonuma_lock);
> @@ -82,54 +76,63 @@ void autonuma_migrate_split_huge_page(struct page *page,
>  				      struct page *page_tail)
>  {
>  	int nid, last_nid;
> +	struct page_autonuma *page_autonuma, *page_tail_autonuma;
>  
> -	nid = page->autonuma_migrate_nid;
> +	if (autonuma_impossible())

Is it just better to call it 'autonuma_off()' ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-30  2:43                 ` Nai Xia
@ 2012-06-30  5:48                   ` Dor Laor
  2012-06-30  6:58                     ` Nai Xia
  2012-06-30  8:23                     ` Nai Xia
  2012-06-30 12:48                   ` Andrea Arcangeli
  1 sibling, 2 replies; 327+ messages in thread
From: Dor Laor @ 2012-06-30  5:48 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Hillf Danton,
	linux-kernel, linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/30/2012 05:43 AM, Nai Xia wrote:
> On Sat, Jun 30, 2012 at 9:23 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>> On Sat, Jun 30, 2012 at 04:01:50AM +0800, Nai Xia wrote:
>>> On Sat, Jun 30, 2012 at 2:53 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>>>> On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
>>>>> The previous comments were not shouts but the mother of all NAKs.
>>>>
>>>> I never said any such thing. I just said why should I bother reading
>>>> your stuff if you're ignoring most my feedback anyway.
>>>>
>>>> If you want to read that as a NAK, not my problem.
>>>
>>> Hey guys, Can I say NAK to these patches ?
>>>
>>> Now I aware that this sampling algorithm is completely broken, if we take
>>> a few seconds to see what it is trying to solve:
>>>
>>> We all know that LRU is try to solve the question of "what are the
>>> pages recently accessed?",
>>> so its engouth to use pte bits to approximate.
>>
>> I made an example about the active list to try to explain it why your
>> example is still going to work fine.
>>
>> After it becomes active (from inactive) and it's being a referenced
>> active page, it won't become _very_active_ or _very_very_active_ or
>> more no matter how many more times you look up the pagecache.
>>
>> The LRU order wasn't relevant here.
>>
>>> However, the numa balancing problem is fundamentally like this:
>>>
>>> In some time unit,
>>>
>>>        W = pages_accessed  *  average_page_access_frequence
>>>
>>> We are trying to move process to the node having max W,  right?
>>
>> First of all, the mm_autonuma statistics are not in function of time
>> and there is no page access frequency there.
>>
>> mm_autonuma is static information collected by knuma_scand from the
>> pagetables. That's static and 100% accurate on the whole process and
>> definitely not generated by the numa hinting page faults. I could shut
>> off all numa hinting page faults permanently and still generate the
>> mm_autonuma information identically.
>>
>> There's a knob in /sys/kernel/mm/autonuma/knuma_scand/working_set that
>> you can enable if you want to use a "runtime" and not static
>> information for the mm_autonuma too, but that's not the default for
>> now (but I think it may be a better default, there wasn't enough time
>> to test this yet)
>>
>> The task_autonuma (thread) statistics are the only thing that is
>> sampled by default in a 10sec interval (the interval tunable too with
>> sysfs, and 10sec is likely too aggressive, 30sec sounds better, we're
>> eventually going to make it dynamic anyway)
>>
>> So even if you were right, the thread statistics only kicks in to
>> balance threads against threads of the same process, most of the time
>> what's more important are the mm_autonuma statistics.
>>
>> But in reality the thread statistics also works perfectly for the job,
>> as an approximation of the NUMA memory footprint of the thread (vs the
>> other threads). And then the rest of the memory slowly follows
>> whatever node CPUs I placed the thread (even if that's not the
>> absolutely best one at all times).
>>
>>> Andrea's patch can only approximate the pages_accessed number in a
>>> time unit(scan interval),
>>> I don't think it can catch even 1% of  average_page_access_frequence
>>> on a busy workload.
>>> Blindly assuming that all the pages'  average_page_access_frequence is
>>> the same is seemly
>>> broken to me.
>>
>> All we need is an approximation to take a better than random decision,
>> even if you get it 1% right, it's still better than 0% right by going
>> blind. Your 1% is too pessimistic, in my tests the thread statistics
>> are more like >90% correct in average (I monitor them with the debug
>> mode constantly).
>>
>> If this 1% right, happens one a million samples, who cares, it's not
>> going to run measurably slower anyway (and it will still be better
>> than picking a 0% right node).
>>
>> What you're saying is that because the active list in the pagecache
>> won't differentiate between 10 cache hits and 20 cache hits, we should
>> drop the active list and stop activating pages and just threat them
>> all the same because in some unlucky access pattern, the active list
>> may only get right 1% of the working set. But there's a reason why the
>> active list exists despite it may get things wrong in some corner case
>> and possibly leave the large amount of pages accessed infrequently in
>> the inactive list forever (even if it gets things only 1% right in
>> those worst cases, it's still better than 0% right and no active list
>> at all).
>>
>> To say it in another way, you may still crash with the car even if
>> you're careful, but do you think it's better to watch at the street or
>> to drive blindfolded?
>>
>> numa/sched drives blindfolded, autonuma watches around every 10sec
>> very carefully for the best next turn to take with the car and to
>> avoid obstacles, you can imagine who wins.
>>
>> Watching the street carefully every 10sec doesn't mean the next moment
>> a missile won't hit your car to make you crash, you're still having
>> better chances not to crash than by driving blindfolded.
>>
>> numa/sched pretends to compete without collecting information for the
>> NUMA thread memory footprint (task_autonuma, sampled with a
>> exponential backoff at 10sec intervals), and without process
>> information (full static information from the pagetables, not
>> sampled). No matter how you compute stuff, if you've nothing
>> meaningful in input to your algorithm you lose. And it looks like you
>> believe that you can take better decisions with nothing in input to
>> your NUMA placement algorithm, because my thread info (task_autonuma)
>> isn't 100% perfect at all times and it can't predict the future. The
>> alternative is to get that information from syscalls, but even
>> ignoring the -ENOMEM from split_vma, that will lead to userland bugs
>> and overall the task_autonuma information may be more reliable in the
>> end, even if it's sampled using an exponential backoff.
>>
>> Also note the exponential backoff thing, it's not really the last
>> interval, it's the last interval plus half the previous interval plus
>> 1/4 the previous interval etc... and we can trivially control the
>> decay.
>>
>> All we need is to get a direction and knowing _exactly_ what the task
>> did over the last 10 seconds (even if it can't predict the future of
>> what the thread will do in the next 1sec), is all we need to get a
>> direction. After we take the direction then the memory will follow so
>> we cannot care less what it does in the next second because that will
>> follow the CPU (after a while, last_nid anti-false-sharing logic
>> permitting), and at least we'll know for sure that the memory accessed
>> in the last 10sec is already local and that defines the best node to
>> schedule the thread.
>>
>> I don't mean there's no room for improvement in the way the input data
>> can be computed, and even in the way the input data can be generated,
>> the exponential backoff decay can be tuned too, I just tried to do the
>> simplest computations on the data to make the workloads converge fast
>> and you're welcome to contribute.
>>
>> But I believe the task_autonuma information is extremely valuable and
>> we can trust it very much knowing we'll get a great placement. The
>> concern you have isn't invalid, but it's a very minor one and the
>> sampling rate effects you are concerned about, while real, they're
>> lost in the noise in practice.
>
> Well, I think I am not convinced by your this many words. And surely
> I  will NOT follow your reasoning of "Having information is always
> good than nothing".  We all know that  an illy biased balancing is worse
> than randomness:  at least randomness means "average, fair play, ...".
> With all uncertain things, I think only a comprehensive survey
> of real world workloads can tell if my concern is significant or not.
>
> So I think my suggestion to you is:  Show world some solid and sound
> real world proof that your approximation is > 90% accurate, just like

The cover letter contained a link to the performance:
https://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf

It includes, specJbb, kernelbuild, cpuHog in guests, and handful of 
units tests.

I'm sure anyone can beat most kernel algorithm with some pathological 
case including LRU and CFS. The only way to improve the numa balancing 
stuff is to sample more, meaning faulting more == larger overhead.

Maybe its worth to add a measurement that if we've done too many 
bounding of a particular page to stop scan that page for a while. It's 
an optimization that needs to be prove it worth in real life.

Cheers,
Dor

> the pioneers already did to LRU(This problem is surely different from
> LRU. ).  Tons of words, will not do this.
>
> Thanks,
>
> Nai
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=ilto:"dont@kvack.org"> email@kvack.org </a>
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-30  5:48                   ` Dor Laor
@ 2012-06-30  6:58                     ` Nai Xia
  2012-06-30 13:04                       ` Andrea Arcangeli
                                         ` (2 more replies)
  2012-06-30  8:23                     ` Nai Xia
  1 sibling, 3 replies; 327+ messages in thread
From: Nai Xia @ 2012-06-30  6:58 UTC (permalink / raw)
  To: dlaor
  Cc: Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Hillf Danton,
	linux-kernel, linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 1:48 PM, Dor Laor <dlaor@redhat.com> wrote:
> On 06/30/2012 05:43 AM, Nai Xia wrote:
>>
>> On Sat, Jun 30, 2012 at 9:23 AM, Andrea Arcangeli <aarcange@redhat.com>
>> wrote:
>>>
>>> On Sat, Jun 30, 2012 at 04:01:50AM +0800, Nai Xia wrote:
>>>>
>>>> On Sat, Jun 30, 2012 at 2:53 AM, Peter Zijlstra <a.p.zijlstra@chello.nl>
>>>> wrote:
>>>>>
>>>>> On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
>>>>>>
>>>>>> The previous comments were not shouts but the mother of all NAKs.
>>>>>
>>>>>
>>>>> I never said any such thing. I just said why should I bother reading
>>>>> your stuff if you're ignoring most my feedback anyway.
>>>>>
>>>>> If you want to read that as a NAK, not my problem.
>>>>
>>>>
>>>> Hey guys, Can I say NAK to these patches ?
>>>>
>>>> Now I aware that this sampling algorithm is completely broken, if we
>>>> take
>>>> a few seconds to see what it is trying to solve:
>>>>
>>>> We all know that LRU is try to solve the question of "what are the
>>>> pages recently accessed?",
>>>> so its engouth to use pte bits to approximate.
>>>
>>>
>>> I made an example about the active list to try to explain it why your
>>> example is still going to work fine.
>>>
>>> After it becomes active (from inactive) and it's being a referenced
>>> active page, it won't become _very_active_ or _very_very_active_ or
>>> more no matter how many more times you look up the pagecache.
>>>
>>> The LRU order wasn't relevant here.
>>>
>>>> However, the numa balancing problem is fundamentally like this:
>>>>
>>>> In some time unit,
>>>>
>>>>       W = pages_accessed  *  average_page_access_frequence
>>>>
>>>> We are trying to move process to the node having max W,  right?
>>>
>>>
>>> First of all, the mm_autonuma statistics are not in function of time
>>> and there is no page access frequency there.
>>>
>>> mm_autonuma is static information collected by knuma_scand from the
>>> pagetables. That's static and 100% accurate on the whole process and
>>> definitely not generated by the numa hinting page faults. I could shut
>>> off all numa hinting page faults permanently and still generate the
>>> mm_autonuma information identically.
>>>
>>> There's a knob in /sys/kernel/mm/autonuma/knuma_scand/working_set that
>>> you can enable if you want to use a "runtime" and not static
>>> information for the mm_autonuma too, but that's not the default for
>>> now (but I think it may be a better default, there wasn't enough time
>>> to test this yet)
>>>
>>> The task_autonuma (thread) statistics are the only thing that is
>>> sampled by default in a 10sec interval (the interval tunable too with
>>> sysfs, and 10sec is likely too aggressive, 30sec sounds better, we're
>>> eventually going to make it dynamic anyway)
>>>
>>> So even if you were right, the thread statistics only kicks in to
>>> balance threads against threads of the same process, most of the time
>>> what's more important are the mm_autonuma statistics.
>>>
>>> But in reality the thread statistics also works perfectly for the job,
>>> as an approximation of the NUMA memory footprint of the thread (vs the
>>> other threads). And then the rest of the memory slowly follows
>>> whatever node CPUs I placed the thread (even if that's not the
>>> absolutely best one at all times).
>>>
>>>> Andrea's patch can only approximate the pages_accessed number in a
>>>> time unit(scan interval),
>>>> I don't think it can catch even 1% of  average_page_access_frequence
>>>> on a busy workload.
>>>> Blindly assuming that all the pages'  average_page_access_frequence is
>>>> the same is seemly
>>>> broken to me.
>>>
>>>
>>> All we need is an approximation to take a better than random decision,
>>> even if you get it 1% right, it's still better than 0% right by going
>>> blind. Your 1% is too pessimistic, in my tests the thread statistics
>>> are more like >90% correct in average (I monitor them with the debug
>>> mode constantly).
>>>
>>> If this 1% right, happens one a million samples, who cares, it's not
>>> going to run measurably slower anyway (and it will still be better
>>> than picking a 0% right node).
>>>
>>> What you're saying is that because the active list in the pagecache
>>> won't differentiate between 10 cache hits and 20 cache hits, we should
>>> drop the active list and stop activating pages and just threat them
>>> all the same because in some unlucky access pattern, the active list
>>> may only get right 1% of the working set. But there's a reason why the
>>> active list exists despite it may get things wrong in some corner case
>>> and possibly leave the large amount of pages accessed infrequently in
>>> the inactive list forever (even if it gets things only 1% right in
>>> those worst cases, it's still better than 0% right and no active list
>>> at all).
>>>
>>> To say it in another way, you may still crash with the car even if
>>> you're careful, but do you think it's better to watch at the street or
>>> to drive blindfolded?
>>>
>>> numa/sched drives blindfolded, autonuma watches around every 10sec
>>> very carefully for the best next turn to take with the car and to
>>> avoid obstacles, you can imagine who wins.
>>>
>>> Watching the street carefully every 10sec doesn't mean the next moment
>>> a missile won't hit your car to make you crash, you're still having
>>> better chances not to crash than by driving blindfolded.
>>>
>>> numa/sched pretends to compete without collecting information for the
>>> NUMA thread memory footprint (task_autonuma, sampled with a
>>> exponential backoff at 10sec intervals), and without process
>>> information (full static information from the pagetables, not
>>> sampled). No matter how you compute stuff, if you've nothing
>>> meaningful in input to your algorithm you lose. And it looks like you
>>> believe that you can take better decisions with nothing in input to
>>> your NUMA placement algorithm, because my thread info (task_autonuma)
>>> isn't 100% perfect at all times and it can't predict the future. The
>>> alternative is to get that information from syscalls, but even
>>> ignoring the -ENOMEM from split_vma, that will lead to userland bugs
>>> and overall the task_autonuma information may be more reliable in the
>>> end, even if it's sampled using an exponential backoff.
>>>
>>> Also note the exponential backoff thing, it's not really the last
>>> interval, it's the last interval plus half the previous interval plus
>>> 1/4 the previous interval etc... and we can trivially control the
>>> decay.
>>>
>>> All we need is to get a direction and knowing _exactly_ what the task
>>> did over the last 10 seconds (even if it can't predict the future of
>>> what the thread will do in the next 1sec), is all we need to get a
>>> direction. After we take the direction then the memory will follow so
>>> we cannot care less what it does in the next second because that will
>>> follow the CPU (after a while, last_nid anti-false-sharing logic
>>> permitting), and at least we'll know for sure that the memory accessed
>>> in the last 10sec is already local and that defines the best node to
>>> schedule the thread.
>>>
>>> I don't mean there's no room for improvement in the way the input data
>>> can be computed, and even in the way the input data can be generated,
>>> the exponential backoff decay can be tuned too, I just tried to do the
>>> simplest computations on the data to make the workloads converge fast
>>> and you're welcome to contribute.
>>>
>>> But I believe the task_autonuma information is extremely valuable and
>>> we can trust it very much knowing we'll get a great placement. The
>>> concern you have isn't invalid, but it's a very minor one and the
>>> sampling rate effects you are concerned about, while real, they're
>>> lost in the noise in practice.
>>
>>
>> Well, I think I am not convinced by your this many words. And surely
>> I  will NOT follow your reasoning of "Having information is always
>> good than nothing".  We all know that  an illy biased balancing is worse
>> than randomness:  at least randomness means "average, fair play, ...".
>> With all uncertain things, I think only a comprehensive survey
>> of real world workloads can tell if my concern is significant or not.
>>
>> So I think my suggestion to you is:  Show world some solid and sound
>> real world proof that your approximation is > 90% accurate, just like
>
>
> The cover letter contained a link to the performance:
> https://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf

Yes, I saw this. But if you consider this already a solid and
comprehensive proof.
You win ,  I have no other words to say.

>
> It includes, specJbb, kernelbuild, cpuHog in guests, and handful of units
> tests.
>
> I'm sure anyone can beat most kernel algorithm with some pathological case
> including LRU and CFS. The only way to improve the numa balancing stuff is

Like I already put, the pathological cases for LRU were already well understood
for decades, they are quite valid to ignore.  And every programmer has
be taught to
avoid these cases.  And this conclusion took much much time of many many
talented brains.

But the problem of this algorithm is not. And you are putting haste conclusion
of it without bothering to do comprehensive research.

"Collect the data from a wide range of pages occasionally,
and then do a condense computing on a small set of pages" looks a very common
practice to me.  But again, if you simply label this as "minor".
I have no other words to say.

> to sample more, meaning faulting more == larger overhead.

Are you sure you really want to compete the sampling speed with CPU intensive
workloads?

OK, I think I'd stop discussing this topic now. Without strict and comprehensive
research on this topic, further arguments seems to me to be purely based on
imagination.

And I have no interest in beating any of your fancy algorithm, it wouldn't bring
me 1G$.  I am just curiously about the truth.

If you insist on ignoring any constructive suggestions from others,
it's pretty much ok to do so.  But I (and possibly many others who are
watching)
am pretty much  possible to do a LOL to your development style.

Basically, anyone has the right to laugh,  if   W = x * y and you only
approximate
x and label y as minor factor.  :D

Cheer,

Nai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-30  5:48                   ` Dor Laor
  2012-06-30  6:58                     ` Nai Xia
@ 2012-06-30  8:23                     ` Nai Xia
  2012-07-02  7:29                       ` Rik van Riel
  1 sibling, 1 reply; 327+ messages in thread
From: Nai Xia @ 2012-06-30  8:23 UTC (permalink / raw)
  To: dlaor
  Cc: Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Hillf Danton,
	linux-kernel, linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 1:48 PM, Dor Laor <dlaor@redhat.com> wrote:
> On 06/30/2012 05:43 AM, Nai Xia wrote:
>>
>> On Sat, Jun 30, 2012 at 9:23 AM, Andrea Arcangeli <aarcange@redhat.com>
>> wrote:
>>>
>>> On Sat, Jun 30, 2012 at 04:01:50AM +0800, Nai Xia wrote:
>>>>
>>>> On Sat, Jun 30, 2012 at 2:53 AM, Peter Zijlstra <a.p.zijlstra@chello.nl>
>>>> wrote:
>>>>>
>>>>> On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
>>>>>>
>>>>>> The previous comments were not shouts but the mother of all NAKs.
>>>>>
>>>>>
>>>>> I never said any such thing. I just said why should I bother reading
>>>>> your stuff if you're ignoring most my feedback anyway.
>>>>>
>>>>> If you want to read that as a NAK, not my problem.
>>>>
>>>>
>>>> Hey guys, Can I say NAK to these patches ?
>>>>
>>>> Now I aware that this sampling algorithm is completely broken, if we
>>>> take
>>>> a few seconds to see what it is trying to solve:
>>>>
>>>> We all know that LRU is try to solve the question of "what are the
>>>> pages recently accessed?",
>>>> so its engouth to use pte bits to approximate.
>>>
>>>
>>> I made an example about the active list to try to explain it why your
>>> example is still going to work fine.
>>>
>>> After it becomes active (from inactive) and it's being a referenced
>>> active page, it won't become _very_active_ or _very_very_active_ or
>>> more no matter how many more times you look up the pagecache.
>>>
>>> The LRU order wasn't relevant here.
>>>
>>>> However, the numa balancing problem is fundamentally like this:
>>>>
>>>> In some time unit,
>>>>
>>>>       W = pages_accessed  *  average_page_access_frequence
>>>>
>>>> We are trying to move process to the node having max W,  right?
>>>
>>>
>>> First of all, the mm_autonuma statistics are not in function of time
>>> and there is no page access frequency there.
>>>
>>> mm_autonuma is static information collected by knuma_scand from the
>>> pagetables. That's static and 100% accurate on the whole process and
>>> definitely not generated by the numa hinting page faults. I could shut
>>> off all numa hinting page faults permanently and still generate the
>>> mm_autonuma information identically.
>>>
>>> There's a knob in /sys/kernel/mm/autonuma/knuma_scand/working_set that
>>> you can enable if you want to use a "runtime" and not static
>>> information for the mm_autonuma too, but that's not the default for
>>> now (but I think it may be a better default, there wasn't enough time
>>> to test this yet)
>>>
>>> The task_autonuma (thread) statistics are the only thing that is
>>> sampled by default in a 10sec interval (the interval tunable too with
>>> sysfs, and 10sec is likely too aggressive, 30sec sounds better, we're
>>> eventually going to make it dynamic anyway)
>>>
>>> So even if you were right, the thread statistics only kicks in to
>>> balance threads against threads of the same process, most of the time
>>> what's more important are the mm_autonuma statistics.
>>>
>>> But in reality the thread statistics also works perfectly for the job,
>>> as an approximation of the NUMA memory footprint of the thread (vs the
>>> other threads). And then the rest of the memory slowly follows
>>> whatever node CPUs I placed the thread (even if that's not the
>>> absolutely best one at all times).
>>>
>>>> Andrea's patch can only approximate the pages_accessed number in a
>>>> time unit(scan interval),
>>>> I don't think it can catch even 1% of  average_page_access_frequence
>>>> on a busy workload.
>>>> Blindly assuming that all the pages'  average_page_access_frequence is
>>>> the same is seemly
>>>> broken to me.
>>>
>>>
>>> All we need is an approximation to take a better than random decision,
>>> even if you get it 1% right, it's still better than 0% right by going
>>> blind. Your 1% is too pessimistic, in my tests the thread statistics
>>> are more like >90% correct in average (I monitor them with the debug
>>> mode constantly).
>>>
>>> If this 1% right, happens one a million samples, who cares, it's not
>>> going to run measurably slower anyway (and it will still be better
>>> than picking a 0% right node).
>>>
>>> What you're saying is that because the active list in the pagecache
>>> won't differentiate between 10 cache hits and 20 cache hits, we should
>>> drop the active list and stop activating pages and just threat them
>>> all the same because in some unlucky access pattern, the active list
>>> may only get right 1% of the working set. But there's a reason why the
>>> active list exists despite it may get things wrong in some corner case
>>> and possibly leave the large amount of pages accessed infrequently in
>>> the inactive list forever (even if it gets things only 1% right in
>>> those worst cases, it's still better than 0% right and no active list
>>> at all).
>>>
>>> To say it in another way, you may still crash with the car even if
>>> you're careful, but do you think it's better to watch at the street or
>>> to drive blindfolded?
>>>
>>> numa/sched drives blindfolded, autonuma watches around every 10sec
>>> very carefully for the best next turn to take with the car and to
>>> avoid obstacles, you can imagine who wins.
>>>
>>> Watching the street carefully every 10sec doesn't mean the next moment
>>> a missile won't hit your car to make you crash, you're still having
>>> better chances not to crash than by driving blindfolded.
>>>
>>> numa/sched pretends to compete without collecting information for the
>>> NUMA thread memory footprint (task_autonuma, sampled with a
>>> exponential backoff at 10sec intervals), and without process
>>> information (full static information from the pagetables, not
>>> sampled). No matter how you compute stuff, if you've nothing
>>> meaningful in input to your algorithm you lose. And it looks like you
>>> believe that you can take better decisions with nothing in input to
>>> your NUMA placement algorithm, because my thread info (task_autonuma)
>>> isn't 100% perfect at all times and it can't predict the future. The
>>> alternative is to get that information from syscalls, but even
>>> ignoring the -ENOMEM from split_vma, that will lead to userland bugs
>>> and overall the task_autonuma information may be more reliable in the
>>> end, even if it's sampled using an exponential backoff.
>>>
>>> Also note the exponential backoff thing, it's not really the last
>>> interval, it's the last interval plus half the previous interval plus
>>> 1/4 the previous interval etc... and we can trivially control the
>>> decay.
>>>
>>> All we need is to get a direction and knowing _exactly_ what the task
>>> did over the last 10 seconds (even if it can't predict the future of
>>> what the thread will do in the next 1sec), is all we need to get a
>>> direction. After we take the direction then the memory will follow so
>>> we cannot care less what it does in the next second because that will
>>> follow the CPU (after a while, last_nid anti-false-sharing logic
>>> permitting), and at least we'll know for sure that the memory accessed
>>> in the last 10sec is already local and that defines the best node to
>>> schedule the thread.
>>>
>>> I don't mean there's no room for improvement in the way the input data
>>> can be computed, and even in the way the input data can be generated,
>>> the exponential backoff decay can be tuned too, I just tried to do the
>>> simplest computations on the data to make the workloads converge fast
>>> and you're welcome to contribute.
>>>
>>> But I believe the task_autonuma information is extremely valuable and
>>> we can trust it very much knowing we'll get a great placement. The
>>> concern you have isn't invalid, but it's a very minor one and the
>>> sampling rate effects you are concerned about, while real, they're
>>> lost in the noise in practice.
>>
>>
>> Well, I think I am not convinced by your this many words. And surely
>> I  will NOT follow your reasoning of "Having information is always
>> good than nothing".  We all know that  an illy biased balancing is worse
>> than randomness:  at least randomness means "average, fair play, ...".
>> With all uncertain things, I think only a comprehensive survey
>> of real world workloads can tell if my concern is significant or not.
>>
>> So I think my suggestion to you is:  Show world some solid and sound
>> real world proof that your approximation is > 90% accurate, just like
>
>
> The cover letter contained a link to the performance:
> https://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf
>
> It includes, specJbb, kernelbuild, cpuHog in guests, and handful of units
> tests.
>
> I'm sure anyone can beat most kernel algorithm with some pathological case
> including LRU and CFS. The only way to improve the numa balancing stuff is
> to sample more, meaning faulting more == larger overhead.
>
> Maybe its worth to add a measurement that if we've done too many bounding of
> a particular page to stop scan that page for a while. It's an optimization
> that needs to be prove it worth in real life.
>

Oh, sorry, I think I forgot few last comments in my last post:

In case you really can take my advice and do comprehensive research,
try to make sure that you compare the result of your fancy sampling algorithm
with this simple logic:

   "Blindly select a node and bind the process and move all pages to it."

Stupid it may sound, I highly suspect it can approach the benchmarks
you already did.

If that's really the truth, then all the sampling and weighting stuff can
be cut off.


Thanks,

Nai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-30  2:43                 ` Nai Xia
  2012-06-30  5:48                   ` Dor Laor
@ 2012-06-30 12:48                   ` Andrea Arcangeli
  2012-06-30 15:10                     ` Nai Xia
  1 sibling, 1 reply; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-30 12:48 UTC (permalink / raw)
  To: Nai Xia
  Cc: Peter Zijlstra, dlaor, Ingo Molnar, Hillf Danton, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 10:43:41AM +0800, Nai Xia wrote:
> Well, I think I am not convinced by your this many words. And surely
> I  will NOT follow your reasoning of "Having information is always
> good than nothing".  We all know that  an illy biased balancing is worse
> than randomness:  at least randomness means "average, fair play, ...".

The only way to get good performance like the hard bindings is to
fully converge the load into one node (or as fewer nodes as possible),
randomness won't get you very far in this case.

> With all uncertain things, I think only a comprehensive survey
> of real world workloads can tell if my concern is significant or not.

I welcome more real world tests.

I'm just not particularly concerned about your concern. The young bit
clearing during swapping would also be susceptible to your concern
just to make another example. If that would be a problem swapping
wouldn't possibly work ok either because pte_numa or pte_young works
the same way. In fact pte_young is even less reliable because the scan
frequency will be more variable so the phase effects will be even more
visible.

The VM is an heuristic, it obviously doesn't need to be perfect at all
times, what matters is the probability that it does the right thing.

> So I think my suggestion to you is:  Show world some solid and sound
> real world proof that your approximation is > 90% accurate, just like
> the pioneers already did to LRU(This problem is surely different from
> LRU. ).  Tons of words, will not do this.

http://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf
http://dl.dropbox.com/u/82832537/kvm-numa-comparison-0.png

There's more but I haven't updated them yet.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-30  6:58                     ` Nai Xia
@ 2012-06-30 13:04                       ` Andrea Arcangeli
  2012-06-30 15:19                         ` Nai Xia
  2012-06-30 19:37                       ` Dor Laor
  2012-06-30 23:55                         ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 327+ messages in thread
From: Andrea Arcangeli @ 2012-06-30 13:04 UTC (permalink / raw)
  To: Nai Xia
  Cc: dlaor, Peter Zijlstra, Ingo Molnar, Hillf Danton, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 02:58:29PM +0800, Nai Xia wrote:
> OK, I think I'd stop discussing this topic now. Without strict and comprehensive
> research on this topic, further arguments seems to me to be purely based on
> imagination.

I suggest to consider how ptep_clear_and_test_young works on the
pte_young bit on the VM swapping code. Then apply your "concern" to
the pte_young bit scan. If you can't NAK the swapping code in the
kernel, well I guess you can't nack AutoNUMA as well because of that
specific concern.

And no I'm not saying this is trivial or obvious, I appreciate your
thoughts a lot, just I'm quite convinced this is a subtle detail but
an irrelevant one that gets lost in the noise.

> If you insist on ignoring any constructive suggestions from others,
> it's pretty much ok to do so.  But I (and possibly many others who are
> watching)
> am pretty much  possible to do a LOL to your development style.

Well if you think answering your emails means ignoring your
suggestions, be my guest.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-30 12:48                   ` Andrea Arcangeli
@ 2012-06-30 15:10                     ` Nai Xia
  2012-07-02  7:36                       ` Rik van Riel
  0 siblings, 1 reply; 327+ messages in thread
From: Nai Xia @ 2012-06-30 15:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Zijlstra, dlaor, Ingo Molnar, Hillf Danton, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt



On 2012a1'06ae??30ae?JPY 20:48, Andrea Arcangeli wrote:
> On Sat, Jun 30, 2012 at 10:43:41AM +0800, Nai Xia wrote:
>> Well, I think I am not convinced by your this many words. And surely
>> I  will NOT follow your reasoning of "Having information is always
>> good than nothing".  We all know that  an illy biased balancing is worse
>> than randomness:  at least randomness means "average, fair play, ...".
>
> The only way to get good performance like the hard bindings is to
> fully converge the load into one node (or as fewer nodes as possible),
> randomness won't get you very far in this case.

I think by now all people should all agree on that "converge the load
into one node" is correct. But I am just thinking your random sampling
is not doing its work. Your benchmark is good.

But just like my last post pointing out, I wonder if it's
only "converge the load into one node" playing a important role.
And if your random sampling is NOT really functioning as expected,
or having just a little gain in the whole benchmark. It may not
worth all its complexity.

>
>> With all uncertain things, I think only a comprehensive survey
>> of real world workloads can tell if my concern is significant or not.
>
> I welcome more real world tests.
>
> I'm just not particularly concerned about your concern. The young bit
> clearing during swapping would also be susceptible to your concern
> just to make another example. If that would be a problem swapping
> wouldn't possibly work ok either because pte_numa or pte_young works
> the same way. In fact pte_young is even less reliable because the scan
> frequency will be more variable so the phase effects will be even more
> visible.

You know what let me feel you are ignoring my words is that each
time you answer my mail with so many words, you keep losing my points
in the mail you answered:
Yes, pte_numa or pte_young works the same way and they both can
answer the problem of "which pages were accessed since last scan".
For LRU, it's OK, it's quite enough. But for numa balancing it's NOT.
We also should care about the hotness of the page sets, since if the
workloads are complex we should NOT be expecting that "if this page
is accessed once, then it's always in my CPU cache during the whole
last scan interval".

The difference between LRU and the problem you are trying to deal
with looks so obvious to me, I am so worried that you are still
messing them up :(


>
> The VM is an heuristic, it obviously doesn't need to be perfect at all
> times, what matters is the probability that it does the right thing.

Probability is just what I were talking about and expect you guys
to analyze, and the thing I am curious about.

Thanks,

Nai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-30 13:04                       ` Andrea Arcangeli
@ 2012-06-30 15:19                         ` Nai Xia
  0 siblings, 0 replies; 327+ messages in thread
From: Nai Xia @ 2012-06-30 15:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: dlaor, Peter Zijlstra, Ingo Molnar, Hillf Danton, linux-kernel,
	linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt



On 2012a1'06ae??30ae?JPY 21:04, Andrea Arcangeli wrote:
> On Sat, Jun 30, 2012 at 02:58:29PM +0800, Nai Xia wrote:
>> OK, I think I'd stop discussing this topic now. Without strict and comprehensive
>> research on this topic, further arguments seems to me to be purely based on
>> imagination.
>
> I suggest to consider how ptep_clear_and_test_young works on the
> pte_young bit on the VM swapping code. Then apply your "concern" to
> the pte_young bit scan. If you can't NAK the swapping code in the
> kernel, well I guess you can't nack AutoNUMA as well because of that
> specific concern.
>
> And no I'm not saying this is trivial or obvious, I appreciate your
> thoughts a lot, just I'm quite convinced this is a subtle detail but
> an irrelevant one that gets lost in the noise.
>
>> If you insist on ignoring any constructive suggestions from others,
>> it's pretty much ok to do so.  But I (and possibly many others who are
>> watching)
>> am pretty much  possible to do a LOL to your development style.
>
> Well if you think answering your emails means ignoring your
> suggestions, be my guest.

Thanks for you kindness, I was just feeling a little bit uncomfortable
about Dor's tune of "Anyone can beat anything". It's OK now, since
anyone with eyes can easily find out I am not a spamming person on
this list.


Thanks,
Nai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-30  6:58                     ` Nai Xia
  2012-06-30 13:04                       ` Andrea Arcangeli
@ 2012-06-30 19:37                       ` Dor Laor
  2012-07-01  2:41                         ` Nai Xia
  2012-06-30 23:55                         ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 327+ messages in thread
From: Dor Laor @ 2012-06-30 19:37 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Hillf Danton,
	linux-kernel, linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/30/2012 09:58 AM, Nai Xia wrote:
> On Sat, Jun 30, 2012 at 1:48 PM, Dor Laor <dlaor@redhat.com> wrote:
>> On 06/30/2012 05:43 AM, Nai Xia wrote:
>>>
>>> On Sat, Jun 30, 2012 at 9:23 AM, Andrea Arcangeli <aarcange@redhat.com>
>>> wrote:
>>>>
>>>> On Sat, Jun 30, 2012 at 04:01:50AM +0800, Nai Xia wrote:
>>>>>
>>>>> On Sat, Jun 30, 2012 at 2:53 AM, Peter Zijlstra <a.p.zijlstra@chello.nl>
>>>>> wrote:
>>>>>>
>>>>>> On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
>>>>>>>
>>>>>>> The previous comments were not shouts but the mother of all NAKs.
>>>>>>
>>>>>>
>>>>>> I never said any such thing. I just said why should I bother reading
>>>>>> your stuff if you're ignoring most my feedback anyway.
>>>>>>
>>>>>> If you want to read that as a NAK, not my problem.
>>>>>
>>>>>
>>>>> Hey guys, Can I say NAK to these patches ?
>>>>>
>>>>> Now I aware that this sampling algorithm is completely broken, if we
>>>>> take
>>>>> a few seconds to see what it is trying to solve:
>>>>>
>>>>> We all know that LRU is try to solve the question of "what are the
>>>>> pages recently accessed?",
>>>>> so its engouth to use pte bits to approximate.
>>>>
>>>>
>>>> I made an example about the active list to try to explain it why your
>>>> example is still going to work fine.
>>>>
>>>> After it becomes active (from inactive) and it's being a referenced
>>>> active page, it won't become _very_active_ or _very_very_active_ or
>>>> more no matter how many more times you look up the pagecache.
>>>>
>>>> The LRU order wasn't relevant here.
>>>>
>>>>> However, the numa balancing problem is fundamentally like this:
>>>>>
>>>>> In some time unit,
>>>>>
>>>>>        W = pages_accessed  *  average_page_access_frequence
>>>>>
>>>>> We are trying to move process to the node having max W,  right?
>>>>
>>>>
>>>> First of all, the mm_autonuma statistics are not in function of time
>>>> and there is no page access frequency there.
>>>>
>>>> mm_autonuma is static information collected by knuma_scand from the
>>>> pagetables. That's static and 100% accurate on the whole process and
>>>> definitely not generated by the numa hinting page faults. I could shut
>>>> off all numa hinting page faults permanently and still generate the
>>>> mm_autonuma information identically.
>>>>
>>>> There's a knob in /sys/kernel/mm/autonuma/knuma_scand/working_set that
>>>> you can enable if you want to use a "runtime" and not static
>>>> information for the mm_autonuma too, but that's not the default for
>>>> now (but I think it may be a better default, there wasn't enough time
>>>> to test this yet)
>>>>
>>>> The task_autonuma (thread) statistics are the only thing that is
>>>> sampled by default in a 10sec interval (the interval tunable too with
>>>> sysfs, and 10sec is likely too aggressive, 30sec sounds better, we're
>>>> eventually going to make it dynamic anyway)
>>>>
>>>> So even if you were right, the thread statistics only kicks in to
>>>> balance threads against threads of the same process, most of the time
>>>> what's more important are the mm_autonuma statistics.
>>>>
>>>> But in reality the thread statistics also works perfectly for the job,
>>>> as an approximation of the NUMA memory footprint of the thread (vs the
>>>> other threads). And then the rest of the memory slowly follows
>>>> whatever node CPUs I placed the thread (even if that's not the
>>>> absolutely best one at all times).
>>>>
>>>>> Andrea's patch can only approximate the pages_accessed number in a
>>>>> time unit(scan interval),
>>>>> I don't think it can catch even 1% of  average_page_access_frequence
>>>>> on a busy workload.
>>>>> Blindly assuming that all the pages'  average_page_access_frequence is
>>>>> the same is seemly
>>>>> broken to me.
>>>>
>>>>
>>>> All we need is an approximation to take a better than random decision,
>>>> even if you get it 1% right, it's still better than 0% right by going
>>>> blind. Your 1% is too pessimistic, in my tests the thread statistics
>>>> are more like >90% correct in average (I monitor them with the debug
>>>> mode constantly).
>>>>
>>>> If this 1% right, happens one a million samples, who cares, it's not
>>>> going to run measurably slower anyway (and it will still be better
>>>> than picking a 0% right node).
>>>>
>>>> What you're saying is that because the active list in the pagecache
>>>> won't differentiate between 10 cache hits and 20 cache hits, we should
>>>> drop the active list and stop activating pages and just threat them
>>>> all the same because in some unlucky access pattern, the active list
>>>> may only get right 1% of the working set. But there's a reason why the
>>>> active list exists despite it may get things wrong in some corner case
>>>> and possibly leave the large amount of pages accessed infrequently in
>>>> the inactive list forever (even if it gets things only 1% right in
>>>> those worst cases, it's still better than 0% right and no active list
>>>> at all).
>>>>
>>>> To say it in another way, you may still crash with the car even if
>>>> you're careful, but do you think it's better to watch at the street or
>>>> to drive blindfolded?
>>>>
>>>> numa/sched drives blindfolded, autonuma watches around every 10sec
>>>> very carefully for the best next turn to take with the car and to
>>>> avoid obstacles, you can imagine who wins.
>>>>
>>>> Watching the street carefully every 10sec doesn't mean the next moment
>>>> a missile won't hit your car to make you crash, you're still having
>>>> better chances not to crash than by driving blindfolded.
>>>>
>>>> numa/sched pretends to compete without collecting information for the
>>>> NUMA thread memory footprint (task_autonuma, sampled with a
>>>> exponential backoff at 10sec intervals), and without process
>>>> information (full static information from the pagetables, not
>>>> sampled). No matter how you compute stuff, if you've nothing
>>>> meaningful in input to your algorithm you lose. And it looks like you
>>>> believe that you can take better decisions with nothing in input to
>>>> your NUMA placement algorithm, because my thread info (task_autonuma)
>>>> isn't 100% perfect at all times and it can't predict the future. The
>>>> alternative is to get that information from syscalls, but even
>>>> ignoring the -ENOMEM from split_vma, that will lead to userland bugs
>>>> and overall the task_autonuma information may be more reliable in the
>>>> end, even if it's sampled using an exponential backoff.
>>>>
>>>> Also note the exponential backoff thing, it's not really the last
>>>> interval, it's the last interval plus half the previous interval plus
>>>> 1/4 the previous interval etc... and we can trivially control the
>>>> decay.
>>>>
>>>> All we need is to get a direction and knowing _exactly_ what the task
>>>> did over the last 10 seconds (even if it can't predict the future of
>>>> what the thread will do in the next 1sec), is all we need to get a
>>>> direction. After we take the direction then the memory will follow so
>>>> we cannot care less what it does in the next second because that will
>>>> follow the CPU (after a while, last_nid anti-false-sharing logic
>>>> permitting), and at least we'll know for sure that the memory accessed
>>>> in the last 10sec is already local and that defines the best node to
>>>> schedule the thread.
>>>>
>>>> I don't mean there's no room for improvement in the way the input data
>>>> can be computed, and even in the way the input data can be generated,
>>>> the exponential backoff decay can be tuned too, I just tried to do the
>>>> simplest computations on the data to make the workloads converge fast
>>>> and you're welcome to contribute.
>>>>
>>>> But I believe the task_autonuma information is extremely valuable and
>>>> we can trust it very much knowing we'll get a great placement. The
>>>> concern you have isn't invalid, but it's a very minor one and the
>>>> sampling rate effects you are concerned about, while real, they're
>>>> lost in the noise in practice.
>>>
>>>
>>> Well, I think I am not convinced by your this many words. And surely
>>> I  will NOT follow your reasoning of "Having information is always
>>> good than nothing".  We all know that  an illy biased balancing is worse
>>> than randomness:  at least randomness means "average, fair play, ...".
>>> With all uncertain things, I think only a comprehensive survey
>>> of real world workloads can tell if my concern is significant or not.
>>>
>>> So I think my suggestion to you is:  Show world some solid and sound

                                          ^^^^^^^^^^^^^
>>> real world proof that your approximation is > 90% accurate, just like
>>
>>
>> The cover letter contained a link to the performance:
>> https://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf
>
> Yes, I saw this. But if you consider this already a solid and
> comprehensive proof.
> You win ,  I have no other words to say.

No one says there is a proof, on contrary, I said it's possible to beat 
any heuristic algorithm and Andrea explained the LRU is such too.

You asked above for real world example and that's what Andrea was trying 
to achieve (note that it includes tiny regression w/ parallel kernel 
compile on tmpfs).

>
>>
>> It includes, specJbb, kernelbuild, cpuHog in guests, and handful of units
>> tests.
>>
>> I'm sure anyone can beat most kernel algorithm with some pathological case
>> including LRU and CFS. The only way to improve the numa balancing stuff is
>
> Like I already put, the pathological cases for LRU were already well understood
> for decades, they are quite valid to ignore.  And every programmer has
> be taught to
> avoid these cases.  And this conclusion took much much time of many many
> talented brains.

Who are these programmers that you talk about? The average Java 
programmer is clueless w.r.t memory allocation.
Even w/ KVM we have an issue of double swap storm when both the guest 
and the host will have the same page on their LRU list.

>
> But the problem of this algorithm is not. And you are putting haste conclusion
> of it without bothering to do comprehensive research.
>
> "Collect the data from a wide range of pages occasionally,
> and then do a condense computing on a small set of pages" looks a very common
> practice to me.  But again, if you simply label this as "minor".
> I have no other words to say.
>
>> to sample more, meaning faulting more == larger overhead.
>
> Are you sure you really want to compete the sampling speed with CPU intensive
> workloads?

You didn't understand my point - I was saying exactly this - it's not 
worth to sample more because it carries a huge over head.
Pls don't be that fast on the 'send' trigger :)

>
> OK, I think I'd stop discussing this topic now. Without strict and comprehensive
> research on this topic, further arguments seems to me to be purely based on
> imagination.
>
> And I have no interest in beating any of your fancy algorithm, it wouldn't bring
> me 1G$.  I am just curiously about the truth.

No one said it's fancy beside you. I actually proposed a way to relax 
such false migrations in my previous reply.


> Basically, anyone has the right to laugh,  if   W = x * y and you only

Laughing while discussing numa code is under estimated!
Let's not continue to spam the list, I think we've all made our points,
Dor

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-30  6:58                     ` Nai Xia
@ 2012-06-30 23:55                         ` Benjamin Herrenschmidt
  2012-06-30 19:37                       ` Dor Laor
  2012-06-30 23:55                         ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 327+ messages in thread
From: Benjamin Herrenschmidt @ 2012-06-30 23:55 UTC (permalink / raw)
  To: Nai Xia
  Cc: dlaor, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar,
	Hillf Danton, linux-kernel, linux-mm, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris

On Sat, 2012-06-30 at 14:58 +0800, Nai Xia wrote:
> If you insist on ignoring any constructive suggestions from others,

But there is nothing constructive about your criticism.

You are basically saying that the whole thing cannot work unless it's
based on 20 years of research. Duh !

Ben.



^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-06-30 23:55                         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 327+ messages in thread
From: Benjamin Herrenschmidt @ 2012-06-30 23:55 UTC (permalink / raw)
  To: Nai Xia
  Cc: dlaor, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar,
	Hillf Danton, linux-kernel, linux-mm, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris

On Sat, 2012-06-30 at 14:58 +0800, Nai Xia wrote:
> If you insist on ignoring any constructive suggestions from others,

But there is nothing constructive about your criticism.

You are basically saying that the whole thing cannot work unless it's
based on 20 years of research. Duh !

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-30 19:37                       ` Dor Laor
@ 2012-07-01  2:41                         ` Nai Xia
  0 siblings, 0 replies; 327+ messages in thread
From: Nai Xia @ 2012-07-01  2:41 UTC (permalink / raw)
  To: dlaor
  Cc: Andrea Arcangeli, Peter Zijlstra, Ingo Molnar, Hillf Danton,
	linux-kernel, linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt



On 2012a1'07ae??01ae?JPY 03:37, Dor Laor wrote:
> On 06/30/2012 09:58 AM, Nai Xia wrote:
>> On Sat, Jun 30, 2012 at 1:48 PM, Dor Laor <dlaor@redhat.com> wrote:
>>> On 06/30/2012 05:43 AM, Nai Xia wrote:
>>>>
>>>> On Sat, Jun 30, 2012 at 9:23 AM, Andrea Arcangeli <aarcange@redhat.com>
>>>> wrote:
>>>>>
>>>>> On Sat, Jun 30, 2012 at 04:01:50AM +0800, Nai Xia wrote:
>>>>>>
>>>>>> On Sat, Jun 30, 2012 at 2:53 AM, Peter Zijlstra <a.p.zijlstra@chello.nl>
>>>>>> wrote:
>>>>>>>
>>>>>>> On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
>>>>>>>>
>>>>>>>> The previous comments were not shouts but the mother of all NAKs.
>>>>>>>
>>>>>>>
>>>>>>> I never said any such thing. I just said why should I bother reading
>>>>>>> your stuff if you're ignoring most my feedback anyway.
>>>>>>>
>>>>>>> If you want to read that as a NAK, not my problem.
>>>>>>
>>>>>>
>>>>>> Hey guys, Can I say NAK to these patches ?
>>>>>>
>>>>>> Now I aware that this sampling algorithm is completely broken, if we
>>>>>> take
>>>>>> a few seconds to see what it is trying to solve:
>>>>>>
>>>>>> We all know that LRU is try to solve the question of "what are the
>>>>>> pages recently accessed?",
>>>>>> so its engouth to use pte bits to approximate.
>>>>>
>>>>>
>>>>> I made an example about the active list to try to explain it why your
>>>>> example is still going to work fine.
>>>>>
>>>>> After it becomes active (from inactive) and it's being a referenced
>>>>> active page, it won't become _very_active_ or _very_very_active_ or
>>>>> more no matter how many more times you look up the pagecache.
>>>>>
>>>>> The LRU order wasn't relevant here.
>>>>>
>>>>>> However, the numa balancing problem is fundamentally like this:
>>>>>>
>>>>>> In some time unit,
>>>>>>
>>>>>> W = pages_accessed * average_page_access_frequence
>>>>>>
>>>>>> We are trying to move process to the node having max W, right?
>>>>>
>>>>>
>>>>> First of all, the mm_autonuma statistics are not in function of time
>>>>> and there is no page access frequency there.
>>>>>
>>>>> mm_autonuma is static information collected by knuma_scand from the
>>>>> pagetables. That's static and 100% accurate on the whole process and
>>>>> definitely not generated by the numa hinting page faults. I could shut
>>>>> off all numa hinting page faults permanently and still generate the
>>>>> mm_autonuma information identically.
>>>>>
>>>>> There's a knob in /sys/kernel/mm/autonuma/knuma_scand/working_set that
>>>>> you can enable if you want to use a "runtime" and not static
>>>>> information for the mm_autonuma too, but that's not the default for
>>>>> now (but I think it may be a better default, there wasn't enough time
>>>>> to test this yet)
>>>>>
>>>>> The task_autonuma (thread) statistics are the only thing that is
>>>>> sampled by default in a 10sec interval (the interval tunable too with
>>>>> sysfs, and 10sec is likely too aggressive, 30sec sounds better, we're
>>>>> eventually going to make it dynamic anyway)
>>>>>
>>>>> So even if you were right, the thread statistics only kicks in to
>>>>> balance threads against threads of the same process, most of the time
>>>>> what's more important are the mm_autonuma statistics.
>>>>>
>>>>> But in reality the thread statistics also works perfectly for the job,
>>>>> as an approximation of the NUMA memory footprint of the thread (vs the
>>>>> other threads). And then the rest of the memory slowly follows
>>>>> whatever node CPUs I placed the thread (even if that's not the
>>>>> absolutely best one at all times).
>>>>>
>>>>>> Andrea's patch can only approximate the pages_accessed number in a
>>>>>> time unit(scan interval),
>>>>>> I don't think it can catch even 1% of average_page_access_frequence
>>>>>> on a busy workload.
>>>>>> Blindly assuming that all the pages' average_page_access_frequence is
>>>>>> the same is seemly
>>>>>> broken to me.
>>>>>
>>>>>
>>>>> All we need is an approximation to take a better than random decision,
>>>>> even if you get it 1% right, it's still better than 0% right by going
>>>>> blind. Your 1% is too pessimistic, in my tests the thread statistics
>>>>> are more like >90% correct in average (I monitor them with the debug
>>>>> mode constantly).
>>>>>
>>>>> If this 1% right, happens one a million samples, who cares, it's not
>>>>> going to run measurably slower anyway (and it will still be better
>>>>> than picking a 0% right node).
>>>>>
>>>>> What you're saying is that because the active list in the pagecache
>>>>> won't differentiate between 10 cache hits and 20 cache hits, we should
>>>>> drop the active list and stop activating pages and just threat them
>>>>> all the same because in some unlucky access pattern, the active list
>>>>> may only get right 1% of the working set. But there's a reason why the
>>>>> active list exists despite it may get things wrong in some corner case
>>>>> and possibly leave the large amount of pages accessed infrequently in
>>>>> the inactive list forever (even if it gets things only 1% right in
>>>>> those worst cases, it's still better than 0% right and no active list
>>>>> at all).
>>>>>
>>>>> To say it in another way, you may still crash with the car even if
>>>>> you're careful, but do you think it's better to watch at the street or
>>>>> to drive blindfolded?
>>>>>
>>>>> numa/sched drives blindfolded, autonuma watches around every 10sec
>>>>> very carefully for the best next turn to take with the car and to
>>>>> avoid obstacles, you can imagine who wins.
>>>>>
>>>>> Watching the street carefully every 10sec doesn't mean the next moment
>>>>> a missile won't hit your car to make you crash, you're still having
>>>>> better chances not to crash than by driving blindfolded.
>>>>>
>>>>> numa/sched pretends to compete without collecting information for the
>>>>> NUMA thread memory footprint (task_autonuma, sampled with a
>>>>> exponential backoff at 10sec intervals), and without process
>>>>> information (full static information from the pagetables, not
>>>>> sampled). No matter how you compute stuff, if you've nothing
>>>>> meaningful in input to your algorithm you lose. And it looks like you
>>>>> believe that you can take better decisions with nothing in input to
>>>>> your NUMA placement algorithm, because my thread info (task_autonuma)
>>>>> isn't 100% perfect at all times and it can't predict the future. The
>>>>> alternative is to get that information from syscalls, but even
>>>>> ignoring the -ENOMEM from split_vma, that will lead to userland bugs
>>>>> and overall the task_autonuma information may be more reliable in the
>>>>> end, even if it's sampled using an exponential backoff.
>>>>>
>>>>> Also note the exponential backoff thing, it's not really the last
>>>>> interval, it's the last interval plus half the previous interval plus
>>>>> 1/4 the previous interval etc... and we can trivially control the
>>>>> decay.
>>>>>
>>>>> All we need is to get a direction and knowing _exactly_ what the task
>>>>> did over the last 10 seconds (even if it can't predict the future of
>>>>> what the thread will do in the next 1sec), is all we need to get a
>>>>> direction. After we take the direction then the memory will follow so
>>>>> we cannot care less what it does in the next second because that will
>>>>> follow the CPU (after a while, last_nid anti-false-sharing logic
>>>>> permitting), and at least we'll know for sure that the memory accessed
>>>>> in the last 10sec is already local and that defines the best node to
>>>>> schedule the thread.
>>>>>
>>>>> I don't mean there's no room for improvement in the way the input data
>>>>> can be computed, and even in the way the input data can be generated,
>>>>> the exponential backoff decay can be tuned too, I just tried to do the
>>>>> simplest computations on the data to make the workloads converge fast
>>>>> and you're welcome to contribute.
>>>>>
>>>>> But I believe the task_autonuma information is extremely valuable and
>>>>> we can trust it very much knowing we'll get a great placement. The
>>>>> concern you have isn't invalid, but it's a very minor one and the
>>>>> sampling rate effects you are concerned about, while real, they're
>>>>> lost in the noise in practice.
>>>>
>>>>
>>>> Well, I think I am not convinced by your this many words. And surely
>>>> I will NOT follow your reasoning of "Having information is always
>>>> good than nothing". We all know that an illy biased balancing is worse
>>>> than randomness: at least randomness means "average, fair play, ...".
>>>> With all uncertain things, I think only a comprehensive survey
>>>> of real world workloads can tell if my concern is significant or not.
>>>>
>>>> So I think my suggestion to you is: Show world some solid and sound
>
> ^^^^^^^^^^^^^
>>>> real world proof that your approximation is > 90% accurate, just like
>>>
>>>
>>> The cover letter contained a link to the performance:
>>> https://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf
>>
>> Yes, I saw this. But if you consider this already a solid and
>> comprehensive proof.
>> You win , I have no other words to say.
>
> No one says there is a proof, on contrary, I said it's possible to beat any heuristic algorithm and Andrea explained the LRU is such too.
>
> You asked above for real world example and that's what Andrea was trying to achieve (note that it includes tiny regression w/ parallel kernel compile on tmpfs).
>
>>
>>>
>>> It includes, specJbb, kernelbuild, cpuHog in guests, and handful of units
>>> tests.
>>>
>>> I'm sure anyone can beat most kernel algorithm with some pathological case
>>> including LRU and CFS. The only way to improve the numa balancing stuff is
>>
>> Like I already put, the pathological cases for LRU were already well understood
>> for decades, they are quite valid to ignore. And every programmer has
>> be taught to
>> avoid these cases. And this conclusion took much much time of many many
>> talented brains.
>
> Who are these programmers that you talk about? The average Java programmer is clueless w.r.t memory allocation.
> Even w/ KVM we have an issue of double swap storm when both the guest and the host will have the same page on their LRU list.
>
>>
>> But the problem of this algorithm is not. And you are putting haste conclusion
>> of it without bothering to do comprehensive research.
>>
>> "Collect the data from a wide range of pages occasionally,
>> and then do a condense computing on a small set of pages" looks a very common
>> practice to me. But again, if you simply label this as "minor".
>> I have no other words to say.
>>
>>> to sample more, meaning faulting more == larger overhead.
>>
>> Are you sure you really want to compete the sampling speed with CPU intensive
>> workloads?
>
> You didn't understand my point - I was saying exactly this - it's not worth to sample more because it carries a huge over head.
> Pls don't be that fast on the 'send' trigger :)
>
>>
>> OK, I think I'd stop discussing this topic now. Without strict and comprehensive
>> research on this topic, further arguments seems to me to be purely based on
>> imagination.
>>
>> And I have no interest in beating any of your fancy algorithm, it wouldn't bring
>> me 1G$. I am just curiously about the truth.
>
> No one said it's fancy beside you. I actually proposed a way to relax such false migrations in my previous reply.
>
>
>> Basically, anyone has the right to laugh, if W = x * y and you only
>
> Laughing while discussing numa code is under estimated!
> Let's not continue to spam the list, I think we've all made our points,
> Dor

Note, my laughing is based on your attitude of "anyone beat anything"
which is also under estimated. I remember despite my criticism of
the sampling, I was quite friendly to Andrea, then why the
acidness/bitterness?

But you are right. We all made our points. Let's stop it.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-30 23:55                         ` Benjamin Herrenschmidt
@ 2012-07-01  3:10                           ` Nai Xia
  -1 siblings, 0 replies; 327+ messages in thread
From: Nai Xia @ 2012-07-01  3:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: dlaor, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar,
	Hillf Danton, linux-kernel, linux-mm, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris



On 2012年07月01日 07:55, Benjamin Herrenschmidt wrote:
> On Sat, 2012-06-30 at 14:58 +0800, Nai Xia wrote:
>> If you insist on ignoring any constructive suggestions from others,
>
> But there is nothing constructive about your criticism.
>
> You are basically saying that the whole thing cannot work unless it's
> based on 20 years of research. Duh !

1. You quote me wrong: I said "group all pages to one node" is correct,
and highly possible to play the major role in your benchmarks.
Sampling is completely broken from my point of view. PZ's patch also
has similar idea of "group all pages to one node" which I think
is also correct.

2. My suggestion to Andrea: Do some more comparative benchmarks to
see what's really happening inside, instead of only macro benchmarks.
You need to have 20 hours of carefully designed survey research
for a new algorithm, instead of reading my mail and spending 20min
to give a conclusion.

If you cannot see the constructiveness of my suggestion. That's
your problem, not mine.

I understand the hard feelings of seeing the possible brokenness of a
thing you've already spend a lot of time. But that's the way people
seeking for truth.

You see, you guys has spent quite sometime to defend your points,
if this time were used to follow my advise doing some further
analysis maybe you've already got some valuable information.

Dor was right, we all made our points. And we are all busy.
Let's stop it. Thanks.

>
> Ben.
>
>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-07-01  3:10                           ` Nai Xia
  0 siblings, 0 replies; 327+ messages in thread
From: Nai Xia @ 2012-07-01  3:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: dlaor, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar,
	Hillf Danton, linux-kernel, linux-mm, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris



On 2012a1'07ae??01ae?JPY 07:55, Benjamin Herrenschmidt wrote:
> On Sat, 2012-06-30 at 14:58 +0800, Nai Xia wrote:
>> If you insist on ignoring any constructive suggestions from others,
>
> But there is nothing constructive about your criticism.
>
> You are basically saying that the whole thing cannot work unless it's
> based on 20 years of research. Duh !

1. You quote me wrong: I said "group all pages to one node" is correct,
and highly possible to play the major role in your benchmarks.
Sampling is completely broken from my point of view. PZ's patch also
has similar idea of "group all pages to one node" which I think
is also correct.

2. My suggestion to Andrea: Do some more comparative benchmarks to
see what's really happening inside, instead of only macro benchmarks.
You need to have 20 hours of carefully designed survey research
for a new algorithm, instead of reading my mail and spending 20min
to give a conclusion.

If you cannot see the constructiveness of my suggestion. That's
your problem, not mine.

I understand the hard feelings of seeing the possible brokenness of a
thing you've already spend a lot of time. But that's the way people
seeking for truth.

You see, you guys has spent quite sometime to defend your points,
if this time were used to follow my advise doing some further
analysis maybe you've already got some valuable information.

Dor was right, we all made our points. And we are all busy.
Let's stop it. Thanks.

>
> Ben.
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 20/40] autonuma: alloc/free/init mm_autonuma
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-07-01 15:33     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-01 15:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:

> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0adbe09..3e5a0d9 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -527,6 +527,8 @@ static void mm_init_aio(struct mm_struct *mm)
>
>   static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
>   {
> +	if (unlikely(alloc_mm_autonuma(mm)))
> +		goto out_free_mm;
>   	atomic_set(&mm->mm_users, 1);
>   	atomic_set(&mm->mm_count, 1);
>   	init_rwsem(&mm->mmap_sem);

I wonder if it would be possible to defer the allocation
of the mm_autonuma struct to knuma_scand, so short lived
processes never have to allocate and free the mm_autonuma
structure.

That way we only have a function call at exit time, and
the branch inside kfree that checks for a null pointer.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 20/40] autonuma: alloc/free/init mm_autonuma
@ 2012-07-01 15:33     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-01 15:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:

> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0adbe09..3e5a0d9 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -527,6 +527,8 @@ static void mm_init_aio(struct mm_struct *mm)
>
>   static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
>   {
> +	if (unlikely(alloc_mm_autonuma(mm)))
> +		goto out_free_mm;
>   	atomic_set(&mm->mm_users, 1);
>   	atomic_set(&mm->mm_count, 1);
>   	init_rwsem(&mm->mmap_sem);

I wonder if it would be possible to defer the allocation
of the mm_autonuma struct to knuma_scand, so short lived
processes never have to allocate and free the mm_autonuma
structure.

That way we only have a function call at exit time, and
the branch inside kfree that checks for a null pointer.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 22/40] autonuma: teach CFS about autonuma affinity
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-07-01 16:37     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-01 16:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:

> @@ -2621,6 +2622,8 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>   		load = weighted_cpuload(i);
>
>   		if (load<  min_load || (load == min_load&&  i == this_cpu)) {
> +			if (!task_autonuma_cpu(p, i))
> +				continue;
>   			min_load = load;
>   			idlest = i;
>   		}

Is it right to only consider CPUs on the "right" NUMA
node, or do we want to harvest idle time elsewhere as
a last resort?

After your change the comment above find_idlest_cpu
no longer matches what the function does!

                 if (load < min_load || (load == min_load && i == 
this_cpu)) {
                         min_load = load;
                         idlest = i;
                 }

Would it make sense for task_autonuma_cpu(p, i) to be
inside the if ( ) braces, since that is what you are
trying to do?

		if ((load < min_load || (load == min_load &&
			i == this_cpu)) && task_autonuma_cpu(p, i)) {

> @@ -2639,24 +2642,27 @@ static int select_idle_sibling(struct task_struct *p, int target)

These bits make sense.

>   	/*
>   	 * Otherwise, iterate the domains and find an elegible idle cpu.
>   	 */
> +	idle_target = false;
>   	sd = rcu_dereference(per_cpu(sd_llc, target));
>   	for_each_lower_domain(sd) {
>   		sg = sd->groups;
> @@ -2670,9 +2676,18 @@ static int select_idle_sibling(struct task_struct *p, int target)
>   					goto next;
>   			}
>
> -			target = cpumask_first_and(sched_group_cpus(sg),
> -					tsk_cpus_allowed(p));
> -			goto done;
> +			for_each_cpu_and(i, sched_group_cpus(sg),
> +						tsk_cpus_allowed(p)) {
> +				/* Find autonuma cpu only in idle group */
> +				if (task_autonuma_cpu(p, i)) {
> +					target = i;
> +					goto done;
> +				}
> +				if (!idle_target) {
> +					idle_target = true;
> +					target = i;
> +				}
> +			}

There already is a for loop right above this:

                         for_each_cpu(i, sched_group_cpus(sg)) {
                                 if (!idle_cpu(i))
                                         goto next;
                         }

It appears to loop over all the CPUs in a sched group, but
not really.  If the first CPU in the sched group is idle,
it will fall through.

If the first CPU in the sched group is not idle, we move
on to the next sched group, instead of looking at the
other CPUs in the sched group.

Peter, Ingo, what is the original code in select_idle_sibling
supposed to do?

That original for_each_cpu loop would make more sense if
it actually looped over each cpu in the group.

Then we could remember two targets. One idle target, and
one autonuma-compliant idle target.

If, after looping over the CPUs, we find no autonuma-compliant
target, we use the other idle target.

Does that make sense?

Am I overlooking something about how the way select_idle_sibling
is supposed to work?

> @@ -3195,6 +3217,8 @@ static int move_one_task(struct lb_env *env)
>   {
>   	struct task_struct *p, *n;
>
> +	env->flags |= LBF_NUMA;
> +numa_repeat:
>   	list_for_each_entry_safe(p, n,&env->src_rq->cfs_tasks, se.group_node) {
>   		if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
>   			continue;
> @@ -3209,8 +3233,14 @@ static int move_one_task(struct lb_env *env)
>   		 * stats here rather than inside move_task().
>   		 */
>   		schedstat_inc(env->sd, lb_gained[env->idle]);
> +		env->flags&= ~LBF_NUMA;
>   		return 1;
>   	}
> +	if (env->flags&  LBF_NUMA) {
> +		env->flags&= ~LBF_NUMA;
> +		goto numa_repeat;
> +	}
> +
>   	return 0;
>   }

Would it make sense to remember the first non-autonuma-compliant
task that can be moved, and keep searching for one that fits
autonuma's criteria further down the line?

Then, if you fail to find a good autonuma task in the first
iteration, you do not have to loop over the list a second time.

> @@ -3235,6 +3265,8 @@ static int move_tasks(struct lb_env *env)
>   	if (env->imbalance<= 0)
>   		return 0;
>
> +	env->flags |= LBF_NUMA;
> +numa_repeat:

Same here.  Loops are bad enough, and it looks like it would
only cost one pointer on the stack to avoid numa_repeat :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 22/40] autonuma: teach CFS about autonuma affinity
@ 2012-07-01 16:37     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-01 16:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:

> @@ -2621,6 +2622,8 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>   		load = weighted_cpuload(i);
>
>   		if (load<  min_load || (load == min_load&&  i == this_cpu)) {
> +			if (!task_autonuma_cpu(p, i))
> +				continue;
>   			min_load = load;
>   			idlest = i;
>   		}

Is it right to only consider CPUs on the "right" NUMA
node, or do we want to harvest idle time elsewhere as
a last resort?

After your change the comment above find_idlest_cpu
no longer matches what the function does!

                 if (load < min_load || (load == min_load && i == 
this_cpu)) {
                         min_load = load;
                         idlest = i;
                 }

Would it make sense for task_autonuma_cpu(p, i) to be
inside the if ( ) braces, since that is what you are
trying to do?

		if ((load < min_load || (load == min_load &&
			i == this_cpu)) && task_autonuma_cpu(p, i)) {

> @@ -2639,24 +2642,27 @@ static int select_idle_sibling(struct task_struct *p, int target)

These bits make sense.

>   	/*
>   	 * Otherwise, iterate the domains and find an elegible idle cpu.
>   	 */
> +	idle_target = false;
>   	sd = rcu_dereference(per_cpu(sd_llc, target));
>   	for_each_lower_domain(sd) {
>   		sg = sd->groups;
> @@ -2670,9 +2676,18 @@ static int select_idle_sibling(struct task_struct *p, int target)
>   					goto next;
>   			}
>
> -			target = cpumask_first_and(sched_group_cpus(sg),
> -					tsk_cpus_allowed(p));
> -			goto done;
> +			for_each_cpu_and(i, sched_group_cpus(sg),
> +						tsk_cpus_allowed(p)) {
> +				/* Find autonuma cpu only in idle group */
> +				if (task_autonuma_cpu(p, i)) {
> +					target = i;
> +					goto done;
> +				}
> +				if (!idle_target) {
> +					idle_target = true;
> +					target = i;
> +				}
> +			}

There already is a for loop right above this:

                         for_each_cpu(i, sched_group_cpus(sg)) {
                                 if (!idle_cpu(i))
                                         goto next;
                         }

It appears to loop over all the CPUs in a sched group, but
not really.  If the first CPU in the sched group is idle,
it will fall through.

If the first CPU in the sched group is not idle, we move
on to the next sched group, instead of looking at the
other CPUs in the sched group.

Peter, Ingo, what is the original code in select_idle_sibling
supposed to do?

That original for_each_cpu loop would make more sense if
it actually looped over each cpu in the group.

Then we could remember two targets. One idle target, and
one autonuma-compliant idle target.

If, after looping over the CPUs, we find no autonuma-compliant
target, we use the other idle target.

Does that make sense?

Am I overlooking something about how the way select_idle_sibling
is supposed to work?

> @@ -3195,6 +3217,8 @@ static int move_one_task(struct lb_env *env)
>   {
>   	struct task_struct *p, *n;
>
> +	env->flags |= LBF_NUMA;
> +numa_repeat:
>   	list_for_each_entry_safe(p, n,&env->src_rq->cfs_tasks, se.group_node) {
>   		if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
>   			continue;
> @@ -3209,8 +3233,14 @@ static int move_one_task(struct lb_env *env)
>   		 * stats here rather than inside move_task().
>   		 */
>   		schedstat_inc(env->sd, lb_gained[env->idle]);
> +		env->flags&= ~LBF_NUMA;
>   		return 1;
>   	}
> +	if (env->flags&  LBF_NUMA) {
> +		env->flags&= ~LBF_NUMA;
> +		goto numa_repeat;
> +	}
> +
>   	return 0;
>   }

Would it make sense to remember the first non-autonuma-compliant
task that can be moved, and keep searching for one that fits
autonuma's criteria further down the line?

Then, if you fail to find a good autonuma task in the first
iteration, you do not have to loop over the list a second time.

> @@ -3235,6 +3265,8 @@ static int move_tasks(struct lb_env *env)
>   	if (env->imbalance<= 0)
>   		return 0;
>
> +	env->flags |= LBF_NUMA;
> +numa_repeat:

Same here.  Loops are bad enough, and it looks like it would
only cost one pointer on the stack to avoid numa_repeat :)

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 23/40] autonuma: sched_set_autonuma_need_balance
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-07-01 16:57     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-01 16:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> Invoke autonuma_balance only on the busy CPUs at the same frequency of
> the CFS load balance.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> ---
>   kernel/sched/fair.c |    3 +++
>   1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dab9bdd..ff288c0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4906,6 +4906,9 @@ static void run_rebalance_domains(struct softirq_action *h)
>
>   	rebalance_domains(this_cpu, idle);
>
> +	if (!this_rq->idle_balance)
> +		sched_set_autonuma_need_balance();
> +

Misleading function name in patch 13, this actually calls
sched_autonuma_balance and is not setting a flag like the
name suggests (to me).

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 23/40] autonuma: sched_set_autonuma_need_balance
@ 2012-07-01 16:57     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-01 16:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> Invoke autonuma_balance only on the busy CPUs at the same frequency of
> the CFS load balance.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> ---
>   kernel/sched/fair.c |    3 +++
>   1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dab9bdd..ff288c0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4906,6 +4906,9 @@ static void run_rebalance_domains(struct softirq_action *h)
>
>   	rebalance_domains(this_cpu, idle);
>
> +	if (!this_rq->idle_balance)
> +		sched_set_autonuma_need_balance();
> +

Misleading function name in patch 13, this actually calls
sched_autonuma_balance and is not setting a flag like the
name suggests (to me).

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 24/40] autonuma: core
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-07-02  4:07     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:

> +unsigned long autonuma_flags __read_mostly =
> +	(1<<AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG)|
> +	(1<<AUTONUMA_SCHED_CLONE_RESET_FLAG)|
> +	(1<<AUTONUMA_SCHED_FORK_RESET_FLAG)|
> +#ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
> +	(1<<AUTONUMA_FLAG)|
> +#endif
> +	(1<<AUTONUMA_SCAN_PMD_FLAG);

Please document what the flags mean.

> +static DEFINE_MUTEX(knumad_mm_mutex);

Wait, you are working on a patch to increase performance on
(large) NUMA systems, but you are introducing a new GLOBAL
LOCK that needs to be taken for a task to be added on the
knumad_scan list?

Can you change the locking of that list, so fork and exit
are not serialized on knumad_mm_mutex?  Maybe RCU?

> +static inline bool autonuma_impossible(void)
> +{
> +	return num_possible_nodes()<= 1 ||
> +		test_bit(AUTONUMA_IMPOSSIBLE_FLAG,&autonuma_flags);
> +}

This seems to test whether autonuma is enabled or
disabled, and is called from a few hot paths.

Would it be better to turn this into a variable,
so this can be tested with one single compare?

Also, something like autonuma_enabled or
autonuma_disabled could be a clearer name.

> +/* caller already holds the compound_lock */
> +void autonuma_migrate_split_huge_page(struct page *page,
> +				      struct page *page_tail)
> +{
> +	int nid, last_nid;
> +
> +	nid = page->autonuma_migrate_nid;
> +	VM_BUG_ON(nid>= MAX_NUMNODES);
> +	VM_BUG_ON(nid<  -1);
> +	VM_BUG_ON(page_tail->autonuma_migrate_nid != -1);
> +	if (nid>= 0) {
> +		VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
> +
> +		compound_lock(page_tail);

The comment above the function says that the caller already
holds the compound lock, yet we try to grab it again here?

Is this a deadlock, or simply an out of date comment?

Either way, it needs to be fixed.

A comment telling us what this function is supposed to
do would not hurt, either.

> +void __autonuma_migrate_page_remove(struct page *page)
> +{
> +	unsigned long flags;
> +	int nid;

In fact, every function larger than about 5 lines could use
a short comment describing what its purpose is.

> +static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
> +					int page_nid)
> +{

I wonder if _enqueue and _rmqueue (for the previous function)
would make more sense, since we are adding and removing the
page from a queue?

> +	VM_BUG_ON(dst_nid>= MAX_NUMNODES);
> +	VM_BUG_ON(dst_nid<  -1);
> +	VM_BUG_ON(page_nid>= MAX_NUMNODES);
> +	VM_BUG_ON(page_nid<  -1);
> +
> +	VM_BUG_ON(page_nid == dst_nid);
> +	VM_BUG_ON(page_to_nid(page) != page_nid);
> +
> +	flags = compound_lock_irqsave(page);

What does this lock protect against?

Should we check that those things are still true, after we
have acquired the lock?

> +static void autonuma_migrate_page_add(struct page *page, int dst_nid,
> +				      int page_nid)
> +{
> +	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
> +	if (migrate_nid != dst_nid)
> +		__autonuma_migrate_page_add(page, dst_nid, page_nid);
> +}

Wait, how are three three nids involved with the migration
of one page?

What is going on here, and why is there no comment explaining it?

> +static bool balance_pgdat(struct pglist_data *pgdat,
> +			  int nr_migrate_pages)
> +{
> +	/* FIXME: this only check the wmarks, make it move
> +	 * "unused" memory or pagecache by queuing it to
> +	 * pgdat->autonuma_migrate_head[pgdat->node_id].
> +	 */

vmscan.c also has a function called balance_pgdat, which does
something very different.

This function seems to check whether a node has enough free
memory. Maybe the name could reflect that?

> +static void cpu_follow_memory_pass(struct task_struct *p,
> +				   struct task_autonuma *task_autonuma,
> +				   unsigned long *task_numa_fault)
> +{
> +	int nid;
> +	for_each_node(nid)
> +		task_numa_fault[nid]>>= 1;
> +	task_autonuma->task_numa_fault_tot>>= 1;
> +}

This seems to age the statistic.  From the name I guess
it is called every pass, but something like numa_age_faults
might be a better name.

It could also use a short comment explaining why the statistics
are aged after each round, instead of zeroed out.

> +static void numa_hinting_fault_cpu_follow_memory(struct task_struct *p,
> +						 int access_nid,
> +						 int numpages,
> +						 bool pass)
> +{
> +	struct task_autonuma *task_autonuma = p->task_autonuma;
> +	unsigned long *task_numa_fault = task_autonuma->task_numa_fault;
> +	if (unlikely(pass))
> +		cpu_follow_memory_pass(p, task_autonuma, task_numa_fault);
> +	task_numa_fault[access_nid] += numpages;
> +	task_autonuma->task_numa_fault_tot += numpages;
> +}

Function name seems to have no bearing on what the function
actually does, which appears to be some kind of statistics
update...

There is no explanation of when pass would be true (or false).

> +static inline bool last_nid_set(struct task_struct *p,
> +				struct page *page, int cpu_nid)
> +{
> +	bool ret = true;
> +	int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
> +	VM_BUG_ON(cpu_nid<  0);
> +	VM_BUG_ON(cpu_nid>= MAX_NUMNODES);
> +	if (autonuma_last_nid>= 0&&  autonuma_last_nid != cpu_nid) {
> +		int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
> +		if (migrate_nid>= 0&&  migrate_nid != cpu_nid)
> +			__autonuma_migrate_page_remove(page);
> +		ret = false;
> +	}
> +	if (autonuma_last_nid != cpu_nid)
> +		ACCESS_ONCE(page->autonuma_last_nid) = cpu_nid;
> +	return ret;
> +}

This function confuses me. Why is there an ACCESS_ONCE
around something that is accessed three times?

It looks like it is trying to set some info on a page,
possibly where the page should be migrated to, and cancel
any migration if the page already has a destination other
than our destination?

It does not help that I have no idea what last_nid means,
because that was not documented in the earlier patches.
The function could use a comment regarding its purpose.

> +static int __page_migrate_nid(struct page *page, int page_nid)
> +{
> +	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
> +	if (migrate_nid<  0)
> +		migrate_nid = page_nid;
> +#if 0
> +	return page_nid;
> +#endif
> +	return migrate_nid;
> +}
> +
> +static int page_migrate_nid(struct page *page)
> +{
> +	return __page_migrate_nid(page, page_to_nid(page));
> +}

Why are there two functions that do the same thing?

Could this be collapsed into one function?

The #if 0 block could probably be removed, too.

> +static int knumad_scan_pmd(struct mm_struct *mm,
> +			   struct vm_area_struct *vma,
> +			   unsigned long address)

Like every other function, this one could use a comment
informing us of the general idea this function is supposed
to do.

> +{
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte, *_pte;
> +	struct page *page;
> +	unsigned long _address, end;
> +	spinlock_t *ptl;
> +	int ret = 0;
> +
> +	VM_BUG_ON(address&  ~PAGE_MASK);
> +
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
> +		goto out;
> +
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		goto out;
> +
> +	pmd = pmd_offset(pud, address);
> +	if (pmd_none(*pmd))
> +		goto out;
> +	if (pmd_trans_huge(*pmd)) {

This is a very large function.  Would it make sense to split the
pmd_trans_huge stuff into its own function?

> +		spin_lock(&mm->page_table_lock);
> +		if (pmd_trans_huge(*pmd)) {
> +			VM_BUG_ON(address&  ~HPAGE_PMD_MASK);
> +			if (unlikely(pmd_trans_splitting(*pmd))) {
> +				spin_unlock(&mm->page_table_lock);
> +				wait_split_huge_page(vma->anon_vma, pmd);
> +			} else {
> +				int page_nid;
> +				unsigned long *numa_fault_tmp;
> +				ret = HPAGE_PMD_NR;
> +
> +				if (autonuma_scan_use_working_set()&&
> +				    pmd_numa(*pmd)) {
> +					spin_unlock(&mm->page_table_lock);
> +					goto out;
> +				}
> +
> +				page = pmd_page(*pmd);
> +
> +				/* only check non-shared pages */
> +				if (page_mapcount(page) != 1) {
> +					spin_unlock(&mm->page_table_lock);
> +					goto out;
> +				}
> +
> +				page_nid = page_migrate_nid(page);
> +				numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
> +				numa_fault_tmp[page_nid] += ret;
> +
> +				if (pmd_numa(*pmd)) {
> +					spin_unlock(&mm->page_table_lock);
> +					goto out;
> +				}
> +
> +				set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
> +				/* defer TLB flush to lower the overhead */
> +				spin_unlock(&mm->page_table_lock);
> +				goto out;
> +			}
> +		} else
> +			spin_unlock(&mm->page_table_lock);
> +	}
> +
> +	VM_BUG_ON(!pmd_present(*pmd));
> +
> +	end = min(vma->vm_end, (address + PMD_SIZE)&  PMD_MASK);
> +	pte = pte_offset_map_lock(mm, pmd, address,&ptl);
> +	for (_address = address, _pte = pte; _address<  end;
> +	     _pte++, _address += PAGE_SIZE) {
> +		unsigned long *numa_fault_tmp;
> +		pte_t pteval = *_pte;
> +		if (!pte_present(pteval))
> +			continue;
> +		if (autonuma_scan_use_working_set()&&
> +		    pte_numa(pteval))
> +			continue;

What is autonuma_scan_use_working_set supposed to do exactly?

This looks like a subtle piece of code, but without any explanation
of what it should do, I cannot verify whether it actually does that.

> +		page = vm_normal_page(vma, _address, pteval);
> +		if (unlikely(!page))
> +			continue;
> +		/* only check non-shared pages */
> +		if (page_mapcount(page) != 1)
> +			continue;
> +
> +		numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
> +		numa_fault_tmp[page_migrate_nid(page)]++;

> +		if (pte_numa(pteval))
> +			continue;

Wait, so we count all pages, even the ones that are PTE_NUMA
as if they incurred faults, even when they did not?

pte_present seems to return true for a numa pte...

> +		if (!autonuma_scan_pmd())
> +			set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
> +
> +		/* defer TLB flush to lower the overhead */
> +		ret++;
> +	}
> +	pte_unmap_unlock(pte, ptl);
> +
> +	if (ret&&  !pmd_numa(*pmd)&&  autonuma_scan_pmd()) {
> +		spin_lock(&mm->page_table_lock);
> +		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
> +		spin_unlock(&mm->page_table_lock);
> +		/* defer TLB flush to lower the overhead */
> +	}

So depending on whether autonuma_scan_pmd is true or false,
we behave differently.  That wants some documenting, near
both the !autonuma_scan_pmd code and the code below...

> +static void mm_numa_fault_flush(struct mm_struct *mm)
> +{
> +	int nid;
> +	struct mm_autonuma *mma = mm->mm_autonuma;
> +	unsigned long *numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
> +	unsigned long tot = 0;
> +	/* FIXME: protect this with seqlock against autonuma_balance() */

Yes, please do.

> +static int knumad_do_scan(void)
> +{
> +	struct mm_struct *mm;
> +	struct mm_autonuma *mm_autonuma;
> +	unsigned long address;
> +	struct vm_area_struct *vma;
> +	int progress = 0;
> +
> +	mm = knumad_scan.mm;
> +	if (!mm) {
> +		if (unlikely(list_empty(&knumad_scan.mm_head)))
> +			return pages_to_scan;

Wait a moment, in knuma_scand() you have this line:

		_progress = knumad_do_scan();

Why are you pretending you made progress, when you did not
scan anything?

This is nothing a comment cannot illuminate :)

> +	down_read(&mm->mmap_sem);
> +	if (unlikely(knumad_test_exit(mm)))

This could use a short comment.

		/* The process is exiting */
> +		vma = NULL;
> +	else
> +		vma = find_vma(mm, address);

This loop could use some comments:

> +	for (; vma&&  progress<  pages_to_scan; vma = vma->vm_next) {
> +		unsigned long start_addr, end_addr;
> +		cond_resched();
		/* process is exiting */
> +		if (unlikely(knumad_test_exit(mm))) {
> +			progress++;
> +			break;
> +		}
		/* only do anonymous memory without explicit numa placement */
> +		if (!vma->anon_vma || vma_policy(vma)) {
> +			progress++;
> +			continue;
> +		}
> +		if (vma->vm_flags&  (VM_PFNMAP | VM_MIXEDMAP)) {
> +			progress++;
> +			continue;
> +		}
> +		if (is_vma_temporary_stack(vma)) {
> +			progress++;
> +			continue;
> +		}
> +
> +		VM_BUG_ON(address&  ~PAGE_MASK);
> +		if (address<  vma->vm_start)
> +			address = vma->vm_start;

How can this happen, when we did a find_vma above?

> +		flush_tlb_range(vma, start_addr, end_addr);
> +		mmu_notifier_invalidate_range_end(vma->vm_mm, start_addr,
> +						  end_addr);
> +	}
> +	up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */

That does not seem like a certainty.  The process might not
exit for hours, or even days.

Also, mmap_sem protects against many more things than just
exit_mmap. It could also be mprotect, munmap or exec.

This may be the one comment in the series so far that is
best removed :)

> +static int knuma_scand(void *none)
> +{
> +	struct mm_struct *mm = NULL;
> +	int progress = 0, _progress;
> +	unsigned long total_progress = 0;
> +
> +	set_freezable();
> +
> +	knuma_scand_disabled();
> +
> +	mutex_lock(&knumad_mm_mutex);
> +
> +	for (;;) {
> +		if (unlikely(kthread_should_stop()))
> +			break;
> +		_progress = knumad_do_scan();
> +		progress += _progress;
> +		total_progress += _progress;

Huh?  Tracking both progress and total_progress?

There is no explanation of what the difference between these
two is, or why you need them.

> +	mm = knumad_scan.mm;
> +	knumad_scan.mm = NULL;
> +	if (mm&&  knumad_test_exit(mm)) {
> +		list_del(&mm->mm_autonuma->mm_node);
> +		/* tell autonuma_exit not to list_del */
> +		VM_BUG_ON(mm->mm_autonuma->mm != mm);
> +		mm->mm_autonuma->mm = NULL;
> +	}
> +	mutex_unlock(&knumad_mm_mutex);
> +
> +	if (mm)
> +		mmdrop(mm);
> +

Why doesn't knumad_do_scan take care of the mmdrop?

Doing this in the calling function is somewhat confusing.

I see that knumad_scan can hold the refcount of an mm
elevated in-between scans, for however long it sleeps.
Is that really something we want?

In the days where we did virtual scanning, the code in
vmscan.c simply moved a process to the end of the mm_list
once it was done with it, and the currently to-scan process
was always at the head of the list.

> +static int isolate_migratepages(struct list_head *migratepages,
> +				struct pglist_data *pgdat)

The kernel already has another function called isolate_migratepages.

Would be nice if this function could at least be documented to
state its purpose. Maybe renamed to make it clear this is the NUMA
specific version.

> +{
> +	int nr = 0, nid;
> +	struct list_head *heads = pgdat->autonuma_migrate_head;
> +
> +	/* FIXME: THP balancing, restart from last nid */

I guess a static variable in the pgdat struct could take care
of that?

> +		if (PageTransHuge(page)) {
> +			VM_BUG_ON(!PageAnon(page));
> +			/* FIXME: remove split_huge_page */

Fair enough. Other people are working on that code :)

> +		if (!__isolate_lru_page(page, 0)) {
> +			VM_BUG_ON(PageTransCompound(page));
> +			del_page_from_lru_list(page, lruvec, page_lru(page));
> +			inc_zone_state(zone, page_is_file_cache(page) ?
> +				       NR_ISOLATED_FILE : NR_ISOLATED_ANON);
> +			spin_unlock_irq(&zone->lru_lock);
> +			/*
> +			 * hold the page pin at least until
> +			 * __isolate_lru_page succeeds
> +			 * (__isolate_lru_page takes a second pin when
> +			 * it succeeds). If we release the pin before
> +			 * __isolate_lru_page returns, the page could
> +			 * have been freed and reallocated from under
> +			 * us, so rendering worthless our previous
> +			 * checks on the page including the
> +			 * split_huge_page call.
> +			 */
> +			put_page(page);
> +
> +			list_add(&page->lru, migratepages);
> +			nr += hpage_nr_pages(page);
> +		} else {
> +			/* FIXME: losing page, safest and simplest for now */

Losing page?  As in a memory leak?

Or as in	/* Something happened. Skip migrating the page. */ ?

> +static void knumad_do_migrate(struct pglist_data *pgdat)
> +{
> +	int nr_migrate_pages = 0;
> +	LIST_HEAD(migratepages);
> +
> +	autonuma_printk("nr_migrate_pages %lu to node %d\n",
> +			pgdat->autonuma_nr_migrate_pages, pgdat->node_id);
> +	do {
> +		int isolated = 0;
> +		if (balance_pgdat(pgdat, nr_migrate_pages))
> +			isolated = isolate_migratepages(&migratepages, pgdat);
> +		/* FIXME: might need to check too many isolated */

Would it help to have isolate_migratepages exit after it has
isolated a large enough number of pages?

I may be tired, but it looks like it is simply putting ALL
the to be migrated pages on the list, even if the number
is unreasonably large to migrate all at once.

> +		if (!isolated)
> +			break;
> +		nr_migrate_pages += isolated;
> +	} while (nr_migrate_pages<  pages_to_migrate);
> +
> +	if (nr_migrate_pages) {
> +		int err;
> +		autonuma_printk("migrate %d to node %d\n", nr_migrate_pages,
> +				pgdat->node_id);
> +		pages_migrated += nr_migrate_pages; /* FIXME: per node */
> +		err = migrate_pages(&migratepages, alloc_migrate_dst_page,
> +				    pgdat->node_id, false, true);
> +		if (err)
> +			/* FIXME: requeue failed pages */
> +			putback_lru_pages(&migratepages);

How about you add a parameter to your (renamed) isolate_migratepages
function to tell it how many pages you want at a time?

Then you could limit it to no more than the number of pages you have
available on the destination node.

> +void autonuma_enter(struct mm_struct *mm)
> +{
> +	if (autonuma_impossible())
> +		return;
> +
> +	mutex_lock(&knumad_mm_mutex);
> +	list_add_tail(&mm->mm_autonuma->mm_node,&knumad_scan.mm_head);
> +	mutex_unlock(&knumad_mm_mutex);
> +}

Adding a global lock to every fork seems like a really bad
idea. Please make this NUMA code more SMP friendly.

> +void autonuma_exit(struct mm_struct *mm)
> +{
> +	bool serialize;
> +
> +	if (autonuma_impossible())
> +		return;

And if you implement the "have the autonuma scanning
daemon allocate the mm_autonuma and task_autonuma
structures" idea, short lived processes can bail out
of autonuma_exit right here, without ever taking the
lock.

> +	serialize = false;
> +	mutex_lock(&knumad_mm_mutex);
> +	if (knumad_scan.mm == mm)
> +		serialize = true;
> +	else if (mm->mm_autonuma->mm) {
> +		VM_BUG_ON(mm->mm_autonuma->mm != mm);
> +		mm->mm_autonuma->mm = NULL; /* debug */
> +		list_del(&mm->mm_autonuma->mm_node);
> +	}
> +	mutex_unlock(&knumad_mm_mutex);
> +
> +	if (serialize) {
> +		/* prevent the mm to go away under knumad_do_scan main loop */
> +		down_write(&mm->mmap_sem);
> +		up_write(&mm->mmap_sem);

This is rather subtle.  A longer explanation could be good.

> +SYSFS_ENTRY(debug, AUTONUMA_DEBUG_FLAG);
> +SYSFS_ENTRY(pmd, AUTONUMA_SCAN_PMD_FLAG);
> +SYSFS_ENTRY(working_set, AUTONUMA_SCAN_USE_WORKING_SET_FLAG);
> +SYSFS_ENTRY(defer, AUTONUMA_MIGRATE_DEFER_FLAG);
> +SYSFS_ENTRY(load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
> +SYSFS_ENTRY(clone_reset, AUTONUMA_SCHED_CLONE_RESET_FLAG);
> +SYSFS_ENTRY(fork_reset, AUTONUMA_SCHED_FORK_RESET_FLAG);

> +#undef SYSFS_ENTRY
> +
> +enum {
> +	SYSFS_KNUMA_SCAND_SLEEP_ENTRY,
> +	SYSFS_KNUMA_SCAND_PAGES_ENTRY,
> +	SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY,
> +	SYSFS_KNUMA_MIGRATED_PAGES_ENTRY,
> +};

Oh goodie, more magic flags.

Please document what they all mean.


> +SYSFS_ENTRY(scan_sleep_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
> +SYSFS_ENTRY(scan_sleep_pass_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
> +SYSFS_ENTRY(pages_to_scan, SYSFS_KNUMA_SCAND_PAGES_ENTRY);
> +
> +SYSFS_ENTRY(migrate_sleep_millisecs, SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY);
> +SYSFS_ENTRY(pages_to_migrate, SYSFS_KNUMA_MIGRATED_PAGES_ENTRY);

These as well.

> +SYSFS_ENTRY(full_scans);
> +SYSFS_ENTRY(pages_scanned);
> +SYSFS_ENTRY(pages_migrated);

The same goes for the statistics.

Documentation, documentation, documentation.

And please don't tell me "full scans" in autonuma means
the same thing it means in KSM, because I still do not
know what it means in KSM...


> +static inline void autonuma_exit_sysfs(struct kobject *autonuma_kobj)
> +{
> +}
> +#endif /* CONFIG_SYSFS */
> +
> +static int __init noautonuma_setup(char *str)
> +{
> +	if (!autonuma_impossible()) {
> +		printk("AutoNUMA permanently disabled\n");
> +		set_bit(AUTONUMA_IMPOSSIBLE_FLAG,&autonuma_flags);

Ohhh, my guess was right.  autonuma_impossible really does
mean autonuma_disabled :)

> +int alloc_task_autonuma(struct task_struct *tsk, struct task_struct *orig,
> +			 int node)
> +{

This looks like something that can be done by the
numa scanning daemon.

> +void free_task_autonuma(struct task_struct *tsk)
> +{
> +	if (autonuma_impossible()) {
> +		BUG_ON(tsk->task_autonuma);
> +		return;
> +	}
> +
> +	BUG_ON(!tsk->task_autonuma);

And this looks like a desired thing, for short lived tasks :)

> +	kmem_cache_free(task_autonuma_cachep, tsk->task_autonuma);
> +	tsk->task_autonuma = NULL;
> +}

> +void free_mm_autonuma(struct mm_struct *mm)
> +{
> +	if (autonuma_impossible()) {
> +		BUG_ON(mm->mm_autonuma);
> +		return;
> +	}
> +
> +	BUG_ON(!mm->mm_autonuma);

Ditto for mm_autonuma

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 24/40] autonuma: core
@ 2012-07-02  4:07     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:

> +unsigned long autonuma_flags __read_mostly =
> +	(1<<AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG)|
> +	(1<<AUTONUMA_SCHED_CLONE_RESET_FLAG)|
> +	(1<<AUTONUMA_SCHED_FORK_RESET_FLAG)|
> +#ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
> +	(1<<AUTONUMA_FLAG)|
> +#endif
> +	(1<<AUTONUMA_SCAN_PMD_FLAG);

Please document what the flags mean.

> +static DEFINE_MUTEX(knumad_mm_mutex);

Wait, you are working on a patch to increase performance on
(large) NUMA systems, but you are introducing a new GLOBAL
LOCK that needs to be taken for a task to be added on the
knumad_scan list?

Can you change the locking of that list, so fork and exit
are not serialized on knumad_mm_mutex?  Maybe RCU?

> +static inline bool autonuma_impossible(void)
> +{
> +	return num_possible_nodes()<= 1 ||
> +		test_bit(AUTONUMA_IMPOSSIBLE_FLAG,&autonuma_flags);
> +}

This seems to test whether autonuma is enabled or
disabled, and is called from a few hot paths.

Would it be better to turn this into a variable,
so this can be tested with one single compare?

Also, something like autonuma_enabled or
autonuma_disabled could be a clearer name.

> +/* caller already holds the compound_lock */
> +void autonuma_migrate_split_huge_page(struct page *page,
> +				      struct page *page_tail)
> +{
> +	int nid, last_nid;
> +
> +	nid = page->autonuma_migrate_nid;
> +	VM_BUG_ON(nid>= MAX_NUMNODES);
> +	VM_BUG_ON(nid<  -1);
> +	VM_BUG_ON(page_tail->autonuma_migrate_nid != -1);
> +	if (nid>= 0) {
> +		VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
> +
> +		compound_lock(page_tail);

The comment above the function says that the caller already
holds the compound lock, yet we try to grab it again here?

Is this a deadlock, or simply an out of date comment?

Either way, it needs to be fixed.

A comment telling us what this function is supposed to
do would not hurt, either.

> +void __autonuma_migrate_page_remove(struct page *page)
> +{
> +	unsigned long flags;
> +	int nid;

In fact, every function larger than about 5 lines could use
a short comment describing what its purpose is.

> +static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
> +					int page_nid)
> +{

I wonder if _enqueue and _rmqueue (for the previous function)
would make more sense, since we are adding and removing the
page from a queue?

> +	VM_BUG_ON(dst_nid>= MAX_NUMNODES);
> +	VM_BUG_ON(dst_nid<  -1);
> +	VM_BUG_ON(page_nid>= MAX_NUMNODES);
> +	VM_BUG_ON(page_nid<  -1);
> +
> +	VM_BUG_ON(page_nid == dst_nid);
> +	VM_BUG_ON(page_to_nid(page) != page_nid);
> +
> +	flags = compound_lock_irqsave(page);

What does this lock protect against?

Should we check that those things are still true, after we
have acquired the lock?

> +static void autonuma_migrate_page_add(struct page *page, int dst_nid,
> +				      int page_nid)
> +{
> +	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
> +	if (migrate_nid != dst_nid)
> +		__autonuma_migrate_page_add(page, dst_nid, page_nid);
> +}

Wait, how are three three nids involved with the migration
of one page?

What is going on here, and why is there no comment explaining it?

> +static bool balance_pgdat(struct pglist_data *pgdat,
> +			  int nr_migrate_pages)
> +{
> +	/* FIXME: this only check the wmarks, make it move
> +	 * "unused" memory or pagecache by queuing it to
> +	 * pgdat->autonuma_migrate_head[pgdat->node_id].
> +	 */

vmscan.c also has a function called balance_pgdat, which does
something very different.

This function seems to check whether a node has enough free
memory. Maybe the name could reflect that?

> +static void cpu_follow_memory_pass(struct task_struct *p,
> +				   struct task_autonuma *task_autonuma,
> +				   unsigned long *task_numa_fault)
> +{
> +	int nid;
> +	for_each_node(nid)
> +		task_numa_fault[nid]>>= 1;
> +	task_autonuma->task_numa_fault_tot>>= 1;
> +}

This seems to age the statistic.  From the name I guess
it is called every pass, but something like numa_age_faults
might be a better name.

It could also use a short comment explaining why the statistics
are aged after each round, instead of zeroed out.

> +static void numa_hinting_fault_cpu_follow_memory(struct task_struct *p,
> +						 int access_nid,
> +						 int numpages,
> +						 bool pass)
> +{
> +	struct task_autonuma *task_autonuma = p->task_autonuma;
> +	unsigned long *task_numa_fault = task_autonuma->task_numa_fault;
> +	if (unlikely(pass))
> +		cpu_follow_memory_pass(p, task_autonuma, task_numa_fault);
> +	task_numa_fault[access_nid] += numpages;
> +	task_autonuma->task_numa_fault_tot += numpages;
> +}

Function name seems to have no bearing on what the function
actually does, which appears to be some kind of statistics
update...

There is no explanation of when pass would be true (or false).

> +static inline bool last_nid_set(struct task_struct *p,
> +				struct page *page, int cpu_nid)
> +{
> +	bool ret = true;
> +	int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
> +	VM_BUG_ON(cpu_nid<  0);
> +	VM_BUG_ON(cpu_nid>= MAX_NUMNODES);
> +	if (autonuma_last_nid>= 0&&  autonuma_last_nid != cpu_nid) {
> +		int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
> +		if (migrate_nid>= 0&&  migrate_nid != cpu_nid)
> +			__autonuma_migrate_page_remove(page);
> +		ret = false;
> +	}
> +	if (autonuma_last_nid != cpu_nid)
> +		ACCESS_ONCE(page->autonuma_last_nid) = cpu_nid;
> +	return ret;
> +}

This function confuses me. Why is there an ACCESS_ONCE
around something that is accessed three times?

It looks like it is trying to set some info on a page,
possibly where the page should be migrated to, and cancel
any migration if the page already has a destination other
than our destination?

It does not help that I have no idea what last_nid means,
because that was not documented in the earlier patches.
The function could use a comment regarding its purpose.

> +static int __page_migrate_nid(struct page *page, int page_nid)
> +{
> +	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
> +	if (migrate_nid<  0)
> +		migrate_nid = page_nid;
> +#if 0
> +	return page_nid;
> +#endif
> +	return migrate_nid;
> +}
> +
> +static int page_migrate_nid(struct page *page)
> +{
> +	return __page_migrate_nid(page, page_to_nid(page));
> +}

Why are there two functions that do the same thing?

Could this be collapsed into one function?

The #if 0 block could probably be removed, too.

> +static int knumad_scan_pmd(struct mm_struct *mm,
> +			   struct vm_area_struct *vma,
> +			   unsigned long address)

Like every other function, this one could use a comment
informing us of the general idea this function is supposed
to do.

> +{
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte, *_pte;
> +	struct page *page;
> +	unsigned long _address, end;
> +	spinlock_t *ptl;
> +	int ret = 0;
> +
> +	VM_BUG_ON(address&  ~PAGE_MASK);
> +
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
> +		goto out;
> +
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		goto out;
> +
> +	pmd = pmd_offset(pud, address);
> +	if (pmd_none(*pmd))
> +		goto out;
> +	if (pmd_trans_huge(*pmd)) {

This is a very large function.  Would it make sense to split the
pmd_trans_huge stuff into its own function?

> +		spin_lock(&mm->page_table_lock);
> +		if (pmd_trans_huge(*pmd)) {
> +			VM_BUG_ON(address&  ~HPAGE_PMD_MASK);
> +			if (unlikely(pmd_trans_splitting(*pmd))) {
> +				spin_unlock(&mm->page_table_lock);
> +				wait_split_huge_page(vma->anon_vma, pmd);
> +			} else {
> +				int page_nid;
> +				unsigned long *numa_fault_tmp;
> +				ret = HPAGE_PMD_NR;
> +
> +				if (autonuma_scan_use_working_set()&&
> +				    pmd_numa(*pmd)) {
> +					spin_unlock(&mm->page_table_lock);
> +					goto out;
> +				}
> +
> +				page = pmd_page(*pmd);
> +
> +				/* only check non-shared pages */
> +				if (page_mapcount(page) != 1) {
> +					spin_unlock(&mm->page_table_lock);
> +					goto out;
> +				}
> +
> +				page_nid = page_migrate_nid(page);
> +				numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
> +				numa_fault_tmp[page_nid] += ret;
> +
> +				if (pmd_numa(*pmd)) {
> +					spin_unlock(&mm->page_table_lock);
> +					goto out;
> +				}
> +
> +				set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
> +				/* defer TLB flush to lower the overhead */
> +				spin_unlock(&mm->page_table_lock);
> +				goto out;
> +			}
> +		} else
> +			spin_unlock(&mm->page_table_lock);
> +	}
> +
> +	VM_BUG_ON(!pmd_present(*pmd));
> +
> +	end = min(vma->vm_end, (address + PMD_SIZE)&  PMD_MASK);
> +	pte = pte_offset_map_lock(mm, pmd, address,&ptl);
> +	for (_address = address, _pte = pte; _address<  end;
> +	     _pte++, _address += PAGE_SIZE) {
> +		unsigned long *numa_fault_tmp;
> +		pte_t pteval = *_pte;
> +		if (!pte_present(pteval))
> +			continue;
> +		if (autonuma_scan_use_working_set()&&
> +		    pte_numa(pteval))
> +			continue;

What is autonuma_scan_use_working_set supposed to do exactly?

This looks like a subtle piece of code, but without any explanation
of what it should do, I cannot verify whether it actually does that.

> +		page = vm_normal_page(vma, _address, pteval);
> +		if (unlikely(!page))
> +			continue;
> +		/* only check non-shared pages */
> +		if (page_mapcount(page) != 1)
> +			continue;
> +
> +		numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
> +		numa_fault_tmp[page_migrate_nid(page)]++;

> +		if (pte_numa(pteval))
> +			continue;

Wait, so we count all pages, even the ones that are PTE_NUMA
as if they incurred faults, even when they did not?

pte_present seems to return true for a numa pte...

> +		if (!autonuma_scan_pmd())
> +			set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
> +
> +		/* defer TLB flush to lower the overhead */
> +		ret++;
> +	}
> +	pte_unmap_unlock(pte, ptl);
> +
> +	if (ret&&  !pmd_numa(*pmd)&&  autonuma_scan_pmd()) {
> +		spin_lock(&mm->page_table_lock);
> +		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
> +		spin_unlock(&mm->page_table_lock);
> +		/* defer TLB flush to lower the overhead */
> +	}

So depending on whether autonuma_scan_pmd is true or false,
we behave differently.  That wants some documenting, near
both the !autonuma_scan_pmd code and the code below...

> +static void mm_numa_fault_flush(struct mm_struct *mm)
> +{
> +	int nid;
> +	struct mm_autonuma *mma = mm->mm_autonuma;
> +	unsigned long *numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
> +	unsigned long tot = 0;
> +	/* FIXME: protect this with seqlock against autonuma_balance() */

Yes, please do.

> +static int knumad_do_scan(void)
> +{
> +	struct mm_struct *mm;
> +	struct mm_autonuma *mm_autonuma;
> +	unsigned long address;
> +	struct vm_area_struct *vma;
> +	int progress = 0;
> +
> +	mm = knumad_scan.mm;
> +	if (!mm) {
> +		if (unlikely(list_empty(&knumad_scan.mm_head)))
> +			return pages_to_scan;

Wait a moment, in knuma_scand() you have this line:

		_progress = knumad_do_scan();

Why are you pretending you made progress, when you did not
scan anything?

This is nothing a comment cannot illuminate :)

> +	down_read(&mm->mmap_sem);
> +	if (unlikely(knumad_test_exit(mm)))

This could use a short comment.

		/* The process is exiting */
> +		vma = NULL;
> +	else
> +		vma = find_vma(mm, address);

This loop could use some comments:

> +	for (; vma&&  progress<  pages_to_scan; vma = vma->vm_next) {
> +		unsigned long start_addr, end_addr;
> +		cond_resched();
		/* process is exiting */
> +		if (unlikely(knumad_test_exit(mm))) {
> +			progress++;
> +			break;
> +		}
		/* only do anonymous memory without explicit numa placement */
> +		if (!vma->anon_vma || vma_policy(vma)) {
> +			progress++;
> +			continue;
> +		}
> +		if (vma->vm_flags&  (VM_PFNMAP | VM_MIXEDMAP)) {
> +			progress++;
> +			continue;
> +		}
> +		if (is_vma_temporary_stack(vma)) {
> +			progress++;
> +			continue;
> +		}
> +
> +		VM_BUG_ON(address&  ~PAGE_MASK);
> +		if (address<  vma->vm_start)
> +			address = vma->vm_start;

How can this happen, when we did a find_vma above?

> +		flush_tlb_range(vma, start_addr, end_addr);
> +		mmu_notifier_invalidate_range_end(vma->vm_mm, start_addr,
> +						  end_addr);
> +	}
> +	up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */

That does not seem like a certainty.  The process might not
exit for hours, or even days.

Also, mmap_sem protects against many more things than just
exit_mmap. It could also be mprotect, munmap or exec.

This may be the one comment in the series so far that is
best removed :)

> +static int knuma_scand(void *none)
> +{
> +	struct mm_struct *mm = NULL;
> +	int progress = 0, _progress;
> +	unsigned long total_progress = 0;
> +
> +	set_freezable();
> +
> +	knuma_scand_disabled();
> +
> +	mutex_lock(&knumad_mm_mutex);
> +
> +	for (;;) {
> +		if (unlikely(kthread_should_stop()))
> +			break;
> +		_progress = knumad_do_scan();
> +		progress += _progress;
> +		total_progress += _progress;

Huh?  Tracking both progress and total_progress?

There is no explanation of what the difference between these
two is, or why you need them.

> +	mm = knumad_scan.mm;
> +	knumad_scan.mm = NULL;
> +	if (mm&&  knumad_test_exit(mm)) {
> +		list_del(&mm->mm_autonuma->mm_node);
> +		/* tell autonuma_exit not to list_del */
> +		VM_BUG_ON(mm->mm_autonuma->mm != mm);
> +		mm->mm_autonuma->mm = NULL;
> +	}
> +	mutex_unlock(&knumad_mm_mutex);
> +
> +	if (mm)
> +		mmdrop(mm);
> +

Why doesn't knumad_do_scan take care of the mmdrop?

Doing this in the calling function is somewhat confusing.

I see that knumad_scan can hold the refcount of an mm
elevated in-between scans, for however long it sleeps.
Is that really something we want?

In the days where we did virtual scanning, the code in
vmscan.c simply moved a process to the end of the mm_list
once it was done with it, and the currently to-scan process
was always at the head of the list.

> +static int isolate_migratepages(struct list_head *migratepages,
> +				struct pglist_data *pgdat)

The kernel already has another function called isolate_migratepages.

Would be nice if this function could at least be documented to
state its purpose. Maybe renamed to make it clear this is the NUMA
specific version.

> +{
> +	int nr = 0, nid;
> +	struct list_head *heads = pgdat->autonuma_migrate_head;
> +
> +	/* FIXME: THP balancing, restart from last nid */

I guess a static variable in the pgdat struct could take care
of that?

> +		if (PageTransHuge(page)) {
> +			VM_BUG_ON(!PageAnon(page));
> +			/* FIXME: remove split_huge_page */

Fair enough. Other people are working on that code :)

> +		if (!__isolate_lru_page(page, 0)) {
> +			VM_BUG_ON(PageTransCompound(page));
> +			del_page_from_lru_list(page, lruvec, page_lru(page));
> +			inc_zone_state(zone, page_is_file_cache(page) ?
> +				       NR_ISOLATED_FILE : NR_ISOLATED_ANON);
> +			spin_unlock_irq(&zone->lru_lock);
> +			/*
> +			 * hold the page pin at least until
> +			 * __isolate_lru_page succeeds
> +			 * (__isolate_lru_page takes a second pin when
> +			 * it succeeds). If we release the pin before
> +			 * __isolate_lru_page returns, the page could
> +			 * have been freed and reallocated from under
> +			 * us, so rendering worthless our previous
> +			 * checks on the page including the
> +			 * split_huge_page call.
> +			 */
> +			put_page(page);
> +
> +			list_add(&page->lru, migratepages);
> +			nr += hpage_nr_pages(page);
> +		} else {
> +			/* FIXME: losing page, safest and simplest for now */

Losing page?  As in a memory leak?

Or as in	/* Something happened. Skip migrating the page. */ ?

> +static void knumad_do_migrate(struct pglist_data *pgdat)
> +{
> +	int nr_migrate_pages = 0;
> +	LIST_HEAD(migratepages);
> +
> +	autonuma_printk("nr_migrate_pages %lu to node %d\n",
> +			pgdat->autonuma_nr_migrate_pages, pgdat->node_id);
> +	do {
> +		int isolated = 0;
> +		if (balance_pgdat(pgdat, nr_migrate_pages))
> +			isolated = isolate_migratepages(&migratepages, pgdat);
> +		/* FIXME: might need to check too many isolated */

Would it help to have isolate_migratepages exit after it has
isolated a large enough number of pages?

I may be tired, but it looks like it is simply putting ALL
the to be migrated pages on the list, even if the number
is unreasonably large to migrate all at once.

> +		if (!isolated)
> +			break;
> +		nr_migrate_pages += isolated;
> +	} while (nr_migrate_pages<  pages_to_migrate);
> +
> +	if (nr_migrate_pages) {
> +		int err;
> +		autonuma_printk("migrate %d to node %d\n", nr_migrate_pages,
> +				pgdat->node_id);
> +		pages_migrated += nr_migrate_pages; /* FIXME: per node */
> +		err = migrate_pages(&migratepages, alloc_migrate_dst_page,
> +				    pgdat->node_id, false, true);
> +		if (err)
> +			/* FIXME: requeue failed pages */
> +			putback_lru_pages(&migratepages);

How about you add a parameter to your (renamed) isolate_migratepages
function to tell it how many pages you want at a time?

Then you could limit it to no more than the number of pages you have
available on the destination node.

> +void autonuma_enter(struct mm_struct *mm)
> +{
> +	if (autonuma_impossible())
> +		return;
> +
> +	mutex_lock(&knumad_mm_mutex);
> +	list_add_tail(&mm->mm_autonuma->mm_node,&knumad_scan.mm_head);
> +	mutex_unlock(&knumad_mm_mutex);
> +}

Adding a global lock to every fork seems like a really bad
idea. Please make this NUMA code more SMP friendly.

> +void autonuma_exit(struct mm_struct *mm)
> +{
> +	bool serialize;
> +
> +	if (autonuma_impossible())
> +		return;

And if you implement the "have the autonuma scanning
daemon allocate the mm_autonuma and task_autonuma
structures" idea, short lived processes can bail out
of autonuma_exit right here, without ever taking the
lock.

> +	serialize = false;
> +	mutex_lock(&knumad_mm_mutex);
> +	if (knumad_scan.mm == mm)
> +		serialize = true;
> +	else if (mm->mm_autonuma->mm) {
> +		VM_BUG_ON(mm->mm_autonuma->mm != mm);
> +		mm->mm_autonuma->mm = NULL; /* debug */
> +		list_del(&mm->mm_autonuma->mm_node);
> +	}
> +	mutex_unlock(&knumad_mm_mutex);
> +
> +	if (serialize) {
> +		/* prevent the mm to go away under knumad_do_scan main loop */
> +		down_write(&mm->mmap_sem);
> +		up_write(&mm->mmap_sem);

This is rather subtle.  A longer explanation could be good.

> +SYSFS_ENTRY(debug, AUTONUMA_DEBUG_FLAG);
> +SYSFS_ENTRY(pmd, AUTONUMA_SCAN_PMD_FLAG);
> +SYSFS_ENTRY(working_set, AUTONUMA_SCAN_USE_WORKING_SET_FLAG);
> +SYSFS_ENTRY(defer, AUTONUMA_MIGRATE_DEFER_FLAG);
> +SYSFS_ENTRY(load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
> +SYSFS_ENTRY(clone_reset, AUTONUMA_SCHED_CLONE_RESET_FLAG);
> +SYSFS_ENTRY(fork_reset, AUTONUMA_SCHED_FORK_RESET_FLAG);

> +#undef SYSFS_ENTRY
> +
> +enum {
> +	SYSFS_KNUMA_SCAND_SLEEP_ENTRY,
> +	SYSFS_KNUMA_SCAND_PAGES_ENTRY,
> +	SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY,
> +	SYSFS_KNUMA_MIGRATED_PAGES_ENTRY,
> +};

Oh goodie, more magic flags.

Please document what they all mean.


> +SYSFS_ENTRY(scan_sleep_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
> +SYSFS_ENTRY(scan_sleep_pass_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
> +SYSFS_ENTRY(pages_to_scan, SYSFS_KNUMA_SCAND_PAGES_ENTRY);
> +
> +SYSFS_ENTRY(migrate_sleep_millisecs, SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY);
> +SYSFS_ENTRY(pages_to_migrate, SYSFS_KNUMA_MIGRATED_PAGES_ENTRY);

These as well.

> +SYSFS_ENTRY(full_scans);
> +SYSFS_ENTRY(pages_scanned);
> +SYSFS_ENTRY(pages_migrated);

The same goes for the statistics.

Documentation, documentation, documentation.

And please don't tell me "full scans" in autonuma means
the same thing it means in KSM, because I still do not
know what it means in KSM...


> +static inline void autonuma_exit_sysfs(struct kobject *autonuma_kobj)
> +{
> +}
> +#endif /* CONFIG_SYSFS */
> +
> +static int __init noautonuma_setup(char *str)
> +{
> +	if (!autonuma_impossible()) {
> +		printk("AutoNUMA permanently disabled\n");
> +		set_bit(AUTONUMA_IMPOSSIBLE_FLAG,&autonuma_flags);

Ohhh, my guess was right.  autonuma_impossible really does
mean autonuma_disabled :)

> +int alloc_task_autonuma(struct task_struct *tsk, struct task_struct *orig,
> +			 int node)
> +{

This looks like something that can be done by the
numa scanning daemon.

> +void free_task_autonuma(struct task_struct *tsk)
> +{
> +	if (autonuma_impossible()) {
> +		BUG_ON(tsk->task_autonuma);
> +		return;
> +	}
> +
> +	BUG_ON(!tsk->task_autonuma);

And this looks like a desired thing, for short lived tasks :)

> +	kmem_cache_free(task_autonuma_cachep, tsk->task_autonuma);
> +	tsk->task_autonuma = NULL;
> +}

> +void free_mm_autonuma(struct mm_struct *mm)
> +{
> +	if (autonuma_impossible()) {
> +		BUG_ON(mm->mm_autonuma);
> +		return;
> +	}
> +
> +	BUG_ON(!mm->mm_autonuma);

Ditto for mm_autonuma

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 25/40] autonuma: follow_page check for pte_numa/pmd_numa
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-07-02  4:14     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:14 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> Without this, follow_page wouldn't trigger the NUMA hinting faults.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

follow_page is called from many different places, not just
the process itself. One example would be ksm.

Do you really want to trigger NUMA hinting faults when the
mm != current->mm, or is that magically prevented somewhere?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 25/40] autonuma: follow_page check for pte_numa/pmd_numa
@ 2012-07-02  4:14     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:14 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> Without this, follow_page wouldn't trigger the NUMA hinting faults.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

follow_page is called from many different places, not just
the process itself. One example would be ksm.

Do you really want to trigger NUMA hinting faults when the
mm != current->mm, or is that magically prevented somewhere?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 26/40] autonuma: default mempolicy follow AutoNUMA
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-07-02  4:19     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> If the task has already been moved to an autonuma_node try to allocate
> memory from it even if it's temporarily not the local node. Chances
> are it's where most of its memory is already located and where it will
> run in the future.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 26/40] autonuma: default mempolicy follow AutoNUMA
@ 2012-07-02  4:19     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> If the task has already been moved to an autonuma_node try to allocate
> memory from it even if it's temporarily not the local node. Chances
> are it's where most of its memory is already located and where it will
> run in the future.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 27/40] autonuma: call autonuma_split_huge_page()
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-07-02  4:22     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> This is needed to make sure the tail pages are also queued into the
> migration queues of knuma_migrated across a transparent hugepage
> split.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 27/40] autonuma: call autonuma_split_huge_page()
@ 2012-07-02  4:22     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> This is needed to make sure the tail pages are also queued into the
> migration queues of knuma_migrated across a transparent hugepage
> split.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 28/40] autonuma: make khugepaged pte_numa aware
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-07-02  4:24     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> If any of the ptes that khugepaged is collapsing was a pte_numa, the
> resulting trans huge pmd will be a pmd_numa too.

Why?

If some of the ptes already got faulted in and made really
resident again, why do you want to incur a new NUMA fault
on the newly collapsed hugepage?

Is there something on we should know about?

If so, could you document it?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 28/40] autonuma: make khugepaged pte_numa aware
@ 2012-07-02  4:24     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> If any of the ptes that khugepaged is collapsing was a pte_numa, the
> resulting trans huge pmd will be a pmd_numa too.

Why?

If some of the ptes already got faulted in and made really
resident again, why do you want to incur a new NUMA fault
on the newly collapsed hugepage?

Is there something on we should know about?

If so, could you document it?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 29/40] autonuma: retain page last_nid information in khugepaged
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-07-02  4:33     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> When pages are collapsed try to keep the last_nid information from one
> of the original pages.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> ---
>   mm/huge_memory.c |   11 +++++++++++
>   1 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 094f82b..ae20409 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1814,7 +1814,18 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
>   			clear_user_highpage(page, address);
>   			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>   		} else {
> +#ifdef CONFIG_AUTONUMA
> +			int autonuma_last_nid;
> +#endif
>   			src_page = pte_page(pteval);
> +#ifdef CONFIG_AUTONUMA
> +			/* pick the last one, better than nothing */
> +			autonuma_last_nid =
> +				ACCESS_ONCE(src_page->autonuma_last_nid);
> +			if (autonuma_last_nid>= 0)
> +				ACCESS_ONCE(page->autonuma_last_nid) =
> +					autonuma_last_nid;
> +#endif
>   			copy_user_highpage(page, src_page, address, vma);
>   			VM_BUG_ON(page_mapcount(src_page) != 1);
>   			VM_BUG_ON(page_count(src_page) != 2);

Can you remember the node ID inside the loop, but do the
assignment just once after the loop has exited?

It seems excessive to make the assignment up to 512 times.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 29/40] autonuma: retain page last_nid information in khugepaged
@ 2012-07-02  4:33     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> When pages are collapsed try to keep the last_nid information from one
> of the original pages.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> ---
>   mm/huge_memory.c |   11 +++++++++++
>   1 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 094f82b..ae20409 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1814,7 +1814,18 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
>   			clear_user_highpage(page, address);
>   			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
>   		} else {
> +#ifdef CONFIG_AUTONUMA
> +			int autonuma_last_nid;
> +#endif
>   			src_page = pte_page(pteval);
> +#ifdef CONFIG_AUTONUMA
> +			/* pick the last one, better than nothing */
> +			autonuma_last_nid =
> +				ACCESS_ONCE(src_page->autonuma_last_nid);
> +			if (autonuma_last_nid>= 0)
> +				ACCESS_ONCE(page->autonuma_last_nid) =
> +					autonuma_last_nid;
> +#endif
>   			copy_user_highpage(page, src_page, address, vma);
>   			VM_BUG_ON(page_mapcount(src_page) != 1);
>   			VM_BUG_ON(page_count(src_page) != 2);

Can you remember the node ID inside the loop, but do the
assignment just once after the loop has exited?

It seems excessive to make the assignment up to 512 times.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 30/40] autonuma: numa hinting page faults entry points
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-07-02  4:47     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:

> +++ b/mm/huge_memory.c
> @@ -1037,6 +1037,23 @@ out:
>   	return page;
>   }
>
> +#ifdef CONFIG_AUTONUMA
> +pmd_t __huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,

This is under CONFIG_AUTONUMA

> +++ b/mm/memory.c

> +static inline pte_t pte_numa_fixup(struct mm_struct *mm,

> +static inline void pmd_numa_fixup(struct mm_struct *mm,

> +static inline pmd_t huge_pmd_numa_fixup(struct mm_struct *mm,

But these are not.  Please fix, or document why this is
not required.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 30/40] autonuma: numa hinting page faults entry points
@ 2012-07-02  4:47     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:

> +++ b/mm/huge_memory.c
> @@ -1037,6 +1037,23 @@ out:
>   	return page;
>   }
>
> +#ifdef CONFIG_AUTONUMA
> +pmd_t __huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,

This is under CONFIG_AUTONUMA

> +++ b/mm/memory.c

> +static inline pte_t pte_numa_fixup(struct mm_struct *mm,

> +static inline void pmd_numa_fixup(struct mm_struct *mm,

> +static inline pmd_t huge_pmd_numa_fixup(struct mm_struct *mm,

But these are not.  Please fix, or document why this is
not required.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 31/40] autonuma: reset autonuma page data when pages are freed
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-07-02  4:49     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> When pages are freed abort any pending migration. If knuma_migrated
> arrives first it will notice because get_page_unless_zero would fail.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> ---
>   mm/page_alloc.c |    4 ++++
>   1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48eabe9..841d964 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -615,6 +615,10 @@ static inline int free_pages_check(struct page *page)
>   		bad_page(page);
>   		return 1;
>   	}
> +	autonuma_migrate_page_remove(page);
> +#ifdef CONFIG_AUTONUMA
> +	page->autonuma_last_nid = -1;
> +#endif

Should these both be under the #ifdef ?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 31/40] autonuma: reset autonuma page data when pages are freed
@ 2012-07-02  4:49     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> When pages are freed abort any pending migration. If knuma_migrated
> arrives first it will notice because get_page_unless_zero would fail.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> ---
>   mm/page_alloc.c |    4 ++++
>   1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48eabe9..841d964 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -615,6 +615,10 @@ static inline int free_pages_check(struct page *page)
>   		bad_page(page);
>   		return 1;
>   	}
> +	autonuma_migrate_page_remove(page);
> +#ifdef CONFIG_AUTONUMA
> +	page->autonuma_last_nid = -1;
> +#endif

Should these both be under the #ifdef ?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 32/40] autonuma: initialize page structure fields
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-07-02  4:50     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> Initialize the AutoNUMA page structure fields at boot.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

I really do not like having a changeset that is undone a
few patches later.  At the very least you could merge this
into patch 14/40

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 32/40] autonuma: initialize page structure fields
@ 2012-07-02  4:50     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> Initialize the AutoNUMA page structure fields at boot.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

I really do not like having a changeset that is undone a
few patches later.  At the very least you could merge this
into patch 14/40

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 33/40] autonuma: link mm/autonuma.o and kernel/sched/numa.o
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-07-02  4:56     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> Link the AutoNUMA core and scheduler object files in the kernel if
> CONFIG_AUTONUMA=y.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Sure, however see my comments on the next patch ...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 33/40] autonuma: link mm/autonuma.o and kernel/sched/numa.o
@ 2012-07-02  4:56     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> Link the AutoNUMA core and scheduler object files in the kernel if
> CONFIG_AUTONUMA=y.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Sure, however see my comments on the next patch ...

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 34/40] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-07-02  4:58     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> Add the config options to allow building the kernel with AutoNUMA.
>
> If CONFIG_AUTONUMA_DEFAULT_ENABLED is "=y", then
> /sys/kernel/mm/autonuma/enabled will be equal to 1, and AutoNUMA will
> be enabled automatically at boot.
>
> CONFIG_AUTONUMA currently depends on X86, because no other arch
> implements the pte/pmd_numa yet and selecting =y would result in a
> failed build, but this shall be relaxed in the future. Porting
> AutoNUMA to other archs should be pretty simple.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

The Makefile changes could be merged into this patch

> diff --git a/mm/Kconfig b/mm/Kconfig
> index 82fed4e..330dd51 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -207,6 +207,19 @@ config MIGRATION
>   	  pages as migration can relocate pages to satisfy a huge page
>   	  allocation instead of reclaiming.
>
> +config AUTONUMA
> +	bool "Auto NUMA"
> +	select MIGRATION
> +	depends on NUMA&&  X86

How about having the x86 architecture export a
HAVE_AUTONUMA flag, and testing for that?

> +	help
> +	  Automatic NUMA CPU scheduling and memory migration.

This could be expanded to list advantages and
disadvantages of having autonuma enabled.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 34/40] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
@ 2012-07-02  4:58     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  4:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> Add the config options to allow building the kernel with AutoNUMA.
>
> If CONFIG_AUTONUMA_DEFAULT_ENABLED is "=y", then
> /sys/kernel/mm/autonuma/enabled will be equal to 1, and AutoNUMA will
> be enabled automatically at boot.
>
> CONFIG_AUTONUMA currently depends on X86, because no other arch
> implements the pte/pmd_numa yet and selecting =y would result in a
> failed build, but this shall be relaxed in the future. Porting
> AutoNUMA to other archs should be pretty simple.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

The Makefile changes could be merged into this patch

> diff --git a/mm/Kconfig b/mm/Kconfig
> index 82fed4e..330dd51 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -207,6 +207,19 @@ config MIGRATION
>   	  pages as migration can relocate pages to satisfy a huge page
>   	  allocation instead of reclaiming.
>
> +config AUTONUMA
> +	bool "Auto NUMA"
> +	select MIGRATION
> +	depends on NUMA&&  X86

How about having the x86 architecture export a
HAVE_AUTONUMA flag, and testing for that?

> +	help
> +	  Automatic NUMA CPU scheduling and memory migration.

This could be expanded to list advantages and
disadvantages of having autonuma enabled.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 35/40] autonuma: boost khugepaged scanning rate
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-07-02  5:12     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  5:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> Until THP native migration is implemented it's safer to boost
> khugepaged scanning rate because all memory migration are splitting
> the hugepages. So the regular rate of scanning becomes too low when
> lots of memory is migrated.

Not too fond of such a hack, but I guess it is ok for development...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 35/40] autonuma: boost khugepaged scanning rate
@ 2012-07-02  5:12     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  5:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> Until THP native migration is implemented it's safer to boost
> khugepaged scanning rate because all memory migration are splitting
> the hugepages. So the regular rate of scanning becomes too low when
> lots of memory is migrated.

Not too fond of such a hack, but I guess it is ok for development...

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 38/40] autonuma: autonuma_migrate_head[0] dynamic size
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-07-02  5:15     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  5:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> Reduce the autonuma_migrate_head array entries from MAX_NUMNODES to
> num_possible_nodes() or zero if autonuma_impossible() is true.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

This could be merged into, or fitted into the series right after
patch 15. Having a dozen other patches in-between makes reviewing
much harder than it should have to be.


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 38/40] autonuma: autonuma_migrate_head[0] dynamic size
@ 2012-07-02  5:15     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  5:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> Reduce the autonuma_migrate_head array entries from MAX_NUMNODES to
> num_possible_nodes() or zero if autonuma_impossible() is true.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

This could be merged into, or fitted into the series right after
patch 15. Having a dozen other patches in-between makes reviewing
much harder than it should have to be.


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 37/40] autonuma: page_autonuma change #include for sparse
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-07-02  6:22     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  6:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> sparse (make C=1) warns about lookup_page_autonuma not being declared,
> that's a false positive, but we can shut it down by being less strict
> in the includes.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

It is a one line change.  Please fold it into the patch
that introduced the issue, and reduce the size of the
patch series.

> diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
> index bace9b8..2468c9e 100644
> --- a/mm/page_autonuma.c
> +++ b/mm/page_autonuma.c
> @@ -1,6 +1,6 @@
>   #include<linux/mm.h>
>   #include<linux/memory.h>
> -#include<linux/autonuma_flags.h>
> +#include<linux/autonuma.h>
>   #include<linux/page_autonuma.h>
>   #include<linux/bootmem.h>
>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 37/40] autonuma: page_autonuma change #include for sparse
@ 2012-07-02  6:22     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  6:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> sparse (make C=1) warns about lookup_page_autonuma not being declared,
> that's a false positive, but we can shut it down by being less strict
> in the includes.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

It is a one line change.  Please fold it into the patch
that introduced the issue, and reduce the size of the
patch series.

> diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
> index bace9b8..2468c9e 100644
> --- a/mm/page_autonuma.c
> +++ b/mm/page_autonuma.c
> @@ -1,6 +1,6 @@
>   #include<linux/mm.h>
>   #include<linux/memory.h>
> -#include<linux/autonuma_flags.h>
> +#include<linux/autonuma.h>
>   #include<linux/page_autonuma.h>
>   #include<linux/bootmem.h>
>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 36/40] autonuma: page_autonuma
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-07-02  6:37     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  6:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:

> +++ b/include/linux/autonuma_flags.h
> @@ -15,6 +15,12 @@ enum autonuma_flag {
>
>   extern unsigned long autonuma_flags;
>
> +static inline bool autonuma_impossible(void)
> +{
> +	return num_possible_nodes()<= 1 ||
> +		test_bit(AUTONUMA_IMPOSSIBLE_FLAG,&autonuma_flags);
> +}

When you fix the name of this function, could you also put it
in the right spot, in the patch where it is originally introduced?

Moving stuff around for no reason in a patch series is not very
reviewer friendly.

> diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
> index 9e697e3..1e860f6 100644
> --- a/include/linux/autonuma_types.h
> +++ b/include/linux/autonuma_types.h
> @@ -39,6 +39,61 @@ struct task_autonuma {
>   	unsigned long task_numa_fault[0];
>   };
>
> +/*
> + * Per page (or per-pageblock) structure dynamically allocated only if
> + * autonuma is not impossible.
> + */

Double negatives are not easy to read.

s/not impossible/enabled/

> +struct page_autonuma {
> +	/*
> +	 * To modify autonuma_last_nid lockless the architecture,
> +	 * needs SMP atomic granularity<  sizeof(long), not all archs
> +	 * have that, notably some ancient alpha (but none of those
> +	 * should run in NUMA systems). Archs without that requires
> +	 * autonuma_last_nid to be a long.
> +	 */

If only all your data structures were documented like this.

I guess that will give you something to do, when addressing
the comments on the other patches :)

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bcaa8ac..c5e47bc 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c

>   #ifdef CONFIG_AUTONUMA
> -			/* pick the last one, better than nothing */
> -			autonuma_last_nid =
> -				ACCESS_ONCE(src_page->autonuma_last_nid);
> -			if (autonuma_last_nid>= 0)
> -				ACCESS_ONCE(page->autonuma_last_nid) =
> -					autonuma_last_nid;
> +			if (!autonuma_impossible()) {
> +				int autonuma_last_nid;
> +				src_page_an = lookup_page_autonuma(src_page);
> +				/* pick the last one, better than nothing */
> +				autonuma_last_nid =
> +					ACCESS_ONCE(src_page_an->autonuma_last_nid);
> +				if (autonuma_last_nid>= 0)
> +					ACCESS_ONCE(page_an->autonuma_last_nid) =
> +						autonuma_last_nid;
> +			}

Remembering the last page the loop went through, and then
looking up the autonuma struct after you exit the loop could
be better.

> diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
> new file mode 100644
> index 0000000..bace9b8
> --- /dev/null
> +++ b/mm/page_autonuma.c
> @@ -0,0 +1,234 @@
> +#include<linux/mm.h>
> +#include<linux/memory.h>
> +#include<linux/autonuma_flags.h>

This should be <linux/autonuma.h>

There is absolutely no good reason why that one-liner change
is a separate patch.

> +struct page_autonuma *lookup_page_autonuma(struct page *page)
> +{

> +	offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
> +	return base + offset;
> +}

Doing this and the reverse allows you to drop the page pointer
in struct autonuma.

It would make sense to do that either in this patch, or in a
new one, but either way pulling it forward out of patch 40
would make the series easier to review for the next round.

 > +fail:
 > +	printk(KERN_CRIT "allocation of page_autonuma failed.\n");
 > +	printk(KERN_CRIT "please try the 'noautonuma' boot option\n");
 > +	panic("Out of memory");
 > +}

The system can run just fine without autonuma.

Would it make sense to simply disable autonuma at this point,
but to try continue running?

> @@ -700,8 +780,14 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
>   	 */
>   	if (PageSlab(usemap_page)) {
>   		kfree(usemap);
> -		if (memmap)
> +		if (memmap) {
>   			__kfree_section_memmap(memmap, PAGES_PER_SECTION);
> +			if (!autonuma_impossible())
> +				__kfree_section_page_autonuma(page_autonuma,
> +							      PAGES_PER_SECTION);
> +			else
> +				BUG_ON(page_autonuma);

VM_BUG_ON ?

> +		if (!autonuma_impossible()) {
> +			struct page *page_autonuma_page;
> +			page_autonuma_page = virt_to_page(page_autonuma);
> +			free_map_bootmem(page_autonuma_page, nr_pages);
> +		} else
> +			BUG_ON(page_autonuma);

ditto

>   	pgdat_resize_unlock(pgdat,&flags);
>   	if (ret<= 0) {
> +		if (!autonuma_impossible())
> +			__kfree_section_page_autonuma(page_autonuma, nr_pages);
> +		else
> +			BUG_ON(page_autonuma);

VM_BUG_ON ?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 36/40] autonuma: page_autonuma
@ 2012-07-02  6:37     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  6:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:

> +++ b/include/linux/autonuma_flags.h
> @@ -15,6 +15,12 @@ enum autonuma_flag {
>
>   extern unsigned long autonuma_flags;
>
> +static inline bool autonuma_impossible(void)
> +{
> +	return num_possible_nodes()<= 1 ||
> +		test_bit(AUTONUMA_IMPOSSIBLE_FLAG,&autonuma_flags);
> +}

When you fix the name of this function, could you also put it
in the right spot, in the patch where it is originally introduced?

Moving stuff around for no reason in a patch series is not very
reviewer friendly.

> diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
> index 9e697e3..1e860f6 100644
> --- a/include/linux/autonuma_types.h
> +++ b/include/linux/autonuma_types.h
> @@ -39,6 +39,61 @@ struct task_autonuma {
>   	unsigned long task_numa_fault[0];
>   };
>
> +/*
> + * Per page (or per-pageblock) structure dynamically allocated only if
> + * autonuma is not impossible.
> + */

Double negatives are not easy to read.

s/not impossible/enabled/

> +struct page_autonuma {
> +	/*
> +	 * To modify autonuma_last_nid lockless the architecture,
> +	 * needs SMP atomic granularity<  sizeof(long), not all archs
> +	 * have that, notably some ancient alpha (but none of those
> +	 * should run in NUMA systems). Archs without that requires
> +	 * autonuma_last_nid to be a long.
> +	 */

If only all your data structures were documented like this.

I guess that will give you something to do, when addressing
the comments on the other patches :)

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bcaa8ac..c5e47bc 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c

>   #ifdef CONFIG_AUTONUMA
> -			/* pick the last one, better than nothing */
> -			autonuma_last_nid =
> -				ACCESS_ONCE(src_page->autonuma_last_nid);
> -			if (autonuma_last_nid>= 0)
> -				ACCESS_ONCE(page->autonuma_last_nid) =
> -					autonuma_last_nid;
> +			if (!autonuma_impossible()) {
> +				int autonuma_last_nid;
> +				src_page_an = lookup_page_autonuma(src_page);
> +				/* pick the last one, better than nothing */
> +				autonuma_last_nid =
> +					ACCESS_ONCE(src_page_an->autonuma_last_nid);
> +				if (autonuma_last_nid>= 0)
> +					ACCESS_ONCE(page_an->autonuma_last_nid) =
> +						autonuma_last_nid;
> +			}

Remembering the last page the loop went through, and then
looking up the autonuma struct after you exit the loop could
be better.

> diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
> new file mode 100644
> index 0000000..bace9b8
> --- /dev/null
> +++ b/mm/page_autonuma.c
> @@ -0,0 +1,234 @@
> +#include<linux/mm.h>
> +#include<linux/memory.h>
> +#include<linux/autonuma_flags.h>

This should be <linux/autonuma.h>

There is absolutely no good reason why that one-liner change
is a separate patch.

> +struct page_autonuma *lookup_page_autonuma(struct page *page)
> +{

> +	offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
> +	return base + offset;
> +}

Doing this and the reverse allows you to drop the page pointer
in struct autonuma.

It would make sense to do that either in this patch, or in a
new one, but either way pulling it forward out of patch 40
would make the series easier to review for the next round.

 > +fail:
 > +	printk(KERN_CRIT "allocation of page_autonuma failed.\n");
 > +	printk(KERN_CRIT "please try the 'noautonuma' boot option\n");
 > +	panic("Out of memory");
 > +}

The system can run just fine without autonuma.

Would it make sense to simply disable autonuma at this point,
but to try continue running?

> @@ -700,8 +780,14 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
>   	 */
>   	if (PageSlab(usemap_page)) {
>   		kfree(usemap);
> -		if (memmap)
> +		if (memmap) {
>   			__kfree_section_memmap(memmap, PAGES_PER_SECTION);
> +			if (!autonuma_impossible())
> +				__kfree_section_page_autonuma(page_autonuma,
> +							      PAGES_PER_SECTION);
> +			else
> +				BUG_ON(page_autonuma);

VM_BUG_ON ?

> +		if (!autonuma_impossible()) {
> +			struct page *page_autonuma_page;
> +			page_autonuma_page = virt_to_page(page_autonuma);
> +			free_map_bootmem(page_autonuma_page, nr_pages);
> +		} else
> +			BUG_ON(page_autonuma);

ditto

>   	pgdat_resize_unlock(pgdat,&flags);
>   	if (ret<= 0) {
> +		if (!autonuma_impossible())
> +			__kfree_section_page_autonuma(page_autonuma, nr_pages);
> +		else
> +			BUG_ON(page_autonuma);

VM_BUG_ON ?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 39/40] autonuma: bugcheck page_autonuma fields on newly allocated pages
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-07-02  6:40     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  6:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> Debug tweak.

> +static inline void autonuma_check_new_page(struct page *page)
> +{
> +	struct page_autonuma *page_autonuma;
> +	if (!autonuma_impossible()) {
> +		page_autonuma = lookup_page_autonuma(page);
> +		BUG_ON(page_autonuma->autonuma_migrate_nid != -1);
> +		BUG_ON(page_autonuma->autonuma_last_nid != -1);

At this point, BUG_ON is not likely to give us a useful backtrace
at all.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2d53a1f..5943ed2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -833,6 +833,7 @@ static inline int check_new_page(struct page *page)
>   		bad_page(page);
>   		return 1;
>   	}
> +	autonuma_check_new_page(page);
>   	return 0;
>   }

Why don't you hook into the return codes that
check_new_page uses?

They appear to be there for a reason.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 39/40] autonuma: bugcheck page_autonuma fields on newly allocated pages
@ 2012-07-02  6:40     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  6:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> Debug tweak.

> +static inline void autonuma_check_new_page(struct page *page)
> +{
> +	struct page_autonuma *page_autonuma;
> +	if (!autonuma_impossible()) {
> +		page_autonuma = lookup_page_autonuma(page);
> +		BUG_ON(page_autonuma->autonuma_migrate_nid != -1);
> +		BUG_ON(page_autonuma->autonuma_last_nid != -1);

At this point, BUG_ON is not likely to give us a useful backtrace
at all.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2d53a1f..5943ed2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -833,6 +833,7 @@ static inline int check_new_page(struct page *page)
>   		bad_page(page);
>   		return 1;
>   	}
> +	autonuma_check_new_page(page);
>   	return 0;
>   }

Why don't you hook into the return codes that
check_new_page uses?

They appear to be there for a reason.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 40/40] autonuma: shrink the per-page page_autonuma struct size
  2012-06-28 12:56   ` Andrea Arcangeli
@ 2012-07-02  7:18     ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  7:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
>  From 32 to 12 bytes, so the AutoNUMA memory footprint is reduced to
> 0.29% of RAM.

Still not ideal, however once we get native THP migration working
it could be practical to switch to a "have a bucket with N
page_autonuma structures for every N*M pages" approach.

For example, we could have 4 struct page_autonuma pages, for 32
memory pages. That would necessitate reintroducing the page pointer
into struct page_autonuma, but it would reduce memory use by roughly
a factor 8.

To get from a struct page to a struct page_autonuma, we would have
to look at the bucket and check whether one of the page_autonuma
structs points at us. If none do, we have to claim an available one.
On migration, we would have to free our page_autonuma struct, which
would make it available for other pages to use.

This would complicate the code somewhat, and potentially slow down
the migration of 4kB pages, but with 2MB pages things could continue
exactly the way they work today.

Does this seem reasonably in any way?

> +++ b/include/linux/autonuma_list.h
> @@ -0,0 +1,94 @@
> +#ifndef __AUTONUMA_LIST_H
> +#define __AUTONUMA_LIST_H
> +
> +#include<linux/types.h>
> +#include<linux/kernel.h>

> +typedef uint32_t autonuma_list_entry;
> +#define AUTONUMA_LIST_MAX_PFN_OFFSET	(AUTONUMA_LIST_HEAD-3)
> +#define AUTONUMA_LIST_POISON1		(AUTONUMA_LIST_HEAD-2)
> +#define AUTONUMA_LIST_POISON2		(AUTONUMA_LIST_HEAD-1)
> +#define AUTONUMA_LIST_HEAD		((uint32_t)UINT_MAX)
> +
> +struct autonuma_list_head {
> +	autonuma_list_entry anl_next_pfn;
> +	autonuma_list_entry anl_prev_pfn;
> +};

This stuff needs to be documented with a large comment, explaining
what is done, what the limitations are, etc...

Having that documentation in the commit message is not going to help
somebody who is browsing the source code.

I also wonder if it would make sense to have this available as a
generic list type, not autonuma specific but an "item number list"
include file with corresponding macros.

It might be useful to have lists with item numbers, instead of
prev & next pointers, in other places in the kernel.

Besides, introducing this list type separately could make things
easier to review.


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 40/40] autonuma: shrink the per-page page_autonuma struct size
@ 2012-07-02  7:18     ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  7:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
>  From 32 to 12 bytes, so the AutoNUMA memory footprint is reduced to
> 0.29% of RAM.

Still not ideal, however once we get native THP migration working
it could be practical to switch to a "have a bucket with N
page_autonuma structures for every N*M pages" approach.

For example, we could have 4 struct page_autonuma pages, for 32
memory pages. That would necessitate reintroducing the page pointer
into struct page_autonuma, but it would reduce memory use by roughly
a factor 8.

To get from a struct page to a struct page_autonuma, we would have
to look at the bucket and check whether one of the page_autonuma
structs points at us. If none do, we have to claim an available one.
On migration, we would have to free our page_autonuma struct, which
would make it available for other pages to use.

This would complicate the code somewhat, and potentially slow down
the migration of 4kB pages, but with 2MB pages things could continue
exactly the way they work today.

Does this seem reasonably in any way?

> +++ b/include/linux/autonuma_list.h
> @@ -0,0 +1,94 @@
> +#ifndef __AUTONUMA_LIST_H
> +#define __AUTONUMA_LIST_H
> +
> +#include<linux/types.h>
> +#include<linux/kernel.h>

> +typedef uint32_t autonuma_list_entry;
> +#define AUTONUMA_LIST_MAX_PFN_OFFSET	(AUTONUMA_LIST_HEAD-3)
> +#define AUTONUMA_LIST_POISON1		(AUTONUMA_LIST_HEAD-2)
> +#define AUTONUMA_LIST_POISON2		(AUTONUMA_LIST_HEAD-1)
> +#define AUTONUMA_LIST_HEAD		((uint32_t)UINT_MAX)
> +
> +struct autonuma_list_head {
> +	autonuma_list_entry anl_next_pfn;
> +	autonuma_list_entry anl_prev_pfn;
> +};

This stuff needs to be documented with a large comment, explaining
what is done, what the limitations are, etc...

Having that documentation in the commit message is not going to help
somebody who is browsing the source code.

I also wonder if it would make sense to have this available as a
generic list type, not autonuma specific but an "item number list"
include file with corresponding macros.

It might be useful to have lists with item numbers, instead of
prev & next pointers, in other places in the kernel.

Besides, introducing this list type separately could make things
easier to review.


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-30  8:23                     ` Nai Xia
@ 2012-07-02  7:29                       ` Rik van Riel
  2012-07-02  7:43                         ` Nai Xia
  0 siblings, 1 reply; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  7:29 UTC (permalink / raw)
  To: Nai Xia
  Cc: dlaor, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar,
	Hillf Danton, linux-kernel, linux-mm, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/30/2012 04:23 AM, Nai Xia wrote:

> Oh, sorry, I think I forgot few last comments in my last post:
>
> In case you really can take my advice and do comprehensive research,
> try to make sure that you compare the result of your fancy sampling algorithm
> with this simple logic:
>
>     "Blindly select a node and bind the process and move all pages to it."
>
> Stupid it may sound, I highly suspect it can approach the benchmarks
> you already did.
>
> If that's really the truth, then all the sampling and weighting stuff can
> be cut off.

All the sampling and weighing is there in order to deal with
processes that do not fit in one NUMA node.

This could be either a process that uses more memory than
what fits in one NUMA node, or a process with more threads
than there are processors in a NUMA node, or both.

That is what the complex code is for.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-30 15:10                     ` Nai Xia
@ 2012-07-02  7:36                       ` Rik van Riel
  2012-07-02  7:56                         ` Nai Xia
  0 siblings, 1 reply; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  7:36 UTC (permalink / raw)
  To: nai.xia
  Cc: Andrea Arcangeli, Peter Zijlstra, dlaor, Ingo Molnar,
	Hillf Danton, linux-kernel, linux-mm, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/30/2012 11:10 AM, Nai Xia wrote:

> Yes, pte_numa or pte_young works the same way and they both can
> answer the problem of "which pages were accessed since last scan".
> For LRU, it's OK, it's quite enough. But for numa balancing it's NOT.

Getting LRU right may be much more important than getting
NUMA balancing right.

Retrieving wrongly evicted data from disk can be a million
of times slower than fetching data from RAM, while the
penalty for accessing a remote NUMA node is only 20% or so.

> We also should care about the hotness of the page sets, since if the
> workloads are complex we should NOT be expecting that "if this page
> is accessed once, then it's always in my CPU cache during the whole
> last scan interval".
>
> The difference between LRU and the problem you are trying to deal
> with looks so obvious to me, I am so worried that you are still
> messing them up :(

For autonuma, it may be fine to have a lower likelyhood of
obtaining an optimum result, because the penalty for getting
it wrong is so much lower.

Say that LRU evicted the wrong page once every 10,000
evictions. At a disk IO penalty of a million times slower
than accessing RAM, that would result in a 100x slowdown.

Now say that autonuma places a page in the wrong NUMA
node once every 10 times. With a 20% penalty for accessing
memory on a remote NUMA node, that results in a 2% slowdown.

Even if the NUMA penalty was 100% (2x as slow remote access
vs. local), it would only be a 10% slowdown.

Why do you think CPU caches can get away with such small
associativity sets and simple eviction algorithms? :)

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-07-02  7:29                       ` Rik van Riel
@ 2012-07-02  7:43                         ` Nai Xia
  0 siblings, 0 replies; 327+ messages in thread
From: Nai Xia @ 2012-07-02  7:43 UTC (permalink / raw)
  To: Rik van Riel
  Cc: dlaor, Andrea Arcangeli, Peter Zijlstra, Ingo Molnar,
	Hillf Danton, linux-kernel, linux-mm, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt



On 2012a1'07ae??02ae?JPY 15:29, Rik van Riel wrote:
> On 06/30/2012 04:23 AM, Nai Xia wrote:
>
>> Oh, sorry, I think I forgot few last comments in my last post:
>>
>> In case you really can take my advice and do comprehensive research,
>> try to make sure that you compare the result of your fancy sampling algorithm
>> with this simple logic:
>>
>> "Blindly select a node and bind the process and move all pages to it."
>>
>> Stupid it may sound, I highly suspect it can approach the benchmarks
>> you already did.
>>
>> If that's really the truth, then all the sampling and weighting stuff can
>> be cut off.
>
> All the sampling and weighing is there in order to deal with
> processes that do not fit in one NUMA node.
>
> This could be either a process that uses more memory than
> what fits in one NUMA node, or a process with more threads
> than there are processors in a NUMA node, or both.
>
> That is what the complex code is for.
>

Quote Andrea's words:"fully converge the load into one node
(or as fewer nodes as possible)".

And I think I've said several times this weight is
actually W = x * y form. I think I've made my points
clear.

Well, if you insist, I will keep silence.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-07-02  7:36                       ` Rik van Riel
@ 2012-07-02  7:56                         ` Nai Xia
  2012-07-02  8:17                           ` Rik van Riel
  0 siblings, 1 reply; 327+ messages in thread
From: Nai Xia @ 2012-07-02  7:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Peter Zijlstra, dlaor, Ingo Molnar,
	Hillf Danton, linux-kernel, linux-mm, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt



On 2012a1'07ae??02ae?JPY 15:36, Rik van Riel wrote:
> On 06/30/2012 11:10 AM, Nai Xia wrote:
>
>> Yes, pte_numa or pte_young works the same way and they both can
>> answer the problem of "which pages were accessed since last scan".
>> For LRU, it's OK, it's quite enough. But for numa balancing it's NOT.
>
> Getting LRU right may be much more important than getting
> NUMA balancing right.
>
> Retrieving wrongly evicted data from disk can be a million
> of times slower than fetching data from RAM, while the
> penalty for accessing a remote NUMA node is only 20% or so.
>
>> We also should care about the hotness of the page sets, since if the
>> workloads are complex we should NOT be expecting that "if this page
>> is accessed once, then it's always in my CPU cache during the whole
>> last scan interval".
>>
>> The difference between LRU and the problem you are trying to deal
>> with looks so obvious to me, I am so worried that you are still
>> messing them up :(
>
> For autonuma, it may be fine to have a lower likelyhood of
> obtaining an optimum result, because the penalty for getting
> it wrong is so much lower.

I said, I am actually want to see some detailed analysis
showing that this sampling is really playing an important role
in benchmarks as it claims to be. Not a quick
"lower likelyhood than optimum" conclusion.....

Please, Rik, I know your points, you don't have to explain
anymore. But I just cannot follow without research data.

If you think I am wrong. It's ok to ignore me from now.....

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-07-02  7:56                         ` Nai Xia
@ 2012-07-02  8:17                           ` Rik van Riel
  2012-07-02  8:31                             ` Nai Xia
  0 siblings, 1 reply; 327+ messages in thread
From: Rik van Riel @ 2012-07-02  8:17 UTC (permalink / raw)
  To: nai.xia
  Cc: Andrea Arcangeli, Peter Zijlstra, dlaor, Ingo Molnar,
	Hillf Danton, linux-kernel, linux-mm, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 07/02/2012 03:56 AM, Nai Xia wrote:
>
>
> On 2012a1'07ae??02ae?JPY 15:36, Rik van Riel wrote:
>> On 06/30/2012 11:10 AM, Nai Xia wrote:
>>
>>> Yes, pte_numa or pte_young works the same way and they both can
>>> answer the problem of "which pages were accessed since last scan".
>>> For LRU, it's OK, it's quite enough. But for numa balancing it's NOT.
>>
>> Getting LRU right may be much more important than getting
>> NUMA balancing right.
>>
>> Retrieving wrongly evicted data from disk can be a million
>> of times slower than fetching data from RAM, while the
>> penalty for accessing a remote NUMA node is only 20% or so.
>>
>>> We also should care about the hotness of the page sets, since if the
>>> workloads are complex we should NOT be expecting that "if this page
>>> is accessed once, then it's always in my CPU cache during the whole
>>> last scan interval".
>>>
>>> The difference between LRU and the problem you are trying to deal
>>> with looks so obvious to me, I am so worried that you are still
>>> messing them up :(
>>
>> For autonuma, it may be fine to have a lower likelyhood of
>> obtaining an optimum result, because the penalty for getting
>> it wrong is so much lower.
>
> I said, I am actually want to see some detailed analysis
> showing that this sampling is really playing an important role
> in benchmarks as it claims to be. Not a quick
> "lower likelyhood than optimum" conclusion.....
>
> Please, Rik, I know your points, you don't have to explain
> anymore. But I just cannot follow without research data.

What kind of data are you looking for?

I have seen a lot of generic comments in your emails,
and one gut feeling about Andrea's sampling algorithm,
but I seem to have missed the details of exactly what
you are looking for.

Btw, I share your feeling that Andrea's sampling
algorithm will probably not be able to distinguish
between NUMA nodes that are very frequent users of
a page, and NUMA nodes that use the same page much
less frequently.

However, I suspect that the penalty of getting it
wrong will be fairly low, while the overhead of
getting access frequency information will be
prohibitively high. There is a reason nobody uses
LRU nowadays, but a clock style algorithm instead.


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-07-02  8:17                           ` Rik van Riel
@ 2012-07-02  8:31                             ` Nai Xia
  0 siblings, 0 replies; 327+ messages in thread
From: Nai Xia @ 2012-07-02  8:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Peter Zijlstra, dlaor, Ingo Molnar,
	Hillf Danton, linux-kernel, linux-mm, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt



On 2012a1'07ae??02ae?JPY 16:17, Rik van Riel wrote:
> On 07/02/2012 03:56 AM, Nai Xia wrote:
>>
>>
>> On 2012a1'07ae??02ae?JPY 15:36, Rik van Riel wrote:
>>> On 06/30/2012 11:10 AM, Nai Xia wrote:
>>>
>>>> Yes, pte_numa or pte_young works the same way and they both can
>>>> answer the problem of "which pages were accessed since last scan".
>>>> For LRU, it's OK, it's quite enough. But for numa balancing it's NOT.
>>>
>>> Getting LRU right may be much more important than getting
>>> NUMA balancing right.
>>>
>>> Retrieving wrongly evicted data from disk can be a million
>>> of times slower than fetching data from RAM, while the
>>> penalty for accessing a remote NUMA node is only 20% or so.
>>>
>>>> We also should care about the hotness of the page sets, since if the
>>>> workloads are complex we should NOT be expecting that "if this page
>>>> is accessed once, then it's always in my CPU cache during the whole
>>>> last scan interval".
>>>>
>>>> The difference between LRU and the problem you are trying to deal
>>>> with looks so obvious to me, I am so worried that you are still
>>>> messing them up :(
>>>
>>> For autonuma, it may be fine to have a lower likelyhood of
>>> obtaining an optimum result, because the penalty for getting
>>> it wrong is so much lower.
>>
>> I said, I am actually want to see some detailed analysis
>> showing that this sampling is really playing an important role
>> in benchmarks as it claims to be. Not a quick
>> "lower likelyhood than optimum" conclusion.....
>>
>> Please, Rik, I know your points, you don't have to explain
>> anymore. But I just cannot follow without research data.
>
> What kind of data are you looking for?
>
> I have seen a lot of generic comments in your emails,
> and one gut feeling about Andrea's sampling algorithm,
> but I seem to have missed the details of exactly what
> you are looking for.
>
> Btw, I share your feeling that Andrea's sampling
> algorithm will probably not be able to distinguish
> between NUMA nodes that are very frequent users of
> a page, and NUMA nodes that use the same page much
> less frequently.
>
> However, I suspect that the penalty of getting it
> wrong will be fairly low, while the overhead of
> getting access frequency information will be
> prohibitively high. There is a reason nobody uses
> LRU nowadays, but a clock style algorithm instead.
>
>

I think I won't repeat myself again and again and
again and get lost in tons of words.

Thank you for your comments, Rik, and best wishes.
This is my last reply.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 11/40] autonuma: define the autonuma flags
  2012-06-30  4:58     ` Konrad Rzeszutek Wilk
@ 2012-07-02 15:42       ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 327+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-02 15:42 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Don Morris, Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 12:58:25AM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Jun 28, 2012 at 02:55:51PM +0200, Andrea Arcangeli wrote:
> > These flags are the ones tweaked through sysfs, they control the
> > behavior of autonuma, from enabling disabling it, to selecting various
> > runtime options.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> >  include/linux/autonuma_flags.h |   62 ++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 62 insertions(+), 0 deletions(-)
> >  create mode 100644 include/linux/autonuma_flags.h
> > 
> > diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
> > new file mode 100644
> > index 0000000..5e29a75
> > --- /dev/null
> > +++ b/include/linux/autonuma_flags.h
> > @@ -0,0 +1,62 @@
> > +#ifndef _LINUX_AUTONUMA_FLAGS_H
> > +#define _LINUX_AUTONUMA_FLAGS_H
> > +
> > +enum autonuma_flag {
> 
> These aren't really flags. They are bit-fields.
> A
> > +	AUTONUMA_FLAG,
> 
> Looking at the code, this is to turn it on. Perhaps a better name such
> as: AUTONUMA_ACTIVE_FLAG ?
> 
> 
> > +	AUTONUMA_IMPOSSIBLE_FLAG,
> > +	AUTONUMA_DEBUG_FLAG,
> > +	AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
> 
> I might have gotten my math wrong, but if you have
> AUTONUMA_SCHED_LOAD_BALACE.. set (so 3), that also means
> that bit 0 and 1 are on. In other words AUTONUMA_FLAG
> and AUTONUMA_IMPOSSIBLE_FLAG are turned on.
> 
> > +	AUTONUMA_SCHED_CLONE_RESET_FLAG,
> > +	AUTONUMA_SCHED_FORK_RESET_FLAG,
> > +	AUTONUMA_SCAN_PMD_FLAG,
> 
> So this is 6, which means 110 bits. So AUTONUMA_FLAG
> gets turned off.

Ignore that pls. I had in my mind that test_bit was doing boolean logic (anding and masking,
and such), but that is not the case.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 11/40] autonuma: define the autonuma flags
@ 2012-07-02 15:42       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 327+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-02 15:42 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Don Morris, Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 12:58:25AM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Jun 28, 2012 at 02:55:51PM +0200, Andrea Arcangeli wrote:
> > These flags are the ones tweaked through sysfs, they control the
> > behavior of autonuma, from enabling disabling it, to selecting various
> > runtime options.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> >  include/linux/autonuma_flags.h |   62 ++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 62 insertions(+), 0 deletions(-)
> >  create mode 100644 include/linux/autonuma_flags.h
> > 
> > diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
> > new file mode 100644
> > index 0000000..5e29a75
> > --- /dev/null
> > +++ b/include/linux/autonuma_flags.h
> > @@ -0,0 +1,62 @@
> > +#ifndef _LINUX_AUTONUMA_FLAGS_H
> > +#define _LINUX_AUTONUMA_FLAGS_H
> > +
> > +enum autonuma_flag {
> 
> These aren't really flags. They are bit-fields.
> A
> > +	AUTONUMA_FLAG,
> 
> Looking at the code, this is to turn it on. Perhaps a better name such
> as: AUTONUMA_ACTIVE_FLAG ?
> 
> 
> > +	AUTONUMA_IMPOSSIBLE_FLAG,
> > +	AUTONUMA_DEBUG_FLAG,
> > +	AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
> 
> I might have gotten my math wrong, but if you have
> AUTONUMA_SCHED_LOAD_BALACE.. set (so 3), that also means
> that bit 0 and 1 are on. In other words AUTONUMA_FLAG
> and AUTONUMA_IMPOSSIBLE_FLAG are turned on.
> 
> > +	AUTONUMA_SCHED_CLONE_RESET_FLAG,
> > +	AUTONUMA_SCHED_FORK_RESET_FLAG,
> > +	AUTONUMA_SCAN_PMD_FLAG,
> 
> So this is 6, which means 110 bits. So AUTONUMA_FLAG
> gets turned off.

Ignore that pls. I had in my mind that test_bit was doing boolean logic (anding and masking,
and such), but that is not the case.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-29 19:19                     ` Rik van Riel
  (?)
@ 2012-07-02 16:57                     ` Vaidyanathan Srinivasan
  2012-07-05 16:56                       ` Vaidyanathan Srinivasan
  -1 siblings, 1 reply; 327+ messages in thread
From: Vaidyanathan Srinivasan @ 2012-07-02 16:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, dlaor, Ingo Molnar, Hillf Danton,
	Andrea Arcangeli, linux-kernel, linux-mm, Dan Smith,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

* Rik van Riel <riel@redhat.com> [2012-06-29 15:19:17]:

> On 06/29/2012 03:03 PM, Peter Zijlstra wrote:
> >On Fri, 2012-06-29 at 20:57 +0200, Peter Zijlstra wrote:
> >>On Fri, 2012-06-29 at 14:46 -0400, Rik van Riel wrote:
> >>>
> >>>I am not convinced all architectures that have CONFIG_NUMA
> >>>need to be a requirement, since some of them (eg. Alpha)
> >>>seem to be lacking a maintainer nowadays.
> >>
> >>Still, this NUMA balancing stuff is not a small tweak to load-balancing.
> >>Its a very significant change is how you schedule. Having such great
> >>differences over architectures isn't something I look forward to.
> 
> I am not too worried about the performance of architectures
> that are essentially orphaned :)
> 
> >Also, Andrea keeps insisting arch support is trivial, so I don't see the
> >problem.
> 
> Getting it implemented in one or two additional architectures
> would be good, to get a template out there that can be used by
> other architecture maintainers.

I am currently porting the framework over to powerpc.  I will share
the initial patches in couple of days.

--Vaidy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 04/40] xen: document Xen is using an unused bit for the pagetables
  2012-06-30  4:47     ` Konrad Rzeszutek Wilk
@ 2012-07-03 10:45       ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-03 10:45 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Hi Konrad,

On Sat, Jun 30, 2012 at 12:47:18AM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Jun 28, 2012 at 02:55:44PM +0200, Andrea Arcangeli wrote:
> > Xen has taken over the last reserved bit available for the pagetables
> 
> Some time ago when I saw this patch I asked about it (if there is way
> to actually stop using this bit) and you mentioned it is not the last
> bit available for pagemaps. Perhaps you should alter the comment
> in this description?

As far as I can tell the comment is correct, it is the last bit
available. Simply I don't need to use it anymore. There are 3 reserved
bits, one is used by Xen, the second is used by SPECIAL the third is
used by kmemcheck.

> > which is set through ioremap, this documents it and makes the code
> 
> It actually is through ioremap, gntdev (to map another guest memory),
> and on pfns which fall in E820 on the non-RAM and gap sections.

Well I dropped this patch, there's too much other important work to
do, this is only a documentation improvement and a cleanup and I don't
need it.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 04/40] xen: document Xen is using an unused bit for the pagetables
@ 2012-07-03 10:45       ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-03 10:45 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Hi Konrad,

On Sat, Jun 30, 2012 at 12:47:18AM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Jun 28, 2012 at 02:55:44PM +0200, Andrea Arcangeli wrote:
> > Xen has taken over the last reserved bit available for the pagetables
> 
> Some time ago when I saw this patch I asked about it (if there is way
> to actually stop using this bit) and you mentioned it is not the last
> bit available for pagemaps. Perhaps you should alter the comment
> in this description?

As far as I can tell the comment is correct, it is the last bit
available. Simply I don't need to use it anymore. There are 3 reserved
bits, one is used by Xen, the second is used by SPECIAL the third is
used by kmemcheck.

> > which is set through ioremap, this documents it and makes the code
> 
> It actually is through ioremap, gntdev (to map another guest memory),
> and on pfns which fall in E820 on the non-RAM and gap sections.

Well I dropped this patch, there's too much other important work to
do, this is only a documentation improvement and a cleanup and I don't
need it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-28 14:46     ` Peter Zijlstra
@ 2012-07-03 11:53       ` Peter Zijlstra
  -1 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-07-03 11:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, 2012-06-28 at 16:46 +0200, Peter Zijlstra wrote:
> As it stands you wrote a lot of words.. but none of them were really
> helpful in understanding what you do. 

Can you write something like the below for autonuma?

That is, present what your balancing goals are and why and in what
measures and at what cost.

Present it in 'proper' math, not examples.

Don't try and make it perfect -- the below isn't, just try and make it a
coherent story.

As a side note, anybody has a good way to show 7 follows from 6 other
than waving hands? One has to show 6 is fully connected and that the max
path length is indeed log n. I spend an hour last night trying but I've
forgotten too much of graph theory to make it stick.

---
 kernel/sched/fair.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 116 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3704ad3..2e44318 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3077,8 +3077,122 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
 
 #ifdef CONFIG_SMP
 /**************************************************
- * Fair scheduling class load-balancing methods:
- */
+ * Fair scheduling class load-balancing methods.
+ *
+ * BASICS
+ *
+ * The purpose of load-balancing is to achieve the same basic fairness the
+ * per-cpu scheduler provides, namely provide a proportional amount of compute
+ * time to each task. This is expressed in the following equation:
+ *
+ *   W_i,n/P_i == W_j,n/P_j for all i,j                               (1)
+ *
+ * Where W_i,n is the n-th weight average for cpu i. The instantaneous weight
+ * W_i,0 is defined as:
+ *
+ *   W_i,0 = \Sum_j w_i,j                                             (2)
+ *
+ * Where w_i,j is the weight of the j-th runnable task on cpu i. This weight
+ * is derived from the nice value as per prio_to_weight[].
+ *
+ * The weight average is an exponential decay average of the instantaneous
+ * weight:
+ *
+ *   W'_i,n = (2^n - 1) / 2^n * W_i,n + 1 / 2^n * W_i,0               (3)
+ *
+ * P_i is the cpu power (or compute capacity) of cpu i, typically it is the
+ * fraction of 'recent' time available for SCHED_OTHER task execution. But it
+ * can also include other factors [XXX].
+ *
+ * To achieve this balance we define a measure of imbalance which follows
+ * directly from (1):
+ *
+ *   imb_i,j = max{ avg(W/P), W_i/P_i } - min{ avg(W/P), W_j/P_j }    (4)
+ *
+ * We them move tasks around to minimize the imbalance. In the continuous
+ * function space it is obvious this converges, in the discrete case we get
+ * a few fun cases generally called infeasible weight scenarios.
+ *
+ * [XXX expand on:
+ *     - infeasible weights;
+ *     - local vs global optima in the discrete case. ]
+ *
+ *
+ * SCHED DOMAINS
+ *
+ * In order to solve the imbalance equation (4), and avoid the obvious O(n^2)
+ * for all i,j solution, we create a tree of cpus that follows the hardware
+ * topology where each level pairs two lower groups (or better). This results
+ * in O(log n) layers. Furthermore we reduce the number of cpus going up the
+ * tree to only the first of the previous level and we decrease the frequency
+ * of load-balance at each level inv. proportional to the number of cpus in
+ * the groups.
+ *
+ * This yields:
+ *
+ *     log_2 n     1     n
+ *   \Sum       { --- * --- * 2^i } = O(n)                            (5)
+ *     i = 0      2^i   2^i
+ *                               `- size of each group
+ *         |         |     `- number of cpus doing load-balance
+ *         |         `- freq
+ *         `- sum over all levels
+ *
+ * Coupled with a limit on how many tasks we can migrate every balance pass,
+ * this makes (5) the runtime complexity of the balancer.
+ *
+ * An important property here is that each CPU is still (indirectly) connected
+ * to every other cpu in at most O(log n) steps:
+ *
+ * The adjacency matrix of the resulting graph is given by:
+ *
+ *             log_2 n     
+ *   A_i,j = \Union     (i % 2^k == 0) && i / 2^(k+1) == j / 2^(k+1)  (6)
+ *             k = 0
+ *
+ * And you'll find that:
+ *
+ *   A^(log_2 n)_i,j != 0  for all i,j                                (7)
+ *
+ * Showing there's indeed a path between every cpu in at most O(log n) steps.
+ * The task movement gives a factor of O(m), giving a convergence complexity
+ * of:
+ *
+ *   O(nm log n),  n := nr_cpus, m := nr_tasks                        (8)
+ *
+ *
+ * WORK CONSERVING
+ *
+ * In order to avoid CPUs going idle while there's still work to do, new idle
+ * balancing is more aggressive and has the newly idle cpu iterate up the domain
+ * tree itself instead of relying on other CPUs to bring it work.
+ *
+ * This adds some complexity to both (5) and (8) but it reduces the total idle
+ * time.
+ *
+ * [XXX more?]
+ *
+ *
+ * CGROUPS
+ *
+ * Cgroups make a horror show out of (2), instead of a simple sum we get:
+ *
+ *                                s_k,i
+ *   W_i,0 = \Sum_j \Prod_k w_k * -----                               (9)
+ *                                 S_k
+ *
+ * Where
+ *
+ *   s_k,i = \Sum_j w_i,j,k  and  S_k = \Sum_i s_k,i                 (10)
+ *
+ * w_i,j,k is the weight of the j-th runnable task in the k-th cgroup on cpu i.
+ *
+ * The big problem is S_k, its a global sum needed to compute a local (W_i)
+ * property.
+ *
+ * [XXX write more on how we solve this.. _after_ merging pjt's patches that
+ *      rewrite all of this once again.]
+ */ 
 
 static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 


^ permalink raw reply related	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
@ 2012-07-03 11:53       ` Peter Zijlstra
  0 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-07-03 11:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, 2012-06-28 at 16:46 +0200, Peter Zijlstra wrote:
> As it stands you wrote a lot of words.. but none of them were really
> helpful in understanding what you do. 

Can you write something like the below for autonuma?

That is, present what your balancing goals are and why and in what
measures and at what cost.

Present it in 'proper' math, not examples.

Don't try and make it perfect -- the below isn't, just try and make it a
coherent story.

As a side note, anybody has a good way to show 7 follows from 6 other
than waving hands? One has to show 6 is fully connected and that the max
path length is indeed log n. I spend an hour last night trying but I've
forgotten too much of graph theory to make it stick.

---
 kernel/sched/fair.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 116 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3704ad3..2e44318 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3077,8 +3077,122 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
 
 #ifdef CONFIG_SMP
 /**************************************************
- * Fair scheduling class load-balancing methods:
- */
+ * Fair scheduling class load-balancing methods.
+ *
+ * BASICS
+ *
+ * The purpose of load-balancing is to achieve the same basic fairness the
+ * per-cpu scheduler provides, namely provide a proportional amount of compute
+ * time to each task. This is expressed in the following equation:
+ *
+ *   W_i,n/P_i == W_j,n/P_j for all i,j                               (1)
+ *
+ * Where W_i,n is the n-th weight average for cpu i. The instantaneous weight
+ * W_i,0 is defined as:
+ *
+ *   W_i,0 = \Sum_j w_i,j                                             (2)
+ *
+ * Where w_i,j is the weight of the j-th runnable task on cpu i. This weight
+ * is derived from the nice value as per prio_to_weight[].
+ *
+ * The weight average is an exponential decay average of the instantaneous
+ * weight:
+ *
+ *   W'_i,n = (2^n - 1) / 2^n * W_i,n + 1 / 2^n * W_i,0               (3)
+ *
+ * P_i is the cpu power (or compute capacity) of cpu i, typically it is the
+ * fraction of 'recent' time available for SCHED_OTHER task execution. But it
+ * can also include other factors [XXX].
+ *
+ * To achieve this balance we define a measure of imbalance which follows
+ * directly from (1):
+ *
+ *   imb_i,j = max{ avg(W/P), W_i/P_i } - min{ avg(W/P), W_j/P_j }    (4)
+ *
+ * We them move tasks around to minimize the imbalance. In the continuous
+ * function space it is obvious this converges, in the discrete case we get
+ * a few fun cases generally called infeasible weight scenarios.
+ *
+ * [XXX expand on:
+ *     - infeasible weights;
+ *     - local vs global optima in the discrete case. ]
+ *
+ *
+ * SCHED DOMAINS
+ *
+ * In order to solve the imbalance equation (4), and avoid the obvious O(n^2)
+ * for all i,j solution, we create a tree of cpus that follows the hardware
+ * topology where each level pairs two lower groups (or better). This results
+ * in O(log n) layers. Furthermore we reduce the number of cpus going up the
+ * tree to only the first of the previous level and we decrease the frequency
+ * of load-balance at each level inv. proportional to the number of cpus in
+ * the groups.
+ *
+ * This yields:
+ *
+ *     log_2 n     1     n
+ *   \Sum       { --- * --- * 2^i } = O(n)                            (5)
+ *     i = 0      2^i   2^i
+ *                               `- size of each group
+ *         |         |     `- number of cpus doing load-balance
+ *         |         `- freq
+ *         `- sum over all levels
+ *
+ * Coupled with a limit on how many tasks we can migrate every balance pass,
+ * this makes (5) the runtime complexity of the balancer.
+ *
+ * An important property here is that each CPU is still (indirectly) connected
+ * to every other cpu in at most O(log n) steps:
+ *
+ * The adjacency matrix of the resulting graph is given by:
+ *
+ *             log_2 n     
+ *   A_i,j = \Union     (i % 2^k == 0) && i / 2^(k+1) == j / 2^(k+1)  (6)
+ *             k = 0
+ *
+ * And you'll find that:
+ *
+ *   A^(log_2 n)_i,j != 0  for all i,j                                (7)
+ *
+ * Showing there's indeed a path between every cpu in at most O(log n) steps.
+ * The task movement gives a factor of O(m), giving a convergence complexity
+ * of:
+ *
+ *   O(nm log n),  n := nr_cpus, m := nr_tasks                        (8)
+ *
+ *
+ * WORK CONSERVING
+ *
+ * In order to avoid CPUs going idle while there's still work to do, new idle
+ * balancing is more aggressive and has the newly idle cpu iterate up the domain
+ * tree itself instead of relying on other CPUs to bring it work.
+ *
+ * This adds some complexity to both (5) and (8) but it reduces the total idle
+ * time.
+ *
+ * [XXX more?]
+ *
+ *
+ * CGROUPS
+ *
+ * Cgroups make a horror show out of (2), instead of a simple sum we get:
+ *
+ *                                s_k,i
+ *   W_i,0 = \Sum_j \Prod_k w_k * -----                               (9)
+ *                                 S_k
+ *
+ * Where
+ *
+ *   s_k,i = \Sum_j w_i,j,k  and  S_k = \Sum_i s_k,i                 (10)
+ *
+ * w_i,j,k is the weight of the j-th runnable task in the k-th cgroup on cpu i.
+ *
+ * The big problem is S_k, its a global sum needed to compute a local (W_i)
+ * property.
+ *
+ * [XXX write more on how we solve this.. _after_ merging pjt's patches that
+ *      rewrite all of this once again.]
+ */ 
 
 static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* Re: [PATCH 05/40] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  2012-06-29 14:26     ` Rik van Riel
@ 2012-07-03 20:30       ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-03 20:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Hi Rik,

On Fri, Jun 29, 2012 at 10:26:39AM -0400, Rik van Riel wrote:
> On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> 
> > +/*
> > + * Cannot be set on pte. The fact it's in between _PAGE_FILE and
> > + * _PAGE_PROTNONE avoids having to alter the swp entries.
> > + */
> > +#define _PAGE_NUMA_PTE	_PAGE_PSE
> > +/*
> > + * Cannot be set on pmd, if transparent hugepages will be swapped out
> > + * the swap entry offset must start above it.
> > + */
> > +#define _PAGE_NUMA_PMD	_PAGE_UNUSED2
> 
> Those comments only tell us what the flags can NOT be
> used for, not what they are actually used for.

You can find an updated version of the comments here:

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commitdiff;h=927ca960d78fefe6fa6aaa260a5b35496abafec5

Thanks for all the feedback, I didn't reply immediately but I'm
handling all the feedback and many more bits have been improved
already in the autonuma branch. I will post them separately for
further review.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 05/40] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
@ 2012-07-03 20:30       ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-03 20:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Hi Rik,

On Fri, Jun 29, 2012 at 10:26:39AM -0400, Rik van Riel wrote:
> On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> 
> > +/*
> > + * Cannot be set on pte. The fact it's in between _PAGE_FILE and
> > + * _PAGE_PROTNONE avoids having to alter the swp entries.
> > + */
> > +#define _PAGE_NUMA_PTE	_PAGE_PSE
> > +/*
> > + * Cannot be set on pmd, if transparent hugepages will be swapped out
> > + * the swap entry offset must start above it.
> > + */
> > +#define _PAGE_NUMA_PMD	_PAGE_UNUSED2
> 
> Those comments only tell us what the flags can NOT be
> used for, not what they are actually used for.

You can find an updated version of the comments here:

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commitdiff;h=927ca960d78fefe6fa6aaa260a5b35496abafec5

Thanks for all the feedback, I didn't reply immediately but I'm
handling all the feedback and many more bits have been improved
already in the autonuma branch. I will post them separately for
further review.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 06/40] autonuma: x86 pte_numa() and pmd_numa()
  2012-06-29 15:02     ` Rik van Riel
@ 2012-07-04 23:03       ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-04 23:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, Jun 29, 2012 at 11:02:41AM -0400, Rik van Riel wrote:
> On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> 
> >   static inline int pte_file(pte_t pte)
> >   {
> > -	return pte_flags(pte)&  _PAGE_FILE;
> > +	return (pte_flags(pte)&  _PAGE_FILE) == _PAGE_FILE;
> >   }
> 
> Wait, why is this change made?  Surely _PAGE_FILE is just
> one single bit and this change is not useful?
> 
> If there is a reason for this change, please document it.

I splitted it off to a separated patch with proper commit log here.

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commitdiff;h=7b2292c7ab86205f3d630533dc9987449fea6347

I haven't checked if it spawns the same warning without the patchset
applied, but I need the build not to show warnings, so while it may be
irrelevant warning for upstream, I don't like to have warnings.

> >   static inline int pte_hidden(pte_t pte)
> > @@ -415,7 +417,46 @@ static inline int pte_hidden(pte_t pte)
> >
> >   static inline int pmd_present(pmd_t pmd)
> >   {
> > -	return pmd_flags(pmd)&  _PAGE_PRESENT;
> > +	return pmd_flags(pmd)&  (_PAGE_PRESENT | _PAGE_PROTNONE |
> > +				 _PAGE_NUMA_PMD);
> > +}
> 
> Somewhat subtle. Better documentation in patch 5 will
> help explain this.

It's as subtle as PROTNONE but I added more explanation below as well
as in patch 5.

> > +#ifdef CONFIG_AUTONUMA
> > +static inline int pte_numa(pte_t pte)
> > +{
> > +	return (pte_flags(pte)&
> > +		(_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
> > +}
> > +
> > +static inline int pmd_numa(pmd_t pmd)
> > +{
> > +	return (pmd_flags(pmd)&
> > +		(_PAGE_NUMA_PMD|_PAGE_PRESENT)) == _PAGE_NUMA_PMD;
> > +}
> > +#endif
> 
> These could use a little explanation of how _PAGE_NUMA_* is
> used and what the flags mean.

Added:

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commitdiff;h=0e6537227de32c40bbf0a5bc6b11d27ba5779e68

> > +static inline pte_t pte_mknotnuma(pte_t pte)
> > +{
> > +	pte = pte_clear_flags(pte, _PAGE_NUMA_PTE);
> > +	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
> > +}
> > +
> > +static inline pmd_t pmd_mknotnuma(pmd_t pmd)
> > +{
> > +	pmd = pmd_clear_flags(pmd, _PAGE_NUMA_PMD);
> > +	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
> > +}
> > +
> > +static inline pte_t pte_mknuma(pte_t pte)
> > +{
> > +	pte = pte_set_flags(pte, _PAGE_NUMA_PTE);
> > +	return pte_clear_flags(pte, _PAGE_PRESENT);
> > +}
> > +
> > +static inline pmd_t pmd_mknuma(pmd_t pmd)
> > +{
> > +	pmd = pmd_set_flags(pmd, _PAGE_NUMA_PMD);
> > +	return pmd_clear_flags(pmd, _PAGE_PRESENT);
> >   }
> 
> These functions could use some explanation, too.
> 
> Why do the top ones set _PAGE_ACCESSED, while the bottom ones
> leave _PAGE_ACCESSED alone?
> 
> I can guess the answer, but it should be documented so it is
> also clear to people with less experience in the VM.

Added too in prev link.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 06/40] autonuma: x86 pte_numa() and pmd_numa()
@ 2012-07-04 23:03       ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-04 23:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, Jun 29, 2012 at 11:02:41AM -0400, Rik van Riel wrote:
> On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> 
> >   static inline int pte_file(pte_t pte)
> >   {
> > -	return pte_flags(pte)&  _PAGE_FILE;
> > +	return (pte_flags(pte)&  _PAGE_FILE) == _PAGE_FILE;
> >   }
> 
> Wait, why is this change made?  Surely _PAGE_FILE is just
> one single bit and this change is not useful?
> 
> If there is a reason for this change, please document it.

I splitted it off to a separated patch with proper commit log here.

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commitdiff;h=7b2292c7ab86205f3d630533dc9987449fea6347

I haven't checked if it spawns the same warning without the patchset
applied, but I need the build not to show warnings, so while it may be
irrelevant warning for upstream, I don't like to have warnings.

> >   static inline int pte_hidden(pte_t pte)
> > @@ -415,7 +417,46 @@ static inline int pte_hidden(pte_t pte)
> >
> >   static inline int pmd_present(pmd_t pmd)
> >   {
> > -	return pmd_flags(pmd)&  _PAGE_PRESENT;
> > +	return pmd_flags(pmd)&  (_PAGE_PRESENT | _PAGE_PROTNONE |
> > +				 _PAGE_NUMA_PMD);
> > +}
> 
> Somewhat subtle. Better documentation in patch 5 will
> help explain this.

It's as subtle as PROTNONE but I added more explanation below as well
as in patch 5.

> > +#ifdef CONFIG_AUTONUMA
> > +static inline int pte_numa(pte_t pte)
> > +{
> > +	return (pte_flags(pte)&
> > +		(_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
> > +}
> > +
> > +static inline int pmd_numa(pmd_t pmd)
> > +{
> > +	return (pmd_flags(pmd)&
> > +		(_PAGE_NUMA_PMD|_PAGE_PRESENT)) == _PAGE_NUMA_PMD;
> > +}
> > +#endif
> 
> These could use a little explanation of how _PAGE_NUMA_* is
> used and what the flags mean.

Added:

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commitdiff;h=0e6537227de32c40bbf0a5bc6b11d27ba5779e68

> > +static inline pte_t pte_mknotnuma(pte_t pte)
> > +{
> > +	pte = pte_clear_flags(pte, _PAGE_NUMA_PTE);
> > +	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
> > +}
> > +
> > +static inline pmd_t pmd_mknotnuma(pmd_t pmd)
> > +{
> > +	pmd = pmd_clear_flags(pmd, _PAGE_NUMA_PMD);
> > +	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
> > +}
> > +
> > +static inline pte_t pte_mknuma(pte_t pte)
> > +{
> > +	pte = pte_set_flags(pte, _PAGE_NUMA_PTE);
> > +	return pte_clear_flags(pte, _PAGE_PRESENT);
> > +}
> > +
> > +static inline pmd_t pmd_mknuma(pmd_t pmd)
> > +{
> > +	pmd = pmd_set_flags(pmd, _PAGE_NUMA_PMD);
> > +	return pmd_clear_flags(pmd, _PAGE_PRESENT);
> >   }
> 
> These functions could use some explanation, too.
> 
> Why do the top ones set _PAGE_ACCESSED, while the bottom ones
> leave _PAGE_ACCESSED alone?
> 
> I can guess the answer, but it should be documented so it is
> also clear to people with less experience in the VM.

Added too in prev link.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 04/40] xen: document Xen is using an unused bit for the pagetables
  2012-06-29 14:16     ` Rik van Riel
@ 2012-07-04 23:05       ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-04 23:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, Jun 29, 2012 at 10:16:12AM -0400, Rik van Riel wrote:
> On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> > Xen has taken over the last reserved bit available for the pagetables
> > which is set through ioremap, this documents it and makes the code
> > more readable.
> >
> > Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> > ---
> >   arch/x86/include/asm/pgtable_types.h |   11 +++++++++--
> >   1 files changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> > index 013286a..b74cac9 100644
> > --- a/arch/x86/include/asm/pgtable_types.h
> > +++ b/arch/x86/include/asm/pgtable_types.h
> > @@ -17,7 +17,7 @@
> >   #define _PAGE_BIT_PAT		7	/* on 4KB pages */
> >   #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
> >   #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
> > -#define _PAGE_BIT_IOMAP		10	/* flag used to indicate IO mapping */
> > +#define _PAGE_BIT_UNUSED2	10
> 
> Considering that Xen is using it, it is not really
> unused, is it?

_PAGE_BIT_UNUSED1 is used too (_PAGE_BIT_SPECIAL). Unused stands for
unused by the CPU, not by the OS. But this patch is dropped.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 04/40] xen: document Xen is using an unused bit for the pagetables
@ 2012-07-04 23:05       ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-04 23:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, Jun 29, 2012 at 10:16:12AM -0400, Rik van Riel wrote:
> On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> > Xen has taken over the last reserved bit available for the pagetables
> > which is set through ioremap, this documents it and makes the code
> > more readable.
> >
> > Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> > ---
> >   arch/x86/include/asm/pgtable_types.h |   11 +++++++++--
> >   1 files changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> > index 013286a..b74cac9 100644
> > --- a/arch/x86/include/asm/pgtable_types.h
> > +++ b/arch/x86/include/asm/pgtable_types.h
> > @@ -17,7 +17,7 @@
> >   #define _PAGE_BIT_PAT		7	/* on 4KB pages */
> >   #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
> >   #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
> > -#define _PAGE_BIT_IOMAP		10	/* flag used to indicate IO mapping */
> > +#define _PAGE_BIT_UNUSED2	10
> 
> Considering that Xen is using it, it is not really
> unused, is it?

_PAGE_BIT_UNUSED1 is used too (_PAGE_BIT_SPECIAL). Unused stands for
unused by the CPU, not by the OS. But this patch is dropped.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
  2012-06-30  4:50     ` Konrad Rzeszutek Wilk
@ 2012-07-04 23:14       ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-04 23:14 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 12:50:14AM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Jun 28, 2012 at 02:55:49PM +0200, Andrea Arcangeli wrote:
> > This function makes it easy to bind the per-node knuma_migrated
> > threads to their respective NUMA nodes. Those threads take memory from
> > the other nodes (in round robin with a incoming queue for each remote
> > node) and they move that memory to their local node.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> >  include/linux/kthread.h |    1 +
> >  include/linux/sched.h   |    2 +-
> >  kernel/kthread.c        |   23 +++++++++++++++++++++++
> >  3 files changed, 25 insertions(+), 1 deletions(-)
> > 
> > diff --git a/include/linux/kthread.h b/include/linux/kthread.h
> > index 0714b24..e733f97 100644
> > --- a/include/linux/kthread.h
> > +++ b/include/linux/kthread.h
> > @@ -33,6 +33,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
> >  })
> >  
> >  void kthread_bind(struct task_struct *k, unsigned int cpu);
> > +void kthread_bind_node(struct task_struct *p, int nid);
> >  int kthread_stop(struct task_struct *k);
> >  int kthread_should_stop(void);
> >  bool kthread_freezable_should_stop(bool *was_frozen);
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 4059c0f..699324c 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> >  #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
> >  #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
> >  #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
> > -#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
> > +#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpus */
> >  #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
> >  #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
> >  #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
> > diff --git a/kernel/kthread.c b/kernel/kthread.c
> > index 3d3de63..48b36f9 100644
> > --- a/kernel/kthread.c
> > +++ b/kernel/kthread.c
> > @@ -234,6 +234,29 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
> >  EXPORT_SYMBOL(kthread_bind);
> >  
> >  /**
> > + * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
> > + * @p: thread created by kthread_create().
> > + * @nid: node (might not be online, must be possible) for @k to run on.
> > + *
> > + * Description: This function is equivalent to set_cpus_allowed(),
> > + * except that @nid doesn't need to be online, and the thread must be
> > + * stopped (i.e., just returned from kthread_create()).
> > + */
> > +void kthread_bind_node(struct task_struct *p, int nid)
> > +{
> > +	/* Must have done schedule() in kthread() before we set_task_cpu */
> > +	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> > +		WARN_ON(1);
> > +		return;
> > +	}
> > +
> > +	/* It's safe because the task is inactive. */
> > +	do_set_cpus_allowed(p, cpumask_of_node(nid));
> > +	p->flags |= PF_THREAD_BOUND;
> > +}
> > +EXPORT_SYMBOL(kthread_bind_node);
> 
> _GPL ?

/**
 * kthread_bind - bind a just-created kthread to a cpu.
 * @p: thread created by kthread_create().
 * @cpu: cpu (might not be online, must be possible) for @k to run on.
 *
 * Description: This function is equivalent to set_cpus_allowed(),
 * except that @cpu doesn't need to be online, and the thread must be
 * stopped (i.e., just returned from kthread_create()).
 */
void kthread_bind(struct task_struct *p, unsigned int cpu)
{
	/* Must have done schedule() in kthread() before we set_task_cpu */
	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
		WARN_ON(1);
		return;
	}

	/* It's safe because the task is inactive. */
	do_set_cpus_allowed(p, cpumask_of(cpu));
	p->flags |= PF_THREAD_BOUND;
}
EXPORT_SYMBOL(kthread_bind);

/**
 * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
 * @p: thread created by kthread_create().
 * @nid: node (might not be online, must be possible) for @k to run on.
 *
 * Description: This function is equivalent to set_cpus_allowed(),
 * except that @nid doesn't need to be online, and the thread must be
 * stopped (i.e., just returned from kthread_create()).
 */
void kthread_bind_node(struct task_struct *p, int nid)
{
	/* Must have done schedule() in kthread() before we set_task_cpu */
	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
		WARN_ON(1);
		return;
	}

	/* It's safe because the task is inactive. */
	do_set_cpus_allowed(p, cpumask_of_node(nid));
	p->flags |= PF_THREAD_BOUND;
}
EXPORT_SYMBOL(kthread_bind_node);

The above should explain why it's not _GPL right now. As far as
AutoNUMA is concerned I can drop the EXPORT_SYMBOL completely and not
allow modules to call this. In fact I could have coded this inside
autonuma too.

I can change it to _GPL, drop the EXPORT_SYMBOL or move it inside the
autonuma code, let me know what you prefer. If I hear nothing I won't
make changes.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
@ 2012-07-04 23:14       ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-04 23:14 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 12:50:14AM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Jun 28, 2012 at 02:55:49PM +0200, Andrea Arcangeli wrote:
> > This function makes it easy to bind the per-node knuma_migrated
> > threads to their respective NUMA nodes. Those threads take memory from
> > the other nodes (in round robin with a incoming queue for each remote
> > node) and they move that memory to their local node.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> >  include/linux/kthread.h |    1 +
> >  include/linux/sched.h   |    2 +-
> >  kernel/kthread.c        |   23 +++++++++++++++++++++++
> >  3 files changed, 25 insertions(+), 1 deletions(-)
> > 
> > diff --git a/include/linux/kthread.h b/include/linux/kthread.h
> > index 0714b24..e733f97 100644
> > --- a/include/linux/kthread.h
> > +++ b/include/linux/kthread.h
> > @@ -33,6 +33,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
> >  })
> >  
> >  void kthread_bind(struct task_struct *k, unsigned int cpu);
> > +void kthread_bind_node(struct task_struct *p, int nid);
> >  int kthread_stop(struct task_struct *k);
> >  int kthread_should_stop(void);
> >  bool kthread_freezable_should_stop(bool *was_frozen);
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 4059c0f..699324c 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> >  #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
> >  #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
> >  #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
> > -#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
> > +#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpus */
> >  #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
> >  #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
> >  #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
> > diff --git a/kernel/kthread.c b/kernel/kthread.c
> > index 3d3de63..48b36f9 100644
> > --- a/kernel/kthread.c
> > +++ b/kernel/kthread.c
> > @@ -234,6 +234,29 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
> >  EXPORT_SYMBOL(kthread_bind);
> >  
> >  /**
> > + * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
> > + * @p: thread created by kthread_create().
> > + * @nid: node (might not be online, must be possible) for @k to run on.
> > + *
> > + * Description: This function is equivalent to set_cpus_allowed(),
> > + * except that @nid doesn't need to be online, and the thread must be
> > + * stopped (i.e., just returned from kthread_create()).
> > + */
> > +void kthread_bind_node(struct task_struct *p, int nid)
> > +{
> > +	/* Must have done schedule() in kthread() before we set_task_cpu */
> > +	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> > +		WARN_ON(1);
> > +		return;
> > +	}
> > +
> > +	/* It's safe because the task is inactive. */
> > +	do_set_cpus_allowed(p, cpumask_of_node(nid));
> > +	p->flags |= PF_THREAD_BOUND;
> > +}
> > +EXPORT_SYMBOL(kthread_bind_node);
> 
> _GPL ?

/**
 * kthread_bind - bind a just-created kthread to a cpu.
 * @p: thread created by kthread_create().
 * @cpu: cpu (might not be online, must be possible) for @k to run on.
 *
 * Description: This function is equivalent to set_cpus_allowed(),
 * except that @cpu doesn't need to be online, and the thread must be
 * stopped (i.e., just returned from kthread_create()).
 */
void kthread_bind(struct task_struct *p, unsigned int cpu)
{
	/* Must have done schedule() in kthread() before we set_task_cpu */
	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
		WARN_ON(1);
		return;
	}

	/* It's safe because the task is inactive. */
	do_set_cpus_allowed(p, cpumask_of(cpu));
	p->flags |= PF_THREAD_BOUND;
}
EXPORT_SYMBOL(kthread_bind);

/**
 * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
 * @p: thread created by kthread_create().
 * @nid: node (might not be online, must be possible) for @k to run on.
 *
 * Description: This function is equivalent to set_cpus_allowed(),
 * except that @nid doesn't need to be online, and the thread must be
 * stopped (i.e., just returned from kthread_create()).
 */
void kthread_bind_node(struct task_struct *p, int nid)
{
	/* Must have done schedule() in kthread() before we set_task_cpu */
	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
		WARN_ON(1);
		return;
	}

	/* It's safe because the task is inactive. */
	do_set_cpus_allowed(p, cpumask_of_node(nid));
	p->flags |= PF_THREAD_BOUND;
}
EXPORT_SYMBOL(kthread_bind_node);

The above should explain why it's not _GPL right now. As far as
AutoNUMA is concerned I can drop the EXPORT_SYMBOL completely and not
allow modules to call this. In fact I could have coded this inside
autonuma too.

I can change it to _GPL, drop the EXPORT_SYMBOL or move it inside the
autonuma code, let me know what you prefer. If I hear nothing I won't
make changes.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 10/40] autonuma: mm_autonuma and sched_autonuma data structures
  2012-06-29 17:45     ` Rik van Riel
@ 2012-07-04 23:16       ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-04 23:16 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, Jun 29, 2012 at 01:45:21PM -0400, Rik van Riel wrote:
> On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> > Define the two data structures that collect the per-process (in the
> > mm) and per-thread (in the task_struct) statistical information that
> > are the input of the CPU follow memory algorithms in the NUMA
> > scheduler.
> 
> I just noticed the subject of this email is misleading, too.
> 
> This patch does not introduce sched_autonuma at all.
> 
> *searches around for the patch that does*

I corrected the subject and commit message leftovers last
weekend. I'll notify when the inline docs are improved.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 10/40] autonuma: mm_autonuma and sched_autonuma data structures
@ 2012-07-04 23:16       ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-04 23:16 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, Jun 29, 2012 at 01:45:21PM -0400, Rik van Riel wrote:
> On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> > Define the two data structures that collect the per-process (in the
> > mm) and per-thread (in the task_struct) statistical information that
> > are the input of the CPU follow memory algorithms in the NUMA
> > scheduler.
> 
> I just noticed the subject of this email is misleading, too.
> 
> This patch does not introduce sched_autonuma at all.
> 
> *searches around for the patch that does*

I corrected the subject and commit message leftovers last
weekend. I'll notify when the inline docs are improved.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 11/40] autonuma: define the autonuma flags
  2012-06-30  5:01     ` Konrad Rzeszutek Wilk
@ 2012-07-04 23:45       ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-04 23:45 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 01:01:44AM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Jun 28, 2012 at 02:55:51PM +0200, Andrea Arcangeli wrote:
> > +extern unsigned long autonuma_flags;
> 
> I could not find the this variable in the preceding patches?
>
> Which patch actually uses it?

Well the below code also uses the above variable, but the variable is
defined in mm/autonuma.c in a later patch.

The aa.git/autonuma branch is constantly tracked by Wu C=1 checker
with the allconfig tester, every commit is built independently, so if
I add a minuscule error that breaks inter-bisectability or if I adds a
C=1 warning for the headers inclusion order I get an automated
email. Any patch reordering needs to take those things into account,
while trying to keep the size of the patches small too (I can't put
all new files in a single patch).

> Also, is there a way to force the AutoNUMA framework
> from not initializing at all? Hold that thought, it probably
> is in some of the other patches.

That is what the AUTONUMA_IMPOSSIBLE_FLAG is about, that flag is set
if you boot with "noautonuma" as parameter to the kernel. (soon that
flag will be renamed to AUTONUMA_POSSIBLE_FLAG and it'll work in the
same but opposite way)

The 12 byte per page overhead goes away, the
knuma_scand/knuma_migratedN kernel daemons are not started, the
task_autonuma/mm_autonuma are not allocated and the kernel boots as if
the hardware is not NUMA (the only loss is a pointer in the mm
struct... the pointer in the task struct is zero cost, stack overflow
permitting :).

> 
> > +
> > +static inline bool autonuma_enabled(void)
> > +{
> > +	return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
> > +}
> > +
> > +static inline bool autonuma_debug(void)
> > +{
> > +	return !!test_bit(AUTONUMA_DEBUG_FLAG, &autonuma_flags);
> > +}
> > +
> > +static inline bool autonuma_sched_load_balance_strict(void)
> > +{
> > +	return !!test_bit(AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
> > +			  &autonuma_flags);
> > +}
> > +
> > +static inline bool autonuma_sched_clone_reset(void)
> > +{
> > +	return !!test_bit(AUTONUMA_SCHED_CLONE_RESET_FLAG,
> > +			  &autonuma_flags);
> > +}
> > +
> > +static inline bool autonuma_sched_fork_reset(void)
> > +{
> > +	return !!test_bit(AUTONUMA_SCHED_FORK_RESET_FLAG,
> > +			  &autonuma_flags);
> > +}
> > +
> > +static inline bool autonuma_scan_pmd(void)
> > +{
> > +	return !!test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
> > +}
> > +
> > +static inline bool autonuma_scan_use_working_set(void)
> > +{
> > +	return !!test_bit(AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
> > +			  &autonuma_flags);
> > +}
> > +
> > +static inline bool autonuma_migrate_defer(void)
> > +{
> > +	return !!test_bit(AUTONUMA_MIGRATE_DEFER_FLAG, &autonuma_flags);
> > +}
> > +
> > +#endif /* _LINUX_AUTONUMA_FLAGS_H */
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 11/40] autonuma: define the autonuma flags
@ 2012-07-04 23:45       ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-04 23:45 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 01:01:44AM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Jun 28, 2012 at 02:55:51PM +0200, Andrea Arcangeli wrote:
> > +extern unsigned long autonuma_flags;
> 
> I could not find the this variable in the preceding patches?
>
> Which patch actually uses it?

Well the below code also uses the above variable, but the variable is
defined in mm/autonuma.c in a later patch.

The aa.git/autonuma branch is constantly tracked by Wu C=1 checker
with the allconfig tester, every commit is built independently, so if
I add a minuscule error that breaks inter-bisectability or if I adds a
C=1 warning for the headers inclusion order I get an automated
email. Any patch reordering needs to take those things into account,
while trying to keep the size of the patches small too (I can't put
all new files in a single patch).

> Also, is there a way to force the AutoNUMA framework
> from not initializing at all? Hold that thought, it probably
> is in some of the other patches.

That is what the AUTONUMA_IMPOSSIBLE_FLAG is about, that flag is set
if you boot with "noautonuma" as parameter to the kernel. (soon that
flag will be renamed to AUTONUMA_POSSIBLE_FLAG and it'll work in the
same but opposite way)

The 12 byte per page overhead goes away, the
knuma_scand/knuma_migratedN kernel daemons are not started, the
task_autonuma/mm_autonuma are not allocated and the kernel boots as if
the hardware is not NUMA (the only loss is a pointer in the mm
struct... the pointer in the task struct is zero cost, stack overflow
permitting :).

> 
> > +
> > +static inline bool autonuma_enabled(void)
> > +{
> > +	return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
> > +}
> > +
> > +static inline bool autonuma_debug(void)
> > +{
> > +	return !!test_bit(AUTONUMA_DEBUG_FLAG, &autonuma_flags);
> > +}
> > +
> > +static inline bool autonuma_sched_load_balance_strict(void)
> > +{
> > +	return !!test_bit(AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
> > +			  &autonuma_flags);
> > +}
> > +
> > +static inline bool autonuma_sched_clone_reset(void)
> > +{
> > +	return !!test_bit(AUTONUMA_SCHED_CLONE_RESET_FLAG,
> > +			  &autonuma_flags);
> > +}
> > +
> > +static inline bool autonuma_sched_fork_reset(void)
> > +{
> > +	return !!test_bit(AUTONUMA_SCHED_FORK_RESET_FLAG,
> > +			  &autonuma_flags);
> > +}
> > +
> > +static inline bool autonuma_scan_pmd(void)
> > +{
> > +	return !!test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
> > +}
> > +
> > +static inline bool autonuma_scan_use_working_set(void)
> > +{
> > +	return !!test_bit(AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
> > +			  &autonuma_flags);
> > +}
> > +
> > +static inline bool autonuma_migrate_defer(void)
> > +{
> > +	return !!test_bit(AUTONUMA_MIGRATE_DEFER_FLAG, &autonuma_flags);
> > +}
> > +
> > +#endif /* _LINUX_AUTONUMA_FLAGS_H */
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
  2012-07-04 23:14       ` Andrea Arcangeli
@ 2012-07-05 12:04         ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 327+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-05 12:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

> /**
>  * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
>  * @p: thread created by kthread_create().
>  * @nid: node (might not be online, must be possible) for @k to run on.
>  *
>  * Description: This function is equivalent to set_cpus_allowed(),
>  * except that @nid doesn't need to be online, and the thread must be
>  * stopped (i.e., just returned from kthread_create()).
>  */
> void kthread_bind_node(struct task_struct *p, int nid)
> {
> 	/* Must have done schedule() in kthread() before we set_task_cpu */
> 	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> 		WARN_ON(1);
> 		return;
> 	}
> 
> 	/* It's safe because the task is inactive. */
> 	do_set_cpus_allowed(p, cpumask_of_node(nid));
> 	p->flags |= PF_THREAD_BOUND;
> }
> EXPORT_SYMBOL(kthread_bind_node);
> 
> The above should explain why it's not _GPL right now. As far as
> AutoNUMA is concerned I can drop the EXPORT_SYMBOL completely and not
> allow modules to call this. In fact I could have coded this inside
> autonuma too.

Ok. How about dropping it and then if its needed for modules then
export it out.
> 
> I can change it to _GPL, drop the EXPORT_SYMBOL or move it inside the
> autonuma code, let me know what you prefer. If I hear nothing I won't
> make changes.
> 
> Thanks,
> Andrea
> 

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
@ 2012-07-05 12:04         ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 327+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-07-05 12:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

> /**
>  * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
>  * @p: thread created by kthread_create().
>  * @nid: node (might not be online, must be possible) for @k to run on.
>  *
>  * Description: This function is equivalent to set_cpus_allowed(),
>  * except that @nid doesn't need to be online, and the thread must be
>  * stopped (i.e., just returned from kthread_create()).
>  */
> void kthread_bind_node(struct task_struct *p, int nid)
> {
> 	/* Must have done schedule() in kthread() before we set_task_cpu */
> 	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> 		WARN_ON(1);
> 		return;
> 	}
> 
> 	/* It's safe because the task is inactive. */
> 	do_set_cpus_allowed(p, cpumask_of_node(nid));
> 	p->flags |= PF_THREAD_BOUND;
> }
> EXPORT_SYMBOL(kthread_bind_node);
> 
> The above should explain why it's not _GPL right now. As far as
> AutoNUMA is concerned I can drop the EXPORT_SYMBOL completely and not
> allow modules to call this. In fact I could have coded this inside
> autonuma too.

Ok. How about dropping it and then if its needed for modules then
export it out.
> 
> I can change it to _GPL, drop the EXPORT_SYMBOL or move it inside the
> autonuma code, let me know what you prefer. If I hear nothing I won't
> make changes.
> 
> Thanks,
> Andrea
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
  2012-07-04 23:14       ` Andrea Arcangeli
@ 2012-07-05 12:18         ` Peter Zijlstra
  -1 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-07-05 12:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Konrad Rzeszutek Wilk, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On Thu, 2012-07-05 at 01:14 +0200, Andrea Arcangeli wrote:
> I can change it to _GPL, drop the EXPORT_SYMBOL or move it inside the
> autonuma code, let me know what you prefer. If I hear nothing I won't
> make changes. 

If I find even a single instance of PF_THREAD_BOUND in your next posting
I'll simply not look at it at all.




^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
@ 2012-07-05 12:18         ` Peter Zijlstra
  0 siblings, 0 replies; 327+ messages in thread
From: Peter Zijlstra @ 2012-07-05 12:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Konrad Rzeszutek Wilk, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On Thu, 2012-07-05 at 01:14 +0200, Andrea Arcangeli wrote:
> I can change it to _GPL, drop the EXPORT_SYMBOL or move it inside the
> autonuma code, let me know what you prefer. If I hear nothing I won't
> make changes. 

If I find even a single instance of PF_THREAD_BOUND in your next posting
I'll simply not look at it at all.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
  2012-07-05 12:18         ` Peter Zijlstra
@ 2012-07-05 12:21           ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-05 12:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Konrad Rzeszutek Wilk, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On Thu, Jul 05, 2012 at 02:18:30PM +0200, Peter Zijlstra wrote:
> On Thu, 2012-07-05 at 01:14 +0200, Andrea Arcangeli wrote:
> > I can change it to _GPL, drop the EXPORT_SYMBOL or move it inside the
> > autonuma code, let me know what you prefer. If I hear nothing I won't
> > make changes. 
> 
> If I find even a single instance of PF_THREAD_BOUND in your next posting
> I'll simply not look at it at all.

Thanks for the info, that's one more reason to keep it.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
@ 2012-07-05 12:21           ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-05 12:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Konrad Rzeszutek Wilk, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On Thu, Jul 05, 2012 at 02:18:30PM +0200, Peter Zijlstra wrote:
> On Thu, 2012-07-05 at 01:14 +0200, Andrea Arcangeli wrote:
> > I can change it to _GPL, drop the EXPORT_SYMBOL or move it inside the
> > autonuma code, let me know what you prefer. If I hear nothing I won't
> > make changes. 
> 
> If I find even a single instance of PF_THREAD_BOUND in your next posting
> I'll simply not look at it at all.

Thanks for the info, that's one more reason to keep it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
  2012-07-05 12:04         ` Konrad Rzeszutek Wilk
@ 2012-07-05 12:28           ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-05 12:28 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Jul 05, 2012 at 08:04:13AM -0400, Konrad Rzeszutek Wilk wrote:
> Ok. How about dropping it and then if its needed for modules then
> export it out.

Ok.

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=be38c0751557b02fe5141ef078f1d1571932875e

Now let's just hope Peter hold to his promises.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
@ 2012-07-05 12:28           ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-05 12:28 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Thu, Jul 05, 2012 at 08:04:13AM -0400, Konrad Rzeszutek Wilk wrote:
> Ok. How about dropping it and then if its needed for modules then
> export it out.

Ok.

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=be38c0751557b02fe5141ef078f1d1571932875e

Now let's just hope Peter hold to his promises.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
  2012-06-29 16:58         ` Rik van Riel
@ 2012-07-05 13:09           ` Johannes Weiner
  -1 siblings, 0 replies; 327+ messages in thread
From: Johannes Weiner @ 2012-07-05 13:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, Jun 29, 2012 at 12:58:01PM -0400, Rik van Riel wrote:
> On 06/29/2012 12:38 PM, Andrea Arcangeli wrote:
> >On Fri, Jun 29, 2012 at 11:36:26AM -0400, Rik van Riel wrote:
> >>On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> >>
> >>>--- a/include/linux/sched.h
> >>>+++ b/include/linux/sched.h
> >>>@@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> >>>   #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
> >>>   #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
> >>>   #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
> >>>-#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
> >>>+#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpus */
> >>>   #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
> >>>   #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
> >>>   #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
> >>
> >>Changing the semantics of PF_THREAD_BOUND without so much as
> >>a comment in your changelog or buy-in from the scheduler
> >>maintainers is a big no-no.
> >>
> >>Is there any reason you even need PF_THREAD_BOUND in your
> >>kernel numa threads?
> >>
> >>I do not see much at all in the scheduler code that uses
> >>PF_THREAD_BOUND and it is not clear at all that your
> >>numa threads get any benefit from them...
> >>
> >>Why do you think you need it?
> 
> >This flag is only used to prevent userland to mess with the kernel CPU
> >binds of kernel threads. It is used to avoid the root user to shoot
> >itself in the foot.
> >
> >So far it has been used to prevent changing bindings to a single
> >CPU. I'm setting it also after making a multiple-cpu bind (all CPUs of
> >the node, instead of just 1 CPU).
> 
> Fair enough.  Looking at the scheduler code some more, I
> see that all PF_THREAD_BOUND seems to do is block userspace
> from changing a thread's CPU bindings.
> 
> Peter and Ingo, what is the special magic in PF_THREAD_BOUND
> that should make it only apply to kernel threads that are bound
> to a single CPU?

In the very first review iteration of AutoNUMA, Peter argued that the
scheduler people want to use this flag in other places where they rely
on this thing meaning a single cpu, not a group of them (check out the
cpumask test in debug_smp_processor_id() in lib/smp_processor_id.c).

He also argued that preventing root from rebinding the numa daemons is
not critical to this feature at all.  And I have to agree.

I certainly think this is NOT the change to make a stand about in this
patch set, seriously.  Not about a nice-to-have thing like this that
doesn't really hurt dropping but does create contention.

It can always be a separate effort to bring in such a flag that would
allow it to be used by other daemons, but this really should be a
separate effort and I don't think anything is really lost by dropping
the change from this series.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
@ 2012-07-05 13:09           ` Johannes Weiner
  0 siblings, 0 replies; 327+ messages in thread
From: Johannes Weiner @ 2012-07-05 13:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Fri, Jun 29, 2012 at 12:58:01PM -0400, Rik van Riel wrote:
> On 06/29/2012 12:38 PM, Andrea Arcangeli wrote:
> >On Fri, Jun 29, 2012 at 11:36:26AM -0400, Rik van Riel wrote:
> >>On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> >>
> >>>--- a/include/linux/sched.h
> >>>+++ b/include/linux/sched.h
> >>>@@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> >>>   #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
> >>>   #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
> >>>   #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
> >>>-#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpu */
> >>>+#define PF_THREAD_BOUND	0x04000000	/* Thread bound to specific cpus */
> >>>   #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
> >>>   #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
> >>>   #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
> >>
> >>Changing the semantics of PF_THREAD_BOUND without so much as
> >>a comment in your changelog or buy-in from the scheduler
> >>maintainers is a big no-no.
> >>
> >>Is there any reason you even need PF_THREAD_BOUND in your
> >>kernel numa threads?
> >>
> >>I do not see much at all in the scheduler code that uses
> >>PF_THREAD_BOUND and it is not clear at all that your
> >>numa threads get any benefit from them...
> >>
> >>Why do you think you need it?
> 
> >This flag is only used to prevent userland to mess with the kernel CPU
> >binds of kernel threads. It is used to avoid the root user to shoot
> >itself in the foot.
> >
> >So far it has been used to prevent changing bindings to a single
> >CPU. I'm setting it also after making a multiple-cpu bind (all CPUs of
> >the node, instead of just 1 CPU).
> 
> Fair enough.  Looking at the scheduler code some more, I
> see that all PF_THREAD_BOUND seems to do is block userspace
> from changing a thread's CPU bindings.
> 
> Peter and Ingo, what is the special magic in PF_THREAD_BOUND
> that should make it only apply to kernel threads that are bound
> to a single CPU?

In the very first review iteration of AutoNUMA, Peter argued that the
scheduler people want to use this flag in other places where they rely
on this thing meaning a single cpu, not a group of them (check out the
cpumask test in debug_smp_processor_id() in lib/smp_processor_id.c).

He also argued that preventing root from rebinding the numa daemons is
not critical to this feature at all.  And I have to agree.

I certainly think this is NOT the change to make a stand about in this
patch set, seriously.  Not about a nice-to-have thing like this that
doesn't really hurt dropping but does create contention.

It can always be a separate effort to bring in such a flag that would
allow it to be used by other daemons, but this really should be a
separate effort and I don't think anything is really lost by dropping
the change from this series.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-07-02 16:57                     ` Vaidyanathan Srinivasan
@ 2012-07-05 16:56                       ` Vaidyanathan Srinivasan
  2012-07-06 13:04                         ` Hillf Danton
  0 siblings, 1 reply; 327+ messages in thread
From: Vaidyanathan Srinivasan @ 2012-07-05 16:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, dlaor, Ingo Molnar, Hillf Danton,
	Andrea Arcangeli, linux-kernel, linux-mm, Dan Smith,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

* Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> [2012-07-02 22:27:15]:

> * Rik van Riel <riel@redhat.com> [2012-06-29 15:19:17]:
> 
> > On 06/29/2012 03:03 PM, Peter Zijlstra wrote:
> > >On Fri, 2012-06-29 at 20:57 +0200, Peter Zijlstra wrote:
> > >>On Fri, 2012-06-29 at 14:46 -0400, Rik van Riel wrote:
> > >>>
> > >>>I am not convinced all architectures that have CONFIG_NUMA
> > >>>need to be a requirement, since some of them (eg. Alpha)
> > >>>seem to be lacking a maintainer nowadays.
> > >>
> > >>Still, this NUMA balancing stuff is not a small tweak to load-balancing.
> > >>Its a very significant change is how you schedule. Having such great
> > >>differences over architectures isn't something I look forward to.
> > 
> > I am not too worried about the performance of architectures
> > that are essentially orphaned :)
> > 
> > >Also, Andrea keeps insisting arch support is trivial, so I don't see the
> > >problem.
> > 
> > Getting it implemented in one or two additional architectures
> > would be good, to get a template out there that can be used by
> > other architecture maintainers.
> 
> I am currently porting the framework over to powerpc.  I will share
> the initial patches in couple of days.

Here is the rough set of changes that are required to get the autonuma
framework working on powerpc.  This patch applies on autonuma19
branch.  Still work-in-progress, my goal is to evaluate the
implementation overheads and benefits on powerpc architecture.

--Vaidy

commit ede91b0af6c56d0ef5d1b07d195d7b59c3a324d0
Author: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Date:   Thu Jul 5 20:48:43 2012 +0530

    Basic changes to port Andrea's autonuma patches to powerpc.
    
    * PMD flaging is not required in powerpc since large pages
      are tracked in ptes.
    * Yet to be tested with large pages
    * This is an initial patch that partially works
    * knuma_scand and numa hinting page faults works
    * Page migration is yet to be observed/verified
    
    Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>

diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index 2e0e411..279d283 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -33,10 +33,56 @@ static inline int pte_dirty(pte_t pte)		{ return pte_val(pte) & _PAGE_DIRTY; }
 static inline int pte_young(pte_t pte)		{ return pte_val(pte) & _PAGE_ACCESSED; }
 static inline int pte_file(pte_t pte)		{ return pte_val(pte) & _PAGE_FILE; }
 static inline int pte_special(pte_t pte)	{ return pte_val(pte) & _PAGE_SPECIAL; }
-static inline int pte_present(pte_t pte)	{ return pte_val(pte) & _PAGE_PRESENT; }
+static inline int pte_present(pte_t pte)	{ return pte_val(pte) &
+							(_PAGE_PRESENT|_PAGE_NUMA_PTE); }
 static inline int pte_none(pte_t pte)		{ return (pte_val(pte) & ~_PTE_NONE_MASK) == 0; }
 static inline pgprot_t pte_pgprot(pte_t pte)	{ return __pgprot(pte_val(pte) & PAGE_PROT_BITS); }
 
+#ifdef CONFIG_AUTONUMA
+static inline int pte_numa(pte_t pte)
+{
+       return (pte_val(pte) &
+               (_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
+}
+
+#endif
+
+static inline pte_t pte_mknotnuma(pte_t pte)
+{
+       pte_val(pte) &= ~_PAGE_NUMA_PTE;
+       pte_val(pte) |= (_PAGE_PRESENT|_PAGE_ACCESSED);
+
+       return pte;
+}
+
+static inline pte_t pte_mknuma(pte_t pte)
+{
+       pte_val(pte) |= _PAGE_NUMA_PTE;
+       pte_val(pte) &= ~_PAGE_PRESENT;
+       return pte;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+       /* PMD tracking not implemented */
+       return 0;
+}
+
+static inline pmd_t pmd_mknotnuma(pmd_t pmd)
+{
+	BUG();
+	return pmd;
+}
+
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+	BUG();
+	return pmd;
+}
+
+/* No pmd flags on powerpc */
+#define set_pmd_at(mm, addr, pmdp, pmd)  do { } while (0)
+
 /* Conversion functions: convert a page and protection to a page entry,
  * and a page entry and page directory to the page they refer to.
  *
diff --git a/arch/powerpc/include/asm/pte-hash64-64k.h b/arch/powerpc/include/asm/pte-hash64-64k.h
index 59247e8..f7e1468 100644
--- a/arch/powerpc/include/asm/pte-hash64-64k.h
+++ b/arch/powerpc/include/asm/pte-hash64-64k.h
@@ -7,6 +7,8 @@
 #define _PAGE_COMBO	0x10000000 /* this is a combo 4k page */
 #define _PAGE_4K_PFN	0x20000000 /* PFN is for a single 4k page */
 
+#define _PAGE_NUMA_PTE 0x40000000 /* Adjust PTE_RPN_SHIFT below */
+
 /* For 64K page, we don't have a separate _PAGE_HASHPTE bit. Instead,
  * we set that to be the whole sub-bits mask. The C code will only
  * test this, so a multi-bit mask will work. For combo pages, this
@@ -36,7 +38,7 @@
  * That gives us a max RPN of 34 bits, which means a max of 50 bits
  * of addressable physical space, or 46 bits for the special 4k PFNs.
  */
-#define PTE_RPN_SHIFT	(30)
+#define PTE_RPN_SHIFT	(31)
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index b6edbb3..45afa9a 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -22,6 +22,7 @@
 #include <linux/pfn.h>
 #include <linux/cpuset.h>
 #include <linux/node.h>
+#include <linux/page_autonuma.h>
 #include <asm/sparsemem.h>
 #include <asm/prom.h>
 #include <asm/smp.h>
@@ -1043,7 +1044,7 @@ void __init do_init_bootmem(void)
 		 * all reserved areas marked.
 		 */
 		NODE_DATA(nid) = careful_zallocation(nid,
-					sizeof(struct pglist_data),
+					autonuma_pglist_data_size(),
 					SMP_CACHE_BYTES, end_pfn);
 
   		dbg("node %d\n", nid);
diff --git a/mm/Kconfig b/mm/Kconfig
index 330dd51..6fb6f25 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -210,7 +210,7 @@ config MIGRATION
 config AUTONUMA
 	bool "Auto NUMA"
 	select MIGRATION
-	depends on NUMA && X86
+	depends on NUMA && (X86 || PPC64)
 	help
 	  Automatic NUMA CPU scheduling and memory migration.
 
diff --git a/mm/autonuma.c b/mm/autonuma.c
index 1873a7b..58fa785 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -26,7 +26,7 @@ unsigned long autonuma_flags __read_mostly =
 #ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
 	(1<<AUTONUMA_FLAG)|
 #endif
-	(1<<AUTONUMA_SCAN_PMD_FLAG);
+	(0<<AUTONUMA_SCAN_PMD_FLAG);
 
 static DEFINE_MUTEX(knumad_mm_mutex);
 
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
index b629074..96c2e26 100644
--- a/mm/page_autonuma.c
+++ b/mm/page_autonuma.c
@@ -1,5 +1,6 @@
 #include <linux/mm.h>
 #include <linux/memory.h>
+#include <linux/vmalloc.h>
 #include <linux/autonuma.h>
 #include <linux/page_autonuma.h>
 #include <linux/bootmem.h>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-06-29 20:01             ` Nai Xia
  2012-06-29 20:44               ` Nai Xia
  2012-06-30  1:23               ` Andrea Arcangeli
@ 2012-07-05 18:07               ` Rik van Riel
  2012-07-05 22:59                 ` Andrea Arcangeli
  2012-07-06  1:00                 ` Nai Xia
  2 siblings, 2 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-05 18:07 UTC (permalink / raw)
  To: Nai Xia
  Cc: Peter Zijlstra, dlaor, Ingo Molnar, Hillf Danton,
	Andrea Arcangeli, linux-kernel, linux-mm, Dan Smith,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/29/2012 04:01 PM, Nai Xia wrote:

> Hey guys, Can I say NAK to these patches ?

Not necessarily the patches, but thinking about your
points some more, I thought of a much more serious
potential problem with Andrea's code.

> Now I aware that this sampling algorithm is completely broken, if we take
> a few seconds to see what it is trying to solve:

> Andrea's patch can only approximate the pages_accessed number in a
> time unit(scan interval),
> I don't think it can catch even 1% of  average_page_access_frequence
> on a busy workload.

It is much more "interesting" than that.

Once the first thread gets a NUMA pagefault on a
particular page, the page is made present in the
page tables and NO OTHER THREAD will get NUMA
page faults.

That means when trying to compare the weighting
of NUMA accesses between different threads in a
10 second interval, we only know THE FIRST FAULT.

We have no information on whether any other threads
tried to access the same page, because we do not
get faults more frequently.

Not only do we not get use frequency information,
we may not get the information on which threads use
which memory, at all.

Somehow Andrea's code still seems to work.

It would be very interesting to know why.

How much sense does the following code still make,
considering we may never get all the info on which
threads use which memory?

+			/*
+			 * Generate the w_nid/w_cpu_nid from the
+			 * pre-computed mm/task_numa_weight[] and
+			 * compute w_other using the w_m/w_t info
+			 * collected from the other process.
+			 */
+			if (mm == p->mm) {
+				if (w_t > w_t_t)
+					w_t_t = w_t;
+				w_other = w_t*AUTONUMA_BALANCE_SCALE/w_t_t;
+				w_nid = task_numa_weight[nid];
+				w_cpu_nid = task_numa_weight[cpu_nid];
+				w_type = W_TYPE_THREAD;

Andrea, what is the real reason your code works?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
  2012-07-05 13:09           ` Johannes Weiner
@ 2012-07-05 18:33             ` Glauber Costa
  -1 siblings, 0 replies; 327+ messages in thread
From: Glauber Costa @ 2012-07-05 18:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Andrea Arcangeli, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Christoph Lameter,
	Alex Shi, Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk,
	Don Morris, Benjamin Herrenschmidt

On 07/05/2012 05:09 PM, Johannes Weiner wrote:
> In the very first review iteration of AutoNUMA, Peter argued that the
> scheduler people want to use this flag in other places where they rely
> on this thing meaning a single cpu, not a group of them (check out the
> cpumask test in debug_smp_processor_id() in lib/smp_processor_id.c).
> 
> He also argued that preventing root from rebinding the numa daemons is
> not critical to this feature at all.  And I have to agree.

Despite not being a scheduler expert, I'll have to side with that as
well. The thing I have in mind is: We have people whose usecase depend
on completely isolating cpus, with nothing but a specialized task
running on it. For those people, even the hard binding between cpu0 and
the timer interrupt is a big problem.

If you force a per-node binding of a kthread, you are basically saying
that those people are unable to isolate a node. Or else, that they have
to choose between that, and AutoNUMA. Both are suboptimal choices, to
say the least.


^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
@ 2012-07-05 18:33             ` Glauber Costa
  0 siblings, 0 replies; 327+ messages in thread
From: Glauber Costa @ 2012-07-05 18:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Andrea Arcangeli, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Christoph Lameter,
	Alex Shi, Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk,
	Don Morris, Benjamin Herrenschmidt

On 07/05/2012 05:09 PM, Johannes Weiner wrote:
> In the very first review iteration of AutoNUMA, Peter argued that the
> scheduler people want to use this flag in other places where they rely
> on this thing meaning a single cpu, not a group of them (check out the
> cpumask test in debug_smp_processor_id() in lib/smp_processor_id.c).
> 
> He also argued that preventing root from rebinding the numa daemons is
> not critical to this feature at all.  And I have to agree.

Despite not being a scheduler expert, I'll have to side with that as
well. The thing I have in mind is: We have people whose usecase depend
on completely isolating cpus, with nothing but a specialized task
running on it. For those people, even the hard binding between cpu0 and
the timer interrupt is a big problem.

If you force a per-node binding of a kthread, you are basically saying
that those people are unable to isolate a node. Or else, that they have
to choose between that, and AutoNUMA. Both are suboptimal choices, to
say the least.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
  2012-07-05 18:33             ` Glauber Costa
@ 2012-07-05 20:07               ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-05 20:07 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Johannes Weiner, Rik van Riel, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Christoph Lameter,
	Alex Shi, Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk,
	Don Morris, Benjamin Herrenschmidt

Hi Johannes and Glauber,

> On 07/05/2012 05:09 PM, Johannes Weiner wrote:
> > In the very first review iteration of AutoNUMA, Peter argued that the
> > scheduler people want to use this flag in other places where they rely
> > on this thing meaning a single cpu, not a group of them (check out the
> > cpumask test in debug_smp_processor_id() in lib/smp_processor_id.c).

I already suggested to add a new bitflag for that future optimization,
when adding the future code, if that optimization is so worth it.

> > 
> > He also argued that preventing root from rebinding the numa daemons is
> > not critical to this feature at all.  And I have to agree.

It's not critical indeed, it was just a minor reliability
improvement just like it is right now for all other usages.

On Thu, Jul 05, 2012 at 10:33:35PM +0400, Glauber Costa wrote:
> Despite not being a scheduler expert, I'll have to side with that as
> well. The thing I have in mind is: We have people whose usecase depend
> on completely isolating cpus, with nothing but a specialized task
> running on it. For those people, even the hard binding between cpu0 and
> the timer interrupt is a big problem.
> 
> If you force a per-node binding of a kthread, you are basically saying
> that those people are unable to isolate a node. Or else, that they have
> to choose between that, and AutoNUMA. Both are suboptimal choices, to
> say the least.

I'm afraid AutoNUMA in those nanosecond latency setups will be
disabled, so knuma_migrated won't ever have a chance to run in the
first place. I doubt they want to risk to hit on a migration fault,
even if the migration is async with AutoNUMA there's still a chance
for userland to hit on a migration pte. Plus who starts to alter CPU
binding on kernel daemon and remove irqs from cpus, is likely to be
using hard bindings on the userland app too, so not needing AutoNUMA.

However we cannot fell for sure I agree, so your argument of wanting
to isolate the CPU while leaving AutoNUMA enabled is the most
convincing argument so far in favour of stopping setting the flag on
knuma_migrated.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 09/40] autonuma: introduce kthread_bind_node()
@ 2012-07-05 20:07               ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-05 20:07 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Johannes Weiner, Rik van Riel, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Srivatsa Vaddagiri, Christoph Lameter,
	Alex Shi, Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk,
	Don Morris, Benjamin Herrenschmidt

Hi Johannes and Glauber,

> On 07/05/2012 05:09 PM, Johannes Weiner wrote:
> > In the very first review iteration of AutoNUMA, Peter argued that the
> > scheduler people want to use this flag in other places where they rely
> > on this thing meaning a single cpu, not a group of them (check out the
> > cpumask test in debug_smp_processor_id() in lib/smp_processor_id.c).

I already suggested to add a new bitflag for that future optimization,
when adding the future code, if that optimization is so worth it.

> > 
> > He also argued that preventing root from rebinding the numa daemons is
> > not critical to this feature at all.  And I have to agree.

It's not critical indeed, it was just a minor reliability
improvement just like it is right now for all other usages.

On Thu, Jul 05, 2012 at 10:33:35PM +0400, Glauber Costa wrote:
> Despite not being a scheduler expert, I'll have to side with that as
> well. The thing I have in mind is: We have people whose usecase depend
> on completely isolating cpus, with nothing but a specialized task
> running on it. For those people, even the hard binding between cpu0 and
> the timer interrupt is a big problem.
> 
> If you force a per-node binding of a kthread, you are basically saying
> that those people are unable to isolate a node. Or else, that they have
> to choose between that, and AutoNUMA. Both are suboptimal choices, to
> say the least.

I'm afraid AutoNUMA in those nanosecond latency setups will be
disabled, so knuma_migrated won't ever have a chance to run in the
first place. I doubt they want to risk to hit on a migration fault,
even if the migration is async with AutoNUMA there's still a chance
for userland to hit on a migration pte. Plus who starts to alter CPU
binding on kernel daemon and remove irqs from cpus, is likely to be
using hard bindings on the userland app too, so not needing AutoNUMA.

However we cannot fell for sure I agree, so your argument of wanting
to isolate the CPU while leaving AutoNUMA enabled is the most
convincing argument so far in favour of stopping setting the flag on
knuma_migrated.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-07-05 18:07               ` Rik van Riel
@ 2012-07-05 22:59                 ` Andrea Arcangeli
  2012-07-06  1:00                 ` Nai Xia
  1 sibling, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-05 22:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Nai Xia, Peter Zijlstra, dlaor, Ingo Molnar, Hillf Danton,
	linux-kernel, linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

Hi Rik,

On Thu, Jul 05, 2012 at 02:07:06PM -0400, Rik van Riel wrote:
> Once the first thread gets a NUMA pagefault on a
> particular page, the page is made present in the
> page tables and NO OTHER THREAD will get NUMA
> page faults.

Oh this is a great question, thanks for raising it.

> That means when trying to compare the weighting
> of NUMA accesses between different threads in a
> 10 second interval, we only know THE FIRST FAULT.

The task_autonuma statistics don't include only the first fault every
10sec, by the time we compare stuff in the scheduler we have a trail
of fault history that decay with an exponential backoff (that can be
tuned to decay slower, right now it goes down pretty aggressively with
a shift right).

static void cpu_follow_memory_pass(struct task_struct *p,
				   struct task_autonuma *task_autonuma,
				   unsigned long *task_numa_fault)
{
	int nid;
	for_each_node(nid)
		task_numa_fault[nid] >>= 1;
	task_autonuma->task_numa_fault_tot >>= 1;
}

So depending on which thread is the first at every pass, that
information will accumulate fine over mutliple task_autonuma
structures if it's a huge amounts.

> We have no information on whether any other threads
> tried to access the same page, because we do not
> get faults more frequently.

If all threads access the same pages and there is false sharing, the
last_nid logic will eventually trigger. You're right maybe two three
passes in a row it's the same thread getting to the same page first,
but eventually another thread will get there and the last_nid will
change. The more nodes and the more threads, the less likely it's the
same getting there first.

The moment another thread in a different node access the page, any
pending migration is aborted if it's still in the page_autonuma LRU
and the autonuma_last_nid will have to be reconfirmed then before we
migrate it anywhere again.

> Not only do we not get use frequency information,
> we may not get the information on which threads use
> which memory, at all.
> 
> Somehow Andrea's code still seems to work.

It works when the process fits in the node because we use the
mm_autonuma statistics when comparing the percentage of memory
utilization per node (so called w_other/w_nid/w_cpu_nid) of threads
belonging to different processes. This alone solves all false sharing
if the process fits in the node. So that above issue becomes
irrelevant (we already convered without using task_autonuma).

Now if the process doesn't fit in the node, if there is false sharing,
that will be accounted with a smaller factor, and it will be accounted
for in its original memory location thanks to the last_nid logic. The
memory will not be migrated because of the last_nid logic
(statistically speaking).

Some spillover will definitely materialize but it won't be significant
as long as the NUMA trashing is not enormous. If the NUMA thrasing is
unlimted, well that workload is impossible to optimize and it's
impossible to converge anywhere and the best would be to do
MADV_INTERLEAVE.

But note that we're only talking about memory with page_mapcount=1
here, shared memory will never generate a single migration spillovers
or numa hinting page fault.

> How much sense does the following code still make,
> considering we may never get all the info on which
> threads use which memory?

It is required to handle the case of threads that have local memory
and the threads don't fit in a single node. That is the only case we
can perfectly coverge that involves more threads than CPUs in the
node.

This scenario is optimally optimized thanks to the mm = p->mm code
below.

There can be false sharing too as long as there is some local memory
too to converge (it may be impossible to converge on the false shared
regions even if we would account them more aggressively).

I don't exclude the reduced accounting of false shared memory that you
are asking about, may actually be beneficial. The more threads are
involved in the false sharing the more the accounting of the false
sharing regions will be reduced, and that may help to converge without
mistakes. The more threads are involved in the false sharing, the more
likely it's impossible to converge on the false shared memory.

Last but not the least, this is what the hardware gives us, it looks
good enough info to me, but I'm just trying to live with the only
information we can collect from the hardware efficiently.

> 
> +			/*
> +			 * Generate the w_nid/w_cpu_nid from the
> +			 * pre-computed mm/task_numa_weight[] and
> +			 * compute w_other using the w_m/w_t info
> +			 * collected from the other process.
> +			 */
> +			if (mm == p->mm) {
> +				if (w_t > w_t_t)
> +					w_t_t = w_t;
> +				w_other = w_t*AUTONUMA_BALANCE_SCALE/w_t_t;
> +				w_nid = task_numa_weight[nid];
> +				w_cpu_nid = task_numa_weight[cpu_nid];
> +				w_type = W_TYPE_THREAD;
> 
> Andrea, what is the real reason your code works?

Tried to explain above, but it's getting too long again, I wouldn't
know which part to drop though. If it's too messy ignore and I'll try
again later.

PS. this stuff isn't fixed in stone, I'm not saying this is the best
data collection or the best way to compute the data, I believe it's
closer to the absolute minimum amount of info and minimum computations
on the data required to perform as the hard bindings in the majority
of workloads.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-07-05 18:07               ` Rik van Riel
  2012-07-05 22:59                 ` Andrea Arcangeli
@ 2012-07-06  1:00                 ` Nai Xia
  1 sibling, 0 replies; 327+ messages in thread
From: Nai Xia @ 2012-07-06  1:00 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, dlaor, Ingo Molnar, Hillf Danton,
	Andrea Arcangeli, linux-kernel, linux-mm, Dan Smith,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt



On 2012a1'07ae??06ae?JPY 02:07, Rik van Riel wrote:
> On 06/29/2012 04:01 PM, Nai Xia wrote:
>
>> Hey guys, Can I say NAK to these patches ?
>
> Not necessarily the patches, but thinking about your
> points some more, I thought of a much more serious
> potential problem with Andrea's code.
>
>> Now I aware that this sampling algorithm is completely broken, if we take
>> a few seconds to see what it is trying to solve:
>
>> Andrea's patch can only approximate the pages_accessed number in a
>> time unit(scan interval),
>> I don't think it can catch even 1% of average_page_access_frequence
>> on a busy workload.
>
> It is much more "interesting" than that.
>
> Once the first thread gets a NUMA pagefault on a
> particular page, the page is made present in the
> page tables and NO OTHER THREAD will get NUMA
> page faults.
>
> That means when trying to compare the weighting
> of NUMA accesses between different threads in a
> 10 second interval, we only know THE FIRST FAULT.
>
> We have no information on whether any other threads
> tried to access the same page, because we do not
> get faults more frequently.
>
> Not only do we not get use frequency information,
> we may not get the information on which threads use
> which memory, at all.
>
> Somehow Andrea's code still seems to work.

On this point alone, I agree with Andrea's reasoning:
1. This information get averaged in noise.
2. If a thread statistically get more faults than others
then it may deserve to be biased.

Note, I mean only reasoning, I don't have enough
confidence if Andrea's coding is really working like
this, since I didn't do micro benchmarks on this part
of code.

>
> It would be very interesting to know why.

Note my personal experience tells me that
sometimes you wrote a complex system, it works
like a charm. And later you cut out 30% of its
code, it's still working like a charm.

Sometimes a part of a system just is not that
relevant to the output of the whole benchmark,
and this fact may make it seemingly have good
resistance to false negatives/positives.
It's time to look inside with benchmarks, IMO.

Again, I have no intension or benefit in
disabling this algorithm. I am only curious
about the truth. Hope nobody will get offended.


Thanks,

Nai

>
> How much sense does the following code still make,
> considering we may never get all the info on which
> threads use which memory?
>
> + /*
> + * Generate the w_nid/w_cpu_nid from the
> + * pre-computed mm/task_numa_weight[] and
> + * compute w_other using the w_m/w_t info
> + * collected from the other process.
> + */
> + if (mm == p->mm) {
> + if (w_t > w_t_t)
> + w_t_t = w_t;
> + w_other = w_t*AUTONUMA_BALANCE_SCALE/w_t_t;
> + w_nid = task_numa_weight[nid];
> + w_cpu_nid = task_numa_weight[cpu_nid];
> + w_type = W_TYPE_THREAD;
>
> Andrea, what is the real reason your code works?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-07-05 16:56                       ` Vaidyanathan Srinivasan
@ 2012-07-06 13:04                         ` Hillf Danton
  2012-07-06 18:38                           ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 327+ messages in thread
From: Hillf Danton @ 2012-07-06 13:04 UTC (permalink / raw)
  To: svaidy
  Cc: Rik van Riel, Peter Zijlstra, dlaor, Ingo Molnar,
	Andrea Arcangeli, linux-kernel, linux-mm, Dan Smith,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Hi Vaidy,

On Fri, Jul 6, 2012 at 12:56 AM, Vaidyanathan Srinivasan
<svaidy@linux.vnet.ibm.com> wrote:
> --- a/mm/autonuma.c
> +++ b/mm/autonuma.c
> @@ -26,7 +26,7 @@ unsigned long autonuma_flags __read_mostly =
>  #ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
>         (1<<AUTONUMA_FLAG)|
>  #endif
> -       (1<<AUTONUMA_SCAN_PMD_FLAG);
> +       (0<<AUTONUMA_SCAN_PMD_FLAG);
>

Let X86 scan pmd by default, agree?

Good Weekend
Hillf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-07-06 13:04                         ` Hillf Danton
@ 2012-07-06 18:38                           ` Vaidyanathan Srinivasan
  2012-07-12 13:12                             ` Andrea Arcangeli
  0 siblings, 1 reply; 327+ messages in thread
From: Vaidyanathan Srinivasan @ 2012-07-06 18:38 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Rik van Riel, Peter Zijlstra, dlaor, Ingo Molnar,
	Andrea Arcangeli, linux-kernel, linux-mm, Dan Smith,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

* Hillf Danton <dhillf@gmail.com> [2012-07-06 21:04:56]:

> Hi Vaidy,
> 
> On Fri, Jul 6, 2012 at 12:56 AM, Vaidyanathan Srinivasan
> <svaidy@linux.vnet.ibm.com> wrote:
> > --- a/mm/autonuma.c
> > +++ b/mm/autonuma.c
> > @@ -26,7 +26,7 @@ unsigned long autonuma_flags __read_mostly =
> >  #ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
> >         (1<<AUTONUMA_FLAG)|
> >  #endif
> > -       (1<<AUTONUMA_SCAN_PMD_FLAG);
> > +       (0<<AUTONUMA_SCAN_PMD_FLAG);
> >
> 
> Let X86 scan pmd by default, agree?

Sure, yes.  This patch just lists the changes required to get the
framework running on powerpc so that we know the location of code
changes.

We will need an arch specific default flags and leave this ON for x86.

--Vaidy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/40] AutoNUMA19
  2012-06-28 12:55 ` Andrea Arcangeli
@ 2012-07-09 15:40   ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-09 15:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:

> o reduced the page_autonuma memory overhead to from 32 to 12 bytes per page

This bit still worries me a little.

Between the compact list management code, the migrate
daemon, the list locking and page queuing and unqueuing
code, we are looking at 1000-2000 lines of code.

I know your reasoning why asynchronous migration done
by a kernel thread should be better than migrate-on-fault,
but there is no actual evidence that it is in fact true.

Would it be an idea to simplify the code for now, and
add the asynchronous migration later on, if there are
compelling benchmark results that show it to be useful?

The amount of source code involved is large enough that
a justification like that would be useful, IMHO...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 00/40] AutoNUMA19
@ 2012-07-09 15:40   ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-09 15:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:

> o reduced the page_autonuma memory overhead to from 32 to 12 bytes per page

This bit still worries me a little.

Between the compact list management code, the migrate
daemon, the list locking and page queuing and unqueuing
code, we are looking at 1000-2000 lines of code.

I know your reasoning why asynchronous migration done
by a kernel thread should be better than migrate-on-fault,
but there is no actual evidence that it is in fact true.

Would it be an idea to simplify the code for now, and
add the asynchronous migration later on, if there are
compelling benchmark results that show it to be useful?

The amount of source code involved is large enough that
a justification like that would be useful, IMHO...

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 13/40] autonuma: CPU follow memory algorithm
  2012-07-06 18:38                           ` Vaidyanathan Srinivasan
@ 2012-07-12 13:12                             ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-12 13:12 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Hillf Danton, Rik van Riel, Peter Zijlstra, dlaor, Ingo Molnar,
	linux-kernel, linux-mm, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On Sat, Jul 07, 2012 at 12:08:28AM +0530, Vaidyanathan Srinivasan wrote:
> * Hillf Danton <dhillf@gmail.com> [2012-07-06 21:04:56]:
> 
> > Hi Vaidy,
> > 
> > On Fri, Jul 6, 2012 at 12:56 AM, Vaidyanathan Srinivasan
> > <svaidy@linux.vnet.ibm.com> wrote:
> > > --- a/mm/autonuma.c
> > > +++ b/mm/autonuma.c
> > > @@ -26,7 +26,7 @@ unsigned long autonuma_flags __read_mostly =
> > >  #ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
> > >         (1<<AUTONUMA_FLAG)|
> > >  #endif
> > > -       (1<<AUTONUMA_SCAN_PMD_FLAG);
> > > +       (0<<AUTONUMA_SCAN_PMD_FLAG);
> > >
> > 
> > Let X86 scan pmd by default, agree?
> 
> Sure, yes.  This patch just lists the changes required to get the
> framework running on powerpc so that we know the location of code
> changes.
> 
> We will need an arch specific default flags and leave this ON for x86.

I applied it and added the flag. Let me know if this is ok.

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commitdiff;h=1cf85a3f23326bba89d197e845ab6f7883d0efd3
http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commitdiff;h=0276c4e2f7e9c3fc856f9dd5be319c2db1761cb4

I'm trying to fix all review points before releasing a new autonuma20
but you can follow the current status on the devel origin/autonuma
branch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 18/40] autonuma: call autonuma_setup_new_exec()
  2012-06-30  5:04     ` Konrad Rzeszutek Wilk
@ 2012-07-12 17:50       ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-12 17:50 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Hi,

On Sat, Jun 30, 2012 at 01:04:26AM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Jun 28, 2012 at 02:55:58PM +0200, Andrea Arcangeli wrote:
> > This resets all per-thread and per-process statistics across exec
> > syscalls or after kernel threads detached from the mm. The past
> > statistical NUMA information is unlikely to be relevant for the future
> > in these cases.
> 
> The previous patch mentioned that it can run in bypass mode. Is
> this also able to do so? Meaning that these calls end up doing nops?

execve is always resetting unconditionally, there's no option to
inherit the stats from the program running execve. By default all
thread and mm stats are also resetted across fork/clone but there's a
way to disable that with sysfs because those are more likely to retain
some statistical significance.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 18/40] autonuma: call autonuma_setup_new_exec()
@ 2012-07-12 17:50       ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-12 17:50 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Hi,

On Sat, Jun 30, 2012 at 01:04:26AM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Jun 28, 2012 at 02:55:58PM +0200, Andrea Arcangeli wrote:
> > This resets all per-thread and per-process statistics across exec
> > syscalls or after kernel threads detached from the mm. The past
> > statistical NUMA information is unlikely to be relevant for the future
> > in these cases.
> 
> The previous patch mentioned that it can run in bypass mode. Is
> this also able to do so? Meaning that these calls end up doing nops?

execve is always resetting unconditionally, there's no option to
inherit the stats from the program running execve. By default all
thread and mm stats are also resetted across fork/clone but there's a
way to disable that with sysfs because those are more likely to retain
some statistical significance.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 19/40] autonuma: alloc/free/init sched_autonuma
  2012-06-30  5:10     ` Konrad Rzeszutek Wilk
@ 2012-07-12 17:59       ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-12 17:59 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Hi Konrad,

On Sat, Jun 30, 2012 at 01:10:01AM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Jun 28, 2012 at 02:55:59PM +0200, Andrea Arcangeli wrote:
> > This is where the dynamically allocated sched_autonuma structure is
> > being handled.
> > 
> > The reason for keeping this outside of the task_struct besides not
> > using too much kernel stack, is to only allocate it on NUMA
> > hardware. So the not NUMA hardware only pays the memory of a pointer
> > in the kernel stack (which remains NULL at all times in that case). 
> 
> .. snip..
> > +	if (unlikely(alloc_task_autonuma(tsk, orig, node)))
> > +		/* free_thread_info() undoes arch_dup_task_struct() too */
> > +		goto out_thread_info;
> >  
> 
> That looks (without seeing the implementation) and from reading the git
> commit, like that on non-NUMA machines it would fail - and end up
> stop the creation of a task.

On not NUMA systems or if you boot with noautonuma, this does nothing
but returns 0 (if CONFIG_AUTONUMA=n it returns 0 at build time so the
above block is discarded by gcc). It can only fail when AutoNUMA is
enabled on real NUMA hardware, when it does something but you run OOM
(likely also triggering oom killer and stuff but it must handle the
OOM as every other kernel allocation or the whole kernel becomes OOM
deadlock prone).

> 
> Perhaps a better name for the function: alloc_always_task_autonuma
> since the function (at least from the description of this patch) will
> always succeed. Perhaps even remove the:
> "if unlikely(..)" bit?

It will not always succeed and it will not always allocate.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 19/40] autonuma: alloc/free/init sched_autonuma
@ 2012-07-12 17:59       ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-12 17:59 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Hi Konrad,

On Sat, Jun 30, 2012 at 01:10:01AM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Jun 28, 2012 at 02:55:59PM +0200, Andrea Arcangeli wrote:
> > This is where the dynamically allocated sched_autonuma structure is
> > being handled.
> > 
> > The reason for keeping this outside of the task_struct besides not
> > using too much kernel stack, is to only allocate it on NUMA
> > hardware. So the not NUMA hardware only pays the memory of a pointer
> > in the kernel stack (which remains NULL at all times in that case). 
> 
> .. snip..
> > +	if (unlikely(alloc_task_autonuma(tsk, orig, node)))
> > +		/* free_thread_info() undoes arch_dup_task_struct() too */
> > +		goto out_thread_info;
> >  
> 
> That looks (without seeing the implementation) and from reading the git
> commit, like that on non-NUMA machines it would fail - and end up
> stop the creation of a task.

On not NUMA systems or if you boot with noautonuma, this does nothing
but returns 0 (if CONFIG_AUTONUMA=n it returns 0 at build time so the
above block is discarded by gcc). It can only fail when AutoNUMA is
enabled on real NUMA hardware, when it does something but you run OOM
(likely also triggering oom killer and stuff but it must handle the
OOM as every other kernel allocation or the whole kernel becomes OOM
deadlock prone).

> 
> Perhaps a better name for the function: alloc_always_task_autonuma
> since the function (at least from the description of this patch) will
> always succeed. Perhaps even remove the:
> "if unlikely(..)" bit?

It will not always succeed and it will not always allocate.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 20/40] autonuma: alloc/free/init mm_autonuma
  2012-06-30  5:12     ` Konrad Rzeszutek Wilk
@ 2012-07-12 18:08       ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-12 18:08 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 01:12:18AM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Jun 28, 2012 at 02:56:00PM +0200, Andrea Arcangeli wrote:
> > This is where the mm_autonuma structure is being handled. Just like
> > sched_autonuma, this is only allocated at runtime if the hardware the
> > kernel is running on has been detected as NUMA. On not NUMA hardware
> 
> I think the correct wording is "non-NUMA", not "not NUMA".

That sounds far too easy to me, but I've no idea what's the right is here.

> > the memory cost is reduced to one pointer per mm.
> > 
> > To get rid of the pointer in the each mm, the kernel can be compiled
> > with CONFIG_AUTONUMA=n.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> >  kernel/fork.c |    7 +++++++
> >  1 files changed, 7 insertions(+), 0 deletions(-)
> > 
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 0adbe09..3e5a0d9 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -527,6 +527,8 @@ static void mm_init_aio(struct mm_struct *mm)
> >  
> >  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> >  {
> > +	if (unlikely(alloc_mm_autonuma(mm)))
> > +		goto out_free_mm;
> 
> So reading that I would think that on non-NUMA machines this would fail
> (since there is nothing to allocate). But that is not the case
> (I hope!?) Perhaps just make the function not return any values?

It doesn't fail, it returns 0 on non-NUMA. It's identical to
alloc_task_autonuma, per prev email.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 20/40] autonuma: alloc/free/init mm_autonuma
@ 2012-07-12 18:08       ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-12 18:08 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 01:12:18AM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Jun 28, 2012 at 02:56:00PM +0200, Andrea Arcangeli wrote:
> > This is where the mm_autonuma structure is being handled. Just like
> > sched_autonuma, this is only allocated at runtime if the hardware the
> > kernel is running on has been detected as NUMA. On not NUMA hardware
> 
> I think the correct wording is "non-NUMA", not "not NUMA".

That sounds far too easy to me, but I've no idea what's the right is here.

> > the memory cost is reduced to one pointer per mm.
> > 
> > To get rid of the pointer in the each mm, the kernel can be compiled
> > with CONFIG_AUTONUMA=n.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> >  kernel/fork.c |    7 +++++++
> >  1 files changed, 7 insertions(+), 0 deletions(-)
> > 
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 0adbe09..3e5a0d9 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -527,6 +527,8 @@ static void mm_init_aio(struct mm_struct *mm)
> >  
> >  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> >  {
> > +	if (unlikely(alloc_mm_autonuma(mm)))
> > +		goto out_free_mm;
> 
> So reading that I would think that on non-NUMA machines this would fail
> (since there is nothing to allocate). But that is not the case
> (I hope!?) Perhaps just make the function not return any values?

It doesn't fail, it returns 0 on non-NUMA. It's identical to
alloc_task_autonuma, per prev email.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 20/40] autonuma: alloc/free/init mm_autonuma
  2012-07-12 18:08       ` Andrea Arcangeli
@ 2012-07-12 18:17         ` Johannes Weiner
  -1 siblings, 0 replies; 327+ messages in thread
From: Johannes Weiner @ 2012-07-12 18:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Konrad Rzeszutek Wilk, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On Thu, Jul 12, 2012 at 08:08:28PM +0200, Andrea Arcangeli wrote:
> On Sat, Jun 30, 2012 at 01:12:18AM -0400, Konrad Rzeszutek Wilk wrote:
> > On Thu, Jun 28, 2012 at 02:56:00PM +0200, Andrea Arcangeli wrote:
> > > This is where the mm_autonuma structure is being handled. Just like
> > > sched_autonuma, this is only allocated at runtime if the hardware the
> > > kernel is running on has been detected as NUMA. On not NUMA hardware
> > 
> > I think the correct wording is "non-NUMA", not "not NUMA".
> 
> That sounds far too easy to me, but I've no idea what's the right is here.

UMA?

Single-node hardware?

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 20/40] autonuma: alloc/free/init mm_autonuma
@ 2012-07-12 18:17         ` Johannes Weiner
  0 siblings, 0 replies; 327+ messages in thread
From: Johannes Weiner @ 2012-07-12 18:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Konrad Rzeszutek Wilk, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On Thu, Jul 12, 2012 at 08:08:28PM +0200, Andrea Arcangeli wrote:
> On Sat, Jun 30, 2012 at 01:12:18AM -0400, Konrad Rzeszutek Wilk wrote:
> > On Thu, Jun 28, 2012 at 02:56:00PM +0200, Andrea Arcangeli wrote:
> > > This is where the mm_autonuma structure is being handled. Just like
> > > sched_autonuma, this is only allocated at runtime if the hardware the
> > > kernel is running on has been detected as NUMA. On not NUMA hardware
> > 
> > I think the correct wording is "non-NUMA", not "not NUMA".
> 
> That sounds far too easy to me, but I've no idea what's the right is here.

UMA?

Single-node hardware?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 20/40] autonuma: alloc/free/init mm_autonuma
  2012-07-01 15:33     ` Rik van Riel
@ 2012-07-12 18:27       ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-12 18:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Hi Rik,

On Sun, Jul 01, 2012 at 11:33:17AM -0400, Rik van Riel wrote:
> On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> 
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 0adbe09..3e5a0d9 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -527,6 +527,8 @@ static void mm_init_aio(struct mm_struct *mm)
> >
> >   static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> >   {
> > +	if (unlikely(alloc_mm_autonuma(mm)))
> > +		goto out_free_mm;
> >   	atomic_set(&mm->mm_users, 1);
> >   	atomic_set(&mm->mm_count, 1);
> >   	init_rwsem(&mm->mmap_sem);
> 
> I wonder if it would be possible to defer the allocation
> of the mm_autonuma struct to knuma_scand, so short lived
> processes never have to allocate and free the mm_autonuma
> structure.
> 
> That way we only have a function call at exit time, and
> the branch inside kfree that checks for a null pointer.

It would be possible to convert them to prepare_mm/task_autonuma (the
mm side especially would be a branch once in a while) but it would
then become impossible to inherit the mm/task stats across
fork/clone. Right now the default is to reset them, but two sysfs
switches control that, and I wouldn't drop those until I've the time
to experiment how large kernel builds are affected by enabling the
stats inheritance. Right now kernel builds are unaffected because of
the default stat-resetting behavior and gcc too quick to be measured.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 20/40] autonuma: alloc/free/init mm_autonuma
@ 2012-07-12 18:27       ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-12 18:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Hi Rik,

On Sun, Jul 01, 2012 at 11:33:17AM -0400, Rik van Riel wrote:
> On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> 
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 0adbe09..3e5a0d9 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -527,6 +527,8 @@ static void mm_init_aio(struct mm_struct *mm)
> >
> >   static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> >   {
> > +	if (unlikely(alloc_mm_autonuma(mm)))
> > +		goto out_free_mm;
> >   	atomic_set(&mm->mm_users, 1);
> >   	atomic_set(&mm->mm_count, 1);
> >   	init_rwsem(&mm->mmap_sem);
> 
> I wonder if it would be possible to defer the allocation
> of the mm_autonuma struct to knuma_scand, so short lived
> processes never have to allocate and free the mm_autonuma
> structure.
> 
> That way we only have a function call at exit time, and
> the branch inside kfree that checks for a null pointer.

It would be possible to convert them to prepare_mm/task_autonuma (the
mm side especially would be a branch once in a while) but it would
then become impossible to inherit the mm/task stats across
fork/clone. Right now the default is to reset them, but two sysfs
switches control that, and I wouldn't drop those until I've the time
to experiment how large kernel builds are affected by enabling the
stats inheritance. Right now kernel builds are unaffected because of
the default stat-resetting behavior and gcc too quick to be measured.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 28/40] autonuma: make khugepaged pte_numa aware
  2012-07-02  4:24     ` Rik van Riel
@ 2012-07-12 18:50       ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-12 18:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Mon, Jul 02, 2012 at 12:24:36AM -0400, Rik van Riel wrote:
> On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> > If any of the ptes that khugepaged is collapsing was a pte_numa, the
> > resulting trans huge pmd will be a pmd_numa too.
> 
> Why?
> 
> If some of the ptes already got faulted in and made really
> resident again, why do you want to incur a new NUMA fault
> on the newly collapsed hugepage?

If we don't set pmd_numa on the collapsed hugepage, the result is that
we'll understimate the thread NUMA affinity to the node where the
hugepage is located (mm affinity is recorded independently by the NUMA
hinting page faults).

If it's better or worse I guess depends on luck, we just lose
information.

I guess overstimating the node affinity with a node with hugepages
just collapsed is better than understimating it, more often than not.

I doubt it matters much if just 1 pte_numa or all pte_numa creates a
pmd_numa.

With the pmd scan mode (default enabled) we fault in at
pmd-granularity regardless of THP or not, so either ways it's the
same, this only an issue when you set knuma_scand/pmd = 0 at runtime.

> Is there something on we should know about?
> 
> If so, could you document it?

I'll add a note.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 28/40] autonuma: make khugepaged pte_numa aware
@ 2012-07-12 18:50       ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-12 18:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Mon, Jul 02, 2012 at 12:24:36AM -0400, Rik van Riel wrote:
> On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> > If any of the ptes that khugepaged is collapsing was a pte_numa, the
> > resulting trans huge pmd will be a pmd_numa too.
> 
> Why?
> 
> If some of the ptes already got faulted in and made really
> resident again, why do you want to incur a new NUMA fault
> on the newly collapsed hugepage?

If we don't set pmd_numa on the collapsed hugepage, the result is that
we'll understimate the thread NUMA affinity to the node where the
hugepage is located (mm affinity is recorded independently by the NUMA
hinting page faults).

If it's better or worse I guess depends on luck, we just lose
information.

I guess overstimating the node affinity with a node with hugepages
just collapsed is better than understimating it, more often than not.

I doubt it matters much if just 1 pte_numa or all pte_numa creates a
pmd_numa.

With the pmd scan mode (default enabled) we fault in at
pmd-granularity regardless of THP or not, so either ways it's the
same, this only an issue when you set knuma_scand/pmd = 0 at runtime.

> Is there something on we should know about?
> 
> If so, could you document it?

I'll add a note.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 36/40] autonuma: page_autonuma
  2012-06-30  5:24     ` Konrad Rzeszutek Wilk
@ 2012-07-12 19:43       ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-12 19:43 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 01:24:05AM -0400, Konrad Rzeszutek Wilk wrote:
> I think you are better using a different name.
> 
> Perhaps 'if (autonuma_on())'

I changed it to AUTONUMA_POSSIBLE_FLAG/autonuma_possible() and
optimized the implementation to a single test_bit on the read mostly
flag variable.

"possible" is the term all NUMA code in the kernel already uses with
almost the same meaning (modulo noautonuma parameter) as
num_possible_nodes() etc...

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 36/40] autonuma: page_autonuma
@ 2012-07-12 19:43       ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-12 19:43 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Sat, Jun 30, 2012 at 01:24:05AM -0400, Konrad Rzeszutek Wilk wrote:
> I think you are better using a different name.
> 
> Perhaps 'if (autonuma_on())'

I changed it to AUTONUMA_POSSIBLE_FLAG/autonuma_possible() and
optimized the implementation to a single test_bit on the read mostly
flag variable.

"possible" is the term all NUMA code in the kernel already uses with
almost the same meaning (modulo noautonuma parameter) as
num_possible_nodes() etc...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 36/40] autonuma: page_autonuma
  2012-07-02  6:37     ` Rik van Riel
@ 2012-07-12 19:58       ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-12 19:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Mon, Jul 02, 2012 at 02:37:10AM -0400, Rik van Riel wrote:
>  > +fail:
>  > +	printk(KERN_CRIT "allocation of page_autonuma failed.\n");
>  > +	printk(KERN_CRIT "please try the 'noautonuma' boot option\n");
>  > +	panic("Out of memory");
>  > +}
> 
> The system can run just fine without autonuma.
> 
> Would it make sense to simply disable autonuma at this point,
> but to try continue running?

BTW, the same would apply to mm/page_cgroup.c, but I think the idea
here is that something serious went wrong. Workaround with noautonuma
boot option is enough.

> 
> > @@ -700,8 +780,14 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
> >   	 */
> >   	if (PageSlab(usemap_page)) {
> >   		kfree(usemap);
> > -		if (memmap)
> > +		if (memmap) {
> >   			__kfree_section_memmap(memmap, PAGES_PER_SECTION);
> > +			if (!autonuma_impossible())
> > +				__kfree_section_page_autonuma(page_autonuma,
> > +							      PAGES_PER_SECTION);
> > +			else
> > +				BUG_ON(page_autonuma);
> 
> VM_BUG_ON ?
> 
> > +		if (!autonuma_impossible()) {
> > +			struct page *page_autonuma_page;
> > +			page_autonuma_page = virt_to_page(page_autonuma);
> > +			free_map_bootmem(page_autonuma_page, nr_pages);
> > +		} else
> > +			BUG_ON(page_autonuma);
> 
> ditto
> 
> >   	pgdat_resize_unlock(pgdat,&flags);
> >   	if (ret<= 0) {
> > +		if (!autonuma_impossible())
> > +			__kfree_section_page_autonuma(page_autonuma, nr_pages);
> > +		else
> > +			BUG_ON(page_autonuma);
> 
> VM_BUG_ON ?

These only run at the very boot stage, so performance is irrelevant
and it's safer to keep them on.

The rest was corrected.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 36/40] autonuma: page_autonuma
@ 2012-07-12 19:58       ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-12 19:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Mon, Jul 02, 2012 at 02:37:10AM -0400, Rik van Riel wrote:
>  > +fail:
>  > +	printk(KERN_CRIT "allocation of page_autonuma failed.\n");
>  > +	printk(KERN_CRIT "please try the 'noautonuma' boot option\n");
>  > +	panic("Out of memory");
>  > +}
> 
> The system can run just fine without autonuma.
> 
> Would it make sense to simply disable autonuma at this point,
> but to try continue running?

BTW, the same would apply to mm/page_cgroup.c, but I think the idea
here is that something serious went wrong. Workaround with noautonuma
boot option is enough.

> 
> > @@ -700,8 +780,14 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
> >   	 */
> >   	if (PageSlab(usemap_page)) {
> >   		kfree(usemap);
> > -		if (memmap)
> > +		if (memmap) {
> >   			__kfree_section_memmap(memmap, PAGES_PER_SECTION);
> > +			if (!autonuma_impossible())
> > +				__kfree_section_page_autonuma(page_autonuma,
> > +							      PAGES_PER_SECTION);
> > +			else
> > +				BUG_ON(page_autonuma);
> 
> VM_BUG_ON ?
> 
> > +		if (!autonuma_impossible()) {
> > +			struct page *page_autonuma_page;
> > +			page_autonuma_page = virt_to_page(page_autonuma);
> > +			free_map_bootmem(page_autonuma_page, nr_pages);
> > +		} else
> > +			BUG_ON(page_autonuma);
> 
> ditto
> 
> >   	pgdat_resize_unlock(pgdat,&flags);
> >   	if (ret<= 0) {
> > +		if (!autonuma_impossible())
> > +			__kfree_section_page_autonuma(page_autonuma, nr_pages);
> > +		else
> > +			BUG_ON(page_autonuma);
> 
> VM_BUG_ON ?

These only run at the very boot stage, so performance is irrelevant
and it's safer to keep them on.

The rest was corrected.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 40/40] autonuma: shrink the per-page page_autonuma struct size
  2012-07-02  7:18     ` Rik van Riel
@ 2012-07-12 20:21       ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-12 20:21 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Mon, Jul 02, 2012 at 03:18:46AM -0400, Rik van Riel wrote:
> On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> >  From 32 to 12 bytes, so the AutoNUMA memory footprint is reduced to
> > 0.29% of RAM.
> 
> Still not ideal, however once we get native THP migration working
> it could be practical to switch to a "have a bucket with N
> page_autonuma structures for every N*M pages" approach.
> 
> For example, we could have 4 struct page_autonuma pages, for 32
> memory pages. That would necessitate reintroducing the page pointer
> into struct page_autonuma, but it would reduce memory use by roughly
> a factor 8.
> 
> To get from a struct page to a struct page_autonuma, we would have
> to look at the bucket and check whether one of the page_autonuma
> structs points at us. If none do, we have to claim an available one.
> On migration, we would have to free our page_autonuma struct, which
> would make it available for other pages to use.
> 
> This would complicate the code somewhat, and potentially slow down
> the migration of 4kB pages, but with 2MB pages things could continue
> exactly the way they work today.
> 
> Does this seem reasonably in any way?

Reducing the max lru size loses info too. The thing I dislike is that
knuma_migrated may not migrate the page until a few knuma_scand passed
on large systems (giving a chance to last_nid_set to notice if there's
false sharing and cancel the migration). I conceptually like the
unlimited sized LRU migration list.

The other cons is that it'll increase the complexity even more by
having to deal with dynamic objects instead of an extension of the
struct page.

And the 2bytes for the last_nid information would need to be retained
for every page, unless we drop the last_nid logic which I doubt would
be good.

And the alternative without an hash is not feasible: one could reduce
it to 8bytes per-page (for the pointer to the page_autonuma structure)
plus 2 bytes for last_nid, so 10 bytes per page plus the actual array,
instead of the current 12 bytes per page.

> I also wonder if it would make sense to have this available as a
> generic list type, not autonuma specific but an "item number list"
> include file with corresponding macros.
> 
> It might be useful to have lists with item numbers, instead of
> prev & next pointers, in other places in the kernel.
> 
> Besides, introducing this list type separately could make things
> easier to review.

Macros bypassing type checking usually aren't recommended and
certainly it's more readable as it is now. But this can always be done
later if needed.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 40/40] autonuma: shrink the per-page page_autonuma struct size
@ 2012-07-12 20:21       ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-12 20:21 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Mon, Jul 02, 2012 at 03:18:46AM -0400, Rik van Riel wrote:
> On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> >  From 32 to 12 bytes, so the AutoNUMA memory footprint is reduced to
> > 0.29% of RAM.
> 
> Still not ideal, however once we get native THP migration working
> it could be practical to switch to a "have a bucket with N
> page_autonuma structures for every N*M pages" approach.
> 
> For example, we could have 4 struct page_autonuma pages, for 32
> memory pages. That would necessitate reintroducing the page pointer
> into struct page_autonuma, but it would reduce memory use by roughly
> a factor 8.
> 
> To get from a struct page to a struct page_autonuma, we would have
> to look at the bucket and check whether one of the page_autonuma
> structs points at us. If none do, we have to claim an available one.
> On migration, we would have to free our page_autonuma struct, which
> would make it available for other pages to use.
> 
> This would complicate the code somewhat, and potentially slow down
> the migration of 4kB pages, but with 2MB pages things could continue
> exactly the way they work today.
> 
> Does this seem reasonably in any way?

Reducing the max lru size loses info too. The thing I dislike is that
knuma_migrated may not migrate the page until a few knuma_scand passed
on large systems (giving a chance to last_nid_set to notice if there's
false sharing and cancel the migration). I conceptually like the
unlimited sized LRU migration list.

The other cons is that it'll increase the complexity even more by
having to deal with dynamic objects instead of an extension of the
struct page.

And the 2bytes for the last_nid information would need to be retained
for every page, unless we drop the last_nid logic which I doubt would
be good.

And the alternative without an hash is not feasible: one could reduce
it to 8bytes per-page (for the pointer to the page_autonuma structure)
plus 2 bytes for last_nid, so 10 bytes per page plus the actual array,
instead of the current 12 bytes per page.

> I also wonder if it would make sense to have this available as a
> generic list type, not autonuma specific but an "item number list"
> include file with corresponding macros.
> 
> It might be useful to have lists with item numbers, instead of
> prev & next pointers, in other places in the kernel.
> 
> Besides, introducing this list type separately could make things
> easier to review.

Macros bypassing type checking usually aren't recommended and
certainly it's more readable as it is now. But this can always be done
later if needed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 28/40] autonuma: make khugepaged pte_numa aware
  2012-07-12 18:50       ` Andrea Arcangeli
@ 2012-07-12 21:25         ` Rik van Riel
  -1 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-12 21:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 07/12/2012 02:50 PM, Andrea Arcangeli wrote:
> On Mon, Jul 02, 2012 at 12:24:36AM -0400, Rik van Riel wrote:
>> On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
>>> If any of the ptes that khugepaged is collapsing was a pte_numa, the
>>> resulting trans huge pmd will be a pmd_numa too.
>>
>> Why?
>>
>> If some of the ptes already got faulted in and made really
>> resident again, why do you want to incur a new NUMA fault
>> on the newly collapsed hugepage?
>
> If we don't set pmd_numa on the collapsed hugepage, the result is that
> we'll understimate the thread NUMA affinity to the node where the
> hugepage is located (mm affinity is recorded independently by the NUMA
> hinting page faults).
>
> If it's better or worse I guess depends on luck, we just lose
> information.

Fair enough.





^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 28/40] autonuma: make khugepaged pte_numa aware
@ 2012-07-12 21:25         ` Rik van Riel
  0 siblings, 0 replies; 327+ messages in thread
From: Rik van Riel @ 2012-07-12 21:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 07/12/2012 02:50 PM, Andrea Arcangeli wrote:
> On Mon, Jul 02, 2012 at 12:24:36AM -0400, Rik van Riel wrote:
>> On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
>>> If any of the ptes that khugepaged is collapsing was a pte_numa, the
>>> resulting trans huge pmd will be a pmd_numa too.
>>
>> Why?
>>
>> If some of the ptes already got faulted in and made really
>> resident again, why do you want to incur a new NUMA fault
>> on the newly collapsed hugepage?
>
> If we don't set pmd_numa on the collapsed hugepage, the result is that
> we'll understimate the thread NUMA affinity to the node where the
> hugepage is located (mm affinity is recorded independently by the NUMA
> hinting page faults).
>
> If it's better or worse I guess depends on luck, we just lose
> information.

Fair enough.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 20/40] autonuma: alloc/free/init mm_autonuma
  2012-07-12 18:17         ` Johannes Weiner
@ 2012-07-13 14:19           ` Christoph Lameter
  -1 siblings, 0 replies; 327+ messages in thread
From: Christoph Lameter @ 2012-07-13 14:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrea Arcangeli, Konrad Rzeszutek Wilk, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On Thu, 12 Jul 2012, Johannes Weiner wrote:

> On Thu, Jul 12, 2012 at 08:08:28PM +0200, Andrea Arcangeli wrote:
> > On Sat, Jun 30, 2012 at 01:12:18AM -0400, Konrad Rzeszutek Wilk wrote:
> > > On Thu, Jun 28, 2012 at 02:56:00PM +0200, Andrea Arcangeli wrote:
> > > > This is where the mm_autonuma structure is being handled. Just like
> > > > sched_autonuma, this is only allocated at runtime if the hardware the
> > > > kernel is running on has been detected as NUMA. On not NUMA hardware
> > >
> > > I think the correct wording is "non-NUMA", not "not NUMA".
> >
> > That sounds far too easy to me, but I've no idea what's the right is here.
>
> UMA?
>
> Single-node hardware?

Lets be simple and stay with non-NUMA.



^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 20/40] autonuma: alloc/free/init mm_autonuma
@ 2012-07-13 14:19           ` Christoph Lameter
  0 siblings, 0 replies; 327+ messages in thread
From: Christoph Lameter @ 2012-07-13 14:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrea Arcangeli, Konrad Rzeszutek Wilk, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On Thu, 12 Jul 2012, Johannes Weiner wrote:

> On Thu, Jul 12, 2012 at 08:08:28PM +0200, Andrea Arcangeli wrote:
> > On Sat, Jun 30, 2012 at 01:12:18AM -0400, Konrad Rzeszutek Wilk wrote:
> > > On Thu, Jun 28, 2012 at 02:56:00PM +0200, Andrea Arcangeli wrote:
> > > > This is where the mm_autonuma structure is being handled. Just like
> > > > sched_autonuma, this is only allocated at runtime if the hardware the
> > > > kernel is running on has been detected as NUMA. On not NUMA hardware
> > >
> > > I think the correct wording is "non-NUMA", not "not NUMA".
> >
> > That sounds far too easy to me, but I've no idea what's the right is here.
>
> UMA?
>
> Single-node hardware?

Lets be simple and stay with non-NUMA.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 25/40] autonuma: follow_page check for pte_numa/pmd_numa
  2012-07-02  4:14     ` Rik van Riel
@ 2012-07-14 16:43       ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-14 16:43 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Mon, Jul 02, 2012 at 12:14:11AM -0400, Rik van Riel wrote:
> On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> > Without this, follow_page wouldn't trigger the NUMA hinting faults.
> >
> > Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> 
> follow_page is called from many different places, not just
> the process itself. One example would be ksm.
> 
> Do you really want to trigger NUMA hinting faults when the
> mm != current->mm, or is that magically prevented somewhere?

The NUMA hinting page fault will update "current->task_autonuma"
according to the page_nid of the page triggering the numa hinting page
fault in follow_page. It doesn't matter if the page belongs to the
different mm, all we care is the page_nid that was accessed by the
"current" task through a numa hinting fault.

When I started thinking the benefit it could provide, I thought it
wouldn't be worth it because task_autonuma statistics are only used to
balance threads belonging to the same process, and mm_autonuma is used
to balance tasks belonging to different processes. And mm_autonuma
will never be able to take into account things like this.

So I converted the !current->mm check to a current->mm != mm check
here to save a bit of cpu and skip it in the autonuma branch.

void numa_hinting_fault(struct mm_struct *mm, struct page *page, int numpages)
{
	/*
	 * "current->mm" could be different from "mm" if
	 * get_user_pages() triggered the fault on some other process
	 * "mm". It wouldn't be a problem to account this NUMA hinting
	 * page fault on the current->task_autonuma statistics even if
	 * it was triggered on a page mapped on a different
	 * "mm". However task_autonuma isn't used to balance threads
	 * belonging to different processes so it wouldn't help and in
	 * turn it's not worth it.
	 */
	if (likely(current->mm == mm && !current->mempolicy && autonuma_enabled())) {

But I was thinking at the usual case of one ptracer task with a single
thread, however now I changed my mind and I think it can help when
there's just one process and a ton of threads spanning multiple nodes,
and one of the threads is ptracing an otherwise idle task and
accessing lot of ram through ptrace. So I think I'll roll it back to
autonuma21 status and allow the accounting of all page_nid even for
different mm again. But this is mostly a theoretical issue.

It can lead to a funny weighting where mm_autonuma shows 100% of the
weight in one node, and task_autonuma shows 95% of the weight to
another different node. But it should still work fine as we won't
allow the thread to go to that different node if a different process
run there. If a thread of the same process runs in the node where
task_autonuma shows 95% of the weight, then it's better to put the
thread there if it has higher weight than the other thread of the same
process so it'll be fine despite mm_autonuma and task_autonuma disagree.

Disagreement of task_autonuma and mm_autonuma happens all the time and
it's perfectly normal, just this will exacerbate a little more.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 25/40] autonuma: follow_page check for pte_numa/pmd_numa
@ 2012-07-14 16:43       ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-14 16:43 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Mon, Jul 02, 2012 at 12:14:11AM -0400, Rik van Riel wrote:
> On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> > Without this, follow_page wouldn't trigger the NUMA hinting faults.
> >
> > Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> 
> follow_page is called from many different places, not just
> the process itself. One example would be ksm.
> 
> Do you really want to trigger NUMA hinting faults when the
> mm != current->mm, or is that magically prevented somewhere?

The NUMA hinting page fault will update "current->task_autonuma"
according to the page_nid of the page triggering the numa hinting page
fault in follow_page. It doesn't matter if the page belongs to the
different mm, all we care is the page_nid that was accessed by the
"current" task through a numa hinting fault.

When I started thinking the benefit it could provide, I thought it
wouldn't be worth it because task_autonuma statistics are only used to
balance threads belonging to the same process, and mm_autonuma is used
to balance tasks belonging to different processes. And mm_autonuma
will never be able to take into account things like this.

So I converted the !current->mm check to a current->mm != mm check
here to save a bit of cpu and skip it in the autonuma branch.

void numa_hinting_fault(struct mm_struct *mm, struct page *page, int numpages)
{
	/*
	 * "current->mm" could be different from "mm" if
	 * get_user_pages() triggered the fault on some other process
	 * "mm". It wouldn't be a problem to account this NUMA hinting
	 * page fault on the current->task_autonuma statistics even if
	 * it was triggered on a page mapped on a different
	 * "mm". However task_autonuma isn't used to balance threads
	 * belonging to different processes so it wouldn't help and in
	 * turn it's not worth it.
	 */
	if (likely(current->mm == mm && !current->mempolicy && autonuma_enabled())) {

But I was thinking at the usual case of one ptracer task with a single
thread, however now I changed my mind and I think it can help when
there's just one process and a ton of threads spanning multiple nodes,
and one of the threads is ptracing an otherwise idle task and
accessing lot of ram through ptrace. So I think I'll roll it back to
autonuma21 status and allow the accounting of all page_nid even for
different mm again. But this is mostly a theoretical issue.

It can lead to a funny weighting where mm_autonuma shows 100% of the
weight in one node, and task_autonuma shows 95% of the weight to
another different node. But it should still work fine as we won't
allow the thread to go to that different node if a different process
run there. If a thread of the same process runs in the node where
task_autonuma shows 95% of the weight, then it's better to put the
thread there if it has higher weight than the other thread of the same
process so it'll be fine despite mm_autonuma and task_autonuma disagree.

Disagreement of task_autonuma and mm_autonuma happens all the time and
it's perfectly normal, just this will exacerbate a little more.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 20/40] autonuma: alloc/free/init mm_autonuma
  2012-07-13 14:19           ` Christoph Lameter
@ 2012-07-14 17:01             ` Andrea Arcangeli
  -1 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-14 17:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Johannes Weiner, Konrad Rzeszutek Wilk, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On Fri, Jul 13, 2012 at 09:19:06AM -0500, Christoph Lameter wrote:
> On Thu, 12 Jul 2012, Johannes Weiner wrote:
> 
> > On Thu, Jul 12, 2012 at 08:08:28PM +0200, Andrea Arcangeli wrote:
> > > On Sat, Jun 30, 2012 at 01:12:18AM -0400, Konrad Rzeszutek Wilk wrote:
> > > > On Thu, Jun 28, 2012 at 02:56:00PM +0200, Andrea Arcangeli wrote:
> > > > > This is where the mm_autonuma structure is being handled. Just like
> > > > > sched_autonuma, this is only allocated at runtime if the hardware the
> > > > > kernel is running on has been detected as NUMA. On not NUMA hardware
> > > >
> > > > I think the correct wording is "non-NUMA", not "not NUMA".
> > >
> > > That sounds far too easy to me, but I've no idea what's the right is here.
> >
> > UMA?
> >
> > Single-node hardware?
> 
> Lets be simple and stay with non-NUMA.

Ok, I already corrected all occurrences.

Thanks.

^ permalink raw reply	[flat|nested] 327+ messages in thread

* Re: [PATCH 20/40] autonuma: alloc/free/init mm_autonuma
@ 2012-07-14 17:01             ` Andrea Arcangeli
  0 siblings, 0 replies; 327+ messages in thread
From: Andrea Arcangeli @ 2012-07-14 17:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Johannes Weiner, Konrad Rzeszutek Wilk, linux-kernel, linux-mm,
	Hillf Danton, Dan Smith, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel,
	Srivatsa Vaddagiri, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Benjamin Herrenschmidt

On Fri, Jul 13, 2012 at 09:19:06AM -0500, Christoph Lameter wrote:
> On Thu, 12 Jul 2012, Johannes Weiner wrote:
> 
> > On Thu, Jul 12, 2012 at 08:08:28PM +0200, Andrea Arcangeli wrote:
> > > On Sat, Jun 30, 2012 at 01:12:18AM -0400, Konrad Rzeszutek Wilk wrote:
> > > > On Thu, Jun 28, 2012 at 02:56:00PM +0200, Andrea Arcangeli wrote:
> > > > > This is where the mm_autonuma structure is being handled. Just like
> > > > > sched_autonuma, this is only allocated at runtime if the hardware the
> > > > > kernel is running on has been detected as NUMA. On not NUMA hardware
> > > >
> > > > I think the correct wording is "non-NUMA", not "not NUMA".
> > >
> > > That sounds far too easy to me, but I've no idea what's the right is here.
> >
> > UMA?
> >
> > Single-node hardware?
> 
> Lets be simple and stay with non-NUMA.

Ok, I already corrected all occurrences.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 327+ messages in thread

* [tip:sched/core] sched: Describe CFS load-balancer
  2012-07-03 11:53       ` Peter Zijlstra
  (?)
@ 2012-10-24  9:58       ` tip-bot for Peter Zijlstra
  -1 siblings, 0 replies; 327+ messages in thread
From: tip-bot for Peter Zijlstra @ 2012-10-24  9:58 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, a.p.zijlstra, peterz, tglx

Commit-ID:  e9c84cb8d5f1b1ea6fcbe6190d51dc84b6975938
Gitweb:     http://git.kernel.org/tip/e9c84cb8d5f1b1ea6fcbe6190d51dc84b6975938
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 3 Jul 2012 13:53:26 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 24 Oct 2012 10:27:33 +0200

sched: Describe CFS load-balancer

Add some scribbles on how and why the load-balancer works..

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1341316406.23484.64.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c |  118 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 116 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e6a353..a319d56c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3456,8 +3456,122 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
 
 #ifdef CONFIG_SMP
 /**************************************************
- * Fair scheduling class load-balancing methods:
- */
+ * Fair scheduling class load-balancing methods.
+ *
+ * BASICS
+ *
+ * The purpose of load-balancing is to achieve the same basic fairness the
+ * per-cpu scheduler provides, namely provide a proportional amount of compute
+ * time to each task. This is expressed in the following equation:
+ *
+ *   W_i,n/P_i == W_j,n/P_j for all i,j                               (1)
+ *
+ * Where W_i,n is the n-th weight average for cpu i. The instantaneous weight
+ * W_i,0 is defined as:
+ *
+ *   W_i,0 = \Sum_j w_i,j                                             (2)
+ *
+ * Where w_i,j is the weight of the j-th runnable task on cpu i. This weight
+ * is derived from the nice value as per prio_to_weight[].
+ *
+ * The weight average is an exponential decay average of the instantaneous
+ * weight:
+ *
+ *   W'_i,n = (2^n - 1) / 2^n * W_i,n + 1 / 2^n * W_i,0               (3)
+ *
+ * P_i is the cpu power (or compute capacity) of cpu i, typically it is the
+ * fraction of 'recent' time available for SCHED_OTHER task execution. But it
+ * can also include other factors [XXX].
+ *
+ * To achieve this balance we define a measure of imbalance which follows
+ * directly from (1):
+ *
+ *   imb_i,j = max{ avg(W/P), W_i/P_i } - min{ avg(W/P), W_j/P_j }    (4)
+ *
+ * We them move tasks around to minimize the imbalance. In the continuous
+ * function space it is obvious this converges, in the discrete case we get
+ * a few fun cases generally called infeasible weight scenarios.
+ *
+ * [XXX expand on:
+ *     - infeasible weights;
+ *     - local vs global optima in the discrete case. ]
+ *
+ *
+ * SCHED DOMAINS
+ *
+ * In order to solve the imbalance equation (4), and avoid the obvious O(n^2)
+ * for all i,j solution, we create a tree of cpus that follows the hardware
+ * topology where each level pairs two lower groups (or better). This results
+ * in O(log n) layers. Furthermore we reduce the number of cpus going up the
+ * tree to only the first of the previous level and we decrease the frequency
+ * of load-balance at each level inv. proportional to the number of cpus in
+ * the groups.
+ *
+ * This yields:
+ *
+ *     log_2 n     1     n
+ *   \Sum       { --- * --- * 2^i } = O(n)                            (5)
+ *     i = 0      2^i   2^i
+ *                               `- size of each group
+ *         |         |     `- number of cpus doing load-balance
+ *         |         `- freq
+ *         `- sum over all levels
+ *
+ * Coupled with a limit on how many tasks we can migrate every balance pass,
+ * this makes (5) the runtime complexity of the balancer.
+ *
+ * An important property here is that each CPU is still (indirectly) connected
+ * to every other cpu in at most O(log n) steps:
+ *
+ * The adjacency matrix of the resulting graph is given by:
+ *
+ *             log_2 n     
+ *   A_i,j = \Union     (i % 2^k == 0) && i / 2^(k+1) == j / 2^(k+1)  (6)
+ *             k = 0
+ *
+ * And you'll find that:
+ *
+ *   A^(log_2 n)_i,j != 0  for all i,j                                (7)
+ *
+ * Showing there's indeed a path between every cpu in at most O(log n) steps.
+ * The task movement gives a factor of O(m), giving a convergence complexity
+ * of:
+ *
+ *   O(nm log n),  n := nr_cpus, m := nr_tasks                        (8)
+ *
+ *
+ * WORK CONSERVING
+ *
+ * In order to avoid CPUs going idle while there's still work to do, new idle
+ * balancing is more aggressive and has the newly idle cpu iterate up the domain
+ * tree itself instead of relying on other CPUs to bring it work.
+ *
+ * This adds some complexity to both (5) and (8) but it reduces the total idle
+ * time.
+ *
+ * [XXX more?]
+ *
+ *
+ * CGROUPS
+ *
+ * Cgroups make a horror show out of (2), instead of a simple sum we get:
+ *
+ *                                s_k,i
+ *   W_i,0 = \Sum_j \Prod_k w_k * -----                               (9)
+ *                                 S_k
+ *
+ * Where
+ *
+ *   s_k,i = \Sum_j w_i,j,k  and  S_k = \Sum_i s_k,i                 (10)
+ *
+ * w_i,j,k is the weight of the j-th runnable task in the k-th cgroup on cpu i.
+ *
+ * The big problem is S_k, its a global sum needed to compute a local (W_i)
+ * property.
+ *
+ * [XXX write more on how we solve this.. _after_ merging pjt's patches that
+ *      rewrite all of this once again.]
+ */ 
 
 static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 

^ permalink raw reply related	[flat|nested] 327+ messages in thread

end of thread, other threads:[~2012-10-24 10:00 UTC | newest]

Thread overview: 327+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-28 12:55 [PATCH 00/40] AutoNUMA19 Andrea Arcangeli
2012-06-28 12:55 ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 01/40] mm: add unlikely to the mm allocation failure check Andrea Arcangeli
2012-06-28 12:55   ` Andrea Arcangeli
2012-06-29 14:10   ` Rik van Riel
2012-06-29 14:10     ` Rik van Riel
2012-06-28 12:55 ` [PATCH 02/40] autonuma: make set_pmd_at always available Andrea Arcangeli
2012-06-28 12:55   ` Andrea Arcangeli
2012-06-29 14:10   ` Rik van Riel
2012-06-29 14:10     ` Rik van Riel
2012-06-28 12:55 ` [PATCH 03/40] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n Andrea Arcangeli
2012-06-28 12:55   ` Andrea Arcangeli
2012-06-29 14:11   ` Rik van Riel
2012-06-29 14:11     ` Rik van Riel
2012-06-28 12:55 ` [PATCH 04/40] xen: document Xen is using an unused bit for the pagetables Andrea Arcangeli
2012-06-28 12:55   ` Andrea Arcangeli
2012-06-29 14:16   ` Rik van Riel
2012-06-29 14:16     ` Rik van Riel
2012-07-04 23:05     ` Andrea Arcangeli
2012-07-04 23:05       ` Andrea Arcangeli
2012-06-30  4:47   ` Konrad Rzeszutek Wilk
2012-06-30  4:47     ` Konrad Rzeszutek Wilk
2012-07-03 10:45     ` Andrea Arcangeli
2012-07-03 10:45       ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 05/40] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD Andrea Arcangeli
2012-06-28 12:55   ` Andrea Arcangeli
2012-06-28 15:13   ` Don Morris
2012-06-28 15:13     ` Don Morris
2012-06-28 15:00     ` Andrea Arcangeli
2012-06-28 15:00       ` Andrea Arcangeli
2012-06-29 14:26   ` Rik van Riel
2012-06-29 14:26     ` Rik van Riel
2012-07-03 20:30     ` Andrea Arcangeli
2012-07-03 20:30       ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 06/40] autonuma: x86 pte_numa() and pmd_numa() Andrea Arcangeli
2012-06-28 12:55   ` Andrea Arcangeli
2012-06-29 15:02   ` Rik van Riel
2012-06-29 15:02     ` Rik van Riel
2012-07-04 23:03     ` Andrea Arcangeli
2012-07-04 23:03       ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 07/40] autonuma: generic " Andrea Arcangeli
2012-06-28 12:55   ` Andrea Arcangeli
2012-06-29 15:13   ` Rik van Riel
2012-06-29 15:13     ` Rik van Riel
2012-06-28 12:55 ` [PATCH 08/40] autonuma: teach gup_fast about pte_numa Andrea Arcangeli
2012-06-28 12:55   ` Andrea Arcangeli
2012-06-29 15:27   ` Rik van Riel
2012-06-29 15:27     ` Rik van Riel
2012-06-28 12:55 ` [PATCH 09/40] autonuma: introduce kthread_bind_node() Andrea Arcangeli
2012-06-28 12:55   ` Andrea Arcangeli
2012-06-29 15:36   ` Rik van Riel
2012-06-29 15:36     ` Rik van Riel
2012-06-29 16:04     ` Peter Zijlstra
2012-06-29 16:04       ` Peter Zijlstra
2012-06-29 16:11       ` Rik van Riel
2012-06-29 16:11         ` Rik van Riel
2012-06-29 16:38     ` Andrea Arcangeli
2012-06-29 16:38       ` Andrea Arcangeli
2012-06-29 16:58       ` Rik van Riel
2012-06-29 16:58         ` Rik van Riel
2012-07-05 13:09         ` Johannes Weiner
2012-07-05 13:09           ` Johannes Weiner
2012-07-05 18:33           ` Glauber Costa
2012-07-05 18:33             ` Glauber Costa
2012-07-05 20:07             ` Andrea Arcangeli
2012-07-05 20:07               ` Andrea Arcangeli
2012-06-30  4:50   ` Konrad Rzeszutek Wilk
2012-06-30  4:50     ` Konrad Rzeszutek Wilk
2012-07-04 23:14     ` Andrea Arcangeli
2012-07-04 23:14       ` Andrea Arcangeli
2012-07-05 12:04       ` Konrad Rzeszutek Wilk
2012-07-05 12:04         ` Konrad Rzeszutek Wilk
2012-07-05 12:28         ` Andrea Arcangeli
2012-07-05 12:28           ` Andrea Arcangeli
2012-07-05 12:18       ` Peter Zijlstra
2012-07-05 12:18         ` Peter Zijlstra
2012-07-05 12:21         ` Andrea Arcangeli
2012-07-05 12:21           ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 10/40] autonuma: mm_autonuma and sched_autonuma data structures Andrea Arcangeli
2012-06-28 12:55   ` Andrea Arcangeli
2012-06-29 15:47   ` Rik van Riel
2012-06-29 15:47     ` Rik van Riel
2012-06-29 17:45   ` Rik van Riel
2012-06-29 17:45     ` Rik van Riel
2012-07-04 23:16     ` Andrea Arcangeli
2012-07-04 23:16       ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 11/40] autonuma: define the autonuma flags Andrea Arcangeli
2012-06-28 12:55   ` Andrea Arcangeli
2012-06-29 16:10   ` Rik van Riel
2012-06-29 16:10     ` Rik van Riel
2012-06-30  4:58   ` Konrad Rzeszutek Wilk
2012-06-30  4:58     ` Konrad Rzeszutek Wilk
2012-07-02 15:42     ` Konrad Rzeszutek Wilk
2012-07-02 15:42       ` Konrad Rzeszutek Wilk
2012-06-30  5:01   ` Konrad Rzeszutek Wilk
2012-06-30  5:01     ` Konrad Rzeszutek Wilk
2012-07-04 23:45     ` Andrea Arcangeli
2012-07-04 23:45       ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 12/40] autonuma: core autonuma.h header Andrea Arcangeli
2012-06-28 12:55   ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 13/40] autonuma: CPU follow memory algorithm Andrea Arcangeli
2012-06-28 12:55   ` Andrea Arcangeli
2012-06-28 14:46   ` Peter Zijlstra
2012-06-28 14:46     ` Peter Zijlstra
2012-06-29 14:11     ` Nai Xia
2012-06-29 14:11       ` Nai Xia
2012-06-29 16:30       ` Andrea Arcangeli
2012-06-29 16:30         ` Andrea Arcangeli
2012-06-29 18:09         ` Nai Xia
2012-06-29 18:09           ` Nai Xia
2012-06-29 21:02         ` Nai Xia
2012-06-29 21:02           ` Nai Xia
2012-07-03 11:53     ` Peter Zijlstra
2012-07-03 11:53       ` Peter Zijlstra
2012-10-24  9:58       ` [tip:sched/core] sched: Describe CFS load-balancer tip-bot for Peter Zijlstra
2012-06-28 14:53   ` [PATCH 13/40] autonuma: CPU follow memory algorithm Peter Zijlstra
2012-06-28 14:53     ` Peter Zijlstra
2012-06-29 12:16     ` Hillf Danton
2012-06-29 12:16       ` Hillf Danton
2012-06-29 12:55       ` Ingo Molnar
2012-06-29 12:55         ` Ingo Molnar
2012-06-29 16:51         ` Dor Laor
2012-06-29 16:51           ` Dor Laor
2012-06-29 18:41           ` Peter Zijlstra
2012-06-29 18:41             ` Peter Zijlstra
2012-06-29 18:46             ` Rik van Riel
2012-06-29 18:46               ` Rik van Riel
2012-06-29 18:51               ` Peter Zijlstra
2012-06-29 18:51                 ` Peter Zijlstra
2012-06-29 18:57               ` Peter Zijlstra
2012-06-29 18:57                 ` Peter Zijlstra
2012-06-29 19:03                 ` Peter Zijlstra
2012-06-29 19:03                   ` Peter Zijlstra
2012-06-29 19:19                   ` Rik van Riel
2012-06-29 19:19                     ` Rik van Riel
2012-07-02 16:57                     ` Vaidyanathan Srinivasan
2012-07-05 16:56                       ` Vaidyanathan Srinivasan
2012-07-06 13:04                         ` Hillf Danton
2012-07-06 18:38                           ` Vaidyanathan Srinivasan
2012-07-12 13:12                             ` Andrea Arcangeli
2012-06-29 18:49           ` Peter Zijlstra
2012-06-29 18:49             ` Peter Zijlstra
2012-06-29 18:53           ` Peter Zijlstra
2012-06-29 18:53             ` Peter Zijlstra
2012-06-29 20:01             ` Nai Xia
2012-06-29 20:44               ` Nai Xia
2012-06-30  1:23               ` Andrea Arcangeli
2012-06-30  2:43                 ` Nai Xia
2012-06-30  5:48                   ` Dor Laor
2012-06-30  6:58                     ` Nai Xia
2012-06-30 13:04                       ` Andrea Arcangeli
2012-06-30 15:19                         ` Nai Xia
2012-06-30 19:37                       ` Dor Laor
2012-07-01  2:41                         ` Nai Xia
2012-06-30 23:55                       ` Benjamin Herrenschmidt
2012-06-30 23:55                         ` Benjamin Herrenschmidt
2012-07-01  3:10                         ` Nai Xia
2012-07-01  3:10                           ` Nai Xia
2012-06-30  8:23                     ` Nai Xia
2012-07-02  7:29                       ` Rik van Riel
2012-07-02  7:43                         ` Nai Xia
2012-06-30 12:48                   ` Andrea Arcangeli
2012-06-30 15:10                     ` Nai Xia
2012-07-02  7:36                       ` Rik van Riel
2012-07-02  7:56                         ` Nai Xia
2012-07-02  8:17                           ` Rik van Riel
2012-07-02  8:31                             ` Nai Xia
2012-07-05 18:07               ` Rik van Riel
2012-07-05 22:59                 ` Andrea Arcangeli
2012-07-06  1:00                 ` Nai Xia
2012-06-29 19:04           ` Peter Zijlstra
2012-06-29 19:04             ` Peter Zijlstra
2012-06-29 20:27             ` Nai Xia
2012-06-29 18:03   ` Rik van Riel
2012-06-29 18:03     ` Rik van Riel
2012-06-28 12:55 ` [PATCH 14/40] autonuma: add page structure fields Andrea Arcangeli
2012-06-28 12:55   ` Andrea Arcangeli
2012-06-29 18:06   ` Rik van Riel
2012-06-29 18:06     ` Rik van Riel
2012-06-28 12:55 ` [PATCH 15/40] autonuma: knuma_migrated per NUMA node queues Andrea Arcangeli
2012-06-28 12:55   ` Andrea Arcangeli
2012-06-29 18:31   ` Rik van Riel
2012-06-29 18:31     ` Rik van Riel
2012-06-28 12:55 ` [PATCH 16/40] autonuma: init knuma_migrated queues Andrea Arcangeli
2012-06-28 12:55   ` Andrea Arcangeli
2012-06-29 18:35   ` Rik van Riel
2012-06-29 18:35     ` Rik van Riel
2012-06-28 12:55 ` [PATCH 17/40] autonuma: autonuma_enter/exit Andrea Arcangeli
2012-06-28 12:55   ` Andrea Arcangeli
2012-06-29 18:37   ` Rik van Riel
2012-06-29 18:37     ` Rik van Riel
2012-06-28 12:55 ` [PATCH 18/40] autonuma: call autonuma_setup_new_exec() Andrea Arcangeli
2012-06-28 12:55   ` Andrea Arcangeli
2012-06-29 18:39   ` Rik van Riel
2012-06-29 18:39     ` Rik van Riel
2012-06-30  5:04   ` Konrad Rzeszutek Wilk
2012-06-30  5:04     ` Konrad Rzeszutek Wilk
2012-07-12 17:50     ` Andrea Arcangeli
2012-07-12 17:50       ` Andrea Arcangeli
2012-06-28 12:55 ` [PATCH 19/40] autonuma: alloc/free/init sched_autonuma Andrea Arcangeli
2012-06-28 12:55   ` Andrea Arcangeli
2012-06-29 18:52   ` Rik van Riel
2012-06-29 18:52     ` Rik van Riel
2012-06-30  5:10   ` Konrad Rzeszutek Wilk
2012-06-30  5:10     ` Konrad Rzeszutek Wilk
2012-07-12 17:59     ` Andrea Arcangeli
2012-07-12 17:59       ` Andrea Arcangeli
2012-06-28 12:56 ` [PATCH 20/40] autonuma: alloc/free/init mm_autonuma Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-06-29 18:54   ` Rik van Riel
2012-06-29 18:54     ` Rik van Riel
2012-06-30  5:12   ` Konrad Rzeszutek Wilk
2012-06-30  5:12     ` Konrad Rzeszutek Wilk
2012-07-12 18:08     ` Andrea Arcangeli
2012-07-12 18:08       ` Andrea Arcangeli
2012-07-12 18:17       ` Johannes Weiner
2012-07-12 18:17         ` Johannes Weiner
2012-07-13 14:19         ` Christoph Lameter
2012-07-13 14:19           ` Christoph Lameter
2012-07-14 17:01           ` Andrea Arcangeli
2012-07-14 17:01             ` Andrea Arcangeli
2012-07-01 15:33   ` Rik van Riel
2012-07-01 15:33     ` Rik van Riel
2012-07-12 18:27     ` Andrea Arcangeli
2012-07-12 18:27       ` Andrea Arcangeli
2012-06-28 12:56 ` [PATCH 21/40] autonuma: avoid CFS select_task_rq_fair to return -1 Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-06-29 18:57   ` Rik van Riel
2012-06-29 18:57     ` Rik van Riel
2012-06-29 19:05     ` Peter Zijlstra
2012-06-29 19:05       ` Peter Zijlstra
2012-06-29 19:07       ` Rik van Riel
2012-06-29 19:07         ` Rik van Riel
2012-06-29 20:48         ` Ingo Molnar
2012-06-29 20:48           ` Ingo Molnar
2012-06-28 12:56 ` [PATCH 22/40] autonuma: teach CFS about autonuma affinity Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-07-01 16:37   ` Rik van Riel
2012-07-01 16:37     ` Rik van Riel
2012-06-28 12:56 ` [PATCH 23/40] autonuma: sched_set_autonuma_need_balance Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-07-01 16:57   ` Rik van Riel
2012-07-01 16:57     ` Rik van Riel
2012-06-28 12:56 ` [PATCH 24/40] autonuma: core Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-07-02  4:07   ` Rik van Riel
2012-07-02  4:07     ` Rik van Riel
2012-06-28 12:56 ` [PATCH 25/40] autonuma: follow_page check for pte_numa/pmd_numa Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-07-02  4:14   ` Rik van Riel
2012-07-02  4:14     ` Rik van Riel
2012-07-14 16:43     ` Andrea Arcangeli
2012-07-14 16:43       ` Andrea Arcangeli
2012-06-28 12:56 ` [PATCH 26/40] autonuma: default mempolicy follow AutoNUMA Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-07-02  4:19   ` Rik van Riel
2012-07-02  4:19     ` Rik van Riel
2012-06-28 12:56 ` [PATCH 27/40] autonuma: call autonuma_split_huge_page() Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-07-02  4:22   ` Rik van Riel
2012-07-02  4:22     ` Rik van Riel
2012-06-28 12:56 ` [PATCH 28/40] autonuma: make khugepaged pte_numa aware Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-07-02  4:24   ` Rik van Riel
2012-07-02  4:24     ` Rik van Riel
2012-07-12 18:50     ` Andrea Arcangeli
2012-07-12 18:50       ` Andrea Arcangeli
2012-07-12 21:25       ` Rik van Riel
2012-07-12 21:25         ` Rik van Riel
2012-06-28 12:56 ` [PATCH 29/40] autonuma: retain page last_nid information in khugepaged Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-07-02  4:33   ` Rik van Riel
2012-07-02  4:33     ` Rik van Riel
2012-06-28 12:56 ` [PATCH 30/40] autonuma: numa hinting page faults entry points Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-07-02  4:47   ` Rik van Riel
2012-07-02  4:47     ` Rik van Riel
2012-06-28 12:56 ` [PATCH 31/40] autonuma: reset autonuma page data when pages are freed Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-07-02  4:49   ` Rik van Riel
2012-07-02  4:49     ` Rik van Riel
2012-06-28 12:56 ` [PATCH 32/40] autonuma: initialize page structure fields Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-07-02  4:50   ` Rik van Riel
2012-07-02  4:50     ` Rik van Riel
2012-06-28 12:56 ` [PATCH 33/40] autonuma: link mm/autonuma.o and kernel/sched/numa.o Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-07-02  4:56   ` Rik van Riel
2012-07-02  4:56     ` Rik van Riel
2012-06-28 12:56 ` [PATCH 34/40] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-07-02  4:58   ` Rik van Riel
2012-07-02  4:58     ` Rik van Riel
2012-06-28 12:56 ` [PATCH 35/40] autonuma: boost khugepaged scanning rate Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-07-02  5:12   ` Rik van Riel
2012-07-02  5:12     ` Rik van Riel
2012-06-28 12:56 ` [PATCH 36/40] autonuma: page_autonuma Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-06-30  5:24   ` Konrad Rzeszutek Wilk
2012-06-30  5:24     ` Konrad Rzeszutek Wilk
2012-07-12 19:43     ` Andrea Arcangeli
2012-07-12 19:43       ` Andrea Arcangeli
2012-07-02  6:37   ` Rik van Riel
2012-07-02  6:37     ` Rik van Riel
2012-07-12 19:58     ` Andrea Arcangeli
2012-07-12 19:58       ` Andrea Arcangeli
2012-06-28 12:56 ` [PATCH 37/40] autonuma: page_autonuma change #include for sparse Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-07-02  6:22   ` Rik van Riel
2012-07-02  6:22     ` Rik van Riel
2012-06-28 12:56 ` [PATCH 38/40] autonuma: autonuma_migrate_head[0] dynamic size Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-07-02  5:15   ` Rik van Riel
2012-07-02  5:15     ` Rik van Riel
2012-06-28 12:56 ` [PATCH 39/40] autonuma: bugcheck page_autonuma fields on newly allocated pages Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-07-02  6:40   ` Rik van Riel
2012-07-02  6:40     ` Rik van Riel
2012-06-28 12:56 ` [PATCH 40/40] autonuma: shrink the per-page page_autonuma struct size Andrea Arcangeli
2012-06-28 12:56   ` Andrea Arcangeli
2012-07-02  7:18   ` Rik van Riel
2012-07-02  7:18     ` Rik van Riel
2012-07-12 20:21     ` Andrea Arcangeli
2012-07-12 20:21       ` Andrea Arcangeli
2012-07-09 15:40 ` [PATCH 00/40] AutoNUMA19 Rik van Riel
2012-07-09 15:40   ` Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.