linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/36] AutoNUMA24
@ 2012-08-22 14:58 Andrea Arcangeli
  2012-08-22 14:58 ` [PATCH 01/36] autonuma: make set_pmd_at always available Andrea Arcangeli
                   ` (36 more replies)
  0 siblings, 37 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Hello everyone,

Before the Kernel Summit, I think it's good idea to post a new
AutoNUMA24 and to go through a new review cycle. The last review cycle
has been fundamental in improving the patchset. Thanks!

The objective of AutoNUMA is to be able to perform as close as
possible to (and sometime faster than) the NUMA hard CPU/memory
bindings setups, without requiring the administrator to manually setup
any NUMA hard bind.

I hope everyone sees this is an hard problem, and what one thinks will
work great in theory, when tested in practice, it may not run so
great. But I'd like to remind that all research is good and valuable.
All approaches to solve the problem are worthwhile, regardless if they
work better/worse. sched-numa rewrite is also a very interesting
approach and I hope everyone agrees that it's wonderful that both ways
to solve the problem are being researched. Whatever will be merged
upstream in the end won't change the fact that all work done to try to
solve this hard problem is very valuable and worthwhile.

git clone --reference linux -b autonuma24 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

Development autonuma branch:

git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

To update:

git fetch
git checkout -f origin/autonuma

PDF with some benchmark results:

http://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma-vs-sched-numa-rewrite-20120817.pdf
http://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf

Changelog from AutoNUMA19 to AutoNUMA24:

o Improved lots of comments and header commit messages.

o Rewritten from scratch the comment at the top of kernel/sched/numa.c
  as the old comment wasn't well received in upstream reviews. Tried
  to describe the algorithm from a global view now.

o Added ppc64 support.

o Improved patch splitup.

o Lots of code cleanups and variable renames to make the code more readable.

o Try to take advantage of task_autonuma_nid before the knuma_scand is
  complete.

o Moved some performance tuning sysfs tweaks under DEBUG_VM so they
  won't be visible on production kernels.

o Enabled by default the working set mode for the mm_autonuma data
  collection.

o Halved the size of the mm_autonuma structure.

o scan_sleep_pass_millisecs now is more intuitive (you can can set it
  to 10000 to mean one pass every 10 sec, in the previous release it had
  to be set to 5000 to one pass every 10 sec).

o Removed PF_THREAD_BOUND to allow CPU isolation. Turned the VM_BUG_ON
  verifying the hard binding into a WARN_ON_ONCE so the knuma_migrated
  can be moved by root anywhere safely.

o Optimized autonuma_possible() to avoid checking num_possible_nodes()
  every time.

o Added the math on the last_nid statistical effects from sched-numa
  rewrite which also introduced the last_nid logic of AutoNUMA.

o Now handle systems with holes in the NUMA nodemask. Lots of
  num_possible_nodes() replaced with nr_node_ids (nr_node_ids not so
  nice name for such information).

o Fixed a bug affecting KSM. KSM failed to merge pages mapped with a
  pte_numa pte, now it passes LTP fine. LTP found it.

o Fixed repeated CPU scheduler migrate in sched_autonuma_balance()
  (the idle load balancing sometime was faster and it put the task
  back to its previous CPU before it had a chance to be scheduled on
  the destination CPU).

o More...

Changelog from AutoNUMA-alpha14 to AutoNUMA19:

o sched_autonuma_balance callout location removed from schedule() now it runs
  in the softirq along with CFS load_balancing

o lots of documentation about the math in the sched_autonuma_balance algorithm

o fixed a bug in the fast path detection in sched_autonuma_balance that could
  decrease performance with many nodes

o reduced the page_autonuma memory overhead to from 32 to 12 bytes per page

o fixed a crash in __pmd_numa_fixup

o knuma_numad won't scan VM_MIXEDMAP|PFNMAP (it never touched those ptes
  anyway)

o fixed a crash in autonuma_exit

o fixed a crash when split_huge_page returns 0 in knuma_migratedN as the page
  has been freed already

o assorted cleanups and probably more

Changelog from alpha13 to alpha14:

o page_autonuma introduction, no memory wasted if the kernel is booted
  on not-NUMA hardware. Tested with flatmem/sparsemem on x86
  autonuma=y/n and sparsemem/vsparsemem on x86_64 with autonuma=y/n.
  "noautonuma" kernel param disables autonuma permanently also when
  booted on NUMA hardware (no /sys/kernel/mm/autonuma, and no
  page_autonuma allocations, like cgroup_disable=memory)

o autonuma_balance only runs along with run_rebalance_domains, to
  avoid altering the usual scheduler runtime. autonuma_balance gives a
  "kick" to the scheduler after a rebalance (it overrides the load
  balance activity if needed). It's not yet tested on specjbb or more
  schedule intensive benchmark, hopefully there's no NUMA
  regression. For intensive compute loads not involving a flood of
  scheduling activity this doesn't show any performance regression,
  and it avoids altering the strict schedule performance. It goes in
  the direction of being less intrusive with the stock scheduler
  runtime.

  Note: autonuma_balance still runs from normal context (not softirq
  context like run_rebalance_domains) to be able to wait on process
  migration (avoid _nowait), but most of the time it does nothing at
  all.

Changelog from alpha11 to alpha13:

o autonuma_balance optimization (take the fast path when process is in
  the preferred NUMA node)

TODO:

o THP native migration (orthogonal and also needed for
  cpuset/migrate_pages(2)/numa/sched).

Andrea Arcangeli (35):
  autonuma: make set_pmd_at always available
  autonuma: export is_vma_temporary_stack() even if
    CONFIG_TRANSPARENT_HUGEPAGE=n
  autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  autonuma: pte_numa() and pmd_numa()
  autonuma: teach gup_fast about pmd_numa
  autonuma: introduce kthread_bind_node()
  autonuma: mm_autonuma and task_autonuma data structures
  autonuma: define the autonuma flags
  autonuma: core autonuma.h header
  autonuma: CPU follows memory algorithm
  autonuma: add page structure fields
  autonuma: knuma_migrated per NUMA node queues
  autonuma: autonuma_enter/exit
  autonuma: call autonuma_setup_new_exec()
  autonuma: alloc/free/init task_autonuma
  autonuma: alloc/free/init mm_autonuma
  autonuma: prevent select_task_rq_fair to return -1
  autonuma: teach CFS about autonuma affinity
  autonuma: memory follows CPU algorithm and task/mm_autonuma stats
    collection
  autonuma: default mempolicy follow AutoNUMA
  autonuma: call autonuma_split_huge_page()
  autonuma: make khugepaged pte_numa aware
  autonuma: retain page last_nid information in khugepaged
  autonuma: numa hinting page faults entry points
  autonuma: reset autonuma page data when pages are freed
  autonuma: link mm/autonuma.o and kernel/sched/numa.o
  autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
  autonuma: page_autonuma
  autonuma: autonuma_migrate_head[0] dynamic size
  autonuma: bugcheck page_autonuma fields on newly allocated pages
  autonuma: shrink the per-page page_autonuma struct size
  autonuma: boost khugepaged scanning rate
  autonuma: make the AUTONUMA_SCAN_PMD_FLAG conditional to
    CONFIG_HAVE_ARCH_AUTONUMA_SCAN_PMD
  autonuma: add knuma_migrated/allow_first_fault in sysfs
  autonuma: add mm_autonuma working set estimation

Vaidyanathan Srinivasan (1):
  autonuma: powerpc port

 arch/Kconfig                              |    6 +
 arch/powerpc/Kconfig                      |    6 +
 arch/powerpc/include/asm/pgtable.h        |   48 +-
 arch/powerpc/include/asm/pte-hash64-64k.h |    4 +-
 arch/powerpc/mm/numa.c                    |    3 +-
 arch/x86/Kconfig                          |    2 +
 arch/x86/include/asm/paravirt.h           |    2 -
 arch/x86/include/asm/pgtable.h            |   65 ++-
 arch/x86/include/asm/pgtable_types.h      |   28 +
 arch/x86/mm/gup.c                         |   13 +-
 arch/x86/mm/numa.c                        |    6 +-
 arch/x86/mm/numa_32.c                     |    3 +-
 fs/exec.c                                 |    7 +
 include/asm-generic/pgtable.h             |   12 +
 include/linux/autonuma.h                  |   72 ++
 include/linux/autonuma_flags.h            |  168 +++
 include/linux/autonuma_list.h             |  100 ++
 include/linux/autonuma_sched.h            |   50 +
 include/linux/autonuma_types.h            |  169 +++
 include/linux/huge_mm.h                   |    6 +-
 include/linux/kthread.h                   |    1 +
 include/linux/memory_hotplug.h            |    3 +-
 include/linux/mm_types.h                  |    5 +
 include/linux/mmzone.h                    |   38 +
 include/linux/page_autonuma.h             |   59 +
 include/linux/sched.h                     |    3 +
 init/main.c                               |    2 +
 kernel/fork.c                             |   18 +
 kernel/kthread.c                          |   21 +
 kernel/sched/Makefile                     |    1 +
 kernel/sched/core.c                       |    1 +
 kernel/sched/fair.c                       |   86 ++-
 kernel/sched/numa.c                       |  604 ++++++++++
 kernel/sched/sched.h                      |   19 +
 mm/Kconfig                                |   17 +
 mm/Makefile                               |    1 +
 mm/autonuma.c                             | 1727 +++++++++++++++++++++++++++++
 mm/autonuma_list.c                        |  169 +++
 mm/huge_memory.c                          |   78 ++-
 mm/memory.c                               |   31 +
 mm/memory_hotplug.c                       |    2 +-
 mm/mempolicy.c                            |   12 +-
 mm/mmu_context.c                          |    3 +
 mm/page_alloc.c                           |    7 +-
 mm/page_autonuma.c                        |  248 +++++
 mm/sparse.c                               |  126 ++-
 46 files changed, 4014 insertions(+), 38 deletions(-)
 create mode 100644 include/linux/autonuma.h
 create mode 100644 include/linux/autonuma_flags.h
 create mode 100644 include/linux/autonuma_list.h
 create mode 100644 include/linux/autonuma_sched.h
 create mode 100644 include/linux/autonuma_types.h
 create mode 100644 include/linux/page_autonuma.h
 create mode 100644 kernel/sched/numa.c
 create mode 100644 mm/autonuma.c
 create mode 100644 mm/autonuma_list.c
 create mode 100644 mm/page_autonuma.c


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 01/36] autonuma: make set_pmd_at always available
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
@ 2012-08-22 14:58 ` Andrea Arcangeli
  2012-08-22 14:58 ` [PATCH 02/36] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n Andrea Arcangeli
                   ` (35 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

set_pmd_at() will also be used for the knuma_scand/pmd = 1 (default)
mode even when TRANSPARENT_HUGEPAGE=n. Make it available so the build
won't fail.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/paravirt.h |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a0facf3..5edd174 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -528,7 +528,6 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 		PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 			      pmd_t *pmdp, pmd_t pmd)
 {
@@ -539,7 +538,6 @@ static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp,
 			    native_pmd_val(pmd));
 }
-#endif
 
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 02/36] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
  2012-08-22 14:58 ` [PATCH 01/36] autonuma: make set_pmd_at always available Andrea Arcangeli
@ 2012-08-22 14:58 ` Andrea Arcangeli
  2012-08-22 14:58 ` [PATCH 03/36] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD Andrea Arcangeli
                   ` (34 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

is_vma_temporary_stack() is needed by mm/autonuma.c too, and without
this the build breaks with CONFIG_TRANSPARENT_HUGEPAGE=n.

Reported-by: Petr Holasek <pholasek@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/huge_mm.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4c59b11..ad4e2e0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -54,13 +54,13 @@ extern pmd_t *page_check_address_pmd(struct page *page,
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 
+extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define HPAGE_PMD_SHIFT HPAGE_SHIFT
 #define HPAGE_PMD_MASK HPAGE_MASK
 #define HPAGE_PMD_SIZE HPAGE_SIZE
 
-extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
-
 #define transparent_hugepage_enabled(__vma)				\
 	((transparent_hugepage_flags &					\
 	  (1<<TRANSPARENT_HUGEPAGE_FLAG) ||				\

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 03/36] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
  2012-08-22 14:58 ` [PATCH 01/36] autonuma: make set_pmd_at always available Andrea Arcangeli
  2012-08-22 14:58 ` [PATCH 02/36] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n Andrea Arcangeli
@ 2012-08-22 14:58 ` Andrea Arcangeli
  2012-08-22 14:58 ` [PATCH 04/36] autonuma: pte_numa() and pmd_numa() Andrea Arcangeli
                   ` (33 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

We will set these bitflags only when the pmd or pte is not present.

They work like PROTNONE but they identify a request for the numa
hinting page fault to trigger.

Because we want to be able to set these bitflags in any established
pte or pmd (while clearing the present bit at the same time) without
losing information, these bitflags must never be set when the pte and
pmd are present.

For _PAGE_NUMA_PTE the pte bitflag used is _PAGE_PSE, which cannot be
set on ptes and it also fits in between _PAGE_FILE and _PAGE_PROTNONE
which avoids having to alter the swp entries format.

For _PAGE_NUMA_PMD, we use a reserved bitflag. pmds never contain
swap_entries but if in the future we'll swap transparent hugepages, we
must keep in mind not to use the _PAGE_UNUSED2 bitflag in the swap
entry format and to start the swap entry offset above it.

PAGE_UNUSED2 is used by Xen but only on ptes established by ioremap,
never on pmds so there's no risk of collision with Xen.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable_types.h |   28 ++++++++++++++++++++++++++++
 1 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 013286a..400d771 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -64,6 +64,34 @@
 #define _PAGE_FILE	(_AT(pteval_t, 1) << _PAGE_BIT_FILE)
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
+/*
+ * _PAGE_NUMA_PTE indicates that this page will trigger a numa hinting
+ * minor page fault to gather autonuma statistics (see
+ * pte_numa()). The bit picked (7) is purposefully in between the
+ * _PAGE_FILE (6) and _PAGE_PROTNONE (8) bits. Therefore, it doesn't
+ * require changes to the swp entry format because that bit is always
+ * zero when the pte is not present. It is also always zero when the
+ * pte is present (_PAGE_PAT (7) is never set on the pte according to
+ * arch/x86/mm/pat.c).
+ *
+ * The bit picked must be always zero when the pmd is present and not
+ * present, so that we don't lose information when we set it while
+ * atomically clearing the present bit.
+ */
+#define _PAGE_NUMA_PTE	_PAGE_PSE
+/*
+ * _PAGE_NUMA_PMD indicates that this page will trigger a numa hinting
+ * minor page fault to gather autonuma statistics (see
+ * pmd_numa())._PAGE_IOMAP is used by Xen but only on the pte, never
+ * on the pmd. If transparent hugepages will be swapped out natively,
+ * the swap entry offset will have to start above _PAGE_IOMAP.
+ *
+ * The bit picked must be always zero when the pmd is present and not
+ * present, so that we don't lose information when we set it while
+ * atomically clearing the present bit.
+ */
+#define _PAGE_NUMA_PMD	_PAGE_IOMAP
+
 #define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |	\
 			 _PAGE_ACCESSED | _PAGE_DIRTY)
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED |	\

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 04/36] autonuma: pte_numa() and pmd_numa()
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2012-08-22 14:58 ` [PATCH 03/36] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD Andrea Arcangeli
@ 2012-08-22 14:58 ` Andrea Arcangeli
  2012-08-22 14:58 ` [PATCH 05/36] autonuma: teach gup_fast about pmd_numa Andrea Arcangeli
                   ` (32 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Implement pte_numa and pmd_numa.

We must atomically set the numa bit and clear the present bit to
define a pte_numa or pmd_numa.

Once a pte or pmd has been set as pte_numa or pmd_numa, the next time
a thread touches a virtual address in the corresponding virtual range,
a NUMA hinting page fault will trigger. The NUMA hinting page fault
will clear the NUMA bit and set the present bit again to resolve the
page fault.

NUMA hinting page faults are used:

1) to fill in the per-thread NUMA statistic stored for each thread in
   a current->task_autonuma data structure

2) to track the per-node last_nid information in the page structure to
   detect false sharing

3) to queue the page mapped by the pte_numa or pmd_numa for async
   migration if there have been enough NUMA hinting page faults on the
   page coming from remote CPUs

NUMA hinting page faults collect information and possibly add pages to
migrate queues. They are extremely quick, absolutely non-blocking and
do not allocate memory.

The generic implementation is used when CONFIG_AUTONUMA=n.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/include/asm/pgtable.h |   65 ++++++++++++++++++++++++++++++++++++++-
 include/asm-generic/pgtable.h  |   12 +++++++
 2 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b49e70d..bfe42aa 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -405,7 +405,8 @@ static inline int pte_same(pte_t a, pte_t b)
 
 static inline int pte_present(pte_t a)
 {
-	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
+	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+			       _PAGE_NUMA_PTE);
 }
 
 static inline int pte_hidden(pte_t pte)
@@ -421,7 +422,63 @@ static inline int pmd_present(pmd_t pmd)
 	 * the _PAGE_PSE flag will remain set at all times while the
 	 * _PAGE_PRESENT bit is clear).
 	 */
-	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
+	return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
+				 _PAGE_NUMA_PMD);
+}
+
+#ifdef CONFIG_AUTONUMA
+/*
+ * _PAGE_NUMA_PTE and _PAGE_NUMA_PMD works identical to
+ * _PAGE_PROTNONE. They're set only when _PAGE_PRESET is not
+ * set and they're never set if _PAGE_PRESENT is set.
+ *
+ * pte/pmd_present() returns true if pte/pmd_numa returns true. Page
+ * fault triggers on those regions if pte/pmd_numa returns true
+ * (because _PAGE_PRESENT is not set).
+ */
+static inline int pte_numa(pte_t pte)
+{
+	return (pte_flags(pte) &
+		(_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+	return (pmd_flags(pmd) &
+		(_PAGE_NUMA_PMD|_PAGE_PRESENT)) == _PAGE_NUMA_PMD;
+}
+#endif
+
+/*
+ * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
+ * because they're called by the NUMA hinting minor page fault. If we
+ * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
+ * would be forced to set it later while filling the TLB after we
+ * return to userland. That would trigger a second write to memory
+ * that we optimize away by setting _PAGE_ACCESSED here.
+ */
+static inline pte_t pte_mknonnuma(pte_t pte)
+{
+	pte = pte_clear_flags(pte, _PAGE_NUMA_PTE);
+	return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mknonnuma(pmd_t pmd)
+{
+	pmd = pmd_clear_flags(pmd, _PAGE_NUMA_PMD);
+	return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pte_t pte_mknuma(pte_t pte)
+{
+	pte = pte_set_flags(pte, _PAGE_NUMA_PTE);
+	return pte_clear_flags(pte, _PAGE_PRESENT);
+}
+
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+	pmd = pmd_set_flags(pmd, _PAGE_NUMA_PMD);
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
 }
 
 static inline int pmd_none(pmd_t pmd)
@@ -480,6 +537,10 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 
 static inline int pmd_bad(pmd_t pmd)
 {
+#ifdef CONFIG_AUTONUMA
+	if (pmd_numa(pmd))
+		return 0;
+#endif
 	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
 }
 
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index ff4947b..0ff87ec 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -530,6 +530,18 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
+#ifndef CONFIG_AUTONUMA
+static inline int pte_numa(pte_t pte)
+{
+	return 0;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+	return 0;
+}
+#endif /* CONFIG_AUTONUMA */
+
 #endif /* CONFIG_MMU */
 
 #endif /* !__ASSEMBLY__ */

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 05/36] autonuma: teach gup_fast about pmd_numa
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2012-08-22 14:58 ` [PATCH 04/36] autonuma: pte_numa() and pmd_numa() Andrea Arcangeli
@ 2012-08-22 14:58 ` Andrea Arcangeli
  2012-08-22 14:58 ` [PATCH 06/36] autonuma: introduce kthread_bind_node() Andrea Arcangeli
                   ` (31 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

In the special "pmd" mode of knuma_scand
(/sys/kernel/mm/autonuma/knuma_scand/pmd == 1), the pmd may be of numa
type (_PAGE_PRESENT not set), however the pte might be
present. Therefore, gup_pmd_range() must return 0 in this case to
avoid losing a NUMA hinting page fault during gup_fast.

Note: gup_fast will skip over non present ptes (like numa types), so
no explicit check is needed for the pte_numa case. gup_fast will also
skip over THP when the trans huge pmd is non present. So, the pmd_numa
case will also be correctly skipped with no additional code changes
required.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/mm/gup.c |   13 ++++++++++++-
 1 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..02c5ec5 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -163,8 +163,19 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		 * can't because it has irq disabled and
 		 * wait_split_huge_page() would never return as the
 		 * tlb flush IPI wouldn't run.
+		 *
+		 * The pmd_numa() check is needed because the code
+		 * doesn't check the _PAGE_PRESENT bit of the pmd if
+		 * the gup_pte_range() path is taken. NOTE: not all
+		 * gup_fast users will will access the page contents
+		 * using the CPU through the NUMA memory channels like
+		 * KVM does. So we're forced to trigger NUMA hinting
+		 * page faults unconditionally for all gup_fast users
+		 * even though NUMA hinting page faults aren't useful
+		 * to I/O drivers that will access the page with DMA
+		 * and not with the CPU.
 		 */
-		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd) || pmd_numa(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 06/36] autonuma: introduce kthread_bind_node()
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2012-08-22 14:58 ` [PATCH 05/36] autonuma: teach gup_fast about pmd_numa Andrea Arcangeli
@ 2012-08-22 14:58 ` Andrea Arcangeli
  2012-08-22 14:58 ` [PATCH 07/36] autonuma: mm_autonuma and task_autonuma data structures Andrea Arcangeli
                   ` (30 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This function makes it easy to bind the per-node knuma_migrated
threads to their respective NUMA nodes. Those threads take memory from
the other nodes (in round robin with a incoming queue for each remote
node) and they move that memory to their local node.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/kthread.h |    1 +
 kernel/kthread.c        |   21 +++++++++++++++++++++
 2 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 22ccf9d..5901aad 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -33,6 +33,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 })
 
 void kthread_bind(struct task_struct *k, unsigned int cpu);
+void kthread_bind_node(struct task_struct *p, int nid);
 int kthread_stop(struct task_struct *k);
 int kthread_should_stop(void);
 bool kthread_freezable_should_stop(bool *was_frozen);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index b579af5..0034e5f 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -234,6 +234,27 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
 EXPORT_SYMBOL(kthread_bind);
 
 /**
+ * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
+ * @p: thread created by kthread_create().
+ * @nid: node (might not be online, must be possible) for @k to run on.
+ *
+ * Description: This function is equivalent to set_cpus_allowed(),
+ * except that @nid doesn't need to be online, and the thread must be
+ * stopped (i.e., just returned from kthread_create()).
+ */
+void kthread_bind_node(struct task_struct *p, int nid)
+{
+	/* Must have done schedule() in kthread() before we set_task_cpu */
+	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
+		WARN_ON(1);
+		return;
+	}
+
+	/* It's safe because the task is inactive. */
+	do_set_cpus_allowed(p, cpumask_of_node(nid));
+}
+
+/**
  * kthread_stop - stop a thread created by kthread_create().
  * @k: thread created by kthread_create().
  *

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 07/36] autonuma: mm_autonuma and task_autonuma data structures
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2012-08-22 14:58 ` [PATCH 06/36] autonuma: introduce kthread_bind_node() Andrea Arcangeli
@ 2012-08-22 14:58 ` Andrea Arcangeli
  2012-08-22 14:58 ` [PATCH 08/36] autonuma: define the autonuma flags Andrea Arcangeli
                   ` (29 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Define the two data structures that collect the per-process (in the
mm) and per-thread (in the task_struct) statistical information that
are the input of the CPU follow memory algorithms in the NUMA
scheduler.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_types.h |  107 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma_types.h

diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
new file mode 100644
index 0000000..9673ce8
--- /dev/null
+++ b/include/linux/autonuma_types.h
@@ -0,0 +1,107 @@
+#ifndef _LINUX_AUTONUMA_TYPES_H
+#define _LINUX_AUTONUMA_TYPES_H
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/numa.h>
+
+
+/*
+ * Per-mm (per-process) structure that contains the NUMA memory
+ * placement statistics generated by the knuma scan daemon. This
+ * structure is dynamically allocated only if AutoNUMA is possible on
+ * this system. They are linked togehter in a list headed within the
+ * knumad_scan structure.
+ */
+struct mm_autonuma {
+	/* link for knuma_scand's list of mm structures to scan */
+	struct list_head mm_node;
+	/* Pointer to associated mm structure */
+	struct mm_struct *mm;
+
+	/*
+	 * Zeroed from here during allocation, check
+	 * mm_autonuma_reset() if you alter the below.
+	 */
+
+	/*
+	 * Pass counter for this mm. This exist only to be able to
+	 * tell when it's time to apply the exponential backoff on the
+	 * task_autonuma statistics.
+	 */
+	unsigned long mm_numa_fault_pass;
+	/* Total number of pages that will trigger NUMA faults for this mm */
+	unsigned long mm_numa_fault_tot;
+	/* Number of pages that will trigger NUMA faults for each [nid] */
+	unsigned long mm_numa_fault[0];
+	/* do not add more variables here, the above array size is dynamic */
+};
+
+extern int alloc_mm_autonuma(struct mm_struct *mm);
+extern void free_mm_autonuma(struct mm_struct *mm);
+extern void __init mm_autonuma_init(void);
+
+/*
+ * Per-task (thread) structure that contains the NUMA memory placement
+ * statistics generated by the knuma scan daemon. This structure is
+ * dynamically allocated only if AutoNUMA is possible on this
+ * system. They are linked togehter in a list headed within the
+ * knumad_scan structure.
+ */
+struct task_autonuma {
+	/* node id the CPU scheduler should try to stick with (-1 if none) */
+	int task_selected_nid;
+
+	/*
+	 * Zeroed from here during allocation, check
+	 * mm_autonuma_reset() if you alter the below.
+	 */
+
+	/*
+	 * Pass counter for this task. When the pass counter is found
+	 * out of sync with the mm_numa_fault_pass we know it's time
+	 * to apply the exponential backoff on the task_autonuma
+	 * statistics, and then we synchronize it with
+	 * mm_numa_fault_pass. This pass counter is needed because in
+	 * knuma_scand we work on the mm and we've no visibility on
+	 * the task_autonuma. Furthermore it would be detrimental to
+	 * apply exponential backoff to all task_autonuma associated
+	 * to a certain mm_autonuma (potentially zeroing out the trail
+	 * of statistical data in task_autonuma) if the task is idle
+	 * for a long period of time (i.e. several knuma_scand passes).
+	 */
+	unsigned long task_numa_fault_pass;
+	/* Total number of eligible pages that triggered NUMA faults */
+	unsigned long task_numa_fault_tot;
+	/* Number of pages that triggered NUMA faults for each [nid] */
+	unsigned long task_numa_fault[0];
+	/* do not add more variables here, the above array size is dynamic */
+};
+
+extern int alloc_task_autonuma(struct task_struct *tsk,
+			       struct task_struct *orig,
+			       int node);
+extern void __init task_autonuma_init(void);
+extern void free_task_autonuma(struct task_struct *tsk);
+
+#else /* CONFIG_AUTONUMA */
+
+static inline int alloc_mm_autonuma(struct mm_struct *mm)
+{
+	return 0;
+}
+static inline void free_mm_autonuma(struct mm_struct *mm) {}
+static inline void mm_autonuma_init(void) {}
+
+static inline int alloc_task_autonuma(struct task_struct *tsk,
+				      struct task_struct *orig,
+				      int node)
+{
+	return 0;
+}
+static inline void task_autonuma_init(void) {}
+static inline void free_task_autonuma(struct task_struct *tsk) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+#endif /* _LINUX_AUTONUMA_TYPES_H */

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 08/36] autonuma: define the autonuma flags
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2012-08-22 14:58 ` [PATCH 07/36] autonuma: mm_autonuma and task_autonuma data structures Andrea Arcangeli
@ 2012-08-22 14:58 ` Andrea Arcangeli
  2012-08-22 14:58 ` [PATCH 09/36] autonuma: core autonuma.h header Andrea Arcangeli
                   ` (28 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

These flags are the ones tweaked through sysfs, they control the
behavior of autonuma, from enabling disabling it, to selecting various
runtime options.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_flags.h |  129 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 129 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma_flags.h

diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
new file mode 100644
index 0000000..f53203a
--- /dev/null
+++ b/include/linux/autonuma_flags.h
@@ -0,0 +1,129 @@
+#ifndef _LINUX_AUTONUMA_FLAGS_H
+#define _LINUX_AUTONUMA_FLAGS_H
+
+/*
+ * If CONFIG_AUTONUMA=n this file isn't included and only
+ * autonuma_possible() is defined (as false) in autonuma.h to allow
+ * optimizing away at compile time blocks of common code without using
+ * #ifdefs.
+ */
+#ifndef CONFIG_AUTONUMA
+#error "autonuma flags included by mistake"
+#endif
+
+enum autonuma_flag {
+	/*
+	 * Set if the kernel wasn't passed the "noautonuma" boot
+	 * parameter and the hardware is NUMA. If AutoNUMA is not
+	 * possible the value of all other flags becomes irrelevant
+	 * (they will never be checked) and AutoNUMA can't be enabled.
+	 *
+	 * No defaults: depends on hardware discovery and "noautonuma"
+	 * early param.
+	 */
+	AUTONUMA_POSSIBLE_FLAG,
+	/*
+	 * If AutoNUMA is possible, this defines if AutoNUMA is
+	 * currently enabled or disabled. It can be toggled at runtime
+	 * through sysfs.
+	 *
+	 * The default depends on CONFIG_AUTONUMA_DEFAULT_ENABLED.
+	 */
+	AUTONUMA_ENABLED_FLAG,
+	/*
+	 * If set through sysfs this will print lots of debug info
+	 * about the AutoNUMA activities in the kernel logs.
+	 *
+	 * Default not set.
+	 */
+	AUTONUMA_DEBUG_FLAG,
+	/*
+	 * This defines if CFS should prioritize between load
+	 * balancing fairness or NUMA affinity, if there are no idle
+	 * CPUs available. If this flag is set AutoNUMA will
+	 * prioritize on NUMA affinity and it will disregard
+	 * inter-node fairness.
+	 *
+	 * Default not set.
+	 */
+	AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
+	/*
+	 * This flag defines if the task/mm_autonuma statistics should
+	 * be inherithed from the parent task/process or instead if
+	 * they should be cleared at every fork/clone. The
+	 * task/mm_autonuma statistics are always cleared across
+	 * execve and there's no way to disable that.
+	 *
+	 * Default set.
+	 */
+	AUTONUMA_SCHED_RESET_FLAG,
+	/*
+	 * If set, this tells knuma_scand to trigger NUMA hinting page
+	 * faults at the pmd level instead of the pte level. This
+	 * reduces the number of NUMA hinting faults potentially
+	 * saving CPU time. It reduces the accuracy of the
+	 * task_autonuma statistics (but does not change the accuracy
+	 * of the mm_autonuma statistics). This flag can be toggled
+	 * through sysfs as runtime.
+	 *
+	 * This flag does not affect AutoNUMA with transparent
+	 * hugepages (THP). With THP the NUMA hinting page faults
+	 * always happen at the pmd level, regardless of the setting
+	 * of this flag. Note: there is no reduction in accuracy of
+	 * task_autonuma statistics with THP.
+	 *
+	 * Default set.
+	 */
+	AUTONUMA_SCAN_PMD_FLAG,
+	/*
+	 * If set, knuma_migrated will wake up in the middle of each
+	 * knuma_scand pass, regardless of how many pages have been
+	 * already queued. If not set knuma_migrated will wake up as
+	 * soon as the number of pages in the migration LRU reached a
+	 * certain threshold.
+	 *
+	 * Default not set.
+	 */
+	AUTONUMA_MIGRATE_DEFER_FLAG,
+};
+
+extern unsigned long autonuma_flags;
+
+static inline bool autonuma_possible(void)
+{
+	return test_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
+}
+
+static inline bool autonuma_enabled(void)
+{
+	return test_bit(AUTONUMA_ENABLED_FLAG, &autonuma_flags);
+}
+
+static inline bool autonuma_debug(void)
+{
+	return test_bit(AUTONUMA_DEBUG_FLAG, &autonuma_flags);
+}
+
+static inline bool autonuma_sched_load_balance_strict(void)
+{
+	return test_bit(AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
+			&autonuma_flags);
+}
+
+static inline bool autonuma_sched_reset(void)
+{
+	return test_bit(AUTONUMA_SCHED_RESET_FLAG,
+			&autonuma_flags);
+}
+
+static inline bool autonuma_scan_pmd(void)
+{
+	return test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
+}
+
+static inline bool autonuma_migrate_defer(void)
+{
+	return test_bit(AUTONUMA_MIGRATE_DEFER_FLAG, &autonuma_flags);
+}
+
+#endif /* _LINUX_AUTONUMA_FLAGS_H */

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 09/36] autonuma: core autonuma.h header
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (7 preceding siblings ...)
  2012-08-22 14:58 ` [PATCH 08/36] autonuma: define the autonuma flags Andrea Arcangeli
@ 2012-08-22 14:58 ` Andrea Arcangeli
  2012-08-22 14:58 ` [PATCH 10/36] autonuma: CPU follows memory algorithm Andrea Arcangeli
                   ` (27 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Header that defines the generic AutoNUMA specific functions.

All functions are defined unconditionally, but are only linked into
the kernel if CONFIG_AUTONUMA=y. When CONFIG_AUTONUMA=n, their call
sites are optimized away at build time (or the kernel wouldn't link).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma.h |   41 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 41 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma.h

diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
new file mode 100644
index 0000000..85ca5eb
--- /dev/null
+++ b/include/linux/autonuma.h
@@ -0,0 +1,41 @@
+#ifndef _LINUX_AUTONUMA_H
+#define _LINUX_AUTONUMA_H
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/autonuma_flags.h>
+
+extern void autonuma_enter(struct mm_struct *mm);
+extern void autonuma_exit(struct mm_struct *mm);
+extern void __autonuma_migrate_page_remove(struct page *page);
+extern void autonuma_migrate_split_huge_page(struct page *page,
+					     struct page *page_tail);
+extern void autonuma_setup_new_exec(struct task_struct *p);
+
+static inline void autonuma_migrate_page_remove(struct page *page)
+{
+	if (ACCESS_ONCE(page->autonuma_migrate_nid) >= 0)
+		__autonuma_migrate_page_remove(page);
+}
+
+#define autonuma_printk(format, args...) \
+	if (autonuma_debug()) printk(format, ##args)
+
+#else /* CONFIG_AUTONUMA */
+
+static inline void autonuma_enter(struct mm_struct *mm) {}
+static inline void autonuma_exit(struct mm_struct *mm) {}
+static inline void autonuma_migrate_page_remove(struct page *page) {}
+static inline void autonuma_migrate_split_huge_page(struct page *page,
+						    struct page *page_tail) {}
+static inline void autonuma_setup_new_exec(struct task_struct *p) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+extern pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+			      unsigned long addr, pte_t pte, pte_t *ptep);
+extern void __pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+			     pmd_t *pmd);
+extern void numa_hinting_fault(struct page *page, int numpages);
+
+#endif /* _LINUX_AUTONUMA_H */

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 10/36] autonuma: CPU follows memory algorithm
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (8 preceding siblings ...)
  2012-08-22 14:58 ` [PATCH 09/36] autonuma: core autonuma.h header Andrea Arcangeli
@ 2012-08-22 14:58 ` Andrea Arcangeli
  2012-08-22 14:58 ` [PATCH 11/36] autonuma: add page structure fields Andrea Arcangeli
                   ` (26 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This algorithm takes as input the statistical information filled by the
knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
(p->task_autonuma), evaluates it for the current scheduled task, and
compares it against every other running process to see if it should
move the current task to another NUMA node.

When the scheduler decides if the task should be migrated to a
different NUMA node or to stay in the same NUMA node, the decision is
then stored into p->task_autonuma->task_selected_nid. The fair
scheduler then tries to keep the task on the task_selected_nid.

Code include fixes and cleanups from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_sched.h |   50 ++++
 include/linux/mm_types.h       |    5 +
 include/linux/sched.h          |    3 +
 kernel/sched/core.c            |    1 +
 kernel/sched/fair.c            |    4 +
 kernel/sched/numa.c            |  604 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h           |   19 ++
 7 files changed, 686 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/autonuma_sched.h
 create mode 100644 kernel/sched/numa.c

diff --git a/include/linux/autonuma_sched.h b/include/linux/autonuma_sched.h
new file mode 100644
index 0000000..588bba5
--- /dev/null
+++ b/include/linux/autonuma_sched.h
@@ -0,0 +1,50 @@
+#ifndef _LINUX_AUTONUMA_SCHED_H
+#define _LINUX_AUTONUMA_SCHED_H
+
+#ifdef CONFIG_AUTONUMA
+#include <linux/autonuma_flags.h>
+
+extern void __sched_autonuma_balance(void);
+extern bool sched_autonuma_can_migrate_task(struct task_struct *p,
+					    int numa, int dst_cpu,
+					    enum cpu_idle_type idle);
+
+static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
+{
+	int task_selected_nid;
+	struct task_autonuma *task_autonuma = p->task_autonuma;
+
+	if (!task_autonuma)
+		return true;
+
+	task_selected_nid = ACCESS_ONCE(task_autonuma->task_selected_nid);
+	if (task_selected_nid < 0 || task_selected_nid == cpu_to_node(cpu))
+		return true;
+	else
+		return false;
+}
+
+static inline void sched_autonuma_balance(void)
+{
+	struct task_autonuma *ta = current->task_autonuma;
+
+	if (ta && current->mm)
+		__sched_autonuma_balance();
+}
+#else /* CONFIG_AUTONUMA */
+static inline bool sched_autonuma_can_migrate_task(struct task_struct *p,
+						   int numa, int dst_cpu,
+						   enum cpu_idle_type idle)
+{
+	return true;
+}
+
+static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
+{
+	return true;
+}
+
+static inline void sched_autonuma_balance(void) {}
+#endif /* CONFIG_AUTONUMA */
+
+#endif /* _LINUX_AUTONUMA_SCHED_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bf78672..c80101c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -13,6 +13,7 @@
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
 #include <linux/uprobes.h>
+#include <linux/autonuma_types.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -405,6 +406,10 @@ struct mm_struct {
 	struct cpumask cpumask_allocation;
 #endif
 	struct uprobes_state uprobes_state;
+#ifdef CONFIG_AUTONUMA
+	/* this is used by the scheduler and the page allocator */
+	struct mm_autonuma *mm_autonuma;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b8c8664..8b91676 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1523,6 +1523,9 @@ struct task_struct {
 	struct mempolicy *mempolicy;	/* Protected by alloc_lock */
 	short il_next;
 	short pref_node_fork;
+#ifdef CONFIG_AUTONUMA
+	struct task_autonuma *task_autonuma;
+#endif
 #endif
 	struct rcu_head rcu;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..d0af967 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -72,6 +72,7 @@
 #include <linux/slab.h>
 #include <linux/init_task.h>
 #include <linux/binfmts.h>
+#include <linux/autonuma_sched.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c219bf8..42a88fa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/profile.h>
 #include <linux/interrupt.h>
+#include <linux/autonuma_sched.h>
 
 #include <trace/events/sched.h>
 
@@ -4920,6 +4921,9 @@ static void run_rebalance_domains(struct softirq_action *h)
 
 	rebalance_domains(this_cpu, idle);
 
+	if (!this_rq->idle_balance)
+		sched_autonuma_balance();
+
 	/*
 	 * If this cpu has a pending nohz_balance_kick, then do the
 	 * balancing on behalf of the other idle cpus whose ticks are
diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c
new file mode 100644
index 0000000..2646c82
--- /dev/null
+++ b/kernel/sched/numa.c
@@ -0,0 +1,604 @@
+/*
+ *  Copyright (C) 2012  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/sched.h>
+#include <linux/autonuma_sched.h>
+#include <asm/tlb.h>
+
+#include "sched.h"
+
+/*
+ * Callback used by the AutoNUMA balancer to migrate a task to the
+ * selected CPU. Invoked by stop_one_cpu_nowait().
+ */
+static int autonuma_balance_cpu_stop(void *data)
+{
+	struct rq *src_rq = data;
+	int src_cpu = cpu_of(src_rq);
+	int dst_cpu = src_rq->autonuma_balance_dst_cpu;
+	struct task_struct *p = src_rq->autonuma_balance_task;
+	struct rq *dst_rq = cpu_rq(dst_cpu);
+
+	raw_spin_lock_irq(&p->pi_lock);
+	raw_spin_lock(&src_rq->lock);
+
+	/* Make sure the selected cpu hasn't gone down in the meanwhile */
+	if (unlikely(src_cpu != smp_processor_id() ||
+		     !src_rq->autonuma_balance))
+		goto out_unlock;
+
+	/* Check if the affinity changed in the meanwhile */
+	if (!cpumask_test_cpu(dst_cpu, tsk_cpus_allowed(p)))
+		goto out_unlock;
+
+	/* Is the task to migrate still there? */
+	if (task_cpu(p) != src_cpu)
+		goto out_unlock;
+
+	BUG_ON(src_rq == dst_rq);
+
+	/* Prepare to move the task from src_rq to dst_rq */
+	double_lock_balance(src_rq, dst_rq);
+
+	/*
+	 * Supposedly pi_lock should have been enough but some code
+	 * seems to call __set_task_cpu without pi_lock.
+	 */
+	if (task_cpu(p) != src_cpu)
+		goto out_double_unlock;
+
+	/*
+	 * If the task is not on a rq, the task_selected_nid will take
+	 * care of the NUMA affinity at the next wake-up.
+	 */
+	if (p->on_rq) {
+		deactivate_task(src_rq, p, 0);
+		set_task_cpu(p, dst_cpu);
+		activate_task(dst_rq, p, 0);
+		check_preempt_curr(dst_rq, p, 0);
+	}
+
+out_double_unlock:
+	double_unlock_balance(src_rq, dst_rq);
+out_unlock:
+	src_rq->autonuma_balance = false;
+	raw_spin_unlock(&src_rq->lock);
+	/* spinlocks acts as barrier() so p is stored locally on the stack */
+	raw_spin_unlock_irq(&p->pi_lock);
+	put_task_struct(p);
+	return 0;
+}
+
+#define AUTONUMA_BALANCE_SCALE 1000
+
+enum {
+	W_TYPE_THREAD,
+	W_TYPE_PROCESS,
+};
+
+/*
+ * This function __sched_autonuma_balance() is responsible for
+ * deciding which is the best CPU each process should be running on
+ * according to the NUMA statistics collected in mm->mm_autonuma and
+ * tsk->task_autonuma.
+ *
+ * This will not alter the active idle load balancing and most other
+ * scheduling activity, it works by exchanging running tasks across
+ * CPUs located in different NUMA nodes, when such an exchange
+ * provides a net benefit in increasing the system wide NUMA
+ * convergence.
+ *
+ * The tasks that are the closest to "fully converged" are given the
+ * maximum priority in being moved to their "best node".
+ *
+ * "Full convergence" is achieved when all memory accesses by a task
+ * are 100% local to the CPU it is running on. A task's "best node" is
+ * the NUMA node that recently had the most memory accesses from the
+ * task. The tasks that are closest to being fully converged are given
+ * maximum priority for being moved to their "best node."
+ *
+ * To find how close a task is to converging we use weights. These
+ * weights are computed using the task_autonuma and mm_autonuma
+ * statistics. These weights represent the percentage amounts of
+ * memory accesses (in AUTONUMA_BALANCE_SCALE) that each task recently
+ * had in each node. If the weight of one node is equal to
+ * AUTONUMA_BALANCE_SCALE that implies the task reached "full
+ * convergence" in that given node. To the contrary, a node with a
+ * zero weight would be the "worst node" for the task.
+ *
+ * If the weights for two tasks on CPUs in different nodes are equal
+ * no switch will happen.
+ *
+ * The core math that evaluates the current CPU against the CPUs of
+ * all other nodes is this:
+ *
+ *	if (w_nid > w_other && w_nid > w_this_nid)
+ *		weight = w_nid - w_other + w_nid - w_this_nid;
+ *
+ * w_nid is the memory weight of this task on the other CPU.
+ * w_other is the memory weight of the other task in the other CPU.
+ * w_this_nid is the memory weight of this task on the current CPU.
+ *
+ * w_nid > w_other means: the current task is closer to fully converge
+ * on the node of the other CPU than the other task that is currently
+ * running in the other CPU.
+ *
+ * w_nid > w_this_nid means: the current task is closer to converge on
+ * the node of the other CPU than in the current node.
+ *
+ * If both checks succeed it guarantees that we found a way to
+ * multilaterally improve the system wide NUMA
+ * convergence. Multilateral here means that the same checks will not
+ * succeed again on those same two tasks, after the task exchange, so
+ * there is no risk of ping-pong.
+ *
+ * If a task exchange can happen because the two checks succeed, we
+ * select the destination CPU that will give us the biggest increase
+ * in system wide convergence (i.e. biggest "weight", in the above
+ * quoted code).
+ *
+ * CFS is NUMA aware via sched_autonuma_can migrate_task(). CFS searches
+ * CPUs in the task's task_selected_nid first during load balancing and
+ * idle balancing.
+ *
+ * The task's task_selected_nid is the node selected by
+ * __sched_autonuma_balance() when it migrates the current task to the
+ * selected cpu in the selected node during the task exchange.
+ *
+ * Once a task has been moved to another node, closer to most of the
+ * memory it has recently accessed, any memory for that task not in
+ * the new node moves slowly (asynchronously in the background) to the
+ * new node. This is done by the knuma_migratedN (where the suffix N
+ * is the node id) daemon described in mm/autonuma.c.
+ *
+ * One important thing is how we calculate the weights using
+ * task_autonuma or mm_autonuma, depending if the other CPU is running
+ * a thread of the current process, or a thread of a different
+ * process.
+ *
+ * We use the mm_autonuma statistics to calculate the NUMA weights of
+ * the two task candidates for exchange if the task in the other CPU
+ * belongs to a different process. This way all threads of the same
+ * process will converge to the same node, which is the one with the
+ * highest percentage of memory for the process.  This will happen
+ * even if the thread's "best node" is busy running threads of a
+ * different process.
+ *
+ * If the two candidate tasks for exchange are threads of the same
+ * process, we use the task_autonuma information (because the
+ * mm_autonuma information is identical). By using the task_autonuma
+ * statistics, each thread follows its own memory locality and they
+ * will not necessarily converge on the same node. This is often very
+ * desirable for processes with more theads than CPUs on each NUMA
+ * node.
+ *
+ * To avoid the risk of NUMA false sharing it's best to schedule all
+ * threads accessing the same memory in the same node (on in as fewer
+ * nodes as possible if they can't fit in a single node).
+ *
+ * False sharing in the above sentence is intended as simultaneous
+ * virtual memory accesses to the same pages of memory, by threads
+ * running in CPUs of different nodes. Sharing doesn't refer to shared
+ * memory as in tmpfs, but it refers to CLONE_VM instead.
+ *
+ * This algorithm might be expanded to take all runnable processes
+ * into account later.
+ *
+ * This algorithm is executed by every CPU in the context of the
+ * SCHED_SOFTIRQ load balancing event at regular intervals.
+ *
+ * If the task is found to have converged in the current node, we
+ * already know that the check "w_nid > w_this_nid" will not succeed,
+ * so the function returns without having to check any of the CPUs of
+ * the other NUMA nodes.
+ */
+void __sched_autonuma_balance(void)
+{
+	int cpu, nid, selected_cpu, selected_nid, selected_nid_mm;
+	int this_nid = numa_node_id();
+	int this_cpu = smp_processor_id();
+	/*
+	 * task_weight: node thread weight
+	 * task_tot: total sum of all node thread weights
+	 * mm_weight: node mm/process weight
+	 * mm_tot: total sum of all node mm/process weights
+	 */
+	unsigned long task_weight, task_tot, mm_weight, mm_tot;
+	unsigned long task_max, mm_max;
+	unsigned long weight_max, weight;
+	long s_w_nid = -1, s_w_this_nid = -1, s_w_other = -1;
+	int s_w_type = -1;
+	struct cpumask *allowed;
+	struct task_struct *p = current, *selected_task;
+	struct task_autonuma *task_autonuma = p->task_autonuma;
+	struct mm_autonuma *mm_autonuma;
+	struct rq *rq;
+
+	/* per-cpu statically allocated in runqueues */
+	long *task_numa_weight;
+	long *mm_numa_weight;
+
+	if (!task_autonuma || !p->mm)
+		return;
+
+	if (!autonuma_enabled()) {
+		if (task_autonuma->task_selected_nid != -1)
+			task_autonuma->task_selected_nid = -1;
+		return;
+	}
+
+	allowed = tsk_cpus_allowed(p);
+	mm_autonuma = p->mm->mm_autonuma;
+
+	/*
+	 * If the task has no NUMA hinting page faults or if the mm
+	 * hasn't been fully scanned by knuma_scand yet, set task
+	 * selected nid to the current nid, to avoid the task bounce
+	 * around randomly.
+	 */
+	mm_tot = ACCESS_ONCE(mm_autonuma->mm_numa_fault_tot);
+	if (!mm_tot) {
+		if (task_autonuma->task_selected_nid != this_nid)
+			task_autonuma->task_selected_nid = this_nid;
+		return;
+	}
+	task_tot = task_autonuma->task_numa_fault_tot;
+	if (!task_tot) {
+		if (task_autonuma->task_selected_nid != this_nid)
+			task_autonuma->task_selected_nid = this_nid;
+		return;
+	}
+
+	rq = cpu_rq(this_cpu);
+
+	/*
+	 * Verify that we can migrate the current task, otherwise try
+	 * again later.
+	 */
+	if (ACCESS_ONCE(rq->autonuma_balance))
+		return;
+
+	/*
+	 * The following two arrays will hold the NUMA affinity weight
+	 * information for the current process if scheduled on the
+	 * given NUMA node.
+	 *
+	 * mm_numa_weight[nid] - mm NUMA affinity weight for the NUMA node
+	 * task_numa_weight[nid] - task NUMA affinity weight for the NUMA node
+	 */
+	task_numa_weight = rq->task_numa_weight;
+	mm_numa_weight = rq->mm_numa_weight;
+
+	/*
+	 * Identify the NUMA node where this thread (task_struct), and
+	 * the process (mm_struct) as a whole, has the largest number
+	 * of NUMA faults.
+	 */
+	task_max = mm_max = 0;
+	selected_nid = selected_nid_mm = -1;
+	for_each_online_node(nid) {
+		mm_weight = ACCESS_ONCE(mm_autonuma->mm_numa_fault[nid]);
+		task_weight = task_autonuma->task_numa_fault[nid];
+		if (mm_weight > mm_tot)
+			/* could be removed with a seqlock */
+			mm_tot = mm_weight;
+		mm_numa_weight[nid] = mm_weight*AUTONUMA_BALANCE_SCALE/mm_tot;
+		if (task_weight > task_tot) {
+			task_tot = task_weight;
+			WARN_ON(1);
+		}
+		task_numa_weight[nid] = task_weight*AUTONUMA_BALANCE_SCALE/task_tot;
+		if (mm_numa_weight[nid] > mm_max) {
+			mm_max = mm_numa_weight[nid];
+			selected_nid_mm = nid;
+		}
+		if (task_numa_weight[nid] > task_max) {
+			task_max = task_numa_weight[nid];
+			selected_nid = nid;
+		}
+	}
+	/*
+	 * If this NUMA node is the selected one, based on process
+	 * memory and task NUMA faults, set task_selected_nid and
+	 * we're done.
+	 */
+	if (selected_nid == this_nid && selected_nid_mm == selected_nid) {
+		if (task_autonuma->task_selected_nid != selected_nid)
+			task_autonuma->task_selected_nid = selected_nid;
+		return;
+	}
+
+	selected_cpu = this_cpu;
+	selected_nid = this_nid;
+
+	weight = weight_max = 0;
+
+	selected_task = NULL;
+
+	/* check that the following raw_spin_lock_irq is safe */
+	BUG_ON(irqs_disabled());
+
+	/*
+	 * Check the other NUMA nodes to see if there is a task we
+	 * should exchange places with.
+	 */
+	for_each_online_node(nid) {
+		/* No need to check our current node. */
+		if (nid == this_nid)
+			continue;
+		for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
+			long w_nid, w_this_nid, w_other;
+			int w_type;
+			struct mm_struct *mm;
+			struct task_struct *_selected_task;
+			rq = cpu_rq(cpu);
+			if (!cpu_online(cpu))
+				continue;
+
+			/* CFS takes care of idle balancing. */
+			if (idle_cpu(cpu))
+				continue;
+
+			mm = rq->curr->mm;
+			if (!mm)
+				continue;
+
+			/*
+			 * Check if the _selected_task is pending for
+			 * migrate. Do it locklessly: it's an
+			 * optimistic racy check anyway.
+			 */
+			if (ACCESS_ONCE(rq->autonuma_balance))
+				continue;
+
+			/*
+			 * Grab the mm_weight/task_weight/mm_tot/task_tot of the
+			 * processes running in the other CPUs to
+			 * compute w_other.
+			 */
+			raw_spin_lock_irq(&rq->lock);
+			_selected_task = rq->curr;
+			/* recheck after implicit barrier() */
+			mm = _selected_task->mm;
+			if (!mm) {
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+			mm_tot = ACCESS_ONCE(mm->mm_autonuma->mm_numa_fault_tot);
+			task_tot = _selected_task->task_autonuma->task_numa_fault_tot;
+			if (!mm_tot || !task_tot) {
+				/* Need NUMA faults to evaluate NUMA placement. */
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+			/*
+			 * Check if that the _selected_task is allowed
+			 * to be migrated to this_cpu.
+			 */
+			if (!cpumask_test_cpu(this_cpu,
+					      tsk_cpus_allowed(_selected_task))) {
+				raw_spin_unlock_irq(&rq->lock);
+				continue;
+			}
+			mm_weight = ACCESS_ONCE(mm->mm_autonuma->mm_numa_fault[nid]);
+			task_weight = _selected_task->task_autonuma->task_numa_fault[nid];
+			raw_spin_unlock_irq(&rq->lock);
+
+			if (mm == p->mm) {
+				/*
+				 * This task is another thread in the
+				 * same process. Use the task statistics.
+				 */
+				if (task_weight > task_tot)
+					task_tot = task_weight;
+				w_other = task_weight*AUTONUMA_BALANCE_SCALE/task_tot;
+				w_nid = task_numa_weight[nid];
+				w_this_nid = task_numa_weight[this_nid];
+				w_type = W_TYPE_THREAD;
+			} else {
+				/*
+				 * This task is part of another process.
+				 * Use the mm statistics.
+				 */
+				if (mm_weight > mm_tot)
+					mm_tot = mm_weight;
+				w_other = mm_weight*AUTONUMA_BALANCE_SCALE/mm_tot;
+				w_nid = mm_numa_weight[nid];
+				w_this_nid = mm_numa_weight[this_nid];
+				w_type = W_TYPE_PROCESS;
+			}
+
+			/*
+			 * Would swapping NUMA location with this task
+			 * reduce the total number of cross-node NUMA
+			 * faults in the system?
+			 */
+			if (w_nid > w_other && w_nid > w_this_nid) {
+				weight = w_nid - w_other + w_nid - w_this_nid;
+
+				/* Remember the best candidate. */
+				if (weight > weight_max) {
+					weight_max = weight;
+					selected_cpu = cpu;
+					selected_nid = nid;
+
+					s_w_other = w_other;
+					s_w_nid = w_nid;
+					s_w_this_nid = w_this_nid;
+					s_w_type = w_type;
+					selected_task = _selected_task;
+				}
+			}
+		}
+	}
+
+	if (task_autonuma->task_selected_nid != selected_nid)
+		task_autonuma->task_selected_nid = selected_nid;
+	if (selected_cpu != this_cpu) {
+		if (autonuma_debug()) {
+			char *w_type_str = NULL;
+			switch (s_w_type) {
+			case W_TYPE_THREAD:
+				w_type_str = "thread";
+				break;
+			case W_TYPE_PROCESS:
+				w_type_str = "process";
+				break;
+			}
+			printk("%p %d - %dto%d - %dto%d - %ld %ld %ld - %s\n",
+			       p->mm, p->pid, this_nid, selected_nid,
+			       this_cpu, selected_cpu,
+			       s_w_other, s_w_nid, s_w_this_nid,
+			       w_type_str);
+		}
+		BUG_ON(this_nid == selected_nid);
+		goto found;
+	}
+
+	return;
+
+found:
+	rq = cpu_rq(this_cpu);
+
+	/*
+	 * autonuma_balance synchronizes accesses to
+	 * autonuma_balance_work. After set, it's cleared by the
+	 * callback once the migration work is finished.
+	 */
+	raw_spin_lock_irq(&rq->lock);
+	if (rq->autonuma_balance) {
+		raw_spin_unlock_irq(&rq->lock);
+		return;
+	}
+	rq->autonuma_balance = true;
+	raw_spin_unlock_irq(&rq->lock);
+
+	rq->autonuma_balance_dst_cpu = selected_cpu;
+	rq->autonuma_balance_task = p;
+	get_task_struct(p);
+
+	/* Do the actual migration. */
+	stop_one_cpu_nowait(this_cpu,
+			    autonuma_balance_cpu_stop, rq,
+			    &rq->autonuma_balance_work);
+
+	BUG_ON(!selected_task);
+	rq = cpu_rq(selected_cpu);
+
+	/*
+	 * autonuma_balance synchronizes accesses to
+	 * autonuma_balance_work. After set, it's cleared by the
+	 * callback once the migration work is finished.
+	 */
+	raw_spin_lock_irq(&rq->lock);
+	/*
+	 * The chance of selected_task having quit in the meanwhile
+	 * and another task having reused its previous task struct is
+	 * tiny. Even if it happens the kernel will be stable.
+	 */
+	if (rq->autonuma_balance || rq->curr != selected_task) {
+		raw_spin_unlock_irq(&rq->lock);
+		return;
+	}
+	rq->autonuma_balance = true;
+	/* take the pin on the task struct before dropping the lock */
+	get_task_struct(selected_task);
+	raw_spin_unlock_irq(&rq->lock);
+
+	rq->autonuma_balance_dst_cpu = this_cpu;
+	rq->autonuma_balance_task = selected_task;
+
+	/* Do the actual migration. */
+	stop_one_cpu_nowait(selected_cpu,
+			    autonuma_balance_cpu_stop, rq,
+			    &rq->autonuma_balance_work);
+#ifdef __ia64__
+#error "NOTE: tlb_migrate_finish won't run here, review before deleting"
+#endif
+}
+
+/*
+ * The function sched_autonuma_can_migrate_task is called by CFS
+ * can_migrate_task() to prioritize on the task's
+ * task_selected_nid. It is called during load_balancing, idle
+ * balancing and in general before any task CPU migration event
+ * happens.
+ *
+ * The caller first scans the CFS migration candidate tasks passing a
+ * not zero numa parameter, to skip tasks without AutoNUMA affinity
+ * (according to the tasks's task_selected_nid). If no task can be
+ * migrated in the first scan, a second scan is run with a zero numa
+ * parameter.
+ *
+ * If the numa parameter is not zero, this function allows the task
+ * migration only if the dst_cpu of the migration is in the node
+ * selected by AutoNUMA or if it's an idle load balancing event.
+ *
+ * If load_balance_strict is enabled, AutoNUMA will only allow
+ * migration of tasks for idle balancing purposes (the idle balancing
+ * of CFS is never altered by AutoNUMA). In the not strict mode the
+ * load balancing is not altered and the AutoNUMA affinity is
+ * disregarded in favor of higher fairness. The load_balance_strict
+ * knob is runtime tunable in sysfs.
+ *
+ * If load_balance_strict is enabled, it tends to partition the
+ * system. In turn it may reduce the scheduler fairness across NUMA
+ * nodes, but it should deliver higher global performance.
+ */
+bool sched_autonuma_can_migrate_task(struct task_struct *p,
+				     int numa, int dst_cpu,
+				     enum cpu_idle_type idle)
+{
+	if (!task_autonuma_cpu(p, dst_cpu)) {
+		if (numa)
+			return false;
+		if (autonuma_sched_load_balance_strict() &&
+		    idle != CPU_NEWLY_IDLE && idle != CPU_IDLE)
+			return false;
+	}
+	return true;
+}
+
+/*
+ * sched_autonuma_dump_mm is a purely debugging function called at
+ * regular intervals when /sys/kernel/mm/autonuma/debug is
+ * enabled. This prints in the kernel logs how the threads and
+ * processes are distributed in all NUMA nodes to easily check if the
+ * threads of the same processes are converging in the same
+ * nodes. This won't take into account kernel threads and because it
+ * runs itself from a kernel thread it won't show what was running in
+ * the current CPU, but it's simple and good enough to get what we
+ * need in the debug logs. This function can be disabled or deleted
+ * later.
+ */
+void sched_autonuma_dump_mm(void)
+{
+	int nid, cpu;
+	cpumask_var_t x;
+
+	if (!alloc_cpumask_var(&x, GFP_KERNEL))
+		return;
+	cpumask_setall(x);
+	for_each_online_node(nid) {
+		for_each_cpu(cpu, cpumask_of_node(nid)) {
+			struct rq *rq = cpu_rq(cpu);
+			struct mm_struct *mm = rq->curr->mm;
+			int nr = 0, cpux;
+			if (!cpumask_test_cpu(cpu, x))
+				continue;
+			for_each_cpu(cpux, cpumask_of_node(nid)) {
+				struct rq *rqx = cpu_rq(cpux);
+				if (rqx->curr->mm == mm) {
+					nr++;
+					cpumask_clear_cpu(cpux, x);
+				}
+			}
+			printk("nid %d process %p nr_threads %d\n", nid, mm, nr);
+		}
+	}
+	free_cpumask_var(x);
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f6714d0..458b711 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -467,6 +467,25 @@ struct rq {
 #ifdef CONFIG_SMP
 	struct llist_head wake_list;
 #endif
+#ifdef CONFIG_AUTONUMA
+	/* stop_one_cpu_nowait() data used by autonuma_balance_cpu_stop() */
+	bool autonuma_balance;
+	int autonuma_balance_dst_cpu;
+	struct task_struct *autonuma_balance_task;
+	struct cpu_stop_work autonuma_balance_work;
+	/*
+	 * Per-cpu arrays used to compute the per-thread and
+	 * per-process NUMA affinity weights (per nid) for the current
+	 * process. Allocated statically to avoid overflowing the
+	 * stack with large MAX_NUMNODES values.
+	 *
+	 * FIXME: allocate with dynamic num_possible_nodes() array
+	 * sizes and only if autonuma is possible, to save some dozen
+	 * KB of RAM when booting on non NUMA (or small NUMA) systems.
+	 */
+	long task_numa_weight[MAX_NUMNODES];
+	long mm_numa_weight[MAX_NUMNODES];
+#endif
 };
 
 static inline int cpu_of(struct rq *rq)

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 11/36] autonuma: add page structure fields
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (9 preceding siblings ...)
  2012-08-22 14:58 ` [PATCH 10/36] autonuma: CPU follows memory algorithm Andrea Arcangeli
@ 2012-08-22 14:58 ` Andrea Arcangeli
  2012-08-22 14:58 ` [PATCH 12/36] autonuma: knuma_migrated per NUMA node queues Andrea Arcangeli
                   ` (25 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 64bit archs, 20 bytes are used for async memory migration (specific
to the knuma_migrated per-node threads), and 4 bytes are used for the
thread NUMA false sharing detection logic.

This is the basic implementation improved by later patches.

Later patches moves the new fields to a dynamically allocated
page_autonuma of 32 bytes per page (only allocated if booted on NUMA
hardware, unless "noautonuma" is passed as parameter to the kernel at
boot). Yet another later patch introduces the autonuma_list and
reduces the size of the page_autonuma from 32 to 12 bytes.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm_types.h |   26 ++++++++++++++++++++++++++
 mm/page_alloc.c          |    4 ++++
 2 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c80101c..3f10fef 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -152,6 +152,32 @@ struct page {
 		struct page *first_page;	/* Compound tail pages */
 	};
 
+#ifdef CONFIG_AUTONUMA
+	/*
+	 * FIXME: move to pgdat section along with the memcg and allocate
+	 * at runtime only in presence of a numa system.
+	 */
+	/*
+	 * To modify autonuma_last_nid lockless the architecture,
+	 * needs SMP atomic granularity < sizeof(long), not all archs
+	 * have that, notably some ancient alpha (but none of those
+	 * should run in NUMA systems). Archs without that requires
+	 * autonuma_last_nid to be a long.
+	 */
+#ifdef CONFIG_64BIT
+	int autonuma_migrate_nid;
+	int autonuma_last_nid;
+#else
+#if MAX_NUMNODES > 32767
+#error "too many nodes"
+#endif
+	/* FIXME: remember to check the updates are atomic */
+	short autonuma_migrate_nid;
+	short autonuma_last_nid;
+#endif
+	struct list_head autonuma_migrate_node;
+#endif
+
 	/*
 	 * On machines where all RAM is mapped into kernel address space,
 	 * we can simply calculate the virtual address. On machines with
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ff61443..a6337b3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3787,6 +3787,10 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 
 		INIT_LIST_HEAD(&page->lru);
+#ifdef CONFIG_AUTONUMA
+		page->autonuma_last_nid = -1;
+		page->autonuma_migrate_nid = -1;
+#endif
 #ifdef WANT_PAGE_VIRTUAL
 		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
 		if (!is_highmem_idx(zone))

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 12/36] autonuma: knuma_migrated per NUMA node queues
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (10 preceding siblings ...)
  2012-08-22 14:58 ` [PATCH 11/36] autonuma: add page structure fields Andrea Arcangeli
@ 2012-08-22 14:58 ` Andrea Arcangeli
  2012-08-22 14:58 ` [PATCH 13/36] autonuma: autonuma_enter/exit Andrea Arcangeli
                   ` (24 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This defines the knuma_migrated queues. There is one knuma_migrated
per NUMA node with active CPUs. Pages are added to these queues
through the NUMA hinting page fault (memory follow CPU algorithm with
false sharing evaluation). The daemons are then woken up with a
certain hysteresis to migrate the memory in a round robin fashion from
all remote nodes to the daemon's local node.

The head that belongs to the local node that knuma_migrated runs on,
for now must be empty and it's not being used.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mmzone.h |   18 ++++++++++++++++++
 mm/page_alloc.c        |   11 +++++++++++
 2 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2daa54f..a5920f8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -709,6 +709,24 @@ typedef struct pglist_data {
 	struct task_struct *kswapd;	/* Protected by lock_memory_hotplug() */
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
+#ifdef CONFIG_AUTONUMA
+	/*
+	 * lock serializing all lists with heads in the
+	 * autonuma_migrate_head[] array, and the
+	 * autonuma_nr_migrate_pages field.
+	 */
+	spinlock_t autonuma_lock;
+	/*
+	 * All pages from node "page_nid" to be migrated to this node,
+	 * will be queued into the list
+	 * autonuma_migrate_head[page_nid].
+	 */
+	struct list_head autonuma_migrate_head[MAX_NUMNODES];
+	/* number of pages from other nodes queued for migration to this node */
+	unsigned long autonuma_nr_migrate_pages;
+	/* waitqueue for this node knuma_migrated daemon */
+	wait_queue_head_t autonuma_knuma_migrated_wait;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a6337b3..8c9cad5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -58,6 +58,7 @@
 #include <linux/prefetch.h>
 #include <linux/migrate.h>
 #include <linux/page-debug-flags.h>
+#include <linux/autonuma.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -4391,8 +4392,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	int nid = pgdat->node_id;
 	unsigned long zone_start_pfn = pgdat->node_start_pfn;
 	int ret;
+#ifdef CONFIG_AUTONUMA
+	int node_iter;
+#endif
 
 	pgdat_resize_init(pgdat);
+#ifdef CONFIG_AUTONUMA
+	spin_lock_init(&pgdat->autonuma_lock);
+	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
+	pgdat->autonuma_nr_migrate_pages = 0;
+	for_each_node(node_iter)
+		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+#endif
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 	pgdat_page_cgroup_init(pgdat);

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 13/36] autonuma: autonuma_enter/exit
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (11 preceding siblings ...)
  2012-08-22 14:58 ` [PATCH 12/36] autonuma: knuma_migrated per NUMA node queues Andrea Arcangeli
@ 2012-08-22 14:58 ` Andrea Arcangeli
  2012-08-22 14:58 ` [PATCH 14/36] autonuma: call autonuma_setup_new_exec() Andrea Arcangeli
                   ` (23 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This is where we register (and unregister) an "mm" structure into
AutoNUMA for knuma_scand to scan them.

knuma_scand is the first gear in the whole AutoNUMA algorithm.
knuma_scand is the daemon that scans the "mm" structures in the list
and sets pmd_numa and pte_numa to allow the NUMA hinting page faults
to start. All other actions follow after that. If knuma_scand doesn't
run, AutoNUMA is fully bypassed. If knuma_scand is stopped, soon all
other AutoNUMA gears will settle down too.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index f998e53..fbc67ee 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -70,6 +70,7 @@
 #include <linux/khugepaged.h>
 #include <linux/signalfd.h>
 #include <linux/uprobes.h>
+#include <linux/autonuma.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -541,6 +542,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
 		mmu_notifier_mm_init(mm);
+		autonuma_enter(mm);
 		return mm;
 	}
 
@@ -609,6 +611,7 @@ void mmput(struct mm_struct *mm)
 		exit_aio(mm);
 		ksm_exit(mm);
 		khugepaged_exit(mm); /* must run before exit_mmap */
+		autonuma_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 14/36] autonuma: call autonuma_setup_new_exec()
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (12 preceding siblings ...)
  2012-08-22 14:58 ` [PATCH 13/36] autonuma: autonuma_enter/exit Andrea Arcangeli
@ 2012-08-22 14:58 ` Andrea Arcangeli
  2012-08-22 14:58 ` [PATCH 15/36] autonuma: alloc/free/init task_autonuma Andrea Arcangeli
                   ` (22 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This resets all per-thread and per-process statistics across exec
syscalls or after kernel threads detach from the mm. The past
statistical NUMA information is unlikely to be relevant for the future
in these cases.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/exec.c        |    7 +++++++
 mm/mmu_context.c |    3 +++
 2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 574cf4d..1d55077 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/pipe_fs_i.h>
 #include <linux/oom.h>
 #include <linux/compat.h>
+#include <linux/autonuma.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1172,6 +1173,12 @@ void setup_new_exec(struct linux_binprm * bprm)
 			
 	flush_signal_handlers(current, 0);
 	flush_old_files(current->files);
+
+	/*
+	 * Reset autonuma counters, as past NUMA information
+	 * is unlikely to be relevant for the future.
+	 */
+	autonuma_setup_new_exec(current);
 }
 EXPORT_SYMBOL(setup_new_exec);
 
diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 3dcfaf4..e6fff1c 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -7,6 +7,7 @@
 #include <linux/mmu_context.h>
 #include <linux/export.h>
 #include <linux/sched.h>
+#include <linux/autonuma.h>
 
 #include <asm/mmu_context.h>
 
@@ -52,6 +53,8 @@ void unuse_mm(struct mm_struct *mm)
 {
 	struct task_struct *tsk = current;
 
+	autonuma_setup_new_exec(tsk);
+
 	task_lock(tsk);
 	sync_mm_rss(mm);
 	tsk->mm = NULL;

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 15/36] autonuma: alloc/free/init task_autonuma
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (13 preceding siblings ...)
  2012-08-22 14:58 ` [PATCH 14/36] autonuma: call autonuma_setup_new_exec() Andrea Arcangeli
@ 2012-08-22 14:58 ` Andrea Arcangeli
  2012-08-22 14:59 ` [PATCH 16/36] autonuma: alloc/free/init mm_autonuma Andrea Arcangeli
                   ` (21 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This is where the dynamically allocated task_autonuma structure is
being handled.

This is the structure holding the per-thread NUMA statistics generated
by the NUMA hinting page faults. This per-thread NUMA statistical
information is needed by sched_autonuma_balance to make optimal NUMA
balancing decisions.

It also contains the task_selected_nid which hints the stock CPU
scheduler on the best NUMA node to schedule this thread on (as decided
by sched_autonuma_balance).

The reason for keeping this outside of the task_struct besides not
using too much kernel stack, is to only allocate it on NUMA
hardware. So the non NUMA hardware only pays the memory of a pointer
in the kernel stack (which remains NULL at all times in that case).

If the kernel is compiled with CONFIG_AUTONUMA=n, not even the pointer
is allocated on the kernel stack of course.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index fbc67ee..9ba6e9b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -209,6 +209,7 @@ void free_task(struct task_struct *tsk)
 {
 	account_kernel_stack(tsk->stack, -1);
 	arch_release_thread_info(tsk->stack);
+	free_task_autonuma(tsk);
 	free_thread_info(tsk->stack);
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
@@ -264,6 +265,9 @@ void __init fork_init(unsigned long mempages)
 	/* do the arch specific task caches init */
 	arch_task_cache_init();
 
+	/* prepare task_autonuma for alloc_task_autonuma/free_task_autonuma */
+	task_autonuma_init();
+
 	/*
 	 * The default maximum number of threads is set to a safe
 	 * value: the thread structures can take up at most half
@@ -310,6 +314,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	if (err)
 		goto free_ti;
 
+	if (unlikely(alloc_task_autonuma(tsk, orig, node)))
+		/* free_thread_info() undoes arch_dup_task_struct() too */
+		goto free_ti;
+
 	tsk->stack = ti;
 
 	setup_thread_stack(tsk, orig);

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 16/36] autonuma: alloc/free/init mm_autonuma
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (14 preceding siblings ...)
  2012-08-22 14:58 ` [PATCH 15/36] autonuma: alloc/free/init task_autonuma Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 14:59 ` [PATCH 17/36] autonuma: prevent select_task_rq_fair to return -1 Andrea Arcangeli
                   ` (20 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This is where the mm_autonuma structure is being handled.

mm_autonuma holds the link for knuma_scand's list of mm structures to
scan and a pointer to the associated mm structure for knuma_scand's
convenience.

It also contains the per-mm NUMA statistics collected by knuma_scand
daemon. The per-mm NUMA statistics are needed by
sched_autonuma_balance to take appropriate NUMA balancing decision
when balancing threads belonging to different processes.

Just like task_autonuma, this is only allocated at runtime if the
hardware the kernel is running on has been detected as NUMA. On not
NUMA hardware the memory cost is reduced to one pointer per mm.

To get rid of the pointer in the each mm, the kernel can be compiled
with CONFIG_AUTONUMA=n.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/fork.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 9ba6e9b..7367c32 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -532,6 +532,8 @@ static void mm_init_aio(struct mm_struct *mm)
 
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 {
+	if (unlikely(alloc_mm_autonuma(mm)))
+		goto out_free_mm;
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
 	init_rwsem(&mm->mmap_sem);
@@ -554,6 +556,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 		return mm;
 	}
 
+	free_mm_autonuma(mm);
+out_free_mm:
 	free_mm(mm);
 	return NULL;
 }
@@ -603,6 +607,7 @@ void __mmdrop(struct mm_struct *mm)
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
+	free_mm_autonuma(mm);
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -885,6 +890,7 @@ fail_nocontext:
 	 * If init_new_context() failed, we cannot use mmput() to free the mm
 	 * because it calls destroy_context()
 	 */
+	free_mm_autonuma(mm);
 	mm_free_pgd(mm);
 	free_mm(mm);
 	return NULL;
@@ -1707,6 +1713,7 @@ void __init proc_caches_init(void)
 	mm_cachep = kmem_cache_create("mm_struct",
 			sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL);
+	mm_autonuma_init();
 	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC);
 	mmap_init();
 	nsproxy_cache_init();

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 17/36] autonuma: prevent select_task_rq_fair to return -1
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (15 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 16/36] autonuma: alloc/free/init mm_autonuma Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 14:59 ` [PATCH 18/36] autonuma: teach CFS about autonuma affinity Andrea Arcangeli
                   ` (19 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

find_idlest_cpu when run up on all domain levels shouldn't normally
return -1. With the introduction of the NUMA affinity check that
should be still true most of the time, but it's not guaranteed if the
NUMA affinity of the task changes very fast. So better not to depend
on timings.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 42a88fa..677b99e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2794,6 +2794,17 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 unlock:
 	rcu_read_unlock();
 
+#ifdef CONFIG_AUTONUMA
+	if (new_cpu < 0)
+		/*
+		 * find_idlest_cpu() may return -1 if
+		 * task_autonuma_cpu() changes all the time, it's very
+		 * unlikely, but we must handle it if it ever happens.
+		 */
+		new_cpu = prev_cpu;
+#endif
+	BUG_ON(new_cpu < 0);
+
 	return new_cpu;
 }
 #endif /* CONFIG_SMP */

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 18/36] autonuma: teach CFS about autonuma affinity
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (16 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 17/36] autonuma: prevent select_task_rq_fair to return -1 Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 14:59 ` [PATCH 19/36] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection Andrea Arcangeli
                   ` (18 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

The CFS scheduler is still in charge of all scheduling decisions. At
times, however, AutoNUMA balancing will override them.

Generally, we'll just rely on the CFS scheduler to keep doing its
thing, while preferring the task's AutoNUMA affine node when deciding
to move a task to a different runqueue or when waking it up.

For example, idle balancing, while looking into the runqueues of busy
CPUs, will first look for a task that "wants" to run on the NUMA node
of this idle CPU (one where task_autonuma_cpu() returns true).

Most of this is encoded in can_migrate_task becoming AutoNUMA aware
and running two passes for each balancing pass, the first NUMA aware,
and the second one relaxed.

Idle or newidle balancing is always allowed to fall back to scheduling
non-affine AutoNUMA tasks (ones with task_selected_nid set to another
node). Load_balancing, which affects fairness more than performance,
is only able to schedule against AutoNUMA affinity if the flag
/sys/kernel/mm/autonuma/scheduler/load_balance_strict is not set.

Tasks that haven't been fully profiled yet, are not affected by this
because their p->task_autonuma->task_selected_nid is still set to the
original value of -1 and task_autonuma_cpu will always return true in
that case.

Includes fixes from Hillf Danton <dhillf@gmail.com>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/fair.c |   71 ++++++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 677b99e..560a170 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2622,6 +2622,8 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 		load = weighted_cpuload(i);
 
 		if (load < min_load || (load == min_load && i == this_cpu)) {
+			if (!task_autonuma_cpu(p, i))
+				continue;
 			min_load = load;
 			idlest = i;
 		}
@@ -2638,31 +2640,43 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	struct sched_domain *sd;
+	bool idle_target;
 
 	/*
-	 * If the task is going to be woken-up on this cpu and if it is
-	 * already idle, then it is the right target.
+	 * If the task is going to be woken-up on this cpu and if it
+	 * is already idle and if this cpu is in the AutoNUMA selected
+	 * NUMA node, then it is the right target.
 	 */
-	if (target == cpu && idle_cpu(cpu))
+	if (target == cpu && idle_cpu(cpu) && task_autonuma_cpu(p, cpu))
 		return cpu;
 
 	/*
-	 * If the task is going to be woken-up on the cpu where it previously
-	 * ran and if it is currently idle, then it the right target.
+	 * If the task is going to be woken-up on the cpu where it
+	 * previously ran and if it is currently idle and if the cpu
+	 * where it run previously is in the AutoNUMA selected node,
+	 * then it the right target.
 	 */
-	if (target == prev_cpu && idle_cpu(prev_cpu))
+	if (target == prev_cpu && idle_cpu(prev_cpu) &&
+	    task_autonuma_cpu(p, prev_cpu))
 		return prev_cpu;
 
 	/*
 	 * Otherwise, check assigned siblings to find an elegible idle cpu.
 	 */
+	idle_target = false;
 	sd = rcu_dereference(per_cpu(sd_llc, target));
 
 	for_each_lower_domain(sd) {
 		if (!cpumask_test_cpu(sd->idle_buddy, tsk_cpus_allowed(p)))
 			continue;
-		if (idle_cpu(sd->idle_buddy))
-			return sd->idle_buddy;
+		if (idle_cpu(sd->idle_buddy)) {
+			if (task_autonuma_cpu(p, sd->idle_buddy))
+				return sd->idle_buddy;
+			else if (!idle_target) {
+				idle_target = true;
+				target = sd->idle_buddy;
+			}
+		}
 	}
 
 	return target;
@@ -2694,7 +2708,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		return prev_cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
-		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) &&
+		    task_autonuma_cpu(p, cpu))
 			want_affine = 1;
 		new_cpu = prev_cpu;
 	}
@@ -3067,6 +3082,7 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
 #define LBF_SOME_PINNED 0x04
+#define LBF_NUMA	0x08
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -3146,7 +3162,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * We do not migrate tasks that are:
 	 * 1) running (obviously), or
 	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
-	 * 3) are cache-hot on their current CPU.
+	 * 3) are cache-hot on their current CPU, or
+	 * 4) going to be migrated to a dst_cpu not in the selected NUMA node
+	 *    if LBF_NUMA is set.
 	 */
 	if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
 		int new_dst_cpu;
@@ -3181,6 +3199,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 0;
 	}
 
+	if (!sched_autonuma_can_migrate_task(p, env->flags & LBF_NUMA,
+					     env->dst_cpu, env->idle))
+		return 0;
+
 	/*
 	 * Aggressive migration if:
 	 * 1) task is cache cold, or
@@ -3217,6 +3239,8 @@ static int move_one_task(struct lb_env *env)
 {
 	struct task_struct *p, *n;
 
+	env->flags |= LBF_NUMA;
+numa_repeat:
 	list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
 		if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
 			continue;
@@ -3231,8 +3255,14 @@ static int move_one_task(struct lb_env *env)
 		 * stats here rather than inside move_task().
 		 */
 		schedstat_inc(env->sd, lb_gained[env->idle]);
+		env->flags &= ~LBF_NUMA;
 		return 1;
 	}
+	if (env->flags & LBF_NUMA) {
+		env->flags &= ~LBF_NUMA;
+		goto numa_repeat;
+	}
+
 	return 0;
 }
 
@@ -3257,6 +3287,8 @@ static int move_tasks(struct lb_env *env)
 	if (env->imbalance <= 0)
 		return 0;
 
+	env->flags |= LBF_NUMA;
+numa_repeat:
 	while (!list_empty(tasks)) {
 		p = list_first_entry(tasks, struct task_struct, se.group_node);
 
@@ -3296,9 +3328,13 @@ static int move_tasks(struct lb_env *env)
 		 * kernels will stop after the first task is pulled to minimize
 		 * the critical section.
 		 */
-		if (env->idle == CPU_NEWLY_IDLE)
-			break;
+		if (env->idle == CPU_NEWLY_IDLE) {
+			env->flags &= ~LBF_NUMA;
+			goto out;
+		}
 #endif
+		/* not idle anymore after pulling first task */
+		env->idle = CPU_NOT_IDLE;
 
 		/*
 		 * We only want to steal up to the prescribed amount of
@@ -3311,6 +3347,17 @@ static int move_tasks(struct lb_env *env)
 next:
 		list_move_tail(&p->se.group_node, tasks);
 	}
+	if ((env->flags & (LBF_NUMA|LBF_NEED_BREAK)) == LBF_NUMA) {
+		env->flags &= ~LBF_NUMA;
+		if (env->imbalance > 0) {
+			env->loop = 0;
+			env->loop_break = sched_nr_migrate_break;
+			goto numa_repeat;
+		}
+	}
+#ifdef CONFIG_PREEMPT
+out:
+#endif
 
 	/*
 	 * Right now, this is one of only two places move_task() is called,

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 19/36] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (17 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 18/36] autonuma: teach CFS about autonuma affinity Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 20:19   ` Andi Kleen
  2012-08-22 14:59 ` [PATCH 20/36] autonuma: default mempolicy follow AutoNUMA Andrea Arcangeli
                   ` (17 subsequent siblings)
  36 siblings, 1 reply; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This implements the following parts of autonuma:

o knuma_scand: daemon for setting pte_numa and pmd_numa while
  gathering NUMA mm stats

o NUMA hinting page fault handler: queues pages for migration and
  gathers NUMA task stats

o knuma_migrated: kernel threads that migrate memory from remote nodes
  to the local node

o The rest of autonuma core logic: false sharing detection, sysfs and
  initialization routines

The AutoNUMA algorithm when knuma_scand is not running is fully
bypassed and it will not alter the runtime of memory management or the
scheduler.

The whole AutoNUMA logic is a chain reaction as a result of the
actions of the knuma_scand. Various parts of the code can be described
like different gears (gears as in glxgears).

knuma_scand is the first gear and it collects the mm_autonuma
per-process statistics and at the same time it sets the ptes and pmds
it scans respectively as pte_numa and pmd_numa.

The second gear are the numa hinting page faults. These are triggered
by the pte_numa/pmd_numa pmd/ptes. They collect the task_autonuma
per-thread statistics. They also implement the memory follow CPU logic
where we track if pages are repeatedly accessed by remote nodes. The
memory follow CPU logic can decide to migrate pages across different
NUMA nodes by queuing the pages for migration in the per-node
knuma_migrated queues.

The third gear is knuma_migrated. There is one knuma_migrated daemon
per node. Pages pending for migration are queued in a matrix of
lists. Each knuma_migrated (in parallel with each other) goes over
those lists and migrates the pages queued for migration in round robin
from each incoming node to the node where knuma_migrated is running.

The fourth gear is the NUMA scheduler balancing code. That computes
the statistical information collected in mm->mm_autonuma and
p->task_autonuma and evaluates the status of all CPUs to decide if
tasks should be migrated to CPUs in remote nodes.

The only "input" information of the AutoNUMA algorithm that isn't
collected through NUMA hinting page faults are the per-process
mm->mm_autonuma statistics. Those mm_autonuma statistics are collected
by the knuma_scand pmd/pte scans that are also responsible for setting
pte_numa/pmd_numa to activate the NUMA hinting page faults.

knuma_scand -> NUMA hinting page faults
  |                       |
 \|/                     \|/
mm_autonuma  <->  task_autonuma (CPU follow memory, this is mm_autonuma too)
                  page last_nid  (false thread sharing/thread shared memory detection )
                  queue or cancel page migration (memory follow CPU)

There is one knuma_migratedN daemon per NUMA node. After pages are
queued for a node, the knuma_migratedN daemon for that node will take
care of migrating pages to the node at a steady rate and in parallel
with the deamons for other nodes. Each daemon migrates pages in a
round-robin fashion, from all the other nodes. This keeps all memory
channels in a large system active at the same time and will avoid
hitting on a single memory channel for too long, thus minimizing
memory bus migration latency effects.

Once a page is queued for asynchronous migration, the migration can
still be canceled if false sharing is later detected.

The code includes some fixes from Hillf Danton <dhillf@gmail.com>.

Math documentation on autonuma_last_nid in the header of
last_nid_set() reworked from sched-numa code by Peter Zijlstra
<a.p.zijlstra@chello.nl>.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Hillf Danton <dhillf@gmail.com>
---
 mm/autonuma.c | 1619 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 1619 insertions(+), 0 deletions(-)
 create mode 100644 mm/autonuma.c

diff --git a/mm/autonuma.c b/mm/autonuma.c
new file mode 100644
index 0000000..a505ec3
--- /dev/null
+++ b/mm/autonuma.c
@@ -0,0 +1,1619 @@
+/*
+ *  Copyright (C) 2012  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ *  Boot with "numa=fake=2" to test on non NUMA systems.
+ */
+
+#include <linux/mm.h>
+#include <linux/rmap.h>
+#include <linux/kthread.h>
+#include <linux/mmu_notifier.h>
+#include <linux/freezer.h>
+#include <linux/mm_inline.h>
+#include <linux/migrate.h>
+#include <linux/swap.h>
+#include <linux/autonuma.h>
+#include <asm/tlbflush.h>
+#include <asm/pgtable.h>
+
+unsigned long autonuma_flags __read_mostly =
+	(1<<AUTONUMA_POSSIBLE_FLAG)
+	|(1<<AUTONUMA_SCHED_RESET_FLAG)
+#ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
+	|(1<<AUTONUMA_ENABLED_FLAG)
+#endif
+	|(1<<AUTONUMA_SCAN_PMD_FLAG);
+
+static DEFINE_MUTEX(knumad_mm_mutex);
+
+/* knuma_scand */
+static unsigned int scan_sleep_millisecs __read_mostly = 100;
+static unsigned int scan_sleep_pass_millisecs __read_mostly = 10000;
+static unsigned int pages_to_scan __read_mostly = 128*1024*1024/PAGE_SIZE;
+static DECLARE_WAIT_QUEUE_HEAD(knuma_scand_wait);
+static unsigned long full_scans;
+static unsigned long pages_scanned;
+
+/* knuma_migrated */
+static unsigned int migrate_sleep_millisecs __read_mostly = 100;
+static unsigned int pages_to_migrate __read_mostly = 128*1024*1024/PAGE_SIZE;
+static volatile unsigned long pages_migrated;
+
+static struct knuma_scand_data {
+	struct list_head mm_head; /* entry: mm->mm_autonuma->mm_node */
+	struct mm_struct *mm;
+	unsigned long address;
+	unsigned long *mm_numa_fault_tmp;
+} knuma_scand_data = {
+	.mm_head = LIST_HEAD_INIT(knuma_scand_data.mm_head),
+};
+
+static inline void autonuma_migrate_lock(int nid)
+{
+	spin_lock(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_unlock(int nid)
+{
+	spin_unlock(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_lock_irq(int nid)
+{
+	spin_lock_irq(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_unlock_irq(int nid)
+{
+	spin_unlock_irq(&NODE_DATA(nid)->autonuma_lock);
+}
+
+/* caller already holds the compound_lock */
+void autonuma_migrate_split_huge_page(struct page *page,
+				      struct page *page_tail)
+{
+	int nid, last_nid;
+
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	VM_BUG_ON(page_tail->autonuma_migrate_nid != -1);
+	if (nid >= 0) {
+		VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
+
+		/*
+		 * The caller only takes the compound_lock for the
+		 * head page. Here we take the lock on the tail page,
+		 * too. So after the pages become visible (after the
+		 * below autonuma_migrate_unlock), they can't be
+		 * removed form the LRU until we drop the
+		 * compound_lock for page_tail.
+		 */
+		compound_lock(page_tail);
+		autonuma_migrate_lock(nid);
+		list_add_tail(&page_tail->autonuma_migrate_node,
+			      &page->autonuma_migrate_node);
+		autonuma_migrate_unlock(nid);
+
+		page_tail->autonuma_migrate_nid = nid;
+		compound_unlock(page_tail);
+	}
+
+	last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	if (last_nid >= 0)
+		page_tail->autonuma_last_nid = last_nid;
+}
+
+void __autonuma_migrate_page_remove(struct page *page)
+{
+	unsigned long flags;
+	int nid;
+
+	flags = compound_lock_irqsave(page);
+
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	if (nid >= 0) {
+		int numpages = hpage_nr_pages(page);
+		autonuma_migrate_lock(nid);
+		list_del(&page->autonuma_migrate_node);
+		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
+		autonuma_migrate_unlock(nid);
+
+		page->autonuma_migrate_nid = -1;
+	}
+
+	compound_unlock_irqrestore(page, flags);
+}
+
+static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
+					int page_nid)
+{
+	unsigned long flags;
+	int nid;
+	int numpages;
+	unsigned long nr_migrate_pages;
+	wait_queue_head_t *wait_queue;
+
+	VM_BUG_ON(dst_nid >= MAX_NUMNODES);
+	VM_BUG_ON(dst_nid < -1);
+	VM_BUG_ON(page_nid >= MAX_NUMNODES);
+	VM_BUG_ON(page_nid < -1);
+
+	VM_BUG_ON(page_nid == dst_nid);
+	VM_BUG_ON(page_to_nid(page) != page_nid);
+
+	/*
+	 * Remove the page from the old migrate node's lru list (if it
+	 * was queued) and add it to the new node's lru list. The page
+	 * autonuma_migrate_nid that tracks where and if the page is
+	 * queued is protected by the compound lock so take that
+	 * first.
+	*/
+	flags = compound_lock_irqsave(page);
+
+	numpages = hpage_nr_pages(page);
+	nid = page->autonuma_migrate_nid;
+	VM_BUG_ON(nid >= MAX_NUMNODES);
+	VM_BUG_ON(nid < -1);
+	if (nid >= 0) {
+		autonuma_migrate_lock(nid);
+		list_del(&page->autonuma_migrate_node);
+		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
+		autonuma_migrate_unlock(nid);
+	}
+
+	autonuma_migrate_lock(dst_nid);
+	list_add(&page->autonuma_migrate_node,
+		 &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
+	NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
+	nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
+
+	autonuma_migrate_unlock(dst_nid);
+
+	page->autonuma_migrate_nid = dst_nid;
+
+	compound_unlock_irqrestore(page, flags);
+
+	if (!autonuma_migrate_defer()) {
+		wait_queue = &NODE_DATA(dst_nid)->autonuma_knuma_migrated_wait;
+		if (nr_migrate_pages >= pages_to_migrate &&
+		    nr_migrate_pages - numpages < pages_to_migrate &&
+		    waitqueue_active(wait_queue))
+			wake_up_interruptible(wait_queue);
+	}
+}
+
+static void autonuma_migrate_page_add(struct page *page, int dst_nid,
+				      int page_nid)
+{
+	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	if (migrate_nid != dst_nid)
+		__autonuma_migrate_page_add(page, dst_nid, page_nid);
+}
+
+static bool autonuma_balance_pgdat(struct pglist_data *pgdat,
+				   int nr_migrate_pages)
+{
+	/* FIXME: this only check the wmarks, make it move
+	 * "unused" memory or pagecache by queuing it to
+	 * pgdat->autonuma_migrate_head[pgdat->node_id].
+	 */
+	int z;
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone->all_unreclaimable)
+			continue;
+
+		/*
+		 * FIXME: in theory we're ok if we can obtain
+		 * pages_to_migrate pages from all zones, it doesn't
+		 * need to be all in a single zone. We care about the
+		 * pgdat, not the zone.
+		 */
+
+		/*
+		 * Try not to wakeup kswapd by allocating
+		 * pages_to_migrate pages.
+		 */
+		if (!zone_watermark_ok(zone, 0,
+				       high_wmark_pages(zone) +
+				       nr_migrate_pages,
+				       0, 0))
+			continue;
+		return true;
+	}
+	return false;
+}
+
+static void cpu_follow_memory_pass(struct task_struct *p,
+				   struct task_autonuma *task_autonuma,
+				   unsigned long *task_numa_fault)
+{
+	int nid;
+	/* If a new pass started, degrade the stats by a factor of 2 */
+	for_each_node(nid)
+		task_numa_fault[nid] >>= 1;
+	task_autonuma->task_numa_fault_tot >>= 1;
+}
+
+static void numa_hinting_fault_cpu_follow_memory(struct task_struct *p,
+						 int access_nid,
+						 int numpages,
+						 bool new_pass)
+{
+	struct task_autonuma *task_autonuma = p->task_autonuma;
+	unsigned long *task_numa_fault = task_autonuma->task_numa_fault;
+
+	/* prevent sched_autonuma_balance() to run on top of us */
+	local_bh_disable();
+
+	if (unlikely(new_pass))
+		cpu_follow_memory_pass(p, task_autonuma, task_numa_fault);
+	task_numa_fault[access_nid] += numpages;
+	task_autonuma->task_numa_fault_tot += numpages;
+
+	local_bh_enable();
+}
+
+/*
+ * In this function we build a temporal CPU_node<->page relation by
+ * using a two-stage autonuma_last_nid filter to remove short/unlikely
+ * relations.
+ *
+ * Using P(p) ~ n_p / n_t as per frequentest probability, we can
+ * equate a node's CPU usage of a particular page (n_p) per total
+ * usage of this page (n_t) (in a given time-span) to a probability.
+ *
+ * Our periodic faults will then sample this probability and getting
+ * the same result twice in a row, given these samples are fully
+ * independent, is then given by P(n)^2, provided our sample period
+ * is sufficiently short compared to the usage pattern.
+ *
+ * This quadric squishes small probabilities, making it less likely
+ * we act on an unlikely CPU_node<->page relation.
+ */
+static inline bool last_nid_set(struct page *page, int this_nid)
+{
+	bool ret = true;
+	int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	VM_BUG_ON(this_nid < 0);
+	VM_BUG_ON(this_nid >= MAX_NUMNODES);
+	if (autonuma_last_nid >= 0 && autonuma_last_nid != this_nid) {
+		int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+		if (migrate_nid >= 0)
+			__autonuma_migrate_page_remove(page);
+		ret = false;
+	}
+	if (autonuma_last_nid != this_nid)
+		ACCESS_ONCE(page->autonuma_last_nid) = this_nid;
+	return ret;
+}
+
+static int __page_migrate_nid(struct page *page, int page_nid)
+{
+	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	if (migrate_nid < 0)
+		migrate_nid = page_nid;
+	return migrate_nid;
+}
+
+static int page_migrate_nid(struct page *page)
+{
+	return __page_migrate_nid(page, page_to_nid(page));
+}
+
+static int numa_hinting_fault_memory_follow_cpu(struct page *page,
+						int this_nid, int page_nid,
+						bool new_pass)
+{
+	if (!last_nid_set(page, this_nid))
+		return page_nid;
+	if (!PageLRU(page))
+		return page_nid;
+	if (this_nid != page_nid)
+		autonuma_migrate_page_add(page, this_nid, page_nid);
+	else
+		autonuma_migrate_page_remove(page);
+	return this_nid;
+}
+
+void numa_hinting_fault(struct page *page, int numpages)
+{
+	/*
+	 * "current->mm" could be different from the "mm" where the
+	 * NUMA hinting page fault happened, if get_user_pages()
+	 * triggered the fault on some other process "mm". That is ok,
+	 * all we care about is to count the "page_nid" access on the
+	 * current->task_autonuma, even if the page belongs to a
+	 * different "mm".
+	 */
+	WARN_ON_ONCE(!current->mm);
+	if (likely(current->mm && !current->mempolicy && autonuma_enabled())) {
+		struct task_struct *p = current;
+		int this_nid, page_nid, access_nid;
+		bool new_pass;
+
+		/*
+		 * new_pass is only true the first time the thread
+		 * faults on this pass of knuma_scand.
+		 */
+		new_pass = p->task_autonuma->task_numa_fault_pass !=
+			p->mm->mm_autonuma->mm_numa_fault_pass;
+		page_nid = page_to_nid(page);
+		this_nid = numa_node_id();
+		VM_BUG_ON(this_nid < 0);
+		VM_BUG_ON(this_nid >= MAX_NUMNODES);
+		access_nid = numa_hinting_fault_memory_follow_cpu(page,
+								  this_nid,
+								  page_nid,
+								  new_pass);
+		numa_hinting_fault_cpu_follow_memory(p, access_nid,
+						     numpages, new_pass);
+		if (unlikely(new_pass))
+			/*
+			 * Set the task's fault_pass equal to the new
+			 * mm's fault_pass, so new_pass will be false
+			 * on the next fault by this thread in this
+			 * same pass.
+			 */
+			p->task_autonuma->task_numa_fault_pass =
+				p->mm->mm_autonuma->mm_numa_fault_pass;
+	}
+}
+
+/* NUMA hinting page fault entry point for ptes */
+pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+		       unsigned long addr, pte_t pte, pte_t *ptep)
+{
+	struct page *page;
+	pte = pte_mknonnuma(pte);
+	set_pte_at(mm, addr, ptep, pte);
+	page = vm_normal_page(vma, addr, pte);
+	BUG_ON(!page);
+	numa_hinting_fault(page, 1);
+	return pte;
+}
+
+/* NUMA hinting page fault entry point for regular pmds */
+void __pmd_numa_fixup(struct mm_struct *mm,
+		      unsigned long addr, pmd_t *pmdp)
+{
+	pmd_t pmd;
+	pte_t *pte;
+	unsigned long _addr = addr & PMD_MASK;
+	unsigned long offset;
+	spinlock_t *ptl;
+	bool numa = false;
+	struct vm_area_struct *vma;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = *pmdp;
+	if (pmd_numa(pmd)) {
+		set_pmd_at(mm, _addr, pmdp, pmd_mknonnuma(pmd));
+		numa = true;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	if (!numa)
+		return;
+
+	vma = find_vma(mm, _addr);
+	/* we're in a page fault so some vma must be in the range */
+	BUG_ON(!vma);
+	BUG_ON(vma->vm_start >= _addr + PMD_SIZE);
+	offset = max(_addr, vma->vm_start) & ~PMD_MASK;
+	VM_BUG_ON(offset >= PMD_SIZE);
+	pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
+	pte += offset >> PAGE_SHIFT;
+	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
+		pte_t pteval = *pte;
+		struct page * page;
+		if (!pte_present(pteval))
+			continue;
+		if (addr >= vma->vm_end) {
+			vma = find_vma(mm, addr);
+			/* there's a pte present so there must be a vma */
+			BUG_ON(!vma);
+			BUG_ON(addr < vma->vm_start);
+		}
+		if (pte_numa(pteval)) {
+			pteval = pte_mknonnuma(pteval);
+			set_pte_at(mm, addr, pte, pteval);
+		}
+		page = vm_normal_page(vma, addr, pteval);
+		if (unlikely(!page))
+			continue;
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1)
+			continue;
+		numa_hinting_fault(page, 1);
+	}
+	pte_unmap_unlock(pte, ptl);
+}
+
+static inline int task_autonuma_size(void)
+{
+	return sizeof(struct task_autonuma) +
+		nr_node_ids * sizeof(unsigned long);
+}
+
+static inline int task_autonuma_reset_size(void)
+{
+	struct task_autonuma *task_autonuma = NULL;
+	return task_autonuma_size() -
+		(int)((char *)(&task_autonuma->task_numa_fault_pass) -
+		      (char *)task_autonuma);
+}
+
+static void task_autonuma_reset(struct task_autonuma *task_autonuma)
+{
+	task_autonuma->task_selected_nid = -1;
+	memset(&task_autonuma->task_numa_fault_pass, 0,
+	       task_autonuma_reset_size());
+}
+
+static inline int mm_autonuma_fault_size(void)
+{
+	return nr_node_ids * sizeof(unsigned long);
+}
+
+static inline int mm_autonuma_size(void)
+{
+	return sizeof(struct mm_autonuma) + mm_autonuma_fault_size();
+}
+
+static inline int mm_autonuma_reset_size(void)
+{
+	struct mm_autonuma *mm_autonuma = NULL;
+	return mm_autonuma_size() -
+		(int)((char *)(&mm_autonuma->mm_numa_fault_pass) -
+		      (char *)mm_autonuma);
+}
+
+static void mm_autonuma_reset(struct mm_autonuma *mm_autonuma)
+{
+	memset(&mm_autonuma->mm_numa_fault_pass, 0, mm_autonuma_reset_size());
+}
+
+void autonuma_setup_new_exec(struct task_struct *p)
+{
+	if (p->task_autonuma)
+		task_autonuma_reset(p->task_autonuma);
+	if (p->mm && p->mm->mm_autonuma)
+		mm_autonuma_reset(p->mm->mm_autonuma);
+}
+
+static inline int knumad_test_exit(struct mm_struct *mm)
+{
+	return atomic_read(&mm->mm_users) == 0;
+}
+
+/*
+ * Here we search for not shared page mappings (mapcount == 1) and we
+ * set up the pmd/pte_numa on those mappings so the very next access
+ * will fire a NUMA hinting page fault. We also collect the
+ * mm_autonuma statistics for this process mm at the same time.
+ */
+static int knuma_scand_pmd(struct mm_struct *mm,
+			   struct vm_area_struct *vma,
+			   unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte, *_pte;
+	struct page *page;
+	unsigned long _address, end;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	VM_BUG_ON(address & ~PAGE_MASK);
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+
+	if (pmd_trans_huge(*pmd)) {
+		spin_lock(&mm->page_table_lock);
+		if (pmd_trans_huge(*pmd)) {
+			VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+			if (unlikely(pmd_trans_splitting(*pmd))) {
+				spin_unlock(&mm->page_table_lock);
+				wait_split_huge_page(vma->anon_vma, pmd);
+			} else {
+				int page_nid;
+				unsigned long *fault_tmp;
+				ret = HPAGE_PMD_NR;
+
+				page = pmd_page(*pmd);
+
+				/* only check non-shared pages */
+				if (page_mapcount(page) != 1) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
+				page_nid = page_migrate_nid(page);
+				fault_tmp = knuma_scand_data.mm_numa_fault_tmp;
+				fault_tmp[page_nid] += ret;
+
+				if (pmd_numa(*pmd)) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
+				set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+				/* defer TLB flush to lower the overhead */
+				spin_unlock(&mm->page_table_lock);
+				goto out;
+			}
+		} else
+			spin_unlock(&mm->page_table_lock);
+	}
+
+	VM_BUG_ON(!pmd_present(*pmd));
+
+	end = min(vma->vm_end, (address + PMD_SIZE) & PMD_MASK);
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	for (_address = address, _pte = pte; _address < end;
+	     _pte++, _address += PAGE_SIZE) {
+		pte_t pteval = *_pte;
+		unsigned long *fault_tmp;
+		if (!pte_present(pteval))
+			continue;
+		page = vm_normal_page(vma, _address, pteval);
+		if (unlikely(!page))
+			continue;
+		/* only check non-shared pages */
+		if (page_mapcount(page) != 1)
+			continue;
+
+		fault_tmp = knuma_scand_data.mm_numa_fault_tmp;
+		fault_tmp[page_migrate_nid(page)]++;
+
+		if (pte_numa(pteval))
+			continue;
+
+		if (!autonuma_scan_pmd())
+			set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
+
+		/* defer TLB flush to lower the overhead */
+		ret++;
+	}
+	pte_unmap_unlock(pte, ptl);
+
+	if (ret && !pmd_numa(*pmd) && autonuma_scan_pmd()) {
+		/*
+		 * Mark the page table pmd as numa if "autonuma scan
+		 * pmd" mode is enabled.
+		 */
+		spin_lock(&mm->page_table_lock);
+		set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+		spin_unlock(&mm->page_table_lock);
+		/* defer TLB flush to lower the overhead */
+	}
+
+out:
+	return ret;
+}
+
+static void mm_numa_fault_tmp_flush(struct mm_struct *mm)
+{
+	int nid;
+	struct mm_autonuma *mma = mm->mm_autonuma;
+	unsigned long tot;
+	unsigned long *fault_tmp = knuma_scand_data.mm_numa_fault_tmp;
+
+	/* FIXME: would be better protected with write_seqlock_bh() */
+	local_bh_disable();
+
+	tot = 0;
+	for_each_node(nid) {
+		unsigned long faults = fault_tmp[nid];
+		fault_tmp[nid] = 0;
+		mma->mm_numa_fault[nid] = faults;
+		tot += faults;
+	}
+	mma->mm_numa_fault_tot = tot;
+
+	local_bh_enable();
+}
+
+static void mm_numa_fault_tmp_reset(void)
+{
+	memset(knuma_scand_data.mm_numa_fault_tmp, 0,
+	       mm_autonuma_fault_size());
+}
+
+static inline void validate_mm_numa_fault_tmp(unsigned long address)
+{
+#ifdef CONFIG_DEBUG_VM
+	int nid;
+	if (address)
+		return;
+	for_each_node(nid)
+		BUG_ON(knuma_scand_data.mm_numa_fault_tmp[nid]);
+#endif
+}
+
+/*
+ * Scan the next part of the mm. Keep track of the progress made and
+ * return it.
+ */
+static int knumad_do_scan(void)
+{
+	struct mm_struct *mm;
+	struct mm_autonuma *mm_autonuma;
+	unsigned long address;
+	struct vm_area_struct *vma;
+	int progress = 0;
+
+	mm = knuma_scand_data.mm;
+	/*
+	 * knuma_scand_data.mm is NULL after the end of each
+	 * knuma_scand pass. So when it's NULL we've start from
+	 * scratch from the very first mm in the list.
+	 */
+	if (!mm) {
+		if (unlikely(list_empty(&knuma_scand_data.mm_head)))
+			return pages_to_scan;
+		mm_autonuma = list_entry(knuma_scand_data.mm_head.next,
+					 struct mm_autonuma, mm_node);
+		mm = mm_autonuma->mm;
+		knuma_scand_data.address = 0;
+		knuma_scand_data.mm = mm;
+		atomic_inc(&mm->mm_count);
+		mm_autonuma->mm_numa_fault_pass++;
+	}
+	address = knuma_scand_data.address;
+
+	validate_mm_numa_fault_tmp(address);
+
+	mutex_unlock(&knumad_mm_mutex);
+
+	down_read(&mm->mmap_sem);
+	if (unlikely(knumad_test_exit(mm)))
+		vma = NULL;
+	else
+		vma = find_vma(mm, address);
+
+	progress++;
+	for (; vma && progress < pages_to_scan; vma = vma->vm_next) {
+		unsigned long start_addr, end_addr;
+		cond_resched();
+		if (unlikely(knumad_test_exit(mm))) {
+			progress++;
+			break;
+		}
+
+		if (!vma->anon_vma || vma_policy(vma)) {
+			progress++;
+			continue;
+		}
+		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) {
+			progress++;
+			continue;
+		}
+		if (is_vma_temporary_stack(vma)) {
+			progress++;
+			continue;
+		}
+
+		VM_BUG_ON(address & ~PAGE_MASK);
+		if (address < vma->vm_start)
+			address = vma->vm_start;
+
+		start_addr = address;
+		while (address < vma->vm_end) {
+			cond_resched();
+			if (unlikely(knumad_test_exit(mm)))
+				break;
+
+			VM_BUG_ON(address < vma->vm_start ||
+				  address + PAGE_SIZE > vma->vm_end);
+			progress += knuma_scand_pmd(mm, vma, address);
+			/* move to next address */
+			address = (address + PMD_SIZE) & PMD_MASK;
+			if (progress >= pages_to_scan)
+				break;
+		}
+		end_addr = min(address, vma->vm_end);
+
+		/*
+		 * Flush the TLB for the mm to start the NUMA hinting
+		 * page faults after we finish scanning this vma part.
+		 */
+		mmu_notifier_invalidate_range_start(vma->vm_mm, start_addr,
+						    end_addr);
+		flush_tlb_range(vma, start_addr, end_addr);
+		mmu_notifier_invalidate_range_end(vma->vm_mm, start_addr,
+						  end_addr);
+	}
+	up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
+
+	mutex_lock(&knumad_mm_mutex);
+	VM_BUG_ON(knuma_scand_data.mm != mm);
+	knuma_scand_data.address = address;
+	/*
+	 * Change the current mm if this mm is about to die, or if we
+	 * scanned all vmas of this mm.
+	 */
+	if (knumad_test_exit(mm) || !vma) {
+		mm_autonuma = mm->mm_autonuma;
+		if (mm_autonuma->mm_node.next != &knuma_scand_data.mm_head) {
+			mm_autonuma = list_entry(mm_autonuma->mm_node.next,
+						 struct mm_autonuma, mm_node);
+			knuma_scand_data.mm = mm_autonuma->mm;
+			atomic_inc(&knuma_scand_data.mm->mm_count);
+			knuma_scand_data.address = 0;
+			knuma_scand_data.mm->mm_autonuma->mm_numa_fault_pass++;
+		} else
+			knuma_scand_data.mm = NULL;
+
+		if (knumad_test_exit(mm)) {
+			list_del(&mm->mm_autonuma->mm_node);
+			/* tell autonuma_exit not to list_del */
+			VM_BUG_ON(mm->mm_autonuma->mm != mm);
+			mm->mm_autonuma->mm = NULL;
+			mm_numa_fault_tmp_reset();
+		} else
+			mm_numa_fault_tmp_flush(mm);
+
+		mmdrop(mm);
+	}
+
+	return progress;
+}
+
+static void wake_up_knuma_migrated(void)
+{
+	int nid;
+
+	lru_add_drain();
+	for_each_online_node(nid) {
+		struct pglist_data *pgdat = NODE_DATA(nid);
+		if (pgdat->autonuma_nr_migrate_pages &&
+		    waitqueue_active(&pgdat->autonuma_knuma_migrated_wait))
+			wake_up_interruptible(&pgdat->
+					      autonuma_knuma_migrated_wait);
+	}
+}
+
+static void knuma_scand_disabled(void)
+{
+	if (!autonuma_enabled())
+		wait_event_freezable(knuma_scand_wait,
+				     autonuma_enabled() ||
+				     kthread_should_stop());
+}
+
+static int knuma_scand(void *none)
+{
+	struct mm_struct *mm = NULL;
+	int progress = 0, _progress;
+	unsigned long total_progress = 0;
+
+	set_freezable();
+
+	knuma_scand_disabled();
+
+	/*
+	 * Serialize the knuma_scand_data against
+	 * autonuma_enter/exit().
+	 */
+	mutex_lock(&knumad_mm_mutex);
+
+	for (;;) {
+		if (unlikely(kthread_should_stop()))
+			break;
+
+		/* Do one loop of scanning, keeping track of the progress */
+		_progress = knumad_do_scan();
+		progress += _progress;
+		total_progress += _progress;
+		mutex_unlock(&knumad_mm_mutex);
+
+		/* Check if we completed one full scan pass */
+		if (unlikely(!knuma_scand_data.mm)) {
+			autonuma_printk("knuma_scand %lu\n", total_progress);
+			pages_scanned += total_progress;
+			total_progress = 0;
+			full_scans++;
+
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_pass_millisecs/2));
+			/* flush the last pending pages < pages_to_migrate */
+			wake_up_knuma_migrated();
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_pass_millisecs/2));
+
+			if (autonuma_debug()) {
+				extern void sched_autonuma_dump_mm(void);
+				sched_autonuma_dump_mm();
+			}
+
+			/* wait while there is no pinned mm */
+			knuma_scand_disabled();
+		}
+		if (progress > pages_to_scan) {
+			progress = 0;
+			wait_event_freezable_timeout(knuma_scand_wait,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     scan_sleep_millisecs));
+		}
+		cond_resched();
+		mutex_lock(&knumad_mm_mutex);
+	}
+
+	mm = knuma_scand_data.mm;
+	knuma_scand_data.mm = NULL;
+	if (mm && knumad_test_exit(mm)) {
+		list_del(&mm->mm_autonuma->mm_node);
+		/* tell autonuma_exit not to list_del */
+		VM_BUG_ON(mm->mm_autonuma->mm != mm);
+		mm->mm_autonuma->mm = NULL;
+	}
+	mutex_unlock(&knumad_mm_mutex);
+
+	if (mm)
+		mmdrop(mm);
+	mm_numa_fault_tmp_reset();
+
+	return 0;
+}
+
+static int isolate_migratepages(struct list_head *migratepages,
+				struct pglist_data *pgdat)
+{
+	int nr = 0, nid;
+	struct list_head *heads = pgdat->autonuma_migrate_head;
+
+	/* FIXME: THP balancing, restart from last nid */
+	for_each_online_node(nid) {
+		struct zone *zone;
+		struct page *page;
+		struct lruvec *lruvec;
+
+		cond_resched();
+		/*
+		 * Let the admin notice if the CPU binding of the
+		 * knuma_migrated kernel threads has been altered in a
+		 * suboptimal way.
+		 */
+		WARN_ONCE(numa_node_id() != pgdat->node_id,
+			  "knuma_migrated%d: the CPU binding of this kernel "
+			  "thread has been altered in a suboptimal way\n",
+			  pgdat->node_id);
+		if (nid == pgdat->node_id) {
+			VM_BUG_ON(!list_empty(&heads[nid]));
+			continue;
+		}
+		if (list_empty(&heads[nid]))
+			continue;
+		/* some page wants to go to this pgdat */
+		/*
+		 * Take the lock with irqs disabled to avoid a lock
+		 * inversion with the lru_lock. The lru_lock is taken
+		 * before the autonuma_migrate_lock in
+		 * split_huge_page. If we didn't disable irqs, the
+		 * lru_lock could be taken by interrupts after we have
+		 * obtained the autonuma_migrate_lock here.
+		 */
+		autonuma_migrate_lock_irq(pgdat->node_id);
+		if (list_empty(&heads[nid])) {
+			autonuma_migrate_unlock_irq(pgdat->node_id);
+			continue;
+		}
+		page = list_entry(heads[nid].prev,
+				  struct page,
+				  autonuma_migrate_node);
+		if (unlikely(!get_page_unless_zero(page))) {
+			/*
+			 * Is getting freed and will remove self from the
+			 * autonuma list shortly, skip it for now.
+			 */
+			list_del(&page->autonuma_migrate_node);
+			list_add(&page->autonuma_migrate_node,
+				 &heads[nid]);
+			autonuma_migrate_unlock_irq(pgdat->node_id);
+			autonuma_printk("autonuma migrate page is free\n");
+			continue;
+		}
+		autonuma_migrate_unlock_irq(pgdat->node_id);
+		if (!PageLRU(page)) {
+			autonuma_printk("autonuma migrate page not in LRU\n");
+			__autonuma_migrate_page_remove(page);
+			put_page(page);
+			continue;
+		}
+
+		VM_BUG_ON(nid != page_to_nid(page));
+
+		if (PageTransHuge(page)) {
+			VM_BUG_ON(!PageAnon(page));
+			/* FIXME: remove split_huge_page */
+			if (unlikely(split_huge_page(page))) {
+				autonuma_printk("autonuma migrate THP free\n");
+				__autonuma_migrate_page_remove(page);
+				put_page(page);
+				continue;
+			}
+		}
+
+		__autonuma_migrate_page_remove(page);
+
+		zone = page_zone(page);
+		spin_lock_irq(&zone->lru_lock);
+
+		/* Must run under the lru_lock and before page isolation */
+		lruvec = mem_cgroup_page_lruvec(page, zone);
+
+		if (!__isolate_lru_page(page, 0)) {
+			VM_BUG_ON(PageTransCompound(page));
+			del_page_from_lru_list(page, lruvec, page_lru(page));
+			inc_zone_state(zone, page_is_file_cache(page) ?
+				       NR_ISOLATED_FILE : NR_ISOLATED_ANON);
+			spin_unlock_irq(&zone->lru_lock);
+			/*
+			 * Pin the page at least until
+			 * __isolate_lru_page succeeds
+			 * (__isolate_lru_page pins it again when it
+			 * succeeds). If we unpin before
+			 * __isolate_lru_page returns, the page could
+			 * be freed and reallocated out from under
+			 * us. Thus our previous checks on the page,
+			 * and split_huge_page, would be worthless.
+			 */
+			put_page(page);
+
+			list_add(&page->lru, migratepages);
+			nr += hpage_nr_pages(page);
+		} else {
+			/* FIXME: losing page, safest and simplest for now */
+			spin_unlock_irq(&zone->lru_lock);
+			put_page(page);
+			autonuma_printk("autonuma migrate page lost\n");
+		}
+	}
+
+	return nr;
+}
+
+static struct page *alloc_migrate_dst_page(struct page *page,
+					   unsigned long data,
+					   int **result)
+{
+	int nid = (int) data;
+	struct page *newpage;
+	newpage = alloc_pages_exact_node(nid,
+					 GFP_HIGHUSER_MOVABLE | GFP_THISNODE,
+					 0);
+	if (newpage)
+		newpage->autonuma_last_nid = page->autonuma_last_nid;
+	return newpage;
+}
+
+static void knumad_do_migrate(struct pglist_data *pgdat)
+{
+	int nr_migrate_pages = 0;
+	LIST_HEAD(migratepages);
+
+	autonuma_printk("nr_migrate_pages %lu to node %d\n",
+			pgdat->autonuma_nr_migrate_pages, pgdat->node_id);
+	do {
+		int isolated = 0;
+		if (autonuma_balance_pgdat(pgdat, nr_migrate_pages))
+			isolated = isolate_migratepages(&migratepages, pgdat);
+		/* FIXME: might need to check too many isolated */
+		if (!isolated)
+			break;
+		nr_migrate_pages += isolated;
+	} while (nr_migrate_pages < pages_to_migrate);
+
+	if (nr_migrate_pages) {
+		int err;
+		autonuma_printk("migrate %d to node %d\n", nr_migrate_pages,
+				pgdat->node_id);
+		pages_migrated += nr_migrate_pages; /* FIXME: per node */
+		err = migrate_pages(&migratepages, alloc_migrate_dst_page,
+				    pgdat->node_id, false, true);
+		if (err)
+			/* FIXME: requeue failed pages */
+			putback_lru_pages(&migratepages);
+	}
+}
+
+static int knuma_migrated(void *arg)
+{
+	struct pglist_data *pgdat = (struct pglist_data *)arg;
+	int nid = pgdat->node_id;
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(nowakeup);
+
+	set_freezable();
+
+	for (;;) {
+		if (unlikely(kthread_should_stop()))
+			break;
+		/* FIXME: scan the free levels of this node we may not
+		 * be allowed to receive memory if the wmark of this
+		 * pgdat are below high.  In the future also add
+		 * not-interesting pages like not-accessed pages to
+		 * pgdat->autonuma_migrate_head[pgdat->node_id]; so we
+		 * can move our memory away to other nodes in order
+		 * to satisfy the high-wmark described above (so migration
+		 * can continue).
+		 */
+		knumad_do_migrate(pgdat);
+		if (!pgdat->autonuma_nr_migrate_pages) {
+			wait_event_freezable(
+				pgdat->autonuma_knuma_migrated_wait,
+				pgdat->autonuma_nr_migrate_pages ||
+				kthread_should_stop());
+			autonuma_printk("wake knuma_migrated %d\n", nid);
+		} else
+			wait_event_freezable_timeout(nowakeup,
+						     kthread_should_stop(),
+						     msecs_to_jiffies(
+						     migrate_sleep_millisecs));
+	}
+
+	return 0;
+}
+
+void autonuma_enter(struct mm_struct *mm)
+{
+	if (!autonuma_possible())
+		return;
+
+	mutex_lock(&knumad_mm_mutex);
+	list_add_tail(&mm->mm_autonuma->mm_node, &knuma_scand_data.mm_head);
+	mutex_unlock(&knumad_mm_mutex);
+}
+
+void autonuma_exit(struct mm_struct *mm)
+{
+	bool serialize;
+
+	if (!autonuma_possible())
+		return;
+
+	serialize = false;
+	mutex_lock(&knumad_mm_mutex);
+	if (knuma_scand_data.mm == mm)
+		serialize = true;
+	else if (mm->mm_autonuma->mm) {
+		VM_BUG_ON(mm->mm_autonuma->mm != mm);
+		mm->mm_autonuma->mm = NULL; /* debug */
+		list_del(&mm->mm_autonuma->mm_node);
+	}
+	mutex_unlock(&knumad_mm_mutex);
+
+	if (serialize) {
+		/* prevent the mm to go away under knumad_do_scan main loop */
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	}
+}
+
+static int start_knuma_scand(void)
+{
+	int err = 0;
+	struct task_struct *knumad_thread;
+
+	knuma_scand_data.mm_numa_fault_tmp = kzalloc(mm_autonuma_fault_size(),
+						     GFP_KERNEL);
+	if (!knuma_scand_data.mm_numa_fault_tmp)
+		return -ENOMEM;
+
+	knumad_thread = kthread_run(knuma_scand, NULL, "knuma_scand");
+	if (unlikely(IS_ERR(knumad_thread))) {
+		autonuma_printk(KERN_ERR
+				"knumad: kthread_run(knuma_scand) failed\n");
+		err = PTR_ERR(knumad_thread);
+	}
+	return err;
+}
+
+static int start_knuma_migrated(void)
+{
+	int err = 0;
+	struct task_struct *knumad_thread;
+	int nid;
+
+	for_each_online_node(nid) {
+		knumad_thread = kthread_create_on_node(knuma_migrated,
+						       NODE_DATA(nid),
+						       nid,
+						       "knuma_migrated%d",
+						       nid);
+		if (unlikely(IS_ERR(knumad_thread))) {
+			autonuma_printk(KERN_ERR
+					"knumad: "
+					"kthread_run(knuma_migrated%d) "
+					"failed\n", nid);
+			err = PTR_ERR(knumad_thread);
+		} else {
+			autonuma_printk("cpumask %d %lx\n", nid,
+					cpumask_of_node(nid)->bits[0]);
+			kthread_bind_node(knumad_thread, nid);
+			wake_up_process(knumad_thread);
+		}
+	}
+	return err;
+}
+
+
+#ifdef CONFIG_SYSFS
+
+static ssize_t flag_show(struct kobject *kobj,
+			 struct kobj_attribute *attr, char *buf,
+			 enum autonuma_flag flag)
+{
+	return sprintf(buf, "%d\n",
+		       !!test_bit(flag, &autonuma_flags));
+}
+static ssize_t flag_store(struct kobject *kobj,
+			  struct kobj_attribute *attr,
+			  const char *buf, size_t count,
+			  enum autonuma_flag flag)
+{
+	unsigned long value;
+	int ret;
+
+	ret = kstrtoul(buf, 10, &value);
+	if (ret < 0)
+		return ret;
+	if (value > 1)
+		return -EINVAL;
+
+	if (value)
+		set_bit(flag, &autonuma_flags);
+	else
+		clear_bit(flag, &autonuma_flags);
+
+	return count;
+}
+
+static ssize_t enabled_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return flag_show(kobj, attr, buf, AUTONUMA_ENABLED_FLAG);
+}
+static ssize_t enabled_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = flag_store(kobj, attr, buf, count, AUTONUMA_ENABLED_FLAG);
+
+	if (ret > 0 && autonuma_enabled())
+		wake_up_interruptible(&knuma_scand_wait);
+
+	return ret;
+}
+static struct kobj_attribute enabled_attr =
+	__ATTR(enabled, 0644, enabled_show, enabled_store);
+
+#define SYSFS_ENTRY(NAME, FLAG)						\
+static ssize_t NAME ## _show(struct kobject *kobj,			\
+			     struct kobj_attribute *attr, char *buf)	\
+{									\
+	return flag_show(kobj, attr, buf, FLAG);			\
+}									\
+									\
+static ssize_t NAME ## _store(struct kobject *kobj,			\
+			      struct kobj_attribute *attr,		\
+			      const char *buf, size_t count)		\
+{									\
+	return flag_store(kobj, attr, buf, count, FLAG);		\
+}									\
+static struct kobj_attribute NAME ## _attr =				\
+	__ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
+
+SYSFS_ENTRY(pmd, AUTONUMA_SCAN_PMD_FLAG);
+SYSFS_ENTRY(debug, AUTONUMA_DEBUG_FLAG);
+#ifdef CONFIG_DEBUG_VM
+SYSFS_ENTRY(load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
+SYSFS_ENTRY(defer, AUTONUMA_MIGRATE_DEFER_FLAG);
+SYSFS_ENTRY(reset, AUTONUMA_SCHED_RESET_FLAG);
+#endif /* CONFIG_DEBUG_VM */
+
+#undef SYSFS_ENTRY
+
+enum {
+	SYSFS_KNUMA_SCAND_SLEEP_ENTRY,
+	SYSFS_KNUMA_SCAND_PAGES_ENTRY,
+	SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY,
+	SYSFS_KNUMA_MIGRATED_PAGES_ENTRY,
+};
+
+#define SYSFS_ENTRY(NAME, SYSFS_TYPE)				\
+static ssize_t NAME ## _show(struct kobject *kobj,		\
+			     struct kobj_attribute *attr,	\
+			     char *buf)				\
+{								\
+	return sprintf(buf, "%u\n", NAME);			\
+}								\
+static ssize_t NAME ## _store(struct kobject *kobj,		\
+			      struct kobj_attribute *attr,	\
+			      const char *buf, size_t count)	\
+{								\
+	unsigned long val;					\
+	int err;						\
+								\
+	err = strict_strtoul(buf, 10, &val);			\
+	if (err || val > UINT_MAX)				\
+		return -EINVAL;					\
+	switch (SYSFS_TYPE) {					\
+	case SYSFS_KNUMA_SCAND_PAGES_ENTRY:			\
+	case SYSFS_KNUMA_MIGRATED_PAGES_ENTRY:			\
+		if (!val)					\
+			return -EINVAL;				\
+		break;						\
+	}							\
+								\
+	NAME = val;						\
+	switch (SYSFS_TYPE) {					\
+	case SYSFS_KNUMA_SCAND_SLEEP_ENTRY:			\
+		wake_up_interruptible(&knuma_scand_wait);	\
+		break;						\
+	case							\
+		SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY:		\
+		wake_up_knuma_migrated();			\
+		break;						\
+	}							\
+								\
+	return count;						\
+}								\
+static struct kobj_attribute NAME ## _attr =			\
+	__ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
+
+SYSFS_ENTRY(scan_sleep_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
+SYSFS_ENTRY(scan_sleep_pass_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
+SYSFS_ENTRY(pages_to_scan, SYSFS_KNUMA_SCAND_PAGES_ENTRY);
+
+SYSFS_ENTRY(migrate_sleep_millisecs, SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY);
+SYSFS_ENTRY(pages_to_migrate, SYSFS_KNUMA_MIGRATED_PAGES_ENTRY);
+
+#undef SYSFS_ENTRY
+
+static struct attribute *autonuma_attr[] = {
+	&enabled_attr.attr,
+	&debug_attr.attr,
+	NULL,
+};
+static struct attribute_group autonuma_attr_group = {
+	.attrs = autonuma_attr,
+};
+
+#define SYSFS_ENTRY(NAME)					\
+static ssize_t NAME ## _show(struct kobject *kobj,		\
+			     struct kobj_attribute *attr,	\
+			     char *buf)				\
+{								\
+	return sprintf(buf, "%lu\n", NAME);			\
+}								\
+static struct kobj_attribute NAME ## _attr =			\
+	__ATTR_RO(NAME);
+
+SYSFS_ENTRY(full_scans);
+SYSFS_ENTRY(pages_scanned);
+SYSFS_ENTRY(pages_migrated);
+
+#undef SYSFS_ENTRY
+
+static struct attribute *knuma_scand_attr[] = {
+	&scan_sleep_millisecs_attr.attr,
+	&scan_sleep_pass_millisecs_attr.attr,
+	&pages_to_scan_attr.attr,
+	&pages_scanned_attr.attr,
+	&full_scans_attr.attr,
+	&pmd_attr.attr,
+	NULL,
+};
+static struct attribute_group knuma_scand_attr_group = {
+	.attrs = knuma_scand_attr,
+	.name = "knuma_scand",
+};
+
+static struct attribute *knuma_migrated_attr[] = {
+	&migrate_sleep_millisecs_attr.attr,
+	&pages_to_migrate_attr.attr,
+	&pages_migrated_attr.attr,
+#ifdef CONFIG_DEBUG_VM
+	&defer_attr.attr,
+#endif
+	NULL,
+};
+static struct attribute_group knuma_migrated_attr_group = {
+	.attrs = knuma_migrated_attr,
+	.name = "knuma_migrated",
+};
+
+#ifdef CONFIG_DEBUG_VM
+static struct attribute *scheduler_attr[] = {
+	&load_balance_strict_attr.attr,
+	&reset_attr.attr,
+	NULL,
+};
+static struct attribute_group scheduler_attr_group = {
+	.attrs = scheduler_attr,
+	.name = "scheduler",
+};
+#endif
+
+static int __init autonuma_init_sysfs(struct kobject **autonuma_kobj)
+{
+	int err;
+
+	*autonuma_kobj = kobject_create_and_add("autonuma", mm_kobj);
+	if (unlikely(!*autonuma_kobj)) {
+		printk(KERN_ERR "autonuma: failed kobject create\n");
+		return -ENOMEM;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &autonuma_attr_group);
+	if (err) {
+		printk(KERN_ERR "autonuma: failed register autonuma group\n");
+		goto delete_obj;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &knuma_scand_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register knuma_scand group\n");
+		goto remove_autonuma;
+	}
+
+	err = sysfs_create_group(*autonuma_kobj, &knuma_migrated_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register knuma_migrated group\n");
+		goto remove_knuma_scand;
+	}
+
+#ifdef CONFIG_DEBUG_VM
+	err = sysfs_create_group(*autonuma_kobj, &scheduler_attr_group);
+	if (err) {
+		printk(KERN_ERR
+		       "autonuma: failed register scheduler group\n");
+		goto remove_knuma_migrated;
+	}
+#endif
+
+	return 0;
+
+#ifdef CONFIG_DEBUG_VM
+remove_knuma_migrated:
+	sysfs_remove_group(*autonuma_kobj, &knuma_migrated_attr_group);
+#endif
+remove_knuma_scand:
+	sysfs_remove_group(*autonuma_kobj, &knuma_scand_attr_group);
+remove_autonuma:
+	sysfs_remove_group(*autonuma_kobj, &autonuma_attr_group);
+delete_obj:
+	kobject_put(*autonuma_kobj);
+	return err;
+}
+
+static void __init autonuma_exit_sysfs(struct kobject *autonuma_kobj)
+{
+	sysfs_remove_group(autonuma_kobj, &knuma_migrated_attr_group);
+	sysfs_remove_group(autonuma_kobj, &knuma_scand_attr_group);
+	sysfs_remove_group(autonuma_kobj, &autonuma_attr_group);
+	kobject_put(autonuma_kobj);
+}
+#else
+static inline int autonuma_init_sysfs(struct kobject **autonuma_kobj)
+{
+	return 0;
+}
+
+static inline void autonuma_exit_sysfs(struct kobject *autonuma_kobj)
+{
+}
+#endif /* CONFIG_SYSFS */
+
+static int __init noautonuma_setup(char *str)
+{
+	if (autonuma_possible()) {
+		printk("AutoNUMA permanently disabled\n");
+		clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
+		WARN_ON(autonuma_possible()); /* avoid early crash */
+	}
+	return 1;
+}
+__setup("noautonuma", noautonuma_setup);
+
+static bool autonuma_init_checks_failed(void)
+{
+	/* safety checks on nr_node_ids */
+	int last_nid = find_last_bit(node_states[N_POSSIBLE].bits, MAX_NUMNODES);
+	if (last_nid + 1 != nr_node_ids) {
+		WARN_ON(1);
+		return true;
+	}
+	if (num_possible_nodes() > nr_node_ids) {
+		WARN_ON(1);
+		return true;
+	}
+	return false;
+}
+
+static int __init autonuma_init(void)
+{
+	int err;
+	struct kobject *autonuma_kobj;
+
+	VM_BUG_ON(num_possible_nodes() < 1);
+	if (num_possible_nodes() <= 1 || !autonuma_possible()) {
+		clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
+		return -EINVAL;
+	} else if (autonuma_init_checks_failed()) {
+		printk("autonuma disengaged: init checks failed\n");
+		clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
+		return -EINVAL;
+	}
+
+	err = autonuma_init_sysfs(&autonuma_kobj);
+	if (err)
+		return err;
+
+	err = start_knuma_scand();
+	if (err) {
+		printk("failed to start knuma_scand\n");
+		goto out;
+	}
+	err = start_knuma_migrated();
+	if (err) {
+		printk("failed to start knuma_migrated\n");
+		goto out;
+	}
+
+	printk("AutoNUMA initialized successfully\n");
+	return err;
+
+out:
+	autonuma_exit_sysfs(autonuma_kobj);
+	return err;
+}
+module_init(autonuma_init)
+
+static struct kmem_cache *task_autonuma_cachep;
+
+int alloc_task_autonuma(struct task_struct *tsk, struct task_struct *orig,
+			 int node)
+{
+	int err = 1;
+	struct task_autonuma *task_autonuma;
+
+	if (!autonuma_possible())
+		goto no_numa;
+	task_autonuma = kmem_cache_alloc_node(task_autonuma_cachep,
+					      GFP_KERNEL, node);
+	if (!task_autonuma)
+		goto out;
+	if (autonuma_sched_reset())
+		task_autonuma_reset(task_autonuma);
+	else
+		memcpy(task_autonuma, orig->task_autonuma,
+		       task_autonuma_size());
+	tsk->task_autonuma = task_autonuma;
+no_numa:
+	err = 0;
+out:
+	return err;
+}
+
+void free_task_autonuma(struct task_struct *tsk)
+{
+	if (!autonuma_possible()) {
+		BUG_ON(tsk->task_autonuma);
+		return;
+	}
+
+	BUG_ON(!tsk->task_autonuma);
+	kmem_cache_free(task_autonuma_cachep, tsk->task_autonuma);
+	tsk->task_autonuma = NULL;
+}
+
+void __init task_autonuma_init(void)
+{
+	struct task_autonuma *task_autonuma;
+
+	BUG_ON(current != &init_task);
+
+	if (!autonuma_possible())
+		return;
+
+	task_autonuma_cachep =
+		kmem_cache_create("task_autonuma",
+				  task_autonuma_size(), 0,
+				  SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+
+	task_autonuma = kmem_cache_alloc_node(task_autonuma_cachep,
+					      GFP_KERNEL, numa_node_id());
+	BUG_ON(!task_autonuma);
+	task_autonuma_reset(task_autonuma);
+	BUG_ON(current->task_autonuma);
+	current->task_autonuma = task_autonuma;
+}
+
+static struct kmem_cache *mm_autonuma_cachep;
+
+int alloc_mm_autonuma(struct mm_struct *mm)
+{
+	int err = 1;
+	struct mm_autonuma *mm_autonuma;
+
+	if (!autonuma_possible())
+		goto no_numa;
+	mm_autonuma = kmem_cache_alloc(mm_autonuma_cachep, GFP_KERNEL);
+	if (!mm_autonuma)
+		goto out;
+	if (autonuma_sched_reset() || !mm->mm_autonuma)
+		mm_autonuma_reset(mm_autonuma);
+	else
+		memcpy(mm_autonuma, mm->mm_autonuma, mm_autonuma_size());
+
+	/*
+	 * We're not leaking memory here, if mm->mm_autonuma is not
+	 * zero it's a not refcounted copy of the parent's
+	 * mm->mm_autonuma pointer.
+	 */
+	mm->mm_autonuma = mm_autonuma;
+	mm_autonuma->mm = mm;
+no_numa:
+	err = 0;
+out:
+	return err;
+}
+
+void free_mm_autonuma(struct mm_struct *mm)
+{
+	if (!autonuma_possible()) {
+		BUG_ON(mm->mm_autonuma);
+		return;
+	}
+
+	BUG_ON(!mm->mm_autonuma);
+	kmem_cache_free(mm_autonuma_cachep, mm->mm_autonuma);
+	mm->mm_autonuma = NULL;
+}
+
+void __init mm_autonuma_init(void)
+{
+	BUG_ON(current != &init_task);
+	BUG_ON(current->mm);
+
+	if (!autonuma_possible())
+		return;
+
+	mm_autonuma_cachep =
+		kmem_cache_create("mm_autonuma",
+				  mm_autonuma_size(), 0,
+				  SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+}

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 20/36] autonuma: default mempolicy follow AutoNUMA
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (18 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 19/36] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 14:59 ` [PATCH 21/36] autonuma: call autonuma_split_huge_page() Andrea Arcangeli
                   ` (16 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

If an task_selected_nid has already been selected for the task, try to
allocate memory from it even if it's temporarily not the local
node. Chances are it's where most of its memory is already located and
where it will run in the future.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/mempolicy.c |   12 ++++++++++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index bd92431..19a8f72 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1951,10 +1951,18 @@ retry_cpuset:
 	 */
 	if (pol->mode == MPOL_INTERLEAVE)
 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
-	else
+	else {
+		int nid = -1;
+#ifdef CONFIG_AUTONUMA
+		if (current->task_autonuma)
+			nid = current->task_autonuma->task_selected_nid;
+#endif
+		if (nid < 0)
+			nid = numa_node_id();
 		page = __alloc_pages_nodemask(gfp, order,
-				policy_zonelist(gfp, pol, numa_node_id()),
+				policy_zonelist(gfp, pol, nid),
 				policy_nodemask(gfp, pol));
+	}
 
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 		goto retry_cpuset;

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 21/36] autonuma: call autonuma_split_huge_page()
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (19 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 20/36] autonuma: default mempolicy follow AutoNUMA Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 14:59 ` [PATCH 22/36] autonuma: make khugepaged pte_numa aware Andrea Arcangeli
                   ` (15 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This is needed to make sure the tail pages are also queued into the
migration queues of knuma_migrated across a transparent hugepage
split.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5cdf668..08fd33c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -17,6 +17,7 @@
 #include <linux/khugepaged.h>
 #include <linux/freezer.h>
 #include <linux/mman.h>
+#include <linux/autonuma.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -1316,6 +1317,7 @@ static void __split_huge_page_refcount(struct page *page)
 		BUG_ON(!PageSwapBacked(page_tail));
 
 		lru_add_page_tail(page, page_tail, lruvec);
+		autonuma_migrate_split_huge_page(page, page_tail);
 	}
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(__page_count(page) <= 0);

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 22/36] autonuma: make khugepaged pte_numa aware
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (20 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 21/36] autonuma: call autonuma_split_huge_page() Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 14:59 ` [PATCH 23/36] autonuma: retain page last_nid information in khugepaged Andrea Arcangeli
                   ` (14 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

If any of the ptes that khugepaged is collapsing was a pte_numa, the
resulting trans huge pmd will be a pmd_numa too.

See the comment inline for why we require just one pte_numa pte to
make a pmd_numa pmd. If needed later we could change the number of
pte_numa ptes required to create a pmd_numa and make it tunable with
sysfs too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |   33 +++++++++++++++++++++++++++++++--
 1 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 08fd33c..a65590f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1816,12 +1816,19 @@ out:
 	return isolated;
 }
 
-static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
+/*
+ * Do the actual data copy for mapped ptes and release the mapped
+ * pages, or alternatively zero out the transparent hugepage in the
+ * mapping holes. Transfer the page_autonuma information in the
+ * process. Return true if any of the mapped ptes was of numa type.
+ */
+static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 				      struct vm_area_struct *vma,
 				      unsigned long address,
 				      spinlock_t *ptl)
 {
 	pte_t *_pte;
+	bool mknuma = false;
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
 		pte_t pteval = *_pte;
 		struct page *src_page;
@@ -1849,11 +1856,29 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			page_remove_rmap(src_page);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
+
+			/*
+			 * Only require one pte_numa mapped by a pmd
+			 * to make it a pmd_numa, too. To avoid the
+			 * risk of losing NUMA hinting page faults, it
+			 * is better to overestimate the NUMA node
+			 * affinity with a node where we just
+			 * collapsed a hugepage, rather than
+			 * underestimate it.
+			 *
+			 * Note: if AUTONUMA_SCAN_PMD_FLAG is set, we
+			 * won't find any pte_numa ptes since we're
+			 * only setting NUMA hinting at the pmd
+			 * level.
+			 */
+			mknuma |= pte_numa(pteval);
 		}
 
 		address += PAGE_SIZE;
 		page++;
 	}
+
+	return mknuma;
 }
 
 static void collapse_huge_page(struct mm_struct *mm,
@@ -1871,6 +1896,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	int isolated;
 	unsigned long hstart, hend;
+	bool mknuma = false;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 #ifndef CONFIG_NUMA
@@ -1989,7 +2015,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	anon_vma_unlock(vma->anon_vma);
 
-	__collapse_huge_page_copy(pte, new_page, vma, address, ptl);
+	mknuma = pmd_numa(_pmd);
+	mknuma |= __collapse_huge_page_copy(pte, new_page, vma, address, ptl);
 	pte_unmap(pte);
 	__SetPageUptodate(new_page);
 	pgtable = pmd_pgtable(_pmd);
@@ -1999,6 +2026,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	_pmd = mk_pmd(new_page, vma->vm_page_prot);
 	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
 	_pmd = pmd_mkhuge(_pmd);
+	if (mknuma)
+		_pmd = pmd_mknuma(_pmd);
 
 	/*
 	 * spin_lock() below is not the equivalent of smp_wmb(), so

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 23/36] autonuma: retain page last_nid information in khugepaged
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (21 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 22/36] autonuma: make khugepaged pte_numa aware Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 14:59 ` [PATCH 24/36] autonuma: numa hinting page faults entry points Andrea Arcangeli
                   ` (13 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

When pages are collapsed try to keep the last_nid information from one
of the original pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a65590f..0d2a12f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1829,6 +1829,9 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 {
 	pte_t *_pte;
 	bool mknuma = false;
+#ifdef CONFIG_AUTONUMA
+	int autonuma_last_nid = -1;
+#endif
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
 		pte_t pteval = *_pte;
 		struct page *src_page;
@@ -1838,6 +1841,17 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
 		} else {
 			src_page = pte_page(pteval);
+#ifdef CONFIG_AUTONUMA
+			/* pick the first one, better than nothing */
+			if (autonuma_last_nid < 0) {
+				autonuma_last_nid =
+					ACCESS_ONCE(src_page->
+						    autonuma_last_nid);
+				if (autonuma_last_nid >= 0)
+					ACCESS_ONCE(page->autonuma_last_nid) =
+						autonuma_last_nid;
+			}
+#endif
 			copy_user_highpage(page, src_page, address, vma);
 			VM_BUG_ON(page_mapcount(src_page) != 1);
 			VM_BUG_ON(page_count(src_page) != 0);

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 24/36] autonuma: numa hinting page faults entry points
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (22 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 23/36] autonuma: retain page last_nid information in khugepaged Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 14:59 ` [PATCH 25/36] autonuma: reset autonuma page data when pages are freed Andrea Arcangeli
                   ` (12 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This is where the numa hinting page faults are detected and are passed
over to the AutoNUMA core logic.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/huge_mm.h |    2 ++
 mm/huge_memory.c        |   18 ++++++++++++++++++
 mm/memory.c             |   31 +++++++++++++++++++++++++++++++
 3 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ad4e2e0..5270c81 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -11,6 +11,8 @@ extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
 			       pmd_t orig_pmd);
+extern pmd_t __huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+				   pmd_t pmd, pmd_t *pmdp);
 extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
 extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
 					  unsigned long addr,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0d2a12f..067cba1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1037,6 +1037,24 @@ out:
 	return page;
 }
 
+#ifdef CONFIG_AUTONUMA
+/* NUMA hinting page fault entry point for trans huge pmds */
+pmd_t __huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+			    pmd_t pmd, pmd_t *pmdp)
+{
+	spin_lock(&mm->page_table_lock);
+	if (pmd_same(pmd, *pmdp)) {
+		struct page *page = pmd_page(pmd);
+		pmd = pmd_mknonnuma(pmd);
+		set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd);
+		numa_hinting_fault(page, HPAGE_PMD_NR);
+		VM_BUG_ON(pmd_numa(pmd));
+	}
+	spin_unlock(&mm->page_table_lock);
+	return pmd;
+}
+#endif
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 71282f5..00f1ae7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/autonuma.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3418,6 +3419,31 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+static inline pte_t pte_numa_fixup(struct mm_struct *mm,
+				   struct vm_area_struct *vma,
+				   unsigned long addr, pte_t pte, pte_t *ptep)
+{
+	if (pte_numa(pte))
+		pte = __pte_numa_fixup(mm, vma, addr, pte, ptep);
+	return pte;
+}
+
+static inline void pmd_numa_fixup(struct mm_struct *mm,
+				  unsigned long addr, pmd_t *pmd)
+{
+	if (pmd_numa(*pmd))
+		__pmd_numa_fixup(mm, addr, pmd);
+}
+
+static inline pmd_t huge_pmd_numa_fixup(struct mm_struct *mm,
+					unsigned long addr,
+					pmd_t pmd, pmd_t *pmdp)
+{
+	if (pmd_numa(pmd))
+		pmd = __huge_pmd_numa_fixup(mm, addr, pmd, pmdp);
+	return pmd;
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3460,6 +3486,7 @@ int handle_pte_fault(struct mm_struct *mm,
 	spin_lock(ptl);
 	if (unlikely(!pte_same(*pte, entry)))
 		goto unlock;
+	entry = pte_numa_fixup(mm, vma, address, entry, pte);
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!pte_write(entry))
 			return do_wp_page(mm, vma, address,
@@ -3530,6 +3557,8 @@ retry:
 		 */
 		orig_pmd = ACCESS_ONCE(*pmd);
 		if (pmd_trans_huge(orig_pmd)) {
+			orig_pmd = huge_pmd_numa_fixup(mm, address,
+						       orig_pmd, pmd);
 			if (flags & FAULT_FLAG_WRITE &&
 			    !pmd_write(orig_pmd) &&
 			    !pmd_trans_splitting(orig_pmd)) {
@@ -3548,6 +3577,8 @@ retry:
 		}
 	}
 
+	pmd_numa_fixup(mm, address, pmd);
+
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
 	 * run pte_offset_map on the pmd, if an huge pmd could

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 25/36] autonuma: reset autonuma page data when pages are freed
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (23 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 24/36] autonuma: numa hinting page faults entry points Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 14:59 ` [PATCH 26/36] autonuma: link mm/autonuma.o and kernel/sched/numa.o Andrea Arcangeli
                   ` (11 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

When pages are freed abort any pending migration. If knuma_migrated
arrives first it will notice because get_page_unless_zero would fail.

You can safely ignore the #ifdef because a later patch (page_autonuma)
clears it.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8c9cad5..49e2916 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -619,6 +619,10 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
+	autonuma_migrate_page_remove(page);
+#ifdef CONFIG_AUTONUMA
+	page->autonuma_last_nid = -1;
+#endif
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 26/36] autonuma: link mm/autonuma.o and kernel/sched/numa.o
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (24 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 25/36] autonuma: reset autonuma page data when pages are freed Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 14:59 ` [PATCH 27/36] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED Andrea Arcangeli
                   ` (10 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Link the AutoNUMA core and scheduler object files in the kernel if
CONFIG_AUTONUMA=y.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 kernel/sched/Makefile |    1 +
 mm/Makefile           |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 173ea52..783a840 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -16,3 +16,4 @@ obj-$(CONFIG_SMP) += cpupri.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
+obj-$(CONFIG_AUTONUMA) += numa.o
diff --git a/mm/Makefile b/mm/Makefile
index 92753e2..0fd3165 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -34,6 +34,7 @@ obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
+obj-$(CONFIG_AUTONUMA) 	+= autonuma.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 27/36] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (25 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 26/36] autonuma: link mm/autonuma.o and kernel/sched/numa.o Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 14:59 ` [PATCH 28/36] autonuma: page_autonuma Andrea Arcangeli
                   ` (9 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Add the config options to allow building the kernel with AutoNUMA.

If CONFIG_AUTONUMA_DEFAULT_ENABLED is "=y", then
/sys/kernel/mm/autonuma/enabled will be equal to 1, and AutoNUMA will
be enabled automatically at boot.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/Kconfig     |    3 +++
 arch/x86/Kconfig |    1 +
 mm/Kconfig       |   17 +++++++++++++++++
 3 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 72f2fa1..ee3ed89 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -281,4 +281,7 @@ config SECCOMP_FILTER
 
 	  See Documentation/prctl/seccomp_filter.txt for details.
 
+config HAVE_ARCH_AUTONUMA
+	bool
+
 source "kernel/gcov/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8ec3a1a..4cbdfce 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -97,6 +97,7 @@ config X86
 	select KTIME_SCALAR if X86_32
 	select GENERIC_STRNCPY_FROM_USER
 	select GENERIC_STRNLEN_USER
+	select HAVE_ARCH_AUTONUMA
 
 config INSTRUCTION_DECODER
 	def_bool (KPROBES || PERF_EVENTS || UPROBES)
diff --git a/mm/Kconfig b/mm/Kconfig
index d5c8019..f00a0cd 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -211,6 +211,23 @@ config MIGRATION
 	  pages as migration can relocate pages to satisfy a huge page
 	  allocation instead of reclaiming.
 
+config AUTONUMA
+	bool "AutoNUMA"
+	select MIGRATION
+	depends on NUMA && HAVE_ARCH_AUTONUMA
+	help
+	  Automatic NUMA CPU scheduling and memory migration.
+
+	  Avoids the administrator to manually setup hard NUMA
+	  bindings in order to achieve optimal performance on NUMA
+	  hardware.
+
+config AUTONUMA_DEFAULT_ENABLED
+	bool "Auto NUMA default enabled"
+	depends on AUTONUMA
+	help
+	  Automatic NUMA CPU scheduling and memory migration enabled at boot.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 28/36] autonuma: page_autonuma
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (26 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 27/36] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 14:59 ` [PATCH 29/36] autonuma: autonuma_migrate_head[0] dynamic size Andrea Arcangeli
                   ` (8 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Move the AutoNUMA per page information from the "struct page" to a
separate page_autonuma data structure allocated in the memsection
(with sparsemem) or in the pgdat (with flatmem).

This is done to avoid growing the size of "struct page". The
page_autonuma data is only allocated if the kernel is booted on real
NUMA hardware and noautonuma is not passed as a parameter to the
kernel.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma.h       |   18 +++-
 include/linux/autonuma_types.h |   55 +++++++++
 include/linux/mm_types.h       |   26 ----
 include/linux/mmzone.h         |   24 +++-
 include/linux/page_autonuma.h  |   53 +++++++++
 init/main.c                    |    2 +
 mm/Makefile                    |    2 +-
 mm/autonuma.c                  |  102 +++++++++++------
 mm/huge_memory.c               |   13 ++-
 mm/page_alloc.c                |   21 +---
 mm/page_autonuma.c             |  246 ++++++++++++++++++++++++++++++++++++++++
 mm/sparse.c                    |  126 +++++++++++++++++++-
 12 files changed, 588 insertions(+), 100 deletions(-)
 create mode 100644 include/linux/page_autonuma.h
 create mode 100644 mm/page_autonuma.c

diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
index 85ca5eb..1d87ecc 100644
--- a/include/linux/autonuma.h
+++ b/include/linux/autonuma.h
@@ -7,15 +7,26 @@
 
 extern void autonuma_enter(struct mm_struct *mm);
 extern void autonuma_exit(struct mm_struct *mm);
-extern void __autonuma_migrate_page_remove(struct page *page);
+extern void __autonuma_migrate_page_remove(struct page *,
+					   struct page_autonuma *);
 extern void autonuma_migrate_split_huge_page(struct page *page,
 					     struct page *page_tail);
 extern void autonuma_setup_new_exec(struct task_struct *p);
+extern struct page_autonuma *lookup_page_autonuma(struct page *page);
 
 static inline void autonuma_migrate_page_remove(struct page *page)
 {
-	if (ACCESS_ONCE(page->autonuma_migrate_nid) >= 0)
-		__autonuma_migrate_page_remove(page);
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+	if (ACCESS_ONCE(page_autonuma->autonuma_migrate_nid) >= 0)
+		__autonuma_migrate_page_remove(page, page_autonuma);
+}
+
+static inline void autonuma_free_page(struct page *page)
+{
+	if (autonuma_possible()) {
+		autonuma_migrate_page_remove(page);
+		lookup_page_autonuma(page)->autonuma_last_nid = -1;
+	}
 }
 
 #define autonuma_printk(format, args...) \
@@ -29,6 +40,7 @@ static inline void autonuma_migrate_page_remove(struct page *page) {}
 static inline void autonuma_migrate_split_huge_page(struct page *page,
 						    struct page *page_tail) {}
 static inline void autonuma_setup_new_exec(struct task_struct *p) {}
+static inline void autonuma_free_page(struct page *page) {}
 
 #endif /* CONFIG_AUTONUMA */
 
diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
index 9673ce8..525c31f 100644
--- a/include/linux/autonuma_types.h
+++ b/include/linux/autonuma_types.h
@@ -78,6 +78,61 @@ struct task_autonuma {
 	/* do not add more variables here, the above array size is dynamic */
 };
 
+/*
+ * Per page (or per-pageblock) structure dynamically allocated only if
+ * autonuma is possible.
+ */
+struct page_autonuma {
+	/*
+	 * To modify autonuma_last_nid locklessly, the architecture
+	 * needs to have SMP atomic granularity < sizeof(long). Not
+	 * all architectures have this, notably some ancient Alphas
+	 * (but none of them should run in NUMA
+	 * systems). Architectures without this granularity require
+	 * autonuma_last_nid to be a long.
+	 */
+#ifdef CONFIG_64BIT
+	/*
+	 * If autonuma_migrate_nid is >= 0, it means the page_autonuma
+	 * structure is linked into one of the NUMA node's migrate
+	 * lists. Which list is determined by the NUMA node the page
+	 * belongs to. If autonuma_migrate_nid is -1, the
+	 * page_autonuma structure is not linked into any NUMA node's
+	 * migrate list.
+	 */
+	int autonuma_migrate_nid;
+	/*
+	 * autonuma_last_nid records the NUMA node that accessed the
+	 * page during the last NUMA hinting page fault. If a
+	 * different node accesses the page next, AutoNUMA will not
+	 * migrate the page. This tries to avoid page thrashing by
+	 * requiring that a page be accessed by the same node twice in
+	 * a row before it is queued for migration.
+	 */
+	int autonuma_last_nid;
+#else
+#if MAX_NUMNODES > 32767
+#error "too many nodes"
+#endif
+	short autonuma_migrate_nid;
+	short autonuma_last_nid;
+#endif
+	/*
+	 * This is the list node that links the page (referenced by
+	 * the page_autonuma structure) in the
+	 * &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid] lru.
+	 */
+	struct list_head autonuma_migrate_node;
+
+	/*
+	 * To find the page starting from the autonuma_migrate_node we
+	 * need a backlink.
+	 *
+	 * FIXME: drop it;
+	 */
+	struct page *page;
+};
+
 extern int alloc_task_autonuma(struct task_struct *tsk,
 			       struct task_struct *orig,
 			       int node);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3f10fef..c80101c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -152,32 +152,6 @@ struct page {
 		struct page *first_page;	/* Compound tail pages */
 	};
 
-#ifdef CONFIG_AUTONUMA
-	/*
-	 * FIXME: move to pgdat section along with the memcg and allocate
-	 * at runtime only in presence of a numa system.
-	 */
-	/*
-	 * To modify autonuma_last_nid lockless the architecture,
-	 * needs SMP atomic granularity < sizeof(long), not all archs
-	 * have that, notably some ancient alpha (but none of those
-	 * should run in NUMA systems). Archs without that requires
-	 * autonuma_last_nid to be a long.
-	 */
-#ifdef CONFIG_64BIT
-	int autonuma_migrate_nid;
-	int autonuma_last_nid;
-#else
-#if MAX_NUMNODES > 32767
-#error "too many nodes"
-#endif
-	/* FIXME: remember to check the updates are atomic */
-	short autonuma_migrate_nid;
-	short autonuma_last_nid;
-#endif
-	struct list_head autonuma_migrate_node;
-#endif
-
 	/*
 	 * On machines where all RAM is mapped into kernel address space,
 	 * we can simply calculate the virtual address. On machines with
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a5920f8..853e236 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -710,12 +710,9 @@ typedef struct pglist_data {
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
 #ifdef CONFIG_AUTONUMA
-	/*
-	 * lock serializing all lists with heads in the
-	 * autonuma_migrate_head[] array, and the
-	 * autonuma_nr_migrate_pages field.
-	 */
-	spinlock_t autonuma_lock;
+#if !defined(CONFIG_SPARSEMEM)
+	struct page_autonuma *node_page_autonuma;
+#endif
 	/*
 	 * All pages from node "page_nid" to be migrated to this node,
 	 * will be queued into the list
@@ -726,6 +723,12 @@ typedef struct pglist_data {
 	unsigned long autonuma_nr_migrate_pages;
 	/* waitqueue for this node knuma_migrated daemon */
 	wait_queue_head_t autonuma_knuma_migrated_wait;
+	/*
+	 * lock serializing all lists with heads in the
+	 * autonuma_migrate_head[] array, and the
+	 * autonuma_nr_migrate_pages field.
+	 */
+	spinlock_t autonuma_lock;
 #endif
 } pg_data_t;
 
@@ -1088,6 +1091,15 @@ struct mem_section {
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
 	 */
 	struct page_cgroup *page_cgroup;
+#endif
+#ifdef CONFIG_AUTONUMA
+	/*
+	 * If !SPARSEMEM, pgdat doesn't have page_autonuma pointer. We use
+	 * section.
+	 */
+	struct page_autonuma *section_page_autonuma;
+#endif
+#if defined(CONFIG_MEMCG) ^ defined(CONFIG_AUTONUMA)
 	unsigned long pad;
 #endif
 };
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
new file mode 100644
index 0000000..9763e61
--- /dev/null
+++ b/include/linux/page_autonuma.h
@@ -0,0 +1,53 @@
+#ifndef _LINUX_PAGE_AUTONUMA_H
+#define _LINUX_PAGE_AUTONUMA_H
+
+#if defined(CONFIG_AUTONUMA) && !defined(CONFIG_SPARSEMEM)
+extern void __init page_autonuma_init_flatmem(void);
+#else
+static inline void __init page_autonuma_init_flatmem(void) {}
+#endif
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/autonuma_flags.h>
+
+extern void __meminit page_autonuma_map_init(struct page *page,
+					     struct page_autonuma *page_autonuma,
+					     int nr_pages);
+
+#ifdef CONFIG_SPARSEMEM
+#define PAGE_AUTONUMA_SIZE (sizeof(struct page_autonuma))
+#define SECTION_PAGE_AUTONUMA_SIZE (PAGE_AUTONUMA_SIZE *	\
+				    PAGES_PER_SECTION)
+#endif
+
+extern void __meminit pgdat_autonuma_init(struct pglist_data *);
+
+#else /* CONFIG_AUTONUMA */
+
+#ifdef CONFIG_SPARSEMEM
+struct page_autonuma;
+#define PAGE_AUTONUMA_SIZE 0
+#define SECTION_PAGE_AUTONUMA_SIZE 0
+
+#define autonuma_possible() false
+
+#endif /* CONFIG_SPARSEMEM */
+
+static inline void pgdat_autonuma_init(struct pglist_data *pgdat) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+#ifdef CONFIG_SPARSEMEM
+extern struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
+									unsigned long nr_pages);
+extern void __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
+					  unsigned long nr_pages);
+extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
+							 unsigned long pnum_begin,
+							 unsigned long pnum_end,
+							 unsigned long map_count,
+							 int nodeid);
+#endif
+
+#endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/init/main.c b/init/main.c
index b286730..586764f 100644
--- a/init/main.c
+++ b/init/main.c
@@ -69,6 +69,7 @@
 #include <linux/slab.h>
 #include <linux/perf_event.h>
 #include <linux/file.h>
+#include <linux/page_autonuma.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -456,6 +457,7 @@ static void __init mm_init(void)
 	 * bigger than MAX_ORDER unless SPARSEMEM.
 	 */
 	page_cgroup_init_flatmem();
+	page_autonuma_init_flatmem();
 	mem_init();
 	kmem_cache_init();
 	percpu_init_late();
diff --git a/mm/Makefile b/mm/Makefile
index 0fd3165..5a4fa30 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -34,7 +34,7 @@ obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
-obj-$(CONFIG_AUTONUMA) 	+= autonuma.o
+obj-$(CONFIG_AUTONUMA) 	+= autonuma.o page_autonuma.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
diff --git a/mm/autonuma.c b/mm/autonuma.c
index a505ec3..7967507 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -76,11 +76,18 @@ void autonuma_migrate_split_huge_page(struct page *page,
 				      struct page *page_tail)
 {
 	int nid, last_nid;
+	struct page_autonuma *page_autonuma, *page_tail_autonuma;
 
-	nid = page->autonuma_migrate_nid;
+	if (!autonuma_possible())
+		return;
+
+	page_autonuma = lookup_page_autonuma(page);
+	page_tail_autonuma = lookup_page_autonuma(page_tail);
+
+	nid = page_autonuma->autonuma_migrate_nid;
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 	VM_BUG_ON(nid < -1);
-	VM_BUG_ON(page_tail->autonuma_migrate_nid != -1);
+	VM_BUG_ON(page_tail_autonuma->autonuma_migrate_nid != -1);
 	if (nid >= 0) {
 		VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
 
@@ -94,44 +101,46 @@ void autonuma_migrate_split_huge_page(struct page *page,
 		 */
 		compound_lock(page_tail);
 		autonuma_migrate_lock(nid);
-		list_add_tail(&page_tail->autonuma_migrate_node,
-			      &page->autonuma_migrate_node);
+		list_add_tail(&page_tail_autonuma->autonuma_migrate_node,
+			      &page_autonuma->autonuma_migrate_node);
 		autonuma_migrate_unlock(nid);
 
-		page_tail->autonuma_migrate_nid = nid;
+		page_tail_autonuma->autonuma_migrate_nid = nid;
 		compound_unlock(page_tail);
 	}
 
-	last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid);
 	if (last_nid >= 0)
-		page_tail->autonuma_last_nid = last_nid;
+		page_tail_autonuma->autonuma_last_nid = last_nid;
 }
 
-void __autonuma_migrate_page_remove(struct page *page)
+void __autonuma_migrate_page_remove(struct page *page,
+				    struct page_autonuma *page_autonuma)
 {
 	unsigned long flags;
 	int nid;
 
 	flags = compound_lock_irqsave(page);
 
-	nid = page->autonuma_migrate_nid;
+	nid = page_autonuma->autonuma_migrate_nid;
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 	VM_BUG_ON(nid < -1);
 	if (nid >= 0) {
 		int numpages = hpage_nr_pages(page);
 		autonuma_migrate_lock(nid);
-		list_del(&page->autonuma_migrate_node);
+		list_del(&page_autonuma->autonuma_migrate_node);
 		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
 		autonuma_migrate_unlock(nid);
 
-		page->autonuma_migrate_nid = -1;
+		page_autonuma->autonuma_migrate_nid = -1;
 	}
 
 	compound_unlock_irqrestore(page, flags);
 }
 
-static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
-					int page_nid)
+static void __autonuma_migrate_page_add(struct page *page,
+					struct page_autonuma *page_autonuma,
+					int dst_nid, int page_nid)
 {
 	unsigned long flags;
 	int nid;
@@ -157,25 +166,25 @@ static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
 	flags = compound_lock_irqsave(page);
 
 	numpages = hpage_nr_pages(page);
-	nid = page->autonuma_migrate_nid;
+	nid = page_autonuma->autonuma_migrate_nid;
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 	VM_BUG_ON(nid < -1);
 	if (nid >= 0) {
 		autonuma_migrate_lock(nid);
-		list_del(&page->autonuma_migrate_node);
+		list_del(&page_autonuma->autonuma_migrate_node);
 		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
 		autonuma_migrate_unlock(nid);
 	}
 
 	autonuma_migrate_lock(dst_nid);
-	list_add(&page->autonuma_migrate_node,
+	list_add(&page_autonuma->autonuma_migrate_node,
 		 &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
 	NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
 	nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
 
 	autonuma_migrate_unlock(dst_nid);
 
-	page->autonuma_migrate_nid = dst_nid;
+	page_autonuma->autonuma_migrate_nid = dst_nid;
 
 	compound_unlock_irqrestore(page, flags);
 
@@ -191,9 +200,13 @@ static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
 static void autonuma_migrate_page_add(struct page *page, int dst_nid,
 				      int page_nid)
 {
-	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	int migrate_nid;
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+
+	migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
 	if (migrate_nid != dst_nid)
-		__autonuma_migrate_page_add(page, dst_nid, page_nid);
+		__autonuma_migrate_page_add(page, page_autonuma,
+					    dst_nid, page_nid);
 }
 
 static bool autonuma_balance_pgdat(struct pglist_data *pgdat,
@@ -284,23 +297,26 @@ static void numa_hinting_fault_cpu_follow_memory(struct task_struct *p,
 static inline bool last_nid_set(struct page *page, int this_nid)
 {
 	bool ret = true;
-	int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+	int autonuma_last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid);
 	VM_BUG_ON(this_nid < 0);
 	VM_BUG_ON(this_nid >= MAX_NUMNODES);
 	if (autonuma_last_nid >= 0 && autonuma_last_nid != this_nid) {
-		int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+		int migrate_nid;
+		migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
 		if (migrate_nid >= 0)
-			__autonuma_migrate_page_remove(page);
+			__autonuma_migrate_page_remove(page, page_autonuma);
 		ret = false;
 	}
 	if (autonuma_last_nid != this_nid)
-		ACCESS_ONCE(page->autonuma_last_nid) = this_nid;
+		ACCESS_ONCE(page_autonuma->autonuma_last_nid) = this_nid;
 	return ret;
 }
 
 static int __page_migrate_nid(struct page *page, int page_nid)
 {
-	int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+	int migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
 	if (migrate_nid < 0)
 		migrate_nid = page_nid;
 	return migrate_nid;
@@ -895,6 +911,7 @@ static int isolate_migratepages(struct list_head *migratepages,
 		struct zone *zone;
 		struct page *page;
 		struct lruvec *lruvec;
+		struct page_autonuma *page_autonuma;
 
 		cond_resched();
 		/*
@@ -926,16 +943,17 @@ static int isolate_migratepages(struct list_head *migratepages,
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			continue;
 		}
-		page = list_entry(heads[nid].prev,
-				  struct page,
-				  autonuma_migrate_node);
+		page_autonuma = list_entry(heads[nid].prev,
+					   struct page_autonuma,
+					   autonuma_migrate_node);
+		page = page_autonuma->page;
 		if (unlikely(!get_page_unless_zero(page))) {
 			/*
 			 * Is getting freed and will remove self from the
 			 * autonuma list shortly, skip it for now.
 			 */
-			list_del(&page->autonuma_migrate_node);
-			list_add(&page->autonuma_migrate_node,
+			list_del(&page_autonuma->autonuma_migrate_node);
+			list_add(&page_autonuma->autonuma_migrate_node,
 				 &heads[nid]);
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			autonuma_printk("autonuma migrate page is free\n");
@@ -944,7 +962,7 @@ static int isolate_migratepages(struct list_head *migratepages,
 		autonuma_migrate_unlock_irq(pgdat->node_id);
 		if (!PageLRU(page)) {
 			autonuma_printk("autonuma migrate page not in LRU\n");
-			__autonuma_migrate_page_remove(page);
+			__autonuma_migrate_page_remove(page, page_autonuma);
 			put_page(page);
 			continue;
 		}
@@ -956,13 +974,14 @@ static int isolate_migratepages(struct list_head *migratepages,
 			/* FIXME: remove split_huge_page */
 			if (unlikely(split_huge_page(page))) {
 				autonuma_printk("autonuma migrate THP free\n");
-				__autonuma_migrate_page_remove(page);
+				__autonuma_migrate_page_remove(page,
+							       page_autonuma);
 				put_page(page);
 				continue;
 			}
 		}
 
-		__autonuma_migrate_page_remove(page);
+		__autonuma_migrate_page_remove(page, page_autonuma);
 
 		zone = page_zone(page);
 		spin_lock_irq(&zone->lru_lock);
@@ -1007,11 +1026,16 @@ static struct page *alloc_migrate_dst_page(struct page *page,
 {
 	int nid = (int) data;
 	struct page *newpage;
+	struct page_autonuma *page_autonuma, *newpage_autonuma;
 	newpage = alloc_pages_exact_node(nid,
 					 GFP_HIGHUSER_MOVABLE | GFP_THISNODE,
 					 0);
-	if (newpage)
-		newpage->autonuma_last_nid = page->autonuma_last_nid;
+	if (newpage) {
+		page_autonuma = lookup_page_autonuma(page);
+		newpage_autonuma = lookup_page_autonuma(newpage);
+		newpage_autonuma->autonuma_last_nid =
+			page_autonuma->autonuma_last_nid;
+	}
 	return newpage;
 }
 
@@ -1446,7 +1470,8 @@ static int __init noautonuma_setup(char *str)
 	}
 	return 1;
 }
-__setup("noautonuma", noautonuma_setup);
+/* early so sparse.c also can see it */
+early_param("noautonuma", noautonuma_setup);
 
 static bool autonuma_init_checks_failed(void)
 {
@@ -1470,7 +1495,12 @@ static int __init autonuma_init(void)
 
 	VM_BUG_ON(num_possible_nodes() < 1);
 	if (num_possible_nodes() <= 1 || !autonuma_possible()) {
-		clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
+		/* should have been already initialized by page_autonuma */
+		if (autonuma_possible()) {
+			WARN_ON(1);
+			/* try to fixup if it wasn't ok */
+			clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
+		}
 		return -EINVAL;
 	} else if (autonuma_init_checks_failed()) {
 		printk("autonuma disengaged: init checks failed\n");
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 067cba1..579e52b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1849,7 +1849,12 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 	bool mknuma = false;
 #ifdef CONFIG_AUTONUMA
 	int autonuma_last_nid = -1;
+	struct page_autonuma *src_page_an, *page_an = NULL;
+
+	if (autonuma_possible())
+		page_an = lookup_page_autonuma(page);
 #endif
+
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
 		pte_t pteval = *_pte;
 		struct page *src_page;
@@ -1861,12 +1866,12 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			src_page = pte_page(pteval);
 #ifdef CONFIG_AUTONUMA
 			/* pick the first one, better than nothing */
-			if (autonuma_last_nid < 0) {
+			if (autonuma_possible() && autonuma_last_nid < 0) {
+				src_page_an = lookup_page_autonuma(src_page);
 				autonuma_last_nid =
-					ACCESS_ONCE(src_page->
-						    autonuma_last_nid);
+					ACCESS_ONCE(src_page_an->autonuma_last_nid);
 				if (autonuma_last_nid >= 0)
-					ACCESS_ONCE(page->autonuma_last_nid) =
+					ACCESS_ONCE(page_an->autonuma_last_nid) =
 						autonuma_last_nid;
 			}
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 49e2916..74b73fa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -59,6 +59,7 @@
 #include <linux/migrate.h>
 #include <linux/page-debug-flags.h>
 #include <linux/autonuma.h>
+#include <linux/page_autonuma.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -619,10 +620,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
-	autonuma_migrate_page_remove(page);
-#ifdef CONFIG_AUTONUMA
-	page->autonuma_last_nid = -1;
-#endif
+	autonuma_free_page(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -3792,10 +3790,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 
 		INIT_LIST_HEAD(&page->lru);
-#ifdef CONFIG_AUTONUMA
-		page->autonuma_last_nid = -1;
-		page->autonuma_migrate_nid = -1;
-#endif
 #ifdef WANT_PAGE_VIRTUAL
 		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
 		if (!is_highmem_idx(zone))
@@ -4396,21 +4390,12 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	int nid = pgdat->node_id;
 	unsigned long zone_start_pfn = pgdat->node_start_pfn;
 	int ret;
-#ifdef CONFIG_AUTONUMA
-	int node_iter;
-#endif
 
 	pgdat_resize_init(pgdat);
-#ifdef CONFIG_AUTONUMA
-	spin_lock_init(&pgdat->autonuma_lock);
-	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
-	pgdat->autonuma_nr_migrate_pages = 0;
-	for_each_node(node_iter)
-		INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
-#endif
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 	pgdat_page_cgroup_init(pgdat);
+	pgdat_autonuma_init(pgdat);
 
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
new file mode 100644
index 0000000..46d616c
--- /dev/null
+++ b/mm/page_autonuma.c
@@ -0,0 +1,246 @@
+#include <linux/mm.h>
+#include <linux/memory.h>
+#include <linux/autonuma.h>
+#include <linux/page_autonuma.h>
+#include <linux/bootmem.h>
+
+void __meminit page_autonuma_map_init(struct page *page,
+				      struct page_autonuma *page_autonuma,
+				      int nr_pages)
+{
+	struct page *end;
+	for (end = page + nr_pages; page < end; page++, page_autonuma++) {
+		page_autonuma->autonuma_last_nid = -1;
+		page_autonuma->autonuma_migrate_nid = -1;
+		page_autonuma->page = page;
+	}
+}
+
+static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+	int node_iter;
+
+	spin_lock_init(&pgdat->autonuma_lock);
+	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
+	pgdat->autonuma_nr_migrate_pages = 0;
+
+	/* initialize autonuma_possible() */
+	if (num_possible_nodes() <= 1)
+		clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
+
+	/* noautonuma early param may also clear AUTONUMA_POSSIBLE_FLAG */
+	if (autonuma_possible())
+		for_each_node(node_iter)
+			INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+}
+
+#if !defined(CONFIG_SPARSEMEM)
+
+static unsigned long total_usage;
+
+void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+	__pgdat_autonuma_init(pgdat);
+	pgdat->node_page_autonuma = NULL;
+}
+
+struct page_autonuma *lookup_page_autonuma(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long offset;
+	struct page_autonuma *base;
+
+	base = NODE_DATA(page_to_nid(page))->node_page_autonuma;
+#ifdef CONFIG_DEBUG_VM
+	/*
+	 * The sanity checks the page allocator does upon freeing a
+	 * page can reach here before the page_autonuma arrays are
+	 * allocated when feeding a range of pages to the allocator
+	 * for the first time during bootup or memory hotplug.
+	 */
+	if (unlikely(!base))
+		return NULL;
+#endif
+	offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
+	return base + offset;
+}
+
+static int __init alloc_node_page_autonuma(int nid)
+{
+	struct page_autonuma *base;
+	unsigned long table_size;
+	unsigned long nr_pages;
+
+	nr_pages = NODE_DATA(nid)->node_spanned_pages;
+	if (!nr_pages)
+		return 0;
+
+	table_size = sizeof(struct page_autonuma) * nr_pages;
+
+	base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
+			table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	if (!base)
+		return -ENOMEM;
+	NODE_DATA(nid)->node_page_autonuma = base;
+	total_usage += table_size;
+	page_autonuma_map_init(NODE_DATA(nid)->node_mem_map, base, nr_pages);
+	return 0;
+}
+
+void __init page_autonuma_init_flatmem(void)
+{
+
+	int nid, fail;
+
+	/* __pgdat_autonuma_init initialized autonuma_possible() */
+	if (!autonuma_possible())
+		return;
+
+	for_each_online_node(nid)  {
+		fail = alloc_node_page_autonuma(nid);
+		if (fail)
+			goto fail;
+	}
+	printk(KERN_INFO "allocated %lu KBytes of page_autonuma\n",
+	       total_usage >> 10);
+	printk(KERN_INFO "please try the 'noautonuma' option if you"
+	" don't want to allocate page_autonuma memory\n");
+	return;
+fail:
+	printk(KERN_CRIT "allocation of page_autonuma failed.\n");
+	printk(KERN_CRIT "please try the 'noautonuma' boot option\n");
+	panic("Out of memory");
+}
+
+#else /* CONFIG_SPARSEMEM */
+
+struct page_autonuma *lookup_page_autonuma(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	struct mem_section *section = __pfn_to_section(pfn);
+
+	/* if it's not a power of two we may be wasting memory */
+	BUILD_BUG_ON(SECTION_PAGE_AUTONUMA_SIZE &
+		     (SECTION_PAGE_AUTONUMA_SIZE-1));
+
+	/* memsection must be a power of two */
+	BUILD_BUG_ON(sizeof(struct mem_section) &
+		     (sizeof(struct mem_section)-1));
+
+#ifdef CONFIG_DEBUG_VM
+	/*
+	 * The sanity checks the page allocator does upon freeing a
+	 * page can reach here before the page_autonuma arrays are
+	 * allocated when feeding a range of pages to the allocator
+	 * for the first time during bootup or memory hotplug.
+	 */
+	if (!section->section_page_autonuma)
+		return NULL;
+#endif
+	return section->section_page_autonuma + pfn;
+}
+
+void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+	__pgdat_autonuma_init(pgdat);
+}
+
+struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
+								 unsigned long nr_pages)
+{
+	struct page_autonuma *ret;
+	struct page *page;
+	unsigned long memmap_size = PAGE_AUTONUMA_SIZE * nr_pages;
+
+	page = alloc_pages_node(nid, GFP_KERNEL|__GFP_NOWARN,
+				get_order(memmap_size));
+	if (page)
+		goto got_map_page_autonuma;
+
+	ret = vmalloc(memmap_size);
+	if (ret)
+		goto out;
+
+	return NULL;
+got_map_page_autonuma:
+	ret = (struct page_autonuma *)pfn_to_kaddr(page_to_pfn(page));
+out:
+	return ret;
+}
+
+void __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
+				   unsigned long nr_pages)
+{
+	if (is_vmalloc_addr(page_autonuma))
+		vfree(page_autonuma);
+	else
+		free_pages((unsigned long)page_autonuma,
+			   get_order(PAGE_AUTONUMA_SIZE * nr_pages));
+}
+
+static struct page_autonuma __init *sparse_page_autonuma_map_populate(unsigned long pnum,
+								      int nid)
+{
+	struct page_autonuma *map;
+	unsigned long size;
+
+	map = alloc_remap(nid, SECTION_PAGE_AUTONUMA_SIZE);
+	if (map)
+		return map;
+
+	size = PAGE_ALIGN(SECTION_PAGE_AUTONUMA_SIZE);
+	map = __alloc_bootmem_node_high(NODE_DATA(nid), size,
+					PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	return map;
+}
+
+void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
+						  unsigned long pnum_begin,
+						  unsigned long pnum_end,
+						  unsigned long map_count,
+						  int nodeid)
+{
+	void *map;
+	unsigned long pnum;
+	unsigned long size = SECTION_PAGE_AUTONUMA_SIZE;
+
+	map = alloc_remap(nodeid, size * map_count);
+	if (map) {
+		for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+			if (!present_section_nr(pnum))
+				continue;
+			page_autonuma_map[pnum] = map;
+			map += size;
+		}
+		return;
+	}
+
+	size = PAGE_ALIGN(size);
+	map = __alloc_bootmem_node_high(NODE_DATA(nodeid), size * map_count,
+					PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	if (map) {
+		for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+			if (!present_section_nr(pnum))
+				continue;
+			page_autonuma_map[pnum] = map;
+			map += size;
+		}
+		return;
+	}
+
+	/* fallback */
+	for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+		struct mem_section *ms;
+
+		if (!present_section_nr(pnum))
+			continue;
+		page_autonuma_map[pnum] = sparse_page_autonuma_map_populate(pnum, nodeid);
+		if (page_autonuma_map[pnum])
+			continue;
+		ms = __nr_to_section(pnum);
+		printk(KERN_ERR "%s: sparsemem page_autonuma map backing failed "
+		       "some memory will not be available.\n", __func__);
+	}
+}
+
+#endif /* CONFIG_SPARSEMEM */
diff --git a/mm/sparse.c b/mm/sparse.c
index fac95f2..5b8d018 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -9,6 +9,7 @@
 #include <linux/export.h>
 #include <linux/spinlock.h>
 #include <linux/vmalloc.h>
+#include <linux/page_autonuma.h>
 #include "internal.h"
 #include <asm/dma.h>
 #include <asm/pgalloc.h>
@@ -230,7 +231,8 @@ struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pn
 
 static int __meminit sparse_init_one_section(struct mem_section *ms,
 		unsigned long pnum, struct page *mem_map,
-		unsigned long *pageblock_bitmap)
+		unsigned long *pageblock_bitmap,
+		struct page_autonuma *page_autonuma)
 {
 	if (!present_section(ms))
 		return -EINVAL;
@@ -239,6 +241,14 @@ static int __meminit sparse_init_one_section(struct mem_section *ms,
 	ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) |
 							SECTION_HAS_MEM_MAP;
  	ms->pageblock_flags = pageblock_bitmap;
+#ifdef CONFIG_AUTONUMA
+	if (page_autonuma) {
+		ms->section_page_autonuma = page_autonuma - section_nr_to_pfn(pnum);
+		page_autonuma_map_init(mem_map, page_autonuma, PAGES_PER_SECTION);
+	}
+#else
+	BUG_ON(page_autonuma);
+#endif
 
 	return 1;
 }
@@ -480,6 +490,9 @@ void __init sparse_init(void)
 	int size2;
 	struct page **map_map;
 #endif
+	struct page_autonuma **uninitialized_var(page_autonuma_map);
+	struct page_autonuma *page_autonuma;
+	int size3;
 
 	/* Setup pageblock_order for HUGETLB_PAGE_SIZE_VARIABLE */
 	set_pageblock_order();
@@ -577,6 +590,63 @@ void __init sparse_init(void)
 					 map_count, nodeid_begin);
 #endif
 
+	/* __pgdat_autonuma_init initialized autonuma_possible() */
+	if (autonuma_possible()) {
+		unsigned long total_page_autonuma;
+		unsigned long page_autonuma_count;
+
+		size3 = sizeof(struct page_autonuma *) * NR_MEM_SECTIONS;
+		page_autonuma_map = alloc_bootmem(size3);
+		if (!page_autonuma_map)
+			panic("can not allocate page_autonuma_map\n");
+
+		for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
+			struct mem_section *ms;
+
+			if (!present_section_nr(pnum))
+				continue;
+			ms = __nr_to_section(pnum);
+			nodeid_begin = sparse_early_nid(ms);
+			pnum_begin = pnum;
+			break;
+		}
+		total_page_autonuma = 0;
+		page_autonuma_count = 1;
+		for (pnum = pnum_begin + 1; pnum < NR_MEM_SECTIONS; pnum++) {
+			struct mem_section *ms;
+			int nodeid;
+
+			if (!present_section_nr(pnum))
+				continue;
+			ms = __nr_to_section(pnum);
+			nodeid = sparse_early_nid(ms);
+			if (nodeid == nodeid_begin) {
+				page_autonuma_count++;
+				continue;
+			}
+			/* ok, we need to take cake of from pnum_begin to pnum - 1*/
+			sparse_early_page_autonuma_alloc_node(page_autonuma_map,
+							      pnum_begin,
+							      NR_MEM_SECTIONS,
+							      page_autonuma_count,
+							      nodeid_begin);
+			total_page_autonuma += SECTION_PAGE_AUTONUMA_SIZE * page_autonuma_count;
+			/* new start, update count etc*/
+			nodeid_begin = nodeid;
+			pnum_begin = pnum;
+			page_autonuma_count = 1;
+		}
+		/* ok, last chunk */
+		sparse_early_page_autonuma_alloc_node(page_autonuma_map, pnum_begin,
+						      NR_MEM_SECTIONS,
+						      page_autonuma_count, nodeid_begin);
+		total_page_autonuma += SECTION_PAGE_AUTONUMA_SIZE * page_autonuma_count;
+		printk("allocated %lu KBytes of page_autonuma\n",
+		       total_page_autonuma >> 10);
+		printk(KERN_INFO "please try the 'noautonuma' option if you"
+		       " don't want to allocate page_autonuma memory\n");
+	}
+
 	for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
 		if (!present_section_nr(pnum))
 			continue;
@@ -585,6 +655,13 @@ void __init sparse_init(void)
 		if (!usemap)
 			continue;
 
+		if (autonuma_possible()) {
+			page_autonuma = page_autonuma_map[pnum];
+			if (!page_autonuma)
+				continue;
+		} else
+			page_autonuma = NULL;
+
 #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
 		map = map_map[pnum];
 #else
@@ -594,11 +671,13 @@ void __init sparse_init(void)
 			continue;
 
 		sparse_init_one_section(__nr_to_section(pnum), pnum, map,
-								usemap);
+					usemap, page_autonuma);
 	}
 
 	vmemmap_populate_print_last();
 
+	if (autonuma_possible())
+		free_bootmem(__pa(page_autonuma_map), size3);
 #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
 	free_bootmem(__pa(map_map), size2);
 #endif
@@ -685,7 +764,8 @@ static void free_map_bootmem(struct page *page, unsigned long nr_pages)
 }
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
-static void free_section_usemap(struct page *memmap, unsigned long *usemap)
+static void free_section_usemap(struct page *memmap, unsigned long *usemap,
+				struct page_autonuma *page_autonuma)
 {
 	struct page *usemap_page;
 	unsigned long nr_pages;
@@ -699,8 +779,14 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 	 */
 	if (PageSlab(usemap_page)) {
 		kfree(usemap);
-		if (memmap)
+		if (memmap) {
 			__kfree_section_memmap(memmap, PAGES_PER_SECTION);
+			if (autonuma_possible())
+				__kfree_section_page_autonuma(page_autonuma,
+							      PAGES_PER_SECTION);
+			else
+				BUG_ON(page_autonuma);
+		}
 		return;
 	}
 
@@ -717,6 +803,13 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 			>> PAGE_SHIFT;
 
 		free_map_bootmem(memmap_page, nr_pages);
+
+		if (autonuma_possible()) {
+			struct page *page_autonuma_page;
+			page_autonuma_page = virt_to_page(page_autonuma);
+			free_map_bootmem(page_autonuma_page, nr_pages);
+		} else
+			BUG_ON(page_autonuma);
 	}
 }
 
@@ -732,6 +825,7 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	struct mem_section *ms;
 	struct page *memmap;
+	struct page_autonuma *page_autonuma;
 	unsigned long *usemap;
 	unsigned long flags;
 	int ret;
@@ -751,6 +845,16 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 		__kfree_section_memmap(memmap, nr_pages);
 		return -ENOMEM;
 	}
+	if (autonuma_possible()) {
+		page_autonuma = __kmalloc_section_page_autonuma(pgdat->node_id,
+								nr_pages);
+		if (!page_autonuma) {
+			kfree(usemap);
+			__kfree_section_memmap(memmap, nr_pages);
+			return -ENOMEM;
+		}
+	} else
+		page_autonuma = NULL;
 
 	pgdat_resize_lock(pgdat, &flags);
 
@@ -762,11 +866,16 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 
 	ms->section_mem_map |= SECTION_MARKED_PRESENT;
 
-	ret = sparse_init_one_section(ms, section_nr, memmap, usemap);
+	ret = sparse_init_one_section(ms, section_nr, memmap, usemap,
+				      page_autonuma);
 
 out:
 	pgdat_resize_unlock(pgdat, &flags);
 	if (ret <= 0) {
+		if (autonuma_possible())
+			__kfree_section_page_autonuma(page_autonuma, nr_pages);
+		else
+			BUG_ON(page_autonuma);
 		kfree(usemap);
 		__kfree_section_memmap(memmap, nr_pages);
 	}
@@ -777,6 +886,7 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 {
 	struct page *memmap = NULL;
 	unsigned long *usemap = NULL;
+	struct page_autonuma *page_autonuma = NULL;
 
 	if (ms->section_mem_map) {
 		usemap = ms->pageblock_flags;
@@ -784,8 +894,12 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 						__section_nr(ms));
 		ms->section_mem_map = 0;
 		ms->pageblock_flags = NULL;
+
+#ifdef CONFIG_AUTONUMA
+		page_autonuma = ms->section_page_autonuma;
+#endif
 	}
 
-	free_section_usemap(memmap, usemap);
+	free_section_usemap(memmap, usemap, page_autonuma);
 }
 #endif

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 29/36] autonuma: autonuma_migrate_head[0] dynamic size
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (27 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 28/36] autonuma: page_autonuma Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 14:59 ` [PATCH 30/36] autonuma: bugcheck page_autonuma fields on newly allocated pages Andrea Arcangeli
                   ` (7 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Reduce the autonuma_migrate_head array entries from MAX_NUMNODES to
num_possible_nodes() or zero if autonuma is not possible.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/mm/numa.c             |    6 ++++--
 arch/x86/mm/numa_32.c          |    3 ++-
 include/linux/memory_hotplug.h |    3 ++-
 include/linux/mmzone.h         |   19 +++++++++++++------
 include/linux/page_autonuma.h  |   10 ++++++++--
 mm/memory_hotplug.c            |    2 +-
 6 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2d125be..a4a9e92 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -11,6 +11,7 @@
 #include <linux/nodemask.h>
 #include <linux/sched.h>
 #include <linux/topology.h>
+#include <linux/page_autonuma.h>
 
 #include <asm/e820.h>
 #include <asm/proto.h>
@@ -192,7 +193,8 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
 /* Initialize NODE_DATA for a node on the local memory */
 static void __init setup_node_data(int nid, u64 start, u64 end)
 {
-	const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
+	const size_t nd_size = roundup(autonuma_pglist_data_size(),
+				       PAGE_SIZE);
 	bool remapped = false;
 	u64 nd_pa;
 	void *nd;
@@ -239,7 +241,7 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
 		printk(KERN_INFO "    NODE_DATA(%d) on node %d\n", nid, tnid);
 
 	node_data[nid] = nd;
-	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+	memset(NODE_DATA(nid), 0, autonuma_pglist_data_size());
 	NODE_DATA(nid)->node_id = nid;
 	NODE_DATA(nid)->node_start_pfn = start >> PAGE_SHIFT;
 	NODE_DATA(nid)->node_spanned_pages = (end - start) >> PAGE_SHIFT;
diff --git a/arch/x86/mm/numa_32.c b/arch/x86/mm/numa_32.c
index 534255a..d32d6cc 100644
--- a/arch/x86/mm/numa_32.c
+++ b/arch/x86/mm/numa_32.c
@@ -25,6 +25,7 @@
 #include <linux/bootmem.h>
 #include <linux/memblock.h>
 #include <linux/module.h>
+#include <linux/page_autonuma.h>
 
 #include "numa_internal.h"
 
@@ -194,7 +195,7 @@ void __init init_alloc_remap(int nid, u64 start, u64 end)
 
 	/* calculate the necessary space aligned to large page size */
 	size = node_memmap_size_bytes(nid, start_pfn, end_pfn);
-	size += ALIGN(sizeof(pg_data_t), PAGE_SIZE);
+	size += ALIGN(autonuma_pglist_data_size(), PAGE_SIZE);
 	size = ALIGN(size, LARGE_PAGE_BYTES);
 
 	/* allocate node memory and the lowmem remap area */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 910550f..76b1840 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -5,6 +5,7 @@
 #include <linux/spinlock.h>
 #include <linux/notifier.h>
 #include <linux/bug.h>
+#include <linux/page_autonuma.h>
 
 struct page;
 struct zone;
@@ -130,7 +131,7 @@ extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
  */
 #define generic_alloc_nodedata(nid)				\
 ({								\
-	kzalloc(sizeof(pg_data_t), GFP_KERNEL);			\
+	kzalloc(autonuma_pglist_data_size(), GFP_KERNEL);	\
 })
 /*
  * This definition is just for error path in node hotadd.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 853e236..4d8e100 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -713,12 +713,6 @@ typedef struct pglist_data {
 #if !defined(CONFIG_SPARSEMEM)
 	struct page_autonuma *node_page_autonuma;
 #endif
-	/*
-	 * All pages from node "page_nid" to be migrated to this node,
-	 * will be queued into the list
-	 * autonuma_migrate_head[page_nid].
-	 */
-	struct list_head autonuma_migrate_head[MAX_NUMNODES];
 	/* number of pages from other nodes queued for migration to this node */
 	unsigned long autonuma_nr_migrate_pages;
 	/* waitqueue for this node knuma_migrated daemon */
@@ -729,7 +723,20 @@ typedef struct pglist_data {
 	 * autonuma_nr_migrate_pages field.
 	 */
 	spinlock_t autonuma_lock;
+	/*
+	 * All pages from node "page_nid" to be migrated to this node,
+	 * will be queued into the list
+	 * autonuma_migrate_head[page_nid].
+	 *
+	 * Arches supporting AutoNUMA must allocate the pgdat
+	 * structure using the size returned from the
+	 * autonuma_pglist_data_size() function after including
+	 * <linux/page_autonuma.h>. The below field must remain the
+	 * last one of this structure.
+	 */
+	struct list_head autonuma_migrate_head[0];
 #endif
+	/* do not add more variables here, the above array size is dynamic */
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
index 9763e61..bd6249c 100644
--- a/include/linux/page_autonuma.h
+++ b/include/linux/page_autonuma.h
@@ -10,6 +10,7 @@ static inline void __init page_autonuma_init_flatmem(void) {}
 #ifdef CONFIG_AUTONUMA
 
 #include <linux/autonuma_flags.h>
+#include <linux/autonuma_types.h>
 
 extern void __meminit page_autonuma_map_init(struct page *page,
 					     struct page_autonuma *page_autonuma,
@@ -29,11 +30,10 @@ extern void __meminit pgdat_autonuma_init(struct pglist_data *);
 struct page_autonuma;
 #define PAGE_AUTONUMA_SIZE 0
 #define SECTION_PAGE_AUTONUMA_SIZE 0
+#endif /* CONFIG_SPARSEMEM */
 
 #define autonuma_possible() false
 
-#endif /* CONFIG_SPARSEMEM */
-
 static inline void pgdat_autonuma_init(struct pglist_data *pgdat) {}
 
 #endif /* CONFIG_AUTONUMA */
@@ -50,4 +50,10 @@ extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **
 							 int nodeid);
 #endif
 
+/* inline won't work here */
+#define autonuma_pglist_data_size() (sizeof(struct pglist_data) +	\
+				     (autonuma_possible() ?		\
+				      sizeof(struct list_head) * \
+				      nr_node_ids : 0))
+
 #endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 3ad25f9..86b37db 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -164,7 +164,7 @@ void register_page_bootmem_info_node(struct pglist_data *pgdat)
 	struct page *page;
 	struct zone *zone;
 
-	nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
+	nr_pages = PAGE_ALIGN(autonuma_pglist_data_size()) >> PAGE_SHIFT;
 	page = virt_to_page(pgdat);
 
 	for (i = 0; i < nr_pages; i++, page++)

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 30/36] autonuma: bugcheck page_autonuma fields on newly allocated pages
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (28 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 29/36] autonuma: autonuma_migrate_head[0] dynamic size Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 14:59 ` [PATCH 31/36] autonuma: shrink the per-page page_autonuma struct size Andrea Arcangeli
                   ` (6 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Debug tweak.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma.h |   19 +++++++++++++++++++
 mm/page_alloc.c          |    3 ++-
 2 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
index 1d87ecc..8a779e0 100644
--- a/include/linux/autonuma.h
+++ b/include/linux/autonuma.h
@@ -29,6 +29,24 @@ static inline void autonuma_free_page(struct page *page)
 	}
 }
 
+static inline int autonuma_check_new_page(struct page *page)
+{
+	struct page_autonuma *page_autonuma;
+	int ret = 0;
+	if (autonuma_possible()) {
+		page_autonuma = lookup_page_autonuma(page);
+		if (unlikely(page_autonuma->autonuma_migrate_nid != -1)) {
+			ret = 1;
+			WARN_ON(1);
+		}
+		if (unlikely(page_autonuma->autonuma_last_nid != -1)) {
+			ret = 1;
+			WARN_ON(1);
+		}
+	}
+	return ret;
+}
+
 #define autonuma_printk(format, args...) \
 	if (autonuma_debug()) printk(format, ##args)
 
@@ -41,6 +59,7 @@ static inline void autonuma_migrate_split_huge_page(struct page *page,
 						    struct page *page_tail) {}
 static inline void autonuma_setup_new_exec(struct task_struct *p) {}
 static inline void autonuma_free_page(struct page *page) {}
+static inline int autonuma_check_new_page(struct page *page) { return 0; }
 
 #endif /* CONFIG_AUTONUMA */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 74b73fa..87a4d5b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -833,7 +833,8 @@ static inline int check_new_page(struct page *page)
 		(page->mapping != NULL)  |
 		(__page_count(page) != 0)  |
 		(page->flags & PAGE_FLAGS_CHECK_AT_PREP) |
-		(mem_cgroup_bad_page_check(page)))) {
+		(mem_cgroup_bad_page_check(page)) |
+		autonuma_check_new_page(page))) {
 		bad_page(page);
 		return 1;
 	}

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 31/36] autonuma: shrink the per-page page_autonuma struct size
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (29 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 30/36] autonuma: bugcheck page_autonuma fields on newly allocated pages Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 14:59 ` [PATCH 32/36] autonuma: boost khugepaged scanning rate Andrea Arcangeli
                   ` (5 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

>From 32 to 12 bytes, so the AutoNUMA memory footprint is reduced to
0.29% of RAM.

This however will fail to migrate pages above a 16 Terabyte offset
from the start of each node (migration failure isn't fatal, simply
those pages will not follow the CPU, a warning will be printed in the
log just once in that case).

AutoNUMA will also fail to build if there are more than (2**15)-1
nodes supported by the MAX_NUMNODES at build time (it would be easy to
relax it to (2**16)-1 nodes without increasing the memory footprint,
but it's not even worth it, so let's keep the negative space reserved
for now).

This means the max RAM configuration fully supported by AutoNUMA
becomes AUTONUMA_LIST_MAX_PFN_OFFSET multiplied by 32767 nodes
multiplied by the PAGE_SIZE (assume 4096 here, but for some archs it's
bigger).

4096*32767*(0xffffffff-3)>>(10*5) = 511 PetaBytes.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_list.h  |  100 +++++++++++++++++++++++
 include/linux/autonuma_types.h |   45 ++++++-----
 include/linux/mmzone.h         |    3 +-
 include/linux/page_autonuma.h  |    2 +-
 mm/Makefile                    |    2 +-
 mm/autonuma.c                  |   93 ++++++++++++++++------
 mm/autonuma_list.c             |  169 ++++++++++++++++++++++++++++++++++++++++
 mm/page_autonuma.c             |   24 +++---
 8 files changed, 380 insertions(+), 58 deletions(-)
 create mode 100644 include/linux/autonuma_list.h
 create mode 100644 mm/autonuma_list.c

diff --git a/include/linux/autonuma_list.h b/include/linux/autonuma_list.h
new file mode 100644
index 0000000..b77acb4
--- /dev/null
+++ b/include/linux/autonuma_list.h
@@ -0,0 +1,100 @@
+#ifndef __AUTONUMA_LIST_H
+#define __AUTONUMA_LIST_H
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+
+typedef uint32_t autonuma_list_entry;
+#define AUTONUMA_LIST_MAX_PFN_OFFSET	(AUTONUMA_LIST_HEAD-3)
+#define AUTONUMA_LIST_POISON1		(AUTONUMA_LIST_HEAD-2)
+#define AUTONUMA_LIST_POISON2		(AUTONUMA_LIST_HEAD-1)
+#define AUTONUMA_LIST_HEAD		((uint32_t)UINT_MAX)
+
+struct autonuma_list_head {
+	autonuma_list_entry anl_next_pfn;
+	autonuma_list_entry anl_prev_pfn;
+};
+
+static inline void AUTONUMA_INIT_LIST_HEAD(struct autonuma_list_head *anl)
+{
+	anl->anl_next_pfn = AUTONUMA_LIST_HEAD;
+	anl->anl_prev_pfn = AUTONUMA_LIST_HEAD;
+}
+
+/* abstraction conversion methods */
+extern struct page *autonuma_list_entry_to_page(int nid,
+					autonuma_list_entry pfn_offset);
+extern autonuma_list_entry autonuma_page_to_list_entry(int page_nid,
+						       struct page *page);
+extern struct autonuma_list_head *__autonuma_list_head(int page_nid,
+					struct autonuma_list_head *head,
+					autonuma_list_entry pfn_offset);
+
+extern bool __autonuma_list_add(int page_nid,
+				struct page *page,
+				struct autonuma_list_head *head,
+				autonuma_list_entry prev,
+				autonuma_list_entry next);
+
+/*
+ * autonuma_list_add - add a new entry
+ *
+ * Insert a new entry after the specified head.
+ */
+static inline bool autonuma_list_add(int page_nid,
+				     struct page *page,
+				     autonuma_list_entry entry,
+				     struct autonuma_list_head *head)
+{
+	struct autonuma_list_head *entry_head;
+	entry_head = __autonuma_list_head(page_nid, head, entry);
+	return __autonuma_list_add(page_nid, page, head,
+				   entry, entry_head->anl_next_pfn);
+}
+
+/*
+ * autonuma_list_add_tail - add a new entry
+ *
+ * Insert a new entry before the specified head.
+ * This is useful for implementing queues.
+ */
+static inline bool autonuma_list_add_tail(int page_nid,
+					  struct page *page,
+					  autonuma_list_entry entry,
+					  struct autonuma_list_head *head)
+{
+	struct autonuma_list_head *entry_head;
+	entry_head = __autonuma_list_head(page_nid, head, entry);
+	return __autonuma_list_add(page_nid, page, head,
+				   entry_head->anl_prev_pfn, entry);
+}
+
+/*
+ * autonuma_list_del - deletes entry from list.
+ * @entry: the element to delete from the list.
+ */
+extern void autonuma_list_del(int page_nid,
+			      struct autonuma_list_head *entry,
+			      struct autonuma_list_head *head);
+
+static inline bool autonuma_list_empty(const struct autonuma_list_head *head)
+{
+	return ACCESS_ONCE(head->anl_next_pfn) == AUTONUMA_LIST_HEAD;
+}
+
+/* safe to call only when the list cannot change under us */
+extern bool autonuma_list_empty_debug(const struct autonuma_list_head *head);
+
+#if 0 /* not needed so far */
+/*
+ * autonuma_list_is_singular - tests whether a list has just one entry.
+ * @head: the list to test.
+ */
+static inline int autonuma_list_is_singular(const struct autonuma_list_head *head)
+{
+	return !autonuma_list_empty_debug(head) &&
+		(head->anl_next_pfn == head->anl_prev_pfn);
+}
+#endif
+
+#endif /* __AUTONUMA_LIST_H */
diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
index 525c31f..80ace7f 100644
--- a/include/linux/autonuma_types.h
+++ b/include/linux/autonuma_types.h
@@ -4,6 +4,7 @@
 #ifdef CONFIG_AUTONUMA
 
 #include <linux/numa.h>
+#include <linux/autonuma_list.h>
 
 
 /*
@@ -81,6 +82,19 @@ struct task_autonuma {
 /*
  * Per page (or per-pageblock) structure dynamically allocated only if
  * autonuma is possible.
+ *
+ * This structure takes 12 bytes per page for all architectures. There
+ * are two constraints to make this work:
+ *
+ * 1) the build will abort if MAX_NUMNODES is too big according to
+ *    the #error check below
+ *
+ * 2) AutoNUMA will not succeed to insert into the migration queue any
+ *    page whose pfn offset value (offset with respect to the first
+ *    pfn of the node) is bigger than AUTONUMA_LIST_MAX_PFN_OFFSET
+ *    (NOTE: AUTONUMA_LIST_MAX_PFN_OFFSET is still a valid pfn offset
+ *    value). This means with huge node sizes and small PAGE_SIZE,
+ *    some pages may not be allowed to be migrated.
  */
 struct page_autonuma {
 	/*
@@ -91,7 +105,14 @@ struct page_autonuma {
 	 * systems). Architectures without this granularity require
 	 * autonuma_last_nid to be a long.
 	 */
-#ifdef CONFIG_64BIT
+#if MAX_NUMNODES > 32767
+	/*
+	 * Verify at build time that int16_t for autonuma_migrate_nid
+	 * and autonuma_last_nid won't risk to overflow, max allowed
+	 * nid value is (2**15)-1.
+	 */
+#error "too many nodes"
+#endif
 	/*
 	 * If autonuma_migrate_nid is >= 0, it means the page_autonuma
 	 * structure is linked into one of the NUMA node's migrate
@@ -100,7 +121,7 @@ struct page_autonuma {
 	 * page_autonuma structure is not linked into any NUMA node's
 	 * migrate list.
 	 */
-	int autonuma_migrate_nid;
+	int16_t autonuma_migrate_nid;
 	/*
 	 * autonuma_last_nid records the NUMA node that accessed the
 	 * page during the last NUMA hinting page fault. If a
@@ -109,28 +130,14 @@ struct page_autonuma {
 	 * requiring that a page be accessed by the same node twice in
 	 * a row before it is queued for migration.
 	 */
-	int autonuma_last_nid;
-#else
-#if MAX_NUMNODES > 32767
-#error "too many nodes"
-#endif
-	short autonuma_migrate_nid;
-	short autonuma_last_nid;
-#endif
+	int16_t autonuma_last_nid;
+
 	/*
 	 * This is the list node that links the page (referenced by
 	 * the page_autonuma structure) in the
 	 * &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid] lru.
 	 */
-	struct list_head autonuma_migrate_node;
-
-	/*
-	 * To find the page starting from the autonuma_migrate_node we
-	 * need a backlink.
-	 *
-	 * FIXME: drop it;
-	 */
-	struct page *page;
+	struct autonuma_list_head autonuma_migrate_node;
 };
 
 extern int alloc_task_autonuma(struct task_struct *tsk,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4d8e100..af71633 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -17,6 +17,7 @@
 #include <linux/pageblock-flags.h>
 #include <generated/bounds.h>
 #include <linux/atomic.h>
+#include <linux/autonuma_list.h>
 #include <asm/page.h>
 
 /* Free memory management - zoned buddy allocator.  */
@@ -734,7 +735,7 @@ typedef struct pglist_data {
 	 * <linux/page_autonuma.h>. The below field must remain the
 	 * last one of this structure.
 	 */
-	struct list_head autonuma_migrate_head[0];
+	struct autonuma_list_head autonuma_migrate_head[0];
 #endif
 	/* do not add more variables here, the above array size is dynamic */
 } pg_data_t;
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
index bd6249c..aeb2e9d 100644
--- a/include/linux/page_autonuma.h
+++ b/include/linux/page_autonuma.h
@@ -53,7 +53,7 @@ extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **
 /* inline won't work here */
 #define autonuma_pglist_data_size() (sizeof(struct pglist_data) +	\
 				     (autonuma_possible() ?		\
-				      sizeof(struct list_head) * \
+				      sizeof(struct autonuma_list_head) * \
 				      nr_node_ids : 0))
 
 #endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/mm/Makefile b/mm/Makefile
index 5a4fa30..04357c1 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -34,7 +34,7 @@ obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
-obj-$(CONFIG_AUTONUMA) 	+= autonuma.o page_autonuma.o
+obj-$(CONFIG_AUTONUMA) 	+= autonuma.o page_autonuma.o autonuma_list.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
diff --git a/mm/autonuma.c b/mm/autonuma.c
index 7967507..ada6c57 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -89,7 +89,14 @@ void autonuma_migrate_split_huge_page(struct page *page,
 	VM_BUG_ON(nid < -1);
 	VM_BUG_ON(page_tail_autonuma->autonuma_migrate_nid != -1);
 	if (nid >= 0) {
-		VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
+		bool added;
+		int page_nid = page_to_nid(page);
+		struct autonuma_list_head *head;
+		autonuma_list_entry entry;
+		entry = autonuma_page_to_list_entry(page_nid, page);
+		head = &NODE_DATA(nid)->autonuma_migrate_head[page_nid];
+		VM_BUG_ON(page_nid != page_to_nid(page_tail));
+		VM_BUG_ON(page_nid == nid);
 
 		/*
 		 * The caller only takes the compound_lock for the
@@ -101,11 +108,19 @@ void autonuma_migrate_split_huge_page(struct page *page,
 		 */
 		compound_lock(page_tail);
 		autonuma_migrate_lock(nid);
-		list_add_tail(&page_tail_autonuma->autonuma_migrate_node,
-			      &page_autonuma->autonuma_migrate_node);
+		added = autonuma_list_add_tail(page_nid, page_tail, entry,
+					       head);
+		/*
+		 * AUTONUMA_LIST_MAX_PFN_OFFSET+1 isn't a power of 2
+		 * so "added" may be false if there's a pfn overflow
+		 * in the list.
+		 */
+		if (!added)
+			NODE_DATA(nid)->autonuma_nr_migrate_pages--;
 		autonuma_migrate_unlock(nid);
 
-		page_tail_autonuma->autonuma_migrate_nid = nid;
+		if (added)
+			page_tail_autonuma->autonuma_migrate_nid = nid;
 		compound_unlock(page_tail);
 	}
 
@@ -127,8 +142,15 @@ void __autonuma_migrate_page_remove(struct page *page,
 	VM_BUG_ON(nid < -1);
 	if (nid >= 0) {
 		int numpages = hpage_nr_pages(page);
+		int page_nid = page_to_nid(page);
+		struct autonuma_list_head *head;
+		VM_BUG_ON(nid == page_nid);
+		head = &NODE_DATA(nid)->autonuma_migrate_head[page_nid];
+
 		autonuma_migrate_lock(nid);
-		list_del(&page_autonuma->autonuma_migrate_node);
+		autonuma_list_del(page_nid,
+				  &page_autonuma->autonuma_migrate_node,
+				  head);
 		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
 		autonuma_migrate_unlock(nid);
 
@@ -147,6 +169,8 @@ static void __autonuma_migrate_page_add(struct page *page,
 	int numpages;
 	unsigned long nr_migrate_pages;
 	wait_queue_head_t *wait_queue;
+	struct autonuma_list_head *head;
+	bool added;
 
 	VM_BUG_ON(dst_nid >= MAX_NUMNODES);
 	VM_BUG_ON(dst_nid < -1);
@@ -170,25 +194,42 @@ static void __autonuma_migrate_page_add(struct page *page,
 	VM_BUG_ON(nid >= MAX_NUMNODES);
 	VM_BUG_ON(nid < -1);
 	if (nid >= 0) {
+		VM_BUG_ON(nid == page_nid);
+		head = &NODE_DATA(nid)->autonuma_migrate_head[page_nid];
+
 		autonuma_migrate_lock(nid);
-		list_del(&page_autonuma->autonuma_migrate_node);
+		autonuma_list_del(page_nid,
+				  &page_autonuma->autonuma_migrate_node,
+				  head);
 		NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
 		autonuma_migrate_unlock(nid);
 	}
 
+	head = &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid];
+
 	autonuma_migrate_lock(dst_nid);
-	list_add(&page_autonuma->autonuma_migrate_node,
-		 &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
-	NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
-	nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
+	added = autonuma_list_add(page_nid, page, AUTONUMA_LIST_HEAD, head);
+	if (added) {
+		NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
+		nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
+	}
 
 	autonuma_migrate_unlock(dst_nid);
 
-	page_autonuma->autonuma_migrate_nid = dst_nid;
+	if (added)
+		page_autonuma->autonuma_migrate_nid = dst_nid;
 
 	compound_unlock_irqrestore(page, flags);
 
-	if (!autonuma_migrate_defer()) {
+	/* Done, if migrate defer flag is set */
+	if (autonuma_migrate_defer())
+		return;
+
+	/*
+	 * Wake up migrate daemon if the number of pages has reached
+	 * the threshold.
+	 */
+	if (added) {
 		wait_queue = &NODE_DATA(dst_nid)->autonuma_knuma_migrated_wait;
 		if (nr_migrate_pages >= pages_to_migrate &&
 		    nr_migrate_pages - numpages < pages_to_migrate &&
@@ -904,7 +945,7 @@ static int isolate_migratepages(struct list_head *migratepages,
 				struct pglist_data *pgdat)
 {
 	int nr = 0, nid;
-	struct list_head *heads = pgdat->autonuma_migrate_head;
+	struct autonuma_list_head *heads = pgdat->autonuma_migrate_head;
 
 	/* FIXME: THP balancing, restart from last nid */
 	for_each_online_node(nid) {
@@ -924,10 +965,10 @@ static int isolate_migratepages(struct list_head *migratepages,
 			  "thread has been altered in a suboptimal way\n",
 			  pgdat->node_id);
 		if (nid == pgdat->node_id) {
-			VM_BUG_ON(!list_empty(&heads[nid]));
+			VM_BUG_ON(!autonuma_list_empty_debug(&heads[nid]));
 			continue;
 		}
-		if (list_empty(&heads[nid]))
+		if (autonuma_list_empty(&heads[nid]))
 			continue;
 		/* some page wants to go to this pgdat */
 		/*
@@ -939,22 +980,26 @@ static int isolate_migratepages(struct list_head *migratepages,
 		 * obtained the autonuma_migrate_lock here.
 		 */
 		autonuma_migrate_lock_irq(pgdat->node_id);
-		if (list_empty(&heads[nid])) {
+		if (autonuma_list_empty_debug(&heads[nid])) {
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			continue;
 		}
-		page_autonuma = list_entry(heads[nid].prev,
-					   struct page_autonuma,
-					   autonuma_migrate_node);
-		page = page_autonuma->page;
+		page = autonuma_list_entry_to_page(nid,
+						   heads[nid].anl_prev_pfn);
+		BUG_ON(nid != page_to_nid(page));
+		page_autonuma = lookup_page_autonuma(page);
 		if (unlikely(!get_page_unless_zero(page))) {
+			struct autonuma_list_head *entry_head;
 			/*
 			 * Is getting freed and will remove self from the
 			 * autonuma list shortly, skip it for now.
 			 */
-			list_del(&page_autonuma->autonuma_migrate_node);
-			list_add(&page_autonuma->autonuma_migrate_node,
-				 &heads[nid]);
+			entry_head = &page_autonuma->autonuma_migrate_node;
+			autonuma_list_del(nid, entry_head, &heads[nid]);
+			if (!autonuma_list_add(nid, page,
+					       AUTONUMA_LIST_HEAD,
+					       &heads[nid]))
+				BUG();
 			autonuma_migrate_unlock_irq(pgdat->node_id);
 			autonuma_printk("autonuma migrate page is free\n");
 			continue;
@@ -967,8 +1012,6 @@ static int isolate_migratepages(struct list_head *migratepages,
 			continue;
 		}
 
-		VM_BUG_ON(nid != page_to_nid(page));
-
 		if (PageTransHuge(page)) {
 			VM_BUG_ON(!PageAnon(page));
 			/* FIXME: remove split_huge_page */
diff --git a/mm/autonuma_list.c b/mm/autonuma_list.c
new file mode 100644
index 0000000..0a1cab1
--- /dev/null
+++ b/mm/autonuma_list.c
@@ -0,0 +1,169 @@
+/*
+ * Copyright 2006, Red Hat, Inc., Dave Jones
+ * Copyright 2012, Red Hat, Inc.
+ * Released under the General Public License (GPL).
+ *
+ * This file contains the linked list implementations for
+ * autonuma migration lists.
+ */
+
+#include <linux/mm.h>
+#include <linux/autonuma.h>
+
+/*
+ * Insert a new entry between two known consecutive entries.
+ *
+ * This is only for internal list manipulation where we know
+ * the prev/next entries already!
+ *
+ * return true if succeeded, or false if the (page_nid, pfn_offset)
+ * pair couldn't represent the pfn and the list_add didn't succeed.
+ */
+bool __autonuma_list_add(int page_nid,
+			 struct page *page,
+			 struct autonuma_list_head *head,
+			 autonuma_list_entry prev,
+			 autonuma_list_entry next)
+{
+	autonuma_list_entry new;
+
+	VM_BUG_ON(page_nid != page_to_nid(page));
+	new = autonuma_page_to_list_entry(page_nid, page);
+	if (new > AUTONUMA_LIST_MAX_PFN_OFFSET)
+		return false;
+
+	WARN(new == prev || new == next,
+	     "autonuma_list_add double add: new=%u, prev=%u, next=%u.\n",
+	     new, prev, next);
+
+	__autonuma_list_head(page_nid, head, next)->anl_prev_pfn = new;
+	__autonuma_list_head(page_nid, head, new)->anl_next_pfn = next;
+	__autonuma_list_head(page_nid, head, new)->anl_prev_pfn = prev;
+	__autonuma_list_head(page_nid, head, prev)->anl_next_pfn = new;
+	return true;
+}
+
+static inline void __autonuma_list_del_entry(int page_nid,
+					     struct autonuma_list_head *entry,
+					     struct autonuma_list_head *head)
+{
+	autonuma_list_entry prev, next;
+
+	next = entry->anl_next_pfn;
+	prev = entry->anl_prev_pfn;
+
+	if (WARN(next == AUTONUMA_LIST_POISON1,
+		 "autonuma_list_del corruption, "
+		 "%p->anl_next_pfn is AUTONUMA_LIST_POISON1 (%u)\n",
+		entry, AUTONUMA_LIST_POISON1) ||
+	    WARN(prev == AUTONUMA_LIST_POISON2,
+		"autonuma_list_del corruption, "
+		 "%p->anl_prev_pfn is AUTONUMA_LIST_POISON2 (%u)\n",
+		entry, AUTONUMA_LIST_POISON2))
+		return;
+
+	__autonuma_list_head(page_nid, head, next)->anl_prev_pfn = prev;
+	__autonuma_list_head(page_nid, head, prev)->anl_next_pfn = next;
+}
+
+/*
+ * autonuma_list_del - deletes entry from list.
+ *
+ * Note: autonuma_list_empty on entry does not return true after this,
+ * the entry is in an undefined state.
+ */
+void autonuma_list_del(int page_nid, struct autonuma_list_head *entry,
+		       struct autonuma_list_head *head)
+{
+	__autonuma_list_del_entry(page_nid, entry, head);
+	entry->anl_next_pfn = AUTONUMA_LIST_POISON1;
+	entry->anl_prev_pfn = AUTONUMA_LIST_POISON2;
+}
+
+/*
+ * autonuma_list_empty - tests whether a list is empty
+ * @head: the list to test.
+ */
+bool autonuma_list_empty_debug(const struct autonuma_list_head *head)
+{
+	bool ret = false;
+	if (head->anl_next_pfn == AUTONUMA_LIST_HEAD) {
+		ret = true;
+		BUG_ON(head->anl_prev_pfn != AUTONUMA_LIST_HEAD);
+	}
+	return ret;
+}
+
+/* abstraction conversion methods */
+
+static inline struct page *__autonuma_list_entry_to_page(int page_nid,
+							 autonuma_list_entry pfn_offset)
+{
+	struct pglist_data *pgdat = NODE_DATA(page_nid);
+	unsigned long pfn = pgdat->node_start_pfn + pfn_offset;
+	BUG_ON(pfn_offset >= pgdat->node_spanned_pages);
+	return pfn_to_page(pfn);
+}
+
+struct page *autonuma_list_entry_to_page(int page_nid,
+					 autonuma_list_entry pfn_offset)
+{
+	VM_BUG_ON(page_nid < 0);
+	BUG_ON(pfn_offset == AUTONUMA_LIST_POISON1);
+	BUG_ON(pfn_offset == AUTONUMA_LIST_POISON2);
+	BUG_ON(pfn_offset == AUTONUMA_LIST_HEAD);
+	return __autonuma_list_entry_to_page(page_nid, pfn_offset);
+}
+
+/*
+ * returns a value above AUTONUMA_LIST_MAX_PFN_OFFSET if the pfn is
+ * located a too big offset from the start of the node and cannot be
+ * represented by the (page_nid, pfn_offset) pair.
+ */
+autonuma_list_entry autonuma_page_to_list_entry(int page_nid,
+						struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	struct pglist_data *pgdat = NODE_DATA(page_nid);
+	VM_BUG_ON(page_nid != page_to_nid(page));
+	BUG_ON(pfn < pgdat->node_start_pfn);
+	BUG_ON(pfn >= pgdat->node_start_pfn + pgdat->node_spanned_pages);
+	pfn -= pgdat->node_start_pfn;
+	if (pfn > AUTONUMA_LIST_MAX_PFN_OFFSET) {
+		WARN_ONCE(1, "autonuma_page_to_list_entry: "
+			  "pfn_offset  %lu, pgdat %p, "
+			  "pgdat->node_start_pfn %lu\n",
+			  pfn, pgdat, pgdat->node_start_pfn);
+		/*
+		 * Any value bigger than AUTONUMA_LIST_MAX_PFN_OFFSET
+		 * will work as an error retval, but better pick one
+		 * that will cause noise if computed wrong by the
+		 * caller.
+		 */
+		return AUTONUMA_LIST_POISON1;
+	}
+	return pfn; /* convert to uint16_t without losing information */
+}
+
+static inline struct autonuma_list_head *____autonuma_list_head(int page_nid,
+					autonuma_list_entry pfn_offset)
+{
+	struct pglist_data *pgdat = NODE_DATA(page_nid);
+	unsigned long pfn = pgdat->node_start_pfn + pfn_offset;
+	struct page *page = pfn_to_page(pfn);
+	struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+	return &page_autonuma->autonuma_migrate_node;
+}
+
+struct autonuma_list_head *__autonuma_list_head(int page_nid,
+					struct autonuma_list_head *head,
+					autonuma_list_entry pfn_offset)
+{
+	VM_BUG_ON(page_nid < 0);
+	BUG_ON(pfn_offset == AUTONUMA_LIST_POISON1);
+	BUG_ON(pfn_offset == AUTONUMA_LIST_POISON2);
+	if (pfn_offset != AUTONUMA_LIST_HEAD)
+		return ____autonuma_list_head(page_nid, pfn_offset);
+	else
+		return head;
+}
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
index 46d616c..c8ba137 100644
--- a/mm/page_autonuma.c
+++ b/mm/page_autonuma.c
@@ -1,5 +1,6 @@
 #include <linux/mm.h>
 #include <linux/memory.h>
+#include <linux/vmalloc.h>
 #include <linux/autonuma.h>
 #include <linux/page_autonuma.h>
 #include <linux/bootmem.h>
@@ -12,7 +13,6 @@ void __meminit page_autonuma_map_init(struct page *page,
 	for (end = page + nr_pages; page < end; page++, page_autonuma++) {
 		page_autonuma->autonuma_last_nid = -1;
 		page_autonuma->autonuma_migrate_nid = -1;
-		page_autonuma->page = page;
 	}
 }
 
@@ -20,6 +20,9 @@ static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
 {
 	int node_iter;
 
+	/* verify the per-page page_autonuma 12 byte fixed cost */
+	BUILD_BUG_ON((unsigned long) &((struct page_autonuma *)0)[1] != 12);
+
 	spin_lock_init(&pgdat->autonuma_lock);
 	init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
 	pgdat->autonuma_nr_migrate_pages = 0;
@@ -30,8 +33,11 @@ static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
 
 	/* noautonuma early param may also clear AUTONUMA_POSSIBLE_FLAG */
 	if (autonuma_possible())
-		for_each_node(node_iter)
-			INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+		for_each_node(node_iter) {
+			struct autonuma_list_head *head;
+			head = &pgdat->autonuma_migrate_head[node_iter];
+			AUTONUMA_INIT_LIST_HEAD(head);
+		}
 }
 
 #if !defined(CONFIG_SPARSEMEM)
@@ -119,14 +125,6 @@ struct page_autonuma *lookup_page_autonuma(struct page *page)
 	unsigned long pfn = page_to_pfn(page);
 	struct mem_section *section = __pfn_to_section(pfn);
 
-	/* if it's not a power of two we may be wasting memory */
-	BUILD_BUG_ON(SECTION_PAGE_AUTONUMA_SIZE &
-		     (SECTION_PAGE_AUTONUMA_SIZE-1));
-
-	/* memsection must be a power of two */
-	BUILD_BUG_ON(sizeof(struct mem_section) &
-		     (sizeof(struct mem_section)-1));
-
 #ifdef CONFIG_DEBUG_VM
 	/*
 	 * The sanity checks the page allocator does upon freeing a
@@ -142,6 +140,10 @@ struct page_autonuma *lookup_page_autonuma(struct page *page)
 
 void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
 {
+	/* memsection must be a power of two */
+	BUILD_BUG_ON(sizeof(struct mem_section) &
+		     (sizeof(struct mem_section)-1));
+
 	__pgdat_autonuma_init(pgdat);
 }
 

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 32/36] autonuma: boost khugepaged scanning rate
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (30 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 31/36] autonuma: shrink the per-page page_autonuma struct size Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 14:59 ` [PATCH 33/36] autonuma: powerpc port Andrea Arcangeli
                   ` (4 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Until THP native migration is implemented it's safer to boost
khugepaged scanning rate because all memory migration are splitting
the hugepages. So the regular rate of scanning becomes too low when
lots of memory is migrated.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 579e52b..00320b6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -573,6 +573,12 @@ static int __init hugepage_init(void)
 
 	set_recommended_min_free_kbytes();
 
+	/* Hack, remove after THP native migration */
+	if (autonuma_possible()) {
+		khugepaged_scan_sleep_millisecs = 100;
+		khugepaged_alloc_sleep_millisecs = 10000;
+	}
+
 	return 0;
 out:
 	hugepage_exit_sysfs(hugepage_kobj);

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 33/36] autonuma: powerpc port
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (31 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 32/36] autonuma: boost khugepaged scanning rate Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 22:01   ` Benjamin Herrenschmidt
  2012-08-22 14:59 ` [PATCH 34/36] autonuma: make the AUTONUMA_SCAN_PMD_FLAG conditional to CONFIG_HAVE_ARCH_AUTONUMA_SCAN_PMD Andrea Arcangeli
                   ` (3 subsequent siblings)
  36 siblings, 1 reply; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

From: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>

    * PMD flaging is not required in powerpc since large pages
      are tracked in ptes.
    * Yet to be tested with large pages
    * This is an initial patch that partially works
    * knuma_scand and numa hinting page faults works
    * Page migration is yet to be observed/verified

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/powerpc/include/asm/pgtable.h        |   48 ++++++++++++++++++++++++++++-
 arch/powerpc/include/asm/pte-hash64-64k.h |    4 ++-
 arch/powerpc/mm/numa.c                    |    3 +-
 mm/autonuma.c                             |    2 +-
 4 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index 2e0e411..5f03079 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -33,10 +33,56 @@ static inline int pte_dirty(pte_t pte)		{ return pte_val(pte) & _PAGE_DIRTY; }
 static inline int pte_young(pte_t pte)		{ return pte_val(pte) & _PAGE_ACCESSED; }
 static inline int pte_file(pte_t pte)		{ return pte_val(pte) & _PAGE_FILE; }
 static inline int pte_special(pte_t pte)	{ return pte_val(pte) & _PAGE_SPECIAL; }
-static inline int pte_present(pte_t pte)	{ return pte_val(pte) & _PAGE_PRESENT; }
+static inline int pte_present(pte_t pte)	{ return pte_val(pte) &
+							(_PAGE_PRESENT|_PAGE_NUMA_PTE); }
 static inline int pte_none(pte_t pte)		{ return (pte_val(pte) & ~_PTE_NONE_MASK) == 0; }
 static inline pgprot_t pte_pgprot(pte_t pte)	{ return __pgprot(pte_val(pte) & PAGE_PROT_BITS); }
 
+#ifdef CONFIG_AUTONUMA
+static inline int pte_numa(pte_t pte)
+{
+       return (pte_val(pte) &
+               (_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
+}
+
+#endif
+
+static inline pte_t pte_mknonnuma(pte_t pte)
+{
+       pte_val(pte) &= ~_PAGE_NUMA_PTE;
+       pte_val(pte) |= (_PAGE_PRESENT|_PAGE_ACCESSED);
+
+       return pte;
+}
+
+static inline pte_t pte_mknuma(pte_t pte)
+{
+       pte_val(pte) |= _PAGE_NUMA_PTE;
+       pte_val(pte) &= ~_PAGE_PRESENT;
+       return pte;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+       /* PMD tracking not implemented */
+       return 0;
+}
+
+static inline pmd_t pmd_mknonnuma(pmd_t pmd)
+{
+	BUG();
+	return pmd;
+}
+
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+	BUG();
+	return pmd;
+}
+
+/* No pmd flags on powerpc */
+#define set_pmd_at(mm, addr, pmdp, pmd)  do { } while (0)
+
 /* Conversion functions: convert a page and protection to a page entry,
  * and a page entry and page directory to the page they refer to.
  *
diff --git a/arch/powerpc/include/asm/pte-hash64-64k.h b/arch/powerpc/include/asm/pte-hash64-64k.h
index 59247e8..f7e1468 100644
--- a/arch/powerpc/include/asm/pte-hash64-64k.h
+++ b/arch/powerpc/include/asm/pte-hash64-64k.h
@@ -7,6 +7,8 @@
 #define _PAGE_COMBO	0x10000000 /* this is a combo 4k page */
 #define _PAGE_4K_PFN	0x20000000 /* PFN is for a single 4k page */
 
+#define _PAGE_NUMA_PTE 0x40000000 /* Adjust PTE_RPN_SHIFT below */
+
 /* For 64K page, we don't have a separate _PAGE_HASHPTE bit. Instead,
  * we set that to be the whole sub-bits mask. The C code will only
  * test this, so a multi-bit mask will work. For combo pages, this
@@ -36,7 +38,7 @@
  * That gives us a max RPN of 34 bits, which means a max of 50 bits
  * of addressable physical space, or 46 bits for the special 4k PFNs.
  */
-#define PTE_RPN_SHIFT	(30)
+#define PTE_RPN_SHIFT	(31)
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 39b1597..80af41e 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -22,6 +22,7 @@
 #include <linux/pfn.h>
 #include <linux/cpuset.h>
 #include <linux/node.h>
+#include <linux/page_autonuma.h>
 #include <asm/sparsemem.h>
 #include <asm/prom.h>
 #include <asm/smp.h>
@@ -1045,7 +1046,7 @@ void __init do_init_bootmem(void)
 		 * all reserved areas marked.
 		 */
 		NODE_DATA(nid) = careful_zallocation(nid,
-					sizeof(struct pglist_data),
+					autonuma_pglist_data_size(),
 					SMP_CACHE_BYTES, end_pfn);
 
   		dbg("node %d\n", nid);
diff --git a/mm/autonuma.c b/mm/autonuma.c
index ada6c57..a4da3f3 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -25,7 +25,7 @@ unsigned long autonuma_flags __read_mostly =
 #ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
 	|(1<<AUTONUMA_ENABLED_FLAG)
 #endif
-	|(1<<AUTONUMA_SCAN_PMD_FLAG);
+	|(0<<AUTONUMA_SCAN_PMD_FLAG);
 
 static DEFINE_MUTEX(knumad_mm_mutex);
 

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 34/36] autonuma: make the AUTONUMA_SCAN_PMD_FLAG conditional to CONFIG_HAVE_ARCH_AUTONUMA_SCAN_PMD
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (32 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 33/36] autonuma: powerpc port Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 14:59 ` [PATCH 35/36] autonuma: add knuma_migrated/allow_first_fault in sysfs Andrea Arcangeli
                   ` (2 subsequent siblings)
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Remove the sysfs entry /sys/kernel/mm/autonuma/knuma_scand/pmd and
force the knuma_scand pmd mode off if
CONFIG_HAVE_ARCH_AUTONUMA_SCAN_PMD is not set by the architecture.

Enable AutoNUMA for PPC64.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/Kconfig         |    3 +++
 arch/powerpc/Kconfig |    6 ++++++
 arch/x86/Kconfig     |    1 +
 mm/autonuma.c        |    9 ++++++++-
 4 files changed, 18 insertions(+), 1 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index ee3ed89..6f4f19f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -284,4 +284,7 @@ config SECCOMP_FILTER
 config HAVE_ARCH_AUTONUMA
 	bool
 
+config HAVE_ARCH_AUTONUMA_SCAN_PMD
+	bool
+
 source "kernel/gcov/Kconfig"
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 352f416..73fa908 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -140,6 +140,12 @@ config PPC
 	select GENERIC_STRNCPY_FROM_USER
 	select GENERIC_STRNLEN_USER
 
+# allow AutoNUMA only on PPC64 for now
+config PPC_HAVE_ARCH_AUTONUMA
+	bool
+	default y if PPC64
+	select HAVE_ARCH_AUTONUMA
+
 config EARLY_PRINTK
 	bool
 	default y
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4cbdfce..f24bff8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -98,6 +98,7 @@ config X86
 	select GENERIC_STRNCPY_FROM_USER
 	select GENERIC_STRNLEN_USER
 	select HAVE_ARCH_AUTONUMA
+	select HAVE_ARCH_AUTONUMA_SCAN_PMD
 
 config INSTRUCTION_DECODER
 	def_bool (KPROBES || PERF_EVENTS || UPROBES)
diff --git a/mm/autonuma.c b/mm/autonuma.c
index a4da3f3..4b7c744 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -25,7 +25,10 @@ unsigned long autonuma_flags __read_mostly =
 #ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
 	|(1<<AUTONUMA_ENABLED_FLAG)
 #endif
-	|(0<<AUTONUMA_SCAN_PMD_FLAG);
+#ifdef CONFIG_HAVE_ARCH_AUTONUMA_SCAN_PMD
+	|(1<<AUTONUMA_SCAN_PMD_FLAG)
+#endif
+	;
 
 static DEFINE_MUTEX(knumad_mm_mutex);
 
@@ -1300,7 +1303,9 @@ static ssize_t NAME ## _store(struct kobject *kobj,			\
 static struct kobj_attribute NAME ## _attr =				\
 	__ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
 
+#ifdef CONFIG_HAVE_ARCH_AUTONUMA_SCAN_PMD
 SYSFS_ENTRY(pmd, AUTONUMA_SCAN_PMD_FLAG);
+#endif /* CONFIG_HAVE_ARCH_AUTONUMA_SCAN_PMD */
 SYSFS_ENTRY(debug, AUTONUMA_DEBUG_FLAG);
 #ifdef CONFIG_DEBUG_VM
 SYSFS_ENTRY(load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
@@ -1398,7 +1403,9 @@ static struct attribute *knuma_scand_attr[] = {
 	&pages_to_scan_attr.attr,
 	&pages_scanned_attr.attr,
 	&full_scans_attr.attr,
+#ifdef CONFIG_HAVE_ARCH_AUTONUMA_SCAN_PMD
 	&pmd_attr.attr,
+#endif
 	NULL,
 };
 static struct attribute_group knuma_scand_attr_group = {

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 35/36] autonuma: add knuma_migrated/allow_first_fault in sysfs
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (33 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 34/36] autonuma: make the AUTONUMA_SCAN_PMD_FLAG conditional to CONFIG_HAVE_ARCH_AUTONUMA_SCAN_PMD Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 14:59 ` [PATCH 36/36] autonuma: add mm_autonuma working set estimation Andrea Arcangeli
  2012-08-22 19:26 ` [PATCH 00/36] AutoNUMA24 Rik van Riel
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

This sysfs control, if enabled, allows memory migrations on the first
numa hinting page fault.

If disabled it forbids it and requires a confirmation through the
last_nid logic.

By default, the first fault is allowed to migrate memory. Disabling it
may increase the time it takes to converge, but it reduces some
initial thrashing in case of NUMA false sharing.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_flags.h |   20 ++++++++++++++++++++
 mm/autonuma.c                  |    7 +++++--
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
index f53203a..28756ca 100644
--- a/include/linux/autonuma_flags.h
+++ b/include/linux/autonuma_flags.h
@@ -85,6 +85,20 @@ enum autonuma_flag {
 	 * Default not set.
 	 */
 	AUTONUMA_MIGRATE_DEFER_FLAG,
+	/*
+	 * If set, a page must successfully pass a last_nid check
+	 * before it can be migrated even if it's the very first NUMA
+	 * hinting page fault occurring on the page. If not set, the
+	 * first NUMA hinting page fault of a newly allocated page
+	 * will always pass the last_nid check.
+	 *
+	 * If not set a newly started workload can converge quicker,
+	 * but it may incur in more false positive migrations before
+	 * reaching convergence.
+	 *
+	 * Default not set.
+	 */
+	AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG,
 };
 
 extern unsigned long autonuma_flags;
@@ -126,4 +140,10 @@ static inline bool autonuma_migrate_defer(void)
 	return test_bit(AUTONUMA_MIGRATE_DEFER_FLAG, &autonuma_flags);
 }
 
+static inline bool autonuma_migrate_allow_first_fault(void)
+{
+	return test_bit(AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG,
+			&autonuma_flags);
+}
+
 #endif /* _LINUX_AUTONUMA_FLAGS_H */
diff --git a/mm/autonuma.c b/mm/autonuma.c
index 4b7c744..e7570df 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -28,7 +28,7 @@ unsigned long autonuma_flags __read_mostly =
 #ifdef CONFIG_HAVE_ARCH_AUTONUMA_SCAN_PMD
 	|(1<<AUTONUMA_SCAN_PMD_FLAG)
 #endif
-	;
+	|(1<<AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG);
 
 static DEFINE_MUTEX(knumad_mm_mutex);
 
@@ -345,7 +345,8 @@ static inline bool last_nid_set(struct page *page, int this_nid)
 	int autonuma_last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid);
 	VM_BUG_ON(this_nid < 0);
 	VM_BUG_ON(this_nid >= MAX_NUMNODES);
-	if (autonuma_last_nid >= 0 && autonuma_last_nid != this_nid) {
+	if ((!autonuma_migrate_allow_first_fault() ||
+	     autonuma_last_nid >= 0) && autonuma_last_nid != this_nid) {
 		int migrate_nid;
 		migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
 		if (migrate_nid >= 0)
@@ -1311,6 +1312,7 @@ SYSFS_ENTRY(debug, AUTONUMA_DEBUG_FLAG);
 SYSFS_ENTRY(load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
 SYSFS_ENTRY(defer, AUTONUMA_MIGRATE_DEFER_FLAG);
 SYSFS_ENTRY(reset, AUTONUMA_SCHED_RESET_FLAG);
+SYSFS_ENTRY(allow_first_fault, AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG);
 #endif /* CONFIG_DEBUG_VM */
 
 #undef SYSFS_ENTRY
@@ -1419,6 +1421,7 @@ static struct attribute *knuma_migrated_attr[] = {
 	&pages_migrated_attr.attr,
 #ifdef CONFIG_DEBUG_VM
 	&defer_attr.attr,
+	&allow_first_fault_attr.attr,
 #endif
 	NULL,
 };

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 36/36] autonuma: add mm_autonuma working set estimation
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (34 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 35/36] autonuma: add knuma_migrated/allow_first_fault in sysfs Andrea Arcangeli
@ 2012-08-22 14:59 ` Andrea Arcangeli
  2012-08-22 19:26 ` [PATCH 00/36] AutoNUMA24 Rik van Riel
  36 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 14:59 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Hillf Danton, Dan Smith, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Paul Turner, Suresh Siddha,
	Mike Galbraith, Paul E. McKenney, Lai Jiangshan, Bharata B Rao,
	Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

Working set estimation will only record memory that was recently used
and in turn will be eligible for automatic migration. It will ignore
memory that is never accessed by the process and that in turn will
never attempted to be migrated. This can increase NUMA convergence if
large areas of memory are never used.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/autonuma_flags.h |   25 ++++++++++++++++++++++---
 mm/autonuma.c                  |   25 +++++++++++++++++++++++++
 2 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
index 28756ca..f72f2e2 100644
--- a/include/linux/autonuma_flags.h
+++ b/include/linux/autonuma_flags.h
@@ -62,9 +62,10 @@ enum autonuma_flag {
 	 * faults at the pmd level instead of the pte level. This
 	 * reduces the number of NUMA hinting faults potentially
 	 * saving CPU time. It reduces the accuracy of the
-	 * task_autonuma statistics (but does not change the accuracy
-	 * of the mm_autonuma statistics). This flag can be toggled
-	 * through sysfs as runtime.
+	 * task_autonuma statistics (it doesn't change the accuracy of
+	 * the mm_autonuma statistics if the mm_working_set mode is
+	 * not set). This flag can be toggled through sysfs as
+	 * runtime.
 	 *
 	 * This flag does not affect AutoNUMA with transparent
 	 * hugepages (THP). With THP the NUMA hinting page faults
@@ -99,6 +100,18 @@ enum autonuma_flag {
 	 * Default not set.
 	 */
 	AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG,
+	/*
+	 * If set, mm_autonuma will represent a working set estimation
+	 * of the memory used by the process over the last knuma_scand
+	 * pass.
+	 *
+	 * If not set, mm_autonuma will represent all (not shared)
+	 * memory eligible for automatic migration mapped by the
+	 * process.
+	 *
+	 * Default set.
+	 */
+	AUTONUMA_MM_WORKING_SET_FLAG,
 };
 
 extern unsigned long autonuma_flags;
@@ -146,4 +159,10 @@ static inline bool autonuma_migrate_allow_first_fault(void)
 			&autonuma_flags);
 }
 
+static inline bool autonuma_mm_working_set(void)
+{
+	return test_bit(AUTONUMA_MM_WORKING_SET_FLAG,
+			&autonuma_flags);
+}
+
 #endif /* _LINUX_AUTONUMA_FLAGS_H */
diff --git a/mm/autonuma.c b/mm/autonuma.c
index e7570df..71ce619 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -28,6 +28,7 @@ unsigned long autonuma_flags __read_mostly =
 #ifdef CONFIG_HAVE_ARCH_AUTONUMA_SCAN_PMD
 	|(1<<AUTONUMA_SCAN_PMD_FLAG)
 #endif
+	|(1<<AUTONUMA_MM_WORKING_SET_FLAG)
 	|(1<<AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG);
 
 static DEFINE_MUTEX(knumad_mm_mutex);
@@ -603,6 +604,12 @@ static int knuma_scand_pmd(struct mm_struct *mm,
 				unsigned long *fault_tmp;
 				ret = HPAGE_PMD_NR;
 
+				if (autonuma_mm_working_set() &&
+				    pmd_numa(*pmd)) {
+					spin_unlock(&mm->page_table_lock);
+					goto out;
+				}
+
 				page = pmd_page(*pmd);
 
 				/* only check non-shared pages */
@@ -639,6 +646,9 @@ static int knuma_scand_pmd(struct mm_struct *mm,
 		unsigned long *fault_tmp;
 		if (!pte_present(pteval))
 			continue;
+		if (autonuma_mm_working_set() &&
+		    pte_numa(pteval))
+			continue;
 		page = vm_normal_page(vma, _address, pteval);
 		if (unlikely(!page))
 			continue;
@@ -682,6 +692,17 @@ static void mm_numa_fault_tmp_flush(struct mm_struct *mm)
 	unsigned long tot;
 	unsigned long *fault_tmp = knuma_scand_data.mm_numa_fault_tmp;
 
+	if (autonuma_mm_working_set()) {
+		for_each_node(nid) {
+			tot = fault_tmp[nid];
+			if (tot)
+				break;
+		}
+		if (!tot)
+			/* process was idle, keep the old data */
+			return;
+	}
+
 	/* FIXME: would be better protected with write_seqlock_bh() */
 	local_bh_disable();
 
@@ -1313,6 +1334,7 @@ SYSFS_ENTRY(load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
 SYSFS_ENTRY(defer, AUTONUMA_MIGRATE_DEFER_FLAG);
 SYSFS_ENTRY(reset, AUTONUMA_SCHED_RESET_FLAG);
 SYSFS_ENTRY(allow_first_fault, AUTONUMA_MIGRATE_ALLOW_FIRST_FAULT_FLAG);
+SYSFS_ENTRY(mm_working_set, AUTONUMA_MM_WORKING_SET_FLAG);
 #endif /* CONFIG_DEBUG_VM */
 
 #undef SYSFS_ENTRY
@@ -1408,6 +1430,9 @@ static struct attribute *knuma_scand_attr[] = {
 #ifdef CONFIG_HAVE_ARCH_AUTONUMA_SCAN_PMD
 	&pmd_attr.attr,
 #endif
+#ifdef CONFIG_DEBUG_VM
+	&mm_working_set_attr.attr,
+#endif
 	NULL,
 };
 static struct attribute_group knuma_scand_attr_group = {

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 00/36] AutoNUMA24
  2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
                   ` (35 preceding siblings ...)
  2012-08-22 14:59 ` [PATCH 36/36] autonuma: add mm_autonuma working set estimation Andrea Arcangeli
@ 2012-08-22 19:26 ` Rik van Riel
  2012-08-22 21:40   ` Ingo Molnar
  36 siblings, 1 reply; 54+ messages in thread
From: Rik van Riel @ 2012-08-22 19:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On 08/22/2012 10:58 AM, Andrea Arcangeli wrote:
> Hello everyone,
>
> Before the Kernel Summit, I think it's good idea to post a new
> AutoNUMA24 and to go through a new review cycle. The last review cycle
> has been fundamental in improving the patchset. Thanks!

Thanks for improving the code and incorporating all our
feedback. The AutoNUMA codebase is now in a state where
I can live with it.

I hope the code will be acceptable to others, too.

> The objective of AutoNUMA is to be able to perform as close as
> possible to (and sometime faster than) the NUMA hard CPU/memory
> bindings setups, without requiring the administrator to manually setup
> any NUMA hard bind.

It is a difficult problem, but the performance numbers
I have seen before (with older versions) seem to suggest
that AutoNUMA is accomplishing the goal.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 19/36] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection
  2012-08-22 14:59 ` [PATCH 19/36] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection Andrea Arcangeli
@ 2012-08-22 20:19   ` Andi Kleen
  2012-08-22 21:22     ` Hugh Dickins
  2012-08-22 21:24     ` Andrea Arcangeli
  0 siblings, 2 replies; 54+ messages in thread
From: Andi Kleen @ 2012-08-22 20:19 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, linux-mm

Andrea Arcangeli <aarcange@redhat.com> writes:

> +/*
> + * In this function we build a temporal CPU_node<->page relation by
> + * using a two-stage autonuma_last_nid filter to remove short/unlikely
> + * relations.
> + *
> + * Using P(p) ~ n_p / n_t as per frequentest probability, we can
> + * equate a node's CPU usage of a particular page (n_p) per total
> + * usage of this page (n_t) (in a given time-span) to a probability.
> + *
> + * Our periodic faults will then sample this probability and getting
> + * the same result twice in a row, given these samples are fully
> + * independent, is then given by P(n)^2, provided our sample period
> + * is sufficiently short compared to the usage pattern.
> + *
> + * This quadric squishes small probabilities, making it less likely
> + * we act on an unlikely CPU_node<->page relation.
> + */

The code does not seem to do what the comment describes.

> +static inline bool last_nid_set(struct page *page, int this_nid)
> +{
> +	bool ret = true;
> +	int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
> +	VM_BUG_ON(this_nid < 0);
> +	VM_BUG_ON(this_nid >= MAX_NUMNODES);
> +	if (autonuma_last_nid >= 0 && autonuma_last_nid != this_nid) {
> +		int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
> +		if (migrate_nid >= 0)
> +			__autonuma_migrate_page_remove(page);
> +		ret = false;
> +	}
> +	if (autonuma_last_nid != this_nid)
> +		ACCESS_ONCE(page->autonuma_last_nid) = this_nid;
> +	return ret;
> +}
> +
> +		/*
> +		 * Take the lock with irqs disabled to avoid a lock
> +		 * inversion with the lru_lock. The lru_lock is taken
> +		 * before the autonuma_migrate_lock in
> +		 * split_huge_page. If we didn't disable irqs, the
> +		 * lru_lock could be taken by interrupts after we have
> +		 * obtained the autonuma_migrate_lock here.
> +		 */

Which interrupt code takes the lru_lock? That sounds like a bug.


-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 19/36] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection
  2012-08-22 20:19   ` Andi Kleen
@ 2012-08-22 21:22     ` Hugh Dickins
  2012-08-22 21:24     ` Andrea Arcangeli
  1 sibling, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2012-08-22 21:22 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrea Arcangeli, linux-kernel, linux-mm

On Wed, 22 Aug 2012, Andi Kleen wrote:
> Andrea Arcangeli <aarcange@redhat.com> writes:
> > +		/*
> > +		 * Take the lock with irqs disabled to avoid a lock
> > +		 * inversion with the lru_lock. The lru_lock is taken
> > +		 * before the autonuma_migrate_lock in
> > +		 * split_huge_page. If we didn't disable irqs, the
> > +		 * lru_lock could be taken by interrupts after we have
> > +		 * obtained the autonuma_migrate_lock here.
> > +		 */
> 
> Which interrupt code takes the lru_lock? That sounds like a bug.

Not a bug: the clearest example is end_page_writeback() calling
rotate_reclaimable_page(); but I think once you probe deeper, you
find some other mm/swap.c pagevec operations which may get called
from interrupt, and end up freeing unrelated PageLRU pages.

Hugh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 19/36] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection
  2012-08-22 20:19   ` Andi Kleen
  2012-08-22 21:22     ` Hugh Dickins
@ 2012-08-22 21:24     ` Andrea Arcangeli
  2012-08-22 22:37       ` Andi Kleen
  1 sibling, 1 reply; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 21:24 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

Hi Andi,

On Wed, Aug 22, 2012 at 01:19:04PM -0700, Andi Kleen wrote:
> Andrea Arcangeli <aarcange@redhat.com> writes:
> 
> > +/*
> > + * In this function we build a temporal CPU_node<->page relation by
> > + * using a two-stage autonuma_last_nid filter to remove short/unlikely
> > + * relations.
> > + *
> > + * Using P(p) ~ n_p / n_t as per frequentest probability, we can
> > + * equate a node's CPU usage of a particular page (n_p) per total
> > + * usage of this page (n_t) (in a given time-span) to a probability.
> > + *
> > + * Our periodic faults will then sample this probability and getting
> > + * the same result twice in a row, given these samples are fully
> > + * independent, is then given by P(n)^2, provided our sample period
> > + * is sufficiently short compared to the usage pattern.
> > + *
> > + * This quadric squishes small probabilities, making it less likely
> > + * we act on an unlikely CPU_node<->page relation.
> > + */
> 
> The code does not seem to do what the comment describes.

This comment seems quite accurate to me (btw I taken it from
sched-numa rewrite with minor changes).

By having a confirmation through periodic samples that the memory
access happens twice in a row from the same node we increase the
probability of doing worthwhile memory migrations and we diminish the
risk of worthless migration as result of false relations/sharing.

> > +static inline bool last_nid_set(struct page *page, int this_nid)
> > +{
> > +	bool ret = true;
> > +	int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
> > +	VM_BUG_ON(this_nid < 0);
> > +	VM_BUG_ON(this_nid >= MAX_NUMNODES);
> > +	if (autonuma_last_nid >= 0 && autonuma_last_nid != this_nid) {
> > +		int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
> > +		if (migrate_nid >= 0)
> > +			__autonuma_migrate_page_remove(page);
> > +		ret = false;
> > +	}
> > +	if (autonuma_last_nid != this_nid)
> > +		ACCESS_ONCE(page->autonuma_last_nid) = this_nid;
> > +	return ret;
> > +}
> > +
> > +		/*
> > +		 * Take the lock with irqs disabled to avoid a lock
> > +		 * inversion with the lru_lock. The lru_lock is taken
> > +		 * before the autonuma_migrate_lock in
> > +		 * split_huge_page. If we didn't disable irqs, the
> > +		 * lru_lock could be taken by interrupts after we have
> > +		 * obtained the autonuma_migrate_lock here.
> > +		 */
> 
> Which interrupt code takes the lru_lock? That sounds like a bug.

Disabling irqs around lru_lock was an optimization to avoid increasing
the hold time of the lock when all critical sections were short after
the isolation code. Now it's used to rotate lrus at I/O completion too.

end_page_writeback -> rotate_reclaimable_page -> pagevec_move_tail

=========================================================
[ INFO: possible irq lock inversion dependency detected ]
3.6.0-rc2+ #46 Not tainted
---------------------------------------------------------
numa01/7725 just changed the state of lock:
 (&(&zone->lru_lock)->rlock){..-.-.}, at: [<ffffffff81106e5e>] pagevec_lru_move_fn+0x9e/0x110
but this lock took another, SOFTIRQ-unsafe lock in the past:
 (&(&pgdat->autonuma_lock)->rlock){+.+.-.}

and interrupts could create inverse lock ordering between them.


other info that might help us debug this:
 Possible interrupt unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&(&pgdat->autonuma_lock)->rlock);
                               local_irq_disable();
                               lock(&(&zone->lru_lock)->rlock);
                               lock(&(&pgdat->autonuma_lock)->rlock);
  <Interrupt>
    lock(&(&zone->lru_lock)->rlock);

 *** DEADLOCK ***

2 locks held by numa01/7725:
 #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff815527f1>] do_page_fault+0x121/0x520
 #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff81153ee8>] __mem_cgroup_try_charge+0x348/0xbb0

the shortest dependencies between 2nd lock and 1st lock:
 -> (&(&pgdat->autonuma_lock)->rlock){+.+.-.} ops: 7031259 {
    HARDIRQ-ON-W at:
                      [<ffffffff810b9e6f>] mark_held_locks+0x5f/0x140
                      [<ffffffff810ba002>] trace_hardirqs_on_caller+0xb2/0x1a0
                      [<ffffffff810ba0fd>] trace_hardirqs_on+0xd/0x10
                      [<ffffffff8113de49>] knuma_migrated+0x259/0xab0
                      [<ffffffff8107fdd6>] kthread+0xb6/0xc0
                      [<ffffffff81557204>] kernel_thread_helper+0x4/0x10
    SOFTIRQ-ON-W at:
                      [<ffffffff810b9e6f>] mark_held_locks+0x5f/0x140
                      [<ffffffff810ba05d>] trace_hardirqs_on_caller+0x10d/0x1a0
                      [<ffffffff810ba0fd>] trace_hardirqs_on+0xd/0x10
                      [<ffffffff8113de49>] knuma_migrated+0x259/0xab0
                      [<ffffffff8107fdd6>] kthread+0xb6/0xc0
                      [<ffffffff81557204>] kernel_thread_helper+0x4/0x10
    IN-RECLAIM_FS-W at:
                         [<ffffffff810b78f4>] __lock_acquire+0x5c4/0x1dd0
                         [<ffffffff810b9682>] lock_acquire+0x62/0x80
                         [<ffffffff8154de9b>] _raw_spin_lock+0x3b/0x50
                         [<ffffffff8113dafd>] __autonuma_migrate_page_remove+0xdd/0x1d0
                         [<ffffffff81101483>] free_pages_prepare+0xe3/0x190
                         [<ffffffff811016b4>] free_hot_cold_page+0x44/0x1d0
                         [<ffffffff81101a6e>] free_hot_cold_page_list+0x3e/0x60
                         [<ffffffff81106d81>] release_pages+0x1f1/0x230
                         [<ffffffff81106eb0>] pagevec_lru_move_fn+0xf0/0x110
                         [<ffffffff81106ee7>] __pagevec_lru_add+0x17/0x20
                         [<ffffffff8110780b>] lru_add_drain_cpu+0x9b/0x130
                         [<ffffffff81107969>] lru_add_drain+0x29/0x40
                         [<ffffffff8110add5>] shrink_active_list+0x65/0x340
                         [<ffffffff8110c483>] balance_pgdat+0x323/0x890
                         [<ffffffff8110cbb3>] kswapd+0x1c3/0x340
                         [<ffffffff8107fdd6>] kthread+0xb6/0xc0
                         [<ffffffff81557204>] kernel_thread_helper+0x4/0x10
    INITIAL USE at:
                     [<ffffffff810b762f>] __lock_acquire+0x2ff/0x1dd0
                     [<ffffffff810b9682>] lock_acquire+0x62/0x80
                     [<ffffffff8154de9b>] _raw_spin_lock+0x3b/0x50
                     [<ffffffff8113e95b>] numa_hinting_fault+0x2bb/0x5b0
                     [<ffffffff8113ee9d>] __pmd_numa_fixup+0x1cd/0x200
                     [<ffffffff8111de08>] handle_mm_fault+0x2c8/0x380
                     [<ffffffff8155285e>] do_page_fault+0x18e/0x520
                     [<ffffffff8154ed85>] page_fault+0x25/0x30
                     [<ffffffff81172d7c>] sys_poll+0x6c/0x100
                     [<ffffffff815560b9>] system_call_fastpath+0x16/0x1b
  }
  ... key      at: [<ffffffff8220b968>] __key.16051+0x0/0x18
  ... acquired at:
   [<ffffffff810b9682>] lock_acquire+0x62/0x80
   [<ffffffff8154de9b>] _raw_spin_lock+0x3b/0x50
   [<ffffffff8113d929>] autonuma_migrate_split_huge_page+0x119/0x210
   [<ffffffff8114c897>] split_huge_page+0x267/0x7f0
   [<ffffffff8113df52>] knuma_migrated+0x362/0xab0
   [<ffffffff8107fdd6>] kthread+0xb6/0xc0
   [<ffffffff81557204>] kernel_thread_helper+0x4/0x10

-> (&(&zone->lru_lock)->rlock){..-.-.} ops: 10130605 {
   IN-SOFTIRQ-W at:
                    [<ffffffff810b7a95>] __lock_acquire+0x765/0x1dd0
                    [<ffffffff810b9682>] lock_acquire+0x62/0x80
                    [<ffffffff8154dfe3>] _raw_spin_lock_irqsave+0x53/0x70
                    [<ffffffff81106e5e>] pagevec_lru_move_fn+0x9e/0x110
                    [<ffffffff81106faf>] pagevec_move_tail+0x1f/0x30
                    [<ffffffff811074fd>] rotate_reclaimable_page+0xdd/0x100
                    [<ffffffff810f90ad>] end_page_writeback+0x4d/0x60
                    [<ffffffff8112f21b>] end_swap_bio_write+0x2b/0x80
                    [<ffffffff8118f9d8>] bio_endio+0x18/0x30
                    [<ffffffff8124970b>] req_bio_endio.clone.53+0x8b/0xd0
                    [<ffffffff81249840>] blk_update_request+0xf0/0x5a0
                    [<ffffffff81249d1f>] blk_update_bidi_request+0x2f/0x90
                    [<ffffffff81249daa>] blk_end_bidi_request+0x2a/0x80
                    [<ffffffff81249e3b>] blk_end_request+0xb/0x10
                    [<ffffffff81349e27>] scsi_io_completion+0x97/0x640
                    [<ffffffff8134238e>] scsi_finish_command+0xbe/0xf0
                    [<ffffffff81349c1f>] scsi_softirq_done+0x9f/0x130
                    [<ffffffff8124fee2>] blk_done_softirq+0x82/0xa0
                    [<ffffffff81064278>] __do_softirq+0xc8/0x180
                    [<ffffffff815572fc>] call_softirq+0x1c/0x30
                    [<ffffffff81004375>] do_softirq+0xa5/0xe0
                    [<ffffffff8106462e>] irq_exit+0x9e/0xc0
                    [<ffffffff8102330f>] smp_call_function_single_interrupt+0x2f/0x40
                    [<ffffffff81556d6f>] call_function_single_interrupt+0x6f/0x80
                    [<ffffffff8114ff6e>] mem_cgroup_from_task+0x4e/0xd0
                    [<ffffffff81153f5d>] __mem_cgroup_try_charge+0x3bd/0xbb0
                    [<ffffffff81154e54>] mem_cgroup_charge_common+0x64/0xc0
                    [<ffffffff811554c1>] mem_cgroup_newpage_charge+0x31/0x40
                    [<ffffffff8111d5fa>] handle_pte_fault+0x70a/0xa90
                    [<ffffffff8111dd93>] handle_mm_fault+0x253/0x380
                    [<ffffffff8155285e>] do_page_fault+0x18e/0x520
                    [<ffffffff8154ed85>] page_fault+0x25/0x30
   IN-RECLAIM_FS-W at:
                       [<ffffffff810b78f4>] __lock_acquire+0x5c4/0x1dd0
                       [<ffffffff810b9682>] lock_acquire+0x62/0x80
                       [<ffffffff8154dfe3>] _raw_spin_lock_irqsave+0x53/0x70
                       [<ffffffff81106e5e>] pagevec_lru_move_fn+0x9e/0x110
                       [<ffffffff81106ee7>] __pagevec_lru_add+0x17/0x20
                       [<ffffffff8110780b>] lru_add_drain_cpu+0x9b/0x130
                       [<ffffffff81107969>] lru_add_drain+0x29/0x40
                       [<ffffffff8110add5>] shrink_active_list+0x65/0x340
                       [<ffffffff8110c483>] balance_pgdat+0x323/0x890
                       [<ffffffff8110cbb3>] kswapd+0x1c3/0x340
                       [<ffffffff8107fdd6>] kthread+0xb6/0xc0
                       [<ffffffff81557204>] kernel_thread_helper+0x4/0x10
   INITIAL USE at:
                   [<ffffffff810b762f>] __lock_acquire+0x2ff/0x1dd0
                   [<ffffffff810b9682>] lock_acquire+0x62/0x80
                   [<ffffffff8154dfe3>] _raw_spin_lock_irqsave+0x53/0x70
                   [<ffffffff81106e5e>] pagevec_lru_move_fn+0x9e/0x110
                   [<ffffffff81106ee7>] __pagevec_lru_add+0x17/0x20
                   [<ffffffff8110780b>] lru_add_drain_cpu+0x9b/0x130
                   [<ffffffff81107969>] lru_add_drain+0x29/0x40
                   [<ffffffff81107991>] __pagevec_release+0x11/0x30
                   [<ffffffff81108454>] truncate_inode_pages_range+0x344/0x4b0
                   [<ffffffff81108640>] truncate_inode_pages+0x10/0x20
                   [<ffffffff811926da>] kill_bdev+0x2a/0x40
                   [<ffffffff81192aff>] __blkdev_put+0x6f/0x1d0
                   [<ffffffff81192cbb>] blkdev_put+0x5b/0x170
                   [<ffffffff81253cfa>] add_disk+0x41a/0x4a0
                   [<ffffffff81355290>] sd_probe_async+0x120/0x1d0
                   [<ffffffff8108800d>] async_run_entry_fn+0x7d/0x180
                   [<ffffffff810777ff>] process_one_work+0x19f/0x510
                   [<ffffffff8107a7e7>] worker_thread+0x1a7/0x4b0
                   [<ffffffff8107fdd6>] kthread+0xb6/0xc0
                   [<ffffffff81557204>] kernel_thread_helper+0x4/0x10
 }
 ... key      at: [<ffffffff822094c8>] __key.34621+0x0/0x8
 ... acquired at:
   [<ffffffff810b5fde>] check_usage_forwards+0x8e/0x110
   [<ffffffff810b6ed6>] mark_lock+0x1d6/0x630
   [<ffffffff810b7a95>] __lock_acquire+0x765/0x1dd0
   [<ffffffff810b9682>] lock_acquire+0x62/0x80
   [<ffffffff8154dfe3>] _raw_spin_lock_irqsave+0x53/0x70
   [<ffffffff81106e5e>] pagevec_lru_move_fn+0x9e/0x110
   [<ffffffff81106faf>] pagevec_move_tail+0x1f/0x30
   [<ffffffff811074fd>] rotate_reclaimable_page+0xdd/0x100
   [<ffffffff810f90ad>] end_page_writeback+0x4d/0x60
   [<ffffffff8112f21b>] end_swap_bio_write+0x2b/0x80
   [<ffffffff8118f9d8>] bio_endio+0x18/0x30
   [<ffffffff8124970b>] req_bio_endio.clone.53+0x8b/0xd0
   [<ffffffff81249840>] blk_update_request+0xf0/0x5a0
   [<ffffffff81249d1f>] blk_update_bidi_request+0x2f/0x90
   [<ffffffff81249daa>] blk_end_bidi_request+0x2a/0x80
   [<ffffffff81249e3b>] blk_end_request+0xb/0x10
   [<ffffffff81349e27>] scsi_io_completion+0x97/0x640
   [<ffffffff8134238e>] scsi_finish_command+0xbe/0xf0
   [<ffffffff81349c1f>] scsi_softirq_done+0x9f/0x130
   [<ffffffff8124fee2>] blk_done_softirq+0x82/0xa0
   [<ffffffff81064278>] __do_softirq+0xc8/0x180
   [<ffffffff815572fc>] call_softirq+0x1c/0x30
   [<ffffffff81004375>] do_softirq+0xa5/0xe0
   [<ffffffff8106462e>] irq_exit+0x9e/0xc0
   [<ffffffff8102330f>] smp_call_function_single_interrupt+0x2f/0x40
   [<ffffffff81556d6f>] call_function_single_interrupt+0x6f/0x80
   [<ffffffff8114ff6e>] mem_cgroup_from_task+0x4e/0xd0
   [<ffffffff81153f5d>] __mem_cgroup_try_charge+0x3bd/0xbb0
   [<ffffffff81154e54>] mem_cgroup_charge_common+0x64/0xc0
   [<ffffffff811554c1>] mem_cgroup_newpage_charge+0x31/0x40
   [<ffffffff8111d5fa>] handle_pte_fault+0x70a/0xa90
   [<ffffffff8111dd93>] handle_mm_fault+0x253/0x380
   [<ffffffff8155285e>] do_page_fault+0x18e/0x520
   [<ffffffff8154ed85>] page_fault+0x25/0x30


stack backtrace:
Pid: 7725, comm: numa01 Not tainted 3.6.0-rc2+ #46
Call Trace:
 <IRQ>  [<ffffffff810b5f06>] print_irq_inversion_bug+0x1c6/0x210
 [<ffffffff810b5f50>] ? print_irq_inversion_bug+0x210/0x210
 [<ffffffff810b5fde>] check_usage_forwards+0x8e/0x110
 [<ffffffff810b6ed6>] mark_lock+0x1d6/0x630
 [<ffffffff810b7a95>] __lock_acquire+0x765/0x1dd0
 [<ffffffff810fb790>] ? mempool_alloc_slab+0x10/0x20
 [<ffffffff811465cb>] ? kmem_cache_alloc+0xbb/0x1b0
 [<ffffffff810b9682>] lock_acquire+0x62/0x80
 [<ffffffff81106e5e>] ? pagevec_lru_move_fn+0x9e/0x110
 [<ffffffff8154dfe3>] _raw_spin_lock_irqsave+0x53/0x70
 [<ffffffff81106e5e>] ? pagevec_lru_move_fn+0x9e/0x110
 [<ffffffff81106e5e>] pagevec_lru_move_fn+0x9e/0x110
 [<ffffffff81106400>] ? __pagevec_lru_add_fn+0x130/0x130
 [<ffffffff81106faf>] pagevec_move_tail+0x1f/0x30
 [<ffffffff811074fd>] rotate_reclaimable_page+0xdd/0x100
 [<ffffffff810f90ad>] end_page_writeback+0x4d/0x60
 [<ffffffff81349592>] ? scsi_request_fn+0xa2/0x4b0
 [<ffffffff8112f21b>] end_swap_bio_write+0x2b/0x80
 [<ffffffff8118f9d8>] bio_endio+0x18/0x30
 [<ffffffff8124970b>] req_bio_endio.clone.53+0x8b/0xd0
 [<ffffffff81249840>] blk_update_request+0xf0/0x5a0
 [<ffffffff81249a7a>] ? blk_update_request+0x32a/0x5a0
 [<ffffffff81249d1f>] blk_update_bidi_request+0x2f/0x90
 [<ffffffff81249daa>] blk_end_bidi_request+0x2a/0x80
 [<ffffffff81249e3b>] blk_end_request+0xb/0x10
 [<ffffffff81349e27>] scsi_io_completion+0x97/0x640
 [<ffffffff8134238e>] scsi_finish_command+0xbe/0xf0
 [<ffffffff81349c1f>] scsi_softirq_done+0x9f/0x130
 [<ffffffff8124fee2>] blk_done_softirq+0x82/0xa0
 [<ffffffff81064278>] __do_softirq+0xc8/0x180
 [<ffffffff810b4b5d>] ? trace_hardirqs_off+0xd/0x10
 [<ffffffff815572fc>] call_softirq+0x1c/0x30
 [<ffffffff81004375>] do_softirq+0xa5/0xe0
 [<ffffffff8106462e>] irq_exit+0x9e/0xc0
 [<ffffffff8102330f>] smp_call_function_single_interrupt+0x2f/0x40
 [<ffffffff81556d6f>] call_function_single_interrupt+0x6f/0x80
 <EOI>  [<ffffffff8107c699>] ? debug_lockdep_rcu_enabled+0x29/0x40
 [<ffffffff8114ff6e>] mem_cgroup_from_task+0x4e/0xd0
 [<ffffffff81153f5d>] __mem_cgroup_try_charge+0x3bd/0xbb0
 [<ffffffff81153ee8>] ? __mem_cgroup_try_charge+0x348/0xbb0
 [<ffffffff81154e54>] mem_cgroup_charge_common+0x64/0xc0
 [<ffffffff811554c1>] mem_cgroup_newpage_charge+0x31/0x40
 [<ffffffff8111d5fa>] handle_pte_fault+0x70a/0xa90
 [<ffffffff81101875>] ? __free_pages+0x35/0x40
 [<ffffffff8111dd93>] handle_mm_fault+0x253/0x380
 [<ffffffff8155285e>] do_page_fault+0x18e/0x520
 [<ffffffff812693de>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff810dff0f>] ? rcu_irq_exit+0x7f/0xd0
 [<ffffffff8154eb70>] ? retint_restore_args+0x13/0x13
 [<ffffffff8126941d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
 [<ffffffff8154ed85>] page_fault+0x25/0x30

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 00/36] AutoNUMA24
  2012-08-22 19:26 ` [PATCH 00/36] AutoNUMA24 Rik van Riel
@ 2012-08-22 21:40   ` Ingo Molnar
  2012-08-22 22:19     ` Andrea Arcangeli
  0 siblings, 1 reply; 54+ messages in thread
From: Ingo Molnar @ 2012-08-22 21:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Johannes Weiner, Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt


* Rik van Riel <riel@redhat.com> wrote:

> On 08/22/2012 10:58 AM, Andrea Arcangeli wrote:
> >Hello everyone,
> >
> >Before the Kernel Summit, I think it's good idea to post a new
> >AutoNUMA24 and to go through a new review cycle. The last review cycle
> >has been fundamental in improving the patchset. Thanks!
> 
> Thanks for improving the code and incorporating all our 
> feedback. The AutoNUMA codebase is now in a state where I can 
> live with it.
> 
> I hope the code will be acceptable to others, too.

Lots of scheduler changes. Has all of peterz's review feedback 
been addressed?

Hm, he isn't even Cc:-ed, how is that supposed to work?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 33/36] autonuma: powerpc port
  2012-08-22 14:59 ` [PATCH 33/36] autonuma: powerpc port Andrea Arcangeli
@ 2012-08-22 22:01   ` Benjamin Herrenschmidt
  2012-08-22 22:35     ` Andrea Arcangeli
  2012-08-22 22:56     ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 54+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-22 22:01 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Andrea Arcangeli

On Wed, 2012-08-22 at 16:59 +0200, Andrea Arcangeli wrote:
> From: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> 
>     * PMD flaging is not required in powerpc since large pages
>       are tracked in ptes.
>     * Yet to be tested with large pages
>     * This is an initial patch that partially works
>     * knuma_scand and numa hinting page faults works
>     * Page migration is yet to be observed/verified
> 
> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

I don't like this.

---
>  arch/powerpc/include/asm/pgtable.h        |   48 ++++++++++++++++++++++++++++-
>  arch/powerpc/include/asm/pte-hash64-64k.h |    4 ++-
>  arch/powerpc/mm/numa.c                    |    3 +-
>  mm/autonuma.c                             |    2 +-
>  4 files changed, 53 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
> index 2e0e411..5f03079 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -33,10 +33,56 @@ static inline int pte_dirty(pte_t pte)		{ return pte_val(pte) & _PAGE_DIRTY; }
>  static inline int pte_young(pte_t pte)		{ return pte_val(pte) & _PAGE_ACCESSED; }
>  static inline int pte_file(pte_t pte)		{ return pte_val(pte) & _PAGE_FILE; }
>  static inline int pte_special(pte_t pte)	{ return pte_val(pte) & _PAGE_SPECIAL; }
> -static inline int pte_present(pte_t pte)	{ return pte_val(pte) & _PAGE_PRESENT; }
> +static inline int pte_present(pte_t pte)	{ return pte_val(pte) &
> +							(_PAGE_PRESENT|_PAGE_NUMA_PTE); }

Is this absolutely necessary ? (testing two bits). It somewhat changes
the semantics of "pte_present" which I don't really like.

>  static inline int pte_none(pte_t pte)		{ return (pte_val(pte) & ~_PTE_NONE_MASK) == 0; }
>  static inline pgprot_t pte_pgprot(pte_t pte)	{ return __pgprot(pte_val(pte) & PAGE_PROT_BITS); }
>  
> +#ifdef CONFIG_AUTONUMA
> +static inline int pte_numa(pte_t pte)
> +{
> +       return (pte_val(pte) &
> +               (_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
> +}
> +
> +#endif

Why the ifdef and not anywhere else ?

> +static inline pte_t pte_mknonnuma(pte_t pte)
> +{
> +       pte_val(pte) &= ~_PAGE_NUMA_PTE;
> +       pte_val(pte) |= (_PAGE_PRESENT|_PAGE_ACCESSED);
> +
> +       return pte;
> +}
> +
> +static inline pte_t pte_mknuma(pte_t pte)
> +{
> +       pte_val(pte) |= _PAGE_NUMA_PTE;
> +       pte_val(pte) &= ~_PAGE_PRESENT;
> +       return pte;
> +}
> +
> +static inline int pmd_numa(pmd_t pmd)
> +{
> +       /* PMD tracking not implemented */
> +       return 0;
> +}
> +
> +static inline pmd_t pmd_mknonnuma(pmd_t pmd)
> +{
> +	BUG();
> +	return pmd;
> +}
> +
> +static inline pmd_t pmd_mknuma(pmd_t pmd)
> +{
> +	BUG();
> +	return pmd;
> +}
> +
> +/* No pmd flags on powerpc */
> +#define set_pmd_at(mm, addr, pmdp, pmd)  do { } while (0)
> +
>  /* Conversion functions: convert a page and protection to a page entry,
>   * and a page entry and page directory to the page they refer to.
>   *
> diff --git a/arch/powerpc/include/asm/pte-hash64-64k.h b/arch/powerpc/include/asm/pte-hash64-64k.h
> index 59247e8..f7e1468 100644
> --- a/arch/powerpc/include/asm/pte-hash64-64k.h
> +++ b/arch/powerpc/include/asm/pte-hash64-64k.h
> @@ -7,6 +7,8 @@
>  #define _PAGE_COMBO	0x10000000 /* this is a combo 4k page */
>  #define _PAGE_4K_PFN	0x20000000 /* PFN is for a single 4k page */
>  
> +#define _PAGE_NUMA_PTE 0x40000000 /* Adjust PTE_RPN_SHIFT below */
> +
>  /* For 64K page, we don't have a separate _PAGE_HASHPTE bit. Instead,
>   * we set that to be the whole sub-bits mask. The C code will only
>   * test this, so a multi-bit mask will work. For combo pages, this
> @@ -36,7 +38,7 @@
>   * That gives us a max RPN of 34 bits, which means a max of 50 bits
>   * of addressable physical space, or 46 bits for the special 4k PFNs.
>   */
> -#define PTE_RPN_SHIFT	(30)
> +#define PTE_RPN_SHIFT	(31)

I'm concerned. We are already running short on RPN bits. We can't spare
more. If you absolutely need a PTE bit, we'll need to explore ways to
free some, but just reducing the RPN isn't an option.

Think of what happens if PTE_4K_PFN is set...

Also you conveniently avoided all the other pte-*.h variants meaning you
broke the build for everything except ppc64 with 64k pages.

>  #ifndef __ASSEMBLY__
>  
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 39b1597..80af41e 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -22,6 +22,7 @@
>  #include <linux/pfn.h>
>  #include <linux/cpuset.h>
>  #include <linux/node.h>
> +#include <linux/page_autonuma.h>
>  #include <asm/sparsemem.h>
>  #include <asm/prom.h>
>  #include <asm/smp.h>
> @@ -1045,7 +1046,7 @@ void __init do_init_bootmem(void)
>  		 * all reserved areas marked.
>  		 */
>  		NODE_DATA(nid) = careful_zallocation(nid,
> -					sizeof(struct pglist_data),
> +					autonuma_pglist_data_size(),
>  					SMP_CACHE_BYTES, end_pfn);
>  
>    		dbg("node %d\n", nid);
> diff --git a/mm/autonuma.c b/mm/autonuma.c
> index ada6c57..a4da3f3 100644
> --- a/mm/autonuma.c
> +++ b/mm/autonuma.c
> @@ -25,7 +25,7 @@ unsigned long autonuma_flags __read_mostly =
>  #ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
>  	|(1<<AUTONUMA_ENABLED_FLAG)
>  #endif
> -	|(1<<AUTONUMA_SCAN_PMD_FLAG);
> +	|(0<<AUTONUMA_SCAN_PMD_FLAG);

That changes the default accross all architectures, is that ok vs.
Andrea ?

Cheers,
Ben.

>  static DEFINE_MUTEX(knumad_mm_mutex);
>  



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 00/36] AutoNUMA24
  2012-08-22 21:40   ` Ingo Molnar
@ 2012-08-22 22:19     ` Andrea Arcangeli
  2012-08-23  8:42       ` Ingo Molnar
  0 siblings, 1 reply; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 22:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, linux-kernel, linux-mm, Hillf Danton, Dan Smith,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt

On Wed, Aug 22, 2012 at 11:40:48PM +0200, Ingo Molnar wrote:
> 
> * Rik van Riel <riel@redhat.com> wrote:
> 
> > On 08/22/2012 10:58 AM, Andrea Arcangeli wrote:
> > >Hello everyone,
> > >
> > >Before the Kernel Summit, I think it's good idea to post a new
> > >AutoNUMA24 and to go through a new review cycle. The last review cycle
> > >has been fundamental in improving the patchset. Thanks!
> > 
> > Thanks for improving the code and incorporating all our 
> > feedback. The AutoNUMA codebase is now in a state where I can 
> > live with it.
> > 
> > I hope the code will be acceptable to others, too.
> 
> Lots of scheduler changes. Has all of peterz's review feedback 
> been addressed?

git diff --stat origin kernel/sched/
 kernel/sched/Makefile |    1 +
 kernel/sched/core.c   |    1 +
 kernel/sched/fair.c   |   86 ++++++-
 kernel/sched/numa.c   |  604 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h  |   19 ++
 5 files changed, 699 insertions(+), 12 deletions(-)

Lots of scheduler changes only if CONFIG_AUTONUMA=y. If
CONFIG_AUTONUMA=n it's just 107 lines of scheduler changes (numa.c
won't get built in that case).

> Hm, he isn't even Cc:-ed, how is that supposed to work?

I separately forwarded him the announcement email because I wanted to
add a few more (minor) details for him. Of course Peter's review is
fundamental and appreciated and already helped to make the code a lot
better.

His previous comments should have been addressed, the documentation of
sched_autonuma_balance has been rewritten from scratch,
PF_THREAD_BOUND is gone, etc... It's possible we'll have to go through
more rewrites of the docs if this still isn't good enough. I don't
know yet. This is what the review is about after all :).

numa.c is self contained but I see it as a plus that it's self
contained. First it's easy to nuke AutoNUMA by just deleting the .[ch]
files and fixing up the build errors as result in case a better
algorithm emerges in the future. Second it's trivial to proof those 107
lines won't regress CFS when CONFIG_AUTONUMA=n.

About the CFS integration: sched_autonuma_balance() is simply yet
another idle active load balancing method.

The idle active load balancing that with hyperthreading takes care of
distributing the threads to be sure to fill all idle cores, in
principle works identical to AutoNUMA. It also works side by side with
CFS and moves tasks under CFS control (without CFS possibly noticing)
to optimize for HT. numa.c in principle does the exact same thing (and
it also calls the very same stop_one_cpu_nowait to do the migrations)
but it optimizes for NUMA instead of HT.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 33/36] autonuma: powerpc port
  2012-08-22 22:01   ` Benjamin Herrenschmidt
@ 2012-08-22 22:35     ` Andrea Arcangeli
  2012-08-23  5:11       ` Benjamin Herrenschmidt
  2012-08-22 22:56     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 22:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Vaidyanathan Srinivasan, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris

On Thu, Aug 23, 2012 at 08:01:47AM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2012-08-22 at 16:59 +0200, Andrea Arcangeli wrote:
> > diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
> > index 2e0e411..5f03079 100644
> > --- a/arch/powerpc/include/asm/pgtable.h
> > +++ b/arch/powerpc/include/asm/pgtable.h
> > @@ -33,10 +33,56 @@ static inline int pte_dirty(pte_t pte)		{ return pte_val(pte) & _PAGE_DIRTY; }
> >  static inline int pte_young(pte_t pte)		{ return pte_val(pte) & _PAGE_ACCESSED; }
> >  static inline int pte_file(pte_t pte)		{ return pte_val(pte) & _PAGE_FILE; }
> >  static inline int pte_special(pte_t pte)	{ return pte_val(pte) & _PAGE_SPECIAL; }
> > -static inline int pte_present(pte_t pte)	{ return pte_val(pte) & _PAGE_PRESENT; }
> > +static inline int pte_present(pte_t pte)	{ return pte_val(pte) &
> > +							(_PAGE_PRESENT|_PAGE_NUMA_PTE); }
> 
> Is this absolutely necessary ? (testing two bits). It somewhat changes
> the semantics of "pte_present" which I don't really like.

I'm actually surprised you don't already check for PROTNONE
there. Anyway yes this is necessary, the whole concept of NUMA hinting
page faults is to make the pte not present, and to set another bit (be
it a reserved bit or PROTNONE doesn't change anything in that
respect). But another bit replacing _PAGE_PRESENT must exist.

This change is zero cost at runtime, and 0x1 or 0x3 won't change a
thing for the CPU.

> >  static inline int pte_none(pte_t pte)		{ return (pte_val(pte) & ~_PTE_NONE_MASK) == 0; }
> >  static inline pgprot_t pte_pgprot(pte_t pte)	{ return __pgprot(pte_val(pte) & PAGE_PROT_BITS); }
> >  
> > +#ifdef CONFIG_AUTONUMA
> > +static inline int pte_numa(pte_t pte)
> > +{
> > +       return (pte_val(pte) &
> > +               (_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
> > +}
> > +
> > +#endif
> 
> Why the ifdef and not anywhere else ?

The generic version is implemented in asm-generic/pgtable.h to avoid dups.

> > diff --git a/arch/powerpc/include/asm/pte-hash64-64k.h b/arch/powerpc/include/asm/pte-hash64-64k.h
> > index 59247e8..f7e1468 100644
> > --- a/arch/powerpc/include/asm/pte-hash64-64k.h
> > +++ b/arch/powerpc/include/asm/pte-hash64-64k.h
> > @@ -7,6 +7,8 @@
> >  #define _PAGE_COMBO	0x10000000 /* this is a combo 4k page */
> >  #define _PAGE_4K_PFN	0x20000000 /* PFN is for a single 4k page */
> >  
> > +#define _PAGE_NUMA_PTE 0x40000000 /* Adjust PTE_RPN_SHIFT below */
> > +
> >  /* For 64K page, we don't have a separate _PAGE_HASHPTE bit. Instead,
> >   * we set that to be the whole sub-bits mask. The C code will only
> >   * test this, so a multi-bit mask will work. For combo pages, this
> > @@ -36,7 +38,7 @@
> >   * That gives us a max RPN of 34 bits, which means a max of 50 bits
> >   * of addressable physical space, or 46 bits for the special 4k PFNs.
> >   */
> > -#define PTE_RPN_SHIFT	(30)
> > +#define PTE_RPN_SHIFT	(31)
> 
> I'm concerned. We are already running short on RPN bits. We can't spare
> more. If you absolutely need a PTE bit, we'll need to explore ways to
> free some, but just reducing the RPN isn't an option.

No way to do it without a spare bit.

Note that this is now true for sched-numa rewrite as well because it
also introduced the NUMA hinting page faults of AutoNUMA (except what
it does during the fault is different there, but the mechanism of
firing them and the need of a spare pte bit is identical).

But you must have a bit for protnone, don't you? You can implement it
with prot none, I can add the vma as parameter to some function to
achieve it if you need. It may be good idea to do anyway even if
there's no need on x86 at this point.

> Think of what happens if PTE_4K_PFN is set...

It may very well broken with PTE_4K_PFN is set, I'm not familiar with
that. If that's the case we'll just add an option to prevent
AUTONUMA=y to be set if PTE_4K_PFN is set thanks for the info.

> Also you conveniently avoided all the other pte-*.h variants meaning you
> broke the build for everything except ppc64 with 64k pages.

This can only be enabled on PPC64 in KConfig so no problem about
ppc32.

> > diff --git a/mm/autonuma.c b/mm/autonuma.c
> > index ada6c57..a4da3f3 100644
> > --- a/mm/autonuma.c
> > +++ b/mm/autonuma.c
> > @@ -25,7 +25,7 @@ unsigned long autonuma_flags __read_mostly =
> >  #ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
> >  	|(1<<AUTONUMA_ENABLED_FLAG)
> >  #endif
> > -	|(1<<AUTONUMA_SCAN_PMD_FLAG);
> > +	|(0<<AUTONUMA_SCAN_PMD_FLAG);
> 
> That changes the default accross all architectures, is that ok vs.
> Andrea ?

:) Indeed! But the next patch (34) undoes this hack. I just merged the
patch with "git am" and then introduced a proper way for the arch to
specify if the PMD scan is supported or not in an incremental
patch. Adding ppc64 support, and making the PMD scan mode arch
conditional are two separate things so I thought it was cleaner
keeping those in two separate patches but I can fold them if you
prefer.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 19/36] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection
  2012-08-22 21:24     ` Andrea Arcangeli
@ 2012-08-22 22:37       ` Andi Kleen
  2012-08-22 22:46         ` Andrea Arcangeli
  0 siblings, 1 reply; 54+ messages in thread
From: Andi Kleen @ 2012-08-22 22:37 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andi Kleen, linux-kernel, linux-mm

> 
> This comment seems quite accurate to me (btw I taken it from
> sched-numa rewrite with minor changes).

I had expected it to describe the next function. If it's a strategic
overview maybe it should be somewhere else.

> Disabling irqs around lru_lock was an optimization to avoid increasing
> the hold time of the lock when all critical sections were short after
> the isolation code. Now it's used to rotate lrus at I/O completion too.

Thanks.
-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 19/36] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection
  2012-08-22 22:37       ` Andi Kleen
@ 2012-08-22 22:46         ` Andrea Arcangeli
  0 siblings, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 22:46 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

Hi Andi,

On Thu, Aug 23, 2012 at 12:37:33AM +0200, Andi Kleen wrote:
> > 
> > This comment seems quite accurate to me (btw I taken it from
> > sched-numa rewrite with minor changes).
> 
> I had expected it to describe the next function. If it's a strategic
> overview maybe it should be somewhere else.

Well the next function is last_nid_set, and that's where the last_nid
logic is implemented. The comment explains why last_nid statistically
provides a benefit so I thought it was an ok location, but I welcome
suggestions to move it somewhere else.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 33/36] autonuma: powerpc port
  2012-08-22 22:01   ` Benjamin Herrenschmidt
  2012-08-22 22:35     ` Andrea Arcangeli
@ 2012-08-22 22:56     ` Benjamin Herrenschmidt
  2012-08-22 23:06       ` Andrea Arcangeli
  2012-08-23  4:15       ` Vaidyanathan Srinivasan
  1 sibling, 2 replies; 54+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-22 22:56 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Andrea Arcangeli

On Thu, 2012-08-23 at 08:01 +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2012-08-22 at 16:59 +0200, Andrea Arcangeli wrote:
> > From: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> > 
> >     * PMD flaging is not required in powerpc since large pages
> >       are tracked in ptes.
> >     * Yet to be tested with large pages
> >     * This is an initial patch that partially works
> >     * knuma_scand and numa hinting page faults works
> >     * Page migration is yet to be observed/verified
> > 
> > Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> 
> I don't like this.

What I mean here is that it's fine as a proof of concept ;-) I don't
like it being in a series aimed at upstream...

We can try to flush out the issues, but as it is, the patch isn't
upstreamable imho.

As for finding PTE bits, I have a few ideas we need to discuss, but
nothing simple I'm afraid.

Cheers,
Ben.

> ---
> >  arch/powerpc/include/asm/pgtable.h        |   48 ++++++++++++++++++++++++++++-
> >  arch/powerpc/include/asm/pte-hash64-64k.h |    4 ++-
> >  arch/powerpc/mm/numa.c                    |    3 +-
> >  mm/autonuma.c                             |    2 +-
> >  4 files changed, 53 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
> > index 2e0e411..5f03079 100644
> > --- a/arch/powerpc/include/asm/pgtable.h
> > +++ b/arch/powerpc/include/asm/pgtable.h
> > @@ -33,10 +33,56 @@ static inline int pte_dirty(pte_t pte)		{ return pte_val(pte) & _PAGE_DIRTY; }
> >  static inline int pte_young(pte_t pte)		{ return pte_val(pte) & _PAGE_ACCESSED; }
> >  static inline int pte_file(pte_t pte)		{ return pte_val(pte) & _PAGE_FILE; }
> >  static inline int pte_special(pte_t pte)	{ return pte_val(pte) & _PAGE_SPECIAL; }
> > -static inline int pte_present(pte_t pte)	{ return pte_val(pte) & _PAGE_PRESENT; }
> > +static inline int pte_present(pte_t pte)	{ return pte_val(pte) &
> > +							(_PAGE_PRESENT|_PAGE_NUMA_PTE); }
> 
> Is this absolutely necessary ? (testing two bits). It somewhat changes
> the semantics of "pte_present" which I don't really like.
> 
> >  static inline int pte_none(pte_t pte)		{ return (pte_val(pte) & ~_PTE_NONE_MASK) == 0; }
> >  static inline pgprot_t pte_pgprot(pte_t pte)	{ return __pgprot(pte_val(pte) & PAGE_PROT_BITS); }
> >  
> > +#ifdef CONFIG_AUTONUMA
> > +static inline int pte_numa(pte_t pte)
> > +{
> > +       return (pte_val(pte) &
> > +               (_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
> > +}
> > +
> > +#endif
> 
> Why the ifdef and not anywhere else ?
> 
> > +static inline pte_t pte_mknonnuma(pte_t pte)
> > +{
> > +       pte_val(pte) &= ~_PAGE_NUMA_PTE;
> > +       pte_val(pte) |= (_PAGE_PRESENT|_PAGE_ACCESSED);
> > +
> > +       return pte;
> > +}
> > +
> > +static inline pte_t pte_mknuma(pte_t pte)
> > +{
> > +       pte_val(pte) |= _PAGE_NUMA_PTE;
> > +       pte_val(pte) &= ~_PAGE_PRESENT;
> > +       return pte;
> > +}
> > +
> > +static inline int pmd_numa(pmd_t pmd)
> > +{
> > +       /* PMD tracking not implemented */
> > +       return 0;
> > +}
> > +
> > +static inline pmd_t pmd_mknonnuma(pmd_t pmd)
> > +{
> > +	BUG();
> > +	return pmd;
> > +}
> > +
> > +static inline pmd_t pmd_mknuma(pmd_t pmd)
> > +{
> > +	BUG();
> > +	return pmd;
> > +}
> > +
> > +/* No pmd flags on powerpc */
> > +#define set_pmd_at(mm, addr, pmdp, pmd)  do { } while (0)
> > +
> >  /* Conversion functions: convert a page and protection to a page entry,
> >   * and a page entry and page directory to the page they refer to.
> >   *
> > diff --git a/arch/powerpc/include/asm/pte-hash64-64k.h b/arch/powerpc/include/asm/pte-hash64-64k.h
> > index 59247e8..f7e1468 100644
> > --- a/arch/powerpc/include/asm/pte-hash64-64k.h
> > +++ b/arch/powerpc/include/asm/pte-hash64-64k.h
> > @@ -7,6 +7,8 @@
> >  #define _PAGE_COMBO	0x10000000 /* this is a combo 4k page */
> >  #define _PAGE_4K_PFN	0x20000000 /* PFN is for a single 4k page */
> >  
> > +#define _PAGE_NUMA_PTE 0x40000000 /* Adjust PTE_RPN_SHIFT below */
> > +
> >  /* For 64K page, we don't have a separate _PAGE_HASHPTE bit. Instead,
> >   * we set that to be the whole sub-bits mask. The C code will only
> >   * test this, so a multi-bit mask will work. For combo pages, this
> > @@ -36,7 +38,7 @@
> >   * That gives us a max RPN of 34 bits, which means a max of 50 bits
> >   * of addressable physical space, or 46 bits for the special 4k PFNs.
> >   */
> > -#define PTE_RPN_SHIFT	(30)
> > +#define PTE_RPN_SHIFT	(31)
> 
> I'm concerned. We are already running short on RPN bits. We can't spare
> more. If you absolutely need a PTE bit, we'll need to explore ways to
> free some, but just reducing the RPN isn't an option.
> 
> Think of what happens if PTE_4K_PFN is set...
> 
> Also you conveniently avoided all the other pte-*.h variants meaning you
> broke the build for everything except ppc64 with 64k pages.
> 
> >  #ifndef __ASSEMBLY__
> >  
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 39b1597..80af41e 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -22,6 +22,7 @@
> >  #include <linux/pfn.h>
> >  #include <linux/cpuset.h>
> >  #include <linux/node.h>
> > +#include <linux/page_autonuma.h>
> >  #include <asm/sparsemem.h>
> >  #include <asm/prom.h>
> >  #include <asm/smp.h>
> > @@ -1045,7 +1046,7 @@ void __init do_init_bootmem(void)
> >  		 * all reserved areas marked.
> >  		 */
> >  		NODE_DATA(nid) = careful_zallocation(nid,
> > -					sizeof(struct pglist_data),
> > +					autonuma_pglist_data_size(),
> >  					SMP_CACHE_BYTES, end_pfn);
> >  
> >    		dbg("node %d\n", nid);
> > diff --git a/mm/autonuma.c b/mm/autonuma.c
> > index ada6c57..a4da3f3 100644
> > --- a/mm/autonuma.c
> > +++ b/mm/autonuma.c
> > @@ -25,7 +25,7 @@ unsigned long autonuma_flags __read_mostly =
> >  #ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
> >  	|(1<<AUTONUMA_ENABLED_FLAG)
> >  #endif
> > -	|(1<<AUTONUMA_SCAN_PMD_FLAG);
> > +	|(0<<AUTONUMA_SCAN_PMD_FLAG);
> 
> That changes the default accross all architectures, is that ok vs.
> Andrea ?
> 
> Cheers,
> Ben.
> 
> >  static DEFINE_MUTEX(knumad_mm_mutex);
> >  
> 



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 33/36] autonuma: powerpc port
  2012-08-22 22:56     ` Benjamin Herrenschmidt
@ 2012-08-22 23:06       ` Andrea Arcangeli
  2012-08-23  4:15       ` Vaidyanathan Srinivasan
  1 sibling, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-22 23:06 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Vaidyanathan Srinivasan, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris

Hi Benjamin,

On Thu, Aug 23, 2012 at 08:56:34AM +1000, Benjamin Herrenschmidt wrote:
> What I mean here is that it's fine as a proof of concept ;-) I don't
> like it being in a series aimed at upstream...
> 
> We can try to flush out the issues, but as it is, the patch isn't
> upstreamable imho.

Well there's no real urgency to merge the ppc64 support immediately. I
will move it at the end of the patchset. Until the ppc64 patch is
applied you simply cannot set AUTONUMA=y but there's no regression
whatsoever.

> As for finding PTE bits, I have a few ideas we need to discuss, but
> nothing simple I'm afraid.

Sure we can discuss it.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 33/36] autonuma: powerpc port
  2012-08-22 22:56     ` Benjamin Herrenschmidt
  2012-08-22 23:06       ` Andrea Arcangeli
@ 2012-08-23  4:15       ` Vaidyanathan Srinivasan
  1 sibling, 0 replies; 54+ messages in thread
From: Vaidyanathan Srinivasan @ 2012-08-23  4:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-kernel, linux-mm, Hillf Danton, Dan Smith, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Paul Turner,
	Suresh Siddha, Mike Galbraith, Paul E. McKenney, Lai Jiangshan,
	Bharata B Rao, Lee Schermerhorn, Rik van Riel, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Andrea Arcangeli

* Benjamin Herrenschmidt <benh@kernel.crashing.org> [2012-08-23 08:56:34]:

> On Thu, 2012-08-23 at 08:01 +1000, Benjamin Herrenschmidt wrote:
> > On Wed, 2012-08-22 at 16:59 +0200, Andrea Arcangeli wrote:
> > > From: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> > > 
> > >     * PMD flaging is not required in powerpc since large pages
> > >       are tracked in ptes.
> > >     * Yet to be tested with large pages
> > >     * This is an initial patch that partially works
> > >     * knuma_scand and numa hinting page faults works
> > >     * Page migration is yet to be observed/verified
> > > 
> > > Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > I don't like this.
> 
> What I mean here is that it's fine as a proof of concept ;-) I don't
> like it being in a series aimed at upstream...

I agree.  My intend was to get the ppc64 discussions started and also
see what it takes to get autonuma to a new arch.

> We can try to flush out the issues, but as it is, the patch isn't
> upstreamable imho.

Yes.  The patch is still in RFC phase and good only for
review/testing.

> As for finding PTE bits, I have a few ideas we need to discuss, but
> nothing simple I'm afraid.

Sure Ben, lets try them and find the better fit.

--Vaidy


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 33/36] autonuma: powerpc port
  2012-08-22 22:35     ` Andrea Arcangeli
@ 2012-08-23  5:11       ` Benjamin Herrenschmidt
  2012-08-23 15:23         ` Andrea Arcangeli
  2012-08-23 22:13         ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 54+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-23  5:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Vaidyanathan Srinivasan, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Tony Breeds, Kumar Gala

On Thu, 2012-08-23 at 00:35 +0200, Andrea Arcangeli wrote:

> I'm actually surprised you don't already check for PROTNONE
> there. Anyway yes this is necessary, the whole concept of NUMA hinting
> page faults is to make the pte not present, and to set another bit (be
> it a reserved bit or PROTNONE doesn't change anything in that
> respect). But another bit replacing _PAGE_PRESENT must exist.
> 
> This change is zero cost at runtime, and 0x1 or 0x3 won't change a
> thing for the CPU.

We don't have PROTNONE on ppc, see below.

 .../...

> > I'm concerned. We are already running short on RPN bits. We can't spare
> > more. If you absolutely need a PTE bit, we'll need to explore ways to
> > free some, but just reducing the RPN isn't an option.
> 
> No way to do it without a spare bit.
> 
> Note that this is now true for sched-numa rewrite as well because it
> also introduced the NUMA hinting page faults of AutoNUMA (except what
> it does during the fault is different there, but the mechanism of
> firing them and the need of a spare pte bit is identical).
> 
> But you must have a bit for protnone, don't you? You can implement it
> with prot none, I can add the vma as parameter to some function to
> achieve it if you need. It may be good idea to do anyway even if
> there's no need on x86 at this point.

So we don't do protnone, and now that you mention it, I think that means
that some of our embedded stuff is busted :-)

Basically PROT_NONE turns into _PAGE_PRESENT without _PAGE_USER for us.

On server, user accesses effectively use the user protection bits due to
the fact that the user segments are tagged as such. So the fact that
PROT_NONE -> !_PAGE_USER for us is essentially enough.

However, the embedded ppc situation is more interesting... and it looks
like it is indeed broken, meaning that a user can coerce the kernel into
accessing PROT_NONE on its behalf with copy_from_user & co (though read
only really).

Looks like the SW TLB handlers used on embedded should also check
whether the address is a user or kernel address, and enforce _PAGE_USER
in the former case. They might have done in the past, it's possible that
it's code we lost, but as it is, it's broken.

The case of HW loaded TLB embedded will need a different definition of
PAGE_NONE as well I suspect. Kumar, can you have a look ?

> > Think of what happens if PTE_4K_PFN is set...
> 
> It may very well broken with PTE_4K_PFN is set, I'm not familiar with
> that. If that's the case we'll just add an option to prevent
> AUTONUMA=y to be set if PTE_4K_PFN is set thanks for the info.
> 
> > Also you conveniently avoided all the other pte-*.h variants meaning you
> > broke the build for everything except ppc64 with 64k pages.
> 
> This can only be enabled on PPC64 in KConfig so no problem about
> ppc32.

I wasn't especially thinking of ppc32... there's also hash64-4k or
embedded 64... Also pgtable.h is common, so all those added uses of
_PAGE_NUMA_PTE to static inline functions are going to break the build
unless _PAGE_NUMA_PTE is #defined to 0 when not used (we do that for a
bunch of bits in pte-common.h already).

> > > diff --git a/mm/autonuma.c b/mm/autonuma.c
> > > index ada6c57..a4da3f3 100644
> > > --- a/mm/autonuma.c
> > > +++ b/mm/autonuma.c
> > > @@ -25,7 +25,7 @@ unsigned long autonuma_flags __read_mostly =
> > >  #ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
> > >  	|(1<<AUTONUMA_ENABLED_FLAG)
> > >  #endif
> > > -	|(1<<AUTONUMA_SCAN_PMD_FLAG);
> > > +	|(0<<AUTONUMA_SCAN_PMD_FLAG);
> > 
> > That changes the default accross all architectures, is that ok vs.
> > Andrea ?
> 
> :) Indeed! But the next patch (34) undoes this hack. I just merged the
> patch with "git am" and then introduced a proper way for the arch to
> specify if the PMD scan is supported or not in an incremental
> patch. Adding ppc64 support, and making the PMD scan mode arch
> conditional are two separate things so I thought it was cleaner
> keeping those in two separate patches but I can fold them if you
> prefer.

Ok.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 00/36] AutoNUMA24
  2012-08-22 22:19     ` Andrea Arcangeli
@ 2012-08-23  8:42       ` Ingo Molnar
  0 siblings, 0 replies; 54+ messages in thread
From: Ingo Molnar @ 2012-08-23  8:42 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, linux-kernel, linux-mm, Hillf Danton, Dan Smith,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Paul Turner, Suresh Siddha, Mike Galbraith, Paul E. McKenney,
	Lai Jiangshan, Bharata B Rao, Lee Schermerhorn, Johannes Weiner,
	Srivatsa Vaddagiri, Christoph Lameter, Alex Shi,
	Mauricio Faria de Oliveira, Konrad Rzeszutek Wilk, Don Morris,
	Benjamin Herrenschmidt, Peter Zijlstra


* Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Aug 22, 2012 at 11:40:48PM +0200, Ingo Molnar wrote:
> > 
> > * Rik van Riel <riel@redhat.com> wrote:
> > 
> > > On 08/22/2012 10:58 AM, Andrea Arcangeli wrote:
> > > >Hello everyone,
> > > >
> > > >Before the Kernel Summit, I think it's good idea to post a new
> > > >AutoNUMA24 and to go through a new review cycle. The last review cycle
> > > >has been fundamental in improving the patchset. Thanks!
> > > 
> > > Thanks for improving the code and incorporating all our 
> > > feedback. The AutoNUMA codebase is now in a state where I can 
> > > live with it.
> > > 
> > > I hope the code will be acceptable to others, too.
> > 
> > Lots of scheduler changes. Has all of peterz's review feedback 
> > been addressed?
> 
> git diff --stat origin kernel/sched/
>  kernel/sched/Makefile |    1 +
>  kernel/sched/core.c   |    1 +
>  kernel/sched/fair.c   |   86 ++++++-
>  kernel/sched/numa.c   |  604 +++++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h  |   19 ++
>  5 files changed, 699 insertions(+), 12 deletions(-)
> 
> Lots of scheduler changes only if CONFIG_AUTONUMA=y.

That's a lot of scheduler changes.

> [...] If CONFIG_AUTONUMA=n it's just 107 lines of scheduler 
> changes (numa.c won't get built in that case).
> 
> > Hm, he isn't even Cc:-ed, how is that supposed to work?
> 
> I separately forwarded him the announcement email because I 
> wanted to add a few more (minor) details for him. Of course 
> Peter's review is fundamental and appreciated and already 
> helped to make the code a lot better.

I see no reason why such details shouldn't be discussed openly 
and why forwarding him things separately should cause you to 
drop a scheduler co-maintainer from the Cc:, with a 700 lines 
kernel/sched/ diffstat ...

> His previous comments should have been addressed, [...]

That's good news. Peter?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 33/36] autonuma: powerpc port
  2012-08-23  5:11       ` Benjamin Herrenschmidt
@ 2012-08-23 15:23         ` Andrea Arcangeli
  2012-08-23 22:13         ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 54+ messages in thread
From: Andrea Arcangeli @ 2012-08-23 15:23 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Vaidyanathan Srinivasan, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Tony Breeds, Kumar Gala

Hi Benjamin,

On Thu, Aug 23, 2012 at 03:11:00PM +1000, Benjamin Herrenschmidt wrote:
> Basically PROT_NONE turns into _PAGE_PRESENT without _PAGE_USER for us.

Maybe the simplest is to implement pte_numa as !_PAGE_USER too. No
need to clear the _PAGE_PRESENT bit and to alter pte_present() if
clearing _PAGE_USER already achieves it.

It should be trivial to add the vma parameter to pte_numa(pte, vma) so
you can implement pte_numa by checking the vma->vm_page_prot in the
inline pte_numa function, to be able to tell if it's a real prot none
(in which case pte_numa return false) or if it's the NUMA hinting page
fault. In the latter case pte_numa will return true.

> However, the embedded ppc situation is more interesting... and it looks
> like it is indeed broken, meaning that a user can coerce the kernel into
> accessing PROT_NONE on its behalf with copy_from_user & co (though read
> only really).
> 
> Looks like the SW TLB handlers used on embedded should also check
> whether the address is a user or kernel address, and enforce _PAGE_USER
> in the former case. They might have done in the past, it's possible that
> it's code we lost, but as it is, it's broken.
> 
> The case of HW loaded TLB embedded will need a different definition of
> PAGE_NONE as well I suspect. Kumar, can you have a look ?

Even if we can't track copy-user accesses with the NUMA
hinting page faults, AUTONUMA should still work fairly well. The
flakey PROTNONE on embedded, is more a problem in itself than it would
be for pte_numa on embedded.

OTOH AutoNUMA working on embedded isn't important so it may be just
better to disable it until !_PAGE_USER is reliable.

> I wasn't especially thinking of ppc32... there's also hash64-4k or
> embedded 64... Also pgtable.h is common, so all those added uses of
> _PAGE_NUMA_PTE to static inline functions are going to break the build
> unless _PAGE_NUMA_PTE is #defined to 0 when not used (we do that for a
> bunch of bits in pte-common.h already).

It'd be actually worse if it would build ;). But I guess using
_PAGE_USER to implement pte_numa will solve the problem for 4k page
size too.

We can discuss this during kernel summit ;).

Thanks a lot!
Andrea

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 33/36] autonuma: powerpc port
  2012-08-23  5:11       ` Benjamin Herrenschmidt
  2012-08-23 15:23         ` Andrea Arcangeli
@ 2012-08-23 22:13         ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 54+ messages in thread
From: Benjamin Herrenschmidt @ 2012-08-23 22:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Vaidyanathan Srinivasan, linux-kernel, linux-mm, Hillf Danton,
	Dan Smith, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Paul Turner, Suresh Siddha, Mike Galbraith,
	Paul E. McKenney, Lai Jiangshan, Bharata B Rao, Lee Schermerhorn,
	Rik van Riel, Johannes Weiner, Srivatsa Vaddagiri,
	Christoph Lameter, Alex Shi, Mauricio Faria de Oliveira,
	Konrad Rzeszutek Wilk, Don Morris, Tony Breeds, Kumar Gala

On Thu, 2012-08-23 at 15:11 +1000, Benjamin Herrenschmidt wrote:

> So we don't do protnone, and now that you mention it, I think that
> means
> that some of our embedded stuff is busted :-)
> 
> Basically PROT_NONE turns into _PAGE_PRESENT without _PAGE_USER for
> us.

 .../...

> Looks like the SW TLB handlers used on embedded should also check
> whether the address is a user or kernel address, and enforce
> _PAGE_USER
> in the former case. They might have done in the past, it's possible
> that
> it's code we lost, but as it is, it's broken.
> 
> The case of HW loaded TLB embedded will need a different definition of
> PAGE_NONE as well I suspect. Kumar, can you have a look ?

Ok, replying to myself... I wrote some of that stuff so I was all ready
to put the brown paper bag on etc... but in fact:

 - On Book3e.h, we have all 6 protection bits in the PTE (user R,W,X and
supervisor R,W,X). _PAGE_BASE has none of them and _PAGE_USER brings
both UR and SR. Since _PAGE_USER is not set for PROT_NONE we should be
fine. That's the one I wrote so here goes the brown paper bag :-)

 - 44x/47x is in trouble. _PAGE_USER is just a bit in the PTE that the
TLB load handler uses to copy the S bits into the U bits. So we need to
modify the code to also refuse to load a TLB entry with an EA below
PAGE_OFFSET if _PAGE_USER isn't set. I'll give a try at a patch today if
I get a chance, else it will have to wait til after I'm back from
Plumbers.

 - 8xx is probably in trouble, I don't know, I never touch that code, so
somebody from FSL should have a look if they care.

 - FSL BookE looks wrong after a quick look, I'll also let FSL take care
of it.

Cheers,
Ben.




^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2012-08-23 22:15 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-22 14:58 [PATCH 00/36] AutoNUMA24 Andrea Arcangeli
2012-08-22 14:58 ` [PATCH 01/36] autonuma: make set_pmd_at always available Andrea Arcangeli
2012-08-22 14:58 ` [PATCH 02/36] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n Andrea Arcangeli
2012-08-22 14:58 ` [PATCH 03/36] autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD Andrea Arcangeli
2012-08-22 14:58 ` [PATCH 04/36] autonuma: pte_numa() and pmd_numa() Andrea Arcangeli
2012-08-22 14:58 ` [PATCH 05/36] autonuma: teach gup_fast about pmd_numa Andrea Arcangeli
2012-08-22 14:58 ` [PATCH 06/36] autonuma: introduce kthread_bind_node() Andrea Arcangeli
2012-08-22 14:58 ` [PATCH 07/36] autonuma: mm_autonuma and task_autonuma data structures Andrea Arcangeli
2012-08-22 14:58 ` [PATCH 08/36] autonuma: define the autonuma flags Andrea Arcangeli
2012-08-22 14:58 ` [PATCH 09/36] autonuma: core autonuma.h header Andrea Arcangeli
2012-08-22 14:58 ` [PATCH 10/36] autonuma: CPU follows memory algorithm Andrea Arcangeli
2012-08-22 14:58 ` [PATCH 11/36] autonuma: add page structure fields Andrea Arcangeli
2012-08-22 14:58 ` [PATCH 12/36] autonuma: knuma_migrated per NUMA node queues Andrea Arcangeli
2012-08-22 14:58 ` [PATCH 13/36] autonuma: autonuma_enter/exit Andrea Arcangeli
2012-08-22 14:58 ` [PATCH 14/36] autonuma: call autonuma_setup_new_exec() Andrea Arcangeli
2012-08-22 14:58 ` [PATCH 15/36] autonuma: alloc/free/init task_autonuma Andrea Arcangeli
2012-08-22 14:59 ` [PATCH 16/36] autonuma: alloc/free/init mm_autonuma Andrea Arcangeli
2012-08-22 14:59 ` [PATCH 17/36] autonuma: prevent select_task_rq_fair to return -1 Andrea Arcangeli
2012-08-22 14:59 ` [PATCH 18/36] autonuma: teach CFS about autonuma affinity Andrea Arcangeli
2012-08-22 14:59 ` [PATCH 19/36] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection Andrea Arcangeli
2012-08-22 20:19   ` Andi Kleen
2012-08-22 21:22     ` Hugh Dickins
2012-08-22 21:24     ` Andrea Arcangeli
2012-08-22 22:37       ` Andi Kleen
2012-08-22 22:46         ` Andrea Arcangeli
2012-08-22 14:59 ` [PATCH 20/36] autonuma: default mempolicy follow AutoNUMA Andrea Arcangeli
2012-08-22 14:59 ` [PATCH 21/36] autonuma: call autonuma_split_huge_page() Andrea Arcangeli
2012-08-22 14:59 ` [PATCH 22/36] autonuma: make khugepaged pte_numa aware Andrea Arcangeli
2012-08-22 14:59 ` [PATCH 23/36] autonuma: retain page last_nid information in khugepaged Andrea Arcangeli
2012-08-22 14:59 ` [PATCH 24/36] autonuma: numa hinting page faults entry points Andrea Arcangeli
2012-08-22 14:59 ` [PATCH 25/36] autonuma: reset autonuma page data when pages are freed Andrea Arcangeli
2012-08-22 14:59 ` [PATCH 26/36] autonuma: link mm/autonuma.o and kernel/sched/numa.o Andrea Arcangeli
2012-08-22 14:59 ` [PATCH 27/36] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED Andrea Arcangeli
2012-08-22 14:59 ` [PATCH 28/36] autonuma: page_autonuma Andrea Arcangeli
2012-08-22 14:59 ` [PATCH 29/36] autonuma: autonuma_migrate_head[0] dynamic size Andrea Arcangeli
2012-08-22 14:59 ` [PATCH 30/36] autonuma: bugcheck page_autonuma fields on newly allocated pages Andrea Arcangeli
2012-08-22 14:59 ` [PATCH 31/36] autonuma: shrink the per-page page_autonuma struct size Andrea Arcangeli
2012-08-22 14:59 ` [PATCH 32/36] autonuma: boost khugepaged scanning rate Andrea Arcangeli
2012-08-22 14:59 ` [PATCH 33/36] autonuma: powerpc port Andrea Arcangeli
2012-08-22 22:01   ` Benjamin Herrenschmidt
2012-08-22 22:35     ` Andrea Arcangeli
2012-08-23  5:11       ` Benjamin Herrenschmidt
2012-08-23 15:23         ` Andrea Arcangeli
2012-08-23 22:13         ` Benjamin Herrenschmidt
2012-08-22 22:56     ` Benjamin Herrenschmidt
2012-08-22 23:06       ` Andrea Arcangeli
2012-08-23  4:15       ` Vaidyanathan Srinivasan
2012-08-22 14:59 ` [PATCH 34/36] autonuma: make the AUTONUMA_SCAN_PMD_FLAG conditional to CONFIG_HAVE_ARCH_AUTONUMA_SCAN_PMD Andrea Arcangeli
2012-08-22 14:59 ` [PATCH 35/36] autonuma: add knuma_migrated/allow_first_fault in sysfs Andrea Arcangeli
2012-08-22 14:59 ` [PATCH 36/36] autonuma: add mm_autonuma working set estimation Andrea Arcangeli
2012-08-22 19:26 ` [PATCH 00/36] AutoNUMA24 Rik van Riel
2012-08-22 21:40   ` Ingo Molnar
2012-08-22 22:19     ` Andrea Arcangeli
2012-08-23  8:42       ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).