linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/31] numa/core patches
@ 2012-10-25 12:16 Peter Zijlstra
  2012-10-25 12:16 ` [PATCH 01/31] sched, numa, mm: Make find_busiest_queue() a method Peter Zijlstra
                   ` (33 more replies)
  0 siblings, 34 replies; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Ingo Molnar

Hi all,

Here's a re-post of the NUMA scheduling and migration improvement
patches that we are working on. These include techniques from
AutoNUMA and the sched/numa tree and form a unified basis - it
has got all the bits that look good and mergeable.

With these patches applied, the mbind system calls expand to
new modes of lazy-migration binding, and if the
CONFIG_SCHED_NUMA=y .config option is enabled the scheduler
will automatically sample the working set of tasks via page
faults. Based on that information the scheduler then tries
to balance smartly, put tasks on a home node and migrate CPU
work and memory on the same node.

They are functional in their current state and have had testing on
a variety of x86 NUMA hardware.

These patches will continue their life in tip:numa/core and unless
there are major showstoppers they are intended for the v3.8
merge window.

We believe that they provide a solid basis for future work.

Please review .. once again and holler if you see anything funny! :-)



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 01/31] sched, numa, mm: Make find_busiest_queue() a method
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-10-25 12:16 ` [PATCH 02/31] sched, numa, mm: Describe the NUMA scheduling problem formally Peter Zijlstra
                   ` (32 subsequent siblings)
  33 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Ingo Molnar

[-- Attachment #1: 0001-sched-numa-mm-Make-find_busiest_queue-a-method.patch --]
[-- Type: text/plain, Size: 1859 bytes --]

Its a bit awkward but it was the least painful means of modifying the
queue selection. Used in a later patch to conditionally use a random
queue.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c |   20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

Index: tip/kernel/sched/fair.c
===================================================================
--- tip.orig/kernel/sched/fair.c
+++ tip/kernel/sched/fair.c
@@ -3063,6 +3063,9 @@ struct lb_env {
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
+
+	struct rq *		(*find_busiest_queue)(struct lb_env *,
+						      struct sched_group *);
 };
 
 /*
@@ -4236,13 +4239,14 @@ static int load_balance(int this_cpu, st
 	struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
 
 	struct lb_env env = {
-		.sd		= sd,
-		.dst_cpu	= this_cpu,
-		.dst_rq		= this_rq,
-		.dst_grpmask    = sched_group_cpus(sd->groups),
-		.idle		= idle,
-		.loop_break	= sched_nr_migrate_break,
-		.cpus		= cpus,
+		.sd		    = sd,
+		.dst_cpu	    = this_cpu,
+		.dst_rq		    = this_rq,
+		.dst_grpmask        = sched_group_cpus(sd->groups),
+		.idle		    = idle,
+		.loop_break	    = sched_nr_migrate_break,
+		.cpus		    = cpus,
+		.find_busiest_queue = find_busiest_queue,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
@@ -4261,7 +4265,7 @@ redo:
 		goto out_balanced;
 	}
 
-	busiest = find_busiest_queue(&env, group);
+	busiest = env.find_busiest_queue(&env, group);
 	if (!busiest) {
 		schedstat_inc(sd, lb_nobusyq[idle]);
 		goto out_balanced;



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 02/31] sched, numa, mm: Describe the NUMA scheduling problem formally
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
  2012-10-25 12:16 ` [PATCH 01/31] sched, numa, mm: Make find_busiest_queue() a method Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-11-01  9:56   ` Mel Gorman
  2012-10-25 12:16 ` [PATCH 03/31] mm/thp: Preserve pgprot across huge page split Peter Zijlstra
                   ` (31 subsequent siblings)
  33 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, H. Peter Anvin,
	Mike Galbraith, Ingo Molnar

[-- Attachment #1: 0002-sched-numa-mm-Describe-the-NUMA-scheduling-problem-f.patch --]
[-- Type: text/plain, Size: 9951 bytes --]

This is probably a first: formal description of a complex high-level
computing problem, within the kernel source.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Mike Galbraith <efault@gmx.de>
Rik van Riel <riel@redhat.com>
[ Next step: generate the kernel source from such formal descriptions and retire to a tropical island! ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/scheduler/numa-problem.txt |  230 +++++++++++++++++++++++++++++++
 1 file changed, 230 insertions(+)
 create mode 100644 Documentation/scheduler/numa-problem.txt

Index: tip/Documentation/scheduler/numa-problem.txt
===================================================================
--- /dev/null
+++ tip/Documentation/scheduler/numa-problem.txt
@@ -0,0 +1,230 @@
+
+
+Effective NUMA scheduling problem statement, described formally:
+
+ * minimize interconnect traffic
+
+For each task 't_i' we have memory, this memory can be spread over multiple
+physical nodes, let us denote this as: 'p_i,k', the memory task 't_i' has on
+node 'k' in [pages].  
+
+If a task shares memory with another task let us denote this as:
+'s_i,k', the memory shared between tasks including 't_i' residing on node
+'k'.
+
+Let 'M' be the distribution that governs all 'p' and 's', ie. the page placement.
+
+Similarly, lets define 'fp_i,k' and 'fs_i,k' resp. as the (average) usage
+frequency over those memory regions [1/s] such that the product gives an
+(average) bandwidth 'bp' and 'bs' in [pages/s].
+
+(note: multiple tasks sharing memory naturally avoid duplicat accounting
+       because each task will have its own access frequency 'fs')
+
+(pjt: I think this frequency is more numerically consistent if you explicitly 
+      restrict p/s above to be the working-set. (It also makes explicit the 
+      requirement for <C0,M0> to change about a change in the working set.)
+
+      Doing this does have the nice property that it lets you use your frequency
+      measurement as a weak-ordering for the benefit a task would receive when
+      we can't fit everything.
+
+      e.g. task1 has working set 10mb, f=90%
+           task2 has working set 90mb, f=10%
+
+      Both are using 9mb/s of bandwidth, but we'd expect a much larger benefit
+      from task1 being on the right node than task2. )
+
+Let 'C' map every task 't_i' to a cpu 'c_i' and its corresponding node 'n_i':
+
+  C: t_i -> {c_i, n_i}
+
+This gives us the total interconnect traffic between nodes 'k' and 'l',
+'T_k,l', as:
+
+  T_k,l = \Sum_i bp_i,l + bs_i,l + \Sum bp_j,k + bs_j,k where n_i == k, n_j == l
+
+And our goal is to obtain C0 and M0 such that:
+
+  T_k,l(C0, M0) =< T_k,l(C, M) for all C, M where k != l
+
+(note: we could introduce 'nc(k,l)' as the cost function of accessing memory
+       on node 'l' from node 'k', this would be useful for bigger NUMA systems
+
+ pjt: I agree nice to have, but intuition suggests diminishing returns on more
+      usual systems given factors like things like Haswell's enormous 35mb l3
+      cache and QPI being able to do a direct fetch.)
+
+(note: do we need a limit on the total memory per node?)
+
+
+ * fairness
+
+For each task 't_i' we have a weight 'w_i' (related to nice), and each cpu
+'c_n' has a compute capacity 'P_n', again, using our map 'C' we can formulate a
+load 'L_n':
+
+  L_n = 1/P_n * \Sum_i w_i for all c_i = n
+
+using that we can formulate a load difference between CPUs
+
+  L_n,m = | L_n - L_m |
+
+Which allows us to state the fairness goal like:
+
+  L_n,m(C0) =< L_n,m(C) for all C, n != m
+
+(pjt: It can also be usefully stated that, having converged at C0:
+
+   | L_n(C0) - L_m(C0) | <= 4/3 * | G_n( U(t_i, t_j) ) - G_m( U(t_i, t_j) ) |
+
+      Where G_n,m is the greedy partition of tasks between L_n and L_m. This is
+      the "worst" partition we should accept; but having it gives us a useful 
+      bound on how much we can reasonably adjust L_n/L_m at a Pareto point to 
+      favor T_n,m. )
+
+Together they give us the complete multi-objective optimization problem:
+
+  min_C,M [ L_n,m(C), T_k,l(C,M) ]
+
+
+
+Notes:
+
+ - the memory bandwidth problem is very much an inter-process problem, in
+   particular there is no such concept as a process in the above problem.
+
+ - the naive solution would completely prefer fairness over interconnect
+   traffic, the more complicated solution could pick another Pareto point using
+   an aggregate objective function such that we balance the loss of work
+   efficiency against the gain of running, we'd want to more or less suggest
+   there to be a fixed bound on the error from the Pareto line for any
+   such solution.
+
+References:
+
+  http://en.wikipedia.org/wiki/Mathematical_optimization
+  http://en.wikipedia.org/wiki/Multi-objective_optimization
+
+
+* warning, significant hand-waving ahead, improvements welcome *
+
+
+Partial solutions / approximations:
+
+ 1) have task node placement be a pure preference from the 'fairness' pov.
+
+This means we always prefer fairness over interconnect bandwidth. This reduces
+the problem to:
+
+  min_C,M [ T_k,l(C,M) ]
+
+ 2a) migrate memory towards 'n_i' (the task's node).
+
+This creates memory movement such that 'p_i,k for k != n_i' becomes 0 -- 
+provided 'n_i' stays stable enough and there's sufficient memory (looks like
+we might need memory limits for this).
+
+This does however not provide us with any 's_i' (shared) information. It does
+however remove 'M' since it defines memory placement in terms of task
+placement.
+
+XXX properties of this M vs a potential optimal
+
+ 2b) migrate memory towards 'n_i' using 2 samples.
+
+This separates pages into those that will migrate and those that will not due
+to the two samples not matching. We could consider the first to be of 'p_i'
+(private) and the second to be of 's_i' (shared).
+
+This interpretation can be motivated by the previously observed property that
+'p_i,k for k != n_i' should become 0 under sufficient memory, leaving only
+'s_i' (shared). (here we loose the need for memory limits again, since it
+becomes indistinguishable from shared).
+
+XXX include the statistical babble on double sampling somewhere near
+
+This reduces the problem further; we loose 'M' as per 2a, it further reduces
+the 'T_k,l' (interconnect traffic) term to only include shared (since per the
+above all private will be local):
+
+  T_k,l = \Sum_i bs_i,l for every n_i = k, l != k
+
+[ more or less matches the state of sched/numa and describes its remaining
+  problems and assumptions. It should work well for tasks without significant
+  shared memory usage between tasks. ]
+
+Possible future directions:
+
+Motivated by the form of 'T_k,l', try and obtain each term of the sum, so we
+can evaluate it;
+
+ 3a) add per-task per node counters
+
+At fault time, count the number of pages the task faults on for each node.
+This should give an approximation of 'p_i' for the local node and 's_i,k' for
+all remote nodes.
+
+While these numbers provide pages per scan, and so have the unit [pages/s] they
+don't count repeat access and thus aren't actually representable for our
+bandwidth numberes.
+
+ 3b) additional frequency term
+
+Additionally (or instead if it turns out we don't need the raw 'p' and 's' 
+numbers) we can approximate the repeat accesses by using the time since marking
+the pages as indication of the access frequency.
+
+Let 'I' be the interval of marking pages and 'e' the elapsed time since the
+last marking, then we could estimate the number of accesses 'a' as 'a = I / e'.
+If we then increment the node counters using 'a' instead of 1 we might get
+a better estimate of bandwidth terms.
+
+ 3c) additional averaging; can be applied on top of either a/b.
+
+[ Rik argues that decaying averages on 3a might be sufficient for bandwidth since
+  the decaying avg includes the old accesses and therefore has a measure of repeat
+  accesses.
+
+  Rik also argued that the sample frequency is too low to get accurate access
+  frequency measurements, I'm not entirely convinced, event at low sample 
+  frequencies the avg elapsed time 'e' over multiple samples should still
+  give us a fair approximation of the avg access frequency 'a'.
+
+  So doing both b&c has a fair chance of working and allowing us to distinguish
+  between important and less important memory accesses.
+
+  Experimentation has shown no benefit from the added frequency term so far. ]
+
+This will give us 'bp_i' and 'bs_i,k' so that we can approximately compute
+'T_k,l' Our optimization problem now reads:
+
+  min_C [ \Sum_i bs_i,l for every n_i = k, l != k ]
+
+And includes only shared terms, this makes sense since all task private memory
+will become local as per 2.
+
+This suggests that if there is significant shared memory, we should try and
+move towards it.
+
+ 4) move towards where 'most' memory is
+
+The simplest significance test is comparing the biggest shared 's_i,k' against
+the private 'p_i'. If we have more shared than private, move towards it.
+
+This effectively makes us move towards where most our memory is and forms a
+feed-back loop with 2. We migrate memory towards us and we migrate towards
+where 'most' memory is.
+
+(Note: even if there were two tasks fully trashing the same shared memory, it
+       is very rare for there to be an 50/50 split in memory, lacking a perfect
+       split, the small will move towards the larger. In case of the perfect
+       split, we'll tie-break towards the lower node number.)
+
+ 5) 'throttle' 4's node placement
+
+Since per 2b our 's_i,k' and 'p_i' require at least two scans to 'stabilize'
+and show representative numbers, we should limit node-migration to not be
+faster than this.
+
+ n) poke holes in previous that require more stuff and describe it.



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 03/31] mm/thp: Preserve pgprot across huge page split
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
  2012-10-25 12:16 ` [PATCH 01/31] sched, numa, mm: Make find_busiest_queue() a method Peter Zijlstra
  2012-10-25 12:16 ` [PATCH 02/31] sched, numa, mm: Describe the NUMA scheduling problem formally Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-11-01 10:22   ` Mel Gorman
  2012-10-25 12:16 ` [PATCH 04/31] x86/mm: Introduce pte_accessible() Peter Zijlstra
                   ` (30 subsequent siblings)
  33 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Paul Turner, Ingo Molnar

[-- Attachment #1: 0003-mm-thp-Preserve-pgprot-across-huge-page-split.patch --]
[-- Type: text/plain, Size: 5899 bytes --]

We're going to play games with page-protections, ensure we don't lose
them over a THP split.

Collapse seems to always allocate a new (huge) page which should
already end up on the new target node so loosing protections there
isn't a problem.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Paul Turner <pjt@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable.h |    1 
 mm/huge_memory.c               |  105 +++++++++++++++++++----------------------
 2 files changed, 51 insertions(+), 55 deletions(-)

Index: tip/arch/x86/include/asm/pgtable.h
===================================================================
--- tip.orig/arch/x86/include/asm/pgtable.h
+++ tip/arch/x86/include/asm/pgtable.h
@@ -349,6 +349,7 @@ static inline pgprot_t pgprot_modify(pgp
 }
 
 #define pte_pgprot(x) __pgprot(pte_flags(x) & PTE_FLAGS_MASK)
+#define pmd_pgprot(x) __pgprot(pmd_val(x) & ~_HPAGE_CHG_MASK)
 
 #define canon_pgprot(p) __pgprot(massage_pgprot(p))
 
Index: tip/mm/huge_memory.c
===================================================================
--- tip.orig/mm/huge_memory.c
+++ tip/mm/huge_memory.c
@@ -1343,63 +1343,60 @@ static int __split_huge_page_map(struct
 	int ret = 0, i;
 	pgtable_t pgtable;
 	unsigned long haddr;
+	pgprot_t prot;
 
 	spin_lock(&mm->page_table_lock);
 	pmd = page_check_address_pmd(page, mm, address,
 				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
-	if (pmd) {
-		pgtable = pgtable_trans_huge_withdraw(mm);
-		pmd_populate(mm, &_pmd, pgtable);
-
-		haddr = address;
-		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
-			pte_t *pte, entry;
-			BUG_ON(PageCompound(page+i));
-			entry = mk_pte(page + i, vma->vm_page_prot);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-			if (!pmd_write(*pmd))
-				entry = pte_wrprotect(entry);
-			else
-				BUG_ON(page_mapcount(page) != 1);
-			if (!pmd_young(*pmd))
-				entry = pte_mkold(entry);
-			pte = pte_offset_map(&_pmd, haddr);
-			BUG_ON(!pte_none(*pte));
-			set_pte_at(mm, haddr, pte, entry);
-			pte_unmap(pte);
-		}
-
-		smp_wmb(); /* make pte visible before pmd */
-		/*
-		 * Up to this point the pmd is present and huge and
-		 * userland has the whole access to the hugepage
-		 * during the split (which happens in place). If we
-		 * overwrite the pmd with the not-huge version
-		 * pointing to the pte here (which of course we could
-		 * if all CPUs were bug free), userland could trigger
-		 * a small page size TLB miss on the small sized TLB
-		 * while the hugepage TLB entry is still established
-		 * in the huge TLB. Some CPU doesn't like that. See
-		 * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
-		 * Erratum 383 on page 93. Intel should be safe but is
-		 * also warns that it's only safe if the permission
-		 * and cache attributes of the two entries loaded in
-		 * the two TLB is identical (which should be the case
-		 * here). But it is generally safer to never allow
-		 * small and huge TLB entries for the same virtual
-		 * address to be loaded simultaneously. So instead of
-		 * doing "pmd_populate(); flush_tlb_range();" we first
-		 * mark the current pmd notpresent (atomically because
-		 * here the pmd_trans_huge and pmd_trans_splitting
-		 * must remain set at all times on the pmd until the
-		 * split is complete for this pmd), then we flush the
-		 * SMP TLB and finally we write the non-huge version
-		 * of the pmd entry with pmd_populate.
-		 */
-		pmdp_invalidate(vma, address, pmd);
-		pmd_populate(mm, pmd, pgtable);
-		ret = 1;
+	if (!pmd)
+		goto unlock;
+
+	prot = pmd_pgprot(*pmd);
+	pgtable = pgtable_trans_huge_withdraw(mm);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0, haddr = address; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t *pte, entry;
+
+		BUG_ON(PageCompound(page+i));
+		entry = mk_pte(page + i, prot);
+		entry = pte_mkdirty(entry);
+		if (!pmd_young(*pmd))
+			entry = pte_mkold(entry);
+		pte = pte_offset_map(&_pmd, haddr);
+		BUG_ON(!pte_none(*pte));
+		set_pte_at(mm, haddr, pte, entry);
+		pte_unmap(pte);
 	}
+
+	smp_wmb(); /* make ptes visible before pmd, see __pte_alloc */
+	/*
+	 * Up to this point the pmd is present and huge.
+	 *
+	 * If we overwrite the pmd with the not-huge version, we could trigger
+	 * a small page size TLB miss on the small sized TLB while the hugepage
+	 * TLB entry is still established in the huge TLB.
+	 *
+	 * Some CPUs don't like that. See
+	 * http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum 383
+	 * on page 93.
+	 *
+	 * Thus it is generally safer to never allow small and huge TLB entries
+	 * for overlapping virtual addresses to be loaded. So we first mark the
+	 * current pmd not present, then we flush the TLB and finally we write
+	 * the non-huge version of the pmd entry with pmd_populate.
+	 *
+	 * The above needs to be done under the ptl because pmd_trans_huge and
+	 * pmd_trans_splitting must remain set on the pmd until the split is
+	 * complete. The ptl also protects against concurrent faults due to
+	 * making the pmd not-present.
+	 */
+	set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
+	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	pmd_populate(mm, pmd, pgtable);
+	ret = 1;
+
+unlock:
 	spin_unlock(&mm->page_table_lock);
 
 	return ret;
@@ -2287,10 +2284,8 @@ static void khugepaged_do_scan(void)
 {
 	struct page *hpage = NULL;
 	unsigned int progress = 0, pass_through_head = 0;
-	unsigned int pages = khugepaged_pages_to_scan;
 	bool wait = true;
-
-	barrier(); /* write khugepaged_pages_to_scan to local stack */
+	unsigned int pages = ACCESS_ONCE(khugepaged_pages_to_scan);
 
 	while (progress < pages) {
 		if (!khugepaged_prealloc_page(&hpage, &wait))



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 04/31] x86/mm: Introduce pte_accessible()
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (2 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 03/31] mm/thp: Preserve pgprot across huge page split Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-10-25 20:10   ` Linus Torvalds
  2012-11-01 10:42   ` [PATCH 04/31] " Mel Gorman
  2012-10-25 12:16 ` [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags() Peter Zijlstra
                   ` (29 subsequent siblings)
  33 siblings, 2 replies; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: 0004-x86-mm-Introduce-pte_accessible.patch --]
[-- Type: text/plain, Size: 1835 bytes --]

From: Rik van Riel <riel@redhat.com>

We need pte_present to return true for _PAGE_PROTNONE pages, to indicate that
the pte is associated with a page.

However, for TLB flushing purposes, we would like to know whether the pte
points to an actually accessible page.  This allows us to skip remote TLB
flushes for pages that are not actually accessible.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable.h |    6 ++++++
 include/asm-generic/pgtable.h  |    4 ++++
 2 files changed, 10 insertions(+)

Index: tip/arch/x86/include/asm/pgtable.h
===================================================================
--- tip.orig/arch/x86/include/asm/pgtable.h
+++ tip/arch/x86/include/asm/pgtable.h
@@ -408,6 +408,12 @@ static inline int pte_present(pte_t a)
 	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
 }
 
+#define __HAVE_ARCH_PTE_ACCESSIBLE
+static inline int pte_accessible(pte_t a)
+{
+	return pte_flags(a) & _PAGE_PRESENT;
+}
+
 static inline int pte_hidden(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_HIDDEN;
Index: tip/include/asm-generic/pgtable.h
===================================================================
--- tip.orig/include/asm-generic/pgtable.h
+++ tip/include/asm-generic/pgtable.h
@@ -219,6 +219,10 @@ static inline int pmd_same(pmd_t pmd_a,
 #define move_pte(pte, prot, old_addr, new_addr)	(pte)
 #endif
 
+#ifndef __HAVE_ARCH_PTE_ACCESSIBLE
+#define pte_accessible(pte)		pte_present(pte)
+#endif
+
 #ifndef flush_tlb_fix_spurious_fault
 #define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address)
 #endif



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags()
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (3 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 04/31] x86/mm: Introduce pte_accessible() Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-10-25 20:17   ` Linus Torvalds
  2012-10-25 12:16 ` [PATCH 06/31] mm: Only flush the TLB when clearing an accessible pte Peter Zijlstra
                   ` (28 subsequent siblings)
  33 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: 0005-x86-mm-Reduce-tlb-flushes-from-ptep_set_access_flags.patch --]
[-- Type: text/plain, Size: 1694 bytes --]

From: Rik van Riel <riel@redhat.com>

If ptep_set_access_flags() is invoked to upgrade access permissions
on a PTE, there is no security or data integrity reason to do a
remote TLB flush.

Lazily letting another CPU incur a spurious page fault occasionally
is (much!) cheaper than aggressively flushing everybody else's TLB.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pgtable.c |   17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

Index: tip/arch/x86/mm/pgtable.c
===================================================================
--- tip.orig/arch/x86/mm/pgtable.c
+++ tip/arch/x86/mm/pgtable.c
@@ -306,11 +306,26 @@ int ptep_set_access_flags(struct vm_area
 			  pte_t entry, int dirty)
 {
 	int changed = !pte_same(*ptep, entry);
+	/*
+	 * If the page used to be inaccessible (_PAGE_PROTNONE), or
+	 * this call upgrades the access permissions on the same page,
+	 * it is safe to skip the remote TLB flush.
+	 */
+	bool flush_remote = false;
+	if (!pte_accessible(*ptep))
+		flush_remote = false;
+	else if (pte_pfn(*ptep) != pte_pfn(entry) ||
+			(pte_write(*ptep) && !pte_write(entry)) ||
+			(pte_exec(*ptep) && !pte_exec(entry)))
+		flush_remote = true;
 
 	if (changed && dirty) {
 		*ptep = entry;
 		pte_update_defer(vma->vm_mm, address, ptep);
-		flush_tlb_page(vma, address);
+		if (flush_remote)
+			flush_tlb_page(vma, address);
+		else
+			__flush_tlb_one(address);
 	}
 
 	return changed;



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 06/31] mm: Only flush the TLB when clearing an accessible pte
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (4 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags() Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-10-25 12:16 ` [PATCH 07/31] sched, numa, mm, s390/thp: Implement pmd_pgprot() for s390 Peter Zijlstra
                   ` (27 subsequent siblings)
  33 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: 0006-mm-Only-flush-the-TLB-when-clearing-an-accessible-pt.patch --]
[-- Type: text/plain, Size: 970 bytes --]

From: Rik van Riel <riel@redhat.com>

If ptep_clear_flush() is called to clear a page table entry that is
accessible anyway by the CPU, eg. a _PAGE_PROTNONE page table entry,
there is no need to flush the TLB on remote CPUs.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/pgtable-generic.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: tip/mm/pgtable-generic.c
===================================================================
--- tip.orig/mm/pgtable-generic.c
+++ tip/mm/pgtable-generic.c
@@ -88,7 +88,8 @@ pte_t ptep_clear_flush(struct vm_area_st
 {
 	pte_t pte;
 	pte = ptep_get_and_clear((vma)->vm_mm, address, ptep);
-	flush_tlb_page(vma, address);
+	if (pte_accessible(pte))
+		flush_tlb_page(vma, address);
 	return pte;
 }
 #endif



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 07/31] sched, numa, mm, s390/thp: Implement pmd_pgprot() for s390
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (5 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 06/31] mm: Only flush the TLB when clearing an accessible pte Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-11-01 10:49   ` Mel Gorman
  2012-10-25 12:16 ` [PATCH 08/31] sched, numa, mm, MIPS/thp: Add pmd_pgprot() implementation Peter Zijlstra
                   ` (26 subsequent siblings)
  33 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Gerald Schaefer, Martin Schwidefsky,
	Heiko Carstens, Peter Zijlstra, Ralf Baechle, Ingo Molnar

[-- Attachment #1: 0007-sched-numa-mm-s390-thp-Implement-pmd_pgprot-for-s390.patch --]
[-- Type: text/plain, Size: 1199 bytes --]

From: Gerald Schaefer <gerald.schaefer@de.ibm.com>

This patch adds an implementation of pmd_pgprot() for s390,
in preparation to future THP changes.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/s390/include/asm/pgtable.h |   13 +++++++++++++
 1 file changed, 13 insertions(+)

Index: tip/arch/s390/include/asm/pgtable.h
===================================================================
--- tip.orig/arch/s390/include/asm/pgtable.h
+++ tip/arch/s390/include/asm/pgtable.h
@@ -1240,6 +1240,19 @@ static inline void set_pmd_at(struct mm_
 	*pmdp = entry;
 }
 
+static inline pgprot_t pmd_pgprot(pmd_t pmd)
+{
+	pgprot_t prot = PAGE_RW;
+
+	if (pmd_val(pmd) & _SEGMENT_ENTRY_RO) {
+		if (pmd_val(pmd) & _SEGMENT_ENTRY_INV)
+			prot = PAGE_NONE;
+		else
+			prot = PAGE_RO;
+	}
+	return prot;
+}
+
 static inline unsigned long massage_pgprot_pmd(pgprot_t pgprot)
 {
 	unsigned long pgprot_pmd = 0;



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 08/31] sched, numa, mm, MIPS/thp: Add pmd_pgprot() implementation
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (6 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 07/31] sched, numa, mm, s390/thp: Implement pmd_pgprot() for s390 Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-10-25 12:16 ` [PATCH 09/31] mm/pgprot: Move the pgprot_modify() fallback definition to mm.h Peter Zijlstra
                   ` (25 subsequent siblings)
  33 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Ralf Baechle, Martin Schwidefsky,
	Heiko Carstens, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: 0008-sched-numa-mm-MIPS-thp-Add-pmd_pgprot-implementation.patch --]
[-- Type: text/plain, Size: 982 bytes --]

From: Ralf Baechle <ralf@linux-mips.org>

Add the pmd_pgprot() method that will be needed
by the new NUMA code.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/mips/include/asm/pgtable.h |    2 ++
 1 file changed, 2 insertions(+)

Index: tip/arch/mips/include/asm/pgtable.h
===================================================================
--- tip.orig/arch/mips/include/asm/pgtable.h
+++ tip/arch/mips/include/asm/pgtable.h
@@ -89,6 +89,8 @@ static inline int is_zero_pfn(unsigned l
 
 extern void paging_init(void);
 
+#define pmd_pgprot(x)		__pgprot(pmd_val(x) & ~_PAGE_CHG_MASK)
+
 /*
  * Conversion functions: convert a page and protection to a page entry,
  * and a page entry and page directory to the page they refer to.



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 09/31] mm/pgprot: Move the pgprot_modify() fallback definition to mm.h
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (7 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 08/31] sched, numa, mm, MIPS/thp: Add pmd_pgprot() implementation Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-10-25 12:16 ` [PATCH 10/31] mm/mpol: Remove NUMA_INTERLEAVE_HIT Peter Zijlstra
                   ` (24 subsequent siblings)
  33 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Paul Turner, Ingo Molnar

[-- Attachment #1: 0009-mm-pgprot-Move-the-pgprot_modify-fallback-definition.patch --]
[-- Type: text/plain, Size: 2022 bytes --]

From: Ingo Molnar <mingo@kernel.org>

pgprot_modify() is available on x86, but on other architectures it only
gets defined in mm/mprotect.c - breaking the build if anything outside
of mprotect.c tries to make use of this function.

Move it to the generic pgprot area in mm.h, so that an upcoming patch
can make use of it.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Rik van Riel <riel@redhat.com>
Cc: Paul Turner <pjt@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mm.h |   13 +++++++++++++
 mm/mprotect.c      |    7 -------
 2 files changed, 13 insertions(+), 7 deletions(-)

Index: tip/include/linux/mm.h
===================================================================
--- tip.orig/include/linux/mm.h
+++ tip/include/linux/mm.h
@@ -164,6 +164,19 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_TRIED	0x40	/* second try */
 
 /*
+ * Some architectures (such as x86) may need to preserve certain pgprot
+ * bits, without complicating generic pgprot code.
+ *
+ * Most architectures don't care:
+ */
+#ifndef pgprot_modify
+static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
+{
+	return newprot;
+}
+#endif
+
+/*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
  * ->fault function. The vma's ->fault is responsible for returning a bitmask
  * of VM_FAULT_xxx flags that give details about how the fault was handled.
Index: tip/mm/mprotect.c
===================================================================
--- tip.orig/mm/mprotect.c
+++ tip/mm/mprotect.c
@@ -28,13 +28,6 @@
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
 
-#ifndef pgprot_modify
-static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
-{
-	return newprot;
-}
-#endif
-
 static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 10/31] mm/mpol: Remove NUMA_INTERLEAVE_HIT
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (8 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 09/31] mm/pgprot: Move the pgprot_modify() fallback definition to mm.h Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-10-25 20:58   ` Andi Kleen
  2012-10-25 12:16 ` [PATCH 11/31] mm/mpol: Make MPOL_LOCAL a real policy Peter Zijlstra
                   ` (23 subsequent siblings)
  33 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Lee Schermerhorn, Ingo Molnar

[-- Attachment #1: 0010-mm-mpol-Remove-NUMA_INTERLEAVE_HIT.patch --]
[-- Type: text/plain, Size: 5522 bytes --]

Since the NUMA_INTERLEAVE_HIT statistic is useless on its own; it wants
to be compared to either a total of interleave allocations or to a miss
count, remove it.

Fixing it would be possible, but since we've gone years without these
statistics I figure we can continue that way.

Also NUMA_HIT fully includes NUMA_INTERLEAVE_HIT so users might
switch to using that.

This cleans up some of the weird MPOL_INTERLEAVE allocation exceptions.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 drivers/base/node.c    |    2 -
 include/linux/mmzone.h |    1 
 mm/mempolicy.c         |   68 +++++++++++++++----------------------------------
 mm/vmstat.c            |    1 
 4 files changed, 22 insertions(+), 50 deletions(-)

Index: tip/drivers/base/node.c
===================================================================
--- tip.orig/drivers/base/node.c
+++ tip/drivers/base/node.c
@@ -169,7 +169,7 @@ static ssize_t node_read_numastat(struct
 		       node_page_state(dev->id, NUMA_HIT),
 		       node_page_state(dev->id, NUMA_MISS),
 		       node_page_state(dev->id, NUMA_FOREIGN),
-		       node_page_state(dev->id, NUMA_INTERLEAVE_HIT),
+		       0UL,
 		       node_page_state(dev->id, NUMA_LOCAL),
 		       node_page_state(dev->id, NUMA_OTHER));
 }
Index: tip/include/linux/mmzone.h
===================================================================
--- tip.orig/include/linux/mmzone.h
+++ tip/include/linux/mmzone.h
@@ -137,7 +137,6 @@ enum zone_stat_item {
 	NUMA_HIT,		/* allocated in intended node */
 	NUMA_MISS,		/* allocated in non intended node */
 	NUMA_FOREIGN,		/* was intended here, hit elsewhere */
-	NUMA_INTERLEAVE_HIT,	/* interleaver preferred this zone */
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
Index: tip/mm/mempolicy.c
===================================================================
--- tip.orig/mm/mempolicy.c
+++ tip/mm/mempolicy.c
@@ -1587,11 +1587,29 @@ static nodemask_t *policy_nodemask(gfp_t
 	return NULL;
 }
 
+/* Do dynamic interleaving for a process */
+static unsigned interleave_nodes(struct mempolicy *policy)
+{
+	unsigned nid, next;
+	struct task_struct *me = current;
+
+	nid = me->il_next;
+	next = next_node(nid, policy->v.nodes);
+	if (next >= MAX_NUMNODES)
+		next = first_node(policy->v.nodes);
+	if (next < MAX_NUMNODES)
+		me->il_next = next;
+	return nid;
+}
+
 /* Return a zonelist indicated by gfp for node representing a mempolicy */
 static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
 	int nd)
 {
 	switch (policy->mode) {
+	case MPOL_INTERLEAVE:
+		nd = interleave_nodes(policy);
+		break;
 	case MPOL_PREFERRED:
 		if (!(policy->flags & MPOL_F_LOCAL))
 			nd = policy->v.preferred_node;
@@ -1613,21 +1631,6 @@ static struct zonelist *policy_zonelist(
 	return node_zonelist(nd, gfp);
 }
 
-/* Do dynamic interleaving for a process */
-static unsigned interleave_nodes(struct mempolicy *policy)
-{
-	unsigned nid, next;
-	struct task_struct *me = current;
-
-	nid = me->il_next;
-	next = next_node(nid, policy->v.nodes);
-	if (next >= MAX_NUMNODES)
-		next = first_node(policy->v.nodes);
-	if (next < MAX_NUMNODES)
-		me->il_next = next;
-	return nid;
-}
-
 /*
  * Depending on the memory policy provide a node from which to allocate the
  * next slab entry.
@@ -1864,21 +1867,6 @@ out:
 	return ret;
 }
 
-/* Allocate a page in interleaved policy.
-   Own path because it needs to do special accounting. */
-static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
-					unsigned nid)
-{
-	struct zonelist *zl;
-	struct page *page;
-
-	zl = node_zonelist(nid, gfp);
-	page = __alloc_pages(gfp, order, zl);
-	if (page && page_zone(page) == zonelist_zone(&zl->_zonerefs[0]))
-		inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
-	return page;
-}
-
 /**
  * 	alloc_pages_vma	- Allocate a page for a VMA.
  *
@@ -1915,17 +1903,6 @@ retry_cpuset:
 	pol = get_vma_policy(current, vma, addr);
 	cpuset_mems_cookie = get_mems_allowed();
 
-	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
-		unsigned nid;
-
-		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
-		mpol_cond_put(pol);
-		page = alloc_page_interleave(gfp, order, nid);
-		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
-			goto retry_cpuset;
-
-		return page;
-	}
 	zl = policy_zonelist(gfp, pol, node);
 	if (unlikely(mpol_needs_cond_ref(pol))) {
 		/*
@@ -1983,12 +1960,9 @@ retry_cpuset:
 	 * No reference counting needed for current->mempolicy
 	 * nor system default_policy
 	 */
-	if (pol->mode == MPOL_INTERLEAVE)
-		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
-	else
-		page = __alloc_pages_nodemask(gfp, order,
-				policy_zonelist(gfp, pol, numa_node_id()),
-				policy_nodemask(gfp, pol));
+	page = __alloc_pages_nodemask(gfp, order,
+			policy_zonelist(gfp, pol, numa_node_id()),
+			policy_nodemask(gfp, pol));
 
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 		goto retry_cpuset;
Index: tip/mm/vmstat.c
===================================================================
--- tip.orig/mm/vmstat.c
+++ tip/mm/vmstat.c
@@ -729,7 +729,6 @@ const char * const vmstat_text[] = {
 	"numa_hit",
 	"numa_miss",
 	"numa_foreign",
-	"numa_interleave",
 	"numa_local",
 	"numa_other",
 #endif



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 11/31] mm/mpol: Make MPOL_LOCAL a real policy
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (9 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 10/31] mm/mpol: Remove NUMA_INTERLEAVE_HIT Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-11-01 10:58   ` Mel Gorman
  2012-10-25 12:16 ` [PATCH 12/31] mm/mpol: Add MPOL_MF_NOOP Peter Zijlstra
                   ` (22 subsequent siblings)
  33 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Lee Schermerhorn, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: 0011-mm-mpol-Make-MPOL_LOCAL-a-real-policy.patch --]
[-- Type: text/plain, Size: 2129 bytes --]

Make MPOL_LOCAL a real and exposed policy such that applications that
relied on the previous default behaviour can explicitly request it.

Requested-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/uapi/linux/mempolicy.h |    1 +
 mm/mempolicy.c                 |    9 ++++++---
 2 files changed, 7 insertions(+), 3 deletions(-)

Index: tip/include/uapi/linux/mempolicy.h
===================================================================
--- tip.orig/include/uapi/linux/mempolicy.h
+++ tip/include/uapi/linux/mempolicy.h
@@ -20,6 +20,7 @@ enum {
 	MPOL_PREFERRED,
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
+	MPOL_LOCAL,
 	MPOL_MAX,	/* always last member of enum */
 };
 
Index: tip/mm/mempolicy.c
===================================================================
--- tip.orig/mm/mempolicy.c
+++ tip/mm/mempolicy.c
@@ -269,6 +269,10 @@ static struct mempolicy *mpol_new(unsign
 			     (flags & MPOL_F_RELATIVE_NODES)))
 				return ERR_PTR(-EINVAL);
 		}
+	} else if (mode == MPOL_LOCAL) {
+		if (!nodes_empty(*nodes))
+			return ERR_PTR(-EINVAL);
+		mode = MPOL_PREFERRED;
 	} else if (nodes_empty(*nodes))
 		return ERR_PTR(-EINVAL);
 	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
@@ -2371,7 +2375,6 @@ void numa_default_policy(void)
  * "local" is pseudo-policy:  MPOL_PREFERRED with MPOL_F_LOCAL flag
  * Used only for mpol_parse_str() and mpol_to_str()
  */
-#define MPOL_LOCAL MPOL_MAX
 static const char * const policy_modes[] =
 {
 	[MPOL_DEFAULT]    = "default",
@@ -2424,12 +2427,12 @@ int mpol_parse_str(char *str, struct mem
 	if (flags)
 		*flags++ = '\0';	/* terminate mode string */
 
-	for (mode = 0; mode <= MPOL_LOCAL; mode++) {
+	for (mode = 0; mode < MPOL_MAX; mode++) {
 		if (!strcmp(str, policy_modes[mode])) {
 			break;
 		}
 	}
-	if (mode > MPOL_LOCAL)
+	if (mode >= MPOL_MAX)
 		goto out;
 
 	switch (mode) {



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 12/31] mm/mpol: Add MPOL_MF_NOOP
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (10 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 11/31] mm/mpol: Make MPOL_LOCAL a real policy Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-11-01 11:10   ` Mel Gorman
  2012-10-25 12:16 ` [PATCH 13/31] mm/mpol: Check for misplaced page Peter Zijlstra
                   ` (21 subsequent siblings)
  33 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Lee Schermerhorn, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: 0012-mm-mpol-Add-MPOL_MF_NOOP.patch --]
[-- Type: text/plain, Size: 3046 bytes --]

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

This patch augments the MPOL_MF_LAZY feature by adding a "NOOP" policy
to mbind().  When the NOOP policy is used with the 'MOVE and 'LAZY
flags, mbind() will map the pages PROT_NONE so that they will be
migrated on the next touch.

This allows an application to prepare for a new phase of operation
where different regions of shared storage will be assigned to
worker threads, w/o changing policy.  Note that we could just use
"default" policy in this case.  However, this also allows an
application to request that pages be migrated, only if necessary,
to follow any arbitrary policy that might currently apply to a
range of pages, without knowing the policy, or without specifying
multiple mbind()s for ranges with different policies.

[ Bug in early version of mpol_parse_str() reported by Fengguang Wu. ]

Bug-Reported-by: Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/uapi/linux/mempolicy.h |    1 +
 mm/mempolicy.c                 |   11 ++++++-----
 2 files changed, 7 insertions(+), 5 deletions(-)

Index: tip/include/uapi/linux/mempolicy.h
===================================================================
--- tip.orig/include/uapi/linux/mempolicy.h
+++ tip/include/uapi/linux/mempolicy.h
@@ -21,6 +21,7 @@ enum {
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
+	MPOL_NOOP,		/* retain existing policy for range */
 	MPOL_MAX,	/* always last member of enum */
 };
 
Index: tip/mm/mempolicy.c
===================================================================
--- tip.orig/mm/mempolicy.c
+++ tip/mm/mempolicy.c
@@ -251,10 +251,10 @@ static struct mempolicy *mpol_new(unsign
 	pr_debug("setting mode %d flags %d nodes[0] %lx\n",
 		 mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
 
-	if (mode == MPOL_DEFAULT) {
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) {
 		if (nodes && !nodes_empty(*nodes))
 			return ERR_PTR(-EINVAL);
-		return NULL;	/* simply delete any existing policy */
+		return NULL;
 	}
 	VM_BUG_ON(!nodes);
 
@@ -1146,7 +1146,7 @@ static long do_mbind(unsigned long start
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
-	if (mode == MPOL_DEFAULT)
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
 		flags &= ~MPOL_MF_STRICT;
 
 	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
@@ -2381,7 +2381,8 @@ static const char * const policy_modes[]
 	[MPOL_PREFERRED]  = "prefer",
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
-	[MPOL_LOCAL]      = "local"
+	[MPOL_LOCAL]      = "local",
+	[MPOL_NOOP]	  = "noop",	/* should not actually be used */
 };
 
 
@@ -2432,7 +2433,7 @@ int mpol_parse_str(char *str, struct mem
 			break;
 		}
 	}
-	if (mode >= MPOL_MAX)
+	if (mode >= MPOL_MAX || mode == MPOL_NOOP)
 		goto out;
 
 	switch (mode) {



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 13/31] mm/mpol: Check for misplaced page
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (11 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 12/31] mm/mpol: Add MPOL_MF_NOOP Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-10-25 12:16 ` [PATCH 14/31] mm/mpol: Create special PROT_NONE infrastructure Peter Zijlstra
                   ` (20 subsequent siblings)
  33 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Lee Schermerhorn, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: 0013-mm-mpol-Check-for-misplaced-page.patch --]
[-- Type: text/plain, Size: 5000 bytes --]

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

This patch provides a new function to test whether a page resides
on a node that is appropriate for the mempolicy for the vma and
address where the page is supposed to be mapped.  This involves
looking up the node where the page belongs.  So, the function
returns that node so that it may be used to allocated the page
without consulting the policy again.  

A subsequent patch will call this function from the fault path.
Because of this, I don't want to go ahead and allocate the page, e.g.,
via alloc_page_vma() only to have to free it if it has the correct
policy.  So, I just mimic the alloc_page_vma() node computation
logic--sort of.

Note:  we could use this function to implement a MPOL_MF_STRICT
behavior when migrating pages to match mbind() mempolicy--e.g.,
to ensure that pages in an interleaved range are reinterleaved
rather than left where they are when they reside on any page in
the interleave nodemask.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
[ Added MPOL_F_LAZY to trigger migrate-on-fault;
  simplified code now that we don't have to bother
  with special crap for interleaved ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mempolicy.h      |    8 ++++
 include/uapi/linux/mempolicy.h |    1 
 mm/mempolicy.c                 |   76 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 85 insertions(+)

Index: tip/include/linux/mempolicy.h
===================================================================
--- tip.orig/include/linux/mempolicy.h
+++ tip/include/linux/mempolicy.h
@@ -198,6 +198,8 @@ static inline int vma_migratable(struct
 	return 1;
 }
 
+extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
+
 #else
 
 struct mempolicy {};
@@ -323,5 +325,11 @@ static inline int mpol_to_str(char *buff
 	return 0;
 }
 
+static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+				 unsigned long address)
+{
+	return -1; /* no node preference */
+}
+
 #endif /* CONFIG_NUMA */
 #endif
Index: tip/include/uapi/linux/mempolicy.h
===================================================================
--- tip.orig/include/uapi/linux/mempolicy.h
+++ tip/include/uapi/linux/mempolicy.h
@@ -61,6 +61,7 @@ enum mpol_rebind_step {
 #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
 #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
 #define MPOL_F_REBINDING (1 << 2)	/* identify policies in rebinding */
+#define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 
 
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
Index: tip/mm/mempolicy.c
===================================================================
--- tip.orig/mm/mempolicy.c
+++ tip/mm/mempolicy.c
@@ -2153,6 +2153,82 @@ static void sp_free(struct sp_node *n)
 	kmem_cache_free(sn_cache, n);
 }
 
+/**
+ * mpol_misplaced - check whether current page node is valid in policy
+ *
+ * @page   - page to be checked
+ * @vma    - vm area where page mapped
+ * @addr   - virtual address where page mapped
+ *
+ * Lookup current policy node id for vma,addr and "compare to" page's
+ * node id.
+ *
+ * Returns:
+ *	-1	- not misplaced, page is in the right node
+ *	node	- node id where the page should be
+ *
+ * Policy determination "mimics" alloc_page_vma().
+ * Called from fault path where we know the vma and faulting address.
+ */
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+{
+	struct mempolicy *pol;
+	struct zone *zone;
+	int curnid = page_to_nid(page);
+	unsigned long pgoff;
+	int polnid = -1;
+	int ret = -1;
+
+	BUG_ON(!vma);
+
+	pol = get_vma_policy(current, vma, addr);
+	if (!(pol->flags & MPOL_F_MOF))
+		goto out;
+
+	switch (pol->mode) {
+	case MPOL_INTERLEAVE:
+		BUG_ON(addr >= vma->vm_end);
+		BUG_ON(addr < vma->vm_start);
+
+		pgoff = vma->vm_pgoff;
+		pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
+		polnid = offset_il_node(pol, vma, pgoff);
+		break;
+
+	case MPOL_PREFERRED:
+		if (pol->flags & MPOL_F_LOCAL)
+			polnid = numa_node_id();
+		else
+			polnid = pol->v.preferred_node;
+		break;
+
+	case MPOL_BIND:
+		/*
+		 * allows binding to multiple nodes.
+		 * use current page if in policy nodemask,
+		 * else select nearest allowed node, if any.
+		 * If no allowed nodes, use current [!misplaced].
+		 */
+		if (node_isset(curnid, pol->v.nodes))
+			goto out;
+		(void)first_zones_zonelist(
+				node_zonelist(numa_node_id(), GFP_HIGHUSER),
+				gfp_zone(GFP_HIGHUSER),
+				&pol->v.nodes, &zone);
+		polnid = zone->node;
+		break;
+
+	default:
+		BUG();
+	}
+	if (curnid != polnid)
+		ret = polnid;
+out:
+	mpol_cond_put(pol);
+
+	return ret;
+}
+
 static void sp_delete(struct shared_policy *sp, struct sp_node *n)
 {
 	pr_debug("deleting %lx-l%lx\n", n->start, n->end);



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 14/31] mm/mpol: Create special PROT_NONE infrastructure
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (12 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 13/31] mm/mpol: Check for misplaced page Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-11-01 11:51   ` Mel Gorman
  2012-10-25 12:16 ` [PATCH 15/31] mm/mpol: Add MPOL_MF_LAZY Peter Zijlstra
                   ` (19 subsequent siblings)
  33 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Paul Turner, Ingo Molnar

[-- Attachment #1: 0014-mm-mpol-Create-special-PROT_NONE-infrastructure.patch --]
[-- Type: text/plain, Size: 9999 bytes --]

In order to facilitate a lazy -- fault driven -- migration of pages,
create a special transient PROT_NONE variant, we can then use the
'spurious' protection faults to drive our migrations from.

Pages that already had an effective PROT_NONE mapping will not
be detected to generate these 'spuriuos' faults for the simple reason
that we cannot distinguish them on their protection bits, see
pte_numa().

This isn't a problem since PROT_NONE (and possible PROT_WRITE with
dirty tracking) aren't used or are rare enough for us to not care
about their placement.

Suggested-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Paul Turner <pjt@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
[ fixed various cross-arch and THP/!THP details ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/huge_mm.h |   19 ++++++++++++
 include/linux/mm.h      |   18 +++++++++++
 mm/huge_memory.c        |   32 ++++++++++++++++++++
 mm/memory.c             |   75 +++++++++++++++++++++++++++++++++++++++++++-----
 mm/mprotect.c           |   24 ++++++++++-----
 5 files changed, 154 insertions(+), 14 deletions(-)

Index: tip/include/linux/huge_mm.h
===================================================================
--- tip.orig/include/linux/huge_mm.h
+++ tip/include/linux/huge_mm.h
@@ -159,6 +159,13 @@ static inline struct page *compound_tran
 	}
 	return page;
 }
+
+extern bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd);
+
+extern void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+				  unsigned long address, pmd_t *pmd,
+				  unsigned int flags, pmd_t orig_pmd);
+
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -195,6 +202,18 @@ static inline int pmd_trans_huge_lock(pm
 {
 	return 0;
 }
+
+static inline bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
+{
+	return false;
+}
+
+static inline void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+				  unsigned long address, pmd_t *pmd,
+				  unsigned int flags, pmd_t orig_pmd)
+{
+}
+
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
Index: tip/include/linux/mm.h
===================================================================
--- tip.orig/include/linux/mm.h
+++ tip/include/linux/mm.h
@@ -1091,6 +1091,9 @@ extern unsigned long move_page_tables(st
 extern unsigned long do_mremap(unsigned long addr,
 			       unsigned long old_len, unsigned long new_len,
 			       unsigned long flags, unsigned long new_addr);
+extern void change_protection(struct vm_area_struct *vma, unsigned long start,
+			      unsigned long end, pgprot_t newprot,
+			      int dirty_accountable);
 extern int mprotect_fixup(struct vm_area_struct *vma,
 			  struct vm_area_struct **pprev, unsigned long start,
 			  unsigned long end, unsigned long newflags);
@@ -1561,6 +1564,21 @@ static inline pgprot_t vm_get_page_prot(
 }
 #endif
 
+static inline pgprot_t vma_prot_none(struct vm_area_struct *vma)
+{
+	/*
+	 * obtain PROT_NONE by removing READ|WRITE|EXEC privs
+	 */
+	vm_flags_t vmflags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
+	return pgprot_modify(vma->vm_page_prot, vm_get_page_prot(vmflags));
+}
+
+static inline void
+change_prot_none(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+	change_protection(vma, start, end, vma_prot_none(vma), 0);
+}
+
 struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
 int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
 			unsigned long pfn, unsigned long size, pgprot_t);
Index: tip/mm/huge_memory.c
===================================================================
--- tip.orig/mm/huge_memory.c
+++ tip/mm/huge_memory.c
@@ -725,6 +725,38 @@ out:
 	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
 }
 
+bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
+{
+	/*
+	 * See pte_numa().
+	 */
+	if (pmd_same(pmd, pmd_modify(pmd, vma->vm_page_prot)))
+		return false;
+
+	return pmd_same(pmd, pmd_modify(pmd, vma_prot_none(vma)));
+}
+
+void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmd,
+			   unsigned int flags, pmd_t entry)
+{
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, entry)))
+		goto out_unlock;
+
+	/* do fancy stuff */
+
+	/* change back to regular protection */
+	entry = pmd_modify(entry, vma->vm_page_prot);
+	if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
+		update_mmu_cache_pmd(vma, address, entry);
+
+out_unlock:
+	spin_unlock(&mm->page_table_lock);
+}
+
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *vma)
Index: tip/mm/memory.c
===================================================================
--- tip.orig/mm/memory.c
+++ tip/mm/memory.c
@@ -1464,6 +1464,25 @@ int zap_vma_ptes(struct vm_area_struct *
 }
 EXPORT_SYMBOL_GPL(zap_vma_ptes);
 
+static bool pte_numa(struct vm_area_struct *vma, pte_t pte)
+{
+	/*
+	 * If we have the normal vma->vm_page_prot protections we're not a
+	 * 'special' PROT_NONE page.
+	 *
+	 * This means we cannot get 'special' PROT_NONE faults from genuine
+	 * PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
+	 * tracking.
+	 *
+	 * Neither case is really interesting for our current use though so we
+	 * don't care.
+	 */
+	if (pte_same(pte, pte_modify(pte, vma->vm_page_prot)))
+		return false;
+
+	return pte_same(pte, pte_modify(pte, vma_prot_none(vma)));
+}
+
 /**
  * follow_page - look up a page descriptor from a user-virtual address
  * @vma: vm_area_struct mapping @address
@@ -3433,6 +3452,41 @@ static int do_nonlinear_fault(struct mm_
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, pte_t *ptep, pmd_t *pmd,
+			unsigned int flags, pte_t entry)
+{
+	spinlock_t *ptl;
+	int ret = 0;
+
+	if (!pte_unmap_same(mm, pmd, ptep, entry))
+		goto out;
+
+	/*
+	 * Do fancy stuff...
+	 */
+
+	/*
+	 * OK, nothing to do,.. change the protection back to what it
+	 * ought to be.
+	 */
+	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (unlikely(!pte_same(*ptep, entry)))
+		goto unlock;
+
+	flush_cache_page(vma, address, pte_pfn(entry));
+
+	ptep_modify_prot_start(mm, address, ptep);
+	entry = pte_modify(entry, vma->vm_page_prot);
+	ptep_modify_prot_commit(mm, address, ptep, entry);
+
+	update_mmu_cache(vma, address, ptep);
+unlock:
+	pte_unmap_unlock(ptep, ptl);
+out:
+	return ret;
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3471,6 +3525,9 @@ int handle_pte_fault(struct mm_struct *m
 					pte, pmd, flags, entry);
 	}
 
+	if (pte_numa(vma, entry))
+		return do_numa_page(mm, vma, address, pte, pmd, flags, entry);
+
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
 	if (unlikely(!pte_same(*pte, entry)))
@@ -3535,13 +3592,16 @@ retry:
 							  pmd, flags);
 	} else {
 		pmd_t orig_pmd = *pmd;
-		int ret;
+		int ret = 0;
 
 		barrier();
-		if (pmd_trans_huge(orig_pmd)) {
-			if (flags & FAULT_FLAG_WRITE &&
-			    !pmd_write(orig_pmd) &&
-			    !pmd_trans_splitting(orig_pmd)) {
+		if (pmd_trans_huge(orig_pmd) && !pmd_trans_splitting(orig_pmd)) {
+			if (pmd_numa(vma, orig_pmd)) {
+				do_huge_pmd_numa_page(mm, vma, address, pmd,
+						      flags, orig_pmd);
+			}
+
+			if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
 				ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
 							  orig_pmd);
 				/*
@@ -3551,12 +3611,13 @@ retry:
 				 */
 				if (unlikely(ret & VM_FAULT_OOM))
 					goto retry;
-				return ret;
 			}
-			return 0;
+
+			return ret;
 		}
 	}
 
+
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
 	 * run pte_offset_map on the pmd, if an huge pmd could
Index: tip/mm/mprotect.c
===================================================================
--- tip.orig/mm/mprotect.c
+++ tip/mm/mprotect.c
@@ -112,7 +112,7 @@ static inline void change_pud_range(stru
 	} while (pud++, addr = next, addr != end);
 }
 
-static void change_protection(struct vm_area_struct *vma,
+static void change_protection_range(struct vm_area_struct *vma,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
@@ -134,6 +134,20 @@ static void change_protection(struct vm_
 	flush_tlb_range(vma, start, end);
 }
 
+void change_protection(struct vm_area_struct *vma, unsigned long start,
+		       unsigned long end, pgprot_t newprot,
+		       int dirty_accountable)
+{
+	struct mm_struct *mm = vma->vm_mm;
+
+	mmu_notifier_invalidate_range_start(mm, start, end);
+	if (is_vm_hugetlb_page(vma))
+		hugetlb_change_protection(vma, start, end, newprot);
+	else
+		change_protection_range(vma, start, end, newprot, dirty_accountable);
+	mmu_notifier_invalidate_range_end(mm, start, end);
+}
+
 int
 mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	unsigned long start, unsigned long end, unsigned long newflags)
@@ -206,12 +220,8 @@ success:
 		dirty_accountable = 1;
 	}
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
-	if (is_vm_hugetlb_page(vma))
-		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
-	else
-		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+
 	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
 	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
 	perf_event_mmap(vma);



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 15/31] mm/mpol: Add MPOL_MF_LAZY
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (13 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 14/31] mm/mpol: Create special PROT_NONE infrastructure Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-11-01 12:01   ` Mel Gorman
  2012-10-25 12:16 ` [PATCH 16/31] numa, mm: Support NUMA hinting page faults from gup/gup_fast Peter Zijlstra
                   ` (18 subsequent siblings)
  33 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Lee Schermerhorn, Lee Schermerhorn,
	Peter Zijlstra, Ingo Molnar

[-- Attachment #1: 0015-mm-mpol-Add-MPOL_MF_LAZY.patch --]
[-- Type: text/plain, Size: 5340 bytes --]

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

This patch adds another mbind() flag to request "lazy migration".  The
flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
pages are marked PROT_NONE. The pages will be migrated in the fault
path on "first touch", if the policy dictates at that time.

"Lazy Migration" will allow testing of migrate-on-fault via mbind().
Also allows applications to specify that only subsequently touched
pages be migrated to obey new policy, instead of all pages in range.
This can be useful for multi-threaded applications working on a
large shared data area that is initialized by an initial thread
resulting in all pages on one [or a few, if overflowed] nodes.
After PROT_NONE, the pages in regions assigned to the worker threads
will be automatically migrated local to the threads on 1st touch.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
[ nearly complete rewrite.. ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/uapi/linux/mempolicy.h |   13 ++++++++--
 mm/mempolicy.c                 |   49 ++++++++++++++++++++++++++---------------
 2 files changed, 42 insertions(+), 20 deletions(-)

Index: tip/include/uapi/linux/mempolicy.h
===================================================================
--- tip.orig/include/uapi/linux/mempolicy.h
+++ tip/include/uapi/linux/mempolicy.h
@@ -49,9 +49,16 @@ enum mpol_rebind_step {
 
 /* Flags for mbind */
 #define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
-#define MPOL_MF_MOVE	(1<<1)	/* Move pages owned by this process to conform to mapping */
-#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to mapping */
-#define MPOL_MF_INTERNAL (1<<3)	/* Internal flags start here */
+#define MPOL_MF_MOVE	 (1<<1)	/* Move pages owned by this process to conform
+				   to policy */
+#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to policy */
+#define MPOL_MF_LAZY	 (1<<3)	/* Modifies '_MOVE:  lazy migrate on fault */
+#define MPOL_MF_INTERNAL (1<<4)	/* Internal flags start here */
+
+#define MPOL_MF_VALID	(MPOL_MF_STRICT   | 	\
+			 MPOL_MF_MOVE     | 	\
+			 MPOL_MF_MOVE_ALL |	\
+			 MPOL_MF_LAZY)
 
 /*
  * Internal flags that share the struct mempolicy flags word with
Index: tip/mm/mempolicy.c
===================================================================
--- tip.orig/mm/mempolicy.c
+++ tip/mm/mempolicy.c
@@ -583,22 +583,32 @@ check_range(struct mm_struct *mm, unsign
 		return ERR_PTR(-EFAULT);
 	prev = NULL;
 	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
+		unsigned long endvma = vma->vm_end;
+
+		if (endvma > end)
+			endvma = end;
+		if (vma->vm_start > start)
+			start = vma->vm_start;
+
 		if (!(flags & MPOL_MF_DISCONTIG_OK)) {
 			if (!vma->vm_next && vma->vm_end < end)
 				return ERR_PTR(-EFAULT);
 			if (prev && prev->vm_end < vma->vm_start)
 				return ERR_PTR(-EFAULT);
 		}
-		if (!is_vm_hugetlb_page(vma) &&
-		    ((flags & MPOL_MF_STRICT) ||
+
+		if (is_vm_hugetlb_page(vma))
+			goto next;
+
+		if (flags & MPOL_MF_LAZY) {
+			change_prot_none(vma, start, endvma);
+			goto next;
+		}
+
+		if ((flags & MPOL_MF_STRICT) ||
 		     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
-				vma_migratable(vma)))) {
-			unsigned long endvma = vma->vm_end;
+		      vma_migratable(vma))) {
 
-			if (endvma > end)
-				endvma = end;
-			if (vma->vm_start > start)
-				start = vma->vm_start;
 			err = check_pgd_range(vma, start, endvma, nodes,
 						flags, private);
 			if (err) {
@@ -606,6 +616,7 @@ check_range(struct mm_struct *mm, unsign
 				break;
 			}
 		}
+next:
 		prev = vma;
 	}
 	return first;
@@ -1137,8 +1148,7 @@ static long do_mbind(unsigned long start
 	int err;
 	LIST_HEAD(pagelist);
 
-	if (flags & ~(unsigned long)(MPOL_MF_STRICT |
-				     MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+  	if (flags & ~(unsigned long)MPOL_MF_VALID)
 		return -EINVAL;
 	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
 		return -EPERM;
@@ -1161,6 +1171,9 @@ static long do_mbind(unsigned long start
 	if (IS_ERR(new))
 		return PTR_ERR(new);
 
+	if (flags & MPOL_MF_LAZY)
+		new->flags |= MPOL_F_MOF;
+
 	/*
 	 * If we are using the default policy then operation
 	 * on discontinuous address spaces is okay after all
@@ -1197,21 +1210,23 @@ static long do_mbind(unsigned long start
 	vma = check_range(mm, start, end, nmask,
 			  flags | MPOL_MF_INVERT, &pagelist);
 
-	err = PTR_ERR(vma);
-	if (!IS_ERR(vma)) {
-		int nr_failed = 0;
-
+	err = PTR_ERR(vma);	/* maybe ... */
+	if (!IS_ERR(vma) && mode != MPOL_NOOP)
 		err = mbind_range(mm, start, end, new);
 
+	if (!err) {
+		int nr_failed = 0;
+
 		if (!list_empty(&pagelist)) {
+			WARN_ON_ONCE(flags & MPOL_MF_LAZY);
 			nr_failed = migrate_pages(&pagelist, new_vma_page,
-						(unsigned long)vma,
-						false, MIGRATE_SYNC);
+						  (unsigned long)vma,
+						  false, MIGRATE_SYNC);
 			if (nr_failed)
 				putback_lru_pages(&pagelist);
 		}
 
-		if (!err && nr_failed && (flags & MPOL_MF_STRICT))
+		if (nr_failed && (flags & MPOL_MF_STRICT))
 			err = -EIO;
 	} else
 		putback_lru_pages(&pagelist);



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 16/31] numa, mm: Support NUMA hinting page faults from gup/gup_fast
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (14 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 15/31] mm/mpol: Add MPOL_MF_LAZY Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-10-25 12:16 ` [PATCH 17/31] mm/migrate: Introduce migrate_misplaced_page() Peter Zijlstra
                   ` (17 subsequent siblings)
  33 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: 0016-numa-mm-Support-NUMA-hinting-page-faults-from-gup-gu.patch --]
[-- Type: text/plain, Size: 2971 bytes --]

From: Ingo Molnar <mingo@kernel.org>

Introduce FOLL_NUMA to tell follow_page to check
pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do
so because it always invokes handle_mm_fault and retries the
follow_page later.

KVM secondary MMU page faults will trigger the NUMA hinting page
faults through gup_fast -> get_user_pages -> follow_page ->
handle_mm_fault.

Other follow_page callers like KSM should not use FOLL_NUMA, or they
would fail to get the pages if they use follow_page instead of
get_user_pages.

[ This patch was picked up from the AutoNUMA tree. ]

Originally-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ ported to this tree. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mm.h |    1 +
 mm/memory.c        |   17 +++++++++++++++++
 2 files changed, 18 insertions(+)

Index: tip/include/linux/mm.h
===================================================================
--- tip.orig/include/linux/mm.h
+++ tip/include/linux/mm.h
@@ -1600,6 +1600,7 @@ struct page *follow_page(struct vm_area_
 #define FOLL_MLOCK	0x40	/* mark page as mlocked */
 #define FOLL_SPLIT	0x80	/* don't return transhuge pages, split them */
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
+#define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
Index: tip/mm/memory.c
===================================================================
--- tip.orig/mm/memory.c
+++ tip/mm/memory.c
@@ -1536,6 +1536,8 @@ struct page *follow_page(struct vm_area_
 		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
 		goto out;
 	}
+	if ((flags & FOLL_NUMA) && pmd_numa(vma, *pmd))
+		goto no_page_table;
 	if (pmd_trans_huge(*pmd)) {
 		if (flags & FOLL_SPLIT) {
 			split_huge_page_pmd(mm, pmd);
@@ -1565,6 +1567,8 @@ split_fallthrough:
 	pte = *ptep;
 	if (!pte_present(pte))
 		goto no_page;
+	if ((flags & FOLL_NUMA) && pte_numa(vma, pte))
+		goto no_page;
 	if ((flags & FOLL_WRITE) && !pte_write(pte))
 		goto unlock;
 
@@ -1716,6 +1720,19 @@ int __get_user_pages(struct task_struct
 			(VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
 	vm_flags &= (gup_flags & FOLL_FORCE) ?
 			(VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
+
+	/*
+	 * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
+	 * would be called on PROT_NONE ranges. We must never invoke
+	 * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
+	 * page faults would unprotect the PROT_NONE ranges if
+	 * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
+	 * bitflag. So to avoid that, don't set FOLL_NUMA if
+	 * FOLL_FORCE is set.
+	 */
+	if (!(gup_flags & FOLL_FORCE))
+		gup_flags |= FOLL_NUMA;
+
 	i = 0;
 
 	do {



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 17/31] mm/migrate: Introduce migrate_misplaced_page()
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (15 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 16/31] numa, mm: Support NUMA hinting page faults from gup/gup_fast Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-11-01 12:20   ` Mel Gorman
  2012-10-25 12:16 ` [PATCH 18/31] mm/mpol: Use special PROT_NONE to migrate pages Peter Zijlstra
                   ` (16 subsequent siblings)
  33 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Paul Turner, Ingo Molnar

[-- Attachment #1: 0017-mm-migrate-Introduce-migrate_misplaced_page.patch --]
[-- Type: text/plain, Size: 6345 bytes --]

Add migrate_misplaced_page() which deals with migrating pages from
faults. 

This includes adding a new MIGRATE_FAULT migration mode to
deal with the extra page reference required due to having to look up
the page.

Based-on-work-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Paul Turner <pjt@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/migrate.h      |    7 +++
 include/linux/migrate_mode.h |    3 +
 mm/migrate.c                 |   85 ++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 87 insertions(+), 8 deletions(-)

Index: tip/include/linux/migrate.h
===================================================================
--- tip.orig/include/linux/migrate.h
+++ tip/include/linux/migrate.h
@@ -30,6 +30,7 @@ extern int migrate_vmas(struct mm_struct
 extern void migrate_page_copy(struct page *newpage, struct page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
+extern int migrate_misplaced_page(struct page *page, int node);
 #else
 
 static inline void putback_lru_pages(struct list_head *l) {}
@@ -63,5 +64,11 @@ static inline int migrate_huge_page_move
 #define migrate_page NULL
 #define fail_migrate_page NULL
 
+static inline
+int migrate_misplaced_page(struct page *page, int node)
+{
+	return -EAGAIN; /* can't migrate now */
+}
 #endif /* CONFIG_MIGRATION */
+
 #endif /* _LINUX_MIGRATE_H */
Index: tip/include/linux/migrate_mode.h
===================================================================
--- tip.orig/include/linux/migrate_mode.h
+++ tip/include/linux/migrate_mode.h
@@ -6,11 +6,14 @@
  *	on most operations but not ->writepage as the potential stall time
  *	is too significant
  * MIGRATE_SYNC will block when migrating pages
+ * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy
+ *	this path has an extra reference count
  */
 enum migrate_mode {
 	MIGRATE_ASYNC,
 	MIGRATE_SYNC_LIGHT,
 	MIGRATE_SYNC,
+	MIGRATE_FAULT,
 };
 
 #endif		/* MIGRATE_MODE_H_INCLUDED */
Index: tip/mm/migrate.c
===================================================================
--- tip.orig/mm/migrate.c
+++ tip/mm/migrate.c
@@ -225,7 +225,7 @@ static bool buffer_migrate_lock_buffers(
 	struct buffer_head *bh = head;
 
 	/* Simple case, sync compaction */
-	if (mode != MIGRATE_ASYNC) {
+	if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT) {
 		do {
 			get_bh(bh);
 			lock_buffer(bh);
@@ -279,12 +279,22 @@ static int migrate_page_move_mapping(str
 		struct page *newpage, struct page *page,
 		struct buffer_head *head, enum migrate_mode mode)
 {
-	int expected_count;
+	int expected_count = 0;
 	void **pslot;
 
+	if (mode == MIGRATE_FAULT) {
+		/*
+		 * MIGRATE_FAULT has an extra reference on the page and
+		 * otherwise acts like ASYNC, no point in delaying the
+		 * fault, we'll try again next time.
+		 */
+		expected_count++;
+	}
+
 	if (!mapping) {
 		/* Anonymous page without mapping */
-		if (page_count(page) != 1)
+		expected_count += 1;
+		if (page_count(page) != expected_count)
 			return -EAGAIN;
 		return 0;
 	}
@@ -294,7 +304,7 @@ static int migrate_page_move_mapping(str
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
  					page_index(page));
 
-	expected_count = 2 + page_has_private(page);
+	expected_count += 2 + page_has_private(page);
 	if (page_count(page) != expected_count ||
 		radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
 		spin_unlock_irq(&mapping->tree_lock);
@@ -313,7 +323,7 @@ static int migrate_page_move_mapping(str
 	 * the mapping back due to an elevated page count, we would have to
 	 * block waiting on other references to be dropped.
 	 */
-	if (mode == MIGRATE_ASYNC && head &&
+	if ((mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT) && head &&
 			!buffer_migrate_lock_buffers(head, mode)) {
 		page_unfreeze_refs(page, expected_count);
 		spin_unlock_irq(&mapping->tree_lock);
@@ -521,7 +531,7 @@ int buffer_migrate_page(struct address_s
 	 * with an IRQ-safe spinlock held. In the sync case, the buffers
 	 * need to be locked now
 	 */
-	if (mode != MIGRATE_ASYNC)
+	if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT)
 		BUG_ON(!buffer_migrate_lock_buffers(head, mode));
 
 	ClearPagePrivate(page);
@@ -687,7 +697,7 @@ static int __unmap_and_move(struct page
 	struct anon_vma *anon_vma = NULL;
 
 	if (!trylock_page(page)) {
-		if (!force || mode == MIGRATE_ASYNC)
+		if (!force || mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT)
 			goto out;
 
 		/*
@@ -1403,4 +1413,63 @@ int migrate_vmas(struct mm_struct *mm, c
  	}
  	return err;
 }
-#endif
+
+/*
+ * Attempt to migrate a misplaced page to the specified destination
+ * node.
+ */
+int migrate_misplaced_page(struct page *page, int node)
+{
+	struct address_space *mapping = page_mapping(page);
+	int page_lru = page_is_file_cache(page);
+	struct page *newpage;
+	int ret = -EAGAIN;
+	gfp_t gfp = GFP_HIGHUSER_MOVABLE;
+
+	/*
+	 * Don't migrate pages that are mapped in multiple processes.
+	 */
+	if (page_mapcount(page) != 1)
+		goto out;
+
+	/*
+	 * Never wait for allocations just to migrate on fault, but don't dip
+	 * into reserves. And, only accept pages from the specified node. No
+	 * sense migrating to a different "misplaced" page!
+	 */
+	if (mapping)
+		gfp = mapping_gfp_mask(mapping);
+	gfp &= ~__GFP_WAIT;
+	gfp |= __GFP_NOMEMALLOC | GFP_THISNODE;
+
+	newpage = alloc_pages_node(node, gfp, 0);
+	if (!newpage) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (isolate_lru_page(page)) {
+		ret = -EBUSY;
+		goto put_new;
+	}
+
+	inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+	ret = __unmap_and_move(page, newpage, 0, 0, MIGRATE_FAULT);
+	/*
+	 * A page that has been migrated has all references removed and will be
+	 * freed. A page that has not been migrated will have kepts its
+	 * references and be restored.
+	 */
+	dec_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+	putback_lru_page(page);
+put_new:
+	/*
+	 * Move the new page to the LRU. If migration was not successful
+	 * then this will free the page.
+	 */
+	putback_lru_page(newpage);
+out:
+	return ret;
+}
+
+#endif /* CONFIG_NUMA */



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 18/31] mm/mpol: Use special PROT_NONE to migrate pages
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (16 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 17/31] mm/migrate: Introduce migrate_misplaced_page() Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-10-25 12:16 ` [PATCH 19/31] sched, numa, mm: Introduce tsk_home_node() Peter Zijlstra
                   ` (15 subsequent siblings)
  33 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Paul Turner, Ingo Molnar

[-- Attachment #1: 0018-mm-mpol-Use-special-PROT_NONE-to-migrate-pages.patch --]
[-- Type: text/plain, Size: 5135 bytes --]

Combine our previous PROT_NONE, mpol_misplaced and
migrate_misplaced_page() pieces into an effective migrate on fault
scheme.

Note that (on x86) we rely on PROT_NONE pages being !present and avoid
the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves
the page-migration performance.

Suggested-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Paul Turner <pjt@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/huge_memory.c |   41 ++++++++++++++++++++++++++++++++++-
 mm/memory.c      |   63 +++++++++++++++++++++++++++++++++++++++----------------
 2 files changed, 85 insertions(+), 19 deletions(-)

Index: tip/mm/huge_memory.c
===================================================================
--- tip.orig/mm/huge_memory.c
+++ tip/mm/huge_memory.c
@@ -18,6 +18,7 @@
 #include <linux/freezer.h>
 #include <linux/mman.h>
 #include <linux/pagemap.h>
+#include <linux/migrate.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -741,12 +742,48 @@ void do_huge_pmd_numa_page(struct mm_str
 			   unsigned int flags, pmd_t entry)
 {
 	unsigned long haddr = address & HPAGE_PMD_MASK;
+	struct page *page = NULL;
+	int node;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(*pmd, entry)))
 		goto out_unlock;
 
-	/* do fancy stuff */
+	if (unlikely(pmd_trans_splitting(entry))) {
+		spin_unlock(&mm->page_table_lock);
+		wait_split_huge_page(vma->anon_vma, pmd);
+		return;
+	}
+
+#ifdef CONFIG_NUMA
+	page = pmd_page(entry);
+	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+
+	get_page(page);
+	spin_unlock(&mm->page_table_lock);
+
+	/*
+	 * XXX should we serialize against split_huge_page ?
+	 */
+
+	node = mpol_misplaced(page, vma, haddr);
+	if (node == -1)
+		goto do_fixup;
+
+	/*
+	 * Due to lacking code to migrate thp pages, we'll split
+	 * (which preserves the special PROT_NONE) and re-take the
+	 * fault on the normal pages.
+	 */
+	split_huge_page(page);
+	put_page(page);
+	return;
+
+do_fixup:
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, entry)))
+		goto out_unlock;
+#endif
 
 	/* change back to regular protection */
 	entry = pmd_modify(entry, vma->vm_page_prot);
@@ -755,6 +792,8 @@ void do_huge_pmd_numa_page(struct mm_str
 
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
+	if (page)
+		put_page(page);
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
Index: tip/mm/memory.c
===================================================================
--- tip.orig/mm/memory.c
+++ tip/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/migrate.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -1467,8 +1468,10 @@ EXPORT_SYMBOL_GPL(zap_vma_ptes);
 static bool pte_numa(struct vm_area_struct *vma, pte_t pte)
 {
 	/*
-	 * If we have the normal vma->vm_page_prot protections we're not a
-	 * 'special' PROT_NONE page.
+	 * For NUMA page faults, we use PROT_NONE ptes in VMAs with
+	 * "normal" vma->vm_page_prot protections.  Genuine PROT_NONE
+	 * VMAs should never get here, because the fault handling code
+	 * will notice that the VMA has no read or write permissions.
 	 *
 	 * This means we cannot get 'special' PROT_NONE faults from genuine
 	 * PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
@@ -3473,35 +3476,59 @@ static int do_numa_page(struct mm_struct
 			unsigned long address, pte_t *ptep, pmd_t *pmd,
 			unsigned int flags, pte_t entry)
 {
+	struct page *page = NULL;
+	int node, page_nid = -1;
 	spinlock_t *ptl;
-	int ret = 0;
-
-	if (!pte_unmap_same(mm, pmd, ptep, entry))
-		goto out;
 
-	/*
-	 * Do fancy stuff...
-	 */
-
-	/*
-	 * OK, nothing to do,.. change the protection back to what it
-	 * ought to be.
-	 */
-	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
 	if (unlikely(!pte_same(*ptep, entry)))
-		goto unlock;
+		goto out_unlock;
+
+	page = vm_normal_page(vma, address, entry);
+	if (page) {
+		get_page(page);
+		page_nid = page_to_nid(page);
+		node = mpol_misplaced(page, vma, address);
+		if (node != -1)
+			goto migrate;
+	}
 
+out_pte_upgrade_unlock:
 	flush_cache_page(vma, address, pte_pfn(entry));
 
 	ptep_modify_prot_start(mm, address, ptep);
 	entry = pte_modify(entry, vma->vm_page_prot);
 	ptep_modify_prot_commit(mm, address, ptep, entry);
 
+	/* No TLB flush needed because we upgraded the PTE */
+
 	update_mmu_cache(vma, address, ptep);
-unlock:
+
+out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out:
-	return ret;
+	if (page)
+		put_page(page);
+
+	return 0;
+
+migrate:
+	pte_unmap_unlock(ptep, ptl);
+
+	if (!migrate_misplaced_page(page, node)) {
+		page_nid = node;
+		goto out;
+	}
+
+	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (!pte_same(*ptep, entry)) {
+		put_page(page);
+		page = NULL;
+		goto out_unlock;
+	}
+
+	goto out_pte_upgrade_unlock;
 }
 
 /*



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 19/31] sched, numa, mm: Introduce tsk_home_node()
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (17 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 18/31] mm/mpol: Use special PROT_NONE to migrate pages Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-11-01 13:48   ` Mel Gorman
  2012-10-25 12:16 ` [PATCH 20/31] sched, numa, mm/mpol: Make mempolicy home-node aware Peter Zijlstra
                   ` (14 subsequent siblings)
  33 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Lee Schermerhorn, Ingo Molnar

[-- Attachment #1: 0019-sched-numa-mm-Introduce-tsk_home_node.patch --]
[-- Type: text/plain, Size: 5314 bytes --]

Introduce the home-node concept for tasks. In order to keep memory
locality we need to have a something to stay local to, we define the
home-node of a task as the node we prefer to allocate memory from and
prefer to execute on.

These are no hard guarantees, merely soft preferences. This allows for
optimal resource usage, we can run a task away from the home-node, the
remote memory hit -- while expensive -- is less expensive than not
running at all, or very little, due to severe cpu overload.

Similarly, we can allocate memory from another node if our home-node
is depleted, again, some memory is better than no memory.

This patch merely introduces the basic infrastructure, all policy
comes later.

NOTE: we introduce the concept of EMBEDDED_NUMA, these are
architectures where the memory access cost doesn't depend on the cpu
but purely on the physical address -- embedded boards with cheap
(slow) and expensive (fast) memory banks.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/sh/mm/Kconfig        |    1 +
 include/linux/init_task.h |    8 ++++++++
 include/linux/sched.h     |   12 ++++++++++++
 init/Kconfig              |   14 ++++++++++++++
 kernel/sched/core.c       |   36 ++++++++++++++++++++++++++++++++++++
 5 files changed, 71 insertions(+)

Index: tip/arch/sh/mm/Kconfig
===================================================================
--- tip.orig/arch/sh/mm/Kconfig
+++ tip/arch/sh/mm/Kconfig
@@ -111,6 +111,7 @@ config VSYSCALL
 config NUMA
 	bool "Non Uniform Memory Access (NUMA) Support"
 	depends on MMU && SYS_SUPPORTS_NUMA && EXPERIMENTAL
+	select EMBEDDED_NUMA
 	default n
 	help
 	  Some SH systems have many various memories scattered around
Index: tip/include/linux/init_task.h
===================================================================
--- tip.orig/include/linux/init_task.h
+++ tip/include/linux/init_task.h
@@ -143,6 +143,13 @@ extern struct task_group root_task_group
 
 #define INIT_TASK_COMM "swapper"
 
+#ifdef CONFIG_SCHED_NUMA
+# define INIT_TASK_NUMA(tsk)						\
+	.node = -1,
+#else
+# define INIT_TASK_NUMA(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -210,6 +217,7 @@ extern struct task_group root_task_group
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
 	INIT_CPUSET_SEQ							\
+	INIT_TASK_NUMA(tsk)						\
 }
 
 
Index: tip/include/linux/sched.h
===================================================================
--- tip.orig/include/linux/sched.h
+++ tip/include/linux/sched.h
@@ -1479,6 +1479,9 @@ struct task_struct {
 	short il_next;
 	short pref_node_fork;
 #endif
+#ifdef CONFIG_SCHED_NUMA
+	int node;
+#endif
 	struct rcu_head rcu;
 
 	/*
@@ -1553,6 +1556,15 @@ struct task_struct {
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
+static inline int tsk_home_node(struct task_struct *p)
+{
+#ifdef CONFIG_SCHED_NUMA
+	return p->node;
+#else
+	return -1;
+#endif
+}
+
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
Index: tip/init/Kconfig
===================================================================
--- tip.orig/init/Kconfig
+++ tip/init/Kconfig
@@ -696,6 +696,20 @@ config LOG_BUF_SHIFT
 config HAVE_UNSTABLE_SCHED_CLOCK
 	bool
 
+#
+# For architectures that (ab)use NUMA to represent different memory regions
+# all cpu-local but of different latencies, such as SuperH.
+#
+config EMBEDDED_NUMA
+	bool
+
+config SCHED_NUMA
+	bool "Memory placement aware NUMA scheduler"
+	default n
+	depends on SMP && NUMA && MIGRATION && !EMBEDDED_NUMA
+	help
+	  This option adds support for automatic NUMA aware memory/task placement.
+
 menuconfig CGROUPS
 	boolean "Control Group support"
 	depends on EVENTFD
Index: tip/kernel/sched/core.c
===================================================================
--- tip.orig/kernel/sched/core.c
+++ tip/kernel/sched/core.c
@@ -5959,6 +5959,42 @@ static struct sched_domain_topology_leve
 
 static struct sched_domain_topology_level *sched_domain_topology = default_topology;
 
+#ifdef CONFIG_SCHED_NUMA
+
+/*
+ * Requeues a task ensuring its on the right load-balance list so
+ * that it might get migrated to its new home.
+ *
+ * Note that we cannot actively migrate ourselves since our callers
+ * can be from atomic context. We rely on the regular load-balance
+ * mechanisms to move us around -- its all preference anyway.
+ */
+void sched_setnode(struct task_struct *p, int node)
+{
+	unsigned long flags;
+	int on_rq, running;
+	struct rq *rq;
+
+	rq = task_rq_lock(p, &flags);
+	on_rq = p->on_rq;
+	running = task_current(rq, p);
+
+	if (on_rq)
+		dequeue_task(rq, p, 0);
+	if (running)
+		p->sched_class->put_prev_task(rq, p);
+
+	p->node = node;
+
+	if (running)
+		p->sched_class->set_curr_task(rq);
+	if (on_rq)
+		enqueue_task(rq, p, 0);
+	task_rq_unlock(rq, p, &flags);
+}
+
+#endif /* CONFIG_SCHED_NUMA */
+
 #ifdef CONFIG_NUMA
 
 static int sched_domains_numa_levels;



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 20/31] sched, numa, mm/mpol: Make mempolicy home-node aware
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (18 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 19/31] sched, numa, mm: Introduce tsk_home_node() Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-11-01 13:58   ` Mel Gorman
  2012-10-25 12:16 ` [PATCH 21/31] sched, numa, mm: Introduce sched_feat_numa() Peter Zijlstra
                   ` (13 subsequent siblings)
  33 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Ingo Molnar

[-- Attachment #1: 0020-sched-numa-mm-mpol-Make-mempolicy-home-node-aware.patch --]
[-- Type: text/plain, Size: 2598 bytes --]

Add another layer of fallback policy to make the home node concept
useful from a memory allocation PoV.

This changes the mpol order to:

 - vma->vm_ops->get_policy	[if applicable]
 - vma->vm_policy		[if applicable]
 - task->mempolicy
 - tsk_home_node() preferred	[NEW]
 - default_policy

Note that the tsk_home_node() policy has Migrate-on-Fault enabled to
facilitate efficient on-demand memory migration.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/mempolicy.c |   29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

Index: tip/mm/mempolicy.c
===================================================================
--- tip.orig/mm/mempolicy.c
+++ tip/mm/mempolicy.c
@@ -117,6 +117,22 @@ static struct mempolicy default_policy =
 	.flags = MPOL_F_LOCAL,
 };
 
+static struct mempolicy preferred_node_policy[MAX_NUMNODES];
+
+static struct mempolicy *get_task_policy(struct task_struct *p)
+{
+	struct mempolicy *pol = p->mempolicy;
+	int node;
+
+	if (!pol) {
+		node = tsk_home_node(p);
+		if (node != -1)
+			pol = &preferred_node_policy[node];
+	}
+
+	return pol;
+}
+
 static const struct mempolicy_operations {
 	int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
 	/*
@@ -1565,7 +1581,7 @@ asmlinkage long compat_sys_mbind(compat_
 struct mempolicy *get_vma_policy(struct task_struct *task,
 		struct vm_area_struct *vma, unsigned long addr)
 {
-	struct mempolicy *pol = task->mempolicy;
+	struct mempolicy *pol = get_task_policy(task);
 
 	if (vma) {
 		if (vma->vm_ops && vma->vm_ops->get_policy) {
@@ -1965,7 +1981,7 @@ retry_cpuset:
  */
 struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 {
-	struct mempolicy *pol = current->mempolicy;
+	struct mempolicy *pol = get_task_policy(current);
 	struct page *page;
 	unsigned int cpuset_mems_cookie;
 
@@ -2424,6 +2440,15 @@ void __init numa_policy_init(void)
 				     sizeof(struct sp_node),
 				     0, SLAB_PANIC, NULL);
 
+	for_each_node(nid) {
+		preferred_node_policy[nid] = (struct mempolicy) {
+			.refcnt = ATOMIC_INIT(1),
+			.mode = MPOL_PREFERRED,
+			.flags = MPOL_F_MOF,
+			.v = { .preferred_node = nid, },
+		};
+	}
+
 	/*
 	 * Set interleaving policy for system init. Interleaving is only
 	 * enabled across suitably sized nodes (default is >= 16MB), or



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 21/31] sched, numa, mm: Introduce sched_feat_numa()
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (19 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 20/31] sched, numa, mm/mpol: Make mempolicy home-node aware Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-11-01 14:00   ` Mel Gorman
  2012-10-25 12:16 ` [PATCH 22/31] sched, numa, mm: Implement THP migration Peter Zijlstra
                   ` (12 subsequent siblings)
  33 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Ingo Molnar

[-- Attachment #1: 0021-sched-numa-mm-Introduce-sched_feat_numa.patch --]
[-- Type: text/plain, Size: 1021 bytes --]

Avoid a few #ifdef's later on.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/sched.h |    6 ++++++
 1 file changed, 6 insertions(+)

Index: tip/kernel/sched/sched.h
===================================================================
--- tip.orig/kernel/sched/sched.h
+++ tip/kernel/sched/sched.h
@@ -648,6 +648,12 @@ extern struct static_key sched_feat_keys
 #define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
 #endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */
 
+#ifdef CONFIG_SCHED_NUMA
+#define sched_feat_numa(x) sched_feat(x)
+#else
+#define sched_feat_numa(x) (0)
+#endif
+
 static inline u64 global_rt_period(void)
 {
 	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 22/31] sched, numa, mm: Implement THP migration
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (20 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 21/31] sched, numa, mm: Introduce sched_feat_numa() Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-11-01 14:16   ` Mel Gorman
  2012-10-25 12:16 ` [PATCH 23/31] sched, numa, mm: Implement home-node awareness Peter Zijlstra
                   ` (11 subsequent siblings)
  33 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: 0022-sched-numa-mm-Implement-THP-migration.patch --]
[-- Type: text/plain, Size: 5105 bytes --]

Add THP migration for the NUMA working set scanning fault case.

It uses the page lock to serialize. No migration pte dance is
necessary because the pte is already unmapped when we decide
to migrate.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
[ Significant fixes and changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/huge_memory.c |  133 ++++++++++++++++++++++++++++++++++++++++++-------------
 mm/migrate.c     |    2 
 2 files changed, 104 insertions(+), 31 deletions(-)

Index: tip/mm/huge_memory.c
===================================================================
--- tip.orig/mm/huge_memory.c
+++ tip/mm/huge_memory.c
@@ -742,12 +742,13 @@ void do_huge_pmd_numa_page(struct mm_str
 			   unsigned int flags, pmd_t entry)
 {
 	unsigned long haddr = address & HPAGE_PMD_MASK;
+	struct page *new_page = NULL;
 	struct page *page = NULL;
-	int node;
+	int node, lru;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(*pmd, entry)))
-		goto out_unlock;
+		goto unlock;
 
 	if (unlikely(pmd_trans_splitting(entry))) {
 		spin_unlock(&mm->page_table_lock);
@@ -755,45 +756,117 @@ void do_huge_pmd_numa_page(struct mm_str
 		return;
 	}
 
-#ifdef CONFIG_NUMA
 	page = pmd_page(entry);
-	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+	if (page) {
+		VM_BUG_ON(!PageCompound(page) || !PageHead(page));
 
-	get_page(page);
+		get_page(page);
+		node = mpol_misplaced(page, vma, haddr);
+		if (node != -1)
+			goto migrate;
+	}
+
+fixup:
+	/* change back to regular protection */
+	entry = pmd_modify(entry, vma->vm_page_prot);
+	set_pmd_at(mm, haddr, pmd, entry);
+	update_mmu_cache_pmd(vma, address, entry);
+
+unlock:
 	spin_unlock(&mm->page_table_lock);
+	if (page)
+		put_page(page);
 
-	/*
-	 * XXX should we serialize against split_huge_page ?
-	 */
-
-	node = mpol_misplaced(page, vma, haddr);
-	if (node == -1)
-		goto do_fixup;
-
-	/*
-	 * Due to lacking code to migrate thp pages, we'll split
-	 * (which preserves the special PROT_NONE) and re-take the
-	 * fault on the normal pages.
-	 */
-	split_huge_page(page);
-	put_page(page);
 	return;
 
-do_fixup:
+migrate:
+	spin_unlock(&mm->page_table_lock);
+
+	lock_page(page);
 	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(*pmd, entry)))
-		goto out_unlock;
-#endif
+	if (unlikely(!pmd_same(*pmd, entry))) {
+		spin_unlock(&mm->page_table_lock);
+		unlock_page(page);
+		put_page(page);
+		return;
+	}
+	spin_unlock(&mm->page_table_lock);
 
-	/* change back to regular protection */
-	entry = pmd_modify(entry, vma->vm_page_prot);
-	if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
-		update_mmu_cache_pmd(vma, address, entry);
+	new_page = alloc_pages_node(node,
+	    (GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT,
+	    HPAGE_PMD_ORDER);
+
+	if (!new_page)
+		goto alloc_fail;
+
+	lru = PageLRU(page);
+
+	if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
+		goto alloc_fail;
+
+	if (!trylock_page(new_page))
+		BUG();
+
+	/* anon mapping, we can simply copy page->mapping to the new page: */
+	new_page->mapping = page->mapping;
+	new_page->index = page->index;
 
-out_unlock:
+	migrate_page_copy(new_page, page);
+
+	WARN_ON(PageLRU(new_page));
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, entry))) {
+		spin_unlock(&mm->page_table_lock);
+		if (lru)
+			putback_lru_page(page);
+
+		unlock_page(new_page);
+		ClearPageActive(new_page);	/* Set by migrate_page_copy() */
+		new_page->mapping = NULL;
+		put_page(new_page);		/* Free it */
+
+		unlock_page(page);
+		put_page(page);			/* Drop the local reference */
+
+		return;
+	}
+
+	entry = mk_pmd(new_page, vma->vm_page_prot);
+	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+	entry = pmd_mkhuge(entry);
+
+	page_add_new_anon_rmap(new_page, vma, haddr);
+
+	set_pmd_at(mm, haddr, pmd, entry);
+	update_mmu_cache_pmd(vma, address, entry);
+	page_remove_rmap(page);
 	spin_unlock(&mm->page_table_lock);
-	if (page)
+
+	put_page(page);			/* Drop the rmap reference */
+
+	if (lru)
+		put_page(page);		/* drop the LRU isolation reference */
+
+	unlock_page(new_page);
+	unlock_page(page);
+	put_page(page);			/* Drop the local reference */
+
+	return;
+
+alloc_fail:
+	if (new_page)
+		put_page(new_page);
+
+	unlock_page(page);
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, entry))) {
 		put_page(page);
+		page = NULL;
+		goto unlock;
+	}
+	goto fixup;
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
Index: tip/mm/migrate.c
===================================================================
--- tip.orig/mm/migrate.c
+++ tip/mm/migrate.c
@@ -417,7 +417,7 @@ int migrate_huge_page_move_mapping(struc
  */
 void migrate_page_copy(struct page *newpage, struct page *page)
 {
-	if (PageHuge(page))
+	if (PageHuge(page) || PageTransHuge(page))
 		copy_huge_page(newpage, page);
 	else
 		copy_highpage(newpage, page);



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 23/31] sched, numa, mm: Implement home-node awareness
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (21 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 22/31] sched, numa, mm: Implement THP migration Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-11-01 15:06   ` Mel Gorman
  2012-10-25 12:16 ` [PATCH 24/31] sched, numa, mm: Introduce last_nid in the pageframe Peter Zijlstra
                   ` (10 subsequent siblings)
  33 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Ingo Molnar

[-- Attachment #1: 0023-sched-numa-mm-Implement-home-node-awareness.patch --]
[-- Type: text/plain, Size: 22641 bytes --]

Implement home node preference in the scheduler's load-balancer.

This is done in four pieces:

 - task_numa_hot(); make it harder to migrate tasks away from their
   home-node, controlled using the NUMA_HOT feature flag.

 - select_task_rq_fair(); prefer placing the task in their home-node,
   controlled using the NUMA_TTWU_BIAS feature flag. Disabled by
   default for we found this to be far too agressive. 

 - load_balance(); during the regular pull load-balance pass, try
   pulling tasks that are on the wrong node first with a preference
   of moving them nearer to their home-node through task_numa_hot(),
   controlled through the NUMA_PULL feature flag.

 - load_balance(); when the balancer finds no imbalance, introduce
   some such that it still prefers to move tasks towards their
   home-node, using active load-balance if needed, controlled through
   the NUMA_PULL_BIAS feature flag.

   In particular, only introduce this BIAS if the system is otherwise
   properly (weight) balanced and we either have an offnode or !numa
   task to trade for it.

In order to easily find off-node tasks, split the per-cpu task list
into two parts.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h   |    3 
 kernel/sched/core.c     |   28 +++
 kernel/sched/debug.c    |    3 
 kernel/sched/fair.c     |  349 +++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/features.h |   10 +
 kernel/sched/sched.h    |   17 ++
 6 files changed, 384 insertions(+), 26 deletions(-)

Index: tip/include/linux/sched.h
===================================================================
--- tip.orig/include/linux/sched.h
+++ tip/include/linux/sched.h
@@ -823,6 +823,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
+#define SD_NUMA			0x4000	/* cross-node balancing */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
@@ -1481,6 +1482,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_SCHED_NUMA
 	int node;
+	unsigned long numa_contrib;
 #endif
 	struct rcu_head rcu;
 
@@ -2084,6 +2086,7 @@ extern int sched_setscheduler(struct tas
 			      const struct sched_param *);
 extern int sched_setscheduler_nocheck(struct task_struct *, int,
 				      const struct sched_param *);
+extern void sched_setnode(struct task_struct *p, int node);
 extern struct task_struct *idle_task(int cpu);
 /**
  * is_idle_task - is the specified task an idle task?
Index: tip/kernel/sched/core.c
===================================================================
--- tip.orig/kernel/sched/core.c
+++ tip/kernel/sched/core.c
@@ -5484,7 +5484,9 @@ static void destroy_sched_domains(struct
 DEFINE_PER_CPU(struct sched_domain *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_id);
 
-static void update_top_cache_domain(int cpu)
+DEFINE_PER_CPU(struct sched_domain *, sd_node);
+
+static void update_domain_cache(int cpu)
 {
 	struct sched_domain *sd;
 	int id = cpu;
@@ -5495,6 +5497,15 @@ static void update_top_cache_domain(int
 
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_id, cpu) = id;
+
+	for_each_domain(cpu, sd) {
+		if (cpumask_equal(sched_domain_span(sd),
+				  cpumask_of_node(cpu_to_node(cpu))))
+			goto got_node;
+	}
+	sd = NULL;
+got_node:
+	rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
 }
 
 /*
@@ -5537,7 +5548,7 @@ cpu_attach_domain(struct sched_domain *s
 	rcu_assign_pointer(rq->sd, sd);
 	destroy_sched_domains(tmp, cpu);
 
-	update_top_cache_domain(cpu);
+	update_domain_cache(cpu);
 }
 
 /* cpus with isolated domains */
@@ -5965,9 +5976,9 @@ static struct sched_domain_topology_leve
  * Requeues a task ensuring its on the right load-balance list so
  * that it might get migrated to its new home.
  *
- * Note that we cannot actively migrate ourselves since our callers
- * can be from atomic context. We rely on the regular load-balance
- * mechanisms to move us around -- its all preference anyway.
+ * Since home-node is pure preference there's no hard migrate to force
+ * us anywhere, this also allows us to call this from atomic context if
+ * required.
  */
 void sched_setnode(struct task_struct *p, int node)
 {
@@ -6040,6 +6051,7 @@ sd_numa_init(struct sched_domain_topolog
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 1*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
+					| 1*SD_NUMA
 					| sd_local_flags(level)
 					,
 		.last_balance		= jiffies,
@@ -6901,6 +6913,12 @@ void __init sched_init(void)
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
 
 		INIT_LIST_HEAD(&rq->cfs_tasks);
+#ifdef CONFIG_SCHED_NUMA
+		INIT_LIST_HEAD(&rq->offnode_tasks);
+		rq->onnode_running = 0;
+		rq->offnode_running = 0;
+		rq->offnode_weight = 0;
+#endif
 
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ
Index: tip/kernel/sched/debug.c
===================================================================
--- tip.orig/kernel/sched/debug.c
+++ tip/kernel/sched/debug.c
@@ -132,6 +132,9 @@ print_task(struct seq_file *m, struct rq
 	SEQ_printf(m, "%15Ld %15Ld %15Ld.%06ld %15Ld.%06ld %15Ld.%06ld",
 		0LL, 0LL, 0LL, 0L, 0LL, 0L, 0LL, 0L);
 #endif
+#ifdef CONFIG_SCHED_NUMA
+	SEQ_printf(m, " %d/%d", p->node, cpu_to_node(task_cpu(p)));
+#endif
 #ifdef CONFIG_CGROUP_SCHED
 	SEQ_printf(m, " %s", task_group_path(task_group(p)));
 #endif
Index: tip/kernel/sched/fair.c
===================================================================
--- tip.orig/kernel/sched/fair.c
+++ tip/kernel/sched/fair.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/profile.h>
 #include <linux/interrupt.h>
+#include <linux/random.h>
 
 #include <trace/events/sched.h>
 
@@ -773,6 +774,51 @@ update_stats_curr_start(struct cfs_rq *c
 }
 
 /**************************************************
+ * Scheduling class numa methods.
+ */
+
+#ifdef CONFIG_SMP
+static unsigned long task_h_load(struct task_struct *p);
+#endif
+
+#ifdef CONFIG_SCHED_NUMA
+static struct list_head *account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+	struct list_head *tasks = &rq->cfs_tasks;
+
+	if (tsk_home_node(p) != cpu_to_node(task_cpu(p))) {
+		p->numa_contrib = task_h_load(p);
+		rq->offnode_weight += p->numa_contrib;
+		rq->offnode_running++;
+		tasks = &rq->offnode_tasks;
+	} else
+		rq->onnode_running++;
+
+	return tasks;
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+	if (tsk_home_node(p) != cpu_to_node(task_cpu(p))) {
+		rq->offnode_weight -= p->numa_contrib;
+		rq->offnode_running--;
+	} else
+		rq->onnode_running--;
+}
+#else
+#ifdef CONFIG_SMP
+static struct list_head *account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+	return NULL;
+}
+#endif
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+}
+#endif /* CONFIG_SCHED_NUMA */
+
+/**************************************************
  * Scheduling class queueing methods:
  */
 
@@ -783,9 +829,17 @@ account_entity_enqueue(struct cfs_rq *cf
 	if (!parent_entity(se))
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
 #ifdef CONFIG_SMP
-	if (entity_is_task(se))
-		list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
-#endif
+	if (entity_is_task(se)) {
+		struct rq *rq = rq_of(cfs_rq);
+		struct task_struct *p = task_of(se);
+		struct list_head *tasks = &rq->cfs_tasks;
+
+		if (tsk_home_node(p) != -1)
+			tasks = account_numa_enqueue(rq, p);
+
+		list_add(&se->group_node, tasks);
+	}
+#endif /* CONFIG_SMP */
 	cfs_rq->nr_running++;
 }
 
@@ -795,8 +849,14 @@ account_entity_dequeue(struct cfs_rq *cf
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (!parent_entity(se))
 		update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
-	if (entity_is_task(se))
+	if (entity_is_task(se)) {
+		struct task_struct *p = task_of(se);
+
 		list_del_init(&se->group_node);
+
+		if (tsk_home_node(p) != -1)
+			account_numa_dequeue(rq_of(cfs_rq), p);
+	}
 	cfs_rq->nr_running--;
 }
 
@@ -2681,6 +2741,35 @@ done:
 	return target;
 }
 
+#ifdef CONFIG_SCHED_NUMA
+static inline bool pick_numa_rand(int n)
+{
+	return !(get_random_int() % n);
+}
+
+/*
+ * Pick a random elegible CPU in the target node, hopefully faster
+ * than doing a least-loaded scan.
+ */
+static int numa_select_node_cpu(struct task_struct *p, int node)
+{
+	int weight = cpumask_weight(cpumask_of_node(node));
+	int i, cpu = -1;
+
+	for_each_cpu_and(i, cpumask_of_node(node), tsk_cpus_allowed(p)) {
+		if (cpu < 0 || pick_numa_rand(weight))
+			cpu = i;
+	}
+
+	return cpu;
+}
+#else
+static int numa_select_node_cpu(struct task_struct *p, int node)
+{
+	return -1;
+}
+#endif /* CONFIG_SCHED_NUMA */
+
 /*
  * sched_balance_self: balance the current task (running on cpu) in domains
  * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
@@ -2701,6 +2790,7 @@ select_task_rq_fair(struct task_struct *
 	int new_cpu = cpu;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
+	int node = tsk_home_node(p);
 
 	if (p->nr_cpus_allowed == 1)
 		return prev_cpu;
@@ -2712,6 +2802,36 @@ select_task_rq_fair(struct task_struct *
 	}
 
 	rcu_read_lock();
+	if (sched_feat_numa(NUMA_TTWU_BIAS) && node != -1) {
+		/*
+		 * For fork,exec find the idlest cpu in the home-node.
+		 */
+		if (sd_flag & (SD_BALANCE_FORK|SD_BALANCE_EXEC)) {
+			int node_cpu = numa_select_node_cpu(p, node);
+			if (node_cpu < 0)
+				goto find_sd;
+
+			new_cpu = cpu = node_cpu;
+			sd = per_cpu(sd_node, cpu);
+			goto pick_idlest;
+		}
+
+		/*
+		 * For wake, pretend we were running in the home-node.
+		 */
+		if (cpu_to_node(prev_cpu) != node) {
+			int node_cpu = numa_select_node_cpu(p, node);
+			if (node_cpu < 0)
+				goto find_sd;
+
+			if (sched_feat_numa(NUMA_TTWU_TO))
+				cpu = node_cpu;
+			else
+				prev_cpu = node_cpu;
+		}
+	}
+
+find_sd:
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
 			continue;
@@ -2738,6 +2858,7 @@ select_task_rq_fair(struct task_struct *
 		goto unlock;
 	}
 
+pick_idlest:
 	while (sd) {
 		int load_idx = sd->forkexec_idx;
 		struct sched_group *group;
@@ -3060,6 +3181,8 @@ struct lb_env {
 
 	unsigned int		flags;
 
+	struct list_head	*tasks;
+
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
@@ -3080,11 +3203,28 @@ static void move_task(struct task_struct
 	check_preempt_curr(env->dst_rq, p, 0);
 }
 
+static int task_numa_hot(struct task_struct *p, struct lb_env *env)
+{
+	int from_dist, to_dist;
+	int node = tsk_home_node(p);
+
+	if (!sched_feat_numa(NUMA_HOT) || node == -1)
+		return 0; /* no node preference */
+
+	from_dist = node_distance(cpu_to_node(env->src_cpu), node);
+	to_dist = node_distance(cpu_to_node(env->dst_cpu), node);
+
+	if (to_dist < from_dist)
+		return 0; /* getting closer is ok */
+
+	return 1; /* stick to where we are */
+}
+
 /*
  * Is this task likely cache-hot:
  */
 static int
-task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
+task_hot(struct task_struct *p, struct lb_env *env)
 {
 	s64 delta;
 
@@ -3107,7 +3247,7 @@ task_hot(struct task_struct *p, u64 now,
 	if (sysctl_sched_migration_cost == 0)
 		return 0;
 
-	delta = now - p->se.exec_start;
+	delta = env->src_rq->clock_task - p->se.exec_start;
 
 	return delta < (s64)sysctl_sched_migration_cost;
 }
@@ -3164,7 +3304,9 @@ int can_migrate_task(struct task_struct
 	 * 2) too many balance attempts have failed.
 	 */
 
-	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
+	tsk_cache_hot = task_hot(p, env);
+	if (env->idle == CPU_NOT_IDLE)
+		tsk_cache_hot |= task_numa_hot(p, env);
 	if (!tsk_cache_hot ||
 		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 #ifdef CONFIG_SCHEDSTATS
@@ -3190,11 +3332,11 @@ int can_migrate_task(struct task_struct
  *
  * Called with both runqueues locked.
  */
-static int move_one_task(struct lb_env *env)
+static int __move_one_task(struct lb_env *env)
 {
 	struct task_struct *p, *n;
 
-	list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
+	list_for_each_entry_safe(p, n, env->tasks, se.group_node) {
 		if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
 			continue;
 
@@ -3213,7 +3355,20 @@ static int move_one_task(struct lb_env *
 	return 0;
 }
 
-static unsigned long task_h_load(struct task_struct *p);
+static int move_one_task(struct lb_env *env)
+{
+	if (sched_feat_numa(NUMA_PULL)) {
+		env->tasks = offnode_tasks(env->src_rq);
+		if (__move_one_task(env))
+			return 1;
+	}
+
+	env->tasks = &env->src_rq->cfs_tasks;
+	if (__move_one_task(env))
+		return 1;
+
+	return 0;
+}
 
 static const unsigned int sched_nr_migrate_break = 32;
 
@@ -3226,7 +3381,6 @@ static const unsigned int sched_nr_migra
  */
 static int move_tasks(struct lb_env *env)
 {
-	struct list_head *tasks = &env->src_rq->cfs_tasks;
 	struct task_struct *p;
 	unsigned long load;
 	int pulled = 0;
@@ -3234,8 +3388,9 @@ static int move_tasks(struct lb_env *env
 	if (env->imbalance <= 0)
 		return 0;
 
-	while (!list_empty(tasks)) {
-		p = list_first_entry(tasks, struct task_struct, se.group_node);
+again:
+	while (!list_empty(env->tasks)) {
+		p = list_first_entry(env->tasks, struct task_struct, se.group_node);
 
 		env->loop++;
 		/* We've more or less seen every task there is, call it quits */
@@ -3246,7 +3401,7 @@ static int move_tasks(struct lb_env *env
 		if (env->loop > env->loop_break) {
 			env->loop_break += sched_nr_migrate_break;
 			env->flags |= LBF_NEED_BREAK;
-			break;
+			goto out;
 		}
 
 		if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
@@ -3274,7 +3429,7 @@ static int move_tasks(struct lb_env *env
 		 * the critical section.
 		 */
 		if (env->idle == CPU_NEWLY_IDLE)
-			break;
+			goto out;
 #endif
 
 		/*
@@ -3282,13 +3437,20 @@ static int move_tasks(struct lb_env *env
 		 * weighted load.
 		 */
 		if (env->imbalance <= 0)
-			break;
+			goto out;
 
 		continue;
 next:
-		list_move_tail(&p->se.group_node, tasks);
+		list_move_tail(&p->se.group_node, env->tasks);
+	}
+
+	if (env->tasks == offnode_tasks(env->src_rq)) {
+		env->tasks = &env->src_rq->cfs_tasks;
+		env->loop = 0;
+		goto again;
 	}
 
+out:
 	/*
 	 * Right now, this is one of only two places move_task() is called,
 	 * so we can safely collect move_task() stats here rather than
@@ -3407,12 +3569,13 @@ static inline void update_shares(int cpu
 static inline void update_h_load(long cpu)
 {
 }
-
+#ifdef CONFIG_SMP
 static unsigned long task_h_load(struct task_struct *p)
 {
 	return p->se.load.weight;
 }
 #endif
+#endif
 
 /********** Helpers for find_busiest_group ************************/
 /*
@@ -3443,6 +3606,14 @@ struct sd_lb_stats {
 	unsigned int  busiest_group_weight;
 
 	int group_imb; /* Is there imbalance in this sd */
+#ifdef CONFIG_SCHED_NUMA
+	struct sched_group *numa_group; /* group which has offnode_tasks */
+	unsigned long numa_group_weight;
+	unsigned long numa_group_running;
+
+	unsigned long this_offnode_running;
+	unsigned long this_onnode_running;
+#endif
 };
 
 /*
@@ -3458,6 +3629,11 @@ struct sg_lb_stats {
 	unsigned long group_weight;
 	int group_imb; /* Is there an imbalance in the group ? */
 	int group_has_capacity; /* Is there extra capacity in the group? */
+#ifdef CONFIG_SCHED_NUMA
+	unsigned long numa_offnode_weight;
+	unsigned long numa_offnode_running;
+	unsigned long numa_onnode_running;
+#endif
 };
 
 /**
@@ -3486,6 +3662,121 @@ static inline int get_sd_load_idx(struct
 	return load_idx;
 }
 
+#ifdef CONFIG_SCHED_NUMA
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+	sgs->numa_offnode_weight += rq->offnode_weight;
+	sgs->numa_offnode_running += rq->offnode_running;
+	sgs->numa_onnode_running += rq->onnode_running;
+}
+
+/*
+ * Since the offnode lists are indiscriminate (they contain tasks for all other
+ * nodes) it is impossible to say if there's any task on there that wants to
+ * move towards the pulling cpu. Therefore select a random offnode list to pull
+ * from such that eventually we'll try them all.
+ *
+ * Select a random group that has offnode tasks as sds->numa_group
+ */
+static inline void update_sd_numa_stats(struct sched_domain *sd,
+		struct sched_group *group, struct sd_lb_stats *sds,
+		int local_group, struct sg_lb_stats *sgs)
+{
+	if (!(sd->flags & SD_NUMA))
+		return;
+
+	if (local_group) {
+		sds->this_offnode_running = sgs->numa_offnode_running;
+		sds->this_onnode_running  = sgs->numa_onnode_running;
+		return;
+	}
+
+	if (!sgs->numa_offnode_running)
+		return;
+
+	if (!sds->numa_group || pick_numa_rand(sd->span_weight / group->group_weight)) {
+		sds->numa_group = group;
+		sds->numa_group_weight = sgs->numa_offnode_weight;
+		sds->numa_group_running = sgs->numa_offnode_running;
+	}
+}
+
+/*
+ * Pick a random queue from the group that has offnode tasks.
+ */
+static struct rq *find_busiest_numa_queue(struct lb_env *env,
+					  struct sched_group *group)
+{
+	struct rq *busiest = NULL, *rq;
+	int cpu;
+
+	for_each_cpu_and(cpu, sched_group_cpus(group), env->cpus) {
+		rq = cpu_rq(cpu);
+		if (!rq->offnode_running)
+			continue;
+		if (!busiest || pick_numa_rand(group->group_weight))
+			busiest = rq;
+	}
+
+	return busiest;
+}
+
+/*
+ * Called in case of no other imbalance, if there is a queue running offnode
+ * tasksk we'll say we're imbalanced anyway to nudge these tasks towards their
+ * proper node.
+ */
+static inline int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
+{
+	if (!sched_feat(NUMA_PULL_BIAS))
+		return 0;
+
+	if (!sds->numa_group)
+		return 0;
+
+	/*
+	 * Only pull an offnode task home if we've got offnode or !numa tasks to trade for it.
+	 */
+	if (!sds->this_offnode_running &&
+	    !(sds->this_nr_running - sds->this_onnode_running - sds->this_offnode_running))
+		return 0;
+
+	env->imbalance = sds->numa_group_weight / sds->numa_group_running;
+	sds->busiest = sds->numa_group;
+	env->find_busiest_queue = find_busiest_numa_queue;
+	return 1;
+}
+
+static inline bool need_active_numa_balance(struct lb_env *env)
+{
+	return env->find_busiest_queue == find_busiest_numa_queue &&
+			env->src_rq->offnode_running == 1 &&
+			env->src_rq->nr_running == 1;
+}
+
+#else /* CONFIG_SCHED_NUMA */
+
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+}
+
+static inline void update_sd_numa_stats(struct sched_domain *sd,
+		struct sched_group *group, struct sd_lb_stats *sds,
+		int local_group, struct sg_lb_stats *sgs)
+{
+}
+
+static inline int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
+{
+	return 0;
+}
+
+static inline bool need_active_numa_balance(struct lb_env *env)
+{
+	return false;
+}
+#endif /* CONFIG_SCHED_NUMA */
+
 unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
 {
 	return SCHED_POWER_SCALE;
@@ -3701,6 +3992,8 @@ static inline void update_sg_lb_stats(st
 		sgs->sum_weighted_load += weighted_cpuload(i);
 		if (idle_cpu(i))
 			sgs->idle_cpus++;
+
+		update_sg_numa_stats(sgs, rq);
 	}
 
 	/*
@@ -3854,6 +4147,8 @@ static inline void update_sd_lb_stats(st
 			sds->group_imb = sgs.group_imb;
 		}
 
+		update_sd_numa_stats(env->sd, sg, sds, local_group, &sgs);
+
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 }
@@ -4084,7 +4379,7 @@ find_busiest_group(struct lb_env *env, i
 
 	/* There is no busy sibling group to pull tasks from */
 	if (!sds.busiest || sds.busiest_nr_running == 0)
-		goto out_balanced;
+		goto ret;
 
 	sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;
 
@@ -4106,14 +4401,14 @@ find_busiest_group(struct lb_env *env, i
 	 * don't try and pull any tasks.
 	 */
 	if (sds.this_load >= sds.max_load)
-		goto out_balanced;
+		goto ret;
 
 	/*
 	 * Don't pull any tasks if this group is already above the domain
 	 * average load.
 	 */
 	if (sds.this_load >= sds.avg_load)
-		goto out_balanced;
+		goto ret;
 
 	if (env->idle == CPU_IDLE) {
 		/*
@@ -4140,6 +4435,9 @@ force_balance:
 	return sds.busiest;
 
 out_balanced:
+	if (check_numa_busiest_group(env, &sds))
+		return sds.busiest;
+
 ret:
 	env->imbalance = 0;
 	return NULL;
@@ -4218,6 +4516,9 @@ static int need_active_balance(struct lb
 			return 1;
 	}
 
+	if (need_active_numa_balance(env))
+		return 1;
+
 	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
 }
 
@@ -4270,6 +4571,8 @@ redo:
 		schedstat_inc(sd, lb_nobusyq[idle]);
 		goto out_balanced;
 	}
+	env.src_rq  = busiest;
+	env.src_cpu = busiest->cpu;
 
 	BUG_ON(busiest == env.dst_rq);
 
@@ -4288,6 +4591,10 @@ redo:
 		env.src_cpu   = busiest->cpu;
 		env.src_rq    = busiest;
 		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
+		if (sched_feat_numa(NUMA_PULL))
+			env.tasks = offnode_tasks(busiest);
+		else
+			env.tasks = &busiest->cfs_tasks;
 
 		update_h_load(env.src_cpu);
 more_balance:
Index: tip/kernel/sched/features.h
===================================================================
--- tip.orig/kernel/sched/features.h
+++ tip/kernel/sched/features.h
@@ -61,3 +61,13 @@ SCHED_FEAT(TTWU_QUEUE, true)
 SCHED_FEAT(FORCE_SD_OVERLAP, false)
 SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
+
+#ifdef CONFIG_SCHED_NUMA
+SCHED_FEAT(NUMA,           true)
+SCHED_FEAT(NUMA_HOT,       true)
+SCHED_FEAT(NUMA_TTWU_BIAS, false)
+SCHED_FEAT(NUMA_TTWU_TO,   false)
+SCHED_FEAT(NUMA_PULL,      true)
+SCHED_FEAT(NUMA_PULL_BIAS, true)
+#endif
+
Index: tip/kernel/sched/sched.h
===================================================================
--- tip.orig/kernel/sched/sched.h
+++ tip/kernel/sched/sched.h
@@ -418,6 +418,13 @@ struct rq {
 
 	struct list_head cfs_tasks;
 
+#ifdef CONFIG_SCHED_NUMA
+	unsigned long    onnode_running;
+	unsigned long    offnode_running;
+	unsigned long	 offnode_weight;
+	struct list_head offnode_tasks;
+#endif
+
 	u64 rt_avg;
 	u64 age_stamp;
 	u64 idle_stamp;
@@ -469,6 +476,15 @@ struct rq {
 #endif
 };
 
+static inline struct list_head *offnode_tasks(struct rq *rq)
+{
+#ifdef CONFIG_SCHED_NUMA
+	return &rq->offnode_tasks;
+#else
+	return NULL;
+#endif
+}
+
 static inline int cpu_of(struct rq *rq)
 {
 #ifdef CONFIG_SMP
@@ -529,6 +545,7 @@ static inline struct sched_domain *highe
 
 DECLARE_PER_CPU(struct sched_domain *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(struct sched_domain *, sd_node);
 
 extern int group_balance_cpu(struct sched_group *sg);
 



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 24/31] sched, numa, mm: Introduce last_nid in the pageframe
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (22 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 23/31] sched, numa, mm: Implement home-node awareness Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-11-01 15:17   ` Mel Gorman
  2012-10-25 12:16 ` [PATCH 25/31] sched, numa, mm/mpol: Add_MPOL_F_HOME Peter Zijlstra
                   ` (9 subsequent siblings)
  33 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: 0024-sched-numa-mm-Fold-page-nid_last-into-page-flags-whe.patch --]
[-- Type: text/plain, Size: 10584 bytes --]

Introduce a per-page last_nid field, fold this into the struct
page::flags field whenever possible.

The unlikely/rare 32bit NUMA configs will likely grow the page-frame.

Completely dropping 32bit support for CONFIG_SCHED_NUMA would simplify
things, but it would also remove the warning if we grow enough 64bit
only page-flags to push the last-nid out.

Suggested-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mm.h                |   90 ++++++++++++++++++++------------------
 include/linux/mm_types.h          |    5 ++
 include/linux/mmzone.h            |   14 -----
 include/linux/page-flags-layout.h |   83 +++++++++++++++++++++++++++++++++++
 mm/huge_memory.c                  |    1 
 mm/memory.c                       |    4 +
 6 files changed, 143 insertions(+), 54 deletions(-)
 create mode 100644 include/linux/page-flags-layout.h

Index: tip/include/linux/mm.h
===================================================================
--- tip.orig/include/linux/mm.h
+++ tip/include/linux/mm.h
@@ -594,50 +594,11 @@ static inline pte_t maybe_mkwrite(pte_t
  * sets it, so none of the operations on it need to be atomic.
  */
 
-
-/*
- * page->flags layout:
- *
- * There are three possibilities for how page->flags get
- * laid out.  The first is for the normal case, without
- * sparsemem.  The second is for sparsemem when there is
- * plenty of space for node and section.  The last is when
- * we have run out of space and have to fall back to an
- * alternate (slower) way of determining the node.
- *
- * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE | ... | FLAGS |
- * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
- * classic sparse no space for node:  | SECTION |     ZONE    | ... | FLAGS |
- */
-#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
-#define SECTIONS_WIDTH		SECTIONS_SHIFT
-#else
-#define SECTIONS_WIDTH		0
-#endif
-
-#define ZONES_WIDTH		ZONES_SHIFT
-
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define NODES_WIDTH		NODES_SHIFT
-#else
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-#error "Vmemmap: No space for nodes field in page flags"
-#endif
-#define NODES_WIDTH		0
-#endif
-
-/* Page flags: | [SECTION] | [NODE] | ZONE | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NID] | ... | FLAGS | */
 #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
 #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
-
-/*
- * We are going to use the flags for the page to node mapping if its in
- * there.  This includes the case where there is no node, so it is implicit.
- */
-#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
-#define NODE_NOT_IN_PAGE_FLAGS
-#endif
+#define LAST_NID_PGOFF		(ZONES_PGOFF - LAST_NID_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
@@ -647,6 +608,7 @@ static inline pte_t maybe_mkwrite(pte_t
 #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
 #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
+#define LAST_NID_PGSHIFT	(LAST_NID_PGOFF * (LAST_NID_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -668,6 +630,7 @@ static inline pte_t maybe_mkwrite(pte_t
 #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
 #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
 #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
+#define LAST_NID_MASK		((1UL << LAST_NID_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
@@ -706,6 +669,51 @@ static inline int page_to_nid(const stru
 }
 #endif
 
+#ifdef CONFIG_SCHED_NUMA
+#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
+static inline int page_xchg_last_nid(struct page *page, int nid)
+{
+	return xchg(&page->_last_nid, nid);
+}
+
+static inline int page_last_nid(struct page *page)
+{
+	return page->_last_nid;
+}
+#else
+static inline int page_xchg_last_nid(struct page *page, int nid)
+{
+	unsigned long old_flags, flags;
+	int last_nid;
+
+	do {
+		old_flags = flags = page->flags;
+		last_nid = (flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
+
+		flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
+		flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+	} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
+
+	return last_nid;
+}
+
+static inline int page_last_nid(struct page *page)
+{
+	return (page->flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
+}
+#endif /* LAST_NID_NOT_IN_PAGE_FLAGS */
+#else /* CONFIG_SCHED_NUMA */
+static inline int page_xchg_last_nid(struct page *page, int nid)
+{
+	return page_to_nid(page);
+}
+
+static inline int page_last_nid(struct page *page)
+{
+	return page_to_nid(page);
+}
+#endif /* CONFIG_SCHED_NUMA */
+
 static inline struct zone *page_zone(const struct page *page)
 {
 	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
Index: tip/include/linux/mm_types.h
===================================================================
--- tip.orig/include/linux/mm_types.h
+++ tip/include/linux/mm_types.h
@@ -12,6 +12,7 @@
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
 #include <linux/uprobes.h>
+#include <linux/page-flags-layout.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -175,6 +176,10 @@ struct page {
 	 */
 	void *shadow;
 #endif
+
+#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
+	int _last_nid;
+#endif
 }
 /*
  * The struct page can be forced to be double word aligned so that atomic ops
Index: tip/include/linux/mmzone.h
===================================================================
--- tip.orig/include/linux/mmzone.h
+++ tip/include/linux/mmzone.h
@@ -15,7 +15,7 @@
 #include <linux/seqlock.h>
 #include <linux/nodemask.h>
 #include <linux/pageblock-flags.h>
-#include <generated/bounds.h>
+#include <linux/page-flags-layout.h>
 #include <linux/atomic.h>
 #include <asm/page.h>
 
@@ -317,16 +317,6 @@ enum zone_type {
  * match the requested limits. See gfp_zone() in include/linux/gfp.h
  */
 
-#if MAX_NR_ZONES < 2
-#define ZONES_SHIFT 0
-#elif MAX_NR_ZONES <= 2
-#define ZONES_SHIFT 1
-#elif MAX_NR_ZONES <= 4
-#define ZONES_SHIFT 2
-#else
-#error ZONES_SHIFT -- too many zones configured adjust calculation
-#endif
-
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 
@@ -1029,8 +1019,6 @@ static inline unsigned long early_pfn_to
  * PA_SECTION_SHIFT		physical address to/from section number
  * PFN_SECTION_SHIFT		pfn to/from section number
  */
-#define SECTIONS_SHIFT		(MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
-
 #define PA_SECTION_SHIFT	(SECTION_SIZE_BITS)
 #define PFN_SECTION_SHIFT	(SECTION_SIZE_BITS - PAGE_SHIFT)
 
Index: tip/include/linux/page-flags-layout.h
===================================================================
--- /dev/null
+++ tip/include/linux/page-flags-layout.h
@@ -0,0 +1,83 @@
+#ifndef _LINUX_PAGE_FLAGS_LAYOUT
+#define _LINUX_PAGE_FLAGS_LAYOUT
+
+#include <linux/numa.h>
+#include <generated/bounds.h>
+
+#if MAX_NR_ZONES < 2
+#define ZONES_SHIFT 0
+#elif MAX_NR_ZONES <= 2
+#define ZONES_SHIFT 1
+#elif MAX_NR_ZONES <= 4
+#define ZONES_SHIFT 2
+#else
+#error ZONES_SHIFT -- too many zones configured adjust calculation
+#endif
+
+#ifdef CONFIG_SPARSEMEM
+#include <asm/sparsemem.h>
+
+/* 
+ * SECTION_SHIFT    		#bits space required to store a section #
+ */
+#define SECTIONS_SHIFT         (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
+#endif
+
+/*
+ * page->flags layout:
+ *
+ * There are five possibilities for how page->flags get laid out.  The first
+ * (and second) is for the normal case, without sparsemem. The third is for
+ * sparsemem when there is plenty of space for node and section. The last is
+ * when we have run out of space and have to fall back to an alternate (slower)
+ * way of determining the node.
+ *
+ * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |            ... | FLAGS |
+ *     "      plus space for last_nid:|       NODE     | ZONE | LAST_NID | ... | FLAGS |
+ * classic sparse with space for node:| SECTION | NODE | ZONE |            ... | FLAGS |
+ *     "      plus space for last_nid:| SECTION | NODE | ZONE | LAST_NID | ... | FLAGS |
+ * classic sparse no space for node:  | SECTION |     ZONE    |            ... | FLAGS |
+ */
+#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
+
+#define SECTIONS_WIDTH		SECTIONS_SHIFT
+#else
+#define SECTIONS_WIDTH		0
+#endif
+
+#define ZONES_WIDTH		ZONES_SHIFT
+
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define NODES_WIDTH		NODES_SHIFT
+#else
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+#error "Vmemmap: No space for nodes field in page flags"
+#endif
+#define NODES_WIDTH		0
+#endif
+
+#ifdef CONFIG_SCHED_NUMA
+#define LAST_NID_SHIFT	NODES_SHIFT
+#else
+#define LAST_NID_SHIFT	0
+#endif
+
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_NID_WIDTH	LAST_NID_SHIFT
+#else
+#define LAST_NID_WIDTH	0
+#endif
+
+/*
+ * We are going to use the flags for the page to node mapping if its in
+ * there.  This includes the case where there is no node, so it is implicit.
+ */
+#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
+#define NODE_NOT_IN_PAGE_FLAGS
+#endif
+
+#if defined(CONFIG_SCHED_NUMA) && LAST_NID_WIDTH == 0
+#define LAST_NID_NOT_IN_PAGE_FLAGS
+#endif
+
+#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
Index: tip/mm/huge_memory.c
===================================================================
--- tip.orig/mm/huge_memory.c
+++ tip/mm/huge_memory.c
@@ -1440,6 +1440,7 @@ static void __split_huge_page_refcount(s
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
+		page_xchg_last_nid(page, page_last_nid(page_tail));
 
 		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
Index: tip/mm/memory.c
===================================================================
--- tip.orig/mm/memory.c
+++ tip/mm/memory.c
@@ -68,6 +68,10 @@
 
 #include "internal.h"
 
+#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA config, growing page-frame for last_nid.
+#endif
+
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 /* use the per-pgdat data instead for discontigmem - mbligh */
 unsigned long max_mapnr;



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 25/31] sched, numa, mm/mpol: Add_MPOL_F_HOME
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (23 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 24/31] sched, numa, mm: Introduce last_nid in the pageframe Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-10-25 12:16 ` [PATCH 26/31] sched, numa, mm: Add fault driven placement and migration policy Peter Zijlstra
                   ` (8 subsequent siblings)
  33 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Paul Turner, Ingo Molnar

[-- Attachment #1: 0025-sched-numa-mm-Add_MPOL_F_HOME.patch --]
[-- Type: text/plain, Size: 3081 bytes --]

Add MPOL_F_HOME, to implement multi-stage home node binding.

Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/uapi/linux/mempolicy.h |    1 +
 mm/mempolicy.c                 |   34 +++++++++++++++++++++++++++++++++-
 2 files changed, 34 insertions(+), 1 deletion(-)

Index: tip/include/uapi/linux/mempolicy.h
===================================================================
--- tip.orig/include/uapi/linux/mempolicy.h
+++ tip/include/uapi/linux/mempolicy.h
@@ -69,6 +69,7 @@ enum mpol_rebind_step {
 #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
 #define MPOL_F_REBINDING (1 << 2)	/* identify policies in rebinding */
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
+#define MPOL_F_HOME	(1 << 4) /* this is the home-node policy */
 
 
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
Index: tip/mm/mempolicy.c
===================================================================
--- tip.orig/mm/mempolicy.c
+++ tip/mm/mempolicy.c
@@ -2190,6 +2190,7 @@ static void sp_free(struct sp_node *n)
  * @page   - page to be checked
  * @vma    - vm area where page mapped
  * @addr   - virtual address where page mapped
+ * @multi  - use multi-stage node binding
  *
  * Lookup current policy node id for vma,addr and "compare to" page's
  * node id.
@@ -2252,6 +2253,37 @@ int mpol_misplaced(struct page *page, st
 	default:
 		BUG();
 	}
+
+	/*
+	 * Multi-stage node selection is used in conjunction with a periodic
+	 * migration fault to build a temporal task<->page relation. By
+	 * using a two-stage filter we remove short/unlikely relations.
+	 *
+	 * Using P(p) ~ n_p / n_t as per frequentist probability, we can
+	 * equate a task's usage of a particular page (n_p) per total usage
+	 * of this page (n_t) (in a given time-span) to a probability.
+	 *
+	 * Our periodic faults will then sample this probability and getting
+	 * the same result twice in a row, given these samples are fully
+	 * independent, is then given by P(n)^2, provided our sample period
+	 * is sufficiently short compared to the usage pattern.
+	 *
+	 * This quadric squishes small probabilities, making it less likely
+	 * we act on an unlikely task<->page relation.
+	 */
+	if (pol->flags & MPOL_F_HOME) {
+		int last_nid;
+
+		/*
+		 * Migrate towards the current node, depends on
+		 * task_numa_placement() details.
+		 */
+		polnid = numa_node_id();
+		last_nid = page_xchg_last_nid(page, polnid);
+		if (last_nid != polnid)
+			goto out;
+	}
+
 	if (curnid != polnid)
 		ret = polnid;
 out:
@@ -2444,7 +2476,7 @@ void __init numa_policy_init(void)
 		preferred_node_policy[nid] = (struct mempolicy) {
 			.refcnt = ATOMIC_INIT(1),
 			.mode = MPOL_PREFERRED,
-			.flags = MPOL_F_MOF,
+			.flags = MPOL_F_MOF | MPOL_F_HOME,
 			.v = { .preferred_node = nid, },
 		};
 	}



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 26/31] sched, numa, mm: Add fault driven placement and migration policy
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (24 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 25/31] sched, numa, mm/mpol: Add_MPOL_F_HOME Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-10-25 20:53   ` Linus Torvalds
                     ` (2 more replies)
  2012-10-25 12:16 ` [PATCH 27/31] sched, numa, mm: Add credits for NUMA placement Peter Zijlstra
                   ` (7 subsequent siblings)
  33 siblings, 3 replies; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Ingo Molnar, Peter Zijlstra

[-- Attachment #1: 0026-sched-numa-mm-Add-fault-driven-placement-and-migrati.patch --]
[-- Type: text/plain, Size: 15586 bytes --]

As per the problem/design document Documentation/scheduler/numa-problem.txt
implement 3ac & 4.

( A pure 3a was found too unstable, I did briefly try 3bc
  but found no significant improvement. )

Implement a per-task memory placement scheme relying on a regular
PROT_NONE 'migration' fault to scan the memory space of the procress
and uses a two stage migration scheme to reduce the invluence of
unlikely usage relations.

It relies on the assumption that the compute part is tied to a
paticular task and builds a task<->page relation set to model the
compute<->data relation.

In the previous patch we made memory migrate towards where the task
is running, here we select the node on which most memory is located
as the preferred node to run on.

This creates a feed-back control loop between trying to schedule a
task on a node and migrating memory towards the node the task is
scheduled on. 

Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Rik van Riel <riel@redhat.com>
Fixes-by: David Rientjes <rientjes@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mm_types.h |    4 +
 include/linux/sched.h    |   35 +++++++--
 kernel/sched/core.c      |   16 ++++
 kernel/sched/fair.c      |  175 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/features.h  |    1 
 kernel/sched/sched.h     |   31 +++++---
 kernel/sysctl.c          |   31 +++++++-
 mm/huge_memory.c         |    7 +
 mm/memory.c              |    4 -
 9 files changed, 282 insertions(+), 22 deletions(-)
Index: tip/include/linux/mm_types.h
===================================================================
--- tip.orig/include/linux/mm_types.h
+++ tip/include/linux/mm_types.h
@@ -403,6 +403,10 @@ struct mm_struct {
 #ifdef CONFIG_CPUMASK_OFFSTACK
 	struct cpumask cpumask_allocation;
 #endif
+#ifdef CONFIG_SCHED_NUMA
+	unsigned long numa_next_scan;
+	int numa_scan_seq;
+#endif
 	struct uprobes_state uprobes_state;
 };
 
Index: tip/include/linux/sched.h
===================================================================
--- tip.orig/include/linux/sched.h
+++ tip/include/linux/sched.h
@@ -1481,9 +1481,16 @@ struct task_struct {
 	short pref_node_fork;
 #endif
 #ifdef CONFIG_SCHED_NUMA
-	int node;
+	int node;			/* task home node   */
+	int numa_scan_seq;
+	int numa_migrate_seq;
+	unsigned int numa_scan_period;
+	u64 node_stamp;			/* migration stamp  */
 	unsigned long numa_contrib;
-#endif
+	unsigned long *numa_faults;
+	struct callback_head numa_work;
+#endif /* CONFIG_SCHED_NUMA */
+
 	struct rcu_head rcu;
 
 	/*
@@ -1558,15 +1565,24 @@ struct task_struct {
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
+#ifdef CONFIG_SCHED_NUMA
 static inline int tsk_home_node(struct task_struct *p)
 {
-#ifdef CONFIG_SCHED_NUMA
 	return p->node;
+}
+
+extern void task_numa_fault(int node, int pages);
 #else
+static inline int tsk_home_node(struct task_struct *p)
+{
 	return -1;
-#endif
 }
 
+static inline void task_numa_fault(int node, int pages)
+{
+}
+#endif /* CONFIG_SCHED_NUMA */
+
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
@@ -2004,6 +2020,10 @@ enum sched_tunable_scaling {
 };
 extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
+extern unsigned int sysctl_sched_numa_scan_period_min;
+extern unsigned int sysctl_sched_numa_scan_period_max;
+extern unsigned int sysctl_sched_numa_settle_count;
+
 #ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;
@@ -2014,18 +2034,17 @@ extern unsigned int sysctl_sched_shares_
 int sched_proc_update_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
 		loff_t *ppos);
-#endif
-#ifdef CONFIG_SCHED_DEBUG
+
 static inline unsigned int get_sysctl_timer_migration(void)
 {
 	return sysctl_timer_migration;
 }
-#else
+#else /* CONFIG_SCHED_DEBUG */
 static inline unsigned int get_sysctl_timer_migration(void)
 {
 	return 1;
 }
-#endif
+#endif /* CONFIG_SCHED_DEBUG */
 extern unsigned int sysctl_sched_rt_period;
 extern int sysctl_sched_rt_runtime;
 
Index: tip/kernel/sched/core.c
===================================================================
--- tip.orig/kernel/sched/core.c
+++ tip/kernel/sched/core.c
@@ -1533,6 +1533,21 @@ static void __sched_fork(struct task_str
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
 #endif
+
+#ifdef CONFIG_SCHED_NUMA
+	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
+		p->mm->numa_next_scan = jiffies;
+		p->mm->numa_scan_seq = 0;
+	}
+
+	p->node = -1;
+	p->node_stamp = 0ULL;
+	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
+	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+	p->numa_faults = NULL;
+	p->numa_scan_period = sysctl_sched_numa_scan_period_min;
+	p->numa_work.next = &p->numa_work;
+#endif /* CONFIG_SCHED_NUMA */
 }
 
 /*
@@ -1774,6 +1789,7 @@ static void finish_task_switch(struct rq
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		task_numa_free(prev);
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
Index: tip/kernel/sched/fair.c
===================================================================
--- tip.orig/kernel/sched/fair.c
+++ tip/kernel/sched/fair.c
@@ -27,6 +27,8 @@
 #include <linux/profile.h>
 #include <linux/interrupt.h>
 #include <linux/random.h>
+#include <linux/mempolicy.h>
+#include <linux/task_work.h>
 
 #include <trace/events/sched.h>
 
@@ -775,6 +777,21 @@ update_stats_curr_start(struct cfs_rq *c
 
 /**************************************************
  * Scheduling class numa methods.
+ *
+ * The purpose of the NUMA bits are to maintain compute (task) and data
+ * (memory) locality. We try and achieve this by making tasks stick to
+ * a particular node (their home node) but if fairness mandates they run
+ * elsewhere for long enough, we let the memory follow them.
+ *
+ * Tasks start out with their home-node unset (-1) this effectively means
+ * they act !NUMA until we've established the task is busy enough to bother
+ * with placement.
+ *
+ * We keep a home-node per task and use periodic fault scans to try and
+ * estalish a task<->page relation. This assumes the task<->page relation is a
+ * compute<->data relation, this is false for things like virt. and n:m
+ * threading solutions but its the best we can do given the information we
+ * have.
  */
 
 #ifdef CONFIG_SMP
@@ -805,6 +822,157 @@ static void account_numa_dequeue(struct
 	} else
 		rq->onnode_running--;
 }
+
+/*
+ * numa task sample period in ms: 5s
+ */
+unsigned int sysctl_sched_numa_scan_period_min = 5000;
+unsigned int sysctl_sched_numa_scan_period_max = 5000*16;
+
+/*
+ * Wait for the 2-sample stuff to settle before migrating again
+ */
+unsigned int sysctl_sched_numa_settle_count = 2;
+
+static void task_numa_placement(struct task_struct *p)
+{
+	unsigned long faults, max_faults = 0;
+	int node, max_node = -1;
+	int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
+
+	if (p->numa_scan_seq == seq)
+		return;
+
+	p->numa_scan_seq = seq;
+
+	for (node = 0; node < nr_node_ids; node++) {
+		faults = p->numa_faults[node];
+
+		if (faults > max_faults) {
+			max_faults = faults;
+			max_node = node;
+		}
+
+		p->numa_faults[node] /= 2;
+	}
+
+	if (max_node == -1)
+		return;
+
+	if (p->node != max_node) {
+		p->numa_scan_period = sysctl_sched_numa_scan_period_min;
+		if (sched_feat(NUMA_SETTLE) &&
+		    (seq - p->numa_migrate_seq) <= (int)sysctl_sched_numa_settle_count)
+			return;
+		p->numa_migrate_seq = seq;
+		sched_setnode(p, max_node);
+	} else {
+		p->numa_scan_period = min(sysctl_sched_numa_scan_period_max,
+				p->numa_scan_period * 2);
+	}
+}
+
+/*
+ * Got a PROT_NONE fault for a page on @node.
+ */
+void task_numa_fault(int node, int pages)
+{
+	struct task_struct *p = current;
+
+	if (unlikely(!p->numa_faults)) {
+		int size = sizeof(unsigned long) * nr_node_ids;
+
+		p->numa_faults = kzalloc(size, GFP_KERNEL);
+		if (!p->numa_faults)
+			return;
+	}
+
+	task_numa_placement(p);
+
+	p->numa_faults[node] += pages;
+}
+
+/*
+ * The expensive part of numa migration is done from task_work context.
+ * Triggered from task_tick_numa().
+ */
+void task_numa_work(struct callback_head *work)
+{
+	unsigned long migrate, next_scan, now = jiffies;
+	struct task_struct *p = current;
+	struct mm_struct *mm = p->mm;
+
+	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
+
+	work->next = work; /* protect against double add */
+	/*
+	 * Who cares about NUMA placement when they're dying.
+	 *
+	 * NOTE: make sure not to dereference p->mm before this check,
+	 * exit_task_work() happens _after_ exit_mm() so we could be called
+	 * without p->mm even though we still had it when we enqueued this
+	 * work.
+	 */
+	if (p->flags & PF_EXITING)
+		return;
+
+	/*
+	 * Enforce maximal scan/migration frequency..
+	 */
+	migrate = mm->numa_next_scan;
+	if (time_before(now, migrate))
+		return;
+
+	next_scan = now + 2*msecs_to_jiffies(sysctl_sched_numa_scan_period_min);
+	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
+		return;
+
+	ACCESS_ONCE(mm->numa_scan_seq)++;
+	{
+		struct vm_area_struct *vma;
+
+		down_write(&mm->mmap_sem);
+		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			if (!vma_migratable(vma))
+				continue;
+			change_protection(vma, vma->vm_start, vma->vm_end, vma_prot_none(vma), 0);
+		}
+		up_write(&mm->mmap_sem);
+	}
+}
+
+/*
+ * Drive the periodic memory faults..
+ */
+void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+	struct callback_head *work = &curr->numa_work;
+	u64 period, now;
+
+	/*
+	 * We don't care about NUMA placement if we don't have memory.
+	 */
+	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
+		return;
+
+	/*
+	 * Using runtime rather than walltime has the dual advantage that
+	 * we (mostly) drive the selection from busy threads and that the
+	 * task needs to have done some actual work before we bother with
+	 * NUMA placement.
+	 */
+	now = curr->se.sum_exec_runtime;
+	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
+
+	if (now - curr->node_stamp > period) {
+		curr->node_stamp = now;
+
+		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
+			init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
+			task_work_add(curr, work, true);
+		}
+	}
+}
 #else
 #ifdef CONFIG_SMP
 static struct list_head *account_numa_enqueue(struct rq *rq, struct task_struct *p)
@@ -816,6 +984,10 @@ static struct list_head *account_numa_en
 static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
 {
 }
+
+static void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+}
 #endif /* CONFIG_SCHED_NUMA */
 
 /**************************************************
@@ -5265,6 +5437,9 @@ static void task_tick_fair(struct rq *rq
 		cfs_rq = cfs_rq_of(se);
 		entity_tick(cfs_rq, se, queued);
 	}
+
+	if (sched_feat_numa(NUMA))
+		task_tick_numa(rq, curr);
 }
 
 /*
Index: tip/kernel/sched/features.h
===================================================================
--- tip.orig/kernel/sched/features.h
+++ tip/kernel/sched/features.h
@@ -69,5 +69,6 @@ SCHED_FEAT(NUMA_TTWU_BIAS, false)
 SCHED_FEAT(NUMA_TTWU_TO,   false)
 SCHED_FEAT(NUMA_PULL,      true)
 SCHED_FEAT(NUMA_PULL_BIAS, true)
+SCHED_FEAT(NUMA_SETTLE,    true)
 #endif
 
Index: tip/kernel/sched/sched.h
===================================================================
--- tip.orig/kernel/sched/sched.h
+++ tip/kernel/sched/sched.h
@@ -3,6 +3,7 @@
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
 #include <linux/stop_machine.h>
+#include <linux/slab.h>
 
 #include "cpupri.h"
 
@@ -476,15 +477,6 @@ struct rq {
 #endif
 };
 
-static inline struct list_head *offnode_tasks(struct rq *rq)
-{
-#ifdef CONFIG_SCHED_NUMA
-	return &rq->offnode_tasks;
-#else
-	return NULL;
-#endif
-}
-
 static inline int cpu_of(struct rq *rq)
 {
 #ifdef CONFIG_SMP
@@ -502,6 +494,27 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
+#ifdef CONFIG_SCHED_NUMA
+static inline struct list_head *offnode_tasks(struct rq *rq)
+{
+	return &rq->offnode_tasks;
+}
+
+static inline void task_numa_free(struct task_struct *p)
+{
+	kfree(p->numa_faults);
+}
+#else /* CONFIG_SCHED_NUMA */
+static inline struct list_head *offnode_tasks(struct rq *rq)
+{
+	return NULL;
+}
+
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_SCHED_NUMA */
+
 #ifdef CONFIG_SMP
 
 #define rcu_dereference_check_sched_domain(p) \
Index: tip/kernel/sysctl.c
===================================================================
--- tip.orig/kernel/sysctl.c
+++ tip/kernel/sysctl.c
@@ -256,9 +256,11 @@ static int min_sched_granularity_ns = 10
 static int max_sched_granularity_ns = NSEC_PER_SEC;	/* 1 second */
 static int min_wakeup_granularity_ns;			/* 0 usecs */
 static int max_wakeup_granularity_ns = NSEC_PER_SEC;	/* 1 second */
+#ifdef CONFIG_SMP
 static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE;
 static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1;
-#endif
+#endif /* CONFIG_SMP */
+#endif /* CONFIG_SCHED_DEBUG */
 
 #ifdef CONFIG_COMPACTION
 static int min_extfrag_threshold;
@@ -301,6 +303,7 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &min_wakeup_granularity_ns,
 		.extra2		= &max_wakeup_granularity_ns,
 	},
+#ifdef CONFIG_SMP
 	{
 		.procname	= "sched_tunable_scaling",
 		.data		= &sysctl_sched_tunable_scaling,
@@ -347,7 +350,31 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
-#endif
+#endif /* CONFIG_SMP */
+#ifdef CONFIG_SCHED_NUMA
+	{
+		.procname	= "sched_numa_scan_period_min_ms",
+		.data		= &sysctl_sched_numa_scan_period_min,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "sched_numa_scan_period_max_ms",
+		.data		= &sysctl_sched_numa_scan_period_max,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "sched_numa_settle_count",
+		.data		= &sysctl_sched_numa_settle_count,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif /* CONFIG_SCHED_NUMA */
+#endif /* CONFIG_SCHED_DEBUG */
 	{
 		.procname	= "sched_rt_period_us",
 		.data		= &sysctl_sched_rt_period,
Index: tip/mm/huge_memory.c
===================================================================
--- tip.orig/mm/huge_memory.c
+++ tip/mm/huge_memory.c
@@ -774,9 +774,10 @@ fixup:
 
 unlock:
 	spin_unlock(&mm->page_table_lock);
-	if (page)
+	if (page) {
+		task_numa_fault(page_to_nid(page), HPAGE_PMD_NR);
 		put_page(page);
-
+	}
 	return;
 
 migrate:
@@ -845,6 +846,8 @@ migrate:
 
 	put_page(page);			/* Drop the rmap reference */
 
+	task_numa_fault(node, HPAGE_PMD_NR);
+
 	if (lru)
 		put_page(page);		/* drop the LRU isolation reference */
 
Index: tip/mm/memory.c
===================================================================
--- tip.orig/mm/memory.c
+++ tip/mm/memory.c
@@ -3512,8 +3512,10 @@ out_pte_upgrade_unlock:
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
 out:
-	if (page)
+	if (page) {
+		task_numa_fault(page_nid, 1);
 		put_page(page);
+	}
 
 	return 0;
 



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 27/31] sched, numa, mm: Add credits for NUMA placement
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (25 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 26/31] sched, numa, mm: Add fault driven placement and migration policy Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-10-25 12:16 ` [PATCH 28/31] sched, numa, mm: Implement constant, per task Working Set Sampling (WSS) rate Peter Zijlstra
                   ` (6 subsequent siblings)
  33 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: 0027-sched-numa-mm-Add-credits-for-NUMA-placement.patch --]
[-- Type: text/plain, Size: 2137 bytes --]

From: Rik van Riel <riel@redhat.com>

The NUMA placement code has been rewritten several times, but
the basic ideas took a lot of work to develop. The people who
put in the work deserve credit for it. Thanks Andrea & Peter :)

[ The Documentation/scheduler/numa-problem.txt file should
  probably be rewritten once we figure out the final details of
  what the NUMA code needs to do, and why. ]

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: aarcange@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
----
This is against tip.git numa/core
---
 CREDITS             |    1 +
 kernel/sched/fair.c |    3 +++
 mm/memory.c         |    2 ++
 3 files changed, 6 insertions(+)

Index: tip/CREDITS
===================================================================
--- tip.orig/CREDITS
+++ tip/CREDITS
@@ -125,6 +125,7 @@ D: Author of pscan that helps to fix lp/
 D: Author of lil (Linux Interrupt Latency benchmark)
 D: Fixed the shm swap deallocation at swapoff time (try_to_unuse message)
 D: VM hacker
+D: NUMA task placement
 D: Various other kernel hacks
 S: Imola 40026
 S: Italy
Index: tip/kernel/sched/fair.c
===================================================================
--- tip.orig/kernel/sched/fair.c
+++ tip/kernel/sched/fair.c
@@ -18,6 +18,9 @@
  *
  *  Adaptive scheduling granularity, math enhancements by Peter Zijlstra
  *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ *  NUMA placement, statistics and algorithm by Andrea Arcangeli,
+ *  CFS balancing changes by Peter Zijlstra. Copyright (C) 2012 Red Hat, Inc.
  */
 
 #include <linux/latencytop.h>
Index: tip/mm/memory.c
===================================================================
--- tip.orig/mm/memory.c
+++ tip/mm/memory.c
@@ -36,6 +36,8 @@
  *		(Gerhard.Wichert@pdb.siemens.de)
  *
  * Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ *
+ * 2012 - NUMA placement page faults (Andrea Arcangeli, Peter Zijlstra)
  */
 
 #include <linux/kernel_stat.h>



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 28/31] sched, numa, mm: Implement constant, per task Working Set Sampling (WSS) rate
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (26 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 27/31] sched, numa, mm: Add credits for NUMA placement Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-11-01 15:48   ` Mel Gorman
  2012-10-25 12:16 ` [PATCH 29/31] sched, numa, mm: Add NUMA_MIGRATION feature flag Peter Zijlstra
                   ` (5 subsequent siblings)
  33 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: 0028-sched-numa-mm-Implement-constant-per-task-Working-Se.patch --]
[-- Type: text/plain, Size: 6293 bytes --]

Previously, to probe the working set of a task, we'd use
a very simple and crude method: mark all of its address
space PROT_NONE.

That method has various (obvious) disadvantages:

 - it samples the working set at dissimilar rates,
   giving some tasks a sampling quality advantage
   over others.

 - creates performance problems for tasks with very
   large working sets

 - over-samples processes with large address spaces but
   which only very rarely execute

Improve that method by keeping a rotating offset into the
address space that marks the current position of the scan,
and advance it by a constant rate (in a CPU cycles execution
proportional manner). If the offset reaches the last mapped
address of the mm then it then it starts over at the first
address.

The per-task nature of the working set sampling functionality
in this tree allows such constant rate, per task,
execution-weight proportional sampling of the working set,
with an adaptive sampling interval/frequency that goes from
once per 100 msecs up to just once per 1.6 seconds.
The current sampling volume is 256 MB per interval.

As tasks mature and converge their working set, so does the
sampling rate slow down to just a trickle, 256 MB per 1.6
seconds of CPU time executed.

This, beyond being adaptive, also rate-limits rarely
executing systems and does not over-sample on overloaded
systems.

[ In AutoNUMA speak, this patch deals with the effective sampling
  rate of the 'hinting page fault'. AutoNUMA's scanning is
  currently rate-limited, but it is also fundamentally
  single-threaded, executing in the knuma_scand kernel thread,
  so the limit in AutoNUMA is global and does not scale up with
  the number of CPUs, nor does it scan tasks in an execution
  proportional manner.

  So the idea of rate-limiting the scanning was first implemented
  in the AutoNUMA tree via a global rate limit. This patch goes
  beyond that by implementing an execution rate proportional
  working set sampling rate that is not implemented via a single
  global scanning daemon. ]

[ Dan Carpenter pointed out a possible NULL pointer dereference in the
  first version of this patch. ]

Based-on-idea-by: Andrea Arcangeli <aarcange@redhat.com>
Bug-Found-By: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ Wrote changelog and fixed bug. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mm_types.h |    1 +
 include/linux/sched.h    |    1 +
 kernel/sched/fair.c      |   43 ++++++++++++++++++++++++++++++-------------
 kernel/sysctl.c          |    7 +++++++
 4 files changed, 39 insertions(+), 13 deletions(-)

Index: tip/include/linux/mm_types.h
===================================================================
--- tip.orig/include/linux/mm_types.h
+++ tip/include/linux/mm_types.h
@@ -405,6 +405,7 @@ struct mm_struct {
 #endif
 #ifdef CONFIG_SCHED_NUMA
 	unsigned long numa_next_scan;
+	unsigned long numa_scan_offset;
 	int numa_scan_seq;
 #endif
 	struct uprobes_state uprobes_state;
Index: tip/include/linux/sched.h
===================================================================
--- tip.orig/include/linux/sched.h
+++ tip/include/linux/sched.h
@@ -2022,6 +2022,7 @@ extern enum sched_tunable_scaling sysctl
 
 extern unsigned int sysctl_sched_numa_scan_period_min;
 extern unsigned int sysctl_sched_numa_scan_period_max;
+extern unsigned int sysctl_sched_numa_scan_size;
 extern unsigned int sysctl_sched_numa_settle_count;
 
 #ifdef CONFIG_SCHED_DEBUG
Index: tip/kernel/sched/fair.c
===================================================================
--- tip.orig/kernel/sched/fair.c
+++ tip/kernel/sched/fair.c
@@ -829,8 +829,9 @@ static void account_numa_dequeue(struct
 /*
  * numa task sample period in ms: 5s
  */
-unsigned int sysctl_sched_numa_scan_period_min = 5000;
-unsigned int sysctl_sched_numa_scan_period_max = 5000*16;
+unsigned int sysctl_sched_numa_scan_period_min = 100;
+unsigned int sysctl_sched_numa_scan_period_max = 100*16;
+unsigned int sysctl_sched_numa_scan_size = 256;   /* MB */
 
 /*
  * Wait for the 2-sample stuff to settle before migrating again
@@ -904,6 +905,9 @@ void task_numa_work(struct callback_head
 	unsigned long migrate, next_scan, now = jiffies;
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
+	struct vm_area_struct *vma;
+	unsigned long offset, end;
+	long length;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
 
@@ -930,18 +934,31 @@ void task_numa_work(struct callback_head
 	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
 		return;
 
-	ACCESS_ONCE(mm->numa_scan_seq)++;
-	{
-		struct vm_area_struct *vma;
-
-		down_write(&mm->mmap_sem);
-		for (vma = mm->mmap; vma; vma = vma->vm_next) {
-			if (!vma_migratable(vma))
-				continue;
-			change_protection(vma, vma->vm_start, vma->vm_end, vma_prot_none(vma), 0);
-		}
-		up_write(&mm->mmap_sem);
+	offset = mm->numa_scan_offset;
+	length = sysctl_sched_numa_scan_size;
+	length <<= 20;
+
+	down_write(&mm->mmap_sem);
+	vma = find_vma(mm, offset);
+	if (!vma) {
+		ACCESS_ONCE(mm->numa_scan_seq)++;
+		offset = 0;
+		vma = mm->mmap;
+	}
+	for (; vma && length > 0; vma = vma->vm_next) {
+		if (!vma_migratable(vma))
+			continue;
+
+		offset = max(offset, vma->vm_start);
+		end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
+		length -= end - offset;
+
+		change_prot_none(vma, offset, end);
+
+		offset = end;
 	}
+	mm->numa_scan_offset = offset;
+	up_write(&mm->mmap_sem);
 }
 
 /*
Index: tip/kernel/sysctl.c
===================================================================
--- tip.orig/kernel/sysctl.c
+++ tip/kernel/sysctl.c
@@ -367,6 +367,13 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
+		.procname	= "sched_numa_scan_size_mb",
+		.data		= &sysctl_sched_numa_scan_size,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "sched_numa_settle_count",
 		.data		= &sysctl_sched_numa_settle_count,
 		.maxlen		= sizeof(unsigned int),



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 29/31] sched, numa, mm: Add NUMA_MIGRATION feature flag
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (27 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 28/31] sched, numa, mm: Implement constant, per task Working Set Sampling (WSS) rate Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-10-25 12:16 ` [PATCH 30/31] sched, numa, mm: Implement slow start for working set sampling Peter Zijlstra
                   ` (4 subsequent siblings)
  33 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Ingo Molnar, Peter Zijlstra

[-- Attachment #1: 0029-sched-numa-mm-Add-NUMA_MIGRATION-feature-flag.patch --]
[-- Type: text/plain, Size: 1765 bytes --]

From: Ingo Molnar <mingo@kernel.org>

After this patch, doing:

   # echo NO_NUMA_MIGRATION > /sys/kernel/debug/sched_features

Will turn off the NUMA placement logic/policy - but keeps the
working set sampling faults in place.

This allows the debugging of the WSS facility, by using it
but keeping vanilla, non-NUMA CPU and memory placement
policies.

Default enabled. Generates on extra code on !CONFIG_SCHED_DEBUG.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
---
 kernel/sched/core.c     |    3 +++
 kernel/sched/features.h |    3 +++
 2 files changed, 6 insertions(+)

Index: tip/kernel/sched/core.c
===================================================================
--- tip.orig/kernel/sched/core.c
+++ tip/kernel/sched/core.c
@@ -6002,6 +6002,9 @@ void sched_setnode(struct task_struct *p
 	int on_rq, running;
 	struct rq *rq;
 
+	if (!sched_feat(NUMA_MIGRATION))
+		return;
+
 	rq = task_rq_lock(p, &flags);
 	on_rq = p->on_rq;
 	running = task_current(rq, p);
Index: tip/kernel/sched/features.h
===================================================================
--- tip.orig/kernel/sched/features.h
+++ tip/kernel/sched/features.h
@@ -63,7 +63,10 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 
 #ifdef CONFIG_SCHED_NUMA
+/* Do the working set probing faults: */
 SCHED_FEAT(NUMA,           true)
+/* Do actual migration/placement based on the working set information: */
+SCHED_FEAT(NUMA_MIGRATION, true)
 SCHED_FEAT(NUMA_HOT,       true)
 SCHED_FEAT(NUMA_TTWU_BIAS, false)
 SCHED_FEAT(NUMA_TTWU_TO,   false)



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 30/31] sched, numa, mm: Implement slow start for working set sampling
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (28 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 29/31] sched, numa, mm: Add NUMA_MIGRATION feature flag Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-11-01 15:52   ` Mel Gorman
  2012-10-25 12:16 ` [PATCH 31/31] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page() Peter Zijlstra
                   ` (3 subsequent siblings)
  33 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: 0030-sched-numa-mm-Implement-slow-start-for-working-set-s.patch --]
[-- Type: text/plain, Size: 5295 bytes --]

Add a 1 second delay before starting to scan the working set of
a task and starting to balance it amongst nodes.

[ note that before the constant per task WSS sampling rate patch
  the initial scan would happen much later still, in effect that
  patch caused this regression. ]

The theory is that short-run tasks benefit very little from NUMA
placement: they come and go, and they better stick to the node
they were started on. As tasks mature and rebalance to other CPUs
and nodes, so does their NUMA placement have to change and so
does it start to matter more and more.

In practice this change fixes an observable kbuild regression:

   # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]

   !NUMA:
   45.291088843 seconds time elapsed                                          ( +-  0.40% )
   45.154231752 seconds time elapsed                                          ( +-  0.36% )

   +NUMA, no slow start:
   46.172308123 seconds time elapsed                                          ( +-  0.30% )
   46.343168745 seconds time elapsed                                          ( +-  0.25% )

   +NUMA, 1 sec slow start:
   45.224189155 seconds time elapsed                                          ( +-  0.25% )
   45.160866532 seconds time elapsed                                          ( +-  0.17% )

and it also fixes an observable perf bench (hackbench) regression:

   # perf stat --null --repeat 10 perf bench sched messaging

   -NUMA:

   -NUMA:                  0.246225691 seconds time elapsed                   ( +-  1.31% )
   +NUMA no slow start:    0.252620063 seconds time elapsed                   ( +-  1.13% )

   +NUMA 1sec delay:       0.248076230 seconds time elapsed                   ( +-  1.35% )

The implementation is simple and straightforward, most of the patch
deals with adding the /proc/sys/kernel/sched_numa_scan_delay_ms tunable
knob.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ Wrote the changelog, ran measurements, tuned the default. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |    1 +
 kernel/sched/core.c   |    2 +-
 kernel/sched/fair.c   |   11 +++++++----
 kernel/sysctl.c       |    7 +++++++
 4 files changed, 16 insertions(+), 5 deletions(-)

Index: tip/include/linux/sched.h
===================================================================
--- tip.orig/include/linux/sched.h
+++ tip/include/linux/sched.h
@@ -2020,6 +2020,7 @@ enum sched_tunable_scaling {
 };
 extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
+extern unsigned int sysctl_sched_numa_scan_delay;
 extern unsigned int sysctl_sched_numa_scan_period_min;
 extern unsigned int sysctl_sched_numa_scan_period_max;
 extern unsigned int sysctl_sched_numa_scan_size;
Index: tip/kernel/sched/core.c
===================================================================
--- tip.orig/kernel/sched/core.c
+++ tip/kernel/sched/core.c
@@ -1545,7 +1545,7 @@ static void __sched_fork(struct task_str
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_faults = NULL;
-	p->numa_scan_period = sysctl_sched_numa_scan_period_min;
+	p->numa_scan_period = sysctl_sched_numa_scan_delay;
 	p->numa_work.next = &p->numa_work;
 #endif /* CONFIG_SCHED_NUMA */
 }
Index: tip/kernel/sched/fair.c
===================================================================
--- tip.orig/kernel/sched/fair.c
+++ tip/kernel/sched/fair.c
@@ -827,11 +827,12 @@ static void account_numa_dequeue(struct
 }
 
 /*
- * numa task sample period in ms: 5s
+ * Scan @scan_size MB every @scan_period after an initial @scan_delay.
  */
-unsigned int sysctl_sched_numa_scan_period_min = 100;
-unsigned int sysctl_sched_numa_scan_period_max = 100*16;
-unsigned int sysctl_sched_numa_scan_size = 256;   /* MB */
+unsigned int sysctl_sched_numa_scan_delay = 1000;	/* ms */
+unsigned int sysctl_sched_numa_scan_period_min = 100;	/* ms */
+unsigned int sysctl_sched_numa_scan_period_max = 100*16;/* ms */
+unsigned int sysctl_sched_numa_scan_size = 256;		/* MB */
 
 /*
  * Wait for the 2-sample stuff to settle before migrating again
@@ -985,6 +986,8 @@ void task_tick_numa(struct rq *rq, struc
 	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
 
 	if (now - curr->node_stamp > period) {
+		if (!curr->node_stamp)
+			curr->numa_scan_period = sysctl_sched_numa_scan_period_min;
 		curr->node_stamp = now;
 
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
Index: tip/kernel/sysctl.c
===================================================================
--- tip.orig/kernel/sysctl.c
+++ tip/kernel/sysctl.c
@@ -353,6 +353,13 @@ static struct ctl_table kern_table[] = {
 #endif /* CONFIG_SMP */
 #ifdef CONFIG_SCHED_NUMA
 	{
+		.procname	= "sched_numa_scan_delay_ms",
+		.data		= &sysctl_sched_numa_scan_delay,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "sched_numa_scan_period_min_ms",
 		.data		= &sysctl_sched_numa_scan_period_min,
 		.maxlen		= sizeof(unsigned int),



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 31/31] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (29 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 30/31] sched, numa, mm: Implement slow start for working set sampling Peter Zijlstra
@ 2012-10-25 12:16 ` Peter Zijlstra
  2012-10-26  9:07 ` [PATCH 00/31] numa/core patches Zhouping Liu
                   ` (2 subsequent siblings)
  33 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:16 UTC (permalink / raw)
  To: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton
  Cc: linux-kernel, linux-mm, Ingo Molnar

[-- Attachment #1: 0031-sched-numa-mm-Add-memcg-support.patch --]
[-- Type: text/plain, Size: 1294 bytes --]

From: Johannes Weiner <hannes@cmpxchg.org>

[ Turned email suggestions into patch plus fixes. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/huge_memory.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Index: tip/mm/huge_memory.c
===================================================================
--- tip.orig/mm/huge_memory.c
+++ tip/mm/huge_memory.c
@@ -742,6 +742,7 @@ void do_huge_pmd_numa_page(struct mm_str
 			   unsigned int flags, pmd_t entry)
 {
 	unsigned long haddr = address & HPAGE_PMD_MASK;
+	struct mem_cgroup *memcg = NULL;
 	struct page *new_page = NULL;
 	struct page *page = NULL;
 	int node, lru;
@@ -800,6 +801,8 @@ migrate:
 	if (!new_page)
 		goto alloc_fail;
 
+	mem_cgroup_prepare_migration(page, new_page, &memcg);
+
 	lru = PageLRU(page);
 
 	if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
@@ -852,14 +855,19 @@ migrate:
 		put_page(page);		/* drop the LRU isolation reference */
 
 	unlock_page(new_page);
+
+	mem_cgroup_end_migration(memcg, page, new_page, true);
+
 	unlock_page(page);
 	put_page(page);			/* Drop the local reference */
 
 	return;
 
 alloc_fail:
-	if (new_page)
+	if (new_page) {
+		mem_cgroup_end_migration(memcg, page, new_page, false);
 		put_page(new_page);
+	}
 
 	unlock_page(page);
 



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 04/31] x86/mm: Introduce pte_accessible()
  2012-10-25 12:16 ` [PATCH 04/31] x86/mm: Introduce pte_accessible() Peter Zijlstra
@ 2012-10-25 20:10   ` Linus Torvalds
  2012-10-26  6:24     ` [PATCH 04/31, v2] " Ingo Molnar
  2012-11-01 10:42   ` [PATCH 04/31] " Mel Gorman
  1 sibling, 1 reply; 135+ messages in thread
From: Linus Torvalds @ 2012-10-25 20:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Andrew Morton, linux-kernel, linux-mm,
	Ingo Molnar

NAK NAK NAK.

On Thu, Oct 25, 2012 at 5:16 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> +#define __HAVE_ARCH_PTE_ACCESSIBLE
> +static inline int pte_accessible(pte_t a)

Stop doing this f*cking crazy ad-hoc "I have some other name
available" #defines.

Use the same name, for chissake! Don't make up new random names.

Just do

   #define pte_accessible pte_accessible

and then you can use

   #ifndef pte_accessible

to define the generic thing. Instead of having this INSANE "two
different names for the same f*cking thing" crap.

Stop it. Really.

Also, this:

> +#ifndef __HAVE_ARCH_PTE_ACCESSIBLE
> +#define pte_accessible(pte)            pte_present(pte)
> +#endif

looks unsafe and like a really bad idea.

You should probably do

  #ifndef pte_accessible
    #define pte_accessible(pte) ((void)(pte),1)
  #endif

because you have no idea if other architectures do

 (a) the same trick as x86 does for PROT_NONE (I can already tell you
from a quick grep that ia64, m32r, m68k and sh do it)
 (b) might not perhaps be caching non-present pte's anyway

So NAK on this whole patch. It's bad. It's ugly, it's wrong, and it's
actively buggy.

                Linus

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags()
  2012-10-25 12:16 ` [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags() Peter Zijlstra
@ 2012-10-25 20:17   ` Linus Torvalds
  2012-10-26  2:30     ` Rik van Riel
  0 siblings, 1 reply; 135+ messages in thread
From: Linus Torvalds @ 2012-10-25 20:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Andrew Morton, linux-kernel, linux-mm,
	Ingo Molnar

On Thu, Oct 25, 2012 at 5:16 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> From: Rik van Riel <riel@redhat.com>
>
> @@ -306,11 +306,26 @@ int ptep_set_access_flags(struct vm_area
>                           pte_t entry, int dirty)
>  {
>         int changed = !pte_same(*ptep, entry);
> +       /*
> +        * If the page used to be inaccessible (_PAGE_PROTNONE), or
> +        * this call upgrades the access permissions on the same page,
> +        * it is safe to skip the remote TLB flush.
> +        */
> +       bool flush_remote = false;
> +       if (!pte_accessible(*ptep))
> +               flush_remote = false;
> +       else if (pte_pfn(*ptep) != pte_pfn(entry) ||
> +                       (pte_write(*ptep) && !pte_write(entry)) ||
> +                       (pte_exec(*ptep) && !pte_exec(entry)))
> +               flush_remote = true;
>
>         if (changed && dirty) {

Did anybody ever actually look at this sh*t-for-brains patch?

Yeah, I'm grumpy. But I'm wasting time looking at patches that have
new code in them that is stupid and retarded.

This is the VM, guys, we don't add stupid and retarded code.

LOOK at the code, for chrissake. Just look at it. And if you don't see
why the above is stupid and retarded, you damn well shouldn't be
touching VM code.

              Linus

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 26/31] sched, numa, mm: Add fault driven placement and migration policy
  2012-10-25 12:16 ` [PATCH 26/31] sched, numa, mm: Add fault driven placement and migration policy Peter Zijlstra
@ 2012-10-25 20:53   ` Linus Torvalds
  2012-10-26  7:15     ` Ingo Molnar
  2012-10-30 19:23   ` Rik van Riel
  2012-11-01 15:40   ` Mel Gorman
  2 siblings, 1 reply; 135+ messages in thread
From: Linus Torvalds @ 2012-10-25 20:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Andrew Morton, linux-kernel, linux-mm,
	Ingo Molnar

On Thu, Oct 25, 2012 at 5:16 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> +       /*
> +        * Using runtime rather than walltime has the dual advantage that
> +        * we (mostly) drive the selection from busy threads and that the
> +        * task needs to have done some actual work before we bother with
> +        * NUMA placement.
> +        */

That explanation makes sense..

> +       now = curr->se.sum_exec_runtime;
> +       period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
> +
> +       if (now - curr->node_stamp > period) {
> +               curr->node_stamp = now;
> +
> +               if (!time_before(jiffies, curr->mm->numa_next_scan)) {

.. but then the whole "numa_next_scan" thing ends up being about
real-time anyway?

So 'numa_scan_period' in in CPU time (msec, converted to nsec at
runtime rather than when setting it), but 'numa_next_scan' is in
wallclock time (jiffies)?

But *both* of them are based on the same 'numa_scan_period' thing that
the user sets in ms.

So numa_scan_period is interpreted as both wallclock *and* as runtime?

Maybe this works, but it doesn't really make much sense. And what is
the impact of this on machines that run lots of loads with delays
(whether due to IO or timers)?

                     Linus

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 10/31] mm/mpol: Remove NUMA_INTERLEAVE_HIT
  2012-10-25 12:16 ` [PATCH 10/31] mm/mpol: Remove NUMA_INTERLEAVE_HIT Peter Zijlstra
@ 2012-10-25 20:58   ` Andi Kleen
  2012-10-26  7:59     ` Ingo Molnar
  0 siblings, 1 reply; 135+ messages in thread
From: Andi Kleen @ 2012-10-25 20:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, Lee Schermerhorn, Ingo Molnar

Peter Zijlstra <a.p.zijlstra@chello.nl> writes:

> Since the NUMA_INTERLEAVE_HIT statistic is useless on its own; it wants
> to be compared to either a total of interleave allocations or to a miss
> count, remove it.

NACK, as already posted several times.

This breaks the numactl test suite, which is the only way currently to
test interleaving.

Please don't ignore review feedback.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags()
  2012-10-25 20:17   ` Linus Torvalds
@ 2012-10-26  2:30     ` Rik van Riel
  2012-10-26  2:56       ` Linus Torvalds
  0 siblings, 1 reply; 135+ messages in thread
From: Rik van Riel @ 2012-10-26  2:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Andrew Morton, linux-kernel, linux-mm,
	Ingo Molnar

On 10/25/2012 04:17 PM, Linus Torvalds wrote:
> On Thu, Oct 25, 2012 at 5:16 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> From: Rik van Riel <riel@redhat.com>
>>
>> @@ -306,11 +306,26 @@ int ptep_set_access_flags(struct vm_area
>>                            pte_t entry, int dirty)
>>   {
>>          int changed = !pte_same(*ptep, entry);
>> +       /*
>> +        * If the page used to be inaccessible (_PAGE_PROTNONE), or
>> +        * this call upgrades the access permissions on the same page,
>> +        * it is safe to skip the remote TLB flush.
>> +        */
>> +       bool flush_remote = false;
>> +       if (!pte_accessible(*ptep))
>> +               flush_remote = false;
>> +       else if (pte_pfn(*ptep) != pte_pfn(entry) ||
>> +                       (pte_write(*ptep) && !pte_write(entry)) ||
>> +                       (pte_exec(*ptep) && !pte_exec(entry)))
>> +               flush_remote = true;
>>
>>          if (changed && dirty) {
>
> Did anybody ever actually look at this sh*t-for-brains patch?
>
> Yeah, I'm grumpy. But I'm wasting time looking at patches that have
> new code in them that is stupid and retarded.
>
> This is the VM, guys, we don't add stupid and retarded code.
>
> LOOK at the code, for chrissake. Just look at it. And if you don't see
> why the above is stupid and retarded, you damn well shouldn't be
> touching VM code.

I agree it is pretty ugly.  However, the above patch
did get rid of a gigantic performance regression with
Peter's code.

Doing unnecessary remote TLB flushes was costing about
90% performance with specjbb on a 4 node system.

However, if we can guarantee that ptep_set_access_flags
is only ever called for pte permission _upgrades_, we
can simply get rid of the remote TLB flush on x86, and
skip the paranoia tests we are doing above.

Do we have that kind of guarantee?


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags()
  2012-10-26  2:30     ` Rik van Riel
@ 2012-10-26  2:56       ` Linus Torvalds
  2012-10-26  3:57         ` Rik van Riel
  0 siblings, 1 reply; 135+ messages in thread
From: Linus Torvalds @ 2012-10-26  2:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Andrew Morton, linux-kernel, linux-mm,
	Ingo Molnar

On Thu, Oct 25, 2012 at 7:30 PM, Rik van Riel <riel@redhat.com> wrote:
>>
>> LOOK at the code, for chrissake. Just look at it. And if you don't see
>> why the above is stupid and retarded, you damn well shouldn't be
>> touching VM code.
>
> I agree it is pretty ugly.  However, the above patch
> did get rid of a gigantic performance regression with
> Peter's code.

Rik, *LOOK* at the code like I asked you to, instead of making excuses for it.

I'm not necessarily arguing with what the code tries to do. I'm
arguing with the fact that the code is pure and utter *garbage*.

It has two major (and I mean *MAJOR*) problems, both of which
individually should make you ashamed for ever posting that piece of
shit:

The obvious-without-even-understanding-semantics problem:

 - it's humongously stupidly written. It calculates that
'flush_remote' flag WHETHER IT GETS USED OR NOT.

   Christ. I can kind of expect stuff like that in driver code etc,
but in VM routines?

   Yes, the compiler may be smart enough to actually fix up the
idiocy. That doesn't make it less stupid.

The more-subtle-but-fundamental-problem:

 - regardless of how stupidly written it is on a very superficial
level, it's even more stupid in a much more fundamental way.

   That whole routine is explicitly written to be opportunistic. It is
*documented* to only set the access flags, so comparing anything else
is stupid, wouldn't you say?

Documented where? It's actually explicitly documented in the
pgtable-generic.c file which has the generic implementation of that
thing. But it's implicitly documented both in the name of the function
(do take another look) *and* in the actual implementation of the
function.

Look at the code: it doesn't even always update the page tables AT ALL
(and no, the return value does *not* reflect whether it updated it or
not!)

Also, notice how we update the pte entry with a simple

    *ptep = entry;

statement, not with the usual expensive page table updates? The only
thing that makes this safe is that we *only* do it with the exact same
page frame number (anything else would be disastrously buggy on 32-bit
PAE, for example). And we only ever do it with the dirty bit always
set, because otherwise we might be silently dropping a concurrent
hardware update of the dirty bit of the previous pte value on another
CPU.

The latter requirement is why the x86 code does

    if (changed && dirty) {

while the generic code checks just "If (changed)" (and then uses the
much more expensive set_pte_at() that has the proper dirty-bit
guarantees, and generates atomic accesses, not to mention various
virtualization crap).

In other words, everything that was added by that patch is PURE AND
UTTER SHIT. And THAT is what I'm objecting to.

Guess what? If you want to optimize the function to not do remote TLB
flushes, then just do that! None of the garbage. Just change the

    flush_tlb_page(vma, address);

line to

    __flush_tlb_one(address);

and it should damn well work. Because everything I see about
"flush_remote" looks just wrong, wrong, wrong.

And if there really is some reason for that whole flush_remote
braindamage, then we have much bigger problems, namely the fact that
we've broken the documented semantics of that function, and we're
doing various other things that are completely and utterly invalid
unless the above semantics hold.

So that patch should be burned, and possibly used as an example of
horribly crappy code for later generations. At no point should it be
applied.

                 Linus

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags()
  2012-10-26  2:56       ` Linus Torvalds
@ 2012-10-26  3:57         ` Rik van Riel
  2012-10-26  4:23           ` Linus Torvalds
  0 siblings, 1 reply; 135+ messages in thread
From: Rik van Riel @ 2012-10-26  3:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Andrew Morton, linux-kernel, linux-mm,
	Ingo Molnar

On 10/25/2012 10:56 PM, Linus Torvalds wrote:

> Guess what? If you want to optimize the function to not do remote TLB
> flushes, then just do that! None of the garbage. Just change the
>
>      flush_tlb_page(vma, address);
>
> line to
>
>      __flush_tlb_one(address);

That may not even be needed.  Apparently Intel chips
automatically flush an entry from the TLB when it
causes a page fault.  I assume AMD chips do the same,
because flush_tlb_fix_spurious_fault evaluates to
nothing on x86.

> and it should damn well work. Because everything I see about
> "flush_remote" looks just wrong, wrong, wrong.

Are there architectures where we do need to flush
remote TLBs on upgrading the permissions on a PTE?

Because that is what the implementation in
pgtable-generic.c seems to be doing as well...

> And if there really is some reason for that whole flush_remote
> braindamage, then we have much bigger problems, namely the fact that
> we've broken the documented semantics of that function, and we're
> doing various other things that are completely and utterly invalid
> unless the above semantics hold.

Want to just remove the TLB flush entirely and see
if anything breaks in 3.8-rc1?

 From reading the code again, it looks like things
should indeed work ok.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags()
  2012-10-26  3:57         ` Rik van Riel
@ 2012-10-26  4:23           ` Linus Torvalds
  2012-10-26  6:42             ` Ingo Molnar
  2012-10-26 12:34             ` Michel Lespinasse
  0 siblings, 2 replies; 135+ messages in thread
From: Linus Torvalds @ 2012-10-26  4:23 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Andrew Morton, linux-kernel, linux-mm,
	Ingo Molnar

On Thu, Oct 25, 2012 at 8:57 PM, Rik van Riel <riel@redhat.com> wrote:
>
> That may not even be needed.  Apparently Intel chips
> automatically flush an entry from the TLB when it
> causes a page fault.  I assume AMD chips do the same,
> because flush_tlb_fix_spurious_fault evaluates to
> nothing on x86.

Yes. It's not architected as far as I know, though. But I agree, it's
possible - even likely - we could avoid TLB flushing entirely on x86.

If you want to try it, I would seriously suggest you do it as a
separate commit though, just in case.

> Are there architectures where we do need to flush
> remote TLBs on upgrading the permissions on a PTE?

I *suspect* that whole TLB flush just magically became an SMP one
without anybody ever really thinking about it.

So it's quite possible we could do this to the pgtable-generic.c code
too. However, we don't actually have any generic way to do a local
single-address flush (the __flush_tlb_one() thing is
architecture-specific, although it exists on a few architectures).
We'd need to add a local_flush_tlb_page(vma, address) function.

Alternatively, we could decide to use the "tlb_fix_spurious_fault()"
thing in there. Possibly just do it unconditionally in the caller - or
even just specify that the fault handler has to do it. And stop
returning a value at all from ptep_set_access_flags() (I *think*
that's the only thing the return value gets used for - flushing the
TLB on the local cpu for the cpu's that want it).

> Want to just remove the TLB flush entirely and see
> if anything breaks in 3.8-rc1?
>
> From reading the code again, it looks like things
> should indeed work ok.

I would be open to it, but just in case it causes bisectable problems
I'd really want to see it in two patches ("make it always do the local
flush" followed by "remove even the local flush"), and then it would
pinpoint any need.

              Linus

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 04/31, v2] x86/mm: Introduce pte_accessible()
  2012-10-25 20:10   ` Linus Torvalds
@ 2012-10-26  6:24     ` Ingo Molnar
  0 siblings, 0 replies; 135+ messages in thread
From: Ingo Molnar @ 2012-10-26  6:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Rik van Riel, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Andrew Morton, linux-kernel,
	linux-mm


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> NAK NAK NAK.
> 
> On Thu, Oct 25, 2012 at 5:16 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > +#define __HAVE_ARCH_PTE_ACCESSIBLE
> > +static inline int pte_accessible(pte_t a)
> 
> Stop doing this f*cking crazy ad-hoc "I have some other name
> available" #defines.
> 
> Use the same name, for chissake! Don't make up new random names.
> 
> Just do
> 
>    #define pte_accessible pte_accessible
> 
> and then you can use
> 
>    #ifndef pte_accessible
> 
> to define the generic thing. Instead of having this INSANE "two
> different names for the same f*cking thing" crap.

Yeah...

> Stop it. Really.
> 
> Also, this:
> 
> > +#ifndef __HAVE_ARCH_PTE_ACCESSIBLE
> > +#define pte_accessible(pte)            pte_present(pte)
> > +#endif
> 
> looks unsafe and like a really bad idea.
> 
> You should probably do
> 
>   #ifndef pte_accessible
>     #define pte_accessible(pte) ((void)(pte),1)
>   #endif
> 
> because you have no idea if other architectures do
> 
>  (a) the same trick as x86 does for PROT_NONE (I can already tell you
> from a quick grep that ia64, m32r, m68k and sh do it)
>  (b) might not perhaps be caching non-present pte's anyway

Indeed that's much safer and each arch can opt-in consciously 
instead of us offering a potentially unsafe optimization.

> So NAK on this whole patch. It's bad. It's ugly, it's wrong, 
> and it's actively buggy.

I have fixed it as per the updated patch below. Only very 
lightly tested.

Thanks,

	Ingo

----------------------------->
Subject: x86/mm: Introduce pte_accessible()
From: Rik van Riel <riel@redhat.com>
Date: Tue, 9 Oct 2012 15:31:12 +0200

We need pte_present to return true for _PAGE_PROTNONE pages, to indicate that
the pte is associated with a page.

However, for TLB flushing purposes, we would like to know whether the pte
points to an actually accessible page.  This allows us to skip remote TLB
flushes for pages that are not actually accessible.

Fill in this method for x86 and provide a safe (but slower) method
on other architectures.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Fixed-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-66p11te4uj23gevgh4j987ip@git.kernel.org
[ Added Linus's review fixes. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable.h |    6 ++++++
 include/asm-generic/pgtable.h  |    4 ++++
 2 files changed, 10 insertions(+)

Index: tip/arch/x86/include/asm/pgtable.h
===================================================================
--- tip.orig/arch/x86/include/asm/pgtable.h
+++ tip/arch/x86/include/asm/pgtable.h
@@ -408,6 +408,12 @@ static inline int pte_present(pte_t a)
 	return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
 }
 
+#define pte_accessible pte_accessible
+static inline int pte_accessible(pte_t a)
+{
+	return pte_flags(a) & _PAGE_PRESENT;
+}
+
 static inline int pte_hidden(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_HIDDEN;
Index: tip/include/asm-generic/pgtable.h
===================================================================
--- tip.orig/include/asm-generic/pgtable.h
+++ tip/include/asm-generic/pgtable.h
@@ -219,6 +219,10 @@ static inline int pmd_same(pmd_t pmd_a,
 #define move_pte(pte, prot, old_addr, new_addr)	(pte)
 #endif
 
+#ifndef pte_accessible
+# define pte_accessible(pte)		((void)(pte),1)
+#endif
+
 #ifndef flush_tlb_fix_spurious_fault
 #define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address)
 #endif

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags()
  2012-10-26  4:23           ` Linus Torvalds
@ 2012-10-26  6:42             ` Ingo Molnar
  2012-10-26 12:34             ` Michel Lespinasse
  1 sibling, 0 replies; 135+ messages in thread
From: Ingo Molnar @ 2012-10-26  6:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Peter Zijlstra, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Andrew Morton, linux-kernel,
	linux-mm


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Oct 25, 2012 at 8:57 PM, Rik van Riel <riel@redhat.com> wrote:
> >
> > That may not even be needed.  Apparently Intel chips 
> > automatically flush an entry from the TLB when it causes a 
> > page fault.  I assume AMD chips do the same, because 
> > flush_tlb_fix_spurious_fault evaluates to nothing on x86.
> 
> Yes. It's not architected as far as I know, though. But I 
> agree, it's possible - even likely - we could avoid TLB 
> flushing entirely on x86.
> 
> If you want to try it, I would seriously suggest you do it as 
> a separate commit though, just in case.

Ok, will do it like that. INVLPG overhead is small effect, 
nevertheless it's worth trying.

What *has* shown up in my profiles though, and which drove some 
of these changes is that for heavily threaded VM-intense 
workloads such as a single SPECjbb JVM instance running on all 
CPUs and all nodes, TLB flushes with any sort of serialization 
aspect are absolutely deadly.

So just to be *able* to verify the performance benefit and 
impact of some of the later NUMA-directed changes, we had to 
eliminate a number of scalability bottlenecks and put these 
optimization patches in front of the main changes.

That is why you have to go 20+ patches into the queue to see the 
real point :-/

> > Are there architectures where we do need to flush remote 
> > TLBs on upgrading the permissions on a PTE?
> 
> I *suspect* that whole TLB flush just magically became an SMP 
> one without anybody ever really thinking about it.

Yeah, and I think part of the problem is that it's also a not 
particularly straightforward to analyze performance bottleneck: 
SMP TLB flushing does not show up as visible high overhead in 
profiles mainly, it mostly shows up as extra idle time.

If the nature of the workload is that it has extra available 
paralellism that can fill in the idle time, it will mask much of 
the effect and there's only a slight shift in the profile.

It needs a borderline loaded system and sleep profiling to 
pinpoint these sources of overhead.

[...]
> > From reading the code again, it looks like things should 
> > indeed work ok.
> 
> I would be open to it, but just in case it causes bisectable 
> problems I'd really want to see it in two patches ("make it 
> always do the local flush" followed by "remove even the local 
> flush"), and then it would pinpoint any need.

Yeah, 100% agreed.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 26/31] sched, numa, mm: Add fault driven placement and migration policy
  2012-10-25 20:53   ` Linus Torvalds
@ 2012-10-26  7:15     ` Ingo Molnar
  2012-10-26 13:50       ` Ingo Molnar
  0 siblings, 1 reply; 135+ messages in thread
From: Ingo Molnar @ 2012-10-26  7:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Rik van Riel, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Andrew Morton, linux-kernel,
	linux-mm


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Oct 25, 2012 at 5:16 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > +       /*
> > +        * Using runtime rather than walltime has the dual advantage that
> > +        * we (mostly) drive the selection from busy threads and that the
> > +        * task needs to have done some actual work before we bother with
> > +        * NUMA placement.
> > +        */
> 
> That explanation makes sense..
> 
> > +       now = curr->se.sum_exec_runtime;
> > +       period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
> > +
> > +       if (now - curr->node_stamp > period) {
> > +               curr->node_stamp = now;
> > +
> > +               if (!time_before(jiffies, curr->mm->numa_next_scan)) {
> 
> .. but then the whole "numa_next_scan" thing ends up being 
> about real-time anyway?
>
> So 'numa_scan_period' in in CPU time (msec, converted to nsec 
> at runtime rather than when setting it), but 'numa_next_scan' 
> is in wallclock time (jiffies)?
> 
> But *both* of them are based on the same 'numa_scan_period' 
> thing that the user sets in ms.
> 
> So numa_scan_period is interpreted as both wallclock *and* as 
> runtime?
> 
> Maybe this works, but it doesn't really make much sense.

So, the relationship between wall clock time and execution 
runtime is that on the limit they run at the same speed: when 
there's a single task running. In any other case execution 
runtime can only run slower than wall time.

So the bit you found weird:

> > +               if (!time_before(jiffies, curr->mm->numa_next_scan)) {

together with the task_numa_work() frequency limit:

        /*
         * Enforce maximal scan/migration frequency..
         */
        migrate = mm->numa_next_scan;
        if (time_before(now, migrate))
                return;

        next_scan = now + 2*msecs_to_jiffies(sysctl_sched_numa_scan_period_min);
        if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
                return;

puts an upper limit on the per mm scanning frequency.

This filters us from over-sampling if there are many threads: if 
all threads happen to come in at the same time we don't create a 
spike in overhead.

We also avoid multiple threads scanning at once in parallel to 
each other. Faults are nicely parallel, especially with all the 
preparatory patches in place, so the distributed nature of the 
faults itself is not a problem.

So we have to conflicting goals here: on one hand we have a 
quality of sampling goal which asks for per task runtime 
proportional scanning on all threads, but we also have a 
performance goal and don't actually want all threads running at 
the same time. This frequency limit avoids the over-sampling 
scenario while still fulfilling the per task sampling property, 
statistically on average.

If you agree that we should do it like that and if the 
implementation is correct and optimal, I will put a better 
explanation into the code.

[
  task_numa_work() performance side note:

  We are also *very* close to be able to use down_read() instead
  of down_write() in the sampling-unmap code in 
  task_numa_work(), as it should be safe in theory to call 
  change_protection(PROT_NONE) in parallel - but there's one 
  regression that disagrees with this theory so we use 
  down_write() at the moment.

  Maybe you could help us there: can you see a reason why the
  change_prot_none()->change_protection() call in
  task_numa_work() can not occur in parallel to a page fault in
  another thread on another CPU? It should be safe - yet if we 
  change it I can see occasional corruption of user-space state: 
  segfaults and register corruption.
]

> [...] And what is the impact of this on machines that run lots 
> of loads with delays (whether due to IO or timers)?

I've done sysbench OLTP measurements which showed no apparent 
regressions:

 #
 # Comparing { res-schednuma-NO_NUMA.txt } to { res-schednuma-+NUMA.txt }:
 #
 #  threads     improvement %       SysBench OLTP transactions/second
 #-------------------------------------------------------------------
         2:            2.11 %              #    2160.20  vs.  2205.80
         4:           -5.52 %              #    4202.04  vs.  3969.97
         8:            0.01 %              #    6894.45  vs.  6895.45
        16:           -0.31 %              #   11840.77  vs. 11804.30
        24:           -0.56 %              #   15053.98  vs. 14969.14
        30:            0.56 %              #   17043.23  vs. 17138.21
        32:           -1.08 %              #   17797.04  vs. 17604.67
        34:            1.04 %              #   18158.10  vs. 18347.22
        36:           -0.16 %              #   18125.42  vs. 18096.68
        40:            0.45 %              #   18218.73  vs. 18300.59
        48:           -0.39 %              #   18266.91  vs. 18195.26
        56:           -0.11 %              #   18285.56  vs. 18265.74
        64:            0.23 %              #   18304.74  vs. 18347.51
        96:            0.18 %              #   18268.44  vs. 18302.04
       128:            0.22 %              #   18058.92  vs. 18099.34
       256:            1.63 %              #   17068.55  vs. 17347.14
       512:            6.86 %              #   13452.18  vs. 14375.08

No regression is the best we can hope for I think, given that 
OLTP typically has huge global caches and global serialization, 
so any NUMA conscious will at most be a nuisance.

We've also done kbuild measurements - which too is a pretty 
sleepy workload that is too fast for any migration techniques to 
help.

But even sysbench isn't doing very long delays, so I will do 
more IO delay targeted measurements.

So I've been actively looking for and checking the worst-case 
loads for this feature. The feature obviously helps long-run, 
CPU-intense workloads, but those aren't the challenging ones 
really IMO: I spent 70% of the time analyzing workloads that are 
not expected to be friends with this feature.

We are also keeping CONFIG_SCHED_NUMA off by default for good 
measure.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 10/31] mm/mpol: Remove NUMA_INTERLEAVE_HIT
  2012-10-25 20:58   ` Andi Kleen
@ 2012-10-26  7:59     ` Ingo Molnar
  0 siblings, 0 replies; 135+ messages in thread
From: Ingo Molnar @ 2012-10-26  7:59 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Rik van Riel, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-mm, Lee Schermerhorn


* Andi Kleen <andi@firstfloor.org> wrote:

> Peter Zijlstra <a.p.zijlstra@chello.nl> writes:
> 
> > Since the NUMA_INTERLEAVE_HIT statistic is useless on its 
> > own; it wants to be compared to either a total of interleave 
> > allocations or to a miss count, remove it.
> >
> > Fixing it would be possible, but since we've gone years 
> > without these statistics I figure we can continue that way.
> >
> > Also NUMA_HIT fully includes NUMA_INTERLEAVE_HIT so users 
> > might switch to using that.
> >
> > This cleans up some of the weird MPOL_INTERLEAVE allocation 
> > exceptions.
> 
> NACK, as already posted several times.
> 
> This breaks the numactl test suite, which is the only way 
> currently to test interleaving.

This patch is not essential to the NUMA series so I've zapped it 
from the patch queue and fixed up the roll-on effects.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (30 preceding siblings ...)
  2012-10-25 12:16 ` [PATCH 31/31] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page() Peter Zijlstra
@ 2012-10-26  9:07 ` Zhouping Liu
  2012-10-26  9:08   ` Peter Zijlstra
  2012-10-30 12:20 ` Mel Gorman
  2012-11-05 17:11 ` Srikar Dronamraju
  33 siblings, 1 reply; 135+ messages in thread
From: Zhouping Liu @ 2012-10-26  9:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, Ingo Molnar

[-- Attachment #1: Type: text/plain, Size: 5265 bytes --]

On 10/25/2012 08:16 PM, Peter Zijlstra wrote:
> Hi all,
>
> Here's a re-post of the NUMA scheduling and migration improvement
> patches that we are working on. These include techniques from
> AutoNUMA and the sched/numa tree and form a unified basis - it
> has got all the bits that look good and mergeable.
>
> With these patches applied, the mbind system calls expand to
> new modes of lazy-migration binding, and if the
> CONFIG_SCHED_NUMA=y .config option is enabled the scheduler
> will automatically sample the working set of tasks via page
> faults. Based on that information the scheduler then tries
> to balance smartly, put tasks on a home node and migrate CPU
> work and memory on the same node.
>
> They are functional in their current state and have had testing on
> a variety of x86 NUMA hardware.
>
> These patches will continue their life in tip:numa/core and unless
> there are major showstoppers they are intended for the v3.8
> merge window.
>
> We believe that they provide a solid basis for future work.
>
> Please review .. once again and holler if you see anything funny! :-)

Hi,

I tested the patch set, but there's one issues blocked me:

  kernel BUG at mm/memcontrol.c:3263!

--------- snip -----------------
[  179.804754] kernel BUG at mm/memcontrol.c:3263!
[  179.874356] invalid opcode: 0000 [#1] SMP
[  179.939377] Modules linked in: fuse ip6table_filter ip6_tables 
ebtable_nat ebtables bnep bluetooth rfkill iptable_mangle 
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack be2iscsi 
iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi 
ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp 
libiscsi_tcp libiscsi scsi_transport_iscsi vfat fat iTCO_wdt cdc_ether 
coretemp iTCO_vendor_support usbnet mii ioatdma lpc_ich crc32c_intel 
bnx2 shpchp i7core_edac pcspkr tpm_tis tpm i2c_i801 mfd_core tpm_bios 
edac_core dca serio_raw microcode vhost_net tun macvtap macvlan 
kvm_intel kvm uinput mgag200 i2c_algo_bit drm_kms_helper ttm drm 
megaraid_sas i2c_core
[  180.737647] CPU 7
[  180.759586] Pid: 1316, comm: X Not tainted 3.7.0-rc2+ #3 IBM IBM 
System x3400 M3 Server -[7379I08]-/69Y4356
[  180.918591] RIP: 0010:[<ffffffff8118c39a>]  [<ffffffff8118c39a>] 
mem_cgroup_prepare_migration+0xba/0xd0
[  181.047572] RSP: 0000:ffff880179113d38  EFLAGS: 00013202
[  181.127009] RAX: 0040100000084069 RBX: ffffea0005b28000 RCX: 
ffffea00099a805c
[  181.228674] RDX: ffff880179113d90 RSI: ffffea00099a8000 RDI: 
ffffea0005b28000
[  181.331080] RBP: ffff880179113d58 R08: 0000000000280000 R09: 
ffff88027fffff80
[  181.433163] R10: 00000000000000d4 R11: 00000037e9f7bd90 R12: 
ffff880179113d90
[  181.533866] R13: 00007fc5ffa00000 R14: ffff880178001fe8 R15: 
000000016ca001e0
[  181.635264] FS:  00007fc600ddb940(0000) GS:ffff88027fc60000(0000) 
knlGS:0000000000000000
[  181.753726] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  181.842013] CR2: 00007fc5ffa00000 CR3: 00000001779d2000 CR4: 
00000000000007e0
[  181.945346] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[  182.049416] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[  182.153796] Process X (pid: 1316, threadinfo ffff880179112000, task 
ffff880179364620)
[  182.266464] Stack:
[  182.309943]  ffff880177d2c980 00007fc5ffa00000 ffffea0005b28000 
ffff880177d2c980
[  182.418164]  ffff880179113dc8 ffffffff81183b60 ffff880177d2c9dc 
0000000178001fe0
[  182.526366]  ffff880177856a50 ffffea00099a8000 ffff880177d2cc38 
0000000000000000
[  182.633709] Call Trace:
[  182.681450]  [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
[  182.775090]  [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
[  182.863038]  [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
[  182.950574]  [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
[  183.041512]  [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
[  183.126832]  [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
[  183.211216]  [<ffffffff81632ede>] do_page_fault+0xe/0x10
[  183.293705]  [<ffffffff8162f518>] page_fault+0x28/0x30
[  183.373909] Code: 00 48 8b 78 08 48 8b 57 10 83 e2 01 75 05 f0 83 47 
08 01 f6 43 08 01 74 bb f0 80 08 04 eb b5 f3 90 48 8b 10 80 e2 01 75 f6 
eb 94 <0f> 0b 0f 1f 40 00 e8 9c b4 49 00 66 66 2e 0f 1f 84 00 00 00 00
[  183.651946] RIP  [<ffffffff8118c39a>] 
mem_cgroup_prepare_migration+0xba/0xd0
[  183.760378]  RSP <ffff880179113d38>

===========================================================================

my system has two numa nodes.

There are two methods can reproduce the bug on my machine.
1. start X server:
     # startx
    it's 100% to reproduce it, and which can crash the system.

2. Compiling kernel source using multi-threads:
     # make -j N
    this action can produce such similar above Call Trace, but it 
*didn't* crash the system

The whole dmesg log and config file are attached.

Also I have tested the mainline kernel un-patched sched/numa patch set,
there's no such issues.

please let me know if you need more info.

Thanks,
Zhouping

>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


[-- Attachment #2: config_sched_numa --]
[-- Type: text/plain, Size: 114674 bytes --]

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 3.7.0-rc2 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_ARCH_HAS_CPU_AUTOPROBE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11"
CONFIG_ARCH_CPU_PROBE_RELEASE=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_HAVE_IRQ_WORK=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_FHANDLE=y
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y
# CONFIG_AUDIT_LOGINUID_IMMUTABLE is not set
CONFIG_HAVE_GENERIC_HARDIRQS=y

#
# IRQ subsystem
#
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
# CONFIG_TICK_CPU_ACCOUNTING is not set
CONFIG_IRQ_TIME_ACCOUNTING=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_PREEMPT_RCU is not set
# CONFIG_RCU_USER_QS is not set
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_RCU_FAST_NO_HZ is not set
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=18
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_SCHED_NUMA=y
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_RESOURCE_COUNTERS=y
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
# CONFIG_MEMCG_SWAP_ENABLED is not set
CONFIG_MEMCG_KMEM=y
# CONFIG_CGROUP_HUGETLB is not set
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
# CONFIG_CHECKPOINT_RESTORE is not set
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y
CONFIG_SCHED_AUTOGROUP=y
CONFIG_MM_OWNER=y
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
# CONFIG_EXPERT is not set
CONFIG_HAVE_UID16=y
CONFIG_UID16=y
# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
# CONFIG_EMBEDDED is not set
CONFIG_HAVE_PERF_EVENTS=y

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PCI_QUIRKS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_COMPAT_BRK is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
CONFIG_OPROFILE=m
CONFIG_OPROFILE_EVENT_MULTIPLEX=y
CONFIG_HAVE_OPROFILE=y
CONFIG_OPROFILE_NMI_TIMER=y
CONFIG_KPROBES=y
CONFIG_JUMP_LABEL=y
CONFIG_OPTPROBES=y
CONFIG_UPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_KRETPROBES=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_ATTRS=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HAVE_ALIGNED_STRUCT_PAGE=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_GENERIC_KERNEL_THREAD=y
CONFIG_GENERIC_KERNEL_EXECVE=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP_FILTER=y
CONFIG_HAVE_RCU_USER_QS=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_MODULES_USE_ELF_RELA=y

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_FORCE_LOAD is not set
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_MODULE_SIG=y
# CONFIG_MODULE_SIG_FORCE is not set
# CONFIG_MODULE_SIG_SHA1 is not set
# CONFIG_MODULE_SIG_SHA224 is not set
CONFIG_MODULE_SIG_SHA256=y
# CONFIG_MODULE_SIG_SHA384 is not set
# CONFIG_MODULE_SIG_SHA512 is not set
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_BSG=y
CONFIG_BLK_DEV_BSGLIB=y
CONFIG_BLK_DEV_INTEGRITY=y
CONFIG_BLK_DEV_THROTTLING=y

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
# CONFIG_ATARI_PARTITION is not set
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
# CONFIG_LDM_PARTITION is not set
CONFIG_SGI_PARTITION=y
# CONFIG_ULTRIX_PARTITION is not set
CONFIG_SUN_PARTITION=y
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_CFQ_GROUP_IOSCHED=y
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_PADATA=y
CONFIG_ASN1=y
CONFIG_INLINE_SPIN_UNLOCK_IRQ=y
CONFIG_INLINE_READ_UNLOCK=y
CONFIG_INLINE_READ_UNLOCK_IRQ=y
CONFIG_INLINE_WRITE_UNLOCK=y
CONFIG_INLINE_WRITE_UNLOCK_IRQ=y
CONFIG_MUTEX_SPIN_ON_OWNER=y
CONFIG_FREEZER=y

#
# Processor type and features
#
CONFIG_ZONE_DMA=y
CONFIG_SMP=y
CONFIG_X86_X2APIC=y
CONFIG_X86_MPPARSE=y
CONFIG_X86_EXTENDED_PLATFORM=y
# CONFIG_X86_NUMACHIP is not set
# CONFIG_X86_VSMP is not set
# CONFIG_X86_UV is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_PARAVIRT_GUEST=y
CONFIG_PARAVIRT_TIME_ACCOUNTING=y
CONFIG_XEN=y
CONFIG_XEN_DOM0=y
CONFIG_XEN_PRIVILEGED_GUEST=y
CONFIG_XEN_PVHVM=y
CONFIG_XEN_MAX_DOMAIN_MEMORY=500
CONFIG_XEN_SAVE_RESTORE=y
CONFIG_XEN_DEBUG_FS=y
CONFIG_KVM_GUEST=y
CONFIG_PARAVIRT=y
# CONFIG_PARAVIRT_SPINLOCKS is not set
CONFIG_PARAVIRT_CLOCK=y
# CONFIG_PARAVIRT_DEBUG is not set
CONFIG_NO_BOOTMEM=y
# CONFIG_MEMTEST is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_XADD=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
# CONFIG_CALGARY_IOMMU is not set
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS=128
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
# CONFIG_X86_MCE_INJECT is not set
CONFIG_X86_THERMAL_VECTOR=y
CONFIG_I8K=m
CONFIG_MICROCODE=m
CONFIG_MICROCODE_INTEL=y
CONFIG_MICROCODE_AMD=y
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_DIRECT_GBPAGES=y
CONFIG_NUMA=y
CONFIG_AMD_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NODES_SPAN_OTHER_NODES=y
# CONFIG_NUMA_EMU is not set
CONFIG_NODES_SHIFT=9
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_HAVE_MEMBLOCK=y
CONFIG_HAVE_MEMBLOCK_NODE_MAP=y
CONFIG_ARCH_DISCARD_MEMBLOCK=y
CONFIG_MEMORY_ISOLATION=y
# CONFIG_MEMORY_HOTPLUG is not set
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MMU_NOTIFIER=y
CONFIG_KSM=y
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
CONFIG_MEMORY_FAILURE=y
CONFIG_HWPOISON_INJECT=m
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_CLEANCACHE=y
CONFIG_FRONTSWAP=y
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
# CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK is not set
CONFIG_X86_RESERVE_LOW=64
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=1
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_ARCH_RANDOM=y
CONFIG_X86_SMAP=y
CONFIG_EFI=y
CONFIG_EFI_STUB=y
CONFIG_SECCOMP=y
CONFIG_CC_STACKPROTECTOR=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
CONFIG_KEXEC_JUMP=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_ALIGN=0x1000000
CONFIG_HOTPLUG_CPU=y
# CONFIG_COMPAT_VDSO is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y

#
# Power management and ACPI options
#
CONFIG_ARCH_HIBERNATION_HEADER=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
CONFIG_HIBERNATE_CALLBACKS=y
CONFIG_HIBERNATION=y
CONFIG_PM_STD_PARTITION=""
CONFIG_PM_SLEEP=y
CONFIG_PM_SLEEP_SMP=y
# CONFIG_PM_AUTOSLEEP is not set
# CONFIG_PM_WAKELOCKS is not set
CONFIG_PM_RUNTIME=y
CONFIG_PM=y
CONFIG_PM_DEBUG=y
CONFIG_PM_ADVANCED_DEBUG=y
# CONFIG_PM_TEST_SUSPEND is not set
CONFIG_PM_SLEEP_DEBUG=y
CONFIG_PM_TRACE=y
CONFIG_PM_TRACE_RTC=y
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_PROCFS=y
# CONFIG_ACPI_PROCFS_POWER is not set
CONFIG_ACPI_EC_DEBUGFS=m
# CONFIG_ACPI_PROC_EVENT is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_VIDEO=m
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_IPMI=m
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_PROCESSOR_AGGREGATOR=m
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_NUMA=y
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_PCI_SLOT=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
CONFIG_ACPI_SBS=m
CONFIG_ACPI_HED=y
CONFIG_ACPI_CUSTOM_METHOD=m
# CONFIG_ACPI_BGRT is not set
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_GHES=y
CONFIG_ACPI_APEI_PCIEAER=y
CONFIG_ACPI_APEI_MEMORY_FAILURE=y
# CONFIG_ACPI_APEI_EINJ is not set
# CONFIG_ACPI_APEI_ERST_DEBUG is not set
CONFIG_SFI=y

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
CONFIG_CPU_FREQ_STAT=m
CONFIG_CPU_FREQ_STAT_DETAILS=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y

#
# x86 CPU frequency scaling drivers
#
CONFIG_X86_PCC_CPUFREQ=y
CONFIG_X86_ACPI_CPUFREQ=y
CONFIG_X86_ACPI_CPUFREQ_CPB=y
CONFIG_X86_POWERNOW_K8=y
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
CONFIG_X86_P4_CLOCKMOD=y

#
# shared options
#
CONFIG_X86_SPEEDSTEP_LIB=y
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set
CONFIG_INTEL_IDLE=y

#
# Memory power savings
#
CONFIG_I7300_IDLE_IOAT_CHANNEL=y
CONFIG_I7300_IDLE=m

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_XEN=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI_PCIE=y
CONFIG_PCIEAER=y
CONFIG_PCIE_ECRC=y
CONFIG_PCIEAER_INJECT=m
CONFIG_PCIEASPM=y
# CONFIG_PCIEASPM_DEBUG is not set
CONFIG_PCIEASPM_DEFAULT=y
# CONFIG_PCIEASPM_POWERSAVE is not set
# CONFIG_PCIEASPM_PERFORMANCE is not set
CONFIG_PCIE_PME=y
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_REALLOC_ENABLE_AUTO is not set
CONFIG_PCI_STUB=y
CONFIG_XEN_PCIDEV_FRONTEND=m
CONFIG_HT_IRQ=y
CONFIG_PCI_ATS=y
CONFIG_PCI_IOV=y
CONFIG_PCI_PRI=y
CONFIG_PCI_PASID=y
CONFIG_PCI_IOAPIC=y
CONFIG_PCI_LABEL=y
CONFIG_ISA_DMA_API=y
CONFIG_AMD_NB=y
CONFIG_PCCARD=y
CONFIG_PCMCIA=y
CONFIG_PCMCIA_LOAD_CIS=y
CONFIG_CARDBUS=y

#
# PC-card bridges
#
CONFIG_YENTA=m
CONFIG_YENTA_O2=y
CONFIG_YENTA_RICOH=y
CONFIG_YENTA_TI=y
CONFIG_YENTA_ENE_TUNE=y
CONFIG_YENTA_TOSHIBA=y
CONFIG_PD6729=m
CONFIG_I82092=m
CONFIG_PCCARD_NONSTATIC=y
CONFIG_HOTPLUG_PCI=y
CONFIG_HOTPLUG_PCI_ACPI=y
CONFIG_HOTPLUG_PCI_ACPI_IBM=m
# CONFIG_HOTPLUG_PCI_CPCI is not set
CONFIG_HOTPLUG_PCI_SHPC=m
# CONFIG_RAPIDIO is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE=y
CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
# CONFIG_HAVE_AOUT is not set
CONFIG_BINFMT_MISC=m
CONFIG_COREDUMP=y
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_AOUT is not set
# CONFIG_X86_X32 is not set
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_KEYS_COMPAT=y
CONFIG_HAVE_TEXT_POKE_SMP=y
CONFIG_X86_DEV_DMA_OPS=y
CONFIG_NET=y
CONFIG_COMPAT_NETLINK_MESSAGES=y

#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_DIAG is not set
CONFIG_UNIX=y
CONFIG_UNIX_DIAG=m
CONFIG_XFRM=y
CONFIG_XFRM_ALGO=y
CONFIG_XFRM_USER=y
CONFIG_XFRM_SUB_POLICY=y
CONFIG_XFRM_MIGRATE=y
CONFIG_XFRM_STATISTICS=y
CONFIG_XFRM_IPCOMP=m
CONFIG_NET_KEY=m
CONFIG_NET_KEY_MIGRATE=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_IP_FIB_TRIE_STATS=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
CONFIG_IP_ROUTE_CLASSID=y
# CONFIG_IP_PNP is not set
CONFIG_NET_IPIP=m
CONFIG_NET_IPGRE_DEMUX=m
CONFIG_NET_IPGRE=m
CONFIG_NET_IPGRE_BROADCAST=y
CONFIG_IP_MROUTE=y
CONFIG_IP_MROUTE_MULTIPLE_TABLES=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
CONFIG_ARPD=y
CONFIG_SYN_COOKIES=y
CONFIG_NET_IPVTI=m
CONFIG_INET_AH=m
CONFIG_INET_ESP=m
CONFIG_INET_IPCOMP=m
CONFIG_INET_XFRM_TUNNEL=m
CONFIG_INET_TUNNEL=m
CONFIG_INET_XFRM_MODE_TRANSPORT=m
CONFIG_INET_XFRM_MODE_TUNNEL=m
CONFIG_INET_XFRM_MODE_BEET=m
CONFIG_INET_LRO=y
CONFIG_INET_DIAG=m
CONFIG_INET_TCP_DIAG=m
CONFIG_INET_UDP_DIAG=m
CONFIG_TCP_CONG_ADVANCED=y
CONFIG_TCP_CONG_BIC=m
CONFIG_TCP_CONG_CUBIC=y
CONFIG_TCP_CONG_WESTWOOD=m
CONFIG_TCP_CONG_HTCP=m
CONFIG_TCP_CONG_HSTCP=m
CONFIG_TCP_CONG_HYBLA=m
CONFIG_TCP_CONG_VEGAS=m
CONFIG_TCP_CONG_SCALABLE=m
CONFIG_TCP_CONG_LP=m
CONFIG_TCP_CONG_VENO=m
CONFIG_TCP_CONG_YEAH=m
CONFIG_TCP_CONG_ILLINOIS=m
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_TCP_MD5SIG=y
CONFIG_IPV6=y
CONFIG_IPV6_PRIVACY=y
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
CONFIG_IPV6_OPTIMISTIC_DAD=y
CONFIG_INET6_AH=m
CONFIG_INET6_ESP=m
CONFIG_INET6_IPCOMP=m
CONFIG_IPV6_MIP6=y
CONFIG_INET6_XFRM_TUNNEL=m
CONFIG_INET6_TUNNEL=m
CONFIG_INET6_XFRM_MODE_TRANSPORT=m
CONFIG_INET6_XFRM_MODE_TUNNEL=m
CONFIG_INET6_XFRM_MODE_BEET=m
CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION=m
CONFIG_IPV6_SIT=m
CONFIG_IPV6_SIT_6RD=y
CONFIG_IPV6_NDISC_NODETYPE=y
CONFIG_IPV6_TUNNEL=m
# CONFIG_IPV6_GRE is not set
CONFIG_IPV6_MULTIPLE_TABLES=y
CONFIG_IPV6_SUBTREES=y
CONFIG_IPV6_MROUTE=y
CONFIG_IPV6_MROUTE_MULTIPLE_TABLES=y
CONFIG_IPV6_PIMSM_V2=y
CONFIG_NETLABEL=y
CONFIG_NETWORK_SECMARK=y
CONFIG_NETWORK_PHY_TIMESTAMPING=y
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
CONFIG_NETFILTER_ADVANCED=y
CONFIG_BRIDGE_NETFILTER=y

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_NETLINK=m
CONFIG_NETFILTER_NETLINK_ACCT=m
CONFIG_NETFILTER_NETLINK_QUEUE=m
CONFIG_NETFILTER_NETLINK_LOG=m
CONFIG_NF_CONNTRACK=m
CONFIG_NF_CONNTRACK_MARK=y
CONFIG_NF_CONNTRACK_SECMARK=y
CONFIG_NF_CONNTRACK_ZONES=y
CONFIG_NF_CONNTRACK_PROCFS=y
CONFIG_NF_CONNTRACK_EVENTS=y
# CONFIG_NF_CONNTRACK_TIMEOUT is not set
CONFIG_NF_CONNTRACK_TIMESTAMP=y
CONFIG_NF_CT_PROTO_DCCP=m
CONFIG_NF_CT_PROTO_GRE=m
CONFIG_NF_CT_PROTO_SCTP=m
CONFIG_NF_CT_PROTO_UDPLITE=m
CONFIG_NF_CONNTRACK_AMANDA=m
CONFIG_NF_CONNTRACK_FTP=m
CONFIG_NF_CONNTRACK_H323=m
CONFIG_NF_CONNTRACK_IRC=m
CONFIG_NF_CONNTRACK_BROADCAST=m
CONFIG_NF_CONNTRACK_NETBIOS_NS=m
CONFIG_NF_CONNTRACK_SNMP=m
CONFIG_NF_CONNTRACK_PPTP=m
CONFIG_NF_CONNTRACK_SANE=m
CONFIG_NF_CONNTRACK_SIP=m
CONFIG_NF_CONNTRACK_TFTP=m
CONFIG_NF_CT_NETLINK=m
# CONFIG_NF_CT_NETLINK_TIMEOUT is not set
CONFIG_NF_CT_NETLINK_HELPER=m
CONFIG_NETFILTER_NETLINK_QUEUE_CT=y
CONFIG_NETFILTER_TPROXY=m
CONFIG_NETFILTER_XTABLES=y

#
# Xtables combined modules
#
CONFIG_NETFILTER_XT_MARK=m
CONFIG_NETFILTER_XT_CONNMARK=m
CONFIG_NETFILTER_XT_SET=m

#
# Xtables targets
#
CONFIG_NETFILTER_XT_TARGET_AUDIT=m
CONFIG_NETFILTER_XT_TARGET_CHECKSUM=m
CONFIG_NETFILTER_XT_TARGET_CLASSIFY=m
CONFIG_NETFILTER_XT_TARGET_CONNMARK=m
CONFIG_NETFILTER_XT_TARGET_CONNSECMARK=m
CONFIG_NETFILTER_XT_TARGET_CT=m
CONFIG_NETFILTER_XT_TARGET_DSCP=m
CONFIG_NETFILTER_XT_TARGET_HL=m
CONFIG_NETFILTER_XT_TARGET_HMARK=m
CONFIG_NETFILTER_XT_TARGET_IDLETIMER=m
CONFIG_NETFILTER_XT_TARGET_LED=m
CONFIG_NETFILTER_XT_TARGET_LOG=m
CONFIG_NETFILTER_XT_TARGET_MARK=m
CONFIG_NETFILTER_XT_TARGET_NFLOG=m
CONFIG_NETFILTER_XT_TARGET_NFQUEUE=m
CONFIG_NETFILTER_XT_TARGET_RATEEST=m
CONFIG_NETFILTER_XT_TARGET_TEE=m
CONFIG_NETFILTER_XT_TARGET_TPROXY=m
CONFIG_NETFILTER_XT_TARGET_TRACE=m
CONFIG_NETFILTER_XT_TARGET_SECMARK=m
CONFIG_NETFILTER_XT_TARGET_TCPMSS=m
CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP=m

#
# Xtables matches
#
CONFIG_NETFILTER_XT_MATCH_ADDRTYPE=m
CONFIG_NETFILTER_XT_MATCH_CLUSTER=m
CONFIG_NETFILTER_XT_MATCH_COMMENT=m
CONFIG_NETFILTER_XT_MATCH_CONNBYTES=m
CONFIG_NETFILTER_XT_MATCH_CONNLIMIT=m
CONFIG_NETFILTER_XT_MATCH_CONNMARK=m
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=m
CONFIG_NETFILTER_XT_MATCH_CPU=m
CONFIG_NETFILTER_XT_MATCH_DCCP=m
CONFIG_NETFILTER_XT_MATCH_DEVGROUP=m
CONFIG_NETFILTER_XT_MATCH_DSCP=m
CONFIG_NETFILTER_XT_MATCH_ECN=m
CONFIG_NETFILTER_XT_MATCH_ESP=m
CONFIG_NETFILTER_XT_MATCH_HASHLIMIT=m
CONFIG_NETFILTER_XT_MATCH_HELPER=m
CONFIG_NETFILTER_XT_MATCH_HL=m
CONFIG_NETFILTER_XT_MATCH_IPRANGE=m
CONFIG_NETFILTER_XT_MATCH_IPVS=m
CONFIG_NETFILTER_XT_MATCH_LENGTH=m
CONFIG_NETFILTER_XT_MATCH_LIMIT=m
CONFIG_NETFILTER_XT_MATCH_MAC=m
CONFIG_NETFILTER_XT_MATCH_MARK=m
CONFIG_NETFILTER_XT_MATCH_MULTIPORT=m
CONFIG_NETFILTER_XT_MATCH_NFACCT=m
CONFIG_NETFILTER_XT_MATCH_OSF=m
CONFIG_NETFILTER_XT_MATCH_OWNER=m
CONFIG_NETFILTER_XT_MATCH_POLICY=m
CONFIG_NETFILTER_XT_MATCH_PHYSDEV=m
CONFIG_NETFILTER_XT_MATCH_PKTTYPE=m
CONFIG_NETFILTER_XT_MATCH_QUOTA=m
CONFIG_NETFILTER_XT_MATCH_RATEEST=m
CONFIG_NETFILTER_XT_MATCH_REALM=m
CONFIG_NETFILTER_XT_MATCH_RECENT=m
CONFIG_NETFILTER_XT_MATCH_SCTP=m
CONFIG_NETFILTER_XT_MATCH_SOCKET=m
CONFIG_NETFILTER_XT_MATCH_STATE=m
CONFIG_NETFILTER_XT_MATCH_STATISTIC=m
CONFIG_NETFILTER_XT_MATCH_STRING=m
CONFIG_NETFILTER_XT_MATCH_TCPMSS=m
CONFIG_NETFILTER_XT_MATCH_TIME=m
CONFIG_NETFILTER_XT_MATCH_U32=m
CONFIG_IP_SET=m
CONFIG_IP_SET_MAX=256
CONFIG_IP_SET_BITMAP_IP=m
CONFIG_IP_SET_BITMAP_IPMAC=m
CONFIG_IP_SET_BITMAP_PORT=m
CONFIG_IP_SET_HASH_IP=m
CONFIG_IP_SET_HASH_IPPORT=m
CONFIG_IP_SET_HASH_IPPORTIP=m
CONFIG_IP_SET_HASH_IPPORTNET=m
CONFIG_IP_SET_HASH_NET=m
CONFIG_IP_SET_HASH_NETPORT=m
CONFIG_IP_SET_HASH_NETIFACE=m
CONFIG_IP_SET_LIST_SET=m
CONFIG_IP_VS=m
CONFIG_IP_VS_IPV6=y
# CONFIG_IP_VS_DEBUG is not set
CONFIG_IP_VS_TAB_BITS=12

#
# IPVS transport protocol load balancing support
#
CONFIG_IP_VS_PROTO_TCP=y
CONFIG_IP_VS_PROTO_UDP=y
CONFIG_IP_VS_PROTO_AH_ESP=y
CONFIG_IP_VS_PROTO_ESP=y
CONFIG_IP_VS_PROTO_AH=y
CONFIG_IP_VS_PROTO_SCTP=y

#
# IPVS scheduler
#
CONFIG_IP_VS_RR=m
CONFIG_IP_VS_WRR=m
CONFIG_IP_VS_LC=m
CONFIG_IP_VS_WLC=m
CONFIG_IP_VS_LBLC=m
CONFIG_IP_VS_LBLCR=m
CONFIG_IP_VS_DH=m
CONFIG_IP_VS_SH=m
CONFIG_IP_VS_SED=m
CONFIG_IP_VS_NQ=m

#
# IPVS SH scheduler
#
CONFIG_IP_VS_SH_TAB_BITS=8

#
# IPVS application helper
#
CONFIG_IP_VS_NFCT=y
CONFIG_IP_VS_PE_SIP=m

#
# IP: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV4=m
CONFIG_NF_CONNTRACK_IPV4=m
# CONFIG_NF_CONNTRACK_PROC_COMPAT is not set
CONFIG_IP_NF_QUEUE=m
CONFIG_IP_NF_IPTABLES=y
CONFIG_IP_NF_MATCH_AH=m
CONFIG_IP_NF_MATCH_ECN=m
CONFIG_IP_NF_MATCH_RPFILTER=m
CONFIG_IP_NF_MATCH_TTL=m
CONFIG_IP_NF_FILTER=y
CONFIG_IP_NF_TARGET_REJECT=y
CONFIG_IP_NF_TARGET_ULOG=m
# CONFIG_NF_NAT_IPV4 is not set
CONFIG_IP_NF_MANGLE=m
CONFIG_IP_NF_TARGET_CLUSTERIP=m
CONFIG_IP_NF_TARGET_ECN=m
CONFIG_IP_NF_TARGET_TTL=m
CONFIG_IP_NF_RAW=m
CONFIG_IP_NF_SECURITY=m
CONFIG_IP_NF_ARPTABLES=m
CONFIG_IP_NF_ARPFILTER=m
CONFIG_IP_NF_ARP_MANGLE=m

#
# IPv6: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV6=m
CONFIG_NF_CONNTRACK_IPV6=m
CONFIG_IP6_NF_IPTABLES=m
CONFIG_IP6_NF_MATCH_AH=m
CONFIG_IP6_NF_MATCH_EUI64=m
CONFIG_IP6_NF_MATCH_FRAG=m
CONFIG_IP6_NF_MATCH_OPTS=m
CONFIG_IP6_NF_MATCH_HL=m
CONFIG_IP6_NF_MATCH_IPV6HEADER=m
CONFIG_IP6_NF_MATCH_MH=m
CONFIG_IP6_NF_MATCH_RPFILTER=m
CONFIG_IP6_NF_MATCH_RT=m
CONFIG_IP6_NF_TARGET_HL=m
CONFIG_IP6_NF_FILTER=m
CONFIG_IP6_NF_TARGET_REJECT=m
CONFIG_IP6_NF_MANGLE=m
CONFIG_IP6_NF_RAW=m
CONFIG_IP6_NF_SECURITY=m
# CONFIG_NF_NAT_IPV6 is not set
CONFIG_BRIDGE_NF_EBTABLES=m
CONFIG_BRIDGE_EBT_BROUTE=m
CONFIG_BRIDGE_EBT_T_FILTER=m
CONFIG_BRIDGE_EBT_T_NAT=m
CONFIG_BRIDGE_EBT_802_3=m
CONFIG_BRIDGE_EBT_AMONG=m
CONFIG_BRIDGE_EBT_ARP=m
CONFIG_BRIDGE_EBT_IP=m
CONFIG_BRIDGE_EBT_IP6=m
CONFIG_BRIDGE_EBT_LIMIT=m
CONFIG_BRIDGE_EBT_MARK=m
CONFIG_BRIDGE_EBT_PKTTYPE=m
CONFIG_BRIDGE_EBT_STP=m
CONFIG_BRIDGE_EBT_VLAN=m
CONFIG_BRIDGE_EBT_ARPREPLY=m
CONFIG_BRIDGE_EBT_DNAT=m
CONFIG_BRIDGE_EBT_MARK_T=m
CONFIG_BRIDGE_EBT_REDIRECT=m
CONFIG_BRIDGE_EBT_SNAT=m
CONFIG_BRIDGE_EBT_LOG=m
CONFIG_BRIDGE_EBT_ULOG=m
CONFIG_BRIDGE_EBT_NFLOG=m
CONFIG_IP_DCCP=m
CONFIG_INET_DCCP_DIAG=m

#
# DCCP CCIDs Configuration (EXPERIMENTAL)
#
# CONFIG_IP_DCCP_CCID2_DEBUG is not set
CONFIG_IP_DCCP_CCID3=y
# CONFIG_IP_DCCP_CCID3_DEBUG is not set
CONFIG_IP_DCCP_TFRC_LIB=y

#
# DCCP Kernel Hacking
#
# CONFIG_IP_DCCP_DEBUG is not set
# CONFIG_NET_DCCPPROBE is not set
CONFIG_IP_SCTP=m
CONFIG_NET_SCTPPROBE=m
# CONFIG_SCTP_DBG_MSG is not set
# CONFIG_SCTP_DBG_OBJCNT is not set
# CONFIG_SCTP_HMAC_NONE is not set
CONFIG_SCTP_HMAC_SHA1=y
# CONFIG_SCTP_HMAC_MD5 is not set
CONFIG_RDS=m
CONFIG_RDS_RDMA=m
CONFIG_RDS_TCP=m
# CONFIG_RDS_DEBUG is not set
# CONFIG_TIPC is not set
CONFIG_ATM=m
CONFIG_ATM_CLIP=m
# CONFIG_ATM_CLIP_NO_ICMP is not set
CONFIG_ATM_LANE=m
# CONFIG_ATM_MPOA is not set
CONFIG_ATM_BR2684=m
# CONFIG_ATM_BR2684_IPFILTER is not set
CONFIG_L2TP=m
CONFIG_L2TP_DEBUGFS=m
CONFIG_L2TP_V3=y
CONFIG_L2TP_IP=m
CONFIG_L2TP_ETH=m
CONFIG_STP=m
CONFIG_GARP=m
CONFIG_BRIDGE=m
CONFIG_BRIDGE_IGMP_SNOOPING=y
CONFIG_NET_DSA=m
CONFIG_NET_DSA_TAG_DSA=y
CONFIG_NET_DSA_TAG_EDSA=y
CONFIG_NET_DSA_TAG_TRAILER=y
CONFIG_VLAN_8021Q=m
CONFIG_VLAN_8021Q_GVRP=y
# CONFIG_DECNET is not set
CONFIG_LLC=m
# CONFIG_LLC2 is not set
CONFIG_IPX=m
# CONFIG_IPX_INTERN is not set
CONFIG_ATALK=m
CONFIG_DEV_APPLETALK=m
CONFIG_IPDDP=m
CONFIG_IPDDP_ENCAP=y
CONFIG_IPDDP_DECAP=y
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
CONFIG_WAN_ROUTER=m
# CONFIG_PHONET is not set
CONFIG_IEEE802154=m
CONFIG_IEEE802154_6LOWPAN=m
CONFIG_MAC802154=m
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
CONFIG_NET_SCH_CBQ=m
CONFIG_NET_SCH_HTB=m
CONFIG_NET_SCH_HFSC=m
CONFIG_NET_SCH_ATM=m
CONFIG_NET_SCH_PRIO=m
CONFIG_NET_SCH_MULTIQ=m
CONFIG_NET_SCH_RED=m
CONFIG_NET_SCH_SFB=m
CONFIG_NET_SCH_SFQ=m
CONFIG_NET_SCH_TEQL=m
CONFIG_NET_SCH_TBF=m
CONFIG_NET_SCH_GRED=m
CONFIG_NET_SCH_DSMARK=m
CONFIG_NET_SCH_NETEM=m
CONFIG_NET_SCH_DRR=m
CONFIG_NET_SCH_MQPRIO=m
CONFIG_NET_SCH_CHOKE=m
CONFIG_NET_SCH_QFQ=m
CONFIG_NET_SCH_CODEL=m
CONFIG_NET_SCH_FQ_CODEL=m
CONFIG_NET_SCH_INGRESS=m
CONFIG_NET_SCH_PLUG=m

#
# Classification
#
CONFIG_NET_CLS=y
CONFIG_NET_CLS_BASIC=m
CONFIG_NET_CLS_TCINDEX=m
CONFIG_NET_CLS_ROUTE4=m
CONFIG_NET_CLS_FW=m
CONFIG_NET_CLS_U32=m
CONFIG_CLS_U32_PERF=y
CONFIG_CLS_U32_MARK=y
CONFIG_NET_CLS_RSVP=m
CONFIG_NET_CLS_RSVP6=m
CONFIG_NET_CLS_FLOW=m
CONFIG_NET_CLS_CGROUP=y
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
CONFIG_NET_EMATCH_CMP=m
CONFIG_NET_EMATCH_NBYTE=m
CONFIG_NET_EMATCH_U32=m
CONFIG_NET_EMATCH_META=m
CONFIG_NET_EMATCH_TEXT=m
CONFIG_NET_EMATCH_IPSET=m
CONFIG_NET_CLS_ACT=y
CONFIG_NET_ACT_POLICE=m
CONFIG_NET_ACT_GACT=m
CONFIG_GACT_PROB=y
CONFIG_NET_ACT_MIRRED=m
CONFIG_NET_ACT_IPT=m
CONFIG_NET_ACT_NAT=m
CONFIG_NET_ACT_PEDIT=m
CONFIG_NET_ACT_SIMP=m
CONFIG_NET_ACT_SKBEDIT=m
CONFIG_NET_ACT_CSUM=m
CONFIG_NET_CLS_IND=y
CONFIG_NET_SCH_FIFO=y
CONFIG_DCB=y
CONFIG_DNS_RESOLVER=m
CONFIG_BATMAN_ADV=m
CONFIG_BATMAN_ADV_BLA=y
# CONFIG_BATMAN_ADV_DEBUG is not set
CONFIG_OPENVSWITCH=m
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_XPS=y
CONFIG_NETPRIO_CGROUP=m
CONFIG_BQL=y
CONFIG_BPF_JIT=y

#
# Network testing
#
CONFIG_NET_PKTGEN=m
# CONFIG_NET_TCPPROBE is not set
CONFIG_NET_DROP_MONITOR=y
CONFIG_HAMRADIO=y

#
# Packet Radio protocols
#
CONFIG_AX25=m
CONFIG_AX25_DAMA_SLAVE=y
CONFIG_NETROM=m
CONFIG_ROSE=m

#
# AX.25 network device drivers
#
CONFIG_MKISS=m
CONFIG_6PACK=m
CONFIG_BPQETHER=m
CONFIG_BAYCOM_SER_FDX=m
CONFIG_BAYCOM_SER_HDX=m
CONFIG_BAYCOM_PAR=m
CONFIG_YAM=m
# CONFIG_CAN is not set
CONFIG_IRDA=m

#
# IrDA protocols
#
CONFIG_IRLAN=m
CONFIG_IRNET=m
CONFIG_IRCOMM=m
# CONFIG_IRDA_ULTRA is not set

#
# IrDA options
#
CONFIG_IRDA_CACHE_LAST_LSAP=y
CONFIG_IRDA_FAST_RR=y
# CONFIG_IRDA_DEBUG is not set

#
# Infrared-port device drivers
#

#
# SIR device drivers
#
CONFIG_IRTTY_SIR=m

#
# Dongle support
#
CONFIG_DONGLE=y
CONFIG_ESI_DONGLE=m
CONFIG_ACTISYS_DONGLE=m
CONFIG_TEKRAM_DONGLE=m
CONFIG_TOIM3232_DONGLE=m
CONFIG_LITELINK_DONGLE=m
CONFIG_MA600_DONGLE=m
CONFIG_GIRBIL_DONGLE=m
CONFIG_MCP2120_DONGLE=m
CONFIG_OLD_BELKIN_DONGLE=m
CONFIG_ACT200L_DONGLE=m
CONFIG_KINGSUN_DONGLE=m
CONFIG_KSDAZZLE_DONGLE=m
CONFIG_KS959_DONGLE=m

#
# FIR device drivers
#
CONFIG_USB_IRDA=m
CONFIG_SIGMATEL_FIR=m
CONFIG_NSC_FIR=m
CONFIG_WINBOND_FIR=m
CONFIG_SMC_IRCC_FIR=m
CONFIG_ALI_FIR=m
CONFIG_VLSI_FIR=m
CONFIG_VIA_FIR=m
CONFIG_MCS_FIR=m
CONFIG_BT=m
CONFIG_BT_RFCOMM=m
CONFIG_BT_RFCOMM_TTY=y
CONFIG_BT_BNEP=m
CONFIG_BT_BNEP_MC_FILTER=y
CONFIG_BT_BNEP_PROTO_FILTER=y
CONFIG_BT_CMTP=m
CONFIG_BT_HIDP=m

#
# Bluetooth device drivers
#
CONFIG_BT_HCIBTUSB=m
CONFIG_BT_HCIBTSDIO=m
CONFIG_BT_HCIUART=m
CONFIG_BT_HCIUART_H4=y
CONFIG_BT_HCIUART_BCSP=y
CONFIG_BT_HCIUART_ATH3K=y
CONFIG_BT_HCIUART_LL=y
CONFIG_BT_HCIUART_3WIRE=y
CONFIG_BT_HCIBCM203X=m
CONFIG_BT_HCIBPA10X=m
CONFIG_BT_HCIBFUSB=m
CONFIG_BT_HCIDTL1=m
CONFIG_BT_HCIBT3C=m
CONFIG_BT_HCIBLUECARD=m
CONFIG_BT_HCIBTUART=m
CONFIG_BT_HCIVHCI=m
CONFIG_BT_MRVL=m
CONFIG_BT_MRVL_SDIO=m
CONFIG_BT_ATH3K=m
# CONFIG_AF_RXRPC is not set
CONFIG_FIB_RULES=y
CONFIG_WIRELESS=y
CONFIG_WIRELESS_EXT=y
CONFIG_WEXT_CORE=y
CONFIG_WEXT_PROC=y
CONFIG_WEXT_SPY=y
CONFIG_WEXT_PRIV=y
CONFIG_CFG80211=m
# CONFIG_NL80211_TESTMODE is not set
# CONFIG_CFG80211_DEVELOPER_WARNINGS is not set
# CONFIG_CFG80211_REG_DEBUG is not set
CONFIG_CFG80211_DEFAULT_PS=y
CONFIG_CFG80211_DEBUGFS=y
# CONFIG_CFG80211_INTERNAL_REGDB is not set
CONFIG_CFG80211_WEXT=y
CONFIG_LIB80211=m
CONFIG_LIB80211_CRYPT_WEP=m
CONFIG_LIB80211_CRYPT_CCMP=m
CONFIG_LIB80211_CRYPT_TKIP=m
# CONFIG_LIB80211_DEBUG is not set
CONFIG_MAC80211=m
CONFIG_MAC80211_HAS_RC=y
CONFIG_MAC80211_RC_MINSTREL=y
CONFIG_MAC80211_RC_MINSTREL_HT=y
CONFIG_MAC80211_RC_DEFAULT_MINSTREL=y
CONFIG_MAC80211_RC_DEFAULT="minstrel_ht"
CONFIG_MAC80211_MESH=y
CONFIG_MAC80211_LEDS=y
CONFIG_MAC80211_DEBUGFS=y
# CONFIG_MAC80211_MESSAGE_TRACING is not set
# CONFIG_MAC80211_DEBUG_MENU is not set
CONFIG_WIMAX=m
CONFIG_WIMAX_DEBUG_LEVEL=8
CONFIG_RFKILL=m
CONFIG_RFKILL_LEDS=y
CONFIG_RFKILL_INPUT=y
CONFIG_NET_9P=m
CONFIG_NET_9P_VIRTIO=m
CONFIG_NET_9P_RDMA=m
# CONFIG_NET_9P_DEBUG is not set
# CONFIG_CAIF is not set
CONFIG_CEPH_LIB=m
# CONFIG_CEPH_LIB_PRETTYDEBUG is not set
# CONFIG_CEPH_LIB_USE_DNS_RESOLVER is not set
CONFIG_NFC=m
CONFIG_NFC_NCI=m
CONFIG_NFC_HCI=m
CONFIG_NFC_SHDLC=y
CONFIG_NFC_LLCP=y

#
# Near Field Communication (NFC) devices
#
CONFIG_PN544_HCI_NFC=m
CONFIG_NFC_PN533=m
CONFIG_HAVE_BPF_JIT=y

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH=""
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
# CONFIG_FIRMWARE_IN_KERNEL is not set
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_DEBUG_DRIVER is not set
CONFIG_DEBUG_DEVRES=y
CONFIG_SYS_HYPERVISOR=y
# CONFIG_GENERIC_CPU_DEVICES is not set
CONFIG_REGMAP=y
CONFIG_REGMAP_I2C=m
CONFIG_DMA_SHARED_BUFFER=y

#
# Bus devices
#
# CONFIG_OMAP_OCP2SCP is not set
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
CONFIG_MTD=m
# CONFIG_MTD_TESTS is not set
# CONFIG_MTD_REDBOOT_PARTS is not set
# CONFIG_MTD_AR7_PARTS is not set

#
# User Modules And Translation Layers
#
# CONFIG_MTD_CHAR is not set
# CONFIG_MTD_BLKDEVS is not set
# CONFIG_MTD_BLOCK is not set
# CONFIG_MTD_BLOCK_RO is not set
# CONFIG_FTL is not set
# CONFIG_NFTL is not set
# CONFIG_INFTL is not set
# CONFIG_RFD_FTL is not set
# CONFIG_SSFDC is not set
# CONFIG_SM_FTL is not set
# CONFIG_MTD_OOPS is not set
# CONFIG_MTD_SWAP is not set

#
# RAM/ROM/Flash chip drivers
#
# CONFIG_MTD_CFI is not set
# CONFIG_MTD_JEDECPROBE is not set
CONFIG_MTD_MAP_BANK_WIDTH_1=y
CONFIG_MTD_MAP_BANK_WIDTH_2=y
CONFIG_MTD_MAP_BANK_WIDTH_4=y
# CONFIG_MTD_MAP_BANK_WIDTH_8 is not set
# CONFIG_MTD_MAP_BANK_WIDTH_16 is not set
# CONFIG_MTD_MAP_BANK_WIDTH_32 is not set
CONFIG_MTD_CFI_I1=y
CONFIG_MTD_CFI_I2=y
# CONFIG_MTD_CFI_I4 is not set
# CONFIG_MTD_CFI_I8 is not set
# CONFIG_MTD_RAM is not set
# CONFIG_MTD_ROM is not set
# CONFIG_MTD_ABSENT is not set

#
# Mapping drivers for chip access
#
# CONFIG_MTD_COMPLEX_MAPPINGS is not set
# CONFIG_MTD_TS5500 is not set
# CONFIG_MTD_INTEL_VR_NOR is not set
# CONFIG_MTD_PLATRAM is not set

#
# Self-contained MTD device drivers
#
# CONFIG_MTD_PMC551 is not set
# CONFIG_MTD_SLRAM is not set
# CONFIG_MTD_PHRAM is not set
# CONFIG_MTD_MTDRAM is not set
# CONFIG_MTD_BLOCK2MTD is not set

#
# Disk-On-Chip Device Drivers
#
# CONFIG_MTD_DOCG3 is not set
# CONFIG_MTD_NAND is not set
# CONFIG_MTD_ONENAND is not set

#
# LPDDR flash memory drivers
#
# CONFIG_MTD_LPDDR is not set
CONFIG_MTD_UBI=m
CONFIG_MTD_UBI_WL_THRESHOLD=4096
CONFIG_MTD_UBI_BEB_LIMIT=20
# CONFIG_MTD_UBI_FASTMAP is not set
# CONFIG_MTD_UBI_GLUEBI is not set
CONFIG_PARPORT=m
CONFIG_PARPORT_PC=m
CONFIG_PARPORT_SERIAL=m
# CONFIG_PARPORT_PC_FIFO is not set
# CONFIG_PARPORT_PC_SUPERIO is not set
CONFIG_PARPORT_PC_PCMCIA=m
# CONFIG_PARPORT_GSC is not set
# CONFIG_PARPORT_AX88796 is not set
CONFIG_PARPORT_1284=y
CONFIG_PARPORT_NOT_PC=y
CONFIG_PNP=y
# CONFIG_PNP_DEBUG_MESSAGES is not set

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
CONFIG_BLK_DEV_FD=m
# CONFIG_PARIDE is not set
CONFIG_BLK_DEV_PCIESSD_MTIP32XX=m
CONFIG_BLK_CPQ_DA=m
CONFIG_BLK_CPQ_CISS_DA=m
CONFIG_CISS_SCSI_TAPE=y
CONFIG_BLK_DEV_DAC960=m
CONFIG_BLK_DEV_UMEM=m
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
CONFIG_BLK_DEV_CRYPTOLOOP=m
CONFIG_BLK_DEV_DRBD=m
# CONFIG_DRBD_FAULT_INJECTION is not set
CONFIG_BLK_DEV_NBD=m
CONFIG_BLK_DEV_NVME=m
CONFIG_BLK_DEV_OSD=m
CONFIG_BLK_DEV_SX8=m
CONFIG_BLK_DEV_RAM=m
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=16384
# CONFIG_BLK_DEV_XIP is not set
CONFIG_CDROM_PKTCDVD=m
CONFIG_CDROM_PKTCDVD_BUFFERS=8
# CONFIG_CDROM_PKTCDVD_WCACHE is not set
CONFIG_ATA_OVER_ETH=m
CONFIG_XEN_BLKDEV_FRONTEND=m
CONFIG_XEN_BLKDEV_BACKEND=m
CONFIG_VIRTIO_BLK=m
# CONFIG_BLK_DEV_HD is not set
CONFIG_BLK_DEV_RBD=m

#
# Misc devices
#
CONFIG_SENSORS_LIS3LV02D=m
# CONFIG_AD525X_DPOT is not set
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_INTEL_MID_PTI is not set
CONFIG_SGI_IOC4=m
CONFIG_TIFM_CORE=m
CONFIG_TIFM_7XX1=m
# CONFIG_ICS932S401 is not set
CONFIG_ENCLOSURE_SERVICES=m
CONFIG_HP_ILO=m
CONFIG_APDS9802ALS=m
CONFIG_ISL29003=m
CONFIG_ISL29020=m
CONFIG_SENSORS_TSL2550=m
# CONFIG_SENSORS_BH1780 is not set
CONFIG_SENSORS_BH1770=m
CONFIG_SENSORS_APDS990X=m
# CONFIG_HMC6352 is not set
# CONFIG_DS1682 is not set
CONFIG_VMWARE_BALLOON=m
# CONFIG_BMP085_I2C is not set
CONFIG_PCH_PHUB=m
# CONFIG_USB_SWITCH_FSA9480 is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
CONFIG_EEPROM_AT24=m
CONFIG_EEPROM_LEGACY=m
CONFIG_EEPROM_MAX6875=m
CONFIG_EEPROM_93CX6=m
CONFIG_CB710_CORE=m
# CONFIG_CB710_DEBUG is not set
CONFIG_CB710_DEBUG_ASSUMPTIONS=y

#
# Texas Instruments shared transport line discipline
#
CONFIG_SENSORS_LIS3_I2C=m

#
# Altera FPGA firmware download module
#
CONFIG_ALTERA_STAPL=m
CONFIG_INTEL_MEI=m
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
CONFIG_RAID_ATTRS=m
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_TGT=m
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
CONFIG_CHR_DEV_ST=m
CONFIG_CHR_DEV_OSST=m
CONFIG_BLK_DEV_SR=y
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=y
CONFIG_CHR_DEV_SCH=m
CONFIG_SCSI_ENCLOSURE=m
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_SCAN_ASYNC=y

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=m
CONFIG_SCSI_FC_ATTRS=m
CONFIG_SCSI_FC_TGT_ATTRS=y
CONFIG_SCSI_ISCSI_ATTRS=m
CONFIG_SCSI_SAS_ATTRS=m
CONFIG_SCSI_SAS_LIBSAS=m
CONFIG_SCSI_SAS_ATA=y
CONFIG_SCSI_SAS_HOST_SMP=y
CONFIG_SCSI_SRP_ATTRS=m
CONFIG_SCSI_SRP_TGT_ATTRS=y
CONFIG_SCSI_LOWLEVEL=y
CONFIG_ISCSI_TCP=m
CONFIG_ISCSI_BOOT_SYSFS=m
CONFIG_SCSI_CXGB3_ISCSI=m
CONFIG_SCSI_CXGB4_ISCSI=m
CONFIG_SCSI_BNX2_ISCSI=m
CONFIG_SCSI_BNX2X_FCOE=m
CONFIG_BE2ISCSI=m
CONFIG_BLK_DEV_3W_XXXX_RAID=m
CONFIG_SCSI_HPSA=m
CONFIG_SCSI_3W_9XXX=m
CONFIG_SCSI_3W_SAS=m
CONFIG_SCSI_ACARD=m
CONFIG_SCSI_AACRAID=m
CONFIG_SCSI_AIC7XXX=m
CONFIG_AIC7XXX_CMDS_PER_DEVICE=4
CONFIG_AIC7XXX_RESET_DELAY_MS=15000
# CONFIG_AIC7XXX_DEBUG_ENABLE is not set
CONFIG_AIC7XXX_DEBUG_MASK=0
# CONFIG_AIC7XXX_REG_PRETTY_PRINT is not set
CONFIG_SCSI_AIC7XXX_OLD=m
CONFIG_SCSI_AIC79XX=m
CONFIG_AIC79XX_CMDS_PER_DEVICE=4
CONFIG_AIC79XX_RESET_DELAY_MS=15000
# CONFIG_AIC79XX_DEBUG_ENABLE is not set
CONFIG_AIC79XX_DEBUG_MASK=0
# CONFIG_AIC79XX_REG_PRETTY_PRINT is not set
CONFIG_SCSI_AIC94XX=m
# CONFIG_AIC94XX_DEBUG is not set
CONFIG_SCSI_MVSAS=m
# CONFIG_SCSI_MVSAS_DEBUG is not set
CONFIG_SCSI_MVSAS_TASKLET=y
CONFIG_SCSI_MVUMI=m
# CONFIG_SCSI_DPT_I2O is not set
CONFIG_SCSI_ADVANSYS=m
CONFIG_SCSI_ARCMSR=m
CONFIG_MEGARAID_NEWGEN=y
CONFIG_MEGARAID_MM=m
CONFIG_MEGARAID_MAILBOX=m
CONFIG_MEGARAID_LEGACY=m
CONFIG_MEGARAID_SAS=m
CONFIG_SCSI_MPT2SAS=m
CONFIG_SCSI_MPT2SAS_MAX_SGE=128
CONFIG_SCSI_MPT2SAS_LOGGING=y
CONFIG_SCSI_UFSHCD=m
CONFIG_SCSI_HPTIOP=m
CONFIG_SCSI_BUSLOGIC=m
CONFIG_VMWARE_PVSCSI=m
CONFIG_HYPERV_STORAGE=m
CONFIG_LIBFC=m
CONFIG_LIBFCOE=m
CONFIG_FCOE=m
CONFIG_FCOE_FNIC=m
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
CONFIG_SCSI_GDTH=m
CONFIG_SCSI_ISCI=m
CONFIG_SCSI_IPS=m
CONFIG_SCSI_INITIO=m
CONFIG_SCSI_INIA100=m
# CONFIG_SCSI_PPA is not set
# CONFIG_SCSI_IMM is not set
CONFIG_SCSI_STEX=m
CONFIG_SCSI_SYM53C8XX_2=m
CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE=1
CONFIG_SCSI_SYM53C8XX_DEFAULT_TAGS=16
CONFIG_SCSI_SYM53C8XX_MAX_TAGS=64
CONFIG_SCSI_SYM53C8XX_MMIO=y
# CONFIG_SCSI_IPR is not set
CONFIG_SCSI_QLOGIC_1280=m
CONFIG_SCSI_QLA_FC=m
CONFIG_TCM_QLA2XXX=m
CONFIG_SCSI_QLA_ISCSI=m
CONFIG_SCSI_LPFC=m
# CONFIG_SCSI_LPFC_DEBUG_FS is not set
CONFIG_SCSI_DC395x=m
CONFIG_SCSI_DC390T=m
CONFIG_SCSI_DEBUG=m
CONFIG_SCSI_PMCRAID=m
CONFIG_SCSI_PM8001=m
CONFIG_SCSI_SRP=m
CONFIG_SCSI_BFA_FC=m
CONFIG_SCSI_VIRTIO=m
# CONFIG_SCSI_LOWLEVEL_PCMCIA is not set
CONFIG_SCSI_DH=y
CONFIG_SCSI_DH_RDAC=m
CONFIG_SCSI_DH_HP_SW=m
CONFIG_SCSI_DH_EMC=m
CONFIG_SCSI_DH_ALUA=m
CONFIG_SCSI_OSD_INITIATOR=m
CONFIG_SCSI_OSD_ULD=m
CONFIG_SCSI_OSD_DPRINT_SENSE=1
# CONFIG_SCSI_OSD_DEBUG is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_VERBOSE_ERROR=y
CONFIG_ATA_ACPI=y
CONFIG_SATA_PMP=y

#
# Controllers with non-SFF native interface
#
CONFIG_SATA_AHCI=y
CONFIG_SATA_AHCI_PLATFORM=m
CONFIG_SATA_INIC162X=m
CONFIG_SATA_ACARD_AHCI=m
CONFIG_SATA_SIL24=m
CONFIG_ATA_SFF=y

#
# SFF controllers with custom DMA interface
#
CONFIG_PDC_ADMA=m
CONFIG_SATA_QSTOR=m
CONFIG_SATA_SX4=m
CONFIG_ATA_BMDMA=y

#
# SATA SFF controllers with BMDMA
#
CONFIG_ATA_PIIX=y
# CONFIG_SATA_HIGHBANK is not set
CONFIG_SATA_MV=m
CONFIG_SATA_NV=m
CONFIG_SATA_PROMISE=m
CONFIG_SATA_SIL=m
CONFIG_SATA_SIS=m
CONFIG_SATA_SVW=m
CONFIG_SATA_ULI=m
CONFIG_SATA_VIA=m
CONFIG_SATA_VITESSE=m

#
# PATA SFF controllers with BMDMA
#
CONFIG_PATA_ALI=m
CONFIG_PATA_AMD=m
CONFIG_PATA_ARASAN_CF=m
CONFIG_PATA_ARTOP=m
CONFIG_PATA_ATIIXP=m
CONFIG_PATA_ATP867X=m
CONFIG_PATA_CMD64X=m
CONFIG_PATA_CS5520=m
CONFIG_PATA_CS5530=m
CONFIG_PATA_CS5536=m
CONFIG_PATA_CYPRESS=m
CONFIG_PATA_EFAR=m
CONFIG_PATA_HPT366=m
CONFIG_PATA_HPT37X=m
CONFIG_PATA_HPT3X2N=m
CONFIG_PATA_HPT3X3=m
# CONFIG_PATA_HPT3X3_DMA is not set
CONFIG_PATA_IT8213=m
CONFIG_PATA_IT821X=m
CONFIG_PATA_JMICRON=m
CONFIG_PATA_MARVELL=m
CONFIG_PATA_NETCELL=m
CONFIG_PATA_NINJA32=m
CONFIG_PATA_NS87415=m
CONFIG_PATA_OLDPIIX=m
CONFIG_PATA_OPTIDMA=m
CONFIG_PATA_PDC2027X=m
CONFIG_PATA_PDC_OLD=m
# CONFIG_PATA_RADISYS is not set
CONFIG_PATA_RDC=m
# CONFIG_PATA_SC1200 is not set
CONFIG_PATA_SCH=m
CONFIG_PATA_SERVERWORKS=m
CONFIG_PATA_SIL680=m
CONFIG_PATA_SIS=m
CONFIG_PATA_TOSHIBA=m
CONFIG_PATA_TRIFLEX=m
CONFIG_PATA_VIA=m
CONFIG_PATA_WINBOND=m

#
# PIO-only SFF controllers
#
CONFIG_PATA_CMD640_PCI=m
CONFIG_PATA_MPIIX=m
CONFIG_PATA_NS87410=m
CONFIG_PATA_OPTI=m
CONFIG_PATA_PCMCIA=m
# CONFIG_PATA_RZ1000 is not set

#
# Generic fallback / legacy drivers
#
CONFIG_PATA_ACPI=m
CONFIG_ATA_GENERIC=m
# CONFIG_PATA_LEGACY is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
CONFIG_MD_LINEAR=m
CONFIG_MD_RAID0=m
CONFIG_MD_RAID1=m
CONFIG_MD_RAID10=m
CONFIG_MD_RAID456=m
# CONFIG_MULTICORE_RAID456 is not set
CONFIG_MD_MULTIPATH=m
CONFIG_MD_FAULTY=m
CONFIG_BLK_DEV_DM=y
CONFIG_DM_DEBUG=y
CONFIG_DM_BUFIO=m
CONFIG_DM_BIO_PRISON=m
CONFIG_DM_PERSISTENT_DATA=m
CONFIG_DM_CRYPT=m
CONFIG_DM_SNAPSHOT=y
CONFIG_DM_THIN_PROVISIONING=m
# CONFIG_DM_DEBUG_BLOCK_STACK_TRACING is not set
CONFIG_DM_MIRROR=y
CONFIG_DM_RAID=m
CONFIG_DM_LOG_USERSPACE=m
CONFIG_DM_ZERO=y
CONFIG_DM_MULTIPATH=m
CONFIG_DM_MULTIPATH_QL=m
CONFIG_DM_MULTIPATH_ST=m
# CONFIG_DM_DELAY is not set
CONFIG_DM_UEVENT=y
CONFIG_DM_FLAKEY=m
CONFIG_DM_VERITY=m
CONFIG_TARGET_CORE=m
CONFIG_TCM_IBLOCK=m
CONFIG_TCM_FILEIO=m
CONFIG_TCM_PSCSI=m
CONFIG_LOOPBACK_TARGET=m
CONFIG_TCM_FC=m
CONFIG_ISCSI_TARGET=m
CONFIG_SBP_TARGET=m
CONFIG_FUSION=y
CONFIG_FUSION_SPI=m
CONFIG_FUSION_FC=m
CONFIG_FUSION_SAS=m
CONFIG_FUSION_MAX_SGE=40
CONFIG_FUSION_CTL=m
CONFIG_FUSION_LAN=m
CONFIG_FUSION_LOGGING=y

#
# IEEE 1394 (FireWire) support
#
CONFIG_FIREWIRE=m
CONFIG_FIREWIRE_OHCI=m
CONFIG_FIREWIRE_SBP2=m
CONFIG_FIREWIRE_NET=m
CONFIG_FIREWIRE_NOSY=m
# CONFIG_I2O is not set
CONFIG_MACINTOSH_DRIVERS=y
CONFIG_MAC_EMUMOUSEBTN=y
CONFIG_NETDEVICES=y
CONFIG_NET_CORE=y
CONFIG_BONDING=m
CONFIG_DUMMY=m
CONFIG_EQUALIZER=m
CONFIG_NET_FC=y
CONFIG_MII=m
CONFIG_IFB=m
CONFIG_NET_TEAM=m
CONFIG_NET_TEAM_MODE_BROADCAST=m
CONFIG_NET_TEAM_MODE_ROUNDROBIN=m
CONFIG_NET_TEAM_MODE_ACTIVEBACKUP=m
CONFIG_NET_TEAM_MODE_LOADBALANCE=m
CONFIG_MACVLAN=m
CONFIG_MACVTAP=m
# CONFIG_VXLAN is not set
CONFIG_NETCONSOLE=m
CONFIG_NETCONSOLE_DYNAMIC=y
CONFIG_NETPOLL=y
CONFIG_NETPOLL_TRAP=y
CONFIG_NET_POLL_CONTROLLER=y
CONFIG_TUN=m
CONFIG_VETH=m
CONFIG_VIRTIO_NET=m
CONFIG_SUNGEM_PHY=m
# CONFIG_ARCNET is not set
CONFIG_ATM_DRIVERS=y
# CONFIG_ATM_DUMMY is not set
CONFIG_ATM_TCP=m
# CONFIG_ATM_LANAI is not set
CONFIG_ATM_ENI=m
# CONFIG_ATM_ENI_DEBUG is not set
# CONFIG_ATM_ENI_TUNE_BURST is not set
CONFIG_ATM_FIRESTREAM=m
# CONFIG_ATM_ZATM is not set
CONFIG_ATM_NICSTAR=m
# CONFIG_ATM_NICSTAR_USE_SUNI is not set
# CONFIG_ATM_NICSTAR_USE_IDT77105 is not set
# CONFIG_ATM_IDT77252 is not set
# CONFIG_ATM_AMBASSADOR is not set
# CONFIG_ATM_HORIZON is not set
# CONFIG_ATM_IA is not set
# CONFIG_ATM_FORE200E is not set
CONFIG_ATM_HE=m
# CONFIG_ATM_HE_USE_SUNI is not set
CONFIG_ATM_SOLOS=m

#
# CAIF transport drivers
#

#
# Distributed Switch Architecture drivers
#
CONFIG_NET_DSA_MV88E6XXX=m
CONFIG_NET_DSA_MV88E6060=m
CONFIG_NET_DSA_MV88E6XXX_NEED_PPU=y
CONFIG_NET_DSA_MV88E6131=m
CONFIG_NET_DSA_MV88E6123_61_65=m
CONFIG_ETHERNET=y
CONFIG_MDIO=m
CONFIG_NET_VENDOR_3COM=y
CONFIG_PCMCIA_3C574=m
CONFIG_PCMCIA_3C589=m
CONFIG_VORTEX=m
CONFIG_TYPHOON=m
CONFIG_NET_VENDOR_ADAPTEC=y
CONFIG_ADAPTEC_STARFIRE=m
CONFIG_NET_VENDOR_ALTEON=y
CONFIG_ACENIC=m
# CONFIG_ACENIC_OMIT_TIGON_I is not set
CONFIG_NET_VENDOR_AMD=y
CONFIG_AMD8111_ETH=m
CONFIG_PCNET32=m
CONFIG_PCMCIA_NMCLAN=m
CONFIG_NET_VENDOR_ATHEROS=y
CONFIG_ATL2=m
CONFIG_ATL1=m
CONFIG_ATL1E=m
CONFIG_ATL1C=m
CONFIG_NET_VENDOR_BROADCOM=y
CONFIG_B44=m
CONFIG_B44_PCI_AUTOSELECT=y
CONFIG_B44_PCICORE_AUTOSELECT=y
CONFIG_B44_PCI=y
CONFIG_BNX2=m
CONFIG_CNIC=m
CONFIG_TIGON3=m
CONFIG_BNX2X=m
CONFIG_NET_VENDOR_BROCADE=y
CONFIG_BNA=m
CONFIG_NET_CALXEDA_XGMAC=m
CONFIG_NET_VENDOR_CHELSIO=y
CONFIG_CHELSIO_T1=m
CONFIG_CHELSIO_T1_1G=y
CONFIG_CHELSIO_T3=m
CONFIG_CHELSIO_T4=m
CONFIG_CHELSIO_T4VF=m
CONFIG_NET_VENDOR_CISCO=y
CONFIG_ENIC=m
CONFIG_DNET=m
CONFIG_NET_VENDOR_DEC=y
CONFIG_NET_TULIP=y
CONFIG_DE2104X=m
CONFIG_DE2104X_DSL=0
CONFIG_TULIP=m
# CONFIG_TULIP_MWI is not set
CONFIG_TULIP_MMIO=y
# CONFIG_TULIP_NAPI is not set
CONFIG_DE4X5=m
CONFIG_WINBOND_840=m
CONFIG_DM9102=m
CONFIG_ULI526X=m
CONFIG_PCMCIA_XIRCOM=m
CONFIG_NET_VENDOR_DLINK=y
CONFIG_DE600=m
CONFIG_DE620=m
CONFIG_DL2K=m
CONFIG_SUNDANCE=m
# CONFIG_SUNDANCE_MMIO is not set
CONFIG_NET_VENDOR_EMULEX=y
CONFIG_BE2NET=m
CONFIG_NET_VENDOR_EXAR=y
CONFIG_S2IO=m
CONFIG_VXGE=m
# CONFIG_VXGE_DEBUG_TRACE_ALL is not set
# CONFIG_NET_VENDOR_FUJITSU is not set
# CONFIG_NET_VENDOR_HP is not set
CONFIG_NET_VENDOR_INTEL=y
CONFIG_E100=m
CONFIG_E1000=m
CONFIG_E1000E=m
CONFIG_IGB=m
CONFIG_IGB_DCA=y
CONFIG_IGB_PTP=y
CONFIG_IGBVF=m
CONFIG_IXGB=m
CONFIG_IXGBE=m
CONFIG_IXGBE_HWMON=y
CONFIG_IXGBE_DCA=y
CONFIG_IXGBE_DCB=y
CONFIG_IXGBE_PTP=y
CONFIG_IXGBEVF=m
# CONFIG_NET_VENDOR_I825XX is not set
CONFIG_IP1000=m
CONFIG_JME=m
CONFIG_NET_VENDOR_MARVELL=y
CONFIG_SKGE=m
# CONFIG_SKGE_DEBUG is not set
CONFIG_SKGE_GENESIS=y
CONFIG_SKY2=m
# CONFIG_SKY2_DEBUG is not set
CONFIG_NET_VENDOR_MELLANOX=y
CONFIG_MLX4_EN=m
CONFIG_MLX4_EN_DCB=y
CONFIG_MLX4_CORE=m
CONFIG_MLX4_DEBUG=y
CONFIG_NET_VENDOR_MICREL=y
# CONFIG_KS8842 is not set
# CONFIG_KS8851_MLL is not set
CONFIG_KSZ884X_PCI=m
CONFIG_NET_VENDOR_MYRI=y
CONFIG_MYRI10GE=m
CONFIG_MYRI10GE_DCA=y
CONFIG_FEALNX=m
CONFIG_NET_VENDOR_NATSEMI=y
CONFIG_NATSEMI=m
CONFIG_NS83820=m
CONFIG_NET_VENDOR_8390=y
CONFIG_PCMCIA_AXNET=m
CONFIG_NE2K_PCI=m
CONFIG_PCMCIA_PCNET=m
CONFIG_NET_VENDOR_NVIDIA=y
CONFIG_FORCEDETH=m
CONFIG_NET_VENDOR_OKI=y
CONFIG_PCH_GBE=m
# CONFIG_PCH_PTP is not set
CONFIG_ETHOC=m
CONFIG_NET_PACKET_ENGINE=y
CONFIG_HAMACHI=m
CONFIG_YELLOWFIN=m
CONFIG_NET_VENDOR_QLOGIC=y
CONFIG_QLA3XXX=m
CONFIG_QLCNIC=m
CONFIG_QLGE=m
CONFIG_NETXEN_NIC=m
CONFIG_NET_VENDOR_REALTEK=y
CONFIG_ATP=m
CONFIG_8139CP=m
CONFIG_8139TOO=m
# CONFIG_8139TOO_PIO is not set
# CONFIG_8139TOO_TUNE_TWISTER is not set
CONFIG_8139TOO_8129=y
# CONFIG_8139_OLD_RX_RESET is not set
CONFIG_R8169=m
CONFIG_NET_VENDOR_RDC=y
CONFIG_R6040=m
# CONFIG_NET_VENDOR_SEEQ is not set
CONFIG_NET_VENDOR_SILAN=y
CONFIG_SC92031=m
CONFIG_NET_VENDOR_SIS=y
CONFIG_SIS900=m
CONFIG_SIS190=m
CONFIG_SFC=m
# CONFIG_SFC_MTD is not set
CONFIG_SFC_MCDI_MON=y
CONFIG_SFC_SRIOV=y
CONFIG_SFC_PTP=y
CONFIG_NET_VENDOR_SMSC=y
CONFIG_PCMCIA_SMC91C92=m
CONFIG_EPIC100=m
CONFIG_SMSC9420=m
CONFIG_NET_VENDOR_STMICRO=y
CONFIG_STMMAC_ETH=m
# CONFIG_STMMAC_PLATFORM is not set
# CONFIG_STMMAC_PCI is not set
# CONFIG_STMMAC_DEBUG_FS is not set
# CONFIG_STMMAC_DA is not set
CONFIG_STMMAC_RING=y
# CONFIG_STMMAC_CHAINED is not set
CONFIG_NET_VENDOR_SUN=y
CONFIG_HAPPYMEAL=m
CONFIG_SUNGEM=m
CONFIG_CASSINI=m
CONFIG_NIU=m
CONFIG_NET_VENDOR_TEHUTI=y
CONFIG_TEHUTI=m
CONFIG_NET_VENDOR_TI=y
CONFIG_TLAN=m
CONFIG_NET_VENDOR_VIA=y
CONFIG_VIA_RHINE=m
CONFIG_VIA_RHINE_MMIO=y
CONFIG_VIA_VELOCITY=m
CONFIG_NET_VENDOR_WIZNET=y
CONFIG_WIZNET_W5100=m
CONFIG_WIZNET_W5300=m
# CONFIG_WIZNET_BUS_DIRECT is not set
# CONFIG_WIZNET_BUS_INDIRECT is not set
CONFIG_WIZNET_BUS_ANY=y
CONFIG_NET_VENDOR_XIRCOM=y
CONFIG_PCMCIA_XIRC2PS=m
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_NET_SB1000 is not set
CONFIG_PHYLIB=y

#
# MII PHY device drivers
#
CONFIG_AMD_PHY=m
CONFIG_MARVELL_PHY=m
CONFIG_DAVICOM_PHY=m
CONFIG_QSEMI_PHY=m
CONFIG_LXT_PHY=m
CONFIG_CICADA_PHY=m
CONFIG_VITESSE_PHY=m
CONFIG_SMSC_PHY=m
CONFIG_BROADCOM_PHY=m
CONFIG_BCM87XX_PHY=m
CONFIG_ICPLUS_PHY=m
CONFIG_REALTEK_PHY=m
CONFIG_NATIONAL_PHY=m
CONFIG_STE10XP=m
CONFIG_LSI_ET1011C_PHY=m
CONFIG_MICREL_PHY=m
CONFIG_FIXED_PHY=y
CONFIG_MDIO_BITBANG=m
# CONFIG_PLIP is not set
CONFIG_PPP=m
CONFIG_PPP_BSDCOMP=m
CONFIG_PPP_DEFLATE=m
CONFIG_PPP_FILTER=y
CONFIG_PPP_MPPE=m
CONFIG_PPP_MULTILINK=y
CONFIG_PPPOATM=m
CONFIG_PPPOE=m
CONFIG_PPTP=m
CONFIG_PPPOL2TP=m
CONFIG_PPP_ASYNC=m
CONFIG_PPP_SYNC_TTY=m
CONFIG_SLIP=m
CONFIG_SLHC=m
CONFIG_SLIP_COMPRESSED=y
CONFIG_SLIP_SMART=y
# CONFIG_SLIP_MODE_SLIP6 is not set

#
# USB Network Adapters
#
CONFIG_USB_CATC=m
CONFIG_USB_KAWETH=m
CONFIG_USB_PEGASUS=m
CONFIG_USB_RTL8150=m
CONFIG_USB_USBNET=m
CONFIG_USB_NET_AX8817X=m
CONFIG_USB_NET_CDCETHER=m
CONFIG_USB_NET_CDC_EEM=m
CONFIG_USB_NET_CDC_NCM=m
CONFIG_USB_NET_DM9601=m
CONFIG_USB_NET_SMSC75XX=m
CONFIG_USB_NET_SMSC95XX=m
CONFIG_USB_NET_GL620A=m
CONFIG_USB_NET_NET1080=m
CONFIG_USB_NET_PLUSB=m
CONFIG_USB_NET_MCS7830=m
CONFIG_USB_NET_RNDIS_HOST=m
CONFIG_USB_NET_CDC_SUBSET=m
CONFIG_USB_ALI_M5632=y
CONFIG_USB_AN2720=y
CONFIG_USB_BELKIN=y
CONFIG_USB_ARMLINUX=y
CONFIG_USB_EPSON2888=y
CONFIG_USB_KC2190=y
CONFIG_USB_NET_ZAURUS=m
CONFIG_USB_NET_CX82310_ETH=m
CONFIG_USB_NET_KALMIA=m
CONFIG_USB_NET_QMI_WWAN=m
CONFIG_USB_HSO=m
CONFIG_USB_NET_INT51X1=m
CONFIG_USB_IPHETH=m
CONFIG_USB_SIERRA_NET=m
CONFIG_USB_VL600=m
CONFIG_WLAN=y
# CONFIG_PCMCIA_RAYCS is not set
# CONFIG_LIBERTAS_THINFIRM is not set
# CONFIG_AIRO is not set
# CONFIG_ATMEL is not set
CONFIG_AT76C50X_USB=m
# CONFIG_AIRO_CS is not set
# CONFIG_PCMCIA_WL3501 is not set
# CONFIG_PRISM54 is not set
# CONFIG_USB_ZD1201 is not set
CONFIG_USB_NET_RNDIS_WLAN=m
CONFIG_RTL8180=m
CONFIG_RTL8187=m
CONFIG_RTL8187_LEDS=y
# CONFIG_ADM8211 is not set
CONFIG_MAC80211_HWSIM=m
CONFIG_MWL8K=m
CONFIG_ATH_COMMON=m
# CONFIG_ATH_DEBUG is not set
CONFIG_ATH5K=m
CONFIG_ATH5K_DEBUG=y
# CONFIG_ATH5K_TRACER is not set
CONFIG_ATH5K_PCI=y
CONFIG_ATH9K_HW=m
CONFIG_ATH9K_COMMON=m
CONFIG_ATH9K_BTCOEX_SUPPORT=y
CONFIG_ATH9K=m
CONFIG_ATH9K_PCI=y
CONFIG_ATH9K_AHB=y
CONFIG_ATH9K_DEBUGFS=y
# CONFIG_ATH9K_MAC_DEBUG is not set
CONFIG_ATH9K_RATE_CONTROL=y
CONFIG_ATH9K_HTC=m
# CONFIG_ATH9K_HTC_DEBUGFS is not set
CONFIG_CARL9170=m
CONFIG_CARL9170_LEDS=y
# CONFIG_CARL9170_DEBUGFS is not set
CONFIG_CARL9170_WPC=y
# CONFIG_CARL9170_HWRNG is not set
CONFIG_ATH6KL=m
CONFIG_ATH6KL_SDIO=m
CONFIG_ATH6KL_USB=m
CONFIG_ATH6KL_DEBUG=y
CONFIG_B43=m
CONFIG_B43_BCMA=y
# CONFIG_B43_BCMA_EXTRA is not set
CONFIG_B43_SSB=y
CONFIG_B43_PCI_AUTOSELECT=y
CONFIG_B43_PCICORE_AUTOSELECT=y
CONFIG_B43_PCMCIA=y
CONFIG_B43_SDIO=y
CONFIG_B43_BCMA_PIO=y
CONFIG_B43_PIO=y
CONFIG_B43_PHY_N=y
CONFIG_B43_PHY_LP=y
CONFIG_B43_PHY_HT=y
CONFIG_B43_LEDS=y
CONFIG_B43_HWRNG=y
# CONFIG_B43_DEBUG is not set
CONFIG_B43LEGACY=m
CONFIG_B43LEGACY_PCI_AUTOSELECT=y
CONFIG_B43LEGACY_PCICORE_AUTOSELECT=y
CONFIG_B43LEGACY_LEDS=y
CONFIG_B43LEGACY_HWRNG=y
# CONFIG_B43LEGACY_DEBUG is not set
CONFIG_B43LEGACY_DMA=y
CONFIG_B43LEGACY_PIO=y
CONFIG_B43LEGACY_DMA_AND_PIO_MODE=y
# CONFIG_B43LEGACY_DMA_MODE is not set
# CONFIG_B43LEGACY_PIO_MODE is not set
CONFIG_BRCMUTIL=m
CONFIG_BRCMSMAC=m
CONFIG_BRCMFMAC=m
CONFIG_BRCMFMAC_SDIO=y
CONFIG_BRCMFMAC_SDIO_OOB=y
CONFIG_BRCMFMAC_USB=y
# CONFIG_BRCMISCAN is not set
# CONFIG_BRCMDBG is not set
# CONFIG_HOSTAP is not set
CONFIG_IPW2100=m
CONFIG_IPW2100_MONITOR=y
# CONFIG_IPW2100_DEBUG is not set
CONFIG_IPW2200=m
CONFIG_IPW2200_MONITOR=y
CONFIG_IPW2200_RADIOTAP=y
CONFIG_IPW2200_PROMISCUOUS=y
CONFIG_IPW2200_QOS=y
# CONFIG_IPW2200_DEBUG is not set
CONFIG_LIBIPW=m
# CONFIG_LIBIPW_DEBUG is not set
CONFIG_IWLWIFI=m
CONFIG_IWLDVM=m

#
# Debugging Options
#
CONFIG_IWLWIFI_DEBUG=y
CONFIG_IWLWIFI_DEBUGFS=y
# CONFIG_IWLWIFI_DEBUG_EXPERIMENTAL_UCODE is not set
# CONFIG_IWLWIFI_DEVICE_TRACING is not set
# CONFIG_IWLWIFI_P2P is not set
# CONFIG_IWLWIFI_EXPERIMENTAL_MFP is not set
CONFIG_IWLEGACY=m
CONFIG_IWL4965=m
CONFIG_IWL3945=m

#
# iwl3945 / iwl4965 Debugging Options
#
CONFIG_IWLEGACY_DEBUG=y
CONFIG_IWLEGACY_DEBUGFS=y
CONFIG_LIBERTAS=m
CONFIG_LIBERTAS_USB=m
CONFIG_LIBERTAS_CS=m
CONFIG_LIBERTAS_SDIO=m
# CONFIG_LIBERTAS_DEBUG is not set
CONFIG_LIBERTAS_MESH=y
# CONFIG_HERMES is not set
CONFIG_P54_COMMON=m
CONFIG_P54_USB=m
CONFIG_P54_PCI=m
CONFIG_P54_LEDS=y
CONFIG_RT2X00=m
CONFIG_RT2400PCI=m
CONFIG_RT2500PCI=m
CONFIG_RT61PCI=m
CONFIG_RT2800PCI=m
CONFIG_RT2800PCI_RT33XX=y
CONFIG_RT2800PCI_RT35XX=y
CONFIG_RT2800PCI_RT53XX=y
CONFIG_RT2800PCI_RT3290=y
CONFIG_RT2500USB=m
CONFIG_RT73USB=m
CONFIG_RT2800USB=m
CONFIG_RT2800USB_RT33XX=y
CONFIG_RT2800USB_RT35XX=y
CONFIG_RT2800USB_RT53XX=y
CONFIG_RT2800USB_UNKNOWN=y
CONFIG_RT2800_LIB=m
CONFIG_RT2X00_LIB_PCI=m
CONFIG_RT2X00_LIB_USB=m
CONFIG_RT2X00_LIB=m
CONFIG_RT2X00_LIB_FIRMWARE=y
CONFIG_RT2X00_LIB_CRYPTO=y
CONFIG_RT2X00_LIB_LEDS=y
CONFIG_RT2X00_LIB_DEBUGFS=y
# CONFIG_RT2X00_DEBUG is not set
CONFIG_RTL8192CE=m
CONFIG_RTL8192SE=m
CONFIG_RTL8192DE=m
CONFIG_RTL8192CU=m
CONFIG_RTLWIFI=m
# CONFIG_RTLWIFI_DEBUG is not set
CONFIG_RTL8192C_COMMON=m
# CONFIG_WL_TI is not set
CONFIG_ZD1211RW=m
# CONFIG_ZD1211RW_DEBUG is not set
CONFIG_MWIFIEX=m
CONFIG_MWIFIEX_SDIO=m
CONFIG_MWIFIEX_PCIE=m
CONFIG_MWIFIEX_USB=m

#
# WiMAX Wireless Broadband devices
#
CONFIG_WIMAX_I2400M=m
CONFIG_WIMAX_I2400M_USB=m
CONFIG_WIMAX_I2400M_DEBUG_LEVEL=8
# CONFIG_WAN is not set
CONFIG_IEEE802154_DRIVERS=m
CONFIG_IEEE802154_FAKEHARD=m
CONFIG_IEEE802154_FAKELB=m
CONFIG_XEN_NETDEV_FRONTEND=m
CONFIG_XEN_NETDEV_BACKEND=m
CONFIG_VMXNET3=m
CONFIG_HYPERV_NET=m
CONFIG_ISDN=y
CONFIG_ISDN_I4L=m
CONFIG_ISDN_PPP=y
CONFIG_ISDN_PPP_VJ=y
CONFIG_ISDN_MPP=y
CONFIG_IPPP_FILTER=y
# CONFIG_ISDN_PPP_BSDCOMP is not set
CONFIG_ISDN_AUDIO=y
CONFIG_ISDN_TTY_FAX=y

#
# ISDN feature submodules
#
CONFIG_ISDN_DIVERSION=m

#
# ISDN4Linux hardware drivers
#

#
# Passive cards
#
CONFIG_ISDN_DRV_HISAX=m

#
# D-channel protocol features
#
CONFIG_HISAX_EURO=y
CONFIG_DE_AOC=y
CONFIG_HISAX_NO_SENDCOMPLETE=y
CONFIG_HISAX_NO_LLC=y
CONFIG_HISAX_NO_KEYPAD=y
CONFIG_HISAX_1TR6=y
CONFIG_HISAX_NI1=y
CONFIG_HISAX_MAX_CARDS=8

#
# HiSax supported cards
#
CONFIG_HISAX_16_3=y
CONFIG_HISAX_TELESPCI=y
CONFIG_HISAX_S0BOX=y
CONFIG_HISAX_FRITZPCI=y
CONFIG_HISAX_AVM_A1_PCMCIA=y
CONFIG_HISAX_ELSA=y
CONFIG_HISAX_DIEHLDIVA=y
CONFIG_HISAX_SEDLBAUER=y
CONFIG_HISAX_NETJET=y
CONFIG_HISAX_NETJET_U=y
CONFIG_HISAX_NICCY=y
CONFIG_HISAX_BKM_A4T=y
CONFIG_HISAX_SCT_QUADRO=y
CONFIG_HISAX_GAZEL=y
CONFIG_HISAX_HFC_PCI=y
CONFIG_HISAX_W6692=y
CONFIG_HISAX_HFC_SX=y
CONFIG_HISAX_ENTERNOW_PCI=y
# CONFIG_HISAX_DEBUG is not set

#
# HiSax PCMCIA card service modules
#
CONFIG_HISAX_SEDLBAUER_CS=m
CONFIG_HISAX_ELSA_CS=m
CONFIG_HISAX_AVM_A1_CS=m
CONFIG_HISAX_TELES_CS=m

#
# HiSax sub driver modules
#
CONFIG_HISAX_ST5481=m
# CONFIG_HISAX_HFCUSB is not set
CONFIG_HISAX_HFC4S8S=m
CONFIG_HISAX_FRITZ_PCIPNP=m

#
# Active cards
#
CONFIG_ISDN_CAPI=m
CONFIG_ISDN_DRV_AVMB1_VERBOSE_REASON=y
# CONFIG_CAPI_TRACE is not set
CONFIG_ISDN_CAPI_MIDDLEWARE=y
CONFIG_ISDN_CAPI_CAPI20=m
CONFIG_ISDN_CAPI_CAPIDRV=m

#
# CAPI hardware drivers
#
CONFIG_CAPI_AVM=y
CONFIG_ISDN_DRV_AVMB1_B1PCI=m
CONFIG_ISDN_DRV_AVMB1_B1PCIV4=y
CONFIG_ISDN_DRV_AVMB1_B1PCMCIA=m
CONFIG_ISDN_DRV_AVMB1_AVM_CS=m
CONFIG_ISDN_DRV_AVMB1_T1PCI=m
CONFIG_ISDN_DRV_AVMB1_C4=m
CONFIG_CAPI_EICON=y
CONFIG_ISDN_DIVAS=m
CONFIG_ISDN_DIVAS_BRIPCI=y
CONFIG_ISDN_DIVAS_PRIPCI=y
CONFIG_ISDN_DIVAS_DIVACAPI=m
CONFIG_ISDN_DIVAS_USERIDI=m
CONFIG_ISDN_DIVAS_MAINT=m
CONFIG_ISDN_DRV_GIGASET=m
CONFIG_GIGASET_CAPI=y
# CONFIG_GIGASET_I4L is not set
# CONFIG_GIGASET_DUMMYLL is not set
CONFIG_GIGASET_BASE=m
CONFIG_GIGASET_M105=m
CONFIG_GIGASET_M101=m
# CONFIG_GIGASET_DEBUG is not set
CONFIG_HYSDN=m
CONFIG_HYSDN_CAPI=y
CONFIG_MISDN=m
CONFIG_MISDN_DSP=m
CONFIG_MISDN_L1OIP=m

#
# mISDN hardware drivers
#
CONFIG_MISDN_HFCPCI=m
CONFIG_MISDN_HFCMULTI=m
CONFIG_MISDN_HFCUSB=m
CONFIG_MISDN_AVMFRITZ=m
CONFIG_MISDN_SPEEDFAX=m
CONFIG_MISDN_INFINEON=m
CONFIG_MISDN_W6692=m
CONFIG_MISDN_NETJET=m
CONFIG_MISDN_IPAC=m
CONFIG_MISDN_ISAR=m
CONFIG_ISDN_HDLC=m

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=y
CONFIG_INPUT_POLLDEV=m
CONFIG_INPUT_SPARSEKMAP=m
# CONFIG_INPUT_MATRIXKMAP is not set

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
# CONFIG_INPUT_MOUSEDEV_PSAUX is not set
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
CONFIG_INPUT_JOYDEV=m
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
# CONFIG_KEYBOARD_ADP5588 is not set
# CONFIG_KEYBOARD_ADP5589 is not set
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_QT1070 is not set
# CONFIG_KEYBOARD_QT2160 is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_TCA6416 is not set
# CONFIG_KEYBOARD_TCA8418 is not set
# CONFIG_KEYBOARD_LM8323 is not set
# CONFIG_KEYBOARD_LM8333 is not set
# CONFIG_KEYBOARD_MAX7359 is not set
# CONFIG_KEYBOARD_MCS is not set
# CONFIG_KEYBOARD_MPR121 is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_OPENCORES is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_OMAP4 is not set
# CONFIG_KEYBOARD_XTKBD is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
CONFIG_MOUSE_PS2_ELANTECH=y
CONFIG_MOUSE_PS2_SENTELIC=y
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
CONFIG_MOUSE_SERIAL=m
CONFIG_MOUSE_APPLETOUCH=m
CONFIG_MOUSE_BCM5974=m
CONFIG_MOUSE_VSXXXAA=m
CONFIG_MOUSE_SYNAPTICS_I2C=m
CONFIG_MOUSE_SYNAPTICS_USB=m
CONFIG_INPUT_JOYSTICK=y
CONFIG_JOYSTICK_ANALOG=m
CONFIG_JOYSTICK_A3D=m
CONFIG_JOYSTICK_ADI=m
CONFIG_JOYSTICK_COBRA=m
CONFIG_JOYSTICK_GF2K=m
CONFIG_JOYSTICK_GRIP=m
CONFIG_JOYSTICK_GRIP_MP=m
CONFIG_JOYSTICK_GUILLEMOT=m
CONFIG_JOYSTICK_INTERACT=m
CONFIG_JOYSTICK_SIDEWINDER=m
CONFIG_JOYSTICK_TMDC=m
CONFIG_JOYSTICK_IFORCE=m
CONFIG_JOYSTICK_IFORCE_USB=y
CONFIG_JOYSTICK_IFORCE_232=y
CONFIG_JOYSTICK_WARRIOR=m
CONFIG_JOYSTICK_MAGELLAN=m
CONFIG_JOYSTICK_SPACEORB=m
CONFIG_JOYSTICK_SPACEBALL=m
CONFIG_JOYSTICK_STINGER=m
CONFIG_JOYSTICK_TWIDJOY=m
CONFIG_JOYSTICK_ZHENHUA=m
CONFIG_JOYSTICK_DB9=m
CONFIG_JOYSTICK_GAMECON=m
CONFIG_JOYSTICK_TURBOGRAFX=m
# CONFIG_JOYSTICK_AS5011 is not set
CONFIG_JOYSTICK_JOYDUMP=m
CONFIG_JOYSTICK_XPAD=m
CONFIG_JOYSTICK_XPAD_FF=y
CONFIG_JOYSTICK_XPAD_LEDS=y
CONFIG_JOYSTICK_WALKERA0701=m
CONFIG_INPUT_TABLET=y
CONFIG_TABLET_USB_ACECAD=m
CONFIG_TABLET_USB_AIPTEK=m
CONFIG_TABLET_USB_GTCO=m
CONFIG_TABLET_USB_HANWANG=m
CONFIG_TABLET_USB_KBTAB=m
CONFIG_TABLET_USB_WACOM=m
CONFIG_INPUT_TOUCHSCREEN=y
# CONFIG_TOUCHSCREEN_AD7879 is not set
CONFIG_TOUCHSCREEN_ATMEL_MXT=m
# CONFIG_TOUCHSCREEN_BU21013 is not set
# CONFIG_TOUCHSCREEN_CYTTSP_CORE is not set
CONFIG_TOUCHSCREEN_DYNAPRO=m
# CONFIG_TOUCHSCREEN_HAMPSHIRE is not set
CONFIG_TOUCHSCREEN_EETI=m
CONFIG_TOUCHSCREEN_EGALAX=m
CONFIG_TOUCHSCREEN_FUJITSU=m
CONFIG_TOUCHSCREEN_ILI210X=m
CONFIG_TOUCHSCREEN_GUNZE=m
CONFIG_TOUCHSCREEN_ELO=m
CONFIG_TOUCHSCREEN_WACOM_W8001=m
CONFIG_TOUCHSCREEN_WACOM_I2C=m
# CONFIG_TOUCHSCREEN_MAX11801 is not set
CONFIG_TOUCHSCREEN_MCS5000=m
CONFIG_TOUCHSCREEN_MMS114=m
CONFIG_TOUCHSCREEN_MTOUCH=m
CONFIG_TOUCHSCREEN_INEXIO=m
CONFIG_TOUCHSCREEN_MK712=m
CONFIG_TOUCHSCREEN_PENMOUNT=m
CONFIG_TOUCHSCREEN_EDT_FT5X06=m
CONFIG_TOUCHSCREEN_TOUCHRIGHT=m
CONFIG_TOUCHSCREEN_TOUCHWIN=m
CONFIG_TOUCHSCREEN_PIXCIR=m
# CONFIG_TOUCHSCREEN_WM97XX is not set
CONFIG_TOUCHSCREEN_USB_COMPOSITE=m
CONFIG_TOUCHSCREEN_USB_EGALAX=y
CONFIG_TOUCHSCREEN_USB_PANJIT=y
CONFIG_TOUCHSCREEN_USB_3M=y
CONFIG_TOUCHSCREEN_USB_ITM=y
CONFIG_TOUCHSCREEN_USB_ETURBO=y
CONFIG_TOUCHSCREEN_USB_GUNZE=y
CONFIG_TOUCHSCREEN_USB_DMC_TSC10=y
CONFIG_TOUCHSCREEN_USB_IRTOUCH=y
CONFIG_TOUCHSCREEN_USB_IDEALTEK=y
CONFIG_TOUCHSCREEN_USB_GENERAL_TOUCH=y
CONFIG_TOUCHSCREEN_USB_GOTOP=y
CONFIG_TOUCHSCREEN_USB_JASTEC=y
CONFIG_TOUCHSCREEN_USB_ELO=y
CONFIG_TOUCHSCREEN_USB_E2I=y
CONFIG_TOUCHSCREEN_USB_ZYTRONIC=y
CONFIG_TOUCHSCREEN_USB_ETT_TC45USB=y
CONFIG_TOUCHSCREEN_USB_NEXIO=y
CONFIG_TOUCHSCREEN_USB_EASYTOUCH=y
CONFIG_TOUCHSCREEN_TOUCHIT213=m
CONFIG_TOUCHSCREEN_TSC_SERIO=m
CONFIG_TOUCHSCREEN_TSC2007=m
CONFIG_TOUCHSCREEN_ST1232=m
# CONFIG_TOUCHSCREEN_TPS6507X is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_AD714X is not set
# CONFIG_INPUT_BMA150 is not set
CONFIG_INPUT_PCSPKR=m
CONFIG_INPUT_MMA8450=m
CONFIG_INPUT_MPU3050=m
CONFIG_INPUT_APANEL=m
CONFIG_INPUT_ATLAS_BTNS=m
CONFIG_INPUT_ATI_REMOTE2=m
CONFIG_INPUT_KEYSPAN_REMOTE=m
CONFIG_INPUT_KXTJ9=m
# CONFIG_INPUT_KXTJ9_POLLED_MODE is not set
CONFIG_INPUT_POWERMATE=m
CONFIG_INPUT_YEALINK=m
CONFIG_INPUT_CM109=m
CONFIG_INPUT_UINPUT=m
# CONFIG_INPUT_PCF8574 is not set
# CONFIG_INPUT_ADXL34X is not set
CONFIG_INPUT_CMA3000=m
CONFIG_INPUT_CMA3000_I2C=m
CONFIG_INPUT_XEN_KBDDEV_FRONTEND=y

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PARKBD is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
CONFIG_SERIO_RAW=m
CONFIG_SERIO_ALTERA_PS2=m
# CONFIG_SERIO_PS2MULT is not set
CONFIG_GAMEPORT=m
CONFIG_GAMEPORT_NS558=m
CONFIG_GAMEPORT_L4=m
CONFIG_GAMEPORT_EMU10K1=m
CONFIG_GAMEPORT_FM801=m

#
# Character devices
#
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_VT_CONSOLE_SLEEP=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_UNIX98_PTYS=y
CONFIG_DEVPTS_MULTIPLE_INSTANCES=y
# CONFIG_LEGACY_PTYS is not set
CONFIG_SERIAL_NONSTANDARD=y
CONFIG_ROCKETPORT=m
CONFIG_CYCLADES=m
# CONFIG_CYZ_INTR is not set
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
CONFIG_SYNCLINK=m
CONFIG_SYNCLINKMP=m
CONFIG_SYNCLINK_GT=m
CONFIG_NOZOMI=m
# CONFIG_ISI is not set
CONFIG_N_HDLC=m
CONFIG_N_GSM=m
# CONFIG_TRACE_SINK is not set
# CONFIG_DEVKMEM is not set
# CONFIG_STALDRV is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_CS=m
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
# CONFIG_SERIAL_8250_DETECT_IRQ is not set
CONFIG_SERIAL_8250_RSA=y

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_KGDB_NMI is not set
# CONFIG_SERIAL_MFD_HSU is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_CONSOLE_POLL=y
CONFIG_SERIAL_JSM=m
# CONFIG_SERIAL_SCCNXP is not set
# CONFIG_SERIAL_TIMBERDALE is not set
# CONFIG_SERIAL_ALTERA_JTAGUART is not set
# CONFIG_SERIAL_ALTERA_UART is not set
# CONFIG_SERIAL_PCH_UART is not set
# CONFIG_SERIAL_XILINX_PS_UART is not set
CONFIG_PRINTER=m
CONFIG_LP_CONSOLE=y
CONFIG_PPDEV=m
CONFIG_HVC_DRIVER=y
CONFIG_HVC_IRQ=y
CONFIG_HVC_XEN=y
CONFIG_HVC_XEN_FRONTEND=y
CONFIG_VIRTIO_CONSOLE=y
CONFIG_IPMI_HANDLER=m
# CONFIG_IPMI_PANIC_EVENT is not set
CONFIG_IPMI_DEVICE_INTERFACE=m
CONFIG_IPMI_SI=m
CONFIG_IPMI_WATCHDOG=m
CONFIG_IPMI_POWEROFF=m
CONFIG_HW_RANDOM=y
CONFIG_HW_RANDOM_TIMERIOMEM=m
CONFIG_HW_RANDOM_INTEL=m
CONFIG_HW_RANDOM_AMD=m
CONFIG_HW_RANDOM_VIA=m
CONFIG_HW_RANDOM_VIRTIO=m
CONFIG_HW_RANDOM_TPM=m
CONFIG_NVRAM=y
CONFIG_R3964=m
# CONFIG_APPLICOM is not set

#
# PCMCIA character devices
#
# CONFIG_SYNCLINK_CS is not set
CONFIG_CARDMAN_4000=m
CONFIG_CARDMAN_4040=m
CONFIG_IPWIRELESS=m
CONFIG_MWAVE=m
CONFIG_RAW_DRIVER=y
CONFIG_MAX_RAW_DEVS=8192
CONFIG_HPET=y
# CONFIG_HPET_MMAP is not set
CONFIG_HANGCHECK_TIMER=m
CONFIG_TCG_TPM=m
CONFIG_TCG_TIS=m
# CONFIG_TCG_TIS_I2C_INFINEON is not set
CONFIG_TCG_NSC=m
CONFIG_TCG_ATMEL=m
CONFIG_TCG_INFINEON=m
CONFIG_TELCLOCK=m
CONFIG_DEVPORT=y
CONFIG_I2C=m
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
CONFIG_I2C_CHARDEV=m
# CONFIG_I2C_MUX is not set
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_SMBUS=m
CONFIG_I2C_ALGOBIT=m
CONFIG_I2C_ALGOPCA=m

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
CONFIG_I2C_AMD756=m
CONFIG_I2C_AMD756_S4882=m
CONFIG_I2C_AMD8111=m
CONFIG_I2C_I801=m
CONFIG_I2C_ISCH=m
CONFIG_I2C_PIIX4=m
CONFIG_I2C_NFORCE2=m
CONFIG_I2C_NFORCE2_S4985=m
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
CONFIG_I2C_SIS96X=m
CONFIG_I2C_VIA=m
CONFIG_I2C_VIAPRO=m

#
# ACPI drivers
#
CONFIG_I2C_SCMI=m

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_DESIGNWARE_PCI is not set
# CONFIG_I2C_EG20T is not set
# CONFIG_I2C_INTEL_MID is not set
# CONFIG_I2C_OCORES is not set
CONFIG_I2C_PCA_PLATFORM=m
# CONFIG_I2C_PXA_PCI is not set
CONFIG_I2C_SIMTEC=m
# CONFIG_I2C_XILINX is not set

#
# External I2C/SMBus adapter drivers
#
CONFIG_I2C_DIOLAN_U2C=m
CONFIG_I2C_PARPORT=m
CONFIG_I2C_PARPORT_LIGHT=m
# CONFIG_I2C_TAOS_EVM is not set
CONFIG_I2C_TINY_USB=m

#
# Other I2C/SMBus bus drivers
#
CONFIG_I2C_STUB=m
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_SPI is not set
# CONFIG_HSI is not set

#
# PPS support
#
CONFIG_PPS=m
# CONFIG_PPS_DEBUG is not set

#
# PPS clients support
#
# CONFIG_PPS_CLIENT_KTIMER is not set
CONFIG_PPS_CLIENT_LDISC=m
CONFIG_PPS_CLIENT_PARPORT=m
CONFIG_PPS_CLIENT_GPIO=m

#
# PPS generators support
#

#
# PTP clock support
#
CONFIG_PTP_1588_CLOCK=m
CONFIG_DP83640_PHY=m
CONFIG_PTP_1588_CLOCK_PCH=m
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
# CONFIG_GPIOLIB is not set
CONFIG_W1=m
CONFIG_W1_CON=y

#
# 1-wire Bus Masters
#
# CONFIG_W1_MASTER_MATROX is not set
CONFIG_W1_MASTER_DS2490=m
CONFIG_W1_MASTER_DS2482=m
CONFIG_W1_MASTER_DS1WM=m
# CONFIG_HDQ_MASTER_OMAP is not set

#
# 1-wire Slaves
#
CONFIG_W1_SLAVE_THERM=m
CONFIG_W1_SLAVE_SMEM=m
CONFIG_W1_SLAVE_DS2408=m
CONFIG_W1_SLAVE_DS2423=m
CONFIG_W1_SLAVE_DS2431=m
CONFIG_W1_SLAVE_DS2433=m
CONFIG_W1_SLAVE_DS2433_CRC=y
CONFIG_W1_SLAVE_DS2760=m
CONFIG_W1_SLAVE_DS2780=m
CONFIG_W1_SLAVE_DS2781=m
CONFIG_W1_SLAVE_DS28E04=m
CONFIG_W1_SLAVE_BQ27000=m
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_TEST_POWER is not set
# CONFIG_BATTERY_DS2760 is not set
# CONFIG_BATTERY_DS2780 is not set
# CONFIG_BATTERY_DS2781 is not set
# CONFIG_BATTERY_DS2782 is not set
# CONFIG_BATTERY_SBS is not set
# CONFIG_BATTERY_BQ27x00 is not set
# CONFIG_BATTERY_MAX17040 is not set
# CONFIG_BATTERY_MAX17042 is not set
# CONFIG_CHARGER_ISP1704 is not set
# CONFIG_CHARGER_MAX8903 is not set
# CONFIG_CHARGER_LP8727 is not set
CONFIG_CHARGER_SMB347=m
# CONFIG_POWER_AVS is not set
CONFIG_HWMON=y
CONFIG_HWMON_VID=m
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Native drivers
#
CONFIG_SENSORS_ABITUGURU=m
CONFIG_SENSORS_ABITUGURU3=m
CONFIG_SENSORS_AD7414=m
CONFIG_SENSORS_AD7418=m
CONFIG_SENSORS_ADM1021=m
CONFIG_SENSORS_ADM1025=m
CONFIG_SENSORS_ADM1026=m
CONFIG_SENSORS_ADM1029=m
CONFIG_SENSORS_ADM1031=m
CONFIG_SENSORS_ADM9240=m
# CONFIG_SENSORS_ADT7410 is not set
CONFIG_SENSORS_ADT7411=m
CONFIG_SENSORS_ADT7462=m
CONFIG_SENSORS_ADT7470=m
CONFIG_SENSORS_ADT7475=m
CONFIG_SENSORS_ASC7621=m
CONFIG_SENSORS_K8TEMP=m
CONFIG_SENSORS_K10TEMP=m
CONFIG_SENSORS_FAM15H_POWER=m
CONFIG_SENSORS_ASB100=m
CONFIG_SENSORS_ATXP1=m
CONFIG_SENSORS_DS620=m
CONFIG_SENSORS_DS1621=m
CONFIG_SENSORS_I5K_AMB=m
CONFIG_SENSORS_F71805F=m
CONFIG_SENSORS_F71882FG=m
CONFIG_SENSORS_F75375S=m
CONFIG_SENSORS_FSCHMD=m
CONFIG_SENSORS_G760A=m
CONFIG_SENSORS_GL518SM=m
CONFIG_SENSORS_GL520SM=m
# CONFIG_SENSORS_HIH6130 is not set
CONFIG_SENSORS_CORETEMP=m
CONFIG_SENSORS_IBMAEM=m
CONFIG_SENSORS_IBMPEX=m
CONFIG_SENSORS_IT87=m
# CONFIG_SENSORS_JC42 is not set
CONFIG_SENSORS_LINEAGE=m
CONFIG_SENSORS_LM63=m
CONFIG_SENSORS_LM73=m
CONFIG_SENSORS_LM75=m
CONFIG_SENSORS_LM77=m
CONFIG_SENSORS_LM78=m
CONFIG_SENSORS_LM80=m
CONFIG_SENSORS_LM83=m
CONFIG_SENSORS_LM85=m
CONFIG_SENSORS_LM87=m
CONFIG_SENSORS_LM90=m
CONFIG_SENSORS_LM92=m
CONFIG_SENSORS_LM93=m
CONFIG_SENSORS_LTC4151=m
CONFIG_SENSORS_LTC4215=m
CONFIG_SENSORS_LTC4245=m
CONFIG_SENSORS_LTC4261=m
CONFIG_SENSORS_LM95241=m
CONFIG_SENSORS_LM95245=m
CONFIG_SENSORS_MAX16065=m
CONFIG_SENSORS_MAX1619=m
CONFIG_SENSORS_MAX1668=m
# CONFIG_SENSORS_MAX197 is not set
CONFIG_SENSORS_MAX6639=m
CONFIG_SENSORS_MAX6642=m
CONFIG_SENSORS_MAX6650=m
CONFIG_SENSORS_MCP3021=m
CONFIG_SENSORS_NTC_THERMISTOR=m
CONFIG_SENSORS_PC87360=m
CONFIG_SENSORS_PC87427=m
CONFIG_SENSORS_PCF8591=m
CONFIG_PMBUS=m
CONFIG_SENSORS_PMBUS=m
CONFIG_SENSORS_ADM1275=m
CONFIG_SENSORS_LM25066=m
CONFIG_SENSORS_LTC2978=m
CONFIG_SENSORS_MAX16064=m
CONFIG_SENSORS_MAX34440=m
CONFIG_SENSORS_MAX8688=m
CONFIG_SENSORS_UCD9000=m
CONFIG_SENSORS_UCD9200=m
CONFIG_SENSORS_ZL6100=m
CONFIG_SENSORS_SHT21=m
CONFIG_SENSORS_SIS5595=m
# CONFIG_SENSORS_SMM665 is not set
CONFIG_SENSORS_DME1737=m
CONFIG_SENSORS_EMC1403=m
# CONFIG_SENSORS_EMC2103 is not set
CONFIG_SENSORS_EMC6W201=m
CONFIG_SENSORS_SMSC47M1=m
CONFIG_SENSORS_SMSC47M192=m
CONFIG_SENSORS_SMSC47B397=m
CONFIG_SENSORS_SCH56XX_COMMON=m
CONFIG_SENSORS_SCH5627=m
CONFIG_SENSORS_SCH5636=m
CONFIG_SENSORS_ADS1015=m
CONFIG_SENSORS_ADS7828=m
CONFIG_SENSORS_AMC6821=m
CONFIG_SENSORS_INA2XX=m
CONFIG_SENSORS_THMC50=m
CONFIG_SENSORS_TMP102=m
CONFIG_SENSORS_TMP401=m
CONFIG_SENSORS_TMP421=m
CONFIG_SENSORS_VIA_CPUTEMP=m
CONFIG_SENSORS_VIA686A=m
CONFIG_SENSORS_VT1211=m
CONFIG_SENSORS_VT8231=m
CONFIG_SENSORS_W83781D=m
CONFIG_SENSORS_W83791D=m
CONFIG_SENSORS_W83792D=m
CONFIG_SENSORS_W83793=m
CONFIG_SENSORS_W83795=m
# CONFIG_SENSORS_W83795_FANCTRL is not set
CONFIG_SENSORS_W83L785TS=m
CONFIG_SENSORS_W83L786NG=m
CONFIG_SENSORS_W83627HF=m
CONFIG_SENSORS_W83627EHF=m
CONFIG_SENSORS_APPLESMC=m

#
# ACPI drivers
#
CONFIG_SENSORS_ACPI_POWER=m
CONFIG_SENSORS_ATK0110=m
CONFIG_THERMAL=y
CONFIG_THERMAL_HWMON=y
# CONFIG_CPU_THERMAL is not set
CONFIG_WATCHDOG=y
CONFIG_WATCHDOG_CORE=y
# CONFIG_WATCHDOG_NOWAYOUT is not set

#
# Watchdog Device Drivers
#
CONFIG_SOFT_WATCHDOG=m
# CONFIG_ACQUIRE_WDT is not set
# CONFIG_ADVANTECH_WDT is not set
CONFIG_ALIM1535_WDT=m
CONFIG_ALIM7101_WDT=m
CONFIG_F71808E_WDT=m
CONFIG_SP5100_TCO=m
# CONFIG_SC520_WDT is not set
CONFIG_SBC_FITPC2_WATCHDOG=m
# CONFIG_EUROTECH_WDT is not set
CONFIG_IB700_WDT=m
CONFIG_IBMASR=m
# CONFIG_WAFER_WDT is not set
CONFIG_I6300ESB_WDT=m
CONFIG_IE6XX_WDT=m
CONFIG_ITCO_WDT=m
CONFIG_ITCO_VENDOR_SUPPORT=y
CONFIG_IT8712F_WDT=m
CONFIG_IT87_WDT=m
CONFIG_HP_WATCHDOG=m
CONFIG_HPWDT_NMI_DECODING=y
# CONFIG_SC1200_WDT is not set
# CONFIG_PC87413_WDT is not set
CONFIG_NV_TCO=m
# CONFIG_60XX_WDT is not set
# CONFIG_SBC8360_WDT is not set
# CONFIG_CPU5_WDT is not set
CONFIG_SMSC_SCH311X_WDT=m
# CONFIG_SMSC37B787_WDT is not set
CONFIG_VIA_WDT=m
CONFIG_W83627HF_WDT=m
CONFIG_W83697HF_WDT=m
CONFIG_W83697UG_WDT=m
CONFIG_W83877F_WDT=m
CONFIG_W83977F_WDT=m
CONFIG_MACHZ_WDT=m
# CONFIG_SBC_EPX_C3_WATCHDOG is not set
CONFIG_XEN_WDT=m

#
# PCI-based Watchdog Cards
#
CONFIG_PCIPCWATCHDOG=m
CONFIG_WDTPCI=m

#
# USB-based Watchdog Cards
#
CONFIG_USBPCWATCHDOG=m
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
CONFIG_SSB=m
CONFIG_SSB_SPROM=y
CONFIG_SSB_BLOCKIO=y
CONFIG_SSB_PCIHOST_POSSIBLE=y
CONFIG_SSB_PCIHOST=y
CONFIG_SSB_B43_PCI_BRIDGE=y
CONFIG_SSB_PCMCIAHOST_POSSIBLE=y
CONFIG_SSB_PCMCIAHOST=y
CONFIG_SSB_SDIOHOST_POSSIBLE=y
CONFIG_SSB_SDIOHOST=y
# CONFIG_SSB_DEBUG is not set
CONFIG_SSB_DRIVER_PCICORE_POSSIBLE=y
CONFIG_SSB_DRIVER_PCICORE=y
CONFIG_BCMA_POSSIBLE=y

#
# Broadcom specific AMBA
#
CONFIG_BCMA=m
CONFIG_BCMA_BLOCKIO=y
CONFIG_BCMA_HOST_PCI_POSSIBLE=y
CONFIG_BCMA_HOST_PCI=y
CONFIG_BCMA_DRIVER_GMAC_CMN=y
# CONFIG_BCMA_DEBUG is not set

#
# Multifunction device drivers
#
CONFIG_MFD_CORE=m
CONFIG_MFD_SM501=m
# CONFIG_HTC_PASIC3 is not set
# CONFIG_MFD_LM3533 is not set
# CONFIG_TPS6105X is not set
# CONFIG_TPS6507X is not set
# CONFIG_MFD_TPS65217 is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_MFD_ARIZONA_I2C is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_MFD_MC13XXX_I2C is not set
# CONFIG_ABX500_CORE is not set
# CONFIG_MFD_CS5535 is not set
CONFIG_LPC_SCH=m
CONFIG_LPC_ICH=m
# CONFIG_MFD_RDC321X is not set
# CONFIG_MFD_JANZ_CMODIO is not set
CONFIG_MFD_VX855=m
CONFIG_MFD_WL1273_CORE=m
# CONFIG_REGULATOR is not set
CONFIG_MEDIA_SUPPORT=m

#
# Multimedia core support
#
CONFIG_MEDIA_CAMERA_SUPPORT=y
CONFIG_MEDIA_ANALOG_TV_SUPPORT=y
CONFIG_MEDIA_DIGITAL_TV_SUPPORT=y
CONFIG_MEDIA_RADIO_SUPPORT=y
CONFIG_MEDIA_RC_SUPPORT=y
CONFIG_MEDIA_CONTROLLER=y
CONFIG_VIDEO_DEV=m
CONFIG_VIDEO_V4L2_SUBDEV_API=y
CONFIG_VIDEO_V4L2=m
# CONFIG_VIDEO_ADV_DEBUG is not set
# CONFIG_VIDEO_FIXED_MINOR_RANGES is not set
CONFIG_DVB_CORE=m
CONFIG_DVB_NET=y
CONFIG_DVB_MAX_ADAPTERS=8
CONFIG_DVB_DYNAMIC_MINORS=y

#
# Media drivers
#
CONFIG_RC_CORE=m
CONFIG_RC_MAP=m
CONFIG_RC_DECODERS=y
CONFIG_LIRC=m
CONFIG_IR_LIRC_CODEC=m
CONFIG_IR_NEC_DECODER=m
CONFIG_IR_RC5_DECODER=m
CONFIG_IR_RC6_DECODER=m
CONFIG_IR_JVC_DECODER=m
CONFIG_IR_SONY_DECODER=m
CONFIG_IR_RC5_SZ_DECODER=m
CONFIG_IR_SANYO_DECODER=m
CONFIG_IR_MCE_KBD_DECODER=m
CONFIG_RC_DEVICES=y
CONFIG_RC_ATI_REMOTE=m
CONFIG_IR_ENE=m
CONFIG_IR_IMON=m
CONFIG_IR_MCEUSB=m
CONFIG_IR_ITE_CIR=m
CONFIG_IR_FINTEK=m
CONFIG_IR_NUVOTON=m
CONFIG_IR_REDRAT3=m
CONFIG_IR_STREAMZAP=m
CONFIG_IR_WINBOND_CIR=m
CONFIG_IR_IGUANA=m
# CONFIG_IR_TTUSBIR is not set
CONFIG_RC_LOOPBACK=m
CONFIG_IR_GPIO_CIR=m
# CONFIG_MEDIA_USB_SUPPORT is not set
# CONFIG_MEDIA_PCI_SUPPORT is not set
# CONFIG_V4L_PLATFORM_DRIVERS is not set
CONFIG_V4L_MEM2MEM_DRIVERS=y
# CONFIG_VIDEO_MEM2MEM_DEINTERLACE is not set
# CONFIG_V4L_TEST_DRIVERS is not set

#
# Supported MMC/SDIO adapters
#
CONFIG_SMS_SDIO_DRV=m
# CONFIG_MEDIA_PARPORT_SUPPORT is not set
CONFIG_RADIO_ADAPTERS=y
CONFIG_RADIO_SI470X=y
CONFIG_USB_SI470X=m
CONFIG_I2C_SI470X=m
CONFIG_USB_MR800=m
CONFIG_USB_DSBR=m
CONFIG_RADIO_MAXIRADIO=m
CONFIG_RADIO_SHARK=m
CONFIG_RADIO_SHARK2=m
CONFIG_I2C_SI4713=m
CONFIG_RADIO_SI4713=m
CONFIG_USB_KEENE=m
CONFIG_RADIO_TEA5764=m
CONFIG_RADIO_SAA7706H=m
# CONFIG_RADIO_TEF6862 is not set
CONFIG_RADIO_WL1273=m

#
# Texas Instruments WL128x FM driver (ST based)
#

#
# Supported FireWire (IEEE 1394) Adapters
#
CONFIG_DVB_FIREDTV=m
CONFIG_DVB_FIREDTV_INPUT=y
CONFIG_SMS_SIANO_MDTV=m
CONFIG_MEDIA_SUBDRV_AUTOSELECT=y

#
# Media ancillary drivers (tuners, sensors, i2c, frontends)
#
CONFIG_VIDEO_IR_I2C=m

#
# Audio decoders, processors and mixers
#

#
# RDS decoders
#

#
# Video decoders
#

#
# Video and audio decoders
#

#
# MPEG video encoders
#

#
# Video encoders
#

#
# Camera sensor devices
#

#
# Flash devices
#

#
# Video improvement chips
#

#
# Miscelaneous helper chips
#

#
# Sensors used on soc_camera driver
#
CONFIG_MEDIA_ATTACH=y
CONFIG_MEDIA_TUNER=m
CONFIG_MEDIA_TUNER_SIMPLE=m
CONFIG_MEDIA_TUNER_TDA8290=m
CONFIG_MEDIA_TUNER_TDA827X=m
CONFIG_MEDIA_TUNER_TDA18271=m
CONFIG_MEDIA_TUNER_TDA9887=m
CONFIG_MEDIA_TUNER_TEA5761=m
CONFIG_MEDIA_TUNER_TEA5767=m
CONFIG_MEDIA_TUNER_MT20XX=m
CONFIG_MEDIA_TUNER_XC2028=m
CONFIG_MEDIA_TUNER_XC5000=m
CONFIG_MEDIA_TUNER_XC4000=m
CONFIG_MEDIA_TUNER_MC44S803=m

#
# Multistandard (satellite) frontends
#

#
# Multistandard (cable + terrestrial) frontends
#

#
# DVB-S (satellite) frontends
#

#
# DVB-T (terrestrial) frontends
#

#
# DVB-C (cable) frontends
#

#
# ATSC (North American/Korean Terrestrial/Cable DTV) frontends
#

#
# ISDB-T (terrestrial) frontends
#

#
# Digital terrestrial only tuners/PLL
#

#
# SEC control devices for DVB-S
#

#
# Tools to develop new frontends
#
# CONFIG_DVB_DUMMY_FE is not set

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
CONFIG_AGP_SIS=y
CONFIG_AGP_VIA=y
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=16
CONFIG_VGA_SWITCHEROO=y
CONFIG_DRM=m
CONFIG_DRM_USB=m
CONFIG_DRM_KMS_HELPER=m
CONFIG_DRM_LOAD_EDID_FIRMWARE=y
CONFIG_DRM_TTM=m
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
CONFIG_DRM_RADEON=m
CONFIG_DRM_RADEON_KMS=y
CONFIG_DRM_NOUVEAU=m
CONFIG_NOUVEAU_DEBUG=5
CONFIG_NOUVEAU_DEBUG_DEFAULT=3
CONFIG_DRM_NOUVEAU_BACKLIGHT=y

#
# I2C encoder or helper chips
#
CONFIG_DRM_I2C_CH7006=m
CONFIG_DRM_I2C_SIL164=m
# CONFIG_DRM_I810 is not set
CONFIG_DRM_I915=m
CONFIG_DRM_I915_KMS=y
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
CONFIG_DRM_VIA=m
# CONFIG_DRM_SAVAGE is not set
CONFIG_DRM_VMWGFX=m
# CONFIG_DRM_VMWGFX_FBCON is not set
CONFIG_DRM_GMA500=m
# CONFIG_DRM_GMA600 is not set
CONFIG_DRM_GMA3600=y
CONFIG_DRM_UDL=m
CONFIG_DRM_AST=m
CONFIG_DRM_MGAG200=m
CONFIG_DRM_CIRRUS_QEMU=m
# CONFIG_STUB_POULSBO is not set
CONFIG_VGASTATE=m
CONFIG_VIDEO_OUTPUT_CONTROL=m
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
# CONFIG_FB_DDC is not set
CONFIG_FB_BOOT_VESA_SUPPORT=y
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
CONFIG_FB_SYS_FILLRECT=y
CONFIG_FB_SYS_COPYAREA=y
CONFIG_FB_SYS_IMAGEBLIT=y
# CONFIG_FB_FOREIGN_ENDIAN is not set
CONFIG_FB_SYS_FOPS=y
# CONFIG_FB_WMT_GE_ROPS is not set
CONFIG_FB_DEFERRED_IO=y
# CONFIG_FB_SVGALIB is not set
# CONFIG_FB_MACMODES is not set
CONFIG_FB_BACKLIGHT=y
# CONFIG_FB_MODE_HELPERS is not set
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
CONFIG_FB_VGA16=m
# CONFIG_FB_UVESA is not set
CONFIG_FB_VESA=y
CONFIG_FB_EFI=y
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I740 is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_VIA is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
CONFIG_FB_VOODOO1=m
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_GEODE is not set
# CONFIG_FB_TMIO is not set
# CONFIG_FB_SM501 is not set
# CONFIG_FB_SMSCUFX is not set
# CONFIG_FB_UDL is not set
CONFIG_FB_VIRTUAL=m
CONFIG_XEN_FBDEV_FRONTEND=y
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_BROADSHEET is not set
# CONFIG_FB_AUO_K190X is not set
# CONFIG_EXYNOS_VIDEO is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
CONFIG_LCD_CLASS_DEVICE=m
CONFIG_LCD_PLATFORM=m
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_BACKLIGHT_GENERIC is not set
CONFIG_BACKLIGHT_APPLE=m
# CONFIG_BACKLIGHT_SAHARA is not set
# CONFIG_BACKLIGHT_ADP8860 is not set
# CONFIG_BACKLIGHT_ADP8870 is not set
# CONFIG_BACKLIGHT_LM3630 is not set
# CONFIG_BACKLIGHT_LM3639 is not set
CONFIG_BACKLIGHT_LP855X=m

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
# CONFIG_LOGO_LINUX_VGA16 is not set
CONFIG_LOGO_LINUX_CLUT224=y
CONFIG_SOUND=m
CONFIG_SOUND_OSS_CORE=y
CONFIG_SOUND_OSS_CORE_PRECLAIM=y
CONFIG_SND=m
CONFIG_SND_TIMER=m
CONFIG_SND_PCM=m
CONFIG_SND_HWDEP=m
CONFIG_SND_RAWMIDI=m
CONFIG_SND_JACK=y
CONFIG_SND_SEQUENCER=m
CONFIG_SND_SEQ_DUMMY=m
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=m
CONFIG_SND_PCM_OSS=m
CONFIG_SND_PCM_OSS_PLUGINS=y
CONFIG_SND_SEQUENCER_OSS=y
CONFIG_SND_HRTIMER=m
CONFIG_SND_SEQ_HRTIMER_DEFAULT=y
CONFIG_SND_DYNAMIC_MINORS=y
# CONFIG_SND_SUPPORT_OLD_API is not set
CONFIG_SND_VERBOSE_PROCFS=y
CONFIG_SND_VERBOSE_PRINTK=y
CONFIG_SND_DEBUG=y
# CONFIG_SND_DEBUG_VERBOSE is not set
CONFIG_SND_PCM_XRUN_DEBUG=y
CONFIG_SND_VMASTER=y
CONFIG_SND_KCTL_JACK=y
CONFIG_SND_DMA_SGBUF=y
CONFIG_SND_RAWMIDI_SEQ=m
CONFIG_SND_OPL3_LIB_SEQ=m
# CONFIG_SND_OPL4_LIB_SEQ is not set
# CONFIG_SND_SBAWE_SEQ is not set
CONFIG_SND_EMU10K1_SEQ=m
CONFIG_SND_MPU401_UART=m
CONFIG_SND_OPL3_LIB=m
CONFIG_SND_VX_LIB=m
CONFIG_SND_AC97_CODEC=m
CONFIG_SND_DRIVERS=y
CONFIG_SND_PCSP=m
CONFIG_SND_DUMMY=m
CONFIG_SND_ALOOP=m
CONFIG_SND_VIRMIDI=m
CONFIG_SND_MTPAV=m
CONFIG_SND_MTS64=m
CONFIG_SND_SERIAL_U16550=m
CONFIG_SND_MPU401=m
CONFIG_SND_PORTMAN2X4=m
CONFIG_SND_AC97_POWER_SAVE=y
CONFIG_SND_AC97_POWER_SAVE_DEFAULT=0
CONFIG_SND_SB_COMMON=m
CONFIG_SND_SB16_DSP=m
CONFIG_SND_TEA575X=m
CONFIG_SND_PCI=y
CONFIG_SND_AD1889=m
CONFIG_SND_ALS300=m
CONFIG_SND_ALS4000=m
CONFIG_SND_ALI5451=m
CONFIG_SND_ASIHPI=m
CONFIG_SND_ATIIXP=m
CONFIG_SND_ATIIXP_MODEM=m
CONFIG_SND_AU8810=m
CONFIG_SND_AU8820=m
CONFIG_SND_AU8830=m
# CONFIG_SND_AW2 is not set
CONFIG_SND_AZT3328=m
CONFIG_SND_BT87X=m
# CONFIG_SND_BT87X_OVERCLOCK is not set
CONFIG_SND_CA0106=m
CONFIG_SND_CMIPCI=m
CONFIG_SND_OXYGEN_LIB=m
CONFIG_SND_OXYGEN=m
CONFIG_SND_CS4281=m
CONFIG_SND_CS46XX=m
CONFIG_SND_CS46XX_NEW_DSP=y
CONFIG_SND_CS5530=m
CONFIG_SND_CS5535AUDIO=m
CONFIG_SND_CTXFI=m
CONFIG_SND_DARLA20=m
CONFIG_SND_GINA20=m
CONFIG_SND_LAYLA20=m
CONFIG_SND_DARLA24=m
CONFIG_SND_GINA24=m
CONFIG_SND_LAYLA24=m
CONFIG_SND_MONA=m
CONFIG_SND_MIA=m
CONFIG_SND_ECHO3G=m
CONFIG_SND_INDIGO=m
CONFIG_SND_INDIGOIO=m
CONFIG_SND_INDIGODJ=m
CONFIG_SND_INDIGOIOX=m
CONFIG_SND_INDIGODJX=m
CONFIG_SND_EMU10K1=m
CONFIG_SND_EMU10K1X=m
CONFIG_SND_ENS1370=m
CONFIG_SND_ENS1371=m
CONFIG_SND_ES1938=m
CONFIG_SND_ES1968=m
CONFIG_SND_ES1968_INPUT=y
CONFIG_SND_ES1968_RADIO=y
CONFIG_SND_FM801=m
CONFIG_SND_FM801_TEA575X_BOOL=y
CONFIG_SND_HDA_INTEL=m
CONFIG_SND_HDA_PREALLOC_SIZE=4096
CONFIG_SND_HDA_HWDEP=y
CONFIG_SND_HDA_RECONFIG=y
CONFIG_SND_HDA_INPUT_BEEP=y
CONFIG_SND_HDA_INPUT_BEEP_MODE=0
CONFIG_SND_HDA_INPUT_JACK=y
CONFIG_SND_HDA_PATCH_LOADER=y
CONFIG_SND_HDA_CODEC_REALTEK=y
CONFIG_SND_HDA_CODEC_ANALOG=y
CONFIG_SND_HDA_CODEC_SIGMATEL=y
CONFIG_SND_HDA_CODEC_VIA=y
CONFIG_SND_HDA_CODEC_HDMI=y
CONFIG_SND_HDA_CODEC_CIRRUS=y
CONFIG_SND_HDA_CODEC_CONEXANT=y
CONFIG_SND_HDA_CODEC_CA0110=y
CONFIG_SND_HDA_CODEC_CA0132=y
CONFIG_SND_HDA_CODEC_CMEDIA=y
CONFIG_SND_HDA_CODEC_SI3054=y
CONFIG_SND_HDA_GENERIC=y
CONFIG_SND_HDA_POWER_SAVE_DEFAULT=0
CONFIG_SND_HDSP=m
CONFIG_SND_HDSPM=m
CONFIG_SND_ICE1712=m
CONFIG_SND_ICE1724=m
CONFIG_SND_INTEL8X0=m
CONFIG_SND_INTEL8X0M=m
CONFIG_SND_KORG1212=m
CONFIG_SND_LOLA=m
CONFIG_SND_LX6464ES=m
CONFIG_SND_MAESTRO3=m
CONFIG_SND_MAESTRO3_INPUT=y
CONFIG_SND_MIXART=m
CONFIG_SND_NM256=m
CONFIG_SND_PCXHR=m
CONFIG_SND_RIPTIDE=m
CONFIG_SND_RME32=m
CONFIG_SND_RME96=m
CONFIG_SND_RME9652=m
CONFIG_SND_SONICVIBES=m
CONFIG_SND_TRIDENT=m
CONFIG_SND_VIA82XX=m
CONFIG_SND_VIA82XX_MODEM=m
CONFIG_SND_VIRTUOSO=m
CONFIG_SND_VX222=m
CONFIG_SND_YMFPCI=m
CONFIG_SND_USB=y
CONFIG_SND_USB_AUDIO=m
CONFIG_SND_USB_UA101=m
CONFIG_SND_USB_USX2Y=m
CONFIG_SND_USB_CAIAQ=m
CONFIG_SND_USB_CAIAQ_INPUT=y
CONFIG_SND_USB_US122L=m
CONFIG_SND_USB_6FIRE=m
CONFIG_SND_FIREWIRE=y
CONFIG_SND_FIREWIRE_LIB=m
CONFIG_SND_FIREWIRE_SPEAKERS=m
CONFIG_SND_ISIGHT=m
# CONFIG_SND_PCMCIA is not set
# CONFIG_SND_SOC is not set
# CONFIG_SOUND_PRIME is not set
CONFIG_AC97_BUS=m

#
# HID support
#
CONFIG_HID=y
CONFIG_HID_BATTERY_STRENGTH=y
CONFIG_HIDRAW=y
CONFIG_UHID=m
CONFIG_HID_GENERIC=y

#
# Special HID drivers
#
CONFIG_HID_A4TECH=y
CONFIG_HID_ACRUX=m
CONFIG_HID_ACRUX_FF=y
CONFIG_HID_APPLE=y
CONFIG_HID_AUREAL=m
CONFIG_HID_BELKIN=y
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
CONFIG_HID_PRODIKEYS=m
CONFIG_HID_CYPRESS=y
CONFIG_HID_DRAGONRISE=m
CONFIG_DRAGONRISE_FF=y
CONFIG_HID_EMS_FF=m
CONFIG_HID_ELECOM=m
CONFIG_HID_EZKEY=y
CONFIG_HID_HOLTEK=m
CONFIG_HOLTEK_FF=y
CONFIG_HID_KEYTOUCH=m
CONFIG_HID_KYE=m
CONFIG_HID_UCLOGIC=m
CONFIG_HID_WALTOP=m
CONFIG_HID_GYRATION=m
CONFIG_HID_TWINHAN=m
CONFIG_HID_KENSINGTON=y
CONFIG_HID_LCPOWER=m
CONFIG_HID_LENOVO_TPKBD=m
CONFIG_HID_LOGITECH=y
CONFIG_HID_LOGITECH_DJ=m
CONFIG_LOGITECH_FF=y
CONFIG_LOGIRUMBLEPAD2_FF=y
CONFIG_LOGIG940_FF=y
CONFIG_LOGIWHEELS_FF=y
CONFIG_HID_MAGICMOUSE=m
CONFIG_HID_MICROSOFT=y
CONFIG_HID_MONTEREY=y
CONFIG_HID_MULTITOUCH=m
CONFIG_HID_NTRIG=y
CONFIG_HID_ORTEK=m
CONFIG_HID_PANTHERLORD=m
CONFIG_PANTHERLORD_FF=y
CONFIG_HID_PETALYNX=m
CONFIG_HID_PICOLCD=m
CONFIG_HID_PICOLCD_FB=y
CONFIG_HID_PICOLCD_BACKLIGHT=y
CONFIG_HID_PICOLCD_LCD=y
CONFIG_HID_PICOLCD_LEDS=y
CONFIG_HID_PICOLCD_CIR=y
CONFIG_HID_PRIMAX=m
# CONFIG_HID_PS3REMOTE is not set
CONFIG_HID_ROCCAT=m
CONFIG_HID_SAITEK=m
CONFIG_HID_SAMSUNG=m
CONFIG_HID_SONY=m
CONFIG_HID_SPEEDLINK=m
CONFIG_HID_SUNPLUS=m
CONFIG_HID_GREENASIA=m
CONFIG_GREENASIA_FF=y
CONFIG_HID_HYPERV_MOUSE=m
CONFIG_HID_SMARTJOYPLUS=m
CONFIG_SMARTJOYPLUS_FF=y
CONFIG_HID_TIVO=m
CONFIG_HID_TOPSEED=m
CONFIG_HID_THRUSTMASTER=m
CONFIG_THRUSTMASTER_FF=y
CONFIG_HID_WACOM=m
CONFIG_HID_WIIMOTE=m
CONFIG_HID_WIIMOTE_EXT=y
CONFIG_HID_ZEROPLUS=m
CONFIG_ZEROPLUS_FF=y
CONFIG_HID_ZYDACRON=m
# CONFIG_HID_SENSOR_HUB is not set

#
# USB HID support
#
CONFIG_USB_HID=y
CONFIG_HID_PID=y
CONFIG_USB_HIDDEV=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB_ARCH_HAS_XHCI=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_COMMON=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y

#
# Miscellaneous USB options
#
# CONFIG_USB_DYNAMIC_MINORS is not set
CONFIG_USB_SUSPEND=y
# CONFIG_USB_OTG is not set
CONFIG_USB_MON=y
CONFIG_USB_WUSB=m
CONFIG_USB_WUSB_CBAF=m
# CONFIG_USB_WUSB_CBAF_DEBUG is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_XHCI_HCD=y
# CONFIG_USB_XHCI_HCD_DEBUGGING is not set
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_ISP1760_HCD is not set
CONFIG_USB_ISP1362_HCD=m
CONFIG_USB_OHCI_HCD=y
# CONFIG_USB_OHCI_HCD_PLATFORM is not set
# CONFIG_USB_EHCI_HCD_PLATFORM is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=y
CONFIG_USB_U132_HCD=m
CONFIG_USB_SL811_HCD=m
CONFIG_USB_SL811_HCD_ISO=y
# CONFIG_USB_SL811_CS is not set
# CONFIG_USB_R8A66597_HCD is not set
CONFIG_USB_WHCI_HCD=m
CONFIG_USB_HWA_HCD=m
# CONFIG_USB_HCD_BCMA is not set
# CONFIG_USB_HCD_SSB is not set
# CONFIG_USB_CHIPIDEA is not set

#
# USB Device Class drivers
#
CONFIG_USB_ACM=m
CONFIG_USB_PRINTER=m
CONFIG_USB_WDM=m
CONFIG_USB_TMC=m

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
CONFIG_USB_STORAGE=m
# CONFIG_USB_STORAGE_DEBUG is not set
CONFIG_USB_STORAGE_REALTEK=m
CONFIG_REALTEK_AUTOPM=y
CONFIG_USB_STORAGE_DATAFAB=m
CONFIG_USB_STORAGE_FREECOM=m
CONFIG_USB_STORAGE_ISD200=m
CONFIG_USB_STORAGE_USBAT=m
CONFIG_USB_STORAGE_SDDR09=m
CONFIG_USB_STORAGE_SDDR55=m
CONFIG_USB_STORAGE_JUMPSHOT=m
CONFIG_USB_STORAGE_ALAUDA=m
CONFIG_USB_STORAGE_ONETOUCH=m
CONFIG_USB_STORAGE_KARMA=m
CONFIG_USB_STORAGE_CYPRESS_ATACB=m
CONFIG_USB_STORAGE_ENE_UB6250=m
# CONFIG_USB_UAS is not set

#
# USB Imaging devices
#
CONFIG_USB_MDC800=m
CONFIG_USB_MICROTEK=m

#
# USB port drivers
#
CONFIG_USB_USS720=m
CONFIG_USB_SERIAL=y
CONFIG_USB_SERIAL_CONSOLE=y
CONFIG_USB_SERIAL_GENERIC=y
CONFIG_USB_SERIAL_AIRCABLE=m
CONFIG_USB_SERIAL_ARK3116=m
CONFIG_USB_SERIAL_BELKIN=m
CONFIG_USB_SERIAL_CH341=m
CONFIG_USB_SERIAL_WHITEHEAT=m
CONFIG_USB_SERIAL_DIGI_ACCELEPORT=m
CONFIG_USB_SERIAL_CP210X=m
CONFIG_USB_SERIAL_CYPRESS_M8=m
CONFIG_USB_SERIAL_EMPEG=m
CONFIG_USB_SERIAL_FTDI_SIO=m
CONFIG_USB_SERIAL_FUNSOFT=m
CONFIG_USB_SERIAL_VISOR=m
CONFIG_USB_SERIAL_IPAQ=m
CONFIG_USB_SERIAL_IR=m
CONFIG_USB_SERIAL_EDGEPORT=m
CONFIG_USB_SERIAL_EDGEPORT_TI=m
# CONFIG_USB_SERIAL_F81232 is not set
CONFIG_USB_SERIAL_GARMIN=m
CONFIG_USB_SERIAL_IPW=m
CONFIG_USB_SERIAL_IUU=m
CONFIG_USB_SERIAL_KEYSPAN_PDA=m
CONFIG_USB_SERIAL_KEYSPAN=m
CONFIG_USB_SERIAL_KLSI=m
CONFIG_USB_SERIAL_KOBIL_SCT=m
CONFIG_USB_SERIAL_MCT_U232=m
# CONFIG_USB_SERIAL_METRO is not set
CONFIG_USB_SERIAL_MOS7720=m
CONFIG_USB_SERIAL_MOS7715_PARPORT=y
CONFIG_USB_SERIAL_MOS7840=m
CONFIG_USB_SERIAL_MOTOROLA=m
CONFIG_USB_SERIAL_NAVMAN=m
CONFIG_USB_SERIAL_PL2303=m
CONFIG_USB_SERIAL_OTI6858=m
CONFIG_USB_SERIAL_QCAUX=m
CONFIG_USB_SERIAL_QUALCOMM=m
CONFIG_USB_SERIAL_SPCP8X5=m
CONFIG_USB_SERIAL_HP4X=m
CONFIG_USB_SERIAL_SAFE=m
CONFIG_USB_SERIAL_SAFE_PADDED=y
CONFIG_USB_SERIAL_SIEMENS_MPI=m
CONFIG_USB_SERIAL_SIERRAWIRELESS=m
CONFIG_USB_SERIAL_SYMBOL=m
CONFIG_USB_SERIAL_TI=m
CONFIG_USB_SERIAL_CYBERJACK=m
CONFIG_USB_SERIAL_XIRCOM=m
CONFIG_USB_SERIAL_WWAN=m
CONFIG_USB_SERIAL_OPTION=m
CONFIG_USB_SERIAL_OMNINET=m
CONFIG_USB_SERIAL_OPTICON=m
CONFIG_USB_SERIAL_VIVOPAY_SERIAL=m
# CONFIG_USB_SERIAL_ZIO is not set
# CONFIG_USB_SERIAL_ZTE is not set
CONFIG_USB_SERIAL_SSU100=m
CONFIG_USB_SERIAL_QT2=m
CONFIG_USB_SERIAL_DEBUG=m

#
# USB Miscellaneous drivers
#
CONFIG_USB_EMI62=m
CONFIG_USB_EMI26=m
CONFIG_USB_ADUTUX=m
CONFIG_USB_SEVSEG=m
# CONFIG_USB_RIO500 is not set
CONFIG_USB_LEGOTOWER=m
CONFIG_USB_LCD=m
CONFIG_USB_LED=m
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
CONFIG_USB_IDMOUSE=m
CONFIG_USB_FTDI_ELAN=m
CONFIG_USB_APPLEDISPLAY=m
CONFIG_USB_SISUSBVGA=m
CONFIG_USB_SISUSBVGA_CON=y
CONFIG_USB_LD=m
CONFIG_USB_TRANCEVIBRATOR=m
CONFIG_USB_IOWARRIOR=m
# CONFIG_USB_TEST is not set
CONFIG_USB_ISIGHTFW=m
CONFIG_USB_YUREX=m
CONFIG_USB_EZUSB_FX2=m

#
# USB Physical Layer drivers
#
# CONFIG_OMAP_USB2 is not set
# CONFIG_USB_ISP1301 is not set
CONFIG_USB_ATM=m
CONFIG_USB_SPEEDTOUCH=m
CONFIG_USB_CXACRU=m
CONFIG_USB_UEAGLEATM=m
CONFIG_USB_XUSBATM=m
# CONFIG_USB_GADGET is not set

#
# OTG and related infrastructure
#
CONFIG_USB_OTG_UTILS=y
CONFIG_NOP_USB_XCEIV=m
CONFIG_UWB=m
CONFIG_UWB_HWA=m
CONFIG_UWB_WHCI=m
CONFIG_UWB_I1480U=m
CONFIG_MMC=m
# CONFIG_MMC_DEBUG is not set
# CONFIG_MMC_UNSAFE_RESUME is not set
# CONFIG_MMC_CLKGATE is not set

#
# MMC/SD/SDIO Card Drivers
#
CONFIG_MMC_BLOCK=m
CONFIG_MMC_BLOCK_MINORS=8
CONFIG_MMC_BLOCK_BOUNCE=y
CONFIG_SDIO_UART=m
# CONFIG_MMC_TEST is not set

#
# MMC/SD/SDIO Host Controller Drivers
#
CONFIG_MMC_SDHCI=m
CONFIG_MMC_SDHCI_PCI=m
CONFIG_MMC_RICOH_MMC=y
CONFIG_MMC_SDHCI_PLTFM=m
CONFIG_MMC_WBSD=m
CONFIG_MMC_TIFM_SD=m
CONFIG_MMC_SDRICOH_CS=m
CONFIG_MMC_CB710=m
CONFIG_MMC_VIA_SDMMC=m
CONFIG_MMC_VUB300=m
CONFIG_MMC_USHC=m
CONFIG_MEMSTICK=m
# CONFIG_MEMSTICK_DEBUG is not set

#
# MemoryStick drivers
#
# CONFIG_MEMSTICK_UNSAFE_RESUME is not set
CONFIG_MSPRO_BLOCK=m

#
# MemoryStick Host Controller Drivers
#
CONFIG_MEMSTICK_TIFM_MS=m
CONFIG_MEMSTICK_JMICRON_38X=m
CONFIG_MEMSTICK_R592=m
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y

#
# LED drivers
#
CONFIG_LEDS_LM3530=m
# CONFIG_LEDS_LM3642 is not set
# CONFIG_LEDS_PCA9532 is not set
CONFIG_LEDS_LP3944=m
CONFIG_LEDS_LP5521=m
CONFIG_LEDS_LP5523=m
CONFIG_LEDS_CLEVO_MAIL=m
# CONFIG_LEDS_PCA955X is not set
# CONFIG_LEDS_PCA9633 is not set
# CONFIG_LEDS_BD2802 is not set
CONFIG_LEDS_INTEL_SS4200=m
CONFIG_LEDS_DELL_NETBOOKS=m
# CONFIG_LEDS_TCA6507 is not set
# CONFIG_LEDS_LM355x is not set
# CONFIG_LEDS_OT200 is not set
CONFIG_LEDS_BLINKM=m
CONFIG_LEDS_TRIGGERS=y

#
# LED Triggers
#
CONFIG_LEDS_TRIGGER_TIMER=m
CONFIG_LEDS_TRIGGER_ONESHOT=m
CONFIG_LEDS_TRIGGER_HEARTBEAT=m
CONFIG_LEDS_TRIGGER_BACKLIGHT=m
# CONFIG_LEDS_TRIGGER_CPU is not set
CONFIG_LEDS_TRIGGER_DEFAULT_ON=m

#
# iptables trigger is under Netfilter config (LED target)
#
CONFIG_LEDS_TRIGGER_TRANSIENT=m
CONFIG_ACCESSIBILITY=y
CONFIG_A11Y_BRAILLE_CONSOLE=y
CONFIG_INFINIBAND=m
CONFIG_INFINIBAND_USER_MAD=m
CONFIG_INFINIBAND_USER_ACCESS=m
CONFIG_INFINIBAND_USER_MEM=y
CONFIG_INFINIBAND_ADDR_TRANS=y
CONFIG_INFINIBAND_MTHCA=m
CONFIG_INFINIBAND_MTHCA_DEBUG=y
CONFIG_INFINIBAND_IPATH=m
CONFIG_INFINIBAND_QIB=m
CONFIG_INFINIBAND_AMSO1100=m
# CONFIG_INFINIBAND_AMSO1100_DEBUG is not set
CONFIG_INFINIBAND_CXGB3=m
# CONFIG_INFINIBAND_CXGB3_DEBUG is not set
CONFIG_INFINIBAND_CXGB4=m
CONFIG_MLX4_INFINIBAND=m
CONFIG_INFINIBAND_NES=m
# CONFIG_INFINIBAND_NES_DEBUG is not set
# CONFIG_INFINIBAND_OCRDMA is not set
CONFIG_INFINIBAND_IPOIB=m
CONFIG_INFINIBAND_IPOIB_CM=y
CONFIG_INFINIBAND_IPOIB_DEBUG=y
CONFIG_INFINIBAND_IPOIB_DEBUG_DATA=y
CONFIG_INFINIBAND_SRP=m
CONFIG_INFINIBAND_SRPT=m
CONFIG_INFINIBAND_ISER=m
CONFIG_EDAC=y

#
# Reporting subsystems
#
CONFIG_EDAC_LEGACY_SYSFS=y
# CONFIG_EDAC_DEBUG is not set
CONFIG_EDAC_DECODE_MCE=m
CONFIG_EDAC_MCE_INJ=m
CONFIG_EDAC_MM_EDAC=m
CONFIG_EDAC_AMD64=m
# CONFIG_EDAC_AMD64_ERROR_INJECTION is not set
CONFIG_EDAC_E752X=m
CONFIG_EDAC_I82975X=m
CONFIG_EDAC_I3000=m
CONFIG_EDAC_I3200=m
CONFIG_EDAC_X38=m
CONFIG_EDAC_I5400=m
CONFIG_EDAC_I7CORE=m
CONFIG_EDAC_I5000=m
CONFIG_EDAC_I5100=m
CONFIG_EDAC_I7300=m
CONFIG_EDAC_SBRIDGE=m
CONFIG_RTC_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
# CONFIG_RTC_DEBUG is not set

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
CONFIG_RTC_DRV_DS1307=m
CONFIG_RTC_DRV_DS1374=m
CONFIG_RTC_DRV_DS1672=m
CONFIG_RTC_DRV_DS3232=m
CONFIG_RTC_DRV_MAX6900=m
CONFIG_RTC_DRV_RS5C372=m
CONFIG_RTC_DRV_ISL1208=m
CONFIG_RTC_DRV_ISL12022=m
CONFIG_RTC_DRV_X1205=m
CONFIG_RTC_DRV_PCF8563=m
CONFIG_RTC_DRV_PCF8583=m
CONFIG_RTC_DRV_M41T80=m
CONFIG_RTC_DRV_M41T80_WDT=y
CONFIG_RTC_DRV_BQ32K=m
# CONFIG_RTC_DRV_S35390A is not set
CONFIG_RTC_DRV_FM3130=m
CONFIG_RTC_DRV_RX8581=m
CONFIG_RTC_DRV_RX8025=m
CONFIG_RTC_DRV_EM3027=m
CONFIG_RTC_DRV_RV3029C2=m

#
# SPI RTC drivers
#

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
CONFIG_RTC_DRV_DS1286=m
CONFIG_RTC_DRV_DS1511=m
CONFIG_RTC_DRV_DS1553=m
CONFIG_RTC_DRV_DS1742=m
CONFIG_RTC_DRV_STK17TA8=m
# CONFIG_RTC_DRV_M48T86 is not set
CONFIG_RTC_DRV_M48T35=m
CONFIG_RTC_DRV_M48T59=m
CONFIG_RTC_DRV_MSM6242=m
CONFIG_RTC_DRV_BQ4802=m
CONFIG_RTC_DRV_RP5C01=m
CONFIG_RTC_DRV_V3020=m
# CONFIG_RTC_DRV_DS2404 is not set

#
# on-CPU RTC drivers
#
CONFIG_DMADEVICES=y
# CONFIG_DMADEVICES_DEBUG is not set

#
# DMA Devices
#
# CONFIG_INTEL_MID_DMAC is not set
CONFIG_INTEL_IOATDMA=m
# CONFIG_TIMB_DMA is not set
CONFIG_PCH_DMA=m
CONFIG_DMA_ENGINE=y

#
# DMA Clients
#
CONFIG_NET_DMA=y
CONFIG_ASYNC_TX_DMA=y
# CONFIG_DMATEST is not set
CONFIG_DCA=m
CONFIG_AUXDISPLAY=y
CONFIG_KS0108=m
CONFIG_KS0108_PORT=0x378
CONFIG_KS0108_DELAY=2
CONFIG_CFAG12864B=m
CONFIG_CFAG12864B_RATE=20
CONFIG_UIO=m
CONFIG_UIO_CIF=m
# CONFIG_UIO_PDRV is not set
# CONFIG_UIO_PDRV_GENIRQ is not set
CONFIG_UIO_AEC=m
CONFIG_UIO_SERCOS3=m
CONFIG_UIO_PCI_GENERIC=m
# CONFIG_UIO_NETX is not set
# CONFIG_VFIO is not set
CONFIG_VIRTIO=y

#
# Virtio drivers
#
CONFIG_VIRTIO_PCI=y
CONFIG_VIRTIO_BALLOON=m
CONFIG_VIRTIO_MMIO=m
# CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES is not set

#
# Microsoft Hyper-V guest support
#
CONFIG_HYPERV=m
CONFIG_HYPERV_UTILS=m

#
# Xen driver support
#
CONFIG_XEN_BALLOON=y
CONFIG_XEN_SELFBALLOONING=y
CONFIG_XEN_SCRUB_PAGES=y
CONFIG_XEN_DEV_EVTCHN=m
CONFIG_XEN_BACKEND=y
CONFIG_XENFS=m
CONFIG_XEN_COMPAT_XENFS=y
CONFIG_XEN_SYS_HYPERVISOR=y
CONFIG_XEN_XENBUS_FRONTEND=y
CONFIG_XEN_GNTDEV=m
CONFIG_XEN_GRANT_DEV_ALLOC=m
CONFIG_SWIOTLB_XEN=y
CONFIG_XEN_TMEM=y
CONFIG_XEN_PCIDEV_BACKEND=m
CONFIG_XEN_PRIVCMD=m
CONFIG_XEN_ACPI_PROCESSOR=m
# CONFIG_XEN_MCE_LOG is not set
CONFIG_STAGING=y
# CONFIG_ET131X is not set
# CONFIG_SLICOSS is not set
# CONFIG_USBIP_CORE is not set
# CONFIG_W35UND is not set
# CONFIG_PRISM2_USB is not set
# CONFIG_ECHO is not set
# CONFIG_COMEDI is not set
# CONFIG_ASUS_OLED is not set
# CONFIG_PANEL is not set
# CONFIG_R8187SE is not set
# CONFIG_RTL8192U is not set
# CONFIG_RTLLIB is not set
CONFIG_R8712U=m
# CONFIG_RTS_PSTOR is not set
# CONFIG_RTS5139 is not set
# CONFIG_TRANZPORT is not set
# CONFIG_IDE_PHISON is not set
# CONFIG_LINE6_USB is not set
# CONFIG_USB_SERIAL_QUATECH2 is not set
# CONFIG_VT6655 is not set
# CONFIG_VT6656 is not set
# CONFIG_DX_SEP is not set
# CONFIG_ZSMALLOC is not set
# CONFIG_WLAGS49_H2 is not set
# CONFIG_WLAGS49_H25 is not set
# CONFIG_FB_SM7XX is not set
CONFIG_CRYSTALHD=m
# CONFIG_FB_XGI is not set
# CONFIG_ACPI_QUICKSTART is not set
# CONFIG_USB_ENESTORAGE is not set
# CONFIG_BCM_WIMAX is not set
# CONFIG_FT1000 is not set

#
# Speakup console speech
#
# CONFIG_SPEAKUP is not set
# CONFIG_TOUCHSCREEN_SYNAPTICS_I2C_RMI4 is not set
CONFIG_STAGING_MEDIA=y
# CONFIG_DVB_AS102 is not set
# CONFIG_DVB_CXD2099 is not set
# CONFIG_VIDEO_DT3155 is not set
# CONFIG_VIDEO_GO7007 is not set
# CONFIG_SOLO6X10 is not set
CONFIG_LIRC_STAGING=y
CONFIG_LIRC_BT829=m
CONFIG_LIRC_IGORPLUGUSB=m
CONFIG_LIRC_IMON=m
CONFIG_LIRC_PARALLEL=m
CONFIG_LIRC_SASEM=m
CONFIG_LIRC_SERIAL=m
CONFIG_LIRC_SERIAL_TRANSMITTER=y
CONFIG_LIRC_SIR=m
CONFIG_LIRC_ZILOG=m

#
# Android
#
# CONFIG_ANDROID is not set
# CONFIG_PHONE is not set
# CONFIG_USB_WPAN_HCD is not set
# CONFIG_IPACK_BUS is not set
# CONFIG_WIMAX_GDM72XX is not set
# CONFIG_CSR_WIFI is not set
# CONFIG_ZCACHE2 is not set
CONFIG_NET_VENDOR_SILICOM=y
# CONFIG_SBYPASS is not set
# CONFIG_BPCTL is not set
# CONFIG_CED1401 is not set
# CONFIG_DGRP is not set
CONFIG_X86_PLATFORM_DEVICES=y
CONFIG_ACER_WMI=m
CONFIG_ACERHDF=m
CONFIG_ASUS_LAPTOP=m
CONFIG_DELL_LAPTOP=m
CONFIG_DELL_WMI=m
CONFIG_DELL_WMI_AIO=m
CONFIG_FUJITSU_LAPTOP=m
# CONFIG_FUJITSU_LAPTOP_DEBUG is not set
CONFIG_FUJITSU_TABLET=m
CONFIG_AMILO_RFKILL=m
CONFIG_HP_ACCEL=m
CONFIG_HP_WMI=m
CONFIG_MSI_LAPTOP=m
CONFIG_PANASONIC_LAPTOP=m
CONFIG_COMPAL_LAPTOP=m
CONFIG_SONY_LAPTOP=m
CONFIG_SONYPI_COMPAT=y
CONFIG_IDEAPAD_LAPTOP=m
CONFIG_THINKPAD_ACPI=m
CONFIG_THINKPAD_ACPI_ALSA_SUPPORT=y
# CONFIG_THINKPAD_ACPI_DEBUGFACILITIES is not set
# CONFIG_THINKPAD_ACPI_DEBUG is not set
# CONFIG_THINKPAD_ACPI_UNSAFE_LEDS is not set
CONFIG_THINKPAD_ACPI_VIDEO=y
CONFIG_THINKPAD_ACPI_HOTKEY_POLL=y
CONFIG_SENSORS_HDAPS=m
# CONFIG_INTEL_MENLOW is not set
CONFIG_EEEPC_LAPTOP=m
CONFIG_ASUS_WMI=m
CONFIG_ASUS_NB_WMI=m
CONFIG_EEEPC_WMI=m
CONFIG_ACPI_WMI=m
CONFIG_MSI_WMI=m
CONFIG_TOPSTAR_LAPTOP=m
CONFIG_ACPI_TOSHIBA=m
CONFIG_TOSHIBA_BT_RFKILL=m
CONFIG_ACPI_CMPC=m
CONFIG_INTEL_IPS=m
# CONFIG_IBM_RTL is not set
# CONFIG_XO15_EBOOK is not set
CONFIG_SAMSUNG_LAPTOP=m
CONFIG_MXM_WMI=m
CONFIG_INTEL_OAKTRAIL=m
CONFIG_SAMSUNG_Q10=m
CONFIG_APPLE_GMUX=m

#
# Hardware Spinlock drivers
#
CONFIG_CLKEVT_I8253=y
CONFIG_I8253_LOCK=y
CONFIG_CLKBLD_I8253=y
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y
CONFIG_AMD_IOMMU=y
CONFIG_AMD_IOMMU_STATS=y
CONFIG_AMD_IOMMU_V2=m
CONFIG_DMAR_TABLE=y
CONFIG_INTEL_IOMMU=y
# CONFIG_INTEL_IOMMU_DEFAULT_ON is not set
CONFIG_INTEL_IOMMU_FLOPPY_WA=y
CONFIG_IRQ_REMAP=y

#
# Remoteproc drivers (EXPERIMENTAL)
#
# CONFIG_STE_MODEM_RPROC is not set

#
# Rpmsg drivers (EXPERIMENTAL)
#
# CONFIG_VIRT_DRIVERS is not set
# CONFIG_PM_DEVFREQ is not set
# CONFIG_EXTCON is not set
# CONFIG_MEMORY is not set
# CONFIG_IIO is not set
# CONFIG_VME_BUS is not set
# CONFIG_PWM is not set

#
# Firmware Drivers
#
CONFIG_EDD=m
# CONFIG_EDD_OFF is not set
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_EFI_VARS=y
CONFIG_DELL_RBU=m
CONFIG_DCDBAS=m
CONFIG_DMIID=y
CONFIG_DMI_SYSFS=y
CONFIG_ISCSI_IBFT_FIND=y
CONFIG_ISCSI_IBFT=m
# CONFIG_GOOGLE_FIRMWARE is not set

#
# File systems
#
CONFIG_DCACHE_WORD_ACCESS=y
# CONFIG_EXT2_FS is not set
# CONFIG_EXT3_FS is not set
CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT23=y
CONFIG_EXT4_FS_XATTR=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
# CONFIG_EXT4_DEBUG is not set
CONFIG_JBD2=y
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=y
CONFIG_REISERFS_FS=m
# CONFIG_REISERFS_CHECK is not set
CONFIG_REISERFS_PROC_INFO=y
CONFIG_REISERFS_FS_XATTR=y
CONFIG_REISERFS_FS_POSIX_ACL=y
CONFIG_REISERFS_FS_SECURITY=y
CONFIG_JFS_FS=m
CONFIG_JFS_POSIX_ACL=y
CONFIG_JFS_SECURITY=y
# CONFIG_JFS_DEBUG is not set
# CONFIG_JFS_STATISTICS is not set
CONFIG_XFS_FS=m
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
# CONFIG_XFS_RT is not set
# CONFIG_XFS_DEBUG is not set
CONFIG_GFS2_FS=m
CONFIG_GFS2_FS_LOCKING_DLM=y
CONFIG_OCFS2_FS=m
CONFIG_OCFS2_FS_O2CB=m
CONFIG_OCFS2_FS_USERSPACE_CLUSTER=m
# CONFIG_OCFS2_FS_STATS is not set
# CONFIG_OCFS2_DEBUG_MASKLOG is not set
# CONFIG_OCFS2_DEBUG_FS is not set
CONFIG_BTRFS_FS=m
CONFIG_BTRFS_FS_POSIX_ACL=y
# CONFIG_BTRFS_FS_CHECK_INTEGRITY is not set
CONFIG_NILFS2_FS=m
CONFIG_FS_POSIX_ACL=y
CONFIG_EXPORTFS=y
CONFIG_FILE_LOCKING=y
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_FANOTIFY=y
CONFIG_FANOTIFY_ACCESS_PERMISSIONS=y
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
# CONFIG_PRINT_QUOTA_WARNING is not set
# CONFIG_QUOTA_DEBUG is not set
CONFIG_QUOTA_TREE=y
# CONFIG_QFMT_V1 is not set
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
CONFIG_QUOTACTL_COMPAT=y
CONFIG_AUTOFS4_FS=y
CONFIG_FUSE_FS=m
CONFIG_CUSE=m
CONFIG_GENERIC_ACL=y

#
# Caches
#
CONFIG_FSCACHE=m
CONFIG_FSCACHE_STATS=y
# CONFIG_FSCACHE_HISTOGRAM is not set
# CONFIG_FSCACHE_DEBUG is not set
CONFIG_FSCACHE_OBJECT_LIST=y
CONFIG_CACHEFILES=m
# CONFIG_CACHEFILES_DEBUG is not set
# CONFIG_CACHEFILES_HISTOGRAM is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_UDF_FS=m
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="ascii"
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_CONFIGFS_FS=y
CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ADFS_FS is not set
CONFIG_AFFS_FS=m
CONFIG_ECRYPT_FS=m
CONFIG_HFS_FS=m
CONFIG_HFSPLUS_FS=m
CONFIG_BEFS_FS=m
# CONFIG_BEFS_DEBUG is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_JFFS2_FS is not set
CONFIG_UBIFS_FS=m
# CONFIG_UBIFS_FS_ADVANCED_COMPR is not set
CONFIG_UBIFS_FS_LZO=y
CONFIG_UBIFS_FS_ZLIB=y
# CONFIG_LOGFS is not set
CONFIG_CRAMFS=m
CONFIG_SQUASHFS=m
CONFIG_SQUASHFS_XATTR=y
CONFIG_SQUASHFS_ZLIB=y
CONFIG_SQUASHFS_LZO=y
CONFIG_SQUASHFS_XZ=y
# CONFIG_SQUASHFS_4K_DEVBLK_SIZE is not set
# CONFIG_SQUASHFS_EMBEDDED is not set
CONFIG_SQUASHFS_FRAGMENT_CACHE_SIZE=3
# CONFIG_VXFS_FS is not set
CONFIG_MINIX_FS=m
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_QNX6FS_FS is not set
CONFIG_ROMFS_FS=m
CONFIG_ROMFS_BACKED_BY_BLOCK=y
# CONFIG_ROMFS_BACKED_BY_MTD is not set
# CONFIG_ROMFS_BACKED_BY_BOTH is not set
CONFIG_ROMFS_ON_BLOCK=y
CONFIG_PSTORE=y
# CONFIG_PSTORE_CONSOLE is not set
# CONFIG_PSTORE_FTRACE is not set
CONFIG_PSTORE_RAM=m
CONFIG_SYSV_FS=m
CONFIG_UFS_FS=m
# CONFIG_UFS_FS_WRITE is not set
# CONFIG_UFS_DEBUG is not set
# CONFIG_EXOFS_FS is not set
CONFIG_ORE=m
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=m
CONFIG_NFS_V2=m
CONFIG_NFS_V3=m
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=m
# CONFIG_NFS_SWAP is not set
CONFIG_NFS_V4_1=y
CONFIG_PNFS_FILE_LAYOUT=m
CONFIG_PNFS_BLOCK=m
CONFIG_PNFS_OBJLAYOUT=m
CONFIG_NFS_V4_1_IMPLEMENTATION_ID_DOMAIN="kernel.org"
CONFIG_NFS_FSCACHE=y
# CONFIG_NFS_USE_LEGACY_DNS is not set
CONFIG_NFS_USE_KERNEL_DNS=y
CONFIG_NFS_DEBUG=y
CONFIG_NFSD=m
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
# CONFIG_NFSD_FAULT_INJECTION is not set
CONFIG_LOCKD=m
CONFIG_LOCKD_V4=y
CONFIG_NFS_ACL_SUPPORT=m
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=m
CONFIG_SUNRPC_GSS=m
CONFIG_SUNRPC_BACKCHANNEL=y
CONFIG_SUNRPC_XPRT_RDMA=m
CONFIG_RPCSEC_GSS_KRB5=m
CONFIG_SUNRPC_DEBUG=y
CONFIG_CEPH_FS=m
CONFIG_CIFS=m
CONFIG_CIFS_STATS=y
# CONFIG_CIFS_STATS2 is not set
CONFIG_CIFS_WEAK_PW_HASH=y
CONFIG_CIFS_UPCALL=y
CONFIG_CIFS_XATTR=y
CONFIG_CIFS_POSIX=y
CONFIG_CIFS_ACL=y
# CONFIG_CIFS_DEBUG2 is not set
CONFIG_CIFS_DFS_UPCALL=y
# CONFIG_CIFS_SMB2 is not set
CONFIG_CIFS_FSCACHE=y
CONFIG_NCP_FS=m
CONFIG_NCPFS_PACKET_SIGNING=y
CONFIG_NCPFS_IOCTL_LOCKING=y
CONFIG_NCPFS_STRONG=y
CONFIG_NCPFS_NFS_NS=y
CONFIG_NCPFS_OS2_NS=y
CONFIG_NCPFS_SMALLDOS=y
CONFIG_NCPFS_NLS=y
CONFIG_NCPFS_EXTRAS=y
CONFIG_CODA_FS=m
# CONFIG_AFS_FS is not set
CONFIG_9P_FS=m
CONFIG_9P_FSCACHE=y
CONFIG_9P_FS_POSIX_ACL=y
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=y
CONFIG_NLS_CODEPAGE_737=m
CONFIG_NLS_CODEPAGE_775=m
CONFIG_NLS_CODEPAGE_850=m
CONFIG_NLS_CODEPAGE_852=m
CONFIG_NLS_CODEPAGE_855=m
CONFIG_NLS_CODEPAGE_857=m
CONFIG_NLS_CODEPAGE_860=m
CONFIG_NLS_CODEPAGE_861=m
CONFIG_NLS_CODEPAGE_862=m
CONFIG_NLS_CODEPAGE_863=m
CONFIG_NLS_CODEPAGE_864=m
CONFIG_NLS_CODEPAGE_865=m
CONFIG_NLS_CODEPAGE_866=m
CONFIG_NLS_CODEPAGE_869=m
CONFIG_NLS_CODEPAGE_936=m
CONFIG_NLS_CODEPAGE_950=m
CONFIG_NLS_CODEPAGE_932=m
CONFIG_NLS_CODEPAGE_949=m
CONFIG_NLS_CODEPAGE_874=m
CONFIG_NLS_ISO8859_8=m
CONFIG_NLS_CODEPAGE_1250=m
CONFIG_NLS_CODEPAGE_1251=m
CONFIG_NLS_ASCII=y
CONFIG_NLS_ISO8859_1=m
CONFIG_NLS_ISO8859_2=m
CONFIG_NLS_ISO8859_3=m
CONFIG_NLS_ISO8859_4=m
CONFIG_NLS_ISO8859_5=m
CONFIG_NLS_ISO8859_6=m
CONFIG_NLS_ISO8859_7=m
CONFIG_NLS_ISO8859_9=m
CONFIG_NLS_ISO8859_13=m
CONFIG_NLS_ISO8859_14=m
CONFIG_NLS_ISO8859_15=m
CONFIG_NLS_KOI8_R=m
CONFIG_NLS_KOI8_U=m
CONFIG_NLS_MAC_ROMAN=m
CONFIG_NLS_MAC_CELTIC=m
CONFIG_NLS_MAC_CENTEURO=m
CONFIG_NLS_MAC_CROATIAN=m
CONFIG_NLS_MAC_CYRILLIC=m
CONFIG_NLS_MAC_GAELIC=m
CONFIG_NLS_MAC_GREEK=m
CONFIG_NLS_MAC_ICELAND=m
CONFIG_NLS_MAC_INUIT=m
CONFIG_NLS_MAC_ROMANIAN=m
CONFIG_NLS_MAC_TURKISH=m
CONFIG_NLS_UTF8=m
CONFIG_DLM=m
CONFIG_DLM_DEBUG=y

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_PRINTK_TIME=y
CONFIG_DEFAULT_MESSAGE_LOGLEVEL=4
# CONFIG_ENABLE_WARN_DEPRECATED is not set
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=2048
CONFIG_MAGIC_SYSRQ=y
CONFIG_STRIP_ASM_SYMS=y
# CONFIG_READABLE_ASM is not set
CONFIG_UNUSED_SYMBOLS=y
CONFIG_DEBUG_FS=y
CONFIG_HEADERS_CHECK=y
# CONFIG_DEBUG_SECTION_MISMATCH is not set
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_SHIRQ=y
CONFIG_LOCKUP_DETECTOR=y
CONFIG_HARDLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=0
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
# CONFIG_PANIC_ON_OOPS is not set
CONFIG_PANIC_ON_OOPS_VALUE=0
# CONFIG_DETECT_HUNG_TASK is not set
CONFIG_SCHED_DEBUG=y
CONFIG_SCHEDSTATS=y
CONFIG_TIMER_STATS=y
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
# CONFIG_DEBUG_KMEMLEAK is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
CONFIG_SPARSE_RCU_POINTER=y
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_ATOMIC_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
CONFIG_STACKTRACE=y
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_INFO=y
# CONFIG_DEBUG_INFO_REDUCED is not set
CONFIG_DEBUG_VM=y
# CONFIG_DEBUG_VM_RB is not set
# CONFIG_DEBUG_VIRTUAL is not set
# CONFIG_DEBUG_WRITECOUNT is not set
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_DEBUG_LIST=y
# CONFIG_TEST_LIST_SORT is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_DEBUG_CREDENTIALS is not set
CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y
CONFIG_BOOT_PRINTK_DELAY=y
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
# CONFIG_RCU_CPU_STALL_INFO is not set
# CONFIG_RCU_TRACE is not set
# CONFIG_KPROBES_SANITY_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
# CONFIG_DEBUG_PER_CPU_MAPS is not set
# CONFIG_LKDTM is not set
# CONFIG_NOTIFIER_ERROR_INJECTION is not set
# CONFIG_FAULT_INJECTION is not set
CONFIG_LATENCYTOP=y
# CONFIG_DEBUG_PAGEALLOC is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_FENTRY=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_TRACE_CLOCK=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_EVENT_POWER_TRACING_DEPRECATED=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_RING_BUFFER_ALLOW_SWAP=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
# CONFIG_IRQSOFF_TRACER is not set
CONFIG_SCHED_TRACER=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_PROFILE_ALL_BRANCHES is not set
CONFIG_STACK_TRACER=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_KPROBE_EVENT=y
CONFIG_UPROBE_EVENT=y
CONFIG_PROBE_EVENTS=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FUNCTION_PROFILER=y
CONFIG_FTRACE_MCOUNT_RECORD=y
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_MMIOTRACE is not set
CONFIG_RING_BUFFER_BENCHMARK=m
# CONFIG_RBTREE_TEST is not set
# CONFIG_INTERVAL_TREE_TEST is not set
CONFIG_PROVIDE_OHCI1394_DMA_INIT=y
# CONFIG_FIREWIRE_OHCI_REMOTE_DMA is not set
CONFIG_BUILD_DOCSRC=y
CONFIG_DYNAMIC_DEBUG=y
# CONFIG_DMA_API_DEBUG is not set
CONFIG_ATOMIC64_SELFTEST=y
CONFIG_ASYNC_RAID6_TEST=m
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_KGDB=y
CONFIG_KGDB_SERIAL_CONSOLE=y
CONFIG_KGDB_TESTS=y
# CONFIG_KGDB_TESTS_ON_BOOT is not set
CONFIG_KGDB_LOW_LEVEL_TRAP=y
CONFIG_KGDB_KDB=y
CONFIG_KDB_KEYBOARD=y
CONFIG_HAVE_ARCH_KMEMCHECK=y
CONFIG_TEST_KSTRTOX=y
CONFIG_STRICT_DEVMEM=y
# CONFIG_X86_VERBOSE_BOOTUP is not set
CONFIG_EARLY_PRINTK=y
CONFIG_EARLY_PRINTK_DBGP=y
CONFIG_DEBUG_STACKOVERFLOW=y
# CONFIG_X86_PTDUMP is not set
CONFIG_DEBUG_RODATA=y
CONFIG_DEBUG_RODATA_TEST=y
CONFIG_DEBUG_SET_MODULE_RONX=y
CONFIG_DEBUG_NX_TEST=m
# CONFIG_DEBUG_TLBFLUSH is not set
# CONFIG_IOMMU_DEBUG is not set
# CONFIG_IOMMU_STRESS is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
CONFIG_X86_DECODER_SELFTEST=y
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
CONFIG_DEBUG_BOOT_PARAMS=y
# CONFIG_CPA_DEBUG is not set
CONFIG_OPTIMIZE_INLINING=y
# CONFIG_DEBUG_STRICT_USER_COPY_CHECKS is not set
# CONFIG_DEBUG_NMI_SELFTEST is not set

#
# Security options
#
CONFIG_KEYS=y
CONFIG_TRUSTED_KEYS=m
CONFIG_ENCRYPTED_KEYS=m
CONFIG_KEYS_DEBUG_PROC_KEYS=y
# CONFIG_SECURITY_DMESG_RESTRICT is not set
CONFIG_SECURITY=y
CONFIG_SECURITYFS=y
CONFIG_SECURITY_NETWORK=y
CONFIG_SECURITY_NETWORK_XFRM=y
# CONFIG_SECURITY_PATH is not set
CONFIG_INTEL_TXT=y
CONFIG_LSM_MMAP_MIN_ADDR=65536
CONFIG_SECURITY_SELINUX=y
CONFIG_SECURITY_SELINUX_BOOTPARAM=y
CONFIG_SECURITY_SELINUX_BOOTPARAM_VALUE=1
CONFIG_SECURITY_SELINUX_DISABLE=y
CONFIG_SECURITY_SELINUX_DEVELOP=y
CONFIG_SECURITY_SELINUX_AVC_STATS=y
CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1
# CONFIG_SECURITY_SELINUX_POLICYDB_VERSION_MAX is not set
# CONFIG_SECURITY_SMACK is not set
# CONFIG_SECURITY_TOMOYO is not set
# CONFIG_SECURITY_APPARMOR is not set
# CONFIG_SECURITY_YAMA is not set
# CONFIG_IMA is not set
CONFIG_DEFAULT_SECURITY_SELINUX=y
# CONFIG_DEFAULT_SECURITY_DAC is not set
CONFIG_DEFAULT_SECURITY="selinux"
CONFIG_XOR_BLOCKS=m
CONFIG_ASYNC_CORE=m
CONFIG_ASYNC_MEMCPY=m
CONFIG_ASYNC_XOR=m
CONFIG_ASYNC_PQ=m
CONFIG_ASYNC_RAID6_RECOV=m
CONFIG_ASYNC_TX_DISABLE_PQ_VAL_DMA=y
CONFIG_ASYNC_TX_DISABLE_XOR_VAL_DMA=y
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_FIPS=y
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_PCOMP=m
CONFIG_CRYPTO_PCOMP2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
CONFIG_CRYPTO_USER=m
# CONFIG_CRYPTO_MANAGER_DISABLE_TESTS is not set
CONFIG_CRYPTO_GF128MUL=y
CONFIG_CRYPTO_NULL=m
CONFIG_CRYPTO_PCRYPT=m
CONFIG_CRYPTO_WORKQUEUE=y
CONFIG_CRYPTO_CRYPTD=y
CONFIG_CRYPTO_AUTHENC=m
CONFIG_CRYPTO_TEST=m
CONFIG_CRYPTO_ABLK_HELPER_X86=y
CONFIG_CRYPTO_GLUE_HELPER_X86=m

#
# Authenticated Encryption with Associated Data
#
CONFIG_CRYPTO_CCM=m
CONFIG_CRYPTO_GCM=m
CONFIG_CRYPTO_SEQIV=y

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
CONFIG_CRYPTO_CTR=y
CONFIG_CRYPTO_CTS=m
CONFIG_CRYPTO_ECB=y
CONFIG_CRYPTO_LRW=y
CONFIG_CRYPTO_PCBC=m
CONFIG_CRYPTO_XTS=y

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=y
CONFIG_CRYPTO_XCBC=m
CONFIG_CRYPTO_VMAC=m

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
CONFIG_CRYPTO_CRC32C_INTEL=m
CONFIG_CRYPTO_GHASH=m
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=y
CONFIG_CRYPTO_MICHAEL_MIC=m
CONFIG_CRYPTO_RMD128=m
CONFIG_CRYPTO_RMD160=m
CONFIG_CRYPTO_RMD256=m
CONFIG_CRYPTO_RMD320=m
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_SHA1_SSSE3=m
CONFIG_CRYPTO_SHA256=y
CONFIG_CRYPTO_SHA512=m
CONFIG_CRYPTO_TGR192=m
CONFIG_CRYPTO_WP512=m
CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL=m

#
# Ciphers
#
CONFIG_CRYPTO_AES=y
CONFIG_CRYPTO_AES_X86_64=y
CONFIG_CRYPTO_AES_NI_INTEL=y
CONFIG_CRYPTO_ANUBIS=m
CONFIG_CRYPTO_ARC4=m
CONFIG_CRYPTO_BLOWFISH=m
CONFIG_CRYPTO_BLOWFISH_COMMON=m
CONFIG_CRYPTO_BLOWFISH_X86_64=m
CONFIG_CRYPTO_CAMELLIA=m
CONFIG_CRYPTO_CAMELLIA_X86_64=m
CONFIG_CRYPTO_CAST5=m
# CONFIG_CRYPTO_CAST5_AVX_X86_64 is not set
CONFIG_CRYPTO_CAST6=m
# CONFIG_CRYPTO_CAST6_AVX_X86_64 is not set
CONFIG_CRYPTO_DES=m
CONFIG_CRYPTO_FCRYPT=m
CONFIG_CRYPTO_KHAZAD=m
CONFIG_CRYPTO_SALSA20=m
CONFIG_CRYPTO_SALSA20_X86_64=m
CONFIG_CRYPTO_SEED=m
CONFIG_CRYPTO_SERPENT=m
CONFIG_CRYPTO_SERPENT_SSE2_X86_64=m
CONFIG_CRYPTO_SERPENT_AVX_X86_64=m
CONFIG_CRYPTO_TEA=m
CONFIG_CRYPTO_TWOFISH=m
CONFIG_CRYPTO_TWOFISH_COMMON=m
CONFIG_CRYPTO_TWOFISH_X86_64=m
CONFIG_CRYPTO_TWOFISH_X86_64_3WAY=m
CONFIG_CRYPTO_TWOFISH_AVX_X86_64=m

#
# Compression
#
CONFIG_CRYPTO_DEFLATE=m
CONFIG_CRYPTO_ZLIB=m
CONFIG_CRYPTO_LZO=m

#
# Random Number Generation
#
CONFIG_CRYPTO_ANSI_CPRNG=m
CONFIG_CRYPTO_USER_API=y
CONFIG_CRYPTO_USER_API_HASH=y
CONFIG_CRYPTO_USER_API_SKCIPHER=y
CONFIG_CRYPTO_HW=y
CONFIG_CRYPTO_DEV_PADLOCK=m
CONFIG_CRYPTO_DEV_PADLOCK_AES=m
CONFIG_CRYPTO_DEV_PADLOCK_SHA=m
CONFIG_ASYMMETRIC_KEY_TYPE=y
CONFIG_ASYMMETRIC_PUBLIC_KEY_SUBTYPE=y
CONFIG_PUBLIC_KEY_ALGO_RSA=y
CONFIG_X509_CERTIFICATE_PARSER=y
CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_APIC_ARCHITECTURE=y
CONFIG_KVM_MMIO=y
CONFIG_KVM_ASYNC_PF=y
CONFIG_HAVE_KVM_MSI=y
CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=m
CONFIG_KVM_INTEL=m
CONFIG_KVM_AMD=m
CONFIG_KVM_MMU_AUDIT=y
CONFIG_VHOST_NET=m
CONFIG_TCM_VHOST=m
CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_RAID6_PQ=m
CONFIG_BITREVERSE=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_IO=y
CONFIG_CRC_CCITT=m
CONFIG_CRC16=y
CONFIG_CRC_T10DIF=y
CONFIG_CRC_ITU_T=m
CONFIG_CRC32=y
# CONFIG_CRC32_SELFTEST is not set
CONFIG_CRC32_SLICEBY8=y
# CONFIG_CRC32_SLICEBY4 is not set
# CONFIG_CRC32_SARWATE is not set
# CONFIG_CRC32_BIT is not set
# CONFIG_CRC7 is not set
CONFIG_LIBCRC32C=m
CONFIG_CRC8=m
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=m
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_XZ_DEC=y
CONFIG_XZ_DEC_X86=y
CONFIG_XZ_DEC_POWERPC=y
CONFIG_XZ_DEC_IA64=y
CONFIG_XZ_DEC_ARM=y
CONFIG_XZ_DEC_ARMTHUMB=y
CONFIG_XZ_DEC_SPARC=y
CONFIG_XZ_DEC_BCJ=y
# CONFIG_XZ_DEC_TEST is not set
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_BZIP2=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_DECOMPRESS_XZ=y
CONFIG_DECOMPRESS_LZO=y
CONFIG_GENERIC_ALLOCATOR=y
CONFIG_REED_SOLOMON=m
CONFIG_REED_SOLOMON_ENC8=y
CONFIG_REED_SOLOMON_DEC8=y
CONFIG_TEXTSEARCH=y
CONFIG_TEXTSEARCH_KMP=m
CONFIG_TEXTSEARCH_BM=m
CONFIG_TEXTSEARCH_FSM=m
CONFIG_BTREE=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
CONFIG_CHECK_SIGNATURE=y
CONFIG_CPU_RMAP=y
CONFIG_DQL=y
CONFIG_NLATTR=y
CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE=y
CONFIG_LRU_CACHE=m
CONFIG_AVERAGE=y
CONFIG_CLZ_TAB=y
CONFIG_CORDIC=m
# CONFIG_DDR is not set
CONFIG_MPILIB=y
CONFIG_OID_REGISTRY=y

[-- Attachment #3: dmesg.log --]
[-- Type: text/x-log, Size: 20970 bytes --]

    Welcome to Fedora 18 (Spherical Cow)!
     
    [  OK  ] Reached target Remote File Systems.
    [  OK  ] Listening on Syslog Socket.
    [  OK  ] Reached target Syslog.
             Starting Load legacy module configuration...
    [  OK  ] Listening on Delayed Shutdown Socket.
    [  OK  ] Listening on /dev/initctl Compatibility Named Pipe.
    [ [   68.651905] SELinux: initialized (dev autofs, type autofs), uses genfs_contexts
     OK  ] Listening on LVM2 metadata daemon socket.
             Starting Setup Virtual Console...
             Starting Apply Kernel Variables...
    [  OK  ] Set up automount Arbitrary Executable File Formats File System Automount Point.
    [   69.052143] SELinux: initialized (dev hugetlbfs, type hugetlbfs), uses transition SIDs
     
    [  OK  ] Reached target Encrypted Volumes.
             Starting Load Kernel Modules...
             Mounting Debug File System...
             Mounting [   69.326800] kvm: VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL does not work properly. Using workaround
    Huge Pages File System...
    [   69.519392] tun: Universal TUN/TAP device driver, 1.6
     
    [  OK  ] Listening on Device-mapper[   69.519393] tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
     event daemon FIFOs.
             Expecting device dev-disk-by\x2duuid-8429f30c\x2d11e9\x2d4747\x2d8567\x2d163851aa58cd.device...
             Starting Configure read-only root support...
    [  OK  ] Listening on udev [   69.932728] systemd-udevd[294]: starting version 194
    Kernel Socket.
    [  OK  ] Listening o[   69.956079] EXT4-fs (sda4): re-mounted. Opts: (null)
    n udev Control Socket.
             Starting udev Coldplug all Devices...
             Starting udev Kernel Device Manager...
             Starting Remount Root and Kernel File Systems...
             Expecting device dev-disk-by\x2duuid-2A8E\x2d753D.device...
             Expecting device dev-disk-by\x2duuid-95266d2c\x2d14af\x2d4bcb\x2d86b5\x2dd6e517680083.device...
             Expecting device dev-disk-by\x2duuid-db550ad3\x2d1bca\x2d4b61\x2db378\x2dac29e1279f65.device...
    [  OK  ] Started udev Kernel Device Manager.
    [  OK  ] Started Load legacy module configuration.
    [  OK  ] Started Setup Virtual Console.
    [  OK  ] Started Apply Kernel Variables.
    [^[[1;3[   70.831909] microcode: CPU0 sig=0x206c2, pf=0x1, revision=0x13
    2m  OK  ] Mounted POSIX Message Queue File S[   70.869425] microcode: CPU1 sig=0x206c2, pf=0x1, revision=0x13
    ystem.
    [  OK  ] Started Load Kernel[   70.870293] microcode: CPU2 sig=0x206c2, pf=0x1, revision=0x13
     Modules.
    [  OK  ] Mounted Debug Fi[   70.871149] microcode: CPU3 sig=0x206c2, pf=0x1, revision=0x13
    le System.
    [  OK  ] Mounted Huge Pa[   70.964589] microcode: CPU4 sig=0x206c2, pf=0x1, revision=0x13
    ges File System.
    [  OK  ] Started S[   70.965017] microcode: CPU5 sig=0x206c2, pf=0x1, revision=0x13
    oftware RAID Monitor Takeover.
    [  OK  ^[[   70.965740] microcode: CPU6 sig=0x206c2, pf=0x1, revision=0x13
    [   70.966384] microcode: CPU7 sig=0x206c2, pf=0x1, revision=0x13
     
    [  OK  ] Started Remount Root and Ke[   70.967130] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
    rnel File Systems.
    [  OK  ] Reached[   71.957411] dca service started, version 1.12.1
     target Local File Systems (Pre).
             Mou[   71.957412] EDAC MC: Ver: 3.0.0
    nting Temporary Directory...
    [   72.152175] i801_smbus 0000:00:1f.3: enabling device (0140 -> 0143)
    [   72.152374] i801_smbus 0000:00:1f.3: SMBus using PCI Interrupt
    [   72.165140] tpm_tis 00:09: 1.2 TPM (device-id 0xFE, rev-id 70)
    [   72.174116] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
             Starting Load Random Seed...
    [   72.669517] bnx2: Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v2.2.3 (June 27, 2012)
    [   72.669615] input: PC Speaker as /devices/platform/pcspkr/input/input4
    [   72.669871] EDAC MC1: Giving out device to 'i7core_edac.c' 'i7 core #1': DEV 0000:fe:03.0
    [   72.669898] EDAC PCI0: Giving out device to module 'i7core_edac' controller 'EDAC PCI controller': DEV '0000:fe:03.0' (POLLED)
    [   73.265230] EDAC MC0: Giving out device to 'i7core_edac.c' 'i7 core #0': DEV 0000:ff:03.0
    [   73.265254] EDAC PCI1: Giving out device to module 'i7core_edac' controller 'EDAC PCI controller': DEV '0000:ff:03.0' (POLLED)
    [   73.265333] bnx2 0000:0b:00.0 eth0: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem 92000000, IRQ 28, node addr 5c:f3:fc:55:d4:48
    [   73.265641] EDAC i7core: Driver loaded, 2 memory controller(s) found.
    [   73.266165] bnx2 0000:0b:00.1 eth1: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem 94000000, IRQ 40, node addr 5c:f3:fc:55:d4:49
    [   73.266299] ACPI Warning: 0x0000000000000430-0x000000000000043f SystemIO conflicts with Region \_SB_.GPUS 1 (20120913/utaddress-251)
    [   73.266300] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
    [   73.266968] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
    [   73.290273] ioatdma: Intel(R) QuickData Technology Driver 4.00
    [   73.290322] ioatdma 0000:00:16.0: enabling device (0000 -> 0002)
    [   73.290451] ioatdma 0000:00:16.0: irq 68 for MSI/MSI-X
    [   73.290833] ioatdma 0000:00:16.1: enabling device (0000 -> 0002)
    [   73.290925] ioatdma 0000:00:16.1: irq 69 for MSI/MSI-X
    [   73.291151] ioatdma 0000:00:16.2: enabling device (0000 -> 0002)
    [   73.291212] ioatdma 0000:00:16.2: irq 70 for MSI/MSI-X
    [   73.291433] ioatdma 0000:00:16.3: enabling device (0000 -> 0002)
    [   73.291494] ioatdma 0000:00:16.3: irq 71 for MSI/MSI-X
    [   73.291957] ioatdma 0000:00:16.4: enabling device (0000 -> 0002)
    [   73.292057] ioatdma 0000:00:16.4: irq 72 for MSI/MSI-X
    [   73.292277] ioatdma 0000:00:16.5: enabling device (0000 -> 0002)
    [   73.292355] ioatdma 0000:00:16.5: irq 73 for MSI/MSI-X
    [   73.292597] ioatdma 0000:00:16.6: enabling device (0000 -> 0002)
    [   73.292735] ioatdma 0000:00:16.6: irq 74 for MSI/MSI-X
    [   73.292959] ioatdma 0000:00:16.7: enabling device (0000 -> 0002)
    [   73.293020] ioatdma 0000:00:16.7: irq 75 for MSI/MSI-X
    [   75.691285] iTCO_vendor_support: vendor-support=0
    [   75.813413] cdc_ether 4-2:1.0 usb0: register 'cdc_ether' at usb-0000:00:1a.1-2, CDC Ethernet Device, 5e:f3:fc:55:d4:4d
    [   75.813441] usbcore: registered new interface driver cdc_ether
    [   77.020152] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.10
    [   77.020234] iTCO_wdt: unable to reset NO_REBOOT flag, device disabled by hardware/BIOS
             Mounting Configuration File System...
    [   77.454931] SELinux: initialized (dev configfs, type configfs), uses genfs_contexts
    [  OK  ] Stopped Trigger Flushing of Journal to Persistent Storage.
             Stopping Journal Service...
    [  OK  ] Stopped Journal Service.
             Starting Journal Service...
    [  OK  ] Started Journal Service.
    [  OK  ] Mounted Temporary Directory.
    [  OK  ] Mounted Configuration File System.
    [  OK  ] Started udev Coldplug all Devices[   77.943792] Adding 8175612k swap on /dev/sda3.  Priority:-1 extents:1 across:8175612k
    .
    [  OK  ] Started Load Random Seed.
             Starting udev Wait for Complete Device Initialization...
    [  OK  ] Found device ServeRAID_M5015.
             Activating swap /dev/disk/by-uuid/8429f30c-11e9-4747-8567-163851aa58cd...
    [  OK  ] Found device ServeRAID_M5015.
             Starting File System Check[   78.407610] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null)
     on /dev/disk/by-uuid/95266d2c-14af-4bcb-86b5-d6e517680083...
    [   78.410550] SELinux: initialized (dev sda2, type ext4), uses xattr
    systemd-fsck[389]: /dev/sda2: recovering journal
    systemd-fsck[389]: /dev/sda2: clean, 70/128016 files, 153973/512000 blocks
    [  OK  ] Started udev Wait for Complete Device Initialization.
    [  OK  ] Activated swap /dev/disk/by-uuid/8429f30c-11e9-4747-8567-163851aa58cd.
    [  OK  ] Started File System Check on /dev/disk/by-uuid/95266d2c-14af-4bcb-86b5-d6e517680083.
             Mounting /boot...
    [  OK  ] Reached target Swap.
             Starting Wait for storage scan...
    [  OK  ] Mounted /boot.
    [  OK  ] Found device ServeRAID_M5015.
             Starting File System Check on /dev/disk/by-uuid/db550ad3-1bca-4b61-b378-ac29e1279f65...
    [  OK  ] Started Wait for storage scan.
    systemd-fsck[405]: /dev/sda5: recovering journal
             Starting Initialize storage subsystems (RAID, LVM, etc.)...
    systemd-fsck[   79.573761] SELinux: initialized (dev sda1, type vfat), uses genfs_contexts
    [405]: /dev/sda5: clean, 230/5079040 files, 382649/20313600 blocks
    [  OK  ] Found device ServeRAID_M5015.
             Mounting /boot/efi...
    [  OK  ] Started File System Che[   79.945416] EXT4-fs (sda5): mounted filesystem with ordered data mode. Opts: (null)
    ck on /dev/disk/by-uuid/db550ad3-1bca-4b61-b378-ac29e1279f65.
    [   79.945421] SELinux: initialized (dev sda5, type ext4), uses xattr
    [  OK  ] Started Initialize storage subsystems (RAID, LVM, etc.).
             Starting Initialize storage subsystems (RAID, LVM, etc.)...
             Mounting /home...
    [  OK  ] Mounted /boot/efi.
    [  OK  ] Started Initialize storage subsystems (RAID, LVM, etc.).
    [  OK  ] Mounted /home.
             Starting Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling...
    [  OK  ] Started Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
    [  [   80.764347] type=1400 audit(1351260878.586:4): avc:  denied  { read } for  pid=433 comm="systemd-tmpfile" name="lock" dev="sda4" ino=3145784 scontext=system_u:system_r:systemd_tmpfiles_t:s0 tcontext=system_u:object_r:var_t:s0 tclass=lnk_file
    OK  ] Reached target Local File Systems.
             Starting Recreate Volatile Files and Directories...
             Starting [   80.764572] type=1400 audit(1351260878.586:5): avc:  denied  { read } for  pid=433 comm="systemd-tmpfile" name="lock" dev="sda4" ino=3145784 scontext=system_u:system_r:systemd_tmpfiles_t:s0 tcontext=system_u:object_r:var_t:s0 tclass=lnk_file
    Trigger Flushing of Journal to Persistent Storage...
    [   81.643401] systemd-journald[385]: Received SIGUSR1
             Starting Tell Plymouth To Write Out Runtime Data...
    [  OK  ] Started Recreate Volatile Files and Directories.
    [  OK  ] Started Trigger Flushing of Journal to Persistent Storage.
    [  OK  ] Started Tell Plymouth To Write Out Runtime Data.
    [  OK  ] Reached target System Initialization.
             Starting Restore Sound Card State...
    [  OK  ] Listening on RPCbind Server Activation Socket.
    [  OK  ] Listening on CUPS Printing Service Sockets.
    [  OK  ] Listening on Avahi mDNS/DNS-SD Stack Activation Socket.
    [  OK  ] Listening on PC/SC Smart Card Daemon Activation Socket.
    [  OK  ] Listening on D-Bus System Message Bus Socket.
    [  OK  ] Reached target Sockets.
    [  OK  ] Reached target Basic System.
             Starting LSB: Starts and stops login iSCSI daem[   82.766212] Loading iSCSI transport class v2.0-870.
    on....
             Starting firewalld - dynamic firewall daemon...
             Starting Security Auditing Service...
             Starting ABRT Automated Bug Reporting Tool...
    [  OK  ] Started ABRT Automated Bug Reporting Tool.
             Starting irqbalance daemon...
    [   83.186928] iscsi: registered transport (tcp)
             Starting Machine Check Exception Logging Daemon...
             Starting Harvest vmcores for ABRT...
             Starting Install ABRT coredump hook...
             Starting ABRT Xorg log watcher...
    [  OK  ] Started ABRT Xorg log watcher.
             Starting Kernel Samepage Merging...
             Starting ABRT kernel log watcher...
    [   83.693943] iscsi: registered transport (iser)
    [  OK  ] Started ABRT kernel log watcher.
             Starting Self Monitoring and Reporting Technology (SMART) Daemon...
    [   83.881448] libcxgbi:libcxgbi_init_module: tag itt 0x1fff, 13 bits, age 0xf, 4 bits.
    [   83.881449] libcxgbi:ddp_setup_host_page_size: system PAGE 4096, ddp idx 0.
    [  OK  ] Started Self Monitoring and Reporting Technology (SMART) Daemon.
             Starting NTP client/server...
             Starting Login Service...
             Starting Avahi mDNS/DNS-SD Stack...
    [   84.398509] Chelsio T3 iSCSI Driver cxgb3i v2.0.0 (Jun. 2010)
     
    [   84.398537] iscsi: registered transport (cxgb3i)
    [   84.667763] Chelsio T4 iSCSI Driver cxgb4i v0.9.1 (Aug. 2010)
    [   84.667791] iscsi: registered transport (cxgb4i)
    [  OK  ] Started System Logging Service.
             Starting Permit User Sessions...
             Starting D-Bus System Message Bus...
    [  OK  ] Started D-Bus System Message Bus.
    [   85.038927] cnic: Broadcom NetXtreme II CNIC Driver cnic v2.5.14 (Sep 30, 2012)
    [   85.214112] Broadcom NetXtreme II iSCSI Driver bnx2i v2.7.2.2 (Apr 25, 2012)
    [   85.214173] iscsi: registered transport (bnx2i)
    [  OK  ] Started Restore Sound Card State.
    [  OK  ] Started Security Aud[   85.552436] iscsi: registered transport (be2iscsi)
    iting Service.
    [  OK  ] Started irq[   85.552437] In beiscsi_module_init, tt=ffffffffa035b000
    balance daemon.
    [  OK  ] Started Machine Check Exception Logging Daemon.
    [  OK  ] Started Harvest vmcores for ABRT.
    [  OK  ] Started Install ABRT coredump hook.
    [  OK  ] Started Kernel Samepage Merging.
    [  OK  ] Started Permit User Sessions.
             Starting Job spooling tools...
    [  OK  ] Started Job spooling tools.
             Starting Wait for Plymouth Boot Screen to Quit...
             Starting Terminate Plymouth Boot Screen...
             Starting Kernel Samepage Merging (KSM) Tuning Daemon...
             Starting Command Scheduler...
    [  OK  ] Started Command Scheduler.
    [  OK  ] Started LSB: Starts and stops login iSCSI daemon..
    [  OK  ] Started Kernel Samepage Merging (KSM) Tuning Daemon.
    [  OK  ] Started Login Service.
    [  OK  ] Started Avahi mDNS/DNS-SD Stack.
    [   87.491448] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
    [   88.337065] bnx2 0000:0b:00.0: irq 76 for MSI/MSI-X
    [   88.433811] bnx2 0000:0b:00.0: irq 77 for MSI/MSI-X
    [   88.530554] bnx2 0000:0b:00.0: irq 78 for MSI/MSI-X
    [   88.626160] bnx2 0000:0b:00.0: irq 79 for MSI/MSI-X
    [   88.721806] bnx2 0000:0b:00.0: irq 80 for MSI/MSI-X
    [   88.816751] bnx2 0000:0b:00.0: irq 81 for MSI/MSI-X
    [   88.911473] bnx2 0000:0b:00.0: irq 82 for MSI/MSI-X
    [   89.005730] bnx2 0000:0b:00.0: irq 83 for MSI/MSI-X
    [   89.100194] bnx2 0000:0b:00.0: irq 84 for MSI/MSI-X
    [   89.241620] bnx2 0000:0b:00.0 eth0: using MSIX
    [   89.329601] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
    [   89.435553] Bluetooth: Core ver 2.16
    [   89.513154] NET: Registered protocol family 31
    [   89.530802] bnx2 0000:0b:00.1: irq 85 for MSI/MSI-X
    [   89.530817] bnx2 0000:0b:00.1: irq 86 for MSI/MSI-X
    [   89.530832] bnx2 0000:0b:00.1: irq 87 for MSI/MSI-X
    [   89.530846] bnx2 0000:0b:00.1: irq 88 for MSI/MSI-X
    [   89.530862] bnx2 0000:0b:00.1: irq 89 for MSI/MSI-X
    [   89.530876] bnx2 0000:0b:00.1: irq 90 for MSI/MSI-X
    [   89.530891] bnx2 0000:0b:00.1: irq 91 for MSI/MSI-X
    [   89.530905] bnx2 0000:0b:00.1: irq 92 for MSI/MSI-X
    [   89.530919] bnx2 0000:0b:00.1: irq 93 for MSI/MSI-X
    [   89.578639] bnx2 0000:0b:00.1 eth1: using MSIX
    [   89.578693] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
    [   90.611119] Bluetooth: HCI device and connection manager initialized
    [   90.718006] Bluetooth: HCI socket layer initialized
    [   90.807276] Bluetooth: L2CAP socket layer initialized
    [   90.897726] Bluetooth: SCO socket layer initialized
    [   91.047382] Bluetooth: BNEP (Ethernet Emulation) ver 1.3
    [   91.140772] Bluetooth: BNEP filters: protocol multicast
    [   91.233105] Bluetooth: BNEP socket layer initialized
    [   92.497381] bnx2 0000:0b:00.0 eth0: NIC Copper Link is Up, 1000 Mbps full duplex
    [   92.587116]
    [   92.605283] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
    [   92.768491] bnx2 0000:0b:00.
                                   1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
    [   92.857692]
    [   92.875815] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
    [   94.355591] Ebtables v2.0 registered
    [   94.468819] ip6_tables: (C) 2000-2006 Netfilter Core Team
    [   94.656930] cgroup: libvirtd (860) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
    [   94.841229] cgroup: "memory" requires setting use_hierarchy to 1 on the root.
    [   94.841332] cgroup: libvirtd (860) created nested cgroup for controller "devi[   94.841380] cgroup: libvirtd (860) created nested cgroup for controller "freezer" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
    [   94.841430] cgroup: libvirtd (860) created nested cgroup for controller "blkio" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
    [  158.926026] [drm] mga base 0
    [  160.492547] fuse init (API version 7.20)
    [  160.541437] SELinux: initialized (dev fuse, type fuse), uses genfs_contexts
    [  179.734405] ------------[ cut here ]------------
    [  179.804754] kernel BUG at mm/memcontrol.c:3263!
    [  179.874356] invalid opcode: 0000 [#1] SMP
    [  179.939377] Modules linked in: fuse ip6table_filter ip6_tables ebtable_nat ebtables bnep bluetooth rfkill iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfat fat iTCO_wdt cdc_ether coretemp iTCO_vendor_support usbnet mii ioatdma lpc_ich crc32c_intel bnx2 shpchp i7core_edac pcspkr tpm_tis tpm i2c_i801 mfd_core tpm_bios edac_core dca serio_raw microcode vhost_net tun macvtap macvlan kvm_intel kvm uinput mgag200 i2c_algo_bit drm_kms_helper ttm drm megaraid_sas i2c_core
    [  180.737647] CPU 7
    [  180.759586] Pid: 1316, comm: X Not tainted 3.7.0-rc2+ #3 IBM IBM System x3400 M3 Server -[7379I08]-/69Y4356     
    [  180.918591] RIP: 0010:[<ffffffff8118c39a>]  [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0
    [  181.047572] RSP: 0000:ffff880179113d38  EFLAGS: 00013202
    [  181.127009] RAX: 0040100000084069 RBX: ffffea0005b28000 RCX: ffffea00099a805c
    [  181.228674] RDX: ffff880179113d90 RSI: ffffea00099a8000 RDI: ffffea0005b28000
    [  181.331080] RBP: ffff880179113d58 R08: 0000000000280000 R09: ffff88027fffff80
    [  181.433163] R10: 00000000000000d4 R11: 00000037e9f7bd90 R12: ffff880179113d90
    [  181.533866] R13: 00007fc5ffa00000 R14: ffff880178001fe8 R15: 000000016ca001e0
    [  181.635264] FS:  00007fc600ddb940(0000) GS:ffff88027fc60000(0000) knlGS:0000000000000000
    [  181.753726] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [  181.842013] CR2: 00007fc5ffa00000 CR3: 00000001779d2000 CR4: 00000000000007e0
    [  181.945346] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [  182.049416] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [  182.153796] Process X (pid: 1316, threadinfo ffff880179112000, task ffff880179364620)
    [  182.266464] Stack:
    [  182.309943]  ffff880177d2c980 00007fc5ffa00000 ffffea0005b28000 ffff880177d2c980
    [  182.418164]  ffff880179113dc8 ffffffff81183b60 ffff880177d2c9dc 0000000178001fe0
    [  182.526366]  ffff880177856a50 ffffea00099a8000 ffff880177d2cc38 0000000000000000
    [  182.633709] Call Trace:
    [  182.681450]  [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
    [  182.775090]  [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
    [  182.863038]  [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
    [  182.950574]  [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
    [  183.041512]  [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
    [  183.126832]  [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
    [  183.211216]  [<ffffffff81632ede>] do_page_fault+0xe/0x10
    [  183.293705]  [<ffffffff8162f518>] page_fault+0x28/0x30
    [  183.373909] Code: 00 48 8b 78 08 48 8b 57 10 83 e2 01 75 05 f0 83 47 08 01 f6 43 08 01 74 bb f0 80 08 04 eb b5 f3 90 48 8b 10 80 e2 01 75 f6 eb 94 <0f> 0b 0f 1f 40 00 e8 9c b4 49 00 66 66 2e 0f 1f 84 00 00 00 00
    [  183.651946] RIP  [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0
    [  183.760378]  RSP <ffff880179113d38> 

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-10-26  9:07 ` [PATCH 00/31] numa/core patches Zhouping Liu
@ 2012-10-26  9:08   ` Peter Zijlstra
  2012-10-26  9:20     ` Ingo Molnar
  2012-10-28 17:56     ` Johannes Weiner
  0 siblings, 2 replies; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-26  9:08 UTC (permalink / raw)
  To: Zhouping Liu
  Cc: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, Ingo Molnar

On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
> [  180.918591] RIP: 0010:[<ffffffff8118c39a>]  [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0

> [  182.681450]  [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
> [  182.775090]  [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
> [  182.863038]  [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
> [  182.950574]  [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
> [  183.041512]  [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
> [  183.126832]  [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
> [  183.211216]  [<ffffffff81632ede>] do_page_fault+0xe/0x10
> [  183.293705]  [<ffffffff8162f518>] page_fault+0x28/0x30 

Johannes, this looks like the thp migration memcg hookery gone bad,
could you have a look at this?

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-10-26  9:08   ` Peter Zijlstra
@ 2012-10-26  9:20     ` Ingo Molnar
  2012-10-26  9:41       ` Zhouping Liu
  2012-10-26 10:20       ` Zhouping Liu
  2012-10-28 17:56     ` Johannes Weiner
  1 sibling, 2 replies; 135+ messages in thread
From: Ingo Molnar @ 2012-10-26  9:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Zhouping Liu, Rik van Riel, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-mm


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
> > [  180.918591] RIP: 0010:[<ffffffff8118c39a>]  [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0
> 
> > [  182.681450]  [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
> > [  182.775090]  [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
> > [  182.863038]  [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
> > [  182.950574]  [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
> > [  183.041512]  [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
> > [  183.126832]  [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
> > [  183.211216]  [<ffffffff81632ede>] do_page_fault+0xe/0x10
> > [  183.293705]  [<ffffffff8162f518>] page_fault+0x28/0x30 
> 
> Johannes, this looks like the thp migration memcg hookery gone bad,
> could you have a look at this?

Meanwhile, Zhouping Liu, could you please not apply the last 
patch:

  [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()

and see whether it boots/works without that?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-10-26  9:20     ` Ingo Molnar
@ 2012-10-26  9:41       ` Zhouping Liu
  2012-10-26 10:20       ` Zhouping Liu
  1 sibling, 0 replies; 135+ messages in thread
From: Zhouping Liu @ 2012-10-26  9:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Rik van Riel, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-mm, CAI Qian

On 10/26/2012 05:20 PM, Ingo Molnar wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
>> On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
>>> [  180.918591] RIP: 0010:[<ffffffff8118c39a>]  [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0
>>> [  182.681450]  [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
>>> [  182.775090]  [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
>>> [  182.863038]  [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
>>> [  182.950574]  [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
>>> [  183.041512]  [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
>>> [  183.126832]  [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
>>> [  183.211216]  [<ffffffff81632ede>] do_page_fault+0xe/0x10
>>> [  183.293705]  [<ffffffff8162f518>] page_fault+0x28/0x30
>> Johannes, this looks like the thp migration memcg hookery gone bad,
>> could you have a look at this?
> Meanwhile, Zhouping Liu, could you please not apply the last
> patch:
>
>    [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
>
> and see whether it boots/works without that?

Ok, I  reverted the 31st patch, will provide the results here after I 
finish testing.

Thanks,
Zhouping

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-10-26  9:20     ` Ingo Molnar
  2012-10-26  9:41       ` Zhouping Liu
@ 2012-10-26 10:20       ` Zhouping Liu
  2012-10-26 10:24         ` Ingo Molnar
  1 sibling, 1 reply; 135+ messages in thread
From: Zhouping Liu @ 2012-10-26 10:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Rik van Riel, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-mm, CAI Qian

On 10/26/2012 05:20 PM, Ingo Molnar wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
>> On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
>>> [  180.918591] RIP: 0010:[<ffffffff8118c39a>]  [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0
>>> [  182.681450]  [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
>>> [  182.775090]  [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
>>> [  182.863038]  [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
>>> [  182.950574]  [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
>>> [  183.041512]  [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
>>> [  183.126832]  [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
>>> [  183.211216]  [<ffffffff81632ede>] do_page_fault+0xe/0x10
>>> [  183.293705]  [<ffffffff8162f518>] page_fault+0x28/0x30
>> Johannes, this looks like the thp migration memcg hookery gone bad,
>> could you have a look at this?
> Meanwhile, Zhouping Liu, could you please not apply the last
> patch:
>
>    [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
>
> and see whether it boots/works without that?

Hi Ingo,

your supposed is right, after reverting the 31st patch(sched, numa, mm: 
Add memcg support to do_huge_pmd_numa_page())
the issue is gone, thank you.


Thanks,
Zhouping

>
> Thanks,
>
> 	Ingo
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-10-26 10:20       ` Zhouping Liu
@ 2012-10-26 10:24         ` Ingo Molnar
  0 siblings, 0 replies; 135+ messages in thread
From: Ingo Molnar @ 2012-10-26 10:24 UTC (permalink / raw)
  To: Zhouping Liu
  Cc: Peter Zijlstra, Rik van Riel, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-mm, CAI Qian


* Zhouping Liu <zliu@redhat.com> wrote:

> On 10/26/2012 05:20 PM, Ingo Molnar wrote:
> >* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> >>On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
> >>>[  180.918591] RIP: 0010:[<ffffffff8118c39a>]  [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0
> >>>[  182.681450]  [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
> >>>[  182.775090]  [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
> >>>[  182.863038]  [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
> >>>[  182.950574]  [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
> >>>[  183.041512]  [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
> >>>[  183.126832]  [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
> >>>[  183.211216]  [<ffffffff81632ede>] do_page_fault+0xe/0x10
> >>>[  183.293705]  [<ffffffff8162f518>] page_fault+0x28/0x30
> >>Johannes, this looks like the thp migration memcg hookery gone bad,
> >>could you have a look at this?
> >Meanwhile, Zhouping Liu, could you please not apply the last
> >patch:
> >
> >   [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
> >
> >and see whether it boots/works without that?
> 
> Hi Ingo,
> 
> your supposed is right, after reverting the 31st patch(sched, numa,
> mm: Add memcg support to do_huge_pmd_numa_page())
> the issue is gone, thank you.

The tested bits you can find in the numa/core tree:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git numa/core

It includes all changes (patches #1-#30) except patch #31 - I 
wanted to test and apply that last patch today, but won't do it 
now that you've reported this regression.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags()
  2012-10-26  4:23           ` Linus Torvalds
  2012-10-26  6:42             ` Ingo Molnar
@ 2012-10-26 12:34             ` Michel Lespinasse
  2012-10-26 12:48               ` Andi Kleen
  2012-10-26 17:01               ` Linus Torvalds
  1 sibling, 2 replies; 135+ messages in thread
From: Michel Lespinasse @ 2012-10-26 12:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Peter Zijlstra, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Andrew Morton, linux-kernel,
	linux-mm, Ingo Molnar

On Thu, Oct 25, 2012 at 9:23 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Oct 25, 2012 at 8:57 PM, Rik van Riel <riel@redhat.com> wrote:
>>
>> That may not even be needed.  Apparently Intel chips
>> automatically flush an entry from the TLB when it
>> causes a page fault.  I assume AMD chips do the same,
>> because flush_tlb_fix_spurious_fault evaluates to
>> nothing on x86.
>
> Yes. It's not architected as far as I know, though. But I agree, it's
> possible - even likely - we could avoid TLB flushing entirely on x86.

Actually, it is architected on x86. This was first described in the
intel appnote 317080 "TLBs, Paging-Structure Caches, and Their
Invalidation", last paragraph of section 5.1. Nowadays, the same
contents are buried somewhere in Volume 3 of the architecture manual
(in my copy: 4.10.4.1 Operations that Invalidate TLBs and
Paging-Structure Caches)

> If you want to try it, I would seriously suggest you do it as a
> separate commit though, just in case.
>
>> Are there architectures where we do need to flush
>> remote TLBs on upgrading the permissions on a PTE?
>
> I *suspect* that whole TLB flush just magically became an SMP one
> without anybody ever really thinking about it.

I would be very worried about assuming every non-x86 arch has similar
TLB semantics. However, if their fault handlers always invalidate TLB
for pages that get spurious faults, then skipping the remote
invalidation would be fine. (I believe this is what
tlb_fix_spurious_fault() is for ?)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags()
  2012-10-26 12:34             ` Michel Lespinasse
@ 2012-10-26 12:48               ` Andi Kleen
  2012-10-26 13:16                 ` Rik van Riel
  2012-10-26 13:23                 ` [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags() Michel Lespinasse
  2012-10-26 17:01               ` Linus Torvalds
  1 sibling, 2 replies; 135+ messages in thread
From: Andi Kleen @ 2012-10-26 12:48 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Linus Torvalds, Rik van Riel, Peter Zijlstra, Andrea Arcangeli,
	Mel Gorman, Johannes Weiner, Thomas Gleixner, Andrew Morton,
	linux-kernel, linux-mm, Ingo Molnar

Michel Lespinasse <walken@google.com> writes:

> On Thu, Oct 25, 2012 at 9:23 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Thu, Oct 25, 2012 at 8:57 PM, Rik van Riel <riel@redhat.com> wrote:
>>>
>>> That may not even be needed.  Apparently Intel chips
>>> automatically flush an entry from the TLB when it
>>> causes a page fault.  I assume AMD chips do the same,
>>> because flush_tlb_fix_spurious_fault evaluates to
>>> nothing on x86.
>>
>> Yes. It's not architected as far as I know, though. But I agree, it's
>> possible - even likely - we could avoid TLB flushing entirely on x86.
>
> Actually, it is architected on x86. This was first described in the
> intel appnote 317080 "TLBs, Paging-Structure Caches, and Their
> Invalidation", last paragraph of section 5.1. Nowadays, the same
> contents are buried somewhere in Volume 3 of the architecture manual
> (in my copy: 4.10.4.1 Operations that Invalidate TLBs and
> Paging-Structure Caches)

This unfortunately would only work for processes with no threads
because it only works on the current logical CPU.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags()
  2012-10-26 12:48               ` Andi Kleen
@ 2012-10-26 13:16                 ` Rik van Riel
  2012-10-26 13:26                   ` Ingo Molnar
  2012-10-26 13:23                 ` [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags() Michel Lespinasse
  1 sibling, 1 reply; 135+ messages in thread
From: Rik van Riel @ 2012-10-26 13:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Michel Lespinasse, Linus Torvalds, Peter Zijlstra,
	Andrea Arcangeli, Mel Gorman, Johannes Weiner, Thomas Gleixner,
	Andrew Morton, linux-kernel, linux-mm, Ingo Molnar

On 10/26/2012 08:48 AM, Andi Kleen wrote:
> Michel Lespinasse <walken@google.com> writes:
>
>> On Thu, Oct 25, 2012 at 9:23 PM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>> On Thu, Oct 25, 2012 at 8:57 PM, Rik van Riel <riel@redhat.com> wrote:
>>>>
>>>> That may not even be needed.  Apparently Intel chips
>>>> automatically flush an entry from the TLB when it
>>>> causes a page fault.  I assume AMD chips do the same,
>>>> because flush_tlb_fix_spurious_fault evaluates to
>>>> nothing on x86.
>>>
>>> Yes. It's not architected as far as I know, though. But I agree, it's
>>> possible - even likely - we could avoid TLB flushing entirely on x86.
>>
>> Actually, it is architected on x86. This was first described in the
>> intel appnote 317080 "TLBs, Paging-Structure Caches, and Their
>> Invalidation", last paragraph of section 5.1. Nowadays, the same
>> contents are buried somewhere in Volume 3 of the architecture manual
>> (in my copy: 4.10.4.1 Operations that Invalidate TLBs and
>> Paging-Structure Caches)
>
> This unfortunately would only work for processes with no threads
> because it only works on the current logical CPU.

That is fine.

Potentially triggering a spurious page fault on
another CPU is bound to be better than always
doing a synchronous remote TLB flush, waiting
for who knows how many CPUs to acknowledge the
IPI...


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags()
  2012-10-26 12:48               ` Andi Kleen
  2012-10-26 13:16                 ` Rik van Riel
@ 2012-10-26 13:23                 ` Michel Lespinasse
  1 sibling, 0 replies; 135+ messages in thread
From: Michel Lespinasse @ 2012-10-26 13:23 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Rik van Riel, Peter Zijlstra, Andrea Arcangeli,
	Mel Gorman, Johannes Weiner, Thomas Gleixner, Andrew Morton,
	linux-kernel, linux-mm, Ingo Molnar

On Fri, Oct 26, 2012 at 5:48 AM, Andi Kleen <andi@firstfloor.org> wrote:
> Michel Lespinasse <walken@google.com> writes:
>
>> On Thu, Oct 25, 2012 at 9:23 PM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>> On Thu, Oct 25, 2012 at 8:57 PM, Rik van Riel <riel@redhat.com> wrote:
>>>>
>>>> That may not even be needed.  Apparently Intel chips
>>>> automatically flush an entry from the TLB when it
>>>> causes a page fault.  I assume AMD chips do the same,
>>>> because flush_tlb_fix_spurious_fault evaluates to
>>>> nothing on x86.
>>>
>>> Yes. It's not architected as far as I know, though. But I agree, it's
>>> possible - even likely - we could avoid TLB flushing entirely on x86.
>>
>> Actually, it is architected on x86. This was first described in the
>> intel appnote 317080 "TLBs, Paging-Structure Caches, and Their
>> Invalidation", last paragraph of section 5.1. Nowadays, the same
>> contents are buried somewhere in Volume 3 of the architecture manual
>> (in my copy: 4.10.4.1 Operations that Invalidate TLBs and
>> Paging-Structure Caches)
>
> This unfortunately would only work for processes with no threads
> because it only works on the current logical CPU.

No, the point is, if we are only *increasing* permissions on a page,
we can skip the remote TLB invalidations. Later on each remote CPU
might possibly get a spurious fault on that page, but that spurious
fault will resynchronize their TLBs for that page, so that the
instruction retry after the fault won't fault again.

It is often cheaper to let remote CPUs get an occasional spurious
fault than to synchronize with them on every permission change.

Of course, none of the above applies if we are *reducing* permissions
on a page (we really can't skip TLB invalidations there)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags()
  2012-10-26 13:16                 ` Rik van Riel
@ 2012-10-26 13:26                   ` Ingo Molnar
  2012-10-26 13:28                     ` Ingo Molnar
                                       ` (3 more replies)
  0 siblings, 4 replies; 135+ messages in thread
From: Ingo Molnar @ 2012-10-26 13:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andi Kleen, Michel Lespinasse, Linus Torvalds, Peter Zijlstra,
	Andrea Arcangeli, Mel Gorman, Johannes Weiner, Thomas Gleixner,
	Andrew Morton, linux-kernel, linux-mm


* Rik van Riel <riel@redhat.com> wrote:

> On 10/26/2012 08:48 AM, Andi Kleen wrote:
> >Michel Lespinasse <walken@google.com> writes:
> >
> >>On Thu, Oct 25, 2012 at 9:23 PM, Linus Torvalds
> >><torvalds@linux-foundation.org> wrote:
> >>>On Thu, Oct 25, 2012 at 8:57 PM, Rik van Riel <riel@redhat.com> wrote:
> >>>>
> >>>>That may not even be needed.  Apparently Intel chips
> >>>>automatically flush an entry from the TLB when it
> >>>>causes a page fault.  I assume AMD chips do the same,
> >>>>because flush_tlb_fix_spurious_fault evaluates to
> >>>>nothing on x86.
> >>>
> >>>Yes. It's not architected as far as I know, though. But I agree, it's
> >>>possible - even likely - we could avoid TLB flushing entirely on x86.
> >>
> >>Actually, it is architected on x86. This was first described in the
> >>intel appnote 317080 "TLBs, Paging-Structure Caches, and Their
> >>Invalidation", last paragraph of section 5.1. Nowadays, the same
> >>contents are buried somewhere in Volume 3 of the architecture manual
> >>(in my copy: 4.10.4.1 Operations that Invalidate TLBs and
> >>Paging-Structure Caches)
> >
> > This unfortunately would only work for processes with no 
> > threads because it only works on the current logical CPU.
> 
> That is fine.
> 
> Potentially triggering a spurious page fault on
> another CPU is bound to be better than always
> doing a synchronous remote TLB flush, waiting
> for who knows how many CPUs to acknowledge the
> IPI...

The other killer is the fundamental IPI delay - which makes it 
'invisible' to regular profiling and makes it hard to analyze.

So yes, even the local flush is a win, a major one - and the 
flush-less one is likely a win too, because INVLPG has some 
TLB-cache-walking costs.

Rik, mind sending an updated patch that addresses Linus's 
concerns, or should I code it up if you are busy?

We can also certainly try the second patch, but I'd do it at the 
end of the series, to put some tree distance between the two 
patches, to not concentrate regression risks too tightly in the 
Git space, to help out with hard to bisect problems...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags()
  2012-10-26 13:26                   ` Ingo Molnar
@ 2012-10-26 13:28                     ` Ingo Molnar
  2012-10-26 18:44                     ` [PATCH 1/3] x86/mm: only do a local TLB flush in ptep_set_access_flags() Rik van Riel
                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 135+ messages in thread
From: Ingo Molnar @ 2012-10-26 13:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andi Kleen, Michel Lespinasse, Linus Torvalds, Peter Zijlstra,
	Andrea Arcangeli, Mel Gorman, Johannes Weiner, Thomas Gleixner,
	Andrew Morton, linux-kernel, linux-mm


* Ingo Molnar <mingo@kernel.org> wrote:

> [...]
> 
> Rik, mind sending an updated patch that addresses Linus's 
> concerns, or should I code it up if you are busy?
> 
> We can also certainly try the second patch, but I'd do it at 
> the end of the series, to put some tree distance between the 
> two patches, to not concentrate regression risks too tightly 
> in the Git space, to help out with hard to bisect problems...

I'd also like to have the second patch separately because I'd 
like to measure spurious fault frequency before and after the 
change, with a reference workload.

Just a single page fault, even it's a minor one, might make a 
micro-optimization a net loss. INVLPG might be the cheaper 
option on average - it needs to be measured. (I'll do that, just 
please keep it separate from the main TLB-flush optimization.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 26/31] sched, numa, mm: Add fault driven placement and migration policy
  2012-10-26  7:15     ` Ingo Molnar
@ 2012-10-26 13:50       ` Ingo Molnar
  2012-10-26 14:11         ` Peter Zijlstra
  0 siblings, 1 reply; 135+ messages in thread
From: Ingo Molnar @ 2012-10-26 13:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Rik van Riel, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Andrew Morton, linux-kernel,
	linux-mm


* Ingo Molnar <mingo@kernel.org> wrote:

> [
>   task_numa_work() performance side note:
> 
>   We are also *very* close to be able to use down_read() instead
>   of down_write() in the sampling-unmap code in 
>   task_numa_work(), as it should be safe in theory to call 
>   change_protection(PROT_NONE) in parallel - but there's one 
>   regression that disagrees with this theory so we use 
>   down_write() at the moment.
> 
>   Maybe you could help us there: can you see a reason why the
>   change_prot_none()->change_protection() call in
>   task_numa_work() can not occur in parallel to a page fault in
>   another thread on another CPU? It should be safe - yet if we 
>   change it I can see occasional corruption of user-space state: 
>   segfaults and register corruption.
> ]

Oh, just found the reason:

the ptep_modify_prot_start()/modify()/commit() sequence is 
SMP-unsafe - it has to be done under the mmap_sem write-locked.

It is safe against *hardware* updates to the PTE, but not safe 
against itself.

This is apparently a hidden cost of paravirt, it is forcing that 
weird sequence and thus the down_write() ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 26/31] sched, numa, mm: Add fault driven placement and migration policy
  2012-10-26 13:50       ` Ingo Molnar
@ 2012-10-26 14:11         ` Peter Zijlstra
  2012-10-26 14:14           ` Ingo Molnar
  0 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2012-10-26 14:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Rik van Riel, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Andrew Morton, linux-kernel,
	linux-mm

On Fri, 2012-10-26 at 15:50 +0200, Ingo Molnar wrote:
> 
> Oh, just found the reason:
> 
> the ptep_modify_prot_start()/modify()/commit() sequence is 
> SMP-unsafe - it has to be done under the mmap_sem write-locked.
> 
> It is safe against *hardware* updates to the PTE, but not safe 
> against itself. 

Shouldn't the pte_lock serialize all that still? All sites that modify
PTE contents should hold the pte_lock (and do afaict).

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 26/31] sched, numa, mm: Add fault driven placement and migration policy
  2012-10-26 14:11         ` Peter Zijlstra
@ 2012-10-26 14:14           ` Ingo Molnar
  2012-10-26 16:47             ` Linus Torvalds
  0 siblings, 1 reply; 135+ messages in thread
From: Ingo Molnar @ 2012-10-26 14:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Rik van Riel, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Andrew Morton, linux-kernel,
	linux-mm


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Fri, 2012-10-26 at 15:50 +0200, Ingo Molnar wrote:
> > 
> > Oh, just found the reason:
> > 
> > the ptep_modify_prot_start()/modify()/commit() sequence is 
> > SMP-unsafe - it has to be done under the mmap_sem 
> > write-locked.
> > 
> > It is safe against *hardware* updates to the PTE, but not 
> > safe against itself.
> 
> Shouldn't the pte_lock serialize all that still? All sites 
> that modify PTE contents should hold the pte_lock (and do 
> afaict).

Hm, indeed.

Is there no code under down_read() (in the page fault path) that 
modifies the pte via just pure atomics?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 26/31] sched, numa, mm: Add fault driven placement and migration policy
  2012-10-26 14:14           ` Ingo Molnar
@ 2012-10-26 16:47             ` Linus Torvalds
  0 siblings, 0 replies; 135+ messages in thread
From: Linus Torvalds @ 2012-10-26 16:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Rik van Riel, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Andrew Morton, linux-kernel,
	linux-mm

On Fri, Oct 26, 2012 at 7:14 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>>
>> Shouldn't the pte_lock serialize all that still? All sites
>> that modify PTE contents should hold the pte_lock (and do
>> afaict).
>
> Hm, indeed.
>
> Is there no code under down_read() (in the page fault path) that
> modifies the pte via just pure atomics?

Well, the ptep_set_access_flags() thing modifies the pte under
down_read(). Not using atomics, though. If it races with itself or
with a hardware page walk, that's fine, but if it races with something
changing other bits than A/D, that would be horribly horribly bad - it
could undo any other bit changes exactly because it's a unlocked
read-do-other-things-write sequence.

But it's always run under the page table lock - as should all other SW
page table modifications - so it *should* be fine. The down_read() is
for protecting other VM data structures (notably the vma lists etc),
not the page table bit-twiddling.

In fact, the whole SW page table modification scheme *depends* on the
page table lock, because the ptep_modify_prot_start/commit thing does
a "atomically clear the page table pointer to protect against hardware
walkers". And if another software walker were to see that cleared
state, it would do bad things (the exception, as usual, is the GUP
code, which does the optimistic unlocked accesses and conceptually
emulates a hardware page table walk)

So I really think that the mmap_sem should be entirely a non-issue for
this kind of code.

            Linus

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags()
  2012-10-26 12:34             ` Michel Lespinasse
  2012-10-26 12:48               ` Andi Kleen
@ 2012-10-26 17:01               ` Linus Torvalds
  2012-10-26 17:54                 ` Rik van Riel
  1 sibling, 1 reply; 135+ messages in thread
From: Linus Torvalds @ 2012-10-26 17:01 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Rik van Riel, Peter Zijlstra, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Andrew Morton, linux-kernel,
	linux-mm, Ingo Molnar

On Fri, Oct 26, 2012 at 5:34 AM, Michel Lespinasse <walken@google.com> wrote:
> On Thu, Oct 25, 2012 at 9:23 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>
>> Yes. It's not architected as far as I know, though. But I agree, it's
>> possible - even likely - we could avoid TLB flushing entirely on x86.
>
> Actually, it is architected on x86. This was first described in the
> intel appnote 317080 "TLBs, Paging-Structure Caches, and Their
> Invalidation", last paragraph of section 5.1. Nowadays, the same
> contents are buried somewhere in Volume 3 of the architecture manual
> (in my copy: 4.10.4.1 Operations that Invalidate TLBs and
> Paging-Structure Caches)

Good. I should have known it must be architected, because we've gone
back-and-forth on this in the kernel historically. We used to have
some TLB invalidates in the faulting path because I wasn't sure
whether they were needed or not, but we clearly don't have them any
more (and I suspect coverage was always spotty).

And Intel (and AMD) have been very good at documenting as architected
these kinds of details that people end up relying on even if they
weren't necessarily originally explicitly documented.

>> I *suspect* that whole TLB flush just magically became an SMP one
>> without anybody ever really thinking about it.
>
> I would be very worried about assuming every non-x86 arch has similar
> TLB semantics. However, if their fault handlers always invalidate TLB
> for pages that get spurious faults, then skipping the remote
> invalidation would be fine. (I believe this is what
> tlb_fix_spurious_fault() is for ?)

Yes. Of course, there may be some case where we unintentionally don't
necessarily flush a faulting address (on some architecture that needs
it), and then removing the cross-cpu invalidate could expose that
pre-existing bug-let, and cause an infinite loop of page faults due to
a TLB entry that never gets invalidated even if the page tables are
actually up-to-date.

So changing the mm/pgtable-generic.c function sounds like the right
thing to do, but would be a bit more scary.

Changing the x86 version sounds safe, *especially* since you point out
that the "fault-causes-tlb-invalidate" is architected behavior.

So I'd almost be willing to drop the invalidate in just one single
commit, because it really should be safe. The only thing it does is
guarantee that the accessed bit gets updated, and the accessed bit
just isn't that important. If we never flush the TLB on another CPU
that continues to use a TLB entry where the accessed bit is set (even
if it's cleared in the in-memory page tables), the worst that can
happen is that the accessed bit doesn't ever get set even if that CPU
constantly uses the page.

And nobody will *ever* care. The A bit is purely a heuristic for the
page LRU thing, we don't care about irrelevant special cases that
won't even affect correctness (much less performance - if that thing
is really hot and stays in the TLB, if we evict it, it will
immediately get reloaded anyway).

And doing a TLB invalidate even locally is worthless: sure, setting
the dirty bit and not invalidating the TLB can cause a local micro-tlb
fault (not a software-visible one, just microarchitectural pipeline
restart with TLB reload) on the next write access (because the TLB
would still contain D=0), so *eve*if* the CPU didn't
invalidate-on-fault, there's no reason we should invalidate in
software on x86.

Again, this can be different on non-x86 architectures with software
dirty bits, where a stale TLB entry that never gets flushed could
cause infinite TLB faults that never make progress, but that's really
a TLB _walker_ issue, not a generic VM issue.

          Linus

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags()
  2012-10-26 17:01               ` Linus Torvalds
@ 2012-10-26 17:54                 ` Rik van Riel
  2012-10-26 18:02                   ` Linus Torvalds
  0 siblings, 1 reply; 135+ messages in thread
From: Rik van Riel @ 2012-10-26 17:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michel Lespinasse, Peter Zijlstra, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Andrew Morton, linux-kernel,
	linux-mm, Ingo Molnar

On 10/26/2012 01:01 PM, Linus Torvalds wrote:
> On Fri, Oct 26, 2012 at 5:34 AM, Michel Lespinasse <walken@google.com> wrote:
>> On Thu, Oct 25, 2012 at 9:23 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>>
>>> Yes. It's not architected as far as I know, though. But I agree, it's
>>> possible - even likely - we could avoid TLB flushing entirely on x86.
>>
>> Actually, it is architected on x86. This was first described in the
>> intel appnote 317080 "TLBs, Paging-Structure Caches, and Their
>> Invalidation", last paragraph of section 5.1. Nowadays, the same
>> contents are buried somewhere in Volume 3 of the architecture manual
>> (in my copy: 4.10.4.1 Operations that Invalidate TLBs and
>> Paging-Structure Caches)
>
> Good. I should have known it must be architected, because we've gone
> back-and-forth on this in the kernel historically. We used to have
> some TLB invalidates in the faulting path because I wasn't sure
> whether they were needed or not, but we clearly don't have them any
> more (and I suspect coverage was always spotty).
>
> And Intel (and AMD) have been very good at documenting as architected
> these kinds of details that people end up relying on even if they
> weren't necessarily originally explicitly documented.
>
>>> I *suspect* that whole TLB flush just magically became an SMP one
>>> without anybody ever really thinking about it.
>>
>> I would be very worried about assuming every non-x86 arch has similar
>> TLB semantics. However, if their fault handlers always invalidate TLB
>> for pages that get spurious faults, then skipping the remote
>> invalidation would be fine. (I believe this is what
>> tlb_fix_spurious_fault() is for ?)
>
> Yes. Of course, there may be some case where we unintentionally don't
> necessarily flush a faulting address (on some architecture that needs
> it), and then removing the cross-cpu invalidate could expose that
> pre-existing bug-let, and cause an infinite loop of page faults due to
> a TLB entry that never gets invalidated even if the page tables are
> actually up-to-date.
>
> So changing the mm/pgtable-generic.c function sounds like the right
> thing to do, but would be a bit more scary.
>
> Changing the x86 version sounds safe, *especially* since you point out
> that the "fault-causes-tlb-invalidate" is architected behavior.
>
> So I'd almost be willing to drop the invalidate in just one single
> commit, because it really should be safe. The only thing it does is
> guarantee that the accessed bit gets updated, and the accessed bit
> just isn't that important. If we never flush the TLB on another CPU
> that continues to use a TLB entry where the accessed bit is set (even
> if it's cleared in the in-memory page tables), the worst that can
> happen is that the accessed bit doesn't ever get set even if that CPU
> constantly uses the page.

I suspect it would be safe to simply call tlb_fix_spurious_fault()
both on x86 and in the generic version.

If tlb_fix_spurious_fault is broken on some architecture, they
would already be running into issues like "write page fault
loops until the next context switch" :)

> Again, this can be different on non-x86 architectures with software
> dirty bits, where a stale TLB entry that never gets flushed could
> cause infinite TLB faults that never make progress, but that's really
> a TLB _walker_ issue, not a generic VM issue.

Would tlb_fix_spurious_fault take care of that on those
architectures?


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags()
  2012-10-26 17:54                 ` Rik van Riel
@ 2012-10-26 18:02                   ` Linus Torvalds
  2012-10-26 18:14                     ` Rik van Riel
  0 siblings, 1 reply; 135+ messages in thread
From: Linus Torvalds @ 2012-10-26 18:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michel Lespinasse, Peter Zijlstra, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Andrew Morton, linux-kernel,
	linux-mm, Ingo Molnar

On Fri, Oct 26, 2012 at 10:54 AM, Rik van Riel <riel@redhat.com> wrote:
>
> Would tlb_fix_spurious_fault take care of that on those
> architectures?

.. assuming that they implement it as a real TLB flush, yes.

But maybe the architecture never noticed that it happened to depend on
the fact that we do a cross-CPU invalidate? So a missing
tlb_fix_spurious_fault() implementation could cause a short loop of
repeated page faults, until the IPI happens. And it would be so
incredibly rare that nobody would ever have noticed.

And if that could have happened, then with the cross-cpu invalidate
removed, the "incredibly rare short-lived constant page fault retry"
could turn into "incredibly rare lockup due to infinite page fault
retry due to TLB entry that never turns dirty despite it being marked
dirty by SW in the in-memory page tables".

Very unlikely, I agree. And this is only relevant for the non-x86
case, so changing the x86-specific optimized version is an independent
issue.

              Linus

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags()
  2012-10-26 18:02                   ` Linus Torvalds
@ 2012-10-26 18:14                     ` Rik van Riel
  2012-10-26 18:41                       ` Linus Torvalds
  0 siblings, 1 reply; 135+ messages in thread
From: Rik van Riel @ 2012-10-26 18:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michel Lespinasse, Peter Zijlstra, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Andrew Morton, linux-kernel,
	linux-mm, Ingo Molnar

On 10/26/2012 02:02 PM, Linus Torvalds wrote:
> On Fri, Oct 26, 2012 at 10:54 AM, Rik van Riel <riel@redhat.com> wrote:
>>
>> Would tlb_fix_spurious_fault take care of that on those
>> architectures?
>
> .. assuming that they implement it as a real TLB flush, yes.
>
> But maybe the architecture never noticed that it happened to depend on
> the fact that we do a cross-CPU invalidate? So a missing
> tlb_fix_spurious_fault() implementation could cause a short loop of
> repeated page faults, until the IPI happens. And it would be so
> incredibly rare that nobody would ever have noticed.
>
> And if that could have happened, then with the cross-cpu invalidate
> removed, the "incredibly rare short-lived constant page fault retry"
> could turn into "incredibly rare lockup due to infinite page fault
> retry due to TLB entry that never turns dirty despite it being marked
> dirty by SW in the in-memory page tables".

I suspect the next context switch would flush out the TLB,
making it a slowdown, not a lockup.

Still a good reason to make such a change in its own commit,
so it can be bisected and tracked down.

The commit message could tell architecture maintainers what
to do if this particular commit got them into trouble:
implement a proper local TLB flush in tlb_fix_spurious_fault.

I'll send this in as a separate patch.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags()
  2012-10-26 18:14                     ` Rik van Riel
@ 2012-10-26 18:41                       ` Linus Torvalds
  0 siblings, 0 replies; 135+ messages in thread
From: Linus Torvalds @ 2012-10-26 18:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michel Lespinasse, Peter Zijlstra, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Andrew Morton, linux-kernel,
	linux-mm, Ingo Molnar

On Fri, Oct 26, 2012 at 11:14 AM, Rik van Riel <riel@redhat.com> wrote:
>
> I suspect the next context switch would flush out the TLB,
> making it a slowdown, not a lockup.

Common case, yes. But the page fault might happen in kernel space (due
to a "put_user()" call, say), and with CONFIG_PREEMPT=n.

Sure, put_user() is always done in a context where blocking (and
scheduling) is legal, but that doesn't necessarily equate scheduling
actually happening. If we're returning to kernel space and don't have
any IO, it might never happen.

Anyway, I suspect such behavior it's almost impossible to trigger.
Which would just make it rather hard to find.

             Linus

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 1/3] x86/mm: only do a local TLB flush in ptep_set_access_flags()
  2012-10-26 13:26                   ` Ingo Molnar
  2012-10-26 13:28                     ` Ingo Molnar
@ 2012-10-26 18:44                     ` Rik van Riel
  2012-10-26 18:49                       ` Linus Torvalds
  2012-10-26 18:45                     ` [PATCH 2/3] x86,mm: drop TLB flush from ptep_set_access_flags Rik van Riel
  2012-10-26 18:46                     ` [PATCH 3/3] mm,generic: only flush the local TLB in ptep_set_access_flags Rik van Riel
  3 siblings, 1 reply; 135+ messages in thread
From: Rik van Riel @ 2012-10-26 18:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Michel Lespinasse, Linus Torvalds, Peter Zijlstra,
	Andrea Arcangeli, Mel Gorman, Johannes Weiner, Thomas Gleixner,
	Andrew Morton, linux-kernel, linux-mm

Here are the TLB patches as requested:
---8<---
    
The function ptep_set_access_flags() is only ever invoked to upgrade
access permissions on a PTE. That makes it safe to skip flushing the
TLBs on remote TLBs. The worst that can happen is a spurious page
fault on other CPUs, which would flush that TLB entry.

Lazily letting another CPU incur a spurious page fault occasionally
is (much!) cheaper than aggressively flushing everybody else's TLB.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/pgtable.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..b3b852c 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -310,7 +310,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	if (changed && dirty) {
 		*ptep = entry;
 		pte_update_defer(vma->vm_mm, address, ptep);
-		flush_tlb_page(vma, address);
+		__flush_tlb_one(address);
 	}
 
 	return changed;



^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH 2/3] x86,mm: drop TLB flush from ptep_set_access_flags
  2012-10-26 13:26                   ` Ingo Molnar
  2012-10-26 13:28                     ` Ingo Molnar
  2012-10-26 18:44                     ` [PATCH 1/3] x86/mm: only do a local TLB flush in ptep_set_access_flags() Rik van Riel
@ 2012-10-26 18:45                     ` Rik van Riel
  2012-10-26 21:12                       ` Alan Cox
  2012-10-26 18:46                     ` [PATCH 3/3] mm,generic: only flush the local TLB in ptep_set_access_flags Rik van Riel
  3 siblings, 1 reply; 135+ messages in thread
From: Rik van Riel @ 2012-10-26 18:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Michel Lespinasse, Linus Torvalds, Peter Zijlstra,
	Andrea Arcangeli, Mel Gorman, Johannes Weiner, Thomas Gleixner,
	Andrew Morton, linux-kernel, linux-mm

Intel has an architectural guarantee that the TLB entry causing
a page fault gets invalidated automatically. This means
we should be able to drop the local TLB invalidation.

Because of the way other areas of the page fault code work,
chances are good that all x86 CPUs do this.  However, if
someone somewhere has an x86 CPU that does not invalidate
the TLB entry causing a page fault, this one-liner should
be easy to revert.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
---
 arch/x86/mm/pgtable.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index b3b852c..15e5953 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -310,7 +310,6 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	if (changed && dirty) {
 		*ptep = entry;
 		pte_update_defer(vma->vm_mm, address, ptep);
-		__flush_tlb_one(address);
 	}
 
 	return changed;

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [PATCH 3/3] mm,generic: only flush the local TLB in ptep_set_access_flags
  2012-10-26 13:26                   ` Ingo Molnar
                                       ` (2 preceding siblings ...)
  2012-10-26 18:45                     ` [PATCH 2/3] x86,mm: drop TLB flush from ptep_set_access_flags Rik van Riel
@ 2012-10-26 18:46                     ` Rik van Riel
  2012-10-26 18:48                       ` Linus Torvalds
  3 siblings, 1 reply; 135+ messages in thread
From: Rik van Riel @ 2012-10-26 18:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Michel Lespinasse, Linus Torvalds, Peter Zijlstra,
	Andrea Arcangeli, Mel Gorman, Johannes Weiner, Thomas Gleixner,
	Andrew Morton, linux-kernel, linux-mm

The function ptep_set_access_flags is only ever used to upgrade
access permissions to a page. That means the only negative side
effect of not flushing remote TLBs is that other CPUs may incur
spurious page faults, if they happen to access the same address,
and still have a PTE with the old permissions cached in their
TLB.

Having another CPU maybe incur a spurious page fault is faster
than always incurring the cost of a remote TLB flush, so replace
the remote TLB flush with a purely local one.

This should be safe on every architecture that correctly
implements flush_tlb_fix_spurious_fault() to actually invalidate
the local TLB entry that caused a page fault, as well as on
architectures where the hardware invalidates TLB entries that
cause page faults.

In the unlikely event that you are hitting what appears to be
an infinite loop of page faults, and 'git bisect' took you to
this changeset, your architecture needs to implement
flush_tlb_fix_spurious_fault to actually flush the TLB entry.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 mm/pgtable-generic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index e642627..0361369 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -27,7 +27,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	int changed = !pte_same(*ptep, entry);
 	if (changed) {
 		set_pte_at(vma->vm_mm, address, ptep, entry);
-		flush_tlb_page(vma, address);
+		flush_tlb_fix_spurious_fault(vma, address);
 	}
 	return changed;
 }

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/3] mm,generic: only flush the local TLB in ptep_set_access_flags
  2012-10-26 18:46                     ` [PATCH 3/3] mm,generic: only flush the local TLB in ptep_set_access_flags Rik van Riel
@ 2012-10-26 18:48                       ` Linus Torvalds
  2012-10-26 18:53                         ` Linus Torvalds
  2012-10-26 18:57                         ` Rik van Riel
  0 siblings, 2 replies; 135+ messages in thread
From: Linus Torvalds @ 2012-10-26 18:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, Andi Kleen, Michel Lespinasse, Peter Zijlstra,
	Andrea Arcangeli, Mel Gorman, Johannes Weiner, Thomas Gleixner,
	Andrew Morton, linux-kernel, linux-mm

On Fri, Oct 26, 2012 at 11:46 AM, Rik van Riel <riel@redhat.com> wrote:
>
> The function ptep_set_access_flags is only ever used to upgrade
> access permissions to a page.

NOTE: It's *not* "access permissions". It's "access flags".

Big difference. This is not about permissions at all.

The access flags are the Accessed and Dirty bits. And the dirty bit is
never *cleared* by this function, it's only ever potentially set.
That, together with the fact that the accessed flag is "best effort"
rather than exact, is what makes this function so special to begin
with.

                    Linus

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 1/3] x86/mm: only do a local TLB flush in ptep_set_access_flags()
  2012-10-26 18:44                     ` [PATCH 1/3] x86/mm: only do a local TLB flush in ptep_set_access_flags() Rik van Riel
@ 2012-10-26 18:49                       ` Linus Torvalds
  2012-10-26 19:16                         ` Rik van Riel
  0 siblings, 1 reply; 135+ messages in thread
From: Linus Torvalds @ 2012-10-26 18:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, Andi Kleen, Michel Lespinasse, Peter Zijlstra,
	Andrea Arcangeli, Mel Gorman, Johannes Weiner, Thomas Gleixner,
	Andrew Morton, linux-kernel, linux-mm

On Fri, Oct 26, 2012 at 11:44 AM, Rik van Riel <riel@redhat.com> wrote:
>
> The function ptep_set_access_flags() is only ever invoked to upgrade
> access permissions on a PTE

Same deal. Please don't call these "access permissions". That confuses
the issue.

           Linus

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/3] mm,generic: only flush the local TLB in ptep_set_access_flags
  2012-10-26 18:48                       ` Linus Torvalds
@ 2012-10-26 18:53                         ` Linus Torvalds
  2012-10-26 18:57                         ` Rik van Riel
  1 sibling, 0 replies; 135+ messages in thread
From: Linus Torvalds @ 2012-10-26 18:53 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, Andi Kleen, Michel Lespinasse, Peter Zijlstra,
	Andrea Arcangeli, Mel Gorman, Johannes Weiner, Thomas Gleixner,
	Andrew Morton, linux-kernel, linux-mm

On Fri, Oct 26, 2012 at 11:48 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> NOTE: It's *not* "access permissions". It's "access flags".
>
> Big difference. This is not about permissions at all.

Anyway, modulo the misleading commit messages, ACK from on the series.

                 Linus

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/3] mm,generic: only flush the local TLB in ptep_set_access_flags
  2012-10-26 18:48                       ` Linus Torvalds
  2012-10-26 18:53                         ` Linus Torvalds
@ 2012-10-26 18:57                         ` Rik van Riel
  2012-10-26 19:16                           ` Linus Torvalds
  1 sibling, 1 reply; 135+ messages in thread
From: Rik van Riel @ 2012-10-26 18:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andi Kleen, Michel Lespinasse, Peter Zijlstra,
	Andrea Arcangeli, Mel Gorman, Johannes Weiner, Thomas Gleixner,
	Andrew Morton, linux-kernel, linux-mm

On 10/26/2012 02:48 PM, Linus Torvalds wrote:
> On Fri, Oct 26, 2012 at 11:46 AM, Rik van Riel <riel@redhat.com> wrote:
>>
>> The function ptep_set_access_flags is only ever used to upgrade
>> access permissions to a page.
>
> NOTE: It's *not* "access permissions". It's "access flags".
>
> Big difference. This is not about permissions at all.

It looks like do_wp_page also sets the write bit in the pte
"entry" before passing it to ptep_set_access_flags, making
that the place where the write bit is set in the pte.

Is this a bug in do_wp_page?

Am I reading things wrong?

reuse:
                 flush_cache_page(vma, address, pte_pfn(orig_pte));
                 entry = pte_mkyoung(orig_pte);
                 entry = maybe_mkwrite(pte_mkdirty(entry), vma);
                 if (ptep_set_access_flags(vma, address, page_table, 
entry,1))
                         update_mmu_cache(vma, address, page_table);


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 3/3] mm,generic: only flush the local TLB in ptep_set_access_flags
  2012-10-26 18:57                         ` Rik van Riel
@ 2012-10-26 19:16                           ` Linus Torvalds
  2012-10-26 19:33                             ` [PATCH -v2 " Rik van Riel
  0 siblings, 1 reply; 135+ messages in thread
From: Linus Torvalds @ 2012-10-26 19:16 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, Andi Kleen, Michel Lespinasse, Peter Zijlstra,
	Andrea Arcangeli, Mel Gorman, Johannes Weiner, Thomas Gleixner,
	Andrew Morton, linux-kernel, linux-mm

On Fri, Oct 26, 2012 at 11:57 AM, Rik van Riel <riel@redhat.com> wrote:
>
> It looks like do_wp_page also sets the write bit in the pte
> "entry" before passing it to ptep_set_access_flags, making
> that the place where the write bit is set in the pte.
>
> Is this a bug in do_wp_page?

Hmm. Right you are. That's indeed worth noting that it can indeed
change access permissions in that particular way (ie enabling writes).

So yeah, good catch. And it's ok to add the writeable bits like this
(and it can't race with hardware like the dirty bit can, since
hardware never sets writability).

In fact, it should probably be documented in the source code
somewhere. In particular, there's a very subtle requirement that you
can only set the writable bit if the dirty bit is also set at the same
time, for example.

                 Linus

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 1/3] x86/mm: only do a local TLB flush in ptep_set_access_flags()
  2012-10-26 18:49                       ` Linus Torvalds
@ 2012-10-26 19:16                         ` Rik van Riel
  2012-10-26 19:18                           ` Linus Torvalds
  0 siblings, 1 reply; 135+ messages in thread
From: Rik van Riel @ 2012-10-26 19:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andi Kleen, Michel Lespinasse, Peter Zijlstra,
	Andrea Arcangeli, Mel Gorman, Johannes Weiner, Thomas Gleixner,
	Andrew Morton, linux-kernel, linux-mm

On 10/26/2012 02:49 PM, Linus Torvalds wrote:
> On Fri, Oct 26, 2012 at 11:44 AM, Rik van Riel <riel@redhat.com> wrote:
>>
>> The function ptep_set_access_flags() is only ever invoked to upgrade
>> access permissions on a PTE
>
> Same deal. Please don't call these "access permissions". That confuses
> the issue.

I can change the text of the changelog, however it looks
like do_wp_page does actually use ptep_set_access_flags
to set the write bit in the pte...

I guess both need to be reflected in the changelog text
somehow?

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 1/3] x86/mm: only do a local TLB flush in ptep_set_access_flags()
  2012-10-26 19:16                         ` Rik van Riel
@ 2012-10-26 19:18                           ` Linus Torvalds
  2012-10-26 19:21                             ` Rik van Riel
  2012-10-29 15:23                             ` Rik van Riel
  0 siblings, 2 replies; 135+ messages in thread
From: Linus Torvalds @ 2012-10-26 19:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, Andi Kleen, Michel Lespinasse, Peter Zijlstra,
	Andrea Arcangeli, Mel Gorman, Johannes Weiner, Thomas Gleixner,
	Andrew Morton, linux-kernel, linux-mm

On Fri, Oct 26, 2012 at 12:16 PM, Rik van Riel <riel@redhat.com> wrote:
>
> I can change the text of the changelog, however it looks
> like do_wp_page does actually use ptep_set_access_flags
> to set the write bit in the pte...
>
> I guess both need to be reflected in the changelog text
> somehow?

Yeah, and by now, after all this discussion, I suspect it should be
committed with a comment too. Commit messages are good and all, but
unless chasing a particular bug they introduced, we shouldn't expect
people to read them for background information.

                    Linus

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 1/3] x86/mm: only do a local TLB flush in ptep_set_access_flags()
  2012-10-26 19:18                           ` Linus Torvalds
@ 2012-10-26 19:21                             ` Rik van Riel
  2012-10-29 15:23                             ` Rik van Riel
  1 sibling, 0 replies; 135+ messages in thread
From: Rik van Riel @ 2012-10-26 19:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andi Kleen, Michel Lespinasse, Peter Zijlstra,
	Andrea Arcangeli, Mel Gorman, Johannes Weiner, Thomas Gleixner,
	Andrew Morton, linux-kernel, linux-mm

On 10/26/2012 03:18 PM, Linus Torvalds wrote:
> On Fri, Oct 26, 2012 at 12:16 PM, Rik van Riel <riel@redhat.com> wrote:
>>
>> I can change the text of the changelog, however it looks
>> like do_wp_page does actually use ptep_set_access_flags
>> to set the write bit in the pte...
>>
>> I guess both need to be reflected in the changelog text
>> somehow?
>
> Yeah, and by now, after all this discussion, I suspect it should be
> committed with a comment too. Commit messages are good and all, but
> unless chasing a particular bug they introduced, we shouldn't expect
> people to read them for background information.

Will do :)


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH -v2 3/3] mm,generic: only flush the local TLB in ptep_set_access_flags
  2012-10-26 19:16                           ` Linus Torvalds
@ 2012-10-26 19:33                             ` Rik van Riel
  0 siblings, 0 replies; 135+ messages in thread
From: Rik van Riel @ 2012-10-26 19:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andi Kleen, Michel Lespinasse, Peter Zijlstra,
	Andrea Arcangeli, Mel Gorman, Johannes Weiner, Thomas Gleixner,
	Andrew Morton, linux-kernel, linux-mm

The function ptep_set_access_flags is only ever used to upgrade
access permissions to a page. That means the only negative side
effect of not flushing remote TLBs is that other CPUs may incur
spurious page faults, if they happen to access the same address,
and still have a PTE with the old permissions cached in their
TLB.

Having another CPU maybe incur a spurious page fault is faster
than always incurring the cost of a remote TLB flush, so replace
the remote TLB flush with a purely local one.

This should be safe on every architecture that correctly
implements flush_tlb_fix_spurious_fault() to actually invalidate
the local TLB entry that caused a page fault, as well as on
architectures where the hardware invalidates TLB entries that
cause page faults.

In the unlikely event that you are hitting what appears to be
an infinite loop of page faults, and 'git bisect' took you to
this changeset, your architecture needs to implement
flush_tlb_fix_spurious_fault to actually flush the TLB entry.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 mm/pgtable-generic.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index e642627..d8397da 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -12,8 +12,8 @@
 
 #ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
 /*
- * Only sets the access flags (dirty, accessed, and
- * writable). Furthermore, we know it always gets set to a "more
+ * Only sets the access flags (dirty, accessed), as well as write 
+ * permission. Furthermore, we know it always gets set to a "more
  * permissive" setting, which allows most architectures to optimize
  * this. We return whether the PTE actually changed, which in turn
  * instructs the caller to do things like update__mmu_cache.  This
@@ -27,7 +27,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	int changed = !pte_same(*ptep, entry);
 	if (changed) {
 		set_pte_at(vma->vm_mm, address, ptep, entry);
-		flush_tlb_page(vma, address);
+		flush_tlb_fix_spurious_fault(vma, address);
 	}
 	return changed;
 }



^ permalink raw reply related	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] x86,mm: drop TLB flush from ptep_set_access_flags
  2012-10-26 18:45                     ` [PATCH 2/3] x86,mm: drop TLB flush from ptep_set_access_flags Rik van Riel
@ 2012-10-26 21:12                       ` Alan Cox
  2012-10-27  3:49                         ` Rik van Riel
  2012-10-27 13:40                         ` Rik van Riel
  0 siblings, 2 replies; 135+ messages in thread
From: Alan Cox @ 2012-10-26 21:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ingo Molnar, Andi Kleen, Michel Lespinasse, Linus Torvalds,
	Peter Zijlstra, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Andrew Morton, linux-kernel, linux-mm

On Fri, 26 Oct 2012 14:45:02 -0400
Rik van Riel <riel@redhat.com> wrote:

> Intel has an architectural guarantee that the TLB entry causing
> a page fault gets invalidated automatically. This means
> we should be able to drop the local TLB invalidation.
> 
> Because of the way other areas of the page fault code work,
> chances are good that all x86 CPUs do this.  However, if
> someone somewhere has an x86 CPU that does not invalidate
> the TLB entry causing a page fault, this one-liner should
> be easy to revert.

This does not strike me as a good standard of validation for such a change

At the very least we should have an ACK from AMD and from VIA, and
preferably ping RDC and some of the other embedded folks. Given an AMD
and VIA ACK I'd be fine. I doubt anyone knows any more what Cyrix CPUs
did or cared about and I imagine H Peter or Linus can answer for
Transmeta ;-)

Alan



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] x86,mm: drop TLB flush from ptep_set_access_flags
  2012-10-26 21:12                       ` Alan Cox
@ 2012-10-27  3:49                         ` Rik van Riel
  2012-10-27 10:29                           ` Ingo Molnar
  2012-10-27 13:40                         ` Rik van Riel
  1 sibling, 1 reply; 135+ messages in thread
From: Rik van Riel @ 2012-10-27  3:49 UTC (permalink / raw)
  To: Alan Cox
  Cc: Ingo Molnar, Andi Kleen, Michel Lespinasse, Linus Torvalds,
	Peter Zijlstra, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Andrew Morton, linux-kernel, linux-mm

On 10/26/2012 05:12 PM, Alan Cox wrote:
> On Fri, 26 Oct 2012 14:45:02 -0400
> Rik van Riel <riel@redhat.com> wrote:
>
>> Intel has an architectural guarantee that the TLB entry causing
>> a page fault gets invalidated automatically. This means
>> we should be able to drop the local TLB invalidation.
>>
>> Because of the way other areas of the page fault code work,
>> chances are good that all x86 CPUs do this.  However, if
>> someone somewhere has an x86 CPU that does not invalidate
>> the TLB entry causing a page fault, this one-liner should
>> be easy to revert.
>
> This does not strike me as a good standard of validation for such a change
>
> At the very least we should have an ACK from AMD and from VIA, and
> preferably ping RDC and some of the other embedded folks. Given an AMD
> and VIA ACK I'd be fine. I doubt anyone knows any more what Cyrix CPUs
> did or cared about and I imagine H Peter or Linus can answer for
> Transmeta ;-)

Fair enough.

If it turns out any of those CPUs need an explicit
flush, then we can also adjust flush_tlb_fix_spurious_fault
to actually do a local flush on x86 (or at least on those
CPUs).


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] x86,mm: drop TLB flush from ptep_set_access_flags
  2012-10-27  3:49                         ` Rik van Riel
@ 2012-10-27 10:29                           ` Ingo Molnar
  0 siblings, 0 replies; 135+ messages in thread
From: Ingo Molnar @ 2012-10-27 10:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Alan Cox, Andi Kleen, Michel Lespinasse, Linus Torvalds,
	Peter Zijlstra, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Andrew Morton, linux-kernel, linux-mm


* Rik van Riel <riel@redhat.com> wrote:

> On 10/26/2012 05:12 PM, Alan Cox wrote:
> >On Fri, 26 Oct 2012 14:45:02 -0400
> >Rik van Riel <riel@redhat.com> wrote:
> >
> >>Intel has an architectural guarantee that the TLB entry causing
> >>a page fault gets invalidated automatically. This means
> >>we should be able to drop the local TLB invalidation.
> >>
> >>Because of the way other areas of the page fault code work,
> >>chances are good that all x86 CPUs do this.  However, if
> >>someone somewhere has an x86 CPU that does not invalidate
> >>the TLB entry causing a page fault, this one-liner should
> >>be easy to revert.
> >
> >This does not strike me as a good standard of validation for such a change
> >
> >At the very least we should have an ACK from AMD and from VIA, and
> >preferably ping RDC and some of the other embedded folks. Given an AMD
> >and VIA ACK I'd be fine. I doubt anyone knows any more what Cyrix CPUs
> >did or cared about and I imagine H Peter or Linus can answer for
> >Transmeta ;-)
> 
> Fair enough.
> 
> If it turns out any of those CPUs need an explicit flush, then 
> we can also adjust flush_tlb_fix_spurious_fault to actually do 
> a local flush on x86 (or at least on those CPUs).

Yes. And even if we have 'confirmation' from documentation and 
elsewhere, testing has to be done to see actual real behavior of 
CPUs, so this is going to be a separate, bisectable commit put 
under surveillance ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] x86,mm: drop TLB flush from ptep_set_access_flags
  2012-10-26 21:12                       ` Alan Cox
  2012-10-27  3:49                         ` Rik van Riel
@ 2012-10-27 13:40                         ` Rik van Riel
  2012-10-29 16:57                           ` Borislav Petkov
  1 sibling, 1 reply; 135+ messages in thread
From: Rik van Riel @ 2012-10-27 13:40 UTC (permalink / raw)
  To: Alan Cox
  Cc: Ingo Molnar, Andi Kleen, Michel Lespinasse, Linus Torvalds,
	Peter Zijlstra, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Andrew Morton, linux-kernel, linux-mm, florian,
	Borislav Petkov

On 10/26/2012 05:12 PM, Alan Cox wrote:
> On Fri, 26 Oct 2012 14:45:02 -0400
> Rik van Riel <riel@redhat.com> wrote:
>
>> Intel has an architectural guarantee that the TLB entry causing
>> a page fault gets invalidated automatically. This means
>> we should be able to drop the local TLB invalidation.
>>
>> Because of the way other areas of the page fault code work,
>> chances are good that all x86 CPUs do this.  However, if
>> someone somewhere has an x86 CPU that does not invalidate
>> the TLB entry causing a page fault, this one-liner should
>> be easy to revert.
>
> This does not strike me as a good standard of validation for such a change
>
> At the very least we should have an ACK from AMD and from VIA, and
> preferably ping RDC and some of the other embedded folks. Given an AMD
> and VIA ACK I'd be fine. I doubt anyone knows any more what Cyrix CPUs
> did or cared about and I imagine H Peter or Linus can answer for
> Transmeta ;-)

Florian, would you happen to know who at RDC could be contacted
to verify whether a TLB entry causing a page fault gets
invalidated automatically, upon entering the page fault path?

Borislav, would you happen to know whether AMD (and VIA) CPUs
automatically invalidate TLB entries that cause page faults?
If you do not know, would you happen who to ask? :)

If these CPUs do not invalidate a TLB entry causing a page
fault (a write fault on a read-only PTE), then we may have to
change the kernel so flush_tlb_fix_spurious_fault does
something on the CPU models in question...

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-10-26  9:08   ` Peter Zijlstra
  2012-10-26  9:20     ` Ingo Molnar
@ 2012-10-28 17:56     ` Johannes Weiner
  2012-10-29  2:44       ` Zhouping Liu
  2012-10-30  6:29       ` [PATCH 00/31] numa/core patches Zhouping Liu
  1 sibling, 2 replies; 135+ messages in thread
From: Johannes Weiner @ 2012-10-28 17:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Zhouping Liu, Rik van Riel, Andrea Arcangeli, Mel Gorman,
	Thomas Gleixner, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, Ingo Molnar

On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote:
> On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
> > [  180.918591] RIP: 0010:[<ffffffff8118c39a>]  [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0
> 
> > [  182.681450]  [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
> > [  182.775090]  [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
> > [  182.863038]  [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
> > [  182.950574]  [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
> > [  183.041512]  [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
> > [  183.126832]  [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
> > [  183.211216]  [<ffffffff81632ede>] do_page_fault+0xe/0x10
> > [  183.293705]  [<ffffffff8162f518>] page_fault+0x28/0x30 
> 
> Johannes, this looks like the thp migration memcg hookery gone bad,
> could you have a look at this?

Oops.  Here is an incremental fix, feel free to fold it into #31.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5c30a14..0d7ebd3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -801,8 +801,6 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!new_page)
 		goto alloc_fail;
 
-	mem_cgroup_prepare_migration(page, new_page, &memcg);
-
 	lru = PageLRU(page);
 
 	if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
@@ -835,6 +833,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		return;
 	}
+	/*
+	 * Traditional migration needs to prepare the memcg charge
+	 * transaction early to prevent the old page from being
+	 * uncharged when installing migration entries.  Here we can
+	 * save the potential rollback and start the charge transfer
+	 * only when migration is already known to end successfully.
+	 */
+	mem_cgroup_prepare_migration(page, new_page, &memcg);
 
 	entry = mk_pmd(new_page, vma->vm_page_prot);
 	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -845,6 +851,12 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	set_pmd_at(mm, haddr, pmd, entry);
 	update_mmu_cache_pmd(vma, address, entry);
 	page_remove_rmap(page);
+	/*
+	 * Finish the charge transaction under the page table lock to
+	 * prevent split_huge_page() from dividing up the charge
+	 * before it's fully transferred to the new page.
+	 */
+	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(&mm->page_table_lock);
 
 	put_page(page);			/* Drop the rmap reference */
@@ -856,18 +868,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	unlock_page(new_page);
 
-	mem_cgroup_end_migration(memcg, page, new_page, true);
-
 	unlock_page(page);
 	put_page(page);			/* Drop the local reference */
 
 	return;
 
 alloc_fail:
-	if (new_page) {
-		mem_cgroup_end_migration(memcg, page, new_page, false);
+	if (new_page)
 		put_page(new_page);
-	}
 
 	unlock_page(page);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7acf43b..011e510 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3255,15 +3255,18 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
 				  struct mem_cgroup **memcgp)
 {
 	struct mem_cgroup *memcg = NULL;
+	unsigned int nr_pages = 1;
 	struct page_cgroup *pc;
 	enum charge_type ctype;
 
 	*memcgp = NULL;
 
-	VM_BUG_ON(PageTransHuge(page));
 	if (mem_cgroup_disabled())
 		return;
 
+	if (PageTransHuge(page))
+		nr_pages <<= compound_order(page);
+
 	pc = lookup_page_cgroup(page);
 	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc)) {
@@ -3325,7 +3328,7 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
 	 * charged to the res_counter since we plan on replacing the
 	 * old one and only one page is going to be left afterwards.
 	 */
-	__mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false);
+	__mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
 }
 
 /* remove redundant charge if migration failed*/

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-10-28 17:56     ` Johannes Weiner
@ 2012-10-29  2:44       ` Zhouping Liu
  2012-10-29  6:50         ` [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page() Ingo Molnar
  2012-10-30  6:29       ` [PATCH 00/31] numa/core patches Zhouping Liu
  1 sibling, 1 reply; 135+ messages in thread
From: Zhouping Liu @ 2012-10-29  2:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Peter Zijlstra, Rik van Riel, Andrea Arcangeli, Mel Gorman,
	Thomas Gleixner, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, Ingo Molnar

On 10/29/2012 01:56 AM, Johannes Weiner wrote:
> On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote:
>> On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
>>> [  180.918591] RIP: 0010:[<ffffffff8118c39a>]  [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0
>>> [  182.681450]  [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
>>> [  182.775090]  [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
>>> [  182.863038]  [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
>>> [  182.950574]  [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
>>> [  183.041512]  [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
>>> [  183.126832]  [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
>>> [  183.211216]  [<ffffffff81632ede>] do_page_fault+0xe/0x10
>>> [  183.293705]  [<ffffffff8162f518>] page_fault+0x28/0x30
>> Johannes, this looks like the thp migration memcg hookery gone bad,
>> could you have a look at this?
> Oops.  Here is an incremental fix, feel free to fold it into #31.

Hi Johannes,

Tested the below patch, and I'm sure it has fixed the above issue, thank 
you.

  Zhouping

>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5c30a14..0d7ebd3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -801,8 +801,6 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   	if (!new_page)
>   		goto alloc_fail;
>   
> -	mem_cgroup_prepare_migration(page, new_page, &memcg);
> -
>   	lru = PageLRU(page);
>   
>   	if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
> @@ -835,6 +833,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   
>   		return;
>   	}
> +	/*
> +	 * Traditional migration needs to prepare the memcg charge
> +	 * transaction early to prevent the old page from being
> +	 * uncharged when installing migration entries.  Here we can
> +	 * save the potential rollback and start the charge transfer
> +	 * only when migration is already known to end successfully.
> +	 */
> +	mem_cgroup_prepare_migration(page, new_page, &memcg);
>   
>   	entry = mk_pmd(new_page, vma->vm_page_prot);
>   	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> @@ -845,6 +851,12 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   	set_pmd_at(mm, haddr, pmd, entry);
>   	update_mmu_cache_pmd(vma, address, entry);
>   	page_remove_rmap(page);
> +	/*
> +	 * Finish the charge transaction under the page table lock to
> +	 * prevent split_huge_page() from dividing up the charge
> +	 * before it's fully transferred to the new page.
> +	 */
> +	mem_cgroup_end_migration(memcg, page, new_page, true);
>   	spin_unlock(&mm->page_table_lock);
>   
>   	put_page(page);			/* Drop the rmap reference */
> @@ -856,18 +868,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   
>   	unlock_page(new_page);
>   
> -	mem_cgroup_end_migration(memcg, page, new_page, true);
> -
>   	unlock_page(page);
>   	put_page(page);			/* Drop the local reference */
>   
>   	return;
>   
>   alloc_fail:
> -	if (new_page) {
> -		mem_cgroup_end_migration(memcg, page, new_page, false);
> +	if (new_page)
>   		put_page(new_page);
> -	}
>   
>   	unlock_page(page);
>   
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7acf43b..011e510 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3255,15 +3255,18 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
>   				  struct mem_cgroup **memcgp)
>   {
>   	struct mem_cgroup *memcg = NULL;
> +	unsigned int nr_pages = 1;
>   	struct page_cgroup *pc;
>   	enum charge_type ctype;
>   
>   	*memcgp = NULL;
>   
> -	VM_BUG_ON(PageTransHuge(page));
>   	if (mem_cgroup_disabled())
>   		return;
>   
> +	if (PageTransHuge(page))
> +		nr_pages <<= compound_order(page);
> +
>   	pc = lookup_page_cgroup(page);
>   	lock_page_cgroup(pc);
>   	if (PageCgroupUsed(pc)) {
> @@ -3325,7 +3328,7 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
>   	 * charged to the res_counter since we plan on replacing the
>   	 * old one and only one page is going to be left afterwards.
>   	 */
> -	__mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false);
> +	__mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
>   }
>   
>   /* remove redundant charge if migration failed*/


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
  2012-10-29  2:44       ` Zhouping Liu
@ 2012-10-29  6:50         ` Ingo Molnar
  2012-10-29  8:24           ` Johannes Weiner
  0 siblings, 1 reply; 135+ messages in thread
From: Ingo Molnar @ 2012-10-29  6:50 UTC (permalink / raw)
  To: Zhouping Liu
  Cc: Johannes Weiner, Peter Zijlstra, Rik van Riel, Andrea Arcangeli,
	Mel Gorman, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-mm


* Zhouping Liu <zliu@redhat.com> wrote:

> Hi Johannes,
> 
> Tested the below patch, and I'm sure it has fixed the above 
> issue, thank you.

Thanks. Below is the folded up patch.

	Ingo

---------------------------->
Subject: sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu Oct 25 12:49:51 CEST 2012

Add memory control group support to hugepage migration.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Zhouping Liu <zliu@redhat.com>
Link: http://lkml.kernel.org/n/tip-rDk9mgpoyhZlwh2xhlykvgnp@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/huge_memory.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

Index: tip/mm/huge_memory.c
===================================================================
--- tip.orig/mm/huge_memory.c
+++ tip/mm/huge_memory.c
@@ -743,6 +743,7 @@ void do_huge_pmd_numa_page(struct mm_str
 			   unsigned int flags, pmd_t entry)
 {
 	unsigned long haddr = address & HPAGE_PMD_MASK;
+	struct mem_cgroup *memcg = NULL;
 	struct page *new_page = NULL;
 	struct page *page = NULL;
 	int node, lru;
@@ -833,6 +834,14 @@ migrate:
 
 		return;
 	}
+	/*
+	 * Traditional migration needs to prepare the memcg charge
+	 * transaction early to prevent the old page from being
+	 * uncharged when installing migration entries.  Here we can
+	 * save the potential rollback and start the charge transfer
+	 * only when migration is already known to end successfully.
+	 */
+	mem_cgroup_prepare_migration(page, new_page, &memcg);
 
 	entry = mk_pmd(new_page, vma->vm_page_prot);
 	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -843,6 +852,12 @@ migrate:
 	set_pmd_at(mm, haddr, pmd, entry);
 	update_mmu_cache_pmd(vma, address, entry);
 	page_remove_rmap(page);
+	/*
+	 * Finish the charge transaction under the page table lock to
+	 * prevent split_huge_page() from dividing up the charge
+	 * before it's fully transferred to the new page.
+	 */
+	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(&mm->page_table_lock);
 
 	put_page(page);			/* Drop the rmap reference */


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
  2012-10-29  6:50         ` [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page() Ingo Molnar
@ 2012-10-29  8:24           ` Johannes Weiner
  2012-10-29  8:36             ` Zhouping Liu
  2012-10-29 11:15             ` Ingo Molnar
  0 siblings, 2 replies; 135+ messages in thread
From: Johannes Weiner @ 2012-10-29  8:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zhouping Liu, Peter Zijlstra, Rik van Riel, Andrea Arcangeli,
	Mel Gorman, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-mm

Hello Ingo!

On Mon, Oct 29, 2012 at 07:50:44AM +0100, Ingo Molnar wrote:
> 
> * Zhouping Liu <zliu@redhat.com> wrote:
> 
> > Hi Johannes,
> > 
> > Tested the below patch, and I'm sure it has fixed the above 
> > issue, thank you.
> 
> Thanks. Below is the folded up patch.
> 
> 	Ingo
> 
> ---------------------------->
> Subject: sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Thu Oct 25 12:49:51 CEST 2012
> 
> Add memory control group support to hugepage migration.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Tested-by: Zhouping Liu <zliu@redhat.com>
> Link: http://lkml.kernel.org/n/tip-rDk9mgpoyhZlwh2xhlykvgnp@git.kernel.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  mm/huge_memory.c |   15 +++++++++++++++
>  1 file changed, 15 insertions(+)

Did the mm/memcontrol.c part go missing?

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
  2012-10-29  8:24           ` Johannes Weiner
@ 2012-10-29  8:36             ` Zhouping Liu
  2012-10-29 11:15             ` Ingo Molnar
  1 sibling, 0 replies; 135+ messages in thread
From: Zhouping Liu @ 2012-10-29  8:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Peter Zijlstra, Rik van Riel, Andrea Arcangeli,
	Mel Gorman, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-mm

On 10/29/2012 04:24 PM, Johannes Weiner wrote:
> Hello Ingo!
>
> On Mon, Oct 29, 2012 at 07:50:44AM +0100, Ingo Molnar wrote:
>> * Zhouping Liu <zliu@redhat.com> wrote:
>>
>>> Hi Johannes,
>>>
>>> Tested the below patch, and I'm sure it has fixed the above
>>> issue, thank you.
>> Thanks. Below is the folded up patch.
>>
>> 	Ingo
>>
>> ---------------------------->
>> Subject: sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
>> From: Johannes Weiner <hannes@cmpxchg.org>
>> Date: Thu Oct 25 12:49:51 CEST 2012
>>
>> Add memory control group support to hugepage migration.
>>
>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>> Tested-by: Zhouping Liu <zliu@redhat.com>
>> Link: http://lkml.kernel.org/n/tip-rDk9mgpoyhZlwh2xhlykvgnp@git.kernel.org
>> Signed-off-by: Ingo Molnar <mingo@kernel.org>
>> ---
>>   mm/huge_memory.c |   15 +++++++++++++++
>>   1 file changed, 15 insertions(+)
> Did the mm/memcontrol.c part go missing?

I think so, as the issue is still existed with this patch

Thanks,
Zhouping

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
  2012-10-29  8:24           ` Johannes Weiner
  2012-10-29  8:36             ` Zhouping Liu
@ 2012-10-29 11:15             ` Ingo Molnar
  1 sibling, 0 replies; 135+ messages in thread
From: Ingo Molnar @ 2012-10-29 11:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Zhouping Liu, Peter Zijlstra, Rik van Riel, Andrea Arcangeli,
	Mel Gorman, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-mm


* Johannes Weiner <hannes@cmpxchg.org> wrote:

> Hello Ingo!
> 
> On Mon, Oct 29, 2012 at 07:50:44AM +0100, Ingo Molnar wrote:
> > 
> > * Zhouping Liu <zliu@redhat.com> wrote:
> > 
> > > Hi Johannes,
> > > 
> > > Tested the below patch, and I'm sure it has fixed the above 
> > > issue, thank you.
> > 
> > Thanks. Below is the folded up patch.
> > 
> > 	Ingo
> > 
> > ---------------------------->
> > Subject: sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Thu Oct 25 12:49:51 CEST 2012
> > 
> > Add memory control group support to hugepage migration.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Tested-by: Zhouping Liu <zliu@redhat.com>
> > Link: http://lkml.kernel.org/n/tip-rDk9mgpoyhZlwh2xhlykvgnp@git.kernel.org
> > Signed-off-by: Ingo Molnar <mingo@kernel.org>
> > ---
> >  mm/huge_memory.c |   15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> 
> Did the mm/memcontrol.c part go missing?

Yes :-/

Fixing it up now.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 1/3] x86/mm: only do a local TLB flush in ptep_set_access_flags()
  2012-10-26 19:18                           ` Linus Torvalds
  2012-10-26 19:21                             ` Rik van Riel
@ 2012-10-29 15:23                             ` Rik van Riel
  2012-12-21  9:57                               ` trailing flush_tlb_fix_spurious_fault in handle_pte_fault (was Re: [PATCH 1/3] x86/mm: only do a local TLB flush in ptep_set_access_flags()) Vineet Gupta
  1 sibling, 1 reply; 135+ messages in thread
From: Rik van Riel @ 2012-10-29 15:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andi Kleen, Michel Lespinasse, Peter Zijlstra,
	Andrea Arcangeli, Mel Gorman, Johannes Weiner, Thomas Gleixner,
	Andrew Morton, linux-kernel, linux-mm

On 10/26/2012 03:18 PM, Linus Torvalds wrote:
> On Fri, Oct 26, 2012 at 12:16 PM, Rik van Riel <riel@redhat.com> wrote:
>>
>> I can change the text of the changelog, however it looks
>> like do_wp_page does actually use ptep_set_access_flags
>> to set the write bit in the pte...
>>
>> I guess both need to be reflected in the changelog text
>> somehow?
>
> Yeah, and by now, after all this discussion, I suspect it should be
> committed with a comment too. Commit messages are good and all, but
> unless chasing a particular bug they introduced, we shouldn't expect
> people to read them for background information.

Now that we have the TLB things taken care of, and
comments to patches 10/31 and 26/31 have been addressed,
is there anything else that needs to be done before
these NUMA patches can be merged?

Anyone, this is a good time to speak up. We have some
time to address whatever concern you may have.

(besides waiting for the next merge window)


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] x86,mm: drop TLB flush from ptep_set_access_flags
  2012-10-27 13:40                         ` Rik van Riel
@ 2012-10-29 16:57                           ` Borislav Petkov
  2012-10-29 17:06                             ` Linus Torvalds
  0 siblings, 1 reply; 135+ messages in thread
From: Borislav Petkov @ 2012-10-29 16:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Alan Cox, Ingo Molnar, Andi Kleen, Michel Lespinasse,
	Linus Torvalds, Peter Zijlstra, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Andrew Morton, linux-kernel,
	linux-mm, florian, Borislav Petkov

On Sat, Oct 27, 2012 at 09:40:41AM -0400, Rik van Riel wrote:
> Borislav, would you happen to know whether AMD (and VIA) CPUs
> automatically invalidate TLB entries that cause page faults? If you do
> not know, would you happen who to ask? :)

Short answer: yes.

Long answer (from APM v2, section 5.5.2):

Use of Cached Entries When Reporting a Page Fault Exception.

On current AMD64 processors, when any type of page fault exception is
encountered by the MMU, any cached upper-level entries that lead to the
faulting entry are flushed (along with the TLB entry, if already cached)
and the table walk is repeated to confirm the page fault using the
table entries in memory. This is done because a table entry is allowed
to be upgraded (by marking it as present, or by removing its write,
execute or supervisor restrictions) without explicitly maintaining TLB
coherency. Such an upgrade will be found when the table is re-walked,
which resolves the fault. If the fault is confirmed on the re-walk
however, a page fault exception is reported, and upper level entries
that may have been cached on the re-walk are flushed.

HTH.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] x86,mm: drop TLB flush from ptep_set_access_flags
  2012-10-29 16:57                           ` Borislav Petkov
@ 2012-10-29 17:06                             ` Linus Torvalds
  2012-11-17 14:50                               ` Borislav Petkov
  0 siblings, 1 reply; 135+ messages in thread
From: Linus Torvalds @ 2012-10-29 17:06 UTC (permalink / raw)
  To: Borislav Petkov, Rik van Riel, Alan Cox, Ingo Molnar, Andi Kleen,
	Michel Lespinasse, Peter Zijlstra, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Andrew Morton, linux-kernel,
	linux-mm, florian, Borislav Petkov

On Mon, Oct 29, 2012 at 9:57 AM, Borislav Petkov <bp@alien8.de> wrote:
>
> On current AMD64 processors,

Can you verify that this is true for older cpu's too (ie the old
pre-64-bit ones, say K6 and original Athlon)?

>                 This is done because a table entry is allowed
> to be upgraded (by marking it as present

Well, that was traditionally solved by not caching not-present entries
at all. Which can be a problem for some things (prefetch of NULL etc),
so caching and then re-checking on faults is potentially the correct
thing, but I'm just mentioning it because it might not be much of an
argument for older microarchitectures..

>, or by removing its write,
> execute or supervisor restrictions) without explicitly maintaining TLB
> coherency. Such an upgrade will be found when the table is re-walked,
> which resolves the fault.

.. but this is obviously what we're interested in. And since AMD has
documented it (as well as Intel), I have this strong suspicion that
operating systems have traditionally relied on this behavior.

I don't remember the test coverage details from my Transmeta days, and
while I certainly saw the page table walker, it wasn't my code.

My gut feel is that this is likely something x86 just always does
(because it's the right thing to do to keep things simple for
software), but getting explicit confirmation about older AMD cpu's
would definitely be good.

                  Linus

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-10-28 17:56     ` Johannes Weiner
  2012-10-29  2:44       ` Zhouping Liu
@ 2012-10-30  6:29       ` Zhouping Liu
  2012-10-31  0:48         ` Johannes Weiner
  1 sibling, 1 reply; 135+ messages in thread
From: Zhouping Liu @ 2012-10-30  6:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Peter Zijlstra, Rik van Riel, Andrea Arcangeli, Mel Gorman,
	Thomas Gleixner, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, Ingo Molnar, CAI Qian

On 10/29/2012 01:56 AM, Johannes Weiner wrote:
> On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote:
>> On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
>>> [  180.918591] RIP: 0010:[<ffffffff8118c39a>]  [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0
>>> [  182.681450]  [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
>>> [  182.775090]  [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
>>> [  182.863038]  [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
>>> [  182.950574]  [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
>>> [  183.041512]  [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
>>> [  183.126832]  [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
>>> [  183.211216]  [<ffffffff81632ede>] do_page_fault+0xe/0x10
>>> [  183.293705]  [<ffffffff8162f518>] page_fault+0x28/0x30
>> Johannes, this looks like the thp migration memcg hookery gone bad,
>> could you have a look at this?
> Oops.  Here is an incremental fix, feel free to fold it into #31.
Hello Johannes,

maybe I don't think the below patch completely fix this issue, as I 
found a new error(maybe similar with this):

[88099.923724] ------------[ cut here ]------------
[88099.924036] kernel BUG at mm/memcontrol.c:1134!
[88099.924036] invalid opcode: 0000 [#1] SMP
[88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm 
amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp 
joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi 
megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit 
drm_kms_helper ttm drm i2c_core
[88099.924036] CPU 7
[88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3 
Dell Inc. PowerEdge 6950/0WN213
[88099.924036] RIP: 0010:[<ffffffff81188e97>] [<ffffffff81188e97>] 
mem_cgroup_update_lru_size+0x27/0x30
[88099.924036] RSP: 0000:ffff88021b247ca8  EFLAGS: 00010082
[88099.924036] RAX: ffff88011d310138 RBX: ffffea0002f18000 RCX: 
0000000000000001
[88099.924036] RDX: fffffffffffffe00 RSI: 000000000000000e RDI: 
ffff88011d310138
[88099.924036] RBP: ffff88021b247ca8 R08: 0000000000000000 R09: 
a8000bc600000000
[88099.924036] R10: 0000000000000000 R11: 0000000000000000 R12: 
00000000fffffe00
[88099.924036] R13: ffff88011ffecb40 R14: 0000000000000286 R15: 
0000000000000000
[88099.924036] FS:  00007f787d0bf740(0000) GS:ffff88021fc80000(0000) 
knlGS:0000000000000000
[88099.924036] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[88099.924036] CR2: 00007f7873a00010 CR3: 000000021bda0000 CR4: 
00000000000007e0
[88099.924036] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[88099.924036] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[88099.924036] Process stress (pid: 3441, threadinfo ffff88021b246000, 
task ffff88021b399760)
[88099.924036] Stack:
[88099.924036]  ffff88021b247cf8 ffffffff8113a9cd ffffea0002f18000 
ffff88011d310138
[88099.924036]  0000000000000200 ffffea0002f18000 ffff88019bace580 
00007f7873c00000
[88099.924036]  ffff88021aca0cf0 ffffea00081e0000 ffff88021b247d18 
ffffffff8113aa7d
[88099.924036] Call Trace:
[88099.924036]  [<ffffffff8113a9cd>] __page_cache_release.part.11+0xdd/0x140
[88099.924036]  [<ffffffff8113aa7d>] __put_compound_page+0x1d/0x30
[88099.924036]  [<ffffffff8113ac4d>] put_compound_page+0x5d/0x1e0
[88099.924036]  [<ffffffff8113b1a5>] put_page+0x45/0x50
[88099.924036]  [<ffffffff8118378c>] do_huge_pmd_numa_page+0x2ec/0x4e0
[88099.924036]  [<ffffffff81158089>] handle_mm_fault+0x1e9/0x360
[88099.924036]  [<ffffffff8162cd22>] __do_page_fault+0x172/0x4e0
[88099.924036]  [<ffffffff810958b9>] ? task_numa_work+0x1c9/0x220
[88099.924036]  [<ffffffff8107c56c>] ? task_work_run+0xac/0xe0
[88099.924036]  [<ffffffff8162d09e>] do_page_fault+0xe/0x10
[88099.924036]  [<ffffffff816296d8>] page_fault+0x28/0x30
[88099.924036] Code: 00 00 00 00 66 66 66 66 90 44 8b 1d 1c 90 b5 00 55 
48 89 e5 45 85 db 75 10 89 f6 48 63 d2 48 83 c6 0e 48 01 54 f7 08 78 02 
5d c3 <0f> 0b 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 48 83 ec
[88099.924036] RIP  [<ffffffff81188e97>] 
mem_cgroup_update_lru_size+0x27/0x30
[88099.924036]  RSP <ffff88021b247ca8>
[88099.924036] ---[ end trace c8d6b169e0c3f25a ]---
[88108.054610] ------------[ cut here ]------------
[88108.054610] WARNING: at kernel/watchdog.c:245 
watchdog_overflow_callback+0x9c/0xd0()
[88108.054610] Hardware name: PowerEdge 6950
[88108.054610] Watchdog detected hard LOCKUP on cpu 3
[88108.054610] Modules linked in: lockd sunrpc kvm_amd kvm 
amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp 
joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi 
megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit 
drm_kms_helper ttm drm i2c_core
[88108.054610] Pid: 3429, comm: stress Tainted: G      D 3.7.0-rc2Jons+ #3
[88108.054610] Call Trace:
[88108.054610]  <NMI>  [<ffffffff8105c29f>] warn_slowpath_common+0x7f/0xc0
[88108.054610]  [<ffffffff8105c396>] warn_slowpath_fmt+0x46/0x50
[88108.054610]  [<ffffffff81093fa8>] ? sched_clock_cpu+0xa8/0x120
[88108.054610]  [<ffffffff810e95c0>] ? touch_nmi_watchdog+0x80/0x80
[88108.054610]  [<ffffffff810e965c>] watchdog_overflow_callback+0x9c/0xd0
[88108.054610]  [<ffffffff81124e6d>] __perf_event_overflow+0x9d/0x230
[88108.054610]  [<ffffffff81121f44>] ? perf_event_update_userpage+0x24/0x110
[88108.054610]  [<ffffffff81125a74>] perf_event_overflow+0x14/0x20
[88108.054610]  [<ffffffff8102440a>] x86_pmu_handle_irq+0x10a/0x160
[88108.054610]  [<ffffffff8162ac4d>] perf_event_nmi_handler+0x1d/0x20
[88108.054610]  [<ffffffff8162a411>] nmi_handle.isra.0+0x51/0x80
[88108.054610]  [<ffffffff8162a5b9>] do_nmi+0x179/0x350
[88108.054610]  [<ffffffff81629a30>] end_repeat_nmi+0x1e/0x2e
[88108.054610]  [<ffffffff816290c2>] ? _raw_spin_lock_irqsave+0x32/0x40
[88108.054610]  [<ffffffff816290c2>] ? _raw_spin_lock_irqsave+0x32/0x40
[88108.054610]  [<ffffffff816290c2>] ? _raw_spin_lock_irqsave+0x32/0x40
[88108.054610]  <<EOE>>  [<ffffffff8113b087>] pagevec_lru_move_fn+0x97/0x110
[88108.054610]  [<ffffffff8113a5f0>] ? pagevec_move_tail_fn+0x80/0x80
[88108.054610]  [<ffffffff8113b11c>] __pagevec_lru_add+0x1c/0x20
[88108.054610]  [<ffffffff8113b4e8>] __lru_cache_add+0x68/0x90
[88108.054610]  [<ffffffff8113b71b>] lru_cache_add_lru+0x3b/0x60
[88108.054610]  [<ffffffff81161151>] page_add_new_anon_rmap+0xc1/0x170
[88108.054610]  [<ffffffff811854b2>] do_huge_pmd_anonymous_page+0x242/0x330
[88108.054610]  [<ffffffff81158162>] handle_mm_fault+0x2c2/0x360
[88108.054610]  [<ffffffff8162cd22>] __do_page_fault+0x172/0x4e0
[88108.054610]  [<ffffffff8109520f>] ? __dequeue_entity+0x2f/0x50
[88108.054610]  [<ffffffff810125d1>] ? __switch_to+0x181/0x4a0
[88108.054610]  [<ffffffff8162d09e>] do_page_fault+0xe/0x10
[88108.054610]  [<ffffffff816296d8>] page_fault+0x28/0x30
[88108.054610] ---[ end trace c8d6b169e0c3f25b ]---
......
......

it's easy to reproduce with stress[1] workload.
what command I used  is '# stress -i 20 -m 30 -v'

I will report it on a new subject if it's a new issue.

let me know if you need other info.

[1] http://weather.ou.edu/~apw/projects/stress/

Thanks,
Zhouping
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5c30a14..0d7ebd3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -801,8 +801,6 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   	if (!new_page)
>   		goto alloc_fail;
>   
> -	mem_cgroup_prepare_migration(page, new_page, &memcg);
> -
>   	lru = PageLRU(page);
>   
>   	if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
> @@ -835,6 +833,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   
>   		return;
>   	}
> +	/*
> +	 * Traditional migration needs to prepare the memcg charge
> +	 * transaction early to prevent the old page from being
> +	 * uncharged when installing migration entries.  Here we can
> +	 * save the potential rollback and start the charge transfer
> +	 * only when migration is already known to end successfully.
> +	 */
> +	mem_cgroup_prepare_migration(page, new_page, &memcg);
>   
>   	entry = mk_pmd(new_page, vma->vm_page_prot);
>   	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> @@ -845,6 +851,12 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   	set_pmd_at(mm, haddr, pmd, entry);
>   	update_mmu_cache_pmd(vma, address, entry);
>   	page_remove_rmap(page);
> +	/*
> +	 * Finish the charge transaction under the page table lock to
> +	 * prevent split_huge_page() from dividing up the charge
> +	 * before it's fully transferred to the new page.
> +	 */
> +	mem_cgroup_end_migration(memcg, page, new_page, true);
>   	spin_unlock(&mm->page_table_lock);
>   
>   	put_page(page);			/* Drop the rmap reference */
> @@ -856,18 +868,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   
>   	unlock_page(new_page);
>   
> -	mem_cgroup_end_migration(memcg, page, new_page, true);
> -
>   	unlock_page(page);
>   	put_page(page);			/* Drop the local reference */
>   
>   	return;
>   
>   alloc_fail:
> -	if (new_page) {
> -		mem_cgroup_end_migration(memcg, page, new_page, false);
> +	if (new_page)
>   		put_page(new_page);
> -	}
>   
>   	unlock_page(page);
>   
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7acf43b..011e510 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3255,15 +3255,18 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
>   				  struct mem_cgroup **memcgp)
>   {
>   	struct mem_cgroup *memcg = NULL;
> +	unsigned int nr_pages = 1;
>   	struct page_cgroup *pc;
>   	enum charge_type ctype;
>   
>   	*memcgp = NULL;
>   
> -	VM_BUG_ON(PageTransHuge(page));
>   	if (mem_cgroup_disabled())
>   		return;
>   
> +	if (PageTransHuge(page))
> +		nr_pages <<= compound_order(page);
> +
>   	pc = lookup_page_cgroup(page);
>   	lock_page_cgroup(pc);
>   	if (PageCgroupUsed(pc)) {
> @@ -3325,7 +3328,7 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
>   	 * charged to the res_counter since we plan on replacing the
>   	 * old one and only one page is going to be left afterwards.
>   	 */
> -	__mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false);
> +	__mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
>   }
>   
>   /* remove redundant charge if migration failed*/


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (31 preceding siblings ...)
  2012-10-26  9:07 ` [PATCH 00/31] numa/core patches Zhouping Liu
@ 2012-10-30 12:20 ` Mel Gorman
  2012-10-30 15:28   ` Andrew Morton
                     ` (2 more replies)
  2012-11-05 17:11 ` Srikar Dronamraju
  33 siblings, 3 replies; 135+ messages in thread
From: Mel Gorman @ 2012-10-30 12:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	Ingo Molnar

On Thu, Oct 25, 2012 at 02:16:17PM +0200, Peter Zijlstra wrote:
> Hi all,
> 
> Here's a re-post of the NUMA scheduling and migration improvement
> patches that we are working on. These include techniques from
> AutoNUMA and the sched/numa tree and form a unified basis - it
> has got all the bits that look good and mergeable.
> 

Thanks for the repost. I have not even started a review yet as I was
travelling and just online today. It will be another day or two before I can
start but I was at least able to do a comparison test between autonuma and
schednuma today to see which actually performs the best. Even without the
review I was able to stick on similar vmstats as was applied to autonuma
to give a rough estimate of the relative overhead of both implementations.

Machine was a 4-node box running autonumabench and specjbb.

Three kernels are

3.7-rc2-stats-v2r1	vmstat patces on top
3.7-rc2-autonuma-v27	latest autonuma with stats on top
3.7-rc2-schednuma-v1r3	schednuma series minus the last path + stats

AUTONUMA BENCH
                                          3.7.0                 3.7.0                 3.7.0
                                 rc2-stats-v2r1    rc2-autonuma-v27r8    rc2-schednuma-v1r3
User    NUMA01               67145.71 (  0.00%)    30110.13 ( 55.16%)    61666.46 (  8.16%)
User    NUMA01_THEADLOCAL    55104.60 (  0.00%)    17285.49 ( 68.63%)    17135.48 ( 68.90%)
User    NUMA02                7074.54 (  0.00%)     2219.11 ( 68.63%)     2226.09 ( 68.53%)
User    NUMA02_SMT            2916.86 (  0.00%)      999.19 ( 65.74%)     1038.06 ( 64.41%)
System  NUMA01                  42.28 (  0.00%)      469.07 (-1009.44%)     2808.08 (-6541.63%)
System  NUMA01_THEADLOCAL       41.71 (  0.00%)      183.24 (-339.32%)      174.92 (-319.37%)
System  NUMA02                  34.67 (  0.00%)       27.85 ( 19.67%)       15.03 ( 56.65%)
System  NUMA02_SMT               0.89 (  0.00%)       18.36 (-1962.92%)        5.05 (-467.42%)
Elapsed NUMA01                1512.97 (  0.00%)      698.18 ( 53.85%)     1422.71 (  5.97%)
Elapsed NUMA01_THEADLOCAL     1264.23 (  0.00%)      389.51 ( 69.19%)      377.51 ( 70.14%)
Elapsed NUMA02                 181.52 (  0.00%)       60.65 ( 66.59%)       52.86 ( 70.88%)
Elapsed NUMA02_SMT             163.59 (  0.00%)       58.57 ( 64.20%)       48.82 ( 70.16%)
CPU     NUMA01                4440.00 (  0.00%)     4379.00 (  1.37%)     4531.00 ( -2.05%)
CPU     NUMA01_THEADLOCAL     4362.00 (  0.00%)     4484.00 ( -2.80%)     4585.00 ( -5.11%)
CPU     NUMA02                3916.00 (  0.00%)     3704.00 (  5.41%)     4239.00 ( -8.25%)
CPU     NUMA02_SMT            1783.00 (  0.00%)     1737.00 (  2.58%)     2136.00 (-19.80%)

Two figures really matter here - System CPU usage and Elapsed time.

autonuma was known to hurt system CPU usage for the NUMA01 test case but
schednuma does *far* worse. I do not have a breakdown of where this time
is being spent but the raw figure is bad. autonuma is 10 times worse
than a vanilla kernel and schednuma is 5 times worse than autonuma.

For the overhead of the other test cases, schednuma is roughly
comparable with autonuma -- i.e. both pretty high overhead.

In terms of elapsed time, autonuma in the NUMA01 test case massively
improves elapsed time while schednuma barely makes a dent on it. Looking
at the memory usage per node (I generated a graph offline), it appears
that schednuma does not migrate pages to other nodes fast enough. The
convergence figures do not reflect this because the convergence seems
high (towards 1) but it may be because the approximation using faults is
insufficient.

In the other cases, schednuma does well and is comparable to autonuma.

MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0
        rc2-stats-v2r1rc2-autonuma-v27r8rc2-schednuma-v1r3
User       132248.88    50620.50    82073.11
System        120.19      699.12     3003.83
Elapsed      3131.10     1215.63     1911.55

This is the overall time to complete the test. autonuma is way better
than schednuma but this is all due to how it handles the NUMA01 test
case.

MMTests Statistics: vmstat
                                 3.7.0       3.7.0       3.7.0
                          rc2-stats-v2r1rc2-autonuma-v27r8rc2-schednuma-v1r3
Page Ins                         37256       37508       37360
Page Outs                        28888       13372       19488
Swap Ins                             0           0           0
Swap Outs                            0           0           0
Direct pages scanned                 0           0           0
Kswapd pages scanned                 0           0           0
Kswapd pages reclaimed               0           0           0
Direct pages reclaimed               0           0           0
Kswapd efficiency                 100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000
Direct efficiency                 100%        100%        100%
Direct velocity                  0.000       0.000       0.000
Percentage direct scans             0%          0%          0%
Page writes by reclaim               0           0           0
Page writes file                     0           0           0
Page writes anon                     0           0           0
Page reclaim immediate               0           0           0
Page rescued immediate               0           0           0
Slabs scanned                        0           0           0
Direct inode steals                  0           0           0
Kswapd inode steals                  0           0           0
Kswapd skipped wait                  0           0           0
THP fault alloc                  17370       17923       13399
THP collapse alloc                   6       12385           3
THP splits                           3       12577           0
THP fault fallback                   0           0           0
THP collapse fail                    0           0           0
Compaction stalls                    0           0           0
Compaction success                   0           0           0
Compaction failures                  0           0           0
Page migrate success                 0     7061327       57167
Page migrate failure                 0           0           0
Compaction pages isolated            0           0           0
Compaction migrate scanned           0           0           0
Compaction free scanned              0           0           0
Compaction cost                      0        7329          59
NUMA PTE updates                     0      191503      123214
NUMA hint faults                     0    13322261      818762
NUMA hint local faults               0     9813514      756797
NUMA pages migrated                  0     7061327       57167
AutoNUMA cost                        0       66746        4095

The "THP collapse alloc" figures are interesting but reflect the fact
that schednuma can migrate THP pages natively where as autonuma does
not. 

The "Page migrate success" figure is more interesting. autonuma migrates
much more aggressively even though "NUMA PTE faults" are not that
different.

For reasons that are not immediately obvious, autonuma incurs far more
"NUMA hint faults" even though the PTE updates are not that different. I
expect when I actually review the code this will be due to differences
on how and when the two implentations decide to mark a PTE PROT_NONE.
A stonger possibility is because autonuma is not natively migrating THP
pages.  I also expect autonuma is continually scanning where as schednuma is
reacting to some other external event or at least less frequently scanning.
Obviously, I cannot rule out the possibility that the stats patch was buggy.

Because of the fewer faults, the "cost model" for sched-numa is lower.
OBviously there is a disconnect here because System CPU usage is high
but the cost model only takes a few limited variables into account.

In terms of absolute performance (elapsed time), autonuma is currently
better than schednuma. schednuma has high System CPU overhead in one case
for some unknown reason and introduces a lot of overhead but in general
worked less than autonuma as it incurred fewer faults.

Finally, I recorded node-load-misses,node-store-misses events. These are
the total number of events recorded

stats-v2r1	  94600194
autonuma	 945370766
schednuma	2828322708

It was surprising to me that the number of events recorded was higher -
page table accesses maybe? Either way, schednuma missed a *LOT* more
than autonuma but maybe I'm misinterpreting the meaning of the
node-load-misses,node-store-misses events as I haven't had the change
yet to dig down and see what perf maps those events onto.

SPECJBB BOPS
                          3.7.0                 3.7.0                 3.7.0
                 rc2-stats-v2r1    rc2-autonuma-v27r8    rc2-schednuma-v1r3
Mean   1      25960.00 (  0.00%)     24884.25 ( -4.14%)     25056.00 ( -3.48%)
Mean   2      53997.50 (  0.00%)     55744.25 (  3.23%)     52165.75 ( -3.39%)
Mean   3      78454.25 (  0.00%)     82321.75 (  4.93%)     76939.25 ( -1.93%)
Mean   4     101131.25 (  0.00%)    106996.50 (  5.80%)     99365.00 ( -1.75%)
Mean   5     120807.00 (  0.00%)    129999.50 (  7.61%)    118492.00 ( -1.92%)
Mean   6     135793.50 (  0.00%)    152013.25 ( 11.94%)    133139.75 ( -1.95%)
Mean   7     137686.75 (  0.00%)    158556.00 ( 15.16%)    136070.25 ( -1.17%)
Mean   8     135802.25 (  0.00%)    160725.50 ( 18.35%)    140158.75 (  3.21%)
Mean   9     129194.00 (  0.00%)    161531.00 ( 25.03%)    137308.00 (  6.28%)
Mean   10    125457.00 (  0.00%)    156800.00 ( 24.98%)    136357.50 (  8.69%)
Mean   11    121733.75 (  0.00%)    154211.25 ( 26.68%)    138089.50 ( 13.44%)
Mean   12    110556.25 (  0.00%)    149009.75 ( 34.78%)    138835.50 ( 25.58%)
Mean   13    107484.75 (  0.00%)    144792.25 ( 34.71%)    128099.50 ( 19.18%)
Mean   14    105733.00 (  0.00%)    141304.75 ( 33.64%)    118950.50 ( 12.50%)
Mean   15    104492.00 (  0.00%)    138179.00 ( 32.24%)    119325.75 ( 14.20%)
Mean   16    103312.75 (  0.00%)    136635.00 ( 32.25%)    116104.50 ( 12.38%)
Mean   17    101999.25 (  0.00%)    134625.00 ( 31.99%)    114375.75 ( 12.13%)
Mean   18    100107.75 (  0.00%)    132831.25 ( 32.69%)    114352.25 ( 14.23%)
TPut   1     103840.00 (  0.00%)     99537.00 ( -4.14%)    100224.00 ( -3.48%)
TPut   2     215990.00 (  0.00%)    222977.00 (  3.23%)    208663.00 ( -3.39%)
TPut   3     313817.00 (  0.00%)    329287.00 (  4.93%)    307757.00 ( -1.93%)
TPut   4     404525.00 (  0.00%)    427986.00 (  5.80%)    397460.00 ( -1.75%)
TPut   5     483228.00 (  0.00%)    519998.00 (  7.61%)    473968.00 ( -1.92%)
TPut   6     543174.00 (  0.00%)    608053.00 ( 11.94%)    532559.00 ( -1.95%)
TPut   7     550747.00 (  0.00%)    634224.00 ( 15.16%)    544281.00 ( -1.17%)
TPut   8     543209.00 (  0.00%)    642902.00 ( 18.35%)    560635.00 (  3.21%)
TPut   9     516776.00 (  0.00%)    646124.00 ( 25.03%)    549232.00 (  6.28%)
TPut   10    501828.00 (  0.00%)    627200.00 ( 24.98%)    545430.00 (  8.69%)
TPut   11    486935.00 (  0.00%)    616845.00 ( 26.68%)    552358.00 ( 13.44%)
TPut   12    442225.00 (  0.00%)    596039.00 ( 34.78%)    555342.00 ( 25.58%)
TPut   13    429939.00 (  0.00%)    579169.00 ( 34.71%)    512398.00 ( 19.18%)
TPut   14    422932.00 (  0.00%)    565219.00 ( 33.64%)    475802.00 ( 12.50%)
TPut   15    417968.00 (  0.00%)    552716.00 ( 32.24%)    477303.00 ( 14.20%)
TPut   16    413251.00 (  0.00%)    546540.00 ( 32.25%)    464418.00 ( 12.38%)
TPut   17    407997.00 (  0.00%)    538500.00 ( 31.99%)    457503.00 ( 12.13%)
TPut   18    400431.00 (  0.00%)    531325.00 ( 32.69%)    457409.00 ( 14.23%)

In reality, this report is larger but I chopped it down a bit for
brevity. autonuma beats schednuma *heavily* on this benchmark both in
terms of average operations per numa node and overall throughput.

SPECJBB PEAKS
                                       3.7.0                      3.7.0                      3.7.0
                              rc2-stats-v2r1         rc2-autonuma-v27r8         rc2-schednuma-v1r3
 Expctd Warehouse                   12.00 (  0.00%)                   12.00 (  0.00%)                   12.00 (  0.00%)
 Expctd Peak Bops               442225.00 (  0.00%)               596039.00 ( 34.78%)               555342.00 ( 25.58%)
 Actual Warehouse                    7.00 (  0.00%)                    9.00 ( 28.57%)                    8.00 ( 14.29%)
 Actual Peak Bops               550747.00 (  0.00%)               646124.00 ( 17.32%)               560635.00 (  1.80%)

autonuma was also able to handle more simultaneous warehouses peaking at
9 warehouses in comparison to schednumas 8 and the normal kernels 7. Of
course all fell short of the expected peak of 12 but that's neither here
nor there.

MMTests Statistics: duration
               3.7.0       3.7.0       3.7.0
        rc2-stats-v2r1rc2-autonuma-v27r8rc2-schednuma-v1r3
User       481580.26   478759.35   464261.89
System        179.35      803.59    16577.76
Elapsed     10398.85    10354.08    10383.61

Duration is the same but the benchmark should run for roughly the same
length of time each time so that is not earth shattering.

However, look at the System CPU usage. autonuma was bad but schednuma is
*completely* out of control.

MMTests Statistics: vmstat
                                 3.7.0       3.7.0       3.7.0
                          rc2-stats-v2r1rc2-autonuma-v27r8rc2-schednuma-v1r3
Page Ins                         33220       33896       33664
Page Outs                       111332      113116      115972
Swap Ins                             0           0           0
Swap Outs                            0           0           0
Direct pages scanned                 0           0           0
Kswapd pages scanned                 0           0           0
Kswapd pages reclaimed               0           0           0
Direct pages reclaimed               0           0           0
Kswapd efficiency                 100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000
Direct efficiency                 100%        100%        100%
Direct velocity                  0.000       0.000       0.000
Percentage direct scans             0%          0%          0%
Page writes by reclaim               0           0           0
Page writes file                     0           0           0
Page writes anon                     0           0           0
Page reclaim immediate               0           0           0
Page rescued immediate               0           0           0
Slabs scanned                        0           0           0
Direct inode steals                  0           0           0
Kswapd inode steals                  0           0           0
Kswapd skipped wait                  0           0           0
THP fault alloc                      1           2           1
THP collapse alloc                   0          21           0
THP splits                           0           1           0
THP fault fallback                   0           0           0
THP collapse fail                    0           0           0
Compaction stalls                    0           0           0
Compaction success                   0           0           0
Compaction failures                  0           0           0
Page migrate success                 0     8070314   399095844
Page migrate failure                 0           0           0
Compaction pages isolated            0           0           0
Compaction migrate scanned           0           0           0
Compaction free scanned              0           0           0
Compaction cost                      0        8376      414261
NUMA PTE updates                     0        3841     1110729
NUMA hint faults                     0  2033295070  2945111212
NUMA hint local faults               0  1895230022  2545845756
NUMA pages migrated                  0     8070314   399095844
AutoNUMA cost                        0    10166628    14733146

Interesting to note that native THP migration makes no difference here.

schednuma migrated a lot more aggressively in this test, and incurred
*way* more PTE updates. I have no explanation for this but overall
schednuma was far heavier than autonuma.

So, without reviewing the code at all, it seems to me that schednuma is
not the obvious choice for merging above autonuma as the merge to -tip
implied -- at least based on these figures. By and large, autonuma seems
to perform better and while I know that some of its paths are heavy, it
was also clear during review of the code that the overhead could have been
reduced incrementally. Maybe the same can be said for schednuma, we'll see
but I expect that the actual performance be taken into accounting during
merging as well as the relatively maintenance effort.


> Please review .. once again and holler if you see anything funny! :-)
> 

Consider the figures above to be a hollering that I think something
might be screwy in schednuma :)

I'll do a release of mmtests if you want to use the same benchmarks or
see if I messed up how it was benchmarked which is quite possible as
this was a rush job while I was travelling.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-10-30 12:20 ` Mel Gorman
@ 2012-10-30 15:28   ` Andrew Morton
  2012-10-30 16:59     ` Mel Gorman
  2012-11-03 11:04   ` Alex Shi
  2012-11-09  8:51   ` Rik van Riel
  2 siblings, 1 reply; 135+ messages in thread
From: Andrew Morton @ 2012-10-30 15:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, linux-kernel, linux-mm,
	Ingo Molnar


On Tue, 30 Oct 2012 12:20:32 +0000 Mel Gorman <mgorman@suse.de> wrote:

> ...

Useful testing - thanks.  Did I miss the description of what
autonumabench actually does?  How representitive is it of real-world
things?

> I also expect autonuma is continually scanning where as schednuma is
> reacting to some other external event or at least less frequently scanning.

Might this imply that autonuma is consuming more CPU in kernel threads,
the cost of which didn't get included in these results?

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-10-30 15:28   ` Andrew Morton
@ 2012-10-30 16:59     ` Mel Gorman
  0 siblings, 0 replies; 135+ messages in thread
From: Mel Gorman @ 2012-10-30 16:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Rik van Riel, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, linux-kernel, linux-mm,
	Ingo Molnar

On Tue, Oct 30, 2012 at 08:28:10AM -0700, Andrew Morton wrote:
> 
> On Tue, 30 Oct 2012 12:20:32 +0000 Mel Gorman <mgorman@suse.de> wrote:
> 
> > ...
> 
> Useful testing - thanks.  Did I miss the description of what
> autonumabench actually does?  How representitive is it of real-world
> things?
> 

It's not representative of anything at all. It's a synthetic benchmark
that just measures if automatic NUMA migration (whatever the mechanism)
is working as expected. I'm not aware of a decent description of what
the test does and why. Here is my current interpretation and hopefully
Andrea will correct me if I'm wrong.

NUMA01
  Two processes
  NUM_CPUS/2 number of threads so all CPUs are in use
  
  On startup, the process forks
  Each process mallocs a 3G buffer but there is no communication
      between the processes.
  Threads are created that zeros out the full buffer 1000 times

  The objective of the test is that initially the two processes
  allocate their memory on the same node. As the threads are
  are created the memory will migrate from the initial node to
  nodes that are closer to the referencing thread.

  It is worth noting that this benchmark is specifically tuned
  for two nodes and the expectation is that the two processes
  and their threads split so that all process A runs on node 0
  and all threads on process B run in node 1

  With 4 and more nodes, this is actually an adverse workload.
  As all the buffer is zeroed in both processes, there is an
  expectation that it will continually bounce between two nodes.

  So, on 2 nodes, this benchmark tests convergence. On 4 or more
  nodes, this partially measures how much busy work automatic
  NUMA migrate does and it'll be very noisy due to cache conflicts.

NUMA01_THREADLOCAL
  Two processes
  NUM_CPUS/2 number of threads so all CPUs are in use

  On startup, the process forks
  Each process mallocs a 3G buffer but there is no communication
      between the processes
  Threads are created that zero out their own subset of the buffer.
      Each buffer is 3G/NR_THREADS in size
  
  This benchmark is more realistic. In an ideal situation, each
  thread will migrate its data to its local node. The test really
  is to see does it converge and how quickly.

NUMA02
 One process, NR_CPU threads

 On startup, malloc a 1G buffer
 Create threads that zero out a thread-local portion of the buffer.
      Zeros multiple times - the number of times is fixed and seems
      to just be to take a period of time

 This is similar in principal to NUMA01_THREADLOCAL except that only
 one process is involved. I think it was aimed at being more JVM-like.

NUMA02_SMT
 One process, NR_CPU/2 threads

 This is a variation of NUMA02 except that with half the cores idle it
 is checking if the system migrates the memory to two or more nodes or
 if it tries to fit everything in one node even though the memory should
 migrate to be close to the CPU

> > I also expect autonuma is continually scanning where as schednuma is
> > reacting to some other external event or at least less frequently scanning.
> 
> Might this imply that autonuma is consuming more CPU in kernel threads,
> the cost of which didn't get included in these results?

It might but according to top, knuma_scand only used 7.86 seconds of CPU
time during the whole test and the time used by the migration tests is
also very low. Most migration threads used less than 1 second of CPU
time. Two migration threads used 2 seconds of CPU time each but that
still seems low.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 26/31] sched, numa, mm: Add fault driven placement and migration policy
  2012-10-25 12:16 ` [PATCH 26/31] sched, numa, mm: Add fault driven placement and migration policy Peter Zijlstra
  2012-10-25 20:53   ` Linus Torvalds
@ 2012-10-30 19:23   ` Rik van Riel
  2012-11-01 15:40   ` Mel Gorman
  2 siblings, 0 replies; 135+ messages in thread
From: Rik van Riel @ 2012-10-30 19:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Arcangeli, Mel Gorman, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	Ingo Molnar

On 10/25/2012 08:16 AM, Peter Zijlstra wrote:
> +/*
> + * Drive the periodic memory faults..
> + */
> +void task_tick_numa(struct rq *rq, struct task_struct *curr)
> +{
> +	struct callback_head *work = &curr->numa_work;
> +	u64 period, now;
> +
> +	/*
> +	 * We don't care about NUMA placement if we don't have memory.
> +	 */
> +	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
> +		return;

We should probably skip the whole unmap-and-refault
business if we are running on a system that is not
NUMA.  Ie. a system with just one node...

> +	/*
> +	 * Using runtime rather than walltime has the dual advantage that
> +	 * we (mostly) drive the selection from busy threads and that the
> +	 * task needs to have done some actual work before we bother with
> +	 * NUMA placement.
> +	 */
> +	now = curr->se.sum_exec_runtime;
> +	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
> +



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-10-30  6:29       ` [PATCH 00/31] numa/core patches Zhouping Liu
@ 2012-10-31  0:48         ` Johannes Weiner
  2012-10-31  7:26           ` Hugh Dickins
  0 siblings, 1 reply; 135+ messages in thread
From: Johannes Weiner @ 2012-10-31  0:48 UTC (permalink / raw)
  To: Zhouping Liu
  Cc: Peter Zijlstra, Rik van Riel, Andrea Arcangeli, Mel Gorman,
	Thomas Gleixner, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, Ingo Molnar, CAI Qian, Hugh Dickins

On Tue, Oct 30, 2012 at 02:29:25PM +0800, Zhouping Liu wrote:
> On 10/29/2012 01:56 AM, Johannes Weiner wrote:
> >On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote:
> >>On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
> >>>[  180.918591] RIP: 0010:[<ffffffff8118c39a>]  [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0
> >>>[  182.681450]  [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
> >>>[  182.775090]  [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
> >>>[  182.863038]  [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
> >>>[  182.950574]  [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
> >>>[  183.041512]  [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
> >>>[  183.126832]  [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
> >>>[  183.211216]  [<ffffffff81632ede>] do_page_fault+0xe/0x10
> >>>[  183.293705]  [<ffffffff8162f518>] page_fault+0x28/0x30
> >>Johannes, this looks like the thp migration memcg hookery gone bad,
> >>could you have a look at this?
> >Oops.  Here is an incremental fix, feel free to fold it into #31.
> Hello Johannes,
> 
> maybe I don't think the below patch completely fix this issue, as I
> found a new error(maybe similar with this):
> 
> [88099.923724] ------------[ cut here ]------------
> [88099.924036] kernel BUG at mm/memcontrol.c:1134!
> [88099.924036] invalid opcode: 0000 [#1] SMP
> [88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm
> amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp
> joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi
> megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit
> drm_kms_helper ttm drm i2c_core
> [88099.924036] CPU 7
> [88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3
> Dell Inc. PowerEdge 6950/0WN213
> [88099.924036] RIP: 0010:[<ffffffff81188e97>] [<ffffffff81188e97>]
> mem_cgroup_update_lru_size+0x27/0x30

Thanks a lot for your testing efforts, I really appreciate it.

I'm looking into it, but I don't expect power to get back for several
days where I live, so it's hard to reproduce it locally.

But that looks like an LRU accounting imbalance that I wasn't able to
tie to this patch yet.  Do you see weird numbers for the lru counters
in /proc/vmstat even without this memory cgroup patch?  Ccing Hugh as
well.

Thanks,
Johannes

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-10-31  0:48         ` Johannes Weiner
@ 2012-10-31  7:26           ` Hugh Dickins
  2012-10-31 13:15             ` Zhouping Liu
  0 siblings, 1 reply; 135+ messages in thread
From: Hugh Dickins @ 2012-10-31  7:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Zhouping Liu, Peter Zijlstra, Rik van Riel, Andrea Arcangeli,
	Mel Gorman, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-mm, Ingo Molnar, CAI Qian

On Tue, 30 Oct 2012, Johannes Weiner wrote:
> On Tue, Oct 30, 2012 at 02:29:25PM +0800, Zhouping Liu wrote:
> > On 10/29/2012 01:56 AM, Johannes Weiner wrote:
> > >On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote:
> > >>On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
> > >>>[  180.918591] RIP: 0010:[<ffffffff8118c39a>]  [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0
> > >>>[  182.681450]  [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
> > >>>[  182.775090]  [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
> > >>>[  182.863038]  [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
> > >>>[  182.950574]  [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
> > >>>[  183.041512]  [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
> > >>>[  183.126832]  [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
> > >>>[  183.211216]  [<ffffffff81632ede>] do_page_fault+0xe/0x10
> > >>>[  183.293705]  [<ffffffff8162f518>] page_fault+0x28/0x30
> > >>Johannes, this looks like the thp migration memcg hookery gone bad,
> > >>could you have a look at this?
> > >Oops.  Here is an incremental fix, feel free to fold it into #31.
> > Hello Johannes,
> > 
> > maybe I don't think the below patch completely fix this issue, as I
> > found a new error(maybe similar with this):
> > 
> > [88099.923724] ------------[ cut here ]------------
> > [88099.924036] kernel BUG at mm/memcontrol.c:1134!
> > [88099.924036] invalid opcode: 0000 [#1] SMP
> > [88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm
> > amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp
> > joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi
> > megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit
> > drm_kms_helper ttm drm i2c_core
> > [88099.924036] CPU 7
> > [88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3
> > Dell Inc. PowerEdge 6950/0WN213
> > [88099.924036] RIP: 0010:[<ffffffff81188e97>] [<ffffffff81188e97>]
> > mem_cgroup_update_lru_size+0x27/0x30
> 
> Thanks a lot for your testing efforts, I really appreciate it.
> 
> I'm looking into it, but I don't expect power to get back for several
> days where I live, so it's hard to reproduce it locally.
> 
> But that looks like an LRU accounting imbalance that I wasn't able to
> tie to this patch yet.  Do you see weird numbers for the lru counters
> in /proc/vmstat even without this memory cgroup patch?  Ccing Hugh as
> well.

Sorry, I didn't get very far with it tonight.

Almost certain to be a page which was added to lru while it looked like
a 4k page, but taken off lru as a 2M page: we are taking a 2M page off
lru here, it's likely to be the page in question, but not necessarily.

There's quite a few put_page()s in do_huge_pmd_numa_page(), and it
would help if we could focus on the one which is giving the trouble,
but I don't know which that is.  Zhouping, if you can, please would
you do an "objdump -ld vmlinux >bigfile" of your kernel, then extract
from bigfile just the lines from "<do_huge_pmd_numa_page>:" to whatever
is the next function, and post or mail privately just that disassembly.
That should be good to identify which of the put_page()s is involved.

do_huge_pmd_numa_page() does look a bit worrying, but I've not pinned
the misaccounting seen to the aspects which have worried me so far.
Where is a check for page_mapcount(page) being 1?  And surely it's
unsafe to to be migrating the page when it was found !PageLRU?  It's
quite likely to be sitting in a pagevec or on a local list somewhere,
about to be added to lru at any moment.

Hugh

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-10-31  7:26           ` Hugh Dickins
@ 2012-10-31 13:15             ` Zhouping Liu
  2012-10-31 17:31               ` Hugh Dickins
  0 siblings, 1 reply; 135+ messages in thread
From: Zhouping Liu @ 2012-10-31 13:15 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, Peter Zijlstra, Rik van Riel, Andrea Arcangeli,
	Mel Gorman, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-mm, Ingo Molnar, CAI Qian

On 10/31/2012 03:26 PM, Hugh Dickins wrote:
> On Tue, 30 Oct 2012, Johannes Weiner wrote:
>
> [88099.923724] ------------[ cut here ]------------
> [88099.924036] kernel BUG at mm/memcontrol.c:1134!
> [88099.924036] invalid opcode: 0000 [#1] SMP
> [88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm
> amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp
> joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi
> megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit
> drm_kms_helper ttm drm i2c_core
> [88099.924036] CPU 7
> [88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3
> Dell Inc. PowerEdge 6950/0WN213
> [88099.924036] RIP: 0010:[<ffffffff81188e97>] [<ffffffff81188e97>]
> mem_cgroup_update_lru_size+0x27/0x30
>> Thanks a lot for your testing efforts, I really appreciate it.
>>
>> I'm looking into it, but I don't expect power to get back for several
>> days where I live, so it's hard to reproduce it locally.
>>
>> But that looks like an LRU accounting imbalance that I wasn't able to
>> tie to this patch yet.  Do you see weird numbers for the lru counters
>> in /proc/vmstat even without this memory cgroup patch?  Ccing Hugh as
>> well.
> Sorry, I didn't get very far with it tonight.
>
> Almost certain to be a page which was added to lru while it looked like
> a 4k page, but taken off lru as a 2M page: we are taking a 2M page off
> lru here, it's likely to be the page in question, but not necessarily.
>
> There's quite a few put_page()s in do_huge_pmd_numa_page(), and it
> would help if we could focus on the one which is giving the trouble,
> but I don't know which that is.  Zhouping, if you can, please would
> you do an "objdump -ld vmlinux >bigfile" of your kernel, then extract
> from bigfile just the lines from "<do_huge_pmd_numa_page>:" to whatever
> is the next function, and post or mail privately just that disassembly.
> That should be good to identify which of the put_page()s is involved.

Hugh, I didn't find the next function, as I can't find any words that 
matched "do_huge_pmd_numa_page".
is there any other methods? also I tried to use kdump to dump vmcore 
file, but unluckily kdump didn't
work well, if you think it useful to dump vmcore file, I can try it 
again and provide more info.

Thanks,
Zhouping

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-10-31 13:15             ` Zhouping Liu
@ 2012-10-31 17:31               ` Hugh Dickins
  2012-11-01 13:41                 ` Hugh Dickins
  0 siblings, 1 reply; 135+ messages in thread
From: Hugh Dickins @ 2012-10-31 17:31 UTC (permalink / raw)
  To: Zhouping Liu
  Cc: Johannes Weiner, Peter Zijlstra, Rik van Riel, Andrea Arcangeli,
	Mel Gorman, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-mm, Ingo Molnar, CAI Qian

On Wed, 31 Oct 2012, Zhouping Liu wrote:
> On 10/31/2012 03:26 PM, Hugh Dickins wrote:
> > 
> > There's quite a few put_page()s in do_huge_pmd_numa_page(), and it
> > would help if we could focus on the one which is giving the trouble,
> > but I don't know which that is.  Zhouping, if you can, please would
> > you do an "objdump -ld vmlinux >bigfile" of your kernel, then extract
> > from bigfile just the lines from "<do_huge_pmd_numa_page>:" to whatever
> > is the next function, and post or mail privately just that disassembly.
> > That should be good to identify which of the put_page()s is involved.
> 
> Hugh, I didn't find the next function, as I can't find any words that matched
> "do_huge_pmd_numa_page".
> is there any other methods?

Hmm, do_huge_pmd_numa_page does appear in your stacktrace,
unless I've made a typo but am blind to it.

Were you applying objdump to the vmlinux which gave you the
BUG at mm/memcontrol.c:1134! ?

Maybe just do "objdump -ld mm/huge_memory.o >notsobigfile"
and mail me an attachment of the notsobigfile.

I did try building your config here last night, but ran out of disk
space on this partition, and it was already clear that my gcc version
differs from yours, so not quite matching.

> also I tried to use kdump to dump vmcore file,
> but unluckily kdump didn't
> work well, if you think it useful to dump vmcore file, I can try it again and
> provide more info.

It would take me awhile to get up to speed on using that,
I'd prefer to start with just the objdump of huge_memory.o.

I forgot last night to say that I did try stress (but not on a kernel
of your config), but didn't see the BUG: I expect there are too many
differences in our environments, and I'd have to tweak things one way
or another to get it to happen - probably a waste of time.

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 02/31] sched, numa, mm: Describe the NUMA scheduling problem formally
  2012-10-25 12:16 ` [PATCH 02/31] sched, numa, mm: Describe the NUMA scheduling problem formally Peter Zijlstra
@ 2012-11-01  9:56   ` Mel Gorman
  2012-11-01 13:13     ` Rik van Riel
  0 siblings, 1 reply; 135+ messages in thread
From: Mel Gorman @ 2012-11-01  9:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	H. Peter Anvin, Mike Galbraith, Ingo Molnar

On Thu, Oct 25, 2012 at 02:16:19PM +0200, Peter Zijlstra wrote:
> This is probably a first: formal description of a complex high-level
> computing problem, within the kernel source.
> 

Who does not love the smell of formal methods first thing in the
morning?

> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Rik van Riel <riel@redhat.com>
> [ Next step: generate the kernel source from such formal descriptions and retire to a tropical island! ]

You can use the computing award monies as a pension.

> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  Documentation/scheduler/numa-problem.txt |  230 +++++++++++++++++++++++++++++++
>  1 file changed, 230 insertions(+)
>  create mode 100644 Documentation/scheduler/numa-problem.txt
> 
> Index: tip/Documentation/scheduler/numa-problem.txt
> ===================================================================
> --- /dev/null
> +++ tip/Documentation/scheduler/numa-problem.txt
> @@ -0,0 +1,230 @@
> +
> +
> +Effective NUMA scheduling problem statement, described formally:
> +
> + * minimize interconnect traffic
> +
> +For each task 't_i' we have memory, this memory can be spread over multiple
> +physical nodes, let us denote this as: 'p_i,k', the memory task 't_i' has on
> +node 'k' in [pages].  
> +
> +If a task shares memory with another task let us denote this as:
> +'s_i,k', the memory shared between tasks including 't_i' residing on node
> +'k'.
> +

This does not take into account false sharing. T_0 and T_1 could map a
region MAP_SHARED that is not page-aligned. It is approximately, but not
quite, a shared page and how it is detected matters. For the purposes of
the optimisation, it sounds like it should not matter but as the NUMA01
test case is a worst-case scenario for false-sharing and sched-numa suffers
badly badly there, it might be important.

> +Let 'M' be the distribution that governs all 'p' and 's', ie. the page placement.
> +
> +Similarly, lets define 'fp_i,k' and 'fs_i,k' resp. as the (average) usage
> +frequency over those memory regions [1/s] such that the product gives an
> +(average) bandwidth 'bp' and 'bs' in [pages/s].
> +

We cannot directly measure this without using profiles all of the time.
I assume we will approximate this with sampling but it does mean we depend
very heavily on that sampling being representative to make the correct
decisions.

> +(note: multiple tasks sharing memory naturally avoid duplicat accounting
> +       because each task will have its own access frequency 'fs')
> +
> +(pjt: I think this frequency is more numerically consistent if you explicitly 
> +      restrict p/s above to be the working-set. (It also makes explicit the 
> +      requirement for <C0,M0> to change about a change in the working set.)
> +

Do you mean p+s? i.e. explicitly restrict p and s to be all task-local
and task-shared pages currently used by the system? If so, I agree that it
would be numerically more consistent. If p_i is all mapped pages instead
of the working set then depending on exactly how it is calculated, the
"average usage" can appear to drop if the process maps more regions. This
would be unhelpful because it sets of perverse incentives for tasks to
game the algorithm.

> +      Doing this does have the nice property that it lets you use your frequency
> +      measurement as a weak-ordering for the benefit a task would receive when
> +      we can't fit everything.
> +
> +      e.g. task1 has working set 10mb, f=90%
> +           task2 has working set 90mb, f=10%
> +
> +      Both are using 9mb/s of bandwidth, but we'd expect a much larger benefit
> +      from task1 being on the right node than task2. )
> +
> +Let 'C' map every task 't_i' to a cpu 'c_i' and its corresponding node 'n_i':
> +
> +  C: t_i -> {c_i, n_i}
> +
> +This gives us the total interconnect traffic between nodes 'k' and 'l',
> +'T_k,l', as:
> +
> +  T_k,l = \Sum_i bp_i,l + bs_i,l + \Sum bp_j,k + bs_j,k where n_i == k, n_j == l
> +

Task in this case must really mean a kernel task. It does not and should not
distinguish between processes and threads because for the purposes of p_i and
s_i, it doesn't matter. If this is right, there is no harm in clarifying it.

> +And our goal is to obtain C0 and M0 such that:
> +
> +  T_k,l(C0, M0) =< T_k,l(C, M) for all C, M where k != l
> +

You could add "Informally, the goal is to minimise interconnect
traffic".

> +(note: we could introduce 'nc(k,l)' as the cost function of accessing memory
> +       on node 'l' from node 'k', this would be useful for bigger NUMA systems
> +
> + pjt: I agree nice to have, but intuition suggests diminishing returns on more
> +      usual systems given factors like things like Haswell's enormous 35mb l3
> +      cache and QPI being able to do a direct fetch.)
> +

Besides, even if the NUMA distance is fixed that does not mean the cost
of a NUMA miss from the perspective of a process is fixed because it
could prefetched on one hand or cached locally in L1 or L2 cache on the
other. It's just not worth taking the weight into account.

> +(note: do we need a limit on the total memory per node?)
> +

It only needs to be taken into account if the task is using more memory
than fits into a node. You may be able to indirectly measure this in
practive using the numa_foreign counter.

If schednuma is enabled and numa_foreign is rapidly increasing it might
indiate that the total memory available and \Sum_i p_i,k + s_i,k has to
be taken into account. Hopefully such a thing can be avoided because it
would be expensive to calculate.

> +
> + * fairness
> +
> +For each task 't_i' we have a weight 'w_i' (related to nice), and each cpu
> +'c_n' has a compute capacity 'P_n', again, using our map 'C' we can formulate a
> +load 'L_n':
> +
> +  L_n = 1/P_n * \Sum_i w_i for all c_i = n
> +
> +using that we can formulate a load difference between CPUs
> +
> +  L_n,m = | L_n - L_m |
> +

This is showing a strong bias towards the scheduler. As you take w_i
into account, it potentially means that higher priority tasks can "push"
lower priority tasks and their memory off a node. This can lead to a
situation where a high priority task can starve a lower priority task as
the lower priority task now must dedicate cycles to moving its memory
around.

I understand your motivation for taking weight into account here but I
wonder if it's premature?

> +Which allows us to state the fairness goal like:
> +
> +  L_n,m(C0) =< L_n,m(C) for all C, n != m
> +
> +(pjt: It can also be usefully stated that, having converged at C0:
> +
> +   | L_n(C0) - L_m(C0) | <= 4/3 * | G_n( U(t_i, t_j) ) - G_m( U(t_i, t_j) ) |
> +
> +      Where G_n,m is the greedy partition of tasks between L_n and L_m. This is
> +      the "worst" partition we should accept; but having it gives us a useful 
> +      bound on how much we can reasonably adjust L_n/L_m at a Pareto point to 
> +      favor T_n,m. )
> +
> +Together they give us the complete multi-objective optimization problem:
> +
> +  min_C,M [ L_n,m(C), T_k,l(C,M) ]
> +

I admire the goal and the apparent simplicity but I think there are
potentially unexpected outcomes when you try to minimise both.

For example
  2-node machine, 24 cores, 4G per node
  1 compute process, 24 threads - workload is adaptive mesh computation
	of some sort that fits in 3G

  Early in the lifetime of this, it will be balanced between the two nodes
  (minimising interconnect traffic and CPU load). As it computes it might
  refine the mesh such that all 24 threads are operating on just 1G of
  memory on one node with a lot of false sharing within pages. In this case
  you effectively want the memory to be pinned on one node even if all
  threads are in use. As these pages are all probably in p (but maybe s,
  it's a gray area in this case) it potentially leads to a ping-pong effect
  when we minimise for L_n,m(C).  I know we would try to solve for both but
  as T_k,l is based on something we cannot accurately measure at runtime,
  there will be drift.

Another example
  2-node machine, 24 cores, 4G per node
  1 compute process, 2 threads, 1 thread needs 6G

  In this case we cannot minimise for T_k,l as spillover is inevitable and
  interconnect traffic will never be 0. To get around this, the scheduled
  CPU would always have to follow memory i.e. the thread uses cpu on node
  0 when operating on that memory and switching to a cpu on node 1 otherwise.
  I'm not sure how this can be modelled under the optimization problem
  as presented.

At this point I'm not proposing a solution - I'm just pointing out that
there are potential corner cases where this can get screwy.

FWIW, we still benefit from having this formally described even if it
cannot cover all the cases.

> +
> +
> +Notes:
> +
> + - the memory bandwidth problem is very much an inter-process problem, in
> +   particular there is no such concept as a process in the above problem.
> +

Yep.

> + - the naive solution would completely prefer fairness over interconnect
> +   traffic, the more complicated solution could pick another Pareto point using
> +   an aggregate objective function such that we balance the loss of work
> +   efficiency against the gain of running, we'd want to more or less suggest
> +   there to be a fixed bound on the error from the Pareto line for any
> +   such solution.
> +

I suspect that the Pareto point and objective function will depend on
the workload, whether it fits in the node and whether its usage of memory
between nodes and tasks is balanced or not.

It would be ideal though to have such a function.

> +References:
> +
> +  http://en.wikipedia.org/wiki/Mathematical_optimization
> +  http://en.wikipedia.org/wiki/Multi-objective_optimization
> +
> +
> +* warning, significant hand-waving ahead, improvements welcome *
> +
> +
> +Partial solutions / approximations:
> +
> + 1) have task node placement be a pure preference from the 'fairness' pov.
> +
> +This means we always prefer fairness over interconnect bandwidth. This reduces
> +the problem to:
> +
> +  min_C,M [ T_k,l(C,M) ]
> +

Is this not preferring interconnect bandwidth over fairness? i.e. always
reduce interconnect bandwidth regardless of how the CPUs are being used?

> + 2a) migrate memory towards 'n_i' (the task's node).
> +
> +This creates memory movement such that 'p_i,k for k != n_i' becomes 0 -- 
> +provided 'n_i' stays stable enough and there's sufficient memory (looks like
> +we might need memory limits for this).
> +

Not just memory limits, you may need to detect if p_i fits in k. Last thing
you need is an effect like zone_reclaim_mode==1 where t_i is reclaiming
its own memory to migrate pages to a local node. Such a thing could not
be happening currently as the benchmarks would have shown the scanning.

> +This does however not provide us with any 's_i' (shared) information. It does
> +however remove 'M' since it defines memory placement in terms of task
> +placement.
> +
> +XXX properties of this M vs a potential optimal
> +
> + 2b) migrate memory towards 'n_i' using 2 samples.
> +
> +This separates pages into those that will migrate and those that will not due
> +to the two samples not matching. We could consider the first to be of 'p_i'
> +(private) and the second to be of 's_i' (shared).
> +

When minimising for L_n,m this could cause problems. Lets say we are dealing
with the first example above. The memory should be effectively pinned on
one node. If we minimise for L_n, we are using CPUs on a remote node and
if it samples twice, it will migrate the memory setting up a potential
ping-pong effect if there is any false sharing of pages.

Of course, this is not a problem if the memory of a task can be partitioned
into s_i and p_i but it heavily depends on detecting s_i correctly.

> +This interpretation can be motivated by the previously observed property that
> +'p_i,k for k != n_i' should become 0 under sufficient memory, leaving only
> +'s_i' (shared). (here we loose the need for memory limits again, since it
> +becomes indistinguishable from shared).
> +
> +XXX include the statistical babble on double sampling somewhere near
> +

Minimally, I do not see an obvious way of describing why 3, 4, 7 or eleventy
samples would be better than 2. To high and it migrates too slowly. Too
low and it ping-pongs and the ideal number of samples is workload-dependant.

> +This reduces the problem further; we loose 'M' as per 2a, it further reduces
> +the 'T_k,l' (interconnect traffic) term to only include shared (since per the
> +above all private will be local):
> +
> +  T_k,l = \Sum_i bs_i,l for every n_i = k, l != k
> +
> +[ more or less matches the state of sched/numa and describes its remaining
> +  problems and assumptions. It should work well for tasks without significant
> +  shared memory usage between tasks. ]
> +
> +Possible future directions:
> +
> +Motivated by the form of 'T_k,l', try and obtain each term of the sum, so we
> +can evaluate it;
> +
> + 3a) add per-task per node counters
> +
> +At fault time, count the number of pages the task faults on for each node.
> +This should give an approximation of 'p_i' for the local node and 's_i,k' for
> +all remote nodes.
> +

Yes. The rate of sampling will determine how accurate it is.

> +While these numbers provide pages per scan, and so have the unit [pages/s] they
> +don't count repeat access and thus aren't actually representable for our
> +bandwidth numberes.
> +

Counting repeat accesses would require either continual profiling (and
hardware that can identify the data address not just the IP) or aggressive
sampling - neither which is appealing. In other words, an approximation
of p_i is as good as we are going to get.

> + 3b) additional frequency term
> +
> +Additionally (or instead if it turns out we don't need the raw 'p' and 's' 
> +numbers) we can approximate the repeat accesses by using the time since marking
> +the pages as indication of the access frequency.
> +

Risky. If the process blocks on IO this could skew in unexpected ways
because such an approximation assumes the task is CPU or memory bound.

> +Let 'I' be the interval of marking pages and 'e' the elapsed time since the
> +last marking, then we could estimate the number of accesses 'a' as 'a = I / e'.
> +If we then increment the node counters using 'a' instead of 1 we might get
> +a better estimate of bandwidth terms.
> +

Not keen on this one. It really assumes that there is more or less
constant use of the CPU.

> + 3c) additional averaging; can be applied on top of either a/b.
> +
> +[ Rik argues that decaying averages on 3a might be sufficient for bandwidth since
> +  the decaying avg includes the old accesses and therefore has a measure of repeat
> +  accesses.
> +

Minimally, it would be less vunerable to spikes.

> +  Rik also argued that the sample frequency is too low to get accurate access
> +  frequency measurements, I'm not entirely convinced, event at low sample 
> +  frequencies the avg elapsed time 'e' over multiple samples should still
> +  give us a fair approximation of the avg access frequency 'a'.
> +

Sampling too high also increases the risk of a ping-pong effect.

> +  So doing both b&c has a fair chance of working and allowing us to distinguish
> +  between important and less important memory accesses.
> +
> +  Experimentation has shown no benefit from the added frequency term so far. ]
> +

At this point, I prefer a&c over b&c but that could just be because I'm
wary of time-based heuristics having been bit by them once or twice.

> +This will give us 'bp_i' and 'bs_i,k' so that we can approximately compute
> +'T_k,l' Our optimization problem now reads:
> +
> +  min_C [ \Sum_i bs_i,l for every n_i = k, l != k ]
> +
> +And includes only shared terms, this makes sense since all task private memory
> +will become local as per 2.
> +
> +This suggests that if there is significant shared memory, we should try and
> +move towards it.
> +
> + 4) move towards where 'most' memory is
> +
> +The simplest significance test is comparing the biggest shared 's_i,k' against
> +the private 'p_i'. If we have more shared than private, move towards it.
> +

Depending on how s_i is calculated, this might also mitigate the ping-pong
problem.

> +This effectively makes us move towards where most our memory is and forms a
> +feed-back loop with 2. We migrate memory towards us and we migrate towards
> +where 'most' memory is.
> +
> +(Note: even if there were two tasks fully trashing the same shared memory, it
> +       is very rare for there to be an 50/50 split in memory, lacking a perfect
> +       split, the small will move towards the larger. In case of the perfect
> +       split, we'll tie-break towards the lower node number.)
> +
> + 5) 'throttle' 4's node placement
> +
> +Since per 2b our 's_i,k' and 'p_i' require at least two scans to 'stabilize'
> +and show representative numbers, we should limit node-migration to not be
> +faster than this.
> +
> + n) poke holes in previous that require more stuff and describe it.
> 

Even as it is, this is a helpful description of the problem! Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 03/31] mm/thp: Preserve pgprot across huge page split
  2012-10-25 12:16 ` [PATCH 03/31] mm/thp: Preserve pgprot across huge page split Peter Zijlstra
@ 2012-11-01 10:22   ` Mel Gorman
  0 siblings, 0 replies; 135+ messages in thread
From: Mel Gorman @ 2012-11-01 10:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	Paul Turner, Ingo Molnar

On Thu, Oct 25, 2012 at 02:16:20PM +0200, Peter Zijlstra wrote:
> We're going to play games with page-protections, ensure we don't lose
> them over a THP split.
> 

Why?

If PROT_NONE becomes a present pte, we lose samples. If a present pte
becomes PROT_NONE, we get spurious faults and some sampling oddities.
Both situations only apply when a THP is being split which implies
distruption anyway (mprotect, page reclaim etc.) and neither seems that
series. It seems premature at this point of the series and looks like it
could have been dropped entirely.

> Collapse seems to always allocate a new (huge) page which should
> already end up on the new target node so loosing protections there
> isn't a problem.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: Paul Turner <pjt@google.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/include/asm/pgtable.h |    1 
>  mm/huge_memory.c               |  105 +++++++++++++++++++----------------------
>  2 files changed, 51 insertions(+), 55 deletions(-)
> 
> Index: tip/arch/x86/include/asm/pgtable.h
> ===================================================================
> --- tip.orig/arch/x86/include/asm/pgtable.h
> +++ tip/arch/x86/include/asm/pgtable.h
> @@ -349,6 +349,7 @@ static inline pgprot_t pgprot_modify(pgp
>  }
>  
>  #define pte_pgprot(x) __pgprot(pte_flags(x) & PTE_FLAGS_MASK)
> +#define pmd_pgprot(x) __pgprot(pmd_val(x) & ~_HPAGE_CHG_MASK)
>  
>  #define canon_pgprot(p) __pgprot(massage_pgprot(p))
>  
> Index: tip/mm/huge_memory.c
> ===================================================================
> --- tip.orig/mm/huge_memory.c
> +++ tip/mm/huge_memory.c
> @@ -1343,63 +1343,60 @@ static int __split_huge_page_map(struct
>  	int ret = 0, i;
>  	pgtable_t pgtable;
>  	unsigned long haddr;
> +	pgprot_t prot;
>  
>  	spin_lock(&mm->page_table_lock);
>  	pmd = page_check_address_pmd(page, mm, address,
>  				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
> -	if (pmd) {
> -		pgtable = pgtable_trans_huge_withdraw(mm);
> -		pmd_populate(mm, &_pmd, pgtable);
> -
> -		haddr = address;
> -		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> -			pte_t *pte, entry;
> -			BUG_ON(PageCompound(page+i));
> -			entry = mk_pte(page + i, vma->vm_page_prot);
> -			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> -			if (!pmd_write(*pmd))
> -				entry = pte_wrprotect(entry);
> -			else
> -				BUG_ON(page_mapcount(page) != 1);
> -			if (!pmd_young(*pmd))
> -				entry = pte_mkold(entry);
> -			pte = pte_offset_map(&_pmd, haddr);
> -			BUG_ON(!pte_none(*pte));
> -			set_pte_at(mm, haddr, pte, entry);
> -			pte_unmap(pte);
> -		}
> -
> -		smp_wmb(); /* make pte visible before pmd */
> -		/*
> -		 * Up to this point the pmd is present and huge and
> -		 * userland has the whole access to the hugepage
> -		 * during the split (which happens in place). If we
> -		 * overwrite the pmd with the not-huge version
> -		 * pointing to the pte here (which of course we could
> -		 * if all CPUs were bug free), userland could trigger
> -		 * a small page size TLB miss on the small sized TLB
> -		 * while the hugepage TLB entry is still established
> -		 * in the huge TLB. Some CPU doesn't like that. See
> -		 * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
> -		 * Erratum 383 on page 93. Intel should be safe but is
> -		 * also warns that it's only safe if the permission
> -		 * and cache attributes of the two entries loaded in
> -		 * the two TLB is identical (which should be the case
> -		 * here). But it is generally safer to never allow
> -		 * small and huge TLB entries for the same virtual
> -		 * address to be loaded simultaneously. So instead of
> -		 * doing "pmd_populate(); flush_tlb_range();" we first
> -		 * mark the current pmd notpresent (atomically because
> -		 * here the pmd_trans_huge and pmd_trans_splitting
> -		 * must remain set at all times on the pmd until the
> -		 * split is complete for this pmd), then we flush the
> -		 * SMP TLB and finally we write the non-huge version
> -		 * of the pmd entry with pmd_populate.
> -		 */
> -		pmdp_invalidate(vma, address, pmd);
> -		pmd_populate(mm, pmd, pgtable);
> -		ret = 1;
> +	if (!pmd)
> +		goto unlock;
> +

*whinge*

Changing the pmd check like this churned the code a more than necessary
making it harder to review. It forces me to move back and forth to figure
out exactly what it is you added. If you wanted to do this cleanup, it
should have been a separate patch.

> +	prot = pmd_pgprot(*pmd);
> +	pgtable = pgtable_trans_huge_withdraw(mm);
> +	pmd_populate(mm, &_pmd, pgtable);
> +
> +	for (i = 0, haddr = address; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> +		pte_t *pte, entry;
> +
> +		BUG_ON(PageCompound(page+i));
> +		entry = mk_pte(page + i, prot);
> +		entry = pte_mkdirty(entry);

For example, because of the churn, it's not obvious that the

                       entry = maybe_mkwrite(pte_mkdirty(entry), vma);
                       if (!pmd_write(*pmd))
                               entry = pte_wrprotect(entry);
                       else
                               BUG_ON(page_mapcount(page) != 1);

checks went away and you are instead using the prot flags retrieved by
pmd_pgprot to preserve _PAGE_RW which I think is the actual point of the
patch even if it's not obvious from the diff.

> +		if (!pmd_young(*pmd))
> +			entry = pte_mkold(entry);
> +		pte = pte_offset_map(&_pmd, haddr);
> +		BUG_ON(!pte_none(*pte));
> +		set_pte_at(mm, haddr, pte, entry);
> +		pte_unmap(pte);
>  	}
> +
> +	smp_wmb(); /* make ptes visible before pmd, see __pte_alloc */
> +	/*
> +	 * Up to this point the pmd is present and huge.
> +	 *
> +	 * If we overwrite the pmd with the not-huge version, we could trigger
> +	 * a small page size TLB miss on the small sized TLB while the hugepage
> +	 * TLB entry is still established in the huge TLB.
> +	 *
> +	 * Some CPUs don't like that. See
> +	 * http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum 383
> +	 * on page 93.
> +	 *
> +	 * Thus it is generally safer to never allow small and huge TLB entries
> +	 * for overlapping virtual addresses to be loaded. So we first mark the
> +	 * current pmd not present, then we flush the TLB and finally we write
> +	 * the non-huge version of the pmd entry with pmd_populate.
> +	 *
> +	 * The above needs to be done under the ptl because pmd_trans_huge and
> +	 * pmd_trans_splitting must remain set on the pmd until the split is
> +	 * complete. The ptl also protects against concurrent faults due to
> +	 * making the pmd not-present.
> +	 */
> +	set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
> +	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +	pmd_populate(mm, pmd, pgtable);
> +	ret = 1;
> +
> +unlock:
>  	spin_unlock(&mm->page_table_lock);
>  
>  	return ret;
> @@ -2287,10 +2284,8 @@ static void khugepaged_do_scan(void)
>  {
>  	struct page *hpage = NULL;
>  	unsigned int progress = 0, pass_through_head = 0;
> -	unsigned int pages = khugepaged_pages_to_scan;
>  	bool wait = true;
> -
> -	barrier(); /* write khugepaged_pages_to_scan to local stack */
> +	unsigned int pages = ACCESS_ONCE(khugepaged_pages_to_scan);
>  
>  	while (progress < pages) {
>  		if (!khugepaged_prealloc_page(&hpage, &wait))
> 

This hunk looks fine but has nothing to do with the patch or the changelog.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 04/31] x86/mm: Introduce pte_accessible()
  2012-10-25 12:16 ` [PATCH 04/31] x86/mm: Introduce pte_accessible() Peter Zijlstra
  2012-10-25 20:10   ` Linus Torvalds
@ 2012-11-01 10:42   ` Mel Gorman
  1 sibling, 0 replies; 135+ messages in thread
From: Mel Gorman @ 2012-11-01 10:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	Ingo Molnar

On Thu, Oct 25, 2012 at 02:16:21PM +0200, Peter Zijlstra wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> We need pte_present to return true for _PAGE_PROTNONE pages, to indicate that
> the pte is associated with a page.
> 
> However, for TLB flushing purposes, we would like to know whether the pte
> points to an actually accessible page.  This allows us to skip remote TLB
> flushes for pages that are not actually accessible.
> 

It feels like we are putting the cart before the horse to be taking TLB
flushing optimisations into account this early in the series. That
aside, what was wrong with the following patches?

autonuma: define _PAGE_NUMA
	arch-dependant definition of a flag that happens to be PROT_NONE
	on x86 but could be anything at all really which would help portability

autonuma: pte_numa() and pmd_numa()
	makes pte_present do what you want
	adds pte_numa and pmd_numa which potentially could have been
	used instead of pte_accessible

autonuma: teach gup_fast about pmd_numa
	sortof self-explanatory

and building on those? The arch-dependant nature of _PAGE_NUMA might have
avoided Linus sending the childrens college fund to the swear jar and
avoided this complaint;

===
because you have no idea if other architectures do

 (a) the same trick as x86 does for PROT_NONE (I can already tell you
     from a quick grep that ia64, m32r, m68k and sh do it)
 (b) might not perhaps be caching non-present pte's anyway
====

The "autonuma: define _PAGE_NUMA" happens to use PROT_NONE but as an
implementation detail rather than by design and as a bonus point describes
what it is doing. The "autonuma" part in the title is misleading, it's
not autonuma-specific at all and could have been dropped or just renamed
"numa:"

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 07/31] sched, numa, mm, s390/thp: Implement pmd_pgprot() for s390
  2012-10-25 12:16 ` [PATCH 07/31] sched, numa, mm, s390/thp: Implement pmd_pgprot() for s390 Peter Zijlstra
@ 2012-11-01 10:49   ` Mel Gorman
  0 siblings, 0 replies; 135+ messages in thread
From: Mel Gorman @ 2012-11-01 10:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	Gerald Schaefer, Martin Schwidefsky, Heiko Carstens,
	Peter Zijlstra, Ralf Baechle, Ingo Molnar

On Thu, Oct 25, 2012 at 02:16:24PM +0200, Peter Zijlstra wrote:
> From: Gerald Schaefer <gerald.schaefer@de.ibm.com>
> 
> This patch adds an implementation of pmd_pgprot() for s390,
> in preparation to future THP changes.
> 

The additional pmd_pgprot implementations only are necessary if we want
to preserve the PROT_NONE protections across a split but that somewhat
forces that PROT_NONE be used as the protection bit across all
architectures. Is that possible? I think I would prefer that
prot-protection-across-splits just went away until it was proven
necessary and potentially recoded in terms of _PAGE_NUMA and friends
instead.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 11/31] mm/mpol: Make MPOL_LOCAL a real policy
  2012-10-25 12:16 ` [PATCH 11/31] mm/mpol: Make MPOL_LOCAL a real policy Peter Zijlstra
@ 2012-11-01 10:58   ` Mel Gorman
  0 siblings, 0 replies; 135+ messages in thread
From: Mel Gorman @ 2012-11-01 10:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	Lee Schermerhorn, Ingo Molnar, Michael Kerrisk

On Thu, Oct 25, 2012 at 02:16:28PM +0200, Peter Zijlstra wrote:
> Make MPOL_LOCAL a real and exposed policy such that applications that
> relied on the previous default behaviour can explicitly request it.
> 
> Requested-by: Christoph Lameter <cl@linux.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>

Seems reasonable but Michael Kerrisk should be cc'd because when the dust
settles on this there may be a manual page update required.

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 12/31] mm/mpol: Add MPOL_MF_NOOP
  2012-10-25 12:16 ` [PATCH 12/31] mm/mpol: Add MPOL_MF_NOOP Peter Zijlstra
@ 2012-11-01 11:10   ` Mel Gorman
  0 siblings, 0 replies; 135+ messages in thread
From: Mel Gorman @ 2012-11-01 11:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	Lee Schermerhorn, Ingo Molnar

On Thu, Oct 25, 2012 at 02:16:29PM +0200, Peter Zijlstra wrote:
> From: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
> This patch augments the MPOL_MF_LAZY feature by adding a "NOOP" policy

The MPOL_MF_LAZY feature doesn't exist yet so it's hard to augment at this
point :)

> to mbind().  When the NOOP policy is used with the 'MOVE and 'LAZY
> flags, mbind() will map the pages PROT_NONE so that they will be
> migrated on the next touch.
> 

This implies that a user-space application has a two-stage process.  Stage 1,
it marks a range NOOP and the stage 2 marks the range lazy.  That feels like
it might violate Rusty's API design rule of "The obvious use is wrong."
What is the motivation for exposing NOOP to userspace? Instead why does
mbind(addr, len, MPOL_MF_LAZY, nodemask, maxnode, flags) not imply that
the range gets marked PROT_NONE (or PROT_NUMA or some other variant that
is not arch-specific)?

It also seems that MPOL_MF_LAZY must imply MPOL_MF_MOVE or it's a bit
pointless without the application having to specify the exact flag
combination

> This allows an application to prepare for a new phase of operation
> where different regions of shared storage will be assigned to
> worker threads, w/o changing policy.  Note that we could just use
> "default" policy in this case.  However, this also allows an
> application to request that pages be migrated, only if necessary,
> to follow any arbitrary policy that might currently apply to a
> range of pages, without knowing the policy, or without specifying
> multiple mbind()s for ranges with different policies.
> 

I very much like the idea because potentially a motivated developer could use
this mechanism to avoid any ping-pong problems with an automatic migration
scheme. It could even be argued that any application using MPOL_MF_LAZY
get unsubscribed from any automatic mechanism to avoid interference.

> [ Bug in early version of mpol_parse_str() reported by Fengguang Wu. ]
> 
> Bug-Reported-by: Reported-by: Fengguang Wu <fengguang.wu@intel.com>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  include/uapi/linux/mempolicy.h |    1 +
>  mm/mempolicy.c                 |   11 ++++++-----
>  2 files changed, 7 insertions(+), 5 deletions(-)
> 
> Index: tip/include/uapi/linux/mempolicy.h
> ===================================================================
> --- tip.orig/include/uapi/linux/mempolicy.h
> +++ tip/include/uapi/linux/mempolicy.h
> @@ -21,6 +21,7 @@ enum {
>  	MPOL_BIND,
>  	MPOL_INTERLEAVE,
>  	MPOL_LOCAL,
> +	MPOL_NOOP,		/* retain existing policy for range */
>  	MPOL_MAX,	/* always last member of enum */
>  };
>  
> Index: tip/mm/mempolicy.c
> ===================================================================
> --- tip.orig/mm/mempolicy.c
> +++ tip/mm/mempolicy.c
> @@ -251,10 +251,10 @@ static struct mempolicy *mpol_new(unsign
>  	pr_debug("setting mode %d flags %d nodes[0] %lx\n",
>  		 mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
>  
> -	if (mode == MPOL_DEFAULT) {
> +	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) {
>  		if (nodes && !nodes_empty(*nodes))
>  			return ERR_PTR(-EINVAL);
> -		return NULL;	/* simply delete any existing policy */
> +		return NULL;
>  	}
>  	VM_BUG_ON(!nodes);
>  
> @@ -1146,7 +1146,7 @@ static long do_mbind(unsigned long start
>  	if (start & ~PAGE_MASK)
>  		return -EINVAL;
>  
> -	if (mode == MPOL_DEFAULT)
> +	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
>  		flags &= ~MPOL_MF_STRICT;
>  
>  	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
> @@ -2381,7 +2381,8 @@ static const char * const policy_modes[]
>  	[MPOL_PREFERRED]  = "prefer",
>  	[MPOL_BIND]       = "bind",
>  	[MPOL_INTERLEAVE] = "interleave",
> -	[MPOL_LOCAL]      = "local"
> +	[MPOL_LOCAL]      = "local",
> +	[MPOL_NOOP]	  = "noop",	/* should not actually be used */

If it should not be used, why it is exposed to userspace?

>  };
>  
>  
> @@ -2432,7 +2433,7 @@ int mpol_parse_str(char *str, struct mem
>  			break;
>  		}
>  	}
> -	if (mode >= MPOL_MAX)
> +	if (mode >= MPOL_MAX || mode == MPOL_NOOP)
>  		goto out;
>  
>  	switch (mode) {
> 
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 14/31] mm/mpol: Create special PROT_NONE infrastructure
  2012-10-25 12:16 ` [PATCH 14/31] mm/mpol: Create special PROT_NONE infrastructure Peter Zijlstra
@ 2012-11-01 11:51   ` Mel Gorman
  0 siblings, 0 replies; 135+ messages in thread
From: Mel Gorman @ 2012-11-01 11:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	Paul Turner, Ingo Molnar

On Thu, Oct 25, 2012 at 02:16:31PM +0200, Peter Zijlstra wrote:
> In order to facilitate a lazy -- fault driven -- migration of pages,
> create a special transient PROT_NONE variant, we can then use the
> 'spurious' protection faults to drive our migrations from.
> 

The changelog should mention that fault-driven migration also means that
the full cost of migration is incurred by the process. If someone in the
future tries to do the migration in a kernel thread they should be reminded
that the fault-driven choice was deliberate.

> Pages that already had an effective PROT_NONE mapping will not
> be detected to generate these 'spuriuos' faults for the simple reason
> that we cannot distinguish them on their protection bits, see
> pte_numa().
> 
> This isn't a problem since PROT_NONE (and possible PROT_WRITE with
> dirty tracking) aren't used or are rare enough for us to not care
> about their placement.
> 
> Suggested-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: Paul Turner <pjt@google.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> [ fixed various cross-arch and THP/!THP details ]
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  include/linux/huge_mm.h |   19 ++++++++++++
>  include/linux/mm.h      |   18 +++++++++++
>  mm/huge_memory.c        |   32 ++++++++++++++++++++
>  mm/memory.c             |   75 +++++++++++++++++++++++++++++++++++++++++++-----
>  mm/mprotect.c           |   24 ++++++++++-----
>  5 files changed, 154 insertions(+), 14 deletions(-)
> 
> Index: tip/include/linux/huge_mm.h
> ===================================================================
> --- tip.orig/include/linux/huge_mm.h
> +++ tip/include/linux/huge_mm.h
> @@ -159,6 +159,13 @@ static inline struct page *compound_tran
>  	}
>  	return page;
>  }
> +
> +extern bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd);
> +
> +extern void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +				  unsigned long address, pmd_t *pmd,
> +				  unsigned int flags, pmd_t orig_pmd);
> +
>  #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
>  #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
> @@ -195,6 +202,18 @@ static inline int pmd_trans_huge_lock(pm
>  {
>  	return 0;
>  }
> +
> +static inline bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
> +{
> +	return false;
> +}
> +
> +static inline void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +				  unsigned long address, pmd_t *pmd,
> +				  unsigned int flags, pmd_t orig_pmd)
> +{
> +}
> +
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  #endif /* _LINUX_HUGE_MM_H */
> Index: tip/include/linux/mm.h
> ===================================================================
> --- tip.orig/include/linux/mm.h
> +++ tip/include/linux/mm.h
> @@ -1091,6 +1091,9 @@ extern unsigned long move_page_tables(st
>  extern unsigned long do_mremap(unsigned long addr,
>  			       unsigned long old_len, unsigned long new_len,
>  			       unsigned long flags, unsigned long new_addr);
> +extern void change_protection(struct vm_area_struct *vma, unsigned long start,
> +			      unsigned long end, pgprot_t newprot,
> +			      int dirty_accountable);
>  extern int mprotect_fixup(struct vm_area_struct *vma,
>  			  struct vm_area_struct **pprev, unsigned long start,
>  			  unsigned long end, unsigned long newflags);
> @@ -1561,6 +1564,21 @@ static inline pgprot_t vm_get_page_prot(
>  }
>  #endif
>  
> +static inline pgprot_t vma_prot_none(struct vm_area_struct *vma)
> +{
> +	/*
> +	 * obtain PROT_NONE by removing READ|WRITE|EXEC privs
> +	 */
> +	vm_flags_t vmflags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
> +	return pgprot_modify(vma->vm_page_prot, vm_get_page_prot(vmflags));
> +}
> +

Again, this very much hard-codes the choice of prot_none as the
_PAGE_NUMA bit.

> +static inline void
> +change_prot_none(struct vm_area_struct *vma, unsigned long start, unsigned long end)
> +{
> +	change_protection(vma, start, end, vma_prot_none(vma), 0);
> +}
> +

And this is somewhat explicit too. Steal pte_mknuma and shove this into
the arch layer?

>  struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
>  int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
>  			unsigned long pfn, unsigned long size, pgprot_t);
> Index: tip/mm/huge_memory.c
> ===================================================================
> --- tip.orig/mm/huge_memory.c
> +++ tip/mm/huge_memory.c
> @@ -725,6 +725,38 @@ out:
>  	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
>  }
>  
> +bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
> +{
> +	/*
> +	 * See pte_numa().
> +	 */
> +	if (pmd_same(pmd, pmd_modify(pmd, vma->vm_page_prot)))
> +		return false;
> +
> +	return pmd_same(pmd, pmd_modify(pmd, vma_prot_none(vma)));
> +}
> +
> +void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			   unsigned long address, pmd_t *pmd,
> +			   unsigned int flags, pmd_t entry)
> +{
> +	unsigned long haddr = address & HPAGE_PMD_MASK;
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(*pmd, entry)))
> +		goto out_unlock;
> +
> +	/* do fancy stuff */
> +

Joking aside, 

> +	/* change back to regular protection */
> +	entry = pmd_modify(entry, vma->vm_page_prot);
> +	if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
> +		update_mmu_cache_pmd(vma, address, entry);
> +
> +out_unlock:
> +	spin_unlock(&mm->page_table_lock);
> +}
> +
>  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
>  		  struct vm_area_struct *vma)
> Index: tip/mm/memory.c
> ===================================================================
> --- tip.orig/mm/memory.c
> +++ tip/mm/memory.c
> @@ -1464,6 +1464,25 @@ int zap_vma_ptes(struct vm_area_struct *
>  }
>  EXPORT_SYMBOL_GPL(zap_vma_ptes);
>  
> +static bool pte_numa(struct vm_area_struct *vma, pte_t pte)
> +{
> +	/*
> +	 * If we have the normal vma->vm_page_prot protections we're not a
> +	 * 'special' PROT_NONE page.
> +	 *
> +	 * This means we cannot get 'special' PROT_NONE faults from genuine
> +	 * PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
> +	 * tracking.
> +	 *
> +	 * Neither case is really interesting for our current use though so we
> +	 * don't care.
> +	 */
> +	if (pte_same(pte, pte_modify(pte, vma->vm_page_prot)))
> +		return false;
> +
> +	return pte_same(pte, pte_modify(pte, vma_prot_none(vma)));
> +}
> +
>  /**
>   * follow_page - look up a page descriptor from a user-virtual address
>   * @vma: vm_area_struct mapping @address
> @@ -3433,6 +3452,41 @@ static int do_nonlinear_fault(struct mm_
>  	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
>  }
>  
> +static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			unsigned long address, pte_t *ptep, pmd_t *pmd,
> +			unsigned int flags, pte_t entry)
> +{
> +	spinlock_t *ptl;
> +	int ret = 0;
> +
> +	if (!pte_unmap_same(mm, pmd, ptep, entry))
> +		goto out;
> +
> +	/*
> +	 * Do fancy stuff...
> +	 */
> +

Ok, so we should not have to check for a splitting huge page at this
point because it has been checked already.

> +	/*
> +	 * OK, nothing to do,.. change the protection back to what it
> +	 * ought to be.
> +	 */
> +	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> +	if (unlikely(!pte_same(*ptep, entry)))
> +		goto unlock;
> +
> +	flush_cache_page(vma, address, pte_pfn(entry));
> +

This page was marked PROT_NONE so why is it necessary to flush the
cache? Needs a comment.

> +	ptep_modify_prot_start(mm, address, ptep);
> +	entry = pte_modify(entry, vma->vm_page_prot);
> +	ptep_modify_prot_commit(mm, address, ptep, entry);
> +

could have used pte_mknonnuma() if it was pulled in.

> +	update_mmu_cache(vma, address, ptep);
> +unlock:
> +	pte_unmap_unlock(ptep, ptl);
> +out:
> +	return ret;
> +}
> +
>  /*
>   * These routines also need to handle stuff like marking pages dirty
>   * and/or accessed for architectures that don't do it in hardware (most
> @@ -3471,6 +3525,9 @@ int handle_pte_fault(struct mm_struct *m
>  					pte, pmd, flags, entry);
>  	}
>  
> +	if (pte_numa(vma, entry))
> +		return do_numa_page(mm, vma, address, pte, pmd, flags, entry);
> +
>  	ptl = pte_lockptr(mm, pmd);
>  	spin_lock(ptl);
>  	if (unlikely(!pte_same(*pte, entry)))
> @@ -3535,13 +3592,16 @@ retry:
>  							  pmd, flags);
>  	} else {
>  		pmd_t orig_pmd = *pmd;
> -		int ret;
> +		int ret = 0;
>  
>  		barrier();
> -		if (pmd_trans_huge(orig_pmd)) {
> -			if (flags & FAULT_FLAG_WRITE &&
> -			    !pmd_write(orig_pmd) &&
> -			    !pmd_trans_splitting(orig_pmd)) {
> +		if (pmd_trans_huge(orig_pmd) && !pmd_trans_splitting(orig_pmd)) {

Hmm ok, if it trans_huge and is splitting it now falls through

> +			if (pmd_numa(vma, orig_pmd)) {
> +				do_huge_pmd_numa_page(mm, vma, address, pmd,
> +						      flags, orig_pmd);
> +			}
> +

When this thing returns you are not holding the page_table_lock and mmap_sem
on its own is not enough to protect against a split.  Should you not recheck
pmd_trans_splitting and potentially return 0 to retry the fault if it is?

I guess it does not matter per-se. If it's a write, you call
do_huge_pmd_wp_page() which will eventually check if pmd_same (which will
fail as PROT_NONE was fixed up) and retry the whole fault after a bunch
of work like allocating a huge page. To avoid that, I strong suspect
you should re-read orig_pmd after handling the NUMA fault or something
similar. If the fault is a read fault, it'll fall through and return 0.

> +			if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
>  				ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
>  							  orig_pmd);
>  				/*
> @@ -3551,12 +3611,13 @@ retry:
>  				 */
>  				if (unlikely(ret & VM_FAULT_OOM))
>  					goto retry;
> -				return ret;
>  			}
> -			return 0;
> +
> +			return ret;
>  		}
>  	}
>  
> +
>  	/*
>  	 * Use __pte_alloc instead of pte_alloc_map, because we can't
>  	 * run pte_offset_map on the pmd, if an huge pmd could
> Index: tip/mm/mprotect.c
> ===================================================================
> --- tip.orig/mm/mprotect.c
> +++ tip/mm/mprotect.c
> @@ -112,7 +112,7 @@ static inline void change_pud_range(stru
>  	} while (pud++, addr = next, addr != end);
>  }
>  
> -static void change_protection(struct vm_area_struct *vma,
> +static void change_protection_range(struct vm_area_struct *vma,
>  		unsigned long addr, unsigned long end, pgprot_t newprot,
>  		int dirty_accountable)
>  {
> @@ -134,6 +134,20 @@ static void change_protection(struct vm_
>  	flush_tlb_range(vma, start, end);
>  }
>  
> +void change_protection(struct vm_area_struct *vma, unsigned long start,
> +		       unsigned long end, pgprot_t newprot,
> +		       int dirty_accountable)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +
> +	mmu_notifier_invalidate_range_start(mm, start, end);
> +	if (is_vm_hugetlb_page(vma))
> +		hugetlb_change_protection(vma, start, end, newprot);
> +	else
> +		change_protection_range(vma, start, end, newprot, dirty_accountable);
> +	mmu_notifier_invalidate_range_end(mm, start, end);
> +}
> +
>  int
>  mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
>  	unsigned long start, unsigned long end, unsigned long newflags)
> @@ -206,12 +220,8 @@ success:
>  		dirty_accountable = 1;
>  	}
>  
> -	mmu_notifier_invalidate_range_start(mm, start, end);
> -	if (is_vm_hugetlb_page(vma))
> -		hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
> -	else
> -		change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
> -	mmu_notifier_invalidate_range_end(mm, start, end);
> +	change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
> +
>  	vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
>  	vm_stat_account(mm, newflags, vma->vm_file, nrpages);
>  	perf_event_mmap(vma);
> 
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 15/31] mm/mpol: Add MPOL_MF_LAZY
  2012-10-25 12:16 ` [PATCH 15/31] mm/mpol: Add MPOL_MF_LAZY Peter Zijlstra
@ 2012-11-01 12:01   ` Mel Gorman
  0 siblings, 0 replies; 135+ messages in thread
From: Mel Gorman @ 2012-11-01 12:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	Lee Schermerhorn, Ingo Molnar

On Thu, Oct 25, 2012 at 02:16:32PM +0200, Peter Zijlstra wrote:
> From: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
> This patch adds another mbind() flag to request "lazy migration".  The
> flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
> pages are marked PROT_NONE. The pages will be migrated in the fault
> path on "first touch", if the policy dictates at that time.
> 
> "Lazy Migration" will allow testing of migrate-on-fault via mbind().
> Also allows applications to specify that only subsequently touched
> pages be migrated to obey new policy, instead of all pages in range.
> This can be useful for multi-threaded applications working on a
> large shared data area that is initialized by an initial thread
> resulting in all pages on one [or a few, if overflowed] nodes.
> After PROT_NONE, the pages in regions assigned to the worker threads
> will be automatically migrated local to the threads on 1st touch.
> 

I like the idea. Applications will need some way of progamatically detecting
if MPOL_MF_LAZY is available. I guess they could always try the flag and if
it fails, do nothing. When/if the manual page for it happens it probably
will be a lot more obvious. The changelog does not actually describe what
this patch does though

This patch is a preparation step for "lazy migration". When the flag is
used, the range is marked prot_none. In a later patch it will be
detected at fault time that the page is misplaced and migrate it at that
point.

> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> [ nearly complete rewrite.. ]
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>

> ---
>  include/uapi/linux/mempolicy.h |   13 ++++++++--
>  mm/mempolicy.c                 |   49 ++++++++++++++++++++++++++---------------
>  2 files changed, 42 insertions(+), 20 deletions(-)
> 
> Index: tip/include/uapi/linux/mempolicy.h
> ===================================================================
> --- tip.orig/include/uapi/linux/mempolicy.h
> +++ tip/include/uapi/linux/mempolicy.h
> @@ -49,9 +49,16 @@ enum mpol_rebind_step {
>  
>  /* Flags for mbind */
>  #define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
> -#define MPOL_MF_MOVE	(1<<1)	/* Move pages owned by this process to conform to mapping */
> -#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to mapping */
> -#define MPOL_MF_INTERNAL (1<<3)	/* Internal flags start here */
> +#define MPOL_MF_MOVE	 (1<<1)	/* Move pages owned by this process to conform
> +				   to policy */
> +#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to policy */
> +#define MPOL_MF_LAZY	 (1<<3)	/* Modifies '_MOVE:  lazy migrate on fault */
> +#define MPOL_MF_INTERNAL (1<<4)	/* Internal flags start here */
> +
> +#define MPOL_MF_VALID	(MPOL_MF_STRICT   | 	\
> +			 MPOL_MF_MOVE     | 	\
> +			 MPOL_MF_MOVE_ALL |	\
> +			 MPOL_MF_LAZY)
>  
>  /*
>   * Internal flags that share the struct mempolicy flags word with
> Index: tip/mm/mempolicy.c
> ===================================================================
> --- tip.orig/mm/mempolicy.c
> +++ tip/mm/mempolicy.c
> @@ -583,22 +583,32 @@ check_range(struct mm_struct *mm, unsign
>  		return ERR_PTR(-EFAULT);
>  	prev = NULL;
>  	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
> +		unsigned long endvma = vma->vm_end;
> +
> +		if (endvma > end)
> +			endvma = end;
> +		if (vma->vm_start > start)
> +			start = vma->vm_start;
> +
>  		if (!(flags & MPOL_MF_DISCONTIG_OK)) {
>  			if (!vma->vm_next && vma->vm_end < end)
>  				return ERR_PTR(-EFAULT);
>  			if (prev && prev->vm_end < vma->vm_start)
>  				return ERR_PTR(-EFAULT);
>  		}
> -		if (!is_vm_hugetlb_page(vma) &&
> -		    ((flags & MPOL_MF_STRICT) ||
> +
> +		if (is_vm_hugetlb_page(vma))
> +			goto next;
> +
> +		if (flags & MPOL_MF_LAZY) {
> +			change_prot_none(vma, start, endvma);
> +			goto next;
> +		}
> +

More hard-coding of prot_none.

> +		if ((flags & MPOL_MF_STRICT) ||
>  		     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
> -				vma_migratable(vma)))) {
> -			unsigned long endvma = vma->vm_end;
> +		      vma_migratable(vma))) {
>  
> -			if (endvma > end)
> -				endvma = end;
> -			if (vma->vm_start > start)
> -				start = vma->vm_start;
>  			err = check_pgd_range(vma, start, endvma, nodes,
>  						flags, private);
>  			if (err) {
> @@ -606,6 +616,7 @@ check_range(struct mm_struct *mm, unsign
>  				break;
>  			}
>  		}
> +next:
>  		prev = vma;
>  	}
>  	return first;
> @@ -1137,8 +1148,7 @@ static long do_mbind(unsigned long start
>  	int err;
>  	LIST_HEAD(pagelist);
>  
> -	if (flags & ~(unsigned long)(MPOL_MF_STRICT |
> -				     MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
> +  	if (flags & ~(unsigned long)MPOL_MF_VALID)
>  		return -EINVAL;
>  	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
>  		return -EPERM;
> @@ -1161,6 +1171,9 @@ static long do_mbind(unsigned long start
>  	if (IS_ERR(new))
>  		return PTR_ERR(new);
>  
> +	if (flags & MPOL_MF_LAZY)
> +		new->flags |= MPOL_F_MOF;
> +
>  	/*
>  	 * If we are using the default policy then operation
>  	 * on discontinuous address spaces is okay after all
> @@ -1197,21 +1210,23 @@ static long do_mbind(unsigned long start
>  	vma = check_range(mm, start, end, nmask,
>  			  flags | MPOL_MF_INVERT, &pagelist);
>  
> -	err = PTR_ERR(vma);
> -	if (!IS_ERR(vma)) {
> -		int nr_failed = 0;
> -
> +	err = PTR_ERR(vma);	/* maybe ... */
> +	if (!IS_ERR(vma) && mode != MPOL_NOOP)
>  		err = mbind_range(mm, start, end, new);
>  
> +	if (!err) {
> +		int nr_failed = 0;
> +
>  		if (!list_empty(&pagelist)) {
> +			WARN_ON_ONCE(flags & MPOL_MF_LAZY);
>  			nr_failed = migrate_pages(&pagelist, new_vma_page,
> -						(unsigned long)vma,
> -						false, MIGRATE_SYNC);
> +						  (unsigned long)vma,
> +						  false, MIGRATE_SYNC);
>  			if (nr_failed)
>  				putback_lru_pages(&pagelist);
>  		}
>  
> -		if (!err && nr_failed && (flags & MPOL_MF_STRICT))
> +		if (nr_failed && (flags & MPOL_MF_STRICT))
>  			err = -EIO;
>  	} else
>  		putback_lru_pages(&pagelist);
> 
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 17/31] mm/migrate: Introduce migrate_misplaced_page()
  2012-10-25 12:16 ` [PATCH 17/31] mm/migrate: Introduce migrate_misplaced_page() Peter Zijlstra
@ 2012-11-01 12:20   ` Mel Gorman
  0 siblings, 0 replies; 135+ messages in thread
From: Mel Gorman @ 2012-11-01 12:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	Paul Turner, Ingo Molnar

On Thu, Oct 25, 2012 at 02:16:34PM +0200, Peter Zijlstra wrote:
> Add migrate_misplaced_page() which deals with migrating pages from
> faults. 
> 
> This includes adding a new MIGRATE_FAULT migration mode to
> deal with the extra page reference required due to having to look up
> the page.
> 
> Based-on-work-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: Paul Turner <pjt@google.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  include/linux/migrate.h      |    7 +++
>  include/linux/migrate_mode.h |    3 +
>  mm/migrate.c                 |   85 ++++++++++++++++++++++++++++++++++++++-----
>  3 files changed, 87 insertions(+), 8 deletions(-)
> 
> Index: tip/include/linux/migrate.h
> ===================================================================
> --- tip.orig/include/linux/migrate.h
> +++ tip/include/linux/migrate.h
> @@ -30,6 +30,7 @@ extern int migrate_vmas(struct mm_struct
>  extern void migrate_page_copy(struct page *newpage, struct page *page);
>  extern int migrate_huge_page_move_mapping(struct address_space *mapping,
>  				  struct page *newpage, struct page *page);
> +extern int migrate_misplaced_page(struct page *page, int node);
>  #else
>  
>  static inline void putback_lru_pages(struct list_head *l) {}
> @@ -63,5 +64,11 @@ static inline int migrate_huge_page_move
>  #define migrate_page NULL
>  #define fail_migrate_page NULL
>  
> +static inline
> +int migrate_misplaced_page(struct page *page, int node)
> +{
> +	return -EAGAIN; /* can't migrate now */
> +}
>  #endif /* CONFIG_MIGRATION */
> +
>  #endif /* _LINUX_MIGRATE_H */
> Index: tip/include/linux/migrate_mode.h
> ===================================================================
> --- tip.orig/include/linux/migrate_mode.h
> +++ tip/include/linux/migrate_mode.h
> @@ -6,11 +6,14 @@
>   *	on most operations but not ->writepage as the potential stall time
>   *	is too significant
>   * MIGRATE_SYNC will block when migrating pages
> + * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy
> + *	this path has an extra reference count
>   */
>  enum migrate_mode {
>  	MIGRATE_ASYNC,
>  	MIGRATE_SYNC_LIGHT,
>  	MIGRATE_SYNC,
> +	MIGRATE_FAULT,
>  };
>  
>  #endif		/* MIGRATE_MODE_H_INCLUDED */
> Index: tip/mm/migrate.c
> ===================================================================
> --- tip.orig/mm/migrate.c
> +++ tip/mm/migrate.c
> @@ -225,7 +225,7 @@ static bool buffer_migrate_lock_buffers(
>  	struct buffer_head *bh = head;
>  
>  	/* Simple case, sync compaction */
> -	if (mode != MIGRATE_ASYNC) {
> +	if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT) {
>  		do {
>  			get_bh(bh);
>  			lock_buffer(bh);
> @@ -279,12 +279,22 @@ static int migrate_page_move_mapping(str
>  		struct page *newpage, struct page *page,
>  		struct buffer_head *head, enum migrate_mode mode)
>  {
> -	int expected_count;
> +	int expected_count = 0;
>  	void **pslot;
>  
> +	if (mode == MIGRATE_FAULT) {
> +		/*
> +		 * MIGRATE_FAULT has an extra reference on the page and
> +		 * otherwise acts like ASYNC, no point in delaying the
> +		 * fault, we'll try again next time.
> +		 */
> +		expected_count++;
> +	}
> +
>  	if (!mapping) {
>  		/* Anonymous page without mapping */
> -		if (page_count(page) != 1)
> +		expected_count += 1;
> +		if (page_count(page) != expected_count)
>  			return -EAGAIN;
>  		return 0;
>  	}
> @@ -294,7 +304,7 @@ static int migrate_page_move_mapping(str
>  	pslot = radix_tree_lookup_slot(&mapping->page_tree,
>   					page_index(page));
>  
> -	expected_count = 2 + page_has_private(page);
> +	expected_count += 2 + page_has_private(page);
>  	if (page_count(page) != expected_count ||
>  		radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
>  		spin_unlock_irq(&mapping->tree_lock);
> @@ -313,7 +323,7 @@ static int migrate_page_move_mapping(str
>  	 * the mapping back due to an elevated page count, we would have to
>  	 * block waiting on other references to be dropped.
>  	 */
> -	if (mode == MIGRATE_ASYNC && head &&
> +	if ((mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT) && head &&
>  			!buffer_migrate_lock_buffers(head, mode)) {
>  		page_unfreeze_refs(page, expected_count);
>  		spin_unlock_irq(&mapping->tree_lock);
> @@ -521,7 +531,7 @@ int buffer_migrate_page(struct address_s
>  	 * with an IRQ-safe spinlock held. In the sync case, the buffers
>  	 * need to be locked now
>  	 */
> -	if (mode != MIGRATE_ASYNC)
> +	if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT)
>  		BUG_ON(!buffer_migrate_lock_buffers(head, mode));
>  
>  	ClearPagePrivate(page);
> @@ -687,7 +697,7 @@ static int __unmap_and_move(struct page
>  	struct anon_vma *anon_vma = NULL;
>  
>  	if (!trylock_page(page)) {
> -		if (!force || mode == MIGRATE_ASYNC)
> +		if (!force || mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT)
>  			goto out;
>  
>  		/*
> @@ -1403,4 +1413,63 @@ int migrate_vmas(struct mm_struct *mm, c
>   	}
>   	return err;
>  }
> -#endif
> +
> +/*
> + * Attempt to migrate a misplaced page to the specified destination
> + * node.
> + */
> +int migrate_misplaced_page(struct page *page, int node)
> +{
> +	struct address_space *mapping = page_mapping(page);
> +	int page_lru = page_is_file_cache(page);
> +	struct page *newpage;
> +	int ret = -EAGAIN;
> +	gfp_t gfp = GFP_HIGHUSER_MOVABLE;
> +
> +	/*
> +	 * Don't migrate pages that are mapped in multiple processes.
> +	 */
> +	if (page_mapcount(page) != 1)
> +		goto out;
> +

Why?

I know why -- it's because we don't want to ping-pong shared library
pages.

However, what if it's a shmem mapping? We might want to consider migrating
those but with this check that will never happen.

> +	/*
> +	 * Never wait for allocations just to migrate on fault, but don't dip
> +	 * into reserves. And, only accept pages from the specified node. No
> +	 * sense migrating to a different "misplaced" page!
> +	 */

Not only that,  we do not want to reclaim on a remote node to allow
migration. The cost of reclaim will exceed the benefit from local memory
accesses.

> +	if (mapping)
> +		gfp = mapping_gfp_mask(mapping);
> +	gfp &= ~__GFP_WAIT;
> +	gfp |= __GFP_NOMEMALLOC | GFP_THISNODE;
> +
> +	newpage = alloc_pages_node(node, gfp, 0);
> +	if (!newpage) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	if (isolate_lru_page(page)) {
> +		ret = -EBUSY;
> +		goto put_new;
> +	}
> +

Going to need to keep an eye on the hotness of the LRU lock during heavy
migration states. Actually, a process using MPOL_MF_LAZY and then creating
a large number of threads is going to hammer that lock. I've no proposed
solution to this but if we see a bug report related to stalls after using
that system call then this will be a candidate.

Compaction had to use trylock-like logic to get around this problem but
it's not necessarily the right choice for lazy migration unless you are
willing to depend on the automatic detection to fix it up later.

> +	inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
> +	ret = __unmap_and_move(page, newpage, 0, 0, MIGRATE_FAULT);

By bypassing __unmap_and_move, you miss the PageTransHuge() check in
unmap_and_move and the splitting when a transhuge is encountered. Was that
deliberate? Why?

> +	/*
> +	 * A page that has been migrated has all references removed and will be
> +	 * freed. A page that has not been migrated will have kepts its
> +	 * references and be restored.
> +	 */
> +	dec_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
> +	putback_lru_page(page);
> +put_new:
> +	/*
> +	 * Move the new page to the LRU. If migration was not successful
> +	 * then this will free the page.
> +	 */
> +	putback_lru_page(newpage);
> +out:
> +	return ret;
> +}
> +
> +#endif /* CONFIG_NUMA */
> 
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 02/31] sched, numa, mm: Describe the NUMA scheduling problem formally
  2012-11-01  9:56   ` Mel Gorman
@ 2012-11-01 13:13     ` Rik van Riel
  0 siblings, 0 replies; 135+ messages in thread
From: Rik van Riel @ 2012-11-01 13:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, H. Peter Anvin, Mike Galbraith, Ingo Molnar

On 11/01/2012 05:56 AM, Mel Gorman wrote:
> On Thu, Oct 25, 2012 at 02:16:19PM +0200, Peter Zijlstra wrote:
>> This is probably a first: formal description of a complex high-level
>> computing problem, within the kernel source.
>>
>
> Who does not love the smell of formal methods first thing in the
> morning?

The only issue I have with this document is that it does not have
any description of how the source code tries to solve the problem
at hand.

A description of how the problem is solved will make the documentation
useful to people trying to figure out why the NUMA code does what
it does.

Of course, since we still do not know what sched-numa needs to do
in order to match autonuma performance, that description would have
to be updated later, anyway.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-10-31 17:31               ` Hugh Dickins
@ 2012-11-01 13:41                 ` Hugh Dickins
  2012-11-02  3:23                   ` Zhouping Liu
  0 siblings, 1 reply; 135+ messages in thread
From: Hugh Dickins @ 2012-11-01 13:41 UTC (permalink / raw)
  To: Zhouping Liu
  Cc: Johannes Weiner, Peter Zijlstra, Rik van Riel, Andrea Arcangeli,
	Mel Gorman, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-mm, Ingo Molnar, CAI Qian

On Wed, 31 Oct 2012, Hugh Dickins wrote:
> On Wed, 31 Oct 2012, Zhouping Liu wrote:
> > On 10/31/2012 03:26 PM, Hugh Dickins wrote:
> > > 
> > > There's quite a few put_page()s in do_huge_pmd_numa_page(), and it
> > > would help if we could focus on the one which is giving the trouble,
> > > but I don't know which that is.  Zhouping, if you can, please would
> > > you do an "objdump -ld vmlinux >bigfile" of your kernel, then extract
> > > from bigfile just the lines from "<do_huge_pmd_numa_page>:" to whatever
> > > is the next function, and post or mail privately just that disassembly.
> > > That should be good to identify which of the put_page()s is involved.
> > 
> > Hugh, I didn't find the next function, as I can't find any words that matched
> > "do_huge_pmd_numa_page".
> > is there any other methods?
> 
> Hmm, do_huge_pmd_numa_page does appear in your stacktrace,
> unless I've made a typo but am blind to it.
> 
> Were you applying objdump to the vmlinux which gave you the
> BUG at mm/memcontrol.c:1134! ?

Thanks for the further info you then sent privately: I have not made any
more effort to reproduce the issue, but your objdump did tell me that the
put_page hitting the problem is the one on line 872 of mm/huge_memory.c,
"Drop the local reference", just before successful return after migration.

I didn't really get the inspiration I'd hoped for out of knowing that,
but it did make wonder whether you're suffering from one of the issues
I already mentioned, and I can now see a way in which it might cause
the mm/memcontrol.c:1134 BUG:-

migrate_page_copy() does TestClearPageActive on the source page:
so given the unsafe way in which do_huge_pmd_numa_page() was proceeding
with a !PageLRU page, it's quite possible that the page was sitting in
a pagevec, and added to the active lru (so added to the lru_size of the
active lru), but our final put_page removes it from lru, active flag has
been cleared, so we subtract it from the lru_size of the inactive lru -
that could indeed make it go negative and trigger the BUG.

Here's a patch fixing and tidying up that and a few other things there.
But I'm not signing it off yet, partly because I've barely tested it
(quite probably I didn't even have any numa pmd migration happening
at all), and partly because just a moment ago I ran across this
instructive comment in __collapse_huge_page_isolate():
	/* cannot use mapcount: can't collapse if there's a gup pin */
	if (page_count(page) != 1) {

Hmm, yes, below I've added the page_mapcount() check I proposed to
do_huge_pmd_numa_page(), but is even that safe enough?  Do we actually
need a page_count() check (for 2?) to guard against get_user_pages()?
I suspect we do, but then do we have enough locking to stabilize such
a check?  Probably, but...

This will take more time, and I doubt get_user_pages() is an issue in
your testing, so please would you try the patch below, to see if it
does fix the BUGs you are seeing?  Thanks a lot.

Not-Yet-Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/huge_memory.c |   24 +++++++++---------------
 1 file changed, 9 insertions(+), 15 deletions(-)

--- 3.7-rc2+schednuma+johannes/mm/huge_memory.c	2012-11-01 04:10:43.812155671 -0700
+++ linux/mm/huge_memory.c	2012-11-01 05:52:19.512153771 -0700
@@ -745,7 +745,7 @@ void do_huge_pmd_numa_page(struct mm_str
 	struct mem_cgroup *memcg = NULL;
 	struct page *new_page = NULL;
 	struct page *page = NULL;
-	int node, lru;
+	int node = -1;
 
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(*pmd, entry)))
@@ -762,7 +762,8 @@ void do_huge_pmd_numa_page(struct mm_str
 		VM_BUG_ON(!PageCompound(page) || !PageHead(page));
 
 		get_page(page);
-		node = mpol_misplaced(page, vma, haddr);
+		if (page_mapcount(page) == 1)	/* Only do exclusively mapped */
+			node = mpol_misplaced(page, vma, haddr);
 		if (node != -1)
 			goto migrate;
 	}
@@ -801,13 +802,11 @@ migrate:
 	if (!new_page)
 		goto alloc_fail;
 
-	lru = PageLRU(page);
-
-	if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
+	if (isolate_lru_page(page))	/* Does an implicit get_page() */
 		goto alloc_fail;
 
-	if (!trylock_page(new_page))
-		BUG();
+	__set_page_locked(new_page);
+	SetPageSwapBacked(new_page);
 
 	/* anon mapping, we can simply copy page->mapping to the new page: */
 	new_page->mapping = page->mapping;
@@ -820,8 +819,6 @@ migrate:
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(*pmd, entry))) {
 		spin_unlock(&mm->page_table_lock);
-		if (lru)
-			putback_lru_page(page);
 
 		unlock_page(new_page);
 		ClearPageActive(new_page);	/* Set by migrate_page_copy() */
@@ -829,6 +826,7 @@ migrate:
 		put_page(new_page);		/* Free it */
 
 		unlock_page(page);
+		putback_lru_page(page);
 		put_page(page);			/* Drop the local reference */
 
 		return;
@@ -859,16 +857,12 @@ migrate:
 	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(&mm->page_table_lock);
 
-	put_page(page);			/* Drop the rmap reference */
-
 	task_numa_fault(node, HPAGE_PMD_NR);
 
-	if (lru)
-		put_page(page);		/* drop the LRU isolation reference */
-
 	unlock_page(new_page);
-
 	unlock_page(page);
+	put_page(page);			/* Drop the rmap reference */
+	put_page(page);			/* Drop the LRU isolation reference */
 	put_page(page);			/* Drop the local reference */
 
 	return;

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 19/31] sched, numa, mm: Introduce tsk_home_node()
  2012-10-25 12:16 ` [PATCH 19/31] sched, numa, mm: Introduce tsk_home_node() Peter Zijlstra
@ 2012-11-01 13:48   ` Mel Gorman
  0 siblings, 0 replies; 135+ messages in thread
From: Mel Gorman @ 2012-11-01 13:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	Lee Schermerhorn, Ingo Molnar

On Thu, Oct 25, 2012 at 02:16:36PM +0200, Peter Zijlstra wrote:
> Introduce the home-node concept for tasks. In order to keep memory
> locality we need to have a something to stay local to, we define the
> home-node of a task as the node we prefer to allocate memory from and
> prefer to execute on.
> 

That implies that at some point or the other we must be hooking into
alloc_pages_current() and modifying where it calls numa_node_id() to
take the home node into account. Otherwise a process that faults while
running temporarily off the home node will allocate a page in the wrong
node forcing a migration later.

If we don't do that, why not and how do we cope with a task being
temporarily scheduled on a CPU that is not on the home node?

> These are no hard guarantees, merely soft preferences. This allows for
> optimal resource usage, we can run a task away from the home-node, the
> remote memory hit -- while expensive -- is less expensive than not
> running at all, or very little, due to severe cpu overload.
> 
> Similarly, we can allocate memory from another node if our home-node
> is depleted, again, some memory is better than no memory.
> 

Yes.

> This patch merely introduces the basic infrastructure, all policy
> comes later.
> 
> NOTE: we introduce the concept of EMBEDDED_NUMA, these are
> architectures where the memory access cost doesn't depend on the cpu
> but purely on the physical address -- embedded boards with cheap
> (slow) and expensive (fast) memory banks.
> 

This is a bit left-of-center. Is it necessary to deal with this now?

The name EMBEDDED here sucks a bit too as it has nothing to do with
whether the machine is embedded or not. Based on the description
NUMA_LATENCY_VARIABLE or something might have been a better name with a
desription saying that. Not sure as it's not obvious yet how it gets
used.

> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/sh/mm/Kconfig        |    1 +
>  include/linux/init_task.h |    8 ++++++++
>  include/linux/sched.h     |   12 ++++++++++++
>  init/Kconfig              |   14 ++++++++++++++
>  kernel/sched/core.c       |   36 ++++++++++++++++++++++++++++++++++++
>  5 files changed, 71 insertions(+)
> 
> Index: tip/arch/sh/mm/Kconfig
> ===================================================================
> --- tip.orig/arch/sh/mm/Kconfig
> +++ tip/arch/sh/mm/Kconfig
> @@ -111,6 +111,7 @@ config VSYSCALL
>  config NUMA
>  	bool "Non Uniform Memory Access (NUMA) Support"
>  	depends on MMU && SYS_SUPPORTS_NUMA && EXPERIMENTAL
> +	select EMBEDDED_NUMA
>  	default n
>  	help
>  	  Some SH systems have many various memories scattered around
> Index: tip/include/linux/init_task.h
> ===================================================================
> --- tip.orig/include/linux/init_task.h
> +++ tip/include/linux/init_task.h
> @@ -143,6 +143,13 @@ extern struct task_group root_task_group
>  
>  #define INIT_TASK_COMM "swapper"
>  
> +#ifdef CONFIG_SCHED_NUMA
> +# define INIT_TASK_NUMA(tsk)						\
> +	.node = -1,
> +#else
> +# define INIT_TASK_NUMA(tsk)
> +#endif
> +
>  /*
>   *  INIT_TASK is used to set up the first task table, touch at
>   * your own risk!. Base=0, limit=0x1fffff (=2MB)
> @@ -210,6 +217,7 @@ extern struct task_group root_task_group
>  	INIT_TRACE_RECURSION						\
>  	INIT_TASK_RCU_PREEMPT(tsk)					\
>  	INIT_CPUSET_SEQ							\
> +	INIT_TASK_NUMA(tsk)						\
>  }
>  
>  
> Index: tip/include/linux/sched.h
> ===================================================================
> --- tip.orig/include/linux/sched.h
> +++ tip/include/linux/sched.h
> @@ -1479,6 +1479,9 @@ struct task_struct {
>  	short il_next;
>  	short pref_node_fork;
>  #endif
> +#ifdef CONFIG_SCHED_NUMA
> +	int node;
> +#endif

int home_node and a comment. node might be ok in parts of the VM where it
is clear from context what it means but in task_struct, "node" gives very
little hint as to what it is for.

>  	struct rcu_head rcu;
>  
>  	/*
> @@ -1553,6 +1556,15 @@ struct task_struct {
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
>  #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
>  
> +static inline int tsk_home_node(struct task_struct *p)
> +{
> +#ifdef CONFIG_SCHED_NUMA
> +	return p->node;
> +#else
> +	return -1;
> +#endif
> +}
> +
>  /*
>   * Priority of a process goes from 0..MAX_PRIO-1, valid RT
>   * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
> Index: tip/init/Kconfig
> ===================================================================
> --- tip.orig/init/Kconfig
> +++ tip/init/Kconfig
> @@ -696,6 +696,20 @@ config LOG_BUF_SHIFT
>  config HAVE_UNSTABLE_SCHED_CLOCK
>  	bool
>  
> +#
> +# For architectures that (ab)use NUMA to represent different memory regions
> +# all cpu-local but of different latencies, such as SuperH.
> +#
> +config EMBEDDED_NUMA
> +	bool
> +
> +config SCHED_NUMA
> +	bool "Memory placement aware NUMA scheduler"
> +	default n
> +	depends on SMP && NUMA && MIGRATION && !EMBEDDED_NUMA
> +	help
> +	  This option adds support for automatic NUMA aware memory/task placement.
> +

I see why you introduce EMBEDDED_NUMA now.  This should have been a separate
patch though explaining why when NUMA is abused like this that automatic NUMA
placement is the wrong thing to do because presumably the lower latency
regions are being manually managed and should not be interfered with.
That, or such architectures need to add a pgdat field that excludes such
nodes from automatic migration.

>  menuconfig CGROUPS
>  	boolean "Control Group support"
>  	depends on EVENTFD
> Index: tip/kernel/sched/core.c
> ===================================================================
> --- tip.orig/kernel/sched/core.c
> +++ tip/kernel/sched/core.c
> @@ -5959,6 +5959,42 @@ static struct sched_domain_topology_leve
>  
>  static struct sched_domain_topology_level *sched_domain_topology = default_topology;
>  
> +#ifdef CONFIG_SCHED_NUMA
> +
> +/*
> + * Requeues a task ensuring its on the right load-balance list so
> + * that it might get migrated to its new home.
> + *
> + * Note that we cannot actively migrate ourselves since our callers
> + * can be from atomic context. We rely on the regular load-balance
> + * mechanisms to move us around -- its all preference anyway.
> + */
> +void sched_setnode(struct task_struct *p, int node)
> +{
> +	unsigned long flags;
> +	int on_rq, running;
> +	struct rq *rq;
> +
> +	rq = task_rq_lock(p, &flags);
> +	on_rq = p->on_rq;
> +	running = task_current(rq, p);
> +
> +	if (on_rq)
> +		dequeue_task(rq, p, 0);
> +	if (running)
> +		p->sched_class->put_prev_task(rq, p);
> +
> +	p->node = node;
> +
> +	if (running)
> +		p->sched_class->set_curr_task(rq);
> +	if (on_rq)
> +		enqueue_task(rq, p, 0);
> +	task_rq_unlock(rq, p, &flags);
> +}
> +

Presumably this thing is called rare enough that rq lock contention will
not be a problem. If it is, it'll be quickly obvious.

> +#endif /* CONFIG_SCHED_NUMA */
> +
>  #ifdef CONFIG_NUMA
>  
>  static int sched_domains_numa_levels;
> 
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 20/31] sched, numa, mm/mpol: Make mempolicy home-node aware
  2012-10-25 12:16 ` [PATCH 20/31] sched, numa, mm/mpol: Make mempolicy home-node aware Peter Zijlstra
@ 2012-11-01 13:58   ` Mel Gorman
  2012-11-01 14:10     ` Don Morris
  0 siblings, 1 reply; 135+ messages in thread
From: Mel Gorman @ 2012-11-01 13:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	Paul Turner, Lee Schermerhorn, Christoph Lameter, Ingo Molnar

On Thu, Oct 25, 2012 at 02:16:37PM +0200, Peter Zijlstra wrote:
> Add another layer of fallback policy to make the home node concept
> useful from a memory allocation PoV.
> 
> This changes the mpol order to:
> 
>  - vma->vm_ops->get_policy	[if applicable]
>  - vma->vm_policy		[if applicable]
>  - task->mempolicy
>  - tsk_home_node() preferred	[NEW]
>  - default_policy
> 
> Note that the tsk_home_node() policy has Migrate-on-Fault enabled to
> facilitate efficient on-demand memory migration.
> 

Makes sense and it looks like a VMA policy, if set, will still override
the home_node policy as you'd expect. At some point this may need to cope
with node hot-remove. Also, at some point this must be dealing with the
case where mbind() is called but the home_node is not in the nodemask.
Does that happen somewhere else in the series? (maybe I'll see it later)

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 21/31] sched, numa, mm: Introduce sched_feat_numa()
  2012-10-25 12:16 ` [PATCH 21/31] sched, numa, mm: Introduce sched_feat_numa() Peter Zijlstra
@ 2012-11-01 14:00   ` Mel Gorman
  0 siblings, 0 replies; 135+ messages in thread
From: Mel Gorman @ 2012-11-01 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	Paul Turner, Lee Schermerhorn, Christoph Lameter, Ingo Molnar

On Thu, Oct 25, 2012 at 02:16:38PM +0200, Peter Zijlstra wrote:
> Avoid a few #ifdef's later on.
> 

It does mean that schednuma cannot be enabled or disabled from the command
line (or similarly easy mechanism) and that debugfs must be mounted to
control it. This would be awkward from an admin perspective if they wanted
to force disable schednuma because it hurt their workload for whatever
reason. Yes, they can disable it with an init script of some sort but
there will be some moaning.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 20/31] sched, numa, mm/mpol: Make mempolicy home-node aware
  2012-11-01 13:58   ` Mel Gorman
@ 2012-11-01 14:10     ` Don Morris
  0 siblings, 0 replies; 135+ messages in thread
From: Don Morris @ 2012-11-01 14:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Ingo Molnar

On 11/01/2012 06:58 AM, Mel Gorman wrote:
> On Thu, Oct 25, 2012 at 02:16:37PM +0200, Peter Zijlstra wrote:
>> Add another layer of fallback policy to make the home node concept
>> useful from a memory allocation PoV.
>>
>> This changes the mpol order to:
>>
>>  - vma->vm_ops->get_policy	[if applicable]
>>  - vma->vm_policy		[if applicable]
>>  - task->mempolicy
>>  - tsk_home_node() preferred	[NEW]
>>  - default_policy
>>
>> Note that the tsk_home_node() policy has Migrate-on-Fault enabled to
>> facilitate efficient on-demand memory migration.
>>
> 
> Makes sense and it looks like a VMA policy, if set, will still override
> the home_node policy as you'd expect. At some point this may need to cope
> with node hot-remove. Also, at some point this must be dealing with the
> case where mbind() is called but the home_node is not in the nodemask.
> Does that happen somewhere else in the series? (maybe I'll see it later)
> 

I'd expect one of the first things to be done in the sequence of
hot-removing a node would be to take the cpus offline (at least
out of being schedulable). Hence the tasks would be migrated
to other nodes/processors, which should result in a home node
update the same as if the scheduler had simply chosen a better
home for them anyway. The memory would then migrate either
via the home node change by the tasks themselves or via
migration to evacuate the outgoing node (with the preferred
migration target using the new home node).

As long as no one wants to do something crazy like offline
a node before taking the resources away from the scheduler
and memory management, it should all work out.

Don Morris

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 22/31] sched, numa, mm: Implement THP migration
  2012-10-25 12:16 ` [PATCH 22/31] sched, numa, mm: Implement THP migration Peter Zijlstra
@ 2012-11-01 14:16   ` Mel Gorman
  0 siblings, 0 replies; 135+ messages in thread
From: Mel Gorman @ 2012-11-01 14:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	Ingo Molnar

On Thu, Oct 25, 2012 at 02:16:39PM +0200, Peter Zijlstra wrote:
> Add THP migration for the NUMA working set scanning fault case.
> 
> It uses the page lock to serialize.

Serialize against what?

> No migration pte dance is
> necessary because the pte is already unmapped when we decide
> to migrate.
> 

Without the migration PTE dance it does mean that parallel faults could
attempt to migrate at the same time and do a lot of busy work. Is this
what the page lock is meant to serialise against?

Either way, this feels like an optimisation that should have appeared later
in the series because if there is any bug due to this patch it'll be a bitch
to debug. The bisection will point to where schednuma finally comes into play
and not this patch where the actual bug is (if any, maybe this is perfect).

Also, I'm slightly confused by this comment because we have this

        page = pmd_page(entry);
        if (page) {
                VM_BUG_ON(!PageCompound(page) || !PageHead(page));

                get_page(page);
                node = mpol_misplaced(page, vma, haddr);
                if (node != -1)
                        goto migrate;

It's not obvious at all why the "pte is already unmapped when we decide
to migrate".

> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> [ Significant fixes and changelog. ]
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  mm/huge_memory.c |  133 ++++++++++++++++++++++++++++++++++++++++++-------------
>  mm/migrate.c     |    2 
>  2 files changed, 104 insertions(+), 31 deletions(-)
> 
> Index: tip/mm/huge_memory.c
> ===================================================================
> --- tip.orig/mm/huge_memory.c
> +++ tip/mm/huge_memory.c
> @@ -742,12 +742,13 @@ void do_huge_pmd_numa_page(struct mm_str
>  			   unsigned int flags, pmd_t entry)
>  {
>  	unsigned long haddr = address & HPAGE_PMD_MASK;
> +	struct page *new_page = NULL;
>  	struct page *page = NULL;
> -	int node;
> +	int node, lru;
>  
>  	spin_lock(&mm->page_table_lock);
>  	if (unlikely(!pmd_same(*pmd, entry)))
> -		goto out_unlock;
> +		goto unlock;
>  
>  	if (unlikely(pmd_trans_splitting(entry))) {
>  		spin_unlock(&mm->page_table_lock);
> @@ -755,45 +756,117 @@ void do_huge_pmd_numa_page(struct mm_str
>  		return;
>  	}
>  
> -#ifdef CONFIG_NUMA
>  	page = pmd_page(entry);
> -	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> +	if (page) {
> +		VM_BUG_ON(!PageCompound(page) || !PageHead(page));
>  
> -	get_page(page);
> +		get_page(page);
> +		node = mpol_misplaced(page, vma, haddr);
> +		if (node != -1)
> +			goto migrate;
> +	}
> +
> +fixup:
> +	/* change back to regular protection */
> +	entry = pmd_modify(entry, vma->vm_page_prot);
> +	set_pmd_at(mm, haddr, pmd, entry);
> +	update_mmu_cache_pmd(vma, address, entry);
> +
> +unlock:
>  	spin_unlock(&mm->page_table_lock);
> +	if (page)
> +		put_page(page);
>  
> -	/*
> -	 * XXX should we serialize against split_huge_page ?
> -	 */
> -
> -	node = mpol_misplaced(page, vma, haddr);
> -	if (node == -1)
> -		goto do_fixup;
> -
> -	/*
> -	 * Due to lacking code to migrate thp pages, we'll split
> -	 * (which preserves the special PROT_NONE) and re-take the
> -	 * fault on the normal pages.
> -	 */
> -	split_huge_page(page);
> -	put_page(page);
>  	return;
>  
> -do_fixup:
> +migrate:
> +	spin_unlock(&mm->page_table_lock);
> +
> +	lock_page(page);
>  	spin_lock(&mm->page_table_lock);
> -	if (unlikely(!pmd_same(*pmd, entry)))
> -		goto out_unlock;
> -#endif
> +	if (unlikely(!pmd_same(*pmd, entry))) {
> +		spin_unlock(&mm->page_table_lock);
> +		unlock_page(page);
> +		put_page(page);
> +		return;
> +	}
> +	spin_unlock(&mm->page_table_lock);
>  
> -	/* change back to regular protection */
> -	entry = pmd_modify(entry, vma->vm_page_prot);
> -	if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
> -		update_mmu_cache_pmd(vma, address, entry);
> +	new_page = alloc_pages_node(node,
> +	    (GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT,
> +	    HPAGE_PMD_ORDER);
> +
> +	if (!new_page)
> +		goto alloc_fail;
> +
> +	lru = PageLRU(page);
> +
> +	if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
> +		goto alloc_fail;
> +
> +	if (!trylock_page(new_page))
> +		BUG();
> +
> +	/* anon mapping, we can simply copy page->mapping to the new page: */
> +	new_page->mapping = page->mapping;
> +	new_page->index = page->index;
>  
> -out_unlock:
> +	migrate_page_copy(new_page, page);
> +
> +	WARN_ON(PageLRU(new_page));
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(*pmd, entry))) {


So the page lock serialises against parallel faults by the looks of
things but where does it serialise against the transhuge page being
split underneath you and turning it into a regular pmd? I guess the
pmd_same check sortof catches that but it feels like it is caught by
accident instead of on purpose.

> +		spin_unlock(&mm->page_table_lock);
> +		if (lru)
> +			putback_lru_page(page);
> +
> +		unlock_page(new_page);
> +		ClearPageActive(new_page);	/* Set by migrate_page_copy() */
> +		new_page->mapping = NULL;
> +		put_page(new_page);		/* Free it */
> +
> +		unlock_page(page);
> +		put_page(page);			/* Drop the local reference */
> +
> +		return;
> +	}
> +
> +	entry = mk_pmd(new_page, vma->vm_page_prot);
> +	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +	entry = pmd_mkhuge(entry);
> +
> +	page_add_new_anon_rmap(new_page, vma, haddr);
> +
> +	set_pmd_at(mm, haddr, pmd, entry);
> +	update_mmu_cache_pmd(vma, address, entry);
> +	page_remove_rmap(page);
>  	spin_unlock(&mm->page_table_lock);
> -	if (page)
> +
> +	put_page(page);			/* Drop the rmap reference */
> +
> +	if (lru)
> +		put_page(page);		/* drop the LRU isolation reference */
> +
> +	unlock_page(new_page);
> +	unlock_page(page);
> +	put_page(page);			/* Drop the local reference */
> +
> +	return;
> +
> +alloc_fail:
> +	if (new_page)
> +		put_page(new_page);
> +
> +	unlock_page(page);
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(*pmd, entry))) {
>  		put_page(page);
> +		page = NULL;
> +		goto unlock;
> +	}
> +	goto fixup;
>  }
>  
>  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> Index: tip/mm/migrate.c
> ===================================================================
> --- tip.orig/mm/migrate.c
> +++ tip/mm/migrate.c
> @@ -417,7 +417,7 @@ int migrate_huge_page_move_mapping(struc
>   */
>  void migrate_page_copy(struct page *newpage, struct page *page)
>  {
> -	if (PageHuge(page))
> +	if (PageHuge(page) || PageTransHuge(page))
>  		copy_huge_page(newpage, page);
>  	else
>  		copy_highpage(newpage, page);
> 
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 23/31] sched, numa, mm: Implement home-node awareness
  2012-10-25 12:16 ` [PATCH 23/31] sched, numa, mm: Implement home-node awareness Peter Zijlstra
@ 2012-11-01 15:06   ` Mel Gorman
  0 siblings, 0 replies; 135+ messages in thread
From: Mel Gorman @ 2012-11-01 15:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	Paul Turner, Lee Schermerhorn, Christoph Lameter, Ingo Molnar

On Thu, Oct 25, 2012 at 02:16:40PM +0200, Peter Zijlstra wrote:
> Implement home node preference in the scheduler's load-balancer.
> 
> This is done in four pieces:
> 
>  - task_numa_hot(); make it harder to migrate tasks away from their
>    home-node, controlled using the NUMA_HOT feature flag.
> 

We don't actually know if it's hotm we're guessing. task_numa_stick()?

>  - select_task_rq_fair(); prefer placing the task in their home-node,
>    controlled using the NUMA_TTWU_BIAS feature flag. Disabled by
>    default for we found this to be far too agressive. 
> 

Separate patch then?

>  - load_balance(); during the regular pull load-balance pass, try
>    pulling tasks that are on the wrong node first with a preference
>    of moving them nearer to their home-node through task_numa_hot(),
>    controlled through the NUMA_PULL feature flag.
> 

Sounds sensible.

>  - load_balance(); when the balancer finds no imbalance, introduce
>    some such that it still prefers to move tasks towards their
>    home-node, using active load-balance if needed, controlled through
>    the NUMA_PULL_BIAS feature flag.
> 
>    In particular, only introduce this BIAS if the system is otherwise
>    properly (weight) balanced and we either have an offnode or !numa
>    task to trade for it.
> 

Again, sounds reasonable.

> In order to easily find off-node tasks, split the per-cpu task list
> into two parts.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Paul Turner <pjt@google.com>
> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  include/linux/sched.h   |    3 
>  kernel/sched/core.c     |   28 +++
>  kernel/sched/debug.c    |    3 
>  kernel/sched/fair.c     |  349 +++++++++++++++++++++++++++++++++++++++++++++---
>  kernel/sched/features.h |   10 +
>  kernel/sched/sched.h    |   17 ++
>  6 files changed, 384 insertions(+), 26 deletions(-)
> 
> Index: tip/include/linux/sched.h
> ===================================================================
> --- tip.orig/include/linux/sched.h
> +++ tip/include/linux/sched.h
> @@ -823,6 +823,7 @@ enum cpu_idle_type {
>  #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
>  #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
>  #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
> +#define SD_NUMA			0x4000	/* cross-node balancing */
>  
>  extern int __weak arch_sd_sibiling_asym_packing(void);
>  
> @@ -1481,6 +1482,7 @@ struct task_struct {
>  #endif
>  #ifdef CONFIG_SCHED_NUMA
>  	int node;
> +	unsigned long numa_contrib;

comment!

/*
 * numa_contrib records how much of this tasks load factor was due to
 * running away from its homoe node
 */

It contributes to numa_offnode_weight but where do we make any decisions
based on it? Superficially this is for stats but it gets bubbled all the
way up to the sched domain where the actual decisions are made. The
comment could be a lot more helpful in spelling this out.

>  #endif
>  	struct rcu_head rcu;
>  
> @@ -2084,6 +2086,7 @@ extern int sched_setscheduler(struct tas
>  			      const struct sched_param *);
>  extern int sched_setscheduler_nocheck(struct task_struct *, int,
>  				      const struct sched_param *);
> +extern void sched_setnode(struct task_struct *p, int node);
>  extern struct task_struct *idle_task(int cpu);
>  /**
>   * is_idle_task - is the specified task an idle task?
> Index: tip/kernel/sched/core.c
> ===================================================================
> --- tip.orig/kernel/sched/core.c
> +++ tip/kernel/sched/core.c
> @@ -5484,7 +5484,9 @@ static void destroy_sched_domains(struct
>  DEFINE_PER_CPU(struct sched_domain *, sd_llc);
>  DEFINE_PER_CPU(int, sd_llc_id);
>  
> -static void update_top_cache_domain(int cpu)
> +DEFINE_PER_CPU(struct sched_domain *, sd_node);
> +
> +static void update_domain_cache(int cpu)
>  {
>  	struct sched_domain *sd;
>  	int id = cpu;
> @@ -5495,6 +5497,15 @@ static void update_top_cache_domain(int
>  
>  	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>  	per_cpu(sd_llc_id, cpu) = id;
> +
> +	for_each_domain(cpu, sd) {
> +		if (cpumask_equal(sched_domain_span(sd),
> +				  cpumask_of_node(cpu_to_node(cpu))))
> +			goto got_node;
> +	}
> +	sd = NULL;
> +got_node:
> +	rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
>  }

Not obvious how this connects to the rest of the patch at all.

>  
>  /*
> @@ -5537,7 +5548,7 @@ cpu_attach_domain(struct sched_domain *s
>  	rcu_assign_pointer(rq->sd, sd);
>  	destroy_sched_domains(tmp, cpu);
>  
> -	update_top_cache_domain(cpu);
> +	update_domain_cache(cpu);
>  }
>  
>  /* cpus with isolated domains */
> @@ -5965,9 +5976,9 @@ static struct sched_domain_topology_leve
>   * Requeues a task ensuring its on the right load-balance list so
>   * that it might get migrated to its new home.
>   *
> - * Note that we cannot actively migrate ourselves since our callers
> - * can be from atomic context. We rely on the regular load-balance
> - * mechanisms to move us around -- its all preference anyway.
> + * Since home-node is pure preference there's no hard migrate to force
> + * us anywhere, this also allows us to call this from atomic context if
> + * required.
>   */
>  void sched_setnode(struct task_struct *p, int node)
>  {
> @@ -6040,6 +6051,7 @@ sd_numa_init(struct sched_domain_topolog
>  					| 0*SD_SHARE_PKG_RESOURCES
>  					| 1*SD_SERIALIZE
>  					| 0*SD_PREFER_SIBLING
> +					| 1*SD_NUMA
>  					| sd_local_flags(level)
>  					,
>  		.last_balance		= jiffies,
> @@ -6901,6 +6913,12 @@ void __init sched_init(void)
>  		rq->avg_idle = 2*sysctl_sched_migration_cost;
>  
>  		INIT_LIST_HEAD(&rq->cfs_tasks);
> +#ifdef CONFIG_SCHED_NUMA
> +		INIT_LIST_HEAD(&rq->offnode_tasks);
> +		rq->onnode_running = 0;
> +		rq->offnode_running = 0;
> +		rq->offnode_weight = 0;
> +#endif
>  
>  		rq_attach_root(rq, &def_root_domain);
>  #ifdef CONFIG_NO_HZ
> Index: tip/kernel/sched/debug.c
> ===================================================================
> --- tip.orig/kernel/sched/debug.c
> +++ tip/kernel/sched/debug.c
> @@ -132,6 +132,9 @@ print_task(struct seq_file *m, struct rq
>  	SEQ_printf(m, "%15Ld %15Ld %15Ld.%06ld %15Ld.%06ld %15Ld.%06ld",
>  		0LL, 0LL, 0LL, 0L, 0LL, 0L, 0LL, 0L);
>  #endif
> +#ifdef CONFIG_SCHED_NUMA
> +	SEQ_printf(m, " %d/%d", p->node, cpu_to_node(task_cpu(p)));
> +#endif
>  #ifdef CONFIG_CGROUP_SCHED
>  	SEQ_printf(m, " %s", task_group_path(task_group(p)));
>  #endif
> Index: tip/kernel/sched/fair.c
> ===================================================================
> --- tip.orig/kernel/sched/fair.c
> +++ tip/kernel/sched/fair.c
> @@ -26,6 +26,7 @@
>  #include <linux/slab.h>
>  #include <linux/profile.h>
>  #include <linux/interrupt.h>
> +#include <linux/random.h>
>  
>  #include <trace/events/sched.h>
>  
> @@ -773,6 +774,51 @@ update_stats_curr_start(struct cfs_rq *c
>  }
>  
>  /**************************************************
> + * Scheduling class numa methods.
> + */
> +
> +#ifdef CONFIG_SMP
> +static unsigned long task_h_load(struct task_struct *p);
> +#endif
> +
> +#ifdef CONFIG_SCHED_NUMA
> +static struct list_head *account_numa_enqueue(struct rq *rq, struct task_struct *p)
> +{
> +	struct list_head *tasks = &rq->cfs_tasks;
> +
> +	if (tsk_home_node(p) != cpu_to_node(task_cpu(p))) {
> +		p->numa_contrib = task_h_load(p);
> +		rq->offnode_weight += p->numa_contrib;
> +		rq->offnode_running++;
> +		tasks = &rq->offnode_tasks;
> +	} else
> +		rq->onnode_running++;
> +
> +	return tasks;
> +}
> +
> +static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
> +{
> +	if (tsk_home_node(p) != cpu_to_node(task_cpu(p))) {
> +		rq->offnode_weight -= p->numa_contrib;
> +		rq->offnode_running--;
> +	} else
> +		rq->onnode_running--;
> +}
> +#else
> +#ifdef CONFIG_SMP
> +static struct list_head *account_numa_enqueue(struct rq *rq, struct task_struct *p)
> +{
> +	return NULL;
> +}
> +#endif
> +
> +static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
> +{
> +}
> +#endif /* CONFIG_SCHED_NUMA */
> +
> +/**************************************************
>   * Scheduling class queueing methods:
>   */
>  
> @@ -783,9 +829,17 @@ account_entity_enqueue(struct cfs_rq *cf
>  	if (!parent_entity(se))
>  		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
>  #ifdef CONFIG_SMP
> -	if (entity_is_task(se))
> -		list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
> -#endif
> +	if (entity_is_task(se)) {
> +		struct rq *rq = rq_of(cfs_rq);
> +		struct task_struct *p = task_of(se);
> +		struct list_head *tasks = &rq->cfs_tasks;
> +
> +		if (tsk_home_node(p) != -1)
> +			tasks = account_numa_enqueue(rq, p);
> +
> +		list_add(&se->group_node, tasks);
> +	}
> +#endif /* CONFIG_SMP */
>  	cfs_rq->nr_running++;
>  }
>  
> @@ -795,8 +849,14 @@ account_entity_dequeue(struct cfs_rq *cf
>  	update_load_sub(&cfs_rq->load, se->load.weight);
>  	if (!parent_entity(se))
>  		update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
> -	if (entity_is_task(se))
> +	if (entity_is_task(se)) {
> +		struct task_struct *p = task_of(se);
> +
>  		list_del_init(&se->group_node);
> +
> +		if (tsk_home_node(p) != -1)
> +			account_numa_dequeue(rq_of(cfs_rq), p);
> +	}
>  	cfs_rq->nr_running--;
>  }
>  
> @@ -2681,6 +2741,35 @@ done:
>  	return target;
>  }
>  
> +#ifdef CONFIG_SCHED_NUMA
> +static inline bool pick_numa_rand(int n)
> +{
> +	return !(get_random_int() % n);
> +}
> +

"return get_random_int() % n" I could understand but this thing looks like
it only returns 1 if the random number is 0 or some multiple of n. Hard
to see how this is going to randomly select a node as such.

> +/*
> + * Pick a random elegible CPU in the target node, hopefully faster
> + * than doing a least-loaded scan.
> + */
> +static int numa_select_node_cpu(struct task_struct *p, int node)
> +{
> +	int weight = cpumask_weight(cpumask_of_node(node));
> +	int i, cpu = -1;
> +
> +	for_each_cpu_and(i, cpumask_of_node(node), tsk_cpus_allowed(p)) {
> +		if (cpu < 0 || pick_numa_rand(weight))
> +			cpu = i;
> +	}
> +
> +	return cpu;
> +}
> +#else
> +static int numa_select_node_cpu(struct task_struct *p, int node)
> +{
> +	return -1;
> +}
> +#endif /* CONFIG_SCHED_NUMA */
> +
>  /*
>   * sched_balance_self: balance the current task (running on cpu) in domains
>   * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
> @@ -2701,6 +2790,7 @@ select_task_rq_fair(struct task_struct *
>  	int new_cpu = cpu;
>  	int want_affine = 0;
>  	int sync = wake_flags & WF_SYNC;
> +	int node = tsk_home_node(p);
>  
>  	if (p->nr_cpus_allowed == 1)
>  		return prev_cpu;
> @@ -2712,6 +2802,36 @@ select_task_rq_fair(struct task_struct *
>  	}
>  
>  	rcu_read_lock();
> +	if (sched_feat_numa(NUMA_TTWU_BIAS) && node != -1) {
> +		/*
> +		 * For fork,exec find the idlest cpu in the home-node.
> +		 */
> +		if (sd_flag & (SD_BALANCE_FORK|SD_BALANCE_EXEC)) {
> +			int node_cpu = numa_select_node_cpu(p, node);
> +			if (node_cpu < 0)
> +				goto find_sd;
> +
> +			new_cpu = cpu = node_cpu;
> +			sd = per_cpu(sd_node, cpu);
> +			goto pick_idlest;
> +		}
> +
> +		/*
> +		 * For wake, pretend we were running in the home-node.
> +		 */
> +		if (cpu_to_node(prev_cpu) != node) {
> +			int node_cpu = numa_select_node_cpu(p, node);
> +			if (node_cpu < 0)
> +				goto find_sd;
> +
> +			if (sched_feat_numa(NUMA_TTWU_TO))
> +				cpu = node_cpu;
> +			else
> +				prev_cpu = node_cpu;
> +		}
> +	}
> +
> +find_sd:
>  	for_each_domain(cpu, tmp) {
>  		if (!(tmp->flags & SD_LOAD_BALANCE))
>  			continue;
> @@ -2738,6 +2858,7 @@ select_task_rq_fair(struct task_struct *
>  		goto unlock;
>  	}
>  
> +pick_idlest:
>  	while (sd) {
>  		int load_idx = sd->forkexec_idx;
>  		struct sched_group *group;
> @@ -3060,6 +3181,8 @@ struct lb_env {
>  
>  	unsigned int		flags;
>  
> +	struct list_head	*tasks;
> +
>  	unsigned int		loop;
>  	unsigned int		loop_break;
>  	unsigned int		loop_max;
> @@ -3080,11 +3203,28 @@ static void move_task(struct task_struct
>  	check_preempt_curr(env->dst_rq, p, 0);
>  }
>  
> +static int task_numa_hot(struct task_struct *p, struct lb_env *env)
> +{

bool

document return value.

It's not actually returning if the node is "hot" or "cold", it's
returning if it should stick with the current node or not.

> +	int from_dist, to_dist;
> +	int node = tsk_home_node(p);
> +
> +	if (!sched_feat_numa(NUMA_HOT) || node == -1)
> +		return 0; /* no node preference */
> +
> +	from_dist = node_distance(cpu_to_node(env->src_cpu), node);
> +	to_dist = node_distance(cpu_to_node(env->dst_cpu), node);
> +
> +	if (to_dist < from_dist)
> +		return 0; /* getting closer is ok */
> +
> +	return 1; /* stick to where we are */
> +}
> +

Ok.

>  /*
>   * Is this task likely cache-hot:
>   */
>  static int
> -task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
> +task_hot(struct task_struct *p, struct lb_env *env)
>  {
>  	s64 delta;
>  
> @@ -3107,7 +3247,7 @@ task_hot(struct task_struct *p, u64 now,
>  	if (sysctl_sched_migration_cost == 0)
>  		return 0;
>  
> -	delta = now - p->se.exec_start;
> +	delta = env->src_rq->clock_task - p->se.exec_start;
>  

This looks like a cleanup. Not obviously connected with the rest of the
patch.

>  	return delta < (s64)sysctl_sched_migration_cost;
>  }
> @@ -3164,7 +3304,9 @@ int can_migrate_task(struct task_struct
>  	 * 2) too many balance attempts have failed.
>  	 */
>  
> -	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
> +	tsk_cache_hot = task_hot(p, env);
> +	if (env->idle == CPU_NOT_IDLE)
> +		tsk_cache_hot |= task_numa_hot(p, env);
>  	if (!tsk_cache_hot ||
>  		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
>  #ifdef CONFIG_SCHEDSTATS
> @@ -3190,11 +3332,11 @@ int can_migrate_task(struct task_struct
>   *
>   * Called with both runqueues locked.
>   */
> -static int move_one_task(struct lb_env *env)
> +static int __move_one_task(struct lb_env *env)
>  {
>  	struct task_struct *p, *n;
>  
> -	list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
> +	list_for_each_entry_safe(p, n, env->tasks, se.group_node) {
>  		if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
>  			continue;
>  
> @@ -3213,7 +3355,20 @@ static int move_one_task(struct lb_env *
>  	return 0;
>  }
>  
> -static unsigned long task_h_load(struct task_struct *p);
> +static int move_one_task(struct lb_env *env)
> +{

This function is not actually used in this patch. Glancing forward I see
that you later call this from the load balancer which makes sense but overall
this patch is hard to follow because it's not clear which parts are relevant
and which are not as this hunk is not critical to the concept of home-node
awareness.  It's part of the CPU migration policy when schednuma is enabled.

> +	if (sched_feat_numa(NUMA_PULL)) {
> +		env->tasks = offnode_tasks(env->src_rq);
> +		if (__move_one_task(env))
> +			return 1;
> +	}
> +
> +	env->tasks = &env->src_rq->cfs_tasks;
> +	if (__move_one_task(env))
> +		return 1;
> +
> +	return 0;
> +}
>  
>  static const unsigned int sched_nr_migrate_break = 32;
>  
> @@ -3226,7 +3381,6 @@ static const unsigned int sched_nr_migra
>   */
>  static int move_tasks(struct lb_env *env)
>  {
> -	struct list_head *tasks = &env->src_rq->cfs_tasks;
>  	struct task_struct *p;
>  	unsigned long load;
>  	int pulled = 0;
> @@ -3234,8 +3388,9 @@ static int move_tasks(struct lb_env *env
>  	if (env->imbalance <= 0)
>  		return 0;
>  
> -	while (!list_empty(tasks)) {
> -		p = list_first_entry(tasks, struct task_struct, se.group_node);
> +again:
> +	while (!list_empty(env->tasks)) {
> +		p = list_first_entry(env->tasks, struct task_struct, se.group_node);
>  
>  		env->loop++;
>  		/* We've more or less seen every task there is, call it quits */
> @@ -3246,7 +3401,7 @@ static int move_tasks(struct lb_env *env
>  		if (env->loop > env->loop_break) {
>  			env->loop_break += sched_nr_migrate_break;
>  			env->flags |= LBF_NEED_BREAK;
> -			break;
> +			goto out;
>  		}
>  
>  		if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
> @@ -3274,7 +3429,7 @@ static int move_tasks(struct lb_env *env
>  		 * the critical section.
>  		 */
>  		if (env->idle == CPU_NEWLY_IDLE)
> -			break;
> +			goto out;
>  #endif
>  
>  		/*
> @@ -3282,13 +3437,20 @@ static int move_tasks(struct lb_env *env
>  		 * weighted load.
>  		 */
>  		if (env->imbalance <= 0)
> -			break;
> +			goto out;
>  
>  		continue;
>  next:
> -		list_move_tail(&p->se.group_node, tasks);
> +		list_move_tail(&p->se.group_node, env->tasks);
> +	}
> +
> +	if (env->tasks == offnode_tasks(env->src_rq)) {
> +		env->tasks = &env->src_rq->cfs_tasks;
> +		env->loop = 0;
> +		goto again;
>  	}
>  
> +out:
>  	/*
>  	 * Right now, this is one of only two places move_task() is called,
>  	 * so we can safely collect move_task() stats here rather than
> @@ -3407,12 +3569,13 @@ static inline void update_shares(int cpu
>  static inline void update_h_load(long cpu)
>  {
>  }
> -
> +#ifdef CONFIG_SMP
>  static unsigned long task_h_load(struct task_struct *p)
>  {
>  	return p->se.load.weight;
>  }
>  #endif
> +#endif
>  
>  /********** Helpers for find_busiest_group ************************/
>  /*
> @@ -3443,6 +3606,14 @@ struct sd_lb_stats {
>  	unsigned int  busiest_group_weight;
>  
>  	int group_imb; /* Is there imbalance in this sd */
> +#ifdef CONFIG_SCHED_NUMA
> +	struct sched_group *numa_group; /* group which has offnode_tasks */
> +	unsigned long numa_group_weight;
> +	unsigned long numa_group_running;
> +
> +	unsigned long this_offnode_running;
> +	unsigned long this_onnode_running;
> +#endif

So from here is where the actual home-node awareness part kicks in.

>  };
>  
>  /*
> @@ -3458,6 +3629,11 @@ struct sg_lb_stats {
>  	unsigned long group_weight;
>  	int group_imb; /* Is there an imbalance in the group ? */
>  	int group_has_capacity; /* Is there extra capacity in the group? */
> +#ifdef CONFIG_SCHED_NUMA
> +	unsigned long numa_offnode_weight;
> +	unsigned long numa_offnode_running;
> +	unsigned long numa_onnode_running;
> +#endif
>  };
>  
>  /**
> @@ -3486,6 +3662,121 @@ static inline int get_sd_load_idx(struct
>  	return load_idx;
>  }
>  
> +#ifdef CONFIG_SCHED_NUMA
> +static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
> +{
> +	sgs->numa_offnode_weight += rq->offnode_weight;
> +	sgs->numa_offnode_running += rq->offnode_running;
> +	sgs->numa_onnode_running += rq->onnode_running;
> +}
> +
> +/*
> + * Since the offnode lists are indiscriminate (they contain tasks for all other
> + * nodes) it is impossible to say if there's any task on there that wants to
> + * move towards the pulling cpu. Therefore select a random offnode list to pull
> + * from such that eventually we'll try them all.
> + *
> + * Select a random group that has offnode tasks as sds->numa_group
> + */

The comment says we select a random group but this thing returns void.
We're not selecting anything.

> +static inline void update_sd_numa_stats(struct sched_domain *sd,
> +		struct sched_group *group, struct sd_lb_stats *sds,
> +		int local_group, struct sg_lb_stats *sgs)
> +{
> +	if (!(sd->flags & SD_NUMA))
> +		return;
> +
> +	if (local_group) {
> +		sds->this_offnode_running = sgs->numa_offnode_running;
> +		sds->this_onnode_running  = sgs->numa_onnode_running;
> +		return;
> +	}
> +
> +	if (!sgs->numa_offnode_running)
> +		return;
> +
> +	if (!sds->numa_group || pick_numa_rand(sd->span_weight / group->group_weight)) {

What does passing in sd->span_weight / group->group_weight) mean?

> +		sds->numa_group = group;
> +		sds->numa_group_weight = sgs->numa_offnode_weight;
> +		sds->numa_group_running = sgs->numa_offnode_running;
> +	}
> +}
> +
> +/*
> + * Pick a random queue from the group that has offnode tasks.
> + */
> +static struct rq *find_busiest_numa_queue(struct lb_env *env,
> +					  struct sched_group *group)
> +{
> +	struct rq *busiest = NULL, *rq;
> +	int cpu;
> +
> +	for_each_cpu_and(cpu, sched_group_cpus(group), env->cpus) {
> +		rq = cpu_rq(cpu);
> +		if (!rq->offnode_running)
> +			continue;
> +		if (!busiest || pick_numa_rand(group->group_weight))
> +			busiest = rq;
> +	}
> +
> +	return busiest;
> +}

So if the random number if 0 or group_weight it will be randomly considered
the busiest runqueue. Any idea how often that happens?  I have a suspicion
that the random perturb logic to avoid worst-case scenarios may not be
operating as expected.

> +
> +/*
> + * Called in case of no other imbalance, if there is a queue running offnode
> + * tasksk we'll say we're imbalanced anyway to nudge these tasks towards their
> + * proper node.
> + */
> +static inline int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
> +{
> +	if (!sched_feat(NUMA_PULL_BIAS))
> +		return 0;
> +
> +	if (!sds->numa_group)
> +		return 0;
> +
> +	/*
> +	 * Only pull an offnode task home if we've got offnode or !numa tasks to trade for it.
> +	 */
> +	if (!sds->this_offnode_running &&
> +	    !(sds->this_nr_running - sds->this_onnode_running - sds->this_offnode_running))
> +		return 0;
> +
> +	env->imbalance = sds->numa_group_weight / sds->numa_group_running;
> +	sds->busiest = sds->numa_group;
> +	env->find_busiest_queue = find_busiest_numa_queue;
> +	return 1;
> +}
> +
> +static inline bool need_active_numa_balance(struct lb_env *env)
> +{
> +	return env->find_busiest_queue == find_busiest_numa_queue &&
> +			env->src_rq->offnode_running == 1 &&
> +			env->src_rq->nr_running == 1;
> +}
> +
> +#else /* CONFIG_SCHED_NUMA */
> +
> +static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
> +{
> +}
> +
> +static inline void update_sd_numa_stats(struct sched_domain *sd,
> +		struct sched_group *group, struct sd_lb_stats *sds,
> +		int local_group, struct sg_lb_stats *sgs)
> +{
> +}
> +
> +static inline int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
> +{
> +	return 0;
> +}
> +
> +static inline bool need_active_numa_balance(struct lb_env *env)
> +{
> +	return false;
> +}
> +#endif /* CONFIG_SCHED_NUMA */
> +
>  unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
>  {
>  	return SCHED_POWER_SCALE;
> @@ -3701,6 +3992,8 @@ static inline void update_sg_lb_stats(st
>  		sgs->sum_weighted_load += weighted_cpuload(i);
>  		if (idle_cpu(i))
>  			sgs->idle_cpus++;
> +
> +		update_sg_numa_stats(sgs, rq);
>  	}
>  
>  	/*
> @@ -3854,6 +4147,8 @@ static inline void update_sd_lb_stats(st
>  			sds->group_imb = sgs.group_imb;
>  		}
>  
> +		update_sd_numa_stats(env->sd, sg, sds, local_group, &sgs);
> +
>  		sg = sg->next;
>  	} while (sg != env->sd->groups);
>  }
> @@ -4084,7 +4379,7 @@ find_busiest_group(struct lb_env *env, i
>  
>  	/* There is no busy sibling group to pull tasks from */
>  	if (!sds.busiest || sds.busiest_nr_running == 0)
> -		goto out_balanced;
> +		goto ret;
>  
>  	sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;
>  
> @@ -4106,14 +4401,14 @@ find_busiest_group(struct lb_env *env, i
>  	 * don't try and pull any tasks.
>  	 */
>  	if (sds.this_load >= sds.max_load)
> -		goto out_balanced;
> +		goto ret;
>  
>  	/*
>  	 * Don't pull any tasks if this group is already above the domain
>  	 * average load.
>  	 */
>  	if (sds.this_load >= sds.avg_load)
> -		goto out_balanced;
> +		goto ret;
>  
>  	if (env->idle == CPU_IDLE) {
>  		/*
> @@ -4140,6 +4435,9 @@ force_balance:
>  	return sds.busiest;
>  
>  out_balanced:
> +	if (check_numa_busiest_group(env, &sds))
> +		return sds.busiest;
> +
>  ret:
>  	env->imbalance = 0;
>  	return NULL;
> @@ -4218,6 +4516,9 @@ static int need_active_balance(struct lb
>  			return 1;
>  	}
>  
> +	if (need_active_numa_balance(env))
> +		return 1;
> +
>  	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
>  }
>  
> @@ -4270,6 +4571,8 @@ redo:
>  		schedstat_inc(sd, lb_nobusyq[idle]);
>  		goto out_balanced;
>  	}
> +	env.src_rq  = busiest;
> +	env.src_cpu = busiest->cpu;
>  
>  	BUG_ON(busiest == env.dst_rq);
>  
> @@ -4288,6 +4591,10 @@ redo:
>  		env.src_cpu   = busiest->cpu;
>  		env.src_rq    = busiest;
>  		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
> +		if (sched_feat_numa(NUMA_PULL))
> +			env.tasks = offnode_tasks(busiest);
> +		else
> +			env.tasks = &busiest->cfs_tasks;
>  
>  		update_h_load(env.src_cpu);
>  more_balance:
> Index: tip/kernel/sched/features.h
> ===================================================================
> --- tip.orig/kernel/sched/features.h
> +++ tip/kernel/sched/features.h
> @@ -61,3 +61,13 @@ SCHED_FEAT(TTWU_QUEUE, true)
>  SCHED_FEAT(FORCE_SD_OVERLAP, false)
>  SCHED_FEAT(RT_RUNTIME_SHARE, true)
>  SCHED_FEAT(LB_MIN, false)
> +
> +#ifdef CONFIG_SCHED_NUMA
> +SCHED_FEAT(NUMA,           true)
> +SCHED_FEAT(NUMA_HOT,       true)
> +SCHED_FEAT(NUMA_TTWU_BIAS, false)
> +SCHED_FEAT(NUMA_TTWU_TO,   false)
> +SCHED_FEAT(NUMA_PULL,      true)
> +SCHED_FEAT(NUMA_PULL_BIAS, true)
> +#endif

Many of the other SCHED_FEAT flags got a nice explanation.

> +
> Index: tip/kernel/sched/sched.h
> ===================================================================
> --- tip.orig/kernel/sched/sched.h
> +++ tip/kernel/sched/sched.h
> @@ -418,6 +418,13 @@ struct rq {
>  
>  	struct list_head cfs_tasks;
>  
> +#ifdef CONFIG_SCHED_NUMA
> +	unsigned long    onnode_running;
> +	unsigned long    offnode_running;
> +	unsigned long	 offnode_weight;
> +	struct list_head offnode_tasks;
> +#endif
> +
>  	u64 rt_avg;
>  	u64 age_stamp;
>  	u64 idle_stamp;
> @@ -469,6 +476,15 @@ struct rq {
>  #endif
>  };
>  
> +static inline struct list_head *offnode_tasks(struct rq *rq)
> +{
> +#ifdef CONFIG_SCHED_NUMA
> +	return &rq->offnode_tasks;
> +#else
> +	return NULL;
> +#endif
> +}
> +
>  static inline int cpu_of(struct rq *rq)
>  {
>  #ifdef CONFIG_SMP
> @@ -529,6 +545,7 @@ static inline struct sched_domain *highe
>  
>  DECLARE_PER_CPU(struct sched_domain *, sd_llc);
>  DECLARE_PER_CPU(int, sd_llc_id);
> +DECLARE_PER_CPU(struct sched_domain *, sd_node);
>  
>  extern int group_balance_cpu(struct sched_group *sg);
>  
> 
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 24/31] sched, numa, mm: Introduce last_nid in the pageframe
  2012-10-25 12:16 ` [PATCH 24/31] sched, numa, mm: Introduce last_nid in the pageframe Peter Zijlstra
@ 2012-11-01 15:17   ` Mel Gorman
  0 siblings, 0 replies; 135+ messages in thread
From: Mel Gorman @ 2012-11-01 15:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	Ingo Molnar

On Thu, Oct 25, 2012 at 02:16:41PM +0200, Peter Zijlstra wrote:
> Introduce a per-page last_nid field, fold this into the struct
> page::flags field whenever possible.
> 

This is used later in the series.

> The unlikely/rare 32bit NUMA configs will likely grow the page-frame.
> 

If someone files a bug report that lowmem pressure is higher on their
32-bit NUMA machine I would be a bit surprised. Maybe abusers of NUMA
care but screw it.

> Completely dropping 32bit support for CONFIG_SCHED_NUMA would simplify
> things, but it would also remove the warning if we grow enough 64bit
> only page-flags to push the last-nid out.
> 
> Suggested-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  include/linux/mm.h                |   90 ++++++++++++++++++++------------------
>  include/linux/mm_types.h          |    5 ++
>  include/linux/mmzone.h            |   14 -----
>  include/linux/page-flags-layout.h |   83 +++++++++++++++++++++++++++++++++++
>  mm/huge_memory.c                  |    1 
>  mm/memory.c                       |    4 +
>  6 files changed, 143 insertions(+), 54 deletions(-)
>  create mode 100644 include/linux/page-flags-layout.h
> 
> Index: tip/include/linux/mm.h
> ===================================================================
> --- tip.orig/include/linux/mm.h
> +++ tip/include/linux/mm.h
> @@ -594,50 +594,11 @@ static inline pte_t maybe_mkwrite(pte_t
>   * sets it, so none of the operations on it need to be atomic.
>   */
>  
> -
> -/*
> - * page->flags layout:
> - *
> - * There are three possibilities for how page->flags get
> - * laid out.  The first is for the normal case, without
> - * sparsemem.  The second is for sparsemem when there is
> - * plenty of space for node and section.  The last is when
> - * we have run out of space and have to fall back to an
> - * alternate (slower) way of determining the node.
> - *
> - * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE | ... | FLAGS |
> - * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
> - * classic sparse no space for node:  | SECTION |     ZONE    | ... | FLAGS |
> - */

The move to page-flags-layout.h should have been a separate patch!
Figuring out what you actually added to the layout is going to be a
complete headache (headache was not the first word I was going to use).

In other words I did not try very hard and I'll just be scanning for
something obvious.

> -#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> -#define SECTIONS_WIDTH		SECTIONS_SHIFT
> -#else
> -#define SECTIONS_WIDTH		0
> -#endif
> -
> -#define ZONES_WIDTH		ZONES_SHIFT
> -
> -#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
> -#define NODES_WIDTH		NODES_SHIFT
> -#else
> -#ifdef CONFIG_SPARSEMEM_VMEMMAP
> -#error "Vmemmap: No space for nodes field in page flags"
> -#endif
> -#define NODES_WIDTH		0
> -#endif
> -
> -/* Page flags: | [SECTION] | [NODE] | ZONE | ... | FLAGS | */
> +/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NID] | ... | FLAGS | */
>  #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
>  #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
>  #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
> -
> -/*
> - * We are going to use the flags for the page to node mapping if its in
> - * there.  This includes the case where there is no node, so it is implicit.
> - */
> -#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
> -#define NODE_NOT_IN_PAGE_FLAGS
> -#endif
> +#define LAST_NID_PGOFF		(ZONES_PGOFF - LAST_NID_WIDTH)
>  
>  /*
>   * Define the bit shifts to access each section.  For non-existent
> @@ -647,6 +608,7 @@ static inline pte_t maybe_mkwrite(pte_t
>  #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
>  #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
>  #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
> +#define LAST_NID_PGSHIFT	(LAST_NID_PGOFF * (LAST_NID_WIDTH != 0))
>  

Why is LAST_NID_PGSHIFT != NODES_PGSHIFT?

Oh, it more or less is but it's hidden. Screw that, get rid of
LAST_NIFT_SHIFT WIDTH and express this entirely in terms of NODES_PFSHIFT
and friends.

>  /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
>  #ifdef NODE_NOT_IN_PAGE_FLAGS
> @@ -668,6 +630,7 @@ static inline pte_t maybe_mkwrite(pte_t
>  #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
>  #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
>  #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
> +#define LAST_NID_MASK		((1UL << LAST_NID_WIDTH) - 1)
>  #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
>  
>  static inline enum zone_type page_zonenum(const struct page *page)
> @@ -706,6 +669,51 @@ static inline int page_to_nid(const stru
>  }
>  #endif
>  
> +#ifdef CONFIG_SCHED_NUMA
> +#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
> +static inline int page_xchg_last_nid(struct page *page, int nid)
> +{
> +	return xchg(&page->_last_nid, nid);
> +}
> +
> +static inline int page_last_nid(struct page *page)
> +{
> +	return page->_last_nid;
> +}
> +#else
> +static inline int page_xchg_last_nid(struct page *page, int nid)
> +{
> +	unsigned long old_flags, flags;
> +	int last_nid;
> +
> +	do {
> +		old_flags = flags = page->flags;
> +		last_nid = (flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
> +
> +		flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
> +		flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
> +	} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
> +

This opens a very small window where this function messed up the flags
and fixes them up very shortly afterwards. For most page flags it will
not matter but potentially causes weirdness with the extended page
flags or are we protected by something else?

> +	return last_nid;
> +}
> +
> +static inline int page_last_nid(struct page *page)
> +{
> +	return (page->flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
> +}
> +#endif /* LAST_NID_NOT_IN_PAGE_FLAGS */
> +#else /* CONFIG_SCHED_NUMA */
> +static inline int page_xchg_last_nid(struct page *page, int nid)
> +{
> +	return page_to_nid(page);
> +}
> +
> +static inline int page_last_nid(struct page *page)
> +{
> +	return page_to_nid(page);
> +}
> +#endif /* CONFIG_SCHED_NUMA */
> +
>  static inline struct zone *page_zone(const struct page *page)
>  {
>  	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
> Index: tip/include/linux/mm_types.h
> ===================================================================
> --- tip.orig/include/linux/mm_types.h
> +++ tip/include/linux/mm_types.h
> @@ -12,6 +12,7 @@
>  #include <linux/cpumask.h>
>  #include <linux/page-debug-flags.h>
>  #include <linux/uprobes.h>
> +#include <linux/page-flags-layout.h>
>  #include <asm/page.h>
>  #include <asm/mmu.h>
>  
> @@ -175,6 +176,10 @@ struct page {
>  	 */
>  	void *shadow;
>  #endif
> +
> +#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
> +	int _last_nid;
> +#endif
>  }
>  /*
>   * The struct page can be forced to be double word aligned so that atomic ops
> Index: tip/include/linux/mmzone.h
> ===================================================================
> --- tip.orig/include/linux/mmzone.h
> +++ tip/include/linux/mmzone.h
> @@ -15,7 +15,7 @@
>  #include <linux/seqlock.h>
>  #include <linux/nodemask.h>
>  #include <linux/pageblock-flags.h>
> -#include <generated/bounds.h>
> +#include <linux/page-flags-layout.h>
>  #include <linux/atomic.h>
>  #include <asm/page.h>
>  
> @@ -317,16 +317,6 @@ enum zone_type {
>   * match the requested limits. See gfp_zone() in include/linux/gfp.h
>   */
>  
> -#if MAX_NR_ZONES < 2
> -#define ZONES_SHIFT 0
> -#elif MAX_NR_ZONES <= 2
> -#define ZONES_SHIFT 1
> -#elif MAX_NR_ZONES <= 4
> -#define ZONES_SHIFT 2
> -#else
> -#error ZONES_SHIFT -- too many zones configured adjust calculation
> -#endif
> -
>  struct zone {
>  	/* Fields commonly accessed by the page allocator */
>  
> @@ -1029,8 +1019,6 @@ static inline unsigned long early_pfn_to
>   * PA_SECTION_SHIFT		physical address to/from section number
>   * PFN_SECTION_SHIFT		pfn to/from section number
>   */
> -#define SECTIONS_SHIFT		(MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
> -
>  #define PA_SECTION_SHIFT	(SECTION_SIZE_BITS)
>  #define PFN_SECTION_SHIFT	(SECTION_SIZE_BITS - PAGE_SHIFT)
>  
> Index: tip/include/linux/page-flags-layout.h
> ===================================================================
> --- /dev/null
> +++ tip/include/linux/page-flags-layout.h
> @@ -0,0 +1,83 @@
> +#ifndef _LINUX_PAGE_FLAGS_LAYOUT
> +#define _LINUX_PAGE_FLAGS_LAYOUT
> +
> +#include <linux/numa.h>
> +#include <generated/bounds.h>
> +
> +#if MAX_NR_ZONES < 2
> +#define ZONES_SHIFT 0
> +#elif MAX_NR_ZONES <= 2
> +#define ZONES_SHIFT 1
> +#elif MAX_NR_ZONES <= 4
> +#define ZONES_SHIFT 2
> +#else
> +#error ZONES_SHIFT -- too many zones configured adjust calculation
> +#endif
> +
> +#ifdef CONFIG_SPARSEMEM
> +#include <asm/sparsemem.h>
> +
> +/* 
> + * SECTION_SHIFT    		#bits space required to store a section #
> + */
> +#define SECTIONS_SHIFT         (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
> +#endif
> +
> +/*
> + * page->flags layout:
> + *
> + * There are five possibilities for how page->flags get laid out.  The first
> + * (and second) is for the normal case, without sparsemem. The third is for
> + * sparsemem when there is plenty of space for node and section. The last is
> + * when we have run out of space and have to fall back to an alternate (slower)
> + * way of determining the node.
> + *
> + * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |            ... | FLAGS |
> + *     "      plus space for last_nid:|       NODE     | ZONE | LAST_NID | ... | FLAGS |
> + * classic sparse with space for node:| SECTION | NODE | ZONE |            ... | FLAGS |
> + *     "      plus space for last_nid:| SECTION | NODE | ZONE | LAST_NID | ... | FLAGS |
> + * classic sparse no space for node:  | SECTION |     ZONE    |            ... | FLAGS |
> + */
> +#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> +
> +#define SECTIONS_WIDTH		SECTIONS_SHIFT
> +#else
> +#define SECTIONS_WIDTH		0
> +#endif
> +
> +#define ZONES_WIDTH		ZONES_SHIFT
> +
> +#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
> +#define NODES_WIDTH		NODES_SHIFT
> +#else
> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> +#error "Vmemmap: No space for nodes field in page flags"
> +#endif
> +#define NODES_WIDTH		0
> +#endif
> +
> +#ifdef CONFIG_SCHED_NUMA
> +#define LAST_NID_SHIFT	NODES_SHIFT
> +#else
> +#define LAST_NID_SHIFT	0
> +#endif
> +
> +#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
> +#define LAST_NID_WIDTH	LAST_NID_SHIFT
> +#else
> +#define LAST_NID_WIDTH	0
> +#endif
> +
> +/*
> + * We are going to use the flags for the page to node mapping if its in
> + * there.  This includes the case where there is no node, so it is implicit.
> + */
> +#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
> +#define NODE_NOT_IN_PAGE_FLAGS
> +#endif
> +
> +#if defined(CONFIG_SCHED_NUMA) && LAST_NID_WIDTH == 0
> +#define LAST_NID_NOT_IN_PAGE_FLAGS
> +#endif
> +
> +#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
> Index: tip/mm/huge_memory.c
> ===================================================================
> --- tip.orig/mm/huge_memory.c
> +++ tip/mm/huge_memory.c
> @@ -1440,6 +1440,7 @@ static void __split_huge_page_refcount(s
>  		page_tail->mapping = page->mapping;
>  
>  		page_tail->index = page->index + i;
> +		page_xchg_last_nid(page, page_last_nid(page_tail));
>  
>  		BUG_ON(!PageAnon(page_tail));
>  		BUG_ON(!PageUptodate(page_tail));
> Index: tip/mm/memory.c
> ===================================================================
> --- tip.orig/mm/memory.c
> +++ tip/mm/memory.c
> @@ -68,6 +68,10 @@
>  
>  #include "internal.h"
>  
> +#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
> +#warning Unfortunate NUMA config, growing page-frame for last_nid.
> +#endif
> +
>  #ifndef CONFIG_NEED_MULTIPLE_NODES
>  /* use the per-pgdat data instead for discontigmem - mbligh */
>  unsigned long max_mapnr;
> 
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 26/31] sched, numa, mm: Add fault driven placement and migration policy
  2012-10-25 12:16 ` [PATCH 26/31] sched, numa, mm: Add fault driven placement and migration policy Peter Zijlstra
  2012-10-25 20:53   ` Linus Torvalds
  2012-10-30 19:23   ` Rik van Riel
@ 2012-11-01 15:40   ` Mel Gorman
  2 siblings, 0 replies; 135+ messages in thread
From: Mel Gorman @ 2012-11-01 15:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	Ingo Molnar

On Thu, Oct 25, 2012 at 02:16:43PM +0200, Peter Zijlstra wrote:
> As per the problem/design document Documentation/scheduler/numa-problem.txt
> implement 3ac & 4.
> 
> ( A pure 3a was found too unstable, I did briefly try 3bc
>   but found no significant improvement. )
> 
> Implement a per-task memory placement scheme relying on a regular
> PROT_NONE 'migration' fault to scan the memory space of the procress
> and uses a two stage migration scheme to reduce the invluence of
> unlikely usage relations.
> 
> It relies on the assumption that the compute part is tied to a
> paticular task and builds a task<->page relation set to model the
> compute<->data relation.
> 
> In the previous patch we made memory migrate towards where the task
> is running, here we select the node on which most memory is located
> as the preferred node to run on.
> 
> This creates a feed-back control loop between trying to schedule a
> task on a node and migrating memory towards the node the task is
> scheduled on. 
> 

Ok.

> Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
> Suggested-by: Rik van Riel <riel@redhat.com>
> Fixes-by: David Rientjes <rientjes@google.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  include/linux/mm_types.h |    4 +
>  include/linux/sched.h    |   35 +++++++--
>  kernel/sched/core.c      |   16 ++++
>  kernel/sched/fair.c      |  175 +++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/features.h  |    1 
>  kernel/sched/sched.h     |   31 +++++---
>  kernel/sysctl.c          |   31 +++++++-
>  mm/huge_memory.c         |    7 +
>  mm/memory.c              |    4 -
>  9 files changed, 282 insertions(+), 22 deletions(-)
> Index: tip/include/linux/mm_types.h
> ===================================================================
> --- tip.orig/include/linux/mm_types.h
> +++ tip/include/linux/mm_types.h
> @@ -403,6 +403,10 @@ struct mm_struct {
>  #ifdef CONFIG_CPUMASK_OFFSTACK
>  	struct cpumask cpumask_allocation;
>  #endif
> +#ifdef CONFIG_SCHED_NUMA
> +	unsigned long numa_next_scan;

comment. 

> +	int numa_scan_seq;

comment! at least the other one is easy to guess. This thing looks like
it's preventing multiple threads in a process space scanning and
updating PTEs at the same time. Effectively it's a type of barrier but
without a comment I'm not sure if what it's doing is what you expect it
to be doing or something else entirely.

> +#endif
>  	struct uprobes_state uprobes_state;
>  };
>  
> Index: tip/include/linux/sched.h
> ===================================================================
> --- tip.orig/include/linux/sched.h
> +++ tip/include/linux/sched.h
> @@ -1481,9 +1481,16 @@ struct task_struct {
>  	short pref_node_fork;
>  #endif
>  #ifdef CONFIG_SCHED_NUMA
> -	int node;
> +	int node;			/* task home node   */
> +	int numa_scan_seq;
> +	int numa_migrate_seq;
> +	unsigned int numa_scan_period;
> +	u64 node_stamp;			/* migration stamp  */
>  	unsigned long numa_contrib;
> -#endif
> +	unsigned long *numa_faults;
> +	struct callback_head numa_work;
> +#endif /* CONFIG_SCHED_NUMA */
> +
>  	struct rcu_head rcu;
>  
>  	/*
> @@ -1558,15 +1565,24 @@ struct task_struct {
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
>  #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
>  
> +#ifdef CONFIG_SCHED_NUMA
>  static inline int tsk_home_node(struct task_struct *p)
>  {
> -#ifdef CONFIG_SCHED_NUMA
>  	return p->node;
> +}
> +
> +extern void task_numa_fault(int node, int pages);
>  #else
> +static inline int tsk_home_node(struct task_struct *p)
> +{
>  	return -1;
> -#endif
>  }
>  
> +static inline void task_numa_fault(int node, int pages)
> +{
> +}
> +#endif /* CONFIG_SCHED_NUMA */
> +
>  /*
>   * Priority of a process goes from 0..MAX_PRIO-1, valid RT
>   * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
> @@ -2004,6 +2020,10 @@ enum sched_tunable_scaling {
>  };
>  extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
>  
> +extern unsigned int sysctl_sched_numa_scan_period_min;
> +extern unsigned int sysctl_sched_numa_scan_period_max;
> +extern unsigned int sysctl_sched_numa_settle_count;
> +
>  #ifdef CONFIG_SCHED_DEBUG
>  extern unsigned int sysctl_sched_migration_cost;
>  extern unsigned int sysctl_sched_nr_migrate;
> @@ -2014,18 +2034,17 @@ extern unsigned int sysctl_sched_shares_
>  int sched_proc_update_handler(struct ctl_table *table, int write,
>  		void __user *buffer, size_t *length,
>  		loff_t *ppos);
> -#endif
> -#ifdef CONFIG_SCHED_DEBUG
> +
>  static inline unsigned int get_sysctl_timer_migration(void)
>  {
>  	return sysctl_timer_migration;
>  }
> -#else
> +#else /* CONFIG_SCHED_DEBUG */
>  static inline unsigned int get_sysctl_timer_migration(void)
>  {
>  	return 1;
>  }
> -#endif
> +#endif /* CONFIG_SCHED_DEBUG */
>  extern unsigned int sysctl_sched_rt_period;
>  extern int sysctl_sched_rt_runtime;
>  
> Index: tip/kernel/sched/core.c
> ===================================================================
> --- tip.orig/kernel/sched/core.c
> +++ tip/kernel/sched/core.c
> @@ -1533,6 +1533,21 @@ static void __sched_fork(struct task_str
>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>  	INIT_HLIST_HEAD(&p->preempt_notifiers);
>  #endif
> +
> +#ifdef CONFIG_SCHED_NUMA
> +	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
> +		p->mm->numa_next_scan = jiffies;
> +		p->mm->numa_scan_seq = 0;
> +	}
> +
> +	p->node = -1;
> +	p->node_stamp = 0ULL;
> +	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
> +	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
> +	p->numa_faults = NULL;
> +	p->numa_scan_period = sysctl_sched_numa_scan_period_min;
> +	p->numa_work.next = &p->numa_work;
> +#endif /* CONFIG_SCHED_NUMA */
>  }
>  
>  /*
> @@ -1774,6 +1789,7 @@ static void finish_task_switch(struct rq
>  	if (mm)
>  		mmdrop(mm);
>  	if (unlikely(prev_state == TASK_DEAD)) {
> +		task_numa_free(prev);
>  		/*
>  		 * Remove function-return probe instances associated with this
>  		 * task and put them back on the free list.
> Index: tip/kernel/sched/fair.c
> ===================================================================
> --- tip.orig/kernel/sched/fair.c
> +++ tip/kernel/sched/fair.c
> @@ -27,6 +27,8 @@
>  #include <linux/profile.h>
>  #include <linux/interrupt.h>
>  #include <linux/random.h>
> +#include <linux/mempolicy.h>
> +#include <linux/task_work.h>
>  
>  #include <trace/events/sched.h>
>  
> @@ -775,6 +777,21 @@ update_stats_curr_start(struct cfs_rq *c
>  
>  /**************************************************
>   * Scheduling class numa methods.
> + *
> + * The purpose of the NUMA bits are to maintain compute (task) and data
> + * (memory) locality. We try and achieve this by making tasks stick to
> + * a particular node (their home node) but if fairness mandates they run
> + * elsewhere for long enough, we let the memory follow them.
> + *
> + * Tasks start out with their home-node unset (-1) this effectively means
> + * they act !NUMA until we've established the task is busy enough to bother
> + * with placement.
> + *
> + * We keep a home-node per task and use periodic fault scans to try and
> + * estalish a task<->page relation. This assumes the task<->page relation is a
> + * compute<->data relation, this is false for things like virt. and n:m
> + * threading solutions but its the best we can do given the information we
> + * have.
>   */
>  
>  #ifdef CONFIG_SMP
> @@ -805,6 +822,157 @@ static void account_numa_dequeue(struct
>  	} else
>  		rq->onnode_running--;
>  }
> +
> +/*
> + * numa task sample period in ms: 5s
> + */
> +unsigned int sysctl_sched_numa_scan_period_min = 5000;
> +unsigned int sysctl_sched_numa_scan_period_max = 5000*16;
> +
> +/*
> + * Wait for the 2-sample stuff to settle before migrating again
> + */
> +unsigned int sysctl_sched_numa_settle_count = 2;
> +
> +static void task_numa_placement(struct task_struct *p)
> +{
> +	unsigned long faults, max_faults = 0;
> +	int node, max_node = -1;
> +	int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
> +
> +	if (p->numa_scan_seq == seq)
> +		return;
> +
> +	p->numa_scan_seq = seq;
> +
> +	for (node = 0; node < nr_node_ids; node++) {
> +		faults = p->numa_faults[node];
> +
> +		if (faults > max_faults) {
> +			max_faults = faults;
> +			max_node = node;
> +		}
> +
> +		p->numa_faults[node] /= 2;
> +	}

No comments explaining the logic behind the decaying average. It can be
inferred if someone reads Documentation/scheduler/numa-problem.txt and
point 3c carefully enough. At the very least point them at it.


> +
> +	if (max_node == -1)
> +		return;
> +
> +	if (p->node != max_node) {
> +		p->numa_scan_period = sysctl_sched_numa_scan_period_min;
> +		if (sched_feat(NUMA_SETTLE) &&
> +		    (seq - p->numa_migrate_seq) <= (int)sysctl_sched_numa_settle_count)
> +			return;
> +		p->numa_migrate_seq = seq;
> +		sched_setnode(p, max_node);

Ok, so at a guess even if we do ping-pong it will only take effect every
10 seconds which could be far worse.

> +	} else {
> +		p->numa_scan_period = min(sysctl_sched_numa_scan_period_max,
> +				p->numa_scan_period * 2);
> +	}
> +}
> +
> +/*
> + * Got a PROT_NONE fault for a page on @node.
> + */
> +void task_numa_fault(int node, int pages)
> +{
> +	struct task_struct *p = current;
> +
> +	if (unlikely(!p->numa_faults)) {
> +		int size = sizeof(unsigned long) * nr_node_ids;
> +
> +		p->numa_faults = kzalloc(size, GFP_KERNEL);
> +		if (!p->numa_faults)
> +			return;
> +	}
> +

On a maximally configured machine this will be an order-4 allocation and
you need at least 512 nodes before it's an order-1 allocation. As unlikely
as it is, should this be GFP_NOWARN?

> +	task_numa_placement(p);
> +
> +	p->numa_faults[node] += pages;
> +}
> +
> +/*
> + * The expensive part of numa migration is done from task_work context.
> + * Triggered from task_tick_numa().
> + */
> +void task_numa_work(struct callback_head *work)
> +{
> +	unsigned long migrate, next_scan, now = jiffies;
> +	struct task_struct *p = current;
> +	struct mm_struct *mm = p->mm;
> +
> +	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
> +
> +	work->next = work; /* protect against double add */
> +	/*
> +	 * Who cares about NUMA placement when they're dying.
> +	 *
> +	 * NOTE: make sure not to dereference p->mm before this check,
> +	 * exit_task_work() happens _after_ exit_mm() so we could be called
> +	 * without p->mm even though we still had it when we enqueued this
> +	 * work.
> +	 */
> +	if (p->flags & PF_EXITING)
> +		return;
> +
> +	/*
> +	 * Enforce maximal scan/migration frequency..
> +	 */
> +	migrate = mm->numa_next_scan;
> +	if (time_before(now, migrate))
> +		return;
> +
> +	next_scan = now + 2*msecs_to_jiffies(sysctl_sched_numa_scan_period_min);
> +	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
> +		return;
> +
> +	ACCESS_ONCE(mm->numa_scan_seq)++;
> +	{
> +		struct vm_area_struct *vma;
> +
> +		down_write(&mm->mmap_sem);
> +		for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +			if (!vma_migratable(vma))
> +				continue;
> +			change_protection(vma, vma->vm_start, vma->vm_end, vma_prot_none(vma), 0);
> +		}
> +		up_write(&mm->mmap_sem);
> +	}
> +}

Ok, I like the idea of the scanning cost being incurred by the process.
I was going to complain though that for very large processes that the length
time it takes to complete this scan could be considerable.  However,
a quick glance forward indicates that you cope with this problem later by
limiting how much is scanned each time.

> +
> +/*
> + * Drive the periodic memory faults..
> + */
> +void task_tick_numa(struct rq *rq, struct task_struct *curr)
> +{
> +	struct callback_head *work = &curr->numa_work;
> +	u64 period, now;
> +
> +	/*
> +	 * We don't care about NUMA placement if we don't have memory.
> +	 */
> +	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
> +		return;
> +
> +	/*
> +	 * Using runtime rather than walltime has the dual advantage that
> +	 * we (mostly) drive the selection from busy threads and that the
> +	 * task needs to have done some actual work before we bother with
> +	 * NUMA placement.
> +	 */

Makes sense.

> +	now = curr->se.sum_exec_runtime;
> +	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
> +
> +	if (now - curr->node_stamp > period) {
> +		curr->node_stamp = now;
> +
> +		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
> +			init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
> +			task_work_add(curr, work, true);
> +		}
> +	}
> +}
>  #else
>  #ifdef CONFIG_SMP
>  static struct list_head *account_numa_enqueue(struct rq *rq, struct task_struct *p)
> @@ -816,6 +984,10 @@ static struct list_head *account_numa_en
>  static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
>  {
>  }
> +
> +static void task_tick_numa(struct rq *rq, struct task_struct *curr)
> +{
> +}
>  #endif /* CONFIG_SCHED_NUMA */
>  
>  /**************************************************
> @@ -5265,6 +5437,9 @@ static void task_tick_fair(struct rq *rq
>  		cfs_rq = cfs_rq_of(se);
>  		entity_tick(cfs_rq, se, queued);
>  	}
> +
> +	if (sched_feat_numa(NUMA))
> +		task_tick_numa(rq, curr);
>  }
>  
>  /*
> Index: tip/kernel/sched/features.h
> ===================================================================
> --- tip.orig/kernel/sched/features.h
> +++ tip/kernel/sched/features.h
> @@ -69,5 +69,6 @@ SCHED_FEAT(NUMA_TTWU_BIAS, false)
>  SCHED_FEAT(NUMA_TTWU_TO,   false)
>  SCHED_FEAT(NUMA_PULL,      true)
>  SCHED_FEAT(NUMA_PULL_BIAS, true)
> +SCHED_FEAT(NUMA_SETTLE,    true)
>  #endif
>  
> Index: tip/kernel/sched/sched.h
> ===================================================================
> --- tip.orig/kernel/sched/sched.h
> +++ tip/kernel/sched/sched.h
> @@ -3,6 +3,7 @@
>  #include <linux/mutex.h>
>  #include <linux/spinlock.h>
>  #include <linux/stop_machine.h>
> +#include <linux/slab.h>
>  
>  #include "cpupri.h"
>  
> @@ -476,15 +477,6 @@ struct rq {
>  #endif
>  };
>  
> -static inline struct list_head *offnode_tasks(struct rq *rq)
> -{
> -#ifdef CONFIG_SCHED_NUMA
> -	return &rq->offnode_tasks;
> -#else
> -	return NULL;
> -#endif
> -}
> -
>  static inline int cpu_of(struct rq *rq)
>  {
>  #ifdef CONFIG_SMP
> @@ -502,6 +494,27 @@ DECLARE_PER_CPU(struct rq, runqueues);
>  #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
>  #define raw_rq()		(&__raw_get_cpu_var(runqueues))
>  
> +#ifdef CONFIG_SCHED_NUMA
> +static inline struct list_head *offnode_tasks(struct rq *rq)
> +{
> +	return &rq->offnode_tasks;
> +}
> +
> +static inline void task_numa_free(struct task_struct *p)
> +{
> +	kfree(p->numa_faults);
> +}
> +#else /* CONFIG_SCHED_NUMA */
> +static inline struct list_head *offnode_tasks(struct rq *rq)
> +{
> +	return NULL;
> +}
> +
> +static inline void task_numa_free(struct task_struct *p)
> +{
> +}
> +#endif /* CONFIG_SCHED_NUMA */
> +
>  #ifdef CONFIG_SMP
>  
>  #define rcu_dereference_check_sched_domain(p) \
> Index: tip/kernel/sysctl.c
> ===================================================================
> --- tip.orig/kernel/sysctl.c
> +++ tip/kernel/sysctl.c
> @@ -256,9 +256,11 @@ static int min_sched_granularity_ns = 10
>  static int max_sched_granularity_ns = NSEC_PER_SEC;	/* 1 second */
>  static int min_wakeup_granularity_ns;			/* 0 usecs */
>  static int max_wakeup_granularity_ns = NSEC_PER_SEC;	/* 1 second */
> +#ifdef CONFIG_SMP
>  static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE;
>  static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1;
> -#endif
> +#endif /* CONFIG_SMP */
> +#endif /* CONFIG_SCHED_DEBUG */
>  
>  #ifdef CONFIG_COMPACTION
>  static int min_extfrag_threshold;
> @@ -301,6 +303,7 @@ static struct ctl_table kern_table[] = {
>  		.extra1		= &min_wakeup_granularity_ns,
>  		.extra2		= &max_wakeup_granularity_ns,
>  	},
> +#ifdef CONFIG_SMP
>  	{
>  		.procname	= "sched_tunable_scaling",
>  		.data		= &sysctl_sched_tunable_scaling,
> @@ -347,7 +350,31 @@ static struct ctl_table kern_table[] = {
>  		.extra1		= &zero,
>  		.extra2		= &one,
>  	},
> -#endif
> +#endif /* CONFIG_SMP */
> +#ifdef CONFIG_SCHED_NUMA
> +	{
> +		.procname	= "sched_numa_scan_period_min_ms",
> +		.data		= &sysctl_sched_numa_scan_period_min,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
> +	{
> +		.procname	= "sched_numa_scan_period_max_ms",
> +		.data		= &sysctl_sched_numa_scan_period_max,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
> +	{
> +		.procname	= "sched_numa_settle_count",
> +		.data		= &sysctl_sched_numa_settle_count,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
> +#endif /* CONFIG_SCHED_NUMA */
> +#endif /* CONFIG_SCHED_DEBUG */
>  	{
>  		.procname	= "sched_rt_period_us",
>  		.data		= &sysctl_sched_rt_period,
> Index: tip/mm/huge_memory.c
> ===================================================================
> --- tip.orig/mm/huge_memory.c
> +++ tip/mm/huge_memory.c
> @@ -774,9 +774,10 @@ fixup:
>  
>  unlock:
>  	spin_unlock(&mm->page_table_lock);
> -	if (page)
> +	if (page) {
> +		task_numa_fault(page_to_nid(page), HPAGE_PMD_NR);
>  		put_page(page);
> -
> +	}
>  	return;
>  
>  migrate:
> @@ -845,6 +846,8 @@ migrate:
>  
>  	put_page(page);			/* Drop the rmap reference */
>  
> +	task_numa_fault(node, HPAGE_PMD_NR);
> +
>  	if (lru)
>  		put_page(page);		/* drop the LRU isolation reference */
>  
> Index: tip/mm/memory.c
> ===================================================================
> --- tip.orig/mm/memory.c
> +++ tip/mm/memory.c
> @@ -3512,8 +3512,10 @@ out_pte_upgrade_unlock:
>  out_unlock:
>  	pte_unmap_unlock(ptep, ptl);
>  out:
> -	if (page)
> +	if (page) {
> +		task_numa_fault(page_nid, 1);
>  		put_page(page);
> +	}
>  
>  	return 0;
>  
> 
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 28/31] sched, numa, mm: Implement constant, per task Working Set Sampling (WSS) rate
  2012-10-25 12:16 ` [PATCH 28/31] sched, numa, mm: Implement constant, per task Working Set Sampling (WSS) rate Peter Zijlstra
@ 2012-11-01 15:48   ` Mel Gorman
  0 siblings, 0 replies; 135+ messages in thread
From: Mel Gorman @ 2012-11-01 15:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	Ingo Molnar

On Thu, Oct 25, 2012 at 02:16:45PM +0200, Peter Zijlstra wrote:
> Previously, to probe the working set of a task, we'd use
> a very simple and crude method: mark all of its address
> space PROT_NONE.
> 
> That method has various (obvious) disadvantages:
> 
>  - it samples the working set at dissimilar rates,
>    giving some tasks a sampling quality advantage
>    over others.
> 
>  - creates performance problems for tasks with very
>    large working sets
> 
>  - over-samples processes with large address spaces but
>    which only very rarely execute
> 
> Improve that method by keeping a rotating offset into the
> address space that marks the current position of the scan,
> and advance it by a constant rate (in a CPU cycles execution
> proportional manner). If the offset reaches the last mapped
> address of the mm then it then it starts over at the first
> address.
> 
> The per-task nature of the working set sampling functionality
> in this tree allows such constant rate, per task,
> execution-weight proportional sampling of the working set,
> with an adaptive sampling interval/frequency that goes from
> once per 100 msecs up to just once per 1.6 seconds.
> The current sampling volume is 256 MB per interval.
> 
> As tasks mature and converge their working set, so does the
> sampling rate slow down to just a trickle, 256 MB per 1.6
> seconds of CPU time executed.
> 
> This, beyond being adaptive, also rate-limits rarely
> executing systems and does not over-sample on overloaded
> systems.
> 
> [ In AutoNUMA speak, this patch deals with the effective sampling
>   rate of the 'hinting page fault'. AutoNUMA's scanning is
>   currently rate-limited, but it is also fundamentally
>   single-threaded, executing in the knuma_scand kernel thread,
>   so the limit in AutoNUMA is global and does not scale up with
>   the number of CPUs, nor does it scan tasks in an execution
>   proportional manner.
> 
>   So the idea of rate-limiting the scanning was first implemented
>   in the AutoNUMA tree via a global rate limit. This patch goes
>   beyond that by implementing an execution rate proportional
>   working set sampling rate that is not implemented via a single
>   global scanning daemon. ]
> 
> [ Dan Carpenter pointed out a possible NULL pointer dereference in the
>   first version of this patch. ]
> 
> Based-on-idea-by: Andrea Arcangeli <aarcange@redhat.com>
> Bug-Found-By: Dan Carpenter <dan.carpenter@oracle.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Rik van Riel <riel@redhat.com>
> [ Wrote changelog and fixed bug. ]
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  include/linux/mm_types.h |    1 +
>  include/linux/sched.h    |    1 +
>  kernel/sched/fair.c      |   43 ++++++++++++++++++++++++++++++-------------
>  kernel/sysctl.c          |    7 +++++++
>  4 files changed, 39 insertions(+), 13 deletions(-)
> 
> Index: tip/include/linux/mm_types.h
> ===================================================================
> --- tip.orig/include/linux/mm_types.h
> +++ tip/include/linux/mm_types.h
> @@ -405,6 +405,7 @@ struct mm_struct {
>  #endif
>  #ifdef CONFIG_SCHED_NUMA
>  	unsigned long numa_next_scan;
> +	unsigned long numa_scan_offset;
>  	int numa_scan_seq;
>  #endif
>  	struct uprobes_state uprobes_state;
> Index: tip/include/linux/sched.h
> ===================================================================
> --- tip.orig/include/linux/sched.h
> +++ tip/include/linux/sched.h
> @@ -2022,6 +2022,7 @@ extern enum sched_tunable_scaling sysctl
>  
>  extern unsigned int sysctl_sched_numa_scan_period_min;
>  extern unsigned int sysctl_sched_numa_scan_period_max;
> +extern unsigned int sysctl_sched_numa_scan_size;
>  extern unsigned int sysctl_sched_numa_settle_count;
>  
>  #ifdef CONFIG_SCHED_DEBUG
> Index: tip/kernel/sched/fair.c
> ===================================================================
> --- tip.orig/kernel/sched/fair.c
> +++ tip/kernel/sched/fair.c
> @@ -829,8 +829,9 @@ static void account_numa_dequeue(struct
>  /*
>   * numa task sample period in ms: 5s
>   */
> -unsigned int sysctl_sched_numa_scan_period_min = 5000;
> -unsigned int sysctl_sched_numa_scan_period_max = 5000*16;
> +unsigned int sysctl_sched_numa_scan_period_min = 100;
> +unsigned int sysctl_sched_numa_scan_period_max = 100*16;
> +unsigned int sysctl_sched_numa_scan_size = 256;   /* MB */
>  
>  /*
>   * Wait for the 2-sample stuff to settle before migrating again
> @@ -904,6 +905,9 @@ void task_numa_work(struct callback_head
>  	unsigned long migrate, next_scan, now = jiffies;
>  	struct task_struct *p = current;
>  	struct mm_struct *mm = p->mm;
> +	struct vm_area_struct *vma;
> +	unsigned long offset, end;
> +	long length;
>  
>  	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
>  
> @@ -930,18 +934,31 @@ void task_numa_work(struct callback_head
>  	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
>  		return;
>  
> -	ACCESS_ONCE(mm->numa_scan_seq)++;
> -	{
> -		struct vm_area_struct *vma;
> -
> -		down_write(&mm->mmap_sem);
> -		for (vma = mm->mmap; vma; vma = vma->vm_next) {
> -			if (!vma_migratable(vma))
> -				continue;
> -			change_protection(vma, vma->vm_start, vma->vm_end, vma_prot_none(vma), 0);
> -		}
> -		up_write(&mm->mmap_sem);
> +	offset = mm->numa_scan_offset;
> +	length = sysctl_sched_numa_scan_size;
> +	length <<= 20;
> +
> +	down_write(&mm->mmap_sem);

I should have spotted this during the last patch but we have to take
mmap_sem for write?!? Why? Parallel mmap and fault performance is
potentially mutilated by this depending on how often this task_numa_work
thing is running.

> +	vma = find_vma(mm, offset);

and a find_vma every scan restart. That sucks too.

Cache the vma as well as the offset. Compare vma->mm under mmap_sem held
for read and that the offset still matches. Will that avoid the expense
of the lookup?

> +	if (!vma) {
> +		ACCESS_ONCE(mm->numa_scan_seq)++;
> +		offset = 0;
> +		vma = mm->mmap;
> +	}
> +	for (; vma && length > 0; vma = vma->vm_next) {
> +		if (!vma_migratable(vma))
> +			continue;
> +
> +		offset = max(offset, vma->vm_start);
> +		end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
> +		length -= end - offset;
> +
> +		change_prot_none(vma, offset, end);
> +
> +		offset = end;
>  	}
> +	mm->numa_scan_offset = offset;
> +	up_write(&mm->mmap_sem);
>  }
>  
>  /*
> Index: tip/kernel/sysctl.c
> ===================================================================
> --- tip.orig/kernel/sysctl.c
> +++ tip/kernel/sysctl.c
> @@ -367,6 +367,13 @@ static struct ctl_table kern_table[] = {
>  		.proc_handler	= proc_dointvec,
>  	},
>  	{
> +		.procname	= "sched_numa_scan_size_mb",
> +		.data		= &sysctl_sched_numa_scan_size,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
> +	{

If some muppet writes 0 into this, it effectively disables scanning. I
guess who cares, but maybe a minimum value of a a hugepage size would
make some sort of sense.

>  		.procname	= "sched_numa_settle_count",
>  		.data		= &sysctl_sched_numa_settle_count,
>  		.maxlen		= sizeof(unsigned int),
> 
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 30/31] sched, numa, mm: Implement slow start for working set sampling
  2012-10-25 12:16 ` [PATCH 30/31] sched, numa, mm: Implement slow start for working set sampling Peter Zijlstra
@ 2012-11-01 15:52   ` Mel Gorman
  0 siblings, 0 replies; 135+ messages in thread
From: Mel Gorman @ 2012-11-01 15:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Johannes Weiner, Thomas Gleixner,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	Ingo Molnar

On Thu, Oct 25, 2012 at 02:16:47PM +0200, Peter Zijlstra wrote:
> Add a 1 second delay before starting to scan the working set of
> a task and starting to balance it amongst nodes.
> 
> [ note that before the constant per task WSS sampling rate patch
>   the initial scan would happen much later still, in effect that
>   patch caused this regression. ]
> 
> The theory is that short-run tasks benefit very little from NUMA
> placement: they come and go, and they better stick to the node
> they were started on. As tasks mature and rebalance to other CPUs
> and nodes, so does their NUMA placement have to change and so
> does it start to matter more and more.
> 

Yeah, ok. It's done by wall time, right? Should it be CPU time in case
it spent the first second asleep?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-11-01 13:41                 ` Hugh Dickins
@ 2012-11-02  3:23                   ` Zhouping Liu
  2012-11-02 23:06                     ` Hugh Dickins
  0 siblings, 1 reply; 135+ messages in thread
From: Zhouping Liu @ 2012-11-02  3:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, Peter Zijlstra, Rik van Riel, Andrea Arcangeli,
	Mel Gorman, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-mm, Ingo Molnar, CAI Qian

On 11/01/2012 09:41 PM, Hugh Dickins wrote:
> On Wed, 31 Oct 2012, Hugh Dickins wrote:
>> On Wed, 31 Oct 2012, Zhouping Liu wrote:
>>> On 10/31/2012 03:26 PM, Hugh Dickins wrote:
>>>> There's quite a few put_page()s in do_huge_pmd_numa_page(), and it
>>>> would help if we could focus on the one which is giving the trouble,
>>>> but I don't know which that is.  Zhouping, if you can, please would
>>>> you do an "objdump -ld vmlinux >bigfile" of your kernel, then extract
>>>> from bigfile just the lines from "<do_huge_pmd_numa_page>:" to whatever
>>>> is the next function, and post or mail privately just that disassembly.
>>>> That should be good to identify which of the put_page()s is involved.
>>> Hugh, I didn't find the next function, as I can't find any words that matched
>>> "do_huge_pmd_numa_page".
>>> is there any other methods?
>> Hmm, do_huge_pmd_numa_page does appear in your stacktrace,
>> unless I've made a typo but am blind to it.
>>
>> Were you applying objdump to the vmlinux which gave you the
>> BUG at mm/memcontrol.c:1134! ?
> Thanks for the further info you then sent privately: I have not made any
> more effort to reproduce the issue, but your objdump did tell me that the
> put_page hitting the problem is the one on line 872 of mm/huge_memory.c,
> "Drop the local reference", just before successful return after migration.
>
> I didn't really get the inspiration I'd hoped for out of knowing that,
> but it did make wonder whether you're suffering from one of the issues
> I already mentioned, and I can now see a way in which it might cause
> the mm/memcontrol.c:1134 BUG:-
>
> migrate_page_copy() does TestClearPageActive on the source page:
> so given the unsafe way in which do_huge_pmd_numa_page() was proceeding
> with a !PageLRU page, it's quite possible that the page was sitting in
> a pagevec, and added to the active lru (so added to the lru_size of the
> active lru), but our final put_page removes it from lru, active flag has
> been cleared, so we subtract it from the lru_size of the inactive lru -
> that could indeed make it go negative and trigger the BUG.
>
> Here's a patch fixing and tidying up that and a few other things there.
> But I'm not signing it off yet, partly because I've barely tested it
> (quite probably I didn't even have any numa pmd migration happening
> at all), and partly because just a moment ago I ran across this
> instructive comment in __collapse_huge_page_isolate():
> 	/* cannot use mapcount: can't collapse if there's a gup pin */
> 	if (page_count(page) != 1) {
>
> Hmm, yes, below I've added the page_mapcount() check I proposed to
> do_huge_pmd_numa_page(), but is even that safe enough?  Do we actually
> need a page_count() check (for 2?) to guard against get_user_pages()?
> I suspect we do, but then do we have enough locking to stabilize such
> a check?  Probably, but...
>
> This will take more time, and I doubt get_user_pages() is an issue in
> your testing, so please would you try the patch below, to see if it
> does fix the BUGs you are seeing?  Thanks a lot.

Hugh, I have tested the patch for 5 more hours, the issue can't be 
reproduced again,
so I think it has fixed the issue, thank you :)

Zhouping

>
> Not-Yet-Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>
>   mm/huge_memory.c |   24 +++++++++---------------
>   1 file changed, 9 insertions(+), 15 deletions(-)
>
> --- 3.7-rc2+schednuma+johannes/mm/huge_memory.c	2012-11-01 04:10:43.812155671 -0700
> +++ linux/mm/huge_memory.c	2012-11-01 05:52:19.512153771 -0700
> @@ -745,7 +745,7 @@ void do_huge_pmd_numa_page(struct mm_str
>   	struct mem_cgroup *memcg = NULL;
>   	struct page *new_page = NULL;
>   	struct page *page = NULL;
> -	int node, lru;
> +	int node = -1;
>   
>   	spin_lock(&mm->page_table_lock);
>   	if (unlikely(!pmd_same(*pmd, entry)))
> @@ -762,7 +762,8 @@ void do_huge_pmd_numa_page(struct mm_str
>   		VM_BUG_ON(!PageCompound(page) || !PageHead(page));
>   
>   		get_page(page);
> -		node = mpol_misplaced(page, vma, haddr);
> +		if (page_mapcount(page) == 1)	/* Only do exclusively mapped */
> +			node = mpol_misplaced(page, vma, haddr);
>   		if (node != -1)
>   			goto migrate;
>   	}
> @@ -801,13 +802,11 @@ migrate:
>   	if (!new_page)
>   		goto alloc_fail;
>   
> -	lru = PageLRU(page);
> -
> -	if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
> +	if (isolate_lru_page(page))	/* Does an implicit get_page() */
>   		goto alloc_fail;
>   
> -	if (!trylock_page(new_page))
> -		BUG();
> +	__set_page_locked(new_page);
> +	SetPageSwapBacked(new_page);
>   
>   	/* anon mapping, we can simply copy page->mapping to the new page: */
>   	new_page->mapping = page->mapping;
> @@ -820,8 +819,6 @@ migrate:
>   	spin_lock(&mm->page_table_lock);
>   	if (unlikely(!pmd_same(*pmd, entry))) {
>   		spin_unlock(&mm->page_table_lock);
> -		if (lru)
> -			putback_lru_page(page);
>   
>   		unlock_page(new_page);
>   		ClearPageActive(new_page);	/* Set by migrate_page_copy() */
> @@ -829,6 +826,7 @@ migrate:
>   		put_page(new_page);		/* Free it */
>   
>   		unlock_page(page);
> +		putback_lru_page(page);
>   		put_page(page);			/* Drop the local reference */
>   
>   		return;
> @@ -859,16 +857,12 @@ migrate:
>   	mem_cgroup_end_migration(memcg, page, new_page, true);
>   	spin_unlock(&mm->page_table_lock);
>   
> -	put_page(page);			/* Drop the rmap reference */
> -
>   	task_numa_fault(node, HPAGE_PMD_NR);
>   
> -	if (lru)
> -		put_page(page);		/* drop the LRU isolation reference */
> -
>   	unlock_page(new_page);
> -
>   	unlock_page(page);
> +	put_page(page);			/* Drop the rmap reference */
> +	put_page(page);			/* Drop the LRU isolation reference */
>   	put_page(page);			/* Drop the local reference */
>   
>   	return;
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-11-02  3:23                   ` Zhouping Liu
@ 2012-11-02 23:06                     ` Hugh Dickins
  0 siblings, 0 replies; 135+ messages in thread
From: Hugh Dickins @ 2012-11-02 23:06 UTC (permalink / raw)
  To: Zhouping Liu
  Cc: Johannes Weiner, Peter Zijlstra, Rik van Riel, Andrea Arcangeli,
	Mel Gorman, Thomas Gleixner, Linus Torvalds, Andrew Morton,
	linux-kernel, linux-mm, Ingo Molnar, CAI Qian

On Fri, 2 Nov 2012, Zhouping Liu wrote:
> On 11/01/2012 09:41 PM, Hugh Dickins wrote:
> > 
> > Here's a patch fixing and tidying up that and a few other things there.
> > But I'm not signing it off yet, partly because I've barely tested it
> > (quite probably I didn't even have any numa pmd migration happening
> > at all), and partly because just a moment ago I ran across this
> > instructive comment in __collapse_huge_page_isolate():
> > 	/* cannot use mapcount: can't collapse if there's a gup pin */
> > 	if (page_count(page) != 1) {
> > 
> > Hmm, yes, below I've added the page_mapcount() check I proposed to
> > do_huge_pmd_numa_page(), but is even that safe enough?  Do we actually
> > need a page_count() check (for 2?) to guard against get_user_pages()?
> > I suspect we do, but then do we have enough locking to stabilize such
> > a check?  Probably, but...
> > 
> > This will take more time, and I doubt get_user_pages() is an issue in
> > your testing, so please would you try the patch below, to see if it
> > does fix the BUGs you are seeing?  Thanks a lot.
> 
> Hugh, I have tested the patch for 5 more hours,
> the issue can't be reproduced again,
> so I think it has fixed the issue, thank you :)

Thanks a lot for testing and reporting back, that's good news.

However, I've meanwhile become convinced that more fixes are needed here,
to be safe against get_user_pages() (including get_user_pages_fast());
to get the Mlocked count right; and to recover correctly when !pmd_same
with an Unevictable page.

Won't now have time to update the patch today,
but these additional fixes shouldn't hold up your testing.

Hugh

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-10-30 12:20 ` Mel Gorman
  2012-10-30 15:28   ` Andrew Morton
@ 2012-11-03 11:04   ` Alex Shi
  2012-11-03 12:21     ` Mel Gorman
  2012-11-09  8:51   ` Rik van Riel
  2 siblings, 1 reply; 135+ messages in thread
From: Alex Shi @ 2012-11-03 11:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, Ingo Molnar

>
> In reality, this report is larger but I chopped it down a bit for
> brevity. autonuma beats schednuma *heavily* on this benchmark both in
> terms of average operations per numa node and overall throughput.
>
> SPECJBB PEAKS
>                                        3.7.0                      3.7.0                      3.7.0
>                               rc2-stats-v2r1         rc2-autonuma-v27r8         rc2-schednuma-v1r3
>  Expctd Warehouse                   12.00 (  0.00%)                   12.00 (  0.00%)                   12.00 (  0.00%)
>  Expctd Peak Bops               442225.00 (  0.00%)               596039.00 ( 34.78%)               555342.00 ( 25.58%)
>  Actual Warehouse                    7.00 (  0.00%)                    9.00 ( 28.57%)                    8.00 ( 14.29%)
>  Actual Peak Bops               550747.00 (  0.00%)               646124.00 ( 17.32%)               560635.00 (  1.80%)

It is impressive report!

Could you like to share the what JVM and options are you using in the
testing, and based on which kinds of platform?

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-11-03 11:04   ` Alex Shi
@ 2012-11-03 12:21     ` Mel Gorman
  2012-11-10  2:47       ` Alex Shi
  0 siblings, 1 reply; 135+ messages in thread
From: Mel Gorman @ 2012-11-03 12:21 UTC (permalink / raw)
  To: Alex Shi
  Cc: Peter Zijlstra, Rik van Riel, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, Ingo Molnar

On Sat, Nov 03, 2012 at 07:04:04PM +0800, Alex Shi wrote:
> >
> > In reality, this report is larger but I chopped it down a bit for
> > brevity. autonuma beats schednuma *heavily* on this benchmark both in
> > terms of average operations per numa node and overall throughput.
> >
> > SPECJBB PEAKS
> >                                        3.7.0                      3.7.0                      3.7.0
> >                               rc2-stats-v2r1         rc2-autonuma-v27r8         rc2-schednuma-v1r3
> >  Expctd Warehouse                   12.00 (  0.00%)                   12.00 (  0.00%)                   12.00 (  0.00%)
> >  Expctd Peak Bops               442225.00 (  0.00%)               596039.00 ( 34.78%)               555342.00 ( 25.58%)
> >  Actual Warehouse                    7.00 (  0.00%)                    9.00 ( 28.57%)                    8.00 ( 14.29%)
> >  Actual Peak Bops               550747.00 (  0.00%)               646124.00 ( 17.32%)               560635.00 (  1.80%)
> 
> It is impressive report!
> 
> Could you like to share the what JVM and options are you using in the
> testing, and based on which kinds of platform?
> 

Oracle JVM version "1.7.0_07"
Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)

4 JVMs were run, one for each node.

JVM switch specified was -Xmx12901m so it would consume roughly 80% of
memory overall.

Machine is x86-64 4-node, 64G of RAM, CPUs are E7-4807, 48 cores in
total with HT enabled.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
                   ` (32 preceding siblings ...)
  2012-10-30 12:20 ` Mel Gorman
@ 2012-11-05 17:11 ` Srikar Dronamraju
  33 siblings, 0 replies; 135+ messages in thread
From: Srikar Dronamraju @ 2012-11-05 17:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, Ingo Molnar

Hey Peter, 


Here are results on 2node and 8node machine while running the autonuma
benchmark.
----------------------------------------------------------------------------
On 2 node, 12 core 24GB 
----------------------------------------------------------------------------
KernelVersion: 3.7.0-rc3
                        Testcase:      Min      Max      Avg
                          numa01:   121.23   122.43   121.53
                numa01_HARD_BIND:    80.90    81.07    80.96
             numa01_INVERSE_BIND:   145.91   146.06   145.97
             numa01_THREAD_ALLOC:   395.81   398.30   397.47
   numa01_THREAD_ALLOC_HARD_BIND:   264.09   264.27   264.18
numa01_THREAD_ALLOC_INVERSE_BIND:   476.36   476.65   476.53
                          numa02:    53.11    53.19    53.15
                numa02_HARD_BIND:    35.20    35.29    35.25
             numa02_INVERSE_BIND:    63.52    63.55    63.54
                      numa02_SMT:    60.28    62.00    61.33
            numa02_SMT_HARD_BIND:    42.63    43.61    43.22
         numa02_SMT_INVERSE_BIND:    76.27    78.06    77.31

KernelVersion: numasched (i.e 3.7.0-rc3 + your patches)
                        Testcase:      Min      Max      Avg  %Change
                          numa01:   121.28   121.71   121.47    0.05%
                numa01_HARD_BIND:    80.89    81.01    80.96    0.00%
             numa01_INVERSE_BIND:   145.87   146.04   145.96    0.01%
             numa01_THREAD_ALLOC:   398.07   400.27   398.90   -0.36%
   numa01_THREAD_ALLOC_HARD_BIND:   264.02   264.21   264.14    0.02%
numa01_THREAD_ALLOC_INVERSE_BIND:   476.13   476.62   476.41    0.03%
                          numa02:    52.97    53.25    53.13    0.04%
                numa02_HARD_BIND:    35.21    35.28    35.24    0.03%
             numa02_INVERSE_BIND:    63.51    63.54    63.53    0.02%
                      numa02_SMT:    61.35    62.46    61.97   -1.03%
            numa02_SMT_HARD_BIND:    42.89    43.85    43.22    0.00%
         numa02_SMT_INVERSE_BIND:    76.53    77.68    77.08    0.30%

----------------------------------------------------------------------------

KernelVersion: 3.7.0-rc3(with HT enabled )
                        Testcase:      Min      Max      Avg
                          numa01:   242.58   244.39   243.68
                numa01_HARD_BIND:   169.36   169.40   169.38
             numa01_INVERSE_BIND:   299.69   299.73   299.71
             numa01_THREAD_ALLOC:   399.86   404.10   401.50
   numa01_THREAD_ALLOC_HARD_BIND:   278.72   278.77   278.75
numa01_THREAD_ALLOC_INVERSE_BIND:   493.46   493.59   493.54
                          numa02:    53.00    53.33    53.19
                numa02_HARD_BIND:    36.77    36.88    36.82
             numa02_INVERSE_BIND:    66.07    66.10    66.09
                      numa02_SMT:    53.23    53.51    53.35
            numa02_SMT_HARD_BIND:    35.19    35.27    35.24
         numa02_SMT_INVERSE_BIND:    63.50    63.54    63.52

KernelVersion: numasched (i.e 3.7.0-rc3 + your patches) (with HT enabled)
                        Testcase:      Min      Max      Avg  %Change
                          numa01:   242.68   244.59   243.53    0.06%
                numa01_HARD_BIND:   169.37   169.42   169.40   -0.01%
             numa01_INVERSE_BIND:   299.83   299.96   299.91   -0.07%
             numa01_THREAD_ALLOC:   399.53   403.13   401.62   -0.03%
   numa01_THREAD_ALLOC_HARD_BIND:   278.78   278.80   278.79   -0.01%
numa01_THREAD_ALLOC_INVERSE_BIND:   493.63   493.90   493.78   -0.05%
                          numa02:    53.06    53.42    53.22   -0.06%
                numa02_HARD_BIND:    36.78    36.87    36.82    0.00%
             numa02_INVERSE_BIND:    66.09    66.10    66.10   -0.02%
                      numa02_SMT:    53.34    53.55    53.42   -0.13%
            numa02_SMT_HARD_BIND:    35.22    35.29    35.25   -0.03%
         numa02_SMT_INVERSE_BIND:    63.50    63.58    63.53   -0.02%
----------------------------------------------------------------------------



On 8 node, 64 core, 320 GB 
----------------------------------------------------------------------------

KernelVersion: 3.7.0-rc3()
                        Testcase:      Min      Max      Avg
                          numa01:  1550.56  1596.03  1574.24
                numa01_HARD_BIND:   915.25  2540.64  1392.42
             numa01_INVERSE_BIND:  2964.66  3716.33  3149.10
             numa01_THREAD_ALLOC:   922.99  1003.31   972.99
   numa01_THREAD_ALLOC_HARD_BIND:   579.54  1266.65   896.75
numa01_THREAD_ALLOC_INVERSE_BIND:  1794.51  2057.16  1922.86
                          numa02:   126.22   133.01   130.91
                numa02_HARD_BIND:    25.85    26.25    26.06
             numa02_INVERSE_BIND:   341.38   350.35   345.82
                      numa02_SMT:   153.06   175.41   163.47
            numa02_SMT_HARD_BIND:    27.10   212.39   114.37
         numa02_SMT_INVERSE_BIND:   285.70  1542.83   540.62

KernelVersion: numasched()
                        Testcase:      Min      Max      Avg  %Change
                          numa01:  1542.69  1601.81  1569.68    0.29%
                numa01_HARD_BIND:   867.35  1094.00   966.05   44.14%
             numa01_INVERSE_BIND:  2835.71  3030.36  2966.99    6.14%
             numa01_THREAD_ALLOC:   326.35   379.43   347.01  180.39%
   numa01_THREAD_ALLOC_HARD_BIND:   611.55   720.09   657.06   36.48%
numa01_THREAD_ALLOC_INVERSE_BIND:  1839.60  1999.58  1919.36    0.18%
                          numa02:    35.35    55.09    40.81  220.78%
                numa02_HARD_BIND:    26.58    26.81    26.68   -2.32%
             numa02_INVERSE_BIND:   341.86   355.36   347.68   -0.53%
                      numa02_SMT:    37.65    48.65    43.08  279.46%
            numa02_SMT_HARD_BIND:    28.29   157.66    84.29   35.69%
         numa02_SMT_INVERSE_BIND:   313.07   346.72   333.69   62.01%
----------------------------------------------------------------------------

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-10-30 12:20 ` Mel Gorman
  2012-10-30 15:28   ` Andrew Morton
  2012-11-03 11:04   ` Alex Shi
@ 2012-11-09  8:51   ` Rik van Riel
  2 siblings, 0 replies; 135+ messages in thread
From: Rik van Riel @ 2012-11-09  8:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, Ingo Molnar

On 10/30/2012 08:20 AM, Mel Gorman wrote:
> On Thu, Oct 25, 2012 at 02:16:17PM +0200, Peter Zijlstra wrote:
>> Hi all,
>>
>> Here's a re-post of the NUMA scheduling and migration improvement
>> patches that we are working on. These include techniques from
>> AutoNUMA and the sched/numa tree and form a unified basis - it
>> has got all the bits that look good and mergeable.
>>
>
> Thanks for the repost. I have not even started a review yet as I was
> travelling and just online today. It will be another day or two before I can
> start but I was at least able to do a comparison test between autonuma and
> schednuma today to see which actually performs the best. Even without the
> review I was able to stick on similar vmstats as was applied to autonuma
> to give a rough estimate of the relative overhead of both implementations.

Peter, Ingo,

do you have any comments on the performance measurements
by Mel?

Any ideas on how to fix sched/numa or numa/core?

At this point, I suspect the easiest way forward might be
to merge the basic infrastructure from Mel's combined
tree (in -mm? in -tip?), so we can experiment with different
NUMA placement policies on top.

That way we can do apples to apples comparison of the
policies, and figure out what works best, and why.



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-11-03 12:21     ` Mel Gorman
@ 2012-11-10  2:47       ` Alex Shi
  2012-11-12  9:50         ` Mel Gorman
  0 siblings, 1 reply; 135+ messages in thread
From: Alex Shi @ 2012-11-10  2:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Rik van Riel, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, Ingo Molnar, Alex Shi, Chen, Tim C

On Sat, Nov 3, 2012 at 8:21 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Sat, Nov 03, 2012 at 07:04:04PM +0800, Alex Shi wrote:
>> >
>> > In reality, this report is larger but I chopped it down a bit for
>> > brevity. autonuma beats schednuma *heavily* on this benchmark both in
>> > terms of average operations per numa node and overall throughput.
>> >
>> > SPECJBB PEAKS
>> >                                        3.7.0                      3.7.0                      3.7.0
>> >                               rc2-stats-v2r1         rc2-autonuma-v27r8         rc2-schednuma-v1r3
>> >  Expctd Warehouse                   12.00 (  0.00%)                   12.00 (  0.00%)                   12.00 (  0.00%)
>> >  Expctd Peak Bops               442225.00 (  0.00%)               596039.00 ( 34.78%)               555342.00 ( 25.58%)
>> >  Actual Warehouse                    7.00 (  0.00%)                    9.00 ( 28.57%)                    8.00 ( 14.29%)
>> >  Actual Peak Bops               550747.00 (  0.00%)               646124.00 ( 17.32%)               560635.00 (  1.80%)
>>
>> It is impressive report!
>>
>> Could you like to share the what JVM and options are you using in the
>> testing, and based on which kinds of platform?
>>
>
> Oracle JVM version "1.7.0_07"
> Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
> Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
>
> 4 JVMs were run, one for each node.
>
> JVM switch specified was -Xmx12901m so it would consume roughly 80% of
> memory overall.
>
> Machine is x86-64 4-node, 64G of RAM, CPUs are E7-4807, 48 cores in
> total with HT enabled.
>

Thanks for configuration sharing!

I used Jrockit and openjdk with Hugepage plus pin JVM to cpu socket.
In previous sched numa version, I had found 20% dropping with Jrockit
with our configuration. but for this version. No clear regression
found. also has no benefit found.

Seems we need to expend the testing configurations. :)
-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/31] numa/core patches
  2012-11-10  2:47       ` Alex Shi
@ 2012-11-12  9:50         ` Mel Gorman
  0 siblings, 0 replies; 135+ messages in thread
From: Mel Gorman @ 2012-11-12  9:50 UTC (permalink / raw)
  To: Alex Shi
  Cc: Peter Zijlstra, Rik van Riel, Andrea Arcangeli, Johannes Weiner,
	Thomas Gleixner, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, Ingo Molnar, Alex Shi, Chen, Tim C

On Sat, Nov 10, 2012 at 10:47:41AM +0800, Alex Shi wrote:
> On Sat, Nov 3, 2012 at 8:21 PM, Mel Gorman <mgorman@suse.de> wrote:
> > On Sat, Nov 03, 2012 at 07:04:04PM +0800, Alex Shi wrote:
> >> >
> >> > In reality, this report is larger but I chopped it down a bit for
> >> > brevity. autonuma beats schednuma *heavily* on this benchmark both in
> >> > terms of average operations per numa node and overall throughput.
> >> >
> >> > SPECJBB PEAKS
> >> >                                        3.7.0                      3.7.0                      3.7.0
> >> >                               rc2-stats-v2r1         rc2-autonuma-v27r8         rc2-schednuma-v1r3
> >> >  Expctd Warehouse                   12.00 (  0.00%)                   12.00 (  0.00%)                   12.00 (  0.00%)
> >> >  Expctd Peak Bops               442225.00 (  0.00%)               596039.00 ( 34.78%)               555342.00 ( 25.58%)
> >> >  Actual Warehouse                    7.00 (  0.00%)                    9.00 ( 28.57%)                    8.00 ( 14.29%)
> >> >  Actual Peak Bops               550747.00 (  0.00%)               646124.00 ( 17.32%)               560635.00 (  1.80%)
> >>
> >> It is impressive report!
> >>
> >> Could you like to share the what JVM and options are you using in the
> >> testing, and based on which kinds of platform?
> >>
> >
> > Oracle JVM version "1.7.0_07"
> > Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
> > Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
> >
> > 4 JVMs were run, one for each node.
> >
> > JVM switch specified was -Xmx12901m so it would consume roughly 80% of
> > memory overall.
> >
> > Machine is x86-64 4-node, 64G of RAM, CPUs are E7-4807, 48 cores in
> > total with HT enabled.
> >
> 
> Thanks for configuration sharing!
> 
> I used Jrockit and openjdk with Hugepage plus pin JVM to cpu socket.

If you are using hugepages then automatic numa is not migrating those
pages. If you are pinning the JVMs to the socket then automatic numa
balancing is unnecessary as they are already on the correct node.

> In previous sched numa version, I had found 20% dropping with Jrockit
> with our configuration. but for this version. No clear regression
> found. also has no benefit found.
> 

You are only checking for regressions with your configuration which is
important because it showed that schednuma introduced only overhead in
an optimisation NUMA configuration.

In your case, you will see little or not benefit with any automatic NUMA
balancing implementation as the most important pages neiter can migrate
nor need to.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] x86,mm: drop TLB flush from ptep_set_access_flags
  2012-10-29 17:06                             ` Linus Torvalds
@ 2012-11-17 14:50                               ` Borislav Petkov
  2012-11-17 14:56                                 ` Linus Torvalds
  0 siblings, 1 reply; 135+ messages in thread
From: Borislav Petkov @ 2012-11-17 14:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Alan Cox, Ingo Molnar, Andi Kleen,
	Michel Lespinasse, Peter Zijlstra, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Andrew Morton, linux-kernel,
	linux-mm, florian, Borislav Petkov

On Mon, Oct 29, 2012 at 10:06:15AM -0700, Linus Torvalds wrote:
> On Mon, Oct 29, 2012 at 9:57 AM, Borislav Petkov <bp@alien8.de> wrote:
> >
> > On current AMD64 processors,
> 
> Can you verify that this is true for older cpu's too (ie the old
> pre-64-bit ones, say K6 and original Athlon)?

Albeit with a slight delay, the answer is yes: all AMD cpus
automatically invalidate cached TLB entries (and intermediate walk
results, for that matter) on a #PF.

I don't know, however, whether it would be prudent to have some sort of
a cheap assertion in the code (cheaper than INVLPG %ADDR, although on
older cpus we do MOV CR3) just in case. This should be enabled only with
DEBUG_VM on, of course...

HTH.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] x86,mm: drop TLB flush from ptep_set_access_flags
  2012-11-17 14:50                               ` Borislav Petkov
@ 2012-11-17 14:56                                 ` Linus Torvalds
  2012-11-17 15:17                                   ` Borislav Petkov
  2012-11-17 15:24                                   ` Rik van Riel
  0 siblings, 2 replies; 135+ messages in thread
From: Linus Torvalds @ 2012-11-17 14:56 UTC (permalink / raw)
  To: Borislav Petkov, Linus Torvalds, Rik van Riel, Alan Cox,
	Ingo Molnar, Andi Kleen, Michel Lespinasse, Peter Zijlstra,
	Andrea Arcangeli, Mel Gorman, Johannes Weiner, Thomas Gleixner,
	Andrew Morton, Linux Kernel Mailing List, linux-mm,
	Florian Fainelli, Borislav Petkov

On Sat, Nov 17, 2012 at 6:50 AM, Borislav Petkov <bp@alien8.de> wrote:
>
> Albeit with a slight delay, the answer is yes: all AMD cpus
> automatically invalidate cached TLB entries (and intermediate walk
> results, for that matter) on a #PF.

Thanks. I suspect it ends up being basically architectural, and that
Windows (and quite possibly Linux versions too) have depended on the
behavior.

> I don't know, however, whether it would be prudent to have some sort of
> a cheap assertion in the code (cheaper than INVLPG %ADDR, although on
> older cpus we do MOV CR3) just in case. This should be enabled only with
> DEBUG_VM on, of course...

I wonder how we could actually test for it. We'd have to have some
per-cpu page-fault address check (along with a generation count on the
mm or similar). I doubt we'd figure out anything that works reliably
and efficiently and would actually show any problems (plus we would
have no way to ever know we even got the code right, since presumably
we'd never find hardware that actually shows the behavior we'd be
looking for..)

               Linus

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] x86,mm: drop TLB flush from ptep_set_access_flags
  2012-11-17 14:56                                 ` Linus Torvalds
@ 2012-11-17 15:17                                   ` Borislav Petkov
  2012-11-17 15:24                                   ` Rik van Riel
  1 sibling, 0 replies; 135+ messages in thread
From: Borislav Petkov @ 2012-11-17 15:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Alan Cox, Ingo Molnar, Andi Kleen,
	Michel Lespinasse, Peter Zijlstra, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Andrew Morton,
	Linux Kernel Mailing List, linux-mm, Florian Fainelli,
	Borislav Petkov

On Sat, Nov 17, 2012 at 06:56:10AM -0800, Linus Torvalds wrote:
> I wonder how we could actually test for it. We'd have to have some
> per-cpu page-fault address check (along with a generation count on the
> mm or similar). I doubt we'd figure out anything that works reliably
> and efficiently and would actually show any problems (plus we would
> have no way to ever know we even got the code right, since presumably
> we'd never find hardware that actually shows the behavior we'd be
> looking for..)

Hmm, touching some wrong page through the stale TLB entry could be a
pretty nasty issue to debug. But you're probably right: how does one
test cheaply whether a PTE just got kicked out of the TLB? Maybe mark it
not-present but this would force a rewalk in the case when it is shared,
which is penalty we don't want to pay.

Oh well...

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] x86,mm: drop TLB flush from ptep_set_access_flags
  2012-11-17 14:56                                 ` Linus Torvalds
  2012-11-17 15:17                                   ` Borislav Petkov
@ 2012-11-17 15:24                                   ` Rik van Riel
  2012-11-17 21:53                                     ` Shentino
  1 sibling, 1 reply; 135+ messages in thread
From: Rik van Riel @ 2012-11-17 15:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Alan Cox, Ingo Molnar, Andi Kleen,
	Michel Lespinasse, Peter Zijlstra, Andrea Arcangeli, Mel Gorman,
	Johannes Weiner, Thomas Gleixner, Andrew Morton,
	Linux Kernel Mailing List, linux-mm, Florian Fainelli,
	Borislav Petkov

On 11/17/2012 09:56 AM, Linus Torvalds wrote:
> On Sat, Nov 17, 2012 at 6:50 AM, Borislav Petkov <bp@alien8.de> wrote:
>> I don't know, however, whether it would be prudent to have some sort of
>> a cheap assertion in the code (cheaper than INVLPG %ADDR, although on
>> older cpus we do MOV CR3) just in case. This should be enabled only with
>> DEBUG_VM on, of course...
>
> I wonder how we could actually test for it. We'd have to have some
> per-cpu page-fault address check (along with a generation count on the
> mm or similar). I doubt we'd figure out anything that works reliably
> and efficiently and would actually show any problems

Would it be enough to simply print out a warning if we fault
on the same address twice (or three times) in a row, and then
flush the local TLB?

I realize this would not just trigger on CPUs that fail to
invalidate TLB entries that cause faults, but also on kernel
paths that cause a page fault to be re-taken...

... but then again, don't we want to find those paths and
fix them, anyway? :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] x86,mm: drop TLB flush from ptep_set_access_flags
  2012-11-17 15:24                                   ` Rik van Riel
@ 2012-11-17 21:53                                     ` Shentino
  2012-11-18 15:29                                       ` Michel Lespinasse
  0 siblings, 1 reply; 135+ messages in thread
From: Shentino @ 2012-11-17 21:53 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Borislav Petkov, Alan Cox, Ingo Molnar,
	Andi Kleen, Michel Lespinasse, Peter Zijlstra, Andrea Arcangeli,
	Mel Gorman, Johannes Weiner, Thomas Gleixner, Andrew Morton,
	Linux Kernel Mailing List, linux-mm, Florian Fainelli,
	Borislav Petkov

On Sat, Nov 17, 2012 at 7:24 AM, Rik van Riel <riel@redhat.com> wrote:
> On 11/17/2012 09:56 AM, Linus Torvalds wrote:
>>
>> On Sat, Nov 17, 2012 at 6:50 AM, Borislav Petkov <bp@alien8.de> wrote:
>>>
>>> I don't know, however, whether it would be prudent to have some sort of
>>> a cheap assertion in the code (cheaper than INVLPG %ADDR, although on
>>> older cpus we do MOV CR3) just in case. This should be enabled only with
>>> DEBUG_VM on, of course...
>>
>>
>> I wonder how we could actually test for it. We'd have to have some
>> per-cpu page-fault address check (along with a generation count on the
>> mm or similar). I doubt we'd figure out anything that works reliably
>> and efficiently and would actually show any problems
>
> Would it be enough to simply print out a warning if we fault
> on the same address twice (or three times) in a row, and then
> flush the local TLB?
>
> I realize this would not just trigger on CPUs that fail to
> invalidate TLB entries that cause faults, but also on kernel
> paths that cause a page fault to be re-taken...

I'm actually curious if the architecture docs/software developer
manuals for IA-32 mandate any TLB invalidations on a #PF

Is there any official vendor documentation on the subject?

And perhaps equally valid, should we trust it if it exists?

> ... but then again, don't we want to find those paths and
> fix them, anyway? :)
>
> --
> All rights reversed
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 2/3] x86,mm: drop TLB flush from ptep_set_access_flags
  2012-11-17 21:53                                     ` Shentino
@ 2012-11-18 15:29                                       ` Michel Lespinasse
  0 siblings, 0 replies; 135+ messages in thread
From: Michel Lespinasse @ 2012-11-18 15:29 UTC (permalink / raw)
  To: Shentino
  Cc: Rik van Riel, Linus Torvalds, Borislav Petkov, Alan Cox,
	Ingo Molnar, Andi Kleen, Peter Zijlstra, Andrea Arcangeli,
	Mel Gorman, Johannes Weiner, Thomas Gleixner, Andrew Morton,
	Linux Kernel Mailing List, linux-mm, Florian Fainelli,
	Borislav Petkov

On Sat, Nov 17, 2012 at 1:53 PM, Shentino <shentino@gmail.com> wrote:
> I'm actually curious if the architecture docs/software developer
> manuals for IA-32 mandate any TLB invalidations on a #PF
>
> Is there any official vendor documentation on the subject?

Yes. Quoting a prior email:

Actually, it is architected on x86. This was first described in the
intel appnote 317080 "TLBs, Paging-Structure Caches, and Their
Invalidation", last paragraph of section 5.1. Nowadays, the same
contents are buried somewhere in Volume 3 of the architecture manual
(in my copy: 4.10.4.1 Operations that Invalidate TLBs and
Paging-Structure Caches)

> And perhaps equally valid, should we trust it if it exists?

I know that Intel has been very careful in documenting the architected
TLB behaviors and did it with the understanding that people should be
able to depend on what's being written up there.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* trailing flush_tlb_fix_spurious_fault in handle_pte_fault (was Re: [PATCH 1/3] x86/mm: only do a local TLB flush in ptep_set_access_flags())
  2012-10-29 15:23                             ` Rik van Riel
@ 2012-12-21  9:57                               ` Vineet Gupta
  0 siblings, 0 replies; 135+ messages in thread
From: Vineet Gupta @ 2012-12-21  9:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Ingo Molnar, Andi Kleen, Michel Lespinasse,
	Peter Zijlstra, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Thomas Gleixner, Andrew Morton, linux-kernel, linux-mm,
	Gilad Ben Yossef, Andrea Arcangeli

On Monday 29 October 2012 08:53 PM, Rik van Riel wrote:
> On 10/26/2012 03:18 PM, Linus Torvalds wrote:
>> On Fri, Oct 26, 2012 at 12:16 PM, Rik van Riel <riel@redhat.com> wrote:
>>>
>>> I can change the text of the changelog, however it looks
>>> like do_wp_page does actually use ptep_set_access_flags
>>> to set the write bit in the pte...
>>>
>>> I guess both need to be reflected in the changelog text
>>> somehow?
>>
>> Yeah, and by now, after all this discussion, I suspect it should be
>> committed with a comment too. Commit messages are good and all, but
>> unless chasing a particular bug they introduced, we shouldn't expect
>> people to read them for background information.
> 
> Now that we have the TLB things taken care of, and
> comments to patches 10/31 and 26/31 have been addressed,
> is there anything else that needs to be done before
> these NUMA patches can be merged?
> 
> Anyone, this is a good time to speak up. We have some
> time to address whatever concern you may have.
> 

Hi,

I know I'm very late in speaking up - but still I'll hazard a try. This is not
exactly the same topic but closely related.

There is a a different call to flush_tlb_fix_spurious( ), towards the end of
handle_pte_fault( ) which commit 61c77326d "x86, mm: Avoid unnecessary TLB flush"
made no-op for X86. However is this really needed for any arch at all - even if we
don't know all the arch specific quirks.

Given the code flow below

handle_pte_fault( )
....
....
if ptep_set_access_flags()-> if PTE chg remote TLB shot (pgtable-generic.c ver)
   update_mmu_cache       -> if PTE chg local TLB possibly shot too
else
   flush_tlb_fix_spurious_fault -> PTE didn't change - still remote TLB shotdown

So for PTE unchanged case, we default to doing remote TLB IPIs (barring X86) -
unless arch makes this macro NULL.

Thing is, in case of SMP races - due to PTE being different - any fixups to
local/remote will be handled within ptep_set_access_flags( ) - arch-specific or
generic versions. What I fail to understand is need to do anything - specially a
remote shootdown, for PTE not changed case.

I could shut up and just make it NO-OP for ARC, but ....

Please note that for the record, the addition of this special case was done via
following change. It might help answer what I feel to comprehend.

2005-10-29 1a44e14 [PATCH] .text page fault SMP scalability optimization

I might be totally off track so please feel free to bash me - but atleast I would
end up knowing more !

Thx,
-Vineet

^ permalink raw reply	[flat|nested] 135+ messages in thread

end of thread, other threads:[~2012-12-21  9:58 UTC | newest]

Thread overview: 135+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-25 12:16 [PATCH 00/31] numa/core patches Peter Zijlstra
2012-10-25 12:16 ` [PATCH 01/31] sched, numa, mm: Make find_busiest_queue() a method Peter Zijlstra
2012-10-25 12:16 ` [PATCH 02/31] sched, numa, mm: Describe the NUMA scheduling problem formally Peter Zijlstra
2012-11-01  9:56   ` Mel Gorman
2012-11-01 13:13     ` Rik van Riel
2012-10-25 12:16 ` [PATCH 03/31] mm/thp: Preserve pgprot across huge page split Peter Zijlstra
2012-11-01 10:22   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 04/31] x86/mm: Introduce pte_accessible() Peter Zijlstra
2012-10-25 20:10   ` Linus Torvalds
2012-10-26  6:24     ` [PATCH 04/31, v2] " Ingo Molnar
2012-11-01 10:42   ` [PATCH 04/31] " Mel Gorman
2012-10-25 12:16 ` [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags() Peter Zijlstra
2012-10-25 20:17   ` Linus Torvalds
2012-10-26  2:30     ` Rik van Riel
2012-10-26  2:56       ` Linus Torvalds
2012-10-26  3:57         ` Rik van Riel
2012-10-26  4:23           ` Linus Torvalds
2012-10-26  6:42             ` Ingo Molnar
2012-10-26 12:34             ` Michel Lespinasse
2012-10-26 12:48               ` Andi Kleen
2012-10-26 13:16                 ` Rik van Riel
2012-10-26 13:26                   ` Ingo Molnar
2012-10-26 13:28                     ` Ingo Molnar
2012-10-26 18:44                     ` [PATCH 1/3] x86/mm: only do a local TLB flush in ptep_set_access_flags() Rik van Riel
2012-10-26 18:49                       ` Linus Torvalds
2012-10-26 19:16                         ` Rik van Riel
2012-10-26 19:18                           ` Linus Torvalds
2012-10-26 19:21                             ` Rik van Riel
2012-10-29 15:23                             ` Rik van Riel
2012-12-21  9:57                               ` trailing flush_tlb_fix_spurious_fault in handle_pte_fault (was Re: [PATCH 1/3] x86/mm: only do a local TLB flush in ptep_set_access_flags()) Vineet Gupta
2012-10-26 18:45                     ` [PATCH 2/3] x86,mm: drop TLB flush from ptep_set_access_flags Rik van Riel
2012-10-26 21:12                       ` Alan Cox
2012-10-27  3:49                         ` Rik van Riel
2012-10-27 10:29                           ` Ingo Molnar
2012-10-27 13:40                         ` Rik van Riel
2012-10-29 16:57                           ` Borislav Petkov
2012-10-29 17:06                             ` Linus Torvalds
2012-11-17 14:50                               ` Borislav Petkov
2012-11-17 14:56                                 ` Linus Torvalds
2012-11-17 15:17                                   ` Borislav Petkov
2012-11-17 15:24                                   ` Rik van Riel
2012-11-17 21:53                                     ` Shentino
2012-11-18 15:29                                       ` Michel Lespinasse
2012-10-26 18:46                     ` [PATCH 3/3] mm,generic: only flush the local TLB in ptep_set_access_flags Rik van Riel
2012-10-26 18:48                       ` Linus Torvalds
2012-10-26 18:53                         ` Linus Torvalds
2012-10-26 18:57                         ` Rik van Riel
2012-10-26 19:16                           ` Linus Torvalds
2012-10-26 19:33                             ` [PATCH -v2 " Rik van Riel
2012-10-26 13:23                 ` [PATCH 05/31] x86/mm: Reduce tlb flushes from ptep_set_access_flags() Michel Lespinasse
2012-10-26 17:01               ` Linus Torvalds
2012-10-26 17:54                 ` Rik van Riel
2012-10-26 18:02                   ` Linus Torvalds
2012-10-26 18:14                     ` Rik van Riel
2012-10-26 18:41                       ` Linus Torvalds
2012-10-25 12:16 ` [PATCH 06/31] mm: Only flush the TLB when clearing an accessible pte Peter Zijlstra
2012-10-25 12:16 ` [PATCH 07/31] sched, numa, mm, s390/thp: Implement pmd_pgprot() for s390 Peter Zijlstra
2012-11-01 10:49   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 08/31] sched, numa, mm, MIPS/thp: Add pmd_pgprot() implementation Peter Zijlstra
2012-10-25 12:16 ` [PATCH 09/31] mm/pgprot: Move the pgprot_modify() fallback definition to mm.h Peter Zijlstra
2012-10-25 12:16 ` [PATCH 10/31] mm/mpol: Remove NUMA_INTERLEAVE_HIT Peter Zijlstra
2012-10-25 20:58   ` Andi Kleen
2012-10-26  7:59     ` Ingo Molnar
2012-10-25 12:16 ` [PATCH 11/31] mm/mpol: Make MPOL_LOCAL a real policy Peter Zijlstra
2012-11-01 10:58   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 12/31] mm/mpol: Add MPOL_MF_NOOP Peter Zijlstra
2012-11-01 11:10   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 13/31] mm/mpol: Check for misplaced page Peter Zijlstra
2012-10-25 12:16 ` [PATCH 14/31] mm/mpol: Create special PROT_NONE infrastructure Peter Zijlstra
2012-11-01 11:51   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 15/31] mm/mpol: Add MPOL_MF_LAZY Peter Zijlstra
2012-11-01 12:01   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 16/31] numa, mm: Support NUMA hinting page faults from gup/gup_fast Peter Zijlstra
2012-10-25 12:16 ` [PATCH 17/31] mm/migrate: Introduce migrate_misplaced_page() Peter Zijlstra
2012-11-01 12:20   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 18/31] mm/mpol: Use special PROT_NONE to migrate pages Peter Zijlstra
2012-10-25 12:16 ` [PATCH 19/31] sched, numa, mm: Introduce tsk_home_node() Peter Zijlstra
2012-11-01 13:48   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 20/31] sched, numa, mm/mpol: Make mempolicy home-node aware Peter Zijlstra
2012-11-01 13:58   ` Mel Gorman
2012-11-01 14:10     ` Don Morris
2012-10-25 12:16 ` [PATCH 21/31] sched, numa, mm: Introduce sched_feat_numa() Peter Zijlstra
2012-11-01 14:00   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 22/31] sched, numa, mm: Implement THP migration Peter Zijlstra
2012-11-01 14:16   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 23/31] sched, numa, mm: Implement home-node awareness Peter Zijlstra
2012-11-01 15:06   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 24/31] sched, numa, mm: Introduce last_nid in the pageframe Peter Zijlstra
2012-11-01 15:17   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 25/31] sched, numa, mm/mpol: Add_MPOL_F_HOME Peter Zijlstra
2012-10-25 12:16 ` [PATCH 26/31] sched, numa, mm: Add fault driven placement and migration policy Peter Zijlstra
2012-10-25 20:53   ` Linus Torvalds
2012-10-26  7:15     ` Ingo Molnar
2012-10-26 13:50       ` Ingo Molnar
2012-10-26 14:11         ` Peter Zijlstra
2012-10-26 14:14           ` Ingo Molnar
2012-10-26 16:47             ` Linus Torvalds
2012-10-30 19:23   ` Rik van Riel
2012-11-01 15:40   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 27/31] sched, numa, mm: Add credits for NUMA placement Peter Zijlstra
2012-10-25 12:16 ` [PATCH 28/31] sched, numa, mm: Implement constant, per task Working Set Sampling (WSS) rate Peter Zijlstra
2012-11-01 15:48   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 29/31] sched, numa, mm: Add NUMA_MIGRATION feature flag Peter Zijlstra
2012-10-25 12:16 ` [PATCH 30/31] sched, numa, mm: Implement slow start for working set sampling Peter Zijlstra
2012-11-01 15:52   ` Mel Gorman
2012-10-25 12:16 ` [PATCH 31/31] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page() Peter Zijlstra
2012-10-26  9:07 ` [PATCH 00/31] numa/core patches Zhouping Liu
2012-10-26  9:08   ` Peter Zijlstra
2012-10-26  9:20     ` Ingo Molnar
2012-10-26  9:41       ` Zhouping Liu
2012-10-26 10:20       ` Zhouping Liu
2012-10-26 10:24         ` Ingo Molnar
2012-10-28 17:56     ` Johannes Weiner
2012-10-29  2:44       ` Zhouping Liu
2012-10-29  6:50         ` [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page() Ingo Molnar
2012-10-29  8:24           ` Johannes Weiner
2012-10-29  8:36             ` Zhouping Liu
2012-10-29 11:15             ` Ingo Molnar
2012-10-30  6:29       ` [PATCH 00/31] numa/core patches Zhouping Liu
2012-10-31  0:48         ` Johannes Weiner
2012-10-31  7:26           ` Hugh Dickins
2012-10-31 13:15             ` Zhouping Liu
2012-10-31 17:31               ` Hugh Dickins
2012-11-01 13:41                 ` Hugh Dickins
2012-11-02  3:23                   ` Zhouping Liu
2012-11-02 23:06                     ` Hugh Dickins
2012-10-30 12:20 ` Mel Gorman
2012-10-30 15:28   ` Andrew Morton
2012-10-30 16:59     ` Mel Gorman
2012-11-03 11:04   ` Alex Shi
2012-11-03 12:21     ` Mel Gorman
2012-11-10  2:47       ` Alex Shi
2012-11-12  9:50         ` Mel Gorman
2012-11-09  8:51   ` Rik van Riel
2012-11-05 17:11 ` Srikar Dronamraju

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).