[RFC PATCH 0/4] TLB flush multiple pages with a single IPI v2

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] TLB flush multiple pages with a single IPI v2
@ 2015-04-16 10:22 ` Mel Gorman
  0 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-04-16 10:22 UTC (permalink / raw)
  To: Linux-MM
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	LKML, Mel Gorman

Changelog since V1
o Structure and variable renaming				(hughd)
o Defer flushes even if the unmapping process is sleeping	(huged)
o Alternative sizing of structure				(peterz)
o Use GFP_KERNEL instead of GFP_ATOMIC, PF_MEMALLOC protects	(andi)
o Immediately flush dirty PTEs to avoid corruption		(mel)
o Further clarify docs on the required arch guarantees		(mel)

When unmapping pages it is necessary to flush the TLB. If that page was
accessed by another CPU then an IPI is used to flush the remote CPU. That
is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second.

There already is a window between when a page is unmapped and when it is
TLB flushed. This series simply increases the window so multiple pages can
be flushed using a single IPI.

Patch 1 simply made the rest of the series easier to write as ftrace
	could identify all the senders of TLB flush IPIS.

Patch 2 collects a list of PFNs and sends one IPI to flush them all

Patch 3 uses more memory so further defer when the IPI gets sent

Patch 4 uses the same infrastructure as patch 2 to batch IPIs sent during
	page migration.

The performance impact is documented in the changelogs but in the optimistic
case on a 4-socket machine the full series reduces interrupts from 900K
interrupts/second to 60K interrupts/second.

 arch/x86/Kconfig                |  1 +
 arch/x86/include/asm/tlbflush.h |  2 +
 arch/x86/mm/tlb.c               |  1 +
 include/linux/init_task.h       |  8 ++++
 include/linux/mm_types.h        |  1 +
 include/linux/rmap.h            |  3 ++
 include/linux/sched.h           | 15 +++++++
 include/trace/events/tlb.h      |  3 +-
 init/Kconfig                    |  8 ++++
 kernel/fork.c                   |  7 +++
 kernel/sched/core.c             |  3 ++
 mm/internal.h                   | 16 +++++++
 mm/migrate.c                    |  6 ++-
 mm/rmap.c                       | 99 ++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                     | 33 +++++++++++++-
 15 files changed, 201 insertions(+), 5 deletions(-)

-- 
2.1.2


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH 0/4] TLB flush multiple pages with a single IPI v2
@ 2015-04-16 10:22 ` Mel Gorman
  0 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-04-16 10:22 UTC (permalink / raw)
  To: Linux-MM
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	LKML, Mel Gorman

Changelog since V1
o Structure and variable renaming				(hughd)
o Defer flushes even if the unmapping process is sleeping	(huged)
o Alternative sizing of structure				(peterz)
o Use GFP_KERNEL instead of GFP_ATOMIC, PF_MEMALLOC protects	(andi)
o Immediately flush dirty PTEs to avoid corruption		(mel)
o Further clarify docs on the required arch guarantees		(mel)

When unmapping pages it is necessary to flush the TLB. If that page was
accessed by another CPU then an IPI is used to flush the remote CPU. That
is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second.

There already is a window between when a page is unmapped and when it is
TLB flushed. This series simply increases the window so multiple pages can
be flushed using a single IPI.

Patch 1 simply made the rest of the series easier to write as ftrace
	could identify all the senders of TLB flush IPIS.

Patch 2 collects a list of PFNs and sends one IPI to flush them all

Patch 3 uses more memory so further defer when the IPI gets sent

Patch 4 uses the same infrastructure as patch 2 to batch IPIs sent during
	page migration.

The performance impact is documented in the changelogs but in the optimistic
case on a 4-socket machine the full series reduces interrupts from 900K
interrupts/second to 60K interrupts/second.

 arch/x86/Kconfig                |  1 +
 arch/x86/include/asm/tlbflush.h |  2 +
 arch/x86/mm/tlb.c               |  1 +
 include/linux/init_task.h       |  8 ++++
 include/linux/mm_types.h        |  1 +
 include/linux/rmap.h            |  3 ++
 include/linux/sched.h           | 15 +++++++
 include/trace/events/tlb.h      |  3 +-
 init/Kconfig                    |  8 ++++
 kernel/fork.c                   |  7 +++
 kernel/sched/core.c             |  3 ++
 mm/internal.h                   | 16 +++++++
 mm/migrate.c                    |  6 ++-
 mm/rmap.c                       | 99 ++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                     | 33 +++++++++++++-
 15 files changed, 201 insertions(+), 5 deletions(-)

-- 
2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent
  2015-04-16 10:22 ` Mel Gorman
@ 2015-04-16 10:22   ` Mel Gorman
  -1 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-04-16 10:22 UTC (permalink / raw)
  To: Linux-MM
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	LKML, Mel Gorman

It is easy to trace when an IPI is received to flush a TLB but harder to
detect what event sent it. This patch makes it easy to identify the source
of IPIs being transmitted for TLB flushes on x86.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/mm/tlb.c          | 1 +
 include/linux/mm_types.h   | 1 +
 include/trace/events/tlb.h | 3 ++-
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3250f2371aea..2da824c1c140 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	info.flush_end = end;
 
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
+	trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start);
 	if (is_uv_system()) {
 		unsigned int cpu;
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 199a03aab8dc..856038aa166e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -532,6 +532,7 @@ enum tlb_flush_reason {
 	TLB_REMOTE_SHOOTDOWN,
 	TLB_LOCAL_SHOOTDOWN,
 	TLB_LOCAL_MM_SHOOTDOWN,
+	TLB_REMOTE_SEND_IPI,
 	NR_TLB_FLUSH_REASONS,
 };
 
diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h
index 0e7635765153..0fc101472988 100644
--- a/include/trace/events/tlb.h
+++ b/include/trace/events/tlb.h
@@ -11,7 +11,8 @@
 	{ TLB_FLUSH_ON_TASK_SWITCH,	"flush on task switch" },	\
 	{ TLB_REMOTE_SHOOTDOWN,		"remote shootdown" },		\
 	{ TLB_LOCAL_SHOOTDOWN,		"local shootdown" },		\
-	{ TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" }
+	{ TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" },		\
+	{ TLB_REMOTE_SEND_IPI,		"remote ipi send" }
 
 TRACE_EVENT_CONDITION(tlb_flush,
 
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent
@ 2015-04-16 10:22   ` Mel Gorman
  0 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-04-16 10:22 UTC (permalink / raw)
  To: Linux-MM
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	LKML, Mel Gorman

It is easy to trace when an IPI is received to flush a TLB but harder to
detect what event sent it. This patch makes it easy to identify the source
of IPIs being transmitted for TLB flushes on x86.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/mm/tlb.c          | 1 +
 include/linux/mm_types.h   | 1 +
 include/trace/events/tlb.h | 3 ++-
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3250f2371aea..2da824c1c140 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	info.flush_end = end;
 
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
+	trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start);
 	if (is_uv_system()) {
 		unsigned int cpu;
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 199a03aab8dc..856038aa166e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -532,6 +532,7 @@ enum tlb_flush_reason {
 	TLB_REMOTE_SHOOTDOWN,
 	TLB_LOCAL_SHOOTDOWN,
 	TLB_LOCAL_MM_SHOOTDOWN,
+	TLB_REMOTE_SEND_IPI,
 	NR_TLB_FLUSH_REASONS,
 };
 
diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h
index 0e7635765153..0fc101472988 100644
--- a/include/trace/events/tlb.h
+++ b/include/trace/events/tlb.h
@@ -11,7 +11,8 @@
 	{ TLB_FLUSH_ON_TASK_SWITCH,	"flush on task switch" },	\
 	{ TLB_REMOTE_SHOOTDOWN,		"remote shootdown" },		\
 	{ TLB_LOCAL_SHOOTDOWN,		"local shootdown" },		\
-	{ TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" }
+	{ TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" },		\
+	{ TLB_REMOTE_SEND_IPI,		"remote ipi send" }
 
 TRACE_EVENT_CONDITION(tlb_flush,
 
-- 
2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping
  2015-04-16 10:22 ` Mel Gorman
@ 2015-04-16 10:22   ` Mel Gorman
  -1 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-04-16 10:22 UTC (permalink / raw)
  To: Linux-MM
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	LKML, Mel Gorman

An IPI is sent to flush remote TLBs when a page is unmapped that was
recently accessed by other CPUs. There are many circumstances where this
happens but the obvious one is kswapd reclaiming pages belonging to a
running process as kswapd and the task are likely running on separate CPUs.

On small machines, this is not a significant problem but as machine
gets larger with more cores and more memory, the cost of these IPIs can
be high. This patch uses a structure similar in principle to a pagevec
to collect a list of PFNs and CPUs that require flushing. It then sends
one IPI to flush the list of PFNs. A new TLB flush helper is required for
this and one is added for x86. Other architectures will need to decide if
batching like this is both safe and worth the memory overhead. Specifically
the requirement is;

	If a clean page is unmapped and not immediately flushed, the
	architecture must guarantee that a write to that page from a CPU
	with a cached TLB entry will trap a page fault.

This is essentially what the kernel already depends on but the window is
much larger with this patch applied and is worth highlighting.

The impact of this patch depends on the workload as measuring any benefit
requires both mapped pages co-located on the LRU and memory pressure. The
case with the biggest impact is multiple processes reading mapped pages
taken from the vm-scalability test suite. The test case uses NR_CPU readers
of mapped files that consume 10*RAM.

vmscale on a 4-node machine with 64G RAM and 48 CPUs
                                                4.0.0                      4.0.0
                                              vanilla              batchunmap-v2
lru-file-mmap-read-elapsed           159.76 (  0.00%)           119.26 ( 26.91%)

               4.0.0       4.0.0
             vanilla   batchunmap-v2
User 	      567.53	      613.38
System 	     5949.65	     4120.03
Elapsed	      161.08	      120.60

This is showing that the readers completed 26% with 30% less CPU time. From
vmstats, it is known that the vanilla kernel was interrupted roughly 900K
times per second during the steady phase of the test and the patched kernel
was interrupts 180K times per second.

The impact is much lower on a small machine

vmscale on a 1-node machine with 8G RAM and 1 CPU
                                                4.0.0                      4.0.0
                                              vanilla              batchunmap-v2
Ops lru-file-mmap-read-elapsed        22.50 (  0.00%)            19.91 ( 11.51%)

               4.0.0       4.0.0
             vanilla   batchunmap-v1
User           33.64       33.22
System         36.22       33.63
Elapsed        24.11       21.46

It's still a noticeable improvement with vmstat showing interrupts went
from roughly 500K per second to 45K per second.

The patch will have no impact on workloads with no memory pressure or
have relatively few mapped pages.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/Kconfig                |  1 +
 arch/x86/include/asm/tlbflush.h |  2 +
 include/linux/init_task.h       |  8 ++++
 include/linux/rmap.h            |  3 ++
 include/linux/sched.h           | 14 ++++++
 init/Kconfig                    |  8 ++++
 kernel/fork.c                   |  5 +++
 kernel/sched/core.c             |  3 ++
 mm/internal.h                   | 11 +++++
 mm/rmap.c                       | 99 ++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                     | 32 ++++++++++++-
 11 files changed, 183 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca55187..290844263218 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -30,6 +30,7 @@ config X86
 	select ARCH_MIGHT_HAVE_PC_SERIO
 	select HAVE_AOUT if X86_32
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
 	select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
 	select ARCH_SUPPORTS_INT128 if X86_64
 	select HAVE_IDE
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cd791948b286..96a27051a70a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -152,6 +152,8 @@ static inline void __flush_tlb_one(unsigned long addr)
  * and page-granular flushes are available only on i486 and up.
  */
 
+#define flush_local_tlb_addr(addr) __flush_tlb_one(addr)
+
 #ifndef CONFIG_SMP
 
 /* "_up" is for UniProcessor.
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 696d22312b31..0771937b47e1 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -175,6 +175,13 @@ extern struct task_group root_task_group;
 # define INIT_NUMA_BALANCING(tsk)
 #endif
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)				\
+	.tlb_ubc = NULL,
+#else
+# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)
+#endif
+
 #ifdef CONFIG_KASAN
 # define INIT_KASAN(tsk)						\
 	.kasan_depth = 1,
@@ -257,6 +264,7 @@ extern struct task_group root_task_group;
 	INIT_RT_MUTEXES(tsk)						\
 	INIT_VTIME(tsk)							\
 	INIT_NUMA_BALANCING(tsk)					\
+	INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)				\
 	INIT_KASAN(tsk)							\
 }
 
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c4c559a45dc8..8d23914b219e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -89,6 +89,9 @@ enum ttu_flags {
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
+	TTU_BATCH_FLUSH = (1 << 11),	/* Batch TLB flushes where possible
+					 * and caller guarantees they will
+					 * do a final flush if necessary */
 };
 
 #ifdef CONFIG_MMU
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a419b65770d6..5c09db02fe78 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1275,6 +1275,16 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
+/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */
+#define BATCH_TLBFLUSH_SIZE 32UL
+
+/* Track pages that require TLB flushes */
+struct tlbflush_unmap_batch {
+	struct cpumask cpumask;
+	unsigned long nr_pages;
+	unsigned long pfns[BATCH_TLBFLUSH_SIZE];
+};
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -1634,6 +1644,10 @@ struct task_struct {
 	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+	struct tlbflush_unmap_batch *tlb_ubc;
+#endif
+
 	struct rcu_head rcu;
 
 	/*
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d4261b..f519fbb6ac35 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -889,6 +889,14 @@ config ARCH_SUPPORTS_NUMA_BALANCING
 	bool
 
 #
+# For architectures that have a local TLB flush for a PFN without knowledge
+# of the VMA. The architecture must provide guarantees on what happens if
+# a clean TLB cache entry is written after the unmap. Details are in mm/rmap.c
+# near the check for should_defer_flush.
+config ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+	bool
+
+#
 # For architectures that know their GCC __int128 support is sound
 #
 config ARCH_SUPPORTS_INT128
diff --git a/kernel/fork.c b/kernel/fork.c
index cf65139615a0..86c872fec9fb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -246,6 +246,11 @@ void __put_task_struct(struct task_struct *tsk)
 	delayacct_tsk_free(tsk);
 	put_signal_struct(tsk->signal);
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+	kfree(tsk->tlb_ubc);
+	tsk->tlb_ubc = NULL;
+#endif
+
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 62671f53202a..9836a28d001b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1823,6 +1823,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+	p->tlb_ubc = NULL;
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
 }
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/internal.h b/mm/internal.h
index a96da5b0029d..35aba439c275 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -431,4 +431,15 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
 #define ALLOC_FAIR		0x100 /* fair zone allocation */
 
+enum ttu_flags;
+struct tlbflush_unmap_batch;
+
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+void try_to_unmap_flush(void);
+#else
+static inline void try_to_unmap_flush(void)
+{
+}
+
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/rmap.c b/mm/rmap.c
index c161a14b6a8f..576267f64655 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -60,6 +60,8 @@
 
 #include <asm/tlbflush.h>
 
+#include <trace/events/tlb.h>
+
 #include "internal.h"
 
 static struct kmem_cache *anon_vma_cachep;
@@ -581,6 +583,74 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 	return address;
 }
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+static void percpu_flush_tlb_batch_pages(void *data)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = data;
+	int i;
+
+	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+	for (i = 0; i < tlb_ubc->nr_pages; i++)
+		flush_local_tlb_addr(tlb_ubc->pfns[i] << PAGE_SHIFT);
+}
+
+void try_to_unmap_flush(void)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
+
+	if (!tlb_ubc || !tlb_ubc->nr_pages)
+		return;
+
+	trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, tlb_ubc->nr_pages);
+	smp_call_function_many(&tlb_ubc->cpumask, percpu_flush_tlb_batch_pages,
+		(void *)tlb_ubc, true);
+	cpumask_clear(&tlb_ubc->cpumask);
+	tlb_ubc->nr_pages = 0;
+}
+
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
+		struct page *page)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
+
+	cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm));
+	tlb_ubc->pfns[tlb_ubc->nr_pages] = page_to_pfn(page);
+	tlb_ubc->nr_pages++;
+
+	if (tlb_ubc->nr_pages == BATCH_TLBFLUSH_SIZE)
+		try_to_unmap_flush();
+}
+
+/*
+ * Returns true if the TLB flush should be deferred to the end of a batch of
+ * unmap operations to reduce IPIs.
+ */
+static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
+{
+	bool should_defer = false;
+
+	if (!current->tlb_ubc || !(flags & TTU_BATCH_FLUSH))
+		return false;
+
+	/* If remote CPUs need to be flushed then defer batch the flush */
+	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+		should_defer = true;
+	put_cpu();
+
+	return should_defer;
+}
+#else
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
+		struct page *page)
+{
+}
+
+static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
+{
+	return false;
+}
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
+
 /*
  * At what user virtual address is page expected in vma?
  * Caller should check the page is actually part of the vma.
@@ -1186,6 +1256,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pte_t pteval;
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
+	bool deferred;
 	enum ttu_flags flags = (enum ttu_flags)arg;
 
 	pte = page_check_address(page, mm, address, &ptl, 0);
@@ -1213,11 +1284,35 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush(vma, address, pte);
+	deferred = should_defer_flush(mm, flags);
+	if (deferred) {
+		/*
+		 * We clear the PTE but do not flush so potentially a remote
+		 * CPU could still be writing to the page. If the entry was
+		 * previously clean then the architecture must guarantee that
+		 * a clear->dirty transition on a cached TLB entry is written
+		 * through and traps if the PTE is unmapped. If the entry is
+		 * already dirty then it's handled below by the pte_dirty check.
+		 */
+		pteval = ptep_get_and_clear(mm, address, pte);
+		set_tlb_ubc_flush_pending(mm, page);
+	} else {
+		pteval = ptep_clear_flush(vma, address, pte);
+	}
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
+	if (pte_dirty(pteval)) {
+		/*
+		 * If the PTE was dirty then the TLB must be flushed before
+		 * the page is unlocked as IO can start in parallel. Without
+		 * the flush, writes could still happen and data would be
+		 * potentially lost.
+		 */
+		if (deferred)
+			flush_tlb_page(vma, address);
+
 		set_page_dirty(page);
+	}
 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd71bac..f668c8fa78fd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1024,7 +1024,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page, ttu_flags)) {
+			switch (try_to_unmap(page,
+					ttu_flags|TTU_BATCH_FLUSH)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -1211,6 +1212,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 	ret = shrink_page_list(&clean_pages, zone, &sc,
 			TTU_UNMAP|TTU_IGNORE_ACCESS,
 			&dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
+	try_to_unmap_flush();
 	list_splice(&clean_pages, page_list);
 	mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
 	return ret;
@@ -2223,6 +2225,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
 		scan_adjusted = true;
 	}
 	blk_finish_plug(&plug);
+	try_to_unmap_flush();
 	sc->nr_reclaimed += nr_reclaimed;
 
 	/*
@@ -2762,6 +2765,30 @@ out:
 	return false;
 }
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+static inline void alloc_tlb_ubc(void)
+{
+	if (current->tlb_ubc)
+		return;
+
+	/*
+	 * Allocate the control structure for batch TLB flushing. Harmless if
+	 * the allocation fails as reclaimer will just send more IPIs.
+	 */
+	current->tlb_ubc = kmalloc(sizeof(struct tlbflush_unmap_batch),
+						GFP_ATOMIC | __GFP_NOWARN);
+	if (!current->tlb_ubc)
+		return;
+
+	cpumask_clear(&current->tlb_ubc->cpumask);
+	current->tlb_ubc->nr_pages = 0;
+}
+#else
+static inline void alloc_tlb_ubc(void)
+{
+}
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
+
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 				gfp_t gfp_mask, nodemask_t *nodemask)
 {
@@ -2789,6 +2816,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 				sc.may_writepage,
 				gfp_mask);
 
+	alloc_tlb_ubc();
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
@@ -3364,6 +3392,8 @@ static int kswapd(void *p)
 
 	lockdep_set_current_reclaim_state(GFP_KERNEL);
 
+	alloc_tlb_ubc();
+
 	if (!cpumask_empty(cpumask))
 		set_cpus_allowed_ptr(tsk, cpumask);
 	current->reclaim_state = &reclaim_state;
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping
@ 2015-04-16 10:22   ` Mel Gorman
  0 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-04-16 10:22 UTC (permalink / raw)
  To: Linux-MM
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	LKML, Mel Gorman

An IPI is sent to flush remote TLBs when a page is unmapped that was
recently accessed by other CPUs. There are many circumstances where this
happens but the obvious one is kswapd reclaiming pages belonging to a
running process as kswapd and the task are likely running on separate CPUs.

On small machines, this is not a significant problem but as machine
gets larger with more cores and more memory, the cost of these IPIs can
be high. This patch uses a structure similar in principle to a pagevec
to collect a list of PFNs and CPUs that require flushing. It then sends
one IPI to flush the list of PFNs. A new TLB flush helper is required for
this and one is added for x86. Other architectures will need to decide if
batching like this is both safe and worth the memory overhead. Specifically
the requirement is;

	If a clean page is unmapped and not immediately flushed, the
	architecture must guarantee that a write to that page from a CPU
	with a cached TLB entry will trap a page fault.

This is essentially what the kernel already depends on but the window is
much larger with this patch applied and is worth highlighting.

The impact of this patch depends on the workload as measuring any benefit
requires both mapped pages co-located on the LRU and memory pressure. The
case with the biggest impact is multiple processes reading mapped pages
taken from the vm-scalability test suite. The test case uses NR_CPU readers
of mapped files that consume 10*RAM.

vmscale on a 4-node machine with 64G RAM and 48 CPUs
                                                4.0.0                      4.0.0
                                              vanilla              batchunmap-v2
lru-file-mmap-read-elapsed           159.76 (  0.00%)           119.26 ( 26.91%)

               4.0.0       4.0.0
             vanilla   batchunmap-v2
User 	      567.53	      613.38
System 	     5949.65	     4120.03
Elapsed	      161.08	      120.60

This is showing that the readers completed 26% with 30% less CPU time. From
vmstats, it is known that the vanilla kernel was interrupted roughly 900K
times per second during the steady phase of the test and the patched kernel
was interrupts 180K times per second.

The impact is much lower on a small machine

vmscale on a 1-node machine with 8G RAM and 1 CPU
                                                4.0.0                      4.0.0
                                              vanilla              batchunmap-v2
Ops lru-file-mmap-read-elapsed        22.50 (  0.00%)            19.91 ( 11.51%)

               4.0.0       4.0.0
             vanilla   batchunmap-v1
User           33.64       33.22
System         36.22       33.63
Elapsed        24.11       21.46

It's still a noticeable improvement with vmstat showing interrupts went
from roughly 500K per second to 45K per second.

The patch will have no impact on workloads with no memory pressure or
have relatively few mapped pages.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/Kconfig                |  1 +
 arch/x86/include/asm/tlbflush.h |  2 +
 include/linux/init_task.h       |  8 ++++
 include/linux/rmap.h            |  3 ++
 include/linux/sched.h           | 14 ++++++
 init/Kconfig                    |  8 ++++
 kernel/fork.c                   |  5 +++
 kernel/sched/core.c             |  3 ++
 mm/internal.h                   | 11 +++++
 mm/rmap.c                       | 99 ++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                     | 32 ++++++++++++-
 11 files changed, 183 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca55187..290844263218 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -30,6 +30,7 @@ config X86
 	select ARCH_MIGHT_HAVE_PC_SERIO
 	select HAVE_AOUT if X86_32
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
 	select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
 	select ARCH_SUPPORTS_INT128 if X86_64
 	select HAVE_IDE
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cd791948b286..96a27051a70a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -152,6 +152,8 @@ static inline void __flush_tlb_one(unsigned long addr)
  * and page-granular flushes are available only on i486 and up.
  */
 
+#define flush_local_tlb_addr(addr) __flush_tlb_one(addr)
+
 #ifndef CONFIG_SMP
 
 /* "_up" is for UniProcessor.
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 696d22312b31..0771937b47e1 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -175,6 +175,13 @@ extern struct task_group root_task_group;
 # define INIT_NUMA_BALANCING(tsk)
 #endif
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)				\
+	.tlb_ubc = NULL,
+#else
+# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)
+#endif
+
 #ifdef CONFIG_KASAN
 # define INIT_KASAN(tsk)						\
 	.kasan_depth = 1,
@@ -257,6 +264,7 @@ extern struct task_group root_task_group;
 	INIT_RT_MUTEXES(tsk)						\
 	INIT_VTIME(tsk)							\
 	INIT_NUMA_BALANCING(tsk)					\
+	INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)				\
 	INIT_KASAN(tsk)							\
 }
 
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c4c559a45dc8..8d23914b219e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -89,6 +89,9 @@ enum ttu_flags {
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
+	TTU_BATCH_FLUSH = (1 << 11),	/* Batch TLB flushes where possible
+					 * and caller guarantees they will
+					 * do a final flush if necessary */
 };
 
 #ifdef CONFIG_MMU
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a419b65770d6..5c09db02fe78 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1275,6 +1275,16 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
+/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */
+#define BATCH_TLBFLUSH_SIZE 32UL
+
+/* Track pages that require TLB flushes */
+struct tlbflush_unmap_batch {
+	struct cpumask cpumask;
+	unsigned long nr_pages;
+	unsigned long pfns[BATCH_TLBFLUSH_SIZE];
+};
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -1634,6 +1644,10 @@ struct task_struct {
 	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+	struct tlbflush_unmap_batch *tlb_ubc;
+#endif
+
 	struct rcu_head rcu;
 
 	/*
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d4261b..f519fbb6ac35 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -889,6 +889,14 @@ config ARCH_SUPPORTS_NUMA_BALANCING
 	bool
 
 #
+# For architectures that have a local TLB flush for a PFN without knowledge
+# of the VMA. The architecture must provide guarantees on what happens if
+# a clean TLB cache entry is written after the unmap. Details are in mm/rmap.c
+# near the check for should_defer_flush.
+config ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+	bool
+
+#
 # For architectures that know their GCC __int128 support is sound
 #
 config ARCH_SUPPORTS_INT128
diff --git a/kernel/fork.c b/kernel/fork.c
index cf65139615a0..86c872fec9fb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -246,6 +246,11 @@ void __put_task_struct(struct task_struct *tsk)
 	delayacct_tsk_free(tsk);
 	put_signal_struct(tsk->signal);
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+	kfree(tsk->tlb_ubc);
+	tsk->tlb_ubc = NULL;
+#endif
+
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 62671f53202a..9836a28d001b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1823,6 +1823,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+	p->tlb_ubc = NULL;
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
 }
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/internal.h b/mm/internal.h
index a96da5b0029d..35aba439c275 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -431,4 +431,15 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
 #define ALLOC_FAIR		0x100 /* fair zone allocation */
 
+enum ttu_flags;
+struct tlbflush_unmap_batch;
+
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+void try_to_unmap_flush(void);
+#else
+static inline void try_to_unmap_flush(void)
+{
+}
+
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/rmap.c b/mm/rmap.c
index c161a14b6a8f..576267f64655 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -60,6 +60,8 @@
 
 #include <asm/tlbflush.h>
 
+#include <trace/events/tlb.h>
+
 #include "internal.h"
 
 static struct kmem_cache *anon_vma_cachep;
@@ -581,6 +583,74 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 	return address;
 }
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+static void percpu_flush_tlb_batch_pages(void *data)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = data;
+	int i;
+
+	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+	for (i = 0; i < tlb_ubc->nr_pages; i++)
+		flush_local_tlb_addr(tlb_ubc->pfns[i] << PAGE_SHIFT);
+}
+
+void try_to_unmap_flush(void)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
+
+	if (!tlb_ubc || !tlb_ubc->nr_pages)
+		return;
+
+	trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, tlb_ubc->nr_pages);
+	smp_call_function_many(&tlb_ubc->cpumask, percpu_flush_tlb_batch_pages,
+		(void *)tlb_ubc, true);
+	cpumask_clear(&tlb_ubc->cpumask);
+	tlb_ubc->nr_pages = 0;
+}
+
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
+		struct page *page)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
+
+	cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm));
+	tlb_ubc->pfns[tlb_ubc->nr_pages] = page_to_pfn(page);
+	tlb_ubc->nr_pages++;
+
+	if (tlb_ubc->nr_pages == BATCH_TLBFLUSH_SIZE)
+		try_to_unmap_flush();
+}
+
+/*
+ * Returns true if the TLB flush should be deferred to the end of a batch of
+ * unmap operations to reduce IPIs.
+ */
+static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
+{
+	bool should_defer = false;
+
+	if (!current->tlb_ubc || !(flags & TTU_BATCH_FLUSH))
+		return false;
+
+	/* If remote CPUs need to be flushed then defer batch the flush */
+	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+		should_defer = true;
+	put_cpu();
+
+	return should_defer;
+}
+#else
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
+		struct page *page)
+{
+}
+
+static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
+{
+	return false;
+}
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
+
 /*
  * At what user virtual address is page expected in vma?
  * Caller should check the page is actually part of the vma.
@@ -1186,6 +1256,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pte_t pteval;
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
+	bool deferred;
 	enum ttu_flags flags = (enum ttu_flags)arg;
 
 	pte = page_check_address(page, mm, address, &ptl, 0);
@@ -1213,11 +1284,35 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush(vma, address, pte);
+	deferred = should_defer_flush(mm, flags);
+	if (deferred) {
+		/*
+		 * We clear the PTE but do not flush so potentially a remote
+		 * CPU could still be writing to the page. If the entry was
+		 * previously clean then the architecture must guarantee that
+		 * a clear->dirty transition on a cached TLB entry is written
+		 * through and traps if the PTE is unmapped. If the entry is
+		 * already dirty then it's handled below by the pte_dirty check.
+		 */
+		pteval = ptep_get_and_clear(mm, address, pte);
+		set_tlb_ubc_flush_pending(mm, page);
+	} else {
+		pteval = ptep_clear_flush(vma, address, pte);
+	}
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
+	if (pte_dirty(pteval)) {
+		/*
+		 * If the PTE was dirty then the TLB must be flushed before
+		 * the page is unlocked as IO can start in parallel. Without
+		 * the flush, writes could still happen and data would be
+		 * potentially lost.
+		 */
+		if (deferred)
+			flush_tlb_page(vma, address);
+
 		set_page_dirty(page);
+	}
 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd71bac..f668c8fa78fd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1024,7 +1024,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page, ttu_flags)) {
+			switch (try_to_unmap(page,
+					ttu_flags|TTU_BATCH_FLUSH)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -1211,6 +1212,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 	ret = shrink_page_list(&clean_pages, zone, &sc,
 			TTU_UNMAP|TTU_IGNORE_ACCESS,
 			&dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
+	try_to_unmap_flush();
 	list_splice(&clean_pages, page_list);
 	mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
 	return ret;
@@ -2223,6 +2225,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
 		scan_adjusted = true;
 	}
 	blk_finish_plug(&plug);
+	try_to_unmap_flush();
 	sc->nr_reclaimed += nr_reclaimed;
 
 	/*
@@ -2762,6 +2765,30 @@ out:
 	return false;
 }
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+static inline void alloc_tlb_ubc(void)
+{
+	if (current->tlb_ubc)
+		return;
+
+	/*
+	 * Allocate the control structure for batch TLB flushing. Harmless if
+	 * the allocation fails as reclaimer will just send more IPIs.
+	 */
+	current->tlb_ubc = kmalloc(sizeof(struct tlbflush_unmap_batch),
+						GFP_ATOMIC | __GFP_NOWARN);
+	if (!current->tlb_ubc)
+		return;
+
+	cpumask_clear(&current->tlb_ubc->cpumask);
+	current->tlb_ubc->nr_pages = 0;
+}
+#else
+static inline void alloc_tlb_ubc(void)
+{
+}
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
+
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 				gfp_t gfp_mask, nodemask_t *nodemask)
 {
@@ -2789,6 +2816,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 				sc.may_writepage,
 				gfp_mask);
 
+	alloc_tlb_ubc();
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
@@ -3364,6 +3392,8 @@ static int kswapd(void *p)
 
 	lockdep_set_current_reclaim_state(GFP_KERNEL);
 
+	alloc_tlb_ubc();
+
 	if (!cpumask_empty(cpumask))
 		set_cpus_allowed_ptr(tsk, cpumask);
 	current->reclaim_state = &reclaim_state;
-- 
2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/4] mm: Gather more PFNs before sending a TLB to flush unmapped pages
  2015-04-16 10:22 ` Mel Gorman
@ 2015-04-16 10:22   ` Mel Gorman
  -1 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-04-16 10:22 UTC (permalink / raw)
  To: Linux-MM
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	LKML, Mel Gorman

The patch "mm: Send a single IPI to TLB flush multiple pages when unmapping"
would batch 32 pages before sending an IPI. This patch increases the size of
the data structure to hold a pages worth of PFNs before sending an IPI. This
is a trade-off between memory usage and reducing IPIS sent. In the ideal
case where multiple processes are reading large mapped files, this patch
reduces interrupts/second from roughly 180K per second to 60K per second.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  9 +++++----
 kernel/fork.c         |  6 ++++--
 mm/vmscan.c           | 13 +++++++------
 3 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5c09db02fe78..3e4d3f545005 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1275,16 +1275,17 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
-/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */
-#define BATCH_TLBFLUSH_SIZE 32UL
-
 /* Track pages that require TLB flushes */
 struct tlbflush_unmap_batch {
 	struct cpumask cpumask;
 	unsigned long nr_pages;
-	unsigned long pfns[BATCH_TLBFLUSH_SIZE];
+	unsigned long pfns[0];
 };
 
+/* alloc_tlb_ubc() always allocates a page */
+#define BATCH_TLBFLUSH_SIZE \
+	((PAGE_SIZE - sizeof(struct tlbflush_unmap_batch)) / sizeof(unsigned long))
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
diff --git a/kernel/fork.c b/kernel/fork.c
index 86c872fec9fb..f260663f209a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -247,8 +247,10 @@ void __put_task_struct(struct task_struct *tsk)
 	put_signal_struct(tsk->signal);
 
 #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
-	kfree(tsk->tlb_ubc);
-	tsk->tlb_ubc = NULL;
+	if (tsk->tlb_ubc) {
+		free_page((unsigned long)tsk->tlb_ubc);
+		tsk->tlb_ubc = NULL;
+	}
 #endif
 
 	if (!profile_handoff_task(tsk))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f668c8fa78fd..a8dde281652a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2766,17 +2766,18 @@ out:
 }
 
 #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+/*
+ * Allocate the control structure for batch TLB flushing. An allocation
+ * failure is harmless as the reclaimer will send IPIs where necessary.
+ * If the allocation size changes then update BATCH_TLBFLUSH_SIZE.
+ */
 static inline void alloc_tlb_ubc(void)
 {
 	if (current->tlb_ubc)
 		return;
 
-	/*
-	 * Allocate the control structure for batch TLB flushing. Harmless if
-	 * the allocation fails as reclaimer will just send more IPIs.
-	 */
-	current->tlb_ubc = kmalloc(sizeof(struct tlbflush_unmap_batch),
-						GFP_ATOMIC | __GFP_NOWARN);
+	current->tlb_ubc = (struct tlbflush_unmap_batch *)
+				__get_free_page(GFP_KERNEL | __GFP_NOWARN);
 	if (!current->tlb_ubc)
 		return;
 
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/4] mm: Gather more PFNs before sending a TLB to flush unmapped pages
@ 2015-04-16 10:22   ` Mel Gorman
  0 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-04-16 10:22 UTC (permalink / raw)
  To: Linux-MM
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	LKML, Mel Gorman

The patch "mm: Send a single IPI to TLB flush multiple pages when unmapping"
would batch 32 pages before sending an IPI. This patch increases the size of
the data structure to hold a pages worth of PFNs before sending an IPI. This
is a trade-off between memory usage and reducing IPIS sent. In the ideal
case where multiple processes are reading large mapped files, this patch
reduces interrupts/second from roughly 180K per second to 60K per second.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  9 +++++----
 kernel/fork.c         |  6 ++++--
 mm/vmscan.c           | 13 +++++++------
 3 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5c09db02fe78..3e4d3f545005 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1275,16 +1275,17 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
-/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */
-#define BATCH_TLBFLUSH_SIZE 32UL
-
 /* Track pages that require TLB flushes */
 struct tlbflush_unmap_batch {
 	struct cpumask cpumask;
 	unsigned long nr_pages;
-	unsigned long pfns[BATCH_TLBFLUSH_SIZE];
+	unsigned long pfns[0];
 };
 
+/* alloc_tlb_ubc() always allocates a page */
+#define BATCH_TLBFLUSH_SIZE \
+	((PAGE_SIZE - sizeof(struct tlbflush_unmap_batch)) / sizeof(unsigned long))
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
diff --git a/kernel/fork.c b/kernel/fork.c
index 86c872fec9fb..f260663f209a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -247,8 +247,10 @@ void __put_task_struct(struct task_struct *tsk)
 	put_signal_struct(tsk->signal);
 
 #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
-	kfree(tsk->tlb_ubc);
-	tsk->tlb_ubc = NULL;
+	if (tsk->tlb_ubc) {
+		free_page((unsigned long)tsk->tlb_ubc);
+		tsk->tlb_ubc = NULL;
+	}
 #endif
 
 	if (!profile_handoff_task(tsk))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f668c8fa78fd..a8dde281652a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2766,17 +2766,18 @@ out:
 }
 
 #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+/*
+ * Allocate the control structure for batch TLB flushing. An allocation
+ * failure is harmless as the reclaimer will send IPIs where necessary.
+ * If the allocation size changes then update BATCH_TLBFLUSH_SIZE.
+ */
 static inline void alloc_tlb_ubc(void)
 {
 	if (current->tlb_ubc)
 		return;
 
-	/*
-	 * Allocate the control structure for batch TLB flushing. Harmless if
-	 * the allocation fails as reclaimer will just send more IPIs.
-	 */
-	current->tlb_ubc = kmalloc(sizeof(struct tlbflush_unmap_batch),
-						GFP_ATOMIC | __GFP_NOWARN);
+	current->tlb_ubc = (struct tlbflush_unmap_batch *)
+				__get_free_page(GFP_KERNEL | __GFP_NOWARN);
 	if (!current->tlb_ubc)
 		return;
 
-- 
2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration
  2015-04-16 10:22 ` Mel Gorman
@ 2015-04-16 10:22   ` Mel Gorman
  -1 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-04-16 10:22 UTC (permalink / raw)
  To: Linux-MM
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	LKML, Mel Gorman

Page reclaim batches multiple TLB flushes into one IPI and this patch teaches
page migration to also batch any necessary flushes. MMtests has a THP scale
microbenchmark that deliberately fragments memory and then allocates THPs
to stress compaction. It's not a page reclaim benchmark and recent kernels
avoid excessive compaction but this patch reduced system CPU usage

               4.0.0       4.0.0
            baseline batchmigrate-v1
User          970.70     1012.24
System       2067.48     1840.00
Elapsed      1520.63     1529.66

Note that this particular workload was not TLB flush intensive with peaks
in interrupts during the compaction phase. The 4.0 kernel peaked at 345K
interrupts/second, the kernel that batches reclaim TLB entries peaked at
13K interrupts/second and this patch peaked at 10K interrupts/second.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/internal.h | 5 +++++
 mm/migrate.c  | 6 +++++-
 mm/vmscan.c   | 2 +-
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 35aba439c275..c2481574b41a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -436,10 +436,15 @@ struct tlbflush_unmap_batch;
 
 #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
 void try_to_unmap_flush(void);
+void alloc_tlb_ubc(void);
 #else
 static inline void try_to_unmap_flush(void)
 {
 }
 
+static inline void alloc_tlb_ubc(void)
+{
+}
+
 #endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 85e042686031..fda7b320ac00 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -879,7 +879,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	/* Establish migration ptes or remove ptes */
 	if (page_mapped(page)) {
 		try_to_unmap(page,
-			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|TTU_BATCH_FLUSH);
 		page_was_mapped = 1;
 	}
 
@@ -1098,6 +1098,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 	if (!swapwrite)
 		current->flags |= PF_SWAPWRITE;
 
+	alloc_tlb_ubc();
+
 	for(pass = 0; pass < 10 && retry; pass++) {
 		retry = 0;
 
@@ -1144,6 +1146,8 @@ out:
 	if (!swapwrite)
 		current->flags &= ~PF_SWAPWRITE;
 
+	try_to_unmap_flush();
+
 	return rc;
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a8dde281652a..361bf59e0594 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2771,7 +2771,7 @@ out:
  * failure is harmless as the reclaimer will send IPIs where necessary.
  * If the allocation size changes then update BATCH_TLBFLUSH_SIZE.
  */
-static inline void alloc_tlb_ubc(void)
+void alloc_tlb_ubc(void)
 {
 	if (current->tlb_ubc)
 		return;
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration
@ 2015-04-16 10:22   ` Mel Gorman
  0 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-04-16 10:22 UTC (permalink / raw)
  To: Linux-MM
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	LKML, Mel Gorman

Page reclaim batches multiple TLB flushes into one IPI and this patch teaches
page migration to also batch any necessary flushes. MMtests has a THP scale
microbenchmark that deliberately fragments memory and then allocates THPs
to stress compaction. It's not a page reclaim benchmark and recent kernels
avoid excessive compaction but this patch reduced system CPU usage

               4.0.0       4.0.0
            baseline batchmigrate-v1
User          970.70     1012.24
System       2067.48     1840.00
Elapsed      1520.63     1529.66

Note that this particular workload was not TLB flush intensive with peaks
in interrupts during the compaction phase. The 4.0 kernel peaked at 345K
interrupts/second, the kernel that batches reclaim TLB entries peaked at
13K interrupts/second and this patch peaked at 10K interrupts/second.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/internal.h | 5 +++++
 mm/migrate.c  | 6 +++++-
 mm/vmscan.c   | 2 +-
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 35aba439c275..c2481574b41a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -436,10 +436,15 @@ struct tlbflush_unmap_batch;
 
 #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
 void try_to_unmap_flush(void);
+void alloc_tlb_ubc(void);
 #else
 static inline void try_to_unmap_flush(void)
 {
 }
 
+static inline void alloc_tlb_ubc(void)
+{
+}
+
 #endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 85e042686031..fda7b320ac00 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -879,7 +879,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	/* Establish migration ptes or remove ptes */
 	if (page_mapped(page)) {
 		try_to_unmap(page,
-			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|TTU_BATCH_FLUSH);
 		page_was_mapped = 1;
 	}
 
@@ -1098,6 +1098,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 	if (!swapwrite)
 		current->flags |= PF_SWAPWRITE;
 
+	alloc_tlb_ubc();
+
 	for(pass = 0; pass < 10 && retry; pass++) {
 		retry = 0;
 
@@ -1144,6 +1146,8 @@ out:
 	if (!swapwrite)
 		current->flags &= ~PF_SWAPWRITE;
 
+	try_to_unmap_flush();
+
 	return rc;
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a8dde281652a..361bf59e0594 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2771,7 +2771,7 @@ out:
  * failure is harmless as the reclaimer will send IPIs where necessary.
  * If the allocation size changes then update BATCH_TLBFLUSH_SIZE.
  */
-static inline void alloc_tlb_ubc(void)
+void alloc_tlb_ubc(void)
 {
 	if (current->tlb_ubc)
 		return;
-- 
2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration
  2015-04-16 10:22   ` Mel Gorman
@ 2015-04-16 10:51     ` Mel Gorman
  -1 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-04-16 10:51 UTC (permalink / raw)
  To: Linux-MM
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, LKML

On Thu, Apr 16, 2015 at 11:22:46AM +0100, Mel Gorman wrote:
> Page reclaim batches multiple TLB flushes into one IPI and this patch teaches
> page migration to also batch any necessary flushes. MMtests has a THP scale
> microbenchmark that deliberately fragments memory and then allocates THPs
> to stress compaction. It's not a page reclaim benchmark and recent kernels
> avoid excessive compaction but this patch reduced system CPU usage
> 
>                4.0.0       4.0.0
>             baseline batchmigrate-v1
> User          970.70     1012.24
> System       2067.48     1840.00
> Elapsed      1520.63     1529.66
> 
> Note that this particular workload was not TLB flush intensive with peaks
> in interrupts during the compaction phase. The 4.0 kernel peaked at 345K
> interrupts/second, the kernel that batches reclaim TLB entries peaked at
> 13K interrupts/second and this patch peaked at 10K interrupts/second.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

The following is needed on !x86 although it's pointless to test on !x86
at the moment.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 361bf59e0594..548c94834112 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2784,10 +2784,6 @@ void alloc_tlb_ubc(void)
 	cpumask_clear(&current->tlb_ubc->cpumask);
 	current->tlb_ubc->nr_pages = 0;
 }
-#else
-static inline void alloc_tlb_ubc(void)
-{
-}
 #endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
 
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration
@ 2015-04-16 10:51     ` Mel Gorman
  0 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-04-16 10:51 UTC (permalink / raw)
  To: Linux-MM
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, LKML

On Thu, Apr 16, 2015 at 11:22:46AM +0100, Mel Gorman wrote:
> Page reclaim batches multiple TLB flushes into one IPI and this patch teaches
> page migration to also batch any necessary flushes. MMtests has a THP scale
> microbenchmark that deliberately fragments memory and then allocates THPs
> to stress compaction. It's not a page reclaim benchmark and recent kernels
> avoid excessive compaction but this patch reduced system CPU usage
> 
>                4.0.0       4.0.0
>             baseline batchmigrate-v1
> User          970.70     1012.24
> System       2067.48     1840.00
> Elapsed      1520.63     1529.66
> 
> Note that this particular workload was not TLB flush intensive with peaks
> in interrupts during the compaction phase. The 4.0 kernel peaked at 345K
> interrupts/second, the kernel that batches reclaim TLB entries peaked at
> 13K interrupts/second and this patch peaked at 10K interrupts/second.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

The following is needed on !x86 although it's pointless to test on !x86
at the moment.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 361bf59e0594..548c94834112 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2784,10 +2784,6 @@ void alloc_tlb_ubc(void)
 	cpumask_clear(&current->tlb_ubc->cpumask);
 	current->tlb_ubc->nr_pages = 0;
 }
-#else
-static inline void alloc_tlb_ubc(void)
-{
-}
 #endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
 
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent
  2015-04-16 10:22   ` Mel Gorman
@ 2015-04-16 15:51     ` Rik van Riel
  -1 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2015-04-16 15:51 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM
  Cc: Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, LKML

On 04/16/2015 06:22 AM, Mel Gorman wrote:
> It is easy to trace when an IPI is received to flush a TLB but harder to
> detect what event sent it. This patch makes it easy to identify the source
> of IPIs being transmitted for TLB flushes on x86.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent
@ 2015-04-16 15:51     ` Rik van Riel
  0 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2015-04-16 15:51 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM
  Cc: Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, LKML

On 04/16/2015 06:22 AM, Mel Gorman wrote:
> It is easy to trace when an IPI is received to flush a TLB but harder to
> detect what event sent it. This patch makes it easy to identify the source
> of IPIs being transmitted for TLB flushes on x86.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping
  2015-04-16 10:22   ` Mel Gorman
@ 2015-04-16 15:52     ` Rik van Riel
  -1 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2015-04-16 15:52 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM
  Cc: Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, LKML

On 04/16/2015 06:22 AM, Mel Gorman wrote:

> 	If a clean page is unmapped and not immediately flushed, the
> 	architecture must guarantee that a write to that page from a CPU
> 	with a cached TLB entry will trap a page fault.
> 
> This is essentially what the kernel already depends on but the window is
> much larger with this patch applied and is worth highlighting.

> Signed-off-by: Mel Gorman <mgorman@suse.de>

Given the architectural guarantee ...

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping
@ 2015-04-16 15:52     ` Rik van Riel
  0 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2015-04-16 15:52 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM
  Cc: Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, LKML

On 04/16/2015 06:22 AM, Mel Gorman wrote:

> 	If a clean page is unmapped and not immediately flushed, the
> 	architecture must guarantee that a write to that page from a CPU
> 	with a cached TLB entry will trap a page fault.
> 
> This is essentially what the kernel already depends on but the window is
> much larger with this patch applied and is worth highlighting.

> Signed-off-by: Mel Gorman <mgorman@suse.de>

Given the architectural guarantee ...

Reviewed-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] mm: Gather more PFNs before sending a TLB to flush unmapped pages
  2015-04-16 10:22   ` Mel Gorman
@ 2015-04-16 16:00     ` Rik van Riel
  -1 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2015-04-16 16:00 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM
  Cc: Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, LKML

On 04/16/2015 06:22 AM, Mel Gorman wrote:
> The patch "mm: Send a single IPI to TLB flush multiple pages when unmapping"
> would batch 32 pages before sending an IPI. This patch increases the size of
> the data structure to hold a pages worth of PFNs before sending an IPI. This
> is a trade-off between memory usage and reducing IPIS sent. In the ideal
> case where multiple processes are reading large mapped files, this patch
> reduces interrupts/second from roughly 180K per second to 60K per second.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/4] mm: Gather more PFNs before sending a TLB to flush unmapped pages
@ 2015-04-16 16:00     ` Rik van Riel
  0 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2015-04-16 16:00 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM
  Cc: Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, LKML

On 04/16/2015 06:22 AM, Mel Gorman wrote:
> The patch "mm: Send a single IPI to TLB flush multiple pages when unmapping"
> would batch 32 pages before sending an IPI. This patch increases the size of
> the data structure to hold a pages worth of PFNs before sending an IPI. This
> is a trade-off between memory usage and reducing IPIS sent. In the ideal
> case where multiple processes are reading large mapped files, this patch
> reduces interrupts/second from roughly 180K per second to 60K per second.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration
  2015-04-16 10:22   ` Mel Gorman
@ 2015-04-16 16:01     ` Rik van Riel
  -1 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2015-04-16 16:01 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM
  Cc: Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, LKML

On 04/16/2015 06:22 AM, Mel Gorman wrote:
> Page reclaim batches multiple TLB flushes into one IPI and this patch teaches
> page migration to also batch any necessary flushes. MMtests has a THP scale
> microbenchmark that deliberately fragments memory and then allocates THPs
> to stress compaction. It's not a page reclaim benchmark and recent kernels
> avoid excessive compaction but this patch reduced system CPU usage
> 
>                4.0.0       4.0.0
>             baseline batchmigrate-v1
> User          970.70     1012.24
> System       2067.48     1840.00
> Elapsed      1520.63     1529.66
> 
> Note that this particular workload was not TLB flush intensive with peaks
> in interrupts during the compaction phase. The 4.0 kernel peaked at 345K
> interrupts/second, the kernel that batches reclaim TLB entries peaked at
> 13K interrupts/second and this patch peaked at 10K interrupts/second.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration
@ 2015-04-16 16:01     ` Rik van Riel
  0 siblings, 0 replies; 36+ messages in thread
From: Rik van Riel @ 2015-04-16 16:01 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM
  Cc: Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, LKML

On 04/16/2015 06:22 AM, Mel Gorman wrote:
> Page reclaim batches multiple TLB flushes into one IPI and this patch teaches
> page migration to also batch any necessary flushes. MMtests has a THP scale
> microbenchmark that deliberately fragments memory and then allocates THPs
> to stress compaction. It's not a page reclaim benchmark and recent kernels
> avoid excessive compaction but this patch reduced system CPU usage
> 
>                4.0.0       4.0.0
>             baseline batchmigrate-v1
> User          970.70     1012.24
> System       2067.48     1840.00
> Elapsed      1520.63     1529.66
> 
> Note that this particular workload was not TLB flush intensive with peaks
> in interrupts during the compaction phase. The 4.0 kernel peaked at 345K
> interrupts/second, the kernel that batches reclaim TLB entries peaked at
> 13K interrupts/second and this patch peaked at 10K interrupts/second.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent
  2015-04-16 10:22   ` Mel Gorman
@ 2015-04-16 16:55     ` Dave Hansen
  -1 siblings, 0 replies; 36+ messages in thread
From: Dave Hansen @ 2015-04-16 16:55 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Andi Kleen, LKML

On 04/16/2015 03:22 AM, Mel Gorman wrote:
> It is easy to trace when an IPI is received to flush a TLB but harder to
> detect what event sent it. This patch makes it easy to identify the source
> of IPIs being transmitted for TLB flushes on x86.

Looks fine to me.  I think I even thought about adding this but didn't
see an immediate need for it.  I guess this does let you see how many
IPIs are sent vs. received.

Reviewed-by: Dave Hansen <dave.hansen@intel.com>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent
@ 2015-04-16 16:55     ` Dave Hansen
  0 siblings, 0 replies; 36+ messages in thread
From: Dave Hansen @ 2015-04-16 16:55 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Andi Kleen, LKML

On 04/16/2015 03:22 AM, Mel Gorman wrote:
> It is easy to trace when an IPI is received to flush a TLB but harder to
> detect what event sent it. This patch makes it easy to identify the source
> of IPIs being transmitted for TLB flushes on x86.

Looks fine to me.  I think I even thought about adding this but didn't
see an immediate need for it.  I guess this does let you see how many
IPIs are sent vs. received.

Reviewed-by: Dave Hansen <dave.hansen@intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent
  2015-04-16 16:55     ` Dave Hansen
@ 2015-04-16 17:39       ` Mel Gorman
  -1 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-04-16 17:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux-MM, Rik van Riel, Hugh Dickins, Minchan Kim, Andi Kleen, LKML

On Thu, Apr 16, 2015 at 09:55:43AM -0700, Dave Hansen wrote:
> On 04/16/2015 03:22 AM, Mel Gorman wrote:
> > It is easy to trace when an IPI is received to flush a TLB but harder to
> > detect what event sent it. This patch makes it easy to identify the source
> > of IPIs being transmitted for TLB flushes on x86.
> 
> Looks fine to me.  I think I even thought about adding this but didn't
> see an immediate need for it.  I guess this does let you see how many
> IPIs are sent vs. received.
> 

It would but that's not why I wanted it. I wanted a stack track of who
was sending the IPI and I can't get that on the receive side. I could
have used perf probe and some hackery but this seemed useful in itself.

> Reviewed-by: Dave Hansen <dave.hansen@intel.com>

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent
@ 2015-04-16 17:39       ` Mel Gorman
  0 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-04-16 17:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux-MM, Rik van Riel, Hugh Dickins, Minchan Kim, Andi Kleen, LKML

On Thu, Apr 16, 2015 at 09:55:43AM -0700, Dave Hansen wrote:
> On 04/16/2015 03:22 AM, Mel Gorman wrote:
> > It is easy to trace when an IPI is received to flush a TLB but harder to
> > detect what event sent it. This patch makes it easy to identify the source
> > of IPIs being transmitted for TLB flushes on x86.
> 
> Looks fine to me.  I think I even thought about adding this but didn't
> see an immediate need for it.  I guess this does let you see how many
> IPIs are sent vs. received.
> 

It would but that's not why I wanted it. I wanted a stack track of who
was sending the IPI and I can't get that on the receive side. I could
have used perf probe and some hackery but this seemed useful in itself.

> Reviewed-by: Dave Hansen <dave.hansen@intel.com>

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration
  2015-04-16 10:22   ` Mel Gorman
@ 2015-04-16 18:57     ` Hugh Dickins
  -1 siblings, 0 replies; 36+ messages in thread
From: Hugh Dickins @ 2015-04-16 18:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen,
	Andi Kleen, LKML

On Thu, 16 Apr 2015, Mel Gorman wrote:

> Page reclaim batches multiple TLB flushes into one IPI and this patch teaches
> page migration to also batch any necessary flushes. MMtests has a THP scale
> microbenchmark that deliberately fragments memory and then allocates THPs
> to stress compaction. It's not a page reclaim benchmark and recent kernels
> avoid excessive compaction but this patch reduced system CPU usage
> 
>                4.0.0       4.0.0
>             baseline batchmigrate-v1
> User          970.70     1012.24
> System       2067.48     1840.00
> Elapsed      1520.63     1529.66
> 
> Note that this particular workload was not TLB flush intensive with peaks
> in interrupts during the compaction phase. The 4.0 kernel peaked at 345K
> interrupts/second, the kernel that batches reclaim TLB entries peaked at
> 13K interrupts/second and this patch peaked at 10K interrupts/second.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/internal.h | 5 +++++
>  mm/migrate.c  | 6 +++++-
>  mm/vmscan.c   | 2 +-
>  3 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 35aba439c275..c2481574b41a 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -436,10 +436,15 @@ struct tlbflush_unmap_batch;
>  
>  #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
>  void try_to_unmap_flush(void);
> +void alloc_tlb_ubc(void);
>  #else
>  static inline void try_to_unmap_flush(void)
>  {
>  }
>  
> +static inline void alloc_tlb_ubc(void)
> +{
> +}
> +
>  #endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
>  #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 85e042686031..fda7b320ac00 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -879,7 +879,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>  	/* Establish migration ptes or remove ptes */
>  	if (page_mapped(page)) {
>  		try_to_unmap(page,
> -			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> +			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|TTU_BATCH_FLUSH);
>  		page_was_mapped = 1;
>  	}
>  
> @@ -1098,6 +1098,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  	if (!swapwrite)
>  		current->flags |= PF_SWAPWRITE;
>  
> +	alloc_tlb_ubc();
> +
>  	for(pass = 0; pass < 10 && retry; pass++) {
>  		retry = 0;
>  
> @@ -1144,6 +1146,8 @@ out:
>  	if (!swapwrite)
>  		current->flags &= ~PF_SWAPWRITE;
>  
> +	try_to_unmap_flush();

This is the right place to aim to flush, but I think you have to make
more changes before it is safe to do so here.

The putback_lru_page(page) in unmap_and_move() is commented "A page
that has been migrated has all references removed and will be freed".

If you leave TLB flushing until after the page has been freed, then
there's a risk that userspace will see, not the data it expects at
whatever virtual address, but data placed in there by the next user
of this freed page.

So you'll need to do a little restructuring first.

> +
>  	return rc;
>  }
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a8dde281652a..361bf59e0594 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2771,7 +2771,7 @@ out:
>   * failure is harmless as the reclaimer will send IPIs where necessary.
>   * If the allocation size changes then update BATCH_TLBFLUSH_SIZE.
>   */
> -static inline void alloc_tlb_ubc(void)
> +void alloc_tlb_ubc(void)
>  {
>  	if (current->tlb_ubc)
>  		return;
> -- 
> 2.1.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration
@ 2015-04-16 18:57     ` Hugh Dickins
  0 siblings, 0 replies; 36+ messages in thread
From: Hugh Dickins @ 2015-04-16 18:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen,
	Andi Kleen, LKML

On Thu, 16 Apr 2015, Mel Gorman wrote:

> Page reclaim batches multiple TLB flushes into one IPI and this patch teaches
> page migration to also batch any necessary flushes. MMtests has a THP scale
> microbenchmark that deliberately fragments memory and then allocates THPs
> to stress compaction. It's not a page reclaim benchmark and recent kernels
> avoid excessive compaction but this patch reduced system CPU usage
> 
>                4.0.0       4.0.0
>             baseline batchmigrate-v1
> User          970.70     1012.24
> System       2067.48     1840.00
> Elapsed      1520.63     1529.66
> 
> Note that this particular workload was not TLB flush intensive with peaks
> in interrupts during the compaction phase. The 4.0 kernel peaked at 345K
> interrupts/second, the kernel that batches reclaim TLB entries peaked at
> 13K interrupts/second and this patch peaked at 10K interrupts/second.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/internal.h | 5 +++++
>  mm/migrate.c  | 6 +++++-
>  mm/vmscan.c   | 2 +-
>  3 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 35aba439c275..c2481574b41a 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -436,10 +436,15 @@ struct tlbflush_unmap_batch;
>  
>  #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
>  void try_to_unmap_flush(void);
> +void alloc_tlb_ubc(void);
>  #else
>  static inline void try_to_unmap_flush(void)
>  {
>  }
>  
> +static inline void alloc_tlb_ubc(void)
> +{
> +}
> +
>  #endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
>  #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 85e042686031..fda7b320ac00 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -879,7 +879,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>  	/* Establish migration ptes or remove ptes */
>  	if (page_mapped(page)) {
>  		try_to_unmap(page,
> -			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> +			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|TTU_BATCH_FLUSH);
>  		page_was_mapped = 1;
>  	}
>  
> @@ -1098,6 +1098,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  	if (!swapwrite)
>  		current->flags |= PF_SWAPWRITE;
>  
> +	alloc_tlb_ubc();
> +
>  	for(pass = 0; pass < 10 && retry; pass++) {
>  		retry = 0;
>  
> @@ -1144,6 +1146,8 @@ out:
>  	if (!swapwrite)
>  		current->flags &= ~PF_SWAPWRITE;
>  
> +	try_to_unmap_flush();

This is the right place to aim to flush, but I think you have to make
more changes before it is safe to do so here.

The putback_lru_page(page) in unmap_and_move() is commented "A page
that has been migrated has all references removed and will be freed".

If you leave TLB flushing until after the page has been freed, then
there's a risk that userspace will see, not the data it expects at
whatever virtual address, but data placed in there by the next user
of this freed page.

So you'll need to do a little restructuring first.

> +
>  	return rc;
>  }
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a8dde281652a..361bf59e0594 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2771,7 +2771,7 @@ out:
>   * failure is harmless as the reclaimer will send IPIs where necessary.
>   * If the allocation size changes then update BATCH_TLBFLUSH_SIZE.
>   */
> -static inline void alloc_tlb_ubc(void)
> +void alloc_tlb_ubc(void)
>  {
>  	if (current->tlb_ubc)
>  		return;
> -- 
> 2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping
  2015-04-16 10:22   ` Mel Gorman
@ 2015-04-16 19:21     ` Hugh Dickins
  -1 siblings, 0 replies; 36+ messages in thread
From: Hugh Dickins @ 2015-04-16 19:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen,
	Andi Kleen, LKML

On Thu, 16 Apr 2015, Mel Gorman wrote:
>  
>  	/* Move the dirty bit to the physical page now the pte is gone. */
> -	if (pte_dirty(pteval))
> +	if (pte_dirty(pteval)) {
> +		/*
> +		 * If the PTE was dirty then the TLB must be flushed before
> +		 * the page is unlocked as IO can start in parallel. Without
> +		 * the flush, writes could still happen and data would be
> +		 * potentially lost.
> +		 */
> +		if (deferred)
> +			flush_tlb_page(vma, address);

Okay, yes, that should deal with it; and you're probably right that the
safe pte_dirty !pte_write case is too uncommon to be worth another test.

But it would be better to batch even in the pte_dirty case: noting that
it has occurred in the tlb_ubc, then if so, doing try_to_unmap_flush()
before leaving try_to_unmap().

Particularly as you have already set_tlb_ubc_flush_pending() above,
so shrink_lruvec() may then follow with an unnecessary flush; though
I guess a little rearrangement here could stop that.

> +
>  		set_page_dirty(page);
> +	}
>  
>  	/* Update high watermark before we lower rss */
>  	update_hiwater_rss(mm);

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping
@ 2015-04-16 19:21     ` Hugh Dickins
  0 siblings, 0 replies; 36+ messages in thread
From: Hugh Dickins @ 2015-04-16 19:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen,
	Andi Kleen, LKML

On Thu, 16 Apr 2015, Mel Gorman wrote:
>  
>  	/* Move the dirty bit to the physical page now the pte is gone. */
> -	if (pte_dirty(pteval))
> +	if (pte_dirty(pteval)) {
> +		/*
> +		 * If the PTE was dirty then the TLB must be flushed before
> +		 * the page is unlocked as IO can start in parallel. Without
> +		 * the flush, writes could still happen and data would be
> +		 * potentially lost.
> +		 */
> +		if (deferred)
> +			flush_tlb_page(vma, address);

Okay, yes, that should deal with it; and you're probably right that the
safe pte_dirty !pte_write case is too uncommon to be worth another test.

But it would be better to batch even in the pte_dirty case: noting that
it has occurred in the tlb_ubc, then if so, doing try_to_unmap_flush()
before leaving try_to_unmap().

Particularly as you have already set_tlb_ubc_flush_pending() above,
so shrink_lruvec() may then follow with an unnecessary flush; though
I guess a little rearrangement here could stop that.

> +
>  		set_page_dirty(page);
> +	}
>  
>  	/* Update high watermark before we lower rss */
>  	update_hiwater_rss(mm);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration
  2015-04-16 18:57     ` Hugh Dickins
@ 2015-04-16 19:34       ` Mel Gorman
  -1 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-04-16 19:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linux-MM, Rik van Riel, Minchan Kim, Dave Hansen, Andi Kleen, LKML

On Thu, Apr 16, 2015 at 11:57:15AM -0700, Hugh Dickins wrote:
> > @@ -1098,6 +1098,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> >  	if (!swapwrite)
> >  		current->flags |= PF_SWAPWRITE;
> >  
> > +	alloc_tlb_ubc();
> > +
> >  	for(pass = 0; pass < 10 && retry; pass++) {
> >  		retry = 0;
> >  
> > @@ -1144,6 +1146,8 @@ out:
> >  	if (!swapwrite)
> >  		current->flags &= ~PF_SWAPWRITE;
> >  
> > +	try_to_unmap_flush();
> 
> This is the right place to aim to flush, but I think you have to make
> more changes before it is safe to do so here.
> 
> The putback_lru_page(page) in unmap_and_move() is commented "A page
> that has been migrated has all references removed and will be freed".
> 
> If you leave TLB flushing until after the page has been freed, then
> there's a risk that userspace will see, not the data it expects at
> whatever virtual address, but data placed in there by the next user
> of this freed page.
> 
> So you'll need to do a little restructuring first.
> 

Well spotted. I believe you are correct and it almost certainly applies to
patch 2 as well for similar reasons. It also impacts the maximum reasonable
batch size that can be managed while maintaing safety. I'll do the necessary
shuffling tomorrow or Monday.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration
@ 2015-04-16 19:34       ` Mel Gorman
  0 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-04-16 19:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linux-MM, Rik van Riel, Minchan Kim, Dave Hansen, Andi Kleen, LKML

On Thu, Apr 16, 2015 at 11:57:15AM -0700, Hugh Dickins wrote:
> > @@ -1098,6 +1098,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
> >  	if (!swapwrite)
> >  		current->flags |= PF_SWAPWRITE;
> >  
> > +	alloc_tlb_ubc();
> > +
> >  	for(pass = 0; pass < 10 && retry; pass++) {
> >  		retry = 0;
> >  
> > @@ -1144,6 +1146,8 @@ out:
> >  	if (!swapwrite)
> >  		current->flags &= ~PF_SWAPWRITE;
> >  
> > +	try_to_unmap_flush();
> 
> This is the right place to aim to flush, but I think you have to make
> more changes before it is safe to do so here.
> 
> The putback_lru_page(page) in unmap_and_move() is commented "A page
> that has been migrated has all references removed and will be freed".
> 
> If you leave TLB flushing until after the page has been freed, then
> there's a risk that userspace will see, not the data it expects at
> whatever virtual address, but data placed in there by the next user
> of this freed page.
> 
> So you'll need to do a little restructuring first.
> 

Well spotted. I believe you are correct and it almost certainly applies to
patch 2 as well for similar reasons. It also impacts the maximum reasonable
batch size that can be managed while maintaing safety. I'll do the necessary
shuffling tomorrow or Monday.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent
  2015-07-06 13:39 [PATCH 0/4] TLB flush multiple pages per IPI v7 Mel Gorman
@ 2015-07-06 13:39   ` Mel Gorman
  0 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-07-06 13:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Dave Hansen, Ingo Molnar, Linus Torvalds, Linux-MM,
	LKML, Mel Gorman

It is easy to trace when an IPI is received to flush a TLB but harder to
detect what event sent it. This patch makes it easy to identify the source
of IPIs being transmitted for TLB flushes on x86.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
---
 arch/x86/mm/tlb.c          | 1 +
 include/linux/mm_types.h   | 1 +
 include/trace/events/tlb.h | 3 ++-
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3250f2371aea..2da824c1c140 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	info.flush_end = end;
 
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
+	trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start);
 	if (is_uv_system()) {
 		unsigned int cpu;
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0038ac7466fd..84ef58543e2b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -552,6 +552,7 @@ enum tlb_flush_reason {
 	TLB_REMOTE_SHOOTDOWN,
 	TLB_LOCAL_SHOOTDOWN,
 	TLB_LOCAL_MM_SHOOTDOWN,
+	TLB_REMOTE_SEND_IPI,
 	NR_TLB_FLUSH_REASONS,
 };
 
diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h
index 4250f364a6ca..bc8815f45f3b 100644
--- a/include/trace/events/tlb.h
+++ b/include/trace/events/tlb.h
@@ -11,7 +11,8 @@
 	EM(  TLB_FLUSH_ON_TASK_SWITCH,	"flush on task switch" )	\
 	EM(  TLB_REMOTE_SHOOTDOWN,	"remote shootdown" )		\
 	EM(  TLB_LOCAL_SHOOTDOWN,	"local shootdown" )		\
-	EMe( TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" )
+	EM(  TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" )		\
+	EMe( TLB_REMOTE_SEND_IPI,	"remote ipi send" )
 
 /*
  * First define the enums in TLB_FLUSH_REASON to be exported to userspace
-- 
2.3.5


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent
@ 2015-07-06 13:39   ` Mel Gorman
  0 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-07-06 13:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Dave Hansen, Ingo Molnar, Linus Torvalds, Linux-MM,
	LKML, Mel Gorman

It is easy to trace when an IPI is received to flush a TLB but harder to
detect what event sent it. This patch makes it easy to identify the source
of IPIs being transmitted for TLB flushes on x86.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
---
 arch/x86/mm/tlb.c          | 1 +
 include/linux/mm_types.h   | 1 +
 include/trace/events/tlb.h | 3 ++-
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3250f2371aea..2da824c1c140 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	info.flush_end = end;
 
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
+	trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start);
 	if (is_uv_system()) {
 		unsigned int cpu;
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0038ac7466fd..84ef58543e2b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -552,6 +552,7 @@ enum tlb_flush_reason {
 	TLB_REMOTE_SHOOTDOWN,
 	TLB_LOCAL_SHOOTDOWN,
 	TLB_LOCAL_MM_SHOOTDOWN,
+	TLB_REMOTE_SEND_IPI,
 	NR_TLB_FLUSH_REASONS,
 };
 
diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h
index 4250f364a6ca..bc8815f45f3b 100644
--- a/include/trace/events/tlb.h
+++ b/include/trace/events/tlb.h
@@ -11,7 +11,8 @@
 	EM(  TLB_FLUSH_ON_TASK_SWITCH,	"flush on task switch" )	\
 	EM(  TLB_REMOTE_SHOOTDOWN,	"remote shootdown" )		\
 	EM(  TLB_LOCAL_SHOOTDOWN,	"local shootdown" )		\
-	EMe( TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" )
+	EM(  TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" )		\
+	EMe( TLB_REMOTE_SEND_IPI,	"remote ipi send" )
 
 /*
  * First define the enums in TLB_FLUSH_REASON to be exported to userspace
-- 
2.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent
  2015-06-09 17:31 [PATCH 0/3] TLB flush multiple pages per IPI v6 Mel Gorman
@ 2015-06-09 17:31   ` Mel Gorman
  0 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-06-09 17:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	H Peter Anvin, Ingo Molnar, Linus Torvalds, Thomas Gleixner,
	Peter Zijlstra, Linux-MM, LKML, Mel Gorman

It is easy to trace when an IPI is received to flush a TLB but harder to
detect what event sent it. This patch makes it easy to identify the source
of IPIs being transmitted for TLB flushes on x86.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
---
 arch/x86/mm/tlb.c          | 1 +
 include/linux/mm_types.h   | 1 +
 include/trace/events/tlb.h | 3 ++-
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3250f2371aea..2da824c1c140 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	info.flush_end = end;
 
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
+	trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start);
 	if (is_uv_system()) {
 		unsigned int cpu;
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8d37e26a1007..86ad9f902042 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -534,6 +534,7 @@ enum tlb_flush_reason {
 	TLB_REMOTE_SHOOTDOWN,
 	TLB_LOCAL_SHOOTDOWN,
 	TLB_LOCAL_MM_SHOOTDOWN,
+	TLB_REMOTE_SEND_IPI,
 	NR_TLB_FLUSH_REASONS,
 };
 
diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h
index 4250f364a6ca..bc8815f45f3b 100644
--- a/include/trace/events/tlb.h
+++ b/include/trace/events/tlb.h
@@ -11,7 +11,8 @@
 	EM(  TLB_FLUSH_ON_TASK_SWITCH,	"flush on task switch" )	\
 	EM(  TLB_REMOTE_SHOOTDOWN,	"remote shootdown" )		\
 	EM(  TLB_LOCAL_SHOOTDOWN,	"local shootdown" )		\
-	EMe( TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" )
+	EM(  TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" )		\
+	EMe( TLB_REMOTE_SEND_IPI,	"remote ipi send" )
 
 /*
  * First define the enums in TLB_FLUSH_REASON to be exported to userspace
-- 
2.3.5


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent
@ 2015-06-09 17:31   ` Mel Gorman
  0 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-06-09 17:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	H Peter Anvin, Ingo Molnar, Linus Torvalds, Thomas Gleixner,
	Peter Zijlstra, Linux-MM, LKML, Mel Gorman

It is easy to trace when an IPI is received to flush a TLB but harder to
detect what event sent it. This patch makes it easy to identify the source
of IPIs being transmitted for TLB flushes on x86.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
---
 arch/x86/mm/tlb.c          | 1 +
 include/linux/mm_types.h   | 1 +
 include/trace/events/tlb.h | 3 ++-
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3250f2371aea..2da824c1c140 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	info.flush_end = end;
 
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
+	trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start);
 	if (is_uv_system()) {
 		unsigned int cpu;
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8d37e26a1007..86ad9f902042 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -534,6 +534,7 @@ enum tlb_flush_reason {
 	TLB_REMOTE_SHOOTDOWN,
 	TLB_LOCAL_SHOOTDOWN,
 	TLB_LOCAL_MM_SHOOTDOWN,
+	TLB_REMOTE_SEND_IPI,
 	NR_TLB_FLUSH_REASONS,
 };
 
diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h
index 4250f364a6ca..bc8815f45f3b 100644
--- a/include/trace/events/tlb.h
+++ b/include/trace/events/tlb.h
@@ -11,7 +11,8 @@
 	EM(  TLB_FLUSH_ON_TASK_SWITCH,	"flush on task switch" )	\
 	EM(  TLB_REMOTE_SHOOTDOWN,	"remote shootdown" )		\
 	EM(  TLB_LOCAL_SHOOTDOWN,	"local shootdown" )		\
-	EMe( TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" )
+	EM(  TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" )		\
+	EMe( TLB_REMOTE_SEND_IPI,	"remote ipi send" )
 
 /*
  * First define the enums in TLB_FLUSH_REASON to be exported to userspace
-- 
2.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent
  2015-04-15 10:42 [RFC PATCH 0/4] TLB flush multiple pages with a single IPI Mel Gorman
@ 2015-04-15 10:42   ` Mel Gorman
  0 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-04-15 10:42 UTC (permalink / raw)
  To: Linux-MM
  Cc: Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML, Mel Gorman

It is easy to trace when an IPI is received to flush a TLB but harder to
detect what event sent it. This patch makes it easy to identify the source
of IPIs being transmitted for TLB flushes on x86.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/mm/tlb.c          | 1 +
 include/linux/mm_types.h   | 1 +
 include/trace/events/tlb.h | 3 ++-
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3250f2371aea..2da824c1c140 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	info.flush_end = end;
 
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
+	trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start);
 	if (is_uv_system()) {
 		unsigned int cpu;
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 199a03aab8dc..856038aa166e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -532,6 +532,7 @@ enum tlb_flush_reason {
 	TLB_REMOTE_SHOOTDOWN,
 	TLB_LOCAL_SHOOTDOWN,
 	TLB_LOCAL_MM_SHOOTDOWN,
+	TLB_REMOTE_SEND_IPI,
 	NR_TLB_FLUSH_REASONS,
 };
 
diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h
index 0e7635765153..0fc101472988 100644
--- a/include/trace/events/tlb.h
+++ b/include/trace/events/tlb.h
@@ -11,7 +11,8 @@
 	{ TLB_FLUSH_ON_TASK_SWITCH,	"flush on task switch" },	\
 	{ TLB_REMOTE_SHOOTDOWN,		"remote shootdown" },		\
 	{ TLB_LOCAL_SHOOTDOWN,		"local shootdown" },		\
-	{ TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" }
+	{ TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" },		\
+	{ TLB_REMOTE_SEND_IPI,		"remote ipi send" }
 
 TRACE_EVENT_CONDITION(tlb_flush,
 
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent
@ 2015-04-15 10:42   ` Mel Gorman
  0 siblings, 0 replies; 36+ messages in thread
From: Mel Gorman @ 2015-04-15 10:42 UTC (permalink / raw)
  To: Linux-MM
  Cc: Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML, Mel Gorman

It is easy to trace when an IPI is received to flush a TLB but harder to
detect what event sent it. This patch makes it easy to identify the source
of IPIs being transmitted for TLB flushes on x86.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/mm/tlb.c          | 1 +
 include/linux/mm_types.h   | 1 +
 include/trace/events/tlb.h | 3 ++-
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3250f2371aea..2da824c1c140 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	info.flush_end = end;
 
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
+	trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start);
 	if (is_uv_system()) {
 		unsigned int cpu;
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 199a03aab8dc..856038aa166e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -532,6 +532,7 @@ enum tlb_flush_reason {
 	TLB_REMOTE_SHOOTDOWN,
 	TLB_LOCAL_SHOOTDOWN,
 	TLB_LOCAL_MM_SHOOTDOWN,
+	TLB_REMOTE_SEND_IPI,
 	NR_TLB_FLUSH_REASONS,
 };
 
diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h
index 0e7635765153..0fc101472988 100644
--- a/include/trace/events/tlb.h
+++ b/include/trace/events/tlb.h
@@ -11,7 +11,8 @@
 	{ TLB_FLUSH_ON_TASK_SWITCH,	"flush on task switch" },	\
 	{ TLB_REMOTE_SHOOTDOWN,		"remote shootdown" },		\
 	{ TLB_LOCAL_SHOOTDOWN,		"local shootdown" },		\
-	{ TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" }
+	{ TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" },		\
+	{ TLB_REMOTE_SEND_IPI,		"remote ipi send" }
 
 TRACE_EVENT_CONDITION(tlb_flush,
 
-- 
2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2015-07-06 13:40 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-16 10:22 [RFC PATCH 0/4] TLB flush multiple pages with a single IPI v2 Mel Gorman
2015-04-16 10:22 ` Mel Gorman
2015-04-16 10:22 ` [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent Mel Gorman
2015-04-16 10:22   ` Mel Gorman
2015-04-16 15:51   ` Rik van Riel
2015-04-16 15:51     ` Rik van Riel
2015-04-16 16:55   ` Dave Hansen
2015-04-16 16:55     ` Dave Hansen
2015-04-16 17:39     ` Mel Gorman
2015-04-16 17:39       ` Mel Gorman
2015-04-16 10:22 ` [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping Mel Gorman
2015-04-16 10:22   ` Mel Gorman
2015-04-16 15:52   ` Rik van Riel
2015-04-16 15:52     ` Rik van Riel
2015-04-16 19:21   ` Hugh Dickins
2015-04-16 19:21     ` Hugh Dickins
2015-04-16 10:22 ` [PATCH 3/4] mm: Gather more PFNs before sending a TLB to flush unmapped pages Mel Gorman
2015-04-16 10:22   ` Mel Gorman
2015-04-16 16:00   ` Rik van Riel
2015-04-16 16:00     ` Rik van Riel
2015-04-16 10:22 ` [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration Mel Gorman
2015-04-16 10:22   ` Mel Gorman
2015-04-16 10:51   ` Mel Gorman
2015-04-16 10:51     ` Mel Gorman
2015-04-16 16:01   ` Rik van Riel
2015-04-16 16:01     ` Rik van Riel
2015-04-16 18:57   ` Hugh Dickins
2015-04-16 18:57     ` Hugh Dickins
2015-04-16 19:34     ` Mel Gorman
2015-04-16 19:34       ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2015-07-06 13:39 [PATCH 0/4] TLB flush multiple pages per IPI v7 Mel Gorman
2015-07-06 13:39 ` [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent Mel Gorman
2015-07-06 13:39   ` Mel Gorman
2015-06-09 17:31 [PATCH 0/3] TLB flush multiple pages per IPI v6 Mel Gorman
2015-06-09 17:31 ` [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent Mel Gorman
2015-06-09 17:31   ` Mel Gorman
2015-04-15 10:42 [RFC PATCH 0/4] TLB flush multiple pages with a single IPI Mel Gorman
2015-04-15 10:42 ` [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent Mel Gorman
2015-04-15 10:42   ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.