* [RFC PATCH 0/4] TLB flush multiple pages with a single IPI @ 2015-04-15 10:42 ` Mel Gorman 0 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-15 10:42 UTC (permalink / raw) To: Linux-MM Cc: Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML, Mel Gorman When unmapping pages it is necessary to flush the TLB. If that page was accessed by another CPU then an IPI is used to flush the remote CPU. That is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second. There already is a window between when a page is unmapped and when it is TLB flushed. This series simply increases the window so multiple pages can be flushed using a single IPI. Patch 1 simply made the rest of the series easier to write as ftrace could identify all the senders of TLB flush IPIS. Patch 2 collects a list of PFNs and sends one IPI to flush them all Patch 3 uses more memory so further defer when the IPI gets sent Patch 4 uses the same infrastructure as patch 2 to batch IPIs sent during page migration. The performance impact is documented in the changelogs but in the optimistic case on a 4-socket machine the full series reduces interrupts from 900K interrupts/second to 60K interrupts/second. Last minute note: It occured to me just before sending that a TLB flush cannot be batched if the PTE was dirty at unmap time as the page lock is released before the TLB flush occurs. That allows IO to be started in parallel while writes can still take place through a cached entry. I decided not to delay the series as it's RFC and I want to see if there is interest in this. Note however that there is a difficult-to-hit potential corruption race here. arch/x86/Kconfig | 1 + arch/x86/include/asm/tlbflush.h | 2 + arch/x86/mm/tlb.c | 1 + include/linux/init_task.h | 8 ++++ include/linux/mm_types.h | 1 + include/linux/rmap.h | 3 ++ include/linux/sched.h | 20 ++++++++++ include/trace/events/tlb.h | 3 +- init/Kconfig | 5 +++ kernel/fork.c | 5 +++ kernel/sched/core.c | 3 ++ mm/internal.h | 16 ++++++++ mm/migrate.c | 8 +++- mm/rmap.c | 85 ++++++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 29 +++++++++++++- 15 files changed, 186 insertions(+), 4 deletions(-) -- 2.1.2 ^ permalink raw reply [flat|nested] 58+ messages in thread
* [RFC PATCH 0/4] TLB flush multiple pages with a single IPI @ 2015-04-15 10:42 ` Mel Gorman 0 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-15 10:42 UTC (permalink / raw) To: Linux-MM Cc: Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML, Mel Gorman When unmapping pages it is necessary to flush the TLB. If that page was accessed by another CPU then an IPI is used to flush the remote CPU. That is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second. There already is a window between when a page is unmapped and when it is TLB flushed. This series simply increases the window so multiple pages can be flushed using a single IPI. Patch 1 simply made the rest of the series easier to write as ftrace could identify all the senders of TLB flush IPIS. Patch 2 collects a list of PFNs and sends one IPI to flush them all Patch 3 uses more memory so further defer when the IPI gets sent Patch 4 uses the same infrastructure as patch 2 to batch IPIs sent during page migration. The performance impact is documented in the changelogs but in the optimistic case on a 4-socket machine the full series reduces interrupts from 900K interrupts/second to 60K interrupts/second. Last minute note: It occured to me just before sending that a TLB flush cannot be batched if the PTE was dirty at unmap time as the page lock is released before the TLB flush occurs. That allows IO to be started in parallel while writes can still take place through a cached entry. I decided not to delay the series as it's RFC and I want to see if there is interest in this. Note however that there is a difficult-to-hit potential corruption race here. arch/x86/Kconfig | 1 + arch/x86/include/asm/tlbflush.h | 2 + arch/x86/mm/tlb.c | 1 + include/linux/init_task.h | 8 ++++ include/linux/mm_types.h | 1 + include/linux/rmap.h | 3 ++ include/linux/sched.h | 20 ++++++++++ include/trace/events/tlb.h | 3 +- init/Kconfig | 5 +++ kernel/fork.c | 5 +++ kernel/sched/core.c | 3 ++ mm/internal.h | 16 ++++++++ mm/migrate.c | 8 +++- mm/rmap.c | 85 ++++++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 29 +++++++++++++- 15 files changed, 186 insertions(+), 4 deletions(-) -- 2.1.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent 2015-04-15 10:42 ` Mel Gorman @ 2015-04-15 10:42 ` Mel Gorman -1 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-15 10:42 UTC (permalink / raw) To: Linux-MM Cc: Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML, Mel Gorman It is easy to trace when an IPI is received to flush a TLB but harder to detect what event sent it. This patch makes it easy to identify the source of IPIs being transmitted for TLB flushes on x86. Signed-off-by: Mel Gorman <mgorman@suse.de> --- arch/x86/mm/tlb.c | 1 + include/linux/mm_types.h | 1 + include/trace/events/tlb.h | 3 ++- 3 files changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 3250f2371aea..2da824c1c140 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask, info.flush_end = end; count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); + trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start); if (is_uv_system()) { unsigned int cpu; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 199a03aab8dc..856038aa166e 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -532,6 +532,7 @@ enum tlb_flush_reason { TLB_REMOTE_SHOOTDOWN, TLB_LOCAL_SHOOTDOWN, TLB_LOCAL_MM_SHOOTDOWN, + TLB_REMOTE_SEND_IPI, NR_TLB_FLUSH_REASONS, }; diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h index 0e7635765153..0fc101472988 100644 --- a/include/trace/events/tlb.h +++ b/include/trace/events/tlb.h @@ -11,7 +11,8 @@ { TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" }, \ { TLB_REMOTE_SHOOTDOWN, "remote shootdown" }, \ { TLB_LOCAL_SHOOTDOWN, "local shootdown" }, \ - { TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" } + { TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" }, \ + { TLB_REMOTE_SEND_IPI, "remote ipi send" } TRACE_EVENT_CONDITION(tlb_flush, -- 2.1.2 ^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent @ 2015-04-15 10:42 ` Mel Gorman 0 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-15 10:42 UTC (permalink / raw) To: Linux-MM Cc: Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML, Mel Gorman It is easy to trace when an IPI is received to flush a TLB but harder to detect what event sent it. This patch makes it easy to identify the source of IPIs being transmitted for TLB flushes on x86. Signed-off-by: Mel Gorman <mgorman@suse.de> --- arch/x86/mm/tlb.c | 1 + include/linux/mm_types.h | 1 + include/trace/events/tlb.h | 3 ++- 3 files changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 3250f2371aea..2da824c1c140 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask, info.flush_end = end; count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); + trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start); if (is_uv_system()) { unsigned int cpu; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 199a03aab8dc..856038aa166e 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -532,6 +532,7 @@ enum tlb_flush_reason { TLB_REMOTE_SHOOTDOWN, TLB_LOCAL_SHOOTDOWN, TLB_LOCAL_MM_SHOOTDOWN, + TLB_REMOTE_SEND_IPI, NR_TLB_FLUSH_REASONS, }; diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h index 0e7635765153..0fc101472988 100644 --- a/include/trace/events/tlb.h +++ b/include/trace/events/tlb.h @@ -11,7 +11,8 @@ { TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" }, \ { TLB_REMOTE_SHOOTDOWN, "remote shootdown" }, \ { TLB_LOCAL_SHOOTDOWN, "local shootdown" }, \ - { TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" } + { TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" }, \ + { TLB_REMOTE_SEND_IPI, "remote ipi send" } TRACE_EVENT_CONDITION(tlb_flush, -- 2.1.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping 2015-04-15 10:42 ` Mel Gorman @ 2015-04-15 10:42 ` Mel Gorman -1 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-15 10:42 UTC (permalink / raw) To: Linux-MM Cc: Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML, Mel Gorman An IPI is sent to flush remote TLBs when a page is unmapped that was recently accessed by other CPUs. There are many circumstances where this happens but the obvious one is kswapd reclaiming pages belonging to a running process as kswapd and the task are likely running on separate CPUs. On small machines, this is not a significant problem but as machine gets larger with more cores and more memory, the cost of these IPIs can be high. This patch uses a structure similar in principle to a pagevec to collect a list of PFNs and CPUs that require flushing. It then sends one IPI to flush the list of PFNs. A new TLB flush helper is required for this and one is added for x86. Other architectures will need to decide if batching like this is both safe and worth the memory overhead. Specifically the requirement is; If a clean page is unmapped and not immediately flushed, the architecture must guarantee that a write to that page from a CPU with a cached TLB entry will trap a page fault. This is essentially what the kernel already depends on but the window is much larger with this patch applied and is worth highlighting. The impact of this patch depends on the workload as measuring any benefit requires both mapped pages co-located on the LRU and memory pressure. The case with the biggest impact is multiple processes reading mapped pages taken from the vm-scalability test suite. The test case uses NR_CPU readers of mapped files that consume 10*RAM. vmscale on a 4-node machine with 64G RAM and 48 CPUs 4.0.0 4.0.0 vanilla batchunmap-v1 lru-file-mmap-read-elapsed 161.08 ( 0.00%) 117.73 ( 26.91%) 4.0.0 4.0.0 vanilla batchunmap-v1 User 571.38 602.93 System 5990.12 4072.56 Elapsed 162.39 119.06 This is showing that the readers completed 26% with 32% less CPU time. From vmstats, it is known that the vanilla kernel was interrupted roughly 900K times per second during the steady phase of the test and the patched kernel was interrupts 180K times per second. The impact is much lower on a small machine vmscale on a 1-node machine with 8G RAM and 1 CPU 4.0.0 4.0.0 vanilla batchunmap-v1 Ops lru-file-mmap-read-elapsed 22.50 ( 0.00%) 19.60 ( 12.89%) 4.0.0 4.0.0 vanilla batchunmap-v1 User 33.64 32.72 System 36.22 33.22 Elapsed 24.11 21.21 It's still a noticeable improvement with vmstat showing interrupts went from roughly 500K per second to 45K per second. The patch will have no impact on workloads with no memory pressure or have relatively few mapped pages. Signed-off-by: Mel Gorman <mgorman@suse.de> --- arch/x86/Kconfig | 1 + arch/x86/include/asm/tlbflush.h | 2 + include/linux/init_task.h | 8 ++++ include/linux/rmap.h | 3 ++ include/linux/sched.h | 15 ++++++++ init/Kconfig | 5 +++ kernel/fork.c | 5 +++ kernel/sched/core.c | 3 ++ mm/internal.h | 11 ++++++ mm/rmap.c | 85 ++++++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 33 +++++++++++++++- 11 files changed, 169 insertions(+), 2 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index b7d31ca55187..290844263218 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -30,6 +30,7 @@ config X86 select ARCH_MIGHT_HAVE_PC_SERIO select HAVE_AOUT if X86_32 select HAVE_UNSTABLE_SCHED_CLOCK + select ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH select ARCH_SUPPORTS_NUMA_BALANCING if X86_64 select ARCH_SUPPORTS_INT128 if X86_64 select HAVE_IDE diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index cd791948b286..96a27051a70a 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -152,6 +152,8 @@ static inline void __flush_tlb_one(unsigned long addr) * and page-granular flushes are available only on i486 and up. */ +#define flush_local_tlb_addr(addr) __flush_tlb_one(addr) + #ifndef CONFIG_SMP /* "_up" is for UniProcessor. diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 696d22312b31..8127a46d3b9c 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -175,6 +175,13 @@ extern struct task_group root_task_group; # define INIT_NUMA_BALANCING(tsk) #endif +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH +# define INIT_UNMAP_BATCH_CONTROL(tsk) \ + .ubc = NULL, +#else +# define INIT_UNMAP_BATCH_CONTROL(tsk) +#endif + #ifdef CONFIG_KASAN # define INIT_KASAN(tsk) \ .kasan_depth = 1, @@ -257,6 +264,7 @@ extern struct task_group root_task_group; INIT_RT_MUTEXES(tsk) \ INIT_VTIME(tsk) \ INIT_NUMA_BALANCING(tsk) \ + INIT_UNMAP_BATCH_CONTROL(tsk) \ INIT_KASAN(tsk) \ } diff --git a/include/linux/rmap.h b/include/linux/rmap.h index c4c559a45dc8..8d23914b219e 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -89,6 +89,9 @@ enum ttu_flags { TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */ TTU_IGNORE_ACCESS = (1 << 9), /* don't age */ TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */ + TTU_BATCH_FLUSH = (1 << 11), /* Batch TLB flushes where possible + * and caller guarantees they will + * do a final flush if necessary */ }; #ifdef CONFIG_MMU diff --git a/include/linux/sched.h b/include/linux/sched.h index a419b65770d6..9d51841806f4 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1275,6 +1275,16 @@ enum perf_event_task_context { perf_nr_task_contexts, }; +/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */ +#define BATCH_TLBFLUSH_SIZE 32UL + +/* Track pages that require TLB flushes */ +struct unmap_batch { + struct cpumask cpumask; + unsigned long nr_pages; + unsigned long pfns[BATCH_TLBFLUSH_SIZE]; +}; + struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ void *stack; @@ -1634,6 +1644,11 @@ struct task_struct { unsigned long numa_pages_migrated; #endif /* CONFIG_NUMA_BALANCING */ +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH + /* For batched TLB flushes of unmapped pages */ + struct unmap_batch *ubc; +#endif + struct rcu_head rcu; /* diff --git a/init/Kconfig b/init/Kconfig index f5dbc6d4261b..4827d742bfeb 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -889,6 +889,11 @@ config ARCH_SUPPORTS_NUMA_BALANCING bool # +# For architectures that have a local TLB flush for a PFN without VMA knowledge +config ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH + bool + +# # For architectures that know their GCC __int128 support is sound # config ARCH_SUPPORTS_INT128 diff --git a/kernel/fork.c b/kernel/fork.c index cf65139615a0..de9d35434863 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -246,6 +246,11 @@ void __put_task_struct(struct task_struct *tsk) delayacct_tsk_free(tsk); put_signal_struct(tsk->signal); +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH + kfree(tsk->ubc); + tsk->ubc = NULL; +#endif + if (!profile_handoff_task(tsk)) free_task(tsk); } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 62671f53202a..d17f8864c25d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1823,6 +1823,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->numa_group = NULL; #endif /* CONFIG_NUMA_BALANCING */ +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH + p->ubc = NULL; +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ } #ifdef CONFIG_NUMA_BALANCING diff --git a/mm/internal.h b/mm/internal.h index a96da5b0029d..fe69dd159e34 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -431,4 +431,15 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone, #define ALLOC_CMA 0x80 /* allow allocations from CMA areas */ #define ALLOC_FAIR 0x100 /* fair zone allocation */ +enum ttu_flags; +struct unmap_batch; + +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH +void try_to_unmap_flush(void); +#else +static inline void try_to_unmap_flush(void) +{ +} + +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ #endif /* __MM_INTERNAL_H */ diff --git a/mm/rmap.c b/mm/rmap.c index c161a14b6a8f..abb5e5373354 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -60,6 +60,8 @@ #include <asm/tlbflush.h> +#include <trace/events/tlb.h> + #include "internal.h" static struct kmem_cache *anon_vma_cachep; @@ -581,6 +583,74 @@ vma_address(struct page *page, struct vm_area_struct *vma) return address; } +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH +static void percpu_flush_tlb_batch_pages(void *data) +{ + struct unmap_batch *ubc = data; + int i; + + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); + for (i = 0; i < ubc->nr_pages; i++) + flush_local_tlb_addr(ubc->pfns[i] << PAGE_SHIFT); +} + +void try_to_unmap_flush(void) +{ + struct unmap_batch *ubc = current->ubc; + + if (!ubc || !ubc->nr_pages) + return; + + trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, ubc->nr_pages); + smp_call_function_many(&ubc->cpumask, percpu_flush_tlb_batch_pages, + (void *)ubc, true); + cpumask_clear(&ubc->cpumask); + ubc->nr_pages = 0; +} + +static void set_ubc_flush_pending(struct mm_struct *mm, + struct page *page) +{ + struct unmap_batch *ubc = current->ubc; + + cpumask_or(&ubc->cpumask, &ubc->cpumask, mm_cpumask(mm)); + ubc->pfns[ubc->nr_pages] = page_to_pfn(page); + ubc->nr_pages++; + + if (ubc->nr_pages == BATCH_TLBFLUSH_SIZE) + try_to_unmap_flush(); +} + +/* + * Returns true if the TLB flush should be deferred to the end of a batch of + * unmap operations to reduce IPIs. + */ +static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) +{ + bool should_defer = false; + + if (!current->ubc || !(flags & TTU_BATCH_FLUSH)) + return false; + + /* If remote CPUs need to be flushed then defer batch the flush */ + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) + should_defer = true; + put_cpu(); + + return should_defer; +} +#else +static void set_ubc_flush_pending(struct mm_struct *mm, + struct page *page) +{ +} + +static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) +{ + return false; +} +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ + /* * At what user virtual address is page expected in vma? * Caller should check the page is actually part of the vma. @@ -1213,7 +1283,20 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); - pteval = ptep_clear_flush(vma, address, pte); + if (should_defer_flush(mm, flags)) { + /* + * We clear the PTE but do not flush so potentially a remote + * CPU could still be writing to the page. If the entry was + * already dirty then no data is lost. If the dirty bit was + * previously clear then the architecture must guarantee that + * a clear->dirty transition on a cached TLB entry is written + * through and traps if the PTE is unmapped. + */ + pteval = ptep_get_and_clear(mm, address, pte); + set_ubc_flush_pending(mm, page); + } else { + pteval = ptep_clear_flush(vma, address, pte); + } /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) diff --git a/mm/vmscan.c b/mm/vmscan.c index 5e8eadd71bac..68bcc0b73a76 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1024,7 +1024,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, * processes. Try to unmap it here. */ if (page_mapped(page) && mapping) { - switch (try_to_unmap(page, ttu_flags)) { + switch (try_to_unmap(page, + ttu_flags|TTU_BATCH_FLUSH)) { case SWAP_FAIL: goto activate_locked; case SWAP_AGAIN: @@ -1065,6 +1066,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, goto keep_locked; /* Page is dirty, try to write it out here */ + try_to_unmap_flush(); switch (pageout(page, mapping, sc)) { case PAGE_KEEP: goto keep_locked; @@ -1211,6 +1213,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone, ret = shrink_page_list(&clean_pages, zone, &sc, TTU_UNMAP|TTU_IGNORE_ACCESS, &dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true); + try_to_unmap_flush(); list_splice(&clean_pages, page_list); mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret); return ret; @@ -2223,6 +2226,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness, scan_adjusted = true; } blk_finish_plug(&plug); + try_to_unmap_flush(); sc->nr_reclaimed += nr_reclaimed; /* @@ -2762,6 +2766,30 @@ out: return false; } +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH +static inline void alloc_ubc(void) +{ + if (current->ubc) + return; + + /* + * Allocate the control structure for batch TLB flushing. Harmless if + * the allocation fails as reclaimer will just send more IPIs. + */ + current->ubc = kmalloc(sizeof(struct unmap_batch), + GFP_ATOMIC | __GFP_NOWARN); + if (!current->ubc) + return; + + cpumask_clear(¤t->ubc->cpumask); + current->ubc->nr_pages = 0; +} +#else +static inline void alloc_ubc(void) +{ +} +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ + unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *nodemask) { @@ -2789,6 +2817,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, sc.may_writepage, gfp_mask); + alloc_ubc(); nr_reclaimed = do_try_to_free_pages(zonelist, &sc); trace_mm_vmscan_direct_reclaim_end(nr_reclaimed); @@ -3364,6 +3393,8 @@ static int kswapd(void *p) lockdep_set_current_reclaim_state(GFP_KERNEL); + alloc_ubc(); + if (!cpumask_empty(cpumask)) set_cpus_allowed_ptr(tsk, cpumask); current->reclaim_state = &reclaim_state; -- 2.1.2 ^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping @ 2015-04-15 10:42 ` Mel Gorman 0 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-15 10:42 UTC (permalink / raw) To: Linux-MM Cc: Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML, Mel Gorman An IPI is sent to flush remote TLBs when a page is unmapped that was recently accessed by other CPUs. There are many circumstances where this happens but the obvious one is kswapd reclaiming pages belonging to a running process as kswapd and the task are likely running on separate CPUs. On small machines, this is not a significant problem but as machine gets larger with more cores and more memory, the cost of these IPIs can be high. This patch uses a structure similar in principle to a pagevec to collect a list of PFNs and CPUs that require flushing. It then sends one IPI to flush the list of PFNs. A new TLB flush helper is required for this and one is added for x86. Other architectures will need to decide if batching like this is both safe and worth the memory overhead. Specifically the requirement is; If a clean page is unmapped and not immediately flushed, the architecture must guarantee that a write to that page from a CPU with a cached TLB entry will trap a page fault. This is essentially what the kernel already depends on but the window is much larger with this patch applied and is worth highlighting. The impact of this patch depends on the workload as measuring any benefit requires both mapped pages co-located on the LRU and memory pressure. The case with the biggest impact is multiple processes reading mapped pages taken from the vm-scalability test suite. The test case uses NR_CPU readers of mapped files that consume 10*RAM. vmscale on a 4-node machine with 64G RAM and 48 CPUs 4.0.0 4.0.0 vanilla batchunmap-v1 lru-file-mmap-read-elapsed 161.08 ( 0.00%) 117.73 ( 26.91%) 4.0.0 4.0.0 vanilla batchunmap-v1 User 571.38 602.93 System 5990.12 4072.56 Elapsed 162.39 119.06 This is showing that the readers completed 26% with 32% less CPU time. From vmstats, it is known that the vanilla kernel was interrupted roughly 900K times per second during the steady phase of the test and the patched kernel was interrupts 180K times per second. The impact is much lower on a small machine vmscale on a 1-node machine with 8G RAM and 1 CPU 4.0.0 4.0.0 vanilla batchunmap-v1 Ops lru-file-mmap-read-elapsed 22.50 ( 0.00%) 19.60 ( 12.89%) 4.0.0 4.0.0 vanilla batchunmap-v1 User 33.64 32.72 System 36.22 33.22 Elapsed 24.11 21.21 It's still a noticeable improvement with vmstat showing interrupts went from roughly 500K per second to 45K per second. The patch will have no impact on workloads with no memory pressure or have relatively few mapped pages. Signed-off-by: Mel Gorman <mgorman@suse.de> --- arch/x86/Kconfig | 1 + arch/x86/include/asm/tlbflush.h | 2 + include/linux/init_task.h | 8 ++++ include/linux/rmap.h | 3 ++ include/linux/sched.h | 15 ++++++++ init/Kconfig | 5 +++ kernel/fork.c | 5 +++ kernel/sched/core.c | 3 ++ mm/internal.h | 11 ++++++ mm/rmap.c | 85 ++++++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 33 +++++++++++++++- 11 files changed, 169 insertions(+), 2 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index b7d31ca55187..290844263218 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -30,6 +30,7 @@ config X86 select ARCH_MIGHT_HAVE_PC_SERIO select HAVE_AOUT if X86_32 select HAVE_UNSTABLE_SCHED_CLOCK + select ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH select ARCH_SUPPORTS_NUMA_BALANCING if X86_64 select ARCH_SUPPORTS_INT128 if X86_64 select HAVE_IDE diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index cd791948b286..96a27051a70a 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -152,6 +152,8 @@ static inline void __flush_tlb_one(unsigned long addr) * and page-granular flushes are available only on i486 and up. */ +#define flush_local_tlb_addr(addr) __flush_tlb_one(addr) + #ifndef CONFIG_SMP /* "_up" is for UniProcessor. diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 696d22312b31..8127a46d3b9c 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -175,6 +175,13 @@ extern struct task_group root_task_group; # define INIT_NUMA_BALANCING(tsk) #endif +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH +# define INIT_UNMAP_BATCH_CONTROL(tsk) \ + .ubc = NULL, +#else +# define INIT_UNMAP_BATCH_CONTROL(tsk) +#endif + #ifdef CONFIG_KASAN # define INIT_KASAN(tsk) \ .kasan_depth = 1, @@ -257,6 +264,7 @@ extern struct task_group root_task_group; INIT_RT_MUTEXES(tsk) \ INIT_VTIME(tsk) \ INIT_NUMA_BALANCING(tsk) \ + INIT_UNMAP_BATCH_CONTROL(tsk) \ INIT_KASAN(tsk) \ } diff --git a/include/linux/rmap.h b/include/linux/rmap.h index c4c559a45dc8..8d23914b219e 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -89,6 +89,9 @@ enum ttu_flags { TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */ TTU_IGNORE_ACCESS = (1 << 9), /* don't age */ TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */ + TTU_BATCH_FLUSH = (1 << 11), /* Batch TLB flushes where possible + * and caller guarantees they will + * do a final flush if necessary */ }; #ifdef CONFIG_MMU diff --git a/include/linux/sched.h b/include/linux/sched.h index a419b65770d6..9d51841806f4 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1275,6 +1275,16 @@ enum perf_event_task_context { perf_nr_task_contexts, }; +/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */ +#define BATCH_TLBFLUSH_SIZE 32UL + +/* Track pages that require TLB flushes */ +struct unmap_batch { + struct cpumask cpumask; + unsigned long nr_pages; + unsigned long pfns[BATCH_TLBFLUSH_SIZE]; +}; + struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ void *stack; @@ -1634,6 +1644,11 @@ struct task_struct { unsigned long numa_pages_migrated; #endif /* CONFIG_NUMA_BALANCING */ +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH + /* For batched TLB flushes of unmapped pages */ + struct unmap_batch *ubc; +#endif + struct rcu_head rcu; /* diff --git a/init/Kconfig b/init/Kconfig index f5dbc6d4261b..4827d742bfeb 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -889,6 +889,11 @@ config ARCH_SUPPORTS_NUMA_BALANCING bool # +# For architectures that have a local TLB flush for a PFN without VMA knowledge +config ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH + bool + +# # For architectures that know their GCC __int128 support is sound # config ARCH_SUPPORTS_INT128 diff --git a/kernel/fork.c b/kernel/fork.c index cf65139615a0..de9d35434863 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -246,6 +246,11 @@ void __put_task_struct(struct task_struct *tsk) delayacct_tsk_free(tsk); put_signal_struct(tsk->signal); +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH + kfree(tsk->ubc); + tsk->ubc = NULL; +#endif + if (!profile_handoff_task(tsk)) free_task(tsk); } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 62671f53202a..d17f8864c25d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1823,6 +1823,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->numa_group = NULL; #endif /* CONFIG_NUMA_BALANCING */ +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH + p->ubc = NULL; +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ } #ifdef CONFIG_NUMA_BALANCING diff --git a/mm/internal.h b/mm/internal.h index a96da5b0029d..fe69dd159e34 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -431,4 +431,15 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone, #define ALLOC_CMA 0x80 /* allow allocations from CMA areas */ #define ALLOC_FAIR 0x100 /* fair zone allocation */ +enum ttu_flags; +struct unmap_batch; + +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH +void try_to_unmap_flush(void); +#else +static inline void try_to_unmap_flush(void) +{ +} + +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ #endif /* __MM_INTERNAL_H */ diff --git a/mm/rmap.c b/mm/rmap.c index c161a14b6a8f..abb5e5373354 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -60,6 +60,8 @@ #include <asm/tlbflush.h> +#include <trace/events/tlb.h> + #include "internal.h" static struct kmem_cache *anon_vma_cachep; @@ -581,6 +583,74 @@ vma_address(struct page *page, struct vm_area_struct *vma) return address; } +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH +static void percpu_flush_tlb_batch_pages(void *data) +{ + struct unmap_batch *ubc = data; + int i; + + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); + for (i = 0; i < ubc->nr_pages; i++) + flush_local_tlb_addr(ubc->pfns[i] << PAGE_SHIFT); +} + +void try_to_unmap_flush(void) +{ + struct unmap_batch *ubc = current->ubc; + + if (!ubc || !ubc->nr_pages) + return; + + trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, ubc->nr_pages); + smp_call_function_many(&ubc->cpumask, percpu_flush_tlb_batch_pages, + (void *)ubc, true); + cpumask_clear(&ubc->cpumask); + ubc->nr_pages = 0; +} + +static void set_ubc_flush_pending(struct mm_struct *mm, + struct page *page) +{ + struct unmap_batch *ubc = current->ubc; + + cpumask_or(&ubc->cpumask, &ubc->cpumask, mm_cpumask(mm)); + ubc->pfns[ubc->nr_pages] = page_to_pfn(page); + ubc->nr_pages++; + + if (ubc->nr_pages == BATCH_TLBFLUSH_SIZE) + try_to_unmap_flush(); +} + +/* + * Returns true if the TLB flush should be deferred to the end of a batch of + * unmap operations to reduce IPIs. + */ +static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) +{ + bool should_defer = false; + + if (!current->ubc || !(flags & TTU_BATCH_FLUSH)) + return false; + + /* If remote CPUs need to be flushed then defer batch the flush */ + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) + should_defer = true; + put_cpu(); + + return should_defer; +} +#else +static void set_ubc_flush_pending(struct mm_struct *mm, + struct page *page) +{ +} + +static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) +{ + return false; +} +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ + /* * At what user virtual address is page expected in vma? * Caller should check the page is actually part of the vma. @@ -1213,7 +1283,20 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); - pteval = ptep_clear_flush(vma, address, pte); + if (should_defer_flush(mm, flags)) { + /* + * We clear the PTE but do not flush so potentially a remote + * CPU could still be writing to the page. If the entry was + * already dirty then no data is lost. If the dirty bit was + * previously clear then the architecture must guarantee that + * a clear->dirty transition on a cached TLB entry is written + * through and traps if the PTE is unmapped. + */ + pteval = ptep_get_and_clear(mm, address, pte); + set_ubc_flush_pending(mm, page); + } else { + pteval = ptep_clear_flush(vma, address, pte); + } /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) diff --git a/mm/vmscan.c b/mm/vmscan.c index 5e8eadd71bac..68bcc0b73a76 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1024,7 +1024,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, * processes. Try to unmap it here. */ if (page_mapped(page) && mapping) { - switch (try_to_unmap(page, ttu_flags)) { + switch (try_to_unmap(page, + ttu_flags|TTU_BATCH_FLUSH)) { case SWAP_FAIL: goto activate_locked; case SWAP_AGAIN: @@ -1065,6 +1066,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, goto keep_locked; /* Page is dirty, try to write it out here */ + try_to_unmap_flush(); switch (pageout(page, mapping, sc)) { case PAGE_KEEP: goto keep_locked; @@ -1211,6 +1213,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone, ret = shrink_page_list(&clean_pages, zone, &sc, TTU_UNMAP|TTU_IGNORE_ACCESS, &dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true); + try_to_unmap_flush(); list_splice(&clean_pages, page_list); mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret); return ret; @@ -2223,6 +2226,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness, scan_adjusted = true; } blk_finish_plug(&plug); + try_to_unmap_flush(); sc->nr_reclaimed += nr_reclaimed; /* @@ -2762,6 +2766,30 @@ out: return false; } +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH +static inline void alloc_ubc(void) +{ + if (current->ubc) + return; + + /* + * Allocate the control structure for batch TLB flushing. Harmless if + * the allocation fails as reclaimer will just send more IPIs. + */ + current->ubc = kmalloc(sizeof(struct unmap_batch), + GFP_ATOMIC | __GFP_NOWARN); + if (!current->ubc) + return; + + cpumask_clear(¤t->ubc->cpumask); + current->ubc->nr_pages = 0; +} +#else +static inline void alloc_ubc(void) +{ +} +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ + unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *nodemask) { @@ -2789,6 +2817,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, sc.may_writepage, gfp_mask); + alloc_ubc(); nr_reclaimed = do_try_to_free_pages(zonelist, &sc); trace_mm_vmscan_direct_reclaim_end(nr_reclaimed); @@ -3364,6 +3393,8 @@ static int kswapd(void *p) lockdep_set_current_reclaim_state(GFP_KERNEL); + alloc_ubc(); + if (!cpumask_empty(cpumask)) set_cpus_allowed_ptr(tsk, cpumask); current->reclaim_state = &reclaim_state; -- 2.1.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping 2015-04-15 10:42 ` Mel Gorman @ 2015-04-15 21:03 ` Rik van Riel -1 siblings, 0 replies; 58+ messages in thread From: Rik van Riel @ 2015-04-15 21:03 UTC (permalink / raw) To: Mel Gorman, Linux-MM; +Cc: Johannes Weiner, Dave Hansen, Andi Kleen, LKML On 04/15/2015 06:42 AM, Mel Gorman wrote: > An IPI is sent to flush remote TLBs when a page is unmapped that was > recently accessed by other CPUs. There are many circumstances where this > happens but the obvious one is kswapd reclaiming pages belonging to a > running process as kswapd and the task are likely running on separate CPUs. > > On small machines, this is not a significant problem but as machine > gets larger with more cores and more memory, the cost of these IPIs can > be high. This patch uses a structure similar in principle to a pagevec > to collect a list of PFNs and CPUs that require flushing. It then sends > one IPI to flush the list of PFNs. A new TLB flush helper is required for > this and one is added for x86. Other architectures will need to decide if > batching like this is both safe and worth the memory overhead. Specifically > the requirement is; > > If a clean page is unmapped and not immediately flushed, the > architecture must guarantee that a write to that page from a CPU > with a cached TLB entry will trap a page fault. > > This is essentially what the kernel already depends on but the window is > much larger with this patch applied and is worth highlighting. This means we already have a (hard to hit?) data corruption issue in the kernel. We can lose data if we unmap a writable but not dirty pte from a file page, and the task writes before we flush the TLB. I can only see one way to completely close the window, and that is to make the pte(s) read-only, and flush the TLB before unmapping and then flushing the TLB again. Luckily this is only true for ptes that are both writeable and clean. This would of course not be acceptable overhead when flushing things one page at a time, but if we are moving to batched TLB flushes anyway, there may be a way around this... 1) Check whether the to-be-unmapped pte is read-only, or the page is already marked dirty, if either is true, we can go straight to (4). 2) Mark a larger number of ptes read-only in one go (one page table page worth of ptes perhaps?) 3) Flush the TLBs for the task(s) with recently turned read-only ptes. 4) Unmap PTEs like your patch series does. 5) Flush the TLBs like your patch series does. This might require some protection in the page fault code, to ensure do_wp_page does not mark the pte read-write again in-between (2) and (4). Then again, do_wp_page does mark the page dirty so we may be ok. As an aside, it may be worth just doing a global tlb flush if the number of entries in a ubc exceeds a certain number. It may also be worth moving try_to_unmap_flush() from shrink_lruvec() to shrink_zone(), so it is called once per zone and not once per cgroup inside the zone. I guess we do need to call it before we call should_continue_reclaim(), though :) ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping @ 2015-04-15 21:03 ` Rik van Riel 0 siblings, 0 replies; 58+ messages in thread From: Rik van Riel @ 2015-04-15 21:03 UTC (permalink / raw) To: Mel Gorman, Linux-MM; +Cc: Johannes Weiner, Dave Hansen, Andi Kleen, LKML On 04/15/2015 06:42 AM, Mel Gorman wrote: > An IPI is sent to flush remote TLBs when a page is unmapped that was > recently accessed by other CPUs. There are many circumstances where this > happens but the obvious one is kswapd reclaiming pages belonging to a > running process as kswapd and the task are likely running on separate CPUs. > > On small machines, this is not a significant problem but as machine > gets larger with more cores and more memory, the cost of these IPIs can > be high. This patch uses a structure similar in principle to a pagevec > to collect a list of PFNs and CPUs that require flushing. It then sends > one IPI to flush the list of PFNs. A new TLB flush helper is required for > this and one is added for x86. Other architectures will need to decide if > batching like this is both safe and worth the memory overhead. Specifically > the requirement is; > > If a clean page is unmapped and not immediately flushed, the > architecture must guarantee that a write to that page from a CPU > with a cached TLB entry will trap a page fault. > > This is essentially what the kernel already depends on but the window is > much larger with this patch applied and is worth highlighting. This means we already have a (hard to hit?) data corruption issue in the kernel. We can lose data if we unmap a writable but not dirty pte from a file page, and the task writes before we flush the TLB. I can only see one way to completely close the window, and that is to make the pte(s) read-only, and flush the TLB before unmapping and then flushing the TLB again. Luckily this is only true for ptes that are both writeable and clean. This would of course not be acceptable overhead when flushing things one page at a time, but if we are moving to batched TLB flushes anyway, there may be a way around this... 1) Check whether the to-be-unmapped pte is read-only, or the page is already marked dirty, if either is true, we can go straight to (4). 2) Mark a larger number of ptes read-only in one go (one page table page worth of ptes perhaps?) 3) Flush the TLBs for the task(s) with recently turned read-only ptes. 4) Unmap PTEs like your patch series does. 5) Flush the TLBs like your patch series does. This might require some protection in the page fault code, to ensure do_wp_page does not mark the pte read-write again in-between (2) and (4). Then again, do_wp_page does mark the page dirty so we may be ok. As an aside, it may be worth just doing a global tlb flush if the number of entries in a ubc exceeds a certain number. It may also be worth moving try_to_unmap_flush() from shrink_lruvec() to shrink_zone(), so it is called once per zone and not once per cgroup inside the zone. I guess we do need to call it before we call should_continue_reclaim(), though :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping 2015-04-15 21:03 ` Rik van Riel @ 2015-04-15 21:16 ` Hugh Dickins -1 siblings, 0 replies; 58+ messages in thread From: Hugh Dickins @ 2015-04-15 21:16 UTC (permalink / raw) To: Rik van Riel Cc: Mel Gorman, Linux-MM, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Wed, 15 Apr 2015, Rik van Riel wrote: > On 04/15/2015 06:42 AM, Mel Gorman wrote: > > An IPI is sent to flush remote TLBs when a page is unmapped that was > > recently accessed by other CPUs. There are many circumstances where this > > happens but the obvious one is kswapd reclaiming pages belonging to a > > running process as kswapd and the task are likely running on separate CPUs. > > > > On small machines, this is not a significant problem but as machine > > gets larger with more cores and more memory, the cost of these IPIs can > > be high. This patch uses a structure similar in principle to a pagevec > > to collect a list of PFNs and CPUs that require flushing. It then sends > > one IPI to flush the list of PFNs. A new TLB flush helper is required for > > this and one is added for x86. Other architectures will need to decide if > > batching like this is both safe and worth the memory overhead. Specifically > > the requirement is; > > > > If a clean page is unmapped and not immediately flushed, the > > architecture must guarantee that a write to that page from a CPU > > with a cached TLB entry will trap a page fault. > > > > This is essentially what the kernel already depends on but the window is > > much larger with this patch applied and is worth highlighting. > > This means we already have a (hard to hit?) data corruption > issue in the kernel. We can lose data if we unmap a writable > but not dirty pte from a file page, and the task writes before > we flush the TLB. I don't think so. IIRC, when the CPU needs to set the dirty bit, it doesn't just do that in its TLB entry, but has to fetch and update the actual pte entry - and at that point discovers it's no longer valid so traps, as Mel says. (I'm now reading that paragraph differently from when I replied to 4/4.) Hugh ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping @ 2015-04-15 21:16 ` Hugh Dickins 0 siblings, 0 replies; 58+ messages in thread From: Hugh Dickins @ 2015-04-15 21:16 UTC (permalink / raw) To: Rik van Riel Cc: Mel Gorman, Linux-MM, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Wed, 15 Apr 2015, Rik van Riel wrote: > On 04/15/2015 06:42 AM, Mel Gorman wrote: > > An IPI is sent to flush remote TLBs when a page is unmapped that was > > recently accessed by other CPUs. There are many circumstances where this > > happens but the obvious one is kswapd reclaiming pages belonging to a > > running process as kswapd and the task are likely running on separate CPUs. > > > > On small machines, this is not a significant problem but as machine > > gets larger with more cores and more memory, the cost of these IPIs can > > be high. This patch uses a structure similar in principle to a pagevec > > to collect a list of PFNs and CPUs that require flushing. It then sends > > one IPI to flush the list of PFNs. A new TLB flush helper is required for > > this and one is added for x86. Other architectures will need to decide if > > batching like this is both safe and worth the memory overhead. Specifically > > the requirement is; > > > > If a clean page is unmapped and not immediately flushed, the > > architecture must guarantee that a write to that page from a CPU > > with a cached TLB entry will trap a page fault. > > > > This is essentially what the kernel already depends on but the window is > > much larger with this patch applied and is worth highlighting. > > This means we already have a (hard to hit?) data corruption > issue in the kernel. We can lose data if we unmap a writable > but not dirty pte from a file page, and the task writes before > we flush the TLB. I don't think so. IIRC, when the CPU needs to set the dirty bit, it doesn't just do that in its TLB entry, but has to fetch and update the actual pte entry - and at that point discovers it's no longer valid so traps, as Mel says. (I'm now reading that paragraph differently from when I replied to 4/4.) Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping 2015-04-15 21:16 ` Hugh Dickins @ 2015-04-15 21:28 ` Mel Gorman -1 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-15 21:28 UTC (permalink / raw) To: Hugh Dickins Cc: Rik van Riel, Linux-MM, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Wed, Apr 15, 2015 at 02:16:49PM -0700, Hugh Dickins wrote: > On Wed, 15 Apr 2015, Rik van Riel wrote: > > On 04/15/2015 06:42 AM, Mel Gorman wrote: > > > An IPI is sent to flush remote TLBs when a page is unmapped that was > > > recently accessed by other CPUs. There are many circumstances where this > > > happens but the obvious one is kswapd reclaiming pages belonging to a > > > running process as kswapd and the task are likely running on separate CPUs. > > > > > > On small machines, this is not a significant problem but as machine > > > gets larger with more cores and more memory, the cost of these IPIs can > > > be high. This patch uses a structure similar in principle to a pagevec > > > to collect a list of PFNs and CPUs that require flushing. It then sends > > > one IPI to flush the list of PFNs. A new TLB flush helper is required for > > > this and one is added for x86. Other architectures will need to decide if > > > batching like this is both safe and worth the memory overhead. Specifically > > > the requirement is; > > > > > > If a clean page is unmapped and not immediately flushed, the > > > architecture must guarantee that a write to that page from a CPU > > > with a cached TLB entry will trap a page fault. > > > > > > This is essentially what the kernel already depends on but the window is > > > much larger with this patch applied and is worth highlighting. > > > > This means we already have a (hard to hit?) data corruption > > issue in the kernel. We can lose data if we unmap a writable > > but not dirty pte from a file page, and the task writes before > > we flush the TLB. > > I don't think so. IIRC, when the CPU needs to set the dirty bit, > it doesn't just do that in its TLB entry, but has to fetch and update > the actual pte entry - and at that point discovers it's no longer > valid so traps, as Mel says. > This is what I'm expecting i.e. clean->dirty transition is write-through to the PTE which is now unmapped and it traps. I'm assuming there is an architectural guarantee that it happens but could not find an explicit statement in the docs. I'm hoping Dave or Andi can check with the relevant people on my behalf. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping @ 2015-04-15 21:28 ` Mel Gorman 0 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-15 21:28 UTC (permalink / raw) To: Hugh Dickins Cc: Rik van Riel, Linux-MM, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Wed, Apr 15, 2015 at 02:16:49PM -0700, Hugh Dickins wrote: > On Wed, 15 Apr 2015, Rik van Riel wrote: > > On 04/15/2015 06:42 AM, Mel Gorman wrote: > > > An IPI is sent to flush remote TLBs when a page is unmapped that was > > > recently accessed by other CPUs. There are many circumstances where this > > > happens but the obvious one is kswapd reclaiming pages belonging to a > > > running process as kswapd and the task are likely running on separate CPUs. > > > > > > On small machines, this is not a significant problem but as machine > > > gets larger with more cores and more memory, the cost of these IPIs can > > > be high. This patch uses a structure similar in principle to a pagevec > > > to collect a list of PFNs and CPUs that require flushing. It then sends > > > one IPI to flush the list of PFNs. A new TLB flush helper is required for > > > this and one is added for x86. Other architectures will need to decide if > > > batching like this is both safe and worth the memory overhead. Specifically > > > the requirement is; > > > > > > If a clean page is unmapped and not immediately flushed, the > > > architecture must guarantee that a write to that page from a CPU > > > with a cached TLB entry will trap a page fault. > > > > > > This is essentially what the kernel already depends on but the window is > > > much larger with this patch applied and is worth highlighting. > > > > This means we already have a (hard to hit?) data corruption > > issue in the kernel. We can lose data if we unmap a writable > > but not dirty pte from a file page, and the task writes before > > we flush the TLB. > > I don't think so. IIRC, when the CPU needs to set the dirty bit, > it doesn't just do that in its TLB entry, but has to fetch and update > the actual pte entry - and at that point discovers it's no longer > valid so traps, as Mel says. > This is what I'm expecting i.e. clean->dirty transition is write-through to the PTE which is now unmapped and it traps. I'm assuming there is an architectural guarantee that it happens but could not find an explicit statement in the docs. I'm hoping Dave or Andi can check with the relevant people on my behalf. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping 2015-04-15 21:28 ` Mel Gorman @ 2015-04-15 21:32 ` Dave Hansen -1 siblings, 0 replies; 58+ messages in thread From: Dave Hansen @ 2015-04-15 21:32 UTC (permalink / raw) To: Mel Gorman, Hugh Dickins Cc: Rik van Riel, Linux-MM, Johannes Weiner, Andi Kleen, LKML On 04/15/2015 02:28 PM, Mel Gorman wrote: > This is what I'm expecting i.e. clean->dirty transition is write-through > to the PTE which is now unmapped and it traps. I'm assuming there is an > architectural guarantee that it happens but could not find an explicit > statement in the docs. I'm hoping Dave or Andi can check with the relevant > people on my behalf. The docs do look a bit ambiguous to me. I'm working on getting some clarified language in there. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping @ 2015-04-15 21:32 ` Dave Hansen 0 siblings, 0 replies; 58+ messages in thread From: Dave Hansen @ 2015-04-15 21:32 UTC (permalink / raw) To: Mel Gorman, Hugh Dickins Cc: Rik van Riel, Linux-MM, Johannes Weiner, Andi Kleen, LKML On 04/15/2015 02:28 PM, Mel Gorman wrote: > This is what I'm expecting i.e. clean->dirty transition is write-through > to the PTE which is now unmapped and it traps. I'm assuming there is an > architectural guarantee that it happens but could not find an explicit > statement in the docs. I'm hoping Dave or Andi can check with the relevant > people on my behalf. The docs do look a bit ambiguous to me. I'm working on getting some clarified language in there. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping 2015-04-15 21:28 ` Mel Gorman @ 2015-04-16 6:38 ` Minchan Kim -1 siblings, 0 replies; 58+ messages in thread From: Minchan Kim @ 2015-04-16 6:38 UTC (permalink / raw) To: Mel Gorman Cc: Hugh Dickins, Rik van Riel, Linux-MM, Johannes Weiner, Dave Hansen, Andi Kleen, LKML Hello Mel, On Wed, Apr 15, 2015 at 10:28:55PM +0100, Mel Gorman wrote: > On Wed, Apr 15, 2015 at 02:16:49PM -0700, Hugh Dickins wrote: > > On Wed, 15 Apr 2015, Rik van Riel wrote: > > > On 04/15/2015 06:42 AM, Mel Gorman wrote: > > > > An IPI is sent to flush remote TLBs when a page is unmapped that was > > > > recently accessed by other CPUs. There are many circumstances where this > > > > happens but the obvious one is kswapd reclaiming pages belonging to a > > > > running process as kswapd and the task are likely running on separate CPUs. > > > > > > > > On small machines, this is not a significant problem but as machine > > > > gets larger with more cores and more memory, the cost of these IPIs can > > > > be high. This patch uses a structure similar in principle to a pagevec > > > > to collect a list of PFNs and CPUs that require flushing. It then sends > > > > one IPI to flush the list of PFNs. A new TLB flush helper is required for > > > > this and one is added for x86. Other architectures will need to decide if > > > > batching like this is both safe and worth the memory overhead. Specifically > > > > the requirement is; > > > > > > > > If a clean page is unmapped and not immediately flushed, the > > > > architecture must guarantee that a write to that page from a CPU > > > > with a cached TLB entry will trap a page fault. > > > > > > > > This is essentially what the kernel already depends on but the window is > > > > much larger with this patch applied and is worth highlighting. > > > > > > This means we already have a (hard to hit?) data corruption > > > issue in the kernel. We can lose data if we unmap a writable > > > but not dirty pte from a file page, and the task writes before > > > we flush the TLB. > > > > I don't think so. IIRC, when the CPU needs to set the dirty bit, > > it doesn't just do that in its TLB entry, but has to fetch and update > > the actual pte entry - and at that point discovers it's no longer > > valid so traps, as Mel says. > > > > This is what I'm expecting i.e. clean->dirty transition is write-through > to the PTE which is now unmapped and it traps. I'm assuming there is an > architectural guarantee that it happens but could not find an explicit > statement in the docs. I'm hoping Dave or Andi can check with the relevant > people on my behalf. A dumb question. It's not related to your patch but MADV_FREE. clean->dirty transition is *atomic* as well as write-through? I'm really confusing. It seems most arches use xchg for ptep_get_and_clear so it's atomic but some of arches without defining __HAVE_ARCH_PTEP_GET_AND_CLEAR will use non-atomic version in include/asm-generic/pgtable.h. #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long address, pte_t *ptep) { pte_t pte = *ptep; pte_clear(mm, address, ptep); return pte; } #endif I hope they have own lock or something to protect a race between software and hardware(ie, CPU set dirty bit by itself). Anyway, if there is a problem about that, we might see data corruption but didn't so I guess it's atomic. Otherwise, MADV_FREE will break, too. I'd like to confirm that. Thanks. -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping @ 2015-04-16 6:38 ` Minchan Kim 0 siblings, 0 replies; 58+ messages in thread From: Minchan Kim @ 2015-04-16 6:38 UTC (permalink / raw) To: Mel Gorman Cc: Hugh Dickins, Rik van Riel, Linux-MM, Johannes Weiner, Dave Hansen, Andi Kleen, LKML Hello Mel, On Wed, Apr 15, 2015 at 10:28:55PM +0100, Mel Gorman wrote: > On Wed, Apr 15, 2015 at 02:16:49PM -0700, Hugh Dickins wrote: > > On Wed, 15 Apr 2015, Rik van Riel wrote: > > > On 04/15/2015 06:42 AM, Mel Gorman wrote: > > > > An IPI is sent to flush remote TLBs when a page is unmapped that was > > > > recently accessed by other CPUs. There are many circumstances where this > > > > happens but the obvious one is kswapd reclaiming pages belonging to a > > > > running process as kswapd and the task are likely running on separate CPUs. > > > > > > > > On small machines, this is not a significant problem but as machine > > > > gets larger with more cores and more memory, the cost of these IPIs can > > > > be high. This patch uses a structure similar in principle to a pagevec > > > > to collect a list of PFNs and CPUs that require flushing. It then sends > > > > one IPI to flush the list of PFNs. A new TLB flush helper is required for > > > > this and one is added for x86. Other architectures will need to decide if > > > > batching like this is both safe and worth the memory overhead. Specifically > > > > the requirement is; > > > > > > > > If a clean page is unmapped and not immediately flushed, the > > > > architecture must guarantee that a write to that page from a CPU > > > > with a cached TLB entry will trap a page fault. > > > > > > > > This is essentially what the kernel already depends on but the window is > > > > much larger with this patch applied and is worth highlighting. > > > > > > This means we already have a (hard to hit?) data corruption > > > issue in the kernel. We can lose data if we unmap a writable > > > but not dirty pte from a file page, and the task writes before > > > we flush the TLB. > > > > I don't think so. IIRC, when the CPU needs to set the dirty bit, > > it doesn't just do that in its TLB entry, but has to fetch and update > > the actual pte entry - and at that point discovers it's no longer > > valid so traps, as Mel says. > > > > This is what I'm expecting i.e. clean->dirty transition is write-through > to the PTE which is now unmapped and it traps. I'm assuming there is an > architectural guarantee that it happens but could not find an explicit > statement in the docs. I'm hoping Dave or Andi can check with the relevant > people on my behalf. A dumb question. It's not related to your patch but MADV_FREE. clean->dirty transition is *atomic* as well as write-through? I'm really confusing. It seems most arches use xchg for ptep_get_and_clear so it's atomic but some of arches without defining __HAVE_ARCH_PTEP_GET_AND_CLEAR will use non-atomic version in include/asm-generic/pgtable.h. #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long address, pte_t *ptep) { pte_t pte = *ptep; pte_clear(mm, address, ptep); return pte; } #endif I hope they have own lock or something to protect a race between software and hardware(ie, CPU set dirty bit by itself). Anyway, if there is a problem about that, we might see data corruption but didn't so I guess it's atomic. Otherwise, MADV_FREE will break, too. I'd like to confirm that. Thanks. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping 2015-04-16 6:38 ` Minchan Kim @ 2015-04-16 8:07 ` Mel Gorman -1 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-16 8:07 UTC (permalink / raw) To: Minchan Kim Cc: Hugh Dickins, Rik van Riel, Linux-MM, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Thu, Apr 16, 2015 at 03:38:26PM +0900, Minchan Kim wrote: > Hello Mel, > > On Wed, Apr 15, 2015 at 10:28:55PM +0100, Mel Gorman wrote: > > On Wed, Apr 15, 2015 at 02:16:49PM -0700, Hugh Dickins wrote: > > > On Wed, 15 Apr 2015, Rik van Riel wrote: > > > > On 04/15/2015 06:42 AM, Mel Gorman wrote: > > > > > An IPI is sent to flush remote TLBs when a page is unmapped that was > > > > > recently accessed by other CPUs. There are many circumstances where this > > > > > happens but the obvious one is kswapd reclaiming pages belonging to a > > > > > running process as kswapd and the task are likely running on separate CPUs. > > > > > > > > > > On small machines, this is not a significant problem but as machine > > > > > gets larger with more cores and more memory, the cost of these IPIs can > > > > > be high. This patch uses a structure similar in principle to a pagevec > > > > > to collect a list of PFNs and CPUs that require flushing. It then sends > > > > > one IPI to flush the list of PFNs. A new TLB flush helper is required for > > > > > this and one is added for x86. Other architectures will need to decide if > > > > > batching like this is both safe and worth the memory overhead. Specifically > > > > > the requirement is; > > > > > > > > > > If a clean page is unmapped and not immediately flushed, the > > > > > architecture must guarantee that a write to that page from a CPU > > > > > with a cached TLB entry will trap a page fault. > > > > > > > > > > This is essentially what the kernel already depends on but the window is > > > > > much larger with this patch applied and is worth highlighting. > > > > > > > > This means we already have a (hard to hit?) data corruption > > > > issue in the kernel. We can lose data if we unmap a writable > > > > but not dirty pte from a file page, and the task writes before > > > > we flush the TLB. > > > > > > I don't think so. IIRC, when the CPU needs to set the dirty bit, > > > it doesn't just do that in its TLB entry, but has to fetch and update > > > the actual pte entry - and at that point discovers it's no longer > > > valid so traps, as Mel says. > > > > > > > This is what I'm expecting i.e. clean->dirty transition is write-through > > to the PTE which is now unmapped and it traps. I'm assuming there is an > > architectural guarantee that it happens but could not find an explicit > > statement in the docs. I'm hoping Dave or Andi can check with the relevant > > people on my behalf. > > A dumb question. It's not related to your patch but MADV_FREE. > > clean->dirty transition is *atomic* as well as write-through? This is the TLB cache clean->dirty transition so it's not 100% clear what you are asking. It both needs to be write-through and the TLB updates must happen before the actual data write to cache or memory and it must be ordered. > I'm really confusing. > It seems most arches use xchg for ptep_get_and_clear so it's > atomic but some of arches without defining __HAVE_ARCH_PTEP_GET_AND_CLEAR > will use non-atomic version in include/asm-generic/pgtable.h. > > #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR > static inline pte_t ptep_get_and_clear(struct mm_struct *mm, > unsigned long address, > pte_t *ptep) > { > pte_t pte = *ptep; > pte_clear(mm, address, ptep); > return pte; > } > #endif > And if they are using this, they need to be ok that it's not atomic but it's not clear what you are asking. > I hope they have own lock or something to protect a race between software > and hardware(ie, CPU set dirty bit by itself). > Or they're UP. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping @ 2015-04-16 8:07 ` Mel Gorman 0 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-16 8:07 UTC (permalink / raw) To: Minchan Kim Cc: Hugh Dickins, Rik van Riel, Linux-MM, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Thu, Apr 16, 2015 at 03:38:26PM +0900, Minchan Kim wrote: > Hello Mel, > > On Wed, Apr 15, 2015 at 10:28:55PM +0100, Mel Gorman wrote: > > On Wed, Apr 15, 2015 at 02:16:49PM -0700, Hugh Dickins wrote: > > > On Wed, 15 Apr 2015, Rik van Riel wrote: > > > > On 04/15/2015 06:42 AM, Mel Gorman wrote: > > > > > An IPI is sent to flush remote TLBs when a page is unmapped that was > > > > > recently accessed by other CPUs. There are many circumstances where this > > > > > happens but the obvious one is kswapd reclaiming pages belonging to a > > > > > running process as kswapd and the task are likely running on separate CPUs. > > > > > > > > > > On small machines, this is not a significant problem but as machine > > > > > gets larger with more cores and more memory, the cost of these IPIs can > > > > > be high. This patch uses a structure similar in principle to a pagevec > > > > > to collect a list of PFNs and CPUs that require flushing. It then sends > > > > > one IPI to flush the list of PFNs. A new TLB flush helper is required for > > > > > this and one is added for x86. Other architectures will need to decide if > > > > > batching like this is both safe and worth the memory overhead. Specifically > > > > > the requirement is; > > > > > > > > > > If a clean page is unmapped and not immediately flushed, the > > > > > architecture must guarantee that a write to that page from a CPU > > > > > with a cached TLB entry will trap a page fault. > > > > > > > > > > This is essentially what the kernel already depends on but the window is > > > > > much larger with this patch applied and is worth highlighting. > > > > > > > > This means we already have a (hard to hit?) data corruption > > > > issue in the kernel. We can lose data if we unmap a writable > > > > but not dirty pte from a file page, and the task writes before > > > > we flush the TLB. > > > > > > I don't think so. IIRC, when the CPU needs to set the dirty bit, > > > it doesn't just do that in its TLB entry, but has to fetch and update > > > the actual pte entry - and at that point discovers it's no longer > > > valid so traps, as Mel says. > > > > > > > This is what I'm expecting i.e. clean->dirty transition is write-through > > to the PTE which is now unmapped and it traps. I'm assuming there is an > > architectural guarantee that it happens but could not find an explicit > > statement in the docs. I'm hoping Dave or Andi can check with the relevant > > people on my behalf. > > A dumb question. It's not related to your patch but MADV_FREE. > > clean->dirty transition is *atomic* as well as write-through? This is the TLB cache clean->dirty transition so it's not 100% clear what you are asking. It both needs to be write-through and the TLB updates must happen before the actual data write to cache or memory and it must be ordered. > I'm really confusing. > It seems most arches use xchg for ptep_get_and_clear so it's > atomic but some of arches without defining __HAVE_ARCH_PTEP_GET_AND_CLEAR > will use non-atomic version in include/asm-generic/pgtable.h. > > #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR > static inline pte_t ptep_get_and_clear(struct mm_struct *mm, > unsigned long address, > pte_t *ptep) > { > pte_t pte = *ptep; > pte_clear(mm, address, ptep); > return pte; > } > #endif > And if they are using this, they need to be ok that it's not atomic but it's not clear what you are asking. > I hope they have own lock or something to protect a race between software > and hardware(ie, CPU set dirty bit by itself). > Or they're UP. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping 2015-04-16 8:07 ` Mel Gorman @ 2015-04-16 8:29 ` Minchan Kim -1 siblings, 0 replies; 58+ messages in thread From: Minchan Kim @ 2015-04-16 8:29 UTC (permalink / raw) To: Mel Gorman Cc: Hugh Dickins, Rik van Riel, Linux-MM, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Thu, Apr 16, 2015 at 09:07:22AM +0100, Mel Gorman wrote: > On Thu, Apr 16, 2015 at 03:38:26PM +0900, Minchan Kim wrote: > > Hello Mel, > > > > On Wed, Apr 15, 2015 at 10:28:55PM +0100, Mel Gorman wrote: > > > On Wed, Apr 15, 2015 at 02:16:49PM -0700, Hugh Dickins wrote: > > > > On Wed, 15 Apr 2015, Rik van Riel wrote: > > > > > On 04/15/2015 06:42 AM, Mel Gorman wrote: > > > > > > An IPI is sent to flush remote TLBs when a page is unmapped that was > > > > > > recently accessed by other CPUs. There are many circumstances where this > > > > > > happens but the obvious one is kswapd reclaiming pages belonging to a > > > > > > running process as kswapd and the task are likely running on separate CPUs. > > > > > > > > > > > > On small machines, this is not a significant problem but as machine > > > > > > gets larger with more cores and more memory, the cost of these IPIs can > > > > > > be high. This patch uses a structure similar in principle to a pagevec > > > > > > to collect a list of PFNs and CPUs that require flushing. It then sends > > > > > > one IPI to flush the list of PFNs. A new TLB flush helper is required for > > > > > > this and one is added for x86. Other architectures will need to decide if > > > > > > batching like this is both safe and worth the memory overhead. Specifically > > > > > > the requirement is; > > > > > > > > > > > > If a clean page is unmapped and not immediately flushed, the > > > > > > architecture must guarantee that a write to that page from a CPU > > > > > > with a cached TLB entry will trap a page fault. > > > > > > > > > > > > This is essentially what the kernel already depends on but the window is > > > > > > much larger with this patch applied and is worth highlighting. > > > > > > > > > > This means we already have a (hard to hit?) data corruption > > > > > issue in the kernel. We can lose data if we unmap a writable > > > > > but not dirty pte from a file page, and the task writes before > > > > > we flush the TLB. > > > > > > > > I don't think so. IIRC, when the CPU needs to set the dirty bit, > > > > it doesn't just do that in its TLB entry, but has to fetch and update > > > > the actual pte entry - and at that point discovers it's no longer > > > > valid so traps, as Mel says. > > > > > > > > > > This is what I'm expecting i.e. clean->dirty transition is write-through > > > to the PTE which is now unmapped and it traps. I'm assuming there is an > > > architectural guarantee that it happens but could not find an explicit > > > statement in the docs. I'm hoping Dave or Andi can check with the relevant > > > people on my behalf. > > > > A dumb question. It's not related to your patch but MADV_FREE. > > > > clean->dirty transition is *atomic* as well as write-through? > > This is the TLB cache clean->dirty transition so it's not 100% clear what you > are asking. It both needs to be write-through and the TLB updates must happen > before the actual data write to cache or memory and it must be ordered. Sorry for not clear. I will try again. In try_to_unmap_one, pteval = ptep_clear_flush(vma, address, pte); { pte = ptep_get_and_clear(mm, address, ptep); <-------------- A application write on other CPU. flush_tlb_page(vma, address); } /* Move the dirty bit to the physical page now the pte is gone. */ dirty = pte_dirty(pteval); if (dirty) set_page_dirty(page); ... In above, ptep_clear_flush just does xchg operation to make pte zero in ptep_get_and_clear and return old pte_val but didn't flush TLB yet. Let's assume old pte_val doesn't have dirty bit(ie, it was clean). If application on other CPU does write the memory at the same time, what happens? I mean (pte cleaning/return old) and (dirty bit setting by CPU itself) should be exclusive so application on another CPU should encounter page fault or we should see the dirty bit. Is it guaranteed? > > > I'm really confusing. > > It seems most arches use xchg for ptep_get_and_clear so it's > > atomic but some of arches without defining __HAVE_ARCH_PTEP_GET_AND_CLEAR > > will use non-atomic version in include/asm-generic/pgtable.h. > > > > #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR > > static inline pte_t ptep_get_and_clear(struct mm_struct *mm, > > unsigned long address, > > pte_t *ptep) > > { > > pte_t pte = *ptep; > > pte_clear(mm, address, ptep); > > return pte; > > } > > #endif > > > > And if they are using this, they need to be ok that it's not atomic but > it's not clear what you are asking. > > > I hope they have own lock or something to protect a race between software > > and hardware(ie, CPU set dirty bit by itself). > > > > Or they're UP. Yeb. > > -- > Mel Gorman > SUSE Labs -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping @ 2015-04-16 8:29 ` Minchan Kim 0 siblings, 0 replies; 58+ messages in thread From: Minchan Kim @ 2015-04-16 8:29 UTC (permalink / raw) To: Mel Gorman Cc: Hugh Dickins, Rik van Riel, Linux-MM, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Thu, Apr 16, 2015 at 09:07:22AM +0100, Mel Gorman wrote: > On Thu, Apr 16, 2015 at 03:38:26PM +0900, Minchan Kim wrote: > > Hello Mel, > > > > On Wed, Apr 15, 2015 at 10:28:55PM +0100, Mel Gorman wrote: > > > On Wed, Apr 15, 2015 at 02:16:49PM -0700, Hugh Dickins wrote: > > > > On Wed, 15 Apr 2015, Rik van Riel wrote: > > > > > On 04/15/2015 06:42 AM, Mel Gorman wrote: > > > > > > An IPI is sent to flush remote TLBs when a page is unmapped that was > > > > > > recently accessed by other CPUs. There are many circumstances where this > > > > > > happens but the obvious one is kswapd reclaiming pages belonging to a > > > > > > running process as kswapd and the task are likely running on separate CPUs. > > > > > > > > > > > > On small machines, this is not a significant problem but as machine > > > > > > gets larger with more cores and more memory, the cost of these IPIs can > > > > > > be high. This patch uses a structure similar in principle to a pagevec > > > > > > to collect a list of PFNs and CPUs that require flushing. It then sends > > > > > > one IPI to flush the list of PFNs. A new TLB flush helper is required for > > > > > > this and one is added for x86. Other architectures will need to decide if > > > > > > batching like this is both safe and worth the memory overhead. Specifically > > > > > > the requirement is; > > > > > > > > > > > > If a clean page is unmapped and not immediately flushed, the > > > > > > architecture must guarantee that a write to that page from a CPU > > > > > > with a cached TLB entry will trap a page fault. > > > > > > > > > > > > This is essentially what the kernel already depends on but the window is > > > > > > much larger with this patch applied and is worth highlighting. > > > > > > > > > > This means we already have a (hard to hit?) data corruption > > > > > issue in the kernel. We can lose data if we unmap a writable > > > > > but not dirty pte from a file page, and the task writes before > > > > > we flush the TLB. > > > > > > > > I don't think so. IIRC, when the CPU needs to set the dirty bit, > > > > it doesn't just do that in its TLB entry, but has to fetch and update > > > > the actual pte entry - and at that point discovers it's no longer > > > > valid so traps, as Mel says. > > > > > > > > > > This is what I'm expecting i.e. clean->dirty transition is write-through > > > to the PTE which is now unmapped and it traps. I'm assuming there is an > > > architectural guarantee that it happens but could not find an explicit > > > statement in the docs. I'm hoping Dave or Andi can check with the relevant > > > people on my behalf. > > > > A dumb question. It's not related to your patch but MADV_FREE. > > > > clean->dirty transition is *atomic* as well as write-through? > > This is the TLB cache clean->dirty transition so it's not 100% clear what you > are asking. It both needs to be write-through and the TLB updates must happen > before the actual data write to cache or memory and it must be ordered. Sorry for not clear. I will try again. In try_to_unmap_one, pteval = ptep_clear_flush(vma, address, pte); { pte = ptep_get_and_clear(mm, address, ptep); <-------------- A application write on other CPU. flush_tlb_page(vma, address); } /* Move the dirty bit to the physical page now the pte is gone. */ dirty = pte_dirty(pteval); if (dirty) set_page_dirty(page); ... In above, ptep_clear_flush just does xchg operation to make pte zero in ptep_get_and_clear and return old pte_val but didn't flush TLB yet. Let's assume old pte_val doesn't have dirty bit(ie, it was clean). If application on other CPU does write the memory at the same time, what happens? I mean (pte cleaning/return old) and (dirty bit setting by CPU itself) should be exclusive so application on another CPU should encounter page fault or we should see the dirty bit. Is it guaranteed? > > > I'm really confusing. > > It seems most arches use xchg for ptep_get_and_clear so it's > > atomic but some of arches without defining __HAVE_ARCH_PTEP_GET_AND_CLEAR > > will use non-atomic version in include/asm-generic/pgtable.h. > > > > #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR > > static inline pte_t ptep_get_and_clear(struct mm_struct *mm, > > unsigned long address, > > pte_t *ptep) > > { > > pte_t pte = *ptep; > > pte_clear(mm, address, ptep); > > return pte; > > } > > #endif > > > > And if they are using this, they need to be ok that it's not atomic but > it's not clear what you are asking. > > > I hope they have own lock or something to protect a race between software > > and hardware(ie, CPU set dirty bit by itself). > > > > Or they're UP. Yeb. > > -- > Mel Gorman > SUSE Labs -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping 2015-04-16 8:29 ` Minchan Kim @ 2015-04-16 9:19 ` Mel Gorman -1 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-16 9:19 UTC (permalink / raw) To: Minchan Kim Cc: Hugh Dickins, Rik van Riel, Linux-MM, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Thu, Apr 16, 2015 at 05:29:55PM +0900, Minchan Kim wrote: > On Thu, Apr 16, 2015 at 09:07:22AM +0100, Mel Gorman wrote: > > On Thu, Apr 16, 2015 at 03:38:26PM +0900, Minchan Kim wrote: > > > Hello Mel, > > > > > > On Wed, Apr 15, 2015 at 10:28:55PM +0100, Mel Gorman wrote: > > > > On Wed, Apr 15, 2015 at 02:16:49PM -0700, Hugh Dickins wrote: > > > > > On Wed, 15 Apr 2015, Rik van Riel wrote: > > > > > > On 04/15/2015 06:42 AM, Mel Gorman wrote: > > > > > > > An IPI is sent to flush remote TLBs when a page is unmapped that was > > > > > > > recently accessed by other CPUs. There are many circumstances where this > > > > > > > happens but the obvious one is kswapd reclaiming pages belonging to a > > > > > > > running process as kswapd and the task are likely running on separate CPUs. > > > > > > > > > > > > > > On small machines, this is not a significant problem but as machine > > > > > > > gets larger with more cores and more memory, the cost of these IPIs can > > > > > > > be high. This patch uses a structure similar in principle to a pagevec > > > > > > > to collect a list of PFNs and CPUs that require flushing. It then sends > > > > > > > one IPI to flush the list of PFNs. A new TLB flush helper is required for > > > > > > > this and one is added for x86. Other architectures will need to decide if > > > > > > > batching like this is both safe and worth the memory overhead. Specifically > > > > > > > the requirement is; > > > > > > > > > > > > > > If a clean page is unmapped and not immediately flushed, the > > > > > > > architecture must guarantee that a write to that page from a CPU > > > > > > > with a cached TLB entry will trap a page fault. > > > > > > > > > > > > > > This is essentially what the kernel already depends on but the window is > > > > > > > much larger with this patch applied and is worth highlighting. > > > > > > > > > > > > This means we already have a (hard to hit?) data corruption > > > > > > issue in the kernel. We can lose data if we unmap a writable > > > > > > but not dirty pte from a file page, and the task writes before > > > > > > we flush the TLB. > > > > > > > > > > I don't think so. IIRC, when the CPU needs to set the dirty bit, > > > > > it doesn't just do that in its TLB entry, but has to fetch and update > > > > > the actual pte entry - and at that point discovers it's no longer > > > > > valid so traps, as Mel says. > > > > > > > > > > > > > This is what I'm expecting i.e. clean->dirty transition is write-through > > > > to the PTE which is now unmapped and it traps. I'm assuming there is an > > > > architectural guarantee that it happens but could not find an explicit > > > > statement in the docs. I'm hoping Dave or Andi can check with the relevant > > > > people on my behalf. > > > > > > A dumb question. It's not related to your patch but MADV_FREE. > > > > > > clean->dirty transition is *atomic* as well as write-through? > > > > This is the TLB cache clean->dirty transition so it's not 100% clear what you > > are asking. It both needs to be write-through and the TLB updates must happen > > before the actual data write to cache or memory and it must be ordered. > > Sorry for not clear. I will try again. > > In try_to_unmap_one, > > > pteval = ptep_clear_flush(vma, address, pte); > { > pte = ptep_get_and_clear(mm, address, ptep); > <-------------- A application write on other CPU. > flush_tlb_page(vma, address); > } > > /* Move the dirty bit to the physical page now the pte is gone. */ > dirty = pte_dirty(pteval); > if (dirty) > set_page_dirty(page); > ... > > > In above, ptep_clear_flush just does xchg operation to make pte zero > in ptep_get_and_clear and return old pte_val but didn't flush TLB yet. Correct. > Let's assume old pte_val doesn't have dirty bit(ie, it was clean). > If application on other CPU does write the memory at the same time, > what happens? The comments describe the architectural guarantee I'm looking for. Dave says he's asking the relevant people within Intel. I revised the comment in the unreleased V2 so it reads /* * We clear the PTE but do not flush so potentially a remote * CPU could still be writing to the page. If the entry was * previously clean then the architecture must guarantee that * a clear->dirty transition on a cached TLB entry is written * through and traps if the PTE is unmapped. If the entry is * already dirty then it's handled below by the * pte_dirty check. */ > I mean (pte cleaning/return old) and (dirty bit setting by CPU itself) > should be exclusive so application on another CPU should encounter > page fault or we should see the dirty bit. > Is it guaranteed? > This is the key question. I think "yes it must be" but Dave is going to get the definite answer in the x86 case. Each architecture will need to examine the issue separately. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping @ 2015-04-16 9:19 ` Mel Gorman 0 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-16 9:19 UTC (permalink / raw) To: Minchan Kim Cc: Hugh Dickins, Rik van Riel, Linux-MM, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Thu, Apr 16, 2015 at 05:29:55PM +0900, Minchan Kim wrote: > On Thu, Apr 16, 2015 at 09:07:22AM +0100, Mel Gorman wrote: > > On Thu, Apr 16, 2015 at 03:38:26PM +0900, Minchan Kim wrote: > > > Hello Mel, > > > > > > On Wed, Apr 15, 2015 at 10:28:55PM +0100, Mel Gorman wrote: > > > > On Wed, Apr 15, 2015 at 02:16:49PM -0700, Hugh Dickins wrote: > > > > > On Wed, 15 Apr 2015, Rik van Riel wrote: > > > > > > On 04/15/2015 06:42 AM, Mel Gorman wrote: > > > > > > > An IPI is sent to flush remote TLBs when a page is unmapped that was > > > > > > > recently accessed by other CPUs. There are many circumstances where this > > > > > > > happens but the obvious one is kswapd reclaiming pages belonging to a > > > > > > > running process as kswapd and the task are likely running on separate CPUs. > > > > > > > > > > > > > > On small machines, this is not a significant problem but as machine > > > > > > > gets larger with more cores and more memory, the cost of these IPIs can > > > > > > > be high. This patch uses a structure similar in principle to a pagevec > > > > > > > to collect a list of PFNs and CPUs that require flushing. It then sends > > > > > > > one IPI to flush the list of PFNs. A new TLB flush helper is required for > > > > > > > this and one is added for x86. Other architectures will need to decide if > > > > > > > batching like this is both safe and worth the memory overhead. Specifically > > > > > > > the requirement is; > > > > > > > > > > > > > > If a clean page is unmapped and not immediately flushed, the > > > > > > > architecture must guarantee that a write to that page from a CPU > > > > > > > with a cached TLB entry will trap a page fault. > > > > > > > > > > > > > > This is essentially what the kernel already depends on but the window is > > > > > > > much larger with this patch applied and is worth highlighting. > > > > > > > > > > > > This means we already have a (hard to hit?) data corruption > > > > > > issue in the kernel. We can lose data if we unmap a writable > > > > > > but not dirty pte from a file page, and the task writes before > > > > > > we flush the TLB. > > > > > > > > > > I don't think so. IIRC, when the CPU needs to set the dirty bit, > > > > > it doesn't just do that in its TLB entry, but has to fetch and update > > > > > the actual pte entry - and at that point discovers it's no longer > > > > > valid so traps, as Mel says. > > > > > > > > > > > > > This is what I'm expecting i.e. clean->dirty transition is write-through > > > > to the PTE which is now unmapped and it traps. I'm assuming there is an > > > > architectural guarantee that it happens but could not find an explicit > > > > statement in the docs. I'm hoping Dave or Andi can check with the relevant > > > > people on my behalf. > > > > > > A dumb question. It's not related to your patch but MADV_FREE. > > > > > > clean->dirty transition is *atomic* as well as write-through? > > > > This is the TLB cache clean->dirty transition so it's not 100% clear what you > > are asking. It both needs to be write-through and the TLB updates must happen > > before the actual data write to cache or memory and it must be ordered. > > Sorry for not clear. I will try again. > > In try_to_unmap_one, > > > pteval = ptep_clear_flush(vma, address, pte); > { > pte = ptep_get_and_clear(mm, address, ptep); > <-------------- A application write on other CPU. > flush_tlb_page(vma, address); > } > > /* Move the dirty bit to the physical page now the pte is gone. */ > dirty = pte_dirty(pteval); > if (dirty) > set_page_dirty(page); > ... > > > In above, ptep_clear_flush just does xchg operation to make pte zero > in ptep_get_and_clear and return old pte_val but didn't flush TLB yet. Correct. > Let's assume old pte_val doesn't have dirty bit(ie, it was clean). > If application on other CPU does write the memory at the same time, > what happens? The comments describe the architectural guarantee I'm looking for. Dave says he's asking the relevant people within Intel. I revised the comment in the unreleased V2 so it reads /* * We clear the PTE but do not flush so potentially a remote * CPU could still be writing to the page. If the entry was * previously clean then the architecture must guarantee that * a clear->dirty transition on a cached TLB entry is written * through and traps if the PTE is unmapped. If the entry is * already dirty then it's handled below by the * pte_dirty check. */ > I mean (pte cleaning/return old) and (dirty bit setting by CPU itself) > should be exclusive so application on another CPU should encounter > page fault or we should see the dirty bit. > Is it guaranteed? > This is the key question. I think "yes it must be" but Dave is going to get the definite answer in the x86 case. Each architecture will need to examine the issue separately. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping 2015-04-16 9:19 ` Mel Gorman @ 2015-04-16 23:30 ` Minchan Kim -1 siblings, 0 replies; 58+ messages in thread From: Minchan Kim @ 2015-04-16 23:30 UTC (permalink / raw) To: Mel Gorman Cc: Hugh Dickins, Rik van Riel, Linux-MM, Johannes Weiner, Dave Hansen, Andi Kleen, LKML Hello Mel, On Thu, Apr 16, 2015 at 10:19:22AM +0100, Mel Gorman wrote: > On Thu, Apr 16, 2015 at 05:29:55PM +0900, Minchan Kim wrote: > > On Thu, Apr 16, 2015 at 09:07:22AM +0100, Mel Gorman wrote: > > > On Thu, Apr 16, 2015 at 03:38:26PM +0900, Minchan Kim wrote: > > > > Hello Mel, > > > > > > > > On Wed, Apr 15, 2015 at 10:28:55PM +0100, Mel Gorman wrote: > > > > > On Wed, Apr 15, 2015 at 02:16:49PM -0700, Hugh Dickins wrote: > > > > > > On Wed, 15 Apr 2015, Rik van Riel wrote: > > > > > > > On 04/15/2015 06:42 AM, Mel Gorman wrote: > > > > > > > > An IPI is sent to flush remote TLBs when a page is unmapped that was > > > > > > > > recently accessed by other CPUs. There are many circumstances where this > > > > > > > > happens but the obvious one is kswapd reclaiming pages belonging to a > > > > > > > > running process as kswapd and the task are likely running on separate CPUs. > > > > > > > > > > > > > > > > On small machines, this is not a significant problem but as machine > > > > > > > > gets larger with more cores and more memory, the cost of these IPIs can > > > > > > > > be high. This patch uses a structure similar in principle to a pagevec > > > > > > > > to collect a list of PFNs and CPUs that require flushing. It then sends > > > > > > > > one IPI to flush the list of PFNs. A new TLB flush helper is required for > > > > > > > > this and one is added for x86. Other architectures will need to decide if > > > > > > > > batching like this is both safe and worth the memory overhead. Specifically > > > > > > > > the requirement is; > > > > > > > > > > > > > > > > If a clean page is unmapped and not immediately flushed, the > > > > > > > > architecture must guarantee that a write to that page from a CPU > > > > > > > > with a cached TLB entry will trap a page fault. > > > > > > > > > > > > > > > > This is essentially what the kernel already depends on but the window is > > > > > > > > much larger with this patch applied and is worth highlighting. > > > > > > > > > > > > > > This means we already have a (hard to hit?) data corruption > > > > > > > issue in the kernel. We can lose data if we unmap a writable > > > > > > > but not dirty pte from a file page, and the task writes before > > > > > > > we flush the TLB. > > > > > > > > > > > > I don't think so. IIRC, when the CPU needs to set the dirty bit, > > > > > > it doesn't just do that in its TLB entry, but has to fetch and update > > > > > > the actual pte entry - and at that point discovers it's no longer > > > > > > valid so traps, as Mel says. > > > > > > > > > > > > > > > > This is what I'm expecting i.e. clean->dirty transition is write-through > > > > > to the PTE which is now unmapped and it traps. I'm assuming there is an > > > > > architectural guarantee that it happens but could not find an explicit > > > > > statement in the docs. I'm hoping Dave or Andi can check with the relevant > > > > > people on my behalf. > > > > > > > > A dumb question. It's not related to your patch but MADV_FREE. > > > > > > > > clean->dirty transition is *atomic* as well as write-through? > > > > > > This is the TLB cache clean->dirty transition so it's not 100% clear what you > > > are asking. It both needs to be write-through and the TLB updates must happen > > > before the actual data write to cache or memory and it must be ordered. > > > > Sorry for not clear. I will try again. > > > > In try_to_unmap_one, > > > > > > pteval = ptep_clear_flush(vma, address, pte); > > { > > pte = ptep_get_and_clear(mm, address, ptep); > > <-------------- A application write on other CPU. > > flush_tlb_page(vma, address); > > } > > > > /* Move the dirty bit to the physical page now the pte is gone. */ > > dirty = pte_dirty(pteval); > > if (dirty) > > set_page_dirty(page); > > ... > > > > > > In above, ptep_clear_flush just does xchg operation to make pte zero > > in ptep_get_and_clear and return old pte_val but didn't flush TLB yet. > > Correct. > > > Let's assume old pte_val doesn't have dirty bit(ie, it was clean). > > If application on other CPU does write the memory at the same time, > > what happens? > > The comments describe the architectural guarantee I'm looking for. Dave > says he's asking the relevant people within Intel. I revised the comment > in the unreleased V2 so it reads > > /* > * We clear the PTE but do not flush so potentially a remote > * CPU could still be writing to the page. If the entry was > * previously clean then the architecture must guarantee that > * a clear->dirty transition on a cached TLB entry is written > * through and traps if the PTE is unmapped. If the entry is > * already dirty then it's handled below by the > * pte_dirty check. > */ > > > I mean (pte cleaning/return old) and (dirty bit setting by CPU itself) > > should be exclusive so application on another CPU should encounter > > page fault or we should see the dirty bit. > > Is it guaranteed? > > > > This is the key question. I think "yes it must be" but Dave is going to > get the definite answer in the x86 case. Each architecture will need to > examine the issue separately. If other architectures didn't guarantee, it will happen data loss by memory-mapped file page write. And that code stayed for many years so I guess every architecture guarantees it. Otherwise, mmaped-file page write and MADV_FREE will be broken. Thanks for the answer, Mel! > > -- > Mel Gorman > SUSE Labs -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping @ 2015-04-16 23:30 ` Minchan Kim 0 siblings, 0 replies; 58+ messages in thread From: Minchan Kim @ 2015-04-16 23:30 UTC (permalink / raw) To: Mel Gorman Cc: Hugh Dickins, Rik van Riel, Linux-MM, Johannes Weiner, Dave Hansen, Andi Kleen, LKML Hello Mel, On Thu, Apr 16, 2015 at 10:19:22AM +0100, Mel Gorman wrote: > On Thu, Apr 16, 2015 at 05:29:55PM +0900, Minchan Kim wrote: > > On Thu, Apr 16, 2015 at 09:07:22AM +0100, Mel Gorman wrote: > > > On Thu, Apr 16, 2015 at 03:38:26PM +0900, Minchan Kim wrote: > > > > Hello Mel, > > > > > > > > On Wed, Apr 15, 2015 at 10:28:55PM +0100, Mel Gorman wrote: > > > > > On Wed, Apr 15, 2015 at 02:16:49PM -0700, Hugh Dickins wrote: > > > > > > On Wed, 15 Apr 2015, Rik van Riel wrote: > > > > > > > On 04/15/2015 06:42 AM, Mel Gorman wrote: > > > > > > > > An IPI is sent to flush remote TLBs when a page is unmapped that was > > > > > > > > recently accessed by other CPUs. There are many circumstances where this > > > > > > > > happens but the obvious one is kswapd reclaiming pages belonging to a > > > > > > > > running process as kswapd and the task are likely running on separate CPUs. > > > > > > > > > > > > > > > > On small machines, this is not a significant problem but as machine > > > > > > > > gets larger with more cores and more memory, the cost of these IPIs can > > > > > > > > be high. This patch uses a structure similar in principle to a pagevec > > > > > > > > to collect a list of PFNs and CPUs that require flushing. It then sends > > > > > > > > one IPI to flush the list of PFNs. A new TLB flush helper is required for > > > > > > > > this and one is added for x86. Other architectures will need to decide if > > > > > > > > batching like this is both safe and worth the memory overhead. Specifically > > > > > > > > the requirement is; > > > > > > > > > > > > > > > > If a clean page is unmapped and not immediately flushed, the > > > > > > > > architecture must guarantee that a write to that page from a CPU > > > > > > > > with a cached TLB entry will trap a page fault. > > > > > > > > > > > > > > > > This is essentially what the kernel already depends on but the window is > > > > > > > > much larger with this patch applied and is worth highlighting. > > > > > > > > > > > > > > This means we already have a (hard to hit?) data corruption > > > > > > > issue in the kernel. We can lose data if we unmap a writable > > > > > > > but not dirty pte from a file page, and the task writes before > > > > > > > we flush the TLB. > > > > > > > > > > > > I don't think so. IIRC, when the CPU needs to set the dirty bit, > > > > > > it doesn't just do that in its TLB entry, but has to fetch and update > > > > > > the actual pte entry - and at that point discovers it's no longer > > > > > > valid so traps, as Mel says. > > > > > > > > > > > > > > > > This is what I'm expecting i.e. clean->dirty transition is write-through > > > > > to the PTE which is now unmapped and it traps. I'm assuming there is an > > > > > architectural guarantee that it happens but could not find an explicit > > > > > statement in the docs. I'm hoping Dave or Andi can check with the relevant > > > > > people on my behalf. > > > > > > > > A dumb question. It's not related to your patch but MADV_FREE. > > > > > > > > clean->dirty transition is *atomic* as well as write-through? > > > > > > This is the TLB cache clean->dirty transition so it's not 100% clear what you > > > are asking. It both needs to be write-through and the TLB updates must happen > > > before the actual data write to cache or memory and it must be ordered. > > > > Sorry for not clear. I will try again. > > > > In try_to_unmap_one, > > > > > > pteval = ptep_clear_flush(vma, address, pte); > > { > > pte = ptep_get_and_clear(mm, address, ptep); > > <-------------- A application write on other CPU. > > flush_tlb_page(vma, address); > > } > > > > /* Move the dirty bit to the physical page now the pte is gone. */ > > dirty = pte_dirty(pteval); > > if (dirty) > > set_page_dirty(page); > > ... > > > > > > In above, ptep_clear_flush just does xchg operation to make pte zero > > in ptep_get_and_clear and return old pte_val but didn't flush TLB yet. > > Correct. > > > Let's assume old pte_val doesn't have dirty bit(ie, it was clean). > > If application on other CPU does write the memory at the same time, > > what happens? > > The comments describe the architectural guarantee I'm looking for. Dave > says he's asking the relevant people within Intel. I revised the comment > in the unreleased V2 so it reads > > /* > * We clear the PTE but do not flush so potentially a remote > * CPU could still be writing to the page. If the entry was > * previously clean then the architecture must guarantee that > * a clear->dirty transition on a cached TLB entry is written > * through and traps if the PTE is unmapped. If the entry is > * already dirty then it's handled below by the > * pte_dirty check. > */ > > > I mean (pte cleaning/return old) and (dirty bit setting by CPU itself) > > should be exclusive so application on another CPU should encounter > > page fault or we should see the dirty bit. > > Is it guaranteed? > > > > This is the key question. I think "yes it must be" but Dave is going to > get the definite answer in the x86 case. Each architecture will need to > examine the issue separately. If other architectures didn't guarantee, it will happen data loss by memory-mapped file page write. And that code stayed for many years so I guess every architecture guarantees it. Otherwise, mmaped-file page write and MADV_FREE will be broken. Thanks for the answer, Mel! > > -- > Mel Gorman > SUSE Labs -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping 2015-04-15 10:42 ` Mel Gorman @ 2015-04-15 22:20 ` Andi Kleen -1 siblings, 0 replies; 58+ messages in thread From: Andi Kleen @ 2015-04-15 22:20 UTC (permalink / raw) To: Mel Gorman Cc: Linux-MM, Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML I did a quick read and it looks good to me. It's a bit ugly to bloat current with the ubc pointer, but i guess there's no good way around that. Also not nice to use GFP_ATOMIC for the allocation, but again there's no way around it and it will eventually recover if it fails. There may be a slightly better GFP flag for this situation that doesn't dip into the interrupt pools? -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping @ 2015-04-15 22:20 ` Andi Kleen 0 siblings, 0 replies; 58+ messages in thread From: Andi Kleen @ 2015-04-15 22:20 UTC (permalink / raw) To: Mel Gorman Cc: Linux-MM, Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML I did a quick read and it looks good to me. It's a bit ugly to bloat current with the ubc pointer, but i guess there's no good way around that. Also not nice to use GFP_ATOMIC for the allocation, but again there's no way around it and it will eventually recover if it fails. There may be a slightly better GFP flag for this situation that doesn't dip into the interrupt pools? -Andi -- ak@linux.intel.com -- Speaking for myself only. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping 2015-04-15 22:20 ` Andi Kleen @ 2015-04-15 22:53 ` Mel Gorman -1 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-15 22:53 UTC (permalink / raw) To: Andi Kleen; +Cc: Linux-MM, Rik van Riel, Johannes Weiner, Dave Hansen, LKML On Thu, Apr 16, 2015 at 12:20:06AM +0200, Andi Kleen wrote: > > I did a quick read and it looks good to me. > Thanks. Does that also include a guarantee that a write to a clean TLB entry will fault if the underlying PTE is unmapped? > It's a bit ugly to bloat current with the ubc pointer, > but i guess there's no good way around that. > I didn't see a better alternative. > Also not nice to use GFP_ATOMIC for the allocation, > but again there's no way around it and it will > eventually recover if it fails. There may be > a slightly better GFP flag for this situation that > doesn't dip into the interrupt pools? > I can use GFP_KERNEL|__GFP_NOWARN. In the kswapd case, it's early in the lifetime of the system so it's not going to enter direct reclaim. In the direct reclaim path, the allocation will not recurse due to PF_MEMALLOC. It ends up achieving the same effect without being as obvious as GFP_ATOMIC was. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping @ 2015-04-15 22:53 ` Mel Gorman 0 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-15 22:53 UTC (permalink / raw) To: Andi Kleen; +Cc: Linux-MM, Rik van Riel, Johannes Weiner, Dave Hansen, LKML On Thu, Apr 16, 2015 at 12:20:06AM +0200, Andi Kleen wrote: > > I did a quick read and it looks good to me. > Thanks. Does that also include a guarantee that a write to a clean TLB entry will fault if the underlying PTE is unmapped? > It's a bit ugly to bloat current with the ubc pointer, > but i guess there's no good way around that. > I didn't see a better alternative. > Also not nice to use GFP_ATOMIC for the allocation, > but again there's no way around it and it will > eventually recover if it fails. There may be > a slightly better GFP flag for this situation that > doesn't dip into the interrupt pools? > I can use GFP_KERNEL|__GFP_NOWARN. In the kswapd case, it's early in the lifetime of the system so it's not going to enter direct reclaim. In the direct reclaim path, the allocation will not recurse due to PF_MEMALLOC. It ends up achieving the same effect without being as obvious as GFP_ATOMIC was. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH 3/4] mm: Gather more PFNs before sending a TLB to flush unmapped pages 2015-04-15 10:42 ` Mel Gorman @ 2015-04-15 10:42 ` Mel Gorman -1 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-15 10:42 UTC (permalink / raw) To: Linux-MM Cc: Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML, Mel Gorman The patch "mm: Send a single IPI to TLB flush multiple pages when unmapping" would batch 32 pages before sending an IPI. This patch increases the size of the data structure to hold a pages worth of PFNs before sending an IPI. This is a trade-off between memory usage and reducing IPIS sent. In the ideal case where multiple processes are reading large mapped files, this patch reduces interrupts/second from roughly 180K per second to 60K per second. Signed-off-by: Mel Gorman <mgorman@suse.de> --- include/linux/sched.h | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 9d51841806f4..abff66ecc302 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1275,11 +1275,16 @@ enum perf_event_task_context { perf_nr_task_contexts, }; -/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */ -#define BATCH_TLBFLUSH_SIZE 32UL +/* + * Use a page to store as many PFNs as possible for batch unmapping. Adjusting + * this trades memory usage for number of IPIs sent + */ +#define BATCH_TLBFLUSH_SIZE \ + ((PAGE_SIZE - sizeof(struct cpumask) - sizeof(unsigned long)) / sizeof(unsigned long)) /* Track pages that require TLB flushes */ struct unmap_batch { + /* Update BATCH_TLBFLUSH_SIZE when adjusting this structure */ struct cpumask cpumask; unsigned long nr_pages; unsigned long pfns[BATCH_TLBFLUSH_SIZE]; -- 2.1.2 ^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 3/4] mm: Gather more PFNs before sending a TLB to flush unmapped pages @ 2015-04-15 10:42 ` Mel Gorman 0 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-15 10:42 UTC (permalink / raw) To: Linux-MM Cc: Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML, Mel Gorman The patch "mm: Send a single IPI to TLB flush multiple pages when unmapping" would batch 32 pages before sending an IPI. This patch increases the size of the data structure to hold a pages worth of PFNs before sending an IPI. This is a trade-off between memory usage and reducing IPIS sent. In the ideal case where multiple processes are reading large mapped files, this patch reduces interrupts/second from roughly 180K per second to 60K per second. Signed-off-by: Mel Gorman <mgorman@suse.de> --- include/linux/sched.h | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 9d51841806f4..abff66ecc302 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1275,11 +1275,16 @@ enum perf_event_task_context { perf_nr_task_contexts, }; -/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */ -#define BATCH_TLBFLUSH_SIZE 32UL +/* + * Use a page to store as many PFNs as possible for batch unmapping. Adjusting + * this trades memory usage for number of IPIs sent + */ +#define BATCH_TLBFLUSH_SIZE \ + ((PAGE_SIZE - sizeof(struct cpumask) - sizeof(unsigned long)) / sizeof(unsigned long)) /* Track pages that require TLB flushes */ struct unmap_batch { + /* Update BATCH_TLBFLUSH_SIZE when adjusting this structure */ struct cpumask cpumask; unsigned long nr_pages; unsigned long pfns[BATCH_TLBFLUSH_SIZE]; -- 2.1.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: [PATCH 3/4] mm: Gather more PFNs before sending a TLB to flush unmapped pages 2015-04-15 10:42 ` Mel Gorman @ 2015-04-15 11:42 ` Peter Zijlstra -1 siblings, 0 replies; 58+ messages in thread From: Peter Zijlstra @ 2015-04-15 11:42 UTC (permalink / raw) To: Mel Gorman Cc: Linux-MM, Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Wed, Apr 15, 2015 at 11:42:55AM +0100, Mel Gorman wrote: > +/* > + * Use a page to store as many PFNs as possible for batch unmapping. Adjusting > + * this trades memory usage for number of IPIs sent > + */ > +#define BATCH_TLBFLUSH_SIZE \ > + ((PAGE_SIZE - sizeof(struct cpumask) - sizeof(unsigned long)) / sizeof(unsigned long)) > > /* Track pages that require TLB flushes */ > struct unmap_batch { > + /* Update BATCH_TLBFLUSH_SIZE when adjusting this structure */ > struct cpumask cpumask; > unsigned long nr_pages; > unsigned long pfns[BATCH_TLBFLUSH_SIZE]; The alternative is something like: struct unmap_batch { struct cpumask cpumask; unsigned long nr_pages; unsigned long pfnsp[0]; }; #define BATCH_TLBFLUSH_SIZE ((PAGE_SIZE - sizeof(struct unmap_batch)) / sizeof(unsigned long)) and unconditionally allocate 1 page. This saves you from having to worry about the layout of struct unmap_batch. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 3/4] mm: Gather more PFNs before sending a TLB to flush unmapped pages @ 2015-04-15 11:42 ` Peter Zijlstra 0 siblings, 0 replies; 58+ messages in thread From: Peter Zijlstra @ 2015-04-15 11:42 UTC (permalink / raw) To: Mel Gorman Cc: Linux-MM, Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Wed, Apr 15, 2015 at 11:42:55AM +0100, Mel Gorman wrote: > +/* > + * Use a page to store as many PFNs as possible for batch unmapping. Adjusting > + * this trades memory usage for number of IPIs sent > + */ > +#define BATCH_TLBFLUSH_SIZE \ > + ((PAGE_SIZE - sizeof(struct cpumask) - sizeof(unsigned long)) / sizeof(unsigned long)) > > /* Track pages that require TLB flushes */ > struct unmap_batch { > + /* Update BATCH_TLBFLUSH_SIZE when adjusting this structure */ > struct cpumask cpumask; > unsigned long nr_pages; > unsigned long pfns[BATCH_TLBFLUSH_SIZE]; The alternative is something like: struct unmap_batch { struct cpumask cpumask; unsigned long nr_pages; unsigned long pfnsp[0]; }; #define BATCH_TLBFLUSH_SIZE ((PAGE_SIZE - sizeof(struct unmap_batch)) / sizeof(unsigned long)) and unconditionally allocate 1 page. This saves you from having to worry about the layout of struct unmap_batch. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 3/4] mm: Gather more PFNs before sending a TLB to flush unmapped pages 2015-04-15 11:42 ` Peter Zijlstra @ 2015-04-15 12:15 ` Mel Gorman -1 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-15 12:15 UTC (permalink / raw) To: Peter Zijlstra Cc: Linux-MM, Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Wed, Apr 15, 2015 at 01:42:20PM +0200, Peter Zijlstra wrote: > On Wed, Apr 15, 2015 at 11:42:55AM +0100, Mel Gorman wrote: > > +/* > > + * Use a page to store as many PFNs as possible for batch unmapping. Adjusting > > + * this trades memory usage for number of IPIs sent > > + */ > > +#define BATCH_TLBFLUSH_SIZE \ > > + ((PAGE_SIZE - sizeof(struct cpumask) - sizeof(unsigned long)) / sizeof(unsigned long)) > > > > /* Track pages that require TLB flushes */ > > struct unmap_batch { > > + /* Update BATCH_TLBFLUSH_SIZE when adjusting this structure */ > > struct cpumask cpumask; > > unsigned long nr_pages; > > unsigned long pfns[BATCH_TLBFLUSH_SIZE]; > > The alternative is something like: > > struct unmap_batch { > struct cpumask cpumask; > unsigned long nr_pages; > unsigned long pfnsp[0]; > }; > > #define BATCH_TLBFLUSH_SIZE ((PAGE_SIZE - sizeof(struct unmap_batch)) / sizeof(unsigned long)) > > and unconditionally allocate 1 page. This saves you from having to worry > about the layout of struct unmap_batch. True but then I need to calculate the size of the real array so it's similar in terms of readability. The plus would be that if the structure changes then the size calculation is not changed but then the allocation site and the size calculation must be kept in sync. I did not see a clear win of one approach over the other so flipped a coin. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 3/4] mm: Gather more PFNs before sending a TLB to flush unmapped pages @ 2015-04-15 12:15 ` Mel Gorman 0 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-15 12:15 UTC (permalink / raw) To: Peter Zijlstra Cc: Linux-MM, Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Wed, Apr 15, 2015 at 01:42:20PM +0200, Peter Zijlstra wrote: > On Wed, Apr 15, 2015 at 11:42:55AM +0100, Mel Gorman wrote: > > +/* > > + * Use a page to store as many PFNs as possible for batch unmapping. Adjusting > > + * this trades memory usage for number of IPIs sent > > + */ > > +#define BATCH_TLBFLUSH_SIZE \ > > + ((PAGE_SIZE - sizeof(struct cpumask) - sizeof(unsigned long)) / sizeof(unsigned long)) > > > > /* Track pages that require TLB flushes */ > > struct unmap_batch { > > + /* Update BATCH_TLBFLUSH_SIZE when adjusting this structure */ > > struct cpumask cpumask; > > unsigned long nr_pages; > > unsigned long pfns[BATCH_TLBFLUSH_SIZE]; > > The alternative is something like: > > struct unmap_batch { > struct cpumask cpumask; > unsigned long nr_pages; > unsigned long pfnsp[0]; > }; > > #define BATCH_TLBFLUSH_SIZE ((PAGE_SIZE - sizeof(struct unmap_batch)) / sizeof(unsigned long)) > > and unconditionally allocate 1 page. This saves you from having to worry > about the layout of struct unmap_batch. True but then I need to calculate the size of the real array so it's similar in terms of readability. The plus would be that if the structure changes then the size calculation is not changed but then the allocation site and the size calculation must be kept in sync. I did not see a clear win of one approach over the other so flipped a coin. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 3/4] mm: Gather more PFNs before sending a TLB to flush unmapped pages 2015-04-15 12:15 ` Mel Gorman @ 2015-04-15 12:24 ` Peter Zijlstra -1 siblings, 0 replies; 58+ messages in thread From: Peter Zijlstra @ 2015-04-15 12:24 UTC (permalink / raw) To: Mel Gorman Cc: Linux-MM, Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Wed, Apr 15, 2015 at 01:15:53PM +0100, Mel Gorman wrote: > On Wed, Apr 15, 2015 at 01:42:20PM +0200, Peter Zijlstra wrote: > > On Wed, Apr 15, 2015 at 11:42:55AM +0100, Mel Gorman wrote: > > > +/* > > > + * Use a page to store as many PFNs as possible for batch unmapping. Adjusting > > > + * this trades memory usage for number of IPIs sent > > > + */ > > > +#define BATCH_TLBFLUSH_SIZE \ > > > + ((PAGE_SIZE - sizeof(struct cpumask) - sizeof(unsigned long)) / sizeof(unsigned long)) > > > > > > /* Track pages that require TLB flushes */ > > > struct unmap_batch { > > > + /* Update BATCH_TLBFLUSH_SIZE when adjusting this structure */ > > > struct cpumask cpumask; > > > unsigned long nr_pages; > > > unsigned long pfns[BATCH_TLBFLUSH_SIZE]; > > > > The alternative is something like: > > > > struct unmap_batch { > > struct cpumask cpumask; > > unsigned long nr_pages; > > unsigned long pfnsp[0]; > > }; > > > > #define BATCH_TLBFLUSH_SIZE ((PAGE_SIZE - sizeof(struct unmap_batch)) / sizeof(unsigned long)) > > > > and unconditionally allocate 1 page. This saves you from having to worry > > about the layout of struct unmap_batch. > > True but then I need to calculate the size of the real array so it's > similar in terms of readability. The plus would be that if the structure > changes then the size calculation is not changed but then the allocation > site and the size calculation must be kept in sync. I did not see a clear > win of one approach over the other so flipped a coin. I'm not seeing your argument, in both your an mine variant the allocation is hard assumed to be 1 page, right? But even then, what's more likely to change, extra members in our struct or growing the allocation to two (or more) pages? ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 3/4] mm: Gather more PFNs before sending a TLB to flush unmapped pages @ 2015-04-15 12:24 ` Peter Zijlstra 0 siblings, 0 replies; 58+ messages in thread From: Peter Zijlstra @ 2015-04-15 12:24 UTC (permalink / raw) To: Mel Gorman Cc: Linux-MM, Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Wed, Apr 15, 2015 at 01:15:53PM +0100, Mel Gorman wrote: > On Wed, Apr 15, 2015 at 01:42:20PM +0200, Peter Zijlstra wrote: > > On Wed, Apr 15, 2015 at 11:42:55AM +0100, Mel Gorman wrote: > > > +/* > > > + * Use a page to store as many PFNs as possible for batch unmapping. Adjusting > > > + * this trades memory usage for number of IPIs sent > > > + */ > > > +#define BATCH_TLBFLUSH_SIZE \ > > > + ((PAGE_SIZE - sizeof(struct cpumask) - sizeof(unsigned long)) / sizeof(unsigned long)) > > > > > > /* Track pages that require TLB flushes */ > > > struct unmap_batch { > > > + /* Update BATCH_TLBFLUSH_SIZE when adjusting this structure */ > > > struct cpumask cpumask; > > > unsigned long nr_pages; > > > unsigned long pfns[BATCH_TLBFLUSH_SIZE]; > > > > The alternative is something like: > > > > struct unmap_batch { > > struct cpumask cpumask; > > unsigned long nr_pages; > > unsigned long pfnsp[0]; > > }; > > > > #define BATCH_TLBFLUSH_SIZE ((PAGE_SIZE - sizeof(struct unmap_batch)) / sizeof(unsigned long)) > > > > and unconditionally allocate 1 page. This saves you from having to worry > > about the layout of struct unmap_batch. > > True but then I need to calculate the size of the real array so it's > similar in terms of readability. The plus would be that if the structure > changes then the size calculation is not changed but then the allocation > site and the size calculation must be kept in sync. I did not see a clear > win of one approach over the other so flipped a coin. I'm not seeing your argument, in both your an mine variant the allocation is hard assumed to be 1 page, right? But even then, what's more likely to change, extra members in our struct or growing the allocation to two (or more) pages? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 3/4] mm: Gather more PFNs before sending a TLB to flush unmapped pages 2015-04-15 12:24 ` Peter Zijlstra @ 2015-04-15 12:56 ` Mel Gorman -1 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-15 12:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Linux-MM, Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Wed, Apr 15, 2015 at 02:24:40PM +0200, Peter Zijlstra wrote: > On Wed, Apr 15, 2015 at 01:15:53PM +0100, Mel Gorman wrote: > > On Wed, Apr 15, 2015 at 01:42:20PM +0200, Peter Zijlstra wrote: > > > On Wed, Apr 15, 2015 at 11:42:55AM +0100, Mel Gorman wrote: > > > > +/* > > > > + * Use a page to store as many PFNs as possible for batch unmapping. Adjusting > > > > + * this trades memory usage for number of IPIs sent > > > > + */ > > > > +#define BATCH_TLBFLUSH_SIZE \ > > > > + ((PAGE_SIZE - sizeof(struct cpumask) - sizeof(unsigned long)) / sizeof(unsigned long)) > > > > > > > > /* Track pages that require TLB flushes */ > > > > struct unmap_batch { > > > > + /* Update BATCH_TLBFLUSH_SIZE when adjusting this structure */ > > > > struct cpumask cpumask; > > > > unsigned long nr_pages; > > > > unsigned long pfns[BATCH_TLBFLUSH_SIZE]; > > > > > > The alternative is something like: > > > > > > struct unmap_batch { > > > struct cpumask cpumask; > > > unsigned long nr_pages; > > > unsigned long pfnsp[0]; > > > }; > > > > > > #define BATCH_TLBFLUSH_SIZE ((PAGE_SIZE - sizeof(struct unmap_batch)) / sizeof(unsigned long)) > > > > > > and unconditionally allocate 1 page. This saves you from having to worry > > > about the layout of struct unmap_batch. > > > > True but then I need to calculate the size of the real array so it's > > similar in terms of readability. The plus would be that if the structure > > changes then the size calculation is not changed but then the allocation > > site and the size calculation must be kept in sync. I did not see a clear > > win of one approach over the other so flipped a coin. > > I'm not seeing your argument, in both your an mine variant the > allocation is hard assumed to be 1 page, right? No, in mine I can use sizeof to "discover" it even though the answer is always a page. > But even then, what's > more likely to change, extra members in our struct or growing the > allocation to two (or more) pages? Either approach requires careful treatment. I can switch to your method in V2 because to me, they're equivalent in terms of readability and maintenance. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 3/4] mm: Gather more PFNs before sending a TLB to flush unmapped pages @ 2015-04-15 12:56 ` Mel Gorman 0 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-15 12:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Linux-MM, Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Wed, Apr 15, 2015 at 02:24:40PM +0200, Peter Zijlstra wrote: > On Wed, Apr 15, 2015 at 01:15:53PM +0100, Mel Gorman wrote: > > On Wed, Apr 15, 2015 at 01:42:20PM +0200, Peter Zijlstra wrote: > > > On Wed, Apr 15, 2015 at 11:42:55AM +0100, Mel Gorman wrote: > > > > +/* > > > > + * Use a page to store as many PFNs as possible for batch unmapping. Adjusting > > > > + * this trades memory usage for number of IPIs sent > > > > + */ > > > > +#define BATCH_TLBFLUSH_SIZE \ > > > > + ((PAGE_SIZE - sizeof(struct cpumask) - sizeof(unsigned long)) / sizeof(unsigned long)) > > > > > > > > /* Track pages that require TLB flushes */ > > > > struct unmap_batch { > > > > + /* Update BATCH_TLBFLUSH_SIZE when adjusting this structure */ > > > > struct cpumask cpumask; > > > > unsigned long nr_pages; > > > > unsigned long pfns[BATCH_TLBFLUSH_SIZE]; > > > > > > The alternative is something like: > > > > > > struct unmap_batch { > > > struct cpumask cpumask; > > > unsigned long nr_pages; > > > unsigned long pfnsp[0]; > > > }; > > > > > > #define BATCH_TLBFLUSH_SIZE ((PAGE_SIZE - sizeof(struct unmap_batch)) / sizeof(unsigned long)) > > > > > > and unconditionally allocate 1 page. This saves you from having to worry > > > about the layout of struct unmap_batch. > > > > True but then I need to calculate the size of the real array so it's > > similar in terms of readability. The plus would be that if the structure > > changes then the size calculation is not changed but then the allocation > > site and the size calculation must be kept in sync. I did not see a clear > > win of one approach over the other so flipped a coin. > > I'm not seeing your argument, in both your an mine variant the > allocation is hard assumed to be 1 page, right? No, in mine I can use sizeof to "discover" it even though the answer is always a page. > But even then, what's > more likely to change, extra members in our struct or growing the > allocation to two (or more) pages? Either approach requires careful treatment. I can switch to your method in V2 because to me, they're equivalent in terms of readability and maintenance. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration 2015-04-15 10:42 ` Mel Gorman @ 2015-04-15 10:42 ` Mel Gorman -1 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-15 10:42 UTC (permalink / raw) To: Linux-MM Cc: Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML, Mel Gorman Page reclaim batches multiple TLB flushes into one IPI and this patch teaches page migration to also batch any necessary flushes. MMtests has a THP scale microbenchmark that deliberately fragments memory and then allocates THPs to stress compaction. It's not a page reclaim benchmark and recent kernels avoid excessive compaction but this patch reduced system CPU usage 4.0.0 4.0.0 baseline batchmigrate-v1 User 970.70 1012.24 System 2067.48 1840.00 Elapsed 1520.63 1529.66 Note that this particular workload was not TLB flush intensive with peaks in interrupts during the compaction phase. The 4.0 kernel peaked at 345K interrupts/second, the kernel that batches reclaim TLB entries peaked at 13K interrupts/second and this patch peaked at 10K interrupts/second. Signed-off-by: Mel Gorman <mgorman@suse.de> --- mm/internal.h | 5 +++++ mm/migrate.c | 8 +++++++- mm/vmscan.c | 6 +----- 3 files changed, 13 insertions(+), 6 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index fe69dd159e34..cb70555a7291 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -436,10 +436,15 @@ struct unmap_batch; #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH void try_to_unmap_flush(void); +void alloc_ubc(void); #else static inline void try_to_unmap_flush(void) { } +static inline void alloc_ubc(void) +{ +} + #endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ #endif /* __MM_INTERNAL_H */ diff --git a/mm/migrate.c b/mm/migrate.c index 85e042686031..973d8befe528 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -789,6 +789,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, if (current->flags & PF_MEMALLOC) goto out; + try_to_unmap_flush(); lock_page(page); } @@ -805,6 +806,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, } if (!force) goto out_unlock; + try_to_unmap_flush(); wait_on_page_writeback(page); } /* @@ -879,7 +881,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, /* Establish migration ptes or remove ptes */ if (page_mapped(page)) { try_to_unmap(page, - TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); + TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|TTU_BATCH_FLUSH); page_was_mapped = 1; } @@ -1098,6 +1100,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, if (!swapwrite) current->flags |= PF_SWAPWRITE; + alloc_ubc(); + for(pass = 0; pass < 10 && retry; pass++) { retry = 0; @@ -1144,6 +1148,8 @@ out: if (!swapwrite) current->flags &= ~PF_SWAPWRITE; + try_to_unmap_flush(); + return rc; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 68bcc0b73a76..d659e3655575 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2767,7 +2767,7 @@ out: } #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH -static inline void alloc_ubc(void) +void alloc_ubc(void) { if (current->ubc) return; @@ -2784,10 +2784,6 @@ static inline void alloc_ubc(void) cpumask_clear(¤t->ubc->cpumask); current->ubc->nr_pages = 0; } -#else -static inline void alloc_ubc(void) -{ -} #endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, -- 2.1.2 ^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration @ 2015-04-15 10:42 ` Mel Gorman 0 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-15 10:42 UTC (permalink / raw) To: Linux-MM Cc: Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML, Mel Gorman Page reclaim batches multiple TLB flushes into one IPI and this patch teaches page migration to also batch any necessary flushes. MMtests has a THP scale microbenchmark that deliberately fragments memory and then allocates THPs to stress compaction. It's not a page reclaim benchmark and recent kernels avoid excessive compaction but this patch reduced system CPU usage 4.0.0 4.0.0 baseline batchmigrate-v1 User 970.70 1012.24 System 2067.48 1840.00 Elapsed 1520.63 1529.66 Note that this particular workload was not TLB flush intensive with peaks in interrupts during the compaction phase. The 4.0 kernel peaked at 345K interrupts/second, the kernel that batches reclaim TLB entries peaked at 13K interrupts/second and this patch peaked at 10K interrupts/second. Signed-off-by: Mel Gorman <mgorman@suse.de> --- mm/internal.h | 5 +++++ mm/migrate.c | 8 +++++++- mm/vmscan.c | 6 +----- 3 files changed, 13 insertions(+), 6 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index fe69dd159e34..cb70555a7291 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -436,10 +436,15 @@ struct unmap_batch; #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH void try_to_unmap_flush(void); +void alloc_ubc(void); #else static inline void try_to_unmap_flush(void) { } +static inline void alloc_ubc(void) +{ +} + #endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ #endif /* __MM_INTERNAL_H */ diff --git a/mm/migrate.c b/mm/migrate.c index 85e042686031..973d8befe528 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -789,6 +789,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, if (current->flags & PF_MEMALLOC) goto out; + try_to_unmap_flush(); lock_page(page); } @@ -805,6 +806,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, } if (!force) goto out_unlock; + try_to_unmap_flush(); wait_on_page_writeback(page); } /* @@ -879,7 +881,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, /* Establish migration ptes or remove ptes */ if (page_mapped(page)) { try_to_unmap(page, - TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); + TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|TTU_BATCH_FLUSH); page_was_mapped = 1; } @@ -1098,6 +1100,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, if (!swapwrite) current->flags |= PF_SWAPWRITE; + alloc_ubc(); + for(pass = 0; pass < 10 && retry; pass++) { retry = 0; @@ -1144,6 +1148,8 @@ out: if (!swapwrite) current->flags &= ~PF_SWAPWRITE; + try_to_unmap_flush(); + return rc; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 68bcc0b73a76..d659e3655575 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2767,7 +2767,7 @@ out: } #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH -static inline void alloc_ubc(void) +void alloc_ubc(void) { if (current->ubc) return; @@ -2784,10 +2784,6 @@ static inline void alloc_ubc(void) cpumask_clear(¤t->ubc->cpumask); current->ubc->nr_pages = 0; } -#else -static inline void alloc_ubc(void) -{ -} #endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, -- 2.1.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration 2015-04-15 10:42 ` Mel Gorman @ 2015-04-15 21:06 ` Hugh Dickins -1 siblings, 0 replies; 58+ messages in thread From: Hugh Dickins @ 2015-04-15 21:06 UTC (permalink / raw) To: Mel Gorman Cc: Linux-MM, Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Wed, 15 Apr 2015, Mel Gorman wrote: > Page reclaim batches multiple TLB flushes into one IPI and this patch teaches > page migration to also batch any necessary flushes. MMtests has a THP scale > microbenchmark that deliberately fragments memory and then allocates THPs > to stress compaction. It's not a page reclaim benchmark and recent kernels > avoid excessive compaction but this patch reduced system CPU usage > > 4.0.0 4.0.0 > baseline batchmigrate-v1 > User 970.70 1012.24 > System 2067.48 1840.00 > Elapsed 1520.63 1529.66 > > Note that this particular workload was not TLB flush intensive with peaks > in interrupts during the compaction phase. The 4.0 kernel peaked at 345K > interrupts/second, the kernel that batches reclaim TLB entries peaked at > 13K interrupts/second and this patch peaked at 10K interrupts/second. > > Signed-off-by: Mel Gorman <mgorman@suse.de> > --- > mm/internal.h | 5 +++++ > mm/migrate.c | 8 +++++++- > mm/vmscan.c | 6 +----- > 3 files changed, 13 insertions(+), 6 deletions(-) > > diff --git a/mm/internal.h b/mm/internal.h > index fe69dd159e34..cb70555a7291 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -436,10 +436,15 @@ struct unmap_batch; > > #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > void try_to_unmap_flush(void); > +void alloc_ubc(void); > #else > static inline void try_to_unmap_flush(void) > { > } > > +static inline void alloc_ubc(void) > +{ > +} > + > #endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ > #endif /* __MM_INTERNAL_H */ > diff --git a/mm/migrate.c b/mm/migrate.c > index 85e042686031..973d8befe528 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -789,6 +789,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > if (current->flags & PF_MEMALLOC) > goto out; > > + try_to_unmap_flush(); I have a vested interest in minimizing page migration overhead, enthusiastic for more batching if it can be done, so took a quick look at this patch (the earliers not so much); but am mystified by your placement of the try_to_unmap_flush()s. Why would one be needed here, yet not before the trylock_page() above? Oh, when might sleep? Though I still don't grasp why that's necessary, and try_to_unmap() below may itself sleep. > lock_page(page); > } > > @@ -805,6 +806,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > } > if (!force) > goto out_unlock; > + try_to_unmap_flush(); > wait_on_page_writeback(page); > } > /* > @@ -879,7 +881,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > /* Establish migration ptes or remove ptes */ > if (page_mapped(page)) { > try_to_unmap(page, > - TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); > + TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|TTU_BATCH_FLUSH); But isn't this the only place for the try_to_unmap_flush(), unless you make much more change to the way page migration works? Would batch together the TLB flushes from multiple mappings of the same page, though that's not a very ambitious goal. Delayed much later than this point, and user modifications to the old page could continue while we're copying it into the new page and after, so the new page receives only some undefined part of the modifications. Or perhaps this is the last minute point you were making about page lock in the 0/4, though page lock not so relevant here. Or your paragraph in the 0/4 "If a clean page is unmapped and not immediately flushed..." but I don't see where that is being enforced. I can imagine more optimization possible on !pte_write pages than on pte_write pages, but don't see any sign of that. Or am I just skimming this series too carelessly, and making a fool of myself by missing the important bits? Sorry if I'm wasting your time. > page_was_mapped = 1; > } > > @@ -1098,6 +1100,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, > if (!swapwrite) > current->flags |= PF_SWAPWRITE; > > + alloc_ubc(); > + > for(pass = 0; pass < 10 && retry; pass++) { > retry = 0; > > @@ -1144,6 +1148,8 @@ out: > if (!swapwrite) > current->flags &= ~PF_SWAPWRITE; > > + try_to_unmap_flush(); > + > return rc; > } > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 68bcc0b73a76..d659e3655575 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2767,7 +2767,7 @@ out: > } > > #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > -static inline void alloc_ubc(void) > +void alloc_ubc(void) Looking at this patch first, I wondered what on earth a ubc is. The letters "tlb" in the name might help people to locate its place in the world better. And then curious that it works with pfns rather than page pointers, as its natural cousin mmu_gather does (oops, no "tlb" there either, though that's compensated by naming its pointer "tlb" everywhere). pfns: are you thinking ahead to struct page-less persistent memory considerations? Though would they ever arrive here? I'd have thought it better to carry on with struct pages at least for now - or are they becoming unfashionable? (I think some tracing struct page pointers were converted to pfns recently.) But no big deal. > { > if (current->ubc) > return; > @@ -2784,10 +2784,6 @@ static inline void alloc_ubc(void) > cpumask_clear(¤t->ubc->cpumask); > current->ubc->nr_pages = 0; > } > -#else > -static inline void alloc_ubc(void) > -{ > -} > #endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ > > unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > -- > 2.1.2 ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration @ 2015-04-15 21:06 ` Hugh Dickins 0 siblings, 0 replies; 58+ messages in thread From: Hugh Dickins @ 2015-04-15 21:06 UTC (permalink / raw) To: Mel Gorman Cc: Linux-MM, Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Wed, 15 Apr 2015, Mel Gorman wrote: > Page reclaim batches multiple TLB flushes into one IPI and this patch teaches > page migration to also batch any necessary flushes. MMtests has a THP scale > microbenchmark that deliberately fragments memory and then allocates THPs > to stress compaction. It's not a page reclaim benchmark and recent kernels > avoid excessive compaction but this patch reduced system CPU usage > > 4.0.0 4.0.0 > baseline batchmigrate-v1 > User 970.70 1012.24 > System 2067.48 1840.00 > Elapsed 1520.63 1529.66 > > Note that this particular workload was not TLB flush intensive with peaks > in interrupts during the compaction phase. The 4.0 kernel peaked at 345K > interrupts/second, the kernel that batches reclaim TLB entries peaked at > 13K interrupts/second and this patch peaked at 10K interrupts/second. > > Signed-off-by: Mel Gorman <mgorman@suse.de> > --- > mm/internal.h | 5 +++++ > mm/migrate.c | 8 +++++++- > mm/vmscan.c | 6 +----- > 3 files changed, 13 insertions(+), 6 deletions(-) > > diff --git a/mm/internal.h b/mm/internal.h > index fe69dd159e34..cb70555a7291 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -436,10 +436,15 @@ struct unmap_batch; > > #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > void try_to_unmap_flush(void); > +void alloc_ubc(void); > #else > static inline void try_to_unmap_flush(void) > { > } > > +static inline void alloc_ubc(void) > +{ > +} > + > #endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ > #endif /* __MM_INTERNAL_H */ > diff --git a/mm/migrate.c b/mm/migrate.c > index 85e042686031..973d8befe528 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -789,6 +789,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > if (current->flags & PF_MEMALLOC) > goto out; > > + try_to_unmap_flush(); I have a vested interest in minimizing page migration overhead, enthusiastic for more batching if it can be done, so took a quick look at this patch (the earliers not so much); but am mystified by your placement of the try_to_unmap_flush()s. Why would one be needed here, yet not before the trylock_page() above? Oh, when might sleep? Though I still don't grasp why that's necessary, and try_to_unmap() below may itself sleep. > lock_page(page); > } > > @@ -805,6 +806,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > } > if (!force) > goto out_unlock; > + try_to_unmap_flush(); > wait_on_page_writeback(page); > } > /* > @@ -879,7 +881,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > /* Establish migration ptes or remove ptes */ > if (page_mapped(page)) { > try_to_unmap(page, > - TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); > + TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|TTU_BATCH_FLUSH); But isn't this the only place for the try_to_unmap_flush(), unless you make much more change to the way page migration works? Would batch together the TLB flushes from multiple mappings of the same page, though that's not a very ambitious goal. Delayed much later than this point, and user modifications to the old page could continue while we're copying it into the new page and after, so the new page receives only some undefined part of the modifications. Or perhaps this is the last minute point you were making about page lock in the 0/4, though page lock not so relevant here. Or your paragraph in the 0/4 "If a clean page is unmapped and not immediately flushed..." but I don't see where that is being enforced. I can imagine more optimization possible on !pte_write pages than on pte_write pages, but don't see any sign of that. Or am I just skimming this series too carelessly, and making a fool of myself by missing the important bits? Sorry if I'm wasting your time. > page_was_mapped = 1; > } > > @@ -1098,6 +1100,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, > if (!swapwrite) > current->flags |= PF_SWAPWRITE; > > + alloc_ubc(); > + > for(pass = 0; pass < 10 && retry; pass++) { > retry = 0; > > @@ -1144,6 +1148,8 @@ out: > if (!swapwrite) > current->flags &= ~PF_SWAPWRITE; > > + try_to_unmap_flush(); > + > return rc; > } > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 68bcc0b73a76..d659e3655575 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2767,7 +2767,7 @@ out: > } > > #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > -static inline void alloc_ubc(void) > +void alloc_ubc(void) Looking at this patch first, I wondered what on earth a ubc is. The letters "tlb" in the name might help people to locate its place in the world better. And then curious that it works with pfns rather than page pointers, as its natural cousin mmu_gather does (oops, no "tlb" there either, though that's compensated by naming its pointer "tlb" everywhere). pfns: are you thinking ahead to struct page-less persistent memory considerations? Though would they ever arrive here? I'd have thought it better to carry on with struct pages at least for now - or are they becoming unfashionable? (I think some tracing struct page pointers were converted to pfns recently.) But no big deal. > { > if (current->ubc) > return; > @@ -2784,10 +2784,6 @@ static inline void alloc_ubc(void) > cpumask_clear(¤t->ubc->cpumask); > current->ubc->nr_pages = 0; > } > -#else > -static inline void alloc_ubc(void) > -{ > -} > #endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ > > unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > -- > 2.1.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration 2015-04-15 21:06 ` Hugh Dickins @ 2015-04-15 21:44 ` Mel Gorman -1 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-15 21:44 UTC (permalink / raw) To: Hugh Dickins Cc: Linux-MM, Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Wed, Apr 15, 2015 at 02:06:19PM -0700, Hugh Dickins wrote: > On Wed, 15 Apr 2015, Mel Gorman wrote: > > > Page reclaim batches multiple TLB flushes into one IPI and this patch teaches > > page migration to also batch any necessary flushes. MMtests has a THP scale > > microbenchmark that deliberately fragments memory and then allocates THPs > > to stress compaction. It's not a page reclaim benchmark and recent kernels > > avoid excessive compaction but this patch reduced system CPU usage > > > > 4.0.0 4.0.0 > > baseline batchmigrate-v1 > > User 970.70 1012.24 > > System 2067.48 1840.00 > > Elapsed 1520.63 1529.66 > > > > Note that this particular workload was not TLB flush intensive with peaks > > in interrupts during the compaction phase. The 4.0 kernel peaked at 345K > > interrupts/second, the kernel that batches reclaim TLB entries peaked at > > 13K interrupts/second and this patch peaked at 10K interrupts/second. > > > > Signed-off-by: Mel Gorman <mgorman@suse.de> > > --- > > mm/internal.h | 5 +++++ > > mm/migrate.c | 8 +++++++- > > mm/vmscan.c | 6 +----- > > 3 files changed, 13 insertions(+), 6 deletions(-) > > > > diff --git a/mm/internal.h b/mm/internal.h > > index fe69dd159e34..cb70555a7291 100644 > > --- a/mm/internal.h > > +++ b/mm/internal.h > > @@ -436,10 +436,15 @@ struct unmap_batch; > > > > #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > > void try_to_unmap_flush(void); > > +void alloc_ubc(void); > > #else > > static inline void try_to_unmap_flush(void) > > { > > } > > > > +static inline void alloc_ubc(void) > > +{ > > +} > > + > > #endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ > > #endif /* __MM_INTERNAL_H */ > > diff --git a/mm/migrate.c b/mm/migrate.c > > index 85e042686031..973d8befe528 100644 > > --- a/mm/migrate.c > > +++ b/mm/migrate.c > > @@ -789,6 +789,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > > if (current->flags & PF_MEMALLOC) > > goto out; > > > > + try_to_unmap_flush(); > > I have a vested interest in minimizing page migration overhead, > enthusiastic for more batching if it can be done, so took a quick > look at this patch (the earliers not so much); but am mystified by > your placement of the try_to_unmap_flush()s. > The placement is to flush the TLB before sleeping for a long time. If the whole approach is safe then it's not necessary but I saw little reason to leave it as-is. It should be perfectly safe to not flush before locking the page (which might sleep) or waiting on writeback (also might sleep). I'll drop these if they're confusing and similarly I can drop the flush before entering writeback in mm/vmscan.c > Why would one be needed here, yet not before the trylock_page() above? > Oh, when might sleep? Though I still don't grasp why that's necessary, > and try_to_unmap() below may itself sleep. > It's not necessary, I just was matching the expectation that when we unmap we should flush "soon". > > lock_page(page); > > } > > > > @@ -805,6 +806,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > > } > > if (!force) > > goto out_unlock; > > + try_to_unmap_flush(); > > wait_on_page_writeback(page); > > } > > /* > > @@ -879,7 +881,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > > /* Establish migration ptes or remove ptes */ > > if (page_mapped(page)) { > > try_to_unmap(page, > > - TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); > > + TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|TTU_BATCH_FLUSH); > > But isn't this the only place for the try_to_unmap_flush(), unless you > make much more change to the way page migration works? Would batch > together the TLB flushes from multiple mappings of the same page, > though that's not a very ambitious goal. > Hmm, I don't quite get this. When the page is unmapped, the masks for the CPU will be or'd together so the PFN will be flushed from the TLB of any CPU that was accessing it. > Delayed much later than this point, and user modifications to the old > page could continue while we're copying it into the new page and after, > so the new page receives only some undefined part of the modifications. > For patch 2 or 4 to be safe, there must be an architectural guarantee that clean->dirty transitions after an unmap triggers a fault. I accept that in this series that previously dirty PTE can indeed leak through causing corruption and I've noted it in the leader. It's already in V2 which currently is being tested. > Or perhaps this is the last minute point you were making about > page lock in the 0/4, though page lock not so relevant here. > Yes for the writes leaking through after the unmap if it was previously dirty. The flush before lock page is not related. > Or your paragraph in the 0/4 "If a clean page is unmapped and not > immediately flushed..." but I don't see where that is being enforced. > I'm assuming hardware but I need the architecture guys to confirm that. > I can imagine more optimization possible on !pte_write pages than > on pte_write pages, but don't see any sign of that. > It's in rmap.c near the should_defer_flush part. I think that's what you're looking for or I'm misunderstanding the question. > Or am I just skimming this series too carelessly, and making a fool of > myself by missing the important bits? Sorry if I'm wasting your time. > Not at all. The more eyes on this the better. > > page_was_mapped = 1; > > } > > > > @@ -1098,6 +1100,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, > > if (!swapwrite) > > current->flags |= PF_SWAPWRITE; > > > > + alloc_ubc(); > > + > > for(pass = 0; pass < 10 && retry; pass++) { > > retry = 0; > > > > @@ -1144,6 +1148,8 @@ out: > > if (!swapwrite) > > current->flags &= ~PF_SWAPWRITE; > > > > + try_to_unmap_flush(); > > + > > return rc; > > } > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 68bcc0b73a76..d659e3655575 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2767,7 +2767,7 @@ out: > > } > > > > #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > > -static inline void alloc_ubc(void) > > +void alloc_ubc(void) > > Looking at this patch first, I wondered what on earth a ubc is. > The letters "tlb" in the name might help people to locate its > place in the world better. > I can do that. It'll be struct tlb_unmap_batch and tlb_ubc; > And then curious that it works with pfns rather than page pointers, Because the TLB flush is about the physical address, not the page pointer. I felt that the PFN was both a more natural interface and this avoids a page_to_pfn lookup in the per-cpu TLB flush handler. > as its natural cousin mmu_gather does (oops, no "tlb" there either, > though that's compensated by naming its pointer "tlb" everywhere). > > pfns: are you thinking ahead to struct page-less persistent memory > considerations? Nothing so fancy, I wanted to avoid the page_to_pfn lookup. On VMEMMAP, that is a negligible cost but even so. > Though would they ever arrive here? I'd have > thought it better to carry on with struct pages at least for now - > or are they becoming unfashionable? (I think some tracing struct > page pointers were converted to pfns recently.) But no big deal. > FWIW, I did not consider the current debate on whether persistent memory would use struct pages or not. I simply see zero advantage to using the struct page unnecessarily. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration @ 2015-04-15 21:44 ` Mel Gorman 0 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-15 21:44 UTC (permalink / raw) To: Hugh Dickins Cc: Linux-MM, Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Wed, Apr 15, 2015 at 02:06:19PM -0700, Hugh Dickins wrote: > On Wed, 15 Apr 2015, Mel Gorman wrote: > > > Page reclaim batches multiple TLB flushes into one IPI and this patch teaches > > page migration to also batch any necessary flushes. MMtests has a THP scale > > microbenchmark that deliberately fragments memory and then allocates THPs > > to stress compaction. It's not a page reclaim benchmark and recent kernels > > avoid excessive compaction but this patch reduced system CPU usage > > > > 4.0.0 4.0.0 > > baseline batchmigrate-v1 > > User 970.70 1012.24 > > System 2067.48 1840.00 > > Elapsed 1520.63 1529.66 > > > > Note that this particular workload was not TLB flush intensive with peaks > > in interrupts during the compaction phase. The 4.0 kernel peaked at 345K > > interrupts/second, the kernel that batches reclaim TLB entries peaked at > > 13K interrupts/second and this patch peaked at 10K interrupts/second. > > > > Signed-off-by: Mel Gorman <mgorman@suse.de> > > --- > > mm/internal.h | 5 +++++ > > mm/migrate.c | 8 +++++++- > > mm/vmscan.c | 6 +----- > > 3 files changed, 13 insertions(+), 6 deletions(-) > > > > diff --git a/mm/internal.h b/mm/internal.h > > index fe69dd159e34..cb70555a7291 100644 > > --- a/mm/internal.h > > +++ b/mm/internal.h > > @@ -436,10 +436,15 @@ struct unmap_batch; > > > > #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > > void try_to_unmap_flush(void); > > +void alloc_ubc(void); > > #else > > static inline void try_to_unmap_flush(void) > > { > > } > > > > +static inline void alloc_ubc(void) > > +{ > > +} > > + > > #endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */ > > #endif /* __MM_INTERNAL_H */ > > diff --git a/mm/migrate.c b/mm/migrate.c > > index 85e042686031..973d8befe528 100644 > > --- a/mm/migrate.c > > +++ b/mm/migrate.c > > @@ -789,6 +789,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > > if (current->flags & PF_MEMALLOC) > > goto out; > > > > + try_to_unmap_flush(); > > I have a vested interest in minimizing page migration overhead, > enthusiastic for more batching if it can be done, so took a quick > look at this patch (the earliers not so much); but am mystified by > your placement of the try_to_unmap_flush()s. > The placement is to flush the TLB before sleeping for a long time. If the whole approach is safe then it's not necessary but I saw little reason to leave it as-is. It should be perfectly safe to not flush before locking the page (which might sleep) or waiting on writeback (also might sleep). I'll drop these if they're confusing and similarly I can drop the flush before entering writeback in mm/vmscan.c > Why would one be needed here, yet not before the trylock_page() above? > Oh, when might sleep? Though I still don't grasp why that's necessary, > and try_to_unmap() below may itself sleep. > It's not necessary, I just was matching the expectation that when we unmap we should flush "soon". > > lock_page(page); > > } > > > > @@ -805,6 +806,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > > } > > if (!force) > > goto out_unlock; > > + try_to_unmap_flush(); > > wait_on_page_writeback(page); > > } > > /* > > @@ -879,7 +881,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > > /* Establish migration ptes or remove ptes */ > > if (page_mapped(page)) { > > try_to_unmap(page, > > - TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); > > + TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|TTU_BATCH_FLUSH); > > But isn't this the only place for the try_to_unmap_flush(), unless you > make much more change to the way page migration works? Would batch > together the TLB flushes from multiple mappings of the same page, > though that's not a very ambitious goal. > Hmm, I don't quite get this. When the page is unmapped, the masks for the CPU will be or'd together so the PFN will be flushed from the TLB of any CPU that was accessing it. > Delayed much later than this point, and user modifications to the old > page could continue while we're copying it into the new page and after, > so the new page receives only some undefined part of the modifications. > For patch 2 or 4 to be safe, there must be an architectural guarantee that clean->dirty transitions after an unmap triggers a fault. I accept that in this series that previously dirty PTE can indeed leak through causing corruption and I've noted it in the leader. It's already in V2 which currently is being tested. > Or perhaps this is the last minute point you were making about > page lock in the 0/4, though page lock not so relevant here. > Yes for the writes leaking through after the unmap if it was previously dirty. The flush before lock page is not related. > Or your paragraph in the 0/4 "If a clean page is unmapped and not > immediately flushed..." but I don't see where that is being enforced. > I'm assuming hardware but I need the architecture guys to confirm that. > I can imagine more optimization possible on !pte_write pages than > on pte_write pages, but don't see any sign of that. > It's in rmap.c near the should_defer_flush part. I think that's what you're looking for or I'm misunderstanding the question. > Or am I just skimming this series too carelessly, and making a fool of > myself by missing the important bits? Sorry if I'm wasting your time. > Not at all. The more eyes on this the better. > > page_was_mapped = 1; > > } > > > > @@ -1098,6 +1100,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, > > if (!swapwrite) > > current->flags |= PF_SWAPWRITE; > > > > + alloc_ubc(); > > + > > for(pass = 0; pass < 10 && retry; pass++) { > > retry = 0; > > > > @@ -1144,6 +1148,8 @@ out: > > if (!swapwrite) > > current->flags &= ~PF_SWAPWRITE; > > > > + try_to_unmap_flush(); > > + > > return rc; > > } > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 68bcc0b73a76..d659e3655575 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2767,7 +2767,7 @@ out: > > } > > > > #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > > -static inline void alloc_ubc(void) > > +void alloc_ubc(void) > > Looking at this patch first, I wondered what on earth a ubc is. > The letters "tlb" in the name might help people to locate its > place in the world better. > I can do that. It'll be struct tlb_unmap_batch and tlb_ubc; > And then curious that it works with pfns rather than page pointers, Because the TLB flush is about the physical address, not the page pointer. I felt that the PFN was both a more natural interface and this avoids a page_to_pfn lookup in the per-cpu TLB flush handler. > as its natural cousin mmu_gather does (oops, no "tlb" there either, > though that's compensated by naming its pointer "tlb" everywhere). > > pfns: are you thinking ahead to struct page-less persistent memory > considerations? Nothing so fancy, I wanted to avoid the page_to_pfn lookup. On VMEMMAP, that is a negligible cost but even so. > Though would they ever arrive here? I'd have > thought it better to carry on with struct pages at least for now - > or are they becoming unfashionable? (I think some tracing struct > page pointers were converted to pfns recently.) But no big deal. > FWIW, I did not consider the current debate on whether persistent memory would use struct pages or not. I simply see zero advantage to using the struct page unnecessarily. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration 2015-04-15 21:44 ` Mel Gorman @ 2015-04-15 23:50 ` Hugh Dickins -1 siblings, 0 replies; 58+ messages in thread From: Hugh Dickins @ 2015-04-15 23:50 UTC (permalink / raw) To: Mel Gorman Cc: Hugh Dickins, Linux-MM, Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Wed, 15 Apr 2015, Mel Gorman wrote: > On Wed, Apr 15, 2015 at 02:06:19PM -0700, Hugh Dickins wrote: > > On Wed, 15 Apr 2015, Mel Gorman wrote: > > > > > diff --git a/mm/migrate.c b/mm/migrate.c > > > index 85e042686031..973d8befe528 100644 > > > --- a/mm/migrate.c > > > +++ b/mm/migrate.c > > > @@ -789,6 +789,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > > > if (current->flags & PF_MEMALLOC) > > > goto out; > > > > > > + try_to_unmap_flush(); > > > > I have a vested interest in minimizing page migration overhead, > > enthusiastic for more batching if it can be done, so took a quick > > look at this patch (the earliers not so much); but am mystified by > > your placement of the try_to_unmap_flush()s. > > > > The placement is to flush the TLB before sleeping for a long time. If the > whole approach is safe then it's not necessary but I saw little reason to > leave it as-is. It should be perfectly safe to not flush before locking > the page (which might sleep) or waiting on writeback (also might sleep). > I'll drop these if they're confusing and similarly I can drop the flush > before entering writeback in mm/vmscan.c Yes, I think I would prefer you to drop them: if it's unnecessary to flush in these places, why choose to do so and lose the batching? > > > Why would one be needed here, yet not before the trylock_page() above? > > Oh, when might sleep? Though I still don't grasp why that's necessary, > > and try_to_unmap() below may itself sleep. > > > > It's not necessary, I just was matching the expectation that when we unmap > we should flush "soon". But no sooner than necessary: that's why you're batching. > > > > lock_page(page); > > > } > > > > > > @@ -805,6 +806,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > > > } > > > if (!force) > > > goto out_unlock; > > > + try_to_unmap_flush(); > > > wait_on_page_writeback(page); > > > } > > > /* > > > @@ -879,7 +881,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > > > /* Establish migration ptes or remove ptes */ > > > if (page_mapped(page)) { > > > try_to_unmap(page, > > > - TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); > > > + TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|TTU_BATCH_FLUSH); > > > > But isn't this the only place for the try_to_unmap_flush(), unless you > > make much more change to the way page migration works? Would batch > > together the TLB flushes from multiple mappings of the same page, > > though that's not a very ambitious goal. > > > > Hmm, I don't quite get this. When the page is unmapped, the masks for the > CPU will be or'd together so the PFN will be flushed from the TLB of any > CPU that was accessing it. It appears that I am focused on the pte_write pte_dirty case, whereas you're thinking that you already admitted to that error with your "Last minute note" about page lock and IO in 0/4. Right, they're different aspects of the same issue, which I didn't catch at first. > > > Delayed much later than this point, and user modifications to the old > > page could continue while we're copying it into the new page and after, > > so the new page receives only some undefined part of the modifications. > > > > For patch 2 or 4 to be safe, there must be an architectural guarantee > that clean->dirty transitions after an unmap triggers a fault. I accept > that in this series that previously dirty PTE can indeed leak through > causing corruption and I've noted it in the leader. It's already in V2 > which currently is being tested. Right, I've only been telling you what you had already realized. I should wait for V2 or V3 before commenting further. > > > Or perhaps this is the last minute point you were making about > > page lock in the 0/4, though page lock not so relevant here. > > > > Yes for the writes leaking through after the unmap if it was previously > dirty. The flush before lock page is not related. > > > Or your paragraph in the 0/4 "If a clean page is unmapped and not > > immediately flushed..." but I don't see where that is being enforced. > > > > I'm assuming hardware but I need the architecture guys to confirm that. > > > I can imagine more optimization possible on !pte_write pages than > > on pte_write pages, but don't see any sign of that. > > > > It's in rmap.c near the should_defer_flush part. I think that's what you're > looking for or I'm misunderstanding the question. "It" being what? Again I think we're misunderstanding each other. Seeing that the pte_write case looked dangerous (and yes, it's actually only the pte_write pte_dirty case), I was expecting to find some special treatment of pte_write pages versus !pte_write pages somewhere; whereas that's the case which your "Last minute note" admits is not handled in this version. > > > Or am I just skimming this series too carelessly, and making a fool of > > myself by missing the important bits? Sorry if I'm wasting your time. > > > > Not at all. The more eyes on this the better. > > > > page_was_mapped = 1; > > > } > > > > > > @@ -1098,6 +1100,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, > > > if (!swapwrite) > > > current->flags |= PF_SWAPWRITE; > > > > > > + alloc_ubc(); > > > + > > > for(pass = 0; pass < 10 && retry; pass++) { > > > retry = 0; > > > > > > @@ -1144,6 +1148,8 @@ out: > > > if (!swapwrite) > > > current->flags &= ~PF_SWAPWRITE; > > > > > > + try_to_unmap_flush(); > > > + > > > return rc; > > > } > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index 68bcc0b73a76..d659e3655575 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -2767,7 +2767,7 @@ out: > > > } > > > > > > #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > > > -static inline void alloc_ubc(void) > > > +void alloc_ubc(void) > > > > Looking at this patch first, I wondered what on earth a ubc is. > > The letters "tlb" in the name might help people to locate its > > place in the world better. > > > > I can do that. It'll be struct tlb_unmap_batch and tlb_ubc; Thanks. > > > And then curious that it works with pfns rather than page pointers, > > Because the TLB flush is about the physical address, not the page pointer. I > felt that the PFN was both a more natural interface and this avoids a > page_to_pfn lookup in the per-cpu TLB flush handler. Right, that's a good reason, I missed that. > > > as its natural cousin mmu_gather does (oops, no "tlb" there either, > > though that's compensated by naming its pointer "tlb" everywhere). > > > > pfns: are you thinking ahead to struct page-less persistent memory > > considerations? > > Nothing so fancy, I wanted to avoid the page_to_pfn lookup. On VMEMMAP, > that is a negligible cost but even so. > > > Though would they ever arrive here? I'd have > > thought it better to carry on with struct pages at least for now - > > or are they becoming unfashionable? (I think some tracing struct > > page pointers were converted to pfns recently.) But no big deal. > > > > FWIW, I did not consider the current debate on whether persistent memory > would use struct pages or not. I simply see zero advantage to using the > struct page unnecessarily. Agreed. Hugh ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration @ 2015-04-15 23:50 ` Hugh Dickins 0 siblings, 0 replies; 58+ messages in thread From: Hugh Dickins @ 2015-04-15 23:50 UTC (permalink / raw) To: Mel Gorman Cc: Hugh Dickins, Linux-MM, Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen, LKML On Wed, 15 Apr 2015, Mel Gorman wrote: > On Wed, Apr 15, 2015 at 02:06:19PM -0700, Hugh Dickins wrote: > > On Wed, 15 Apr 2015, Mel Gorman wrote: > > > > > diff --git a/mm/migrate.c b/mm/migrate.c > > > index 85e042686031..973d8befe528 100644 > > > --- a/mm/migrate.c > > > +++ b/mm/migrate.c > > > @@ -789,6 +789,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > > > if (current->flags & PF_MEMALLOC) > > > goto out; > > > > > > + try_to_unmap_flush(); > > > > I have a vested interest in minimizing page migration overhead, > > enthusiastic for more batching if it can be done, so took a quick > > look at this patch (the earliers not so much); but am mystified by > > your placement of the try_to_unmap_flush()s. > > > > The placement is to flush the TLB before sleeping for a long time. If the > whole approach is safe then it's not necessary but I saw little reason to > leave it as-is. It should be perfectly safe to not flush before locking > the page (which might sleep) or waiting on writeback (also might sleep). > I'll drop these if they're confusing and similarly I can drop the flush > before entering writeback in mm/vmscan.c Yes, I think I would prefer you to drop them: if it's unnecessary to flush in these places, why choose to do so and lose the batching? > > > Why would one be needed here, yet not before the trylock_page() above? > > Oh, when might sleep? Though I still don't grasp why that's necessary, > > and try_to_unmap() below may itself sleep. > > > > It's not necessary, I just was matching the expectation that when we unmap > we should flush "soon". But no sooner than necessary: that's why you're batching. > > > > lock_page(page); > > > } > > > > > > @@ -805,6 +806,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > > > } > > > if (!force) > > > goto out_unlock; > > > + try_to_unmap_flush(); > > > wait_on_page_writeback(page); > > > } > > > /* > > > @@ -879,7 +881,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > > > /* Establish migration ptes or remove ptes */ > > > if (page_mapped(page)) { > > > try_to_unmap(page, > > > - TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); > > > + TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|TTU_BATCH_FLUSH); > > > > But isn't this the only place for the try_to_unmap_flush(), unless you > > make much more change to the way page migration works? Would batch > > together the TLB flushes from multiple mappings of the same page, > > though that's not a very ambitious goal. > > > > Hmm, I don't quite get this. When the page is unmapped, the masks for the > CPU will be or'd together so the PFN will be flushed from the TLB of any > CPU that was accessing it. It appears that I am focused on the pte_write pte_dirty case, whereas you're thinking that you already admitted to that error with your "Last minute note" about page lock and IO in 0/4. Right, they're different aspects of the same issue, which I didn't catch at first. > > > Delayed much later than this point, and user modifications to the old > > page could continue while we're copying it into the new page and after, > > so the new page receives only some undefined part of the modifications. > > > > For patch 2 or 4 to be safe, there must be an architectural guarantee > that clean->dirty transitions after an unmap triggers a fault. I accept > that in this series that previously dirty PTE can indeed leak through > causing corruption and I've noted it in the leader. It's already in V2 > which currently is being tested. Right, I've only been telling you what you had already realized. I should wait for V2 or V3 before commenting further. > > > Or perhaps this is the last minute point you were making about > > page lock in the 0/4, though page lock not so relevant here. > > > > Yes for the writes leaking through after the unmap if it was previously > dirty. The flush before lock page is not related. > > > Or your paragraph in the 0/4 "If a clean page is unmapped and not > > immediately flushed..." but I don't see where that is being enforced. > > > > I'm assuming hardware but I need the architecture guys to confirm that. > > > I can imagine more optimization possible on !pte_write pages than > > on pte_write pages, but don't see any sign of that. > > > > It's in rmap.c near the should_defer_flush part. I think that's what you're > looking for or I'm misunderstanding the question. "It" being what? Again I think we're misunderstanding each other. Seeing that the pte_write case looked dangerous (and yes, it's actually only the pte_write pte_dirty case), I was expecting to find some special treatment of pte_write pages versus !pte_write pages somewhere; whereas that's the case which your "Last minute note" admits is not handled in this version. > > > Or am I just skimming this series too carelessly, and making a fool of > > myself by missing the important bits? Sorry if I'm wasting your time. > > > > Not at all. The more eyes on this the better. > > > > page_was_mapped = 1; > > > } > > > > > > @@ -1098,6 +1100,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, > > > if (!swapwrite) > > > current->flags |= PF_SWAPWRITE; > > > > > > + alloc_ubc(); > > > + > > > for(pass = 0; pass < 10 && retry; pass++) { > > > retry = 0; > > > > > > @@ -1144,6 +1148,8 @@ out: > > > if (!swapwrite) > > > current->flags &= ~PF_SWAPWRITE; > > > > > > + try_to_unmap_flush(); > > > + > > > return rc; > > > } > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index 68bcc0b73a76..d659e3655575 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -2767,7 +2767,7 @@ out: > > > } > > > > > > #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH > > > -static inline void alloc_ubc(void) > > > +void alloc_ubc(void) > > > > Looking at this patch first, I wondered what on earth a ubc is. > > The letters "tlb" in the name might help people to locate its > > place in the world better. > > > > I can do that. It'll be struct tlb_unmap_batch and tlb_ubc; Thanks. > > > And then curious that it works with pfns rather than page pointers, > > Because the TLB flush is about the physical address, not the page pointer. I > felt that the PFN was both a more natural interface and this avoids a > page_to_pfn lookup in the per-cpu TLB flush handler. Right, that's a good reason, I missed that. > > > as its natural cousin mmu_gather does (oops, no "tlb" there either, > > though that's compensated by naming its pointer "tlb" everywhere). > > > > pfns: are you thinking ahead to struct page-less persistent memory > > considerations? > > Nothing so fancy, I wanted to avoid the page_to_pfn lookup. On VMEMMAP, > that is a negligible cost but even so. > > > Though would they ever arrive here? I'd have > > thought it better to carry on with struct pages at least for now - > > or are they becoming unfashionable? (I think some tracing struct > > page pointers were converted to pfns recently.) But no big deal. > > > > FWIW, I did not consider the current debate on whether persistent memory > would use struct pages or not. I simply see zero advantage to using the > struct page unnecessarily. Agreed. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* [RFC PATCH 0/4] TLB flush multiple pages with a single IPI v2 @ 2015-04-16 10:22 Mel Gorman 2015-04-16 10:22 ` Mel Gorman 0 siblings, 1 reply; 58+ messages in thread From: Mel Gorman @ 2015-04-16 10:22 UTC (permalink / raw) To: Linux-MM Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, LKML, Mel Gorman Changelog since V1 o Structure and variable renaming (hughd) o Defer flushes even if the unmapping process is sleeping (huged) o Alternative sizing of structure (peterz) o Use GFP_KERNEL instead of GFP_ATOMIC, PF_MEMALLOC protects (andi) o Immediately flush dirty PTEs to avoid corruption (mel) o Further clarify docs on the required arch guarantees (mel) When unmapping pages it is necessary to flush the TLB. If that page was accessed by another CPU then an IPI is used to flush the remote CPU. That is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second. There already is a window between when a page is unmapped and when it is TLB flushed. This series simply increases the window so multiple pages can be flushed using a single IPI. Patch 1 simply made the rest of the series easier to write as ftrace could identify all the senders of TLB flush IPIS. Patch 2 collects a list of PFNs and sends one IPI to flush them all Patch 3 uses more memory so further defer when the IPI gets sent Patch 4 uses the same infrastructure as patch 2 to batch IPIs sent during page migration. The performance impact is documented in the changelogs but in the optimistic case on a 4-socket machine the full series reduces interrupts from 900K interrupts/second to 60K interrupts/second. arch/x86/Kconfig | 1 + arch/x86/include/asm/tlbflush.h | 2 + arch/x86/mm/tlb.c | 1 + include/linux/init_task.h | 8 ++++ include/linux/mm_types.h | 1 + include/linux/rmap.h | 3 ++ include/linux/sched.h | 15 +++++++ include/trace/events/tlb.h | 3 +- init/Kconfig | 8 ++++ kernel/fork.c | 7 +++ kernel/sched/core.c | 3 ++ mm/internal.h | 16 +++++++ mm/migrate.c | 6 ++- mm/rmap.c | 99 ++++++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 33 +++++++++++++- 15 files changed, 201 insertions(+), 5 deletions(-) -- 2.1.2 ^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent 2015-04-16 10:22 [RFC PATCH 0/4] TLB flush multiple pages with a single IPI v2 Mel Gorman @ 2015-04-16 10:22 ` Mel Gorman 0 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-16 10:22 UTC (permalink / raw) To: Linux-MM Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, LKML, Mel Gorman It is easy to trace when an IPI is received to flush a TLB but harder to detect what event sent it. This patch makes it easy to identify the source of IPIs being transmitted for TLB flushes on x86. Signed-off-by: Mel Gorman <mgorman@suse.de> --- arch/x86/mm/tlb.c | 1 + include/linux/mm_types.h | 1 + include/trace/events/tlb.h | 3 ++- 3 files changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 3250f2371aea..2da824c1c140 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask, info.flush_end = end; count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); + trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start); if (is_uv_system()) { unsigned int cpu; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 199a03aab8dc..856038aa166e 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -532,6 +532,7 @@ enum tlb_flush_reason { TLB_REMOTE_SHOOTDOWN, TLB_LOCAL_SHOOTDOWN, TLB_LOCAL_MM_SHOOTDOWN, + TLB_REMOTE_SEND_IPI, NR_TLB_FLUSH_REASONS, }; diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h index 0e7635765153..0fc101472988 100644 --- a/include/trace/events/tlb.h +++ b/include/trace/events/tlb.h @@ -11,7 +11,8 @@ { TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" }, \ { TLB_REMOTE_SHOOTDOWN, "remote shootdown" }, \ { TLB_LOCAL_SHOOTDOWN, "local shootdown" }, \ - { TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" } + { TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" }, \ + { TLB_REMOTE_SEND_IPI, "remote ipi send" } TRACE_EVENT_CONDITION(tlb_flush, -- 2.1.2 ^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent @ 2015-04-16 10:22 ` Mel Gorman 0 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-16 10:22 UTC (permalink / raw) To: Linux-MM Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, LKML, Mel Gorman It is easy to trace when an IPI is received to flush a TLB but harder to detect what event sent it. This patch makes it easy to identify the source of IPIs being transmitted for TLB flushes on x86. Signed-off-by: Mel Gorman <mgorman@suse.de> --- arch/x86/mm/tlb.c | 1 + include/linux/mm_types.h | 1 + include/trace/events/tlb.h | 3 ++- 3 files changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 3250f2371aea..2da824c1c140 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask, info.flush_end = end; count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); + trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start); if (is_uv_system()) { unsigned int cpu; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 199a03aab8dc..856038aa166e 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -532,6 +532,7 @@ enum tlb_flush_reason { TLB_REMOTE_SHOOTDOWN, TLB_LOCAL_SHOOTDOWN, TLB_LOCAL_MM_SHOOTDOWN, + TLB_REMOTE_SEND_IPI, NR_TLB_FLUSH_REASONS, }; diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h index 0e7635765153..0fc101472988 100644 --- a/include/trace/events/tlb.h +++ b/include/trace/events/tlb.h @@ -11,7 +11,8 @@ { TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" }, \ { TLB_REMOTE_SHOOTDOWN, "remote shootdown" }, \ { TLB_LOCAL_SHOOTDOWN, "local shootdown" }, \ - { TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" } + { TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" }, \ + { TLB_REMOTE_SEND_IPI, "remote ipi send" } TRACE_EVENT_CONDITION(tlb_flush, -- 2.1.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent 2015-04-16 10:22 ` Mel Gorman @ 2015-04-16 15:51 ` Rik van Riel -1 siblings, 0 replies; 58+ messages in thread From: Rik van Riel @ 2015-04-16 15:51 UTC (permalink / raw) To: Mel Gorman, Linux-MM Cc: Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, LKML On 04/16/2015 06:22 AM, Mel Gorman wrote: > It is easy to trace when an IPI is received to flush a TLB but harder to > detect what event sent it. This patch makes it easy to identify the source > of IPIs being transmitted for TLB flushes on x86. > > Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent @ 2015-04-16 15:51 ` Rik van Riel 0 siblings, 0 replies; 58+ messages in thread From: Rik van Riel @ 2015-04-16 15:51 UTC (permalink / raw) To: Mel Gorman, Linux-MM Cc: Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, LKML On 04/16/2015 06:22 AM, Mel Gorman wrote: > It is easy to trace when an IPI is received to flush a TLB but harder to > detect what event sent it. This patch makes it easy to identify the source > of IPIs being transmitted for TLB flushes on x86. > > Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent 2015-04-16 10:22 ` Mel Gorman @ 2015-04-16 16:55 ` Dave Hansen -1 siblings, 0 replies; 58+ messages in thread From: Dave Hansen @ 2015-04-16 16:55 UTC (permalink / raw) To: Mel Gorman, Linux-MM Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Andi Kleen, LKML On 04/16/2015 03:22 AM, Mel Gorman wrote: > It is easy to trace when an IPI is received to flush a TLB but harder to > detect what event sent it. This patch makes it easy to identify the source > of IPIs being transmitted for TLB flushes on x86. Looks fine to me. I think I even thought about adding this but didn't see an immediate need for it. I guess this does let you see how many IPIs are sent vs. received. Reviewed-by: Dave Hansen <dave.hansen@intel.com> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent @ 2015-04-16 16:55 ` Dave Hansen 0 siblings, 0 replies; 58+ messages in thread From: Dave Hansen @ 2015-04-16 16:55 UTC (permalink / raw) To: Mel Gorman, Linux-MM Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Andi Kleen, LKML On 04/16/2015 03:22 AM, Mel Gorman wrote: > It is easy to trace when an IPI is received to flush a TLB but harder to > detect what event sent it. This patch makes it easy to identify the source > of IPIs being transmitted for TLB flushes on x86. Looks fine to me. I think I even thought about adding this but didn't see an immediate need for it. I guess this does let you see how many IPIs are sent vs. received. Reviewed-by: Dave Hansen <dave.hansen@intel.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent 2015-04-16 16:55 ` Dave Hansen @ 2015-04-16 17:39 ` Mel Gorman -1 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-16 17:39 UTC (permalink / raw) To: Dave Hansen Cc: Linux-MM, Rik van Riel, Hugh Dickins, Minchan Kim, Andi Kleen, LKML On Thu, Apr 16, 2015 at 09:55:43AM -0700, Dave Hansen wrote: > On 04/16/2015 03:22 AM, Mel Gorman wrote: > > It is easy to trace when an IPI is received to flush a TLB but harder to > > detect what event sent it. This patch makes it easy to identify the source > > of IPIs being transmitted for TLB flushes on x86. > > Looks fine to me. I think I even thought about adding this but didn't > see an immediate need for it. I guess this does let you see how many > IPIs are sent vs. received. > It would but that's not why I wanted it. I wanted a stack track of who was sending the IPI and I can't get that on the receive side. I could have used perf probe and some hackery but this seemed useful in itself. > Reviewed-by: Dave Hansen <dave.hansen@intel.com> Thanks. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent @ 2015-04-16 17:39 ` Mel Gorman 0 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-04-16 17:39 UTC (permalink / raw) To: Dave Hansen Cc: Linux-MM, Rik van Riel, Hugh Dickins, Minchan Kim, Andi Kleen, LKML On Thu, Apr 16, 2015 at 09:55:43AM -0700, Dave Hansen wrote: > On 04/16/2015 03:22 AM, Mel Gorman wrote: > > It is easy to trace when an IPI is received to flush a TLB but harder to > > detect what event sent it. This patch makes it easy to identify the source > > of IPIs being transmitted for TLB flushes on x86. > > Looks fine to me. I think I even thought about adding this but didn't > see an immediate need for it. I guess this does let you see how many > IPIs are sent vs. received. > It would but that's not why I wanted it. I wanted a stack track of who was sending the IPI and I can't get that on the receive side. I could have used perf probe and some hackery but this seemed useful in itself. > Reviewed-by: Dave Hansen <dave.hansen@intel.com> Thanks. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH 0/3] TLB flush multiple pages per IPI v6 @ 2015-06-09 17:31 Mel Gorman 2015-06-09 17:31 ` Mel Gorman 0 siblings, 1 reply; 58+ messages in thread From: Mel Gorman @ 2015-06-09 17:31 UTC (permalink / raw) To: Andrew Morton Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, H Peter Anvin, Ingo Molnar, Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML, Mel Gorman Changelog since V5 o Split series to first do a full TLB flush and then targetting flushing Changelog since V4 o Rebase to 4.1-rc6 Changelog since V3 o Drop batching of TLB flush from migration o Redo how larger batching is managed o Batch TLB flushes when writable entries exist When unmapping pages it is necessary to flush the TLB. If that page was accessed by another CPU then an IPI is used to flush the remote CPU. That is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second. There already is a window between when a page is unmapped and when it is TLB flushed. This series ses the window so multiple pages can be flushed using a single IPI. This should be safe or the kernel is hosed already. Patch 1 simply made the rest of the series easier to write as ftrace could identify all the senders of TLB flush IPIS. Patch 2 tracks what CPUs potentially map a PFN and then sends an IPI to flush the entire TLB. Patch 3 tracks when there potentially are writable TLB entries that need to be batched differently Patch 4 notes that a full TLB flush could clear active entries and incur a penalty in the near future while the TLB is being refilled. The IPI flushes just the individual PFNs which incurs a direct cost to avoid an indirect cost. The performance impact is documented in the changelogs but in the optimistic case on a 4-socket machine the full series reduces interrupts from 900K interrupts/second to 60K interrupts/second. arch/x86/Kconfig | 1 + arch/x86/include/asm/tlbflush.h | 2 + arch/x86/mm/tlb.c | 1 + include/linux/mm_types.h | 1 + include/linux/rmap.h | 3 + include/linux/sched.h | 31 +++++++++++ include/trace/events/tlb.h | 3 +- init/Kconfig | 8 +++ kernel/fork.c | 5 ++ kernel/sched/core.c | 3 + mm/internal.h | 15 +++++ mm/rmap.c | 118 +++++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 33 ++++++++++- 13 files changed, 220 insertions(+), 4 deletions(-) -- 2.3.5 ^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent 2015-06-09 17:31 [PATCH 0/3] TLB flush multiple pages per IPI v6 Mel Gorman @ 2015-06-09 17:31 ` Mel Gorman 0 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-06-09 17:31 UTC (permalink / raw) To: Andrew Morton Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, H Peter Anvin, Ingo Molnar, Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML, Mel Gorman It is easy to trace when an IPI is received to flush a TLB but harder to detect what event sent it. This patch makes it easy to identify the source of IPIs being transmitted for TLB flushes on x86. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Dave Hansen <dave.hansen@intel.com> --- arch/x86/mm/tlb.c | 1 + include/linux/mm_types.h | 1 + include/trace/events/tlb.h | 3 ++- 3 files changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 3250f2371aea..2da824c1c140 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask, info.flush_end = end; count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); + trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start); if (is_uv_system()) { unsigned int cpu; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 8d37e26a1007..86ad9f902042 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -534,6 +534,7 @@ enum tlb_flush_reason { TLB_REMOTE_SHOOTDOWN, TLB_LOCAL_SHOOTDOWN, TLB_LOCAL_MM_SHOOTDOWN, + TLB_REMOTE_SEND_IPI, NR_TLB_FLUSH_REASONS, }; diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h index 4250f364a6ca..bc8815f45f3b 100644 --- a/include/trace/events/tlb.h +++ b/include/trace/events/tlb.h @@ -11,7 +11,8 @@ EM( TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" ) \ EM( TLB_REMOTE_SHOOTDOWN, "remote shootdown" ) \ EM( TLB_LOCAL_SHOOTDOWN, "local shootdown" ) \ - EMe( TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" ) + EM( TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" ) \ + EMe( TLB_REMOTE_SEND_IPI, "remote ipi send" ) /* * First define the enums in TLB_FLUSH_REASON to be exported to userspace -- 2.3.5 ^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent @ 2015-06-09 17:31 ` Mel Gorman 0 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-06-09 17:31 UTC (permalink / raw) To: Andrew Morton Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen, H Peter Anvin, Ingo Molnar, Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Linux-MM, LKML, Mel Gorman It is easy to trace when an IPI is received to flush a TLB but harder to detect what event sent it. This patch makes it easy to identify the source of IPIs being transmitted for TLB flushes on x86. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Dave Hansen <dave.hansen@intel.com> --- arch/x86/mm/tlb.c | 1 + include/linux/mm_types.h | 1 + include/trace/events/tlb.h | 3 ++- 3 files changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 3250f2371aea..2da824c1c140 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask, info.flush_end = end; count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); + trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start); if (is_uv_system()) { unsigned int cpu; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 8d37e26a1007..86ad9f902042 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -534,6 +534,7 @@ enum tlb_flush_reason { TLB_REMOTE_SHOOTDOWN, TLB_LOCAL_SHOOTDOWN, TLB_LOCAL_MM_SHOOTDOWN, + TLB_REMOTE_SEND_IPI, NR_TLB_FLUSH_REASONS, }; diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h index 4250f364a6ca..bc8815f45f3b 100644 --- a/include/trace/events/tlb.h +++ b/include/trace/events/tlb.h @@ -11,7 +11,8 @@ EM( TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" ) \ EM( TLB_REMOTE_SHOOTDOWN, "remote shootdown" ) \ EM( TLB_LOCAL_SHOOTDOWN, "local shootdown" ) \ - EMe( TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" ) + EM( TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" ) \ + EMe( TLB_REMOTE_SEND_IPI, "remote ipi send" ) /* * First define the enums in TLB_FLUSH_REASON to be exported to userspace -- 2.3.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 0/4] TLB flush multiple pages per IPI v7 @ 2015-07-06 13:39 Mel Gorman 2015-07-06 13:39 ` Mel Gorman 0 siblings, 1 reply; 58+ messages in thread From: Mel Gorman @ 2015-07-06 13:39 UTC (permalink / raw) To: Andrew Morton Cc: Rik van Riel, Dave Hansen, Ingo Molnar, Linus Torvalds, Linux-MM, LKML, Mel Gorman This is hopefully the final version that was agreed on. Ingo, you had sent an ack but I had to add a new arch helper after that for accounting purposes and there was a new patch added for the swap cluster suggestion. With the changes I did not include the ack just in case it was no longer valid. Changelog since V6 o Rebase to v4.2-rc1 o Fix TLB flush counter accounting o Drop dynamic allocation patch, no benefit and very messy o Drop targetting flushing, expected to be of dubious merit o Increase swap cluster max Changelog since V5 o Split series to first do a full TLB flush and then targetting flushing Changelog since V4 o Rebase to 4.1-rc6 Changelog since V3 o Drop batching of TLB flush from migration o Redo how larger batching is managed o Batch TLB flushes when writable entries exist When unmapping pages it is necessary to flush the TLB. If that page was accessed by another CPU then an IPI is used to flush the remote CPU. That is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second. There already is a window between when a page is unmapped and when it is TLB flushed. This series ses the window so multiple pages can be flushed using a single IPI. This should be safe or the kernel is hosed already. Patch 1 simply made the rest of the series easier to write as ftrace could identify all the senders of TLB flush IPIS. Patch 2 tracks what CPUs potentially map a PFN and then sends an IPI to flush the entire TLB. Patch 3 tracks when there potentially are writable TLB entries that need to be batched differently Patch 4 increases SWAP_CLUSTER_MAX to further batch flushes The performance impact is documented in the changelogs but in the optimistic case on a 4-socket machine the full series reduces interrupts from 900K interrupts/second to 60K interrupts/second. arch/x86/Kconfig | 1 + arch/x86/include/asm/tlbflush.h | 6 +++ arch/x86/mm/tlb.c | 1 + include/linux/mm_types.h | 1 + include/linux/rmap.h | 3 ++ include/linux/sched.h | 23 ++++++++ include/linux/swap.h | 2 +- include/trace/events/tlb.h | 3 +- init/Kconfig | 10 ++++ mm/internal.h | 15 ++++++ mm/rmap.c | 117 +++++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 30 ++++++++++- 12 files changed, 207 insertions(+), 5 deletions(-) -- 2.3.5 ^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent 2015-07-06 13:39 [PATCH 0/4] TLB flush multiple pages per IPI v7 Mel Gorman @ 2015-07-06 13:39 ` Mel Gorman 0 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-07-06 13:39 UTC (permalink / raw) To: Andrew Morton Cc: Rik van Riel, Dave Hansen, Ingo Molnar, Linus Torvalds, Linux-MM, LKML, Mel Gorman It is easy to trace when an IPI is received to flush a TLB but harder to detect what event sent it. This patch makes it easy to identify the source of IPIs being transmitted for TLB flushes on x86. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Dave Hansen <dave.hansen@intel.com> --- arch/x86/mm/tlb.c | 1 + include/linux/mm_types.h | 1 + include/trace/events/tlb.h | 3 ++- 3 files changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 3250f2371aea..2da824c1c140 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask, info.flush_end = end; count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); + trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start); if (is_uv_system()) { unsigned int cpu; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 0038ac7466fd..84ef58543e2b 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -552,6 +552,7 @@ enum tlb_flush_reason { TLB_REMOTE_SHOOTDOWN, TLB_LOCAL_SHOOTDOWN, TLB_LOCAL_MM_SHOOTDOWN, + TLB_REMOTE_SEND_IPI, NR_TLB_FLUSH_REASONS, }; diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h index 4250f364a6ca..bc8815f45f3b 100644 --- a/include/trace/events/tlb.h +++ b/include/trace/events/tlb.h @@ -11,7 +11,8 @@ EM( TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" ) \ EM( TLB_REMOTE_SHOOTDOWN, "remote shootdown" ) \ EM( TLB_LOCAL_SHOOTDOWN, "local shootdown" ) \ - EMe( TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" ) + EM( TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" ) \ + EMe( TLB_REMOTE_SEND_IPI, "remote ipi send" ) /* * First define the enums in TLB_FLUSH_REASON to be exported to userspace -- 2.3.5 ^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent @ 2015-07-06 13:39 ` Mel Gorman 0 siblings, 0 replies; 58+ messages in thread From: Mel Gorman @ 2015-07-06 13:39 UTC (permalink / raw) To: Andrew Morton Cc: Rik van Riel, Dave Hansen, Ingo Molnar, Linus Torvalds, Linux-MM, LKML, Mel Gorman It is easy to trace when an IPI is received to flush a TLB but harder to detect what event sent it. This patch makes it easy to identify the source of IPIs being transmitted for TLB flushes on x86. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Dave Hansen <dave.hansen@intel.com> --- arch/x86/mm/tlb.c | 1 + include/linux/mm_types.h | 1 + include/trace/events/tlb.h | 3 ++- 3 files changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 3250f2371aea..2da824c1c140 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask, info.flush_end = end; count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); + trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start); if (is_uv_system()) { unsigned int cpu; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 0038ac7466fd..84ef58543e2b 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -552,6 +552,7 @@ enum tlb_flush_reason { TLB_REMOTE_SHOOTDOWN, TLB_LOCAL_SHOOTDOWN, TLB_LOCAL_MM_SHOOTDOWN, + TLB_REMOTE_SEND_IPI, NR_TLB_FLUSH_REASONS, }; diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h index 4250f364a6ca..bc8815f45f3b 100644 --- a/include/trace/events/tlb.h +++ b/include/trace/events/tlb.h @@ -11,7 +11,8 @@ EM( TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" ) \ EM( TLB_REMOTE_SHOOTDOWN, "remote shootdown" ) \ EM( TLB_LOCAL_SHOOTDOWN, "local shootdown" ) \ - EMe( TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" ) + EM( TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" ) \ + EMe( TLB_REMOTE_SEND_IPI, "remote ipi send" ) /* * First define the enums in TLB_FLUSH_REASON to be exported to userspace -- 2.3.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 58+ messages in thread
end of thread, other threads:[~2015-07-06 13:40 UTC | newest] Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-04-15 10:42 [RFC PATCH 0/4] TLB flush multiple pages with a single IPI Mel Gorman 2015-04-15 10:42 ` Mel Gorman 2015-04-15 10:42 ` [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent Mel Gorman 2015-04-15 10:42 ` Mel Gorman 2015-04-15 10:42 ` [PATCH 2/4] mm: Send a single IPI to TLB flush multiple pages when unmapping Mel Gorman 2015-04-15 10:42 ` Mel Gorman 2015-04-15 21:03 ` Rik van Riel 2015-04-15 21:03 ` Rik van Riel 2015-04-15 21:16 ` Hugh Dickins 2015-04-15 21:16 ` Hugh Dickins 2015-04-15 21:28 ` Mel Gorman 2015-04-15 21:28 ` Mel Gorman 2015-04-15 21:32 ` Dave Hansen 2015-04-15 21:32 ` Dave Hansen 2015-04-16 6:38 ` Minchan Kim 2015-04-16 6:38 ` Minchan Kim 2015-04-16 8:07 ` Mel Gorman 2015-04-16 8:07 ` Mel Gorman 2015-04-16 8:29 ` Minchan Kim 2015-04-16 8:29 ` Minchan Kim 2015-04-16 9:19 ` Mel Gorman 2015-04-16 9:19 ` Mel Gorman 2015-04-16 23:30 ` Minchan Kim 2015-04-16 23:30 ` Minchan Kim 2015-04-15 22:20 ` Andi Kleen 2015-04-15 22:20 ` Andi Kleen 2015-04-15 22:53 ` Mel Gorman 2015-04-15 22:53 ` Mel Gorman 2015-04-15 10:42 ` [PATCH 3/4] mm: Gather more PFNs before sending a TLB to flush unmapped pages Mel Gorman 2015-04-15 10:42 ` Mel Gorman 2015-04-15 11:42 ` Peter Zijlstra 2015-04-15 11:42 ` Peter Zijlstra 2015-04-15 12:15 ` Mel Gorman 2015-04-15 12:15 ` Mel Gorman 2015-04-15 12:24 ` Peter Zijlstra 2015-04-15 12:24 ` Peter Zijlstra 2015-04-15 12:56 ` Mel Gorman 2015-04-15 12:56 ` Mel Gorman 2015-04-15 10:42 ` [PATCH 4/4] mm: migrate: Batch TLB flushing when unmapping pages for migration Mel Gorman 2015-04-15 10:42 ` Mel Gorman 2015-04-15 21:06 ` Hugh Dickins 2015-04-15 21:06 ` Hugh Dickins 2015-04-15 21:44 ` Mel Gorman 2015-04-15 21:44 ` Mel Gorman 2015-04-15 23:50 ` Hugh Dickins 2015-04-15 23:50 ` Hugh Dickins 2015-04-16 10:22 [RFC PATCH 0/4] TLB flush multiple pages with a single IPI v2 Mel Gorman 2015-04-16 10:22 ` [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent Mel Gorman 2015-04-16 10:22 ` Mel Gorman 2015-04-16 15:51 ` Rik van Riel 2015-04-16 15:51 ` Rik van Riel 2015-04-16 16:55 ` Dave Hansen 2015-04-16 16:55 ` Dave Hansen 2015-04-16 17:39 ` Mel Gorman 2015-04-16 17:39 ` Mel Gorman 2015-06-09 17:31 [PATCH 0/3] TLB flush multiple pages per IPI v6 Mel Gorman 2015-06-09 17:31 ` [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent Mel Gorman 2015-06-09 17:31 ` Mel Gorman 2015-07-06 13:39 [PATCH 0/4] TLB flush multiple pages per IPI v7 Mel Gorman 2015-07-06 13:39 ` [PATCH 1/4] x86, mm: Trace when an IPI is about to be sent Mel Gorman 2015-07-06 13:39 ` Mel Gorman
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.