linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] mm: support parallel free of memory
@ 2017-02-24 11:40 Aaron Lu
  2017-02-24 11:40 ` [PATCH 1/5] mm: add tlb_flush_mmu_free_batches Aaron Lu
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Aaron Lu @ 2017-02-24 11:40 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Tim Chen, Andrew Morton, Ying Huang, Aaron Lu

For regular processes, the time taken in its exit() path to free its
used memory is not a problem. But there are heavy ones that consume
several Terabytes memory and the time taken to free its memory could
last more than ten minutes.

To optimize this use case, a parallel free method is proposed here.
For detailed explanation, please refer to patch 2/5.

I'm not sure if we need patch 4/5 which can avoid page accumulation
being interrupted in some case(patch description has more information).
My test case, which only deal with anon memory doesn't get any help out
of this of course. It can be safely dropped if it is deemed not useful.

A test program that did a single malloc() of 320G memory is used to see
how useful the proposed parallel free solution is, the time calculated
is for the free() call. Test machine is a Haswell EX which has
4nodes/72cores/144threads with 512G memory. All tests are done with THP
disabled.

kernel                             time
v4.10                              10.8s  A+-2.8%
this patch(with default setting)   5.795s A+-5.8%

Patch 3/5 introduced a dedicated workqueue for the free workers and
here are more results when setting different values for max_active of
this workqueue:

max_active:   time
1             8.9s   A+-0.5%
2             5.65s  A+-5.5%
4             4.84s  A+-0.16%
8             4.77s  A+-0.97%
16            4.85s  A+-0.77%
32            6.21s  A+-0.46%

Comments are welcome.

Aaron Lu (5):
  mm: add tlb_flush_mmu_free_batches
  mm: parallel free pages
  mm: use a dedicated workqueue for the free workers
  mm: add force_free_pages in zap_pte_range
  mm: add debugfs interface for parallel free tuning

 include/asm-generic/tlb.h |  12 ++--
 mm/memory.c               | 138 +++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 122 insertions(+), 28 deletions(-)

-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/5] mm: add tlb_flush_mmu_free_batches
  2017-02-24 11:40 [PATCH 0/5] mm: support parallel free of memory Aaron Lu
@ 2017-02-24 11:40 ` Aaron Lu
  2017-02-24 11:40 ` [PATCH 2/5] mm: parallel free pages Aaron Lu
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Aaron Lu @ 2017-02-24 11:40 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Tim Chen, Andrew Morton, Ying Huang, Aaron Lu

There are two places doing page free where one is freeing pages pointed
by the mmu_gather_batch in tlb_flush_mmu_free and one for the batch page
itself in tlb_flush_mmu_finish. There will be yet another place in the
following patch to free both the pages pointed by the mmu_gather_batches
and the batch page itself in the parallel free worker thread. To avoid
code duplication, add a new function for this purpose.

Another reason to add this function is that after the following patch,
cond_resched will need to be added at places where more than 10K pages
can be freed, i.e. in tlb_flush_mmu_free and the worker function.
Instead of adding cond_resched at multiple places, using a single
function to reduce code duplication.

No functionality change.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 mm/memory.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6bf2b471e30c..2b88196841b9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -251,14 +251,25 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 	__tlb_reset_range(tlb);
 }
 
-static void tlb_flush_mmu_free(struct mmu_gather *tlb)
+static void tlb_flush_mmu_free_batches(struct mmu_gather_batch *batch_start,
+				       int free_batch_page)
 {
-	struct mmu_gather_batch *batch;
+	struct mmu_gather_batch *batch, *next;
 
-	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
-		free_pages_and_swap_cache(batch->pages, batch->nr);
-		batch->nr = 0;
+	for (batch = batch_start; batch; batch = next) {
+		next = batch->next;
+		if (batch->nr) {
+			free_pages_and_swap_cache(batch->pages, batch->nr);
+			batch->nr = 0;
+		}
+		if (free_batch_page)
+			free_pages((unsigned long)batch, 0);
 	}
+}
+
+static void tlb_flush_mmu_free(struct mmu_gather *tlb)
+{
+	tlb_flush_mmu_free_batches(&tlb->local, 0);
 	tlb->active = &tlb->local;
 }
 
@@ -274,17 +285,12 @@ void tlb_flush_mmu(struct mmu_gather *tlb)
  */
 void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
-	struct mmu_gather_batch *batch, *next;
-
 	tlb_flush_mmu(tlb);
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
-	for (batch = tlb->local.next; batch; batch = next) {
-		next = batch->next;
-		free_pages((unsigned long)batch, 0);
-	}
+	tlb_flush_mmu_free_batches(tlb->local.next, 1);
 	tlb->local.next = NULL;
 }
 
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/5] mm: parallel free pages
  2017-02-24 11:40 [PATCH 0/5] mm: support parallel free of memory Aaron Lu
  2017-02-24 11:40 ` [PATCH 1/5] mm: add tlb_flush_mmu_free_batches Aaron Lu
@ 2017-02-24 11:40 ` Aaron Lu
  2017-02-24 11:40 ` [PATCH 3/5] mm: use a dedicated workqueue for the free workers Aaron Lu
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Aaron Lu @ 2017-02-24 11:40 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Tim Chen, Andrew Morton, Ying Huang, Aaron Lu

For regular processes, the time taken in its exit() path to free its
used memory is not a problem. But there are heavy ones that consume
several Terabytes memory and the time taken to free its memory could
last more than ten minutes.

To optimize this use case, a parallel free method is proposed and it is
based on the current gather batch free.

The current gather batch free works like:
For each struct mmu_gather *tlb, there is a static buffer to store those
to-be-freed page pointers. The size is MMU_GATHER_BUNDLE, which is
defined to be 8. So if a tlb tear down doesn't free more than 8 pages,
that is all we need. If 8+ pages are to be freed, new pages will need
to be allocated to store those to-be-freed page pointers.

The structure used to describe the saved page pointers is called
struct mmu_gather_batch and tlb->local is of this type. tlb->local is
different than other struct mmu_gather_batch(es) in that the page
pointer array used by tlb->local points to the previouslly described
static buffer while the other struct mmu_gather_batch(es) page pointer
array points to the dynamically allocated pages.

These batches will form a singly linked list, starting from &tlb->local.

tlb->local.pages  => tlb->pages(8 pointers)
      \|/
      next => *batch1->pages => about 500 pointers
                \|/
                next => batch2->pages => about 500 pointers
                          \|/
                          next => batch3->pages => about 500 pointers
                                    ... ...

The propose parallel free did this: if the process has many pages to be
freed, accumulate them in these struct mmu_gather_batch(es) one after
another till 256K pages are accumulated. Then take this singly linked
list starting from tlb->local.next off struct mmu_gather *tlb and free
them in a worker thread. The main thread can return to continue zap
other pages(after freeing pages pointed by tlb->local.pages).

Note that since we may be accumulating as many as 256K pages now, the
soft lockup on !CONFIG_PREEMPT issue which is fixed by commit
53a59fc67f97 ("mm: limit mmu_gather batching to fix soft lockups on
!CONFIG_PREEMPT") can reappear. For that matter, add cond_resched() in
tlb_flush_mmu_free_batches where many pages can be freed.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 include/asm-generic/tlb.h | 12 +++++------
 mm/memory.c               | 55 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 59 insertions(+), 8 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 7eed8cf3130a..07229b48e7f9 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -78,13 +78,9 @@ struct mmu_gather_batch {
 #define MAX_GATHER_BATCH	\
 	((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(void *))
 
-/*
- * Limit the maximum number of mmu_gather batches to reduce a risk of soft
- * lockups for non-preemptible kernels on huge machines when a lot of memory
- * is zapped during unmapping.
- * 10K pages freed at once should be safe even without a preemption point.
- */
-#define MAX_GATHER_BATCH_COUNT	(10000UL/MAX_GATHER_BATCH)
+#define ASYNC_FREE_THRESHOLD (256*1024UL)
+#define MAX_GATHER_BATCH_COUNT	DIV_ROUND_UP(ASYNC_FREE_THRESHOLD, MAX_GATHER_BATCH)
+#define PAGE_FREE_NR_TO_YIELD (10000UL)
 
 /* struct mmu_gather is an opaque type used by the mm code for passing around
  * any data needed by arch specific code for tlb_remove_page.
@@ -108,6 +104,8 @@ struct mmu_gather {
 	struct page		*__pages[MMU_GATHER_BUNDLE];
 	unsigned int		batch_count;
 	int page_size;
+	unsigned int            page_nr; /* how many pages we have gathered to be freed */
+	struct list_head        worker_list; /* list for spawned workers that do the free job */
 };
 
 #define HAVE_GENERIC_MMU_GATHER
diff --git a/mm/memory.c b/mm/memory.c
index 2b88196841b9..b98cd25075f0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -229,6 +229,9 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long
 	tlb->local.max  = ARRAY_SIZE(tlb->__pages);
 	tlb->active     = &tlb->local;
 	tlb->batch_count = 0;
+	tlb->page_nr    = 0;
+
+	INIT_LIST_HEAD(&tlb->worker_list);
 
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb->batch = NULL;
@@ -255,22 +258,63 @@ static void tlb_flush_mmu_free_batches(struct mmu_gather_batch *batch_start,
 				       int free_batch_page)
 {
 	struct mmu_gather_batch *batch, *next;
+	int nr = 0;
 
 	for (batch = batch_start; batch; batch = next) {
 		next = batch->next;
 		if (batch->nr) {
 			free_pages_and_swap_cache(batch->pages, batch->nr);
+			nr += batch->nr;
 			batch->nr = 0;
 		}
-		if (free_batch_page)
+		if (free_batch_page) {
 			free_pages((unsigned long)batch, 0);
+			nr++;
+		}
+		if (nr >= PAGE_FREE_NR_TO_YIELD) {
+			cond_resched();
+			nr = 0;
+		}
 	}
 }
 
+struct batch_free_struct {
+	struct work_struct work;
+	struct mmu_gather_batch *batch_start;
+	struct list_head list;
+};
+
+static void batch_free_work(struct work_struct *work)
+{
+	struct batch_free_struct *batch_free = container_of(work, struct batch_free_struct, work);
+	tlb_flush_mmu_free_batches(batch_free->batch_start, 1);
+}
+
 static void tlb_flush_mmu_free(struct mmu_gather *tlb)
 {
+	struct batch_free_struct *batch_free = NULL;
+
+	if (tlb->page_nr >= ASYNC_FREE_THRESHOLD)
+		batch_free = kmalloc(sizeof(*batch_free), GFP_NOWAIT | __GFP_NOWARN);
+
+	if (batch_free) {
+		/*
+		 * Start a worker to free pages stored
+		 * in batches following tlb->local.
+		 */
+		batch_free->batch_start = tlb->local.next;
+		INIT_WORK(&batch_free->work, batch_free_work);
+		list_add(&batch_free->list, &tlb->worker_list);
+		queue_work(system_unbound_wq, &batch_free->work);
+
+		tlb->batch_count = 0;
+		tlb->local.next = NULL;
+		/* fall through to free pages stored in tlb->local */
+	}
+
 	tlb_flush_mmu_free_batches(&tlb->local, 0);
 	tlb->active = &tlb->local;
+	tlb->page_nr = 0;
 }
 
 void tlb_flush_mmu(struct mmu_gather *tlb)
@@ -285,11 +329,18 @@ void tlb_flush_mmu(struct mmu_gather *tlb)
  */
 void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
+	struct batch_free_struct *batch_free, *n;
+
 	tlb_flush_mmu(tlb);
 
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
 
+	list_for_each_entry_safe(batch_free, n, &tlb->worker_list, list) {
+		flush_work(&batch_free->work);
+		kfree(batch_free);
+	}
+
 	tlb_flush_mmu_free_batches(tlb->local.next, 1);
 	tlb->local.next = NULL;
 }
@@ -308,6 +359,8 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
 	VM_BUG_ON(!tlb->end);
 	VM_WARN_ON(tlb->page_size != page_size);
 
+	tlb->page_nr++;
+
 	batch = tlb->active;
 	/*
 	 * Add the page and check if we are full. If so
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/5] mm: use a dedicated workqueue for the free workers
  2017-02-24 11:40 [PATCH 0/5] mm: support parallel free of memory Aaron Lu
  2017-02-24 11:40 ` [PATCH 1/5] mm: add tlb_flush_mmu_free_batches Aaron Lu
  2017-02-24 11:40 ` [PATCH 2/5] mm: parallel free pages Aaron Lu
@ 2017-02-24 11:40 ` Aaron Lu
  2017-02-24 11:40 ` [PATCH 4/5] mm: add force_free_pages in zap_pte_range Aaron Lu
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Aaron Lu @ 2017-02-24 11:40 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Tim Chen, Andrew Morton, Ying Huang, Aaron Lu

Introduce a workqueue for all the free workers so that user can fine
tune how many workers can be active through sysfs interface: max_active.
More workers will normally lead to better performance, but too many can
cause severe lock contention.

Note that since the zone lock is global, the workqueue is also global
for all processes, i.e. if we set 8 to max_active, we will have at most
8 workers active for all processes that are doing munmap()/exit()/etc.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 mm/memory.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index b98cd25075f0..eb8b17fc1b2b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -254,6 +254,18 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 	__tlb_reset_range(tlb);
 }
 
+static struct workqueue_struct *batch_free_wq;
+static int __init batch_free_wq_init(void)
+{
+	batch_free_wq = alloc_workqueue("batch_free_wq", WQ_UNBOUND | WQ_SYSFS, 0);
+	if (!batch_free_wq) {
+		pr_warn("failed to create workqueue batch_free_wq\n");
+		return -ENOMEM;
+	}
+	return 0;
+}
+subsys_initcall(batch_free_wq_init);
+
 static void tlb_flush_mmu_free_batches(struct mmu_gather_batch *batch_start,
 				       int free_batch_page)
 {
@@ -305,7 +317,7 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb)
 		batch_free->batch_start = tlb->local.next;
 		INIT_WORK(&batch_free->work, batch_free_work);
 		list_add(&batch_free->list, &tlb->worker_list);
-		queue_work(system_unbound_wq, &batch_free->work);
+		queue_work(batch_free_wq, &batch_free->work);
 
 		tlb->batch_count = 0;
 		tlb->local.next = NULL;
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/5] mm: add force_free_pages in zap_pte_range
  2017-02-24 11:40 [PATCH 0/5] mm: support parallel free of memory Aaron Lu
                   ` (2 preceding siblings ...)
  2017-02-24 11:40 ` [PATCH 3/5] mm: use a dedicated workqueue for the free workers Aaron Lu
@ 2017-02-24 11:40 ` Aaron Lu
  2017-02-24 11:40 ` [PATCH 5/5] mm: add debugfs interface for parallel free tuning Aaron Lu
  2017-03-01  0:39 ` [PATCH 0/5] mm: support parallel free of memory Andrew Morton
  5 siblings, 0 replies; 9+ messages in thread
From: Aaron Lu @ 2017-02-24 11:40 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Tim Chen, Andrew Morton, Ying Huang, Aaron Lu

force_flush in zap_pte_range is set in the following 2 conditions:
1 When no more batches can be allocated (either due to no memory or
  MAX_GATHER_BATCH_COUNT has reached) to store those to-be-freed page
  pointers;
2 When a TLB_only flush is needed before dropping the PTE lock to avoid
  a race condition as explained in commit 1cf35d47712d
  ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts").

Once force_flush is set, the pages accumulated thus far will all be
freed. Since there is no need to do page free if condition 2 occurred,
add a new variable named force_free_pages to decide if page free should
be done and it will only only be set if condition 1 occured.

With this change, the page accumulation will not be interrupted by
condition 2 anymore. In the meantime, rename force_flush to
force_flush_tlb for condition 2.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 mm/memory.c | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index eb8b17fc1b2b..7d1fe74084be 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1185,7 +1185,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct zap_details *details)
 {
 	struct mm_struct *mm = tlb->mm;
-	int force_flush = 0;
+	int force_flush_tlb = 0, force_free_pages = 0;
 	int rss[NR_MM_COUNTERS];
 	spinlock_t *ptl;
 	pte_t *start_pte;
@@ -1232,7 +1232,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 					 */
 					if (unlikely(details && details->ignore_dirty))
 						continue;
-					force_flush = 1;
+					force_flush_tlb = 1;
 					set_page_dirty(page);
 				}
 				if (pte_young(ptent) &&
@@ -1244,7 +1244,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(__tlb_remove_page(tlb, page))) {
-				force_flush = 1;
+				force_free_pages = 1;
 				addr += PAGE_SIZE;
 				break;
 			}
@@ -1272,18 +1272,14 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	arch_leave_lazy_mmu_mode();
 
 	/* Do the actual TLB flush before dropping ptl */
-	if (force_flush)
+	if (force_flush_tlb) {
+		force_flush_tlb = 0;
 		tlb_flush_mmu_tlbonly(tlb);
+	}
 	pte_unmap_unlock(start_pte, ptl);
 
-	/*
-	 * If we forced a TLB flush (either due to running out of
-	 * batch buffers or because we needed to flush dirty TLB
-	 * entries before releasing the ptl), free the batched
-	 * memory too. Restart if we didn't do everything.
-	 */
-	if (force_flush) {
-		force_flush = 0;
+	if (force_free_pages) {
+		force_free_pages = 0;
 		tlb_flush_mmu_free(tlb);
 		if (addr != end)
 			goto again;
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 5/5] mm: add debugfs interface for parallel free tuning
  2017-02-24 11:40 [PATCH 0/5] mm: support parallel free of memory Aaron Lu
                   ` (3 preceding siblings ...)
  2017-02-24 11:40 ` [PATCH 4/5] mm: add force_free_pages in zap_pte_range Aaron Lu
@ 2017-02-24 11:40 ` Aaron Lu
  2017-03-01  0:39 ` [PATCH 0/5] mm: support parallel free of memory Andrew Morton
  5 siblings, 0 replies; 9+ messages in thread
From: Aaron Lu @ 2017-02-24 11:40 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Tim Chen, Andrew Morton, Ying Huang, Aaron Lu

Make it possible to set different values for async_free_threshold and
max_gather_batch_count through debugfs.

With this, we can do tests for different purposes:
1 Restore vanilla kernel bahaviour for performance comparison.
  Set max_gather_batch_count to a value like 20 to effectively restore
  the behaviour of vanilla kernel since this will make page gathered
  always smaller than async_free_threshold(effectively disable parallel
  free);
2 Debug purpose.
  Set async_free_threshold to a very small value(like 128) to trigger
  parallel free even on ordinary processes, ideal for debug purpose with
  a virtual machine that doesn't have much memory assigned to it;
3 Performance tuning.
  Use a different value for async_free_threshold and max_gather_batch_count
  other than the default to test if parallel free performs better or worse.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 mm/memory.c | 33 +++++++++++++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7d1fe74084be..9ca07c59e525 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -184,6 +184,35 @@ static void check_sync_rss_stat(struct task_struct *task)
 
 #ifdef HAVE_GENERIC_MMU_GATHER
 
+static unsigned long async_free_threshold = ASYNC_FREE_THRESHOLD;
+static unsigned long max_gather_batch_count = MAX_GATHER_BATCH_COUNT;
+
+#ifdef CONFIG_DEBUG_FS
+static int __init tlb_mmu_parallel_free_debugfs(void)
+{
+	umode_t mode = S_IFREG | S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH;
+	struct dentry *dir;
+
+	dir = debugfs_create_dir("parallel_free", NULL);
+	if (!dir)
+		return -ENOMEM;
+
+	if (!debugfs_create_ulong("async_free_threshold", mode, dir,
+				&async_free_threshold))
+		goto fail;
+	if (!debugfs_create_ulong("max_gather_batch_count", mode, dir,
+				&max_gather_batch_count))
+		goto fail;
+
+	return 0;
+
+fail:
+	debugfs_remove_recursive(dir);
+	return -ENOMEM;
+}
+late_initcall(tlb_mmu_parallel_free_debugfs);
+#endif
+
 static bool tlb_next_batch(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
@@ -194,7 +223,7 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
 		return true;
 	}
 
-	if (tlb->batch_count == MAX_GATHER_BATCH_COUNT)
+	if (tlb->batch_count == max_gather_batch_count)
 		return false;
 
 	batch = (void *)__get_free_pages(GFP_NOWAIT | __GFP_NOWARN, 0);
@@ -306,7 +335,7 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb)
 {
 	struct batch_free_struct *batch_free = NULL;
 
-	if (tlb->page_nr >= ASYNC_FREE_THRESHOLD)
+	if (tlb->page_nr >= async_free_threshold)
 		batch_free = kmalloc(sizeof(*batch_free), GFP_NOWAIT | __GFP_NOWARN);
 
 	if (batch_free) {
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/5] mm: support parallel free of memory
  2017-02-24 11:40 [PATCH 0/5] mm: support parallel free of memory Aaron Lu
                   ` (4 preceding siblings ...)
  2017-02-24 11:40 ` [PATCH 5/5] mm: add debugfs interface for parallel free tuning Aaron Lu
@ 2017-03-01  0:39 ` Andrew Morton
  2017-03-01  0:43   ` Dave Hansen
  5 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2017-03-01  0:39 UTC (permalink / raw)
  To: Aaron Lu; +Cc: linux-mm, linux-kernel, Dave Hansen, Tim Chen, Ying Huang

On Fri, 24 Feb 2017 19:40:31 +0800 Aaron Lu <aaron.lu@intel.com> wrote:

> For regular processes, the time taken in its exit() path to free its
> used memory is not a problem. But there are heavy ones that consume
> several Terabytes memory and the time taken to free its memory could
> last more than ten minutes.
> 
> To optimize this use case, a parallel free method is proposed here.
> For detailed explanation, please refer to patch 2/5.
> 
> I'm not sure if we need patch 4/5 which can avoid page accumulation
> being interrupted in some case(patch description has more information).
> My test case, which only deal with anon memory doesn't get any help out
> of this of course. It can be safely dropped if it is deemed not useful.
> 
> A test program that did a single malloc() of 320G memory is used to see
> how useful the proposed parallel free solution is, the time calculated
> is for the free() call. Test machine is a Haswell EX which has
> 4nodes/72cores/144threads with 512G memory. All tests are done with THP
> disabled.
> 
> kernel                             time
> v4.10                              10.8s  __2.8%
> this patch(with default setting)   5.795s __5.8%

Dumb question: why not do this in userspace, presumably as part of the
malloc() library?  malloc knows where all the memory is and should be
able to kick off N threads to run around munmapping everything?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/5] mm: support parallel free of memory
  2017-03-01  0:39 ` [PATCH 0/5] mm: support parallel free of memory Andrew Morton
@ 2017-03-01  0:43   ` Dave Hansen
  2017-03-01  1:17     ` Aaron Lu
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Hansen @ 2017-03-01  0:43 UTC (permalink / raw)
  To: Andrew Morton, Aaron Lu; +Cc: linux-mm, linux-kernel, Tim Chen, Ying Huang

On 02/28/2017 04:39 PM, Andrew Morton wrote:
> Dumb question: why not do this in userspace, presumably as part of the
> malloc() library?  malloc knows where all the memory is and should be
> able to kick off N threads to run around munmapping everything?

One of the places we saw this happen was when an app crashed and was
exit()'ing under duress without cleaning up nicely.  The time that it
takes to unmap a few TB of 4k pages is pretty excessive.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/5] mm: support parallel free of memory
  2017-03-01  0:43   ` Dave Hansen
@ 2017-03-01  1:17     ` Aaron Lu
  0 siblings, 0 replies; 9+ messages in thread
From: Aaron Lu @ 2017-03-01  1:17 UTC (permalink / raw)
  To: Dave Hansen, Andrew Morton; +Cc: linux-mm, linux-kernel, Tim Chen, Ying Huang

On 03/01/2017 08:43 AM, Dave Hansen wrote:
> On 02/28/2017 04:39 PM, Andrew Morton wrote:
>> Dumb question: why not do this in userspace, presumably as part of the
>> malloc() library?  malloc knows where all the memory is and should be
>> able to kick off N threads to run around munmapping everything?
> 
> One of the places we saw this happen was when an app crashed and was
> exit()'ing under duress without cleaning up nicely.  The time that it
> takes to unmap a few TB of 4k pages is pretty excessive.
 
Thanks Dave for the answer, I should have put this in the changelog(will
do that in the next revision). Sorry about this Andrew, I hope Dave's
answer clears things up about the patch's intention.

Regards,
Aaron

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-03-01  1:17 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-24 11:40 [PATCH 0/5] mm: support parallel free of memory Aaron Lu
2017-02-24 11:40 ` [PATCH 1/5] mm: add tlb_flush_mmu_free_batches Aaron Lu
2017-02-24 11:40 ` [PATCH 2/5] mm: parallel free pages Aaron Lu
2017-02-24 11:40 ` [PATCH 3/5] mm: use a dedicated workqueue for the free workers Aaron Lu
2017-02-24 11:40 ` [PATCH 4/5] mm: add force_free_pages in zap_pte_range Aaron Lu
2017-02-24 11:40 ` [PATCH 5/5] mm: add debugfs interface for parallel free tuning Aaron Lu
2017-03-01  0:39 ` [PATCH 0/5] mm: support parallel free of memory Andrew Morton
2017-03-01  0:43   ` Dave Hansen
2017-03-01  1:17     ` Aaron Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).