All of lore.kernel.org
 help / color / mirror / Atom feed
* + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree
@ 2016-04-27 21:17 akpm
  2016-04-28 15:19 ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: akpm @ 2016-04-27 21:17 UTC (permalink / raw)
  To: ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes,
	hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, mhocko,
	n-horiguchi, riel, rientjes, vbabka, mm-commits


The patch titled
     Subject: mm, thp: avoid unnecessary swapin in khugepaged
has been added to the -mm tree.  Its filename is
     mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/SubmitChecklist when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Subject: mm, thp: avoid unnecessary swapin in khugepaged

Currently khugepaged makes swapin readahead to improve THP collapse rate. 
This patch checks vm statistics to avoid workload of swapin, if
unnecessary.  So that when system under pressure, khugepaged won't consume
resources to swapin and won't trigger direct reclaim when swapin
readahead.

The patch was tested with a test program that allocates 800MB of memory,
writes to it, and then sleeps.  The system was forced to swap out all. 
Afterwards, the test program touches the area by writing, it skips a page
in each 20 pages of the area.  When waiting to swapin readahead left part
of the test, the memory forced to be busy doing page reclaim.  There was
enough free memory during test, khugepaged did not swapin readahead due to
business.

Test results:

                        After swapped out
-------------------------------------------------------------------
              | Anonymous | AnonHugePages | Swap      | Fraction  |
-------------------------------------------------------------------
With patch    | 0 kB      |  0 kB         | 800000 kB |    %100   |
-------------------------------------------------------------------
Without patch | 0 kB      |  0 kB         | 800000 kB |    %100   |
-------------------------------------------------------------------

                        After swapped in
-------------------------------------------------------------------
              | Anonymous | AnonHugePages | Swap      | Fraction  |
-------------------------------------------------------------------
With patch    | 385120 kB | 102400 kB     | 414880 kB |    %26    |
-------------------------------------------------------------------
Without patch | 389728 kB | 194560 kB     | 410272 kB |    %49    |
-------------------------------------------------------------------

Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/huge_memory.c |   18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff -puN mm/huge_memory.c~mm-thp-avoid-unnecessary-swapin-in-khugepaged mm/huge_memory.c
--- a/mm/huge_memory.c~mm-thp-avoid-unnecessary-swapin-in-khugepaged
+++ a/mm/huge_memory.c
@@ -102,6 +102,7 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepage
  */
 static unsigned int khugepaged_max_ptes_none __read_mostly;
 static unsigned int khugepaged_max_ptes_swap __read_mostly = HPAGE_PMD_NR/8;
+static unsigned long allocstall;
 
 static int khugepaged(void *none);
 static int khugepaged_slab_init(void);
@@ -2429,7 +2430,7 @@ static void collapse_huge_page(struct mm
 	struct page *new_page;
 	spinlock_t *pmd_ptl, *pte_ptl;
 	int isolated = 0, result = 0;
-	unsigned long hstart, hend;
+	unsigned long hstart, hend, swap, curr_allocstall;
 	struct mem_cgroup *memcg;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
@@ -2484,7 +2485,14 @@ static void collapse_huge_page(struct mm
 		goto out;
 	}
 
-	__collapse_huge_page_swapin(mm, vma, address, pmd);
+	swap = get_mm_counter(mm, MM_SWAPENTS);
+	curr_allocstall = sum_vm_event(ALLOCSTALL);
+	/*
+	 * When system under pressure, don't swapin readahead.
+	 * So that avoid unnecessary resource consuming.
+	 */
+	if (allocstall == curr_allocstall && swap != 0)
+		__collapse_huge_page_swapin(mm, vma, address, pmd);
 
 	anon_vma_lock_write(vma->anon_vma);
 
@@ -2878,14 +2886,17 @@ static void khugepaged_wait_work(void)
 		if (!khugepaged_scan_sleep_millisecs)
 			return;
 
+		allocstall = sum_vm_event(ALLOCSTALL);
 		wait_event_freezable_timeout(khugepaged_wait,
 					     kthread_should_stop(),
 			msecs_to_jiffies(khugepaged_scan_sleep_millisecs));
 		return;
 	}
 
-	if (khugepaged_enabled())
+	if (khugepaged_enabled()) {
+		allocstall = sum_vm_event(ALLOCSTALL);
 		wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
+	}
 }
 
 static int khugepaged(void *none)
@@ -2894,6 +2905,7 @@ static int khugepaged(void *none)
 
 	set_freezable();
 	set_user_nice(current, MAX_NICE);
+	allocstall = sum_vm_event(ALLOCSTALL);
 
 	while (!kthread_should_stop()) {
 		khugepaged_do_scan();
_

Patches currently in -mm which might be from ebru.akagunduz@gmail.com are

mm-make-optimistic-check-for-swapin-readahead.patch
mm-make-swapin-readahead-to-improve-thp-collapse-rate.patch
mm-vmstat-calculate-particular-vm-event.patch
mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree
  2016-04-27 21:17 + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree akpm
@ 2016-04-28 15:19 ` Michal Hocko
  2016-05-17  7:58   ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2016-04-28 15:19 UTC (permalink / raw)
  To: akpm
  Cc: ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes,
	hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, n-horiguchi,
	riel, rientjes, vbabka, mm-commits, linux-mm

On Wed 27-04-16 14:17:20, Andrew Morton wrote:
[...]
> @@ -2484,7 +2485,14 @@ static void collapse_huge_page(struct mm
>  		goto out;
>  	}
>  
> -	__collapse_huge_page_swapin(mm, vma, address, pmd);
> +	swap = get_mm_counter(mm, MM_SWAPENTS);
> +	curr_allocstall = sum_vm_event(ALLOCSTALL);
> +	/*
> +	 * When system under pressure, don't swapin readahead.
> +	 * So that avoid unnecessary resource consuming.
> +	 */
> +	if (allocstall == curr_allocstall && swap != 0)
> +		__collapse_huge_page_swapin(mm, vma, address, pmd);
>  
>  	anon_vma_lock_write(vma->anon_vma);
>  

I have mentioned that before already but this seems like a rather weak
heuristic. Don't we really rather teach __collapse_huge_page_swapin
(resp. do_swap_page) do to an optimistic GFP_NOWAIT allocations and
back off under the memory pressure?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree
  2016-04-28 15:19 ` Michal Hocko
@ 2016-05-17  7:58   ` Michal Hocko
  2016-05-17  9:02     ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2016-05-17  7:58 UTC (permalink / raw)
  To: akpm
  Cc: ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes,
	hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, n-horiguchi,
	riel, rientjes, vbabka, mm-commits, linux-mm

On Thu 28-04-16 17:19:21, Michal Hocko wrote:
> On Wed 27-04-16 14:17:20, Andrew Morton wrote:
> [...]
> > @@ -2484,7 +2485,14 @@ static void collapse_huge_page(struct mm
> >  		goto out;
> >  	}
> >  
> > -	__collapse_huge_page_swapin(mm, vma, address, pmd);
> > +	swap = get_mm_counter(mm, MM_SWAPENTS);
> > +	curr_allocstall = sum_vm_event(ALLOCSTALL);
> > +	/*
> > +	 * When system under pressure, don't swapin readahead.
> > +	 * So that avoid unnecessary resource consuming.
> > +	 */
> > +	if (allocstall == curr_allocstall && swap != 0)
> > +		__collapse_huge_page_swapin(mm, vma, address, pmd);
> >  
> >  	anon_vma_lock_write(vma->anon_vma);
> >  
> 
> I have mentioned that before already but this seems like a rather weak
> heuristic. Don't we really rather teach __collapse_huge_page_swapin
> (resp. do_swap_page) do to an optimistic GFP_NOWAIT allocations and
> back off under the memory pressure?

I gave it a try and it doesn't seem really bad. Untested and I might
have missed something really obvious but what do you think about this
approach rather than relying on ALLOCSTALL which is really weak
heuristic:
---
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 87f09dc986ab..1a4d4c807d92 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2389,7 +2389,8 @@ static void __collapse_huge_page_swapin(struct mm_struct *mm,
 		swapped_in++;
 		ret = do_swap_page(mm, vma, _address, pte, pmd,
 				   FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
-				   pteval);
+				   pteval,
+				   GFP_HIGHUSER_MOVABLE | ~__GFP_DIRECT_RECLAIM);
 		if (ret & VM_FAULT_ERROR) {
 			trace_mm_collapse_huge_page_swapin(mm, swapped_in, 0);
 			return;
diff --git a/mm/memory.c b/mm/memory.c
index d79c6db41502..f897ec89bd79 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2490,7 +2490,7 @@ EXPORT_SYMBOL(unmap_mapping_range);
  */
 int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		unsigned int flags, pte_t orig_pte)
+		unsigned int flags, pte_t orig_pte, gfp_t gfp_mask)
 {
 	spinlock_t *ptl;
 	struct page *page, *swapcache;
@@ -2519,8 +2519,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
 	page = lookup_swap_cache(entry);
 	if (!page) {
-		page = swapin_readahead(entry,
-					GFP_HIGHUSER_MOVABLE, vma, address);
+		page = swapin_readahead(entry, gfp_mask, vma, address);
 		if (!page) {
 			/*
 			 * Back out if somebody else faulted in this pte
@@ -2573,7 +2572,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_page;
 	}
 
-	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false)) {
+	if (mem_cgroup_try_charge(page, mm, gfp_mask, &memcg, false)) {
 		ret = VM_FAULT_OOM;
 		goto out_page;
 	}
@@ -3349,7 +3348,7 @@ static int handle_pte_fault(struct mm_struct *mm,
 						flags, entry);
 		}
 		return do_swap_page(mm, vma, address,
-					pte, pmd, flags, entry);
+					pte, pmd, flags, entry, GFP_HIGHUSER_MOVABLE);
 	}
 
 	if (pte_protnone(entry))
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree
  2016-05-17  7:58   ` Michal Hocko
@ 2016-05-17  9:02     ` Michal Hocko
  2016-05-17 11:31       ` Kirill A. Shutemov
  2016-05-19  5:00       ` Minchan Kim
  0 siblings, 2 replies; 17+ messages in thread
From: Michal Hocko @ 2016-05-17  9:02 UTC (permalink / raw)
  To: akpm
  Cc: ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes,
	hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, n-horiguchi,
	riel, rientjes, vbabka, mm-commits, linux-mm

On Tue 17-05-16 09:58:15, Michal Hocko wrote:
> On Thu 28-04-16 17:19:21, Michal Hocko wrote:
> > On Wed 27-04-16 14:17:20, Andrew Morton wrote:
> > [...]
> > > @@ -2484,7 +2485,14 @@ static void collapse_huge_page(struct mm
> > >  		goto out;
> > >  	}
> > >  
> > > -	__collapse_huge_page_swapin(mm, vma, address, pmd);
> > > +	swap = get_mm_counter(mm, MM_SWAPENTS);
> > > +	curr_allocstall = sum_vm_event(ALLOCSTALL);
> > > +	/*
> > > +	 * When system under pressure, don't swapin readahead.
> > > +	 * So that avoid unnecessary resource consuming.
> > > +	 */
> > > +	if (allocstall == curr_allocstall && swap != 0)
> > > +		__collapse_huge_page_swapin(mm, vma, address, pmd);
> > >  
> > >  	anon_vma_lock_write(vma->anon_vma);
> > >  
> > 
> > I have mentioned that before already but this seems like a rather weak
> > heuristic. Don't we really rather teach __collapse_huge_page_swapin
> > (resp. do_swap_page) do to an optimistic GFP_NOWAIT allocations and
> > back off under the memory pressure?
> 
> I gave it a try and it doesn't seem really bad. Untested and I might
> have missed something really obvious but what do you think about this
> approach rather than relying on ALLOCSTALL which is really weak
> heuristic:

Ups forgot to add mm/internal.h to the git index
---
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 87f09dc986ab..1a4d4c807d92 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2389,7 +2389,8 @@ static void __collapse_huge_page_swapin(struct mm_struct *mm,
 		swapped_in++;
 		ret = do_swap_page(mm, vma, _address, pte, pmd,
 				   FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
-				   pteval);
+				   pteval,
+				   GFP_HIGHUSER_MOVABLE | ~__GFP_DIRECT_RECLAIM);
 		if (ret & VM_FAULT_ERROR) {
 			trace_mm_collapse_huge_page_swapin(mm, swapped_in, 0);
 			return;
diff --git a/mm/internal.h b/mm/internal.h
index b6ead95a0184..0b3cc643eced 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -37,7 +37,7 @@
 
 extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, pte_t *page_table, pmd_t *pmd,
-			unsigned int flags, pte_t orig_pte);
+			unsigned int flags, pte_t orig_pte, gfp_t gfp_mask);
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
diff --git a/mm/memory.c b/mm/memory.c
index d79c6db41502..f897ec89bd79 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2490,7 +2490,7 @@ EXPORT_SYMBOL(unmap_mapping_range);
  */
 int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		unsigned int flags, pte_t orig_pte)
+		unsigned int flags, pte_t orig_pte, gfp_t gfp_mask)
 {
 	spinlock_t *ptl;
 	struct page *page, *swapcache;
@@ -2519,8 +2519,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
 	page = lookup_swap_cache(entry);
 	if (!page) {
-		page = swapin_readahead(entry,
-					GFP_HIGHUSER_MOVABLE, vma, address);
+		page = swapin_readahead(entry, gfp_mask, vma, address);
 		if (!page) {
 			/*
 			 * Back out if somebody else faulted in this pte
@@ -2573,7 +2572,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_page;
 	}
 
-	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false)) {
+	if (mem_cgroup_try_charge(page, mm, gfp_mask, &memcg, false)) {
 		ret = VM_FAULT_OOM;
 		goto out_page;
 	}
@@ -3349,7 +3348,7 @@ static int handle_pte_fault(struct mm_struct *mm,
 						flags, entry);
 		}
 		return do_swap_page(mm, vma, address,
-					pte, pmd, flags, entry);
+					pte, pmd, flags, entry, GFP_HIGHUSER_MOVABLE);
 	}
 
 	if (pte_protnone(entry))
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree
  2016-05-17  9:02     ` Michal Hocko
@ 2016-05-17 11:31       ` Kirill A. Shutemov
  2016-05-17 12:25         ` Michal Hocko
  2016-05-19  5:00       ` Minchan Kim
  1 sibling, 1 reply; 17+ messages in thread
From: Kirill A. Shutemov @ 2016-05-17 11:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov,
	hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman,
	n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm

On Tue, May 17, 2016 at 11:02:54AM +0200, Michal Hocko wrote:
> On Tue 17-05-16 09:58:15, Michal Hocko wrote:
> > On Thu 28-04-16 17:19:21, Michal Hocko wrote:
> > > On Wed 27-04-16 14:17:20, Andrew Morton wrote:
> > > [...]
> > > > @@ -2484,7 +2485,14 @@ static void collapse_huge_page(struct mm
> > > >  		goto out;
> > > >  	}
> > > >  
> > > > -	__collapse_huge_page_swapin(mm, vma, address, pmd);
> > > > +	swap = get_mm_counter(mm, MM_SWAPENTS);
> > > > +	curr_allocstall = sum_vm_event(ALLOCSTALL);
> > > > +	/*
> > > > +	 * When system under pressure, don't swapin readahead.
> > > > +	 * So that avoid unnecessary resource consuming.
> > > > +	 */
> > > > +	if (allocstall == curr_allocstall && swap != 0)
> > > > +		__collapse_huge_page_swapin(mm, vma, address, pmd);
> > > >  
> > > >  	anon_vma_lock_write(vma->anon_vma);
> > > >  
> > > 
> > > I have mentioned that before already but this seems like a rather weak
> > > heuristic. Don't we really rather teach __collapse_huge_page_swapin
> > > (resp. do_swap_page) do to an optimistic GFP_NOWAIT allocations and
> > > back off under the memory pressure?
> > 
> > I gave it a try and it doesn't seem really bad. Untested and I might
> > have missed something really obvious but what do you think about this
> > approach rather than relying on ALLOCSTALL which is really weak
> > heuristic:
> 
> Ups forgot to add mm/internal.h to the git index
> ---
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 87f09dc986ab..1a4d4c807d92 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2389,7 +2389,8 @@ static void __collapse_huge_page_swapin(struct mm_struct *mm,
>  		swapped_in++;
>  		ret = do_swap_page(mm, vma, _address, pte, pmd,
>  				   FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
> -				   pteval);
> +				   pteval,
> +				   GFP_HIGHUSER_MOVABLE | ~__GFP_DIRECT_RECLAIM);

Why only direct recliam? I'm not sure if triggering kswapd is justified
for swapin. Maybe ~__GFP_RECLAIM?

That said, I like the approach. ALLOCSTALL approach has locking issue[1].

[1] http://lkml.kernel.org/r/20160505013245.GB10429@yexl-desktop

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree
  2016-05-17 11:31       ` Kirill A. Shutemov
@ 2016-05-17 12:25         ` Michal Hocko
  0 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2016-05-17 12:25 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov,
	hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman,
	n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm

On Tue 17-05-16 14:31:14, Kirill A. Shutemov wrote:
> On Tue, May 17, 2016 at 11:02:54AM +0200, Michal Hocko wrote:
[...]
> > ---
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 87f09dc986ab..1a4d4c807d92 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2389,7 +2389,8 @@ static void __collapse_huge_page_swapin(struct mm_struct *mm,
> >  		swapped_in++;
> >  		ret = do_swap_page(mm, vma, _address, pte, pmd,
> >  				   FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
> > -				   pteval);
> > +				   pteval,
> > +				   GFP_HIGHUSER_MOVABLE | ~__GFP_DIRECT_RECLAIM);
> 
> Why only direct recliam? I'm not sure if triggering kswapd is justified
> for swapin. Maybe ~__GFP_RECLAIM?

Dunno, skipping kswapd sounds like we might consume a lot of memory
and the next request might hit the direct reclaim too early. But this is
only a speculation. This has to be measured properly and evaluated
against real usecases.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree
  2016-05-17  9:02     ` Michal Hocko
  2016-05-17 11:31       ` Kirill A. Shutemov
@ 2016-05-19  5:00       ` Minchan Kim
  2016-05-19  7:03         ` Michal Hocko
  1 sibling, 1 reply; 17+ messages in thread
From: Minchan Kim @ 2016-05-19  5:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov,
	hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman,
	n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm

On Tue, May 17, 2016 at 11:02:54AM +0200, Michal Hocko wrote:
> On Tue 17-05-16 09:58:15, Michal Hocko wrote:
> > On Thu 28-04-16 17:19:21, Michal Hocko wrote:
> > > On Wed 27-04-16 14:17:20, Andrew Morton wrote:
> > > [...]
> > > > @@ -2484,7 +2485,14 @@ static void collapse_huge_page(struct mm
> > > >  		goto out;
> > > >  	}
> > > >  
> > > > -	__collapse_huge_page_swapin(mm, vma, address, pmd);
> > > > +	swap = get_mm_counter(mm, MM_SWAPENTS);
> > > > +	curr_allocstall = sum_vm_event(ALLOCSTALL);
> > > > +	/*
> > > > +	 * When system under pressure, don't swapin readahead.
> > > > +	 * So that avoid unnecessary resource consuming.
> > > > +	 */
> > > > +	if (allocstall == curr_allocstall && swap != 0)
> > > > +		__collapse_huge_page_swapin(mm, vma, address, pmd);
> > > >  
> > > >  	anon_vma_lock_write(vma->anon_vma);
> > > >  
> > > 
> > > I have mentioned that before already but this seems like a rather weak
> > > heuristic. Don't we really rather teach __collapse_huge_page_swapin
> > > (resp. do_swap_page) do to an optimistic GFP_NOWAIT allocations and
> > > back off under the memory pressure?
> > 
> > I gave it a try and it doesn't seem really bad. Untested and I might
> > have missed something really obvious but what do you think about this
> > approach rather than relying on ALLOCSTALL which is really weak
> > heuristic:

I like this approach rather than playing with allocstall diff of vmevent
which can be disabled in some configuration and it's not a good indicator
to represent current memory pressure situation.

However, I agree with Rik's requirement which doesn't want to turn over
page cache for collapsing THP page via swapin. So, your suggestion cannot
prevent it because khugepaged can consume memory through this swapin
operation continuously while kswapd is doing aging of LRU list in parallel.
IOW, fluctuation between HIGH and LOW watermark.

So, How about using waitqueue_active(&pgdat->kswapd_wait) to detect
current memory pressure? So if kswapd is active, we could avoid swapin
for THP collapsing.

> 
> Ups forgot to add mm/internal.h to the git index
> ---
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 87f09dc986ab..1a4d4c807d92 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2389,7 +2389,8 @@ static void __collapse_huge_page_swapin(struct mm_struct *mm,
>  		swapped_in++;
>  		ret = do_swap_page(mm, vma, _address, pte, pmd,
>  				   FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
> -				   pteval);
> +				   pteval,
> +				   GFP_HIGHUSER_MOVABLE | ~__GFP_DIRECT_RECLAIM);
>  		if (ret & VM_FAULT_ERROR) {
>  			trace_mm_collapse_huge_page_swapin(mm, swapped_in, 0);
>  			return;
> diff --git a/mm/internal.h b/mm/internal.h
> index b6ead95a0184..0b3cc643eced 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -37,7 +37,7 @@
>  
>  extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long address, pte_t *page_table, pmd_t *pmd,
> -			unsigned int flags, pte_t orig_pte);
> +			unsigned int flags, pte_t orig_pte, gfp_t gfp_mask);
>  
>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>  		unsigned long floor, unsigned long ceiling);
> diff --git a/mm/memory.c b/mm/memory.c
> index d79c6db41502..f897ec89bd79 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2490,7 +2490,7 @@ EXPORT_SYMBOL(unmap_mapping_range);
>   */
>  int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		unsigned long address, pte_t *page_table, pmd_t *pmd,
> -		unsigned int flags, pte_t orig_pte)
> +		unsigned int flags, pte_t orig_pte, gfp_t gfp_mask)
>  {
>  	spinlock_t *ptl;
>  	struct page *page, *swapcache;
> @@ -2519,8 +2519,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
>  	page = lookup_swap_cache(entry);
>  	if (!page) {
> -		page = swapin_readahead(entry,
> -					GFP_HIGHUSER_MOVABLE, vma, address);
> +		page = swapin_readahead(entry, gfp_mask, vma, address);
>  		if (!page) {
>  			/*
>  			 * Back out if somebody else faulted in this pte
> @@ -2573,7 +2572,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		goto out_page;
>  	}
>  
> -	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false)) {
> +	if (mem_cgroup_try_charge(page, mm, gfp_mask, &memcg, false)) {
>  		ret = VM_FAULT_OOM;
>  		goto out_page;
>  	}
> @@ -3349,7 +3348,7 @@ static int handle_pte_fault(struct mm_struct *mm,
>  						flags, entry);
>  		}
>  		return do_swap_page(mm, vma, address,
> -					pte, pmd, flags, entry);
> +					pte, pmd, flags, entry, GFP_HIGHUSER_MOVABLE);
>  	}
>  
>  	if (pte_protnone(entry))
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree
  2016-05-19  5:00       ` Minchan Kim
@ 2016-05-19  7:03         ` Michal Hocko
  2016-05-19  7:27           ` Minchan Kim
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2016-05-19  7:03 UTC (permalink / raw)
  To: Minchan Kim
  Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov,
	hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman,
	n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm

On Thu 19-05-16 14:00:38, Minchan Kim wrote:
> On Tue, May 17, 2016 at 11:02:54AM +0200, Michal Hocko wrote:
> > On Tue 17-05-16 09:58:15, Michal Hocko wrote:
> > > On Thu 28-04-16 17:19:21, Michal Hocko wrote:
> > > > On Wed 27-04-16 14:17:20, Andrew Morton wrote:
> > > > [...]
> > > > > @@ -2484,7 +2485,14 @@ static void collapse_huge_page(struct mm
> > > > >  		goto out;
> > > > >  	}
> > > > >  
> > > > > -	__collapse_huge_page_swapin(mm, vma, address, pmd);
> > > > > +	swap = get_mm_counter(mm, MM_SWAPENTS);
> > > > > +	curr_allocstall = sum_vm_event(ALLOCSTALL);
> > > > > +	/*
> > > > > +	 * When system under pressure, don't swapin readahead.
> > > > > +	 * So that avoid unnecessary resource consuming.
> > > > > +	 */
> > > > > +	if (allocstall == curr_allocstall && swap != 0)
> > > > > +		__collapse_huge_page_swapin(mm, vma, address, pmd);
> > > > >  
> > > > >  	anon_vma_lock_write(vma->anon_vma);
> > > > >  
> > > > 
> > > > I have mentioned that before already but this seems like a rather weak
> > > > heuristic. Don't we really rather teach __collapse_huge_page_swapin
> > > > (resp. do_swap_page) do to an optimistic GFP_NOWAIT allocations and
> > > > back off under the memory pressure?
> > > 
> > > I gave it a try and it doesn't seem really bad. Untested and I might
> > > have missed something really obvious but what do you think about this
> > > approach rather than relying on ALLOCSTALL which is really weak
> > > heuristic:
> 
> I like this approach rather than playing with allocstall diff of vmevent
> which can be disabled in some configuration and it's not a good indicator
> to represent current memory pressure situation.

Not only that it won't work for e.g. memcg configurations because we
would end up reclaiming that memcg as the gfp mask tells us to do so and
ALLOCSTALL would be quite about that.

> However, I agree with Rik's requirement which doesn't want to turn over
> page cache for collapsing THP page via swapin. So, your suggestion cannot
> prevent it because khugepaged can consume memory through this swapin
> operation continuously while kswapd is doing aging of LRU list in parallel.
> IOW, fluctuation between HIGH and LOW watermark.

I am not sure this is actually a problem. We have other sources of
opportunistic allocations with some fallback and those wake up kswapd
(they only clear __GFP_DIRECT_RECLAIM). Also this swapin should happen
only when a certain portion of the huge page is already populated so
it won't happen all the time and sounds like we would benefit from the
reclaimed page cache in favor of the THP.

> So, How about using waitqueue_active(&pgdat->kswapd_wait) to detect
> current memory pressure? So if kswapd is active, we could avoid swapin
> for THP collapsing.

Dunno, this sounds quite arbitrary. And I am even not sure this all
optimistic swap in is a huge win to be honest. I just really hate the
ALLOCSTALL heuristic because it simply doesn't work.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree
  2016-05-19  7:03         ` Michal Hocko
@ 2016-05-19  7:27           ` Minchan Kim
  2016-05-19  7:39             ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Minchan Kim @ 2016-05-19  7:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov,
	hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman,
	n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm

On Thu, May 19, 2016 at 09:03:57AM +0200, Michal Hocko wrote:
> On Thu 19-05-16 14:00:38, Minchan Kim wrote:
> > On Tue, May 17, 2016 at 11:02:54AM +0200, Michal Hocko wrote:
> > > On Tue 17-05-16 09:58:15, Michal Hocko wrote:
> > > > On Thu 28-04-16 17:19:21, Michal Hocko wrote:
> > > > > On Wed 27-04-16 14:17:20, Andrew Morton wrote:
> > > > > [...]
> > > > > > @@ -2484,7 +2485,14 @@ static void collapse_huge_page(struct mm
> > > > > >  		goto out;
> > > > > >  	}
> > > > > >  
> > > > > > -	__collapse_huge_page_swapin(mm, vma, address, pmd);
> > > > > > +	swap = get_mm_counter(mm, MM_SWAPENTS);
> > > > > > +	curr_allocstall = sum_vm_event(ALLOCSTALL);
> > > > > > +	/*
> > > > > > +	 * When system under pressure, don't swapin readahead.
> > > > > > +	 * So that avoid unnecessary resource consuming.
> > > > > > +	 */
> > > > > > +	if (allocstall == curr_allocstall && swap != 0)
> > > > > > +		__collapse_huge_page_swapin(mm, vma, address, pmd);
> > > > > >  
> > > > > >  	anon_vma_lock_write(vma->anon_vma);
> > > > > >  
> > > > > 
> > > > > I have mentioned that before already but this seems like a rather weak
> > > > > heuristic. Don't we really rather teach __collapse_huge_page_swapin
> > > > > (resp. do_swap_page) do to an optimistic GFP_NOWAIT allocations and
> > > > > back off under the memory pressure?
> > > > 
> > > > I gave it a try and it doesn't seem really bad. Untested and I might
> > > > have missed something really obvious but what do you think about this
> > > > approach rather than relying on ALLOCSTALL which is really weak
> > > > heuristic:
> > 
> > I like this approach rather than playing with allocstall diff of vmevent
> > which can be disabled in some configuration and it's not a good indicator
> > to represent current memory pressure situation.
> 
> Not only that it won't work for e.g. memcg configurations because we
> would end up reclaiming that memcg as the gfp mask tells us to do so and
> ALLOCSTALL would be quite about that.

Right you are. I didn't consider memcg. Thanks for pointing out.

> 
> > However, I agree with Rik's requirement which doesn't want to turn over
> > page cache for collapsing THP page via swapin. So, your suggestion cannot
> > prevent it because khugepaged can consume memory through this swapin
> > operation continuously while kswapd is doing aging of LRU list in parallel.
> > IOW, fluctuation between HIGH and LOW watermark.
> 
> I am not sure this is actually a problem. We have other sources of
> opportunistic allocations with some fallback and those wake up kswapd
> (they only clear __GFP_DIRECT_RECLAIM). Also this swapin should happen
> only when a certain portion of the huge page is already populated so

I can't find any logic you mentioned "a certain portion of the huge page
is already populated" in next-20160517. What am I missing now?

> it won't happen all the time and sounds like we would benefit from the
> reclaimed page cache in favor of the THP.

It depends on storage speed. If a page is swapped out, it means it's not a
workingset so we might read cold page at the cost of evciting warm page.
Additionally, if the huge page was swapped out, it is likely to swap out
again because it's not a hot * 512 page. For those pages, shouldn't we
evict page cache? I think it's not a good tradeoff.

> 
> > So, How about using waitqueue_active(&pgdat->kswapd_wait) to detect
> > current memory pressure? So if kswapd is active, we could avoid swapin
> > for THP collapsing.
> 
> Dunno, this sounds quite arbitrary. And I am even not sure this all
> optimistic swap in is a huge win to be honest. I just really hate the
> ALLOCSTALL heuristic because it simply doesn't work.

Agree.
 
> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree
  2016-05-19  7:27           ` Minchan Kim
@ 2016-05-19  7:39             ` Michal Hocko
  2016-05-20  0:21               ` Minchan Kim
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2016-05-19  7:39 UTC (permalink / raw)
  To: Minchan Kim
  Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov,
	hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman,
	n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm

On Thu 19-05-16 16:27:51, Minchan Kim wrote:
> On Thu, May 19, 2016 at 09:03:57AM +0200, Michal Hocko wrote:
> > On Thu 19-05-16 14:00:38, Minchan Kim wrote:
> > > On Tue, May 17, 2016 at 11:02:54AM +0200, Michal Hocko wrote:
> > > > On Tue 17-05-16 09:58:15, Michal Hocko wrote:
> > > > > On Thu 28-04-16 17:19:21, Michal Hocko wrote:
> > > > > > On Wed 27-04-16 14:17:20, Andrew Morton wrote:
> > > > > > [...]
> > > > > > > @@ -2484,7 +2485,14 @@ static void collapse_huge_page(struct mm
> > > > > > >  		goto out;
> > > > > > >  	}
> > > > > > >  
> > > > > > > -	__collapse_huge_page_swapin(mm, vma, address, pmd);
> > > > > > > +	swap = get_mm_counter(mm, MM_SWAPENTS);
> > > > > > > +	curr_allocstall = sum_vm_event(ALLOCSTALL);
> > > > > > > +	/*
> > > > > > > +	 * When system under pressure, don't swapin readahead.
> > > > > > > +	 * So that avoid unnecessary resource consuming.
> > > > > > > +	 */
> > > > > > > +	if (allocstall == curr_allocstall && swap != 0)
> > > > > > > +		__collapse_huge_page_swapin(mm, vma, address, pmd);
> > > > > > >  
> > > > > > >  	anon_vma_lock_write(vma->anon_vma);
> > > > > > >  
> > > > > > 
> > > > > > I have mentioned that before already but this seems like a rather weak
> > > > > > heuristic. Don't we really rather teach __collapse_huge_page_swapin
> > > > > > (resp. do_swap_page) do to an optimistic GFP_NOWAIT allocations and
> > > > > > back off under the memory pressure?
> > > > > 
> > > > > I gave it a try and it doesn't seem really bad. Untested and I might
> > > > > have missed something really obvious but what do you think about this
> > > > > approach rather than relying on ALLOCSTALL which is really weak
> > > > > heuristic:
> > > 
> > > I like this approach rather than playing with allocstall diff of vmevent
> > > which can be disabled in some configuration and it's not a good indicator
> > > to represent current memory pressure situation.
> > 
> > Not only that it won't work for e.g. memcg configurations because we
> > would end up reclaiming that memcg as the gfp mask tells us to do so and
> > ALLOCSTALL would be quite about that.
> 
> Right you are. I didn't consider memcg. Thanks for pointing out.
> 
> > 
> > > However, I agree with Rik's requirement which doesn't want to turn over
> > > page cache for collapsing THP page via swapin. So, your suggestion cannot
> > > prevent it because khugepaged can consume memory through this swapin
> > > operation continuously while kswapd is doing aging of LRU list in parallel.
> > > IOW, fluctuation between HIGH and LOW watermark.
> > 
> > I am not sure this is actually a problem. We have other sources of
> > opportunistic allocations with some fallback and those wake up kswapd
> > (they only clear __GFP_DIRECT_RECLAIM). Also this swapin should happen
> > only when a certain portion of the huge page is already populated so
> 
> I can't find any logic you mentioned "a certain portion of the huge page
> is already populated" in next-20160517. What am I missing now?

khugepaged_max_ptes_swap. I didn't look closer but from a quick glance
this is the threshold for the optimistic swapin.

> > it won't happen all the time and sounds like we would benefit from the
> > reclaimed page cache in favor of the THP.
> 
> It depends on storage speed. If a page is swapped out, it means it's not a
> workingset so we might read cold page at the cost of evciting warm page.
> Additionally, if the huge page was swapped out, it is likely to swap out
> again because it's not a hot * 512 page. For those pages, shouldn't we
> evict page cache? I think it's not a good tradeoff.

This is exactly the problem of the optimistic THP swap in. We just do
not know whether it is worth it. But I guess that a reasonable threshold
would solve this. It is really ineffective to keep small pages when only
few holes are swapped out (for what ever reasons). HPAGE_PMD_NR/8 which
we use right now is not documented but I guess 64 pages sounds like a
reasonable value which shouldn't cause way too much of reclaim.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree
  2016-05-19  7:39             ` Michal Hocko
@ 2016-05-20  0:21               ` Minchan Kim
  2016-05-20  6:39                 ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Minchan Kim @ 2016-05-20  0:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov,
	hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman,
	n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm

On Thu, May 19, 2016 at 09:39:57AM +0200, Michal Hocko wrote:
> On Thu 19-05-16 16:27:51, Minchan Kim wrote:
> > On Thu, May 19, 2016 at 09:03:57AM +0200, Michal Hocko wrote:
> > > On Thu 19-05-16 14:00:38, Minchan Kim wrote:
> > > > On Tue, May 17, 2016 at 11:02:54AM +0200, Michal Hocko wrote:
> > > > > On Tue 17-05-16 09:58:15, Michal Hocko wrote:
> > > > > > On Thu 28-04-16 17:19:21, Michal Hocko wrote:
> > > > > > > On Wed 27-04-16 14:17:20, Andrew Morton wrote:
> > > > > > > [...]
> > > > > > > > @@ -2484,7 +2485,14 @@ static void collapse_huge_page(struct mm
> > > > > > > >  		goto out;
> > > > > > > >  	}
> > > > > > > >  
> > > > > > > > -	__collapse_huge_page_swapin(mm, vma, address, pmd);
> > > > > > > > +	swap = get_mm_counter(mm, MM_SWAPENTS);
> > > > > > > > +	curr_allocstall = sum_vm_event(ALLOCSTALL);
> > > > > > > > +	/*
> > > > > > > > +	 * When system under pressure, don't swapin readahead.
> > > > > > > > +	 * So that avoid unnecessary resource consuming.
> > > > > > > > +	 */
> > > > > > > > +	if (allocstall == curr_allocstall && swap != 0)
> > > > > > > > +		__collapse_huge_page_swapin(mm, vma, address, pmd);
> > > > > > > >  
> > > > > > > >  	anon_vma_lock_write(vma->anon_vma);
> > > > > > > >  
> > > > > > > 
> > > > > > > I have mentioned that before already but this seems like a rather weak
> > > > > > > heuristic. Don't we really rather teach __collapse_huge_page_swapin
> > > > > > > (resp. do_swap_page) do to an optimistic GFP_NOWAIT allocations and
> > > > > > > back off under the memory pressure?
> > > > > > 
> > > > > > I gave it a try and it doesn't seem really bad. Untested and I might
> > > > > > have missed something really obvious but what do you think about this
> > > > > > approach rather than relying on ALLOCSTALL which is really weak
> > > > > > heuristic:
> > > > 
> > > > I like this approach rather than playing with allocstall diff of vmevent
> > > > which can be disabled in some configuration and it's not a good indicator
> > > > to represent current memory pressure situation.
> > > 
> > > Not only that it won't work for e.g. memcg configurations because we
> > > would end up reclaiming that memcg as the gfp mask tells us to do so and
> > > ALLOCSTALL would be quite about that.
> > 
> > Right you are. I didn't consider memcg. Thanks for pointing out.
> > 
> > > 
> > > > However, I agree with Rik's requirement which doesn't want to turn over
> > > > page cache for collapsing THP page via swapin. So, your suggestion cannot
> > > > prevent it because khugepaged can consume memory through this swapin
> > > > operation continuously while kswapd is doing aging of LRU list in parallel.
> > > > IOW, fluctuation between HIGH and LOW watermark.
> > > 
> > > I am not sure this is actually a problem. We have other sources of
> > > opportunistic allocations with some fallback and those wake up kswapd
> > > (they only clear __GFP_DIRECT_RECLAIM). Also this swapin should happen
> > > only when a certain portion of the huge page is already populated so
> > 
> > I can't find any logic you mentioned "a certain portion of the huge page
> > is already populated" in next-20160517. What am I missing now?
> 
> khugepaged_max_ptes_swap. I didn't look closer but from a quick glance
> this is the threshold for the optimistic swapin.

Thanks. I see it now.

> 
> > > it won't happen all the time and sounds like we would benefit from the
> > > reclaimed page cache in favor of the THP.
> > 
> > It depends on storage speed. If a page is swapped out, it means it's not a
> > workingset so we might read cold page at the cost of evciting warm page.
> > Additionally, if the huge page was swapped out, it is likely to swap out
> > again because it's not a hot * 512 page. For those pages, shouldn't we
> > evict page cache? I think it's not a good tradeoff.
> 
> This is exactly the problem of the optimistic THP swap in. We just do
> not know whether it is worth it. But I guess that a reasonable threshold
> would solve this. It is really ineffective to keep small pages when only
> few holes are swapped out (for what ever reasons). HPAGE_PMD_NR/8 which
> we use right now is not documented but I guess 64 pages sounds like a
> reasonable value which shouldn't cause way too much of reclaim.

I don't know what's the best defaut vaule. Anyway I agree we should introduce
the threshold to collapse THP through swapin IO operation.

I think other important thing we should consider is how the THP page is likely
to be hot without split in a short time like KSM is doing double checking to
merge stable page. Of course, it wouldn't be easyI to implement but I think
algorithm is based on such *hotness* basically and then consider the number
of max_swap_ptes. IOW, I think we should approach more conservative rather
than optimistic because a page I/O overhead by wrong choice could be bigger
than benefit of a few TLB hit.
If we approach that way, maybe we don't need to detect memory pressure.

For that way, how about raising bar for swapin allowance?
I mean now we allows swapin

from
        64 pages below swap ptes and 1 page young in 512 ptes 
to
        64 pages below swap ptes and 256 page young in 512 ptes 

> 
> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree
  2016-05-20  0:21               ` Minchan Kim
@ 2016-05-20  6:39                 ` Michal Hocko
  2016-05-20  7:26                   ` Minchan Kim
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2016-05-20  6:39 UTC (permalink / raw)
  To: Minchan Kim
  Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov,
	hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman,
	n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm

On Fri 20-05-16 09:21:55, Minchan Kim wrote:
[...]
> I think other important thing we should consider is how the THP page is likely
> to be hot without split in a short time like KSM is doing double checking to
> merge stable page. Of course, it wouldn't be easyI to implement but I think
> algorithm is based on such *hotness* basically and then consider the number
> of max_swap_ptes. IOW, I think we should approach more conservative rather
> than optimistic because a page I/O overhead by wrong choice could be bigger
> than benefit of a few TLB hit.
> If we approach that way, maybe we don't need to detect memory pressure.
> 
> For that way, how about raising bar for swapin allowance?
> I mean now we allows swapin
> 
> from
>         64 pages below swap ptes and 1 page young in 512 ptes 
> to
>         64 pages below swap ptes and 256 page young in 512 ptes 

I agree that the current 1 page threshold for collapsing is way too
optimistic. Same as the defaults we had for the page fault THP faulting
which has caused many issues. So I would be all for changing it. I do
not have good benchmarks to back any "good" number unfortunately. So
such a change would be quite arbitrary based on feeling... If you have
some workload where collapsing THP pages causes some real issues that
would be great justification though.

That being said khugepaged_max_ptes_none = HPAGE_PMD_NR/2 sounds like a
good start to me. Whether all of the present pages have to be young I am
not so sure.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree
  2016-05-20  6:39                 ` Michal Hocko
@ 2016-05-20  7:26                   ` Minchan Kim
  2016-05-20  7:34                     ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Minchan Kim @ 2016-05-20  7:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov,
	hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman,
	n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm

On Fri, May 20, 2016 at 08:39:17AM +0200, Michal Hocko wrote:
> On Fri 20-05-16 09:21:55, Minchan Kim wrote:
> [...]
> > I think other important thing we should consider is how the THP page is likely
> > to be hot without split in a short time like KSM is doing double checking to
> > merge stable page. Of course, it wouldn't be easyI to implement but I think
> > algorithm is based on such *hotness* basically and then consider the number
> > of max_swap_ptes. IOW, I think we should approach more conservative rather
> > than optimistic because a page I/O overhead by wrong choice could be bigger
> > than benefit of a few TLB hit.
> > If we approach that way, maybe we don't need to detect memory pressure.
> > 
> > For that way, how about raising bar for swapin allowance?
> > I mean now we allows swapin
> > 
> > from
> >         64 pages below swap ptes and 1 page young in 512 ptes 
> > to
> >         64 pages below swap ptes and 256 page young in 512 ptes 
> 
> I agree that the current 1 page threshold for collapsing is way too
> optimistic. Same as the defaults we had for the page fault THP faulting
> which has caused many issues. So I would be all for changing it. I do

I don't know we should change all but if we change it for THP faulting,
I believe THP swapin should be more conservative value rather than THP
faulting because cost of THP swapin collapsing would be heavier.

> not have good benchmarks to back any "good" number unfortunately. So
> such a change would be quite arbitrary based on feeling... If you have
> some workload where collapsing THP pages causes some real issues that
> would be great justification though.

Hope to have but our products never have turned on THP. :(
I just wanted to say current problem and suggestion so a THP guy
can have an interest on that.
If it's not worth to do, simple igonore.

> 
> That being said khugepaged_max_ptes_none = HPAGE_PMD_NR/2 sounds like a

max_ptes_none?
               
> good start to me. Whether all of the present pages have to be young I am
> not so sure.
> 
> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree
  2016-05-20  7:26                   ` Minchan Kim
@ 2016-05-20  7:34                     ` Michal Hocko
  2016-05-20  7:44                       ` Minchan Kim
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2016-05-20  7:34 UTC (permalink / raw)
  To: Minchan Kim
  Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov,
	hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman,
	n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm

On Fri 20-05-16 16:26:24, Minchan Kim wrote:
> On Fri, May 20, 2016 at 08:39:17AM +0200, Michal Hocko wrote:
> > On Fri 20-05-16 09:21:55, Minchan Kim wrote:
> > [...]
> > > I think other important thing we should consider is how the THP page is likely
> > > to be hot without split in a short time like KSM is doing double checking to
> > > merge stable page. Of course, it wouldn't be easyI to implement but I think
> > > algorithm is based on such *hotness* basically and then consider the number
> > > of max_swap_ptes. IOW, I think we should approach more conservative rather
> > > than optimistic because a page I/O overhead by wrong choice could be bigger
> > > than benefit of a few TLB hit.
> > > If we approach that way, maybe we don't need to detect memory pressure.
> > > 
> > > For that way, how about raising bar for swapin allowance?
> > > I mean now we allows swapin
> > > 
> > > from
> > >         64 pages below swap ptes and 1 page young in 512 ptes 
> > > to
> > >         64 pages below swap ptes and 256 page young in 512 ptes 
> > 
> > I agree that the current 1 page threshold for collapsing is way too
> > optimistic. Same as the defaults we had for the page fault THP faulting
> > which has caused many issues. So I would be all for changing it. I do
> 
> I don't know we should change all but if we change it for THP faulting,
> I believe THP swapin should be more conservative value rather than THP
> faulting because cost of THP swapin collapsing would be heavier.
> 
> > not have good benchmarks to back any "good" number unfortunately. So
> > such a change would be quite arbitrary based on feeling... If you have
> > some workload where collapsing THP pages causes some real issues that
> > would be great justification though.
> 
> Hope to have but our products never have turned on THP. :(
> I just wanted to say current problem and suggestion so a THP guy
> can have an interest on that.
> If it's not worth to do, simple igonore.

I guess this is worth changing I am just not sure about the
justification
 
> > That being said khugepaged_max_ptes_none = HPAGE_PMD_NR/2 sounds like a
> 
> max_ptes_none?

Not sure I understand what you mean here.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree
  2016-05-20  7:34                     ` Michal Hocko
@ 2016-05-20  7:44                       ` Minchan Kim
  2016-05-20  8:02                         ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Minchan Kim @ 2016-05-20  7:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov,
	hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman,
	n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm

On Fri, May 20, 2016 at 09:34:32AM +0200, Michal Hocko wrote:
> On Fri 20-05-16 16:26:24, Minchan Kim wrote:
> > On Fri, May 20, 2016 at 08:39:17AM +0200, Michal Hocko wrote:
> > > On Fri 20-05-16 09:21:55, Minchan Kim wrote:
> > > [...]
> > > > I think other important thing we should consider is how the THP page is likely
> > > > to be hot without split in a short time like KSM is doing double checking to
> > > > merge stable page. Of course, it wouldn't be easyI to implement but I think
> > > > algorithm is based on such *hotness* basically and then consider the number
> > > > of max_swap_ptes. IOW, I think we should approach more conservative rather
> > > > than optimistic because a page I/O overhead by wrong choice could be bigger
> > > > than benefit of a few TLB hit.
> > > > If we approach that way, maybe we don't need to detect memory pressure.
> > > > 
> > > > For that way, how about raising bar for swapin allowance?
> > > > I mean now we allows swapin
> > > > 
> > > > from
> > > >         64 pages below swap ptes and 1 page young in 512 ptes 
> > > > to
> > > >         64 pages below swap ptes and 256 page young in 512 ptes 
> > > 
> > > I agree that the current 1 page threshold for collapsing is way too
> > > optimistic. Same as the defaults we had for the page fault THP faulting
> > > which has caused many issues. So I would be all for changing it. I do
> > 
> > I don't know we should change all but if we change it for THP faulting,
> > I believe THP swapin should be more conservative value rather than THP
> > faulting because cost of THP swapin collapsing would be heavier.
> > 
> > > not have good benchmarks to back any "good" number unfortunately. So
> > > such a change would be quite arbitrary based on feeling... If you have
> > > some workload where collapsing THP pages causes some real issues that
> > > would be great justification though.
> > 
> > Hope to have but our products never have turned on THP. :(
> > I just wanted to say current problem and suggestion so a THP guy
> > can have an interest on that.
> > If it's not worth to do, simple igonore.
> 
> I guess this is worth changing I am just not sure about the
> justification

I agree with you about THP faulting which have been there for several
years so changing without justification, it has risk. Howerver,
swapin collapsing is new feature so it might be worth for the feature.

>  
> > > That being said khugepaged_max_ptes_none = HPAGE_PMD_NR/2 sounds like a
> > 
> > max_ptes_none?
> 
> Not sure I understand what you mean here.

We are talking about max_ptes_swap and max_active_pages(i.e., pte_young)
but suddenly you are saying max_ptes_none so I was curious it was just
typo.


> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree
  2016-05-20  7:44                       ` Minchan Kim
@ 2016-05-20  8:02                         ` Michal Hocko
  2016-05-20  8:26                           ` Minchan Kim
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2016-05-20  8:02 UTC (permalink / raw)
  To: Minchan Kim
  Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov,
	hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman,
	n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm

On Fri 20-05-16 16:44:50, Minchan Kim wrote:
> > > > That being said khugepaged_max_ptes_none = HPAGE_PMD_NR/2 sounds like a
> > > 
> > > max_ptes_none?
> > 
> > Not sure I understand what you mean here.
> 
> We are talking about max_ptes_swap and max_active_pages(i.e., pte_young)
> but suddenly you are saying max_ptes_none so I was curious it was just
> typo.

Because the default for pte_none resp. zero pages collapsing into THP is
khugepaged_max_ptes_none and the current default means that a single
present page is sufficient. That is way too optimistic. So I consider
this to be a good start. I am not so sure about minimum young pages
because that would probably require yet another tunable and we have more
than enough of them. Anyway I guess we are getting off-topic here...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree
  2016-05-20  8:02                         ` Michal Hocko
@ 2016-05-20  8:26                           ` Minchan Kim
  0 siblings, 0 replies; 17+ messages in thread
From: Minchan Kim @ 2016-05-20  8:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov,
	hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman,
	n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm

On Fri, May 20, 2016 at 10:02:18AM +0200, Michal Hocko wrote:
> On Fri 20-05-16 16:44:50, Minchan Kim wrote:
> > > > > That being said khugepaged_max_ptes_none = HPAGE_PMD_NR/2 sounds like a
> > > > 
> > > > max_ptes_none?
> > > 
> > > Not sure I understand what you mean here.
> > 
> > We are talking about max_ptes_swap and max_active_pages(i.e., pte_young)
> > but suddenly you are saying max_ptes_none so I was curious it was just
> > typo.
> 
> Because the default for pte_none resp. zero pages collapsing into THP is
> khugepaged_max_ptes_none and the current default means that a single
> present page is sufficient. That is way too optimistic. So I consider
> this to be a good start. I am not so sure about minimum young pages
> because that would probably require yet another tunable and we have more
> than enough of them. Anyway I guess we are getting off-topic here...

Optimistic swapin collapsing

1. it could be too optimisitic to lose the gain due to evicting workingset
2. let's detect memory pressure
3. current allocstall magic is not a good idea.
4. let's change the design from optimistic to conservative
5. how we can be conservative
6. two things - detect hot pages and threshold of swap pte
7. threhsold of swap pte is already done so remained thing is detect hot page
8. how to detect hot page - let's use young bit
9. Now, we are conservatie so we will swap in when it's worth
10. let's remove allocstall magic

I think it's not off-topic.
Anyway, it's just my thought and don't have any real workload and objection.
Feel free to ignore.

> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-05-20  8:26 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-27 21:17 + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree akpm
2016-04-28 15:19 ` Michal Hocko
2016-05-17  7:58   ` Michal Hocko
2016-05-17  9:02     ` Michal Hocko
2016-05-17 11:31       ` Kirill A. Shutemov
2016-05-17 12:25         ` Michal Hocko
2016-05-19  5:00       ` Minchan Kim
2016-05-19  7:03         ` Michal Hocko
2016-05-19  7:27           ` Minchan Kim
2016-05-19  7:39             ` Michal Hocko
2016-05-20  0:21               ` Minchan Kim
2016-05-20  6:39                 ` Michal Hocko
2016-05-20  7:26                   ` Minchan Kim
2016-05-20  7:34                     ` Michal Hocko
2016-05-20  7:44                       ` Minchan Kim
2016-05-20  8:02                         ` Michal Hocko
2016-05-20  8:26                           ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.