* + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree @ 2016-04-27 21:17 akpm 2016-04-28 15:19 ` Michal Hocko 0 siblings, 1 reply; 17+ messages in thread From: akpm @ 2016-04-27 21:17 UTC (permalink / raw) To: ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, mhocko, n-horiguchi, riel, rientjes, vbabka, mm-commits The patch titled Subject: mm, thp: avoid unnecessary swapin in khugepaged has been added to the -mm tree. Its filename is mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch and later at http://ozlabs.org/~akpm/mmotm/broken-out/mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/SubmitChecklist when testing your code *** The -mm tree is included into linux-next and is updated there every 3-4 working days ------------------------------------------------------ From: Ebru Akagunduz <ebru.akagunduz@gmail.com> Subject: mm, thp: avoid unnecessary swapin in khugepaged Currently khugepaged makes swapin readahead to improve THP collapse rate. This patch checks vm statistics to avoid workload of swapin, if unnecessary. So that when system under pressure, khugepaged won't consume resources to swapin and won't trigger direct reclaim when swapin readahead. The patch was tested with a test program that allocates 800MB of memory, writes to it, and then sleeps. The system was forced to swap out all. Afterwards, the test program touches the area by writing, it skips a page in each 20 pages of the area. When waiting to swapin readahead left part of the test, the memory forced to be busy doing page reclaim. There was enough free memory during test, khugepaged did not swapin readahead due to business. Test results: After swapped out ------------------------------------------------------------------- | Anonymous | AnonHugePages | Swap | Fraction | ------------------------------------------------------------------- With patch | 0 kB | 0 kB | 800000 kB | %100 | ------------------------------------------------------------------- Without patch | 0 kB | 0 kB | 800000 kB | %100 | ------------------------------------------------------------------- After swapped in ------------------------------------------------------------------- | Anonymous | AnonHugePages | Swap | Fraction | ------------------------------------------------------------------- With patch | 385120 kB | 102400 kB | 414880 kB | %26 | ------------------------------------------------------------------- Without patch | 389728 kB | 194560 kB | 410272 kB | %49 | ------------------------------------------------------------------- Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Boaz Harrosh <boaz@plexistor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/huge_memory.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff -puN mm/huge_memory.c~mm-thp-avoid-unnecessary-swapin-in-khugepaged mm/huge_memory.c --- a/mm/huge_memory.c~mm-thp-avoid-unnecessary-swapin-in-khugepaged +++ a/mm/huge_memory.c @@ -102,6 +102,7 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepage */ static unsigned int khugepaged_max_ptes_none __read_mostly; static unsigned int khugepaged_max_ptes_swap __read_mostly = HPAGE_PMD_NR/8; +static unsigned long allocstall; static int khugepaged(void *none); static int khugepaged_slab_init(void); @@ -2429,7 +2430,7 @@ static void collapse_huge_page(struct mm struct page *new_page; spinlock_t *pmd_ptl, *pte_ptl; int isolated = 0, result = 0; - unsigned long hstart, hend; + unsigned long hstart, hend, swap, curr_allocstall; struct mem_cgroup *memcg; unsigned long mmun_start; /* For mmu_notifiers */ unsigned long mmun_end; /* For mmu_notifiers */ @@ -2484,7 +2485,14 @@ static void collapse_huge_page(struct mm goto out; } - __collapse_huge_page_swapin(mm, vma, address, pmd); + swap = get_mm_counter(mm, MM_SWAPENTS); + curr_allocstall = sum_vm_event(ALLOCSTALL); + /* + * When system under pressure, don't swapin readahead. + * So that avoid unnecessary resource consuming. + */ + if (allocstall == curr_allocstall && swap != 0) + __collapse_huge_page_swapin(mm, vma, address, pmd); anon_vma_lock_write(vma->anon_vma); @@ -2878,14 +2886,17 @@ static void khugepaged_wait_work(void) if (!khugepaged_scan_sleep_millisecs) return; + allocstall = sum_vm_event(ALLOCSTALL); wait_event_freezable_timeout(khugepaged_wait, kthread_should_stop(), msecs_to_jiffies(khugepaged_scan_sleep_millisecs)); return; } - if (khugepaged_enabled()) + if (khugepaged_enabled()) { + allocstall = sum_vm_event(ALLOCSTALL); wait_event_freezable(khugepaged_wait, khugepaged_wait_event()); + } } static int khugepaged(void *none) @@ -2894,6 +2905,7 @@ static int khugepaged(void *none) set_freezable(); set_user_nice(current, MAX_NICE); + allocstall = sum_vm_event(ALLOCSTALL); while (!kthread_should_stop()) { khugepaged_do_scan(); _ Patches currently in -mm which might be from ebru.akagunduz@gmail.com are mm-make-optimistic-check-for-swapin-readahead.patch mm-make-swapin-readahead-to-improve-thp-collapse-rate.patch mm-vmstat-calculate-particular-vm-event.patch mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree 2016-04-27 21:17 + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree akpm @ 2016-04-28 15:19 ` Michal Hocko 2016-05-17 7:58 ` Michal Hocko 0 siblings, 1 reply; 17+ messages in thread From: Michal Hocko @ 2016-04-28 15:19 UTC (permalink / raw) To: akpm Cc: ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm On Wed 27-04-16 14:17:20, Andrew Morton wrote: [...] > @@ -2484,7 +2485,14 @@ static void collapse_huge_page(struct mm > goto out; > } > > - __collapse_huge_page_swapin(mm, vma, address, pmd); > + swap = get_mm_counter(mm, MM_SWAPENTS); > + curr_allocstall = sum_vm_event(ALLOCSTALL); > + /* > + * When system under pressure, don't swapin readahead. > + * So that avoid unnecessary resource consuming. > + */ > + if (allocstall == curr_allocstall && swap != 0) > + __collapse_huge_page_swapin(mm, vma, address, pmd); > > anon_vma_lock_write(vma->anon_vma); > I have mentioned that before already but this seems like a rather weak heuristic. Don't we really rather teach __collapse_huge_page_swapin (resp. do_swap_page) do to an optimistic GFP_NOWAIT allocations and back off under the memory pressure? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree 2016-04-28 15:19 ` Michal Hocko @ 2016-05-17 7:58 ` Michal Hocko 2016-05-17 9:02 ` Michal Hocko 0 siblings, 1 reply; 17+ messages in thread From: Michal Hocko @ 2016-05-17 7:58 UTC (permalink / raw) To: akpm Cc: ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm On Thu 28-04-16 17:19:21, Michal Hocko wrote: > On Wed 27-04-16 14:17:20, Andrew Morton wrote: > [...] > > @@ -2484,7 +2485,14 @@ static void collapse_huge_page(struct mm > > goto out; > > } > > > > - __collapse_huge_page_swapin(mm, vma, address, pmd); > > + swap = get_mm_counter(mm, MM_SWAPENTS); > > + curr_allocstall = sum_vm_event(ALLOCSTALL); > > + /* > > + * When system under pressure, don't swapin readahead. > > + * So that avoid unnecessary resource consuming. > > + */ > > + if (allocstall == curr_allocstall && swap != 0) > > + __collapse_huge_page_swapin(mm, vma, address, pmd); > > > > anon_vma_lock_write(vma->anon_vma); > > > > I have mentioned that before already but this seems like a rather weak > heuristic. Don't we really rather teach __collapse_huge_page_swapin > (resp. do_swap_page) do to an optimistic GFP_NOWAIT allocations and > back off under the memory pressure? I gave it a try and it doesn't seem really bad. Untested and I might have missed something really obvious but what do you think about this approach rather than relying on ALLOCSTALL which is really weak heuristic: --- diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 87f09dc986ab..1a4d4c807d92 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2389,7 +2389,8 @@ static void __collapse_huge_page_swapin(struct mm_struct *mm, swapped_in++; ret = do_swap_page(mm, vma, _address, pte, pmd, FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT, - pteval); + pteval, + GFP_HIGHUSER_MOVABLE | ~__GFP_DIRECT_RECLAIM); if (ret & VM_FAULT_ERROR) { trace_mm_collapse_huge_page_swapin(mm, swapped_in, 0); return; diff --git a/mm/memory.c b/mm/memory.c index d79c6db41502..f897ec89bd79 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2490,7 +2490,7 @@ EXPORT_SYMBOL(unmap_mapping_range); */ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd, - unsigned int flags, pte_t orig_pte) + unsigned int flags, pte_t orig_pte, gfp_t gfp_mask) { spinlock_t *ptl; struct page *page, *swapcache; @@ -2519,8 +2519,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, delayacct_set_flag(DELAYACCT_PF_SWAPIN); page = lookup_swap_cache(entry); if (!page) { - page = swapin_readahead(entry, - GFP_HIGHUSER_MOVABLE, vma, address); + page = swapin_readahead(entry, gfp_mask, vma, address); if (!page) { /* * Back out if somebody else faulted in this pte @@ -2573,7 +2572,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, goto out_page; } - if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false)) { + if (mem_cgroup_try_charge(page, mm, gfp_mask, &memcg, false)) { ret = VM_FAULT_OOM; goto out_page; } @@ -3349,7 +3348,7 @@ static int handle_pte_fault(struct mm_struct *mm, flags, entry); } return do_swap_page(mm, vma, address, - pte, pmd, flags, entry); + pte, pmd, flags, entry, GFP_HIGHUSER_MOVABLE); } if (pte_protnone(entry)) -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree 2016-05-17 7:58 ` Michal Hocko @ 2016-05-17 9:02 ` Michal Hocko 2016-05-17 11:31 ` Kirill A. Shutemov 2016-05-19 5:00 ` Minchan Kim 0 siblings, 2 replies; 17+ messages in thread From: Michal Hocko @ 2016-05-17 9:02 UTC (permalink / raw) To: akpm Cc: ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm On Tue 17-05-16 09:58:15, Michal Hocko wrote: > On Thu 28-04-16 17:19:21, Michal Hocko wrote: > > On Wed 27-04-16 14:17:20, Andrew Morton wrote: > > [...] > > > @@ -2484,7 +2485,14 @@ static void collapse_huge_page(struct mm > > > goto out; > > > } > > > > > > - __collapse_huge_page_swapin(mm, vma, address, pmd); > > > + swap = get_mm_counter(mm, MM_SWAPENTS); > > > + curr_allocstall = sum_vm_event(ALLOCSTALL); > > > + /* > > > + * When system under pressure, don't swapin readahead. > > > + * So that avoid unnecessary resource consuming. > > > + */ > > > + if (allocstall == curr_allocstall && swap != 0) > > > + __collapse_huge_page_swapin(mm, vma, address, pmd); > > > > > > anon_vma_lock_write(vma->anon_vma); > > > > > > > I have mentioned that before already but this seems like a rather weak > > heuristic. Don't we really rather teach __collapse_huge_page_swapin > > (resp. do_swap_page) do to an optimistic GFP_NOWAIT allocations and > > back off under the memory pressure? > > I gave it a try and it doesn't seem really bad. Untested and I might > have missed something really obvious but what do you think about this > approach rather than relying on ALLOCSTALL which is really weak > heuristic: Ups forgot to add mm/internal.h to the git index --- diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 87f09dc986ab..1a4d4c807d92 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2389,7 +2389,8 @@ static void __collapse_huge_page_swapin(struct mm_struct *mm, swapped_in++; ret = do_swap_page(mm, vma, _address, pte, pmd, FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT, - pteval); + pteval, + GFP_HIGHUSER_MOVABLE | ~__GFP_DIRECT_RECLAIM); if (ret & VM_FAULT_ERROR) { trace_mm_collapse_huge_page_swapin(mm, swapped_in, 0); return; diff --git a/mm/internal.h b/mm/internal.h index b6ead95a0184..0b3cc643eced 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -37,7 +37,7 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd, - unsigned int flags, pte_t orig_pte); + unsigned int flags, pte_t orig_pte, gfp_t gfp_mask); void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma, unsigned long floor, unsigned long ceiling); diff --git a/mm/memory.c b/mm/memory.c index d79c6db41502..f897ec89bd79 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2490,7 +2490,7 @@ EXPORT_SYMBOL(unmap_mapping_range); */ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd, - unsigned int flags, pte_t orig_pte) + unsigned int flags, pte_t orig_pte, gfp_t gfp_mask) { spinlock_t *ptl; struct page *page, *swapcache; @@ -2519,8 +2519,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, delayacct_set_flag(DELAYACCT_PF_SWAPIN); page = lookup_swap_cache(entry); if (!page) { - page = swapin_readahead(entry, - GFP_HIGHUSER_MOVABLE, vma, address); + page = swapin_readahead(entry, gfp_mask, vma, address); if (!page) { /* * Back out if somebody else faulted in this pte @@ -2573,7 +2572,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, goto out_page; } - if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false)) { + if (mem_cgroup_try_charge(page, mm, gfp_mask, &memcg, false)) { ret = VM_FAULT_OOM; goto out_page; } @@ -3349,7 +3348,7 @@ static int handle_pte_fault(struct mm_struct *mm, flags, entry); } return do_swap_page(mm, vma, address, - pte, pmd, flags, entry); + pte, pmd, flags, entry, GFP_HIGHUSER_MOVABLE); } if (pte_protnone(entry)) -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree 2016-05-17 9:02 ` Michal Hocko @ 2016-05-17 11:31 ` Kirill A. Shutemov 2016-05-17 12:25 ` Michal Hocko 2016-05-19 5:00 ` Minchan Kim 1 sibling, 1 reply; 17+ messages in thread From: Kirill A. Shutemov @ 2016-05-17 11:31 UTC (permalink / raw) To: Michal Hocko Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm On Tue, May 17, 2016 at 11:02:54AM +0200, Michal Hocko wrote: > On Tue 17-05-16 09:58:15, Michal Hocko wrote: > > On Thu 28-04-16 17:19:21, Michal Hocko wrote: > > > On Wed 27-04-16 14:17:20, Andrew Morton wrote: > > > [...] > > > > @@ -2484,7 +2485,14 @@ static void collapse_huge_page(struct mm > > > > goto out; > > > > } > > > > > > > > - __collapse_huge_page_swapin(mm, vma, address, pmd); > > > > + swap = get_mm_counter(mm, MM_SWAPENTS); > > > > + curr_allocstall = sum_vm_event(ALLOCSTALL); > > > > + /* > > > > + * When system under pressure, don't swapin readahead. > > > > + * So that avoid unnecessary resource consuming. > > > > + */ > > > > + if (allocstall == curr_allocstall && swap != 0) > > > > + __collapse_huge_page_swapin(mm, vma, address, pmd); > > > > > > > > anon_vma_lock_write(vma->anon_vma); > > > > > > > > > > I have mentioned that before already but this seems like a rather weak > > > heuristic. Don't we really rather teach __collapse_huge_page_swapin > > > (resp. do_swap_page) do to an optimistic GFP_NOWAIT allocations and > > > back off under the memory pressure? > > > > I gave it a try and it doesn't seem really bad. Untested and I might > > have missed something really obvious but what do you think about this > > approach rather than relying on ALLOCSTALL which is really weak > > heuristic: > > Ups forgot to add mm/internal.h to the git index > --- > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 87f09dc986ab..1a4d4c807d92 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2389,7 +2389,8 @@ static void __collapse_huge_page_swapin(struct mm_struct *mm, > swapped_in++; > ret = do_swap_page(mm, vma, _address, pte, pmd, > FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT, > - pteval); > + pteval, > + GFP_HIGHUSER_MOVABLE | ~__GFP_DIRECT_RECLAIM); Why only direct recliam? I'm not sure if triggering kswapd is justified for swapin. Maybe ~__GFP_RECLAIM? That said, I like the approach. ALLOCSTALL approach has locking issue[1]. [1] http://lkml.kernel.org/r/20160505013245.GB10429@yexl-desktop -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree 2016-05-17 11:31 ` Kirill A. Shutemov @ 2016-05-17 12:25 ` Michal Hocko 0 siblings, 0 replies; 17+ messages in thread From: Michal Hocko @ 2016-05-17 12:25 UTC (permalink / raw) To: Kirill A. Shutemov Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm On Tue 17-05-16 14:31:14, Kirill A. Shutemov wrote: > On Tue, May 17, 2016 at 11:02:54AM +0200, Michal Hocko wrote: [...] > > --- > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index 87f09dc986ab..1a4d4c807d92 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -2389,7 +2389,8 @@ static void __collapse_huge_page_swapin(struct mm_struct *mm, > > swapped_in++; > > ret = do_swap_page(mm, vma, _address, pte, pmd, > > FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT, > > - pteval); > > + pteval, > > + GFP_HIGHUSER_MOVABLE | ~__GFP_DIRECT_RECLAIM); > > Why only direct recliam? I'm not sure if triggering kswapd is justified > for swapin. Maybe ~__GFP_RECLAIM? Dunno, skipping kswapd sounds like we might consume a lot of memory and the next request might hit the direct reclaim too early. But this is only a speculation. This has to be measured properly and evaluated against real usecases. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree 2016-05-17 9:02 ` Michal Hocko 2016-05-17 11:31 ` Kirill A. Shutemov @ 2016-05-19 5:00 ` Minchan Kim 2016-05-19 7:03 ` Michal Hocko 1 sibling, 1 reply; 17+ messages in thread From: Minchan Kim @ 2016-05-19 5:00 UTC (permalink / raw) To: Michal Hocko Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm On Tue, May 17, 2016 at 11:02:54AM +0200, Michal Hocko wrote: > On Tue 17-05-16 09:58:15, Michal Hocko wrote: > > On Thu 28-04-16 17:19:21, Michal Hocko wrote: > > > On Wed 27-04-16 14:17:20, Andrew Morton wrote: > > > [...] > > > > @@ -2484,7 +2485,14 @@ static void collapse_huge_page(struct mm > > > > goto out; > > > > } > > > > > > > > - __collapse_huge_page_swapin(mm, vma, address, pmd); > > > > + swap = get_mm_counter(mm, MM_SWAPENTS); > > > > + curr_allocstall = sum_vm_event(ALLOCSTALL); > > > > + /* > > > > + * When system under pressure, don't swapin readahead. > > > > + * So that avoid unnecessary resource consuming. > > > > + */ > > > > + if (allocstall == curr_allocstall && swap != 0) > > > > + __collapse_huge_page_swapin(mm, vma, address, pmd); > > > > > > > > anon_vma_lock_write(vma->anon_vma); > > > > > > > > > > I have mentioned that before already but this seems like a rather weak > > > heuristic. Don't we really rather teach __collapse_huge_page_swapin > > > (resp. do_swap_page) do to an optimistic GFP_NOWAIT allocations and > > > back off under the memory pressure? > > > > I gave it a try and it doesn't seem really bad. Untested and I might > > have missed something really obvious but what do you think about this > > approach rather than relying on ALLOCSTALL which is really weak > > heuristic: I like this approach rather than playing with allocstall diff of vmevent which can be disabled in some configuration and it's not a good indicator to represent current memory pressure situation. However, I agree with Rik's requirement which doesn't want to turn over page cache for collapsing THP page via swapin. So, your suggestion cannot prevent it because khugepaged can consume memory through this swapin operation continuously while kswapd is doing aging of LRU list in parallel. IOW, fluctuation between HIGH and LOW watermark. So, How about using waitqueue_active(&pgdat->kswapd_wait) to detect current memory pressure? So if kswapd is active, we could avoid swapin for THP collapsing. > > Ups forgot to add mm/internal.h to the git index > --- > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 87f09dc986ab..1a4d4c807d92 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2389,7 +2389,8 @@ static void __collapse_huge_page_swapin(struct mm_struct *mm, > swapped_in++; > ret = do_swap_page(mm, vma, _address, pte, pmd, > FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT, > - pteval); > + pteval, > + GFP_HIGHUSER_MOVABLE | ~__GFP_DIRECT_RECLAIM); > if (ret & VM_FAULT_ERROR) { > trace_mm_collapse_huge_page_swapin(mm, swapped_in, 0); > return; > diff --git a/mm/internal.h b/mm/internal.h > index b6ead95a0184..0b3cc643eced 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -37,7 +37,7 @@ > > extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, > unsigned long address, pte_t *page_table, pmd_t *pmd, > - unsigned int flags, pte_t orig_pte); > + unsigned int flags, pte_t orig_pte, gfp_t gfp_mask); > > void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma, > unsigned long floor, unsigned long ceiling); > diff --git a/mm/memory.c b/mm/memory.c > index d79c6db41502..f897ec89bd79 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2490,7 +2490,7 @@ EXPORT_SYMBOL(unmap_mapping_range); > */ > int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, > unsigned long address, pte_t *page_table, pmd_t *pmd, > - unsigned int flags, pte_t orig_pte) > + unsigned int flags, pte_t orig_pte, gfp_t gfp_mask) > { > spinlock_t *ptl; > struct page *page, *swapcache; > @@ -2519,8 +2519,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, > delayacct_set_flag(DELAYACCT_PF_SWAPIN); > page = lookup_swap_cache(entry); > if (!page) { > - page = swapin_readahead(entry, > - GFP_HIGHUSER_MOVABLE, vma, address); > + page = swapin_readahead(entry, gfp_mask, vma, address); > if (!page) { > /* > * Back out if somebody else faulted in this pte > @@ -2573,7 +2572,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, > goto out_page; > } > > - if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false)) { > + if (mem_cgroup_try_charge(page, mm, gfp_mask, &memcg, false)) { > ret = VM_FAULT_OOM; > goto out_page; > } > @@ -3349,7 +3348,7 @@ static int handle_pte_fault(struct mm_struct *mm, > flags, entry); > } > return do_swap_page(mm, vma, address, > - pte, pmd, flags, entry); > + pte, pmd, flags, entry, GFP_HIGHUSER_MOVABLE); > } > > if (pte_protnone(entry)) > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree 2016-05-19 5:00 ` Minchan Kim @ 2016-05-19 7:03 ` Michal Hocko 2016-05-19 7:27 ` Minchan Kim 0 siblings, 1 reply; 17+ messages in thread From: Michal Hocko @ 2016-05-19 7:03 UTC (permalink / raw) To: Minchan Kim Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm On Thu 19-05-16 14:00:38, Minchan Kim wrote: > On Tue, May 17, 2016 at 11:02:54AM +0200, Michal Hocko wrote: > > On Tue 17-05-16 09:58:15, Michal Hocko wrote: > > > On Thu 28-04-16 17:19:21, Michal Hocko wrote: > > > > On Wed 27-04-16 14:17:20, Andrew Morton wrote: > > > > [...] > > > > > @@ -2484,7 +2485,14 @@ static void collapse_huge_page(struct mm > > > > > goto out; > > > > > } > > > > > > > > > > - __collapse_huge_page_swapin(mm, vma, address, pmd); > > > > > + swap = get_mm_counter(mm, MM_SWAPENTS); > > > > > + curr_allocstall = sum_vm_event(ALLOCSTALL); > > > > > + /* > > > > > + * When system under pressure, don't swapin readahead. > > > > > + * So that avoid unnecessary resource consuming. > > > > > + */ > > > > > + if (allocstall == curr_allocstall && swap != 0) > > > > > + __collapse_huge_page_swapin(mm, vma, address, pmd); > > > > > > > > > > anon_vma_lock_write(vma->anon_vma); > > > > > > > > > > > > > I have mentioned that before already but this seems like a rather weak > > > > heuristic. Don't we really rather teach __collapse_huge_page_swapin > > > > (resp. do_swap_page) do to an optimistic GFP_NOWAIT allocations and > > > > back off under the memory pressure? > > > > > > I gave it a try and it doesn't seem really bad. Untested and I might > > > have missed something really obvious but what do you think about this > > > approach rather than relying on ALLOCSTALL which is really weak > > > heuristic: > > I like this approach rather than playing with allocstall diff of vmevent > which can be disabled in some configuration and it's not a good indicator > to represent current memory pressure situation. Not only that it won't work for e.g. memcg configurations because we would end up reclaiming that memcg as the gfp mask tells us to do so and ALLOCSTALL would be quite about that. > However, I agree with Rik's requirement which doesn't want to turn over > page cache for collapsing THP page via swapin. So, your suggestion cannot > prevent it because khugepaged can consume memory through this swapin > operation continuously while kswapd is doing aging of LRU list in parallel. > IOW, fluctuation between HIGH and LOW watermark. I am not sure this is actually a problem. We have other sources of opportunistic allocations with some fallback and those wake up kswapd (they only clear __GFP_DIRECT_RECLAIM). Also this swapin should happen only when a certain portion of the huge page is already populated so it won't happen all the time and sounds like we would benefit from the reclaimed page cache in favor of the THP. > So, How about using waitqueue_active(&pgdat->kswapd_wait) to detect > current memory pressure? So if kswapd is active, we could avoid swapin > for THP collapsing. Dunno, this sounds quite arbitrary. And I am even not sure this all optimistic swap in is a huge win to be honest. I just really hate the ALLOCSTALL heuristic because it simply doesn't work. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree 2016-05-19 7:03 ` Michal Hocko @ 2016-05-19 7:27 ` Minchan Kim 2016-05-19 7:39 ` Michal Hocko 0 siblings, 1 reply; 17+ messages in thread From: Minchan Kim @ 2016-05-19 7:27 UTC (permalink / raw) To: Michal Hocko Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm On Thu, May 19, 2016 at 09:03:57AM +0200, Michal Hocko wrote: > On Thu 19-05-16 14:00:38, Minchan Kim wrote: > > On Tue, May 17, 2016 at 11:02:54AM +0200, Michal Hocko wrote: > > > On Tue 17-05-16 09:58:15, Michal Hocko wrote: > > > > On Thu 28-04-16 17:19:21, Michal Hocko wrote: > > > > > On Wed 27-04-16 14:17:20, Andrew Morton wrote: > > > > > [...] > > > > > > @@ -2484,7 +2485,14 @@ static void collapse_huge_page(struct mm > > > > > > goto out; > > > > > > } > > > > > > > > > > > > - __collapse_huge_page_swapin(mm, vma, address, pmd); > > > > > > + swap = get_mm_counter(mm, MM_SWAPENTS); > > > > > > + curr_allocstall = sum_vm_event(ALLOCSTALL); > > > > > > + /* > > > > > > + * When system under pressure, don't swapin readahead. > > > > > > + * So that avoid unnecessary resource consuming. > > > > > > + */ > > > > > > + if (allocstall == curr_allocstall && swap != 0) > > > > > > + __collapse_huge_page_swapin(mm, vma, address, pmd); > > > > > > > > > > > > anon_vma_lock_write(vma->anon_vma); > > > > > > > > > > > > > > > > I have mentioned that before already but this seems like a rather weak > > > > > heuristic. Don't we really rather teach __collapse_huge_page_swapin > > > > > (resp. do_swap_page) do to an optimistic GFP_NOWAIT allocations and > > > > > back off under the memory pressure? > > > > > > > > I gave it a try and it doesn't seem really bad. Untested and I might > > > > have missed something really obvious but what do you think about this > > > > approach rather than relying on ALLOCSTALL which is really weak > > > > heuristic: > > > > I like this approach rather than playing with allocstall diff of vmevent > > which can be disabled in some configuration and it's not a good indicator > > to represent current memory pressure situation. > > Not only that it won't work for e.g. memcg configurations because we > would end up reclaiming that memcg as the gfp mask tells us to do so and > ALLOCSTALL would be quite about that. Right you are. I didn't consider memcg. Thanks for pointing out. > > > However, I agree with Rik's requirement which doesn't want to turn over > > page cache for collapsing THP page via swapin. So, your suggestion cannot > > prevent it because khugepaged can consume memory through this swapin > > operation continuously while kswapd is doing aging of LRU list in parallel. > > IOW, fluctuation between HIGH and LOW watermark. > > I am not sure this is actually a problem. We have other sources of > opportunistic allocations with some fallback and those wake up kswapd > (they only clear __GFP_DIRECT_RECLAIM). Also this swapin should happen > only when a certain portion of the huge page is already populated so I can't find any logic you mentioned "a certain portion of the huge page is already populated" in next-20160517. What am I missing now? > it won't happen all the time and sounds like we would benefit from the > reclaimed page cache in favor of the THP. It depends on storage speed. If a page is swapped out, it means it's not a workingset so we might read cold page at the cost of evciting warm page. Additionally, if the huge page was swapped out, it is likely to swap out again because it's not a hot * 512 page. For those pages, shouldn't we evict page cache? I think it's not a good tradeoff. > > > So, How about using waitqueue_active(&pgdat->kswapd_wait) to detect > > current memory pressure? So if kswapd is active, we could avoid swapin > > for THP collapsing. > > Dunno, this sounds quite arbitrary. And I am even not sure this all > optimistic swap in is a huge win to be honest. I just really hate the > ALLOCSTALL heuristic because it simply doesn't work. Agree. > -- > Michal Hocko > SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree 2016-05-19 7:27 ` Minchan Kim @ 2016-05-19 7:39 ` Michal Hocko 2016-05-20 0:21 ` Minchan Kim 0 siblings, 1 reply; 17+ messages in thread From: Michal Hocko @ 2016-05-19 7:39 UTC (permalink / raw) To: Minchan Kim Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm On Thu 19-05-16 16:27:51, Minchan Kim wrote: > On Thu, May 19, 2016 at 09:03:57AM +0200, Michal Hocko wrote: > > On Thu 19-05-16 14:00:38, Minchan Kim wrote: > > > On Tue, May 17, 2016 at 11:02:54AM +0200, Michal Hocko wrote: > > > > On Tue 17-05-16 09:58:15, Michal Hocko wrote: > > > > > On Thu 28-04-16 17:19:21, Michal Hocko wrote: > > > > > > On Wed 27-04-16 14:17:20, Andrew Morton wrote: > > > > > > [...] > > > > > > > @@ -2484,7 +2485,14 @@ static void collapse_huge_page(struct mm > > > > > > > goto out; > > > > > > > } > > > > > > > > > > > > > > - __collapse_huge_page_swapin(mm, vma, address, pmd); > > > > > > > + swap = get_mm_counter(mm, MM_SWAPENTS); > > > > > > > + curr_allocstall = sum_vm_event(ALLOCSTALL); > > > > > > > + /* > > > > > > > + * When system under pressure, don't swapin readahead. > > > > > > > + * So that avoid unnecessary resource consuming. > > > > > > > + */ > > > > > > > + if (allocstall == curr_allocstall && swap != 0) > > > > > > > + __collapse_huge_page_swapin(mm, vma, address, pmd); > > > > > > > > > > > > > > anon_vma_lock_write(vma->anon_vma); > > > > > > > > > > > > > > > > > > > I have mentioned that before already but this seems like a rather weak > > > > > > heuristic. Don't we really rather teach __collapse_huge_page_swapin > > > > > > (resp. do_swap_page) do to an optimistic GFP_NOWAIT allocations and > > > > > > back off under the memory pressure? > > > > > > > > > > I gave it a try and it doesn't seem really bad. Untested and I might > > > > > have missed something really obvious but what do you think about this > > > > > approach rather than relying on ALLOCSTALL which is really weak > > > > > heuristic: > > > > > > I like this approach rather than playing with allocstall diff of vmevent > > > which can be disabled in some configuration and it's not a good indicator > > > to represent current memory pressure situation. > > > > Not only that it won't work for e.g. memcg configurations because we > > would end up reclaiming that memcg as the gfp mask tells us to do so and > > ALLOCSTALL would be quite about that. > > Right you are. I didn't consider memcg. Thanks for pointing out. > > > > > > However, I agree with Rik's requirement which doesn't want to turn over > > > page cache for collapsing THP page via swapin. So, your suggestion cannot > > > prevent it because khugepaged can consume memory through this swapin > > > operation continuously while kswapd is doing aging of LRU list in parallel. > > > IOW, fluctuation between HIGH and LOW watermark. > > > > I am not sure this is actually a problem. We have other sources of > > opportunistic allocations with some fallback and those wake up kswapd > > (they only clear __GFP_DIRECT_RECLAIM). Also this swapin should happen > > only when a certain portion of the huge page is already populated so > > I can't find any logic you mentioned "a certain portion of the huge page > is already populated" in next-20160517. What am I missing now? khugepaged_max_ptes_swap. I didn't look closer but from a quick glance this is the threshold for the optimistic swapin. > > it won't happen all the time and sounds like we would benefit from the > > reclaimed page cache in favor of the THP. > > It depends on storage speed. If a page is swapped out, it means it's not a > workingset so we might read cold page at the cost of evciting warm page. > Additionally, if the huge page was swapped out, it is likely to swap out > again because it's not a hot * 512 page. For those pages, shouldn't we > evict page cache? I think it's not a good tradeoff. This is exactly the problem of the optimistic THP swap in. We just do not know whether it is worth it. But I guess that a reasonable threshold would solve this. It is really ineffective to keep small pages when only few holes are swapped out (for what ever reasons). HPAGE_PMD_NR/8 which we use right now is not documented but I guess 64 pages sounds like a reasonable value which shouldn't cause way too much of reclaim. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree 2016-05-19 7:39 ` Michal Hocko @ 2016-05-20 0:21 ` Minchan Kim 2016-05-20 6:39 ` Michal Hocko 0 siblings, 1 reply; 17+ messages in thread From: Minchan Kim @ 2016-05-20 0:21 UTC (permalink / raw) To: Michal Hocko Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm On Thu, May 19, 2016 at 09:39:57AM +0200, Michal Hocko wrote: > On Thu 19-05-16 16:27:51, Minchan Kim wrote: > > On Thu, May 19, 2016 at 09:03:57AM +0200, Michal Hocko wrote: > > > On Thu 19-05-16 14:00:38, Minchan Kim wrote: > > > > On Tue, May 17, 2016 at 11:02:54AM +0200, Michal Hocko wrote: > > > > > On Tue 17-05-16 09:58:15, Michal Hocko wrote: > > > > > > On Thu 28-04-16 17:19:21, Michal Hocko wrote: > > > > > > > On Wed 27-04-16 14:17:20, Andrew Morton wrote: > > > > > > > [...] > > > > > > > > @@ -2484,7 +2485,14 @@ static void collapse_huge_page(struct mm > > > > > > > > goto out; > > > > > > > > } > > > > > > > > > > > > > > > > - __collapse_huge_page_swapin(mm, vma, address, pmd); > > > > > > > > + swap = get_mm_counter(mm, MM_SWAPENTS); > > > > > > > > + curr_allocstall = sum_vm_event(ALLOCSTALL); > > > > > > > > + /* > > > > > > > > + * When system under pressure, don't swapin readahead. > > > > > > > > + * So that avoid unnecessary resource consuming. > > > > > > > > + */ > > > > > > > > + if (allocstall == curr_allocstall && swap != 0) > > > > > > > > + __collapse_huge_page_swapin(mm, vma, address, pmd); > > > > > > > > > > > > > > > > anon_vma_lock_write(vma->anon_vma); > > > > > > > > > > > > > > > > > > > > > > I have mentioned that before already but this seems like a rather weak > > > > > > > heuristic. Don't we really rather teach __collapse_huge_page_swapin > > > > > > > (resp. do_swap_page) do to an optimistic GFP_NOWAIT allocations and > > > > > > > back off under the memory pressure? > > > > > > > > > > > > I gave it a try and it doesn't seem really bad. Untested and I might > > > > > > have missed something really obvious but what do you think about this > > > > > > approach rather than relying on ALLOCSTALL which is really weak > > > > > > heuristic: > > > > > > > > I like this approach rather than playing with allocstall diff of vmevent > > > > which can be disabled in some configuration and it's not a good indicator > > > > to represent current memory pressure situation. > > > > > > Not only that it won't work for e.g. memcg configurations because we > > > would end up reclaiming that memcg as the gfp mask tells us to do so and > > > ALLOCSTALL would be quite about that. > > > > Right you are. I didn't consider memcg. Thanks for pointing out. > > > > > > > > > However, I agree with Rik's requirement which doesn't want to turn over > > > > page cache for collapsing THP page via swapin. So, your suggestion cannot > > > > prevent it because khugepaged can consume memory through this swapin > > > > operation continuously while kswapd is doing aging of LRU list in parallel. > > > > IOW, fluctuation between HIGH and LOW watermark. > > > > > > I am not sure this is actually a problem. We have other sources of > > > opportunistic allocations with some fallback and those wake up kswapd > > > (they only clear __GFP_DIRECT_RECLAIM). Also this swapin should happen > > > only when a certain portion of the huge page is already populated so > > > > I can't find any logic you mentioned "a certain portion of the huge page > > is already populated" in next-20160517. What am I missing now? > > khugepaged_max_ptes_swap. I didn't look closer but from a quick glance > this is the threshold for the optimistic swapin. Thanks. I see it now. > > > > it won't happen all the time and sounds like we would benefit from the > > > reclaimed page cache in favor of the THP. > > > > It depends on storage speed. If a page is swapped out, it means it's not a > > workingset so we might read cold page at the cost of evciting warm page. > > Additionally, if the huge page was swapped out, it is likely to swap out > > again because it's not a hot * 512 page. For those pages, shouldn't we > > evict page cache? I think it's not a good tradeoff. > > This is exactly the problem of the optimistic THP swap in. We just do > not know whether it is worth it. But I guess that a reasonable threshold > would solve this. It is really ineffective to keep small pages when only > few holes are swapped out (for what ever reasons). HPAGE_PMD_NR/8 which > we use right now is not documented but I guess 64 pages sounds like a > reasonable value which shouldn't cause way too much of reclaim. I don't know what's the best defaut vaule. Anyway I agree we should introduce the threshold to collapse THP through swapin IO operation. I think other important thing we should consider is how the THP page is likely to be hot without split in a short time like KSM is doing double checking to merge stable page. Of course, it wouldn't be easyI to implement but I think algorithm is based on such *hotness* basically and then consider the number of max_swap_ptes. IOW, I think we should approach more conservative rather than optimistic because a page I/O overhead by wrong choice could be bigger than benefit of a few TLB hit. If we approach that way, maybe we don't need to detect memory pressure. For that way, how about raising bar for swapin allowance? I mean now we allows swapin from 64 pages below swap ptes and 1 page young in 512 ptes to 64 pages below swap ptes and 256 page young in 512 ptes > > -- > Michal Hocko > SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree 2016-05-20 0:21 ` Minchan Kim @ 2016-05-20 6:39 ` Michal Hocko 2016-05-20 7:26 ` Minchan Kim 0 siblings, 1 reply; 17+ messages in thread From: Michal Hocko @ 2016-05-20 6:39 UTC (permalink / raw) To: Minchan Kim Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm On Fri 20-05-16 09:21:55, Minchan Kim wrote: [...] > I think other important thing we should consider is how the THP page is likely > to be hot without split in a short time like KSM is doing double checking to > merge stable page. Of course, it wouldn't be easyI to implement but I think > algorithm is based on such *hotness* basically and then consider the number > of max_swap_ptes. IOW, I think we should approach more conservative rather > than optimistic because a page I/O overhead by wrong choice could be bigger > than benefit of a few TLB hit. > If we approach that way, maybe we don't need to detect memory pressure. > > For that way, how about raising bar for swapin allowance? > I mean now we allows swapin > > from > 64 pages below swap ptes and 1 page young in 512 ptes > to > 64 pages below swap ptes and 256 page young in 512 ptes I agree that the current 1 page threshold for collapsing is way too optimistic. Same as the defaults we had for the page fault THP faulting which has caused many issues. So I would be all for changing it. I do not have good benchmarks to back any "good" number unfortunately. So such a change would be quite arbitrary based on feeling... If you have some workload where collapsing THP pages causes some real issues that would be great justification though. That being said khugepaged_max_ptes_none = HPAGE_PMD_NR/2 sounds like a good start to me. Whether all of the present pages have to be young I am not so sure. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree 2016-05-20 6:39 ` Michal Hocko @ 2016-05-20 7:26 ` Minchan Kim 2016-05-20 7:34 ` Michal Hocko 0 siblings, 1 reply; 17+ messages in thread From: Minchan Kim @ 2016-05-20 7:26 UTC (permalink / raw) To: Michal Hocko Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm On Fri, May 20, 2016 at 08:39:17AM +0200, Michal Hocko wrote: > On Fri 20-05-16 09:21:55, Minchan Kim wrote: > [...] > > I think other important thing we should consider is how the THP page is likely > > to be hot without split in a short time like KSM is doing double checking to > > merge stable page. Of course, it wouldn't be easyI to implement but I think > > algorithm is based on such *hotness* basically and then consider the number > > of max_swap_ptes. IOW, I think we should approach more conservative rather > > than optimistic because a page I/O overhead by wrong choice could be bigger > > than benefit of a few TLB hit. > > If we approach that way, maybe we don't need to detect memory pressure. > > > > For that way, how about raising bar for swapin allowance? > > I mean now we allows swapin > > > > from > > 64 pages below swap ptes and 1 page young in 512 ptes > > to > > 64 pages below swap ptes and 256 page young in 512 ptes > > I agree that the current 1 page threshold for collapsing is way too > optimistic. Same as the defaults we had for the page fault THP faulting > which has caused many issues. So I would be all for changing it. I do I don't know we should change all but if we change it for THP faulting, I believe THP swapin should be more conservative value rather than THP faulting because cost of THP swapin collapsing would be heavier. > not have good benchmarks to back any "good" number unfortunately. So > such a change would be quite arbitrary based on feeling... If you have > some workload where collapsing THP pages causes some real issues that > would be great justification though. Hope to have but our products never have turned on THP. :( I just wanted to say current problem and suggestion so a THP guy can have an interest on that. If it's not worth to do, simple igonore. > > That being said khugepaged_max_ptes_none = HPAGE_PMD_NR/2 sounds like a max_ptes_none? > good start to me. Whether all of the present pages have to be young I am > not so sure. > > -- > Michal Hocko > SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree 2016-05-20 7:26 ` Minchan Kim @ 2016-05-20 7:34 ` Michal Hocko 2016-05-20 7:44 ` Minchan Kim 0 siblings, 1 reply; 17+ messages in thread From: Michal Hocko @ 2016-05-20 7:34 UTC (permalink / raw) To: Minchan Kim Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm On Fri 20-05-16 16:26:24, Minchan Kim wrote: > On Fri, May 20, 2016 at 08:39:17AM +0200, Michal Hocko wrote: > > On Fri 20-05-16 09:21:55, Minchan Kim wrote: > > [...] > > > I think other important thing we should consider is how the THP page is likely > > > to be hot without split in a short time like KSM is doing double checking to > > > merge stable page. Of course, it wouldn't be easyI to implement but I think > > > algorithm is based on such *hotness* basically and then consider the number > > > of max_swap_ptes. IOW, I think we should approach more conservative rather > > > than optimistic because a page I/O overhead by wrong choice could be bigger > > > than benefit of a few TLB hit. > > > If we approach that way, maybe we don't need to detect memory pressure. > > > > > > For that way, how about raising bar for swapin allowance? > > > I mean now we allows swapin > > > > > > from > > > 64 pages below swap ptes and 1 page young in 512 ptes > > > to > > > 64 pages below swap ptes and 256 page young in 512 ptes > > > > I agree that the current 1 page threshold for collapsing is way too > > optimistic. Same as the defaults we had for the page fault THP faulting > > which has caused many issues. So I would be all for changing it. I do > > I don't know we should change all but if we change it for THP faulting, > I believe THP swapin should be more conservative value rather than THP > faulting because cost of THP swapin collapsing would be heavier. > > > not have good benchmarks to back any "good" number unfortunately. So > > such a change would be quite arbitrary based on feeling... If you have > > some workload where collapsing THP pages causes some real issues that > > would be great justification though. > > Hope to have but our products never have turned on THP. :( > I just wanted to say current problem and suggestion so a THP guy > can have an interest on that. > If it's not worth to do, simple igonore. I guess this is worth changing I am just not sure about the justification > > That being said khugepaged_max_ptes_none = HPAGE_PMD_NR/2 sounds like a > > max_ptes_none? Not sure I understand what you mean here. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree 2016-05-20 7:34 ` Michal Hocko @ 2016-05-20 7:44 ` Minchan Kim 2016-05-20 8:02 ` Michal Hocko 0 siblings, 1 reply; 17+ messages in thread From: Minchan Kim @ 2016-05-20 7:44 UTC (permalink / raw) To: Michal Hocko Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm On Fri, May 20, 2016 at 09:34:32AM +0200, Michal Hocko wrote: > On Fri 20-05-16 16:26:24, Minchan Kim wrote: > > On Fri, May 20, 2016 at 08:39:17AM +0200, Michal Hocko wrote: > > > On Fri 20-05-16 09:21:55, Minchan Kim wrote: > > > [...] > > > > I think other important thing we should consider is how the THP page is likely > > > > to be hot without split in a short time like KSM is doing double checking to > > > > merge stable page. Of course, it wouldn't be easyI to implement but I think > > > > algorithm is based on such *hotness* basically and then consider the number > > > > of max_swap_ptes. IOW, I think we should approach more conservative rather > > > > than optimistic because a page I/O overhead by wrong choice could be bigger > > > > than benefit of a few TLB hit. > > > > If we approach that way, maybe we don't need to detect memory pressure. > > > > > > > > For that way, how about raising bar for swapin allowance? > > > > I mean now we allows swapin > > > > > > > > from > > > > 64 pages below swap ptes and 1 page young in 512 ptes > > > > to > > > > 64 pages below swap ptes and 256 page young in 512 ptes > > > > > > I agree that the current 1 page threshold for collapsing is way too > > > optimistic. Same as the defaults we had for the page fault THP faulting > > > which has caused many issues. So I would be all for changing it. I do > > > > I don't know we should change all but if we change it for THP faulting, > > I believe THP swapin should be more conservative value rather than THP > > faulting because cost of THP swapin collapsing would be heavier. > > > > > not have good benchmarks to back any "good" number unfortunately. So > > > such a change would be quite arbitrary based on feeling... If you have > > > some workload where collapsing THP pages causes some real issues that > > > would be great justification though. > > > > Hope to have but our products never have turned on THP. :( > > I just wanted to say current problem and suggestion so a THP guy > > can have an interest on that. > > If it's not worth to do, simple igonore. > > I guess this is worth changing I am just not sure about the > justification I agree with you about THP faulting which have been there for several years so changing without justification, it has risk. Howerver, swapin collapsing is new feature so it might be worth for the feature. > > > > That being said khugepaged_max_ptes_none = HPAGE_PMD_NR/2 sounds like a > > > > max_ptes_none? > > Not sure I understand what you mean here. We are talking about max_ptes_swap and max_active_pages(i.e., pte_young) but suddenly you are saying max_ptes_none so I was curious it was just typo. > -- > Michal Hocko > SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree 2016-05-20 7:44 ` Minchan Kim @ 2016-05-20 8:02 ` Michal Hocko 2016-05-20 8:26 ` Minchan Kim 0 siblings, 1 reply; 17+ messages in thread From: Michal Hocko @ 2016-05-20 8:02 UTC (permalink / raw) To: Minchan Kim Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm On Fri 20-05-16 16:44:50, Minchan Kim wrote: > > > > That being said khugepaged_max_ptes_none = HPAGE_PMD_NR/2 sounds like a > > > > > > max_ptes_none? > > > > Not sure I understand what you mean here. > > We are talking about max_ptes_swap and max_active_pages(i.e., pte_young) > but suddenly you are saying max_ptes_none so I was curious it was just > typo. Because the default for pte_none resp. zero pages collapsing into THP is khugepaged_max_ptes_none and the current default means that a single present page is sufficient. That is way too optimistic. So I consider this to be a good start. I am not so sure about minimum young pages because that would probably require yet another tunable and we have more than enough of them. Anyway I guess we are getting off-topic here... -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree 2016-05-20 8:02 ` Michal Hocko @ 2016-05-20 8:26 ` Minchan Kim 0 siblings, 0 replies; 17+ messages in thread From: Minchan Kim @ 2016-05-20 8:26 UTC (permalink / raw) To: Michal Hocko Cc: akpm, ebru.akagunduz, aarcange, aneesh.kumar, boaz, gorcunov, hannes, hughd, iamjoonsoo.kim, kirill.shutemov, mgorman, n-horiguchi, riel, rientjes, vbabka, mm-commits, linux-mm On Fri, May 20, 2016 at 10:02:18AM +0200, Michal Hocko wrote: > On Fri 20-05-16 16:44:50, Minchan Kim wrote: > > > > > That being said khugepaged_max_ptes_none = HPAGE_PMD_NR/2 sounds like a > > > > > > > > max_ptes_none? > > > > > > Not sure I understand what you mean here. > > > > We are talking about max_ptes_swap and max_active_pages(i.e., pte_young) > > but suddenly you are saying max_ptes_none so I was curious it was just > > typo. > > Because the default for pte_none resp. zero pages collapsing into THP is > khugepaged_max_ptes_none and the current default means that a single > present page is sufficient. That is way too optimistic. So I consider > this to be a good start. I am not so sure about minimum young pages > because that would probably require yet another tunable and we have more > than enough of them. Anyway I guess we are getting off-topic here... Optimistic swapin collapsing 1. it could be too optimisitic to lose the gain due to evicting workingset 2. let's detect memory pressure 3. current allocstall magic is not a good idea. 4. let's change the design from optimistic to conservative 5. how we can be conservative 6. two things - detect hot pages and threshold of swap pte 7. threhsold of swap pte is already done so remained thing is detect hot page 8. how to detect hot page - let's use young bit 9. Now, we are conservatie so we will swap in when it's worth 10. let's remove allocstall magic I think it's not off-topic. Anyway, it's just my thought and don't have any real workload and objection. Feel free to ignore. > -- > Michal Hocko > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2016-05-20 8:26 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-04-27 21:17 + mm-thp-avoid-unnecessary-swapin-in-khugepaged.patch added to -mm tree akpm 2016-04-28 15:19 ` Michal Hocko 2016-05-17 7:58 ` Michal Hocko 2016-05-17 9:02 ` Michal Hocko 2016-05-17 11:31 ` Kirill A. Shutemov 2016-05-17 12:25 ` Michal Hocko 2016-05-19 5:00 ` Minchan Kim 2016-05-19 7:03 ` Michal Hocko 2016-05-19 7:27 ` Minchan Kim 2016-05-19 7:39 ` Michal Hocko 2016-05-20 0:21 ` Minchan Kim 2016-05-20 6:39 ` Michal Hocko 2016-05-20 7:26 ` Minchan Kim 2016-05-20 7:34 ` Michal Hocko 2016-05-20 7:44 ` Minchan Kim 2016-05-20 8:02 ` Michal Hocko 2016-05-20 8:26 ` Minchan Kim
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.