All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] mm: incorporate read-only pages into transparent huge pages
@ 2015-01-24 15:38 ` Ebru Akagunduz
  0 siblings, 0 replies; 8+ messages in thread
From: Ebru Akagunduz @ 2015-01-24 15:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, kirill, mhocko, mgorman, rientjes, sasha.levin, hughd,
	hannes, vbabka, linux-kernel, riel, aarcange, Ebru Akagunduz

This patch aims to improve THP collapse rates, by allowing
THP collapse in the presence of read-only ptes, like those
left in place by do_swap_page after a read fault.

Currently THP can collapse 4kB pages into a THP when
there are up to khugepaged_max_ptes_none pte_none ptes
in a 2MB range. This patch applies the same limit for
read-only ptes.

The patch was tested with a test program that allocates
800MB of memory, writes to it, and then sleeps. I force
the system to swap out all but 190MB of the program by
touching other memory. Afterwards, the test program does
a mix of reads and writes to its memory, and the memory
gets swapped back in.

Without the patch, only the memory that did not get
swapped out remained in THPs, which corresponds to 24% of
the memory of the program. The percentage did not increase
over time.

With this patch, after 5 minutes of waiting khugepaged had
collapsed 48% of the program's memory back into THPs.

Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
Changes in v2:
 - Remove extra code indent
 - Add fast path optimistic check to
   __collapse_huge_page_isolate()
 - Add comment line for check condition of page_count()
 - Move check condition of page_count() below to trylock_page()

I've written down test results:
With the patch:
After swapped out:
cat /proc/pid/smaps:
Anonymous:	42804 kB
AnonHugePages:	38912 kB
Swap:		757200 kB
Fraction:	90,90

cat /proc/meminfo:
AnonPages:	1843956 kB
AnonHugePages:	1712128 kB
Fraction:	92,85

After swapped in:
In a few seconds:
cat /proc/pid/smaps:
Anonymous:	800004 kB
AnonHugePages:	104448 kB
Swap:		0 kB
Fraction:	13,05

cat /proc/meminfo:
AnonPages:	2605728 kB
AnonHugePages:	1777664 kB
Fraction:	68,22

In 5 minutes:
cat /proc/pid/smaps
Anonymous:	800004 kB
AnonHugePages:	389120 kB
Swap:		0 kB
Fraction:	48,63

cat /proc/meminfo:
AnonPages:	2607824 kB
AnonHugePages:	2041856 kB
Fraction:	78,29

Without the patch:
After swapped out:
cat /proc/pid/smaps:
Anonymous:      190660 kB
AnonHugePages:  190464 kB
Swap:           609344 kB
Fraction:       99,89

cat /proc/meminfo:
AnonPages:      1740456 kB
AnonHugePages:  1667072 kB
Fraction:       95,78

After swapped in:
cat /proc/pid/smaps:
Anonymous:      800004 kB
AnonHugePages:  190464 kB
Swap:           0 kB
Fraction:       23,80

cat /proc/meminfo:
AnonPages:      2350032 kB
AnonHugePages:  1667072 kB
Fraction:       70,93

I waited 10 minutes the fractions
did not change without the patch.

 mm/huge_memory.c | 48 +++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 39 insertions(+), 9 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 817a875..5e3e9b9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2148,7 +2148,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 {
 	struct page *page;
 	pte_t *_pte;
-	int referenced = 0, none = 0;
+	int referenced = 0, none = 0, ro = 0;
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
 	     _pte++, address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
@@ -2158,7 +2158,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			else
 				goto out;
 		}
-		if (!pte_present(pteval) || !pte_write(pteval))
+		if (!pte_present(pteval))
 			goto out;
 		page = vm_normal_page(vma, address, pteval);
 		if (unlikely(!page))
@@ -2168,9 +2168,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		VM_BUG_ON_PAGE(!PageAnon(page), page);
 		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 
-		/* cannot use mapcount: can't collapse if there's a gup pin */
-		if (page_count(page) != 1)
-			goto out;
 		/*
 		 * We can do it before isolate_lru_page because the
 		 * page can't be freed from under us. NOTE: PG_lock
@@ -2179,6 +2176,31 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		 */
 		if (!trylock_page(page))
 			goto out;
+
+		/*
+		 * cannot use mapcount: can't collapse if there's a gup pin.
+		 * The page must only be referenced by the scanned process
+		 * and page swap cache.
+		 */
+		if (page_count(page) != 1 + !!PageSwapCache(page)) {
+			unlock_page(page);
+			goto out;
+		}
+		if (!pte_write(pteval)) {
+			if (++ro > khugepaged_max_ptes_none) {
+				unlock_page(page);
+				goto out;
+			}
+			if (PageSwapCache(page) && !reuse_swap_page(page)) {
+				unlock_page(page);
+				goto out;
+			}
+			/*
+			 * Page is not in the swap cache, and page count is
+			 * one (see above). It can be collapsed into a THP.
+			 */
+		}
+
 		/*
 		 * Isolate the page to avoid collapsing an hugepage
 		 * currently in use by the VM.
@@ -2550,7 +2572,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
-	int ret = 0, referenced = 0, none = 0;
+	int ret = 0, referenced = 0, none = 0, ro = 0;
 	struct page *page;
 	unsigned long _address;
 	spinlock_t *ptl;
@@ -2573,8 +2595,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 			else
 				goto out_unmap;
 		}
-		if (!pte_present(pteval) || !pte_write(pteval))
+		if (!pte_present(pteval))
 			goto out_unmap;
+		if (!pte_write(pteval)) {
+			if (++ro > khugepaged_max_ptes_none)
+				goto out_unmap;
+		}
 		page = vm_normal_page(vma, _address, pteval);
 		if (unlikely(!page))
 			goto out_unmap;
@@ -2591,8 +2617,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		VM_BUG_ON_PAGE(PageCompound(page), page);
 		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
 			goto out_unmap;
-		/* cannot use mapcount: can't collapse if there's a gup pin */
-		if (page_count(page) != 1)
+		/*
+		 * cannot use mapcount: can't collapse if there's a gup pin.
+		 * The page must only be referenced by the scanned process
+		 * and page swap cache.
+		 */
+		if (page_count(page) != 1 + !!PageSwapCache(page))
 			goto out_unmap;
 		if (pte_young(pteval) || PageReferenced(page) ||
 		    mmu_notifier_test_young(vma->vm_mm, address))
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2] mm: incorporate read-only pages into transparent huge pages
@ 2015-01-24 15:38 ` Ebru Akagunduz
  0 siblings, 0 replies; 8+ messages in thread
From: Ebru Akagunduz @ 2015-01-24 15:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, kirill, mhocko, mgorman, rientjes, sasha.levin, hughd,
	hannes, vbabka, linux-kernel, riel, aarcange, Ebru Akagunduz

This patch aims to improve THP collapse rates, by allowing
THP collapse in the presence of read-only ptes, like those
left in place by do_swap_page after a read fault.

Currently THP can collapse 4kB pages into a THP when
there are up to khugepaged_max_ptes_none pte_none ptes
in a 2MB range. This patch applies the same limit for
read-only ptes.

The patch was tested with a test program that allocates
800MB of memory, writes to it, and then sleeps. I force
the system to swap out all but 190MB of the program by
touching other memory. Afterwards, the test program does
a mix of reads and writes to its memory, and the memory
gets swapped back in.

Without the patch, only the memory that did not get
swapped out remained in THPs, which corresponds to 24% of
the memory of the program. The percentage did not increase
over time.

With this patch, after 5 minutes of waiting khugepaged had
collapsed 48% of the program's memory back into THPs.

Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
Changes in v2:
 - Remove extra code indent
 - Add fast path optimistic check to
   __collapse_huge_page_isolate()
 - Add comment line for check condition of page_count()
 - Move check condition of page_count() below to trylock_page()

I've written down test results:
With the patch:
After swapped out:
cat /proc/pid/smaps:
Anonymous:	42804 kB
AnonHugePages:	38912 kB
Swap:		757200 kB
Fraction:	90,90

cat /proc/meminfo:
AnonPages:	1843956 kB
AnonHugePages:	1712128 kB
Fraction:	92,85

After swapped in:
In a few seconds:
cat /proc/pid/smaps:
Anonymous:	800004 kB
AnonHugePages:	104448 kB
Swap:		0 kB
Fraction:	13,05

cat /proc/meminfo:
AnonPages:	2605728 kB
AnonHugePages:	1777664 kB
Fraction:	68,22

In 5 minutes:
cat /proc/pid/smaps
Anonymous:	800004 kB
AnonHugePages:	389120 kB
Swap:		0 kB
Fraction:	48,63

cat /proc/meminfo:
AnonPages:	2607824 kB
AnonHugePages:	2041856 kB
Fraction:	78,29

Without the patch:
After swapped out:
cat /proc/pid/smaps:
Anonymous:      190660 kB
AnonHugePages:  190464 kB
Swap:           609344 kB
Fraction:       99,89

cat /proc/meminfo:
AnonPages:      1740456 kB
AnonHugePages:  1667072 kB
Fraction:       95,78

After swapped in:
cat /proc/pid/smaps:
Anonymous:      800004 kB
AnonHugePages:  190464 kB
Swap:           0 kB
Fraction:       23,80

cat /proc/meminfo:
AnonPages:      2350032 kB
AnonHugePages:  1667072 kB
Fraction:       70,93

I waited 10 minutes the fractions
did not change without the patch.

 mm/huge_memory.c | 48 +++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 39 insertions(+), 9 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 817a875..5e3e9b9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2148,7 +2148,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 {
 	struct page *page;
 	pte_t *_pte;
-	int referenced = 0, none = 0;
+	int referenced = 0, none = 0, ro = 0;
 	for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
 	     _pte++, address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
@@ -2158,7 +2158,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			else
 				goto out;
 		}
-		if (!pte_present(pteval) || !pte_write(pteval))
+		if (!pte_present(pteval))
 			goto out;
 		page = vm_normal_page(vma, address, pteval);
 		if (unlikely(!page))
@@ -2168,9 +2168,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		VM_BUG_ON_PAGE(!PageAnon(page), page);
 		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 
-		/* cannot use mapcount: can't collapse if there's a gup pin */
-		if (page_count(page) != 1)
-			goto out;
 		/*
 		 * We can do it before isolate_lru_page because the
 		 * page can't be freed from under us. NOTE: PG_lock
@@ -2179,6 +2176,31 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		 */
 		if (!trylock_page(page))
 			goto out;
+
+		/*
+		 * cannot use mapcount: can't collapse if there's a gup pin.
+		 * The page must only be referenced by the scanned process
+		 * and page swap cache.
+		 */
+		if (page_count(page) != 1 + !!PageSwapCache(page)) {
+			unlock_page(page);
+			goto out;
+		}
+		if (!pte_write(pteval)) {
+			if (++ro > khugepaged_max_ptes_none) {
+				unlock_page(page);
+				goto out;
+			}
+			if (PageSwapCache(page) && !reuse_swap_page(page)) {
+				unlock_page(page);
+				goto out;
+			}
+			/*
+			 * Page is not in the swap cache, and page count is
+			 * one (see above). It can be collapsed into a THP.
+			 */
+		}
+
 		/*
 		 * Isolate the page to avoid collapsing an hugepage
 		 * currently in use by the VM.
@@ -2550,7 +2572,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
-	int ret = 0, referenced = 0, none = 0;
+	int ret = 0, referenced = 0, none = 0, ro = 0;
 	struct page *page;
 	unsigned long _address;
 	spinlock_t *ptl;
@@ -2573,8 +2595,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 			else
 				goto out_unmap;
 		}
-		if (!pte_present(pteval) || !pte_write(pteval))
+		if (!pte_present(pteval))
 			goto out_unmap;
+		if (!pte_write(pteval)) {
+			if (++ro > khugepaged_max_ptes_none)
+				goto out_unmap;
+		}
 		page = vm_normal_page(vma, _address, pteval);
 		if (unlikely(!page))
 			goto out_unmap;
@@ -2591,8 +2617,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		VM_BUG_ON_PAGE(PageCompound(page), page);
 		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
 			goto out_unmap;
-		/* cannot use mapcount: can't collapse if there's a gup pin */
-		if (page_count(page) != 1)
+		/*
+		 * cannot use mapcount: can't collapse if there's a gup pin.
+		 * The page must only be referenced by the scanned process
+		 * and page swap cache.
+		 */
+		if (page_count(page) != 1 + !!PageSwapCache(page))
 			goto out_unmap;
 		if (pte_young(pteval) || PageReferenced(page) ||
 		    mmu_notifier_test_young(vma->vm_mm, address))
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm: incorporate read-only pages into transparent huge pages
  2015-01-24 15:38 ` Ebru Akagunduz
@ 2015-01-26  7:36   ` Vlastimil Babka
  -1 siblings, 0 replies; 8+ messages in thread
From: Vlastimil Babka @ 2015-01-26  7:36 UTC (permalink / raw)
  To: Ebru Akagunduz, linux-mm
  Cc: akpm, kirill, mhocko, mgorman, rientjes, sasha.levin, hughd,
	hannes, linux-kernel, riel, aarcange

On 01/24/2015 04:38 PM, Ebru Akagunduz wrote:
> This patch aims to improve THP collapse rates, by allowing
> THP collapse in the presence of read-only ptes, like those
> left in place by do_swap_page after a read fault.
> 
> Currently THP can collapse 4kB pages into a THP when
> there are up to khugepaged_max_ptes_none pte_none ptes
> in a 2MB range. This patch applies the same limit for
> read-only ptes.
> 
> The patch was tested with a test program that allocates
> 800MB of memory, writes to it, and then sleeps. I force
> the system to swap out all but 190MB of the program by
> touching other memory. Afterwards, the test program does
> a mix of reads and writes to its memory, and the memory
> gets swapped back in.
> 
> Without the patch, only the memory that did not get
> swapped out remained in THPs, which corresponds to 24% of
> the memory of the program. The percentage did not increase
> over time.
> 
> With this patch, after 5 minutes of waiting khugepaged had
> collapsed 48% of the program's memory back into THPs.
> 
> Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> ---
> Changes in v2:
>  - Remove extra code indent
>  - Add fast path optimistic check to
>    __collapse_huge_page_isolate()

My interpretation is that the optimistic check is in khugepaged_scan_pmd() while
in __collapse_huge_page_isolate() it's protected by lock, as Andrea suggested?

>  - Add comment line for check condition of page_count()
>  - Move check condition of page_count() below to trylock_page()
> 
> I've written down test results:
> With the patch:
> After swapped out:
> cat /proc/pid/smaps:
> Anonymous:	42804 kB
> AnonHugePages:	38912 kB
> Swap:		757200 kB
> Fraction:	90,90
> 
> cat /proc/meminfo:
> AnonPages:	1843956 kB
> AnonHugePages:	1712128 kB
> Fraction:	92,85
> 
> After swapped in:
> In a few seconds:
> cat /proc/pid/smaps:
> Anonymous:	800004 kB
> AnonHugePages:	104448 kB
> Swap:		0 kB
> Fraction:	13,05
> 
> cat /proc/meminfo:
> AnonPages:	2605728 kB
> AnonHugePages:	1777664 kB
> Fraction:	68,22
> 
> In 5 minutes:
> cat /proc/pid/smaps
> Anonymous:	800004 kB
> AnonHugePages:	389120 kB
> Swap:		0 kB
> Fraction:	48,63
> 
> cat /proc/meminfo:
> AnonPages:	2607824 kB
> AnonHugePages:	2041856 kB
> Fraction:	78,29
> 
> Without the patch:
> After swapped out:
> cat /proc/pid/smaps:
> Anonymous:      190660 kB
> AnonHugePages:  190464 kB
> Swap:           609344 kB
> Fraction:       99,89
> 
> cat /proc/meminfo:
> AnonPages:      1740456 kB
> AnonHugePages:  1667072 kB
> Fraction:       95,78
> 
> After swapped in:
> cat /proc/pid/smaps:
> Anonymous:      800004 kB
> AnonHugePages:  190464 kB
> Swap:           0 kB
> Fraction:       23,80
> 
> cat /proc/meminfo:
> AnonPages:      2350032 kB
> AnonHugePages:  1667072 kB
> Fraction:       70,93
> 
> I waited 10 minutes the fractions
> did not change without the patch.
> 
>  mm/huge_memory.c | 48 +++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 39 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 817a875..5e3e9b9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2148,7 +2148,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  {
>  	struct page *page;
>  	pte_t *_pte;
> -	int referenced = 0, none = 0;
> +	int referenced = 0, none = 0, ro = 0;
>  	for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
>  	     _pte++, address += PAGE_SIZE) {
>  		pte_t pteval = *_pte;
> @@ -2158,7 +2158,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  			else
>  				goto out;
>  		}
> -		if (!pte_present(pteval) || !pte_write(pteval))
> +		if (!pte_present(pteval))
>  			goto out;
>  		page = vm_normal_page(vma, address, pteval);
>  		if (unlikely(!page))
> @@ -2168,9 +2168,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		VM_BUG_ON_PAGE(!PageAnon(page), page);
>  		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
>  
> -		/* cannot use mapcount: can't collapse if there's a gup pin */
> -		if (page_count(page) != 1)
> -			goto out;
>  		/*
>  		 * We can do it before isolate_lru_page because the
>  		 * page can't be freed from under us. NOTE: PG_lock
> @@ -2179,6 +2176,31 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		 */
>  		if (!trylock_page(page))
>  			goto out;
> +
> +		/*
> +		 * cannot use mapcount: can't collapse if there's a gup pin.
> +		 * The page must only be referenced by the scanned process
> +		 * and page swap cache.
> +		 */
> +		if (page_count(page) != 1 + !!PageSwapCache(page)) {
> +			unlock_page(page);
> +			goto out;
> +		}
> +		if (!pte_write(pteval)) {
> +			if (++ro > khugepaged_max_ptes_none) {
> +				unlock_page(page);
> +				goto out;

So just for completeness, as I said later for v1 I think this can leave us with
read-only VMA: consider ro == 256 and none == 256, referenced can still be >0
(up to 256). I think that the check for referenced that follows this for loop
should also check if (ro + none < HPAGE_PMD_NR).

> +			}
> +			if (PageSwapCache(page) && !reuse_swap_page(page)) {
> +				unlock_page(page);
> +				goto out;
> +			}
> +			/*
> +			 * Page is not in the swap cache, and page count is
> +			 * one (see above). It can be collapsed into a THP.
> +			 */

I would still put the VM_BUG_ON(page_count(page) != 1) here as I suggested
previously. Even more so that I think it would have been able to catch the
problem that Andrea pointed out in v1.

> +		}
> +
>  		/*
>  		 * Isolate the page to avoid collapsing an hugepage
>  		 * currently in use by the VM.
> @@ -2550,7 +2572,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  {
>  	pmd_t *pmd;
>  	pte_t *pte, *_pte;
> -	int ret = 0, referenced = 0, none = 0;
> +	int ret = 0, referenced = 0, none = 0, ro = 0;
>  	struct page *page;
>  	unsigned long _address;
>  	spinlock_t *ptl;
> @@ -2573,8 +2595,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  			else
>  				goto out_unmap;
>  		}
> -		if (!pte_present(pteval) || !pte_write(pteval))
> +		if (!pte_present(pteval))
>  			goto out_unmap;
> +		if (!pte_write(pteval)) {
> +			if (++ro > khugepaged_max_ptes_none)
> +				goto out_unmap;
> +		}
>  		page = vm_normal_page(vma, _address, pteval);
>  		if (unlikely(!page))
>  			goto out_unmap;
> @@ -2591,8 +2617,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  		VM_BUG_ON_PAGE(PageCompound(page), page);
>  		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
>  			goto out_unmap;
> -		/* cannot use mapcount: can't collapse if there's a gup pin */
> -		if (page_count(page) != 1)
> +		/*
> +		 * cannot use mapcount: can't collapse if there's a gup pin.
> +		 * The page must only be referenced by the scanned process
> +		 * and page swap cache.
> +		 */
> +		if (page_count(page) != 1 + !!PageSwapCache(page))
>  			goto out_unmap;
>  		if (pte_young(pteval) || PageReferenced(page) ||
>  		    mmu_notifier_test_young(vma->vm_mm, address))
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm: incorporate read-only pages into transparent huge pages
@ 2015-01-26  7:36   ` Vlastimil Babka
  0 siblings, 0 replies; 8+ messages in thread
From: Vlastimil Babka @ 2015-01-26  7:36 UTC (permalink / raw)
  To: Ebru Akagunduz, linux-mm
  Cc: akpm, kirill, mhocko, mgorman, rientjes, sasha.levin, hughd,
	hannes, linux-kernel, riel, aarcange

On 01/24/2015 04:38 PM, Ebru Akagunduz wrote:
> This patch aims to improve THP collapse rates, by allowing
> THP collapse in the presence of read-only ptes, like those
> left in place by do_swap_page after a read fault.
> 
> Currently THP can collapse 4kB pages into a THP when
> there are up to khugepaged_max_ptes_none pte_none ptes
> in a 2MB range. This patch applies the same limit for
> read-only ptes.
> 
> The patch was tested with a test program that allocates
> 800MB of memory, writes to it, and then sleeps. I force
> the system to swap out all but 190MB of the program by
> touching other memory. Afterwards, the test program does
> a mix of reads and writes to its memory, and the memory
> gets swapped back in.
> 
> Without the patch, only the memory that did not get
> swapped out remained in THPs, which corresponds to 24% of
> the memory of the program. The percentage did not increase
> over time.
> 
> With this patch, after 5 minutes of waiting khugepaged had
> collapsed 48% of the program's memory back into THPs.
> 
> Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> ---
> Changes in v2:
>  - Remove extra code indent
>  - Add fast path optimistic check to
>    __collapse_huge_page_isolate()

My interpretation is that the optimistic check is in khugepaged_scan_pmd() while
in __collapse_huge_page_isolate() it's protected by lock, as Andrea suggested?

>  - Add comment line for check condition of page_count()
>  - Move check condition of page_count() below to trylock_page()
> 
> I've written down test results:
> With the patch:
> After swapped out:
> cat /proc/pid/smaps:
> Anonymous:	42804 kB
> AnonHugePages:	38912 kB
> Swap:		757200 kB
> Fraction:	90,90
> 
> cat /proc/meminfo:
> AnonPages:	1843956 kB
> AnonHugePages:	1712128 kB
> Fraction:	92,85
> 
> After swapped in:
> In a few seconds:
> cat /proc/pid/smaps:
> Anonymous:	800004 kB
> AnonHugePages:	104448 kB
> Swap:		0 kB
> Fraction:	13,05
> 
> cat /proc/meminfo:
> AnonPages:	2605728 kB
> AnonHugePages:	1777664 kB
> Fraction:	68,22
> 
> In 5 minutes:
> cat /proc/pid/smaps
> Anonymous:	800004 kB
> AnonHugePages:	389120 kB
> Swap:		0 kB
> Fraction:	48,63
> 
> cat /proc/meminfo:
> AnonPages:	2607824 kB
> AnonHugePages:	2041856 kB
> Fraction:	78,29
> 
> Without the patch:
> After swapped out:
> cat /proc/pid/smaps:
> Anonymous:      190660 kB
> AnonHugePages:  190464 kB
> Swap:           609344 kB
> Fraction:       99,89
> 
> cat /proc/meminfo:
> AnonPages:      1740456 kB
> AnonHugePages:  1667072 kB
> Fraction:       95,78
> 
> After swapped in:
> cat /proc/pid/smaps:
> Anonymous:      800004 kB
> AnonHugePages:  190464 kB
> Swap:           0 kB
> Fraction:       23,80
> 
> cat /proc/meminfo:
> AnonPages:      2350032 kB
> AnonHugePages:  1667072 kB
> Fraction:       70,93
> 
> I waited 10 minutes the fractions
> did not change without the patch.
> 
>  mm/huge_memory.c | 48 +++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 39 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 817a875..5e3e9b9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2148,7 +2148,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  {
>  	struct page *page;
>  	pte_t *_pte;
> -	int referenced = 0, none = 0;
> +	int referenced = 0, none = 0, ro = 0;
>  	for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
>  	     _pte++, address += PAGE_SIZE) {
>  		pte_t pteval = *_pte;
> @@ -2158,7 +2158,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  			else
>  				goto out;
>  		}
> -		if (!pte_present(pteval) || !pte_write(pteval))
> +		if (!pte_present(pteval))
>  			goto out;
>  		page = vm_normal_page(vma, address, pteval);
>  		if (unlikely(!page))
> @@ -2168,9 +2168,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		VM_BUG_ON_PAGE(!PageAnon(page), page);
>  		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
>  
> -		/* cannot use mapcount: can't collapse if there's a gup pin */
> -		if (page_count(page) != 1)
> -			goto out;
>  		/*
>  		 * We can do it before isolate_lru_page because the
>  		 * page can't be freed from under us. NOTE: PG_lock
> @@ -2179,6 +2176,31 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		 */
>  		if (!trylock_page(page))
>  			goto out;
> +
> +		/*
> +		 * cannot use mapcount: can't collapse if there's a gup pin.
> +		 * The page must only be referenced by the scanned process
> +		 * and page swap cache.
> +		 */
> +		if (page_count(page) != 1 + !!PageSwapCache(page)) {
> +			unlock_page(page);
> +			goto out;
> +		}
> +		if (!pte_write(pteval)) {
> +			if (++ro > khugepaged_max_ptes_none) {
> +				unlock_page(page);
> +				goto out;

So just for completeness, as I said later for v1 I think this can leave us with
read-only VMA: consider ro == 256 and none == 256, referenced can still be >0
(up to 256). I think that the check for referenced that follows this for loop
should also check if (ro + none < HPAGE_PMD_NR).

> +			}
> +			if (PageSwapCache(page) && !reuse_swap_page(page)) {
> +				unlock_page(page);
> +				goto out;
> +			}
> +			/*
> +			 * Page is not in the swap cache, and page count is
> +			 * one (see above). It can be collapsed into a THP.
> +			 */

I would still put the VM_BUG_ON(page_count(page) != 1) here as I suggested
previously. Even more so that I think it would have been able to catch the
problem that Andrea pointed out in v1.

> +		}
> +
>  		/*
>  		 * Isolate the page to avoid collapsing an hugepage
>  		 * currently in use by the VM.
> @@ -2550,7 +2572,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  {
>  	pmd_t *pmd;
>  	pte_t *pte, *_pte;
> -	int ret = 0, referenced = 0, none = 0;
> +	int ret = 0, referenced = 0, none = 0, ro = 0;
>  	struct page *page;
>  	unsigned long _address;
>  	spinlock_t *ptl;
> @@ -2573,8 +2595,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  			else
>  				goto out_unmap;
>  		}
> -		if (!pte_present(pteval) || !pte_write(pteval))
> +		if (!pte_present(pteval))
>  			goto out_unmap;
> +		if (!pte_write(pteval)) {
> +			if (++ro > khugepaged_max_ptes_none)
> +				goto out_unmap;
> +		}
>  		page = vm_normal_page(vma, _address, pteval);
>  		if (unlikely(!page))
>  			goto out_unmap;
> @@ -2591,8 +2617,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  		VM_BUG_ON_PAGE(PageCompound(page), page);
>  		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
>  			goto out_unmap;
> -		/* cannot use mapcount: can't collapse if there's a gup pin */
> -		if (page_count(page) != 1)
> +		/*
> +		 * cannot use mapcount: can't collapse if there's a gup pin.
> +		 * The page must only be referenced by the scanned process
> +		 * and page swap cache.
> +		 */
> +		if (page_count(page) != 1 + !!PageSwapCache(page))
>  			goto out_unmap;
>  		if (pte_young(pteval) || PageReferenced(page) ||
>  		    mmu_notifier_test_young(vma->vm_mm, address))
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm: incorporate read-only pages into transparent huge pages
  2015-01-26  7:36   ` Vlastimil Babka
@ 2015-01-26 15:19     ` Andrea Arcangeli
  -1 siblings, 0 replies; 8+ messages in thread
From: Andrea Arcangeli @ 2015-01-26 15:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Ebru Akagunduz, linux-mm, akpm, kirill, mhocko, mgorman,
	rientjes, sasha.levin, hughd, hannes, linux-kernel, riel

Hi,

On Mon, Jan 26, 2015 at 08:36:06AM +0100, Vlastimil Babka wrote:
> >  - Add fast path optimistic check to
> >    __collapse_huge_page_isolate()
> 
> My interpretation is that the optimistic check is in khugepaged_scan_pmd() while
> in __collapse_huge_page_isolate() it's protected by lock, as Andrea suggested?

Correct, __collapse_huge_page_isolate is the "accurate" check that was
missing in v1. The optimistic check was the one in khugepaged_scan_pmd
and it was already present.

> > @@ -2168,9 +2168,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >  		VM_BUG_ON_PAGE(!PageAnon(page), page);
> >  		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
> >  
> > -		/* cannot use mapcount: can't collapse if there's a gup pin */
> > -		if (page_count(page) != 1)
> > -			goto out;
> >  		/*
> >  		 * We can do it before isolate_lru_page because the
> >  		 * page can't be freed from under us. NOTE: PG_lock
> > @@ -2179,6 +2176,31 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >  		 */
> >  		if (!trylock_page(page))
> >  			goto out;
> > +
> > +		/*
> > +		 * cannot use mapcount: can't collapse if there's a gup pin.
> > +		 * The page must only be referenced by the scanned process
> > +		 * and page swap cache.
> > +		 */
> > +		if (page_count(page) != 1 + !!PageSwapCache(page)) {
> > +			unlock_page(page);
> > +			goto out;
> > +		}
> > +		if (!pte_write(pteval)) {
> > +			if (++ro > khugepaged_max_ptes_none) {
> > +				unlock_page(page);
> > +				goto out;
> 
> So just for completeness, as I said later for v1 I think this can leave us with
> read-only VMA: consider ro == 256 and none == 256, referenced can still be >0
> (up to 256). I think that the check for referenced that follows this for loop
> should also check if (ro + none < HPAGE_PMD_NR).

The moment "ro" becomes 256 or "none" becomes 256, we immediately goto out.

	if (likely(referenced))
		return 1;
out:
	release_pte_pages(pte, _pte);
	return 0;

"out" is past the referenced check and we fail the collapse
immediately.

If ro is < 255 then "none" is also < 255 (if pte_write is true, then
pte_none cannot be true).

Overall I don't see how we could collapse in readonly vma and where
the bug is for this case, but I may be overlooking something obvious.

> > +			}
> > +			if (PageSwapCache(page) && !reuse_swap_page(page)) {
> > +				unlock_page(page);
> > +				goto out;
> > +			}
> > +			/*
> > +			 * Page is not in the swap cache, and page count is
> > +			 * one (see above). It can be collapsed into a THP.
> > +			 */
> 
> I would still put the VM_BUG_ON(page_count(page) != 1) here as I suggested
> previously. Even more so that I think it would have been able to catch the
> problem that Andrea pointed out in v1.

This is subtle, but we can't do VM_BUG_ON because it's ok if the VM
comes before us in another CPU, and takes a pin on the page to isolate
the page.

In short if you changed the current upstream code like below:

		/* cannot use mapcount: can't collapse if there's a gup pin */
		if (page_count(page) != 1)
			goto out;
		VM_BUG_ON(page_count(page) != 1);
		trylock...

the VM_BUG_ON could fire as a false positive. It's not even related to
the trylock_page, it's related to the LRU lock and isolate_lru_page
that the VM could run on a different CPU and it's not a bug.

The VM is free to pin the page. We only need to be sure there are no
GUP-pins after we blocked all variants of GUP and the check is enough
for that.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm: incorporate read-only pages into transparent huge pages
@ 2015-01-26 15:19     ` Andrea Arcangeli
  0 siblings, 0 replies; 8+ messages in thread
From: Andrea Arcangeli @ 2015-01-26 15:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Ebru Akagunduz, linux-mm, akpm, kirill, mhocko, mgorman,
	rientjes, sasha.levin, hughd, hannes, linux-kernel, riel

Hi,

On Mon, Jan 26, 2015 at 08:36:06AM +0100, Vlastimil Babka wrote:
> >  - Add fast path optimistic check to
> >    __collapse_huge_page_isolate()
> 
> My interpretation is that the optimistic check is in khugepaged_scan_pmd() while
> in __collapse_huge_page_isolate() it's protected by lock, as Andrea suggested?

Correct, __collapse_huge_page_isolate is the "accurate" check that was
missing in v1. The optimistic check was the one in khugepaged_scan_pmd
and it was already present.

> > @@ -2168,9 +2168,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >  		VM_BUG_ON_PAGE(!PageAnon(page), page);
> >  		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
> >  
> > -		/* cannot use mapcount: can't collapse if there's a gup pin */
> > -		if (page_count(page) != 1)
> > -			goto out;
> >  		/*
> >  		 * We can do it before isolate_lru_page because the
> >  		 * page can't be freed from under us. NOTE: PG_lock
> > @@ -2179,6 +2176,31 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >  		 */
> >  		if (!trylock_page(page))
> >  			goto out;
> > +
> > +		/*
> > +		 * cannot use mapcount: can't collapse if there's a gup pin.
> > +		 * The page must only be referenced by the scanned process
> > +		 * and page swap cache.
> > +		 */
> > +		if (page_count(page) != 1 + !!PageSwapCache(page)) {
> > +			unlock_page(page);
> > +			goto out;
> > +		}
> > +		if (!pte_write(pteval)) {
> > +			if (++ro > khugepaged_max_ptes_none) {
> > +				unlock_page(page);
> > +				goto out;
> 
> So just for completeness, as I said later for v1 I think this can leave us with
> read-only VMA: consider ro == 256 and none == 256, referenced can still be >0
> (up to 256). I think that the check for referenced that follows this for loop
> should also check if (ro + none < HPAGE_PMD_NR).

The moment "ro" becomes 256 or "none" becomes 256, we immediately goto out.

	if (likely(referenced))
		return 1;
out:
	release_pte_pages(pte, _pte);
	return 0;

"out" is past the referenced check and we fail the collapse
immediately.

If ro is < 255 then "none" is also < 255 (if pte_write is true, then
pte_none cannot be true).

Overall I don't see how we could collapse in readonly vma and where
the bug is for this case, but I may be overlooking something obvious.

> > +			}
> > +			if (PageSwapCache(page) && !reuse_swap_page(page)) {
> > +				unlock_page(page);
> > +				goto out;
> > +			}
> > +			/*
> > +			 * Page is not in the swap cache, and page count is
> > +			 * one (see above). It can be collapsed into a THP.
> > +			 */
> 
> I would still put the VM_BUG_ON(page_count(page) != 1) here as I suggested
> previously. Even more so that I think it would have been able to catch the
> problem that Andrea pointed out in v1.

This is subtle, but we can't do VM_BUG_ON because it's ok if the VM
comes before us in another CPU, and takes a pin on the page to isolate
the page.

In short if you changed the current upstream code like below:

		/* cannot use mapcount: can't collapse if there's a gup pin */
		if (page_count(page) != 1)
			goto out;
		VM_BUG_ON(page_count(page) != 1);
		trylock...

the VM_BUG_ON could fire as a false positive. It's not even related to
the trylock_page, it's related to the LRU lock and isolate_lru_page
that the VM could run on a different CPU and it's not a bug.

The VM is free to pin the page. We only need to be sure there are no
GUP-pins after we blocked all variants of GUP and the check is enough
for that.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm: incorporate read-only pages into transparent huge pages
  2015-01-26 15:19     ` Andrea Arcangeli
@ 2015-01-26 15:25       ` Andrea Arcangeli
  -1 siblings, 0 replies; 8+ messages in thread
From: Andrea Arcangeli @ 2015-01-26 15:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Ebru Akagunduz, linux-mm, akpm, kirill, mhocko, mgorman,
	rientjes, sasha.levin, hughd, hannes, linux-kernel, riel

On Mon, Jan 26, 2015 at 04:19:06PM +0100, Andrea Arcangeli wrote:
> Overall I don't see how we could collapse in readonly vma and where
> the bug is for this case, but I may be overlooking something obvious.

I just realized what the problem was... that the "ro" is not the total
number of readonly ptes mapped by the pmd,.. because we don't count
the none ones as readonly too.

It misses a ro increase or equivalent adjustment:

		if (pte_none(pteval)) {
+			ro++;
			if (++none <= khugepaged_max_ptes_none)
[..]

Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm: incorporate read-only pages into transparent huge pages
@ 2015-01-26 15:25       ` Andrea Arcangeli
  0 siblings, 0 replies; 8+ messages in thread
From: Andrea Arcangeli @ 2015-01-26 15:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Ebru Akagunduz, linux-mm, akpm, kirill, mhocko, mgorman,
	rientjes, sasha.levin, hughd, hannes, linux-kernel, riel

On Mon, Jan 26, 2015 at 04:19:06PM +0100, Andrea Arcangeli wrote:
> Overall I don't see how we could collapse in readonly vma and where
> the bug is for this case, but I may be overlooking something obvious.

I just realized what the problem was... that the "ro" is not the total
number of readonly ptes mapped by the pmd,.. because we don't count
the none ones as readonly too.

It misses a ro increase or equivalent adjustment:

		if (pte_none(pteval)) {
+			ro++;
			if (++none <= khugepaged_max_ptes_none)
[..]

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-01-26 16:02 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-24 15:38 [PATCH v2] mm: incorporate read-only pages into transparent huge pages Ebru Akagunduz
2015-01-24 15:38 ` Ebru Akagunduz
2015-01-26  7:36 ` Vlastimil Babka
2015-01-26  7:36   ` Vlastimil Babka
2015-01-26 15:19   ` Andrea Arcangeli
2015-01-26 15:19     ` Andrea Arcangeli
2015-01-26 15:25     ` Andrea Arcangeli
2015-01-26 15:25       ` Andrea Arcangeli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.