All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm: incorporate read-only pages into transparent huge pages
@ 2015-01-23  7:47 ` Ebru Akagunduz
  0 siblings, 0 replies; 20+ messages in thread
From: Ebru Akagunduz @ 2015-01-23  7:47 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, kirill, mhocko, mgorman, rientjes, sasha.levin, hughd,
	hannes, vbabka, linux-kernel, riel, aarcange, Ebru Akagunduz

This patch aims to improve THP collapse rates, by allowing
THP collapse in the presence of read-only ptes, like those
left in place by do_swap_page after a read fault.

Currently THP can collapse 4kB pages into a THP when
there are up to khugepaged_max_ptes_none pte_none ptes
in a 2MB range. This patch applies the same limit for
read-only ptes.

The patch was tested with a test program that allocates
800MB of memory, writes to it, and then sleeps. I force
the system to swap out all but 190MB of the program by
touching other memory. Afterwards, the test program does
a mix of reads and writes to its memory, and the memory
gets swapped back in.

Without the patch, only the memory that did not get
swapped out remained in THPs, which corresponds to 24% of
the memory of the program. The percentage did not increase
over time.

With this patch, after 5 minutes of waiting khugepaged had
collapsed 55% of the program's memory back into THPs.

Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
I've written down test results:

With the patch:
After swapped out:
cat /proc/pid/smaps:
Anonymous:      100352 kB
AnonHugePages:  98304 kB
Swap:           699652 kB
Fraction:       97,95

cat /proc/meminfo:
AnonPages:      1763732 kB
AnonHugePages:  1716224 kB
Fraction:       97,30

After swapped in:
In a few seconds:
cat /proc/pid/smaps
Anonymous:      800004 kB
AnonHugePages:  235520 kB
Swap:           0 kB
Fraction:       29,43

cat /proc/meminfo:
AnonPages:      2464336 kB
AnonHugePages:  1853440 kB
Fraction:       75,21

In five minutes:
cat /proc/pid/smaps:
Anonymous:      800004 kB
AnonHugePages:  440320 kB
Swap:           0 kB
Fraction:       55,0

cat /proc/meminfo:
AnonPages:      2464340
AnonHugePages:  2058240
Fraction:       83,52

Without the patch:
After swapped out:
cat /proc/pid/smaps:
Anonymous:      190660 kB
AnonHugePages:  190464 kB
Swap:           609344 kB
Fraction:       99,89

cat /proc/meminfo:
AnonPages:      1740456 kB
AnonHugePages:  1667072 kB
Fraction:       95,78

After swapped in:
cat /proc/pid/smaps:
Anonymous:      800004 kB
AnonHugePages:  190464 kB
Swap:           0 kB
Fraction:       23,80

cat /proc/meminfo:
AnonPages:      2350032 kB
AnonHugePages:  1667072 kB
Fraction:       70,93

I waited 10 minutes the fractions
did not change without the patch.

 mm/huge_memory.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 817a875..af750d9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2158,7 +2158,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			else
 				goto out;
 		}
-		if (!pte_present(pteval) || !pte_write(pteval))
+		if (!pte_present(pteval))
 			goto out;
 		page = vm_normal_page(vma, address, pteval);
 		if (unlikely(!page))
@@ -2169,7 +2169,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 
 		/* cannot use mapcount: can't collapse if there's a gup pin */
-		if (page_count(page) != 1)
+		if (page_count(page) != 1 + !!PageSwapCache(page))
 			goto out;
 		/*
 		 * We can do it before isolate_lru_page because the
@@ -2179,6 +2179,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		 */
 		if (!trylock_page(page))
 			goto out;
+		if (!pte_write(pteval)) {
+			if (PageSwapCache(page) && !reuse_swap_page(page)) {
+					unlock_page(page);
+					goto out;
+			}
+			/*
+			 * Page is not in the swap cache, and page count is
+			 * one (see above). It can be collapsed into a THP.
+			 */
+		}
+
 		/*
 		 * Isolate the page to avoid collapsing an hugepage
 		 * currently in use by the VM.
@@ -2550,7 +2561,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
-	int ret = 0, referenced = 0, none = 0;
+	int ret = 0, referenced = 0, none = 0, ro = 0;
 	struct page *page;
 	unsigned long _address;
 	spinlock_t *ptl;
@@ -2573,8 +2584,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 			else
 				goto out_unmap;
 		}
-		if (!pte_present(pteval) || !pte_write(pteval))
+		if (!pte_present(pteval))
 			goto out_unmap;
+		if (!pte_write(pteval)) {
+			if (++ro > khugepaged_max_ptes_none)
+				goto out_unmap;
+		}
 		page = vm_normal_page(vma, _address, pteval);
 		if (unlikely(!page))
 			goto out_unmap;
@@ -2592,7 +2607,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
 			goto out_unmap;
 		/* cannot use mapcount: can't collapse if there's a gup pin */
-		if (page_count(page) != 1)
+		if (page_count(page) != 1 + !!PageSwapCache(page))
 			goto out_unmap;
 		if (pte_young(pteval) || PageReferenced(page) ||
 		    mmu_notifier_test_young(vma->vm_mm, address))
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH] mm: incorporate read-only pages into transparent huge pages
@ 2015-01-23  7:47 ` Ebru Akagunduz
  0 siblings, 0 replies; 20+ messages in thread
From: Ebru Akagunduz @ 2015-01-23  7:47 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, kirill, mhocko, mgorman, rientjes, sasha.levin, hughd,
	hannes, vbabka, linux-kernel, riel, aarcange, Ebru Akagunduz

This patch aims to improve THP collapse rates, by allowing
THP collapse in the presence of read-only ptes, like those
left in place by do_swap_page after a read fault.

Currently THP can collapse 4kB pages into a THP when
there are up to khugepaged_max_ptes_none pte_none ptes
in a 2MB range. This patch applies the same limit for
read-only ptes.

The patch was tested with a test program that allocates
800MB of memory, writes to it, and then sleeps. I force
the system to swap out all but 190MB of the program by
touching other memory. Afterwards, the test program does
a mix of reads and writes to its memory, and the memory
gets swapped back in.

Without the patch, only the memory that did not get
swapped out remained in THPs, which corresponds to 24% of
the memory of the program. The percentage did not increase
over time.

With this patch, after 5 minutes of waiting khugepaged had
collapsed 55% of the program's memory back into THPs.

Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
I've written down test results:

With the patch:
After swapped out:
cat /proc/pid/smaps:
Anonymous:      100352 kB
AnonHugePages:  98304 kB
Swap:           699652 kB
Fraction:       97,95

cat /proc/meminfo:
AnonPages:      1763732 kB
AnonHugePages:  1716224 kB
Fraction:       97,30

After swapped in:
In a few seconds:
cat /proc/pid/smaps
Anonymous:      800004 kB
AnonHugePages:  235520 kB
Swap:           0 kB
Fraction:       29,43

cat /proc/meminfo:
AnonPages:      2464336 kB
AnonHugePages:  1853440 kB
Fraction:       75,21

In five minutes:
cat /proc/pid/smaps:
Anonymous:      800004 kB
AnonHugePages:  440320 kB
Swap:           0 kB
Fraction:       55,0

cat /proc/meminfo:
AnonPages:      2464340
AnonHugePages:  2058240
Fraction:       83,52

Without the patch:
After swapped out:
cat /proc/pid/smaps:
Anonymous:      190660 kB
AnonHugePages:  190464 kB
Swap:           609344 kB
Fraction:       99,89

cat /proc/meminfo:
AnonPages:      1740456 kB
AnonHugePages:  1667072 kB
Fraction:       95,78

After swapped in:
cat /proc/pid/smaps:
Anonymous:      800004 kB
AnonHugePages:  190464 kB
Swap:           0 kB
Fraction:       23,80

cat /proc/meminfo:
AnonPages:      2350032 kB
AnonHugePages:  1667072 kB
Fraction:       70,93

I waited 10 minutes the fractions
did not change without the patch.

 mm/huge_memory.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 817a875..af750d9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2158,7 +2158,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			else
 				goto out;
 		}
-		if (!pte_present(pteval) || !pte_write(pteval))
+		if (!pte_present(pteval))
 			goto out;
 		page = vm_normal_page(vma, address, pteval);
 		if (unlikely(!page))
@@ -2169,7 +2169,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 
 		/* cannot use mapcount: can't collapse if there's a gup pin */
-		if (page_count(page) != 1)
+		if (page_count(page) != 1 + !!PageSwapCache(page))
 			goto out;
 		/*
 		 * We can do it before isolate_lru_page because the
@@ -2179,6 +2179,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		 */
 		if (!trylock_page(page))
 			goto out;
+		if (!pte_write(pteval)) {
+			if (PageSwapCache(page) && !reuse_swap_page(page)) {
+					unlock_page(page);
+					goto out;
+			}
+			/*
+			 * Page is not in the swap cache, and page count is
+			 * one (see above). It can be collapsed into a THP.
+			 */
+		}
+
 		/*
 		 * Isolate the page to avoid collapsing an hugepage
 		 * currently in use by the VM.
@@ -2550,7 +2561,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
-	int ret = 0, referenced = 0, none = 0;
+	int ret = 0, referenced = 0, none = 0, ro = 0;
 	struct page *page;
 	unsigned long _address;
 	spinlock_t *ptl;
@@ -2573,8 +2584,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 			else
 				goto out_unmap;
 		}
-		if (!pte_present(pteval) || !pte_write(pteval))
+		if (!pte_present(pteval))
 			goto out_unmap;
+		if (!pte_write(pteval)) {
+			if (++ro > khugepaged_max_ptes_none)
+				goto out_unmap;
+		}
 		page = vm_normal_page(vma, _address, pteval);
 		if (unlikely(!page))
 			goto out_unmap;
@@ -2592,7 +2607,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
 			goto out_unmap;
 		/* cannot use mapcount: can't collapse if there's a gup pin */
-		if (page_count(page) != 1)
+		if (page_count(page) != 1 + !!PageSwapCache(page))
 			goto out_unmap;
 		if (pte_young(pteval) || PageReferenced(page) ||
 		    mmu_notifier_test_young(vma->vm_mm, address))
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: incorporate read-only pages into transparent huge pages
  2015-01-23  7:47 ` Ebru Akagunduz
@ 2015-01-23 11:37   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 20+ messages in thread
From: Kirill A. Shutemov @ 2015-01-23 11:37 UTC (permalink / raw)
  To: Ebru Akagunduz
  Cc: linux-mm, akpm, mhocko, mgorman, rientjes, sasha.levin, hughd,
	hannes, vbabka, linux-kernel, riel, aarcange

On Fri, Jan 23, 2015 at 09:47:36AM +0200, Ebru Akagunduz wrote:
> This patch aims to improve THP collapse rates, by allowing
> THP collapse in the presence of read-only ptes, like those
> left in place by do_swap_page after a read fault.
> 
> Currently THP can collapse 4kB pages into a THP when
> there are up to khugepaged_max_ptes_none pte_none ptes
> in a 2MB range. This patch applies the same limit for
> read-only ptes.
> 
> The patch was tested with a test program that allocates
> 800MB of memory, writes to it, and then sleeps. I force
> the system to swap out all but 190MB of the program by
> touching other memory. Afterwards, the test program does
> a mix of reads and writes to its memory, and the memory
> gets swapped back in.
> 
> Without the patch, only the memory that did not get
> swapped out remained in THPs, which corresponds to 24% of
> the memory of the program. The percentage did not increase
> over time.
> 
> With this patch, after 5 minutes of waiting khugepaged had
> collapsed 55% of the program's memory back into THPs.
> 
> Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> ---
> I've written down test results:
> 
> With the patch:
> After swapped out:
> cat /proc/pid/smaps:
> Anonymous:      100352 kB
> AnonHugePages:  98304 kB
> Swap:           699652 kB
> Fraction:       97,95
> 
> cat /proc/meminfo:
> AnonPages:      1763732 kB
> AnonHugePages:  1716224 kB
> Fraction:       97,30
> 
> After swapped in:
> In a few seconds:
> cat /proc/pid/smaps
> Anonymous:      800004 kB
> AnonHugePages:  235520 kB
> Swap:           0 kB
> Fraction:       29,43
> 
> cat /proc/meminfo:
> AnonPages:      2464336 kB
> AnonHugePages:  1853440 kB
> Fraction:       75,21
> 
> In five minutes:
> cat /proc/pid/smaps:
> Anonymous:      800004 kB
> AnonHugePages:  440320 kB
> Swap:           0 kB
> Fraction:       55,0
> 
> cat /proc/meminfo:
> AnonPages:      2464340
> AnonHugePages:  2058240
> Fraction:       83,52
> 
> Without the patch:
> After swapped out:
> cat /proc/pid/smaps:
> Anonymous:      190660 kB
> AnonHugePages:  190464 kB
> Swap:           609344 kB
> Fraction:       99,89
> 
> cat /proc/meminfo:
> AnonPages:      1740456 kB
> AnonHugePages:  1667072 kB
> Fraction:       95,78
> 
> After swapped in:
> cat /proc/pid/smaps:
> Anonymous:      800004 kB
> AnonHugePages:  190464 kB
> Swap:           0 kB
> Fraction:       23,80
> 
> cat /proc/meminfo:
> AnonPages:      2350032 kB
> AnonHugePages:  1667072 kB
> Fraction:       70,93
> 
> I waited 10 minutes the fractions
> did not change without the patch.
> 
>  mm/huge_memory.c | 25 ++++++++++++++++++++-----
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 817a875..af750d9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2158,7 +2158,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  			else
>  				goto out;
>  		}
> -		if (!pte_present(pteval) || !pte_write(pteval))
> +		if (!pte_present(pteval))
>  			goto out;
>  		page = vm_normal_page(vma, address, pteval);
>  		if (unlikely(!page))
> @@ -2169,7 +2169,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
>  
>  		/* cannot use mapcount: can't collapse if there's a gup pin */
> -		if (page_count(page) != 1)
> +		if (page_count(page) != 1 + !!PageSwapCache(page))
>  			goto out;
>  		/*
>  		 * We can do it before isolate_lru_page because the
> @@ -2179,6 +2179,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		 */
>  		if (!trylock_page(page))
>  			goto out;
> +		if (!pte_write(pteval)) {
> +			if (PageSwapCache(page) && !reuse_swap_page(page)) {
> +					unlock_page(page);
> +					goto out;
> +			}
> +			/*
> +			 * Page is not in the swap cache, and page count is
> +			 * one (see above). It can be collapsed into a THP.
> +			 */
> +		}

Hm. As a side effect it will effectevely allow collapse in PROT_READ vmas,
right? I'm not convinced it's a good idea.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: incorporate read-only pages into transparent huge pages
@ 2015-01-23 11:37   ` Kirill A. Shutemov
  0 siblings, 0 replies; 20+ messages in thread
From: Kirill A. Shutemov @ 2015-01-23 11:37 UTC (permalink / raw)
  To: Ebru Akagunduz
  Cc: linux-mm, akpm, mhocko, mgorman, rientjes, sasha.levin, hughd,
	hannes, vbabka, linux-kernel, riel, aarcange

On Fri, Jan 23, 2015 at 09:47:36AM +0200, Ebru Akagunduz wrote:
> This patch aims to improve THP collapse rates, by allowing
> THP collapse in the presence of read-only ptes, like those
> left in place by do_swap_page after a read fault.
> 
> Currently THP can collapse 4kB pages into a THP when
> there are up to khugepaged_max_ptes_none pte_none ptes
> in a 2MB range. This patch applies the same limit for
> read-only ptes.
> 
> The patch was tested with a test program that allocates
> 800MB of memory, writes to it, and then sleeps. I force
> the system to swap out all but 190MB of the program by
> touching other memory. Afterwards, the test program does
> a mix of reads and writes to its memory, and the memory
> gets swapped back in.
> 
> Without the patch, only the memory that did not get
> swapped out remained in THPs, which corresponds to 24% of
> the memory of the program. The percentage did not increase
> over time.
> 
> With this patch, after 5 minutes of waiting khugepaged had
> collapsed 55% of the program's memory back into THPs.
> 
> Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> ---
> I've written down test results:
> 
> With the patch:
> After swapped out:
> cat /proc/pid/smaps:
> Anonymous:      100352 kB
> AnonHugePages:  98304 kB
> Swap:           699652 kB
> Fraction:       97,95
> 
> cat /proc/meminfo:
> AnonPages:      1763732 kB
> AnonHugePages:  1716224 kB
> Fraction:       97,30
> 
> After swapped in:
> In a few seconds:
> cat /proc/pid/smaps
> Anonymous:      800004 kB
> AnonHugePages:  235520 kB
> Swap:           0 kB
> Fraction:       29,43
> 
> cat /proc/meminfo:
> AnonPages:      2464336 kB
> AnonHugePages:  1853440 kB
> Fraction:       75,21
> 
> In five minutes:
> cat /proc/pid/smaps:
> Anonymous:      800004 kB
> AnonHugePages:  440320 kB
> Swap:           0 kB
> Fraction:       55,0
> 
> cat /proc/meminfo:
> AnonPages:      2464340
> AnonHugePages:  2058240
> Fraction:       83,52
> 
> Without the patch:
> After swapped out:
> cat /proc/pid/smaps:
> Anonymous:      190660 kB
> AnonHugePages:  190464 kB
> Swap:           609344 kB
> Fraction:       99,89
> 
> cat /proc/meminfo:
> AnonPages:      1740456 kB
> AnonHugePages:  1667072 kB
> Fraction:       95,78
> 
> After swapped in:
> cat /proc/pid/smaps:
> Anonymous:      800004 kB
> AnonHugePages:  190464 kB
> Swap:           0 kB
> Fraction:       23,80
> 
> cat /proc/meminfo:
> AnonPages:      2350032 kB
> AnonHugePages:  1667072 kB
> Fraction:       70,93
> 
> I waited 10 minutes the fractions
> did not change without the patch.
> 
>  mm/huge_memory.c | 25 ++++++++++++++++++++-----
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 817a875..af750d9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2158,7 +2158,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  			else
>  				goto out;
>  		}
> -		if (!pte_present(pteval) || !pte_write(pteval))
> +		if (!pte_present(pteval))
>  			goto out;
>  		page = vm_normal_page(vma, address, pteval);
>  		if (unlikely(!page))
> @@ -2169,7 +2169,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
>  
>  		/* cannot use mapcount: can't collapse if there's a gup pin */
> -		if (page_count(page) != 1)
> +		if (page_count(page) != 1 + !!PageSwapCache(page))
>  			goto out;
>  		/*
>  		 * We can do it before isolate_lru_page because the
> @@ -2179,6 +2179,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		 */
>  		if (!trylock_page(page))
>  			goto out;
> +		if (!pte_write(pteval)) {
> +			if (PageSwapCache(page) && !reuse_swap_page(page)) {
> +					unlock_page(page);
> +					goto out;
> +			}
> +			/*
> +			 * Page is not in the swap cache, and page count is
> +			 * one (see above). It can be collapsed into a THP.
> +			 */
> +		}

Hm. As a side effect it will effectevely allow collapse in PROT_READ vmas,
right? I'm not convinced it's a good idea.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: incorporate read-only pages into transparent huge pages
  2015-01-23 11:37   ` Kirill A. Shutemov
@ 2015-01-23 14:57     ` Rik van Riel
  -1 siblings, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2015-01-23 14:57 UTC (permalink / raw)
  To: Kirill A. Shutemov, Ebru Akagunduz
  Cc: linux-mm, akpm, mhocko, mgorman, rientjes, sasha.levin, hughd,
	hannes, vbabka, linux-kernel, aarcange

On 01/23/2015 06:37 AM, Kirill A. Shutemov wrote:
> On Fri, Jan 23, 2015 at 09:47:36AM +0200, Ebru Akagunduz wrote:
>> This patch aims to improve THP collapse rates, by allowing
>> THP collapse in the presence of read-only ptes, like those
>> left in place by do_swap_page after a read fault.
>>
>> Currently THP can collapse 4kB pages into a THP when
>> there are up to khugepaged_max_ptes_none pte_none ptes
>> in a 2MB range. This patch applies the same limit for
>> read-only ptes.

>> @@ -2179,6 +2179,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>  		 */
>>  		if (!trylock_page(page))
>>  			goto out;
>> +		if (!pte_write(pteval)) {
>> +			if (PageSwapCache(page) && !reuse_swap_page(page)) {
>> +					unlock_page(page);
>> +					goto out;
>> +			}
>> +			/*
>> +			 * Page is not in the swap cache, and page count is
>> +			 * one (see above). It can be collapsed into a THP.
>> +			 */
>> +		}
> 
> Hm. As a side effect it will effectevely allow collapse in PROT_READ vmas,
> right? I'm not convinced it's a good idea.

It will only allow a THP collapse if there is at least one
read-write pte.

I suspect that excludes read-only VMAs automatically.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: incorporate read-only pages into transparent huge pages
@ 2015-01-23 14:57     ` Rik van Riel
  0 siblings, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2015-01-23 14:57 UTC (permalink / raw)
  To: Kirill A. Shutemov, Ebru Akagunduz
  Cc: linux-mm, akpm, mhocko, mgorman, rientjes, sasha.levin, hughd,
	hannes, vbabka, linux-kernel, aarcange

On 01/23/2015 06:37 AM, Kirill A. Shutemov wrote:
> On Fri, Jan 23, 2015 at 09:47:36AM +0200, Ebru Akagunduz wrote:
>> This patch aims to improve THP collapse rates, by allowing
>> THP collapse in the presence of read-only ptes, like those
>> left in place by do_swap_page after a read fault.
>>
>> Currently THP can collapse 4kB pages into a THP when
>> there are up to khugepaged_max_ptes_none pte_none ptes
>> in a 2MB range. This patch applies the same limit for
>> read-only ptes.

>> @@ -2179,6 +2179,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>  		 */
>>  		if (!trylock_page(page))
>>  			goto out;
>> +		if (!pte_write(pteval)) {
>> +			if (PageSwapCache(page) && !reuse_swap_page(page)) {
>> +					unlock_page(page);
>> +					goto out;
>> +			}
>> +			/*
>> +			 * Page is not in the swap cache, and page count is
>> +			 * one (see above). It can be collapsed into a THP.
>> +			 */
>> +		}
> 
> Hm. As a side effect it will effectevely allow collapse in PROT_READ vmas,
> right? I'm not convinced it's a good idea.

It will only allow a THP collapse if there is at least one
read-write pte.

I suspect that excludes read-only VMAs automatically.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: incorporate read-only pages into transparent huge pages
  2015-01-23 14:57     ` Rik van Riel
@ 2015-01-23 15:58       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 20+ messages in thread
From: Kirill A. Shutemov @ 2015-01-23 15:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ebru Akagunduz, linux-mm, akpm, mhocko, mgorman, rientjes,
	sasha.levin, hughd, hannes, vbabka, linux-kernel, aarcange

On Fri, Jan 23, 2015 at 09:57:03AM -0500, Rik van Riel wrote:
> On 01/23/2015 06:37 AM, Kirill A. Shutemov wrote:
> > On Fri, Jan 23, 2015 at 09:47:36AM +0200, Ebru Akagunduz wrote:
> >> This patch aims to improve THP collapse rates, by allowing
> >> THP collapse in the presence of read-only ptes, like those
> >> left in place by do_swap_page after a read fault.
> >>
> >> Currently THP can collapse 4kB pages into a THP when
> >> there are up to khugepaged_max_ptes_none pte_none ptes
> >> in a 2MB range. This patch applies the same limit for
> >> read-only ptes.
> 
> >> @@ -2179,6 +2179,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >>  		 */
> >>  		if (!trylock_page(page))
> >>  			goto out;
> >> +		if (!pte_write(pteval)) {
> >> +			if (PageSwapCache(page) && !reuse_swap_page(page)) {
> >> +					unlock_page(page);
> >> +					goto out;
> >> +			}
> >> +			/*
> >> +			 * Page is not in the swap cache, and page count is
> >> +			 * one (see above). It can be collapsed into a THP.
> >> +			 */
> >> +		}
> > 
> > Hm. As a side effect it will effectevely allow collapse in PROT_READ vmas,
> > right? I'm not convinced it's a good idea.
> 
> It will only allow a THP collapse if there is at least one
> read-write pte.
> 
> I suspect that excludes read-only VMAs automatically.

Ah. Okay. I missed that condition.

Looks good to me.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: incorporate read-only pages into transparent huge pages
@ 2015-01-23 15:58       ` Kirill A. Shutemov
  0 siblings, 0 replies; 20+ messages in thread
From: Kirill A. Shutemov @ 2015-01-23 15:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Ebru Akagunduz, linux-mm, akpm, mhocko, mgorman, rientjes,
	sasha.levin, hughd, hannes, vbabka, linux-kernel, aarcange

On Fri, Jan 23, 2015 at 09:57:03AM -0500, Rik van Riel wrote:
> On 01/23/2015 06:37 AM, Kirill A. Shutemov wrote:
> > On Fri, Jan 23, 2015 at 09:47:36AM +0200, Ebru Akagunduz wrote:
> >> This patch aims to improve THP collapse rates, by allowing
> >> THP collapse in the presence of read-only ptes, like those
> >> left in place by do_swap_page after a read fault.
> >>
> >> Currently THP can collapse 4kB pages into a THP when
> >> there are up to khugepaged_max_ptes_none pte_none ptes
> >> in a 2MB range. This patch applies the same limit for
> >> read-only ptes.
> 
> >> @@ -2179,6 +2179,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >>  		 */
> >>  		if (!trylock_page(page))
> >>  			goto out;
> >> +		if (!pte_write(pteval)) {
> >> +			if (PageSwapCache(page) && !reuse_swap_page(page)) {
> >> +					unlock_page(page);
> >> +					goto out;
> >> +			}
> >> +			/*
> >> +			 * Page is not in the swap cache, and page count is
> >> +			 * one (see above). It can be collapsed into a THP.
> >> +			 */
> >> +		}
> > 
> > Hm. As a side effect it will effectevely allow collapse in PROT_READ vmas,
> > right? I'm not convinced it's a good idea.
> 
> It will only allow a THP collapse if there is at least one
> read-write pte.
> 
> I suspect that excludes read-only VMAs automatically.

Ah. Okay. I missed that condition.

Looks good to me.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: incorporate read-only pages into transparent huge pages
  2015-01-23  7:47 ` Ebru Akagunduz
@ 2015-01-23 16:12   ` Vlastimil Babka
  -1 siblings, 0 replies; 20+ messages in thread
From: Vlastimil Babka @ 2015-01-23 16:12 UTC (permalink / raw)
  To: Ebru Akagunduz, linux-mm
  Cc: akpm, kirill, mhocko, mgorman, rientjes, sasha.levin, hughd,
	hannes, linux-kernel, riel, aarcange

On 01/23/2015 08:47 AM, Ebru Akagunduz wrote:
> This patch aims to improve THP collapse rates, by allowing
> THP collapse in the presence of read-only ptes, like those
> left in place by do_swap_page after a read fault.

An other examples? What about zero pages?

> Currently THP can collapse 4kB pages into a THP when
> there are up to khugepaged_max_ptes_none pte_none ptes
> in a 2MB range. This patch applies the same limit for
> read-only ptes.
>
> The patch was tested with a test program that allocates
> 800MB of memory, writes to it, and then sleeps. I force
> the system to swap out all but 190MB of the program by
> touching other memory. Afterwards, the test program does
> a mix of reads and writes to its memory, and the memory
> gets swapped back in.
>
> Without the patch, only the memory that did not get
> swapped out remained in THPs, which corresponds to 24% of
> the memory of the program. The percentage did not increase
> over time.
>
> With this patch, after 5 minutes of waiting khugepaged had
> collapsed 55% of the program's memory back into THPs.
>
> Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>

Sounds like a good idea.
Acked-by: Vlastimil Babka <vbabka@suse.cz>
nits below:

> ---
> I've written down test results:
>
> With the patch:
> After swapped out:
> cat /proc/pid/smaps:
> Anonymous:      100352 kB
> AnonHugePages:  98304 kB
> Swap:           699652 kB
> Fraction:       97,95
>
> cat /proc/meminfo:
> AnonPages:      1763732 kB
> AnonHugePages:  1716224 kB
> Fraction:       97,30
>
> After swapped in:
> In a few seconds:
> cat /proc/pid/smaps
> Anonymous:      800004 kB
> AnonHugePages:  235520 kB
> Swap:           0 kB
> Fraction:       29,43
>
> cat /proc/meminfo:
> AnonPages:      2464336 kB
> AnonHugePages:  1853440 kB
> Fraction:       75,21
>
> In five minutes:
> cat /proc/pid/smaps:
> Anonymous:      800004 kB
> AnonHugePages:  440320 kB
> Swap:           0 kB
> Fraction:       55,0
>
> cat /proc/meminfo:
> AnonPages:      2464340
> AnonHugePages:  2058240
> Fraction:       83,52
>
> Without the patch:
> After swapped out:
> cat /proc/pid/smaps:
> Anonymous:      190660 kB
> AnonHugePages:  190464 kB
> Swap:           609344 kB
> Fraction:       99,89
>
> cat /proc/meminfo:
> AnonPages:      1740456 kB
> AnonHugePages:  1667072 kB
> Fraction:       95,78
>
> After swapped in:
> cat /proc/pid/smaps:
> Anonymous:      800004 kB
> AnonHugePages:  190464 kB
> Swap:           0 kB
> Fraction:       23,80
>
> cat /proc/meminfo:
> AnonPages:      2350032 kB
> AnonHugePages:  1667072 kB
> Fraction:       70,93
>
> I waited 10 minutes the fractions
> did not change without the patch.
>
>   mm/huge_memory.c | 25 ++++++++++++++++++++-----
>   1 file changed, 20 insertions(+), 5 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 817a875..af750d9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2158,7 +2158,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   			else
>   				goto out;
>   		}
> -		if (!pte_present(pteval) || !pte_write(pteval))
> +		if (!pte_present(pteval))
>   			goto out;
>   		page = vm_normal_page(vma, address, pteval);
>   		if (unlikely(!page))
> @@ -2169,7 +2169,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
>
>   		/* cannot use mapcount: can't collapse if there's a gup pin */
> -		if (page_count(page) != 1)
> +		if (page_count(page) != 1 + !!PageSwapCache(page))

Took me a while to grok this !!PageSwapCache(page) part. Perhaps expand 
the comment?

>   			goto out;
>   		/*
>   		 * We can do it before isolate_lru_page because the
> @@ -2179,6 +2179,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   		 */
>   		if (!trylock_page(page))
>   			goto out;
> +		if (!pte_write(pteval)) {
> +			if (PageSwapCache(page) && !reuse_swap_page(page)) {
> +					unlock_page(page);
> +					goto out;

Too much indent on the 2 lines above.

> +			}
> +			/*
> +			 * Page is not in the swap cache, and page count is
> +			 * one (see above). It can be collapsed into a THP.
> +			 */

Such comment sounds like a good place for:

			VM_BUG_ON(page_count(page) != 1));

> +		}
> +
>   		/*
>   		 * Isolate the page to avoid collapsing an hugepage
>   		 * currently in use by the VM.
> @@ -2550,7 +2561,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   {
>   	pmd_t *pmd;
>   	pte_t *pte, *_pte;
> -	int ret = 0, referenced = 0, none = 0;
> +	int ret = 0, referenced = 0, none = 0, ro = 0;
>   	struct page *page;
>   	unsigned long _address;
>   	spinlock_t *ptl;
> @@ -2573,8 +2584,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   			else
>   				goto out_unmap;
>   		}
> -		if (!pte_present(pteval) || !pte_write(pteval))
> +		if (!pte_present(pteval))
>   			goto out_unmap;
> +		if (!pte_write(pteval)) {
> +			if (++ro > khugepaged_max_ptes_none)
> +				goto out_unmap;
> +		}
>   		page = vm_normal_page(vma, _address, pteval);
>   		if (unlikely(!page))
>   			goto out_unmap;
> @@ -2592,7 +2607,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
>   			goto out_unmap;
>   		/* cannot use mapcount: can't collapse if there's a gup pin */
> -		if (page_count(page) != 1)
> +		if (page_count(page) != 1 + !!PageSwapCache(page))

Same as above. Even more so, as there's no other page swap cache 
handling code in this function.

Thanks.

>   			goto out_unmap;
>   		if (pte_young(pteval) || PageReferenced(page) ||
>   		    mmu_notifier_test_young(vma->vm_mm, address))
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: incorporate read-only pages into transparent huge pages
@ 2015-01-23 16:12   ` Vlastimil Babka
  0 siblings, 0 replies; 20+ messages in thread
From: Vlastimil Babka @ 2015-01-23 16:12 UTC (permalink / raw)
  To: Ebru Akagunduz, linux-mm
  Cc: akpm, kirill, mhocko, mgorman, rientjes, sasha.levin, hughd,
	hannes, linux-kernel, riel, aarcange

On 01/23/2015 08:47 AM, Ebru Akagunduz wrote:
> This patch aims to improve THP collapse rates, by allowing
> THP collapse in the presence of read-only ptes, like those
> left in place by do_swap_page after a read fault.

An other examples? What about zero pages?

> Currently THP can collapse 4kB pages into a THP when
> there are up to khugepaged_max_ptes_none pte_none ptes
> in a 2MB range. This patch applies the same limit for
> read-only ptes.
>
> The patch was tested with a test program that allocates
> 800MB of memory, writes to it, and then sleeps. I force
> the system to swap out all but 190MB of the program by
> touching other memory. Afterwards, the test program does
> a mix of reads and writes to its memory, and the memory
> gets swapped back in.
>
> Without the patch, only the memory that did not get
> swapped out remained in THPs, which corresponds to 24% of
> the memory of the program. The percentage did not increase
> over time.
>
> With this patch, after 5 minutes of waiting khugepaged had
> collapsed 55% of the program's memory back into THPs.
>
> Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>

Sounds like a good idea.
Acked-by: Vlastimil Babka <vbabka@suse.cz>
nits below:

> ---
> I've written down test results:
>
> With the patch:
> After swapped out:
> cat /proc/pid/smaps:
> Anonymous:      100352 kB
> AnonHugePages:  98304 kB
> Swap:           699652 kB
> Fraction:       97,95
>
> cat /proc/meminfo:
> AnonPages:      1763732 kB
> AnonHugePages:  1716224 kB
> Fraction:       97,30
>
> After swapped in:
> In a few seconds:
> cat /proc/pid/smaps
> Anonymous:      800004 kB
> AnonHugePages:  235520 kB
> Swap:           0 kB
> Fraction:       29,43
>
> cat /proc/meminfo:
> AnonPages:      2464336 kB
> AnonHugePages:  1853440 kB
> Fraction:       75,21
>
> In five minutes:
> cat /proc/pid/smaps:
> Anonymous:      800004 kB
> AnonHugePages:  440320 kB
> Swap:           0 kB
> Fraction:       55,0
>
> cat /proc/meminfo:
> AnonPages:      2464340
> AnonHugePages:  2058240
> Fraction:       83,52
>
> Without the patch:
> After swapped out:
> cat /proc/pid/smaps:
> Anonymous:      190660 kB
> AnonHugePages:  190464 kB
> Swap:           609344 kB
> Fraction:       99,89
>
> cat /proc/meminfo:
> AnonPages:      1740456 kB
> AnonHugePages:  1667072 kB
> Fraction:       95,78
>
> After swapped in:
> cat /proc/pid/smaps:
> Anonymous:      800004 kB
> AnonHugePages:  190464 kB
> Swap:           0 kB
> Fraction:       23,80
>
> cat /proc/meminfo:
> AnonPages:      2350032 kB
> AnonHugePages:  1667072 kB
> Fraction:       70,93
>
> I waited 10 minutes the fractions
> did not change without the patch.
>
>   mm/huge_memory.c | 25 ++++++++++++++++++++-----
>   1 file changed, 20 insertions(+), 5 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 817a875..af750d9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2158,7 +2158,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   			else
>   				goto out;
>   		}
> -		if (!pte_present(pteval) || !pte_write(pteval))
> +		if (!pte_present(pteval))
>   			goto out;
>   		page = vm_normal_page(vma, address, pteval);
>   		if (unlikely(!page))
> @@ -2169,7 +2169,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
>
>   		/* cannot use mapcount: can't collapse if there's a gup pin */
> -		if (page_count(page) != 1)
> +		if (page_count(page) != 1 + !!PageSwapCache(page))

Took me a while to grok this !!PageSwapCache(page) part. Perhaps expand 
the comment?

>   			goto out;
>   		/*
>   		 * We can do it before isolate_lru_page because the
> @@ -2179,6 +2179,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   		 */
>   		if (!trylock_page(page))
>   			goto out;
> +		if (!pte_write(pteval)) {
> +			if (PageSwapCache(page) && !reuse_swap_page(page)) {
> +					unlock_page(page);
> +					goto out;

Too much indent on the 2 lines above.

> +			}
> +			/*
> +			 * Page is not in the swap cache, and page count is
> +			 * one (see above). It can be collapsed into a THP.
> +			 */

Such comment sounds like a good place for:

			VM_BUG_ON(page_count(page) != 1));

> +		}
> +
>   		/*
>   		 * Isolate the page to avoid collapsing an hugepage
>   		 * currently in use by the VM.
> @@ -2550,7 +2561,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   {
>   	pmd_t *pmd;
>   	pte_t *pte, *_pte;
> -	int ret = 0, referenced = 0, none = 0;
> +	int ret = 0, referenced = 0, none = 0, ro = 0;
>   	struct page *page;
>   	unsigned long _address;
>   	spinlock_t *ptl;
> @@ -2573,8 +2584,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   			else
>   				goto out_unmap;
>   		}
> -		if (!pte_present(pteval) || !pte_write(pteval))
> +		if (!pte_present(pteval))
>   			goto out_unmap;
> +		if (!pte_write(pteval)) {
> +			if (++ro > khugepaged_max_ptes_none)
> +				goto out_unmap;
> +		}
>   		page = vm_normal_page(vma, _address, pteval);
>   		if (unlikely(!page))
>   			goto out_unmap;
> @@ -2592,7 +2607,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
>   			goto out_unmap;
>   		/* cannot use mapcount: can't collapse if there's a gup pin */
> -		if (page_count(page) != 1)
> +		if (page_count(page) != 1 + !!PageSwapCache(page))

Same as above. Even more so, as there's no other page swap cache 
handling code in this function.

Thanks.

>   			goto out_unmap;
>   		if (pte_young(pteval) || PageReferenced(page) ||
>   		    mmu_notifier_test_young(vma->vm_mm, address))
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: incorporate read-only pages into transparent huge pages
  2015-01-23 16:12   ` Vlastimil Babka
@ 2015-01-23 16:15     ` Rik van Riel
  -1 siblings, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2015-01-23 16:15 UTC (permalink / raw)
  To: Vlastimil Babka, Ebru Akagunduz, linux-mm
  Cc: akpm, kirill, mhocko, mgorman, rientjes, sasha.levin, hughd,
	hannes, linux-kernel, aarcange

On 01/23/2015 11:12 AM, Vlastimil Babka wrote:
> On 01/23/2015 08:47 AM, Ebru Akagunduz wrote:
>> This patch aims to improve THP collapse rates, by allowing
>> THP collapse in the presence of read-only ptes, like those
>> left in place by do_swap_page after a read fault.
> 
> An other examples? What about zero pages?

I don't think this patch handles the zero page, due to
the reference count being higher than 1.

Handling the zero page could be a good next case to handle
in Ebru's OPW project to improve the THP collapse rate.

>> Currently THP can collapse 4kB pages into a THP when
>> there are up to khugepaged_max_ptes_none pte_none ptes
>> in a 2MB range. This patch applies the same limit for
>> read-only ptes.
>>
>> The patch was tested with a test program that allocates
>> 800MB of memory, writes to it, and then sleeps. I force
>> the system to swap out all but 190MB of the program by
>> touching other memory. Afterwards, the test program does
>> a mix of reads and writes to its memory, and the memory
>> gets swapped back in.
>>
>> Without the patch, only the memory that did not get
>> swapped out remained in THPs, which corresponds to 24% of
>> the memory of the program. The percentage did not increase
>> over time.
>>
>> With this patch, after 5 minutes of waiting khugepaged had
>> collapsed 55% of the program's memory back into THPs.
>>
>> Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
>> Reviewed-by: Rik van Riel <riel@redhat.com>
> 
> Sounds like a good idea.
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> nits below:
> 
>> ---
>> I've written down test results:
>>
>> With the patch:
>> After swapped out:
>> cat /proc/pid/smaps:
>> Anonymous:      100352 kB
>> AnonHugePages:  98304 kB
>> Swap:           699652 kB
>> Fraction:       97,95
>>
>> cat /proc/meminfo:
>> AnonPages:      1763732 kB
>> AnonHugePages:  1716224 kB
>> Fraction:       97,30
>>
>> After swapped in:
>> In a few seconds:
>> cat /proc/pid/smaps
>> Anonymous:      800004 kB
>> AnonHugePages:  235520 kB
>> Swap:           0 kB
>> Fraction:       29,43
>>
>> cat /proc/meminfo:
>> AnonPages:      2464336 kB
>> AnonHugePages:  1853440 kB
>> Fraction:       75,21
>>
>> In five minutes:
>> cat /proc/pid/smaps:
>> Anonymous:      800004 kB
>> AnonHugePages:  440320 kB
>> Swap:           0 kB
>> Fraction:       55,0
>>
>> cat /proc/meminfo:
>> AnonPages:      2464340
>> AnonHugePages:  2058240
>> Fraction:       83,52
>>
>> Without the patch:
>> After swapped out:
>> cat /proc/pid/smaps:
>> Anonymous:      190660 kB
>> AnonHugePages:  190464 kB
>> Swap:           609344 kB
>> Fraction:       99,89
>>
>> cat /proc/meminfo:
>> AnonPages:      1740456 kB
>> AnonHugePages:  1667072 kB
>> Fraction:       95,78
>>
>> After swapped in:
>> cat /proc/pid/smaps:
>> Anonymous:      800004 kB
>> AnonHugePages:  190464 kB
>> Swap:           0 kB
>> Fraction:       23,80
>>
>> cat /proc/meminfo:
>> AnonPages:      2350032 kB
>> AnonHugePages:  1667072 kB
>> Fraction:       70,93
>>
>> I waited 10 minutes the fractions
>> did not change without the patch.
>>
>>   mm/huge_memory.c | 25 ++++++++++++++++++++-----
>>   1 file changed, 20 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 817a875..af750d9 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2158,7 +2158,7 @@ static int __collapse_huge_page_isolate(struct
>> vm_area_struct *vma,
>>               else
>>                   goto out;
>>           }
>> -        if (!pte_present(pteval) || !pte_write(pteval))
>> +        if (!pte_present(pteval))
>>               goto out;
>>           page = vm_normal_page(vma, address, pteval);
>>           if (unlikely(!page))
>> @@ -2169,7 +2169,7 @@ static int __collapse_huge_page_isolate(struct
>> vm_area_struct *vma,
>>           VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
>>
>>           /* cannot use mapcount: can't collapse if there's a gup pin */
>> -        if (page_count(page) != 1)
>> +        if (page_count(page) != 1 + !!PageSwapCache(page))
> 
> Took me a while to grok this !!PageSwapCache(page) part. Perhaps expand
> the comment?
> 
>>               goto out;
>>           /*
>>            * We can do it before isolate_lru_page because the
>> @@ -2179,6 +2179,17 @@ static int __collapse_huge_page_isolate(struct
>> vm_area_struct *vma,
>>            */
>>           if (!trylock_page(page))
>>               goto out;
>> +        if (!pte_write(pteval)) {
>> +            if (PageSwapCache(page) && !reuse_swap_page(page)) {
>> +                    unlock_page(page);
>> +                    goto out;
> 
> Too much indent on the 2 lines above.
> 
>> +            }
>> +            /*
>> +             * Page is not in the swap cache, and page count is
>> +             * one (see above). It can be collapsed into a THP.
>> +             */
> 
> Such comment sounds like a good place for:
> 
>             VM_BUG_ON(page_count(page) != 1));
> 
>> +        }
>> +
>>           /*
>>            * Isolate the page to avoid collapsing an hugepage
>>            * currently in use by the VM.
>> @@ -2550,7 +2561,7 @@ static int khugepaged_scan_pmd(struct mm_struct
>> *mm,
>>   {
>>       pmd_t *pmd;
>>       pte_t *pte, *_pte;
>> -    int ret = 0, referenced = 0, none = 0;
>> +    int ret = 0, referenced = 0, none = 0, ro = 0;
>>       struct page *page;
>>       unsigned long _address;
>>       spinlock_t *ptl;
>> @@ -2573,8 +2584,12 @@ static int khugepaged_scan_pmd(struct mm_struct
>> *mm,
>>               else
>>                   goto out_unmap;
>>           }
>> -        if (!pte_present(pteval) || !pte_write(pteval))
>> +        if (!pte_present(pteval))
>>               goto out_unmap;
>> +        if (!pte_write(pteval)) {
>> +            if (++ro > khugepaged_max_ptes_none)
>> +                goto out_unmap;
>> +        }
>>           page = vm_normal_page(vma, _address, pteval);
>>           if (unlikely(!page))
>>               goto out_unmap;
>> @@ -2592,7 +2607,7 @@ static int khugepaged_scan_pmd(struct mm_struct
>> *mm,
>>           if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
>>               goto out_unmap;
>>           /* cannot use mapcount: can't collapse if there's a gup pin */
>> -        if (page_count(page) != 1)
>> +        if (page_count(page) != 1 + !!PageSwapCache(page))
> 
> Same as above. Even more so, as there's no other page swap cache
> handling code in this function.
> 
> Thanks.
> 
>>               goto out_unmap;
>>           if (pte_young(pteval) || PageReferenced(page) ||
>>               mmu_notifier_test_young(vma->vm_mm, address))
>>
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: incorporate read-only pages into transparent huge pages
@ 2015-01-23 16:15     ` Rik van Riel
  0 siblings, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2015-01-23 16:15 UTC (permalink / raw)
  To: Vlastimil Babka, Ebru Akagunduz, linux-mm
  Cc: akpm, kirill, mhocko, mgorman, rientjes, sasha.levin, hughd,
	hannes, linux-kernel, aarcange

On 01/23/2015 11:12 AM, Vlastimil Babka wrote:
> On 01/23/2015 08:47 AM, Ebru Akagunduz wrote:
>> This patch aims to improve THP collapse rates, by allowing
>> THP collapse in the presence of read-only ptes, like those
>> left in place by do_swap_page after a read fault.
> 
> An other examples? What about zero pages?

I don't think this patch handles the zero page, due to
the reference count being higher than 1.

Handling the zero page could be a good next case to handle
in Ebru's OPW project to improve the THP collapse rate.

>> Currently THP can collapse 4kB pages into a THP when
>> there are up to khugepaged_max_ptes_none pte_none ptes
>> in a 2MB range. This patch applies the same limit for
>> read-only ptes.
>>
>> The patch was tested with a test program that allocates
>> 800MB of memory, writes to it, and then sleeps. I force
>> the system to swap out all but 190MB of the program by
>> touching other memory. Afterwards, the test program does
>> a mix of reads and writes to its memory, and the memory
>> gets swapped back in.
>>
>> Without the patch, only the memory that did not get
>> swapped out remained in THPs, which corresponds to 24% of
>> the memory of the program. The percentage did not increase
>> over time.
>>
>> With this patch, after 5 minutes of waiting khugepaged had
>> collapsed 55% of the program's memory back into THPs.
>>
>> Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
>> Reviewed-by: Rik van Riel <riel@redhat.com>
> 
> Sounds like a good idea.
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> nits below:
> 
>> ---
>> I've written down test results:
>>
>> With the patch:
>> After swapped out:
>> cat /proc/pid/smaps:
>> Anonymous:      100352 kB
>> AnonHugePages:  98304 kB
>> Swap:           699652 kB
>> Fraction:       97,95
>>
>> cat /proc/meminfo:
>> AnonPages:      1763732 kB
>> AnonHugePages:  1716224 kB
>> Fraction:       97,30
>>
>> After swapped in:
>> In a few seconds:
>> cat /proc/pid/smaps
>> Anonymous:      800004 kB
>> AnonHugePages:  235520 kB
>> Swap:           0 kB
>> Fraction:       29,43
>>
>> cat /proc/meminfo:
>> AnonPages:      2464336 kB
>> AnonHugePages:  1853440 kB
>> Fraction:       75,21
>>
>> In five minutes:
>> cat /proc/pid/smaps:
>> Anonymous:      800004 kB
>> AnonHugePages:  440320 kB
>> Swap:           0 kB
>> Fraction:       55,0
>>
>> cat /proc/meminfo:
>> AnonPages:      2464340
>> AnonHugePages:  2058240
>> Fraction:       83,52
>>
>> Without the patch:
>> After swapped out:
>> cat /proc/pid/smaps:
>> Anonymous:      190660 kB
>> AnonHugePages:  190464 kB
>> Swap:           609344 kB
>> Fraction:       99,89
>>
>> cat /proc/meminfo:
>> AnonPages:      1740456 kB
>> AnonHugePages:  1667072 kB
>> Fraction:       95,78
>>
>> After swapped in:
>> cat /proc/pid/smaps:
>> Anonymous:      800004 kB
>> AnonHugePages:  190464 kB
>> Swap:           0 kB
>> Fraction:       23,80
>>
>> cat /proc/meminfo:
>> AnonPages:      2350032 kB
>> AnonHugePages:  1667072 kB
>> Fraction:       70,93
>>
>> I waited 10 minutes the fractions
>> did not change without the patch.
>>
>>   mm/huge_memory.c | 25 ++++++++++++++++++++-----
>>   1 file changed, 20 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 817a875..af750d9 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2158,7 +2158,7 @@ static int __collapse_huge_page_isolate(struct
>> vm_area_struct *vma,
>>               else
>>                   goto out;
>>           }
>> -        if (!pte_present(pteval) || !pte_write(pteval))
>> +        if (!pte_present(pteval))
>>               goto out;
>>           page = vm_normal_page(vma, address, pteval);
>>           if (unlikely(!page))
>> @@ -2169,7 +2169,7 @@ static int __collapse_huge_page_isolate(struct
>> vm_area_struct *vma,
>>           VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
>>
>>           /* cannot use mapcount: can't collapse if there's a gup pin */
>> -        if (page_count(page) != 1)
>> +        if (page_count(page) != 1 + !!PageSwapCache(page))
> 
> Took me a while to grok this !!PageSwapCache(page) part. Perhaps expand
> the comment?
> 
>>               goto out;
>>           /*
>>            * We can do it before isolate_lru_page because the
>> @@ -2179,6 +2179,17 @@ static int __collapse_huge_page_isolate(struct
>> vm_area_struct *vma,
>>            */
>>           if (!trylock_page(page))
>>               goto out;
>> +        if (!pte_write(pteval)) {
>> +            if (PageSwapCache(page) && !reuse_swap_page(page)) {
>> +                    unlock_page(page);
>> +                    goto out;
> 
> Too much indent on the 2 lines above.
> 
>> +            }
>> +            /*
>> +             * Page is not in the swap cache, and page count is
>> +             * one (see above). It can be collapsed into a THP.
>> +             */
> 
> Such comment sounds like a good place for:
> 
>             VM_BUG_ON(page_count(page) != 1));
> 
>> +        }
>> +
>>           /*
>>            * Isolate the page to avoid collapsing an hugepage
>>            * currently in use by the VM.
>> @@ -2550,7 +2561,7 @@ static int khugepaged_scan_pmd(struct mm_struct
>> *mm,
>>   {
>>       pmd_t *pmd;
>>       pte_t *pte, *_pte;
>> -    int ret = 0, referenced = 0, none = 0;
>> +    int ret = 0, referenced = 0, none = 0, ro = 0;
>>       struct page *page;
>>       unsigned long _address;
>>       spinlock_t *ptl;
>> @@ -2573,8 +2584,12 @@ static int khugepaged_scan_pmd(struct mm_struct
>> *mm,
>>               else
>>                   goto out_unmap;
>>           }
>> -        if (!pte_present(pteval) || !pte_write(pteval))
>> +        if (!pte_present(pteval))
>>               goto out_unmap;
>> +        if (!pte_write(pteval)) {
>> +            if (++ro > khugepaged_max_ptes_none)
>> +                goto out_unmap;
>> +        }
>>           page = vm_normal_page(vma, _address, pteval);
>>           if (unlikely(!page))
>>               goto out_unmap;
>> @@ -2592,7 +2607,7 @@ static int khugepaged_scan_pmd(struct mm_struct
>> *mm,
>>           if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
>>               goto out_unmap;
>>           /* cannot use mapcount: can't collapse if there's a gup pin */
>> -        if (page_count(page) != 1)
>> +        if (page_count(page) != 1 + !!PageSwapCache(page))
> 
> Same as above. Even more so, as there's no other page swap cache
> handling code in this function.
> 
> Thanks.
> 
>>               goto out_unmap;
>>           if (pte_young(pteval) || PageReferenced(page) ||
>>               mmu_notifier_test_young(vma->vm_mm, address))
>>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: incorporate read-only pages into transparent huge pages
  2015-01-23  7:47 ` Ebru Akagunduz
@ 2015-01-23 19:04   ` Rik van Riel
  -1 siblings, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2015-01-23 19:04 UTC (permalink / raw)
  To: Ebru Akagunduz, linux-mm
  Cc: akpm, kirill, mhocko, mgorman, rientjes, sasha.levin, hughd,
	hannes, vbabka, linux-kernel, aarcange

On 01/23/2015 02:47 AM, Ebru Akagunduz wrote:

> @@ -2169,7 +2169,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
>  
>  		/* cannot use mapcount: can't collapse if there's a gup pin */
> -		if (page_count(page) != 1)
> +		if (page_count(page) != 1 + !!PageSwapCache(page))
>  			goto out;
>  		/*
>  		 * We can do it before isolate_lru_page because the
> @@ -2179,6 +2179,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		 */
>  		if (!trylock_page(page))
>  			goto out;
> +		if (!pte_write(pteval)) {
> +			if (PageSwapCache(page) && !reuse_swap_page(page)) {
> +					unlock_page(page);
> +					goto out;
> +			}
> +			/*
> +			 * Page is not in the swap cache, and page count is
> +			 * one (see above). It can be collapsed into a THP.
> +			 */
> +		}

Andrea pointed out a bug between the above two parts of
the patch.

In-between where we check page_count(page), and where we
check whether the page got added to the swap cache, the
page count may change, causing us to get into a race
condition with get_user_pages_fast, the pageout code, etc.

It is necessary to check the page count again right after
the trylock_page(page) above, to make sure it was not changed
while the page was not yet locked.

That second check should have a comment explaining that
the first "page_count(page) != 1 + !!PageSwapCache(page)"
check could be unsafe due to the page not yet locked,
so the check needs to be repeated. Maybe something along
the lines of:

     /* Re-check the page count with the page locked */

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: incorporate read-only pages into transparent huge pages
@ 2015-01-23 19:04   ` Rik van Riel
  0 siblings, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2015-01-23 19:04 UTC (permalink / raw)
  To: Ebru Akagunduz, linux-mm
  Cc: akpm, kirill, mhocko, mgorman, rientjes, sasha.levin, hughd,
	hannes, vbabka, linux-kernel, aarcange

On 01/23/2015 02:47 AM, Ebru Akagunduz wrote:

> @@ -2169,7 +2169,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
>  
>  		/* cannot use mapcount: can't collapse if there's a gup pin */
> -		if (page_count(page) != 1)
> +		if (page_count(page) != 1 + !!PageSwapCache(page))
>  			goto out;
>  		/*
>  		 * We can do it before isolate_lru_page because the
> @@ -2179,6 +2179,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		 */
>  		if (!trylock_page(page))
>  			goto out;
> +		if (!pte_write(pteval)) {
> +			if (PageSwapCache(page) && !reuse_swap_page(page)) {
> +					unlock_page(page);
> +					goto out;
> +			}
> +			/*
> +			 * Page is not in the swap cache, and page count is
> +			 * one (see above). It can be collapsed into a THP.
> +			 */
> +		}

Andrea pointed out a bug between the above two parts of
the patch.

In-between where we check page_count(page), and where we
check whether the page got added to the swap cache, the
page count may change, causing us to get into a race
condition with get_user_pages_fast, the pageout code, etc.

It is necessary to check the page count again right after
the trylock_page(page) above, to make sure it was not changed
while the page was not yet locked.

That second check should have a comment explaining that
the first "page_count(page) != 1 + !!PageSwapCache(page)"
check could be unsafe due to the page not yet locked,
so the check needs to be repeated. Maybe something along
the lines of:

     /* Re-check the page count with the page locked */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: incorporate read-only pages into transparent huge pages
  2015-01-23  7:47 ` Ebru Akagunduz
@ 2015-01-23 19:18   ` Andrea Arcangeli
  -1 siblings, 0 replies; 20+ messages in thread
From: Andrea Arcangeli @ 2015-01-23 19:18 UTC (permalink / raw)
  To: Ebru Akagunduz
  Cc: linux-mm, akpm, kirill, mhocko, mgorman, rientjes, sasha.levin,
	hughd, hannes, vbabka, linux-kernel, riel

Hello everyone,

On Fri, Jan 23, 2015 at 09:47:36AM +0200, Ebru Akagunduz wrote:
> This patch aims to improve THP collapse rates, by allowing
> THP collapse in the presence of read-only ptes, like those
> left in place by do_swap_page after a read fault.
> 
> Currently THP can collapse 4kB pages into a THP when
> there are up to khugepaged_max_ptes_none pte_none ptes
> in a 2MB range. This patch applies the same limit for
> read-only ptes.
> 
> The patch was tested with a test program that allocates
> 800MB of memory, writes to it, and then sleeps. I force
> the system to swap out all but 190MB of the program by
> touching other memory. Afterwards, the test program does
> a mix of reads and writes to its memory, and the memory
> gets swapped back in.
> 
> Without the patch, only the memory that did not get
> swapped out remained in THPs, which corresponds to 24% of
> the memory of the program. The percentage did not increase
> over time.
> 
> With this patch, after 5 minutes of waiting khugepaged had
> collapsed 55% of the program's memory back into THPs.

This is a nice improvement, thanks!

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 817a875..af750d9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2158,7 +2158,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  			else
>  				goto out;
>  		}
> -		if (!pte_present(pteval) || !pte_write(pteval))
> +		if (!pte_present(pteval))
>  			goto out;
>  		page = vm_normal_page(vma, address, pteval);
>  		if (unlikely(!page))
> @@ -2169,7 +2169,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
>  
>  		/* cannot use mapcount: can't collapse if there's a gup pin */
> -		if (page_count(page) != 1)
> +		if (page_count(page) != 1 + !!PageSwapCache(page))
>  			goto out;
>  		/*
>  		 * We can do it before isolate_lru_page because the
> @@ -2179,6 +2179,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		 */
>  		if (!trylock_page(page))
>  			goto out;

No gup pin can be taken from under us because we hold the mmap_sem for
writing, PT lock and stopped any gup-fast with pmdp_clear_flush.

Problem is PageSwapCache if read before taking the page lock is
unstable/racy and so page_count could return 2 because there's a real
gup-pin, but then the page is added to swapcache by another CPU and we
pass the check because !!PageSwapCache becomes 1 (despite page_count
also become 3, but we happened to read that just a bit earlier).

Not sure if we should keep a fast path check before trylock_page that
can reduce the cacheline bouncing on the trylock operation. We already
had a fast path check in the scanning loop before invoking
collapse_huge_page. We may just move the check after trylock page
after adding a comment about the need of the pagelock for
pageswapcache to be stable.

The PageSwapCache (and matching page count increase) cannot appear or
disappear from under us if we hold the page lock.

> +		if (!pte_write(pteval)) {
> +			if (++ro > khugepaged_max_ptes_none)
> +				goto out_unmap;
> +		}

It's true this is maxed out at 511, so there must be at least one
writable and not none pte (as results of the two "ro" and "none"
counters checks).

However this is applied only to the "mmap_sem hold for reading"
"fast-path" scanning loop to identify candidate THP to collapse.

After this check, we release the mmap_sem (hidden in up_read) and then
we take it for writing. After the mmap_sem is released all vma state
can change from under us.

So this check alone doesn't guarantee we won't collapse THP inside
VM_READ vmas I'm afraid.

We've got two ++none checks too, for the same reason, or we'd
potentially allocate THP by mistake after a concurrent MADV_DONTNEED
(which would be even less problematic as it'd just allocate a THP by
mistake and no other side effect).

critical check:

static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
					unsigned long address,
					pte_t *pte)
		if (pte_none(pteval)) {
			if (++none <= khugepaged_max_ptes_none)
				continue;
			else
				goto out;
		}

fast path optimistic check:

static int khugepaged_scan_pmd(struct mm_struct *mm,
		pte_t pteval = *_pte;
		if (pte_none(pteval)) {
			if (++none <= khugepaged_max_ptes_none)
				continue;
			else
				goto out_unmap;

We need 2 of them for ++ro too I think.

The +!!PageSwapCache addition to the khugepaged_scan_pmd is instead
fine as it's just optimistic and if we end up in collapse_huge_page
because the race hits is no problem (it's incredibly low probability
event). Only in __collapse_huge_page_isolate we need fully accuracy.

Aside from these two points which shouldn't be problematic to adjust,
the rest looks fine!

Andrea

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: incorporate read-only pages into transparent huge pages
@ 2015-01-23 19:18   ` Andrea Arcangeli
  0 siblings, 0 replies; 20+ messages in thread
From: Andrea Arcangeli @ 2015-01-23 19:18 UTC (permalink / raw)
  To: Ebru Akagunduz
  Cc: linux-mm, akpm, kirill, mhocko, mgorman, rientjes, sasha.levin,
	hughd, hannes, vbabka, linux-kernel, riel

Hello everyone,

On Fri, Jan 23, 2015 at 09:47:36AM +0200, Ebru Akagunduz wrote:
> This patch aims to improve THP collapse rates, by allowing
> THP collapse in the presence of read-only ptes, like those
> left in place by do_swap_page after a read fault.
> 
> Currently THP can collapse 4kB pages into a THP when
> there are up to khugepaged_max_ptes_none pte_none ptes
> in a 2MB range. This patch applies the same limit for
> read-only ptes.
> 
> The patch was tested with a test program that allocates
> 800MB of memory, writes to it, and then sleeps. I force
> the system to swap out all but 190MB of the program by
> touching other memory. Afterwards, the test program does
> a mix of reads and writes to its memory, and the memory
> gets swapped back in.
> 
> Without the patch, only the memory that did not get
> swapped out remained in THPs, which corresponds to 24% of
> the memory of the program. The percentage did not increase
> over time.
> 
> With this patch, after 5 minutes of waiting khugepaged had
> collapsed 55% of the program's memory back into THPs.

This is a nice improvement, thanks!

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 817a875..af750d9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2158,7 +2158,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  			else
>  				goto out;
>  		}
> -		if (!pte_present(pteval) || !pte_write(pteval))
> +		if (!pte_present(pteval))
>  			goto out;
>  		page = vm_normal_page(vma, address, pteval);
>  		if (unlikely(!page))
> @@ -2169,7 +2169,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
>  
>  		/* cannot use mapcount: can't collapse if there's a gup pin */
> -		if (page_count(page) != 1)
> +		if (page_count(page) != 1 + !!PageSwapCache(page))
>  			goto out;
>  		/*
>  		 * We can do it before isolate_lru_page because the
> @@ -2179,6 +2179,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		 */
>  		if (!trylock_page(page))
>  			goto out;

No gup pin can be taken from under us because we hold the mmap_sem for
writing, PT lock and stopped any gup-fast with pmdp_clear_flush.

Problem is PageSwapCache if read before taking the page lock is
unstable/racy and so page_count could return 2 because there's a real
gup-pin, but then the page is added to swapcache by another CPU and we
pass the check because !!PageSwapCache becomes 1 (despite page_count
also become 3, but we happened to read that just a bit earlier).

Not sure if we should keep a fast path check before trylock_page that
can reduce the cacheline bouncing on the trylock operation. We already
had a fast path check in the scanning loop before invoking
collapse_huge_page. We may just move the check after trylock page
after adding a comment about the need of the pagelock for
pageswapcache to be stable.

The PageSwapCache (and matching page count increase) cannot appear or
disappear from under us if we hold the page lock.

> +		if (!pte_write(pteval)) {
> +			if (++ro > khugepaged_max_ptes_none)
> +				goto out_unmap;
> +		}

It's true this is maxed out at 511, so there must be at least one
writable and not none pte (as results of the two "ro" and "none"
counters checks).

However this is applied only to the "mmap_sem hold for reading"
"fast-path" scanning loop to identify candidate THP to collapse.

After this check, we release the mmap_sem (hidden in up_read) and then
we take it for writing. After the mmap_sem is released all vma state
can change from under us.

So this check alone doesn't guarantee we won't collapse THP inside
VM_READ vmas I'm afraid.

We've got two ++none checks too, for the same reason, or we'd
potentially allocate THP by mistake after a concurrent MADV_DONTNEED
(which would be even less problematic as it'd just allocate a THP by
mistake and no other side effect).

critical check:

static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
					unsigned long address,
					pte_t *pte)
		if (pte_none(pteval)) {
			if (++none <= khugepaged_max_ptes_none)
				continue;
			else
				goto out;
		}

fast path optimistic check:

static int khugepaged_scan_pmd(struct mm_struct *mm,
		pte_t pteval = *_pte;
		if (pte_none(pteval)) {
			if (++none <= khugepaged_max_ptes_none)
				continue;
			else
				goto out_unmap;

We need 2 of them for ++ro too I think.

The +!!PageSwapCache addition to the khugepaged_scan_pmd is instead
fine as it's just optimistic and if we end up in collapse_huge_page
because the race hits is no problem (it's incredibly low probability
event). Only in __collapse_huge_page_isolate we need fully accuracy.

Aside from these two points which shouldn't be problematic to adjust,
the rest looks fine!

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: incorporate read-only pages into transparent huge pages
  2015-01-23 19:18   ` Andrea Arcangeli
@ 2015-01-25  9:25     ` Vlastimil Babka
  -1 siblings, 0 replies; 20+ messages in thread
From: Vlastimil Babka @ 2015-01-25  9:25 UTC (permalink / raw)
  To: Andrea Arcangeli, Ebru Akagunduz
  Cc: linux-mm, akpm, kirill, mhocko, mgorman, rientjes, sasha.levin,
	hughd, hannes, linux-kernel, riel

On 23.1.2015 20:18, Andrea Arcangeli wrote:
>> >+		if (!pte_write(pteval)) {
>> >+			if (++ro > khugepaged_max_ptes_none)
>> >+				goto out_unmap;
>> >+		}
> It's true this is maxed out at 511, so there must be at least one
> writable and not none pte (as results of the two "ro" and "none"
> counters checks).

Hm, but if we consider ro and pte_none separately, both can be lower
than 512, but the sum of the two can be 512, so we can actually be in
read-only VMA?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: incorporate read-only pages into transparent huge pages
@ 2015-01-25  9:25     ` Vlastimil Babka
  0 siblings, 0 replies; 20+ messages in thread
From: Vlastimil Babka @ 2015-01-25  9:25 UTC (permalink / raw)
  To: Andrea Arcangeli, Ebru Akagunduz
  Cc: linux-mm, akpm, kirill, mhocko, mgorman, rientjes, sasha.levin,
	hughd, hannes, linux-kernel, riel

On 23.1.2015 20:18, Andrea Arcangeli wrote:
>> >+		if (!pte_write(pteval)) {
>> >+			if (++ro > khugepaged_max_ptes_none)
>> >+				goto out_unmap;
>> >+		}
> It's true this is maxed out at 511, so there must be at least one
> writable and not none pte (as results of the two "ro" and "none"
> counters checks).

Hm, but if we consider ro and pte_none separately, both can be lower
than 512, but the sum of the two can be 512, so we can actually be in
read-only VMA?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: incorporate read-only pages into transparent huge pages
  2015-01-25  9:25     ` Vlastimil Babka
@ 2015-01-25 14:42       ` Zhang Yanfei
  -1 siblings, 0 replies; 20+ messages in thread
From: Zhang Yanfei @ 2015-01-25 14:42 UTC (permalink / raw)
  To: Vlastimil Babka, Andrea Arcangeli, Ebru Akagunduz
  Cc: linux-mm, akpm, kirill, mhocko, mgorman, rientjes, sasha.levin,
	hughd, hannes, linux-kernel, riel

Hello

在 2015/1/25 17:25, Vlastimil Babka 写道:
> On 23.1.2015 20:18, Andrea Arcangeli wrote:
>>> >+        if (!pte_write(pteval)) {
>>> >+            if (++ro > khugepaged_max_ptes_none)
>>> >+                goto out_unmap;
>>> >+        }
>> It's true this is maxed out at 511, so there must be at least one
>> writable and not none pte (as results of the two "ro" and "none"
>> counters checks).
> 
> Hm, but if we consider ro and pte_none separately, both can be lower
> than 512, but the sum of the two can be 512, so we can actually be in
> read-only VMA?

Yes, I also think so.

So is it necessary to add a at-least-one-writable-pte check just like the existing
at-least-one-page-referenced check?

Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: incorporate read-only pages into transparent huge pages
@ 2015-01-25 14:42       ` Zhang Yanfei
  0 siblings, 0 replies; 20+ messages in thread
From: Zhang Yanfei @ 2015-01-25 14:42 UTC (permalink / raw)
  To: Vlastimil Babka, Andrea Arcangeli, Ebru Akagunduz
  Cc: linux-mm, akpm, kirill, mhocko, mgorman, rientjes, sasha.levin,
	hughd, hannes, linux-kernel, riel

Hello

a?? 2015/1/25 17:25, Vlastimil Babka a??e??:
> On 23.1.2015 20:18, Andrea Arcangeli wrote:
>>> >+        if (!pte_write(pteval)) {
>>> >+            if (++ro > khugepaged_max_ptes_none)
>>> >+                goto out_unmap;
>>> >+        }
>> It's true this is maxed out at 511, so there must be at least one
>> writable and not none pte (as results of the two "ro" and "none"
>> counters checks).
> 
> Hm, but if we consider ro and pte_none separately, both can be lower
> than 512, but the sum of the two can be 512, so we can actually be in
> read-only VMA?

Yes, I also think so.

So is it necessary to add a at-least-one-writable-pte check just like the existing
at-least-one-page-referenced check?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-01-25 14:48 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-23  7:47 [PATCH] mm: incorporate read-only pages into transparent huge pages Ebru Akagunduz
2015-01-23  7:47 ` Ebru Akagunduz
2015-01-23 11:37 ` Kirill A. Shutemov
2015-01-23 11:37   ` Kirill A. Shutemov
2015-01-23 14:57   ` Rik van Riel
2015-01-23 14:57     ` Rik van Riel
2015-01-23 15:58     ` Kirill A. Shutemov
2015-01-23 15:58       ` Kirill A. Shutemov
2015-01-23 16:12 ` Vlastimil Babka
2015-01-23 16:12   ` Vlastimil Babka
2015-01-23 16:15   ` Rik van Riel
2015-01-23 16:15     ` Rik van Riel
2015-01-23 19:04 ` Rik van Riel
2015-01-23 19:04   ` Rik van Riel
2015-01-23 19:18 ` Andrea Arcangeli
2015-01-23 19:18   ` Andrea Arcangeli
2015-01-25  9:25   ` Vlastimil Babka
2015-01-25  9:25     ` Vlastimil Babka
2015-01-25 14:42     ` Zhang Yanfei
2015-01-25 14:42       ` Zhang Yanfei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.