All of lore.kernel.org
 help / color / mirror / Atom feed
* [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
@ 2016-10-13 12:19 ` Jan Stancek
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Stancek @ 2016-10-13 12:19 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: mike.kravetz, hillf.zj, dave.hansen, kirill.shutemov, mhocko,
	n-horiguchi, aneesh.kumar, iamjoonsoo.kim

Hi,

I'm running into ENOMEM failures with libhugetlbfs testsuite [1] on
a power8 lpar system running 4.8 or latest git [2]. Repeated runs of
this suite trigger multiple OOMs, that eventually kill entire system,
it usually takes 3-5 runs:

 * Total System Memory......:  18024 MB
 * Shared Mem Max Mapping...:    320 MB
 * System Huge Page Size....:     16 MB
 * Available Huge Pages.....:     20
 * Total size of Huge Pages.:    320 MB
 * Remaining System Memory..:  17704 MB
 * Huge Page User Group.....:  hugepages (1001)

I see this only on ppc (BE/LE), x86_64 seems unaffected and successfully
ran the tests for ~12 hours.

Bisect has identified following patch as culprit:
  commit 67961f9db8c477026ea20ce05761bde6f8bf85b0
  Author: Mike Kravetz <mike.kravetz@oracle.com>
  Date:   Wed Jun 8 15:33:42 2016 -0700
    mm/hugetlb: fix huge page reserve accounting for private mappings


Following patch (made with my limited insight) applied to
latest git [2] fixes the problem for me:

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ec49d9e..7261583 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1876,7 +1876,7 @@ static long __vma_reservation_common(struct hstate *h,
                 * return value of this routine is the opposite of the
                 * value returned from reserve map manipulation routines above.
                 */
-               if (ret)
+               if (ret >= 0)
                        return 0;
                else
                        return 1;

Regards,
Jan

[1] https://github.com/libhugetlbfs/libhugetlbfs
[2] v4.8-14230-gb67be92

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
@ 2016-10-13 12:19 ` Jan Stancek
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Stancek @ 2016-10-13 12:19 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: mike.kravetz, hillf.zj, dave.hansen, kirill.shutemov, mhocko,
	n-horiguchi, aneesh.kumar, iamjoonsoo.kim

Hi,

I'm running into ENOMEM failures with libhugetlbfs testsuite [1] on
a power8 lpar system running 4.8 or latest git [2]. Repeated runs of
this suite trigger multiple OOMs, that eventually kill entire system,
it usually takes 3-5 runs:

 * Total System Memory......:  18024 MB
 * Shared Mem Max Mapping...:    320 MB
 * System Huge Page Size....:     16 MB
 * Available Huge Pages.....:     20
 * Total size of Huge Pages.:    320 MB
 * Remaining System Memory..:  17704 MB
 * Huge Page User Group.....:  hugepages (1001)

I see this only on ppc (BE/LE), x86_64 seems unaffected and successfully
ran the tests for ~12 hours.

Bisect has identified following patch as culprit:
  commit 67961f9db8c477026ea20ce05761bde6f8bf85b0
  Author: Mike Kravetz <mike.kravetz@oracle.com>
  Date:   Wed Jun 8 15:33:42 2016 -0700
    mm/hugetlb: fix huge page reserve accounting for private mappings


Following patch (made with my limited insight) applied to
latest git [2] fixes the problem for me:

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ec49d9e..7261583 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1876,7 +1876,7 @@ static long __vma_reservation_common(struct hstate *h,
                 * return value of this routine is the opposite of the
                 * value returned from reserve map manipulation routines above.
                 */
-               if (ret)
+               if (ret >= 0)
                        return 0;
                else
                        return 1;

Regards,
Jan

[1] https://github.com/libhugetlbfs/libhugetlbfs
[2] v4.8-14230-gb67be92

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
  2016-10-13 12:19 ` Jan Stancek
@ 2016-10-13 15:24   ` Mike Kravetz
  -1 siblings, 0 replies; 26+ messages in thread
From: Mike Kravetz @ 2016-10-13 15:24 UTC (permalink / raw)
  To: Jan Stancek, linux-mm, linux-kernel
  Cc: hillf.zj, dave.hansen, kirill.shutemov, mhocko, n-horiguchi,
	aneesh.kumar, iamjoonsoo.kim

On 10/13/2016 05:19 AM, Jan Stancek wrote:
> Hi,
> 
> I'm running into ENOMEM failures with libhugetlbfs testsuite [1] on
> a power8 lpar system running 4.8 or latest git [2]. Repeated runs of
> this suite trigger multiple OOMs, that eventually kill entire system,
> it usually takes 3-5 runs:
> 
>  * Total System Memory......:  18024 MB
>  * Shared Mem Max Mapping...:    320 MB
>  * System Huge Page Size....:     16 MB
>  * Available Huge Pages.....:     20
>  * Total size of Huge Pages.:    320 MB
>  * Remaining System Memory..:  17704 MB
>  * Huge Page User Group.....:  hugepages (1001)
> 
> I see this only on ppc (BE/LE), x86_64 seems unaffected and successfully
> ran the tests for ~12 hours.
> 
> Bisect has identified following patch as culprit:
>   commit 67961f9db8c477026ea20ce05761bde6f8bf85b0
>   Author: Mike Kravetz <mike.kravetz@oracle.com>
>   Date:   Wed Jun 8 15:33:42 2016 -0700
>     mm/hugetlb: fix huge page reserve accounting for private mappings
> 

Thanks Jan, I'll take a look.

> 
> Following patch (made with my limited insight) applied to
> latest git [2] fixes the problem for me:
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ec49d9e..7261583 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1876,7 +1876,7 @@ static long __vma_reservation_common(struct hstate *h,
>                  * return value of this routine is the opposite of the
>                  * value returned from reserve map manipulation routines above.
>                  */
> -               if (ret)
> +               if (ret >= 0)
>                         return 0;
>                 else
>                         return 1;
> 

Do note that this code is only executed if this condition is true:

	else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) && ret >= 0) {

So, we would always return 0.  This always tells the calling code that a
reservation exists.

-- 
Mike Kravetz

> Regards,
> Jan
> 
> [1] https://github.com/libhugetlbfs/libhugetlbfs
> [2] v4.8-14230-gb67be92
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
@ 2016-10-13 15:24   ` Mike Kravetz
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Kravetz @ 2016-10-13 15:24 UTC (permalink / raw)
  To: Jan Stancek, linux-mm, linux-kernel
  Cc: hillf.zj, dave.hansen, kirill.shutemov, mhocko, n-horiguchi,
	aneesh.kumar, iamjoonsoo.kim

On 10/13/2016 05:19 AM, Jan Stancek wrote:
> Hi,
> 
> I'm running into ENOMEM failures with libhugetlbfs testsuite [1] on
> a power8 lpar system running 4.8 or latest git [2]. Repeated runs of
> this suite trigger multiple OOMs, that eventually kill entire system,
> it usually takes 3-5 runs:
> 
>  * Total System Memory......:  18024 MB
>  * Shared Mem Max Mapping...:    320 MB
>  * System Huge Page Size....:     16 MB
>  * Available Huge Pages.....:     20
>  * Total size of Huge Pages.:    320 MB
>  * Remaining System Memory..:  17704 MB
>  * Huge Page User Group.....:  hugepages (1001)
> 
> I see this only on ppc (BE/LE), x86_64 seems unaffected and successfully
> ran the tests for ~12 hours.
> 
> Bisect has identified following patch as culprit:
>   commit 67961f9db8c477026ea20ce05761bde6f8bf85b0
>   Author: Mike Kravetz <mike.kravetz@oracle.com>
>   Date:   Wed Jun 8 15:33:42 2016 -0700
>     mm/hugetlb: fix huge page reserve accounting for private mappings
> 

Thanks Jan, I'll take a look.

> 
> Following patch (made with my limited insight) applied to
> latest git [2] fixes the problem for me:
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ec49d9e..7261583 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1876,7 +1876,7 @@ static long __vma_reservation_common(struct hstate *h,
>                  * return value of this routine is the opposite of the
>                  * value returned from reserve map manipulation routines above.
>                  */
> -               if (ret)
> +               if (ret >= 0)
>                         return 0;
>                 else
>                         return 1;
> 

Do note that this code is only executed if this condition is true:

	else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) && ret >= 0) {

So, we would always return 0.  This always tells the calling code that a
reservation exists.

-- 
Mike Kravetz

> Regards,
> Jan
> 
> [1] https://github.com/libhugetlbfs/libhugetlbfs
> [2] v4.8-14230-gb67be92
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
  2016-10-13 15:24   ` Mike Kravetz
@ 2016-10-13 23:26     ` Mike Kravetz
  -1 siblings, 0 replies; 26+ messages in thread
From: Mike Kravetz @ 2016-10-13 23:26 UTC (permalink / raw)
  To: Jan Stancek, linux-mm, linux-kernel
  Cc: hillf.zj, dave.hansen, kirill.shutemov, mhocko, n-horiguchi,
	aneesh.kumar, iamjoonsoo.kim

On 10/13/2016 08:24 AM, Mike Kravetz wrote:
> On 10/13/2016 05:19 AM, Jan Stancek wrote:
>> Hi,
>>
>> I'm running into ENOMEM failures with libhugetlbfs testsuite [1] on
>> a power8 lpar system running 4.8 or latest git [2]. Repeated runs of
>> this suite trigger multiple OOMs, that eventually kill entire system,
>> it usually takes 3-5 runs:
>>
>>  * Total System Memory......:  18024 MB
>>  * Shared Mem Max Mapping...:    320 MB
>>  * System Huge Page Size....:     16 MB
>>  * Available Huge Pages.....:     20
>>  * Total size of Huge Pages.:    320 MB
>>  * Remaining System Memory..:  17704 MB
>>  * Huge Page User Group.....:  hugepages (1001)
>>

Hi Jan,

Any chance you can get the contents of /sys/kernel/mm/hugepages
before and after the first run of libhugetlbfs testsuite on Power?
Perhaps a script like:

cd /sys/kernel/mm/hugepages
for f in hugepages-*/*; do
	n=`cat $f`;
	echo -e "$n\t$f";
done

Just want to make sure the numbers look as they should.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
@ 2016-10-13 23:26     ` Mike Kravetz
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Kravetz @ 2016-10-13 23:26 UTC (permalink / raw)
  To: Jan Stancek, linux-mm, linux-kernel
  Cc: hillf.zj, dave.hansen, kirill.shutemov, mhocko, n-horiguchi,
	aneesh.kumar, iamjoonsoo.kim

On 10/13/2016 08:24 AM, Mike Kravetz wrote:
> On 10/13/2016 05:19 AM, Jan Stancek wrote:
>> Hi,
>>
>> I'm running into ENOMEM failures with libhugetlbfs testsuite [1] on
>> a power8 lpar system running 4.8 or latest git [2]. Repeated runs of
>> this suite trigger multiple OOMs, that eventually kill entire system,
>> it usually takes 3-5 runs:
>>
>>  * Total System Memory......:  18024 MB
>>  * Shared Mem Max Mapping...:    320 MB
>>  * System Huge Page Size....:     16 MB
>>  * Available Huge Pages.....:     20
>>  * Total size of Huge Pages.:    320 MB
>>  * Remaining System Memory..:  17704 MB
>>  * Huge Page User Group.....:  hugepages (1001)
>>

Hi Jan,

Any chance you can get the contents of /sys/kernel/mm/hugepages
before and after the first run of libhugetlbfs testsuite on Power?
Perhaps a script like:

cd /sys/kernel/mm/hugepages
for f in hugepages-*/*; do
	n=`cat $f`;
	echo -e "$n\t$f";
done

Just want to make sure the numbers look as they should.

-- 
Mike Kravetz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
  2016-10-13 23:26     ` Mike Kravetz
@ 2016-10-14  8:48       ` Jan Stancek
  -1 siblings, 0 replies; 26+ messages in thread
From: Jan Stancek @ 2016-10-14  8:48 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: hillf.zj, dave.hansen, kirill.shutemov, mhocko, n-horiguchi,
	aneesh.kumar, iamjoonsoo.kim

On 10/14/2016 01:26 AM, Mike Kravetz wrote:
> 
> Hi Jan,
> 
> Any chance you can get the contents of /sys/kernel/mm/hugepages
> before and after the first run of libhugetlbfs testsuite on Power?
> Perhaps a script like:
> 
> cd /sys/kernel/mm/hugepages
> for f in hugepages-*/*; do
> 	n=`cat $f`;
> 	echo -e "$n\t$f";
> done
> 
> Just want to make sure the numbers look as they should.
> 

Hi Mike,

Numbers are below. I have also isolated a single testcase from "func"
group of tests: corrupt-by-cow-opt [1]. This test stops working if I
run it 19 times (with 20 hugepages). And if I disable this test,
"func" group tests can all pass repeatedly.

[1] https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/corrupt-by-cow-opt.c

Regards,
Jan

Kernel is v4.8-14230-gb67be92, with reboot between each run.
1) Only func tests
System boot
After setup:
20      hugepages-16384kB/free_hugepages
20      hugepages-16384kB/nr_hugepages
20      hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
0       hugepages-16384kB/resv_hugepages
0       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

After func tests:
********** TEST SUMMARY
*                      16M
*                      32-bit 64-bit
*     Total testcases:     0     85
*             Skipped:     0      0
*                PASS:     0     81
*                FAIL:     0      4
*    Killed by signal:     0      0
*   Bad configuration:     0      0
*       Expected FAIL:     0      0
*     Unexpected PASS:     0      0
* Strange test result:     0      0

26      hugepages-16384kB/free_hugepages
26      hugepages-16384kB/nr_hugepages
26      hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
1       hugepages-16384kB/resv_hugepages
0       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

After test cleanup:
 umount -a -t hugetlbfs
 hugeadm --pool-pages-max ${HPSIZE}:0

1       hugepages-16384kB/free_hugepages
1       hugepages-16384kB/nr_hugepages
1       hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
1       hugepages-16384kB/resv_hugepages
1       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

---

2) Only stress tests
System boot
After setup:
20      hugepages-16384kB/free_hugepages
20      hugepages-16384kB/nr_hugepages
20      hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
0       hugepages-16384kB/resv_hugepages
0       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

After stress tests:
20      hugepages-16384kB/free_hugepages
20      hugepages-16384kB/nr_hugepages
20      hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
17      hugepages-16384kB/resv_hugepages
0       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

After cleanup:
17      hugepages-16384kB/free_hugepages
17      hugepages-16384kB/nr_hugepages
17      hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
17      hugepages-16384kB/resv_hugepages
17      hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

---

3) only corrupt-by-cow-opt

System boot
After setup:
20      hugepages-16384kB/free_hugepages
20      hugepages-16384kB/nr_hugepages
20      hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
0       hugepages-16384kB/resv_hugepages
0       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

libhugetlbfs-2.18# env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3298
Write s to 0x3effff000000 via shared mapping
Write p to 0x3effff000000 via private mapping
Read s from 0x3effff000000 via shared mapping
PASS
20      hugepages-16384kB/free_hugepages
20      hugepages-16384kB/nr_hugepages
20      hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
1       hugepages-16384kB/resv_hugepages
0       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

# env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3312
Write s to 0x3effff000000 via shared mapping
Write p to 0x3effff000000 via private mapping
Read s from 0x3effff000000 via shared mapping
PASS
20      hugepages-16384kB/free_hugepages
20      hugepages-16384kB/nr_hugepages
20      hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
2       hugepages-16384kB/resv_hugepages
0       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

(... output cut from ~17 iterations ...)

# env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3686
Write s to 0x3effff000000 via shared mapping
Bus error
20      hugepages-16384kB/free_hugepages
20      hugepages-16384kB/nr_hugepages
20      hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
19      hugepages-16384kB/resv_hugepages
0       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

# env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3700
Write s to 0x3effff000000 via shared mapping
FAIL    mmap() 2: Cannot allocate memory
20      hugepages-16384kB/free_hugepages
20      hugepages-16384kB/nr_hugepages
20      hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
19      hugepages-16384kB/resv_hugepages
0       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
@ 2016-10-14  8:48       ` Jan Stancek
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Stancek @ 2016-10-14  8:48 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: hillf.zj, dave.hansen, kirill.shutemov, mhocko, n-horiguchi,
	aneesh.kumar, iamjoonsoo.kim

On 10/14/2016 01:26 AM, Mike Kravetz wrote:
> 
> Hi Jan,
> 
> Any chance you can get the contents of /sys/kernel/mm/hugepages
> before and after the first run of libhugetlbfs testsuite on Power?
> Perhaps a script like:
> 
> cd /sys/kernel/mm/hugepages
> for f in hugepages-*/*; do
> 	n=`cat $f`;
> 	echo -e "$n\t$f";
> done
> 
> Just want to make sure the numbers look as they should.
> 

Hi Mike,

Numbers are below. I have also isolated a single testcase from "func"
group of tests: corrupt-by-cow-opt [1]. This test stops working if I
run it 19 times (with 20 hugepages). And if I disable this test,
"func" group tests can all pass repeatedly.

[1] https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/corrupt-by-cow-opt.c

Regards,
Jan

Kernel is v4.8-14230-gb67be92, with reboot between each run.
1) Only func tests
System boot
After setup:
20      hugepages-16384kB/free_hugepages
20      hugepages-16384kB/nr_hugepages
20      hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
0       hugepages-16384kB/resv_hugepages
0       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

After func tests:
********** TEST SUMMARY
*                      16M
*                      32-bit 64-bit
*     Total testcases:     0     85
*             Skipped:     0      0
*                PASS:     0     81
*                FAIL:     0      4
*    Killed by signal:     0      0
*   Bad configuration:     0      0
*       Expected FAIL:     0      0
*     Unexpected PASS:     0      0
* Strange test result:     0      0

26      hugepages-16384kB/free_hugepages
26      hugepages-16384kB/nr_hugepages
26      hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
1       hugepages-16384kB/resv_hugepages
0       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

After test cleanup:
 umount -a -t hugetlbfs
 hugeadm --pool-pages-max ${HPSIZE}:0

1       hugepages-16384kB/free_hugepages
1       hugepages-16384kB/nr_hugepages
1       hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
1       hugepages-16384kB/resv_hugepages
1       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

---

2) Only stress tests
System boot
After setup:
20      hugepages-16384kB/free_hugepages
20      hugepages-16384kB/nr_hugepages
20      hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
0       hugepages-16384kB/resv_hugepages
0       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

After stress tests:
20      hugepages-16384kB/free_hugepages
20      hugepages-16384kB/nr_hugepages
20      hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
17      hugepages-16384kB/resv_hugepages
0       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

After cleanup:
17      hugepages-16384kB/free_hugepages
17      hugepages-16384kB/nr_hugepages
17      hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
17      hugepages-16384kB/resv_hugepages
17      hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

---

3) only corrupt-by-cow-opt

System boot
After setup:
20      hugepages-16384kB/free_hugepages
20      hugepages-16384kB/nr_hugepages
20      hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
0       hugepages-16384kB/resv_hugepages
0       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

libhugetlbfs-2.18# env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3298
Write s to 0x3effff000000 via shared mapping
Write p to 0x3effff000000 via private mapping
Read s from 0x3effff000000 via shared mapping
PASS
20      hugepages-16384kB/free_hugepages
20      hugepages-16384kB/nr_hugepages
20      hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
1       hugepages-16384kB/resv_hugepages
0       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

# env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3312
Write s to 0x3effff000000 via shared mapping
Write p to 0x3effff000000 via private mapping
Read s from 0x3effff000000 via shared mapping
PASS
20      hugepages-16384kB/free_hugepages
20      hugepages-16384kB/nr_hugepages
20      hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
2       hugepages-16384kB/resv_hugepages
0       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

(... output cut from ~17 iterations ...)

# env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3686
Write s to 0x3effff000000 via shared mapping
Bus error
20      hugepages-16384kB/free_hugepages
20      hugepages-16384kB/nr_hugepages
20      hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
19      hugepages-16384kB/resv_hugepages
0       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

# env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3700
Write s to 0x3effff000000 via shared mapping
FAIL    mmap() 2: Cannot allocate memory
20      hugepages-16384kB/free_hugepages
20      hugepages-16384kB/nr_hugepages
20      hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
19      hugepages-16384kB/resv_hugepages
0       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
  2016-10-14  8:48       ` Jan Stancek
@ 2016-10-14 23:57         ` Mike Kravetz
  -1 siblings, 0 replies; 26+ messages in thread
From: Mike Kravetz @ 2016-10-14 23:57 UTC (permalink / raw)
  To: Jan Stancek, linux-mm, linux-kernel
  Cc: hillf.zj, dave.hansen, kirill.shutemov, mhocko, n-horiguchi,
	aneesh.kumar, iamjoonsoo.kim

On 10/14/2016 01:48 AM, Jan Stancek wrote:
> On 10/14/2016 01:26 AM, Mike Kravetz wrote:
>>
>> Hi Jan,
>>
>> Any chance you can get the contents of /sys/kernel/mm/hugepages
>> before and after the first run of libhugetlbfs testsuite on Power?
>> Perhaps a script like:
>>
>> cd /sys/kernel/mm/hugepages
>> for f in hugepages-*/*; do
>> 	n=`cat $f`;
>> 	echo -e "$n\t$f";
>> done
>>
>> Just want to make sure the numbers look as they should.
>>
> 
> Hi Mike,
> 
> Numbers are below. I have also isolated a single testcase from "func"
> group of tests: corrupt-by-cow-opt [1]. This test stops working if I
> run it 19 times (with 20 hugepages). And if I disable this test,
> "func" group tests can all pass repeatedly.

Thanks Jan,

I appreciate your efforts.

> 
> [1] https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/corrupt-by-cow-opt.c
> 
> Regards,
> Jan
> 
> Kernel is v4.8-14230-gb67be92, with reboot between each run.
> 1) Only func tests
> System boot
> After setup:
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 0       hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> After func tests:
> ********** TEST SUMMARY
> *                      16M
> *                      32-bit 64-bit
> *     Total testcases:     0     85
> *             Skipped:     0      0
> *                PASS:     0     81
> *                FAIL:     0      4
> *    Killed by signal:     0      0
> *   Bad configuration:     0      0
> *       Expected FAIL:     0      0
> *     Unexpected PASS:     0      0
> * Strange test result:     0      0
> 
> 26      hugepages-16384kB/free_hugepages
> 26      hugepages-16384kB/nr_hugepages
> 26      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 1       hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> After test cleanup:
>  umount -a -t hugetlbfs
>  hugeadm --pool-pages-max ${HPSIZE}:0
> 
> 1       hugepages-16384kB/free_hugepages
> 1       hugepages-16384kB/nr_hugepages
> 1       hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 1       hugepages-16384kB/resv_hugepages
> 1       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 

I am guessing the leaked reserve page is which is triggered by
running the test you isolated corrupt-by-cow-opt.


> ---
> 
> 2) Only stress tests
> System boot
> After setup:
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 0       hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> After stress tests:
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 17      hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> After cleanup:
> 17      hugepages-16384kB/free_hugepages
> 17      hugepages-16384kB/nr_hugepages
> 17      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 17      hugepages-16384kB/resv_hugepages
> 17      hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 

This looks worse than the summary after running the functional tests.

> ---
> 
> 3) only corrupt-by-cow-opt
> 
> System boot
> After setup:
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 0       hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> libhugetlbfs-2.18# env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
> Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3298
> Write s to 0x3effff000000 via shared mapping
> Write p to 0x3effff000000 via private mapping
> Read s from 0x3effff000000 via shared mapping
> PASS
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 1       hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages

Leaked one reserve page

> 
> # env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
> Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3312
> Write s to 0x3effff000000 via shared mapping
> Write p to 0x3effff000000 via private mapping
> Read s from 0x3effff000000 via shared mapping
> PASS
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 2       hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages

It is pretty consistent that we leak a reserve page every time this
test is run.

The interesting thing is that corrupt-by-cow-opt is a very simple
test case.  commit 67961f9db8c4 potentially changes the return value
of the functions vma_has_reserves() and vma_needs/commit_reservation()
for the owner (HPAGE_RESV_OWNER) of private mappings.  running the
test with and without the commit results in the same return values for
these routines on x86.  And, no leaked reserve pages.

Is it possible to revert this commit and run the libhugetlbs tests
(func and stress) again while monitoring the counts in /sys?  The
counts should go to zero after cleanup as you describe above.  I just
want to make sure that this commit is causing all the problems you
are seeing.  If it is, then we can consider reverting and I can try
to think of another way to address the original issue.

Thanks for your efforts on this.  I can not reproduce on x86 or sparc
and do not see any similar symptoms on these architectures.

-- 
Mike Kravetz

> 
> (... output cut from ~17 iterations ...)
> 
> # env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
> Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3686
> Write s to 0x3effff000000 via shared mapping
> Bus error
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 19      hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> # env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
> Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3700
> Write s to 0x3effff000000 via shared mapping
> FAIL    mmap() 2: Cannot allocate memory
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 19      hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
@ 2016-10-14 23:57         ` Mike Kravetz
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Kravetz @ 2016-10-14 23:57 UTC (permalink / raw)
  To: Jan Stancek, linux-mm, linux-kernel
  Cc: hillf.zj, dave.hansen, kirill.shutemov, mhocko, n-horiguchi,
	aneesh.kumar, iamjoonsoo.kim

On 10/14/2016 01:48 AM, Jan Stancek wrote:
> On 10/14/2016 01:26 AM, Mike Kravetz wrote:
>>
>> Hi Jan,
>>
>> Any chance you can get the contents of /sys/kernel/mm/hugepages
>> before and after the first run of libhugetlbfs testsuite on Power?
>> Perhaps a script like:
>>
>> cd /sys/kernel/mm/hugepages
>> for f in hugepages-*/*; do
>> 	n=`cat $f`;
>> 	echo -e "$n\t$f";
>> done
>>
>> Just want to make sure the numbers look as they should.
>>
> 
> Hi Mike,
> 
> Numbers are below. I have also isolated a single testcase from "func"
> group of tests: corrupt-by-cow-opt [1]. This test stops working if I
> run it 19 times (with 20 hugepages). And if I disable this test,
> "func" group tests can all pass repeatedly.

Thanks Jan,

I appreciate your efforts.

> 
> [1] https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/corrupt-by-cow-opt.c
> 
> Regards,
> Jan
> 
> Kernel is v4.8-14230-gb67be92, with reboot between each run.
> 1) Only func tests
> System boot
> After setup:
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 0       hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> After func tests:
> ********** TEST SUMMARY
> *                      16M
> *                      32-bit 64-bit
> *     Total testcases:     0     85
> *             Skipped:     0      0
> *                PASS:     0     81
> *                FAIL:     0      4
> *    Killed by signal:     0      0
> *   Bad configuration:     0      0
> *       Expected FAIL:     0      0
> *     Unexpected PASS:     0      0
> * Strange test result:     0      0
> 
> 26      hugepages-16384kB/free_hugepages
> 26      hugepages-16384kB/nr_hugepages
> 26      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 1       hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> After test cleanup:
>  umount -a -t hugetlbfs
>  hugeadm --pool-pages-max ${HPSIZE}:0
> 
> 1       hugepages-16384kB/free_hugepages
> 1       hugepages-16384kB/nr_hugepages
> 1       hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 1       hugepages-16384kB/resv_hugepages
> 1       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 

I am guessing the leaked reserve page is which is triggered by
running the test you isolated corrupt-by-cow-opt.


> ---
> 
> 2) Only stress tests
> System boot
> After setup:
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 0       hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> After stress tests:
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 17      hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> After cleanup:
> 17      hugepages-16384kB/free_hugepages
> 17      hugepages-16384kB/nr_hugepages
> 17      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 17      hugepages-16384kB/resv_hugepages
> 17      hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 

This looks worse than the summary after running the functional tests.

> ---
> 
> 3) only corrupt-by-cow-opt
> 
> System boot
> After setup:
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 0       hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> libhugetlbfs-2.18# env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
> Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3298
> Write s to 0x3effff000000 via shared mapping
> Write p to 0x3effff000000 via private mapping
> Read s from 0x3effff000000 via shared mapping
> PASS
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 1       hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages

Leaked one reserve page

> 
> # env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
> Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3312
> Write s to 0x3effff000000 via shared mapping
> Write p to 0x3effff000000 via private mapping
> Read s from 0x3effff000000 via shared mapping
> PASS
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 2       hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages

It is pretty consistent that we leak a reserve page every time this
test is run.

The interesting thing is that corrupt-by-cow-opt is a very simple
test case.  commit 67961f9db8c4 potentially changes the return value
of the functions vma_has_reserves() and vma_needs/commit_reservation()
for the owner (HPAGE_RESV_OWNER) of private mappings.  running the
test with and without the commit results in the same return values for
these routines on x86.  And, no leaked reserve pages.

Is it possible to revert this commit and run the libhugetlbs tests
(func and stress) again while monitoring the counts in /sys?  The
counts should go to zero after cleanup as you describe above.  I just
want to make sure that this commit is causing all the problems you
are seeing.  If it is, then we can consider reverting and I can try
to think of another way to address the original issue.

Thanks for your efforts on this.  I can not reproduce on x86 or sparc
and do not see any similar symptoms on these architectures.

-- 
Mike Kravetz

> 
> (... output cut from ~17 iterations ...)
> 
> # env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
> Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3686
> Write s to 0x3effff000000 via shared mapping
> Bus error
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 19      hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> # env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
> Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3700
> Write s to 0x3effff000000 via shared mapping
> FAIL    mmap() 2: Cannot allocate memory
> 20      hugepages-16384kB/free_hugepages
> 20      hugepages-16384kB/nr_hugepages
> 20      hugepages-16384kB/nr_hugepages_mempolicy
> 0       hugepages-16384kB/nr_overcommit_hugepages
> 19      hugepages-16384kB/resv_hugepages
> 0       hugepages-16384kB/surplus_hugepages
> 0       hugepages-16777216kB/free_hugepages
> 0       hugepages-16777216kB/nr_hugepages
> 0       hugepages-16777216kB/nr_hugepages_mempolicy
> 0       hugepages-16777216kB/nr_overcommit_hugepages
> 0       hugepages-16777216kB/resv_hugepages
> 0       hugepages-16777216kB/surplus_hugepages
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
  2016-10-14 23:57         ` Mike Kravetz
@ 2016-10-17  5:04           ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 26+ messages in thread
From: Aneesh Kumar K.V @ 2016-10-17  5:04 UTC (permalink / raw)
  To: Mike Kravetz, Jan Stancek, linux-mm, linux-kernel
  Cc: hillf.zj, dave.hansen, kirill.shutemov, mhocko, n-horiguchi,
	iamjoonsoo.kim

Mike Kravetz <mike.kravetz@oracle.com> writes:

> On 10/14/2016 01:48 AM, Jan Stancek wrote:
>> On 10/14/2016 01:26 AM, Mike Kravetz wrote:
>>>
>>> Hi Jan,
>>>
>>> Any chance you can get the contents of /sys/kernel/mm/hugepages
>>> before and after the first run of libhugetlbfs testsuite on Power?
>>> Perhaps a script like:
>>>
>>> cd /sys/kernel/mm/hugepages
>>> for f in hugepages-*/*; do
>>> 	n=`cat $f`;
>>> 	echo -e "$n\t$f";
>>> done
>>>
>>> Just want to make sure the numbers look as they should.
>>>
>> 
>> Hi Mike,
>> 
>> Numbers are below. I have also isolated a single testcase from "func"
>> group of tests: corrupt-by-cow-opt [1]. This test stops working if I
>> run it 19 times (with 20 hugepages). And if I disable this test,
>> "func" group tests can all pass repeatedly.
>
> Thanks Jan,
>
> I appreciate your efforts.
>
>> 
>> [1] https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/corrupt-by-cow-opt.c
>> 
>> Regards,
>> Jan
>> 
>> Kernel is v4.8-14230-gb67be92, with reboot between each run.
>> 1) Only func tests
>> System boot
>> After setup:
>> 20      hugepages-16384kB/free_hugepages
>> 20      hugepages-16384kB/nr_hugepages
>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>> 0       hugepages-16384kB/nr_overcommit_hugepages
>> 0       hugepages-16384kB/resv_hugepages
>> 0       hugepages-16384kB/surplus_hugepages
>> 0       hugepages-16777216kB/free_hugepages
>> 0       hugepages-16777216kB/nr_hugepages
>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>> 0       hugepages-16777216kB/resv_hugepages
>> 0       hugepages-16777216kB/surplus_hugepages
>> 
>> After func tests:
>> ********** TEST SUMMARY
>> *                      16M
>> *                      32-bit 64-bit
>> *     Total testcases:     0     85
>> *             Skipped:     0      0
>> *                PASS:     0     81
>> *                FAIL:     0      4
>> *    Killed by signal:     0      0
>> *   Bad configuration:     0      0
>> *       Expected FAIL:     0      0
>> *     Unexpected PASS:     0      0
>> * Strange test result:     0      0
>> 
>> 26      hugepages-16384kB/free_hugepages
>> 26      hugepages-16384kB/nr_hugepages
>> 26      hugepages-16384kB/nr_hugepages_mempolicy
>> 0       hugepages-16384kB/nr_overcommit_hugepages
>> 1       hugepages-16384kB/resv_hugepages
>> 0       hugepages-16384kB/surplus_hugepages
>> 0       hugepages-16777216kB/free_hugepages
>> 0       hugepages-16777216kB/nr_hugepages
>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>> 0       hugepages-16777216kB/resv_hugepages
>> 0       hugepages-16777216kB/surplus_hugepages
>> 
>> After test cleanup:
>>  umount -a -t hugetlbfs
>>  hugeadm --pool-pages-max ${HPSIZE}:0
>> 
>> 1       hugepages-16384kB/free_hugepages
>> 1       hugepages-16384kB/nr_hugepages
>> 1       hugepages-16384kB/nr_hugepages_mempolicy
>> 0       hugepages-16384kB/nr_overcommit_hugepages
>> 1       hugepages-16384kB/resv_hugepages
>> 1       hugepages-16384kB/surplus_hugepages
>> 0       hugepages-16777216kB/free_hugepages
>> 0       hugepages-16777216kB/nr_hugepages
>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>> 0       hugepages-16777216kB/resv_hugepages
>> 0       hugepages-16777216kB/surplus_hugepages
>> 
>
> I am guessing the leaked reserve page is which is triggered by
> running the test you isolated corrupt-by-cow-opt.
>
>
>> ---
>> 
>> 2) Only stress tests
>> System boot
>> After setup:
>> 20      hugepages-16384kB/free_hugepages
>> 20      hugepages-16384kB/nr_hugepages
>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>> 0       hugepages-16384kB/nr_overcommit_hugepages
>> 0       hugepages-16384kB/resv_hugepages
>> 0       hugepages-16384kB/surplus_hugepages
>> 0       hugepages-16777216kB/free_hugepages
>> 0       hugepages-16777216kB/nr_hugepages
>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>> 0       hugepages-16777216kB/resv_hugepages
>> 0       hugepages-16777216kB/surplus_hugepages
>> 
>> After stress tests:
>> 20      hugepages-16384kB/free_hugepages
>> 20      hugepages-16384kB/nr_hugepages
>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>> 0       hugepages-16384kB/nr_overcommit_hugepages
>> 17      hugepages-16384kB/resv_hugepages
>> 0       hugepages-16384kB/surplus_hugepages
>> 0       hugepages-16777216kB/free_hugepages
>> 0       hugepages-16777216kB/nr_hugepages
>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>> 0       hugepages-16777216kB/resv_hugepages
>> 0       hugepages-16777216kB/surplus_hugepages
>> 
>> After cleanup:
>> 17      hugepages-16384kB/free_hugepages
>> 17      hugepages-16384kB/nr_hugepages
>> 17      hugepages-16384kB/nr_hugepages_mempolicy
>> 0       hugepages-16384kB/nr_overcommit_hugepages
>> 17      hugepages-16384kB/resv_hugepages
>> 17      hugepages-16384kB/surplus_hugepages
>> 0       hugepages-16777216kB/free_hugepages
>> 0       hugepages-16777216kB/nr_hugepages
>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>> 0       hugepages-16777216kB/resv_hugepages
>> 0       hugepages-16777216kB/surplus_hugepages
>> 
>
> This looks worse than the summary after running the functional tests.
>
>> ---
>> 
>> 3) only corrupt-by-cow-opt
>> 
>> System boot
>> After setup:
>> 20      hugepages-16384kB/free_hugepages
>> 20      hugepages-16384kB/nr_hugepages
>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>> 0       hugepages-16384kB/nr_overcommit_hugepages
>> 0       hugepages-16384kB/resv_hugepages
>> 0       hugepages-16384kB/surplus_hugepages
>> 0       hugepages-16777216kB/free_hugepages
>> 0       hugepages-16777216kB/nr_hugepages
>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>> 0       hugepages-16777216kB/resv_hugepages
>> 0       hugepages-16777216kB/surplus_hugepages
>> 
>> libhugetlbfs-2.18# env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
>> Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3298
>> Write s to 0x3effff000000 via shared mapping
>> Write p to 0x3effff000000 via private mapping
>> Read s from 0x3effff000000 via shared mapping
>> PASS
>> 20      hugepages-16384kB/free_hugepages
>> 20      hugepages-16384kB/nr_hugepages
>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>> 0       hugepages-16384kB/nr_overcommit_hugepages
>> 1       hugepages-16384kB/resv_hugepages
>> 0       hugepages-16384kB/surplus_hugepages
>> 0       hugepages-16777216kB/free_hugepages
>> 0       hugepages-16777216kB/nr_hugepages
>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>> 0       hugepages-16777216kB/resv_hugepages
>> 0       hugepages-16777216kB/surplus_hugepages
>
> Leaked one reserve page
>
>> 
>> # env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
>> Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3312
>> Write s to 0x3effff000000 via shared mapping
>> Write p to 0x3effff000000 via private mapping
>> Read s from 0x3effff000000 via shared mapping
>> PASS
>> 20      hugepages-16384kB/free_hugepages
>> 20      hugepages-16384kB/nr_hugepages
>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>> 0       hugepages-16384kB/nr_overcommit_hugepages
>> 2       hugepages-16384kB/resv_hugepages
>> 0       hugepages-16384kB/surplus_hugepages
>> 0       hugepages-16777216kB/free_hugepages
>> 0       hugepages-16777216kB/nr_hugepages
>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>> 0       hugepages-16777216kB/resv_hugepages
>> 0       hugepages-16777216kB/surplus_hugepages
>
> It is pretty consistent that we leak a reserve page every time this
> test is run.
>
> The interesting thing is that corrupt-by-cow-opt is a very simple
> test case.  commit 67961f9db8c4 potentially changes the return value
> of the functions vma_has_reserves() and vma_needs/commit_reservation()
> for the owner (HPAGE_RESV_OWNER) of private mappings.  running the
> test with and without the commit results in the same return values for
> these routines on x86.  And, no leaked reserve pages.

looking at that commit, I am not sure region_chg output indicate a hole
punched. ie, w.r.t private mapping when we mmap, we don't do a
region_chg (hugetlb_reserve_page()). So with a fault later when we
call vma_needs_reservation, we will find region_chg returning >= 0 right ?

>
> Is it possible to revert this commit and run the libhugetlbs tests
> (func and stress) again while monitoring the counts in /sys?  The
> counts should go to zero after cleanup as you describe above.  I just
> want to make sure that this commit is causing all the problems you
> are seeing.  If it is, then we can consider reverting and I can try
> to think of another way to address the original issue.
>
> Thanks for your efforts on this.  I can not reproduce on x86 or sparc
> and do not see any similar symptoms on these architectures.
>

Not sure how any of this is arch specific. So on both x86 and sparc we
don't find the count going wrong as above ?

-aneesh

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
@ 2016-10-17  5:04           ` Aneesh Kumar K.V
  0 siblings, 0 replies; 26+ messages in thread
From: Aneesh Kumar K.V @ 2016-10-17  5:04 UTC (permalink / raw)
  To: Mike Kravetz, Jan Stancek, linux-mm, linux-kernel
  Cc: hillf.zj, dave.hansen, kirill.shutemov, mhocko, n-horiguchi,
	iamjoonsoo.kim

Mike Kravetz <mike.kravetz@oracle.com> writes:

> On 10/14/2016 01:48 AM, Jan Stancek wrote:
>> On 10/14/2016 01:26 AM, Mike Kravetz wrote:
>>>
>>> Hi Jan,
>>>
>>> Any chance you can get the contents of /sys/kernel/mm/hugepages
>>> before and after the first run of libhugetlbfs testsuite on Power?
>>> Perhaps a script like:
>>>
>>> cd /sys/kernel/mm/hugepages
>>> for f in hugepages-*/*; do
>>> 	n=`cat $f`;
>>> 	echo -e "$n\t$f";
>>> done
>>>
>>> Just want to make sure the numbers look as they should.
>>>
>> 
>> Hi Mike,
>> 
>> Numbers are below. I have also isolated a single testcase from "func"
>> group of tests: corrupt-by-cow-opt [1]. This test stops working if I
>> run it 19 times (with 20 hugepages). And if I disable this test,
>> "func" group tests can all pass repeatedly.
>
> Thanks Jan,
>
> I appreciate your efforts.
>
>> 
>> [1] https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/corrupt-by-cow-opt.c
>> 
>> Regards,
>> Jan
>> 
>> Kernel is v4.8-14230-gb67be92, with reboot between each run.
>> 1) Only func tests
>> System boot
>> After setup:
>> 20      hugepages-16384kB/free_hugepages
>> 20      hugepages-16384kB/nr_hugepages
>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>> 0       hugepages-16384kB/nr_overcommit_hugepages
>> 0       hugepages-16384kB/resv_hugepages
>> 0       hugepages-16384kB/surplus_hugepages
>> 0       hugepages-16777216kB/free_hugepages
>> 0       hugepages-16777216kB/nr_hugepages
>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>> 0       hugepages-16777216kB/resv_hugepages
>> 0       hugepages-16777216kB/surplus_hugepages
>> 
>> After func tests:
>> ********** TEST SUMMARY
>> *                      16M
>> *                      32-bit 64-bit
>> *     Total testcases:     0     85
>> *             Skipped:     0      0
>> *                PASS:     0     81
>> *                FAIL:     0      4
>> *    Killed by signal:     0      0
>> *   Bad configuration:     0      0
>> *       Expected FAIL:     0      0
>> *     Unexpected PASS:     0      0
>> * Strange test result:     0      0
>> 
>> 26      hugepages-16384kB/free_hugepages
>> 26      hugepages-16384kB/nr_hugepages
>> 26      hugepages-16384kB/nr_hugepages_mempolicy
>> 0       hugepages-16384kB/nr_overcommit_hugepages
>> 1       hugepages-16384kB/resv_hugepages
>> 0       hugepages-16384kB/surplus_hugepages
>> 0       hugepages-16777216kB/free_hugepages
>> 0       hugepages-16777216kB/nr_hugepages
>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>> 0       hugepages-16777216kB/resv_hugepages
>> 0       hugepages-16777216kB/surplus_hugepages
>> 
>> After test cleanup:
>>  umount -a -t hugetlbfs
>>  hugeadm --pool-pages-max ${HPSIZE}:0
>> 
>> 1       hugepages-16384kB/free_hugepages
>> 1       hugepages-16384kB/nr_hugepages
>> 1       hugepages-16384kB/nr_hugepages_mempolicy
>> 0       hugepages-16384kB/nr_overcommit_hugepages
>> 1       hugepages-16384kB/resv_hugepages
>> 1       hugepages-16384kB/surplus_hugepages
>> 0       hugepages-16777216kB/free_hugepages
>> 0       hugepages-16777216kB/nr_hugepages
>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>> 0       hugepages-16777216kB/resv_hugepages
>> 0       hugepages-16777216kB/surplus_hugepages
>> 
>
> I am guessing the leaked reserve page is which is triggered by
> running the test you isolated corrupt-by-cow-opt.
>
>
>> ---
>> 
>> 2) Only stress tests
>> System boot
>> After setup:
>> 20      hugepages-16384kB/free_hugepages
>> 20      hugepages-16384kB/nr_hugepages
>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>> 0       hugepages-16384kB/nr_overcommit_hugepages
>> 0       hugepages-16384kB/resv_hugepages
>> 0       hugepages-16384kB/surplus_hugepages
>> 0       hugepages-16777216kB/free_hugepages
>> 0       hugepages-16777216kB/nr_hugepages
>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>> 0       hugepages-16777216kB/resv_hugepages
>> 0       hugepages-16777216kB/surplus_hugepages
>> 
>> After stress tests:
>> 20      hugepages-16384kB/free_hugepages
>> 20      hugepages-16384kB/nr_hugepages
>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>> 0       hugepages-16384kB/nr_overcommit_hugepages
>> 17      hugepages-16384kB/resv_hugepages
>> 0       hugepages-16384kB/surplus_hugepages
>> 0       hugepages-16777216kB/free_hugepages
>> 0       hugepages-16777216kB/nr_hugepages
>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>> 0       hugepages-16777216kB/resv_hugepages
>> 0       hugepages-16777216kB/surplus_hugepages
>> 
>> After cleanup:
>> 17      hugepages-16384kB/free_hugepages
>> 17      hugepages-16384kB/nr_hugepages
>> 17      hugepages-16384kB/nr_hugepages_mempolicy
>> 0       hugepages-16384kB/nr_overcommit_hugepages
>> 17      hugepages-16384kB/resv_hugepages
>> 17      hugepages-16384kB/surplus_hugepages
>> 0       hugepages-16777216kB/free_hugepages
>> 0       hugepages-16777216kB/nr_hugepages
>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>> 0       hugepages-16777216kB/resv_hugepages
>> 0       hugepages-16777216kB/surplus_hugepages
>> 
>
> This looks worse than the summary after running the functional tests.
>
>> ---
>> 
>> 3) only corrupt-by-cow-opt
>> 
>> System boot
>> After setup:
>> 20      hugepages-16384kB/free_hugepages
>> 20      hugepages-16384kB/nr_hugepages
>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>> 0       hugepages-16384kB/nr_overcommit_hugepages
>> 0       hugepages-16384kB/resv_hugepages
>> 0       hugepages-16384kB/surplus_hugepages
>> 0       hugepages-16777216kB/free_hugepages
>> 0       hugepages-16777216kB/nr_hugepages
>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>> 0       hugepages-16777216kB/resv_hugepages
>> 0       hugepages-16777216kB/surplus_hugepages
>> 
>> libhugetlbfs-2.18# env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
>> Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3298
>> Write s to 0x3effff000000 via shared mapping
>> Write p to 0x3effff000000 via private mapping
>> Read s from 0x3effff000000 via shared mapping
>> PASS
>> 20      hugepages-16384kB/free_hugepages
>> 20      hugepages-16384kB/nr_hugepages
>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>> 0       hugepages-16384kB/nr_overcommit_hugepages
>> 1       hugepages-16384kB/resv_hugepages
>> 0       hugepages-16384kB/surplus_hugepages
>> 0       hugepages-16777216kB/free_hugepages
>> 0       hugepages-16777216kB/nr_hugepages
>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>> 0       hugepages-16777216kB/resv_hugepages
>> 0       hugepages-16777216kB/surplus_hugepages
>
> Leaked one reserve page
>
>> 
>> # env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
>> Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3312
>> Write s to 0x3effff000000 via shared mapping
>> Write p to 0x3effff000000 via private mapping
>> Read s from 0x3effff000000 via shared mapping
>> PASS
>> 20      hugepages-16384kB/free_hugepages
>> 20      hugepages-16384kB/nr_hugepages
>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>> 0       hugepages-16384kB/nr_overcommit_hugepages
>> 2       hugepages-16384kB/resv_hugepages
>> 0       hugepages-16384kB/surplus_hugepages
>> 0       hugepages-16777216kB/free_hugepages
>> 0       hugepages-16777216kB/nr_hugepages
>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>> 0       hugepages-16777216kB/resv_hugepages
>> 0       hugepages-16777216kB/surplus_hugepages
>
> It is pretty consistent that we leak a reserve page every time this
> test is run.
>
> The interesting thing is that corrupt-by-cow-opt is a very simple
> test case.  commit 67961f9db8c4 potentially changes the return value
> of the functions vma_has_reserves() and vma_needs/commit_reservation()
> for the owner (HPAGE_RESV_OWNER) of private mappings.  running the
> test with and without the commit results in the same return values for
> these routines on x86.  And, no leaked reserve pages.

looking at that commit, I am not sure region_chg output indicate a hole
punched. ie, w.r.t private mapping when we mmap, we don't do a
region_chg (hugetlb_reserve_page()). So with a fault later when we
call vma_needs_reservation, we will find region_chg returning >= 0 right ?

>
> Is it possible to revert this commit and run the libhugetlbs tests
> (func and stress) again while monitoring the counts in /sys?  The
> counts should go to zero after cleanup as you describe above.  I just
> want to make sure that this commit is causing all the problems you
> are seeing.  If it is, then we can consider reverting and I can try
> to think of another way to address the original issue.
>
> Thanks for your efforts on this.  I can not reproduce on x86 or sparc
> and do not see any similar symptoms on these architectures.
>

Not sure how any of this is arch specific. So on both x86 and sparc we
don't find the count going wrong as above ?

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
  2016-10-14 23:57         ` Mike Kravetz
@ 2016-10-17 14:44           ` Jan Stancek
  -1 siblings, 0 replies; 26+ messages in thread
From: Jan Stancek @ 2016-10-17 14:44 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, hillf zj, dave hansen, kirill shutemov,
	mhocko, n-horiguchi, aneesh kumar, iamjoonsoo kim


----- Original Message -----
> From: "Mike Kravetz" <mike.kravetz@oracle.com>
> To: "Jan Stancek" <jstancek@redhat.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org
> Cc: "hillf zj" <hillf.zj@alibaba-inc.com>, "dave hansen" <dave.hansen@linux.intel.com>, "kirill shutemov"
> <kirill.shutemov@linux.intel.com>, mhocko@suse.cz, n-horiguchi@ah.jp.nec.com, "aneesh kumar"
> <aneesh.kumar@linux.vnet.ibm.com>, "iamjoonsoo kim" <iamjoonsoo.kim@lge.com>
> Sent: Saturday, 15 October, 2016 1:57:31 AM
> Subject: Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
> 
> 
> It is pretty consistent that we leak a reserve page every time this
> test is run.
> 
> The interesting thing is that corrupt-by-cow-opt is a very simple
> test case.  commit 67961f9db8c4 potentially changes the return value
> of the functions vma_has_reserves() and vma_needs/commit_reservation()
> for the owner (HPAGE_RESV_OWNER) of private mappings.  running the
> test with and without the commit results in the same return values for
> these routines on x86.  And, no leaked reserve pages.
> 
> Is it possible to revert this commit and run the libhugetlbs tests
> (func and stress) again while monitoring the counts in /sys?  The
> counts should go to zero after cleanup as you describe above.  I just
> want to make sure that this commit is causing all the problems you
> are seeing.  If it is, then we can consider reverting and I can try
> to think of another way to address the original issue.
> 
> Thanks for your efforts on this.  I can not reproduce on x86 or sparc
> and do not see any similar symptoms on these architectures.
> 
> --
> Mike Kravetz
> 

Hi Mike,

Revert of 67961f9db8c4 helps, I let whole suite run for 100 iterations,
there were no issues.

I cut down reproducer and removed last mmap/write/munmap as that is enough
to reproduce the problem. Then I started introducing some traces into kernel
and noticed that on ppc I get 3 faults, while on x86 I get only 2.

Interesting is the 2nd fault, that is first write after mapping as PRIVATE.
Following condition fails on ppc first time:
    if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
but it's immediately followed by fault that looks identical
and in that one it evaluates as true.

Same with alloc_huge_page(), on x86_64 it's called twice, on ppc three times.
In 2nd call vma_needs_reservation() returns 0, in 3rd it returns 1.

---- ppc -> 2nd and 3rd fault ---
mmap(MAP_PRIVATE)
hugetlb_fault address: 3effff000000, flags: 55
hugetlb_cow old_page: f0000000010fc000
alloc_huge_page ret: f000000001100000
hugetlb_cow ptep: c000000455b27cf8, pte_same: 0
free_huge_page page: f000000001100000, restore_reserve: 1
hugetlb_fault address: 3effff000000, flags: 55
hugetlb_cow old_page: f0000000010fc000
alloc_huge_page ret: f000000001100000
hugetlb_cow ptep: c000000455b27cf8, pte_same: 1

--- x86_64 -> 2nd fault ---
mmap(MAP_PRIVATE)
hugetlb_fault address: 7f71a4200000, flags: 55
hugetlb_cow address 0x7f71a4200000, old_page: ffffea0008d20000
alloc_huge_page ret: ffffea0008d38000
hugetlb_cow ptep: ffff8802314c7908, pte_same: 1

Regards,
Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
@ 2016-10-17 14:44           ` Jan Stancek
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Stancek @ 2016-10-17 14:44 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, hillf zj, dave hansen, kirill shutemov,
	mhocko, n-horiguchi, aneesh kumar, iamjoonsoo kim


----- Original Message -----
> From: "Mike Kravetz" <mike.kravetz@oracle.com>
> To: "Jan Stancek" <jstancek@redhat.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org
> Cc: "hillf zj" <hillf.zj@alibaba-inc.com>, "dave hansen" <dave.hansen@linux.intel.com>, "kirill shutemov"
> <kirill.shutemov@linux.intel.com>, mhocko@suse.cz, n-horiguchi@ah.jp.nec.com, "aneesh kumar"
> <aneesh.kumar@linux.vnet.ibm.com>, "iamjoonsoo kim" <iamjoonsoo.kim@lge.com>
> Sent: Saturday, 15 October, 2016 1:57:31 AM
> Subject: Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
> 
> 
> It is pretty consistent that we leak a reserve page every time this
> test is run.
> 
> The interesting thing is that corrupt-by-cow-opt is a very simple
> test case.  commit 67961f9db8c4 potentially changes the return value
> of the functions vma_has_reserves() and vma_needs/commit_reservation()
> for the owner (HPAGE_RESV_OWNER) of private mappings.  running the
> test with and without the commit results in the same return values for
> these routines on x86.  And, no leaked reserve pages.
> 
> Is it possible to revert this commit and run the libhugetlbs tests
> (func and stress) again while monitoring the counts in /sys?  The
> counts should go to zero after cleanup as you describe above.  I just
> want to make sure that this commit is causing all the problems you
> are seeing.  If it is, then we can consider reverting and I can try
> to think of another way to address the original issue.
> 
> Thanks for your efforts on this.  I can not reproduce on x86 or sparc
> and do not see any similar symptoms on these architectures.
> 
> --
> Mike Kravetz
> 

Hi Mike,

Revert of 67961f9db8c4 helps, I let whole suite run for 100 iterations,
there were no issues.

I cut down reproducer and removed last mmap/write/munmap as that is enough
to reproduce the problem. Then I started introducing some traces into kernel
and noticed that on ppc I get 3 faults, while on x86 I get only 2.

Interesting is the 2nd fault, that is first write after mapping as PRIVATE.
Following condition fails on ppc first time:
    if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
but it's immediately followed by fault that looks identical
and in that one it evaluates as true.

Same with alloc_huge_page(), on x86_64 it's called twice, on ppc three times.
In 2nd call vma_needs_reservation() returns 0, in 3rd it returns 1.

---- ppc -> 2nd and 3rd fault ---
mmap(MAP_PRIVATE)
hugetlb_fault address: 3effff000000, flags: 55
hugetlb_cow old_page: f0000000010fc000
alloc_huge_page ret: f000000001100000
hugetlb_cow ptep: c000000455b27cf8, pte_same: 0
free_huge_page page: f000000001100000, restore_reserve: 1
hugetlb_fault address: 3effff000000, flags: 55
hugetlb_cow old_page: f0000000010fc000
alloc_huge_page ret: f000000001100000
hugetlb_cow ptep: c000000455b27cf8, pte_same: 1

--- x86_64 -> 2nd fault ---
mmap(MAP_PRIVATE)
hugetlb_fault address: 7f71a4200000, flags: 55
hugetlb_cow address 0x7f71a4200000, old_page: ffffea0008d20000
alloc_huge_page ret: ffffea0008d38000
hugetlb_cow ptep: ffff8802314c7908, pte_same: 1

Regards,
Jan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
  2016-10-17 14:44           ` Jan Stancek
@ 2016-10-17 18:27             ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 26+ messages in thread
From: Aneesh Kumar K.V @ 2016-10-17 18:27 UTC (permalink / raw)
  To: Jan Stancek, Mike Kravetz
  Cc: linux-mm, linux-kernel, hillf zj, dave hansen, kirill shutemov,
	mhocko, n-horiguchi, iamjoonsoo kim

Jan Stancek <jstancek@redhat.com> writes:


> Hi Mike,
>
> Revert of 67961f9db8c4 helps, I let whole suite run for 100 iterations,
> there were no issues.
>
> I cut down reproducer and removed last mmap/write/munmap as that is enough
> to reproduce the problem. Then I started introducing some traces into kernel
> and noticed that on ppc I get 3 faults, while on x86 I get only 2.
>
> Interesting is the 2nd fault, that is first write after mapping as PRIVATE.
> Following condition fails on ppc first time:
>     if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
> but it's immediately followed by fault that looks identical
> and in that one it evaluates as true.

ok, we miss the _PAGE_PTE in new_pte there. 

	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
				&& (vma->vm_flags & VM_SHARED)));
	set_huge_pte_at(mm, address, ptep, new_pte);

	hugetlb_count_add(pages_per_huge_page(h), mm);
	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
		/* Optimization, do the COW without a second fault */
		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
	}

IMHO that new_pte usage is wrong, because we don't consider flags that
can possibly be added by set_huge_pte_at there. For pp64 we add _PAGE_PTE 

>
> Same with alloc_huge_page(), on x86_64 it's called twice, on ppc three times.
> In 2nd call vma_needs_reservation() returns 0, in 3rd it returns 1.
>
> ---- ppc -> 2nd and 3rd fault ---
> mmap(MAP_PRIVATE)
> hugetlb_fault address: 3effff000000, flags: 55
> hugetlb_cow old_page: f0000000010fc000
> alloc_huge_page ret: f000000001100000
> hugetlb_cow ptep: c000000455b27cf8, pte_same: 0
> free_huge_page page: f000000001100000, restore_reserve: 1
> hugetlb_fault address: 3effff000000, flags: 55
> hugetlb_cow old_page: f0000000010fc000
> alloc_huge_page ret: f000000001100000
> hugetlb_cow ptep: c000000455b27cf8, pte_same: 1
>
> --- x86_64 -> 2nd fault ---
> mmap(MAP_PRIVATE)
> hugetlb_fault address: 7f71a4200000, flags: 55
> hugetlb_cow address 0x7f71a4200000, old_page: ffffea0008d20000
> alloc_huge_page ret: ffffea0008d38000
> hugetlb_cow ptep: ffff8802314c7908, pte_same: 1
>

But I guess we still have issue with respecting reservation here.

I will look at _PAGE_PTE and see what best we can do w.r.t hugetlb.

-aneesh

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
@ 2016-10-17 18:27             ` Aneesh Kumar K.V
  0 siblings, 0 replies; 26+ messages in thread
From: Aneesh Kumar K.V @ 2016-10-17 18:27 UTC (permalink / raw)
  To: Jan Stancek, Mike Kravetz
  Cc: linux-mm, linux-kernel, hillf zj, dave hansen, kirill shutemov,
	mhocko, n-horiguchi, iamjoonsoo kim

Jan Stancek <jstancek@redhat.com> writes:


> Hi Mike,
>
> Revert of 67961f9db8c4 helps, I let whole suite run for 100 iterations,
> there were no issues.
>
> I cut down reproducer and removed last mmap/write/munmap as that is enough
> to reproduce the problem. Then I started introducing some traces into kernel
> and noticed that on ppc I get 3 faults, while on x86 I get only 2.
>
> Interesting is the 2nd fault, that is first write after mapping as PRIVATE.
> Following condition fails on ppc first time:
>     if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
> but it's immediately followed by fault that looks identical
> and in that one it evaluates as true.

ok, we miss the _PAGE_PTE in new_pte there. 

	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
				&& (vma->vm_flags & VM_SHARED)));
	set_huge_pte_at(mm, address, ptep, new_pte);

	hugetlb_count_add(pages_per_huge_page(h), mm);
	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
		/* Optimization, do the COW without a second fault */
		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
	}

IMHO that new_pte usage is wrong, because we don't consider flags that
can possibly be added by set_huge_pte_at there. For pp64 we add _PAGE_PTE 

>
> Same with alloc_huge_page(), on x86_64 it's called twice, on ppc three times.
> In 2nd call vma_needs_reservation() returns 0, in 3rd it returns 1.
>
> ---- ppc -> 2nd and 3rd fault ---
> mmap(MAP_PRIVATE)
> hugetlb_fault address: 3effff000000, flags: 55
> hugetlb_cow old_page: f0000000010fc000
> alloc_huge_page ret: f000000001100000
> hugetlb_cow ptep: c000000455b27cf8, pte_same: 0
> free_huge_page page: f000000001100000, restore_reserve: 1
> hugetlb_fault address: 3effff000000, flags: 55
> hugetlb_cow old_page: f0000000010fc000
> alloc_huge_page ret: f000000001100000
> hugetlb_cow ptep: c000000455b27cf8, pte_same: 1
>
> --- x86_64 -> 2nd fault ---
> mmap(MAP_PRIVATE)
> hugetlb_fault address: 7f71a4200000, flags: 55
> hugetlb_cow address 0x7f71a4200000, old_page: ffffea0008d20000
> alloc_huge_page ret: ffffea0008d38000
> hugetlb_cow ptep: ffff8802314c7908, pte_same: 1
>

But I guess we still have issue with respecting reservation here.

I will look at _PAGE_PTE and see what best we can do w.r.t hugetlb.

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
  2016-10-17  5:04           ` Aneesh Kumar K.V
@ 2016-10-17 22:53             ` Mike Kravetz
  -1 siblings, 0 replies; 26+ messages in thread
From: Mike Kravetz @ 2016-10-17 22:53 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Jan Stancek, linux-mm, linux-kernel
  Cc: hillf.zj, dave.hansen, kirill.shutemov, mhocko, n-horiguchi,
	iamjoonsoo.kim

On 10/16/2016 10:04 PM, Aneesh Kumar K.V wrote:
> Mike Kravetz <mike.kravetz@oracle.com> writes:
> 
>> On 10/14/2016 01:48 AM, Jan Stancek wrote:
>>> On 10/14/2016 01:26 AM, Mike Kravetz wrote:
>>>>
>>>> Hi Jan,
>>>>
>>>> Any chance you can get the contents of /sys/kernel/mm/hugepages
>>>> before and after the first run of libhugetlbfs testsuite on Power?
>>>> Perhaps a script like:
>>>>
>>>> cd /sys/kernel/mm/hugepages
>>>> for f in hugepages-*/*; do
>>>> 	n=`cat $f`;
>>>> 	echo -e "$n\t$f";
>>>> done
>>>>
>>>> Just want to make sure the numbers look as they should.
>>>>
>>>
>>> Hi Mike,
>>>
>>> Numbers are below. I have also isolated a single testcase from "func"
>>> group of tests: corrupt-by-cow-opt [1]. This test stops working if I
>>> run it 19 times (with 20 hugepages). And if I disable this test,
>>> "func" group tests can all pass repeatedly.
>>
>> Thanks Jan,
>>
>> I appreciate your efforts.
>>
>>>
>>> [1] https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/corrupt-by-cow-opt.c
>>>
>>> Regards,
>>> Jan
>>>
>>> Kernel is v4.8-14230-gb67be92, with reboot between each run.
>>> 1) Only func tests
>>> System boot
>>> After setup:
>>> 20      hugepages-16384kB/free_hugepages
>>> 20      hugepages-16384kB/nr_hugepages
>>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>>> 0       hugepages-16384kB/nr_overcommit_hugepages
>>> 0       hugepages-16384kB/resv_hugepages
>>> 0       hugepages-16384kB/surplus_hugepages
>>> 0       hugepages-16777216kB/free_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>>> 0       hugepages-16777216kB/resv_hugepages
>>> 0       hugepages-16777216kB/surplus_hugepages
>>>
>>> After func tests:
>>> ********** TEST SUMMARY
>>> *                      16M
>>> *                      32-bit 64-bit
>>> *     Total testcases:     0     85
>>> *             Skipped:     0      0
>>> *                PASS:     0     81
>>> *                FAIL:     0      4
>>> *    Killed by signal:     0      0
>>> *   Bad configuration:     0      0
>>> *       Expected FAIL:     0      0
>>> *     Unexpected PASS:     0      0
>>> * Strange test result:     0      0
>>>
>>> 26      hugepages-16384kB/free_hugepages
>>> 26      hugepages-16384kB/nr_hugepages
>>> 26      hugepages-16384kB/nr_hugepages_mempolicy
>>> 0       hugepages-16384kB/nr_overcommit_hugepages
>>> 1       hugepages-16384kB/resv_hugepages
>>> 0       hugepages-16384kB/surplus_hugepages
>>> 0       hugepages-16777216kB/free_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>>> 0       hugepages-16777216kB/resv_hugepages
>>> 0       hugepages-16777216kB/surplus_hugepages
>>>
>>> After test cleanup:
>>>  umount -a -t hugetlbfs
>>>  hugeadm --pool-pages-max ${HPSIZE}:0
>>>
>>> 1       hugepages-16384kB/free_hugepages
>>> 1       hugepages-16384kB/nr_hugepages
>>> 1       hugepages-16384kB/nr_hugepages_mempolicy
>>> 0       hugepages-16384kB/nr_overcommit_hugepages
>>> 1       hugepages-16384kB/resv_hugepages
>>> 1       hugepages-16384kB/surplus_hugepages
>>> 0       hugepages-16777216kB/free_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>>> 0       hugepages-16777216kB/resv_hugepages
>>> 0       hugepages-16777216kB/surplus_hugepages
>>>
>>
>> I am guessing the leaked reserve page is which is triggered by
>> running the test you isolated corrupt-by-cow-opt.
>>
>>
>>> ---
>>>
>>> 2) Only stress tests
>>> System boot
>>> After setup:
>>> 20      hugepages-16384kB/free_hugepages
>>> 20      hugepages-16384kB/nr_hugepages
>>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>>> 0       hugepages-16384kB/nr_overcommit_hugepages
>>> 0       hugepages-16384kB/resv_hugepages
>>> 0       hugepages-16384kB/surplus_hugepages
>>> 0       hugepages-16777216kB/free_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>>> 0       hugepages-16777216kB/resv_hugepages
>>> 0       hugepages-16777216kB/surplus_hugepages
>>>
>>> After stress tests:
>>> 20      hugepages-16384kB/free_hugepages
>>> 20      hugepages-16384kB/nr_hugepages
>>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>>> 0       hugepages-16384kB/nr_overcommit_hugepages
>>> 17      hugepages-16384kB/resv_hugepages
>>> 0       hugepages-16384kB/surplus_hugepages
>>> 0       hugepages-16777216kB/free_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>>> 0       hugepages-16777216kB/resv_hugepages
>>> 0       hugepages-16777216kB/surplus_hugepages
>>>
>>> After cleanup:
>>> 17      hugepages-16384kB/free_hugepages
>>> 17      hugepages-16384kB/nr_hugepages
>>> 17      hugepages-16384kB/nr_hugepages_mempolicy
>>> 0       hugepages-16384kB/nr_overcommit_hugepages
>>> 17      hugepages-16384kB/resv_hugepages
>>> 17      hugepages-16384kB/surplus_hugepages
>>> 0       hugepages-16777216kB/free_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>>> 0       hugepages-16777216kB/resv_hugepages
>>> 0       hugepages-16777216kB/surplus_hugepages
>>>
>>
>> This looks worse than the summary after running the functional tests.
>>
>>> ---
>>>
>>> 3) only corrupt-by-cow-opt
>>>
>>> System boot
>>> After setup:
>>> 20      hugepages-16384kB/free_hugepages
>>> 20      hugepages-16384kB/nr_hugepages
>>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>>> 0       hugepages-16384kB/nr_overcommit_hugepages
>>> 0       hugepages-16384kB/resv_hugepages
>>> 0       hugepages-16384kB/surplus_hugepages
>>> 0       hugepages-16777216kB/free_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>>> 0       hugepages-16777216kB/resv_hugepages
>>> 0       hugepages-16777216kB/surplus_hugepages
>>>
>>> libhugetlbfs-2.18# env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
>>> Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3298
>>> Write s to 0x3effff000000 via shared mapping
>>> Write p to 0x3effff000000 via private mapping
>>> Read s from 0x3effff000000 via shared mapping
>>> PASS
>>> 20      hugepages-16384kB/free_hugepages
>>> 20      hugepages-16384kB/nr_hugepages
>>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>>> 0       hugepages-16384kB/nr_overcommit_hugepages
>>> 1       hugepages-16384kB/resv_hugepages
>>> 0       hugepages-16384kB/surplus_hugepages
>>> 0       hugepages-16777216kB/free_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>>> 0       hugepages-16777216kB/resv_hugepages
>>> 0       hugepages-16777216kB/surplus_hugepages
>>
>> Leaked one reserve page
>>
>>>
>>> # env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
>>> Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3312
>>> Write s to 0x3effff000000 via shared mapping
>>> Write p to 0x3effff000000 via private mapping
>>> Read s from 0x3effff000000 via shared mapping
>>> PASS
>>> 20      hugepages-16384kB/free_hugepages
>>> 20      hugepages-16384kB/nr_hugepages
>>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>>> 0       hugepages-16384kB/nr_overcommit_hugepages
>>> 2       hugepages-16384kB/resv_hugepages
>>> 0       hugepages-16384kB/surplus_hugepages
>>> 0       hugepages-16777216kB/free_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>>> 0       hugepages-16777216kB/resv_hugepages
>>> 0       hugepages-16777216kB/surplus_hugepages
>>
>> It is pretty consistent that we leak a reserve page every time this
>> test is run.
>>
>> The interesting thing is that corrupt-by-cow-opt is a very simple
>> test case.  commit 67961f9db8c4 potentially changes the return value
>> of the functions vma_has_reserves() and vma_needs/commit_reservation()
>> for the owner (HPAGE_RESV_OWNER) of private mappings.  running the
>> test with and without the commit results in the same return values for
>> these routines on x86.  And, no leaked reserve pages.
> 
> looking at that commit, I am not sure region_chg output indicate a hole
> punched. ie, w.r.t private mapping when we mmap, we don't do a
> region_chg (hugetlb_reserve_page()). So with a fault later when we
> call vma_needs_reservation, we will find region_chg returning >= 0 right ?
> 

Let me try to explain.

When a private mapping is created, hugetlb_reserve_pages to reserve
huge pages for the mapping.  A reserve map is created and installed
in the (vma_private) VMA.  No reservation entries are actually created
for the mapping.  But, hugetlb_acct_memory() is called to reserve
pages for the mapping in the global pool.  This will adjust (increment)
the global reserved huge page counter (resv_huge_pages).

As pages within the private mapping are faulted in, huge_page_alloc() is
called to allocate the pages.  Within alloc_huge_page, vma_needs_reservation
is called to determine if there is a reservation for this allocation.
If there is a reservation, the global count is adjusted (decremented).
In any case where a page is returned to the caller, vma_commit_reservation
is called and an entry for the page is created in the reserve map (VMA
vma_private) of the mapping.

Once a page is instantiated within the private mapping, an entry exists
in the reserve map and the reserve count has been adjusted to indicate
that the reserve has been consumed.  Subsequent faults will not instantiate
a new page unless the original is somehow removed from the mapping.  The
only way a user can remove a page from the mapping is via a hole punch or
truncate operation.  Note that hole punch and truncate for huge pages
only to apply to hugetlbfs backed mappings and not anonymous mappings.

hole punch and truncate will unmap huge pages from any private private
mapping associated with the same offset in the hugetlbfs file.  However,
they will not remove entries from the VMA private_data reserve maps.
Nor, will they adjust global reserve counts based on private mappings.

Now suppose a subsequent fault happened for a page private mapping removed
via hole punch or truncate.  Prior to commit 67961f9db8c4,
vma_needs_reservation ALWAYS returned false to indicate that a reservation
existed for the page.  So, alloc_huge_page would consume a reserved page.
The problem is that the reservation was consumed at the time of the first
fault and no longer exist.  This caused the global reserve count to be
incorrect.

Commit 67961f9db8c4 looks at the VMA private reserve map to determine if
the original reservation was consumed.  If an entry exists in the map, it
is assumed the reservation was consumed and no longer exists.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
@ 2016-10-17 22:53             ` Mike Kravetz
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Kravetz @ 2016-10-17 22:53 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Jan Stancek, linux-mm, linux-kernel
  Cc: hillf.zj, dave.hansen, kirill.shutemov, mhocko, n-horiguchi,
	iamjoonsoo.kim

On 10/16/2016 10:04 PM, Aneesh Kumar K.V wrote:
> Mike Kravetz <mike.kravetz@oracle.com> writes:
> 
>> On 10/14/2016 01:48 AM, Jan Stancek wrote:
>>> On 10/14/2016 01:26 AM, Mike Kravetz wrote:
>>>>
>>>> Hi Jan,
>>>>
>>>> Any chance you can get the contents of /sys/kernel/mm/hugepages
>>>> before and after the first run of libhugetlbfs testsuite on Power?
>>>> Perhaps a script like:
>>>>
>>>> cd /sys/kernel/mm/hugepages
>>>> for f in hugepages-*/*; do
>>>> 	n=`cat $f`;
>>>> 	echo -e "$n\t$f";
>>>> done
>>>>
>>>> Just want to make sure the numbers look as they should.
>>>>
>>>
>>> Hi Mike,
>>>
>>> Numbers are below. I have also isolated a single testcase from "func"
>>> group of tests: corrupt-by-cow-opt [1]. This test stops working if I
>>> run it 19 times (with 20 hugepages). And if I disable this test,
>>> "func" group tests can all pass repeatedly.
>>
>> Thanks Jan,
>>
>> I appreciate your efforts.
>>
>>>
>>> [1] https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/corrupt-by-cow-opt.c
>>>
>>> Regards,
>>> Jan
>>>
>>> Kernel is v4.8-14230-gb67be92, with reboot between each run.
>>> 1) Only func tests
>>> System boot
>>> After setup:
>>> 20      hugepages-16384kB/free_hugepages
>>> 20      hugepages-16384kB/nr_hugepages
>>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>>> 0       hugepages-16384kB/nr_overcommit_hugepages
>>> 0       hugepages-16384kB/resv_hugepages
>>> 0       hugepages-16384kB/surplus_hugepages
>>> 0       hugepages-16777216kB/free_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>>> 0       hugepages-16777216kB/resv_hugepages
>>> 0       hugepages-16777216kB/surplus_hugepages
>>>
>>> After func tests:
>>> ********** TEST SUMMARY
>>> *                      16M
>>> *                      32-bit 64-bit
>>> *     Total testcases:     0     85
>>> *             Skipped:     0      0
>>> *                PASS:     0     81
>>> *                FAIL:     0      4
>>> *    Killed by signal:     0      0
>>> *   Bad configuration:     0      0
>>> *       Expected FAIL:     0      0
>>> *     Unexpected PASS:     0      0
>>> * Strange test result:     0      0
>>>
>>> 26      hugepages-16384kB/free_hugepages
>>> 26      hugepages-16384kB/nr_hugepages
>>> 26      hugepages-16384kB/nr_hugepages_mempolicy
>>> 0       hugepages-16384kB/nr_overcommit_hugepages
>>> 1       hugepages-16384kB/resv_hugepages
>>> 0       hugepages-16384kB/surplus_hugepages
>>> 0       hugepages-16777216kB/free_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>>> 0       hugepages-16777216kB/resv_hugepages
>>> 0       hugepages-16777216kB/surplus_hugepages
>>>
>>> After test cleanup:
>>>  umount -a -t hugetlbfs
>>>  hugeadm --pool-pages-max ${HPSIZE}:0
>>>
>>> 1       hugepages-16384kB/free_hugepages
>>> 1       hugepages-16384kB/nr_hugepages
>>> 1       hugepages-16384kB/nr_hugepages_mempolicy
>>> 0       hugepages-16384kB/nr_overcommit_hugepages
>>> 1       hugepages-16384kB/resv_hugepages
>>> 1       hugepages-16384kB/surplus_hugepages
>>> 0       hugepages-16777216kB/free_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>>> 0       hugepages-16777216kB/resv_hugepages
>>> 0       hugepages-16777216kB/surplus_hugepages
>>>
>>
>> I am guessing the leaked reserve page is which is triggered by
>> running the test you isolated corrupt-by-cow-opt.
>>
>>
>>> ---
>>>
>>> 2) Only stress tests
>>> System boot
>>> After setup:
>>> 20      hugepages-16384kB/free_hugepages
>>> 20      hugepages-16384kB/nr_hugepages
>>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>>> 0       hugepages-16384kB/nr_overcommit_hugepages
>>> 0       hugepages-16384kB/resv_hugepages
>>> 0       hugepages-16384kB/surplus_hugepages
>>> 0       hugepages-16777216kB/free_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>>> 0       hugepages-16777216kB/resv_hugepages
>>> 0       hugepages-16777216kB/surplus_hugepages
>>>
>>> After stress tests:
>>> 20      hugepages-16384kB/free_hugepages
>>> 20      hugepages-16384kB/nr_hugepages
>>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>>> 0       hugepages-16384kB/nr_overcommit_hugepages
>>> 17      hugepages-16384kB/resv_hugepages
>>> 0       hugepages-16384kB/surplus_hugepages
>>> 0       hugepages-16777216kB/free_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>>> 0       hugepages-16777216kB/resv_hugepages
>>> 0       hugepages-16777216kB/surplus_hugepages
>>>
>>> After cleanup:
>>> 17      hugepages-16384kB/free_hugepages
>>> 17      hugepages-16384kB/nr_hugepages
>>> 17      hugepages-16384kB/nr_hugepages_mempolicy
>>> 0       hugepages-16384kB/nr_overcommit_hugepages
>>> 17      hugepages-16384kB/resv_hugepages
>>> 17      hugepages-16384kB/surplus_hugepages
>>> 0       hugepages-16777216kB/free_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>>> 0       hugepages-16777216kB/resv_hugepages
>>> 0       hugepages-16777216kB/surplus_hugepages
>>>
>>
>> This looks worse than the summary after running the functional tests.
>>
>>> ---
>>>
>>> 3) only corrupt-by-cow-opt
>>>
>>> System boot
>>> After setup:
>>> 20      hugepages-16384kB/free_hugepages
>>> 20      hugepages-16384kB/nr_hugepages
>>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>>> 0       hugepages-16384kB/nr_overcommit_hugepages
>>> 0       hugepages-16384kB/resv_hugepages
>>> 0       hugepages-16384kB/surplus_hugepages
>>> 0       hugepages-16777216kB/free_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>>> 0       hugepages-16777216kB/resv_hugepages
>>> 0       hugepages-16777216kB/surplus_hugepages
>>>
>>> libhugetlbfs-2.18# env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
>>> Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3298
>>> Write s to 0x3effff000000 via shared mapping
>>> Write p to 0x3effff000000 via private mapping
>>> Read s from 0x3effff000000 via shared mapping
>>> PASS
>>> 20      hugepages-16384kB/free_hugepages
>>> 20      hugepages-16384kB/nr_hugepages
>>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>>> 0       hugepages-16384kB/nr_overcommit_hugepages
>>> 1       hugepages-16384kB/resv_hugepages
>>> 0       hugepages-16384kB/surplus_hugepages
>>> 0       hugepages-16777216kB/free_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>>> 0       hugepages-16777216kB/resv_hugepages
>>> 0       hugepages-16777216kB/surplus_hugepages
>>
>> Leaked one reserve page
>>
>>>
>>> # env LD_LIBRARY_PATH=./obj64 ./tests/obj64/corrupt-by-cow-opt; /root/grab.sh
>>> Starting testcase "./tests/obj64/corrupt-by-cow-opt", pid 3312
>>> Write s to 0x3effff000000 via shared mapping
>>> Write p to 0x3effff000000 via private mapping
>>> Read s from 0x3effff000000 via shared mapping
>>> PASS
>>> 20      hugepages-16384kB/free_hugepages
>>> 20      hugepages-16384kB/nr_hugepages
>>> 20      hugepages-16384kB/nr_hugepages_mempolicy
>>> 0       hugepages-16384kB/nr_overcommit_hugepages
>>> 2       hugepages-16384kB/resv_hugepages
>>> 0       hugepages-16384kB/surplus_hugepages
>>> 0       hugepages-16777216kB/free_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages
>>> 0       hugepages-16777216kB/nr_hugepages_mempolicy
>>> 0       hugepages-16777216kB/nr_overcommit_hugepages
>>> 0       hugepages-16777216kB/resv_hugepages
>>> 0       hugepages-16777216kB/surplus_hugepages
>>
>> It is pretty consistent that we leak a reserve page every time this
>> test is run.
>>
>> The interesting thing is that corrupt-by-cow-opt is a very simple
>> test case.  commit 67961f9db8c4 potentially changes the return value
>> of the functions vma_has_reserves() and vma_needs/commit_reservation()
>> for the owner (HPAGE_RESV_OWNER) of private mappings.  running the
>> test with and without the commit results in the same return values for
>> these routines on x86.  And, no leaked reserve pages.
> 
> looking at that commit, I am not sure region_chg output indicate a hole
> punched. ie, w.r.t private mapping when we mmap, we don't do a
> region_chg (hugetlb_reserve_page()). So with a fault later when we
> call vma_needs_reservation, we will find region_chg returning >= 0 right ?
> 

Let me try to explain.

When a private mapping is created, hugetlb_reserve_pages to reserve
huge pages for the mapping.  A reserve map is created and installed
in the (vma_private) VMA.  No reservation entries are actually created
for the mapping.  But, hugetlb_acct_memory() is called to reserve
pages for the mapping in the global pool.  This will adjust (increment)
the global reserved huge page counter (resv_huge_pages).

As pages within the private mapping are faulted in, huge_page_alloc() is
called to allocate the pages.  Within alloc_huge_page, vma_needs_reservation
is called to determine if there is a reservation for this allocation.
If there is a reservation, the global count is adjusted (decremented).
In any case where a page is returned to the caller, vma_commit_reservation
is called and an entry for the page is created in the reserve map (VMA
vma_private) of the mapping.

Once a page is instantiated within the private mapping, an entry exists
in the reserve map and the reserve count has been adjusted to indicate
that the reserve has been consumed.  Subsequent faults will not instantiate
a new page unless the original is somehow removed from the mapping.  The
only way a user can remove a page from the mapping is via a hole punch or
truncate operation.  Note that hole punch and truncate for huge pages
only to apply to hugetlbfs backed mappings and not anonymous mappings.

hole punch and truncate will unmap huge pages from any private private
mapping associated with the same offset in the hugetlbfs file.  However,
they will not remove entries from the VMA private_data reserve maps.
Nor, will they adjust global reserve counts based on private mappings.

Now suppose a subsequent fault happened for a page private mapping removed
via hole punch or truncate.  Prior to commit 67961f9db8c4,
vma_needs_reservation ALWAYS returned false to indicate that a reservation
existed for the page.  So, alloc_huge_page would consume a reserved page.
The problem is that the reservation was consumed at the time of the first
fault and no longer exist.  This caused the global reserve count to be
incorrect.

Commit 67961f9db8c4 looks at the VMA private reserve map to determine if
the original reservation was consumed.  If an entry exists in the map, it
is assumed the reservation was consumed and no longer exists.

-- 
Mike Kravetz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
  2016-10-17 18:27             ` Aneesh Kumar K.V
@ 2016-10-17 23:19               ` Mike Kravetz
  -1 siblings, 0 replies; 26+ messages in thread
From: Mike Kravetz @ 2016-10-17 23:19 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Jan Stancek
  Cc: linux-mm, linux-kernel, hillf zj, dave hansen, kirill shutemov,
	mhocko, n-horiguchi, iamjoonsoo kim

On 10/17/2016 11:27 AM, Aneesh Kumar K.V wrote:
> Jan Stancek <jstancek@redhat.com> writes:
> 
> 
>> Hi Mike,
>>
>> Revert of 67961f9db8c4 helps, I let whole suite run for 100 iterations,
>> there were no issues.
>>
>> I cut down reproducer and removed last mmap/write/munmap as that is enough
>> to reproduce the problem. Then I started introducing some traces into kernel
>> and noticed that on ppc I get 3 faults, while on x86 I get only 2.
>>
>> Interesting is the 2nd fault, that is first write after mapping as PRIVATE.
>> Following condition fails on ppc first time:
>>     if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
>> but it's immediately followed by fault that looks identical
>> and in that one it evaluates as true.
> 
> ok, we miss the _PAGE_PTE in new_pte there. 
> 
> 	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
> 				&& (vma->vm_flags & VM_SHARED)));
> 	set_huge_pte_at(mm, address, ptep, new_pte);
> 
> 	hugetlb_count_add(pages_per_huge_page(h), mm);
> 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
> 		/* Optimization, do the COW without a second fault */
> 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
> 	}
> 
> IMHO that new_pte usage is wrong, because we don't consider flags that
> can possibly be added by set_huge_pte_at there. For pp64 we add _PAGE_PTE 
> 

Thanks for looking at this Aneesh.

>>
>> Same with alloc_huge_page(), on x86_64 it's called twice, on ppc three times.
>> In 2nd call vma_needs_reservation() returns 0, in 3rd it returns 1.
>>
>> ---- ppc -> 2nd and 3rd fault ---
>> mmap(MAP_PRIVATE)
>> hugetlb_fault address: 3effff000000, flags: 55
>> hugetlb_cow old_page: f0000000010fc000
>> alloc_huge_page ret: f000000001100000
>> hugetlb_cow ptep: c000000455b27cf8, pte_same: 0
>> free_huge_page page: f000000001100000, restore_reserve: 1

So, the interesting thing is that since we do not take the optimized path
there is an additional fault.  It looks like the additional fault results
in the originally allocated page being free'ed and reserve count being
incremented.  As mentioned in the description of commit 67961f9db8c4, the
VMA private reserve map will still contain an entry for the page.
Therefore,
when a page allocation happens as the result of the next fault, it will
think the reserved page has already been consumed and not use it.  This is
how we are 'leaking' reserved pages.

>> hugetlb_fault address: 3effff000000, flags: 55
>> hugetlb_cow old_page: f0000000010fc000
>> alloc_huge_page ret: f000000001100000
>> hugetlb_cow ptep: c000000455b27cf8, pte_same: 1
>>
>> --- x86_64 -> 2nd fault ---
>> mmap(MAP_PRIVATE)
>> hugetlb_fault address: 7f71a4200000, flags: 55
>> hugetlb_cow address 0x7f71a4200000, old_page: ffffea0008d20000
>> alloc_huge_page ret: ffffea0008d38000
>> hugetlb_cow ptep: ffff8802314c7908, pte_same: 1
>>
> 
> But I guess we still have issue with respecting reservation here.
> 
> I will look at _PAGE_PTE and see what best we can do w.r.t hugetlb.
> 
> -aneesh

If there was not the additional fault, we would not perform the additional
free and alloc and not see this issue.  However, the logic in 67961f9db8c4
also missed this error case (and I think any time we do not take the
optimized
code path).

I suspect it is would be desirable to fix the code path for Power such that
it does not do the additional fault (free/alloc).  I'll take a look at the
code for commit for 67961f9db8c4.  It certainly misses the error case, and
seems 'too fragile' to depend on the optimized code paths.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
@ 2016-10-17 23:19               ` Mike Kravetz
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Kravetz @ 2016-10-17 23:19 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Jan Stancek
  Cc: linux-mm, linux-kernel, hillf zj, dave hansen, kirill shutemov,
	mhocko, n-horiguchi, iamjoonsoo kim

On 10/17/2016 11:27 AM, Aneesh Kumar K.V wrote:
> Jan Stancek <jstancek@redhat.com> writes:
> 
> 
>> Hi Mike,
>>
>> Revert of 67961f9db8c4 helps, I let whole suite run for 100 iterations,
>> there were no issues.
>>
>> I cut down reproducer and removed last mmap/write/munmap as that is enough
>> to reproduce the problem. Then I started introducing some traces into kernel
>> and noticed that on ppc I get 3 faults, while on x86 I get only 2.
>>
>> Interesting is the 2nd fault, that is first write after mapping as PRIVATE.
>> Following condition fails on ppc first time:
>>     if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
>> but it's immediately followed by fault that looks identical
>> and in that one it evaluates as true.
> 
> ok, we miss the _PAGE_PTE in new_pte there. 
> 
> 	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
> 				&& (vma->vm_flags & VM_SHARED)));
> 	set_huge_pte_at(mm, address, ptep, new_pte);
> 
> 	hugetlb_count_add(pages_per_huge_page(h), mm);
> 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
> 		/* Optimization, do the COW without a second fault */
> 		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
> 	}
> 
> IMHO that new_pte usage is wrong, because we don't consider flags that
> can possibly be added by set_huge_pte_at there. For pp64 we add _PAGE_PTE 
> 

Thanks for looking at this Aneesh.

>>
>> Same with alloc_huge_page(), on x86_64 it's called twice, on ppc three times.
>> In 2nd call vma_needs_reservation() returns 0, in 3rd it returns 1.
>>
>> ---- ppc -> 2nd and 3rd fault ---
>> mmap(MAP_PRIVATE)
>> hugetlb_fault address: 3effff000000, flags: 55
>> hugetlb_cow old_page: f0000000010fc000
>> alloc_huge_page ret: f000000001100000
>> hugetlb_cow ptep: c000000455b27cf8, pte_same: 0
>> free_huge_page page: f000000001100000, restore_reserve: 1

So, the interesting thing is that since we do not take the optimized path
there is an additional fault.  It looks like the additional fault results
in the originally allocated page being free'ed and reserve count being
incremented.  As mentioned in the description of commit 67961f9db8c4, the
VMA private reserve map will still contain an entry for the page.
Therefore,
when a page allocation happens as the result of the next fault, it will
think the reserved page has already been consumed and not use it.  This is
how we are 'leaking' reserved pages.

>> hugetlb_fault address: 3effff000000, flags: 55
>> hugetlb_cow old_page: f0000000010fc000
>> alloc_huge_page ret: f000000001100000
>> hugetlb_cow ptep: c000000455b27cf8, pte_same: 1
>>
>> --- x86_64 -> 2nd fault ---
>> mmap(MAP_PRIVATE)
>> hugetlb_fault address: 7f71a4200000, flags: 55
>> hugetlb_cow address 0x7f71a4200000, old_page: ffffea0008d20000
>> alloc_huge_page ret: ffffea0008d38000
>> hugetlb_cow ptep: ffff8802314c7908, pte_same: 1
>>
> 
> But I guess we still have issue with respecting reservation here.
> 
> I will look at _PAGE_PTE and see what best we can do w.r.t hugetlb.
> 
> -aneesh

If there was not the additional fault, we would not perform the additional
free and alloc and not see this issue.  However, the logic in 67961f9db8c4
also missed this error case (and I think any time we do not take the
optimized
code path).

I suspect it is would be desirable to fix the code path for Power such that
it does not do the additional fault (free/alloc).  I'll take a look at the
code for commit for 67961f9db8c4.  It certainly misses the error case, and
seems 'too fragile' to depend on the optimized code paths.

-- 
Mike Kravetz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
  2016-10-17 22:53             ` Mike Kravetz
@ 2016-10-18  1:18               ` Mike Kravetz
  -1 siblings, 0 replies; 26+ messages in thread
From: Mike Kravetz @ 2016-10-18  1:18 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Jan Stancek, linux-mm, linux-kernel
  Cc: hillf.zj, dave.hansen, kirill.shutemov, mhocko, n-horiguchi,
	iamjoonsoo.kim

On 10/17/2016 03:53 PM, Mike Kravetz wrote:
> On 10/16/2016 10:04 PM, Aneesh Kumar K.V wrote:
>>
>> looking at that commit, I am not sure region_chg output indicate a hole
>> punched. ie, w.r.t private mapping when we mmap, we don't do a
>> region_chg (hugetlb_reserve_page()). So with a fault later when we
>> call vma_needs_reservation, we will find region_chg returning >= 0 right ?
>>
> 
> Let me try to explain.
> 
> When a private mapping is created, hugetlb_reserve_pages to reserve
> huge pages for the mapping.  A reserve map is created and installed
> in the (vma_private) VMA.  No reservation entries are actually created
> for the mapping.  But, hugetlb_acct_memory() is called to reserve
> pages for the mapping in the global pool.  This will adjust (increment)
> the global reserved huge page counter (resv_huge_pages).
> 
> As pages within the private mapping are faulted in, huge_page_alloc() is
> called to allocate the pages.  Within alloc_huge_page, vma_needs_reservation
> is called to determine if there is a reservation for this allocation.
> If there is a reservation, the global count is adjusted (decremented).
> In any case where a page is returned to the caller, vma_commit_reservation
> is called and an entry for the page is created in the reserve map (VMA
> vma_private) of the mapping.
> 
> Once a page is instantiated within the private mapping, an entry exists
> in the reserve map and the reserve count has been adjusted to indicate
> that the reserve has been consumed.  Subsequent faults will not instantiate
> a new page unless the original is somehow removed from the mapping.  The
> only way a user can remove a page from the mapping is via a hole punch or
> truncate operation.  Note that hole punch and truncate for huge pages
> only to apply to hugetlbfs backed mappings and not anonymous mappings.
> 
> hole punch and truncate will unmap huge pages from any private private
> mapping associated with the same offset in the hugetlbfs file.  However,
> they will not remove entries from the VMA private_data reserve maps.
> Nor, will they adjust global reserve counts based on private mappings.

Question.  Should hole punch and truncate unmap private mappings?
Commit 67961f9db8c4 is just trying to correctly handle that situation.
If we do not unmap the private pages, then there is no need for this code.

-- 
Mike Kravetz

> 
> Now suppose a subsequent fault happened for a page private mapping removed
> via hole punch or truncate.  Prior to commit 67961f9db8c4,
> vma_needs_reservation ALWAYS returned false to indicate that a reservation
> existed for the page.  So, alloc_huge_page would consume a reserved page.
> The problem is that the reservation was consumed at the time of the first
> fault and no longer exist.  This caused the global reserve count to be
> incorrect.
> 
> Commit 67961f9db8c4 looks at the VMA private reserve map to determine if
> the original reservation was consumed.  If an entry exists in the map, it
> is assumed the reservation was consumed and no longer exists.
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
@ 2016-10-18  1:18               ` Mike Kravetz
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Kravetz @ 2016-10-18  1:18 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Jan Stancek, linux-mm, linux-kernel
  Cc: hillf.zj, dave.hansen, kirill.shutemov, mhocko, n-horiguchi,
	iamjoonsoo.kim

On 10/17/2016 03:53 PM, Mike Kravetz wrote:
> On 10/16/2016 10:04 PM, Aneesh Kumar K.V wrote:
>>
>> looking at that commit, I am not sure region_chg output indicate a hole
>> punched. ie, w.r.t private mapping when we mmap, we don't do a
>> region_chg (hugetlb_reserve_page()). So with a fault later when we
>> call vma_needs_reservation, we will find region_chg returning >= 0 right ?
>>
> 
> Let me try to explain.
> 
> When a private mapping is created, hugetlb_reserve_pages to reserve
> huge pages for the mapping.  A reserve map is created and installed
> in the (vma_private) VMA.  No reservation entries are actually created
> for the mapping.  But, hugetlb_acct_memory() is called to reserve
> pages for the mapping in the global pool.  This will adjust (increment)
> the global reserved huge page counter (resv_huge_pages).
> 
> As pages within the private mapping are faulted in, huge_page_alloc() is
> called to allocate the pages.  Within alloc_huge_page, vma_needs_reservation
> is called to determine if there is a reservation for this allocation.
> If there is a reservation, the global count is adjusted (decremented).
> In any case where a page is returned to the caller, vma_commit_reservation
> is called and an entry for the page is created in the reserve map (VMA
> vma_private) of the mapping.
> 
> Once a page is instantiated within the private mapping, an entry exists
> in the reserve map and the reserve count has been adjusted to indicate
> that the reserve has been consumed.  Subsequent faults will not instantiate
> a new page unless the original is somehow removed from the mapping.  The
> only way a user can remove a page from the mapping is via a hole punch or
> truncate operation.  Note that hole punch and truncate for huge pages
> only to apply to hugetlbfs backed mappings and not anonymous mappings.
> 
> hole punch and truncate will unmap huge pages from any private private
> mapping associated with the same offset in the hugetlbfs file.  However,
> they will not remove entries from the VMA private_data reserve maps.
> Nor, will they adjust global reserve counts based on private mappings.

Question.  Should hole punch and truncate unmap private mappings?
Commit 67961f9db8c4 is just trying to correctly handle that situation.
If we do not unmap the private pages, then there is no need for this code.

-- 
Mike Kravetz

> 
> Now suppose a subsequent fault happened for a page private mapping removed
> via hole punch or truncate.  Prior to commit 67961f9db8c4,
> vma_needs_reservation ALWAYS returned false to indicate that a reservation
> existed for the page.  So, alloc_huge_page would consume a reserved page.
> The problem is that the reservation was consumed at the time of the first
> fault and no longer exist.  This caused the global reserve count to be
> incorrect.
> 
> Commit 67961f9db8c4 looks at the VMA private reserve map to determine if
> the original reservation was consumed.  If an entry exists in the map, it
> is assumed the reservation was consumed and no longer exists.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
  2016-10-17 14:44           ` Jan Stancek
@ 2016-10-18  8:31             ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 26+ messages in thread
From: Aneesh Kumar K.V @ 2016-10-18  8:31 UTC (permalink / raw)
  To: Jan Stancek, Mike Kravetz
  Cc: linux-mm, linux-kernel, hillf zj, dave hansen, kirill shutemov,
	mhocko, n-horiguchi, iamjoonsoo kim

Jan Stancek <jstancek@redhat.com> writes:
> Hi Mike,
>
> Revert of 67961f9db8c4 helps, I let whole suite run for 100 iterations,
> there were no issues.
>
> I cut down reproducer and removed last mmap/write/munmap as that is enough
> to reproduce the problem. Then I started introducing some traces into kernel
> and noticed that on ppc I get 3 faults, while on x86 I get only 2.
>
> Interesting is the 2nd fault, that is first write after mapping as PRIVATE.
> Following condition fails on ppc first time:
>     if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
> but it's immediately followed by fault that looks identical
> and in that one it evaluates as true.
>
> Same with alloc_huge_page(), on x86_64 it's called twice, on ppc three times.
> In 2nd call vma_needs_reservation() returns 0, in 3rd it returns 1.
>
> ---- ppc -> 2nd and 3rd fault ---
> mmap(MAP_PRIVATE)
> hugetlb_fault address: 3effff000000, flags: 55
> hugetlb_cow old_page: f0000000010fc000
> alloc_huge_page ret: f000000001100000
> hugetlb_cow ptep: c000000455b27cf8, pte_same: 0
> free_huge_page page: f000000001100000, restore_reserve: 1
> hugetlb_fault address: 3effff000000, flags: 55
> hugetlb_cow old_page: f0000000010fc000
> alloc_huge_page ret: f000000001100000
> hugetlb_cow ptep: c000000455b27cf8, pte_same: 1
>
> --- x86_64 -> 2nd fault ---
> mmap(MAP_PRIVATE)
> hugetlb_fault address: 7f71a4200000, flags: 55
> hugetlb_cow address 0x7f71a4200000, old_page: ffffea0008d20000
> alloc_huge_page ret: ffffea0008d38000
> hugetlb_cow ptep: ffff8802314c7908, pte_same: 1
>
> Regards,
> Jan
>

Can you check with the below patch. I ran the corrupt-by-cow-opt test with this patch
and resv count got correctly updated.

commit fb2e0c081d2922c8aaa49bbe166472aac68ef5e1
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Tue Oct 18 11:23:11 2016 +0530

    mm/hugetlb: Use the right pte val for compare in hugetlb_cow
    
    We cannot use the pte value used in set_pte_at for pte_same comparison,
    because archs like ppc64, filter/add new pte flag in set_pte_at. Instead
    fetch the pte value inside hugetlb_cow. We are comparing pte value to
    make sure the pte didn't change since we dropped the page table lock.
    hugetlb_cow get called with page table lock held, and we can take a copy
    of the pte value before we drop the page table lock.
    
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ec49d9ef1eef..da8fbd02b92e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3386,15 +3386,17 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
  * Keep the pte_same checks anyway to make transition from the mutex easier.
  */
 static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address, pte_t *ptep, pte_t pte,
-			struct page *pagecache_page, spinlock_t *ptl)
+		       unsigned long address, pte_t *ptep,
+		       struct page *pagecache_page, spinlock_t *ptl)
 {
+	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
 	struct page *old_page, *new_page;
 	int ret = 0, outside_reserve = 0;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
+	pte = huge_ptep_get(ptep);
 	old_page = pte_page(pte);
 
 retry_avoidcopy:
@@ -3668,7 +3670,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	hugetlb_count_add(pages_per_huge_page(h), mm);
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
 		/* Optimization, do the COW without a second fault */
-		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
+		ret = hugetlb_cow(mm, vma, address, ptep, page, ptl);
 	}
 
 	spin_unlock(ptl);
@@ -3822,8 +3824,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!huge_pte_write(entry)) {
-			ret = hugetlb_cow(mm, vma, address, ptep, entry,
-					pagecache_page, ptl);
+			ret = hugetlb_cow(mm, vma, address, ptep,
+					  pagecache_page, ptl);
 			goto out_put_page;
 		}
 		entry = huge_pte_mkdirty(entry);

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
@ 2016-10-18  8:31             ` Aneesh Kumar K.V
  0 siblings, 0 replies; 26+ messages in thread
From: Aneesh Kumar K.V @ 2016-10-18  8:31 UTC (permalink / raw)
  To: Jan Stancek, Mike Kravetz
  Cc: linux-mm, linux-kernel, hillf zj, dave hansen, kirill shutemov,
	mhocko, n-horiguchi, iamjoonsoo kim

Jan Stancek <jstancek@redhat.com> writes:
> Hi Mike,
>
> Revert of 67961f9db8c4 helps, I let whole suite run for 100 iterations,
> there were no issues.
>
> I cut down reproducer and removed last mmap/write/munmap as that is enough
> to reproduce the problem. Then I started introducing some traces into kernel
> and noticed that on ppc I get 3 faults, while on x86 I get only 2.
>
> Interesting is the 2nd fault, that is first write after mapping as PRIVATE.
> Following condition fails on ppc first time:
>     if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
> but it's immediately followed by fault that looks identical
> and in that one it evaluates as true.
>
> Same with alloc_huge_page(), on x86_64 it's called twice, on ppc three times.
> In 2nd call vma_needs_reservation() returns 0, in 3rd it returns 1.
>
> ---- ppc -> 2nd and 3rd fault ---
> mmap(MAP_PRIVATE)
> hugetlb_fault address: 3effff000000, flags: 55
> hugetlb_cow old_page: f0000000010fc000
> alloc_huge_page ret: f000000001100000
> hugetlb_cow ptep: c000000455b27cf8, pte_same: 0
> free_huge_page page: f000000001100000, restore_reserve: 1
> hugetlb_fault address: 3effff000000, flags: 55
> hugetlb_cow old_page: f0000000010fc000
> alloc_huge_page ret: f000000001100000
> hugetlb_cow ptep: c000000455b27cf8, pte_same: 1
>
> --- x86_64 -> 2nd fault ---
> mmap(MAP_PRIVATE)
> hugetlb_fault address: 7f71a4200000, flags: 55
> hugetlb_cow address 0x7f71a4200000, old_page: ffffea0008d20000
> alloc_huge_page ret: ffffea0008d38000
> hugetlb_cow ptep: ffff8802314c7908, pte_same: 1
>
> Regards,
> Jan
>

Can you check with the below patch. I ran the corrupt-by-cow-opt test with this patch
and resv count got correctly updated.

commit fb2e0c081d2922c8aaa49bbe166472aac68ef5e1
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Tue Oct 18 11:23:11 2016 +0530

    mm/hugetlb: Use the right pte val for compare in hugetlb_cow
    
    We cannot use the pte value used in set_pte_at for pte_same comparison,
    because archs like ppc64, filter/add new pte flag in set_pte_at. Instead
    fetch the pte value inside hugetlb_cow. We are comparing pte value to
    make sure the pte didn't change since we dropped the page table lock.
    hugetlb_cow get called with page table lock held, and we can take a copy
    of the pte value before we drop the page table lock.
    
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ec49d9ef1eef..da8fbd02b92e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3386,15 +3386,17 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
  * Keep the pte_same checks anyway to make transition from the mutex easier.
  */
 static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address, pte_t *ptep, pte_t pte,
-			struct page *pagecache_page, spinlock_t *ptl)
+		       unsigned long address, pte_t *ptep,
+		       struct page *pagecache_page, spinlock_t *ptl)
 {
+	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
 	struct page *old_page, *new_page;
 	int ret = 0, outside_reserve = 0;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
+	pte = huge_ptep_get(ptep);
 	old_page = pte_page(pte);
 
 retry_avoidcopy:
@@ -3668,7 +3670,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	hugetlb_count_add(pages_per_huge_page(h), mm);
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
 		/* Optimization, do the COW without a second fault */
-		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
+		ret = hugetlb_cow(mm, vma, address, ptep, page, ptl);
 	}
 
 	spin_unlock(ptl);
@@ -3822,8 +3824,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!huge_pte_write(entry)) {
-			ret = hugetlb_cow(mm, vma, address, ptep, entry,
-					pagecache_page, ptl);
+			ret = hugetlb_cow(mm, vma, address, ptep,
+					  pagecache_page, ptl);
 			goto out_put_page;
 		}
 		entry = huge_pte_mkdirty(entry);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
  2016-10-18  8:31             ` Aneesh Kumar K.V
@ 2016-10-18 11:28               ` Jan Stancek
  -1 siblings, 0 replies; 26+ messages in thread
From: Jan Stancek @ 2016-10-18 11:28 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Mike Kravetz
  Cc: linux-mm, linux-kernel, hillf zj, dave hansen, kirill shutemov,
	mhocko, n-horiguchi, iamjoonsoo kim





----- Original Message -----
> Jan Stancek <jstancek@redhat.com> writes:
> > Hi Mike,
> >
> > Revert of 67961f9db8c4 helps, I let whole suite run for 100 iterations,
> > there were no issues.
> >
> > I cut down reproducer and removed last mmap/write/munmap as that is enough
> > to reproduce the problem. Then I started introducing some traces into
> > kernel
> > and noticed that on ppc I get 3 faults, while on x86 I get only 2.
> >
> > Interesting is the 2nd fault, that is first write after mapping as PRIVATE.
> > Following condition fails on ppc first time:
> >     if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
> > but it's immediately followed by fault that looks identical
> > and in that one it evaluates as true.
> >
> > Same with alloc_huge_page(), on x86_64 it's called twice, on ppc three
> > times.
> > In 2nd call vma_needs_reservation() returns 0, in 3rd it returns 1.
> >
> > ---- ppc -> 2nd and 3rd fault ---
> > mmap(MAP_PRIVATE)
> > hugetlb_fault address: 3effff000000, flags: 55
> > hugetlb_cow old_page: f0000000010fc000
> > alloc_huge_page ret: f000000001100000
> > hugetlb_cow ptep: c000000455b27cf8, pte_same: 0
> > free_huge_page page: f000000001100000, restore_reserve: 1
> > hugetlb_fault address: 3effff000000, flags: 55
> > hugetlb_cow old_page: f0000000010fc000
> > alloc_huge_page ret: f000000001100000
> > hugetlb_cow ptep: c000000455b27cf8, pte_same: 1
> >
> > --- x86_64 -> 2nd fault ---
> > mmap(MAP_PRIVATE)
> > hugetlb_fault address: 7f71a4200000, flags: 55
> > hugetlb_cow address 0x7f71a4200000, old_page: ffffea0008d20000
> > alloc_huge_page ret: ffffea0008d38000
> > hugetlb_cow ptep: ffff8802314c7908, pte_same: 1
> >
> > Regards,
> > Jan
> >
> 
> Can you check with the below patch. I ran the corrupt-by-cow-opt test with
> this patch
> and resv count got correctly updated.

I am running libhugetlbfs suite with patch below in loop for
~2 hours now and I don't see any problems/ENOMEMs/OOMs or
leaked resv pages:

0       hugepages-16384kB/free_hugepages
0       hugepages-16384kB/nr_hugepages
0       hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
0       hugepages-16384kB/resv_hugepages
0       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

Regards,
Jan

> 
> commit fb2e0c081d2922c8aaa49bbe166472aac68ef5e1
> Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Date:   Tue Oct 18 11:23:11 2016 +0530
> 
>     mm/hugetlb: Use the right pte val for compare in hugetlb_cow
>     
>     We cannot use the pte value used in set_pte_at for pte_same comparison,
>     because archs like ppc64, filter/add new pte flag in set_pte_at. Instead
>     fetch the pte value inside hugetlb_cow. We are comparing pte value to
>     make sure the pte didn't change since we dropped the page table lock.
>     hugetlb_cow get called with page table lock held, and we can take a copy
>     of the pte value before we drop the page table lock.
>     
>     Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ec49d9ef1eef..da8fbd02b92e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3386,15 +3386,17 @@ static void unmap_ref_private(struct mm_struct *mm,
> struct vm_area_struct *vma,
>   * Keep the pte_same checks anyway to make transition from the mutex easier.
>   */
>  static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> -			unsigned long address, pte_t *ptep, pte_t pte,
> -			struct page *pagecache_page, spinlock_t *ptl)
> +		       unsigned long address, pte_t *ptep,
> +		       struct page *pagecache_page, spinlock_t *ptl)
>  {
> +	pte_t pte;
>  	struct hstate *h = hstate_vma(vma);
>  	struct page *old_page, *new_page;
>  	int ret = 0, outside_reserve = 0;
>  	unsigned long mmun_start;	/* For mmu_notifiers */
>  	unsigned long mmun_end;		/* For mmu_notifiers */
>  
> +	pte = huge_ptep_get(ptep);
>  	old_page = pte_page(pte);
>  
>  retry_avoidcopy:
> @@ -3668,7 +3670,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct
> vm_area_struct *vma,
>  	hugetlb_count_add(pages_per_huge_page(h), mm);
>  	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
>  		/* Optimization, do the COW without a second fault */
> -		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
> +		ret = hugetlb_cow(mm, vma, address, ptep, page, ptl);
>  	}
>  
>  	spin_unlock(ptl);
> @@ -3822,8 +3824,8 @@ int hugetlb_fault(struct mm_struct *mm, struct
> vm_area_struct *vma,
>  
>  	if (flags & FAULT_FLAG_WRITE) {
>  		if (!huge_pte_write(entry)) {
> -			ret = hugetlb_cow(mm, vma, address, ptep, entry,
> -					pagecache_page, ptl);
> +			ret = hugetlb_cow(mm, vma, address, ptep,
> +					  pagecache_page, ptl);
>  			goto out_put_page;
>  		}
>  		entry = huge_pte_mkdirty(entry);
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system
@ 2016-10-18 11:28               ` Jan Stancek
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Stancek @ 2016-10-18 11:28 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Mike Kravetz
  Cc: linux-mm, linux-kernel, hillf zj, dave hansen, kirill shutemov,
	mhocko, n-horiguchi, iamjoonsoo kim





----- Original Message -----
> Jan Stancek <jstancek@redhat.com> writes:
> > Hi Mike,
> >
> > Revert of 67961f9db8c4 helps, I let whole suite run for 100 iterations,
> > there were no issues.
> >
> > I cut down reproducer and removed last mmap/write/munmap as that is enough
> > to reproduce the problem. Then I started introducing some traces into
> > kernel
> > and noticed that on ppc I get 3 faults, while on x86 I get only 2.
> >
> > Interesting is the 2nd fault, that is first write after mapping as PRIVATE.
> > Following condition fails on ppc first time:
> >     if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
> > but it's immediately followed by fault that looks identical
> > and in that one it evaluates as true.
> >
> > Same with alloc_huge_page(), on x86_64 it's called twice, on ppc three
> > times.
> > In 2nd call vma_needs_reservation() returns 0, in 3rd it returns 1.
> >
> > ---- ppc -> 2nd and 3rd fault ---
> > mmap(MAP_PRIVATE)
> > hugetlb_fault address: 3effff000000, flags: 55
> > hugetlb_cow old_page: f0000000010fc000
> > alloc_huge_page ret: f000000001100000
> > hugetlb_cow ptep: c000000455b27cf8, pte_same: 0
> > free_huge_page page: f000000001100000, restore_reserve: 1
> > hugetlb_fault address: 3effff000000, flags: 55
> > hugetlb_cow old_page: f0000000010fc000
> > alloc_huge_page ret: f000000001100000
> > hugetlb_cow ptep: c000000455b27cf8, pte_same: 1
> >
> > --- x86_64 -> 2nd fault ---
> > mmap(MAP_PRIVATE)
> > hugetlb_fault address: 7f71a4200000, flags: 55
> > hugetlb_cow address 0x7f71a4200000, old_page: ffffea0008d20000
> > alloc_huge_page ret: ffffea0008d38000
> > hugetlb_cow ptep: ffff8802314c7908, pte_same: 1
> >
> > Regards,
> > Jan
> >
> 
> Can you check with the below patch. I ran the corrupt-by-cow-opt test with
> this patch
> and resv count got correctly updated.

I am running libhugetlbfs suite with patch below in loop for
~2 hours now and I don't see any problems/ENOMEMs/OOMs or
leaked resv pages:

0       hugepages-16384kB/free_hugepages
0       hugepages-16384kB/nr_hugepages
0       hugepages-16384kB/nr_hugepages_mempolicy
0       hugepages-16384kB/nr_overcommit_hugepages
0       hugepages-16384kB/resv_hugepages
0       hugepages-16384kB/surplus_hugepages
0       hugepages-16777216kB/free_hugepages
0       hugepages-16777216kB/nr_hugepages
0       hugepages-16777216kB/nr_hugepages_mempolicy
0       hugepages-16777216kB/nr_overcommit_hugepages
0       hugepages-16777216kB/resv_hugepages
0       hugepages-16777216kB/surplus_hugepages

Regards,
Jan

> 
> commit fb2e0c081d2922c8aaa49bbe166472aac68ef5e1
> Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Date:   Tue Oct 18 11:23:11 2016 +0530
> 
>     mm/hugetlb: Use the right pte val for compare in hugetlb_cow
>     
>     We cannot use the pte value used in set_pte_at for pte_same comparison,
>     because archs like ppc64, filter/add new pte flag in set_pte_at. Instead
>     fetch the pte value inside hugetlb_cow. We are comparing pte value to
>     make sure the pte didn't change since we dropped the page table lock.
>     hugetlb_cow get called with page table lock held, and we can take a copy
>     of the pte value before we drop the page table lock.
>     
>     Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ec49d9ef1eef..da8fbd02b92e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3386,15 +3386,17 @@ static void unmap_ref_private(struct mm_struct *mm,
> struct vm_area_struct *vma,
>   * Keep the pte_same checks anyway to make transition from the mutex easier.
>   */
>  static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> -			unsigned long address, pte_t *ptep, pte_t pte,
> -			struct page *pagecache_page, spinlock_t *ptl)
> +		       unsigned long address, pte_t *ptep,
> +		       struct page *pagecache_page, spinlock_t *ptl)
>  {
> +	pte_t pte;
>  	struct hstate *h = hstate_vma(vma);
>  	struct page *old_page, *new_page;
>  	int ret = 0, outside_reserve = 0;
>  	unsigned long mmun_start;	/* For mmu_notifiers */
>  	unsigned long mmun_end;		/* For mmu_notifiers */
>  
> +	pte = huge_ptep_get(ptep);
>  	old_page = pte_page(pte);
>  
>  retry_avoidcopy:
> @@ -3668,7 +3670,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct
> vm_area_struct *vma,
>  	hugetlb_count_add(pages_per_huge_page(h), mm);
>  	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
>  		/* Optimization, do the COW without a second fault */
> -		ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
> +		ret = hugetlb_cow(mm, vma, address, ptep, page, ptl);
>  	}
>  
>  	spin_unlock(ptl);
> @@ -3822,8 +3824,8 @@ int hugetlb_fault(struct mm_struct *mm, struct
> vm_area_struct *vma,
>  
>  	if (flags & FAULT_FLAG_WRITE) {
>  		if (!huge_pte_write(entry)) {
> -			ret = hugetlb_cow(mm, vma, address, ptep, entry,
> -					pagecache_page, ptl);
> +			ret = hugetlb_cow(mm, vma, address, ptep,
> +					  pagecache_page, ptl);
>  			goto out_put_page;
>  		}
>  		entry = huge_pte_mkdirty(entry);
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2016-10-18 11:34 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-13 12:19 [bug/regression] libhugetlbfs testsuite failures and OOMs eventually kill my system Jan Stancek
2016-10-13 12:19 ` Jan Stancek
2016-10-13 15:24 ` Mike Kravetz
2016-10-13 15:24   ` Mike Kravetz
2016-10-13 23:26   ` Mike Kravetz
2016-10-13 23:26     ` Mike Kravetz
2016-10-14  8:48     ` Jan Stancek
2016-10-14  8:48       ` Jan Stancek
2016-10-14 23:57       ` Mike Kravetz
2016-10-14 23:57         ` Mike Kravetz
2016-10-17  5:04         ` Aneesh Kumar K.V
2016-10-17  5:04           ` Aneesh Kumar K.V
2016-10-17 22:53           ` Mike Kravetz
2016-10-17 22:53             ` Mike Kravetz
2016-10-18  1:18             ` Mike Kravetz
2016-10-18  1:18               ` Mike Kravetz
2016-10-17 14:44         ` Jan Stancek
2016-10-17 14:44           ` Jan Stancek
2016-10-17 18:27           ` Aneesh Kumar K.V
2016-10-17 18:27             ` Aneesh Kumar K.V
2016-10-17 23:19             ` Mike Kravetz
2016-10-17 23:19               ` Mike Kravetz
2016-10-18  8:31           ` Aneesh Kumar K.V
2016-10-18  8:31             ` Aneesh Kumar K.V
2016-10-18 11:28             ` Jan Stancek
2016-10-18 11:28               ` Jan Stancek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.