[MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background

All of lore.kernel.org
 help / color / mirror / Atom feed

* [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
@ 2019-07-29  5:17 ` Li Wang
  0 siblings, 0 replies; 32+ messages in thread
From: Li Wang @ 2019-07-29  5:17 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Linux-MM, LTP List, mike.kravetz, xishi.qiuxishi, mhocko, Cyril Hrubis

[-- Attachment #1: Type: text/plain, Size: 2557 bytes --]

Hi Naoya and Linux-MMers,

The LTP/move_page12 V2 triggers SIGBUS in the kernel-v5.2.3 testing.
https://github.com/wangli5665/ltp/blob/master/testcases/kernel/syscalls/move_pages/move_pages12.c

It seems like the retry mmap() triggers SIGBUS while doing the
numa_move_pages()
in background. That is very similar to the kernel bug which was mentioned
by commit 6bc9b56433b76e40d(mm: fix race on soft-offlining ): A race
condition between soft offline and hugetlb_fault which causes unexpected
process SIGBUS killing.

I'm not sure if that below patch is making sene to memory-failures.c, but after
building a new kernel-5.2.3 with this change, the problem can NOT be
reproduced.

Any comments?

----------------------------------
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1695,15 +1695,16 @@ static int soft_offline_huge_page(struct page
*page, int flags)
        unlock_page(hpage);

        ret = isolate_huge_page(hpage, &pagelist);
+       if (!ret) {
+               pr_info("soft offline: %#lx hugepage failed to isolate\n",
pfn);
+               return -EBUSY;
+       }
+
        /*
         * get_any_page() and isolate_huge_page() takes a refcount each,
         * so need to drop one here.
         */
        put_hwpoison_page(hpage);
-       if (!ret) {
-               pr_info("soft offline: %#lx hugepage failed to isolate\n",
pfn);
-               return -EBUSY;
-       }


----- test on kernel-v5.2.3 ------
# ./move_pages12
tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:251: INFO: Free RAM 194212832 kB
move_pages12.c:269: INFO: Increasing 2048kB hugepages pool on node 0 to 4
move_pages12.c:279: INFO: Increasing 2048kB hugepages pool on node 1 to 6
move_pages12.c:195: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:195: INFO: Allocating and freeing 4 hugepages on node 1
move_pages12.c:185: PASS: Bug not reproduced
tst_test.c:1145: BROK: Test killed by SIGBUS!
move_pages12.c:114: FAIL: move_pages failed: ESRCH

----- test on kernel-v5.2.3  + above patch------
# ./move_pages12
tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:252: INFO: Free RAM 64780164 kB
move_pages12.c:270: INFO: Increasing 2048kB hugepages pool on node 0 to 7
move_pages12.c:280: INFO: Increasing 2048kB hugepages pool on node 1 to 10
move_pages12.c:196: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:196: INFO: Allocating and freeing 4 hugepages on node 1
move_pages12.c:186: PASS: Bug not reproduced
move_pages12.c:186: PASS: Bug not reproduced
--
Regards,
Li Wang

[-- Attachment #2: Type: text/html, Size: 3934 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [LTP] [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
@ 2019-07-29  5:17 ` Li Wang
  0 siblings, 0 replies; 32+ messages in thread
From: Li Wang @ 2019-07-29  5:17 UTC (permalink / raw)
  To: ltp

Hi Naoya and Linux-MMers,

The LTP/move_page12 V2 triggers SIGBUS in the kernel-v5.2.3 testing.
https://github.com/wangli5665/ltp/blob/master/testcases/kernel/syscalls/move_pages/move_pages12.c

It seems like the retry mmap() triggers SIGBUS while doing the
numa_move_pages()
in background. That is very similar to the kernel bug which was mentioned
by commit 6bc9b56433b76e40d(mm: fix race on soft-offlining ): A race
condition between soft offline and hugetlb_fault which causes unexpected
process SIGBUS killing.

I'm not sure if that below patch is making sene to memory-failures.c, but after
building a new kernel-5.2.3 with this change, the problem can NOT be
reproduced.

Any comments?

----------------------------------
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1695,15 +1695,16 @@ static int soft_offline_huge_page(struct page
*page, int flags)
        unlock_page(hpage);

        ret = isolate_huge_page(hpage, &pagelist);
+       if (!ret) {
+               pr_info("soft offline: %#lx hugepage failed to isolate\n",
pfn);
+               return -EBUSY;
+       }
+
        /*
         * get_any_page() and isolate_huge_page() takes a refcount each,
         * so need to drop one here.
         */
        put_hwpoison_page(hpage);
-       if (!ret) {
-               pr_info("soft offline: %#lx hugepage failed to isolate\n",
pfn);
-               return -EBUSY;
-       }


----- test on kernel-v5.2.3 ------
# ./move_pages12
tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:251: INFO: Free RAM 194212832 kB
move_pages12.c:269: INFO: Increasing 2048kB hugepages pool on node 0 to 4
move_pages12.c:279: INFO: Increasing 2048kB hugepages pool on node 1 to 6
move_pages12.c:195: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:195: INFO: Allocating and freeing 4 hugepages on node 1
move_pages12.c:185: PASS: Bug not reproduced
tst_test.c:1145: BROK: Test killed by SIGBUS!
move_pages12.c:114: FAIL: move_pages failed: ESRCH

----- test on kernel-v5.2.3  + above patch------
# ./move_pages12
tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:252: INFO: Free RAM 64780164 kB
move_pages12.c:270: INFO: Increasing 2048kB hugepages pool on node 0 to 7
move_pages12.c:280: INFO: Increasing 2048kB hugepages pool on node 1 to 10
move_pages12.c:196: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:196: INFO: Allocating and freeing 4 hugepages on node 1
move_pages12.c:186: PASS: Bug not reproduced
move_pages12.c:186: PASS: Bug not reproduced
--
Regards,
Li Wang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linux.it/pipermail/ltp/attachments/20190729/f91efb92/attachment.htm>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
  2019-07-29  5:17 ` [LTP] " Li Wang
@ 2019-07-29 19:00   ` Mike Kravetz
  -1 siblings, 0 replies; 32+ messages in thread
From: Mike Kravetz @ 2019-07-29 19:00 UTC (permalink / raw)
  To: Li Wang, Naoya Horiguchi
  Cc: Linux-MM, LTP List, xishi.qiuxishi, mhocko, Cyril Hrubis

On 7/28/19 10:17 PM, Li Wang wrote:
> Hi Naoya and Linux-MMers,
> 
> The LTP/move_page12 V2 triggers SIGBUS in the kernel-v5.2.3 testing.
> https://github.com/wangli5665/ltp/blob/master/testcases/kernel/syscalls/move_pages/move_pages12.c
> 
> It seems like the retry mmap() triggers SIGBUS while doing thenuma_move_pages() in background. That is very similar to the kernelbug which was mentioned by commit 6bc9b56433b76e40d(mm: fix race onsoft-offlining ): A race condition between soft offline andhugetlb_fault which causes unexpected process SIGBUS killing.
> 
> I'm not sure if that below patch is making sene to memory-failures.c, but after building a new kernel-5.2.3 with this change, the problem can NOT be reproduced. 
> 
> Any comments?

Something seems strange.  I can not reproduce with unmodified 5.2.3

[root@f23d move_pages]# uname -r
5.2.3
[root@f23d move_pages]# PATH=$PATH:$PWD ./move_pages12
tst_test.c:1096: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:201: INFO: Free RAM 6725424 kB
move_pages12.c:219: INFO: Increasing 2048kB hugepages pool on node 0 to 4
move_pages12.c:229: INFO: Increasing 2048kB hugepages pool on node 1 to 4
move_pages12.c:145: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:145: INFO: Allocating and freeing 4 hugepages on node 1
move_pages12.c:135: PASS: Bug not reproduced

Summary:
passed   1
failed   0
skipped  0
warnings 0

Also, the soft_offline_huge_page() code should not come into play with
this specific test.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [LTP]  [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
@ 2019-07-29 19:00   ` Mike Kravetz
  0 siblings, 0 replies; 32+ messages in thread
From: Mike Kravetz @ 2019-07-29 19:00 UTC (permalink / raw)
  To: ltp

On 7/28/19 10:17 PM, Li Wang wrote:
> Hi Naoya and Linux-MMers,
> 
> The LTP/move_page12 V2 triggers SIGBUS in the kernel-v5.2.3 testing.
> https://github.com/wangli5665/ltp/blob/master/testcases/kernel/syscalls/move_pages/move_pages12.c
> 
> It seems like the retry mmap() triggers SIGBUS while doing thenuma_move_pages() in background. That is very similar to the kernelbug which was mentioned by commit 6bc9b56433b76e40d(mm: fix race onsoft-offlining ): A race condition between soft offline andhugetlb_fault which causes unexpected process SIGBUS killing.
> 
> I'm not sure if that below patch is making sene to memory-failures.c, but after building a new kernel-5.2.3 with this change, the problem can NOT be reproduced. 
> 
> Any comments?

Something seems strange.  I can not reproduce with unmodified 5.2.3

[root@f23d move_pages]# uname -r
5.2.3
[root@f23d move_pages]# PATH=$PATH:$PWD ./move_pages12
tst_test.c:1096: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:201: INFO: Free RAM 6725424 kB
move_pages12.c:219: INFO: Increasing 2048kB hugepages pool on node 0 to 4
move_pages12.c:229: INFO: Increasing 2048kB hugepages pool on node 1 to 4
move_pages12.c:145: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:145: INFO: Allocating and freeing 4 hugepages on node 1
move_pages12.c:135: PASS: Bug not reproduced

Summary:
passed   1
failed   0
skipped  0
warnings 0

Also, the soft_offline_huge_page() code should not come into play with
this specific test.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
  2019-07-29 19:00   ` [LTP] " Mike Kravetz
@ 2019-07-30  6:29     ` Li Wang
  -1 siblings, 0 replies; 32+ messages in thread
From: Li Wang @ 2019-07-30  6:29 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Naoya Horiguchi, Linux-MM, LTP List, xishi.qiuxishi, mhocko,
	Cyril Hrubis

[-- Attachment #1: Type: text/plain, Size: 3983 bytes --]

Hi Mike,

Thanks for trying this.

On Tue, Jul 30, 2019 at 3:01 AM Mike Kravetz <mike.kravetz@oracle.com>
wrote:
>
> On 7/28/19 10:17 PM, Li Wang wrote:
> > Hi Naoya and Linux-MMers,
> >
> > The LTP/move_page12 V2 triggers SIGBUS in the kernel-v5.2.3 testing.
> >
https://github.com/wangli5665/ltp/blob/master/testcases/kernel/syscalls/move_pages/move_pages12.c
> >
> > It seems like the retry mmap() triggers SIGBUS while doing
thenuma_move_pages() in background. That is very similar to the kernelbug
which was mentioned by commit 6bc9b56433b76e40d(mm: fix race
onsoft-offlining ): A race condition between soft offline andhugetlb_fault
which causes unexpected process SIGBUS killing.
> >
> > I'm not sure if that below patch is making sene to memory-failures.c,
but after building a new kernel-5.2.3 with this change, the problem can NOT
be reproduced.
> >
> > Any comments?
>
> Something seems strange.  I can not reproduce with unmodified 5.2.3

It's not 100% reproducible, I tried ten times only hit 4~6 times fail.

Did you try the test case with patch V3(in my branch)?
https://github.com/wangli5665/ltp/commit/198fca89870c1b807a01b27bb1d2ec6e2af1c7b6

# git clone https://github.com/wangli5665/ltp ltp.wangli --depth=1
# cd ltp.wangli/; make autotools;
# ./configure ; make -j24
# cd testcases/kernel/syscalls/move_pages/
# ./move_pages12
tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:249: INFO: Free RAM 64386300 kB
move_pages12.c:267: INFO: Increasing 2048kB hugepages pool on node 0 to 4
move_pages12.c:277: INFO: Increasing 2048kB hugepages pool on node 1 to 4
move_pages12.c:193: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:193: INFO: Allocating and freeing 4 hugepages on node 1
move_pages12.c:183: PASS: Bug not reproduced
tst_test.c:1145: BROK: Test killed by SIGBUS!
move_pages12.c:117: FAIL: move_pages failed: ESRCH

# uname -r
5.2.3

# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 16049 MB
node 0 free: 15736 MB
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 16123 MB
node 1 free: 15850 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 16123 MB
node 2 free: 15989 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 16097 MB
node 3 free: 15278 MB
node distances:
node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10

> Also, the soft_offline_huge_page() code should not come into play with
> this specific test.

I got the "soft offline xxx.. hugepage failed to isolate" message from
soft_offline_huge_page()
in dmesg log.

=== debug print info ===
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1701,7 +1701,7 @@ static int soft_offline_huge_page(struct page *page,
int flags)
         */
        put_hwpoison_page(hpage);
        if (!ret) {
-               pr_info("soft offline: %#lx hugepage failed to isolate\n",
pfn);
+               pr_info("liwang -- soft offline: %#lx hugepage failed to
isolate\n", pfn);
                return -EBUSY;
        }

# dmesg
...
[ 1068.947205] Soft offlining pfn 0x40b200 at process virtual address
0x7f9d8d000000
[ 1068.987054] Soft offlining pfn 0x40ac00 at process virtual address
0x7f9d8d200000
[ 1069.048478] Soft offlining pfn 0x40a800 at process virtual address
0x7f9d8d000000
[ 1069.087413] Soft offlining pfn 0x40ae00 at process virtual address
0x7f9d8d200000
[ 1069.123285] liwang -- soft offline: 0x40ae00 hugepage failed to isolate
[ 1069.160137] Soft offlining pfn 0x80f800 at process virtual address
0x7f9d8d000000
[ 1069.196009] Soft offlining pfn 0x80fe00 at process virtual address
0x7f9d8d200000
[ 1069.243436] Soft offlining pfn 0x40a400 at process virtual address
0x7f9d8d000000
[ 1069.281301] Soft offlining pfn 0x40a600 at process virtual address
0x7f9d8d200000
[ 1069.318171] liwang -- soft offline: 0x40a600 hugepage failed to isolate

-- 
Regards,
Li Wang

[-- Attachment #2: Type: text/html, Size: 5786 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [LTP]  [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
@ 2019-07-30  6:29     ` Li Wang
  0 siblings, 0 replies; 32+ messages in thread
From: Li Wang @ 2019-07-30  6:29 UTC (permalink / raw)
  To: ltp

Hi Mike,

Thanks for trying this.

On Tue, Jul 30, 2019 at 3:01 AM Mike Kravetz <mike.kravetz@oracle.com>
wrote:
>
> On 7/28/19 10:17 PM, Li Wang wrote:
> > Hi Naoya and Linux-MMers,
> >
> > The LTP/move_page12 V2 triggers SIGBUS in the kernel-v5.2.3 testing.
> >
https://github.com/wangli5665/ltp/blob/master/testcases/kernel/syscalls/move_pages/move_pages12.c
> >
> > It seems like the retry mmap() triggers SIGBUS while doing
thenuma_move_pages() in background. That is very similar to the kernelbug
which was mentioned by commit 6bc9b56433b76e40d(mm: fix race
onsoft-offlining ): A race condition between soft offline andhugetlb_fault
which causes unexpected process SIGBUS killing.
> >
> > I'm not sure if that below patch is making sene to memory-failures.c,
but after building a new kernel-5.2.3 with this change, the problem can NOT
be reproduced.
> >
> > Any comments?
>
> Something seems strange.  I can not reproduce with unmodified 5.2.3

It's not 100% reproducible, I tried ten times only hit 4~6 times fail.

Did you try the test case with patch V3(in my branch)?
https://github.com/wangli5665/ltp/commit/198fca89870c1b807a01b27bb1d2ec6e2af1c7b6

# git clone https://github.com/wangli5665/ltp ltp.wangli --depth=1
# cd ltp.wangli/; make autotools;
# ./configure ; make -j24
# cd testcases/kernel/syscalls/move_pages/
# ./move_pages12
tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:249: INFO: Free RAM 64386300 kB
move_pages12.c:267: INFO: Increasing 2048kB hugepages pool on node 0 to 4
move_pages12.c:277: INFO: Increasing 2048kB hugepages pool on node 1 to 4
move_pages12.c:193: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:193: INFO: Allocating and freeing 4 hugepages on node 1
move_pages12.c:183: PASS: Bug not reproduced
tst_test.c:1145: BROK: Test killed by SIGBUS!
move_pages12.c:117: FAIL: move_pages failed: ESRCH

# uname -r
5.2.3

# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 16049 MB
node 0 free: 15736 MB
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 16123 MB
node 1 free: 15850 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 16123 MB
node 2 free: 15989 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 16097 MB
node 3 free: 15278 MB
node distances:
node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10

> Also, the soft_offline_huge_page() code should not come into play with
> this specific test.

I got the "soft offline xxx.. hugepage failed to isolate" message from
soft_offline_huge_page()
in dmesg log.

=== debug print info ===
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1701,7 +1701,7 @@ static int soft_offline_huge_page(struct page *page,
int flags)
         */
        put_hwpoison_page(hpage);
        if (!ret) {
-               pr_info("soft offline: %#lx hugepage failed to isolate\n",
pfn);
+               pr_info("liwang -- soft offline: %#lx hugepage failed to
isolate\n", pfn);
                return -EBUSY;
        }

# dmesg
...
[ 1068.947205] Soft offlining pfn 0x40b200 at process virtual address
0x7f9d8d000000
[ 1068.987054] Soft offlining pfn 0x40ac00 at process virtual address
0x7f9d8d200000
[ 1069.048478] Soft offlining pfn 0x40a800 at process virtual address
0x7f9d8d000000
[ 1069.087413] Soft offlining pfn 0x40ae00 at process virtual address
0x7f9d8d200000
[ 1069.123285] liwang -- soft offline: 0x40ae00 hugepage failed to isolate
[ 1069.160137] Soft offlining pfn 0x80f800 at process virtual address
0x7f9d8d000000
[ 1069.196009] Soft offlining pfn 0x80fe00 at process virtual address
0x7f9d8d200000
[ 1069.243436] Soft offlining pfn 0x40a400 at process virtual address
0x7f9d8d000000
[ 1069.281301] Soft offlining pfn 0x40a600 at process virtual address
0x7f9d8d200000
[ 1069.318171] liwang -- soft offline: 0x40a600 hugepage failed to isolate

-- 
Regards,
Li Wang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linux.it/pipermail/ltp/attachments/20190730/23d7a58b/attachment-0001.htm>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
  2019-07-29 19:00   ` [LTP] " Mike Kravetz
@ 2019-07-30  6:38     ` Li Wang
  -1 siblings, 0 replies; 32+ messages in thread
From: Li Wang @ 2019-07-30  6:38 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Naoya Horiguchi, Linux-MM, LTP List, xishi.qiuxishi, mhocko,
	Cyril Hrubis

On Tue, Jul 30, 2019 at 3:01 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
...
> Something seems strange.  I can not reproduce with unmodified 5.2.3
>
> [root@f23d move_pages]# uname -r
> 5.2.3
> [root@f23d move_pages]# PATH=$PATH:$PWD ./move_pages12
> tst_test.c:1096: INFO: Timeout per run is 0h 05m 00s
> move_pages12.c:201: INFO: Free RAM 6725424 kB
> move_pages12.c:219: INFO: Increasing 2048kB hugepages pool on node 0 to 4
> move_pages12.c:229: INFO: Increasing 2048kB hugepages pool on node 1 to 4
> move_pages12.c:145: INFO: Allocating and freeing 4 hugepages on node 0
> move_pages12.c:145: INFO: Allocating and freeing 4 hugepages on node 1
> move_pages12.c:135: PASS: Bug not reproduced
>
> Summary:
> passed   1
> failed   0
> skipped  0
> warnings 0

FYI:

And, from your test log, it looks like you were running an old LTP
version, the test#2 was added in move_page12 in the latest master
branch.

So, the completely test log should be included two-part:

# ./move_pages12
tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:252: INFO: Free RAM 63759028 kB
move_pages12.c:270: INFO: Increasing 2048kB hugepages pool on node 0 to 4
move_pages12.c:280: INFO: Increasing 2048kB hugepages pool on node 1 to 6
move_pages12.c:196: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:196: INFO: Allocating and freeing 4 hugepages on node 1
move_pages12.c:186: PASS: Bug not reproduced
move_pages12.c:186: PASS: Bug not reproduced

Summary:
passed   2
failed   0
skipped  0
warnings 0


-- 
Regards,
Li Wang


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [LTP]  [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
@ 2019-07-30  6:38     ` Li Wang
  0 siblings, 0 replies; 32+ messages in thread
From: Li Wang @ 2019-07-30  6:38 UTC (permalink / raw)
  To: ltp

On Tue, Jul 30, 2019 at 3:01 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
...
> Something seems strange.  I can not reproduce with unmodified 5.2.3
>
> [root@f23d move_pages]# uname -r
> 5.2.3
> [root@f23d move_pages]# PATH=$PATH:$PWD ./move_pages12
> tst_test.c:1096: INFO: Timeout per run is 0h 05m 00s
> move_pages12.c:201: INFO: Free RAM 6725424 kB
> move_pages12.c:219: INFO: Increasing 2048kB hugepages pool on node 0 to 4
> move_pages12.c:229: INFO: Increasing 2048kB hugepages pool on node 1 to 4
> move_pages12.c:145: INFO: Allocating and freeing 4 hugepages on node 0
> move_pages12.c:145: INFO: Allocating and freeing 4 hugepages on node 1
> move_pages12.c:135: PASS: Bug not reproduced
>
> Summary:
> passed   1
> failed   0
> skipped  0
> warnings 0

FYI:

And, from your test log, it looks like you were running an old LTP
version, the test#2 was added in move_page12 in the latest master
branch.

So, the completely test log should be included two-part:

# ./move_pages12
tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:252: INFO: Free RAM 63759028 kB
move_pages12.c:270: INFO: Increasing 2048kB hugepages pool on node 0 to 4
move_pages12.c:280: INFO: Increasing 2048kB hugepages pool on node 1 to 6
move_pages12.c:196: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:196: INFO: Allocating and freeing 4 hugepages on node 1
move_pages12.c:186: PASS: Bug not reproduced
move_pages12.c:186: PASS: Bug not reproduced

Summary:
passed   2
failed   0
skipped  0
warnings 0


-- 
Regards,
Li Wang

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
  2019-07-30  6:29     ` [LTP] " Li Wang
@ 2019-07-31  0:44       ` Mike Kravetz
  -1 siblings, 0 replies; 32+ messages in thread
From: Mike Kravetz @ 2019-07-31  0:44 UTC (permalink / raw)
  To: Li Wang
  Cc: Naoya Horiguchi, Linux-MM, LTP List, xishi.qiuxishi, mhocko,
	Cyril Hrubis

On 7/29/19 11:29 PM, Li Wang wrote:
> It's not 100% reproducible, I tried ten times only hit 4~6 times fail.
> 
> Did you try the test case with patch V3(in my branch)?
> https://github.com/wangli5665/ltp/commit/198fca89870c1b807a01b27bb1d2ec6e2af1c7b6
> 

My bad!  I was using an old version of the test without the soft offline
testing.

> # git clone https://github.com/wangli5665/ltp ltp.wangli --depth=1
> # cd ltp.wangli/; make autotools;
> # ./configure ; make -j24
> # cd testcases/kernel/syscalls/move_pages/
> # ./move_pages12 
> tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s
> move_pages12.c:249: INFO: Free RAM 64386300 kB
> move_pages12.c:267: INFO: Increasing 2048kB hugepages pool on node 0 to 4
> move_pages12.c:277: INFO: Increasing 2048kB hugepages pool on node 1 to 4
> move_pages12.c:193: INFO: Allocating and freeing 4 hugepages on node 0
> move_pages12.c:193: INFO: Allocating and freeing 4 hugepages on node 1
> move_pages12.c:183: PASS: Bug not reproduced
> tst_test.c:1145: BROK: Test killed by SIGBUS!
> move_pages12.c:117: FAIL: move_pages failed: ESRCH

Yes, I can recreate.

When I see this failure, the SIGBUS is the result of a huge page allocation
failure.  The allocation was in response to a page fault.

Note that running the test will deplete memory of the system as huge pages
are marked 'poisoned' and can not be reused.  So, each run of the test will
take additional memory offline.

A SIGBUS is the normal behavior for a hugetlb page fault failure due to
lack of huge pages.  Ugly, but that is the design.  I do not believe this
test should not be experiencing this due to reservations taken at mmap
time.  However, the test is combining faults, soft offline and page
migrations, so the there are lots of moving parts.

I'll continue to investigate.

Naoya may have more context as he contributed to both the kernel code and
the testcase.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [LTP]  [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
@ 2019-07-31  0:44       ` Mike Kravetz
  0 siblings, 0 replies; 32+ messages in thread
From: Mike Kravetz @ 2019-07-31  0:44 UTC (permalink / raw)
  To: ltp

On 7/29/19 11:29 PM, Li Wang wrote:
> It's not 100% reproducible, I tried ten times only hit 4~6 times fail.
> 
> Did you try the test case with patch V3(in my branch)?
> https://github.com/wangli5665/ltp/commit/198fca89870c1b807a01b27bb1d2ec6e2af1c7b6
> 

My bad!  I was using an old version of the test without the soft offline
testing.

> # git clone https://github.com/wangli5665/ltp ltp.wangli --depth=1
> # cd ltp.wangli/; make autotools;
> # ./configure ; make -j24
> # cd testcases/kernel/syscalls/move_pages/
> # ./move_pages12 
> tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s
> move_pages12.c:249: INFO: Free RAM 64386300 kB
> move_pages12.c:267: INFO: Increasing 2048kB hugepages pool on node 0 to 4
> move_pages12.c:277: INFO: Increasing 2048kB hugepages pool on node 1 to 4
> move_pages12.c:193: INFO: Allocating and freeing 4 hugepages on node 0
> move_pages12.c:193: INFO: Allocating and freeing 4 hugepages on node 1
> move_pages12.c:183: PASS: Bug not reproduced
> tst_test.c:1145: BROK: Test killed by SIGBUS!
> move_pages12.c:117: FAIL: move_pages failed: ESRCH

Yes, I can recreate.

When I see this failure, the SIGBUS is the result of a huge page allocation
failure.  The allocation was in response to a page fault.

Note that running the test will deplete memory of the system as huge pages
are marked 'poisoned' and can not be reused.  So, each run of the test will
take additional memory offline.

A SIGBUS is the normal behavior for a hugetlb page fault failure due to
lack of huge pages.  Ugly, but that is the design.  I do not believe this
test should not be experiencing this due to reservations taken at mmap
time.  However, the test is combining faults, soft offline and page
migrations, so the there are lots of moving parts.

I'll continue to investigate.

Naoya may have more context as he contributed to both the kernel code and
the testcase.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
  2019-07-31  0:44       ` [LTP] " Mike Kravetz
@ 2019-08-02  0:19         ` Mike Kravetz
  -1 siblings, 0 replies; 32+ messages in thread
From: Mike Kravetz @ 2019-08-02  0:19 UTC (permalink / raw)
  To: Li Wang
  Cc: Naoya Horiguchi, Linux-MM, LTP List, xishi.qiuxishi, mhocko,
	Cyril Hrubis

On 7/30/19 5:44 PM, Mike Kravetz wrote:
> A SIGBUS is the normal behavior for a hugetlb page fault failure due to
> lack of huge pages.  Ugly, but that is the design.  I do not believe this
> test should not be experiencing this due to reservations taken at mmap
> time.  However, the test is combining faults, soft offline and page
> migrations, so the there are lots of moving parts.
> 
> I'll continue to investigate.

There appears to be a race with hugetlb_fault and try_to_unmap_one of
the migration path.

Can you try this patch in your environment?  I am not sure if it will
be the final fix, but just wanted to see if it addresses issue for you.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ede7e7f5d1ab..f3156c5432e3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 
 		page = alloc_huge_page(vma, haddr, 0);
 		if (IS_ERR(page)) {
+			/*
+			 * We could race with page migration (try_to_unmap_one)
+			 * which is modifying page table with lock.  However,
+			 * we are not holding lock here.  Before returning
+			 * error that will SIGBUS caller, get ptl and make
+			 * sure there really is no entry.
+			 */
+			ptl = huge_pte_lock(h, mm, ptep);
+			if (!huge_pte_none(huge_ptep_get(ptep))) {
+				ret = 0;
+				spin_unlock(ptl);
+				goto out;
+			}
+			spin_unlock(ptl);
 			ret = vmf_error(PTR_ERR(page));
 			goto out;
 		}


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [LTP]  [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
@ 2019-08-02  0:19         ` Mike Kravetz
  0 siblings, 0 replies; 32+ messages in thread
From: Mike Kravetz @ 2019-08-02  0:19 UTC (permalink / raw)
  To: ltp

On 7/30/19 5:44 PM, Mike Kravetz wrote:
> A SIGBUS is the normal behavior for a hugetlb page fault failure due to
> lack of huge pages.  Ugly, but that is the design.  I do not believe this
> test should not be experiencing this due to reservations taken at mmap
> time.  However, the test is combining faults, soft offline and page
> migrations, so the there are lots of moving parts.
> 
> I'll continue to investigate.

There appears to be a race with hugetlb_fault and try_to_unmap_one of
the migration path.

Can you try this patch in your environment?  I am not sure if it will
be the final fix, but just wanted to see if it addresses issue for you.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ede7e7f5d1ab..f3156c5432e3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 
 		page = alloc_huge_page(vma, haddr, 0);
 		if (IS_ERR(page)) {
+			/*
+			 * We could race with page migration (try_to_unmap_one)
+			 * which is modifying page table with lock.  However,
+			 * we are not holding lock here.  Before returning
+			 * error that will SIGBUS caller, get ptl and make
+			 * sure there really is no entry.
+			 */
+			ptl = huge_pte_lock(h, mm, ptep);
+			if (!huge_pte_none(huge_ptep_get(ptep))) {
+				ret = 0;
+				spin_unlock(ptl);
+				goto out;
+			}
+			spin_unlock(ptl);
 			ret = vmf_error(PTR_ERR(page));
 			goto out;
 		}

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
  2019-07-29  5:17 ` [LTP] " Li Wang
@ 2019-08-02  3:48   ` Naoya Horiguchi
  -1 siblings, 0 replies; 32+ messages in thread
From: Naoya Horiguchi @ 2019-08-02  3:48 UTC (permalink / raw)
  To: Li Wang
  Cc: Linux-MM, LTP List, mike.kravetz, xishi.qiuxishi, mhocko, Cyril Hrubis

On Mon, Jul 29, 2019 at 01:17:27PM +0800, Li Wang wrote:
> Hi Naoya and Linux-MMers,
> 
> The LTP/move_page12 V2 triggers SIGBUS in the kernel-v5.2.3 testing.
> https://github.com/wangli5665/ltp/blob/master/testcases/kernel/syscalls/
> move_pages/move_pages12.c
> 
> It seems like the retry mmap() triggers SIGBUS while doing the numa_move_pages
> () in background. That is very similar to the kernel bug which was mentioned by
> commit 6bc9b56433b76e40d(mm: fix race on soft-offlining ): A race condition
> between soft offline and hugetlb_fault which causes unexpected process SIGBUS
> killing.
> 
> I'm not sure if that below patch is making sene to memory-failures.c, but after
> building a new kernel-5.2.3 with this change, the problem can NOT be reproduced
> . 
> 
> Any comments?
> 
> ----------------------------------
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1695,15 +1695,16 @@ static int soft_offline_huge_page(struct page *page,
> int flags)
>         unlock_page(hpage);
> 
>         ret = isolate_huge_page(hpage, &pagelist);
> +       if (!ret) {
> +               pr_info("soft offline: %#lx hugepage failed to isolate\n",
> pfn);
> +               return -EBUSY;
> +       }
> +
>         /*
>          * get_any_page() and isolate_huge_page() takes a refcount each,
>          * so need to drop one here.
>          */
>         put_hwpoison_page(hpage);
> -       if (!ret) {
> -               pr_info("soft offline: %#lx hugepage failed to isolate\n",
> pfn);
> -               return -EBUSY;
> -       }

Sorry for my late response.

This change skips put_hwpoison_page() in failure path, so soft_offline_page()
should return without releasing hpage's refcount taken by get_any_page(),
maybe which is not what we want.

- Naoya

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [LTP]  [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
@ 2019-08-02  3:48   ` Naoya Horiguchi
  0 siblings, 0 replies; 32+ messages in thread
From: Naoya Horiguchi @ 2019-08-02  3:48 UTC (permalink / raw)
  To: ltp

On Mon, Jul 29, 2019 at 01:17:27PM +0800, Li Wang wrote:
> Hi Naoya and Linux-MMers,
> 
> The LTP/move_page12 V2 triggers SIGBUS in the kernel-v5.2.3 testing.
> https://github.com/wangli5665/ltp/blob/master/testcases/kernel/syscalls/
> move_pages/move_pages12.c
> 
> It seems like the retry mmap() triggers SIGBUS while doing the numa_move_pages
> () in background. That is very similar to the kernel bug which was mentioned by
> commit 6bc9b56433b76e40d(mm: fix race on soft-offlining ): A race condition
> between soft offline and hugetlb_fault which causes unexpected process SIGBUS
> killing.
> 
> I'm not sure if that below patch is making sene to memory-failures.c, but after
> building a new kernel-5.2.3 with this change, the problem can NOT be reproduced
> . 
> 
> Any comments?
> 
> ----------------------------------
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1695,15 +1695,16 @@ static int soft_offline_huge_page(struct page *page,
> int flags)
>         unlock_page(hpage);
> 
>         ret = isolate_huge_page(hpage, &pagelist);
> +       if (!ret) {
> +               pr_info("soft offline: %#lx hugepage failed to isolate\n",
> pfn);
> +               return -EBUSY;
> +       }
> +
>         /*
>          * get_any_page() and isolate_huge_page() takes a refcount each,
>          * so need to drop one here.
>          */
>         put_hwpoison_page(hpage);
> -       if (!ret) {
> -               pr_info("soft offline: %#lx hugepage failed to isolate\n",
> pfn);
> -               return -EBUSY;
> -       }

Sorry for my late response.

This change skips put_hwpoison_page() in failure path, so soft_offline_page()
should return without releasing hpage's refcount taken by get_any_page(),
maybe which is not what we want.

- Naoya

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
  2019-08-02  0:19         ` [LTP] " Mike Kravetz
@ 2019-08-02  4:15           ` Naoya Horiguchi
  -1 siblings, 0 replies; 32+ messages in thread
From: Naoya Horiguchi @ 2019-08-02  4:15 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Li Wang, Linux-MM, LTP List, xishi.qiuxishi, mhocko, Cyril Hrubis

On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote:
> On 7/30/19 5:44 PM, Mike Kravetz wrote:
> > A SIGBUS is the normal behavior for a hugetlb page fault failure due to
> > lack of huge pages.  Ugly, but that is the design.  I do not believe this
> > test should not be experiencing this due to reservations taken at mmap
> > time.  However, the test is combining faults, soft offline and page
> > migrations, so the there are lots of moving parts.
> > 
> > I'll continue to investigate.
> 
> There appears to be a race with hugetlb_fault and try_to_unmap_one of
> the migration path.
> 
> Can you try this patch in your environment?  I am not sure if it will
> be the final fix, but just wanted to see if it addresses issue for you.
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ede7e7f5d1ab..f3156c5432e3 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>  
>  		page = alloc_huge_page(vma, haddr, 0);
>  		if (IS_ERR(page)) {
> +			/*
> +			 * We could race with page migration (try_to_unmap_one)
> +			 * which is modifying page table with lock.  However,
> +			 * we are not holding lock here.  Before returning
> +			 * error that will SIGBUS caller, get ptl and make
> +			 * sure there really is no entry.
> +			 */
> +			ptl = huge_pte_lock(h, mm, ptep);
> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
> +				ret = 0;
> +				spin_unlock(ptl);
> +				goto out;
> +			}
> +			spin_unlock(ptl);

Thanks you for investigation, Mike.
I tried this change and found no SIGBUS, so it works well.

I'm still not clear about how !huge_pte_none() becomes true here,
because we enter hugetlb_no_page() only when huge_pte_none() is non-null
and (racy) try_to_unmap_one() from page migration should convert the
huge_pte into a migration entry, not null.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [LTP]  [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
@ 2019-08-02  4:15           ` Naoya Horiguchi
  0 siblings, 0 replies; 32+ messages in thread
From: Naoya Horiguchi @ 2019-08-02  4:15 UTC (permalink / raw)
  To: ltp

On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote:
> On 7/30/19 5:44 PM, Mike Kravetz wrote:
> > A SIGBUS is the normal behavior for a hugetlb page fault failure due to
> > lack of huge pages.  Ugly, but that is the design.  I do not believe this
> > test should not be experiencing this due to reservations taken at mmap
> > time.  However, the test is combining faults, soft offline and page
> > migrations, so the there are lots of moving parts.
> > 
> > I'll continue to investigate.
> 
> There appears to be a race with hugetlb_fault and try_to_unmap_one of
> the migration path.
> 
> Can you try this patch in your environment?  I am not sure if it will
> be the final fix, but just wanted to see if it addresses issue for you.
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ede7e7f5d1ab..f3156c5432e3 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>  
>  		page = alloc_huge_page(vma, haddr, 0);
>  		if (IS_ERR(page)) {
> +			/*
> +			 * We could race with page migration (try_to_unmap_one)
> +			 * which is modifying page table with lock.  However,
> +			 * we are not holding lock here.  Before returning
> +			 * error that will SIGBUS caller, get ptl and make
> +			 * sure there really is no entry.
> +			 */
> +			ptl = huge_pte_lock(h, mm, ptep);
> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
> +				ret = 0;
> +				spin_unlock(ptl);
> +				goto out;
> +			}
> +			spin_unlock(ptl);

Thanks you for investigation, Mike.
I tried this change and found no SIGBUS, so it works well.

I'm still not clear about how !huge_pte_none() becomes true here,
because we enter hugetlb_no_page() only when huge_pte_none() is non-null
and (racy) try_to_unmap_one() from page migration should convert the
huge_pte into a migration entry, not null.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
  2019-08-02  0:19         ` [LTP] " Mike Kravetz
@ 2019-08-02  9:59           ` Li Wang
  -1 siblings, 0 replies; 32+ messages in thread
From: Li Wang @ 2019-08-02  9:59 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Naoya Horiguchi, Linux-MM, LTP List, xishi.qiuxishi, mhocko,
	Cyril Hrubis

Hi Mike,

Thanks for working on this.

On Fri, Aug 2, 2019 at 8:20 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 7/30/19 5:44 PM, Mike Kravetz wrote:
> > A SIGBUS is the normal behavior for a hugetlb page fault failure due to
> > lack of huge pages.  Ugly, but that is the design.  I do not believe this
> > test should not be experiencing this due to reservations taken at mmap
> > time.  However, the test is combining faults, soft offline and page
> > migrations, so the there are lots of moving parts.
> >
> > I'll continue to investigate.
>
> There appears to be a race with hugetlb_fault and try_to_unmap_one of
> the migration path.
>
> Can you try this patch in your environment?  I am not sure if it will
> be the final fix, but just wanted to see if it addresses issue for you.

It works for me. After rebuilding the kernel with your patch, SIGBUS
does not appear anymore.

And I'm also thinking that why the huge_pte is not none here when race
with page migration (try_to_unmap_one).

--
Regards,
Li Wang


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [LTP]  [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
@ 2019-08-02  9:59           ` Li Wang
  0 siblings, 0 replies; 32+ messages in thread
From: Li Wang @ 2019-08-02  9:59 UTC (permalink / raw)
  To: ltp

Hi Mike,

Thanks for working on this.

On Fri, Aug 2, 2019 at 8:20 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 7/30/19 5:44 PM, Mike Kravetz wrote:
> > A SIGBUS is the normal behavior for a hugetlb page fault failure due to
> > lack of huge pages.  Ugly, but that is the design.  I do not believe this
> > test should not be experiencing this due to reservations taken at mmap
> > time.  However, the test is combining faults, soft offline and page
> > migrations, so the there are lots of moving parts.
> >
> > I'll continue to investigate.
>
> There appears to be a race with hugetlb_fault and try_to_unmap_one of
> the migration path.
>
> Can you try this patch in your environment?  I am not sure if it will
> be the final fix, but just wanted to see if it addresses issue for you.

It works for me. After rebuilding the kernel with your patch, SIGBUS
does not appear anymore.

And I'm also thinking that why the huge_pte is not none here when race
with page migration (try_to_unmap_one).

--
Regards,
Li Wang

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
  2019-08-02  4:15           ` [LTP] " Naoya Horiguchi
@ 2019-08-02 17:42             ` Mike Kravetz
  -1 siblings, 0 replies; 32+ messages in thread
From: Mike Kravetz @ 2019-08-02 17:42 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Li Wang, Linux-MM, LTP List, xishi.qiuxishi, mhocko, Cyril Hrubis

On 8/1/19 9:15 PM, Naoya Horiguchi wrote:
> On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote:
>> There appears to be a race with hugetlb_fault and try_to_unmap_one of
>> the migration path.
>>
>> Can you try this patch in your environment?  I am not sure if it will
>> be the final fix, but just wanted to see if it addresses issue for you.
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index ede7e7f5d1ab..f3156c5432e3 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>>  
>>  		page = alloc_huge_page(vma, haddr, 0);
>>  		if (IS_ERR(page)) {
>> +			/*
>> +			 * We could race with page migration (try_to_unmap_one)
>> +			 * which is modifying page table with lock.  However,
>> +			 * we are not holding lock here.  Before returning
>> +			 * error that will SIGBUS caller, get ptl and make
>> +			 * sure there really is no entry.
>> +			 */
>> +			ptl = huge_pte_lock(h, mm, ptep);
>> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
>> +				ret = 0;
>> +				spin_unlock(ptl);
>> +				goto out;
>> +			}
>> +			spin_unlock(ptl);
> 
> Thanks you for investigation, Mike.
> I tried this change and found no SIGBUS, so it works well.
> 
> I'm still not clear about how !huge_pte_none() becomes true here,
> because we enter hugetlb_no_page() only when huge_pte_none() is non-null
> and (racy) try_to_unmap_one() from page migration should convert the
> huge_pte into a migration entry, not null.

Thanks for taking a look Naoya.

In try_to_unmap_one(), there is this code block:

		/* Nuke the page table entry. */
		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
		if (should_defer_flush(mm, flags)) {
			/*
			 * We clear the PTE but do not flush so potentially
			 * a remote CPU could still be writing to the page.
			 * If the entry was previously clean then the
			 * architecture must guarantee that a clear->dirty
			 * transition on a cached TLB entry is written through
			 * and traps if the PTE is unmapped.
			 */
			pteval = ptep_get_and_clear(mm, address, pvmw.pte);

			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
		} else {
			pteval = ptep_clear_flush(vma, address, pvmw.pte);
		}

That happens before setting the migration entry.  Therefore, for a period
of time the pte is NULL (huge_pte_none() returns true).

try_to_unmap_one holds the page table lock, but hugetlb_fault does not take
the lock to 'optimistically' check huge_pte_none().  When huge_pte_none
returns true, it calls hugetlb_no_page which is where we try to allocate
a page and fails.

Does that make sense, or am I missing something?

The patch checks for this specific condition: someone changing the pte
from NULL to non-NULL while holding the lock.  I am not sure if this is
the best way to fix.  But, it may be the easiest.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [LTP]  [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
@ 2019-08-02 17:42             ` Mike Kravetz
  0 siblings, 0 replies; 32+ messages in thread
From: Mike Kravetz @ 2019-08-02 17:42 UTC (permalink / raw)
  To: ltp

On 8/1/19 9:15 PM, Naoya Horiguchi wrote:
> On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote:
>> There appears to be a race with hugetlb_fault and try_to_unmap_one of
>> the migration path.
>>
>> Can you try this patch in your environment?  I am not sure if it will
>> be the final fix, but just wanted to see if it addresses issue for you.
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index ede7e7f5d1ab..f3156c5432e3 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>>  
>>  		page = alloc_huge_page(vma, haddr, 0);
>>  		if (IS_ERR(page)) {
>> +			/*
>> +			 * We could race with page migration (try_to_unmap_one)
>> +			 * which is modifying page table with lock.  However,
>> +			 * we are not holding lock here.  Before returning
>> +			 * error that will SIGBUS caller, get ptl and make
>> +			 * sure there really is no entry.
>> +			 */
>> +			ptl = huge_pte_lock(h, mm, ptep);
>> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
>> +				ret = 0;
>> +				spin_unlock(ptl);
>> +				goto out;
>> +			}
>> +			spin_unlock(ptl);
> 
> Thanks you for investigation, Mike.
> I tried this change and found no SIGBUS, so it works well.
> 
> I'm still not clear about how !huge_pte_none() becomes true here,
> because we enter hugetlb_no_page() only when huge_pte_none() is non-null
> and (racy) try_to_unmap_one() from page migration should convert the
> huge_pte into a migration entry, not null.

Thanks for taking a look Naoya.

In try_to_unmap_one(), there is this code block:

		/* Nuke the page table entry. */
		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
		if (should_defer_flush(mm, flags)) {
			/*
			 * We clear the PTE but do not flush so potentially
			 * a remote CPU could still be writing to the page.
			 * If the entry was previously clean then the
			 * architecture must guarantee that a clear->dirty
			 * transition on a cached TLB entry is written through
			 * and traps if the PTE is unmapped.
			 */
			pteval = ptep_get_and_clear(mm, address, pvmw.pte);

			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
		} else {
			pteval = ptep_clear_flush(vma, address, pvmw.pte);
		}

That happens before setting the migration entry.  Therefore, for a period
of time the pte is NULL (huge_pte_none() returns true).

try_to_unmap_one holds the page table lock, but hugetlb_fault does not take
the lock to 'optimistically' check huge_pte_none().  When huge_pte_none
returns true, it calls hugetlb_no_page which is where we try to allocate
a page and fails.

Does that make sense, or am I missing something?

The patch checks for this specific condition: someone changing the pte
from NULL to non-NULL while holding the lock.  I am not sure if this is
the best way to fix.  But, it may be the easiest.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
  2019-08-02 17:42             ` [LTP] " Mike Kravetz
@ 2019-08-05  0:40               ` Naoya Horiguchi
  -1 siblings, 0 replies; 32+ messages in thread
From: Naoya Horiguchi @ 2019-08-05  0:40 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Li Wang, Linux-MM, LTP List, xishi.qiuxishi, mhocko, Cyril Hrubis

On Fri, Aug 02, 2019 at 10:42:33AM -0700, Mike Kravetz wrote:
> On 8/1/19 9:15 PM, Naoya Horiguchi wrote:
> > On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote:
> >> There appears to be a race with hugetlb_fault and try_to_unmap_one of
> >> the migration path.
> >>
> >> Can you try this patch in your environment?  I am not sure if it will
> >> be the final fix, but just wanted to see if it addresses issue for you.
> >>
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index ede7e7f5d1ab..f3156c5432e3 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >>  
> >>  		page = alloc_huge_page(vma, haddr, 0);
> >>  		if (IS_ERR(page)) {
> >> +			/*
> >> +			 * We could race with page migration (try_to_unmap_one)
> >> +			 * which is modifying page table with lock.  However,
> >> +			 * we are not holding lock here.  Before returning
> >> +			 * error that will SIGBUS caller, get ptl and make
> >> +			 * sure there really is no entry.
> >> +			 */
> >> +			ptl = huge_pte_lock(h, mm, ptep);
> >> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
> >> +				ret = 0;
> >> +				spin_unlock(ptl);
> >> +				goto out;
> >> +			}
> >> +			spin_unlock(ptl);
> > 
> > Thanks you for investigation, Mike.
> > I tried this change and found no SIGBUS, so it works well.
> > 
> > I'm still not clear about how !huge_pte_none() becomes true here,
> > because we enter hugetlb_no_page() only when huge_pte_none() is non-null
> > and (racy) try_to_unmap_one() from page migration should convert the
> > huge_pte into a migration entry, not null.
> 
> Thanks for taking a look Naoya.
> 
> In try_to_unmap_one(), there is this code block:
> 
> 		/* Nuke the page table entry. */
> 		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> 		if (should_defer_flush(mm, flags)) {
> 			/*
> 			 * We clear the PTE but do not flush so potentially
> 			 * a remote CPU could still be writing to the page.
> 			 * If the entry was previously clean then the
> 			 * architecture must guarantee that a clear->dirty
> 			 * transition on a cached TLB entry is written through
> 			 * and traps if the PTE is unmapped.
> 			 */
> 			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
> 
> 			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
> 		} else {
> 			pteval = ptep_clear_flush(vma, address, pvmw.pte);
> 		}
> 
> That happens before setting the migration entry.  Therefore, for a period
> of time the pte is NULL (huge_pte_none() returns true).
> 
> try_to_unmap_one holds the page table lock, but hugetlb_fault does not take
> the lock to 'optimistically' check huge_pte_none().  When huge_pte_none
> returns true, it calls hugetlb_no_page which is where we try to allocate
> a page and fails.
> 
> Does that make sense, or am I missing something?

Make sense to me, thanks.

> 
> The patch checks for this specific condition: someone changing the pte
> from NULL to non-NULL while holding the lock.  I am not sure if this is
> the best way to fix.  But, it may be the easiest.

Yes, I think so.

- Naoya

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [LTP]  [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
@ 2019-08-05  0:40               ` Naoya Horiguchi
  0 siblings, 0 replies; 32+ messages in thread
From: Naoya Horiguchi @ 2019-08-05  0:40 UTC (permalink / raw)
  To: ltp

On Fri, Aug 02, 2019 at 10:42:33AM -0700, Mike Kravetz wrote:
> On 8/1/19 9:15 PM, Naoya Horiguchi wrote:
> > On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote:
> >> There appears to be a race with hugetlb_fault and try_to_unmap_one of
> >> the migration path.
> >>
> >> Can you try this patch in your environment?  I am not sure if it will
> >> be the final fix, but just wanted to see if it addresses issue for you.
> >>
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index ede7e7f5d1ab..f3156c5432e3 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >>  
> >>  		page = alloc_huge_page(vma, haddr, 0);
> >>  		if (IS_ERR(page)) {
> >> +			/*
> >> +			 * We could race with page migration (try_to_unmap_one)
> >> +			 * which is modifying page table with lock.  However,
> >> +			 * we are not holding lock here.  Before returning
> >> +			 * error that will SIGBUS caller, get ptl and make
> >> +			 * sure there really is no entry.
> >> +			 */
> >> +			ptl = huge_pte_lock(h, mm, ptep);
> >> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
> >> +				ret = 0;
> >> +				spin_unlock(ptl);
> >> +				goto out;
> >> +			}
> >> +			spin_unlock(ptl);
> > 
> > Thanks you for investigation, Mike.
> > I tried this change and found no SIGBUS, so it works well.
> > 
> > I'm still not clear about how !huge_pte_none() becomes true here,
> > because we enter hugetlb_no_page() only when huge_pte_none() is non-null
> > and (racy) try_to_unmap_one() from page migration should convert the
> > huge_pte into a migration entry, not null.
> 
> Thanks for taking a look Naoya.
> 
> In try_to_unmap_one(), there is this code block:
> 
> 		/* Nuke the page table entry. */
> 		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> 		if (should_defer_flush(mm, flags)) {
> 			/*
> 			 * We clear the PTE but do not flush so potentially
> 			 * a remote CPU could still be writing to the page.
> 			 * If the entry was previously clean then the
> 			 * architecture must guarantee that a clear->dirty
> 			 * transition on a cached TLB entry is written through
> 			 * and traps if the PTE is unmapped.
> 			 */
> 			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
> 
> 			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
> 		} else {
> 			pteval = ptep_clear_flush(vma, address, pvmw.pte);
> 		}
> 
> That happens before setting the migration entry.  Therefore, for a period
> of time the pte is NULL (huge_pte_none() returns true).
> 
> try_to_unmap_one holds the page table lock, but hugetlb_fault does not take
> the lock to 'optimistically' check huge_pte_none().  When huge_pte_none
> returns true, it calls hugetlb_no_page which is where we try to allocate
> a page and fails.
> 
> Does that make sense, or am I missing something?

Make sense to me, thanks.

> 
> The patch checks for this specific condition: someone changing the pte
> from NULL to non-NULL while holding the lock.  I am not sure if this is
> the best way to fix.  But, it may be the easiest.

Yes, I think so.

- Naoya

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
  2019-08-02 17:42             ` [LTP] " Mike Kravetz
@ 2019-08-05  8:57               ` Michal Hocko
  -1 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2019-08-05  8:57 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Naoya Horiguchi, Li Wang, Linux-MM, LTP List, xishi.qiuxishi,
	Cyril Hrubis

On Fri 02-08-19 10:42:33, Mike Kravetz wrote:
> On 8/1/19 9:15 PM, Naoya Horiguchi wrote:
> > On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote:
> >> There appears to be a race with hugetlb_fault and try_to_unmap_one of
> >> the migration path.
> >>
> >> Can you try this patch in your environment?  I am not sure if it will
> >> be the final fix, but just wanted to see if it addresses issue for you.
> >>
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index ede7e7f5d1ab..f3156c5432e3 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >>  
> >>  		page = alloc_huge_page(vma, haddr, 0);
> >>  		if (IS_ERR(page)) {
> >> +			/*
> >> +			 * We could race with page migration (try_to_unmap_one)
> >> +			 * which is modifying page table with lock.  However,
> >> +			 * we are not holding lock here.  Before returning
> >> +			 * error that will SIGBUS caller, get ptl and make
> >> +			 * sure there really is no entry.
> >> +			 */
> >> +			ptl = huge_pte_lock(h, mm, ptep);
> >> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
> >> +				ret = 0;
> >> +				spin_unlock(ptl);
> >> +				goto out;
> >> +			}
> >> +			spin_unlock(ptl);
> > 
> > Thanks you for investigation, Mike.
> > I tried this change and found no SIGBUS, so it works well.
> > 
> > I'm still not clear about how !huge_pte_none() becomes true here,
> > because we enter hugetlb_no_page() only when huge_pte_none() is non-null
> > and (racy) try_to_unmap_one() from page migration should convert the
> > huge_pte into a migration entry, not null.
> 
> Thanks for taking a look Naoya.
> 
> In try_to_unmap_one(), there is this code block:
> 
> 		/* Nuke the page table entry. */
> 		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> 		if (should_defer_flush(mm, flags)) {
> 			/*
> 			 * We clear the PTE but do not flush so potentially
> 			 * a remote CPU could still be writing to the page.
> 			 * If the entry was previously clean then the
> 			 * architecture must guarantee that a clear->dirty
> 			 * transition on a cached TLB entry is written through
> 			 * and traps if the PTE is unmapped.
> 			 */
> 			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
> 
> 			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
> 		} else {
> 			pteval = ptep_clear_flush(vma, address, pvmw.pte);
> 		}
> 
> That happens before setting the migration entry.  Therefore, for a period
> of time the pte is NULL (huge_pte_none() returns true).
> 
> try_to_unmap_one holds the page table lock, but hugetlb_fault does not take
> the lock to 'optimistically' check huge_pte_none().  When huge_pte_none
> returns true, it calls hugetlb_no_page which is where we try to allocate
> a page and fails.
> 
> Does that make sense, or am I missing something?
> 
> The patch checks for this specific condition: someone changing the pte
> from NULL to non-NULL while holding the lock.  I am not sure if this is
> the best way to fix.  But, it may be the easiest.

Please add a comment to explain this because this is quite subtle and
tricky. Unlike the regular page fault hugetlb_no_page is protected by a
large lock so a retry check seems unexpected.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [LTP]  [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
@ 2019-08-05  8:57               ` Michal Hocko
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2019-08-05  8:57 UTC (permalink / raw)
  To: ltp

On Fri 02-08-19 10:42:33, Mike Kravetz wrote:
> On 8/1/19 9:15 PM, Naoya Horiguchi wrote:
> > On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote:
> >> There appears to be a race with hugetlb_fault and try_to_unmap_one of
> >> the migration path.
> >>
> >> Can you try this patch in your environment?  I am not sure if it will
> >> be the final fix, but just wanted to see if it addresses issue for you.
> >>
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index ede7e7f5d1ab..f3156c5432e3 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >>  
> >>  		page = alloc_huge_page(vma, haddr, 0);
> >>  		if (IS_ERR(page)) {
> >> +			/*
> >> +			 * We could race with page migration (try_to_unmap_one)
> >> +			 * which is modifying page table with lock.  However,
> >> +			 * we are not holding lock here.  Before returning
> >> +			 * error that will SIGBUS caller, get ptl and make
> >> +			 * sure there really is no entry.
> >> +			 */
> >> +			ptl = huge_pte_lock(h, mm, ptep);
> >> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
> >> +				ret = 0;
> >> +				spin_unlock(ptl);
> >> +				goto out;
> >> +			}
> >> +			spin_unlock(ptl);
> > 
> > Thanks you for investigation, Mike.
> > I tried this change and found no SIGBUS, so it works well.
> > 
> > I'm still not clear about how !huge_pte_none() becomes true here,
> > because we enter hugetlb_no_page() only when huge_pte_none() is non-null
> > and (racy) try_to_unmap_one() from page migration should convert the
> > huge_pte into a migration entry, not null.
> 
> Thanks for taking a look Naoya.
> 
> In try_to_unmap_one(), there is this code block:
> 
> 		/* Nuke the page table entry. */
> 		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> 		if (should_defer_flush(mm, flags)) {
> 			/*
> 			 * We clear the PTE but do not flush so potentially
> 			 * a remote CPU could still be writing to the page.
> 			 * If the entry was previously clean then the
> 			 * architecture must guarantee that a clear->dirty
> 			 * transition on a cached TLB entry is written through
> 			 * and traps if the PTE is unmapped.
> 			 */
> 			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
> 
> 			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
> 		} else {
> 			pteval = ptep_clear_flush(vma, address, pvmw.pte);
> 		}
> 
> That happens before setting the migration entry.  Therefore, for a period
> of time the pte is NULL (huge_pte_none() returns true).
> 
> try_to_unmap_one holds the page table lock, but hugetlb_fault does not take
> the lock to 'optimistically' check huge_pte_none().  When huge_pte_none
> returns true, it calls hugetlb_no_page which is where we try to allocate
> a page and fails.
> 
> Does that make sense, or am I missing something?
> 
> The patch checks for this specific condition: someone changing the pte
> from NULL to non-NULL while holding the lock.  I am not sure if this is
> the best way to fix.  But, it may be the easiest.

Please add a comment to explain this because this is quite subtle and
tricky. Unlike the regular page fault hugetlb_no_page is protected by a
large lock so a retry check seems unexpected.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
  2019-08-05  8:57               ` [LTP] " Michal Hocko
@ 2019-08-05 17:36                 ` Mike Kravetz
  -1 siblings, 0 replies; 32+ messages in thread
From: Mike Kravetz @ 2019-08-05 17:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Naoya Horiguchi, Li Wang, Linux-MM, LTP List, xishi.qiuxishi,
	Cyril Hrubis

On 8/5/19 1:57 AM, Michal Hocko wrote:
> On Fri 02-08-19 10:42:33, Mike Kravetz wrote:
>> On 8/1/19 9:15 PM, Naoya Horiguchi wrote:
>>> On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote:
>>>> There appears to be a race with hugetlb_fault and try_to_unmap_one of
>>>> the migration path.
>>>>
>>>> Can you try this patch in your environment?  I am not sure if it will
>>>> be the final fix, but just wanted to see if it addresses issue for you.
>>>>
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>> index ede7e7f5d1ab..f3156c5432e3 100644
>>>> --- a/mm/hugetlb.c
>>>> +++ b/mm/hugetlb.c
>>>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>>>>  
>>>>  		page = alloc_huge_page(vma, haddr, 0);
>>>>  		if (IS_ERR(page)) {
>>>> +			/*
>>>> +			 * We could race with page migration (try_to_unmap_one)
>>>> +			 * which is modifying page table with lock.  However,
>>>> +			 * we are not holding lock here.  Before returning
>>>> +			 * error that will SIGBUS caller, get ptl and make
>>>> +			 * sure there really is no entry.
>>>> +			 */
>>>> +			ptl = huge_pte_lock(h, mm, ptep);
>>>> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
>>>> +				ret = 0;
>>>> +				spin_unlock(ptl);
>>>> +				goto out;
>>>> +			}
>>>> +			spin_unlock(ptl);
>>>
>>> Thanks you for investigation, Mike.
>>> I tried this change and found no SIGBUS, so it works well.
>>>
>>> I'm still not clear about how !huge_pte_none() becomes true here,
>>> because we enter hugetlb_no_page() only when huge_pte_none() is non-null
>>> and (racy) try_to_unmap_one() from page migration should convert the
>>> huge_pte into a migration entry, not null.
>>
>> Thanks for taking a look Naoya.
>>
>> In try_to_unmap_one(), there is this code block:
>>
>> 		/* Nuke the page table entry. */
>> 		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
>> 		if (should_defer_flush(mm, flags)) {
>> 			/*
>> 			 * We clear the PTE but do not flush so potentially
>> 			 * a remote CPU could still be writing to the page.
>> 			 * If the entry was previously clean then the
>> 			 * architecture must guarantee that a clear->dirty
>> 			 * transition on a cached TLB entry is written through
>> 			 * and traps if the PTE is unmapped.
>> 			 */
>> 			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
>>
>> 			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
>> 		} else {
>> 			pteval = ptep_clear_flush(vma, address, pvmw.pte);
>> 		}
>>
>> That happens before setting the migration entry.  Therefore, for a period
>> of time the pte is NULL (huge_pte_none() returns true).
>>
>> try_to_unmap_one holds the page table lock, but hugetlb_fault does not take
>> the lock to 'optimistically' check huge_pte_none().  When huge_pte_none
>> returns true, it calls hugetlb_no_page which is where we try to allocate
>> a page and fails.
>>
>> Does that make sense, or am I missing something?
>>
>> The patch checks for this specific condition: someone changing the pte
>> from NULL to non-NULL while holding the lock.  I am not sure if this is
>> the best way to fix.  But, it may be the easiest.
> 
> Please add a comment to explain this because this is quite subtle and
> tricky. Unlike the regular page fault hugetlb_no_page is protected by a
> large lock so a retry check seems unexpected.

Will do.

Fixing up hugetlbfs locking is still 'on my list'.  There are known issues.
The last RFC/attempt was this:
http://lkml.kernel.org/r/20190201221705.15622-1-mike.kravetz@oracle.com
I believe that patch would have handled this issue.

However, as mentioned above it may better to just patch this issue exposed
by LTP and work on the more comprehensive change in the background.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [LTP]  [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
@ 2019-08-05 17:36                 ` Mike Kravetz
  0 siblings, 0 replies; 32+ messages in thread
From: Mike Kravetz @ 2019-08-05 17:36 UTC (permalink / raw)
  To: ltp

On 8/5/19 1:57 AM, Michal Hocko wrote:
> On Fri 02-08-19 10:42:33, Mike Kravetz wrote:
>> On 8/1/19 9:15 PM, Naoya Horiguchi wrote:
>>> On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote:
>>>> There appears to be a race with hugetlb_fault and try_to_unmap_one of
>>>> the migration path.
>>>>
>>>> Can you try this patch in your environment?  I am not sure if it will
>>>> be the final fix, but just wanted to see if it addresses issue for you.
>>>>
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>> index ede7e7f5d1ab..f3156c5432e3 100644
>>>> --- a/mm/hugetlb.c
>>>> +++ b/mm/hugetlb.c
>>>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>>>>  
>>>>  		page = alloc_huge_page(vma, haddr, 0);
>>>>  		if (IS_ERR(page)) {
>>>> +			/*
>>>> +			 * We could race with page migration (try_to_unmap_one)
>>>> +			 * which is modifying page table with lock.  However,
>>>> +			 * we are not holding lock here.  Before returning
>>>> +			 * error that will SIGBUS caller, get ptl and make
>>>> +			 * sure there really is no entry.
>>>> +			 */
>>>> +			ptl = huge_pte_lock(h, mm, ptep);
>>>> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
>>>> +				ret = 0;
>>>> +				spin_unlock(ptl);
>>>> +				goto out;
>>>> +			}
>>>> +			spin_unlock(ptl);
>>>
>>> Thanks you for investigation, Mike.
>>> I tried this change and found no SIGBUS, so it works well.
>>>
>>> I'm still not clear about how !huge_pte_none() becomes true here,
>>> because we enter hugetlb_no_page() only when huge_pte_none() is non-null
>>> and (racy) try_to_unmap_one() from page migration should convert the
>>> huge_pte into a migration entry, not null.
>>
>> Thanks for taking a look Naoya.
>>
>> In try_to_unmap_one(), there is this code block:
>>
>> 		/* Nuke the page table entry. */
>> 		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
>> 		if (should_defer_flush(mm, flags)) {
>> 			/*
>> 			 * We clear the PTE but do not flush so potentially
>> 			 * a remote CPU could still be writing to the page.
>> 			 * If the entry was previously clean then the
>> 			 * architecture must guarantee that a clear->dirty
>> 			 * transition on a cached TLB entry is written through
>> 			 * and traps if the PTE is unmapped.
>> 			 */
>> 			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
>>
>> 			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
>> 		} else {
>> 			pteval = ptep_clear_flush(vma, address, pvmw.pte);
>> 		}
>>
>> That happens before setting the migration entry.  Therefore, for a period
>> of time the pte is NULL (huge_pte_none() returns true).
>>
>> try_to_unmap_one holds the page table lock, but hugetlb_fault does not take
>> the lock to 'optimistically' check huge_pte_none().  When huge_pte_none
>> returns true, it calls hugetlb_no_page which is where we try to allocate
>> a page and fails.
>>
>> Does that make sense, or am I missing something?
>>
>> The patch checks for this specific condition: someone changing the pte
>> from NULL to non-NULL while holding the lock.  I am not sure if this is
>> the best way to fix.  But, it may be the easiest.
> 
> Please add a comment to explain this because this is quite subtle and
> tricky. Unlike the regular page fault hugetlb_no_page is protected by a
> large lock so a retry check seems unexpected.

Will do.

Fixing up hugetlbfs locking is still 'on my list'.  There are known issues.
The last RFC/attempt was this:
http://lkml.kernel.org/r/20190201221705.15622-1-mike.kravetz@oracle.com
I believe that patch would have handled this issue.

However, as mentioned above it may better to just patch this issue exposed
by LTP and work on the more comprehensive change in the background.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
  2019-08-05 17:36                 ` [LTP] " Mike Kravetz
@ 2019-08-07  0:07                   ` Mike Kravetz
  -1 siblings, 0 replies; 32+ messages in thread
From: Mike Kravetz @ 2019-08-07  0:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Naoya Horiguchi, Li Wang, Linux-MM, LTP List, xishi.qiuxishi,
	Cyril Hrubis

On 8/5/19 10:36 AM, Mike Kravetz wrote:
>>>>> Can you try this patch in your environment?  I am not sure if it will
>>>>> be the final fix, but just wanted to see if it addresses issue for you.
>>>>>
>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>>> index ede7e7f5d1ab..f3156c5432e3 100644
>>>>> --- a/mm/hugetlb.c
>>>>> +++ b/mm/hugetlb.c
>>>>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>>>>>  
>>>>>  		page = alloc_huge_page(vma, haddr, 0);
>>>>>  		if (IS_ERR(page)) {
>>>>> +			/*
>>>>> +			 * We could race with page migration (try_to_unmap_one)
>>>>> +			 * which is modifying page table with lock.  However,
>>>>> +			 * we are not holding lock here.  Before returning
>>>>> +			 * error that will SIGBUS caller, get ptl and make
>>>>> +			 * sure there really is no entry.
>>>>> +			 */
>>>>> +			ptl = huge_pte_lock(h, mm, ptep);
>>>>> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
>>>>> +				ret = 0;
>>>>> +				spin_unlock(ptl);
>>>>> +				goto out;
>>>>> +			}
>>>>> +			spin_unlock(ptl);
>>>>
>>>> Thanks you for investigation, Mike.
>>>> I tried this change and found no SIGBUS, so it works well.

Here is another way to address the issue.  Take the hugetlb fault mutex in
the migration code when modifying the page tables.  IIUC, the fault mutex
was introduced to prevent this same issue when there were two page faults
on the same page (and we were unable to allocate an 'extra' page).  The
downside to such an approach is that we add more hugetlbfs specific code
to try_to_unmap_one.

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index edf476c8cfb9..df0e74f9962e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -485,6 +485,17 @@ static inline int hstate_index(struct hstate *h)
 	return h - hstates;
 }
 
+/*
+ * Convert the address within this vma to the page offset within
+ * the mapping, in pagecache page units; huge pages here.
+ */
+static inline pgoff_t vma_hugecache_offset(struct hstate *h,
+			struct vm_area_struct *vma, unsigned long address)
+{
+	return ((address - vma->vm_start) >> huge_page_shift(h)) +
+		(vma->vm_pgoff >> huge_page_order(h));
+}
+
 pgoff_t __basepage_index(struct page *page);
 
 /* Return page->index in PAGE_SIZE units */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ede7e7f5d1ab..959aed5b7969 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -615,17 +615,6 @@ static long region_count(struct resv_map *resv, long f, long t)
 	return chg;
 }
 
-/*
- * Convert the address within this vma to the page offset within
- * the mapping, in pagecache page units; huge pages here.
- */
-static pgoff_t vma_hugecache_offset(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long address)
-{
-	return ((address - vma->vm_start) >> huge_page_shift(h)) +
-			(vma->vm_pgoff >> huge_page_order(h));
-}
-
 pgoff_t linear_hugepage_index(struct vm_area_struct *vma,
 				     unsigned long address)
 {
diff --git a/mm/rmap.c b/mm/rmap.c
index e5dfe2ae6b0d..f8c95482c23e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1350,6 +1350,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	bool ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)arg;
+	u32 hugetlb_hash = 0;
 
 	/* munlock has nothing to gain from examining un-locked vmas */
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
@@ -1377,6 +1378,19 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				min(vma->vm_end, address +
 				    (PAGE_SIZE << compound_order(page))));
 	if (PageHuge(page)) {
+		struct hstate *h = hstate_vma(vma);
+
+		/*
+		 * Take the hugetlb fault mutex so that we do not race with
+		 * page faults while modifying page table.  Mutex must be
+		 * acquired before ptl below.
+		 */
+		hugetlb_hash = hugetlb_fault_mutex_hash(h,
+					vma->vm_file->f_mapping,
+					vma_hugecache_offset(h, vma, address),
+					address);
+		mutex_lock(&hugetlb_fault_mutex_table[hugetlb_hash]);
+
 		/*
 		 * If sharing is possible, start and end will be adjusted
 		 * accordingly.
@@ -1659,6 +1673,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	}
 
 	mmu_notifier_invalidate_range_end(&range);
+	if (PageHuge(page))
+		mutex_unlock(&hugetlb_fault_mutex_table[hugetlb_hash]);
 
 	return ret;
 }


Michal, Naoya any preferences on how this should be fixed?  I'll send a
proper patch if we agree on an approach.
-- 
Mike Kravetz


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [LTP]  [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
@ 2019-08-07  0:07                   ` Mike Kravetz
  0 siblings, 0 replies; 32+ messages in thread
From: Mike Kravetz @ 2019-08-07  0:07 UTC (permalink / raw)
  To: ltp

On 8/5/19 10:36 AM, Mike Kravetz wrote:
>>>>> Can you try this patch in your environment?  I am not sure if it will
>>>>> be the final fix, but just wanted to see if it addresses issue for you.
>>>>>
>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>>> index ede7e7f5d1ab..f3156c5432e3 100644
>>>>> --- a/mm/hugetlb.c
>>>>> +++ b/mm/hugetlb.c
>>>>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>>>>>  
>>>>>  		page = alloc_huge_page(vma, haddr, 0);
>>>>>  		if (IS_ERR(page)) {
>>>>> +			/*
>>>>> +			 * We could race with page migration (try_to_unmap_one)
>>>>> +			 * which is modifying page table with lock.  However,
>>>>> +			 * we are not holding lock here.  Before returning
>>>>> +			 * error that will SIGBUS caller, get ptl and make
>>>>> +			 * sure there really is no entry.
>>>>> +			 */
>>>>> +			ptl = huge_pte_lock(h, mm, ptep);
>>>>> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
>>>>> +				ret = 0;
>>>>> +				spin_unlock(ptl);
>>>>> +				goto out;
>>>>> +			}
>>>>> +			spin_unlock(ptl);
>>>>
>>>> Thanks you for investigation, Mike.
>>>> I tried this change and found no SIGBUS, so it works well.

Here is another way to address the issue.  Take the hugetlb fault mutex in
the migration code when modifying the page tables.  IIUC, the fault mutex
was introduced to prevent this same issue when there were two page faults
on the same page (and we were unable to allocate an 'extra' page).  The
downside to such an approach is that we add more hugetlbfs specific code
to try_to_unmap_one.

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index edf476c8cfb9..df0e74f9962e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -485,6 +485,17 @@ static inline int hstate_index(struct hstate *h)
 	return h - hstates;
 }
 
+/*
+ * Convert the address within this vma to the page offset within
+ * the mapping, in pagecache page units; huge pages here.
+ */
+static inline pgoff_t vma_hugecache_offset(struct hstate *h,
+			struct vm_area_struct *vma, unsigned long address)
+{
+	return ((address - vma->vm_start) >> huge_page_shift(h)) +
+		(vma->vm_pgoff >> huge_page_order(h));
+}
+
 pgoff_t __basepage_index(struct page *page);
 
 /* Return page->index in PAGE_SIZE units */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ede7e7f5d1ab..959aed5b7969 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -615,17 +615,6 @@ static long region_count(struct resv_map *resv, long f, long t)
 	return chg;
 }
 
-/*
- * Convert the address within this vma to the page offset within
- * the mapping, in pagecache page units; huge pages here.
- */
-static pgoff_t vma_hugecache_offset(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long address)
-{
-	return ((address - vma->vm_start) >> huge_page_shift(h)) +
-			(vma->vm_pgoff >> huge_page_order(h));
-}
-
 pgoff_t linear_hugepage_index(struct vm_area_struct *vma,
 				     unsigned long address)
 {
diff --git a/mm/rmap.c b/mm/rmap.c
index e5dfe2ae6b0d..f8c95482c23e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1350,6 +1350,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	bool ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)arg;
+	u32 hugetlb_hash = 0;
 
 	/* munlock has nothing to gain from examining un-locked vmas */
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
@@ -1377,6 +1378,19 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				min(vma->vm_end, address +
 				    (PAGE_SIZE << compound_order(page))));
 	if (PageHuge(page)) {
+		struct hstate *h = hstate_vma(vma);
+
+		/*
+		 * Take the hugetlb fault mutex so that we do not race with
+		 * page faults while modifying page table.  Mutex must be
+		 * acquired before ptl below.
+		 */
+		hugetlb_hash = hugetlb_fault_mutex_hash(h,
+					vma->vm_file->f_mapping,
+					vma_hugecache_offset(h, vma, address),
+					address);
+		mutex_lock(&hugetlb_fault_mutex_table[hugetlb_hash]);
+
 		/*
 		 * If sharing is possible, start and end will be adjusted
 		 * accordingly.
@@ -1659,6 +1673,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	}
 
 	mmu_notifier_invalidate_range_end(&range);
+	if (PageHuge(page))
+		mutex_unlock(&hugetlb_fault_mutex_table[hugetlb_hash]);
 
 	return ret;
 }


Michal, Naoya any preferences on how this should be fixed?  I'll send a
proper patch if we agree on an approach.
-- 
Mike Kravetz

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
  2019-08-07  0:07                   ` [LTP] " Mike Kravetz
@ 2019-08-07  7:39                     ` Michal Hocko
  -1 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2019-08-07  7:39 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Naoya Horiguchi, Li Wang, Linux-MM, LTP List, xishi.qiuxishi,
	Cyril Hrubis

On Tue 06-08-19 17:07:25, Mike Kravetz wrote:
> On 8/5/19 10:36 AM, Mike Kravetz wrote:
> >>>>> Can you try this patch in your environment?  I am not sure if it will
> >>>>> be the final fix, but just wanted to see if it addresses issue for you.
> >>>>>
> >>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >>>>> index ede7e7f5d1ab..f3156c5432e3 100644
> >>>>> --- a/mm/hugetlb.c
> >>>>> +++ b/mm/hugetlb.c
> >>>>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >>>>>  
> >>>>>  		page = alloc_huge_page(vma, haddr, 0);
> >>>>>  		if (IS_ERR(page)) {
> >>>>> +			/*
> >>>>> +			 * We could race with page migration (try_to_unmap_one)
> >>>>> +			 * which is modifying page table with lock.  However,
> >>>>> +			 * we are not holding lock here.  Before returning
> >>>>> +			 * error that will SIGBUS caller, get ptl and make
> >>>>> +			 * sure there really is no entry.
> >>>>> +			 */
> >>>>> +			ptl = huge_pte_lock(h, mm, ptep);
> >>>>> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
> >>>>> +				ret = 0;
> >>>>> +				spin_unlock(ptl);
> >>>>> +				goto out;
> >>>>> +			}
> >>>>> +			spin_unlock(ptl);
> >>>>
> >>>> Thanks you for investigation, Mike.
> >>>> I tried this change and found no SIGBUS, so it works well.
> 
> Here is another way to address the issue.  Take the hugetlb fault mutex in
> the migration code when modifying the page tables.  IIUC, the fault mutex
> was introduced to prevent this same issue when there were two page faults
> on the same page (and we were unable to allocate an 'extra' page).  The
> downside to such an approach is that we add more hugetlbfs specific code
> to try_to_unmap_one.

I would rather go with the hugetlb_no_page which is better isolated.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [LTP]  [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
@ 2019-08-07  7:39                     ` Michal Hocko
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2019-08-07  7:39 UTC (permalink / raw)
  To: ltp

On Tue 06-08-19 17:07:25, Mike Kravetz wrote:
> On 8/5/19 10:36 AM, Mike Kravetz wrote:
> >>>>> Can you try this patch in your environment?  I am not sure if it will
> >>>>> be the final fix, but just wanted to see if it addresses issue for you.
> >>>>>
> >>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >>>>> index ede7e7f5d1ab..f3156c5432e3 100644
> >>>>> --- a/mm/hugetlb.c
> >>>>> +++ b/mm/hugetlb.c
> >>>>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >>>>>  
> >>>>>  		page = alloc_huge_page(vma, haddr, 0);
> >>>>>  		if (IS_ERR(page)) {
> >>>>> +			/*
> >>>>> +			 * We could race with page migration (try_to_unmap_one)
> >>>>> +			 * which is modifying page table with lock.  However,
> >>>>> +			 * we are not holding lock here.  Before returning
> >>>>> +			 * error that will SIGBUS caller, get ptl and make
> >>>>> +			 * sure there really is no entry.
> >>>>> +			 */
> >>>>> +			ptl = huge_pte_lock(h, mm, ptep);
> >>>>> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
> >>>>> +				ret = 0;
> >>>>> +				spin_unlock(ptl);
> >>>>> +				goto out;
> >>>>> +			}
> >>>>> +			spin_unlock(ptl);
> >>>>
> >>>> Thanks you for investigation, Mike.
> >>>> I tried this change and found no SIGBUS, so it works well.
> 
> Here is another way to address the issue.  Take the hugetlb fault mutex in
> the migration code when modifying the page tables.  IIUC, the fault mutex
> was introduced to prevent this same issue when there were two page faults
> on the same page (and we were unable to allocate an 'extra' page).  The
> downside to such an approach is that we add more hugetlbfs specific code
> to try_to_unmap_one.

I would rather go with the hugetlb_no_page which is better isolated.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
  2019-08-07  7:39                     ` [LTP] " Michal Hocko
@ 2019-08-07 15:10                       ` Mike Kravetz
  -1 siblings, 0 replies; 32+ messages in thread
From: Mike Kravetz @ 2019-08-07 15:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Naoya Horiguchi, Li Wang, Linux-MM, LTP List, xishi.qiuxishi,
	Cyril Hrubis

On 8/7/19 12:39 AM, Michal Hocko wrote:
> On Tue 06-08-19 17:07:25, Mike Kravetz wrote:
>> On 8/5/19 10:36 AM, Mike Kravetz wrote:
>>>>>>> Can you try this patch in your environment?  I am not sure if it will
>>>>>>> be the final fix, but just wanted to see if it addresses issue for you.
>>>>>>>
>>>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>>>>> index ede7e7f5d1ab..f3156c5432e3 100644
>>>>>>> --- a/mm/hugetlb.c
>>>>>>> +++ b/mm/hugetlb.c
>>>>>>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>>>>>>>  
>>>>>>>  		page = alloc_huge_page(vma, haddr, 0);
>>>>>>>  		if (IS_ERR(page)) {
>>>>>>> +			/*
>>>>>>> +			 * We could race with page migration (try_to_unmap_one)
>>>>>>> +			 * which is modifying page table with lock.  However,
>>>>>>> +			 * we are not holding lock here.  Before returning
>>>>>>> +			 * error that will SIGBUS caller, get ptl and make
>>>>>>> +			 * sure there really is no entry.
>>>>>>> +			 */
>>>>>>> +			ptl = huge_pte_lock(h, mm, ptep);
>>>>>>> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
>>>>>>> +				ret = 0;
>>>>>>> +				spin_unlock(ptl);
>>>>>>> +				goto out;
>>>>>>> +			}
>>>>>>> +			spin_unlock(ptl);
>>>>>>
>>>>>> Thanks you for investigation, Mike.
>>>>>> I tried this change and found no SIGBUS, so it works well.
>>
>> Here is another way to address the issue.  Take the hugetlb fault mutex in
>> the migration code when modifying the page tables.  IIUC, the fault mutex
>> was introduced to prevent this same issue when there were two page faults
>> on the same page (and we were unable to allocate an 'extra' page).  The
>> downside to such an approach is that we add more hugetlbfs specific code
>> to try_to_unmap_one.
> 
> I would rather go with the hugetlb_no_page which is better isolated.

Sounds good.

And, after more thought modifying try_to_unmap_one will not work.  Why?
It violates lock ordering.  Current ordering is hugetlb_mutex, page lock
then page table lock.  The page lock is taken before calling try_to_unmap_one.
In addition, try_to_unmap is unmapping from multiple vmas so we can not
know the values for hugetlb hash before taking page lock as the hash values
are vma specific.  So, without many modifications we can not add hugetlb
fault mutex to this code path.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [LTP]  [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background
@ 2019-08-07 15:10                       ` Mike Kravetz
  0 siblings, 0 replies; 32+ messages in thread
From: Mike Kravetz @ 2019-08-07 15:10 UTC (permalink / raw)
  To: ltp

On 8/7/19 12:39 AM, Michal Hocko wrote:
> On Tue 06-08-19 17:07:25, Mike Kravetz wrote:
>> On 8/5/19 10:36 AM, Mike Kravetz wrote:
>>>>>>> Can you try this patch in your environment?  I am not sure if it will
>>>>>>> be the final fix, but just wanted to see if it addresses issue for you.
>>>>>>>
>>>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>>>>> index ede7e7f5d1ab..f3156c5432e3 100644
>>>>>>> --- a/mm/hugetlb.c
>>>>>>> +++ b/mm/hugetlb.c
>>>>>>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>>>>>>>  
>>>>>>>  		page = alloc_huge_page(vma, haddr, 0);
>>>>>>>  		if (IS_ERR(page)) {
>>>>>>> +			/*
>>>>>>> +			 * We could race with page migration (try_to_unmap_one)
>>>>>>> +			 * which is modifying page table with lock.  However,
>>>>>>> +			 * we are not holding lock here.  Before returning
>>>>>>> +			 * error that will SIGBUS caller, get ptl and make
>>>>>>> +			 * sure there really is no entry.
>>>>>>> +			 */
>>>>>>> +			ptl = huge_pte_lock(h, mm, ptep);
>>>>>>> +			if (!huge_pte_none(huge_ptep_get(ptep))) {
>>>>>>> +				ret = 0;
>>>>>>> +				spin_unlock(ptl);
>>>>>>> +				goto out;
>>>>>>> +			}
>>>>>>> +			spin_unlock(ptl);
>>>>>>
>>>>>> Thanks you for investigation, Mike.
>>>>>> I tried this change and found no SIGBUS, so it works well.
>>
>> Here is another way to address the issue.  Take the hugetlb fault mutex in
>> the migration code when modifying the page tables.  IIUC, the fault mutex
>> was introduced to prevent this same issue when there were two page faults
>> on the same page (and we were unable to allocate an 'extra' page).  The
>> downside to such an approach is that we add more hugetlbfs specific code
>> to try_to_unmap_one.
> 
> I would rather go with the hugetlb_no_page which is better isolated.

Sounds good.

And, after more thought modifying try_to_unmap_one will not work.  Why?
It violates lock ordering.  Current ordering is hugetlb_mutex, page lock
then page table lock.  The page lock is taken before calling try_to_unmap_one.
In addition, try_to_unmap is unmapping from multiple vmas so we can not
know the values for hugetlb hash before taking page lock as the hash values
are vma specific.  So, without many modifications we can not add hugetlb
fault mutex to this code path.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2019-08-07 15:10 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-29  5:17 [MM Bug?] mmap() triggers SIGBUS while doing the numa_move_pages() for offlined hugepage in background Li Wang
2019-07-29  5:17 ` [LTP] " Li Wang
2019-07-29 19:00 ` Mike Kravetz
2019-07-29 19:00   ` [LTP] " Mike Kravetz
2019-07-30  6:29   ` Li Wang
2019-07-30  6:29     ` [LTP] " Li Wang
2019-07-31  0:44     ` Mike Kravetz
2019-07-31  0:44       ` [LTP] " Mike Kravetz
2019-08-02  0:19       ` Mike Kravetz
2019-08-02  0:19         ` [LTP] " Mike Kravetz
2019-08-02  4:15         ` Naoya Horiguchi
2019-08-02  4:15           ` [LTP] " Naoya Horiguchi
2019-08-02 17:42           ` Mike Kravetz
2019-08-02 17:42             ` [LTP] " Mike Kravetz
2019-08-05  0:40             ` Naoya Horiguchi
2019-08-05  0:40               ` [LTP] " Naoya Horiguchi
2019-08-05  8:57             ` Michal Hocko
2019-08-05  8:57               ` [LTP] " Michal Hocko
2019-08-05 17:36               ` Mike Kravetz
2019-08-05 17:36                 ` [LTP] " Mike Kravetz
2019-08-07  0:07                 ` Mike Kravetz
2019-08-07  0:07                   ` [LTP] " Mike Kravetz
2019-08-07  7:39                   ` Michal Hocko
2019-08-07  7:39                     ` [LTP] " Michal Hocko
2019-08-07 15:10                     ` Mike Kravetz
2019-08-07 15:10                       ` [LTP] " Mike Kravetz
2019-08-02  9:59         ` Li Wang
2019-08-02  9:59           ` [LTP] " Li Wang
2019-07-30  6:38   ` Li Wang
2019-07-30  6:38     ` [LTP] " Li Wang
2019-08-02  3:48 ` Naoya Horiguchi
2019-08-02  3:48   ` [LTP] " Naoya Horiguchi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.