All of lore.kernel.org
 help / color / mirror / Atom feed
From: Li Wang <liwang@redhat.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Linux-MM <linux-mm@kvack.org>,  LTP List <ltp@lists.linux.it>,
	xishi.qiuxishi@alibaba-inc.com, mhocko@kernel.org,
	 Cyril Hrubis <chrubis@suse.cz>
Subject: Re: [MM Bug?] mmap() triggers SIGBUS while doing the​ ​numa_move_pages() for offlined hugepage in background
Date: Tue, 30 Jul 2019 14:29:09 +0800	[thread overview]
Message-ID: <CAEemH2d=vEfppCbCgVoGdHed2kuY3GWnZGhymYT1rnxjoWNdcQ@mail.gmail.com> (raw)
In-Reply-To: <47999e20-ccbe-deda-c960-473db5b56ea0@oracle.com>

[-- Attachment #1: Type: text/plain, Size: 3983 bytes --]

Hi Mike,

Thanks for trying this.

On Tue, Jul 30, 2019 at 3:01 AM Mike Kravetz <mike.kravetz@oracle.com>
wrote:
>
> On 7/28/19 10:17 PM, Li Wang wrote:
> > Hi Naoya and Linux-MMers,
> >
> > The LTP/move_page12 V2 triggers SIGBUS in the kernel-v5.2.3 testing.
> >
https://github.com/wangli5665/ltp/blob/master/testcases/kernel/syscalls/move_pages/move_pages12.c
> >
> > It seems like the retry mmap() triggers SIGBUS while doing
thenuma_move_pages() in background. That is very similar to the kernelbug
which was mentioned by commit 6bc9b56433b76e40d(mm: fix race
onsoft-offlining ): A race condition between soft offline andhugetlb_fault
which causes unexpected process SIGBUS killing.
> >
> > I'm not sure if that below patch is making sene to memory-failures.c,
but after building a new kernel-5.2.3 with this change, the problem can NOT
be reproduced.
> >
> > Any comments?
>
> Something seems strange.  I can not reproduce with unmodified 5.2.3

It's not 100% reproducible, I tried ten times only hit 4~6 times fail.

Did you try the test case with patch V3(in my branch)?
https://github.com/wangli5665/ltp/commit/198fca89870c1b807a01b27bb1d2ec6e2af1c7b6

# git clone https://github.com/wangli5665/ltp ltp.wangli --depth=1
# cd ltp.wangli/; make autotools;
# ./configure ; make -j24
# cd testcases/kernel/syscalls/move_pages/
# ./move_pages12
tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:249: INFO: Free RAM 64386300 kB
move_pages12.c:267: INFO: Increasing 2048kB hugepages pool on node 0 to 4
move_pages12.c:277: INFO: Increasing 2048kB hugepages pool on node 1 to 4
move_pages12.c:193: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:193: INFO: Allocating and freeing 4 hugepages on node 1
move_pages12.c:183: PASS: Bug not reproduced
tst_test.c:1145: BROK: Test killed by SIGBUS!
move_pages12.c:117: FAIL: move_pages failed: ESRCH

# uname -r
5.2.3

# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 16049 MB
node 0 free: 15736 MB
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 16123 MB
node 1 free: 15850 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 16123 MB
node 2 free: 15989 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 16097 MB
node 3 free: 15278 MB
node distances:
node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10

> Also, the soft_offline_huge_page() code should not come into play with
> this specific test.

I got the "soft offline xxx.. hugepage failed to isolate" message from
soft_offline_huge_page()
in dmesg log.

=== debug print info ===
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1701,7 +1701,7 @@ static int soft_offline_huge_page(struct page *page,
int flags)
         */
        put_hwpoison_page(hpage);
        if (!ret) {
-               pr_info("soft offline: %#lx hugepage failed to isolate\n",
pfn);
+               pr_info("liwang -- soft offline: %#lx hugepage failed to
isolate\n", pfn);
                return -EBUSY;
        }

# dmesg
...
[ 1068.947205] Soft offlining pfn 0x40b200 at process virtual address
0x7f9d8d000000
[ 1068.987054] Soft offlining pfn 0x40ac00 at process virtual address
0x7f9d8d200000
[ 1069.048478] Soft offlining pfn 0x40a800 at process virtual address
0x7f9d8d000000
[ 1069.087413] Soft offlining pfn 0x40ae00 at process virtual address
0x7f9d8d200000
[ 1069.123285] liwang -- soft offline: 0x40ae00 hugepage failed to isolate
[ 1069.160137] Soft offlining pfn 0x80f800 at process virtual address
0x7f9d8d000000
[ 1069.196009] Soft offlining pfn 0x80fe00 at process virtual address
0x7f9d8d200000
[ 1069.243436] Soft offlining pfn 0x40a400 at process virtual address
0x7f9d8d000000
[ 1069.281301] Soft offlining pfn 0x40a600 at process virtual address
0x7f9d8d200000
[ 1069.318171] liwang -- soft offline: 0x40a600 hugepage failed to isolate

-- 
Regards,
Li Wang

[-- Attachment #2: Type: text/html, Size: 5786 bytes --]

WARNING: multiple messages have this Message-ID (diff)
From: Li Wang <liwang@redhat.com>
To: ltp@lists.linux.it
Subject: [LTP]  [MM Bug?] mmap() triggers SIGBUS while doing the​ ​numa_move_pages() for offlined hugepage in background
Date: Tue, 30 Jul 2019 14:29:09 +0800	[thread overview]
Message-ID: <CAEemH2d=vEfppCbCgVoGdHed2kuY3GWnZGhymYT1rnxjoWNdcQ@mail.gmail.com> (raw)
In-Reply-To: <47999e20-ccbe-deda-c960-473db5b56ea0@oracle.com>

Hi Mike,

Thanks for trying this.

On Tue, Jul 30, 2019 at 3:01 AM Mike Kravetz <mike.kravetz@oracle.com>
wrote:
>
> On 7/28/19 10:17 PM, Li Wang wrote:
> > Hi Naoya and Linux-MMers,
> >
> > The LTP/move_page12 V2 triggers SIGBUS in the kernel-v5.2.3 testing.
> >
https://github.com/wangli5665/ltp/blob/master/testcases/kernel/syscalls/move_pages/move_pages12.c
> >
> > It seems like the retry mmap() triggers SIGBUS while doing
thenuma_move_pages() in background. That is very similar to the kernelbug
which was mentioned by commit 6bc9b56433b76e40d(mm: fix race
onsoft-offlining ): A race condition between soft offline andhugetlb_fault
which causes unexpected process SIGBUS killing.
> >
> > I'm not sure if that below patch is making sene to memory-failures.c,
but after building a new kernel-5.2.3 with this change, the problem can NOT
be reproduced.
> >
> > Any comments?
>
> Something seems strange.  I can not reproduce with unmodified 5.2.3

It's not 100% reproducible, I tried ten times only hit 4~6 times fail.

Did you try the test case with patch V3(in my branch)?
https://github.com/wangli5665/ltp/commit/198fca89870c1b807a01b27bb1d2ec6e2af1c7b6

# git clone https://github.com/wangli5665/ltp ltp.wangli --depth=1
# cd ltp.wangli/; make autotools;
# ./configure ; make -j24
# cd testcases/kernel/syscalls/move_pages/
# ./move_pages12
tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:249: INFO: Free RAM 64386300 kB
move_pages12.c:267: INFO: Increasing 2048kB hugepages pool on node 0 to 4
move_pages12.c:277: INFO: Increasing 2048kB hugepages pool on node 1 to 4
move_pages12.c:193: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:193: INFO: Allocating and freeing 4 hugepages on node 1
move_pages12.c:183: PASS: Bug not reproduced
tst_test.c:1145: BROK: Test killed by SIGBUS!
move_pages12.c:117: FAIL: move_pages failed: ESRCH

# uname -r
5.2.3

# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 16049 MB
node 0 free: 15736 MB
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 16123 MB
node 1 free: 15850 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 16123 MB
node 2 free: 15989 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 16097 MB
node 3 free: 15278 MB
node distances:
node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10

> Also, the soft_offline_huge_page() code should not come into play with
> this specific test.

I got the "soft offline xxx.. hugepage failed to isolate" message from
soft_offline_huge_page()
in dmesg log.

=== debug print info ===
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1701,7 +1701,7 @@ static int soft_offline_huge_page(struct page *page,
int flags)
         */
        put_hwpoison_page(hpage);
        if (!ret) {
-               pr_info("soft offline: %#lx hugepage failed to isolate\n",
pfn);
+               pr_info("liwang -- soft offline: %#lx hugepage failed to
isolate\n", pfn);
                return -EBUSY;
        }

# dmesg
...
[ 1068.947205] Soft offlining pfn 0x40b200 at process virtual address
0x7f9d8d000000
[ 1068.987054] Soft offlining pfn 0x40ac00 at process virtual address
0x7f9d8d200000
[ 1069.048478] Soft offlining pfn 0x40a800 at process virtual address
0x7f9d8d000000
[ 1069.087413] Soft offlining pfn 0x40ae00 at process virtual address
0x7f9d8d200000
[ 1069.123285] liwang -- soft offline: 0x40ae00 hugepage failed to isolate
[ 1069.160137] Soft offlining pfn 0x80f800 at process virtual address
0x7f9d8d000000
[ 1069.196009] Soft offlining pfn 0x80fe00 at process virtual address
0x7f9d8d200000
[ 1069.243436] Soft offlining pfn 0x40a400 at process virtual address
0x7f9d8d000000
[ 1069.281301] Soft offlining pfn 0x40a600 at process virtual address
0x7f9d8d200000
[ 1069.318171] liwang -- soft offline: 0x40a600 hugepage failed to isolate

-- 
Regards,
Li Wang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linux.it/pipermail/ltp/attachments/20190730/23d7a58b/attachment-0001.htm>

  reply	other threads:[~2019-07-30  6:29 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-29  5:17 [MM Bug?] mmap() triggers SIGBUS while doing the​ ​numa_move_pages() for offlined hugepage in background Li Wang
2019-07-29  5:17 ` [LTP] " Li Wang
2019-07-29 19:00 ` Mike Kravetz
2019-07-29 19:00   ` [LTP] " Mike Kravetz
2019-07-30  6:29   ` Li Wang [this message]
2019-07-30  6:29     ` Li Wang
2019-07-31  0:44     ` Mike Kravetz
2019-07-31  0:44       ` [LTP] " Mike Kravetz
2019-08-02  0:19       ` Mike Kravetz
2019-08-02  0:19         ` [LTP] " Mike Kravetz
2019-08-02  4:15         ` Naoya Horiguchi
2019-08-02  4:15           ` [LTP] " Naoya Horiguchi
2019-08-02 17:42           ` Mike Kravetz
2019-08-02 17:42             ` [LTP] " Mike Kravetz
2019-08-05  0:40             ` Naoya Horiguchi
2019-08-05  0:40               ` [LTP] " Naoya Horiguchi
2019-08-05  8:57             ` Michal Hocko
2019-08-05  8:57               ` [LTP] " Michal Hocko
2019-08-05 17:36               ` Mike Kravetz
2019-08-05 17:36                 ` [LTP] " Mike Kravetz
2019-08-07  0:07                 ` Mike Kravetz
2019-08-07  0:07                   ` [LTP] " Mike Kravetz
2019-08-07  7:39                   ` Michal Hocko
2019-08-07  7:39                     ` [LTP] " Michal Hocko
2019-08-07 15:10                     ` Mike Kravetz
2019-08-07 15:10                       ` [LTP] " Mike Kravetz
2019-08-02  9:59         ` Li Wang
2019-08-02  9:59           ` [LTP] " Li Wang
2019-07-30  6:38   ` Li Wang
2019-07-30  6:38     ` [LTP] " Li Wang
2019-08-02  3:48 ` Naoya Horiguchi
2019-08-02  3:48   ` [LTP] " Naoya Horiguchi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAEemH2d=vEfppCbCgVoGdHed2kuY3GWnZGhymYT1rnxjoWNdcQ@mail.gmail.com' \
    --to=liwang@redhat.com \
    --cc=chrubis@suse.cz \
    --cc=linux-mm@kvack.org \
    --cc=ltp@lists.linux.it \
    --cc=mhocko@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=xishi.qiuxishi@alibaba-inc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.