From: "HORIGUCHI NAOYA(堀口 直也)" <naoya.horiguchi@nec.com>
To: Qian Cai <cai@lca.pw>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
Oscar Salvador <osalvador@suse.de>,
Michal Hocko <mhocko@kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
David Hildenbrand <david@redhat.com>,
Mike Kravetz <mike.kravetz@oracle.com>
Subject: Re: memory offline infinite loop after soft offline
Date: Fri, 15 May 2020 03:48:10 +0000 [thread overview]
Message-ID: <20200515034809.GA27576@hori.linux.bs1.fc.nec.co.jp> (raw)
In-Reply-To: <DE0721DF-9E5C-4719-B382-01A4A74C04AD@lca.pw>
On Thu, May 14, 2020 at 10:46:33PM -0400, Qian Cai wrote:
>
>
> > On Oct 20, 2019, at 11:16 PM, Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> >
> > On Fri, Oct 18, 2019 at 07:56:09AM -0400, Qian Cai wrote:
> >>
> >>
> >> On Oct 18, 2019, at 2:35 AM, Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> >> wrote:
> >>
> >>
> >> You're right, then I don't see how this happens. If the error hugepage was
> >> isolated without having PG_hwpoison set, it's unexpected and problematic.
> >> I'm testing myself with v5.4-rc2 (simply ran move_pages12 and did hotremove
> >> /hotadd)
> >> but don't reproduce the issue yet. Do we need specific kernel version/
> >> config
> >> to trigger this?
> >>
> >>
> >> This is reproducible on linux-next with the config. Not sure if it is
> >> reproducible on x86.
> >>
> >> https://raw.githubusercontent.com/cailca/linux-mm/master/powerpc.config
> >>
> >> and kernel cmdline if that matters
> >>
> >> page_poison=on page_owner=on numa_balancing=enable \
> >> systemd.unified_cgroup_hierarchy=1 debug_guardpage_minorder=1 \
> >> page_alloc.shuffle=1
> >
> > Thanks for the info.
> >
> >>
> >> BTW, where does the code set PG_hwpoison for the head page?
> >
> > Precisely speaking, soft offline only sets PG_hwpoison after the target
> > hugepage is successfully dissolved (then it's not a hugepage any more),
> > so PG_hwpoison is set on the raw page in set_hwpoison_free_buddy_page().
> >
> > In move_pages12 case, madvise(MADV_SOFT_OFFLINE) is called for the range
> > of 2 hugepages, so the expected result is that page offset 0 and 512
> > are marked as PG_hwpoison after injection.
> >
> > Looking at your dump_page() output, the end_pfn is page offset 1
> > ("page:c00c000800458040" is likely to point to pfn 0x11601.)
> > The page belongs to high order buddy free page, but doesn't have
> > PageBuddy nor PageHWPoison because it was not the head page or
> > the raw error page.
> >
> >> Unfortunately, this does not solve the problem. It looks to me that in
> >> soft_offline_huge_page(), set_hwpoison_free_buddy_page() will only set
> >> PG_hwpoison for buddy pages, so the even the compound_head() has no PG_hwpoison
> >> set.
> >
> > Your analysis is totally correct, and this behavior will be fixed by
> > the change (https://lkml.org/lkml/2019/10/17/551) in Oscar's rework.
> > The raw error page will be taken off from buddy system and the other
> > subpages are properly split into lower orderer pages (we'll properly
> > manage PageBuddy flags). So all possible cases would be covered by
> > branches in __test_page_isolated_in_pageblock.
>
> Naoya, Oscar, it looks like this series was stuck.
>
> https://lkml.org/lkml/2019/10/17/551
>
> I can still reproduce this issue as today. Maybe it is best we could post a single patch (which one?) to fix the loop first?
I'm very sorry to be quiet for long, but I think that I agree with
this patchset and try to see what happend if merged into mmtom,
although we need rebaseing to latest mmotm and some basic testing.
next prev parent reply other threads:[~2020-05-15 3:48 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-10-11 21:32 memory offline infinite loop after soft offline Qian Cai
2019-10-12 10:30 ` osalvador
2019-10-14 8:39 ` Michal Hocko
2019-10-17 9:34 ` Naoya Horiguchi
2019-10-17 10:01 ` Michal Hocko
2019-10-17 10:03 ` David Hildenbrand
2019-10-17 18:07 ` Qian Cai
2019-10-17 18:27 ` Michal Hocko
2019-10-18 2:19 ` Naoya Horiguchi
2019-10-18 6:06 ` Michal Hocko
2019-10-18 6:32 ` Naoya Horiguchi
2019-10-18 7:33 ` Michal Hocko
2019-10-18 8:46 ` Naoya Horiguchi
[not found] ` <64DC81FB-C1D2-44F2-981F-C6F766124B91@lca.pw>
2019-10-21 3:16 ` Naoya Horiguchi
2020-05-15 2:46 ` Qian Cai
2020-05-15 3:48 ` HORIGUCHI NAOYA(堀口 直也) [this message]
2020-05-19 4:17 ` Qian Cai
2019-10-18 8:13 ` David Hildenbrand
2019-10-18 8:24 ` Michal Hocko
2019-10-18 8:38 ` David Hildenbrand
2019-10-18 8:55 ` Michal Hocko
2019-10-18 11:00 ` David Hildenbrand
2019-10-18 11:05 ` David Hildenbrand
2019-10-18 11:34 ` Michal Hocko
2019-10-18 11:51 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200515034809.GA27576@hori.linux.bs1.fc.nec.co.jp \
--to=naoya.horiguchi@nec.com \
--cc=cai@lca.pw \
--cc=david@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=mike.kravetz@oracle.com \
--cc=n-horiguchi@ah.jp.nec.com \
--cc=osalvador@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).