Re: [PATCH v6 1/2] mm,hwpoison: fix race with hugetlb page allocation

From: "HORIGUCHI NAOYA(堀口　直也)" <naoya.horiguchi@nec.com>
To: "Luck, Tony" <tony.luck@intel.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>,
	Oscar Salvador <osalvador@suse.de>,
	Muchun Song <songmuchun@bytedance.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@suse.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v6 1/2] mm,hwpoison: fix race with hugetlb page allocation
Date: Fri, 13 Aug 2021 06:29:51 +0000	[thread overview]
Message-ID: <20210813062951.GA203438@hori.linux.bs1.fc.nec.co.jp> (raw)
In-Reply-To: <20210812152548.GA1579021@agluck-desk2.amr.corp.intel.com>

On Thu, Aug 12, 2021 at 08:25:48AM -0700, Luck, Tony wrote:
> On Thu, Aug 12, 2021 at 09:03:04AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > Sorry for the failures. I think that the following patch (and dependencies)
> > should solve the issue.
> > https://lore.kernel.org/linux-mm/20210614021212.223326-6-nao.horiguchi@gmail.com/.
> > I'll submit the update (maybe the patchset will be smaller by feedbacks)
> > later soon.
>
> I was uncertain about which dependencies you meant. So I followed
> the advice in the cover letter for the patch series containing that
> patch and did:
>
> $ git fetch https://github.com/nhoriguchi/linux hwpoison
>
> This kernel still has some odd issues and the poison page
> did not get unmapped from my test user application.
>
> See git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
> for my test program. In this case I was just running with default settings
> to inject an error into a user data page and then consume it.
>
> Here's the dmesg output. There are multiple calls to memory_failure()
> because the poison address is signalled both by the memory controller
> (CMCI with UCNA signature) and the DCU (#MC with SRAR signature).

Thanks for the details.

The following dmesg implies that the failure happened in einj_mem_uc
which has 13 sub-testcases (specified by "testname"). In which one
the "unknonw page" event was triggered?

>
> Note that the first message says: "recovery action for unknown page: Ignored"
>
> [   70.331253] EINJ: Error INJection is initialized.
> [   76.949490] process '/aegl/ras-tools/einj_mem_uc' started with executable stack
> [   77.481846] Disabling lock debugging due to kernel taint
> [   77.482004] mce: [Hardware Error]: Machine check events logged
> [   77.487176] mce: Uncorrected hardware memory error in user-access at 7e025e400
> [   77.493225] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
> [   77.508704] {1}[Hardware Error]: event severity: recoverable
> [   77.514361] {1}[Hardware Error]:  Error 0, type: recoverable
> [   77.520011] {1}[Hardware Error]:  fru_text: Card01, ChnG, DIMM0
> [   77.525921] {1}[Hardware Error]:   section_type: memory error
> [   77.531659] {1}[Hardware Error]:   error_status: 0x0000000000000400

Accourding to https://www.kernel.org/doc/html/latest/firmware-guide/acpi/apei/einj.html
this error status means "Platform Uncorrectable non-fatal", so memory_failure()
is called with MF_ACTION_REQUIRED set?

> [   77.537914] {1}[Hardware Error]:   physical_address: 0x00000007e025e400
> [   77.544518] {1}[Hardware Error]:   node: 0 card: 6 module: 0 rank: 1 bank: 8 device: 0 row: 15105 column: 896
> [   77.554503] {1}[Hardware Error]:   error_type: 4, single-symbol chipkill ECC
> [   77.561548] {1}[Hardware Error]:   DIMM location: NODE 3 CPU0_DIMM_G1
> [   77.568135] Memory failure: 0x7e025e: recovery action for unknown page: Ignored

This "unknown page" should come from the following line in memory_failure():

   int memory_failure(unsigned long pfn, int flags)
   ...
           if (!(flags & MF_COUNT_INCREASED)) {
                   res = get_hwpoison_page(p, flags);
                   if (!res) {
                           ...  // code for HWPoisonHandlable pages.
                   } else if (res < 0) {
                           action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED); /// HERE
                           res = -EBUSY;
                           goto unlock_mutex;
                   }
           }

This path is chosen when HWPoisonHandlable() returns false in get_hwpoison_page()
and HWPoisonHandlable() is like this in the current version:

    static inline bool HWPoisonHandlable(struct page *page)
    {
            return PageLRU(page) || __PageMovable(page) ||
                    PageSlab(page) || PageTable(page) || PageReserved(page);
    }

So I wonder why HWPoisonHandlable() returned false in your case ("to inject
an error into a user data page and then consume it" sounds a basic testcase
so it's expected to be detected by PageLRU() or __PageMovable().)

I think that we are going to the direction of stopping taking refcount of
error pages blindly to reduce the risk of critical races.  In order to
improve HWPoisonHandlable() check instead of reverting it, I'd like to
understand error page's state in the failure case.  Could you try testing
again with inserting a dump_page() like below?  (I'll try to reproduce by
myself somehow, but it might be hard because I have no machine with firmware
supporting EINJ.)

    @@ -1703,6 +1703,7 @@ int memory_failure(unsigned long pfn, int flags)
                            goto unlock_mutex;
                    } else if (res < 0) {
                            action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
    +                       dump_page(p, "hwpoison unknown page");
                            res = -EBUSY;
                            goto unlock_mutex;
                    }

Thanks,
Naoya Horiguchi

> [   77.575445] Memory failure: 0x7e025e: already hardware poisoned
> [   77.627894] EDAC skx MC3: HANDLING MCE MEMORY ERROR
> [   77.633600] EDAC skx MC3: CPU 0: Machine Check Event: 0x0 Bank 25: 0xac00000200a00090
> [   77.633601] EDAC skx MC3: TSC 0x18d57ce28f4f25
> [   77.633602] EDAC skx MC3: ADDR 0x7e025e400
> [   77.633603] EDAC skx MC3: MISC 0x9000201d809c086
> [   77.633603] EDAC skx MC3: PROCESSOR 0:0x606a6 TIME 1628780833 SOCKET 0 APIC 0x0
> [   77.633608] EDAC MC3: 1 UE memory read error on CPU_SrcID#0_MC#3_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7e025e offset:0x400 grain:32 -  err_code:0x00a0:0x0090  SystemAddress:0x7e025e400 ProcessorSocketId:0x0 MemoryControllerId:0x3 ChannelAddress:0xec04bc00 ChannelId:0x0 RankAddress:0x76025c00 PhysicalRankId:0x1 DimmSlotId:0x0 Row:0x3b01 Column:0x380 Bank:0x0 BankGroup:0x2 ChipSelect:0x1 ChipId:0x0)
> [   77.633611] Memory failure: 0x7e025e: Sending SIGBUS to einj_mem_uc:12283 due to hardware memory corruption
> [   77.668827] mce: [Hardware Error]: Machine check events logged
> [   77.678605] Memory failure: 0x7e025e: already hardware poisoned
> [   77.736685] EDAC skx MC3: HANDLING MCE MEMORY ERROR
> [   77.742392] EDAC skx MC3: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f
> [   77.742394] EDAC skx MC3: TSC 0x0
> [   77.742394] EDAC skx MC3: ADDR 0x7e025e400
> [   77.742395] EDAC skx MC3: MISC 0x0
> [   77.742395] EDAC skx MC3: PROCESSOR 0:0x606a6 TIME 1628780833 SOCKET 0 APIC 0x0
> [   77.742397] EDAC MC3: 1 UE memory read error on CPU_SrcID#0_MC#3_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7e025e offset:0x400 grain:32 -  err_code:0x0000:0x009f  SystemAddress:0x7e025e400 ProcessorSocketId:0x0 MemoryControllerId:0x3 ChannelAddress:0xec04bc00 ChannelId:0x0 RankAddress:0x76025c00 PhysicalRankId:0x1 DimmSlotId:0x0 Row:0x3b01 Column:0x380 Bank:0x0 BankGroup:0x2 ChipSelect:0x1 ChipId:0x0)
> [   77.777612] Memory failure: 0x7e025e: already hardware poisoned
>
> -Tony
>