All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Luck, Tony" <tony.luck@intel.com>
To: "HORIGUCHI NAOYA(堀口 直也)" <naoya.horiguchi@nec.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>,
	Oscar Salvador <osalvador@suse.de>,
	Muchun Song <songmuchun@bytedance.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@suse.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v6 1/2] mm,hwpoison: fix race with hugetlb page allocation
Date: Thu, 12 Aug 2021 08:25:48 -0700	[thread overview]
Message-ID: <20210812152548.GA1579021@agluck-desk2.amr.corp.intel.com> (raw)
In-Reply-To: <20210812090303.GA153531@hori.linux.bs1.fc.nec.co.jp>

On Thu, Aug 12, 2021 at 09:03:04AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> Sorry for the failures. I think that the following patch (and dependencies)
> should solve the issue.
> https://lore.kernel.org/linux-mm/20210614021212.223326-6-nao.horiguchi@gmail.com/.
> I'll submit the update (maybe the patchset will be smaller by feedbacks)
> later soon.

I was uncertain about which dependencies you meant. So I followed
the advice in the cover letter for the patch series containing that
patch and did:

$ git fetch https://github.com/nhoriguchi/linux hwpoison

This kernel still has some odd issues and the poison page
did not get unmapped from my test user application.

See git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
for my test program. In this case I was just running with default settings
to inject an error into a user data page and then consume it.

Here's the dmesg output. There are multiple calls to memory_failure()
because the poison address is signalled both by the memory controller
(CMCI with UCNA signature) and the DCU (#MC with SRAR signature).

Note that the first message says: "recovery action for unknown page: Ignored"

[   70.331253] EINJ: Error INJection is initialized.
[   76.949490] process '/aegl/ras-tools/einj_mem_uc' started with executable stack
[   77.481846] Disabling lock debugging due to kernel taint
[   77.482004] mce: [Hardware Error]: Machine check events logged
[   77.487176] mce: Uncorrected hardware memory error in user-access at 7e025e400
[   77.493225] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[   77.508704] {1}[Hardware Error]: event severity: recoverable
[   77.514361] {1}[Hardware Error]:  Error 0, type: recoverable
[   77.520011] {1}[Hardware Error]:  fru_text: Card01, ChnG, DIMM0
[   77.525921] {1}[Hardware Error]:   section_type: memory error
[   77.531659] {1}[Hardware Error]:   error_status: 0x0000000000000400
[   77.537914] {1}[Hardware Error]:   physical_address: 0x00000007e025e400
[   77.544518] {1}[Hardware Error]:   node: 0 card: 6 module: 0 rank: 1 bank: 8 device: 0 row: 15105 column: 896
[   77.554503] {1}[Hardware Error]:   error_type: 4, single-symbol chipkill ECC
[   77.561548] {1}[Hardware Error]:   DIMM location: NODE 3 CPU0_DIMM_G1
[   77.568135] Memory failure: 0x7e025e: recovery action for unknown page: Ignored
[   77.575445] Memory failure: 0x7e025e: already hardware poisoned
[   77.627894] EDAC skx MC3: HANDLING MCE MEMORY ERROR
[   77.633600] EDAC skx MC3: CPU 0: Machine Check Event: 0x0 Bank 25: 0xac00000200a00090
[   77.633601] EDAC skx MC3: TSC 0x18d57ce28f4f25
[   77.633602] EDAC skx MC3: ADDR 0x7e025e400
[   77.633603] EDAC skx MC3: MISC 0x9000201d809c086
[   77.633603] EDAC skx MC3: PROCESSOR 0:0x606a6 TIME 1628780833 SOCKET 0 APIC 0x0
[   77.633608] EDAC MC3: 1 UE memory read error on CPU_SrcID#0_MC#3_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7e025e offset:0x400 grain:32 -  err_code:0x00a0:0x0090  SystemAddress:0x7e025e400 ProcessorSocketId:0x0 MemoryControllerId:0x3 ChannelAddress:0xec04bc00 ChannelId:0x0 RankAddress:0x76025c00 PhysicalRankId:0x1 DimmSlotId:0x0 Row:0x3b01 Column:0x380 Bank:0x0 BankGroup:0x2 ChipSelect:0x1 ChipId:0x0)
[   77.633611] Memory failure: 0x7e025e: Sending SIGBUS to einj_mem_uc:12283 due to hardware memory corruption
[   77.668827] mce: [Hardware Error]: Machine check events logged
[   77.678605] Memory failure: 0x7e025e: already hardware poisoned
[   77.736685] EDAC skx MC3: HANDLING MCE MEMORY ERROR
[   77.742392] EDAC skx MC3: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f
[   77.742394] EDAC skx MC3: TSC 0x0
[   77.742394] EDAC skx MC3: ADDR 0x7e025e400
[   77.742395] EDAC skx MC3: MISC 0x0
[   77.742395] EDAC skx MC3: PROCESSOR 0:0x606a6 TIME 1628780833 SOCKET 0 APIC 0x0
[   77.742397] EDAC MC3: 1 UE memory read error on CPU_SrcID#0_MC#3_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7e025e offset:0x400 grain:32 -  err_code:0x0000:0x009f  SystemAddress:0x7e025e400 ProcessorSocketId:0x0 MemoryControllerId:0x3 ChannelAddress:0xec04bc00 ChannelId:0x0 RankAddress:0x76025c00 PhysicalRankId:0x1 DimmSlotId:0x0 Row:0x3b01 Column:0x380 Bank:0x0 BankGroup:0x2 ChipSelect:0x1 ChipId:0x0)
[   77.777612] Memory failure: 0x7e025e: already hardware poisoned

-Tony

  reply	other threads:[~2021-08-12 15:25 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-03 23:36 [PATCH v6 0/2] hwpoison: fix race with hugetlb page allocation Naoya Horiguchi
2021-06-03 23:36 ` [PATCH v6 1/2] mm,hwpoison: " Naoya Horiguchi
2021-06-04 23:55   ` Mike Kravetz
2021-08-12  4:28   ` Luck, Tony
2021-08-12  9:03     ` HORIGUCHI NAOYA(堀口 直也)
2021-08-12 15:25       ` Luck, Tony [this message]
2021-08-13  6:29         ` HORIGUCHI NAOYA(堀口 直也)
2021-08-13 15:07           ` Luck, Tony
2021-08-16 17:12             ` Naoya Horiguchi
2021-08-16 17:56               ` Luck, Tony
2021-08-17  5:40                 ` HORIGUCHI NAOYA(堀口 直也)
2021-06-03 23:36 ` [PATCH v6 2/2] mm,hwpoison: make get_hwpoison_page() call get_any_page() Naoya Horiguchi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210812152548.GA1579021@agluck-desk2.amr.corp.intel.com \
    --to=tony.luck@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=nao.horiguchi@gmail.com \
    --cc=naoya.horiguchi@nec.com \
    --cc=osalvador@suse.de \
    --cc=songmuchun@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.