linux-acpi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Shuai Xue <xueshuai@linux.alibaba.com>
To: bp@alien8.de, rafael@kernel.org, wangkefeng.wang@huawei.com,
	tanxiaofei@huawei.com, mawupeng1@huawei.com, tony.luck@intel.com,
	linmiaohe@huawei.com, naoya.horiguchi@nec.com,
	james.morse@arm.com, gregkh@linuxfoundation.org, will@kernel.org,
	jarkko@kernel.org
Cc: linux-acpi@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, akpm@linux-foundation.org,
	linux-edac@vger.kernel.org, x86@kernel.org,
	xueshuai@linux.alibaba.com, justin.he@arm.com, ardb@kernel.org,
	ying.huang@intel.com, ashish.kalra@amd.com,
	baolin.wang@linux.alibaba.com, tglx@linutronix.de,
	mingo@redhat.com, dave.hansen@linux.intel.com, lenb@kernel.org,
	hpa@zytor.com, robert.moore@intel.com, lvying6@huawei.com,
	xiexiuqi@huawei.com, zhuo.song@linux.alibaba.com
Subject: [PATCH v11 0/3] ACPI: APEI: handle synchronous exceptions in task work to send correct SIGBUS si_code
Date: Sun,  4 Feb 2024 16:01:41 +0800	[thread overview]
Message-ID: <20240204080144.7977-1-xueshuai@linux.alibaba.com> (raw)
In-Reply-To: <20221027042445.60108-1-xueshuai@linux.alibaba.com>

## Changes Log
changes since v10:
- rebase to v6.8-rc2

changes since v9:
- split patch 2 to address exactly one issue in one patch (per Borislav)
- rewrite commit log according to template (per Borislav)
- pickup reviewed-by tag of patch 1 from James Morse
- alloc and free twcb through gen_pool_{alloc, free) (Per James)
- rewrite cover letter

changes since v8:
- remove the bug fix tag of patch 2 (per Jarkko Sakkinen)
- remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi)
- rewrite the return value comments of memory_failure (per Naoya Horiguchi)

changes since v7:
- rebase to Linux v6.6-rc2 (no code changed)
- rewritten the cover letter to explain the motivation of this patchset

changes since v6:
- add more explicty error message suggested by Xiaofei
- pick up reviewed-by tag from Xiaofei
- pick up internal reviewed-by tag from Baolin

changes since v5 by addressing comments from Kefeng:
- document return value of memory_failure()
- drop redundant comments in call site of memory_failure() 
- make ghes_do_proc void and handle abnormal case within it
- pick up reviewed-by tag from Kefeng Wang 

changes since v4 by addressing comments from Xiaofei:
- do a force kill only for abnormal sync errors

changes since v3 by addressing comments from Xiaofei:
- do a force kill for abnormal memory failure error such as invalid PA,
unexpected severity, OOM, etc
- pcik up tested-by tag from Ma Wupeng

changes since v2 by addressing comments from Naoya:
- rename mce_task_work to sync_task_work
- drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
- add steps to reproduce this problem in cover letter

changes since v1:
- synchronous events by notify type
- Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@linux.alibaba.com/

## Cover Letter

There are two major types of uncorrected recoverable (UCR) errors :

- Synchronous error: The error is detected and raised at the point of the
  consumption in the execution flow, e.g. when a CPU tries to access
  a poisoned cache line. The CPU will take a synchronous error exception
  such as Synchronous External Abort (SEA) on Arm64 and Machine Check
  Exception (MCE) on X86. OS requires to take action (for example, offline
  failure page/kill failure thread) to recover this uncorrectable error.

- Asynchronous error: The error is detected out of processor execution
  context, e.g. when an error is detected by a background scrubber. Some data
  in the memory are corrupted. But the data have not been consumed. OS is
  optional to take action to recover this uncorrectable error.

Since commit a70297d22132 ("ACPI: APEI: set memory failure flags as
MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED
could be used to determine whether a synchronous exception occurs on ARM64
platform. When a synchronous exception is detected, the kernel should
terminate the current process which accessing the poisoned page. This is
done by sending a SIGBUS signal with an error code BUS_MCEERR_AR,
indicating an action-required machine check error on read.

However, the memory failure recovery is incorrectly sending a SIGBUS
with wrong error code BUS_MCEERR_AO for synchronous errors in early kill
mode, even MF_ACTION_REQUIRED is set. The main problem is that
synchronous errors are queued as a memory_failure() work, and are
executed within a kernel thread context, not the user-space process that
encountered the corrupted memory on ARM64 platform. As a result, when
kill_proc() is called to terminate the process, it sends the incorrect
SIGBUS error code because the context in which it operates is not the
one where the error was triggered.

To this end, fix the problem by:

- Patch 1: performing a force kill if no memory_failure() work is queued for
	   synchronous errors.
- Patch 2: a minor comments improvement.
- Patch 3: queue memory_failure() as a task_work so that it runs in the
	   context of the process that is actually consuming the poisoned
	   data, and it will send SIBBUS with si_code BUS_MCEERR_AR.

Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4].
Acknowledge to discussion with them.

## Steps to Reproduce This Problem

To reproduce this problem:

	# STEP1: enable early kill mode
	#sysctl -w vm.memory_failure_early_kill=1
	vm.memory_failure_early_kill = 1

	# STEP2: inject an UCE error and consume it to trigger a synchronous error
	#einj_mem_uc single
	0: single   vaddr = 0xffffb0d75400 paddr = 4092d55b400
	injecting ...
	triggering ...
	signal 7 code 5 addr 0xffffb0d75000
	page not present
	Test passed

The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error
and it is not fact.

After this patch set:

	# STEP1: enable early kill mode
	#sysctl -w vm.memory_failure_early_kill=1
	vm.memory_failure_early_kill = 1

	# STEP2: inject an UCE error and consume it to trigger a synchronous error
	#einj_mem_uc single
	0: single   vaddr = 0xffffb0d75400 paddr = 4092d55b400
	injecting ...
	triggering ...
	signal 7 code 4 addr 0xffffb0d75000
	page not present
	Test passed

The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error
as we expected.

[1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/20200512030609.19593-1-gengdongjiu@huawei.com/
[2] https://lore.kernel.org/lkml/20221205115111.131568-3-lvying6@huawei.com/
[3] https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com
[4] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@huawei.com/

Shuai Xue (3):
  ACPI: APEI: send SIGBUS to current task if synchronous memory error
    not recovered
  mm: memory-failure: move return value documentation to function
    declaration
  ACPI: APEI: handle synchronous exceptions in task work to send correct
    SIGBUS si_code

 arch/x86/kernel/cpu/mce/core.c |  9 +---
 drivers/acpi/apei/ghes.c       | 84 +++++++++++++++++++++-------------
 include/acpi/ghes.h            |  3 --
 mm/memory-failure.c            | 22 +++------
 4 files changed, 59 insertions(+), 59 deletions(-)

-- 
2.39.3


  parent reply	other threads:[~2024-02-04  8:02 UTC|newest]

Thread overview: 121+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-27  4:24 [PATCH] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on action required events Shuai Xue
2022-10-28 17:08 ` Rafael J. Wysocki
2022-10-28 17:25   ` Luck, Tony
2022-11-02 11:53     ` Shuai Xue
2022-11-22 11:40       ` Shuai Xue
2022-11-02  7:07   ` Shuai Xue
2022-12-06 15:33 ` [RFC PATCH 0/2] ACPI: APEI: handle synchronous exceptions in task work Shuai Xue
2022-12-07  9:54   ` reply for " Lv Ying
2022-12-07 12:34     ` Bixuan Cui
2022-12-07 12:56     ` Shuai Xue
2022-12-07 14:04       ` Shuai Xue
2022-12-08  2:27         ` Lv Ying
2022-12-06 15:33 ` [RFC PATCH 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events Shuai Xue
2022-12-06 15:33 ` [RFC PATCH 2/2] ACPI: APEI: separate synchronous error handling into task work Shuai Xue
2023-02-27  5:03 ` [PATCH v2 0/2] ACPI: APEI: handle synchronous exceptions with proper si_code Shuai Xue
2023-03-06  0:45   ` Shuai Xue
2023-02-27  5:03 ` [PATCH v2 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events Shuai Xue
2023-03-16  7:21   ` HORIGUCHI NAOYA(堀口 直也)
2023-03-16  9:57     ` Shuai Xue
2023-03-16 16:45       ` Luck, Tony
2023-03-17  1:12         ` Shuai Xue
2023-02-27  5:03 ` [PATCH v2 2/2] ACPI: APEI: handle synchronous exceptions in task work Shuai Xue
2023-03-16  7:21   ` HORIGUCHI NAOYA(堀口 直也)
2023-03-16 11:10     ` Shuai Xue
2023-03-17  0:29       ` HORIGUCHI NAOYA(堀口 直也)
2023-03-17  1:24         ` Shuai Xue
2023-03-17  7:24 ` [PATCH v3 0/2] ACPI: APEI: handle synchronous exceptions with proper si_code Shuai Xue
2023-03-20 18:03   ` Rafael J. Wysocki
2023-03-30  6:11     ` Shuai Xue
2023-03-30  9:52       ` Rafael J. Wysocki
2023-03-21  7:17   ` mawupeng
2023-03-22  1:27     ` Shuai Xue
2023-03-17  7:24 ` [PATCH v3 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events Shuai Xue
2023-03-17  7:24 ` [PATCH v3 2/2] ACPI: APEI: handle synchronous exceptions in task work Shuai Xue
2023-04-06 12:39   ` Xiaofei Tan
2023-04-07  2:21     ` Shuai Xue
2023-04-08  9:13 ` [PATCH v4 0/2] ACPI: APEI: handle synchronous exceptions with proper si_code Shuai Xue
2023-04-08  9:13 ` [PATCH v4 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events Shuai Xue
2023-04-08  9:13 ` [PATCH v4 2/2] ACPI: APEI: handle synchronous exceptions in task work Shuai Xue
2023-04-11  1:44   ` Xiaofei Tan
2023-04-11  3:16     ` Shuai Xue
2023-04-11  9:02       ` Xiaofei Tan
2023-04-11  9:48         ` Shuai Xue
2023-04-11 10:48 ` [PATCH v5 0/2] ACPI: APEI: handle synchronous exceptions with proper si_code Shuai Xue
2023-04-11 10:48 ` [PATCH v5 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events Shuai Xue
2023-04-11 14:17   ` Kefeng Wang
2023-04-12  2:54     ` Shuai Xue
2023-04-12  3:55   ` Xiaofei Tan
2023-04-13  1:42     ` Shuai Xue
2023-04-11 10:48 ` [PATCH v5 2/2] ACPI: APEI: handle synchronous exceptions in task work Shuai Xue
2023-04-11 14:28   ` Kefeng Wang
2023-04-12  2:58     ` Shuai Xue
2023-04-12  4:05   ` Xiaofei Tan
2023-04-13  1:49     ` Shuai Xue
2023-04-12 11:27 ` [PATCH v6 0/2] ACPI: APEI: handle synchronous exceptions with proper si_code Shuai Xue
2023-04-12 11:28 ` [PATCH v6 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events Shuai Xue
2023-04-12 11:28 ` [PATCH v6 2/2] ACPI: APEI: handle synchronous exceptions in task work Shuai Xue
2023-04-17  1:14 ` [PATCH v7 0/2] ACPI: APEI: handle synchronous exceptions with proper si_code Shuai Xue
2023-04-24  6:24   ` Shuai Xue
2023-05-08  1:55     ` Shuai Xue
2023-04-17  1:14 ` [PATCH v7 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events Shuai Xue
2023-04-17  1:14 ` [PATCH v7 2/2] ACPI: APEI: handle synchronous exceptions in task work Shuai Xue
2023-09-19  2:21 ` [RESEND PATCH v8 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code Shuai Xue
2023-09-19  2:21 ` [RESEND PATCH v8 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events Shuai Xue
2023-09-25 14:43   ` Jarkko Sakkinen
2023-09-26  6:23     ` Shuai Xue
2023-09-19  2:21 ` [RESEND PATCH v8 2/2] ACPI: APEI: handle synchronous exceptions in task work Shuai Xue
2023-09-25 15:00   ` Jarkko Sakkinen
2023-09-26  6:38     ` Shuai Xue
2023-10-03  8:28   ` Naoya Horiguchi
2023-10-07  2:01     ` Shuai Xue
2023-10-07  7:28 ` [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code Shuai Xue
2023-11-21  1:48   ` Shuai Xue
2023-11-23 15:07   ` Borislav Petkov
2023-11-25  6:44     ` Shuai Xue
2023-11-25 12:10       ` Borislav Petkov
2023-11-26 12:25         ` Shuai Xue
2023-11-29 18:54           ` Borislav Petkov
2023-11-30  2:58             ` Shuai Xue
2023-11-30 14:40               ` Borislav Petkov
2023-11-30 17:43                 ` James Morse
2023-12-01  2:58                   ` Shuai Xue
2023-11-30 17:39             ` James Morse
2023-12-01  3:37               ` Shuai Xue
2023-10-07  7:28 ` [PATCH v9 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events Shuai Xue
2023-11-30 17:39   ` James Morse
2023-12-01  5:22     ` Shuai Xue
2023-10-07  7:28 ` [PATCH v9 2/2] ACPI: APEI: handle synchronous exceptions in task work Shuai Xue
2023-11-30 17:39   ` James Morse
2023-12-01  7:03     ` Shuai Xue
2023-12-18  6:45 ` [PATCH v10 0/4] ACPI: APEI: handle synchronous errors in task work with proper si_code Shuai Xue
2023-12-18  6:45 ` [PATCH v10 1/4] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events Shuai Xue
2023-12-18  6:53   ` Greg KH
2023-12-21 13:55   ` Rafael J. Wysocki
2023-12-22  1:07     ` Shuai Xue
2023-12-18  6:45 ` [PATCH v10 2/4] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered Shuai Xue
2023-12-18  6:54   ` Greg KH
2023-12-18  6:45 ` [PATCH v10 3/4] mm: memory-failure: move memory_failure() return value documentation to function declaration Shuai Xue
2023-12-18  6:54   ` Greg KH
2023-12-18  6:45 ` [PATCH v10 4/4] ACPI: APEI: handle synchronous exceptions in task work Shuai Xue
2023-12-18  6:54   ` Greg KH
2024-02-04  8:01 ` Shuai Xue [this message]
2024-02-19  1:46   ` [PATCH v11 0/3] ACPI: APEI: handle synchronous exceptions in task work to send correct SIGBUS si_code Shuai Xue
2024-02-04  8:01 ` [PATCH v11 1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered Shuai Xue
2024-02-19  9:25   ` Borislav Petkov
2024-02-22  2:07     ` Shuai Xue
2024-02-23  5:26       ` Dan Williams
2024-02-23 12:08         ` Jonathan Cameron
2024-02-23 12:17           ` Jonathan Cameron
2024-02-24  6:08             ` Shuai Xue
2024-02-26 10:29               ` Borislav Petkov
2024-02-27  1:23                 ` Shuai Xue
2024-02-24 19:42             ` Dan Williams
2024-02-24 19:40     ` Dan Williams
2024-02-04  8:01 ` [PATCH v11 2/3] mm: memory-failure: move return value documentation to function declaration Shuai Xue
2024-02-26 10:46   ` Borislav Petkov
2024-02-27  1:27     ` Shuai Xue
2024-02-04  8:01 ` [PATCH v11 3/3] ACPI: APEI: handle synchronous exceptions in task work to send correct SIGBUS si_code Shuai Xue
2024-02-29  7:05   ` Shuai Xue
2024-03-08 10:18   ` Borislav Petkov
2024-03-12  6:05     ` Shuai Xue

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240204080144.7977-1-xueshuai@linux.alibaba.com \
    --to=xueshuai@linux.alibaba.com \
    --cc=akpm@linux-foundation.org \
    --cc=ardb@kernel.org \
    --cc=ashish.kalra@amd.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hpa@zytor.com \
    --cc=james.morse@arm.com \
    --cc=jarkko@kernel.org \
    --cc=justin.he@arm.com \
    --cc=lenb@kernel.org \
    --cc=linmiaohe@huawei.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lvying6@huawei.com \
    --cc=mawupeng1@huawei.com \
    --cc=mingo@redhat.com \
    --cc=naoya.horiguchi@nec.com \
    --cc=rafael@kernel.org \
    --cc=robert.moore@intel.com \
    --cc=tanxiaofei@huawei.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=xiexiuqi@huawei.com \
    --cc=ying.huang@intel.com \
    --cc=zhuo.song@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).