From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B26B5C5478C for ; Fri, 23 Feb 2024 12:17:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 448AE6B006E; Fri, 23 Feb 2024 07:17:09 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3F8F06B0071; Fri, 23 Feb 2024 07:17:09 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2E8306B0072; Fri, 23 Feb 2024 07:17:09 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 1F5D76B006E for ; Fri, 23 Feb 2024 07:17:09 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id F24C8A18F6 for ; Fri, 23 Feb 2024 12:17:08 +0000 (UTC) X-FDA: 81822968136.12.02851AF Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf18.hostedemail.com (Postfix) with ESMTP id 1402C1C0012 for ; Fri, 23 Feb 2024 12:17:06 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf18.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708690627; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FxG9J350XfUe1UvPl4JC3RkQgxuv89D9lsJ4xhWtfng=; b=gyKk+f9lPubvXaBuf9WMUPhDALPX1iCSK/EhT4aisXUHiMQzqlc/wR4UpKtEu/h2R4xjN1 bHDzLSDwWE8HFPl/DO52R/3gmTTMOmMUUgPBEi7YJd1hsLXegKVuKKOBbsYpTfJosoiz+E TQmHdjUCP3HhkKgXzW5hbeOFfn/eRNk= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf18.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708690627; a=rsa-sha256; cv=none; b=7ITkKqfpv6K2+MsyhU7Ul3q/xwHRMBAehI8+z+IxnXBQi9Uq1SrNsedfojJnBLuQ8wTZiK HHNsterZtzbx3J8j8da1hGox7d+n1KdyPHFRDxwaBq0ZGN7DFCciOX7NbcKONNGF3IvXGF umAOsC+AVa+Y2VmjSIKTEUlqFGhX39M= Received: from mail.maildlp.com (unknown [172.18.186.231]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4Th86H5SCgz6K64V; Fri, 23 Feb 2024 20:12:55 +0800 (CST) Received: from lhrpeml500005.china.huawei.com (unknown [7.191.163.240]) by mail.maildlp.com (Postfix) with ESMTPS id B5251140B54; Fri, 23 Feb 2024 20:17:04 +0800 (CST) Received: from localhost (10.202.227.76) by lhrpeml500005.china.huawei.com (7.191.163.240) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.35; Fri, 23 Feb 2024 12:17:03 +0000 Date: Fri, 23 Feb 2024 12:17:01 +0000 From: Jonathan Cameron To: Dan Williams CC: Shuai Xue , Borislav Petkov , Ira Weiny , "Luck, Tony" , "james.morse@arm.com" , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: Re: [PATCH v11 1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered Message-ID: <20240223121701.00004bcf@Huawei.com> In-Reply-To: <20240223120813.00005d1f@Huawei.com> References: <20221027042445.60108-1-xueshuai@linux.alibaba.com> <20240204080144.7977-2-xueshuai@linux.alibaba.com> <20240219092528.GTZdMeiDWIDz613VeT@fat_crate.local> <65d82c9352e78_24f3f294d5@dwillia2-mobl3.amr.corp.intel.com.notmuch> <20240223120813.00005d1f@Huawei.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.1.0 (GTK 3.24.33; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.202.227.76] X-ClientProxiedBy: lhrpeml100004.china.huawei.com (7.191.162.219) To lhrpeml500005.china.huawei.com (7.191.163.240) X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 1402C1C0012 X-Stat-Signature: cnik5hemjea9gape4393p45qx6pn6mqn X-HE-Tag: 1708690626-297907 X-HE-Meta: U2FsdGVkX19JG/E4eUj4p9i0mQwAYWN1KOyGIZ5ZOJtFhxQK7C/Kqw6aDxhbggaol6lqmLjb8flluKf8RYU+YYMu2JknwYJ4I1EYYOsnHXE//PBhGlA8dZ7jxkkArMc71RuQ0vNw5S/iSOeEvUKRaszpuEO53GyxT9ZZbOzMf11as6zZYBmRgprQMKAEcrJNzyb5e6OIEBu7iyHvODp3Xj7Bet9I8s/2omXRAooIgXYlt8fBPNjmeV+sm4hX7yVdr8qg4EuDNCP96bOgHxAQxLbui7W8+YuSuw3VPkmAu9WJNSXmoV4rBf2mz2OXO1CsatlX1bqYFox68ksRuw1cUrezVO4qyjUsbRfzgm0UOtPzjAThE4XGCU4LEMlNE2LdmqWCZbLJun5pIUYylF2Am8moJIEE+kf1u05r1nzym9st01p8SfneVxgBVTKMx9VPD+aOzc4Z9AEnPN5ccSeyvPmjjEChYv6sGXY1hHfqeJET76cO7SaxJdMvhF3P96VSYV70tHB+yVptermbmW4Hif0bnJQQTClV/42fAM041rn/BvUq6ASZAycpNsXYbOu6b7Ikkfw3P2a/9B6VwQesWAoZ5opo5jVnbhZEOWGF4LDM76amdlFQHanQEiFAXv1rGPIgyq3dwabmLPlL0SY1iTMQmAtiuRsAI8dQhIy96HnhHbREiVLOoU3kthaKQcFdtP7bEVE0nCg078zSXihyN5eYA2xWHAFAzx0l3h0BR5PAs1rJQlooCGDyaSANR/MdK9+WwTz5PDVSHI1jOH8uC/bFUDoZrspsAECczhSVRQE+M5sDSh131zVif1dn3MRxVfYLUNqVd1KoxU8ocu/8XGwPsazdazQ0nhRq1XohOfooT89dwN1ZrLCpwgsdLDRbrG3LzszNdd2aaT0pSl2IJWGASbCOenCdso2H4f6sC3Q= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, 23 Feb 2024 12:08:13 +0000 Jonathan Cameron wrote: > On Thu, 22 Feb 2024 21:26:43 -0800 > Dan Williams wrote: > > > Shuai Xue wrote: > > > > > > > > > On 2024/2/19 17:25, Borislav Petkov wrote: > > > > On Sun, Feb 04, 2024 at 04:01:42PM +0800, Shuai Xue wrote: > > > >> Synchronous error was detected as a result of user-space process accessing > > > >> a 2-bit uncorrected error. The CPU will take a synchronous error exception > > > >> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a > > > >> memory_failure() work which poisons the related page, unmaps the page, and > > > >> then sends a SIGBUS to the process, so that a system wide panic can be > > > >> avoided. > > > >> > > > >> However, no memory_failure() work will be queued when abnormal synchronous > > > >> errors occur. These errors can include situations such as invalid PA, > > > >> unexpected severity, no memory failure config support, invalid GUID > > > >> section, etc. In such case, the user-space process will trigger SEA again. > > > >> This loop can potentially exceed the platform firmware threshold or even > > > >> trigger a kernel hard lockup, leading to a system reboot. > > > >> > > > >> Fix it by performing a force kill if no memory_failure() work is queued > > > >> for synchronous errors. > > > >> > > > >> Signed-off-by: Shuai Xue > > > >> --- > > > >> drivers/acpi/apei/ghes.c | 9 +++++++++ > > > >> 1 file changed, 9 insertions(+) > > > >> > > > >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > > > >> index 7b7c605166e0..0892550732d4 100644 > > > >> --- a/drivers/acpi/apei/ghes.c > > > >> +++ b/drivers/acpi/apei/ghes.c > > > >> @@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes, > > > >> } > > > >> } > > > >> > > > >> + /* > > > >> + * If no memory failure work is queued for abnormal synchronous > > > >> + * errors, do a force kill. > > > >> + */ > > > >> + if (sync && !queued) { > > > >> + pr_err("Sending SIGBUS to current task due to memory error not recovered"); > > > >> + force_sig(SIGBUS); > > > >> + } > > > > > > > > Except that there are a bunch of CXL GUIDs being handled there too and > > > > this will sigbus those processes now automatically. > > > > > > Before the CXL GUIDs added, @Tony confirmed that the HEST notifications are always > > > asynchronous on x86 platform, so only Synchronous External Abort (SEA) on ARM is > > > delivered as a synchronous notification. > > > > > > Will the CXL component trigger synchronous events for which we need to terminate the > > > current process by sending sigbus to process? > > > > None of the CXL component errors should be handled as synchronous > > events. They are either asynchronous protocol errors, or effectively > > equivalent to CPER_SEC_PLATFORM_MEM notifications. > > Not a good example, CPER_SEC_PLATFORM_MEM is sometimes signaled via SEA. > Premature send.:( One example I can point at is how we do signaling of memory errors detected by the host into a VM on arm64. https://elixir.bootlin.com/qemu/latest/source/hw/acpi/ghes.c#L391 CPER_SEC_PLATFORM_MEM via ARM Synchronous External Abort (SEA). Right now we've only used async in QEMU for proposed CXL error CPER records signalling but your reference to them being similar to CPER_SEC_PLATFORM_MEM is valid so 'maybe' they will be synchronous in some physical systems as it's one viable way to provide rich information for synchronous reception of poison. For the VM case my assumption today is we don't care about providing the VM with rich data, so CPER_SEC_PLATFORM_MEM is fine as a path for errors whether from CXL CPER records or not. Jonathan