From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 86F80C433FE for ; Thu, 13 Jan 2022 23:11:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235320AbiAMXL3 (ORCPT ); Thu, 13 Jan 2022 18:11:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41682 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232394AbiAMXL2 (ORCPT ); Thu, 13 Jan 2022 18:11:28 -0500 Received: from out2.migadu.com (out2.migadu.com [IPv6:2001:41d0:2:aacc::]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 20EC8C061574 for ; Thu, 13 Jan 2022 15:11:27 -0800 (PST) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1642115485; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=+CoZiINFVqY4tRz0h+rd5YDtl6623JrbL8fyCa3EoQc=; b=PsvDcht36POJkSlyvVHWr7CnBLmJ86SbuCW325MfocVlna/6eXcuFKLzu1Ipb3Mg1gDJJl OyyRcGYYXdPJ8pX/lnBjTRk2/lYDCqvujtxZ71tmxqlnavG8otpVY5i9Uap9c744b9Qg47 yaoLELWA6bY5PwRtCQOrQZ25gH7vNh4= From: Naoya Horiguchi To: linux-mm@kvack.org Cc: Andrew Morton , Tony Luck , Youquan Song , Naoya Horiguchi , linux-kernel@vger.kernel.org Subject: [PATCH v2] mm/hwpoison: Fix error page recovered but reported "not recovered" Date: Fri, 14 Jan 2022 08:11:17 +0900 Message-Id: <20220113231117.1021405-1-naoya.horiguchi@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Migadu-Auth-User: linux.dev Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Naoya Horiguchi When an uncorrected memory error is consumed there is a race between the CMCI from the memory controller reporting an uncorrected error with a UCNA signature, and the core reporting and SRAR signature machine check when the data is about to be consumed. If the CMCI wins that race, the page is marked poisoned when uc_decode_notifier() calls memory_failure() and the machine check processing code finds the page already poisoned. It calls kill_accessing_process() to make sure a SIGBUS is sent. But returns the wrong error code. Console log looks like this: [34775.674296] mce: Uncorrected hardware memory error in user-access at 3710b3400 [34775.675413] Memory failure: 0x3710b3: recovery action for dirty LRU page: Recovered [34775.690310] Memory failure: 0x3710b3: already hardware poisoned [34775.696247] Memory failure: 0x3710b3: Sending SIGBUS to einj_mem_uc:361438 due to hardware memory corruption [34775.706072] mce: Memory error not recovered kill_accessing_process() is supposed to return -EHWPOISON to notify that SIGBUS is already set to the process and kill_me_maybe() doesn't have to send it again. But current code simply fails to do this, so fix it to make sure to work as intended. This change avoids the noise message "Memory error not recovered" and skips duplicate SIGBUSs. [Tony: Reworded some parts of commit message] Fixes: a3f5d80ea401 ("mm,hwpoison: send SIGBUS with error virutal address") Reported-by: Youquan Song Cc: Tony Luck Signed-off-by: Naoya Horiguchi --- mm/memory-failure.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 14ae5c18e776..4c9bd1d37301 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -707,8 +707,10 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn, (void *)&priv); if (ret == 1 && priv.tk.addr) kill_proc(&priv.tk, pfn, flags); + else + ret = 0; mmap_read_unlock(p->mm); - return ret ? -EFAULT : -EHWPOISON; + return ret > 0 ? -EHWPOISON : -EFAULT; } static const char *action_name[] = { -- 2.25.1