From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6198FC38A2D for ; Fri, 21 Oct 2022 20:01:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229874AbiJUUBd (ORCPT ); Fri, 21 Oct 2022 16:01:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47014 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229727AbiJUUBa (ORCPT ); Fri, 21 Oct 2022 16:01:30 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6A5FE25E886 for ; Fri, 21 Oct 2022 13:01:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1666382488; x=1697918488; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=/l/AGxNBveIFVLifPG7mHqaofFSHPy5andRAcNZBXFI=; b=iJGAohOrEopopUldTmSoMCuPLHyM2Tr2nHChri786Oz0HL+d9NZ7OHF/ REdNNoWz6/itKDUCH4iWPUvvae5d+sZUSLM3bJBQIHMia1f5U9EBal+Vo ofbK7LNPtmJ9SlIV92rgHu+pgt+EeWVx919ezlabvjnYPbdFWsTVkRp80 Ef5MTetcLnPOQi6X/ZJvFGhVPBQy3YKSiSoER+3EmgIciJZPk4KRymhvj enWKmAQzd7M2Fn1DBHM2BVy0iEg7Em7IA5SE6zbFuczDDia3wg4Y7LDtx gc0PnoNh6n19inksrVWzmKkDG/QfWwz/BIyxSDj598Sgmb6fq2jtyTpdB A==; X-IronPort-AV: E=McAfee;i="6500,9779,10507"; a="290401087" X-IronPort-AV: E=Sophos;i="5.95,203,1661842800"; d="scan'208";a="290401087" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Oct 2022 13:01:27 -0700 X-IronPort-AV: E=McAfee;i="6500,9779,10507"; a="633069084" X-IronPort-AV: E=Sophos;i="5.95,203,1661842800"; d="scan'208";a="633069084" Received: from agluck-desk3.sc.intel.com ([172.25.222.78]) by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Oct 2022 13:01:27 -0700 From: Tony Luck To: Naoya Horiguchi , Andrew Morton Cc: Miaohe Lin , Matthew Wilcox , Shuai Xue , Dan Williams , Michael Ellerman , Nicholas Piggin , Christophe Leroy , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Tony Luck Subject: [PATCH v3 0/2] Copy-on-write poison recovery Date: Fri, 21 Oct 2022 13:01:18 -0700 Message-Id: <20221021200120.175753-1-tony.luck@intel.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20221019170835.155381-1-tony.luck@intel.com> References: <20221019170835.155381-1-tony.luck@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Part 1 deals with the process that triggered the copy on write fault with a store to a shared read-only page. That process is send a SIGBUS with the usual machine check decoration to specify the virtual address of the lost page, together with the scope. Part 2 sets up to asynchronously take the page with the uncorrected error offline to prevent additional machine check faults. H/t to Miaohe Lin and Shuai Xue for pointing me to the existing function to queue a call to memory_failure(). On x86 there is some duplicate reporting (because the error is also signalled by the memory controller as well as by the core that triggered the machine check). Console logs look like this: [ 1647.723403] mce: [Hardware Error]: Machine check events logged Machine check from kernel copy routine [ 1647.723414] MCE: Killing einj_mem_uc:3600 due to hardware memory corruption fault at 7f3309503400 x86 fault handler sends SIGBUS to child process [ 1647.735183] Memory failure: 0x905b92d: recovery action for dirty LRU page: Recovered Async call to memory_failure() from copy on write path [ 1647.748397] Memory failure: 0x905b92d: already hardware poisoned uc_decode_notifier() processes memory controller report [ 1647.761313] MCE: Killing einj_mem_uc:3599 due to hardware memory corruption fault at 7f3309503400 Parent process tries to read poisoned page. Page has been unmapped, so #PF handler sends SIGBUS Tony Luck (2): mm, hwpoison: Try to recover from copy-on write faults mm, hwpoison: When copy-on-write hits poison, take page offline include/linux/highmem.h | 24 ++++++++++++++++++++++++ include/linux/mm.h | 5 ++++- mm/memory.c | 32 ++++++++++++++++++++++---------- 3 files changed, 50 insertions(+), 11 deletions(-) -- 2.37.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.ozlabs.org (lists.ozlabs.org [112.213.38.117]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 3B4A5C38A2D for ; Fri, 21 Oct 2022 20:02:33 +0000 (UTC) Received: from boromir.ozlabs.org (localhost [IPv6:::1]) by lists.ozlabs.org (Postfix) with ESMTP id 4MvFkH3Tyzz3dw4 for ; Sat, 22 Oct 2022 07:02:31 +1100 (AEDT) Authentication-Results: lists.ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.a=rsa-sha256 header.s=Intel header.b=b3b5ekre; dkim-atps=neutral Authentication-Results: lists.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=intel.com (client-ip=192.55.52.151; helo=mga17.intel.com; envelope-from=tony.luck@intel.com; receiver=) Authentication-Results: lists.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.a=rsa-sha256 header.s=Intel header.b=b3b5ekre; dkim-atps=neutral Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 4MvFjD07YSz3045 for ; Sat, 22 Oct 2022 07:01:29 +1100 (AEDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1666382496; x=1697918496; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=/l/AGxNBveIFVLifPG7mHqaofFSHPy5andRAcNZBXFI=; b=b3b5ekreCVTkk5UcGbcUDyXtMg8Xf2VIRXW9aK5AFkl5yAkGH1iTLRAS YCVrcFzQEJOI/1l+vA9A2HkuALvfojfQu/1JOkyNcnsnzsuLCvFVcjrk7 Er/AjAH4+NsaoVkMv0ZfTGcA2B6OyMwHh5Qbh4OiTg+TwF0ZGt0t2yz7g APihhuRTfMO2KrqOzlymUnaCOyFvmomxRD2tg0AALuRIcSn7AIXx2UtHl x2BA+JvnLlZhF+7INdT9YSZ8Ot828x1+5+qUrdq5E8/dPMGxJ6DgljU1A /5jI6r7703a2Cm73IeMMdNur11R+Z1f1N/EIx5NAJv72FSB2AOvrD1+yx w==; X-IronPort-AV: E=McAfee;i="6500,9779,10507"; a="287493355" X-IronPort-AV: E=Sophos;i="5.95,203,1661842800"; d="scan'208";a="287493355" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Oct 2022 13:01:27 -0700 X-IronPort-AV: E=McAfee;i="6500,9779,10507"; a="633069084" X-IronPort-AV: E=Sophos;i="5.95,203,1661842800"; d="scan'208";a="633069084" Received: from agluck-desk3.sc.intel.com ([172.25.222.78]) by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Oct 2022 13:01:27 -0700 From: Tony Luck To: Naoya Horiguchi , Andrew Morton Subject: [PATCH v3 0/2] Copy-on-write poison recovery Date: Fri, 21 Oct 2022 13:01:18 -0700 Message-Id: <20221021200120.175753-1-tony.luck@intel.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20221019170835.155381-1-tony.luck@intel.com> References: <20221019170835.155381-1-tony.luck@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Miaohe Lin , Matthew Wilcox , linux-kernel@vger.kernel.org, Nicholas Piggin , linux-mm@kvack.org, Tony Luck , Shuai Xue , Dan Williams , linuxppc-dev@lists.ozlabs.org Errors-To: linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Sender: "Linuxppc-dev" Part 1 deals with the process that triggered the copy on write fault with a store to a shared read-only page. That process is send a SIGBUS with the usual machine check decoration to specify the virtual address of the lost page, together with the scope. Part 2 sets up to asynchronously take the page with the uncorrected error offline to prevent additional machine check faults. H/t to Miaohe Lin and Shuai Xue for pointing me to the existing function to queue a call to memory_failure(). On x86 there is some duplicate reporting (because the error is also signalled by the memory controller as well as by the core that triggered the machine check). Console logs look like this: [ 1647.723403] mce: [Hardware Error]: Machine check events logged Machine check from kernel copy routine [ 1647.723414] MCE: Killing einj_mem_uc:3600 due to hardware memory corruption fault at 7f3309503400 x86 fault handler sends SIGBUS to child process [ 1647.735183] Memory failure: 0x905b92d: recovery action for dirty LRU page: Recovered Async call to memory_failure() from copy on write path [ 1647.748397] Memory failure: 0x905b92d: already hardware poisoned uc_decode_notifier() processes memory controller report [ 1647.761313] MCE: Killing einj_mem_uc:3599 due to hardware memory corruption fault at 7f3309503400 Parent process tries to read poisoned page. Page has been unmapped, so #PF handler sends SIGBUS Tony Luck (2): mm, hwpoison: Try to recover from copy-on write faults mm, hwpoison: When copy-on-write hits poison, take page offline include/linux/highmem.h | 24 ++++++++++++++++++++++++ include/linux/mm.h | 5 ++++- mm/memory.c | 32 ++++++++++++++++++++++---------- 3 files changed, 50 insertions(+), 11 deletions(-) -- 2.37.3