From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A57D9C11F7B for ; Tue, 6 Jul 2021 16:45:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 814376141C for ; Tue, 6 Jul 2021 16:45:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230219AbhGFQrj (ORCPT ); Tue, 6 Jul 2021 12:47:39 -0400 Received: from mga01.intel.com ([192.55.52.88]:52110 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230048AbhGFQrd (ORCPT ); Tue, 6 Jul 2021 12:47:33 -0400 X-IronPort-AV: E=McAfee;i="6200,9189,10037"; a="230883143" X-IronPort-AV: E=Sophos;i="5.83,328,1616482800"; d="scan'208";a="230883143" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jul 2021 09:44:53 -0700 X-IronPort-AV: E=Sophos;i="5.83,328,1616482800"; d="scan'208";a="457133171" Received: from agluck-desk2.sc.intel.com (HELO agluck-desk2.amr.corp.intel.com) ([10.3.52.146]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jul 2021 09:44:52 -0700 Date: Tue, 6 Jul 2021 09:44:51 -0700 From: "Luck, Tony" To: Ding Hui Cc: bp@alien8.de, bp@suse.de, naoya.horiguchi@nec.com, osalvador@suse.de, peterz@infradead.org, linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org, tglx@linutronix.de, mingo@redhat.com, x86@kernel.org, hpa@zytor.com, youquan.song@intel.com, huangcun@sangfor.com.cn, stable@vger.kernel.org Subject: Re: [PATCH v2] x86/mce: Fix endless loop when run task works after #MC Message-ID: <20210706164451.GA1289248@agluck-desk2.amr.corp.intel.com> References: <20210706121606.15864-1-dinghui@sangfor.com.cn> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210706121606.15864-1-dinghui@sangfor.com.cn> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 06, 2021 at 08:16:06PM +0800, Ding Hui wrote: > Recently we encounter multi #MC on the same task when it's > task_work_run() has not been called, current->mce_kill_me was > added to task_works list more than once, that make a circular > linked task_works, so task_work_run() will do a endless loop. I saw the same and posted a similar fix a while back: https://www.spinics.net/lists/linux-mm/msg251006.html It didn't get merged because some validation tests began failing around the same time. I'm now pretty sure I understand what happened with those other tests. I'll post my updated version (second patch in a three part series) later today. > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > + if (!cmpxchg(¤t->mce_kill_me.func, NULL, ch.func)) { > + current->mce_addr = m->addr; > + current->mce_kflags = m->kflags; > + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); > + current->mce_whole_page = whole_page(m); You don't need an atomic cmpxchg here (nor the WRITE_ONCE() to clear it). The task is operating on its own task_struct. Nobody else should touch the mce_kill_me field. -Tony