From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.4 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7BCE1C2BA1A for ; Tue, 7 Apr 2020 16:37:28 +0000 (UTC) Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 353C62072A for ; Tue, 7 Apr 2020 16:37:28 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="NFLTGW7S" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 353C62072A Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+infradead-linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20170209; h=Sender: Content-Transfer-Encoding:Content-Type:Cc:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:Date: Message-ID:From:References:To:Subject:Reply-To:Content-ID:Content-Description :Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=XmzCd3letIhVqLRlWvVJH00kY0XkuMl2qZ5oknnQQ80=; b=NFLTGW7S39BtAu atHvtWnmWjjjriv7kLilv8nhTFrjvH5tmyBBHoiG+F2cDEwCdmP/uj9xVTnsUonrcoEuHT/YXargA wygQ+40ydct0mnciFdxrJ5UPUWSQCFwK+GN7s3o5dT/734JHfGd48XZ95fGH/RYjj1O+6wr2liYTd FPOhkYjtFKLLfjc4XvxwvQ84NMj8CfqhURevlvDoQydbVOhBOM2lg2QSw/cTIzDq3bcsFVv1LGbbo JevJ/JCwfGKteQz+X7BP2Xwb66uinLv/1xz34O479BLboLLGBYPWzxM2G9qTuAk/UH5yJPcBKm90B 8YxfCdpOxGvsS/ORSmaQ==; Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1jLrE7-0004Wz-Fv; Tue, 07 Apr 2020 16:37:27 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1jLrE5-0004WU-1u for linux-arm-kernel@lists.infradead.org; Tue, 07 Apr 2020 16:37:26 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 621BB30E; Tue, 7 Apr 2020 09:37:22 -0700 (PDT) Received: from [192.168.0.14] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 50B663F52E; Tue, 7 Apr 2020 09:37:21 -0700 (PDT) Subject: Re: Question about SEA handling process happened in user space To: Xiaofei Tan References: <5E81EFCD.6020605@huawei.com> <2b0e5507-ad75-9af1-6afe-aa87d8cf597f@arm.com> <5E83104A.7020803@huawei.com> <5E840F3B.6040803@huawei.com> From: James Morse Message-ID: <7d6668d6-ec4a-e362-94a3-c31950651c02@arm.com> Date: Tue, 7 Apr 2020 17:37:14 +0100 User-Agent: Mozilla/5.0 (X11; Linux aarch64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0 MIME-Version: 1.0 In-Reply-To: <5E840F3B.6040803@huawei.com> Content-Language: en-GB X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20200407_093725_184398_65072060 X-CRM114-Status: GOOD ( 22.69 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Catalin Marinas , Linuxarm , Will Deacon , Dave Martin , linux-arm-kernel@lists.infradead.org, Shiju Jose Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+infradead-linux-arm-kernel=archiver.kernel.org@lists.infradead.org Hi Xiaofei, On 01/04/2020 04:49, Xiaofei Tan wrote: > On 2020/4/1 1:00, James Morse wrote: >> On 3/31/20 10:41 AM, Xiaofei Tan wrote: >>> 1.memory_failure() is only called for "memory error section" record. Then >>> should we use this memory record for ghes sea report? Our platform is >>> using "ARM processor error section". >> For what classes of error? > Both processor cache ecc error and memory error (marked by poison) can lead to SEA. These are the errors that can cause the hardware to notify software via external abort. For which classes of error does your firmware then use a 'processor error'? It sounds like you assume everything reported in the CPU records must be a processor error, and everything reported by the memory error must be a memory error. (digression: this isn't really true! The CPU could report that it read poison from memory. Is this a memory error, or a CPU error? Equally the memory controller could report that a PCIe device wrote to a not-present DIMM. Is this a memory error?) >> If memory has become corrupted, you should tell the OS about the memory error. >> >> >From (my) memory: linux will just print out 'processor errors', and panic() if >> they are marked as fatal. I don't think you can use these to convey a memory >> error... >> > > OK. Then firmware should detect error source. If it is processor cache error, > we use "ARM processor error section", else if it is memory error, we use "memory error section". > Normally, we report memory error only from RAS node of DDRC or HHA module. For SEA, Do you have patches to get linux to do something useful with the processor error nodes? We'd need it to handle uncorrected cache errors with a physical address, as if they were memory errors... A virtual address is no-use as the memory may have been re-mapped in the meantime. > It is a little strange to report as memory error when collect errors from processor > RAS node. Its pragmatic: today linux ignores the processor errors. If you suffer a cache error, the memory that backed that cached location is now also corrupt, as you've lost the writes that made the cache-line dirty. If you can describe this memory corruption, without treating it as 'the error' then an OS that doesn't know about the process error sections will still do the right thing. (i.e. leave out the device/row/rank stuff to avoid it being attributed to a DIMM) The downside is you have fake memory errors when nothing bad happened to the DIMM. These should be uniform, and smaller than the errors actually occurring at the DIMM. I've no idea if patches adding support for the processor error nodes would be considered for stable. >>> 2.Should we define an error source structure for each cpu core in HEST table? >>> If not, there may be conflict if more than one cpu core fall into SEA. >> >> This is a question for the people who wrote your firmware. >> For firmware first, you must have set SCR_EL3.EA. What does your firmware do if >> two CPUs take an external abort at the same time? > > Will block the second one until first SEA finished and error source of HEST table free. Okay, so one 'SEA' entry in the HEST describes the single region that CPER will be written to. >> Each CPU having its own area to read/write CPER would mean you need one >> NOTIFY_SEA entry in the HEST for each area ... but how does the OS know which >> CPU is which? > > Yes, OS don't know this. > So, it is ok to share the only one area for all CPUs. Yes, as there is no way to pair the memory with the CPUs. If there is more than one region, then each CPU taking an external abort will walk the list, checking each one. Its up to firmware to ensure this is serialised. Sounds like you've got this sorted. Thanks, James _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel