From mboxrd@z Thu Jan 1 00:00:00 1970 From: Xiongfeng Wang Subject: Re: [PATCH v3 8/8] arm64: exception: check shared writable page in SEI handler Date: Wed, 12 Apr 2017 16:35:07 +0800 Message-ID: <5b914702-7262-54d5-2f64-0521a04add03@huawei.com> References: <1490869877-118713-1-git-send-email-xiexiuqi@huawei.com> <1490869877-118713-9-git-send-email-xiexiuqi@huawei.com> <58E7B6BD.3000401@arm.com> Mime-Version: 1.0 Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <58E7B6BD.3000401@arm.com> Sender: linux-kernel-owner@vger.kernel.org To: James Morse , Xie XiuQi Cc: christoffer.dall@linaro.org, marc.zyngier@arm.com, catalin.marinas@arm.com, will.deacon@arm.com, fu.wei@linaro.org, rostedt@goodmis.org, hanjun.guo@linaro.org, shiju.jose@huawei.com, linux-arm-kernel@lists.infradead.org, kvmarm@lists.cs.columbia.edu, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, gengdongjiu@huawei.com, zhengqiang10@huawei.com, wuquanming@huawei.com, Wang Xiongfeng List-Id: linux-acpi@vger.kernel.org Hi James, On 2017/4/7 23:56, James Morse wrote: > Hi Xie XiuQi, > > On 30/03/17 11:31, Xie XiuQi wrote: >> From: Wang Xiongfeng >> >> Since SEI is asynchronous, the error data has been consumed. So we must >> suppose that all the memory data current process can write are >> contaminated. If the process doesn't have shared writable pages, the >> process will be killed, and the system will continue running normally. >> Otherwise, the system must be terminated, because the error has been >> propagated to other processes running on other cores, and recursively >> the error may be propagated to several another processes. > > This is pretty complicated. We can't guarantee that another CPU hasn't modified > the page tables while we do this, (so its racy). We can't guarantee that the > corrupt data hasn't been sent over the network or written to disk in the mean > time (so its not enough). > > The scenario you have is a write of corrupt data to memory where another CPU > reading it doesn't know the value is corrupt. > > The hardware gives us quite a lot of help containing errors. The RAS > specification (DDI 0587A) describes your scenario as error propagation in '2.1.2 > Architectural error propagation', and then classifies it in '2.1.3 > Architecturally infected, containable and uncontainable' as uncontained because > the value is no longer in the general-purpose registers. For uncontained errors > we should panic(). > > We shouldn't need to try to track errors after we get a notification as the > hardware has done this for us. > Thanks for your comments. I think what you said is reasonable. We will remove this patch and use AET fields of ESR_ELx to determine whether we should kill current process or just panic. > > Firmware-first does complicate this if events like this are not delivered using > a synchronous external abort, as Linux may have PSTATE.A masked preventing > SError Interrupts from being taken. It looks like PSTATE.A is masked much more > often than is necessary. I will look into cleaning this up. > > > Thanks, > > James > > . > Thanks, Wang Xiongfeng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753488AbdDLIiy (ORCPT ); Wed, 12 Apr 2017 04:38:54 -0400 Received: from szxga01-in.huawei.com ([45.249.212.187]:5323 "EHLO dggrg01-dlp.huawei.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751992AbdDLIiv (ORCPT ); Wed, 12 Apr 2017 04:38:51 -0400 Subject: Re: [PATCH v3 8/8] arm64: exception: check shared writable page in SEI handler To: James Morse , Xie XiuQi References: <1490869877-118713-1-git-send-email-xiexiuqi@huawei.com> <1490869877-118713-9-git-send-email-xiexiuqi@huawei.com> <58E7B6BD.3000401@arm.com> CC: , , , , , , , , , , , , , , , , Wang Xiongfeng From: Xiongfeng Wang Message-ID: <5b914702-7262-54d5-2f64-0521a04add03@huawei.com> Date: Wed, 12 Apr 2017 16:35:07 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 In-Reply-To: <58E7B6BD.3000401@arm.com> Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.177.32.209] X-CFilter-Loop: Reflected X-Mirapoint-Virus-RAPID-Raw: score=unknown(0), refid=str=0001.0A090205.58EDE6C8.01BB,ss=1,re=0.000,recu=0.000,reip=0.000,cl=1,cld=1,fgs=0, ip=0.0.0.0, so=2014-11-16 11:51:01, dmn=2013-03-21 17:37:32 X-Mirapoint-Loop-Id: 6096b8fbe95356265ccee73281091b6e Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi James, On 2017/4/7 23:56, James Morse wrote: > Hi Xie XiuQi, > > On 30/03/17 11:31, Xie XiuQi wrote: >> From: Wang Xiongfeng >> >> Since SEI is asynchronous, the error data has been consumed. So we must >> suppose that all the memory data current process can write are >> contaminated. If the process doesn't have shared writable pages, the >> process will be killed, and the system will continue running normally. >> Otherwise, the system must be terminated, because the error has been >> propagated to other processes running on other cores, and recursively >> the error may be propagated to several another processes. > > This is pretty complicated. We can't guarantee that another CPU hasn't modified > the page tables while we do this, (so its racy). We can't guarantee that the > corrupt data hasn't been sent over the network or written to disk in the mean > time (so its not enough). > > The scenario you have is a write of corrupt data to memory where another CPU > reading it doesn't know the value is corrupt. > > The hardware gives us quite a lot of help containing errors. The RAS > specification (DDI 0587A) describes your scenario as error propagation in '2.1.2 > Architectural error propagation', and then classifies it in '2.1.3 > Architecturally infected, containable and uncontainable' as uncontained because > the value is no longer in the general-purpose registers. For uncontained errors > we should panic(). > > We shouldn't need to try to track errors after we get a notification as the > hardware has done this for us. > Thanks for your comments. I think what you said is reasonable. We will remove this patch and use AET fields of ESR_ELx to determine whether we should kill current process or just panic. > > Firmware-first does complicate this if events like this are not delivered using > a synchronous external abort, as Linux may have PSTATE.A masked preventing > SError Interrupts from being taken. It looks like PSTATE.A is masked much more > often than is necessary. I will look into cleaning this up. > > > Thanks, > > James > > . > Thanks, Wang Xiongfeng From mboxrd@z Thu Jan 1 00:00:00 1970 From: wangxiongfeng2@huawei.com (Xiongfeng Wang) Date: Wed, 12 Apr 2017 16:35:07 +0800 Subject: [PATCH v3 8/8] arm64: exception: check shared writable page in SEI handler In-Reply-To: <58E7B6BD.3000401@arm.com> References: <1490869877-118713-1-git-send-email-xiexiuqi@huawei.com> <1490869877-118713-9-git-send-email-xiexiuqi@huawei.com> <58E7B6BD.3000401@arm.com> Message-ID: <5b914702-7262-54d5-2f64-0521a04add03@huawei.com> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org Hi James, On 2017/4/7 23:56, James Morse wrote: > Hi Xie XiuQi, > > On 30/03/17 11:31, Xie XiuQi wrote: >> From: Wang Xiongfeng >> >> Since SEI is asynchronous, the error data has been consumed. So we must >> suppose that all the memory data current process can write are >> contaminated. If the process doesn't have shared writable pages, the >> process will be killed, and the system will continue running normally. >> Otherwise, the system must be terminated, because the error has been >> propagated to other processes running on other cores, and recursively >> the error may be propagated to several another processes. > > This is pretty complicated. We can't guarantee that another CPU hasn't modified > the page tables while we do this, (so its racy). We can't guarantee that the > corrupt data hasn't been sent over the network or written to disk in the mean > time (so its not enough). > > The scenario you have is a write of corrupt data to memory where another CPU > reading it doesn't know the value is corrupt. > > The hardware gives us quite a lot of help containing errors. The RAS > specification (DDI 0587A) describes your scenario as error propagation in '2.1.2 > Architectural error propagation', and then classifies it in '2.1.3 > Architecturally infected, containable and uncontainable' as uncontained because > the value is no longer in the general-purpose registers. For uncontained errors > we should panic(). > > We shouldn't need to try to track errors after we get a notification as the > hardware has done this for us. > Thanks for your comments. I think what you said is reasonable. We will remove this patch and use AET fields of ESR_ELx to determine whether we should kill current process or just panic. > > Firmware-first does complicate this if events like this are not delivered using > a synchronous external abort, as Linux may have PSTATE.A masked preventing > SError Interrupts from being taken. It looks like PSTATE.A is masked much more > often than is necessary. I will look into cleaning this up. > > > Thanks, > > James > > . > Thanks, Wang Xiongfeng