All of lore.kernel.org
 help / color / mirror / Atom feed
From: Xie XiuQi <xiexiuqi@huawei.com>
To: Mark Rutland <mark.rutland@arm.com>
Cc: <catalin.marinas@arm.com>, <will@kernel.org>,
	<tglx@linutronix.de>, <james.morse@arm.com>,
	<tanxiaofei@huawei.com>, <wangxiongfeng2@huawei.com>,
	<linux-arm-kernel@lists.infradead.org>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] arm64: panic on synchronous external abort in kernel context
Date: Tue, 14 Apr 2020 20:19:41 +0800	[thread overview]
Message-ID: <9982d344-328e-320d-020f-218ab74ae2b1@huawei.com> (raw)
In-Reply-To: <20200414105923.GA2486@C02TD0UTHF1T.local>

Hi Mark,

On 2020/4/14 18:59, Mark Rutland wrote:
> On Fri, Apr 10, 2020 at 09:52:45AM +0800, Xie XiuQi wrote:
>> We should panic even panic_on_oops is not set, when we can't recover
>> from synchronous external abort in kernel context.
>>
>> Othervise, there are two issues:
>> 1) fallback to do_exit() in exception context, cause this core hung up.
>>    do_sea()
>>    -> arm64_notify_die
>>       -> die
>>          -> do_exit
>> 2) errors may propagated.
>>
>> Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
>> Cc: Xiaofei Tan <tanxiaofei@huawei.com>
>> ---
>>  arch/arm64/include/asm/esr.h | 12 ++++++++++++
>>  arch/arm64/kernel/traps.c    |  2 ++
>>  2 files changed, 14 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
>> index cb29253ae86b..acfc71c6d148 100644
>> --- a/arch/arm64/include/asm/esr.h
>> +++ b/arch/arm64/include/asm/esr.h
>> @@ -326,6 +326,18 @@ static inline bool esr_is_data_abort(u32 esr)
>>  	return ec == ESR_ELx_EC_DABT_LOW || ec == ESR_ELx_EC_DABT_CUR;
>>  }
>>  
>> +static inline bool esr_is_inst_abort(u32 esr)
>> +{
>> +	const u32 ec = ESR_ELx_EC(esr);
>> +
>> +	return ec == ESR_ELx_EC_IABT_LOW || ec == ESR_ELx_EC_IABT_CUR;
>> +}
>> +
>> +static inline bool esr_is_ext_abort(u32 esr)
>> +{
>> +	return esr_is_data_abort(esr) || esr_is_inst_abort(esr);
>> +}
> 
> A data abort or an intstruction abort are not necessarily synchronus
> external aborts, so this isn't right.
> 
> What exactly are you trying to catch here? If you are seeing a problem
> in practice, can you please share your log from a crash?

Yes, I meet a problem in practice.

Tan Xiaofei report this issue when doing ras error inject testing:
1) panic_on_oops is not set;
2) trigger a sea by inject a ECC error;
3) a cpu core receive a sea in kernel context:
   do_mem_abort
     -> arm64_notify_die
        -> die                    # kernel context, call die() directly;
           -> do_exit             # kernel process context, call do_exit(SIGSEGV);
              -> do_task_dead()   # call do_task_dead(), and hung up this core;

Actually, we should not call do_exit() for kernel task, or before return from
external abort.

So, I send this patch intend to panic directly when receive a sea in kernel context.

crash log:

NOTICE:  [TotemRasIntCpuNodeEri]:[1879L]
NOTICE:  [RasEriInterrupt]:[173L]NodeTYP2Status = 0x0
[  387.740609] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
[  387.748837] {1}[Hardware Error]: event severity: recoverable
[  387.754470] {1}[Hardware Error]:  Error 0, type: recoverable
[  387.760103] {1}[Hardware Error]:   section_type: ARM processor error
[  387.766425] {1}[Hardware Error]:   MIDR: 0x00000000481fd010
[  387.771972] {1}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000081080000
[  387.780628] {1}[Hardware Error]:   error affinity level: 0
[  387.786088] {1}[Hardware Error]:   running state: 0x1
[  387.791115] {1}[Hardware Error]:   Power State Coordination Interface state: 0
[  387.798301] {1}[Hardware Error]:   Error info structure 0:
[  387.803761] {1}[Hardware Error]:   num errors: 1
[  387.808356] {1}[Hardware Error]:    error_type: 0, cache error
[  387.814160] {1}[Hardware Error]:    error_info: 0x0000000024400017
[  387.820311] {1}[Hardware Error]:     transaction type: Instruction
[  387.826461] {1}[Hardware Error]:     operation type: Generic error (type cannot be determined)
[  387.835031] {1}[Hardware Error]:     cache level: 1
[  387.839878] {1}[Hardware Error]:     the error has been corrected
[  387.845942] {1}[Hardware Error]:    physical fault address: 0x00000027caf50770
[  387.853162] Internal error: synchronous external abort: 96000610 [#1] PREEMPT SMP
[  387.860611] Modules linked in: l1l2_inject(O) vfio_iommu_type1 vfio_pci vfio_virqfd vfio ib_ipoib ib_umad rpcrdma ib_iser libiscsi scsi_transport_iscsi hns_roce_hw_v2 crct10dif_ce ses hns3 hclge hnae3 hisi_trng_v2 rng_core hisi_zip hisi_sec2 hisi_hpre hisi_qm uacce hisi_sas_v3_hw hisi_sas_main libsas scsi_transport_sas
[  387.888725] CPU: 0 PID: 940 Comm: kworker/0:3 Kdump: loaded Tainted: G           O      5.5.0-rc4-g5993cbe #1
[  387.898592] Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V3.B210.01 03/12/2020
[
 Message from  387.907429] Workqueue: events error_inject [l1l2_inject]
[  387.914098] pstate: 80c00009 (Nzcv daif +PAN +UAO)
 s[yslogd@localho  387.918867] pc : lsu_inj_ue+0x58/0x70 [l1l2_inject]
[  387.925103] lr : error_inject+0x64/0xb0 [l1l2_inject]
[  387.930132] sp : ffff800020af3d90
[  387.933435] x29: ffff800020af3d90 x28: 0000000000000000
st[ at Mar 30 11:  387.938723] x27: ffff0027dae6e838 x26: ffff80001178bcc8
[  387.945391] x25: 0000000000000000 x24: ffffa2c77162e090
[  387.950680] x23: 0000000000000000 x22: ffff2027d7c33d40
33[:55 ...
 ker  387.955968] x21: ffff2027d7c37a00 x20: ffff0027d679c000
[  387.962636] x19: ffffa2c77162e088 x18: 0000000020cbf59a
ne[l:Internal err  387.967924] x17: 000000000000000e x16: ffffa2c7b812bc98
[  387.974592] x15: 0000000000000001 x14: 0000000000000000
[  387.979880] x13: ffff2027cf75a780 x12: ffffa2c7b8299c18
or[: synchronous   387.985168] x11: 0000000000000000 x10: 00000000000009f0
[  387.991836] x9 : ffff800020af3d50 x8 : fefefefefefefeff
ex[ternal abort:   387.997124] x7 : 0000000000000000 x6 : 001d4ed88e000000
[  388.003792] x5 : 000073746e657665 x4 : 0000080110f81381
[  388.009080] x3 : 000000000000002f x2 : 807fffffffffffff
96[000610 [#1] PR  388.014369] x1 : ffffa2c77162c518 x0 : 0000000000000081
[  388.021037] Call trace:
EEMPT SMP
[  388.023475]  lsu_inj_ue+0x58/0x70 [l1l2_inject]
[  388.029019]  error_inject+0x64/0xb0 [l1l2_inject]
[  388.033707]  process_one_work+0x158/0x4b8
[  388.037699]  worker_thread+0x50/0x498
[  388.041348]  kthread+0xfc/0x128
[  388.044480]  ret_from_fork+0x10/0x1c
[  388.048042] Code: b2790000 d519f780 f9800020 d5033f9f (58001001)
[  388.054109] ---[ end trace 39d51c21b0e42ba6 ]---

core 0 hung up at here.


> 
> Thanks,
> Mark.
> 
>> +
>>  const char *esr_get_class_string(u32 esr);
>>  #endif /* __ASSEMBLY */
>>  
>> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
>> index cf402be5c573..08f7f7688d5b 100644
>> --- a/arch/arm64/kernel/traps.c
>> +++ b/arch/arm64/kernel/traps.c
>> @@ -202,6 +202,8 @@ void die(const char *str, struct pt_regs *regs, int err)
>>  		panic("Fatal exception in interrupt");
>>  	if (panic_on_oops)
>>  		panic("Fatal exception");
>> +	if (esr_is_ext_abort(err))
>> +		panic("Synchronous external abort in kernel context");
>>  
>>  	raw_spin_unlock_irqrestore(&die_lock, flags);
>>  
>> -- 
>> 2.20.1
>>
> .
> 


WARNING: multiple messages have this Message-ID (diff)
From: Xie XiuQi <xiexiuqi@huawei.com>
To: Mark Rutland <mark.rutland@arm.com>
Cc: catalin.marinas@arm.com, linux-kernel@vger.kernel.org,
	tanxiaofei@huawei.com, james.morse@arm.com, tglx@linutronix.de,
	will@kernel.org, wangxiongfeng2@huawei.com,
	linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH] arm64: panic on synchronous external abort in kernel context
Date: Tue, 14 Apr 2020 20:19:41 +0800	[thread overview]
Message-ID: <9982d344-328e-320d-020f-218ab74ae2b1@huawei.com> (raw)
In-Reply-To: <20200414105923.GA2486@C02TD0UTHF1T.local>

Hi Mark,

On 2020/4/14 18:59, Mark Rutland wrote:
> On Fri, Apr 10, 2020 at 09:52:45AM +0800, Xie XiuQi wrote:
>> We should panic even panic_on_oops is not set, when we can't recover
>> from synchronous external abort in kernel context.
>>
>> Othervise, there are two issues:
>> 1) fallback to do_exit() in exception context, cause this core hung up.
>>    do_sea()
>>    -> arm64_notify_die
>>       -> die
>>          -> do_exit
>> 2) errors may propagated.
>>
>> Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
>> Cc: Xiaofei Tan <tanxiaofei@huawei.com>
>> ---
>>  arch/arm64/include/asm/esr.h | 12 ++++++++++++
>>  arch/arm64/kernel/traps.c    |  2 ++
>>  2 files changed, 14 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
>> index cb29253ae86b..acfc71c6d148 100644
>> --- a/arch/arm64/include/asm/esr.h
>> +++ b/arch/arm64/include/asm/esr.h
>> @@ -326,6 +326,18 @@ static inline bool esr_is_data_abort(u32 esr)
>>  	return ec == ESR_ELx_EC_DABT_LOW || ec == ESR_ELx_EC_DABT_CUR;
>>  }
>>  
>> +static inline bool esr_is_inst_abort(u32 esr)
>> +{
>> +	const u32 ec = ESR_ELx_EC(esr);
>> +
>> +	return ec == ESR_ELx_EC_IABT_LOW || ec == ESR_ELx_EC_IABT_CUR;
>> +}
>> +
>> +static inline bool esr_is_ext_abort(u32 esr)
>> +{
>> +	return esr_is_data_abort(esr) || esr_is_inst_abort(esr);
>> +}
> 
> A data abort or an intstruction abort are not necessarily synchronus
> external aborts, so this isn't right.
> 
> What exactly are you trying to catch here? If you are seeing a problem
> in practice, can you please share your log from a crash?

Yes, I meet a problem in practice.

Tan Xiaofei report this issue when doing ras error inject testing:
1) panic_on_oops is not set;
2) trigger a sea by inject a ECC error;
3) a cpu core receive a sea in kernel context:
   do_mem_abort
     -> arm64_notify_die
        -> die                    # kernel context, call die() directly;
           -> do_exit             # kernel process context, call do_exit(SIGSEGV);
              -> do_task_dead()   # call do_task_dead(), and hung up this core;

Actually, we should not call do_exit() for kernel task, or before return from
external abort.

So, I send this patch intend to panic directly when receive a sea in kernel context.

crash log:

NOTICE:  [TotemRasIntCpuNodeEri]:[1879L]
NOTICE:  [RasEriInterrupt]:[173L]NodeTYP2Status = 0x0
[  387.740609] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
[  387.748837] {1}[Hardware Error]: event severity: recoverable
[  387.754470] {1}[Hardware Error]:  Error 0, type: recoverable
[  387.760103] {1}[Hardware Error]:   section_type: ARM processor error
[  387.766425] {1}[Hardware Error]:   MIDR: 0x00000000481fd010
[  387.771972] {1}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000081080000
[  387.780628] {1}[Hardware Error]:   error affinity level: 0
[  387.786088] {1}[Hardware Error]:   running state: 0x1
[  387.791115] {1}[Hardware Error]:   Power State Coordination Interface state: 0
[  387.798301] {1}[Hardware Error]:   Error info structure 0:
[  387.803761] {1}[Hardware Error]:   num errors: 1
[  387.808356] {1}[Hardware Error]:    error_type: 0, cache error
[  387.814160] {1}[Hardware Error]:    error_info: 0x0000000024400017
[  387.820311] {1}[Hardware Error]:     transaction type: Instruction
[  387.826461] {1}[Hardware Error]:     operation type: Generic error (type cannot be determined)
[  387.835031] {1}[Hardware Error]:     cache level: 1
[  387.839878] {1}[Hardware Error]:     the error has been corrected
[  387.845942] {1}[Hardware Error]:    physical fault address: 0x00000027caf50770
[  387.853162] Internal error: synchronous external abort: 96000610 [#1] PREEMPT SMP
[  387.860611] Modules linked in: l1l2_inject(O) vfio_iommu_type1 vfio_pci vfio_virqfd vfio ib_ipoib ib_umad rpcrdma ib_iser libiscsi scsi_transport_iscsi hns_roce_hw_v2 crct10dif_ce ses hns3 hclge hnae3 hisi_trng_v2 rng_core hisi_zip hisi_sec2 hisi_hpre hisi_qm uacce hisi_sas_v3_hw hisi_sas_main libsas scsi_transport_sas
[  387.888725] CPU: 0 PID: 940 Comm: kworker/0:3 Kdump: loaded Tainted: G           O      5.5.0-rc4-g5993cbe #1
[  387.898592] Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V3.B210.01 03/12/2020
[
 Message from  387.907429] Workqueue: events error_inject [l1l2_inject]
[  387.914098] pstate: 80c00009 (Nzcv daif +PAN +UAO)
 s[yslogd@localho  387.918867] pc : lsu_inj_ue+0x58/0x70 [l1l2_inject]
[  387.925103] lr : error_inject+0x64/0xb0 [l1l2_inject]
[  387.930132] sp : ffff800020af3d90
[  387.933435] x29: ffff800020af3d90 x28: 0000000000000000
st[ at Mar 30 11:  387.938723] x27: ffff0027dae6e838 x26: ffff80001178bcc8
[  387.945391] x25: 0000000000000000 x24: ffffa2c77162e090
[  387.950680] x23: 0000000000000000 x22: ffff2027d7c33d40
33[:55 ...
 ker  387.955968] x21: ffff2027d7c37a00 x20: ffff0027d679c000
[  387.962636] x19: ffffa2c77162e088 x18: 0000000020cbf59a
ne[l:Internal err  387.967924] x17: 000000000000000e x16: ffffa2c7b812bc98
[  387.974592] x15: 0000000000000001 x14: 0000000000000000
[  387.979880] x13: ffff2027cf75a780 x12: ffffa2c7b8299c18
or[: synchronous   387.985168] x11: 0000000000000000 x10: 00000000000009f0
[  387.991836] x9 : ffff800020af3d50 x8 : fefefefefefefeff
ex[ternal abort:   387.997124] x7 : 0000000000000000 x6 : 001d4ed88e000000
[  388.003792] x5 : 000073746e657665 x4 : 0000080110f81381
[  388.009080] x3 : 000000000000002f x2 : 807fffffffffffff
96[000610 [#1] PR  388.014369] x1 : ffffa2c77162c518 x0 : 0000000000000081
[  388.021037] Call trace:
EEMPT SMP
[  388.023475]  lsu_inj_ue+0x58/0x70 [l1l2_inject]
[  388.029019]  error_inject+0x64/0xb0 [l1l2_inject]
[  388.033707]  process_one_work+0x158/0x4b8
[  388.037699]  worker_thread+0x50/0x498
[  388.041348]  kthread+0xfc/0x128
[  388.044480]  ret_from_fork+0x10/0x1c
[  388.048042] Code: b2790000 d519f780 f9800020 d5033f9f (58001001)
[  388.054109] ---[ end trace 39d51c21b0e42ba6 ]---

core 0 hung up at here.


> 
> Thanks,
> Mark.
> 
>> +
>>  const char *esr_get_class_string(u32 esr);
>>  #endif /* __ASSEMBLY */
>>  
>> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
>> index cf402be5c573..08f7f7688d5b 100644
>> --- a/arch/arm64/kernel/traps.c
>> +++ b/arch/arm64/kernel/traps.c
>> @@ -202,6 +202,8 @@ void die(const char *str, struct pt_regs *regs, int err)
>>  		panic("Fatal exception in interrupt");
>>  	if (panic_on_oops)
>>  		panic("Fatal exception");
>> +	if (esr_is_ext_abort(err))
>> +		panic("Synchronous external abort in kernel context");
>>  
>>  	raw_spin_unlock_irqrestore(&die_lock, flags);
>>  
>> -- 
>> 2.20.1
>>
> .
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  parent reply	other threads:[~2020-04-14 12:20 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-10  1:52 [PATCH] arm64: panic on synchronous external abort in kernel context Xie XiuQi
2020-04-10  1:52 ` Xie XiuQi
2020-04-14 10:59 ` Mark Rutland
2020-04-14 10:59   ` Mark Rutland
2020-04-14 12:16   ` James Morse
2020-04-14 12:16     ` James Morse
2020-04-14 12:39     ` Xie XiuQi
2020-04-14 12:39       ` Xie XiuQi
2020-04-14 14:53       ` James Morse
2020-04-14 14:53         ` James Morse
2020-04-30  9:44         ` Xie XiuQi
2020-04-30  9:44           ` Xie XiuQi
2020-04-14 12:19   ` Xie XiuQi [this message]
2020-04-14 12:19     ` Xie XiuQi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9982d344-328e-320d-020f-218ab74ae2b1@huawei.com \
    --to=xiexiuqi@huawei.com \
    --cc=catalin.marinas@arm.com \
    --cc=james.morse@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=tanxiaofei@huawei.com \
    --cc=tglx@linutronix.de \
    --cc=wangxiongfeng2@huawei.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.