linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: ARC compact700 NPS platform - EZ_MachineCheck exception handler
@ 2018-05-21 14:14 Ofer Levi(SW)
  2018-05-21 16:59 ` Vineet Gupta
  0 siblings, 1 reply; 3+ messages in thread
From: Ofer Levi(SW) @ 2018-05-21 14:14 UTC (permalink / raw)
  To: vgupta; +Cc: linux-kernel, Meir Lichtinger

Resending, due to typo in LKML mail  address.


 Hi Vineet,
 
 The EV_MachineCheck exception handler is halting the core for exceptions
 which are not tlb_overlap_fault.
 Since for the NPS platform each core is running a single thread in ZOL (Zero
 Overhead Linux) isolation mode, we feel that most of the time it is safe to
 resume execution instead of halting the core.
 I would appreciate it if you could review the change  below and let me know
 what you think, if this change is valid or if we missed or overlooked
 something.
 We are not looking to push this change upstream, but will be used on some
 systems.
 
 Please see below our implementation after label 1.
 
 Thanks
 Ofer
 
 ENTRY(EV_MachineCheck)
 
 	EXCEPTION_PROLOGUE
 
 #ifdef CONFIG_CONTEXT_TRACKING
 	bl context_tracking_user_exit
 #endif
 
 	lr  r2, [ecr]
 	lr  r0, [efa]
 	mov r1, sp
 
 	; hardware auto-disables MMU, re-enable it to allow kernel vaddr
 	; access for say stack unwinding of modules for crash dumps
 	lr r3, [ARC_REG_PID]
 	or r3, r3, MMU_ENABLE
 	sr r3, [ARC_REG_PID]
 
 	lsr  	r3, r2, 8
 	bmsk 	r3, r3, 7
 	brne    r3, ECR_C_MCHK_DUP_TLB, 1f
 
 	bl      do_tlb_overlap_fault
 	b       ret_from_exception
 
 1:
 	FAKE_RET_FROM_EXCPN
 	bl		do_machine_check  ; using DO_ERROR_INFO macro
 	b       ret_from_exception
 
 END(EV_MachineCheck)

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: ARC compact700 NPS platform - EZ_MachineCheck exception handler
  2018-05-21 14:14 ARC compact700 NPS platform - EZ_MachineCheck exception handler Ofer Levi(SW)
@ 2018-05-21 16:59 ` Vineet Gupta
  2018-05-22 14:03   ` Ofer Levi(SW)
  0 siblings, 1 reply; 3+ messages in thread
From: Vineet Gupta @ 2018-05-21 16:59 UTC (permalink / raw)
  To: Ofer Levi(SW); +Cc: linux-kernel, Meir Lichtinger, arcml

On 05/21/2018 07:14 AM, Ofer Levi(SW) wrote:
> Resending, due to typo in LKML mail  address.

Also please CC linux-snps-arc@lists.infradead.org for any ARC Linux related posts.

>   
>   The EV_MachineCheck exception handler is halting the core for exceptions
>   which are not tlb_overlap_fault.
>   Since for the NPS platform each core is running a single thread in ZOL (Zero
>   Overhead Linux) isolation mode, we feel that most of the time it is safe to
>   resume execution instead of halting the core.

Most of the time is not good enough when dealing with OS code :-(
A Machine check excepting implies something went terribly wrong. Some of those 
cases can be handled gracefully (such as duplicate TLB entry), but others can't so 
continuing despite it is recipe for disaster. Perhaps your chip has some spurious 
Machine check exceptions ?

>   I would appreciate it if you could review the change  below

Next time please send a real patch so I know right away what was changed.

> and let me know
>   what you think, if this change is valid or if we missed or overlooked
>   something.
>   We are not looking to push this change upstream, but will be used on some
>   systems.

Hmm, but you have to explain why those machine checks are fine !

>   
>   Please see below our implementation after label 1.
>   
>   Thanks
>   Ofer
>   
>   ENTRY(EV_MachineCheck)
>   
>   	EXCEPTION_PROLOGUE
>   
> ...
>   	brne    r3, ECR_C_MCHK_DUP_TLB, 1f
>   
>   	bl      do_tlb_overlap_fault
>   	b       ret_from_exception
>   
>   1:
>   	FAKE_RET_FROM_EXCPN

You don't need this.

>   	bl		do_machine_check  ; using DO_ERROR_INFO macro

We don't have above function in code. There's do_machine_check_fault() which calls 
die() -> flag 1 - so it would halt the kernel and would never return here.
So your patch is broken in implementation as well.

>   	b       ret_from_exception
>   
>   END(EV_MachineCheck)
>
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: ARC compact700 NPS platform - EZ_MachineCheck exception handler
  2018-05-21 16:59 ` Vineet Gupta
@ 2018-05-22 14:03   ` Ofer Levi(SW)
  0 siblings, 0 replies; 3+ messages in thread
From: Ofer Levi(SW) @ 2018-05-22 14:03 UTC (permalink / raw)
  To: Vineet Gupta; +Cc: linux-kernel, Meir Lichtinger, arcml


There are two cases to consider for this exception:

> but others can't so continuing despite it is recipe for disaster. Perhaps your chip
> has some spurious Machine check exceptions ?

1. Except for core 0, which is running the linux os, all other cores are running 
packet processing code in ZOL isolation mode. If any of these cores hit the compact 700
0x20 exception it is logical to assume all other cores will hit it too.
It seems that eventually in any case, will have to reset HW and reboot the system.
It might be beneficial for user to try collect more info for debugging the issue even if it’s 
a disaster for the system.

> Hmm, but you have to explain why those machine checks are fine !

2. The ARC compact700 instruction set was extended to support fast DMA 
operations to various added HW accelerators and new asm ops to support network 
packet Processing.
In case of an error,  some of these instructions are wired to the 0x20 exception.
There is an HW mechanism to partition the DDR between linux os and the various accelerators
This mechanism unaware of the mmu or virtual memory handling. 
In a cases where an accelerator access out of its memory bounds this exception is hit 
but there is no risk to system stability. User signal handler can catch it allowing easier 
debugging.
This is one example.



> >   1:
> >   	FAKE_RET_FROM_EXCPN
> 
> You don't need this.

When removing FAKE_RET_FROM_EXCPN, first EV_MachineCheck exception 
Is causing the core running that thread to stall.
If not removed multiple exceptions are handled and system seems healthy.

Please note that exception is generated by accessing one of the NPS accelerators
address which is out its memory space, so no harm is expected to system 


> Next time please send a real patch so I know right away what was changed.
My apologies, here is the patch based on linux-4.16.10

diff -uprN linux-4.16.10/arch/arc/kernel/entry.S linux/arch/arc/kernel/entry.S
--- linux-4.16.10/arch/arc/kernel/entry.S	2018-05-19 11:19:37.000000000 +0300
+++ linux/arch/arc/kernel/entry.S	2018-05-22 14:12:18.065103918 +0300
@@ -106,13 +106,9 @@ ENTRY(EV_MachineCheck)
 	b       ret_from_exception
 
 1:
-	; DEAD END: can't do much, display Regs and HALT
-	SAVE_CALLEE_SAVED_USER
-
-	GET_CURR_TASK_FIELD_PTR   TASK_THREAD, r10
-	st  sp, [r10, THREAD_CALLEE_REG]
-
-	j  do_machine_check_fault
+	FAKE_RET_FROM_EXCPN
+	bl		do_machine_check
+	b       ret_from_exception
 
 END(EV_MachineCheck)
 
diff -uprN linux-4.16.10/arch/arc/kernel/traps.c linux/arch/arc/kernel/traps.c
--- linux-4.16.10/arch/arc/kernel/traps.c	2018-05-19 11:19:37.000000000 +0300
+++ linux/arch/arc/kernel/traps.c	2018-05-22 14:13:25.162748373 +0300
@@ -86,6 +86,7 @@ DO_ERROR_INFO(SIGBUS, "Invalid Mem Acces
 DO_ERROR_INFO(SIGTRAP, "Breakpoint Set", trap_is_brkpt, TRAP_BRKPT)
 DO_ERROR_INFO(SIGBUS, "Misaligned Access", do_misaligned_error, BUS_ADRALN)
 DO_ERROR_INFO(SIGSEGV, "gcc generated __builtin_trap", do_trap5_error, 0)
+DO_ERROR_INFO(SIGBUS, "Machine Check", do_machine_check, BUS_MCEERR_AR )
 
 /*
  * Entry Point for Misaligned Data access Exception, for emulating in software






> -----Original Message-----
> From: Vineet Gupta [mailto:Vineet.Gupta1@synopsys.com]
> Sent: Monday, May 21, 2018 19:59
> To: Ofer Levi(SW) <oferle@mellanox.com>
> Cc: linux-kernel@vger.kernel.org; Meir Lichtinger <meirl@mellanox.com>;
> arcml <linux-snps-arc@lists.infradead.org>
> Subject: Re: ARC compact700 NPS platform - EZ_MachineCheck exception
> handler
> 
> On 05/21/2018 07:14 AM, Ofer Levi(SW) wrote:
> > Resending, due to typo in LKML mail  address.
> 
> Also please CC linux-snps-arc@lists.infradead.org for any ARC Linux related
> posts.
> 
> >
> >   The EV_MachineCheck exception handler is halting the core for
> exceptions
> >   which are not tlb_overlap_fault.
> >   Since for the NPS platform each core is running a single thread in ZOL (Zero
> >   Overhead Linux) isolation mode, we feel that most of the time it is safe to
> >   resume execution instead of halting the core.
> 
> Most of the time is not good enough when dealing with OS code :-( A
> Machine check excepting implies something went terribly wrong. Some of
> those cases can be handled gracefully (such as duplicate TLB entry), but
> others can't so continuing despite it is recipe for disaster. Perhaps your chip
> has some spurious Machine check exceptions ?
> 
> >   I would appreciate it if you could review the change  below
> 
> Next time please send a real patch so I know right away what was changed.
> 
> > and let me know
> >   what you think, if this change is valid or if we missed or overlooked
> >   something.
> >   We are not looking to push this change upstream, but will be used on
> some
> >   systems.
> 
> Hmm, but you have to explain why those machine checks are fine !
> 
> >
> >   Please see below our implementation after label 1.
> >
> >   Thanks
> >   Ofer
> >
> >   ENTRY(EV_MachineCheck)
> >
> >   	EXCEPTION_PROLOGUE
> >
> > ...
> >   	brne    r3, ECR_C_MCHK_DUP_TLB, 1f
> >
> >   	bl      do_tlb_overlap_fault
> >   	b       ret_from_exception
> >
> >   1:
> >   	FAKE_RET_FROM_EXCPN
> 
> You don't need this.
> 
> >   	bl		do_machine_check  ; using DO_ERROR_INFO macro
> 
> We don't have above function in code. There's do_machine_check_fault()
> which calls
> die() -> flag 1 - so it would halt the kernel and would never return here.
> So your patch is broken in implementation as well.
> 
> >   	b       ret_from_exception
> >
> >   END(EV_MachineCheck)
> >
> >

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-05-22 14:03 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-21 14:14 ARC compact700 NPS platform - EZ_MachineCheck exception handler Ofer Levi(SW)
2018-05-21 16:59 ` Vineet Gupta
2018-05-22 14:03   ` Ofer Levi(SW)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).