Re: [PATCH] xen: arm: Don't use stop_cpu() in halt_this_cpu()

From: Stefano Stabellini <sstabellini@kernel.org>
To: Bertrand Marquis <Bertrand.Marquis@arm.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>,
	Julien Grall <julien@xen.org>,
	 "dmitry.semenets@gmail.com" <dmitry.semenets@gmail.com>,
	 "xen-devel@lists.xenproject.org"
	<xen-devel@lists.xenproject.org>,
	 Dmytro Semenets <dmytro_semenets@epam.com>,
	 Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
Subject: Re: [PATCH] xen: arm: Don't use stop_cpu() in halt_this_cpu()
Date: Thu, 30 Jun 2022 14:14:14 -0700 (PDT)	[thread overview]
Message-ID: <alpine.DEB.2.22.394.2206301404410.4389@ubuntu-linux-20-04-desktop> (raw)
In-Reply-To: <14736B47-2F17-4684-9162-17C3E55F8D15@arm.com>

On Thu, 30 Jun 2022, Bertrand Marquis wrote:
> > On 29 Jun 2022, at 18:19, Stefano Stabellini <sstabellini@kernel.org> wrote:
> > On Wed, 29 Jun 2022, Julien Grall wrote:
> >> On 28/06/2022 23:56, Stefano Stabellini wrote:
> >>>> The advantage of the panic() is it will remind us that some needs to be
> >>>> fixed.
> >>>> With a warning (or WARN()) people will tend to ignore it.
> >>> 
> >>> I know that this specific code path (cpu off) is probably not super
> >>> relevant for what I am about to say, but as we move closer to safety
> >>> certifiability we need to get away from using "panic" and BUG_ON as a
> >>> reminder that more work is needed to have a fully correct implementation
> >>> of something.
> >> 
> >> I don't think we have many places at runtime using BUG_ON()/panic(). They are
> >> often used because we think Xen would not be able to recover if the condition
> >> is hit.
> >> 
> >> I am happy to remove them, but this should not be at the expense to introduce
> >> other potential weird bugs.
> >> 
> >>> 
> >>> I also see your point and agree that ASSERT is not acceptable for
> >>> external input but from my point of view panic is the same (slightly
> >>> worse because it doesn't go away in production builds).
> >> 
> >> I think it depends on your target. Would you be happy if Xen continue to run
> >> with potentially a fatal flaw?
> > 
> > Actually, this is an excellent question. I don't know what is the
> > expected behavior from a safety perspective in case of serious errors.
> > How the error should be reported and whether continuing or not is
> > recommended. I'll try to find out more information.
> 
> I think there are 2 answers to this:
> - as much as possible: those case must be avoided and it must be demonstrated that they are impossible and hence removed or turn the system in a failsafe mode so that actions can be handle (usually reboot after saving some data)
> - in some cases this can be robustness code (more for security)
> 
> I think in our case that if we know that we are ending in a case where the system is unstable we should:
> - stop the guest responsible for this (if a guest is the origin) or return an error to the guest and cancel the operation if suitable
> - panic if this is internal or dom0
> 
> A warning informing that something not supported was done and ending in an unexpected behaviour is for sure not acceptable.

Let's say that we demonstrate that a problematic case is impossible, can
we still have a panic in the code? For instance:

ret = firmware_call();
if (ret)
    panic();

We know ret is always zero unless firmware is buggy or not
spec-compliant. Can the panic() still be present?

And/or do we need to replace all instances of "panic" with going into
"failsafe mode", which saves state and reboots so it is not so
dissimilar from panic actually?

In case of guest-initiated unexpected errors we already try to crash the
guest responsible and not crash the entire system because it is also a
matter of security (possible DOS). That is clear.

So it is other kind of unexpected errors, mostly due to hardware or
firmware unexpected behavior or Xen finding itself in state of a state
machine that should be impossible. Those are the ones we don't have a
clear way to proceed.