Re: Hypercall fault injection (Was [PATCH 0/3] xen/domain: More structured teardown)

From: Jan Beulich <jbeulich@suse.com>
To: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>, "Wei Liu" <wl@xen.org>,
	"Stefano Stabellini" <sstabellini@kernel.org>,
	"Julien Grall" <julien@xen.org>,
	"Volodymyr Babchuk" <Volodymyr_Babchuk@epam.com>,
	"Juergen Gross" <jgross@suse.com>,
	Xen-devel <xen-devel@lists.xenproject.org>
Subject: Re: Hypercall fault injection (Was [PATCH 0/3] xen/domain: More structured teardown)
Date: Tue, 22 Dec 2020 11:00:44 +0100	[thread overview]
Message-ID: <983a3fef-c80f-ec2a-bf3c-5e054fc6a7a9@suse.com> (raw)
In-Reply-To: <ac552c84-144c-c213-7985-84d92cbb5601@citrix.com>

On 21.12.2020 20:36, Andrew Cooper wrote:
> Hello,
> 
> We have some very complicated hypercalls, createdomain, and max_vcpus a
> close second, with immense complexity, and very hard-to-test error handling.
> 
> It is no surprise that the error handling is riddled with bugs.
> 
> Random failures from core functions is one way, but I'm not sure that
> will be especially helpful.  In particular, we'd need a way to exclude
> "dom0 critical" operations so we've got a usable system to run testing on.
> 
> As an alternative, how about adding a fault_ttl field into the hypercall?
> 
> The exact paths taken in {domain,vcpu}_create() are sensitive to the
> hardware, Xen Kconfig, and other parameters passed into the
> hypercall(s).  The testing logic doesn't really want to care about what
> failed; simply that the error was handled correctly.
> 
> So a test for this might look like:
> 
> cfg = { ... };
> while ( xc_create_domain(xch, cfg) < 0 )
>     cfg.fault_ttl++;
> 
> 
> The pro's of this approach is that for a specific build of Xen on a
> piece of hardware, it ought to check every failure path in
> domain_create(), until the ttl finally gets higher than the number of
> fail-able actions required to construct a domain.  Also, the test
> doesn't need changing as the complexity of domain_create() changes.
> 
> The main con will mostly likely be the invasiveness of code in Xen, but
> I suppose any fault injection is going to be invasive to a certain extent.

While I like the idea in principle, the innocent looking

cfg = { ... };

is quite a bit of a concern here as well: Depending on the precise
settings, paths taken in the hypervisor may heavily vary, and hence
such a test will only end up being useful if it covers a wide
variety of settings. Even if the number of tests to execute turned
out to still be manageable today, it may quickly turn out not
sufficiently scalable as we add new settings controllable right at
domain creation (which I understand is the plan).

Jan