Re: [Qemu-devel] [PATCH v6 for 2.1 00/10] Modify block jobs to use node-names

From: Eric Blake <eblake@redhat.com>
To: Jeff Cody <jcody@redhat.com>, Stefan Hajnoczi <stefanha@gmail.com>
Cc: kwolf@redhat.com, benoit.canet@irqsave.net, pkrempa@redhat.com,
	famz@redhat.com, qemu-devel@nongnu.org, stefanha@redhat.com
Subject: Re: [Qemu-devel] [PATCH v6 for 2.1 00/10] Modify block jobs to use node-names
Date: Thu, 19 Jun 2014 10:49:52 -0600	[thread overview]
Message-ID: <53A314B0.9030508@redhat.com> (raw)
In-Reply-To: <20140619162600.GB6096@localhost.localdomain>

[-- Attachment #1: Type: text/plain, Size: 9294 bytes --]

On 06/19/2014 10:26 AM, Jeff Cody wrote:

> Having said that, to be fair, the new QAPI command change-backing-file
> does propagate this top-layer in-use flag semantic, but I would prefer
> that patch to be dropped rather than not committing this series.

Libvirt would prefer to have change-backing-file in the same release
that adds the optional backing-file parameters, as a witness that
backing-file is under libvirt's control (even if libvirt does not CALL
change-backing-file, the command needs to exist).  One possibility for
2.1: create the command (so it shows up in 'query-commands' and
satisfies libvirt probing) but make it fail unless the node being
changed is the active node (that is, NO way to rewrite the metadata of a
backing file).  Then, in 2.2, when we figure out op-blockers for child
nodes, we can enable the full power of change-backing-file to work on
any node, not just the active device node.

>> My questions are:
>> a. How do we fix resize, snapshot-sync, etc?  It seems like we need to
>>    propagate child op blockers.
>>
>> b. Is it a good idea to perform op blocker checks on the root node?
>>    It's inconsistent with resize, snapshot-sync, etc.  Permissions in
>>    BDS graphs with multiple root nodes (e.g. guest device and NBD
>>    run-time server) will be different depending on which root you
>>    specify.
> 
> I don't think (b) is the ultimate solution.  It is used as a stop-gap
> because op blockers in the current implementation is essentially
> analogous to the in-use flag.  But is it good enough for 2.1?  If
> *everything* checks the topmost node in 2.1, then I think we are OK in
> all cases except where images files share a common BDS.
> 
> The ability for internal BDSs to share a common base BDS makes some
> block jobs unsafe currently, I believe.  A crude and ugly fix is to
> only allow a single block-job to occur at any given time, but that
> doesn't seem feasible, so let's ignore that.

Can we even create a common base BDS in qemu 2.1?  I know that the
blockdev-add command was designed with this possibility in mind, but to
my knowledge, we have not actually wired it up yet.  If 2.1 still
creates a distinct BDS for every element of every chain, even if some of
those distinct BDS are visiting the same file, then blocking jobs based
on device is sufficient.  Then, in 2.2 when we finally figure out how to
do op-blockers on child nodes, we can also figure out sharing nodes.

Of course, until we have shared nodes, management apps must realize that
requesting an operation on a node that happens to have a duplicate BDS
visiting the same file from another device will NOT be flagged by qemu
as dangerous.  That's just a quality-of-implementation issue, though -
just because qemu can't flag ALL stupid actions of management doesn't
mean qemu is buggy.  Or put another way, if I have:

disk1: base <- img1
disk2: base <- img2

if qemu allows shared BDS, and I am able to open 'base' as a shared BDS
for both disks, then an attempt to block-commit img1 into base will be
blocked because base is still in use by img2.  But as a management app,
I shouldn't be attempting that in the first place.  Yes, it would be
nice if qemu blocks me from being stupid; but even if qemu allows shared
BDS, if I don't open a shared BDS on 'base' in the first place, it is no
different than if qemu doesn't allow shared BDS.  Either way, it's not
qemu's fault if I, as management, try to commit into a file that is in
use by more than one chain.

if I am not able to do a shared BDS, I'm still
and as the management, I did not request that qemu open 'base' as a
shared BDS between both

> 
> Perhaps, for 2.1, provide an overlay pointer list inside each BDS
> (some of my earlier patches in this series had a single overlay, but
> that is not enough).  We could then apply op blockers to the topmost
> nodes for any affected BDS image in a chain, by navigating upwards.
> Not sure how complex this would be in practice, though.  
> 
> We could also apply child blockers to all nodes in all directions in a
> graph, if we don't want to rely on the topmost image as a blocker
> proxy for the whole drive.
> 
>>
>> c. We're painting ourselves into a corner by using the root node for op
>>    blocker checks.  We'll have to apply the same op blockers to all
>>    nodes in a graph.  There's no opportunity to apply different op
>>    blockers to a subset of the child nodes.  I *think* this can be
>>    changed later without affecting the QMP API, so it's not a critical
>>    issue.
> 
> We've already painted ourselves in that corner, alas.
> 
> I agree that from a QAPI perspective, the change is not critical:
> once op blockers are correctly applied to all child nodes, any API
> change (e.g. commit or stream) would likely be optional only (such as,
> making 'device' optional instead of mandatory), and thus discoverable.

Well, discoverable if we ever get QAPI introspection added, or if we can
rely on the same hacks as we are already planning to use with
optional-'top' of getting a distinct GenericError vs. DeviceNotFound
class when abusing the QMP call with a bogus device name during probing
time.

> 
>>
>> The answer seems to be that op blockers should be propagated to all
>> child nodes and commands should test the node, not the drive/root node.
>> That gives us the flexibility for per-node op blockers in the future and
>> maintains compatibility with existing node-name users.
>>
> 
> Long term thoughts:
> 
> So if I think of operations that are done on block devices from a
> block job, and chuck them into categories, I think we have:
> 
> 1) Read of guest-visible data
> 2) Write of guest-visible data
> 3) Read of host-visible data (e.g. image file metadata)
> 4) Write of host-visible data (e.g. image file metadata, such as
> the backing-file)
> 5) Block chain manipulations (e.g. movement of a BDS, change to r/w
> instead of r/o, etc..)
> 6) I/O attribute changes (e.g. throttling, etc..)
> 
> Does this make sense, and did I miss any (likely so)?  It seems like
> we should issue blockers not based on specific commands (e.g.
> BLOCK_OP_TYPE_COMMIT), but rather based on what specific category of
> interaction on a BDS we want to prohibit / allow.

Yes, I like that train of thought - it's not the operation that we are
blocking, but the access category(ies) required by that operation.

> 
> I don't think specific command blockers provide enough granularity,
> and doesn't necessarily scale well as new commands are added. It
> forces a new block job author to go through the specific
> implementation of the other block job commands, and interpret what
> operations to prohibit based on what other jobs do.  Whereas if each
> command issues blockers based on the operation category, it takes care
> of itself, and I just issue blockers based on my block job behavior.
> 
> Each command would then issue appropriate operational blockers to each
> BDS affected by an operation.  For instance, at first blush, I think
> block-commit would want (at the very least) to block (and check) the
> following, in this example chain:
> 
> 
>      [base] <-- [int1] <--  [int2] <-- [int3] <-- [top] <-- [overlay]
> 
>      becomes:
> 
>      [base] <-- [overlay]
> 
> 
> Blocked operations per image:
> 
> * 'base' image
> 
> Operation       |  Blocked
> -----------------------------
> GUEST READ      |   Y
> GUEST WRITE     |   Y
> HOST READ       |    
> HOST WRITE      |    
> CHAIN           |   Y
> I/O ATTRIBUTE   |

Makes sense:

blocking guest read because we are actively modifying the contents; if
any other operation branches off of base, they will see data that is not
consistent with what the guest sees through overlay.

blocking guest write because it is a backing chain, and overlay still
depends on base image for sectors that have not yet been commited..

blocking chain manipulations because we are going to do a chain
manipulation once our command completes.

Maybe you also want to block HOST_WRITE (our chain manipulation is done
by writing host metadata; so if someone else writes that under our feet,
we may have races on what gets written).

> 
> 
> * Intermediate images between 'base' up to and including 'top':
> 
> Operation       |  Blocked
> -----------------------------
> GUEST READ      |   
> GUEST WRITE     |   Y
> HOST READ       |
> HOST WRITE      |   Y
> CHAIN           |   Y
> I/O ATTRIBUTE   |
> 

This needs to also block GUEST_READ - as soon as the commit starts
modifying base, ALL intermediate images are invalid (they no longer
represent a state of memory as was seen by the guest).  Once you start a
commit, you MUST NOT allow a fleecing operation on any of the
intermediate operations.

> 
> * The overlay of 'top', if it exists:
> 
> Operation       |  Blocked
> -----------------------------
> GUEST READ      |   
> GUEST WRITE     |    
> HOST READ       |
> HOST WRITE      |   Y
> CHAIN           |   Y
> I/O ATTRIBUTE   |
> 
> 
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]