From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:48619)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1a9vsg-0000sO-8L
	for qemu-devel@nongnu.org; Fri, 18 Dec 2015 09:19:39 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1a9vse-0001iF-I3
	for qemu-devel@nongnu.org; Fri, 18 Dec 2015 09:19:38 -0500
Date: Fri, 18 Dec 2015 15:19:25 +0100
From: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20151218141925.GE31910@noname.redhat.com>
References: <20151216062529.GA25889@localhost.localdomain>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20151216062529.GA25889@localhost.localdomain>
Subject: Re: [Qemu-devel] RFC: Operation Blockers in QEMU Block Nodes
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Jeff Cody <jcody@redhat.com>
Cc: famz@redhat.com, qemu-block@nongnu.org, qemu-devel@nongnu.org, armbru@redhat.com, stefanha@redhat.com, jsnow@redhat.com

Am 16.12.2015 um 07:25 hat Jeff Cody geschrieben:
> Background:
> ------------
> Block jobs, and other QAPI operations, may modify and impact the
> BlockDriverState graph in QEMU.  In order to support multiple
> operations safely, we need a mechanism to block and gate operations,
> 
> We currently have op blockers, that are attached to each BDS.
> However, in practice we check this on the device level, rather than on
> the granularity of the individual BDS.  Also, due to limitations of
> the current system, we only allow a single block job at a time.

As I already said on IRC, I think a really important part in order to
make proper op blockers workable is that we actually enforce that you
have to request permission before you can operate on an image. That is,
if you were to write to an image, but you haven't requested write
permission from the op blocker system before, you'd get an assertion
failure.

The reason for this is that it's a long standing issue of the current
system that it's not only possible, but even easy to forget adding the
blocker code or updating it when the surrounding code changes. The
result is a mechanism that is too restrictive for many cases on the one
hand, but fails to actually provide the protection we need on the other
hand (for example, bs->file is still unprotected).

Blockers are subtle and at the same time pervasive enough that I'm
almost sure we wouldn't implement them correctly if we don't force
ourselves to do it right by letting qemu crash whenever we don't.

> Proposed New Design Features:
> ------------------------------
> This design would supersede the current op blocker system.
> 
> Rather than have the op blockers attached directly to each BDS, Block
> Job access will be done through a separate Node Access Control system.
> 
> This new system will:
> 
>     * Allow / Disallow operations to BDSs, (generally initiated by QAPI
>       actions; i.e. BDS node operations other than guest read/writes)
> 
>     * Not be limited to "block jobs" (e.g. block-commit, block-stream,
>       etc..) - will also apply to other operations, such as
>       blockdev-add, change-backing-file, etc.
> 
>     * Allow granularity in options, and provide the safety needed to
>       support multiple block jobs and operations.
> 
>     * Reference each BDS by its node-name
> 
>     * Be independent of the bdrv_states graph (i.e. does not reside in
>       the BDS structure)

I agree that the BDS structure is probably not the right place (except
possibly for caching the permission bitmasks), but I also wouldn't say
that it should be completely independent structures. Otherwise keeping
things in sync when the graph changes might turn out hard.

At the moment I can see two fundamentally different kinds of operations
that are interesting for op blockers:

1. I/O requests

   We need a place to check I/O requests that knows both the user (and
   the permissions it has acquired) and the node that is accessed.

   Any I/O request from outside the block layer (which includes guest
   devices, NBD servers and block jobs) should come through a
   BlockBackend. Even though we don't have it yet, we want to have a
   single BB per user. These requests could be dealt with by calling
   into the NAC whenever a BDS is attach to or detached from a BB;
   assertions can be checked in the appropriate blk_* functions.

   But there are also references between BDSes in the middle of the
   graph, like a qcow2 image referencing (and sending I/O requests to)
   its bs->file and bs->backing. It's important that we represent the
   requirements of such nodes in the NAC. As permissions are always for
   a specific user for a specific node, neither the parent node nor the
   child node are the correct place to manage the permissions. It seems
   that instead BdrvChild might be where they belong. The NAC needs to
   be informed whenever a child node is attached or detached from a BDS.

   As it happens, I've already been considering for a while to convert
   BlockBackend to using BdrvChild as well, so we could possibly unify
   both cases.

   The problem we have with BdrvChild is that I/O requests don't go
   through BdrvChild functions, so there is no place to add the
   assertions we need. We could possibly still add them to blk_*, but
   that would leave internal I/O requests unchecked.

2. Graph reconfiguration

   The problematic part here is that it's generally done by users who
   don't have a direct reference to all the nodes that are affected from
   the changes to the graph.

   For example, the commit job wants to drop all intermediate nodes
   after completing the copy - but even after properly converting it to
   BlockBackend, it would still only have a BB for the top and the base
   node, but not to any of the intermediate nodes that it drops.

   We don't seem to have convenient places like attaching/detaching from
   nodes where we could put the permission requests. Also we don't seem
   to have a central place for assertions; the best we can do is adding
   them to some functions like bdrv_drop_intermediates().

Or maybe the problem isn't really "I/O requests" vs. "reconfiguration",
but "nodes we have a reference for" vs. "nodes without explicit
reference".

> Node Access Control: Jobs
> --------------------------
> Every QAPI/HMP initiated operation is considered a "job", regardless
> if it is implemented via coroutines (e.g. block jobs such as
> block-commit), or handled synchronously in a QAPI handler (e.g.
> change-backing-file, blockdev-add).
> 
> A job consists of:
>     * Unique job ID
>     * A list of all node-names affected by the operation
>     * Action flags (see below) for each node in the above list
> 
> 
> Node Access Control: Action Flags
> -----------------------------------
> Every operation must set action flag pairs, for every node affected by
> the operation.
> 
> Action flags are set as a Require/Allow pair.  If the Require
> flag is set, then the operation requires the ability to take the
> specific action.  If the Allow flag is set, then the operation will
> allow other operations to perform same action in parallel.
> 
> The default is to prohibit, not allow, parallel actions.
> 
> The proposed actions are:
> 
>     1. Modify - Visible Data
>         * This is modification that would be detected by an external
>           read (for instance, a hypothetical qemu-io read from the
>           specific node), inclusive of its backing files.  A classic
>           example would be writing data into another node, as part of
>           block-commit.
> 
>     2. Modify - Metadata
>         * This is a write that is not necessarily visible to an external
>           read, but still modifies the underlying image format behind the
>           node.  I.e., change-backing-file to an identical backing file.
>           (n.b.: if changing the backing file to a non-identical
>           backing file, the "Write - Visible Data" action would also
>           be required).

Consider the case of block-streaming from B to C:

    A <- B <- C <--- virtio-blk

The block job writes data into C, but doesn't change the visible data.
Would this be covered by this action? If so, I think "Metadata" is a
misnomer because we actually copy data around.

>     3. Graph Reconfiguration
>         * Removing, adding, or reordering nodes in the bdrv_state
>           graph. (n.b.: Write - Visible Data may need to be set, if
>           appropriate)

We need to define this so that it refers to an operation on a specific
node. I think we can take this as changing the child node references of
a given BDS.

>     4. Read - Visible Data
>         * Read data visible to an external read.
> 
>     5. Read - Metadata
>         * Read node metadata

Is this a useful operation? I think everyone who does anything with an
image is automatically reading metadata.


<thinking aloud>

I think there is something missing. Initially I thought it's just
another action, but I'm not so sure any more. Let's check a block-commit
job this time, commit C back into A:

         +---------- NBD server
         v
    A <- B <- C <--- virtio-blk

Now as we all know, committing data from C back into A makes B invalid,
so this should be prohibited. "Read - Visible Data" could possibly
provide this, but the bs->backing read in C would use the same action,
so it's impossible to allow the backing file relationship, but not the
NBD server. This is the part that could possibly be solved if the commit
job converted all intermediate bs->backing links to something weaker,
like "Read Backing File", which doesn't care whether the image is
self-consistent.

However, we could instead have two backing file relationships. Then one
of them must be okay (the B <- C one) and the other one must block
the commit job. And in fact, if it's the block-commit job that changes
the link to weaker semantics, we would achieve this.

Hm... Does this make any sense? I'm not sure if this would turn out
completely correct.

</thinking aloud>

Kevin