Hi,

Re: XCP's use of blktap2:

> On Mon, 2010-11-15 at 13:27 -0500, Jeremy Fitzhardinge wrote:
> > On 11/12/2010 07:55 PM, Daniel Stodden wrote:
> > > The second issue I see is the XCP side of things. XenServer got a
> lot of
> > > benefit out of blktap2, and particularly because of the tapdevs. It
> > > promotes a fairly rigorous split between a blkback VBD, controlled
> by
> > > the agent, and tapdevs, controlled by XS's storage manager.
> > >
> > > That doesn't prevent blkback to go into userspace, but it better
> won't
> > > share a process with some libblktap, which in turn would better not
> be
> > > controlled under the same xenstore path.
> >
> >
> > Could you elaborate on this?  What was the benefit?
> 
> It's been mainly a matter of who controls what. Blktap1 was basically a
> VBD, controlled by the agent. Blktap2 is a VDI represented as a block
> device. Leaving management of that to XCP's storage manager, which just
> hands that device node over to Xapi simplified many things. Before, the
> agent had to understand a lot about the type of storage, then talk to
> the right backend accordingly. Worse, in order to have storage
> management control a couple datapath features, you'd basically have to
> talk to Xapi, which would talk though xenstore to blktap, which was a
> bit tedious. :)

As Daniel says, XCP currently separates domain management (setting up, rebooting VMs) from storage management (attaching disks, snapshot, coalesce). In the current design the storage layer handles the storage control-path (instigating snapshots, clones, coalesce, dedup in future) through a storage API ("SMAPI") and provides a uniform interface to qemu, blkback for the data-path (currently in the form of a dom0 block device). In a VM start, xapi will first ask the storage control-path to make a disk available, and then pass this information to blkback/qemu.

One of the trickiest things XCP handles is vhd "coalesce": merging a vhd file into its "parent". This comes up because vhds are arranged in a tree structure where the leaves are separate independent VM disks and the nodes represent shared common blocks, the result of (eg) cloning a single VM lots of times. When guest disks are deleted and the vhd leaves are removed, it sometimes becomes possible to save space by merging nodes together. The tricky bit is doing this while I/O is still being performed in parallel against logically separate (but related by parentage/history) disks on different hosts. It's necessary for the thing doing the coalescing to know where all the I/O is going on (eg to be able to find the host and pid where the related tapdisks (or qemus) live) and it's necessary for it to be able to signal to these processes when they need to re-read the vhd tree metadata.

In the bad old blktap1 days, the storage control-path didn't know enough about the data-path to reliably signal the active tapdisks: IIRC the tapdisks were spawned by blktapctrl as a side-effect of the domain manager writing to xenstore. In the much better blktap2 days :) the storage control-path sets up (registers?) the data-path (currently via tap-ctl and a dom0 block device) and so it knows who to talk to in order to co-ordinate a coalesce.

So I think the critical thing is to be able to have the storage control-path able to do something to "register" a data-path, enabling it to find later and signal any processes using that data-path. There are a bunch of different possibilities the storage control-path could use instead of using tap-ctl to create a block device, including:

1. directly spawn a tapdisk2 userspace process. Some identifier (pid, unix domain socket) could be passed to qemu allowing it to perform I/O. The block backend could be either in the tapdisk2 directly or in qemu?

2. return a (path to vhd file, callback unix domain socket). This could be passed to qemu (or something else) and qemu could use the callback socket to register its intention to use the data-path (and hence that it needs to be signaled if something changes)

I'm sure there are lots of possibilities :-)

Cheers,
Dave