From mboxrd@z Thu Jan  1 00:00:00 1970
From: Daniel Stodden <daniel.stodden@citrix.com>
Subject: blktap: Sync with XCP, dropping zero-copy.
Date: Fri, 12 Nov 2010 15:31:42 -0800
Message-ID: <1289604707-13378-1-git-send-email-daniel.stodden@citrix.com>
Return-path: <xen-devel-bounces@lists.xensource.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Xen <xen-devel@lists.xensource.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
List-Id: xen-devel@lists.xenproject.org

Hi all.

This is the better half of what XCP developments and testing brought
for blktap.

It's fairly a big change in how I/O buffers are managed. Prior to this
series, we had zero-copy I/O down to userspace. Unfortunately, blktap2
always had to jump through a couple of extra loops to do so. Present
state of that is that we dropped that, so all tapdev I/O is bounced
to/from a bunch of normal pages. Essentially replacing the old VMA
management with a couple insert/zap VM calls.

One issue was that the kernel can't cope with recursive
I/O. Submitting an iovec on a tapdev, passing it to userspace and then
reissuing the same vector via AIO apparently doesn't fit well with the
lock protocol applied to those pages. This is the main reason why
blktap had to deal a lot with grant refs. About as much as blkback
already does before passing requests on. What happens there is that
it's aliasing those granted pages under a different PFN, thereby in a
separate page struct. Not pretty, but it worked, so it's not the
reason why we chose to drop that at some point.

The more prevalent problem was network storage, especially anything
involving TCP. That includes VHD on both NFS and iSCSI. The problem
with those is that retransmits (by the transport) and I/O op
completion (on the application layer) are never synchronized.  With
sufficiently bad timing and bit of jitter on the network, it's
perfectly common for the kernel to complete an AIO request with a late
ack on the input queue just when retransmission timer is about to fire
underneath. The completion will unmap the granted frame, crashing any
uncanceled retransmission on an empty page frame. There are different
ways to deal with that. Page destructors might be one way, but as far
as I heard they are not particularly popular upstream. Issuing the
block I/O on dom0 memory is straightforward and avoids the hassle. One
could go argue that retransmits after DIO completion are still a
potential privacy problem (I did), but it's not Xen's problem after
all.

If zero-copy becomes more attractive again, the plan would be to
rather use grantdev in userspace, such as a filter driver for tapdisk
instead. Until then, there's presumably a notable difference in L2
cache footprint. Then again, there's also a whole number of cycles not
spent in redundant hypercalls now, to deal with the pseudophysical
map.

There are also benefits or non-issues.

 - This blktap is rather xen-independent. Certainly depends on the
   common ring macros, but lacking grant stuff it compiles on bare
   metal Linux with no CONFIG_XEN. Not consummated here, because
   that's going to move the source tree out of drivers/xen. But I'd
   like to post a new branch proposing to do so.

 - Blktaps size in dom0 didn't really change. Frames (now pages) were
   always pooled. We used to balloon memory to claim space for
   redundant grant mappings. Now we reserve, by default, the same
   volume in normal memory.

 - The previous code would runs all I/O on a single pool. Typically
   two rings worth of requests. Sufficient for a whole lot of systems,
   especially with single storage backends, but not so nice when I/O
   on a number of otherwise independent filers or volumes collides.

   Pools are refcounted kobjects in sysfs. Toolstacks using the new
   code can thereby choose to elimitate bottlenecks by grouping taps
   on different buffer pools. Pools can also be resized, to accomodate
   greater queue depths. [Note that blkback still has the same issue,
   so guests won't take advantage of that before that's resolved as
   well.]

 - XCP started to make some use of stacking tapdevs. Think pointing
   the image chain of a bunch of "leaf" taps to a shared parent
   node. That works fairly well, but definitely takes independent
   resource pools to avoid deadlock by parent starvation then.

Please pull upstream/xen/dom0/backend/blktap2 from
git://xenbits.xensource.com/people/dstodden/linux.git

.. and/or upstream/xen/next for a merge.

I also pulled in the pending warning fix from Teck Choon Giam.

Cheers,
Daniel