From mboxrd@z Thu Jan 1 00:00:00 1970 From: Daniel Stodden Subject: blktap: Sync with XCP, dropping zero-copy. Date: Fri, 12 Nov 2010 15:31:42 -0800 Message-ID: <1289604707-13378-1-git-send-email-daniel.stodden@citrix.com> Return-path: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Xen Cc: Jeremy Fitzhardinge List-Id: xen-devel@lists.xenproject.org Hi all. This is the better half of what XCP developments and testing brought for blktap. It's fairly a big change in how I/O buffers are managed. Prior to this series, we had zero-copy I/O down to userspace. Unfortunately, blktap2 always had to jump through a couple of extra loops to do so. Present state of that is that we dropped that, so all tapdev I/O is bounced to/from a bunch of normal pages. Essentially replacing the old VMA management with a couple insert/zap VM calls. One issue was that the kernel can't cope with recursive I/O. Submitting an iovec on a tapdev, passing it to userspace and then reissuing the same vector via AIO apparently doesn't fit well with the lock protocol applied to those pages. This is the main reason why blktap had to deal a lot with grant refs. About as much as blkback already does before passing requests on. What happens there is that it's aliasing those granted pages under a different PFN, thereby in a separate page struct. Not pretty, but it worked, so it's not the reason why we chose to drop that at some point. The more prevalent problem was network storage, especially anything involving TCP. That includes VHD on both NFS and iSCSI. The problem with those is that retransmits (by the transport) and I/O op completion (on the application layer) are never synchronized. With sufficiently bad timing and bit of jitter on the network, it's perfectly common for the kernel to complete an AIO request with a late ack on the input queue just when retransmission timer is about to fire underneath. The completion will unmap the granted frame, crashing any uncanceled retransmission on an empty page frame. There are different ways to deal with that. Page destructors might be one way, but as far as I heard they are not particularly popular upstream. Issuing the block I/O on dom0 memory is straightforward and avoids the hassle. One could go argue that retransmits after DIO completion are still a potential privacy problem (I did), but it's not Xen's problem after all. If zero-copy becomes more attractive again, the plan would be to rather use grantdev in userspace, such as a filter driver for tapdisk instead. Until then, there's presumably a notable difference in L2 cache footprint. Then again, there's also a whole number of cycles not spent in redundant hypercalls now, to deal with the pseudophysical map. There are also benefits or non-issues. - This blktap is rather xen-independent. Certainly depends on the common ring macros, but lacking grant stuff it compiles on bare metal Linux with no CONFIG_XEN. Not consummated here, because that's going to move the source tree out of drivers/xen. But I'd like to post a new branch proposing to do so. - Blktaps size in dom0 didn't really change. Frames (now pages) were always pooled. We used to balloon memory to claim space for redundant grant mappings. Now we reserve, by default, the same volume in normal memory. - The previous code would runs all I/O on a single pool. Typically two rings worth of requests. Sufficient for a whole lot of systems, especially with single storage backends, but not so nice when I/O on a number of otherwise independent filers or volumes collides. Pools are refcounted kobjects in sysfs. Toolstacks using the new code can thereby choose to elimitate bottlenecks by grouping taps on different buffer pools. Pools can also be resized, to accomodate greater queue depths. [Note that blkback still has the same issue, so guests won't take advantage of that before that's resolved as well.] - XCP started to make some use of stacking tapdevs. Think pointing the image chain of a bunch of "leaf" taps to a shared parent node. That works fairly well, but definitely takes independent resource pools to avoid deadlock by parent starvation then. Please pull upstream/xen/dom0/backend/blktap2 from git://xenbits.xensource.com/people/dstodden/linux.git .. and/or upstream/xen/next for a merge. I also pulled in the pending warning fix from Teck Choon Giam. Cheers, Daniel