Re: blktap: Sync with XCP, dropping zero-copy.

From: Jeremy Fitzhardinge <jeremy@goop.org>
To: Daniel Stodden <daniel.stodden@citrix.com>
Cc: "Xen-devel@lists.xensource.com" <Xen-devel@lists.xensource.com>
Subject: Re: blktap: Sync with XCP, dropping zero-copy.
Date: Mon, 15 Nov 2010 10:27:12 -0800	[thread overview]
Message-ID: <4CE17B80.7080606@goop.org> (raw)
In-Reply-To: <1289620544.11102.373.camel@agari.van.xensource.com>

On 11/12/2010 07:55 PM, Daniel Stodden wrote:
>> Surely this can be dealt with by replacing the mapped granted page with
>> a local copy if the refcount is elevated?
> Yeah. We briefly discussed this when the problem started to pop up
> (again).
>
> I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with
> a dummy page mapped in. You wouldn't need a copy, a R/O zero map easily
> does the job.

Hm, I'd be a bit concerned that that might cause problems if used
generically.  If the page is being used RO, then replacing with a copy
shouldn't be a problem, but getting a consistent snapshot of a mutable
page is obviously going to be a problem.

>  On UP that'd be just a matter of disabling interrupts for
> a while.

Disable for what purpose?  You mean to do an exchange of the mapping, or
something else?

> I dropped it after it became clear that XS was moving to SMP, where one
> would end up with a full barrier to orchestrate the TLB flushes
> everywhere. Now, the skb runs prone to crash all run in softirq context,
> I wouldn't exactly expect a huge performance win from syncing on that
> kind of thing across all nodes, compared to local memcpy. Nor would I
> want storage stuff to touch locks shared with TCP, that's just not our
> business. Correct me if I'm mistaken.

I don't follow what your high-level concern is here.   If we update the
pte to unmap the granted page, then its up to Xen to arrange for any TLB
flushes to make sure the page is no longer accessible to the domain.  We
don't need to do anything explicit, and its independent of whether we're
simply unmapping the page or replacing the mapping with another one (ie,
any TLB flushing necessary is already going on).

I was also under the impression that this is a relatively rare event
rather than something that's going to be necessary for every page, so
the overhead should be minimal.

>>> If zero-copy becomes more attractive again, the plan would be to
>>> rather use grantdev in userspace, such as a filter driver for tapdisk
>>> instead. Until then, there's presumably a notable difference in L2
>>> cache footprint. Then again, there's also a whole number of cycles not
>>> spent in redundant hypercalls now, to deal with the pseudophysical
>>> map.
>> Frankly, I think the idea of putting blkback+tapdisk entirely in
>> usermode is all upside with no (obvious) downsides.  It:
>>
>>    1. avoids having to upstream anything
>>    2. avoids having to upstream anything
>>    3. avoids having to upstream anything
>>
>>    4. gets us back zero-copy (if that's important)
> (No, unfortunately. DIO on a granted frame under blktap would be as
> vulnerable as DIO on a granted frame under a userland blkback, userland
> won't buy us anthing as far as the zcopy side of things is concerned).

Why's that?  Are you talking about the stray page reference
vulnerability, or something else?  You're right that it doesn't really
help with stray references because its still kernel code which ends up
doing the dereference rather than usermode.

>>    5. makes the IO path nice and straightforward
>>    6. seems to address all the other problems you mentioned
> I'm not at all against a userland blkback. Just wouldn't go as far as
> considering this a silver bullet.
>
> The main thing I'm scared of is ending up hacking cheesy stuff into the
> user ABI to take advantage of things immediately available to FSes on
> the bio layer, but harder (or at least slower) to get made available to
> userland.

But that hasn't been a problem for tapdisk so far.  Given that it is
using DIO then any completed write has at least been submitted to a
storage device, so then its just a question of how to make sure that any
buffers are fully flushed, which would be fdatasync() I guess.

Besides, our requirements are hardly unique; if there's some clear need
for a new API, then the course of action is to work with broader Linux
community to work out what it should be and how it should be
implemented.  There's no need for "hacking cheesy stuff" - there's been
enough of that already.

> DISCARD support is one currently hot example, do you see that in AIO
> somewhere? Ok, probably a good thing for everybody anyway, so maybe
> patching that is useful work. But it's extra work right now and probably
> no more fun to maintain than blktap is.

fallocate() is being extended to allow hole-punching in files.  I don't
know what work is being done to do discard on random block devices, but
that's clearly a generically useful thing to have.

> The second issue I see is the XCP side of things. XenServer got a lot of
> benefit out of blktap2, and particularly because of the tapdevs. It
> promotes a fairly rigorous split between a blkback VBD, controlled by
> the agent, and tapdevs, controlled by XS's storage manager.
>
> That doesn't prevent blkback to go into userspace, but it better won't
> share a process with some libblktap, which in turn would better not be
> controlled under the same xenstore path.

Could you elaborate on this?  What was the benefit?

>> The only caveat is the stray unmapping problem, but I think gntdev can
>> be modified to deal with that pretty easily.
> Not easier than anything else in kernel space, but when dealing only
> with the refcounts, that's as as good a place as anwhere else, yes.

I think the refcount test is pretty straightforward - if the refcount is
1, then we're the sole owner of the page and we don't need to worry
about any other users.  If its > 1, then somebody else has it, and we
need to make sure it no longer refers to a granted page (which is just a
matter of doing a set_pte_atomic() to remap from present to present).

Then we'd have a set of frames whose lifetimes are being determined by
some other subsystem.  We can either maintain a list of them and poll
waiting for them to become free, or just release them and let them be
managed by the normal kernel lifetime rules (which requires that the
memory attached to them be completely normal, of course).

    J