Re: blktap: Sync with XCP, dropping zero-copy. - Andres Lagar-Cavilla

From: Andres Lagar-Cavilla <andres@lagarcavilla.org>
To: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: xen-devel@lists.xensource.com,
	Daniel Stodden <daniel.stodden@citrix.com>
Subject: Re: blktap: Sync with XCP, dropping zero-copy.
Date: Wed, 17 Nov 2010 14:47:47 -0500	[thread overview]
Message-ID: <0C42FEE9-DA97-4053-B4F0-AD08554B416C@lagarcavilla.org> (raw)
In-Reply-To: <4CE41676.3070007@goop.org>

So, swapping mfns for write requests is a definite no-no. One could still live with copying write buffers and swapping read buffers by the end of the request. That still yields some benefit. 

As for kernel mappings, I though a solution would be to provide the hypervisor with both pte pointers. After all pte pointers are already provided for mapping grants in user-space. But that's a little too much to handle for the current interface.

Thanks for the feedback
Andres
On Nov 17, 2010, at 12:52 PM, Jeremy Fitzhardinge wrote:

> On 11/17/2010 08:36 AM, Andres Lagar-Cavilla wrote:
>> I'll throw an idea there and you educate me why it's lame.
>> 
>> Going back to the primary issue of dropping zero-copy, you want the block backend (tapdev w/AIO or otherwise) to operate on regular dom0 pages, because you run into all sorts of quirkiness otherwise: magical VM_FOREIGN incantations to back granted mfn's with fake page structs that make get_user_pages happy, quirky grant PTEs, etc.
>> 
>> Ok, so how about something along the lines of GNTTABOP_swap? Eerily reminiscent of (maligned?) GNTTABOP_transfer, but hear me out.
>> 
>> The observation is that for a blkfront read, you could do the read all along on a regular dom0 frame, and when stuffing the response into the ring, swap the dom0 frame (mfn) you used with the domU frame provided as a buffer. Then the algorithm folds out:
>> 
>> 1. Block backend, instead of get_empty_pages_and_pagevec at init time, creates a pool of reserved regular pages via get_free_page(s). These pages have their refcount pumped, no one in dom0 will ever touch them.
>> 
>> 2. When extracting a blkfront write from the ring, call GNTTABOP_swap immediately. One of the backend-reserved mfn's is swapped with the domU mfn. Pfn's and page struct's on both ends remain untouched.
> 
> Would GNTTABOP_swap also require the domU to have already unmapped the
> page from its own pagetables?  Presumably it would fail if it didn't,
> otherwise you'd end up with a domU mapping the same mfn as a
> dom0-private page.
> 
>> 3. For blkfront reads, call swap when stuffing the response back into the ring
>> 
>> 4. Because of 1, dom0 can a) calmly fix its p2m (and kvaddr) after swap, much like balloon and others do, without fear of races. More importantly, b) you don't have a weirdo granted PTE, or work with a frame from other domain. It's your page all along, dom0
>> 
>> 5. One assumption for domU is that pages allocated as blkfront buffers won't be touched by anybody, so a) it's safe for them to swap async with another frame with undef contents and b) domU can fix its p2m (and kvaddr) when pulling responses from the ring (the new mfn should be put on the response by dom0 directly or through an opaque handle)
>> 
>> 6. Scatter-gather vectors in ring requests give you a natural multicall batching for these GNTTABOP_swap's. I.e. all these hypercalls won't happen as often and at the granularity as skbuff's demanded for GNTTABOP_transfer
>> 
>> 7. Potentially domU may want to use the contents in a blkfront write buffer later for something else. So it's not really zero-copy. But the approach opens a window to async memcpy . From the point of swap when pulling the req to the point of pushing the response, you can do memcpy at any time. Don't know about how practical that is though.
> 
> I think that will be the common case - the kernel will always attempt to
> write dirty pagecache pages to make clean ones, and it will still want
> them around to access.  So it can't really give up the page altogether;
> if it hands it over to dom0, it needs to make a local copy first.
> 
>> Problems at first glance:
>> 1. To support GNTTABOP_swap you need to add more if(version) to blkfront and blkback.
>> 2. The kernel vaddr will need to be managed as well by dom0/U. Much like balloon or others: hypercall, fix p2m, and fix kvaddr all need to be taken care of. domU will probably need to neuter its kvaddr before granting, and then re-establish it when the response arrives. Weren't all these hypercalls ultimately more expensive than memcpy for GNTABOP_transfer for netback?
>> 3. Managing the pool of backend reserved pages may be a problem?
>> 
>> So in the end, perhaps more of an academic exercise than a palatable answer, but nonetheless I'd like to hear other problems people may find with this approach
> 
> It's not clear to me that its any improvement over just directly copying
> the data up front.
> 
>    J