From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andres Lagar-Cavilla Subject: Re: blktap: Sync with XCP, dropping zero-copy. Date: Wed, 17 Nov 2010 11:36:49 -0500 Message-ID: References: <20101116215621.59FC2CF782@homiemail-mx7.g.dreamhost.com> Mime-Version: 1.0 (Apple Message framework v1081) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20101116215621.59FC2CF782@homiemail-mx7.g.dreamhost.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: xen-devel@lists.xensource.com Cc: Jeremy Fitzhardinge , Daniel Stodden List-Id: xen-devel@lists.xenproject.org I'll throw an idea there and you educate me why it's lame. Going back to the primary issue of dropping zero-copy, you want the = block backend (tapdev w/AIO or otherwise) to operate on regular dom0 = pages, because you run into all sorts of quirkiness otherwise: magical = VM_FOREIGN incantations to back granted mfn's with fake page structs = that make get_user_pages happy, quirky grant PTEs, etc. Ok, so how about something along the lines of GNTTABOP_swap? Eerily = reminiscent of (maligned?) GNTTABOP_transfer, but hear me out. The observation is that for a blkfront read, you could do the read all = along on a regular dom0 frame, and when stuffing the response into the = ring, swap the dom0 frame (mfn) you used with the domU frame provided as = a buffer. Then the algorithm folds out: 1. Block backend, instead of get_empty_pages_and_pagevec at init time, = creates a pool of reserved regular pages via get_free_page(s). These = pages have their refcount pumped, no one in dom0 will ever touch them. 2. When extracting a blkfront write from the ring, call GNTTABOP_swap = immediately. One of the backend-reserved mfn's is swapped with the domU = mfn. Pfn's and page struct's on both ends remain untouched. 3. For blkfront reads, call swap when stuffing the response back into = the ring 4. Because of 1, dom0 can a) calmly fix its p2m (and kvaddr) after swap, = much like balloon and others do, without fear of races. More = importantly, b) you don't have a weirdo granted PTE, or work with a = frame from other domain. It's your page all along, dom0 5. One assumption for domU is that pages allocated as blkfront buffers = won't be touched by anybody, so a) it's safe for them to swap async with = another frame with undef contents and b) domU can fix its p2m (and = kvaddr) when pulling responses from the ring (the new mfn should be put = on the response by dom0 directly or through an opaque handle) 6. Scatter-gather vectors in ring requests give you a natural multicall = batching for these GNTTABOP_swap's. I.e. all these hypercalls won't = happen as often and at the granularity as skbuff's demanded for = GNTTABOP_transfer 7. Potentially domU may want to use the contents in a blkfront write = buffer later for something else. So it's not really zero-copy. But the = approach opens a window to async memcpy . =46rom the point of swap when = pulling the req to the point of pushing the response, you can do memcpy = at any time. Don't know about how practical that is though. Problems at first glance: 1. To support GNTTABOP_swap you need to add more if(version) to blkfront = and blkback. 2. The kernel vaddr will need to be managed as well by dom0/U. Much like = balloon or others: hypercall, fix p2m, and fix kvaddr all need to be = taken care of. domU will probably need to neuter its kvaddr before = granting, and then re-establish it when the response arrives. Weren't = all these hypercalls ultimately more expensive than memcpy for = GNTABOP_transfer for netback? 3. Managing the pool of backend reserved pages may be a problem? So in the end, perhaps more of an academic exercise than a palatable = answer, but nonetheless I'd like to hear other problems people may find = with this approach Andres=20 > Message: 3 > Date: Tue, 16 Nov 2010 13:28:51 -0800 > From: Daniel Stodden > Subject: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy. > To: Jeremy Fitzhardinge > Cc: "Xen-devel@lists.xensource.com" > Message-ID: <1289942932.11102.802.camel@agari.van.xensource.com> > Content-Type: text/plain; charset=3D"UTF-8" >=20 > On Tue, 2010-11-16 at 12:56 -0500, Jeremy Fitzhardinge wrote: >> On 11/16/2010 01:13 AM, Daniel Stodden wrote: >>> On Mon, 2010-11-15 at 13:27 -0500, Jeremy Fitzhardinge wrote: >>>> On 11/12/2010 07:55 PM, Daniel Stodden wrote: >>>>>> Surely this can be dealt with by replacing the mapped granted = page with >>>>>> a local copy if the refcount is elevated? >>>>> Yeah. We briefly discussed this when the problem started to pop up >>>>> (again). >>>>>=20 >>>>> I had a patch, for blktap1 in XS 5.5 iirc, which would fill = mapping with >>>>> a dummy page mapped in. You wouldn't need a copy, a R/O zero map = easily >>>>> does the job. >>>> Hm, I'd be a bit concerned that that might cause problems if used >>>> generically.=20 >>> Yeah. It wasn't a problem because all the network backends are on = TCP, >>> where one can be rather sure that the dups are going to be properly >>> dropped. >>>=20 >>> Does this hold everywhere ..? -- As mentioned below, the problem is >>> rather in AIO/DIO than being Xen-specific, so you can see the same >>> behavior on bare metal kernels too. A userspace app seeing an AIO >>> complete and then reusing that buffer elsewhere will occassionally >>> resend garbage over the network. >>=20 >> Yeah, that sounds like a generic security problem. I presume the >> protocol will just discard the excess retransmit data, but it might = mean >> a usermode program ends up transmitting secrets it never intended = to... >>=20 >>> There are some important parts which would go missing. Such as >>> ratelimiting gntdev accesses -- 200 thundering tapdisks each trying = to >>> gntmap 352 pages simultaneously isn't so good, so there still needs = to >>> be some bridge arbitrating them. I'd rather keep that in kernel = space, >>> okay to cram stuff like that into gntdev? It'd be much more >>> straightforward than IPC. >>=20 >> What's the problem? If you do nothing then it will appear to the = kernel >> as a bunch of processes doing memory allocations, and they'll get >> blocked/rate-limited accordingly if memory is getting short. =20 >=20 > The problem is that just letting the page allocator work through > allocations isn't going to scale anywhere. >=20 > The worst case memory requested under load is * (32 = * > 11 pages). As a (conservative) rule of thumb, N will be 200 or rather > better. >=20 > The number of I/O actually in-flight at any point, in contrast, is > derived from the queue/sg sizes of the physical device. For a simple > disk, that's about a ring or two. >=20 >> There's >> plenty of existing mechanisms to control that sort of thing (cgroups, >> etc) without adding anything new to the kernel. Or are you talking >> about something other than simple memory pressure? >>=20 >> And there's plenty of existing IPC mechanisms if you want them to >> explicitly coordinate with each other, but I'd tend to thing that's >> premature unless you have something specific in mind. >>=20 >>> Also, I was absolutely certain I once saw VM_FOREIGN support in = gntdev.. >>> Can't find it now, what happened? Without, there's presently still = no >>> zero-copy. >>=20 >> gntdev doesn't need VM_FOREIGN any more - it uses the (relatively >> new-ish) mmu notifier infrastructure which is intended to allow a = device >> to sync an external MMU with usermode mappings. We're not using it = in >> precisely that way, but it allows us to wrangle grant mappings before >> the generic code tries to do normal pte ops on them. >=20 > The mmu notifiers were for safe teardown only. They are not sufficient > for DIO, which wants gup() to work. If you want zcopy on gntdev, we'll > need to back those VMAs with page structs. Or bounce again (gulp, = just > mentioning it). As with the blktap2 patches, note there is no = difference > in the dom0 memory bill, it takes page frames. >=20 > This is pretty much exactly the pooling stuff in current = drivers/blktap. > The interface could look as follows ([] designates users).=20 >=20 > * [toolstack] > Calling some ctls to create/destroy ctls pools of frames.=20 > (Blktap currently does this in sysfs.) >=20 > * [toolstack] > Optionally resize them, according to the physical queue=20 > depth [estimate] of the underlying storage. >=20 > * [tapdisk] > A backend instance, when starting up, opens a gntdev, then > uses a ctl to bind its gntdev handle to a frame pool. >=20 > * [tapdisk] > The .mmap call now will allocate frames to back the VMA. > This operation can fail/block under congestion. Neither > is desirable, so we need a .poll. >=20 > * [tapdisk] > To integrate grant mappings with a single-threaded event loop, > use .poll. The handle fires as soon as a request can be mapped. >=20 > Under congestion, the .poll code will queue waiting disks and wake > them round-robin, once VMAs are released. >=20 > (A [tapdisk] doesn't mean to dismiss a potential [qemu].) >=20 > Still confident we want that? (Seriously asking). A lot of the code to > do so has been written for blktap, it wouldn't be hard to bend into a > gntdev extension. >=20 >>> Once the issues were solved, it'd be kinda nice. Simplifies stuff = like >>> memshr for blktap, which depends on getting hold of original grefs. >>>=20 >>> We'd presumably still need the tapdev nodes, for qemu, etc. But = those >>> can stay non-xen aware then. >>>=20 >>>>>> The only caveat is the stray unmapping problem, but I think = gntdev can >>>>>> be modified to deal with that pretty easily. >>>>> Not easier than anything else in kernel space, but when dealing = only >>>>> with the refcounts, that's as as good a place as anwhere else, = yes. >>>> I think the refcount test is pretty straightforward - if the = refcount is >>>> 1, then we're the sole owner of the page and we don't need to worry >>>> about any other users. If its > 1, then somebody else has it, and = we >>>> need to make sure it no longer refers to a granted page (which is = just a >>>> matter of doing a set_pte_atomic() to remap from present to = present). >>> [set_pte_atomic over grant ptes doesn't work, or does it?] >>=20 >> No, I forgot about grant ptes magic properties. But there is the = hypercall. >=20 > Yup. >=20 >>>> Then we'd have a set of frames whose lifetimes are being determined = by >>>> some other subsystem. We can either maintain a list of them and = poll >>>> waiting for them to become free, or just release them and let them = be >>>> managed by the normal kernel lifetime rules (which requires that = the >>>> memory attached to them be completely normal, of course). >>> The latter sounds like a good alternative to polling. So an >>> unmap_and_replace, and giving up ownership thereafter. Next run of = the >>> dispatcher thread can can just refill the foreign pfn range via >>> alloc_empty_pages(), to rebalance. >>=20 >> Do we actually need a "foreign page range"? Won't any pfn do? If we >> start with a specific range of foreign pfns and then start freeing = those >> pfns back to the kernel, we won't have one for long... >=20 > I guess we've been meaning the same thing here, unless I'm > misunderstanding you. Any pfn does, and the balloon pagevec = allocations > default to order 0 entries indeed. Sorry, you're right, that's not a > 'range'. With a pending re-xmit, the backend can find a couple (or = all) > of the request frames have count>1. It can flip and abandon those as > normal memory. But it will need those lost memory slots back, straight > away or next time it's running out of frames. As order-0 allocations. >=20 > Foreign memory is deliberately short. Blkback still defaults to 2 = rings > worth of address space, iirc, globally. That's what that mempool sysfs > stuff in the later blktap2 patches aimed at -- making the size > configurable where queue length matters, and isolate throughput = between > physical backends, where the toolstack wants to care. >=20 > Daniel >=20 >=20 >=20 >=20 > ------------------------------ >=20 > Message: 4 > Date: Tue, 16 Nov 2010 13:42:54 -0800 (PST) > From: Boris Derzhavets > Subject: Re: [Xen-devel] Re: 2.6.37-rc1 mainline domU - BUG: unable to > handle kernel paging request > To: Konrad Rzeszutek Wilk > Cc: Jeremy Fitzhardinge , > xen-devel@lists.xensource.com, Bruce Edge = > Message-ID: <923132.8834.qm@web56101.mail.re3.yahoo.com> > Content-Type: text/plain; charset=3D"us-ascii" >=20 >> So what I think you are saying is that you keep on getting the bug in = DomU? >> Is the stack-trace the same as in rc1? >=20 > Yes. > When i want to get 1-2 hr of stable work :- >=20 > # service network restart > # service nfs restart >=20 > at Dom0. >=20 > I also believe that presence of xen-pcifront.fix.patch is making = things much more stable > on F14. >=20 > Boris. >=20 > --- On Tue, 11/16/10, Konrad Rzeszutek Wilk = wrote: >=20 > From: Konrad Rzeszutek Wilk > Subject: Re: [Xen-devel] Re: 2.6.37-rc1 mainline domU - BUG: unable to = handle kernel paging request > To: "Boris Derzhavets" > Cc: "Jeremy Fitzhardinge" , = xen-devel@lists.xensource.com, "Bruce Edge" > Date: Tuesday, November 16, 2010, 4:15 PM >=20 > On Tue, Nov 16, 2010 at 12:43:28PM -0800, Boris Derzhavets wrote: >>> Huh. I .. what? I am confused. I thought we established that the = issue >>> was not related to Xen PCI front? You also seem to uncomment the >>> upstream.core.patches and the xen.pvhvm.patch - why? >>=20 >> I cannot uncomment upstream.core.patches and the xen.pvhvm.patch >> it gives failed HUNKs >=20 > Uhh.. I am even more confused. >>=20 >>> Ok, they are.. v2.6.37-rc2 which came out today has the fixes >>=20 >> I am pretty sure rc2 doesn't contain everything from = xen.next-2.6.37.patch, >> gntdev's stuff for sure. I've built 2.6.37-rc2 kernel rpms and loaded=20= >> kernel-2.6.27-rc2.git0.xendom0.x86_64 under Xen 4.0.1.=20 >> Device /dev/xen/gntdev has not been created. I understand that it's >> unrelated to DomU ( related to Dom0) , but once again with rc2 in = DomU i cannot >> get 3.2 GB copied over to DomU from NFS share at Dom0. >=20 > So what I think you are saying is that you keep on getting the bug in = DomU? > Is the stack-trace the same as in rc1? >=20 >=20 >=20 >=20 >=20 > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: = http://lists.xensource.com/archives/html/xen-devel/attachments/20101116/01= 5048ae/attachment.html >=20 > ------------------------------ >=20 > Message: 5 > Date: Tue, 16 Nov 2010 13:49:14 -0800 (PST) > From: Boris Derzhavets > Subject: Re: [Xen-devel] Re: 2.6.37-rc1 mainline domU - BUG: unable to > handle kernel paging request > To: Konrad Rzeszutek Wilk > Cc: Jeremy Fitzhardinge , > xen-devel@lists.xensource.com, Bruce Edge = > Message-ID: <228566.47308.qm@web56106.mail.re3.yahoo.com> > Content-Type: text/plain; charset=3D"iso-8859-1" >=20 > Yes, here we are >=20 > [ 186.975228] ------------[ cut here ]------------ > [ 186.975245] kernel BUG at mm/mmap.c:2399! > [ 186.975254] invalid opcode: 0000 [#1] SMP=20 > [ 186.975269] last sysfs file: = /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map > [ 186.975284] CPU 0=20 > [ 186.975290] Modules linked in: nfs fscache deflate zlib_deflate ctr = camellia cast5 rmd160 crypto_null ccm serpent blowfish twofish_generic = twofish_x86_64 twofish_common ecb xcbc cbc sha256_generic sha512_generic = des_generic cryptd aes_x86_64 aes_generic ah6 ah4 esp6 esp4 = xfrm4_mode_beet xfrm4_tunnel tunnel4 xfrm4_mode_tunnel = xfrm4_mode_transport xfrm6_mode_transport xfrm6_mode_ro xfrm6_mode_beet = xfrm6_mode_tunnel ipcomp ipcomp6 xfrm_ipcomp xfrm6_tunnel tunnel6 af_key = nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 uinput xen_netfront = microcode xen_blkfront [last unloaded: scsi_wait_scan] > [ 186.975507]=20 > [ 186.975515] Pid: 1562, comm: ls Not tainted = 2.6.37-0.1.rc1.git8.xendom0.fc14.x86_64 #1 / > [ 186.975529] RIP: e030:[] [] = exit_mmap+0x10c/0x119 > [ 186.975550] RSP: e02b:ffff8800781bde18 EFLAGS: 00010202 > [ 186.975560] RAX: 0000000000000000 RBX: 0000000000000000 RCX: = 0000000000000000 > [ 186.975573] RDX: 00000000914a9149 RSI: 0000000000000001 RDI: = ffffea00000c0280 > [ 186.975585] RBP: ffff8800781bde48 R08: ffffea00000c0280 R09: = 0000000000000001 > [ 186.975598] R10: ffffffff8100750f R11: ffffea0000967778 R12: = ffff880076c68b00 > [ 186.975610] R13: ffff88007f83f1e0 R14: ffff880076c68b68 R15: = 0000000000000001 > [ 186.975625] FS: 00007f8e471d97c0(0000) GS:ffff88007f831000(0000) = knlGS:0000000000000000 > [ 186.975639] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 186.975650] CR2: 00007f8e464a9940 CR3: 0000000001a03000 CR4: = 0000000000002660 > [ 186.975663] DR0: 0000000000000000 DR1: 0000000000000000 DR2: = 0000000000000000 > [ 186.976012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: = 0000000000000400 > [ 186.976012] Process ls (pid: 1562, threadinfo ffff8800781bc000, = task ffff8800788223e0) > [ 186.976012] Stack: > [ 186.976012] 000000000000006b ffff88007f83f1e0 ffff8800781bde38 = ffff880076c68b00 > [ 186.976012] ffff880076c68c40 ffff8800788229d0 ffff8800781bde68 = ffffffff810505fc > [ 186.976012] ffff8800788223e0 ffff880076c68b00 ffff8800781bdeb8 = ffffffff81056747 > [ 186.976012] Call Trace: > [ 186.976012] [] mmput+0x65/0xd8 > [ 186.976012] [] exit_mm+0x13e/0x14b > [ 186.976012] [] do_exit+0x222/0x7c6 > [ 186.976012] [] ? = xen_restore_fl_direct_end+0x0/0x1 > [ 186.976012] [] ? arch_local_irq_restore+0xb/0xd > [ 186.976012] [] ? = lockdep_sys_exit_thunk+0x35/0x67 > [ 186.976012] [] do_group_exit+0x88/0xb6 > [ 186.976012] [] sys_exit_group+0x17/0x1b > [ 186.976012] [] system_call_fastpath+0x16/0x1b > [ 186.976012] Code: 8d 7d 18 e8 c3 8a 00 00 41 c7 45 08 00 00 00 00 = 48 89 df e8 0d e9 ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 98 01 00 00 = 00 74 02 <0f> 0b 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 54 53 = 48=20 > [ 186.976012] RIP [] exit_mmap+0x10c/0x119 > [ 186.976012] RSP > [ 186.976012] ---[ end trace c0f4eff4054a67e4 ]--- > [ 186.976012] Fixing recursive fault but reboot is needed! >=20 > Message from syslogd@fedora14 at Nov 17 00:47:40 ... > kernel:[ 186.975228] ------------[ cut here ]------------ >=20 > Message from syslogd@fedora14 at Nov 17 00:47:40 ... > kernel:[ 186.975254] invalid opcode: 0000 [#1] SMP=20 >=20 > Message from syslogd@fedora14 at Nov 17 00:47:40 ... > kernel:[ 186.975269] last sysfs file: = /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map >=20 > Message from syslogd@fedora14 at Nov 17 00:47:40 ... > kernel:[ 186.976012] Stack: >=20 > Message from syslogd@fedora14 at Nov 17 00:47:40 ... > kernel:[ 186.976012] Call Trace: >=20 > Message from syslogd@fedora14 at Nov 17 00:47:40 ... > kernel:[ 186.976012] Code: 8d 7d 18 e8 c3 8a 00 00 41 c7 45 08 00 00 = 00 00 48 89 df e8 0d e9 ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 98 01 = 00 00 00 74 02 <0f> 0b 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 = 54 53 48=20 >=20 > --- On Tue, 11/16/10, Konrad Rzeszutek Wilk = wrote: >=20 > From: Konrad Rzeszutek Wilk > Subject: Re: [Xen-devel] Re: 2.6.37-rc1 mainline domU - BUG: unable to = handle kernel paging request > To: "Boris Derzhavets" > Cc: "Jeremy Fitzhardinge" , = xen-devel@lists.xensource.com, "Bruce Edge" > Date: Tuesday, November 16, 2010, 4:15 PM >=20 > On Tue, Nov 16, 2010 at 12:43:28PM -0800, Boris Derzhavets wrote: >>> Huh. I .. what? I am confused. I thought we established that the = issue >>> was not related to Xen PCI front? You also seem to uncomment the >>> upstream.core.patches and the xen.pvhvm.patch - why? >>=20 >> I cannot uncomment upstream.core.patches and the xen.pvhvm.patch >> it gives failed HUNKs >=20 > Uhh.. I am even more confused. >>=20 >>> Ok, they are.. v2.6.37-rc2 which came out today has the fixes >>=20 >> I am pretty sure rc2 doesn't contain everything from = xen.next-2.6.37.patch, >> gntdev's stuff for sure. I've built 2.6.37-rc2 kernel rpms and loaded=20= >> kernel-2.6.27-rc2.git0.xendom0.x86_64 under Xen 4.0.1.=20 >> Device /dev/xen/gntdev has not been created. I understand that it's >> unrelated to DomU ( related to Dom0) , but once again with rc2 in = DomU i cannot >> get 3.2 GB copied over to DomU from NFS share at Dom0. >=20 > So what I think you are saying is that you keep on getting the bug in = DomU? > Is the stack-trace the same as in rc1? >=20 >=20 >=20 >=20 >=20 > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: = http://lists.xensource.com/archives/html/xen-devel/attachments/20101116/84= bccfd3/attachment.html >=20 > ------------------------------ >=20 > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >=20 >=20 > End of Xen-devel Digest, Vol 69, Issue 218 > ******************************************