All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeremy Fitzhardinge <jeremy@goop.org>
To: Daniel Stodden <daniel.stodden@citrix.com>
Cc: "Xen-devel@lists.xensource.com" <Xen-devel@lists.xensource.com>
Subject: Re: blktap: Sync with XCP, dropping zero-copy.
Date: Mon, 15 Nov 2010 10:27:12 -0800	[thread overview]
Message-ID: <4CE17B80.7080606@goop.org> (raw)
In-Reply-To: <1289620544.11102.373.camel@agari.van.xensource.com>

On 11/12/2010 07:55 PM, Daniel Stodden wrote:
>> Surely this can be dealt with by replacing the mapped granted page with
>> a local copy if the refcount is elevated?
> Yeah. We briefly discussed this when the problem started to pop up
> (again).
>
> I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with
> a dummy page mapped in. You wouldn't need a copy, a R/O zero map easily
> does the job.

Hm, I'd be a bit concerned that that might cause problems if used
generically.  If the page is being used RO, then replacing with a copy
shouldn't be a problem, but getting a consistent snapshot of a mutable
page is obviously going to be a problem.

>  On UP that'd be just a matter of disabling interrupts for
> a while.

Disable for what purpose?  You mean to do an exchange of the mapping, or
something else?

> I dropped it after it became clear that XS was moving to SMP, where one
> would end up with a full barrier to orchestrate the TLB flushes
> everywhere. Now, the skb runs prone to crash all run in softirq context,
> I wouldn't exactly expect a huge performance win from syncing on that
> kind of thing across all nodes, compared to local memcpy. Nor would I
> want storage stuff to touch locks shared with TCP, that's just not our
> business. Correct me if I'm mistaken.

I don't follow what your high-level concern is here.   If we update the
pte to unmap the granted page, then its up to Xen to arrange for any TLB
flushes to make sure the page is no longer accessible to the domain.  We
don't need to do anything explicit, and its independent of whether we're
simply unmapping the page or replacing the mapping with another one (ie,
any TLB flushing necessary is already going on).

I was also under the impression that this is a relatively rare event
rather than something that's going to be necessary for every page, so
the overhead should be minimal.

>>> If zero-copy becomes more attractive again, the plan would be to
>>> rather use grantdev in userspace, such as a filter driver for tapdisk
>>> instead. Until then, there's presumably a notable difference in L2
>>> cache footprint. Then again, there's also a whole number of cycles not
>>> spent in redundant hypercalls now, to deal with the pseudophysical
>>> map.
>> Frankly, I think the idea of putting blkback+tapdisk entirely in
>> usermode is all upside with no (obvious) downsides.  It:
>>
>>    1. avoids having to upstream anything
>>    2. avoids having to upstream anything
>>    3. avoids having to upstream anything
>>
>>    4. gets us back zero-copy (if that's important)
> (No, unfortunately. DIO on a granted frame under blktap would be as
> vulnerable as DIO on a granted frame under a userland blkback, userland
> won't buy us anthing as far as the zcopy side of things is concerned).

Why's that?  Are you talking about the stray page reference
vulnerability, or something else?  You're right that it doesn't really
help with stray references because its still kernel code which ends up
doing the dereference rather than usermode.

>>    5. makes the IO path nice and straightforward
>>    6. seems to address all the other problems you mentioned
> I'm not at all against a userland blkback. Just wouldn't go as far as
> considering this a silver bullet.
>
> The main thing I'm scared of is ending up hacking cheesy stuff into the
> user ABI to take advantage of things immediately available to FSes on
> the bio layer, but harder (or at least slower) to get made available to
> userland.

But that hasn't been a problem for tapdisk so far.  Given that it is
using DIO then any completed write has at least been submitted to a
storage device, so then its just a question of how to make sure that any
buffers are fully flushed, which would be fdatasync() I guess.

Besides, our requirements are hardly unique; if there's some clear need
for a new API, then the course of action is to work with broader Linux
community to work out what it should be and how it should be
implemented.  There's no need for "hacking cheesy stuff" - there's been
enough of that already.

> DISCARD support is one currently hot example, do you see that in AIO
> somewhere? Ok, probably a good thing for everybody anyway, so maybe
> patching that is useful work. But it's extra work right now and probably
> no more fun to maintain than blktap is.

fallocate() is being extended to allow hole-punching in files.  I don't
know what work is being done to do discard on random block devices, but
that's clearly a generically useful thing to have.

> The second issue I see is the XCP side of things. XenServer got a lot of
> benefit out of blktap2, and particularly because of the tapdevs. It
> promotes a fairly rigorous split between a blkback VBD, controlled by
> the agent, and tapdevs, controlled by XS's storage manager.
>
> That doesn't prevent blkback to go into userspace, but it better won't
> share a process with some libblktap, which in turn would better not be
> controlled under the same xenstore path.


Could you elaborate on this?  What was the benefit?


>> The only caveat is the stray unmapping problem, but I think gntdev can
>> be modified to deal with that pretty easily.
> Not easier than anything else in kernel space, but when dealing only
> with the refcounts, that's as as good a place as anwhere else, yes.

I think the refcount test is pretty straightforward - if the refcount is
1, then we're the sole owner of the page and we don't need to worry
about any other users.  If its > 1, then somebody else has it, and we
need to make sure it no longer refers to a granted page (which is just a
matter of doing a set_pte_atomic() to remap from present to present).

Then we'd have a set of frames whose lifetimes are being determined by
some other subsystem.  We can either maintain a list of them and poll
waiting for them to become free, or just release them and let them be
managed by the normal kernel lifetime rules (which requires that the
memory attached to them be completely normal, of course).

    J

  parent reply	other threads:[~2010-11-15 18:27 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-11-12 23:31 blktap: Sync with XCP, dropping zero-copy Daniel Stodden
2010-11-12 23:31 ` [PATCH 1/5] blktap: Manage segment buffers in mempools Daniel Stodden
2010-11-12 23:31 ` [PATCH 2/5] blktap: Make VMAs non-foreign and bounce buffered Daniel Stodden
2010-11-12 23:31 ` [PATCH 3/5] blktap: Add queue access macros Daniel Stodden
2010-11-12 23:31 ` [PATCH 4/5] blktap: Forward port to 2.6.32 Daniel Stodden
2010-11-12 23:31 ` [PATCH 5/5] Fix compilation format warning in drivers/xen/blktap/device.c Daniel Stodden
2010-11-13  0:50 ` blktap: Sync with XCP, dropping zero-copy Jeremy Fitzhardinge
2010-11-13  3:56   ` Daniel Stodden
     [not found]   ` <1289620544.11102.373.camel@agari.van.xensource.com>
2010-11-15 18:27     ` Jeremy Fitzhardinge [this message]
2010-11-15 19:19       ` Ian Campbell
2010-11-15 19:34         ` Jeremy Fitzhardinge
2010-11-15 20:07           ` Ian Campbell
2010-11-16  0:43             ` Daniel Stodden
2010-11-16  9:13       ` Daniel Stodden
2010-11-16 12:17         ` Stefano Stabellini
2010-11-16 16:11           ` Konrad Rzeszutek Wilk
2010-11-16 16:16             ` Stefano Stabellini
2010-11-17  2:40           ` Daniel Stodden
2010-11-17 12:35             ` Stefano Stabellini
2010-11-17 15:34               ` Jonathan Ludlam
2010-11-16 13:00         ` Dave Scott
2010-11-16 14:48           ` Stefano Stabellini
2010-11-16 17:56         ` Jeremy Fitzhardinge
2010-11-16 21:28           ` Daniel Stodden
2010-11-17 17:04             ` Ian Campbell
2010-11-17 19:27               ` Daniel Stodden
2010-11-18 13:56                 ` Ian Campbell
2010-11-18 19:37                   ` Daniel Stodden
2010-11-19 10:57                     ` Ian Campbell
2010-11-17 18:00             ` Jeremy Fitzhardinge
2010-11-17 20:21               ` Daniel Stodden
2010-11-17 21:02                 ` Jeremy Fitzhardinge
2010-11-17 21:57                   ` Daniel Stodden
2010-11-17 22:14                     ` Jeremy Fitzhardinge
     [not found]                       ` <1290035201.11102.1577.camel@agari.van.xensource.com>
     [not found]                         ` <4CE46A03.3010104@goop.org>
     [not found]                           ` <1290040898.11102.1709.camel@agari.van.xensource.com>
2010-11-18  2:29                             ` Jeremy Fitzhardinge
2010-11-17 23:32                     ` Daniel Stodden
     [not found] <20101116215621.59FC2CF782@homiemail-mx7.g.dreamhost.com>
2010-11-17 16:36 ` Andres Lagar-Cavilla
2010-11-17 17:52   ` Jeremy Fitzhardinge
2010-11-17 19:47     ` Andres Lagar-Cavilla
2010-11-17 23:42   ` Daniel Stodden

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4CE17B80.7080606@goop.org \
    --to=jeremy@goop.org \
    --cc=Xen-devel@lists.xensource.com \
    --cc=daniel.stodden@citrix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.