archive mirror
 help / color / mirror / Atom feed
Subject: Re: Make pipe data structure be a circular list of pages, rather than
Date: 19 Jan 2005 21:12:50 -0000	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <> wrote:
> You seem to have misunderstood the original proposal, it had little to do
> with file descriptors.  The idea was that different subsystems in the OS
> export pull() and push() interfaces and you use them.  The file decriptors
> are only involved if you provide them with those interfaces(which you
> would, it makes sense).  You are hung up on the pipe idea, the idea I
> see in my head is far more generic.  Anything can play and you don't
> need a pipe at all, you need 

I was fantasizing about more generality as well.  In particular, my
original fantasy allowed data to, in theory and with compatible devices,
be read from one PCI device, passed through a series of pipes, and
written to another without ever hitting main memory - only one PCI-PCI
DMA operation performed.

A slightly more common case would be zero-copy, where data gets DMAed
from the source into memory and from memory to the destination.
That's roughly Larry's pull/push model.

The direct DMA case requires buffer memory on one of the two cards.
(And would possibly be a fruitful source of hardware bugs, since I
suspect that Windows Doesn't Do That.)

Larry has the "full-page gift" optimization, which could in theory allow
data to be "renamed" straight into the page cache.  However, the page also
has to be properly aligned and not in some awkward highmem address space.
I'm not currently convinced that this would happen often enough to be
worth the extra implementation hair, but feel free to argue otherwise.

(And Larry, what's the "loan" bit for?   When is loan != !gift ?)

The big gotcha, as Larry's original paper properly points out, is handling
write errors.  We need some sort of "unpull" operation to put data back
of the destination can't accept it.  Otherwise, what do you return from
splice()?  If the source is seekable, that's easy, and a pipe isn't much
harder, but for a general character device, we need a bit of help.

The way I handle this in user-level software, to connect modules that
provide data buffering, is to split "pull" into two operations.
"Show me some buffered data" and "Consume some buffered data".
The first returns a buffer pointer (to a const buffer) and length.
(The length must be non-zero except at EOF, but may be 1 byte.)

The second advances the buffer pointer.  The advance distance must be
no more than the length returned previously, but may be less.

In typical single-threaded code, I allow not calling the advance function
or calling it multiple times, but they're typically called 1:1, and
requiring that would give you a good place to do locking.  A character
device, network stream, or the like, would acquire an exclusive lock.
A block device or file would not need to (or could make it a shared lock
or refcount).

The same technique can be used when writing data to a module that does
buffering: "Give me some bufer space" and "Okay, I filled some part of
it in."  In some devices, the latter call can fail, and the writer has
to be able to cope with that.

By allowing both of those (and, knowing that PCI writes are more efficient
than PCI reads, giving the latter preference if both are available),
you can do direct device-to-device copies on splice().

The problem with Larry's separate pull() and push() calls is that you
then need a user-visible abstraction for "pulled but not yet pushed"
data, which seems lile unnecessary abstraction violation.

The main infrastructure hassle you need to support this *universally*
is the unget() on "character devices" like pipes and network sockets.
Ideally, it would be some generic buffer front end that could be used
by the device for normal data as well as the special case.

Ooh.  Need to think.  If there's a -EIO problem on one of the file
descriptors, how does the caller know which one?  That's an argument for
separate pull and push (although the splice() library routine still
has the problem).  Any suggestions?  Does userland need to fall
back on read()/write() for a byte?

  parent reply	other threads:[~2005-01-19 21:13 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-01-08  8:25 Make pipe data structure be a circular list of pages, rather linux
2005-01-08 18:41 ` Linus Torvalds
2005-01-08 21:47   ` Alan Cox
2005-01-13 21:46   ` Ingo Oeser
2005-01-13 22:32     ` Linus Torvalds
2005-01-14 21:03       ` Ingo Oeser
2005-01-14 21:29         ` Linus Torvalds
2005-01-14 22:12           ` Ingo Oeser
2005-01-14 22:44             ` Linus Torvalds
2005-01-14 23:34               ` Ingo Oeser
2005-01-15  0:16                 ` Linus Torvalds
2005-01-16  2:59                   ` Linus Torvalds
2005-01-17 16:03                     ` Ingo Oeser
2005-01-19 21:12                     ` linux [this message]
2005-01-20  2:06                       ` Make pipe data structure be a circular list of pages, rather than Robert White
  -- strict thread matches above, loose matches on Subject: below --
2005-01-20  2:14 Robert White
     [not found] <Pine.LNX.4.44.0501091946020.3620-100000@localhost.localdomain>
     [not found] ` <>
     [not found]   ` <>
2005-01-12 19:50     ` Davide Libenzi
2005-01-12 20:10       ` Linus Torvalds
2005-01-07 14:30 Oleg Nesterov
2005-01-07 15:45 ` Alan Cox
2005-01-07 17:23   ` Linus Torvalds
2005-01-08 18:25     ` Hugh Dickins
2005-01-08 18:54       ` Linus Torvalds
2005-01-07 16:17 ` Linus Torvalds
2005-01-07 16:06   ` Alan Cox
2005-01-07 17:33     ` Linus Torvalds
2005-01-07 17:48       ` Linus Torvalds
2005-01-07 20:59         ` Mike Waychison
2005-01-07 23:46           ` Chris Friesen
2005-01-08 21:38             ` Lee Revell
2005-01-08 21:51               ` Linus Torvalds
2005-01-08 22:02                 ` Lee Revell
2005-01-08 22:29                 ` Davide Libenzi
2005-01-09  4:07                 ` Linus Torvalds
2005-01-09 23:19                   ` Davide Libenzi
2005-01-14 10:15             ` Peter Chubb
2005-01-07 21:59         ` Linus Torvalds
2005-01-07 22:53           ` Diego Calleja
2005-01-07 23:15             ` Linus Torvalds
2005-01-10 23:23         ` Robert White
2005-01-07 17:45     ` Chris Friesen
2005-01-07 16:39   ` Davide Libenzi
2005-01-07 17:09     ` Linus Torvalds
2005-08-18  6:07   ` Coywolf Qi Hunt
     [not found] <>
2005-01-07  3:41 ` William Lee Irwin III
2005-01-07  6:35   ` Linus Torvalds
2005-01-07  6:37     ` Linus Torvalds
2005-01-19 16:29       ` Larry McVoy
2005-01-19 17:14         ` Linus Torvalds
2005-01-19 19:01           ` Larry McVoy
2005-01-20  0:01             ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).