All of lore.kernel.org
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Andy Lutomirski <luto@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>,
	Matthew Wilcox <willy@infradead.org>,
	Stefan Metzmacher <metze@samba.org>, Jens Axboe <axboe@kernel.dk>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux API Mailing List <linux-api@vger.kernel.org>,
	io-uring <io-uring@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Samba Technical <samba-technical@lists.samba.org>
Subject: Re: copy on write for splice() from file to pipe?
Date: Fri, 10 Feb 2023 08:33:56 -0800	[thread overview]
Message-ID: <CAHk-=wj66F6CdJUAAjqigXMBy7gHquFMzPNAwKCgkrb2mF6U7w@mail.gmail.com> (raw)
In-Reply-To: <CALCETrWjJisipSJA7tPu+h6B2gs3m+g0yPhZ4z+Atod+WOMkZg@mail.gmail.com>

On Fri, Feb 10, 2023 at 7:15 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> Frankly, I really don't like having non-immutable data in a pipe.

That statement is completely nonsensical.

A pipe is a kernel buffer. If you want the buffer to be immutable, you
do "read()" and "write()" on it. End of story.

In contrast, "splice()" is literally the "map" operation. It's
"mmap()", without the "m", because it turns out that memory mapping
has a few problems:

 (a) mmap fundamentally only works on a page granularity

 (b) mmap has huge setup and teardown costs with page tables and TLB's

and so splice() is just basically "strange mmap with that kernel
buffer that is pipe"

Really. That is what spice *is*. There's absolutely no question about it.

It has some advantages over mmap, in that because it's a software
mapping, we can "map" partial pages, which means that you can also map
data that might not be in full pages (it might be an incoming network
buffer, for example).

It has a lot of disadvantages over mmap too, of course. Not using
hardware means that it doesn't really show up as a linear mapping, and
you don't get the nice hardware lookup accelerations - but it also
means that you don't have the downsides (ie TLB costs etc).

So I'm not saying that it is the same thing as "mmap", but I very much
_am_ saying that there's a very real and direct similarity. There are
three fundamental IO models in Unix: read, write, and mmap. And mmap()
is very much a "you get a direct window into something that can change
under you".  And splice() is exactly that same thing.

It was literally designed to be "look, we want zero-copy networking,
and we could do 'sendfile()' by mmap'ing the file, but mmap - and
particularly munmap - is too expensive, so we map things into kernel
buffers instead".

So saying "I really don't like having non-immutable data in a pipe" is
complete nonsense. It's syntactically correct English, but it makes no
conceptual sense.

You can say "I don't like 'splice()'". That's fine. I used to think
splice was a really cool concept, but I kind of hate it these days.
Not liking splice() makes a ton of sense.

But given splice, saying "I don't like non-immutable data" really is
complete nonsense.

If you want a stable buffer, use read() and write(). It's that simple.
If you want to send data from a file to the network, and want a stable
buffer in between the two, then "read()" and "write()" is *exactly*
what you should do.

With read and write, there's no mmap()/munmap() overhead of the file,
and you already have the buffer (iwe call it "user address space").
The only downside is the extra copy.

So if  you want to send stable, unchanging file contents to the
network, there is absolutely *no* reason to ever involve a pipe at
all, and you should entirely ignore splice().

The *only* reason to ever use splice() is because you don't want to
copy data, and just want a reference to it, and want to keep it all in
kernel space, because the kernel<->user boundary ends up either
requiring copies, or that page alignment, and is generally fairly
expensive.

But once you decide to go that way, you need to understand that you
don't have "immutable data". You asked for a reference, you got a
reference, and it *will* change.

That's not something specific to "splice()". It's fundamental to the
whole *concept* of zero-copy. If you don't want copies, and the source
file changes, then you see those changes.

So saying "I really don't like having non-immutable data in a pipe"
really is nonsense. Don't use splice. Or do, and realize that it's a
*mapping*.

Because that is literally the whole - and ONLY - reason for splice in
the first place.

              Linus

  reply	other threads:[~2023-02-10 16:34 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-09 13:55 copy on write for splice() from file to pipe? Stefan Metzmacher
2023-02-09 14:11 ` Matthew Wilcox
2023-02-09 14:29   ` Stefan Metzmacher
2023-02-09 16:41 ` Linus Torvalds
2023-02-09 19:17   ` Stefan Metzmacher
2023-02-09 19:36     ` Linus Torvalds
2023-02-09 19:48       ` Linus Torvalds
2023-02-09 20:33         ` Jeremy Allison
2023-02-10 20:45         ` Stefan Metzmacher
2023-02-10 20:51           ` Linus Torvalds
2023-02-10  2:16   ` Dave Chinner
2023-02-10  4:06     ` Dave Chinner
2023-02-10  4:44       ` Matthew Wilcox
2023-02-10  6:57         ` Dave Chinner
2023-02-10 15:14           ` Andy Lutomirski
2023-02-10 16:33             ` Linus Torvalds [this message]
2023-02-10 17:57               ` Andy Lutomirski
2023-02-10 18:19                 ` Jeremy Allison
2023-02-10 19:29                   ` Stefan Metzmacher
2023-02-10 18:37                 ` Linus Torvalds
2023-02-10 19:01                   ` Andy Lutomirski
2023-02-10 19:18                     ` Linus Torvalds
2023-02-10 19:27                       ` Jeremy Allison
2023-02-10 19:42                         ` Stefan Metzmacher
2023-02-10 19:42                         ` Linus Torvalds
2023-02-10 19:54                           ` Stefan Metzmacher
2023-02-10 19:29                       ` Linus Torvalds
2023-02-13  9:07                         ` Herbert Xu
2023-02-10 19:55                       ` Andy Lutomirski
2023-02-10 20:27                         ` Linus Torvalds
2023-02-10 20:32                           ` Jens Axboe
2023-02-10 20:36                             ` Linus Torvalds
2023-02-10 20:39                               ` Jens Axboe
2023-02-10 20:44                                 ` Linus Torvalds
2023-02-10 20:50                                   ` Jens Axboe
2023-02-10 21:14                                     ` Andy Lutomirski
2023-02-10 21:27                                       ` Jens Axboe
2023-02-10 21:51                                         ` Jens Axboe
2023-02-10 22:08                                           ` Linus Torvalds
2023-02-10 22:16                                             ` Jens Axboe
2023-02-10 22:17                                             ` Linus Torvalds
2023-02-10 22:25                                               ` Jens Axboe
2023-02-10 22:35                                                 ` Linus Torvalds
2023-02-10 22:51                                                   ` Jens Axboe
2023-02-11  3:18                                             ` Ming Lei
2023-02-11  6:17                                               ` Ming Lei
2023-02-11 14:13                                               ` Jens Axboe
2023-02-11 15:05                                                 ` Ming Lei
2023-02-11 15:33                                                   ` Jens Axboe
2023-02-11 18:57                                                     ` Linus Torvalds
2023-02-12  2:46                                                       ` Jens Axboe
2023-02-10  4:47       ` Linus Torvalds
2023-02-10  6:19         ` Dave Chinner
2023-02-10 17:23           ` Linus Torvalds
2023-02-10 17:47             ` Linus Torvalds
2023-02-13  9:28               ` Herbert Xu
2023-02-10 22:41             ` David Laight
2023-02-10 22:51               ` Jens Axboe
2023-02-13  9:30               ` Herbert Xu
2023-02-13  9:25           ` Herbert Xu
2023-02-13 18:01             ` Andy Lutomirski
2023-02-14  1:22               ` Herbert Xu
2023-02-17 23:13                 ` Andy Lutomirski
2023-02-20  4:54                   ` Herbert Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHk-=wj66F6CdJUAAjqigXMBy7gHquFMzPNAwKCgkrb2mF6U7w@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=david@fromorbit.com \
    --cc=io-uring@vger.kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=metze@samba.org \
    --cc=samba-technical@lists.samba.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.