All of lore.kernel.org
 help / color / mirror / Atom feed
* splice() from /dev/zero to a pipe does not work (5.9+)
@ 2021-05-07 18:05 Colin Ian King
  2021-05-07 18:21 ` Kees Cook
  0 siblings, 1 reply; 7+ messages in thread
From: Colin Ian King @ 2021-05-07 18:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Kees Cook, Al Viro, linux-fsdevel

Hi,

While doing some micro benchmarking with stress-ng I discovered that
since linux 5.9 the splicing from /dev/zero to a pipe now fails with
-EINVAL.

I bisected this down to the following commit:

36e2c7421f02a22f71c9283e55fdb672a9eb58e7 is the first bad commit
commit 36e2c7421f02a22f71c9283e55fdb672a9eb58e7
Author: Christoph Hellwig <hch@lst.de>
Date:   Thu Sep 3 16:22:34 2020 +0200

    fs: don't allow splice read/write without explicit ops

I'm not sure if this has been reported before, or if it's intentional
behavior or not. As it stands, it's a regression in the stress-ng splice
test case.

Prior to that commit, splicing worked from /dev/zero to a pipe. Below is
an example of the reproducer:

--- reproducer below ---

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>

#define FAIL(x) { perror(x); exit(1); }

int main(void)
{
        int fd_in, fd_out, fds[2];
        ssize_t ret;
        loff_t off_in, off_out;
        const size_t len = 4096;

        if ((fd_in = open("/dev/zero", O_RDONLY)) < 0)
                FAIL("open /dev/zero failed");

        /*
         *   /dev/zero -> pipe splice -> pipe splice -> /dev/null
         */
        if (pipe(fds) < 0)
                FAIL("pipe FAILed\n");

        if ((fd_out = open("/dev/null", O_WRONLY)) < 0)
                FAIL("open /dev/null failed");

        ret = splice(fd_in, NULL, fds[1], NULL, len, SPLICE_F_MOVE);
        if (ret < 0)
                FAIL("splice failed");

        ret = splice(fds[0], NULL, fd_out, NULL, len, SPLICE_F_MOVE);
        if (ret < 0)
                FAIL("splice failed");

        return 0;
}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: splice() from /dev/zero to a pipe does not work (5.9+)
  2021-05-07 18:05 splice() from /dev/zero to a pipe does not work (5.9+) Colin Ian King
@ 2021-05-07 18:21 ` Kees Cook
  2021-05-07 19:06   ` Linus Torvalds
  2021-05-12 15:13   ` Colin Ian King
  0 siblings, 2 replies; 7+ messages in thread
From: Kees Cook @ 2021-05-07 18:21 UTC (permalink / raw)
  To: Colin Ian King
  Cc: Christoph Hellwig, Al Viro, Linus Torvalds, Johannes Berg,
	linux-fsdevel, LKML

On Fri, May 07, 2021 at 07:05:51PM +0100, Colin Ian King wrote:
> Hi,
> 
> While doing some micro benchmarking with stress-ng I discovered that
> since linux 5.9 the splicing from /dev/zero to a pipe now fails with
> -EINVAL.
> 
> I bisected this down to the following commit:
> 
> 36e2c7421f02a22f71c9283e55fdb672a9eb58e7 is the first bad commit
> commit 36e2c7421f02a22f71c9283e55fdb672a9eb58e7
> Author: Christoph Hellwig <hch@lst.de>
> Date:   Thu Sep 3 16:22:34 2020 +0200
> 
>     fs: don't allow splice read/write without explicit ops
> 
> I'm not sure if this has been reported before, or if it's intentional
> behavior or not. As it stands, it's a regression in the stress-ng splice
> test case.

The general loss of generic splice read/write is known. Here's some
early conversations I was involved in:

https://lore.kernel.org/lkml/20200818200725.GA1081@lst.de/
https://lore.kernel.org/lkml/202009181443.C2179FB@keescook/
https://lore.kernel.org/lkml/20201005204517.2652730-1-keescook@chromium.org/

And it's been getting re-implemented in individual places:

$ git log --oneline --no-merges --grep 36e2c742
42984af09afc jffs2: Hook up splice_write callback
a35d8f016e0b nilfs2: make splice write available again
f8ad8187c3b5 fs/pipe: allow sendfile() to pipe again
f2d6c2708bd8 kernfs: wire up ->splice_read and ->splice_write
9bb48c82aced tty: implement write_iter
dd78b0c483e3 tty: implement read_iter
14e3e989f6a5 proc mountinfo: make splice available again
c1048828c3db orangefs: add splice file operations
960f4f8a4e60 fs: 9p: add generic splice_write file operation
cf03f316ad20 fs: 9p: add generic splice_read file operations
06a17bbe1d47 afs: Fix copy_file_range()

So the question is likely, "do we want this for /dev/zero?"

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: splice() from /dev/zero to a pipe does not work (5.9+)
  2021-05-07 18:21 ` Kees Cook
@ 2021-05-07 19:06   ` Linus Torvalds
  2021-05-07 19:17     ` Al Viro
  2021-05-12 15:13   ` Colin Ian King
  1 sibling, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2021-05-07 19:06 UTC (permalink / raw)
  To: Kees Cook
  Cc: Colin Ian King, Christoph Hellwig, Al Viro, Johannes Berg,
	linux-fsdevel, LKML

On Fri, May 7, 2021 at 11:21 AM Kees Cook <keescook@chromium.org> wrote:
>
> So the question is likely, "do we want this for /dev/zero?"

Well, /dev/zero should at least be safe, and I guess it's actually
interesting from a performance testing standpoint (ie useful for some
kind of "what is the overhead of the splice code with no data copy").

So I'll happily take a sane patch for /dev/zero, although I think it
probably only makes sense if it's made to use the zero page explicitly
(ie exactly for that "no data copy testing" case).

So very much *not* using generic_file_splice_read(), even if that
might be the one-liner.

/dev/zero should probably also use the (already existing)
splice_write_null() function for the .splice_write case.

Anybody willing to look into this? My gu feel is that it *should* be easy to do.

That said - looking at the current 'pipe_zero()', it uses
'push_pipe()' to actually allocation regular pages, and then clear
them.

Which is basically what a generic_file_splice_read() would do, and it
feels incredibly pointless and stupid to me.

I *think* we should be able to just do something like

    len = size;
    while (len > 0) {
        struct pipe_buffer *buf;
        unsigned int tail = pipe->tail;
        unsigned int head = pipe->head;
        unsigned int mask = pipe->ring_size - 1;

        if (pipe_full(head, tail, pipe->max_usage))
            break;
        buf = &pipe->bufs[iter_head & p_mask];
        buf->ops = &zero_pipe_buf_ops;
        buf->page = ZERO_PAGE(0);
        buf->offset = 0;
        buf->len = min_t(ssize_t, len, PAGE_SIZE);
        len -= buf->len;
        pipe->head = head+1;
    }
    return size - len;

but honestly, I haven't thought a lot about it.

Al? This is another of those "right up your alley" things.

Maybe it's not worth it, and just using generic_file_splice_read() is
the way to go, but I do get the feeling that if we are splicing
/dev/null, the whole _point_ of it is about benchmarking, not "make it
work".

            Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: splice() from /dev/zero to a pipe does not work (5.9+)
  2021-05-07 19:06   ` Linus Torvalds
@ 2021-05-07 19:17     ` Al Viro
  2021-05-07 19:29       ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: Al Viro @ 2021-05-07 19:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, Colin Ian King, Christoph Hellwig, Johannes Berg,
	linux-fsdevel, LKML

On Fri, May 07, 2021 at 12:06:31PM -0700, Linus Torvalds wrote:

> That said - looking at the current 'pipe_zero()', it uses
> 'push_pipe()' to actually allocation regular pages, and then clear
> them.
> 
> Which is basically what a generic_file_splice_read() would do, and it
> feels incredibly pointless and stupid to me.
> 
> I *think* we should be able to just do something like
> 
>     len = size;
>     while (len > 0) {
>         struct pipe_buffer *buf;
>         unsigned int tail = pipe->tail;
>         unsigned int head = pipe->head;
>         unsigned int mask = pipe->ring_size - 1;
> 
>         if (pipe_full(head, tail, pipe->max_usage))
>             break;
>         buf = &pipe->bufs[iter_head & p_mask];
>         buf->ops = &zero_pipe_buf_ops;
>         buf->page = ZERO_PAGE(0);
>         buf->offset = 0;
>         buf->len = min_t(ssize_t, len, PAGE_SIZE);
>         len -= buf->len;
>         pipe->head = head+1;
>     }
>     return size - len;
> 
> but honestly, I haven't thought a lot about it.
> 
> Al? This is another of those "right up your alley" things.

Umm...  That would do wonders to anything that used to do
copy_to_user()/clear_user()/copy_to_user() and got converted
to copy_to_iter()/iov_iter_zero()/copy_to_iter()...

Are you sure we can shove zero page into pipe, anyway?
IIRC, get_page()/put_page() on that is not allowed, and
I'm not at all sure that nothing in e.g. fuse splice-related
logics would go ahead an do just that.  Or am I confused
about the page refcounting for those?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: splice() from /dev/zero to a pipe does not work (5.9+)
  2021-05-07 19:17     ` Al Viro
@ 2021-05-07 19:29       ` Linus Torvalds
  2021-05-07 20:31         ` Al Viro
  0 siblings, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2021-05-07 19:29 UTC (permalink / raw)
  To: Al Viro
  Cc: Kees Cook, Colin Ian King, Christoph Hellwig, Johannes Berg,
	linux-fsdevel, LKML

On Fri, May 7, 2021 at 12:17 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Umm...  That would do wonders to anything that used to do
> copy_to_user()/clear_user()/copy_to_user() and got converted
> to copy_to_iter()/iov_iter_zero()/copy_to_iter()...

I didn't mean for iov_iter_zero doing this - only splice_read_zero().

> Are you sure we can shove zero page into pipe, anyway?
> IIRC, get_page()/put_page() on that is not allowed,

That's what the

    buf->ops = &zero_pipe_buf_ops;

is for. The zero_pipe_buf_ops would have empty get and release
functions, and a 'steal' function that always returns false.

That's how the pipe pages are supposed to work: there are people who
put non-page data (ie things like skbuff allocations etc) into a
splice pipe buffer. It's why we have those "ops" pointers.

              Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: splice() from /dev/zero to a pipe does not work (5.9+)
  2021-05-07 19:29       ` Linus Torvalds
@ 2021-05-07 20:31         ` Al Viro
  0 siblings, 0 replies; 7+ messages in thread
From: Al Viro @ 2021-05-07 20:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, Colin Ian King, Christoph Hellwig, Johannes Berg,
	linux-fsdevel, LKML

On Fri, May 07, 2021 at 12:29:44PM -0700, Linus Torvalds wrote:
> On Fri, May 7, 2021 at 12:17 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Umm...  That would do wonders to anything that used to do
> > copy_to_user()/clear_user()/copy_to_user() and got converted
> > to copy_to_iter()/iov_iter_zero()/copy_to_iter()...
> 
> I didn't mean for iov_iter_zero doing this - only splice_read_zero().
> 
> > Are you sure we can shove zero page into pipe, anyway?
> > IIRC, get_page()/put_page() on that is not allowed,
> 
> That's what the
> 
>     buf->ops = &zero_pipe_buf_ops;
> 
> is for. The zero_pipe_buf_ops would have empty get and release
> functions, and a 'steal' function that always returns false.
> 
> That's how the pipe pages are supposed to work: there are people who
> put non-page data (ie things like skbuff allocations etc) into a
> splice pipe buffer. It's why we have those "ops" pointers.

Supposed to - sure, but I'd like to verify that they actually do work
that way before we go there.  Let me RTFS a bit...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: splice() from /dev/zero to a pipe does not work (5.9+)
  2021-05-07 18:21 ` Kees Cook
  2021-05-07 19:06   ` Linus Torvalds
@ 2021-05-12 15:13   ` Colin Ian King
  1 sibling, 0 replies; 7+ messages in thread
From: Colin Ian King @ 2021-05-12 15:13 UTC (permalink / raw)
  To: Kees Cook
  Cc: Christoph Hellwig, Al Viro, Linus Torvalds, Johannes Berg,
	linux-fsdevel, LKML

On 07/05/2021 19:21, Kees Cook wrote:
> On Fri, May 07, 2021 at 07:05:51PM +0100, Colin Ian King wrote:
>> Hi,
>>
>> While doing some micro benchmarking with stress-ng I discovered that
>> since linux 5.9 the splicing from /dev/zero to a pipe now fails with
>> -EINVAL.
>>
>> I bisected this down to the following commit:
>>
>> 36e2c7421f02a22f71c9283e55fdb672a9eb58e7 is the first bad commit
>> commit 36e2c7421f02a22f71c9283e55fdb672a9eb58e7
>> Author: Christoph Hellwig <hch@lst.de>
>> Date:   Thu Sep 3 16:22:34 2020 +0200
>>
>>     fs: don't allow splice read/write without explicit ops
>>
>> I'm not sure if this has been reported before, or if it's intentional
>> behavior or not. As it stands, it's a regression in the stress-ng splice
>> test case.
> 
> The general loss of generic splice read/write is known. Here's some
> early conversations I was involved in:
> 
> https://lore.kernel.org/lkml/20200818200725.GA1081@lst.de/
> https://lore.kernel.org/lkml/202009181443.C2179FB@keescook/
> https://lore.kernel.org/lkml/20201005204517.2652730-1-keescook@chromium.org/
> 
> And it's been getting re-implemented in individual places:
> 
> $ git log --oneline --no-merges --grep 36e2c742
> 42984af09afc jffs2: Hook up splice_write callback
> a35d8f016e0b nilfs2: make splice write available again
> f8ad8187c3b5 fs/pipe: allow sendfile() to pipe again
> f2d6c2708bd8 kernfs: wire up ->splice_read and ->splice_write
> 9bb48c82aced tty: implement write_iter
> dd78b0c483e3 tty: implement read_iter
> 14e3e989f6a5 proc mountinfo: make splice available again
> c1048828c3db orangefs: add splice file operations
> 960f4f8a4e60 fs: 9p: add generic splice_write file operation
> cf03f316ad20 fs: 9p: add generic splice_read file operations
> 06a17bbe1d47 afs: Fix copy_file_range()

Ah..so this explains why copy_file_range() also returns -EINVAL now on
some file systems, such a minix since that uses splicing too via
do_splice_direct().  :-/

> 
> So the question is likely, "do we want this for /dev/zero?"
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-05-12 15:29 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-07 18:05 splice() from /dev/zero to a pipe does not work (5.9+) Colin Ian King
2021-05-07 18:21 ` Kees Cook
2021-05-07 19:06   ` Linus Torvalds
2021-05-07 19:17     ` Al Viro
2021-05-07 19:29       ` Linus Torvalds
2021-05-07 20:31         ` Al Viro
2021-05-12 15:13   ` Colin Ian King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.