linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
@ 2017-01-12 20:26 Alan J. Wylie
  2017-01-12 20:31 ` Al Viro
  2017-01-12 22:26 ` Linus Torvalds
  0 siblings, 2 replies; 42+ messages in thread
From: Alan J. Wylie @ 2017-01-12 20:26 UTC (permalink / raw)
  To: Thorsten Leemhuis, Al Viro, linux-kernel

[-- Attachment #1: message body text --]
[-- Type: text/plain, Size: 5702 bytes --]


Some time after 4.9.0 was released, I noticed that a cron job running
systemd-nspawn was hanging trying to send mail.

I have trimmed it down to a minimal demo:

/* from crontab */
48	19	*	*	*	date; /work/chroot-shared/test.sh; date

/* script ------8<------8<------8<------8<------8<------8< */
# cat /work/chroot-shared/test.sh
#!/bin/bash
set -x

date

systemd-nspawn -q -D /work/chroot.32 --register=no date

date
/* ------8<------8<------8<------8<------8<------8<------8< */

and bisected it to

$ git --no-pager show 523ac9afc73acdcf9f00bd35b6ffb4a7c624a7d7
commit 523ac9afc73acdcf9f00bd35b6ffb4a7c624a7d7
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Fri Sep 23 15:34:57 2016 -0400

    switch default_file_splice_read() to use of pipe-backed iov_iter

    we only use iov_iter_get_pages_alloc() and iov_iter_advance() -
    pages are filled by kernel_readv() via a kvec array (as we used
    to do all along), so iov_iter here is used only as a way of
    arranging for those pages to be in pipe.

    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

The system is an up-to-date gentoo. Config attached.

$ eix -ce postfix
[I] mail-mta/postfix (3.1.2-r1@21/12/16): A fast and secure drop-in replacement for sendmail
$ eix -ce vixie-cron
[I] sys-process/vixie-cron (4.1-r14@24/09/15): Paul Vixie's cron daemon, a fully featured crond implementation
$ eix -ce systemd
[I] sys-apps/systemd (231@06/12/16): System and service manager for Linux

Strace shows that the processes are hanging in write() and read() calls.

====================================================================================================
# ps axfu | grep -A5 "[c]ron"
root       990  0.0  0.0  18708  2068 ?        Ss   20:34   0:00 /usr/sbin/cron
root      6399  0.0  0.0  27292  2172 ?        S    21:10   0:00  \_ /usr/sbin/cron
root      6402  0.0  0.0   9864  2480 ?        Ss   21:10   0:00      \_ /bin/bash -c date; /work/chroot-shared/test.sh; date
root      6406  0.0  0.0   9868  2472 ?        S    21:10   0:00      |   \_ /bin/bash /work/chroot-shared/test.sh
root      6405  0.0  0.0  76152  5428 ?        S    21:10   0:00      \_ /usr/sbin/sendmail -FCronDaemon -odi -oem -oi -t
root      6412  0.0  0.0  76140  5600 ?        S    21:10   0:00          \_ /usr/sbin/postdrop -r
at         993  0.0  0.0  16840   160 ?        Ss   20:34   0:00 /usr/sbin/atd

# strace -p 6406
strace: Process 6406 attached
write(2, "+ date\n", 7
^Cstrace: Process 6406 detached
 <detached ...>
#

# strace -p 6405
strace: Process 6405 attached
read(0,
^Cstrace: Process 6405 detached
 <detached ...>
#

====================================================================================================

git bisect start
# good: [c8d2bc9bc39ebea8437fd974fdbc21847bb897a3] Linux 4.8
git bisect good c8d2bc9bc39ebea8437fd974fdbc21847bb897a3
# good: [8bba2e2e62cbf2db0d03e4de1204f7850bc45c44] Linux 4.8.15
git bisect good 8bba2e2e62cbf2db0d03e4de1204f7850bc45c44
# bad: [1001354ca34179f3db924eb66672442a173147dc] Linux 4.9-rc1
git bisect bad 1001354ca34179f3db924eb66672442a173147dc
# good: [41844e36206be90cd4d962ea49b0abc3612a99d0] Merge tag 'staging-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
git bisect good 41844e36206be90cd4d962ea49b0abc3612a99d0
# bad: [6b5e09a748ad0a0b198d0e268c7e689044bfe48a] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect bad 6b5e09a748ad0a0b198d0e268c7e689044bfe48a
# bad: [2c34ff14bf1d03a705f5400888ecac5b6400e981] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32
git bisect bad 2c34ff14bf1d03a705f5400888ecac5b6400e981
# good: [d042380886fb2fc6c4b0fcfe1214ba473769a8e9] Merge tag 'mfd-for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd
git bisect good d042380886fb2fc6c4b0fcfe1214ba473769a8e9
# good: [bc75450cc3db3485db1e289fef8c1028ba38296a] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
git bisect good bc75450cc3db3485db1e289fef8c1028ba38296a
# good: [2eee010d092903ee95716b6c2fbd9d3289839aa4] Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
git bisect good 2eee010d092903ee95716b6c2fbd9d3289839aa4
# skip: [ca2431633b414d112c39ec69d85d96f090e49453] powerpc/64s: Consolidate Directed Privileged Doorbell 0xa00 interrupt
git bisect skip ca2431633b414d112c39ec69d85d96f090e49453
# skip: [d312603a44eb9dc0dbb0a642a61899bac188ab71] powerpc/Makefile: CROSS32AS is unused, remove it
git bisect skip d312603a44eb9dc0dbb0a642a61899bac188ab71
# skip: [35066c0d798906d46c352c3f12844d2a162d057d] powerpc/eeh: Export confirm_error_lock
git bisect skip 35066c0d798906d46c352c3f12844d2a162d057d
# bad: [a779638cf622f069a484e8802134cca3c6c71415] pipe: add pipe_buf_release() helper
git bisect bad a779638cf622f069a484e8802134cca3c6c71415
# good: [25869262ef7af24ccde988867ac3eb1c3d4b88d4] skb_splice_bits(): get rid of callback
git bisect good 25869262ef7af24ccde988867ac3eb1c3d4b88d4
# bad: [523ac9afc73acdcf9f00bd35b6ffb4a7c624a7d7] switch default_file_splice_read() to use of pipe-backed iov_iter
git bisect bad 523ac9afc73acdcf9f00bd35b6ffb4a7c624a7d7
# good: [241699cd72a8489c9446ae3910ddd243e9b9061b] new iov_iter flavour: pipe-backed
git bisect good 241699cd72a8489c9446ae3910ddd243e9b9061b
# good: [82c156f853840645604acd7c2cebcb75ed1b6652] switch generic_file_splice_read() to use of ->read_iter()
git bisect good 82c156f853840645604acd7c2cebcb75ed1b6652
# first bad commit: [523ac9afc73acdcf9f00bd35b6ffb4a7c624a7d7] switch default_file_splice_read() to use of pipe-backed iov_iter
====================================================================================================


[-- Attachment #2: config-4.9.3.bz2 --]
[-- Type: application/octet-stream, Size: 21088 bytes --]

[-- Attachment #3: .signature --]
[-- Type: text/plain, Size: 194 bytes --]



-- 
Alan J. Wylie                                          http://www.wylie.me.uk/

Dance like no-one's watching. / Encrypt like everyone is.
Security is inversely proportional to convenience

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-12 20:26 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn Alan J. Wylie
@ 2017-01-12 20:31 ` Al Viro
  2017-01-12 20:38   ` Alan J. Wylie
  2017-01-12 22:26 ` Linus Torvalds
  1 sibling, 1 reply; 42+ messages in thread
From: Al Viro @ 2017-01-12 20:31 UTC (permalink / raw)
  To: Alan J. Wylie; +Cc: Thorsten Leemhuis, linux-kernel

On Thu, Jan 12, 2017 at 08:26:52PM +0000, Alan J. Wylie wrote:
> 
> Some time after 4.9.0 was released, I noticed that a cron job running
> systemd-nspawn was hanging trying to send mail.

Check if 8e54cadab447 (fix default_file_splice_read()) fixes it.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-12 20:31 ` Al Viro
@ 2017-01-12 20:38   ` Alan J. Wylie
  0 siblings, 0 replies; 42+ messages in thread
From: Alan J. Wylie @ 2017-01-12 20:38 UTC (permalink / raw)
  To: Al Viro; +Cc: Thorsten Leemhuis, linux-kernel

at 20:31 on Thu 12-Jan-2017 Al Viro (viro@ZenIV.linux.org.uk) wrote:
> On Thu, Jan 12, 2017 at 08:26:52PM +0000, Alan J. Wylie wrote:
> > 
> > Some time after 4.9.0 was released, I noticed that a cron job running
> > systemd-nspawn was hanging trying to send mail.
> 
> Check if 8e54cadab447 (fix default_file_splice_read()) fixes it.

That's already in 4.9.3, which still has the problem.

-- 
Alan J. Wylie                                          http://www.wylie.me.uk/

Dance like no-one's watching. / Encrypt like everyone is.
Security is inversely proportional to convenience

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-12 20:26 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn Alan J. Wylie
  2017-01-12 20:31 ` Al Viro
@ 2017-01-12 22:26 ` Linus Torvalds
  2017-01-12 22:37   ` Al Viro
  2017-01-12 22:46   ` Alan J. Wylie
  1 sibling, 2 replies; 42+ messages in thread
From: Linus Torvalds @ 2017-01-12 22:26 UTC (permalink / raw)
  To: Alan J. Wylie; +Cc: Thorsten Leemhuis, Al Viro, linux-kernel

On Thu, Jan 12, 2017 at 12:26 PM, Alan J. Wylie <alan@wylie.me.uk> wrote:
>
> Strace shows that the processes are hanging in write() and read() calls.

If this is splice-related, I'm assuming that they aren't actually the
two ends of the same pipe, and there is somebody doing splice in the
middle.

I'm not seeing that process. I'm assuming it's systemd.  Can you try
to find it and strace that one too? Because that middle man is likely
the one that has problems (and is not able to splice from one pipe to
the other).

Ugh. That one commit has had a lot of bugs in it already. We do not
have good splice test coverage, because almost nobody uses it.

               Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-12 22:26 ` Linus Torvalds
@ 2017-01-12 22:37   ` Al Viro
  2017-01-12 22:46     ` Al Viro
  2017-01-12 22:46   ` Alan J. Wylie
  1 sibling, 1 reply; 42+ messages in thread
From: Al Viro @ 2017-01-12 22:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Thu, Jan 12, 2017 at 02:26:42PM -0800, Linus Torvalds wrote:
> On Thu, Jan 12, 2017 at 12:26 PM, Alan J. Wylie <alan@wylie.me.uk> wrote:
> >
> > Strace shows that the processes are hanging in write() and read() calls.
> 
> If this is splice-related, I'm assuming that they aren't actually the
> two ends of the same pipe, and there is somebody doing splice in the
> middle.
> 
> I'm not seeing that process. I'm assuming it's systemd.  Can you try
> to find it and strace that one too? Because that middle man is likely
> the one that has problems (and is not able to splice from one pipe to
> the other).
> 
> Ugh. That one commit has had a lot of bugs in it already. We do not
> have good splice test coverage, because almost nobody uses it.

FWIW, I would really like to know what kind of files had been involved.
There are two paths that can lead to default_file_splice_read():
splice_direct_to_actor() -> do_splice_to() -> default_file_splice_read() and
do_splice() -> do_splice_to() -> default_file_splice_read().

The former only gets there for regular files and block devices.  The latter
is guaranteed that file is not a pipe.  So
	* not a socket (have ->splice_read() of their own)
	* not a pipe or FIFO (neither path allows those)
	* not a block device (have ->splice_read() of their own)
	* not a regular file on a normal local fs (ditto)

So what is it called for in that reproducer?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-12 22:26 ` Linus Torvalds
  2017-01-12 22:37   ` Al Viro
@ 2017-01-12 22:46   ` Alan J. Wylie
  2017-01-12 22:58     ` Al Viro
  2017-01-12 23:28     ` Linus Torvalds
  1 sibling, 2 replies; 42+ messages in thread
From: Alan J. Wylie @ 2017-01-12 22:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Thorsten Leemhuis, Al Viro, linux-kernel

at 14:26 on Thu 12-Jan-2017 Linus Torvalds (torvalds@linux-foundation.org) wrote:

> > Strace shows that the processes are hanging in write() and read() calls.
> 
> If this is splice-related, I'm assuming that they aren't actually the
> two ends of the same pipe, and there is somebody doing splice in the
> middle.
> 
> I'm not seeing that process. I'm assuming it's systemd.  Can you try
> to find it and strace that one too?

strace -f -o /var/tmp/strace.txt systemd-nspawn -q -D /work/chroot.32 --register=no date

full output at
https://wylie.me.uk/tmp/strace.txt

last few lines are:

13766 execve("/bin/date", ["date"], [/* 7 vars */] <unfinished ...>
...
13766 close(2)                          = 0
13766 exit_group(0)                     = ?
13766 +++ exited with 0 +++
13761 <... epoll_wait resumed> [{EPOLLIN, {u32=3843638720, u64=94179591538112}}], 7, -1) = 1
13761 clock_gettime(CLOCK_BOOTTIME, {9780, 528425388}) = 0
13761 read(13, "\21\0\0\0\0\0\0\0\1\0\0\0\3065\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128
13761 epoll_ctl(11, EPOLL_CTL_DEL, 1, NULL) = 0
13761 epoll_ctl(11, EPOLL_CTL_DEL, 5, NULL) = 0
13761 signalfd4(13, [INT TERM CHLD], 8, SFD_CLOEXEC|SFD_NONBLOCK) = 13
13761 fcntl(0, F_GETFL)                 = 0 (flags O_RDONLY)
13761 fcntl(1, F_GETFL)                 = 0x1 (flags O_WRONLY)
13761 kill(13766, SIGKILL)              = 0
13761 waitid(P_PID, 13766, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13766, si_uid=0, si_status=0, si_utime=0, si_stime=0}, WEXITED, NULL) = 0
13761 close(6)                          = 0
13761 close(7)                          = 0
13761 close(8)                          = 0
13761 close(9)                          = 0
13761 close(18)                         = 0
13761 close(16)                         = 0
13761 close(14)                         = 0
13761 close(10)                         = 0
13761 copy_file_range(5, NULL, 1, NULL, 9223372036854775807, 0) = -1 EXDEV (Invalid cross-device link)
13761 sendfile(1, 5, NULL, 9223372036854775807) = -1 EINVAL (Invalid argument)
13761 splice(5, NULL, 1, NULL, 9223372036854775807, 0) = -1 EAGAIN (Resource temporarily unavailable)
13761 open("/run/systemd/nspawn/propagate/chroot.32", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_NOFOLLOW|O_NOATIME|O_CLOEXEC) = 6
13761 fstatfs(6, {f_type=TMPFS_MAGIC, f_bsize=4096, f_blocks=1020327, f_bfree=1015931, f_bavail=1015931, f_files=1020327, f_ffree=1019297, f_fsid={0, 0}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_NOSUID|ST_NODEV}) = 0
13761 fstat(6, {st_mode=S_IFDIR|0600, st_size=40, ...}) = 0
13761 fcntl(6, F_GETFL)                 = 0x78800 (flags O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_NOFOLLOW|O_NOATIME)
13761 fcntl(6, F_SETFD, FD_CLOEXEC)     = 0
13761 getdents(6, /* 2 entries */, 32768) = 48
13761 getdents(6, /* 0 entries */, 32768) = 0
13761 close(6)                          = 0
13761 rmdir("/run/systemd/nspawn/propagate/chroot.32") = 0
13761 unlink("/work/.#chroot.32.lck")   = 0
13761 close(3)                          = 0
13761 unlink("/run/systemd/nspawn/locks/inode-65035:24641537") = 0
13761 close(4)                          = 0
13761 close(5)                          = 0
13761 exit_group(0)                     = ?
13761 +++ exited with 0 +++

I'm off to bed now with a stinking head cold I'm afraid.

-- 
Alan J. Wylie                                          http://www.wylie.me.uk/

Dance like no-one's watching. / Encrypt like everyone is.
Security is inversely proportional to convenience

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-12 22:37   ` Al Viro
@ 2017-01-12 22:46     ` Al Viro
  2017-01-12 23:02       ` Linus Torvalds
  0 siblings, 1 reply; 42+ messages in thread
From: Al Viro @ 2017-01-12 22:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Thu, Jan 12, 2017 at 10:37:18PM +0000, Al Viro wrote:
> On Thu, Jan 12, 2017 at 02:26:42PM -0800, Linus Torvalds wrote:
> > On Thu, Jan 12, 2017 at 12:26 PM, Alan J. Wylie <alan@wylie.me.uk> wrote:
> > >
> > > Strace shows that the processes are hanging in write() and read() calls.
> > 
> > If this is splice-related, I'm assuming that they aren't actually the
> > two ends of the same pipe, and there is somebody doing splice in the
> > middle.
> > 
> > I'm not seeing that process. I'm assuming it's systemd.  Can you try
> > to find it and strace that one too? Because that middle man is likely
> > the one that has problems (and is not able to splice from one pipe to
> > the other).
> > 
> > Ugh. That one commit has had a lot of bugs in it already. We do not
> > have good splice test coverage, because almost nobody uses it.
> 
> FWIW, I would really like to know what kind of files had been involved.
> There are two paths that can lead to default_file_splice_read():
> splice_direct_to_actor() -> do_splice_to() -> default_file_splice_read() and
> do_splice() -> do_splice_to() -> default_file_splice_read().
> 
> The former only gets there for regular files and block devices.  The latter
> is guaranteed that file is not a pipe.  So
> 	* not a socket (have ->splice_read() of their own)
> 	* not a pipe or FIFO (neither path allows those)
> 	* not a block device (have ->splice_read() of their own)
> 	* not a regular file on a normal local fs (ditto)
> 
> So what is it called for in that reproducer?

PS: what about the /proc/mounts contents?  If it's something 9p-backed kvm,
your bisect might have been caught on the bug I'd mentioned - if the breakage
you are seeing in 4.9.3 has started after that commit and before the
backport of the fix, your bisect could converge there.  Does the
reproducer trigger on 523ac9afc73a + cherry-pick of 8e54cadab447?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-12 22:46   ` Alan J. Wylie
@ 2017-01-12 22:58     ` Al Viro
  2017-01-12 23:28     ` Linus Torvalds
  1 sibling, 0 replies; 42+ messages in thread
From: Al Viro @ 2017-01-12 22:58 UTC (permalink / raw)
  To: Alan J. Wylie; +Cc: Linus Torvalds, Thorsten Leemhuis, linux-kernel

On Thu, Jan 12, 2017 at 10:46:06PM +0000, Alan J. Wylie wrote:
> at 14:26 on Thu 12-Jan-2017 Linus Torvalds (torvalds@linux-foundation.org) wrote:

> 13761 copy_file_range(5, NULL, 1, NULL, 9223372036854775807, 0) = -1 EXDEV (Invalid cross-device link)
> 13761 sendfile(1, 5, NULL, 9223372036854775807) = -1 EINVAL (Invalid argument)
> 13761 splice(5, NULL, 1, NULL, 9223372036854775807, 0) = -1 EAGAIN (Resource temporarily unavailable)

*bleep*

splice from /dev/ptmx to stdout, whatever the latter had been.  IOW,
we are dealing with do_splice() -> do_splice_to() -> default_file_splice_read()
for pseudo-tty as source...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-12 22:46     ` Al Viro
@ 2017-01-12 23:02       ` Linus Torvalds
  2017-01-12 23:14         ` Al Viro
  2017-01-12 23:14         ` Linus Torvalds
  0 siblings, 2 replies; 42+ messages in thread
From: Linus Torvalds @ 2017-01-12 23:02 UTC (permalink / raw)
  To: Al Viro; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Thu, Jan 12, 2017 at 2:46 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> PS: what about the /proc/mounts contents?  If it's something 9p-backed kvm,
> your bisect might have been caught on the bug I'd mentioned - if the breakage
> you are seeing in 4.9.3 has started after that commit and before the
> backport of the fix, your bisect could converge there.  Does the
> reproducer trigger on 523ac9afc73a + cherry-pick of 8e54cadab447?

Looking at the strace that Alan just posted, I really think it's the
splice change that Alan bisected to.

In particular, this line:

    splice(5, NULL, 1, NULL, 9223372036854775807, 0) = -1 EAGAIN
(Resource temporarily unavailable)

and note that the commit in question introduces that -EAGAIN error code.

The old code never returned EAGAIN at all (well, it could do so later,
if NONBLOCK was set, obviously, but that doesn't seem to be the case
here).

So that commit seems to have introduced a new error case, and I
suspect systemd-nospawn simply doesn't handle it. It is expecting
splice_to_pipe() to actually block.

Ergo: I think we need to do a wait_for_space() somewhere, getting rid
of the EAGAIN.

Looking at the callers of "do_splice_to()", we already have the
wait_for_space() in do_splice(), but we do *not* have it in the
do_splice_from() case when both the input and output file descriptors
are pipes.

Hmm?

             Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-12 23:02       ` Linus Torvalds
@ 2017-01-12 23:14         ` Al Viro
  2017-01-12 23:14         ` Linus Torvalds
  1 sibling, 0 replies; 42+ messages in thread
From: Al Viro @ 2017-01-12 23:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Thu, Jan 12, 2017 at 03:02:13PM -0800, Linus Torvalds wrote:

>     splice(5, NULL, 1, NULL, 9223372036854775807, 0) = -1 EAGAIN
> (Resource temporarily unavailable)
> 
> and note that the commit in question introduces that -EAGAIN error code.
> 
> The old code never returned EAGAIN at all (well, it could do so later,
> if NONBLOCK was set, obviously, but that doesn't seem to be the case
> here).
> 
> So that commit seems to have introduced a new error case, and I
> suspect systemd-nospawn simply doesn't handle it. It is expecting
> splice_to_pipe() to actually block.
> 
> Ergo: I think we need to do a wait_for_space() somewhere, getting rid
> of the EAGAIN.
> 
> Looking at the callers of "do_splice_to()", we already have the
> wait_for_space() in do_splice(), but we do *not* have it in the
> do_splice_from() case when both the input and output file descriptors
> are pipes.

>From the look of his strace, the source is /dev/ptmx.  Pipe-to-pipe
splice goes into splice_pipe_to_pipe() anyway.  do_splice_from() is
pipe-to-non-pipe, and it doesn't go anywhere near default_file_splice_read()...

do_splice_to() is the only thing that can call default_file_splice_read()
and there are only two callers - do_splice() (with its wait_for_space())
and splice_direct_to_actor(), which has internal pipe for destination.
That certainly shouldn't be calling wait_for_space() - there's no other
thread that could possibly read from the destination getting more space
in there.

IOW, it's do_splice() -> do_splice_to() -> default_file_splice_read().

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-12 23:02       ` Linus Torvalds
  2017-01-12 23:14         ` Al Viro
@ 2017-01-12 23:14         ` Linus Torvalds
  2017-01-12 23:27           ` Al Viro
  1 sibling, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2017-01-12 23:14 UTC (permalink / raw)
  To: Al Viro; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Thu, Jan 12, 2017 at 3:02 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Looking at the callers of "do_splice_to()", we already have the
> wait_for_space() in do_splice(), but we do *not* have it in the
> do_splice_from() case when both the input and output file descriptors
> are pipes.

Bah. That case doesn't even trigger the new code. I was lazy with my
grep. The two cases are "do_splice()" (which does have the
wait-for-space) and splice_direct_to_actor(). And
splice_direct_to_actor() shouldn't even need it, should it?

So ignore that. But I think there is something about the EAGAIN.

                  Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-12 23:14         ` Linus Torvalds
@ 2017-01-12 23:27           ` Al Viro
  0 siblings, 0 replies; 42+ messages in thread
From: Al Viro @ 2017-01-12 23:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Thu, Jan 12, 2017 at 03:14:41PM -0800, Linus Torvalds wrote:
> On Thu, Jan 12, 2017 at 3:02 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Looking at the callers of "do_splice_to()", we already have the
> > wait_for_space() in do_splice(), but we do *not* have it in the
> > do_splice_from() case when both the input and output file descriptors
> > are pipes.
> 
> Bah. That case doesn't even trigger the new code. I was lazy with my
> grep. The two cases are "do_splice()" (which does have the
> wait-for-space) and splice_direct_to_actor(). And
> splice_direct_to_actor() shouldn't even need it, should it?
> 
> So ignore that. But I think there is something about the EAGAIN.

It might, but I would really like to see where has that EAGAIN come
from.  I see several possibilities:
	* wait_for_space() with SPLICE_F_NONBLOCK in flags.  Shouldn't
happen with 0 in the last argument of splice(2).
	* default_file_splice_read() seeing pipe->nrbufs == pipe->buffers.
Shouldn't be possible after successful wait_for_space().
	* vfs_readv() returning -EAGAIN.  That might be possible,
actually - the damn thing has come from
13761 open("/dev/ptmx", O_RDWR|O_NOCTTY|O_NONBLOCK|O_CLOEXEC) = 5
and O_NONBLOCK had been present in open flags.

What I'd like to see is strace of the same thing on the working kernel,
ideally - just prior to the commit bisect has converged to.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-12 22:46   ` Alan J. Wylie
  2017-01-12 22:58     ` Al Viro
@ 2017-01-12 23:28     ` Linus Torvalds
  2017-01-13  4:00       ` Al Viro
  2017-01-13  7:23       ` Alan J. Wylie
  1 sibling, 2 replies; 42+ messages in thread
From: Linus Torvalds @ 2017-01-12 23:28 UTC (permalink / raw)
  To: Alan J. Wylie; +Cc: Thorsten Leemhuis, Al Viro, linux-kernel

On Thu, Jan 12, 2017 at 2:46 PM, Alan J. Wylie <alan@wylie.me.uk> wrote:
>
> I'm off to bed now with a stinking head cold I'm afraid.

So assuming Al hasn't figured it out by the time you get back, can you
try to send us the same strace for the working case?

That fd5 (ptmx) does have O_NONBLOCK, so maybe the EAGAIN actually
comes from reading the ptmx, rather than the new AGAIN in fs/splice.c.
But if so, I'd have expected the old code to do the same thing. Ho
humm.

               Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-12 23:28     ` Linus Torvalds
@ 2017-01-13  4:00       ` Al Viro
  2017-01-13  7:38         ` Alan J. Wylie
  2017-01-13  7:23       ` Alan J. Wylie
  1 sibling, 1 reply; 42+ messages in thread
From: Al Viro @ 2017-01-13  4:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Thu, Jan 12, 2017 at 03:28:32PM -0800, Linus Torvalds wrote:
> On Thu, Jan 12, 2017 at 2:46 PM, Alan J. Wylie <alan@wylie.me.uk> wrote:
> >
> > I'm off to bed now with a stinking head cold I'm afraid.
> 
> So assuming Al hasn't figured it out by the time you get back, can you
> try to send us the same strace for the working case?

That, and check if it is reproducible on commit 523ac9afc73a with
commit 8e54cadab447 cherry-picked on top of that.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-12 23:28     ` Linus Torvalds
  2017-01-13  4:00       ` Al Viro
@ 2017-01-13  7:23       ` Alan J. Wylie
  2017-01-13  9:33         ` Al Viro
  1 sibling, 1 reply; 42+ messages in thread
From: Alan J. Wylie @ 2017-01-13  7:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Thorsten Leemhuis, Al Viro, linux-kernel

at 15:28 on Thu 12-Jan-2017 Linus Torvalds (torvalds@linux-foundation.org) wrote:

> So assuming Al hasn't figured it out by the time you get back, can
> you try to send us the same strace for the working case?

Full output at https://wylie.me.uk/tmp/strace1.txt

# uname -a
Linux frodo 4.8.17 #1 SMP PREEMPT Fri Jan 13 06:54:26 GMT 2017 x86_64 AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux

1738  close(2)                          = 0
1738  exit_group(0)                     = ?
1738  +++ exited with 0 +++
1735  <... epoll_wait resumed> [{EPOLLIN, {u32=1570273664, u64=94155843336576}}], 7, -1) = 1
1735  clock_gettime(CLOCK_BOOTTIME, {292, 912429255}) = 0
1735  read(13, "\21\0\0\0\0\0\0\0\1\0\0\0\312\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128
1735  epoll_ctl(11, EPOLL_CTL_DEL, 1, NULL) = 0
1735  epoll_ctl(11, EPOLL_CTL_DEL, 5, NULL) = 0
1735  signalfd4(13, [INT TERM CHLD], 8, SFD_CLOEXEC|SFD_NONBLOCK) = 13
1735  fcntl(0, F_GETFL)                 = 0 (flags O_RDONLY)
1735  fcntl(1, F_GETFL)                 = 0x1 (flags O_WRONLY)
1735  kill(1738, SIGKILL)               = 0
1735  waitid(P_PID, 1738, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1738, si_uid=0, si_status=0, si_utime=0, si_stime=0}, WEXITED, NULL) = 0
1735  close(6)                          = 0
1735  close(7)                          = 0
1735  close(8)                          = 0
1735  close(9)                          = 0
1735  close(18)                         = 0
1735  close(16)                         = 0
1735  close(14)                         = 0
1735  close(10)                         = 0
1735  copy_file_range(5, NULL, 1, NULL, 9223372036854775807, 0) = -1 EXDEV (Invalid cross-device link)
1735  sendfile(1, 5, NULL, 9223372036854775807) = -1 EINVAL (Invalid argument)
1735  splice(5, NULL, 1, NULL, 9223372036854775807, 0) = -1 EAGAIN (Resource temporarily unavailable)
1735  open("/run/systemd/nspawn/propagate/chroot.32", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_NOFOLLOW|O_NOATIME|O_CLOEXEC) = 6
1735  fstatfs(6, {f_type=TMPFS_MAGIC, f_bsize=4096, f_blocks=1020314, f_bfree=1015930, f_bavail=1015930, f_files=1020314, f_ffree=1019308, f_fsid={0, 0}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_NOSUID|ST_NODEV}) = 0
1735  fstat(6, {st_mode=S_IFDIR|0600, st_size=40, ...}) = 0
1735  fcntl(6, F_GETFL)                 = 0x78800 (flags O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_NOFOLLOW|O_NOATIME)
1735  fcntl(6, F_SETFD, FD_CLOEXEC)     = 0
1735  getdents(6, /* 2 entries */, 32768) = 48
1735  getdents(6, /* 0 entries */, 32768) = 0
1735  close(6)                          = 0
1735  rmdir("/run/systemd/nspawn/propagate/chroot.32") = 0
1735  unlink("/work/.#chroot.32.lck")   = 0
1735  close(3)                          = 0
1735  unlink("/run/systemd/nspawn/locks/inode-65035:24641537") = 0
1735  close(4)                          = 0
1735  close(5)                          = 0
1735  exit_group(0)                     = ?
1735  +++ exited with 0 +++


>From within the systemd-nspawn (no KVM or similar virtualisation involved):

Fri Jan 13 07:08:01 GMT 2017
+ date
Fri Jan 13 07:08:01 GMT 2017
+ cat /proc/mounts
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
devtmpfs /dev devtmpfs rw,nosuid,size=4039360k,nr_inodes=1009840,mode=755 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,nodev,mode=755 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup
+rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
/dev/mapper/vg0-root / ext4 rw,relatime,data=ordered 0 0
/dev/mapper/vg0-usr /usr ext4 rw,relatime,data=ordered 0 0
mqueue /dev/mqueue mqueue rw,relatime 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=31,pgrp=1,timeout=0,minproto=5,maxproto=5,direct 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
tmpfs /tmp tmpfs rw,nosuid,nodev 0 0
/dev/mapper/vg1-work /work ext4 rw,relatime,data=ordered 0 0
/dev/mapper/vg1-home /home ext4 rw,relatime,data=ordered 0 0
/dev/sda1 /boot ext4 rw,relatime,data=ordered 0 0
/dev/mapper/vg0-var /var ext4 rw,relatime,data=ordered 0 0
/dev/mapper/vg0-srv /srv ext4 rw,relatime,data=ordered 0 0
tmpfs /run/user/1000 tmpfs rw,nosuid,nodev,relatime,size=816248k,mode=700,uid=1000,gid=100 0 0
+ strace -f -o /var/tmp/strace.txt systemd-nspawn -q -D /work/chroot.32 --register=no date
Fri Jan 13 07:08:01 GMT 2017
+ date
Fri Jan 13 07:08:01 GMT 2017
Fri Jan 13 07:08:01 GMT 2017


-- 
Alan J. Wylie                                          http://www.wylie.me.uk/

Dance like no-one's watching. / Encrypt like everyone is.
Security is inversely proportional to convenience

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-13  4:00       ` Al Viro
@ 2017-01-13  7:38         ` Alan J. Wylie
  0 siblings, 0 replies; 42+ messages in thread
From: Alan J. Wylie @ 2017-01-13  7:38 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, Thorsten Leemhuis, linux-kernel

at 04:00 on Fri 13-Jan-2017 Al Viro (viro@ZenIV.linux.org.uk) wrote:

> That, and check if it is reproducible on commit 523ac9afc73a with
> commit 8e54cadab447 cherry-picked on top of that.

$ git checkout 523ac9afc73a

$ git cherry-pick 8e54cadab447
[detached HEAD 3826f27f6830] fix default_file_splice_read()
 Author: Al Viro <viro@zeniv.linux.org.uk>
 Date: Sat Nov 26 20:05:42 2016 -0500
 1 file changed, 2 insertions(+), 1 deletion(-)

$ uname -a
Linux frodo 4.8.0-rc8-00011-g3826f27f6830 #1 SMP PREEMPT Fri Jan 13 07:30:06 GMT 2017 x86_64 AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux

# ps axfu | grep -A10 cron
root       982  0.0  0.0  18764  1948 ?        Ss   07:32   0:00 /usr/sbin/cron
root      1593  0.0  0.0  27340  2164 ?        S    07:35   0:00  \_ /usr/sbin/cron
root      1594  0.0  0.0   9840  2572 ?        Ss   07:35   0:00      \_ /bin/bash -c date; /work/chroot-shared/test.sh; date
root      1597  0.0  0.0   9840  2668 ?        S    07:35   0:00      |   \_ /bin/bash /work/chroot-shared/test.sh
root      1596  0.0  0.0  76156  5496 ?        S    07:35   0:00      \_ /usr/sbin/sendmail -FCronDaemon -odi -oem -oi -t
root      1600  0.0  0.0  76140  5440 ?        S    07:35   0:00          \_ /usr/sbin/postdrop -r

Still hangs

-- 
Alan J. Wylie                                          http://www.wylie.me.uk/

Dance like no-one's watching. / Encrypt like everyone is.
Security is inversely proportional to convenience

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-13  7:23       ` Alan J. Wylie
@ 2017-01-13  9:33         ` Al Viro
  2017-01-13  9:54           ` Alan J. Wylie
  0 siblings, 1 reply; 42+ messages in thread
From: Al Viro @ 2017-01-13  9:33 UTC (permalink / raw)
  To: Alan J. Wylie; +Cc: Linus Torvalds, Thorsten Leemhuis, linux-kernel

On Fri, Jan 13, 2017 at 07:23:51AM +0000, Alan J. Wylie wrote:
> at 15:28 on Thu 12-Jan-2017 Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> > So assuming Al hasn't figured it out by the time you get back, can
> > you try to send us the same strace for the working case?
> 
> Full output at https://wylie.me.uk/tmp/strace1.txt
> 
> 1735  splice(5, NULL, 1, NULL, 9223372036854775807, 0) = -1 EAGAIN (Resource temporarily unavailable)

Lovely...  So it was getting -EAGAIN all along.  Just in case - could you
try the delta below and see if it triggers?  Simply to exclude the possibility
that it *is* this call of splice() and the change has somehow buggered cleanup
after the kernel_readv() failure...

diff --git a/fs/splice.c b/fs/splice.c
index 873d83104e79..1a2d1bc7f19e 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -393,6 +393,9 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 	size_t offset, dummy, copied = 0;
 	ssize_t res;
 	int i;
+	unsigned nrbufs = pipe->nrbufs,
+		 curbuf = pipe->curbuf,
+		 buffers = pipe->buffers;
 
 	if (pipe->nrbufs == pipe->buffers)
 		return -EAGAIN;
@@ -445,6 +448,16 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 		put_page(pages[i]);
 	kvfree(pages);
 	iov_iter_advance(&to, copied);	/* truncates and discards */
+	if (res == -EAGAIN && (
+		pipe->nrbufs != nrbufs ||
+		pipe->curbuf != curbuf ||
+		pipe->buffers != buffers)
+	    ) {
+		printk(KERN_ERR "nr: %d->%d, cur: %d->%d, buffers: %d->%d\n",
+			nrbufs, pipe->nrbufs,
+			curbuf, pipe->curbuf,
+			buffers, pipe->buffers);
+	}
 	return res;
 }
 

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-13  9:33         ` Al Viro
@ 2017-01-13  9:54           ` Alan J. Wylie
  2017-01-13 10:20             ` Al Viro
  0 siblings, 1 reply; 42+ messages in thread
From: Alan J. Wylie @ 2017-01-13  9:54 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, Thorsten Leemhuis, linux-kernel

at 09:33 on Fri 13-Jan-2017 Al Viro (viro@ZenIV.linux.org.uk) wrote:

> > 1735  splice(5, NULL, 1, NULL, 9223372036854775807, 0) = -1 EAGAIN (Resource temporarily unavailable)
> 
> Lovely...  So it was getting -EAGAIN all along.  Just in case - could you
> try the delta below and see if it triggers?  Simply to exclude the possibility
> that it *is* this call of splice() and the change has somehow buggered cleanup
> after the kernel_readv() failure...


$ git checkout linux-4.9.y
$ git apply patch1

$ uname -a
Linux frodo 4.9.3-dirty #1 SMP PREEMPT Fri Jan 13 09:44:42 GMT 2017 x86_64 AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux

# ps axfu | grep -A10 cron
root       987  0.0  0.0  18764  2128 ?        Ss   09:47   0:00 /usr/sbin/cron
root      1662  0.0  0.0  27340  2160 ?        S    09:51   0:00  \_ /usr/sbin/cron
root      1664  0.0  0.0   9840  1148 ?        Ss   09:51   0:00      \_ /bin/bash -c date; /work/chroot-shared/test.sh; date
root      1668  0.0  0.0   9840  2652 ?        S    09:51   0:00      |   \_ /bin/bash /work/chroot-shared/test.sh
root      1667  0.0  0.0  76156  5576 ?        S    09:51   0:00      \_ /usr/sbin/sendmail -FCronDaemon -odi -oem -oi -t
root      1669  0.0  0.0  76144  5412 ?        S    09:51   0:00          \_ /usr/sbin/postdrop -r

Another hang.

# dmesg  | tail
[   22.352442] r8169 0000:03:00.0: loading /lib/firmware/4.9.3-dirty/rtl_nic/rtl8168e-3.fw failed with error -2
[   22.408814] r8169 0000:03:00.0: direct-loading rtl_nic/rtl8168e-3.fw
[   22.408821] fw_set_page_data: fw-rtl_nic/rtl8168e-3.fw buf=ffff92b7b1cb8c80 data=ffffad1641179000 size=3872
[   22.536043] r8169 0000:03:00.0 enp3s0: link down
[   22.536079] r8169 0000:03:00.0 enp3s0: link down
[   24.873801] r8169 0000:03:00.0 enp3s0: link up
[   24.874766] br0: port 1(enp3s0) entered blocking state
[   24.876622] br0: port 1(enp3s0) entered forwarding state
[   24.878560] IPv6: ADDRCONF(NETDEV_CHANGE): br0: link becomes ready
[  219.683974] nr: 0->16, cur: 5->5, buffers: 16->16
# 

> 
> diff --git a/fs/splice.c b/fs/splice.c
> index 873d83104e79..1a2d1bc7f19e 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -393,6 +393,9 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
>  	size_t offset, dummy, copied = 0;
>  	ssize_t res;
>  	int i;
> +	unsigned nrbufs = pipe->nrbufs,
> +		 curbuf = pipe->curbuf,
> +		 buffers = pipe->buffers;
>  
>  	if (pipe->nrbufs == pipe->buffers)
>  		return -EAGAIN;
> @@ -445,6 +448,16 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
>  		put_page(pages[i]);
>  	kvfree(pages);
>  	iov_iter_advance(&to, copied);	/* truncates and discards */
> +	if (res == -EAGAIN && (
> +		pipe->nrbufs != nrbufs ||
> +		pipe->curbuf != curbuf ||
> +		pipe->buffers != buffers)
> +	    ) {
> +		printk(KERN_ERR "nr: %d->%d, cur: %d->%d, buffers: %d->%d\n",
> +			nrbufs, pipe->nrbufs,
> +			curbuf, pipe->curbuf,
> +			buffers, pipe->buffers);
> +	}
>  	return res;
>  }
>  

-- 
Alan J. Wylie                                          http://www.wylie.me.uk/

Dance like no-one's watching. / Encrypt like everyone is.
Security is inversely proportional to convenience

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-13  9:54           ` Alan J. Wylie
@ 2017-01-13 10:20             ` Al Viro
  2017-01-13 10:32               ` Alan J. Wylie
  2017-01-13 11:18               ` Al Viro
  0 siblings, 2 replies; 42+ messages in thread
From: Al Viro @ 2017-01-13 10:20 UTC (permalink / raw)
  To: Alan J. Wylie; +Cc: Linus Torvalds, Thorsten Leemhuis, linux-kernel

On Fri, Jan 13, 2017 at 09:54:02AM +0000, Alan J. Wylie wrote:
> root      1669  0.0  0.0  76144  5412 ?        S    09:51   0:00          \_ /usr/sbin/postdrop -r
> 
> Another hang.
> 
> # dmesg  | tail
> [   22.352442] r8169 0000:03:00.0: loading /lib/firmware/4.9.3-dirty/rtl_nic/rtl8168e-3.fw failed with error -2
> [   22.408814] r8169 0000:03:00.0: direct-loading rtl_nic/rtl8168e-3.fw
> [   22.408821] fw_set_page_data: fw-rtl_nic/rtl8168e-3.fw buf=ffff92b7b1cb8c80 data=ffffad1641179000 size=3872
> [   22.536043] r8169 0000:03:00.0 enp3s0: link down
> [   22.536079] r8169 0000:03:00.0 enp3s0: link down
> [   24.873801] r8169 0000:03:00.0 enp3s0: link up
> [   24.874766] br0: port 1(enp3s0) entered blocking state
> [   24.876622] br0: port 1(enp3s0) entered forwarding state
> [   24.878560] IPv6: ADDRCONF(NETDEV_CHANGE): br0: link becomes ready
> [  219.683974] nr: 0->16, cur: 5->5, buffers: 16->16

OK, so it is iov_iter_advance() failing to free the shit allocated, either
due to some breakage in pipe_advance() or buggered 'copied'...  Let's
see which one; could you apply the following and run your reproducer?  The
only difference from the previous is that it collects and prints a bit more,
so it should be just as reproducible...

diff --git a/fs/splice.c b/fs/splice.c
index 873d83104e79..11477609e7f7 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -393,6 +393,10 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 	size_t offset, dummy, copied = 0;
 	ssize_t res;
 	int i;
+	unsigned nrbufs = pipe->nrbufs,
+		 curbuf = pipe->curbuf,
+		 buffers = pipe->buffers;
+	int idx, count, offs;
 
 	if (pipe->nrbufs == pipe->buffers)
 		return -EAGAIN;
@@ -444,7 +448,22 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 	for (i = 0; i < nr_pages; i++)
 		put_page(pages[i]);
 	kvfree(pages);
+	count = to.count;
+	idx = to.idx;
+	offs = to.iov_offset;
 	iov_iter_advance(&to, copied);	/* truncates and discards */
+	if (res == -EAGAIN && (
+		pipe->nrbufs != nrbufs ||
+		pipe->curbuf != curbuf ||
+		pipe->buffers != buffers)
+	    ) {
+		printk(KERN_ERR "nr: %d->%d, cur: %d->%d, buffers: %d->%d\n",
+			nrbufs, pipe->nrbufs,
+			curbuf, pipe->curbuf,
+			buffers, pipe->buffers);
+		printk(KERN_ERR "copied: %zd, count:%d, idx:%d, offs:%d\n",
+			copied, count, idx, offs);
+	}
 	return res;
 }
 

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-13 10:20             ` Al Viro
@ 2017-01-13 10:32               ` Alan J. Wylie
  2017-01-13 11:25                 ` Al Viro
  2017-01-13 11:18               ` Al Viro
  1 sibling, 1 reply; 42+ messages in thread
From: Alan J. Wylie @ 2017-01-13 10:32 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, Thorsten Leemhuis, linux-kernel

at 10:20 on Fri 13-Jan-2017 Al Viro (viro@ZenIV.linux.org.uk) wrote:

> OK, so it is iov_iter_advance() failing to free the shit allocated, either
> due to some breakage in pipe_advance() or buggered 'copied'...  Let's
> see which one; could you apply the following and run your reproducer?  The
> only difference from the previous is that it collects and prints a bit more,
> so it should be just as reproducible...

git reset --hard HEAD
vi ~/mail/INBOX patch2 
git apply patch2

# uname -a
Linux frodo 4.9.3-dirty #2 SMP PREEMPT Fri Jan 13 10:22:47 GMT 2017 x86_64 AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux

[hanging as expected]

# dmesg  | tail
...
[  141.553455] nr: 0->16, cur: 5->5, buffers: 16->16
[  141.553461] copied: 0, count:2147479552, idx:5, offs:0

> 
> diff --git a/fs/splice.c b/fs/splice.c
> index 873d83104e79..11477609e7f7 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -393,6 +393,10 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
>  	size_t offset, dummy, copied = 0;
>  	ssize_t res;
>  	int i;
> +	unsigned nrbufs = pipe->nrbufs,
> +		 curbuf = pipe->curbuf,
> +		 buffers = pipe->buffers;
> +	int idx, count, offs;
>  
>  	if (pipe->nrbufs == pipe->buffers)
>  		return -EAGAIN;
> @@ -444,7 +448,22 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
>  	for (i = 0; i < nr_pages; i++)
>  		put_page(pages[i]);
>  	kvfree(pages);
> +	count = to.count;
> +	idx = to.idx;
> +	offs = to.iov_offset;
>  	iov_iter_advance(&to, copied);	/* truncates and discards */
> +	if (res == -EAGAIN && (
> +		pipe->nrbufs != nrbufs ||
> +		pipe->curbuf != curbuf ||
> +		pipe->buffers != buffers)
> +	    ) {
> +		printk(KERN_ERR "nr: %d->%d, cur: %d->%d, buffers: %d->%d\n",
> +			nrbufs, pipe->nrbufs,
> +			curbuf, pipe->curbuf,
> +			buffers, pipe->buffers);
> +		printk(KERN_ERR "copied: %zd, count:%d, idx:%d, offs:%d\n",
> +			copied, count, idx, offs);
> +	}
>  	return res;
>  }
>  

-- 
Alan J. Wylie                                          http://www.wylie.me.uk/

Dance like no-one's watching. / Encrypt like everyone is.
Security is inversely proportional to convenience

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-13 10:20             ` Al Viro
  2017-01-13 10:32               ` Alan J. Wylie
@ 2017-01-13 11:18               ` Al Viro
  2017-01-13 19:33                 ` Linus Torvalds
  1 sibling, 1 reply; 42+ messages in thread
From: Al Viro @ 2017-01-13 11:18 UTC (permalink / raw)
  To: Alan J. Wylie; +Cc: Linus Torvalds, Thorsten Leemhuis, linux-kernel

On Fri, Jan 13, 2017 at 10:20:19AM +0000, Al Viro wrote:

> OK, so it is iov_iter_advance() failing to free the shit allocated, either
> due to some breakage in pipe_advance() or buggered 'copied'...  Let's
> see which one; could you apply the following and run your reproducer?  The
> only difference from the previous is that it collects and prints a bit more,
> so it should be just as reproducible...

Actually, I think I understand what's going on.  It's
        if (pipe->nrbufs) {
                int unused = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
                /* [curbuf,unused) is in use.  Free [idx,unused) */
                while (idx != unused) {
                        pipe_buf_release(pipe, &pipe->bufs[idx]);
                        idx = next_idx(idx, pipe);
                        pipe->nrbufs--;
                }
        }
in case when pipe->nrbufs == pipe->buffers and idx == pipe->curbuf.  IOW,
we have a full pipe and want to empty it entirely; the fun question is,
of course, telling that case from having nothing to free with the same
full pipe...

OK, so we have either
	* off != 0 => something's being left in the pipe, i->idx points
to the last in-use buffer after that, idx points to the first buffer unused
after that.  In that case the current logics is correct.
	* off == 0 => we are emptying the damn thing.  i->idx and idx point
to the first buffer unused after that (== pipe->curbuf + pipe->nrbufs at the
time we'd set the iov_iter up).  The current logics is correct unless
pipe->nrbufs was originally 0 and now has become pipe->buffers.  IOW,
we screw up when off == 0, idx == unused, pipe->nrbufs == pipe->buffers...

	OK, we really ought to make sure that iov_iter_pipe() is never
done on a full pipe.  AFAICS, we do, and if so the following should
suffice.  WARNING: it's completely untested and it's a result of debugging
a fencepost bug in handling of cyclic buffers done at 6 in the morning,
on _way_ too long uptime.  Treat as very dangerous; it's not entirely
impossible that I hadn't fucked up, but don't consider it anything other
than "let's try and see if it immediately explodes" until I've got
a chance to reread it after getting some sleep.

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 25f572303801..7bc0b99d3c83 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -759,11 +759,12 @@ static void pipe_advance(struct iov_iter *i, size_t size)
 		idx = next_idx(idx, pipe);
 	if (pipe->nrbufs) {
 		int unused = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
-		/* [curbuf,unused) is in use.  Free [idx,unused) */
-		while (idx != unused) {
-			pipe_buf_release(pipe, &pipe->bufs[idx]);
-			idx = next_idx(idx, pipe);
-			pipe->nrbufs--;
+		if (idx != unused || unlikely(idx == pipe->curbuf && !off)) {
+			do {
+				pipe_buf_release(pipe, &pipe->bufs[idx]);
+				idx = next_idx(idx, pipe);
+				pipe->nrbufs--;
+			} while (idx != unused);
 		}
 	}
 	i->count -= orig_sz;
@@ -826,6 +827,7 @@ void iov_iter_pipe(struct iov_iter *i, int direction,
 			size_t count)
 {
 	BUG_ON(direction != ITER_PIPE);
+	WARN_ON(pipe->nrbufs == pipe->buffers);
 	i->type = direction;
 	i->pipe = pipe;
 	i->idx = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-13 10:32               ` Alan J. Wylie
@ 2017-01-13 11:25                 ` Al Viro
  0 siblings, 0 replies; 42+ messages in thread
From: Al Viro @ 2017-01-13 11:25 UTC (permalink / raw)
  To: Alan J. Wylie; +Cc: Linus Torvalds, Thorsten Leemhuis, linux-kernel

On Fri, Jan 13, 2017 at 10:32:47AM +0000, Alan J. Wylie wrote:
> # dmesg  | tail
> ...
> [  141.553455] nr: 0->16, cur: 5->5, buffers: 16->16
> [  141.553461] copied: 0, count:2147479552, idx:5, offs:0

OK, that looks like pipe_advance() fencepost (hopefully) dealt with in
the lib/iov_iter.c patch I've just posted.  I'm going down right now, but
if you have a box you are not too fond of, it would be nice if you tried
that one.  Warning: the patch had been written in a barely awake state, and
is completely untested.  Treat accordingly...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-13 11:18               ` Al Viro
@ 2017-01-13 19:33                 ` Linus Torvalds
  2017-01-13 20:08                   ` Al Viro
  2017-01-13 20:08                   ` Linus Torvalds
  0 siblings, 2 replies; 42+ messages in thread
From: Linus Torvalds @ 2017-01-13 19:33 UTC (permalink / raw)
  To: Al Viro; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Fri, Jan 13, 2017 at 3:18 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> in case when pipe->nrbufs == pipe->buffers and idx == pipe->curbuf.  IOW,
> we have a full pipe and want to empty it entirely; the fun question is,
> of course, telling that case from having nothing to free with the same
> full pipe...

That's just because you're looking at all the wrong fields.

Doing this:

     int unused = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);

and then comparing the current index to the unused index is just
complete garbage.  Exactly because the "nrbufs % pipe->buffers"
ambiguity when nrbufs is either 0 or full.

So that comparison is actively crap, exactly *because* it has already
thrown all the actual data away.

You must *never* compare indexes in a circular buffer like this. It's
simply wrong.

Your patch might work, but it just keeps on doing the wrong thing and
then works around the fact that it does the wrong thing by adding more
disgusting hacks.

What am I missing?

Why is "pipe_advance()" written in that incomprehensible form? Why
don't we do the pipe_buf_release() as we advance through it, instead
of doing it at the end?

Also, the line

                buf->len = size;

in that pipe_advance() function looks buggy. "size" is how much we
*remove* from buf->len, shouldn't we update buf->len by subtracting
size?

This function looks so broken that I must be missing something. Why
doesn't pipe_advance() just look like the following:

  static void pipe_advance(struct iov_iter *i, size_t size)
  {
        struct pipe_inode_info *pipe = i->pipe;
        struct pipe_buffer *buf;
        int idx = i->idx;

        if (unlikely(i->count < size))
                size = i->count;
        if (!size)
                return;

        i->count -= size;

        while (pipe->nrbufs) {
                buf = &pipe->bufs[idx];

                /* We are left with a partial buffer.. */
                if (size < buf->len) {
                        buf->len -= size;
                        buf->offset += size;
                        i->idx = idx;
                        i->iov_offset = buf->offset;
                        return;
                }

                size -= buf->len;
                pipe_buf_release(pipe, buf);
                pipe->nrbufs--;
                idx = next_idx(idx, pipe);
        }

        /* We have exhausted all pipe buffers, reset everything */
        pipe->curbuf = 0;
        i->idx = 0;
        i->iov_offset = 0;
        i->count = 0;

        /* Did we try to advace past that? */
        WARN_ON_ONCE(size);
  }

which looks a lot more straightforward, freeing the buffers as it goes
along, and doesn't screw around comparing indexes?

But as I said, I'm probably missing something, because the current
"pipe_advance()" function looks so incomprehensible to me. So I'm
assuming there's some reason for the craziness.

                 Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-13 19:33                 ` Linus Torvalds
@ 2017-01-13 20:08                   ` Al Viro
  2017-01-13 20:11                     ` Al Viro
  2017-01-13 20:08                   ` Linus Torvalds
  1 sibling, 1 reply; 42+ messages in thread
From: Al Viro @ 2017-01-13 20:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Fri, Jan 13, 2017 at 11:33:18AM -0800, Linus Torvalds wrote:

> What am I missing?

The fact that we want to free the _tail_, not the beginning of the
damn thing.

> Why is "pipe_advance()" written in that incomprehensible form? Why
> don't we do the pipe_buf_release() as we advance through it, instead
> of doing it at the end?
> 
> Also, the line
> 
>                 buf->len = size;
> 
> in that pipe_advance() function looks buggy. "size" is how much we
> *remove* from buf->len, shouldn't we update buf->len by subtracting
> size?

Because it's "truncate to size", not "throw everything up to that point
out".

We have some amount of data pushed into pipe (in this case - 0) and we
have some buffers allocated by ..._get_pages() past the end of it.
Some of that we want to keep (again, in this case - none) and have the next
copy_to_iter() go after those, the rest we discard.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-13 19:33                 ` Linus Torvalds
  2017-01-13 20:08                   ` Al Viro
@ 2017-01-13 20:08                   ` Linus Torvalds
  2017-01-13 20:16                     ` Al Viro
  1 sibling, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2017-01-13 20:08 UTC (permalink / raw)
  To: Al Viro; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Fri, Jan 13, 2017 at 11:33 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> This function looks so broken that I must be missing something. Why
> doesn't pipe_advance() just look like the following:
>
>   static void pipe_advance(struct iov_iter *i, size_t size)
>   {
...
>                 pipe_buf_release(pipe, buf);
>                 pipe->nrbufs--;
...

I think this part needs to update "curbufs" too, so something like

                pipe->curbuf = (pipe->curbuf + 1) & (pipe->buffers - 1);

although I think that "idx" has to track curbuf here anyway, so I
guess it could just be combined with the idx update and look something
like

                pipe->curbuf = idx = next_idx(idx, pipe);

in there. Otherwise we get out of sync with the pipe state.

Or maybe I'm just full of it.

                 Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-13 20:08                   ` Al Viro
@ 2017-01-13 20:11                     ` Al Viro
  2017-01-13 20:32                       ` Linus Torvalds
  0 siblings, 1 reply; 42+ messages in thread
From: Al Viro @ 2017-01-13 20:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Fri, Jan 13, 2017 at 08:08:27PM +0000, Al Viro wrote:

> Because it's "truncate to size", not "throw everything up to that point
> out".
> 
> We have some amount of data pushed into pipe (in this case - 0) and we
> have some buffers allocated by ..._get_pages() past the end of it.
> Some of that we want to keep (again, in this case - none) and have the next
> copy_to_iter() go after those, the rest we discard.

PS: 'size' argument of iov_iter_advance() is the second "some" in the
above - we tell it how much we want to advance by and everything past
that point is, in case of PIPE_ITER, discarded.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-13 20:08                   ` Linus Torvalds
@ 2017-01-13 20:16                     ` Al Viro
  0 siblings, 0 replies; 42+ messages in thread
From: Al Viro @ 2017-01-13 20:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Fri, Jan 13, 2017 at 12:08:44PM -0800, Linus Torvalds wrote:
> On Fri, Jan 13, 2017 at 11:33 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > This function looks so broken that I must be missing something. Why
> > doesn't pipe_advance() just look like the following:
> >
> >   static void pipe_advance(struct iov_iter *i, size_t size)
> >   {
> ...
> >                 pipe_buf_release(pipe, buf);
> >                 pipe->nrbufs--;
> ...
> 
> I think this part needs to update "curbufs" too, so something like
> 
>                 pipe->curbuf = (pipe->curbuf + 1) & (pipe->buffers - 1);
> 
> although I think that "idx" has to track curbuf here anyway, so I
> guess it could just be combined with the idx update and look something
> like
> 
>                 pipe->curbuf = idx = next_idx(idx, pipe);
> 
> in there. Otherwise we get out of sync with the pipe state.

You are looking at the wrong end of that cyclic buffer.  ->curbuf is where
the data begins (and it might be well prior to anything we'd pushed there -
pipe might've been non-empty).  Then we have ->nrbufs allocated buffers and
i->idx points to the place where copy_to_iter() will put the data.

We want pipe_advance() to
	* move the point where copy_to_iter() would go by this much
	* free all preallocated buffers past that point.

->curbuf is for pipe readers; we are dealing with writing to pipe here.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-13 20:11                     ` Al Viro
@ 2017-01-13 20:32                       ` Linus Torvalds
  2017-01-13 20:47                         ` Al Viro
  0 siblings, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2017-01-13 20:32 UTC (permalink / raw)
  To: Al Viro; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Fri, Jan 13, 2017 at 12:11 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> PS: 'size' argument of iov_iter_advance() is the second "some" in the
> above - we tell it how much we want to advance by and everything past
> that point is, in case of PIPE_ITER, discarded.

Ok. The naming threw me. It would be more logical to call that
operation a "truncate", not advance.

I notice that one of the comments in fs/splice.c actually says that:

        iov_iter_advance(&to, copied);  /* truncates and discards */

but yes, I see what it's trying to do now.

Ugh. I still think your patch is butt-ugly, and the index comparisons
make me nervous, but..

Let's see if Alan's issue actually goes away with your later patch.

             Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-13 20:32                       ` Linus Torvalds
@ 2017-01-13 20:47                         ` Al Viro
  2017-01-13 21:55                           ` Al Viro
  0 siblings, 1 reply; 42+ messages in thread
From: Al Viro @ 2017-01-13 20:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Fri, Jan 13, 2017 at 12:32:37PM -0800, Linus Torvalds wrote:

> Ugh. I still think your patch is butt-ugly, and the index comparisons
> make me nervous, but..

No arguments here - 6am on 20-odd hours of uptime is _not_ a good time
for writing, especially since the data structure needs better documentation
and probably a couple of inlined helpers.  I'll try to massage the damn
thing into more readable form.

> Let's see if Alan's issue actually goes away with your later patch.

*nod*

I'm fairly sure this is the root cause of what's going on there, but it would
be nice to have an experimental confirmation.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-13 20:47                         ` Al Viro
@ 2017-01-13 21:55                           ` Al Viro
  2017-01-13 21:59                             ` Al Viro
  0 siblings, 1 reply; 42+ messages in thread
From: Al Viro @ 2017-01-13 21:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Fri, Jan 13, 2017 at 08:47:31PM +0000, Al Viro wrote:
> On Fri, Jan 13, 2017 at 12:32:37PM -0800, Linus Torvalds wrote:
> 
> > Ugh. I still think your patch is butt-ugly, and the index comparisons
> > make me nervous, but..
> 
> No arguments here - 6am on 20-odd hours of uptime is _not_ a good time
> for writing, especially since the data structure needs better documentation
> and probably a couple of inlined helpers.  I'll try to massage the damn
> thing into more readable form.

FWIW, I think it will be more readable if we separate the "advance" and
"truncate" parts like this (warning: not even build-tested).  Comments?

static inline void pipe_truncate(struct iov_iter *i)
{
        struct pipe_inode_info *pipe = i->pipe;
        if (pipe->nrbufs) {
                size_t off = i->iov_offset;
                int idx = i->idx;
                int n;

		n = (pipe->curbuf + pipe->nrbufs - idx) & (pipe->buffers - 1);
                if (off) {
                        pipe->bufs[idx].len = off - pipe->bufs[idx].offset;
                        /* free all after idx; n can't be 0 */
                        idx = next_idx(idx, pipe);
                        n--;
                } else {
                        /* free all _starting_ at idx.
                         * n is 0 when we have nothing to do
                         * *or* when we are truncating full pipe to empty.
                         */
                        if (pipe->nrbufs == pipe->buffers && !n)
                                n = pipe->buffers;
                }
                while (n--) {
                        pipe_buf_release(pipe, &pipe->bufs[idx]);
                        idx = next_idx(idx, pipe);
                        pipe->nrbufs--;
                }
        }
}

static void pipe_advance(struct iov_iter *i, size_t size)
{
        struct pipe_inode_info *pipe = i->pipe;
        if (unlikely(i->count < size))
                size = i->count;
        if (size) {
                struct pipe_buffer *buf;
                size_t off = i->iov_offset, left = size;
                int idx = i->idx;
                if (off) /* make it relative to the beginning of buffer */
                        left += off - pipe->bufs[idx].offset;
                while (1) {
                        buf = &pipe->bufs[idx];
                        if (left <= buf->len)
                                break;
                        left -= buf->len;
                        idx = next_idx(idx, pipe);
                }
                i->idx = idx;
                i->iov_offset = buf->offset + left;
		i->count -= size;
        }
        /* ... and discard everything past that point */
        pipe_truncate(i);
}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-13 21:55                           ` Al Viro
@ 2017-01-13 21:59                             ` Al Viro
  2017-01-13 22:13                               ` Al Viro
  0 siblings, 1 reply; 42+ messages in thread
From: Al Viro @ 2017-01-13 21:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Fri, Jan 13, 2017 at 09:55:04PM +0000, Al Viro wrote:
> On Fri, Jan 13, 2017 at 08:47:31PM +0000, Al Viro wrote:
> > On Fri, Jan 13, 2017 at 12:32:37PM -0800, Linus Torvalds wrote:
> > 
> > > Ugh. I still think your patch is butt-ugly, and the index comparisons
> > > make me nervous, but..
> > 
> > No arguments here - 6am on 20-odd hours of uptime is _not_ a good time
> > for writing, especially since the data structure needs better documentation
> > and probably a couple of inlined helpers.  I'll try to massage the damn
> > thing into more readable form.
> 
> FWIW, I think it will be more readable if we separate the "advance" and
> "truncate" parts like this (warning: not even build-tested).  Comments?

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 25f572303801..dae1ac940d5f 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -730,43 +730,60 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 }
 EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);
 
+static inline void pipe_truncate(struct iov_iter *i)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	if (pipe->nrbufs) {
+		size_t off = i->iov_offset;
+		int idx = i->idx;
+		int n;
+
+		n = (pipe->curbuf + pipe->nrbufs - idx) & (pipe->buffers - 1);
+		if (off) {
+			pipe->bufs[idx].len = off - pipe->bufs[idx].offset;
+			/* free all after idx; n can't be 0 */
+			idx = next_idx(idx, pipe);
+			n--;
+		} else {
+			/* free all _starting_ at idx.
+			 * n is 0 when we have nothing to do
+			 * *or* when we are truncating full pipe to empty.
+			 */
+			if (pipe->nrbufs == pipe->buffers && !n)
+				n = pipe->buffers;
+		}
+		while (n--) {
+			pipe_buf_release(pipe, &pipe->bufs[idx]);
+			idx = next_idx(idx, pipe);
+			pipe->nrbufs--;
+		}
+	}
+}
+
 static void pipe_advance(struct iov_iter *i, size_t size)
 {
 	struct pipe_inode_info *pipe = i->pipe;
-	struct pipe_buffer *buf;
-	int idx = i->idx;
-	size_t off = i->iov_offset, orig_sz;
-	
 	if (unlikely(i->count < size))
 		size = i->count;
-	orig_sz = size;
-
 	if (size) {
+		struct pipe_buffer *buf;
+		size_t off = i->iov_offset, left = size;
+		int idx = i->idx;
 		if (off) /* make it relative to the beginning of buffer */
-			size += off - pipe->bufs[idx].offset;
+			left += off - pipe->bufs[idx].offset;
 		while (1) {
 			buf = &pipe->bufs[idx];
-			if (size <= buf->len)
+			if (left <= buf->len)
 				break;
-			size -= buf->len;
+			left -= buf->len;
 			idx = next_idx(idx, pipe);
 		}
-		buf->len = size;
 		i->idx = idx;
-		off = i->iov_offset = buf->offset + size;
-	}
-	if (off)
-		idx = next_idx(idx, pipe);
-	if (pipe->nrbufs) {
-		int unused = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
-		/* [curbuf,unused) is in use.  Free [idx,unused) */
-		while (idx != unused) {
-			pipe_buf_release(pipe, &pipe->bufs[idx]);
-			idx = next_idx(idx, pipe);
-			pipe->nrbufs--;
-		}
+		i->iov_offset = buf->offset + left;
 	}
-	i->count -= orig_sz;
+	i->count -= size;
+	/* ... and discard everything past that point */
+	pipe_truncate(i);
 }
 
 void iov_iter_advance(struct iov_iter *i, size_t size)
@@ -826,6 +843,7 @@ void iov_iter_pipe(struct iov_iter *i, int direction,
 			size_t count)
 {
 	BUG_ON(direction != ITER_PIPE);
+	WARN_ON(pipe->nrbufs == pipe->buffers);
 	i->type = direction;
 	i->pipe = pipe;
 	i->idx = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-13 21:59                             ` Al Viro
@ 2017-01-13 22:13                               ` Al Viro
  2017-01-13 22:50                                 ` Al Viro
  0 siblings, 1 reply; 42+ messages in thread
From: Al Viro @ 2017-01-13 22:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Fri, Jan 13, 2017 at 09:59:19PM +0000, Al Viro wrote:
> +		if (off) {
> +			pipe->bufs[idx].len = off - pipe->bufs[idx].offset;
> +			/* free all after idx; n can't be 0 */

Yes, it can - full pipe, truncation to the part of the first buffer ;-/
OTOH, handling the wraparound can be done uniformly for both "empty all"
and "leave something" cases...  Fixed variant follows:

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 25f572303801..69ae97fb6b1a 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -730,43 +730,54 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 }
 EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);
 
+static inline void pipe_truncate(struct iov_iter *i)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	if (pipe->nrbufs) {
+		size_t off = i->iov_offset;
+		int idx = i->idx;
+		int n;
+
+		n = (pipe->curbuf + pipe->nrbufs - idx) & (pipe->buffers - 1);
+		if (off) {
+			pipe->bufs[idx].len = off - pipe->bufs[idx].offset;
+			idx = next_idx(idx, pipe);
+			n--;
+		}
+		if (pipe->nrbufs == pipe->buffers && n <= 0)
+			n += pipe->buffers;
+		while (n--) {
+			pipe_buf_release(pipe, &pipe->bufs[idx]);
+			idx = next_idx(idx, pipe);
+			pipe->nrbufs--;
+		}
+	}
+}
+
 static void pipe_advance(struct iov_iter *i, size_t size)
 {
 	struct pipe_inode_info *pipe = i->pipe;
-	struct pipe_buffer *buf;
-	int idx = i->idx;
-	size_t off = i->iov_offset, orig_sz;
-	
 	if (unlikely(i->count < size))
 		size = i->count;
-	orig_sz = size;
-
 	if (size) {
+		struct pipe_buffer *buf;
+		size_t off = i->iov_offset, left = size;
+		int idx = i->idx;
 		if (off) /* make it relative to the beginning of buffer */
-			size += off - pipe->bufs[idx].offset;
+			left += off - pipe->bufs[idx].offset;
 		while (1) {
 			buf = &pipe->bufs[idx];
-			if (size <= buf->len)
+			if (left <= buf->len)
 				break;
-			size -= buf->len;
+			left -= buf->len;
 			idx = next_idx(idx, pipe);
 		}
-		buf->len = size;
 		i->idx = idx;
-		off = i->iov_offset = buf->offset + size;
-	}
-	if (off)
-		idx = next_idx(idx, pipe);
-	if (pipe->nrbufs) {
-		int unused = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
-		/* [curbuf,unused) is in use.  Free [idx,unused) */
-		while (idx != unused) {
-			pipe_buf_release(pipe, &pipe->bufs[idx]);
-			idx = next_idx(idx, pipe);
-			pipe->nrbufs--;
-		}
+		i->iov_offset = buf->offset + left;
 	}
-	i->count -= orig_sz;
+	i->count -= size;
+	/* ... and discard everything past that point */
+	pipe_truncate(i);
 }
 
 void iov_iter_advance(struct iov_iter *i, size_t size)
@@ -826,6 +837,7 @@ void iov_iter_pipe(struct iov_iter *i, int direction,
 			size_t count)
 {
 	BUG_ON(direction != ITER_PIPE);
+	WARN_ON(pipe->nrbufs == pipe->buffers);
 	i->type = direction;
 	i->pipe = pipe;
 	i->idx = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-13 22:13                               ` Al Viro
@ 2017-01-13 22:50                                 ` Al Viro
  2017-01-14  0:59                                   ` Linus Torvalds
  2017-01-14 13:16                                   ` Alan J. Wylie
  0 siblings, 2 replies; 42+ messages in thread
From: Al Viro @ 2017-01-13 22:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Fri, Jan 13, 2017 at 10:13:08PM +0000, Al Viro wrote:
> On Fri, Jan 13, 2017 at 09:59:19PM +0000, Al Viro wrote:
> > +		if (off) {
> > +			pipe->bufs[idx].len = off - pipe->bufs[idx].offset;
> > +			/* free all after idx; n can't be 0 */
> 
> Yes, it can - full pipe, truncation to the part of the first buffer ;-/
> OTOH, handling the wraparound can be done uniformly for both "empty all"
> and "leave something" cases...  Fixed variant follows:

Or, even better, we can get rid of all wraparound-related crap if we
calculate the final value of pipe->nrbufs and watch for _that_ as
loop condition:

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 25f572303801..e68604ae3ced 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -730,43 +730,50 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 }
 EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);
 
+static inline void pipe_truncate(struct iov_iter *i)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	if (pipe->nrbufs) {
+		size_t off = i->iov_offset;
+		int idx = i->idx;
+		int nrbufs = (idx - pipe->curbuf) & (pipe->buffers - 1);
+		if (off) {
+			pipe->bufs[idx].len = off - pipe->bufs[idx].offset;
+			idx = next_idx(idx, pipe);
+			nrbufs++;
+		}
+		while (pipe->nrbufs > nrbufs) {
+			pipe_buf_release(pipe, &pipe->bufs[idx]);
+			idx = next_idx(idx, pipe);
+			pipe->nrbufs--;
+		}
+	}
+}
+
 static void pipe_advance(struct iov_iter *i, size_t size)
 {
 	struct pipe_inode_info *pipe = i->pipe;
-	struct pipe_buffer *buf;
-	int idx = i->idx;
-	size_t off = i->iov_offset, orig_sz;
-	
 	if (unlikely(i->count < size))
 		size = i->count;
-	orig_sz = size;
-
 	if (size) {
+		struct pipe_buffer *buf;
+		size_t off = i->iov_offset, left = size;
+		int idx = i->idx;
 		if (off) /* make it relative to the beginning of buffer */
-			size += off - pipe->bufs[idx].offset;
+			left += off - pipe->bufs[idx].offset;
 		while (1) {
 			buf = &pipe->bufs[idx];
-			if (size <= buf->len)
+			if (left <= buf->len)
 				break;
-			size -= buf->len;
+			left -= buf->len;
 			idx = next_idx(idx, pipe);
 		}
-		buf->len = size;
 		i->idx = idx;
-		off = i->iov_offset = buf->offset + size;
-	}
-	if (off)
-		idx = next_idx(idx, pipe);
-	if (pipe->nrbufs) {
-		int unused = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
-		/* [curbuf,unused) is in use.  Free [idx,unused) */
-		while (idx != unused) {
-			pipe_buf_release(pipe, &pipe->bufs[idx]);
-			idx = next_idx(idx, pipe);
-			pipe->nrbufs--;
-		}
+		i->iov_offset = buf->offset + left;
 	}
-	i->count -= orig_sz;
+	i->count -= size;
+	/* ... and discard everything past that point */
+	pipe_truncate(i);
 }
 
 void iov_iter_advance(struct iov_iter *i, size_t size)
@@ -826,6 +833,7 @@ void iov_iter_pipe(struct iov_iter *i, int direction,
 			size_t count)
 {
 	BUG_ON(direction != ITER_PIPE);
+	WARN_ON(pipe->nrbufs == pipe->buffers);
 	i->type = direction;
 	i->pipe = pipe;
 	i->idx = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-13 22:50                                 ` Al Viro
@ 2017-01-14  0:59                                   ` Linus Torvalds
  2017-01-14  1:24                                     ` Al Viro
  2017-01-14 13:16                                   ` Alan J. Wylie
  1 sibling, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2017-01-14  0:59 UTC (permalink / raw)
  To: Al Viro; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Fri, Jan 13, 2017 at 2:50 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Or, even better, we can get rid of all wraparound-related crap if we
> calculate the final value of pipe->nrbufs and watch for _that_ as
> loop condition:

Ok, I like the 'expected nrbufs' approach. It makes the actual freeing
loop trivial, and gets rid of that whole complexity.

I'm not totally enamored with the details of exactly how that expected
valie is calculated, though.

> +static inline void pipe_truncate(struct iov_iter *i)
> +{
> +       struct pipe_inode_info *pipe = i->pipe;
> +       if (pipe->nrbufs) {
> +               size_t off = i->iov_offset;
> +               int idx = i->idx;
> +               int nrbufs = (idx - pipe->curbuf) & (pipe->buffers - 1);

Ok, here we have what superficially appears to be the same wraparound
issue, except not.

The idx/iov_offset rules are subtly different from the pipe buffer
index rules, and point to the actual last buffer.

So here, 'idx' really points to the last buffer, and we're possibly
off by one, and then you fix that up with:

> +               if (off) {
> +                       pipe->bufs[idx].len = off - pipe->bufs[idx].offset;
> +                       idx = next_idx(idx, pipe);
> +                       nrbufs++;
> +               }

and I think this all even works, and I like how the logic here matches
what the PIPE_PARANOIA tests are. Fine.

EXCEPT.

I don't think "i->iov_offset" is actually correct. If you truncated
the whole thing, you should have cleared iov_offset too, and that
never happened. So truncating everything will not empty the buffers
for us, we'll still get to that "if (off)" case and have nrbufs=1.

So I'd expect something like

    if (!size)
        i->iov_offset = 0;

in pipe_advance(), in order to really free all buffers for that case. No?

Or is there some guarantee that iov_offset was already zero there? I
don't see it, but I'm batting zero on this code...

So I'm clearly still missing something. It looks better to me, but I
still don't get some of the cases.

Can you please explain when you are once more awake..

                Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-14  0:59                                   ` Linus Torvalds
@ 2017-01-14  1:24                                     ` Al Viro
  2017-01-14  1:43                                       ` Al Viro
  2017-01-14  1:46                                       ` Linus Torvalds
  0 siblings, 2 replies; 42+ messages in thread
From: Al Viro @ 2017-01-14  1:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Fri, Jan 13, 2017 at 04:59:37PM -0800, Linus Torvalds wrote:

> EXCEPT.
> 
> I don't think "i->iov_offset" is actually correct. If you truncated
> the whole thing, you should have cleared iov_offset too, and that
> never happened. So truncating everything will not empty the buffers
> for us, we'll still get to that "if (off)" case and have nrbufs=1.
> 
> So I'd expect something like
> 
>     if (!size)
>         i->iov_offset = 0;
> 
> in pipe_advance(), in order to really free all buffers for that case. No?

Why would advance by 0 change ->iov_offset here?

> Or is there some guarantee that iov_offset was already zero there? I
> don't see it, but I'm batting zero on this code...

It was zero initially (see iov_iter_pipe()).  It was not affected by
iov_iter_get_pages() and friends.  If there was copy_to_iter/zero_iter/
copy_page_to_iter for any non-zero amount of data, then it's _not_ zero and
should bloody well stay such.

Note that the normal case is something like O_DIRECT read grabbing pages,
doing a bunch of IO into such and, once it's done, iov_iter_advance()
for the actual amount read being done by
                retval = mapping->a_ops->direct_IO(iocb, &data);
                if (retval >= 0) {
                        iocb->ki_pos += retval;
                        iov_iter_advance(iter, retval);
                }
which advances iter past the actually read data.  If we proceed to
fall back to the do_generic_file_read(), it will do a bunch of
copy_page_to_iter() to the points past that.

default_file_splice_read() pretty much imitates O_DIRECT read in that
respect, with kernel_readv() being a counterpart of direct_IO.

generic_file_splice_read() is another PIPE_ITER user; that one really
calls ->read_iter().  On any non-error (including short reads) it will
not bother with iov_iter_advance() at all - ->read_iter() should've
called that already, if at all needed.

On error it does use iov_iter_advance(), pretty much as a way to
trigger pipe_truncate().  There we directly reset ->iov_offset to 0
and ->idx to its original value.  Strictly speaking, we could live
without calling it there at all - normally ->read_iter() instances follow
the usual "if you'd managed to read something, report a short read,
not an error" approach.  None of the exceptions are good candidates for
use of generic_file_splice_read() anyway.

However, theoretically it is possible that ->read_iter() instance does
successful copy_to_iter() and then decides to return an error.  This
        } else if (ret < 0) {
                to.idx = idx;
                to.iov_offset = 0;
                iov_iter_advance(&to, 0); /* to free what was emitted */
in generic_file_splice_read() catches any such cases.  Here the call
of iov_iter_advance(&to, 0) is guaranteed to be equivalent to
pipe_truncate(&to); we might make the latter non-static and call it
directly, but I'm not sure it's worth doing.
 
> Can you please explain when you are once more awake..

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-14  1:24                                     ` Al Viro
@ 2017-01-14  1:43                                       ` Al Viro
  2017-01-14  1:46                                       ` Linus Torvalds
  1 sibling, 0 replies; 42+ messages in thread
From: Al Viro @ 2017-01-14  1:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Sat, Jan 14, 2017 at 01:24:28AM +0000, Al Viro wrote:
> On Fri, Jan 13, 2017 at 04:59:37PM -0800, Linus Torvalds wrote:
> 
> > EXCEPT.
> > 
> > I don't think "i->iov_offset" is actually correct. If you truncated
> > the whole thing, you should have cleared iov_offset too, and that
> > never happened. So truncating everything will not empty the buffers
> > for us, we'll still get to that "if (off)" case and have nrbufs=1.
> > 
> > So I'd expect something like
> > 
> >     if (!size)
> >         i->iov_offset = 0;
> > 
> > in pipe_advance(), in order to really free all buffers for that case. No?
> 
> Why would advance by 0 change ->iov_offset here?
> 
> > Or is there some guarantee that iov_offset was already zero there? I
> > don't see it, but I'm batting zero on this code...
> 
> It was zero initially (see iov_iter_pipe()).  It was not affected by
> iov_iter_get_pages() and friends.  If there was copy_to_iter/zero_iter/
> copy_page_to_iter for any non-zero amount of data, then it's _not_ zero and
> should bloody well stay such.

PS: note that after copy_page_to_iter()/copy_to_iter()/zero_iter() we have
->idx pointing to the last used buffer and ->iov_offset pointing to the end
of data in it, even if it's certain to be full and the next piece written
will go into the next buffer.  The only situation where we have zero
->iov_offset is when no copying had been done at all (via that iov_iter,
that is).  In that case ->idx points to the empty buffer.

If you start with empty pipe and copy_to_pipe() until it fills, you'll have
(assuming e.g. 4K pages and ->curbuf == 3 in the beginning)
	->idx == 3, ->iov_offset == 0 initially
	->idx == 3, ->iov_offset == 4096 after 4096 bytes copied
	->idx == 4, ->iov_offset == 1 after 4097 bytes
...
	->idx == 2, ->iov_offset == 4096 by the end of it.

That's what this "correction" is about...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-14  1:24                                     ` Al Viro
  2017-01-14  1:43                                       ` Al Viro
@ 2017-01-14  1:46                                       ` Linus Torvalds
  2017-01-14  1:57                                         ` Al Viro
  1 sibling, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2017-01-14  1:46 UTC (permalink / raw)
  To: Al Viro; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Fri, Jan 13, 2017 at 5:24 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Why would advance by 0 change ->iov_offset here?

That's not my worry. Advancing by zero obviously doesn't change the position.

But the _truncation_ of the rest requires iov_offset to be zero in
order to actually truncate everything.

So I was worrying about something updating it, and then wanting to
truncate things on error.

But you bring up the kinds of cases I worried about:

> On error it does use iov_iter_advance(), pretty much as a way to
> trigger pipe_truncate().  There we directly reset ->iov_offset to 0
> and ->idx to its original value.

Ok, this was the part I worried about. And this

> However, theoretically it is possible that ->read_iter() instance does
> successful copy_to_iter() and then decides to return an error.  This
>         } else if (ret < 0) {
>                 to.idx = idx;
>                 to.iov_offset = 0;
>                 iov_iter_advance(&to, 0); /* to free what was emitted */
> in generic_file_splice_read() catches any such cases.

So I'm happy with that last patch then, and my worries are laid to rest.

                Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-14  1:46                                       ` Linus Torvalds
@ 2017-01-14  1:57                                         ` Al Viro
  2017-01-15  0:53                                           ` Al Viro
  0 siblings, 1 reply; 42+ messages in thread
From: Al Viro @ 2017-01-14  1:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Fri, Jan 13, 2017 at 05:46:34PM -0800, Linus Torvalds wrote:
> On Fri, Jan 13, 2017 at 5:24 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Why would advance by 0 change ->iov_offset here?
> 
> That's not my worry. Advancing by zero obviously doesn't change the position.
> 
> But the _truncation_ of the rest requires iov_offset to be zero in
> order to actually truncate everything.
> 
> So I was worrying about something updating it, and then wanting to
> truncate things on error.
> 
> But you bring up the kinds of cases I worried about:
> 
> > On error it does use iov_iter_advance(), pretty much as a way to
> > trigger pipe_truncate().  There we directly reset ->iov_offset to 0
> > and ->idx to its original value.
> 
> Ok, this was the part I worried about. And this
> 
> > However, theoretically it is possible that ->read_iter() instance does
> > successful copy_to_iter() and then decides to return an error.  This
> >         } else if (ret < 0) {
> >                 to.idx = idx;
> >                 to.iov_offset = 0;
> >                 iov_iter_advance(&to, 0); /* to free what was emitted */
> > in generic_file_splice_read() catches any such cases.
> 
> So I'm happy with that last patch then, and my worries are laid to rest.

OK.  Let's wait for Alan to confirm that the last variant works and
I'll send a pull request (with Cc: stable # v4.9).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-13 22:50                                 ` Al Viro
  2017-01-14  0:59                                   ` Linus Torvalds
@ 2017-01-14 13:16                                   ` Alan J. Wylie
  2017-01-14 16:29                                     ` Alan J. Wylie
  1 sibling, 1 reply; 42+ messages in thread
From: Alan J. Wylie @ 2017-01-14 13:16 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, Thorsten Leemhuis, linux-kernel

at 22:50 on Fri 13-Jan-2017 Al Viro (viro@ZenIV.linux.org.uk) wrote:

> Or, even better, we can get rid of all wraparound-related crap if we
> calculate the final value of pipe->nrbufs and watch for _that_ as
> loop condition:
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 25f572303801..e68604ae3ced 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -730,43 +730,50 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
>  }
>  EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);
>  
> +static inline void pipe_truncate(struct iov_iter *i)
...

I've dragged an old box out of the storeroom and upgraded and
configured it so it reproduced the problem.

With the patch, the mail is sent with no hang

# uname -a
Linux maglor 4.9.3-dirty #3 SMP Sat Jan 14 12:13:45 GMT 2017 x86_64 AMD Athlon(tm) II X2 270 Processor AuthenticAMD GNU/Linux

Jan 14 12:56:01 maglor cron[741]: (root) CMD (date; /work/chroot-shared/test.sh; date)
Jan 14 12:56:01 maglor postfix/pickup[649]: 8577A604F2: uid=0 from=<root>
Jan 14 12:56:01 maglor postfix/cleanup[752]: 8577A604F2: message-id=<20170114125601.8577A604F2@maglor.localdomain>
Jan 14 12:56:01 maglor postfix/qmgr[650]: 8577A604F2: from=<root@maglor.localdomain>, size=735, nrcpt=1 (queue active)
Jan 14 12:56:01 maglor postfix/smtp[754]: 8577A604F2: to=<alan@frodo.localnet>, orig_to=<root>, relay=frodo.localnet[192.168.21.2]:25, delay=0.54, delays=0.43/0/0.04/0.07, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as A08CC440065)
Jan 14 12:56:01 maglor postfix/qmgr[650]: 8577A604F2: removed

No other problems are apparent.

I'll run this for a bit, then apply it to my workstation (which I'm
rather fond of) and make sure there are no new regressions.

-- 
Alan J. Wylie                                          http://www.wylie.me.uk/

Dance like no-one's watching. / Encrypt like everyone is.
Security is inversely proportional to convenience

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-14 13:16                                   ` Alan J. Wylie
@ 2017-01-14 16:29                                     ` Alan J. Wylie
  2017-01-14 17:57                                       ` Linus Torvalds
  0 siblings, 1 reply; 42+ messages in thread
From: Alan J. Wylie @ 2017-01-14 16:29 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds, Thorsten Leemhuis, linux-kernel

at 13:16 on Sat 14-Jan-2017 Alan J. Wylie (alan@wylie.me.uk) wrote:

> I'll run this for a bit, then apply it to my workstation (which I'm
> rather fond of) and make sure there are no new regressions.

Looking good.

-- 
Alan J. Wylie                                          http://www.wylie.me.uk/

Dance like no-one's watching. / Encrypt like everyone is.
Security is inversely proportional to convenience

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-14 16:29                                     ` Alan J. Wylie
@ 2017-01-14 17:57                                       ` Linus Torvalds
  0 siblings, 0 replies; 42+ messages in thread
From: Linus Torvalds @ 2017-01-14 17:57 UTC (permalink / raw)
  To: Alan J. Wylie; +Cc: Al Viro, Thorsten Leemhuis, linux-kernel

On Sat, Jan 14, 2017 at 8:29 AM, Alan J. Wylie <alan@wylie.me.uk> wrote:
> at 13:16 on Sat 14-Jan-2017 Alan J. Wylie (alan@wylie.me.uk) wrote:
>
>> I'll run this for a bit, then apply it to my workstation (which I'm
>> rather fond of) and make sure there are no new regressions.
>
> Looking good.

Thanks for the pinpointing and all the testing,

               Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn
  2017-01-14  1:57                                         ` Al Viro
@ 2017-01-15  0:53                                           ` Al Viro
  0 siblings, 0 replies; 42+ messages in thread
From: Al Viro @ 2017-01-15  0:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel

On Sat, Jan 14, 2017 at 01:57:57AM +0000, Al Viro wrote:

> OK.  Let's wait for Alan to confirm that the last variant works and
> I'll send a pull request (with Cc: stable # v4.9).

... and here it is.

The following changes since commit b4b8664d291ac1998e0f0bcdc96b6397f0fe68b3:

  arm64: don't pull uaccess.h into *.S (2016-12-26 13:05:17 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus

for you to fetch changes up to b9dc6f65bc5e232d1c05fe34b5daadc7e8bbf1fb:

  fix a fencepost error in pipe_advance() (2017-01-14 19:50:41 -0500)

----------------------------------------------------------------
Al Viro (1):
      fix a fencepost error in pipe_advance()

Dave Kleikamp (1):
      coredump: Ensure proper size of sparse core files

Gu Zheng (1):
      tmpfs: clear S_ISGID when setting posix ACLs

Shaohua Li (1):
      aio: fix lock dep warning

 fs/aio.c                 |  6 ++++--
 fs/binfmt_elf.c          |  1 +
 fs/coredump.c            | 18 ++++++++++++++++
 fs/posix_acl.c           |  9 ++++----
 include/linux/coredump.h |  1 +
 lib/iov_iter.c           | 54 +++++++++++++++++++++++++++---------------------
 6 files changed, 59 insertions(+), 30 deletions(-)

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2017-01-15  0:53 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-12 20:26 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn Alan J. Wylie
2017-01-12 20:31 ` Al Viro
2017-01-12 20:38   ` Alan J. Wylie
2017-01-12 22:26 ` Linus Torvalds
2017-01-12 22:37   ` Al Viro
2017-01-12 22:46     ` Al Viro
2017-01-12 23:02       ` Linus Torvalds
2017-01-12 23:14         ` Al Viro
2017-01-12 23:14         ` Linus Torvalds
2017-01-12 23:27           ` Al Viro
2017-01-12 22:46   ` Alan J. Wylie
2017-01-12 22:58     ` Al Viro
2017-01-12 23:28     ` Linus Torvalds
2017-01-13  4:00       ` Al Viro
2017-01-13  7:38         ` Alan J. Wylie
2017-01-13  7:23       ` Alan J. Wylie
2017-01-13  9:33         ` Al Viro
2017-01-13  9:54           ` Alan J. Wylie
2017-01-13 10:20             ` Al Viro
2017-01-13 10:32               ` Alan J. Wylie
2017-01-13 11:25                 ` Al Viro
2017-01-13 11:18               ` Al Viro
2017-01-13 19:33                 ` Linus Torvalds
2017-01-13 20:08                   ` Al Viro
2017-01-13 20:11                     ` Al Viro
2017-01-13 20:32                       ` Linus Torvalds
2017-01-13 20:47                         ` Al Viro
2017-01-13 21:55                           ` Al Viro
2017-01-13 21:59                             ` Al Viro
2017-01-13 22:13                               ` Al Viro
2017-01-13 22:50                                 ` Al Viro
2017-01-14  0:59                                   ` Linus Torvalds
2017-01-14  1:24                                     ` Al Viro
2017-01-14  1:43                                       ` Al Viro
2017-01-14  1:46                                       ` Linus Torvalds
2017-01-14  1:57                                         ` Al Viro
2017-01-15  0:53                                           ` Al Viro
2017-01-14 13:16                                   ` Alan J. Wylie
2017-01-14 16:29                                     ` Alan J. Wylie
2017-01-14 17:57                                       ` Linus Torvalds
2017-01-13 20:08                   ` Linus Torvalds
2017-01-13 20:16                     ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).