* 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn @ 2017-01-12 20:26 Alan J. Wylie 2017-01-12 20:31 ` Al Viro 2017-01-12 22:26 ` Linus Torvalds 0 siblings, 2 replies; 42+ messages in thread From: Alan J. Wylie @ 2017-01-12 20:26 UTC (permalink / raw) To: Thorsten Leemhuis, Al Viro, linux-kernel [-- Attachment #1: message body text --] [-- Type: text/plain, Size: 5702 bytes --] Some time after 4.9.0 was released, I noticed that a cron job running systemd-nspawn was hanging trying to send mail. I have trimmed it down to a minimal demo: /* from crontab */ 48 19 * * * date; /work/chroot-shared/test.sh; date /* script ------8<------8<------8<------8<------8<------8< */ # cat /work/chroot-shared/test.sh #!/bin/bash set -x date systemd-nspawn -q -D /work/chroot.32 --register=no date date /* ------8<------8<------8<------8<------8<------8<------8< */ and bisected it to $ git --no-pager show 523ac9afc73acdcf9f00bd35b6ffb4a7c624a7d7 commit 523ac9afc73acdcf9f00bd35b6ffb4a7c624a7d7 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Fri Sep 23 15:34:57 2016 -0400 switch default_file_splice_read() to use of pipe-backed iov_iter we only use iov_iter_get_pages_alloc() and iov_iter_advance() - pages are filled by kernel_readv() via a kvec array (as we used to do all along), so iov_iter here is used only as a way of arranging for those pages to be in pipe. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> The system is an up-to-date gentoo. Config attached. $ eix -ce postfix [I] mail-mta/postfix (3.1.2-r1@21/12/16): A fast and secure drop-in replacement for sendmail $ eix -ce vixie-cron [I] sys-process/vixie-cron (4.1-r14@24/09/15): Paul Vixie's cron daemon, a fully featured crond implementation $ eix -ce systemd [I] sys-apps/systemd (231@06/12/16): System and service manager for Linux Strace shows that the processes are hanging in write() and read() calls. ==================================================================================================== # ps axfu | grep -A5 "[c]ron" root 990 0.0 0.0 18708 2068 ? Ss 20:34 0:00 /usr/sbin/cron root 6399 0.0 0.0 27292 2172 ? S 21:10 0:00 \_ /usr/sbin/cron root 6402 0.0 0.0 9864 2480 ? Ss 21:10 0:00 \_ /bin/bash -c date; /work/chroot-shared/test.sh; date root 6406 0.0 0.0 9868 2472 ? S 21:10 0:00 | \_ /bin/bash /work/chroot-shared/test.sh root 6405 0.0 0.0 76152 5428 ? S 21:10 0:00 \_ /usr/sbin/sendmail -FCronDaemon -odi -oem -oi -t root 6412 0.0 0.0 76140 5600 ? S 21:10 0:00 \_ /usr/sbin/postdrop -r at 993 0.0 0.0 16840 160 ? Ss 20:34 0:00 /usr/sbin/atd # strace -p 6406 strace: Process 6406 attached write(2, "+ date\n", 7 ^Cstrace: Process 6406 detached <detached ...> # # strace -p 6405 strace: Process 6405 attached read(0, ^Cstrace: Process 6405 detached <detached ...> # ==================================================================================================== git bisect start # good: [c8d2bc9bc39ebea8437fd974fdbc21847bb897a3] Linux 4.8 git bisect good c8d2bc9bc39ebea8437fd974fdbc21847bb897a3 # good: [8bba2e2e62cbf2db0d03e4de1204f7850bc45c44] Linux 4.8.15 git bisect good 8bba2e2e62cbf2db0d03e4de1204f7850bc45c44 # bad: [1001354ca34179f3db924eb66672442a173147dc] Linux 4.9-rc1 git bisect bad 1001354ca34179f3db924eb66672442a173147dc # good: [41844e36206be90cd4d962ea49b0abc3612a99d0] Merge tag 'staging-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging git bisect good 41844e36206be90cd4d962ea49b0abc3612a99d0 # bad: [6b5e09a748ad0a0b198d0e268c7e689044bfe48a] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net git bisect bad 6b5e09a748ad0a0b198d0e268c7e689044bfe48a # bad: [2c34ff14bf1d03a705f5400888ecac5b6400e981] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32 git bisect bad 2c34ff14bf1d03a705f5400888ecac5b6400e981 # good: [d042380886fb2fc6c4b0fcfe1214ba473769a8e9] Merge tag 'mfd-for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd git bisect good d042380886fb2fc6c4b0fcfe1214ba473769a8e9 # good: [bc75450cc3db3485db1e289fef8c1028ba38296a] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid git bisect good bc75450cc3db3485db1e289fef8c1028ba38296a # good: [2eee010d092903ee95716b6c2fbd9d3289839aa4] Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 git bisect good 2eee010d092903ee95716b6c2fbd9d3289839aa4 # skip: [ca2431633b414d112c39ec69d85d96f090e49453] powerpc/64s: Consolidate Directed Privileged Doorbell 0xa00 interrupt git bisect skip ca2431633b414d112c39ec69d85d96f090e49453 # skip: [d312603a44eb9dc0dbb0a642a61899bac188ab71] powerpc/Makefile: CROSS32AS is unused, remove it git bisect skip d312603a44eb9dc0dbb0a642a61899bac188ab71 # skip: [35066c0d798906d46c352c3f12844d2a162d057d] powerpc/eeh: Export confirm_error_lock git bisect skip 35066c0d798906d46c352c3f12844d2a162d057d # bad: [a779638cf622f069a484e8802134cca3c6c71415] pipe: add pipe_buf_release() helper git bisect bad a779638cf622f069a484e8802134cca3c6c71415 # good: [25869262ef7af24ccde988867ac3eb1c3d4b88d4] skb_splice_bits(): get rid of callback git bisect good 25869262ef7af24ccde988867ac3eb1c3d4b88d4 # bad: [523ac9afc73acdcf9f00bd35b6ffb4a7c624a7d7] switch default_file_splice_read() to use of pipe-backed iov_iter git bisect bad 523ac9afc73acdcf9f00bd35b6ffb4a7c624a7d7 # good: [241699cd72a8489c9446ae3910ddd243e9b9061b] new iov_iter flavour: pipe-backed git bisect good 241699cd72a8489c9446ae3910ddd243e9b9061b # good: [82c156f853840645604acd7c2cebcb75ed1b6652] switch generic_file_splice_read() to use of ->read_iter() git bisect good 82c156f853840645604acd7c2cebcb75ed1b6652 # first bad commit: [523ac9afc73acdcf9f00bd35b6ffb4a7c624a7d7] switch default_file_splice_read() to use of pipe-backed iov_iter ==================================================================================================== [-- Attachment #2: config-4.9.3.bz2 --] [-- Type: application/octet-stream, Size: 21088 bytes --] [-- Attachment #3: .signature --] [-- Type: text/plain, Size: 194 bytes --] -- Alan J. Wylie http://www.wylie.me.uk/ Dance like no-one's watching. / Encrypt like everyone is. Security is inversely proportional to convenience ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-12 20:26 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn Alan J. Wylie @ 2017-01-12 20:31 ` Al Viro 2017-01-12 20:38 ` Alan J. Wylie 2017-01-12 22:26 ` Linus Torvalds 1 sibling, 1 reply; 42+ messages in thread From: Al Viro @ 2017-01-12 20:31 UTC (permalink / raw) To: Alan J. Wylie; +Cc: Thorsten Leemhuis, linux-kernel On Thu, Jan 12, 2017 at 08:26:52PM +0000, Alan J. Wylie wrote: > > Some time after 4.9.0 was released, I noticed that a cron job running > systemd-nspawn was hanging trying to send mail. Check if 8e54cadab447 (fix default_file_splice_read()) fixes it. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-12 20:31 ` Al Viro @ 2017-01-12 20:38 ` Alan J. Wylie 0 siblings, 0 replies; 42+ messages in thread From: Alan J. Wylie @ 2017-01-12 20:38 UTC (permalink / raw) To: Al Viro; +Cc: Thorsten Leemhuis, linux-kernel at 20:31 on Thu 12-Jan-2017 Al Viro (viro@ZenIV.linux.org.uk) wrote: > On Thu, Jan 12, 2017 at 08:26:52PM +0000, Alan J. Wylie wrote: > > > > Some time after 4.9.0 was released, I noticed that a cron job running > > systemd-nspawn was hanging trying to send mail. > > Check if 8e54cadab447 (fix default_file_splice_read()) fixes it. That's already in 4.9.3, which still has the problem. -- Alan J. Wylie http://www.wylie.me.uk/ Dance like no-one's watching. / Encrypt like everyone is. Security is inversely proportional to convenience ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-12 20:26 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn Alan J. Wylie 2017-01-12 20:31 ` Al Viro @ 2017-01-12 22:26 ` Linus Torvalds 2017-01-12 22:37 ` Al Viro 2017-01-12 22:46 ` Alan J. Wylie 1 sibling, 2 replies; 42+ messages in thread From: Linus Torvalds @ 2017-01-12 22:26 UTC (permalink / raw) To: Alan J. Wylie; +Cc: Thorsten Leemhuis, Al Viro, linux-kernel On Thu, Jan 12, 2017 at 12:26 PM, Alan J. Wylie <alan@wylie.me.uk> wrote: > > Strace shows that the processes are hanging in write() and read() calls. If this is splice-related, I'm assuming that they aren't actually the two ends of the same pipe, and there is somebody doing splice in the middle. I'm not seeing that process. I'm assuming it's systemd. Can you try to find it and strace that one too? Because that middle man is likely the one that has problems (and is not able to splice from one pipe to the other). Ugh. That one commit has had a lot of bugs in it already. We do not have good splice test coverage, because almost nobody uses it. Linus ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-12 22:26 ` Linus Torvalds @ 2017-01-12 22:37 ` Al Viro 2017-01-12 22:46 ` Al Viro 2017-01-12 22:46 ` Alan J. Wylie 1 sibling, 1 reply; 42+ messages in thread From: Al Viro @ 2017-01-12 22:37 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Thu, Jan 12, 2017 at 02:26:42PM -0800, Linus Torvalds wrote: > On Thu, Jan 12, 2017 at 12:26 PM, Alan J. Wylie <alan@wylie.me.uk> wrote: > > > > Strace shows that the processes are hanging in write() and read() calls. > > If this is splice-related, I'm assuming that they aren't actually the > two ends of the same pipe, and there is somebody doing splice in the > middle. > > I'm not seeing that process. I'm assuming it's systemd. Can you try > to find it and strace that one too? Because that middle man is likely > the one that has problems (and is not able to splice from one pipe to > the other). > > Ugh. That one commit has had a lot of bugs in it already. We do not > have good splice test coverage, because almost nobody uses it. FWIW, I would really like to know what kind of files had been involved. There are two paths that can lead to default_file_splice_read(): splice_direct_to_actor() -> do_splice_to() -> default_file_splice_read() and do_splice() -> do_splice_to() -> default_file_splice_read(). The former only gets there for regular files and block devices. The latter is guaranteed that file is not a pipe. So * not a socket (have ->splice_read() of their own) * not a pipe or FIFO (neither path allows those) * not a block device (have ->splice_read() of their own) * not a regular file on a normal local fs (ditto) So what is it called for in that reproducer? ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-12 22:37 ` Al Viro @ 2017-01-12 22:46 ` Al Viro 2017-01-12 23:02 ` Linus Torvalds 0 siblings, 1 reply; 42+ messages in thread From: Al Viro @ 2017-01-12 22:46 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Thu, Jan 12, 2017 at 10:37:18PM +0000, Al Viro wrote: > On Thu, Jan 12, 2017 at 02:26:42PM -0800, Linus Torvalds wrote: > > On Thu, Jan 12, 2017 at 12:26 PM, Alan J. Wylie <alan@wylie.me.uk> wrote: > > > > > > Strace shows that the processes are hanging in write() and read() calls. > > > > If this is splice-related, I'm assuming that they aren't actually the > > two ends of the same pipe, and there is somebody doing splice in the > > middle. > > > > I'm not seeing that process. I'm assuming it's systemd. Can you try > > to find it and strace that one too? Because that middle man is likely > > the one that has problems (and is not able to splice from one pipe to > > the other). > > > > Ugh. That one commit has had a lot of bugs in it already. We do not > > have good splice test coverage, because almost nobody uses it. > > FWIW, I would really like to know what kind of files had been involved. > There are two paths that can lead to default_file_splice_read(): > splice_direct_to_actor() -> do_splice_to() -> default_file_splice_read() and > do_splice() -> do_splice_to() -> default_file_splice_read(). > > The former only gets there for regular files and block devices. The latter > is guaranteed that file is not a pipe. So > * not a socket (have ->splice_read() of their own) > * not a pipe or FIFO (neither path allows those) > * not a block device (have ->splice_read() of their own) > * not a regular file on a normal local fs (ditto) > > So what is it called for in that reproducer? PS: what about the /proc/mounts contents? If it's something 9p-backed kvm, your bisect might have been caught on the bug I'd mentioned - if the breakage you are seeing in 4.9.3 has started after that commit and before the backport of the fix, your bisect could converge there. Does the reproducer trigger on 523ac9afc73a + cherry-pick of 8e54cadab447? ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-12 22:46 ` Al Viro @ 2017-01-12 23:02 ` Linus Torvalds 2017-01-12 23:14 ` Al Viro 2017-01-12 23:14 ` Linus Torvalds 0 siblings, 2 replies; 42+ messages in thread From: Linus Torvalds @ 2017-01-12 23:02 UTC (permalink / raw) To: Al Viro; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Thu, Jan 12, 2017 at 2:46 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > PS: what about the /proc/mounts contents? If it's something 9p-backed kvm, > your bisect might have been caught on the bug I'd mentioned - if the breakage > you are seeing in 4.9.3 has started after that commit and before the > backport of the fix, your bisect could converge there. Does the > reproducer trigger on 523ac9afc73a + cherry-pick of 8e54cadab447? Looking at the strace that Alan just posted, I really think it's the splice change that Alan bisected to. In particular, this line: splice(5, NULL, 1, NULL, 9223372036854775807, 0) = -1 EAGAIN (Resource temporarily unavailable) and note that the commit in question introduces that -EAGAIN error code. The old code never returned EAGAIN at all (well, it could do so later, if NONBLOCK was set, obviously, but that doesn't seem to be the case here). So that commit seems to have introduced a new error case, and I suspect systemd-nospawn simply doesn't handle it. It is expecting splice_to_pipe() to actually block. Ergo: I think we need to do a wait_for_space() somewhere, getting rid of the EAGAIN. Looking at the callers of "do_splice_to()", we already have the wait_for_space() in do_splice(), but we do *not* have it in the do_splice_from() case when both the input and output file descriptors are pipes. Hmm? Linus ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-12 23:02 ` Linus Torvalds @ 2017-01-12 23:14 ` Al Viro 2017-01-12 23:14 ` Linus Torvalds 1 sibling, 0 replies; 42+ messages in thread From: Al Viro @ 2017-01-12 23:14 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Thu, Jan 12, 2017 at 03:02:13PM -0800, Linus Torvalds wrote: > splice(5, NULL, 1, NULL, 9223372036854775807, 0) = -1 EAGAIN > (Resource temporarily unavailable) > > and note that the commit in question introduces that -EAGAIN error code. > > The old code never returned EAGAIN at all (well, it could do so later, > if NONBLOCK was set, obviously, but that doesn't seem to be the case > here). > > So that commit seems to have introduced a new error case, and I > suspect systemd-nospawn simply doesn't handle it. It is expecting > splice_to_pipe() to actually block. > > Ergo: I think we need to do a wait_for_space() somewhere, getting rid > of the EAGAIN. > > Looking at the callers of "do_splice_to()", we already have the > wait_for_space() in do_splice(), but we do *not* have it in the > do_splice_from() case when both the input and output file descriptors > are pipes. >From the look of his strace, the source is /dev/ptmx. Pipe-to-pipe splice goes into splice_pipe_to_pipe() anyway. do_splice_from() is pipe-to-non-pipe, and it doesn't go anywhere near default_file_splice_read()... do_splice_to() is the only thing that can call default_file_splice_read() and there are only two callers - do_splice() (with its wait_for_space()) and splice_direct_to_actor(), which has internal pipe for destination. That certainly shouldn't be calling wait_for_space() - there's no other thread that could possibly read from the destination getting more space in there. IOW, it's do_splice() -> do_splice_to() -> default_file_splice_read(). ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-12 23:02 ` Linus Torvalds 2017-01-12 23:14 ` Al Viro @ 2017-01-12 23:14 ` Linus Torvalds 2017-01-12 23:27 ` Al Viro 1 sibling, 1 reply; 42+ messages in thread From: Linus Torvalds @ 2017-01-12 23:14 UTC (permalink / raw) To: Al Viro; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Thu, Jan 12, 2017 at 3:02 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Looking at the callers of "do_splice_to()", we already have the > wait_for_space() in do_splice(), but we do *not* have it in the > do_splice_from() case when both the input and output file descriptors > are pipes. Bah. That case doesn't even trigger the new code. I was lazy with my grep. The two cases are "do_splice()" (which does have the wait-for-space) and splice_direct_to_actor(). And splice_direct_to_actor() shouldn't even need it, should it? So ignore that. But I think there is something about the EAGAIN. Linus ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-12 23:14 ` Linus Torvalds @ 2017-01-12 23:27 ` Al Viro 0 siblings, 0 replies; 42+ messages in thread From: Al Viro @ 2017-01-12 23:27 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Thu, Jan 12, 2017 at 03:14:41PM -0800, Linus Torvalds wrote: > On Thu, Jan 12, 2017 at 3:02 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Looking at the callers of "do_splice_to()", we already have the > > wait_for_space() in do_splice(), but we do *not* have it in the > > do_splice_from() case when both the input and output file descriptors > > are pipes. > > Bah. That case doesn't even trigger the new code. I was lazy with my > grep. The two cases are "do_splice()" (which does have the > wait-for-space) and splice_direct_to_actor(). And > splice_direct_to_actor() shouldn't even need it, should it? > > So ignore that. But I think there is something about the EAGAIN. It might, but I would really like to see where has that EAGAIN come from. I see several possibilities: * wait_for_space() with SPLICE_F_NONBLOCK in flags. Shouldn't happen with 0 in the last argument of splice(2). * default_file_splice_read() seeing pipe->nrbufs == pipe->buffers. Shouldn't be possible after successful wait_for_space(). * vfs_readv() returning -EAGAIN. That might be possible, actually - the damn thing has come from 13761 open("/dev/ptmx", O_RDWR|O_NOCTTY|O_NONBLOCK|O_CLOEXEC) = 5 and O_NONBLOCK had been present in open flags. What I'd like to see is strace of the same thing on the working kernel, ideally - just prior to the commit bisect has converged to. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-12 22:26 ` Linus Torvalds 2017-01-12 22:37 ` Al Viro @ 2017-01-12 22:46 ` Alan J. Wylie 2017-01-12 22:58 ` Al Viro 2017-01-12 23:28 ` Linus Torvalds 1 sibling, 2 replies; 42+ messages in thread From: Alan J. Wylie @ 2017-01-12 22:46 UTC (permalink / raw) To: Linus Torvalds; +Cc: Thorsten Leemhuis, Al Viro, linux-kernel at 14:26 on Thu 12-Jan-2017 Linus Torvalds (torvalds@linux-foundation.org) wrote: > > Strace shows that the processes are hanging in write() and read() calls. > > If this is splice-related, I'm assuming that they aren't actually the > two ends of the same pipe, and there is somebody doing splice in the > middle. > > I'm not seeing that process. I'm assuming it's systemd. Can you try > to find it and strace that one too? strace -f -o /var/tmp/strace.txt systemd-nspawn -q -D /work/chroot.32 --register=no date full output at https://wylie.me.uk/tmp/strace.txt last few lines are: 13766 execve("/bin/date", ["date"], [/* 7 vars */] <unfinished ...> ... 13766 close(2) = 0 13766 exit_group(0) = ? 13766 +++ exited with 0 +++ 13761 <... epoll_wait resumed> [{EPOLLIN, {u32=3843638720, u64=94179591538112}}], 7, -1) = 1 13761 clock_gettime(CLOCK_BOOTTIME, {9780, 528425388}) = 0 13761 read(13, "\21\0\0\0\0\0\0\0\1\0\0\0\3065\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128 13761 epoll_ctl(11, EPOLL_CTL_DEL, 1, NULL) = 0 13761 epoll_ctl(11, EPOLL_CTL_DEL, 5, NULL) = 0 13761 signalfd4(13, [INT TERM CHLD], 8, SFD_CLOEXEC|SFD_NONBLOCK) = 13 13761 fcntl(0, F_GETFL) = 0 (flags O_RDONLY) 13761 fcntl(1, F_GETFL) = 0x1 (flags O_WRONLY) 13761 kill(13766, SIGKILL) = 0 13761 waitid(P_PID, 13766, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13766, si_uid=0, si_status=0, si_utime=0, si_stime=0}, WEXITED, NULL) = 0 13761 close(6) = 0 13761 close(7) = 0 13761 close(8) = 0 13761 close(9) = 0 13761 close(18) = 0 13761 close(16) = 0 13761 close(14) = 0 13761 close(10) = 0 13761 copy_file_range(5, NULL, 1, NULL, 9223372036854775807, 0) = -1 EXDEV (Invalid cross-device link) 13761 sendfile(1, 5, NULL, 9223372036854775807) = -1 EINVAL (Invalid argument) 13761 splice(5, NULL, 1, NULL, 9223372036854775807, 0) = -1 EAGAIN (Resource temporarily unavailable) 13761 open("/run/systemd/nspawn/propagate/chroot.32", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_NOFOLLOW|O_NOATIME|O_CLOEXEC) = 6 13761 fstatfs(6, {f_type=TMPFS_MAGIC, f_bsize=4096, f_blocks=1020327, f_bfree=1015931, f_bavail=1015931, f_files=1020327, f_ffree=1019297, f_fsid={0, 0}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_NOSUID|ST_NODEV}) = 0 13761 fstat(6, {st_mode=S_IFDIR|0600, st_size=40, ...}) = 0 13761 fcntl(6, F_GETFL) = 0x78800 (flags O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_NOFOLLOW|O_NOATIME) 13761 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 13761 getdents(6, /* 2 entries */, 32768) = 48 13761 getdents(6, /* 0 entries */, 32768) = 0 13761 close(6) = 0 13761 rmdir("/run/systemd/nspawn/propagate/chroot.32") = 0 13761 unlink("/work/.#chroot.32.lck") = 0 13761 close(3) = 0 13761 unlink("/run/systemd/nspawn/locks/inode-65035:24641537") = 0 13761 close(4) = 0 13761 close(5) = 0 13761 exit_group(0) = ? 13761 +++ exited with 0 +++ I'm off to bed now with a stinking head cold I'm afraid. -- Alan J. Wylie http://www.wylie.me.uk/ Dance like no-one's watching. / Encrypt like everyone is. Security is inversely proportional to convenience ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-12 22:46 ` Alan J. Wylie @ 2017-01-12 22:58 ` Al Viro 2017-01-12 23:28 ` Linus Torvalds 1 sibling, 0 replies; 42+ messages in thread From: Al Viro @ 2017-01-12 22:58 UTC (permalink / raw) To: Alan J. Wylie; +Cc: Linus Torvalds, Thorsten Leemhuis, linux-kernel On Thu, Jan 12, 2017 at 10:46:06PM +0000, Alan J. Wylie wrote: > at 14:26 on Thu 12-Jan-2017 Linus Torvalds (torvalds@linux-foundation.org) wrote: > 13761 copy_file_range(5, NULL, 1, NULL, 9223372036854775807, 0) = -1 EXDEV (Invalid cross-device link) > 13761 sendfile(1, 5, NULL, 9223372036854775807) = -1 EINVAL (Invalid argument) > 13761 splice(5, NULL, 1, NULL, 9223372036854775807, 0) = -1 EAGAIN (Resource temporarily unavailable) *bleep* splice from /dev/ptmx to stdout, whatever the latter had been. IOW, we are dealing with do_splice() -> do_splice_to() -> default_file_splice_read() for pseudo-tty as source... ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-12 22:46 ` Alan J. Wylie 2017-01-12 22:58 ` Al Viro @ 2017-01-12 23:28 ` Linus Torvalds 2017-01-13 4:00 ` Al Viro 2017-01-13 7:23 ` Alan J. Wylie 1 sibling, 2 replies; 42+ messages in thread From: Linus Torvalds @ 2017-01-12 23:28 UTC (permalink / raw) To: Alan J. Wylie; +Cc: Thorsten Leemhuis, Al Viro, linux-kernel On Thu, Jan 12, 2017 at 2:46 PM, Alan J. Wylie <alan@wylie.me.uk> wrote: > > I'm off to bed now with a stinking head cold I'm afraid. So assuming Al hasn't figured it out by the time you get back, can you try to send us the same strace for the working case? That fd5 (ptmx) does have O_NONBLOCK, so maybe the EAGAIN actually comes from reading the ptmx, rather than the new AGAIN in fs/splice.c. But if so, I'd have expected the old code to do the same thing. Ho humm. Linus ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-12 23:28 ` Linus Torvalds @ 2017-01-13 4:00 ` Al Viro 2017-01-13 7:38 ` Alan J. Wylie 2017-01-13 7:23 ` Alan J. Wylie 1 sibling, 1 reply; 42+ messages in thread From: Al Viro @ 2017-01-13 4:00 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Thu, Jan 12, 2017 at 03:28:32PM -0800, Linus Torvalds wrote: > On Thu, Jan 12, 2017 at 2:46 PM, Alan J. Wylie <alan@wylie.me.uk> wrote: > > > > I'm off to bed now with a stinking head cold I'm afraid. > > So assuming Al hasn't figured it out by the time you get back, can you > try to send us the same strace for the working case? That, and check if it is reproducible on commit 523ac9afc73a with commit 8e54cadab447 cherry-picked on top of that. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-13 4:00 ` Al Viro @ 2017-01-13 7:38 ` Alan J. Wylie 0 siblings, 0 replies; 42+ messages in thread From: Alan J. Wylie @ 2017-01-13 7:38 UTC (permalink / raw) To: Al Viro; +Cc: Linus Torvalds, Thorsten Leemhuis, linux-kernel at 04:00 on Fri 13-Jan-2017 Al Viro (viro@ZenIV.linux.org.uk) wrote: > That, and check if it is reproducible on commit 523ac9afc73a with > commit 8e54cadab447 cherry-picked on top of that. $ git checkout 523ac9afc73a $ git cherry-pick 8e54cadab447 [detached HEAD 3826f27f6830] fix default_file_splice_read() Author: Al Viro <viro@zeniv.linux.org.uk> Date: Sat Nov 26 20:05:42 2016 -0500 1 file changed, 2 insertions(+), 1 deletion(-) $ uname -a Linux frodo 4.8.0-rc8-00011-g3826f27f6830 #1 SMP PREEMPT Fri Jan 13 07:30:06 GMT 2017 x86_64 AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux # ps axfu | grep -A10 cron root 982 0.0 0.0 18764 1948 ? Ss 07:32 0:00 /usr/sbin/cron root 1593 0.0 0.0 27340 2164 ? S 07:35 0:00 \_ /usr/sbin/cron root 1594 0.0 0.0 9840 2572 ? Ss 07:35 0:00 \_ /bin/bash -c date; /work/chroot-shared/test.sh; date root 1597 0.0 0.0 9840 2668 ? S 07:35 0:00 | \_ /bin/bash /work/chroot-shared/test.sh root 1596 0.0 0.0 76156 5496 ? S 07:35 0:00 \_ /usr/sbin/sendmail -FCronDaemon -odi -oem -oi -t root 1600 0.0 0.0 76140 5440 ? S 07:35 0:00 \_ /usr/sbin/postdrop -r Still hangs -- Alan J. Wylie http://www.wylie.me.uk/ Dance like no-one's watching. / Encrypt like everyone is. Security is inversely proportional to convenience ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-12 23:28 ` Linus Torvalds 2017-01-13 4:00 ` Al Viro @ 2017-01-13 7:23 ` Alan J. Wylie 2017-01-13 9:33 ` Al Viro 1 sibling, 1 reply; 42+ messages in thread From: Alan J. Wylie @ 2017-01-13 7:23 UTC (permalink / raw) To: Linus Torvalds; +Cc: Thorsten Leemhuis, Al Viro, linux-kernel at 15:28 on Thu 12-Jan-2017 Linus Torvalds (torvalds@linux-foundation.org) wrote: > So assuming Al hasn't figured it out by the time you get back, can > you try to send us the same strace for the working case? Full output at https://wylie.me.uk/tmp/strace1.txt # uname -a Linux frodo 4.8.17 #1 SMP PREEMPT Fri Jan 13 06:54:26 GMT 2017 x86_64 AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux 1738 close(2) = 0 1738 exit_group(0) = ? 1738 +++ exited with 0 +++ 1735 <... epoll_wait resumed> [{EPOLLIN, {u32=1570273664, u64=94155843336576}}], 7, -1) = 1 1735 clock_gettime(CLOCK_BOOTTIME, {292, 912429255}) = 0 1735 read(13, "\21\0\0\0\0\0\0\0\1\0\0\0\312\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128 1735 epoll_ctl(11, EPOLL_CTL_DEL, 1, NULL) = 0 1735 epoll_ctl(11, EPOLL_CTL_DEL, 5, NULL) = 0 1735 signalfd4(13, [INT TERM CHLD], 8, SFD_CLOEXEC|SFD_NONBLOCK) = 13 1735 fcntl(0, F_GETFL) = 0 (flags O_RDONLY) 1735 fcntl(1, F_GETFL) = 0x1 (flags O_WRONLY) 1735 kill(1738, SIGKILL) = 0 1735 waitid(P_PID, 1738, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1738, si_uid=0, si_status=0, si_utime=0, si_stime=0}, WEXITED, NULL) = 0 1735 close(6) = 0 1735 close(7) = 0 1735 close(8) = 0 1735 close(9) = 0 1735 close(18) = 0 1735 close(16) = 0 1735 close(14) = 0 1735 close(10) = 0 1735 copy_file_range(5, NULL, 1, NULL, 9223372036854775807, 0) = -1 EXDEV (Invalid cross-device link) 1735 sendfile(1, 5, NULL, 9223372036854775807) = -1 EINVAL (Invalid argument) 1735 splice(5, NULL, 1, NULL, 9223372036854775807, 0) = -1 EAGAIN (Resource temporarily unavailable) 1735 open("/run/systemd/nspawn/propagate/chroot.32", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_NOFOLLOW|O_NOATIME|O_CLOEXEC) = 6 1735 fstatfs(6, {f_type=TMPFS_MAGIC, f_bsize=4096, f_blocks=1020314, f_bfree=1015930, f_bavail=1015930, f_files=1020314, f_ffree=1019308, f_fsid={0, 0}, f_namelen=255, f_frsize=4096, f_flags=ST_VALID|ST_NOSUID|ST_NODEV}) = 0 1735 fstat(6, {st_mode=S_IFDIR|0600, st_size=40, ...}) = 0 1735 fcntl(6, F_GETFL) = 0x78800 (flags O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_NOFOLLOW|O_NOATIME) 1735 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 1735 getdents(6, /* 2 entries */, 32768) = 48 1735 getdents(6, /* 0 entries */, 32768) = 0 1735 close(6) = 0 1735 rmdir("/run/systemd/nspawn/propagate/chroot.32") = 0 1735 unlink("/work/.#chroot.32.lck") = 0 1735 close(3) = 0 1735 unlink("/run/systemd/nspawn/locks/inode-65035:24641537") = 0 1735 close(4) = 0 1735 close(5) = 0 1735 exit_group(0) = ? 1735 +++ exited with 0 +++ >From within the systemd-nspawn (no KVM or similar virtualisation involved): Fri Jan 13 07:08:01 GMT 2017 + date Fri Jan 13 07:08:01 GMT 2017 + cat /proc/mounts sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0 proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0 devtmpfs /dev devtmpfs rw,nosuid,size=4039360k,nr_inodes=1009840,mode=755 0 0 tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0 devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /run tmpfs rw,nosuid,nodev,mode=755 0 0 tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup +rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0 /dev/mapper/vg0-root / ext4 rw,relatime,data=ordered 0 0 /dev/mapper/vg0-usr /usr ext4 rw,relatime,data=ordered 0 0 mqueue /dev/mqueue mqueue rw,relatime 0 0 systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=31,pgrp=1,timeout=0,minproto=5,maxproto=5,direct 0 0 debugfs /sys/kernel/debug debugfs rw,relatime 0 0 tmpfs /tmp tmpfs rw,nosuid,nodev 0 0 /dev/mapper/vg1-work /work ext4 rw,relatime,data=ordered 0 0 /dev/mapper/vg1-home /home ext4 rw,relatime,data=ordered 0 0 /dev/sda1 /boot ext4 rw,relatime,data=ordered 0 0 /dev/mapper/vg0-var /var ext4 rw,relatime,data=ordered 0 0 /dev/mapper/vg0-srv /srv ext4 rw,relatime,data=ordered 0 0 tmpfs /run/user/1000 tmpfs rw,nosuid,nodev,relatime,size=816248k,mode=700,uid=1000,gid=100 0 0 + strace -f -o /var/tmp/strace.txt systemd-nspawn -q -D /work/chroot.32 --register=no date Fri Jan 13 07:08:01 GMT 2017 + date Fri Jan 13 07:08:01 GMT 2017 Fri Jan 13 07:08:01 GMT 2017 -- Alan J. Wylie http://www.wylie.me.uk/ Dance like no-one's watching. / Encrypt like everyone is. Security is inversely proportional to convenience ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-13 7:23 ` Alan J. Wylie @ 2017-01-13 9:33 ` Al Viro 2017-01-13 9:54 ` Alan J. Wylie 0 siblings, 1 reply; 42+ messages in thread From: Al Viro @ 2017-01-13 9:33 UTC (permalink / raw) To: Alan J. Wylie; +Cc: Linus Torvalds, Thorsten Leemhuis, linux-kernel On Fri, Jan 13, 2017 at 07:23:51AM +0000, Alan J. Wylie wrote: > at 15:28 on Thu 12-Jan-2017 Linus Torvalds (torvalds@linux-foundation.org) wrote: > > > So assuming Al hasn't figured it out by the time you get back, can > > you try to send us the same strace for the working case? > > Full output at https://wylie.me.uk/tmp/strace1.txt > > 1735 splice(5, NULL, 1, NULL, 9223372036854775807, 0) = -1 EAGAIN (Resource temporarily unavailable) Lovely... So it was getting -EAGAIN all along. Just in case - could you try the delta below and see if it triggers? Simply to exclude the possibility that it *is* this call of splice() and the change has somehow buggered cleanup after the kernel_readv() failure... diff --git a/fs/splice.c b/fs/splice.c index 873d83104e79..1a2d1bc7f19e 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -393,6 +393,9 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos, size_t offset, dummy, copied = 0; ssize_t res; int i; + unsigned nrbufs = pipe->nrbufs, + curbuf = pipe->curbuf, + buffers = pipe->buffers; if (pipe->nrbufs == pipe->buffers) return -EAGAIN; @@ -445,6 +448,16 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos, put_page(pages[i]); kvfree(pages); iov_iter_advance(&to, copied); /* truncates and discards */ + if (res == -EAGAIN && ( + pipe->nrbufs != nrbufs || + pipe->curbuf != curbuf || + pipe->buffers != buffers) + ) { + printk(KERN_ERR "nr: %d->%d, cur: %d->%d, buffers: %d->%d\n", + nrbufs, pipe->nrbufs, + curbuf, pipe->curbuf, + buffers, pipe->buffers); + } return res; } ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-13 9:33 ` Al Viro @ 2017-01-13 9:54 ` Alan J. Wylie 2017-01-13 10:20 ` Al Viro 0 siblings, 1 reply; 42+ messages in thread From: Alan J. Wylie @ 2017-01-13 9:54 UTC (permalink / raw) To: Al Viro; +Cc: Linus Torvalds, Thorsten Leemhuis, linux-kernel at 09:33 on Fri 13-Jan-2017 Al Viro (viro@ZenIV.linux.org.uk) wrote: > > 1735 splice(5, NULL, 1, NULL, 9223372036854775807, 0) = -1 EAGAIN (Resource temporarily unavailable) > > Lovely... So it was getting -EAGAIN all along. Just in case - could you > try the delta below and see if it triggers? Simply to exclude the possibility > that it *is* this call of splice() and the change has somehow buggered cleanup > after the kernel_readv() failure... $ git checkout linux-4.9.y $ git apply patch1 $ uname -a Linux frodo 4.9.3-dirty #1 SMP PREEMPT Fri Jan 13 09:44:42 GMT 2017 x86_64 AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux # ps axfu | grep -A10 cron root 987 0.0 0.0 18764 2128 ? Ss 09:47 0:00 /usr/sbin/cron root 1662 0.0 0.0 27340 2160 ? S 09:51 0:00 \_ /usr/sbin/cron root 1664 0.0 0.0 9840 1148 ? Ss 09:51 0:00 \_ /bin/bash -c date; /work/chroot-shared/test.sh; date root 1668 0.0 0.0 9840 2652 ? S 09:51 0:00 | \_ /bin/bash /work/chroot-shared/test.sh root 1667 0.0 0.0 76156 5576 ? S 09:51 0:00 \_ /usr/sbin/sendmail -FCronDaemon -odi -oem -oi -t root 1669 0.0 0.0 76144 5412 ? S 09:51 0:00 \_ /usr/sbin/postdrop -r Another hang. # dmesg | tail [ 22.352442] r8169 0000:03:00.0: loading /lib/firmware/4.9.3-dirty/rtl_nic/rtl8168e-3.fw failed with error -2 [ 22.408814] r8169 0000:03:00.0: direct-loading rtl_nic/rtl8168e-3.fw [ 22.408821] fw_set_page_data: fw-rtl_nic/rtl8168e-3.fw buf=ffff92b7b1cb8c80 data=ffffad1641179000 size=3872 [ 22.536043] r8169 0000:03:00.0 enp3s0: link down [ 22.536079] r8169 0000:03:00.0 enp3s0: link down [ 24.873801] r8169 0000:03:00.0 enp3s0: link up [ 24.874766] br0: port 1(enp3s0) entered blocking state [ 24.876622] br0: port 1(enp3s0) entered forwarding state [ 24.878560] IPv6: ADDRCONF(NETDEV_CHANGE): br0: link becomes ready [ 219.683974] nr: 0->16, cur: 5->5, buffers: 16->16 # > > diff --git a/fs/splice.c b/fs/splice.c > index 873d83104e79..1a2d1bc7f19e 100644 > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -393,6 +393,9 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos, > size_t offset, dummy, copied = 0; > ssize_t res; > int i; > + unsigned nrbufs = pipe->nrbufs, > + curbuf = pipe->curbuf, > + buffers = pipe->buffers; > > if (pipe->nrbufs == pipe->buffers) > return -EAGAIN; > @@ -445,6 +448,16 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos, > put_page(pages[i]); > kvfree(pages); > iov_iter_advance(&to, copied); /* truncates and discards */ > + if (res == -EAGAIN && ( > + pipe->nrbufs != nrbufs || > + pipe->curbuf != curbuf || > + pipe->buffers != buffers) > + ) { > + printk(KERN_ERR "nr: %d->%d, cur: %d->%d, buffers: %d->%d\n", > + nrbufs, pipe->nrbufs, > + curbuf, pipe->curbuf, > + buffers, pipe->buffers); > + } > return res; > } > -- Alan J. Wylie http://www.wylie.me.uk/ Dance like no-one's watching. / Encrypt like everyone is. Security is inversely proportional to convenience ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-13 9:54 ` Alan J. Wylie @ 2017-01-13 10:20 ` Al Viro 2017-01-13 10:32 ` Alan J. Wylie 2017-01-13 11:18 ` Al Viro 0 siblings, 2 replies; 42+ messages in thread From: Al Viro @ 2017-01-13 10:20 UTC (permalink / raw) To: Alan J. Wylie; +Cc: Linus Torvalds, Thorsten Leemhuis, linux-kernel On Fri, Jan 13, 2017 at 09:54:02AM +0000, Alan J. Wylie wrote: > root 1669 0.0 0.0 76144 5412 ? S 09:51 0:00 \_ /usr/sbin/postdrop -r > > Another hang. > > # dmesg | tail > [ 22.352442] r8169 0000:03:00.0: loading /lib/firmware/4.9.3-dirty/rtl_nic/rtl8168e-3.fw failed with error -2 > [ 22.408814] r8169 0000:03:00.0: direct-loading rtl_nic/rtl8168e-3.fw > [ 22.408821] fw_set_page_data: fw-rtl_nic/rtl8168e-3.fw buf=ffff92b7b1cb8c80 data=ffffad1641179000 size=3872 > [ 22.536043] r8169 0000:03:00.0 enp3s0: link down > [ 22.536079] r8169 0000:03:00.0 enp3s0: link down > [ 24.873801] r8169 0000:03:00.0 enp3s0: link up > [ 24.874766] br0: port 1(enp3s0) entered blocking state > [ 24.876622] br0: port 1(enp3s0) entered forwarding state > [ 24.878560] IPv6: ADDRCONF(NETDEV_CHANGE): br0: link becomes ready > [ 219.683974] nr: 0->16, cur: 5->5, buffers: 16->16 OK, so it is iov_iter_advance() failing to free the shit allocated, either due to some breakage in pipe_advance() or buggered 'copied'... Let's see which one; could you apply the following and run your reproducer? The only difference from the previous is that it collects and prints a bit more, so it should be just as reproducible... diff --git a/fs/splice.c b/fs/splice.c index 873d83104e79..11477609e7f7 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -393,6 +393,10 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos, size_t offset, dummy, copied = 0; ssize_t res; int i; + unsigned nrbufs = pipe->nrbufs, + curbuf = pipe->curbuf, + buffers = pipe->buffers; + int idx, count, offs; if (pipe->nrbufs == pipe->buffers) return -EAGAIN; @@ -444,7 +448,22 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos, for (i = 0; i < nr_pages; i++) put_page(pages[i]); kvfree(pages); + count = to.count; + idx = to.idx; + offs = to.iov_offset; iov_iter_advance(&to, copied); /* truncates and discards */ + if (res == -EAGAIN && ( + pipe->nrbufs != nrbufs || + pipe->curbuf != curbuf || + pipe->buffers != buffers) + ) { + printk(KERN_ERR "nr: %d->%d, cur: %d->%d, buffers: %d->%d\n", + nrbufs, pipe->nrbufs, + curbuf, pipe->curbuf, + buffers, pipe->buffers); + printk(KERN_ERR "copied: %zd, count:%d, idx:%d, offs:%d\n", + copied, count, idx, offs); + } return res; } ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-13 10:20 ` Al Viro @ 2017-01-13 10:32 ` Alan J. Wylie 2017-01-13 11:25 ` Al Viro 2017-01-13 11:18 ` Al Viro 1 sibling, 1 reply; 42+ messages in thread From: Alan J. Wylie @ 2017-01-13 10:32 UTC (permalink / raw) To: Al Viro; +Cc: Linus Torvalds, Thorsten Leemhuis, linux-kernel at 10:20 on Fri 13-Jan-2017 Al Viro (viro@ZenIV.linux.org.uk) wrote: > OK, so it is iov_iter_advance() failing to free the shit allocated, either > due to some breakage in pipe_advance() or buggered 'copied'... Let's > see which one; could you apply the following and run your reproducer? The > only difference from the previous is that it collects and prints a bit more, > so it should be just as reproducible... git reset --hard HEAD vi ~/mail/INBOX patch2 git apply patch2 # uname -a Linux frodo 4.9.3-dirty #2 SMP PREEMPT Fri Jan 13 10:22:47 GMT 2017 x86_64 AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux [hanging as expected] # dmesg | tail ... [ 141.553455] nr: 0->16, cur: 5->5, buffers: 16->16 [ 141.553461] copied: 0, count:2147479552, idx:5, offs:0 > > diff --git a/fs/splice.c b/fs/splice.c > index 873d83104e79..11477609e7f7 100644 > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -393,6 +393,10 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos, > size_t offset, dummy, copied = 0; > ssize_t res; > int i; > + unsigned nrbufs = pipe->nrbufs, > + curbuf = pipe->curbuf, > + buffers = pipe->buffers; > + int idx, count, offs; > > if (pipe->nrbufs == pipe->buffers) > return -EAGAIN; > @@ -444,7 +448,22 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos, > for (i = 0; i < nr_pages; i++) > put_page(pages[i]); > kvfree(pages); > + count = to.count; > + idx = to.idx; > + offs = to.iov_offset; > iov_iter_advance(&to, copied); /* truncates and discards */ > + if (res == -EAGAIN && ( > + pipe->nrbufs != nrbufs || > + pipe->curbuf != curbuf || > + pipe->buffers != buffers) > + ) { > + printk(KERN_ERR "nr: %d->%d, cur: %d->%d, buffers: %d->%d\n", > + nrbufs, pipe->nrbufs, > + curbuf, pipe->curbuf, > + buffers, pipe->buffers); > + printk(KERN_ERR "copied: %zd, count:%d, idx:%d, offs:%d\n", > + copied, count, idx, offs); > + } > return res; > } > -- Alan J. Wylie http://www.wylie.me.uk/ Dance like no-one's watching. / Encrypt like everyone is. Security is inversely proportional to convenience ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-13 10:32 ` Alan J. Wylie @ 2017-01-13 11:25 ` Al Viro 0 siblings, 0 replies; 42+ messages in thread From: Al Viro @ 2017-01-13 11:25 UTC (permalink / raw) To: Alan J. Wylie; +Cc: Linus Torvalds, Thorsten Leemhuis, linux-kernel On Fri, Jan 13, 2017 at 10:32:47AM +0000, Alan J. Wylie wrote: > # dmesg | tail > ... > [ 141.553455] nr: 0->16, cur: 5->5, buffers: 16->16 > [ 141.553461] copied: 0, count:2147479552, idx:5, offs:0 OK, that looks like pipe_advance() fencepost (hopefully) dealt with in the lib/iov_iter.c patch I've just posted. I'm going down right now, but if you have a box you are not too fond of, it would be nice if you tried that one. Warning: the patch had been written in a barely awake state, and is completely untested. Treat accordingly... ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-13 10:20 ` Al Viro 2017-01-13 10:32 ` Alan J. Wylie @ 2017-01-13 11:18 ` Al Viro 2017-01-13 19:33 ` Linus Torvalds 1 sibling, 1 reply; 42+ messages in thread From: Al Viro @ 2017-01-13 11:18 UTC (permalink / raw) To: Alan J. Wylie; +Cc: Linus Torvalds, Thorsten Leemhuis, linux-kernel On Fri, Jan 13, 2017 at 10:20:19AM +0000, Al Viro wrote: > OK, so it is iov_iter_advance() failing to free the shit allocated, either > due to some breakage in pipe_advance() or buggered 'copied'... Let's > see which one; could you apply the following and run your reproducer? The > only difference from the previous is that it collects and prints a bit more, > so it should be just as reproducible... Actually, I think I understand what's going on. It's if (pipe->nrbufs) { int unused = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1); /* [curbuf,unused) is in use. Free [idx,unused) */ while (idx != unused) { pipe_buf_release(pipe, &pipe->bufs[idx]); idx = next_idx(idx, pipe); pipe->nrbufs--; } } in case when pipe->nrbufs == pipe->buffers and idx == pipe->curbuf. IOW, we have a full pipe and want to empty it entirely; the fun question is, of course, telling that case from having nothing to free with the same full pipe... OK, so we have either * off != 0 => something's being left in the pipe, i->idx points to the last in-use buffer after that, idx points to the first buffer unused after that. In that case the current logics is correct. * off == 0 => we are emptying the damn thing. i->idx and idx point to the first buffer unused after that (== pipe->curbuf + pipe->nrbufs at the time we'd set the iov_iter up). The current logics is correct unless pipe->nrbufs was originally 0 and now has become pipe->buffers. IOW, we screw up when off == 0, idx == unused, pipe->nrbufs == pipe->buffers... OK, we really ought to make sure that iov_iter_pipe() is never done on a full pipe. AFAICS, we do, and if so the following should suffice. WARNING: it's completely untested and it's a result of debugging a fencepost bug in handling of cyclic buffers done at 6 in the morning, on _way_ too long uptime. Treat as very dangerous; it's not entirely impossible that I hadn't fucked up, but don't consider it anything other than "let's try and see if it immediately explodes" until I've got a chance to reread it after getting some sleep. diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 25f572303801..7bc0b99d3c83 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -759,11 +759,12 @@ static void pipe_advance(struct iov_iter *i, size_t size) idx = next_idx(idx, pipe); if (pipe->nrbufs) { int unused = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1); - /* [curbuf,unused) is in use. Free [idx,unused) */ - while (idx != unused) { - pipe_buf_release(pipe, &pipe->bufs[idx]); - idx = next_idx(idx, pipe); - pipe->nrbufs--; + if (idx != unused || unlikely(idx == pipe->curbuf && !off)) { + do { + pipe_buf_release(pipe, &pipe->bufs[idx]); + idx = next_idx(idx, pipe); + pipe->nrbufs--; + } while (idx != unused); } } i->count -= orig_sz; @@ -826,6 +827,7 @@ void iov_iter_pipe(struct iov_iter *i, int direction, size_t count) { BUG_ON(direction != ITER_PIPE); + WARN_ON(pipe->nrbufs == pipe->buffers); i->type = direction; i->pipe = pipe; i->idx = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1); ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-13 11:18 ` Al Viro @ 2017-01-13 19:33 ` Linus Torvalds 2017-01-13 20:08 ` Al Viro 2017-01-13 20:08 ` Linus Torvalds 0 siblings, 2 replies; 42+ messages in thread From: Linus Torvalds @ 2017-01-13 19:33 UTC (permalink / raw) To: Al Viro; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Fri, Jan 13, 2017 at 3:18 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > in case when pipe->nrbufs == pipe->buffers and idx == pipe->curbuf. IOW, > we have a full pipe and want to empty it entirely; the fun question is, > of course, telling that case from having nothing to free with the same > full pipe... That's just because you're looking at all the wrong fields. Doing this: int unused = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1); and then comparing the current index to the unused index is just complete garbage. Exactly because the "nrbufs % pipe->buffers" ambiguity when nrbufs is either 0 or full. So that comparison is actively crap, exactly *because* it has already thrown all the actual data away. You must *never* compare indexes in a circular buffer like this. It's simply wrong. Your patch might work, but it just keeps on doing the wrong thing and then works around the fact that it does the wrong thing by adding more disgusting hacks. What am I missing? Why is "pipe_advance()" written in that incomprehensible form? Why don't we do the pipe_buf_release() as we advance through it, instead of doing it at the end? Also, the line buf->len = size; in that pipe_advance() function looks buggy. "size" is how much we *remove* from buf->len, shouldn't we update buf->len by subtracting size? This function looks so broken that I must be missing something. Why doesn't pipe_advance() just look like the following: static void pipe_advance(struct iov_iter *i, size_t size) { struct pipe_inode_info *pipe = i->pipe; struct pipe_buffer *buf; int idx = i->idx; if (unlikely(i->count < size)) size = i->count; if (!size) return; i->count -= size; while (pipe->nrbufs) { buf = &pipe->bufs[idx]; /* We are left with a partial buffer.. */ if (size < buf->len) { buf->len -= size; buf->offset += size; i->idx = idx; i->iov_offset = buf->offset; return; } size -= buf->len; pipe_buf_release(pipe, buf); pipe->nrbufs--; idx = next_idx(idx, pipe); } /* We have exhausted all pipe buffers, reset everything */ pipe->curbuf = 0; i->idx = 0; i->iov_offset = 0; i->count = 0; /* Did we try to advace past that? */ WARN_ON_ONCE(size); } which looks a lot more straightforward, freeing the buffers as it goes along, and doesn't screw around comparing indexes? But as I said, I'm probably missing something, because the current "pipe_advance()" function looks so incomprehensible to me. So I'm assuming there's some reason for the craziness. Linus ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-13 19:33 ` Linus Torvalds @ 2017-01-13 20:08 ` Al Viro 2017-01-13 20:11 ` Al Viro 2017-01-13 20:08 ` Linus Torvalds 1 sibling, 1 reply; 42+ messages in thread From: Al Viro @ 2017-01-13 20:08 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Fri, Jan 13, 2017 at 11:33:18AM -0800, Linus Torvalds wrote: > What am I missing? The fact that we want to free the _tail_, not the beginning of the damn thing. > Why is "pipe_advance()" written in that incomprehensible form? Why > don't we do the pipe_buf_release() as we advance through it, instead > of doing it at the end? > > Also, the line > > buf->len = size; > > in that pipe_advance() function looks buggy. "size" is how much we > *remove* from buf->len, shouldn't we update buf->len by subtracting > size? Because it's "truncate to size", not "throw everything up to that point out". We have some amount of data pushed into pipe (in this case - 0) and we have some buffers allocated by ..._get_pages() past the end of it. Some of that we want to keep (again, in this case - none) and have the next copy_to_iter() go after those, the rest we discard. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-13 20:08 ` Al Viro @ 2017-01-13 20:11 ` Al Viro 2017-01-13 20:32 ` Linus Torvalds 0 siblings, 1 reply; 42+ messages in thread From: Al Viro @ 2017-01-13 20:11 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Fri, Jan 13, 2017 at 08:08:27PM +0000, Al Viro wrote: > Because it's "truncate to size", not "throw everything up to that point > out". > > We have some amount of data pushed into pipe (in this case - 0) and we > have some buffers allocated by ..._get_pages() past the end of it. > Some of that we want to keep (again, in this case - none) and have the next > copy_to_iter() go after those, the rest we discard. PS: 'size' argument of iov_iter_advance() is the second "some" in the above - we tell it how much we want to advance by and everything past that point is, in case of PIPE_ITER, discarded. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-13 20:11 ` Al Viro @ 2017-01-13 20:32 ` Linus Torvalds 2017-01-13 20:47 ` Al Viro 0 siblings, 1 reply; 42+ messages in thread From: Linus Torvalds @ 2017-01-13 20:32 UTC (permalink / raw) To: Al Viro; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Fri, Jan 13, 2017 at 12:11 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > PS: 'size' argument of iov_iter_advance() is the second "some" in the > above - we tell it how much we want to advance by and everything past > that point is, in case of PIPE_ITER, discarded. Ok. The naming threw me. It would be more logical to call that operation a "truncate", not advance. I notice that one of the comments in fs/splice.c actually says that: iov_iter_advance(&to, copied); /* truncates and discards */ but yes, I see what it's trying to do now. Ugh. I still think your patch is butt-ugly, and the index comparisons make me nervous, but.. Let's see if Alan's issue actually goes away with your later patch. Linus ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-13 20:32 ` Linus Torvalds @ 2017-01-13 20:47 ` Al Viro 2017-01-13 21:55 ` Al Viro 0 siblings, 1 reply; 42+ messages in thread From: Al Viro @ 2017-01-13 20:47 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Fri, Jan 13, 2017 at 12:32:37PM -0800, Linus Torvalds wrote: > Ugh. I still think your patch is butt-ugly, and the index comparisons > make me nervous, but.. No arguments here - 6am on 20-odd hours of uptime is _not_ a good time for writing, especially since the data structure needs better documentation and probably a couple of inlined helpers. I'll try to massage the damn thing into more readable form. > Let's see if Alan's issue actually goes away with your later patch. *nod* I'm fairly sure this is the root cause of what's going on there, but it would be nice to have an experimental confirmation. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-13 20:47 ` Al Viro @ 2017-01-13 21:55 ` Al Viro 2017-01-13 21:59 ` Al Viro 0 siblings, 1 reply; 42+ messages in thread From: Al Viro @ 2017-01-13 21:55 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Fri, Jan 13, 2017 at 08:47:31PM +0000, Al Viro wrote: > On Fri, Jan 13, 2017 at 12:32:37PM -0800, Linus Torvalds wrote: > > > Ugh. I still think your patch is butt-ugly, and the index comparisons > > make me nervous, but.. > > No arguments here - 6am on 20-odd hours of uptime is _not_ a good time > for writing, especially since the data structure needs better documentation > and probably a couple of inlined helpers. I'll try to massage the damn > thing into more readable form. FWIW, I think it will be more readable if we separate the "advance" and "truncate" parts like this (warning: not even build-tested). Comments? static inline void pipe_truncate(struct iov_iter *i) { struct pipe_inode_info *pipe = i->pipe; if (pipe->nrbufs) { size_t off = i->iov_offset; int idx = i->idx; int n; n = (pipe->curbuf + pipe->nrbufs - idx) & (pipe->buffers - 1); if (off) { pipe->bufs[idx].len = off - pipe->bufs[idx].offset; /* free all after idx; n can't be 0 */ idx = next_idx(idx, pipe); n--; } else { /* free all _starting_ at idx. * n is 0 when we have nothing to do * *or* when we are truncating full pipe to empty. */ if (pipe->nrbufs == pipe->buffers && !n) n = pipe->buffers; } while (n--) { pipe_buf_release(pipe, &pipe->bufs[idx]); idx = next_idx(idx, pipe); pipe->nrbufs--; } } } static void pipe_advance(struct iov_iter *i, size_t size) { struct pipe_inode_info *pipe = i->pipe; if (unlikely(i->count < size)) size = i->count; if (size) { struct pipe_buffer *buf; size_t off = i->iov_offset, left = size; int idx = i->idx; if (off) /* make it relative to the beginning of buffer */ left += off - pipe->bufs[idx].offset; while (1) { buf = &pipe->bufs[idx]; if (left <= buf->len) break; left -= buf->len; idx = next_idx(idx, pipe); } i->idx = idx; i->iov_offset = buf->offset + left; i->count -= size; } /* ... and discard everything past that point */ pipe_truncate(i); } ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-13 21:55 ` Al Viro @ 2017-01-13 21:59 ` Al Viro 2017-01-13 22:13 ` Al Viro 0 siblings, 1 reply; 42+ messages in thread From: Al Viro @ 2017-01-13 21:59 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Fri, Jan 13, 2017 at 09:55:04PM +0000, Al Viro wrote: > On Fri, Jan 13, 2017 at 08:47:31PM +0000, Al Viro wrote: > > On Fri, Jan 13, 2017 at 12:32:37PM -0800, Linus Torvalds wrote: > > > > > Ugh. I still think your patch is butt-ugly, and the index comparisons > > > make me nervous, but.. > > > > No arguments here - 6am on 20-odd hours of uptime is _not_ a good time > > for writing, especially since the data structure needs better documentation > > and probably a couple of inlined helpers. I'll try to massage the damn > > thing into more readable form. > > FWIW, I think it will be more readable if we separate the "advance" and > "truncate" parts like this (warning: not even build-tested). Comments? diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 25f572303801..dae1ac940d5f 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -730,43 +730,60 @@ size_t iov_iter_copy_from_user_atomic(struct page *page, } EXPORT_SYMBOL(iov_iter_copy_from_user_atomic); +static inline void pipe_truncate(struct iov_iter *i) +{ + struct pipe_inode_info *pipe = i->pipe; + if (pipe->nrbufs) { + size_t off = i->iov_offset; + int idx = i->idx; + int n; + + n = (pipe->curbuf + pipe->nrbufs - idx) & (pipe->buffers - 1); + if (off) { + pipe->bufs[idx].len = off - pipe->bufs[idx].offset; + /* free all after idx; n can't be 0 */ + idx = next_idx(idx, pipe); + n--; + } else { + /* free all _starting_ at idx. + * n is 0 when we have nothing to do + * *or* when we are truncating full pipe to empty. + */ + if (pipe->nrbufs == pipe->buffers && !n) + n = pipe->buffers; + } + while (n--) { + pipe_buf_release(pipe, &pipe->bufs[idx]); + idx = next_idx(idx, pipe); + pipe->nrbufs--; + } + } +} + static void pipe_advance(struct iov_iter *i, size_t size) { struct pipe_inode_info *pipe = i->pipe; - struct pipe_buffer *buf; - int idx = i->idx; - size_t off = i->iov_offset, orig_sz; - if (unlikely(i->count < size)) size = i->count; - orig_sz = size; - if (size) { + struct pipe_buffer *buf; + size_t off = i->iov_offset, left = size; + int idx = i->idx; if (off) /* make it relative to the beginning of buffer */ - size += off - pipe->bufs[idx].offset; + left += off - pipe->bufs[idx].offset; while (1) { buf = &pipe->bufs[idx]; - if (size <= buf->len) + if (left <= buf->len) break; - size -= buf->len; + left -= buf->len; idx = next_idx(idx, pipe); } - buf->len = size; i->idx = idx; - off = i->iov_offset = buf->offset + size; - } - if (off) - idx = next_idx(idx, pipe); - if (pipe->nrbufs) { - int unused = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1); - /* [curbuf,unused) is in use. Free [idx,unused) */ - while (idx != unused) { - pipe_buf_release(pipe, &pipe->bufs[idx]); - idx = next_idx(idx, pipe); - pipe->nrbufs--; - } + i->iov_offset = buf->offset + left; } - i->count -= orig_sz; + i->count -= size; + /* ... and discard everything past that point */ + pipe_truncate(i); } void iov_iter_advance(struct iov_iter *i, size_t size) @@ -826,6 +843,7 @@ void iov_iter_pipe(struct iov_iter *i, int direction, size_t count) { BUG_ON(direction != ITER_PIPE); + WARN_ON(pipe->nrbufs == pipe->buffers); i->type = direction; i->pipe = pipe; i->idx = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1); ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-13 21:59 ` Al Viro @ 2017-01-13 22:13 ` Al Viro 2017-01-13 22:50 ` Al Viro 0 siblings, 1 reply; 42+ messages in thread From: Al Viro @ 2017-01-13 22:13 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Fri, Jan 13, 2017 at 09:59:19PM +0000, Al Viro wrote: > + if (off) { > + pipe->bufs[idx].len = off - pipe->bufs[idx].offset; > + /* free all after idx; n can't be 0 */ Yes, it can - full pipe, truncation to the part of the first buffer ;-/ OTOH, handling the wraparound can be done uniformly for both "empty all" and "leave something" cases... Fixed variant follows: diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 25f572303801..69ae97fb6b1a 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -730,43 +730,54 @@ size_t iov_iter_copy_from_user_atomic(struct page *page, } EXPORT_SYMBOL(iov_iter_copy_from_user_atomic); +static inline void pipe_truncate(struct iov_iter *i) +{ + struct pipe_inode_info *pipe = i->pipe; + if (pipe->nrbufs) { + size_t off = i->iov_offset; + int idx = i->idx; + int n; + + n = (pipe->curbuf + pipe->nrbufs - idx) & (pipe->buffers - 1); + if (off) { + pipe->bufs[idx].len = off - pipe->bufs[idx].offset; + idx = next_idx(idx, pipe); + n--; + } + if (pipe->nrbufs == pipe->buffers && n <= 0) + n += pipe->buffers; + while (n--) { + pipe_buf_release(pipe, &pipe->bufs[idx]); + idx = next_idx(idx, pipe); + pipe->nrbufs--; + } + } +} + static void pipe_advance(struct iov_iter *i, size_t size) { struct pipe_inode_info *pipe = i->pipe; - struct pipe_buffer *buf; - int idx = i->idx; - size_t off = i->iov_offset, orig_sz; - if (unlikely(i->count < size)) size = i->count; - orig_sz = size; - if (size) { + struct pipe_buffer *buf; + size_t off = i->iov_offset, left = size; + int idx = i->idx; if (off) /* make it relative to the beginning of buffer */ - size += off - pipe->bufs[idx].offset; + left += off - pipe->bufs[idx].offset; while (1) { buf = &pipe->bufs[idx]; - if (size <= buf->len) + if (left <= buf->len) break; - size -= buf->len; + left -= buf->len; idx = next_idx(idx, pipe); } - buf->len = size; i->idx = idx; - off = i->iov_offset = buf->offset + size; - } - if (off) - idx = next_idx(idx, pipe); - if (pipe->nrbufs) { - int unused = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1); - /* [curbuf,unused) is in use. Free [idx,unused) */ - while (idx != unused) { - pipe_buf_release(pipe, &pipe->bufs[idx]); - idx = next_idx(idx, pipe); - pipe->nrbufs--; - } + i->iov_offset = buf->offset + left; } - i->count -= orig_sz; + i->count -= size; + /* ... and discard everything past that point */ + pipe_truncate(i); } void iov_iter_advance(struct iov_iter *i, size_t size) @@ -826,6 +837,7 @@ void iov_iter_pipe(struct iov_iter *i, int direction, size_t count) { BUG_ON(direction != ITER_PIPE); + WARN_ON(pipe->nrbufs == pipe->buffers); i->type = direction; i->pipe = pipe; i->idx = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1); ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-13 22:13 ` Al Viro @ 2017-01-13 22:50 ` Al Viro 2017-01-14 0:59 ` Linus Torvalds 2017-01-14 13:16 ` Alan J. Wylie 0 siblings, 2 replies; 42+ messages in thread From: Al Viro @ 2017-01-13 22:50 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Fri, Jan 13, 2017 at 10:13:08PM +0000, Al Viro wrote: > On Fri, Jan 13, 2017 at 09:59:19PM +0000, Al Viro wrote: > > + if (off) { > > + pipe->bufs[idx].len = off - pipe->bufs[idx].offset; > > + /* free all after idx; n can't be 0 */ > > Yes, it can - full pipe, truncation to the part of the first buffer ;-/ > OTOH, handling the wraparound can be done uniformly for both "empty all" > and "leave something" cases... Fixed variant follows: Or, even better, we can get rid of all wraparound-related crap if we calculate the final value of pipe->nrbufs and watch for _that_ as loop condition: diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 25f572303801..e68604ae3ced 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -730,43 +730,50 @@ size_t iov_iter_copy_from_user_atomic(struct page *page, } EXPORT_SYMBOL(iov_iter_copy_from_user_atomic); +static inline void pipe_truncate(struct iov_iter *i) +{ + struct pipe_inode_info *pipe = i->pipe; + if (pipe->nrbufs) { + size_t off = i->iov_offset; + int idx = i->idx; + int nrbufs = (idx - pipe->curbuf) & (pipe->buffers - 1); + if (off) { + pipe->bufs[idx].len = off - pipe->bufs[idx].offset; + idx = next_idx(idx, pipe); + nrbufs++; + } + while (pipe->nrbufs > nrbufs) { + pipe_buf_release(pipe, &pipe->bufs[idx]); + idx = next_idx(idx, pipe); + pipe->nrbufs--; + } + } +} + static void pipe_advance(struct iov_iter *i, size_t size) { struct pipe_inode_info *pipe = i->pipe; - struct pipe_buffer *buf; - int idx = i->idx; - size_t off = i->iov_offset, orig_sz; - if (unlikely(i->count < size)) size = i->count; - orig_sz = size; - if (size) { + struct pipe_buffer *buf; + size_t off = i->iov_offset, left = size; + int idx = i->idx; if (off) /* make it relative to the beginning of buffer */ - size += off - pipe->bufs[idx].offset; + left += off - pipe->bufs[idx].offset; while (1) { buf = &pipe->bufs[idx]; - if (size <= buf->len) + if (left <= buf->len) break; - size -= buf->len; + left -= buf->len; idx = next_idx(idx, pipe); } - buf->len = size; i->idx = idx; - off = i->iov_offset = buf->offset + size; - } - if (off) - idx = next_idx(idx, pipe); - if (pipe->nrbufs) { - int unused = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1); - /* [curbuf,unused) is in use. Free [idx,unused) */ - while (idx != unused) { - pipe_buf_release(pipe, &pipe->bufs[idx]); - idx = next_idx(idx, pipe); - pipe->nrbufs--; - } + i->iov_offset = buf->offset + left; } - i->count -= orig_sz; + i->count -= size; + /* ... and discard everything past that point */ + pipe_truncate(i); } void iov_iter_advance(struct iov_iter *i, size_t size) @@ -826,6 +833,7 @@ void iov_iter_pipe(struct iov_iter *i, int direction, size_t count) { BUG_ON(direction != ITER_PIPE); + WARN_ON(pipe->nrbufs == pipe->buffers); i->type = direction; i->pipe = pipe; i->idx = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1); ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-13 22:50 ` Al Viro @ 2017-01-14 0:59 ` Linus Torvalds 2017-01-14 1:24 ` Al Viro 2017-01-14 13:16 ` Alan J. Wylie 1 sibling, 1 reply; 42+ messages in thread From: Linus Torvalds @ 2017-01-14 0:59 UTC (permalink / raw) To: Al Viro; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Fri, Jan 13, 2017 at 2:50 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Or, even better, we can get rid of all wraparound-related crap if we > calculate the final value of pipe->nrbufs and watch for _that_ as > loop condition: Ok, I like the 'expected nrbufs' approach. It makes the actual freeing loop trivial, and gets rid of that whole complexity. I'm not totally enamored with the details of exactly how that expected valie is calculated, though. > +static inline void pipe_truncate(struct iov_iter *i) > +{ > + struct pipe_inode_info *pipe = i->pipe; > + if (pipe->nrbufs) { > + size_t off = i->iov_offset; > + int idx = i->idx; > + int nrbufs = (idx - pipe->curbuf) & (pipe->buffers - 1); Ok, here we have what superficially appears to be the same wraparound issue, except not. The idx/iov_offset rules are subtly different from the pipe buffer index rules, and point to the actual last buffer. So here, 'idx' really points to the last buffer, and we're possibly off by one, and then you fix that up with: > + if (off) { > + pipe->bufs[idx].len = off - pipe->bufs[idx].offset; > + idx = next_idx(idx, pipe); > + nrbufs++; > + } and I think this all even works, and I like how the logic here matches what the PIPE_PARANOIA tests are. Fine. EXCEPT. I don't think "i->iov_offset" is actually correct. If you truncated the whole thing, you should have cleared iov_offset too, and that never happened. So truncating everything will not empty the buffers for us, we'll still get to that "if (off)" case and have nrbufs=1. So I'd expect something like if (!size) i->iov_offset = 0; in pipe_advance(), in order to really free all buffers for that case. No? Or is there some guarantee that iov_offset was already zero there? I don't see it, but I'm batting zero on this code... So I'm clearly still missing something. It looks better to me, but I still don't get some of the cases. Can you please explain when you are once more awake.. Linus ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-14 0:59 ` Linus Torvalds @ 2017-01-14 1:24 ` Al Viro 2017-01-14 1:43 ` Al Viro 2017-01-14 1:46 ` Linus Torvalds 0 siblings, 2 replies; 42+ messages in thread From: Al Viro @ 2017-01-14 1:24 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Fri, Jan 13, 2017 at 04:59:37PM -0800, Linus Torvalds wrote: > EXCEPT. > > I don't think "i->iov_offset" is actually correct. If you truncated > the whole thing, you should have cleared iov_offset too, and that > never happened. So truncating everything will not empty the buffers > for us, we'll still get to that "if (off)" case and have nrbufs=1. > > So I'd expect something like > > if (!size) > i->iov_offset = 0; > > in pipe_advance(), in order to really free all buffers for that case. No? Why would advance by 0 change ->iov_offset here? > Or is there some guarantee that iov_offset was already zero there? I > don't see it, but I'm batting zero on this code... It was zero initially (see iov_iter_pipe()). It was not affected by iov_iter_get_pages() and friends. If there was copy_to_iter/zero_iter/ copy_page_to_iter for any non-zero amount of data, then it's _not_ zero and should bloody well stay such. Note that the normal case is something like O_DIRECT read grabbing pages, doing a bunch of IO into such and, once it's done, iov_iter_advance() for the actual amount read being done by retval = mapping->a_ops->direct_IO(iocb, &data); if (retval >= 0) { iocb->ki_pos += retval; iov_iter_advance(iter, retval); } which advances iter past the actually read data. If we proceed to fall back to the do_generic_file_read(), it will do a bunch of copy_page_to_iter() to the points past that. default_file_splice_read() pretty much imitates O_DIRECT read in that respect, with kernel_readv() being a counterpart of direct_IO. generic_file_splice_read() is another PIPE_ITER user; that one really calls ->read_iter(). On any non-error (including short reads) it will not bother with iov_iter_advance() at all - ->read_iter() should've called that already, if at all needed. On error it does use iov_iter_advance(), pretty much as a way to trigger pipe_truncate(). There we directly reset ->iov_offset to 0 and ->idx to its original value. Strictly speaking, we could live without calling it there at all - normally ->read_iter() instances follow the usual "if you'd managed to read something, report a short read, not an error" approach. None of the exceptions are good candidates for use of generic_file_splice_read() anyway. However, theoretically it is possible that ->read_iter() instance does successful copy_to_iter() and then decides to return an error. This } else if (ret < 0) { to.idx = idx; to.iov_offset = 0; iov_iter_advance(&to, 0); /* to free what was emitted */ in generic_file_splice_read() catches any such cases. Here the call of iov_iter_advance(&to, 0) is guaranteed to be equivalent to pipe_truncate(&to); we might make the latter non-static and call it directly, but I'm not sure it's worth doing. > Can you please explain when you are once more awake.. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-14 1:24 ` Al Viro @ 2017-01-14 1:43 ` Al Viro 2017-01-14 1:46 ` Linus Torvalds 1 sibling, 0 replies; 42+ messages in thread From: Al Viro @ 2017-01-14 1:43 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Sat, Jan 14, 2017 at 01:24:28AM +0000, Al Viro wrote: > On Fri, Jan 13, 2017 at 04:59:37PM -0800, Linus Torvalds wrote: > > > EXCEPT. > > > > I don't think "i->iov_offset" is actually correct. If you truncated > > the whole thing, you should have cleared iov_offset too, and that > > never happened. So truncating everything will not empty the buffers > > for us, we'll still get to that "if (off)" case and have nrbufs=1. > > > > So I'd expect something like > > > > if (!size) > > i->iov_offset = 0; > > > > in pipe_advance(), in order to really free all buffers for that case. No? > > Why would advance by 0 change ->iov_offset here? > > > Or is there some guarantee that iov_offset was already zero there? I > > don't see it, but I'm batting zero on this code... > > It was zero initially (see iov_iter_pipe()). It was not affected by > iov_iter_get_pages() and friends. If there was copy_to_iter/zero_iter/ > copy_page_to_iter for any non-zero amount of data, then it's _not_ zero and > should bloody well stay such. PS: note that after copy_page_to_iter()/copy_to_iter()/zero_iter() we have ->idx pointing to the last used buffer and ->iov_offset pointing to the end of data in it, even if it's certain to be full and the next piece written will go into the next buffer. The only situation where we have zero ->iov_offset is when no copying had been done at all (via that iov_iter, that is). In that case ->idx points to the empty buffer. If you start with empty pipe and copy_to_pipe() until it fills, you'll have (assuming e.g. 4K pages and ->curbuf == 3 in the beginning) ->idx == 3, ->iov_offset == 0 initially ->idx == 3, ->iov_offset == 4096 after 4096 bytes copied ->idx == 4, ->iov_offset == 1 after 4097 bytes ... ->idx == 2, ->iov_offset == 4096 by the end of it. That's what this "correction" is about... ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-14 1:24 ` Al Viro 2017-01-14 1:43 ` Al Viro @ 2017-01-14 1:46 ` Linus Torvalds 2017-01-14 1:57 ` Al Viro 1 sibling, 1 reply; 42+ messages in thread From: Linus Torvalds @ 2017-01-14 1:46 UTC (permalink / raw) To: Al Viro; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Fri, Jan 13, 2017 at 5:24 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Why would advance by 0 change ->iov_offset here? That's not my worry. Advancing by zero obviously doesn't change the position. But the _truncation_ of the rest requires iov_offset to be zero in order to actually truncate everything. So I was worrying about something updating it, and then wanting to truncate things on error. But you bring up the kinds of cases I worried about: > On error it does use iov_iter_advance(), pretty much as a way to > trigger pipe_truncate(). There we directly reset ->iov_offset to 0 > and ->idx to its original value. Ok, this was the part I worried about. And this > However, theoretically it is possible that ->read_iter() instance does > successful copy_to_iter() and then decides to return an error. This > } else if (ret < 0) { > to.idx = idx; > to.iov_offset = 0; > iov_iter_advance(&to, 0); /* to free what was emitted */ > in generic_file_splice_read() catches any such cases. So I'm happy with that last patch then, and my worries are laid to rest. Linus ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-14 1:46 ` Linus Torvalds @ 2017-01-14 1:57 ` Al Viro 2017-01-15 0:53 ` Al Viro 0 siblings, 1 reply; 42+ messages in thread From: Al Viro @ 2017-01-14 1:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Fri, Jan 13, 2017 at 05:46:34PM -0800, Linus Torvalds wrote: > On Fri, Jan 13, 2017 at 5:24 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > Why would advance by 0 change ->iov_offset here? > > That's not my worry. Advancing by zero obviously doesn't change the position. > > But the _truncation_ of the rest requires iov_offset to be zero in > order to actually truncate everything. > > So I was worrying about something updating it, and then wanting to > truncate things on error. > > But you bring up the kinds of cases I worried about: > > > On error it does use iov_iter_advance(), pretty much as a way to > > trigger pipe_truncate(). There we directly reset ->iov_offset to 0 > > and ->idx to its original value. > > Ok, this was the part I worried about. And this > > > However, theoretically it is possible that ->read_iter() instance does > > successful copy_to_iter() and then decides to return an error. This > > } else if (ret < 0) { > > to.idx = idx; > > to.iov_offset = 0; > > iov_iter_advance(&to, 0); /* to free what was emitted */ > > in generic_file_splice_read() catches any such cases. > > So I'm happy with that last patch then, and my worries are laid to rest. OK. Let's wait for Alan to confirm that the last variant works and I'll send a pull request (with Cc: stable # v4.9). ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-14 1:57 ` Al Viro @ 2017-01-15 0:53 ` Al Viro 0 siblings, 0 replies; 42+ messages in thread From: Al Viro @ 2017-01-15 0:53 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Sat, Jan 14, 2017 at 01:57:57AM +0000, Al Viro wrote: > OK. Let's wait for Alan to confirm that the last variant works and > I'll send a pull request (with Cc: stable # v4.9). ... and here it is. The following changes since commit b4b8664d291ac1998e0f0bcdc96b6397f0fe68b3: arm64: don't pull uaccess.h into *.S (2016-12-26 13:05:17 -0500) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus for you to fetch changes up to b9dc6f65bc5e232d1c05fe34b5daadc7e8bbf1fb: fix a fencepost error in pipe_advance() (2017-01-14 19:50:41 -0500) ---------------------------------------------------------------- Al Viro (1): fix a fencepost error in pipe_advance() Dave Kleikamp (1): coredump: Ensure proper size of sparse core files Gu Zheng (1): tmpfs: clear S_ISGID when setting posix ACLs Shaohua Li (1): aio: fix lock dep warning fs/aio.c | 6 ++++-- fs/binfmt_elf.c | 1 + fs/coredump.c | 18 ++++++++++++++++ fs/posix_acl.c | 9 ++++---- include/linux/coredump.h | 1 + lib/iov_iter.c | 54 +++++++++++++++++++++++++++--------------------- 6 files changed, 59 insertions(+), 30 deletions(-) ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-13 22:50 ` Al Viro 2017-01-14 0:59 ` Linus Torvalds @ 2017-01-14 13:16 ` Alan J. Wylie 2017-01-14 16:29 ` Alan J. Wylie 1 sibling, 1 reply; 42+ messages in thread From: Alan J. Wylie @ 2017-01-14 13:16 UTC (permalink / raw) To: Al Viro; +Cc: Linus Torvalds, Thorsten Leemhuis, linux-kernel at 22:50 on Fri 13-Jan-2017 Al Viro (viro@ZenIV.linux.org.uk) wrote: > Or, even better, we can get rid of all wraparound-related crap if we > calculate the final value of pipe->nrbufs and watch for _that_ as > loop condition: > > diff --git a/lib/iov_iter.c b/lib/iov_iter.c > index 25f572303801..e68604ae3ced 100644 > --- a/lib/iov_iter.c > +++ b/lib/iov_iter.c > @@ -730,43 +730,50 @@ size_t iov_iter_copy_from_user_atomic(struct page *page, > } > EXPORT_SYMBOL(iov_iter_copy_from_user_atomic); > > +static inline void pipe_truncate(struct iov_iter *i) ... I've dragged an old box out of the storeroom and upgraded and configured it so it reproduced the problem. With the patch, the mail is sent with no hang # uname -a Linux maglor 4.9.3-dirty #3 SMP Sat Jan 14 12:13:45 GMT 2017 x86_64 AMD Athlon(tm) II X2 270 Processor AuthenticAMD GNU/Linux Jan 14 12:56:01 maglor cron[741]: (root) CMD (date; /work/chroot-shared/test.sh; date) Jan 14 12:56:01 maglor postfix/pickup[649]: 8577A604F2: uid=0 from=<root> Jan 14 12:56:01 maglor postfix/cleanup[752]: 8577A604F2: message-id=<20170114125601.8577A604F2@maglor.localdomain> Jan 14 12:56:01 maglor postfix/qmgr[650]: 8577A604F2: from=<root@maglor.localdomain>, size=735, nrcpt=1 (queue active) Jan 14 12:56:01 maglor postfix/smtp[754]: 8577A604F2: to=<alan@frodo.localnet>, orig_to=<root>, relay=frodo.localnet[192.168.21.2]:25, delay=0.54, delays=0.43/0/0.04/0.07, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as A08CC440065) Jan 14 12:56:01 maglor postfix/qmgr[650]: 8577A604F2: removed No other problems are apparent. I'll run this for a bit, then apply it to my workstation (which I'm rather fond of) and make sure there are no new regressions. -- Alan J. Wylie http://www.wylie.me.uk/ Dance like no-one's watching. / Encrypt like everyone is. Security is inversely proportional to convenience ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-14 13:16 ` Alan J. Wylie @ 2017-01-14 16:29 ` Alan J. Wylie 2017-01-14 17:57 ` Linus Torvalds 0 siblings, 1 reply; 42+ messages in thread From: Alan J. Wylie @ 2017-01-14 16:29 UTC (permalink / raw) To: Al Viro, Linus Torvalds, Thorsten Leemhuis, linux-kernel at 13:16 on Sat 14-Jan-2017 Alan J. Wylie (alan@wylie.me.uk) wrote: > I'll run this for a bit, then apply it to my workstation (which I'm > rather fond of) and make sure there are no new regressions. Looking good. -- Alan J. Wylie http://www.wylie.me.uk/ Dance like no-one's watching. / Encrypt like everyone is. Security is inversely proportional to convenience ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-14 16:29 ` Alan J. Wylie @ 2017-01-14 17:57 ` Linus Torvalds 0 siblings, 0 replies; 42+ messages in thread From: Linus Torvalds @ 2017-01-14 17:57 UTC (permalink / raw) To: Alan J. Wylie; +Cc: Al Viro, Thorsten Leemhuis, linux-kernel On Sat, Jan 14, 2017 at 8:29 AM, Alan J. Wylie <alan@wylie.me.uk> wrote: > at 13:16 on Sat 14-Jan-2017 Alan J. Wylie (alan@wylie.me.uk) wrote: > >> I'll run this for a bit, then apply it to my workstation (which I'm >> rather fond of) and make sure there are no new regressions. > > Looking good. Thanks for the pinpointing and all the testing, Linus ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-13 19:33 ` Linus Torvalds 2017-01-13 20:08 ` Al Viro @ 2017-01-13 20:08 ` Linus Torvalds 2017-01-13 20:16 ` Al Viro 1 sibling, 1 reply; 42+ messages in thread From: Linus Torvalds @ 2017-01-13 20:08 UTC (permalink / raw) To: Al Viro; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Fri, Jan 13, 2017 at 11:33 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > This function looks so broken that I must be missing something. Why > doesn't pipe_advance() just look like the following: > > static void pipe_advance(struct iov_iter *i, size_t size) > { ... > pipe_buf_release(pipe, buf); > pipe->nrbufs--; ... I think this part needs to update "curbufs" too, so something like pipe->curbuf = (pipe->curbuf + 1) & (pipe->buffers - 1); although I think that "idx" has to track curbuf here anyway, so I guess it could just be combined with the idx update and look something like pipe->curbuf = idx = next_idx(idx, pipe); in there. Otherwise we get out of sync with the pipe state. Or maybe I'm just full of it. Linus ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn 2017-01-13 20:08 ` Linus Torvalds @ 2017-01-13 20:16 ` Al Viro 0 siblings, 0 replies; 42+ messages in thread From: Al Viro @ 2017-01-13 20:16 UTC (permalink / raw) To: Linus Torvalds; +Cc: Alan J. Wylie, Thorsten Leemhuis, linux-kernel On Fri, Jan 13, 2017 at 12:08:44PM -0800, Linus Torvalds wrote: > On Fri, Jan 13, 2017 at 11:33 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > This function looks so broken that I must be missing something. Why > > doesn't pipe_advance() just look like the following: > > > > static void pipe_advance(struct iov_iter *i, size_t size) > > { > ... > > pipe_buf_release(pipe, buf); > > pipe->nrbufs--; > ... > > I think this part needs to update "curbufs" too, so something like > > pipe->curbuf = (pipe->curbuf + 1) & (pipe->buffers - 1); > > although I think that "idx" has to track curbuf here anyway, so I > guess it could just be combined with the idx update and look something > like > > pipe->curbuf = idx = next_idx(idx, pipe); > > in there. Otherwise we get out of sync with the pipe state. You are looking at the wrong end of that cyclic buffer. ->curbuf is where the data begins (and it might be well prior to anything we'd pushed there - pipe might've been non-empty). Then we have ->nrbufs allocated buffers and i->idx points to the place where copy_to_iter() will put the data. We want pipe_advance() to * move the point where copy_to_iter() would go by this much * free all preallocated buffers past that point. ->curbuf is for pipe readers; we are dealing with writing to pipe here. ^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2017-01-15 0:53 UTC | newest] Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-01-12 20:26 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn Alan J. Wylie 2017-01-12 20:31 ` Al Viro 2017-01-12 20:38 ` Alan J. Wylie 2017-01-12 22:26 ` Linus Torvalds 2017-01-12 22:37 ` Al Viro 2017-01-12 22:46 ` Al Viro 2017-01-12 23:02 ` Linus Torvalds 2017-01-12 23:14 ` Al Viro 2017-01-12 23:14 ` Linus Torvalds 2017-01-12 23:27 ` Al Viro 2017-01-12 22:46 ` Alan J. Wylie 2017-01-12 22:58 ` Al Viro 2017-01-12 23:28 ` Linus Torvalds 2017-01-13 4:00 ` Al Viro 2017-01-13 7:38 ` Alan J. Wylie 2017-01-13 7:23 ` Alan J. Wylie 2017-01-13 9:33 ` Al Viro 2017-01-13 9:54 ` Alan J. Wylie 2017-01-13 10:20 ` Al Viro 2017-01-13 10:32 ` Alan J. Wylie 2017-01-13 11:25 ` Al Viro 2017-01-13 11:18 ` Al Viro 2017-01-13 19:33 ` Linus Torvalds 2017-01-13 20:08 ` Al Viro 2017-01-13 20:11 ` Al Viro 2017-01-13 20:32 ` Linus Torvalds 2017-01-13 20:47 ` Al Viro 2017-01-13 21:55 ` Al Viro 2017-01-13 21:59 ` Al Viro 2017-01-13 22:13 ` Al Viro 2017-01-13 22:50 ` Al Viro 2017-01-14 0:59 ` Linus Torvalds 2017-01-14 1:24 ` Al Viro 2017-01-14 1:43 ` Al Viro 2017-01-14 1:46 ` Linus Torvalds 2017-01-14 1:57 ` Al Viro 2017-01-15 0:53 ` Al Viro 2017-01-14 13:16 ` Alan J. Wylie 2017-01-14 16:29 ` Alan J. Wylie 2017-01-14 17:57 ` Linus Torvalds 2017-01-13 20:08 ` Linus Torvalds 2017-01-13 20:16 ` Al Viro
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).