* Pending splice(file -> FIFO) always blocks read(FIFO), regardless of O_NONBLOCK on read side? @ 2023-06-26 1:12 Ahelenia Ziemiańska 2023-06-26 9:32 ` Christian Brauner 0 siblings, 1 reply; 19+ messages in thread From: Ahelenia Ziemiańska @ 2023-06-26 1:12 UTC (permalink / raw) To: Alexander Viro, Christian Brauner, linux-fsdevel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 5169 bytes --] Hi! (starting with get_maintainers.pl fs/splice.c, idk if that's right though) Per fs/splice.c: * The traditional unix read/write is extended with a "splice()" operation * that transfers data buffers to or from a pipe buffer. so I expect splice() to work just about the same as read()/write() (and, to a large extent, it does so). Thus, a refresher on pipe read() semantics (quoting Issue 8 Draft 3; Linux when writing with write()): 60746 When attempting to read from an empty pipe or FIFO: 60747 • If no process has the pipe open for writing, read( ) shall return 0 to indicate end-of-file. 60748 • If some process has the pipe open for writing and O_NONBLOCK is set, read( ) shall return 60749 −1 and set errno to [EAGAIN]. 60750 • If some process has the pipe open for writing and O_NONBLOCK is clear, read( ) shall 60751 block the calling thread until some data is written or the pipe is closed by all processes that 60752 had the pipe open for writing. However, I've observed that this is not the case when splicing from something that sleeps on read to a pipe, and that in that case all readers block, /including/ ones that are reading from fds with O_NONBLOCK set! As an example, consider these two programs: -- >8 -- // wr.c #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> int main() { while (splice(0, 0, 1, 0, 128 * 1024 * 1024, 0) > 0) ; fprintf(stderr, "wr: %m\n"); } -- >8 -- -- >8 -- // rd.c #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <stdio.h> #include <unistd.h> int main() { fcntl(0, F_SETFL, fcntl(0, F_GETFL) | O_NONBLOCK); char buf[64 * 1024] = {}; for (ssize_t rd;;) { #if 1 while ((rd = read(0, buf, sizeof(buf))) == -1 && errno == EINTR) ; #else while ((rd = splice(0, 0, 1, 0, 128 * 1024 * 1024, 0)) == -1 && errno == EINTR) ; #endif fprintf(stderr, "rd=%zd: %m\n", rd); write(1, buf, rd); errno = 0; sleep(1); } } -- >8 -- Thus: -- >8 -- a$ make rd wr a$ mkfifo fifo a$ ./rd < fifo b$ echo qwe > fifo rd=4: Success qwe rd=0: Success rd=0: Success b$ sleep 2 > fifo rd=-1: Resource temporarily unavailable rd=-1: Resource temporarily unavailable rd=0: Success rd=0: Success rd=-1: Resource temporarily unavailable b$ /bin/cat > fifo rd=-1: Resource temporarily unavailable rd=4: Success abc abc rd=-1: Resource temporarily unavailable rd=4: Success def def rd=0: Success ^D rd=0: Success rd=0: Success b$ ./wr > fifo -- >8 -- and nothing. Until you actually type a line (or a few) into teletype b so that the splice completes, at which point so does the read. An even simpler case is -- >8 -- $ ./wr | ./rd abc def rd=8: Success abc def ghi jkl rd=8: Success ghi jkl ^D wr: Success rd=-1: Resource temporarily unavailable rd=0: Success rd=0: Success -- >8 -- splice flags don't do anything. Tested on bookworm (6.1.27-1) and Linus' HEAD (v6.4-rc7-234-g547cc9be86f4). You could say this is a "denial of service", since this is a valid way of following pipes (and, sans SIGIO, the only portable one), and this makes it so the reader is completely blocked, and impervious to all deadly signals (incl. SIGKILL). I've also observed strace hanging (but it responded to SIGKILL). Rudimentary analysis shows that sys_splice() -> __do_splice() -> do_splice() -> {!(ipipe && opipe) -> !(ipipe) -> (ipipe)}: splice_file_to_pipe which then does -- >8 -- pipe_lock(opipe); ret = wait_for_space(opipe, flags); if (!ret) ret = do_splice_to(in, offset, opipe, len, flags); pipe_unlock(opipe); -- >8 -- conversely: -- >8 -- static ssize_t pipe_read(struct kiocb *iocb, struct iov_iter *to) { size_t total_len = iov_iter_count(to); struct file *filp = iocb->ki_filp; struct pipe_inode_info *pipe = filp->private_data; bool was_full, wake_next_reader = false; ssize_t ret; /* Null read succeeds. */ if (unlikely(total_len == 0)) return 0; ret = 0; __pipe_lock(pipe); -- >8 -- so, naturally(?), all non-empty reads wait for the splice to end. To validate this, I've applied the following (which I'm 100% sure is wrong and breaks unrelated stuff): -- >8 -- diff --git a/fs/pipe.c b/fs/pipe.c index 2d88f73f585a..a76535839d32 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -240,6 +240,11 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) if (unlikely(total_len == 0)) return 0; + if ((filp->f_flags & O_NONBLOCK) || (iocb->ki_flags & IOCB_NOWAIT)) { + if (mutex_is_locked(&pipe->mutex)) + return -EAGAIN; + } + ret = 0; __pipe_lock(pipe); -- >8 -- which does make the aforementioned cases less pathological (naturally this now means it always EAGAINs even if there is data to be read since the splice() loop keeps it locked, but you get the picture). [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: Pending splice(file -> FIFO) always blocks read(FIFO), regardless of O_NONBLOCK on read side? 2023-06-26 1:12 Pending splice(file -> FIFO) always blocks read(FIFO), regardless of O_NONBLOCK on read side? Ahelenia Ziemiańska @ 2023-06-26 9:32 ` Christian Brauner 2023-06-26 11:59 ` Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?) Ahelenia Ziemiańska 0 siblings, 1 reply; 19+ messages in thread From: Christian Brauner @ 2023-06-26 9:32 UTC (permalink / raw) To: Ahelenia Ziemiańska Cc: Alexander Viro, linux-fsdevel, linux-kernel, David Howells, Jens Axboe On Mon, Jun 26, 2023 at 03:12:09AM +0200, Ahelenia Ziemiańska wrote: > Hi! (starting with get_maintainers.pl fs/splice.c, > idk if that's right though) > > Per fs/splice.c: > * The traditional unix read/write is extended with a "splice()" operation > * that transfers data buffers to or from a pipe buffer. > so I expect splice() to work just about the same as read()/write() > (and, to a large extent, it does so). > > Thus, a refresher on pipe read() semantics > (quoting Issue 8 Draft 3; Linux when writing with write()): > 60746 When attempting to read from an empty pipe or FIFO: > 60747 • If no process has the pipe open for writing, read( ) shall return 0 to indicate end-of-file. > 60748 • If some process has the pipe open for writing and O_NONBLOCK is set, read( ) shall return > 60749 −1 and set errno to [EAGAIN]. > 60750 • If some process has the pipe open for writing and O_NONBLOCK is clear, read( ) shall > 60751 block the calling thread until some data is written or the pipe is closed by all processes that > 60752 had the pipe open for writing. > > However, I've observed that this is not the case when splicing from > something that sleeps on read to a pipe, and that in that case all > readers block, /including/ ones that are reading from fds with > O_NONBLOCK set! > > As an example, consider these two programs: > -- >8 -- > // wr.c > #define _GNU_SOURCE > #include <fcntl.h> > #include <stdio.h> > int main() { > while (splice(0, 0, 1, 0, 128 * 1024 * 1024, 0) > 0) > ; > fprintf(stderr, "wr: %m\n"); > } > -- >8 -- > > -- >8 -- > // rd.c > #define _GNU_SOURCE > #include <errno.h> > #include <fcntl.h> > #include <stdio.h> > #include <unistd.h> > int main() { > fcntl(0, F_SETFL, fcntl(0, F_GETFL) | O_NONBLOCK); > > char buf[64 * 1024] = {}; > for (ssize_t rd;;) { > #if 1 > while ((rd = read(0, buf, sizeof(buf))) == -1 && errno == EINTR) > ; > #else > while ((rd = splice(0, 0, 1, 0, 128 * 1024 * 1024, 0)) == -1 && > errno == EINTR) > ; > #endif > fprintf(stderr, "rd=%zd: %m\n", rd); > write(1, buf, rd); > > errno = 0; > sleep(1); > } > } > -- >8 -- > > Thus: > -- >8 -- > a$ make rd wr > a$ mkfifo fifo > a$ ./rd < fifo b$ echo qwe > fifo > rd=4: Success > qwe > rd=0: Success > rd=0: Success b$ sleep 2 > fifo > rd=-1: Resource temporarily unavailable > rd=-1: Resource temporarily unavailable > rd=0: Success > rd=0: Success > rd=-1: Resource temporarily unavailable b$ /bin/cat > fifo > rd=-1: Resource temporarily unavailable > rd=4: Success abc > abc > rd=-1: Resource temporarily unavailable > rd=4: Success def > def > rd=0: Success ^D > rd=0: Success > rd=0: Success b$ ./wr > fifo > -- >8 -- > and nothing. Until you actually type a line (or a few) into teletype b > so that the splice completes, at which point so does the read. > > An even simpler case is > -- >8 -- > $ ./wr | ./rd > abc > def > rd=8: Success > abc > def > ghi > jkl > rd=8: Success > ghi > jkl > ^D > wr: Success > rd=-1: Resource temporarily unavailable > rd=0: Success > rd=0: Success > -- >8 -- > > splice flags don't do anything. > Tested on bookworm (6.1.27-1) and Linus' HEAD (v6.4-rc7-234-g547cc9be86f4). > > You could say this is a "denial of service", since this is a valid > way of following pipes (and, sans SIGIO, the only portable one), splice() may block for any of the two file descriptors if they don't have O_NONBLOCK set even if SPLICE_F_NONBLOCK is raised. SPLICE_F_NONBLOCK in splice_file_to_pipe() is only relevant if the pipe is full. If the pipe isn't full then the write is attempted. That of course involves reading the data to splice from the source file. If the source file isn't O_NONBLOCK that read may block holding pipe_lock(). If you raise O_NONBLOCK on the source fd in wr.c then your problems go away. This is pretty long-standing behavior. Splice would have to be refactored to not rely on pipe_lock(). That's likely major work with a good portion of regressions if the past is any indication. If you need that ability to fully async read from a pipe with splice rn then io_uring will at least allow you to punt that read into an async worker thread afaict. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?) 2023-06-26 9:32 ` Christian Brauner @ 2023-06-26 11:59 ` Ahelenia Ziemiańska 2023-06-26 15:56 ` Christian Brauner 0 siblings, 1 reply; 19+ messages in thread From: Ahelenia Ziemiańska @ 2023-06-26 11:59 UTC (permalink / raw) To: Christian Brauner Cc: Alexander Viro, linux-fsdevel, linux-kernel, David Howells, Jens Axboe [-- Attachment #1: Type: text/plain, Size: 11512 bytes --] On Mon, Jun 26, 2023 at 11:32:16AM +0200, Christian Brauner wrote: > On Mon, Jun 26, 2023 at 03:12:09AM +0200, Ahelenia Ziemiańska wrote: > > Hi! (starting with get_maintainers.pl fs/splice.c, > > idk if that's right though) > > > > Per fs/splice.c: > > * The traditional unix read/write is extended with a "splice()" operation > > * that transfers data buffers to or from a pipe buffer. > > so I expect splice() to work just about the same as read()/write() > > (and, to a large extent, it does so). > > > > Thus, a refresher on pipe read() semantics > > (quoting Issue 8 Draft 3; Linux when writing with write()): > > 60746 When attempting to read from an empty pipe or FIFO: > > 60747 • If no process has the pipe open for writing, read( ) shall return 0 to indicate end-of-file. > > 60748 • If some process has the pipe open for writing and O_NONBLOCK is set, read( ) shall return > > 60749 −1 and set errno to [EAGAIN]. > > 60750 • If some process has the pipe open for writing and O_NONBLOCK is clear, read( ) shall > > 60751 block the calling thread until some data is written or the pipe is closed by all processes that > > 60752 had the pipe open for writing. > > > > However, I've observed that this is not the case when splicing from > > something that sleeps on read to a pipe, and that in that case all > > readers block, /including/ ones that are reading from fds with > > O_NONBLOCK set! > > > > As an example, consider these two programs: > > -- >8 -- > > // wr.c > > #define _GNU_SOURCE > > #include <fcntl.h> > > #include <stdio.h> > > int main() { > > while (splice(0, 0, 1, 0, 128 * 1024 * 1024, 0) > 0) > > ; > > fprintf(stderr, "wr: %m\n"); > > } > > -- >8 -- > > > > -- >8 -- > > // rd.c > > #define _GNU_SOURCE > > #include <errno.h> > > #include <fcntl.h> > > #include <stdio.h> > > #include <unistd.h> > > int main() { > > fcntl(0, F_SETFL, fcntl(0, F_GETFL) | O_NONBLOCK); > > > > char buf[64 * 1024] = {}; > > for (ssize_t rd;;) { > > #if 1 > > while ((rd = read(0, buf, sizeof(buf))) == -1 && errno == EINTR) > > ; > > #else > > while ((rd = splice(0, 0, 1, 0, 128 * 1024 * 1024, 0)) == -1 && > > errno == EINTR) > > ; > > #endif > > fprintf(stderr, "rd=%zd: %m\n", rd); > > write(1, buf, rd); > > > > errno = 0; > > sleep(1); > > } > > } > > -- >8 -- > > > > Thus: > > -- >8 -- > > a$ make rd wr > > a$ mkfifo fifo > > a$ ./rd < fifo b$ echo qwe > fifo > > rd=4: Success > > qwe > > rd=0: Success > > rd=0: Success b$ sleep 2 > fifo > > rd=-1: Resource temporarily unavailable > > rd=-1: Resource temporarily unavailable > > rd=0: Success > > rd=0: Success > > rd=-1: Resource temporarily unavailable b$ /bin/cat > fifo > > rd=-1: Resource temporarily unavailable > > rd=4: Success abc > > abc > > rd=-1: Resource temporarily unavailable > > rd=4: Success def > > def > > rd=0: Success ^D > > rd=0: Success > > rd=0: Success b$ ./wr > fifo > > -- >8 -- > > and nothing. Until you actually type a line (or a few) into teletype b > > so that the splice completes, at which point so does the read. > > > > An even simpler case is > > -- >8 -- > > $ ./wr | ./rd > > abc > > def > > rd=8: Success > > abc > > def > > ghi > > jkl > > rd=8: Success > > ghi > > jkl > > ^D > > wr: Success > > rd=-1: Resource temporarily unavailable > > rd=0: Success > > rd=0: Success > > -- >8 -- > > > > splice flags don't do anything. > > Tested on bookworm (6.1.27-1) and Linus' HEAD (v6.4-rc7-234-g547cc9be86f4). > > > > You could say this is a "denial of service", since this is a valid > > way of following pipes (and, sans SIGIO, the only portable one), > splice() may block for any of the two file descriptors if they don't > have O_NONBLOCK set even if SPLICE_F_NONBLOCK is raised. > > SPLICE_F_NONBLOCK in splice_file_to_pipe() is only relevant if the pipe > is full. If the pipe isn't full then the write is attempted. That of > course involves reading the data to splice from the source file. If the > source file isn't O_NONBLOCK that read may block holding pipe_lock(). > > If you raise O_NONBLOCK on the source fd in wr.c then your problems go > away. This is pretty long-standing behavior. I don't see how this is relevant here. Whether the writer splice blocks ‒ or how it behaves at all ‒ doesn't matter. The /reader/ demands non-blocking reads. Just by running a splice() we've managed to permanently hang the reader in a way that's fully impervious to everything. Actually, hold that: in testing this on an actual program that relies on this (nullmailer), I've found that trying to /open the FIFO/ also hangs forever, in that same signal-impervious state. To wit: $ ps 3766 PID TTY STAT TIME COMMAND 3766 ? Ss 0:01 /usr/sbin/nullmailer-send $ ls -l /proc/3766/fd total 0 lr-x------ 1 mail mail 64 Jun 14 15:03 0 -> /dev/null lrwx------ 1 mail mail 64 Jun 14 15:03 1 -> 'socket:[81721760]' lrwx------ 1 mail mail 64 Jun 14 15:03 2 -> 'socket:[81721760]' lr-x------ 1 mail mail 64 Apr 28 15:38 3 -> 'pipe:[81721763]' l-wx------ 1 mail mail 64 Jun 14 15:03 4 -> 'pipe:[81721763]' lr-x------ 1 mail mail 64 Jun 14 15:03 5 -> /var/spool/nullmailer/trigger lrwx------ 1 mail mail 64 Jun 14 15:03 9 -> /dev/null # cat /proc/3766/fdinfo/5 pos: 0 flags: 0104000 mnt_id: 64 ino: 393969 # < /proc/3766/fdinfo/5 fdinfo O_RDONLY O_NONBLOCK O_LARGEFILE # strace -yp 3766 & strace: Process 3766 attached $ strace out/cmd/cat > /var/spool/nullmailer/trigger [cat] (normal libc setup) [cat] splice(0, NULL, 1, NULL, 134217728, SPLICE_F_MOVE|SPLICE_F_MOREa [cat] ) = 2 [cat] splice(0, NULL, 1, NULL, 134217728, SPLICE_F_MOVE|SPLICE_F_MORE [nullmailer] pselect6(6, [5</var/spool/nullmailer/trigger>], NULL, NULL, {tv_sec=86397, tv_nsec=624894145}, NULL) = 1 (in [5], left {tv_sec=86394, tv_nsec=841299215}) [nullmailer] write(1<socket:[81721760]>, "Trigger pulled.\n", 16) = 16 [nullmailer] read(5</var/spool/nullmailer/trigger>, and $ strace -y sh -c 'echo zupa > /var/spool/nullmailer/trigger' (...whatever shell setup) rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER, sa_restorer=0xf7d21bb0}, NULL, 8) = 0 openat(AT_FDCWD, "/var/spool/nullmailer/trigger", O_WRONLY|O_CREAT|O_TRUNC, 0666 This is a "you've lost" situation to me. This system will /never/ send mail now, and any mailer program will also hang forever (again, to wit: # echo zupa | strace -yfo /tmp/ss mail root does hang forever and /tmp/ss ends in 16915 close(6</var/spool/nullmailer/queue>) = 0 16915 unlink("/var/spool/nullmailer/tmp/16915") = 0 16915 openat(AT_FDCWD, "/var/spool/nullmailer/trigger", O_WRONLY|O_NONBLOCK ) which means that, on this system, I will never get events from smartd or ZED, so fuck me if I wanted to get "scrub errored" or "disk will die soon" notifications (in pre-2.0.0 ZED this would also have broken autoreplace=on since it waited synchronously), or from other monitoring, so again fuck me if I wanted to get overheating/packet drops/whatever notifications, or again fuck me if I wanted to get cron mail. In many ways I've brought the system down (or will have done in like a day once some mails go out) by sending a mail weird. Naturally systemd stopping nullmailer failed after a few minutes with × nullmailer.service - Nullmailer relay-only MTA Loaded: loaded (/lib/systemd/system/nullmailer.service; enabled; preset: enabled) Active: failed (Result: timeout) since Mon 2023-06-26 13:10:02 CEST; 6min ago Duration: 1month 4w 10h 55min 29.666s Docs: man:nullmailer(7) Main PID: 3766 Tasks: 1 (limit: 4673) Memory: 3.1M CPU: 1min 26.893s CGroup: /system.slice/nullmailer.service └─3766 /usr/sbin/nullmailer-send Jun 26 13:05:32 szarotka systemd[1]: nullmailer.service: State 'stop-sigterm' timed out. Killing. Jun 26 13:05:32 szarotka systemd[1]: nullmailer.service: Killing process 3766 (nullmailer-send) with signal SIGKILL. Jun 26 13:07:02 szarotka systemd[1]: nullmailer.service: Processes still around after SIGKILL. Ignoring. Jun 26 13:08:32 szarotka systemd[1]: nullmailer.service: State 'final-sigterm' timed out. Killing. Jun 26 13:08:32 szarotka systemd[1]: nullmailer.service: Killing process 3766 (nullmailer-send) with signal SIGKILL. Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Processes still around after final SIGKILL. Entering failed mode. Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Failed with result 'timeout'. Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Unit process 3766 (nullmailer-send) remains running after unit s> Jun 26 13:10:02 szarotka systemd[1]: Stopped nullmailer.service - Nullmailer relay-only MTA. Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Consumed 1min 26.893s CPU time. But not to fret! Maybe we can still kill it with the cgroup! No: # strace -y sh -c 'echo 1 > /sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill' ... dup2(3</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill>, 1) = 1</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill> close(3</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill>) = 0 write(1</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill>, "1\n", 2) = 2 ... This completes, sure, but doesn't do anything at all (admittedly, I'm not a cgroup expert, but it did work on other, non-poisoned, cgroups, so I'd expect it to work). Opening the FIFO with O_NONBLOCK also hangs, obviously. Killing the splicer restores order, as expected. > Splice would have to be > refactored to not rely on pipe_lock(). That's likely major work with a > good portion of regressions if the past is any indication. That's likely; however, it ‒ or an equivalent solution ‒ would probably be a good idea to do, on balance of all my points above, I think. > If you need that ability to fully async read from a pipe with splice > rn then io_uring will at least allow you to punt that read into an async > worker thread afaict. I need my system to not be hanged by any user with a magic syscall. I think you've confused the splice bit as being contentious ‒ I don't care, and couldn't care less about how this is triggered ‒ the issue is that a splice fully excludes /ALL OTHER/ operations on a pipe, and zombifies all processes that try. Consider the case where messages arrive at a collection pipe (this is well-specified and well-used with O_DIRECT and <=PIPE_MAX, I don't have a demo off-hand), or, hell, even a case where logs do: giving any user with append capability effectively exclusive control for as long as they want is, well, suboptimal; you could analyse this as a stronger variant of https://lore.kernel.org/linux-fsdevel/fa6de786ee1241c6ba54c3cce0b980aa@AcuMS.aculab.com/t/#e74be7131861099a7f3d82d51dfc96645d26e9a94 and indeed, my original use-case was this broke tail -f, but you know how it be. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?) 2023-06-26 11:59 ` Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?) Ahelenia Ziemiańska @ 2023-06-26 15:56 ` Christian Brauner 2023-06-26 16:14 ` Ahelenia Ziemiańska 0 siblings, 1 reply; 19+ messages in thread From: Christian Brauner @ 2023-06-26 15:56 UTC (permalink / raw) To: Jens Axboe, David Howells, Linus Torvalds Cc: Alexander Viro, linux-fsdevel, linux-kernel, Ahelenia Ziemiańska On Mon, Jun 26, 2023 at 01:59:07PM +0200, Ahelenia Ziemiańska wrote: > On Mon, Jun 26, 2023 at 11:32:16AM +0200, Christian Brauner wrote: > > On Mon, Jun 26, 2023 at 03:12:09AM +0200, Ahelenia Ziemiańska wrote: > > > Hi! (starting with get_maintainers.pl fs/splice.c, > > > idk if that's right though) > > > > > > Per fs/splice.c: > > > * The traditional unix read/write is extended with a "splice()" operation > > > * that transfers data buffers to or from a pipe buffer. > > > so I expect splice() to work just about the same as read()/write() > > > (and, to a large extent, it does so). > > > > > > Thus, a refresher on pipe read() semantics > > > (quoting Issue 8 Draft 3; Linux when writing with write()): > > > 60746 When attempting to read from an empty pipe or FIFO: > > > 60747 • If no process has the pipe open for writing, read( ) shall return 0 to indicate end-of-file. > > > 60748 • If some process has the pipe open for writing and O_NONBLOCK is set, read( ) shall return > > > 60749 −1 and set errno to [EAGAIN]. > > > 60750 • If some process has the pipe open for writing and O_NONBLOCK is clear, read( ) shall > > > 60751 block the calling thread until some data is written or the pipe is closed by all processes that > > > 60752 had the pipe open for writing. > > > > > > However, I've observed that this is not the case when splicing from > > > something that sleeps on read to a pipe, and that in that case all > > > readers block, /including/ ones that are reading from fds with > > > O_NONBLOCK set! > > > > > > As an example, consider these two programs: > > > -- >8 -- > > > // wr.c > > > #define _GNU_SOURCE > > > #include <fcntl.h> > > > #include <stdio.h> > > > int main() { > > > while (splice(0, 0, 1, 0, 128 * 1024 * 1024, 0) > 0) > > > ; > > > fprintf(stderr, "wr: %m\n"); > > > } > > > -- >8 -- > > > > > > -- >8 -- > > > // rd.c > > > #define _GNU_SOURCE > > > #include <errno.h> > > > #include <fcntl.h> > > > #include <stdio.h> > > > #include <unistd.h> > > > int main() { > > > fcntl(0, F_SETFL, fcntl(0, F_GETFL) | O_NONBLOCK); > > > > > > char buf[64 * 1024] = {}; > > > for (ssize_t rd;;) { > > > #if 1 > > > while ((rd = read(0, buf, sizeof(buf))) == -1 && errno == EINTR) > > > ; > > > #else > > > while ((rd = splice(0, 0, 1, 0, 128 * 1024 * 1024, 0)) == -1 && > > > errno == EINTR) > > > ; > > > #endif > > > fprintf(stderr, "rd=%zd: %m\n", rd); > > > write(1, buf, rd); > > > > > > errno = 0; > > > sleep(1); > > > } > > > } > > > -- >8 -- > > > > > > Thus: > > > -- >8 -- > > > a$ make rd wr > > > a$ mkfifo fifo > > > a$ ./rd < fifo b$ echo qwe > fifo > > > rd=4: Success > > > qwe > > > rd=0: Success > > > rd=0: Success b$ sleep 2 > fifo > > > rd=-1: Resource temporarily unavailable > > > rd=-1: Resource temporarily unavailable > > > rd=0: Success > > > rd=0: Success > > > rd=-1: Resource temporarily unavailable b$ /bin/cat > fifo > > > rd=-1: Resource temporarily unavailable > > > rd=4: Success abc > > > abc > > > rd=-1: Resource temporarily unavailable > > > rd=4: Success def > > > def > > > rd=0: Success ^D > > > rd=0: Success > > > rd=0: Success b$ ./wr > fifo > > > -- >8 -- > > > and nothing. Until you actually type a line (or a few) into teletype b > > > so that the splice completes, at which point so does the read. > > > > > > An even simpler case is > > > -- >8 -- > > > $ ./wr | ./rd > > > abc > > > def > > > rd=8: Success > > > abc > > > def > > > ghi > > > jkl > > > rd=8: Success > > > ghi > > > jkl > > > ^D > > > wr: Success > > > rd=-1: Resource temporarily unavailable > > > rd=0: Success > > > rd=0: Success > > > -- >8 -- > > > > > > splice flags don't do anything. > > > Tested on bookworm (6.1.27-1) and Linus' HEAD (v6.4-rc7-234-g547cc9be86f4). > > > > > > You could say this is a "denial of service", since this is a valid > > > way of following pipes (and, sans SIGIO, the only portable one), > > splice() may block for any of the two file descriptors if they don't > > have O_NONBLOCK set even if SPLICE_F_NONBLOCK is raised. > > > > SPLICE_F_NONBLOCK in splice_file_to_pipe() is only relevant if the pipe > > is full. If the pipe isn't full then the write is attempted. That of > > course involves reading the data to splice from the source file. If the > > source file isn't O_NONBLOCK that read may block holding pipe_lock(). > > > > If you raise O_NONBLOCK on the source fd in wr.c then your problems go > > away. This is pretty long-standing behavior. > I don't see how this is relevant here. Whether the writer splice blocks > ‒ or how it behaves at all ‒ doesn't matter. > > The /reader/ demands non-blocking reads. Just by running a splice() > we've managed to permanently hang the reader in a way that's fully > impervious to everything. > > Actually, hold that: in testing this on an actual program that relies on > this (nullmailer), I've found that trying to /open the FIFO/ also hangs > forever, in that same signal-impervious state. > > To wit: > $ ps 3766 > PID TTY STAT TIME COMMAND > 3766 ? Ss 0:01 /usr/sbin/nullmailer-send > $ ls -l /proc/3766/fd > total 0 > lr-x------ 1 mail mail 64 Jun 14 15:03 0 -> /dev/null > lrwx------ 1 mail mail 64 Jun 14 15:03 1 -> 'socket:[81721760]' > lrwx------ 1 mail mail 64 Jun 14 15:03 2 -> 'socket:[81721760]' > lr-x------ 1 mail mail 64 Apr 28 15:38 3 -> 'pipe:[81721763]' > l-wx------ 1 mail mail 64 Jun 14 15:03 4 -> 'pipe:[81721763]' > lr-x------ 1 mail mail 64 Jun 14 15:03 5 -> /var/spool/nullmailer/trigger > lrwx------ 1 mail mail 64 Jun 14 15:03 9 -> /dev/null > # cat /proc/3766/fdinfo/5 > pos: 0 > flags: 0104000 > mnt_id: 64 > ino: 393969 > # < /proc/3766/fdinfo/5 fdinfo > O_RDONLY O_NONBLOCK O_LARGEFILE > # strace -yp 3766 & > strace: Process 3766 attached > $ strace out/cmd/cat > /var/spool/nullmailer/trigger > [cat] (normal libc setup) > [cat] splice(0, NULL, 1, NULL, 134217728, SPLICE_F_MOVE|SPLICE_F_MOREa > [cat] ) = 2 > [cat] splice(0, NULL, 1, NULL, 134217728, SPLICE_F_MOVE|SPLICE_F_MORE > [nullmailer] pselect6(6, [5</var/spool/nullmailer/trigger>], NULL, NULL, {tv_sec=86397, tv_nsec=624894145}, NULL) = 1 (in [5], left {tv_sec=86394, tv_nsec=841299215}) > [nullmailer] write(1<socket:[81721760]>, "Trigger pulled.\n", 16) = 16 > [nullmailer] read(5</var/spool/nullmailer/trigger>, > and > $ strace -y sh -c 'echo zupa > /var/spool/nullmailer/trigger' > (...whatever shell setup) > rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER, sa_restorer=0xf7d21bb0}, NULL, 8) = 0 > openat(AT_FDCWD, "/var/spool/nullmailer/trigger", O_WRONLY|O_CREAT|O_TRUNC, 0666 > > This is a "you've lost" situation to me. This system will /never/ > send mail now, and any mailer program will also hang forever > (again, to wit: > # echo zupa | strace -yfo /tmp/ss mail root > does hang forever and /tmp/ss ends in > 16915 close(6</var/spool/nullmailer/queue>) = 0 > 16915 unlink("/var/spool/nullmailer/tmp/16915") = 0 > 16915 openat(AT_FDCWD, "/var/spool/nullmailer/trigger", O_WRONLY|O_NONBLOCK > ) > which means that, on this system, I will never get events from smartd > or ZED, so fuck me if I wanted to get "scrub errored" or "disk > will die soon" notifications (in pre-2.0.0 ZED this would also have > broken autoreplace=on since it waited synchronously), > or from other monitoring, so again fuck me if I wanted to get > overheating/packet drops/whatever notifications, > or again fuck me if I wanted to get cron mail. > In many ways I've brought the system down (or will have done in like a > day once some mails go out) by sending a mail weird. > > > Naturally systemd stopping nullmailer failed after a few minutes with > × nullmailer.service - Nullmailer relay-only MTA > Loaded: loaded (/lib/systemd/system/nullmailer.service; enabled; preset: enabled) > Active: failed (Result: timeout) since Mon 2023-06-26 13:10:02 CEST; 6min ago > Duration: 1month 4w 10h 55min 29.666s > Docs: man:nullmailer(7) > Main PID: 3766 > Tasks: 1 (limit: 4673) > Memory: 3.1M > CPU: 1min 26.893s > CGroup: /system.slice/nullmailer.service > └─3766 /usr/sbin/nullmailer-send > > Jun 26 13:05:32 szarotka systemd[1]: nullmailer.service: State 'stop-sigterm' timed out. Killing. > Jun 26 13:05:32 szarotka systemd[1]: nullmailer.service: Killing process 3766 (nullmailer-send) with signal SIGKILL. > Jun 26 13:07:02 szarotka systemd[1]: nullmailer.service: Processes still around after SIGKILL. Ignoring. > Jun 26 13:08:32 szarotka systemd[1]: nullmailer.service: State 'final-sigterm' timed out. Killing. > Jun 26 13:08:32 szarotka systemd[1]: nullmailer.service: Killing process 3766 (nullmailer-send) with signal SIGKILL. > Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Processes still around after final SIGKILL. Entering failed mode. > Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Failed with result 'timeout'. > Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Unit process 3766 (nullmailer-send) remains running after unit s> > Jun 26 13:10:02 szarotka systemd[1]: Stopped nullmailer.service - Nullmailer relay-only MTA. > Jun 26 13:10:02 szarotka systemd[1]: nullmailer.service: Consumed 1min 26.893s CPU time. > > But not to fret! Maybe we can still kill it with the cgroup! No: > # strace -y sh -c 'echo 1 > /sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill' > ... > dup2(3</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill>, 1) = 1</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill> > close(3</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill>) = 0 > write(1</sys/fs/cgroup/system.slice/nullmailer.service/cgroup.kill>, "1\n", 2) = 2 > ... > This completes, sure, but doesn't do anything at all > (admittedly, I'm not a cgroup expert, but it did work on other, > non-poisoned, cgroups, so I'd expect it to work). > > Opening the FIFO with O_NONBLOCK also hangs, obviously. > Killing the splicer restores order, as expected. > > > Splice would have to be > > refactored to not rely on pipe_lock(). That's likely major work with a > > good portion of regressions if the past is any indication. > That's likely; however, it ‒ or an equivalent solution ‒ would > probably be a good idea to do, on balance of all my points above, > I think. In-kernel consumers already have a way of detecting when the pipe isn't safe for non-blocking read anymore because splice has been called and cleared FMODE_NOWAIT. I mean, one workaround would probably be poll() even with O_NONBLOCK but I get why that's annoying and half of a solution. So there are three options afaict: (1) rewrite splice.c to kill its reliance on pipe_lock() Very involved and would need a splice + pipe expert. (2) Add pipe_lock_interruptible() to stop the bleeding and give userspace the ability to at least kill a hanging reader. Also a potentially sensitive change probably regression prone. (3) Somehow factor in FMODE_NOWAIT when acquiring pipe_lock(). If FMODE_NOWAIT is set, try to acquire the lock and if not report EAGAIN otherwise proceed as before. I think Jens proposed a version of this a while back. Adding Linus as well since he probably has thoughts on this. tl;dr it by splicing from a regular file to a pipe where the regular file in splice isn't O_NONBLOCK we can hold pipe_lock() as long as we want and hang pipe_read() even with O_NONBLOCK unkillable trying to acquire pipe_lock(). ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?) 2023-06-26 15:56 ` Christian Brauner @ 2023-06-26 16:14 ` Ahelenia Ziemiańska 2023-07-06 21:56 ` Linus Torvalds 0 siblings, 1 reply; 19+ messages in thread From: Ahelenia Ziemiańska @ 2023-06-26 16:14 UTC (permalink / raw) To: Christian Brauner Cc: Jens Axboe, David Howells, Linus Torvalds, Alexander Viro, linux-fsdevel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 926 bytes --] On Mon, Jun 26, 2023 at 05:56:28PM +0200, Christian Brauner wrote: > I mean, one workaround would probably be poll() even with O_NONBLOCK but > I get why that's annoying and half of a solution. poll() doesn't really change anything since it turns into a hotloop of .revents=POLLHUP with a disconnected writer, cf. https://lore.kernel.org/linux-fsdevel/jbyihkyk5dtaohdwjyivambb2gffyjs3dodpofafnkkunxq7bu@jngkdxx65pux/t/#m747e2bbd0c5cffb6baaf1955f6f8b0d97e216839 The only apparently-supported way to poll pipes under linux is to sleep()/read(O_NONBLOCK) like in the bad old days. And even if that was a working work-around, the fundamental problem of ./spl>fifo excluding all other access to fifo is quite unfortunate too. > tl;dr it by splicing from a regular file to a pipe where the regular > file in splice isn't O_NONBLOCK (Most noticeable when the "regular file" is a tty and thus the I/O never completes by design.) [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?) 2023-06-26 16:14 ` Ahelenia Ziemiańska @ 2023-07-06 21:56 ` Linus Torvalds 2023-07-07 17:21 ` Christian Brauner 2023-07-08 0:00 ` Matthew Wilcox 0 siblings, 2 replies; 19+ messages in thread From: Linus Torvalds @ 2023-07-06 21:56 UTC (permalink / raw) To: Ahelenia Ziemiańska Cc: Christian Brauner, Jens Axboe, David Howells, Alexander Viro, linux-fsdevel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1018 bytes --] On Mon, 26 Jun 2023 at 09:14, Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> wrote: > > And even if that was a working work-around, the fundamental problem of > ./spl>fifo excluding all other access to fifo is quite unfortunate too. So I already sent an earlier broken version of this patch to Ahelenia and Christian, but here's an actually slightly tested version with the obvious bugs fixed. The keyword here being "obvious". It's probably broken in many non-obvious ways, but I'm sending it out in case anybody wants to play around with it. It boots for me. It even changes behavior of programs that splice() and used to keep the pipe lock over the IO and no longer do. But it might do unspeakable things to your pets, so caveat emptor. It "feels" right to me. But it's a rather quick hack, and really needs more eyes and more thought. I mention O_NDELAY in the (preliminary) commit message, but there are probably other issues that need thinking about. Linus [-- Attachment #2: 0001-splice-lock-the-pages-and-unlock-the-pipe-during-IO.patch --] [-- Type: text/x-patch, Size: 9266 bytes --] From 3475e4d56cefab9f8b061dc824db4e314d076a59 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@linux-foundation.org> Date: Thu, 6 Jul 2023 11:23:07 -0700 Subject: [PATCH] splice: lock the pages and unlock the pipe during IO This is already what we do for page cache pages, where we can add raw pages that are not up-to-date yet to the pipe, and the readers call the pipe buffer '->confirm()' function to wait for the data to be ready. So just do the same for splice reading, allowing us to unlock the pipe during the read. That then avoids problems with users that get blocked on the pipe lock. Now they get blocked on the pipe buffer ->confirm() instead. TODO: teach 'O_NDELAY' and select/poll about this too. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- fs/pipe.c | 18 +++--- fs/splice.c | 122 +++++++++++++++++++++++++++++----------- include/linux/pagemap.h | 1 + mm/filemap.c | 6 ++ 4 files changed, 105 insertions(+), 42 deletions(-) diff --git a/fs/pipe.c b/fs/pipe.c index 2d88f73f585a..71942d240c98 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -284,10 +284,17 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) if (!pipe_empty(head, tail)) { struct pipe_buffer *buf = &pipe->bufs[tail & mask]; - size_t chars = buf->len; - size_t written; + size_t chars, written; int error; + error = pipe_buf_confirm(pipe, buf); + if (error) { + if (!ret) + ret = error; + break; + } + + chars = buf->len; if (chars > total_len) { if (buf->flags & PIPE_BUF_FLAG_WHOLE) { if (ret == 0) @@ -297,13 +304,6 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) chars = total_len; } - error = pipe_buf_confirm(pipe, buf); - if (error) { - if (!ret) - ret = error; - break; - } - written = copy_page_to_iter(buf->page, buf->offset, chars, to); if (unlikely(written < chars)) { if (!ret) diff --git a/fs/splice.c b/fs/splice.c index 004eb1c4ce31..503f7eff41b6 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -300,6 +300,42 @@ void splice_shrink_spd(struct splice_pipe_desc *spd) kfree(spd->partial); } +static void finalize_pipe_buf(struct pipe_buffer *buf, unsigned int chunk) +{ + buf->len = chunk; + buf->ops = &default_pipe_buf_ops; + unlock_page(buf->page); +} + +static int busy_pipe_buf_confirm(struct pipe_inode_info *pipe, + struct pipe_buffer *buf) +{ + struct page *page = buf->page; + + if (folio_wait_bit_interruptible(page_folio(page), PG_locked)) + return -EINTR; + return 0; +} + +/* + * These are the same as the default pipe buf operations, + * but the '.confirm()' function requires that any user + * wait for the page to unlock before use. + * + * I guess we could use the whole PG_uptodate logic too, + * and treat these as some kind of special page table pages. + * + * PIPE_BUF_FLAG_CAN_MERGE must obviously not be set when + * using these, and it's important that any pipe reader + * look at buf->len only _after_ confirming the buffer! + */ +const struct pipe_buf_operations busy_pipe_buf_ops = { + .confirm = busy_pipe_buf_confirm, + .release = generic_pipe_buf_release, + .try_steal = generic_pipe_buf_try_steal, + .get = generic_pipe_buf_get, +}; + /** * copy_splice_read - Copy data from a file and splice the copy into a pipe * @in: The file to read from @@ -328,6 +364,7 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos, struct bio_vec *bv; struct kiocb kiocb; struct page **pages; + struct pipe_buffer **bufs; ssize_t ret; size_t used, npages, chunk, remain, keep = 0; int i; @@ -339,11 +376,13 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos, npages = DIV_ROUND_UP(len, PAGE_SIZE); bv = kzalloc(array_size(npages, sizeof(bv[0])) + - array_size(npages, sizeof(struct page *)), GFP_KERNEL); + array_size(npages, sizeof(struct page *)) + + array_size(npages, sizeof(struct pipe_buffer *)), GFP_KERNEL); if (!bv) return -ENOMEM; pages = (struct page **)(bv + npages); + bufs = (struct pipe_buffer **)(pages + npages); npages = alloc_pages_bulk_array(GFP_USER, npages, pages); if (!npages) { kfree(bv); @@ -352,14 +391,34 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos, remain = len = min_t(size_t, len, npages * PAGE_SIZE); + /* Push them into the pipe and build up the bio_vec */ for (i = 0; i < npages; i++) { + struct pipe_buffer *buf = pipe_head_buf(pipe); + struct page *page = pages[i]; + + pipe->head++; + lock_page(page); + *buf = (struct pipe_buffer) { + .ops = &busy_pipe_buf_ops, + .page = page, + .offset = 0, + .len = chunk, + }; + bufs[i] = buf; + chunk = min_t(size_t, PAGE_SIZE, remain); - bv[i].bv_page = pages[i]; + bv[i].bv_page = page; bv[i].bv_offset = 0; bv[i].bv_len = chunk; remain -= chunk; } + /* + * We have now reserved the space with locked pages, + * and can unlock the pipe during the IO. + */ + pipe_unlock(pipe); + /* Do the I/O */ iov_iter_bvec(&to, ITER_DEST, bv, npages, len); init_sync_kiocb(&kiocb, in); @@ -378,26 +437,22 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos, if (ret == -EFAULT) ret = -EAGAIN; - /* Free any pages that didn't get touched at all. */ - if (keep < npages) - release_pages(pages + keep, npages - keep); - - /* Push the remaining pages into the pipe. */ + /* Update the page state in the pipe */ remain = ret; - for (i = 0; i < keep; i++) { - struct pipe_buffer *buf = pipe_head_buf(pipe); + for (i = 0; i < npages; i++) { + struct pipe_buffer *buf = bufs[i]; chunk = min_t(size_t, remain, PAGE_SIZE); - *buf = (struct pipe_buffer) { - .ops = &default_pipe_buf_ops, - .page = bv[i].bv_page, - .offset = 0, - .len = chunk, - }; - pipe->head++; remain -= chunk; + + /* + * NOTE! The size might have changed, and + * chunk may be smaller or zero! + */ + finalize_pipe_buf(buf, chunk); } + pipe_lock(pipe); kfree(bv); return ret; } @@ -455,10 +510,6 @@ static int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_des while (!pipe_empty(head, tail)) { struct pipe_buffer *buf = &pipe->bufs[tail & mask]; - sd->len = buf->len; - if (sd->len > sd->total_len) - sd->len = sd->total_len; - ret = pipe_buf_confirm(pipe, buf); if (unlikely(ret)) { if (ret == -ENODATA) @@ -466,6 +517,10 @@ static int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_des return ret; } + sd->len = buf->len; + if (sd->len > sd->total_len) + sd->len = sd->total_len; + ret = actor(pipe, buf, sd); if (ret <= 0) return ret; @@ -715,12 +770,7 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out, left = sd.total_len; for (n = 0; !pipe_empty(head, tail) && left && n < nbufs; tail++) { struct pipe_buffer *buf = &pipe->bufs[tail & mask]; - size_t this_len = buf->len; - - /* zero-length bvecs are not supported, skip them */ - if (!this_len) - continue; - this_len = min(this_len, left); + size_t this_len; ret = pipe_buf_confirm(pipe, buf); if (unlikely(ret)) { @@ -729,6 +779,12 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out, goto done; } + /* zero-length bvecs are not supported, skip them */ + this_len = buf->len; + if (!this_len) + continue; + this_len = min(this_len, left); + bvec_set_page(&array[n], buf->page, this_len, buf->offset); left -= this_len; @@ -847,13 +903,6 @@ ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out, struct pipe_buffer *buf = &pipe->bufs[tail & mask]; size_t seg; - if (!buf->len) { - tail++; - continue; - } - - seg = min_t(size_t, remain, buf->len); - ret = pipe_buf_confirm(pipe, buf); if (unlikely(ret)) { if (ret == -ENODATA) @@ -861,6 +910,13 @@ ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out, break; } + if (!buf->len) { + tail++; + continue; + } + + seg = min_t(size_t, remain, buf->len); + bvec_set_page(&bvec[bc++], buf->page, seg, buf->offset); remain -= seg; if (remain == 0 || bc >= ARRAY_SIZE(bvec)) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 716953ee1ebd..cc51ea910a91 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -1018,6 +1018,7 @@ static inline bool folio_lock_or_retry(struct folio *folio, */ void folio_wait_bit(struct folio *folio, int bit_nr); int folio_wait_bit_killable(struct folio *folio, int bit_nr); +int folio_wait_bit_interruptible(struct folio *folio, int bit_nr); /* * Wait for a folio to be unlocked. diff --git a/mm/filemap.c b/mm/filemap.c index 9e44a49bbd74..e36e052bb991 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1450,6 +1450,12 @@ int folio_wait_bit_killable(struct folio *folio, int bit_nr) } EXPORT_SYMBOL(folio_wait_bit_killable); +int folio_wait_bit_interruptible(struct folio *folio, int bit_nr) +{ + return folio_wait_bit_common(folio, bit_nr, TASK_INTERRUPTIBLE, SHARED); +} +EXPORT_SYMBOL(folio_wait_bit_interruptible); + /** * folio_put_wait_locked - Drop a reference and wait for it to be unlocked * @folio: The folio to wait for. -- 2.41.0.203.ga4f2cd32bb.dirty ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?) 2023-07-06 21:56 ` Linus Torvalds @ 2023-07-07 17:21 ` Christian Brauner 2023-07-07 19:10 ` Linus Torvalds 2023-07-08 0:00 ` Matthew Wilcox 1 sibling, 1 reply; 19+ messages in thread From: Christian Brauner @ 2023-07-07 17:21 UTC (permalink / raw) To: Linus Torvalds Cc: Ahelenia Ziemiańska, Jens Axboe, David Howells, Alexander Viro, linux-fsdevel, linux-kernel On Thu, Jul 06, 2023 at 02:56:45PM -0700, Linus Torvalds wrote: > On Mon, 26 Jun 2023 at 09:14, Ahelenia Ziemiańska > <nabijaczleweli@nabijaczleweli.xyz> wrote: > > > > And even if that was a working work-around, the fundamental problem of > > ./spl>fifo excluding all other access to fifo is quite unfortunate too. > > So I already sent an earlier broken version of this patch to Ahelenia > and Christian, but here's an actually slightly tested version with the > obvious bugs fixed. > > The keyword here being "obvious". It's probably broken in many > non-obvious ways, but I'm sending it out in case anybody wants to play > around with it. > > It boots for me. It even changes behavior of programs that splice() > and used to keep the pipe lock over the IO and no longer do. > > But it might do unspeakable things to your pets, so caveat emptor. It > "feels" right to me. But it's a rather quick hack, and really needs > more eyes and more thought. I mention O_NDELAY in the (preliminary) > commit message, but there are probably other issues that need thinking > about. Forgot to say, fwiw, I've been running this through the LTP splice, pipe, and ipc tests without issues. A hanging reader can be signaled away cleanly with this. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?) 2023-07-07 17:21 ` Christian Brauner @ 2023-07-07 19:10 ` Linus Torvalds 2023-07-07 19:57 ` Jens Axboe 2023-07-07 22:41 ` Ahelenia Ziemiańska 0 siblings, 2 replies; 19+ messages in thread From: Linus Torvalds @ 2023-07-07 19:10 UTC (permalink / raw) To: Christian Brauner Cc: Ahelenia Ziemiańska, Jens Axboe, David Howells, Alexander Viro, linux-fsdevel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2589 bytes --] On Fri, 7 Jul 2023 at 10:21, Christian Brauner <brauner@kernel.org> wrote: > > Forgot to say, fwiw, I've been running this through the LTP splice, > pipe, and ipc tests without issues. A hanging reader can be signaled > away cleanly with this. So that patch still has a couple of "wait for this" cases remaining. In particular, when we do a read, and we do have pipe buffers, both the read() system call and a number of internal splice functions will go "Ahh, I have data", and then do pipe_buf_confirm() and read it. Which then results in pipe_buf_confirm() blocking. It now blocks interruptibly, which is much nicer, but several of these users *could* just do a non-blocking confirmation instead, and wait for pipe readability. HOWEVER, that's slightly less trivial than you'd expect, because the "wait for readability" needs to be done without the pipe lock held - so you can't actually check the pipe buffer state at that point (since you need the pipe lock to look up the buffer). That's true even of "trivial" cases like actual user-space "read() with O_NONBLOCK and poll()" situations. Now, the solution to all this is *fairly* straightforward: (a) don't use "!pipe_empty()" for a readability check. We already have "pipe_readable()", but it's hidden in fs/pipe.c, so all the splice() code ended up writing the "does this pipe have data" using "!pipe_empty()" instead. (b) make "pipe_buf_confirm()" take a "non-blocking" boolean argument, and if it is non-blocking but hits one of those blocked pages, set "pipe->not_ready", and return -EAGAIN. This is ok, because "pipe_buf_confirm()" is always under the pipe lock, and we'll just clear "pipe->not_ready" under the pipe lock after finalizing all those pages (and before waking up readers) (c) make "pipe_wait_readable()" and "poll()" know about this all, so that we wait properly for a pipe that was not ready to become ready This all makes *most* users deal properly with these blocking events. In particular, things like splice_to_socket() can now do the whole proper "wait without holding the pipe lock" sequence, even when the pipe is not empty, just in this blocked state. This *may* also make all the cases Jens had with io_uring and splicing JustWork(tm). NOTE! NOTE! NOTE! Once more, this "feels right to me", and I'd argue that the basic approach is fairly straightfoward. The patch is also not horrendous. It all makes a fair amount of sense. BUT! I haven't tested this, and like the previous patch, I really would want people to think about this a lot. Comments? Jens? Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 11046 bytes --] fs/fuse/dev.c | 4 +-- fs/pipe.c | 23 ++++--------- fs/splice.c | 82 +++++++++++++++++++++++++++++++++++------------ include/linux/pipe_fs_i.h | 25 +++++++++++++-- 4 files changed, 93 insertions(+), 41 deletions(-) diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 1a8f82f478cb..b891468dee06 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -700,7 +700,7 @@ static int fuse_copy_fill(struct fuse_copy_state *cs) struct pipe_buffer *buf = cs->pipebufs; if (!cs->write) { - err = pipe_buf_confirm(cs->pipe, buf); + err = pipe_buf_confirm(cs->pipe, buf, false); if (err) return err; @@ -800,7 +800,7 @@ static int fuse_try_move_page(struct fuse_copy_state *cs, struct page **pagep) fuse_copy_finish(cs); - err = pipe_buf_confirm(cs->pipe, buf); + err = pipe_buf_confirm(cs->pipe, buf, false); if (err) goto out_put_old; diff --git a/fs/pipe.c b/fs/pipe.c index 71942d240c98..25abf0c5c169 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -217,23 +217,13 @@ static const struct pipe_buf_operations anon_pipe_buf_ops = { .get = generic_pipe_buf_get, }; -/* Done while waiting without holding the pipe lock - thus the READ_ONCE() */ -static inline bool pipe_readable(const struct pipe_inode_info *pipe) -{ - unsigned int head = READ_ONCE(pipe->head); - unsigned int tail = READ_ONCE(pipe->tail); - unsigned int writers = READ_ONCE(pipe->writers); - - return !pipe_empty(head, tail) || !writers; -} - static ssize_t pipe_read(struct kiocb *iocb, struct iov_iter *to) { size_t total_len = iov_iter_count(to); struct file *filp = iocb->ki_filp; struct pipe_inode_info *pipe = filp->private_data; - bool was_full, wake_next_reader = false; + bool non_blocking, was_full, wake_next_reader = false; ssize_t ret; /* Null read succeeds. */ @@ -252,6 +242,8 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) * data for us. */ was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage); + non_blocking = (filp->f_flags & O_NONBLOCK) || + (iocb->ki_flags & IOCB_NOWAIT); for (;;) { /* Read ->head with a barrier vs post_one_notification() */ unsigned int head = smp_load_acquire(&pipe->head); @@ -287,7 +279,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) size_t chars, written; int error; - error = pipe_buf_confirm(pipe, buf); + error = pipe_buf_confirm(pipe, buf, non_blocking); if (error) { if (!ret) ret = error; @@ -342,8 +334,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) break; if (ret) break; - if ((filp->f_flags & O_NONBLOCK) || - (iocb->ki_flags & IOCB_NOWAIT)) { + if (non_blocking) { ret = -EAGAIN; break; } @@ -462,7 +453,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) if ((buf->flags & PIPE_BUF_FLAG_CAN_MERGE) && offset + chars <= PAGE_SIZE) { - ret = pipe_buf_confirm(pipe, buf); + ret = pipe_buf_confirm(pipe, buf, false); if (ret) goto out; @@ -678,7 +669,7 @@ pipe_poll(struct file *filp, poll_table *wait) mask = 0; if (filp->f_mode & FMODE_READ) { - if (!pipe_empty(head, tail)) + if (!pipe_empty(head, tail) && !READ_ONCE(pipe->not_ready)) mask |= EPOLLIN | EPOLLRDNORM; if (!pipe->writers && filp->f_version != pipe->w_counter) mask |= EPOLLHUP; diff --git a/fs/splice.c b/fs/splice.c index 503f7eff41b6..49139413457d 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -116,9 +116,13 @@ static void page_cache_pipe_buf_release(struct pipe_inode_info *pipe, /* * Check whether the contents of buf is OK to access. Since the content * is a page cache page, IO may be in flight. + * + * Note: we don't react to 'non_blocking', because we have no wakeup + * event for any polling. */ static int page_cache_pipe_buf_confirm(struct pipe_inode_info *pipe, - struct pipe_buffer *buf) + struct pipe_buffer *buf, + bool non_blocking) { struct page *page = buf->page; int err; @@ -307,12 +311,31 @@ static void finalize_pipe_buf(struct pipe_buffer *buf, unsigned int chunk) unlock_page(buf->page); } +/* + * This is called with the pipe locked by the read path, so will + * not race with 'finalize_pipe_buf()'. + * + * If it finds a locked page, and we're doing a non-blocking read, + * just set 'pipe->not_ready' and return -EINTR. We will clear that + * and wake up readers after finalizing the pending IO. + * + * NOTE! We have to do it this indirect way because 'pipe_poll()' + * is run without any pipe locking, so we can't check the buffer + * state at polling time. + */ static int busy_pipe_buf_confirm(struct pipe_inode_info *pipe, - struct pipe_buffer *buf) + struct pipe_buffer *buf, + bool non_blocking) { - struct page *page = buf->page; + struct folio *folio = page_folio(buf->page); - if (folio_wait_bit_interruptible(page_folio(page), PG_locked)) + if (!folio_test_locked(folio)) + return 0; + if (non_blocking) { + pipe->not_ready = true; + return -EAGAIN; + } + if (folio_wait_bit_interruptible(folio, PG_locked)) return -EINTR; return 0; } @@ -453,6 +476,14 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos, } pipe_lock(pipe); + /* + * Now that we have finalized any pending pipe buffers + * and re-taken the pipe lock, make sure to tell poll() + * and pipe_readable() that the buffers are usable again. + * + * The caller will be doing the pipe->rd_wait wakeup. + */ + pipe->not_ready = 0; kfree(bv); return ret; } @@ -510,7 +541,11 @@ static int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_des while (!pipe_empty(head, tail)) { struct pipe_buffer *buf = &pipe->bufs[tail & mask]; - ret = pipe_buf_confirm(pipe, buf); + /* + * Do this as a non-blocking confirm. We'll wait for + * readability if required in splice_from_pipe_next(). + */ + ret = pipe_buf_confirm(pipe, buf, true); if (unlikely(ret)) { if (ret == -ENODATA) ret = 0; @@ -584,10 +619,7 @@ static int splice_from_pipe_next(struct pipe_inode_info *pipe, struct splice_des return -ERESTARTSYS; repeat: - while (pipe_empty(pipe->head, pipe->tail)) { - if (!pipe->writers) - return 0; - + while (!pipe_readable(pipe)) { if (sd->num_spliced) return 0; @@ -608,6 +640,9 @@ static int splice_from_pipe_next(struct pipe_inode_info *pipe, struct splice_des if (eat_empty_buffer(pipe)) goto repeat; + if (!pipe->writers) + return 0; + return 1; } @@ -772,7 +807,7 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out, struct pipe_buffer *buf = &pipe->bufs[tail & mask]; size_t this_len; - ret = pipe_buf_confirm(pipe, buf); + ret = pipe_buf_confirm(pipe, buf, false); if (unlikely(ret)) { if (ret == -ENODATA) ret = 0; @@ -863,6 +898,7 @@ ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out, unsigned int head, tail, mask, bc = 0; size_t remain = len; +repeat: /* * Check for signal early to make process killable when there * are always buffers available @@ -871,11 +907,8 @@ ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out, if (signal_pending(current)) break; - while (pipe_empty(pipe->head, pipe->tail)) { + while (!pipe_readable(pipe)) { ret = 0; - if (!pipe->writers) - goto out; - if (spliced) goto out; @@ -894,17 +927,28 @@ ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out, pipe_wait_readable(pipe); } + ret = 0; + if (!pipe->writers) + goto out; + head = pipe->head; tail = pipe->tail; mask = pipe->ring_size - 1; - while (!pipe_empty(head, tail)) { + while (pipe_readable(pipe)) { struct pipe_buffer *buf = &pipe->bufs[tail & mask]; size_t seg; - ret = pipe_buf_confirm(pipe, buf); + ret = pipe_buf_confirm(pipe, buf, true); if (unlikely(ret)) { + if (ret == -EAGAIN) { + ret = 0; + if (bc || spliced) + break; + goto repeat; + } + if (ret == -ENODATA) ret = 0; break; @@ -1658,19 +1702,17 @@ static int ipipe_prep(struct pipe_inode_info *pipe, unsigned int flags) * Check the pipe occupancy without the inode lock first. This function * is speculative anyways, so missing one is ok. */ - if (!pipe_empty(pipe->head, pipe->tail)) + if (pipe_readable(pipe)) return 0; ret = 0; pipe_lock(pipe); - while (pipe_empty(pipe->head, pipe->tail)) { + while (!pipe_readable(pipe)) { if (signal_pending(current)) { ret = -ERESTARTSYS; break; } - if (!pipe->writers) - break; if (flags & SPLICE_F_NONBLOCK) { ret = -EAGAIN; break; diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 02e0086b10f6..7759779a3561 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -41,6 +41,7 @@ struct pipe_buffer { * @note_loss: The next read() should insert a data-lost message * @max_usage: The maximum number of slots that may be used in the ring * @ring_size: total number of buffers (should be a power of 2) + * @not_ready: pipe has buffers, but poll shouldn't return success yet * @nr_accounted: The amount this pipe accounts for in user->pipe_bufs * @tmp_page: cached released page * @readers: number of current readers of this pipe @@ -62,6 +63,7 @@ struct pipe_inode_info { unsigned int tail; unsigned int max_usage; unsigned int ring_size; + bool not_ready; #ifdef CONFIG_WATCH_QUEUE bool note_loss; #endif @@ -100,7 +102,7 @@ struct pipe_buf_operations { * hook. Returns 0 for good, or a negative error value in case of * error. If not present all pages are considered good. */ - int (*confirm)(struct pipe_inode_info *, struct pipe_buffer *); + int (*confirm)(struct pipe_inode_info *, struct pipe_buffer *, bool); /* * When the contents of this pipe buffer has been completely @@ -156,6 +158,22 @@ static inline bool pipe_full(unsigned int head, unsigned int tail, return pipe_occupancy(head, tail) >= limit; } +/** + * pipe_readable - Return true of there is data to be read + * @pipe: The pipe to check + * + * This can be done while waiting without holding the pipe lock - thus the READ_ONCE() + */ +static inline bool pipe_readable(const struct pipe_inode_info *pipe) +{ + unsigned int head = READ_ONCE(pipe->head); + unsigned int tail = READ_ONCE(pipe->tail); + unsigned int writers = READ_ONCE(pipe->writers); + bool ready = !READ_ONCE(pipe->not_ready); + + return (ready && !pipe_empty(head, tail)) || !writers; +} + /** * pipe_buf - Return the pipe buffer for the specified slot in the pipe ring * @pipe: The pipe to access @@ -209,11 +227,12 @@ static inline void pipe_buf_release(struct pipe_inode_info *pipe, * @buf: the buffer to confirm */ static inline int pipe_buf_confirm(struct pipe_inode_info *pipe, - struct pipe_buffer *buf) + struct pipe_buffer *buf, + bool non_blocking) { if (!buf->ops->confirm) return 0; - return buf->ops->confirm(pipe, buf); + return buf->ops->confirm(pipe, buf, non_blocking); } /** ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?) 2023-07-07 19:10 ` Linus Torvalds @ 2023-07-07 19:57 ` Jens Axboe 2023-07-07 22:41 ` Ahelenia Ziemiańska 1 sibling, 0 replies; 19+ messages in thread From: Jens Axboe @ 2023-07-07 19:57 UTC (permalink / raw) To: Linus Torvalds, Christian Brauner Cc: Ahelenia Ziemiańska, David Howells, Alexander Viro, linux-fsdevel, linux-kernel On 7/7/23 1:10?PM, Linus Torvalds wrote: > On Fri, 7 Jul 2023 at 10:21, Christian Brauner <brauner@kernel.org> wrote: >> >> Forgot to say, fwiw, I've been running this through the LTP splice, >> pipe, and ipc tests without issues. A hanging reader can be signaled >> away cleanly with this. > > So that patch still has a couple of "wait for this" cases remaining. > > In particular, when we do a read, and we do have pipe buffers, both > the read() system call and a number of internal splice functions will > go "Ahh, I have data", and then do pipe_buf_confirm() and read it. > > Which then results in pipe_buf_confirm() blocking. It now blocks > interruptibly, which is much nicer, but several of these users *could* > just do a non-blocking confirmation instead, and wait for pipe > readability. > > HOWEVER, that's slightly less trivial than you'd expect, because the > "wait for readability" needs to be done without the pipe lock held - > so you can't actually check the pipe buffer state at that point (since > you need the pipe lock to look up the buffer). > > That's true even of "trivial" cases like actual user-space "read() > with O_NONBLOCK and poll()" situations. > > Now, the solution to all this is *fairly* straightforward: > > (a) don't use "!pipe_empty()" for a readability check. > > We already have "pipe_readable()", but it's hidden in fs/pipe.c, > so all the splice() code ended up writing the "does this pipe have > data" using "!pipe_empty()" instead. > > (b) make "pipe_buf_confirm()" take a "non-blocking" boolean argument, > and if it is non-blocking but hits one of those blocked pages, set > "pipe->not_ready", and return -EAGAIN. > > This is ok, because "pipe_buf_confirm()" is always under the pipe > lock, and we'll just clear "pipe->not_ready" under the pipe lock after > finalizing all those pages (and before waking up readers) > > (c) make "pipe_wait_readable()" and "poll()" know about this all, so > that we wait properly for a pipe that was not ready to become ready > > This all makes *most* users deal properly with these blocking events. > In particular, things like splice_to_socket() can now do the whole > proper "wait without holding the pipe lock" sequence, even when the > pipe is not empty, just in this blocked state. > > This *may* also make all the cases Jens had with io_uring and splicing > JustWork(tm). Exactly! I was reading this thread with excitement just now, would be nice to get rid of that kludge. > NOTE! NOTE! NOTE! Once more, this "feels right to me", and I'd argue > that the basic approach is fairly straightfoward. The patch is also > not horrendous. It all makes a fair amount of sense. BUT! I haven't > tested this, and like the previous patch, I really would want people > to think about this a lot. > > Comments? Jens? I'll take a closer look at this, but won't be until Monday most likely. But the approach seems sane, and going in a more idiomatic direction than before. So seems promising. -- Jens Axboe ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?) 2023-07-07 19:10 ` Linus Torvalds 2023-07-07 19:57 ` Jens Axboe @ 2023-07-07 22:41 ` Ahelenia Ziemiańska 2023-07-07 22:57 ` Linus Torvalds 1 sibling, 1 reply; 19+ messages in thread From: Ahelenia Ziemiańska @ 2023-07-07 22:41 UTC (permalink / raw) To: Linus Torvalds Cc: Christian Brauner, Jens Axboe, David Howells, Alexander Viro, linux-fsdevel, linux-kernel [-- Attachment #1.1: Type: text/plain, Size: 3484 bytes --] On Fri, Jul 07, 2023 at 12:10:36PM -0700, Linus Torvalds wrote: > On Fri, 7 Jul 2023 at 10:21, Christian Brauner <brauner@kernel.org> wrote: > > Forgot to say, fwiw, I've been running this through the LTP splice, > > pipe, and ipc tests without issues. A hanging reader can be signaled > > away cleanly with this. > NOTE! NOTE! NOTE! Once more, this "feels right to me", and I'd argue > that the basic approach is fairly straightfoward. The patch is also > not horrendous. It all makes a fair amount of sense. BUT! I haven't > tested this, and like the previous patch, I really would want people > to think about this a lot. > > Comments? Jens? I applied the patch upthread + this diff to 4f6b6c2b2f86b7878a770736bf478d8a263ff0bc; during test setup I got a null deref (building defconfig minus graphics). Reproducible, full BUG dump attached; trace of [ 149.878931] <TASK> [ 149.879533] ? __die+0x1e/0x60 [ 149.880309] ? page_fault_oops+0x17c/0x470 [ 149.881313] ? search_module_extables+0x14/0x50 [ 149.882422] ? exc_page_fault+0x67/0x150 [ 149.883397] ? asm_exc_page_fault+0x26/0x30 [ 149.884426] ? __pfx_pipe_to_null+0x10/0x10 [ 149.885451] ? splice_from_pipe_next+0x129/0x150 [ 149.886580] __splice_from_pipe+0x39/0x1c0 [ 149.887594] ? __pfx_pipe_to_null+0x10/0x10 [ 149.888615] ? __pfx_pipe_to_null+0x10/0x10 [ 149.889635] splice_from_pipe+0x5c/0x90 [ 149.890579] do_splice+0x35c/0x840 [ 149.891407] __do_splice+0x1eb/0x210 [ 149.892176] __x64_sys_splice+0xad/0x120 [ 149.893019] do_syscall_64+0x3e/0x90 [ 149.893798] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 $ scripts/faddr2line vmlinux splice_from_pipe_next+0x129 splice_from_pipe_next+0x129/0x150: pipe_buf_release at include/linux/pipe_fs_i.h:221 (inlined by) eat_empty_buffer at fs/splice.c:594 (inlined by) splice_from_pipe_next at fs/splice.c:640 I gamed this down to echo c | grep c >/dev/null where grep is ii grep 3.8-5 amd64 GNU grep, egrep and fgrep and strace of the same invocation (on the host) ends with newfstatat(1, "", {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1, 0x3), ...}, AT_EMPTY_PATH) = 0 newfstatat(AT_FDCWD, "/dev/null", {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1, 0x3), ...}, 0) = 0 newfstatat(0, "", {st_mode=S_IFIFO|0600, st_size=0, ...}, AT_EMPTY_PATH) = 0 lseek(0, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) read(0, "c\n", 98304) = 2 splice(0, NULL, 1, NULL, 98304, SPLICE_F_MOVE) = 0 close(1) = 0 close(2) = 0 exit_group(0) = ? +++ exited with 0 +++ And can also reproduce it with echo | { read -r _; exec ./wr; } > /dev/null (where ./wr is "while (splice(0, 0, 1, 0, 128 * 1024 * 1024, 0) > 0) {}"). However: echo | ./wr > /dev/null does NOT crash. Besides that, this doesn't solve the original issue, inasmuch as ./v > fifo & head fifo & echo zupa > fifo (where ./v splices from an empty pty to stdout; v.c attached) echo still sleeps until ./v dies, though it also succumbs to ^C now. "OTOH, on 4f6b6c2b2f86b7878a770736bf478d8a263ff0bc, "timeout 10 ./v > fifo &" (then lines 2 and 3 as above) does kill ./v -> unblock echo -> head copies "zupa", i.e. life resumes as normal after the splicer went away. With the patches, echo zupa is stuck forever (until you signal it)! This is kinda worse. [-- Attachment #1.2: BUG --] [-- Type: text/plain, Size: 3959 bytes --] [ 149.843966] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 149.845820] #PF: supervisor read access in kernel mode [ 149.847190] #PF: error_code(0x0000) - not-present page [ 149.848540] PGD 0 P4D 0 [ 149.849231] Oops: 0000 [#1] PREEMPT SMP PTI [ 149.850345] CPU: 0 PID: 230 Comm: grep Not tainted 6.4.0-12317-gabf530ed3e36-dirty #3 [ 149.852411] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 149.854900] RIP: 0010:splice_from_pipe_next+0x129/0x150 [ 149.856328] Code: ff c6 45 38 00 eb af 5b b8 00 fe ff ff 5d 41 5c 41 5d c3 cc cc cc cc 48 8b 46 10 41 83 c5 01 48 89 df 48 c7 46 10 00 00 00 00 <48> 8b 40 08 e8 ce a5 9a [ 149.861118] RSP: 0018:ffffb2ed40347d70 EFLAGS: 00010202 [ 149.862488] RAX: 0000000000000000 RBX: ffff8c06c1d9a0c0 RCX: 0000000000000000 [ 149.864357] RDX: 0000000000000005 RSI: ffff8c06c8c98028 RDI: ffff8c06c1d9a0c0 [ 149.866217] RBP: ffffb2ed40347de0 R08: 0000000000000001 R09: ffffffffaa428db0 [ 149.868088] R10: 0000000000018000 R11: 0000000000000000 R12: ffff8c06c2625580 [ 149.869950] R13: 0000000000000002 R14: ffff8c06c1d9a0c0 R15: ffffb2ed40347de0 [ 149.871828] FS: 00007fa5a6b3e740(0000) GS:ffff8c06dd800000(0000) knlGS:0000000000000000 [ 149.873937] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 149.875459] CR2: 0000000000000008 CR3: 000000000269a000 CR4: 00000000000006f0 [ 149.877327] Call Trace: [ 149.878931] <TASK> [ 149.879533] ? __die+0x1e/0x60 [ 149.880309] ? page_fault_oops+0x17c/0x470 [ 149.881313] ? search_module_extables+0x14/0x50 [ 149.882422] ? exc_page_fault+0x67/0x150 [ 149.883397] ? asm_exc_page_fault+0x26/0x30 [ 149.884426] ? __pfx_pipe_to_null+0x10/0x10 [ 149.885451] ? splice_from_pipe_next+0x129/0x150 [ 149.886580] __splice_from_pipe+0x39/0x1c0 [ 149.887594] ? __pfx_pipe_to_null+0x10/0x10 [ 149.888615] ? __pfx_pipe_to_null+0x10/0x10 [ 149.889635] splice_from_pipe+0x5c/0x90 [ 149.890579] do_splice+0x35c/0x840 [ 149.891407] __do_splice+0x1eb/0x210 [ 149.892176] __x64_sys_splice+0xad/0x120 [ 149.893019] do_syscall_64+0x3e/0x90 [ 149.893798] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 149.894881] RIP: 0033:0x7fa5a6c49dd3 [ 149.895682] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 66 2e 0f 1f 84 00 00 00 00 00 90 80 3d 11 18 0d 00 00 49 89 ca 74 14 b8 13 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 74 [ 149.899538] RSP: 002b:00007ffc83d77768 EFLAGS: 00000202 ORIG_RAX: 0000000000000113 [ 149.901116] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fa5a6c49dd3 [ 149.902602] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000 [ 149.904048] RBP: 0000564d8aaeb000 R08: 0000000000018000 R09: 0000000000000001 [ 149.905439] R10: 0000000000000000 R11: 0000000000000202 R12: 000000000000000a [ 149.906832] R13: 0000564d8aaeb010 R14: 0000564d8aaeb000 R15: 0000000000000000 [ 149.908239] </TASK> [ 149.908692] Modules linked in: [ 149.909326] CR2: 0000000000000008 [ 149.910050] ---[ end trace 0000000000000000 ]--- [ 149.910986] RIP: 0010:splice_from_pipe_next+0x129/0x150 [ 149.912063] Code: ff c6 45 38 00 eb af 5b b8 00 fe ff ff 5d 41 5c 41 5d c3 cc cc cc cc 48 8b 46 10 41 83 c5 01 48 89 df 48 c7 46 10 00 00 00 00 <48> 8b 40 08 e8 ce a5 9a [ 149.915639] RSP: 0018:ffffb2ed40347d70 EFLAGS: 00010202 [ 149.916589] RAX: 0000000000000000 RBX: ffff8c06c1d9a0c0 RCX: 0000000000000000 [ 149.917877] RDX: 0000000000000005 RSI: ffff8c06c8c98028 RDI: ffff8c06c1d9a0c0 [ 149.919172] RBP: ffffb2ed40347de0 R08: 0000000000000001 R09: ffffffffaa428db0 [ 149.920457] R10: 0000000000018000 R11: 0000000000000000 R12: ffff8c06c2625580 [ 149.921737] R13: 0000000000000002 R14: ffff8c06c1d9a0c0 R15: ffffb2ed40347de0 [ 149.923021] FS: 00007fa5a6b3e740(0000) GS:ffff8c06dd800000(0000) knlGS:0000000000000000 [ 149.924481] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 149.925529] CR2: 0000000000000008 CR3: 000000000269a000 CR4: 00000000000006f0 [-- Attachment #1.3: v.c --] [-- Type: text/x-csrc, Size: 315 bytes --] #define _GNU_SOURCE #include <fcntl.h> #include <stdlib.h> #include <sys/sendfile.h> int main() { int pt = posix_openpt(O_RDWR); grantpt(pt); unlockpt(pt); int cl = open(ptsname(pt), O_RDONLY); for(;;) splice(cl, 0, 1, 0, 128 * 1024 * 1024, 0); // sendfile(1, 0, 0, 128 * 1024 * 1024); } [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?) 2023-07-07 22:41 ` Ahelenia Ziemiańska @ 2023-07-07 22:57 ` Linus Torvalds 2023-07-08 0:30 ` Ahelenia Ziemiańska 0 siblings, 1 reply; 19+ messages in thread From: Linus Torvalds @ 2023-07-07 22:57 UTC (permalink / raw) To: Ahelenia Ziemiańska Cc: Christian Brauner, Jens Axboe, David Howells, Alexander Viro, linux-fsdevel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1906 bytes --] On Fri, 7 Jul 2023 at 15:41, Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> wrote: > > (inlined by) eat_empty_buffer at fs/splice.c:594 Ahh, eat_empty_buffer() ends up releasing the buffer without waiting for it. And the reason for that is actually somewhat interesting: we do have that while (!pipe_readable(pipe)) { .. above it, but the logic for this all is that pipes with pipe buffers are by *default* considered readable until they try to actually confirm the buffer, and at that point they might say "oh, I have to return -EAGAIN and set 'not_ready'". And that splice_from_pipe_next() doesn't do that. End result: it will happily free that pipe buffer that is still in the process of being filled. The good news is that I think the fix is probably trivial. Something like the attached? Again - NOT TESTED. > Besides that, this doesn't solve the original issue, inasmuch as > ./v > fifo & > head fifo & > echo zupa > fifo > (where ./v splices from an empty pty to stdout; v.c attached) > echo still sleeps until ./v dies, though it also succumbs to ^C now. Yeah, I concentrated on just making everything interruptible, But the fact that the echo has to wait for the previous write to finish is kind of fundamental. We can't just magically do writes out of order. 'v' is busy writing to the fifo, we can't let some other write just come in. (We *could* make the splice in ./v not fill the whole pipe buffer, and allow some other writes to fill in buffers afterwards, but at _some_ point you'll hit the "pipe buffers are full and busy, can't add any more without waiting for them to empty"). One thing we could possibly do is to say that we just don't accept any new writes if there are old busy splices in process. So we could make new writes return -EBUSY or something, I guess. Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 633 bytes --] fs/splice.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/fs/splice.c b/fs/splice.c index 49139413457d..df6d34dbf116 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -590,6 +590,13 @@ static inline bool eat_empty_buffer(struct pipe_inode_info *pipe) unsigned int mask = pipe->ring_size - 1; struct pipe_buffer *buf = &pipe->bufs[tail & mask]; + /* + * Do a non-blocking buffer confirm. We may need + * to go back to waiting for the pipe to be readable. + */ + if (pipe_buf_confirm(pipe, buf, true) == -EAGAIN) + return true; + if (unlikely(!buf->len)) { pipe_buf_release(pipe, buf); pipe->tail = tail+1; ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?) 2023-07-07 22:57 ` Linus Torvalds @ 2023-07-08 0:30 ` Ahelenia Ziemiańska 2023-07-08 20:06 ` Linus Torvalds 0 siblings, 1 reply; 19+ messages in thread From: Ahelenia Ziemiańska @ 2023-07-08 0:30 UTC (permalink / raw) To: Linus Torvalds Cc: Christian Brauner, Jens Axboe, David Howells, Alexander Viro, linux-fsdevel, linux-kernel [-- Attachment #1.1: Type: text/plain, Size: 6490 bytes --] On Fri, Jul 07, 2023 at 03:57:40PM -0700, Linus Torvalds wrote: > On Fri, 7 Jul 2023 at 15:41, Ahelenia Ziemiańska > <nabijaczleweli@nabijaczleweli.xyz> wrote: > > (inlined by) eat_empty_buffer at fs/splice.c:594 > Ahh, eat_empty_buffer() ends up releasing the buffer without waiting for it. > > And the reason for that is actually somewhat interesting: we do have that > > while (!pipe_readable(pipe)) { > .. > > above it, but the logic for this all is that pipes with pipe buffers > are by *default* considered readable until they try to actually > confirm the buffer, and at that point they might say "oh, I have to > return -EAGAIN and set 'not_ready'". > > And that splice_from_pipe_next() doesn't do that. > > End result: it will happily free that pipe buffer that is still in the > process of being filled. > > The good news is that I think the fix is probably trivial. Something > like the attached? > > Again - NOT TESTED Same reproducer, backtrace attached: $ scripts/faddr2line vmlinux splice_from_pipe_next+0x6e splice_from_pipe_next+0x6e/0x180: pipe_buf_confirm at include/linux/pipe_fs_i.h:233 (inlined by) eat_empty_buffer at fs/splice.c:597 (inlined by) splice_from_pipe_next at fs/splice.c:647 Looks like the same place. > > Besides that, this doesn't solve the original issue, inasmuch as > > ./v > fifo & > > head fifo & > > echo zupa > fifo > > (where ./v splices from an empty pty to stdout; v.c attached) > > echo still sleeps until ./v dies, though it also succumbs to ^C now. > Yeah, I concentrated on just making everything interruptible, > > But the fact that the echo has to wait for the previous write to > finish is kind of fundamental. We can't just magically do writes out > of order. 'v' is busy writing to the fifo, we can't let some other > write just come in. (It's no longer busy writing to it when it gets killed by timeout in my second example, but echo still doesn't wake up.) > (We *could* make the splice in ./v not fill the whole pipe buffer, and > allow some other writes to fill in buffers afterwards, but at _some_ > point you'll hit the "pipe buffers are full and busy, can't add any > more without waiting for them to empty"). You are, but, well, that's also the case when the pipe is full. As it stands, the pipe is /empty/ and yet /no-one can write to it/. This is the crux of the issue at hand. (Coincidentally, you're describing what looks quite similar to pt 1. from <naljsvzzemr6pjiwwcdpdsdh5vtycdr6fmi2fk2dlr4nn4kq6o@ycanbgxhslti>.) I think we got away with it for so long because most files behave like regular files/blockdevs and the read is always guaranteed to complete ~instantly, but splice is, fundamentally, /not/ writing the whole time because it's not /reading/ the whole time when it's reading an empty socket/a chardev/whatever. Or, rather: splice() from a non-seekable (non-mmap()pable?) fundamentally doesn't really make much sense, because you're not gluing a bit of the pro-verbial page cache (forgive me my term abuse here, you see what I'm getting at tho) to the end of the pipe, you're just telling the kernel to run a read()/write() loop for you. sendfile() works around this by reading and then separately writing to its special buffer (in the form of an anonymous process-local pipe). splice() just raw-dogs the read with the write lock held, but just doesn't check if it can do it. That's how it's, honestly, shaking out to me: someone just forgot a check the first time they wrote it. Because the precondition for "does reading directly into the pipe buffer make sense" is "is it directly equivalent to read(f)/write(p)", and that holds only for seekables. /Maybe/ for, like, sockets if there's already data, or as a special case similar to pipe->pipe. But then for sockets you're already using sendfile(), so? To that end, I'm including a patch based on 4f6b6c2b2f86b7878a770736bf478d8a263ff0bc that does just that: EINVAL. Maybe if you're worried about breaking compatibility (which idk if it's an issue since splice and sendfile are temperamental w.r.t. what they take anyway across versions; you need a read()/write() fallback anyway) that case could become sendfile-like by first reading into a buffer once then pipe->pipe splicing out of it separately. Though, even the usual sendfile read/write loop only works on seekables. > One thing we could possibly do is to say that we just don't accept any > new writes if there are old busy splices in process. So we could make > new writes return -EBUSY or something, I guess. Same logic as above applies + no-one's really expecting EBUSY + what /would/ you do about EBUSY in this case? -- >8 -- From 552d93bee8e1b8333ce84ed0ca8490f6712c5b8b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ahelenia=20Ziemia=C5=84ska?= <nabijaczleweli@nabijaczleweli.xyz> Date: Sat, 8 Jul 2023 01:47:59 +0200 Subject: [PATCH] splice: file->pipe: -EINVAL if file non-seekable MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Both the logical semantic of "tie a page from the page cache to the pipe" and the implemented semantic of "lock the pipe, then read into it" (thus holding the write lock for as as long as the read is outstanding) only make sense if the read is guaranteed to complete instantly. This has been a long-standing omission and, when the read doesn't immediately complete (because it's a tty, socket, &c.), the write lock ‒ which for pipes is a pipe-global mutex ‒ is held until, thus excluding all mutual users of the pipe, until it does. Refuse it. Use read()/write() in userspace instead of getting the kernel to do it for you, badly, when there's no point to doing so. Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wjEC_Rh8+-rtEi8C45upO-Ffw=M_i1211qS_3AvWZCbOg@mail.gmail.com/t/#u Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> --- fs/splice.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/splice.c b/fs/splice.c index 004eb1c4ce31..14cf3cea1091 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1309,6 +1309,8 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out, if (opipe) { if (off_out) return -ESPIPE; + if (!(in->f_mode & FMODE_LSEEK)) + return -EINVAL; if (off_in) { if (!(in->f_mode & FMODE_PREAD)) return -EINVAL; -- 2.39.2 [-- Attachment #1.2: BUG --] [-- Type: text/plain, Size: 3992 bytes --] sh-5.2# echo a | grep a > /dev/null [ 137.724183] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 137.726038] #PF: supervisor read access in kernel mode [ 137.727408] #PF: error_code(0x0000) - not-present page [ 137.728757] PGD 0 P4D 0 [ 137.729454] Oops: 0000 [#1] PREEMPT SMP PTI [ 137.730574] CPU: 1 PID: 227 Comm: grep Not tainted 6.4.0-12317-g1b28a3c9606a-dirty #1 [ 137.732638] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 137.735124] RIP: 0010:splice_from_pipe_next+0x6e/0x180 [ 137.736514] Code: 80 fe 01 75 04 85 c9 75 62 8b 43 5c 44 8b 73 54 83 e8 01 44 21 f0 48 8d 14 80 48 8b 83 98 00 00 00 4c 8d 24 d0 49 8b 44 24 10 <48> 8b 00 48 85 c0 74 16 [ 137.741360] RSP: 0018:ffff93c28037fd68 EFLAGS: 00010202 [ 137.742755] RAX: 0000000000000000 RBX: ffff8ed7012bc600 RCX: 0000000000000000 [ 137.744647] RDX: 0000000000000005 RSI: 0000000000000000 RDI: ffff8ed7012bc600 [ 137.746525] RBP: ffff93c28037fde0 R08: 0000000000000001 R09: ffffffffa6228df0 [ 137.748414] R10: 0000000000018000 R11: 0000000000000000 R12: ffff8ed701e04828 [ 137.750288] R13: ffff8ed701d68000 R14: 0000000000000001 R15: ffff93c28037fde0 [ 137.752178] FS: 00007fe8d094d740(0000) GS:ffff8ed71d900000(0000) knlGS:0000000000000000 [ 137.754293] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 137.755845] CR2: 0000000000000000 CR3: 0000000002aba000 CR4: 00000000000006e0 [ 137.757734] Call Trace: [ 137.758436] <TASK> [ 137.759035] ? __die+0x1e/0x60 [ 137.759893] ? page_fault_oops+0x17c/0x470 [ 137.760989] ? search_module_extables+0x14/0x50 [ 137.762201] ? exc_page_fault+0x67/0x150 [ 137.763251] ? asm_exc_page_fault+0x26/0x30 [ 137.764378] ? __pfx_pipe_to_null+0x10/0x10 [ 137.765494] ? splice_from_pipe_next+0x6e/0x180 [ 137.766703] __splice_from_pipe+0x39/0x1c0 [ 137.767807] ? __pfx_pipe_to_null+0x10/0x10 [ 137.768922] ? __pfx_pipe_to_null+0x10/0x10 [ 137.770039] splice_from_pipe+0x5c/0x90 [ 137.771069] do_splice+0x35c/0x840 [ 137.772002] __do_splice+0x1eb/0x210 [ 137.772975] __x64_sys_splice+0xad/0x120 [ 137.774027] do_syscall_64+0x3e/0x90 [ 137.774999] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 137.776363] RIP: 0033:0x7fe8d0a58dd3 [ 137.777335] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 66 2e 0f 1f 84 00 00 00 00 00 90 80 3d 11 18 0d 00 00 49 89 ca 74 14 b8 13 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 74 [ 137.781970] RSP: 002b:00007ffe3e611af8 EFLAGS: 00000202 ORIG_RAX: 0000000000000113 [ 137.783801] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe8d0a58dd3 [ 137.785516] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000 [ 137.787229] RBP: 0000558b41c02000 R08: 0000000000018000 R09: 0000000000000001 [ 137.788954] R10: 0000000000000000 R11: 0000000000000202 R12: 000000000000000a [ 137.790666] R13: 0000558b41c02002 R14: 0000558b41c02000 R15: 0000000000000000 [ 137.792280] </TASK> [ 137.792767] Modules linked in: [ 137.793445] CR2: 0000000000000000 [ 137.794224] ---[ end trace 0000000000000000 ]--- [ 137.795235] RIP: 0010:splice_from_pipe_next+0x6e/0x180 [ 137.796378] Code: 80 fe 01 75 04 85 c9 75 62 8b 43 5c 44 8b 73 54 83 e8 01 44 21 f0 48 8d 14 80 48 8b 83 98 00 00 00 4c 8d 24 d0 49 8b 44 24 10 <48> 8b 00 48 85 c0 74 16 [ 137.800291] RSP: 0018:ffff93c28037fd68 EFLAGS: 00010202 [ 137.801430] RAX: 0000000000000000 RBX: ffff8ed7012bc600 RCX: 0000000000000000 [ 137.802958] RDX: 0000000000000005 RSI: 0000000000000000 RDI: ffff8ed7012bc600 [ 137.804462] RBP: ffff93c28037fde0 R08: 0000000000000001 R09: ffffffffa6228df0 [ 137.805915] R10: 0000000000018000 R11: 0000000000000000 R12: ffff8ed701e04828 [ 137.807389] R13: ffff8ed701d68000 R14: 0000000000000001 R15: ffff93c28037fde0 [ 137.808837] FS: 00007fe8d094d740(0000) GS:ffff8ed71d900000(0000) knlGS:0000000000000000 [ 137.810467] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 137.811662] CR2: 0000000000000000 CR3: 0000000002aba000 CR4: 00000000000006e0 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?) 2023-07-08 0:30 ` Ahelenia Ziemiańska @ 2023-07-08 20:06 ` Linus Torvalds 2023-07-09 1:03 ` Ahelenia Ziemiańska 0 siblings, 1 reply; 19+ messages in thread From: Linus Torvalds @ 2023-07-08 20:06 UTC (permalink / raw) To: Ahelenia Ziemiańska Cc: Christian Brauner, Jens Axboe, David Howells, Alexander Viro, linux-fsdevel, linux-kernel On Fri, 7 Jul 2023 at 17:30, Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> wrote: > > Same reproducer, backtrace attached: > $ scripts/faddr2line vmlinux splice_from_pipe_next+0x6e > splice_from_pipe_next+0x6e/0x180: > pipe_buf_confirm at include/linux/pipe_fs_i.h:233 Bah. I should have looked more closely at your case. This is a buffer without an 'ops' pointer. So it looks like was already released. And the reason is that the pipe was readable because there were no writers, and I had put the if (!pipe->writers) return 0; check in splice_from_pipe_next() in the wrong place. It needs to be *before* the eat_empty_buffer() call. Duh. Anyway, while I think that fixes your NULL pointer thing, having looked at my original patch I realized that keeping the pointer to the pipe buffer in copy_splice_read() across the dropping of the pipe lock is completely broken. I thought (because I didn't remember the code) that pipe resizing will just copy the pipe buffer pointers around. That would have made it ok to remember a pipe buffer pointer. But it is not so. It will actually create new pipe buffers and copy the buffer contents around. So while fixing your NULL pointer check should be trivial, I think that first patch is actually fundamentally broken wrt pipe resizing, and I see no really sane way to fix it. We could add a new lock just for that, but I don't think it's worth it. > You are, but, well, that's also the case when the pipe is full. > As it stands, the pipe is /empty/ and yet /no-one can write to it/. > This is the crux of the issue at hand. No, I think you are mis-representing things. The pipe isn't empty. It's full of things that just aren't finalized yet. > Or, rather: splice() from a non-seekable (non-mmap()pable?) Please stop with the non-seekable nonsense. Any time I see a patch like this: > + if (!(in->f_mode & FMODE_LSEEK)) > + return -EINVAL; I will just go "that person is not competent". This has absolutely nothing to do with seekability. But it is possible that we need to just bite the bullet and say "copy_splice_read() needs to use a non-blocking kiocb for the IO". Of course, that then doesn't work, because while doing this is trivial: --- a/fs/splice.c +++ b/fs/splice.c @@ -364,6 +364,7 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos, iov_iter_bvec(&to, ITER_DEST, bv, npages, len); init_sync_kiocb(&kiocb, in); kiocb.ki_pos = *ppos; + kiocb.ki_flags |= IOCB_NOWAIT; ret = call_read_iter(in, &kiocb, &to); if (ret > 0) { I suspectr you'll find that it makes no difference, because the tty layer doesn't actually honor the IOCB_NOWAIT flag for various historical reasons. In fact, the kiocb isn't even passed down to the low-level routine, which only gets the 'struct file *', and instead it looks at tty_io_nonblock(), which just does that legacy file->f_flags & O_NONBLOCK test. I guess combined with something like if (!(in->f_mode & FMODE_NOWAIT)) return -EINVAL; it might all work. Linus ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?) 2023-07-08 20:06 ` Linus Torvalds @ 2023-07-09 1:03 ` Ahelenia Ziemiańska 2023-07-09 22:33 ` Ahelenia Ziemiańska 0 siblings, 1 reply; 19+ messages in thread From: Ahelenia Ziemiańska @ 2023-07-09 1:03 UTC (permalink / raw) To: Linus Torvalds Cc: Christian Brauner, Jens Axboe, David Howells, Alexander Viro, linux-fsdevel, linux-kernel [-- Attachment #1.1: Type: text/plain, Size: 8441 bytes --] On Sat, Jul 08, 2023 at 01:06:56PM -0700, Linus Torvalds wrote: > On Fri, 7 Jul 2023 at 17:30, Ahelenia Ziemiańska > <nabijaczleweli@nabijaczleweli.xyz> wrote: > > > > Same reproducer, backtrace attached: > > $ scripts/faddr2line vmlinux splice_from_pipe_next+0x6e > > splice_from_pipe_next+0x6e/0x180: > > pipe_buf_confirm at include/linux/pipe_fs_i.h:233 > Bah. I should have looked more closely at your case. > > This is a buffer without an 'ops' pointer. So it looks like was > already released. > > And the reason is that the pipe was readable because there were no > writers, and I had put the > > if (!pipe->writers) > return 0; > > check in splice_from_pipe_next() in the wrong place. It needs to be > *before* the eat_empty_buffer() call. > > Anyway, while I think that fixes your NULL pointer thing, It does. > So while fixing your NULL pointer check should be trivial, I think > that first patch is actually fundamentally broken wrt pipe resizing, > and I see no really sane way to fix it. We could add a new lock just > for that, but I don't think it's worth it. > > > You are, but, well, that's also the case when the pipe is full. > > As it stands, the pipe is /empty/ and yet /no-one can write to it/. > > This is the crux of the issue at hand. > No, I think you are mis-representing things. The pipe isn't empty. > It's full of things that just aren't finalized yet. Being full of no data (as part of some hidden state) doesn't make it any less empty, but meh; neither here not there. > > Or, rather: splice() from a non-seekable (non-mmap()pable?) > Please stop with the non-seekable nonsense. > > Any time I see a patch like this: > > > + if (!(in->f_mode & FMODE_LSEEK)) > > + return -EINVAL; > > I will just go "that person is not competent". Accurate assessment. > This has absolutely nothing to do with seekability. Yes, and as noted, I was using it as a stand-in for "I/O won't block" due to the above (and splice_direct_to_actor() already uses it). Glad to see you've managed to synthesise my drivel into something workable. > But it is possible that we need to just bite the bullet and say > "copy_splice_read() needs to use a non-blocking kiocb for the IO". > > Of course, that then doesn't work, because while doing this is trivial: > > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -364,6 +364,7 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos, > iov_iter_bvec(&to, ITER_DEST, bv, npages, len); > init_sync_kiocb(&kiocb, in); > kiocb.ki_pos = *ppos; > + kiocb.ki_flags |= IOCB_NOWAIT; > ret = call_read_iter(in, &kiocb, &to); > > if (ret > 0) { > > I suspectr you'll find that it makes no difference, because the tty > layer doesn't actually honor the IOCB_NOWAIT flag for various > historical reasons. Indeed, neither when splicing from a tty, nor from a socket (same setup but socketpair(AF_UNIX, SOCK_STREAM, 0); w.c). > In fact, the kiocb isn't even passed down to the > low-level routine, which only gets the 'struct file *', and instead it > looks at tty_io_nonblock(), which just does that legacy > > file->f_flags & O_NONBLOCK > > test. > > I guess combined with something like > > if (!(in->f_mode & FMODE_NOWAIT)) > return -EINVAL; > > it might all work. Yes, that makes the splice instantly -EINVAL for ttys, but doesn't affect the socketpair() case above, which still blocks forever. This appears to be because vfs_splice_read() does if ((in->f_flags & O_DIRECT) || IS_DAX(in->f_mapping->host)) return copy_splice_read(in, ppos, pipe, len, flags); return in->f_op->splice_read(in, ppos, pipe, len, flags); so the c_s_r() isn't even called for the socket, which uses unix_stream_splice_read(). Thus, diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 123b35ddfd71..384d5a479b4a 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -2880,15 +2880,12 @@ static ssize_t unix_stream_splice_read(struct socket *sock, loff_t *ppos, .pipe = pipe, .size = size, .splice_flags = flags, + .flags = MSG_DONTWAIT, }; if (unlikely(*ppos)) return -ESPIPE; - if (sock->file->f_flags & O_NONBLOCK || - flags & SPLICE_F_NONBLOCK) - state.flags = MSG_DONTWAIT; - return unix_stream_read_generic(&state, false); } makes the splice -EAGAIN w/o data and splice whatever's in the socket when there is data. git grep '\.splice_read.*=' | cut -d= -f2 | tr -s ',;[:space:]' '\n' | sort -u gives me 27 distinct splice_read implementations and 1 cocci config. These are simple delegations: nfs_file_splice_read filemap_splice_read afs_file_splice_read filemap_splice_read ceph_splice_read filemap_splice_read ecryptfs_splice_read_update_atime filemap_splice_read ext4_file_splice_read filemap_splice_read f2fs_file_splice_read filemap_splice_read ntfs_file_splice_read filemap_splice_read ocfs2_file_splice_read filemap_splice_read orangefs_file_splice_read filemap_splice_read v9fs_file_splice_read filemap_splice_read xfs_file_splice_read filemap_splice_read zonefs_file_splice_read filemap_splice_read sock_splice_read copy_splice_read or a socket-specific one coda_file_splice_read vfs_splice_read ovl_splice_read vfs_splice_read vfs_splice_read calls copy_splice_read (not used as a .splice_read). And the rest are: 01. copy_splice_read you've fixed 02. filemap_splice_read is, as I understand it, only applicable to files/blockdevs and already has the required semantics? 03. unix_stream_splice_read can be fixed as above 04. fuse_dev_splice_read allocates a buffer and fuse_dev_do_read()s into it, fuse_dev_do_read does if (file->f_flags & O_NONBLOCK) return -EAGAIN; so this is likely a similarly easy fix 05. tracing_buffers_splice_read, when it doesn't read anything does ret = -EAGAIN; if ((file->f_flags & O_NONBLOCK) || (flags & SPLICE_F_NONBLOCK)) goto out; and waits for at least one thing to read; can probably just goto out instantly 06. tracing_splice_read_pipe delegates to whatever it's tracing, and errors if that errored, so it's fine(?) 07. shmem_file_splice_read is always nonblocking 08. relay_file_splice_read only checks SPLICE_F_NONBLOCK to turn 0 into -EAGAIN so I think it also just doesn't block 09. smc_splice_read falls back to an embedded socket's splice_read, or uses if (flags & SPLICE_F_NONBLOCK) flags = MSG_DONTWAIT; else flags = 0; SMC_STAT_INC(smc, splice_cnt); rc = smc_rx_recvmsg(smc, NULL, pipe, len, flags); so also probably remove the condition 10. kcm_splice_read: 10a. passes flags (SPLICE_F_...) to skb_recv_datagram(), which does timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT); verbatim!? and forwards them to try_recv which also checks for socket-style bits 10b. give it MSG_DONTWAIT, call it a day? 10c. it also passes flags to skb_splice_bits() to actually splice to the pipe, but that discards flags, so no change needed 11. tls_sw_splice_read I don't really understand but it does err = tls_rx_reader_lock(sk, ctx, flags & SPLICE_F_NONBLOCK); and err = tls_rx_rec_wait(sk, NULL, flags & SPLICE_F_NONBLOCK, true); (and skb_splice_bits()) so make them both true? 12. tcp_splice_read uses skb_splice_bits() and timeout governed by timeo = sock_rcvtimeo(sk, sock->file->f_flags & O_NONBLOCK); and re-tries on empty read if timeo!=0 (i.e. !(sock->file->f_flags & O_NONBLOCK)); so turning that into true (timeo = 0) and propagating that would make it behave accd'g to the new semantic Would that make sense? [-- Attachment #1.2: w.c --] [-- Type: text/x-csrc, Size: 215 bytes --] #define _GNU_SOURCE #include <fcntl.h> #include <stdlib.h> #include <sys/socket.h> int main() { int sp[2]; socketpair(AF_UNIX, SOCK_STREAM, 0, sp); for(;;) splice(sp[0], 0, 1, 0, 128 * 1024 * 1024, 0); } [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?) 2023-07-09 1:03 ` Ahelenia Ziemiańska @ 2023-07-09 22:33 ` Ahelenia Ziemiańska 2023-07-10 13:22 ` Ahelenia Ziemiańska 0 siblings, 1 reply; 19+ messages in thread From: Ahelenia Ziemiańska @ 2023-07-09 22:33 UTC (permalink / raw) To: Linus Torvalds Cc: Christian Brauner, Jens Axboe, David Howells, Alexander Viro, linux-fsdevel, linux-kernel [-- Attachment #1.1: Type: text/plain, Size: 2212 bytes --] On Sun, Jul 09, 2023 at 03:03:22AM +0200, Ahelenia Ziemiańska wrote: > On Sat, Jul 08, 2023 at 01:06:56PM -0700, Linus Torvalds wrote: > > I guess combined with something like > > > > if (!(in->f_mode & FMODE_NOWAIT)) > > return -EINVAL; > > > > it might all work. > Yes, that makes the splice instantly -EINVAL for ttys, but doesn't > affect the socketpair() case above, which still blocks forever. This also triggers for regular file -> pipe splices, which is probably a no-go. Attaching a summary diff that does all I said in the previous mail. filemap_get_pages() does use and inspect IOCB_NOWAIT if set in filemap_splice_read(), but it appears to not really make much sense, inasmuch it returns EAGAIN for the first splice() from a blockdev and then instantly return with data on the next go-around. This doesn't really make much sense ‒ and open(2) says blockdevs don't have O_NONBLOCK semantics, so I'm assuming this is not supposed to be exposed to userspace ‒ so I'm not setting it in the diff. I've tested this for: * tty: -EINVAL * socketpair(AF_UNIX, SOCK_STREAM): works as expected $ wc -c fifo & [1] 261 $ ./af_unix ./s > fifo 5 Success 6454 Resource temporarily unavailable 5 Success 6445 Resource temporarily unavailable 0 Success 10 fifo * socket(AF_INET, SOCK_STREAM, 0): works as expected $ wc fifo & [1] 249 $ ./tcp ./s > fifo 23099 Resource temporarily unavailable 7 Success 2099 Resource temporarily unavailable 4 Success 1751 Resource temporarily unavailable 3 Success 21655 Resource temporarily unavailable 95 Success 19589 Resource temporarily unavailable 0 Success 4 15 109 fifo corresponding to host$ nc 127.0.0.1 14640 żupan asd ad asdda dasj aiojd askdl; jasiopdij as[pkdo[p askd9p ias90dk aso[pjd 890pasid90[ jaskl;dj il;asd ^C * blockdev (/dev/vda): as expected (with filemap_splice_read() unchanged), copies it all * regular file: -EINVAL(!) Splicers still sleep if the pipe's full, of course, unless SPLICE_F_NONBLOCK. Test drivers attached as well. [-- Attachment #1.2: nowait.diff --] [-- Type: text/x-diff, Size: 7459 bytes --] diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 1a8f82f478cb..4e8caf66c01e 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -1202,7 +1202,8 @@ __releases(fiq->lock) * the 'sent' flag. */ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file, - struct fuse_copy_state *cs, size_t nbytes) + struct fuse_copy_state *cs, size_t nbytes, + bool nonblock) { ssize_t err; struct fuse_conn *fc = fud->fc; @@ -1238,7 +1239,7 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file, break; spin_unlock(&fiq->lock); - if (file->f_flags & O_NONBLOCK) + if (nonblock) return -EAGAIN; err = wait_event_interruptible_exclusive(fiq->waitq, !fiq->connected || request_pending(fiq)); @@ -1364,7 +1365,8 @@ static ssize_t fuse_dev_read(struct kiocb *iocb, struct iov_iter *to) fuse_copy_init(&cs, 1, to); - return fuse_dev_do_read(fud, file, &cs, iov_iter_count(to)); + return fuse_dev_do_read(fud, file, &cs, iov_iter_count(to), + file->f_flags & O_NONBLOCK); } static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos, @@ -1388,7 +1390,7 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos, fuse_copy_init(&cs, 1, NULL); cs.pipebufs = bufs; cs.pipe = pipe; - ret = fuse_dev_do_read(fud, in, &cs, len); + ret = fuse_dev_do_read(fud, in, &cs, len, true); if (ret < 0) goto out; diff --git a/fs/splice.c b/fs/splice.c index 004eb1c4ce31..e52cc42fff46 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -364,6 +364,7 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos, iov_iter_bvec(&to, ITER_DEST, bv, npages, len); init_sync_kiocb(&kiocb, in); kiocb.ki_pos = *ppos; + kiocb.ki_flags |= IOCB_NOWAIT; ret = call_read_iter(in, &kiocb, &to); if (ret > 0) { @@ -1309,6 +1310,8 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out, if (opipe) { if (off_out) return -ESPIPE; + if (!(in->f_mode & FMODE_NOWAIT)) + return -EINVAL; if (off_in) { if (!(in->f_mode & FMODE_PREAD)) return -EINVAL; diff --git a/kernel/relay.c b/kernel/relay.c index a80fa01042e9..d3f5682c4a12 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -1215,8 +1215,7 @@ static ssize_t relay_file_splice_read(struct file *in, if (ret < 0) break; else if (!ret) { - if (flags & SPLICE_F_NONBLOCK) - ret = -EAGAIN; + ret = -EAGAIN; break; } diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 4529e264cb86..821bcbcd9e35 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -8391,7 +8391,6 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos, if (splice_grow_spd(pipe, &spd)) return -ENOMEM; - again: trace_access_lock(iter->cpu_file); entries = ring_buffer_entries_cpu(iter->array_buffer->buffer, iter->cpu_file); @@ -8442,35 +8441,12 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos, /* did we read anything? */ if (!spd.nr_pages) { - long wait_index; - - if (ret) - goto out; - - ret = -EAGAIN; - if ((file->f_flags & O_NONBLOCK) || (flags & SPLICE_F_NONBLOCK)) - goto out; - - wait_index = READ_ONCE(iter->wait_index); - - ret = wait_on_pipe(iter, iter->tr->buffer_percent); - if (ret) - goto out; - - /* No need to wait after waking up when tracing is off */ - if (!tracer_tracing_is_on(iter->tr)) - goto out; - - /* Make sure we see the new wait_index */ - smp_rmb(); - if (wait_index != iter->wait_index) - goto out; - - goto again; + if (!ret) + ret = -EAGAIN; + } else { + ret = splice_to_pipe(pipe, &spd); } - ret = splice_to_pipe(pipe, &spd); -out: splice_shrink_spd(&spd); return ret; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e03e08745308..92a2be52123e 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -780,7 +780,6 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, .len = len, .flags = flags, }; - long timeo; ssize_t spliced; int ret; @@ -795,7 +794,6 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, lock_sock(sk); - timeo = sock_rcvtimeo(sk, sock->file->f_flags & O_NONBLOCK); while (tss.len) { ret = __tcp_splice_read(sk, &tss); if (ret < 0) @@ -819,35 +817,13 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, ret = -ENOTCONN; break; } - if (!timeo) { - ret = -EAGAIN; - break; - } - /* if __tcp_splice_read() got nothing while we have - * an skb in receive queue, we do not want to loop. - * This might happen with URG data. - */ - if (!skb_queue_empty(&sk->sk_receive_queue)) - break; - sk_wait_data(sk, &timeo, NULL); - if (signal_pending(current)) { - ret = sock_intr_errno(timeo); - break; - } - continue; + ret = -EAGAIN; + break; } tss.len -= ret; spliced += ret; - if (!tss.len || !timeo) - break; - release_sock(sk); - lock_sock(sk); - - if (sk->sk_err || sk->sk_state == TCP_CLOSE || - (sk->sk_shutdown & RCV_SHUTDOWN) || - signal_pending(current)) - break; + break; } release_sock(sk); diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c index 393f01b2a7e6..f96b52a8be0e 100644 --- a/net/kcm/kcmsock.c +++ b/net/kcm/kcmsock.c @@ -1025,7 +1025,7 @@ static ssize_t kcm_splice_read(struct socket *sock, loff_t *ppos, /* Only support splice for SOCKSEQPACKET */ - skb = skb_recv_datagram(sk, flags, &err); + skb = skb_recv_datagram(sk, MSG_DONTWAIT, &err); if (!skb) goto err_out; diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c index a7f887d91d89..4ba8f93ddbe5 100644 --- a/net/smc/af_smc.c +++ b/net/smc/af_smc.c @@ -3172,12 +3172,8 @@ static ssize_t smc_splice_read(struct socket *sock, loff_t *ppos, rc = -ESPIPE; goto out; } - if (flags & SPLICE_F_NONBLOCK) - flags = MSG_DONTWAIT; - else - flags = 0; SMC_STAT_INC(smc, splice_cnt); - rc = smc_rx_recvmsg(smc, NULL, pipe, len, flags); + rc = smc_rx_recvmsg(smc, NULL, pipe, len, MSG_DONTWAIT); } out: release_sock(sk); diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index 53f944e6d8ef..7df1ea6a62a5 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -2136,7 +2136,7 @@ ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos, int chunk; int err; - err = tls_rx_reader_lock(sk, ctx, flags & SPLICE_F_NONBLOCK); + err = tls_rx_reader_lock(sk, ctx, true); if (err < 0) return err; @@ -2145,8 +2145,7 @@ ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos, } else { struct tls_decrypt_arg darg; - err = tls_rx_rec_wait(sk, NULL, flags & SPLICE_F_NONBLOCK, - true); + err = tls_rx_rec_wait(sk, NULL, true, true); if (err <= 0) goto splice_read_end; diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 123b35ddfd71..384d5a479b4a 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -2880,15 +2880,12 @@ static ssize_t unix_stream_splice_read(struct socket *sock, loff_t *ppos, .pipe = pipe, .size = size, .splice_flags = flags, + .flags = MSG_DONTWAIT, }; if (unlikely(*ppos)) return -ESPIPE; - if (sock->file->f_flags & O_NONBLOCK || - flags & SPLICE_F_NONBLOCK) - state.flags = MSG_DONTWAIT; - return unix_stream_read_generic(&state, false); } [-- Attachment #1.3: s.c --] [-- Type: text/x-csrc, Size: 485 bytes --] #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> #include <errno.h> int main() { int lasterr = -1; unsigned ctr = 0; for(;;) { errno = 0; ssize_t ret = splice(0, 0, 1, 0, 128 * 1024 * 1024, 0); if(ret >= 0 || errno != lasterr) { fprintf(stderr, "\n\t%m" + (lasterr == -1)); lasterr = errno; ctr = 0; } if(ret == -1) { ++ctr; fprintf(stderr, "\r%u", ctr); } else fprintf(stderr, "\r%zu", ret); if(!ret) break; } fprintf(stderr, "\n"); } [-- Attachment #1.4: tcp.c --] [-- Type: text/x-csrc, Size: 507 bytes --] #define _GNU_SOURCE #include <stdlib.h> #include <sys/socket.h> #include <sys/types.h> #include <netinet/in.h> #include <unistd.h> int main(int argc, char ** argv) { int s = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, 0); struct sockaddr_in addr = { .sin_family = AF_INET, .sin_port = 12345, .sin_addr = 0, }; if(bind(s, &addr, sizeof(addr)) == -1 || listen(s, 128)) abort(); int fd = accept4(s, NULL, NULL, SOCK_CLOEXEC); if(fd == -1) abort(); dup2(fd, 0); execvp(argv[1], argv + 1); } [-- Attachment #1.5: af_unix.c --] [-- Type: text/x-csrc, Size: 393 bytes --] #include <stdlib.h> #include <sys/socket.h> #include <sys/types.h> #include <sys/un.h> #include <unistd.h> int main(int argc, char ** argv) { int fds[2]; if(socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, fds)) abort(); if(!vfork()) { dup2(fds[0], 0); _exit(execvp(argv[1], argv + 1)); } dup2(fds[1], 1); write(1, "dupa\n", 5); sleep(1); write(1, "dupa\n", 5); sleep(1); } [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?) 2023-07-09 22:33 ` Ahelenia Ziemiańska @ 2023-07-10 13:22 ` Ahelenia Ziemiańska 0 siblings, 0 replies; 19+ messages in thread From: Ahelenia Ziemiańska @ 2023-07-10 13:22 UTC (permalink / raw) To: Linus Torvalds Cc: Christian Brauner, Jens Axboe, David Howells, Alexander Viro, linux-fsdevel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2357 bytes --] On Mon, Jul 10, 2023 at 12:33:07AM +0200, Ahelenia Ziemiańska wrote: > On Sun, Jul 09, 2023 at 03:03:22AM +0200, Ahelenia Ziemiańska wrote: > > On Sat, Jul 08, 2023 at 01:06:56PM -0700, Linus Torvalds wrote: > > > I guess combined with something like > > > > > > if (!(in->f_mode & FMODE_NOWAIT)) > > > return -EINVAL; > > > > > > it might all work. > > Yes, that makes the splice instantly -EINVAL for ttys, but doesn't > > affect the socketpair() case above, which still blocks forever. > This also triggers for regular file -> pipe splices, > which is probably a no-go. Actually, that's only the case for regular files on some filesystems? I originally tested on tmpfs, and now on vfat, ramfs, procfs, and sysfs, and none have FMODE_NOWAIT set. Conversely, ext4 and XFS files both have FMODE_NOWAIT set, and behave like blockdevs incl. the filemap_splice_read() oddities below. Indeed, it looks like Some filesystems (btrfs/ext4/f2fs/ocfs2/xfs, blockdevs, /dev/{null,zero,random,urandom}, pipes, tun/tap) set FMODE_NOWAIT, but that's by far not All of them, so maybe /* File is capable of returning -EAGAIN if I/O will block */ is not the right check for regular files. > filemap_get_pages() does use and inspect IOCB_NOWAIT if set in > filemap_splice_read(), but it appears to not really make much sense, > inasmuch it returns EAGAIN for the first splice() from a > blockdev and then instantly return with data on the next go-around. Indeed, this is inconsistent to both: * readv2(off=-1, RWF_NOWAIT), which always returns EAGAIN, and * fcntl(0, F_SETFL, O_NONBLOCK), read(), which always reads. Thus, > This doesn't really make much sense ‒ and open(2) says blockdevs > don't have O_NONBLOCK semantics, so I'm assuming this is not supposed > to be exposed to userspace ‒ so I'm not setting it in the diff. not specifying IOCB_NOWAIT in filemap_splice_read() provides consistent semantics to "file is read as-if it had O_NONBLOCK set". I've tentatively updated the check to if (!((in->f_mode & FMODE_NOWAIT) || S_ISREG(in->f_inode->i_mode))) (with the reasoning, as previously, that regular files don't /have/ a distinct O_NONBLOCK mode, because they always behave non-blockingly) and with that > I've tested this for: * regular file: works as expected [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?) 2023-07-06 21:56 ` Linus Torvalds 2023-07-07 17:21 ` Christian Brauner @ 2023-07-08 0:00 ` Matthew Wilcox 2023-07-08 0:07 ` Linus Torvalds 1 sibling, 1 reply; 19+ messages in thread From: Matthew Wilcox @ 2023-07-08 0:00 UTC (permalink / raw) To: Linus Torvalds Cc: Ahelenia Ziemiańska, Christian Brauner, Jens Axboe, David Howells, Alexander Viro, linux-fsdevel, linux-kernel On Thu, Jul 06, 2023 at 02:56:45PM -0700, Linus Torvalds wrote: > +static int busy_pipe_buf_confirm(struct pipe_inode_info *pipe, > + struct pipe_buffer *buf) > +{ > + struct page *page = buf->page; > + > + if (folio_wait_bit_interruptible(page_folio(page), PG_locked)) > + return -EINTR; Do we really want interruptible here rather than killable? That is, do we want SIGWINCH or SIGALRM to result in a short read? I assume it's OK to return a short read because userspace has explicitly asked for O_NONBLOCK and can therefore be expected to actually check the return value from read(). ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?) 2023-07-08 0:00 ` Matthew Wilcox @ 2023-07-08 0:07 ` Linus Torvalds 2023-07-08 0:21 ` Linus Torvalds 0 siblings, 1 reply; 19+ messages in thread From: Linus Torvalds @ 2023-07-08 0:07 UTC (permalink / raw) To: Matthew Wilcox Cc: Ahelenia Ziemiańska, Christian Brauner, Jens Axboe, David Howells, Alexander Viro, linux-fsdevel, linux-kernel On Fri, 7 Jul 2023 at 17:00, Matthew Wilcox <willy@infradead.org> wrote: > > Do we really want interruptible here rather than killable? Yes, we actually do need to be just regular interruptible, This is a bog-standard "IO to/from pipe" situation, which is interruptible. > That is, do we want SIGWINCH or SIGALRM to result in a short read? Now, that's a different issue, and is actually handled by the signal layer: a signal that is ignored (where "ignored" includes the case of "default handler") will be dropped early, exactly because we don't want to interrupt things like tty or pipe reads when you resize the window. Of course, if you actually *catch* SIGWINCH, then you will get that short read on a window change. Linus ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?) 2023-07-08 0:07 ` Linus Torvalds @ 2023-07-08 0:21 ` Linus Torvalds 0 siblings, 0 replies; 19+ messages in thread From: Linus Torvalds @ 2023-07-08 0:21 UTC (permalink / raw) To: Matthew Wilcox Cc: Ahelenia Ziemiańska, Christian Brauner, Jens Axboe, David Howells, Alexander Viro, linux-fsdevel, linux-kernel On Fri, 7 Jul 2023 at 17:07, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > That is, do we want SIGWINCH or SIGALRM to result in a short read? > > Now, that's a different issue, and is actually handled by the signal > layer: a signal that is ignored (where "ignored" includes the case of > "default handler") will be dropped early, exactly because we don't > want to interrupt things like tty or pipe reads when you resize the > window. In case you care, it's prepare_signal() -> sig_ignored() -> sig_task_ignored() -> sig_handler_ignored() logic that does this short-circuiting. And while I don't think it's required by POSIX, this was definitely a case where lots of programs that *don't* use any signal handlers at all are very much not expecting to see -EINTR as a return value, and used to break exactly on things like SIGWINCH when reading from a tty or a pipe. But that logic goes back to before linux-1.0. In fact, I think it goes back to at least 0.99.10 (June '93). Linus ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2023-07-10 13:22 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-06-26 1:12 Pending splice(file -> FIFO) always blocks read(FIFO), regardless of O_NONBLOCK on read side? Ahelenia Ziemiańska 2023-06-26 9:32 ` Christian Brauner 2023-06-26 11:59 ` Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?) Ahelenia Ziemiańska 2023-06-26 15:56 ` Christian Brauner 2023-06-26 16:14 ` Ahelenia Ziemiańska 2023-07-06 21:56 ` Linus Torvalds 2023-07-07 17:21 ` Christian Brauner 2023-07-07 19:10 ` Linus Torvalds 2023-07-07 19:57 ` Jens Axboe 2023-07-07 22:41 ` Ahelenia Ziemiańska 2023-07-07 22:57 ` Linus Torvalds 2023-07-08 0:30 ` Ahelenia Ziemiańska 2023-07-08 20:06 ` Linus Torvalds 2023-07-09 1:03 ` Ahelenia Ziemiańska 2023-07-09 22:33 ` Ahelenia Ziemiańska 2023-07-10 13:22 ` Ahelenia Ziemiańska 2023-07-08 0:00 ` Matthew Wilcox 2023-07-08 0:07 ` Linus Torvalds 2023-07-08 0:21 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).