* [GIT PULL] pipe: Notification queue preparation
@ 2019-11-25 22:39 David Howells
2019-11-30 23:05 ` pr-tracker-bot
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: David Howells @ 2019-11-25 22:39 UTC (permalink / raw)
To: torvalds
Cc: dhowells, Rasmus Villemoes, Greg Kroah-Hartman, Peter Zijlstra,
raven, Christian Brauner, keyrings, linux-usb, linux-block,
linux-security-module, linux-fsdevel, linux-api, linux-kernel
Hi Linus,
Can you pull this please? This is my set of preparatory patches for
building a general notification queue on top of pipes. It makes a number
of significant changes:
(1) It removes the nr_exclusive argument from __wake_up_sync_key() as this
is always 1. This prepares for step 2.
(2) Adds wake_up_interruptible_sync_poll_locked() so that poll can be
woken up from a function that's holding the poll waitqueue spinlock.
[btw, I realise that I haven't un-sync'd the
wake_up_interruptible_sync_poll() calls as you tentatively suggested.
I can send a follow up patch to fix that if you still want it]
(3) Change the pipe buffer ring to be managed in terms of unbounded head
and tail indices rather than bounded index and length. This means
that reading the pipe only needs to modify one index, not two.
(4) A selection of helper functions are provided to query the state of the
pipe buffer, plus a couple to apply updates to the pipe indices.
(5) The pipe ring is allowed to have kernel-reserved slots. This allows
many notification messages to be spliced in by the kernel without
allowing userspace to pin too many pages if it writes to the same
pipe.
(6) Advance the head and tail indices inside the pipe waitqueue lock and
use step 2 to poke poll without having to take the lock twice.
(7) Rearrange pipe_write() to preallocate the buffer it is going to write
into and then drop the spinlock. This allows kernel notifications to
then be added the ring whilst it is filling the buffer it allocated.
The read side is stalled because the pipe mutex is still held.
(8) Don't wake up readers on a pipe if there was already data in it when
we added more.
(9) Don't wake up writers on a pipe if the ring wasn't full before we
removed a buffer.
PATCHES BENCHMARK BEST TOTAL BYTES AVG BYTES STDDEV
======= =============== =============== =============== =============== ===============
- pipe 307457969 36348556755 302904639 10622403
- splice 287117614 26933658717 224447155 160777958
- vmsplice 435180375 51302964090 427524700 19083037
rm-nrx pipe 311091179 37093181356 309109844 7221622
rm-nrx splice 285628049 27916298942 232635824 158296431
rm-nrx vmsplice 417703153 47570362546 396419687 33960822
wakesl pipe 310698731 36772541631 306437846 8249347
wakesl splice 286193726 28600435451 238336962 141169318
wakesl vmsplice 436175803 50723895824 422699131 40724240
ht pipe 305534565 36426079543 303550662 5673885
ht splice 243632025 23319439010 194328658 150479853
ht vmsplice 432825176 49101781001 409181508 44102509
k-rsv pipe 308691523 36652267561 305435563 12972559
k-rsv splice 244793528 23625172865 196876440 125319143
k-rsv vmsplice 436119082 49460808579 412173404 55547525
r-adv-t pipe 310094218 36860182219 307168185 8081101
r-adv-t splice 285527382 27085052687 225708772 206918887
r-adv-t vmsplice 336885948 40128756927 334406307 5895935
r-cond pipe 308727804 36635828180 305298568 9976806
r-cond splice 284467568 28445793054 237048275 200284329
r-cond vmsplice 449679489 51134833848 426123615 66790875
w-preal pipe 307416578 36662086426 305517386 6216663
w-preal splice 282655051 28455249109 237127075 194154549
w-preal vmsplice 437002601 47832160621 398601338 96513019
w-redun pipe 307279630 36329750422 302747920 8913567
w-redun splice 284324488 27327152734 227726272 219735663
w-redun vmsplice 451141971 51485257719 429043814 51388217
w-ckful pipe 305055247 36374947350 303124561 5400728
w-ckful splice 281575308 26841554544 223679621 215942886
w-ckful vmsplice 436653588 47564907110 396374225 82255342
The patches column indicates the point in the patchset at which the benchmarks
were taken:
0 No patches
rm-nrx "Remove the nr_exclusive argument from __wake_up_sync_key()"
wakesl "Add wake_up_interruptible_sync_poll_locked()"
ht "pipe: Use head and tail pointers for the ring, not cursor and length"
k-rsv "pipe: Allow pipes to have kernel-reserved slots"
r-adv-t "pipe: Advance tail pointer inside of wait spinlock in pipe_read()"
r-cond "pipe: Conditionalise wakeup in pipe_read()"
w-preal "pipe: Rearrange sequence in pipe_write() to preallocate slot"
w-redun "pipe: Remove redundant wakeup from pipe_write()"
w-ckful "pipe: Check for ring full inside of the spinlock in pipe_write()"
Changes:
(*) Fix some bugs spotted by kbuild.
ver #3:
(*) Get rid of pipe_commit_{read,write}.
(*) Port the virtio_console driver.
(*) Fix pipe_zero().
(*) Amend some comments.
(*) Added an additional patch that changes the threshold at which readers
wake writers for Konstantin Khlebnikov.
ver #2:
(*) Split the notification patches out into a separate branch.
(*) Removed the nr_exclusive parameter from __wake_up_sync_key().
(*) Renamed the locked wakeup function.
(*) Add helpers for empty, full, occupancy.
(*) Split the addition of ->max_usage out into its own patch.
(*) Fixed some bits pointed out by Rasmus Villemoes.
ver #1:
(*) Build on top of standard pipes instead of having a driver.
David
---
The following changes since commit da0c9ea146cbe92b832f1b0f694840ea8eb33cce:
Linux 5.4-rc2 (2019-10-06 14:27:30 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/notifications-pipe-prep-20191115
for you to fetch changes up to 3c0edea9b29f9be6c093f236f762202b30ac9431:
pipe: Remove sync on wake_ups (2019-11-15 16:22:54 +0000)
----------------------------------------------------------------
Pipework for general notification queue
----------------------------------------------------------------
David Howells (12):
pipe: Reduce #inclusion of pipe_fs_i.h
Remove the nr_exclusive argument from __wake_up_sync_key()
Add wake_up_interruptible_sync_poll_locked()
pipe: Use head and tail pointers for the ring, not cursor and length
pipe: Allow pipes to have kernel-reserved slots
pipe: Advance tail pointer inside of wait spinlock in pipe_read()
pipe: Conditionalise wakeup in pipe_read()
pipe: Rearrange sequence in pipe_write() to preallocate slot
pipe: Remove redundant wakeup from pipe_write()
pipe: Check for ring full inside of the spinlock in pipe_write()
pipe: Increase the writer-wakeup threshold to reduce context-switch count
pipe: Remove sync on wake_ups
drivers/char/virtio_console.c | 16 ++-
fs/exec.c | 1 -
fs/fuse/dev.c | 31 +++--
fs/ocfs2/aops.c | 1 -
fs/pipe.c | 232 +++++++++++++++++++++---------------
fs/splice.c | 190 +++++++++++++++++------------
include/linux/pipe_fs_i.h | 64 +++++++++-
include/linux/uio.h | 4 +-
include/linux/wait.h | 11 +-
kernel/exit.c | 2 +-
kernel/sched/wait.c | 37 ++++--
lib/iov_iter.c | 269 ++++++++++++++++++++++++------------------
security/smack/smack_lsm.c | 1 -
13 files changed, 529 insertions(+), 330 deletions(-)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GIT PULL] pipe: Notification queue preparation
2019-11-25 22:39 [GIT PULL] pipe: Notification queue preparation David Howells
@ 2019-11-30 23:05 ` pr-tracker-bot
2019-12-05 12:58 ` David Sterba
2019-12-05 13:56 ` David Howells
2 siblings, 0 replies; 10+ messages in thread
From: pr-tracker-bot @ 2019-11-30 23:05 UTC (permalink / raw)
To: David Howells
Cc: torvalds, dhowells, Rasmus Villemoes, Greg Kroah-Hartman,
Peter Zijlstra, raven, Christian Brauner, keyrings, linux-usb,
linux-block, linux-security-module, linux-fsdevel, linux-api,
linux-kernel
The pull request you sent on Mon, 25 Nov 2019 22:39:49 +0000:
> git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/notifications-pipe-prep-20191115
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/6a965666b7e7475c2f8c8e724703db58b8a8a445
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GIT PULL] pipe: Notification queue preparation
2019-11-25 22:39 [GIT PULL] pipe: Notification queue preparation David Howells
2019-11-30 23:05 ` pr-tracker-bot
@ 2019-12-05 12:58 ` David Sterba
2019-12-05 13:56 ` David Howells
2 siblings, 0 replies; 10+ messages in thread
From: David Sterba @ 2019-12-05 12:58 UTC (permalink / raw)
To: David Howells
Cc: torvalds, Rasmus Villemoes, Greg Kroah-Hartman, Peter Zijlstra,
raven, Christian Brauner, keyrings, linux-usb, linux-block,
linux-security-module, linux-fsdevel, linux-api, linux-kernel
On Mon, Nov 25, 2019 at 10:39:49PM +0000, David Howells wrote:
> David Howells (12):
> pipe: Reduce #inclusion of pipe_fs_i.h
> Remove the nr_exclusive argument from __wake_up_sync_key()
> Add wake_up_interruptible_sync_poll_locked()
> pipe: Use head and tail pointers for the ring, not cursor and length
This commit (8cefc107ca54c8b06438b7dc9) causes hangs of 'btrfs send'
commands, that's eg. inside fstests or in btrfs-progs testsuite. The
'send' uses pipe/splice to get data from kernel to userspace.
The test that reliably worked for me leaves the process hanging with
this stacktrace (no CPU or IO activity):
[<0>] pipe_write+0x1be/0x4b0
[<0>] new_sync_write+0x11e/0x1c0
[<0>] vfs_write+0xc1/0x1d0
[<0>] kernel_write+0x2c/0x40
[<0>] send_cmd+0x78/0xf0 [btrfs]
[<0>] send_extent_data+0x4b1/0x52c [btrfs]
[<0>] process_extent+0xe46/0xe9d [btrfs]
[<0>] changed_cb+0xcf5/0xd2f [btrfs]
[<0>] send_subvol+0x8ad/0xc0b [btrfs]
[<0>] btrfs_ioctl_send+0xe50/0xf30 [btrfs]
[<0>] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[<0>] btrfs_ioctl+0x18e1/0x3450 [btrfs]
[<0>] do_vfs_ioctl+0xa5/0x710
[<0>] ksys_ioctl+0x70/0x80
[<0>] __x64_sys_ioctl+0x16/0x20
[<0>] do_syscall_64+0x56/0x220
[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
I've bisected that to the mentioned commit, using test in btrfs-progs
testsuite with command "make TEST=016\* test-misc". Normally the test
can run up to 10 seconds, in the buggy case it stays there.
I can help testing fixes or gathering more information, please let me
know.
Full bisect log:
# bad: [3c0edea9b29f9be6c093f236f762202b30ac9431] pipe: Remove sync on wake_ups
# good: [da0c9ea146cbe92b832f1b0f694840ea8eb33cce] Linux 5.4-rc2
git bisect start '3c0edea9b29f9be6c093f236f762202b30ac9431' 'd055b4fb4d165b06d912e7f846610d120c3bb9fb^'
# bad: [b667b867344301e24f21d4a4c844675ff61d89e1] pipe: Advance tail pointer inside of wait spinlock in pipe_read()
git bisect bad b667b867344301e24f21d4a4c844675ff61d89e1
# good: [f94df9890e98f2090c6a8d70c795134863b70201] Add wake_up_interruptible_sync_poll_locked()
git bisect good f94df9890e98f2090c6a8d70c795134863b70201
# bad: [6718b6f855a0b4962d54bd625be2718cb820cec6] pipe: Allow pipes to have kernel-reserved slots
git bisect bad 6718b6f855a0b4962d54bd625be2718cb820cec6
# bad: [8cefc107ca54c8b06438b7dc9cc08bc0a11d5b98] pipe: Use head and tail pointers for the ring, not cursor and length
git bisect bad 8cefc107ca54c8b06438b7dc9cc08bc0a11d5b98
# first bad commit: [8cefc107ca54c8b06438b7dc9cc08bc0a11d5b98] pipe: Use head and tail pointers for the ring, not cursor and length
> pipe: Allow pipes to have kernel-reserved slots
> pipe: Advance tail pointer inside of wait spinlock in pipe_read()
> pipe: Conditionalise wakeup in pipe_read()
> pipe: Rearrange sequence in pipe_write() to preallocate slot
> pipe: Remove redundant wakeup from pipe_write()
> pipe: Check for ring full inside of the spinlock in pipe_write()
> pipe: Increase the writer-wakeup threshold to reduce context-switch count
> pipe: Remove sync on wake_ups
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GIT PULL] pipe: Notification queue preparation
2019-11-25 22:39 [GIT PULL] pipe: Notification queue preparation David Howells
2019-11-30 23:05 ` pr-tracker-bot
2019-12-05 12:58 ` David Sterba
@ 2019-12-05 13:56 ` David Howells
2019-12-05 17:12 ` Linus Torvalds
` (2 more replies)
2 siblings, 3 replies; 10+ messages in thread
From: David Howells @ 2019-12-05 13:56 UTC (permalink / raw)
To: dsterba
Cc: dhowells, torvalds, Rasmus Villemoes, Greg Kroah-Hartman,
Peter Zijlstra, raven, Christian Brauner, keyrings, linux-usb,
linux-block, linux-security-module, linux-fsdevel, linux-api,
linux-kernel
David Sterba <dsterba@suse.cz> wrote:
> [<0>] pipe_write+0x1be/0x4b0
Can you get me a line number of that? Assuming you've built with -g, load
vmlinux into gdb and do "i li pipe_write+0x1be".
David
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GIT PULL] pipe: Notification queue preparation
2019-12-05 13:56 ` David Howells
@ 2019-12-05 17:12 ` Linus Torvalds
2019-12-05 17:24 ` Linus Torvalds
2019-12-05 17:21 ` David Sterba
2019-12-05 17:25 ` David Howells
2 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2019-12-05 17:12 UTC (permalink / raw)
To: David Howells
Cc: David Sterba, Rasmus Villemoes, Greg Kroah-Hartman,
Peter Zijlstra, raven, Christian Brauner, keyrings, linux-usb,
linux-block, LSM List, linux-fsdevel, Linux API,
Linux Kernel Mailing List
On Thu, Dec 5, 2019 at 5:57 AM David Howells <dhowells@redhat.com> wrote:
>
> David Sterba <dsterba@suse.cz> wrote:
>
> > [<0>] pipe_write+0x1be/0x4b0
>
> Can you get me a line number of that? Assuming you've built with -g, load
> vmlinux into gdb and do "i li pipe_write+0x1be".
If the kernel is built with debug info (which you need for the gdb
command anyway), it's much better to just use
./scripts/decode_stacktrace.sh
which gives all the information for the whole backtrace.
It would be interesting to hear if somebody else is waiting on the
read side too.
Linus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GIT PULL] pipe: Notification queue preparation
2019-12-05 17:12 ` Linus Torvalds
@ 2019-12-05 17:24 ` Linus Torvalds
0 siblings, 0 replies; 10+ messages in thread
From: Linus Torvalds @ 2019-12-05 17:24 UTC (permalink / raw)
To: David Howells
Cc: David Sterba, Rasmus Villemoes, Greg Kroah-Hartman,
Peter Zijlstra, raven, Christian Brauner, keyrings, linux-usb,
linux-block, LSM List, linux-fsdevel, Linux API,
Linux Kernel Mailing List
On Thu, Dec 5, 2019 at 9:12 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It would be interesting to hear if somebody else is waiting on the
> read side too.
Looking once more at that commit, I find at least one bug, but I'm
pretty sure that's not the cause of this problem.
DavidH, watch out for things like this:
- for (idx = 0; idx < pipe->nrbufs && rem < len; idx++)
+ for (idx = tail; idx < head && rem < len; idx++)
which is a completely buggy conversion.
You can't compare tail and head with an inequality, it gets the
wraparound case wrong.
But I found only one of those, and it's fuse-specific, plus the
overflow would take a long time to trigger, so I'm pretty sure this
isn't what DavidS is reporting.
Linus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GIT PULL] pipe: Notification queue preparation
2019-12-05 13:56 ` David Howells
2019-12-05 17:12 ` Linus Torvalds
@ 2019-12-05 17:21 ` David Sterba
2019-12-05 17:33 ` Linus Torvalds
2019-12-05 17:25 ` David Howells
2 siblings, 1 reply; 10+ messages in thread
From: David Sterba @ 2019-12-05 17:21 UTC (permalink / raw)
To: David Howells
Cc: torvalds, Rasmus Villemoes, Greg Kroah-Hartman, Peter Zijlstra,
raven, Christian Brauner, keyrings, linux-usb, linux-block,
linux-security-module, linux-fsdevel, linux-api, linux-kernel
On Thu, Dec 05, 2019 at 01:56:57PM +0000, David Howells wrote:
> David Sterba <dsterba@suse.cz> wrote:
>
> > [<0>] pipe_write+0x1be/0x4b0
>
> Can you get me a line number of that? Assuming you've built with -g, load
> vmlinux into gdb and do "i li pipe_write+0x1be".
I built it with -g (DEBUG_INFO) but there's no output for the command (gdb 8.2):
(gdb) i li pipe_write+0x1be
Function "pipe_write+0x1be" not defined.
But the address can tell something:
(gdb) l *(pipe_write+0x1be)
0xffffffff81390b8e is in pipe_write (fs/pipe.c:509).
warning: Source file is more recent than executable.
504 kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
505 do_wakeup = 0;
506 }
507 pipe->waiting_writers++;
508 pipe_wait(pipe);
509 pipe->waiting_writers--;
510 }
511 out:
512 __pipe_unlock(pipe);
513 if (do_wakeup) {
I rerun the test again (with a different address where it's stuck), there's
nothing better I can get from the debug info, it always points to pipe_wait,
disassembly points to:
0xffffffff81390b71 <+417>: jne 0xffffffff81390c23 <pipe_write+595>
0xffffffff81390b77 <+423>: test %ecx,%ecx
0xffffffff81390b79 <+425>: jne 0xffffffff81390b95 <pipe_write+453>
0xffffffff81390b7b <+427>: addl $0x1,0x110(%rbx)
0xffffffff81390b82 <+434>: mov %rbx,%rdi
0xffffffff81390b85 <+437>: callq 0xffffffff813908c0 <pipe_wait>
0xffffffff81390b8a <+442>: subl $0x1,0x110(%rbx)
(pipe_write+0x1ba == 0xffffffff81390b8a)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GIT PULL] pipe: Notification queue preparation
2019-12-05 17:21 ` David Sterba
@ 2019-12-05 17:33 ` Linus Torvalds
0 siblings, 0 replies; 10+ messages in thread
From: Linus Torvalds @ 2019-12-05 17:33 UTC (permalink / raw)
To: David Sterba, David Howells, Linus Torvalds, Rasmus Villemoes,
Greg Kroah-Hartman, Peter Zijlstra, raven, Christian Brauner,
keyrings, linux-usb, linux-block, LSM List, linux-fsdevel,
Linux API, Linux Kernel Mailing List
On Thu, Dec 5, 2019 at 9:22 AM David Sterba <dsterba@suse.cz> wrote:
>
> I rerun the test again (with a different address where it's stuck), there's
> nothing better I can get from the debug info, it always points to pipe_wait,
> disassembly points to.
Hah. I see another bug.
"pipe_wait()" depends on the fact that all events that wake it up
happen with the pipe lock held.
But we do some of the "do_wakeup()" handling outside the pipe lock now
on the reader side
__pipe_unlock(pipe);
/* Signal writers asynchronously that there is more room. */
if (do_wakeup) {
wake_up_interruptible_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}
However, that isn't new to this series _either_, so I don't think
that's it. It does wake up things inside the lock _too_ if it ended up
emptying a whole buffer.
So it could be triggered by timing and behavior changes, but I doubt
this pipe_wait() thing is it either. The fact that it bisects to the
thing that changes things to use head/tail pointers makes me think
there's some other incorrect update or comparison somewhere.
That said, "pipe_wait()" is an abomination. It should use a proper
wait condition and use wait_event(), but the code predates all of
that. I suspect pipe_wait() goes back to the dark ages with the BKL
and no actual races between kernel code.
Linus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GIT PULL] pipe: Notification queue preparation
2019-12-05 13:56 ` David Howells
2019-12-05 17:12 ` Linus Torvalds
2019-12-05 17:21 ` David Sterba
@ 2019-12-05 17:25 ` David Howells
2019-12-05 18:18 ` David Sterba
2 siblings, 1 reply; 10+ messages in thread
From: David Howells @ 2019-12-05 17:25 UTC (permalink / raw)
To: dsterba
Cc: dhowells, torvalds, Rasmus Villemoes, Greg Kroah-Hartman,
Peter Zijlstra, raven, Christian Brauner, keyrings, linux-usb,
linux-block, linux-security-module, linux-fsdevel, linux-api,
linux-kernel
I've just posted a couple of patches - can you check to see if they fix your
problem?
https://lore.kernel.org/linux-fsdevel/157556649610.20869.8537079649495343567.stgit@warthog.procyon.org.uk/T/#t
Thanks,
David
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [GIT PULL] pipe: Notification queue preparation
2019-12-05 17:25 ` David Howells
@ 2019-12-05 18:18 ` David Sterba
0 siblings, 0 replies; 10+ messages in thread
From: David Sterba @ 2019-12-05 18:18 UTC (permalink / raw)
To: David Howells
Cc: torvalds, Rasmus Villemoes, Greg Kroah-Hartman, Peter Zijlstra,
raven, Christian Brauner, keyrings, linux-usb, linux-block,
linux-security-module, linux-fsdevel, linux-api, linux-kernel
On Thu, Dec 05, 2019 at 05:25:20PM +0000, David Howells wrote:
> I've just posted a couple of patches - can you check to see if they fix your
> problem?
>
> https://lore.kernel.org/linux-fsdevel/157556649610.20869.8537079649495343567.stgit@warthog.procyon.org.uk/T/#t
Not fixed, the test still hangs with the same call stack.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2019-12-05 18:18 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-25 22:39 [GIT PULL] pipe: Notification queue preparation David Howells
2019-11-30 23:05 ` pr-tracker-bot
2019-12-05 12:58 ` David Sterba
2019-12-05 13:56 ` David Howells
2019-12-05 17:12 ` Linus Torvalds
2019-12-05 17:24 ` Linus Torvalds
2019-12-05 17:21 ` David Sterba
2019-12-05 17:33 ` Linus Torvalds
2019-12-05 17:25 ` David Howells
2019-12-05 18:18 ` David Sterba
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).