Linux-rt-users Archive on lore.kernel.org
 help / color / Atom feed
* 5.9.1-rt18: issues with Firewire card on AMD hardware
@ 2020-10-21 17:50 David Runge
  2020-10-23 11:04 ` [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT Sebastian Andrzej Siewior
  2020-10-26  0:37 ` 5.9.1-rt18: issues with Firewire card on AMD hardware David Runge
  0 siblings, 2 replies; 34+ messages in thread
From: David Runge @ 2020-10-21 17:50 UTC (permalink / raw)
  To: linux-rt-users


[-- Attachment #1: Type: text/plain, Size: 4931 bytes --]

Hi!

I'm currently trying to get the latest linux-rt kernel (v5.9.1-rt18) to
work on my AMD hardware. I'm providing the kernel as a prebuilt binary
to the Arch Linux community, as we do not have that specific kernel in
the repositories (yet).

I'm facing issues in regards to hardware discovery and crashes, which I
do not get on the vanilla kernel (now at 5.9.1) on Arch Linux.

I have a PCIe Firewire card that I use for an external audio interface
(RME Fireface800):

05:00.0 FireWire (IEEE 1394): Texas Instruments XIO2213A/B/XIO2221
IEEE-1394b OHCI Controller [Cheetah Express] (rev 01)

Unfortunately the card does not show up anymore and instead I run into a
bug:

Oct 21 19:08:03 hmbx kernel: ------------[ cut here ]------------
Oct 21 19:08:03 hmbx kernel: DEBUG_LOCKS_WARN_ON(val > preempt_count())
Oct 21 19:08:03 hmbx kernel: WARNING: CPU: 19 PID: 0 at
kernel/sched/core.c:4763 preempt_count_sub+0x5a/0x90
Oct 21 19:08:03 hmbx kernel: Modules linked in: crypto_user ip_tables
x_tables dm_crypt encrypted_keys trusted tpm hid_logitech_hidpp
hid_logitech_dj sd_mod hid_generic usbhid hid amdgpu gpu_sched ttm
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
crct10dif_pclmul cec ahci r8169 crc32_pclmul ghash_clmuln
i_intel libahci realtek aesni_intel mdio_devres crypto_simd dm_mod btrfs
of_mdio firewire_ohci blake2b_generic drm libcrc32c cryptd libata ccp
xhci_pci fixed_phy crc32c_generic ehci_pci igb glue_helper firewire_core
crc32c_intel agpgart rng_core crc_itu_t scsi_mod xhci_hcd xor dca libphy
ehci_hcd i2c_algo_bit wmi raid
6_pq
Oct 21 19:08:03 hmbx kernel: CPU: 19 PID: 0 Comm: swapper/19 Tainted: G
W         5.9.1-rt18-1-rt #1
Oct 21 19:08:03 hmbx kernel: Hardware name: System manufacturer System
Product Name/Pro WS X570-ACE, BIOS 1302 01/20/2020
Oct 21 19:08:03 hmbx kernel: RIP: 0010:preempt_count_sub+0x5a/0x90
Oct 21 19:08:03 hmbx kernel: Code: 2d d5 62 c3 e8 a7 2d 8c 00 85 c0 74
f6 8b 15 55 77 50 01 85 d2 75 ec 48 c7 c6 df 5c 30 9e 48 c7 c7 ea e9 2e
9e e8 ab ee fc ff <0f> 0b c3 84 c0 75 c9 e8 7a 2d 8c 00 85 c0 74 c9 8b
05 28 77 50 01
Oct 21 19:08:03 hmbx kernel: RSP: 0018:ffff9304401e7dc0 EFLAGS: 00010082
Oct 21 19:08:03 hmbx kernel: RAX: 0000000000000000 RBX: ffff90e257e53f00
RCX: 0000000000000000
Oct 21 19:08:03 hmbx kernel: RDX: 0000000000000001 RSI: ffffffff9e313727
RDI: 00000000ffffffff
Oct 21 19:08:03 hmbx kernel: RBP: ffff9304401e7df8 R08: ffffffff9ed443e0
R09: 0000000000000001
Oct 21 19:08:03 hmbx kernel: R10: ffff9304401e7ce8 R11: 3fffffffffffffff
R12: 0000000000000001
Oct 21 19:08:03 hmbx kernel: R13: 0000000000000000 R14: 0000000000000000
R15: 0000000000000000
Oct 21 19:08:03 hmbx kernel: FS:  0000000000000000(0000)
GS:ffff90e25eec0000(0000) knlGS:0000000000000000
Oct 21 19:08:03 hmbx kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Oct 21 19:08:03 hmbx kernel: CR2: 00007f085fb38c20 CR3: 0000000fb829c000
CR4: 0000000000350ee0
Oct 21 19:08:03 hmbx kernel: Call Trace:
Oct 21 19:08:03 hmbx kernel:  irq_exit_rcu+0x28/0xe0
Oct 21 19:08:03 hmbx kernel:  sysvec_call_function_single+0x47/0xe0
Oct 21 19:08:03 hmbx kernel:  asm_sysvec_call_function_single+0x12/0x20
Oct 21 19:08:03 hmbx kernel: RIP: 0010:cpuidle_enter_state+0xd9/0x440
Oct 21 19:08:03 hmbx kernel: Code: 95 ff 49 89 c5 8b 05 6e ad e2 00 85
c0 0f 8f f1 01 00 00 31 ff e8 c7 26 95 ff 45 84 f6 0f 85 9f 01 00 00 fb
66 0f 1f 44 00 00 <45> 85 e4 0f 88 de 00 00 00 49 63 d4 4d 29 fd 48 8d
04 52 48 8d 04
Oct 21 19:08:03 hmbx kernel: RSP: 0018:ffff9304401e7ea0 EFLAGS: 00000246
Oct 21 19:08:03 hmbx kernel: RAX: ffff90e25eec0000 RBX: ffff90e240dc7400
RCX: 000000000000001f
Oct 21 19:08:03 hmbx kernel: RDX: 0000000000000001 RSI: ffffffff9e313727
RDI: ffffffff9e340fb6
Oct 21 19:08:03 hmbx kernel: RBP: ffffffff9e745bc0 R08: 0000000000000002
R09: 0000000000000020
Oct 21 19:08:03 hmbx kernel: R10: 0000000000000165 R11: 0000000000006045
R12: 0000000000000002
Oct 21 19:08:03 hmbx kernel: R13: 0000000398671b89 R14: 0000000000000000
R15: 00000003986701bf
Oct 21 19:08:03 hmbx kernel:  cpuidle_enter+0x29/0x40
Oct 21 19:08:03 hmbx kernel:  do_idle+0x232/0x2d0
Oct 21 19:08:03 hmbx kernel:  cpu_startup_entry+0x19/0x20
Oct 21 19:08:03 hmbx kernel:  secondary_startup_64+0xb6/0xc0
Oct 21 19:08:03 hmbx kernel: ---[ end trace 0000000000000002 ]---

I have the kernel log and lspci output (from the working vanilla kernel)
here: https://pkgbuild.com/~dvzrv/bugs/linux-rt-5.9.1.18/

Any pointers or help would be much appreciated! If you need more
information, just ask! :)

Best,
David

P.S.: I was trying the 5.6.x series of realtime kernels on this machine
as well, but although the hardware gets initialized properly (at least
without crashes) my system completely freezes as soon as I start jackd
on the firewire interface (this does not happen with a non-rt kernel).

-- 
https://sleepmap.de

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT
  2020-10-21 17:50 5.9.1-rt18: issues with Firewire card on AMD hardware David Runge
@ 2020-10-23 11:04 ` Sebastian Andrzej Siewior
  2020-10-23 11:21   ` Christoph Hellwig
  2020-10-26  0:37 ` 5.9.1-rt18: issues with Firewire card on AMD hardware David Runge
  1 sibling, 1 reply; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-10-23 11:04 UTC (permalink / raw)
  To: David Runge
  Cc: linux-rt-users, Jens Axboe, linux-block, linux-kernel,
	Peter Zijlstra, Thomas Gleixner, Daniel Wagner

blk_mq_complete_request_remote() will dispatch request completion to
another CPU via IPI if the CPU belongs to a different cache domain.

This breaks on PREEMPT_RT because the IPI function will complete the
request in IRQ context which includes acquiring spinlock_t typed locks.
Completing the IPI request in softirq on the remote CPU is probably less
efficient because it would require to wake ksoftirqd for this task
(which runs at SCHED_OTHER).

Ignoring the IPI request and completing the request locally is probably
the best option. It be completed either in the IRQ-thread or at the end
of the routine in softirq context.

Let blk_mq_complete_need_ipi() return that there is no need for IPI on
PREEMPT_RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e37aa31332b70..99d2fb51e0e84 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -647,7 +647,7 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq)
 {
 	int cpu = raw_smp_processor_id();
 
-	if (!IS_ENABLED(CONFIG_SMP) ||
+	if (!IS_ENABLED(CONFIG_SMP) || IS_ENABLED(CONFIG_PREEMPT_RT) ||
 	    !test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags))
 		return false;
 
-- 
2.28.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT
  2020-10-23 11:04 ` [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT Sebastian Andrzej Siewior
@ 2020-10-23 11:21   ` Christoph Hellwig
  2020-10-23 13:52     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2020-10-23 11:21 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: David Runge, linux-rt-users, Jens Axboe, linux-block,
	linux-kernel, Peter Zijlstra, Thomas Gleixner, Daniel Wagner

> -	if (!IS_ENABLED(CONFIG_SMP) ||
> +	if (!IS_ENABLED(CONFIG_SMP) || IS_ENABLED(CONFIG_PREEMPT_RT) ||
>  	    !test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags))

This needs a big fat comment explaining your rationale.  And probably
a separate if statement to make it obvious as well.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT
  2020-10-23 11:21   ` Christoph Hellwig
@ 2020-10-23 13:52     ` Sebastian Andrzej Siewior
  2020-10-27  9:26       ` Christoph Hellwig
  0 siblings, 1 reply; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-10-23 13:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Runge, linux-rt-users, Jens Axboe, linux-block,
	linux-kernel, Peter Zijlstra, Thomas Gleixner, Daniel Wagner

On 2020-10-23 12:21:30 [+0100], Christoph Hellwig wrote:
> > -	if (!IS_ENABLED(CONFIG_SMP) ||
> > +	if (!IS_ENABLED(CONFIG_SMP) || IS_ENABLED(CONFIG_PREEMPT_RT) ||
> >  	    !test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags))
> 
> This needs a big fat comment explaining your rationale.  And probably
> a separate if statement to make it obvious as well.

Okay.
How much difference does it make between completing in-softirq vs
in-IPI? I'm asking because acquiring a spinlock_t in an IPI shouldn't be
done (as per Documentation/locking/locktypes.rst). We don't have
anything in lockdep that will complain here on !RT and we the above we
avoid the case on RT.

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: 5.9.1-rt18: issues with Firewire card on AMD hardware
  2020-10-21 17:50 5.9.1-rt18: issues with Firewire card on AMD hardware David Runge
  2020-10-23 11:04 ` [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT Sebastian Andrzej Siewior
@ 2020-10-26  0:37 ` David Runge
  1 sibling, 0 replies; 34+ messages in thread
From: David Runge @ 2020-10-26  0:37 UTC (permalink / raw)
  To: linux-rt-users


[-- Attachment #1: Type: text/plain, Size: 5277 bytes --]

On 2020-10-21 19:50:59 (+0200), David Runge wrote:
> P.S.: I was trying the 5.6.x series of realtime kernels on this machine
> as well, but although the hardware gets initialized properly (at least
> without crashes) my system completely freezes as soon as I start jackd
> on the firewire interface (this does not happen with a non-rt kernel).

Thanks! 5.9.1-rt19 fixes the crash for me.
Unfortunately I now have the ca. 5.6.x-rt* behavior back. The Firewire
card initializes properly, but as soon as I start jackd using the device
and initialize playback, firewire_core blocks forever and a reboot is
required:

Oct 26 01:13:35 hmbx kernel: firewire_core 0000:05:00.0: created device
fw1: GUID 000a3500ada83262, S800
Oct 26 01:14:47 hmbx kernel: logitech-hidpp-device 0003:046D:101A.0006:
HID++ 1.0 device connected.
Oct 26 01:20:11 hmbx kernel: INFO: task jackd:6873 blocked for more than
122 seconds.
Oct 26 01:20:11 hmbx kernel:       Not tainted 5.9.1-rt19-1-rt #1
Oct 26 01:20:11 hmbx kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 26 01:20:11 hmbx kernel: task:jackd           state:D stack:    0
pid: 6873 ppid:  1664 flags:0x80004086
Oct 26 01:20:11 hmbx kernel: Call Trace:
Oct 26 01:20:11 hmbx kernel:  __schedule+0x2c0/0x8f0
Oct 26 01:20:11 hmbx kernel:  schedule+0x60/0x100
Oct 26 01:20:11 hmbx kernel:  fw_device_op_release+0x230/0x290
[firewire_core]
Oct 26 01:20:11 hmbx kernel:  ? wait_woken+0x80/0x80
Oct 26 01:20:11 hmbx kernel:  __fput+0x8e/0x240
Oct 26 01:20:11 hmbx kernel:  task_work_run+0x5c/0x90
Oct 26 01:20:11 hmbx kernel:  do_exit+0x383/0xaf0
Oct 26 01:20:11 hmbx kernel:  ? finish_task_switch.isra.0+0x91/0x4c0
Oct 26 01:20:11 hmbx kernel:  do_group_exit+0x39/0xb0
Oct 26 01:20:11 hmbx kernel:  get_signal+0x14f/0x990
Oct 26 01:20:11 hmbx kernel:  ? preempt_count_add+0x68/0xa0
Oct 26 01:20:11 hmbx kernel:  ? _raw_spin_lock_irqsave+0x26/0x50
Oct 26 01:20:11 hmbx kernel:  arch_do_signal+0x3d/0x750
Oct 26 01:20:11 hmbx kernel:  ? do_epoll_wait+0xda/0x670
Oct 26 01:20:11 hmbx kernel:  exit_to_user_mode_prepare+0x12d/0x1a0
Oct 26 01:20:11 hmbx kernel:  syscall_exit_to_user_mode+0x2c/0x1b0
Oct 26 01:20:11 hmbx kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 26 01:20:11 hmbx kernel: RIP: 0033:0x7f42186895de
Oct 26 01:20:11 hmbx kernel: Code: Bad RIP value.
Oct 26 01:20:11 hmbx kernel: RSP: 002b:00007fff3dcdea40 EFLAGS: 00000246
ORIG_RAX: 00000000000000e8
Oct 26 01:20:11 hmbx kernel: RAX: fffffffffffffffc RBX: 00005596b98d6310
RCX: 00007f42186895de
Oct 26 01:20:11 hmbx kernel: RDX: 0000000000000020 RSI: 00007fff3dcdea70
RDI: 000000000000000d
Oct 26 01:20:11 hmbx kernel: RBP: 00005596b98d6310 R08: 0000000000000002
R09: 00007fff3dcded80
Oct 26 01:20:11 hmbx kernel: R10: 00000000ffffffff R11: 0000000000000246
R12: 00005596b98f5040
Oct 26 01:20:11 hmbx kernel: R13: 00007fff3dcdea70 R14: 00000000801c0000
R15: 0000000000000001
Oct 26 01:22:13 hmbx kernel: INFO: task jackd:6873 blocked for more than
245 seconds.
Oct 26 01:22:13 hmbx kernel:       Not tainted 5.9.1-rt19-1-rt #1
Oct 26 01:22:13 hmbx kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 26 01:22:13 hmbx kernel: task:jackd           state:D stack:    0
pid: 6873 ppid:  1664 flags:0x80004086
Oct 26 01:22:13 hmbx kernel: Call Trace:
Oct 26 01:22:13 hmbx kernel:  __schedule+0x2c0/0x8f0
Oct 26 01:22:13 hmbx kernel:  schedule+0x60/0x100
Oct 26 01:22:13 hmbx kernel:  fw_device_op_release+0x230/0x290
[firewire_core]
Oct 26 01:22:13 hmbx kernel:  ? wait_woken+0x80/0x80
Oct 26 01:22:13 hmbx kernel:  __fput+0x8e/0x240
Oct 26 01:22:13 hmbx kernel:  task_work_run+0x5c/0x90
Oct 26 01:22:13 hmbx kernel:  do_exit+0x383/0xaf0
Oct 26 01:22:13 hmbx kernel:  ? finish_task_switch.isra.0+0x91/0x4c0
Oct 26 01:22:13 hmbx kernel:  do_group_exit+0x39/0xb0
Oct 26 01:22:13 hmbx kernel:  get_signal+0x14f/0x990
Oct 26 01:22:13 hmbx kernel:  ? preempt_count_add+0x68/0xa0
Oct 26 01:22:13 hmbx kernel:  ? _raw_spin_lock_irqsave+0x26/0x50
Oct 26 01:22:13 hmbx kernel:  arch_do_signal+0x3d/0x750
Oct 26 01:22:13 hmbx kernel:  ? do_epoll_wait+0xda/0x670
Oct 26 01:22:13 hmbx kernel:  exit_to_user_mode_prepare+0x12d/0x1a0
Oct 26 01:22:13 hmbx kernel:  syscall_exit_to_user_mode+0x2c/0x1b0
Oct 26 01:22:13 hmbx kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 26 01:22:13 hmbx kernel: RIP: 0033:0x7f42186895de
Oct 26 01:22:13 hmbx kernel: Code: Bad RIP value.
Oct 26 01:22:13 hmbx kernel: RSP: 002b:00007fff3dcdea40 EFLAGS: 00000246
ORIG_RAX: 00000000000000e8
Oct 26 01:22:13 hmbx kernel: RAX: fffffffffffffffc RBX: 00005596b98d6310
RCX: 00007f42186895de
Oct 26 01:22:13 hmbx kernel: RDX: 0000000000000020 RSI: 00007fff3dcdea70
RDI: 000000000000000d
Oct 26 01:22:13 hmbx kernel: RBP: 00005596b98d6310 R08: 0000000000000002
R09: 00007fff3dcded80
Oct 26 01:22:13 hmbx kernel: R10: 00000000ffffffff R11: 0000000000000246
R12: 00005596b98f5040
Oct 26 01:22:13 hmbx kernel: R13: 00007fff3dcdea70 R14: 00000000801c0000
R15: 0000000000000001

This does not happen on the vanilla 5.9.x kernel.

Any hints on what to try or which patch to omit would be greatly
appreciated!


Best,
David

-- 
https://sleepmap.de

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT
  2020-10-23 13:52     ` Sebastian Andrzej Siewior
@ 2020-10-27  9:26       ` Christoph Hellwig
  2020-10-27 10:11         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2020-10-27  9:26 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Christoph Hellwig, David Runge, linux-rt-users, Jens Axboe,
	linux-block, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Daniel Wagner

On Fri, Oct 23, 2020 at 03:52:19PM +0200, Sebastian Andrzej Siewior wrote:
> On 2020-10-23 12:21:30 [+0100], Christoph Hellwig wrote:
> > > -	if (!IS_ENABLED(CONFIG_SMP) ||
> > > +	if (!IS_ENABLED(CONFIG_SMP) || IS_ENABLED(CONFIG_PREEMPT_RT) ||
> > >  	    !test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags))
> > 
> > This needs a big fat comment explaining your rationale.  And probably
> > a separate if statement to make it obvious as well.
> 
> Okay.
> How much difference does it make between completing in-softirq vs
> in-IPI?

For normal non-RT builds?  This introduces another context switch, which
for the latencies we are aiming for is noticable.

> I'm asking because acquiring a spinlock_t in an IPI shouldn't be
> done (as per Documentation/locking/locktypes.rst). We don't have
> anything in lockdep that will complain here on !RT and we the above we
> avoid the case on RT.

At least for NVMe we aren't taking locks, but with the number of drivers

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT
  2020-10-27  9:26       ` Christoph Hellwig
@ 2020-10-27 10:11         ` Sebastian Andrzej Siewior
  2020-10-27 16:07           ` Christoph Hellwig
  0 siblings, 1 reply; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-10-27 10:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Runge, linux-rt-users, Jens Axboe, linux-block,
	linux-kernel, Peter Zijlstra, Thomas Gleixner, Daniel Wagner

On 2020-10-27 09:26:06 [+0000], Christoph Hellwig wrote:
> On Fri, Oct 23, 2020 at 03:52:19PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2020-10-23 12:21:30 [+0100], Christoph Hellwig wrote:
> > > > -	if (!IS_ENABLED(CONFIG_SMP) ||
> > > > +	if (!IS_ENABLED(CONFIG_SMP) || IS_ENABLED(CONFIG_PREEMPT_RT) ||
> > > >  	    !test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags))
> > > 
> > > This needs a big fat comment explaining your rationale.  And probably
> > > a separate if statement to make it obvious as well.
> > 
> > Okay.
> > How much difference does it make between completing in-softirq vs
> > in-IPI?
> 
> For normal non-RT builds?  This introduces another context switch, which
> for the latencies we are aiming for is noticable.

There should be no context switch. The pending softirq should be
executed on irq_exit() from that IPI, that is
  irq_exit()
  -> __irq_exit_rcu()
    -> invoke_softirq()
      -> __do_softirq() || do_softirq_own_stack() 

unlike with the command line switch `threadirqs' enabled,
invoke_softirq() woukd wakeup the `ksoftirqd' thread which would involve
a context switch.

> > I'm asking because acquiring a spinlock_t in an IPI shouldn't be
> > done (as per Documentation/locking/locktypes.rst). We don't have
> > anything in lockdep that will complain here on !RT and we the above we
> > avoid the case on RT.
> 
> At least for NVMe we aren't taking locks, but with the number of drivers

Right. I found this David Runge's log:

|BUG: scheduling while atomic: swapper/19/0/0x00010002
|CPU: 19 PID: 0 Comm: swapper/19 Not tainted 5.9.1-rt18-1-rt #1
|Hardware name: System manufacturer System Product Name/Pro WS X570-ACE, BIOS 1302 01/20/2020
|Call Trace:
| <IRQ>
| dump_stack+0x6b/0x88
| __schedule_bug.cold+0x89/0x97
| __schedule+0x6a4/0xa10
| preempt_schedule_lock+0x23/0x40
| rt_spin_lock_slowlock_locked+0x117/0x2c0
| rt_spin_lock_slowlock+0x58/0x80
| rt_spin_lock+0x2a/0x40
| test_clear_page_writeback+0xcd/0x310
| end_page_writeback+0x43/0x70
| end_bio_extent_buffer_writepage+0xb2/0x100 [btrfs]
| btrfs_end_bio+0x83/0x140 [btrfs]
| clone_endio+0x84/0x1f0 [dm_mod]
| blk_update_request+0x254/0x470
| blk_mq_end_request+0x1c/0x130
| flush_smp_call_function_queue+0xd5/0x1a0
| __sysvec_call_function_single+0x36/0x150
| asm_call_irq_on_stack+0x12/0x20
| </IRQ>

so the NVME driver isn't taking any locks but lock_page_memcg() (and
xa_lock_irqsave()) in test_clear_page_writeback() is.

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT
  2020-10-27 10:11         ` Sebastian Andrzej Siewior
@ 2020-10-27 16:07           ` Christoph Hellwig
  2020-10-27 17:05             ` Thomas Gleixner
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2020-10-27 16:07 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Christoph Hellwig, David Runge, linux-rt-users, Jens Axboe,
	linux-block, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Daniel Wagner

On Tue, Oct 27, 2020 at 11:11:02AM +0100, Sebastian Andrzej Siewior wrote:
> Right. I found this David Runge's log:

True, ->bi_end_io instances can do a lot of things as long as they
are hardirq safe.

And in the end the IPI case isn't the super fast path anyway, as it
means we don't use a queue per CPU.

Is there a way to raise a softirq and preferably place it on a given
CPU without our IPI dance?  That should be a win-win situation for
everyone.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT
  2020-10-27 16:07           ` Christoph Hellwig
@ 2020-10-27 17:05             ` Thomas Gleixner
  2020-10-27 17:23               ` Christoph Hellwig
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Gleixner @ 2020-10-27 17:05 UTC (permalink / raw)
  To: Christoph Hellwig, Sebastian Andrzej Siewior
  Cc: Christoph Hellwig, David Runge, linux-rt-users, Jens Axboe,
	linux-block, linux-kernel, Peter Zijlstra, Daniel Wagner

On Tue, Oct 27 2020 at 16:07, Christoph Hellwig wrote:
> On Tue, Oct 27, 2020 at 11:11:02AM +0100, Sebastian Andrzej Siewior wrote:
>> Right. I found this David Runge's log:
>
> True, ->bi_end_io instances can do a lot of things as long as they
> are hardirq safe.
>
> And in the end the IPI case isn't the super fast path anyway, as it
> means we don't use a queue per CPU.
>
> Is there a way to raise a softirq and preferably place it on a given
> CPU without our IPI dance?  That should be a win-win situation for
> everyone.

Not really. Softirq pending bits are strictly per cpu and we don't have
locking or atomics to set them remotely. Even if we had that, then you'd
still need a mechanism to make sure that the remote CPU actually
processes them. So you'd still need an IPI of some sorts.

Thanks,

        tglx





^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT
  2020-10-27 17:05             ` Thomas Gleixner
@ 2020-10-27 17:23               ` Christoph Hellwig
  2020-10-27 17:59                 ` Sebastian Andrzej Siewior
                                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Christoph Hellwig @ 2020-10-27 17:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Christoph Hellwig, Sebastian Andrzej Siewior, David Runge,
	linux-rt-users, Jens Axboe, linux-block, linux-kernel,
	Peter Zijlstra, Daniel Wagner

On Tue, Oct 27, 2020 at 06:05:15PM +0100, Thomas Gleixner wrote:
> > Is there a way to raise a softirq and preferably place it on a given
> > CPU without our IPI dance?  That should be a win-win situation for
> > everyone.
> 
> Not really. Softirq pending bits are strictly per cpu and we don't have
> locking or atomics to set them remotely. Even if we had that, then you'd
> still need a mechanism to make sure that the remote CPU actually
> processes them. So you'd still need an IPI of some sorts.

Ok.  I was hoping we could hide this in core code somehow, especially
a peterz didn't like the use of smp_call_function_single_async in the
blk-mq completion code very much.

Sebastian, would this solve your preempt-rt and lockdep issues?


diff --git a/block/blk-mq.c b/block/blk-mq.c
index cdced4aca2e812..5c125fb11b5691 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -626,19 +626,7 @@ static void __blk_mq_complete_request_remote(void *data)
 {
 	struct request *rq = data;
 
-	/*
-	 * For most of single queue controllers, there is only one irq vector
-	 * for handling I/O completion, and the only irq's affinity is set
-	 * to all possible CPUs.  On most of ARCHs, this affinity means the irq
-	 * is handled on one specific CPU.
-	 *
-	 * So complete I/O requests in softirq context in case of single queue
-	 * devices to avoid degrading I/O performance due to irqsoff latency.
-	 */
-	if (rq->q->nr_hw_queues == 1)
-		blk_mq_trigger_softirq(rq);
-	else
-		rq->q->mq_ops->complete(rq);
+	blk_mq_trigger_softirq(rq);
 }
 
 static inline bool blk_mq_complete_need_ipi(struct request *rq)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT
  2020-10-27 17:23               ` Christoph Hellwig
@ 2020-10-27 17:59                 ` Sebastian Andrzej Siewior
  2020-10-27 20:58                 ` Sebastian Andrzej Siewior
  2020-10-28 10:04                 ` [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT Peter Zijlstra
  2 siblings, 0 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-10-27 17:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Thomas Gleixner, David Runge, linux-rt-users, Jens Axboe,
	linux-block, linux-kernel, Peter Zijlstra, Daniel Wagner

On 2020-10-27 17:23:09 [+0000], Christoph Hellwig wrote:
> On Tue, Oct 27, 2020 at 06:05:15PM +0100, Thomas Gleixner wrote:
> > > Is there a way to raise a softirq and preferably place it on a given
> > > CPU without our IPI dance?  That should be a win-win situation for
> > > everyone.
> > 
> > Not really. Softirq pending bits are strictly per cpu and we don't have
> > locking or atomics to set them remotely. Even if we had that, then you'd
> > still need a mechanism to make sure that the remote CPU actually
> > processes them. So you'd still need an IPI of some sorts.
> 
> Ok.  I was hoping we could hide this in core code somehow, especially
> a peterz didn't like the use of smp_call_function_single_async in the
> blk-mq completion code very much.
> 
> Sebastian, would this solve your preempt-rt and lockdep issues?

second. I'm cooking something.

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT
  2020-10-27 17:23               ` Christoph Hellwig
  2020-10-27 17:59                 ` Sebastian Andrzej Siewior
@ 2020-10-27 20:58                 ` Sebastian Andrzej Siewior
  2020-10-28  6:56                   ` Christoph Hellwig
  2020-10-28 10:04                 ` [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT Peter Zijlstra
  2 siblings, 1 reply; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-10-27 20:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Thomas Gleixner, David Runge, linux-rt-users, Jens Axboe,
	linux-block, linux-kernel, Peter Zijlstra, Daniel Wagner

On 2020-10-27 17:23:09 [+0000], Christoph Hellwig wrote:
> Ok.  I was hoping we could hide this in core code somehow, especially
> a peterz didn't like the use of smp_call_function_single_async in the
> blk-mq completion code very much.

No idea how you could efficiently avoid smp_call_function_single_async():
- workqueue (okay)
- a timer_list timer which expires now. More code plus it may delay up
  to HZ.

> Sebastian, would this solve your preempt-rt and lockdep issues?

the problem with that is that on RT/force-threaded it will always wake
ksoftirqd thread and complete there. That are extra steps which should
be probably avoided.

Now.
The hunk in blk_mq_complete_need_ipi() will avoid the waking a thread
with force-threaded enabled.

The remaining part is a switch to llist which avoids locking (IRQ
off/on) and it allows invoke the IPI/raise softirq only if something was
added. The entries are now processed in the reverse order but this
shouldn't matter right?

I would split this into two patches (the blk_mq_complete_need_ipi() hunk
and the llist part) unless there are objections.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 55bcee5dc0320..d2452ee9b0e2c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -41,7 +41,7 @@
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
 
-static DEFINE_PER_CPU(struct list_head, blk_cpu_done);
+static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
 
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
@@ -565,80 +565,31 @@ void blk_mq_end_request(struct request *rq, blk_status_t error)
 }
 EXPORT_SYMBOL(blk_mq_end_request);
 
-/*
- * Softirq action handler - move entries to local list and loop over them
- * while passing them to the queue registered handler.
- */
-static __latent_entropy void blk_done_softirq(struct softirq_action *h)
+static void blk_complete_reqs(struct llist_head *cpu_list)
 {
-	struct list_head *cpu_list, local_list;
+	struct llist_node *entry;
+	struct request *rq, *rq_next;
 
-	local_irq_disable();
-	cpu_list = this_cpu_ptr(&blk_cpu_done);
-	list_replace_init(cpu_list, &local_list);
-	local_irq_enable();
+	entry = llist_del_all(cpu_list);
 
-	while (!list_empty(&local_list)) {
-		struct request *rq;
-
-		rq = list_entry(local_list.next, struct request, ipi_list);
-		list_del_init(&rq->ipi_list);
+	llist_for_each_entry_safe(rq, rq_next, entry, ipi_list)
 		rq->q->mq_ops->complete(rq);
-	}
 }
 
-static void blk_mq_trigger_softirq(struct request *rq)
+static __latent_entropy void blk_done_softirq(struct softirq_action *h)
 {
-	struct list_head *list;
-	unsigned long flags;
-
-	local_irq_save(flags);
-	list = this_cpu_ptr(&blk_cpu_done);
-	list_add_tail(&rq->ipi_list, list);
-
-	/*
-	 * If the list only contains our just added request, signal a raise of
-	 * the softirq.  If there are already entries there, someone already
-	 * raised the irq but it hasn't run yet.
-	 */
-	if (list->next == &rq->ipi_list)
-		raise_softirq_irqoff(BLOCK_SOFTIRQ);
-	local_irq_restore(flags);
+	blk_complete_reqs(this_cpu_ptr(&blk_cpu_done));
 }
 
 static int blk_softirq_cpu_dead(unsigned int cpu)
 {
-	/*
-	 * If a CPU goes away, splice its entries to the current CPU
-	 * and trigger a run of the softirq
-	 */
-	local_irq_disable();
-	list_splice_init(&per_cpu(blk_cpu_done, cpu),
-			 this_cpu_ptr(&blk_cpu_done));
-	raise_softirq_irqoff(BLOCK_SOFTIRQ);
-	local_irq_enable();
-
+	blk_complete_reqs(&per_cpu(blk_cpu_done, cpu));
 	return 0;
 }
 
-
 static void __blk_mq_complete_request_remote(void *data)
 {
-	struct request *rq = data;
-
-	/*
-	 * For most of single queue controllers, there is only one irq vector
-	 * for handling I/O completion, and the only irq's affinity is set
-	 * to all possible CPUs.  On most of ARCHs, this affinity means the irq
-	 * is handled on one specific CPU.
-	 *
-	 * So complete I/O requests in softirq context in case of single queue
-	 * devices to avoid degrading I/O performance due to irqsoff latency.
-	 */
-	if (rq->q->nr_hw_queues == 1)
-		blk_mq_trigger_softirq(rq);
-	else
-		rq->q->mq_ops->complete(rq);
+	__raise_softirq_irqoff(BLOCK_SOFTIRQ);
 }
 
 static inline bool blk_mq_complete_need_ipi(struct request *rq)
@@ -648,6 +599,14 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq)
 	if (!IS_ENABLED(CONFIG_SMP) ||
 	    !test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags))
 		return false;
+	/*
+	 * With force threaded interrupts enabled, raising softirq from an SMP
+	 * function call will always result in waking the ksoftirqd thread.
+	 * This is probably worse than completing the request on a different
+	 * cache domain.
+	 */
+       if (force_irqthreads)
+               return false;
 
 	/* same CPU or cache domain?  Complete locally */
 	if (cpu == rq->mq_ctx->cpu ||
@@ -661,6 +620,7 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq)
 
 bool blk_mq_complete_request_remote(struct request *rq)
 {
+	struct llist_head *cpu_list;
 	WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
 
 	/*
@@ -671,14 +631,21 @@ bool blk_mq_complete_request_remote(struct request *rq)
 		return false;
 
 	if (blk_mq_complete_need_ipi(rq)) {
-		rq->csd.func = __blk_mq_complete_request_remote;
-		rq->csd.info = rq;
-		rq->csd.flags = 0;
-		smp_call_function_single_async(rq->mq_ctx->cpu, &rq->csd);
+		unsigned int cpu;
+
+		cpu = rq->mq_ctx->cpu;
+		cpu_list = &per_cpu(blk_cpu_done, cpu);
+		if (llist_add(&rq->ipi_list, cpu_list)) {
+			rq->csd.func = __blk_mq_complete_request_remote;
+			rq->csd.flags = 0;
+			smp_call_function_single_async(cpu, &rq->csd);
+		}
 	} else {
 		if (rq->q->nr_hw_queues > 1)
 			return false;
-		blk_mq_trigger_softirq(rq);
+		cpu_list = this_cpu_ptr(&blk_cpu_done);
+		if (llist_add(&rq->ipi_list, cpu_list))
+			raise_softirq(BLOCK_SOFTIRQ);
 	}
 
 	return true;
@@ -3909,7 +3876,7 @@ static int __init blk_mq_init(void)
 	int i;
 
 	for_each_possible_cpu(i)
-		INIT_LIST_HEAD(&per_cpu(blk_cpu_done, i));
+		init_llist_head(&per_cpu(blk_cpu_done, i));
 	open_softirq(BLOCK_SOFTIRQ, blk_done_softirq);
 
 	cpuhp_setup_state_nocalls(CPUHP_BLOCK_SOFTIRQ_DEAD,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 639cae2c158b5..331b2b675b417 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -156,7 +156,7 @@ struct request {
 	 */
 	union {
 		struct hlist_node hash;	/* merge hash */
-		struct list_head ipi_list;
+		struct llist_node ipi_list;
 	};
 
 	/*

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT
  2020-10-27 20:58                 ` Sebastian Andrzej Siewior
@ 2020-10-28  6:56                   ` Christoph Hellwig
  2020-10-28 14:12                     ` [PATCH 1/3] blk-mq: Don't complete on a remote CPU in force threaded mode Sebastian Andrzej Siewior
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2020-10-28  6:56 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Christoph Hellwig, Thomas Gleixner, David Runge, linux-rt-users,
	Jens Axboe, linux-block, linux-kernel, Peter Zijlstra,
	Daniel Wagner

> The remaining part is a switch to llist which avoids locking (IRQ
> off/on) and it allows invoke the IPI/raise softirq only if something was
> added. The entries are now processed in the reverse order but this
> shouldn't matter right?

For correctness it should not matter, but I think it could have
performance implications.  I think you'll have to throw in a
llist_reverse_order.

> I would split this into two patches (the blk_mq_complete_need_ipi() hunk
> and the llist part) unless there are objections.

Yes, please do.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT
  2020-10-27 17:23               ` Christoph Hellwig
  2020-10-27 17:59                 ` Sebastian Andrzej Siewior
  2020-10-27 20:58                 ` Sebastian Andrzej Siewior
@ 2020-10-28 10:04                 ` Peter Zijlstra
  2 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2020-10-28 10:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Thomas Gleixner, Sebastian Andrzej Siewior, David Runge,
	linux-rt-users, Jens Axboe, linux-block, linux-kernel,
	Daniel Wagner

On Tue, Oct 27, 2020 at 05:23:09PM +0000, Christoph Hellwig wrote:
> Ok.  I was hoping we could hide this in core code somehow, especially
> a peterz didn't like the use of smp_call_function_single_async in the
> blk-mq completion code very much.

It's smp_call_function_single_async() in general that I don't much like.
But Linus seemed unconvinced, so we'll keep it for a while I suppose.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/3] blk-mq: Don't complete on a remote CPU in force threaded mode
  2020-10-28  6:56                   ` Christoph Hellwig
@ 2020-10-28 14:12                     ` Sebastian Andrzej Siewior
  2020-10-28 14:12                       ` [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq Sebastian Andrzej Siewior
  2020-10-28 14:12                       ` [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done Sebastian Andrzej Siewior
  0 siblings, 2 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-10-28 14:12 UTC (permalink / raw)
  To: linux-block
  Cc: Christoph Hellwig, Thomas Gleixner, David Runge, linux-rt-users,
	Jens Axboe, linux-kernel, Peter Zijlstra, Daniel Wagner,
	Sebastian Andrzej Siewior

With force threaded interrupts enabled, raising softirq from an SMP
function call will always result in waking the ksoftirqd thread. This is
not optimal given that the thread runs at SCHED_OTHER priority.

Completing the request in hard IRQ-context on PREEMPT_RT (which enforces
the force threaded mode) is bad because the completion handler may
acquire sleeping locks which violate the locking context.

Disable request completing on a remote CPU in force threaded mode.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 block/blk-mq.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 55bcee5dc0320..421a40968c9ff 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -648,6 +648,14 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq)
 	if (!IS_ENABLED(CONFIG_SMP) ||
 	    !test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags))
 		return false;
+	/*
+	 * With force threaded interrupts enabled, raising softirq from an SMP
+	 * function call will always result in waking the ksoftirqd thread.
+	 * This is probably worse than completing the request on a different
+	 * cache domain.
+	 */
+	if (force_irqthreads)
+		return false;
 
 	/* same CPU or cache domain?  Complete locally */
 	if (cpu == rq->mq_ctx->cpu ||
-- 
2.28.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-10-28 14:12                     ` [PATCH 1/3] blk-mq: Don't complete on a remote CPU in force threaded mode Sebastian Andrzej Siewior
@ 2020-10-28 14:12                       ` Sebastian Andrzej Siewior
  2020-10-28 14:12                       ` [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done Sebastian Andrzej Siewior
  1 sibling, 0 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-10-28 14:12 UTC (permalink / raw)
  To: linux-block
  Cc: Christoph Hellwig, Thomas Gleixner, David Runge, linux-rt-users,
	Jens Axboe, linux-kernel, Peter Zijlstra, Daniel Wagner,
	Sebastian Andrzej Siewior

Controllers with multiple queues have their IRQ-handelers pinned to a
CPU. The core shouldn't need to complete the request on a remote CPU.

Remove this case and always raise the softirq to complete the request.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 block/blk-mq.c | 14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 421a40968c9ff..769d2d532a825 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -626,19 +626,7 @@ static void __blk_mq_complete_request_remote(void *data)
 {
 	struct request *rq = data;
 
-	/*
-	 * For most of single queue controllers, there is only one irq vector
-	 * for handling I/O completion, and the only irq's affinity is set
-	 * to all possible CPUs.  On most of ARCHs, this affinity means the irq
-	 * is handled on one specific CPU.
-	 *
-	 * So complete I/O requests in softirq context in case of single queue
-	 * devices to avoid degrading I/O performance due to irqsoff latency.
-	 */
-	if (rq->q->nr_hw_queues == 1)
-		blk_mq_trigger_softirq(rq);
-	else
-		rq->q->mq_ops->complete(rq);
+	blk_mq_trigger_softirq(rq);
 }
 
 static inline bool blk_mq_complete_need_ipi(struct request *rq)
-- 
2.28.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-28 14:12                     ` [PATCH 1/3] blk-mq: Don't complete on a remote CPU in force threaded mode Sebastian Andrzej Siewior
  2020-10-28 14:12                       ` [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq Sebastian Andrzej Siewior
@ 2020-10-28 14:12                       ` Sebastian Andrzej Siewior
  2020-10-28 14:44                         ` Christoph Hellwig
  2020-10-29 13:12                         ` Sebastian Andrzej Siewior
  1 sibling, 2 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-10-28 14:12 UTC (permalink / raw)
  To: linux-block
  Cc: Christoph Hellwig, Thomas Gleixner, David Runge, linux-rt-users,
	Jens Axboe, linux-kernel, Peter Zijlstra, Daniel Wagner,
	Sebastian Andrzej Siewior

With llist_head it is possible to avoid the locking (the irq-off region)
when items are added. This makes it possible to add items on a remote
CPU.
llist_add() returns true if the list was previously empty. This can be
used to invoke the SMP function call / raise sofirq only if the first
item was added (otherwise it is already pending).
This simplifies the code a little and reduces the IRQ-off regions. With
this change it possible to reduce the SMP-function call a simple
__raise_softirq_irqoff().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 block/blk-mq.c         | 78 ++++++++++++++----------------------------
 include/linux/blkdev.h |  2 +-
 2 files changed, 26 insertions(+), 54 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 769d2d532a825..4f53de48e5038 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -41,7 +41,7 @@
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
 
-static DEFINE_PER_CPU(struct list_head, blk_cpu_done);
+static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
 
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
@@ -565,68 +565,32 @@ void blk_mq_end_request(struct request *rq, blk_status_t error)
 }
 EXPORT_SYMBOL(blk_mq_end_request);
 
-/*
- * Softirq action handler - move entries to local list and loop over them
- * while passing them to the queue registered handler.
- */
-static __latent_entropy void blk_done_softirq(struct softirq_action *h)
+static void blk_complete_reqs(struct llist_head *cpu_list)
 {
-	struct list_head *cpu_list, local_list;
+	struct llist_node *entry;
+	struct request *rq, *rq_next;
 
-	local_irq_disable();
-	cpu_list = this_cpu_ptr(&blk_cpu_done);
-	list_replace_init(cpu_list, &local_list);
-	local_irq_enable();
+	entry = llist_del_all(cpu_list);
+	entry = llist_reverse_order(entry);
 
-	while (!list_empty(&local_list)) {
-		struct request *rq;
-
-		rq = list_entry(local_list.next, struct request, ipi_list);
-		list_del_init(&rq->ipi_list);
+	llist_for_each_entry_safe(rq, rq_next, entry, ipi_list)
 		rq->q->mq_ops->complete(rq);
-	}
 }
 
-static void blk_mq_trigger_softirq(struct request *rq)
+static __latent_entropy void blk_done_softirq(struct softirq_action *h)
 {
-	struct list_head *list;
-	unsigned long flags;
-
-	local_irq_save(flags);
-	list = this_cpu_ptr(&blk_cpu_done);
-	list_add_tail(&rq->ipi_list, list);
-
-	/*
-	 * If the list only contains our just added request, signal a raise of
-	 * the softirq.  If there are already entries there, someone already
-	 * raised the irq but it hasn't run yet.
-	 */
-	if (list->next == &rq->ipi_list)
-		raise_softirq_irqoff(BLOCK_SOFTIRQ);
-	local_irq_restore(flags);
+	blk_complete_reqs(this_cpu_ptr(&blk_cpu_done));
 }
 
 static int blk_softirq_cpu_dead(unsigned int cpu)
 {
-	/*
-	 * If a CPU goes away, splice its entries to the current CPU
-	 * and trigger a run of the softirq
-	 */
-	local_irq_disable();
-	list_splice_init(&per_cpu(blk_cpu_done, cpu),
-			 this_cpu_ptr(&blk_cpu_done));
-	raise_softirq_irqoff(BLOCK_SOFTIRQ);
-	local_irq_enable();
-
+	blk_complete_reqs(&per_cpu(blk_cpu_done, cpu));
 	return 0;
 }
 
-
 static void __blk_mq_complete_request_remote(void *data)
 {
-	struct request *rq = data;
-
-	blk_mq_trigger_softirq(rq);
+	__raise_softirq_irqoff(BLOCK_SOFTIRQ);
 }
 
 static inline bool blk_mq_complete_need_ipi(struct request *rq)
@@ -657,6 +621,7 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq)
 
 bool blk_mq_complete_request_remote(struct request *rq)
 {
+	struct llist_head *cpu_list;
 	WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
 
 	/*
@@ -667,14 +632,21 @@ bool blk_mq_complete_request_remote(struct request *rq)
 		return false;
 
 	if (blk_mq_complete_need_ipi(rq)) {
-		rq->csd.func = __blk_mq_complete_request_remote;
-		rq->csd.info = rq;
-		rq->csd.flags = 0;
-		smp_call_function_single_async(rq->mq_ctx->cpu, &rq->csd);
+		unsigned int cpu;
+
+		cpu = rq->mq_ctx->cpu;
+		cpu_list = &per_cpu(blk_cpu_done, cpu);
+		if (llist_add(&rq->ipi_list, cpu_list)) {
+			rq->csd.func = __blk_mq_complete_request_remote;
+			rq->csd.flags = 0;
+			smp_call_function_single_async(cpu, &rq->csd);
+		}
 	} else {
 		if (rq->q->nr_hw_queues > 1)
 			return false;
-		blk_mq_trigger_softirq(rq);
+		cpu_list = this_cpu_ptr(&blk_cpu_done);
+		if (llist_add(&rq->ipi_list, cpu_list))
+			raise_softirq(BLOCK_SOFTIRQ);
 	}
 
 	return true;
@@ -3905,7 +3877,7 @@ static int __init blk_mq_init(void)
 	int i;
 
 	for_each_possible_cpu(i)
-		INIT_LIST_HEAD(&per_cpu(blk_cpu_done, i));
+		init_llist_head(&per_cpu(blk_cpu_done, i));
 	open_softirq(BLOCK_SOFTIRQ, blk_done_softirq);
 
 	cpuhp_setup_state_nocalls(CPUHP_BLOCK_SOFTIRQ_DEAD,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 639cae2c158b5..331b2b675b417 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -156,7 +156,7 @@ struct request {
 	 */
 	union {
 		struct hlist_node hash;	/* merge hash */
-		struct list_head ipi_list;
+		struct llist_node ipi_list;
 	};
 
 	/*
-- 
2.28.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-28 14:12                       ` [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done Sebastian Andrzej Siewior
@ 2020-10-28 14:44                         ` Christoph Hellwig
  2020-10-28 14:47                           ` Sebastian Andrzej Siewior
  2020-10-29 13:12                         ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2020-10-28 14:44 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-block, Christoph Hellwig, Thomas Gleixner, David Runge,
	linux-rt-users, Jens Axboe, linux-kernel, Peter Zijlstra,
	Daniel Wagner

On Wed, Oct 28, 2020 at 03:12:51PM +0100, Sebastian Andrzej Siewior wrote:
>  static int blk_softirq_cpu_dead(unsigned int cpu)
>  {
> -	/*
> -	 * If a CPU goes away, splice its entries to the current CPU
> -	 * and trigger a run of the softirq
> -	 */
> -	local_irq_disable();
> -	list_splice_init(&per_cpu(blk_cpu_done, cpu),
> -			 this_cpu_ptr(&blk_cpu_done));
> -	raise_softirq_irqoff(BLOCK_SOFTIRQ);
> -	local_irq_enable();
> -
> +	blk_complete_reqs(&per_cpu(blk_cpu_done, cpu));
>  	return 0;

How can this be preempted?  Can't we keep using this_cpu_ptr here?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-28 14:44                         ` Christoph Hellwig
@ 2020-10-28 14:47                           ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-10-28 14:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Thomas Gleixner, David Runge, linux-rt-users,
	Jens Axboe, linux-kernel, Peter Zijlstra, Daniel Wagner

On 2020-10-28 14:44:53 [+0000], Christoph Hellwig wrote:
> On Wed, Oct 28, 2020 at 03:12:51PM +0100, Sebastian Andrzej Siewior wrote:
> >  static int blk_softirq_cpu_dead(unsigned int cpu)
> >  {
> > -	/*
> > -	 * If a CPU goes away, splice its entries to the current CPU
> > -	 * and trigger a run of the softirq
> > -	 */
> > -	local_irq_disable();
> > -	list_splice_init(&per_cpu(blk_cpu_done, cpu),
> > -			 this_cpu_ptr(&blk_cpu_done));
> > -	raise_softirq_irqoff(BLOCK_SOFTIRQ);
> > -	local_irq_enable();
> > -
> > +	blk_complete_reqs(&per_cpu(blk_cpu_done, cpu));
> >  	return 0;
> 
> How can this be preempted?  Can't we keep using this_cpu_ptr here?

cpu of the dead CPU != this CPU.

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-28 14:12                       ` [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done Sebastian Andrzej Siewior
  2020-10-28 14:44                         ` Christoph Hellwig
@ 2020-10-29 13:12                         ` Sebastian Andrzej Siewior
  2020-10-29 14:05                           ` Christoph Hellwig
  1 sibling, 1 reply; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-10-29 13:12 UTC (permalink / raw)
  To: linux-block
  Cc: Christoph Hellwig, Thomas Gleixner, David Runge, linux-rt-users,
	Jens Axboe, linux-kernel, Peter Zijlstra, Daniel Wagner,
	Mike Galbraith

On 2020-10-28 15:12:51 [+0100], To linux-block@vger.kernel.org wrote:
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -667,14 +632,21 @@ bool blk_mq_complete_request_remote(struct request *rq)
>  		return false;
>  
>  	if (blk_mq_complete_need_ipi(rq)) {
>  	} else {
>  		if (rq->q->nr_hw_queues > 1)
>  			return false;
> -		blk_mq_trigger_softirq(rq);
> +		cpu_list = this_cpu_ptr(&blk_cpu_done);
> +		if (llist_add(&rq->ipi_list, cpu_list))
> +			raise_softirq(BLOCK_SOFTIRQ);
>  	}
>  
>  	return true;

So Mike posted this:
| BUG: using smp_processor_id() in preemptible [00000000] code: usb-storage/841
| caller is blk_mq_complete_request_remote.part.0+0xa2/0x120
| CPU: 0 PID: 841 Comm: usb-storage Not tainted 5.10.0-rc1+ #61
| Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-1 04/01/2014
| Call Trace:
|  dump_stack+0x77/0x97
|  check_preemption_disabled+0xbe/0xc0
|  blk_mq_complete_request_remote.part.0+0xa2/0x120
|  blk_mq_complete_request+0x2e/0x40
|  usb_stor_control_thread+0x29a/0x300
|  kthread+0x14b/0x170
|  ret_from_fork+0x22/0x30

This comes from this_cpu_ptr() because usb_stor_control_thread() runs
with enabled preemption.

Adding preempt_disable() around it will make the warning go away but
will wake the ksoftirqd (this happens now, too).
Adding local_bh_disable() around it would perform the completion
immediately (instead of waking kssoftirqd) but local_bh_enable() feels
slightly more expensive.

Are there many drivers completing the SCSI requests in preemtible
context? In this case it would be more efficient to complete the request
directly (usb_stor_control_thread() goes to sleep after that anyway and
there is only one request at a time).

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-29 13:12                         ` Sebastian Andrzej Siewior
@ 2020-10-29 14:05                           ` Christoph Hellwig
  2020-10-29 14:56                             ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2020-10-29 14:05 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-block, Christoph Hellwig, Thomas Gleixner, David Runge,
	linux-rt-users, Jens Axboe, linux-kernel, Peter Zijlstra,
	Daniel Wagner, Mike Galbraith

On Thu, Oct 29, 2020 at 02:12:12PM +0100, Sebastian Andrzej Siewior wrote:
> Are there many drivers completing the SCSI requests in preemtible
> context? In this case it would be more efficient to complete the request
> directly (usb_stor_control_thread() goes to sleep after that anyway and
> there is only one request at a time).

Well, usb-storage obviously seems to do it, and the block layer
does not prohibit it.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-29 14:05                           ` Christoph Hellwig
@ 2020-10-29 14:56                             ` Sebastian Andrzej Siewior
  2020-10-29 14:57                               ` Christoph Hellwig
  0 siblings, 1 reply; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-10-29 14:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Thomas Gleixner, David Runge, linux-rt-users,
	Jens Axboe, linux-kernel, Peter Zijlstra, Daniel Wagner,
	Mike Galbraith

On 2020-10-29 14:05:36 [+0000], Christoph Hellwig wrote:
> Well, usb-storage obviously seems to do it, and the block layer
> does not prohibit it.

Also loop, nvme-tcp and then I stopped looking.
Any objections about adding local_bh_disable() around it?

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-29 14:56                             ` Sebastian Andrzej Siewior
@ 2020-10-29 14:57                               ` Christoph Hellwig
  2020-10-29 20:03                                 ` Sagi Grimberg
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2020-10-29 14:57 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Christoph Hellwig, linux-block, Thomas Gleixner, David Runge,
	linux-rt-users, Jens Axboe, linux-kernel, Peter Zijlstra,
	Daniel Wagner, Mike Galbraith, Sagi Grimberg

On Thu, Oct 29, 2020 at 03:56:23PM +0100, Sebastian Andrzej Siewior wrote:
> On 2020-10-29 14:05:36 [+0000], Christoph Hellwig wrote:
> > Well, usb-storage obviously seems to do it, and the block layer
> > does not prohibit it.
> 
> Also loop, nvme-tcp and then I stopped looking.
> Any objections about adding local_bh_disable() around it?

To me it seems like the whole IPI plus potentially softirq dance is
a little pointless when completing from process context.

Sagi, any opinion on that from the nvme-tcp POV?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-29 14:57                               ` Christoph Hellwig
@ 2020-10-29 20:03                                 ` Sagi Grimberg
  2020-10-29 21:01                                   ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2020-10-29 20:03 UTC (permalink / raw)
  To: Christoph Hellwig, Sebastian Andrzej Siewior
  Cc: linux-block, Thomas Gleixner, David Runge, linux-rt-users,
	Jens Axboe, linux-kernel, Peter Zijlstra, Daniel Wagner,
	Mike Galbraith


>>> Well, usb-storage obviously seems to do it, and the block layer
>>> does not prohibit it.
>>
>> Also loop, nvme-tcp and then I stopped looking.
>> Any objections about adding local_bh_disable() around it?
> 
> To me it seems like the whole IPI plus potentially softirq dance is
> a little pointless when completing from process context.

I agree.

> Sagi, any opinion on that from the nvme-tcp POV?

nvme-tcp should (almost) always complete from the context that matches
the rq->mq_ctx->cpu as the thread that processes incoming
completions (per hctx) should be affinitized to match it (unless cpus
come and go).

So for nvme-tcp I don't expect blk_mq_complete_need_ipi to return true
in normal operation. That leaves the teardowns+aborts, which aren't very
interesting here.

I would note that nvme-tcp does not go to sleep after completing every
I/O like how sebastian indicated usb does.

Having said that, today the network stack is calling nvme_tcp_data_ready
in napi context (softirq) which in turn triggers the queue thread to
handle network rx (and complete the I/O). It's been measured recently
that running the rx context directly in softirq will save some
latency (possible because nvme-tcp rx context is non-blocking).

So I'd think that patch #2 is unnecessary and just add overhead for
nvme-tcp.. do note that the napi softirq cpu mapping depends on the RSS
steering, which is unlikely to match rq->mq_ctx->cpu, hence if completed
from napi context, nvme-tcp will probably always go to the IPI path.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-29 20:03                                 ` Sagi Grimberg
@ 2020-10-29 21:01                                   ` Sebastian Andrzej Siewior
  2020-10-29 21:07                                     ` Sagi Grimberg
  0 siblings, 1 reply; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-10-29 21:01 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-block, Thomas Gleixner, David Runge,
	linux-rt-users, Jens Axboe, linux-kernel, Peter Zijlstra,
	Daniel Wagner, Mike Galbraith

On 2020-10-29 13:03:26 [-0700], Sagi Grimberg wrote:
> 
> > > > Well, usb-storage obviously seems to do it, and the block layer
> > > > does not prohibit it.
> > > 
> > > Also loop, nvme-tcp and then I stopped looking.
> > > Any objections about adding local_bh_disable() around it?
> > 
> > To me it seems like the whole IPI plus potentially softirq dance is
> > a little pointless when completing from process context.
> 
> I agree.
> 
> > Sagi, any opinion on that from the nvme-tcp POV?
> 
> nvme-tcp should (almost) always complete from the context that matches
> the rq->mq_ctx->cpu as the thread that processes incoming
> completions (per hctx) should be affinitized to match it (unless cpus
> come and go).

in which context? But this is probably nr_hw_queues > 1?

> So for nvme-tcp I don't expect blk_mq_complete_need_ipi to return true
> in normal operation. That leaves the teardowns+aborts, which aren't very
> interesting here.

The process context invocation is nvme_tcp_complete_timed_out().

> I would note that nvme-tcp does not go to sleep after completing every
> I/O like how sebastian indicated usb does.
> 
> Having said that, today the network stack is calling nvme_tcp_data_ready
> in napi context (softirq) which in turn triggers the queue thread to
> handle network rx (and complete the I/O). It's been measured recently
> that running the rx context directly in softirq will save some
> latency (possible because nvme-tcp rx context is non-blocking).
> 
> So I'd think that patch #2 is unnecessary and just add overhead for
> nvme-tcp.. do note that the napi softirq cpu mapping depends on the RSS
> steering, which is unlikely to match rq->mq_ctx->cpu, hence if completed
> from napi context, nvme-tcp will probably always go to the IPI path.

but running it in softirq on the remote CPU would still allow of other
packets to come on the remote CPU (which would block BLOCK sofirq if
NET_RX is already running).

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-29 21:01                                   ` Sebastian Andrzej Siewior
@ 2020-10-29 21:07                                     ` Sagi Grimberg
  2020-10-31 10:41                                       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2020-10-29 21:07 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Christoph Hellwig, linux-block, Thomas Gleixner, David Runge,
	linux-rt-users, Jens Axboe, linux-kernel, Peter Zijlstra,
	Daniel Wagner, Mike Galbraith


>>>>> Well, usb-storage obviously seems to do it, and the block layer
>>>>> does not prohibit it.
>>>>
>>>> Also loop, nvme-tcp and then I stopped looking.
>>>> Any objections about adding local_bh_disable() around it?
>>>
>>> To me it seems like the whole IPI plus potentially softirq dance is
>>> a little pointless when completing from process context.
>>
>> I agree.
>>
>>> Sagi, any opinion on that from the nvme-tcp POV?
>>
>> nvme-tcp should (almost) always complete from the context that matches
>> the rq->mq_ctx->cpu as the thread that processes incoming
>> completions (per hctx) should be affinitized to match it (unless cpus
>> come and go).
> 
> in which context?

Not sure what is the question.

> But this is probably nr_hw_queues > 1?

Yes.

>> So for nvme-tcp I don't expect blk_mq_complete_need_ipi to return true
>> in normal operation. That leaves the teardowns+aborts, which aren't very
>> interesting here.
> 
> The process context invocation is nvme_tcp_complete_timed_out().

Yes.

>> I would note that nvme-tcp does not go to sleep after completing every
>> I/O like how sebastian indicated usb does.
>>
>> Having said that, today the network stack is calling nvme_tcp_data_ready
>> in napi context (softirq) which in turn triggers the queue thread to
>> handle network rx (and complete the I/O). It's been measured recently
>> that running the rx context directly in softirq will save some
>> latency (possible because nvme-tcp rx context is non-blocking).
>>
>> So I'd think that patch #2 is unnecessary and just add overhead for
>> nvme-tcp.. do note that the napi softirq cpu mapping depends on the RSS
>> steering, which is unlikely to match rq->mq_ctx->cpu, hence if completed
>> from napi context, nvme-tcp will probably always go to the IPI path.
> 
> but running it in softirq on the remote CPU would still allow of other
> packets to come on the remote CPU (which would block BLOCK sofirq if
> NET_RX is already running).

Not sure I understand your comment, if napi triggers on core X and we
complete from that, it will trigger IPI to core Y, and there with patch 
#2 is will trigger softirq instead of calling ->complete directly no?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-29 21:07                                     ` Sagi Grimberg
@ 2020-10-31 10:41                                       ` Sebastian Andrzej Siewior
  2020-10-31 15:00                                         ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-10-31 10:41 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-block, Thomas Gleixner, David Runge,
	linux-rt-users, Jens Axboe, linux-kernel, Peter Zijlstra,
	Daniel Wagner, Mike Galbraith

On 2020-10-29 14:07:59 [-0700], Sagi Grimberg wrote:
> > in which context?
> 
> Not sure what is the question.

The question is in which context do you complete your requests. My guess
by now is "usually softirq/NAPI and context in rare error case".

> > But this is probably nr_hw_queues > 1?
> 
> Yes.

So this means it will either complete directly or issue an IPI.

> > but running it in softirq on the remote CPU would still allow of other
> > packets to come on the remote CPU (which would block BLOCK sofirq if
> > NET_RX is already running).
> 
> Not sure I understand your comment, if napi triggers on core X and we
> complete from that, it will trigger IPI to core Y, and there with patch #2
> is will trigger softirq instead of calling ->complete directly no?

This is correct. But trigger softirq does not mean that it will wake
`ksoftirqd' as it is the case for the usb-storage right now. In your
case (completing from NAPI/sofitrq (or for most other driver which
complete in their IRQ handler)) it means:
- trigger IPI
- IPI will OR the BLOCK-softirq bit.
- on exit from IPI it will invoke do_softirq() (unless softirq is
  already pending and got interrupted by the IPI) and complete the
  Block request.

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-31 10:41                                       ` Sebastian Andrzej Siewior
@ 2020-10-31 15:00                                         ` Jens Axboe
  2020-10-31 15:01                                           ` Jens Axboe
  2020-11-02  9:55                                           ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 34+ messages in thread
From: Jens Axboe @ 2020-10-31 15:00 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Sagi Grimberg
  Cc: Christoph Hellwig, linux-block, Thomas Gleixner, David Runge,
	linux-rt-users, linux-kernel, Peter Zijlstra, Daniel Wagner,
	Mike Galbraith

On 10/31/20 4:41 AM, Sebastian Andrzej Siewior wrote:
> On 2020-10-29 14:07:59 [-0700], Sagi Grimberg wrote:
>>> in which context?
>>
>> Not sure what is the question.
> 
> The question is in which context do you complete your requests. My guess
> by now is "usually softirq/NAPI and context in rare error case".

There really aren't any rules for this, and it's perfectly legit to
complete from process context. Maybe you're a kthread driven driver and
that's how you handle completions. The block completion path has always
been hard IRQ safe, but possible to call from anywhere.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-31 15:00                                         ` Jens Axboe
@ 2020-10-31 15:01                                           ` Jens Axboe
  2020-10-31 18:09                                             ` Christoph Hellwig
  2020-11-02  9:55                                           ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2020-10-31 15:01 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Sagi Grimberg
  Cc: Christoph Hellwig, linux-block, Thomas Gleixner, David Runge,
	linux-rt-users, linux-kernel, Peter Zijlstra, Daniel Wagner,
	Mike Galbraith

On 10/31/20 9:00 AM, Jens Axboe wrote:
> On 10/31/20 4:41 AM, Sebastian Andrzej Siewior wrote:
>> On 2020-10-29 14:07:59 [-0700], Sagi Grimberg wrote:
>>>> in which context?
>>>
>>> Not sure what is the question.
>>
>> The question is in which context do you complete your requests. My guess
>> by now is "usually softirq/NAPI and context in rare error case".
> 
> There really aren't any rules for this, and it's perfectly legit to
> complete from process context. Maybe you're a kthread driven driver and
> that's how you handle completions. The block completion path has always
> been hard IRQ safe, but possible to call from anywhere.

A more recent example is polled IO, which will always complete from
process/task context and very much is fast path.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-31 15:01                                           ` Jens Axboe
@ 2020-10-31 18:09                                             ` Christoph Hellwig
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2020-10-31 18:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sebastian Andrzej Siewior, Sagi Grimberg, Christoph Hellwig,
	linux-block, Thomas Gleixner, David Runge, linux-rt-users,
	linux-kernel, Peter Zijlstra, Daniel Wagner, Mike Galbraith

On Sat, Oct 31, 2020 at 09:01:45AM -0600, Jens Axboe wrote:
> On 10/31/20 9:00 AM, Jens Axboe wrote:
> > On 10/31/20 4:41 AM, Sebastian Andrzej Siewior wrote:
> >> On 2020-10-29 14:07:59 [-0700], Sagi Grimberg wrote:
> >>>> in which context?
> >>>
> >>> Not sure what is the question.
> >>
> >> The question is in which context do you complete your requests. My guess
> >> by now is "usually softirq/NAPI and context in rare error case".
> > 
> > There really aren't any rules for this, and it's perfectly legit to
> > complete from process context. Maybe you're a kthread driven driver and
> > that's how you handle completions. The block completion path has always
> > been hard IRQ safe, but possible to call from anywhere.
> 
> A more recent example is polled IO, which will always complete from
> process/task context and very much is fast path.

But we never IPI for that anyway, so it is the easy case.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-31 15:00                                         ` Jens Axboe
  2020-10-31 15:01                                           ` Jens Axboe
@ 2020-11-02  9:55                                           ` Sebastian Andrzej Siewior
  2020-11-02 18:12                                             ` Christoph Hellwig
  1 sibling, 1 reply; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-11-02  9:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sagi Grimberg, Christoph Hellwig, linux-block, Thomas Gleixner,
	David Runge, linux-rt-users, linux-kernel, Peter Zijlstra,
	Daniel Wagner, Mike Galbraith

On 2020-10-31 09:00:49 [-0600], Jens Axboe wrote:
> There really aren't any rules for this, and it's perfectly legit to
> complete from process context. Maybe you're a kthread driven driver and
> that's how you handle completions. The block completion path has always
> been hard IRQ safe, but possible to call from anywhere.

I'm not trying to put restrictions and forbidding completions from a
kthread. I'm trying to avoid the pointless softirq dance for no added
value. We could:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4f53de48e5038..c4693b3750878 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -644,9 +644,11 @@ bool blk_mq_complete_request_remote(struct request *rq)
 	} else {
 		if (rq->q->nr_hw_queues > 1)
 			return false;
+		preempt_disable();
 		cpu_list = this_cpu_ptr(&blk_cpu_done);
 		if (llist_add(&rq->ipi_list, cpu_list))
 			raise_softirq(BLOCK_SOFTIRQ);
+		preempt_enable();
 	}
 
 	return true;

to not break that assumption you just mentioned and provide 
|static inline void blk_mq_complete_request_local(struct request *rq)
|{
|                 rq->q->mq_ops->complete(rq);
|}

so that completion issued from from process context (like those from
usb-storage) don't end up waking `ksoftird' (running at SCHED_OTHER)
completing the requests but rather performing it right away. The softirq
dance makes no sense here.

As mentioned earlier, the alternative _could_ be to
	s/preempt_/local_bh_/

in the above patch. This would ensure that any invocation outside of
IRQ/Softirq context would invoke the softirq _directly_ at
local_bh_enable() time rather than waking the daemon for that purpose.
It would also avoid another completion function for the direct case
which could be abused if used from outside the thread context.
The last one is currently my favorite.

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-11-02  9:55                                           ` Sebastian Andrzej Siewior
@ 2020-11-02 18:12                                             ` Christoph Hellwig
  2020-11-04 19:15                                               ` Sagi Grimberg
  2020-11-06 15:23                                               ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 34+ messages in thread
From: Christoph Hellwig @ 2020-11-02 18:12 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Jens Axboe, Sagi Grimberg, Christoph Hellwig, linux-block,
	Thomas Gleixner, David Runge, linux-rt-users, linux-kernel,
	Peter Zijlstra, Daniel Wagner, Mike Galbraith

On Mon, Nov 02, 2020 at 10:55:33AM +0100, Sebastian Andrzej Siewior wrote:
> On 2020-10-31 09:00:49 [-0600], Jens Axboe wrote:
> > There really aren't any rules for this, and it's perfectly legit to
> > complete from process context. Maybe you're a kthread driven driver and
> > that's how you handle completions. The block completion path has always
> > been hard IRQ safe, but possible to call from anywhere.
> 
> I'm not trying to put restrictions and forbidding completions from a
> kthread. I'm trying to avoid the pointless softirq dance for no added
> value. We could:

> to not break that assumption you just mentioned and provide 
> |static inline void blk_mq_complete_request_local(struct request *rq)
> |{
> |                 rq->q->mq_ops->complete(rq);
> |}
> 
> so that completion issued from from process context (like those from
> usb-storage) don't end up waking `ksoftird' (running at SCHED_OTHER)
> completing the requests but rather performing it right away. The softirq
> dance makes no sense here.

Agreed.  But I don't think your above blk_mq_complete_request_local
is all that useful either as ->complete is defined by the caller,
so we could just do a direct call.  Basically we should just
return false from blk_mq_complete_request_remote after updating
the state when called from process context.  But given that IIRC
we are not supposed to check what state we are called from
we'll need a helper just for updating the state instead and
ensure the driver uses the right helper.  Now of course we might
have process context callers that still want to bounce to the
submitting CPU, but in that case we should go directly to a
workqueue or similar.

Either way doing this properly will probabl involve an audit of all
drivers, but I think that is worth it.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-11-02 18:12                                             ` Christoph Hellwig
@ 2020-11-04 19:15                                               ` Sagi Grimberg
  2020-11-06 15:23                                               ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 34+ messages in thread
From: Sagi Grimberg @ 2020-11-04 19:15 UTC (permalink / raw)
  To: Christoph Hellwig, Sebastian Andrzej Siewior
  Cc: Jens Axboe, linux-block, Thomas Gleixner, David Runge,
	linux-rt-users, linux-kernel, Peter Zijlstra, Daniel Wagner,
	Mike Galbraith


>>> There really aren't any rules for this, and it's perfectly legit to
>>> complete from process context. Maybe you're a kthread driven driver and
>>> that's how you handle completions. The block completion path has always
>>> been hard IRQ safe, but possible to call from anywhere.
>>
>> I'm not trying to put restrictions and forbidding completions from a
>> kthread. I'm trying to avoid the pointless softirq dance for no added
>> value. We could:
> 
>> to not break that assumption you just mentioned and provide
>> |static inline void blk_mq_complete_request_local(struct request *rq)
>> |{
>> |                 rq->q->mq_ops->complete(rq);
>> |}
>>
>> so that completion issued from from process context (like those from
>> usb-storage) don't end up waking `ksoftird' (running at SCHED_OTHER)
>> completing the requests but rather performing it right away. The softirq
>> dance makes no sense here.
> 
> Agreed.  But I don't think your above blk_mq_complete_request_local
> is all that useful either as ->complete is defined by the caller,
> so we could just do a direct call.  Basically we should just
> return false from blk_mq_complete_request_remote after updating
> the state when called from process context.

Agreed.

> But given that IIRC
> we are not supposed to check what state we are called from
> we'll need a helper just for updating the state instead and
> ensure the driver uses the right helper.  Now of course we might
> have process context callers that still want to bounce to the
> submitting CPU, but in that case we should go directly to a
> workqueue or similar.

This would mean that it may be suboptimal for nvme-tcp to complete
requests in softirq context from the network context (determined by
NIC steering). Because in this case, this would trigger workqueue
schedule on a per-request basis rather than once per .data_ready
call like we do today. Is that correct?

It has been observed that completing commands in softirq context
(network determined cpu) because basically the completion does
IPI + local complete, not IPI + softirq or IPI + workqueue.

> Either way doing this properly will probabl involve an audit of all
> drivers, but I think that is worth it.

Agree.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-11-02 18:12                                             ` Christoph Hellwig
  2020-11-04 19:15                                               ` Sagi Grimberg
@ 2020-11-06 15:23                                               ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-11-06 15:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, linux-block, Thomas Gleixner,
	David Runge, linux-rt-users, linux-kernel, Peter Zijlstra,
	Daniel Wagner, Mike Galbraith

On 2020-11-02 18:12:38 [+0000], Christoph Hellwig wrote:
> > to not break that assumption you just mentioned and provide 
> > |static inline void blk_mq_complete_request_local(struct request *rq)
> > |{
> > |                 rq->q->mq_ops->complete(rq);
> > |}
> > 
> > so that completion issued from from process context (like those from
> > usb-storage) don't end up waking `ksoftird' (running at SCHED_OTHER)
> > completing the requests but rather performing it right away. The softirq
> > dance makes no sense here.
> 
> Agreed.  But I don't think your above blk_mq_complete_request_local
> is all that useful either as ->complete is defined by the caller,
> so we could just do a direct call.
In usb-storage case it is hidden somewhere in the SCSI stack but this
can probably be changed later on.

>                                     Basically we should just
> return false from blk_mq_complete_request_remote after updating
> the state when called from process context.  But given that IIRC
> we are not supposed to check what state we are called from
> we'll need a helper just for updating the state instead and
> ensure the driver uses the right helper.  Now of course we might
> have process context callers that still want to bounce to the
> submitting CPU, but in that case we should go directly to a
> workqueue or similar.

So instead blk_mq_complete_request_local() you want a helper to set the
state in which the completion function is invoked. Sounds more like an
argument :)

> Either way doing this properly will probabl involve an audit of all
> drivers, but I think that is worth it.

I'm lost. Should I repost the three patches with a preempt_disable()
section (as suggested) to not break preemptible callers? And then move
from there to provide callers from preemtible context an alternative?

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, back to index

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-21 17:50 5.9.1-rt18: issues with Firewire card on AMD hardware David Runge
2020-10-23 11:04 ` [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT Sebastian Andrzej Siewior
2020-10-23 11:21   ` Christoph Hellwig
2020-10-23 13:52     ` Sebastian Andrzej Siewior
2020-10-27  9:26       ` Christoph Hellwig
2020-10-27 10:11         ` Sebastian Andrzej Siewior
2020-10-27 16:07           ` Christoph Hellwig
2020-10-27 17:05             ` Thomas Gleixner
2020-10-27 17:23               ` Christoph Hellwig
2020-10-27 17:59                 ` Sebastian Andrzej Siewior
2020-10-27 20:58                 ` Sebastian Andrzej Siewior
2020-10-28  6:56                   ` Christoph Hellwig
2020-10-28 14:12                     ` [PATCH 1/3] blk-mq: Don't complete on a remote CPU in force threaded mode Sebastian Andrzej Siewior
2020-10-28 14:12                       ` [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq Sebastian Andrzej Siewior
2020-10-28 14:12                       ` [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done Sebastian Andrzej Siewior
2020-10-28 14:44                         ` Christoph Hellwig
2020-10-28 14:47                           ` Sebastian Andrzej Siewior
2020-10-29 13:12                         ` Sebastian Andrzej Siewior
2020-10-29 14:05                           ` Christoph Hellwig
2020-10-29 14:56                             ` Sebastian Andrzej Siewior
2020-10-29 14:57                               ` Christoph Hellwig
2020-10-29 20:03                                 ` Sagi Grimberg
2020-10-29 21:01                                   ` Sebastian Andrzej Siewior
2020-10-29 21:07                                     ` Sagi Grimberg
2020-10-31 10:41                                       ` Sebastian Andrzej Siewior
2020-10-31 15:00                                         ` Jens Axboe
2020-10-31 15:01                                           ` Jens Axboe
2020-10-31 18:09                                             ` Christoph Hellwig
2020-11-02  9:55                                           ` Sebastian Andrzej Siewior
2020-11-02 18:12                                             ` Christoph Hellwig
2020-11-04 19:15                                               ` Sagi Grimberg
2020-11-06 15:23                                               ` Sebastian Andrzej Siewior
2020-10-28 10:04                 ` [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT Peter Zijlstra
2020-10-26  0:37 ` 5.9.1-rt18: issues with Firewire card on AMD hardware David Runge

Linux-rt-users Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-rt-users/0 linux-rt-users/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-rt-users linux-rt-users/ https://lore.kernel.org/linux-rt-users \
		linux-rt-users@vger.kernel.org
	public-inbox-index linux-rt-users

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rt-users


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git