All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3 v2] blk-mq: Don't complete in IRQ, use llist_head
@ 2020-12-04 19:13 Sebastian Andrzej Siewior
  2020-12-04 19:13 ` [PATCH 1/3] blk-mq: Don't complete on a remote CPU in force threaded mode Sebastian Andrzej Siewior
                   ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-12-04 19:13 UTC (permalink / raw)
  To: linux-block
  Cc: Jens Axboe, Thomas Gleixner, Peter Zijlstra, Daniel Wagner,
	Mike Galbraith, Christoph Hellwig, Sagi Grimberg

This a repost of the patches in the old thread [0] which died, rebase against
-next.

The series avoids completing the requests on a remote CPU if booted with
threadirqs. It avoids completing requests in hard-IRQ context on remote
CPU by deferring it to the the softirq context.

One change since the last post: preempt-disable() around llist_add() +
raise_softirq() to ensure that request is added on the same CPU where
the softirq is raised. Some callers (like usb-storage) invoke this
function from preemtible context and this preserves the current "call me
from any context" semantic.
My understanding is that in a later attempt we may change such callers
to complete directly and avoid the softirq ping-pong.

[0] https://lkml.kernel.org/r/20201028141251.3608598-1-bigeasy@linutronix.de

Sebastian



^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 1/3] blk-mq: Don't complete on a remote CPU in force threaded mode
  2020-12-04 19:13 [PATCH 0/3 v2] blk-mq: Don't complete in IRQ, use llist_head Sebastian Andrzej Siewior
@ 2020-12-04 19:13 ` Sebastian Andrzej Siewior
  2020-12-08 13:10   ` Christoph Hellwig
  2020-12-04 19:13 ` [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq Sebastian Andrzej Siewior
  2020-12-04 19:13 ` [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done Sebastian Andrzej Siewior
  2 siblings, 1 reply; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-12-04 19:13 UTC (permalink / raw)
  To: linux-block
  Cc: Jens Axboe, Thomas Gleixner, Peter Zijlstra, Daniel Wagner,
	Mike Galbraith, Christoph Hellwig, Sagi Grimberg,
	Sebastian Andrzej Siewior

With force threaded interrupts enabled, raising softirq from an SMP
function call will always result in waking the ksoftirqd thread. This is
not optimal given that the thread runs at SCHED_OTHER priority.

Completing the request in hard IRQ-context on PREEMPT_RT (which enforces
the force threaded mode) is bad because the completion handler may
acquire sleeping locks which violate the locking context.

Disable request completing on a remote CPU in force threaded mode.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 block/blk-mq.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b9c2efab5db78..7091bb097c63f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -650,6 +650,14 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq)
 	if (!IS_ENABLED(CONFIG_SMP) ||
 	    !test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags))
 		return false;
+	/*
+	 * With force threaded interrupts enabled, raising softirq from an SMP
+	 * function call will always result in waking the ksoftirqd thread.
+	 * This is probably worse than completing the request on a different
+	 * cache domain.
+	 */
+	if (force_irqthreads)
+		return false;
 
 	/* same CPU or cache domain?  Complete locally */
 	if (cpu == rq->mq_ctx->cpu ||
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-04 19:13 [PATCH 0/3 v2] blk-mq: Don't complete in IRQ, use llist_head Sebastian Andrzej Siewior
  2020-12-04 19:13 ` [PATCH 1/3] blk-mq: Don't complete on a remote CPU in force threaded mode Sebastian Andrzej Siewior
@ 2020-12-04 19:13 ` Sebastian Andrzej Siewior
  2020-12-07 23:52   ` Jens Axboe
  2020-12-04 19:13 ` [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done Sebastian Andrzej Siewior
  2 siblings, 1 reply; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-12-04 19:13 UTC (permalink / raw)
  To: linux-block
  Cc: Jens Axboe, Thomas Gleixner, Peter Zijlstra, Daniel Wagner,
	Mike Galbraith, Christoph Hellwig, Sagi Grimberg,
	Sebastian Andrzej Siewior

Controllers with multiple queues have their IRQ-handelers pinned to a
CPU. The core shouldn't need to complete the request on a remote CPU.

Remove this case and always raise the softirq to complete the request.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 block/blk-mq.c | 14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7091bb097c63f..3c0e94913d874 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -628,19 +628,7 @@ static void __blk_mq_complete_request_remote(void *data)
 {
 	struct request *rq = data;
 
-	/*
-	 * For most of single queue controllers, there is only one irq vector
-	 * for handling I/O completion, and the only irq's affinity is set
-	 * to all possible CPUs.  On most of ARCHs, this affinity means the irq
-	 * is handled on one specific CPU.
-	 *
-	 * So complete I/O requests in softirq context in case of single queue
-	 * devices to avoid degrading I/O performance due to irqsoff latency.
-	 */
-	if (rq->q->nr_hw_queues == 1)
-		blk_mq_trigger_softirq(rq);
-	else
-		rq->q->mq_ops->complete(rq);
+	blk_mq_trigger_softirq(rq);
 }
 
 static inline bool blk_mq_complete_need_ipi(struct request *rq)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-12-04 19:13 [PATCH 0/3 v2] blk-mq: Don't complete in IRQ, use llist_head Sebastian Andrzej Siewior
  2020-12-04 19:13 ` [PATCH 1/3] blk-mq: Don't complete on a remote CPU in force threaded mode Sebastian Andrzej Siewior
  2020-12-04 19:13 ` [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq Sebastian Andrzej Siewior
@ 2020-12-04 19:13 ` Sebastian Andrzej Siewior
  2020-12-08 13:20   ` Christoph Hellwig
  2 siblings, 1 reply; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-12-04 19:13 UTC (permalink / raw)
  To: linux-block
  Cc: Jens Axboe, Thomas Gleixner, Peter Zijlstra, Daniel Wagner,
	Mike Galbraith, Christoph Hellwig, Sagi Grimberg,
	Sebastian Andrzej Siewior

With llist_head it is possible to avoid the locking (the irq-off region)
when items are added. This makes it possible to add items on a remote
CPU.
llist_add() returns true if the list was previously empty. This can be
used to invoke the SMP function call / raise sofirq only if the first
item was added (otherwise it is already pending).
This simplifies the code a little and reduces the IRQ-off regions. With
this change it possible to reduce the SMP-function call a simple
__raise_softirq_irqoff().
blk_mq_complete_request_remote() needs a preempt-disable section if the
request needs to complete on the local CPU. Some callers (USB-storage)
invoke this preemptible context and the request needs to be enqueued on
the same CPU as the softirq is raised.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 block/blk-mq.c         | 77 ++++++++++++++----------------------------
 include/linux/blkdev.h |  2 +-
 2 files changed, 27 insertions(+), 52 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3c0e94913d874..b5138327952a4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -41,7 +41,7 @@
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
 
-static DEFINE_PER_CPU(struct list_head, blk_cpu_done);
+static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
 
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
@@ -567,68 +567,32 @@ void blk_mq_end_request(struct request *rq, blk_status_t error)
 }
 EXPORT_SYMBOL(blk_mq_end_request);
 
-/*
- * Softirq action handler - move entries to local list and loop over them
- * while passing them to the queue registered handler.
- */
-static __latent_entropy void blk_done_softirq(struct softirq_action *h)
+static void blk_complete_reqs(struct llist_head *cpu_list)
 {
-	struct list_head *cpu_list, local_list;
+	struct llist_node *entry;
+	struct request *rq, *rq_next;
 
-	local_irq_disable();
-	cpu_list = this_cpu_ptr(&blk_cpu_done);
-	list_replace_init(cpu_list, &local_list);
-	local_irq_enable();
+	entry = llist_del_all(cpu_list);
+	entry = llist_reverse_order(entry);
 
-	while (!list_empty(&local_list)) {
-		struct request *rq;
-
-		rq = list_entry(local_list.next, struct request, ipi_list);
-		list_del_init(&rq->ipi_list);
+	llist_for_each_entry_safe(rq, rq_next, entry, ipi_list)
 		rq->q->mq_ops->complete(rq);
-	}
 }
 
-static void blk_mq_trigger_softirq(struct request *rq)
+static __latent_entropy void blk_done_softirq(struct softirq_action *h)
 {
-	struct list_head *list;
-	unsigned long flags;
-
-	local_irq_save(flags);
-	list = this_cpu_ptr(&blk_cpu_done);
-	list_add_tail(&rq->ipi_list, list);
-
-	/*
-	 * If the list only contains our just added request, signal a raise of
-	 * the softirq.  If there are already entries there, someone already
-	 * raised the irq but it hasn't run yet.
-	 */
-	if (list->next == &rq->ipi_list)
-		raise_softirq_irqoff(BLOCK_SOFTIRQ);
-	local_irq_restore(flags);
+	blk_complete_reqs(this_cpu_ptr(&blk_cpu_done));
 }
 
 static int blk_softirq_cpu_dead(unsigned int cpu)
 {
-	/*
-	 * If a CPU goes away, splice its entries to the current CPU
-	 * and trigger a run of the softirq
-	 */
-	local_irq_disable();
-	list_splice_init(&per_cpu(blk_cpu_done, cpu),
-			 this_cpu_ptr(&blk_cpu_done));
-	raise_softirq_irqoff(BLOCK_SOFTIRQ);
-	local_irq_enable();
-
+	blk_complete_reqs(&per_cpu(blk_cpu_done, cpu));
 	return 0;
 }
 
-
 static void __blk_mq_complete_request_remote(void *data)
 {
-	struct request *rq = data;
-
-	blk_mq_trigger_softirq(rq);
+	__raise_softirq_irqoff(BLOCK_SOFTIRQ);
 }
 
 static inline bool blk_mq_complete_need_ipi(struct request *rq)
@@ -659,6 +623,7 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq)
 
 bool blk_mq_complete_request_remote(struct request *rq)
 {
+	struct llist_head *cpu_list;
 	WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
 
 	/*
@@ -669,12 +634,22 @@ bool blk_mq_complete_request_remote(struct request *rq)
 		return false;
 
 	if (blk_mq_complete_need_ipi(rq)) {
-		INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
-		smp_call_function_single_async(rq->mq_ctx->cpu, &rq->csd);
+		unsigned int cpu;
+
+		cpu = rq->mq_ctx->cpu;
+		cpu_list = &per_cpu(blk_cpu_done, cpu);
+		if (llist_add(&rq->ipi_list, cpu_list)) {
+			INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
+			smp_call_function_single_async(cpu, &rq->csd);
+		}
 	} else {
 		if (rq->q->nr_hw_queues > 1)
 			return false;
-		blk_mq_trigger_softirq(rq);
+		preempt_disable();
+		cpu_list = this_cpu_ptr(&blk_cpu_done);
+		if (llist_add(&rq->ipi_list, cpu_list))
+			raise_softirq(BLOCK_SOFTIRQ);
+		preempt_enable();
 	}
 
 	return true;
@@ -3905,7 +3880,7 @@ static int __init blk_mq_init(void)
 	int i;
 
 	for_each_possible_cpu(i)
-		INIT_LIST_HEAD(&per_cpu(blk_cpu_done, i));
+		init_llist_head(&per_cpu(blk_cpu_done, i));
 	open_softirq(BLOCK_SOFTIRQ, blk_done_softirq);
 
 	cpuhp_setup_state_nocalls(CPUHP_BLOCK_SOFTIRQ_DEAD,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 17cedf0dc83db..7b05390766eec 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -156,7 +156,7 @@ struct request {
 	 */
 	union {
 		struct hlist_node hash;	/* merge hash */
-		struct list_head ipi_list;
+		struct llist_node ipi_list;
 	};
 
 	/*
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-04 19:13 ` [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq Sebastian Andrzej Siewior
@ 2020-12-07 23:52   ` Jens Axboe
  2020-12-08  8:22     ` Sebastian Andrzej Siewior
  2020-12-08 13:13     ` Christoph Hellwig
  0 siblings, 2 replies; 54+ messages in thread
From: Jens Axboe @ 2020-12-07 23:52 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-block
  Cc: Thomas Gleixner, Peter Zijlstra, Daniel Wagner, Mike Galbraith,
	Christoph Hellwig, Sagi Grimberg

On 12/4/20 12:13 PM, Sebastian Andrzej Siewior wrote:
> Controllers with multiple queues have their IRQ-handelers pinned to a
> CPU. The core shouldn't need to complete the request on a remote CPU.
> 
> Remove this case and always raise the softirq to complete the request.

I don't like this one at all, it'll add a softirq jump for the fast path
for eg nvme devices. Did you run any performance testing with this? I
can give it a spin, will do so anyway, but was curious if anything but
"this still works" testing was done.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-07 23:52   ` Jens Axboe
@ 2020-12-08  8:22     ` Sebastian Andrzej Siewior
  2020-12-08  8:44       ` Daniel Wagner
  2020-12-08 13:13     ` Christoph Hellwig
  1 sibling, 1 reply; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-12-08  8:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Thomas Gleixner, Peter Zijlstra, Daniel Wagner,
	Mike Galbraith, Christoph Hellwig, Sagi Grimberg

On 2020-12-07 16:52:57 [-0700], Jens Axboe wrote:
> On 12/4/20 12:13 PM, Sebastian Andrzej Siewior wrote:
> > Controllers with multiple queues have their IRQ-handelers pinned to a
> > CPU. The core shouldn't need to complete the request on a remote CPU.
> > 
> > Remove this case and always raise the softirq to complete the request.
> 
> I don't like this one at all, it'll add a softirq jump for the fast path
> for eg nvme devices. Did you run any performance testing with this? I
> can give it a spin, will do so anyway, but was curious if anything but
> "this still works" testing was done.

I understood that the nvme devices have a queue per CPU and don't need
to complete on a remote-CPU in general.
Also, they don't jump to softirq but invoke softirq on the return from
IRQ. So if softirq is already busy because of network then softirq of
the block layer is delayed. Otherwise there should be no significant
delay if the CPU is idle (so the IRQ is handled and softirq right after
it).

Sagi mentioned nvme-tcp as a user of this remote completion and Daniel
has been kind to run some nvme-tcp tests.

> -- 
> Jens Axboe
> 

Sebastian

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-08  8:22     ` Sebastian Andrzej Siewior
@ 2020-12-08  8:44       ` Daniel Wagner
  2020-12-08 11:36         ` Daniel Wagner
  2020-12-17 16:45         ` Jens Axboe
  0 siblings, 2 replies; 54+ messages in thread
From: Daniel Wagner @ 2020-12-08  8:44 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Jens Axboe, linux-block, Thomas Gleixner, Peter Zijlstra,
	Mike Galbraith, Christoph Hellwig, Sagi Grimberg

On Tue, Dec 08, 2020 at 09:22:20AM +0100, Sebastian Andrzej Siewior wrote:
> Sagi mentioned nvme-tcp as a user of this remote completion and Daniel
> has been kind to run some nvme-tcp tests.

I've started with some benchmarking. The first thing I tried is to find
a setup where the remote path is taken. I found a setup with nvme-fc
with a workload which results in ca 10% remote completion.

Setup:
 - NVMe over FC
 - 2x Emulex LPe36000 32Gb PCIe Fibre Channel Adapter
 - 8 mpaths
 - 4 E7-4820 v3, 80 cores

Workload:
 - fio --rw=randwrite --name=test --size=50M \
       --iodepth=32 --direct=0 --bs=4k --numjobs=40 \
       --time_based --runtime=1h --ioengine=libaio \
       --group_reporting

(I played a bit around with different workloads, most of them
 wont use the remote completion path)

I've annotated the code with a counter and exported it via
debugfs.

@@ -671,6 +673,8 @@ bool blk_mq_complete_request_remote(struct request *rq)
                return false;

        if (blk_mq_complete_need_ipi(rq)) {
+               ctx->rq_remote++;
+
                rq->csd.func = __blk_mq_complete_request_remote;
                rq->csd.info = rq;
                rq->csd.flags = 0;


And hacked a small script to collect the data.

- Baseline (5.10-rc7)

  Starting 40 processes
  Jobs: 40 (f=40): [w(40)][100.0%][r=0KiB/s,w=12.0GiB/s][r=0,w=3405k IOPS][eta 00m:00s]
  test: (groupid=0, jobs=40): err= 0: pid=14225: Mon Dec  7 20:09:57 2020
    write: IOPS=3345k, BW=12.8GiB/s (13.7GB/s)(44.9TiB/3600002msec)
      slat (usec): min=2, max=90772, avg= 9.43, stdev=10.67
      clat (usec): min=2, max=91343, avg=371.79, stdev=119.52
       lat (usec): min=5, max=91358, avg=381.31, stdev=122.45
      clat percentiles (usec):
       |  1.00th=[  231],  5.00th=[  245], 10.00th=[  253], 20.00th=[  273],
       | 30.00th=[  293], 40.00th=[  322], 50.00th=[  351], 60.00th=[  388],
       | 70.00th=[  420], 80.00th=[  465], 90.00th=[  529], 95.00th=[  570],
       | 99.00th=[  644], 99.50th=[  676], 99.90th=[  750], 99.95th=[  783],
       | 99.99th=[  873]
     bw (  KiB/s): min=107333, max=749152, per=2.51%, avg=335200.07, stdev=87628.57, samples=288000
     iops        : min=26833, max=187286, avg=83799.66, stdev=21907.09, samples=288000
    lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
    lat (usec)   : 250=8.04%, 500=6.79%, 750=13.75%, 1000=0.09%
    lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
    lat (msec)   : 100=0.01%
    cpu          : usr=29.14%, sys=70.83%, ctx=320219, majf=0, minf=13056
    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=28.7%, >=64=0.0%
       submit    : 0=0.0%, 4=28.7%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
       complete  : 0=0.0%, 4=28.7%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
       issued rwts: total=0,12042333583,0,0 short=0,0,0,0 dropped=0,0,0,0
       latency   : target=0, window=0, percentile=100.00%, depth=32

  Run status group 0 (all jobs):
    WRITE: bw=12.8GiB/s (13.7GB/s), 12.8GiB/s-12.8GiB/s (13.7GB/s-13.7GB/s), io=44.9TiB (49.3TB), run=3600002-3600002msec

  Disk stats (read/write):
    nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

- Patched

  Jobs: 40 (f=40): [w(40)][100.0%][r=0KiB/s,w=12.9GiB/s][r=0,w=3383k IOPS][eta 00m:00s]
  test: (groupid=0, jobs=40): err= 0: pid=13413: Mon Dec  7 21:31:01 2020
    write: IOPS=3371k, BW=12.9GiB/s (13.8GB/s)(45.2TiB/3600004msec)
      slat (nsec): min=1984, max=90341k, avg=9308.73, stdev=7068.58
      clat (usec): min=2, max=91259, avg=368.94, stdev=118.31
       lat (usec): min=5, max=91269, avg=378.34, stdev=121.43
      clat percentiles (usec):
       |  1.00th=[  231],  5.00th=[  245], 10.00th=[  255], 20.00th=[  277],
       | 30.00th=[  302], 40.00th=[  318], 50.00th=[  334], 60.00th=[  359],
       | 70.00th=[  392], 80.00th=[  433], 90.00th=[  562], 95.00th=[  635],
       | 99.00th=[  693], 99.50th=[  709], 99.90th=[  766], 99.95th=[  816],
       | 99.99th=[  914]
     bw (  KiB/s): min=124304, max=770204, per=2.50%, avg=337559.07, stdev=91383.66, samples=287995
     iops        : min=31076, max=192551, avg=84389.45, stdev=22845.91, samples=287995
    lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
    lat (usec)   : 250=7.44%, 500=7.84%, 750=13.79%, 1000=0.15%
    lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
    lat (msec)   : 100=0.01%
    cpu          : usr=30.30%, sys=69.69%, ctx=179950, majf=0, minf=7403
    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=29.2%, >=64=0.0%
       submit    : 0=0.0%, 4=29.2%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
       complete  : 0=0.0%, 4=29.2%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
       issued rwts: total=0,12135617715,0,0 short=0,0,0,0 dropped=0,0,0,0
       latency   : target=0, window=0, percentile=100.00%, depth=32

  Run status group 0 (all jobs):
    WRITE: bw=12.9GiB/s (13.8GB/s), 12.9GiB/s-12.9GiB/s (13.8GB/s-13.8GB/s), io=45.2TiB (49.7TB), run=3600004-3600004msec

  Disk stats (read/write):
    nvme1n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

- Baseline

  nvme5c0n1 completed     411218 remote      38777 9.43%
  nvme5c0n2 completed          0 remote          0 0.00%
  nvme5c1n1 completed     411270 remote      38770 9.43%
  nvme5c1n2 completed         50 remote          0 0.00%
  nvme5c2n1 completed          0 remote          0 0.00%
  nvme5c2n2 completed          0 remote          0 0.00%
  nvme5c3n1 completed     411220 remote      38751 9.42%
  nvme5c3n2 completed          0 remote          0 0.00%
  nvme5c4n1 completed          0 remote          0 0.00%
  nvme5c4n2 completed          0 remote          0 0.00%
  nvme5c5n1 completed          0 remote          0 0.00%
  nvme5c5n2 completed          0 remote          0 0.00%
  nvme5c6n1 completed     411216 remote      38759 9.43%
  nvme5c6n2 completed          0 remote          0 0.00%
  nvme5c7n1 completed          0 remote          0 0.00%
  nvme5c7n2 completed          0 remote          0 0.00%

- Patched

  nvme1c0n1 completed          0 remote          0 0.00%
  nvme1c0n2 completed          0 remote          0 0.00%
  nvme1c1n1 completed     172202 remote      17813 10.34%
  nvme1c1n2 completed         50 remote          0 0.00%
  nvme1c2n1 completed     172147 remote      17831 10.36%
  nvme1c2n2 completed          0 remote          0 0.00%
  nvme1c3n1 completed          0 remote          0 0.00%
  nvme1c3n2 completed          0 remote          0 0.00%
  nvme1c4n1 completed     172159 remote      17825 10.35%
  nvme1c4n2 completed          0 remote          0 0.00%
  nvme1c5n1 completed          0 remote          0 0.00%
  nvme1c5n2 completed          0 remote          0 0.00%
  nvme1c6n1 completed          0 remote          0 0.00%
  nvme1c6n2 completed          0 remote          0 0.00%
  nvme1c7n1 completed     172156 remote      17781 10.33%
  nvme1c7n2 completed          0 remote          0 0.00%


It looks like the patched version show tiny bit better numbers for this
workload. slat seems to be one of the major difference between the two
runs. But that is the only thing I really spotted to be really off.

I keep going with some more testing. Let what kind of tests you would
also want to see. I'll do a few plain NVMe tests next.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-08  8:44       ` Daniel Wagner
@ 2020-12-08 11:36         ` Daniel Wagner
  2020-12-08 11:49           ` Sebastian Andrzej Siewior
  2020-12-17 16:45         ` Jens Axboe
  1 sibling, 1 reply; 54+ messages in thread
From: Daniel Wagner @ 2020-12-08 11:36 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Jens Axboe, linux-block, Thomas Gleixner, Peter Zijlstra,
	Mike Galbraith, Christoph Hellwig, Sagi Grimberg,
	Hannes Reinecke

On Tue, Dec 08, 2020 at 09:44:10AM +0100, Daniel Wagner wrote:
> On Tue, Dec 08, 2020 at 09:22:20AM +0100, Sebastian Andrzej Siewior wrote:
> > Sagi mentioned nvme-tcp as a user of this remote completion and Daniel
> > has been kind to run some nvme-tcp tests.
> 
> I've started with some benchmarking. The first thing I tried is to find
> a setup where the remote path is taken. I found a setup with nvme-fc
> with a workload which results in ca 10% remote completion.

Setup:
  - Dell Express Flash NVMe PM1725 800GB SFF
  - 2 Gold 6130, 64 cores

Workload:
  - fio --rw=randread --name=test --filename=/dev/nvme0n1 \
        --iodepth=64 --direct=1 --bs=4k --numjobs=32 \
        --time_based --runtime=5m --ioengine=libaio --group_reporting

(Searched for a workload with the highest IOPs which seems to be
randread)

Obvious in this configuration there are no remote completions (verified
it).

- baseline 5.10-rc7

Jobs: 32 (f=32): [r(32)][100.0%][r=2544MiB/s][r=651k IOPS][eta 00m:00s]
test: (groupid=0, jobs=32): err= 0: pid=24118: Tue Dec  8 11:33:21 2020
  read: IOPS=636k, BW=2485MiB/s (2605MB/s)(728GiB/300006msec)
    slat (nsec): min=1502, max=450956, avg=5576.99, stdev=1475.94
    clat (usec): min=195, max=59296, avg=3212.51, stdev=1640.48
     lat (usec): min=201, max=59302, avg=3218.23, stdev=1640.58
    clat percentiles (usec):
     |  1.00th=[ 2573],  5.00th=[ 2671], 10.00th=[ 2769], 20.00th=[ 2868],
     | 30.00th=[ 2933], 40.00th=[ 2999], 50.00th=[ 3064], 60.00th=[ 3163],
     | 70.00th=[ 3261], 80.00th=[ 3359], 90.00th=[ 3589], 95.00th=[ 3818],
     | 99.00th=[ 4948], 99.50th=[ 5669], 99.90th=[40633], 99.95th=[44303],
     | 99.99th=[49021]
   bw (  MiB/s): min=  444, max= 2598, per=99.99%, avg=2484.33, stdev= 8.36, samples=19200
   iops        : min=113782, max=665312, avg=635988.04, stdev=2139.63, samples=19200
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=96.85%, 10=2.96%, 20=0.01%, 50=0.16%
  lat (msec)   : 100=0.01%
  cpu          : usr=9.11%, sys=14.58%, ctx=131930047, majf=0, minf=28434
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=190817510,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=2485MiB/s (2605MB/s), 2485MiB/s-2485MiB/s (2605MB/s-2605MB/s), io=728GiB (782GB), run=300006-300006msec

Disk stats (read/write):
  nvme0n1: ios=190707084/0, merge=0/0, ticks=611781701/0, in_queue=611781702, util=100.00%


- patched

Jobs: 32 (f=32): [r(32)][100.0%][r=2548MiB/s][r=652k IOPS][eta 00m:00s]
test: (groupid=0, jobs=32): err= 0: pid=3059: Tue Dec  8 12:11:25 2020
  read: IOPS=637k, BW=2489MiB/s (2610MB/s)(729GiB/300006msec)
    slat (nsec): min=1453, max=4793.6k, avg=5662.01, stdev=1960.75
    clat (usec): min=77, max=59685, avg=3207.13, stdev=1633.85
     lat (usec): min=82, max=59696, avg=3212.92, stdev=1633.95
    clat percentiles (usec):
     |  1.00th=[ 2573],  5.00th=[ 2671], 10.00th=[ 2737], 20.00th=[ 2835],
     | 30.00th=[ 2933], 40.00th=[ 2999], 50.00th=[ 3064], 60.00th=[ 3163],
     | 70.00th=[ 3228], 80.00th=[ 3359], 90.00th=[ 3556], 95.00th=[ 3785],
     | 99.00th=[ 4948], 99.50th=[ 5669], 99.90th=[40633], 99.95th=[43779],
     | 99.99th=[49021]
   bw (  MiB/s): min=  560, max= 2617, per=99.99%, avg=2488.34, stdev= 8.39, samples=19199
   iops        : min=143452, max=670006, avg=637013.93, stdev=2148.64, samples=19199
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=96.92%, 10=2.89%, 20=0.01%, 50=0.16%
  lat (msec)   : 100=0.01%
  cpu          : usr=9.32%, sys=14.88%, ctx=130862793, majf=0, minf=38825
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=191130719,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=2489MiB/s (2610MB/s), 2489MiB/s-2489MiB/s (2610MB/s-2610MB/s), io=729GiB (783GB), run=300006-300006msec

Disk stats (read/write):
  nvme0n1: ios=191019060/0, merge=0/0, ticks=611718395/0, in_queue=611718395, util=100.00%


Again, the numbers look very alike.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-08 11:36         ` Daniel Wagner
@ 2020-12-08 11:49           ` Sebastian Andrzej Siewior
  2020-12-08 12:41             ` Daniel Wagner
  0 siblings, 1 reply; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-12-08 11:49 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, linux-block, Thomas Gleixner, Peter Zijlstra,
	Mike Galbraith, Christoph Hellwig, Sagi Grimberg,
	Hannes Reinecke

On 2020-12-08 12:36:53 [+0100], Daniel Wagner wrote:
> Obvious in this configuration there are no remote completions (verified
> it).

do you complete on a remote CPU if you limit the queues to one (this is
untested of course)?

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3be352403839a..f35224a64a56e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2126,7 +2126,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	 * If tags are shared with admin queue (Apple bug), then
 	 * make sure we only use one IO queue.
 	 */
-	if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS)
+	if (1 || dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS)
 		nr_io_queues = 1;
 	else
 		nr_io_queues = min(nvme_max_io_queues(dev),

> Thanks,
> Daniel

Sebastian

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-08 11:49           ` Sebastian Andrzej Siewior
@ 2020-12-08 12:41             ` Daniel Wagner
  2020-12-08 12:52               ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 54+ messages in thread
From: Daniel Wagner @ 2020-12-08 12:41 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Jens Axboe, linux-block, Thomas Gleixner, Peter Zijlstra,
	Mike Galbraith, Christoph Hellwig, Sagi Grimberg,
	Hannes Reinecke

On Tue, Dec 08, 2020 at 12:49:36PM +0100, Sebastian Andrzej Siewior wrote:
> On 2020-12-08 12:36:53 [+0100], Daniel Wagner wrote:
> > Obvious in this configuration there are no remote completions (verified
> > it).
> 
> do you complete on a remote CPU if you limit the queues to one (this is
> untested of course)?

nvme0n1/ completed   11913011 remote    6718563 56.40%

yes, but how is this relevant? I thought Jens complain was about the
additional indirection via the softirq context

-		rq->q->mq_ops->complete(rq);
+	blk_mq_trigger_softirq(rq);

and not the remote completion path. I can benchmark it out but I don't
know if it's really helping in the discussion.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-08 12:41             ` Daniel Wagner
@ 2020-12-08 12:52               ` Sebastian Andrzej Siewior
  2020-12-08 12:57                 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-12-08 12:52 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, linux-block, Thomas Gleixner, Peter Zijlstra,
	Mike Galbraith, Christoph Hellwig, Sagi Grimberg,
	Hannes Reinecke

On 2020-12-08 13:41:48 [+0100], Daniel Wagner wrote:
> On Tue, Dec 08, 2020 at 12:49:36PM +0100, Sebastian Andrzej Siewior wrote:
> > On 2020-12-08 12:36:53 [+0100], Daniel Wagner wrote:
> > > Obvious in this configuration there are no remote completions (verified
> > > it).
> > 
> > do you complete on a remote CPU if you limit the queues to one (this is
> > untested of course)?
> 
> nvme0n1/ completed   11913011 remote    6718563 56.40%
> 
> yes, but how is this relevant? I thought Jens complain was about the
> additional indirection via the softirq context
> 
> -		rq->q->mq_ops->complete(rq);
> +	blk_mq_trigger_softirq(rq);
> 
> and not the remote completion path. I can benchmark it out but I don't
> know if it's really helping in the discussion.

The only additional softirq path is for cross-CPU completion. If I
understood you correctly then your NVME device always completes locally
because the queue interrupt fires on the correct CPU.
If you take away the queues then you should have cross-CPU completion
since you have only one queue and this will now complete on the remote
CPU in softirq context (and not in IRQ as it used to).
If this single queue NVME device, which may complete on another CPU, is
not an issue / interesting because it is already limited then ignore
this.

Sebastian

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-08 12:52               ` Sebastian Andrzej Siewior
@ 2020-12-08 12:57                 ` Sebastian Andrzej Siewior
  2020-12-08 13:27                   ` Daniel Wagner
  0 siblings, 1 reply; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-12-08 12:57 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, linux-block, Thomas Gleixner, Peter Zijlstra,
	Mike Galbraith, Christoph Hellwig, Sagi Grimberg,
	Hannes Reinecke

On 2020-12-08 13:52:25 [+0100], To Daniel Wagner wrote:
> On 2020-12-08 13:41:48 [+0100], Daniel Wagner wrote:
> > On Tue, Dec 08, 2020 at 12:49:36PM +0100, Sebastian Andrzej Siewior wrote:
> > > On 2020-12-08 12:36:53 [+0100], Daniel Wagner wrote:
> > > > Obvious in this configuration there are no remote completions (verified
> > > > it).
> > > 
> > > do you complete on a remote CPU if you limit the queues to one (this is
> > > untested of course)?
> > 
> > nvme0n1/ completed   11913011 remote    6718563 56.40%
> > 
> > yes, but how is this relevant? I thought Jens complain was about the
> > additional indirection via the softirq context
> > 
> > -		rq->q->mq_ops->complete(rq);
> > +	blk_mq_trigger_softirq(rq);
> > 
> > and not the remote completion path. I can benchmark it out but I don't
> > know if it's really helping in the discussion.
… blurp

Yes, you are right. Even cross-CPU completion for single-queue was
already completing in softirq. So the only change is for multiqueue
devices which you just demonstrated that it does not happen.
Thank you!

Sebastian

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/3] blk-mq: Don't complete on a remote CPU in force threaded mode
  2020-12-04 19:13 ` [PATCH 1/3] blk-mq: Don't complete on a remote CPU in force threaded mode Sebastian Andrzej Siewior
@ 2020-12-08 13:10   ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2020-12-08 13:10 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-block, Jens Axboe, Thomas Gleixner, Peter Zijlstra,
	Daniel Wagner, Mike Galbraith, Christoph Hellwig, Sagi Grimberg

On Fri, Dec 04, 2020 at 08:13:54PM +0100, Sebastian Andrzej Siewior wrote:
> With force threaded interrupts enabled, raising softirq from an SMP
> function call will always result in waking the ksoftirqd thread. This is
> not optimal given that the thread runs at SCHED_OTHER priority.
> 
> Completing the request in hard IRQ-context on PREEMPT_RT (which enforces
> the force threaded mode) is bad because the completion handler may
> acquire sleeping locks which violate the locking context.
> 
> Disable request completing on a remote CPU in force threaded mode.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-07 23:52   ` Jens Axboe
  2020-12-08  8:22     ` Sebastian Andrzej Siewior
@ 2020-12-08 13:13     ` Christoph Hellwig
  2020-12-17 16:43       ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2020-12-08 13:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sebastian Andrzej Siewior, linux-block, Thomas Gleixner,
	Peter Zijlstra, Daniel Wagner, Mike Galbraith, Christoph Hellwig,
	Sagi Grimberg

On Mon, Dec 07, 2020 at 04:52:57PM -0700, Jens Axboe wrote:
> On 12/4/20 12:13 PM, Sebastian Andrzej Siewior wrote:
> > Controllers with multiple queues have their IRQ-handelers pinned to a
> > CPU. The core shouldn't need to complete the request on a remote CPU.
> > 
> > Remove this case and always raise the softirq to complete the request.
> 
> I don't like this one at all, it'll add a softirq jump for the fast path
> for eg nvme devices.

For the real fast path, that is either a polled queue or irq driven
queues that only map to a single CPU we are never reaching this code,
so I'm not too worried.  Not that I'd complain about numbers, preferably
in the commit log.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-12-04 19:13 ` [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done Sebastian Andrzej Siewior
@ 2020-12-08 13:20   ` Christoph Hellwig
  2020-12-08 13:28     ` Christoph Hellwig
  2020-12-14 20:20     ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 54+ messages in thread
From: Christoph Hellwig @ 2020-12-08 13:20 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-block, Jens Axboe, Thomas Gleixner, Peter Zijlstra,
	Daniel Wagner, Mike Galbraith, Christoph Hellwig, Sagi Grimberg

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 3467 bytes --]

On Fri, Dec 04, 2020 at 08:13:56PM +0100, Sebastian Andrzej Siewior wrote:
> With llist_head it is possible to avoid the locking (the irq-off region)
> when items are added. This makes it possible to add items on a remote
> CPU.
> llist_add() returns true if the list was previously empty. This can be
> used to invoke the SMP function call / raise sofirq only if the first
> item was added (otherwise it is already pending).
> This simplifies the code a little and reduces the IRQ-off regions. With
> this change it possible to reduce the SMP-function call a simple
> __raise_softirq_irqoff().
> blk_mq_complete_request_remote() needs a preempt-disable section if the
> request needs to complete on the local CPU. Some callers (USB-storage)
> invoke this preemptible context and the request needs to be enqueued on
> the same CPU as the softirq is raised.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  block/blk-mq.c         | 77 ++++++++++++++----------------------------
>  include/linux/blkdev.h |  2 +-
>  2 files changed, 27 insertions(+), 52 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3c0e94913d874..b5138327952a4 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -41,7 +41,7 @@
>  #include "blk-mq-sched.h"
>  #include "blk-rq-qos.h"
>  
> +static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
>  
>  static void blk_mq_poll_stats_start(struct request_queue *q);
>  static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
> @@ -567,68 +567,32 @@ void blk_mq_end_request(struct request *rq, blk_status_t error)
>  }
>  EXPORT_SYMBOL(blk_mq_end_request);
>  
> +static void blk_complete_reqs(struct llist_head *cpu_list)
>  {
> +	struct llist_node *entry;
> +	struct request *rq, *rq_next;
>  
> +	entry = llist_del_all(cpu_list);
> +	entry = llist_reverse_order(entry);

I find the variable naming and split of the assignments a little
strange.  What about:

static void blk_complete_reqs(struct llist_head *list)
{
	struct llist_node *first = llist_reverse_order(llist_del_all(list));
	struct request *rq, *next;

?

> +	llist_for_each_entry_safe(rq, rq_next, entry, ipi_list)
>  		rq->q->mq_ops->complete(rq);
>  }

Aren't some sanitizers going to be unhappy if we never delete the
request from the list?

>  bool blk_mq_complete_request_remote(struct request *rq)
>  {
> +	struct llist_head *cpu_list;
>  	WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
>  
>  	/*
> @@ -669,12 +634,22 @@ bool blk_mq_complete_request_remote(struct request *rq)
>  		return false;
>  
>  	if (blk_mq_complete_need_ipi(rq)) {
> +		unsigned int cpu;
> +
> +		cpu = rq->mq_ctx->cpu;
> +		cpu_list = &per_cpu(blk_cpu_done, cpu);
> +		if (llist_add(&rq->ipi_list, cpu_list)) {
> +			INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
> +			smp_call_function_single_async(cpu, &rq->csd);
> +		}

I think the above code section inside the conditional should go into a
little helper instead of being open coded here in the fast path routine.
I laso don't really see the ¶oint of the cpu and cpulist locl variables.

>  	} else {
>  		if (rq->q->nr_hw_queues > 1)
>  			return false;
> +		preempt_disable();
> +		cpu_list = this_cpu_ptr(&blk_cpu_done);
> +		if (llist_add(&rq->ipi_list, cpu_list))
> +			raise_softirq(BLOCK_SOFTIRQ);
> +		preempt_enable();

I think the section after the return false here also would benefit from
a little helper with a descriptive name.

Otherwise this looks good to me.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-08 12:57                 ` Sebastian Andrzej Siewior
@ 2020-12-08 13:27                   ` Daniel Wagner
  0 siblings, 0 replies; 54+ messages in thread
From: Daniel Wagner @ 2020-12-08 13:27 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Jens Axboe, linux-block, Thomas Gleixner, Peter Zijlstra,
	Mike Galbraith, Christoph Hellwig, Sagi Grimberg,
	Hannes Reinecke

On Tue, Dec 08, 2020 at 01:57:09PM +0100, Sebastian Andrzej Siewior wrote:
> Yes, you are right. Even cross-CPU completion for single-queue was
> already completing in softirq. So the only change is for multiqueue
> devices which you just demonstrated that it does not happen.

It can happen if there are less hardware queues than CPUs. But I'd bet
that application which care about performance are well aware of this and
try to keep the work vertical aligned.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-12-08 13:20   ` Christoph Hellwig
@ 2020-12-08 13:28     ` Christoph Hellwig
  2020-12-14 20:20     ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2020-12-08 13:28 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-block, Jens Axboe, Thomas Gleixner, Peter Zijlstra,
	Daniel Wagner, Mike Galbraith, Christoph Hellwig, Sagi Grimberg

Just to clarify what I mean, I think the flow in
blk_mq_complete_request_remote should turn into something like:


	...

	if (rq->cmd_flags & REQ_HIPRI)
		return false;

	if (blk_mq_complete_need_ipi(rq))
		blk_mq_complete_send_ipi(rq);
		return true;
	}

	if (rq->q->nr_hw_queues == 1) {
		blk_mq_raise_softirq(rq);
		return true;
	}

	return false;
}

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-12-08 13:20   ` Christoph Hellwig
  2020-12-08 13:28     ` Christoph Hellwig
@ 2020-12-14 20:20     ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-12-14 20:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Jens Axboe, Thomas Gleixner, Peter Zijlstra,
	Daniel Wagner, Mike Galbraith, Sagi Grimberg

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=utf-8, Size: 8624 bytes --]

On 2020-12-08 13:20:04 [+0000], Christoph Hellwig wrote:
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -41,7 +41,7 @@
> >  #include "blk-mq-sched.h"
> >  #include "blk-rq-qos.h"
> >  
> > +static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
> >  
> >  static void blk_mq_poll_stats_start(struct request_queue *q);
> >  static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
> > @@ -567,68 +567,32 @@ void blk_mq_end_request(struct request *rq, blk_status_t error)
> >  }
> >  EXPORT_SYMBOL(blk_mq_end_request);
> >  
> > +static void blk_complete_reqs(struct llist_head *cpu_list)
> >  {
> > +	struct llist_node *entry;
> > +	struct request *rq, *rq_next;
> >  
> > +	entry = llist_del_all(cpu_list);
> > +	entry = llist_reverse_order(entry);
> 
> I find the variable naming and split of the assignments a little
> strange.  What about:
> 
> static void blk_complete_reqs(struct llist_head *list)
> {
> 	struct llist_node *first = llist_reverse_order(llist_del_all(list));
> 	struct request *rq, *next;
> 
> ?

Sure.

> > +	llist_for_each_entry_safe(rq, rq_next, entry, ipi_list)
> >  		rq->q->mq_ops->complete(rq);
> >  }
> 
> Aren't some sanitizers going to be unhappy if we never delete the
> request from the list?

I don't think so. If so there is more to complain about like,
flush_smp_call_function_queue(), delayed_mntput(),
irq_work_run_list(), ...


> >  bool blk_mq_complete_request_remote(struct request *rq)
> >  {
> > +	struct llist_head *cpu_list;
> >  	WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
> >  
> >  	/*
> > @@ -669,12 +634,22 @@ bool blk_mq_complete_request_remote(struct request *rq)
> >  		return false;
> >  
> >  	if (blk_mq_complete_need_ipi(rq)) {
> > +		unsigned int cpu;
> > +
> > +		cpu = rq->mq_ctx->cpu;
> > +		cpu_list = &per_cpu(blk_cpu_done, cpu);
> > +		if (llist_add(&rq->ipi_list, cpu_list)) {
> > +			INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
> > +			smp_call_function_single_async(cpu, &rq->csd);
> > +		}
> 
> I think the above code section inside the conditional should go into a
> little helper instead of being open coded here in the fast path routine.
> I laso don't really see the ¶oint of the cpu and cpulist locl variables.
> 
> >  	} else {
> >  		if (rq->q->nr_hw_queues > 1)
> >  			return false;
> > +		preempt_disable();
> > +		cpu_list = this_cpu_ptr(&blk_cpu_done);
> > +		if (llist_add(&rq->ipi_list, cpu_list))
> > +			raise_softirq(BLOCK_SOFTIRQ);
> > +		preempt_enable();
> 
> I think the section after the return false here also would benefit from
> a little helper with a descriptive name.
> 
> Otherwise this looks good to me.

Please see below.

----->8-------

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Wed, 28 Oct 2020 11:08:21 +0100
Subject: [PATCH] blk-mq: Use llist_head for blk_cpu_done

With llist_head it is possible to avoid the locking (the irq-off region)
when items are added. This makes it possible to add items on a remote
CPU without additional locking.
llist_add() returns true if the list was previously empty. This can be
used to invoke the SMP function call / raise sofirq only if the first
item was added (otherwise it is already pending).
This simplifies the code a little and reduces the IRQ-off regions.

blk_mq_raise_softirq() needs a preempt-disable section to ensure the
request is enqueued on the same CPU as the softirq is raised.
Some callers (USB-storage) invoke this path in preemptible context.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 block/blk-mq.c         | 97 ++++++++++++++++++------------------------
 include/linux/blkdev.h |  2 +-
 2 files changed, 42 insertions(+), 57 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9baa681f6ee67..959b45fd41882 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -41,7 +41,7 @@
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
 
-static DEFINE_PER_CPU(struct list_head, blk_cpu_done);
+static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
 
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
@@ -567,68 +567,29 @@ void blk_mq_end_request(struct request *rq, blk_status_t error)
 }
 EXPORT_SYMBOL(blk_mq_end_request);
 
-/*
- * Softirq action handler - move entries to local list and loop over them
- * while passing them to the queue registered handler.
- */
-static __latent_entropy void blk_done_softirq(struct softirq_action *h)
+static void blk_complete_reqs(struct llist_head *list)
 {
-	struct list_head *cpu_list, local_list;
+	struct llist_node *entry = llist_reverse_order(llist_del_all(list));
+	struct request *rq, *next;
 
-	local_irq_disable();
-	cpu_list = this_cpu_ptr(&blk_cpu_done);
-	list_replace_init(cpu_list, &local_list);
-	local_irq_enable();
-
-	while (!list_empty(&local_list)) {
-		struct request *rq;
-
-		rq = list_entry(local_list.next, struct request, ipi_list);
-		list_del_init(&rq->ipi_list);
+	llist_for_each_entry_safe(rq, next, entry, ipi_list)
 		rq->q->mq_ops->complete(rq);
-	}
 }
 
-static void blk_mq_trigger_softirq(struct request *rq)
+static __latent_entropy void blk_done_softirq(struct softirq_action *h)
 {
-	struct list_head *list;
-	unsigned long flags;
-
-	local_irq_save(flags);
-	list = this_cpu_ptr(&blk_cpu_done);
-	list_add_tail(&rq->ipi_list, list);
-
-	/*
-	 * If the list only contains our just added request, signal a raise of
-	 * the softirq.  If there are already entries there, someone already
-	 * raised the irq but it hasn't run yet.
-	 */
-	if (list->next == &rq->ipi_list)
-		raise_softirq_irqoff(BLOCK_SOFTIRQ);
-	local_irq_restore(flags);
+	blk_complete_reqs(this_cpu_ptr(&blk_cpu_done));
 }
 
 static int blk_softirq_cpu_dead(unsigned int cpu)
 {
-	/*
-	 * If a CPU goes away, splice its entries to the current CPU
-	 * and trigger a run of the softirq
-	 */
-	local_irq_disable();
-	list_splice_init(&per_cpu(blk_cpu_done, cpu),
-			 this_cpu_ptr(&blk_cpu_done));
-	raise_softirq_irqoff(BLOCK_SOFTIRQ);
-	local_irq_enable();
-
+	blk_complete_reqs(&per_cpu(blk_cpu_done, cpu));
 	return 0;
 }
 
-
 static void __blk_mq_complete_request_remote(void *data)
 {
-	struct request *rq = data;
-
-	blk_mq_trigger_softirq(rq);
+	__raise_softirq_irqoff(BLOCK_SOFTIRQ);
 }
 
 static inline bool blk_mq_complete_need_ipi(struct request *rq)
@@ -657,6 +618,30 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq)
 	return cpu_online(rq->mq_ctx->cpu);
 }
 
+static void blk_mq_complete_send_ipi(struct request *rq)
+{
+	struct llist_head *list;
+	unsigned int cpu;
+
+	cpu = rq->mq_ctx->cpu;
+	list = &per_cpu(blk_cpu_done, cpu);
+	if (llist_add(&rq->ipi_list, list)) {
+		INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
+		smp_call_function_single_async(cpu, &rq->csd);
+	}
+}
+
+static void blk_mq_raise_softirq(struct request *rq)
+{
+	struct llist_head *list;
+
+	preempt_disable();
+	list = this_cpu_ptr(&blk_cpu_done);
+	if (llist_add(&rq->ipi_list, list))
+		raise_softirq(BLOCK_SOFTIRQ);
+	preempt_enable();
+}
+
 bool blk_mq_complete_request_remote(struct request *rq)
 {
 	WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
@@ -669,15 +654,15 @@ bool blk_mq_complete_request_remote(struct request *rq)
 		return false;
 
 	if (blk_mq_complete_need_ipi(rq)) {
-		INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
-		smp_call_function_single_async(rq->mq_ctx->cpu, &rq->csd);
-	} else {
-		if (rq->q->nr_hw_queues > 1)
-			return false;
-		blk_mq_trigger_softirq(rq);
+		blk_mq_complete_send_ipi(rq);
+		return true;
 	}
 
-	return true;
+	if (rq->q->nr_hw_queues == 1) {
+		blk_mq_raise_softirq(rq);
+		return true;
+	}
+	return false;
 }
 EXPORT_SYMBOL_GPL(blk_mq_complete_request_remote);
 
@@ -3917,7 +3902,7 @@ static int __init blk_mq_init(void)
 	int i;
 
 	for_each_possible_cpu(i)
-		INIT_LIST_HEAD(&per_cpu(blk_cpu_done, i));
+		init_llist_head(&per_cpu(blk_cpu_done, i));
 	open_softirq(BLOCK_SOFTIRQ, blk_done_softirq);
 
 	cpuhp_setup_state_nocalls(CPUHP_BLOCK_SOFTIRQ_DEAD,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f94ee3089e015..89a444c5a5833 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -153,7 +153,7 @@ struct request {
 	 */
 	union {
 		struct hlist_node hash;	/* merge hash */
-		struct list_head ipi_list;
+		struct llist_node ipi_list;
 	};
 
 	/*
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-08 13:13     ` Christoph Hellwig
@ 2020-12-17 16:43       ` Sebastian Andrzej Siewior
  2020-12-17 16:55         ` Jens Axboe
  0 siblings, 1 reply; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-12-17 16:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Thomas Gleixner, Peter Zijlstra,
	Daniel Wagner, Mike Galbraith, Sagi Grimberg

On 2020-12-08 13:13:19 [+0000], Christoph Hellwig wrote:
> On Mon, Dec 07, 2020 at 04:52:57PM -0700, Jens Axboe wrote:
> > On 12/4/20 12:13 PM, Sebastian Andrzej Siewior wrote:
> > > Controllers with multiple queues have their IRQ-handelers pinned to a
> > > CPU. The core shouldn't need to complete the request on a remote CPU.
> > > 
> > > Remove this case and always raise the softirq to complete the request.
> > 
> > I don't like this one at all, it'll add a softirq jump for the fast path
> > for eg nvme devices.
> 
> For the real fast path, that is either a polled queue or irq driven
> queues that only map to a single CPU we are never reaching this code,
> so I'm not too worried.  Not that I'd complain about numbers, preferably
> in the commit log.

Did Daniel provide all the numbers you/Jens were looking for or do we
still wait for some?

Sebastian

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-08  8:44       ` Daniel Wagner
  2020-12-08 11:36         ` Daniel Wagner
@ 2020-12-17 16:45         ` Jens Axboe
  2020-12-17 16:49           ` Daniel Wagner
  1 sibling, 1 reply; 54+ messages in thread
From: Jens Axboe @ 2020-12-17 16:45 UTC (permalink / raw)
  To: Daniel Wagner, Sebastian Andrzej Siewior
  Cc: linux-block, Thomas Gleixner, Peter Zijlstra, Mike Galbraith,
	Christoph Hellwig, Sagi Grimberg

On 12/8/20 1:44 AM, Daniel Wagner wrote:
> It looks like the patched version show tiny bit better numbers for this
> workload. slat seems to be one of the major difference between the two
> runs. But that is the only thing I really spotted to be really off.

slat is the same, just one is in nsec and the other is in usec.

> I keep going with some more testing. Let what kind of tests you would
> also want to see. I'll do a few plain NVMe tests next.

This is a good test, thanks.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-17 16:45         ` Jens Axboe
@ 2020-12-17 16:49           ` Daniel Wagner
  2020-12-17 16:54             ` Jens Axboe
  0 siblings, 1 reply; 54+ messages in thread
From: Daniel Wagner @ 2020-12-17 16:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sebastian Andrzej Siewior, linux-block, Thomas Gleixner,
	Peter Zijlstra, Mike Galbraith, Christoph Hellwig, Sagi Grimberg

On Thu, Dec 17, 2020 at 09:45:47AM -0700, Jens Axboe wrote:
> On 12/8/20 1:44 AM, Daniel Wagner wrote:
> > It looks like the patched version show tiny bit better numbers for this
> > workload. slat seems to be one of the major difference between the two
> > runs. But that is the only thing I really spotted to be really off.
> 
> slat is the same, just one is in nsec and the other is in usec.

Ah, good eyes. Need to remember this :)

> > I keep going with some more testing. Let what kind of tests you would
> > also want to see. I'll do a few plain NVMe tests next.
> 
> This is a good test, thanks.

Got sidetracked. Haven't started yet with these tests.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-17 16:49           ` Daniel Wagner
@ 2020-12-17 16:54             ` Jens Axboe
  0 siblings, 0 replies; 54+ messages in thread
From: Jens Axboe @ 2020-12-17 16:54 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Sebastian Andrzej Siewior, linux-block, Thomas Gleixner,
	Peter Zijlstra, Mike Galbraith, Christoph Hellwig, Sagi Grimberg

On 12/17/20 9:49 AM, Daniel Wagner wrote:
> On Thu, Dec 17, 2020 at 09:45:47AM -0700, Jens Axboe wrote:
>> On 12/8/20 1:44 AM, Daniel Wagner wrote:
>>> It looks like the patched version show tiny bit better numbers for this
>>> workload. slat seems to be one of the major difference between the two
>>> runs. But that is the only thing I really spotted to be really off.
>>
>> slat is the same, just one is in nsec and the other is in usec.
> 
> Ah, good eyes. Need to remember this :)
> 
>>> I keep going with some more testing. Let what kind of tests you would
>>> also want to see. I'll do a few plain NVMe tests next.
>>
>> This is a good test, thanks.
> 
> Got sidetracked. Haven't started yet with these tests.

I just ran some here, and as expected, didn't see much change. Single
core IOPS at 2.8M in both cases, and not expecting any remote IPI for
that test case.

So I'd say that's good enough, wasn't expecting a change, and we don't
see one even for your case with remote IPI.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-17 16:43       ` Sebastian Andrzej Siewior
@ 2020-12-17 16:55         ` Jens Axboe
  2020-12-17 16:58           ` Sebastian Andrzej Siewior
  2020-12-17 18:16           ` Daniel Wagner
  0 siblings, 2 replies; 54+ messages in thread
From: Jens Axboe @ 2020-12-17 16:55 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Christoph Hellwig
  Cc: linux-block, Thomas Gleixner, Peter Zijlstra, Daniel Wagner,
	Mike Galbraith, Sagi Grimberg

On 12/17/20 9:43 AM, Sebastian Andrzej Siewior wrote:
> On 2020-12-08 13:13:19 [+0000], Christoph Hellwig wrote:
>> On Mon, Dec 07, 2020 at 04:52:57PM -0700, Jens Axboe wrote:
>>> On 12/4/20 12:13 PM, Sebastian Andrzej Siewior wrote:
>>>> Controllers with multiple queues have their IRQ-handelers pinned to a
>>>> CPU. The core shouldn't need to complete the request on a remote CPU.
>>>>
>>>> Remove this case and always raise the softirq to complete the request.
>>>
>>> I don't like this one at all, it'll add a softirq jump for the fast path
>>> for eg nvme devices.
>>
>> For the real fast path, that is either a polled queue or irq driven
>> queues that only map to a single CPU we are never reaching this code,
>> so I'm not too worried.  Not that I'd complain about numbers, preferably
>> in the commit log.
> 
> Did Daniel provide all the numbers you/Jens were looking for or do we
> still wait for some?

Yeah, I think we're good at this point. I'll queue this up for 5.11.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-17 16:55         ` Jens Axboe
@ 2020-12-17 16:58           ` Sebastian Andrzej Siewior
  2020-12-17 17:05             ` Daniel Wagner
  2020-12-17 18:16           ` Daniel Wagner
  1 sibling, 1 reply; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-12-17 16:58 UTC (permalink / raw)
  To: Jens Axboe, Daniel Wagner
  Cc: Christoph Hellwig, linux-block, Thomas Gleixner, Peter Zijlstra,
	Mike Galbraith, Sagi Grimberg

On 2020-12-17 09:55:08 [-0700], Jens Axboe wrote:
> 
> Yeah, I think we're good at this point. I'll queue this up for 5.11.
Thank you very much.
Thank you Daniel for running all these tests.

Sebastian

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-17 16:58           ` Sebastian Andrzej Siewior
@ 2020-12-17 17:05             ` Daniel Wagner
  0 siblings, 0 replies; 54+ messages in thread
From: Daniel Wagner @ 2020-12-17 17:05 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner,
	Peter Zijlstra, Mike Galbraith, Sagi Grimberg

On Thu, Dec 17, 2020 at 05:58:44PM +0100, Sebastian Andrzej Siewior wrote:
> On 2020-12-17 09:55:08 [-0700], Jens Axboe wrote:
> > 
> > Yeah, I think we're good at this point. I'll queue this up for 5.11.
> Thank you very much.
> Thank you Daniel for running all these tests.

np. Glad I can contribute to the -rt project :)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-17 16:55         ` Jens Axboe
  2020-12-17 16:58           ` Sebastian Andrzej Siewior
@ 2020-12-17 18:16           ` Daniel Wagner
  2020-12-17 18:22             ` Jens Axboe
  1 sibling, 1 reply; 54+ messages in thread
From: Daniel Wagner @ 2020-12-17 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sebastian Andrzej Siewior, Christoph Hellwig, linux-block,
	Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Sagi Grimberg

On Thu, Dec 17, 2020 at 09:55:08AM -0700, Jens Axboe wrote:
> Yeah, I think we're good at this point. I'll queue this up for 5.11.

If I am not complete mistaken you queued v2 of patch 3. Sebastian sent out
a v3 (slightly hidden):

  https://lore.kernel.org/linux-block/20201214202030.izhm4byeznfjoobe@linutronix.de/

which includes the changes Christoph asked for.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-17 18:16           ` Daniel Wagner
@ 2020-12-17 18:22             ` Jens Axboe
  2020-12-17 18:41               ` Daniel Wagner
  0 siblings, 1 reply; 54+ messages in thread
From: Jens Axboe @ 2020-12-17 18:22 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Sebastian Andrzej Siewior, Christoph Hellwig, linux-block,
	Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Sagi Grimberg

On 12/17/20 11:16 AM, Daniel Wagner wrote:
> On Thu, Dec 17, 2020 at 09:55:08AM -0700, Jens Axboe wrote:
>> Yeah, I think we're good at this point. I'll queue this up for 5.11.
> 
> If I am not complete mistaken you queued v2 of patch 3. Sebastian sent out
> a v3 (slightly hidden):
> 
>   https://lore.kernel.org/linux-block/20201214202030.izhm4byeznfjoobe@linutronix.de/
> 
> which includes the changes Christoph asked for.

Not only slightly hidden, b4 gets me v2 as well. Which isn't surprising,
since it's just a patch in a reply. I'll fix it up, but would've been
great if a v3 series had been posted, or at least a v3 of patch 3 in
that thread sent properly.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-17 18:22             ` Jens Axboe
@ 2020-12-17 18:41               ` Daniel Wagner
  2020-12-17 18:46                 ` Jens Axboe
  0 siblings, 1 reply; 54+ messages in thread
From: Daniel Wagner @ 2020-12-17 18:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sebastian Andrzej Siewior, Christoph Hellwig, linux-block,
	Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Sagi Grimberg

On Thu, Dec 17, 2020 at 11:22:58AM -0700, Jens Axboe wrote:
> Not only slightly hidden, b4 gets me v2 as well. Which isn't surprising,
> since it's just a patch in a reply. I'll fix it up, but would've been
> great if a v3 series had been posted, or at least a v3 of patch 3 in
> that thread sent properly.

Yep.

BTW, if you like you can add my

Reviewed-by: Daniel Wagner <dwagner@suse.de>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-17 18:41               ` Daniel Wagner
@ 2020-12-17 18:46                 ` Jens Axboe
  2020-12-17 19:07                   ` Daniel Wagner
  0 siblings, 1 reply; 54+ messages in thread
From: Jens Axboe @ 2020-12-17 18:46 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Sebastian Andrzej Siewior, Christoph Hellwig, linux-block,
	Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Sagi Grimberg

On 12/17/20 11:41 AM, Daniel Wagner wrote:
> On Thu, Dec 17, 2020 at 11:22:58AM -0700, Jens Axboe wrote:
>> Not only slightly hidden, b4 gets me v2 as well. Which isn't surprising,
>> since it's just a patch in a reply. I'll fix it up, but would've been
>> great if a v3 series had been posted, or at least a v3 of patch 3 in
>> that thread sent properly.
> 
> Yep.
> 
> BTW, if you like you can add my
> 
> Reviewed-by: Daniel Wagner <dwagner@suse.de>

To the series, or just that one patch?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-17 18:46                 ` Jens Axboe
@ 2020-12-17 19:07                   ` Daniel Wagner
  2020-12-17 19:13                     ` Jens Axboe
  0 siblings, 1 reply; 54+ messages in thread
From: Daniel Wagner @ 2020-12-17 19:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sebastian Andrzej Siewior, Christoph Hellwig, linux-block,
	Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Sagi Grimberg

On 17.12.20 19:46, Jens Axboe wrote:
> On 12/17/20 11:41 AM, Daniel Wagner wrote:
>> On Thu, Dec 17, 2020 at 11:22:58AM -0700, Jens Axboe wrote:
>>> Not only slightly hidden, b4 gets me v2 as well. Which isn't surprising,
>>> since it's just a patch in a reply. I'll fix it up, but would've been
>>> great if a v3 series had been posted, or at least a v3 of patch 3 in
>>> that thread sent properly.
>>
>> Yep.
>>
>> BTW, if you like you can add my
>>
>> Reviewed-by: Daniel Wagner <dwagner@suse.de>
> 
> To the series, or just that one patch?

to the series but if I am late for this, it's also okau if I miss out to 
the party :)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-17 19:07                   ` Daniel Wagner
@ 2020-12-17 19:13                     ` Jens Axboe
  2020-12-17 19:15                       ` Daniel Wagner
  0 siblings, 1 reply; 54+ messages in thread
From: Jens Axboe @ 2020-12-17 19:13 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Sebastian Andrzej Siewior, Christoph Hellwig, linux-block,
	Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Sagi Grimberg

On 12/17/20 12:07 PM, Daniel Wagner wrote:
> On 17.12.20 19:46, Jens Axboe wrote:
>> On 12/17/20 11:41 AM, Daniel Wagner wrote:
>>> On Thu, Dec 17, 2020 at 11:22:58AM -0700, Jens Axboe wrote:
>>>> Not only slightly hidden, b4 gets me v2 as well. Which isn't surprising,
>>>> since it's just a patch in a reply. I'll fix it up, but would've been
>>>> great if a v3 series had been posted, or at least a v3 of patch 3 in
>>>> that thread sent properly.
>>>
>>> Yep.
>>>
>>> BTW, if you like you can add my
>>>
>>> Reviewed-by: Daniel Wagner <dwagner@suse.de>
>>
>> To the series, or just that one patch?
> 
> to the series but if I am late for this, it's also okau if I miss out to 
> the party :)

Well, had to update #3 anyway, so done.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq
  2020-12-17 19:13                     ` Jens Axboe
@ 2020-12-17 19:15                       ` Daniel Wagner
  0 siblings, 0 replies; 54+ messages in thread
From: Daniel Wagner @ 2020-12-17 19:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sebastian Andrzej Siewior, Christoph Hellwig, linux-block,
	Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Sagi Grimberg

> Well, had to update #3 anyway, so done.

Thanks!

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2021-01-25  8:32     ` Sebastian Andrzej Siewior
@ 2021-01-25  8:39       ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2021-01-25  8:39 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Christoph Hellwig, linux-block, linux-kernel, Jens Axboe,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar

On Mon, Jan 25, 2021 at 09:32:04AM +0100, Sebastian Andrzej Siewior wrote:
> On 2021-01-25 08:30:12 [+0000], Christoph Hellwig wrote:
> > > +static void blk_mq_complete_send_ipi(struct request *rq)
> > > +{
> > > +	struct llist_head *list;
> > > +	unsigned int cpu;
> > > +
> > > +	cpu = rq->mq_ctx->cpu;
> > > +	list = &per_cpu(blk_cpu_done, cpu);
> > > +	if (llist_add(&rq->ipi_list, list)) {
> > > +		INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
> > > +		smp_call_function_single_async(cpu, &rq->csd);
> > > +	}
> > > +}
> > 
> > Nit: it would be nice to initialize cpu and list in the declaration
> > lines.
> 
> Why? They get initialized later.

Because:

	unsigned int cpu = rq->mq_ctx->cpu;
	struct llist_head *list = &per_cpu(blk_cpu_done, cpu);

is a lot easier to follow than:

	struct llist_head *list;
	unsigned int cpu;

	cpu = rq->mq_ctx->cpu;
	list = &per_cpu(blk_cpu_done, cpu);


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2021-01-25  8:30   ` Christoph Hellwig
@ 2021-01-25  8:32     ` Sebastian Andrzej Siewior
  2021-01-25  8:39       ` Christoph Hellwig
  0 siblings, 1 reply; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-01-25  8:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, linux-kernel, Jens Axboe, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar

On 2021-01-25 08:30:12 [+0000], Christoph Hellwig wrote:
> > +static void blk_mq_complete_send_ipi(struct request *rq)
> > +{
> > +	struct llist_head *list;
> > +	unsigned int cpu;
> > +
> > +	cpu = rq->mq_ctx->cpu;
> > +	list = &per_cpu(blk_cpu_done, cpu);
> > +	if (llist_add(&rq->ipi_list, list)) {
> > +		INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
> > +		smp_call_function_single_async(cpu, &rq->csd);
> > +	}
> > +}
> 
> Nit: it would be nice to initialize cpu and list in the declaration
> lines.

Why? They get initialized later.

> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Sebastian

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2021-01-23 20:10 ` [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done Sebastian Andrzej Siewior
@ 2021-01-25  8:30   ` Christoph Hellwig
  2021-01-25  8:32     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2021-01-25  8:30 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-block, linux-kernel, Jens Axboe, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar

> +static void blk_mq_complete_send_ipi(struct request *rq)
> +{
> +	struct llist_head *list;
> +	unsigned int cpu;
> +
> +	cpu = rq->mq_ctx->cpu;
> +	list = &per_cpu(blk_cpu_done, cpu);
> +	if (llist_add(&rq->ipi_list, list)) {
> +		INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
> +		smp_call_function_single_async(cpu, &rq->csd);
> +	}
> +}

Nit: it would be nice to initialize cpu and list in the declaration
lines.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2021-01-23 20:10 [PATCH v3 0/3] blk-mq: Don't complete in IRQ, use llist_head Sebastian Andrzej Siewior
@ 2021-01-23 20:10 ` Sebastian Andrzej Siewior
  2021-01-25  8:30   ` Christoph Hellwig
  0 siblings, 1 reply; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-01-23 20:10 UTC (permalink / raw)
  To: linux-block, linux-kernel
  Cc: Jens Axboe, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Sebastian Andrzej Siewior

With llist_head it is possible to avoid the locking (the irq-off region)
when items are added. This makes it possible to add items on a remote
CPU without additional locking.
llist_add() returns true if the list was previously empty. This can be
used to invoke the SMP function call / raise sofirq only if the first
item was added (otherwise it is already pending).
This simplifies the code a little and reduces the IRQ-off regions.

blk_mq_raise_softirq() needs a preempt-disable section to ensure the
request is enqueued on the same CPU as the softirq is raised.
Some callers (USB-storage) invoke this path in preemptible context.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 block/blk-mq.c         | 97 ++++++++++++++++++------------------------
 include/linux/blkdev.h |  2 +-
 2 files changed, 42 insertions(+), 57 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 90348ae518461..463de2981df8a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -41,7 +41,7 @@
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
 
-static DEFINE_PER_CPU(struct list_head, blk_cpu_done);
+static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
 
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
@@ -567,68 +567,29 @@ void blk_mq_end_request(struct request *rq, blk_status_t error)
 }
 EXPORT_SYMBOL(blk_mq_end_request);
 
-/*
- * Softirq action handler - move entries to local list and loop over them
- * while passing them to the queue registered handler.
- */
-static __latent_entropy void blk_done_softirq(struct softirq_action *h)
+static void blk_complete_reqs(struct llist_head *list)
 {
-	struct list_head *cpu_list, local_list;
+	struct llist_node *entry = llist_reverse_order(llist_del_all(list));
+	struct request *rq, *next;
 
-	local_irq_disable();
-	cpu_list = this_cpu_ptr(&blk_cpu_done);
-	list_replace_init(cpu_list, &local_list);
-	local_irq_enable();
-
-	while (!list_empty(&local_list)) {
-		struct request *rq;
-
-		rq = list_entry(local_list.next, struct request, ipi_list);
-		list_del_init(&rq->ipi_list);
+	llist_for_each_entry_safe(rq, next, entry, ipi_list)
 		rq->q->mq_ops->complete(rq);
-	}
 }
 
-static void blk_mq_trigger_softirq(struct request *rq)
+static __latent_entropy void blk_done_softirq(struct softirq_action *h)
 {
-	struct list_head *list;
-	unsigned long flags;
-
-	local_irq_save(flags);
-	list = this_cpu_ptr(&blk_cpu_done);
-	list_add_tail(&rq->ipi_list, list);
-
-	/*
-	 * If the list only contains our just added request, signal a raise of
-	 * the softirq.  If there are already entries there, someone already
-	 * raised the irq but it hasn't run yet.
-	 */
-	if (list->next == &rq->ipi_list)
-		raise_softirq_irqoff(BLOCK_SOFTIRQ);
-	local_irq_restore(flags);
+	blk_complete_reqs(this_cpu_ptr(&blk_cpu_done));
 }
 
 static int blk_softirq_cpu_dead(unsigned int cpu)
 {
-	/*
-	 * If a CPU goes away, splice its entries to the current CPU
-	 * and trigger a run of the softirq
-	 */
-	local_irq_disable();
-	list_splice_init(&per_cpu(blk_cpu_done, cpu),
-			 this_cpu_ptr(&blk_cpu_done));
-	raise_softirq_irqoff(BLOCK_SOFTIRQ);
-	local_irq_enable();
-
+	blk_complete_reqs(&per_cpu(blk_cpu_done, cpu));
 	return 0;
 }
 
-
 static void __blk_mq_complete_request_remote(void *data)
 {
-	struct request *rq = data;
-
-	blk_mq_trigger_softirq(rq);
+	__raise_softirq_irqoff(BLOCK_SOFTIRQ);
 }
 
 static inline bool blk_mq_complete_need_ipi(struct request *rq)
@@ -657,6 +618,30 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq)
 	return cpu_online(rq->mq_ctx->cpu);
 }
 
+static void blk_mq_complete_send_ipi(struct request *rq)
+{
+	struct llist_head *list;
+	unsigned int cpu;
+
+	cpu = rq->mq_ctx->cpu;
+	list = &per_cpu(blk_cpu_done, cpu);
+	if (llist_add(&rq->ipi_list, list)) {
+		INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
+		smp_call_function_single_async(cpu, &rq->csd);
+	}
+}
+
+static void blk_mq_raise_softirq(struct request *rq)
+{
+	struct llist_head *list;
+
+	preempt_disable();
+	list = this_cpu_ptr(&blk_cpu_done);
+	if (llist_add(&rq->ipi_list, list))
+		raise_softirq(BLOCK_SOFTIRQ);
+	preempt_enable();
+}
+
 bool blk_mq_complete_request_remote(struct request *rq)
 {
 	WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
@@ -669,15 +654,15 @@ bool blk_mq_complete_request_remote(struct request *rq)
 		return false;
 
 	if (blk_mq_complete_need_ipi(rq)) {
-		INIT_CSD(&rq->csd, __blk_mq_complete_request_remote, rq);
-		smp_call_function_single_async(rq->mq_ctx->cpu, &rq->csd);
-	} else {
-		if (rq->q->nr_hw_queues > 1)
-			return false;
-		blk_mq_trigger_softirq(rq);
+		blk_mq_complete_send_ipi(rq);
+		return true;
 	}
 
-	return true;
+	if (rq->q->nr_hw_queues == 1) {
+		blk_mq_raise_softirq(rq);
+		return true;
+	}
+	return false;
 }
 EXPORT_SYMBOL_GPL(blk_mq_complete_request_remote);
 
@@ -3892,7 +3877,7 @@ static int __init blk_mq_init(void)
 	int i;
 
 	for_each_possible_cpu(i)
-		INIT_LIST_HEAD(&per_cpu(blk_cpu_done, i));
+		init_llist_head(&per_cpu(blk_cpu_done, i));
 	open_softirq(BLOCK_SOFTIRQ, blk_done_softirq);
 
 	cpuhp_setup_state_nocalls(CPUHP_BLOCK_SOFTIRQ_DEAD,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f94ee3089e015..89a444c5a5833 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -153,7 +153,7 @@ struct request {
 	 */
 	union {
 		struct hlist_node hash;	/* merge hash */
-		struct list_head ipi_list;
+		struct llist_node ipi_list;
 	};
 
 	/*
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-11-02 18:12                         ` Christoph Hellwig
  2020-11-04 19:15                           ` Sagi Grimberg
@ 2020-11-06 15:23                           ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-11-06 15:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, linux-block, Thomas Gleixner,
	David Runge, linux-rt-users, linux-kernel, Peter Zijlstra,
	Daniel Wagner, Mike Galbraith

On 2020-11-02 18:12:38 [+0000], Christoph Hellwig wrote:
> > to not break that assumption you just mentioned and provide 
> > |static inline void blk_mq_complete_request_local(struct request *rq)
> > |{
> > |                 rq->q->mq_ops->complete(rq);
> > |}
> > 
> > so that completion issued from from process context (like those from
> > usb-storage) don't end up waking `ksoftird' (running at SCHED_OTHER)
> > completing the requests but rather performing it right away. The softirq
> > dance makes no sense here.
> 
> Agreed.  But I don't think your above blk_mq_complete_request_local
> is all that useful either as ->complete is defined by the caller,
> so we could just do a direct call.
In usb-storage case it is hidden somewhere in the SCSI stack but this
can probably be changed later on.

>                                     Basically we should just
> return false from blk_mq_complete_request_remote after updating
> the state when called from process context.  But given that IIRC
> we are not supposed to check what state we are called from
> we'll need a helper just for updating the state instead and
> ensure the driver uses the right helper.  Now of course we might
> have process context callers that still want to bounce to the
> submitting CPU, but in that case we should go directly to a
> workqueue or similar.

So instead blk_mq_complete_request_local() you want a helper to set the
state in which the completion function is invoked. Sounds more like an
argument :)

> Either way doing this properly will probabl involve an audit of all
> drivers, but I think that is worth it.

I'm lost. Should I repost the three patches with a preempt_disable()
section (as suggested) to not break preemptible callers? And then move
from there to provide callers from preemtible context an alternative?

Sebastian

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-11-02 18:12                         ` Christoph Hellwig
@ 2020-11-04 19:15                           ` Sagi Grimberg
  2020-11-06 15:23                           ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 54+ messages in thread
From: Sagi Grimberg @ 2020-11-04 19:15 UTC (permalink / raw)
  To: Christoph Hellwig, Sebastian Andrzej Siewior
  Cc: Jens Axboe, linux-block, Thomas Gleixner, David Runge,
	linux-rt-users, linux-kernel, Peter Zijlstra, Daniel Wagner,
	Mike Galbraith


>>> There really aren't any rules for this, and it's perfectly legit to
>>> complete from process context. Maybe you're a kthread driven driver and
>>> that's how you handle completions. The block completion path has always
>>> been hard IRQ safe, but possible to call from anywhere.
>>
>> I'm not trying to put restrictions and forbidding completions from a
>> kthread. I'm trying to avoid the pointless softirq dance for no added
>> value. We could:
> 
>> to not break that assumption you just mentioned and provide
>> |static inline void blk_mq_complete_request_local(struct request *rq)
>> |{
>> |                 rq->q->mq_ops->complete(rq);
>> |}
>>
>> so that completion issued from from process context (like those from
>> usb-storage) don't end up waking `ksoftird' (running at SCHED_OTHER)
>> completing the requests but rather performing it right away. The softirq
>> dance makes no sense here.
> 
> Agreed.  But I don't think your above blk_mq_complete_request_local
> is all that useful either as ->complete is defined by the caller,
> so we could just do a direct call.  Basically we should just
> return false from blk_mq_complete_request_remote after updating
> the state when called from process context.

Agreed.

> But given that IIRC
> we are not supposed to check what state we are called from
> we'll need a helper just for updating the state instead and
> ensure the driver uses the right helper.  Now of course we might
> have process context callers that still want to bounce to the
> submitting CPU, but in that case we should go directly to a
> workqueue or similar.

This would mean that it may be suboptimal for nvme-tcp to complete
requests in softirq context from the network context (determined by
NIC steering). Because in this case, this would trigger workqueue
schedule on a per-request basis rather than once per .data_ready
call like we do today. Is that correct?

It has been observed that completing commands in softirq context
(network determined cpu) because basically the completion does
IPI + local complete, not IPI + softirq or IPI + workqueue.

> Either way doing this properly will probabl involve an audit of all
> drivers, but I think that is worth it.

Agree.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-11-02  9:55                       ` Sebastian Andrzej Siewior
@ 2020-11-02 18:12                         ` Christoph Hellwig
  2020-11-04 19:15                           ` Sagi Grimberg
  2020-11-06 15:23                           ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 54+ messages in thread
From: Christoph Hellwig @ 2020-11-02 18:12 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Jens Axboe, Sagi Grimberg, Christoph Hellwig, linux-block,
	Thomas Gleixner, David Runge, linux-rt-users, linux-kernel,
	Peter Zijlstra, Daniel Wagner, Mike Galbraith

On Mon, Nov 02, 2020 at 10:55:33AM +0100, Sebastian Andrzej Siewior wrote:
> On 2020-10-31 09:00:49 [-0600], Jens Axboe wrote:
> > There really aren't any rules for this, and it's perfectly legit to
> > complete from process context. Maybe you're a kthread driven driver and
> > that's how you handle completions. The block completion path has always
> > been hard IRQ safe, but possible to call from anywhere.
> 
> I'm not trying to put restrictions and forbidding completions from a
> kthread. I'm trying to avoid the pointless softirq dance for no added
> value. We could:

> to not break that assumption you just mentioned and provide 
> |static inline void blk_mq_complete_request_local(struct request *rq)
> |{
> |                 rq->q->mq_ops->complete(rq);
> |}
> 
> so that completion issued from from process context (like those from
> usb-storage) don't end up waking `ksoftird' (running at SCHED_OTHER)
> completing the requests but rather performing it right away. The softirq
> dance makes no sense here.

Agreed.  But I don't think your above blk_mq_complete_request_local
is all that useful either as ->complete is defined by the caller,
so we could just do a direct call.  Basically we should just
return false from blk_mq_complete_request_remote after updating
the state when called from process context.  But given that IIRC
we are not supposed to check what state we are called from
we'll need a helper just for updating the state instead and
ensure the driver uses the right helper.  Now of course we might
have process context callers that still want to bounce to the
submitting CPU, but in that case we should go directly to a
workqueue or similar.

Either way doing this properly will probabl involve an audit of all
drivers, but I think that is worth it.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-31 15:00                     ` Jens Axboe
  2020-10-31 15:01                       ` Jens Axboe
@ 2020-11-02  9:55                       ` Sebastian Andrzej Siewior
  2020-11-02 18:12                         ` Christoph Hellwig
  1 sibling, 1 reply; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-11-02  9:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sagi Grimberg, Christoph Hellwig, linux-block, Thomas Gleixner,
	David Runge, linux-rt-users, linux-kernel, Peter Zijlstra,
	Daniel Wagner, Mike Galbraith

On 2020-10-31 09:00:49 [-0600], Jens Axboe wrote:
> There really aren't any rules for this, and it's perfectly legit to
> complete from process context. Maybe you're a kthread driven driver and
> that's how you handle completions. The block completion path has always
> been hard IRQ safe, but possible to call from anywhere.

I'm not trying to put restrictions and forbidding completions from a
kthread. I'm trying to avoid the pointless softirq dance for no added
value. We could:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4f53de48e5038..c4693b3750878 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -644,9 +644,11 @@ bool blk_mq_complete_request_remote(struct request *rq)
 	} else {
 		if (rq->q->nr_hw_queues > 1)
 			return false;
+		preempt_disable();
 		cpu_list = this_cpu_ptr(&blk_cpu_done);
 		if (llist_add(&rq->ipi_list, cpu_list))
 			raise_softirq(BLOCK_SOFTIRQ);
+		preempt_enable();
 	}
 
 	return true;

to not break that assumption you just mentioned and provide 
|static inline void blk_mq_complete_request_local(struct request *rq)
|{
|                 rq->q->mq_ops->complete(rq);
|}

so that completion issued from from process context (like those from
usb-storage) don't end up waking `ksoftird' (running at SCHED_OTHER)
completing the requests but rather performing it right away. The softirq
dance makes no sense here.

As mentioned earlier, the alternative _could_ be to
	s/preempt_/local_bh_/

in the above patch. This would ensure that any invocation outside of
IRQ/Softirq context would invoke the softirq _directly_ at
local_bh_enable() time rather than waking the daemon for that purpose.
It would also avoid another completion function for the direct case
which could be abused if used from outside the thread context.
The last one is currently my favorite.

Sebastian

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-31 15:01                       ` Jens Axboe
@ 2020-10-31 18:09                         ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2020-10-31 18:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sebastian Andrzej Siewior, Sagi Grimberg, Christoph Hellwig,
	linux-block, Thomas Gleixner, David Runge, linux-rt-users,
	linux-kernel, Peter Zijlstra, Daniel Wagner, Mike Galbraith

On Sat, Oct 31, 2020 at 09:01:45AM -0600, Jens Axboe wrote:
> On 10/31/20 9:00 AM, Jens Axboe wrote:
> > On 10/31/20 4:41 AM, Sebastian Andrzej Siewior wrote:
> >> On 2020-10-29 14:07:59 [-0700], Sagi Grimberg wrote:
> >>>> in which context?
> >>>
> >>> Not sure what is the question.
> >>
> >> The question is in which context do you complete your requests. My guess
> >> by now is "usually softirq/NAPI and context in rare error case".
> > 
> > There really aren't any rules for this, and it's perfectly legit to
> > complete from process context. Maybe you're a kthread driven driver and
> > that's how you handle completions. The block completion path has always
> > been hard IRQ safe, but possible to call from anywhere.
> 
> A more recent example is polled IO, which will always complete from
> process/task context and very much is fast path.

But we never IPI for that anyway, so it is the easy case.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-31 15:00                     ` Jens Axboe
@ 2020-10-31 15:01                       ` Jens Axboe
  2020-10-31 18:09                         ` Christoph Hellwig
  2020-11-02  9:55                       ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 54+ messages in thread
From: Jens Axboe @ 2020-10-31 15:01 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Sagi Grimberg
  Cc: Christoph Hellwig, linux-block, Thomas Gleixner, David Runge,
	linux-rt-users, linux-kernel, Peter Zijlstra, Daniel Wagner,
	Mike Galbraith

On 10/31/20 9:00 AM, Jens Axboe wrote:
> On 10/31/20 4:41 AM, Sebastian Andrzej Siewior wrote:
>> On 2020-10-29 14:07:59 [-0700], Sagi Grimberg wrote:
>>>> in which context?
>>>
>>> Not sure what is the question.
>>
>> The question is in which context do you complete your requests. My guess
>> by now is "usually softirq/NAPI and context in rare error case".
> 
> There really aren't any rules for this, and it's perfectly legit to
> complete from process context. Maybe you're a kthread driven driver and
> that's how you handle completions. The block completion path has always
> been hard IRQ safe, but possible to call from anywhere.

A more recent example is polled IO, which will always complete from
process/task context and very much is fast path.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-31 10:41                   ` Sebastian Andrzej Siewior
@ 2020-10-31 15:00                     ` Jens Axboe
  2020-10-31 15:01                       ` Jens Axboe
  2020-11-02  9:55                       ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 54+ messages in thread
From: Jens Axboe @ 2020-10-31 15:00 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Sagi Grimberg
  Cc: Christoph Hellwig, linux-block, Thomas Gleixner, David Runge,
	linux-rt-users, linux-kernel, Peter Zijlstra, Daniel Wagner,
	Mike Galbraith

On 10/31/20 4:41 AM, Sebastian Andrzej Siewior wrote:
> On 2020-10-29 14:07:59 [-0700], Sagi Grimberg wrote:
>>> in which context?
>>
>> Not sure what is the question.
> 
> The question is in which context do you complete your requests. My guess
> by now is "usually softirq/NAPI and context in rare error case".

There really aren't any rules for this, and it's perfectly legit to
complete from process context. Maybe you're a kthread driven driver and
that's how you handle completions. The block completion path has always
been hard IRQ safe, but possible to call from anywhere.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-29 21:07                 ` Sagi Grimberg
@ 2020-10-31 10:41                   ` Sebastian Andrzej Siewior
  2020-10-31 15:00                     ` Jens Axboe
  0 siblings, 1 reply; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-10-31 10:41 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-block, Thomas Gleixner, David Runge,
	linux-rt-users, Jens Axboe, linux-kernel, Peter Zijlstra,
	Daniel Wagner, Mike Galbraith

On 2020-10-29 14:07:59 [-0700], Sagi Grimberg wrote:
> > in which context?
> 
> Not sure what is the question.

The question is in which context do you complete your requests. My guess
by now is "usually softirq/NAPI and context in rare error case".

> > But this is probably nr_hw_queues > 1?
> 
> Yes.

So this means it will either complete directly or issue an IPI.

> > but running it in softirq on the remote CPU would still allow of other
> > packets to come on the remote CPU (which would block BLOCK sofirq if
> > NET_RX is already running).
> 
> Not sure I understand your comment, if napi triggers on core X and we
> complete from that, it will trigger IPI to core Y, and there with patch #2
> is will trigger softirq instead of calling ->complete directly no?

This is correct. But trigger softirq does not mean that it will wake
`ksoftirqd' as it is the case for the usb-storage right now. In your
case (completing from NAPI/sofitrq (or for most other driver which
complete in their IRQ handler)) it means:
- trigger IPI
- IPI will OR the BLOCK-softirq bit.
- on exit from IPI it will invoke do_softirq() (unless softirq is
  already pending and got interrupted by the IPI) and complete the
  Block request.

Sebastian

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-29 21:01               ` Sebastian Andrzej Siewior
@ 2020-10-29 21:07                 ` Sagi Grimberg
  2020-10-31 10:41                   ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 54+ messages in thread
From: Sagi Grimberg @ 2020-10-29 21:07 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Christoph Hellwig, linux-block, Thomas Gleixner, David Runge,
	linux-rt-users, Jens Axboe, linux-kernel, Peter Zijlstra,
	Daniel Wagner, Mike Galbraith


>>>>> Well, usb-storage obviously seems to do it, and the block layer
>>>>> does not prohibit it.
>>>>
>>>> Also loop, nvme-tcp and then I stopped looking.
>>>> Any objections about adding local_bh_disable() around it?
>>>
>>> To me it seems like the whole IPI plus potentially softirq dance is
>>> a little pointless when completing from process context.
>>
>> I agree.
>>
>>> Sagi, any opinion on that from the nvme-tcp POV?
>>
>> nvme-tcp should (almost) always complete from the context that matches
>> the rq->mq_ctx->cpu as the thread that processes incoming
>> completions (per hctx) should be affinitized to match it (unless cpus
>> come and go).
> 
> in which context?

Not sure what is the question.

> But this is probably nr_hw_queues > 1?

Yes.

>> So for nvme-tcp I don't expect blk_mq_complete_need_ipi to return true
>> in normal operation. That leaves the teardowns+aborts, which aren't very
>> interesting here.
> 
> The process context invocation is nvme_tcp_complete_timed_out().

Yes.

>> I would note that nvme-tcp does not go to sleep after completing every
>> I/O like how sebastian indicated usb does.
>>
>> Having said that, today the network stack is calling nvme_tcp_data_ready
>> in napi context (softirq) which in turn triggers the queue thread to
>> handle network rx (and complete the I/O). It's been measured recently
>> that running the rx context directly in softirq will save some
>> latency (possible because nvme-tcp rx context is non-blocking).
>>
>> So I'd think that patch #2 is unnecessary and just add overhead for
>> nvme-tcp.. do note that the napi softirq cpu mapping depends on the RSS
>> steering, which is unlikely to match rq->mq_ctx->cpu, hence if completed
>> from napi context, nvme-tcp will probably always go to the IPI path.
> 
> but running it in softirq on the remote CPU would still allow of other
> packets to come on the remote CPU (which would block BLOCK sofirq if
> NET_RX is already running).

Not sure I understand your comment, if napi triggers on core X and we
complete from that, it will trigger IPI to core Y, and there with patch 
#2 is will trigger softirq instead of calling ->complete directly no?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-29 20:03             ` Sagi Grimberg
@ 2020-10-29 21:01               ` Sebastian Andrzej Siewior
  2020-10-29 21:07                 ` Sagi Grimberg
  0 siblings, 1 reply; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-10-29 21:01 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-block, Thomas Gleixner, David Runge,
	linux-rt-users, Jens Axboe, linux-kernel, Peter Zijlstra,
	Daniel Wagner, Mike Galbraith

On 2020-10-29 13:03:26 [-0700], Sagi Grimberg wrote:
> 
> > > > Well, usb-storage obviously seems to do it, and the block layer
> > > > does not prohibit it.
> > > 
> > > Also loop, nvme-tcp and then I stopped looking.
> > > Any objections about adding local_bh_disable() around it?
> > 
> > To me it seems like the whole IPI plus potentially softirq dance is
> > a little pointless when completing from process context.
> 
> I agree.
> 
> > Sagi, any opinion on that from the nvme-tcp POV?
> 
> nvme-tcp should (almost) always complete from the context that matches
> the rq->mq_ctx->cpu as the thread that processes incoming
> completions (per hctx) should be affinitized to match it (unless cpus
> come and go).

in which context? But this is probably nr_hw_queues > 1?

> So for nvme-tcp I don't expect blk_mq_complete_need_ipi to return true
> in normal operation. That leaves the teardowns+aborts, which aren't very
> interesting here.

The process context invocation is nvme_tcp_complete_timed_out().

> I would note that nvme-tcp does not go to sleep after completing every
> I/O like how sebastian indicated usb does.
> 
> Having said that, today the network stack is calling nvme_tcp_data_ready
> in napi context (softirq) which in turn triggers the queue thread to
> handle network rx (and complete the I/O). It's been measured recently
> that running the rx context directly in softirq will save some
> latency (possible because nvme-tcp rx context is non-blocking).
> 
> So I'd think that patch #2 is unnecessary and just add overhead for
> nvme-tcp.. do note that the napi softirq cpu mapping depends on the RSS
> steering, which is unlikely to match rq->mq_ctx->cpu, hence if completed
> from napi context, nvme-tcp will probably always go to the IPI path.

but running it in softirq on the remote CPU would still allow of other
packets to come on the remote CPU (which would block BLOCK sofirq if
NET_RX is already running).

Sebastian

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-29 14:57           ` Christoph Hellwig
@ 2020-10-29 20:03             ` Sagi Grimberg
  2020-10-29 21:01               ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 54+ messages in thread
From: Sagi Grimberg @ 2020-10-29 20:03 UTC (permalink / raw)
  To: Christoph Hellwig, Sebastian Andrzej Siewior
  Cc: linux-block, Thomas Gleixner, David Runge, linux-rt-users,
	Jens Axboe, linux-kernel, Peter Zijlstra, Daniel Wagner,
	Mike Galbraith


>>> Well, usb-storage obviously seems to do it, and the block layer
>>> does not prohibit it.
>>
>> Also loop, nvme-tcp and then I stopped looking.
>> Any objections about adding local_bh_disable() around it?
> 
> To me it seems like the whole IPI plus potentially softirq dance is
> a little pointless when completing from process context.

I agree.

> Sagi, any opinion on that from the nvme-tcp POV?

nvme-tcp should (almost) always complete from the context that matches
the rq->mq_ctx->cpu as the thread that processes incoming
completions (per hctx) should be affinitized to match it (unless cpus
come and go).

So for nvme-tcp I don't expect blk_mq_complete_need_ipi to return true
in normal operation. That leaves the teardowns+aborts, which aren't very
interesting here.

I would note that nvme-tcp does not go to sleep after completing every
I/O like how sebastian indicated usb does.

Having said that, today the network stack is calling nvme_tcp_data_ready
in napi context (softirq) which in turn triggers the queue thread to
handle network rx (and complete the I/O). It's been measured recently
that running the rx context directly in softirq will save some
latency (possible because nvme-tcp rx context is non-blocking).

So I'd think that patch #2 is unnecessary and just add overhead for
nvme-tcp.. do note that the napi softirq cpu mapping depends on the RSS
steering, which is unlikely to match rq->mq_ctx->cpu, hence if completed
from napi context, nvme-tcp will probably always go to the IPI path.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-29 14:56         ` Sebastian Andrzej Siewior
@ 2020-10-29 14:57           ` Christoph Hellwig
  2020-10-29 20:03             ` Sagi Grimberg
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2020-10-29 14:57 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Christoph Hellwig, linux-block, Thomas Gleixner, David Runge,
	linux-rt-users, Jens Axboe, linux-kernel, Peter Zijlstra,
	Daniel Wagner, Mike Galbraith, Sagi Grimberg

On Thu, Oct 29, 2020 at 03:56:23PM +0100, Sebastian Andrzej Siewior wrote:
> On 2020-10-29 14:05:36 [+0000], Christoph Hellwig wrote:
> > Well, usb-storage obviously seems to do it, and the block layer
> > does not prohibit it.
> 
> Also loop, nvme-tcp and then I stopped looking.
> Any objections about adding local_bh_disable() around it?

To me it seems like the whole IPI plus potentially softirq dance is
a little pointless when completing from process context.

Sagi, any opinion on that from the nvme-tcp POV?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-29 14:05       ` Christoph Hellwig
@ 2020-10-29 14:56         ` Sebastian Andrzej Siewior
  2020-10-29 14:57           ` Christoph Hellwig
  0 siblings, 1 reply; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-10-29 14:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Thomas Gleixner, David Runge, linux-rt-users,
	Jens Axboe, linux-kernel, Peter Zijlstra, Daniel Wagner,
	Mike Galbraith

On 2020-10-29 14:05:36 [+0000], Christoph Hellwig wrote:
> Well, usb-storage obviously seems to do it, and the block layer
> does not prohibit it.

Also loop, nvme-tcp and then I stopped looking.
Any objections about adding local_bh_disable() around it?

Sebastian

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-29 13:12     ` Sebastian Andrzej Siewior
@ 2020-10-29 14:05       ` Christoph Hellwig
  2020-10-29 14:56         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2020-10-29 14:05 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-block, Christoph Hellwig, Thomas Gleixner, David Runge,
	linux-rt-users, Jens Axboe, linux-kernel, Peter Zijlstra,
	Daniel Wagner, Mike Galbraith

On Thu, Oct 29, 2020 at 02:12:12PM +0100, Sebastian Andrzej Siewior wrote:
> Are there many drivers completing the SCSI requests in preemtible
> context? In this case it would be more efficient to complete the request
> directly (usb_stor_control_thread() goes to sleep after that anyway and
> there is only one request at a time).

Well, usb-storage obviously seems to do it, and the block layer
does not prohibit it.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-28 14:12   ` [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done Sebastian Andrzej Siewior
  2020-10-28 14:44     ` Christoph Hellwig
@ 2020-10-29 13:12     ` Sebastian Andrzej Siewior
  2020-10-29 14:05       ` Christoph Hellwig
  1 sibling, 1 reply; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-10-29 13:12 UTC (permalink / raw)
  To: linux-block
  Cc: Christoph Hellwig, Thomas Gleixner, David Runge, linux-rt-users,
	Jens Axboe, linux-kernel, Peter Zijlstra, Daniel Wagner,
	Mike Galbraith

On 2020-10-28 15:12:51 [+0100], To linux-block@vger.kernel.org wrote:
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -667,14 +632,21 @@ bool blk_mq_complete_request_remote(struct request *rq)
>  		return false;
>  
>  	if (blk_mq_complete_need_ipi(rq)) {
>  	} else {
>  		if (rq->q->nr_hw_queues > 1)
>  			return false;
> -		blk_mq_trigger_softirq(rq);
> +		cpu_list = this_cpu_ptr(&blk_cpu_done);
> +		if (llist_add(&rq->ipi_list, cpu_list))
> +			raise_softirq(BLOCK_SOFTIRQ);
>  	}
>  
>  	return true;

So Mike posted this:
| BUG: using smp_processor_id() in preemptible [00000000] code: usb-storage/841
| caller is blk_mq_complete_request_remote.part.0+0xa2/0x120
| CPU: 0 PID: 841 Comm: usb-storage Not tainted 5.10.0-rc1+ #61
| Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-1 04/01/2014
| Call Trace:
|  dump_stack+0x77/0x97
|  check_preemption_disabled+0xbe/0xc0
|  blk_mq_complete_request_remote.part.0+0xa2/0x120
|  blk_mq_complete_request+0x2e/0x40
|  usb_stor_control_thread+0x29a/0x300
|  kthread+0x14b/0x170
|  ret_from_fork+0x22/0x30

This comes from this_cpu_ptr() because usb_stor_control_thread() runs
with enabled preemption.

Adding preempt_disable() around it will make the warning go away but
will wake the ksoftirqd (this happens now, too).
Adding local_bh_disable() around it would perform the completion
immediately (instead of waking kssoftirqd) but local_bh_enable() feels
slightly more expensive.

Are there many drivers completing the SCSI requests in preemtible
context? In this case it would be more efficient to complete the request
directly (usb_stor_control_thread() goes to sleep after that anyway and
there is only one request at a time).

Sebastian

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-28 14:44     ` Christoph Hellwig
@ 2020-10-28 14:47       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-10-28 14:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Thomas Gleixner, David Runge, linux-rt-users,
	Jens Axboe, linux-kernel, Peter Zijlstra, Daniel Wagner

On 2020-10-28 14:44:53 [+0000], Christoph Hellwig wrote:
> On Wed, Oct 28, 2020 at 03:12:51PM +0100, Sebastian Andrzej Siewior wrote:
> >  static int blk_softirq_cpu_dead(unsigned int cpu)
> >  {
> > -	/*
> > -	 * If a CPU goes away, splice its entries to the current CPU
> > -	 * and trigger a run of the softirq
> > -	 */
> > -	local_irq_disable();
> > -	list_splice_init(&per_cpu(blk_cpu_done, cpu),
> > -			 this_cpu_ptr(&blk_cpu_done));
> > -	raise_softirq_irqoff(BLOCK_SOFTIRQ);
> > -	local_irq_enable();
> > -
> > +	blk_complete_reqs(&per_cpu(blk_cpu_done, cpu));
> >  	return 0;
> 
> How can this be preempted?  Can't we keep using this_cpu_ptr here?

cpu of the dead CPU != this CPU.

Sebastian

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-28 14:12   ` [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done Sebastian Andrzej Siewior
@ 2020-10-28 14:44     ` Christoph Hellwig
  2020-10-28 14:47       ` Sebastian Andrzej Siewior
  2020-10-29 13:12     ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2020-10-28 14:44 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-block, Christoph Hellwig, Thomas Gleixner, David Runge,
	linux-rt-users, Jens Axboe, linux-kernel, Peter Zijlstra,
	Daniel Wagner

On Wed, Oct 28, 2020 at 03:12:51PM +0100, Sebastian Andrzej Siewior wrote:
>  static int blk_softirq_cpu_dead(unsigned int cpu)
>  {
> -	/*
> -	 * If a CPU goes away, splice its entries to the current CPU
> -	 * and trigger a run of the softirq
> -	 */
> -	local_irq_disable();
> -	list_splice_init(&per_cpu(blk_cpu_done, cpu),
> -			 this_cpu_ptr(&blk_cpu_done));
> -	raise_softirq_irqoff(BLOCK_SOFTIRQ);
> -	local_irq_enable();
> -
> +	blk_complete_reqs(&per_cpu(blk_cpu_done, cpu));
>  	return 0;

How can this be preempted?  Can't we keep using this_cpu_ptr here?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
  2020-10-28 14:12 ` [PATCH 1/3] blk-mq: Don't complete on a remote CPU in force threaded mode Sebastian Andrzej Siewior
@ 2020-10-28 14:12   ` Sebastian Andrzej Siewior
  2020-10-28 14:44     ` Christoph Hellwig
  2020-10-29 13:12     ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 54+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-10-28 14:12 UTC (permalink / raw)
  To: linux-block
  Cc: Christoph Hellwig, Thomas Gleixner, David Runge, linux-rt-users,
	Jens Axboe, linux-kernel, Peter Zijlstra, Daniel Wagner,
	Sebastian Andrzej Siewior

With llist_head it is possible to avoid the locking (the irq-off region)
when items are added. This makes it possible to add items on a remote
CPU.
llist_add() returns true if the list was previously empty. This can be
used to invoke the SMP function call / raise sofirq only if the first
item was added (otherwise it is already pending).
This simplifies the code a little and reduces the IRQ-off regions. With
this change it possible to reduce the SMP-function call a simple
__raise_softirq_irqoff().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 block/blk-mq.c         | 78 ++++++++++++++----------------------------
 include/linux/blkdev.h |  2 +-
 2 files changed, 26 insertions(+), 54 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 769d2d532a825..4f53de48e5038 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -41,7 +41,7 @@
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
 
-static DEFINE_PER_CPU(struct list_head, blk_cpu_done);
+static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
 
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
@@ -565,68 +565,32 @@ void blk_mq_end_request(struct request *rq, blk_status_t error)
 }
 EXPORT_SYMBOL(blk_mq_end_request);
 
-/*
- * Softirq action handler - move entries to local list and loop over them
- * while passing them to the queue registered handler.
- */
-static __latent_entropy void blk_done_softirq(struct softirq_action *h)
+static void blk_complete_reqs(struct llist_head *cpu_list)
 {
-	struct list_head *cpu_list, local_list;
+	struct llist_node *entry;
+	struct request *rq, *rq_next;
 
-	local_irq_disable();
-	cpu_list = this_cpu_ptr(&blk_cpu_done);
-	list_replace_init(cpu_list, &local_list);
-	local_irq_enable();
+	entry = llist_del_all(cpu_list);
+	entry = llist_reverse_order(entry);
 
-	while (!list_empty(&local_list)) {
-		struct request *rq;
-
-		rq = list_entry(local_list.next, struct request, ipi_list);
-		list_del_init(&rq->ipi_list);
+	llist_for_each_entry_safe(rq, rq_next, entry, ipi_list)
 		rq->q->mq_ops->complete(rq);
-	}
 }
 
-static void blk_mq_trigger_softirq(struct request *rq)
+static __latent_entropy void blk_done_softirq(struct softirq_action *h)
 {
-	struct list_head *list;
-	unsigned long flags;
-
-	local_irq_save(flags);
-	list = this_cpu_ptr(&blk_cpu_done);
-	list_add_tail(&rq->ipi_list, list);
-
-	/*
-	 * If the list only contains our just added request, signal a raise of
-	 * the softirq.  If there are already entries there, someone already
-	 * raised the irq but it hasn't run yet.
-	 */
-	if (list->next == &rq->ipi_list)
-		raise_softirq_irqoff(BLOCK_SOFTIRQ);
-	local_irq_restore(flags);
+	blk_complete_reqs(this_cpu_ptr(&blk_cpu_done));
 }
 
 static int blk_softirq_cpu_dead(unsigned int cpu)
 {
-	/*
-	 * If a CPU goes away, splice its entries to the current CPU
-	 * and trigger a run of the softirq
-	 */
-	local_irq_disable();
-	list_splice_init(&per_cpu(blk_cpu_done, cpu),
-			 this_cpu_ptr(&blk_cpu_done));
-	raise_softirq_irqoff(BLOCK_SOFTIRQ);
-	local_irq_enable();
-
+	blk_complete_reqs(&per_cpu(blk_cpu_done, cpu));
 	return 0;
 }
 
-
 static void __blk_mq_complete_request_remote(void *data)
 {
-	struct request *rq = data;
-
-	blk_mq_trigger_softirq(rq);
+	__raise_softirq_irqoff(BLOCK_SOFTIRQ);
 }
 
 static inline bool blk_mq_complete_need_ipi(struct request *rq)
@@ -657,6 +621,7 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq)
 
 bool blk_mq_complete_request_remote(struct request *rq)
 {
+	struct llist_head *cpu_list;
 	WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
 
 	/*
@@ -667,14 +632,21 @@ bool blk_mq_complete_request_remote(struct request *rq)
 		return false;
 
 	if (blk_mq_complete_need_ipi(rq)) {
-		rq->csd.func = __blk_mq_complete_request_remote;
-		rq->csd.info = rq;
-		rq->csd.flags = 0;
-		smp_call_function_single_async(rq->mq_ctx->cpu, &rq->csd);
+		unsigned int cpu;
+
+		cpu = rq->mq_ctx->cpu;
+		cpu_list = &per_cpu(blk_cpu_done, cpu);
+		if (llist_add(&rq->ipi_list, cpu_list)) {
+			rq->csd.func = __blk_mq_complete_request_remote;
+			rq->csd.flags = 0;
+			smp_call_function_single_async(cpu, &rq->csd);
+		}
 	} else {
 		if (rq->q->nr_hw_queues > 1)
 			return false;
-		blk_mq_trigger_softirq(rq);
+		cpu_list = this_cpu_ptr(&blk_cpu_done);
+		if (llist_add(&rq->ipi_list, cpu_list))
+			raise_softirq(BLOCK_SOFTIRQ);
 	}
 
 	return true;
@@ -3905,7 +3877,7 @@ static int __init blk_mq_init(void)
 	int i;
 
 	for_each_possible_cpu(i)
-		INIT_LIST_HEAD(&per_cpu(blk_cpu_done, i));
+		init_llist_head(&per_cpu(blk_cpu_done, i));
 	open_softirq(BLOCK_SOFTIRQ, blk_done_softirq);
 
 	cpuhp_setup_state_nocalls(CPUHP_BLOCK_SOFTIRQ_DEAD,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 639cae2c158b5..331b2b675b417 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -156,7 +156,7 @@ struct request {
 	 */
 	union {
 		struct hlist_node hash;	/* merge hash */
-		struct list_head ipi_list;
+		struct llist_node ipi_list;
 	};
 
 	/*
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2021-01-26 19:54 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-04 19:13 [PATCH 0/3 v2] blk-mq: Don't complete in IRQ, use llist_head Sebastian Andrzej Siewior
2020-12-04 19:13 ` [PATCH 1/3] blk-mq: Don't complete on a remote CPU in force threaded mode Sebastian Andrzej Siewior
2020-12-08 13:10   ` Christoph Hellwig
2020-12-04 19:13 ` [PATCH 2/3] blk-mq: Always complete remote completions requests in softirq Sebastian Andrzej Siewior
2020-12-07 23:52   ` Jens Axboe
2020-12-08  8:22     ` Sebastian Andrzej Siewior
2020-12-08  8:44       ` Daniel Wagner
2020-12-08 11:36         ` Daniel Wagner
2020-12-08 11:49           ` Sebastian Andrzej Siewior
2020-12-08 12:41             ` Daniel Wagner
2020-12-08 12:52               ` Sebastian Andrzej Siewior
2020-12-08 12:57                 ` Sebastian Andrzej Siewior
2020-12-08 13:27                   ` Daniel Wagner
2020-12-17 16:45         ` Jens Axboe
2020-12-17 16:49           ` Daniel Wagner
2020-12-17 16:54             ` Jens Axboe
2020-12-08 13:13     ` Christoph Hellwig
2020-12-17 16:43       ` Sebastian Andrzej Siewior
2020-12-17 16:55         ` Jens Axboe
2020-12-17 16:58           ` Sebastian Andrzej Siewior
2020-12-17 17:05             ` Daniel Wagner
2020-12-17 18:16           ` Daniel Wagner
2020-12-17 18:22             ` Jens Axboe
2020-12-17 18:41               ` Daniel Wagner
2020-12-17 18:46                 ` Jens Axboe
2020-12-17 19:07                   ` Daniel Wagner
2020-12-17 19:13                     ` Jens Axboe
2020-12-17 19:15                       ` Daniel Wagner
2020-12-04 19:13 ` [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done Sebastian Andrzej Siewior
2020-12-08 13:20   ` Christoph Hellwig
2020-12-08 13:28     ` Christoph Hellwig
2020-12-14 20:20     ` Sebastian Andrzej Siewior
  -- strict thread matches above, loose matches on Subject: below --
2021-01-23 20:10 [PATCH v3 0/3] blk-mq: Don't complete in IRQ, use llist_head Sebastian Andrzej Siewior
2021-01-23 20:10 ` [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done Sebastian Andrzej Siewior
2021-01-25  8:30   ` Christoph Hellwig
2021-01-25  8:32     ` Sebastian Andrzej Siewior
2021-01-25  8:39       ` Christoph Hellwig
2020-10-28  6:56 [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT Christoph Hellwig
2020-10-28 14:12 ` [PATCH 1/3] blk-mq: Don't complete on a remote CPU in force threaded mode Sebastian Andrzej Siewior
2020-10-28 14:12   ` [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done Sebastian Andrzej Siewior
2020-10-28 14:44     ` Christoph Hellwig
2020-10-28 14:47       ` Sebastian Andrzej Siewior
2020-10-29 13:12     ` Sebastian Andrzej Siewior
2020-10-29 14:05       ` Christoph Hellwig
2020-10-29 14:56         ` Sebastian Andrzej Siewior
2020-10-29 14:57           ` Christoph Hellwig
2020-10-29 20:03             ` Sagi Grimberg
2020-10-29 21:01               ` Sebastian Andrzej Siewior
2020-10-29 21:07                 ` Sagi Grimberg
2020-10-31 10:41                   ` Sebastian Andrzej Siewior
2020-10-31 15:00                     ` Jens Axboe
2020-10-31 15:01                       ` Jens Axboe
2020-10-31 18:09                         ` Christoph Hellwig
2020-11-02  9:55                       ` Sebastian Andrzej Siewior
2020-11-02 18:12                         ` Christoph Hellwig
2020-11-04 19:15                           ` Sagi Grimberg
2020-11-06 15:23                           ` Sebastian Andrzej Siewior

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.