All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET 0/6] Various block optimizations
@ 2018-11-10 15:13 Jens Axboe
  2018-11-10 15:13   ` Jens Axboe
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Jens Axboe @ 2018-11-10 15:13 UTC (permalink / raw)
  To: linux-block

A mixed bag of optimizations that are now possible, on top of my
for-4.21/block branch.


 block/blk-cgroup.c         |   8 ++--
 block/blk-core.c           |  16 ++++----
 block/blk-flush.c          |   3 +-
 block/blk-mq-debugfs.c     |   2 +-
 block/blk-mq.c             |  24 +++++------
 block/blk-mq.h             |  12 +++---
 block/blk-rq-qos.c         |  90 +++++++++++-----------------------------
 block/blk-rq-qos.h         |  35 ++++++++++++----
 block/blk-softirq.c        |   4 +-
 block/blk-sysfs.c          |  18 ++++----
 block/blk-wbt.c            |   2 +-
 block/elevator.c           |   9 ++--
 block/genhd.c              |   8 ++--
 drivers/md/dm-table.c      |   2 +-
 drivers/nvme/host/pci.c    |  17 ++++++--
 drivers/scsi/scsi_lib.c    |   2 +-
 fs/block_dev.c             |  29 ++++++++++---
 fs/iomap.c                 |   2 +-
 include/linux/blk-mq-ops.h | 100 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/blk-mq.h     |  94 +-----------------------------------------
 include/linux/blkdev.h     |  27 ++++++++++--
 21 files changed, 268 insertions(+), 236 deletions(-)

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/6] nvme: don't disable local ints for polled queue
  2018-11-10 15:13 [PATCHSET 0/6] Various block optimizations Jens Axboe
@ 2018-11-10 15:13   ` Jens Axboe
  2018-11-10 15:13 ` [PATCH 2/6] block: add queue_is_mq() helper Jens Axboe
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2018-11-10 15:13 UTC (permalink / raw)
  To: linux-block; +Cc: Jens Axboe, Keith Busch, linux-nvme

A polled queued doesn't trigger interrupts, so it's always safe
to grab the queue lock without disabling interrupts.

Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-nvme@lists.infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 drivers/nvme/host/pci.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6aa86dfcb32c..a6e3fbddfadf 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1061,15 +1061,26 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
 
 static int __nvme_poll(struct nvme_queue *nvmeq, unsigned int tag)
 {
+	unsigned long flags = 0; /* gcc 7.x fail */
 	u16 start, end;
-	bool found;
+	bool found, has_irq;
 
 	if (!nvme_cqe_pending(nvmeq))
 		return 0;
 
-	spin_lock_irq(&nvmeq->cq_lock);
+	/*
+	 * Polled queue doesn't have an IRQ, no need to disable ints
+	 */
+	has_irq = !nvmeq->polled;
+	if (has_irq)
+		local_irq_save(flags);
+
+	spin_lock(&nvmeq->cq_lock);
 	found = nvme_process_cq(nvmeq, &start, &end, tag);
-	spin_unlock_irq(&nvmeq->cq_lock);
+	spin_unlock(&nvmeq->cq_lock);
+
+	if (has_irq)
+		local_irq_restore(flags);
 
 	nvme_complete_cqes(nvmeq, start, end);
 	return found;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 1/6] nvme: don't disable local ints for polled queue
@ 2018-11-10 15:13   ` Jens Axboe
  0 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2018-11-10 15:13 UTC (permalink / raw)


A polled queued doesn't trigger interrupts, so it's always safe
to grab the queue lock without disabling interrupts.

Cc: Keith Busch <keith.busch at intel.com>
Cc: linux-nvme at lists.infradead.org
Signed-off-by: Jens Axboe <axboe at kernel.dk>
---
 drivers/nvme/host/pci.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6aa86dfcb32c..a6e3fbddfadf 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1061,15 +1061,26 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
 
 static int __nvme_poll(struct nvme_queue *nvmeq, unsigned int tag)
 {
+	unsigned long flags = 0; /* gcc 7.x fail */
 	u16 start, end;
-	bool found;
+	bool found, has_irq;
 
 	if (!nvme_cqe_pending(nvmeq))
 		return 0;
 
-	spin_lock_irq(&nvmeq->cq_lock);
+	/*
+	 * Polled queue doesn't have an IRQ, no need to disable ints
+	 */
+	has_irq = !nvmeq->polled;
+	if (has_irq)
+		local_irq_save(flags);
+
+	spin_lock(&nvmeq->cq_lock);
 	found = nvme_process_cq(nvmeq, &start, &end, tag);
-	spin_unlock_irq(&nvmeq->cq_lock);
+	spin_unlock(&nvmeq->cq_lock);
+
+	if (has_irq)
+		local_irq_restore(flags);
 
 	nvme_complete_cqes(nvmeq, start, end);
 	return found;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/6] block: add queue_is_mq() helper
  2018-11-10 15:13 [PATCHSET 0/6] Various block optimizations Jens Axboe
  2018-11-10 15:13   ` Jens Axboe
@ 2018-11-10 15:13 ` Jens Axboe
  2018-11-10 15:13 ` [PATCH 3/6] blk-mq: embed blk_mq_ops directly in the request queue Jens Axboe
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2018-11-10 15:13 UTC (permalink / raw)
  To: linux-block; +Cc: Jens Axboe

Various spots check for q->mq_ops being non-NULL, but provide
a helper to do this instead.

Where the ->mq_ops != NULL check is redundant, remove it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-cgroup.c     |  8 ++++----
 block/blk-core.c       | 10 +++++-----
 block/blk-flush.c      |  3 +--
 block/blk-mq.c         |  2 +-
 block/blk-sysfs.c      | 14 +++++++-------
 block/blk-wbt.c        |  2 +-
 block/elevator.c       |  9 ++++-----
 block/genhd.c          |  8 ++++----
 drivers/md/dm-table.c  |  2 +-
 include/linux/blkdev.h | 10 +++++++---
 10 files changed, 35 insertions(+), 33 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 6c65791bc3fe..8da8d3773ecf 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1349,7 +1349,7 @@ int blkcg_activate_policy(struct request_queue *q,
 	if (blkcg_policy_enabled(q, pol))
 		return 0;
 
-	if (q->mq_ops)
+	if (queue_is_mq(q))
 		blk_mq_freeze_queue(q);
 pd_prealloc:
 	if (!pd_prealloc) {
@@ -1388,7 +1388,7 @@ int blkcg_activate_policy(struct request_queue *q,
 
 	spin_unlock_irq(q->queue_lock);
 out_bypass_end:
-	if (q->mq_ops)
+	if (queue_is_mq(q))
 		blk_mq_unfreeze_queue(q);
 	if (pd_prealloc)
 		pol->pd_free_fn(pd_prealloc);
@@ -1412,7 +1412,7 @@ void blkcg_deactivate_policy(struct request_queue *q,
 	if (!blkcg_policy_enabled(q, pol))
 		return;
 
-	if (q->mq_ops)
+	if (queue_is_mq(q))
 		blk_mq_freeze_queue(q);
 
 	spin_lock_irq(q->queue_lock);
@@ -1430,7 +1430,7 @@ void blkcg_deactivate_policy(struct request_queue *q,
 
 	spin_unlock_irq(q->queue_lock);
 
-	if (q->mq_ops)
+	if (queue_is_mq(q))
 		blk_mq_unfreeze_queue(q);
 }
 EXPORT_SYMBOL_GPL(blkcg_deactivate_policy);
diff --git a/block/blk-core.c b/block/blk-core.c
index fdc0ad2686c4..ab6675fd3568 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -268,7 +268,7 @@ void blk_sync_queue(struct request_queue *q)
 	del_timer_sync(&q->timeout);
 	cancel_work_sync(&q->timeout_work);
 
-	if (q->mq_ops) {
+	if (queue_is_mq(q)) {
 		struct blk_mq_hw_ctx *hctx;
 		int i;
 
@@ -317,7 +317,7 @@ void blk_set_queue_dying(struct request_queue *q)
 	 */
 	blk_freeze_queue_start(q);
 
-	if (q->mq_ops)
+	if (queue_is_mq(q))
 		blk_mq_wake_waiters(q);
 
 	/* Make blk_queue_enter() reexamine the DYING flag. */
@@ -410,7 +410,7 @@ void blk_cleanup_queue(struct request_queue *q)
 	 * blk_freeze_queue() should be enough for cases of passthrough
 	 * request.
 	 */
-	if (q->mq_ops && blk_queue_init_done(q))
+	if (queue_is_mq(q) && blk_queue_init_done(q))
 		blk_mq_quiesce_queue(q);
 
 	/* for synchronous bio-based driver finish in-flight integrity i/o */
@@ -428,7 +428,7 @@ void blk_cleanup_queue(struct request_queue *q)
 
 	blk_exit_queue(q);
 
-	if (q->mq_ops)
+	if (queue_is_mq(q))
 		blk_mq_free_queue(q);
 
 	percpu_ref_exit(&q->q_usage_counter);
@@ -1736,7 +1736,7 @@ EXPORT_SYMBOL_GPL(rq_flush_dcache_pages);
  */
 int blk_lld_busy(struct request_queue *q)
 {
-	if (q->mq_ops && q->mq_ops->busy)
+	if (queue_is_mq(q) && q->mq_ops->busy)
 		return q->mq_ops->busy(q);
 
 	return 0;
diff --git a/block/blk-flush.c b/block/blk-flush.c
index c53197dcdd70..3b79bea03462 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -273,8 +273,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
 	 * assigned to empty flushes, and we deadlock if we are expecting
 	 * other requests to make progress. Don't defer for that case.
 	 */
-	if (!list_empty(&fq->flush_data_in_flight) &&
-	    !(q->mq_ops && q->elevator) &&
+	if (!list_empty(&fq->flush_data_in_flight) && q->elevator &&
 	    time_before(jiffies,
 			fq->flush_pending_since + FLUSH_PENDING_TIMEOUT))
 		return;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 411be60d0cb6..eb9b9596d3de 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -150,7 +150,7 @@ void blk_freeze_queue_start(struct request_queue *q)
 	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
 	if (freeze_depth == 1) {
 		percpu_ref_kill(&q->q_usage_counter);
-		if (q->mq_ops)
+		if (queue_is_mq(q))
 			blk_mq_run_hw_queues(q, false);
 	}
 }
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index d4b1b84ba8ca..93635a693314 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -68,7 +68,7 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
 	unsigned long nr;
 	int ret, err;
 
-	if (!q->mq_ops)
+	if (!queue_is_mq(q))
 		return -EINVAL;
 
 	ret = queue_var_store(&nr, page, count);
@@ -839,12 +839,12 @@ static void __blk_release_queue(struct work_struct *work)
 
 	blk_queue_free_zone_bitmaps(q);
 
-	if (q->mq_ops)
+	if (queue_is_mq(q))
 		blk_mq_release(q);
 
 	blk_trace_shutdown(q);
 
-	if (q->mq_ops)
+	if (queue_is_mq(q))
 		blk_mq_debugfs_unregister(q);
 
 	bioset_exit(&q->bio_split);
@@ -918,7 +918,7 @@ int blk_register_queue(struct gendisk *disk)
 		goto unlock;
 	}
 
-	if (q->mq_ops) {
+	if (queue_is_mq(q)) {
 		__blk_mq_register_dev(dev, q);
 		blk_mq_debugfs_register(q);
 	}
@@ -929,7 +929,7 @@ int blk_register_queue(struct gendisk *disk)
 
 	blk_throtl_register_queue(q);
 
-	if ((q->mq_ops && q->elevator)) {
+	if (q->elevator) {
 		ret = elv_register_queue(q);
 		if (ret) {
 			mutex_unlock(&q->sysfs_lock);
@@ -978,7 +978,7 @@ void blk_unregister_queue(struct gendisk *disk)
 	 * Remove the sysfs attributes before unregistering the queue data
 	 * structures that can be modified through sysfs.
 	 */
-	if (q->mq_ops)
+	if (queue_is_mq(q))
 		blk_mq_unregister_dev(disk_to_dev(disk), q);
 	mutex_unlock(&q->sysfs_lock);
 
@@ -987,7 +987,7 @@ void blk_unregister_queue(struct gendisk *disk)
 	blk_trace_remove_sysfs(disk_to_dev(disk));
 
 	mutex_lock(&q->sysfs_lock);
-	if (q->mq_ops && q->elevator)
+	if (q->elevator)
 		elv_unregister_queue(q);
 	mutex_unlock(&q->sysfs_lock);
 
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 0fc222d4194b..b580763e8b1e 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -709,7 +709,7 @@ void wbt_enable_default(struct request_queue *q)
 	if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags))
 		return;
 
-	if (q->mq_ops && IS_ENABLED(CONFIG_BLK_WBT_MQ))
+	if (queue_is_mq(q) && IS_ENABLED(CONFIG_BLK_WBT_MQ))
 		wbt_init(q);
 }
 EXPORT_SYMBOL_GPL(wbt_enable_default);
diff --git a/block/elevator.c b/block/elevator.c
index 19351ffa56b1..9df9ae6b038e 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -674,7 +674,7 @@ static int __elevator_change(struct request_queue *q, const char *name)
 	/*
 	 * Special case for mq, turn off scheduling
 	 */
-	if (q->mq_ops && !strncmp(name, "none", 4))
+	if (!strncmp(name, "none", 4))
 		return elevator_switch(q, NULL);
 
 	strlcpy(elevator_name, name, sizeof(elevator_name));
@@ -692,8 +692,7 @@ static int __elevator_change(struct request_queue *q, const char *name)
 
 static inline bool elv_support_iosched(struct request_queue *q)
 {
-	if (q->mq_ops && q->tag_set && (q->tag_set->flags &
-				BLK_MQ_F_NO_SCHED))
+	if (q->tag_set && (q->tag_set->flags & BLK_MQ_F_NO_SCHED))
 		return false;
 	return true;
 }
@@ -703,7 +702,7 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name,
 {
 	int ret;
 
-	if (!q->mq_ops || !elv_support_iosched(q))
+	if (!queue_is_mq(q) || !elv_support_iosched(q))
 		return count;
 
 	ret = __elevator_change(q, name);
@@ -739,7 +738,7 @@ ssize_t elv_iosched_show(struct request_queue *q, char *name)
 	}
 	spin_unlock(&elv_list_lock);
 
-	if (q->mq_ops && q->elevator)
+	if (q->elevator)
 		len += sprintf(name+len, "none");
 
 	len += sprintf(len+name, "\n");
diff --git a/block/genhd.c b/block/genhd.c
index cff6bdf27226..0145bcb0cc76 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -47,7 +47,7 @@ static void disk_release_events(struct gendisk *disk);
 
 void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
 {
-	if (q->mq_ops)
+	if (queue_is_mq(q))
 		return;
 
 	atomic_inc(&part->in_flight[rw]);
@@ -57,7 +57,7 @@ void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
 
 void part_dec_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
 {
-	if (q->mq_ops)
+	if (queue_is_mq(q))
 		return;
 
 	atomic_dec(&part->in_flight[rw]);
@@ -68,7 +68,7 @@ void part_dec_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
 void part_in_flight(struct request_queue *q, struct hd_struct *part,
 		    unsigned int inflight[2])
 {
-	if (q->mq_ops) {
+	if (queue_is_mq(q)) {
 		blk_mq_in_flight(q, part, inflight);
 		return;
 	}
@@ -85,7 +85,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct *part,
 void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
 		       unsigned int inflight[2])
 {
-	if (q->mq_ops) {
+	if (queue_is_mq(q)) {
 		blk_mq_in_flight_rw(q, part, inflight);
 		return;
 	}
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 9038c302d5c2..e42739177107 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -919,7 +919,7 @@ static int device_is_rq_based(struct dm_target *ti, struct dm_dev *dev,
 	struct request_queue *q = bdev_get_queue(dev->bdev);
 	struct verify_rq_based_data *v = data;
 
-	if (q->mq_ops)
+	if (queue_is_mq(q))
 		v->mq_count++;
 	else
 		v->sq_count++;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e67ad2dd025e..3712d1fe48d4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -671,13 +671,17 @@ static inline bool blk_account_rq(struct request *rq)
 
 #define rq_data_dir(rq)		(op_is_write(req_op(rq)) ? WRITE : READ)
 
+static inline bool queue_is_mq(struct request_queue *q)
+{
+	return q->mq_ops;
+}
+
 /*
- * Driver can handle struct request, if it either has an old style
- * request_fn defined, or is blk-mq based.
+ * Only blk-mq based drivers are rq based
  */
 static inline bool queue_is_rq_based(struct request_queue *q)
 {
-	return q->mq_ops;
+	return queue_is_mq(q);
 }
 
 static inline unsigned int blk_queue_cluster(struct request_queue *q)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/6] blk-mq: embed blk_mq_ops directly in the request queue
  2018-11-10 15:13 [PATCHSET 0/6] Various block optimizations Jens Axboe
  2018-11-10 15:13   ` Jens Axboe
  2018-11-10 15:13 ` [PATCH 2/6] block: add queue_is_mq() helper Jens Axboe
@ 2018-11-10 15:13 ` Jens Axboe
  2018-11-14 15:20   ` Christoph Hellwig
  2018-11-10 15:13 ` [PATCH 4/6] block: avoid ordered task state change for polled IO Jens Axboe
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2018-11-10 15:13 UTC (permalink / raw)
  To: linux-block; +Cc: Jens Axboe

This saves an indirect function call everytime we have to
call one of the strategy functions. We keep it const, and just
hack around that a bit in blk_mq_init_allocated_queue(), which
is where we copy the ops in.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c           |   8 +--
 block/blk-mq-debugfs.c     |   2 +-
 block/blk-mq.c             |  22 ++++----
 block/blk-mq.h             |  12 ++---
 block/blk-softirq.c        |   4 +-
 block/blk-sysfs.c          |   4 +-
 drivers/scsi/scsi_lib.c    |   2 +-
 include/linux/blk-mq-ops.h | 100 +++++++++++++++++++++++++++++++++++++
 include/linux/blk-mq.h     |  94 +---------------------------------
 include/linux/blkdev.h     |   6 ++-
 10 files changed, 132 insertions(+), 122 deletions(-)
 create mode 100644 include/linux/blk-mq-ops.h

diff --git a/block/blk-core.c b/block/blk-core.c
index ab6675fd3568..88400ab166ac 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -656,8 +656,8 @@ struct request *blk_get_request(struct request_queue *q, unsigned int op,
 	WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT));
 
 	req = blk_mq_alloc_request(q, op, flags);
-	if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
-		q->mq_ops->initialize_rq_fn(req);
+	if (!IS_ERR(req) && q->mq_ops.initialize_rq_fn)
+		q->mq_ops.initialize_rq_fn(req);
 
 	return req;
 }
@@ -1736,8 +1736,8 @@ EXPORT_SYMBOL_GPL(rq_flush_dcache_pages);
  */
 int blk_lld_busy(struct request_queue *q)
 {
-	if (queue_is_mq(q) && q->mq_ops->busy)
-		return q->mq_ops->busy(q);
+	if (q->mq_ops.busy)
+		return q->mq_ops.busy(q);
 
 	return 0;
 }
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index f021f4817b80..efdfb6258e03 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -354,7 +354,7 @@ static const char *blk_mq_rq_state_name(enum mq_rq_state rq_state)
 
 int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq)
 {
-	const struct blk_mq_ops *const mq_ops = rq->q->mq_ops;
+	const struct blk_mq_ops *const mq_ops = &rq->q->mq_ops;
 	const unsigned int op = rq->cmd_flags & REQ_OP_MASK;
 
 	seq_printf(m, "%p {.op=", rq);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index eb9b9596d3de..6e0cb6adfc90 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -558,7 +558,7 @@ static void __blk_mq_complete_request_remote(void *data)
 	struct request *rq = data;
 	struct request_queue *q = rq->q;
 
-	q->mq_ops->complete(rq);
+	q->mq_ops.complete(rq);
 }
 
 static void __blk_mq_complete_request(struct request *rq)
@@ -586,7 +586,7 @@ static void __blk_mq_complete_request(struct request *rq)
 	}
 
 	if (!test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags)) {
-		q->mq_ops->complete(rq);
+		q->mq_ops.complete(rq);
 		return;
 	}
 
@@ -600,7 +600,7 @@ static void __blk_mq_complete_request(struct request *rq)
 		rq->csd.flags = 0;
 		smp_call_function_single_async(ctx->cpu, &rq->csd);
 	} else {
-		q->mq_ops->complete(rq);
+		q->mq_ops.complete(rq);
 	}
 	put_cpu();
 }
@@ -818,10 +818,10 @@ EXPORT_SYMBOL_GPL(blk_mq_queue_busy);
 static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 {
 	req->rq_flags |= RQF_TIMED_OUT;
-	if (req->q->mq_ops->timeout) {
+	if (req->q->mq_ops.timeout) {
 		enum blk_eh_timer_return ret;
 
-		ret = req->q->mq_ops->timeout(req, reserved);
+		ret = req->q->mq_ops.timeout(req, reserved);
 		if (ret == BLK_EH_DONE)
 			return;
 		WARN_ON_ONCE(ret != BLK_EH_RESET_TIMER);
@@ -1221,7 +1221,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			bd.last = !blk_mq_get_driver_tag(nxt);
 		}
 
-		ret = q->mq_ops->queue_rq(hctx, &bd);
+		ret = q->mq_ops.queue_rq(hctx, &bd);
 		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
 			/*
 			 * If an I/O scheduler has been configured and we got a
@@ -1746,7 +1746,7 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
 	 * Any other error (busy), just add it to our list as we
 	 * previously would have done.
 	 */
-	ret = q->mq_ops->queue_rq(hctx, &bd);
+	ret = q->mq_ops.queue_rq(hctx, &bd);
 	switch (ret) {
 	case BLK_STS_OK:
 		blk_mq_update_dispatch_busy(hctx, false);
@@ -2723,7 +2723,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 						  struct request_queue *q)
 {
 	/* mark the queue as mq asap */
-	q->mq_ops = set->ops;
+	memcpy((void *) &q->mq_ops, set->ops, sizeof(q->mq_ops));
 
 	q->poll_cb = blk_stat_alloc_callback(blk_mq_poll_stats_fn,
 					     blk_mq_poll_stats_bkt,
@@ -2765,7 +2765,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	spin_lock_init(&q->requeue_lock);
 
 	blk_queue_make_request(q, blk_mq_make_request);
-	if (q->mq_ops->poll)
+	if (q->mq_ops.poll)
 		q->poll_fn = blk_mq_poll;
 
 	/*
@@ -2797,7 +2797,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 err_percpu:
 	free_percpu(q->queue_ctx);
 err_exit:
-	q->mq_ops = NULL;
+	memset((void *) &q->mq_ops, 0, sizeof(q->mq_ops));
 	return ERR_PTR(-ENOMEM);
 }
 EXPORT_SYMBOL(blk_mq_init_allocated_queue);
@@ -3328,7 +3328,7 @@ static bool __blk_mq_poll(struct blk_mq_hw_ctx *hctx, struct request *rq)
 
 		hctx->poll_invoked++;
 
-		ret = q->mq_ops->poll(hctx, rq->tag);
+		ret = q->mq_ops.poll(hctx, rq->tag);
 		if (ret > 0) {
 			hctx->poll_success++;
 			set_current_state(TASK_RUNNING);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index facb6e9ddce4..1eb6a3e8af58 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -99,8 +99,8 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
 {
 	int hctx_type = 0;
 
-	if (q->mq_ops->rq_flags_to_type)
-		hctx_type = q->mq_ops->rq_flags_to_type(q, flags);
+	if (q->mq_ops.rq_flags_to_type)
+		hctx_type = q->mq_ops.rq_flags_to_type(q, flags);
 
 	return blk_mq_map_queue_type(q, hctx_type, cpu);
 }
@@ -187,16 +187,16 @@ static inline void blk_mq_put_dispatch_budget(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
 
-	if (q->mq_ops->put_budget)
-		q->mq_ops->put_budget(hctx);
+	if (q->mq_ops.put_budget)
+		q->mq_ops.put_budget(hctx);
 }
 
 static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
 
-	if (q->mq_ops->get_budget)
-		return q->mq_ops->get_budget(hctx);
+	if (q->mq_ops.get_budget)
+		return q->mq_ops.get_budget(hctx);
 	return true;
 }
 
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 1534066e306e..2f4176668470 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -34,7 +34,7 @@ static __latent_entropy void blk_done_softirq(struct softirq_action *h)
 
 		rq = list_entry(local_list.next, struct request, ipi_list);
 		list_del_init(&rq->ipi_list);
-		rq->q->mq_ops->complete(rq);
+		rq->q->mq_ops.complete(rq);
 	}
 }
 
@@ -102,7 +102,7 @@ void __blk_complete_request(struct request *req)
 	unsigned long flags;
 	bool shared = false;
 
-	BUG_ON(!q->mq_ops->complete);
+	BUG_ON(!q->mq_ops.complete);
 
 	local_irq_save(flags);
 	cpu = smp_processor_id();
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 93635a693314..9661ef5b390f 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -380,7 +380,7 @@ static ssize_t queue_poll_delay_store(struct request_queue *q, const char *page,
 {
 	int err, val;
 
-	if (!q->mq_ops || !q->mq_ops->poll)
+	if (!q->mq_ops.poll)
 		return -EINVAL;
 
 	err = kstrtoint(page, 10, &val);
@@ -406,7 +406,7 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
 	unsigned long poll_on;
 	ssize_t ret;
 
-	if (!q->mq_ops || !q->mq_ops->poll)
+	if (!q->mq_ops.poll)
 		return -EINVAL;
 
 	ret = queue_var_store(&poll_on, page, count);
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 5d83a162d03b..61babcb269ab 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1907,7 +1907,7 @@ struct scsi_device *scsi_device_from_queue(struct request_queue *q)
 {
 	struct scsi_device *sdev = NULL;
 
-	if (q->mq_ops == &scsi_mq_ops)
+	if (q->mq_ops.queue_rq == scsi_mq_ops.queue_rq)
 		sdev = q->queuedata;
 	if (!sdev || !get_device(&sdev->sdev_gendev))
 		sdev = NULL;
diff --git a/include/linux/blk-mq-ops.h b/include/linux/blk-mq-ops.h
new file mode 100644
index 000000000000..0940c26875ca
--- /dev/null
+++ b/include/linux/blk-mq-ops.h
@@ -0,0 +1,100 @@
+#ifndef BLK_MQ_OPS_H
+#define BLK_MQ_OPS_H
+
+struct blk_mq_queue_data;
+struct blk_mq_hw_ctx;
+struct blk_mq_tag_set;
+
+typedef blk_status_t (queue_rq_fn)(struct blk_mq_hw_ctx *,
+		const struct blk_mq_queue_data *);
+/* takes rq->cmd_flags as input, returns a hardware type index */
+typedef int (rq_flags_to_type_fn)(struct request_queue *, unsigned int);
+typedef bool (get_budget_fn)(struct blk_mq_hw_ctx *);
+typedef void (put_budget_fn)(struct blk_mq_hw_ctx *);
+typedef enum blk_eh_timer_return (timeout_fn)(struct request *, bool);
+typedef int (init_hctx_fn)(struct blk_mq_hw_ctx *, void *, unsigned int);
+typedef void (exit_hctx_fn)(struct blk_mq_hw_ctx *, unsigned int);
+typedef int (init_request_fn)(struct blk_mq_tag_set *set, struct request *,
+		unsigned int, unsigned int);
+typedef void (exit_request_fn)(struct blk_mq_tag_set *set, struct request *,
+		unsigned int);
+
+typedef bool (busy_iter_fn)(struct blk_mq_hw_ctx *, struct request *, void *,
+		bool);
+typedef bool (busy_tag_iter_fn)(struct request *, void *, bool);
+typedef int (poll_fn)(struct blk_mq_hw_ctx *, unsigned int);
+typedef int (map_queues_fn)(struct blk_mq_tag_set *set);
+typedef bool (busy_fn)(struct request_queue *);
+typedef void (complete_fn)(struct request *);
+
+struct blk_mq_ops {
+	/*
+	 * Queue request
+	 */
+	queue_rq_fn		*queue_rq;
+
+	/*
+	 * Return a queue map type for the given request/bio flags
+	 */
+	rq_flags_to_type_fn	*rq_flags_to_type;
+
+	/*
+	 * Reserve budget before queue request, once .queue_rq is
+	 * run, it is driver's responsibility to release the
+	 * reserved budget. Also we have to handle failure case
+	 * of .get_budget for avoiding I/O deadlock.
+	 */
+	get_budget_fn		*get_budget;
+	put_budget_fn		*put_budget;
+
+	/*
+	 * Called on request timeout
+	 */
+	timeout_fn		*timeout;
+
+	/*
+	 * Called to poll for completion of a specific tag.
+	 */
+	poll_fn			*poll;
+
+	complete_fn		*complete;
+
+	/*
+	 * Called when the block layer side of a hardware queue has been
+	 * set up, allowing the driver to allocate/init matching structures.
+	 * Ditto for exit/teardown.
+	 */
+	init_hctx_fn		*init_hctx;
+	exit_hctx_fn		*exit_hctx;
+
+	/*
+	 * Called for every command allocated by the block layer to allow
+	 * the driver to set up driver specific data.
+	 *
+	 * Tag greater than or equal to queue_depth is for setting up
+	 * flush request.
+	 *
+	 * Ditto for exit/teardown.
+	 */
+	init_request_fn		*init_request;
+	exit_request_fn		*exit_request;
+	/* Called from inside blk_get_request() */
+	void (*initialize_rq_fn)(struct request *rq);
+
+	/*
+	 * If set, returns whether or not this queue currently is busy
+	 */
+	busy_fn			*busy;
+
+	map_queues_fn		*map_queues;
+
+#ifdef CONFIG_BLK_DEBUG_FS
+	/*
+	 * Used by the debugfs implementation to show driver-specific
+	 * information about a request.
+	 */
+	void (*show_rq)(struct seq_file *m, struct request *rq);
+#endif
+};
+
+#endif
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 929e8abc5535..e32e9293e5a0 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -5,6 +5,7 @@
 #include <linux/blkdev.h>
 #include <linux/sbitmap.h>
 #include <linux/srcu.h>
+#include <linux/blk-mq-ops.h>
 
 struct blk_mq_tags;
 struct blk_flush_queue;
@@ -115,99 +116,6 @@ struct blk_mq_queue_data {
 	bool last;
 };
 
-typedef blk_status_t (queue_rq_fn)(struct blk_mq_hw_ctx *,
-		const struct blk_mq_queue_data *);
-/* takes rq->cmd_flags as input, returns a hardware type index */
-typedef int (rq_flags_to_type_fn)(struct request_queue *, unsigned int);
-typedef bool (get_budget_fn)(struct blk_mq_hw_ctx *);
-typedef void (put_budget_fn)(struct blk_mq_hw_ctx *);
-typedef enum blk_eh_timer_return (timeout_fn)(struct request *, bool);
-typedef int (init_hctx_fn)(struct blk_mq_hw_ctx *, void *, unsigned int);
-typedef void (exit_hctx_fn)(struct blk_mq_hw_ctx *, unsigned int);
-typedef int (init_request_fn)(struct blk_mq_tag_set *set, struct request *,
-		unsigned int, unsigned int);
-typedef void (exit_request_fn)(struct blk_mq_tag_set *set, struct request *,
-		unsigned int);
-
-typedef bool (busy_iter_fn)(struct blk_mq_hw_ctx *, struct request *, void *,
-		bool);
-typedef bool (busy_tag_iter_fn)(struct request *, void *, bool);
-typedef int (poll_fn)(struct blk_mq_hw_ctx *, unsigned int);
-typedef int (map_queues_fn)(struct blk_mq_tag_set *set);
-typedef bool (busy_fn)(struct request_queue *);
-typedef void (complete_fn)(struct request *);
-
-
-struct blk_mq_ops {
-	/*
-	 * Queue request
-	 */
-	queue_rq_fn		*queue_rq;
-
-	/*
-	 * Return a queue map type for the given request/bio flags
-	 */
-	rq_flags_to_type_fn	*rq_flags_to_type;
-
-	/*
-	 * Reserve budget before queue request, once .queue_rq is
-	 * run, it is driver's responsibility to release the
-	 * reserved budget. Also we have to handle failure case
-	 * of .get_budget for avoiding I/O deadlock.
-	 */
-	get_budget_fn		*get_budget;
-	put_budget_fn		*put_budget;
-
-	/*
-	 * Called on request timeout
-	 */
-	timeout_fn		*timeout;
-
-	/*
-	 * Called to poll for completion of a specific tag.
-	 */
-	poll_fn			*poll;
-
-	complete_fn		*complete;
-
-	/*
-	 * Called when the block layer side of a hardware queue has been
-	 * set up, allowing the driver to allocate/init matching structures.
-	 * Ditto for exit/teardown.
-	 */
-	init_hctx_fn		*init_hctx;
-	exit_hctx_fn		*exit_hctx;
-
-	/*
-	 * Called for every command allocated by the block layer to allow
-	 * the driver to set up driver specific data.
-	 *
-	 * Tag greater than or equal to queue_depth is for setting up
-	 * flush request.
-	 *
-	 * Ditto for exit/teardown.
-	 */
-	init_request_fn		*init_request;
-	exit_request_fn		*exit_request;
-	/* Called from inside blk_get_request() */
-	void (*initialize_rq_fn)(struct request *rq);
-
-	/*
-	 * If set, returns whether or not this queue currently is busy
-	 */
-	busy_fn			*busy;
-
-	map_queues_fn		*map_queues;
-
-#ifdef CONFIG_BLK_DEBUG_FS
-	/*
-	 * Used by the debugfs implementation to show driver-specific
-	 * information about a request.
-	 */
-	void (*show_rq)(struct seq_file *m, struct request *rq);
-#endif
-};
-
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3712d1fe48d4..ad8474ec8c58 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -28,6 +28,8 @@
 #include <linux/scatterlist.h>
 #include <linux/blkzoned.h>
 
+#include <linux/blk-mq-ops.h>
+
 struct module;
 struct scsi_ioctl_command;
 
@@ -406,7 +408,7 @@ struct request_queue {
 	poll_q_fn		*poll_fn;
 	dma_drain_needed_fn	*dma_drain_needed;
 
-	const struct blk_mq_ops	*mq_ops;
+	const struct blk_mq_ops	mq_ops;
 
 	/* sw queues */
 	struct blk_mq_ctx __percpu	*queue_ctx;
@@ -673,7 +675,7 @@ static inline bool blk_account_rq(struct request *rq)
 
 static inline bool queue_is_mq(struct request_queue *q)
 {
-	return q->mq_ops;
+	return q->mq_ops.queue_rq != NULL;
 }
 
 /*
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/6] block: avoid ordered task state change for polled IO
  2018-11-10 15:13 [PATCHSET 0/6] Various block optimizations Jens Axboe
                   ` (2 preceding siblings ...)
  2018-11-10 15:13 ` [PATCH 3/6] blk-mq: embed blk_mq_ops directly in the request queue Jens Axboe
@ 2018-11-10 15:13 ` Jens Axboe
  2018-11-12  9:35   ` jianchao.wang
  2018-11-10 15:13 ` [PATCH 5/6] blk-rq-qos: inline check for q->rq_qos functions Jens Axboe
  2018-11-10 15:13 ` [PATCH 6/6] block: add polled wakeup task helper Jens Axboe
  5 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2018-11-10 15:13 UTC (permalink / raw)
  To: linux-block; +Cc: Jens Axboe

We only really need the barrier if we're going to be sleeping,
if we're just polling we're fine with the __set_current_state()
variant.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c039abfb2052..e7550b1f9670 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -237,12 +237,23 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 
 	qc = submit_bio(&bio);
 	for (;;) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
+		/*
+		 * Using non-atomic task state for polling
+		 */
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+
 		if (!READ_ONCE(bio.bi_private))
 			break;
+
 		if (!(iocb->ki_flags & IOCB_HIPRI) ||
-		    !blk_poll(bdev_get_queue(bdev), qc))
+		    !blk_poll(bdev_get_queue(bdev), qc)) {
+			/*
+			 * If we're scheduling, ensure that task state
+			 * change is ordered and seen.
+			 */
+			smp_mb();
 			io_schedule();
+		}
 	}
 	__set_current_state(TASK_RUNNING);
 
@@ -403,13 +414,19 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 		return -EIOCBQUEUED;
 
 	for (;;) {
-		set_current_state(TASK_UNINTERRUPTIBLE);
+		/*
+		 * See __blkdev_direct_IO_simple() loop
+		 */
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+
 		if (!READ_ONCE(dio->waiter))
 			break;
 
 		if (!(iocb->ki_flags & IOCB_HIPRI) ||
-		    !blk_poll(bdev_get_queue(bdev), qc))
+		    !blk_poll(bdev_get_queue(bdev), qc)) {
+			smp_mb();
 			io_schedule();
+		}
 	}
 	__set_current_state(TASK_RUNNING);
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/6] blk-rq-qos: inline check for q->rq_qos functions
  2018-11-10 15:13 [PATCHSET 0/6] Various block optimizations Jens Axboe
                   ` (3 preceding siblings ...)
  2018-11-10 15:13 ` [PATCH 4/6] block: avoid ordered task state change for polled IO Jens Axboe
@ 2018-11-10 15:13 ` Jens Axboe
  2018-11-14 15:21   ` Christoph Hellwig
  2018-11-10 15:13 ` [PATCH 6/6] block: add polled wakeup task helper Jens Axboe
  5 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2018-11-10 15:13 UTC (permalink / raw)
  To: linux-block; +Cc: Jens Axboe, Josef Bacik

Put the short code in the fast path, where we don't have any
functions attached to the queue. This minimizes the impact on
the hot path in the core code.

Cleanup duplicated code by having a macro setup both the inline
check and the actual functions.

Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-rq-qos.c | 90 +++++++++++++---------------------------------
 block/blk-rq-qos.h | 35 ++++++++++++++----
 2 files changed, 52 insertions(+), 73 deletions(-)

diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index 0005dfd568dd..266c9e111475 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -27,76 +27,34 @@ bool rq_wait_inc_below(struct rq_wait *rq_wait, unsigned int limit)
 	return atomic_inc_below(&rq_wait->inflight, limit);
 }
 
-void rq_qos_cleanup(struct request_queue *q, struct bio *bio)
-{
-	struct rq_qos *rqos;
-
-	for (rqos = q->rq_qos; rqos; rqos = rqos->next) {
-		if (rqos->ops->cleanup)
-			rqos->ops->cleanup(rqos, bio);
-	}
-}
-
-void rq_qos_done(struct request_queue *q, struct request *rq)
-{
-	struct rq_qos *rqos;
-
-	for (rqos = q->rq_qos; rqos; rqos = rqos->next) {
-		if (rqos->ops->done)
-			rqos->ops->done(rqos, rq);
-	}
-}
-
-void rq_qos_issue(struct request_queue *q, struct request *rq)
-{
-	struct rq_qos *rqos;
-
-	for(rqos = q->rq_qos; rqos; rqos = rqos->next) {
-		if (rqos->ops->issue)
-			rqos->ops->issue(rqos, rq);
-	}
-}
-
-void rq_qos_requeue(struct request_queue *q, struct request *rq)
-{
-	struct rq_qos *rqos;
-
-	for(rqos = q->rq_qos; rqos; rqos = rqos->next) {
-		if (rqos->ops->requeue)
-			rqos->ops->requeue(rqos, rq);
-	}
+#define __RQ_QOS_FUNC_ONE(__OP, type)					\
+void __rq_qos_##__OP(struct rq_qos *rqos, type arg)			\
+{									\
+	do {								\
+		if ((rqos)->ops->__OP)					\
+			(rqos)->ops->__OP((rqos), arg);			\
+		(rqos) = (rqos)->next;					\
+	} while (rqos);							\
 }
 
-void rq_qos_throttle(struct request_queue *q, struct bio *bio,
-		     spinlock_t *lock)
-{
-	struct rq_qos *rqos;
-
-	for(rqos = q->rq_qos; rqos; rqos = rqos->next) {
-		if (rqos->ops->throttle)
-			rqos->ops->throttle(rqos, bio, lock);
-	}
+__RQ_QOS_FUNC_ONE(cleanup, struct bio *);
+__RQ_QOS_FUNC_ONE(done, struct request *);
+__RQ_QOS_FUNC_ONE(issue, struct request *);
+__RQ_QOS_FUNC_ONE(requeue, struct request *);
+__RQ_QOS_FUNC_ONE(done_bio, struct bio *);
+
+#define __RQ_QOS_FUNC_TWO(__OP, type1, type2)				\
+void __rq_qos_##__OP(struct rq_qos *rqos, type1 arg1, type2 arg2)	\
+{									\
+	do {								\
+		if ((rqos)->ops->__OP)					\
+			(rqos)->ops->__OP((rqos), arg1, arg2);		\
+		(rqos) = (rqos)->next;					\
+	} while (rqos);							\
 }
 
-void rq_qos_track(struct request_queue *q, struct request *rq, struct bio *bio)
-{
-	struct rq_qos *rqos;
-
-	for(rqos = q->rq_qos; rqos; rqos = rqos->next) {
-		if (rqos->ops->track)
-			rqos->ops->track(rqos, rq, bio);
-	}
-}
-
-void rq_qos_done_bio(struct request_queue *q, struct bio *bio)
-{
-	struct rq_qos *rqos;
-
-	for(rqos = q->rq_qos; rqos; rqos = rqos->next) {
-		if (rqos->ops->done_bio)
-			rqos->ops->done_bio(rqos, bio);
-	}
-}
+__RQ_QOS_FUNC_TWO(throttle, struct bio *, spinlock_t *);
+__RQ_QOS_FUNC_TWO(track, struct request *, struct bio *);
 
 /*
  * Return true, if we can't increase the depth further by scaling
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index 32b02efbfa66..50558a6ea248 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -98,12 +98,33 @@ void rq_depth_scale_up(struct rq_depth *rqd);
 void rq_depth_scale_down(struct rq_depth *rqd, bool hard_throttle);
 bool rq_depth_calc_max_depth(struct rq_depth *rqd);
 
-void rq_qos_cleanup(struct request_queue *, struct bio *);
-void rq_qos_done(struct request_queue *, struct request *);
-void rq_qos_issue(struct request_queue *, struct request *);
-void rq_qos_requeue(struct request_queue *, struct request *);
-void rq_qos_done_bio(struct request_queue *q, struct bio *bio);
-void rq_qos_throttle(struct request_queue *, struct bio *, spinlock_t *);
-void rq_qos_track(struct request_queue *q, struct request *, struct bio *);
+#define RQ_QOS_FUNC_ONE(__OP, type)					\
+void __rq_qos_##__OP(struct rq_qos *rqos, type arg);			\
+static inline void rq_qos_##__OP(struct request_queue *q, type arg)	\
+{									\
+	if ((q)->rq_qos)						\
+		__rq_qos_##__OP((q)->rq_qos, arg);			\
+}
+
+#define RQ_QOS_FUNC_TWO(__OP, type1, type2)				\
+void __rq_qos_##__OP(struct rq_qos *rqos, type1 arg1, type2 arg2);	\
+static inline void rq_qos_##__OP(struct request_queue *q, type1 arg1,	\
+				 type2 arg2)				\
+{									\
+	if ((q)->rq_qos)						\
+		__rq_qos_##__OP((q)->rq_qos, arg1, arg2);		\
+}
+
+RQ_QOS_FUNC_ONE(cleanup, struct bio *);
+RQ_QOS_FUNC_ONE(done, struct request *);
+RQ_QOS_FUNC_ONE(issue, struct request *);
+RQ_QOS_FUNC_ONE(requeue, struct request *);
+RQ_QOS_FUNC_ONE(done_bio, struct bio *);
+RQ_QOS_FUNC_TWO(throttle, struct bio *, spinlock_t *);
+RQ_QOS_FUNC_TWO(track, struct request *, struct bio *);
+#undef RQ_QOS_FUNC_ONE
+#undef RQ_QOS_FUNC_TWO
+
 void rq_qos_exit(struct request_queue *);
+
 #endif
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 6/6] block: add polled wakeup task helper
  2018-11-10 15:13 [PATCHSET 0/6] Various block optimizations Jens Axboe
                   ` (4 preceding siblings ...)
  2018-11-10 15:13 ` [PATCH 5/6] blk-rq-qos: inline check for q->rq_qos functions Jens Axboe
@ 2018-11-10 15:13 ` Jens Axboe
  5 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2018-11-10 15:13 UTC (permalink / raw)
  To: linux-block; +Cc: Jens Axboe

If we're polling for IO on a device that doesn't use interrupts, then
IO completion loop (and wake of task) is done by submitting task itself.
If that is the case, then we don't need to enter the wake_up_process()
function, we can simply mark ourselves as TASK_RUNNING.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c         |  4 ++--
 fs/iomap.c             |  2 +-
 include/linux/blkdev.h | 13 +++++++++++++
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index e7550b1f9670..eea37d801f1c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -181,7 +181,7 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
 	struct task_struct *waiter = bio->bi_private;
 
 	WRITE_ONCE(bio->bi_private, NULL);
-	wake_up_process(waiter);
+	blk_wake_io_task(waiter);
 }
 
 static ssize_t
@@ -316,7 +316,7 @@ static void blkdev_bio_end_io(struct bio *bio)
 			struct task_struct *waiter = dio->waiter;
 
 			WRITE_ONCE(dio->waiter, NULL);
-			wake_up_process(waiter);
+			blk_wake_io_task(waiter);
 		}
 	}
 
diff --git a/fs/iomap.c b/fs/iomap.c
index f61d13dfdf09..b0462b363bad 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1525,7 +1525,7 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 		if (dio->wait_for_completion) {
 			struct task_struct *waiter = dio->submit.waiter;
 			WRITE_ONCE(dio->submit.waiter, NULL);
-			wake_up_process(waiter);
+			blk_wake_io_task(waiter);
 		} else if (dio->flags & IOMAP_DIO_WRITE) {
 			struct inode *inode = file_inode(dio->iocb->ki_filp);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ad8474ec8c58..b0ce931cc734 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1798,4 +1798,17 @@ static inline int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
 
 #endif /* CONFIG_BLOCK */
 
+static inline void blk_wake_io_task(struct task_struct *waiter)
+{
+	/*
+	 * If we're polling, the task itself is doing the completions. For
+	 * that case, we don't need to signal a wakeup, it's enough to just
+	 * mark us as RUNNING.
+	 */
+	if (waiter == current)
+		__set_current_state(TASK_RUNNING);
+	else
+		wake_up_process(waiter);
+}
+
 #endif
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] block: avoid ordered task state change for polled IO
  2018-11-10 15:13 ` [PATCH 4/6] block: avoid ordered task state change for polled IO Jens Axboe
@ 2018-11-12  9:35   ` jianchao.wang
  2018-11-12 16:26     ` Jens Axboe
  0 siblings, 1 reply; 28+ messages in thread
From: jianchao.wang @ 2018-11-12  9:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block

Hi Jens

On 11/10/18 11:13 PM, Jens Axboe wrote:
> We only really need the barrier if we're going to be sleeping,
> if we're just polling we're fine with the __set_current_state()
> variant.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/block_dev.c | 25 +++++++++++++++++++++----
>  1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index c039abfb2052..e7550b1f9670 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -237,12 +237,23 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	qc = submit_bio(&bio);
>  	for (;;) {
> -		set_current_state(TASK_UNINTERRUPTIBLE);
> +		/*
> +		 * Using non-atomic task state for polling
> +		 */
> +		__set_current_state(TASK_UNINTERRUPTIBLE);
> +
>  		if (!READ_ONCE(bio.bi_private))
>  			break;

When we don't use polling, the blkdev_bio_end_io_simple may come on a different cpu before
submit_bio returns 
For example,
__blkdev_direct_IO_simple
  qc = submit_bio(&bio) 
                                          blkdev_bio_end_io_simple
                                            WRITE_ONCE(bio.bi_private, NULL)
                                            wake_up_process(waiter)
  __set_current_state(TASK_UNINTERRUPTIBLE)                              
  READ_ONCE(bio.bi_private)

At the moment, a smp_rmb at least maybe needed before READ_ONCE(bio.bi_private)
to ensure the WRITE_ONCE(bio.bi_private, NULL) is seen if we miss the wakeup.

> +
>  		if (!(iocb->ki_flags & IOCB_HIPRI) ||
> -		    !blk_poll(bdev_get_queue(bdev), qc))
> +		    !blk_poll(bdev_get_queue(bdev), qc)) {
> +			/*
> +			 * If we're scheduling, ensure that task state
> +			 * change is ordered and seen.
> +			 */
> +			smp_mb();
>  			io_schedule();
> +		}
>  	}
>  	__set_current_state(TASK_RUNNING);
>  
> @@ -403,13 +414,19 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  		return -EIOCBQUEUED;
>  
>  	for (;;) {
> -		set_current_state(TASK_UNINTERRUPTIBLE);
> +		/*
> +		 * See __blkdev_direct_IO_simple() loop
> +		 */
> +		__set_current_state(TASK_UNINTERRUPTIBLE);
> +

Similar with above.

>  		if (!READ_ONCE(dio->waiter))
>  			break;
>  
>  		if (!(iocb->ki_flags & IOCB_HIPRI) ||
> -		    !blk_poll(bdev_get_queue(bdev), qc))
> +		    !blk_poll(bdev_get_queue(bdev), qc)) {
> +			smp_mb();
>  			io_schedule();
> +		}
>  	}
>  	__set_current_state(TASK_RUNNING);
>  
> 

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] block: avoid ordered task state change for polled IO
  2018-11-12  9:35   ` jianchao.wang
@ 2018-11-12 16:26     ` Jens Axboe
  2018-11-13  2:02       ` jianchao.wang
  0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2018-11-12 16:26 UTC (permalink / raw)
  To: jianchao.wang, linux-block

On 11/12/18 2:35 AM, jianchao.wang wrote:
> Hi Jens
> 
> On 11/10/18 11:13 PM, Jens Axboe wrote:
>> We only really need the barrier if we're going to be sleeping,
>> if we're just polling we're fine with the __set_current_state()
>> variant.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/block_dev.c | 25 +++++++++++++++++++++----
>>  1 file changed, 21 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index c039abfb2052..e7550b1f9670 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -237,12 +237,23 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
>>  
>>  	qc = submit_bio(&bio);
>>  	for (;;) {
>> -		set_current_state(TASK_UNINTERRUPTIBLE);
>> +		/*
>> +		 * Using non-atomic task state for polling
>> +		 */
>> +		__set_current_state(TASK_UNINTERRUPTIBLE);
>> +
>>  		if (!READ_ONCE(bio.bi_private))
>>  			break;
> 
> When we don't use polling, the blkdev_bio_end_io_simple may come on a different cpu before
> submit_bio returns 
> For example,
> __blkdev_direct_IO_simple
>   qc = submit_bio(&bio) 
>                                           blkdev_bio_end_io_simple
>                                             WRITE_ONCE(bio.bi_private, NULL)
>                                             wake_up_process(waiter)
>   __set_current_state(TASK_UNINTERRUPTIBLE)                              
>   READ_ONCE(bio.bi_private)

While I agree that you are right, that's an existing issue. The store
barrier implied by set_current_state() doesn't fix that.

How about this variant:

http://git.kernel.dk/cgit/linux-block/commit/?h=mq-perf&id=036ba7a8d1334889c0fe55d4858d78f4e8022f12

which then changes the later wake up helper patch to be:
 
http://git.kernel.dk/cgit/linux-block/commit/?h=mq-perf&id=f8c3f188425967adb040cfefb799b0a5a1df769d

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/6] block: avoid ordered task state change for polled IO
  2018-11-12 16:26     ` Jens Axboe
@ 2018-11-13  2:02       ` jianchao.wang
  0 siblings, 0 replies; 28+ messages in thread
From: jianchao.wang @ 2018-11-13  2:02 UTC (permalink / raw)
  To: Jens Axboe, linux-block

Hi Jens

On 11/13/18 12:26 AM, Jens Axboe wrote:
> On 11/12/18 2:35 AM, jianchao.wang wrote:
>> Hi Jens
>>
>> On 11/10/18 11:13 PM, Jens Axboe wrote:
>>> We only really need the barrier if we're going to be sleeping,
>>> if we're just polling we're fine with the __set_current_state()
>>> variant.
>>>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>> ---
>>>  fs/block_dev.c | 25 +++++++++++++++++++++----
>>>  1 file changed, 21 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>>> index c039abfb2052..e7550b1f9670 100644
>>> --- a/fs/block_dev.c
>>> +++ b/fs/block_dev.c
>>> @@ -237,12 +237,23 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
>>>  
>>>  	qc = submit_bio(&bio);
>>>  	for (;;) {
>>> -		set_current_state(TASK_UNINTERRUPTIBLE);
>>> +		/*
>>> +		 * Using non-atomic task state for polling
>>> +		 */
>>> +		__set_current_state(TASK_UNINTERRUPTIBLE);
>>> +
>>>  		if (!READ_ONCE(bio.bi_private))
>>>  			break;
>>
>> When we don't use polling, the blkdev_bio_end_io_simple may come on a different cpu before
>> submit_bio returns 
>> For example,
>> __blkdev_direct_IO_simple
>>   qc = submit_bio(&bio) 
>>                                           blkdev_bio_end_io_simple
>>                                             WRITE_ONCE(bio.bi_private, NULL)
>>                                             wake_up_process(waiter)
>>   __set_current_state(TASK_UNINTERRUPTIBLE)                              
>>   READ_ONCE(bio.bi_private)
> 
> While I agree that you are right, that's an existing issue. The store
> barrier implied by set_current_state() doesn't fix that.
> 
> How about this variant:
> 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__git.kernel.dk_cgit_linux-2Dblock_commit_-3Fh-3Dmq-2Dperf-26id-3D036ba7a8d1334889c0fe55d4858d78f4e8022f12&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=pjbk_9rxuy7OS1pCkQ9MAHzFf4eT27Yt4RL8jASDJCA&s=AnmYMIcGpIILUrjDFsabj61pcTyMCRHB1Pu8XXi47t0&e=
> 
> which then changes the later wake up helper patch to be:
>  
> https://urldefense.proofpoint.com/v2/url?u=http-3A__git.kernel.dk_cgit_linux-2Dblock_commit_-3Fh-3Dmq-2Dperf-26id-3Df8c3f188425967adb040cfefb799b0a5a1df769d&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=pjbk_9rxuy7OS1pCkQ9MAHzFf4eT27Yt4RL8jASDJCA&s=PdajuUul55e6p0FED_AzzWt5Gip0ekLi1eGYBMLiZPo&e=
> 

The wake up path has contained a full memory barrier with the raw_spin_lock and following smp_mb__after_spinlock

wake_up_process
  -> try_to_wake_up

	raw_spin_lock_irqsave(&p->pi_lock, flags);
	smp_mb__after_spinlock();

So a smp_rmb() could be enough. :)

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] nvme: don't disable local ints for polled queue
  2018-11-10 15:13   ` Jens Axboe
@ 2018-11-14  8:43     ` Christoph Hellwig
  -1 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2018-11-14  8:43 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Keith Busch, linux-nvme

> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 6aa86dfcb32c..a6e3fbddfadf 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1061,15 +1061,26 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
>  
>  static int __nvme_poll(struct nvme_queue *nvmeq, unsigned int tag)
>  {
> +	unsigned long flags = 0; /* gcc 7.x fail */
>  	u16 start, end;
> -	bool found;
> +	bool found, has_irq;
>  
>  	if (!nvme_cqe_pending(nvmeq))
>  		return 0;
>  
> -	spin_lock_irq(&nvmeq->cq_lock);
> +	/*
> +	 * Polled queue doesn't have an IRQ, no need to disable ints
> +	 */
> +	has_irq = !nvmeq->polled;
> +	if (has_irq)
> +		local_irq_save(flags);
> +
> +	spin_lock(&nvmeq->cq_lock);
>  	found = nvme_process_cq(nvmeq, &start, &end, tag);
> -	spin_unlock_irq(&nvmeq->cq_lock);
> +	spin_unlock(&nvmeq->cq_lock);
> +
> +	if (has_irq)
> +		local_irq_restore(flags);

Eww, this just looks ugly.  Given that we are micro-optimizing here
I'd rather just use a different ->poll implementation and thus blk_mq_ops
if we have a separate poll code.  Then we could leave the existing
__nvme_poll as-is and have a new nvme_poll_noirq something like this:

static int nvme_poll_noirq(struct blk_mq_hw_ctx *hctx, unsigned int tag)
{
        struct nvme_queue *nvmeq = hctx->driver_data;
>  	u16 start, end;
	bool found;
>  
>  	if (!nvme_cqe_pending(nvmeq))
>  		return 0;
>  
> +	spin_lock(&nvmeq->cq_lock);
>  	found = nvme_process_cq(nvmeq, &start, &end, tag);
> +	spin_unlock(&nvmeq->cq_lock);
> +
>  	nvme_complete_cqes(nvmeq, start, end);
>  	return found;

And while we are at it:  I think for the irq-driven queuest in a
separate queue for poll setup we might not even need to take the
CQ lock.  Which might be an argument for only allowing polling
if we have the separate queues just to keep everything simple.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/6] nvme: don't disable local ints for polled queue
@ 2018-11-14  8:43     ` Christoph Hellwig
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2018-11-14  8:43 UTC (permalink / raw)


> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 6aa86dfcb32c..a6e3fbddfadf 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1061,15 +1061,26 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
>  
>  static int __nvme_poll(struct nvme_queue *nvmeq, unsigned int tag)
>  {
> +	unsigned long flags = 0; /* gcc 7.x fail */
>  	u16 start, end;
> -	bool found;
> +	bool found, has_irq;
>  
>  	if (!nvme_cqe_pending(nvmeq))
>  		return 0;
>  
> -	spin_lock_irq(&nvmeq->cq_lock);
> +	/*
> +	 * Polled queue doesn't have an IRQ, no need to disable ints
> +	 */
> +	has_irq = !nvmeq->polled;
> +	if (has_irq)
> +		local_irq_save(flags);
> +
> +	spin_lock(&nvmeq->cq_lock);
>  	found = nvme_process_cq(nvmeq, &start, &end, tag);
> -	spin_unlock_irq(&nvmeq->cq_lock);
> +	spin_unlock(&nvmeq->cq_lock);
> +
> +	if (has_irq)
> +		local_irq_restore(flags);

Eww, this just looks ugly.  Given that we are micro-optimizing here
I'd rather just use a different ->poll implementation and thus blk_mq_ops
if we have a separate poll code.  Then we could leave the existing
__nvme_poll as-is and have a new nvme_poll_noirq something like this:

static int nvme_poll_noirq(struct blk_mq_hw_ctx *hctx, unsigned int tag)
{
        struct nvme_queue *nvmeq = hctx->driver_data;
>  	u16 start, end;
	bool found;
>  
>  	if (!nvme_cqe_pending(nvmeq))
>  		return 0;
>  
> +	spin_lock(&nvmeq->cq_lock);
>  	found = nvme_process_cq(nvmeq, &start, &end, tag);
> +	spin_unlock(&nvmeq->cq_lock);
> +
>  	nvme_complete_cqes(nvmeq, start, end);
>  	return found;

And while we are at it:  I think for the irq-driven queuest in a
separate queue for poll setup we might not even need to take the
CQ lock.  Which might be an argument for only allowing polling
if we have the separate queues just to keep everything simple.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] nvme: don't disable local ints for polled queue
  2018-11-10 15:13   ` Jens Axboe
@ 2018-11-14 10:18     ` Max Gurtovoy
  -1 siblings, 0 replies; 28+ messages in thread
From: Max Gurtovoy @ 2018-11-14 10:18 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Keith Busch, linux-nvme


On 11/10/2018 5:13 PM, Jens Axboe wrote:
> A polled queued doesn't trigger interrupts, so it's always safe
> to grab the queue lock without disabling interrupts.


Jens,

can you share the added value in performance for this change ?



> Cc: Keith Busch <keith.busch@intel.com>
> Cc: linux-nvme@lists.infradead.org
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>   drivers/nvme/host/pci.c | 17 ++++++++++++++---
>   1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 6aa86dfcb32c..a6e3fbddfadf 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1061,15 +1061,26 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
>   
>   static int __nvme_poll(struct nvme_queue *nvmeq, unsigned int tag)
>   {
> +	unsigned long flags = 0; /* gcc 7.x fail */
>   	u16 start, end;
> -	bool found;
> +	bool found, has_irq;
>   
>   	if (!nvme_cqe_pending(nvmeq))
>   		return 0;
>   
> -	spin_lock_irq(&nvmeq->cq_lock);
> +	/*
> +	 * Polled queue doesn't have an IRQ, no need to disable ints
> +	 */
> +	has_irq = !nvmeq->polled;
> +	if (has_irq)
> +		local_irq_save(flags);
> +
> +	spin_lock(&nvmeq->cq_lock);
>   	found = nvme_process_cq(nvmeq, &start, &end, tag);
> -	spin_unlock_irq(&nvmeq->cq_lock);
> +	spin_unlock(&nvmeq->cq_lock);
> +
> +	if (has_irq)
> +		local_irq_restore(flags);
>   
>   	nvme_complete_cqes(nvmeq, start, end);
>   	return found;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/6] nvme: don't disable local ints for polled queue
@ 2018-11-14 10:18     ` Max Gurtovoy
  0 siblings, 0 replies; 28+ messages in thread
From: Max Gurtovoy @ 2018-11-14 10:18 UTC (permalink / raw)



On 11/10/2018 5:13 PM, Jens Axboe wrote:
> A polled queued doesn't trigger interrupts, so it's always safe
> to grab the queue lock without disabling interrupts.


Jens,

can you share the added value in performance for this change ?



> Cc: Keith Busch <keith.busch at intel.com>
> Cc: linux-nvme at lists.infradead.org
> Signed-off-by: Jens Axboe <axboe at kernel.dk>
> ---
>   drivers/nvme/host/pci.c | 17 ++++++++++++++---
>   1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 6aa86dfcb32c..a6e3fbddfadf 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1061,15 +1061,26 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
>   
>   static int __nvme_poll(struct nvme_queue *nvmeq, unsigned int tag)
>   {
> +	unsigned long flags = 0; /* gcc 7.x fail */
>   	u16 start, end;
> -	bool found;
> +	bool found, has_irq;
>   
>   	if (!nvme_cqe_pending(nvmeq))
>   		return 0;
>   
> -	spin_lock_irq(&nvmeq->cq_lock);
> +	/*
> +	 * Polled queue doesn't have an IRQ, no need to disable ints
> +	 */
> +	has_irq = !nvmeq->polled;
> +	if (has_irq)
> +		local_irq_save(flags);
> +
> +	spin_lock(&nvmeq->cq_lock);
>   	found = nvme_process_cq(nvmeq, &start, &end, tag);
> -	spin_unlock_irq(&nvmeq->cq_lock);
> +	spin_unlock(&nvmeq->cq_lock);
> +
> +	if (has_irq)
> +		local_irq_restore(flags);
>   
>   	nvme_complete_cqes(nvmeq, start, end);
>   	return found;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/6] blk-mq: embed blk_mq_ops directly in the request queue
  2018-11-10 15:13 ` [PATCH 3/6] blk-mq: embed blk_mq_ops directly in the request queue Jens Axboe
@ 2018-11-14 15:20   ` Christoph Hellwig
  2018-11-14 15:33     ` Jens Axboe
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2018-11-14 15:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block

On Sat, Nov 10, 2018 at 08:13:14AM -0700, Jens Axboe wrote:
> This saves an indirect function call everytime we have to
> call one of the strategy functions. We keep it const, and just
> hack around that a bit in blk_mq_init_allocated_queue(), which
> is where we copy the ops in.

What amount of difference does this patch make?

This really defeats the whole constification, which doesn't hinge
on the const attribute itself, but on having the static ops in
a read-only mapped section.  So unless it is a real measurable
and big enough difference I'm against this patch, and if there is
a real an measurable difference we might have to find other ways
to archive this goal.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] blk-rq-qos: inline check for q->rq_qos functions
  2018-11-10 15:13 ` [PATCH 5/6] blk-rq-qos: inline check for q->rq_qos functions Jens Axboe
@ 2018-11-14 15:21   ` Christoph Hellwig
  2018-11-14 15:33     ` Jens Axboe
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2018-11-14 15:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Josef Bacik

On Sat, Nov 10, 2018 at 08:13:16AM -0700, Jens Axboe wrote:
> Put the short code in the fast path, where we don't have any
> functions attached to the queue. This minimizes the impact on
> the hot path in the core code.
> 
> Cleanup duplicated code by having a macro setup both the inline
> check and the actual functions.

The inlining sounds fine, but please do it without the obsfucating
macros.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] nvme: don't disable local ints for polled queue
  2018-11-14  8:43     ` Christoph Hellwig
@ 2018-11-14 15:31       ` Jens Axboe
  -1 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2018-11-14 15:31 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block, Keith Busch, linux-nvme

On 11/14/18 1:43 AM, Christoph Hellwig wrote:
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index 6aa86dfcb32c..a6e3fbddfadf 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -1061,15 +1061,26 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
>>  
>>  static int __nvme_poll(struct nvme_queue *nvmeq, unsigned int tag)
>>  {
>> +	unsigned long flags = 0; /* gcc 7.x fail */
>>  	u16 start, end;
>> -	bool found;
>> +	bool found, has_irq;
>>  
>>  	if (!nvme_cqe_pending(nvmeq))
>>  		return 0;
>>  
>> -	spin_lock_irq(&nvmeq->cq_lock);
>> +	/*
>> +	 * Polled queue doesn't have an IRQ, no need to disable ints
>> +	 */
>> +	has_irq = !nvmeq->polled;
>> +	if (has_irq)
>> +		local_irq_save(flags);
>> +
>> +	spin_lock(&nvmeq->cq_lock);
>>  	found = nvme_process_cq(nvmeq, &start, &end, tag);
>> -	spin_unlock_irq(&nvmeq->cq_lock);
>> +	spin_unlock(&nvmeq->cq_lock);
>> +
>> +	if (has_irq)
>> +		local_irq_restore(flags);
> 
> Eww, this just looks ugly.  Given that we are micro-optimizing here
> I'd rather just use a different ->poll implementation and thus blk_mq_ops
> if we have a separate poll code.  Then we could leave the existing
> __nvme_poll as-is and have a new nvme_poll_noirq something like this:

Yeah good point, gets rid of those branches too. I'll make that change.

> And while we are at it:  I think for the irq-driven queuest in a
> separate queue for poll setup we might not even need to take the
> CQ lock.  Which might be an argument for only allowing polling
> if we have the separate queues just to keep everything simple.

We can definitely move in that direction, I'll take a look at what
is feasible.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/6] nvme: don't disable local ints for polled queue
@ 2018-11-14 15:31       ` Jens Axboe
  0 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2018-11-14 15:31 UTC (permalink / raw)


On 11/14/18 1:43 AM, Christoph Hellwig wrote:
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index 6aa86dfcb32c..a6e3fbddfadf 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -1061,15 +1061,26 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
>>  
>>  static int __nvme_poll(struct nvme_queue *nvmeq, unsigned int tag)
>>  {
>> +	unsigned long flags = 0; /* gcc 7.x fail */
>>  	u16 start, end;
>> -	bool found;
>> +	bool found, has_irq;
>>  
>>  	if (!nvme_cqe_pending(nvmeq))
>>  		return 0;
>>  
>> -	spin_lock_irq(&nvmeq->cq_lock);
>> +	/*
>> +	 * Polled queue doesn't have an IRQ, no need to disable ints
>> +	 */
>> +	has_irq = !nvmeq->polled;
>> +	if (has_irq)
>> +		local_irq_save(flags);
>> +
>> +	spin_lock(&nvmeq->cq_lock);
>>  	found = nvme_process_cq(nvmeq, &start, &end, tag);
>> -	spin_unlock_irq(&nvmeq->cq_lock);
>> +	spin_unlock(&nvmeq->cq_lock);
>> +
>> +	if (has_irq)
>> +		local_irq_restore(flags);
> 
> Eww, this just looks ugly.  Given that we are micro-optimizing here
> I'd rather just use a different ->poll implementation and thus blk_mq_ops
> if we have a separate poll code.  Then we could leave the existing
> __nvme_poll as-is and have a new nvme_poll_noirq something like this:

Yeah good point, gets rid of those branches too. I'll make that change.

> And while we are at it:  I think for the irq-driven queuest in a
> separate queue for poll setup we might not even need to take the
> CQ lock.  Which might be an argument for only allowing polling
> if we have the separate queues just to keep everything simple.

We can definitely move in that direction, I'll take a look at what
is feasible.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/6] blk-mq: embed blk_mq_ops directly in the request queue
  2018-11-14 15:20   ` Christoph Hellwig
@ 2018-11-14 15:33     ` Jens Axboe
  2018-11-14 15:39       ` Christoph Hellwig
  0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2018-11-14 15:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block

On 11/14/18 8:20 AM, Christoph Hellwig wrote:
> On Sat, Nov 10, 2018 at 08:13:14AM -0700, Jens Axboe wrote:
>> This saves an indirect function call everytime we have to
>> call one of the strategy functions. We keep it const, and just
>> hack around that a bit in blk_mq_init_allocated_queue(), which
>> is where we copy the ops in.
> 
> What amount of difference does this patch make?
> 
> This really defeats the whole constification, which doesn't hinge
> on the const attribute itself, but on having the static ops in
> a read-only mapped section.  So unless it is a real measurable
> and big enough difference I'm against this patch, and if there is
> a real an measurable difference we might have to find other ways
> to archive this goal.

It's measurable. It doesn't defeat the const at all, the mq_ops
is still const and is never modified outside of when the queue
is allocated. The only difference is of course that it is no
longer residing in a read only mapped section, which is a shame.

I'll try and do some specific numbers.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] blk-rq-qos: inline check for q->rq_qos functions
  2018-11-14 15:21   ` Christoph Hellwig
@ 2018-11-14 15:33     ` Jens Axboe
  2018-11-14 15:40       ` Christoph Hellwig
  0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2018-11-14 15:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block, Josef Bacik

On 11/14/18 8:21 AM, Christoph Hellwig wrote:
> On Sat, Nov 10, 2018 at 08:13:16AM -0700, Jens Axboe wrote:
>> Put the short code in the fast path, where we don't have any
>> functions attached to the queue. This minimizes the impact on
>> the hot path in the core code.
>>
>> Cleanup duplicated code by having a macro setup both the inline
>> check and the actual functions.
> 
> The inlining sounds fine, but please do it without the obsfucating
> macros.

Why? It's a lot of code duplication.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/6] nvme: don't disable local ints for polled queue
  2018-11-14  8:43     ` Christoph Hellwig
@ 2018-11-14 15:37       ` Keith Busch
  -1 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2018-11-14 15:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-block, linux-nvme

On Wed, Nov 14, 2018 at 12:43:22AM -0800, Christoph Hellwig wrote:
> static int nvme_poll_noirq(struct blk_mq_hw_ctx *hctx, unsigned int tag)
> {
>         struct nvme_queue *nvmeq = hctx->driver_data;
> >  	u16 start, end;
> 	bool found;
> >  
> >  	if (!nvme_cqe_pending(nvmeq))
> >  		return 0;
> >  
> > +	spin_lock(&nvmeq->cq_lock);
> >  	found = nvme_process_cq(nvmeq, &start, &end, tag);
> > +	spin_unlock(&nvmeq->cq_lock);
> > +
> >  	nvme_complete_cqes(nvmeq, start, end);
> >  	return found;
> 
> And while we are at it:  I think for the irq-driven queues in a
> separate queue for poll setup we might not even need to take the
> CQ lock.  Which might be an argument for only allowing polling
> if we have the separate queues just to keep everything simple.

That's a pretty cool observation. We still poll interrupt driven queues
in the timeout path as a sanity check (it really has helped in debugging
timeout issues), but we can temporarily disable the cq's irq and be
lockless.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/6] nvme: don't disable local ints for polled queue
@ 2018-11-14 15:37       ` Keith Busch
  0 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2018-11-14 15:37 UTC (permalink / raw)


On Wed, Nov 14, 2018@12:43:22AM -0800, Christoph Hellwig wrote:
> static int nvme_poll_noirq(struct blk_mq_hw_ctx *hctx, unsigned int tag)
> {
>         struct nvme_queue *nvmeq = hctx->driver_data;
> >  	u16 start, end;
> 	bool found;
> >  
> >  	if (!nvme_cqe_pending(nvmeq))
> >  		return 0;
> >  
> > +	spin_lock(&nvmeq->cq_lock);
> >  	found = nvme_process_cq(nvmeq, &start, &end, tag);
> > +	spin_unlock(&nvmeq->cq_lock);
> > +
> >  	nvme_complete_cqes(nvmeq, start, end);
> >  	return found;
> 
> And while we are at it:  I think for the irq-driven queues in a
> separate queue for poll setup we might not even need to take the
> CQ lock.  Which might be an argument for only allowing polling
> if we have the separate queues just to keep everything simple.

That's a pretty cool observation. We still poll interrupt driven queues
in the timeout path as a sanity check (it really has helped in debugging
timeout issues), but we can temporarily disable the cq's irq and be
lockless.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/6] blk-mq: embed blk_mq_ops directly in the request queue
  2018-11-14 15:33     ` Jens Axboe
@ 2018-11-14 15:39       ` Christoph Hellwig
  2018-11-14 15:47         ` Jens Axboe
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2018-11-14 15:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-block

On Wed, Nov 14, 2018 at 08:33:06AM -0700, Jens Axboe wrote:
> It's measurable. It doesn't defeat the const at all, the mq_ops
> is still const and is never modified outside of when the queue
> is allocated. The only difference is of course that it is no
> longer residing in a read only mapped section, which is a shame.

And the separate section is very much the point of the const.

Without it attackers can easily overwrite the function pointer and
use it for exploits, as the C const attribute by itself doesn't
protect from that in any way.  But once it is in a read-only section
it can't easily be modified.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] blk-rq-qos: inline check for q->rq_qos functions
  2018-11-14 15:33     ` Jens Axboe
@ 2018-11-14 15:40       ` Christoph Hellwig
  2018-11-14 17:16         ` Jens Axboe
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2018-11-14 15:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-block, Josef Bacik

On Wed, Nov 14, 2018 at 08:33:33AM -0700, Jens Axboe wrote:
> > The inlining sounds fine, but please do it without the obsfucating
> > macros.
> 
> Why? It's a lot of code duplication.

Just source code, not binary code.  And remember that humans want
to understand this code by grepping it, finding it in source browsers,
etc.  And all that macro obsfucation makes it much harder.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/6] blk-mq: embed blk_mq_ops directly in the request queue
  2018-11-14 15:39       ` Christoph Hellwig
@ 2018-11-14 15:47         ` Jens Axboe
  2018-11-14 15:58           ` Christoph Hellwig
  0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2018-11-14 15:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block

On 11/14/18 8:39 AM, Christoph Hellwig wrote:
> On Wed, Nov 14, 2018 at 08:33:06AM -0700, Jens Axboe wrote:
>> It's measurable. It doesn't defeat the const at all, the mq_ops
>> is still const and is never modified outside of when the queue
>> is allocated. The only difference is of course that it is no
>> longer residing in a read only mapped section, which is a shame.
> 
> And the separate section is very much the point of the const.
> 
> Without it attackers can easily overwrite the function pointer and
> use it for exploits, as the C const attribute by itself doesn't
> protect from that in any way.  But once it is in a read-only section
> it can't easily be modified.

Are we really worried about this as an attack surface? I get it
for other types of ops, but for mq_ops, I'd be more worried about
callers changing it inadvertently, and they still can't do that.

It's not a huge deal to me, I can drop this one for now and
revisit at a later time.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/6] blk-mq: embed blk_mq_ops directly in the request queue
  2018-11-14 15:47         ` Jens Axboe
@ 2018-11-14 15:58           ` Christoph Hellwig
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2018-11-14 15:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-block

On Wed, Nov 14, 2018 at 08:47:58AM -0700, Jens Axboe wrote:
> Are we really worried about this as an attack surface? I get it
> for other types of ops, but for mq_ops, I'd be more worried about
> callers changing it inadvertently, and they still can't do that.

It isn't quite as bad as for thing like file ops, but for modern
atacks finding some piece that is easily patchable is what starts
it all.  So the less functions pointers we have that are ever
runtime writable, the better.

> It's not a huge deal to me, I can drop this one for now and
> revisit at a later time.

Yes, I'd be much happier if we dropped this for now.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/6] blk-rq-qos: inline check for q->rq_qos functions
  2018-11-14 15:40       ` Christoph Hellwig
@ 2018-11-14 17:16         ` Jens Axboe
  0 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2018-11-14 17:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block, Josef Bacik

On 11/14/18 8:40 AM, Christoph Hellwig wrote:
> On Wed, Nov 14, 2018 at 08:33:33AM -0700, Jens Axboe wrote:
>>> The inlining sounds fine, but please do it without the obsfucating
>>> macros.
>>
>> Why? It's a lot of code duplication.
> 
> Just source code, not binary code.  And remember that humans want
> to understand this code by grepping it, finding it in source browsers,
> etc.  And all that macro obsfucation makes it much harder.

I've changed it to just spell them all out.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2018-11-14 17:16 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-10 15:13 [PATCHSET 0/6] Various block optimizations Jens Axboe
2018-11-10 15:13 ` [PATCH 1/6] nvme: don't disable local ints for polled queue Jens Axboe
2018-11-10 15:13   ` Jens Axboe
2018-11-14  8:43   ` Christoph Hellwig
2018-11-14  8:43     ` Christoph Hellwig
2018-11-14 15:31     ` Jens Axboe
2018-11-14 15:31       ` Jens Axboe
2018-11-14 15:37     ` Keith Busch
2018-11-14 15:37       ` Keith Busch
2018-11-14 10:18   ` Max Gurtovoy
2018-11-14 10:18     ` Max Gurtovoy
2018-11-10 15:13 ` [PATCH 2/6] block: add queue_is_mq() helper Jens Axboe
2018-11-10 15:13 ` [PATCH 3/6] blk-mq: embed blk_mq_ops directly in the request queue Jens Axboe
2018-11-14 15:20   ` Christoph Hellwig
2018-11-14 15:33     ` Jens Axboe
2018-11-14 15:39       ` Christoph Hellwig
2018-11-14 15:47         ` Jens Axboe
2018-11-14 15:58           ` Christoph Hellwig
2018-11-10 15:13 ` [PATCH 4/6] block: avoid ordered task state change for polled IO Jens Axboe
2018-11-12  9:35   ` jianchao.wang
2018-11-12 16:26     ` Jens Axboe
2018-11-13  2:02       ` jianchao.wang
2018-11-10 15:13 ` [PATCH 5/6] blk-rq-qos: inline check for q->rq_qos functions Jens Axboe
2018-11-14 15:21   ` Christoph Hellwig
2018-11-14 15:33     ` Jens Axboe
2018-11-14 15:40       ` Christoph Hellwig
2018-11-14 17:16         ` Jens Axboe
2018-11-10 15:13 ` [PATCH 6/6] block: add polled wakeup task helper Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.