linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme
@ 2020-02-04  3:30 Weiping Zhang
  2020-02-04  3:31 ` [PATCH v5 1/4] block: add weighted round robin for blkcgroup Weiping Zhang
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Weiping Zhang @ 2020-02-04  3:30 UTC (permalink / raw)
  To: axboe, tj, hch, bvanassche, kbusch, minwoo.im.dev, tglx,
	ming.lei, edmund.nadolski
  Cc: linux-block, cgroups, linux-nvme

Hi,

This series try to add Weighted Round Robin for block cgroup and nvme
driver. When multiple containers share a single nvme device, we want
to protect IO critical container from not be interfernced by other
containers. We add blkio.wrr interface to user to control their IO
priority. The blkio.wrr accept five level priorities, which contains
"urgent", "high", "medium", "low" and "none", the "none" is used for
disable WRR for this cgroup.

The first patch add an WRR infrastucture for block cgroup.

We add extra four hareware contexts at blk-mq layer,
HCTX_TYPE_WRR_URGETN/HIGH/MEDIUM/LOW to allow device driver maps
different hardsware queues to dirrenct hardware context.

The second patch add a nvme_ctrl_ops named get_ams to get the expect
Arbitration Mechanism Selected, now this series only support nvme-pci.
This operations will check both CAP.AMS and nvme-pci wrr queue count,
to decide enable WRR or RR.

The third patch rename write_queues module parameter to read_queues,
that can simplify the calculation the number of defaut,read,poll,wrr
queue.

The last patch add support nvme-pci Weighted Round Robin with Urgent
Priority Class, we add four module paranmeters as follow:
	wrr_urgent_queues
	wrr_high_queues
	wrr_medium_queues
	wrr_low_queues
nvme-pci will set CC.AMS=001b, if CAP.AMS[17]=1 and wrr_xxx_queues
larger than 0. nvme driver will split hardware queues base on the
read/pool/wrr_xxx_queues, then set proper value for Queue Priority
(QPRIO) in DWORD11. This patch also extends IRQ_AFFINITY_MAX_SETS to 6,
since nvme may use 6 irq sets, if read, default,wrr related queues are
all not equal to 0.

fio test:

CPU:	Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
NVME:	Intel SSDPE2KX020T8 P4510 2TB

[root@tmp-201812-d1802-818396173 low]# nvme show-regs /dev/nvme0n1
cap     : 2078030fff
version : 10200
intms   : 0
intmc   : 0
cc      : 460801
csts    : 1
nssr    : 0
aqa     : 1f001f
asq     : 5f7cc08000
acq     : 5f5ac23000
cmbloc  : 0
cmbsz   : 0

Run fio-1, fio-2, fio-3 in parallel, 

For RR(round robin) these three fio nearly get same iops or bps,
if we set blkio.wrr for different priority, the WRR "high" will
get more iops/bps than "medium" and "low".

RR:
fio-1: echo "259:0 none" > /sys/fs/cgroup/blkio/high/blkio.wrr
fio-2: echo "259:0 none" > /sys/fs/cgroup/blkio/medium/blkio.wrr
fio-3: echo "259:0 none" > /sys/fs/cgroup/blkio/low/blkio.wrr

WRR:
fio-1: echo "259:0 high" > /sys/fs/cgroup/blkio/high/blkio.wrr
fio-2: echo "259:0 medium" > /sys/fs/cgroup/blkio/medium/blkio.wrr
fio-3: echo "259:0 low" > /sys/fs/cgroup/blkio/low/blkio.wrr

    Test script:
    https://github.com/dublio/nvme-wrr/blob/master/test_wrr.sh

    Test result:
    randread             (RR)IOPS        (RR)latency     (WRR)IOPS       (WRR)latency
    --------------------------------------------------------------------------------
    randread_high        217474          3528.49         404451          1897.17
    randread_medium      217473          3528.56         202349          3793.54
    randread_low         217978          3520.98         67419           11386.43

    randwrite            (RR)IOPS        (RR)latency     (WRR)IOPS       (WRR)latency
    --------------------------------------------------------------------------------
    randwrite_high       144946          5295.34         277401          2766.66
    randwrite_medium     144861          5296.85         138710          5532.28
    randwrite_low        145105          5289.36         46316           16569.54

    read                 (RR)BW          (RR)latency     (WRR)BW         (WRR)latency
    --------------------------------------------------------------------------------
    read_high            956191          410823.48       1790273         219427.11
    read_medium          920096          426887.25       897644          437760.17
    read_low             928076          423248.05       302899          1297195.34

    write                (RR)BW          (RR)latency     (WRR)BW         (WRR)latency
    --------------------------------------------------------------------------------
    write_high           737211          532359.31       1194013         328970.70
    write_medium         759052          516902.66       600626          653876.69
    write_low            782348          501309.47       203754          1928779.39

Changes since V4:
 * calculate the number of irq sets by the queue count of each HCTX type, and 
   drops patch: genirq/affinity: allow driver's discontigous affinity set
 * extends IRQ_AFFINITY_MAX_SETS to 6 instead of 7.

Changes since V3:
 * only show blkio.wrr in non-root cgroups.
 * give bs/iops and latency in test result.

Changes since V2:
 * drop null_blk related patch, which adds a new NULL_Q_IRQ_WRR to
	simulte nvme wrr policy
 * add urgent tagset map for nvme driver
 * fix some problem in V2, suggested by Minwoo

Changes since V1:
 * reorder HCTX_TYPE_POLL to the last one to adopt nvme driver easily.
 * add support WRR(Weighted Round Robin) for nvme driver

Weiping Zhang (4):
  block: add weighted round robin for blkcgroup
  nvme: add get_ams for nvme_ctrl_ops
  nvme-pci: rename module parameter write_queues to read_queues
  nvme: add support weighted round robin queue

 block/blk-cgroup.c         |  91 ++++++++++++++++
 block/blk-mq-debugfs.c     |   4 +
 block/blk-mq-sched.c       |   5 +-
 block/blk-mq-tag.c         |   4 +-
 block/blk-mq-tag.h         |   2 +-
 block/blk-mq.c             |  12 ++-
 block/blk-mq.h             |  20 +++-
 block/blk.h                |   2 +-
 drivers/nvme/host/core.c   |   9 +-
 drivers/nvme/host/nvme.h   |   2 +
 drivers/nvme/host/pci.c    | 260 ++++++++++++++++++++++++++++++++++++---------
 include/linux/blk-cgroup.h |   2 +
 include/linux/blk-mq.h     |  18 ++++
 include/linux/interrupt.h  |   2 +-
 include/linux/nvme.h       |   3 +
 15 files changed, 375 insertions(+), 61 deletions(-)

-- 
2.14.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v5 1/4] block: add weighted round robin for blkcgroup
  2020-02-04  3:30 [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme Weiping Zhang
@ 2020-02-04  3:31 ` Weiping Zhang
  2020-02-04  3:31 ` [PATCH v5 2/4] nvme: add get_ams for nvme_ctrl_ops Weiping Zhang
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Weiping Zhang @ 2020-02-04  3:31 UTC (permalink / raw)
  To: axboe, tj, hch, bvanassche, kbusch, minwoo.im.dev, tglx,
	ming.lei, edmund.nadolski
  Cc: linux-block, cgroups, linux-nvme

Each block cgroup can select a weighted round robin type to make
its io requests go to the specified haredware queue. Now we support
four round robin type urgent, high, medium, low like what nvme specification
dose.

Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
---
 block/blk-cgroup.c         | 91 ++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-mq-debugfs.c     |  4 ++
 block/blk-mq-sched.c       |  5 ++-
 block/blk-mq-tag.c         |  4 +-
 block/blk-mq-tag.h         |  2 +-
 block/blk-mq.c             | 12 ++++--
 block/blk-mq.h             | 20 +++++++++-
 block/blk.h                |  2 +-
 include/linux/blk-cgroup.h |  2 +
 include/linux/blk-mq.h     | 18 +++++++++
 10 files changed, 150 insertions(+), 10 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index a229b94d5390..a81888c7cb2d 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -830,6 +830,91 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
 	return 0;
 }
 
+static const char *blk_wrr_name[BLK_WRR_COUNT] = {
+	[BLK_WRR_NONE]		= "none",
+	[BLK_WRR_LOW]		= "low",
+	[BLK_WRR_MEDIUM]	= "medium",
+	[BLK_WRR_HIGH]		= "high",
+	[BLK_WRR_URGENT]	= "urgent",
+};
+
+static inline const char *blk_wrr_to_name(int wrr)
+{
+	if (wrr < BLK_WRR_NONE || wrr >= BLK_WRR_COUNT)
+		return "wrong";
+
+	return blk_wrr_name[wrr];
+}
+
+static ssize_t blkcg_wrr_write(struct kernfs_open_file *of,
+			 char *buf, size_t nbytes, loff_t off)
+{
+	struct blkcg *blkcg = css_to_blkcg(of_css(of));
+	struct gendisk *disk;
+	struct request_queue *q;
+	struct blkcg_gq *blkg;
+	unsigned int major, minor;
+	int wrr, key_len, part, ret;
+	char *body;
+
+	if (sscanf(buf, "%u:%u%n", &major, &minor, &key_len) != 2)
+		return -EINVAL;
+
+	body = buf + key_len;
+	if (!isspace(*body))
+		return -EINVAL;
+	body = skip_spaces(body);
+	wrr = sysfs_match_string(blk_wrr_name, body);
+	if (wrr == BLK_WRR_COUNT)
+		return -EINVAL;
+
+	disk = get_gendisk(MKDEV(major, minor), &part);
+	if (!disk)
+		return -ENODEV;
+	if (part) {
+		ret = -EINVAL;
+		goto fail;
+	}
+
+	q = disk->queue;
+
+	blkg = blkg_lookup_create(blkcg, q);
+
+	atomic_set(&blkg->wrr, wrr);
+	put_disk_and_module(disk);
+
+	return nbytes;
+fail:
+	put_disk_and_module(disk);
+	return ret;
+}
+
+static int blkcg_wrr_show(struct seq_file *sf, void *v)
+{
+	struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
+	struct blkcg_gq *blkg;
+
+	rcu_read_lock();
+
+	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
+		const char *dname;
+		char *buf;
+		size_t size = seq_get_buf(sf, &buf), off = 0;
+
+		dname = blkg_dev_name(blkg);
+		if (!dname)
+			continue;
+
+		off += scnprintf(buf+off, size-off, "%s %s\n", dname,
+			blk_wrr_to_name(atomic_read(&blkg->wrr)));
+		seq_commit(sf, off);
+	}
+
+	rcu_read_unlock();
+	return 0;
+}
+
+
 static struct cftype blkcg_files[] = {
 	{
 		.name = "stat",
@@ -844,6 +929,12 @@ static struct cftype blkcg_legacy_files[] = {
 		.name = "reset_stats",
 		.write_u64 = blkcg_reset_stats,
 	},
+	{
+		.name = "wrr",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.write = blkcg_wrr_write,
+		.seq_show = blkcg_wrr_show,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b3f2ba483992..455e5a21ee0c 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -420,6 +420,10 @@ static int hctx_busy_show(void *data, struct seq_file *m)
 static const char *const hctx_types[] = {
 	[HCTX_TYPE_DEFAULT]	= "default",
 	[HCTX_TYPE_READ]	= "read",
+	[HCTX_TYPE_WRR_LOW]	= "wrr_low",
+	[HCTX_TYPE_WRR_MEDIUM]	= "wrr_medium",
+	[HCTX_TYPE_WRR_HIGH]	= "wrr_high",
+	[HCTX_TYPE_WRR_URGENT]	= "wrr_urgent",
 	[HCTX_TYPE_POLL]	= "poll",
 };
 
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index ca22afd47b3d..32e948445eb0 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -7,6 +7,7 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/blk-mq.h>
+#include <linux/blk-cgroup.h>
 
 #include <trace/events/block.h>
 
@@ -326,7 +327,9 @@ bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio,
 {
 	struct elevator_queue *e = q->elevator;
 	struct blk_mq_ctx *ctx = blk_mq_get_ctx(q);
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, bio->bi_opf, ctx);
+	struct blkcg_gq *blkg = bio->bi_blkg;
+	int wrr = blkg ? atomic_read(&blkg->wrr) : BLK_WRR_NONE;
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, bio->bi_opf, ctx, wrr);
 	bool ret = false;
 	enum hctx_type type;
 
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index fbacde454718..e46d2c34a27f 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -99,7 +99,7 @@ static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
 		return __sbitmap_queue_get(bt);
 }
 
-unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
+unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data, int wrr)
 {
 	struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
 	struct sbitmap_queue *bt;
@@ -159,7 +159,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 
 		data->ctx = blk_mq_get_ctx(data->q);
 		data->hctx = blk_mq_map_queue(data->q, data->cmd_flags,
-						data->ctx);
+						data->ctx, wrr);
 		tags = blk_mq_tags_from_data(data);
 		if (data->flags & BLK_MQ_REQ_RESERVED)
 			bt = &tags->breserved_tags;
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 15bc74acb57e..5d951a0f32fe 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -25,7 +25,7 @@ struct blk_mq_tags {
 extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node, int alloc_policy);
 extern void blk_mq_free_tags(struct blk_mq_tags *tags);
 
-extern unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data);
+extern unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data, int wrr);
 extern void blk_mq_put_tag(struct blk_mq_hw_ctx *hctx, struct blk_mq_tags *tags,
 			   struct blk_mq_ctx *ctx, unsigned int tag);
 extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a12b1763508d..26383bde2792 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -340,6 +340,12 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 	unsigned int tag;
 	bool clear_ctx_on_error = false;
 	u64 alloc_time_ns = 0;
+	int wrr;
+
+	if (bio && bio->bi_blkg)
+		wrr = atomic_read(&bio->bi_blkg->wrr);
+	else
+		wrr = BLK_WRR_NONE;
 
 	blk_queue_enter_live(q);
 
@@ -354,7 +360,7 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 	}
 	if (likely(!data->hctx))
 		data->hctx = blk_mq_map_queue(q, data->cmd_flags,
-						data->ctx);
+						data->ctx, wrr);
 	if (data->cmd_flags & REQ_NOWAIT)
 		data->flags |= BLK_MQ_REQ_NOWAIT;
 
@@ -374,7 +380,7 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 		blk_mq_tag_busy(data->hctx);
 	}
 
-	tag = blk_mq_get_tag(data);
+	tag = blk_mq_get_tag(data, wrr);
 	if (tag == BLK_MQ_TAG_FAIL) {
 		if (clear_ctx_on_error)
 			data->ctx = NULL;
@@ -1044,7 +1050,7 @@ bool blk_mq_get_driver_tag(struct request *rq)
 		data.flags |= BLK_MQ_REQ_RESERVED;
 
 	shared = blk_mq_tag_busy(data.hctx);
-	rq->tag = blk_mq_get_tag(&data);
+	rq->tag = blk_mq_get_tag(&data, BLK_WRR_NONE);
 	if (rq->tag >= 0) {
 		if (shared) {
 			rq->rq_flags |= RQF_MQ_INFLIGHT;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index eaaca8fc1c28..e6aac5b46edb 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -101,7 +101,8 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue_type(struct request_queue *
  */
 static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
 						     unsigned int flags,
-						     struct blk_mq_ctx *ctx)
+						     struct blk_mq_ctx *ctx,
+						     int wrr)
 {
 	enum hctx_type type = HCTX_TYPE_DEFAULT;
 
@@ -110,7 +111,22 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
 	 */
 	if (flags & REQ_HIPRI)
 		type = HCTX_TYPE_POLL;
-	else if ((flags & REQ_OP_MASK) == REQ_OP_READ)
+	else if (wrr > BLK_WRR_NONE && wrr < BLK_WRR_COUNT) {
+		switch (wrr) {
+		case BLK_WRR_LOW:
+			type = HCTX_TYPE_WRR_LOW;
+			break;
+		case BLK_WRR_MEDIUM:
+			type = HCTX_TYPE_WRR_MEDIUM;
+			break;
+		case BLK_WRR_HIGH:
+			type = HCTX_TYPE_WRR_HIGH;
+			break;
+		default:
+			type = HCTX_TYPE_WRR_URGENT;
+			break;
+		}
+	} else if ((flags & REQ_OP_MASK) == REQ_OP_READ)
 		type = HCTX_TYPE_READ;
 	
 	return ctx->hctxs[type];
diff --git a/block/blk.h b/block/blk.h
index 6842f28c033e..ba97a6a35a73 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -40,7 +40,7 @@ extern struct ida blk_queue_ida;
 static inline struct blk_flush_queue *
 blk_get_flush_queue(struct request_queue *q, struct blk_mq_ctx *ctx)
 {
-	return blk_mq_map_queue(q, REQ_OP_FLUSH, ctx)->fq;
+	return blk_mq_map_queue(q, REQ_OP_FLUSH, ctx, BLK_WRR_NONE)->fq;
 }
 
 static inline void __blk_get_queue(struct request_queue *q)
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index e4a6949fd171..aab168a36d88 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -138,6 +138,8 @@ struct blkcg_gq {
 	atomic64_t			delay_start;
 	u64				last_delay;
 	int				last_use;
+	/* weighted round robin */
+	atomic_t			wrr;
 
 	struct rcu_head			rcu_head;
 };
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 11cfd6470b1a..e210778a94f0 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -193,17 +193,35 @@ struct blk_mq_queue_map {
  * enum hctx_type - Type of hardware queue
  * @HCTX_TYPE_DEFAULT:	All I/O not otherwise accounted for.
  * @HCTX_TYPE_READ:	Just for READ I/O.
+ * @HCTX_TYPE_WRR_LOW:     Weighted Round Robin low priority, when I/O is not polled.
+ * @HCTX_TYPE_WRR_MEDIUM:  Weighted Round Robin medium priority, when I/O is not polled.
+ * @HCTX_TYPE_WRR_HIGH:    Weighted Round Robin high priority, when I/O is not polled.
+ * @HCTX_TYPE_WRR_URGENT:  Weighted Round Robin urgent priority, when I/O is not polled.
  * @HCTX_TYPE_POLL:	Polled I/O of any kind.
  * @HCTX_MAX_TYPES:	Number of types of hctx.
  */
 enum hctx_type {
 	HCTX_TYPE_DEFAULT,
 	HCTX_TYPE_READ,
+	HCTX_TYPE_WRR_LOW,
+	HCTX_TYPE_WRR_MEDIUM,
+	HCTX_TYPE_WRR_HIGH,
+	HCTX_TYPE_WRR_URGENT,
 	HCTX_TYPE_POLL,
 
 	HCTX_MAX_TYPES,
 };
 
+enum blk_wrr {
+	BLK_WRR_NONE,
+	BLK_WRR_LOW,
+	BLK_WRR_MEDIUM,
+	BLK_WRR_HIGH,
+	BLK_WRR_URGENT,
+
+	BLK_WRR_COUNT,
+};
+
 /**
  * struct blk_mq_tag_set - tag set that can be shared between request queues
  * @map:	   One or more ctx -> hctx mappings. One map exists for each
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v5 2/4] nvme: add get_ams for nvme_ctrl_ops
  2020-02-04  3:30 [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme Weiping Zhang
  2020-02-04  3:31 ` [PATCH v5 1/4] block: add weighted round robin for blkcgroup Weiping Zhang
@ 2020-02-04  3:31 ` Weiping Zhang
  2020-02-04  3:31 ` [PATCH v5 3/4] nvme-pci: rename module parameter write_queues to read_queues Weiping Zhang
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Weiping Zhang @ 2020-02-04  3:31 UTC (permalink / raw)
  To: axboe, tj, hch, bvanassche, kbusch, minwoo.im.dev, tglx,
	ming.lei, edmund.nadolski
  Cc: linux-block, cgroups, linux-nvme

The get_ams() will return the AMS(Arbitration Mechanism Selected)
from the driver.

Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
---
 drivers/nvme/host/core.c | 9 ++++++++-
 drivers/nvme/host/nvme.h | 1 +
 drivers/nvme/host/pci.c  | 6 ++++++
 include/linux/nvme.h     | 1 +
 4 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 6ec03507da68..2275f1756369 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2119,6 +2119,7 @@ int nvme_enable_ctrl(struct nvme_ctrl *ctrl)
 	 */
 	unsigned dev_page_min, page_shift = 12;
 	int ret;
+	u32 ams = NVME_CC_AMS_RR;
 
 	ret = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &ctrl->cap);
 	if (ret) {
@@ -2134,11 +2135,17 @@ int nvme_enable_ctrl(struct nvme_ctrl *ctrl)
 		return -ENODEV;
 	}
 
+	/* get Arbitration Mechanism Selected */
+	if (ctrl->ops->get_ams) {
+		ctrl->ops->get_ams(ctrl, &ams);
+		ams &= NVME_CC_AMS_MASK;
+	}
+
 	ctrl->page_size = 1 << page_shift;
 
 	ctrl->ctrl_config = NVME_CC_CSS_NVM;
 	ctrl->ctrl_config |= (page_shift - 12) << NVME_CC_MPS_SHIFT;
-	ctrl->ctrl_config |= NVME_CC_AMS_RR | NVME_CC_SHN_NONE;
+	ctrl->ctrl_config |= ams | NVME_CC_SHN_NONE;
 	ctrl->ctrl_config |= NVME_CC_IOSQES | NVME_CC_IOCQES;
 	ctrl->ctrl_config |= NVME_CC_ENABLE;
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 1024fec7914c..a1df74f2eed3 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -407,6 +407,7 @@ struct nvme_ctrl_ops {
 	void (*submit_async_event)(struct nvme_ctrl *ctrl);
 	void (*delete_ctrl)(struct nvme_ctrl *ctrl);
 	int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
+	void (*get_ams)(struct nvme_ctrl *ctrl, u32 *ams);
 };
 
 #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 445c2ee2a01d..e460c7310187 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2688,6 +2688,11 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
 	return snprintf(buf, size, "%s", dev_name(&pdev->dev));
 }
 
+static void nvme_pci_get_ams(struct nvme_ctrl *ctrl, u32 *ams)
+{
+	*ams = NVME_CC_AMS_RR;
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.name			= "pcie",
 	.module			= THIS_MODULE,
@@ -2699,6 +2704,7 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.free_ctrl		= nvme_pci_free_ctrl,
 	.submit_async_event	= nvme_pci_submit_async_event,
 	.get_address		= nvme_pci_get_address,
+	.get_ams		= nvme_pci_get_ams,
 };
 
 static int nvme_dev_map(struct nvme_dev *dev)
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 3d5189f46cb1..6fe9121e4d27 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -171,6 +171,7 @@ enum {
 	NVME_CC_AMS_RR		= 0 << NVME_CC_AMS_SHIFT,
 	NVME_CC_AMS_WRRU	= 1 << NVME_CC_AMS_SHIFT,
 	NVME_CC_AMS_VS		= 7 << NVME_CC_AMS_SHIFT,
+	NVME_CC_AMS_MASK	= 7 << NVME_CC_AMS_SHIFT,
 	NVME_CC_SHN_NONE	= 0 << NVME_CC_SHN_SHIFT,
 	NVME_CC_SHN_NORMAL	= 1 << NVME_CC_SHN_SHIFT,
 	NVME_CC_SHN_ABRUPT	= 2 << NVME_CC_SHN_SHIFT,
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v5 3/4] nvme-pci: rename module parameter write_queues to read_queues
  2020-02-04  3:30 [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme Weiping Zhang
  2020-02-04  3:31 ` [PATCH v5 1/4] block: add weighted round robin for blkcgroup Weiping Zhang
  2020-02-04  3:31 ` [PATCH v5 2/4] nvme: add get_ams for nvme_ctrl_ops Weiping Zhang
@ 2020-02-04  3:31 ` Weiping Zhang
  2020-02-04  3:31 ` [PATCH v5 4/4] nvme: add support weighted round robin queue Weiping Zhang
  2020-02-04 15:42 ` [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme Keith Busch
  4 siblings, 0 replies; 17+ messages in thread
From: Weiping Zhang @ 2020-02-04  3:31 UTC (permalink / raw)
  To: axboe, tj, hch, bvanassche, kbusch, minwoo.im.dev, tglx,
	ming.lei, edmund.nadolski
  Cc: linux-block, cgroups, linux-nvme

Now nvme support three type hardware queues, read, poll and default,
this patch rename write_queues to read_queues to set the number of
read queues more explicitly. This patch alos is prepared for nvme
support WRR(weighted round robin) that we can get the number of
each queue type easily.

Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
---
 drivers/nvme/host/pci.c | 24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e460c7310187..1002f3f0349c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -68,10 +68,10 @@ static int io_queue_depth = 1024;
 module_param_cb(io_queue_depth, &io_queue_depth_ops, &io_queue_depth, 0644);
 MODULE_PARM_DESC(io_queue_depth, "set io queue depth, should >= 2");
 
-static unsigned int write_queues;
-module_param(write_queues, uint, 0644);
-MODULE_PARM_DESC(write_queues,
-	"Number of queues to use for writes. If not set, reads and writes "
+static unsigned int read_queues;
+module_param(read_queues, uint, 0644);
+MODULE_PARM_DESC(read_queues,
+	"Number of queues to use for read. If not set, reads and writes "
 	"will share a queue set.");
 
 static unsigned int poll_queues;
@@ -211,7 +211,7 @@ struct nvme_iod {
 
 static unsigned int max_io_queues(void)
 {
-	return num_possible_cpus() + write_queues + poll_queues;
+	return num_possible_cpus() + read_queues + poll_queues;
 }
 
 static unsigned int max_queue_count(void)
@@ -2016,18 +2016,16 @@ static void nvme_calc_irq_sets(struct irq_affinity *affd, unsigned int nrirqs)
 	 * If only one interrupt is available or 'write_queue' == 0, combine
 	 * write and read queues.
 	 *
-	 * If 'write_queues' > 0, ensure it leaves room for at least one read
+	 * If 'read_queues' > 0, ensure it leaves room for at least one write
 	 * queue.
 	 */
-	if (!nrirqs) {
+	if (!nrirqs || nrirqs == 1) {
 		nrirqs = 1;
 		nr_read_queues = 0;
-	} else if (nrirqs == 1 || !write_queues) {
-		nr_read_queues = 0;
-	} else if (write_queues >= nrirqs) {
-		nr_read_queues = 1;
+	} else if (read_queues >= nrirqs) {
+		nr_read_queues = nrirqs - 1;
 	} else {
-		nr_read_queues = nrirqs - write_queues;
+		nr_read_queues = read_queues;
 	}
 
 	dev->io_queues[HCTX_TYPE_DEFAULT] = nrirqs - nr_read_queues;
@@ -3143,7 +3141,7 @@ static int __init nvme_init(void)
 	BUILD_BUG_ON(sizeof(struct nvme_delete_queue) != 64);
 	BUILD_BUG_ON(IRQ_AFFINITY_MAX_SETS < 2);
 
-	write_queues = min(write_queues, num_possible_cpus());
+	read_queues = min(read_queues, num_possible_cpus());
 	poll_queues = min(poll_queues, num_possible_cpus());
 	return pci_register_driver(&nvme_driver);
 }
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v5 4/4] nvme: add support weighted round robin queue
  2020-02-04  3:30 [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme Weiping Zhang
                   ` (2 preceding siblings ...)
  2020-02-04  3:31 ` [PATCH v5 3/4] nvme-pci: rename module parameter write_queues to read_queues Weiping Zhang
@ 2020-02-04  3:31 ` Weiping Zhang
  2020-02-04 15:42 ` [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme Keith Busch
  4 siblings, 0 replies; 17+ messages in thread
From: Weiping Zhang @ 2020-02-04  3:31 UTC (permalink / raw)
  To: axboe, tj, hch, bvanassche, kbusch, minwoo.im.dev, tglx,
	ming.lei, edmund.nadolski
  Cc: linux-block, cgroups, linux-nvme

This patch enalbe Weithed Round Robin if nvme device support it.
We add four module parameters wrr_urgent_queues, wrr_high_queeus,
wrr_medium_queues, wrr_low_queues to control the number of queues for
specified priority. If device doesn't support WRR, all these four
parameters will be forced reset to 0. If device support WRR, but
all of these four parameters are 0, nvme dirver will not enable WRR
when set CC.EN to 1.

Now nvme support five types hardware queue:
poll:		if io was marked for poll
wrr_low:	weighted round robin low
wrr_medium:	weighted round robin medium
wrr_high:	weighted round robin high
wrr_urgent:	weighted round robin urgent
read:		io read, if blkcg's wrr is none and the io is not poll
defaut:		for write/flush, if blkcg's wrr is none and is not poll

The read, default and poll submission queue's priority is medium, when
nvme's wrr was enabled.

Test result:

CPU:    Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
NVME:   Intel SSDPE2KX020T8 P4510 2TB

[root@tmp-201812-d1802-818396173 low]# nvme show-regs /dev/nvme0n1
cap     : 2078030fff
version : 10200
intms   : 0
intmc   : 0
cc      : 460801
csts    : 1
nssr    : 0
aqa     : 1f001f
asq     : 5f7cc08000
acq     : 5f5ac23000
cmbloc  : 0
cmbsz   : 0

Run fio-1, fio-2, fio-3 in parallel,

For RR(round robin) these three fio nearly get same iops or bps,
if we set blkio.wrr for different priority, the WRR "high" will
get more iops/bps than "medium" and "low".

RR:
fio-1: echo "259:0 none" > /sys/fs/cgroup/blkio/high/blkio.wrr
fio-2: echo "259:0 none" > /sys/fs/cgroup/blkio/medium/blkio.wrr
fio-3: echo "259:0 none" > /sys/fs/cgroup/blkio/low/blkio.wrr

WRR:
fio-1: echo "259:0 high" > /sys/fs/cgroup/blkio/high/blkio.wrr
fio-2: echo "259:0 medium" > /sys/fs/cgroup/blkio/medium/blkio.wrr
fio-3: echo "259:0 low" > /sys/fs/cgroup/blkio/low/blkio.wrr

Test script:
https://github.com/dublio/nvme-wrr

Test result:
randread             (RR)IOPS        (RR)latency     (WRR)IOPS       (WRR)latency
--------------------------------------------------------------------------------
randread_high        217474          3528.49         404451          1897.17
randread_medium      217473          3528.56         202349          3793.54
randread_low         217978          3520.98         67419           11386.43

randwrite            (RR)IOPS        (RR)latency     (WRR)IOPS       (WRR)latency
--------------------------------------------------------------------------------
randwrite_high       144946          5295.34         277401          2766.66
randwrite_medium     144861          5296.85         138710          5532.28
randwrite_low        145105          5289.36         46316           16569.54

read                 (RR)BW          (RR)latency     (WRR)BW         (WRR)latency
--------------------------------------------------------------------------------
read_high            956191          410823.48       1790273         219427.11
read_medium          920096          426887.25       897644          437760.17
read_low             928076          423248.05       302899          1297195.34

write                (RR)BW          (RR)latency     (WRR)BW         (WRR)latency
--------------------------------------------------------------------------------
write_high           737211          532359.31       1194013         328970.70
write_medium         759052          516902.66       600626          653876.69
write_low            782348          501309.47       203754          1928779.39

Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
---
 drivers/nvme/host/nvme.h  |   1 +
 drivers/nvme/host/pci.c   | 240 ++++++++++++++++++++++++++++++++++++++--------
 include/linux/interrupt.h |   2 +-
 include/linux/nvme.h      |   2 +
 4 files changed, 203 insertions(+), 42 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a1df74f2eed3..4ebbefe4281d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -250,6 +250,7 @@ struct nvme_ctrl {
 	unsigned int shutdown_timeout;
 	unsigned int kato;
 	bool subsystem;
+	bool wrr_enabled;
 	unsigned long quirks;
 	struct nvme_id_power_state psd[32];
 	struct nvme_effects_log *effects;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 1002f3f0349c..77476f1bf943 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -78,6 +78,22 @@ static unsigned int poll_queues;
 module_param(poll_queues, uint, 0644);
 MODULE_PARM_DESC(poll_queues, "Number of queues to use for polled IO.");
 
+static int wrr_low_queues;
+module_param(wrr_low_queues, int, 0644);
+MODULE_PARM_DESC(wrr_low_queues, "Number of WRR low queues.");
+
+static int wrr_medium_queues;
+module_param(wrr_medium_queues, int, 0644);
+MODULE_PARM_DESC(wrr_medium_queues, "Number of WRR medium queues.");
+
+static int wrr_high_queues;
+module_param(wrr_high_queues, int, 0644);
+MODULE_PARM_DESC(wrr_high_queues, "Number of WRR high queues.");
+
+static int wrr_urgent_queues;
+module_param(wrr_urgent_queues, int, 0644);
+MODULE_PARM_DESC(wrr_urgent_queues, "Number of WRR urgent queues.");
+
 struct nvme_dev;
 struct nvme_queue;
 
@@ -209,6 +225,14 @@ struct nvme_iod {
 	struct scatterlist *sg;
 };
 
+static inline bool nvme_is_wrr_allocated(struct nvme_dev *dev)
+{
+	return dev->io_queues[HCTX_TYPE_WRR_LOW] +
+		dev->io_queues[HCTX_TYPE_WRR_MEDIUM] +
+		dev->io_queues[HCTX_TYPE_WRR_HIGH] +
+		dev->io_queues[HCTX_TYPE_WRR_URGENT] > 0;
+}
+
 static unsigned int max_io_queues(void)
 {
 	return num_possible_cpus() + read_queues + poll_queues;
@@ -1131,19 +1155,24 @@ static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
 }
 
 static int adapter_alloc_sq(struct nvme_dev *dev, u16 qid,
-						struct nvme_queue *nvmeq)
+				struct nvme_queue *nvmeq, int wrr_flag)
 {
 	struct nvme_ctrl *ctrl = &dev->ctrl;
 	struct nvme_command c;
 	int flags = NVME_QUEUE_PHYS_CONTIG;
 
-	/*
-	 * Some drives have a bug that auto-enables WRRU if MEDIUM isn't
-	 * set. Since URGENT priority is zeroes, it makes all queues
-	 * URGENT.
-	 */
-	if (ctrl->quirks & NVME_QUIRK_MEDIUM_PRIO_SQ)
-		flags |= NVME_SQ_PRIO_MEDIUM;
+
+	if (!dev->ctrl.wrr_enabled && !nvme_is_wrr_allocated(dev)) {
+		/*
+		 * Some drives have a bug that auto-enables WRRU if MEDIUM isn't
+		 * set. Since URGENT priority is zeroes, it makes all queues
+		 * URGENT.
+		 */
+		if (ctrl->quirks & NVME_QUIRK_MEDIUM_PRIO_SQ)
+			flags |= NVME_SQ_PRIO_MEDIUM;
+	} else {
+		flags |= wrr_flag;
+	}
 
 	/*
 	 * Note: we (ab)use the fact that the prp fields survive if no data
@@ -1513,11 +1542,51 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
 	wmb(); /* ensure the first interrupt sees the initialization */
 }
 
-static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
+static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 {
 	struct nvme_dev *dev = nvmeq->dev;
-	int result;
 	u16 vector = 0;
+	int start, end, result, wrr_flag;
+	bool polled = false;
+	enum hctx_type type;
+
+	/* 0 for admain queue, io queue index >= 1 */
+	start = 1;
+	/* get hardware context type base on qid */
+	for (type = HCTX_TYPE_DEFAULT; type < HCTX_MAX_TYPES; type++) {
+		end = start + dev->io_queues[type] - 1;
+		if (qid >= start && qid <= end)
+			break;
+		start = end + 1;
+	}
+
+	if (type == HCTX_TYPE_POLL)
+		polled = true;
+
+	if (dev->ctrl.wrr_enabled && nvme_is_wrr_allocated(dev)) {
+		/* set read,poll,default to medium by default */
+		switch (type) {
+		case HCTX_TYPE_WRR_LOW:
+			wrr_flag = NVME_SQ_PRIO_LOW;
+			break;
+		case HCTX_TYPE_WRR_MEDIUM:
+		case HCTX_TYPE_POLL:
+		case HCTX_TYPE_DEFAULT:
+		case HCTX_TYPE_READ:
+			wrr_flag = NVME_SQ_PRIO_MEDIUM;
+			break;
+		case HCTX_TYPE_WRR_HIGH:
+			wrr_flag = NVME_SQ_PRIO_HIGH;
+			break;
+		case HCTX_TYPE_WRR_URGENT:
+			wrr_flag = NVME_SQ_PRIO_URGENT;
+			break;
+		default:
+			return -EINVAL;
+		}
+	} else {
+		wrr_flag = NVME_SQ_PRIO_IGNORE;
+	}
 
 	clear_bit(NVMEQ_DELETE_ERROR, &nvmeq->flags);
 
@@ -1534,7 +1603,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 	if (result)
 		return result;
 
-	result = adapter_alloc_sq(dev, qid, nvmeq);
+	result = adapter_alloc_sq(dev, qid, nvmeq, wrr_flag);
 	if (result < 0)
 		return result;
 	if (result)
@@ -1704,7 +1773,7 @@ static int nvme_pci_configure_admin_queue(struct nvme_dev *dev)
 
 static int nvme_create_io_queues(struct nvme_dev *dev)
 {
-	unsigned i, max, rw_queues;
+	unsigned i, max;
 	int ret = 0;
 
 	for (i = dev->ctrl.queue_count; i <= dev->max_qid; i++) {
@@ -1715,17 +1784,8 @@ static int nvme_create_io_queues(struct nvme_dev *dev)
 	}
 
 	max = min(dev->max_qid, dev->ctrl.queue_count - 1);
-	if (max != 1 && dev->io_queues[HCTX_TYPE_POLL]) {
-		rw_queues = dev->io_queues[HCTX_TYPE_DEFAULT] +
-				dev->io_queues[HCTX_TYPE_READ];
-	} else {
-		rw_queues = max;
-	}
-
 	for (i = dev->online_queues; i <= max; i++) {
-		bool polled = i > rw_queues;
-
-		ret = nvme_create_queue(&dev->queues[i], i, polled);
+		ret = nvme_create_queue(&dev->queues[i], i);
 		if (ret)
 			break;
 	}
@@ -2006,7 +2066,9 @@ static int nvme_setup_host_mem(struct nvme_dev *dev)
 static void nvme_calc_irq_sets(struct irq_affinity *affd, unsigned int nrirqs)
 {
 	struct nvme_dev *dev = affd->priv;
-	unsigned int nr_read_queues;
+	unsigned int nr_total, nr, nr_read, nr_default;
+	unsigned int nr_wrr_urgent, nr_wrr_high, nr_wrr_medium, nr_wrr_low;
+	unsigned int nr_sets;
 
 	/*
 	 * If there is no interupt available for queues, ensure that
@@ -2019,20 +2081,85 @@ static void nvme_calc_irq_sets(struct irq_affinity *affd, unsigned int nrirqs)
 	 * If 'read_queues' > 0, ensure it leaves room for at least one write
 	 * queue.
 	 */
-	if (!nrirqs || nrirqs == 1) {
+	if (!nrirqs)
 		nrirqs = 1;
-		nr_read_queues = 0;
-	} else if (read_queues >= nrirqs) {
-		nr_read_queues = nrirqs - 1;
-	} else {
-		nr_read_queues = read_queues;
-	}
 
-	dev->io_queues[HCTX_TYPE_DEFAULT] = nrirqs - nr_read_queues;
-	affd->set_size[HCTX_TYPE_DEFAULT] = nrirqs - nr_read_queues;
-	dev->io_queues[HCTX_TYPE_READ] = nr_read_queues;
-	affd->set_size[HCTX_TYPE_READ] = nr_read_queues;
-	affd->nr_sets = nr_read_queues ? 2 : 1;
+	nr_total = nrirqs;
+
+	nr_read = nr_wrr_urgent = nr_wrr_high = nr_wrr_medium = nr_wrr_low = 0;
+
+	/* set default to 1, add all the rest queue to default at last */
+	nr = nr_default = 1;
+	nr_sets = 1;
+	nr_total -= nr;
+	if (!nr_total)
+		goto done;
+
+	/* read queues */
+	nr_read = nr = read_queues > nr_total ? nr_total : read_queues;
+	nr_total -= nr;
+	if (!nr_total)
+		goto done;
+
+	/* wrr low queues */
+	nr_wrr_low = nr = wrr_low_queues > nr_total ? nr_total : wrr_low_queues;
+	nr_total -= nr;
+	if (!nr_total)
+		goto done;
+
+	/* wrr medium queues */
+	nr_wrr_medium = nr =
+		wrr_medium_queues > nr_total ? nr_total : wrr_medium_queues;
+	nr_total -= nr;
+	if (!nr_total)
+		goto done;
+
+	/* wrr high queues */
+	nr_wrr_high = nr =
+		wrr_high_queues > nr_total ? nr_total : wrr_high_queues;
+	nr_total -= nr;
+	if (!nr_total)
+		goto done;
+
+	/* wrr urgent queues */
+	nr_wrr_urgent = nr =
+		wrr_urgent_queues > nr_total ? nr_total : wrr_urgent_queues;
+	nr_total -= nr;
+	if (!nr_total)
+		goto done;
+
+	/* add all the rest queue to default */
+	nr_default += nr_total;
+
+done:
+	dev->io_queues[HCTX_TYPE_DEFAULT] = nr_default;
+	affd->set_size[nr_sets - 1] = nr_default;
+	dev->io_queues[HCTX_TYPE_READ] = nr_read;
+	if (nr_read) {
+		nr_sets++;
+		affd->set_size[nr_sets - 1] = nr_read;
+	}
+	dev->io_queues[HCTX_TYPE_WRR_LOW] = nr_wrr_low;
+	if (nr_wrr_low) {
+		nr_sets++;
+		affd->set_size[nr_sets - 1] = nr_wrr_low;
+	}
+	dev->io_queues[HCTX_TYPE_WRR_MEDIUM] = nr_wrr_medium;
+	if (nr_wrr_medium) {
+		nr_sets++;
+		affd->set_size[nr_sets - 1] = nr_wrr_medium;
+	}
+	dev->io_queues[HCTX_TYPE_WRR_HIGH] = nr_wrr_high;
+	if (nr_wrr_high) {
+		nr_sets++;
+		affd->set_size[nr_sets - 1] = nr_wrr_high;
+	}
+	dev->io_queues[HCTX_TYPE_WRR_URGENT] = nr_wrr_urgent;
+	if (nr_wrr_urgent) {
+		nr_sets++;
+		affd->set_size[nr_sets - 1] = nr_wrr_urgent;
+	}
+	affd->nr_sets = nr_sets;
 }
 
 static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
@@ -2061,6 +2188,10 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
 	/* Initialize for the single interrupt case */
 	dev->io_queues[HCTX_TYPE_DEFAULT] = 1;
 	dev->io_queues[HCTX_TYPE_READ] = 0;
+	dev->io_queues[HCTX_TYPE_WRR_LOW] = 0;
+	dev->io_queues[HCTX_TYPE_WRR_MEDIUM] = 0;
+	dev->io_queues[HCTX_TYPE_WRR_HIGH] = 0;
+	dev->io_queues[HCTX_TYPE_WRR_URGENT] = 0;
 
 	/*
 	 * Some Apple controllers require all queues to use the
@@ -2162,10 +2293,16 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 		nvme_suspend_io_queues(dev);
 		goto retry;
 	}
-	dev_info(dev->ctrl.device, "%d/%d/%d default/read/poll queues\n",
+
+	dev_info(dev->ctrl.device, "%d/%d/%d/%d/%d/%d/%d "
+			"default/read/poll/low/medium/high/urgent queues\n",
 					dev->io_queues[HCTX_TYPE_DEFAULT],
 					dev->io_queues[HCTX_TYPE_READ],
-					dev->io_queues[HCTX_TYPE_POLL]);
+					dev->io_queues[HCTX_TYPE_POLL],
+					dev->io_queues[HCTX_TYPE_WRR_LOW],
+					dev->io_queues[HCTX_TYPE_WRR_MEDIUM],
+					dev->io_queues[HCTX_TYPE_WRR_HIGH],
+					dev->io_queues[HCTX_TYPE_WRR_URGENT]);
 	return 0;
 }
 
@@ -2251,9 +2388,7 @@ static void nvme_dev_add(struct nvme_dev *dev)
 	if (!dev->ctrl.tagset) {
 		dev->tagset.ops = &nvme_mq_ops;
 		dev->tagset.nr_hw_queues = dev->online_queues - 1;
-		dev->tagset.nr_maps = 2; /* default + read */
-		if (dev->io_queues[HCTX_TYPE_POLL])
-			dev->tagset.nr_maps++;
+		dev->tagset.nr_maps = HCTX_MAX_TYPES;
 		dev->tagset.timeout = NVME_IO_TIMEOUT;
 		dev->tagset.numa_node = dev_to_node(dev->dev);
 		dev->tagset.queue_depth =
@@ -2688,7 +2823,30 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
 
 static void nvme_pci_get_ams(struct nvme_ctrl *ctrl, u32 *ams)
 {
-	*ams = NVME_CC_AMS_RR;
+	/* if deivce doesn't support WRR, force reset wrr queues to 0 */
+	if (!NVME_CAP_AMS_WRRU(ctrl->cap)) {
+		wrr_low_queues = 0;
+		wrr_medium_queues = 0;
+		wrr_high_queues = 0;
+		wrr_urgent_queues = 0;
+
+		*ams = NVME_CC_AMS_RR;
+		ctrl->wrr_enabled = false;
+		return;
+	}
+
+	/*
+	 * if device support WRR and all wrr queues are  0, don't enable
+	 * device's WRR.
+	 */
+	if ((wrr_low_queues + wrr_medium_queues + wrr_high_queues +
+				wrr_urgent_queues) > 0) {
+		*ams = NVME_CC_AMS_WRRU;
+		ctrl->wrr_enabled = true;
+	} else {
+		*ams = NVME_CC_AMS_RR;
+		ctrl->wrr_enabled = false;
+	}
 }
 
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index c5fe60ec6b84..887f8e36eacc 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -273,7 +273,7 @@ struct irq_affinity_notify {
 	void (*release)(struct kref *ref);
 };
 
-#define	IRQ_AFFINITY_MAX_SETS  4
+#define	IRQ_AFFINITY_MAX_SETS  6
 
 /**
  * struct irq_affinity - Description for automatic irq affinity assignements
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 6fe9121e4d27..695a7bb5b5d8 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -127,6 +127,7 @@ enum {
 };
 
 #define NVME_CAP_MQES(cap)	((cap) & 0xffff)
+#define NVME_CAP_AMS_WRRU(cap)	((cap) & (1 << 17))
 #define NVME_CAP_TIMEOUT(cap)	(((cap) >> 24) & 0xff)
 #define NVME_CAP_STRIDE(cap)	(((cap) >> 32) & 0xf)
 #define NVME_CAP_NSSRC(cap)	(((cap) >> 36) & 0x1)
@@ -895,6 +896,7 @@ enum {
 	NVME_SQ_PRIO_HIGH	= (1 << 1),
 	NVME_SQ_PRIO_MEDIUM	= (2 << 1),
 	NVME_SQ_PRIO_LOW	= (3 << 1),
+	NVME_SQ_PRIO_IGNORE	= NVME_SQ_PRIO_URGENT,
 	NVME_FEAT_ARBITRATION	= 0x01,
 	NVME_FEAT_POWER_MGMT	= 0x02,
 	NVME_FEAT_LBA_RANGE	= 0x03,
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme
  2020-02-04  3:30 [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme Weiping Zhang
                   ` (3 preceding siblings ...)
  2020-02-04  3:31 ` [PATCH v5 4/4] nvme: add support weighted round robin queue Weiping Zhang
@ 2020-02-04 15:42 ` Keith Busch
  2020-02-16  8:09   ` Weiping Zhang
  4 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2020-02-04 15:42 UTC (permalink / raw)
  To: axboe, tj, hch, bvanassche, minwoo.im.dev, tglx, ming.lei,
	edmund.nadolski, linux-block, cgroups, linux-nvme

On Tue, Feb 04, 2020 at 11:30:45AM +0800, Weiping Zhang wrote:
> This series try to add Weighted Round Robin for block cgroup and nvme
> driver. When multiple containers share a single nvme device, we want
> to protect IO critical container from not be interfernced by other
> containers. We add blkio.wrr interface to user to control their IO
> priority. The blkio.wrr accept five level priorities, which contains
> "urgent", "high", "medium", "low" and "none", the "none" is used for
> disable WRR for this cgroup.

The NVMe protocol really doesn't define WRR to be a mechanism to mitigate
interference, though. It defines credits among the weighted queues
only for command fetching, and an urgent strict priority class that
starves the rest. It has nothing to do with how the controller should
prioritize completion of those commands, even if it may be reasonable to
assume influencing when the command is fetched should affect its
completion.

On the "weighted" strict priority, there's nothing separating "high"
from "low" other than the name: the "set features" credit assignment
can invert which queues have higher command fetch rates such that the
"low" is favoured over the "high".

There's no protection against the "urgent" class starving others: normal
IO will timeout and trigger repeated controller resets, while polled IO
will consume 100% of CPU cycles without making any progress if we make
this type of queue available without any additional code to ensure the
host behaves..

On the driver implementation, the number of module parameters being
added here is problematic. We already have 2 special classes of queues,
and defining this at the module level is considered too coarse when
the system has different devices on opposite ends of the capability
spectrum. For example, users want polled queues for the fast devices,
and none for the slower tier. We just don't have a good mechanism to
define per-controller resources, and more queue classes will make this
problem worse.

On the blk-mq side, this implementation doesn't work with the IO
schedulers. If one is in use, requests may be reordered such that a
request on your high-priority hctx may be dispatched later than more
recent ones associated with lower priority. I don't think that's what
you'd want to happen, so priority should be considered with schedulers
too.

But really, though, NVMe's WRR is too heavy weight and difficult to use.
The techincal work group can come up with something better, but it looks
like they've lost interest in TPAR 4011 (no discussion in 2 years, afaics).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme
  2020-02-04 15:42 ` [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme Keith Busch
@ 2020-02-16  8:09   ` Weiping Zhang
  2020-03-31  6:17     ` Weiping Zhang
  0 siblings, 1 reply; 17+ messages in thread
From: Weiping Zhang @ 2020-02-16  8:09 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Tejun Heo, Christoph Hellwig, Bart Van Assche,
	Minwoo Im, Thomas Gleixner, Ming Lei, Nadolski, Edmund,
	linux-block, cgroups, linux-nvme

Keith Busch <kbusch@kernel.org> 于2020年2月4日周二 下午11:42写道:
>
> On Tue, Feb 04, 2020 at 11:30:45AM +0800, Weiping Zhang wrote:
> > This series try to add Weighted Round Robin for block cgroup and nvme
> > driver. When multiple containers share a single nvme device, we want
> > to protect IO critical container from not be interfernced by other
> > containers. We add blkio.wrr interface to user to control their IO
> > priority. The blkio.wrr accept five level priorities, which contains
> > "urgent", "high", "medium", "low" and "none", the "none" is used for
> > disable WRR for this cgroup.
>
Hi Bush,

> The NVMe protocol really doesn't define WRR to be a mechanism to mitigate
> interference, though. It defines credits among the weighted queues
> only for command fetching, and an urgent strict priority class that
> starves the rest. It has nothing to do with how the controller should
> prioritize completion of those commands, even if it may be reasonable to
> assume influencing when the command is fetched should affect its
> completion.
>
Thanks your feedback, the fio test result on WRR shows that, the high-wrr-fio
get more bandwidth/iops and low latency. I think it's a good feature
for the case
that run multiple workload with different priority, especially for
container colocation.

> On the "weighted" strict priority, there's nothing separating "high"
> from "low" other than the name: the "set features" credit assignment
> can invert which queues have higher command fetch rates such that the
> "low" is favoured over the "high".
>
If there is no limitation in the hardware controller, we can add more
checking in
"set feature command". I think mostly people won't give "low" more
credits than "high",
it really does not make sense.

> There's no protection against the "urgent" class starving others: normal
> IO will timeout and trigger repeated controller resets, while polled IO
> will consume 100% of CPU cycles without making any progress if we make
> this type of queue available without any additional code to ensure the
> host behaves..
>
I think we can just disable it in the software layer , actually, I have no real
application need this.

> On the driver implementation, the number of module parameters being
> added here is problematic. We already have 2 special classes of queues,
> and defining this at the module level is considered too coarse when
> the system has different devices on opposite ends of the capability
> spectrum. For example, users want polled queues for the fast devices,
> and none for the slower tier. We just don't have a good mechanism to
> define per-controller resources, and more queue classes will make this
> problem worse.
>
We can add a new "string" module parameter, which contains a model number,
in most cases, the save product with a common prefix model number, so
in this way
nvme can distinguish the different performance devices(hign or low end).
Before create io queue, nvme driver can get the device's Model number(40 Bytes),
then nvme driver can compare device's model number with module parameter, to
decide how many io queues for each disk;

/* if model_number is MODEL_ANY, these parameters will be applied to
all nvme devices. */
char dev_io_queues[1024] = "model_number=MODEL_ANY,
poll=0,read=0,wrr_low=0,wrr_medium=0,wrr_high=0,wrr_urgent=0";
/* these paramters only affect nvme disk whose model number is "XXX" */
char dev_io_queues[1024] = "model_number=XXX,
poll=1,read=2,wrr_low=3,wrr_medium=4,wrr_high=5,wrr_urgent=0;";

struct dev_io_queues {
        char model_number[40];
        unsigned int poll;
        unsgined int read;
        unsigned int wrr_low;
        unsigned int wrr_medium;
        unsigned int wrr_high;
        unsigned int wrr_urgent;
};

We can use these two variable to store io queue configurations:

/* default values for the all disk, except whose model number is not
in io_queues_cfg */
struct dev_io_queues io_queues_def = {};

/* user defined values for a specific model number */
struct dev_io_queues io_queues_cfg = {};

If we need multiple configurations( > 2), we can also extend
dev_io_queues to support it.

> On the blk-mq side, this implementation doesn't work with the IO
> schedulers. If one is in use, requests may be reordered such that a
> request on your high-priority hctx may be dispatched later than more
> recent ones associated with lower priority. I don't think that's what
> you'd want to happen, so priority should be considered with schedulers
> too.
>
Currently, nvme does not use io scheduler by defalut, if user want to make
wrr compatible with io scheduler, we can add other patches to handle this.

> But really, though, NVMe's WRR is too heavy weight and difficult to use.
> The techincal work group can come up with something better, but it looks
> like they've lost interest in TPAR 4011 (no discussion in 2 years, afaics).

For the test result, I think it's a useful feature.
It really gives high priority applications high iops/bandwith and low latency,
and it makes software very thin and simple.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme
  2020-02-16  8:09   ` Weiping Zhang
@ 2020-03-31  6:17     ` Weiping Zhang
  2020-03-31 10:29       ` Paolo Valente
  2020-03-31 14:36       ` Tejun Heo
  0 siblings, 2 replies; 17+ messages in thread
From: Weiping Zhang @ 2020-03-31  6:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Tejun Heo, Christoph Hellwig, Bart Van Assche,
	Minwoo Im, Thomas Gleixner, Ming Lei, Nadolski, Edmund,
	linux-block, cgroups, linux-nvme

> > On the driver implementation, the number of module parameters being
> > added here is problematic. We already have 2 special classes of queues,
> > and defining this at the module level is considered too coarse when
> > the system has different devices on opposite ends of the capability
> > spectrum. For example, users want polled queues for the fast devices,
> > and none for the slower tier. We just don't have a good mechanism to
> > define per-controller resources, and more queue classes will make this
> > problem worse.
> >
> We can add a new "string" module parameter, which contains a model number,
> in most cases, the save product with a common prefix model number, so
> in this way
> nvme can distinguish the different performance devices(hign or low end).
> Before create io queue, nvme driver can get the device's Model number(40 Bytes),
> then nvme driver can compare device's model number with module parameter, to
> decide how many io queues for each disk;
>
> /* if model_number is MODEL_ANY, these parameters will be applied to
> all nvme devices. */
> char dev_io_queues[1024] = "model_number=MODEL_ANY,
> poll=0,read=0,wrr_low=0,wrr_medium=0,wrr_high=0,wrr_urgent=0";
> /* these paramters only affect nvme disk whose model number is "XXX" */
> char dev_io_queues[1024] = "model_number=XXX,
> poll=1,read=2,wrr_low=3,wrr_medium=4,wrr_high=5,wrr_urgent=0;";
>
> struct dev_io_queues {
>         char model_number[40];
>         unsigned int poll;
>         unsgined int read;
>         unsigned int wrr_low;
>         unsigned int wrr_medium;
>         unsigned int wrr_high;
>         unsigned int wrr_urgent;
> };
>
> We can use these two variable to store io queue configurations:
>
> /* default values for the all disk, except whose model number is not
> in io_queues_cfg */
> struct dev_io_queues io_queues_def = {};
>
> /* user defined values for a specific model number */
> struct dev_io_queues io_queues_cfg = {};
>
> If we need multiple configurations( > 2), we can also extend
> dev_io_queues to support it.
>

Hi Maintainers,

If we add patch to support these queue count at controller level,
instead moudle level,
shall we add WRR ?

Recently I do some cgroup io weight testing,
https://github.com/dublio/iotrack/wiki/cgroup-io-weight-test
I think a proper io weight policy
should consider high weight cgroup's iops, latency and also take whole
disk's throughput
into account, that is to say, the policy should do more carfully trade
off between cgroup's
IO performance and whole disk's throughput. I know one policy cannot
do all things perfectly,
but from the test result nvme-wrr can work well.

From the following test result, nvme-wrr work well for both cgroup's
latency, iops, and whole
disk's throughput.

Notes:
blk-iocost: only set qos.model, not set percentage latency.
nvme-wrr: set weight by:
    h=64;m=32;l=8;ab=0; nvme set-feature /dev/nvme1n1 -f 1 -v $(printf
"0x%x\n" $(($ab<<0|$l<<8|$m<<16|$h<<24)))
    echo "$major:$minor high" > /sys/fs/cgroup/test1/io.wrr
    echo "$major:$minor low" > /sys/fs/cgroup/test2/io.wrr


Randread vs Randread:
cgroup.test1.weight : cgroup.test2.weight = 8 : 1
high weight cgroup test1: randread, fio: numjobs=8, iodepth=32, bs=4K
low  weight cgroup test2: randread, fio: numjobs=8, iodepth=32, bs=4K

test case         bw         iops       rd_avg_lat   wr_avg_lat
rd_p99_lat   wr_p99_lat
=======================================================================================
bfq_test1         767226     191806     1333.30      0.00
536.00       0.00
bfq_test2         94607      23651      10816.06     0.00
610.00       0.00
iocost_test1      1457718    364429     701.76       0.00
1630.00      0.00
iocost_test2      1466337    366584     697.62       0.00
1613.00      0.00
none_test1        1456585    364146     702.22       0.00
1646.00      0.00
none_test2        1463090    365772     699.12       0.00
1613.00      0.00
wrr_test1         2635391    658847     387.94       0.00
1236.00      0.00
wrr_test2         365428     91357      2801.00      0.00
5537.00      0.00

https://github.com/dublio/iotrack/wiki/cgroup-io-weight-test#215-summary-fio-output


Randread vs Seq Write:
cgroup.test1.weight : cgroup.test2.weight = 8 : 1
high weight cgroup test1: randread, fio: numjobs=8, iodepth=32, bs=4K
low  weight cgroup test2: seq write, fio: numjobs=1, iodepth=32, bs=256K

test case      bw         iops       rd_avg_lat   wr_avg_lat
rd_p99_lat   wr_p99_lat
=======================================================================================
bfq_test1      814327     203581     1256.19      0.00         593.00       0.00
bfq_test2      104758     409        0.00         78196.32     0.00
     1052770.00
iocost_test1   270467     67616      3784.02      0.00         9371.00      0.00
iocost_test2   1541575    6021       0.00         5313.02      0.00
     6848.00
none_test1     271708     67927      3767.01      0.00         9502.00      0.00
none_test2     1541951    6023       0.00         5311.50      0.00
     6848.00
wrr_test1      775005     193751     1320.17      0.00         4112.00      0.00
wrr_test2      1198319    4680       0.00         6835.30      0.00
     8847.00


https://github.com/dublio/iotrack/wiki/cgroup-io-weight-test#225-summary-fio-output

Thanks
Weiping

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme
  2020-03-31  6:17     ` Weiping Zhang
@ 2020-03-31 10:29       ` Paolo Valente
  2020-03-31 14:36       ` Tejun Heo
  1 sibling, 0 replies; 17+ messages in thread
From: Paolo Valente @ 2020-03-31 10:29 UTC (permalink / raw)
  To: Weiping Zhang
  Cc: Keith Busch, Jens Axboe, Tejun Heo, Christoph Hellwig,
	Bart Van Assche, Minwoo Im, Thomas Gleixner, Ming Lei, Nadolski,
	Edmund, linux-block, cgroups, linux-nvme



> Il giorno 31 mar 2020, alle ore 08:17, Weiping Zhang <zwp10758@gmail.com> ha scritto:
> 
>>> On the driver implementation, the number of module parameters being
>>> added here is problematic. We already have 2 special classes of queues,
>>> and defining this at the module level is considered too coarse when
>>> the system has different devices on opposite ends of the capability
>>> spectrum. For example, users want polled queues for the fast devices,
>>> and none for the slower tier. We just don't have a good mechanism to
>>> define per-controller resources, and more queue classes will make this
>>> problem worse.
>>> 
>> We can add a new "string" module parameter, which contains a model number,
>> in most cases, the save product with a common prefix model number, so
>> in this way
>> nvme can distinguish the different performance devices(hign or low end).
>> Before create io queue, nvme driver can get the device's Model number(40 Bytes),
>> then nvme driver can compare device's model number with module parameter, to
>> decide how many io queues for each disk;
>> 
>> /* if model_number is MODEL_ANY, these parameters will be applied to
>> all nvme devices. */
>> char dev_io_queues[1024] = "model_number=MODEL_ANY,
>> poll=0,read=0,wrr_low=0,wrr_medium=0,wrr_high=0,wrr_urgent=0";
>> /* these paramters only affect nvme disk whose model number is "XXX" */
>> char dev_io_queues[1024] = "model_number=XXX,
>> poll=1,read=2,wrr_low=3,wrr_medium=4,wrr_high=5,wrr_urgent=0;";
>> 
>> struct dev_io_queues {
>>        char model_number[40];
>>        unsigned int poll;
>>        unsgined int read;
>>        unsigned int wrr_low;
>>        unsigned int wrr_medium;
>>        unsigned int wrr_high;
>>        unsigned int wrr_urgent;
>> };
>> 
>> We can use these two variable to store io queue configurations:
>> 
>> /* default values for the all disk, except whose model number is not
>> in io_queues_cfg */
>> struct dev_io_queues io_queues_def = {};
>> 
>> /* user defined values for a specific model number */
>> struct dev_io_queues io_queues_cfg = {};
>> 
>> If we need multiple configurations( > 2), we can also extend
>> dev_io_queues to support it.
>> 
> 
> Hi Maintainers,
> 
> If we add patch to support these queue count at controller level,
> instead moudle level,
> shall we add WRR ?
> 
> Recently I do some cgroup io weight testing,
> https://github.com/dublio/iotrack/wiki/cgroup-io-weight-test
> I think a proper io weight policy
> should consider high weight cgroup's iops, latency and also take whole
> disk's throughput
> into account, that is to say, the policy should do more carfully trade
> off between cgroup's
> IO performance and whole disk's throughput. I know one policy cannot
> do all things perfectly,
> but from the test result nvme-wrr can work well.
> 
> From the following test result, nvme-wrr work well for both cgroup's
> latency, iops, and whole
> disk's throughput.
> 
> Notes:
> blk-iocost: only set qos.model, not set percentage latency.
> nvme-wrr: set weight by:
>    h=64;m=32;l=8;ab=0; nvme set-feature /dev/nvme1n1 -f 1 -v $(printf
> "0x%x\n" $(($ab<<0|$l<<8|$m<<16|$h<<24)))
>    echo "$major:$minor high" > /sys/fs/cgroup/test1/io.wrr
>    echo "$major:$minor low" > /sys/fs/cgroup/test2/io.wrr
> 
> 
> Randread vs Randread:
> cgroup.test1.weight : cgroup.test2.weight = 8 : 1
> high weight cgroup test1: randread, fio: numjobs=8, iodepth=32, bs=4K
> low  weight cgroup test2: randread, fio: numjobs=8, iodepth=32, bs=4K
> 
> test case         bw         iops       rd_avg_lat   wr_avg_lat
> rd_p99_lat   wr_p99_lat
> =======================================================================================
> bfq_test1         767226     191806     1333.30      0.00
> 536.00       0.00
> bfq_test2         94607      23651      10816.06     0.00
> 610.00       0.00
> iocost_test1      1457718    364429     701.76       0.00
> 1630.00      0.00
> iocost_test2      1466337    366584     697.62       0.00
> 1613.00      0.00
> none_test1        1456585    364146     702.22       0.00
> 1646.00      0.00
> none_test2        1463090    365772     699.12       0.00
> 1613.00      0.00
> wrr_test1         2635391    658847     387.94       0.00
> 1236.00      0.00
> wrr_test2         365428     91357      2801.00      0.00
> 5537.00      0.00
> 
> https://github.com/dublio/iotrack/wiki/cgroup-io-weight-test#215-summary-fio-output
> 
> 

Glad to see that BFQ meets weights.  Sad to see how it is suffering in
terms of IOPS on your system.

Good job with your scheduler!

However, as for I/O control, the hard-to-control cases are not the
ones with constantly-full deep queues.  BFQ complexity stems from the
need to control also the tough cases.  An example is sync I/O with
I/O depth one against async I/O.  On the other hand, those use cases
may not be of interest for your scheduler.

Thanks,
Paolo

> Randread vs Seq Write:
> cgroup.test1.weight : cgroup.test2.weight = 8 : 1
> high weight cgroup test1: randread, fio: numjobs=8, iodepth=32, bs=4K
> low  weight cgroup test2: seq write, fio: numjobs=1, iodepth=32, bs=256K
> 
> test case      bw         iops       rd_avg_lat   wr_avg_lat
> rd_p99_lat   wr_p99_lat
> =======================================================================================
> bfq_test1      814327     203581     1256.19      0.00         593.00       0.00
> bfq_test2      104758     409        0.00         78196.32     0.00
>     1052770.00
> iocost_test1   270467     67616      3784.02      0.00         9371.00      0.00
> iocost_test2   1541575    6021       0.00         5313.02      0.00
>     6848.00
> none_test1     271708     67927      3767.01      0.00         9502.00      0.00
> none_test2     1541951    6023       0.00         5311.50      0.00
>     6848.00
> wrr_test1      775005     193751     1320.17      0.00         4112.00      0.00
> wrr_test2      1198319    4680       0.00         6835.30      0.00
>     8847.00
> 
> 
> https://github.com/dublio/iotrack/wiki/cgroup-io-weight-test#225-summary-fio-output
> 
> Thanks
> Weiping


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme
  2020-03-31  6:17     ` Weiping Zhang
  2020-03-31 10:29       ` Paolo Valente
@ 2020-03-31 14:36       ` Tejun Heo
  2020-03-31 15:47         ` Weiping Zhang
  1 sibling, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2020-03-31 14:36 UTC (permalink / raw)
  To: Weiping Zhang
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Bart Van Assche,
	Minwoo Im, Thomas Gleixner, Ming Lei, Nadolski, Edmund,
	linux-block, cgroups, linux-nvme

Hello, Weiping.

On Tue, Mar 31, 2020 at 02:17:06PM +0800, Weiping Zhang wrote:
> Recently I do some cgroup io weight testing,
> https://github.com/dublio/iotrack/wiki/cgroup-io-weight-test
> I think a proper io weight policy
> should consider high weight cgroup's iops, latency and also take whole
> disk's throughput
> into account, that is to say, the policy should do more carfully trade
> off between cgroup's
> IO performance and whole disk's throughput. I know one policy cannot
> do all things perfectly,
> but from the test result nvme-wrr can work well.

That's w/o iocost QoS targets configured, right? iocost should be able to
achieve similar results as wrr with QoS configured.

> From the following test result, nvme-wrr work well for both cgroup's
> latency, iops, and whole
> disk's throughput.

As I wrote before, the issues I see with wrr are the followings.

* Hardware dependent. Some will work ok or even fantastic. Many others will do
  horribly.

* Lack of configuration granularity. We can't configure it granular enough to
  serve hierarchical configuration.

* Likely not a huge problem with the deep QD of nvmes but lack of queue depth
  control can lead to loss of latency control and thus loss of protection for
  low concurrency workloads when pitched against workloads which can saturate
  QD.

All that said, given the feature is available, I don't see any reason to not
allow to use it, but I don't think it fits the cgroup interface model given the
hardware dependency and coarse granularity. For these cases, I think the right
thing to do is using cgroups to provide tagging information - ie. build a
dedicated interface which takes cgroup fd or ino as the tag and associate
configurations that way. There already are other use cases which use cgroup this
way (e.g. perf).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme
  2020-03-31 14:36       ` Tejun Heo
@ 2020-03-31 15:47         ` Weiping Zhang
  2020-03-31 15:51           ` Tejun Heo
  0 siblings, 1 reply; 17+ messages in thread
From: Weiping Zhang @ 2020-03-31 15:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Bart Van Assche,
	Minwoo Im, Thomas Gleixner, Ming Lei, Nadolski, Edmund,
	linux-block, cgroups, linux-nvme

Tejun Heo <tj@kernel.org> 于2020年3月31日周二 下午10:36写道:
>
> Hello, Weiping.
>
> On Tue, Mar 31, 2020 at 02:17:06PM +0800, Weiping Zhang wrote:
> > Recently I do some cgroup io weight testing,
> > https://github.com/dublio/iotrack/wiki/cgroup-io-weight-test
> > I think a proper io weight policy
> > should consider high weight cgroup's iops, latency and also take whole
> > disk's throughput
> > into account, that is to say, the policy should do more carfully trade
> > off between cgroup's
> > IO performance and whole disk's throughput. I know one policy cannot
> > do all things perfectly,
> > but from the test result nvme-wrr can work well.
>
> That's w/o iocost QoS targets configured, right? iocost should be able to
> achieve similar results as wrr with QoS configured.
>
Yes, I have not set Qos target.
> > From the following test result, nvme-wrr work well for both cgroup's
> > latency, iops, and whole
> > disk's throughput.
>
> As I wrote before, the issues I see with wrr are the followings.
>
> * Hardware dependent. Some will work ok or even fantastic. Many others will do
>   horribly.
>
> * Lack of configuration granularity. We can't configure it granular enough to
>   serve hierarchical configuration.
>
> * Likely not a huge problem with the deep QD of nvmes but lack of queue depth
>   control can lead to loss of latency control and thus loss of protection for
>   low concurrency workloads when pitched against workloads which can saturate
>   QD.
>
> All that said, given the feature is available, I don't see any reason to not
> allow to use it, but I don't think it fits the cgroup interface model given the
> hardware dependency and coarse granularity. For these cases, I think the right
> thing to do is using cgroups to provide tagging information - ie. build a
> dedicated interface which takes cgroup fd or ino as the tag and associate
> configurations that way. There already are other use cases which use cgroup this
> way (e.g. perf).
>
Do you means drop the "io.wrr" or "blkio.wrr" in cgroup, and use a
dedicated interface
like /dev/xxx or /proc/xxx?

I see the perf code:
struct fd f = fdget(fd)
struct cgroup_subsys_state *css =
css_tryget_online_from_dir(f.file->f_path.dentry,
        &perf_event_cgrp_subsys);

Looks can be applied to block cgroup in same way.

Thanks your help.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme
  2020-03-31 15:47         ` Weiping Zhang
@ 2020-03-31 15:51           ` Tejun Heo
  2020-03-31 15:52             ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2020-03-31 15:51 UTC (permalink / raw)
  To: Weiping Zhang
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Bart Van Assche,
	Minwoo Im, Thomas Gleixner, Ming Lei, Nadolski, Edmund,
	linux-block, cgroups, linux-nvme

Hello,

On Tue, Mar 31, 2020 at 11:47:41PM +0800, Weiping Zhang wrote:
> Do you means drop the "io.wrr" or "blkio.wrr" in cgroup, and use a
> dedicated interface
> like /dev/xxx or /proc/xxx?

Yes, something along that line. Given that it's nvme specific, it'd be best if
the interface reflects that too - e.g. through a file under
/sys/block/nvme*/device/. Jens, Christoph, what do you guys think?

> I see the perf code:
> struct fd f = fdget(fd)
> struct cgroup_subsys_state *css =
> css_tryget_online_from_dir(f.file->f_path.dentry,
>         &perf_event_cgrp_subsys);
> 
> Looks can be applied to block cgroup in same way.

Yeah, either fd or ino can be used to identify a cgroup.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme
  2020-03-31 15:51           ` Tejun Heo
@ 2020-03-31 15:52             ` Christoph Hellwig
  2020-03-31 15:54               ` Tejun Heo
  2020-03-31 16:31               ` Weiping Zhang
  0 siblings, 2 replies; 17+ messages in thread
From: Christoph Hellwig @ 2020-03-31 15:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Weiping Zhang, Keith Busch, Jens Axboe, Christoph Hellwig,
	Bart Van Assche, Minwoo Im, Thomas Gleixner, Ming Lei, Nadolski,
	Edmund, linux-block, cgroups, linux-nvme

On Tue, Mar 31, 2020 at 11:51:39AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Mar 31, 2020 at 11:47:41PM +0800, Weiping Zhang wrote:
> > Do you means drop the "io.wrr" or "blkio.wrr" in cgroup, and use a
> > dedicated interface
> > like /dev/xxx or /proc/xxx?
> 
> Yes, something along that line. Given that it's nvme specific, it'd be best if
> the interface reflects that too - e.g. through a file under
> /sys/block/nvme*/device/. Jens, Christoph, what do you guys think?

I'm pretty sure I voiced my opinion before - I think the NVMe WRR
queueing concept is completely broken and I do not thing we should
support it at all.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme
  2020-03-31 15:52             ` Christoph Hellwig
@ 2020-03-31 15:54               ` Tejun Heo
  2020-03-31 16:31               ` Weiping Zhang
  1 sibling, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2020-03-31 15:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Weiping Zhang, Keith Busch, Jens Axboe, Bart Van Assche,
	Minwoo Im, Thomas Gleixner, Ming Lei, Nadolski, Edmund,
	linux-block, cgroups, linux-nvme

On Tue, Mar 31, 2020 at 05:52:57PM +0200, Christoph Hellwig wrote:
> On Tue, Mar 31, 2020 at 11:51:39AM -0400, Tejun Heo wrote:
> > Hello,
> > 
> > On Tue, Mar 31, 2020 at 11:47:41PM +0800, Weiping Zhang wrote:
> > > Do you means drop the "io.wrr" or "blkio.wrr" in cgroup, and use a
> > > dedicated interface
> > > like /dev/xxx or /proc/xxx?
> > 
> > Yes, something along that line. Given that it's nvme specific, it'd be best if
> > the interface reflects that too - e.g. through a file under
> > /sys/block/nvme*/device/. Jens, Christoph, what do you guys think?
> 
> I'm pretty sure I voiced my opinion before - I think the NVMe WRR
> queueing concept is completely broken and I do not thing we should
> support it at all.

Ah, okay, I completely forgot about that. I don't have a strong opinion either
way.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme
  2020-03-31 15:52             ` Christoph Hellwig
  2020-03-31 15:54               ` Tejun Heo
@ 2020-03-31 16:31               ` Weiping Zhang
  2020-03-31 16:33                 ` Christoph Hellwig
  1 sibling, 1 reply; 17+ messages in thread
From: Weiping Zhang @ 2020-03-31 16:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tejun Heo, Keith Busch, Jens Axboe, Bart Van Assche, Minwoo Im,
	Thomas Gleixner, Ming Lei, Nadolski, Edmund, linux-block,
	cgroups, linux-nvme

Christoph Hellwig <hch@lst.de> 于2020年3月31日周二 下午11:52写道:
>
> On Tue, Mar 31, 2020 at 11:51:39AM -0400, Tejun Heo wrote:
> > Hello,
> >
> > On Tue, Mar 31, 2020 at 11:47:41PM +0800, Weiping Zhang wrote:
> > > Do you means drop the "io.wrr" or "blkio.wrr" in cgroup, and use a
> > > dedicated interface
> > > like /dev/xxx or /proc/xxx?
> >
> > Yes, something along that line. Given that it's nvme specific, it'd be best if
> > the interface reflects that too - e.g. through a file under
> > /sys/block/nvme*/device/. Jens, Christoph, what do you guys think?
>
> I'm pretty sure I voiced my opinion before - I think the NVMe WRR
> queueing concept is completely broken and I do not thing we should
> support it at all.
Hi Christoph,

Would you like to share more detail about why NVMe WRR is broken?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme
  2020-03-31 16:31               ` Weiping Zhang
@ 2020-03-31 16:33                 ` Christoph Hellwig
  2020-03-31 16:52                   ` Weiping Zhang
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2020-03-31 16:33 UTC (permalink / raw)
  To: Weiping Zhang
  Cc: Christoph Hellwig, Tejun Heo, Keith Busch, Jens Axboe,
	Bart Van Assche, Minwoo Im, Thomas Gleixner, Ming Lei, Nadolski,
	Edmund, linux-block, cgroups, linux-nvme

On Wed, Apr 01, 2020 at 12:31:17AM +0800, Weiping Zhang wrote:
> Would you like to share more detail about why NVMe WRR is broken?

Because it only weights command fetching.  It says absolutely nothing
about command execution.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme
  2020-03-31 16:33                 ` Christoph Hellwig
@ 2020-03-31 16:52                   ` Weiping Zhang
  0 siblings, 0 replies; 17+ messages in thread
From: Weiping Zhang @ 2020-03-31 16:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tejun Heo, Keith Busch, Jens Axboe, Bart Van Assche, Minwoo Im,
	Thomas Gleixner, Ming Lei, Nadolski, Edmund, linux-block,
	cgroups, linux-nvme

Christoph Hellwig <hch@lst.de> 于2020年4月1日周三 上午12:33写道:
>
> On Wed, Apr 01, 2020 at 12:31:17AM +0800, Weiping Zhang wrote:
> > Would you like to share more detail about why NVMe WRR is broken?
>
> Because it only weights command fetching.  It says absolutely nothing
> about command execution.

I know that different ssd vendor may have different implementation,
even some device can work well, seems you have no plan support it.
I drop this patchset.

@Tejun Heo  @Keith Busch
Thanks all of you

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-03-31 16:52 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-04  3:30 [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme Weiping Zhang
2020-02-04  3:31 ` [PATCH v5 1/4] block: add weighted round robin for blkcgroup Weiping Zhang
2020-02-04  3:31 ` [PATCH v5 2/4] nvme: add get_ams for nvme_ctrl_ops Weiping Zhang
2020-02-04  3:31 ` [PATCH v5 3/4] nvme-pci: rename module parameter write_queues to read_queues Weiping Zhang
2020-02-04  3:31 ` [PATCH v5 4/4] nvme: add support weighted round robin queue Weiping Zhang
2020-02-04 15:42 ` [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme Keith Busch
2020-02-16  8:09   ` Weiping Zhang
2020-03-31  6:17     ` Weiping Zhang
2020-03-31 10:29       ` Paolo Valente
2020-03-31 14:36       ` Tejun Heo
2020-03-31 15:47         ` Weiping Zhang
2020-03-31 15:51           ` Tejun Heo
2020-03-31 15:52             ` Christoph Hellwig
2020-03-31 15:54               ` Tejun Heo
2020-03-31 16:31               ` Weiping Zhang
2020-03-31 16:33                 ` Christoph Hellwig
2020-03-31 16:52                   ` Weiping Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).