linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] cfq-iosched: Fair cross-group preemption
@ 2011-03-22  1:10 Chad Talbott
  2011-03-22  1:10 ` [PATCH 1/3] cfq-iosched: Fair cross-group preemption (interface and documentation) Chad Talbott
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Chad Talbott @ 2011-03-22  1:10 UTC (permalink / raw)
  To: jaxboe, vgoyal; +Cc: linux-kernel, mrubin, teravest, Chad Talbott

This patchset introduces fair cross-group preemption.  Right now, we
don't have strict isolation between processes in different cgroups.
For example: currently an RT ioprio thread in one group can preempt a
BE ioprio thread in another group.  We would like to have an
application specify the relative priorities of its threads, but still
allow strict isolation between applications.

With this patch series, we can configure an entire cgroup as needing
low-latency service.  Then that group will be able to preempt another
group.  To prevent a runaway application from starving other
applications, we allow this preemption only until it has exceeded its
fair share (as specified by its relative weight).  So a rate-limited,
but latency sensative application (like streaming audio or video) can
get front-of-the-line service without fear of it hogging a whole
disk's IO.

Note that this series is targeted to shallow-queue devices (e.g. a
single spinning disk).  Without hardware support, it is much more
difficult to provide low-latency on a device with a deep request
queue.

The following bash script demonstrates the feature:

cd /dev/cgroup
mkdir c1 c2

(
    echo '8:16 900' > c1/blkio.weight_device
    echo '8:16 2' > c1/blkio.class
    echo $BASHPID > c1/tasks
    dd if=/dev/sdb of=/dev/null bs=1M iflag=direct
) & 
(
    echo '8:16 100' > c2/blkio.weight_device
    echo '8:16 1' > c2/blkio.class
    echo $BASHPID > c2/tasks
    while : ; do dd if=/dev/sdb of=/dev/null bs=1M count=1 iflag=direct; sleep .1 ; done
) &
sleep 10
kill %1 %2

grep 8:16 c?/blkio*
# end

Since the c2 reader is "well behaved" and only reads 10MB/s, it will
be able to consistenly preempt c1, and therefore see low latency for
its requests.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/3] cfq-iosched: Fair cross-group preemption (interface and documentation)
  2011-03-22  1:10 [PATCH 0/3] cfq-iosched: Fair cross-group preemption Chad Talbott
@ 2011-03-22  1:10 ` Chad Talbott
  2011-03-22 10:03   ` Gui Jianfeng
  2011-03-22  1:10 ` [PATCH 2/3] cfq-iosched: Fair cross-group preemption (implementation) Chad Talbott
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Chad Talbott @ 2011-03-22  1:10 UTC (permalink / raw)
  To: jaxboe, vgoyal; +Cc: linux-kernel, mrubin, teravest, Chad Talbott

blkio.class is a new interface that allows you to specify a cgroup as
requiring low-latency service on a specific block device.  This patch
includes only the interface and documentation.

Signed-off-by: Chad Talbott <ctalbott@google.com>
---
 Documentation/cgroups/blkio-controller.txt |   16 ++++
 block/blk-cgroup.c                         |  121 +++++++++++++++++++++-------
 block/blk-cgroup.h                         |   36 +++++++--
 block/cfq-iosched.c                        |   12 +++-
 4 files changed, 145 insertions(+), 40 deletions(-)

diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
index 4ed7b5c..7acf9d2 100644
--- a/Documentation/cgroups/blkio-controller.txt
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -169,6 +169,22 @@ Proportional weight policy files
 	  dev     weight
 	  8:16    300
 
+- blkio.class
+	- Specifies the per-cgroup, per-device service class.  There
+	  are currently two service classes: real-time (1) and
+	  best-effort (2).  By default, all groups start as
+	  best-effort.
+
+	  #echo dev_maj:dev_minor class > /path/to/cgroup/blkio.class
+	  Configure class=real-time on /dev/sdb (8:16) in this cgroup
+	  # echo '8:16 1' > blkio.class
+	  # cat blkio.class
+	  8:16    1
+
+	  A real-time group can preempt a best-effort group, only as
+	  long as it's within its assigned weight; i.e. fairness is
+	  preserved.
+
 - blkio.time
 	- disk time allocated to cgroup per device in milliseconds. First
 	  two fields specify the major and minor number of the device and
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 455768a..f2c553e 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -129,6 +129,21 @@ blkio_update_group_weight(struct blkio_group *blkg, unsigned int weight)
 	}
 }
 
+static inline void
+blkio_update_group_class(struct blkio_group *blkg, unsigned int class)
+{
+	struct blkio_policy_type *blkiop;
+
+	list_for_each_entry(blkiop, &blkio_list, list) {
+		/* If this policy does not own the blkg, do not send updates */
+		if (blkiop->plid != blkg->plid)
+			continue;
+		if (blkiop->ops.blkio_update_group_class_fn)
+			blkiop->ops.blkio_update_group_class_fn(blkg->key,
+								blkg, class);
+	}
+}
+
 static inline void blkio_update_group_bps(struct blkio_group *blkg, u64 bps,
 				int fileid)
 {
@@ -708,15 +723,29 @@ static int blkio_policy_parse_and_set(char *buf,
 
 	switch (plid) {
 	case BLKIO_POLICY_PROP:
-		ret = strict_strtoul(s[1], 10, &temp);
-		if (ret || (temp < BLKIO_WEIGHT_MIN && temp > 0) ||
-			temp > BLKIO_WEIGHT_MAX)
-			return -EINVAL;
-
-		newpn->plid = plid;
-		newpn->fileid = fileid;
-		newpn->val.weight = temp;
-		break;
+		switch (fileid) {
+		case BLKIO_PROP_weight_device:
+			ret = strict_strtoul(s[1], 10, &temp);
+			if (ret || (temp < BLKIO_WEIGHT_MIN && temp > 0) ||
+			    temp > BLKIO_WEIGHT_MAX)
+				return -EINVAL;
+
+			newpn->plid = plid;
+			newpn->fileid = fileid;
+			newpn->val.prop.weight = temp;
+			break;
+		case BLKIO_PROP_class:
+			ret = strict_strtoul(s[1], 10, &temp);
+			if (ret ||
+			    temp < BLKIO_RT_CLASS ||
+			    temp > BLKIO_BE_CLASS)
+				return -EINVAL;
+
+			newpn->plid = plid;
+			newpn->fileid = fileid;
+			newpn->val.prop.class = temp;
+			break;
+		}
 	case BLKIO_POLICY_THROTL:
 		switch(fileid) {
 		case BLKIO_THROTL_read_bps_device:
@@ -727,7 +756,7 @@ static int blkio_policy_parse_and_set(char *buf,
 
 			newpn->plid = plid;
 			newpn->fileid = fileid;
-			newpn->val.bps = bps;
+			newpn->val.throtl.bps = bps;
 			break;
 		case BLKIO_THROTL_read_iops_device:
 		case BLKIO_THROTL_write_iops_device:
@@ -740,7 +769,7 @@ static int blkio_policy_parse_and_set(char *buf,
 
 			newpn->plid = plid;
 			newpn->fileid = fileid;
-			newpn->val.iops = (unsigned int)iops;
+			newpn->val.throtl.iops = (unsigned int)iops;
 			break;
 		}
 		break;
@@ -759,20 +788,34 @@ unsigned int blkcg_get_weight(struct blkio_cgroup *blkcg,
 	pn = blkio_policy_search_node(blkcg, dev, BLKIO_POLICY_PROP,
 				BLKIO_PROP_weight_device);
 	if (pn)
-		return pn->val.weight;
+		return pn->val.prop.weight;
 	else
 		return blkcg->weight;
 }
 EXPORT_SYMBOL_GPL(blkcg_get_weight);
 
-uint64_t blkcg_get_read_bps(struct blkio_cgroup *blkcg, dev_t dev)
+enum blkcg_class blkcg_get_class(struct blkio_cgroup *blkcg,
+			     dev_t dev)
 {
 	struct blkio_policy_node *pn;
+	enum blkcg_class class = BLKIO_DEFAULT_CLASS;
+
+	pn = blkio_policy_search_node(blkcg, dev, BLKIO_POLICY_PROP,
+				BLKIO_PROP_class);
+	if (pn)
+		class = pn->val.prop.class;
 
+	return class;
+}
+EXPORT_SYMBOL_GPL(blkcg_get_class);
+
+uint64_t blkcg_get_read_bps(struct blkio_cgroup *blkcg, dev_t dev)
+{
+	struct blkio_policy_node *pn;
 	pn = blkio_policy_search_node(blkcg, dev, BLKIO_POLICY_THROTL,
 				BLKIO_THROTL_read_bps_device);
 	if (pn)
-		return pn->val.bps;
+		return pn->val.throtl.bps;
 	else
 		return -1;
 }
@@ -783,7 +826,7 @@ uint64_t blkcg_get_write_bps(struct blkio_cgroup *blkcg, dev_t dev)
 	pn = blkio_policy_search_node(blkcg, dev, BLKIO_POLICY_THROTL,
 				BLKIO_THROTL_write_bps_device);
 	if (pn)
-		return pn->val.bps;
+		return pn->val.throtl.bps;
 	else
 		return -1;
 }
@@ -795,7 +838,7 @@ unsigned int blkcg_get_read_iops(struct blkio_cgroup *blkcg, dev_t dev)
 	pn = blkio_policy_search_node(blkcg, dev, BLKIO_POLICY_THROTL,
 				BLKIO_THROTL_read_iops_device);
 	if (pn)
-		return pn->val.iops;
+		return pn->val.throtl.iops;
 	else
 		return -1;
 }
@@ -806,7 +849,7 @@ unsigned int blkcg_get_write_iops(struct blkio_cgroup *blkcg, dev_t dev)
 	pn = blkio_policy_search_node(blkcg, dev, BLKIO_POLICY_THROTL,
 				BLKIO_THROTL_write_iops_device);
 	if (pn)
-		return pn->val.iops;
+		return pn->val.throtl.iops;
 	else
 		return -1;
 }
@@ -816,19 +859,19 @@ static bool blkio_delete_rule_command(struct blkio_policy_node *pn)
 {
 	switch(pn->plid) {
 	case BLKIO_POLICY_PROP:
-		if (pn->val.weight == 0)
+		if (pn->val.prop.weight == 0)
 			return 1;
 		break;
 	case BLKIO_POLICY_THROTL:
 		switch(pn->fileid) {
 		case BLKIO_THROTL_read_bps_device:
 		case BLKIO_THROTL_write_bps_device:
-			if (pn->val.bps == 0)
+			if (pn->val.throtl.bps == 0)
 				return 1;
 			break;
 		case BLKIO_THROTL_read_iops_device:
 		case BLKIO_THROTL_write_iops_device:
-			if (pn->val.iops == 0)
+			if (pn->val.throtl.iops == 0)
 				return 1;
 		}
 		break;
@@ -844,17 +887,17 @@ static void blkio_update_policy_rule(struct blkio_policy_node *oldpn,
 {
 	switch(oldpn->plid) {
 	case BLKIO_POLICY_PROP:
-		oldpn->val.weight = newpn->val.weight;
+		oldpn->val.prop.weight = newpn->val.prop.weight;
 		break;
 	case BLKIO_POLICY_THROTL:
 		switch(newpn->fileid) {
 		case BLKIO_THROTL_read_bps_device:
 		case BLKIO_THROTL_write_bps_device:
-			oldpn->val.bps = newpn->val.bps;
+			oldpn->val.throtl.bps = newpn->val.throtl.bps;
 			break;
 		case BLKIO_THROTL_read_iops_device:
 		case BLKIO_THROTL_write_iops_device:
-			oldpn->val.iops = newpn->val.iops;
+			oldpn->val.throtl.iops = newpn->val.throtl.iops;
 		}
 		break;
 	default:
@@ -872,22 +915,24 @@ static void blkio_update_blkg_policy(struct blkio_cgroup *blkcg,
 	unsigned int weight, iops;
 	u64 bps;
 
+
 	switch(pn->plid) {
 	case BLKIO_POLICY_PROP:
-		weight = pn->val.weight ? pn->val.weight :
+		weight = pn->val.prop.weight ? pn->val.prop.weight :
 				blkcg->weight;
 		blkio_update_group_weight(blkg, weight);
+		blkio_update_group_class(blkg, pn->val.prop.class);
 		break;
 	case BLKIO_POLICY_THROTL:
 		switch(pn->fileid) {
 		case BLKIO_THROTL_read_bps_device:
 		case BLKIO_THROTL_write_bps_device:
-			bps = pn->val.bps ? pn->val.bps : (-1);
+			bps = pn->val.throtl.bps ? pn->val.throtl.bps : (-1);
 			blkio_update_group_bps(blkg, bps, pn->fileid);
 			break;
 		case BLKIO_THROTL_read_iops_device:
 		case BLKIO_THROTL_write_iops_device:
-			iops = pn->val.iops ? pn->val.iops : (-1);
+			iops = pn->val.throtl.iops ? pn->val.throtl.iops : (-1);
 			blkio_update_group_iops(blkg, iops, pn->fileid);
 			break;
 		}
@@ -984,21 +1029,28 @@ blkio_print_policy_node(struct seq_file *m, struct blkio_policy_node *pn)
 {
 	switch(pn->plid) {
 		case BLKIO_POLICY_PROP:
-			if (pn->fileid == BLKIO_PROP_weight_device)
+			switch(pn->fileid) {
+			case BLKIO_PROP_weight_device:
+				seq_printf(m, "%u:%u\t%u\n", MAJOR(pn->dev),
+					   MINOR(pn->dev), pn->val.prop.weight);
+				break;
+			case BLKIO_PROP_class:
 				seq_printf(m, "%u:%u\t%u\n", MAJOR(pn->dev),
-					MINOR(pn->dev), pn->val.weight);
+					   MINOR(pn->dev), pn->val.prop.class);
+				break;
+			};
 			break;
 		case BLKIO_POLICY_THROTL:
 			switch(pn->fileid) {
 			case BLKIO_THROTL_read_bps_device:
 			case BLKIO_THROTL_write_bps_device:
 				seq_printf(m, "%u:%u\t%llu\n", MAJOR(pn->dev),
-					MINOR(pn->dev), pn->val.bps);
+					   MINOR(pn->dev), pn->val.throtl.bps);
 				break;
 			case BLKIO_THROTL_read_iops_device:
 			case BLKIO_THROTL_write_iops_device:
 				seq_printf(m, "%u:%u\t%u\n", MAJOR(pn->dev),
-					MINOR(pn->dev), pn->val.iops);
+					   MINOR(pn->dev), pn->val.throtl.iops);
 				break;
 			}
 			break;
@@ -1037,6 +1089,7 @@ static int blkiocg_file_read(struct cgroup *cgrp, struct cftype *cft,
 	case BLKIO_POLICY_PROP:
 		switch(name) {
 		case BLKIO_PROP_weight_device:
+		case BLKIO_PROP_class:
 			blkio_read_policy_node_files(cft, blkcg, m);
 			return 0;
 		default:
@@ -1243,6 +1296,14 @@ struct cftype blkio_files[] = {
 		.max_write_len = 256,
 	},
 	{
+		.name = "class",
+		.private = BLKIOFILE_PRIVATE(BLKIO_POLICY_PROP,
+				BLKIO_PROP_class),
+		.read_seq_string = blkiocg_file_read,
+		.write_string = blkiocg_file_write,
+		.max_write_len = 256,
+	},
+	{
 		.name = "weight",
 		.private = BLKIOFILE_PRIVATE(BLKIO_POLICY_PROP,
 				BLKIO_PROP_weight),
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index ea4861b..0e9cd32 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -73,10 +73,18 @@ enum blkg_state_flags {
 	BLKG_empty,
 };
 
+enum blkcg_class {
+	BLKIO_RT_CLASS = 1,
+	BLKIO_BE_CLASS = 2,
+};
+
+#define BLKIO_DEFAULT_CLASS BLKIO_BE_CLASS
+
 /* cgroup files owned by proportional weight policy */
 enum blkcg_file_name_prop {
 	BLKIO_PROP_weight = 1,
 	BLKIO_PROP_weight_device,
+	BLKIO_PROP_class,
 	BLKIO_PROP_io_service_bytes,
 	BLKIO_PROP_io_serviced,
 	BLKIO_PROP_time,
@@ -157,6 +165,21 @@ struct blkio_group {
 	struct blkio_group_stats stats;
 };
 
+struct proportion_policy {
+	unsigned int weight;
+	unsigned int class;
+};
+
+union throttle_policy {
+	/*
+	 * Rate read/write in terms of byptes per second
+	 * Whether this rate represents read or write is determined
+	 * by file type "fileid".
+	 */
+	u64 bps;
+	unsigned int iops;
+};
+
 struct blkio_policy_node {
 	struct list_head node;
 	dev_t dev;
@@ -166,14 +189,8 @@ struct blkio_policy_node {
 	int fileid;
 
 	union {
-		unsigned int weight;
-		/*
-		 * Rate read/write in terms of byptes per second
-		 * Whether this rate represents read or write is determined
-		 * by file type "fileid".
-		 */
-		u64 bps;
-		unsigned int iops;
+		struct proportion_policy prop;
+		union throttle_policy throtl;
 	} val;
 };
 
@@ -192,6 +209,8 @@ typedef void (blkio_unlink_group_fn) (void *key, struct blkio_group *blkg);
 
 typedef void (blkio_update_group_weight_fn) (void *key,
 			struct blkio_group *blkg, unsigned int weight);
+typedef void (blkio_update_group_class_fn) (void *key,
+			struct blkio_group *blkg, unsigned int class);
 typedef void (blkio_update_group_read_bps_fn) (void * key,
 			struct blkio_group *blkg, u64 read_bps);
 typedef void (blkio_update_group_write_bps_fn) (void *key,
@@ -204,6 +223,7 @@ typedef void (blkio_update_group_write_iops_fn) (void *key,
 struct blkio_policy_ops {
 	blkio_unlink_group_fn *blkio_unlink_group_fn;
 	blkio_update_group_weight_fn *blkio_update_group_weight_fn;
+	blkio_update_group_class_fn *blkio_update_group_class_fn;
 	blkio_update_group_read_bps_fn *blkio_update_group_read_bps_fn;
 	blkio_update_group_write_bps_fn *blkio_update_group_write_bps_fn;
 	blkio_update_group_read_iops_fn *blkio_update_group_read_iops_fn;
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ea83a4f..55e78b7 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -179,6 +179,7 @@ struct cfq_group {
 	/* group service_tree key */
 	u64 vdisktime;
 	unsigned int weight;
+	enum blkcg_class class;
 
 	/* number of cfqq currently on this group */
 	int nr_cfqq;
@@ -988,6 +989,12 @@ void cfq_update_blkio_group_weight(void *key, struct blkio_group *blkg,
 	cfqg_of_blkg(blkg)->weight = weight;
 }
 
+void cfq_update_blkio_group_class(void *key, struct blkio_group *blkg,
+					unsigned int class)
+{
+	cfqg_of_blkg(blkg)->class = class;
+}
+
 static struct cfq_group *
 cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
 {
@@ -4115,8 +4122,9 @@ static struct elevator_type iosched_cfq = {
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
 static struct blkio_policy_type blkio_policy_cfq = {
 	.ops = {
-		.blkio_unlink_group_fn =	cfq_unlink_blkio_group,
-		.blkio_update_group_weight_fn =	cfq_update_blkio_group_weight,
+		.blkio_unlink_group_fn =	       cfq_unlink_blkio_group,
+		.blkio_update_group_weight_fn = cfq_update_blkio_group_weight,
+		.blkio_update_group_class_fn =  cfq_update_blkio_group_class,
 	},
 	.plid = BLKIO_POLICY_PROP,
 };
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/3] cfq-iosched: Fair cross-group preemption (implementation)
  2011-03-22  1:10 [PATCH 0/3] cfq-iosched: Fair cross-group preemption Chad Talbott
  2011-03-22  1:10 ` [PATCH 1/3] cfq-iosched: Fair cross-group preemption (interface and documentation) Chad Talbott
@ 2011-03-22  1:10 ` Chad Talbott
  2011-03-22  1:10 ` [PATCH 3/3] cfq-iosched: Fair cross-group preemption (stats) Chad Talbott
  2011-03-22 15:09 ` [PATCH 0/3] cfq-iosched: Fair cross-group preemption Vivek Goyal
  3 siblings, 0 replies; 22+ messages in thread
From: Chad Talbott @ 2011-03-22  1:10 UTC (permalink / raw)
  To: jaxboe, vgoyal; +Cc: linux-kernel, mrubin, teravest, Chad Talbott

Add a new function cfq_group_should_preempt() which uses the new
service classes to decide if a task in one cgroup should preempt a
task in another.  We only allow it if the preempting task's vdisktime
is "behind" the preempted task.

Signed-off-by: Chad Talbott <ctalbott@google.com>
---
 block/cfq-iosched.c |   41 ++++++++++++++++++++++++++++++++++++++---
 1 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 55e78b7..dfcce80 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1157,6 +1157,32 @@ void cfq_unlink_blkio_group(void *key, struct blkio_group *blkg)
 	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
 }
 
+/*
+ * Return true if new_cfqg should preempt cfqg.  A return value of
+ * false means "don't care."  In that case, cfq has other heuristics
+ * to decide
+ */
+static bool
+cfq_group_should_preempt(struct cfq_queue *new_cfqq, struct cfq_queue *cfqq,
+			 struct request *rq)
+{
+	struct cfq_group *cfqg = cfqq->cfqg;
+	struct cfq_group *new_cfqg = new_cfqq->cfqg;
+	u64 grace_period;
+
+	/* in-group preemption is handled elsewhere */
+	if (new_cfqg == cfqg)
+		return false;
+
+	if (!(new_cfqg->class == BLKIO_RT_CLASS &&
+	      cfqg->class == BLKIO_BE_CLASS))
+		return false;
+
+	grace_period = cfq_scale_slice(cfqq->allocated_slice, cfqg);
+	return time_before64(new_cfqg->vdisktime,
+			     cfqg->vdisktime + grace_period);
+}
+
 #else /* GROUP_IOSCHED */
 static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
 {
@@ -1176,6 +1202,12 @@ cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg) {
 static void cfq_release_cfq_groups(struct cfq_data *cfqd) {}
 static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}
 
+static bool
+cfq_group_should_preempt(struct cfq_queue *new_cfqq, struct cfq_queue *cfqq,
+			 struct request *rq)
+{
+	return false;
+}
 #endif /* GROUP_IOSCHED */
 
 /*
@@ -3234,6 +3266,12 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 	if (!cfqq)
 		return false;
 
+	if (cfq_group_should_preempt(new_cfqq, cfqq, rq))
+		return true;
+
+	if (new_cfqq->cfqg != cfqq->cfqg)
+		return false;
+
 	if (cfq_class_idle(new_cfqq))
 		return false;
 
@@ -3253,9 +3291,6 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
 		return true;
 
-	if (new_cfqq->cfqg != cfqq->cfqg)
-		return false;
-
 	if (cfq_slice_used(cfqq))
 		return true;
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/3] cfq-iosched: Fair cross-group preemption (stats)
  2011-03-22  1:10 [PATCH 0/3] cfq-iosched: Fair cross-group preemption Chad Talbott
  2011-03-22  1:10 ` [PATCH 1/3] cfq-iosched: Fair cross-group preemption (interface and documentation) Chad Talbott
  2011-03-22  1:10 ` [PATCH 2/3] cfq-iosched: Fair cross-group preemption (implementation) Chad Talbott
@ 2011-03-22  1:10 ` Chad Talbott
  2011-03-22 15:09 ` [PATCH 0/3] cfq-iosched: Fair cross-group preemption Vivek Goyal
  3 siblings, 0 replies; 22+ messages in thread
From: Chad Talbott @ 2011-03-22  1:10 UTC (permalink / raw)
  To: jaxboe, vgoyal; +Cc: linux-kernel, mrubin, teravest, Chad Talbott

Add three counters to monitor the cross-group preemption behavior.

preempt_count:
  counts the number of times this group preempted another

preempt_active:
  counts the number of times this group issued an IO but
  was already the active group

preempt_throttle:
  number of times this group would have preempted another if it hadn't
  already used up its assigned weight

Signed-off-by: Chad Talbott <ctalbott@google.com>
---
 block/blk-cgroup.c  |   46 ++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-cgroup.h  |   18 ++++++++++++++++++
 block/cfq-iosched.c |   20 +++++++++++++++++---
 3 files changed, 81 insertions(+), 3 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index f2c553e..41ee61c 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -347,6 +347,25 @@ void blkiocg_set_start_empty_time(struct blkio_group *blkg)
 }
 EXPORT_SYMBOL_GPL(blkiocg_set_start_empty_time);
 
+void blkiocg_update_preempt_stats(struct blkio_group *blkg,
+			enum stat_type which, bool direction, bool sync)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&blkg->stats_lock, flags);
+	switch (which) {
+	case BLKIO_STAT_PREEMPT_COUNT:
+	case BLKIO_STAT_PREEMPT_ACTIVE:
+	case BLKIO_STAT_PREEMPT_THROTTLE:
+		blkio_add_stat(blkg->stats.stat_arr[which], 1, direction, sync);
+		break;
+	default:
+		BUG();
+	}
+	spin_unlock_irqrestore(&blkg->stats_lock, flags);
+}
+EXPORT_SYMBOL_GPL(blkiocg_update_preempt_stats);
+
 void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
 			unsigned long dequeue)
 {
@@ -1193,6 +1212,15 @@ static int blkiocg_file_read_map(struct cgroup *cgrp, struct cftype *cft,
 		case BLKIO_PROP_empty_time:
 			return blkio_read_blkg_stats(blkcg, cft, cb,
 						BLKIO_STAT_EMPTY_TIME, 0);
+		case BLKIO_PROP_preempt_count:
+			return blkio_read_blkg_stats(blkcg, cft, cb,
+						BLKIO_STAT_PREEMPT_COUNT, 0);
+		case BLKIO_PROP_preempt_active:
+			return blkio_read_blkg_stats(blkcg, cft, cb,
+						BLKIO_STAT_PREEMPT_ACTIVE, 0);
+		case BLKIO_PROP_preempt_throttle:
+			return blkio_read_blkg_stats(blkcg, cft, cb,
+						BLKIO_STAT_PREEMPT_THROTTLE, 0);
 #endif
 		default:
 			BUG();
@@ -1443,6 +1471,24 @@ struct cftype blkio_files[] = {
 				BLKIO_PROP_dequeue),
 		.read_map = blkiocg_file_read_map,
 	},
+	{
+		.name = "preempt_count",
+		.private = BLKIOFILE_PRIVATE(BLKIO_POLICY_PROP,
+				BLKIO_PROP_preempt_count),
+		.read_map = blkiocg_file_read_map,
+	},
+	{
+		.name = "preempt_active",
+		.private = BLKIOFILE_PRIVATE(BLKIO_POLICY_PROP,
+				BLKIO_PROP_preempt_active),
+		.read_map = blkiocg_file_read_map,
+	},
+	{
+		.name = "preempt_throttle",
+		.private = BLKIOFILE_PRIVATE(BLKIO_POLICY_PROP,
+				BLKIO_PROP_preempt_throttle),
+		.read_map = blkiocg_file_read_map,
+	},
 #endif
 };
 
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 0e9cd32..d2cbdcd 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -44,6 +44,14 @@ enum stat_type {
 	BLKIO_STAT_WAIT_TIME,
 	/* Number of IOs merged */
 	BLKIO_STAT_MERGED,
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+	/* Number of IOs Added to active group */
+	BLKIO_STAT_PREEMPT_ACTIVE,
+	/* Number of cross-group preemptions */
+	BLKIO_STAT_PREEMPT_COUNT,
+	/* Number of cross-group preemptions disallowed by weight */
+	BLKIO_STAT_PREEMPT_THROTTLE,
+#endif
 	/* Number of IOs queued up */
 	BLKIO_STAT_QUEUED,
 	/* All the single valued stats go below this */
@@ -98,6 +106,9 @@ enum blkcg_file_name_prop {
 	BLKIO_PROP_idle_time,
 	BLKIO_PROP_empty_time,
 	BLKIO_PROP_dequeue,
+	BLKIO_PROP_preempt_count,
+	BLKIO_PROP_preempt_active,
+	BLKIO_PROP_preempt_throttle,
 };
 
 /* cgroup files owned by throttle policy */
@@ -271,6 +282,10 @@ void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
 void blkiocg_update_set_idle_time_stats(struct blkio_group *blkg);
 void blkiocg_update_idle_time_stats(struct blkio_group *blkg);
 void blkiocg_set_start_empty_time(struct blkio_group *blkg);
+void blkiocg_update_preempt_stats(struct blkio_group *blkg,
+				enum stat_type which,
+				bool direction,
+				bool sync);
 
 #define BLKG_FLAG_FNS(name)						\
 static inline void blkio_mark_blkg_##name(				\
@@ -301,6 +316,9 @@ static inline void blkiocg_update_set_idle_time_stats(struct blkio_group *blkg)
 {}
 static inline void blkiocg_update_idle_time_stats(struct blkio_group *blkg) {}
 static inline void blkiocg_set_start_empty_time(struct blkio_group *blkg) {}
+static inline void blkiocg_update_preempt_stats(
+		struct blkio_group *blkg, enum stat_type which, bool direction,
+		bool sync) {}
 #endif
 
 #if defined(CONFIG_BLK_CGROUP) || defined(CONFIG_BLK_CGROUP_MODULE)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index dfcce80..0d55be3 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1171,16 +1171,30 @@ cfq_group_should_preempt(struct cfq_queue *new_cfqq, struct cfq_queue *cfqq,
 	u64 grace_period;
 
 	/* in-group preemption is handled elsewhere */
-	if (new_cfqg == cfqg)
+	if (new_cfqg == cfqg) {
+		blkiocg_update_preempt_stats(&new_cfqg->blkg,
+			BLKIO_STAT_PREEMPT_ACTIVE,
+			rq_data_dir(rq), rq_is_sync(rq));
 		return false;
+	}
 
 	if (!(new_cfqg->class == BLKIO_RT_CLASS &&
 	      cfqg->class == BLKIO_BE_CLASS))
 		return false;
 
 	grace_period = cfq_scale_slice(cfqq->allocated_slice, cfqg);
-	return time_before64(new_cfqg->vdisktime,
-			     cfqg->vdisktime + grace_period);
+	if (time_before64(new_cfqg->vdisktime,
+			  cfqg->vdisktime + grace_period)) {
+		blkiocg_update_preempt_stats(&new_cfqg->blkg,
+			BLKIO_STAT_PREEMPT_COUNT,
+			rq_data_dir(rq), rq_is_sync(rq));
+		return true;
+	}
+
+	blkiocg_update_preempt_stats(&new_cfqg->blkg,
+		BLKIO_STAT_PREEMPT_THROTTLE,
+		rq_data_dir(rq), rq_is_sync(rq));
+	return false;
 }
 
 #else /* GROUP_IOSCHED */
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] cfq-iosched: Fair cross-group preemption (interface and documentation)
  2011-03-22  1:10 ` [PATCH 1/3] cfq-iosched: Fair cross-group preemption (interface and documentation) Chad Talbott
@ 2011-03-22 10:03   ` Gui Jianfeng
  2011-03-22 18:07     ` Chad Talbott
  0 siblings, 1 reply; 22+ messages in thread
From: Gui Jianfeng @ 2011-03-22 10:03 UTC (permalink / raw)
  To: Chad Talbott; +Cc: jaxboe, vgoyal, linux-kernel, mrubin, teravest

Chad Talbott wrote:
> blkio.class is a new interface that allows you to specify a cgroup as
> requiring low-latency service on a specific block device.  This patch
> includes only the interface and documentation.
> 
> Signed-off-by: Chad Talbott <ctalbott@google.com>
> ---
>  Documentation/cgroups/blkio-controller.txt |   16 ++++
>  block/blk-cgroup.c                         |  121 +++++++++++++++++++++-------
>  block/blk-cgroup.h                         |   36 +++++++--
>  block/cfq-iosched.c                        |   12 +++-
>  4 files changed, 145 insertions(+), 40 deletions(-)
> 
> diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
> index 4ed7b5c..7acf9d2 100644
> --- a/Documentation/cgroups/blkio-controller.txt
> +++ b/Documentation/cgroups/blkio-controller.txt
> @@ -169,6 +169,22 @@ Proportional weight policy files
>  	  dev     weight
>  	  8:16    300
>  
> +- blkio.class
> +	- Specifies the per-cgroup, per-device service class.  There

Hi Chad,

Do we need to implement a globle blkio.class interface as well as
a device specific one something like "blkio.class_device"? So a user
doesn't need to check every device number if he want to specify a
global value.

Thanks,
Gui

> +	  are currently two service classes: real-time (1) and
> +	  best-effort (2).  By default, all groups start as
> +	  best-effort.
> +
> +	  #echo dev_maj:dev_minor class > /path/to/cgroup/blkio.class
> +	  Configure class=real-time on /dev/sdb (8:16) in this cgroup
> +	  # echo '8:16 1' > blkio.class
> +	  # cat blkio.class
> +	  8:16    1
> +
> +	  A real-time group can preempt a best-effort group, only as
> +	  long as it's within its assigned weight; i.e. fairness is
> +	  preserved.
> +
>  - blkio.time
>  	- disk time allocated to cgroup per device in milliseconds. First
>  	  two fields specify the major and minor number of the device and
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 455768a..f2c553e 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -129,6 +129,21 @@ blkio_update_group_weight(struct blkio_group *blkg, unsigned int weight)
>  	}
>  }
>  
> +static inline void
> +blkio_update_group_class(struct blkio_group *blkg, unsigned int class)
> +{
> +	struct blkio_policy_type *blkiop;
> +
> +	list_for_each_entry(blkiop, &blkio_list, list) {
> +		/* If this policy does not own the blkg, do not send updates */
> +		if (blkiop->plid != blkg->plid)
> +			continue;
> +		if (blkiop->ops.blkio_update_group_class_fn)
> +			blkiop->ops.blkio_update_group_class_fn(blkg->key,
> +								blkg, class);
> +	}
> +}
> +
>  static inline void blkio_update_group_bps(struct blkio_group *blkg, u64 bps,
>  				int fileid)
>  {
> @@ -708,15 +723,29 @@ static int blkio_policy_parse_and_set(char *buf,
>  
>  	switch (plid) {
>  	case BLKIO_POLICY_PROP:
> -		ret = strict_strtoul(s[1], 10, &temp);
> -		if (ret || (temp < BLKIO_WEIGHT_MIN && temp > 0) ||
> -			temp > BLKIO_WEIGHT_MAX)
> -			return -EINVAL;
> -
> -		newpn->plid = plid;
> -		newpn->fileid = fileid;
> -		newpn->val.weight = temp;
> -		break;
> +		switch (fileid) {
> +		case BLKIO_PROP_weight_device:
> +			ret = strict_strtoul(s[1], 10, &temp);
> +			if (ret || (temp < BLKIO_WEIGHT_MIN && temp > 0) ||
> +			    temp > BLKIO_WEIGHT_MAX)
> +				return -EINVAL;
> +
> +			newpn->plid = plid;
> +			newpn->fileid = fileid;
> +			newpn->val.prop.weight = temp;
> +			break;
> +		case BLKIO_PROP_class:
> +			ret = strict_strtoul(s[1], 10, &temp);
> +			if (ret ||
> +			    temp < BLKIO_RT_CLASS ||
> +			    temp > BLKIO_BE_CLASS)
> +				return -EINVAL;
> +
> +			newpn->plid = plid;
> +			newpn->fileid = fileid;
> +			newpn->val.prop.class = temp;
> +			break;
> +		}
>  	case BLKIO_POLICY_THROTL:
>  		switch(fileid) {
>  		case BLKIO_THROTL_read_bps_device:
> @@ -727,7 +756,7 @@ static int blkio_policy_parse_and_set(char *buf,
>  
>  			newpn->plid = plid;
>  			newpn->fileid = fileid;
> -			newpn->val.bps = bps;
> +			newpn->val.throtl.bps = bps;
>  			break;
>  		case BLKIO_THROTL_read_iops_device:
>  		case BLKIO_THROTL_write_iops_device:
> @@ -740,7 +769,7 @@ static int blkio_policy_parse_and_set(char *buf,
>  
>  			newpn->plid = plid;
>  			newpn->fileid = fileid;
> -			newpn->val.iops = (unsigned int)iops;
> +			newpn->val.throtl.iops = (unsigned int)iops;
>  			break;
>  		}
>  		break;
> @@ -759,20 +788,34 @@ unsigned int blkcg_get_weight(struct blkio_cgroup *blkcg,
>  	pn = blkio_policy_search_node(blkcg, dev, BLKIO_POLICY_PROP,
>  				BLKIO_PROP_weight_device);
>  	if (pn)
> -		return pn->val.weight;
> +		return pn->val.prop.weight;
>  	else
>  		return blkcg->weight;
>  }
>  EXPORT_SYMBOL_GPL(blkcg_get_weight);
>  
> -uint64_t blkcg_get_read_bps(struct blkio_cgroup *blkcg, dev_t dev)
> +enum blkcg_class blkcg_get_class(struct blkio_cgroup *blkcg,
> +			     dev_t dev)
>  {
>  	struct blkio_policy_node *pn;
> +	enum blkcg_class class = BLKIO_DEFAULT_CLASS;
> +
> +	pn = blkio_policy_search_node(blkcg, dev, BLKIO_POLICY_PROP,
> +				BLKIO_PROP_class);
> +	if (pn)
> +		class = pn->val.prop.class;
>  
> +	return class;
> +}
> +EXPORT_SYMBOL_GPL(blkcg_get_class);
> +
> +uint64_t blkcg_get_read_bps(struct blkio_cgroup *blkcg, dev_t dev)
> +{
> +	struct blkio_policy_node *pn;
>  	pn = blkio_policy_search_node(blkcg, dev, BLKIO_POLICY_THROTL,
>  				BLKIO_THROTL_read_bps_device);
>  	if (pn)
> -		return pn->val.bps;
> +		return pn->val.throtl.bps;
>  	else
>  		return -1;
>  }
> @@ -783,7 +826,7 @@ uint64_t blkcg_get_write_bps(struct blkio_cgroup *blkcg, dev_t dev)
>  	pn = blkio_policy_search_node(blkcg, dev, BLKIO_POLICY_THROTL,
>  				BLKIO_THROTL_write_bps_device);
>  	if (pn)
> -		return pn->val.bps;
> +		return pn->val.throtl.bps;
>  	else
>  		return -1;
>  }
> @@ -795,7 +838,7 @@ unsigned int blkcg_get_read_iops(struct blkio_cgroup *blkcg, dev_t dev)
>  	pn = blkio_policy_search_node(blkcg, dev, BLKIO_POLICY_THROTL,
>  				BLKIO_THROTL_read_iops_device);
>  	if (pn)
> -		return pn->val.iops;
> +		return pn->val.throtl.iops;
>  	else
>  		return -1;
>  }
> @@ -806,7 +849,7 @@ unsigned int blkcg_get_write_iops(struct blkio_cgroup *blkcg, dev_t dev)
>  	pn = blkio_policy_search_node(blkcg, dev, BLKIO_POLICY_THROTL,
>  				BLKIO_THROTL_write_iops_device);
>  	if (pn)
> -		return pn->val.iops;
> +		return pn->val.throtl.iops;
>  	else
>  		return -1;
>  }
> @@ -816,19 +859,19 @@ static bool blkio_delete_rule_command(struct blkio_policy_node *pn)
>  {
>  	switch(pn->plid) {
>  	case BLKIO_POLICY_PROP:
> -		if (pn->val.weight == 0)
> +		if (pn->val.prop.weight == 0)
>  			return 1;
>  		break;
>  	case BLKIO_POLICY_THROTL:
>  		switch(pn->fileid) {
>  		case BLKIO_THROTL_read_bps_device:
>  		case BLKIO_THROTL_write_bps_device:
> -			if (pn->val.bps == 0)
> +			if (pn->val.throtl.bps == 0)
>  				return 1;
>  			break;
>  		case BLKIO_THROTL_read_iops_device:
>  		case BLKIO_THROTL_write_iops_device:
> -			if (pn->val.iops == 0)
> +			if (pn->val.throtl.iops == 0)
>  				return 1;
>  		}
>  		break;
> @@ -844,17 +887,17 @@ static void blkio_update_policy_rule(struct blkio_policy_node *oldpn,
>  {
>  	switch(oldpn->plid) {
>  	case BLKIO_POLICY_PROP:
> -		oldpn->val.weight = newpn->val.weight;
> +		oldpn->val.prop.weight = newpn->val.prop.weight;
>  		break;
>  	case BLKIO_POLICY_THROTL:
>  		switch(newpn->fileid) {
>  		case BLKIO_THROTL_read_bps_device:
>  		case BLKIO_THROTL_write_bps_device:
> -			oldpn->val.bps = newpn->val.bps;
> +			oldpn->val.throtl.bps = newpn->val.throtl.bps;
>  			break;
>  		case BLKIO_THROTL_read_iops_device:
>  		case BLKIO_THROTL_write_iops_device:
> -			oldpn->val.iops = newpn->val.iops;
> +			oldpn->val.throtl.iops = newpn->val.throtl.iops;
>  		}
>  		break;
>  	default:
> @@ -872,22 +915,24 @@ static void blkio_update_blkg_policy(struct blkio_cgroup *blkcg,
>  	unsigned int weight, iops;
>  	u64 bps;
>  
> +
>  	switch(pn->plid) {
>  	case BLKIO_POLICY_PROP:
> -		weight = pn->val.weight ? pn->val.weight :
> +		weight = pn->val.prop.weight ? pn->val.prop.weight :
>  				blkcg->weight;
>  		blkio_update_group_weight(blkg, weight);
> +		blkio_update_group_class(blkg, pn->val.prop.class);
>  		break;
>  	case BLKIO_POLICY_THROTL:
>  		switch(pn->fileid) {
>  		case BLKIO_THROTL_read_bps_device:
>  		case BLKIO_THROTL_write_bps_device:
> -			bps = pn->val.bps ? pn->val.bps : (-1);
> +			bps = pn->val.throtl.bps ? pn->val.throtl.bps : (-1);
>  			blkio_update_group_bps(blkg, bps, pn->fileid);
>  			break;
>  		case BLKIO_THROTL_read_iops_device:
>  		case BLKIO_THROTL_write_iops_device:
> -			iops = pn->val.iops ? pn->val.iops : (-1);
> +			iops = pn->val.throtl.iops ? pn->val.throtl.iops : (-1);
>  			blkio_update_group_iops(blkg, iops, pn->fileid);
>  			break;
>  		}
> @@ -984,21 +1029,28 @@ blkio_print_policy_node(struct seq_file *m, struct blkio_policy_node *pn)
>  {
>  	switch(pn->plid) {
>  		case BLKIO_POLICY_PROP:
> -			if (pn->fileid == BLKIO_PROP_weight_device)
> +			switch(pn->fileid) {
> +			case BLKIO_PROP_weight_device:
> +				seq_printf(m, "%u:%u\t%u\n", MAJOR(pn->dev),
> +					   MINOR(pn->dev), pn->val.prop.weight);
> +				break;
> +			case BLKIO_PROP_class:
>  				seq_printf(m, "%u:%u\t%u\n", MAJOR(pn->dev),
> -					MINOR(pn->dev), pn->val.weight);
> +					   MINOR(pn->dev), pn->val.prop.class);
> +				break;
> +			};
>  			break;
>  		case BLKIO_POLICY_THROTL:
>  			switch(pn->fileid) {
>  			case BLKIO_THROTL_read_bps_device:
>  			case BLKIO_THROTL_write_bps_device:
>  				seq_printf(m, "%u:%u\t%llu\n", MAJOR(pn->dev),
> -					MINOR(pn->dev), pn->val.bps);
> +					   MINOR(pn->dev), pn->val.throtl.bps);
>  				break;
>  			case BLKIO_THROTL_read_iops_device:
>  			case BLKIO_THROTL_write_iops_device:
>  				seq_printf(m, "%u:%u\t%u\n", MAJOR(pn->dev),
> -					MINOR(pn->dev), pn->val.iops);
> +					   MINOR(pn->dev), pn->val.throtl.iops);
>  				break;
>  			}
>  			break;
> @@ -1037,6 +1089,7 @@ static int blkiocg_file_read(struct cgroup *cgrp, struct cftype *cft,
>  	case BLKIO_POLICY_PROP:
>  		switch(name) {
>  		case BLKIO_PROP_weight_device:
> +		case BLKIO_PROP_class:
>  			blkio_read_policy_node_files(cft, blkcg, m);
>  			return 0;
>  		default:
> @@ -1243,6 +1296,14 @@ struct cftype blkio_files[] = {
>  		.max_write_len = 256,
>  	},
>  	{
> +		.name = "class",
> +		.private = BLKIOFILE_PRIVATE(BLKIO_POLICY_PROP,
> +				BLKIO_PROP_class),
> +		.read_seq_string = blkiocg_file_read,
> +		.write_string = blkiocg_file_write,
> +		.max_write_len = 256,
> +	},
> +	{
>  		.name = "weight",
>  		.private = BLKIOFILE_PRIVATE(BLKIO_POLICY_PROP,
>  				BLKIO_PROP_weight),
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index ea4861b..0e9cd32 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -73,10 +73,18 @@ enum blkg_state_flags {
>  	BLKG_empty,
>  };
>  
> +enum blkcg_class {
> +	BLKIO_RT_CLASS = 1,
> +	BLKIO_BE_CLASS = 2,
> +};
> +
> +#define BLKIO_DEFAULT_CLASS BLKIO_BE_CLASS
> +
>  /* cgroup files owned by proportional weight policy */
>  enum blkcg_file_name_prop {
>  	BLKIO_PROP_weight = 1,
>  	BLKIO_PROP_weight_device,
> +	BLKIO_PROP_class,
>  	BLKIO_PROP_io_service_bytes,
>  	BLKIO_PROP_io_serviced,
>  	BLKIO_PROP_time,
> @@ -157,6 +165,21 @@ struct blkio_group {
>  	struct blkio_group_stats stats;
>  };
>  
> +struct proportion_policy {
> +	unsigned int weight;
> +	unsigned int class;
> +};
> +
> +union throttle_policy {
> +	/*
> +	 * Rate read/write in terms of byptes per second
> +	 * Whether this rate represents read or write is determined
> +	 * by file type "fileid".
> +	 */
> +	u64 bps;
> +	unsigned int iops;
> +};
> +
>  struct blkio_policy_node {
>  	struct list_head node;
>  	dev_t dev;
> @@ -166,14 +189,8 @@ struct blkio_policy_node {
>  	int fileid;
>  
>  	union {
> -		unsigned int weight;
> -		/*
> -		 * Rate read/write in terms of byptes per second
> -		 * Whether this rate represents read or write is determined
> -		 * by file type "fileid".
> -		 */
> -		u64 bps;
> -		unsigned int iops;
> +		struct proportion_policy prop;
> +		union throttle_policy throtl;
>  	} val;
>  };
>  
> @@ -192,6 +209,8 @@ typedef void (blkio_unlink_group_fn) (void *key, struct blkio_group *blkg);
>  
>  typedef void (blkio_update_group_weight_fn) (void *key,
>  			struct blkio_group *blkg, unsigned int weight);
> +typedef void (blkio_update_group_class_fn) (void *key,
> +			struct blkio_group *blkg, unsigned int class);
>  typedef void (blkio_update_group_read_bps_fn) (void * key,
>  			struct blkio_group *blkg, u64 read_bps);
>  typedef void (blkio_update_group_write_bps_fn) (void *key,
> @@ -204,6 +223,7 @@ typedef void (blkio_update_group_write_iops_fn) (void *key,
>  struct blkio_policy_ops {
>  	blkio_unlink_group_fn *blkio_unlink_group_fn;
>  	blkio_update_group_weight_fn *blkio_update_group_weight_fn;
> +	blkio_update_group_class_fn *blkio_update_group_class_fn;
>  	blkio_update_group_read_bps_fn *blkio_update_group_read_bps_fn;
>  	blkio_update_group_write_bps_fn *blkio_update_group_write_bps_fn;
>  	blkio_update_group_read_iops_fn *blkio_update_group_read_iops_fn;
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index ea83a4f..55e78b7 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -179,6 +179,7 @@ struct cfq_group {
>  	/* group service_tree key */
>  	u64 vdisktime;
>  	unsigned int weight;
> +	enum blkcg_class class;
>  
>  	/* number of cfqq currently on this group */
>  	int nr_cfqq;
> @@ -988,6 +989,12 @@ void cfq_update_blkio_group_weight(void *key, struct blkio_group *blkg,
>  	cfqg_of_blkg(blkg)->weight = weight;
>  }
>  
> +void cfq_update_blkio_group_class(void *key, struct blkio_group *blkg,
> +					unsigned int class)
> +{
> +	cfqg_of_blkg(blkg)->class = class;
> +}
> +
>  static struct cfq_group *
>  cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
>  {
> @@ -4115,8 +4122,9 @@ static struct elevator_type iosched_cfq = {
>  #ifdef CONFIG_CFQ_GROUP_IOSCHED
>  static struct blkio_policy_type blkio_policy_cfq = {
>  	.ops = {
> -		.blkio_unlink_group_fn =	cfq_unlink_blkio_group,
> -		.blkio_update_group_weight_fn =	cfq_update_blkio_group_weight,
> +		.blkio_unlink_group_fn =	       cfq_unlink_blkio_group,
> +		.blkio_update_group_weight_fn = cfq_update_blkio_group_weight,
> +		.blkio_update_group_class_fn =  cfq_update_blkio_group_class,
>  	},
>  	.plid = BLKIO_POLICY_PROP,
>  };

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] cfq-iosched: Fair cross-group preemption
  2011-03-22  1:10 [PATCH 0/3] cfq-iosched: Fair cross-group preemption Chad Talbott
                   ` (2 preceding siblings ...)
  2011-03-22  1:10 ` [PATCH 3/3] cfq-iosched: Fair cross-group preemption (stats) Chad Talbott
@ 2011-03-22 15:09 ` Vivek Goyal
  2011-03-22 17:39   ` Chad Talbott
  3 siblings, 1 reply; 22+ messages in thread
From: Vivek Goyal @ 2011-03-22 15:09 UTC (permalink / raw)
  To: Chad Talbott; +Cc: jaxboe, linux-kernel, mrubin, teravest

On Mon, Mar 21, 2011 at 06:10:42PM -0700, Chad Talbott wrote:
> This patchset introduces fair cross-group preemption.  Right now, we
> don't have strict isolation between processes in different cgroups.
> For example: currently an RT ioprio thread in one group can preempt a
> BE ioprio thread in another group.  We would like to have an
> application specify the relative priorities of its threads, but still
> allow strict isolation between applications.
> 
> With this patch series, we can configure an entire cgroup as needing
> low-latency service.  Then that group will be able to preempt another
> group.  To prevent a runaway application from starving other
> applications, we allow this preemption only until it has exceeded its
> fair share (as specified by its relative weight).  So a rate-limited,
> but latency sensative application (like streaming audio or video) can
> get front-of-the-line service without fear of it hogging a whole
> disk's IO.

Chad,

Why not just implement simply RT class groups and always allow an RT
group to preempt an BE class. Same thing we do for cfq queues. I will
not worry too much about a run away application consuming all the 
bandwidth. If that's a concern we could use blkio controller to limit
the IO rate of a latency sensitive applicaiton to make sure it does
not starve BE applications.

If RT starving BE is an issue, then it is an issue with plain cfq queue
also. First we shall have to fix it there.

This definition that a latency sensitive task get prioritized only
till it is consuming its fair share and if task starts using more than
fair share then CFQ automatically stops prioritizing it sounds little
odd to me. If you are looking for predictability, then we lost it. We
shall have to very well know that task is not eating more than its
fair share before we can gurantee any kind of latencies to that task. And
if we know that task is not hogging the disk, there is anyway no risk
of it starving other groups/tasks completely.

Or we could implement something where once in a while we do allow BE
class tasks to dispatch some IO so that RT tasks do not completely
starve BE class tasks. I guess this will be similar to what cpu scheduler
was trying where an RT tasks got 95% of cpu and 5% bandwidth was reserved
for admin to kill RT tasks if some RT tasks goes wild and locks up the
cpus.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] cfq-iosched: Fair cross-group preemption
  2011-03-22 15:09 ` [PATCH 0/3] cfq-iosched: Fair cross-group preemption Vivek Goyal
@ 2011-03-22 17:39   ` Chad Talbott
  2011-03-22 18:12     ` Vivek Goyal
  0 siblings, 1 reply; 22+ messages in thread
From: Chad Talbott @ 2011-03-22 17:39 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: jaxboe, linux-kernel, mrubin, teravest

On Tue, Mar 22, 2011 at 8:09 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> Why not just implement simply RT class groups and always allow an RT
> group to preempt an BE class. Same thing we do for cfq queues. I will
> not worry too much about a run away application consuming all the
> bandwidth. If that's a concern we could use blkio controller to limit
> the IO rate of a latency sensitive applicaiton to make sure it does
> not starve BE applications.

That is not quite the same semantics.  This limited preemption patch
is still work-conserving.  If the RT task in the only task on the
system with IO, it will be able to use all available disk time.

> If RT starving BE is an issue, then it is an issue with plain cfq queue
> also. First we shall have to fix it there.
>
> This definition that a latency sensitive task get prioritized only
> till it is consuming its fair share and if task starts using more than
> fair share then CFQ automatically stops prioritizing it sounds little
> odd to me. If you are looking for predictability, then we lost it. We
> shall have to very well know that task is not eating more than its
> fair share before we can gurantee any kind of latencies to that task. And
> if we know that task is not hogging the disk, there is anyway no risk
> of it starving other groups/tasks completely.

In a shared environment, we have to be a little bit defensive.  We
hope that a latency sensitive task is well characterized and won't
exceed its share of the disk, and that we haven't over-committed the
disk.  If the app does do more IO than expected, then we'd like them
to bear the burden.  We have a choice of two outcomes.  A single job
sometimes failing to achieve low disk latency when it's very busy.  Or
all jobs on a disk sometimes being very slow when another (unrelated)
job is very busy.  The first is easier to understand and debug.

Chad

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] cfq-iosched: Fair cross-group preemption (interface and documentation)
  2011-03-22 10:03   ` Gui Jianfeng
@ 2011-03-22 18:07     ` Chad Talbott
  0 siblings, 0 replies; 22+ messages in thread
From: Chad Talbott @ 2011-03-22 18:07 UTC (permalink / raw)
  To: Gui Jianfeng; +Cc: jaxboe, vgoyal, linux-kernel, mrubin, teravest

On Tue, Mar 22, 2011 at 3:03 AM, Gui Jianfeng
<guijianfeng@cn.fujitsu.com> wrote:
> Do we need to implement a globle blkio.class interface as well as
> a device specific one something like "blkio.class_device"? So a user
> doesn't need to check every device number if he want to specify a
> global value.

I think you're right.  It's more symmetrical with the weight
interface, so will be less confusing.  I'll add it.

Chad

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] cfq-iosched: Fair cross-group preemption
  2011-03-22 17:39   ` Chad Talbott
@ 2011-03-22 18:12     ` Vivek Goyal
  2011-03-22 23:46       ` Chad Talbott
  2011-03-23 20:10       ` Chad Talbott
  0 siblings, 2 replies; 22+ messages in thread
From: Vivek Goyal @ 2011-03-22 18:12 UTC (permalink / raw)
  To: Chad Talbott; +Cc: jaxboe, linux-kernel, mrubin, teravest

On Tue, Mar 22, 2011 at 10:39:36AM -0700, Chad Talbott wrote:
> On Tue, Mar 22, 2011 at 8:09 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > Why not just implement simply RT class groups and always allow an RT
> > group to preempt an BE class. Same thing we do for cfq queues. I will
> > not worry too much about a run away application consuming all the
> > bandwidth. If that's a concern we could use blkio controller to limit
> > the IO rate of a latency sensitive applicaiton to make sure it does
> > not starve BE applications.
> 
> That is not quite the same semantics.  This limited preemption patch
> is still work-conserving.  If the RT task in the only task on the
> system with IO, it will be able to use all available disk time.
> 

It is not same semantics but it feels like too much of special casing
for a single use case.

You are using the generic notion of a RT thread (which in general means
that it gets all the cpu or all the disk ahead of BE task). But you have
changed the definition of RT for this special use case. And also now
group RT is different from queue RT definition.

Why not have similar mechanism for cpu scheduler also then. This
application first should be able to get cpu bandwidth in same predictable
manner before it gets the disk bandwidth.

And I think your generation number patch should address this issue up
to great extent. Isn't it? If a latency sensitive task is not using
its fair quota, it will get a lower vdisktime and get to dispatch soon?

If that soon is not enough, then we could operate with reduce base slice
length so that we allocate smaller slices to groups and get better IO
latencies at the cost of total throughput. 

> > If RT starving BE is an issue, then it is an issue with plain cfq queue
> > also. First we shall have to fix it there.
> >
> > This definition that a latency sensitive task get prioritized only
> > till it is consuming its fair share and if task starts using more than
> > fair share then CFQ automatically stops prioritizing it sounds little
> > odd to me. If you are looking for predictability, then we lost it. We
> > shall have to very well know that task is not eating more than its
> > fair share before we can gurantee any kind of latencies to that task. And
> > if we know that task is not hogging the disk, there is anyway no risk
> > of it starving other groups/tasks completely.
> 
> In a shared environment, we have to be a little bit defensive.  We
> hope that a latency sensitive task is well characterized and won't
> exceed its share of the disk, and that we haven't over-committed the
> disk.  If the app does do more IO than expected, then we'd like them
> to bear the burden.  We have a choice of two outcomes.  A single job
> sometimes failing to achieve low disk latency when it's very busy.  Or
> all jobs on a disk sometimes being very slow when another (unrelated)
> job is very busy.  The first is easier to understand and debug.

To me you are trying to come up with a new scheduling class which is
not RT and you are trying to overload the meaning of RT for your use
case and that's the issue I have.

Coming up with a new scheduling class is also not desirable as that
will demand another service tree and we already have too many. Also
it should probably be also done for task and not just group otherwise
extending this concept to hierarchical setup will get complicated. Queues
and groups will just not gel well.

Frankly speaking, the problem you are having should be solved by your
generation number patch and by having smaller base slices. 

Or You could put latency sensitive applications in an RT class and 
then throttle them using blkio controller. That way you get good
latencies as well as you don't starve other tasks.

But I don't think overloading the meaning for RT or this specific use
case is a good idea.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] cfq-iosched: Fair cross-group preemption
  2011-03-22 18:12     ` Vivek Goyal
@ 2011-03-22 23:46       ` Chad Talbott
  2011-03-23  1:43         ` Vivek Goyal
  2011-03-23 20:10       ` Chad Talbott
  1 sibling, 1 reply; 22+ messages in thread
From: Chad Talbott @ 2011-03-22 23:46 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: jaxboe, linux-kernel, mrubin, teravest

Would it be a better approach to avoid calling the feature real-time?
Or perhaps to use another service tree to allow strict preemption (as
is done at the task level today)?

I don't like the approach of tuning tiny slices and taking a
throughput hit all the time - even if there is no latency sensitive
group on the system.

Chad

On Tue, Mar 22, 2011 at 11:12 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Tue, Mar 22, 2011 at 10:39:36AM -0700, Chad Talbott wrote:
>> On Tue, Mar 22, 2011 at 8:09 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> > Why not just implement simply RT class groups and always allow an RT
>> > group to preempt an BE class. Same thing we do for cfq queues. I will
>> > not worry too much about a run away application consuming all the
>> > bandwidth. If that's a concern we could use blkio controller to limit
>> > the IO rate of a latency sensitive applicaiton to make sure it does
>> > not starve BE applications.
>>
>> That is not quite the same semantics.  This limited preemption patch
>> is still work-conserving.  If the RT task in the only task on the
>> system with IO, it will be able to use all available disk time.
>>
>
> It is not same semantics but it feels like too much of special casing
> for a single use case.
>
> You are using the generic notion of a RT thread (which in general means
> that it gets all the cpu or all the disk ahead of BE task). But you have
> changed the definition of RT for this special use case. And also now
> group RT is different from queue RT definition.
>
> Why not have similar mechanism for cpu scheduler also then. This
> application first should be able to get cpu bandwidth in same predictable
> manner before it gets the disk bandwidth.
>
> And I think your generation number patch should address this issue up
> to great extent. Isn't it? If a latency sensitive task is not using
> its fair quota, it will get a lower vdisktime and get to dispatch soon?
>
> If that soon is not enough, then we could operate with reduce base slice
> length so that we allocate smaller slices to groups and get better IO
> latencies at the cost of total throughput.
>
>> > If RT starving BE is an issue, then it is an issue with plain cfq queue
>> > also. First we shall have to fix it there.
>> >
>> > This definition that a latency sensitive task get prioritized only
>> > till it is consuming its fair share and if task starts using more than
>> > fair share then CFQ automatically stops prioritizing it sounds little
>> > odd to me. If you are looking for predictability, then we lost it. We
>> > shall have to very well know that task is not eating more than its
>> > fair share before we can gurantee any kind of latencies to that task. And
>> > if we know that task is not hogging the disk, there is anyway no risk
>> > of it starving other groups/tasks completely.
>>
>> In a shared environment, we have to be a little bit defensive.  We
>> hope that a latency sensitive task is well characterized and won't
>> exceed its share of the disk, and that we haven't over-committed the
>> disk.  If the app does do more IO than expected, then we'd like them
>> to bear the burden.  We have a choice of two outcomes.  A single job
>> sometimes failing to achieve low disk latency when it's very busy.  Or
>> all jobs on a disk sometimes being very slow when another (unrelated)
>> job is very busy.  The first is easier to understand and debug.
>
> To me you are trying to come up with a new scheduling class which is
> not RT and you are trying to overload the meaning of RT for your use
> case and that's the issue I have.
>
> Coming up with a new scheduling class is also not desirable as that
> will demand another service tree and we already have too many. Also
> it should probably be also done for task and not just group otherwise
> extending this concept to hierarchical setup will get complicated. Queues
> and groups will just not gel well.
>
> Frankly speaking, the problem you are having should be solved by your
> generation number patch and by having smaller base slices.
>
> Or You could put latency sensitive applications in an RT class and
> then throttle them using blkio controller. That way you get good
> latencies as well as you don't starve other tasks.
>
> But I don't think overloading the meaning for RT or this specific use
> case is a good idea.
>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] cfq-iosched: Fair cross-group preemption
  2011-03-22 23:46       ` Chad Talbott
@ 2011-03-23  1:43         ` Vivek Goyal
  0 siblings, 0 replies; 22+ messages in thread
From: Vivek Goyal @ 2011-03-23  1:43 UTC (permalink / raw)
  To: Chad Talbott; +Cc: jaxboe, linux-kernel, mrubin, teravest

On Tue, Mar 22, 2011 at 04:46:59PM -0700, Chad Talbott wrote:
> Would it be a better approach to avoid calling the feature real-time?
> Or perhaps to use another service tree to allow strict preemption (as
> is done at the task level today)?
> 
> I don't like the approach of tuning tiny slices and taking a
> throughput hit all the time - even if there is no latency sensitive
> group on the system.

I would appreciate if you could respond to all the points raised in
previous mail, otherwise I will just end up asking these again. 

- Why your cfq generation number patch is not sufficient? If you got a 
  workload which consumes little bandwidth only once in a while, it 
  would be placed at the beginning of service tree and you should
  experience smaller latencies.

- Why not mark the group RT group and to handle the worst case where
  user does not run away all the bandwidth, use blkio throttle to 
  set the upper limit on bandwidth. 

Coming up with another prio class or service tree just makes things 
even more complicated for this specialized case. In fact notion of
fairness or disk share is only for BE class. So coming up with another
class will not help anyway.

If you come up with another service tree, then comes a big question
how do you decide the fairness between two service trees.

IMHO, this is too special casing for a particular workload and really
will make CFQ even more complex. 

Thanks
Vivek

> 
> Chad
> 
> On Tue, Mar 22, 2011 at 11:12 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Tue, Mar 22, 2011 at 10:39:36AM -0700, Chad Talbott wrote:
> >> On Tue, Mar 22, 2011 at 8:09 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >> > Why not just implement simply RT class groups and always allow an RT
> >> > group to preempt an BE class. Same thing we do for cfq queues. I will
> >> > not worry too much about a run away application consuming all the
> >> > bandwidth. If that's a concern we could use blkio controller to limit
> >> > the IO rate of a latency sensitive applicaiton to make sure it does
> >> > not starve BE applications.
> >>
> >> That is not quite the same semantics.  This limited preemption patch
> >> is still work-conserving.  If the RT task in the only task on the
> >> system with IO, it will be able to use all available disk time.
> >>
> >
> > It is not same semantics but it feels like too much of special casing
> > for a single use case.
> >
> > You are using the generic notion of a RT thread (which in general means
> > that it gets all the cpu or all the disk ahead of BE task). But you have
> > changed the definition of RT for this special use case. And also now
> > group RT is different from queue RT definition.
> >
> > Why not have similar mechanism for cpu scheduler also then. This
> > application first should be able to get cpu bandwidth in same predictable
> > manner before it gets the disk bandwidth.
> >
> > And I think your generation number patch should address this issue up
> > to great extent. Isn't it? If a latency sensitive task is not using
> > its fair quota, it will get a lower vdisktime and get to dispatch soon?
> >
> > If that soon is not enough, then we could operate with reduce base slice
> > length so that we allocate smaller slices to groups and get better IO
> > latencies at the cost of total throughput.
> >
> >> > If RT starving BE is an issue, then it is an issue with plain cfq queue
> >> > also. First we shall have to fix it there.
> >> >
> >> > This definition that a latency sensitive task get prioritized only
> >> > till it is consuming its fair share and if task starts using more than
> >> > fair share then CFQ automatically stops prioritizing it sounds little
> >> > odd to me. If you are looking for predictability, then we lost it. We
> >> > shall have to very well know that task is not eating more than its
> >> > fair share before we can gurantee any kind of latencies to that task. And
> >> > if we know that task is not hogging the disk, there is anyway no risk
> >> > of it starving other groups/tasks completely.
> >>
> >> In a shared environment, we have to be a little bit defensive.  We
> >> hope that a latency sensitive task is well characterized and won't
> >> exceed its share of the disk, and that we haven't over-committed the
> >> disk.  If the app does do more IO than expected, then we'd like them
> >> to bear the burden.  We have a choice of two outcomes.  A single job
> >> sometimes failing to achieve low disk latency when it's very busy.  Or
> >> all jobs on a disk sometimes being very slow when another (unrelated)
> >> job is very busy.  The first is easier to understand and debug.
> >
> > To me you are trying to come up with a new scheduling class which is
> > not RT and you are trying to overload the meaning of RT for your use
> > case and that's the issue I have.
> >
> > Coming up with a new scheduling class is also not desirable as that
> > will demand another service tree and we already have too many. Also
> > it should probably be also done for task and not just group otherwise
> > extending this concept to hierarchical setup will get complicated. Queues
> > and groups will just not gel well.
> >
> > Frankly speaking, the problem you are having should be solved by your
> > generation number patch and by having smaller base slices.
> >
> > Or You could put latency sensitive applications in an RT class and
> > then throttle them using blkio controller. That way you get good
> > latencies as well as you don't starve other tasks.
> >
> > But I don't think overloading the meaning for RT or this specific use
> > case is a good idea.
> >
> > Thanks
> > Vivek
> >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] cfq-iosched: Fair cross-group preemption
  2011-03-22 18:12     ` Vivek Goyal
  2011-03-22 23:46       ` Chad Talbott
@ 2011-03-23 20:10       ` Chad Talbott
  2011-03-23 20:41         ` Vivek Goyal
  1 sibling, 1 reply; 22+ messages in thread
From: Chad Talbott @ 2011-03-23 20:10 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: jaxboe, linux-kernel, mrubin, teravest

On Tue, Mar 22, 2011 at 11:12 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Tue, Mar 22, 2011 at 10:39:36AM -0700, Chad Talbott wrote:
>> On Tue, Mar 22, 2011 at 8:09 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> > Why not just implement simply RT class groups and always allow an RT
>> > group to preempt an BE class. Same thing we do for cfq queues. I will
>> > not worry too much about a run away application consuming all the
>> > bandwidth. If that's a concern we could use blkio controller to limit
>> > the IO rate of a latency sensitive applicaiton to make sure it does
>> > not starve BE applications.
>>
>> That is not quite the same semantics.  This limited preemption patch
>> is still work-conserving.  If the RT task in the only task on the
>> system with IO, it will be able to use all available disk time.
>>
>
> It is not same semantics but it feels like too much of special casing
> for a single use case.

How are you counting use cases?

> You are using the generic notion of a RT thread (which in general means
> that it gets all the cpu or all the disk ahead of BE task). But you have
> changed the definition of RT for this special use case. And also now
> group RT is different from queue RT definition.

Perhaps the name RT has too much of a "this group should be able to
starve all other groups" connotation.  Is there a better name?  Maybe
latency sensitive?

> Why not have similar mechanism for cpu scheduler also then. This
> application first should be able to get cpu bandwidth in same predictable
> manner before it gets the disk bandwidth.

Perhaps this is a good idea.  If the CPU scheduler folks like it, I'll
be happy to support that.

> And I think your generation number patch should address this issue up
> to great extent. Isn't it? If a latency sensitive task is not using
> its fair quota, it will get a lower vdisktime and get to dispatch soon?

It will get to dispatch as soon as the current task's timeslice
expires.  This could be a long time, depending on the number of other
tasks and groups on the system.  We'd like to provide a latency
guarantee that's dependent only on the behavior of the low-latency
application.

> If that soon is not enough, then we could operate with reduce base slice
> length so that we allocate smaller slices to groups and get better IO
> latencies at the cost of total throughput.

With the limited preemption patch, I can still achieve good throughput
for many tasks, as long as the low-latency task is "quiet" or when
there is no low-latency task on the system.  If I use very small
timeslices, then I always pay a throughput price, even when there is
no low-latency task on the system or that task isn't doing any IO.

>> > If RT starving BE is an issue, then it is an issue with plain cfq queue
>> > also. First we shall have to fix it there.
>> >
>> > This definition that a latency sensitive task get prioritized only
>> > till it is consuming its fair share and if task starts using more than
>> > fair share then CFQ automatically stops prioritizing it sounds little
>> > odd to me. If you are looking for predictability, then we lost it. We
>> > shall have to very well know that task is not eating more than its
>> > fair share before we can gurantee any kind of latencies to that task. And
>> > if we know that task is not hogging the disk, there is anyway no risk
>> > of it starving other groups/tasks completely.
>>
>> In a shared environment, we have to be a little bit defensive.  We
>> hope that a latency sensitive task is well characterized and won't
>> exceed its share of the disk, and that we haven't over-committed the
>> disk.  If the app does do more IO than expected, then we'd like them
>> to bear the burden.  We have a choice of two outcomes.  A single job
>> sometimes failing to achieve low disk latency when it's very busy.  Or
>> all jobs on a disk sometimes being very slow when another (unrelated)
>> job is very busy.  The first is easier to understand and debug.
>
> To me you are trying to come up with a new scheduling class which is
> not RT and you are trying to overload the meaning of RT for your use
> case and that's the issue I have.

Can we come up with a better name?  I've used low-latency and
latency-sensitive in this email, and it's not too cumbersome.

> Coming up with a new scheduling class is also not desirable as that
> will demand another service tree and we already have too many. Also
> it should probably be also done for task and not just group otherwise
> extending this concept to hierarchical setup will get complicated. Queues
> and groups will just not gel well.

Is there a plan to provide RT class for groups in the hierarchical
future to allow full symmetry with RT tasks?

> Frankly speaking, the problem you are having should be solved by your
> generation number patch and by having smaller base slices.

Again, the throughput price is quite high to pay for all disks - even
when they have no latency sensitive groups, or those groups are not
issuing IO.

> Or You could put latency sensitive applications in an RT class and
> then throttle them using blkio controller. That way you get good
> latencies as well as you don't starve other tasks.

This is closer to the semantics offered by this patchset, but requires
debugging the complex interactions between two scheduling policies to
understand the resulting behavior.

> But I don't think overloading the meaning for RT or this specific use
> case is a good idea.

I hear you loud and clear, but I disagree.

Chad

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] cfq-iosched: Fair cross-group preemption
  2011-03-23 20:10       ` Chad Talbott
@ 2011-03-23 20:41         ` Vivek Goyal
  2011-03-24 21:47           ` Chad Talbott
  0 siblings, 1 reply; 22+ messages in thread
From: Vivek Goyal @ 2011-03-23 20:41 UTC (permalink / raw)
  To: Chad Talbott; +Cc: jaxboe, linux-kernel, mrubin, teravest

On Wed, Mar 23, 2011 at 01:10:32PM -0700, Chad Talbott wrote:
> On Tue, Mar 22, 2011 at 11:12 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Tue, Mar 22, 2011 at 10:39:36AM -0700, Chad Talbott wrote:
> >> On Tue, Mar 22, 2011 at 8:09 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >> > Why not just implement simply RT class groups and always allow an RT
> >> > group to preempt an BE class. Same thing we do for cfq queues. I will
> >> > not worry too much about a run away application consuming all the
> >> > bandwidth. If that's a concern we could use blkio controller to limit
> >> > the IO rate of a latency sensitive applicaiton to make sure it does
> >> > not starve BE applications.
> >>
> >> That is not quite the same semantics.  This limited preemption patch
> >> is still work-conserving.  If the RT task in the only task on the
> >> system with IO, it will be able to use all available disk time.
> >>
> >
> > It is not same semantics but it feels like too much of special casing
> > for a single use case.
> 
> How are you counting use cases?

This is the first time I have heard this requirement. So if 2-3 different
folks come up with similar concern, then I have idea an idea that this
is a generic need.

You also have not explained what is the workload and what are the
acceptable latencies etc.

> 
> > You are using the generic notion of a RT thread (which in general means
> > that it gets all the cpu or all the disk ahead of BE task). But you have
> > changed the definition of RT for this special use case. And also now
> > group RT is different from queue RT definition.
> 
> Perhaps the name RT has too much of a "this group should be able to
> starve all other groups" connotation.  Is there a better name?  Maybe
> latency sensitive?

I think what you are trying to achieve is that you want to define an
additional task and group property, say latency sensitive. This is
third property apart from ioclass and ioprio. To me you still want
the task/group to be BE class so that it shares the disk in a
proportional weight manner but this additional property will make sure
that task can preempt the non latency sensitive task/group.

We can't do this additional property for group alone because once we
move to hierarhical setup and everything is entity (be it task or queue)
and then we need to decide whether one entity can preempt another
entity or not. By not definining this property for tasks, latency 
sensitive group will always preempt a task on same tree. (May be
that's what you want for your use case). But it is still odd to add
additional properties only for groups and not tasks.

This is the new paradigm (atleast to me). It introduces additional 
complextiy in a already complicated system. So it makes sense to make
sure that there are more than 1 users of this functionality.

> 
> > Why not have similar mechanism for cpu scheduler also then. This
> > application first should be able to get cpu bandwidth in same predictable
> > manner before it gets the disk bandwidth.
> 
> Perhaps this is a good idea.  If the CPU scheduler folks like it, I'll
> be happy to support that.

> 
> > And I think your generation number patch should address this issue up
> > to great extent. Isn't it? If a latency sensitive task is not using
> > its fair quota, it will get a lower vdisktime and get to dispatch soon?
> 
> It will get to dispatch as soon as the current task's timeslice
> expires.  This could be a long time, depending on the number of other
> tasks and groups on the system.  We'd like to provide a latency
> guarantee that's dependent only on the behavior of the low-latency
> application.

What are your latency requirements? I believe maximum slice length can
be 180ms in default settings. You can change base slice to 50 and that
will make maximum slice length to 90. For default case it will be 50ms.

So question is what workload it is which can not tolerate these latencies.

> 
> > If that soon is not enough, then we could operate with reduce base slice
> > length so that we allocate smaller slices to groups and get better IO
> > latencies at the cost of total throughput.
> 
> With the limited preemption patch, I can still achieve good throughput
> for many tasks, as long as the low-latency task is "quiet" or when
> there is no low-latency task on the system.  If I use very small
> timeslices, then I always pay a throughput price, even when there is
> no low-latency task on the system or that task isn't doing any IO.

Ok, that's fine. So with-in BE class you are trying to define another
type of groups that is "low latency". That's why I think this is third
propety apart from ioprio and ioclass.

> 
> >> > If RT starving BE is an issue, then it is an issue with plain cfq queue
> >> > also. First we shall have to fix it there.
> >> >
> >> > This definition that a latency sensitive task get prioritized only
> >> > till it is consuming its fair share and if task starts using more than
> >> > fair share then CFQ automatically stops prioritizing it sounds little
> >> > odd to me. If you are looking for predictability, then we lost it. We
> >> > shall have to very well know that task is not eating more than its
> >> > fair share before we can gurantee any kind of latencies to that task. And
> >> > if we know that task is not hogging the disk, there is anyway no risk
> >> > of it starving other groups/tasks completely.
> >>
> >> In a shared environment, we have to be a little bit defensive.  We
> >> hope that a latency sensitive task is well characterized and won't
> >> exceed its share of the disk, and that we haven't over-committed the
> >> disk.  If the app does do more IO than expected, then we'd like them
> >> to bear the burden.  We have a choice of two outcomes.  A single job
> >> sometimes failing to achieve low disk latency when it's very busy.  Or
> >> all jobs on a disk sometimes being very slow when another (unrelated)
> >> job is very busy.  The first is easier to understand and debug.
> >
> > To me you are trying to come up with a new scheduling class which is
> > not RT and you are trying to overload the meaning of RT for your use
> > case and that's the issue I have.
> 
> Can we come up with a better name?  I've used low-latency and
> latency-sensitive in this email, and it's not too cumbersome.
> 
> > Coming up with a new scheduling class is also not desirable as that
> > will demand another service tree and we already have too many. Also
> > it should probably be also done for task and not just group otherwise
> > extending this concept to hierarchical setup will get complicated. Queues
> > and groups will just not gel well.
> 
> Is there a plan to provide RT class for groups in the hierarchical
> future to allow full symmetry with RT tasks?
> 
> > Frankly speaking, the problem you are having should be solved by your
> > generation number patch and by having smaller base slices.
> 
> Again, the throughput price is quite high to pay for all disks - even
> when they have no latency sensitive groups, or those groups are not
> issuing IO.
> 
> > Or You could put latency sensitive applications in an RT class and
> > then throttle them using blkio controller. That way you get good
> > latencies as well as you don't starve other tasks.
> 
> This is closer to the semantics offered by this patchset, but requires
> debugging the complex interactions between two scheduling policies to
> understand the resulting behavior.

Can you explain that a bit more? throttling behavior is very clear that
a group is allowed dispatch as long as it does not cross the limit.
Otherwise bio is put in a queue and later submitted to underlying devices.

So as long as latency sensitive task is with-in rate limit, it will get
the latency you want. The moment it tries to do lot of IO, it will get
throttled and practically becomes a oridinary BE task. I believe that's
what your patchset does. latency sensitive gets priority only if it
is consuming its fair share of disk. The only difference here is that
defination of fair share is abosolute (specified interms of bps or iops)
instead of it being dynamic depending on how many groups are doing IO.
 
blktrace results show the throttle as well cfq logs in same file so
correlating two policies is really easy. So I really don't think that
understanding the resulting is behavior is hard. I will be happy to be
proven wrong though.

> 
> > But I don't think overloading the meaning for RT or this specific use
> > case is a good idea.
> 
> I hear you loud and clear, but I disagree.

You disagree with what? Changing the definition of RT is fine. ioclass RT
means one thing for tasks and other thing for group, is it fine?

If we really end up doing it, I think we shall have to define an 
additional group file say, blkio.preempt_fair_share. This will mean
that this is a BE group but has additional property which allows it to
preempt existing entity on service tree as long as it does not exceed
it fair share. That way we don't have to define a new class or don't
have to come up with additional service tree.

But I would prefer that you seriously consider implementing RT group class
and rate limit it with throttling logic. Because I believe it should solve
your issue. Only question would be what should be upper limit and I think
that will depend on type of storage your are using and what's your
workload.

Also if you can give a better example where this kind of latency matters,
it will help to understand the problem better.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] cfq-iosched: Fair cross-group preemption
  2011-03-23 20:41         ` Vivek Goyal
@ 2011-03-24 21:47           ` Chad Talbott
  2011-03-25  5:43             ` Gui Jianfeng
  2011-03-25 21:32             ` Vivek Goyal
  0 siblings, 2 replies; 22+ messages in thread
From: Chad Talbott @ 2011-03-24 21:47 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: jaxboe, linux-kernel, mrubin, teravest

On Wed, Mar 23, 2011 at 1:41 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Wed, Mar 23, 2011 at 01:10:32PM -0700, Chad Talbott wrote:
>> On Tue, Mar 22, 2011 at 11:12 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> > On Tue, Mar 22, 2011 at 10:39:36AM -0700, Chad Talbott wrote:
>> >> On Tue, Mar 22, 2011 at 8:09 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> >> > Why not just implement simply RT class groups and always allow an RT
>> >> > group to preempt an BE class. Same thing we do for cfq queues. I will
>> >> > not worry too much about a run away application consuming all the
>> >> > bandwidth. If that's a concern we could use blkio controller to limit
>> >> > the IO rate of a latency sensitive applicaiton to make sure it does
>> >> > not starve BE applications.
>> >>
>> >> That is not quite the same semantics. �This limited preemption patch
>> >> is still work-conserving. �If the RT task in the only task on the
>> >> system with IO, it will be able to use all available disk time.
>> >>
>> >
>> > It is not same semantics but it feels like too much of special casing
>> > for a single use case.
>>
>> How are you counting use cases?
>
> This is the first time I have heard this requirement. So if 2-3 different
> folks come up with similar concern, then I have idea an idea that this
> is a generic need.
>
> You also have not explained what is the workload and what are the
> acceptable latencies etc.
>
>>
>> > You are using the generic notion of a RT thread (which in general means
>> > that it gets all the cpu or all the disk ahead of BE task). But you have
>> > changed the definition of RT for this special use case. And also now
>> > group RT is different from queue RT definition.
>>
>> Perhaps the name RT has too much of a "this group should be able to
>> starve all other groups" connotation. �Is there a better name? �Maybe
>> latency sensitive?
>
> I think what you are trying to achieve is that you want to define an
> additional task and group property, say latency sensitive. This is
> third property apart from ioclass and ioprio. To me you still want
> the task/group to be BE class so that it shares the disk in a
> proportional weight manner but this additional property will make sure
> that task can preempt the non latency sensitive task/group.
>
> We can't do this additional property for group alone because once we
> move to hierarhical setup and everything is entity (be it task or queue)
> and then we need to decide whether one entity can preempt another
> entity or not. By not definining this property for tasks, latency
> sensitive group will always preempt a task on same tree. (May be
> that's what you want for your use case). But it is still odd to add
> additional properties only for groups and not tasks.

You raise a good point about hierarchy.  We'd like to use Gui's
hierarchy patches or similar functionality.  As you point out there is
currently an asymmetry between groups and tasks.  Tasks can be RT, but
groups cannot.  This complicates the hierarchy implementation.

How about adding a blkio.class and blkio.class_device interface to a
truly RT service class?  This class would be able to starve a BE class
(thus be more like the traditional RT/BE divide), and could be
implemented similarly to RT/BE cfqqs today.  This way groups and
queues could easily be scheduled as peers.

> This is the new paradigm (atleast to me). It introduces additional
> complextiy in a already complicated system. So it makes sense to make
> sure that there are more than 1 users of this functionality.
>
>>
>> > Why not have similar mechanism for cpu scheduler also then. This
>> > application first should be able to get cpu bandwidth in same predictable
>> > manner before it gets the disk bandwidth.
>>
>> Perhaps this is a good idea. �If the CPU scheduler folks like it, I'll
>> be happy to support that.
>
>>
>> > And I think your generation number patch should address this issue up
>> > to great extent. Isn't it? If a latency sensitive task is not using
>> > its fair quota, it will get a lower vdisktime and get to dispatch soon?
>>
>> It will get to dispatch as soon as the current task's timeslice
>> expires. �This could be a long time, depending on the number of other
>> tasks and groups on the system. �We'd like to provide a latency
>> guarantee that's dependent only on the behavior of the low-latency
>> application.
>
> What are your latency requirements? I believe maximum slice length can
> be 180ms in default settings. You can change base slice to 50 and that
> will make maximum slice length to 90. For default case it will be 50ms.
>
> So question is what workload it is which can not tolerate these latencies.
>
>>
>> > If that soon is not enough, then we could operate with reduce base slice
>> > length so that we allocate smaller slices to groups and get better IO
>> > latencies at the cost of total throughput.
>>
>> With the limited preemption patch, I can still achieve good throughput
>> for many tasks, as long as the low-latency task is "quiet" or when
>> there is no low-latency task on the system. �If I use very small
>> timeslices, then I always pay a throughput price, even when there is
>> no low-latency task on the system or that task isn't doing any IO.
>
> Ok, that's fine. So with-in BE class you are trying to define another
> type of groups that is "low latency". That's why I think this is third
> propety apart from ioprio and ioclass.
>
>>
>> >> > If RT starving BE is an issue, then it is an issue with plain cfq queue
>> >> > also. First we shall have to fix it there.
>> >> >
>> >> > This definition that a latency sensitive task get prioritized only
>> >> > till it is consuming its fair share and if task starts using more than
>> >> > fair share then CFQ automatically stops prioritizing it sounds little
>> >> > odd to me. If you are looking for predictability, then we lost it. We
>> >> > shall have to very well know that task is not eating more than its
>> >> > fair share before we can gurantee any kind of latencies to that task. And
>> >> > if we know that task is not hogging the disk, there is anyway no risk
>> >> > of it starving other groups/tasks completely.
>> >>
>> >> In a shared environment, we have to be a little bit defensive. �We
>> >> hope that a latency sensitive task is well characterized and won't
>> >> exceed its share of the disk, and that we haven't over-committed the
>> >> disk. �If the app does do more IO than expected, then we'd like them
>> >> to bear the burden. �We have a choice of two outcomes. �A single job
>> >> sometimes failing to achieve low disk latency when it's very busy. �Or
>> >> all jobs on a disk sometimes being very slow when another (unrelated)
>> >> job is very busy. �The first is easier to understand and debug.
>> >
>> > To me you are trying to come up with a new scheduling class which is
>> > not RT and you are trying to overload the meaning of RT for your use
>> > case and that's the issue I have.
>>
>> Can we come up with a better name? �I've used low-latency and
>> latency-sensitive in this email, and it's not too cumbersome.
>>
>> > Coming up with a new scheduling class is also not desirable as that
>> > will demand another service tree and we already have too many. Also
>> > it should probably be also done for task and not just group otherwise
>> > extending this concept to hierarchical setup will get complicated. Queues
>> > and groups will just not gel well.
>>
>> Is there a plan to provide RT class for groups in the hierarchical
>> future to allow full symmetry with RT tasks?

I'm still interested in the answer to this question.  If there's
currently no plan, is there at least an interest in seeing an
implementation?

>> > Or You could put latency sensitive applications in an RT class and
>> > then throttle them using blkio controller. That way you get good
>> > latencies as well as you don't starve other tasks.
>>
>> This is closer to the semantics offered by this patchset, but requires
>> debugging the complex interactions between two scheduling policies to
>> understand the resulting behavior.
>
> Can you explain that a bit more? throttling behavior is very clear that
> a group is allowed dispatch as long as it does not cross the limit.
> Otherwise bio is put in a queue and later submitted to underlying devices.
>
> So as long as latency sensitive task is with-in rate limit, it will get
> the latency you want. The moment it tries to do lot of IO, it will get
> throttled and practically becomes a oridinary BE task. I believe that's
> what your patchset does. latency sensitive gets priority only if it
> is consuming its fair share of disk. The only difference here is that
> defination of fair share is abosolute (specified interms of bps or iops)
> instead of it being dynamic depending on how many groups are doing IO.
>
> blktrace results show the throttle as well cfq logs in same file so
> correlating two policies is really easy. So I really don't think that
> understanding the resulting is behavior is hard. I will be happy to be
> proven wrong though.

Once we have a true-RT class implementation, I'll give it a shot.

>> > But I don't think overloading the meaning for RT or this specific use
>> > case is a good idea.
>>
>> I hear you loud and clear, but I disagree.
>
> You disagree with what? Changing the definition of RT is fine. ioclass RT
> means one thing for tasks and other thing for group, is it fine?

I read your comments so far as "I think this implementation for this
specific use case is a bad idea."  This is what I disagree with.  This
implementation nicely provides the needed behavior.

I'd like to provide the lowest possible latency to a single privileged
group per disk.  At the same time, I need to be able to ensure that
the privileged group isn't able to completely consume the throughput
on the disk.  It will likely share that disk with system daemons and
other "critical" functionality.  It's not important that those daemons
get the same latency guarantees, but they must be guaranteed some disk
time.

> If we really end up doing it, I think we shall have to define an
> additional group file say, blkio.preempt_fair_share. This will mean
> that this is a BE group but has additional property which allows it to
> preempt existing entity on service tree as long as it does not exceed
> it fair share. That way we don't have to define a new class or don't
> have to come up with additional service tree.

I think I hear you objecting more to the name RT.  And that if we had
this "limited preemption" functionality, it should be called by a
different name.

> But I would prefer that you seriously consider implementing RT group class
> and rate limit it with throttling logic. Because I believe it should solve
> your issue. Only question would be what should be upper limit and I think
> that will depend on type of storage your are using and what's your
> workload.
>
> Also if you can give a better example where this kind of latency matters,
> it will help to understand the problem better.

The general problem is that a distributed system is generally made up
of multiple machines, and that any significant operation against that
system will involved multiple machines.  The response to any external
request will likely be determined by the sum of the latencies of the
components.  So I want to reduce the latency on a single drive as much
as possible.

This thread is getting tangled.  I see a few options:

  a) Pursue the functionality in my original patchset with a different name.
  b) Build a true RT class for groups and try with blk-throttle.

You seem pretty unenthusiastic about a).  How do you feel about b)?

Chad

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] cfq-iosched: Fair cross-group preemption
  2011-03-24 21:47           ` Chad Talbott
@ 2011-03-25  5:43             ` Gui Jianfeng
  2011-03-25 21:32             ` Vivek Goyal
  1 sibling, 0 replies; 22+ messages in thread
From: Gui Jianfeng @ 2011-03-25  5:43 UTC (permalink / raw)
  To: Chad Talbott; +Cc: Vivek Goyal, jaxboe, linux-kernel, mrubin, teravest

Chad Talbott wrote:
> On Wed, Mar 23, 2011 at 1:41 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> On Wed, Mar 23, 2011 at 01:10:32PM -0700, Chad Talbott wrote:
>>> On Tue, Mar 22, 2011 at 11:12 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
>>>> On Tue, Mar 22, 2011 at 10:39:36AM -0700, Chad Talbott wrote:
>>>>> On Tue, Mar 22, 2011 at 8:09 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
>>>>>> Why not just implement simply RT class groups and always allow an RT
>>>>>> group to preempt an BE class. Same thing we do for cfq queues. I will
>>>>>> not worry too much about a run away application consuming all the
>>>>>> bandwidth. If that's a concern we could use blkio controller to limit
>>>>>> the IO rate of a latency sensitive applicaiton to make sure it does
>>>>>> not starve BE applications.
>>>>> That is not quite the same semantics. �This limited preemption patch
>>>>> is still work-conserving. �If the RT task in the only task on the
>>>>> system with IO, it will be able to use all available disk time.
>>>>>
>>>> It is not same semantics but it feels like too much of special casing
>>>> for a single use case.
>>> How are you counting use cases?
>> This is the first time I have heard this requirement. So if 2-3 different
>> folks come up with similar concern, then I have idea an idea that this
>> is a generic need.
>>
>> You also have not explained what is the workload and what are the
>> acceptable latencies etc.
>>
>>>> You are using the generic notion of a RT thread (which in general means
>>>> that it gets all the cpu or all the disk ahead of BE task). But you have
>>>> changed the definition of RT for this special use case. And also now
>>>> group RT is different from queue RT definition.
>>> Perhaps the name RT has too much of a "this group should be able to
>>> starve all other groups" connotation. �Is there a better name? �Maybe
>>> latency sensitive?
>> I think what you are trying to achieve is that you want to define an
>> additional task and group property, say latency sensitive. This is
>> third property apart from ioclass and ioprio. To me you still want
>> the task/group to be BE class so that it shares the disk in a
>> proportional weight manner but this additional property will make sure
>> that task can preempt the non latency sensitive task/group.
>>
>> We can't do this additional property for group alone because once we
>> move to hierarhical setup and everything is entity (be it task or queue)
>> and then we need to decide whether one entity can preempt another
>> entity or not. By not definining this property for tasks, latency
>> sensitive group will always preempt a task on same tree. (May be
>> that's what you want for your use case). But it is still odd to add
>> additional properties only for groups and not tasks.
> 
> You raise a good point about hierarchy.  We'd like to use Gui's
> hierarchy patches or similar functionality.  As you point out there is
> currently an asymmetry between groups and tasks.  Tasks can be RT, but
> groups cannot.  This complicates the hierarchy implementation.
> 
> How about adding a blkio.class and blkio.class_device interface to a
> truly RT service class?  This class would be able to starve a BE class
> (thus be more like the traditional RT/BE divide), and could be
> implemented similarly to RT/BE cfqqs today.  This way groups and
> queues could easily be scheduled as peers.

For the current "cfq group hierarchy" implementation, I just put cfqg on
the "BE:SYNC" workload tree for the sake of simplicity. I think we need
to implement ioclass for cfq group for supporting *fully* hierarchical
scheduling.

Gui

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] cfq-iosched: Fair cross-group preemption
  2011-03-24 21:47           ` Chad Talbott
  2011-03-25  5:43             ` Gui Jianfeng
@ 2011-03-25 21:32             ` Vivek Goyal
  2011-03-25 23:53               ` Chad Talbott
  1 sibling, 1 reply; 22+ messages in thread
From: Vivek Goyal @ 2011-03-25 21:32 UTC (permalink / raw)
  To: Chad Talbott; +Cc: jaxboe, linux-kernel, mrubin, teravest

On Thu, Mar 24, 2011 at 02:47:54PM -0700, Chad Talbott wrote:
> On Wed, Mar 23, 2011 at 1:41 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Wed, Mar 23, 2011 at 01:10:32PM -0700, Chad Talbott wrote:
> >> On Tue, Mar 22, 2011 at 11:12 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >> > On Tue, Mar 22, 2011 at 10:39:36AM -0700, Chad Talbott wrote:
> >> >> On Tue, Mar 22, 2011 at 8:09 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >> >> > Why not just implement simply RT class groups and always allow an RT
> >> >> > group to preempt an BE class. Same thing we do for cfq queues. I will
> >> >> > not worry too much about a run away application consuming all the
> >> >> > bandwidth. If that's a concern we could use blkio controller to limit
> >> >> > the IO rate of a latency sensitive applicaiton to make sure it does
> >> >> > not starve BE applications.
> >> >>
> >> >> That is not quite the same semantics. �This limited preemption patch
> >> >> is still work-conserving. �If the RT task in the only task on the
> >> >> system with IO, it will be able to use all available disk time.
> >> >>
> >> >
> >> > It is not same semantics but it feels like too much of special casing
> >> > for a single use case.
> >>
> >> How are you counting use cases?
> >
> > This is the first time I have heard this requirement. So if 2-3 different
> > folks come up with similar concern, then I have idea an idea that this
> > is a generic need.
> >
> > You also have not explained what is the workload and what are the
> > acceptable latencies etc.
> >
> >>
> >> > You are using the generic notion of a RT thread (which in general means
> >> > that it gets all the cpu or all the disk ahead of BE task). But you have
> >> > changed the definition of RT for this special use case. And also now
> >> > group RT is different from queue RT definition.
> >>
> >> Perhaps the name RT has too much of a "this group should be able to
> >> starve all other groups" connotation. �Is there a better name? �Maybe
> >> latency sensitive?

This is jut not RT name. It is also the using the term blkio.class. At
some point of time, it will be a good idea to be able to define ioclass
for group also (like taskss). So we can use blkio.class to define the
ioclass of the group.

> >
> > I think what you are trying to achieve is that you want to define an
> > additional task and group property, say latency sensitive. This is
> > third property apart from ioclass and ioprio. To me you still want
> > the task/group to be BE class so that it shares the disk in a
> > proportional weight manner but this additional property will make sure
> > that task can preempt the non latency sensitive task/group.
> >
> > We can't do this additional property for group alone because once we
> > move to hierarhical setup and everything is entity (be it task or queue)
> > and then we need to decide whether one entity can preempt another
> > entity or not. By not definining this property for tasks, latency
> > sensitive group will always preempt a task on same tree. (May be
> > that's what you want for your use case). But it is still odd to add
> > additional properties only for groups and not tasks.
> 
> You raise a good point about hierarchy.  We'd like to use Gui's
> hierarchy patches or similar functionality.  As you point out there is
> currently an asymmetry between groups and tasks.  Tasks can be RT, but
> groups cannot.  This complicates the hierarchy implementation.
> 
> How about adding a blkio.class and blkio.class_device interface to a
> truly RT service class?  This class would be able to starve a BE class
> (thus be more like the traditional RT/BE divide), and could be
> implemented similarly to RT/BE cfqqs today.  This way groups and
> queues could easily be scheduled as peers.

I think defining blkio.class and coming up with RT and IDLE (if needed)
groups separately makes sense to me. So in cfqd we can define a news
service tree where RT groups get queued up. Once hierarchical
implementation happens, RT queue and RT group entities will go on a
single tree.

[..]
> >> Is there a plan to provide RT class for groups in the hierarchical
> >> future to allow full symmetry with RT tasks?
> 
> I'm still interested in the answer to this question.  If there's
> currently no plan, is there at least an interest in seeing an
> implementation?

This is one of the possible expansions I have in mind. Just that I did
not plan to do it immediaely as I did not have any immediate need. So
feel free to implement the notion of RT cfq groups.

[..]
> I'd like to provide the lowest possible latency to a single privileged
> group per disk.  At the same time, I need to be able to ensure that
> the privileged group isn't able to completely consume the throughput
> on the disk.  It will likely share that disk with system daemons and
> other "critical" functionality.  It's not important that those daemons
> get the same latency guarantees, but they must be guaranteed some disk
> time.
> 
> > If we really end up doing it, I think we shall have to define an
> > additional group file say, blkio.preempt_fair_share. This will mean
> > that this is a BE group but has additional property which allows it to
> > preempt existing entity on service tree as long as it does not exceed
> > it fair share. That way we don't have to define a new class or don't
> > have to come up with additional service tree.
> 
> I think I hear you objecting more to the name RT.  And that if we had
> this "limited preemption" functionality, it should be called by a
> different name.
> 

I think, it was combination of few things.

- Use of RT namespace.
- usage of blkio.class namespace
- Assymetry between task and group properties

So 1 and 2 can be easily fixed by using a different name.
say blkio_preempt_fair_share. The only remaining issue will be  3rd piece.

> > But I would prefer that you seriously consider implementing RT group class
> > and rate limit it with throttling logic. Because I believe it should solve
> > your issue. Only question would be what should be upper limit and I think
> > that will depend on type of storage your are using and what's your
> > workload.
> >
> > Also if you can give a better example where this kind of latency matters,
> > it will help to understand the problem better.
> 
> The general problem is that a distributed system is generally made up
> of multiple machines, and that any significant operation against that
> system will involved multiple machines.  The response to any external
> request will likely be determined by the sum of the latencies of the
> components.  So I want to reduce the latency on a single drive as much
> as possible.
> 
> This thread is getting tangled.  I see a few options:
> 
>   a) Pursue the functionality in my original patchset with a different name.
>   b) Build a true RT class for groups and try with blk-throttle.
> 
> You seem pretty unenthusiastic about a).  How do you feel about b)?

IMHO, Using RT group with throttling avoids introducing asymmetry between
task and group attributes. So I will prefer that approch. Though it means
more code as we will be introducing RT groups but that might be useful
in general for something else too. (I am assuming that somebody makes
use of RT class for cfqq).

The one more down side of trying to use throttling is that one needs to
come up with absolute limit. So one shall have to know disk capacity
and if there are no BE tasks running then latency sensitive task will
be unnecessarily throttled (until and unless some management software
can monitor it and change limit dynamically).

So if you are worried about setting the absolute limit part, then I guess
I am fine with option a). But if you think that setting absolute limit
is not a problem, then option b) is preferred.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] cfq-iosched: Fair cross-group preemption
  2011-03-25 21:32             ` Vivek Goyal
@ 2011-03-25 23:53               ` Chad Talbott
  2011-03-28 13:15                 ` Vivek Goyal
  2011-03-28 13:17                 ` Vivek Goyal
  0 siblings, 2 replies; 22+ messages in thread
From: Chad Talbott @ 2011-03-25 23:53 UTC (permalink / raw)
  To: Vivek Goyal, Gui Jianfeng; +Cc: jaxboe, linux-kernel, mrubin, teravest

On Fri, Mar 25, 2011 at 2:32 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> You seem pretty unenthusiastic about a).  How do you feel about b)?
>
> IMHO, Using RT group with throttling avoids introducing asymmetry between
> task and group attributes. So I will prefer that approch. Though it means
> more code as we will be introducing RT groups but that might be useful
> in general for something else too. (I am assuming that somebody makes
> use of RT class for cfqq).
>
> The one more down side of trying to use throttling is that one needs to
> come up with absolute limit. So one shall have to know disk capacity
> and if there are no BE tasks running then latency sensitive task will
> be unnecessarily throttled (until and unless some management software
> can monitor it and change limit dynamically).
>
> So if you are worried about setting the absolute limit part, then I guess
> I am fine with option a). But if you think that setting absolute limit
> is not a problem, then option b) is preferred.

I prefer option a) - so much so that even with the older CFQ group
implementation we did work to merge the RT and BE service trees to
achieve that behavior.  But I see that using blkio.class is a poor
choice of interface name.  I will rename the interface and resubmit
the patch series (also with Gui's suggestion to keep the "_device"
suffix for consistency).

Also, we'll soon be working to adopt the hierarchy series, and we'll
likely revisit using an RT service tree there.  It's difficult to
justify introducing RT service tree before those patches arrive.

Thanks,
Chad

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] cfq-iosched: Fair cross-group preemption
  2011-03-25 23:53               ` Chad Talbott
@ 2011-03-28 13:15                 ` Vivek Goyal
  2011-03-28 16:59                   ` Chad Talbott
  2011-03-28 13:17                 ` Vivek Goyal
  1 sibling, 1 reply; 22+ messages in thread
From: Vivek Goyal @ 2011-03-28 13:15 UTC (permalink / raw)
  To: Chad Talbott; +Cc: Gui Jianfeng, jaxboe, linux-kernel, mrubin, teravest

On Fri, Mar 25, 2011 at 04:53:13PM -0700, Chad Talbott wrote:
> On Fri, Mar 25, 2011 at 2:32 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >> You seem pretty unenthusiastic about a).  How do you feel about b)?
> >
> > IMHO, Using RT group with throttling avoids introducing asymmetry between
> > task and group attributes. So I will prefer that approch. Though it means
> > more code as we will be introducing RT groups but that might be useful
> > in general for something else too. (I am assuming that somebody makes
> > use of RT class for cfqq).
> >
> > The one more down side of trying to use throttling is that one needs to
> > come up with absolute limit. So one shall have to know disk capacity
> > and if there are no BE tasks running then latency sensitive task will
> > be unnecessarily throttled (until and unless some management software
> > can monitor it and change limit dynamically).
> >
> > So if you are worried about setting the absolute limit part, then I guess
> > I am fine with option a). But if you think that setting absolute limit
> > is not a problem, then option b) is preferred.
> 
> I prefer option a) - so much so that even with the older CFQ group
> implementation we did work to merge the RT and BE service trees to
> achieve that behavior.  But I see that using blkio.class is a poor
> choice of interface name.  I will rename the interface and resubmit
> the patch series (also with Gui's suggestion to keep the "_device"
> suffix for consistency).
> 
> Also, we'll soon be working to adopt the hierarchy series, and we'll
> likely revisit using an RT service tree there.  It's difficult to
> justify introducing RT service tree before those patches arrive.

So once you move to hierachicy series, will be you using RT class and
use throttling and abandon this functionality?

Thanks
Vivek

> 
> Thanks,
> Chad

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] cfq-iosched: Fair cross-group preemption
  2011-03-25 23:53               ` Chad Talbott
  2011-03-28 13:15                 ` Vivek Goyal
@ 2011-03-28 13:17                 ` Vivek Goyal
  2011-03-28 17:02                   ` Chad Talbott
  1 sibling, 1 reply; 22+ messages in thread
From: Vivek Goyal @ 2011-03-28 13:17 UTC (permalink / raw)
  To: Chad Talbott; +Cc: Gui Jianfeng, jaxboe, linux-kernel, mrubin, teravest

On Fri, Mar 25, 2011 at 04:53:13PM -0700, Chad Talbott wrote:
> On Fri, Mar 25, 2011 at 2:32 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >> You seem pretty unenthusiastic about a).  How do you feel about b)?
> >
> > IMHO, Using RT group with throttling avoids introducing asymmetry between
> > task and group attributes. So I will prefer that approch. Though it means
> > more code as we will be introducing RT groups but that might be useful
> > in general for something else too. (I am assuming that somebody makes
> > use of RT class for cfqq).
> >
> > The one more down side of trying to use throttling is that one needs to
> > come up with absolute limit. So one shall have to know disk capacity
> > and if there are no BE tasks running then latency sensitive task will
> > be unnecessarily throttled (until and unless some management software
> > can monitor it and change limit dynamically).
> >
> > So if you are worried about setting the absolute limit part, then I guess
> > I am fine with option a). But if you think that setting absolute limit
> > is not a problem, then option b) is preferred.
> 
> I prefer option a) - so much so that even with the older CFQ group
> implementation we did work to merge the RT and BE service trees to
> achieve that behavior.  But I see that using blkio.class is a poor
> choice of interface name.  I will rename the interface and resubmit
> the patch series (also with Gui's suggestion to keep the "_device"
> suffix for consistency).

Do you need this feature to be global or per device or both?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] cfq-iosched: Fair cross-group preemption
  2011-03-28 13:15                 ` Vivek Goyal
@ 2011-03-28 16:59                   ` Chad Talbott
  2011-03-28 17:24                     ` Vivek Goyal
  0 siblings, 1 reply; 22+ messages in thread
From: Chad Talbott @ 2011-03-28 16:59 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Gui Jianfeng, jaxboe, linux-kernel, mrubin, teravest

On Mon, Mar 28, 2011 at 6:15 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, Mar 25, 2011 at 04:53:13PM -0700, Chad Talbott wrote:
>> Also, we'll soon be working to adopt the hierarchy series, and we'll
>> likely revisit using an RT service tree there.  It's difficult to
>> justify introducing RT service tree before those patches arrive.
>
> So once you move to hierachicy series, will be you using RT class and
> use throttling and abandon this functionality?

I'll certainly investigate using RT class and throttling, but it's too
early to know.  It's a known change in behavior, so it will need
testing (as well as the user-space work to handle the different
interface).

Chad

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] cfq-iosched: Fair cross-group preemption
  2011-03-28 13:17                 ` Vivek Goyal
@ 2011-03-28 17:02                   ` Chad Talbott
  0 siblings, 0 replies; 22+ messages in thread
From: Chad Talbott @ 2011-03-28 17:02 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Gui Jianfeng, jaxboe, linux-kernel, mrubin, teravest

On Mon, Mar 28, 2011 at 6:17 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, Mar 25, 2011 at 04:53:13PM -0700, Chad Talbott wrote:
>> I prefer option a) - so much so that even with the older CFQ group
>> implementation we did work to merge the RT and BE service trees to
>> achieve that behavior.  But I see that using blkio.class is a poor
>> choice of interface name.  I will rename the interface and resubmit
>> the patch series (also with Gui's suggestion to keep the "_device"
>> suffix for consistency).
>
> Do you need this feature to be global or per device or both?

Per-device.

Chad

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] cfq-iosched: Fair cross-group preemption
  2011-03-28 16:59                   ` Chad Talbott
@ 2011-03-28 17:24                     ` Vivek Goyal
  0 siblings, 0 replies; 22+ messages in thread
From: Vivek Goyal @ 2011-03-28 17:24 UTC (permalink / raw)
  To: Chad Talbott; +Cc: Gui Jianfeng, jaxboe, linux-kernel, mrubin, teravest

On Mon, Mar 28, 2011 at 09:59:54AM -0700, Chad Talbott wrote:
> On Mon, Mar 28, 2011 at 6:15 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Fri, Mar 25, 2011 at 04:53:13PM -0700, Chad Talbott wrote:
> >> Also, we'll soon be working to adopt the hierarchy series, and we'll
> >> likely revisit using an RT service tree there.  It's difficult to
> >> justify introducing RT service tree before those patches arrive.
> >
> > So once you move to hierachicy series, will be you using RT class and
> > use throttling and abandon this functionality?
> 
> I'll certainly investigate using RT class and throttling, but it's too
> early to know.  It's a known change in behavior, so it will need
> testing (as well as the user-space work to handle the different
> interface).

So why not assess that now and then go for the one which meets your
requirements in long term. If we introduce this patch now, this 
introduces a new user visible interface and we can't get rid of it
later. If two releases down the line you implement RT group functionality
and move to that, then we have this dead interface not being used.

So why not implement RT groups internally (It should be a huge patch) now
and test and see how well is it working with throttling logic and whether
it meets your needs or not.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2011-03-28 17:24 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-03-22  1:10 [PATCH 0/3] cfq-iosched: Fair cross-group preemption Chad Talbott
2011-03-22  1:10 ` [PATCH 1/3] cfq-iosched: Fair cross-group preemption (interface and documentation) Chad Talbott
2011-03-22 10:03   ` Gui Jianfeng
2011-03-22 18:07     ` Chad Talbott
2011-03-22  1:10 ` [PATCH 2/3] cfq-iosched: Fair cross-group preemption (implementation) Chad Talbott
2011-03-22  1:10 ` [PATCH 3/3] cfq-iosched: Fair cross-group preemption (stats) Chad Talbott
2011-03-22 15:09 ` [PATCH 0/3] cfq-iosched: Fair cross-group preemption Vivek Goyal
2011-03-22 17:39   ` Chad Talbott
2011-03-22 18:12     ` Vivek Goyal
2011-03-22 23:46       ` Chad Talbott
2011-03-23  1:43         ` Vivek Goyal
2011-03-23 20:10       ` Chad Talbott
2011-03-23 20:41         ` Vivek Goyal
2011-03-24 21:47           ` Chad Talbott
2011-03-25  5:43             ` Gui Jianfeng
2011-03-25 21:32             ` Vivek Goyal
2011-03-25 23:53               ` Chad Talbott
2011-03-28 13:15                 ` Vivek Goyal
2011-03-28 16:59                   ` Chad Talbott
2011-03-28 17:24                     ` Vivek Goyal
2011-03-28 13:17                 ` Vivek Goyal
2011-03-28 17:02                   ` Chad Talbott

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).