linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET] blkcg: kill policy node and blkg->dev, take#2
@ 2012-01-22  3:25 Tejun Heo
  2012-01-22  3:25 ` [PATCH 01/17] blkcg: make CONFIG_BLK_CGROUP bool Tejun Heo
                   ` (16 more replies)
  0 siblings, 17 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-22  3:25 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel

Hello, guys.

This is the second take of blkcg-kill-policy-node-and-blkg_dev
patchset.  Changes from the last take[L] are,

* 0001-blkcg-make-CONFIG_BLK_CGROUP-bool.patch added to allow caling
  into blkcg from block core to clear blkg's on elevator switch.

* For patches 0005-0008 added.  These ensure that blkcg policies are
  quiescent and no blkg exists during elevator switch.  0005-0006 are
  lifted from the pending second patchset and will be used for policy
  registration too.

* 0013-blkcg-use-the-usual-get-blkg-path-for-root-blkio_gro.patch
  fixed for !CONFIG_CFQ_GROUP_IOSCHED.

* 0009-blkcg-move-rcu_read_lock-outside-of-blkio_group-get-.patch is
  the 0001 RCU fix patch.  It turns out the RCU lock isn't protecting
  blkg at all but blkcg.  Patch description updated accordingly.

The next patchset will update blkcg such that there's one blkg per
cgroup - request_queue association.  Initially, all blkg's will
contain data for all policies registered on the system but policy
registration will be made per-request_queue.

The original patchset description follows.

blk-cgroup, blk-throttle and cfq-iosched are pretty tightly tangled
and untangling requires seemingly unrelated changes and addition of
some transitional stuff.  Some are in this patchset but they'll be
more prominent in the second patchset.  With that said, this patchset
concentrates on removing blkio_policy_node and blkio_group->dev.

blkio_policy_node carries configuration per cgroup-request_queue
association separately from the matching blkio_group.  The goal seems
to be allowing configuration of missing devices and retaining
configuration across hot unplug/plug cycles.  Such behavior is very
different from existing conventions, misleading and unnecessary.  This
patchset moves configuration to blkio_group and removes
blkio_policy_node.

blkg->dev removal is much simpler.  blkg->dev is used to print out
device number when printing out per-device configurations or
statistics.  The device name can be easily determined by following the
associated request_queue from blkg instead.

This patchset contains the following 17 patches.

0001-blkcg-make-CONFIG_BLK_CGROUP-bool.patch
0002-cfq-don-t-register-propio-policy-if-CONFIG_CFQ_GROUP.patch
0003-elevator-clear-auxiliary-data-earlier-during-elevato.patch
0004-elevator-make-elevator_init_fn-return-0-errno.patch
0005-block-implement-blk_queue_bypass_start-end.patch
0006-block-extend-queue-bypassing-to-cover-blkcg-policies.patch
0007-blkcg-make-blkio_list_lock-irq-safe.patch
0008-blkcg-shoot-down-blkio_groups-on-elevator-switch.patch
0009-blkcg-move-rcu_read_lock-outside-of-blkio_group-get-.patch
0010-blkcg-update-blkg-get-functions-take-blkio_cgroup-as.patch
0011-blkcg-use-q-and-plid-instead-of-opaque-void-for-blki.patch
0012-blkcg-add-blkio_policy-array-and-allow-one-policy-pe.patch
0013-blkcg-use-the-usual-get-blkg-path-for-root-blkio_gro.patch
0014-blkcg-factor-out-blkio_group-creation.patch
0015-blkcg-don-t-allow-or-retain-configuration-of-missing.patch
0016-blkcg-kill-blkio_policy_node.patch
0017-blkcg-kill-the-mind-bending-blkg-dev.patch

0001-0004: misc preps including elevator switch update

0005-0008: ensure blkcg policies are quiescent across elvswitch

0009-0014: move common part of blkg management into blkcg core

0015-0016: kill blkio_policy_node

0017     : kill blkg->dev

This patchset is on top of v3.3-rc1 and also available in the
following git branch.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git blkcg-kill-pn

 block/Kconfig.iosched    |    5 
 block/blk-cgroup.c       |  684 +++++++++++++++--------------------------------
 block/blk-cgroup.h       |  106 ++-----
 block/blk-core.c         |   53 +++
 block/blk-throttle.c     |  266 +++++-------------
 block/blk.h              |    6 
 block/cfq-iosched.c      |  286 +++++++------------
 block/cfq.h              |    7 
 block/deadline-iosched.c |    8 
 block/elevator.c         |  115 +++----
 block/noop-iosched.c     |    8 
 include/linux/blkdev.h   |    5 
 include/linux/elevator.h |    2 
 init/Kconfig             |    2 
 14 files changed, 568 insertions(+), 985 deletions(-)

Thanks.

--
tejun

[L] http://thread.gmane.org/gmane.linux.kernel/1241094

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 01/17] blkcg: make CONFIG_BLK_CGROUP bool
  2012-01-22  3:25 [PATCHSET] blkcg: kill policy node and blkg->dev, take#2 Tejun Heo
@ 2012-01-22  3:25 ` Tejun Heo
  2012-01-23 15:00   ` Vivek Goyal
  2012-01-22  3:25 ` [PATCH 02/17] cfq: don't register propio policy if !CONFIG_CFQ_GROUP_IOSCHED Tejun Heo
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2012-01-22  3:25 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

Block cgroup core can be built as module; however, it isn't too useful
as blk-throttle can only be built-in and cfq-iosched is usually the
default built-in scheduler.  Scheduled blkcg cleanup requires calling
into blkcg from block core.  To simplify that, disallow building blkcg
as module by making CONFIG_BLK_CGROUP bool.

If building blkcg core as module really matters, which I doubt, we can
revisit it after blkcg API cleanup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |    5 +----
 block/blk-cgroup.c    |   17 -----------------
 block/blk-cgroup.h    |   10 ++--------
 init/Kconfig          |    2 +-
 4 files changed, 4 insertions(+), 30 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3199b76..8dca96d 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -23,8 +23,7 @@ config IOSCHED_DEADLINE
 
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
-	# If BLK_CGROUP is a module, CFQ has to be built as module.
-	depends on (BLK_CGROUP=m && m) || !BLK_CGROUP || BLK_CGROUP=y
+	depends on BLK_CGROUP
 	default y
 	---help---
 	  The CFQ I/O scheduler tries to distribute bandwidth equally
@@ -34,8 +33,6 @@ config IOSCHED_CFQ
 
 	  This is the default I/O scheduler.
 
-	  Note: If BLK_CGROUP=m, then CFQ can be built only as module.
-
 config CFQ_GROUP_IOSCHED
 	bool "CFQ Group Scheduling support"
 	depends on IOSCHED_CFQ && BLK_CGROUP
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index fa8f263..7bfc574 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -50,10 +50,7 @@ struct cgroup_subsys blkio_subsys = {
 	.attach = blkiocg_attach,
 	.destroy = blkiocg_destroy,
 	.populate = blkiocg_populate,
-#ifdef CONFIG_BLK_CGROUP
-	/* note: blkio_subsys_id is otherwise defined in blk-cgroup.h */
 	.subsys_id = blkio_subsys_id,
-#endif
 	.use_id = 1,
 	.module = THIS_MODULE,
 };
@@ -1679,17 +1676,3 @@ void blkio_policy_unregister(struct blkio_policy_type *blkiop)
 	spin_unlock(&blkio_list_lock);
 }
 EXPORT_SYMBOL_GPL(blkio_policy_unregister);
-
-static int __init init_cgroup_blkio(void)
-{
-	return cgroup_load_subsys(&blkio_subsys);
-}
-
-static void __exit exit_cgroup_blkio(void)
-{
-	cgroup_unload_subsys(&blkio_subsys);
-}
-
-module_init(init_cgroup_blkio);
-module_exit(exit_cgroup_blkio);
-MODULE_LICENSE("GPL");
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 6f3ace7..3551687 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -24,13 +24,7 @@ enum blkio_policy_id {
 /* Max limits for throttle policy */
 #define THROTL_IOPS_MAX		UINT_MAX
 
-#if defined(CONFIG_BLK_CGROUP) || defined(CONFIG_BLK_CGROUP_MODULE)
-
-#ifndef CONFIG_BLK_CGROUP
-/* When blk-cgroup is a module, its subsys_id isn't a compile-time constant */
-extern struct cgroup_subsys blkio_subsys;
-#define blkio_subsys_id blkio_subsys.subsys_id
-#endif
+#ifdef CONFIG_BLK_CGROUP
 
 enum stat_type {
 	/* Total time spent (in ns) between request dispatch to the driver and
@@ -303,7 +297,7 @@ static inline void blkiocg_update_idle_time_stats(struct blkio_group *blkg) {}
 static inline void blkiocg_set_start_empty_time(struct blkio_group *blkg) {}
 #endif
 
-#if defined(CONFIG_BLK_CGROUP) || defined(CONFIG_BLK_CGROUP_MODULE)
+#ifdef CONFIG_BLK_CGROUP
 extern struct blkio_cgroup blkio_root_cgroup;
 extern struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
 extern struct blkio_cgroup *task_blkio_cgroup(struct task_struct *tsk);
diff --git a/init/Kconfig b/init/Kconfig
index 3f42cd6..da9222d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -766,7 +766,7 @@ config RT_GROUP_SCHED
 endif #CGROUP_SCHED
 
 config BLK_CGROUP
-	tristate "Block IO controller"
+	bool "Block IO controller"
 	depends on BLOCK
 	default n
 	---help---
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 02/17] cfq: don't register propio policy if !CONFIG_CFQ_GROUP_IOSCHED
  2012-01-22  3:25 [PATCHSET] blkcg: kill policy node and blkg->dev, take#2 Tejun Heo
  2012-01-22  3:25 ` [PATCH 01/17] blkcg: make CONFIG_BLK_CGROUP bool Tejun Heo
@ 2012-01-22  3:25 ` Tejun Heo
  2012-01-22  3:25 ` [PATCH 03/17] elevator: clear auxiliary data earlier during elevator switch Tejun Heo
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-22  3:25 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

cfq has been registering zeroed blkio_poilcy_cfq if CFQ_GROUP_IOSCHED
is disabled.  This fortunately doesn't collide with blk-throtl as
BLKIO_POLICY_PROP is zero but is unnecessary and risky.  Just don't
register it if not enabled.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ee55019..529dfba 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3884,8 +3884,6 @@ static struct blkio_policy_type blkio_policy_cfq = {
 	},
 	.plid = BLKIO_POLICY_PROP,
 };
-#else
-static struct blkio_policy_type blkio_policy_cfq;
 #endif
 
 static int __init cfq_init(void)
@@ -3916,14 +3914,17 @@ static int __init cfq_init(void)
 		return ret;
 	}
 
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
 	blkio_policy_register(&blkio_policy_cfq);
-
+#endif
 	return 0;
 }
 
 static void __exit cfq_exit(void)
 {
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
 	blkio_policy_unregister(&blkio_policy_cfq);
+#endif
 	elv_unregister(&iosched_cfq);
 	kmem_cache_destroy(cfq_pool);
 }
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 03/17] elevator: clear auxiliary data earlier during elevator switch
  2012-01-22  3:25 [PATCHSET] blkcg: kill policy node and blkg->dev, take#2 Tejun Heo
  2012-01-22  3:25 ` [PATCH 01/17] blkcg: make CONFIG_BLK_CGROUP bool Tejun Heo
  2012-01-22  3:25 ` [PATCH 02/17] cfq: don't register propio policy if !CONFIG_CFQ_GROUP_IOSCHED Tejun Heo
@ 2012-01-22  3:25 ` Tejun Heo
  2012-01-22  3:25 ` [PATCH 04/17] elevator: make elevator_init_fn() return 0/-errno Tejun Heo
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-22  3:25 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

Elevator switch tries hard to keep as much as context until new
elevator is ready so that it can revert to the original state if
initializing the new elevator fails for some reason.  Unfortunately,
with more auxiliary contexts to manage, this makes elevator init and
exit paths too complex and fragile.

This patch makes elevator_switch() unregister the current elevator and
flush icq's before start initializing the new one.  As we still keep
the old elevator itself, the only difference is that we lose icq's on
rare occassions of switching failure, which isn't critical at all.

Note that this makes explicit elevator parameter to
elevator_init_queue() and __elv_register_queue() unnecessary as they
always can use the current elevator.

This patch enables block cgroup cleanups.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator.c |   88 +++++++++++++++++++++++++++---------------------------
 1 files changed, 44 insertions(+), 44 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 91e18f8..42543e3 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -168,11 +168,10 @@ static struct elevator_type *elevator_get(const char *name)
 	return e;
 }
 
-static int elevator_init_queue(struct request_queue *q,
-			       struct elevator_queue *eq)
+static int elevator_init_queue(struct request_queue *q)
 {
-	eq->elevator_data = eq->type->ops.elevator_init_fn(q);
-	if (eq->elevator_data)
+	q->elevator->elevator_data = q->elevator->type->ops.elevator_init_fn(q);
+	if (q->elevator->elevator_data)
 		return 0;
 	return -ENOMEM;
 }
@@ -235,7 +234,6 @@ static void elevator_release(struct kobject *kobj)
 int elevator_init(struct request_queue *q, char *name)
 {
 	struct elevator_type *e = NULL;
-	struct elevator_queue *eq;
 	int err;
 
 	if (unlikely(q->elevator))
@@ -269,17 +267,16 @@ int elevator_init(struct request_queue *q, char *name)
 		}
 	}
 
-	eq = elevator_alloc(q, e);
-	if (!eq)
+	q->elevator = elevator_alloc(q, e);
+	if (!q->elevator)
 		return -ENOMEM;
 
-	err = elevator_init_queue(q, eq);
+	err = elevator_init_queue(q);
 	if (err) {
-		kobject_put(&eq->kobj);
+		kobject_put(&q->elevator->kobj);
 		return err;
 	}
 
-	q->elevator = eq;
 	return 0;
 }
 EXPORT_SYMBOL(elevator_init);
@@ -848,8 +845,9 @@ static struct kobj_type elv_ktype = {
 	.release	= elevator_release,
 };
 
-int __elv_register_queue(struct request_queue *q, struct elevator_queue *e)
+int elv_register_queue(struct request_queue *q)
 {
+	struct elevator_queue *e = q->elevator;
 	int error;
 
 	error = kobject_add(&e->kobj, &q->kobj, "%s", "iosched");
@@ -867,11 +865,6 @@ int __elv_register_queue(struct request_queue *q, struct elevator_queue *e)
 	}
 	return error;
 }
-
-int elv_register_queue(struct request_queue *q)
-{
-	return __elv_register_queue(q, q->elevator);
-}
 EXPORT_SYMBOL(elv_register_queue);
 
 void elv_unregister_queue(struct request_queue *q)
@@ -954,39 +947,47 @@ EXPORT_SYMBOL_GPL(elv_unregister);
  */
 static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 {
-	struct elevator_queue *old_elevator, *e;
+	struct elevator_queue *old = q->elevator;
+	bool registered = old->registered;
 	int err;
 
-	/* allocate new elevator */
-	e = elevator_alloc(q, new_e);
-	if (!e)
-		return -ENOMEM;
-
-	err = elevator_init_queue(q, e);
-	if (err) {
-		kobject_put(&e->kobj);
-		return err;
-	}
-
-	/* turn on BYPASS and drain all requests w/ elevator private data */
+	/*
+	 * Turn on BYPASS and drain all requests w/ elevator private data.
+	 * Block layer doesn't call into a quiesced elevator - all requests
+	 * are directly put on the dispatch list without elevator data
+	 * using INSERT_BACK.  All requests have SOFTBARRIER set and no
+	 * merge happens either.
+	 */
 	elv_quiesce_start(q);
 
-	/* unregister old queue, register new one and kill old elevator */
-	if (q->elevator->registered) {
+	/* unregister and clear all auxiliary data of the old elevator */
+	if (registered)
 		elv_unregister_queue(q);
-		err = __elv_register_queue(q, e);
-		if (err)
-			goto fail_register;
-	}
 
-	/* done, clear io_cq's, switch elevators and turn off BYPASS */
 	spin_lock_irq(q->queue_lock);
 	ioc_clear_queue(q);
-	old_elevator = q->elevator;
-	q->elevator = e;
 	spin_unlock_irq(q->queue_lock);
 
-	elevator_exit(old_elevator);
+	/* allocate, init and register new elevator */
+	err = -ENOMEM;
+	q->elevator = elevator_alloc(q, new_e);
+	if (!q->elevator)
+		goto fail_init;
+
+	err = elevator_init_queue(q);
+	if (err) {
+		kobject_put(&q->elevator->kobj);
+		goto fail_init;
+	}
+
+	if (registered) {
+		err = elv_register_queue(q);
+		if (err)
+			goto fail_register;
+	}
+
+	/* done, kill the old one and finish */
+	elevator_exit(old);
 	elv_quiesce_end(q);
 
 	blk_add_trace_msg(q, "elv switch: %s", e->type->elevator_name);
@@ -994,11 +995,10 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	return 0;
 
 fail_register:
-	/*
-	 * switch failed, exit the new io scheduler and reattach the old
-	 * one again (along with re-adding the sysfs dir)
-	 */
-	elevator_exit(e);
+	elevator_exit(q->elevator);
+fail_init:
+	/* switch failed, restore and re-register old elevator */
+	q->elevator = old;
 	elv_register_queue(q);
 	elv_quiesce_end(q);
 
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 04/17] elevator: make elevator_init_fn() return 0/-errno
  2012-01-22  3:25 [PATCHSET] blkcg: kill policy node and blkg->dev, take#2 Tejun Heo
                   ` (2 preceding siblings ...)
  2012-01-22  3:25 ` [PATCH 03/17] elevator: clear auxiliary data earlier during elevator switch Tejun Heo
@ 2012-01-22  3:25 ` Tejun Heo
  2012-01-22  3:25 ` [PATCH 05/17] block: implement blk_queue_bypass_start/end() Tejun Heo
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-22  3:25 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

elevator_ops->elevator_init_fn() has a weird return value.  It returns
a void * which the caller should assign to q->elevator->elevator_data
and %NULL return denotes init failure.

Update such that it returns integer 0/-errno and sets elevator_data
directly as necessary.

This makes the interface more conventional and eases further cleanup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c      |    9 +++++----
 block/deadline-iosched.c |    8 +++++---
 block/elevator.c         |   12 ++----------
 block/noop-iosched.c     |    8 +++++---
 include/linux/elevator.h |    2 +-
 5 files changed, 18 insertions(+), 21 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 529dfba..f3d6b04 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3662,7 +3662,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
 	kfree(cfqd);
 }
 
-static void *cfq_init_queue(struct request_queue *q)
+static int cfq_init_queue(struct request_queue *q)
 {
 	struct cfq_data *cfqd;
 	int i, j;
@@ -3671,7 +3671,7 @@ static void *cfq_init_queue(struct request_queue *q)
 
 	cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (!cfqd)
-		return NULL;
+		return -ENOMEM;
 
 	/* Init root service tree */
 	cfqd->grp_service_tree = CFQ_RB_ROOT;
@@ -3698,7 +3698,7 @@ static void *cfq_init_queue(struct request_queue *q)
 	if (blkio_alloc_blkg_stats(&cfqg->blkg)) {
 		kfree(cfqg);
 		kfree(cfqd);
-		return NULL;
+		return -ENOMEM;
 	}
 
 	rcu_read_lock();
@@ -3729,6 +3729,7 @@ static void *cfq_init_queue(struct request_queue *q)
 	cfq_link_cfqq_cfqg(&cfqd->oom_cfqq, &cfqd->root_group);
 
 	cfqd->queue = q;
+	q->elevator->elevator_data = cfqd;
 
 	init_timer(&cfqd->idle_slice_timer);
 	cfqd->idle_slice_timer.function = cfq_idle_slice_timer;
@@ -3753,7 +3754,7 @@ static void *cfq_init_queue(struct request_queue *q)
 	 * second, in order to have larger depth for async operations.
 	 */
 	cfqd->last_delayed_sync = jiffies - HZ;
-	return cfqd;
+	return 0;
 }
 
 /*
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 7bf12d7..599b12e 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -337,13 +337,13 @@ static void deadline_exit_queue(struct elevator_queue *e)
 /*
  * initialize elevator private data (deadline_data).
  */
-static void *deadline_init_queue(struct request_queue *q)
+static int deadline_init_queue(struct request_queue *q)
 {
 	struct deadline_data *dd;
 
 	dd = kmalloc_node(sizeof(*dd), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (!dd)
-		return NULL;
+		return -ENOMEM;
 
 	INIT_LIST_HEAD(&dd->fifo_list[READ]);
 	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
@@ -354,7 +354,9 @@ static void *deadline_init_queue(struct request_queue *q)
 	dd->writes_starved = writes_starved;
 	dd->front_merges = 1;
 	dd->fifo_batch = fifo_batch;
-	return dd;
+
+	q->elevator->elevator_data = dd;
+	return 0;
 }
 
 /*
diff --git a/block/elevator.c b/block/elevator.c
index 42543e3..8dc65b1 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -168,14 +168,6 @@ static struct elevator_type *elevator_get(const char *name)
 	return e;
 }
 
-static int elevator_init_queue(struct request_queue *q)
-{
-	q->elevator->elevator_data = q->elevator->type->ops.elevator_init_fn(q);
-	if (q->elevator->elevator_data)
-		return 0;
-	return -ENOMEM;
-}
-
 static char chosen_elevator[ELV_NAME_MAX];
 
 static int __init elevator_setup(char *str)
@@ -271,7 +263,7 @@ int elevator_init(struct request_queue *q, char *name)
 	if (!q->elevator)
 		return -ENOMEM;
 
-	err = elevator_init_queue(q);
+	err = e->ops.elevator_init_fn(q);
 	if (err) {
 		kobject_put(&q->elevator->kobj);
 		return err;
@@ -974,7 +966,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	if (!q->elevator)
 		goto fail_init;
 
-	err = elevator_init_queue(q);
+	err = new_e->ops.elevator_init_fn(q);
 	if (err) {
 		kobject_put(&q->elevator->kobj);
 		goto fail_init;
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 413a0b1..5d1bf70 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -59,15 +59,17 @@ noop_latter_request(struct request_queue *q, struct request *rq)
 	return list_entry(rq->queuelist.next, struct request, queuelist);
 }
 
-static void *noop_init_queue(struct request_queue *q)
+static int noop_init_queue(struct request_queue *q)
 {
 	struct noop_data *nd;
 
 	nd = kmalloc_node(sizeof(*nd), GFP_KERNEL, q->node);
 	if (!nd)
-		return NULL;
+		return -ENOMEM;
+
 	INIT_LIST_HEAD(&nd->queue);
-	return nd;
+	q->elevator->elevator_data = nd;
+	return 0;
 }
 
 static void noop_exit_queue(struct elevator_queue *e)
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index c24f3d7..d7726c9 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -33,7 +33,7 @@ typedef void (elevator_put_req_fn) (struct request *);
 typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
 typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *);
 
-typedef void *(elevator_init_fn) (struct request_queue *);
+typedef int (elevator_init_fn) (struct request_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
 
 struct elevator_ops
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 05/17] block: implement blk_queue_bypass_start/end()
  2012-01-22  3:25 [PATCHSET] blkcg: kill policy node and blkg->dev, take#2 Tejun Heo
                   ` (3 preceding siblings ...)
  2012-01-22  3:25 ` [PATCH 04/17] elevator: make elevator_init_fn() return 0/-errno Tejun Heo
@ 2012-01-22  3:25 ` Tejun Heo
  2012-01-22  3:25 ` [PATCH 06/17] block: extend queue bypassing to cover blkcg policies Tejun Heo
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-22  3:25 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

Rename and extend elv_queisce_start/end() to
blk_queue_bypass_start/end() which are exported and supports nesting
via @q->bypass_depth.  Also add blk_queue_bypass() to test bypass
state.

This will be further extended and used for blkio_group management.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c       |   39 +++++++++++++++++++++++++++++++++++++--
 block/blk.h            |    6 ++----
 block/elevator.c       |   25 +++----------------------
 include/linux/blkdev.h |    5 ++++-
 4 files changed, 46 insertions(+), 29 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index e6c05a9..c2e39de 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -403,6 +403,42 @@ void blk_drain_queue(struct request_queue *q, bool drain_all)
 }
 
 /**
+ * blk_queue_bypass_start - enter queue bypass mode
+ * @q: queue of interest
+ *
+ * In bypass mode, only the dispatch FIFO queue of @q is used.  This
+ * function makes @q enter bypass mode and drains all requests which were
+ * issued before.  On return, it's guaranteed that no request has ELVPRIV
+ * set.
+ */
+void blk_queue_bypass_start(struct request_queue *q)
+{
+	spin_lock_irq(q->queue_lock);
+	q->bypass_depth++;
+	queue_flag_set(QUEUE_FLAG_BYPASS, q);
+	spin_unlock_irq(q->queue_lock);
+
+	blk_drain_queue(q, false);
+}
+EXPORT_SYMBOL_GPL(blk_queue_bypass_start);
+
+/**
+ * blk_queue_bypass_end - leave queue bypass mode
+ * @q: queue of interest
+ *
+ * Leave bypass mode and restore the normal queueing behavior.
+ */
+void blk_queue_bypass_end(struct request_queue *q)
+{
+	spin_lock_irq(q->queue_lock);
+	if (!--q->bypass_depth)
+		queue_flag_clear(QUEUE_FLAG_BYPASS, q);
+	WARN_ON_ONCE(q->bypass_depth < 0);
+	spin_unlock_irq(q->queue_lock);
+}
+EXPORT_SYMBOL_GPL(blk_queue_bypass_end);
+
+/**
  * blk_cleanup_queue - shutdown a request queue
  * @q: request queue to shutdown
  *
@@ -859,8 +895,7 @@ retry:
 	 * Also, lookup icq while holding queue_lock.  If it doesn't exist,
 	 * it will be created after releasing queue_lock.
 	 */
-	if (blk_rq_should_init_elevator(bio) &&
-	    !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags)) {
+	if (blk_rq_should_init_elevator(bio) && !blk_queue_bypass(q)) {
 		rw_flags |= REQ_ELVPRIV;
 		rl->elvpriv++;
 		if (et->icq_cache && ioc)
diff --git a/block/blk.h b/block/blk.h
index 7efd772..33897f6 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -23,7 +23,8 @@ void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
 			struct bio *bio);
 int blk_rq_append_bio(struct request_queue *q, struct request *rq,
 		      struct bio *bio);
-void blk_drain_queue(struct request_queue *q, bool drain_all);
+void blk_queue_bypass_start(struct request_queue *q);
+void blk_queue_bypass_end(struct request_queue *q);
 void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 bool __blk_end_bidi_request(struct request *rq, int error,
@@ -142,9 +143,6 @@ void blk_queue_congestion_threshold(struct request_queue *q);
 
 int blk_dev_init(void);
 
-void elv_quiesce_start(struct request_queue *q);
-void elv_quiesce_end(struct request_queue *q);
-
 
 /*
  * Return the threshold (number of used requests) at which the queue is
diff --git a/block/elevator.c b/block/elevator.c
index 8dc65b1..078a491 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -600,25 +600,6 @@ void elv_drain_elevator(struct request_queue *q)
 	}
 }
 
-void elv_quiesce_start(struct request_queue *q)
-{
-	if (!q->elevator)
-		return;
-
-	spin_lock_irq(q->queue_lock);
-	queue_flag_set(QUEUE_FLAG_ELVSWITCH, q);
-	spin_unlock_irq(q->queue_lock);
-
-	blk_drain_queue(q, false);
-}
-
-void elv_quiesce_end(struct request_queue *q)
-{
-	spin_lock_irq(q->queue_lock);
-	queue_flag_clear(QUEUE_FLAG_ELVSWITCH, q);
-	spin_unlock_irq(q->queue_lock);
-}
-
 void __elv_add_request(struct request_queue *q, struct request *rq, int where)
 {
 	trace_block_rq_insert(q, rq);
@@ -950,7 +931,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	 * using INSERT_BACK.  All requests have SOFTBARRIER set and no
 	 * merge happens either.
 	 */
-	elv_quiesce_start(q);
+	blk_queue_bypass_start(q);
 
 	/* unregister and clear all auxiliary data of the old elevator */
 	if (registered)
@@ -980,7 +961,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 
 	/* done, kill the old one and finish */
 	elevator_exit(old);
-	elv_quiesce_end(q);
+	blk_queue_bypass_end(q);
 
 	blk_add_trace_msg(q, "elv switch: %s", e->type->elevator_name);
 
@@ -992,7 +973,7 @@ fail_init:
 	/* switch failed, restore and re-register old elevator */
 	q->elevator = old;
 	elv_register_queue(q);
-	elv_quiesce_end(q);
+	blk_queue_bypass_end(q);
 
 	return err;
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6c6a1f0..f10958b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -389,6 +389,8 @@ struct request_queue {
 
 	struct mutex		sysfs_lock;
 
+	int			bypass_depth;
+
 #if defined(CONFIG_BLK_DEV_BSG)
 	bsg_job_fn		*bsg_job_fn;
 	int			bsg_job_size;
@@ -409,7 +411,7 @@ struct request_queue {
 #define	QUEUE_FLAG_SYNCFULL	3	/* read queue has been filled */
 #define QUEUE_FLAG_ASYNCFULL	4	/* write queue has been filled */
 #define QUEUE_FLAG_DEAD		5	/* queue being torn down */
-#define QUEUE_FLAG_ELVSWITCH	6	/* don't use elevator, just do FIFO */
+#define QUEUE_FLAG_BYPASS	6	/* act as dumb FIFO queue */
 #define QUEUE_FLAG_BIDI		7	/* queue supports bidi requests */
 #define QUEUE_FLAG_NOMERGES     8	/* disable merge attempts */
 #define QUEUE_FLAG_SAME_COMP	9	/* complete on same CPU-group */
@@ -497,6 +499,7 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
 #define blk_queue_tagged(q)	test_bit(QUEUE_FLAG_QUEUED, &(q)->queue_flags)
 #define blk_queue_stopped(q)	test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags)
 #define blk_queue_dead(q)	test_bit(QUEUE_FLAG_DEAD, &(q)->queue_flags)
+#define blk_queue_bypass(q)	test_bit(QUEUE_FLAG_BYPASS, &(q)->queue_flags)
 #define blk_queue_nomerges(q)	test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags)
 #define blk_queue_noxmerges(q)	\
 	test_bit(QUEUE_FLAG_NOXMERGES, &(q)->queue_flags)
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 06/17] block: extend queue bypassing to cover blkcg policies
  2012-01-22  3:25 [PATCHSET] blkcg: kill policy node and blkg->dev, take#2 Tejun Heo
                   ` (4 preceding siblings ...)
  2012-01-22  3:25 ` [PATCH 05/17] block: implement blk_queue_bypass_start/end() Tejun Heo
@ 2012-01-22  3:25 ` Tejun Heo
  2012-01-22  3:25 ` [PATCH 07/17] blkcg: make blkio_list_lock irq-safe Tejun Heo
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-22  3:25 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

Extend queue bypassing such that dying queue is always bypassing and
blk-throttle is drained on bypass.  With blkcg policies updated to
test blk_queue_bypass() instead of blk_queue_dead(), this ensures that
no bio or request is held by or going through blkcg policies on a
bypassing queue.

This will be used to implement blkg cleanup on elevator switches and
policy changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c     |   12 ++++++++----
 block/blk-throttle.c |    4 ++--
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index c2e39de..c6c61c0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -366,8 +366,7 @@ void blk_drain_queue(struct request_queue *q, bool drain_all)
 		spin_lock_irq(q->queue_lock);
 
 		elv_drain_elevator(q);
-		if (drain_all)
-			blk_throtl_drain(q);
+		blk_throtl_drain(q);
 
 		/*
 		 * This function might be called on a queue which failed
@@ -408,8 +407,8 @@ void blk_drain_queue(struct request_queue *q, bool drain_all)
  *
  * In bypass mode, only the dispatch FIFO queue of @q is used.  This
  * function makes @q enter bypass mode and drains all requests which were
- * issued before.  On return, it's guaranteed that no request has ELVPRIV
- * set.
+ * throttled or issued before.  On return, it's guaranteed that no request
+ * is being throttled or has ELVPRIV set.
  */
 void blk_queue_bypass_start(struct request_queue *q)
 {
@@ -454,6 +453,11 @@ void blk_cleanup_queue(struct request_queue *q)
 	queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
 
 	spin_lock_irq(lock);
+
+	/* dead queue is permanently in bypass mode till released */
+	q->bypass_depth++;
+	queue_flag_set(QUEUE_FLAG_BYPASS, q);
+
 	queue_flag_set(QUEUE_FLAG_NOMERGES, q);
 	queue_flag_set(QUEUE_FLAG_NOXMERGES, q);
 	queue_flag_set(QUEUE_FLAG_DEAD, q);
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 5eed6a7..702c0e6 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -310,7 +310,7 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 	struct request_queue *q = td->queue;
 
 	/* no throttling for dead queue */
-	if (unlikely(blk_queue_dead(q)))
+	if (unlikely(blk_queue_bypass(q)))
 		return NULL;
 
 	rcu_read_lock();
@@ -335,7 +335,7 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 	spin_lock_irq(q->queue_lock);
 
 	/* Make sure @q is still alive */
-	if (unlikely(blk_queue_dead(q))) {
+	if (unlikely(blk_queue_bypass(q))) {
 		kfree(tg);
 		return NULL;
 	}
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 07/17] blkcg: make blkio_list_lock irq-safe
  2012-01-22  3:25 [PATCHSET] blkcg: kill policy node and blkg->dev, take#2 Tejun Heo
                   ` (5 preceding siblings ...)
  2012-01-22  3:25 ` [PATCH 06/17] block: extend queue bypassing to cover blkcg policies Tejun Heo
@ 2012-01-22  3:25 ` Tejun Heo
  2012-01-22  3:25 ` [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch Tejun Heo
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-22  3:25 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

Scheduled blkcg API cleanup requires nesting queue lock under
blkio_list_lock.  Make it irq-safe.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |   28 ++++++++++++++--------------
 1 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 7bfc574..cbaad8c 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1046,8 +1046,8 @@ static void blkio_update_policy_node_blkg(struct blkio_cgroup *blkcg,
 	struct blkio_group *blkg;
 	struct hlist_node *n;
 
-	spin_lock(&blkio_list_lock);
-	spin_lock_irq(&blkcg->lock);
+	spin_lock_irq(&blkio_list_lock);
+	spin_lock(&blkcg->lock);
 
 	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
 		if (pn->dev != blkg->dev || pn->plid != blkg->plid)
@@ -1055,8 +1055,8 @@ static void blkio_update_policy_node_blkg(struct blkio_cgroup *blkcg,
 		blkio_update_blkg_policy(blkcg, blkg, pn);
 	}
 
-	spin_unlock_irq(&blkcg->lock);
-	spin_unlock(&blkio_list_lock);
+	spin_unlock(&blkcg->lock);
+	spin_unlock_irq(&blkio_list_lock);
 }
 
 static int blkiocg_file_write(struct cgroup *cgrp, struct cftype *cft,
@@ -1321,8 +1321,8 @@ static int blkio_weight_write(struct blkio_cgroup *blkcg, u64 val)
 	if (val < BLKIO_WEIGHT_MIN || val > BLKIO_WEIGHT_MAX)
 		return -EINVAL;
 
-	spin_lock(&blkio_list_lock);
-	spin_lock_irq(&blkcg->lock);
+	spin_lock_irq(&blkio_list_lock);
+	spin_lock(&blkcg->lock);
 	blkcg->weight = (unsigned int)val;
 
 	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
@@ -1333,8 +1333,8 @@ static int blkio_weight_write(struct blkio_cgroup *blkcg, u64 val)
 
 		blkio_update_group_weight(blkg, blkcg->weight);
 	}
-	spin_unlock_irq(&blkcg->lock);
-	spin_unlock(&blkio_list_lock);
+	spin_unlock(&blkcg->lock);
+	spin_unlock_irq(&blkio_list_lock);
 	return 0;
 }
 
@@ -1575,13 +1575,13 @@ static void blkiocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 		 * going away. Let all the IO controlling policies know about
 		 * this event.
 		 */
-		spin_lock(&blkio_list_lock);
+		spin_lock_irqsave(&blkio_list_lock, flags);
 		list_for_each_entry(blkiop, &blkio_list, list) {
 			if (blkiop->plid != blkg->plid)
 				continue;
 			blkiop->ops.blkio_unlink_group_fn(key, blkg);
 		}
-		spin_unlock(&blkio_list_lock);
+		spin_unlock_irqrestore(&blkio_list_lock, flags);
 	} while (1);
 
 	list_for_each_entry_safe(pn, pntmp, &blkcg->policy_list, node) {
@@ -1663,16 +1663,16 @@ static void blkiocg_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 
 void blkio_policy_register(struct blkio_policy_type *blkiop)
 {
-	spin_lock(&blkio_list_lock);
+	spin_lock_irq(&blkio_list_lock);
 	list_add_tail(&blkiop->list, &blkio_list);
-	spin_unlock(&blkio_list_lock);
+	spin_unlock_irq(&blkio_list_lock);
 }
 EXPORT_SYMBOL_GPL(blkio_policy_register);
 
 void blkio_policy_unregister(struct blkio_policy_type *blkiop)
 {
-	spin_lock(&blkio_list_lock);
+	spin_lock_irq(&blkio_list_lock);
 	list_del_init(&blkiop->list);
-	spin_unlock(&blkio_list_lock);
+	spin_unlock_irq(&blkio_list_lock);
 }
 EXPORT_SYMBOL_GPL(blkio_policy_unregister);
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-22  3:25 [PATCHSET] blkcg: kill policy node and blkg->dev, take#2 Tejun Heo
                   ` (6 preceding siblings ...)
  2012-01-22  3:25 ` [PATCH 07/17] blkcg: make blkio_list_lock irq-safe Tejun Heo
@ 2012-01-22  3:25 ` Tejun Heo
  2012-01-23 15:20   ` Vivek Goyal
  2012-01-22  3:25 ` [PATCH 09/17] blkcg: move rcu_read_lock() outside of blkio_group get functions Tejun Heo
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2012-01-22  3:25 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

Elevator switch may involve changes to blkcg policies.  Shoot down all
blkg's while preparing for elevator switch.  Combined with the
previous bypass updates, this ensures that no blkg exists until
elevator switch is complete and will allow moving blkg management to
blkcg core layer.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c   |   12 ++++++++++++
 block/blk-cgroup.h   |    6 +++++-
 block/blk-throttle.c |    7 +++++++
 block/cfq-iosched.c  |   11 +++++++++++
 block/elevator.c     |    2 ++
 5 files changed, 37 insertions(+), 1 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index cbaad8c..f344b53 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1619,6 +1619,18 @@ done:
 	return &blkcg->css;
 }
 
+void blkcg_clear_queue(struct request_queue *q)
+{
+	struct blkio_policy_type *pol;
+
+	lockdep_assert_held(q->queue_lock);
+
+	spin_lock(&blkio_list_lock);
+	list_for_each_entry(pol, &blkio_list, list)
+		pol->ops.blkio_clear_queue_fn(q);
+	spin_unlock(&blkio_list_lock);
+}
+
 /*
  * We cannot support shared io contexts, as we have no mean to support
  * two tasks with the same ioc in two different groups without major rework
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 3551687..6fa49cb 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -203,7 +203,7 @@ extern unsigned int blkcg_get_write_iops(struct blkio_cgroup *blkcg,
 				     dev_t dev);
 
 typedef void (blkio_unlink_group_fn) (void *key, struct blkio_group *blkg);
-
+typedef void (blkio_clear_queue_fn)(struct request_queue *q);
 typedef void (blkio_update_group_weight_fn) (void *key,
 			struct blkio_group *blkg, unsigned int weight);
 typedef void (blkio_update_group_read_bps_fn) (void * key,
@@ -217,6 +217,7 @@ typedef void (blkio_update_group_write_iops_fn) (void *key,
 
 struct blkio_policy_ops {
 	blkio_unlink_group_fn *blkio_unlink_group_fn;
+	blkio_clear_queue_fn *blkio_clear_queue_fn;
 	blkio_update_group_weight_fn *blkio_update_group_weight_fn;
 	blkio_update_group_read_bps_fn *blkio_update_group_read_bps_fn;
 	blkio_update_group_write_bps_fn *blkio_update_group_write_bps_fn;
@@ -230,6 +231,8 @@ struct blkio_policy_type {
 	enum blkio_policy_id plid;
 };
 
+extern void blkcg_clear_queue(struct request_queue *q);
+
 /* Blkio controller policy registration */
 extern void blkio_policy_register(struct blkio_policy_type *);
 extern void blkio_policy_unregister(struct blkio_policy_type *);
@@ -247,6 +250,7 @@ struct blkio_group {
 struct blkio_policy_type {
 };
 
+static inline void blkcg_clear_queue(struct request_queue *q) { }
 static inline void blkio_policy_register(struct blkio_policy_type *blkiop) { }
 static inline void blkio_policy_unregister(struct blkio_policy_type *blkiop) { }
 
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 702c0e6..3bc1cc9 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1029,6 +1029,12 @@ void throtl_unlink_blkio_group(void *key, struct blkio_group *blkg)
 	spin_unlock_irqrestore(td->queue->queue_lock, flags);
 }
 
+static void throtl_clear_queue(struct request_queue *q)
+{
+	lockdep_assert_held(q->queue_lock);
+	throtl_release_tgs(q->td);
+}
+
 static void throtl_update_blkio_group_common(struct throtl_data *td,
 				struct throtl_grp *tg)
 {
@@ -1097,6 +1103,7 @@ static void throtl_shutdown_wq(struct request_queue *q)
 static struct blkio_policy_type blkio_policy_throtl = {
 	.ops = {
 		.blkio_unlink_group_fn = throtl_unlink_blkio_group,
+		.blkio_clear_queue_fn = throtl_clear_queue,
 		.blkio_update_group_read_bps_fn =
 					throtl_update_blkio_group_read_bps,
 		.blkio_update_group_write_bps_fn =
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f3d6b04..b110cd8 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1265,6 +1265,16 @@ static void cfq_unlink_blkio_group(void *key, struct blkio_group *blkg)
 	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
 }
 
+static struct elevator_type iosched_cfq;
+
+static void cfq_clear_queue(struct request_queue *q)
+{
+	lockdep_assert_held(q->queue_lock);
+	/* shoot down blkgs iff the current elevator is cfq */
+	if (q->elevator->type == &iosched_cfq)
+		cfq_release_cfq_groups(q->elevator->elevator_data);
+}
+
 #else /* GROUP_IOSCHED */
 static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd)
 {
@@ -3881,6 +3891,7 @@ static struct elevator_type iosched_cfq = {
 static struct blkio_policy_type blkio_policy_cfq = {
 	.ops = {
 		.blkio_unlink_group_fn =	cfq_unlink_blkio_group,
+		.blkio_clear_queue_fn = cfq_clear_queue,
 		.blkio_update_group_weight_fn =	cfq_update_blkio_group_weight,
 	},
 	.plid = BLKIO_POLICY_PROP,
diff --git a/block/elevator.c b/block/elevator.c
index 078a491..5e371e4 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -38,6 +38,7 @@
 #include <trace/events/block.h>
 
 #include "blk.h"
+#include "blk-cgroup.h"
 
 static DEFINE_SPINLOCK(elv_list_lock);
 static LIST_HEAD(elv_list);
@@ -939,6 +940,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 
 	spin_lock_irq(q->queue_lock);
 	ioc_clear_queue(q);
+	blkcg_clear_queue(q);
 	spin_unlock_irq(q->queue_lock);
 
 	/* allocate, init and register new elevator */
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 09/17] blkcg: move rcu_read_lock() outside of blkio_group get functions
  2012-01-22  3:25 [PATCHSET] blkcg: kill policy node and blkg->dev, take#2 Tejun Heo
                   ` (7 preceding siblings ...)
  2012-01-22  3:25 ` [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch Tejun Heo
@ 2012-01-22  3:25 ` Tejun Heo
  2012-01-22  3:25 ` [PATCH 10/17] blkcg: update blkg get functions take blkio_cgroup as parameter Tejun Heo
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-22  3:25 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

rcu_read_lock() in throtl_get_tb() and cfq_get_cfqg() holds onto
@blkcg while looking up blkg.  For API cleanup, the next patch will
make the caller responsible for determining @blkcg to look blkg from
and let them specify it as a parameter.  Move rcu read locking out to
the callers to prepare for the change.

-v2: Originally this patch was described as a fix for RCU read locking
     bug around @blkg, which Vivek pointed out to be incorrect.  It
     was from misunderstanding the role of rcu locking as protecting
     @blkg not @blkcg.  Patch description updated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-throttle.c |   18 ++++++------------
 block/cfq-iosched.c  |   11 +++++------
 2 files changed, 11 insertions(+), 18 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 3bc1cc9..16fb95b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -313,25 +313,23 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 	if (unlikely(blk_queue_bypass(q)))
 		return NULL;
 
-	rcu_read_lock();
 	blkcg = task_blkio_cgroup(current);
 	tg = throtl_find_tg(td, blkcg);
-	if (tg) {
-		rcu_read_unlock();
+	if (tg)
 		return tg;
-	}
 
 	/*
 	 * Need to allocate a group. Allocation of group also needs allocation
 	 * of per cpu stats which in-turn takes a mutex() and can block. Hence
 	 * we need to drop rcu lock and queue_lock before we call alloc.
 	 */
-	rcu_read_unlock();
 	spin_unlock_irq(q->queue_lock);
+	rcu_read_unlock();
 
 	tg = throtl_alloc_tg(td);
 
 	/* Group allocated and queue is still alive. take the lock */
+	rcu_read_lock();
 	spin_lock_irq(q->queue_lock);
 
 	/* Make sure @q is still alive */
@@ -343,7 +341,6 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 	/*
 	 * Initialize the new group. After sleeping, read the blkcg again.
 	 */
-	rcu_read_lock();
 	blkcg = task_blkio_cgroup(current);
 
 	/*
@@ -354,7 +351,6 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 
 	if (__tg) {
 		kfree(tg);
-		rcu_read_unlock();
 		return __tg;
 	}
 
@@ -365,7 +361,6 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 	}
 
 	throtl_init_add_tg_lists(td, tg, blkcg);
-	rcu_read_unlock();
 	return tg;
 }
 
@@ -1134,7 +1129,6 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 	 * basic fields like stats and io rates. If a group has no rules,
 	 * just update the dispatch stats in lockless manner and return.
 	 */
-
 	rcu_read_lock();
 	blkcg = task_blkio_cgroup(current);
 	tg = throtl_find_tg(td, blkcg);
@@ -1144,11 +1138,9 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 		if (tg_no_rule_group(tg, rw)) {
 			blkiocg_update_dispatch_stats(&tg->blkg, bio->bi_size,
 					rw, rw_is_sync(bio->bi_rw));
-			rcu_read_unlock();
-			goto out;
+			goto out_unlock_rcu;
 		}
 	}
-	rcu_read_unlock();
 
 	/*
 	 * Either group has not been allocated yet or it is not an unlimited
@@ -1206,6 +1198,8 @@ queue_bio:
 
 out_unlock:
 	spin_unlock_irq(q->queue_lock);
+out_unlock_rcu:
+	rcu_read_unlock();
 out:
 	return throttled;
 }
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b110cd8..1cc3392 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1128,13 +1128,10 @@ static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd)
 	struct cfq_group *cfqg = NULL, *__cfqg = NULL;
 	struct request_queue *q = cfqd->queue;
 
-	rcu_read_lock();
 	blkcg = task_blkio_cgroup(current);
 	cfqg = cfq_find_cfqg(cfqd, blkcg);
-	if (cfqg) {
-		rcu_read_unlock();
+	if (cfqg)
 		return cfqg;
-	}
 
 	/*
 	 * Need to allocate a group. Allocation of group also needs allocation
@@ -1164,7 +1161,6 @@ static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd)
 
 	if (__cfqg) {
 		kfree(cfqg);
-		rcu_read_unlock();
 		return __cfqg;
 	}
 
@@ -1172,7 +1168,6 @@ static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd)
 		cfqg = &cfqd->root_group;
 
 	cfq_init_add_cfqg_lists(cfqd, cfqg, blkcg);
-	rcu_read_unlock();
 	return cfqg;
 }
 
@@ -2870,6 +2865,8 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
 	struct cfq_group *cfqg;
 
 retry:
+	rcu_read_lock();
+
 	cfqg = cfq_get_cfqg(cfqd);
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
@@ -2885,6 +2882,7 @@ retry:
 			cfqq = new_cfqq;
 			new_cfqq = NULL;
 		} else if (gfp_mask & __GFP_WAIT) {
+			rcu_read_unlock();
 			spin_unlock_irq(cfqd->queue->queue_lock);
 			new_cfqq = kmem_cache_alloc_node(cfq_pool,
 					gfp_mask | __GFP_ZERO,
@@ -2910,6 +2908,7 @@ retry:
 	if (new_cfqq)
 		kmem_cache_free(cfq_pool, new_cfqq);
 
+	rcu_read_unlock();
 	return cfqq;
 }
 
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 10/17] blkcg: update blkg get functions take blkio_cgroup as parameter
  2012-01-22  3:25 [PATCHSET] blkcg: kill policy node and blkg->dev, take#2 Tejun Heo
                   ` (8 preceding siblings ...)
  2012-01-22  3:25 ` [PATCH 09/17] blkcg: move rcu_read_lock() outside of blkio_group get functions Tejun Heo
@ 2012-01-22  3:25 ` Tejun Heo
  2012-01-22  3:25 ` [PATCH 11/17] blkcg: use q and plid instead of opaque void * for blkio_group association Tejun Heo
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-22  3:25 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

In both blkg get functions - throtl_get_tg() and cfq_get_cfqg(),
instead of obtaining blkcg of %current explicitly, let the caller
specify the blkcg to use as parameter and make both functions hold on
to the blkcg.

This is part of block cgroup interface cleanup and will help making
blkcg API more modular.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-throttle.c |   16 +++++++---------
 block/cfq-iosched.c  |   20 ++++++++++++--------
 2 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 16fb95b..254c8cb 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -303,21 +303,23 @@ throtl_grp *throtl_find_tg(struct throtl_data *td, struct blkio_cgroup *blkcg)
 	return tg;
 }
 
-static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
+static struct throtl_grp *throtl_get_tg(struct throtl_data *td,
+					struct blkio_cgroup *blkcg)
 {
 	struct throtl_grp *tg = NULL, *__tg = NULL;
-	struct blkio_cgroup *blkcg;
 	struct request_queue *q = td->queue;
 
 	/* no throttling for dead queue */
 	if (unlikely(blk_queue_bypass(q)))
 		return NULL;
 
-	blkcg = task_blkio_cgroup(current);
 	tg = throtl_find_tg(td, blkcg);
 	if (tg)
 		return tg;
 
+	if (!css_tryget(&blkcg->css))
+		return NULL;
+
 	/*
 	 * Need to allocate a group. Allocation of group also needs allocation
 	 * of per cpu stats which in-turn takes a mutex() and can block. Hence
@@ -331,6 +333,7 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 	/* Group allocated and queue is still alive. take the lock */
 	rcu_read_lock();
 	spin_lock_irq(q->queue_lock);
+	css_put(&blkcg->css);
 
 	/* Make sure @q is still alive */
 	if (unlikely(blk_queue_bypass(q))) {
@@ -339,11 +342,6 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 	}
 
 	/*
-	 * Initialize the new group. After sleeping, read the blkcg again.
-	 */
-	blkcg = task_blkio_cgroup(current);
-
-	/*
 	 * If some other thread already allocated the group while we were
 	 * not holding queue lock, free up the group
 	 */
@@ -1147,7 +1145,7 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 	 * IO group
 	 */
 	spin_lock_irq(q->queue_lock);
-	tg = throtl_get_tg(td);
+	tg = throtl_get_tg(td, blkcg);
 	if (unlikely(!tg))
 		goto out_unlock;
 
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 1cc3392..d89d582 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1122,17 +1122,19 @@ cfq_find_cfqg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg)
  * Search for the cfq group current task belongs to. request_queue lock must
  * be held.
  */
-static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd)
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd,
+				      struct blkio_cgroup *blkcg)
 {
-	struct blkio_cgroup *blkcg;
 	struct cfq_group *cfqg = NULL, *__cfqg = NULL;
 	struct request_queue *q = cfqd->queue;
 
-	blkcg = task_blkio_cgroup(current);
 	cfqg = cfq_find_cfqg(cfqd, blkcg);
 	if (cfqg)
 		return cfqg;
 
+	if (!css_tryget(&blkcg->css))
+		return NULL;
+
 	/*
 	 * Need to allocate a group. Allocation of group also needs allocation
 	 * of per cpu stats which in-turn takes a mutex() and can block. Hence
@@ -1142,16 +1144,14 @@ static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd)
 	 * around by the time we return. CFQ queue allocation code does
 	 * the same. It might be racy though.
 	 */
-
 	rcu_read_unlock();
 	spin_unlock_irq(q->queue_lock);
 
 	cfqg = cfq_alloc_cfqg(cfqd);
 
 	spin_lock_irq(q->queue_lock);
-
 	rcu_read_lock();
-	blkcg = task_blkio_cgroup(current);
+	css_put(&blkcg->css);
 
 	/*
 	 * If some other thread already allocated the group while we were
@@ -1271,7 +1271,8 @@ static void cfq_clear_queue(struct request_queue *q)
 }
 
 #else /* GROUP_IOSCHED */
-static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd)
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd,
+				      struct blkio_cgroup *blkcg)
 {
 	return &cfqd->root_group;
 }
@@ -2860,6 +2861,7 @@ static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
 		     struct io_context *ioc, gfp_t gfp_mask)
 {
+	struct blkio_cgroup *blkcg;
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
 	struct cfq_io_cq *cic;
 	struct cfq_group *cfqg;
@@ -2867,7 +2869,9 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
 retry:
 	rcu_read_lock();
 
-	cfqg = cfq_get_cfqg(cfqd);
+	blkcg = task_blkio_cgroup(current);
+
+	cfqg = cfq_get_cfqg(cfqd, blkcg);
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 11/17] blkcg: use q and plid instead of opaque void * for blkio_group association
  2012-01-22  3:25 [PATCHSET] blkcg: kill policy node and blkg->dev, take#2 Tejun Heo
                   ` (9 preceding siblings ...)
  2012-01-22  3:25 ` [PATCH 10/17] blkcg: update blkg get functions take blkio_cgroup as parameter Tejun Heo
@ 2012-01-22  3:25 ` Tejun Heo
  2012-01-22  3:25 ` [PATCH 12/17] blkcg: add blkio_policy[] array and allow one policy per policy ID Tejun Heo
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-22  3:25 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

blkgio_group is association between a block cgroup and a queue for a
given policy.  Using opaque void * for association makes things
confusing and hinders factoring of common code.  Use request_queue *
and, if necessary, policy id instead.

This will help block cgroup API cleanup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c   |   32 +++++++++++++++-----------------
 block/blk-cgroup.h   |   22 ++++++++++++----------
 block/blk-throttle.c |   50 +++++++++++++++++++++++---------------------------
 block/cfq-iosched.c  |   30 ++++++++++++++++--------------
 block/cfq.h          |    7 ++++---
 5 files changed, 70 insertions(+), 71 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index f344b53..572499a 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -128,7 +128,7 @@ blkio_update_group_weight(struct blkio_group *blkg, unsigned int weight)
 		if (blkiop->plid != blkg->plid)
 			continue;
 		if (blkiop->ops.blkio_update_group_weight_fn)
-			blkiop->ops.blkio_update_group_weight_fn(blkg->key,
+			blkiop->ops.blkio_update_group_weight_fn(blkg->q,
 							blkg, weight);
 	}
 }
@@ -146,12 +146,12 @@ static inline void blkio_update_group_bps(struct blkio_group *blkg, u64 bps,
 
 		if (fileid == BLKIO_THROTL_read_bps_device
 		    && blkiop->ops.blkio_update_group_read_bps_fn)
-			blkiop->ops.blkio_update_group_read_bps_fn(blkg->key,
+			blkiop->ops.blkio_update_group_read_bps_fn(blkg->q,
 								blkg, bps);
 
 		if (fileid == BLKIO_THROTL_write_bps_device
 		    && blkiop->ops.blkio_update_group_write_bps_fn)
-			blkiop->ops.blkio_update_group_write_bps_fn(blkg->key,
+			blkiop->ops.blkio_update_group_write_bps_fn(blkg->q,
 								blkg, bps);
 	}
 }
@@ -169,12 +169,12 @@ static inline void blkio_update_group_iops(struct blkio_group *blkg,
 
 		if (fileid == BLKIO_THROTL_read_iops_device
 		    && blkiop->ops.blkio_update_group_read_iops_fn)
-			blkiop->ops.blkio_update_group_read_iops_fn(blkg->key,
+			blkiop->ops.blkio_update_group_read_iops_fn(blkg->q,
 								blkg, iops);
 
 		if (fileid == BLKIO_THROTL_write_iops_device
 		    && blkiop->ops.blkio_update_group_write_iops_fn)
-			blkiop->ops.blkio_update_group_write_iops_fn(blkg->key,
+			blkiop->ops.blkio_update_group_write_iops_fn(blkg->q,
 								blkg,iops);
 	}
 }
@@ -477,14 +477,14 @@ int blkio_alloc_blkg_stats(struct blkio_group *blkg)
 EXPORT_SYMBOL_GPL(blkio_alloc_blkg_stats);
 
 void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
-		struct blkio_group *blkg, void *key, dev_t dev,
+		struct blkio_group *blkg, struct request_queue *q, dev_t dev,
 		enum blkio_policy_id plid)
 {
 	unsigned long flags;
 
 	spin_lock_irqsave(&blkcg->lock, flags);
 	spin_lock_init(&blkg->stats_lock);
-	rcu_assign_pointer(blkg->key, key);
+	rcu_assign_pointer(blkg->q, q);
 	blkg->blkcg_id = css_id(&blkcg->css);
 	hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
 	blkg->plid = plid;
@@ -530,18 +530,16 @@ int blkiocg_del_blkio_group(struct blkio_group *blkg)
 EXPORT_SYMBOL_GPL(blkiocg_del_blkio_group);
 
 /* called under rcu_read_lock(). */
-struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key)
+struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
+					 struct request_queue *q,
+					 enum blkio_policy_id plid)
 {
 	struct blkio_group *blkg;
 	struct hlist_node *n;
-	void *__key;
 
-	hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
-		__key = blkg->key;
-		if (__key == key)
+	hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node)
+		if (blkg->q == q && blkg->plid == plid)
 			return blkg;
-	}
-
 	return NULL;
 }
 EXPORT_SYMBOL_GPL(blkiocg_lookup_group);
@@ -1550,7 +1548,7 @@ static void blkiocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 	struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
 	unsigned long flags;
 	struct blkio_group *blkg;
-	void *key;
+	struct request_queue *q;
 	struct blkio_policy_type *blkiop;
 	struct blkio_policy_node *pn, *pntmp;
 
@@ -1565,7 +1563,7 @@ static void blkiocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 
 		blkg = hlist_entry(blkcg->blkg_list.first, struct blkio_group,
 					blkcg_node);
-		key = rcu_dereference(blkg->key);
+		q = rcu_dereference(blkg->q);
 		__blkiocg_del_blkio_group(blkg);
 
 		spin_unlock_irqrestore(&blkcg->lock, flags);
@@ -1579,7 +1577,7 @@ static void blkiocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 		list_for_each_entry(blkiop, &blkio_list, list) {
 			if (blkiop->plid != blkg->plid)
 				continue;
-			blkiop->ops.blkio_unlink_group_fn(key, blkg);
+			blkiop->ops.blkio_unlink_group_fn(q, blkg);
 		}
 		spin_unlock_irqrestore(&blkio_list_lock, flags);
 	} while (1);
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 6fa49cb..d91718b 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -153,8 +153,8 @@ struct blkio_group_stats_cpu {
 };
 
 struct blkio_group {
-	/* An rcu protected unique identifier for the group */
-	void *key;
+	/* Pointer to the associated request_queue, RCU protected */
+	struct request_queue __rcu *q;
 	struct hlist_node blkcg_node;
 	unsigned short blkcg_id;
 	/* Store cgroup path */
@@ -202,17 +202,18 @@ extern unsigned int blkcg_get_read_iops(struct blkio_cgroup *blkcg,
 extern unsigned int blkcg_get_write_iops(struct blkio_cgroup *blkcg,
 				     dev_t dev);
 
-typedef void (blkio_unlink_group_fn) (void *key, struct blkio_group *blkg);
+typedef void (blkio_unlink_group_fn)(struct request_queue *q,
+			struct blkio_group *blkg);
 typedef void (blkio_clear_queue_fn)(struct request_queue *q);
-typedef void (blkio_update_group_weight_fn) (void *key,
+typedef void (blkio_update_group_weight_fn)(struct request_queue *q,
 			struct blkio_group *blkg, unsigned int weight);
-typedef void (blkio_update_group_read_bps_fn) (void * key,
+typedef void (blkio_update_group_read_bps_fn)(struct request_queue *q,
 			struct blkio_group *blkg, u64 read_bps);
-typedef void (blkio_update_group_write_bps_fn) (void *key,
+typedef void (blkio_update_group_write_bps_fn)(struct request_queue *q,
 			struct blkio_group *blkg, u64 write_bps);
-typedef void (blkio_update_group_read_iops_fn) (void *key,
+typedef void (blkio_update_group_read_iops_fn)(struct request_queue *q,
 			struct blkio_group *blkg, unsigned int read_iops);
-typedef void (blkio_update_group_write_iops_fn) (void *key,
+typedef void (blkio_update_group_write_iops_fn)(struct request_queue *q,
 			struct blkio_group *blkg, unsigned int write_iops);
 
 struct blkio_policy_ops {
@@ -306,12 +307,13 @@ extern struct blkio_cgroup blkio_root_cgroup;
 extern struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
 extern struct blkio_cgroup *task_blkio_cgroup(struct task_struct *tsk);
 extern void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
-	struct blkio_group *blkg, void *key, dev_t dev,
+	struct blkio_group *blkg, struct request_queue *q, dev_t dev,
 	enum blkio_policy_id plid);
 extern int blkio_alloc_blkg_stats(struct blkio_group *blkg);
 extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
 extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
-						void *key);
+						struct request_queue *q,
+						enum blkio_policy_id plid);
 void blkiocg_update_timeslice_used(struct blkio_group *blkg,
 					unsigned long time,
 					unsigned long unaccounted_time);
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 254c8cb..47680f5 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -252,7 +252,7 @@ static void throtl_init_add_tg_lists(struct throtl_data *td,
 	__throtl_tg_fill_dev_details(td, tg);
 
 	/* Add group onto cgroup list */
-	blkiocg_add_blkio_group(blkcg, &tg->blkg, (void *)td,
+	blkiocg_add_blkio_group(blkcg, &tg->blkg, td->queue,
 				tg->blkg.dev, BLKIO_POLICY_THROTL);
 
 	tg->bps[READ] = blkcg_get_read_bps(blkcg, tg->blkg.dev);
@@ -288,7 +288,6 @@ static struct
 throtl_grp *throtl_find_tg(struct throtl_data *td, struct blkio_cgroup *blkcg)
 {
 	struct throtl_grp *tg = NULL;
-	void *key = td;
 
 	/*
 	 * This is the common case when there are no blkio cgroups.
@@ -297,7 +296,8 @@ throtl_grp *throtl_find_tg(struct throtl_data *td, struct blkio_cgroup *blkcg)
 	if (blkcg == &blkio_root_cgroup)
 		tg = td->root_tg;
 	else
-		tg = tg_of_blkg(blkiocg_lookup_group(blkcg, key));
+		tg = tg_of_blkg(blkiocg_lookup_group(blkcg, td->queue,
+						     BLKIO_POLICY_THROTL));
 
 	__throtl_tg_fill_dev_details(td, tg);
 	return tg;
@@ -1004,22 +1004,22 @@ static void throtl_release_tgs(struct throtl_data *td)
  * no new IO will come in this group. So get rid of this group as soon as
  * any pending IO in the group is finished.
  *
- * This function is called under rcu_read_lock(). key is the rcu protected
- * pointer. That means "key" is a valid throtl_data pointer as long as we are
- * rcu read lock.
+ * This function is called under rcu_read_lock(). @q is the rcu protected
+ * pointer. That means @q is a valid request_queue pointer as long as we
+ * are rcu read lock.
  *
- * "key" was fetched from blkio_group under blkio_cgroup->lock. That means
+ * @q was fetched from blkio_group under blkio_cgroup->lock. That means
  * it should not be NULL as even if queue was going away, cgroup deltion
  * path got to it first.
  */
-void throtl_unlink_blkio_group(void *key, struct blkio_group *blkg)
+void throtl_unlink_blkio_group(struct request_queue *q,
+			       struct blkio_group *blkg)
 {
 	unsigned long flags;
-	struct throtl_data *td = key;
 
-	spin_lock_irqsave(td->queue->queue_lock, flags);
-	throtl_destroy_tg(td, tg_of_blkg(blkg));
-	spin_unlock_irqrestore(td->queue->queue_lock, flags);
+	spin_lock_irqsave(q->queue_lock, flags);
+	throtl_destroy_tg(q->td, tg_of_blkg(blkg));
+	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
 static void throtl_clear_queue(struct request_queue *q)
@@ -1038,52 +1038,48 @@ static void throtl_update_blkio_group_common(struct throtl_data *td,
 }
 
 /*
- * For all update functions, key should be a valid pointer because these
+ * For all update functions, @q should be a valid pointer because these
  * update functions are called under blkcg_lock, that means, blkg is
- * valid and in turn key is valid. queue exit path can not race because
+ * valid and in turn @q is valid. queue exit path can not race because
  * of blkcg_lock
  *
  * Can not take queue lock in update functions as queue lock under blkcg_lock
  * is not allowed. Under other paths we take blkcg_lock under queue_lock.
  */
-static void throtl_update_blkio_group_read_bps(void *key,
+static void throtl_update_blkio_group_read_bps(struct request_queue *q,
 				struct blkio_group *blkg, u64 read_bps)
 {
-	struct throtl_data *td = key;
 	struct throtl_grp *tg = tg_of_blkg(blkg);
 
 	tg->bps[READ] = read_bps;
-	throtl_update_blkio_group_common(td, tg);
+	throtl_update_blkio_group_common(q->td, tg);
 }
 
-static void throtl_update_blkio_group_write_bps(void *key,
+static void throtl_update_blkio_group_write_bps(struct request_queue *q,
 				struct blkio_group *blkg, u64 write_bps)
 {
-	struct throtl_data *td = key;
 	struct throtl_grp *tg = tg_of_blkg(blkg);
 
 	tg->bps[WRITE] = write_bps;
-	throtl_update_blkio_group_common(td, tg);
+	throtl_update_blkio_group_common(q->td, tg);
 }
 
-static void throtl_update_blkio_group_read_iops(void *key,
+static void throtl_update_blkio_group_read_iops(struct request_queue *q,
 			struct blkio_group *blkg, unsigned int read_iops)
 {
-	struct throtl_data *td = key;
 	struct throtl_grp *tg = tg_of_blkg(blkg);
 
 	tg->iops[READ] = read_iops;
-	throtl_update_blkio_group_common(td, tg);
+	throtl_update_blkio_group_common(q->td, tg);
 }
 
-static void throtl_update_blkio_group_write_iops(void *key,
+static void throtl_update_blkio_group_write_iops(struct request_queue *q,
 			struct blkio_group *blkg, unsigned int write_iops)
 {
-	struct throtl_data *td = key;
 	struct throtl_grp *tg = tg_of_blkg(blkg);
 
 	tg->iops[WRITE] = write_iops;
-	throtl_update_blkio_group_common(td, tg);
+	throtl_update_blkio_group_common(q->td, tg);
 }
 
 static void throtl_shutdown_wq(struct request_queue *q)
@@ -1290,7 +1286,7 @@ void blk_throtl_exit(struct request_queue *q)
 	spin_unlock_irq(q->queue_lock);
 
 	/*
-	 * Wait for tg->blkg->key accessors to exit their grace periods.
+	 * Wait for tg->blkg->q accessors to exit their grace periods.
 	 * Do this wait only if there are other undestroyed groups out
 	 * there (other than root group). This can happen if cgroup deletion
 	 * path claimed the responsibility of cleaning up a group before
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index d89d582..ec6ade7 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1020,7 +1020,8 @@ static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
 	return NULL;
 }
 
-static void cfq_update_blkio_group_weight(void *key, struct blkio_group *blkg,
+static void cfq_update_blkio_group_weight(struct request_queue *q,
+					  struct blkio_group *blkg,
 					  unsigned int weight)
 {
 	struct cfq_group *cfqg = cfqg_of_blkg(blkg);
@@ -1043,10 +1044,10 @@ static void cfq_init_add_cfqg_lists(struct cfq_data *cfqd,
 	if (bdi->dev) {
 		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
 		cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg,
-					(void *)cfqd, MKDEV(major, minor));
+					cfqd->queue, MKDEV(major, minor));
 	} else
 		cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg,
-					(void *)cfqd, 0);
+					cfqd->queue, 0);
 
 	cfqd->nr_blkcg_linked_grps++;
 	cfqg->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
@@ -1097,7 +1098,6 @@ static struct cfq_group *
 cfq_find_cfqg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg)
 {
 	struct cfq_group *cfqg = NULL;
-	void *key = cfqd;
 	struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
 	unsigned int major, minor;
 
@@ -1108,7 +1108,8 @@ cfq_find_cfqg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg)
 	if (blkcg == &blkio_root_cgroup)
 		cfqg = &cfqd->root_group;
 	else
-		cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
+		cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, cfqd->queue,
+							 BLKIO_POLICY_PROP));
 
 	if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
 		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
@@ -1243,21 +1244,22 @@ static void cfq_release_cfq_groups(struct cfq_data *cfqd)
  * any pending IO in the group is finished.
  *
  * This function is called under rcu_read_lock(). key is the rcu protected
- * pointer. That means "key" is a valid cfq_data pointer as long as we are rcu
- * read lock.
+ * pointer. That means @q is a valid request_queue pointer as long as we
+ * are rcu read lock.
  *
- * "key" was fetched from blkio_group under blkio_cgroup->lock. That means
+ * @q was fetched from blkio_group under blkio_cgroup->lock. That means
  * it should not be NULL as even if elevator was exiting, cgroup deltion
  * path got to it first.
  */
-static void cfq_unlink_blkio_group(void *key, struct blkio_group *blkg)
+static void cfq_unlink_blkio_group(struct request_queue *q,
+				   struct blkio_group *blkg)
 {
-	unsigned long  flags;
-	struct cfq_data *cfqd = key;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	unsigned long flags;
 
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+	spin_lock_irqsave(q->queue_lock, flags);
 	cfq_destroy_cfqg(cfqd, cfqg_of_blkg(blkg));
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
 static struct elevator_type iosched_cfq;
@@ -3717,7 +3719,7 @@ static int cfq_init_queue(struct request_queue *q)
 	rcu_read_lock();
 
 	cfq_blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg,
-					(void *)cfqd, 0);
+				    cfqd->queue, 0);
 	rcu_read_unlock();
 	cfqd->nr_blkcg_linked_grps++;
 
diff --git a/block/cfq.h b/block/cfq.h
index 2a15592..343b78a 100644
--- a/block/cfq.h
+++ b/block/cfq.h
@@ -68,8 +68,9 @@ static inline void cfq_blkiocg_update_completion_stats(struct blkio_group *blkg,
 }
 
 static inline void cfq_blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
-			struct blkio_group *blkg, void *key, dev_t dev) {
-	blkiocg_add_blkio_group(blkcg, blkg, key, dev, BLKIO_POLICY_PROP);
+		struct blkio_group *blkg, struct request_queue *q, dev_t dev)
+{
+	blkiocg_add_blkio_group(blkcg, blkg, q, dev, BLKIO_POLICY_PROP);
 }
 
 static inline int cfq_blkiocg_del_blkio_group(struct blkio_group *blkg)
@@ -105,7 +106,7 @@ static inline void cfq_blkiocg_update_dispatch_stats(struct blkio_group *blkg,
 static inline void cfq_blkiocg_update_completion_stats(struct blkio_group *blkg, uint64_t start_time, uint64_t io_start_time, bool direction, bool sync) {}
 
 static inline void cfq_blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
-			struct blkio_group *blkg, void *key, dev_t dev) {}
+		struct blkio_group *blkg, struct request_queue *q, dev_t dev) {}
 static inline int cfq_blkiocg_del_blkio_group(struct blkio_group *blkg)
 {
 	return 0;
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 12/17] blkcg: add blkio_policy[] array and allow one policy per policy ID
  2012-01-22  3:25 [PATCHSET] blkcg: kill policy node and blkg->dev, take#2 Tejun Heo
                   ` (10 preceding siblings ...)
  2012-01-22  3:25 ` [PATCH 11/17] blkcg: use q and plid instead of opaque void * for blkio_group association Tejun Heo
@ 2012-01-22  3:25 ` Tejun Heo
  2012-01-22  3:25 ` [PATCH 13/17] blkcg: use the usual get blkg path for root blkio_group Tejun Heo
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-22  3:25 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

Block cgroup policies are maintained in a linked list and,
theoretically, multiple policies sharing the same policy ID are
allowed.

This patch temporarily restricts one policy per plid and adds
blkio_policy[] array which indexes registered policy types by plid.
Both the restriction and blkio_policy[] array are transitional and
will be removed once API cleanup is complete.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |   10 ++++++++++
 block/blk-cgroup.h |    2 ++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 572499a..a175551 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -28,6 +28,8 @@ static LIST_HEAD(blkio_list);
 struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
 EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
+static struct blkio_policy_type *blkio_policy[BLKIO_NR_POLICIES];
+
 static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
 						  struct cgroup *);
 static int blkiocg_can_attach(struct cgroup_subsys *, struct cgroup *,
@@ -1674,7 +1676,11 @@ static void blkiocg_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 void blkio_policy_register(struct blkio_policy_type *blkiop)
 {
 	spin_lock_irq(&blkio_list_lock);
+
+	BUG_ON(blkio_policy[blkiop->plid]);
+	blkio_policy[blkiop->plid] = blkiop;
 	list_add_tail(&blkiop->list, &blkio_list);
+
 	spin_unlock_irq(&blkio_list_lock);
 }
 EXPORT_SYMBOL_GPL(blkio_policy_register);
@@ -1682,7 +1688,11 @@ EXPORT_SYMBOL_GPL(blkio_policy_register);
 void blkio_policy_unregister(struct blkio_policy_type *blkiop)
 {
 	spin_lock_irq(&blkio_list_lock);
+
+	BUG_ON(blkio_policy[blkiop->plid] != blkiop);
+	blkio_policy[blkiop->plid] = NULL;
 	list_del_init(&blkiop->list);
+
 	spin_unlock_irq(&blkio_list_lock);
 }
 EXPORT_SYMBOL_GPL(blkio_policy_unregister);
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index d91718b..fd72890 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -19,6 +19,8 @@
 enum blkio_policy_id {
 	BLKIO_POLICY_PROP = 0,		/* Proportional Bandwidth division */
 	BLKIO_POLICY_THROTL,		/* Throttling */
+
+	BLKIO_NR_POLICIES,
 };
 
 /* Max limits for throttle policy */
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 13/17] blkcg: use the usual get blkg path for root blkio_group
  2012-01-22  3:25 [PATCHSET] blkcg: kill policy node and blkg->dev, take#2 Tejun Heo
                   ` (11 preceding siblings ...)
  2012-01-22  3:25 ` [PATCH 12/17] blkcg: add blkio_policy[] array and allow one policy per policy ID Tejun Heo
@ 2012-01-22  3:25 ` Tejun Heo
  2012-01-22  3:25 ` [PATCH 14/17] blkcg: factor out blkio_group creation Tejun Heo
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-22  3:25 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

For root blkg, blk_throtl_init() was using throtl_alloc_tg()
explicitly and cfq_init_queue() was manually initializing embedded
cfqd->root_group, adding unnecessarily different code paths to blkg
handling.

Make both use the usual blkio_group get functions - throtl_get_tg()
and cfq_get_cfqg() - for the root blkio_group too.  Note that
blk_throtl_init() callsite is pushed downwards in
blk_alloc_queue_node() so that @q is sufficiently initialized for
throtl_get_tg().

This simplifies root blkg handling noticeably for cfq and will allow
further modularization of blkcg API.

-v2: Vivek pointed out that using cfq_get_cfqg() won't work if
     CONFIG_CFQ_GROUP_IOSCHED is disabled.  Fix it by factoring out
     initialization of base part of cfqg into cfq_init_cfqg_base() and
     alloc/init/free explicitly if !CONFIG_CFQ_GROUP_IOSCHED.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c     |    6 +-
 block/blk-throttle.c |   18 ++++----
 block/cfq-iosched.c  |  105 +++++++++++++++++++++++++-------------------------
 3 files changed, 65 insertions(+), 64 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index c6c61c0..1b73d06 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -538,9 +538,6 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (err)
 		goto fail_id;
 
-	if (blk_throtl_init(q))
-		goto fail_id;
-
 	setup_timer(&q->backing_dev_info.laptop_mode_wb_timer,
 		    laptop_mode_timer_fn, (unsigned long) q);
 	setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
@@ -562,6 +559,9 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	 */
 	q->queue_lock = &q->__queue_lock;
 
+	if (blk_throtl_init(q))
+		goto fail_id;
+
 	return q;
 
 fail_id:
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 47680f5..779d20d 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1236,7 +1236,6 @@ void blk_throtl_drain(struct request_queue *q)
 int blk_throtl_init(struct request_queue *q)
 {
 	struct throtl_data *td;
-	struct throtl_grp *tg;
 
 	td = kzalloc_node(sizeof(*td), GFP_KERNEL, q->node);
 	if (!td)
@@ -1249,19 +1248,20 @@ int blk_throtl_init(struct request_queue *q)
 
 	/* alloc and Init root group. */
 	td->queue = q;
-	tg = throtl_alloc_tg(td);
 
-	if (!tg) {
-		kfree(td);
-		return -ENOMEM;
-	}
+	rcu_read_lock();
+	spin_lock_irq(q->queue_lock);
 
-	td->root_tg = tg;
+	td->root_tg = throtl_get_tg(td, &blkio_root_cgroup);
 
-	rcu_read_lock();
-	throtl_init_add_tg_lists(td, tg, &blkio_root_cgroup);
+	spin_unlock_irq(q->queue_lock);
 	rcu_read_unlock();
 
+	if (!td->root_tg) {
+		kfree(td);
+		return -ENOMEM;
+	}
+
 	/* Attach throtl data to request queue */
 	q->td = td;
 	return 0;
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ec6ade7..8ad6395 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -229,7 +229,7 @@ struct cfq_data {
 	struct request_queue *queue;
 	/* Root service tree for cfq_groups */
 	struct cfq_rb_root grp_service_tree;
-	struct cfq_group root_group;
+	struct cfq_group *root_group;
 
 	/*
 	 * The priority currently being served
@@ -1012,6 +1012,25 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 	cfq_blkiocg_set_start_empty_time(&cfqg->blkg);
 }
 
+/**
+ * cfq_init_cfqg_base - initialize base part of a cfq_group
+ * @cfqg: cfq_group to initialize
+ *
+ * Initialize the base part which is used whether %CONFIG_CFQ_GROUP_IOSCHED
+ * is enabled or not.
+ */
+static void cfq_init_cfqg_base(struct cfq_group *cfqg)
+{
+	struct cfq_rb_root *st;
+	int i, j;
+
+	for_each_cfqg_st(cfqg, i, j, st)
+		*st = CFQ_RB_ROOT;
+	RB_CLEAR_NODE(&cfqg->rb_node);
+
+	cfqg->ttime.last_end_request = jiffies;
+}
+
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
 static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
 {
@@ -1063,19 +1082,14 @@ static void cfq_init_add_cfqg_lists(struct cfq_data *cfqd,
  */
 static struct cfq_group * cfq_alloc_cfqg(struct cfq_data *cfqd)
 {
-	struct cfq_group *cfqg = NULL;
-	int i, j, ret;
-	struct cfq_rb_root *st;
+	struct cfq_group *cfqg;
+	int ret;
 
 	cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
 	if (!cfqg)
 		return NULL;
 
-	for_each_cfqg_st(cfqg, i, j, st)
-		*st = CFQ_RB_ROOT;
-	RB_CLEAR_NODE(&cfqg->rb_node);
-
-	cfqg->ttime.last_end_request = jiffies;
+	cfq_init_cfqg_base(cfqg);
 
 	/*
 	 * Take the initial reference that will be released on destroy
@@ -1106,7 +1120,7 @@ cfq_find_cfqg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg)
 	 * Avoid lookup in this case
 	 */
 	if (blkcg == &blkio_root_cgroup)
-		cfqg = &cfqd->root_group;
+		cfqg = cfqd->root_group;
 	else
 		cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, cfqd->queue,
 							 BLKIO_POLICY_PROP));
@@ -1166,7 +1180,7 @@ static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd,
 	}
 
 	if (!cfqg)
-		cfqg = &cfqd->root_group;
+		cfqg = cfqd->root_group;
 
 	cfq_init_add_cfqg_lists(cfqd, cfqg, blkcg);
 	return cfqg;
@@ -1182,7 +1196,7 @@ static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
 {
 	/* Currently, all async queues are mapped to root group */
 	if (!cfq_cfqq_sync(cfqq))
-		cfqg = &cfqq->cfqd->root_group;
+		cfqg = cfqq->cfqd->root_group;
 
 	cfqq->cfqg = cfqg;
 	/* cfqq reference on cfqg */
@@ -1276,7 +1290,7 @@ static void cfq_clear_queue(struct request_queue *q)
 static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd,
 				      struct blkio_cgroup *blkcg)
 {
-	return &cfqd->root_group;
+	return cfqd->root_group;
 }
 
 static inline struct cfq_group *cfq_ref_get_cfqg(struct cfq_group *cfqg)
@@ -3670,9 +3684,8 @@ static void cfq_exit_queue(struct elevator_queue *e)
 	if (wait)
 		synchronize_rcu();
 
-#ifdef CONFIG_CFQ_GROUP_IOSCHED
-	/* Free up per cpu stats for root group */
-	free_percpu(cfqd->root_group.blkg.stats_cpu);
+#ifndef CONFIG_CFQ_GROUP_IOSCHED
+	kfree(cfqd->root_group);
 #endif
 	kfree(cfqd);
 }
@@ -3680,52 +3693,40 @@ static void cfq_exit_queue(struct elevator_queue *e)
 static int cfq_init_queue(struct request_queue *q)
 {
 	struct cfq_data *cfqd;
-	int i, j;
-	struct cfq_group *cfqg;
-	struct cfq_rb_root *st;
+	int i;
 
 	cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (!cfqd)
 		return -ENOMEM;
 
+	cfqd->queue = q;
+	q->elevator->elevator_data = cfqd;
+
 	/* Init root service tree */
 	cfqd->grp_service_tree = CFQ_RB_ROOT;
 
-	/* Init root group */
-	cfqg = &cfqd->root_group;
-	for_each_cfqg_st(cfqg, i, j, st)
-		*st = CFQ_RB_ROOT;
-	RB_CLEAR_NODE(&cfqg->rb_node);
-
-	/* Give preference to root group over other groups */
-	cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;
-
+	/* Init root group and prefer root group over other groups by default */
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
-	/*
-	 * Set root group reference to 2. One reference will be dropped when
-	 * all groups on cfqd->cfqg_list are being deleted during queue exit.
-	 * Other reference will remain there as we don't want to delete this
-	 * group as it is statically allocated and gets destroyed when
-	 * throtl_data goes away.
-	 */
-	cfqg->ref = 2;
+	rcu_read_lock();
+	spin_lock_irq(q->queue_lock);
 
-	if (blkio_alloc_blkg_stats(&cfqg->blkg)) {
-		kfree(cfqg);
+	cfqd->root_group = cfq_get_cfqg(cfqd, &blkio_root_cgroup);
+
+	spin_unlock_irq(q->queue_lock);
+	rcu_read_unlock();
+#else
+	cfqd->root_group = kzalloc_node(sizeof(*cfqd->root_group),
+					GFP_KERNEL, cfqd->queue->node);
+	if (cfqd->root_group)
+		cfq_init_cfqg_base(cfqd->root_group);
+#endif
+	if (!cfqd->root_group) {
 		kfree(cfqd);
 		return -ENOMEM;
 	}
 
-	rcu_read_lock();
+	cfqd->root_group->weight = 2*BLKIO_WEIGHT_DEFAULT;
 
-	cfq_blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg,
-				    cfqd->queue, 0);
-	rcu_read_unlock();
-	cfqd->nr_blkcg_linked_grps++;
-
-	/* Add group on cfqd->cfqg_list */
-	hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
-#endif
 	/*
 	 * Not strictly needed (since RB_ROOT just clears the node and we
 	 * zeroed cfqd on alloc), but better be safe in case someone decides
@@ -3737,14 +3738,14 @@ static int cfq_init_queue(struct request_queue *q)
 	/*
 	 * Our fallback cfqq if cfq_find_alloc_queue() runs into OOM issues.
 	 * Grab a permanent reference to it, so that the normal code flow
-	 * will not attempt to free it.
+	 * will not attempt to free it.  oom_cfqq is linked to root_group
+	 * but shouldn't hold a reference as it'll never be unlinked.  Lose
+	 * the reference from linking right away.
 	 */
 	cfq_init_cfqq(cfqd, &cfqd->oom_cfqq, 1, 0);
 	cfqd->oom_cfqq.ref++;
-	cfq_link_cfqq_cfqg(&cfqd->oom_cfqq, &cfqd->root_group);
-
-	cfqd->queue = q;
-	q->elevator->elevator_data = cfqd;
+	cfq_link_cfqq_cfqg(&cfqd->oom_cfqq, cfqd->root_group);
+	cfq_put_cfqg(cfqd->root_group);
 
 	init_timer(&cfqd->idle_slice_timer);
 	cfqd->idle_slice_timer.function = cfq_idle_slice_timer;
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 14/17] blkcg: factor out blkio_group creation
  2012-01-22  3:25 [PATCHSET] blkcg: kill policy node and blkg->dev, take#2 Tejun Heo
                   ` (12 preceding siblings ...)
  2012-01-22  3:25 ` [PATCH 13/17] blkcg: use the usual get blkg path for root blkio_group Tejun Heo
@ 2012-01-22  3:25 ` Tejun Heo
  2012-01-22  3:25 ` [PATCH 15/17] blkcg: don't allow or retain configuration of missing devices Tejun Heo
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-22  3:25 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

Currently both blk-throttle and cfq-iosched implement their own
blkio_group creation code in throtl_get_tg() and cfq_get_cfqg().  This
patch factors out the common code.

* New plkio_policy_ops methods blkio_alloc_group_fn() and
  blkio_link_group_fn added.  Both are transitional and will be
  removed once the blkg management code is fully moved into
  blk-cgroup.c.

* blkio_alloc_group_fn() allocates policy-specific blkg which is
  usually a larger data structure with blkg as the first entry and
  intiailizes it.  Note that initialization of blkg proper, including
  percpu stats, is responsibility of blk-cgroup proper.

  Note that default config (weight, bps...) initialization is done
  from this method; otherwise, we end up violating locking order
  between blkcg and q locks via blkcg_get_CONF() functions.

* blkio_link_group_fn() is called under queue_lock and responsible for
  linking the blkg to the queue.  blkcg side is handled by blk-cgroup
  proper.

* The common blkg creation function is named blkg_lookup_create() and
  blkiocg_lookup_group() is renamed to blkg_lookup() for consistency.
  Also, throtl / cfq related functions are similarly [re]named for
  consistency.

* Code to fill cfqg->blkg.dev is open coded in cfq_find_alloc_queue().
  This is transitional and will be removed soon.

This simplifies blkcg policy implementations and enables further
cleanup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c   |  113 ++++++++++++++++++++++++++++-----------
 block/blk-cgroup.h   |   29 +++++-----
 block/blk-throttle.c |  144 ++++++++++++++------------------------------------
 block/cfq-iosched.c  |  134 ++++++++++++----------------------------------
 block/cfq.h          |    8 ---
 5 files changed, 172 insertions(+), 256 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index a175551..da59c9e 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -464,38 +464,89 @@ void blkiocg_update_io_merged_stats(struct blkio_group *blkg, bool direction,
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_io_merged_stats);
 
-/*
- * This function allocates the per cpu stats for blkio_group. Should be called
- * from sleepable context as alloc_per_cpu() requires that.
- */
-int blkio_alloc_blkg_stats(struct blkio_group *blkg)
+struct blkio_group *blkg_lookup_create(struct blkio_cgroup *blkcg,
+				       struct request_queue *q,
+				       enum blkio_policy_id plid)
+	__releases(q->queue_lock) __acquires(q->queue_lock)
 {
-	/* Allocate memory for per cpu stats */
-	blkg->stats_cpu = alloc_percpu(struct blkio_group_stats_cpu);
-	if (!blkg->stats_cpu)
-		return -ENOMEM;
-	return 0;
-}
-EXPORT_SYMBOL_GPL(blkio_alloc_blkg_stats);
+	struct blkio_policy_type *pol = blkio_policy[plid];
+	struct blkio_group *blkg = NULL;
+	struct blkio_group *new_blkg = NULL;
 
-void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
-		struct blkio_group *blkg, struct request_queue *q, dev_t dev,
-		enum blkio_policy_id plid)
-{
-	unsigned long flags;
+	WARN_ON_ONCE(!rcu_read_lock_held());
+	lockdep_assert_held(q->queue_lock);
 
-	spin_lock_irqsave(&blkcg->lock, flags);
-	spin_lock_init(&blkg->stats_lock);
-	rcu_assign_pointer(blkg->q, q);
-	blkg->blkcg_id = css_id(&blkcg->css);
+	/*
+	 * This could be the first entry point of blkcg implementation and
+	 * we shouldn't allow anything to go through for a dead queue.  The
+	 * following can be removed if blkg lookup is guaranteed to fail on
+	 * a dead queue.
+	 */
+	if (unlikely(blk_queue_dead(q)))
+		return NULL;
+
+	blkg = blkg_lookup(blkcg, q, plid);
+	if (blkg)
+		return blkg;
+
+	if (!css_tryget(&blkcg->css))
+		return NULL;
+
+	/*
+	 * Allocate and initialize.
+	 *
+	 * FIXME: The following is broken.  Percpu memory allocation
+	 * requires %GFP_KERNEL context and can't be performed from IO
+	 * path.  Allocation here should inherently be atomic and the
+	 * following lock dancing can be removed once the broken percpu
+	 * allocation is fixed.
+	 */
+	spin_unlock_irq(q->queue_lock);
+	rcu_read_unlock();
+
+	new_blkg = pol->ops.blkio_alloc_group_fn(q, blkcg);
+	if (new_blkg) {
+		new_blkg->stats_cpu = alloc_percpu(struct blkio_group_stats_cpu);
+
+		spin_lock_init(&new_blkg->stats_lock);
+		rcu_assign_pointer(new_blkg->q, q);
+		new_blkg->blkcg_id = css_id(&blkcg->css);
+		new_blkg->plid = plid;
+		cgroup_path(blkcg->css.cgroup, new_blkg->path,
+			    sizeof(new_blkg->path));
+	}
+
+	rcu_read_lock();
+	spin_lock_irq(q->queue_lock);
+	css_put(&blkcg->css);
+
+	/* did the device die inbetween? */
+	if (unlikely(blk_queue_dead(q)))
+		goto out;
+
+	/* did someone beat us to it? */
+	blkg = blkg_lookup(blkcg, q, plid);
+	if (unlikely(blkg))
+		goto out;
+
+	/* did alloc fail? */
+	if (unlikely(!new_blkg || !new_blkg->stats_cpu))
+		goto out;
+
+	/* insert */
+	spin_lock(&blkcg->lock);
+	swap(blkg, new_blkg);
 	hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
-	blkg->plid = plid;
-	spin_unlock_irqrestore(&blkcg->lock, flags);
-	/* Need to take css reference ? */
-	cgroup_path(blkcg->css.cgroup, blkg->path, sizeof(blkg->path));
-	blkg->dev = dev;
+	pol->ops.blkio_link_group_fn(q, blkg);
+	spin_unlock(&blkcg->lock);
+out:
+	if (new_blkg) {
+		free_percpu(new_blkg->stats_cpu);
+		kfree(new_blkg);
+	}
+	return blkg;
 }
-EXPORT_SYMBOL_GPL(blkiocg_add_blkio_group);
+EXPORT_SYMBOL_GPL(blkg_lookup_create);
 
 static void __blkiocg_del_blkio_group(struct blkio_group *blkg)
 {
@@ -532,9 +583,9 @@ int blkiocg_del_blkio_group(struct blkio_group *blkg)
 EXPORT_SYMBOL_GPL(blkiocg_del_blkio_group);
 
 /* called under rcu_read_lock(). */
-struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
-					 struct request_queue *q,
-					 enum blkio_policy_id plid)
+struct blkio_group *blkg_lookup(struct blkio_cgroup *blkcg,
+				struct request_queue *q,
+				enum blkio_policy_id plid)
 {
 	struct blkio_group *blkg;
 	struct hlist_node *n;
@@ -544,7 +595,7 @@ struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
 			return blkg;
 	return NULL;
 }
-EXPORT_SYMBOL_GPL(blkiocg_lookup_group);
+EXPORT_SYMBOL_GPL(blkg_lookup);
 
 static void blkio_reset_stats_cpu(struct blkio_group *blkg)
 {
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index fd72890..9dff67a 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -204,6 +204,10 @@ extern unsigned int blkcg_get_read_iops(struct blkio_cgroup *blkcg,
 extern unsigned int blkcg_get_write_iops(struct blkio_cgroup *blkcg,
 				     dev_t dev);
 
+typedef struct blkio_group *(blkio_alloc_group_fn)(struct request_queue *q,
+						   struct blkio_cgroup *blkcg);
+typedef void (blkio_link_group_fn)(struct request_queue *q,
+			struct blkio_group *blkg);
 typedef void (blkio_unlink_group_fn)(struct request_queue *q,
 			struct blkio_group *blkg);
 typedef void (blkio_clear_queue_fn)(struct request_queue *q);
@@ -219,6 +223,8 @@ typedef void (blkio_update_group_write_iops_fn)(struct request_queue *q,
 			struct blkio_group *blkg, unsigned int write_iops);
 
 struct blkio_policy_ops {
+	blkio_alloc_group_fn *blkio_alloc_group_fn;
+	blkio_link_group_fn *blkio_link_group_fn;
 	blkio_unlink_group_fn *blkio_unlink_group_fn;
 	blkio_clear_queue_fn *blkio_clear_queue_fn;
 	blkio_update_group_weight_fn *blkio_update_group_weight_fn;
@@ -308,14 +314,13 @@ static inline void blkiocg_set_start_empty_time(struct blkio_group *blkg) {}
 extern struct blkio_cgroup blkio_root_cgroup;
 extern struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
 extern struct blkio_cgroup *task_blkio_cgroup(struct task_struct *tsk);
-extern void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
-	struct blkio_group *blkg, struct request_queue *q, dev_t dev,
-	enum blkio_policy_id plid);
-extern int blkio_alloc_blkg_stats(struct blkio_group *blkg);
 extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
-extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
-						struct request_queue *q,
-						enum blkio_policy_id plid);
+extern struct blkio_group *blkg_lookup(struct blkio_cgroup *blkcg,
+				       struct request_queue *q,
+				       enum blkio_policy_id plid);
+struct blkio_group *blkg_lookup_create(struct blkio_cgroup *blkcg,
+				       struct request_queue *q,
+				       enum blkio_policy_id plid);
 void blkiocg_update_timeslice_used(struct blkio_group *blkg,
 					unsigned long time,
 					unsigned long unaccounted_time);
@@ -336,17 +341,11 @@ cgroup_to_blkio_cgroup(struct cgroup *cgroup) { return NULL; }
 static inline struct blkio_cgroup *
 task_blkio_cgroup(struct task_struct *tsk) { return NULL; }
 
-static inline void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
-		struct blkio_group *blkg, void *key, dev_t dev,
-		enum blkio_policy_id plid) {}
-
-static inline int blkio_alloc_blkg_stats(struct blkio_group *blkg) { return 0; }
-
 static inline int
 blkiocg_del_blkio_group(struct blkio_group *blkg) { return 0; }
 
-static inline struct blkio_group *
-blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key) { return NULL; }
+static inline struct blkio_group *blkg_lookup(struct blkio_cgroup *blkcg,
+					      void *key) { return NULL; }
 static inline void blkiocg_update_timeslice_used(struct blkio_group *blkg,
 						unsigned long time,
 						unsigned long unaccounted_time)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 779d20d..d573cb1 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -181,17 +181,25 @@ static void throtl_put_tg(struct throtl_grp *tg)
 	call_rcu(&tg->rcu_head, throtl_free_tg);
 }
 
-static void throtl_init_group(struct throtl_grp *tg)
+static struct blkio_group *throtl_alloc_blkio_group(struct request_queue *q,
+						    struct blkio_cgroup *blkcg)
 {
+	struct throtl_grp *tg;
+
+	tg = kzalloc_node(sizeof(*tg), GFP_ATOMIC, q->node);
+	if (!tg)
+		return NULL;
+
 	INIT_HLIST_NODE(&tg->tg_node);
 	RB_CLEAR_NODE(&tg->rb_node);
 	bio_list_init(&tg->bio_lists[0]);
 	bio_list_init(&tg->bio_lists[1]);
 	tg->limits_changed = false;
 
-	/* Practically unlimited BW */
-	tg->bps[0] = tg->bps[1] = -1;
-	tg->iops[0] = tg->iops[1] = -1;
+	tg->bps[READ] = blkcg_get_read_bps(blkcg, tg->blkg.dev);
+	tg->bps[WRITE] = blkcg_get_write_bps(blkcg, tg->blkg.dev);
+	tg->iops[READ] = blkcg_get_read_iops(blkcg, tg->blkg.dev);
+	tg->iops[WRITE] = blkcg_get_write_iops(blkcg, tg->blkg.dev);
 
 	/*
 	 * Take the initial reference that will be released on destroy
@@ -200,14 +208,8 @@ static void throtl_init_group(struct throtl_grp *tg)
 	 * exit or cgroup deletion path depending on who is exiting first.
 	 */
 	atomic_set(&tg->ref, 1);
-}
 
-/* Should be called with rcu read lock held (needed for blkcg) */
-static void
-throtl_add_group_to_td_list(struct throtl_data *td, struct throtl_grp *tg)
-{
-	hlist_add_head(&tg->tg_node, &td->tg_list);
-	td->nr_undestroyed_grps++;
+	return &tg->blkg;
 }
 
 static void
@@ -246,46 +248,20 @@ throtl_tg_fill_dev_details(struct throtl_data *td, struct throtl_grp *tg)
 	spin_unlock_irq(td->queue->queue_lock);
 }
 
-static void throtl_init_add_tg_lists(struct throtl_data *td,
-			struct throtl_grp *tg, struct blkio_cgroup *blkcg)
+static void throtl_link_blkio_group(struct request_queue *q,
+				    struct blkio_group *blkg)
 {
-	__throtl_tg_fill_dev_details(td, tg);
-
-	/* Add group onto cgroup list */
-	blkiocg_add_blkio_group(blkcg, &tg->blkg, td->queue,
-				tg->blkg.dev, BLKIO_POLICY_THROTL);
-
-	tg->bps[READ] = blkcg_get_read_bps(blkcg, tg->blkg.dev);
-	tg->bps[WRITE] = blkcg_get_write_bps(blkcg, tg->blkg.dev);
-	tg->iops[READ] = blkcg_get_read_iops(blkcg, tg->blkg.dev);
-	tg->iops[WRITE] = blkcg_get_write_iops(blkcg, tg->blkg.dev);
-
-	throtl_add_group_to_td_list(td, tg);
-}
-
-/* Should be called without queue lock and outside of rcu period */
-static struct throtl_grp *throtl_alloc_tg(struct throtl_data *td)
-{
-	struct throtl_grp *tg = NULL;
-	int ret;
-
-	tg = kzalloc_node(sizeof(*tg), GFP_ATOMIC, td->queue->node);
-	if (!tg)
-		return NULL;
-
-	ret = blkio_alloc_blkg_stats(&tg->blkg);
+	struct throtl_data *td = q->td;
+	struct throtl_grp *tg = tg_of_blkg(blkg);
 
-	if (ret) {
-		kfree(tg);
-		return NULL;
-	}
+	__throtl_tg_fill_dev_details(td, tg);
 
-	throtl_init_group(tg);
-	return tg;
+	hlist_add_head(&tg->tg_node, &td->tg_list);
+	td->nr_undestroyed_grps++;
 }
 
 static struct
-throtl_grp *throtl_find_tg(struct throtl_data *td, struct blkio_cgroup *blkcg)
+throtl_grp *throtl_lookup_tg(struct throtl_data *td, struct blkio_cgroup *blkcg)
 {
 	struct throtl_grp *tg = NULL;
 
@@ -296,69 +272,29 @@ throtl_grp *throtl_find_tg(struct throtl_data *td, struct blkio_cgroup *blkcg)
 	if (blkcg == &blkio_root_cgroup)
 		tg = td->root_tg;
 	else
-		tg = tg_of_blkg(blkiocg_lookup_group(blkcg, td->queue,
-						     BLKIO_POLICY_THROTL));
+		tg = tg_of_blkg(blkg_lookup(blkcg, td->queue,
+					    BLKIO_POLICY_THROTL));
 
 	__throtl_tg_fill_dev_details(td, tg);
 	return tg;
 }
 
-static struct throtl_grp *throtl_get_tg(struct throtl_data *td,
-					struct blkio_cgroup *blkcg)
+static struct throtl_grp *throtl_lookup_create_tg(struct throtl_data *td,
+						  struct blkio_cgroup *blkcg)
 {
-	struct throtl_grp *tg = NULL, *__tg = NULL;
 	struct request_queue *q = td->queue;
+	struct throtl_grp *tg = NULL;
+	struct blkio_group *blkg;
 
-	/* no throttling for dead queue */
-	if (unlikely(blk_queue_bypass(q)))
-		return NULL;
-
-	tg = throtl_find_tg(td, blkcg);
-	if (tg)
-		return tg;
-
-	if (!css_tryget(&blkcg->css))
-		return NULL;
-
-	/*
-	 * Need to allocate a group. Allocation of group also needs allocation
-	 * of per cpu stats which in-turn takes a mutex() and can block. Hence
-	 * we need to drop rcu lock and queue_lock before we call alloc.
-	 */
-	spin_unlock_irq(q->queue_lock);
-	rcu_read_unlock();
-
-	tg = throtl_alloc_tg(td);
-
-	/* Group allocated and queue is still alive. take the lock */
-	rcu_read_lock();
-	spin_lock_irq(q->queue_lock);
-	css_put(&blkcg->css);
-
-	/* Make sure @q is still alive */
-	if (unlikely(blk_queue_bypass(q))) {
-		kfree(tg);
-		return NULL;
-	}
-
-	/*
-	 * If some other thread already allocated the group while we were
-	 * not holding queue lock, free up the group
-	 */
-	__tg = throtl_find_tg(td, blkcg);
-
-	if (__tg) {
-		kfree(tg);
-		return __tg;
-	}
+	blkg = blkg_lookup_create(blkcg, q, BLKIO_POLICY_THROTL);
 
-	/* Group allocation failed. Account the IO to root group */
-	if (!tg) {
+	/* if %NULL and @q is alive, fall back to root_tg */
+	if (blkg)
+		tg = tg_of_blkg(blkg);
+	else if (!blk_queue_dead(q))
 		tg = td->root_tg;
-		return tg;
-	}
 
-	throtl_init_add_tg_lists(td, tg, blkcg);
+	__throtl_tg_fill_dev_details(td, tg);
 	return tg;
 }
 
@@ -1091,6 +1027,8 @@ static void throtl_shutdown_wq(struct request_queue *q)
 
 static struct blkio_policy_type blkio_policy_throtl = {
 	.ops = {
+		.blkio_alloc_group_fn = throtl_alloc_blkio_group,
+		.blkio_link_group_fn = throtl_link_blkio_group,
 		.blkio_unlink_group_fn = throtl_unlink_blkio_group,
 		.blkio_clear_queue_fn = throtl_clear_queue,
 		.blkio_update_group_read_bps_fn =
@@ -1125,7 +1063,7 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 	 */
 	rcu_read_lock();
 	blkcg = task_blkio_cgroup(current);
-	tg = throtl_find_tg(td, blkcg);
+	tg = throtl_lookup_tg(td, blkcg);
 	if (tg) {
 		throtl_tg_fill_dev_details(td, tg);
 
@@ -1141,7 +1079,7 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 	 * IO group
 	 */
 	spin_lock_irq(q->queue_lock);
-	tg = throtl_get_tg(td, blkcg);
+	tg = throtl_lookup_create_tg(td, blkcg);
 	if (unlikely(!tg))
 		goto out_unlock;
 
@@ -1246,13 +1184,14 @@ int blk_throtl_init(struct request_queue *q)
 	td->limits_changed = false;
 	INIT_DELAYED_WORK(&td->throtl_work, blk_throtl_work);
 
-	/* alloc and Init root group. */
+	q->td = td;
 	td->queue = q;
 
+	/* alloc and init root group. */
 	rcu_read_lock();
 	spin_lock_irq(q->queue_lock);
 
-	td->root_tg = throtl_get_tg(td, &blkio_root_cgroup);
+	td->root_tg = throtl_lookup_create_tg(td, &blkio_root_cgroup);
 
 	spin_unlock_irq(q->queue_lock);
 	rcu_read_unlock();
@@ -1261,9 +1200,6 @@ int blk_throtl_init(struct request_queue *q)
 		kfree(td);
 		return -ENOMEM;
 	}
-
-	/* Attach throtl data to request queue */
-	q->td = td;
 	return 0;
 }
 
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 8ad6395..b4dd125 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1048,10 +1048,12 @@ static void cfq_update_blkio_group_weight(struct request_queue *q,
 	cfqg->needs_update = true;
 }
 
-static void cfq_init_add_cfqg_lists(struct cfq_data *cfqd,
-			struct cfq_group *cfqg, struct blkio_cgroup *blkcg)
+static void cfq_link_blkio_group(struct request_queue *q,
+				 struct blkio_group *blkg)
 {
-	struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct backing_dev_info *bdi = &q->backing_dev_info;
+	struct cfq_group *cfqg = cfqg_of_blkg(blkg);
 	unsigned int major, minor;
 
 	/*
@@ -1062,34 +1064,26 @@ static void cfq_init_add_cfqg_lists(struct cfq_data *cfqd,
 	 */
 	if (bdi->dev) {
 		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
-		cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg,
-					cfqd->queue, MKDEV(major, minor));
-	} else
-		cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg,
-					cfqd->queue, 0);
+		blkg->dev = MKDEV(major, minor);
+	}
 
 	cfqd->nr_blkcg_linked_grps++;
-	cfqg->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
 
 	/* Add group on cfqd list */
 	hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
 }
 
-/*
- * Should be called from sleepable context. No request queue lock as per
- * cpu stats are allocated dynamically and alloc_percpu needs to be called
- * from sleepable context.
- */
-static struct cfq_group * cfq_alloc_cfqg(struct cfq_data *cfqd)
+static struct blkio_group *cfq_alloc_blkio_group(struct request_queue *q,
+						 struct blkio_cgroup *blkcg)
 {
 	struct cfq_group *cfqg;
-	int ret;
 
-	cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
+	cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, q->node);
 	if (!cfqg)
 		return NULL;
 
 	cfq_init_cfqg_base(cfqg);
+	cfqg->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
 
 	/*
 	 * Take the initial reference that will be released on destroy
@@ -1099,91 +1093,20 @@ static struct cfq_group * cfq_alloc_cfqg(struct cfq_data *cfqd)
 	 */
 	cfqg->ref = 1;
 
-	ret = blkio_alloc_blkg_stats(&cfqg->blkg);
-	if (ret) {
-		kfree(cfqg);
-		return NULL;
-	}
-
-	return cfqg;
-}
-
-static struct cfq_group *
-cfq_find_cfqg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg)
-{
-	struct cfq_group *cfqg = NULL;
-	struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
-	unsigned int major, minor;
-
-	/*
-	 * This is the common case when there are no blkio cgroups.
-	 * Avoid lookup in this case
-	 */
-	if (blkcg == &blkio_root_cgroup)
-		cfqg = cfqd->root_group;
-	else
-		cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, cfqd->queue,
-							 BLKIO_POLICY_PROP));
-
-	if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
-		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
-		cfqg->blkg.dev = MKDEV(major, minor);
-	}
-
-	return cfqg;
+	return &cfqg->blkg;
 }
 
 /*
  * Search for the cfq group current task belongs to. request_queue lock must
  * be held.
  */
-static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd,
-				      struct blkio_cgroup *blkcg)
+static struct cfq_group *cfq_lookup_create_cfqg(struct cfq_data *cfqd,
+						struct blkio_cgroup *blkcg)
 {
-	struct cfq_group *cfqg = NULL, *__cfqg = NULL;
-	struct request_queue *q = cfqd->queue;
+	struct blkio_group *blkg;
 
-	cfqg = cfq_find_cfqg(cfqd, blkcg);
-	if (cfqg)
-		return cfqg;
-
-	if (!css_tryget(&blkcg->css))
-		return NULL;
-
-	/*
-	 * Need to allocate a group. Allocation of group also needs allocation
-	 * of per cpu stats which in-turn takes a mutex() and can block. Hence
-	 * we need to drop rcu lock and queue_lock before we call alloc.
-	 *
-	 * Not taking any queue reference here and assuming that queue is
-	 * around by the time we return. CFQ queue allocation code does
-	 * the same. It might be racy though.
-	 */
-	rcu_read_unlock();
-	spin_unlock_irq(q->queue_lock);
-
-	cfqg = cfq_alloc_cfqg(cfqd);
-
-	spin_lock_irq(q->queue_lock);
-	rcu_read_lock();
-	css_put(&blkcg->css);
-
-	/*
-	 * If some other thread already allocated the group while we were
-	 * not holding queue lock, free up the group
-	 */
-	__cfqg = cfq_find_cfqg(cfqd, blkcg);
-
-	if (__cfqg) {
-		kfree(cfqg);
-		return __cfqg;
-	}
-
-	if (!cfqg)
-		cfqg = cfqd->root_group;
-
-	cfq_init_add_cfqg_lists(cfqd, cfqg, blkcg);
-	return cfqg;
+	blkg = blkg_lookup_create(blkcg, cfqd->queue, BLKIO_POLICY_PROP);
+	return cfqg_of_blkg(blkg);
 }
 
 static inline struct cfq_group *cfq_ref_get_cfqg(struct cfq_group *cfqg)
@@ -1287,8 +1210,8 @@ static void cfq_clear_queue(struct request_queue *q)
 }
 
 #else /* GROUP_IOSCHED */
-static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd,
-				      struct blkio_cgroup *blkcg)
+static struct cfq_group *cfq_lookup_create_cfqg(struct cfq_data *cfqd,
+						struct blkio_cgroup *blkcg)
 {
 	return cfqd->root_group;
 }
@@ -2877,6 +2800,7 @@ static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
 		     struct io_context *ioc, gfp_t gfp_mask)
 {
+	struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
 	struct blkio_cgroup *blkcg;
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
 	struct cfq_io_cq *cic;
@@ -2887,7 +2811,19 @@ retry:
 
 	blkcg = task_blkio_cgroup(current);
 
-	cfqg = cfq_get_cfqg(cfqd, blkcg);
+	/* avoid lookup for the common case where there's no blkio cgroup */
+	if (blkcg == &blkio_root_cgroup)
+		cfqg = cfqd->root_group;
+	else
+		cfqg = cfq_lookup_create_cfqg(cfqd, blkcg);
+
+	if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
+		unsigned int major, minor;
+
+		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+		cfqg->blkg.dev = MKDEV(major, minor);
+	}
+
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
@@ -3710,7 +3646,7 @@ static int cfq_init_queue(struct request_queue *q)
 	rcu_read_lock();
 	spin_lock_irq(q->queue_lock);
 
-	cfqd->root_group = cfq_get_cfqg(cfqd, &blkio_root_cgroup);
+	cfqd->root_group = cfq_lookup_create_cfqg(cfqd, &blkio_root_cgroup);
 
 	spin_unlock_irq(q->queue_lock);
 	rcu_read_unlock();
@@ -3896,6 +3832,8 @@ static struct elevator_type iosched_cfq = {
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
 static struct blkio_policy_type blkio_policy_cfq = {
 	.ops = {
+		.blkio_alloc_group_fn =		cfq_alloc_blkio_group,
+		.blkio_link_group_fn =		cfq_link_blkio_group,
 		.blkio_unlink_group_fn =	cfq_unlink_blkio_group,
 		.blkio_clear_queue_fn = cfq_clear_queue,
 		.blkio_update_group_weight_fn =	cfq_update_blkio_group_weight,
diff --git a/block/cfq.h b/block/cfq.h
index 343b78a..3987601 100644
--- a/block/cfq.h
+++ b/block/cfq.h
@@ -67,12 +67,6 @@ static inline void cfq_blkiocg_update_completion_stats(struct blkio_group *blkg,
 				direction, sync);
 }
 
-static inline void cfq_blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
-		struct blkio_group *blkg, struct request_queue *q, dev_t dev)
-{
-	blkiocg_add_blkio_group(blkcg, blkg, q, dev, BLKIO_POLICY_PROP);
-}
-
 static inline int cfq_blkiocg_del_blkio_group(struct blkio_group *blkg)
 {
 	return blkiocg_del_blkio_group(blkg);
@@ -105,8 +99,6 @@ static inline void cfq_blkiocg_update_dispatch_stats(struct blkio_group *blkg,
 				uint64_t bytes, bool direction, bool sync) {}
 static inline void cfq_blkiocg_update_completion_stats(struct blkio_group *blkg, uint64_t start_time, uint64_t io_start_time, bool direction, bool sync) {}
 
-static inline void cfq_blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
-		struct blkio_group *blkg, struct request_queue *q, dev_t dev) {}
 static inline int cfq_blkiocg_del_blkio_group(struct blkio_group *blkg)
 {
 	return 0;
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 15/17] blkcg: don't allow or retain configuration of missing devices
  2012-01-22  3:25 [PATCHSET] blkcg: kill policy node and blkg->dev, take#2 Tejun Heo
                   ` (13 preceding siblings ...)
  2012-01-22  3:25 ` [PATCH 14/17] blkcg: factor out blkio_group creation Tejun Heo
@ 2012-01-22  3:25 ` Tejun Heo
  2012-01-22  3:25 ` [PATCH 16/17] blkcg: kill blkio_policy_node Tejun Heo
  2012-01-22  3:25 ` [PATCH 17/17] blkcg: kill the mind-bending blkg->dev Tejun Heo
  16 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-22  3:25 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo, Kay Sievers

blkcg is very peculiar in that it allows setting and remembering
configurations for non-existent devices by maintaining separate data
structures for configuration.

This behavior is completely out of the usual norms and outright
confusing; furthermore, it uses dev_t number to match the
configuration to devices, which is unpredictable to begin with and
becomes completely unuseable if EXT_DEVT is fully used.

It is wholely unnecessary - we already have fully functional userland
mechanism to program devices being hotplugged which has full access to
device identification, connection topology and filesystem information.

Add a new struct blkio_group_conf which contains all blkcg
configurations to blkio_group and let blkio_group, which can be
created iff the associated device exists and is removed when the
associated device goes away, carry all configurations.

Note that, after this patch, all newly created blkg's will always have
the default configuration (unlimited for throttling and blkcg's weight
for propio).

This patch makes blkio_policy_node meaningless but doesn't remove it.
The next patch will.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
---
 block/blk-cgroup.c   |   78 ++++++++++++++++++++++++++++++++++++--------------
 block/blk-cgroup.h   |    9 ++++++
 block/blk-throttle.c |    8 ++--
 block/cfq-iosched.c  |    2 +-
 4 files changed, 70 insertions(+), 27 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index da59c9e..f9b4976 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -819,9 +819,12 @@ static uint64_t blkio_get_stat(struct blkio_group *blkg,
 }
 
 static int blkio_policy_parse_and_set(char *buf,
-	struct blkio_policy_node *newpn, enum blkio_policy_id plid, int fileid)
+				      struct blkio_policy_node *newpn,
+				      enum blkio_policy_id plid, int fileid,
+				      struct blkio_cgroup *blkcg)
 {
 	struct gendisk *disk = NULL;
+	struct blkio_group *blkg = NULL;
 	char *s[4], *p, *major_s = NULL, *minor_s = NULL;
 	unsigned long major, minor;
 	int i = 0, ret = -EINVAL;
@@ -867,12 +870,19 @@ static int blkio_policy_parse_and_set(char *buf,
 		goto out;
 
 	/* For rule removal, do not check for device presence. */
-	if (temp) {
-		disk = get_gendisk(dev, &part);
-		if (!disk || part) {
-			ret = -ENODEV;
-			goto out;
-		}
+	disk = get_gendisk(dev, &part);
+
+	if ((!disk || part) && temp) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	rcu_read_lock();
+
+	if (disk && !part) {
+		spin_lock_irq(disk->queue->queue_lock);
+		blkg = blkg_lookup_create(blkcg, disk->queue, plid);
+		spin_unlock_irq(disk->queue->queue_lock);
 	}
 
 	newpn->dev = dev;
@@ -881,25 +891,46 @@ static int blkio_policy_parse_and_set(char *buf,
 	case BLKIO_POLICY_PROP:
 		if ((temp < BLKIO_WEIGHT_MIN && temp > 0) ||
 		     temp > BLKIO_WEIGHT_MAX)
-			goto out;
+			goto out_unlock;
 
 		newpn->plid = plid;
 		newpn->fileid = fileid;
 		newpn->val.weight = temp;
+		if (blkg)
+			blkg->conf.weight = temp;
 		break;
 	case BLKIO_POLICY_THROTL:
 		switch(fileid) {
 		case BLKIO_THROTL_read_bps_device:
+			if (blkg)
+				blkg->conf.bps[READ] = temp;
+			newpn->plid = plid;
+			newpn->fileid = fileid;
+			newpn->val.bps = temp;
+			break;
 		case BLKIO_THROTL_write_bps_device:
+			if (blkg)
+				blkg->conf.bps[WRITE] = temp;
 			newpn->plid = plid;
 			newpn->fileid = fileid;
 			newpn->val.bps = temp;
 			break;
 		case BLKIO_THROTL_read_iops_device:
+			if (temp > THROTL_IOPS_MAX)
+				goto out_unlock;
+
+			if (blkg)
+				blkg->conf.iops[READ] = temp;
+			newpn->plid = plid;
+			newpn->fileid = fileid;
+			newpn->val.iops = (unsigned int)temp;
+			break;
 		case BLKIO_THROTL_write_iops_device:
 			if (temp > THROTL_IOPS_MAX)
-				goto out;
+				goto out_unlock;
 
+			if (blkg)
+				blkg->conf.iops[WRITE] = temp;
 			newpn->plid = plid;
 			newpn->fileid = fileid;
 			newpn->val.iops = (unsigned int)temp;
@@ -910,6 +941,8 @@ static int blkio_policy_parse_and_set(char *buf,
 		BUG();
 	}
 	ret = 0;
+out_unlock:
+	rcu_read_unlock();
 out:
 	put_disk(disk);
 	return ret;
@@ -1059,26 +1092,29 @@ static void blkio_update_policy_rule(struct blkio_policy_node *oldpn,
 static void blkio_update_blkg_policy(struct blkio_cgroup *blkcg,
 		struct blkio_group *blkg, struct blkio_policy_node *pn)
 {
-	unsigned int weight, iops;
-	u64 bps;
+	struct blkio_group_conf *conf = &blkg->conf;
 
 	switch(pn->plid) {
 	case BLKIO_POLICY_PROP:
-		weight = pn->val.weight ? pn->val.weight :
-				blkcg->weight;
-		blkio_update_group_weight(blkg, weight);
+		blkio_update_group_weight(blkg, conf->weight ?: blkcg->weight);
 		break;
 	case BLKIO_POLICY_THROTL:
 		switch(pn->fileid) {
 		case BLKIO_THROTL_read_bps_device:
+			blkio_update_group_bps(blkg, conf->bps[READ] ?: -1,
+					       pn->fileid);
+			break;
 		case BLKIO_THROTL_write_bps_device:
-			bps = pn->val.bps ? pn->val.bps : (-1);
-			blkio_update_group_bps(blkg, bps, pn->fileid);
+			blkio_update_group_bps(blkg, conf->bps[WRITE] ?: -1,
+					       pn->fileid);
 			break;
 		case BLKIO_THROTL_read_iops_device:
+			blkio_update_group_iops(blkg, conf->iops[READ] ?: -1,
+						pn->fileid);
+			break;
 		case BLKIO_THROTL_write_iops_device:
-			iops = pn->val.iops ? pn->val.iops : (-1);
-			blkio_update_group_iops(blkg, iops, pn->fileid);
+			blkio_update_group_iops(blkg, conf->iops[WRITE] ?: -1,
+						pn->fileid);
 			break;
 		}
 		break;
@@ -1116,7 +1152,7 @@ static int blkiocg_file_write(struct cgroup *cgrp, struct cftype *cft,
 	int ret = 0;
 	char *buf;
 	struct blkio_policy_node *newpn, *pn;
-	struct blkio_cgroup *blkcg;
+	struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgrp);
 	int keep_newpn = 0;
 	enum blkio_policy_id plid = BLKIOFILE_POLICY(cft->private);
 	int fileid = BLKIOFILE_ATTR(cft->private);
@@ -1131,12 +1167,10 @@ static int blkiocg_file_write(struct cgroup *cgrp, struct cftype *cft,
 		goto free_buf;
 	}
 
-	ret = blkio_policy_parse_and_set(buf, newpn, plid, fileid);
+	ret = blkio_policy_parse_and_set(buf, newpn, plid, fileid, blkcg);
 	if (ret)
 		goto free_newpn;
 
-	blkcg = cgroup_to_blkio_cgroup(cgrp);
-
 	spin_lock_irq(&blkcg->lock);
 
 	pn = blkio_policy_search_node(blkcg, newpn->dev, plid, fileid);
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 9dff67a..280e55c 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -154,6 +154,12 @@ struct blkio_group_stats_cpu {
 	struct u64_stats_sync syncp;
 };
 
+struct blkio_group_conf {
+	unsigned int weight;
+	unsigned int iops[2];
+	u64 bps[2];
+};
+
 struct blkio_group {
 	/* Pointer to the associated request_queue, RCU protected */
 	struct request_queue __rcu *q;
@@ -166,6 +172,9 @@ struct blkio_group {
 	/* policy which owns this blk group */
 	enum blkio_policy_id plid;
 
+	/* Configuration */
+	struct blkio_group_conf conf;
+
 	/* Need to serialize the stats in the case of reset/update */
 	spinlock_t stats_lock;
 	struct blkio_group_stats stats;
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index d573cb1..c7b3aea 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -196,10 +196,10 @@ static struct blkio_group *throtl_alloc_blkio_group(struct request_queue *q,
 	bio_list_init(&tg->bio_lists[1]);
 	tg->limits_changed = false;
 
-	tg->bps[READ] = blkcg_get_read_bps(blkcg, tg->blkg.dev);
-	tg->bps[WRITE] = blkcg_get_write_bps(blkcg, tg->blkg.dev);
-	tg->iops[READ] = blkcg_get_read_iops(blkcg, tg->blkg.dev);
-	tg->iops[WRITE] = blkcg_get_write_iops(blkcg, tg->blkg.dev);
+	tg->bps[READ] = -1;
+	tg->bps[WRITE] = -1;
+	tg->iops[READ] = -1;
+	tg->iops[WRITE] = -1;
 
 	/*
 	 * Take the initial reference that will be released on destroy
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b4dd125..5498513 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1083,7 +1083,7 @@ static struct blkio_group *cfq_alloc_blkio_group(struct request_queue *q,
 		return NULL;
 
 	cfq_init_cfqg_base(cfqg);
-	cfqg->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
+	cfqg->weight = blkcg->weight;
 
 	/*
 	 * Take the initial reference that will be released on destroy
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 16/17] blkcg: kill blkio_policy_node
  2012-01-22  3:25 [PATCHSET] blkcg: kill policy node and blkg->dev, take#2 Tejun Heo
                   ` (14 preceding siblings ...)
  2012-01-22  3:25 ` [PATCH 15/17] blkcg: don't allow or retain configuration of missing devices Tejun Heo
@ 2012-01-22  3:25 ` Tejun Heo
  2012-01-22  3:25 ` [PATCH 17/17] blkcg: kill the mind-bending blkg->dev Tejun Heo
  16 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-22  3:25 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

Now that blkcg configuration lives in blkg's, blkio_policy_node is no
longer necessary.  Kill it.

blkio_policy_parse_and_set() now fails if invoked for missing device
and functions to print out configurations are updated to print from
blkg's.

cftype_blkg_same_policy() is dropped along with other policy functions
for consistency.  Its one line is open coded in the only user -
blkio_read_blkg_stats().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |  430 +++++++---------------------------------------------
 block/blk-cgroup.h |   32 ----
 2 files changed, 58 insertions(+), 404 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index f9b4976..a983c43 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -58,54 +58,6 @@ struct cgroup_subsys blkio_subsys = {
 };
 EXPORT_SYMBOL_GPL(blkio_subsys);
 
-static inline void blkio_policy_insert_node(struct blkio_cgroup *blkcg,
-					    struct blkio_policy_node *pn)
-{
-	list_add(&pn->node, &blkcg->policy_list);
-}
-
-static inline bool cftype_blkg_same_policy(struct cftype *cft,
-			struct blkio_group *blkg)
-{
-	enum blkio_policy_id plid = BLKIOFILE_POLICY(cft->private);
-
-	if (blkg->plid == plid)
-		return 1;
-
-	return 0;
-}
-
-/* Determines if policy node matches cgroup file being accessed */
-static inline bool pn_matches_cftype(struct cftype *cft,
-			struct blkio_policy_node *pn)
-{
-	enum blkio_policy_id plid = BLKIOFILE_POLICY(cft->private);
-	int fileid = BLKIOFILE_ATTR(cft->private);
-
-	return (plid == pn->plid && fileid == pn->fileid);
-}
-
-/* Must be called with blkcg->lock held */
-static inline void blkio_policy_delete_node(struct blkio_policy_node *pn)
-{
-	list_del(&pn->node);
-}
-
-/* Must be called with blkcg->lock held */
-static struct blkio_policy_node *
-blkio_policy_search_node(const struct blkio_cgroup *blkcg, dev_t dev,
-		enum blkio_policy_id plid, int fileid)
-{
-	struct blkio_policy_node *pn;
-
-	list_for_each_entry(pn, &blkcg->policy_list, node) {
-		if (pn->dev == dev && pn->plid == plid && pn->fileid == fileid)
-			return pn;
-	}
-
-	return NULL;
-}
-
 struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
 {
 	return container_of(cgroup_subsys_state(cgroup, blkio_subsys_id),
@@ -818,10 +770,8 @@ static uint64_t blkio_get_stat(struct blkio_group *blkg,
 	return disk_total;
 }
 
-static int blkio_policy_parse_and_set(char *buf,
-				      struct blkio_policy_node *newpn,
-				      enum blkio_policy_id plid, int fileid,
-				      struct blkio_cgroup *blkcg)
+static int blkio_policy_parse_and_set(char *buf, enum blkio_policy_id plid,
+				      int fileid, struct blkio_cgroup *blkcg)
 {
 	struct gendisk *disk = NULL;
 	struct blkio_group *blkg = NULL;
@@ -869,23 +819,18 @@ static int blkio_policy_parse_and_set(char *buf,
 	if (strict_strtoull(s[1], 10, &temp))
 		goto out;
 
-	/* For rule removal, do not check for device presence. */
 	disk = get_gendisk(dev, &part);
-
-	if ((!disk || part) && temp) {
-		ret = -ENODEV;
+	if (!disk || part)
 		goto out;
-	}
 
 	rcu_read_lock();
 
-	if (disk && !part) {
-		spin_lock_irq(disk->queue->queue_lock);
-		blkg = blkg_lookup_create(blkcg, disk->queue, plid);
-		spin_unlock_irq(disk->queue->queue_lock);
-	}
+	spin_lock_irq(disk->queue->queue_lock);
+	blkg = blkg_lookup_create(blkcg, disk->queue, plid);
+	spin_unlock_irq(disk->queue->queue_lock);
 
-	newpn->dev = dev;
+	if (!blkg)
+		goto out_unlock;
 
 	switch (plid) {
 	case BLKIO_POLICY_PROP:
@@ -893,47 +838,30 @@ static int blkio_policy_parse_and_set(char *buf,
 		     temp > BLKIO_WEIGHT_MAX)
 			goto out_unlock;
 
-		newpn->plid = plid;
-		newpn->fileid = fileid;
-		newpn->val.weight = temp;
-		if (blkg)
-			blkg->conf.weight = temp;
+		blkg->conf.weight = temp;
+		blkio_update_group_weight(blkg, temp ?: blkcg->weight);
 		break;
 	case BLKIO_POLICY_THROTL:
 		switch(fileid) {
 		case BLKIO_THROTL_read_bps_device:
-			if (blkg)
-				blkg->conf.bps[READ] = temp;
-			newpn->plid = plid;
-			newpn->fileid = fileid;
-			newpn->val.bps = temp;
+			blkg->conf.bps[READ] = temp;
+			blkio_update_group_bps(blkg, temp ?: -1, fileid);
 			break;
 		case BLKIO_THROTL_write_bps_device:
-			if (blkg)
-				blkg->conf.bps[WRITE] = temp;
-			newpn->plid = plid;
-			newpn->fileid = fileid;
-			newpn->val.bps = temp;
+			blkg->conf.bps[WRITE] = temp;
+			blkio_update_group_bps(blkg, temp ?: -1, fileid);
 			break;
 		case BLKIO_THROTL_read_iops_device:
 			if (temp > THROTL_IOPS_MAX)
 				goto out_unlock;
-
-			if (blkg)
-				blkg->conf.iops[READ] = temp;
-			newpn->plid = plid;
-			newpn->fileid = fileid;
-			newpn->val.iops = (unsigned int)temp;
+			blkg->conf.iops[READ] = temp;
+			blkio_update_group_iops(blkg, temp ?: -1, fileid);
 			break;
 		case BLKIO_THROTL_write_iops_device:
 			if (temp > THROTL_IOPS_MAX)
 				goto out_unlock;
-
-			if (blkg)
-				blkg->conf.iops[WRITE] = temp;
-			newpn->plid = plid;
-			newpn->fileid = fileid;
-			newpn->val.iops = (unsigned int)temp;
+			blkg->conf.iops[WRITE] = temp;
+			blkio_update_group_iops(blkg, temp ?: -1, fileid);
 			break;
 		}
 		break;
@@ -948,212 +876,12 @@ out:
 	return ret;
 }
 
-unsigned int blkcg_get_weight(struct blkio_cgroup *blkcg,
-			      dev_t dev)
-{
-	struct blkio_policy_node *pn;
-	unsigned long flags;
-	unsigned int weight;
-
-	spin_lock_irqsave(&blkcg->lock, flags);
-
-	pn = blkio_policy_search_node(blkcg, dev, BLKIO_POLICY_PROP,
-				BLKIO_PROP_weight_device);
-	if (pn)
-		weight = pn->val.weight;
-	else
-		weight = blkcg->weight;
-
-	spin_unlock_irqrestore(&blkcg->lock, flags);
-
-	return weight;
-}
-EXPORT_SYMBOL_GPL(blkcg_get_weight);
-
-uint64_t blkcg_get_read_bps(struct blkio_cgroup *blkcg, dev_t dev)
-{
-	struct blkio_policy_node *pn;
-	unsigned long flags;
-	uint64_t bps = -1;
-
-	spin_lock_irqsave(&blkcg->lock, flags);
-	pn = blkio_policy_search_node(blkcg, dev, BLKIO_POLICY_THROTL,
-				BLKIO_THROTL_read_bps_device);
-	if (pn)
-		bps = pn->val.bps;
-	spin_unlock_irqrestore(&blkcg->lock, flags);
-
-	return bps;
-}
-
-uint64_t blkcg_get_write_bps(struct blkio_cgroup *blkcg, dev_t dev)
-{
-	struct blkio_policy_node *pn;
-	unsigned long flags;
-	uint64_t bps = -1;
-
-	spin_lock_irqsave(&blkcg->lock, flags);
-	pn = blkio_policy_search_node(blkcg, dev, BLKIO_POLICY_THROTL,
-				BLKIO_THROTL_write_bps_device);
-	if (pn)
-		bps = pn->val.bps;
-	spin_unlock_irqrestore(&blkcg->lock, flags);
-
-	return bps;
-}
-
-unsigned int blkcg_get_read_iops(struct blkio_cgroup *blkcg, dev_t dev)
-{
-	struct blkio_policy_node *pn;
-	unsigned long flags;
-	unsigned int iops = -1;
-
-	spin_lock_irqsave(&blkcg->lock, flags);
-	pn = blkio_policy_search_node(blkcg, dev, BLKIO_POLICY_THROTL,
-				BLKIO_THROTL_read_iops_device);
-	if (pn)
-		iops = pn->val.iops;
-	spin_unlock_irqrestore(&blkcg->lock, flags);
-
-	return iops;
-}
-
-unsigned int blkcg_get_write_iops(struct blkio_cgroup *blkcg, dev_t dev)
-{
-	struct blkio_policy_node *pn;
-	unsigned long flags;
-	unsigned int iops = -1;
-
-	spin_lock_irqsave(&blkcg->lock, flags);
-	pn = blkio_policy_search_node(blkcg, dev, BLKIO_POLICY_THROTL,
-				BLKIO_THROTL_write_iops_device);
-	if (pn)
-		iops = pn->val.iops;
-	spin_unlock_irqrestore(&blkcg->lock, flags);
-
-	return iops;
-}
-
-/* Checks whether user asked for deleting a policy rule */
-static bool blkio_delete_rule_command(struct blkio_policy_node *pn)
-{
-	switch(pn->plid) {
-	case BLKIO_POLICY_PROP:
-		if (pn->val.weight == 0)
-			return 1;
-		break;
-	case BLKIO_POLICY_THROTL:
-		switch(pn->fileid) {
-		case BLKIO_THROTL_read_bps_device:
-		case BLKIO_THROTL_write_bps_device:
-			if (pn->val.bps == 0)
-				return 1;
-			break;
-		case BLKIO_THROTL_read_iops_device:
-		case BLKIO_THROTL_write_iops_device:
-			if (pn->val.iops == 0)
-				return 1;
-		}
-		break;
-	default:
-		BUG();
-	}
-
-	return 0;
-}
-
-static void blkio_update_policy_rule(struct blkio_policy_node *oldpn,
-					struct blkio_policy_node *newpn)
-{
-	switch(oldpn->plid) {
-	case BLKIO_POLICY_PROP:
-		oldpn->val.weight = newpn->val.weight;
-		break;
-	case BLKIO_POLICY_THROTL:
-		switch(newpn->fileid) {
-		case BLKIO_THROTL_read_bps_device:
-		case BLKIO_THROTL_write_bps_device:
-			oldpn->val.bps = newpn->val.bps;
-			break;
-		case BLKIO_THROTL_read_iops_device:
-		case BLKIO_THROTL_write_iops_device:
-			oldpn->val.iops = newpn->val.iops;
-		}
-		break;
-	default:
-		BUG();
-	}
-}
-
-/*
- * Some rules/values in blkg have changed. Propagate those to respective
- * policies.
- */
-static void blkio_update_blkg_policy(struct blkio_cgroup *blkcg,
-		struct blkio_group *blkg, struct blkio_policy_node *pn)
-{
-	struct blkio_group_conf *conf = &blkg->conf;
-
-	switch(pn->plid) {
-	case BLKIO_POLICY_PROP:
-		blkio_update_group_weight(blkg, conf->weight ?: blkcg->weight);
-		break;
-	case BLKIO_POLICY_THROTL:
-		switch(pn->fileid) {
-		case BLKIO_THROTL_read_bps_device:
-			blkio_update_group_bps(blkg, conf->bps[READ] ?: -1,
-					       pn->fileid);
-			break;
-		case BLKIO_THROTL_write_bps_device:
-			blkio_update_group_bps(blkg, conf->bps[WRITE] ?: -1,
-					       pn->fileid);
-			break;
-		case BLKIO_THROTL_read_iops_device:
-			blkio_update_group_iops(blkg, conf->iops[READ] ?: -1,
-						pn->fileid);
-			break;
-		case BLKIO_THROTL_write_iops_device:
-			blkio_update_group_iops(blkg, conf->iops[WRITE] ?: -1,
-						pn->fileid);
-			break;
-		}
-		break;
-	default:
-		BUG();
-	}
-}
-
-/*
- * A policy node rule has been updated. Propagate this update to all the
- * block groups which might be affected by this update.
- */
-static void blkio_update_policy_node_blkg(struct blkio_cgroup *blkcg,
-				struct blkio_policy_node *pn)
-{
-	struct blkio_group *blkg;
-	struct hlist_node *n;
-
-	spin_lock_irq(&blkio_list_lock);
-	spin_lock(&blkcg->lock);
-
-	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
-		if (pn->dev != blkg->dev || pn->plid != blkg->plid)
-			continue;
-		blkio_update_blkg_policy(blkcg, blkg, pn);
-	}
-
-	spin_unlock(&blkcg->lock);
-	spin_unlock_irq(&blkio_list_lock);
-}
-
 static int blkiocg_file_write(struct cgroup *cgrp, struct cftype *cft,
  				       const char *buffer)
 {
 	int ret = 0;
 	char *buf;
-	struct blkio_policy_node *newpn, *pn;
 	struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgrp);
-	int keep_newpn = 0;
 	enum blkio_policy_id plid = BLKIOFILE_POLICY(cft->private);
 	int fileid = BLKIOFILE_ATTR(cft->private);
 
@@ -1161,69 +889,42 @@ static int blkiocg_file_write(struct cgroup *cgrp, struct cftype *cft,
 	if (!buf)
 		return -ENOMEM;
 
-	newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
-	if (!newpn) {
-		ret = -ENOMEM;
-		goto free_buf;
-	}
-
-	ret = blkio_policy_parse_and_set(buf, newpn, plid, fileid, blkcg);
-	if (ret)
-		goto free_newpn;
-
-	spin_lock_irq(&blkcg->lock);
-
-	pn = blkio_policy_search_node(blkcg, newpn->dev, plid, fileid);
-	if (!pn) {
-		if (!blkio_delete_rule_command(newpn)) {
-			blkio_policy_insert_node(blkcg, newpn);
-			keep_newpn = 1;
-		}
-		spin_unlock_irq(&blkcg->lock);
-		goto update_io_group;
-	}
-
-	if (blkio_delete_rule_command(newpn)) {
-		blkio_policy_delete_node(pn);
-		kfree(pn);
-		spin_unlock_irq(&blkcg->lock);
-		goto update_io_group;
-	}
-	spin_unlock_irq(&blkcg->lock);
-
-	blkio_update_policy_rule(pn, newpn);
-
-update_io_group:
-	blkio_update_policy_node_blkg(blkcg, newpn);
-
-free_newpn:
-	if (!keep_newpn)
-		kfree(newpn);
-free_buf:
+	ret = blkio_policy_parse_and_set(buf, plid, fileid, blkcg);
 	kfree(buf);
 	return ret;
 }
 
-static void
-blkio_print_policy_node(struct seq_file *m, struct blkio_policy_node *pn)
+static void blkio_print_group_conf(struct cftype *cft, struct blkio_group *blkg,
+				   struct seq_file *m)
 {
-	switch(pn->plid) {
+	int fileid = BLKIOFILE_ATTR(cft->private);
+	int rw = WRITE;
+
+	switch (blkg->plid) {
 		case BLKIO_POLICY_PROP:
-			if (pn->fileid == BLKIO_PROP_weight_device)
-				seq_printf(m, "%u:%u\t%u\n", MAJOR(pn->dev),
-					MINOR(pn->dev), pn->val.weight);
+			if (blkg->conf.weight)
+				seq_printf(m, "%u:%u\t%u\n", MAJOR(blkg->dev),
+					MINOR(blkg->dev), blkg->conf.weight);
 			break;
 		case BLKIO_POLICY_THROTL:
-			switch(pn->fileid) {
+			switch (fileid) {
 			case BLKIO_THROTL_read_bps_device:
+				rw = READ;
 			case BLKIO_THROTL_write_bps_device:
-				seq_printf(m, "%u:%u\t%llu\n", MAJOR(pn->dev),
-					MINOR(pn->dev), pn->val.bps);
+				if (blkg->conf.bps[rw])
+					seq_printf(m, "%u:%u\t%llu\n",
+						   MAJOR(blkg->dev),
+						   MINOR(blkg->dev),
+						   blkg->conf.bps[rw]);
 				break;
 			case BLKIO_THROTL_read_iops_device:
+				rw = READ;
 			case BLKIO_THROTL_write_iops_device:
-				seq_printf(m, "%u:%u\t%u\n", MAJOR(pn->dev),
-					MINOR(pn->dev), pn->val.iops);
+				if (blkg->conf.iops[rw])
+					seq_printf(m, "%u:%u\t%u\n",
+						   MAJOR(blkg->dev),
+						   MINOR(blkg->dev),
+						   blkg->conf.iops[rw]);
 				break;
 			}
 			break;
@@ -1233,20 +934,17 @@ blkio_print_policy_node(struct seq_file *m, struct blkio_policy_node *pn)
 }
 
 /* cgroup files which read their data from policy nodes end up here */
-static void blkio_read_policy_node_files(struct cftype *cft,
-			struct blkio_cgroup *blkcg, struct seq_file *m)
+static void blkio_read_conf(struct cftype *cft, struct blkio_cgroup *blkcg,
+			    struct seq_file *m)
 {
-	struct blkio_policy_node *pn;
+	struct blkio_group *blkg;
+	struct hlist_node *n;
 
-	if (!list_empty(&blkcg->policy_list)) {
-		spin_lock_irq(&blkcg->lock);
-		list_for_each_entry(pn, &blkcg->policy_list, node) {
-			if (!pn_matches_cftype(cft, pn))
-				continue;
-			blkio_print_policy_node(m, pn);
-		}
-		spin_unlock_irq(&blkcg->lock);
-	}
+	spin_lock_irq(&blkcg->lock);
+	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node)
+		if (BLKIOFILE_POLICY(cft->private) == blkg->plid)
+			blkio_print_group_conf(cft, blkg, m);
+	spin_unlock_irq(&blkcg->lock);
 }
 
 static int blkiocg_file_read(struct cgroup *cgrp, struct cftype *cft,
@@ -1262,7 +960,7 @@ static int blkiocg_file_read(struct cgroup *cgrp, struct cftype *cft,
 	case BLKIO_POLICY_PROP:
 		switch(name) {
 		case BLKIO_PROP_weight_device:
-			blkio_read_policy_node_files(cft, blkcg, m);
+			blkio_read_conf(cft, blkcg, m);
 			return 0;
 		default:
 			BUG();
@@ -1274,7 +972,7 @@ static int blkiocg_file_read(struct cgroup *cgrp, struct cftype *cft,
 		case BLKIO_THROTL_write_bps_device:
 		case BLKIO_THROTL_read_iops_device:
 		case BLKIO_THROTL_write_iops_device:
-			blkio_read_policy_node_files(cft, blkcg, m);
+			blkio_read_conf(cft, blkcg, m);
 			return 0;
 		default:
 			BUG();
@@ -1298,7 +996,7 @@ static int blkio_read_blkg_stats(struct blkio_cgroup *blkcg,
 	rcu_read_lock();
 	hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
 		if (blkg->dev) {
-			if (!cftype_blkg_same_policy(cft, blkg))
+			if (BLKIOFILE_POLICY(cft->private) != blkg->plid)
 				continue;
 			if (pcpu)
 				cgroup_total += blkio_get_stat_cpu(blkg, cb,
@@ -1397,11 +1095,10 @@ static int blkiocg_file_read_map(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
-static int blkio_weight_write(struct blkio_cgroup *blkcg, u64 val)
+static int blkio_weight_write(struct blkio_cgroup *blkcg, int plid, u64 val)
 {
 	struct blkio_group *blkg;
 	struct hlist_node *n;
-	struct blkio_policy_node *pn;
 
 	if (val < BLKIO_WEIGHT_MIN || val > BLKIO_WEIGHT_MAX)
 		return -EINVAL;
@@ -1410,14 +1107,10 @@ static int blkio_weight_write(struct blkio_cgroup *blkcg, u64 val)
 	spin_lock(&blkcg->lock);
 	blkcg->weight = (unsigned int)val;
 
-	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
-		pn = blkio_policy_search_node(blkcg, blkg->dev,
-				BLKIO_POLICY_PROP, BLKIO_PROP_weight_device);
-		if (pn)
-			continue;
+	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node)
+		if (blkg->plid == plid && !blkg->conf.weight)
+			blkio_update_group_weight(blkg, blkcg->weight);
 
-		blkio_update_group_weight(blkg, blkcg->weight);
-	}
 	spin_unlock(&blkcg->lock);
 	spin_unlock_irq(&blkio_list_lock);
 	return 0;
@@ -1456,7 +1149,7 @@ blkiocg_file_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
 	case BLKIO_POLICY_PROP:
 		switch(name) {
 		case BLKIO_PROP_weight:
-			return blkio_weight_write(blkcg, val);
+			return blkio_weight_write(blkcg, plid, val);
 		}
 		break;
 	default:
@@ -1637,7 +1330,6 @@ static void blkiocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 	struct blkio_group *blkg;
 	struct request_queue *q;
 	struct blkio_policy_type *blkiop;
-	struct blkio_policy_node *pn, *pntmp;
 
 	rcu_read_lock();
 	do {
@@ -1669,11 +1361,6 @@ static void blkiocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 		spin_unlock_irqrestore(&blkio_list_lock, flags);
 	} while (1);
 
-	list_for_each_entry_safe(pn, pntmp, &blkcg->policy_list, node) {
-		blkio_policy_delete_node(pn);
-		kfree(pn);
-	}
-
 	free_css_id(&blkio_subsys, &blkcg->css);
 	rcu_read_unlock();
 	if (blkcg != &blkio_root_cgroup)
@@ -1700,7 +1387,6 @@ done:
 	spin_lock_init(&blkcg->lock);
 	INIT_HLIST_HEAD(&blkcg->blkg_list);
 
-	INIT_LIST_HEAD(&blkcg->policy_list);
 	return &blkcg->css;
 }
 
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 280e55c..9e5a5ff 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -112,7 +112,6 @@ struct blkio_cgroup {
 	unsigned int weight;
 	spinlock_t lock;
 	struct hlist_head blkg_list;
-	struct list_head policy_list; /* list of blkio_policy_node */
 };
 
 struct blkio_group_stats {
@@ -182,37 +181,6 @@ struct blkio_group {
 	struct blkio_group_stats_cpu __percpu *stats_cpu;
 };
 
-struct blkio_policy_node {
-	struct list_head node;
-	dev_t dev;
-	/* This node belongs to max bw policy or porportional weight policy */
-	enum blkio_policy_id plid;
-	/* cgroup file to which this rule belongs to */
-	int fileid;
-
-	union {
-		unsigned int weight;
-		/*
-		 * Rate read/write in terms of bytes per second
-		 * Whether this rate represents read or write is determined
-		 * by file type "fileid".
-		 */
-		u64 bps;
-		unsigned int iops;
-	} val;
-};
-
-extern unsigned int blkcg_get_weight(struct blkio_cgroup *blkcg,
-				     dev_t dev);
-extern uint64_t blkcg_get_read_bps(struct blkio_cgroup *blkcg,
-				     dev_t dev);
-extern uint64_t blkcg_get_write_bps(struct blkio_cgroup *blkcg,
-				     dev_t dev);
-extern unsigned int blkcg_get_read_iops(struct blkio_cgroup *blkcg,
-				     dev_t dev);
-extern unsigned int blkcg_get_write_iops(struct blkio_cgroup *blkcg,
-				     dev_t dev);
-
 typedef struct blkio_group *(blkio_alloc_group_fn)(struct request_queue *q,
 						   struct blkio_cgroup *blkcg);
 typedef void (blkio_link_group_fn)(struct request_queue *q,
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 17/17] blkcg: kill the mind-bending blkg->dev
  2012-01-22  3:25 [PATCHSET] blkcg: kill policy node and blkg->dev, take#2 Tejun Heo
                   ` (15 preceding siblings ...)
  2012-01-22  3:25 ` [PATCH 16/17] blkcg: kill blkio_policy_node Tejun Heo
@ 2012-01-22  3:25 ` Tejun Heo
  16 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-22  3:25 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

blkg->dev is dev_t recording the device number of the block device for
the associated request_queue.  It is used to identify the associated
block device when printing out configuration or stats.

This is redundant to begin with.  A blkg is an association between a
cgroup and a request_queue and it of course is possible to reach
request_queue from blkg and synchronization conventions are in place
for safe q dereferencing, so this shouldn't be necessary from the
beginning.  Furthermore, it's initialized by sscanf()ing the device
name of backing_dev_info.  The mind boggles.

Anyways, if blkg is visible under rcu lock, we *know* that the
associated request_queue hasn't gone away yet and its bdi is
registered and alive - blkg can't be created for request_queue which
hasn't been fully initialized and it can't go away before blkg is
removed.

Let stat and conf read functions get device name from
blkg->q->backing_dev_info.dev and pass it down to printing functions
and remove blkg->dev.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c   |   86 ++++++++++++++++++++++++++------------------------
 block/blk-cgroup.h   |    2 -
 block/blk-throttle.c |   51 +----------------------------
 block/cfq-iosched.c  |   21 ------------
 4 files changed, 47 insertions(+), 113 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index a983c43..1a42546 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -626,10 +626,10 @@ blkiocg_reset_stats(struct cgroup *cgroup, struct cftype *cftype, u64 val)
 	return 0;
 }
 
-static void blkio_get_key_name(enum stat_sub_type type, dev_t dev, char *str,
-				int chars_left, bool diskname_only)
+static void blkio_get_key_name(enum stat_sub_type type, const char *dname,
+			       char *str, int chars_left, bool diskname_only)
 {
-	snprintf(str, chars_left, "%d:%d", MAJOR(dev), MINOR(dev));
+	snprintf(str, chars_left, "%s", dname);
 	chars_left -= strlen(str);
 	if (chars_left <= 0) {
 		printk(KERN_WARNING
@@ -660,9 +660,9 @@ static void blkio_get_key_name(enum stat_sub_type type, dev_t dev, char *str,
 }
 
 static uint64_t blkio_fill_stat(char *str, int chars_left, uint64_t val,
-				struct cgroup_map_cb *cb, dev_t dev)
+				struct cgroup_map_cb *cb, const char *dname)
 {
-	blkio_get_key_name(0, dev, str, chars_left, true);
+	blkio_get_key_name(0, dname, str, chars_left, true);
 	cb->fill(cb, str, val);
 	return val;
 }
@@ -694,7 +694,8 @@ static uint64_t blkio_read_stat_cpu(struct blkio_group *blkg,
 }
 
 static uint64_t blkio_get_stat_cpu(struct blkio_group *blkg,
-		struct cgroup_map_cb *cb, dev_t dev, enum stat_type_cpu type)
+				   struct cgroup_map_cb *cb, const char *dname,
+				   enum stat_type_cpu type)
 {
 	uint64_t disk_total, val;
 	char key_str[MAX_KEY_LEN];
@@ -702,12 +703,14 @@ static uint64_t blkio_get_stat_cpu(struct blkio_group *blkg,
 
 	if (type == BLKIO_STAT_CPU_SECTORS) {
 		val = blkio_read_stat_cpu(blkg, type, 0);
-		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1, val, cb, dev);
+		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1, val, cb,
+				       dname);
 	}
 
 	for (sub_type = BLKIO_STAT_READ; sub_type < BLKIO_STAT_TOTAL;
 			sub_type++) {
-		blkio_get_key_name(sub_type, dev, key_str, MAX_KEY_LEN, false);
+		blkio_get_key_name(sub_type, dname, key_str, MAX_KEY_LEN,
+				   false);
 		val = blkio_read_stat_cpu(blkg, type, sub_type);
 		cb->fill(cb, key_str, val);
 	}
@@ -715,14 +718,16 @@ static uint64_t blkio_get_stat_cpu(struct blkio_group *blkg,
 	disk_total = blkio_read_stat_cpu(blkg, type, BLKIO_STAT_READ) +
 			blkio_read_stat_cpu(blkg, type, BLKIO_STAT_WRITE);
 
-	blkio_get_key_name(BLKIO_STAT_TOTAL, dev, key_str, MAX_KEY_LEN, false);
+	blkio_get_key_name(BLKIO_STAT_TOTAL, dname, key_str, MAX_KEY_LEN,
+			   false);
 	cb->fill(cb, key_str, disk_total);
 	return disk_total;
 }
 
 /* This should be called with blkg->stats_lock held */
 static uint64_t blkio_get_stat(struct blkio_group *blkg,
-		struct cgroup_map_cb *cb, dev_t dev, enum stat_type type)
+			       struct cgroup_map_cb *cb, const char *dname,
+			       enum stat_type type)
 {
 	uint64_t disk_total;
 	char key_str[MAX_KEY_LEN];
@@ -730,11 +735,11 @@ static uint64_t blkio_get_stat(struct blkio_group *blkg,
 
 	if (type == BLKIO_STAT_TIME)
 		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
-					blkg->stats.time, cb, dev);
+					blkg->stats.time, cb, dname);
 #ifdef CONFIG_DEBUG_BLK_CGROUP
 	if (type == BLKIO_STAT_UNACCOUNTED_TIME)
 		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
-					blkg->stats.unaccounted_time, cb, dev);
+				       blkg->stats.unaccounted_time, cb, dname);
 	if (type == BLKIO_STAT_AVG_QUEUE_SIZE) {
 		uint64_t sum = blkg->stats.avg_queue_size_sum;
 		uint64_t samples = blkg->stats.avg_queue_size_samples;
@@ -742,30 +747,33 @@ static uint64_t blkio_get_stat(struct blkio_group *blkg,
 			do_div(sum, samples);
 		else
 			sum = 0;
-		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1, sum, cb, dev);
+		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
+				       sum, cb, dname);
 	}
 	if (type == BLKIO_STAT_GROUP_WAIT_TIME)
 		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
-					blkg->stats.group_wait_time, cb, dev);
+				       blkg->stats.group_wait_time, cb, dname);
 	if (type == BLKIO_STAT_IDLE_TIME)
 		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
-					blkg->stats.idle_time, cb, dev);
+				       blkg->stats.idle_time, cb, dname);
 	if (type == BLKIO_STAT_EMPTY_TIME)
 		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
-					blkg->stats.empty_time, cb, dev);
+				       blkg->stats.empty_time, cb, dname);
 	if (type == BLKIO_STAT_DEQUEUE)
 		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
-					blkg->stats.dequeue, cb, dev);
+				       blkg->stats.dequeue, cb, dname);
 #endif
 
 	for (sub_type = BLKIO_STAT_READ; sub_type < BLKIO_STAT_TOTAL;
 			sub_type++) {
-		blkio_get_key_name(sub_type, dev, key_str, MAX_KEY_LEN, false);
+		blkio_get_key_name(sub_type, dname, key_str, MAX_KEY_LEN,
+				   false);
 		cb->fill(cb, key_str, blkg->stats.stat_arr[type][sub_type]);
 	}
 	disk_total = blkg->stats.stat_arr[type][BLKIO_STAT_READ] +
 			blkg->stats.stat_arr[type][BLKIO_STAT_WRITE];
-	blkio_get_key_name(BLKIO_STAT_TOTAL, dev, key_str, MAX_KEY_LEN, false);
+	blkio_get_key_name(BLKIO_STAT_TOTAL, dname, key_str, MAX_KEY_LEN,
+			   false);
 	cb->fill(cb, key_str, disk_total);
 	return disk_total;
 }
@@ -897,14 +905,15 @@ static int blkiocg_file_write(struct cgroup *cgrp, struct cftype *cft,
 static void blkio_print_group_conf(struct cftype *cft, struct blkio_group *blkg,
 				   struct seq_file *m)
 {
+	const char *dname = dev_name(blkg->q->backing_dev_info.dev);
 	int fileid = BLKIOFILE_ATTR(cft->private);
 	int rw = WRITE;
 
 	switch (blkg->plid) {
 		case BLKIO_POLICY_PROP:
 			if (blkg->conf.weight)
-				seq_printf(m, "%u:%u\t%u\n", MAJOR(blkg->dev),
-					MINOR(blkg->dev), blkg->conf.weight);
+				seq_printf(m, "%s\t%u\n",
+					   dname, blkg->conf.weight);
 			break;
 		case BLKIO_POLICY_THROTL:
 			switch (fileid) {
@@ -912,19 +921,15 @@ static void blkio_print_group_conf(struct cftype *cft, struct blkio_group *blkg,
 				rw = READ;
 			case BLKIO_THROTL_write_bps_device:
 				if (blkg->conf.bps[rw])
-					seq_printf(m, "%u:%u\t%llu\n",
-						   MAJOR(blkg->dev),
-						   MINOR(blkg->dev),
-						   blkg->conf.bps[rw]);
+					seq_printf(m, "%s\t%llu\n",
+						   dname, blkg->conf.bps[rw]);
 				break;
 			case BLKIO_THROTL_read_iops_device:
 				rw = READ;
 			case BLKIO_THROTL_write_iops_device:
 				if (blkg->conf.iops[rw])
-					seq_printf(m, "%u:%u\t%u\n",
-						   MAJOR(blkg->dev),
-						   MINOR(blkg->dev),
-						   blkg->conf.iops[rw]);
+					seq_printf(m, "%s\t%u\n",
+						   dname, blkg->conf.iops[rw]);
 				break;
 			}
 			break;
@@ -995,18 +1000,17 @@ static int blkio_read_blkg_stats(struct blkio_cgroup *blkcg,
 
 	rcu_read_lock();
 	hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
-		if (blkg->dev) {
-			if (BLKIOFILE_POLICY(cft->private) != blkg->plid)
-				continue;
-			if (pcpu)
-				cgroup_total += blkio_get_stat_cpu(blkg, cb,
-						blkg->dev, type);
-			else {
-				spin_lock_irq(&blkg->stats_lock);
-				cgroup_total += blkio_get_stat(blkg, cb,
-						blkg->dev, type);
-				spin_unlock_irq(&blkg->stats_lock);
-			}
+		const char *dname = dev_name(blkg->q->backing_dev_info.dev);
+
+		if (BLKIOFILE_POLICY(cft->private) != blkg->plid)
+			continue;
+		if (pcpu)
+			cgroup_total += blkio_get_stat_cpu(blkg, cb, dname,
+							   type);
+		else {
+			spin_lock_irq(&blkg->stats_lock);
+			cgroup_total += blkio_get_stat(blkg, cb, dname, type);
+			spin_unlock_irq(&blkg->stats_lock);
 		}
 	}
 	if (show_total)
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 9e5a5ff..7e0102c 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -166,8 +166,6 @@ struct blkio_group {
 	unsigned short blkcg_id;
 	/* Store cgroup path */
 	char path[128];
-	/* The device MKDEV(major, minor), this group has been created for */
-	dev_t dev;
 	/* policy which owns this blk group */
 	enum blkio_policy_id plid;
 
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index c7b3aea..f50ce5a 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -212,50 +212,12 @@ static struct blkio_group *throtl_alloc_blkio_group(struct request_queue *q,
 	return &tg->blkg;
 }
 
-static void
-__throtl_tg_fill_dev_details(struct throtl_data *td, struct throtl_grp *tg)
-{
-	struct backing_dev_info *bdi = &td->queue->backing_dev_info;
-	unsigned int major, minor;
-
-	if (!tg || tg->blkg.dev)
-		return;
-
-	/*
-	 * Fill in device details for a group which might not have been
-	 * filled at group creation time as queue was being instantiated
-	 * and driver had not attached a device yet
-	 */
-	if (bdi->dev && dev_name(bdi->dev)) {
-		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
-		tg->blkg.dev = MKDEV(major, minor);
-	}
-}
-
-/*
- * Should be called with without queue lock held. Here queue lock will be
- * taken rarely. It will be taken only once during life time of a group
- * if need be
- */
-static void
-throtl_tg_fill_dev_details(struct throtl_data *td, struct throtl_grp *tg)
-{
-	if (!tg || tg->blkg.dev)
-		return;
-
-	spin_lock_irq(td->queue->queue_lock);
-	__throtl_tg_fill_dev_details(td, tg);
-	spin_unlock_irq(td->queue->queue_lock);
-}
-
 static void throtl_link_blkio_group(struct request_queue *q,
 				    struct blkio_group *blkg)
 {
 	struct throtl_data *td = q->td;
 	struct throtl_grp *tg = tg_of_blkg(blkg);
 
-	__throtl_tg_fill_dev_details(td, tg);
-
 	hlist_add_head(&tg->tg_node, &td->tg_list);
 	td->nr_undestroyed_grps++;
 }
@@ -263,20 +225,14 @@ static void throtl_link_blkio_group(struct request_queue *q,
 static struct
 throtl_grp *throtl_lookup_tg(struct throtl_data *td, struct blkio_cgroup *blkcg)
 {
-	struct throtl_grp *tg = NULL;
-
 	/*
 	 * This is the common case when there are no blkio cgroups.
  	 * Avoid lookup in this case
  	 */
 	if (blkcg == &blkio_root_cgroup)
-		tg = td->root_tg;
-	else
-		tg = tg_of_blkg(blkg_lookup(blkcg, td->queue,
-					    BLKIO_POLICY_THROTL));
+		return td->root_tg;
 
-	__throtl_tg_fill_dev_details(td, tg);
-	return tg;
+	return tg_of_blkg(blkg_lookup(blkcg, td->queue, BLKIO_POLICY_THROTL));
 }
 
 static struct throtl_grp *throtl_lookup_create_tg(struct throtl_data *td,
@@ -294,7 +250,6 @@ static struct throtl_grp *throtl_lookup_create_tg(struct throtl_data *td,
 	else if (!blk_queue_dead(q))
 		tg = td->root_tg;
 
-	__throtl_tg_fill_dev_details(td, tg);
 	return tg;
 }
 
@@ -1065,8 +1020,6 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 	blkcg = task_blkio_cgroup(current);
 	tg = throtl_lookup_tg(td, blkcg);
 	if (tg) {
-		throtl_tg_fill_dev_details(td, tg);
-
 		if (tg_no_rule_group(tg, rw)) {
 			blkiocg_update_dispatch_stats(&tg->blkg, bio->bi_size,
 					rw, rw_is_sync(bio->bi_rw));
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5498513..6970c5e 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1052,20 +1052,7 @@ static void cfq_link_blkio_group(struct request_queue *q,
 				 struct blkio_group *blkg)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
-	struct backing_dev_info *bdi = &q->backing_dev_info;
 	struct cfq_group *cfqg = cfqg_of_blkg(blkg);
-	unsigned int major, minor;
-
-	/*
-	 * Add group onto cgroup list. It might happen that bdi->dev is
-	 * not initialized yet. Initialize this new group without major
-	 * and minor info and this info will be filled in once a new thread
-	 * comes for IO.
-	 */
-	if (bdi->dev) {
-		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
-		blkg->dev = MKDEV(major, minor);
-	}
 
 	cfqd->nr_blkcg_linked_grps++;
 
@@ -2800,7 +2787,6 @@ static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
 		     struct io_context *ioc, gfp_t gfp_mask)
 {
-	struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
 	struct blkio_cgroup *blkcg;
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
 	struct cfq_io_cq *cic;
@@ -2817,13 +2803,6 @@ retry:
 	else
 		cfqg = cfq_lookup_create_cfqg(cfqd, blkcg);
 
-	if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
-		unsigned int major, minor;
-
-		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
-		cfqg->blkg.dev = MKDEV(major, minor);
-	}
-
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 01/17] blkcg: make CONFIG_BLK_CGROUP bool
  2012-01-22  3:25 ` [PATCH 01/17] blkcg: make CONFIG_BLK_CGROUP bool Tejun Heo
@ 2012-01-23 15:00   ` Vivek Goyal
  2012-01-23 15:34     ` Tejun Heo
  0 siblings, 1 reply; 45+ messages in thread
From: Vivek Goyal @ 2012-01-23 15:00 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, ctalbott, rni, linux-kernel

On Sat, Jan 21, 2012 at 07:25:09PM -0800, Tejun Heo wrote:
> Block cgroup core can be built as module; however, it isn't too useful
> as blk-throttle can only be built-in and cfq-iosched is usually the
> default built-in scheduler.  Scheduled blkcg cleanup requires calling
> into blkcg from block core.  To simplify that, disallow building blkcg
> as module by making CONFIG_BLK_CGROUP bool.
> 
> If building blkcg core as module really matters, which I doubt, we can
> revisit it after blkcg API cleanup.

I think not allowing compiling blk-cgroup as module is a good thing.

> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> ---
>  block/Kconfig.iosched |    5 +----
>  block/blk-cgroup.c    |   17 -----------------
>  block/blk-cgroup.h    |   10 ++--------
>  init/Kconfig          |    2 +-
>  4 files changed, 4 insertions(+), 30 deletions(-)
> 
> diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
> index 3199b76..8dca96d 100644
> --- a/block/Kconfig.iosched
> +++ b/block/Kconfig.iosched
> @@ -23,8 +23,7 @@ config IOSCHED_DEADLINE
>  
>  config IOSCHED_CFQ
>  	tristate "CFQ I/O scheduler"
> -	# If BLK_CGROUP is a module, CFQ has to be built as module.
> -	depends on (BLK_CGROUP=m && m) || !BLK_CGROUP || BLK_CGROUP=y
> +	depends on BLK_CGROUP

You don't need above dependency now. Otherwise if BLK_CGROUP=n then one
can't use CFQ. We just want cfq group scheduling to be dependent on
BLK_CGROUP.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-22  3:25 ` [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch Tejun Heo
@ 2012-01-23 15:20   ` Vivek Goyal
  2012-01-23 15:36     ` Vivek Goyal
  2012-01-23 15:39     ` Tejun Heo
  0 siblings, 2 replies; 45+ messages in thread
From: Vivek Goyal @ 2012-01-23 15:20 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, ctalbott, rni, linux-kernel

On Sat, Jan 21, 2012 at 07:25:16PM -0800, Tejun Heo wrote:

[..]
> diff --git a/block/elevator.c b/block/elevator.c
> index 078a491..5e371e4 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -38,6 +38,7 @@
>  #include <trace/events/block.h>
>  
>  #include "blk.h"
> +#include "blk-cgroup.h"
>  
>  static DEFINE_SPINLOCK(elv_list_lock);
>  static LIST_HEAD(elv_list);
> @@ -939,6 +940,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
>  
>  	spin_lock_irq(q->queue_lock);
>  	ioc_clear_queue(q);
> +	blkcg_clear_queue(q);

What happens to per device rules (per device weight and per device
throttling rules)? IIUC, in new scheme of things, upon elevator switch
these rules will be gone.

I can understand per device weight rules disappearing but what about 
throttling rules. They are independent of IO scheduler and change of
io scheduler should not cause per device throttling rules to get lost.

This just gets worse if elevator switch fails and we fall back to old
elevator. But now we have lost of the rules as rules were part of
blkg.conf and old groups are gone.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 01/17] blkcg: make CONFIG_BLK_CGROUP bool
  2012-01-23 15:00   ` Vivek Goyal
@ 2012-01-23 15:34     ` Tejun Heo
  0 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-23 15:34 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel

Hello,

On Mon, Jan 23, 2012 at 10:00:51AM -0500, Vivek Goyal wrote:
> On Sat, Jan 21, 2012 at 07:25:09PM -0800, Tejun Heo wrote:
> > Block cgroup core can be built as module; however, it isn't too useful
> > as blk-throttle can only be built-in and cfq-iosched is usually the
> > default built-in scheduler.  Scheduled blkcg cleanup requires calling
> > into blkcg from block core.  To simplify that, disallow building blkcg
> > as module by making CONFIG_BLK_CGROUP bool.
> > 
> > If building blkcg core as module really matters, which I doubt, we can
> > revisit it after blkcg API cleanup.
> 
> I think not allowing compiling blk-cgroup as module is a good thing.

Sure, it's nice.  It just is mostly pointless at this point and
hinders API cleanup.  As written above, if this really matters, let's
add it back after cleanup is complete.

> >  config IOSCHED_CFQ
> >  	tristate "CFQ I/O scheduler"
> > -	# If BLK_CGROUP is a module, CFQ has to be built as module.
> > -	depends on (BLK_CGROUP=m && m) || !BLK_CGROUP || BLK_CGROUP=y
> > +	depends on BLK_CGROUP
> 
> You don't need above dependency now. Otherwise if BLK_CGROUP=n then one
> can't use CFQ. We just want cfq group scheduling to be dependent on
> BLK_CGROUP.

Ooh, right, will drop the line.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 15:20   ` Vivek Goyal
@ 2012-01-23 15:36     ` Vivek Goyal
  2012-01-23 15:49       ` Tejun Heo
  2012-01-23 15:39     ` Tejun Heo
  1 sibling, 1 reply; 45+ messages in thread
From: Vivek Goyal @ 2012-01-23 15:36 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, ctalbott, rni, linux-kernel

On Mon, Jan 23, 2012 at 10:20:55AM -0500, Vivek Goyal wrote:
> On Sat, Jan 21, 2012 at 07:25:16PM -0800, Tejun Heo wrote:
> 
> [..]
> > diff --git a/block/elevator.c b/block/elevator.c
> > index 078a491..5e371e4 100644
> > --- a/block/elevator.c
> > +++ b/block/elevator.c
> > @@ -38,6 +38,7 @@
> >  #include <trace/events/block.h>
> >  
> >  #include "blk.h"
> > +#include "blk-cgroup.h"
> >  
> >  static DEFINE_SPINLOCK(elv_list_lock);
> >  static LIST_HEAD(elv_list);
> > @@ -939,6 +940,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
> >  
> >  	spin_lock_irq(q->queue_lock);
> >  	ioc_clear_queue(q);
> > +	blkcg_clear_queue(q);
> 
> What happens to per device rules (per device weight and per device
> throttling rules)? IIUC, in new scheme of things, upon elevator switch
> these rules will be gone.
> 
> I can understand per device weight rules disappearing but what about 
> throttling rules. They are independent of IO scheduler and change of
> io scheduler should not cause per device throttling rules to get lost.
> 
> This just gets worse if elevator switch fails and we fall back to old
> elevator. But now we have lost of the rules as rules were part of
> blkg.conf and old groups are gone.

IIUC, above is racy w.r.t cgroup removal and elevator switch. Assume that
elevator swtich is taking place and we have queue lock held and we try to
clear the groups on the queue. Parallely somebody is trying to delete a
cgroup and has been partially successful in doing so by taking off the
group from blkcg list (blkiocg_destroy()). 

Now clear_queue() will complete with one or more groups possibly still
left on cfqd list because of cgroup deletion race and that can cause
problmes.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 15:20   ` Vivek Goyal
  2012-01-23 15:36     ` Vivek Goyal
@ 2012-01-23 15:39     ` Tejun Heo
  2012-01-23 15:52       ` Vivek Goyal
  1 sibling, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2012-01-23 15:39 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel

Hello,

On Mon, Jan 23, 2012 at 10:20:55AM -0500, Vivek Goyal wrote:
> What happens to per device rules (per device weight and per device
> throttling rules)? IIUC, in new scheme of things, upon elevator switch
> these rules will be gone.
> 
> I can understand per device weight rules disappearing but what about 
> throttling rules. They are independent of IO scheduler and change of
> io scheduler should not cause per device throttling rules to get lost.
> 
> This just gets worse if elevator switch fails and we fall back to old
> elevator. But now we have lost of the rules as rules were part of
> blkg.conf and old groups are gone.

As any other choices, it's about trade off and there multiple aspects
to consider.  Both elevator and policy changes are extremely low
frequency and inherently disruptive operations and having persistency
across them doesn't justify adding the amount of complexity and design
convolutions we have now, especially not when rule persistency across
those switches can be implemented with mostly trivial switching script
from userland.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 15:36     ` Vivek Goyal
@ 2012-01-23 15:49       ` Tejun Heo
  0 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-23 15:49 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel

Hello,

On Mon, Jan 23, 2012 at 10:36:48AM -0500, Vivek Goyal wrote:
> IIUC, above is racy w.r.t cgroup removal and elevator switch. Assume that
> elevator swtich is taking place and we have queue lock held and we try to
> clear the groups on the queue. Parallely somebody is trying to delete a
> cgroup and has been partially successful in doing so by taking off the
> group from blkcg list (blkiocg_destroy()). 
> 
> Now clear_queue() will complete with one or more groups possibly still
> left on cfqd list because of cgroup deletion race and that can cause
> problmes.

Yeah, the fun of smart-ass locking.  Ultimately, the locking will be
the same locking scheme as ioc's will be used - ie. any modifications
take both locks and there's no limbo state.  Things are so tightly
entangled and I'm finding it very challenging to sequence patches in
the exact order.  I'll see if I can re-sequence locking update before
this but I might just as well declare that there's transitional race
condition in the patch.

There are also a couple other issues that I found yesterday while
updating further patches.

* blkio_list_lock has locking order reversal.  This isn't difficult to
  fix.

* root_group too gets shot down across elv switch.  It needs to be
  reinitialized afterwards.  This one too turns out to be pretty
  tricky to sequence right.

It probably isn't too easy to see the direction at this point, so...

* There will be single blkg per cgroup-request_queue pair regardless
  of the number of policies.  Each blkg carries common part and opaque
  data part for each policy and is managed by blkcg core layer.

* Set of enabled policies will become per-queue property.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 15:39     ` Tejun Heo
@ 2012-01-23 15:52       ` Vivek Goyal
  2012-01-23 15:57         ` Tejun Heo
  0 siblings, 1 reply; 45+ messages in thread
From: Vivek Goyal @ 2012-01-23 15:52 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, ctalbott, rni, linux-kernel

On Mon, Jan 23, 2012 at 07:39:13AM -0800, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jan 23, 2012 at 10:20:55AM -0500, Vivek Goyal wrote:
> > What happens to per device rules (per device weight and per device
> > throttling rules)? IIUC, in new scheme of things, upon elevator switch
> > these rules will be gone.
> > 
> > I can understand per device weight rules disappearing but what about 
> > throttling rules. They are independent of IO scheduler and change of
> > io scheduler should not cause per device throttling rules to get lost.
> > 
> > This just gets worse if elevator switch fails and we fall back to old
> > elevator. But now we have lost of the rules as rules were part of
> > blkg.conf and old groups are gone.
> 
> As any other choices, it's about trade off and there multiple aspects
> to consider.  Both elevator and policy changes are extremely low
> frequency and inherently disruptive operations and having persistency
> across them doesn't justify adding the amount of complexity and design
> convolutions we have now, especially not when rule persistency across
> those switches can be implemented with mostly trivial switching script
> from userland.

Atleast throttling rules should not disappear over elevator switch. They
are per device and not per IO scheduler. Think of losing nr_requests
settings just because elevator switch happened.

Elevator switch can be low frequency but how would a user space know
that elevator switch failed that's why we lost our rules and now lets
put the rules back.

I am not sure how would we justify this that because of ease of programming
in kernel, now user space is supposed to make sure that any programmed
rules are still there and reprogram the cgroup if rules are gone for
some reason.

Thanks
Vivek 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 15:52       ` Vivek Goyal
@ 2012-01-23 15:57         ` Tejun Heo
  2012-01-23 16:10           ` Vivek Goyal
  0 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2012-01-23 15:57 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel

On Mon, Jan 23, 2012 at 10:52:16AM -0500, Vivek Goyal wrote:
> Atleast throttling rules should not disappear over elevator switch. They
> are per device and not per IO scheduler. Think of losing nr_requests
> settings just because elevator switch happened.
> 
> Elevator switch can be low frequency but how would a user space know
> that elevator switch failed that's why we lost our rules and now lets
> put the rules back.

It's simple - store all the policy rules before switching elevators
and restore them afterwards regardless of success / failure.

> I am not sure how would we justify this that because of ease of programming
> in kernel, now user space is supposed to make sure that any programmed
> rules are still there and reprogram the cgroup if rules are gone for
> some reason.

Sanity in trade off?

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 15:57         ` Tejun Heo
@ 2012-01-23 16:10           ` Vivek Goyal
  2012-01-23 16:13             ` Vivek Goyal
  2012-01-23 16:16             ` Tejun Heo
  0 siblings, 2 replies; 45+ messages in thread
From: Vivek Goyal @ 2012-01-23 16:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, ctalbott, rni, linux-kernel

On Mon, Jan 23, 2012 at 07:57:45AM -0800, Tejun Heo wrote:
> On Mon, Jan 23, 2012 at 10:52:16AM -0500, Vivek Goyal wrote:
> > Atleast throttling rules should not disappear over elevator switch. They
> > are per device and not per IO scheduler. Think of losing nr_requests
> > settings just because elevator switch happened.
> > 
> > Elevator switch can be low frequency but how would a user space know
> > that elevator switch failed that's why we lost our rules and now lets
> > put the rules back.
> 
> It's simple - store all the policy rules before switching elevators
> and restore them afterwards regardless of success / failure.

This does not work well in hierachical management/scenario. Think that
a user gets an upper limit of 10MB/s on a device and now user can manage
its own children groups and divide allocated 10MB/s in children the way
he wants.

Now if root does the elevator switch, and saves all the rules (including
user's rules) and then restores back, these can very well race with
user's scripts of changing rules. If user changed a cgroup device rule
during elevator switch and after elevator switch root restored back
old rules, user's new rule will be lost leading to confusion.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 16:10           ` Vivek Goyal
@ 2012-01-23 16:13             ` Vivek Goyal
  2012-01-23 16:20               ` Tejun Heo
  2012-01-23 16:16             ` Tejun Heo
  1 sibling, 1 reply; 45+ messages in thread
From: Vivek Goyal @ 2012-01-23 16:13 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, ctalbott, rni, linux-kernel

On Mon, Jan 23, 2012 at 11:10:42AM -0500, Vivek Goyal wrote:
> On Mon, Jan 23, 2012 at 07:57:45AM -0800, Tejun Heo wrote:
> > On Mon, Jan 23, 2012 at 10:52:16AM -0500, Vivek Goyal wrote:
> > > Atleast throttling rules should not disappear over elevator switch. They
> > > are per device and not per IO scheduler. Think of losing nr_requests
> > > settings just because elevator switch happened.
> > > 
> > > Elevator switch can be low frequency but how would a user space know
> > > that elevator switch failed that's why we lost our rules and now lets
> > > put the rules back.
> > 
> > It's simple - store all the policy rules before switching elevators
> > and restore them afterwards regardless of success / failure.
> 
> This does not work well in hierachical management/scenario. Think that
> a user gets an upper limit of 10MB/s on a device and now user can manage
> its own children groups and divide allocated 10MB/s in children the way
> he wants.
> 
> Now if root does the elevator switch, and saves all the rules (including
> user's rules) and then restores back, these can very well race with
> user's scripts of changing rules. If user changed a cgroup device rule
> during elevator switch and after elevator switch root restored back
> old rules, user's new rule will be lost leading to confusion.

How about draining throttle groups only on queue exit (blk_cleanup_queue())
and not on elevator switch.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 16:10           ` Vivek Goyal
  2012-01-23 16:13             ` Vivek Goyal
@ 2012-01-23 16:16             ` Tejun Heo
  2012-01-23 16:25               ` Vivek Goyal
  1 sibling, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2012-01-23 16:16 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel

On Mon, Jan 23, 2012 at 11:10:42AM -0500, Vivek Goyal wrote:
> This does not work well in hierachical management/scenario. Think that
> a user gets an upper limit of 10MB/s on a device and now user can manage
> its own children groups and divide allocated 10MB/s in children the way
> he wants.
> 
> Now if root does the elevator switch, and saves all the rules (including
> user's rules) and then restores back, these can very well race with
> user's scripts of changing rules. If user changed a cgroup device rule
> during elevator switch and after elevator switch root restored back
> old rules, user's new rule will be lost leading to confusion.

It simply doesn't matter.  Just declare elvswitch and policy change
reset blkcg configurations.  If some people are crazy enough to switch
elevators regularly with full system running and parallel
configuration going on, let them deal with synchronizing their
scripts.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 16:13             ` Vivek Goyal
@ 2012-01-23 16:20               ` Tejun Heo
  2012-01-23 16:28                 ` Vivek Goyal
  0 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2012-01-23 16:20 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel

On Mon, Jan 23, 2012 at 11:13:08AM -0500, Vivek Goyal wrote:
> How about draining throttle groups only on queue exit (blk_cleanup_queue())
> and not on elevator switch.

It's for possible policy changes and to fully manage blkg from blkcg.
blkg itself needs to change as applied policies change and need to be
flushed.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 16:16             ` Tejun Heo
@ 2012-01-23 16:25               ` Vivek Goyal
  2012-01-23 17:10                 ` Tejun Heo
  0 siblings, 1 reply; 45+ messages in thread
From: Vivek Goyal @ 2012-01-23 16:25 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, ctalbott, rni, linux-kernel

On Mon, Jan 23, 2012 at 08:16:19AM -0800, Tejun Heo wrote:
> On Mon, Jan 23, 2012 at 11:10:42AM -0500, Vivek Goyal wrote:
> > This does not work well in hierachical management/scenario. Think that
> > a user gets an upper limit of 10MB/s on a device and now user can manage
> > its own children groups and divide allocated 10MB/s in children the way
> > he wants.
> > 
> > Now if root does the elevator switch, and saves all the rules (including
> > user's rules) and then restores back, these can very well race with
> > user's scripts of changing rules. If user changed a cgroup device rule
> > during elevator switch and after elevator switch root restored back
> > old rules, user's new rule will be lost leading to confusion.
> 
> It simply doesn't matter.  Just declare elvswitch and policy change
> reset blkcg configurations.  If some people are crazy enough to switch
> elevators regularly with full system running and parallel
> configuration going on, let them deal with synchronizing their
> scripts.

It does not have to be a regular switch. Even one switch during boot
can create issues. 

In RHEL we have the set of scripts which can do system tuning like based
on user chosen profile (tuned). These scripts do various things including
changing elevator. Once you have chosen the profile, it gets applied
automatically over every boot (through init scripts).

Now assume that after a reboot libvirtd is running and resuming various
suspended virtual machines or starting new one and in parallel this
profile is being applied. There is no way to avoid races as systemd allows
parallel execution of services. The only way left will be strong
serialization and that is no cgroup operation is taking place in the
system while some init script is chaning the elevator (no new cgroup
creatoin, no cgroup deletions and no rule settings by any daemon),
otherwise changes might be lost. In practice how would I program 
various init scripts for this?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 16:20               ` Tejun Heo
@ 2012-01-23 16:28                 ` Vivek Goyal
  2012-01-23 16:32                   ` Tejun Heo
  0 siblings, 1 reply; 45+ messages in thread
From: Vivek Goyal @ 2012-01-23 16:28 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, ctalbott, rni, linux-kernel

On Mon, Jan 23, 2012 at 08:20:35AM -0800, Tejun Heo wrote:
> On Mon, Jan 23, 2012 at 11:13:08AM -0500, Vivek Goyal wrote:
> > How about draining throttle groups only on queue exit (blk_cleanup_queue())
> > and not on elevator switch.
> 
> It's for possible policy changes and to fully manage blkg from blkcg.
> blkg itself needs to change as applied policies change and need to be
> flushed.

Can we avoid integrating everything into single blkg. What's wrong with
separate blkg for separage policy. In this case we just don't have the
flexbility to change throttling policy. If it is compiled in, it gets
activated. The only configurable thing is IO scheduler and these groups
will be cleaned up.

So keeping blkg separate for separate policy gives us this flexibility
that we don't have to cleanup throttling data and keep the throttling
rules persistent across elevator switch.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 16:28                 ` Vivek Goyal
@ 2012-01-23 16:32                   ` Tejun Heo
  0 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-23 16:32 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel

On Mon, Jan 23, 2012 at 11:28:40AM -0500, Vivek Goyal wrote:
> Can we avoid integrating everything into single blkg. What's wrong with
> separate blkg for separage policy. In this case we just don't have the
> flexbility to change throttling policy. If it is compiled in, it gets
> activated. The only configurable thing is IO scheduler and these groups
> will be cleaned up.
> 
> So keeping blkg separate for separate policy gives us this flexibility
> that we don't have to cleanup throttling data and keep the throttling
> rules persistent across elevator switch.

I think that's the wrong trade off.  We end up duplicating common
stuff all over the place and the code to deal with the mess is
naturally horrible and almost incomprehensible.  Persistency is much
more minor issue and should be handled as such.  It shouldn't contort
the whole design like it does now.  Let's talk about persistency in
the other reply.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 16:25               ` Vivek Goyal
@ 2012-01-23 17:10                 ` Tejun Heo
  2012-01-23 18:27                   ` Vivek Goyal
  2012-01-23 18:32                   ` Vivek Goyal
  0 siblings, 2 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-23 17:10 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel

On Mon, Jan 23, 2012 at 11:25:53AM -0500, Vivek Goyal wrote:
> It does not have to be a regular switch. Even one switch during boot
> can create issues. 
> 
> In RHEL we have the set of scripts which can do system tuning like based
> on user chosen profile (tuned). These scripts do various things including
> changing elevator. Once you have chosen the profile, it gets applied
> automatically over every boot (through init scripts).
>
> Now assume that after a reboot libvirtd is running and resuming various
> suspended virtual machines or starting new one and in parallel this
> profile is being applied. There is no way to avoid races as systemd allows
> parallel execution of services. The only way left will be strong
> serialization and that is no cgroup operation is taking place in the
> system while some init script is chaning the elevator (no new cgroup
> creatoin, no cgroup deletions and no rule settings by any daemon),
> otherwise changes might be lost. In practice how would I program 
> various init scripts for this?

Why can't systemd order elevator switch before other actions?  It's
not really about switching elevators but about having set of applied
policies set before configuring them.

It is natural to require the target of configuration to be set up
before configuring it, right?  You can't set attributes on eth0 or sda
when those don't exist.  This isn't very different.  You need to have
set of policies and their parameters defined before going ahead with
their configurations and there naturally is ordering between the two
steps - e.g. it doesn't make any sense and is actually misleading to
allow configuration of propio when the elevator in choice doesn't
provide it.

Of course, details of such ordering requirement including granularity
have to be decided and we can decide that keeping things at per-policy
granularity is important enough to justify extra complexity, which I
don't think is the case here.

There are two separate points here.

1. Regardless of persistency granularity, which policies are enabled
   for a device must be determined before configuring the policies.
   The policy_node stuff worked around this by keeping per-policy
   configurations in the core separately violating proper layering and
   any usual conventions.  It's like keeping ata_N_conf or eth_N_conf
   in kernel for devices which may appear in the future.  It's silly
   at best.

2. The granularity of configuration reset is a separate issue and it
   might make sense to do it fine-grained if that is important enough,
   but given how elv/pol changes are used, I am very skeptical this is
   necessary.

No matter what we do for #2, #1 requires ordering between policy
selection and configuration.  You're saying that #2, combined with the
fact that blk-throtl can't be built as module or disabled on runtime,
allows side-stepping the issue for at least blk-throtl.  That doesn't
sound like a good idea to me.  People are working on different
elevators implementing different cgroup strategies.  There is no sane
way around requiring "choosing of policies" to happen before
"configuration of chosen policies".

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 17:10                 ` Tejun Heo
@ 2012-01-23 18:27                   ` Vivek Goyal
  2012-01-23 18:43                     ` Tejun Heo
  2012-01-23 20:40                     ` Lennart Poettering
  2012-01-23 18:32                   ` Vivek Goyal
  1 sibling, 2 replies; 45+ messages in thread
From: Vivek Goyal @ 2012-01-23 18:27 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, ctalbott, rni, linux-kernel, Lennart Poettering

On Mon, Jan 23, 2012 at 09:10:49AM -0800, Tejun Heo wrote:
> On Mon, Jan 23, 2012 at 11:25:53AM -0500, Vivek Goyal wrote:
> > It does not have to be a regular switch. Even one switch during boot
> > can create issues. 
> > 
> > In RHEL we have the set of scripts which can do system tuning like based
> > on user chosen profile (tuned). These scripts do various things including
> > changing elevator. Once you have chosen the profile, it gets applied
> > automatically over every boot (through init scripts).
> >
> > Now assume that after a reboot libvirtd is running and resuming various
> > suspended virtual machines or starting new one and in parallel this
> > profile is being applied. There is no way to avoid races as systemd allows
> > parallel execution of services. The only way left will be strong
> > serialization and that is no cgroup operation is taking place in the
> > system while some init script is chaning the elevator (no new cgroup
> > creatoin, no cgroup deletions and no rule settings by any daemon),
> > otherwise changes might be lost. In practice how would I program 
> > various init scripts for this?
> 
> Why can't systemd order elevator switch before other actions?

Because systemd does not know. For systemd it is just launching services
and what services are doing is not known to systemd.

I think systemd does have some facilities so that services can express
dependency on other services and dependent service blocks on completion
of service it is depenent on. So may be in this case any service dealing
with cgroups shall have to be dependent on this service which tunes
the system and changes elevator.

CCing lennart for more info on systemd.

>  It's
> not really about switching elevators but about having set of applied
> policies set before configuring them.
> 
> It is natural to require the target of configuration to be set up
> before configuring it, right?  You can't set attributes on eth0 or sda
> when those don't exist.  This isn't very different.  You need to have
> set of policies and their parameters defined before going ahead with
> their configurations and there naturally is ordering between the two
> steps - e.g. it doesn't make any sense and is actually misleading to
> allow configuration of propio when the elevator in choice doesn't
> provide it.
> 
> Of course, details of such ordering requirement including granularity
> have to be decided and we can decide that keeping things at per-policy
> granularity is important enough to justify extra complexity, which I
> don't think is the case here.
> 
> There are two separate points here.
> 
> 1. Regardless of persistency granularity, which policies are enabled
>    for a device must be determined before configuring the policies.
>    The policy_node stuff worked around this by keeping per-policy
>    configurations in the core separately violating proper layering and
>    any usual conventions.  It's like keeping ata_N_conf or eth_N_conf
>    in kernel for devices which may appear in the future.  It's silly
>    at best.

Agreed. I understand now that keeping configuration around in kernel for
non-existent devices is not a good idea. So ripping the rules upon 
device tear down makes sense.

> 
> 2. The granularity of configuration reset is a separate issue and it
>    might make sense to do it fine-grained if that is important enough,
>    but given how elv/pol changes are used, I am very skeptical this is
>    necessary.
> 
> No matter what we do for #2, #1 requires ordering between policy
> selection and configuration.  You're saying that #2, combined with the
> fact that blk-throtl can't be built as module or disabled on runtime,
> allows side-stepping the issue for at least blk-throtl.  That doesn't
> sound like a good idea to me.  People are working on different
> elevators implementing different cgroup strategies.  There is no sane
> way around requiring "choosing of policies" to happen before
> "configuration of chosen policies".

I agree on #1 and that is choosing policy before configuring it.

I am concerned about silently removing the configuration of policy A
while some unrelated policy B is going away and user space is asked
to handle it.

It is equivalent of saying that changing IO scheduler also resets all
the request queue tunables to default and now user space script is
supposed to set them back to user configured value. Or write a user space
script which first saves all the request queue tunables, changes the elevator
and then restores it back.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 17:10                 ` Tejun Heo
  2012-01-23 18:27                   ` Vivek Goyal
@ 2012-01-23 18:32                   ` Vivek Goyal
  2012-01-23 18:51                     ` Tejun Heo
  1 sibling, 1 reply; 45+ messages in thread
From: Vivek Goyal @ 2012-01-23 18:32 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, ctalbott, rni, linux-kernel

On Mon, Jan 23, 2012 at 09:10:49AM -0800, Tejun Heo wrote:

[..]
> No matter what we do for #2, #1 requires ordering between policy
> selection and configuration.  You're saying that #2, combined with the
> fact that blk-throtl can't be built as module or disabled on runtime,
> allows side-stepping the issue for at least blk-throtl.  That doesn't
> sound like a good idea to me.  People are working on different
> elevators implementing different cgroup strategies.  There is no sane
> way around requiring "choosing of policies" to happen before
> "configuration of chosen policies".

This is not just specific to blk-throttle. If in future throttling policy
is removable like elevator, then it will be fine to reset the throttling
related configuration upon removal of throttling policy. But resetting
throttling configuration without policy going anywhere does not sound good.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 18:27                   ` Vivek Goyal
@ 2012-01-23 18:43                     ` Tejun Heo
  2012-01-23 19:33                       ` Tejun Heo
  2012-01-23 20:43                       ` Lennart Poettering
  2012-01-23 20:40                     ` Lennart Poettering
  1 sibling, 2 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-23 18:43 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel, Lennart Poettering

On Mon, Jan 23, 2012 at 01:27:45PM -0500, Vivek Goyal wrote:
> > Why can't systemd order elevator switch before other actions?
> 
> Because systemd does not know. For systemd it is just launching services
> and what services are doing is not known to systemd.
> 
> I think systemd does have some facilities so that services can express
> dependency on other services and dependent service blocks on completion
> of service it is depenent on. So may be in this case any service dealing
> with cgroups shall have to be dependent on this service which tunes
> the system and changes elevator.

I'm sure systemd has enough facility for expressing this dependency.
Where this configuration belongs to is a different question tho.  I
don't know how the tuned thing works but configurations like this are
bound to devices and should be part of device discovery / hotplug
sequence.  IOW, it should be something which ultimately runs off udev
events as part of device found event.

> > 1. Regardless of persistency granularity, which policies are enabled
> >    for a device must be determined before configuring the policies.
> >    The policy_node stuff worked around this by keeping per-policy
> >    configurations in the core separately violating proper layering and
> >    any usual conventions.  It's like keeping ata_N_conf or eth_N_conf
> >    in kernel for devices which may appear in the future.  It's silly
> >    at best.
> 
> Agreed. I understand now that keeping configuration around in kernel for
> non-existent devices is not a good idea. So ripping the rules upon 
> device tear down makes sense.

Yeah, the kernel doesn't even have a way to reliably match
configurations to devices consistently.  Good that we agree on this.

> > 2. The granularity of configuration reset is a separate issue and it
> >    might make sense to do it fine-grained if that is important enough,
> >    but given how elv/pol changes are used, I am very skeptical this is
> >    necessary.
> > 
> > No matter what we do for #2, #1 requires ordering between policy
> > selection and configuration.  You're saying that #2, combined with the
> > fact that blk-throtl can't be built as module or disabled on runtime,
> > allows side-stepping the issue for at least blk-throtl.  That doesn't
> > sound like a good idea to me.  People are working on different
> > elevators implementing different cgroup strategies.  There is no sane
> > way around requiring "choosing of policies" to happen before
> > "configuration of chosen policies".
> 
> I agree on #1 and that is choosing policy before configuring it.
> 
> I am concerned about silently removing the configuration of policy A
> while some unrelated policy B is going away and user space is asked
> to handle it.
> 
> It is equivalent of saying that changing IO scheduler also resets all
> the request queue tunables to default and now user space script is
> supposed to set them back to user configured value. Or write a user space
> script which first saves all the request queue tunables, changes the elevator
> and then restores it back.

Yeah, this is much more arguable.  I don't think it would be too
complex to keep per-policy granularity even w/ unified blkg managed by
blkcg core (we'll just need to point to separately allocated
per-policy data from the unified blkg and clear them selectively).
I'm just not convinced of its necessity.  With initial config out of
the way, elvs and blkcg policies don't get molested all that often.

I'll see how complex it actually gets.  If it isn't too much
complexity, yeah, why not...

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 18:32                   ` Vivek Goyal
@ 2012-01-23 18:51                     ` Tejun Heo
  0 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-23 18:51 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel

On Mon, Jan 23, 2012 at 01:32:07PM -0500, Vivek Goyal wrote:
> This is not just specific to blk-throttle. If in future throttling policy
> is removable like elevator, then it will be fine to reset the throttling
> related configuration upon removal of throttling policy. But resetting
> throttling configuration without policy going anywhere does not sound good.

Yeah, that is a compromise.  As long as those configurations are
per-queue, which is the way it's gonna be after the clean up, I don't
think it's a big deal tho.  We'll be saying "if you make changes to
the set of blkcg policies applied to a device, all blkcg
configurations regarding the device will be reset" instead of "if any
change to blkcg policies happen system-wide, all blkcg configurations
will be reset", and the former doesn't sound crazy to me.  Anyways,
let's see how complex doing the finer grained thing gets.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 18:43                     ` Tejun Heo
@ 2012-01-23 19:33                       ` Tejun Heo
  2012-01-23 19:57                         ` Vivek Goyal
  2012-01-23 20:43                       ` Lennart Poettering
  1 sibling, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2012-01-23 19:33 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel, Lennart Poettering

On Mon, Jan 23, 2012 at 10:43:36AM -0800, Tejun Heo wrote:
> Yeah, this is much more arguable.  I don't think it would be too
> complex to keep per-policy granularity even w/ unified blkg managed by
> blkcg core (we'll just need to point to separately allocated
> per-policy data from the unified blkg and clear them selectively).
> I'm just not convinced of its necessity.  With initial config out of
> the way, elvs and blkcg policies don't get molested all that often.
> 
> I'll see how complex it actually gets.  If it isn't too much
> complexity, yeah, why not...

Hmmm... while this isn't terribly complex, it involves considerable
amount of churn as core layer doesn't currently know what policies are
bound to which queues how - we'll have to add some part of that before
shootdown change, use it there, and then later replace it with proper
per-queue thing.  The conversion is already painful enough without
adding another chunk to juggle around.  Given that clearing all on pol
change isn't too crazy, how about the following?

* For now, clear unconditionally on pol/elv change.

* But structure things such that policy specific data are allocated
  separately and on pol/elv change only those policy specific part is
  flushed.

* Later, if deemed necessary, make the clearing of pol-specific part
  selective.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 19:33                       ` Tejun Heo
@ 2012-01-23 19:57                         ` Vivek Goyal
  2012-01-23 20:33                           ` Tejun Heo
  0 siblings, 1 reply; 45+ messages in thread
From: Vivek Goyal @ 2012-01-23 19:57 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, ctalbott, rni, linux-kernel, Lennart Poettering

On Mon, Jan 23, 2012 at 11:33:35AM -0800, Tejun Heo wrote:
> On Mon, Jan 23, 2012 at 10:43:36AM -0800, Tejun Heo wrote:
> > Yeah, this is much more arguable.  I don't think it would be too
> > complex to keep per-policy granularity even w/ unified blkg managed by
> > blkcg core (we'll just need to point to separately allocated
> > per-policy data from the unified blkg and clear them selectively).
> > I'm just not convinced of its necessity.  With initial config out of
> > the way, elvs and blkcg policies don't get molested all that often.
> > 
> > I'll see how complex it actually gets.  If it isn't too much
> > complexity, yeah, why not...
> 
> Hmmm... while this isn't terribly complex, it involves considerable
> amount of churn as core layer doesn't currently know what policies are
> bound to which queues how - we'll have to add some part of that before
> shootdown change, use it there, and then later replace it with proper
> per-queue thing.  The conversion is already painful enough without
> adding another chunk to juggle around.  Given that clearing all on pol
> change isn't too crazy, how about the following?
> 
> * For now, clear unconditionally on pol/elv change.
> 
> * But structure things such that policy specific data are allocated
>   separately and on pol/elv change only those policy specific part is
>   flushed.
> 
> * Later, if deemed necessary, make the clearing of pol-specific part
>   selective.

It would be good if you add one more part to your series (say part4) to
make it happen. We probably don't want to get into mess that from kernel
version A to B we had x behavior, from B to C we broke changed it to y
and in kernel version D we restored it back to x. User space now go
figure what kernel version you are running and behave appropriately. For
distributions supporting these different kernels will become a mess.

So IMHO, if you can keep pol-specific clearing part a separate series
which gets committed in the same kernel version, would help a lot.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 19:57                         ` Vivek Goyal
@ 2012-01-23 20:33                           ` Tejun Heo
  0 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2012-01-23 20:33 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel, Lennart Poettering

On Mon, Jan 23, 2012 at 02:57:17PM -0500, Vivek Goyal wrote:
> It would be good if you add one more part to your series (say part4) to
> make it happen. We probably don't want to get into mess that from kernel
> version A to B we had x behavior, from B to C we broke changed it to y
> and in kernel version D we restored it back to x. User space now go
> figure what kernel version you are running and behave appropriately. For
> distributions supporting these different kernels will become a mess.
> 
> So IMHO, if you can keep pol-specific clearing part a separate series
> which gets committed in the same kernel version, would help a lot.

Sure, we're at the beginning of this devel window.  I'm fairly sure
we'll have enough time to sort it out before the next merge window.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 18:27                   ` Vivek Goyal
  2012-01-23 18:43                     ` Tejun Heo
@ 2012-01-23 20:40                     ` Lennart Poettering
  1 sibling, 0 replies; 45+ messages in thread
From: Lennart Poettering @ 2012-01-23 20:40 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Tejun Heo, axboe, ctalbott, rni, linux-kernel

On Mon, 23.01.12 13:27, Vivek Goyal (vgoyal@redhat.com) wrote:

> > Why can't systemd order elevator switch before other actions?
> 
> Because systemd does not know. For systemd it is just launching services
> and what services are doing is not known to systemd.
> 
> I think systemd does have some facilities so that services can express
> dependency on other services and dependent service blocks on completion
> of service it is depenent on. So may be in this case any service dealing
> with cgroups shall have to be dependent on this service which tunes
> the system and changes elevator.
> 
> CCing lennart for more info on systemd.

Yes, the dependency logic is quite elaborate in systemd. If a service
really needs to it can plug itself into the early-boot phase, instead of
normal boot. It's where we usually load drivers via udev and apply
sysctl and these things, hence before any of the "real" services. So far
only udev itself (and plymouth in some way) are of this special kind
that run this early, but other services may move themselves this
early as well, if they want to.

Note that sticking code in the early boot phase means you need to know
what you do, i.e. not rely on device nodes being there, drivers probed,
access modes fixed, temporary files set up and so on, after all it's
where we do all those things in the first place. You can order yourself
individually after those tasks however, if needed.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 18:43                     ` Tejun Heo
  2012-01-23 19:33                       ` Tejun Heo
@ 2012-01-23 20:43                       ` Lennart Poettering
  2012-01-23 20:47                         ` Tejun Heo
  1 sibling, 1 reply; 45+ messages in thread
From: Lennart Poettering @ 2012-01-23 20:43 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Vivek Goyal, axboe, ctalbott, rni, linux-kernel

On Mon, 23.01.12 10:43, Tejun Heo (tj@kernel.org) wrote:

> 
> On Mon, Jan 23, 2012 at 01:27:45PM -0500, Vivek Goyal wrote:
> > > Why can't systemd order elevator switch before other actions?
> > 
> > Because systemd does not know. For systemd it is just launching services
> > and what services are doing is not known to systemd.
> > 
> > I think systemd does have some facilities so that services can express
> > dependency on other services and dependent service blocks on completion
> > of service it is depenent on. So may be in this case any service dealing
> > with cgroups shall have to be dependent on this service which tunes
> > the system and changes elevator.
> 
> I'm sure systemd has enough facility for expressing this dependency.
> Where this configuration belongs to is a different question tho.  I
> don't know how the tuned thing works but configurations like this are
> bound to devices and should be part of device discovery / hotplug
> sequence.  IOW, it should be something which ultimately runs off udev
> events as part of device found event.

I didn't really follow the whole discussion here, but if this is about
adjusting parameters of a block device as the block device shows up this
must necessarily happen in an udev rule (and not in a daemon watching
block devices using libudev), since you most likely need the
synchronicity: i.e. you want to avoid that normal (libudev-using)
userspace code ever sees the device before those params are written to
the device. A daemon asynchronously watching the devices with libudev
for adjusting such tunables is necessarily racy.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 20:43                       ` Lennart Poettering
@ 2012-01-23 20:47                         ` Tejun Heo
  2012-01-23 21:03                           ` Vivek Goyal
  0 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2012-01-23 20:47 UTC (permalink / raw)
  To: Lennart Poettering; +Cc: Vivek Goyal, axboe, ctalbott, rni, linux-kernel

Hello, Lennart.

On Mon, Jan 23, 2012 at 09:43:52PM +0100, Lennart Poettering wrote:
> I didn't really follow the whole discussion here, but if this is about
> adjusting parameters of a block device as the block device shows up this
> must necessarily happen in an udev rule (and not in a daemon watching
> block devices using libudev), since you most likely need the
> synchronicity: i.e. you want to avoid that normal (libudev-using)
> userspace code ever sees the device before those params are written to
> the device. A daemon asynchronously watching the devices with libudev
> for adjusting such tunables is necessarily racy.

Yeap, this needs to be synchronous so that it's configured before the
rest of system notices the device.  Vivek, can you please ask the
tuned guys to take a look?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch
  2012-01-23 20:47                         ` Tejun Heo
@ 2012-01-23 21:03                           ` Vivek Goyal
  0 siblings, 0 replies; 45+ messages in thread
From: Vivek Goyal @ 2012-01-23 21:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Lennart Poettering, axboe, ctalbott, rni, linux-kernel

On Mon, Jan 23, 2012 at 12:47:57PM -0800, Tejun Heo wrote:
> Hello, Lennart.
> 
> On Mon, Jan 23, 2012 at 09:43:52PM +0100, Lennart Poettering wrote:
> > I didn't really follow the whole discussion here, but if this is about
> > adjusting parameters of a block device as the block device shows up this
> > must necessarily happen in an udev rule (and not in a daemon watching
> > block devices using libudev), since you most likely need the
> > synchronicity: i.e. you want to avoid that normal (libudev-using)
> > userspace code ever sees the device before those params are written to
> > the device. A daemon asynchronously watching the devices with libudev
> > for adjusting such tunables is necessarily racy.
> 
> Yeap, this needs to be synchronous so that it's configured before the
> rest of system notices the device.  Vivek, can you please ask the
> tuned guys to take a look?

Sure I will. Thanks.

Vivek

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2012-01-23 21:03 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-22  3:25 [PATCHSET] blkcg: kill policy node and blkg->dev, take#2 Tejun Heo
2012-01-22  3:25 ` [PATCH 01/17] blkcg: make CONFIG_BLK_CGROUP bool Tejun Heo
2012-01-23 15:00   ` Vivek Goyal
2012-01-23 15:34     ` Tejun Heo
2012-01-22  3:25 ` [PATCH 02/17] cfq: don't register propio policy if !CONFIG_CFQ_GROUP_IOSCHED Tejun Heo
2012-01-22  3:25 ` [PATCH 03/17] elevator: clear auxiliary data earlier during elevator switch Tejun Heo
2012-01-22  3:25 ` [PATCH 04/17] elevator: make elevator_init_fn() return 0/-errno Tejun Heo
2012-01-22  3:25 ` [PATCH 05/17] block: implement blk_queue_bypass_start/end() Tejun Heo
2012-01-22  3:25 ` [PATCH 06/17] block: extend queue bypassing to cover blkcg policies Tejun Heo
2012-01-22  3:25 ` [PATCH 07/17] blkcg: make blkio_list_lock irq-safe Tejun Heo
2012-01-22  3:25 ` [PATCH 08/17] blkcg: shoot down blkio_groups on elevator switch Tejun Heo
2012-01-23 15:20   ` Vivek Goyal
2012-01-23 15:36     ` Vivek Goyal
2012-01-23 15:49       ` Tejun Heo
2012-01-23 15:39     ` Tejun Heo
2012-01-23 15:52       ` Vivek Goyal
2012-01-23 15:57         ` Tejun Heo
2012-01-23 16:10           ` Vivek Goyal
2012-01-23 16:13             ` Vivek Goyal
2012-01-23 16:20               ` Tejun Heo
2012-01-23 16:28                 ` Vivek Goyal
2012-01-23 16:32                   ` Tejun Heo
2012-01-23 16:16             ` Tejun Heo
2012-01-23 16:25               ` Vivek Goyal
2012-01-23 17:10                 ` Tejun Heo
2012-01-23 18:27                   ` Vivek Goyal
2012-01-23 18:43                     ` Tejun Heo
2012-01-23 19:33                       ` Tejun Heo
2012-01-23 19:57                         ` Vivek Goyal
2012-01-23 20:33                           ` Tejun Heo
2012-01-23 20:43                       ` Lennart Poettering
2012-01-23 20:47                         ` Tejun Heo
2012-01-23 21:03                           ` Vivek Goyal
2012-01-23 20:40                     ` Lennart Poettering
2012-01-23 18:32                   ` Vivek Goyal
2012-01-23 18:51                     ` Tejun Heo
2012-01-22  3:25 ` [PATCH 09/17] blkcg: move rcu_read_lock() outside of blkio_group get functions Tejun Heo
2012-01-22  3:25 ` [PATCH 10/17] blkcg: update blkg get functions take blkio_cgroup as parameter Tejun Heo
2012-01-22  3:25 ` [PATCH 11/17] blkcg: use q and plid instead of opaque void * for blkio_group association Tejun Heo
2012-01-22  3:25 ` [PATCH 12/17] blkcg: add blkio_policy[] array and allow one policy per policy ID Tejun Heo
2012-01-22  3:25 ` [PATCH 13/17] blkcg: use the usual get blkg path for root blkio_group Tejun Heo
2012-01-22  3:25 ` [PATCH 14/17] blkcg: factor out blkio_group creation Tejun Heo
2012-01-22  3:25 ` [PATCH 15/17] blkcg: don't allow or retain configuration of missing devices Tejun Heo
2012-01-22  3:25 ` [PATCH 16/17] blkcg: kill blkio_policy_node Tejun Heo
2012-01-22  3:25 ` [PATCH 17/17] blkcg: kill the mind-bending blkg->dev Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).