All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vivek Goyal <vgoyal@redhat.com>
To: linux-kernel@vger.kernel.org,
	containers@lists.linux-foundation.org, dm-devel@redhat.com,
	jens.axboe@oracle.com, ryov@valinux.co.jp,
	balbir@linux.vnet.ibm.com, righi.andrea@gmail.com
Cc: nauman@google.com, dpshah@google.com, lizf@cn.fujitsu.com,
	mikew@google.com, fchecconi@gmail.com, paolo.valente@unimore.it,
	fernando@oss.ntt.co.jp, s-uchida@ap.jp.nec.com,
	taka@valinux.co.jp, guijianfeng@cn.fujitsu.com,
	jmoyer@redhat.com, dhaval@linux.vnet.ibm.com,
	m-ikeda@ds.jp.nec.com, agk@redhat.com, vgoyal@redhat.com,
	akpm@linux-foundation.org, peterz@infradead.org,
	jmarchan@redhat.com
Subject: [PATCH 07/24] io-controller: Common hierarchical fair queuing code in elevaotor layer
Date: Sun, 16 Aug 2009 15:30:29 -0400	[thread overview]
Message-ID: <1250451046-9966-8-git-send-email-vgoyal@redhat.com> (raw)
In-Reply-To: <1250451046-9966-1-git-send-email-vgoyal@redhat.com>

o This patch enables hierarchical fair queuing in common layer. It is
  controlled by config option CONFIG_GROUP_IOSCHED.

o Requests keep a reference on ioq and ioq keeps  keep a reference
  on groups. For async queues in CFQ, and single ioq in other
  schedulers, io_group also keeps are reference on io_queue. This
  reference on ioq is dropped when the queue is released
  (elv_release_ioq). So the queue can be freed.

  When a queue is released, it puts the reference to io_group and the
  io_group is released after all the queues are released. Child groups
  also take reference on parent groups, and release it when they are
  destroyed.

o Reads of iocg->group_data are not always iocg->lock; so all the operations
  on that list are still protected by RCU. All modifications to
  iocg->group_data should always done under iocg->lock.

  Whenever iocg->lock and queue_lock can both be held, queue_lock should
  be held first. This avoids all deadlocks. In order to avoid race
  between cgroup deletion and elevator switch the following algorithm is
  used:

	- Cgroup deletion path holds iocg->lock and removes iog entry
	  to iocg->group_data list. Then it drops iocg->lock, holds
	  queue_lock and destroys iog. So in this path, we never hold
	  iocg->lock and queue_lock at the same time. Also, since we
	  remove iog from iocg->group_data under iocg->lock, we can't
	  race with elevator switch.

	- Elevator switch path does not remove iog from
	  iocg->group_data list directly. It first hold iocg->lock,
	  scans iocg->group_data again to see if iog is still there;
	  it removes iog only if it finds iog there. Otherwise, cgroup
	  deletion must have removed it from the list, and cgroup
	  deletion is responsible for removing iog.

  So the path which removes iog from iocg->group_data list does
  the final removal of iog by calling __io_destroy_group()
  function.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    2 +
 block/elevator-fq.c |  479 +++++++++++++++++++++++++++++++++++++++++++++++++--
 block/elevator-fq.h |   35 ++++
 block/elevator.c    |    4 +
 4 files changed, 509 insertions(+), 11 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 4bde1c8..decb654 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1356,6 +1356,8 @@ alloc_cfqq:
 
 			/* call it after cfq has initialized queue prio */
 			elv_init_ioq_io_group(ioq, iog);
+			/* ioq reference on iog */
+			elv_get_iog(iog);
 			cfq_log_cfqq(cfqd, cfqq, "alloced");
 		} else {
 			cfqq = &cfqd->oom_cfqq;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 3940c61..8d91ff7 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -515,6 +515,7 @@ void elv_put_ioq(struct io_queue *ioq)
 {
 	struct elv_fq_data *efqd = ioq->efqd;
 	struct elevator_queue *e = efqd->eq;
+	struct io_group *iog;
 
 	BUG_ON(atomic_read(&ioq->ref) <= 0);
 	if (!atomic_dec_and_test(&ioq->ref))
@@ -522,12 +523,14 @@ void elv_put_ioq(struct io_queue *ioq)
 	BUG_ON(ioq->nr_queued);
 	BUG_ON(elv_ioq_busy(ioq));
 	BUG_ON(efqd->active_queue == ioq);
+	iog = ioq_to_io_group(ioq);
 
 	/* Can be called by outgoing elevator. Don't use q */
 	BUG_ON(!e->ops->elevator_free_sched_queue_fn);
 	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
 	elv_log_ioq(efqd, ioq, "put_queue");
 	elv_free_ioq(ioq);
+	elv_put_iog(iog);
 }
 EXPORT_SYMBOL(elv_put_ioq);
 
@@ -734,6 +737,27 @@ void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 EXPORT_SYMBOL(elv_io_group_set_async_queue);
 
 #ifdef CONFIG_GROUP_IOSCHED
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup);
+
+static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+{
+	struct io_entity *entity = &iog->entity;
+
+	entity->weight = iocg->weight;
+	entity->ioprio_class = iocg->ioprio_class;
+	entity->ioprio_changed = 1;
+	entity->my_sd = &iog->sched_data;
+}
+
+static void io_group_set_parent(struct io_group *iog, struct io_group *parent)
+{
+	struct io_entity *entity = &iog->entity;
+
+	init_io_entity_parent(entity, &parent->entity);
+
+	/* Child group reference on parent group. */
+	elv_get_iog(parent);
+}
 
 struct io_cgroup io_root_cgroup = {
 	.weight = IO_WEIGHT_DEFAULT,
@@ -746,6 +770,27 @@ static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 			    struct io_cgroup, css);
 }
 
+/*
+ * Search the io_group for efqd into the hash table (by now only a list)
+ * of bgrp.  Must be called under rcu_read_lock().
+ */
+static struct io_group *
+io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
+{
+	struct io_group *iog;
+	struct hlist_node *n;
+	void *__key;
+
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		__key = rcu_dereference(iog->key);
+		if (__key == key)
+			return iog;
+	}
+
+	return NULL;
+}
+
+
 #define SHOW_FUNCTION(__VAR)						\
 static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
@@ -885,12 +930,6 @@ static void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
 	task_unlock(tsk);
 }
 
-static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
-{
-
-	/* Implemented in later patch */
-}
-
 struct cgroup_subsys io_subsys = {
 	.name = "io",
 	.create = iocg_create,
@@ -902,24 +941,210 @@ struct cgroup_subsys io_subsys = {
 	.use_id = 1,
 };
 
+static inline unsigned int iog_weight(struct io_group *iog)
+{
+	return iog->entity.weight;
+}
+
+static struct io_group *
+io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *leaf = NULL, *prev = NULL;
+	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
+
+	for (; cgroup != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		if (iog != NULL) {
+			/*
+			 * All the cgroups in the path from there to the
+			 * root must have a io_group for efqd, so we don't
+			 * need any more allocations.
+			 */
+			break;
+		}
+
+		iog = kzalloc_node(sizeof(*iog), flags, q->node);
+		if (!iog)
+			goto cleanup;
+
+		iog->iocg_id = css_id(&iocg->css);
+
+		io_group_init_entity(iocg, iog);
+
+		atomic_set(&iog->ref, 0);
+
+		/*
+		 * Take the initial reference that will be released on destroy
+		 * This can be thought of a joint reference by cgroup and
+		 * elevator which will be dropped by either elevator exit
+		 * or cgroup deletion path depending on who is exiting first.
+		 */
+		elv_get_iog(iog);
+
+		if (leaf == NULL) {
+			leaf = iog;
+			prev = leaf;
+		} else {
+			io_group_set_parent(prev, iog);
+			/*
+			 * Build a list of allocated nodes using the efqd
+			 * filed, that is still unused and will be initialized
+			 * only after the node will be connected.
+			 */
+			prev->key = iog;
+			prev = iog;
+		}
+	}
+
+	return leaf;
+
+cleanup:
+	while (leaf != NULL) {
+		prev = leaf;
+		leaf = leaf->key;
+		kfree(prev);
+	}
+
+	return NULL;
+}
+
+static void io_group_chain_link(struct request_queue *q, void *key,
+				struct cgroup *cgroup, struct io_group *leaf,
+				struct elv_fq_data *efqd)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *next, *prev = NULL;
+	unsigned long flags;
+
+	assert_spin_locked(q->queue_lock);
+
+	for (; cgroup != NULL && leaf != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		next = leaf->key;
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		BUG_ON(iog != NULL);
+
+		spin_lock_irqsave(&iocg->lock, flags);
+
+		rcu_assign_pointer(leaf->key, key);
+		hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+		hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+
+		spin_unlock_irqrestore(&iocg->lock, flags);
+
+		prev = leaf;
+		leaf = next;
+	}
+
+	BUG_ON(cgroup == NULL && leaf != NULL);
+
+	/*
+	 * This connects the topmost element of the allocated chain to the
+	 * parent group.
+	 */
+	if (cgroup != NULL && prev != NULL) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		iog = io_cgroup_lookup_group(iocg, key);
+		io_group_set_parent(prev, iog);
+	}
+}
+
+static struct io_group *io_find_alloc_group(struct request_queue *q,
+			struct cgroup *cgroup, struct elv_fq_data *efqd,
+			int create)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog = NULL;
+	/* Note: Use efqd as key */
+	void *key = efqd;
+
+	/*
+	 * Take a refenrece to css object. Don't want to map a bio to
+	 * a group if it has been marked for deletion
+	 */
+
+	if (!css_tryget(&iocg->css))
+		return iog;
+
+	iog = io_cgroup_lookup_group(iocg, key);
+	if (iog != NULL || !create)
+		goto end;
+
+	iog = io_group_chain_alloc(q, key, cgroup);
+	if (iog != NULL)
+		io_group_chain_link(q, key, cgroup, iog, efqd);
+
+end:
+	css_put(&iocg->css);
+	return iog;
+}
+
+/*
+ * Search for the io group current task belongs to. If create=1, then also
+ * create the io group if it is not already there.
+ *
+ * Note: This function should be called with queue lock held. It returns
+ * a pointer to io group without taking any reference. That group will
+ * be around as long as queue lock is not dropped (as group reclaim code
+ * needs to get hold of queue lock). So if somebody needs to use group
+ * pointer even after dropping queue lock, take a reference to the group
+ * before dropping queue lock.
+ */
+struct io_group *elv_io_get_io_group(struct request_queue *q, int create)
+{
+	struct cgroup *cgroup;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	assert_spin_locked(q->queue_lock);
+
+	rcu_read_lock();
+	cgroup = task_cgroup(current, io_subsys_id);
+	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	if (!iog) {
+		if (create)
+			iog = efqd->root_group;
+		else
+			/*
+			 * bio merge functions doing lookup don't want to
+			 * map bio to root group by default
+			 */
+			iog = NULL;
+	}
+	rcu_read_unlock();
+	return iog;
+}
+EXPORT_SYMBOL(elv_io_get_io_group);
+
 static void io_free_root_group(struct elevator_queue *e)
 {
 	struct io_group *iog = e->efqd->root_group;
+	struct io_cgroup *iocg = &io_root_cgroup;
+
+	spin_lock_irq(&iocg->lock);
+	hlist_del_rcu(&iog->group_node);
+	spin_unlock_irq(&iocg->lock);
 
 	put_io_group_queues(e, iog);
-	kfree(iog);
+	elv_put_iog(iog);
 }
 
 static struct io_group *io_alloc_root_group(struct request_queue *q,
 					struct elevator_queue *e, void *key)
 {
 	struct io_group *iog;
+	struct io_cgroup *iocg = &io_root_cgroup;
 	int i;
 
 	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (iog == NULL)
 		return NULL;
 
+	elv_get_iog(iog);
 	iog->entity.parent = NULL;
 	iog->entity.my_sd = &iog->sched_data;
 	iog->key = key;
@@ -927,11 +1152,215 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
 
+	spin_lock_irq(&iocg->lock);
+	rcu_assign_pointer(iog->key, key);
+	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+	iog->iocg_id = css_id(&iocg->css);
+	spin_unlock_irq(&iocg->lock);
+
 	return iog;
 }
 
+static void io_group_free_rcu(struct rcu_head *head)
+{
+	struct io_group *iog;
+
+	iog = container_of(head, struct io_group, rcu_head);
+	kfree(iog);
+}
+
+/*
+ * This cleanup function does the last bit of things to destroy cgroup.
+ * It should only get called after io_destroy_group has been invoked.
+ */
+static void io_group_cleanup(struct io_group *iog)
+{
+	struct io_service_tree *st;
+	int i;
+
+	BUG_ON(iog->sched_data.active_entity != NULL);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		BUG_ON(!RB_EMPTY_ROOT(&st->active));
+		BUG_ON(st->active_entity != NULL);
+	}
+
+	/*
+	 * Wait for any rcu readers to exit before freeing up the group.
+	 * Primarily useful when elv_io_get_io_group() is called without queue
+	 * lock to access some group data from bdi_congested_group() path.
+	 */
+	call_rcu(&iog->rcu_head, io_group_free_rcu);
+}
+
+void elv_put_iog(struct io_group *iog)
+{
+	struct io_group *parent_iog = NULL;
+	struct io_entity *parent;
+
+	BUG_ON(atomic_read(&iog->ref) <= 0);
+	if (!atomic_dec_and_test(&iog->ref))
+		return;
+
+	parent = parent_entity(&iog->entity);
+	if (parent)
+		parent_iog = iog_of(parent);
+
+	io_group_cleanup(iog);
+
+	if (parent_iog)
+		elv_put_iog(parent_iog);
+}
+EXPORT_SYMBOL(elv_put_iog);
+
+/*
+ * After the group is destroyed, no new sync IO should come to the group.
+ * It might still have pending IOs in some busy queues. It should be able to
+ * send those IOs down to the disk. The async IOs (due to dirty page writeback)
+ * would go in the root group queues after this, as the group does not exist
+ * anymore.
+ */
+static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
+{
+	hlist_del(&iog->elv_data_node);
+	put_io_group_queues(efqd->eq, iog);
+
+	/*
+	 * Put the reference taken at the time of creation so that when all
+	 * queues are gone, group can be destroyed.
+	 */
+	elv_put_iog(iog);
+}
+
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog;
+	struct elv_fq_data *efqd;
+	unsigned long uninitialized_var(flags);
+
+	/*
+	 * io groups are linked in two lists. One list is maintained
+	 * in elevator (efqd->group_list) and other is maintained
+	 * per cgroup structure (iocg->group_data).
+	 *
+	 * While a cgroup is being deleted, elevator also might be
+	 * exiting and both might try to cleanup the same io group
+	 * so need to be little careful.
+	 *
+	 * (iocg->group_data) is protected by iocg->lock. To avoid deadlock,
+	 * we can't hold the queue lock while holding iocg->lock. So we first
+	 * remove iog from iocg->group_data under iocg->lock. Whoever removes
+	 * iog from iocg->group_data should call __io_destroy_group to remove
+	 * iog.
+	 */
+
+	rcu_read_lock();
+
+remove_entry:
+	spin_lock_irqsave(&iocg->lock, flags);
+
+	if (hlist_empty(&iocg->group_data)) {
+		spin_unlock_irqrestore(&iocg->lock, flags);
+		goto done;
+	}
+	iog = hlist_entry(iocg->group_data.first, struct io_group,
+			  group_node);
+	efqd = rcu_dereference(iog->key);
+	hlist_del_rcu(&iog->group_node);
+	iog->iocg_id = 0;
+	spin_unlock_irqrestore(&iocg->lock, flags);
+
+	spin_lock_irqsave(efqd->queue->queue_lock, flags);
+	__io_destroy_group(efqd, iog);
+	spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
+	goto remove_entry;
+
+done:
+	free_css_id(&io_subsys, &iocg->css);
+	rcu_read_unlock();
+	BUG_ON(!hlist_empty(&iocg->group_data));
+	kfree(iocg);
+}
+
+/*
+ * This functions checks if iog is still in iocg->group_data, and removes it.
+ * If iog is not in that list, then cgroup destroy path has removed it, and
+ * we do not need to remove it.
+ */
+static void
+io_group_check_and_destroy(struct elv_fq_data *efqd, struct io_group *iog)
+{
+	struct io_cgroup *iocg;
+	unsigned long flags;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+
+	css = css_lookup(&io_subsys, iog->iocg_id);
+
+	if (!css)
+		goto out;
+
+	iocg = container_of(css, struct io_cgroup, css);
+
+	spin_lock_irqsave(&iocg->lock, flags);
+
+	if (iog->iocg_id) {
+		hlist_del_rcu(&iog->group_node);
+		__io_destroy_group(efqd, iog);
+	}
+
+	spin_unlock_irqrestore(&iocg->lock, flags);
+out:
+	rcu_read_unlock();
+}
+
+static void release_elv_io_groups(struct elevator_queue *e)
+{
+	struct hlist_node *pos, *n;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = e->efqd;
+
+	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
+					elv_data_node) {
+		io_group_check_and_destroy(efqd, iog);
+	}
+}
+
+/*
+ * if bio sumbmitting task and rq don't belong to same io_group, it can't
+ * be merged
+ */
+int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	struct request_queue *q = rq->q;
+	struct io_queue *ioq = rq->ioq;
+	struct io_group *iog, *__iog;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return 1;
+
+	/* Determine the io group of the bio submitting task */
+	iog = elv_io_get_io_group(q, 0);
+	if (!iog) {
+		/* May be task belongs to a differet cgroup for which io
+		 * group has not been setup yet. */
+		return 0;
+	}
+
+	/* Determine the io group of the ioq, rq belongs to*/
+	__iog = ioq_to_io_group(ioq);
+
+	return (iog == __iog);
+}
+
 #else /* CONFIG_GROUP_IOSCHED */
 
+static inline unsigned int iog_weight(struct io_group *iog) { return 0; }
+static inline void release_elv_io_groups(struct elevator_queue *e) {}
+
 static struct io_group *io_alloc_root_group(struct request_queue *q,
 					struct elevator_queue *e, void *key)
 {
@@ -1008,8 +1437,13 @@ __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
 	struct elevator_queue *eq = q->elevator;
 
 	if (ioq) {
-		elv_log_ioq(efqd, ioq, "set_active, busy=%d",
-						efqd->busy_queues);
+		struct io_group *iog = ioq_to_io_group(ioq);
+		elv_log_ioq(efqd, ioq, "set_active, busy=%d class=%hu prio=%hu"
+				" weight=%u group_weight=%u qued=%d",
+				efqd->busy_queues, ioq->entity.ioprio_class,
+				ioq->entity.ioprio, ioq->entity.weight,
+				iog_weight(iog), ioq->nr_queued);
+
 		ioq->slice_start = ioq->slice_end = 0;
 		ioq->dispatch_start = jiffies;
 
@@ -1174,6 +1608,7 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	struct io_queue *ioq;
 	struct elevator_queue *eq = q->elevator;
 	struct io_entity *entity, *new_entity;
+	struct io_group *iog = NULL, *new_iog = NULL;
 
 	ioq = elv_active_ioq(eq);
 
@@ -1206,9 +1641,16 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 		return 1;
 
 	/*
-	 * Check with io scheduler if it has additional criterion based on
-	 * which it wants to preempt existing queue.
+	 * If both the queues belong to same group, check with io scheduler
+	 * if it has additional criterion based on which it wants to
+	 * preempt existing queue.
 	 */
+	iog = ioq_to_io_group(ioq);
+	new_iog = ioq_to_io_group(new_ioq);
+
+	if (iog != new_iog)
+		return 0;
+
 	if (eq->ops->elevator_should_preempt_fn) {
 		void *sched_queue = elv_ioq_sched_queue(new_ioq);
 
@@ -1354,6 +1796,10 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
 	if (new_ioq)
 		elv_log_ioq(e->efqd, ioq, "cooperating ioq=%d", new_ioq->pid);
 
+	/* Only select co-operating queue if it belongs to same group as ioq */
+	if (new_ioq && !is_same_group(&ioq->entity, &new_ioq->entity))
+		return NULL;
+
 	return new_ioq;
 }
 
@@ -1596,6 +2042,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	efqd->idle_slice_timer.data = (unsigned long) efqd;
 
 	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+	INIT_HLIST_HEAD(&efqd->group_list);
 
 	efqd->elv_slice[0] = elv_slice_async;
 	efqd->elv_slice[1] = elv_slice_sync;
@@ -1613,12 +2060,22 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 void elv_exit_fq_data(struct elevator_queue *e)
 {
 	struct elv_fq_data *efqd = e->efqd;
+	struct request_queue *q = efqd->queue;
 
 	if (!elv_iosched_fair_queuing_enabled(e))
 		return;
 
 	elv_shutdown_timer_wq(e);
 
+	spin_lock_irq(q->queue_lock);
+	release_elv_io_groups(e);
+	spin_unlock_irq(q->queue_lock);
+
+	elv_shutdown_timer_wq(e);
+
+	/* Wait for iog->key accessors to exit their grace periods. */
+	synchronize_rcu();
+
 	BUG_ON(timer_pending(&efqd->idle_slice_timer));
 	io_free_root_group(e);
 }
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 6998e69..671f366 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -92,6 +92,7 @@ struct io_group {
 	atomic_t ref;
 	struct io_sched_data sched_data;
 	struct hlist_node group_node;
+	struct hlist_node elv_data_node;
 	unsigned short iocg_id;
 	/*
 	 * async queue for each priority case for RT and BE class.
@@ -101,6 +102,7 @@ struct io_group {
 	struct io_queue *async_queue[2][IOPRIO_BE_NR];
 	struct io_queue *async_idle_queue;
 	void *key;
+	struct rcu_head rcu_head;
 };
 
 struct io_cgroup {
@@ -134,6 +136,9 @@ struct io_group {
 struct elv_fq_data {
 	struct io_group *root_group;
 
+	/* List of io groups hanging on this elevator */
+	struct hlist_head group_list;
+
 	struct request_queue *queue;
 	struct elevator_queue *eq;
 	unsigned int busy_queues;
@@ -314,6 +319,28 @@ static inline struct io_queue *elv_get_oom_ioq(struct elevator_queue *eq)
 	return &eq->efqd->oom_ioq;
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+
+extern int elv_io_group_allow_merge(struct request *rq, struct bio *bio);
+extern void elv_put_iog(struct io_group *iog);
+extern struct io_group *elv_io_get_io_group(struct request_queue *q,
+						int create);
+
+static inline void elv_get_iog(struct io_group *iog)
+{
+	atomic_inc(&iog->ref);
+}
+
+#else /* !GROUP_IOSCHED */
+
+static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	return 1;
+}
+
+static inline void elv_get_iog(struct io_group *iog) {}
+static inline void elv_put_iog(struct io_group *iog) {}
+
 static inline struct io_group *
 elv_io_get_io_group(struct request_queue *q, int create)
 {
@@ -321,6 +348,8 @@ elv_io_get_io_group(struct request_queue *q, int create)
 	return q->elevator->efqd->root_group;
 }
 
+#endif /* GROUP_IOSCHED */
+
 extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
 extern ssize_t elv_slice_sync_store(struct elevator_queue *q, const char *name,
 						size_t count);
@@ -404,5 +433,11 @@ static inline void *elv_select_ioq(struct request_queue *q, int force)
 {
 	return NULL;
 }
+
+static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
+
+{
+	return 1;
+}
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _ELV_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index ea4042e..b2725cd 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -122,6 +122,10 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
 	    !bio_failfast_driver(bio)	 != !blk_failfast_driver(rq))
 		return 0;
 
+	/* If rq and bio belongs to different groups, dont allow merging */
+	if (!elv_io_group_allow_merge(rq, bio))
+		return 0;
+
 	if (!elv_iosched_allow_merge(rq, bio))
 		return 0;
 
-- 
1.6.0.6


WARNING: multiple messages have this Message-ID (diff)
From: Vivek Goyal <vgoyal@redhat.com>
To: linux-kernel@vger.kernel.org,
	containers@lists.linux-foundation.org, dm-devel@redhat.com,
	jens.axboe@oracle.com, ryov@valinux.co.jp,
	balbir@linux.vnet.ibm.com, righi.andrea@gmail.com
Cc: paolo.valente@unimore.it, jmarchan@redhat.com,
	dhaval@linux.vnet.ibm.com, peterz@infradead.org,
	guijianfeng@cn.fujitsu.com, fernando@oss.ntt.co.jp,
	lizf@cn.fujitsu.com, jmoyer@redhat.com, mikew@google.com,
	fchecconi@gmail.com, dpshah@google.com, vgoyal@redhat.com,
	nauman@google.com, s-uchida@ap.jp.nec.com,
	akpm@linux-foundation.org, agk@redhat.com, m-ikeda@ds.jp.nec.com
Subject: [PATCH 07/24] io-controller: Common hierarchical fair queuing code in elevaotor layer
Date: Sun, 16 Aug 2009 15:30:29 -0400	[thread overview]
Message-ID: <1250451046-9966-8-git-send-email-vgoyal@redhat.com> (raw)
In-Reply-To: <1250451046-9966-1-git-send-email-vgoyal@redhat.com>

o This patch enables hierarchical fair queuing in common layer. It is
  controlled by config option CONFIG_GROUP_IOSCHED.

o Requests keep a reference on ioq and ioq keeps  keep a reference
  on groups. For async queues in CFQ, and single ioq in other
  schedulers, io_group also keeps are reference on io_queue. This
  reference on ioq is dropped when the queue is released
  (elv_release_ioq). So the queue can be freed.

  When a queue is released, it puts the reference to io_group and the
  io_group is released after all the queues are released. Child groups
  also take reference on parent groups, and release it when they are
  destroyed.

o Reads of iocg->group_data are not always iocg->lock; so all the operations
  on that list are still protected by RCU. All modifications to
  iocg->group_data should always done under iocg->lock.

  Whenever iocg->lock and queue_lock can both be held, queue_lock should
  be held first. This avoids all deadlocks. In order to avoid race
  between cgroup deletion and elevator switch the following algorithm is
  used:

	- Cgroup deletion path holds iocg->lock and removes iog entry
	  to iocg->group_data list. Then it drops iocg->lock, holds
	  queue_lock and destroys iog. So in this path, we never hold
	  iocg->lock and queue_lock at the same time. Also, since we
	  remove iog from iocg->group_data under iocg->lock, we can't
	  race with elevator switch.

	- Elevator switch path does not remove iog from
	  iocg->group_data list directly. It first hold iocg->lock,
	  scans iocg->group_data again to see if iog is still there;
	  it removes iog only if it finds iog there. Otherwise, cgroup
	  deletion must have removed it from the list, and cgroup
	  deletion is responsible for removing iog.

  So the path which removes iog from iocg->group_data list does
  the final removal of iog by calling __io_destroy_group()
  function.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    2 +
 block/elevator-fq.c |  479 +++++++++++++++++++++++++++++++++++++++++++++++++--
 block/elevator-fq.h |   35 ++++
 block/elevator.c    |    4 +
 4 files changed, 509 insertions(+), 11 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 4bde1c8..decb654 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1356,6 +1356,8 @@ alloc_cfqq:
 
 			/* call it after cfq has initialized queue prio */
 			elv_init_ioq_io_group(ioq, iog);
+			/* ioq reference on iog */
+			elv_get_iog(iog);
 			cfq_log_cfqq(cfqd, cfqq, "alloced");
 		} else {
 			cfqq = &cfqd->oom_cfqq;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 3940c61..8d91ff7 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -515,6 +515,7 @@ void elv_put_ioq(struct io_queue *ioq)
 {
 	struct elv_fq_data *efqd = ioq->efqd;
 	struct elevator_queue *e = efqd->eq;
+	struct io_group *iog;
 
 	BUG_ON(atomic_read(&ioq->ref) <= 0);
 	if (!atomic_dec_and_test(&ioq->ref))
@@ -522,12 +523,14 @@ void elv_put_ioq(struct io_queue *ioq)
 	BUG_ON(ioq->nr_queued);
 	BUG_ON(elv_ioq_busy(ioq));
 	BUG_ON(efqd->active_queue == ioq);
+	iog = ioq_to_io_group(ioq);
 
 	/* Can be called by outgoing elevator. Don't use q */
 	BUG_ON(!e->ops->elevator_free_sched_queue_fn);
 	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
 	elv_log_ioq(efqd, ioq, "put_queue");
 	elv_free_ioq(ioq);
+	elv_put_iog(iog);
 }
 EXPORT_SYMBOL(elv_put_ioq);
 
@@ -734,6 +737,27 @@ void elv_io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 EXPORT_SYMBOL(elv_io_group_set_async_queue);
 
 #ifdef CONFIG_GROUP_IOSCHED
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup);
+
+static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+{
+	struct io_entity *entity = &iog->entity;
+
+	entity->weight = iocg->weight;
+	entity->ioprio_class = iocg->ioprio_class;
+	entity->ioprio_changed = 1;
+	entity->my_sd = &iog->sched_data;
+}
+
+static void io_group_set_parent(struct io_group *iog, struct io_group *parent)
+{
+	struct io_entity *entity = &iog->entity;
+
+	init_io_entity_parent(entity, &parent->entity);
+
+	/* Child group reference on parent group. */
+	elv_get_iog(parent);
+}
 
 struct io_cgroup io_root_cgroup = {
 	.weight = IO_WEIGHT_DEFAULT,
@@ -746,6 +770,27 @@ static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 			    struct io_cgroup, css);
 }
 
+/*
+ * Search the io_group for efqd into the hash table (by now only a list)
+ * of bgrp.  Must be called under rcu_read_lock().
+ */
+static struct io_group *
+io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
+{
+	struct io_group *iog;
+	struct hlist_node *n;
+	void *__key;
+
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		__key = rcu_dereference(iog->key);
+		if (__key == key)
+			return iog;
+	}
+
+	return NULL;
+}
+
+
 #define SHOW_FUNCTION(__VAR)						\
 static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
@@ -885,12 +930,6 @@ static void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
 	task_unlock(tsk);
 }
 
-static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
-{
-
-	/* Implemented in later patch */
-}
-
 struct cgroup_subsys io_subsys = {
 	.name = "io",
 	.create = iocg_create,
@@ -902,24 +941,210 @@ struct cgroup_subsys io_subsys = {
 	.use_id = 1,
 };
 
+static inline unsigned int iog_weight(struct io_group *iog)
+{
+	return iog->entity.weight;
+}
+
+static struct io_group *
+io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *leaf = NULL, *prev = NULL;
+	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
+
+	for (; cgroup != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		if (iog != NULL) {
+			/*
+			 * All the cgroups in the path from there to the
+			 * root must have a io_group for efqd, so we don't
+			 * need any more allocations.
+			 */
+			break;
+		}
+
+		iog = kzalloc_node(sizeof(*iog), flags, q->node);
+		if (!iog)
+			goto cleanup;
+
+		iog->iocg_id = css_id(&iocg->css);
+
+		io_group_init_entity(iocg, iog);
+
+		atomic_set(&iog->ref, 0);
+
+		/*
+		 * Take the initial reference that will be released on destroy
+		 * This can be thought of a joint reference by cgroup and
+		 * elevator which will be dropped by either elevator exit
+		 * or cgroup deletion path depending on who is exiting first.
+		 */
+		elv_get_iog(iog);
+
+		if (leaf == NULL) {
+			leaf = iog;
+			prev = leaf;
+		} else {
+			io_group_set_parent(prev, iog);
+			/*
+			 * Build a list of allocated nodes using the efqd
+			 * filed, that is still unused and will be initialized
+			 * only after the node will be connected.
+			 */
+			prev->key = iog;
+			prev = iog;
+		}
+	}
+
+	return leaf;
+
+cleanup:
+	while (leaf != NULL) {
+		prev = leaf;
+		leaf = leaf->key;
+		kfree(prev);
+	}
+
+	return NULL;
+}
+
+static void io_group_chain_link(struct request_queue *q, void *key,
+				struct cgroup *cgroup, struct io_group *leaf,
+				struct elv_fq_data *efqd)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *next, *prev = NULL;
+	unsigned long flags;
+
+	assert_spin_locked(q->queue_lock);
+
+	for (; cgroup != NULL && leaf != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		next = leaf->key;
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		BUG_ON(iog != NULL);
+
+		spin_lock_irqsave(&iocg->lock, flags);
+
+		rcu_assign_pointer(leaf->key, key);
+		hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+		hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+
+		spin_unlock_irqrestore(&iocg->lock, flags);
+
+		prev = leaf;
+		leaf = next;
+	}
+
+	BUG_ON(cgroup == NULL && leaf != NULL);
+
+	/*
+	 * This connects the topmost element of the allocated chain to the
+	 * parent group.
+	 */
+	if (cgroup != NULL && prev != NULL) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		iog = io_cgroup_lookup_group(iocg, key);
+		io_group_set_parent(prev, iog);
+	}
+}
+
+static struct io_group *io_find_alloc_group(struct request_queue *q,
+			struct cgroup *cgroup, struct elv_fq_data *efqd,
+			int create)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog = NULL;
+	/* Note: Use efqd as key */
+	void *key = efqd;
+
+	/*
+	 * Take a refenrece to css object. Don't want to map a bio to
+	 * a group if it has been marked for deletion
+	 */
+
+	if (!css_tryget(&iocg->css))
+		return iog;
+
+	iog = io_cgroup_lookup_group(iocg, key);
+	if (iog != NULL || !create)
+		goto end;
+
+	iog = io_group_chain_alloc(q, key, cgroup);
+	if (iog != NULL)
+		io_group_chain_link(q, key, cgroup, iog, efqd);
+
+end:
+	css_put(&iocg->css);
+	return iog;
+}
+
+/*
+ * Search for the io group current task belongs to. If create=1, then also
+ * create the io group if it is not already there.
+ *
+ * Note: This function should be called with queue lock held. It returns
+ * a pointer to io group without taking any reference. That group will
+ * be around as long as queue lock is not dropped (as group reclaim code
+ * needs to get hold of queue lock). So if somebody needs to use group
+ * pointer even after dropping queue lock, take a reference to the group
+ * before dropping queue lock.
+ */
+struct io_group *elv_io_get_io_group(struct request_queue *q, int create)
+{
+	struct cgroup *cgroup;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = q->elevator->efqd;
+
+	assert_spin_locked(q->queue_lock);
+
+	rcu_read_lock();
+	cgroup = task_cgroup(current, io_subsys_id);
+	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	if (!iog) {
+		if (create)
+			iog = efqd->root_group;
+		else
+			/*
+			 * bio merge functions doing lookup don't want to
+			 * map bio to root group by default
+			 */
+			iog = NULL;
+	}
+	rcu_read_unlock();
+	return iog;
+}
+EXPORT_SYMBOL(elv_io_get_io_group);
+
 static void io_free_root_group(struct elevator_queue *e)
 {
 	struct io_group *iog = e->efqd->root_group;
+	struct io_cgroup *iocg = &io_root_cgroup;
+
+	spin_lock_irq(&iocg->lock);
+	hlist_del_rcu(&iog->group_node);
+	spin_unlock_irq(&iocg->lock);
 
 	put_io_group_queues(e, iog);
-	kfree(iog);
+	elv_put_iog(iog);
 }
 
 static struct io_group *io_alloc_root_group(struct request_queue *q,
 					struct elevator_queue *e, void *key)
 {
 	struct io_group *iog;
+	struct io_cgroup *iocg = &io_root_cgroup;
 	int i;
 
 	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (iog == NULL)
 		return NULL;
 
+	elv_get_iog(iog);
 	iog->entity.parent = NULL;
 	iog->entity.my_sd = &iog->sched_data;
 	iog->key = key;
@@ -927,11 +1152,215 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = ELV_SERVICE_TREE_INIT;
 
+	spin_lock_irq(&iocg->lock);
+	rcu_assign_pointer(iog->key, key);
+	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+	iog->iocg_id = css_id(&iocg->css);
+	spin_unlock_irq(&iocg->lock);
+
 	return iog;
 }
 
+static void io_group_free_rcu(struct rcu_head *head)
+{
+	struct io_group *iog;
+
+	iog = container_of(head, struct io_group, rcu_head);
+	kfree(iog);
+}
+
+/*
+ * This cleanup function does the last bit of things to destroy cgroup.
+ * It should only get called after io_destroy_group has been invoked.
+ */
+static void io_group_cleanup(struct io_group *iog)
+{
+	struct io_service_tree *st;
+	int i;
+
+	BUG_ON(iog->sched_data.active_entity != NULL);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		BUG_ON(!RB_EMPTY_ROOT(&st->active));
+		BUG_ON(st->active_entity != NULL);
+	}
+
+	/*
+	 * Wait for any rcu readers to exit before freeing up the group.
+	 * Primarily useful when elv_io_get_io_group() is called without queue
+	 * lock to access some group data from bdi_congested_group() path.
+	 */
+	call_rcu(&iog->rcu_head, io_group_free_rcu);
+}
+
+void elv_put_iog(struct io_group *iog)
+{
+	struct io_group *parent_iog = NULL;
+	struct io_entity *parent;
+
+	BUG_ON(atomic_read(&iog->ref) <= 0);
+	if (!atomic_dec_and_test(&iog->ref))
+		return;
+
+	parent = parent_entity(&iog->entity);
+	if (parent)
+		parent_iog = iog_of(parent);
+
+	io_group_cleanup(iog);
+
+	if (parent_iog)
+		elv_put_iog(parent_iog);
+}
+EXPORT_SYMBOL(elv_put_iog);
+
+/*
+ * After the group is destroyed, no new sync IO should come to the group.
+ * It might still have pending IOs in some busy queues. It should be able to
+ * send those IOs down to the disk. The async IOs (due to dirty page writeback)
+ * would go in the root group queues after this, as the group does not exist
+ * anymore.
+ */
+static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
+{
+	hlist_del(&iog->elv_data_node);
+	put_io_group_queues(efqd->eq, iog);
+
+	/*
+	 * Put the reference taken at the time of creation so that when all
+	 * queues are gone, group can be destroyed.
+	 */
+	elv_put_iog(iog);
+}
+
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog;
+	struct elv_fq_data *efqd;
+	unsigned long uninitialized_var(flags);
+
+	/*
+	 * io groups are linked in two lists. One list is maintained
+	 * in elevator (efqd->group_list) and other is maintained
+	 * per cgroup structure (iocg->group_data).
+	 *
+	 * While a cgroup is being deleted, elevator also might be
+	 * exiting and both might try to cleanup the same io group
+	 * so need to be little careful.
+	 *
+	 * (iocg->group_data) is protected by iocg->lock. To avoid deadlock,
+	 * we can't hold the queue lock while holding iocg->lock. So we first
+	 * remove iog from iocg->group_data under iocg->lock. Whoever removes
+	 * iog from iocg->group_data should call __io_destroy_group to remove
+	 * iog.
+	 */
+
+	rcu_read_lock();
+
+remove_entry:
+	spin_lock_irqsave(&iocg->lock, flags);
+
+	if (hlist_empty(&iocg->group_data)) {
+		spin_unlock_irqrestore(&iocg->lock, flags);
+		goto done;
+	}
+	iog = hlist_entry(iocg->group_data.first, struct io_group,
+			  group_node);
+	efqd = rcu_dereference(iog->key);
+	hlist_del_rcu(&iog->group_node);
+	iog->iocg_id = 0;
+	spin_unlock_irqrestore(&iocg->lock, flags);
+
+	spin_lock_irqsave(efqd->queue->queue_lock, flags);
+	__io_destroy_group(efqd, iog);
+	spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
+	goto remove_entry;
+
+done:
+	free_css_id(&io_subsys, &iocg->css);
+	rcu_read_unlock();
+	BUG_ON(!hlist_empty(&iocg->group_data));
+	kfree(iocg);
+}
+
+/*
+ * This functions checks if iog is still in iocg->group_data, and removes it.
+ * If iog is not in that list, then cgroup destroy path has removed it, and
+ * we do not need to remove it.
+ */
+static void
+io_group_check_and_destroy(struct elv_fq_data *efqd, struct io_group *iog)
+{
+	struct io_cgroup *iocg;
+	unsigned long flags;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+
+	css = css_lookup(&io_subsys, iog->iocg_id);
+
+	if (!css)
+		goto out;
+
+	iocg = container_of(css, struct io_cgroup, css);
+
+	spin_lock_irqsave(&iocg->lock, flags);
+
+	if (iog->iocg_id) {
+		hlist_del_rcu(&iog->group_node);
+		__io_destroy_group(efqd, iog);
+	}
+
+	spin_unlock_irqrestore(&iocg->lock, flags);
+out:
+	rcu_read_unlock();
+}
+
+static void release_elv_io_groups(struct elevator_queue *e)
+{
+	struct hlist_node *pos, *n;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = e->efqd;
+
+	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
+					elv_data_node) {
+		io_group_check_and_destroy(efqd, iog);
+	}
+}
+
+/*
+ * if bio sumbmitting task and rq don't belong to same io_group, it can't
+ * be merged
+ */
+int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	struct request_queue *q = rq->q;
+	struct io_queue *ioq = rq->ioq;
+	struct io_group *iog, *__iog;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return 1;
+
+	/* Determine the io group of the bio submitting task */
+	iog = elv_io_get_io_group(q, 0);
+	if (!iog) {
+		/* May be task belongs to a differet cgroup for which io
+		 * group has not been setup yet. */
+		return 0;
+	}
+
+	/* Determine the io group of the ioq, rq belongs to*/
+	__iog = ioq_to_io_group(ioq);
+
+	return (iog == __iog);
+}
+
 #else /* CONFIG_GROUP_IOSCHED */
 
+static inline unsigned int iog_weight(struct io_group *iog) { return 0; }
+static inline void release_elv_io_groups(struct elevator_queue *e) {}
+
 static struct io_group *io_alloc_root_group(struct request_queue *q,
 					struct elevator_queue *e, void *key)
 {
@@ -1008,8 +1437,13 @@ __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq, int coop)
 	struct elevator_queue *eq = q->elevator;
 
 	if (ioq) {
-		elv_log_ioq(efqd, ioq, "set_active, busy=%d",
-						efqd->busy_queues);
+		struct io_group *iog = ioq_to_io_group(ioq);
+		elv_log_ioq(efqd, ioq, "set_active, busy=%d class=%hu prio=%hu"
+				" weight=%u group_weight=%u qued=%d",
+				efqd->busy_queues, ioq->entity.ioprio_class,
+				ioq->entity.ioprio, ioq->entity.weight,
+				iog_weight(iog), ioq->nr_queued);
+
 		ioq->slice_start = ioq->slice_end = 0;
 		ioq->dispatch_start = jiffies;
 
@@ -1174,6 +1608,7 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	struct io_queue *ioq;
 	struct elevator_queue *eq = q->elevator;
 	struct io_entity *entity, *new_entity;
+	struct io_group *iog = NULL, *new_iog = NULL;
 
 	ioq = elv_active_ioq(eq);
 
@@ -1206,9 +1641,16 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 		return 1;
 
 	/*
-	 * Check with io scheduler if it has additional criterion based on
-	 * which it wants to preempt existing queue.
+	 * If both the queues belong to same group, check with io scheduler
+	 * if it has additional criterion based on which it wants to
+	 * preempt existing queue.
 	 */
+	iog = ioq_to_io_group(ioq);
+	new_iog = ioq_to_io_group(new_ioq);
+
+	if (iog != new_iog)
+		return 0;
+
 	if (eq->ops->elevator_should_preempt_fn) {
 		void *sched_queue = elv_ioq_sched_queue(new_ioq);
 
@@ -1354,6 +1796,10 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
 	if (new_ioq)
 		elv_log_ioq(e->efqd, ioq, "cooperating ioq=%d", new_ioq->pid);
 
+	/* Only select co-operating queue if it belongs to same group as ioq */
+	if (new_ioq && !is_same_group(&ioq->entity, &new_ioq->entity))
+		return NULL;
+
 	return new_ioq;
 }
 
@@ -1596,6 +2042,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	efqd->idle_slice_timer.data = (unsigned long) efqd;
 
 	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+	INIT_HLIST_HEAD(&efqd->group_list);
 
 	efqd->elv_slice[0] = elv_slice_async;
 	efqd->elv_slice[1] = elv_slice_sync;
@@ -1613,12 +2060,22 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 void elv_exit_fq_data(struct elevator_queue *e)
 {
 	struct elv_fq_data *efqd = e->efqd;
+	struct request_queue *q = efqd->queue;
 
 	if (!elv_iosched_fair_queuing_enabled(e))
 		return;
 
 	elv_shutdown_timer_wq(e);
 
+	spin_lock_irq(q->queue_lock);
+	release_elv_io_groups(e);
+	spin_unlock_irq(q->queue_lock);
+
+	elv_shutdown_timer_wq(e);
+
+	/* Wait for iog->key accessors to exit their grace periods. */
+	synchronize_rcu();
+
 	BUG_ON(timer_pending(&efqd->idle_slice_timer));
 	io_free_root_group(e);
 }
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 6998e69..671f366 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -92,6 +92,7 @@ struct io_group {
 	atomic_t ref;
 	struct io_sched_data sched_data;
 	struct hlist_node group_node;
+	struct hlist_node elv_data_node;
 	unsigned short iocg_id;
 	/*
 	 * async queue for each priority case for RT and BE class.
@@ -101,6 +102,7 @@ struct io_group {
 	struct io_queue *async_queue[2][IOPRIO_BE_NR];
 	struct io_queue *async_idle_queue;
 	void *key;
+	struct rcu_head rcu_head;
 };
 
 struct io_cgroup {
@@ -134,6 +136,9 @@ struct io_group {
 struct elv_fq_data {
 	struct io_group *root_group;
 
+	/* List of io groups hanging on this elevator */
+	struct hlist_head group_list;
+
 	struct request_queue *queue;
 	struct elevator_queue *eq;
 	unsigned int busy_queues;
@@ -314,6 +319,28 @@ static inline struct io_queue *elv_get_oom_ioq(struct elevator_queue *eq)
 	return &eq->efqd->oom_ioq;
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+
+extern int elv_io_group_allow_merge(struct request *rq, struct bio *bio);
+extern void elv_put_iog(struct io_group *iog);
+extern struct io_group *elv_io_get_io_group(struct request_queue *q,
+						int create);
+
+static inline void elv_get_iog(struct io_group *iog)
+{
+	atomic_inc(&iog->ref);
+}
+
+#else /* !GROUP_IOSCHED */
+
+static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	return 1;
+}
+
+static inline void elv_get_iog(struct io_group *iog) {}
+static inline void elv_put_iog(struct io_group *iog) {}
+
 static inline struct io_group *
 elv_io_get_io_group(struct request_queue *q, int create)
 {
@@ -321,6 +348,8 @@ elv_io_get_io_group(struct request_queue *q, int create)
 	return q->elevator->efqd->root_group;
 }
 
+#endif /* GROUP_IOSCHED */
+
 extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
 extern ssize_t elv_slice_sync_store(struct elevator_queue *q, const char *name,
 						size_t count);
@@ -404,5 +433,11 @@ static inline void *elv_select_ioq(struct request_queue *q, int force)
 {
 	return NULL;
 }
+
+static inline int elv_io_group_allow_merge(struct request *rq, struct bio *bio)
+
+{
+	return 1;
+}
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _ELV_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index ea4042e..b2725cd 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -122,6 +122,10 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
 	    !bio_failfast_driver(bio)	 != !blk_failfast_driver(rq))
 		return 0;
 
+	/* If rq and bio belongs to different groups, dont allow merging */
+	if (!elv_io_group_allow_merge(rq, bio))
+		return 0;
+
 	if (!elv_iosched_allow_merge(rq, bio))
 		return 0;
 
-- 
1.6.0.6

  parent reply	other threads:[~2009-08-16 19:35 UTC|newest]

Thread overview: 135+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-08-16 19:30 [RFC] IO scheduler based IO controller V8 Vivek Goyal
2009-08-16 19:30 ` Vivek Goyal
2009-08-16 19:30 ` [PATCH 01/24] io-controller: Documentation Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
     [not found]   ` <1250451046-9966-2-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-25  3:36     ` Rik van Riel
2009-08-25  3:36   ` Rik van Riel
2009-08-25  3:36     ` Rik van Riel
2009-08-16 19:30 ` [PATCH 02/24] io-controller: Core of the elevator fair queuing Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-17  5:29   ` Gui Jianfeng
2009-08-17  5:29     ` Gui Jianfeng
2009-08-17 20:37     ` Vivek Goyal
2009-08-17 20:37       ` Vivek Goyal
     [not found]     ` <4A88EACC.6010805-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-08-17 20:37       ` Vivek Goyal
2009-08-19 16:01   ` Jerome Marchand
2009-08-19 18:41     ` Vivek Goyal
2009-08-19 18:41       ` Vivek Goyal
2009-08-20 14:51       ` Jerome Marchand
2009-08-20 15:04         ` Vivek Goyal
2009-08-20 15:04           ` Vivek Goyal
     [not found]         ` <4A8D6302.3080301-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-20 15:04           ` Vivek Goyal
     [not found]       ` <20090819184142.GD4391-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-20 14:51         ` Jerome Marchand
     [not found]     ` <4A8C21DE.1080001-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-19 18:41       ` Vivek Goyal
2009-08-19 18:30   ` Vivek Goyal
2009-08-19 18:30     ` Vivek Goyal
2009-08-21  1:54   ` Gui Jianfeng
     [not found]     ` <4A8DFE3A.6030503-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-08-21  2:00       ` Vivek Goyal
2009-08-21  2:00     ` Vivek Goyal
2009-08-21  2:00       ` Vivek Goyal
     [not found]   ` <1250451046-9966-3-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-17  5:29     ` Gui Jianfeng
2009-08-19 16:01     ` Jerome Marchand
2009-08-19 18:30     ` Vivek Goyal
2009-08-21  1:54     ` Gui Jianfeng
2009-08-27  2:49     ` Gui Jianfeng
2009-08-27  2:49   ` Gui Jianfeng
2009-08-27 21:08     ` Vivek Goyal
2009-08-27 21:08       ` Vivek Goyal
     [not found]     ` <4A95F444.9040705-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-08-27 21:08       ` Vivek Goyal
2009-08-16 19:30 ` [PATCH 03/24] io-controller: Common flat fair queuing code in elevaotor layer Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-19  3:36   ` Gui Jianfeng
     [not found]     ` <4A8B7336.7010800-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-08-19 18:39       ` Vivek Goyal
2009-08-19 18:39     ` Vivek Goyal
2009-08-19 18:39       ` Vivek Goyal
     [not found]   ` <1250451046-9966-4-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-19  3:36     ` Gui Jianfeng
2009-08-19  3:36     ` Gui Jianfeng
2009-08-16 19:30 ` [PATCH 04/24] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-16 19:30 ` [PATCH 05/24] io-controller: Core scheduler changes to support hierarhical scheduling Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-16 19:30 ` [PATCH 06/24] io-controller: cgroup related changes for hierarchical group support Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-16 19:30 ` Vivek Goyal [this message]
2009-08-16 19:30   ` [PATCH 07/24] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
     [not found] ` <1250451046-9966-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-16 19:30   ` [PATCH 01/24] io-controller: Documentation Vivek Goyal
2009-08-16 19:30   ` [PATCH 02/24] io-controller: Core of the elevator fair queuing Vivek Goyal
2009-08-16 19:30   ` [PATCH 03/24] io-controller: Common flat fair queuing code in elevaotor layer Vivek Goyal
2009-08-16 19:30   ` [PATCH 04/24] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
2009-08-16 19:30   ` [PATCH 05/24] io-controller: Core scheduler changes to support hierarhical scheduling Vivek Goyal
2009-08-16 19:30   ` [PATCH 06/24] io-controller: cgroup related changes for hierarchical group support Vivek Goyal
2009-08-16 19:30   ` [PATCH 07/24] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
2009-08-16 19:30   ` [PATCH 08/24] io-controller: cfq changes to use " Vivek Goyal
2009-08-16 19:30   ` [PATCH 09/24] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
2009-08-16 19:30   ` [PATCH 10/24] io-controller: Debug hierarchical IO scheduling Vivek Goyal
2009-08-16 19:30   ` [PATCH 11/24] io-controller: Introduce group idling Vivek Goyal
2009-08-16 19:30   ` [PATCH 12/24] io-controller: Wait for requests to complete from last queue before new queue is scheduled Vivek Goyal
2009-08-16 19:30   ` [PATCH 13/24] io-controller: Separate out queue and data Vivek Goyal
2009-08-16 19:30   ` [PATCH 14/24] io-conroller: Prepare elevator layer for single queue schedulers Vivek Goyal
2009-08-16 19:30   ` [PATCH 15/24] io-controller: noop changes for hierarchical fair queuing Vivek Goyal
2009-08-16 19:30   ` [PATCH 16/24] io-controller: deadline " Vivek Goyal
2009-08-16 19:30   ` [PATCH 17/24] io-controller: anticipatory " Vivek Goyal
2009-08-16 19:30   ` [PATCH 18/24] blkio_cgroup patches from Ryo to track async bios Vivek Goyal
2009-08-16 19:30   ` [PATCH 19/24] io-controller: map async requests to appropriate cgroup Vivek Goyal
2009-08-16 19:30   ` [PATCH 20/24] io-controller: Per cgroup request descriptor support Vivek Goyal
2009-08-16 19:30   ` [PATCH 21/24] io-controller: Per io group bdi congestion interface Vivek Goyal
2009-08-16 19:30   ` [PATCH 22/24] io-controller: Support per cgroup per device weights and io class Vivek Goyal
2009-08-16 19:30   ` [PATCH 23/24] io-controller: map sync requests to group using bio tracking info Vivek Goyal
2009-08-16 19:30   ` [PATCH 24/24] io-controller: debug elevator fair queuing support Vivek Goyal
2009-08-16 19:53   ` [RFC] IO scheduler based IO controller V8 Vivek Goyal
2009-08-16 19:30 ` [PATCH 08/24] io-controller: cfq changes to use hierarchical fair queuing code in elevaotor layer Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-16 19:30 ` [PATCH 09/24] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-16 19:30 ` [PATCH 10/24] io-controller: Debug hierarchical IO scheduling Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-16 19:30 ` [PATCH 11/24] io-controller: Introduce group idling Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-20  1:46   ` [PATCH] IO-Controller: clear ioq wait flag if a request goes into that ioq Gui Jianfeng
     [not found]     ` <4A8CAAE2.1030804-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-08-20 13:42       ` Vivek Goyal
2009-08-20 13:42     ` Vivek Goyal
2009-08-20 13:42       ` Vivek Goyal
     [not found]       ` <20090820134221.GC10615-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-21  0:57         ` Gui Jianfeng
2009-08-21  0:57       ` Gui Jianfeng
     [not found]   ` <1250451046-9966-12-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-20  1:46     ` Gui Jianfeng
2009-08-28  1:12     ` [PATCH 11/24] io-controller: Introduce group idling Gui Jianfeng
2009-08-28  1:12   ` Gui Jianfeng
2009-08-28  1:12     ` Gui Jianfeng
2009-08-16 19:30 ` [PATCH 12/24] io-controller: Wait for requests to complete from last queue before new queue is scheduled Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-24  3:30   ` Gui Jianfeng
     [not found]   ` <1250451046-9966-13-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-24  3:30     ` Gui Jianfeng
2009-08-16 19:30 ` [PATCH 13/24] io-controller: Separate out queue and data Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-16 19:30 ` [PATCH 14/24] io-conroller: Prepare elevator layer for single queue schedulers Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-16 19:30 ` [PATCH 15/24] io-controller: noop changes for hierarchical fair queuing Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-16 19:30 ` [PATCH 16/24] io-controller: deadline " Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-16 19:30 ` [PATCH 17/24] io-controller: anticipatory " Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-16 19:30 ` [PATCH 18/24] blkio_cgroup patches from Ryo to track async bios Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-18 11:42   ` Ryo Tsuruta
2009-08-18 11:42     ` Ryo Tsuruta
     [not found]     ` <20090818.204212.59676649.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-08-18 14:26       ` Vivek Goyal
2009-08-18 14:26     ` Vivek Goyal
2009-08-18 14:26       ` Vivek Goyal
2009-08-19  1:43       ` Ryo Tsuruta
     [not found]       ` <20090818142636.GA7367-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-19  1:43         ` Ryo Tsuruta
     [not found]   ` <1250451046-9966-19-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-18 11:42     ` Ryo Tsuruta
2009-08-16 19:30 ` [PATCH 19/24] io-controller: map async requests to appropriate cgroup Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-16 19:30 ` [PATCH 20/24] io-controller: Per cgroup request descriptor support Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-16 19:30 ` [PATCH 21/24] io-controller: Per io group bdi congestion interface Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-16 19:30 ` [PATCH 22/24] io-controller: Support per cgroup per device weights and io class Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-16 19:30 ` [PATCH 23/24] io-controller: map sync requests to group using bio tracking info Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-16 19:30 ` [PATCH 24/24] io-controller: debug elevator fair queuing support Vivek Goyal
2009-08-16 19:30   ` Vivek Goyal
2009-08-16 19:53 ` [RFC] IO scheduler based IO controller V8 Vivek Goyal
2009-08-16 19:53   ` Vivek Goyal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1250451046-9966-8-git-send-email-vgoyal@redhat.com \
    --to=vgoyal@redhat.com \
    --cc=agk@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=containers@lists.linux-foundation.org \
    --cc=dhaval@linux.vnet.ibm.com \
    --cc=dm-devel@redhat.com \
    --cc=dpshah@google.com \
    --cc=fchecconi@gmail.com \
    --cc=fernando@oss.ntt.co.jp \
    --cc=guijianfeng@cn.fujitsu.com \
    --cc=jens.axboe@oracle.com \
    --cc=jmarchan@redhat.com \
    --cc=jmoyer@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizf@cn.fujitsu.com \
    --cc=m-ikeda@ds.jp.nec.com \
    --cc=mikew@google.com \
    --cc=nauman@google.com \
    --cc=paolo.valente@unimore.it \
    --cc=peterz@infradead.org \
    --cc=righi.andrea@gmail.com \
    --cc=ryov@valinux.co.jp \
    --cc=s-uchida@ap.jp.nec.com \
    --cc=taka@valinux.co.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.